Overview

Overview

Timeloop (Sparseloop) is an infrastructure that aims to provide modeling, mapping and code-generation for Explicitly-Decoupled Data Orchestration (EDDO) architectures, with a focus on for dense- and sparse- tensor algebra workloads. It is built from 3 modular components:

  • A fast analytical model that can emulate a range of EDDO architecture designs and provide performance and energy projections
  • A mapper that that searches for an optimal mapping in the space of mappings of a tensor-algebra problem on a given architecture

An analytical tool is critical to explore the design space of accelerator architectures. The regularity of tensor algebra algorithms allows for (a) building a single unified modeling infrastructure capable of emulating a range of architectural topologies, and (b) building a fast analytical model that can be used in-line as the cost-model for an automated mapper. Finding an optimized mapping of an algorithm on a well-known hardware architecture is itself a challenging problem because of the sheer number of mappings that can exploit data reuse via various spatio-temporal hardware reuse mechanisms. On the architecture-design side, the problem is compounded because architectural design-space exploration and mapping-space exploration are intimately inter-related. To evaluate a proposed architecture, one needs to know how the optimized compiled version of a range of applications would perform on that architecture. Timeloop provides the ability to rapidly explore this co-design space, allowing architects to produce better-optimized designs.

Timeloop can be used to systematically model a variety of tensor-algebra accelerators, to evaluate millions of non-trivial architectural dataflows and algorithmic mappings, to explore the interplay of data-movement tradeoffs, to compare existing architectures and explore the design space of proposed architectures, and to explore the opportunities provided by sparsity. Timeloop analytically models working sets that flow through an architecture over time. Workload shape (tensors and their access patterns) and hardware architecture parameters/constraints are provided as inputs. Technology parameters such as multiplier area/energy, SRAM area/access energy for a variety of sizes, wire energy per unit of distance moved, etc. are used to model the costs of buffer-access, wire-transfer and arithmetic operations. Applying the results of the working-set analysis to the access-cost models provides estimates on performance, area and energy. An optimization solver can be used to run the model through the space of solutions to find an optimal solution within user-specified constraints (e.g. an area bound). The model can be validated using cycle-level simulation, product-design data, and synthesis experiments on specific design points, which can further refine the modeling engine or help establish expected error bounds.