Mapping

Mapping

Timeloop outputs a pretty-printed form of the mapping in a file postfixed by .map.txt. Below is an example output mapping file from Timeloop exercise 04-model-conv1d-oc-3levelspatial using the C-partitioned mapping conv1d+oc+ic-3levelspatial-cp-ws.map.yaml.

MainMemory [ Weights:3072 (3072) Inputs:576 (576) Outputs:512 (512) ]
---------------------------------------------------------------------
| for P in [0:1)

GlobalBuffer [ Weights:3072 (3072) Inputs:576 (576) Outputs:512 (512) ]
-----------------------------------------------------------------------
|   for C in [0:2)
|     for K in [0:32)
|       for R in [0:3)
|         for C in [0:16) (Spatial-X)

RegisterFile [ Weights:1 (1) Inputs:16 (16) Outputs:16 (16) ]
-------------------------------------------------------------
|           for P in [0:16)

Mapping Decisions

The mapping above demonstrates three major mapping decisions: loop tiling, loop permutation, and spatial execution

Loop Tiling

Loop tiling partitions the loop iteration space into smaller factors. Assume we have the following mapping printout:

|   for C in [0:2)
    ...
|     for C in [0:16)

This mapping tiles the iteration space along the C problem dimension. The corresponding tiling factors are 2 and 16. The access pattern of c_idx is as follow:

for c1 in range(0,2):   
  for c0 in range(0,16):
    c_idx= c1*16+c0

Loop Permutation

Loop permutation reorders the loop nests to vary the data access pattern. It is a critical mapping decision that affects the data locality and data reuse opportunities. In this weight-stationary mapping, the loop first traverses along the P dimension, which corresponds to different output data but the same weight data. It then traverses through the weight related problem dimension RKC. Note that spatial loop nests can be executed in parallel and therefore are not accounted for in the temporal loop permutation decision.

Spatial Execution

Spatial execution specifies how the tensor data are partitioned among spatial resources. Loop nests ended with (Spatial-X) or (Spatial-Y) are mapped for spatial execution. For C-partitioned convolution in this example, the weights and inputs are partitioned among 16 spatial resources. The partial sums of the same output are calculated in different resources and then are reduced to one output.

Buffer Bypass and Utilization

The memory levels to tile for in Timeloop are shown in the following format:

MemoryLevelName [ TensorName#0: Utilization for Dense Tensor (Utilization for Sparse Tensor) TensorName#1:...]

For example, the following memory description shows the bypass and utilization infomation at the RegisterFile level.

RegisterFile [ Weights:1 (1) Inputs:16 (16) Outputs:16 (16) ]
-------------------------------------------------------------

In this example, all three tensors Weights, Inputs, Outputs are kept in the RegisterFile and there is no bypassed tensor. If a tensor is bypassing a memory level, its name will not be printed at that level. The corresponding maximum working set size / buffer utilization of the tensors are 1, 16, and 16 in words. Since we assume a dense workload, the sparse tensor utilization is the same as dense tensor utilization. For more details about sparse workload specifications, please refer to the Sparseloop Tutorial.