Stats

Buffer and Arithmetic Levels

This part of the stats file reports the architecture specifications and the corresponding breakdown performance for each level of the hardware hierarchy.

The following section explains the statstics for the mapping of Timeloop exercise 04-model-conv1d-oc-3levelspatial using the C-partitioned mapping conv1d+oc+ic-3levelspatial-cp-ws.map.yaml. In this problem, P is output tensor width; R is weight filter width; C is input channel size; K is output channel size;

MainMemory [ Weights:3072 (3072) Inputs:576 (576) Outputs:512 (512) ]
---------------------------------------------------------------------
| for P in [0:1)

GlobalBuffer [ Weights:3072 (3072) Inputs:576 (576) Outputs:512 (512) ]
-----------------------------------------------------------------------
|   for C in [0:2)
|     for K in [0:32)
|       for R in [0:3)
|         for C in [0:16) (Spatial-X)

RegisterFile [ Weights:1 (1) Inputs:16 (16) Outputs:16 (16) ]
-------------------------------------------------------------
|           for P in [0:16)

Arithemetic Level

Level 0
-------
=== MACC ===

    SPECS
    -----
    Word bits             : 8
    Instances             : 16 (16*1) // (mesh-X*mesh-Y)
    Compute energy        : 0.56 pJ // energy per compute 

    STATS
    -----
    Utilized instances           : 16
    Cycles                       : 3072
    Algorithmic Computes (total) : 49152
    Actual Computes (total)      : 49152
    Gated Computes (total)       : 0
    Skipped Computes (total)     : 0
    Energy (total)               : 27564.44 pJ
    Area (total)                 : 5316.00 um^2

There are 16 8-bit MACC instances operating at 0.56pJ per compute in the specificed architecture and 16 of them are utilized by the example mapping. The cycles are calculated as the product of the temporal loop tiles in the mapping (2*32*3*16). The algorithmic computes for 1dconv is calculated as PRCK (16*3*32*32). This example assumes dense input and weight tensors to the 1dconv, so there is no gated or skipped computes. The total energy is computed by multiplying the energy per compute and the number of computes (0.56*49152).
The total area is compuated by multiplying the area per MACC instance and the number of MACC instances. If Accelergy plugin is enabled, the energy per operation and area per instance number will be shown in the .ERT and .ART files.

Buffer Levels

Level 1
-------
=== RegisterFile ===
...  

Level 2
-------
=== GlobalBuffer ===

    SPECS
    -----
        Technology                   : SRAM
        Data storage size            : 262144 // Bytes 
        Data word bits               : 8
        Data block size              : 32
        Metadata storage width(bits) : 0 // metadata used for sparse data formats 
        Metadata storage depth       : -
        Cluster size                 : 1
        Instances                    : 1 (1*1)
        Shared bandwidth             : -
        Read bandwidth               : -
        Write bandwidth              : -
        Multiple buffering           : 1.00
        Effective data storage size  : 262144
        Min utilization              : 0.00
        Vector read energy           : 221.43 pJ
        Vector write energy          : 221.43 pJ
        Vector metadata read energy  : 0.50 pJ
        Vector metadata write energy : 0.50 pJ
        (De)compression energy       : 0.00 pJ
        Area                         : 679274.00 um^2

Multiple buffer levels can be specified in the arch file. The GlobalBuffer level in the example above is a 256KB SRAM. The read/write/shared bandwidths are not specified, meaning the buffer is supplying the data at the rate of the computation. Vector read/write energy is a function of memory width, depth, cluster size, etc. It can be directly set using the vector-access-energy attribute of the buffer in the architecture specification file.

    MAPPING
    -------
    Loop nest:
      for C in [0:2)
        for K in [0:32)
          for R in [0:3)
            for C in [0:16) (Spatial-X)

The mapping section reflects the access pattern of data stored in the memory level of interest. In this example, the data is partitioned along the C dimension and the fanout is 16 from this memory level to the next memory level.

    STATS
    -----
    Cycles               : 3072
    Bandwidth throttling : 1.00
    Weights:
        Partition size                                              : 3072
        Tile density distribution                                   : fixed-structured
        Data tile shape                                             : 3072
        Max utilized data storage capacity                          : 3072
        Metadata format                                             : none
        Max utilized metadata storage capacity
        Utilized instances (max)                                    : 1
        Utilized clusters (max)                                     : 1
        Algorithmic scalar reads (per-instance)                     : 3072
        Actual scalar reads (per-instance)                          : 3072
        Gated scalar reads (per-instance)                           : 0
        Skipped scalar reads (per-instance)                         : 0
        Algorithmic scalar fills (per-instance)                     : 3072
        Actual scalar fills (per-instance)                          : 3072
        Gated scalar fills (per-instance)                           : 0
        Skipped scalar fills (per-instance)                         : 0
        Algorithmic scalar updates (per-instance)                   : 0
        Actual scalar updates (per-instance)                        : 0
        Gated scalar updates (per-instance)                         : 0
        Skipped scalar updates (per-instance)                       : 0
        Scalar decompression counts (per-cluster)                   : 0
        Scalar compression counts (per-cluster)                     : 0
        Temporal reductions (per-instance)                          : 0
        Address generations (per-cluster)                           : 6144
        Energy (per-scalar-access)                                  : 6.92 pJ
        Energy (per-instance)                                       : 42514.01 pJ
        Energy (total)                                              : 42514.01 pJ
        Temporal Reduction Energy (per-instance)                    : 0.00 pJ
        Temporal Reduction Energy (total)                           : 0.00 pJ
        Address Generation Energy (per-cluster)                     : 0.00 pJ
        Address Generation Energy (total)                           : 0.00 pJ
        Read Bandwidth (per-instance)                               : 1.00 words/cycle
            Breakdown (Data, Format): (100.00%, 0.00%)
   ...

The statistics section reports the breakdown performance for accessing different data tensors kept at different memory levels.

Network Levels

Network 0
---------
GlobalBuffer <==> RegisterFile

    SPECS
    -----
        Type            : Legacy
        Legacy sub-type :
        ConnectionType  : 3
        Word bits       : 8
        Router energy   : - pJ
        Wire energy     : - pJ/b/mm

    STATS
    -----
    Weights:
        Fanout                                  : 16
        Fanout (distributed)                    : 0
        Multicast factor                        : 1
        Ingresses                               : 3072
            @multicast 1: 3072
        Link transfers                          : 0
        Spatial reductions                      : 0
        Average number of hops                  : 2.00
        Energy (per-hop)                        : 0.00 fJ
        Energy (per-instance)                   : 0.00 pJ
        Energy (total)                          : 0.00 pJ
        Link transfer energy (per-instance)     : 0.00 pJ
        Link transfer energy (total)            : 0.00 pJ
        Spatial Reduction Energy (per-instance) : 0.00 pJ
        Spatial Reduction Energy (total)        : 0.00 pJ
    Inputs:
        Fanout                                  : 16
        Fanout (distributed)                    : 0
        Multicast factor                        : 1
        Ingresses                               : 18432
            @multicast 1: 18432
        Link transfers                          : 0
        Spatial reductions                      : 0
        Average number of hops                  : 2.00
        Energy (per-hop)                        : 0.00 fJ
        Energy (per-instance)                   : 0.00 pJ
        Energy (total)                          : 0.00 pJ
        Link transfer energy (per-instance)     : 0.00 pJ
        Link transfer energy (total)            : 0.00 pJ
        Spatial Reduction Energy (per-instance) : 0.00 pJ
        Spatial Reduction Energy (total)        : 0.00 pJ
    Outputs:
        Fanout                                  : 16
        Fanout (distributed)                    : 0
        Multicast factor                        : 16
        Ingresses                               : 1024
            @multicast 16: 1024
        Link transfers                          : 0
        Spatial reductions                      : 15360
        Average number of hops                  : 15.50
        Energy (per-hop)                        : 0.00 fJ
        Energy (per-instance)                   : 0.00 pJ
        Energy (total)                          : 0.00 pJ
        Link transfer energy (per-instance)     : 0.00 pJ
        Link transfer energy (total)            : 0.00 pJ
        Spatial Reduction Energy (per-instance) : 0.00 pJ
        Spatial Reduction Energy (total)        : 0.00 pJ

The network statistics reports the communication costs of each data tensor among different memory levels.