This part of the stats file reports the architecture specifications and the corresponding breakdown performance for each level of the hardware hierarchy.
The following section explains the statstics for the mapping of Timeloop exercise 04-model-conv1d-oc-3levelspatial using the C-partitioned mapping conv1d+oc+ic-3levelspatial-cp-ws.map.yaml.
In this problem, P
is output tensor width; R
is weight filter width; C
is input channel size; K
is output channel size;
MainMemory [ Weights:3072 (3072) Inputs:576 (576) Outputs:512 (512) ]
---------------------------------------------------------------------
| for P in [0:1)
GlobalBuffer [ Weights:3072 (3072) Inputs:576 (576) Outputs:512 (512) ]
-----------------------------------------------------------------------
| for C in [0:2)
| for K in [0:32)
| for R in [0:3)
| for C in [0:16) (Spatial-X)
RegisterFile [ Weights:1 (1) Inputs:16 (16) Outputs:16 (16) ]
-------------------------------------------------------------
| for P in [0:16)
Level 0
-------
=== MACC ===
SPECS
-----
Word bits : 8
Instances : 16 (16*1) // (mesh-X*mesh-Y)
Compute energy : 0.56 pJ // energy per compute
STATS
-----
Utilized instances : 16
Cycles : 3072
Algorithmic Computes (total) : 49152
Actual Computes (total) : 49152
Gated Computes (total) : 0
Skipped Computes (total) : 0
Energy (total) : 27564.44 pJ
Area (total) : 5316.00 um^2
There are 16 8-bit MACC instances operating at 0.56pJ per compute in the specificed architecture and 16 of them are utilized by the example mapping.
The cycles are calculated as the product of the temporal loop tiles in the mapping (2*32*3*16
).
The algorithmic computes for 1dconv is calculated as PRCK (16*3*32*32
).
This example assumes dense input and weight tensors to the 1dconv, so there is no gated or skipped computes.
The total energy is computed by multiplying the energy per compute and the number of computes (0.56*49152
).
The total area is compuated by multiplying the area per MACC instance and the number of MACC instances.
If Accelergy plugin is enabled, the energy per operation and area per instance number will be shown in the .ERT and .ART files.
Level 1
-------
=== RegisterFile ===
...
Level 2
-------
=== GlobalBuffer ===
SPECS
-----
Technology : SRAM
Data storage size : 262144 // Bytes
Data word bits : 8
Data block size : 32
Metadata storage width(bits) : 0 // metadata used for sparse data formats
Metadata storage depth : -
Cluster size : 1
Instances : 1 (1*1)
Shared bandwidth : -
Read bandwidth : -
Write bandwidth : -
Multiple buffering : 1.00
Effective data storage size : 262144
Min utilization : 0.00
Vector read energy : 221.43 pJ
Vector write energy : 221.43 pJ
Vector metadata read energy : 0.50 pJ
Vector metadata write energy : 0.50 pJ
(De)compression energy : 0.00 pJ
Area : 679274.00 um^2
Multiple buffer levels can be specified in the arch file. The GlobalBuffer level in the example above is a 256KB SRAM. The read/write/shared bandwidths are not specified, meaning the buffer is supplying the data at the rate of the computation. Vector read/write energy is a function of memory width, depth, cluster size, etc. It can be directly set using the vector-access-energy
attribute of the buffer in the architecture specification file.
MAPPING
-------
Loop nest:
for C in [0:2)
for K in [0:32)
for R in [0:3)
for C in [0:16) (Spatial-X)
The mapping section reflects the access pattern of data stored in the memory level of interest. In this example, the data is partitioned along the C dimension and the fanout is 16 from this memory level to the next memory level.
STATS
-----
Cycles : 3072
Bandwidth throttling : 1.00
Weights:
Partition size : 3072
Tile density distribution : fixed-structured
Data tile shape : 3072
Max utilized data storage capacity : 3072
Metadata format : none
Max utilized metadata storage capacity
Utilized instances (max) : 1
Utilized clusters (max) : 1
Algorithmic scalar reads (per-instance) : 3072
Actual scalar reads (per-instance) : 3072
Gated scalar reads (per-instance) : 0
Skipped scalar reads (per-instance) : 0
Algorithmic scalar fills (per-instance) : 3072
Actual scalar fills (per-instance) : 3072
Gated scalar fills (per-instance) : 0
Skipped scalar fills (per-instance) : 0
Algorithmic scalar updates (per-instance) : 0
Actual scalar updates (per-instance) : 0
Gated scalar updates (per-instance) : 0
Skipped scalar updates (per-instance) : 0
Scalar decompression counts (per-cluster) : 0
Scalar compression counts (per-cluster) : 0
Temporal reductions (per-instance) : 0
Address generations (per-cluster) : 6144
Energy (per-scalar-access) : 6.92 pJ
Energy (per-instance) : 42514.01 pJ
Energy (total) : 42514.01 pJ
Temporal Reduction Energy (per-instance) : 0.00 pJ
Temporal Reduction Energy (total) : 0.00 pJ
Address Generation Energy (per-cluster) : 0.00 pJ
Address Generation Energy (total) : 0.00 pJ
Read Bandwidth (per-instance) : 1.00 words/cycle
Breakdown (Data, Format): (100.00%, 0.00%)
...
The statistics section reports the breakdown performance for accessing different data tensors kept at different memory levels.
Network 0
---------
GlobalBuffer <==> RegisterFile
SPECS
-----
Type : Legacy
Legacy sub-type :
ConnectionType : 3
Word bits : 8
Router energy : - pJ
Wire energy : - pJ/b/mm
STATS
-----
Weights:
Fanout : 16
Fanout (distributed) : 0
Multicast factor : 1
Ingresses : 3072
@multicast 1: 3072
Link transfers : 0
Spatial reductions : 0
Average number of hops : 2.00
Energy (per-hop) : 0.00 fJ
Energy (per-instance) : 0.00 pJ
Energy (total) : 0.00 pJ
Link transfer energy (per-instance) : 0.00 pJ
Link transfer energy (total) : 0.00 pJ
Spatial Reduction Energy (per-instance) : 0.00 pJ
Spatial Reduction Energy (total) : 0.00 pJ
Inputs:
Fanout : 16
Fanout (distributed) : 0
Multicast factor : 1
Ingresses : 18432
@multicast 1: 18432
Link transfers : 0
Spatial reductions : 0
Average number of hops : 2.00
Energy (per-hop) : 0.00 fJ
Energy (per-instance) : 0.00 pJ
Energy (total) : 0.00 pJ
Link transfer energy (per-instance) : 0.00 pJ
Link transfer energy (total) : 0.00 pJ
Spatial Reduction Energy (per-instance) : 0.00 pJ
Spatial Reduction Energy (total) : 0.00 pJ
Outputs:
Fanout : 16
Fanout (distributed) : 0
Multicast factor : 16
Ingresses : 1024
@multicast 16: 1024
Link transfers : 0
Spatial reductions : 15360
Average number of hops : 15.50
Energy (per-hop) : 0.00 fJ
Energy (per-instance) : 0.00 pJ
Energy (total) : 0.00 pJ
Link transfer energy (per-instance) : 0.00 pJ
Link transfer energy (total) : 0.00 pJ
Spatial Reduction Energy (per-instance) : 0.00 pJ
Spatial Reduction Energy (total) : 0.00 pJ
The network statistics reports the communication costs of each data tensor among different memory levels.