SlideShare a Scribd company logo
1 of 17
1
Lecture: Eyeriss Dataflow
• Topics: Eyeriss architecture and dataflow (digital CNN
accelerator)
We had previously seen basic ANNs that used tiling/buffers/NFUs (DianNao), multiple chips to
distribute the work and eliminate memory accesses (DaDianNao), and a few examples that exploited
sparsity (Deep compression, EIE, Cnvlutin). Now, we’ll examine a spatial architecture that exploits
various dataflows to reduce energy overheads.
Spatial architectures get inputs from their neighbors, thus helping computations that exhibit
producer-consumer relationships and/or data reuse. The next slide shows the overall Eyeriss
architecture. The chip has a global buffer that has a few hundred kilo-bytes of data. It then has an
array of processing elements (PEs), connected by a mesh network.
Each PE has a single MAC unit and a register file (that can store a few hundred values). There may
also be FIFO input/output buffers within and between PEs, but we won’t discuss those today since
they aren’t essential.
2
Dataflow Optimizations
Data can be fetched from 4 different locations, each with varying energy cost: off-chip DRAM, global
buffer, a neighboring PE’s regfile, and local regfile. The key here is to find a dataflow that can reduce
the overall energy cost for a large CNN computation.
3
Overall Spatial Architecture
4
One Primitive
Let’s first take a look at a single PE in Eyeriss. Let’s also first focus on a single 1D convolution (or the
computation required by 1 row of a 2D convolution). This is defined as one primitive and one PE is
responsible for one primitive. Before the computation starts, the PE loads its register file with 1 row
of kernel weights (size R) and 1 row of an input feature map (size H). In the example above, a 3-entry
kernel is applied on a 5-entry row. 9 computations are performed sequentially to produce 3 output
pixels. The resulting outputs are only partial sums (the pink boxes labeled 1, 2, 3). This computation
has a lot of reuse (kernel and ifmap), and all of that reuse is exploited thru the regfile.
5
One Primitive
Consider a larger example. Take a row of the input feature map with 64 elements. Consider one row
of the kernel with 3 elements. Once these 67 elements are in the regfile, we perform computations
that produce 62 partial sum outputs. This is done with 186 MAC operations. The register file may or
may not have to store these partial sums (depending on how bypassing is done).
This is called row-stationary, i.e., a row sits in a PE and performs all required computations. The
computations are done sequentially on a single MAC unit.
6
Dataflow for a 2D Convolution
Now let’s carefully go through the 2D convolution. Let’s discuss “where” first; we’ll later discuss
“when”. See the figures on the next slide to follow along.
If the 2D convolution involves a 3x3 kernel, we’ll partition it into 3 “primitives”, with each primitive
responsible for 1 row of the input feature map and 1 row of the kernel (btw, remember that we use
the terms kernel/filter/weights interchangeably). Take the first column of PEs. PE1,1 receives filter
row 1 and input feature map row 1. They produce partial sums for the first row of the output feature
map. Similarly, PE2,1 and PE3,1 deal with rows 2 and 3 of the kernel/ifmap. The partial sums for all
three PEs must be added to produce the first row of the ofmap (see the red dataflow).
Next, to produce the second row of the ofmap, we’ll move to the second column of PEs. To produce
the second row, the following rows must be convolved: ifmap-row2 and filter-row1; ifmap-row3 and
filter-row2; ifmap-row4 and filter-row3. To make sure the right rows collide at the second column of
PEs, the ifmap and filter rows follow the blue and green rows, as shown in the figure. And this
process continues.
To recap, inputs arrive at the first column. Each PE does a row’s worth of work (1 primitive). The
partial sums in the column are added to produce the first row of the ofmap. Then, the ifmap and
filter rows shift diagonally and laterally to the second column. The partial sums of the second column
are aggregated to produce the second row of the ofmap. If we are working with a 64x64 ifmap and a
3x3 filter, we’ll need a 62x3 array of PEs.
7
Dataflow for a 2D Convolution
The next slide also shows the required computations for our 2D convolution example. We’ll refer to
the first column and the last row of the PE array as “edge PEs”. To perform their computation, these
PEs must first receive 64 ifmap elements and 3 filter elements. The 64 ifmap elements must be
fetched from the global buffer. We next perform 186 MAC operations, each requiring regfile reads
and regfile writes (for partial sums). When the final result is produced, it is sent to the neighboring
PE. In non-edge PEs, the computations are similar. They key difference is that the 64 ifmap elements
are read from the neighboring PE, not from the global buffer. We see here that most reads are from
the local regfile or from a neighboring regfile; a few accesses are from the global buffer. We’ve thus
created a dataflow that again exploits locality and reuse. This is in addition to the reuse we’ve already
exploited within 1 primitive.
Note that eventually, we want to do a 4D computation (lots of ifmaps, lots of ofmaps, multiple
images). We did this by first doing a 1D computation efficiently in one PE. We then set up a grid of
PEs on a 2D chip to process a 2D convolution efficiently. Unfortunately, we can’t have a 3D chip, so
we have to stop with this 2D convolution primitive. To perform the 4D computation, we have to make
repeated calls to a 2D convolution primitive with the following nested loops:
for (images 120)
for (ofmaps 18)
for (ifmaps 14)
perform 2D convolution
8
Dataflow for a 2D Convolution
Given these nested loops, we can compute the total number of DRAM/globalbuffer/PE/regfile
accesses and compute energy. Note that the outputs of each 2D-conv are now partial sums and have
to be aggregated as we walk through these nested loops. The compiler has to try different kinds of
nested loops to figure out the best way to reduce energy.
The entire 4D computation has to thus be carefully decomposed into multiple 2D convolutions (and
each 2D convolution is itself decomposed into multiple 1D convolutions, 1 per PE). Further, the 62x3
PE grid that we just constructed has to be mapped on to the actual PE grid that exists on the chip
(e.g., the fabricated Eyeriss chip has a 12x14 grid).
Having addressed the “where”, let’s now talk about “when”. To simplify the discussion, I said that one
column of PEs is fully processed before we move to the next column. In reality, once an element of
the ifmap arrives at a PE, we can also pass it to the next PE column in the next cycle. So the second
column can start one cycle after the first column. We are thus pipelining the computations in all the
columns. Once we reach steady state, all the columns will be busy, thus achieving very high
parallelism.
9
Row Stationary Dataflow for one 2D Convolution
Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 62x62 outputs; 20 image batch
• Edge prim: (glb) 64 inp, 3 wts; (reg) 186 MACs  psums (124 rg/by) (62 PE)
• Other prims: (PE) 64 inp, 3 wts; (reg) 186 MACs  psums (124 rg/by) (62 PE)
• The first step is done ~64 times; the second step is done ~122 times
• Eventually: 4K outputs to global buffer or DRAM
10
Folding
• May have to fold a 2D convolution over a small physical
set of PEs
• Must eventually take the 4D convolution and fold it into
multiple 2D convolutions – the 2D convolution has to be done
C (input filters) x M (output filters) x N (image batch) times
• Can exploit global buffer reuse and register reuse depending
on which order you do this (note that you have to deal with
inputs, weights, and partial sums)
11
Weight Stationary
Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 61x61 outputs; 20 image batch
• Weight Stationary: the weights stay in place
• Assume that we lay out 3x3 weights across 9 PEs
• Let the inputs stream over these – each weight has to be seen 61x61 times
-- no easy way to move the pixels around to promote this
There are other reasonable dataflows as well. In weight stationary, the 9 weights in a filter may
occupy a 3x3 grid of PEs. The ifmap has to flow through these PEs. Note that each element in the
ifmap has to combine with each of the weights, so there’s plenty of required data movement.
12
Output Stationary
Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 61x61 outputs; 20 image batch
• Output Stationary: the output neuron stays in place
• Need to use all PEs to compute a subset of a 4D space of neurons at a
time – can move inputs around to promote reuse
In output stationary, each PE is responsible for producing one output value. So all the necessary
ifmap and filter values have to find their way to all necessary PEs.
13
Terminology
14
Energy Estimates
• Most ops in convolutional layers
• Most energy in ALU and RF in convs; most energy in buffer in FC
• More storage = more delay
The first figure shows that the CONV layers account for 90%
of all energy. Most energy is in the RF and ALUs. The FC
layers have lower reuse and therefore see more energy in
the global buffer. The optimal design has a good balance of
storage and PEs.
15
Summary
• All about reduced energy and data movement; assume
PEs are busy most of the time (except edge effects)
• Reduced data movement  low energy and low area
(from fewer interconnects)
• While Row Stationary is best, need a detailed design space
exploration to identify the best traversal thru the 4D array
• It’s not always about reducing DRAM accesses; even
global buffer accesses must be reduced
• More PEs allows for better data reuse, so not terrible
even if it means smaller global buffer
• Convs are 90% of all ops and growing
• Their best designs are with 256 PEs, 0.5KB regfile/PE,
128KB global buffer; filter/psum/act = 224/24/12
16
WAX
Recent work makes the argument that the storage hierarchy in Eyeriss is sub-optimal.
Accessing a 0.5KB register file is expensive, as is accessing a 108KB global buffer. A more
efficient hierarchy (as shown here) is to have a handful of registers per PE and a local
buffer of size 8KB. This is one tile and if more data is required, it is fetched from nearby
tiles. This is also an example of near-data processing where an array of 32 MACs is placed
next to an 8KB buffer (subarray). A row of 32 values read from the buffer gets placed in
either a W register (weights) or A register (activations). These registers feed the 32 MACs
in parallel in one cycle. In the next cycle, the A register performs a shift operation. This
enables high reuse while performing a convolution operation. The WAX paper constructs
a number of dataflows to reduce the overall energy required by a deep network.
17
References
• “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for
Convolutional Neural Networks,” Y-H. Chen et al., ISCA 2016
• “Wire-Aware Architecture and Dataflow for CNN Accelerators,”
S. Gudaparthi et al., MICRO 2019

More Related Content

Similar to 19-7960-07-notes.pptx

cis97007
cis97007cis97007
cis97007perfj
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...Joao Galdino Mello de Souza
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipeliningjagrat123
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAAiman Hud
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderAn Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderIJERA Editor
 
Data structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data StructureData structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data StructureOllieShoresna
 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711RCCSRENKEI
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Tokyo Institute of Technology
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdfarpowersarps
 
Intel’S Larrabee
Intel’S LarrabeeIntel’S Larrabee
Intel’S Larrabeevipinpnair
 
Fpga based artificial neural network
Fpga based artificial neural networkFpga based artificial neural network
Fpga based artificial neural networkHoopeer Hoopeer
 

Similar to 19-7960-07-notes.pptx (20)

cis97007
cis97007cis97007
cis97007
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIA
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Queue (1)(1).ppt
Queue (1)(1).pptQueue (1)(1).ppt
Queue (1)(1).ppt
 
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderAn Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
 
Lecture12
Lecture12Lecture12
Lecture12
 
Data structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data StructureData structures assignmentweek4b.pdfCI583 Data Structure
Data structures assignmentweek4b.pdfCI583 Data Structure
 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711
 
nasm_final
nasm_finalnasm_final
nasm_final
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf
 
Intel’S Larrabee
Intel’S LarrabeeIntel’S Larrabee
Intel’S Larrabee
 
Fpga based artificial neural network
Fpga based artificial neural networkFpga based artificial neural network
Fpga based artificial neural network
 

Recently uploaded

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 

Recently uploaded (20)

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 

19-7960-07-notes.pptx

  • 1. 1 Lecture: Eyeriss Dataflow • Topics: Eyeriss architecture and dataflow (digital CNN accelerator) We had previously seen basic ANNs that used tiling/buffers/NFUs (DianNao), multiple chips to distribute the work and eliminate memory accesses (DaDianNao), and a few examples that exploited sparsity (Deep compression, EIE, Cnvlutin). Now, we’ll examine a spatial architecture that exploits various dataflows to reduce energy overheads. Spatial architectures get inputs from their neighbors, thus helping computations that exhibit producer-consumer relationships and/or data reuse. The next slide shows the overall Eyeriss architecture. The chip has a global buffer that has a few hundred kilo-bytes of data. It then has an array of processing elements (PEs), connected by a mesh network. Each PE has a single MAC unit and a register file (that can store a few hundred values). There may also be FIFO input/output buffers within and between PEs, but we won’t discuss those today since they aren’t essential.
  • 2. 2 Dataflow Optimizations Data can be fetched from 4 different locations, each with varying energy cost: off-chip DRAM, global buffer, a neighboring PE’s regfile, and local regfile. The key here is to find a dataflow that can reduce the overall energy cost for a large CNN computation.
  • 4. 4 One Primitive Let’s first take a look at a single PE in Eyeriss. Let’s also first focus on a single 1D convolution (or the computation required by 1 row of a 2D convolution). This is defined as one primitive and one PE is responsible for one primitive. Before the computation starts, the PE loads its register file with 1 row of kernel weights (size R) and 1 row of an input feature map (size H). In the example above, a 3-entry kernel is applied on a 5-entry row. 9 computations are performed sequentially to produce 3 output pixels. The resulting outputs are only partial sums (the pink boxes labeled 1, 2, 3). This computation has a lot of reuse (kernel and ifmap), and all of that reuse is exploited thru the regfile.
  • 5. 5 One Primitive Consider a larger example. Take a row of the input feature map with 64 elements. Consider one row of the kernel with 3 elements. Once these 67 elements are in the regfile, we perform computations that produce 62 partial sum outputs. This is done with 186 MAC operations. The register file may or may not have to store these partial sums (depending on how bypassing is done). This is called row-stationary, i.e., a row sits in a PE and performs all required computations. The computations are done sequentially on a single MAC unit.
  • 6. 6 Dataflow for a 2D Convolution Now let’s carefully go through the 2D convolution. Let’s discuss “where” first; we’ll later discuss “when”. See the figures on the next slide to follow along. If the 2D convolution involves a 3x3 kernel, we’ll partition it into 3 “primitives”, with each primitive responsible for 1 row of the input feature map and 1 row of the kernel (btw, remember that we use the terms kernel/filter/weights interchangeably). Take the first column of PEs. PE1,1 receives filter row 1 and input feature map row 1. They produce partial sums for the first row of the output feature map. Similarly, PE2,1 and PE3,1 deal with rows 2 and 3 of the kernel/ifmap. The partial sums for all three PEs must be added to produce the first row of the ofmap (see the red dataflow). Next, to produce the second row of the ofmap, we’ll move to the second column of PEs. To produce the second row, the following rows must be convolved: ifmap-row2 and filter-row1; ifmap-row3 and filter-row2; ifmap-row4 and filter-row3. To make sure the right rows collide at the second column of PEs, the ifmap and filter rows follow the blue and green rows, as shown in the figure. And this process continues. To recap, inputs arrive at the first column. Each PE does a row’s worth of work (1 primitive). The partial sums in the column are added to produce the first row of the ofmap. Then, the ifmap and filter rows shift diagonally and laterally to the second column. The partial sums of the second column are aggregated to produce the second row of the ofmap. If we are working with a 64x64 ifmap and a 3x3 filter, we’ll need a 62x3 array of PEs.
  • 7. 7 Dataflow for a 2D Convolution The next slide also shows the required computations for our 2D convolution example. We’ll refer to the first column and the last row of the PE array as “edge PEs”. To perform their computation, these PEs must first receive 64 ifmap elements and 3 filter elements. The 64 ifmap elements must be fetched from the global buffer. We next perform 186 MAC operations, each requiring regfile reads and regfile writes (for partial sums). When the final result is produced, it is sent to the neighboring PE. In non-edge PEs, the computations are similar. They key difference is that the 64 ifmap elements are read from the neighboring PE, not from the global buffer. We see here that most reads are from the local regfile or from a neighboring regfile; a few accesses are from the global buffer. We’ve thus created a dataflow that again exploits locality and reuse. This is in addition to the reuse we’ve already exploited within 1 primitive. Note that eventually, we want to do a 4D computation (lots of ifmaps, lots of ofmaps, multiple images). We did this by first doing a 1D computation efficiently in one PE. We then set up a grid of PEs on a 2D chip to process a 2D convolution efficiently. Unfortunately, we can’t have a 3D chip, so we have to stop with this 2D convolution primitive. To perform the 4D computation, we have to make repeated calls to a 2D convolution primitive with the following nested loops: for (images 120) for (ofmaps 18) for (ifmaps 14) perform 2D convolution
  • 8. 8 Dataflow for a 2D Convolution Given these nested loops, we can compute the total number of DRAM/globalbuffer/PE/regfile accesses and compute energy. Note that the outputs of each 2D-conv are now partial sums and have to be aggregated as we walk through these nested loops. The compiler has to try different kinds of nested loops to figure out the best way to reduce energy. The entire 4D computation has to thus be carefully decomposed into multiple 2D convolutions (and each 2D convolution is itself decomposed into multiple 1D convolutions, 1 per PE). Further, the 62x3 PE grid that we just constructed has to be mapped on to the actual PE grid that exists on the chip (e.g., the fabricated Eyeriss chip has a 12x14 grid). Having addressed the “where”, let’s now talk about “when”. To simplify the discussion, I said that one column of PEs is fully processed before we move to the next column. In reality, once an element of the ifmap arrives at a PE, we can also pass it to the next PE column in the next cycle. So the second column can start one cycle after the first column. We are thus pipelining the computations in all the columns. Once we reach steady state, all the columns will be busy, thus achieving very high parallelism.
  • 9. 9 Row Stationary Dataflow for one 2D Convolution Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 62x62 outputs; 20 image batch • Edge prim: (glb) 64 inp, 3 wts; (reg) 186 MACs  psums (124 rg/by) (62 PE) • Other prims: (PE) 64 inp, 3 wts; (reg) 186 MACs  psums (124 rg/by) (62 PE) • The first step is done ~64 times; the second step is done ~122 times • Eventually: 4K outputs to global buffer or DRAM
  • 10. 10 Folding • May have to fold a 2D convolution over a small physical set of PEs • Must eventually take the 4D convolution and fold it into multiple 2D convolutions – the 2D convolution has to be done C (input filters) x M (output filters) x N (image batch) times • Can exploit global buffer reuse and register reuse depending on which order you do this (note that you have to deal with inputs, weights, and partial sums)
  • 11. 11 Weight Stationary Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 61x61 outputs; 20 image batch • Weight Stationary: the weights stay in place • Assume that we lay out 3x3 weights across 9 PEs • Let the inputs stream over these – each weight has to be seen 61x61 times -- no easy way to move the pixels around to promote this There are other reasonable dataflows as well. In weight stationary, the 9 weights in a filter may occupy a 3x3 grid of PEs. The ifmap has to flow through these PEs. Note that each element in the ifmap has to combine with each of the weights, so there’s plenty of required data movement.
  • 12. 12 Output Stationary Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 61x61 outputs; 20 image batch • Output Stationary: the output neuron stays in place • Need to use all PEs to compute a subset of a 4D space of neurons at a time – can move inputs around to promote reuse In output stationary, each PE is responsible for producing one output value. So all the necessary ifmap and filter values have to find their way to all necessary PEs.
  • 14. 14 Energy Estimates • Most ops in convolutional layers • Most energy in ALU and RF in convs; most energy in buffer in FC • More storage = more delay The first figure shows that the CONV layers account for 90% of all energy. Most energy is in the RF and ALUs. The FC layers have lower reuse and therefore see more energy in the global buffer. The optimal design has a good balance of storage and PEs.
  • 15. 15 Summary • All about reduced energy and data movement; assume PEs are busy most of the time (except edge effects) • Reduced data movement  low energy and low area (from fewer interconnects) • While Row Stationary is best, need a detailed design space exploration to identify the best traversal thru the 4D array • It’s not always about reducing DRAM accesses; even global buffer accesses must be reduced • More PEs allows for better data reuse, so not terrible even if it means smaller global buffer • Convs are 90% of all ops and growing • Their best designs are with 256 PEs, 0.5KB regfile/PE, 128KB global buffer; filter/psum/act = 224/24/12
  • 16. 16 WAX Recent work makes the argument that the storage hierarchy in Eyeriss is sub-optimal. Accessing a 0.5KB register file is expensive, as is accessing a 108KB global buffer. A more efficient hierarchy (as shown here) is to have a handful of registers per PE and a local buffer of size 8KB. This is one tile and if more data is required, it is fetched from nearby tiles. This is also an example of near-data processing where an array of 32 MACs is placed next to an 8KB buffer (subarray). A row of 32 values read from the buffer gets placed in either a W register (weights) or A register (activations). These registers feed the 32 MACs in parallel in one cycle. In the next cycle, the A register performs a shift operation. This enables high reuse while performing a convolution operation. The WAX paper constructs a number of dataflows to reduce the overall energy required by a deep network.
  • 17. 17 References • “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” Y-H. Chen et al., ISCA 2016 • “Wire-Aware Architecture and Dataflow for CNN Accelerators,” S. Gudaparthi et al., MICRO 2019