The document discusses a lecture on hardware architectures for deep learning given by Joel Emer and Vivienne Sze at MIT. The goals of the lecture are to discuss popular DNN models, including transformers, and how to evaluate and compare DNN models and hardware. The lecture covers key metrics for evaluation like accuracy, throughput, latency, energy, and flexibility. Popular datasets for evaluation like MNIST, CIFAR-10, and ImageNet are also discussed.
A machine learning and data science pipeline for real companiesDataWorks Summit
Comcast is one of the largest cable and telecommunications providers in the country built on decades of mergers, acquisitions, and subscriber growth. The success of our company depends on keeping our customers happy and how quickly we can pivot with changing trends and new technologies. Data abounds within our internal data centers and edge networks as well as both the private and public cloud across multiple vendors.
Within such an environment and given such challenges, how do we get AI, machine learning, and data science platforms built so our company can respond to the market, predict our customers’ needs and create new revenue generating products that delight our customers? If you don’t happen to be our friends and colleagues at Google, Facebook, and Amazon, what are technologies, strategies, and toolkits you can employ to bring together disparate data sets and quickly get them into the hands of your data scientists and then into your own production systems for use by your customers and business partners?
We’ll explore our journey and evolution and look at specific technologies and decisions that have gotten us to where we are today and demo how our platform works.
Speaker
Ray Harrison, Comcast, Enterprise Architect
Prashant Khanolkar, Comcast, Principal Architect Big Data
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
RAMSES: A new project in data-driven analytical modeling of distributed systems
RAMSES is a new DOE-funded project on the end-to-end analytical performance modeling of science workflows in extreme-scale science environments. It aims to link multiple threads of inquiry that have not, until now, been adequately connected: namely, first-principles performance modeling within individual sub-disciplines (e.g., networks, storage systems, applications), and data-driven methods for evaluating, calibrating, and synthesizing models of complex phenomena. What makes this fusion necessary is the drive to explain, predict, and optimize not just individual system components but complex end-to-end workflows. In this talk, I will introduce the goals of the project and some aspects of our technical approach.
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Samsung R&D.
While Pig provides a great level of abstraction between MapReduce and dataflow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. The recently open sourced Lipstick solves this problem. Jeff emphasizes the architecture, implementation, and future of Lipstick, as well as various use cases around using Lipstick at Netflix (e.g. examples of using Lipstick to improve speed of development and efficiency of new and existing scripts).
Jeff manages the Data Platform Architecture group at Netflix where he is helping to build a service oriented architecture that enables easy access to large scale cloud based analytical processing and analysis of data across the organization. Prior to Netflix, he received his PhD from the University of Florida focusing on database system implementation.
A machine learning and data science pipeline for real companiesDataWorks Summit
Comcast is one of the largest cable and telecommunications providers in the country built on decades of mergers, acquisitions, and subscriber growth. The success of our company depends on keeping our customers happy and how quickly we can pivot with changing trends and new technologies. Data abounds within our internal data centers and edge networks as well as both the private and public cloud across multiple vendors.
Within such an environment and given such challenges, how do we get AI, machine learning, and data science platforms built so our company can respond to the market, predict our customers’ needs and create new revenue generating products that delight our customers? If you don’t happen to be our friends and colleagues at Google, Facebook, and Amazon, what are technologies, strategies, and toolkits you can employ to bring together disparate data sets and quickly get them into the hands of your data scientists and then into your own production systems for use by your customers and business partners?
We’ll explore our journey and evolution and look at specific technologies and decisions that have gotten us to where we are today and demo how our platform works.
Speaker
Ray Harrison, Comcast, Enterprise Architect
Prashant Khanolkar, Comcast, Principal Architect Big Data
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
RAMSES: A new project in data-driven analytical modeling of distributed systems
RAMSES is a new DOE-funded project on the end-to-end analytical performance modeling of science workflows in extreme-scale science environments. It aims to link multiple threads of inquiry that have not, until now, been adequately connected: namely, first-principles performance modeling within individual sub-disciplines (e.g., networks, storage systems, applications), and data-driven methods for evaluating, calibrating, and synthesizing models of complex phenomena. What makes this fusion necessary is the drive to explain, predict, and optimize not just individual system components but complex end-to-end workflows. In this talk, I will introduce the goals of the project and some aspects of our technical approach.
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Samsung R&D.
While Pig provides a great level of abstraction between MapReduce and dataflow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. The recently open sourced Lipstick solves this problem. Jeff emphasizes the architecture, implementation, and future of Lipstick, as well as various use cases around using Lipstick at Netflix (e.g. examples of using Lipstick to improve speed of development and efficiency of new and existing scripts).
Jeff manages the Data Platform Architecture group at Netflix where he is helping to build a service oriented architecture that enables easy access to large scale cloud based analytical processing and analysis of data across the organization. Prior to Netflix, he received his PhD from the University of Florida focusing on database system implementation.
Slides from the Big Data Gurus meetup at Samsung R&D, August 14, 2013
This presentation covers the high level architecture of the Netflix Data Platform with a deep dive into the architecture, implementation, use cases, and future of Lipstick (https://github.com/Netflix/Lipstick) - our open source tool for graphically analyzing and monitoring the execution of Apache Pig scripts.
Netflix uses Apache Pig to express many complex data manipulation and analytics workflows. While Pig provides a great level of abstraction between MapReduce and data flow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. To address this problem, we created (and open sourced) a tool named Lipstick that visualizes and monitors the progress and performance of Pig scripts.
Design and Implementation of Multiplier Using Kcm and Vedic Mathematics by Us...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
This chapter discusses various classification attributed to parallel architectures. It also introduces related parallel programming models and presents the actions of these models on parallel architectures. Notions such as Data parallelism Task parallelism, Tighty and Coupled system, UMA/NUMA, Multicore computing, Symmetric multiprocessing, Distributed Computing, Cluster computing, Shared memory without thread/Thread, etc..
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
The talk will motivate why Apache Arrow and related projects (e.g. DataFusion) is a good choice for implementing modern analytic database systems. It reviews the major components in most databases and explains where Apache Arrow fits in, and explains additional integration benefits from using Arrow.
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Safe Software
University and College Campuses are complex environments. The campus comprises many physical sub-systems, such as buildings, outdoor spaces, utilities, transportation, which are maintained by several divisions using multiple IT tools and different formats. Making campus-wide analytics requires bringing all these data elements and different formats (CAD, GIS, BIM) together to create a comprehensive common operating picture. In this presentation we will demonstrate how FME is a key and crucial technology to create campus wide data warehouse.
Enabling Spatial Decision Support and Analytics on a Campus Scale with FME Te...Safe Software
University and College Campuses are complex environments. The campus comprises many physical sub-systems, such as buildings, outdoor spaces, utilities, transportation, which are maintained by several divisions using multiple IT tools and different formats. Making campus-wide analytics requires bringing all these data elements and different formats (CAD, GIS, BIM) together to create a comprehensive common operating picture. In this presentation we will demonstrate that FME Technology is a key and crucial component in data automation and integration on Campus scale.
Slides from the Big Data Gurus meetup at Samsung R&D, August 14, 2013
This presentation covers the high level architecture of the Netflix Data Platform with a deep dive into the architecture, implementation, use cases, and future of Lipstick (https://github.com/Netflix/Lipstick) - our open source tool for graphically analyzing and monitoring the execution of Apache Pig scripts.
Netflix uses Apache Pig to express many complex data manipulation and analytics workflows. While Pig provides a great level of abstraction between MapReduce and data flow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. To address this problem, we created (and open sourced) a tool named Lipstick that visualizes and monitors the progress and performance of Pig scripts.
Design and Implementation of Multiplier Using Kcm and Vedic Mathematics by Us...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
This chapter discusses various classification attributed to parallel architectures. It also introduces related parallel programming models and presents the actions of these models on parallel architectures. Notions such as Data parallelism Task parallelism, Tighty and Coupled system, UMA/NUMA, Multicore computing, Symmetric multiprocessing, Distributed Computing, Cluster computing, Shared memory without thread/Thread, etc..
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
The talk will motivate why Apache Arrow and related projects (e.g. DataFusion) is a good choice for implementing modern analytic database systems. It reviews the major components in most databases and explains where Apache Arrow fits in, and explains additional integration benefits from using Arrow.
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Safe Software
University and College Campuses are complex environments. The campus comprises many physical sub-systems, such as buildings, outdoor spaces, utilities, transportation, which are maintained by several divisions using multiple IT tools and different formats. Making campus-wide analytics requires bringing all these data elements and different formats (CAD, GIS, BIM) together to create a comprehensive common operating picture. In this presentation we will demonstrate how FME is a key and crucial technology to create campus wide data warehouse.
Enabling Spatial Decision Support and Analytics on a Campus Scale with FME Te...Safe Software
University and College Campuses are complex environments. The campus comprises many physical sub-systems, such as buildings, outdoor spaces, utilities, transportation, which are maintained by several divisions using multiple IT tools and different formats. Making campus-wide analytics requires bringing all these data elements and different formats (CAD, GIS, BIM) together to create a comprehensive common operating picture. In this presentation we will demonstrate that FME Technology is a key and crucial component in data automation and integration on Campus scale.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
1. L04-1
Sze and Emer
6.5930/1
Hardware Architectures for Deep Learning
Popular DNN Models (Transformers),
DNN Evaluation and Training
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
February 15, 2023
2. L04-2
Sze and Emer
Goals of Today’s Lecture
• Popular DNN Models (cont’d)
– Transformers
• Evaluate and compared DNN Models and Hardware
– Key Metrics and Design Objectives
– Datasets and benchmarks
• Training
• Recommended Reading: Chapter 2 & 3
– https://doi.org/10.1007/978-3-031-01766-7
February 15, 2023
4. L04-4
Sze and Emer
Important Computational Properties
• Latency and energy affected by a variety of properties including
– Number of operations that need to done sequentially
• How many operations can be done in parallel?
– Amount of data storage (memory capacity)
• How much data needs to perform compute?
– Number of memory accesses (memory bandwidth)
• How often does data change? How often can data be reused for multiple operations?
• More details on latency and energy in “Key Metrics and Design
Objectives” section of today’s lecture
February 15, 2023
5. L04-5
Sze and Emer
Computation Properties of CONV Layers
• Do operands change during inference?
– Inputs change (dynamic)
– Usually, weights do not change with input (static)
• There are exceptions (e.g., Squeeze-and-Excitation)
• Complexity based on shape of layer
– Number of MACs scales linearly with size of inputs
– Storage: Number of weights RSCM, Number of input activations HWC, Number of output activations PQM
• Sparse connection – no support outside of spatial dimensions of filter (R x S)
– Fewer weights reduces amount of data storage required
– Fewer MAC operations reduces number of operations
• Shared weights across different spatial locations (input)
– Data reuse reduces number of memory accesses
February 15, 2023
6. L04-6
Sze and Emer
Computation Properties of CONV Layers
• Multiple forms of parallelism possible
– Across M (apply multiple filters at the same time – reuse input feature maps)
– Across N (filter multiple input feature maps at the same time – reuse filters)
– Across R, S, C (compute MACs at same time, but need to accumulate – reuse both input activations
and filter weights)
• Shape can change per layer (R, S, C, M, H, W) - Flexible hardware required!
February 15, 2023
Filter
Input Fmap
Convolutional Reuse
(Activations, Weights)
CONV layers only
(sliding window)
Filters
2
1
Input Fmap
Fmap Reuse
(Activations)
CONV and FC layers
Filter
2
1
Input Fmaps
Filter Reuse
(Weights)
CONV and FC layers
(batch size > 1)
7. L04-7
Sze and Emer
Convolution versus Attention Mechanism
• Convolution
– Only models dependencies between spatial neighbors
– Use sparsely connected layer to spatial neighbors; no support for dependencies outside of
spatial dimensions of filter (R x S)
– Fixed region of interest in input for given output
• Attention
– “Allows modeling of [global] dependencies without regard to their distance” [Vaswani,
NeurIPS 2017]
– However, fully connected layer too expensive; develop mechanism to bias “the allocation of
available computational resources towards the most informative components of a signal”
[Hu, CVPR 2018]
– Dynamically select region of interest in input for given output
February 13, 2023
8. L04-8
Sze and Emer
Transformers
• Built from Attention Mechanism [Vaswani,
NeurIPS 2017]
• Widely used for natural language processing
(e.g., GPT-3 [Brown, NeurIPS 2020]), since long
dependencies between words exist
• Recently use for processing other forms of data
including
– audio (e.g., AST [Gong, Interspeech 2021])
– vision (e.g., ViT [Dosovitskiy, ICLR 2021])
February 13, 2023
Image Source: [Vaswani, NeurIPS 2017]
9. L04-9
Sze and Emer
Format of Input into Attention Mechanism
• Break input into chunks referred to as tokens
– For a sentence, each word is a token
– For an image, each patch of pixels is a token
– For audio, each patch of spectrogram is a token
• Support variable sized input by processing a
sequence of “tokens” one at a time
February 13, 2023
Image Source: https://towardsdatascience.com/why-are-there-so-
many-tokenization-methods-for-transformers-a340e493b3a8
AST [Gong, Interspeech 2021]
ViT [Dosovitskiy, ICLR 2021]
10. L04-10
Sze and Emer
Attention Mechanism
• Submit query (q) based on input token
• Assign attention weights for query (q)
against each keys (k1, k2…km) based
on a function ⍺(q,ki) that captures their
dependencies
• Scale values (v1, v2…vm) associated
with each key (k1, k2…km) using
attention weights
• Output sum of scaled values
February 13, 2023
Image Source: https://d2l.ai/
(Note: q, ki, and vi are vectors)
Can be viewed as database of key, value pairs
11. L04-11
Sze and Emer
Attention Mechanism
The attention weights are typically computed as follows
1. Compute the dot product between the query (q) and key (k) vectors,
and scale by the length of the vectors (d). This is referred to as the
scaled-dot product attention scoring function.
2. Use softmax to scale the weight to be between 0 and 1
February 15, 2023
Source: https://d2l.ai/
Image: https://towardsdatascience.com/sigmoid-
and-softmax-functions-in-5-minutes-f516c80ea1f9
Softmax
12. L04-12
Sze and Emer
Attention Mechanism
• To process multiple tokens at time within same sequences, vectors
q, ki, and vi are combined to form matrices Q, K, and V
• Therefore, the scaled-dot product attention becomes
• Key operation in the attention mechanism is matrix multiplication
February 13, 2023
Image Source:
[Vaswani, NeurIPS 2017]
where
• n is the number of queries processed in parallel,
• d is the length of the queries and key vectors,
• v is the length of the values vector, and
• m is the number of key-value pairs in the database such that
Source: https://d2l.ai/
13. L04-13
Sze and Emer
Attention Mechanism
• Desirable to capture different behaviors
– e.g., shorter range versus longer range
• Allow for different transforms of Q, K, and V.
– This is referred to as multi-head attention, where h is the
number of heads.
• Transforms are performed with linear projections
– Three matrix multiplications (WQ, WK, WV) per head
• Outputs are concatenated and undergoes a linear projection
– Another matrix multiplication (WO)
• The weights of these projections are learned
February 15, 2023 Source: https://d2l.ai/
[Vaswani, NeurIPS 2017]
14. L04-14
Sze and Emer
Attention Mechanism
February 15, 2023
Example of compute per head, where X is an “embedding” of an input token.
This is referred to as “self-attention” since Q, K, and V are derived from X
Source: https://jalammar.github.io/illustrated-transformer/
15. L04-15
Sze and Emer
Attention Mechanism
February 15, 2023
Source: https://jalammar.github.io/illustrated-transformer/
Example of compute for multiple heads (Note: concatenate results Zi before multiplying with WO)
The output Z is
processed by a FC
layer to generate
input R to next
attention layer.
X is only the input to
first layer; R is input
to subsequent
attention layers.
16. L04-16
Sze and Emer
Computation Properties of Attention Mechanism
• Many matrix multiplications
– Three matrix multiplications for input projections per head for multi-head attention
– One matrix multiplication for output projections for multi-head attention
– Two matrix multiplications for computing scaled-dot product attention
– A total of five matrix multiplications per head plus one for output projection
• Parallel processing across matrix multiplications
– Across projections of Q, K, and V
– Across heads
• Sequential dependency between matrix multiplications
– Within attention: QKT then multiply by V
– Between projection and attention: Input projections à Attention à Output projection
February 15, 2023
17. L04-17
Sze and Emer
Computation Properties of Attention Mechanism
• Do operands change?
– Matrices Q, K, and V change with input (dynamic)
– Matrices WQ, WK, WV, and WO does not change with input (static)
• Reuse WQ, WK, WV, and WO across input tokens
• Complexity based on size of matrices and number of tokens
– Number of MACs scales quadratically with number of inputs token
– Storage: Number of weights in WQ, WK, WV and WO (multiplied by number of heads),
Intermediate matrices (Q, K, V), Input token matrix X, Output token matrix Z
• Matrix multiplications can be different sizes (design choices: n, d, v, m, h, …)
- Flexible hardware required!
February 15, 2023
18. L04-18
Sze and Emer
Key Metrics and
Design Objectives
February 15, 2023
How can we compare designs?
19. L04-19
Sze and Emer
GOPS/W or TOPS/W?
• GOPS = giga (109) operations per second
• TOPS = tera (1012) operations per second
• GOPS/Watt or TOPS/Watt commonly reported in hardware literature to show
efficiency of design
• However, does not provide sufficient insights on hardware capabilities and
limitations (especially if based on peak throughput/performance)
February 15, 2023
Example: high TOPS per watt can be
achieved with inverter (ring oscillator)
20. L04-20
Sze and Emer
Key Metrics: Much more than OPS/W!
• Accuracy
– Quality of result
• Throughput
– Analytics on high volume data
– Real-time performance (e.g., video at 30 fps)
• Latency
– For interactive applications (e.g., autonomous navigation)
• Energy and Power
– Embedded devices have limited battery capacity
– Data centers have a power ceiling due to cooling cost
• Hardware Cost
– $$$
• Flexibility
– Range of DNN models and tasks
• Scalability
– Scaling of performance with amount of resources
February 15, 2023
ImageNet
MNIST
Computer
Vision
Speech
Recognition
[Sze, CICC 2017]
Data
Center
Embedded
Device
21. L04-21
Sze and Emer
Evaluating Accuracy
• Important to measure accuracy when considering co-design of algorithm and
hardware
• Datasets help provide a way to evaluate and compare different DNN models
and training algorithms
• All accuracy is not the same
– Must consider difficulty of task and dataset to get fair comparison
February 15, 2023
22. L04-22
Sze and Emer
Image Classification Datasets
• Image Classification/Recognition
– Given an entire image à Select 1 of N classes
– No localization (detection)
Image Source:
Stanford cs231n
Datasets affect difficulty of task
February 15, 2023
23. L04-23
Sze and Emer
MNIST
LeNet in 1998
(0.95% error)
ICML 2013
(0.21% error)
http://yann.lecun.com/exdb/mnist/
Digit Classification
28x28 pixels (B&W)
10 Classes
60,000 Training
10,000 Testing
February 15, 2023
28. L04-28
Sze and Emer
Next Tasks: Localization and Detection
[Russakovsky, IJCV 2015]
February 15, 2023
29. L04-29
Sze and Emer
Effectiveness of More Data
February 15, 2023
Accuracy increases logarithmically based on amount training data
Results from Google Internal Dataset
JFT-300M (300M images, 18291 categories)
Orders of magnitude larger than ImageNet
[Sun, ICCV 2017]
Object Detection Semantic Segmentation
“Disclaimer – Large scale learning:
We would like to highlight that the
training regime, learning schedules
and parameters used in this paper are
based on our understanding of training
ConvNets with 1M images. Searching
the right set of hyper-parameters
requires significant more effort: even
training a JFT model for 4 epochs
needed 2 months on 50 K-80 GPUs.
Therefore, in some sense the
quantitative performance reported in
this paper underestimates the impact
of data for all reported image
volumes.”
30. L04-30
Sze and Emer
Recently Introduced Datasets
• Google Open Images (~9M images)
– https://github.com/openimages/dataset
• Youtube-8M (8M videos)
– https://research.google.com/youtube8m/
• AudioSet (2M sound clips)
– https://research.google.com/audioset/index.html
February 15, 2023
31. L04-31
Sze and Emer
Kaggle
February 15, 2023
A platform for predictive modeling competitions
Over 3,500 competition submissions per day
Over 2000+ datasets!
Starting 2018, ImageNet Challenge hosted by Kaggle
https://www.kaggle.com/c/imagenet-object-localization-challenge
32. L04-32
Sze and Emer
Key Design Objectives of DNN Processor
• Increase Throughput and Reduce Latency
– Reduce time per MAC
• Reduce critical path à increase clock frequency
• Reduce instruction overhead
– Avoid unnecessary MACs (save cycles)
– Increase number of processing elements (PE) à more MACs in parallel
• Increase area density of PE or area cost of system
– Increase PE utilization* à keep PEs busy
• Distribute workload to as many PEs as possible
• Balance the workload across PEs
• Sufficient memory bandwidth to deliver workload to PEs (reduce idle cycles)
• Low latency has an additional constraint of small batch size
February 15, 2023
*(100% = peak performance)
33. L04-33
Sze and Emer
Key Design Objectives of DNN Processor
• Reduce Energy and Power
Consumption
– Reduce data movement as it dominates
energy consumption
• Exploit data reuse
– Reduce energy per MAC
• Reduce switching activity and/or capacitance
• Reduce instruction overhead
– Avoid unnecessary MACs
• Power consumption is limited by heat
dissipation, which limits the maximum
# of MACs in parallel (i.e., throughput)
February 15, 2023
Operation: Energy
(pJ)
8b Add 0.03
16b Add 0.05
32b Add 0.1
16b FP Add 0.4
32b FP Add 0.9
8b Multiply 0.2
32b Multiply 3.1
16b FP Multiply 1.1
32b FP Multiply 3.7
32b SRAM Read (8KB) 5
32b DRAM Read 640
Relative Energy Cost
1 10 102 103 104
[Horowitz, ISSCC 2014]
34. L04-34
Sze and Emer
Key Design Objectives of DNN Processor
• Flexibility
– Reduce overhead of supporting flexibility
– Maintain efficiency across wide range of DNN models
• Different layer shapes impact the amount of
– Required storage and compute
– Available data reuse that can be exploited
• Different precision across layers & data types (weight, activation, partial sum)
• Different degrees of sparsity (number of zeros in weights or activations)
• Types of DNN layers and computation beyond MACs (e.g., activation functions)
February 15, 2023
35. L04-35
Sze and Emer
Key Design Objectives of DNN Processor
• Scalability
– Increase how performance (i.e., throughput, latency, energy, power) scales with increase
in amount of resources (e.g., number of PEs, amount of memory, etc.)
February 15, 2023
36. L04-36
Sze and Emer
Specifications to Evaluate Metrics
• Accuracy
– Difficulty of dataset and/or task should be considered
– Difficult tasks typically require more complex DNN models
• Throughput
– Number of PEs with utilization (not just peak performance)
– Runtime for running specific DNN models
• Latency
– Batch size used in evaluation
• Energy and Power
– Power consumption for running specific DNN models
– Off-chip memory access (e.g., DRAM)
• Hardware Cost
– On-chip storage, # of PEs, chip area + process technology
• Flexibility
– Report performance across a wide range of DNN models
– Define range of DNN models that are efficiently supported
February 15, 2023
DRAM
Chip
Off-chip
memory
access
37. L04-37
Sze and Emer
Evaluation Process
The evaluation process for whether a DNN system is a viable solution for a
given application might go as follows:
1. Accuracy determines if it can perform the given task
2. Latency and throughput determine if it can run fast enough and in real-
time
3. Energy and power consumption will primarily dictate the form factor of the
device where the processing can operate
4. Cost, which is primarily dictated by the chip area and external interfaces,
determines how much one would pay for this solution
5. Flexibility determines the range of tasks it can support
February 15, 2023
38. L04-38
Sze and Emer
Example: Metrics of Eyeriss Chip
February 15, 2023
Metric Units Input
Name of CNN Model Text AlexNet
Top-5 error classification
on ImageNet
# 19.8
Supported Layers All CONV
Bits per weight # 16
Bits per input activation # 16
Batch Size # 4
Runtime ms 115.3
Power mW 278
Off-chip Access per Image
Inference
MBytes 3.85
Number of Images Tested # 100
ASIC Specs Input
Process Technology 65nm LP
TSMC (1.0V)
Total Core Area (mm2
) 12.25
Total On-Chip Memory
(kB)
192
Number of Multipliers 168
Clock Frequency (MHz) 200
Core area (mm2
)
/multiplier
0.073
On-Chip memory (kB) /
multiplier
1.14
Measured or Simulated Measured
39. L04-39
Sze and Emer
Comprehensive Coverage for Evaluation
• All metrics should be reported for fair evaluation of design tradeoffs
• Examples of what can happen if certain metric is omitted:
– Without the accuracy given for a specific dataset and task, one could run a simple DNN
and claim low power, high throughput, and low cost – however, the processor might not be
usable for a meaningful task
– Without reporting the off-chip bandwidth, one could build a processor with only multipliers
and claim low cost, high throughput, high accuracy, and low chip power – however, when
evaluating system power, the off-chip memory access would be substantial
• Are results measured or simulated? On what test data?
• Hardware should be evaluated on a wide range of DNNs
– No guarantee that DNN algorithm designer will use a given DNN model or given reduce
complexity approach. Need flexible hardware!
February 15, 2023
40. L04-40
Sze and Emer
MLPerf: Workloads for Benchmarking
• A broad suite of DNN models to serve as a common set of benchmarks to measure the performance
and enable fair comparison of various software frameworks, hardware accelerators, and cloud
platforms for both training and inference of DNNs. (edge compute in the works!)
• The suite includes a wide range of DNNs (e.g., CNN, RNN, etc.) for a variety of tasks include image
classification, object identification, translation, speech-to-text, recommendation, sentiment analysis
and reinforcement learning.
• Categories: cloud/edge; training/inference; closed/open
February 15, 2023
https://mlperf.org/
First results in Dec 2018
23 Companies
7 Institutions
41. L04-41
Sze and Emer
Specifications of DNN Models
• Accuracy
– Define task and dataset
• Shape of DNN Model (“Network Architecture”)
– # of layers, filter size (R, S), # of channels (C), # of filters (M)
• # of Weights & Activations (storage capacity)
– Number of non-zero (NZ) weights and activations
• # of MACs (operations)
– Number of non-zero (NZ) MACS
February 15, 2023
42. L04-42
Sze and Emer
Specifications of DNN Models
Metrics AlexNet VGG-16 GoogLeNet (v1) ResNet-50
Accuracy (top-5 error)* 19.8 8.80 10.7 7.02
# of CONV Layers 5 16 21 49
# of Weights 2.3M 14.7M 6.0M 23.5M
# of MACs 666M 15.3G 1.43G 3.86G
# of NZ MACs** 394M 7.3G 806M 1.5G
# of FC layers 3 3 1 1
# of Weights 58.6M 124M 1M 2M
# of MACs 58.6M 124M 1M 2M
# of NZ MACs** 14.4M 17.7M 639k 1.8M
Total Weights 61M 138M 7M 25.5M
Total MACs 724M 15.5G 1.43G 3.9G
# of NZ MACs** 409M 7.3G 806M 1.5G
**# of NZ MACs computed based on 50,000 validation images
*Single crop results: https://github.com/jcjohnson/cnn-benchmarks
February 15, 2023
43. L04-43
Sze and Emer
Weights & MACs à Energy & Latency
• Warning: Fewer weights and MACs (indirect metrics) do not necessarily result in lower
energy consumption or latency (direct metrics). Other factors also important such as filter
shape, batch size and hardware mapping.
February 15, 2023
Image Source:
Google AI Blog
[Yang, CVPR 2017], [Chen, SysML, 2018], [Yang, ECCV 2018]
44. L04-44
Sze and Emer
Example: AlexNet vs. SqueezeNet
0
200
400
600
800
AlexNet SqueezeNet
0
20
40
60
AlexNet SqueezeNet
Normalized Energy
# of Weights
x105 x108
51.8x
1.3x
www.movidius.com
18
1,7
IS IT WORTH IT?
Incremental Accuracy and the Power Efficiency Cost
1.59 1.13
0
15.66
8.81
4.69
41.43
24.54
10.26
1.97 1.61 1.43 1.10
15.86
11.13
9.69
6.60
83.00%
92.00% 93.33%
83.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
AlexNet (2012) VGG-16 (2013) GoogLeNet (2014) SqueezeNet (2016)
%
Accuracy
Power
Efficiency
(GFLOPS/W)
Network, Year Developed
XU4 TK1 TX1 x86 K40 Accuracy %
mobile server
Notes: ImageNet, Batch = 10/64, using active cooling [Movidius, Hot Chips 2016]
SqueezeNet
AlexNet
Power
Efficiency
(GFLOPS/W)
Output Feature Map
47%
Input
Feature Map
23%
Weights
21%
Computation
10%
Energy Breakdown
(SqueezeNet)
February 15, 2023
Results for SqueezeNetv1.0
Batch size=48
45. L04-45
Sze and Emer
DNN Processor Evaluation Tools
• Require systematic way to
– Evaluate and compare wide range of DNN
processor designs
– Rapidly explore design space
• Accelergy [Wu, ICCAD 2019]
– Early-stage energy estimation tool at the
architecture level
• Estimate energy consumption based on
architecture level components (e.g., # of PEs,
memory size, on-chip network)
– Evaluate architecture level energy impact of
emerging devices
• Plug-ins for different technologies
• Timeloop [Parashar, ISPASS 2019]
– DNN mapping tool
– Performance Simulator à Action counts
February 15, 2023
Open-source code available at:
http://accelergy.mit.edu
Accelergy
(Energy Estimator
Tool)
Architecture
description
Action
counts
Action
counts
Compound
component
description
Energy
estimation
Timeloop
(DNN Mapping Tool
&
Performance
Simulator)
Labs and final project
DNN Model
Shape
(Workload)
Technology
Plug-in
46. L04-46
Sze and Emer
Summary
• Evaluate hardware using the appropriate DNN model and dataset
– Difficult tasks typically require larger models
– Different datasets for different tasks
– Number of datasets growing at a rapid pace
• A comprehensive set of metrics should be considered when
comparing DNN hardware to fully understand design tradeoffs
February 15, 2023
48. L04-48
Sze and Emer
Training vs. Inference
• Training: Determine weights
– Supervised
• Training set has inputs and outputs, i.e., labeled
– Unsupervised (Self-Supervised)
• Training set is unlabeled
– Semi-supervised
• Training set is partially labeled
– Reinforcement
• Output assessed via rewards and punishments
• Inference: Apply weights to determine output
February 15, 2023
49. L04-49
Sze and Emer
Unsupervised Learning
Finds structure in unlabeled data
February 15, 2023
[image source: cambridgespark.com]
50. L04-50
Sze and Emer
Reinforcement Learning
February 15, 2023
Given the state of the current environment, learn a policy that decides
what action the agent should take next to maximize expected rewards.
However, the rewards might not be available immediately after an
action, but instead only after a series of actions.
52. L04-52
Sze and Emer
Training versus Inference
Training
(determine weights)
Weights
Large Datasets
Inference
(use weights)
February 15, 2023
53. L04-53
Sze and Emer
Gradient Descent
• Goal: Determine set of weights to minimize loss
• Use gradient descent to incrementally update weights to reduce loss
– Compute derivative of loss relative to weights to indicate how to change weights (linear
approximation of loss function)
February 15, 2023
[Image Source: http://sebastianraschka.com/]
L(w)
Lmin(w)
Learning rate
55. L04-55
Sze and Emer
Computing Gradients for DNN
February 15, 2023
gradient
An efficient way to compute gradient (a.k.a. partial derivative)
for DNN is using a process called back propagation.
Recall method to update weights during training:
56. L04-56
Sze and Emer
Training DNN
February 15, 2023
Forward propagation*
Back propagation
Input
Class
Scores
Loss
Gradient
* inference also uses
forward propagation
Loss
Function
∂L
∂wij
57. L04-57
Sze and Emer
Back-Propagation of Weights (per Layer)
February 15, 2023
Determine how loss changes w.r.t. to weights
∂L
∂wij
=
∂L
∂yj
∂yj
∂wij
∂L
∂wij
=
∂L
∂yj
xi
chain rule
y = Wx + b
∂yj
∂wij
= xi
Need to compute
∂L
∂yj
𝒙𝟏
𝝏𝑳
𝝏𝒚𝟏
𝝏𝑳
𝝏𝒚𝟐
𝝏𝑳
𝝏𝒚𝟑
𝝏𝑳
𝝏𝑾𝟏𝟏
𝒙𝟐
𝒙𝟑
𝒙𝟒
𝝏𝑳
𝝏𝑾𝟒𝟑
backpropagation
58. L04-58
Sze and Emer
Back-Propagation of Activations (per Layer)
February 15, 2023
Determine how loss changes w.r.t. to input activations
Similar in form to the computation used for inference
∂L
∂yj
∂L
∂xj
Layer 1 Layer 2
=
W11
W43
𝝏𝑳
𝝏𝒙𝟏
𝝏𝑳
𝝏𝒙𝟐
𝝏𝑳
𝝏𝒙𝟑
𝝏𝑳
𝝏𝒙𝟒
𝝏𝑳
𝝏𝒚𝟏
𝝏𝑳
𝝏𝒚𝟐
𝝏𝑳
𝝏𝒚𝟑
backpropagation
Layer 2
𝜕𝐿
𝜕𝑥&
= %
'
𝑤&'
𝜕𝐿
𝜕𝑦'
59. L04-59
Sze and Emer
Back Propagation Across All Layers
Gradient w.r.t. activations
Layer n Layer n+1
∂L
∂wn
ij
=
∂L
∂yn
j
∂yn
j
∂wn
ij
=
∂L
∂yn
j
xn
i
Gradient w.r.t. weights
yn
j = wn
ij xn
i + b
i
∑
where
February 15, 2023
yn
i = xn+1
i
Note:
Need to store
activations from
forward propagation!
𝑥&
(
𝑊&'
(
𝑊&'
()*
𝑦'
( 𝑦'
()*
𝜕𝐿
𝜕𝑦!
" = %
#
𝑤!#
" 𝜕𝐿
𝜕𝑦#
"$%
60. L04-60
Sze and Emer
Demo of CIFAR-10 CNN Training
http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
February 15, 2023
61. L04-61
Sze and Emer
References
• Chapter 2 & 3 in Book
– https://doi.org/10.1007/978-3-031-01766-7
• Other Works Cited in Lecture
– Russakovsky, Olga, et al. "Imagenet large scale visual recognition challenge." International Journal of Computer Vision 115.3 (2015):
211-252.
– Sun, Chen, et al. "Revisiting unreasonable effectiveness of data in deep learning era." arXiv preprint arXiv:1707.02968 (2017).
– Shrivastava, Ashish, et al. "Learning from simulated and unsupervised images through adversarial training." arXiv preprint
arXiv:1612.07828 (2016).
– T.-J. Yang et al., “NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications,” ECCV 2018.
– Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, “Understanding the Limitations of Existing Energy-Efficient Design Approaches for Deep
Neural Networks,” SysML Conference 2018.
– T.-J. Yang et al., “Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning,” CVPR 2017.
– Chen et al., Eyexam, https://arxiv.org/abs/1807.07928
– Williams et al., “Roofline: An insightful visual performance model for floating-point programs and multicore architectures,” CACM 2009
– Wu et al., “Accelergy: An architecture-level energy estimation methodology for accelerator designs,” ICCAD 2019
– Parashar et al., “Timeloop: A systematic approach to dnn accelerator evaluation,” ISPASS 2019
February 15, 2023