SlideShare a Scribd company logo
1 of 38
Download to read offline
1
Accelerating DL
inference with
(Open)CAPI and posit
numbers
Louis Ledoux
Marc Casas
Lyon – OpenPOWER Summit
2
Outline
• Overview of posit number representation
• Problems and motivations
• Implementation: configurable posit-based AI IPs
• Results
• Conclusions
3
Ideas behind posit numbers (1)
• Like IEEE-754 Floating Points, they intend to represent real numbers
• The claim: to be a drop-in replacement of the classical floating point
• More precision and dynamic range in many cases
• The main ideas:
• Less bits per datum by augmenting the entropy of these ones
• Arithmetic driven by an application / use case
• Tackle design flaws of an old standard from the 80’s
• Repeatable / portable
• Laws of algebra (associativity, distributivity)
• Wasted patterns (NaNs)
• Et caetera...
• A standardized Kulisch-like accumulator : the Quire
• Fused Dot Product
[1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit
Arithmetic’
[1]
4
posit representation
• Many configurations of posits
• denoted posit<N,ES> where N is the width of a posit and ES is exponent size
• allows to choose the most suitable configuration for a specific use case
• aka tapered precision
• Golomb-Rice exponent encoding
• Variable-length internal fields
• sign (1b)
• regime
• exponent (if any)
• fraction (if any)
[1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit
Arithmetic’
[1]
5
Ideas behind posit numbers (2)
• Fine tuning (tapering) the arithmetic leads to saving bits for:
• exponent(dynamic range)
• mantissa(precision)
• Example: 32 bits posit could beat 64 bits IEEE754 Floating Point in a scientific application, reducing:
• amount of caches, memory, and storage
• needed throughput to and from processors
• Hardware resources in compute units (still being discussed) [1]
• Power consumption
[1] Yohann Uguen, Luc Forget, Florent de Dinechin. Evaluating the hardware cost of the posit numbersystem. FPL 2019 - 29th International
Conference on Field-Programmable Logic and Applications(FPL), Sep 2019, Barcelona, Spain. pp.1-8. ffhal-02130912v4f
[2] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’
[2]
6
Outline
• Overview of posit number representation
• Problems and motivations
• Implementation: configurable posit-based AI IPs
• Results
• Conclusions
7
Communication dominates arithmetic
• Plethora of µArch sharing a common aspect:
• Non data/memory-centric, the computation is performed far away from the data itself
• As a consequence: energy waste + performance loss
• 62.7% of the total system energy is spent on data movement [1]
[1] Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan
Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks"
Proceedings of the 23rd InternationalConference on Architectural Support for Programming Languages and Operating Systems (ASPLOS),
Williamsburg, VA, USA, March 2018
[2] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), pages 10–14, Feb 2014
[2]
[2]
8
Problems seen as motivations
• Reducing communications ?
• in/near-memory computing
• Quantization
• Transprecision
• Study impacts of arithmetic on an overall System
• Our focus is AI
• Data is increasing, thus the interest for Neural Networks
• Posits are good candidates for neural networks
• Activation functions and the “golden zone” (1)
• Fast sigmoid (2)
• Large vector dot products (3)
• Not too much focus on Compute Units Hardware cost
“How the denser representation of numbers offered by posits
tackle the aforementioned problems ?”
9
1° Activation functions and the “golden zone”
• IEEE-754
• Accuracy is constant & large (80 orders of magnitude)
• Induces low entropy/bit
• Posits
• Pyramid shape, normal/Gaussian distribution
• Due to Golomb-Rice
• Match NN weights distribution
• Activation functions allows rescaling into the golden zone
• Improves the information amount carried per bit
• Entropy/bit
• More information/datum
• Less communication (PCI-e / DDR)
• Less storage, stay on chip
• Compression Ratio
• Shannon entropy
[1]
[1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’
10
2° Fast sigmoid
• IEEE-754 have HW tricks (0x5F3759DF [1])… but posits also
• 𝜎 𝑥 =
1
1+𝑒−𝑥
• Popular activation function in neural networks
• Approximate sigmoid
• 1. negate sign bit (MSB)
• 2. logical right shift by 2
• “Et voilà”: few gates/LUTs and ~1 clock
• Example
• a = 0 = Posit<8,0>(0b00000000)
• 1. tmp = 10000000
• 2. res = 00100000
• Posit<8,0>(00100000) = 1/2
[1] https://en.wikipedia.org/wiki/Fast_inverse_square_root
11
3° Large vector dot products
• Most common computation in NNs is MAC [1]
• Neuron (basic node)  MAC unit / large vector dot product
• The rounding error is also accumulated
• After each product and accumulation
• After each accumulation when FMA is used
• The hardware yields an inexact result
• The posit standard proposes an Exact MAC unit : QUIRE
• Postpone rounding, 1 rounding
• Suits well with very low precision (<8 bits)
[1] A Dynamic Approach to Accelerate Deep Learning
Training. Submitted to International conference on Learning
Representations. 2020. Under review.
12
Outline
• Overview of posit number representation
• Problems and motivations
• Implementation: configurable posit-based AI IPs
• Results
• Conclusions
13
Our complete system
• MareNostrum 4
• AC922
• 2xPOWER9 SMT4 Monza
• 512GiB DDR4-2666
• 2xAlpha-Data 9V3
• XCVU3P-2
• 788k FFs
• 394k LUTs
• 2280 DSPs
• 25.3Mb BRAM
• 90.0Mb URAM
• 8GiB DDR4-2400
• PCI Express® Gen3 x16 / Gen4 x8 or OpenCAPI
14
Accelerating the acceleration
15
Accelerating the acceleration: CAPI
16
Accelerating the acceleration: SNAP
17
Accelerating the acceleration: My helpers
18
Verification methodology
• 1 Testbench/module
• PSLSE
19
Multi Layer Perceptron (MLP)
20
Multi Layer Perceptron (MLP) Hardware
• SDFG
• Direct mapping
• EMAC unit  Neuron
• 1 big pipeline
• AXI-Stream
• FC IP layer configurable
21
Neuron<N,ES> 1: QUIRE
𝜎
0=𝑖
𝑁−1
𝑥𝑖 𝑤𝑖 + 𝑏
22
Neuron<N,ES> 2: NO FMA / NO QUIRE
𝜎
0=𝑖
𝑁−1
𝑥𝑖 𝑤𝑖 + 𝑏
23
Neuron<N,ES> 2: NO FMA / NO QUIRE
𝜎
0=𝑖
𝑁−1
𝑥𝑖 𝑤𝑖 + 𝑏
24
Mult<N,ES>
25
Outline
• Overview of posit number representation
• Problems and motivations
• Implementation: configurable posit-based AI IPs
• Results
• Conclusions
26
Results: Resource utilization
Module LUTs FFs
Neuron<8,0,NOQUIRE> 336 126
Decoder<8,0> 27 0
Mult<8,0> 82 32
Adder<8,0> 133 57
Encoder+Round<8,0> 38 0
Module LUTs FFs
Neuron<4,0,QUIRE> 116 72
Decoder<4,0> 3 0
Mult<4,0> 6 0
Quire19<4,0> 88 56
Encoder+Round<4,0> 7 0
• These 2 versions maintain >
90% accuracy
• Framework allows to
evaluate all combinations of
neurons
• How many bits per datum
can we save with an exact
accumulator ?
27
Saturate the PCI-e link: Data parallelism
• loopback @250MHz and 512b words
• ≈ 16 𝐺𝐵/𝑠 in theory
• ≈ 𝟏𝟐, 𝟓 𝑮𝑩/𝒔 in real life
• But a MLP consumes only a N bits word
• instantiate many of them
• Frames per second
• 𝑙𝑒𝑡 𝑀, 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑁𝑁𝑠
• 𝑚𝑎𝑥𝑁𝑁 =
512
𝑁
• 𝑓𝑝𝑠 𝑛, 𝑀 ≈
𝑀
𝑚𝑎𝑥𝑁𝑁
∗
12,5𝐺𝐵/𝑠
784∗
𝑛
8
𝐵/𝑓𝑟𝑎𝑚𝑒
• Why saturate the link ?
• Show that for a given throughput, lower the
precision is, more FPS there is.
• Use the full power enabled by CAPI+SNAP
28
Overall system performance: 16 MLPs
Accuracy Mem Alloc size Throughput Bandwidth
occupancy
FPS
posit<4,0,quire> ~91% ~23,5MB ~1,56 GB/s 12.5% 3,8 × 106
posit<8,0,noquire> ~91% ~47MB ~3,1 GB/s 25% 3,8 × 106
posit<8,0,quire> ~91% ~47MB ~3,1 GB/s 25% 3,8 × 106
posit<16,0,quire> ~91% ~94MB ~6,2 GB/s 50% 3,8 × 106
ieee<4,1> ~10% ~23,5MB ~1,56 GB/s 12.5% 3,8 × 106
ieee<8,4> ~50% ~47MB ~3,1 GB/s 25% 3,8 × 106
ieee<16,5> ~91% ~94MB ~6,2 GB/s 50% 3,8 × 106
ieee<32,8> ~91% ~189MB ~12,5 GB/s 100% 3,8 × 106
29
Ideal Hardware (12,5GBps) / fictive max MLPs
Accuracy Mem Alloc size # MLPs FPS
posit<4,0,quire> ~91% ~23,5MB 128 𝟑𝟎, 𝟔 × 𝟏𝟎 𝟔
posit<8,0,noquire> ~91% ~47MB 64 15,3 × 106
posit<8,0,quire> ~91% ~47MB 64 15,3 × 106
posit<16,0,quire> ~91% ~94MB 32 7,65 × 106
ieee<4,1> ~10% ~23,5MB 128 30,6 × 106
ieee<8,4> ~50% ~47MB 64 15,3 × 106
ieee<16,5> ~91% ~94MB 32 𝟕, 𝟔𝟓 × 𝟏𝟎 𝟔
ieee<32,8> ~91% ~189MB 16 3,8 × 106
30
A bit of colors: Floorplanning 16 MLP<8,0,QUIRE>
31
Outline
• Overview of posit number representation
• Problems and motivations
• Implementation: configurable posit-based AI IPs
• Results
• Conclusions
32
Conclusions
• CAPI+SNAP allow very fast implementation of RTL logic in a working and powerful System
• We built a framework to evaluate impacts of arithmetic
• Only posits
• Early results look promising
• Need to go further on the evaluation (Specially with IEEE Floating points)
• Same accuracy with 4x less bits
• 4x less memory storage
• 4x less power consumption
• 4x less communication time
• So: “Communication dominates less arithmetic”
• Lack of data concerning power consumption of the System
33
Thank you
louis.ledoux@bsc.es
34
Example of posit decoding
• 𝑥 =
0 𝑤ℎ𝑒𝑛 0. . . 0
±∞ 𝑤ℎ𝑒𝑛 10. . . 0
(−1) 𝑠× 𝑢𝑠𝑒𝑒𝑑 𝑘 × 2 𝑒 × 1 +
𝑓 𝑚−1...𝑓0
2 𝑚 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• 𝑤ℎ𝑒𝑟𝑒 𝑢𝑠𝑒𝑒𝑑 = 22 𝑒𝑠
, 𝑚 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑤𝑖𝑑𝑡ℎ, 𝑘 = 𝑟𝑒𝑔𝑖𝑚𝑒 𝑟𝑢𝑛𝑙𝑒𝑛𝑔𝑡ℎ
• Chosen posit: posit<5,1>(0b00101)
• 𝑘 = −1 𝑏𝑒𝑐𝑎𝑢𝑠𝑒 1 𝑟 𝑏𝑖𝑡 0 𝑓𝑜𝑙𝑙𝑜𝑤𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑟 𝑏𝑖𝑡(1)
• 𝑥 = (−1)0× (221
)−1 × 20 × 1 +
1
21
• 𝑥 = 1 ×
1
4
× 1 ×
3
2
• 𝑥 =
3
8
Picutres from: Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game:
Posit Arithmetic’
35
Accuracy / values distribution
36
Posits now
• Rex computing
• LLNL showed good performance for LULESH and Euler2D
37
Overall system performance
• Baseline: ≈ 12𝐺𝐵/𝑠 with a pipelined loopback @250MHz and 512b words
• To saturate the PCIe link we use data parallelism, 1 MLP / picture
• Can classify x pictures independently
• For posit<8,0>: x = 512/8 = 64 MLPs
• For posit<4,0>: x = 512/4 = 128 MLPs
• Frame per second
• 𝑓𝑝𝑠 𝑛 ≈
12𝐺𝐵/𝑠
784∗
𝑛
8
𝐵/𝑓𝑟𝑎𝑚𝑒
• 𝑓𝑝𝑠 4 ≈ 𝟑𝟎, 𝟔 × 𝟏𝟎 𝟔
• 𝑓𝑝𝑠 8 ≈ 𝟏𝟓, 𝟑 × 𝟏𝟎 𝟔
• Time spent in PCI-e transfer divided by 2
• Allocation size on HOST DDR divided by 2
• Last results show that Half-precision IEEE float is needed to maintain 90% accuracy
• 𝑓𝑝𝑠 16 ≈ 7,6 × 106
“Communication dominates less arithmetic”
38
1 MLP
Accuracy Mem Alloc size Throughput Bandwidth
occupancy
FPS
posit<4,0,quire> ~91% ~23,5MB ~0,096 GB/s 0,8% 2,4 × 105
posit<8,0,noquire> ~91% ~47MB ~0,19 GB/s 1,6% 2,4 × 105
posit<8,0,quire> ~91% ~47MB ~0,19 GB/s 1,6% 2,4 × 105
posit<16,0,quire> ~91% ~94MB ~0,39 GB/s 3,2% 2,4 × 105
ieee<4,1> ~10% ~23,5MB ~0,096 GB/s 0,8% 2,4 × 105
ieee<8,4> ~50% ~47MB ~0,19 GB/s 1,6% 2,4 × 105
ieee<16,5> ~91% ~94MB ~0,39 GB/s 3,6% 2,4 × 105
ieee<32,8> ~91% ~189MB ~0,8 GB/s 6,4% 2,4 × 105

More Related Content

What's hot

Introduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at GalvanizeIntroduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at GalvanizeIntel Nervana
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...Edge AI and Vision Alliance
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksJunKudo2
 
Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications Intel Nervana
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture SearchDaeJin Kim
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...Edge AI and Vision Alliance
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
 
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural NetworkCase Study of Convolutional Neural Network
Case Study of Convolutional Neural NetworkNamHyuk Ahn
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingSangwoo Mo
 
Deep Learning for Robotics
Deep Learning for RoboticsDeep Learning for Robotics
Deep Learning for RoboticsIntel Nervana
 
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...William Nadolski
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeNidhin Pattaniyil
 
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen..."Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...Edge AI and Vision Alliance
 

What's hot (20)

Introduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at GalvanizeIntroduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at Galvanize
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networks
 
Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search
 
TensorFlow 101
TensorFlow 101TensorFlow 101
TensorFlow 101
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
Graph processing
Graph processingGraph processing
Graph processing
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
 
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural NetworkCase Study of Convolutional Neural Network
Case Study of Convolutional Neural Network
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
ODSC West
ODSC WestODSC West
ODSC West
 
Deep Learning for Robotics
Deep Learning for RoboticsDeep Learning for Robotics
Deep Learning for Robotics
 
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
 
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen..."Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
 

Similar to 04 accelerating dl inference with (open)capi and posit numbers

Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Sony Computer Entertainment Europe Research & Development Division
Sony Computer Entertainment Europe Research & Development DivisionSony Computer Entertainment Europe Research & Development Division
Sony Computer Entertainment Europe Research & Development DivisionSlide_N
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptxruvex
 
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude DomainDeep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude DomainJoonhyung Lee
 
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeProgramming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeSlide_N
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptxdhivyak49
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM Research
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchSubhashis Hazarika
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...Simon Lia-Jonassen
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 

Similar to 04 accelerating dl inference with (open)capi and posit numbers (20)

PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
es_hardware_handout
es_hardware_handoutes_hardware_handout
es_hardware_handout
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Sony Computer Entertainment Europe Research & Development Division
Sony Computer Entertainment Europe Research & Development DivisionSony Computer Entertainment Europe Research & Development Division
Sony Computer Entertainment Europe Research & Development Division
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude DomainDeep Learning Fast MRI Using Channel Attention in Magnitude Domain
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
 
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeProgramming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptx
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Pdc lecture1
Pdc lecture1Pdc lecture1
Pdc lecture1
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 

More from Yutaka Kawai

05 high density openpower dual-socket p9 system design example
05 high density openpower dual-socket p9 system design example05 high density openpower dual-socket p9 system design example
05 high density openpower dual-socket p9 system design exampleYutaka Kawai
 
03 desktop on an open powersystem
03 desktop on an open powersystem03 desktop on an open powersystem
03 desktop on an open powersystemYutaka Kawai
 
02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...Yutaka Kawai
 
01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a boxYutaka Kawai
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2Yutaka Kawai
 
Sc19 ibm hms final
Sc19 ibm hms finalSc19 ibm hms final
Sc19 ibm hms finalYutaka Kawai
 
0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy Furmanek0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy FurmanekYutaka Kawai
 
10th meetup20191209b
10th meetup20191209b10th meetup20191209b
10th meetup20191209bYutaka Kawai
 
Light talk kioxia_20191023r2
Light talk kioxia_20191023r2Light talk kioxia_20191023r2
Light talk kioxia_20191023r2Yutaka Kawai
 
Open power ae_jd_20191223_v1
Open power ae_jd_20191223_v1Open power ae_jd_20191223_v1
Open power ae_jd_20191223_v1Yutaka Kawai
 
Open power keynote- openisa
Open power  keynote- openisa Open power  keynote- openisa
Open power keynote- openisa Yutaka Kawai
 
Open power topics20191023
Open power topics20191023Open power topics20191023
Open power topics20191023Yutaka Kawai
 
9th meetup20191023
9th meetup201910239th meetup20191023
9th meetup20191023Yutaka Kawai
 
Ibm open power_meetup_xilinx_lighting_talk_rev1.0
Ibm open power_meetup_xilinx_lighting_talk_rev1.0Ibm open power_meetup_xilinx_lighting_talk_rev1.0
Ibm open power_meetup_xilinx_lighting_talk_rev1.0Yutaka Kawai
 
Nec exp ether071719
Nec exp ether071719Nec exp ether071719
Nec exp ether071719Yutaka Kawai
 
July japan meetup latest
July japan meetup latestJuly japan meetup latest
July japan meetup latestYutaka Kawai
 
8th meetup20190717
8th meetup201907178th meetup20190717
8th meetup20190717Yutaka Kawai
 
2018 capi contest introduction japan-v2b
2018 capi contest introduction japan-v2b2018 capi contest introduction japan-v2b
2018 capi contest introduction japan-v2bYutaka Kawai
 

More from Yutaka Kawai (20)

05 high density openpower dual-socket p9 system design example
05 high density openpower dual-socket p9 system design example05 high density openpower dual-socket p9 system design example
05 high density openpower dual-socket p9 system design example
 
03 desktop on an open powersystem
03 desktop on an open powersystem03 desktop on an open powersystem
03 desktop on an open powersystem
 
02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...
 
01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
Sc19 ibm hms final
Sc19 ibm hms finalSc19 ibm hms final
Sc19 ibm hms final
 
0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy Furmanek0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy Furmanek
 
10th meetup20191209b
10th meetup20191209b10th meetup20191209b
10th meetup20191209b
 
Light talk kioxia_20191023r2
Light talk kioxia_20191023r2Light talk kioxia_20191023r2
Light talk kioxia_20191023r2
 
Open power ae_jd_20191223_v1
Open power ae_jd_20191223_v1Open power ae_jd_20191223_v1
Open power ae_jd_20191223_v1
 
Open power keynote- openisa
Open power  keynote- openisa Open power  keynote- openisa
Open power keynote- openisa
 
Open power topics20191023
Open power topics20191023Open power topics20191023
Open power topics20191023
 
9th meetup20191023
9th meetup201910239th meetup20191023
9th meetup20191023
 
Ibm open power_meetup_xilinx_lighting_talk_rev1.0
Ibm open power_meetup_xilinx_lighting_talk_rev1.0Ibm open power_meetup_xilinx_lighting_talk_rev1.0
Ibm open power_meetup_xilinx_lighting_talk_rev1.0
 
Ai vision u200
Ai vision u200Ai vision u200
Ai vision u200
 
Nec exp ether071719
Nec exp ether071719Nec exp ether071719
Nec exp ether071719
 
July japan meetup latest
July japan meetup latestJuly japan meetup latest
July japan meetup latest
 
8th meetup20190717
8th meetup201907178th meetup20190717
8th meetup20190717
 
2018 capi contest introduction japan-v2b
2018 capi contest introduction japan-v2b2018 capi contest introduction japan-v2b
2018 capi contest introduction japan-v2b
 
OCP48V Solution
OCP48V SolutionOCP48V Solution
OCP48V Solution
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

04 accelerating dl inference with (open)capi and posit numbers

  • 1. 1 Accelerating DL inference with (Open)CAPI and posit numbers Louis Ledoux Marc Casas Lyon – OpenPOWER Summit
  • 2. 2 Outline • Overview of posit number representation • Problems and motivations • Implementation: configurable posit-based AI IPs • Results • Conclusions
  • 3. 3 Ideas behind posit numbers (1) • Like IEEE-754 Floating Points, they intend to represent real numbers • The claim: to be a drop-in replacement of the classical floating point • More precision and dynamic range in many cases • The main ideas: • Less bits per datum by augmenting the entropy of these ones • Arithmetic driven by an application / use case • Tackle design flaws of an old standard from the 80’s • Repeatable / portable • Laws of algebra (associativity, distributivity) • Wasted patterns (NaNs) • Et caetera... • A standardized Kulisch-like accumulator : the Quire • Fused Dot Product [1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’ [1]
  • 4. 4 posit representation • Many configurations of posits • denoted posit<N,ES> where N is the width of a posit and ES is exponent size • allows to choose the most suitable configuration for a specific use case • aka tapered precision • Golomb-Rice exponent encoding • Variable-length internal fields • sign (1b) • regime • exponent (if any) • fraction (if any) [1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’ [1]
  • 5. 5 Ideas behind posit numbers (2) • Fine tuning (tapering) the arithmetic leads to saving bits for: • exponent(dynamic range) • mantissa(precision) • Example: 32 bits posit could beat 64 bits IEEE754 Floating Point in a scientific application, reducing: • amount of caches, memory, and storage • needed throughput to and from processors • Hardware resources in compute units (still being discussed) [1] • Power consumption [1] Yohann Uguen, Luc Forget, Florent de Dinechin. Evaluating the hardware cost of the posit numbersystem. FPL 2019 - 29th International Conference on Field-Programmable Logic and Applications(FPL), Sep 2019, Barcelona, Spain. pp.1-8. ffhal-02130912v4f [2] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’ [2]
  • 6. 6 Outline • Overview of posit number representation • Problems and motivations • Implementation: configurable posit-based AI IPs • Results • Conclusions
  • 7. 7 Communication dominates arithmetic • Plethora of µArch sharing a common aspect: • Non data/memory-centric, the computation is performed far away from the data itself • As a consequence: energy waste + performance loss • 62.7% of the total system energy is spent on data movement [1] [1] Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the 23rd InternationalConference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018 [2] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14, Feb 2014 [2] [2]
  • 8. 8 Problems seen as motivations • Reducing communications ? • in/near-memory computing • Quantization • Transprecision • Study impacts of arithmetic on an overall System • Our focus is AI • Data is increasing, thus the interest for Neural Networks • Posits are good candidates for neural networks • Activation functions and the “golden zone” (1) • Fast sigmoid (2) • Large vector dot products (3) • Not too much focus on Compute Units Hardware cost “How the denser representation of numbers offered by posits tackle the aforementioned problems ?”
  • 9. 9 1° Activation functions and the “golden zone” • IEEE-754 • Accuracy is constant & large (80 orders of magnitude) • Induces low entropy/bit • Posits • Pyramid shape, normal/Gaussian distribution • Due to Golomb-Rice • Match NN weights distribution • Activation functions allows rescaling into the golden zone • Improves the information amount carried per bit • Entropy/bit • More information/datum • Less communication (PCI-e / DDR) • Less storage, stay on chip • Compression Ratio • Shannon entropy [1] [1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’
  • 10. 10 2° Fast sigmoid • IEEE-754 have HW tricks (0x5F3759DF [1])… but posits also • 𝜎 𝑥 = 1 1+𝑒−𝑥 • Popular activation function in neural networks • Approximate sigmoid • 1. negate sign bit (MSB) • 2. logical right shift by 2 • “Et voilà”: few gates/LUTs and ~1 clock • Example • a = 0 = Posit<8,0>(0b00000000) • 1. tmp = 10000000 • 2. res = 00100000 • Posit<8,0>(00100000) = 1/2 [1] https://en.wikipedia.org/wiki/Fast_inverse_square_root
  • 11. 11 3° Large vector dot products • Most common computation in NNs is MAC [1] • Neuron (basic node)  MAC unit / large vector dot product • The rounding error is also accumulated • After each product and accumulation • After each accumulation when FMA is used • The hardware yields an inexact result • The posit standard proposes an Exact MAC unit : QUIRE • Postpone rounding, 1 rounding • Suits well with very low precision (<8 bits) [1] A Dynamic Approach to Accelerate Deep Learning Training. Submitted to International conference on Learning Representations. 2020. Under review.
  • 12. 12 Outline • Overview of posit number representation • Problems and motivations • Implementation: configurable posit-based AI IPs • Results • Conclusions
  • 13. 13 Our complete system • MareNostrum 4 • AC922 • 2xPOWER9 SMT4 Monza • 512GiB DDR4-2666 • 2xAlpha-Data 9V3 • XCVU3P-2 • 788k FFs • 394k LUTs • 2280 DSPs • 25.3Mb BRAM • 90.0Mb URAM • 8GiB DDR4-2400 • PCI Express® Gen3 x16 / Gen4 x8 or OpenCAPI
  • 18. 18 Verification methodology • 1 Testbench/module • PSLSE
  • 20. 20 Multi Layer Perceptron (MLP) Hardware • SDFG • Direct mapping • EMAC unit  Neuron • 1 big pipeline • AXI-Stream • FC IP layer configurable
  • 22. 22 Neuron<N,ES> 2: NO FMA / NO QUIRE 𝜎 0=𝑖 𝑁−1 𝑥𝑖 𝑤𝑖 + 𝑏
  • 23. 23 Neuron<N,ES> 2: NO FMA / NO QUIRE 𝜎 0=𝑖 𝑁−1 𝑥𝑖 𝑤𝑖 + 𝑏
  • 25. 25 Outline • Overview of posit number representation • Problems and motivations • Implementation: configurable posit-based AI IPs • Results • Conclusions
  • 26. 26 Results: Resource utilization Module LUTs FFs Neuron<8,0,NOQUIRE> 336 126 Decoder<8,0> 27 0 Mult<8,0> 82 32 Adder<8,0> 133 57 Encoder+Round<8,0> 38 0 Module LUTs FFs Neuron<4,0,QUIRE> 116 72 Decoder<4,0> 3 0 Mult<4,0> 6 0 Quire19<4,0> 88 56 Encoder+Round<4,0> 7 0 • These 2 versions maintain > 90% accuracy • Framework allows to evaluate all combinations of neurons • How many bits per datum can we save with an exact accumulator ?
  • 27. 27 Saturate the PCI-e link: Data parallelism • loopback @250MHz and 512b words • ≈ 16 𝐺𝐵/𝑠 in theory • ≈ 𝟏𝟐, 𝟓 𝑮𝑩/𝒔 in real life • But a MLP consumes only a N bits word • instantiate many of them • Frames per second • 𝑙𝑒𝑡 𝑀, 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑁𝑁𝑠 • 𝑚𝑎𝑥𝑁𝑁 = 512 𝑁 • 𝑓𝑝𝑠 𝑛, 𝑀 ≈ 𝑀 𝑚𝑎𝑥𝑁𝑁 ∗ 12,5𝐺𝐵/𝑠 784∗ 𝑛 8 𝐵/𝑓𝑟𝑎𝑚𝑒 • Why saturate the link ? • Show that for a given throughput, lower the precision is, more FPS there is. • Use the full power enabled by CAPI+SNAP
  • 28. 28 Overall system performance: 16 MLPs Accuracy Mem Alloc size Throughput Bandwidth occupancy FPS posit<4,0,quire> ~91% ~23,5MB ~1,56 GB/s 12.5% 3,8 × 106 posit<8,0,noquire> ~91% ~47MB ~3,1 GB/s 25% 3,8 × 106 posit<8,0,quire> ~91% ~47MB ~3,1 GB/s 25% 3,8 × 106 posit<16,0,quire> ~91% ~94MB ~6,2 GB/s 50% 3,8 × 106 ieee<4,1> ~10% ~23,5MB ~1,56 GB/s 12.5% 3,8 × 106 ieee<8,4> ~50% ~47MB ~3,1 GB/s 25% 3,8 × 106 ieee<16,5> ~91% ~94MB ~6,2 GB/s 50% 3,8 × 106 ieee<32,8> ~91% ~189MB ~12,5 GB/s 100% 3,8 × 106
  • 29. 29 Ideal Hardware (12,5GBps) / fictive max MLPs Accuracy Mem Alloc size # MLPs FPS posit<4,0,quire> ~91% ~23,5MB 128 𝟑𝟎, 𝟔 × 𝟏𝟎 𝟔 posit<8,0,noquire> ~91% ~47MB 64 15,3 × 106 posit<8,0,quire> ~91% ~47MB 64 15,3 × 106 posit<16,0,quire> ~91% ~94MB 32 7,65 × 106 ieee<4,1> ~10% ~23,5MB 128 30,6 × 106 ieee<8,4> ~50% ~47MB 64 15,3 × 106 ieee<16,5> ~91% ~94MB 32 𝟕, 𝟔𝟓 × 𝟏𝟎 𝟔 ieee<32,8> ~91% ~189MB 16 3,8 × 106
  • 30. 30 A bit of colors: Floorplanning 16 MLP<8,0,QUIRE>
  • 31. 31 Outline • Overview of posit number representation • Problems and motivations • Implementation: configurable posit-based AI IPs • Results • Conclusions
  • 32. 32 Conclusions • CAPI+SNAP allow very fast implementation of RTL logic in a working and powerful System • We built a framework to evaluate impacts of arithmetic • Only posits • Early results look promising • Need to go further on the evaluation (Specially with IEEE Floating points) • Same accuracy with 4x less bits • 4x less memory storage • 4x less power consumption • 4x less communication time • So: “Communication dominates less arithmetic” • Lack of data concerning power consumption of the System
  • 34. 34 Example of posit decoding • 𝑥 = 0 𝑤ℎ𝑒𝑛 0. . . 0 ±∞ 𝑤ℎ𝑒𝑛 10. . . 0 (−1) 𝑠× 𝑢𝑠𝑒𝑒𝑑 𝑘 × 2 𝑒 × 1 + 𝑓 𝑚−1...𝑓0 2 𝑚 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 • 𝑤ℎ𝑒𝑟𝑒 𝑢𝑠𝑒𝑒𝑑 = 22 𝑒𝑠 , 𝑚 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑤𝑖𝑑𝑡ℎ, 𝑘 = 𝑟𝑒𝑔𝑖𝑚𝑒 𝑟𝑢𝑛𝑙𝑒𝑛𝑔𝑡ℎ • Chosen posit: posit<5,1>(0b00101) • 𝑘 = −1 𝑏𝑒𝑐𝑎𝑢𝑠𝑒 1 𝑟 𝑏𝑖𝑡 0 𝑓𝑜𝑙𝑙𝑜𝑤𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑟 𝑏𝑖𝑡(1) • 𝑥 = (−1)0× (221 )−1 × 20 × 1 + 1 21 • 𝑥 = 1 × 1 4 × 1 × 3 2 • 𝑥 = 3 8 Picutres from: Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’
  • 35. 35 Accuracy / values distribution
  • 36. 36 Posits now • Rex computing • LLNL showed good performance for LULESH and Euler2D
  • 37. 37 Overall system performance • Baseline: ≈ 12𝐺𝐵/𝑠 with a pipelined loopback @250MHz and 512b words • To saturate the PCIe link we use data parallelism, 1 MLP / picture • Can classify x pictures independently • For posit<8,0>: x = 512/8 = 64 MLPs • For posit<4,0>: x = 512/4 = 128 MLPs • Frame per second • 𝑓𝑝𝑠 𝑛 ≈ 12𝐺𝐵/𝑠 784∗ 𝑛 8 𝐵/𝑓𝑟𝑎𝑚𝑒 • 𝑓𝑝𝑠 4 ≈ 𝟑𝟎, 𝟔 × 𝟏𝟎 𝟔 • 𝑓𝑝𝑠 8 ≈ 𝟏𝟓, 𝟑 × 𝟏𝟎 𝟔 • Time spent in PCI-e transfer divided by 2 • Allocation size on HOST DDR divided by 2 • Last results show that Half-precision IEEE float is needed to maintain 90% accuracy • 𝑓𝑝𝑠 16 ≈ 7,6 × 106 “Communication dominates less arithmetic”
  • 38. 38 1 MLP Accuracy Mem Alloc size Throughput Bandwidth occupancy FPS posit<4,0,quire> ~91% ~23,5MB ~0,096 GB/s 0,8% 2,4 × 105 posit<8,0,noquire> ~91% ~47MB ~0,19 GB/s 1,6% 2,4 × 105 posit<8,0,quire> ~91% ~47MB ~0,19 GB/s 1,6% 2,4 × 105 posit<16,0,quire> ~91% ~94MB ~0,39 GB/s 3,2% 2,4 × 105 ieee<4,1> ~10% ~23,5MB ~0,096 GB/s 0,8% 2,4 × 105 ieee<8,4> ~50% ~47MB ~0,19 GB/s 1,6% 2,4 × 105 ieee<16,5> ~91% ~94MB ~0,39 GB/s 3,6% 2,4 × 105 ieee<32,8> ~91% ~189MB ~0,8 GB/s 6,4% 2,4 × 105