04 accelerating dl inference with (open)capi and posit numbers

1
Accelerating DL
inference with
(Open)CAPI and posit
numbers
Louis Ledoux
Marc Casas
Lyon – OpenPOWER Summit

2
Outline
• Overview of posit number representation
• Problems and motivations
• Implementation: configurable posit-based AI IPs
• Results
• Conclusions

3
Ideas behind posit numbers (1)
• Like IEEE-754 Floating Points, they intend to represent real numbers
• The claim: to be a drop-in replacement of the classical floating point
• More precision and dynamic range in many cases
• The main ideas:
• Less bits per datum by augmenting the entropy of these ones
• Arithmetic driven by an application / use case
• Tackle design flaws of an old standard from the 80’s
• Repeatable / portable
• Laws of algebra (associativity, distributivity)
• Wasted patterns (NaNs)
• Et caetera...
• A standardized Kulisch-like accumulator : the Quire
• Fused Dot Product
[1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit
Arithmetic’
[1]

4
posit representation
• Many configurations of posits
• denoted posit<N,ES> where N is the width of a posit and ES is exponent size
• allows to choose the most suitable configuration for a specific use case
• aka tapered precision
• Golomb-Rice exponent encoding
• Variable-length internal fields
• sign (1b)
• regime
• exponent (if any)
• fraction (if any)
[1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit
Arithmetic’
[1]

5
Ideas behind posit numbers (2)
• Fine tuning (tapering) the arithmetic leads to saving bits for:
• exponent(dynamic range)
• mantissa(precision)
• Example: 32 bits posit could beat 64 bits IEEE754 Floating Point in a scientific application, reducing:
• amount of caches, memory, and storage
• needed throughput to and from processors
• Hardware resources in compute units (still being discussed) [1]
• Power consumption
[1] Yohann Uguen, Luc Forget, Florent de Dinechin. Evaluating the hardware cost of the posit numbersystem. FPL 2019 - 29th International
Conference on Field-Programmable Logic and Applications(FPL), Sep 2019, Barcelona, Spain. pp.1-8. ffhal-02130912v4f
[2] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’
[2]

6
Outline
• Results
• Conclusions

7
Communication dominates arithmetic
• Plethora of µArch sharing a common aspect:
• Non data/memory-centric, the computation is performed far away from the data itself
• As a consequence: energy waste + performance loss
• 62.7% of the total system energy is spent on data movement [1]
[1] Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan
Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks"
Proceedings of the 23rd InternationalConference on Architectural Support for Programming Languages and Operating Systems (ASPLOS),
Williamsburg, VA, USA, March 2018
[2] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), pages 10–14, Feb 2014
[2]
[2]

8
Problems seen as motivations
• Reducing communications ?
• in/near-memory computing
• Quantization
• Transprecision
• Study impacts of arithmetic on an overall System
• Our focus is AI
• Data is increasing, thus the interest for Neural Networks
• Posits are good candidates for neural networks
• Activation functions and the “golden zone” (1)
• Fast sigmoid (2)
• Large vector dot products (3)
• Not too much focus on Compute Units Hardware cost
“How the denser representation of numbers offered by posits
tackle the aforementioned problems ?”

9
1° Activation functions and the “golden zone”
• IEEE-754
• Accuracy is constant & large (80 orders of magnitude)
• Induces low entropy/bit
• Posits
• Pyramid shape, normal/Gaussian distribution
• Due to Golomb-Rice
• Match NN weights distribution
• Activation functions allows rescaling into the golden zone
• Improves the information amount carried per bit
• Entropy/bit
• More information/datum
• Less communication (PCI-e / DDR)
• Less storage, stay on chip
• Compression Ratio
• Shannon entropy
[1]
[1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’

10
2° Fast sigmoid
• IEEE-754 have HW tricks (0x5F3759DF [1])… but posits also
• 𝜎 𝑥 =
1
1+𝑒−𝑥
• Popular activation function in neural networks
• Approximate sigmoid
• 1. negate sign bit (MSB)
• 2. logical right shift by 2
• “Et voilà”: few gates/LUTs and ~1 clock
• Example
• a = 0 = Posit<8,0>(0b00000000)
• 1. tmp = 10000000
• 2. res = 00100000
• Posit<8,0>(00100000) = 1/2
[1] https://en.wikipedia.org/wiki/Fast_inverse_square_root

11
3° Large vector dot products
• Most common computation in NNs is MAC [1]
• Neuron (basic node)  MAC unit / large vector dot product
• The rounding error is also accumulated
• After each product and accumulation
• After each accumulation when FMA is used
• The hardware yields an inexact result
• The posit standard proposes an Exact MAC unit : QUIRE
• Postpone rounding, 1 rounding
• Suits well with very low precision (<8 bits)
[1] A Dynamic Approach to Accelerate Deep Learning
Training. Submitted to International conference on Learning
Representations. 2020. Under review.

12
Outline
• Results
• Conclusions

13
Our complete system
• MareNostrum 4
• AC922
• 2xPOWER9 SMT4 Monza
• 512GiB DDR4-2666
• 2xAlpha-Data 9V3
• XCVU3P-2
• 788k FFs
• 394k LUTs
• 2280 DSPs
• 25.3Mb BRAM
• 90.0Mb URAM
• 8GiB DDR4-2400
• PCI Express® Gen3 x16 / Gen4 x8 or OpenCAPI

14
Accelerating the acceleration

15
Accelerating the acceleration: CAPI

16
Accelerating the acceleration: SNAP

17
Accelerating the acceleration: My helpers

18
Verification methodology
• 1 Testbench/module
• PSLSE

19
Multi Layer Perceptron (MLP)

20
Multi Layer Perceptron (MLP) Hardware
• SDFG
• Direct mapping
• EMAC unit  Neuron
• 1 big pipeline
• AXI-Stream
• FC IP layer configurable

21
Neuron<N,ES> 1: QUIRE
𝜎
0=𝑖
𝑁−1
𝑥𝑖 𝑤𝑖 + 𝑏

22
Neuron<N,ES> 2: NO FMA / NO QUIRE
𝜎
0=𝑖
𝑁−1

23
Neuron<N,ES> 2: NO FMA / NO QUIRE
𝜎
0=𝑖
𝑁−1

25
Outline
• Results
• Conclusions

26
Results: Resource utilization
Module LUTs FFs
Neuron<8,0,NOQUIRE> 336 126
Decoder<8,0> 27 0
Mult<8,0> 82 32
Adder<8,0> 133 57
Encoder+Round<8,0> 38 0
Module LUTs FFs
Neuron<4,0,QUIRE> 116 72
Decoder<4,0> 3 0
Mult<4,0> 6 0
Quire19<4,0> 88 56
Encoder+Round<4,0> 7 0
• These 2 versions maintain >
90% accuracy
• Framework allows to
evaluate all combinations of
neurons
• How many bits per datum
can we save with an exact
accumulator ?

27
Saturate the PCI-e link: Data parallelism
• loopback @250MHz and 512b words
• ≈ 16 𝐺𝐵/𝑠 in theory
• ≈ 𝟏𝟐, 𝟓 𝑮𝑩/𝒔 in real life
• But a MLP consumes only a N bits word
• instantiate many of them
• Frames per second
• 𝑙𝑒𝑡 𝑀, 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑁𝑁𝑠
• 𝑚𝑎𝑥𝑁𝑁 =
512
𝑁
• 𝑓𝑝𝑠 𝑛, 𝑀 ≈
𝑀
𝑚𝑎𝑥𝑁𝑁
∗
12,5𝐺𝐵/𝑠
784∗
𝑛
8
𝐵/𝑓𝑟𝑎𝑚𝑒
• Why saturate the link ?
• Show that for a given throughput, lower the
precision is, more FPS there is.
• Use the full power enabled by CAPI+SNAP

28
Overall system performance: 16 MLPs
Accuracy Mem Alloc size Throughput Bandwidth
occupancy
FPS
posit<4,0,quire> ~91% ~23,5MB ~1,56 GB/s 12.5% 3,8 × 106
posit<8,0,noquire> ~91% ~47MB ~3,1 GB/s 25% 3,8 × 106
posit<8,0,quire> ~91% ~47MB ~3,1 GB/s 25% 3,8 × 106
posit<16,0,quire> ~91% ~94MB ~6,2 GB/s 50% 3,8 × 106
ieee<4,1> ~10% ~23,5MB ~1,56 GB/s 12.5% 3,8 × 106
ieee<8,4> ~50% ~47MB ~3,1 GB/s 25% 3,8 × 106
ieee<16,5> ~91% ~94MB ~6,2 GB/s 50% 3,8 × 106
ieee<32,8> ~91% ~189MB ~12,5 GB/s 100% 3,8 × 106

29
Ideal Hardware (12,5GBps) / fictive max MLPs
Accuracy Mem Alloc size # MLPs FPS
posit<4,0,quire> ~91% ~23,5MB 128 𝟑𝟎, 𝟔 × 𝟏𝟎 𝟔
posit<8,0,noquire> ~91% ~47MB 64 15,3 × 106
posit<8,0,quire> ~91% ~47MB 64 15,3 × 106
posit<16,0,quire> ~91% ~94MB 32 7,65 × 106
ieee<4,1> ~10% ~23,5MB 128 30,6 × 106
ieee<8,4> ~50% ~47MB 64 15,3 × 106
ieee<16,5> ~91% ~94MB 32 𝟕, 𝟔𝟓 × 𝟏𝟎 𝟔
ieee<32,8> ~91% ~189MB 16 3,8 × 106

30
A bit of colors: Floorplanning 16 MLP<8,0,QUIRE>

31
Outline
• Results
• Conclusions

32
Conclusions
• CAPI+SNAP allow very fast implementation of RTL logic in a working and powerful System
• We built a framework to evaluate impacts of arithmetic
• Only posits
• Early results look promising
• Need to go further on the evaluation (Specially with IEEE Floating points)
• Same accuracy with 4x less bits
• 4x less memory storage
• 4x less power consumption
• 4x less communication time
• So: “Communication dominates less arithmetic”
• Lack of data concerning power consumption of the System

33
Thank you
louis.ledoux@bsc.es

34
Example of posit decoding
• 𝑥 =
0 𝑤ℎ𝑒𝑛 0. . . 0
±∞ 𝑤ℎ𝑒𝑛 10. . . 0
(−1) 𝑠× 𝑢𝑠𝑒𝑒𝑑 𝑘 × 2 𝑒 × 1 +
𝑓 𝑚−1...𝑓0
2 𝑚 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• 𝑤ℎ𝑒𝑟𝑒 𝑢𝑠𝑒𝑒𝑑 = 22 𝑒𝑠
, 𝑚 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑤𝑖𝑑𝑡ℎ, 𝑘 = 𝑟𝑒𝑔𝑖𝑚𝑒 𝑟𝑢𝑛𝑙𝑒𝑛𝑔𝑡ℎ
• Chosen posit: posit<5,1>(0b00101)
• 𝑘 = −1 𝑏𝑒𝑐𝑎𝑢𝑠𝑒 1 𝑟 𝑏𝑖𝑡 0 𝑓𝑜𝑙𝑙𝑜𝑤𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑟 𝑏𝑖𝑡(1)
• 𝑥 = (−1)0× (221
)−1 × 20 × 1 +
1
21
• 𝑥 = 1 ×
1
4
× 1 ×
3
2
• 𝑥 =
3
8
Picutres from: Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game:
Posit Arithmetic’

35
Accuracy / values distribution

36
Posits now
• Rex computing
• LLNL showed good performance for LULESH and Euler2D

37
Overall system performance
• Baseline: ≈ 12𝐺𝐵/𝑠 with a pipelined loopback @250MHz and 512b words
• To saturate the PCIe link we use data parallelism, 1 MLP / picture
• Can classify x pictures independently
• For posit<8,0>: x = 512/8 = 64 MLPs
• For posit<4,0>: x = 512/4 = 128 MLPs
• Frame per second
• 𝑓𝑝𝑠 𝑛 ≈
12𝐺𝐵/𝑠
784∗
𝑛
8
𝐵/𝑓𝑟𝑎𝑚𝑒
• 𝑓𝑝𝑠 4 ≈ 𝟑𝟎, 𝟔 × 𝟏𝟎 𝟔
• 𝑓𝑝𝑠 8 ≈ 𝟏𝟓, 𝟑 × 𝟏𝟎 𝟔
• Time spent in PCI-e transfer divided by 2
• Allocation size on HOST DDR divided by 2
• Last results show that Half-precision IEEE float is needed to maintain 90% accuracy
• 𝑓𝑝𝑠 16 ≈ 7,6 × 106
“Communication dominates less arithmetic”

38
1 MLP
Accuracy Mem Alloc size Throughput Bandwidth
occupancy
FPS
posit<4,0,quire> ~91% ~23,5MB ~0,096 GB/s 0,8% 2,4 × 105
posit<8,0,noquire> ~91% ~47MB ~0,19 GB/s 1,6% 2,4 × 105
posit<8,0,quire> ~91% ~47MB ~0,19 GB/s 1,6% 2,4 × 105
posit<16,0,quire> ~91% ~94MB ~0,39 GB/s 3,2% 2,4 × 105
ieee<4,1> ~10% ~23,5MB ~0,096 GB/s 0,8% 2,4 × 105
ieee<8,4> ~50% ~47MB ~0,19 GB/s 1,6% 2,4 × 105
ieee<16,5> ~91% ~94MB ~0,39 GB/s 3,6% 2,4 × 105
ieee<32,8> ~91% ~189MB ~0,8 GB/s 6,4% 2,4 × 105

04 accelerating dl inference with (open)capi and posit numbers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 04 accelerating dl inference with (open)capi and posit numbers

Similar to 04 accelerating dl inference with (open)capi and posit numbers (20)

More from Yutaka Kawai

More from Yutaka Kawai (20)

Recently uploaded

Recently uploaded (20)

04 accelerating dl inference with (open)capi and posit numbers