Successfully reported this slideshow.
Upcoming SlideShare
×

# 04 accelerating dl inference with (open)capi and posit numbers

70 views

Published on

This was presented by Louis Ledoux and Marc Casas at OpenPOWER summit EU 2019. The original one is uploaded at:
https://static.sched.com/hosted_files/opeu19/1a/presentation_louis_ledoux_posit.pdf

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Writing a good research paper isn't easy and it's the fruit of hard work. For help you can check writing expert. Check out, please ⇒ www.WritePaper.info ⇐ I think they are the best

Are you sure you want to  Yes  No
• Be the first to like this

### 04 accelerating dl inference with (open)capi and posit numbers

1. 1. 1 Accelerating DL inference with (Open)CAPI and posit numbers Louis Ledoux Marc Casas Lyon – OpenPOWER Summit
2. 2. 2 Outline • Overview of posit number representation • Problems and motivations • Implementation: configurable posit-based AI IPs • Results • Conclusions
3. 3. 3 Ideas behind posit numbers (1) • Like IEEE-754 Floating Points, they intend to represent real numbers • The claim: to be a drop-in replacement of the classical floating point • More precision and dynamic range in many cases • The main ideas: • Less bits per datum by augmenting the entropy of these ones • Arithmetic driven by an application / use case • Tackle design flaws of an old standard from the 80’s • Repeatable / portable • Laws of algebra (associativity, distributivity) • Wasted patterns (NaNs) • Et caetera... • A standardized Kulisch-like accumulator : the Quire • Fused Dot Product [1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’ [1]
4. 4. 4 posit representation • Many configurations of posits • denoted posit<N,ES> where N is the width of a posit and ES is exponent size • allows to choose the most suitable configuration for a specific use case • aka tapered precision • Golomb-Rice exponent encoding • Variable-length internal fields • sign (1b) • regime • exponent (if any) • fraction (if any) [1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’ [1]
5. 5. 5 Ideas behind posit numbers (2) • Fine tuning (tapering) the arithmetic leads to saving bits for: • exponent(dynamic range) • mantissa(precision) • Example: 32 bits posit could beat 64 bits IEEE754 Floating Point in a scientific application, reducing: • amount of caches, memory, and storage • needed throughput to and from processors • Hardware resources in compute units (still being discussed) [1] • Power consumption [1] Yohann Uguen, Luc Forget, Florent de Dinechin. Evaluating the hardware cost of the posit numbersystem. FPL 2019 - 29th International Conference on Field-Programmable Logic and Applications(FPL), Sep 2019, Barcelona, Spain. pp.1-8. ffhal-02130912v4f [2] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’ [2]
6. 6. 6 Outline • Overview of posit number representation • Problems and motivations • Implementation: configurable posit-based AI IPs • Results • Conclusions
7. 7. 7 Communication dominates arithmetic • Plethora of µArch sharing a common aspect: • Non data/memory-centric, the computation is performed far away from the data itself • As a consequence: energy waste + performance loss • 62.7% of the total system energy is spent on data movement [1] [1] Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the 23rd InternationalConference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018 [2] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14, Feb 2014 [2] [2]
8. 8. 8 Problems seen as motivations • Reducing communications ? • in/near-memory computing • Quantization • Transprecision • Study impacts of arithmetic on an overall System • Our focus is AI • Data is increasing, thus the interest for Neural Networks • Posits are good candidates for neural networks • Activation functions and the “golden zone” (1) • Fast sigmoid (2) • Large vector dot products (3) • Not too much focus on Compute Units Hardware cost “How the denser representation of numbers offered by posits tackle the aforementioned problems ?”
9. 9. 9 1° Activation functions and the “golden zone” • IEEE-754 • Accuracy is constant & large (80 orders of magnitude) • Induces low entropy/bit • Posits • Pyramid shape, normal/Gaussian distribution • Due to Golomb-Rice • Match NN weights distribution • Activation functions allows rescaling into the golden zone • Improves the information amount carried per bit • Entropy/bit • More information/datum • Less communication (PCI-e / DDR) • Less storage, stay on chip • Compression Ratio • Shannon entropy [1] [1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’
10. 10. 10 2° Fast sigmoid • IEEE-754 have HW tricks (0x5F3759DF [1])… but posits also • 𝜎 𝑥 = 1 1+𝑒−𝑥 • Popular activation function in neural networks • Approximate sigmoid • 1. negate sign bit (MSB) • 2. logical right shift by 2 • “Et voilà”: few gates/LUTs and ~1 clock • Example • a = 0 = Posit<8,0>(0b00000000) • 1. tmp = 10000000 • 2. res = 00100000 • Posit<8,0>(00100000) = 1/2 [1] https://en.wikipedia.org/wiki/Fast_inverse_square_root
11. 11. 11 3° Large vector dot products • Most common computation in NNs is MAC [1] • Neuron (basic node)  MAC unit / large vector dot product • The rounding error is also accumulated • After each product and accumulation • After each accumulation when FMA is used • The hardware yields an inexact result • The posit standard proposes an Exact MAC unit : QUIRE • Postpone rounding, 1 rounding • Suits well with very low precision (<8 bits) [1] A Dynamic Approach to Accelerate Deep Learning Training. Submitted to International conference on Learning Representations. 2020. Under review.
12. 12. 12 Outline • Overview of posit number representation • Problems and motivations • Implementation: configurable posit-based AI IPs • Results • Conclusions
13. 13. 13 Our complete system • MareNostrum 4 • AC922 • 2xPOWER9 SMT4 Monza • 512GiB DDR4-2666 • 2xAlpha-Data 9V3 • XCVU3P-2 • 788k FFs • 394k LUTs • 2280 DSPs • 25.3Mb BRAM • 90.0Mb URAM • 8GiB DDR4-2400 • PCI Express® Gen3 x16 / Gen4 x8 or OpenCAPI
14. 14. 14 Accelerating the acceleration
15. 15. 15 Accelerating the acceleration: CAPI
16. 16. 16 Accelerating the acceleration: SNAP
17. 17. 17 Accelerating the acceleration: My helpers
18. 18. 18 Verification methodology • 1 Testbench/module • PSLSE
19. 19. 19 Multi Layer Perceptron (MLP)
20. 20. 20 Multi Layer Perceptron (MLP) Hardware • SDFG • Direct mapping • EMAC unit  Neuron • 1 big pipeline • AXI-Stream • FC IP layer configurable
21. 21. 21 Neuron<N,ES> 1: QUIRE 𝜎 0=𝑖 𝑁−1 𝑥𝑖 𝑤𝑖 + 𝑏
22. 22. 22 Neuron<N,ES> 2: NO FMA / NO QUIRE 𝜎 0=𝑖 𝑁−1 𝑥𝑖 𝑤𝑖 + 𝑏
23. 23. 23 Neuron<N,ES> 2: NO FMA / NO QUIRE 𝜎 0=𝑖 𝑁−1 𝑥𝑖 𝑤𝑖 + 𝑏
24. 24. 24 Mult<N,ES>
25. 25. 25 Outline • Overview of posit number representation • Problems and motivations • Implementation: configurable posit-based AI IPs • Results • Conclusions
26. 26. 26 Results: Resource utilization Module LUTs FFs Neuron<8,0,NOQUIRE> 336 126 Decoder<8,0> 27 0 Mult<8,0> 82 32 Adder<8,0> 133 57 Encoder+Round<8,0> 38 0 Module LUTs FFs Neuron<4,0,QUIRE> 116 72 Decoder<4,0> 3 0 Mult<4,0> 6 0 Quire19<4,0> 88 56 Encoder+Round<4,0> 7 0 • These 2 versions maintain > 90% accuracy • Framework allows to evaluate all combinations of neurons • How many bits per datum can we save with an exact accumulator ?
27. 27. 27 Saturate the PCI-e link: Data parallelism • loopback @250MHz and 512b words • ≈ 16 𝐺𝐵/𝑠 in theory • ≈ 𝟏𝟐, 𝟓 𝑮𝑩/𝒔 in real life • But a MLP consumes only a N bits word • instantiate many of them • Frames per second • 𝑙𝑒𝑡 𝑀, 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑁𝑁𝑠 • 𝑚𝑎𝑥𝑁𝑁 = 512 𝑁 • 𝑓𝑝𝑠 𝑛, 𝑀 ≈ 𝑀 𝑚𝑎𝑥𝑁𝑁 ∗ 12,5𝐺𝐵/𝑠 784∗ 𝑛 8 𝐵/𝑓𝑟𝑎𝑚𝑒 • Why saturate the link ? • Show that for a given throughput, lower the precision is, more FPS there is. • Use the full power enabled by CAPI+SNAP
28. 28. 28 Overall system performance: 16 MLPs Accuracy Mem Alloc size Throughput Bandwidth occupancy FPS posit<4,0,quire> ~91% ~23,5MB ~1,56 GB/s 12.5% 3,8 × 106 posit<8,0,noquire> ~91% ~47MB ~3,1 GB/s 25% 3,8 × 106 posit<8,0,quire> ~91% ~47MB ~3,1 GB/s 25% 3,8 × 106 posit<16,0,quire> ~91% ~94MB ~6,2 GB/s 50% 3,8 × 106 ieee<4,1> ~10% ~23,5MB ~1,56 GB/s 12.5% 3,8 × 106 ieee<8,4> ~50% ~47MB ~3,1 GB/s 25% 3,8 × 106 ieee<16,5> ~91% ~94MB ~6,2 GB/s 50% 3,8 × 106 ieee<32,8> ~91% ~189MB ~12,5 GB/s 100% 3,8 × 106
29. 29. 29 Ideal Hardware (12,5GBps) / fictive max MLPs Accuracy Mem Alloc size # MLPs FPS posit<4,0,quire> ~91% ~23,5MB 128 𝟑𝟎, 𝟔 × 𝟏𝟎 𝟔 posit<8,0,noquire> ~91% ~47MB 64 15,3 × 106 posit<8,0,quire> ~91% ~47MB 64 15,3 × 106 posit<16,0,quire> ~91% ~94MB 32 7,65 × 106 ieee<4,1> ~10% ~23,5MB 128 30,6 × 106 ieee<8,4> ~50% ~47MB 64 15,3 × 106 ieee<16,5> ~91% ~94MB 32 𝟕, 𝟔𝟓 × 𝟏𝟎 𝟔 ieee<32,8> ~91% ~189MB 16 3,8 × 106
30. 30. 30 A bit of colors: Floorplanning 16 MLP<8,0,QUIRE>
31. 31. 31 Outline • Overview of posit number representation • Problems and motivations • Implementation: configurable posit-based AI IPs • Results • Conclusions
32. 32. 32 Conclusions • CAPI+SNAP allow very fast implementation of RTL logic in a working and powerful System • We built a framework to evaluate impacts of arithmetic • Only posits • Early results look promising • Need to go further on the evaluation (Specially with IEEE Floating points) • Same accuracy with 4x less bits • 4x less memory storage • 4x less power consumption • 4x less communication time • So: “Communication dominates less arithmetic” • Lack of data concerning power consumption of the System
33. 33. 33 Thank you louis.ledoux@bsc.es
34. 34. 34 Example of posit decoding • 𝑥 = 0 𝑤ℎ𝑒𝑛 0. . . 0 ±∞ 𝑤ℎ𝑒𝑛 10. . . 0 (−1) 𝑠× 𝑢𝑠𝑒𝑒𝑑 𝑘 × 2 𝑒 × 1 + 𝑓 𝑚−1...𝑓0 2 𝑚 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 • 𝑤ℎ𝑒𝑟𝑒 𝑢𝑠𝑒𝑒𝑑 = 22 𝑒𝑠 , 𝑚 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑤𝑖𝑑𝑡ℎ, 𝑘 = 𝑟𝑒𝑔𝑖𝑚𝑒 𝑟𝑢𝑛𝑙𝑒𝑛𝑔𝑡ℎ • Chosen posit: posit<5,1>(0b00101) • 𝑘 = −1 𝑏𝑒𝑐𝑎𝑢𝑠𝑒 1 𝑟 𝑏𝑖𝑡 0 𝑓𝑜𝑙𝑙𝑜𝑤𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑟 𝑏𝑖𝑡(1) • 𝑥 = (−1)0× (221 )−1 × 20 × 1 + 1 21 • 𝑥 = 1 × 1 4 × 1 × 3 2 • 𝑥 = 3 8 Picutres from: Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’
35. 35. 35 Accuracy / values distribution
36. 36. 36 Posits now • Rex computing • LLNL showed good performance for LULESH and Euler2D
37. 37. 37 Overall system performance • Baseline: ≈ 12𝐺𝐵/𝑠 with a pipelined loopback @250MHz and 512b words • To saturate the PCIe link we use data parallelism, 1 MLP / picture • Can classify x pictures independently • For posit<8,0>: x = 512/8 = 64 MLPs • For posit<4,0>: x = 512/4 = 128 MLPs • Frame per second • 𝑓𝑝𝑠 𝑛 ≈ 12𝐺𝐵/𝑠 784∗ 𝑛 8 𝐵/𝑓𝑟𝑎𝑚𝑒 • 𝑓𝑝𝑠 4 ≈ 𝟑𝟎, 𝟔 × 𝟏𝟎 𝟔 • 𝑓𝑝𝑠 8 ≈ 𝟏𝟓, 𝟑 × 𝟏𝟎 𝟔 • Time spent in PCI-e transfer divided by 2 • Allocation size on HOST DDR divided by 2 • Last results show that Half-precision IEEE float is needed to maintain 90% accuracy • 𝑓𝑝𝑠 16 ≈ 7,6 × 106 “Communication dominates less arithmetic”
38. 38. 38 1 MLP Accuracy Mem Alloc size Throughput Bandwidth occupancy FPS posit<4,0,quire> ~91% ~23,5MB ~0,096 GB/s 0,8% 2,4 × 105 posit<8,0,noquire> ~91% ~47MB ~0,19 GB/s 1,6% 2,4 × 105 posit<8,0,quire> ~91% ~47MB ~0,19 GB/s 1,6% 2,4 × 105 posit<16,0,quire> ~91% ~94MB ~0,39 GB/s 3,2% 2,4 × 105 ieee<4,1> ~10% ~23,5MB ~0,096 GB/s 0,8% 2,4 × 105 ieee<8,4> ~50% ~47MB ~0,19 GB/s 1,6% 2,4 × 105 ieee<16,5> ~91% ~94MB ~0,39 GB/s 3,6% 2,4 × 105 ieee<32,8> ~91% ~189MB ~0,8 GB/s 6,4% 2,4 × 105