This was presented by Louis Ledoux and Marc Casas at OpenPOWER summit EU 2019. The original one is uploaded at:
https://static.sched.com/hosted_files/opeu19/1a/presentation_louis_ledoux_posit.pdf
2. 2
Outline
• Overview of posit number representation
• Problems and motivations
• Implementation: configurable posit-based AI IPs
• Results
• Conclusions
3. 3
Ideas behind posit numbers (1)
• Like IEEE-754 Floating Points, they intend to represent real numbers
• The claim: to be a drop-in replacement of the classical floating point
• More precision and dynamic range in many cases
• The main ideas:
• Less bits per datum by augmenting the entropy of these ones
• Arithmetic driven by an application / use case
• Tackle design flaws of an old standard from the 80’s
• Repeatable / portable
• Laws of algebra (associativity, distributivity)
• Wasted patterns (NaNs)
• Et caetera...
• A standardized Kulisch-like accumulator : the Quire
• Fused Dot Product
[1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit
Arithmetic’
[1]
4. 4
posit representation
• Many configurations of posits
• denoted posit<N,ES> where N is the width of a posit and ES is exponent size
• allows to choose the most suitable configuration for a specific use case
• aka tapered precision
• Golomb-Rice exponent encoding
• Variable-length internal fields
• sign (1b)
• regime
• exponent (if any)
• fraction (if any)
[1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit
Arithmetic’
[1]
5. 5
Ideas behind posit numbers (2)
• Fine tuning (tapering) the arithmetic leads to saving bits for:
• exponent(dynamic range)
• mantissa(precision)
• Example: 32 bits posit could beat 64 bits IEEE754 Floating Point in a scientific application, reducing:
• amount of caches, memory, and storage
• needed throughput to and from processors
• Hardware resources in compute units (still being discussed) [1]
• Power consumption
[1] Yohann Uguen, Luc Forget, Florent de Dinechin. Evaluating the hardware cost of the posit numbersystem. FPL 2019 - 29th International
Conference on Field-Programmable Logic and Applications(FPL), Sep 2019, Barcelona, Spain. pp.1-8. ffhal-02130912v4f
[2] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’
[2]
6. 6
Outline
• Overview of posit number representation
• Problems and motivations
• Implementation: configurable posit-based AI IPs
• Results
• Conclusions
7. 7
Communication dominates arithmetic
• Plethora of µArch sharing a common aspect:
• Non data/memory-centric, the computation is performed far away from the data itself
• As a consequence: energy waste + performance loss
• 62.7% of the total system energy is spent on data movement [1]
[1] Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan
Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks"
Proceedings of the 23rd InternationalConference on Architectural Support for Programming Languages and Operating Systems (ASPLOS),
Williamsburg, VA, USA, March 2018
[2] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), pages 10–14, Feb 2014
[2]
[2]
8. 8
Problems seen as motivations
• Reducing communications ?
• in/near-memory computing
• Quantization
• Transprecision
• Study impacts of arithmetic on an overall System
• Our focus is AI
• Data is increasing, thus the interest for Neural Networks
• Posits are good candidates for neural networks
• Activation functions and the “golden zone” (1)
• Fast sigmoid (2)
• Large vector dot products (3)
• Not too much focus on Compute Units Hardware cost
“How the denser representation of numbers offered by posits
tackle the aforementioned problems ?”
9. 9
1° Activation functions and the “golden zone”
• IEEE-754
• Accuracy is constant & large (80 orders of magnitude)
• Induces low entropy/bit
• Posits
• Pyramid shape, normal/Gaussian distribution
• Due to Golomb-Rice
• Match NN weights distribution
• Activation functions allows rescaling into the golden zone
• Improves the information amount carried per bit
• Entropy/bit
• More information/datum
• Less communication (PCI-e / DDR)
• Less storage, stay on chip
• Compression Ratio
• Shannon entropy
[1]
[1] Gustafson, John L, and Isaac Yonemoto. ‘Beating Floating Point at Its Own Game: Posit Arithmetic’
10. 10
2° Fast sigmoid
• IEEE-754 have HW tricks (0x5F3759DF [1])… but posits also
• 𝜎 𝑥 =
1
1+𝑒−𝑥
• Popular activation function in neural networks
• Approximate sigmoid
• 1. negate sign bit (MSB)
• 2. logical right shift by 2
• “Et voilà”: few gates/LUTs and ~1 clock
• Example
• a = 0 = Posit<8,0>(0b00000000)
• 1. tmp = 10000000
• 2. res = 00100000
• Posit<8,0>(00100000) = 1/2
[1] https://en.wikipedia.org/wiki/Fast_inverse_square_root
11. 11
3° Large vector dot products
• Most common computation in NNs is MAC [1]
• Neuron (basic node) MAC unit / large vector dot product
• The rounding error is also accumulated
• After each product and accumulation
• After each accumulation when FMA is used
• The hardware yields an inexact result
• The posit standard proposes an Exact MAC unit : QUIRE
• Postpone rounding, 1 rounding
• Suits well with very low precision (<8 bits)
[1] A Dynamic Approach to Accelerate Deep Learning
Training. Submitted to International conference on Learning
Representations. 2020. Under review.
12. 12
Outline
• Overview of posit number representation
• Problems and motivations
• Implementation: configurable posit-based AI IPs
• Results
• Conclusions
25. 25
Outline
• Overview of posit number representation
• Problems and motivations
• Implementation: configurable posit-based AI IPs
• Results
• Conclusions
26. 26
Results: Resource utilization
Module LUTs FFs
Neuron<8,0,NOQUIRE> 336 126
Decoder<8,0> 27 0
Mult<8,0> 82 32
Adder<8,0> 133 57
Encoder+Round<8,0> 38 0
Module LUTs FFs
Neuron<4,0,QUIRE> 116 72
Decoder<4,0> 3 0
Mult<4,0> 6 0
Quire19<4,0> 88 56
Encoder+Round<4,0> 7 0
• These 2 versions maintain >
90% accuracy
• Framework allows to
evaluate all combinations of
neurons
• How many bits per datum
can we save with an exact
accumulator ?
27. 27
Saturate the PCI-e link: Data parallelism
• loopback @250MHz and 512b words
• ≈ 16 𝐺𝐵/𝑠 in theory
• ≈ 𝟏𝟐, 𝟓 𝑮𝑩/𝒔 in real life
• But a MLP consumes only a N bits word
• instantiate many of them
• Frames per second
• 𝑙𝑒𝑡 𝑀, 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑁𝑁𝑠
• 𝑚𝑎𝑥𝑁𝑁 =
512
𝑁
• 𝑓𝑝𝑠 𝑛, 𝑀 ≈
𝑀
𝑚𝑎𝑥𝑁𝑁
∗
12,5𝐺𝐵/𝑠
784∗
𝑛
8
𝐵/𝑓𝑟𝑎𝑚𝑒
• Why saturate the link ?
• Show that for a given throughput, lower the
precision is, more FPS there is.
• Use the full power enabled by CAPI+SNAP
30. 30
A bit of colors: Floorplanning 16 MLP<8,0,QUIRE>
31. 31
Outline
• Overview of posit number representation
• Problems and motivations
• Implementation: configurable posit-based AI IPs
• Results
• Conclusions
32. 32
Conclusions
• CAPI+SNAP allow very fast implementation of RTL logic in a working and powerful System
• We built a framework to evaluate impacts of arithmetic
• Only posits
• Early results look promising
• Need to go further on the evaluation (Specially with IEEE Floating points)
• Same accuracy with 4x less bits
• 4x less memory storage
• 4x less power consumption
• 4x less communication time
• So: “Communication dominates less arithmetic”
• Lack of data concerning power consumption of the System
36. 36
Posits now
• Rex computing
• LLNL showed good performance for LULESH and Euler2D
37. 37
Overall system performance
• Baseline: ≈ 12𝐺𝐵/𝑠 with a pipelined loopback @250MHz and 512b words
• To saturate the PCIe link we use data parallelism, 1 MLP / picture
• Can classify x pictures independently
• For posit<8,0>: x = 512/8 = 64 MLPs
• For posit<4,0>: x = 512/4 = 128 MLPs
• Frame per second
• 𝑓𝑝𝑠 𝑛 ≈
12𝐺𝐵/𝑠
784∗
𝑛
8
𝐵/𝑓𝑟𝑎𝑚𝑒
• 𝑓𝑝𝑠 4 ≈ 𝟑𝟎, 𝟔 × 𝟏𝟎 𝟔
• 𝑓𝑝𝑠 8 ≈ 𝟏𝟓, 𝟑 × 𝟏𝟎 𝟔
• Time spent in PCI-e transfer divided by 2
• Allocation size on HOST DDR divided by 2
• Last results show that Half-precision IEEE float is needed to maintain 90% accuracy
• 𝑓𝑝𝑠 16 ≈ 7,6 × 106
“Communication dominates less arithmetic”