Survey on approximation techniques used for general purpose algorithms, data parallel applications ans solid-state memories. It is interesting to see how approximation algorithms can contribute to solve real-life problems with better efficiency and lower cost!
Questions? krahman@ucdavis.edu.
Approximation techniques used for general purpose algorithms
1. Approximation techniques used for
general purpose algorithms, data
parallel applications and solid-state
memories
1
Presented by: K M Sabidur Rahman
Date: Apr 28, 2014
2. Outline
Approximate Computing
Neural Acceleration for General-Purpose Approximate
Programs
Approximate Storage in Solid-State Memories
Paraprox: Pattern-Based Approximation for Data Parallel
Applications
2
3. Approximate Computing
• Applicable where some degree of variation or error is
acceptable
• Example: Video processing
• Loss of accuracy is permissible
• Better performance given less work
• Low power consumption
3
5. Approximate Computing
• Companies dealing with huge data are interested for more
efficient data processing even with some loss of accuracy
5
6. Categorization of approximation
• Programmer-based: the programmer writes different
approximate versions of a program and a runtime system
decides which version to run.
• Hardware-based: hardware modifications such as imprecise
arithmetic units, register files, or accelerators. Cannot be
readily utilized without manufacturing new hardware.
• Software-based: Approximation is done on the software level.
Each of these solutions works only for a small set of
applications.
6
7. Neural Acceleration for General-
Purpose Approximate Programs
Hadi Esmaeilzadeh, Adrian Sampson, Luis
Ceze and Doug Burger
7
8. Basic concept
A learning-based approach
Select and train a neural network to mimic a region of code
After the learning phase, the compiler replaces the original
code by aproximable code
“NPU”: low power accelerator tightly coupled to the
processor pipeline to accelerate small code regions.
8
9. Challenges for
effective trainable accelerators
• A learning algorithm: to accurately and efficiently mimic
imperative code.
• A language and compilation framework: to transform regions
of imperative code to neural network evaluations.
• An architectural interface: to call a neural processing unit
(NPU) in place of the original code regions
9
10. Neural Acceleration
• Annotate an approximate program component
• Compile the program
• Train a neural network
• Execute on a fast Neural Processing Unit (NPU)
10
13. Code Observation
• Compiler observes the behavior of the candidate code region
by logging its inputs and outputs
• The logged input–output pairs constitute the training and
validation data for the next step
• Compiler uses the collected input–output data to configure
and train a neural network that mimics the candidate region
13
14. Execution
• The transformed program begins execution on the main core
and configures the NPU.
• NPU is invoked to perform a neural network evaluation with of
executing the original code region.
• Invoking the NPU is faster and more energy-efficient than
executing the original code region.
14
19. Architecture Design for NPU
Acceleration
The CPU–NPU interface consists of three queues:
• sending and retrieving the configuration
• sending the inputs and
• retrieving the neural network’s outputs.
19
20. Architecture Design for NPU
Acceleration
The ISA is ex-tended with four instructions to access the queues:
enq.c %r: enqueues the value of the register r into the config
FIFO.
deq.c %r: dequeues a configuration value from the config FIFO
to the register r.
enq.d %r: enqueues the value of the register r into the input
FIFO.
deq.d %r: dequeues the head of the output FIFO to the register
r.
20
27. Approximate Storage in Solid-State
Memories
Adrian Sampson, Jacob Nelson, Karin
Strauss and Luis Ceze
27
28. Basic concept
• Mechanisms to enable applications to store data
approximately
• Improved performance, lifetime, or density of solid-state
memories
28
29. Two techniques
• Reduced-precision writes in multi-level phase-change memory
cells
• Use of blocks with failed bits to store approximate data
• Reduced-precision writes in multi-level phase-change memory
cells can be 1.7x faster on average
• Failed blocks can improve array lifetime by 23% on average
with quality loss under 10%
29
30. INTERFACES FOR
APPROXIMATE STORAGE
• Approximate storage augments memory modules with
software-visible precision modes.
• When an application needs strict data fidelity, it uses
traditional precise storage; the memory then guarantees a
low error rate when recovering the data.
• When the application can tolerate occasional errors in some
data, it uses the memory’s approximate mode, in which data
recovery errors may occur with non-negligible probability
30
31. Phase change memory (PCM)
• Merits: Non-volatile, almost as fast as DRAM, More scalable,
Faster than flash
• Limitations: Need more time and energy to protect against
errors. Cells wear out over time and can no longer be used for
precise data storage.
31
32. Approximate storage in PCM
• PCM work by storing an analog value—resistance and
quantizing it to expose digital storage.
• A larger number of levels per cell requires more time and
energy to access.
• Approximation improves performance and efficiency
32
34. Multi-Level Cell Model
• The shaded areas are the target regions for writes to each
level
• Unshaded areas are guard bands.
• The curves show the probability of reading a given analog
value after writing one of the levels.
• Approximate MLCs decrease guard bands so the probability
distributions overlap.
• Goal is to increase density or performance at the cost of
occasional digital-domain storage errors.
34
35. Memory Interface
• MLC blocks can be made precise or approximate by adjusting
the target threshold of write operations.
• The memory array must know which threshold value to use
for each write operation.
• Memory interface extended to include precision flags
• Read operations are identical for approximate and precise
memory
37
36. USING FAILED MEMORY
CELLS
• Use blocks with exhausted error-correction resources to store
approximate data
• Value stored in a particular failed block will consistently exhibit
bit errors in the same positions
38
37. Prioritized Bit Correction
• Example of mantissa in floating point number.
• Correct the bits that appear in high-order positions within
words and leave the lowest-order failed bits uncorrected.
39
38. Memory Interface
• Unlike with the approximate MLC technique, software has no
control over blocks’ precision state.
• To permit safe allocation of approximate and precise data, the
memory must inform software of the locations of approximate
(i.e., failed) blocks.
• As a block fails the OS adds the block to a pool of approximate
blocks.
• Memory allocators consult this set of approximate blocks
when laying out data in the memory.
• While approximate data can be stored in any block, precise
data must be allocated in memory without failures.
40
39. Benchmarks
• The main-memory applications: Java programs annotated
using the EnerJ , approximation-aware type system, which
marks some data as approximate and leaves other data
precise.
• The persistent-storage benchmarks are static data sets that
can be stored 100% approximately
• Applications: fft, jmeint, lu, mc, raytr. , smm, sor, zxing
41
43. Paraprox
• Pattern-specific approximation methods
• Identify different patterns commonly found in data
parallel workloads
• Use specialized approximation optimization for each
pattern
• Write software once and use it on a variety of
processors
• Provide knobs to control the output quality
45
45. Paraprox framework
• Paraprox detects the patterns
• Generates approximate kernels with different tuning
parameters
• The runtime profiles the kernels and tunes the parameters for
the best performance.
• If the user-defined target output quality (TOQ) is violated, the
runtime system will adjust by
• retuning the parameters and/or
• selecting a less aggressive approximate kernel for the next execution.
47
48. Approximation Optimizations
• Map and scatter/gather patterns: approximate memoization
• Replaces a function call with a query into a lookup table which
returns a pre-computed result
• Pre-compute the output of the map or scatter/gather function
for a number of representative input sets offline.
• During runtime, the launched kernel’s threads use this lookup
table to find the output for all input values.
50
50. Approximate Memoization
• Identify candidate functions
• Find the table size
• Determine qi for each input
• Check for quality; if not satisfied, go back to step 2.
• Fill the Table
• Execution
52
51. Stencil and Partition
• 70% of the each image’s pixels have less than 10% difference
from their neighbors.
• Paraprox assumes that adjacent elements in the input array
are similar in value.
• Rather than access all neighbors within a tile, Paraprox
accesses only a subset of them and assumes the rest of the
neighbors have the same value
53
54. Approximation of tile
• Center based approach
• Row based approximation schemes
• Row based approximation schemes
56
55. Reduction
• Paraprox aims to predict the final result by computing the
reduction of a subset of the input data
• The data is assumed to be distributed uniformly, so a subset
of the data can provide a good representation of the entire
array
• May need adjustment
57
57. • For example, instead of finding the minimum of the original
array, Paraprox finds the minimum within one half of the array
and returns it as the approximate result.
• If the data in both subarrays have similar distributions, the
minimum of these subarrays will be close to each other and
approximation error will be negligible.
59
58. Scan
• Paraprox assumes that differences between elements in the
input array are similar to those in other partitions of the
same input array.
• Parallel implementations of scan patterns break the input
array into sub-arrays and computes the scan result for each of
them.
60
60. Scan : Implementation
A data parallel implementation of the scan pattern has
three phases:
• Phase I scans each subarray.
• Phase II scans the sum of all subarrays.
• Phase III then adds the result of Phase II to each
corresponding subarray in the partial scan to generate the
final result.
62