Approximation techniques used for general purpose algorithms

Approximation techniques used for
general purpose algorithms, data
parallel applications and solid-state
memories
1
Presented by: K M Sabidur Rahman
Date: Apr 28, 2014

Outline
 Approximate Computing
 Neural Acceleration for General-Purpose Approximate
Programs
 Approximate Storage in Solid-State Memories
 Paraprox: Pattern-Based Approximation for Data Parallel
Applications
2

Approximate Computing
• Applicable where some degree of variation or error is
acceptable
• Example: Video processing
• Loss of accuracy is permissible
• Better performance given less work
• Low power consumption
3

Domains
• Multimedia processing
• Machine learning
• Gaming
• Data mining/analysis
• Financial modeling
• Statistics
4

Approximate Computing
• Companies dealing with huge data are interested for more
efficient data processing even with some loss of accuracy
5

Categorization of approximation
• Programmer-based: the programmer writes different
approximate versions of a program and a runtime system
decides which version to run.
• Hardware-based: hardware modifications such as imprecise
arithmetic units, register files, or accelerators. Cannot be
readily utilized without manufacturing new hardware.
• Software-based: Approximation is done on the software level.
Each of these solutions works only for a small set of
applications.
6

Neural Acceleration for General-
Purpose Approximate Programs
Hadi Esmaeilzadeh, Adrian Sampson, Luis
Ceze and Doug Burger
7

Basic concept
 A learning-based approach
 Select and train a neural network to mimic a region of code
 After the learning phase, the compiler replaces the original
code by aproximable code
 “NPU”: low power accelerator tightly coupled to the
processor pipeline to accelerate small code regions.
8

Challenges for
effective trainable accelerators
• A learning algorithm: to accurately and efficiently mimic
imperative code.
• A language and compilation framework: to transform regions
of imperative code to neural network evaluations.
• An architectural interface: to call a neural processing unit
(NPU) in place of the original code regions
9

Neural Acceleration
• Annotate an approximate program component
• Compile the program
• Train a neural network
• Execute on a fast Neural Processing Unit (NPU)
10

From annotatedcodeto accelerated
executionon an NPU-augmentedcore
11

Programming
• The programmer explicitly annotates functions
• This is a common practice in literature
12

Code Observation
• Compiler observes the behavior of the candidate code region
by logging its inputs and outputs
• The logged input–output pairs constitute the training and
validation data for the next step
• Compiler uses the collected input–output data to configure
and train a neural network that mimics the candidate region
13

Execution
• The transformed program begins execution on the main core
and configures the NPU.
• NPU is invoked to perform a neural network evaluation with of
executing the original code region.
• Invoking the NPU is faster and more energy-efficient than
executing the original code region.
14

Code Region Criteria
• Hot code
• Approximability
• Well-defined inputs and outputs
15

Architecture Design for NPU
Acceleration
18

Acceleration
The CPU–NPU interface consists of three queues:
• sending and retrieving the configuration
• sending the inputs and
• retrieving the neural network’s outputs.
19

Acceleration
The ISA is ex-tended with four instructions to access the queues:
enq.c %r: enqueues the value of the register r into the config
FIFO.
deq.c %r: dequeues a configuration value from the config FIFO
to the register r.
enq.d %r: enqueues the value of the register r into the input
FIFO.
deq.d %r: dequeues the head of the output FIFO to the register
r.
20

Benchmarks and Experimental
Setup
• Benchmarks: FFT, inverse kinematics, triangle intersection,
JPEG, K-means, Sobel (annotated one hot function each)
• Experimental Setup: MARSSx86
• Energy model: McPAT and CACTI
23

Results: 3.0x Energy reduction
25

Limitations
• Applicability
• Programmer effort and
• Quality and error control
26

Approximate Storage in Solid-State
Memories
Adrian Sampson, Jacob Nelson, Karin
Strauss and Luis Ceze
27

Basic concept
• Mechanisms to enable applications to store data
approximately
• Improved performance, lifetime, or density of solid-state
memories
28

Two techniques
• Reduced-precision writes in multi-level phase-change memory
cells
• Use of blocks with failed bits to store approximate data
• Reduced-precision writes in multi-level phase-change memory
cells can be 1.7x faster on average
• Failed blocks can improve array lifetime by 23% on average
with quality loss under 10%
29

INTERFACES FOR
APPROXIMATE STORAGE
• Approximate storage augments memory modules with
software-visible precision modes.
• When an application needs strict data fidelity, it uses
traditional precise storage; the memory then guarantees a
low error rate when recovering the data.
• When the application can tolerate occasional errors in some
data, it uses the memory’s approximate mode, in which data
recovery errors may occur with non-negligible probability
30

Phase change memory (PCM)
• Merits: Non-volatile, almost as fast as DRAM, More scalable,
Faster than flash
• Limitations: Need more time and energy to protect against
errors. Cells wear out over time and can no longer be used for
precise data storage.
31

Approximate storage in PCM
• PCM work by storing an analog value—resistance and
quantizing it to expose digital storage.
• A larger number of levels per cell requires more time and
energy to access.
• Approximation improves performance and efficiency
32

Multi-Level Cell Model
• The shaded areas are the target regions for writes to each
level
• Unshaded areas are guard bands.
• The curves show the probability of reading a given analog
value after writing one of the levels.
• Approximate MLCs decrease guard bands so the probability
distributions overlap.
• Goal is to increase density or performance at the cost of
occasional digital-domain storage errors.
34

Memory Interface
• MLC blocks can be made precise or approximate by adjusting
the target threshold of write operations.
• The memory array must know which threshold value to use
for each write operation.
• Memory interface extended to include precision flags
• Read operations are identical for approximate and precise
memory
37

USING FAILED MEMORY
CELLS
• Use blocks with exhausted error-correction resources to store
approximate data
• Value stored in a particular failed block will consistently exhibit
bit errors in the same positions
38

Prioritized Bit Correction
• Example of mantissa in floating point number.
• Correct the bits that appear in high-order positions within
words and leave the lowest-order failed bits uncorrected.
39

Memory Interface
• Unlike with the approximate MLC technique, software has no
control over blocks’ precision state.
• To permit safe allocation of approximate and precise data, the
memory must inform software of the locations of approximate
(i.e., failed) blocks.
• As a block fails the OS adds the block to a pool of approximate
blocks.
• Memory allocators consult this set of approximate blocks
when laying out data in the memory.
• While approximate data can be stored in any block, precise
data must be allocated in memory without failures.
40

Benchmarks
• The main-memory applications: Java programs annotated
using the EnerJ , approximation-aware type system, which
marks some data as approximate and leaves other data
precise.
• The persistent-storage benchmarks are static data sets that
can be stored 100% approximately
• Applications: fft, jmeint, lu, mc, raytr. , smm, sor, zxing
41

Paraprox: Pattern-Based
Approximation for Data Parallel
Applications
Mehrzad Samadi, Davoud Anoushe Jamshidi,
Janghaeng Lee and Scott Mahlke
44

Paraprox
• Pattern-specific approximation methods
• Identify different patterns commonly found in data
parallel workloads
• Use specialized approximation optimization for each
pattern
• Write software once and use it on a variety of
processors
• Provide knobs to control the output quality
45

Paraprox framework
• Paraprox detects the patterns
• Generates approximate kernels with different tuning
parameters
• The runtime profiles the kernels and tunes the parameters for
the best performance.
• If the user-defined target output quality (TOQ) is violated, the
runtime system will adjust by
• retuning the parameters and/or
• selecting a less aggressive approximate kernel for the next execution.
47

Pattern detection
• Map
• Scatter/Gather
• Reduction
• Scan
• Stencil and
• Partition.
48

Approximation Optimizations
• Map and scatter/gather patterns: approximate memoization
• Replaces a function call with a query into a lookup table which
returns a pre-computed result
• Pre-compute the output of the map or scatter/gather function
for a number of representative input sets offline.
• During runtime, the launched kernel’s threads use this lookup
table to find the output for all input values.
50

Approximate Memoization
• Identify candidate functions
• Find the table size
• Determine qi for each input
• Check for quality; if not satisfied, go back to step 2.
• Fill the Table
• Execution
52

Stencil and Partition
• 70% of the each image’s pixels have less than 10% difference
from their neighbors.
• Paraprox assumes that adjacent elements in the input array
are similar in value.
• Rather than access all neighbors within a tile, Paraprox
accesses only a subset of them and assumes the rest of the
neighbors have the same value
53

Approximation of tile
• Center based approach
• Row based approximation schemes
• Row based approximation schemes
56

Reduction
• Paraprox aims to predict the final result by computing the
reduction of a subset of the input data
• The data is assumed to be distributed uniformly, so a subset
of the data can provide a good representation of the entire
array
• May need adjustment
57

• For example, instead of finding the minimum of the original
array, Paraprox finds the minimum within one half of the array
and returns it as the approximate result.
• If the data in both subarrays have similar distributions, the
minimum of these subarrays will be close to each other and
approximation error will be negligible.
59

Scan
• Paraprox assumes that differences between elements in the
input array are similar to those in other partitions of the
same input array.
• Parallel implementations of scan patterns break the input
array into sub-arrays and computes the scan result for each of
them.
60

Scan : Implementation
A data parallel implementation of the scan pattern has
three phases:
• Phase I scans each subarray.
• Phase II scans the sum of all subarrays.
• Phase III then adds the result of Phase II to each
corresponding subarray in the partial scan to generate the
final result.
62

Experimental Setup
• Clang 3.3
• GPU - NVIDIA GTX 560
• CPU- Intel Core I7
• Benchmarks - NVIDIA SDK, Rodinia
64

Results: Performance
comparison
68

Approximation techniques used for general purpose algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Approximation techniques used for general purpose algorithms

Similar to Approximation techniques used for general purpose algorithms (20)

More from Sabidur Rahman

More from Sabidur Rahman (15)

Approximation techniques used for general purpose algorithms