SlideShare a Scribd company logo
1 of 66
Download to read offline
Approximation techniques used for
general purpose algorithms, data
parallel applications and solid-state
memories
1
Presented by: K M Sabidur Rahman
Date: Apr 28, 2014
Outline
 Approximate Computing
 Neural Acceleration for General-Purpose Approximate
Programs
 Approximate Storage in Solid-State Memories
 Paraprox: Pattern-Based Approximation for Data Parallel
Applications
2
Approximate Computing
• Applicable where some degree of variation or error is
acceptable
• Example: Video processing
• Loss of accuracy is permissible
• Better performance given less work
• Low power consumption
3
Domains
• Multimedia processing
• Machine learning
• Gaming
• Data mining/analysis
• Financial modeling
• Statistics
4
Approximate Computing
• Companies dealing with huge data are interested for more
efficient data processing even with some loss of accuracy
5
Categorization of approximation
• Programmer-based: the programmer writes different
approximate versions of a program and a runtime system
decides which version to run.
• Hardware-based: hardware modifications such as imprecise
arithmetic units, register files, or accelerators. Cannot be
readily utilized without manufacturing new hardware.
• Software-based: Approximation is done on the software level.
Each of these solutions works only for a small set of
applications.
6
Neural Acceleration for General-
Purpose Approximate Programs
Hadi Esmaeilzadeh, Adrian Sampson, Luis
Ceze and Doug Burger
7
Basic concept
 A learning-based approach
 Select and train a neural network to mimic a region of code
 After the learning phase, the compiler replaces the original
code by aproximable code
 “NPU”: low power accelerator tightly coupled to the
processor pipeline to accelerate small code regions.
8
Challenges for
effective trainable accelerators
• A learning algorithm: to accurately and efficiently mimic
imperative code.
• A language and compilation framework: to transform regions
of imperative code to neural network evaluations.
• An architectural interface: to call a neural processing unit
(NPU) in place of the original code regions
9
Neural Acceleration
• Annotate an approximate program component
• Compile the program
• Train a neural network
• Execute on a fast Neural Processing Unit (NPU)
10
From annotatedcodeto accelerated
executionon an NPU-augmentedcore
11
Programming
• The programmer explicitly annotates functions
• This is a common practice in literature
12
Code Observation
• Compiler observes the behavior of the candidate code region
by logging its inputs and outputs
• The logged input–output pairs constitute the training and
validation data for the next step
• Compiler uses the collected input–output data to configure
and train a neural network that mimics the candidate region
13
Execution
• The transformed program begins execution on the main core
and configures the NPU.
• NPU is invoked to perform a neural network evaluation with of
executing the original code region.
• Invoking the NPU is faster and more energy-efficient than
executing the original code region.
14
Code Region Criteria
• Hot code
• Approximability
• Well-defined inputs and outputs
15
Original sobel code
16
Parrot transformed code
17
Architecture Design for NPU
Acceleration
18
Architecture Design for NPU
Acceleration
The CPU–NPU interface consists of three queues:
• sending and retrieving the configuration
• sending the inputs and
• retrieving the neural network’s outputs.
19
Architecture Design for NPU
Acceleration
The ISA is ex-tended with four instructions to access the queues:
enq.c %r: enqueues the value of the register r into the config
FIFO.
deq.c %r: dequeues a configuration value from the config FIFO
to the register r.
enq.d %r: enqueues the value of the register r into the input
FIFO.
deq.d %r: dequeues the head of the output FIFO to the register
r.
20
Reconfigurable 8-PE NPU
21
A Single processing engine
22
Benchmarks and Experimental
Setup
• Benchmarks: FFT, inverse kinematics, triangle intersection,
JPEG, K-means, Sobel (annotated one hot function each)
• Experimental Setup: MARSSx86
• Energy model: McPAT and CACTI
23
Results: 2.3x Speedup
24
Results: 3.0x Energy reduction
25
Limitations
• Applicability
• Programmer effort and
• Quality and error control
26
Approximate Storage in Solid-State
Memories
Adrian Sampson, Jacob Nelson, Karin
Strauss and Luis Ceze
27
Basic concept
• Mechanisms to enable applications to store data
approximately
• Improved performance, lifetime, or density of solid-state
memories
28
Two techniques
• Reduced-precision writes in multi-level phase-change memory
cells
• Use of blocks with failed bits to store approximate data
• Reduced-precision writes in multi-level phase-change memory
cells can be 1.7x faster on average
• Failed blocks can improve array lifetime by 23% on average
with quality loss under 10%
29
INTERFACES FOR
APPROXIMATE STORAGE
• Approximate storage augments memory modules with
software-visible precision modes.
• When an application needs strict data fidelity, it uses
traditional precise storage; the memory then guarantees a
low error rate when recovering the data.
• When the application can tolerate occasional errors in some
data, it uses the memory’s approximate mode, in which data
recovery errors may occur with non-negligible probability
30
Phase change memory (PCM)
• Merits: Non-volatile, almost as fast as DRAM, More scalable,
Faster than flash
• Limitations: Need more time and energy to protect against
errors. Cells wear out over time and can no longer be used for
precise data storage.
31
Approximate storage in PCM
• PCM work by storing an analog value—resistance and
quantizing it to expose digital storage.
• A larger number of levels per cell requires more time and
energy to access.
• Approximation improves performance and efficiency
32
Multi-Level Cell Model
33
Multi-Level Cell Model
• The shaded areas are the target regions for writes to each
level
• Unshaded areas are guard bands.
• The curves show the probability of reading a given analog
value after writing one of the levels.
• Approximate MLCs decrease guard bands so the probability
distributions overlap.
• Goal is to increase density or performance at the cost of
occasional digital-domain storage errors.
34
Memory Interface
• MLC blocks can be made precise or approximate by adjusting
the target threshold of write operations.
• The memory array must know which threshold value to use
for each write operation.
• Memory interface extended to include precision flags
• Read operations are identical for approximate and precise
memory
37
USING FAILED MEMORY
CELLS
• Use blocks with exhausted error-correction resources to store
approximate data
• Value stored in a particular failed block will consistently exhibit
bit errors in the same positions
38
Prioritized Bit Correction
• Example of mantissa in floating point number.
• Correct the bits that appear in high-order positions within
words and leave the lowest-order failed bits uncorrected.
39
Memory Interface
• Unlike with the approximate MLC technique, software has no
control over blocks’ precision state.
• To permit safe allocation of approximate and precise data, the
memory must inform software of the locations of approximate
(i.e., failed) blocks.
• As a block fails the OS adds the block to a pool of approximate
blocks.
• Memory allocators consult this set of approximate blocks
when laying out data in the memory.
• While approximate data can be stored in any block, precise
data must be allocated in memory without failures.
40
Benchmarks
• The main-memory applications: Java programs annotated
using the EnerJ , approximation-aware type system, which
marks some data as approximate and leaves other data
precise.
• The persistent-storage benchmarks are static data sets that
can be stored 100% approximately
• Applications: fft, jmeint, lu, mc, raytr. , smm, sor, zxing
41
Results
42
Results
43
Paraprox: Pattern-Based
Approximation for Data Parallel
Applications
Mehrzad Samadi, Davoud Anoushe Jamshidi,
Janghaeng Lee and Scott Mahlke
44
Paraprox
• Pattern-specific approximation methods
• Identify different patterns commonly found in data
parallel workloads
• Use specialized approximation optimization for each
pattern
• Write software once and use it on a variety of
processors
• Provide knobs to control the output quality
45
Paraprox framework
46
Paraprox framework
• Paraprox detects the patterns
• Generates approximate kernels with different tuning
parameters
• The runtime profiles the kernels and tunes the parameters for
the best performance.
• If the user-defined target output quality (TOQ) is violated, the
runtime system will adjust by
• retuning the parameters and/or
• selecting a less aggressive approximate kernel for the next execution.
47
Pattern detection
• Map
• Scatter/Gather
• Reduction
• Scan
• Stencil and
• Partition.
48
Patterns
49
Approximation Optimizations
• Map and scatter/gather patterns: approximate memoization
• Replaces a function call with a query into a lookup table which
returns a pre-computed result
• Pre-compute the output of the map or scatter/gather function
for a number of representative input sets offline.
• During runtime, the launched kernel’s threads use this lookup
table to find the output for all input values.
50
Approximate Memoization
51
Approximate Memoization
• Identify candidate functions
• Find the table size
• Determine qi for each input
• Check for quality; if not satisfied, go back to step 2.
• Fill the Table
• Execution
52
Stencil and Partition
• 70% of the each image’s pixels have less than 10% difference
from their neighbors.
• Paraprox assumes that adjacent elements in the input array
are similar in value.
• Rather than access all neighbors within a tile, Paraprox
accesses only a subset of them and assumes the rest of the
neighbors have the same value
53
54
55
Approximation of tile
• Center based approach
• Row based approximation schemes
• Row based approximation schemes
56
Reduction
• Paraprox aims to predict the final result by computing the
reduction of a subset of the input data
• The data is assumed to be distributed uniformly, so a subset
of the data can provide a good representation of the entire
array
• May need adjustment
57
58
• For example, instead of finding the minimum of the original
array, Paraprox finds the minimum within one half of the array
and returns it as the approximate result.
• If the data in both subarrays have similar distributions, the
minimum of these subarrays will be close to each other and
approximation error will be negligible.
59
Scan
• Paraprox assumes that differences between elements in the
input array are similar to those in other partitions of the
same input array.
• Parallel implementations of scan patterns break the input
array into sub-arrays and computes the scan result for each of
them.
60
Scan
61
Scan : Implementation
A data parallel implementation of the scan pattern has
three phases:
• Phase I scans each subarray.
• Phase II scans the sum of all subarrays.
• Phase III then adds the result of Phase II to each
corresponding subarray in the partial scan to generate the
final result.
62
Scan Approximation
63
Experimental Setup
• Clang 3.3
• GPU - NVIDIA GTX 560
• CPU- Intel Core I7
• Benchmarks - NVIDIA SDK, Rodinia
64
Results: Speedup
65
Results: Performance
comparison
68
Q&A
?
69
70

More Related Content

What's hot

NAAN Mudhalvan competition level 1.pptx
NAAN Mudhalvan competition level 1.pptxNAAN Mudhalvan competition level 1.pptx
NAAN Mudhalvan competition level 1.pptxsanjaisrinivaas
 
01 Transition Fault Detection methods by Swetha
01 Transition Fault Detection methods by Swetha01 Transition Fault Detection methods by Swetha
01 Transition Fault Detection methods by Swethaswethamg18
 
Pulse width modulation (PWM)
Pulse width modulation (PWM)Pulse width modulation (PWM)
Pulse width modulation (PWM)amar pandey
 
low pw and leakage current techniques for cmos circuits
low pw and leakage current techniques for cmos circuitslow pw and leakage current techniques for cmos circuits
low pw and leakage current techniques for cmos circuitsAnamika Pancholi
 
Dc ch05 : signal encoding techniques
Dc ch05 : signal encoding techniquesDc ch05 : signal encoding techniques
Dc ch05 : signal encoding techniquesSyaiful Ahdan
 
Thermal stability & bias compensation
Thermal stability & bias compensationThermal stability & bias compensation
Thermal stability & bias compensationsrirenga
 
ATPG Methods and Algorithms
ATPG Methods and AlgorithmsATPG Methods and Algorithms
ATPG Methods and AlgorithmsDeiptii Das
 
VLSI Testing Techniques
VLSI Testing TechniquesVLSI Testing Techniques
VLSI Testing TechniquesA B Shinde
 
8051 Microcontroller Notes
8051 Microcontroller Notes8051 Microcontroller Notes
8051 Microcontroller NotesDr.YNM
 
DIFFERENTIAL AMPLIFIER using MOSFET
DIFFERENTIAL AMPLIFIER using MOSFETDIFFERENTIAL AMPLIFIER using MOSFET
DIFFERENTIAL AMPLIFIER using MOSFETPraveen Kumar
 
PSK (PHASE SHIFT KEYING )
PSK (PHASE SHIFT KEYING )PSK (PHASE SHIFT KEYING )
PSK (PHASE SHIFT KEYING )vijidhivi
 
2. block diagram and components of embedded system
2. block diagram and components of embedded system2. block diagram and components of embedded system
2. block diagram and components of embedded systemVikas Dongre
 
Single Slope ADC.pptx
Single Slope ADC.pptxSingle Slope ADC.pptx
Single Slope ADC.pptxhepzijustin
 
Ec8352 signals and systems 2 marks with answers
Ec8352 signals and systems   2 marks with answersEc8352 signals and systems   2 marks with answers
Ec8352 signals and systems 2 marks with answersGayathri Krishnamoorthy
 

What's hot (20)

NAAN Mudhalvan competition level 1.pptx
NAAN Mudhalvan competition level 1.pptxNAAN Mudhalvan competition level 1.pptx
NAAN Mudhalvan competition level 1.pptx
 
My VLSI.pptx
My VLSI.pptxMy VLSI.pptx
My VLSI.pptx
 
01 Transition Fault Detection methods by Swetha
01 Transition Fault Detection methods by Swetha01 Transition Fault Detection methods by Swetha
01 Transition Fault Detection methods by Swetha
 
Pulse width modulation (PWM)
Pulse width modulation (PWM)Pulse width modulation (PWM)
Pulse width modulation (PWM)
 
low pw and leakage current techniques for cmos circuits
low pw and leakage current techniques for cmos circuitslow pw and leakage current techniques for cmos circuits
low pw and leakage current techniques for cmos circuits
 
Dc ch05 : signal encoding techniques
Dc ch05 : signal encoding techniquesDc ch05 : signal encoding techniques
Dc ch05 : signal encoding techniques
 
Thermal stability & bias compensation
Thermal stability & bias compensationThermal stability & bias compensation
Thermal stability & bias compensation
 
ATPG Methods and Algorithms
ATPG Methods and AlgorithmsATPG Methods and Algorithms
ATPG Methods and Algorithms
 
Turbo equalization
Turbo equalizationTurbo equalization
Turbo equalization
 
VLSI Testing Techniques
VLSI Testing TechniquesVLSI Testing Techniques
VLSI Testing Techniques
 
8051 interfacing
8051 interfacing8051 interfacing
8051 interfacing
 
8051 Microcontroller Notes
8051 Microcontroller Notes8051 Microcontroller Notes
8051 Microcontroller Notes
 
PUT (industrial electronic)
PUT (industrial electronic)PUT (industrial electronic)
PUT (industrial electronic)
 
DIFFERENTIAL AMPLIFIER using MOSFET
DIFFERENTIAL AMPLIFIER using MOSFETDIFFERENTIAL AMPLIFIER using MOSFET
DIFFERENTIAL AMPLIFIER using MOSFET
 
PSK (PHASE SHIFT KEYING )
PSK (PHASE SHIFT KEYING )PSK (PHASE SHIFT KEYING )
PSK (PHASE SHIFT KEYING )
 
2. block diagram and components of embedded system
2. block diagram and components of embedded system2. block diagram and components of embedded system
2. block diagram and components of embedded system
 
Single Slope ADC.pptx
Single Slope ADC.pptxSingle Slope ADC.pptx
Single Slope ADC.pptx
 
Ec8352 signals and systems 2 marks with answers
Ec8352 signals and systems   2 marks with answersEc8352 signals and systems   2 marks with answers
Ec8352 signals and systems 2 marks with answers
 
ASk,FSK,PSK
ASk,FSK,PSKASk,FSK,PSK
ASk,FSK,PSK
 
Adc &dac ppt
Adc &dac pptAdc &dac ppt
Adc &dac ppt
 

Viewers also liked

Gm mosquitoes to fight malaria
Gm mosquitoes to fight malariaGm mosquitoes to fight malaria
Gm mosquitoes to fight malariaNursing Crusade
 
Desenmascarando la insufiencia mitral. Papel del Ecocardiografista
Desenmascarando la insufiencia mitral. Papel del EcocardiografistaDesenmascarando la insufiencia mitral. Papel del Ecocardiografista
Desenmascarando la insufiencia mitral. Papel del EcocardiografistaSociedad Española de Cardiología
 
Sordera, diabetes e hipertrofia ventricular: una combinación peculiar
Sordera, diabetes e hipertrofia ventricular: una combinación peculiarSordera, diabetes e hipertrofia ventricular: una combinación peculiar
Sordera, diabetes e hipertrofia ventricular: una combinación peculiarSociedad Española de Cardiología
 
Statins for primary prevention in Indians
Statins for primary prevention in IndiansStatins for primary prevention in Indians
Statins for primary prevention in Indianscardiositeindia
 
Dyslipidemia management an evidence based approach
Dyslipidemia management an evidence based approachDyslipidemia management an evidence based approach
Dyslipidemia management an evidence based approachDr Vivek Baliga
 
dual antiplatelet therapy
dual antiplatelet therapydual antiplatelet therapy
dual antiplatelet therapyAmeel Yaqo
 
Utilidad del estudio genético en la evaluación diagnóstica de la mioardiopatí...
Utilidad del estudio genético en la evaluación diagnóstica de la mioardiopatí...Utilidad del estudio genético en la evaluación diagnóstica de la mioardiopatí...
Utilidad del estudio genético en la evaluación diagnóstica de la mioardiopatí...Sociedad Española de Cardiología
 
آشنایی با شبکه های سلولی GSM
آشنایی با شبکه های سلولی GSMآشنایی با شبکه های سلولی GSM
آشنایی با شبکه های سلولی GSMarichoana
 
Rivaroxaban
RivaroxabanRivaroxaban
Rivaroxabantgraphos
 
Ericsson-RBS6000-Baseband-ASIC_TOC
Ericsson-RBS6000-Baseband-ASIC_TOCEricsson-RBS6000-Baseband-ASIC_TOC
Ericsson-RBS6000-Baseband-ASIC_TOCCary Snyder
 
biodiesel project report presentation
biodiesel project report presentationbiodiesel project report presentation
biodiesel project report presentationManu Nair
 
Oral anticoagulants ppt
Oral anticoagulants ppt Oral anticoagulants ppt
Oral anticoagulants ppt Shalini Garg
 

Viewers also liked (20)

coding sample
coding samplecoding sample
coding sample
 
Xmas
XmasXmas
Xmas
 
Gm mosquitoes to fight malaria
Gm mosquitoes to fight malariaGm mosquitoes to fight malaria
Gm mosquitoes to fight malaria
 
Faktorinhibitorer
FaktorinhibitorerFaktorinhibitorer
Faktorinhibitorer
 
Experiences With The Multiplate
Experiences With The MultiplateExperiences With The Multiplate
Experiences With The Multiplate
 
Desenmascarando la insufiencia mitral. Papel del Ecocardiografista
Desenmascarando la insufiencia mitral. Papel del EcocardiografistaDesenmascarando la insufiencia mitral. Papel del Ecocardiografista
Desenmascarando la insufiencia mitral. Papel del Ecocardiografista
 
Sordera, diabetes e hipertrofia ventricular: una combinación peculiar
Sordera, diabetes e hipertrofia ventricular: una combinación peculiarSordera, diabetes e hipertrofia ventricular: una combinación peculiar
Sordera, diabetes e hipertrofia ventricular: una combinación peculiar
 
Statins for primary prevention in Indians
Statins for primary prevention in IndiansStatins for primary prevention in Indians
Statins for primary prevention in Indians
 
Dyslipidemia management an evidence based approach
Dyslipidemia management an evidence based approachDyslipidemia management an evidence based approach
Dyslipidemia management an evidence based approach
 
Hadley Wood News May 2016
Hadley Wood News May 2016Hadley Wood News May 2016
Hadley Wood News May 2016
 
Insuficiencia cardiaca de alto gasto e hipertensión pulmonar
Insuficiencia cardiaca de alto gasto e hipertensión pulmonarInsuficiencia cardiaca de alto gasto e hipertensión pulmonar
Insuficiencia cardiaca de alto gasto e hipertensión pulmonar
 
Peripheral neuropathy
Peripheral neuropathyPeripheral neuropathy
Peripheral neuropathy
 
SEC CALIDAD REcursos y CAlidad en CARdiología, RECALCAR
SEC CALIDAD REcursos y CAlidad en CARdiología, RECALCARSEC CALIDAD REcursos y CAlidad en CARdiología, RECALCAR
SEC CALIDAD REcursos y CAlidad en CARdiología, RECALCAR
 
dual antiplatelet therapy
dual antiplatelet therapydual antiplatelet therapy
dual antiplatelet therapy
 
Utilidad del estudio genético en la evaluación diagnóstica de la mioardiopatí...
Utilidad del estudio genético en la evaluación diagnóstica de la mioardiopatí...Utilidad del estudio genético en la evaluación diagnóstica de la mioardiopatí...
Utilidad del estudio genético en la evaluación diagnóstica de la mioardiopatí...
 
آشنایی با شبکه های سلولی GSM
آشنایی با شبکه های سلولی GSMآشنایی با شبکه های سلولی GSM
آشنایی با شبکه های سلولی GSM
 
Rivaroxaban
RivaroxabanRivaroxaban
Rivaroxaban
 
Ericsson-RBS6000-Baseband-ASIC_TOC
Ericsson-RBS6000-Baseband-ASIC_TOCEricsson-RBS6000-Baseband-ASIC_TOC
Ericsson-RBS6000-Baseband-ASIC_TOC
 
biodiesel project report presentation
biodiesel project report presentationbiodiesel project report presentation
biodiesel project report presentation
 
Oral anticoagulants ppt
Oral anticoagulants ppt Oral anticoagulants ppt
Oral anticoagulants ppt
 

Similar to Approximation techniques used for general purpose algorithms

Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginnersGerwin Makanyanga
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSMaurvi04
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Chapter1 Computer System Overview.ppt
Chapter1 Computer System Overview.pptChapter1 Computer System Overview.ppt
Chapter1 Computer System Overview.pptShikhaManrai1
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Sarwan ali
 
Memory Management Strategies - I.pdf
Memory Management Strategies - I.pdfMemory Management Strategies - I.pdf
Memory Management Strategies - I.pdfHarika Pudugosula
 
Challenges in Embedded Computing
Challenges in Embedded ComputingChallenges in Embedded Computing
Challenges in Embedded ComputingPradeep Kumar TS
 
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptx
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptxAN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptx
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptxKeshvan Dhanapal
 
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdf
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdfAN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdf
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdfKeshvan Dhanapal
 
Chapter1 Computer System Overview Part-1.ppt
Chapter1 Computer System Overview Part-1.pptChapter1 Computer System Overview Part-1.ppt
Chapter1 Computer System Overview Part-1.pptShikhaManrai1
 
UNIT 3-EXPLAINING THE MEMORY MANAGEMENT LOGICAL AND AND PHYSICAL DATA FLOW DI...
UNIT 3-EXPLAINING THE MEMORY MANAGEMENT LOGICAL AND AND PHYSICAL DATA FLOW DI...UNIT 3-EXPLAINING THE MEMORY MANAGEMENT LOGICAL AND AND PHYSICAL DATA FLOW DI...
UNIT 3-EXPLAINING THE MEMORY MANAGEMENT LOGICAL AND AND PHYSICAL DATA FLOW DI...LeahRachael
 
Performance Tuning by Dijesh P
Performance Tuning by Dijesh PPerformance Tuning by Dijesh P
Performance Tuning by Dijesh PPlusOrMinusZero
 
Memory organization.pptx
Memory organization.pptxMemory organization.pptx
Memory organization.pptxRamanRay105
 
Inerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine LearningInerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine LearningYash Diwakar
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance
 

Similar to Approximation techniques used for general purpose algorithms (20)

Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginners
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Chapter1 Computer System Overview.ppt
Chapter1 Computer System Overview.pptChapter1 Computer System Overview.ppt
Chapter1 Computer System Overview.ppt
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler
 
Memory Management Strategies - I.pdf
Memory Management Strategies - I.pdfMemory Management Strategies - I.pdf
Memory Management Strategies - I.pdf
 
Challenges in Embedded Computing
Challenges in Embedded ComputingChallenges in Embedded Computing
Challenges in Embedded Computing
 
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptx
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptxAN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptx
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptx
 
BIRA recent.pptx
BIRA recent.pptxBIRA recent.pptx
BIRA recent.pptx
 
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdf
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdfAN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdf
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdf
 
Chapter1 Computer System Overview Part-1.ppt
Chapter1 Computer System Overview Part-1.pptChapter1 Computer System Overview Part-1.ppt
Chapter1 Computer System Overview Part-1.ppt
 
UNIT 3-EXPLAINING THE MEMORY MANAGEMENT LOGICAL AND AND PHYSICAL DATA FLOW DI...
UNIT 3-EXPLAINING THE MEMORY MANAGEMENT LOGICAL AND AND PHYSICAL DATA FLOW DI...UNIT 3-EXPLAINING THE MEMORY MANAGEMENT LOGICAL AND AND PHYSICAL DATA FLOW DI...
UNIT 3-EXPLAINING THE MEMORY MANAGEMENT LOGICAL AND AND PHYSICAL DATA FLOW DI...
 
Performance Tuning by Dijesh P
Performance Tuning by Dijesh PPerformance Tuning by Dijesh P
Performance Tuning by Dijesh P
 
Cache simulator
Cache simulatorCache simulator
Cache simulator
 
E.s unit 4 and 5
E.s unit 4 and 5E.s unit 4 and 5
E.s unit 4 and 5
 
Memory organization.pptx
Memory organization.pptxMemory organization.pptx
Memory organization.pptx
 
Inerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine LearningInerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine Learning
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 

More from Sabidur Rahman

Smart city- services and technologies
Smart city- services and technologiesSmart city- services and technologies
Smart city- services and technologiesSabidur Rahman
 
Blockchain technology and its’ usecases in computer networks
Blockchain technology and its’ usecases in computer networksBlockchain technology and its’ usecases in computer networks
Blockchain technology and its’ usecases in computer networksSabidur Rahman
 
T-SDN Controllers for Transport Network
T-SDN Controllers for Transport NetworkT-SDN Controllers for Transport Network
T-SDN Controllers for Transport NetworkSabidur Rahman
 
5 g and beyond! IEEE ICC 2018 keynotes reviewed
5 g and beyond! IEEE ICC 2018 keynotes reviewed5 g and beyond! IEEE ICC 2018 keynotes reviewed
5 g and beyond! IEEE ICC 2018 keynotes reviewedSabidur Rahman
 
Meeting the requirements to deploy cloud RAN over optical networks - elastic ...
Meeting the requirements to deploy cloud RAN over optical networks - elastic ...Meeting the requirements to deploy cloud RAN over optical networks - elastic ...
Meeting the requirements to deploy cloud RAN over optical networks - elastic ...Sabidur Rahman
 
Akamai Edge 2017 reviewed
Akamai Edge 2017 reviewedAkamai Edge 2017 reviewed
Akamai Edge 2017 reviewedSabidur Rahman
 
Understanding mobile service usage and user behavior pattern for mec resource...
Understanding mobile service usage and user behavior pattern for mec resource...Understanding mobile service usage and user behavior pattern for mec resource...
Understanding mobile service usage and user behavior pattern for mec resource...Sabidur Rahman
 
Innovations in Edge Computing and MEC
Innovations in Edge Computing and MECInnovations in Edge Computing and MEC
Innovations in Edge Computing and MECSabidur Rahman
 
Dynamic workload migration over optical backbone network to minimize data cen...
Dynamic workload migration over optical backbone network to minimize data cen...Dynamic workload migration over optical backbone network to minimize data cen...
Dynamic workload migration over optical backbone network to minimize data cen...Sabidur Rahman
 
Migration of groups of virtual machines in distributed data centers to reduce...
Migration of groups of virtual machines in distributed data centers to reduce...Migration of groups of virtual machines in distributed data centers to reduce...
Migration of groups of virtual machines in distributed data centers to reduce...Sabidur Rahman
 
Big data and machine learning for network research problems
Big data and machine learning for network research problemsBig data and machine learning for network research problems
Big data and machine learning for network research problemsSabidur Rahman
 
Cost savings from auto-scaling of network resources using machine learning
Cost savings from auto-scaling of network resources using machine learningCost savings from auto-scaling of network resources using machine learning
Cost savings from auto-scaling of network resources using machine learningSabidur Rahman
 
IoT Mobility Forensics
IoT Mobility ForensicsIoT Mobility Forensics
IoT Mobility ForensicsSabidur Rahman
 
Network tomography to enhance the performance of software defined network mon...
Network tomography to enhance the performance of software defined network mon...Network tomography to enhance the performance of software defined network mon...
Network tomography to enhance the performance of software defined network mon...Sabidur Rahman
 
Computer Security: Worms
Computer Security: WormsComputer Security: Worms
Computer Security: WormsSabidur Rahman
 

More from Sabidur Rahman (15)

Smart city- services and technologies
Smart city- services and technologiesSmart city- services and technologies
Smart city- services and technologies
 
Blockchain technology and its’ usecases in computer networks
Blockchain technology and its’ usecases in computer networksBlockchain technology and its’ usecases in computer networks
Blockchain technology and its’ usecases in computer networks
 
T-SDN Controllers for Transport Network
T-SDN Controllers for Transport NetworkT-SDN Controllers for Transport Network
T-SDN Controllers for Transport Network
 
5 g and beyond! IEEE ICC 2018 keynotes reviewed
5 g and beyond! IEEE ICC 2018 keynotes reviewed5 g and beyond! IEEE ICC 2018 keynotes reviewed
5 g and beyond! IEEE ICC 2018 keynotes reviewed
 
Meeting the requirements to deploy cloud RAN over optical networks - elastic ...
Meeting the requirements to deploy cloud RAN over optical networks - elastic ...Meeting the requirements to deploy cloud RAN over optical networks - elastic ...
Meeting the requirements to deploy cloud RAN over optical networks - elastic ...
 
Akamai Edge 2017 reviewed
Akamai Edge 2017 reviewedAkamai Edge 2017 reviewed
Akamai Edge 2017 reviewed
 
Understanding mobile service usage and user behavior pattern for mec resource...
Understanding mobile service usage and user behavior pattern for mec resource...Understanding mobile service usage and user behavior pattern for mec resource...
Understanding mobile service usage and user behavior pattern for mec resource...
 
Innovations in Edge Computing and MEC
Innovations in Edge Computing and MECInnovations in Edge Computing and MEC
Innovations in Edge Computing and MEC
 
Dynamic workload migration over optical backbone network to minimize data cen...
Dynamic workload migration over optical backbone network to minimize data cen...Dynamic workload migration over optical backbone network to minimize data cen...
Dynamic workload migration over optical backbone network to minimize data cen...
 
Migration of groups of virtual machines in distributed data centers to reduce...
Migration of groups of virtual machines in distributed data centers to reduce...Migration of groups of virtual machines in distributed data centers to reduce...
Migration of groups of virtual machines in distributed data centers to reduce...
 
Big data and machine learning for network research problems
Big data and machine learning for network research problemsBig data and machine learning for network research problems
Big data and machine learning for network research problems
 
Cost savings from auto-scaling of network resources using machine learning
Cost savings from auto-scaling of network resources using machine learningCost savings from auto-scaling of network resources using machine learning
Cost savings from auto-scaling of network resources using machine learning
 
IoT Mobility Forensics
IoT Mobility ForensicsIoT Mobility Forensics
IoT Mobility Forensics
 
Network tomography to enhance the performance of software defined network mon...
Network tomography to enhance the performance of software defined network mon...Network tomography to enhance the performance of software defined network mon...
Network tomography to enhance the performance of software defined network mon...
 
Computer Security: Worms
Computer Security: WormsComputer Security: Worms
Computer Security: Worms
 

Approximation techniques used for general purpose algorithms

  • 1. Approximation techniques used for general purpose algorithms, data parallel applications and solid-state memories 1 Presented by: K M Sabidur Rahman Date: Apr 28, 2014
  • 2. Outline  Approximate Computing  Neural Acceleration for General-Purpose Approximate Programs  Approximate Storage in Solid-State Memories  Paraprox: Pattern-Based Approximation for Data Parallel Applications 2
  • 3. Approximate Computing • Applicable where some degree of variation or error is acceptable • Example: Video processing • Loss of accuracy is permissible • Better performance given less work • Low power consumption 3
  • 4. Domains • Multimedia processing • Machine learning • Gaming • Data mining/analysis • Financial modeling • Statistics 4
  • 5. Approximate Computing • Companies dealing with huge data are interested for more efficient data processing even with some loss of accuracy 5
  • 6. Categorization of approximation • Programmer-based: the programmer writes different approximate versions of a program and a runtime system decides which version to run. • Hardware-based: hardware modifications such as imprecise arithmetic units, register files, or accelerators. Cannot be readily utilized without manufacturing new hardware. • Software-based: Approximation is done on the software level. Each of these solutions works only for a small set of applications. 6
  • 7. Neural Acceleration for General- Purpose Approximate Programs Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze and Doug Burger 7
  • 8. Basic concept  A learning-based approach  Select and train a neural network to mimic a region of code  After the learning phase, the compiler replaces the original code by aproximable code  “NPU”: low power accelerator tightly coupled to the processor pipeline to accelerate small code regions. 8
  • 9. Challenges for effective trainable accelerators • A learning algorithm: to accurately and efficiently mimic imperative code. • A language and compilation framework: to transform regions of imperative code to neural network evaluations. • An architectural interface: to call a neural processing unit (NPU) in place of the original code regions 9
  • 10. Neural Acceleration • Annotate an approximate program component • Compile the program • Train a neural network • Execute on a fast Neural Processing Unit (NPU) 10
  • 12. Programming • The programmer explicitly annotates functions • This is a common practice in literature 12
  • 13. Code Observation • Compiler observes the behavior of the candidate code region by logging its inputs and outputs • The logged input–output pairs constitute the training and validation data for the next step • Compiler uses the collected input–output data to configure and train a neural network that mimics the candidate region 13
  • 14. Execution • The transformed program begins execution on the main core and configures the NPU. • NPU is invoked to perform a neural network evaluation with of executing the original code region. • Invoking the NPU is faster and more energy-efficient than executing the original code region. 14
  • 15. Code Region Criteria • Hot code • Approximability • Well-defined inputs and outputs 15
  • 18. Architecture Design for NPU Acceleration 18
  • 19. Architecture Design for NPU Acceleration The CPU–NPU interface consists of three queues: • sending and retrieving the configuration • sending the inputs and • retrieving the neural network’s outputs. 19
  • 20. Architecture Design for NPU Acceleration The ISA is ex-tended with four instructions to access the queues: enq.c %r: enqueues the value of the register r into the config FIFO. deq.c %r: dequeues a configuration value from the config FIFO to the register r. enq.d %r: enqueues the value of the register r into the input FIFO. deq.d %r: dequeues the head of the output FIFO to the register r. 20
  • 22. A Single processing engine 22
  • 23. Benchmarks and Experimental Setup • Benchmarks: FFT, inverse kinematics, triangle intersection, JPEG, K-means, Sobel (annotated one hot function each) • Experimental Setup: MARSSx86 • Energy model: McPAT and CACTI 23
  • 25. Results: 3.0x Energy reduction 25
  • 26. Limitations • Applicability • Programmer effort and • Quality and error control 26
  • 27. Approximate Storage in Solid-State Memories Adrian Sampson, Jacob Nelson, Karin Strauss and Luis Ceze 27
  • 28. Basic concept • Mechanisms to enable applications to store data approximately • Improved performance, lifetime, or density of solid-state memories 28
  • 29. Two techniques • Reduced-precision writes in multi-level phase-change memory cells • Use of blocks with failed bits to store approximate data • Reduced-precision writes in multi-level phase-change memory cells can be 1.7x faster on average • Failed blocks can improve array lifetime by 23% on average with quality loss under 10% 29
  • 30. INTERFACES FOR APPROXIMATE STORAGE • Approximate storage augments memory modules with software-visible precision modes. • When an application needs strict data fidelity, it uses traditional precise storage; the memory then guarantees a low error rate when recovering the data. • When the application can tolerate occasional errors in some data, it uses the memory’s approximate mode, in which data recovery errors may occur with non-negligible probability 30
  • 31. Phase change memory (PCM) • Merits: Non-volatile, almost as fast as DRAM, More scalable, Faster than flash • Limitations: Need more time and energy to protect against errors. Cells wear out over time and can no longer be used for precise data storage. 31
  • 32. Approximate storage in PCM • PCM work by storing an analog value—resistance and quantizing it to expose digital storage. • A larger number of levels per cell requires more time and energy to access. • Approximation improves performance and efficiency 32
  • 34. Multi-Level Cell Model • The shaded areas are the target regions for writes to each level • Unshaded areas are guard bands. • The curves show the probability of reading a given analog value after writing one of the levels. • Approximate MLCs decrease guard bands so the probability distributions overlap. • Goal is to increase density or performance at the cost of occasional digital-domain storage errors. 34
  • 35. Memory Interface • MLC blocks can be made precise or approximate by adjusting the target threshold of write operations. • The memory array must know which threshold value to use for each write operation. • Memory interface extended to include precision flags • Read operations are identical for approximate and precise memory 37
  • 36. USING FAILED MEMORY CELLS • Use blocks with exhausted error-correction resources to store approximate data • Value stored in a particular failed block will consistently exhibit bit errors in the same positions 38
  • 37. Prioritized Bit Correction • Example of mantissa in floating point number. • Correct the bits that appear in high-order positions within words and leave the lowest-order failed bits uncorrected. 39
  • 38. Memory Interface • Unlike with the approximate MLC technique, software has no control over blocks’ precision state. • To permit safe allocation of approximate and precise data, the memory must inform software of the locations of approximate (i.e., failed) blocks. • As a block fails the OS adds the block to a pool of approximate blocks. • Memory allocators consult this set of approximate blocks when laying out data in the memory. • While approximate data can be stored in any block, precise data must be allocated in memory without failures. 40
  • 39. Benchmarks • The main-memory applications: Java programs annotated using the EnerJ , approximation-aware type system, which marks some data as approximate and leaves other data precise. • The persistent-storage benchmarks are static data sets that can be stored 100% approximately • Applications: fft, jmeint, lu, mc, raytr. , smm, sor, zxing 41
  • 42. Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee and Scott Mahlke 44
  • 43. Paraprox • Pattern-specific approximation methods • Identify different patterns commonly found in data parallel workloads • Use specialized approximation optimization for each pattern • Write software once and use it on a variety of processors • Provide knobs to control the output quality 45
  • 45. Paraprox framework • Paraprox detects the patterns • Generates approximate kernels with different tuning parameters • The runtime profiles the kernels and tunes the parameters for the best performance. • If the user-defined target output quality (TOQ) is violated, the runtime system will adjust by • retuning the parameters and/or • selecting a less aggressive approximate kernel for the next execution. 47
  • 46. Pattern detection • Map • Scatter/Gather • Reduction • Scan • Stencil and • Partition. 48
  • 48. Approximation Optimizations • Map and scatter/gather patterns: approximate memoization • Replaces a function call with a query into a lookup table which returns a pre-computed result • Pre-compute the output of the map or scatter/gather function for a number of representative input sets offline. • During runtime, the launched kernel’s threads use this lookup table to find the output for all input values. 50
  • 50. Approximate Memoization • Identify candidate functions • Find the table size • Determine qi for each input • Check for quality; if not satisfied, go back to step 2. • Fill the Table • Execution 52
  • 51. Stencil and Partition • 70% of the each image’s pixels have less than 10% difference from their neighbors. • Paraprox assumes that adjacent elements in the input array are similar in value. • Rather than access all neighbors within a tile, Paraprox accesses only a subset of them and assumes the rest of the neighbors have the same value 53
  • 52. 54
  • 53. 55
  • 54. Approximation of tile • Center based approach • Row based approximation schemes • Row based approximation schemes 56
  • 55. Reduction • Paraprox aims to predict the final result by computing the reduction of a subset of the input data • The data is assumed to be distributed uniformly, so a subset of the data can provide a good representation of the entire array • May need adjustment 57
  • 56. 58
  • 57. • For example, instead of finding the minimum of the original array, Paraprox finds the minimum within one half of the array and returns it as the approximate result. • If the data in both subarrays have similar distributions, the minimum of these subarrays will be close to each other and approximation error will be negligible. 59
  • 58. Scan • Paraprox assumes that differences between elements in the input array are similar to those in other partitions of the same input array. • Parallel implementations of scan patterns break the input array into sub-arrays and computes the scan result for each of them. 60
  • 60. Scan : Implementation A data parallel implementation of the scan pattern has three phases: • Phase I scans each subarray. • Phase II scans the sum of all subarrays. • Phase III then adds the result of Phase II to each corresponding subarray in the partial scan to generate the final result. 62
  • 62. Experimental Setup • Clang 3.3 • GPU - NVIDIA GTX 560 • CPU- Intel Core I7 • Benchmarks - NVIDIA SDK, Rodinia 64
  • 66. 70