SlideShare a Scribd company logo
1 of 34
Download to read offline
Halide
Domain Specific Language
Halide: A Language and Compiler for Optimizing
Parallelism, Locality, and Recomputation in
Image Processing Pipelines
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris,
Frédo Durand, Saman Amarasinghe.
PLDI '13 Proceedings of the 34th ACM SIGPLAN Conference on Programming
Language Design and Implementation
Recap Halide
● Separate descriptions of ‘algorithm’per output pixel and execution ‘schedule’
Recap Halide
● Separate descriptions of ‘algorithm’per output pixel and execution ‘schedule’
● Halide is aprogramming languageembedded in C++
Recap Halide
● Separate descriptions of ‘algorithm’per output pixel and execution ‘schedule’
● Halide is aprogramming languageembedded in C++
● Compiler based on LLVM/Clang
Recap Halide
● Separate descriptions of ‘algorithm’per output pixel and execution ‘schedule’
● Halide is aprogramming languageembedded in C++
● Compiler based on LLVM/Clang
Parallelism
distribute across threads and
SIMD parallel vector
Locality
compute in tiles interleave
tiles of blurx, blury store blurx
in local cache
Compiling Scheduled Pipelines
Compiling Scheduled Pipelines
1. Lowering and Loop synthesis
2. Bounds Inference
3. Sliding Window Optimization and Storage Folding
4. Flattening
5. Vectorization and Unrolling
6. Back-end Code Generation
Lowering
● Lowering process is synthesizes a single, complete set of loop nests and
allocations, given a Halide pipeline and a fully-specified schedule.
Lowering
● Lowering process is synthesizes a single, complete set of loop nests and
allocations, given a Halide pipeline and a fully-specified schedule.
● Lowering begins from the function defining the out.
○ Lowering then proceeds recursively up the pipeline, from callers to callees(here, from out to
blurx)
Lowering
● Lowering process is synthesizes a single, complete set of loop nests and
allocations, given a Halide pipeline and a fully-specified schedule.
● Lowering begins from the function defining the out.
○ Lowering then proceeds recursively up the pipeline, from callers to callees(here, from out to
blurx)
● Each loop is labeled as being serial, parallel, unrolled, or vectorized,
according to the schedule.
Bounds Inference
● The next stage of lowering generates and injects appropriate definitions for
these variables. Like function lowering, bounds inference proceeds
recursively back from the output.
Sliding Window Optimization and Storage Folding
● If a realization of a function is stored at higher loop level than its computation,
with an intervening serial loop, then iterations of that loop can reuse values
generated by previous iterations.
Sliding Window Optimization and Storage Folding
● If a realization of a function is stored at higher loop level than its computation,
with an intervening serial loop, then iterations of that loop can reuse values
generated by previous iterations.
● This transformations that lets us trade off parallelism (because the intervening
loop must be serial) for reuse (because we avoid recomputing values already
computed by previous iterations.)
Flattenting
● The compiler flattens multi-dimensional loads, stores and allocations into their
single-dimensional equivalent.
Flattenting
● The compiler flattens multi-dimensional loads, stores and allocations into their
single-dimensional equivalent.
● By convention, we always set the stride of the innermost dimension to 1, to
ensure we can perform dense vector loads and stores in that dimension.
Vectorization and Unrolling
● Relace loops of constant size scheduled as vectorized or unrolled with
transformed versions of their loop bodies.
Vectorization and Unrolling
● Relace loops of constant size scheduled as vectorized or unrolled with
transformed versions of their loop bodies.
● Unrolling replaces a loop of size n with n sequential statements performing
each loop iteration in turn.
Vectorization and Unrolling
● Relace loops of constant size scheduled as vectorized or unrolled with
transformed versions of their loop bodies.
● Unrolling replaces a loop of size n with n sequential statements performing
each loop iteration in turn.
● Vectorization completely replaces a loop of size n with a signle statement.
Back-end Code Generation
● We first run a standard constant-folding and dead-code elimination pass on
our IR. Which also performs symbolic simplification of common patterns
produced by bounds inference.
Back-end Code Generation
● We first run a standard constant-folding and dead-code elimination pass on
our IR. Which also performs symbolic simplification of common patterns
produced by bounds inference.
● There is mostly a one-to-one mapping between our representation and
LLVM’s, but two specific patterns warrant mention.
○ First, parallel for loops are lowered to LLVM code that first builds a closure containing state
referred to in the body of a for loop.
○ Second, many vector patterns are difficult to express or generate poor code if passed directly
to LLVM. We use peephole optimization to retoute these to architecture-specific intrinsics.
Compiling Scheduled Pipelines
Autotuning
● The optimal schedule dependents on machine architecture, image
dimensions and code generation in complex ways, and exhibits global
dependencies between choices due to loop fusion and caching behavior.
Autotuning
● The optimal schedule dependents on machine architecture, image
dimensions and code generation in complex ways, and exhibits global
dependencies between choices due to loop fusion and caching behavior.
● Use genetic algorithm to seek out a plausible approximate solution.
○ fixed population size(128 individuals per generation)
Autotuning
● The optimal schedule dependents on machine architecture, image
dimensions and code generation in complex ways, and exhibits global
dependencies between choices due to loop fusion and caching behavior.
● Use genetic algorithm to seek out a plausible approximate solution.
○ fixed population size(128 individuals per generation)
● schedule patterns
○ compute and store under x, and vetorize by x
○ fully parallelized and tiled
○ parallelized over y and vectorized over x.
Results
● Platform
○ CPU - quad core Xeon W3520
○ GPU - NVIDIA Tesla C2070
● Algorithm
Result
Autotuning Performance
Q&A

More Related Content

What's hot

SmB café 13 sep '12 - Compaan Design
SmB café 13 sep '12 - Compaan DesignSmB café 13 sep '12 - Compaan Design
SmB café 13 sep '12 - Compaan Design
Christiaan van Gorkum
 
GRAPHITE: An Extensible Graph Traversal Framework for Relational Database Ma...
GRAPHITE:  An Extensible Graph Traversal Framework for Relational Database Ma...GRAPHITE:  An Extensible Graph Traversal Framework for Relational Database Ma...
GRAPHITE: An Extensible Graph Traversal Framework for Relational Database Ma...
Marcus Paradies
 
Datomic rtree-pres
Datomic rtree-presDatomic rtree-pres
Datomic rtree-pres
jsofra
 

What's hot (20)

Reginf pldi3
Reginf pldi3Reginf pldi3
Reginf pldi3
 
Logic Synthesis
Logic SynthesisLogic Synthesis
Logic Synthesis
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
 
SmB café 13 sep '12 - Compaan Design
SmB café 13 sep '12 - Compaan DesignSmB café 13 sep '12 - Compaan Design
SmB café 13 sep '12 - Compaan Design
 
GRAPHITE: An Extensible Graph Traversal Framework for Relational Database Ma...
GRAPHITE:  An Extensible Graph Traversal Framework for Relational Database Ma...GRAPHITE:  An Extensible Graph Traversal Framework for Relational Database Ma...
GRAPHITE: An Extensible Graph Traversal Framework for Relational Database Ma...
 
PERFORMANCE VEHICULAR AD-HOC NETWORK (VANET)
PERFORMANCE VEHICULAR AD-HOC NETWORK (VANET) PERFORMANCE VEHICULAR AD-HOC NETWORK (VANET)
PERFORMANCE VEHICULAR AD-HOC NETWORK (VANET)
 
Datomic rtree-pres
Datomic rtree-presDatomic rtree-pres
Datomic rtree-pres
 
Compiler Construction | Lecture 15 | Memory Management
Compiler Construction | Lecture 15 | Memory ManagementCompiler Construction | Lecture 15 | Memory Management
Compiler Construction | Lecture 15 | Memory Management
 
MPEG-4 BIFS and MPEG-2 TS: Latest developments for digital radio services
MPEG-4 BIFS and MPEG-2 TS: Latest developments for digital radio servicesMPEG-4 BIFS and MPEG-2 TS: Latest developments for digital radio services
MPEG-4 BIFS and MPEG-2 TS: Latest developments for digital radio services
 
Unit iv(simple code generator)
Unit iv(simple code generator)Unit iv(simple code generator)
Unit iv(simple code generator)
 
Programmable array-logic-and-programmable-logic-array
Programmable array-logic-and-programmable-logic-arrayProgrammable array-logic-and-programmable-logic-array
Programmable array-logic-and-programmable-logic-array
 
Bca1040 digital logic
Bca1040  digital logicBca1040  digital logic
Bca1040 digital logic
 
Ieee project reversible logic gates by_amit
Ieee project reversible logic gates  by_amitIeee project reversible logic gates  by_amit
Ieee project reversible logic gates by_amit
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas Nykodym
 
MPEG-4 BIFS Overview
MPEG-4 BIFS OverviewMPEG-4 BIFS Overview
MPEG-4 BIFS Overview
 
Design and minimization of reversible programmable logic arrays and its reali...
Design and minimization of reversible programmable logic arrays and its reali...Design and minimization of reversible programmable logic arrays and its reali...
Design and minimization of reversible programmable logic arrays and its reali...
 
Arc: An IR for Batch and Stream Programming
Arc: An IR for Batch and Stream ProgrammingArc: An IR for Batch and Stream Programming
Arc: An IR for Batch and Stream Programming
 
ScilabTEC 2015 - KIT
ScilabTEC 2015 - KITScilabTEC 2015 - KIT
ScilabTEC 2015 - KIT
 
Preference of Efficient Architectures for GF(p) Elliptic Curve Crypto Operati...
Preference of Efficient Architectures for GF(p) Elliptic Curve Crypto Operati...Preference of Efficient Architectures for GF(p) Elliptic Curve Crypto Operati...
Preference of Efficient Architectures for GF(p) Elliptic Curve Crypto Operati...
 
Coding style for good synthesis
Coding style for good synthesisCoding style for good synthesis
Coding style for good synthesis
 

Similar to Halide - 2

Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
NECST Lab @ Politecnico di Milano
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Spark Summit
 

Similar to Halide - 2 (20)

Deterministic Galois: On-demand, Portable and Parameterless
Deterministic Galois: On-demand, Portable and ParameterlessDeterministic Galois: On-demand, Portable and Parameterless
Deterministic Galois: On-demand, Portable and Parameterless
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
 
ReactiveX
ReactiveXReactiveX
ReactiveX
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et alPaper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
Paper Study - Incremental Data-Flow Analysis Algorithms by Ryder et al
 
Bonn motion, traffic generation and nam
Bonn motion, traffic generation and namBonn motion, traffic generation and nam
Bonn motion, traffic generation and nam
 
Towards Automated Design Space Exploration and Code Generation using Type Tra...
Towards Automated Design Space Exploration and Code Generation using Type Tra...Towards Automated Design Space Exploration and Code Generation using Type Tra...
Towards Automated Design Space Exploration and Code Generation using Type Tra...
 
Iqra -IBM Quantum Challenge Webinar
Iqra -IBM Quantum Challenge WebinarIqra -IBM Quantum Challenge Webinar
Iqra -IBM Quantum Challenge Webinar
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
Java concurrency
Java concurrencyJava concurrency
Java concurrency
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
 
How Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfHow Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdf
 
PyData Los Angeles 2020 (Abhilash Majumder)
PyData Los Angeles 2020 (Abhilash Majumder)PyData Los Angeles 2020 (Abhilash Majumder)
PyData Los Angeles 2020 (Abhilash Majumder)
 
Solve it Differently with Reactive Programming
Solve it Differently with Reactive ProgrammingSolve it Differently with Reactive Programming
Solve it Differently with Reactive Programming
 
DaViT.pdf
DaViT.pdfDaViT.pdf
DaViT.pdf
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
 
Lecture summary: architectures for baseband signal processing of wireless com...
Lecture summary: architectures for baseband signal processing of wireless com...Lecture summary: architectures for baseband signal processing of wireless com...
Lecture summary: architectures for baseband signal processing of wireless com...
 

More from Kobe Yu (6)

Neural Network File Format for Inference Framework
Neural Network File Format for Inference FrameworkNeural Network File Format for Inference Framework
Neural Network File Format for Inference Framework
 
Tensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute LibraryTensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute Library
 
社群與企業之開源專案合作經驗分享:以阿龜微氣候天眼通為例
社群與企業之開源專案合作經驗分享:以阿龜微氣候天眼通為例社群與企業之開源專案合作經驗分享:以阿龜微氣候天眼通為例
社群與企業之開源專案合作經驗分享:以阿龜微氣候天眼通為例
 
FarmHarvestBot 開源授權建議
FarmHarvestBot 開源授權建議FarmHarvestBot 開源授權建議
FarmHarvestBot 開源授權建議
 
Agrino 應用於農業感測的開源專案
Agrino  應用於農業感測的開源專案Agrino  應用於農業感測的開源專案
Agrino 應用於農業感測的開源專案
 
機器學習應用於蔬果辨識
機器學習應用於蔬果辨識機器學習應用於蔬果辨識
機器學習應用於蔬果辨識
 

Recently uploaded

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Recently uploaded (20)

%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 

Halide - 2

  • 2. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe. PLDI '13 Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation
  • 3. Recap Halide ● Separate descriptions of ‘algorithm’per output pixel and execution ‘schedule’
  • 4. Recap Halide ● Separate descriptions of ‘algorithm’per output pixel and execution ‘schedule’ ● Halide is aprogramming languageembedded in C++
  • 5. Recap Halide ● Separate descriptions of ‘algorithm’per output pixel and execution ‘schedule’ ● Halide is aprogramming languageembedded in C++ ● Compiler based on LLVM/Clang
  • 6. Recap Halide ● Separate descriptions of ‘algorithm’per output pixel and execution ‘schedule’ ● Halide is aprogramming languageembedded in C++ ● Compiler based on LLVM/Clang
  • 7.
  • 8.
  • 9. Parallelism distribute across threads and SIMD parallel vector
  • 10. Locality compute in tiles interleave tiles of blurx, blury store blurx in local cache
  • 11.
  • 13. Compiling Scheduled Pipelines 1. Lowering and Loop synthesis 2. Bounds Inference 3. Sliding Window Optimization and Storage Folding 4. Flattening 5. Vectorization and Unrolling 6. Back-end Code Generation
  • 14. Lowering ● Lowering process is synthesizes a single, complete set of loop nests and allocations, given a Halide pipeline and a fully-specified schedule.
  • 15. Lowering ● Lowering process is synthesizes a single, complete set of loop nests and allocations, given a Halide pipeline and a fully-specified schedule. ● Lowering begins from the function defining the out. ○ Lowering then proceeds recursively up the pipeline, from callers to callees(here, from out to blurx)
  • 16. Lowering ● Lowering process is synthesizes a single, complete set of loop nests and allocations, given a Halide pipeline and a fully-specified schedule. ● Lowering begins from the function defining the out. ○ Lowering then proceeds recursively up the pipeline, from callers to callees(here, from out to blurx) ● Each loop is labeled as being serial, parallel, unrolled, or vectorized, according to the schedule.
  • 17. Bounds Inference ● The next stage of lowering generates and injects appropriate definitions for these variables. Like function lowering, bounds inference proceeds recursively back from the output.
  • 18. Sliding Window Optimization and Storage Folding ● If a realization of a function is stored at higher loop level than its computation, with an intervening serial loop, then iterations of that loop can reuse values generated by previous iterations.
  • 19. Sliding Window Optimization and Storage Folding ● If a realization of a function is stored at higher loop level than its computation, with an intervening serial loop, then iterations of that loop can reuse values generated by previous iterations. ● This transformations that lets us trade off parallelism (because the intervening loop must be serial) for reuse (because we avoid recomputing values already computed by previous iterations.)
  • 20. Flattenting ● The compiler flattens multi-dimensional loads, stores and allocations into their single-dimensional equivalent.
  • 21. Flattenting ● The compiler flattens multi-dimensional loads, stores and allocations into their single-dimensional equivalent. ● By convention, we always set the stride of the innermost dimension to 1, to ensure we can perform dense vector loads and stores in that dimension.
  • 22. Vectorization and Unrolling ● Relace loops of constant size scheduled as vectorized or unrolled with transformed versions of their loop bodies.
  • 23. Vectorization and Unrolling ● Relace loops of constant size scheduled as vectorized or unrolled with transformed versions of their loop bodies. ● Unrolling replaces a loop of size n with n sequential statements performing each loop iteration in turn.
  • 24. Vectorization and Unrolling ● Relace loops of constant size scheduled as vectorized or unrolled with transformed versions of their loop bodies. ● Unrolling replaces a loop of size n with n sequential statements performing each loop iteration in turn. ● Vectorization completely replaces a loop of size n with a signle statement.
  • 25. Back-end Code Generation ● We first run a standard constant-folding and dead-code elimination pass on our IR. Which also performs symbolic simplification of common patterns produced by bounds inference.
  • 26. Back-end Code Generation ● We first run a standard constant-folding and dead-code elimination pass on our IR. Which also performs symbolic simplification of common patterns produced by bounds inference. ● There is mostly a one-to-one mapping between our representation and LLVM’s, but two specific patterns warrant mention. ○ First, parallel for loops are lowered to LLVM code that first builds a closure containing state referred to in the body of a for loop. ○ Second, many vector patterns are difficult to express or generate poor code if passed directly to LLVM. We use peephole optimization to retoute these to architecture-specific intrinsics.
  • 28. Autotuning ● The optimal schedule dependents on machine architecture, image dimensions and code generation in complex ways, and exhibits global dependencies between choices due to loop fusion and caching behavior.
  • 29. Autotuning ● The optimal schedule dependents on machine architecture, image dimensions and code generation in complex ways, and exhibits global dependencies between choices due to loop fusion and caching behavior. ● Use genetic algorithm to seek out a plausible approximate solution. ○ fixed population size(128 individuals per generation)
  • 30. Autotuning ● The optimal schedule dependents on machine architecture, image dimensions and code generation in complex ways, and exhibits global dependencies between choices due to loop fusion and caching behavior. ● Use genetic algorithm to seek out a plausible approximate solution. ○ fixed population size(128 individuals per generation) ● schedule patterns ○ compute and store under x, and vetorize by x ○ fully parallelized and tiled ○ parallelized over y and vectorized over x.
  • 31. Results ● Platform ○ CPU - quad core Xeon W3520 ○ GPU - NVIDIA Tesla C2070 ● Algorithm
  • 34. Q&A