Halide
Domain Specific Language
Decoupling Algorithms from Schedules for Easy
Optimization of Image Processing Pipelines
Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy,
Saman Amarasinghe, Frédo Durand
ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2012
Sobel Edge Detection
-1 0 1
-2 0 2
-1 0 1
1 2 1
0 0 0
-1 -2 -1
Gx Gy
Making Image processing faster by hardward
● Parallel Programming
● Memory Locality
=> Requires to reorganize computation!
Message #1: Performance requires complex tradeoffs
Message #2: organization of computation
Existing languages make critical optimizations hard
Parallelism : vectorization / multithreading
Locality : fusion / tiling
C - parallelism + tiling + fusion are hard to write or automate
CUDA, OpenCL, shaders - data parallelism is easy, fusion is hard
libraries don’t help:
BLAS, IPP, MKL, OpenCV, MATLAB
optimized kernels compose into ine ffi cient pipelines (no fusion)
Halide DSL : decouple algorithm from schedule
Algorithm: what is computed
Schedule: where and when it’s computed
Easy for programmers to build pipelines
simplifies algorithm code
improves modularity
Easy for programmers to specify & explore optimizations
fusion, tiling, parallelism, vectorization
can’t break the algorithm
Easy for the compiler to generate fast code
Halide is embedded in C++
● Build Halide functions and expressions using C++
● Evaluate Halide functions immediately
○ just-in-time compile to produce and run a Halide pipeline
● Or statically compile to an object file and header
○ One C++ program creates the Halide pipeline When run, it produces an object file and header
You link this into your actual program
Halide Language
● Algorithm
○ Halide::Image
○ Halide::Var
○ Halide::Expr
○ Halide::Buffer
● Scheduling
○ next slide
Halide Language
Default and Reorder Schedule
Default Reorder
Split Schedule
gradient.split(x, x_outer, x_inner, 4);
gradient.split(y, y_outer, y_inner, 4);
Evaluating in vectors
gradient.split(x, x_outer, x_inner, 4);
gradient.vectorize(x_inner);
Parallel
Var x_outer, y_outer, x_inner, y_inner, tile_index;
gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4);
gradient.fuse(x_outer, y_outer, tile_index);
gradient.parallel(tile_index);
Parallelism
distribute across threads and
SIMD parallel vector
Locality
compute in tiles interleave
tiles of blurx, blury store blurx
in local cache
Input Image
Blur X
Blur Y
Halide Compiler
Basic Halide program (default schedule)
Image<float> input = load<float>("images/rgb.png");
Var x, y;
Func blur_x;
Func blur_y;
blur_x(x,y) = (input(x,y)+input(x+1,y)+input(x+2,y))/3.0;
blur_y(x,y) = (blur_x(x,y)+blur_x(x,y+1)+blur_x(x,y+2))/3.0;
Image<float> output = blur_y.realize(input.width()-2,input.height()-2);
Metaprogramming
● Create C++ objects that describe a Halide program
● Essentially algebraic trees (Abstract Syntax Tree, AST)
● Once the representation is constructed, call .realize() tocompile and execute
● This calls the C++ Halide compiler, creates binary, executes it
● Metaprogramming makes it easy to embed in an existing language and
codebase, avoids the need to parse
Syntax: Main types/keywords
Func : pure functions over an integer domain
Var : pure abstract variables for domain of Funcs
Expr: algebraic expressions of Funcs and Var including standard operators and
functions (+,-,&, /, **, sqrt, sin, cos...)
Image: arrays used as inputs and outputs
Q & A

Halide - 1

  • 1.
  • 2.
    Decoupling Algorithms fromSchedules for Easy Optimization of Image Processing Pipelines Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, Frédo Durand ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2012
  • 3.
    Sobel Edge Detection -10 1 -2 0 2 -1 0 1 1 2 1 0 0 0 -1 -2 -1 Gx Gy
  • 5.
    Making Image processingfaster by hardward ● Parallel Programming ● Memory Locality => Requires to reorganize computation!
  • 6.
    Message #1: Performancerequires complex tradeoffs
  • 7.
  • 8.
    Existing languages makecritical optimizations hard Parallelism : vectorization / multithreading Locality : fusion / tiling C - parallelism + tiling + fusion are hard to write or automate CUDA, OpenCL, shaders - data parallelism is easy, fusion is hard libraries don’t help: BLAS, IPP, MKL, OpenCV, MATLAB optimized kernels compose into ine ffi cient pipelines (no fusion)
  • 9.
    Halide DSL :decouple algorithm from schedule Algorithm: what is computed Schedule: where and when it’s computed Easy for programmers to build pipelines simplifies algorithm code improves modularity Easy for programmers to specify & explore optimizations fusion, tiling, parallelism, vectorization can’t break the algorithm Easy for the compiler to generate fast code
  • 10.
    Halide is embeddedin C++ ● Build Halide functions and expressions using C++ ● Evaluate Halide functions immediately ○ just-in-time compile to produce and run a Halide pipeline ● Or statically compile to an object file and header ○ One C++ program creates the Halide pipeline When run, it produces an object file and header You link this into your actual program
  • 11.
    Halide Language ● Algorithm ○Halide::Image ○ Halide::Var ○ Halide::Expr ○ Halide::Buffer ● Scheduling ○ next slide
  • 12.
  • 13.
    Default and ReorderSchedule Default Reorder
  • 14.
    Split Schedule gradient.split(x, x_outer,x_inner, 4); gradient.split(y, y_outer, y_inner, 4);
  • 15.
    Evaluating in vectors gradient.split(x,x_outer, x_inner, 4); gradient.vectorize(x_inner);
  • 16.
    Parallel Var x_outer, y_outer,x_inner, y_inner, tile_index; gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4); gradient.fuse(x_outer, y_outer, tile_index); gradient.parallel(tile_index);
  • 19.
    Parallelism distribute across threadsand SIMD parallel vector
  • 20.
    Locality compute in tilesinterleave tiles of blurx, blury store blurx in local cache
  • 22.
  • 23.
  • 24.
    Basic Halide program(default schedule) Image<float> input = load<float>("images/rgb.png"); Var x, y; Func blur_x; Func blur_y; blur_x(x,y) = (input(x,y)+input(x+1,y)+input(x+2,y))/3.0; blur_y(x,y) = (blur_x(x,y)+blur_x(x,y+1)+blur_x(x,y+2))/3.0; Image<float> output = blur_y.realize(input.width()-2,input.height()-2);
  • 25.
    Metaprogramming ● Create C++objects that describe a Halide program ● Essentially algebraic trees (Abstract Syntax Tree, AST) ● Once the representation is constructed, call .realize() tocompile and execute ● This calls the C++ Halide compiler, creates binary, executes it ● Metaprogramming makes it easy to embed in an existing language and codebase, avoids the need to parse
  • 26.
    Syntax: Main types/keywords Func: pure functions over an integer domain Var : pure abstract variables for domain of Funcs Expr: algebraic expressions of Funcs and Var including standard operators and functions (+,-,&, /, **, sqrt, sin, cos...) Image: arrays used as inputs and outputs
  • 27.