High-Performance Computing NeedsMachine Learning... And Vice Versa(was “GPU Metaprogramming: A Case Study in Large-Scale C...
Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
Motivation...
The Problem:Visual Object Recognition
Why?
Why?it seems easy, right?
44 years ago...
The Problem:Visual Object Recognition
The Problem:Visual Object Recognition
The Problem:Visual Object Recognition                fast
The Problem:Visual Object Recognition                fast                accurate
The Problem:Visual Object Recognition                fast                accurate                effortless
The Problem:Visual Object Recognition                fast                accurate                effortless               ...
The Problem:Visual Object Recognition                fast                accurate                effortless               ...
hard?
hard?// the world is 3D but the retina is 2D
hard?// the world is 3D but the retina is 2D// the curse of dimensionality
hard?// the world is 3D but the retina is 2D// the curse of dimensionality// considerable   image variation
~50% of   is for vision!
you learned it...      ve     y ha   ma
Background
The ApproachReverse and Forward Engineering the Brain
The ApproachReverse and Forward Engineering the Brain     REVERSE                 FORWARD       Study                     ...
The ApproachReverse and Forward Engineering the Brain     REVERSE                 FORWARD       Study                     ...
Reverse Engineering         Images by DiCarlo JJ & Cox DD                                        Animation by Li NThe Vent...
Reverse Engineering         Images by DiCarlo JJ & Cox DD                                        Animation by Li NThe Vent...
Reverse EngineeringThe Ventral Visual Stream                                         taflo ps ?!                           ...
The ApproachReverse and Forward Engineering the Brain     REVERSE                 FORWARD       Study                     ...
Forward EngineeringThe Ventral Visual Stream                                        a rnin g ???                          ...
“Temp. Adv.”                                                                    “Auto-reset”                              ...
How are things done normally?
How are things done normally?  Usual Formula:
How are things done normally?  Usual Formula:  1) One grad student
How are things done normally?  Usual Formula:  1) One grad student  2) One Model (size limited by runtime)
How are things done normally?  Usual Formula:  1) One grad student  2) One Model (size limited by runtime)  3) Performance...
How are things done normally?  Usual Formula:  1) One grad student  2) One Model (size limited by runtime)  3) Performance...
How are things done normally?  Usual Formula:  1) One grad student  2) One Model (size limited by runtime)  3) Performance...
How do you call this ?  “This is graduate student descent”  - David McAllester
How do you call this ?  “This is graduate student descent”  - David McAllester
What’s better than this?“Conjugate graduate student descent?”- Nicolas Poilvert
Doing things a little bit differently
Doing things a little bit differently  1) One grad student
Doing things a little bit differently  1) One grad student  2) One Hundreds of Thousands of  BIG Models
Doing things a little bit differently  1) One grad student  2) One Hundreds of Thousands of  BIG Models  3) Performance nu...
Doing things a little bit differently  1) One grad student  2) One Hundreds of Thousands of  BIG Models  3) Performance nu...
Doing things a little bit differently  1) One grad student  2) One Hundreds of Thousands of  BIG Models  3) Performance nu...
Doing things a little bit differently  1) One grad student  2) One Hundreds of Thousands of  BIG Models  3) Performance nu...
“   If you want to have good ideas         you must have many ideas.               ”    “  Most of them will be wrong,    ...
High-throughput       Screening
Read-outL3                  thresh/sat            norm strength                                            normalization  ...
The curse of speed
The curse of speed  thousands of big models
The curse of speed  thousands of big models  large amounts of unsupervised  learning experience
The curse of speed...and the blessing of massively parallel computing  No off-the-shelf solution? DIY!  Engineering (Hardw...
The curse of speed...and the blessing of massively parallel computing  No off-the-shelf solution? DIY!  Engineering (Hardw...
r ow n! u ild youB
The blessing of GPUs  Computational power         DIY GPU pr0n (since 2006)   Sony Playstation 3s (since 2007)            ...
speed                 (in billion floating point operations per second)    Q9450 (Matlab/C) [2008]    0.3        Q9450 (C/S...
speed                 (in billion floating point operations per second)    Q9450 (Matlab/C) [2008]    0.3        Q9450 (C/S...
High-throughput Screening        Skimming off the best models                          stupid              chance     base...
High-throughput Screening        Skimming off the best models                          stupid              chance     base...
High-throughput Screening        Skimming off the best models                          stupid              chance     base...
High-throughput ScreeningValidate on other tasks                                                           ~90%           ...
High-throughput ScreeningValidate on other tasks                                                           ~90%           ...
High-throughput ScreeningValidate on other tasks                                                           ~90%           ...
High-throughput ScreeningValidate on other tasks                                                           ~90%           ...
High-throughput ScreeningValidate on faces                                               vs.                              ...
Human vs. Machine  8-way object categorization                                           99.1                             ...
What does it all mean?what have we learned ?                    briefly...
What does it all mean?what have we learned ?Grayscale Input                        Normalize                              ...
What does it all mean?what have we learned ?Grayscale Input                        Normalize                              ...
What does it all mean?what have we learned ?Grayscale Input                        Normalize                              ...
What does it all mean?what have we learned ?Grayscale Input                        Normalize                              ...
What are these models      not good for?ob jects low  level              s   ckgr ound ba   fa ces
Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
one more thing
Real-world apps?testing the generality and scalability of the approach
FacebookReally Real World Problem                                  enormous scale                                     bill...
Relevance to Social Networking                         slide courtesy of David Cox
Relevance to Social Networking
High-throughput       Screening
High-Throughput Screening Labeled Faces in the Wild (LFW) View 1  > 30,000 large-scale models (1to3 layers) screened in on...
Generalization Performance on LFW View 2 (hold out)                       Face Verification Performance (% correct)        ...
“Facebook100”typical social network size?collab. with Zak Stone & Todd Zickler @ Harvard                                  ...
Auto-tagginga network of 100 Facebook friends                             > 86%                             accurate      ...
vs face.comcomparison with a heavily-specialized commercial system                                                        ...
Conclusion?
Hardware Matters !       Yann LeCun’s Mac              picture courtesy of Koray Kavukcuoglu
Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
Two conflicting requirements   The brain is a massively parallel computer➡ Big models are paralyzingly slow to run   Neural...
Two conflicting requirements   The brain is a massively parallel computer                   FA  ST slow to run➡ Big models ...
Two conflicting requirements   The brain is a massively parallel computer                   FA  ST slow to run➡ Big models ...
Two conflicting requirements   The brain is a massively parallel computer                   FA  ST slow to run➡ Big models ...
What’s the bottleneck?
lutio ns!                     k Co nvo       i lter ba n3D F
Our answer?
Meta-programming       !
Meta-programmingWhat?
Meta-programming ! Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW) •   Dynamically compi...
Meta-programming !“Instrument” your solutions:•   Block size•   Work size•   Loop unrolling•   Pre-fetching•   Spilling•  ...
How?
Always use the right tool !
texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];#defi...
Compilation?  (with Python-based solutions)
PyCUDA/PyOpenCL (by Andreas Klockner)  Klöckner, Pinto, Lee, Catanzaro, Ivanov, Fasih (ParCo 2011)
Basic GPU Meta-programming System                                                      A Case Study                       ...
conv_kernel_4x4x4.cuconv_kernel_template.cu                                          #include <stdio.h>                   ...
conv_kernel_template.cu texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FI...
Benefits?
Smooth syntactic ugliness
Smooth syntactic ugliness  Manipulations that are not easily  accessible in CUDA C code:  • loop unrolling (possibly fine-c...
Smooth syntactic ugliness                            Manipulations that are not easily                            accessib...
How about #pragma unroll ?   (why don’t you trust the compiler?)
o t alo ne....    we are n              s for S ignal    Using GPU             elatio n                        pil   ers  ...
Smooth syntactic ugliness  Manipulations that are not easily  accessible in CUDA C code:  • variable-length argument lists
Smooth syntactic ugliness  Manipulations that were not easily  accessible in CUDA C code:  • index un-indexable resources ...
Explore design decision  space more freely
Basic GPU Meta-programming System                                                      A Case Study                       ...
... too many                      optimizations?                      ba nk c                                onflict       ...
e ?              ec id       ’t dc an                        keep them all !
Exploring design decision space more freely  Meta-programming:  • enables efficient learning of the GPU    hardware/softwa...
version Aconv_kernel_beta_template.cu                                                                                     ...
Results
speed                 (in billion floating point operations per second)    Q9450 (Matlab/C) [2008]    0.3        Q9450 (C/S...
-10.4 1024x1024x8      16x5x5x8     726.412 ± 0.398    744.973 ± 0.571    Analysis      2048x2048x4       4x8x8x4     474....
APTER 33 GPU Metaprogramming: A Case Study   Analysis    ➡ Different input configurations  Table 33.3 Performance of Auto-T...
Summary
Summary Meta-programming:
Summary Meta-programming: • can assist exploration and manual   optimization
Summary Meta-programming: • can assist exploration and manual   optimization • can de-clutter highly-optimized code
Summary Meta-programming: • can assist exploration and manual   optimization • can de-clutter highly-optimized code • is e...
Summary Meta-programming: • can assist exploration and manual   optimization • can de-clutter highly-optimized code • is e...
Summary Meta-programming: • can assist exploration and manual   optimization • can de-clutter highly-optimized code • is e...
Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
Intelligent         and fastAuto-Tuning   with Machine Learning                    with James Bergstra and David Cox
Intelligent         and fastAuto-Tuning   with Machine Learning
Auto-tuning: two approaches
Auto-tuning: two approaches• Analytical model-based optimization:
Auto-tuning: two approaches• Analytical model-based optimization: - pros: very generic (dominant in compilers), fast    “i...
Auto-tuning: two approaches• Analytical model-based optimization: - pros: very generic (dominant in compilers), fast    “i...
Auto-tuning: two approaches• Analytical model-based optimization: - pros: very generic (dominant in compilers), fast    “i...
Auto-tuning: two approaches• Analytical model-based optimization: - pros: very generic (dominant in compilers), fast    “i...
Auto-tuning: two approaches• Analytical model-based optimization: - pros: very generic (dominant in compilers), fast    “i...
Empirical Auto-TuningThe goal is to empirically optimize executiontime given both• the environment - hardware (GPU, CPU, M...
Empirical Auto-Tuning with Meta-programming                                                       A Case Study            ...
Intelligent         and fastAuto-Tuning   with Machine Learning
Auto-tuning: best of both approaches ?
Auto-tuning: best of both approaches ?• Empirically-learned model-based  optimization:
Auto-tuning: best of both approaches ?• Empirically-learned model-based  optimization: - pros: auto-tuned code close to pe...
Auto-tuning: best of both approaches ?• Empirically-learned model-based  optimization: - pros: auto-tuned code close to pe...
Auto-tuning: best of both approaches ?• Empirically-learned model-based  optimization: - pros: auto-tuned code close to pe...
Fast Machine Learning-based Runtime Auto-TuningML-based
First Last                           First Last                           First Last                  Affiliation line 1   ...
lutio ns!                     k Co nvo       i lter ba n3D F
NVIDIA GTX 580 (Fermi)0                   P    ie w                      rev(b)                                           ...
NVIDIA GTX 580 (Fermi)0                   P    ie w                      rev(b)                                           ...
What else ?
What else could we do for HPC ?
What else could we do for HPC ?• Minimize failures (exascale supercomputers)
What else could we do for HPC ?• Minimize failures (exascale supercomputers)• Minimize mixed-precision errors
What else could we do for HPC ?• Minimize failures (exascale supercomputers)• Minimize mixed-precision errors• Help better...
What else could we do for HPC ?• Minimize failures (exascale supercomputers)• Minimize mixed-precision errors• Help better...
What else could we do for HPC ?• Minimize failures (exascale supercomputers)• Minimize mixed-precision errors• Help better...
What else could we do for HPC ?• Minimize failures (exascale supercomputers)• Minimize mixed-precision errors• Help better...
It would be a                                                       win-win-win situation!(The Office Season 2, Episode 27:...
Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
en ts                   e dg em    nowlAck                                     DiCarlo Lab @ MIT                          ...
en ts        e dg em    nowlAck
CO ME
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
Upcoming SlideShare
Loading in …5
×

High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

6,382 views

Published on

http://biglearn.org

Published in: Education, Technology

High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

  1. 1. High-Performance Computing NeedsMachine Learning... And Vice Versa(was “GPU Metaprogramming: A Case Study in Large-Scale Convolutional Neural Networks”) dit ion eNicolas PintoNIPS “Big Learning” | December 16th, 2011 The Rowland Institute a HARVARD UNIVERSITY
  2. 2. Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
  3. 3. Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
  4. 4. Motivation...
  5. 5. The Problem:Visual Object Recognition
  6. 6. Why?
  7. 7. Why?it seems easy, right?
  8. 8. 44 years ago...
  9. 9. The Problem:Visual Object Recognition
  10. 10. The Problem:Visual Object Recognition
  11. 11. The Problem:Visual Object Recognition fast
  12. 12. The Problem:Visual Object Recognition fast accurate
  13. 13. The Problem:Visual Object Recognition fast accurate effortless
  14. 14. The Problem:Visual Object Recognition fast accurate effortless critical to survival
  15. 15. The Problem:Visual Object Recognition fast accurate effortless critical to survival tolerant to variations!
  16. 16. hard?
  17. 17. hard?// the world is 3D but the retina is 2D
  18. 18. hard?// the world is 3D but the retina is 2D// the curse of dimensionality
  19. 19. hard?// the world is 3D but the retina is 2D// the curse of dimensionality// considerable image variation
  20. 20. ~50% of is for vision!
  21. 21. you learned it... ve y ha ma
  22. 22. Background
  23. 23. The ApproachReverse and Forward Engineering the Brain
  24. 24. The ApproachReverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  25. 25. The ApproachReverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  26. 26. Reverse Engineering Images by DiCarlo JJ & Cox DD Animation by Li NThe Ventral Visual Stream
  27. 27. Reverse Engineering Images by DiCarlo JJ & Cox DD Animation by Li NThe Ventral Visual Stream
  28. 28. Reverse EngineeringThe Ventral Visual Stream taflo ps ?! in =2 0 pe bra
  29. 29. The ApproachReverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  30. 30. Forward EngineeringThe Ventral Visual Stream a rnin g ??? a bo ut le all
  31. 31. “Temp. Adv.” “Auto-reset” ... number of ltersL2 thresh/sat norm strength Learning normalization neighborhood Rate kernel Trace size “Temp. Adv.” “Auto-reset” ... n. of ltersL1 thresh/sat norm strength Learning Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset”kernel ...
  32. 32. How are things done normally?
  33. 33. How are things done normally? Usual Formula:
  34. 34. How are things done normally? Usual Formula: 1) One grad student
  35. 35. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime)
  36. 36. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets
  37. 37. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock.
  38. 38. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock. 5) One Ph.D.
  39. 39. How do you call this ? “This is graduate student descent” - David McAllester
  40. 40. How do you call this ? “This is graduate student descent” - David McAllester
  41. 41. What’s better than this?“Conjugate graduate student descent?”- Nicolas Poilvert
  42. 42. Doing things a little bit differently
  43. 43. Doing things a little bit differently 1) One grad student
  44. 44. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models
  45. 45. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets
  46. 46. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets
  47. 47. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets 4) yay. we. rock.
  48. 48. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets 4) yay. we. rock. 5) Hundreds of Thousands One PhD ?
  49. 49. “ If you want to have good ideas you must have many ideas. ” “ Most of them will be wrong, and what you have to learn is which ones to throw away. ” Linus Pauling (double Nobel Prize Winner)
  50. 50. High-throughput Screening
  51. 51. Read-outL3 thresh/sat norm strength normalization Learning large family of neighborhood Rate Trace “Temp. Adv.” “Auto-reset” number of lters ... brain-inspired modelsL2 thresh/sat norm strength clusive! Learning normalization neighborhood Rate in Trace 52 parameters ery kernel size “Temp. Adv.” v “Auto-reset” ... n. of lters more than 10 25L1 thresh/sat norm strength Learning possible unique Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset”kernel ... combinations!size number of lters input kernel size Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  52. 52. The curse of speed
  53. 53. The curse of speed thousands of big models
  54. 54. The curse of speed thousands of big models large amounts of unsupervised learning experience
  55. 55. The curse of speed...and the blessing of massively parallel computing No off-the-shelf solution? DIY! Engineering (Hardware/SysAdmin/Software) Science
  56. 56. The curse of speed...and the blessing of massively parallel computing No off-the-shelf solution? DIY! Engineering (Hardware/SysAdmin/Software) Science Leverage non-scientific high-tech markets and their $billions of R&D... Gaming: Graphics Cards (GPUs), PlayStation 3 Web 2.0: Cloud Computing (Amazon, Google)
  57. 57. r ow n! u ild youB
  58. 58. The blessing of GPUs Computational power DIY GPU pr0n (since 2006) Sony Playstation 3s (since 2007) GPUs Peak GFLOP/s CPUs
  59. 59. speed (in billion floating point operations per second) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.07900GTX (OpenGL/Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.4 8800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 GTX480 (CUDA3.x) [2010] 974.3 (Fermi) Pinto, Doukhan, DiCarlo, Cox PLoS 2009 Pinto, Cox GPU Comp. Gems 2011
  60. 60. speed (in billion floating point operations per second) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.07900GTX (OpenGL/Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.4 8800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 cha n ging... e GTX480 (CUDA3.x) [2010] pe edu p is g a m 974.3 (Fermi) >1 000X s Pinto, Doukhan, DiCarlo, Cox PLoS 2009 Pinto, Cox GPU Comp. Gems 2011
  61. 61. High-throughput Screening Skimming off the best models stupid chance baseline 250 200 N=2500 150Count 100 50 0 50 60 70 80 90 100 Performance (%) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  62. 62. High-throughput Screening Skimming off the best models stupid chance baseline 250 200 N=2500 150Count 100 50 0 50 60 70 80 90 100 Performance (%) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  63. 63. High-throughput Screening Skimming off the best models stupid chance baseline 250 200 N=2500 150Count 100 50 0 50 60 70 80 90 100 Performance (%) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  64. 64. High-throughput ScreeningValidate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  65. 65. High-throughput ScreeningValidate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  66. 66. High-throughput ScreeningValidate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  67. 67. High-throughput ScreeningValidate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  68. 68. High-throughput ScreeningValidate on faces vs. HMAX 2.1 PHOG GB PHOW SIFT blend 5 4 3 2 1 V1-like high-throughput models (baseline) state-of-the-art (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  69. 69. Human vs. Machine 8-way object categorization 99.1 64 31.3chance (12.5%) baseline best model best human
  70. 70. What does it all mean?what have we learned ? briefly...
  71. 71. What does it all mean?what have we learned ?Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk➡ dimensionality: more filters is better
  72. 72. What does it all mean?what have we learned ?Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk➡ learning is difficult
  73. 73. What does it all mean?what have we learned ?Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk➡ non-linearities are important
  74. 74. What does it all mean?what have we learned ?Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk➡ normalization is very important missed in previous modeling efforts now confirmed by LeCun et al., Poggio et al., Ng et al.
  75. 75. What are these models not good for?ob jects low level s ckgr ound ba fa ces
  76. 76. Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
  77. 77. one more thing
  78. 78. Real-world apps?testing the generality and scalability of the approach
  79. 79. FacebookReally Real World Problem enormous scale billion of photos 3TB+ uploaded every day dense, collaborative face labelscollab. with Zak Stone & Todd Zickler @ Harvard
  80. 80. Relevance to Social Networking slide courtesy of David Cox
  81. 81. Relevance to Social Networking
  82. 82. High-throughput Screening
  83. 83. High-Throughput Screening Labeled Faces in the Wild (LFW) View 1 > 30,000 large-scale models (1to3 layers) screened in only 3 days HT L3s (3 layers) top 5 models LFW view 1 performance Lea rning! vised o Un super NPinto, Cox (FG 2011) Pinto, Stone, Zickler, Cox (CVPR 2011)
  84. 84. Generalization Performance on LFW View 2 (hold out) Face Verification Performance (% correct) 88.1 86.8 85.3 79.4 Wolf et al. ACCV 2009 Kumar et al. Ours V1-like face.com ICCV 2009 (HT)Pinto, Cox (FG 2011)
  85. 85. “Facebook100”typical social network size?collab. with Zak Stone & Todd Zickler @ Harvard Pinto, Stone, Zickler, Cox (CVPR 2011)
  86. 86. Auto-tagginga network of 100 Facebook friends > 86% accurate (w/ 90 training examples)collab. with Zak Stone & Todd Zickler @ Harvard Pinto, Stone, Zickler, Cox (CVPR 2011)
  87. 87. vs face.comcomparison with a heavily-specialized commercial system L3 (hardware-accelerated brute-force random model)Performance (% correct) face.com V1-likearound) (best technology (one layer) training example(s) / friend Pinto, Stone, Zickler, Cox (CVPR 2011)
  88. 88. Conclusion?
  89. 89. Hardware Matters ! Yann LeCun’s Mac picture courtesy of Koray Kavukcuoglu
  90. 90. Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
  91. 91. Two conflicting requirements The brain is a massively parallel computer➡ Big models are paralyzingly slow to run Neural data only provides weak constraints➡ Lots of parameters – hard to explore
  92. 92. Two conflicting requirements The brain is a massively parallel computer FA ST slow to run➡ Big models are paralyzingly Neural data only provides weak constraints➡ Lots of parameters – hard to explore
  93. 93. Two conflicting requirements The brain is a massively parallel computer FA ST slow to run➡ Big models are paralyzingly Neural data only provides weak constraints LEXI BLE F➡ Lots of parameters – hard to explore
  94. 94. Two conflicting requirements The brain is a massively parallel computer FA ST slow to run➡ Big models are paralyzingly Neural data only provides weak constraints LEXI BLE F➡ Lots of parameters – hard to explore How to optimize?
  95. 95. What’s the bottleneck?
  96. 96. lutio ns! k Co nvo i lter ba n3D F
  97. 97. Our answer?
  98. 98. Meta-programming !
  99. 99. Meta-programmingWhat?
  100. 100. Meta-programming ! Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW) • Dynamically compile specialized versions of the same kernel for different conditions • Empirical run-time tuning • For free: smooth syntactic ugliness: unroll loops, index un-indexable registers, etc.
  101. 101. Meta-programming !“Instrument” your solutions:• Block size• Work size• Loop unrolling• Pre-fetching• Spilling• etc. ... and let the computer generate find the optimal code
  102. 102. How?
  103. 103. Always use the right tool !
  104. 104. texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];#define IMUL(a, b) __mul24(a, b) plating Temextern "C" {#for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) {#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  105. 105. Compilation? (with Python-based solutions)
  106. 106. PyCUDA/PyOpenCL (by Andreas Klockner) Klöckner, Pinto, Lee, Catanzaro, Ivanov, Fasih (ParCo 2011)
  107. 107. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  108. 108. conv_kernel_4x4x4.cuconv_kernel_template.cu #include <stdio.h> texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[4][4][4]; #define IMUL(a, b) __mul24(a, b) texture<float4, 1, cudaReadModeElementType> tex_float4; extern "C" { __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output) { #define IMUL(a, b) __mul24(a, b) extern "C" { __shared__ float shared_in[131][4+1]; // -- input/output offsets #for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; __global__ void convolve_beta_j${j}(float4 *input, float4 float4 input_v4; *output) // -- load input to shared memory { { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 shared_in[threadIdx.x+128*0][0] = input_v4.x; __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; // -- input/output offsets } const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)<131) blockIdx.x*blockDim.x + threadIdx.x; { const uint out_idx = blockIdx.y*OUTPUT_W + input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); blockIdx.x*blockDim.x + threadIdx.x; shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; float4 input_v4; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; // -- load input to shared memory } #for i in xrange($LOAD_ITERATIONS) __syncthreads(); #if $i==($LOAD_ITERATIONS-1) // -- compute dot products if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) float v, w; #end if { float sum0 = 0; input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* float sum1 = 0; $i); float sum2 = 0; float sum3 = 0; shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; sum0 += v*w; } w = constant[0][0][1]; sum1 += v*w; #end for w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w;
  109. 109. conv_kernel_template.cu texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) conv_kernel_4x4x4.cu extern "C" { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 20 kB *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if $i); { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* conv_kernel_8x8x4.cu shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } 64 kB #end for
  110. 110. Benefits?
  111. 111. Smooth syntactic ugliness
  112. 112. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • loop unrolling (possibly fine-controlled)
  113. 113. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • fine-controlled loop unrolling / jamming..) v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w;
  114. 114. How about #pragma unroll ? (why don’t you trust the compiler?)
  115. 115. o t alo ne.... we are n s for S ignal Using GPU elatio n pil ers Corr ust com ’t tr itchell Daniel A. M Don gmen The Murch ode fr a ts ison Widefi eld Array c tical” e “iden re thes + g *h; ompa LOPS • C *c + e*f 770 GF + d b*c grating 8-s econd snap shots over a += inte peeling, roduced by lanking and b*c; -2 526 field p d after RFI b f the J2107 e of the fiel an image o ht is an imag S FLOP n the left is . On the rig a += d*c; Figure 3: O ing hout blank interval wit 20 G entire time eeled imag e. noise the e unp e above the ntours of th f magnitud 10 along with co rs o This at are orde ious data. a += e*f; els th dub ivers at lev ply discard n here to the rece m will sim tector show k ste ichael hClar ct in fl ect or refra real-time sy n-based de occasion, re s the MWA mple media integration hich the si M floor. D wit wil uring deep l require a series of d ata-quality art. tests, of w a += g*h; n integral p will form a eenhill Lincoln Gr Paul La Plante and Reference s t Boolard a += y, EDGES Memo, 058 , 2010. R.J. Cappal lo, M.F. M orales, and ics a ale, d Topics RFI Statist , C.J. Lonsd l of Selecte [1] A.E .E. Rogers, , R.J. Sault IE EE Journa R.B. Wayth eld Array, . Greenhill, hison Widefi ]. itchell, L.J of the Murc 07.1912 E, 97 [2] D.A. M Time Calib ration , [astro- ph/08 s of the IEE S.M. O rd, Real- 7 17, 2008 , Proceeding 2 (5), 707– n Overview 1 nuary 201sday, 27 Ja rocessing, rray: Desig in Signal P on Widefield A he Murchis 8]. , Graphics ale, et al., T 903.182 R.G. Edgar [3] C.J. Lonsd [ast ro-ph/0 H. Pfister, and Series, 506, 2009, ell, K. Dale, Conference (8), 1497–1 , D.A. Mitch d Array, ASP R.B. Wayth on Wide-fiel Greenhill, the Murchis IICS‘2011 [4] S.M . Ord, L.J. ata Pro cessing in cal Units for D Mathemati Processing 1 radio pola rimetry. I. 009. aa d nderstryn20 ing 1 411, 127, 2 .J. Sault, U Janu 6. . Breg man, and R ursday,.,2117, 137–147, 199 7 alar amaker, J.D Th pl. Ser up alogue of sc [5 ] J.P. H st rophys. S ll-co herency an rophys. Su ppl. s, Astron. A . IV. The fu Astron. Ast foundation polarimetry ric fidelity, g radio ge and pola rimet derstandin
  116. 116. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • variable-length argument lists
  117. 117. Smooth syntactic ugliness Manipulations that were not easily accessible in CUDA C code: • index un-indexable resources (e.g. regs)
  118. 118. Explore design decision space more freely
  119. 119. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  120. 120. ... too many optimizations? ba nk c onflict s on ing isi ale sc ec co ca pr ch d part ling itionnrol inixe cla p u ca mpin g m loo gm pi ng adca sting bro ms zero-cop trea
  121. 121. e ? ec id ’t dc an keep them all !
  122. 122. Exploring design decision space more freely Meta-programming: • enables efficient learning of the GPU hardware/software • allows full exploitation of the GPU architecture
  123. 123. version Aconv_kernel_beta_template.cu ... mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 [$N_FILTERS]; mov.b32 $r1, c0[$ofs2+0x000c] #define IMUL(a, b) __mul24(a, b) extern "C" { mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #for j in xrange($FILTER_H) mov.b32 $r1, c0[$ofs2+0x0010] __global__ void convolve_beta_j${j}(float4 *input, float4 *output) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; ... // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) version B #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; ... shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1 } #end for mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1 mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1 ... aster... Why ? 2x f
  124. 124. Results
  125. 125. speed (in billion floating point operations per second) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.07900GTX (OpenGL/Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.4 8800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 cha n ging... e GTX480 (CUDA3.x) [2010] pe edu p is g a m 974.3 (Fermi) >1 000X s Pinto, Doukhan, DiCarlo, Cox PLoS 2009 Pinto, Cox GPU Comp. Gems 2011
  126. 126. -10.4 1024x1024x8 16x5x5x8 726.412 ± 0.398 744.973 ± 0.571 Analysis 2048x2048x4 4x8x8x4 474.681 ± 0.160 887.974 ± 1.017 ➡ Different hardware ? Table 33.2 Performance of Auto-Tuned Implementations on Two Hardware Platforms, Including Performance Tuned on One Platform and Run on the Other Optimized for: Run on: 9400M GTX480 Tuning Speedup 9400M 0.32s 2.52s 675% GTX480 0.016s 0.011s 52%formance gains are observed for the auto-tuned meta-kernels as compared tot, which was hand-picked to allow correct execution of all input ranges ng up against hardware limitations.
  127. 127. APTER 33 GPU Metaprogramming: A Case Study Analysis ➡ Different input configurations Table 33.3 Performance of Auto-Tuned Implementations on Two Input Configurations, Including Performance Tuned for One Configuration and Run with the Other Optimized for: Run on: Config1 Config2 Tuning Speedup config1 11.1ms 15.7ms 41% config2 fails 10.8ms not comparable, in Table 33.3 we show the effect of tuning on one input configuration anin, significant speedups are obtained using kernels tailored to a specific inp
  128. 128. Summary
  129. 129. Summary Meta-programming:
  130. 130. Summary Meta-programming: • can assist exploration and manual optimization
  131. 131. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code
  132. 132. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code • is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda)
  133. 133. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code • is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda) ➡ helps get drastic speed-ups !
  134. 134. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code • is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda) ➡ helps get drastic speed-ups ! ➡ facilitates “auto-tuning” !
  135. 135. Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
  136. 136. Intelligent and fastAuto-Tuning with Machine Learning with James Bergstra and David Cox
  137. 137. Intelligent and fastAuto-Tuning with Machine Learning
  138. 138. Auto-tuning: two approaches
  139. 139. Auto-tuning: two approaches• Analytical model-based optimization:
  140. 140. Auto-tuning: two approaches• Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference”
  141. 141. Auto-tuning: two approaches• Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak
  142. 142. Auto-tuning: two approaches• Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak• Empirical optimization:
  143. 143. Auto-tuning: two approaches• Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak• Empirical optimization: - pros: auto-tuned code close to peak (dominant in specialized libraries e.g. ATLAS, FFTW), easier to build
  144. 144. Auto-tuning: two approaches• Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak• Empirical optimization: - pros: auto-tuned code close to peak (dominant in specialized libraries e.g. ATLAS, FFTW), easier to build - cons: very slow “inference” (for new inputs, etc.)
  145. 145. Empirical Auto-TuningThe goal is to empirically optimize executiontime given both• the environment - hardware (GPU, CPU, Memory, Mobo, etc.) - software (SDK, Compiler suite, etc.)• the data (input dimensions, repetitions, etc.)
  146. 146. Empirical Auto-Tuning with Meta-programming A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  147. 147. Intelligent and fastAuto-Tuning with Machine Learning
  148. 148. Auto-tuning: best of both approaches ?
  149. 149. Auto-tuning: best of both approaches ?• Empirically-learned model-based optimization:
  150. 150. Auto-tuning: best of both approaches ?• Empirically-learned model-based optimization: - pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.)
  151. 151. Auto-tuning: best of both approaches ?• Empirically-learned model-based optimization: - pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.) - cons: unexplored !
  152. 152. Auto-tuning: best of both approaches ?• Empirically-learned model-based optimization: - pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.) - cons: unexplored !* could be dominant in specialized libraries(e.g. machine learning!)
  153. 153. Fast Machine Learning-based Runtime Auto-TuningML-based
  154. 154. First Last First Last First Last Affiliation line 1 Affiliation line 1 Affiliation line 1 Fast Machine Learning-based Affiliation line 2 Affiliation line 2 Affiliation line 2 anon@mail.com anon@mail.com anon@mail.comABSTRACT Runtime Auto-Tuning designs, the field lacks consensus on exactly how the differ-The rapidly evolving landscape of multicore architectures ent subsystems (memory, communication and computation)makes the construction of efficient libraries a daunting task. should be efficiently integrated, modeled and programmed.A family of methods known collectively as “auto-tuning” has These systems have exhibited varying degrees of memoryemerged to address this challenge. Two major approaches to hierarchy and multi-threading complexity and, as a conse-auto-tuning are empirical and model-based: empirical auto- quence, they have been increasingly relying on flexible buttuning is a generic but slow approach that works by mea- low-level software-controlled cache management and paral-suring runtimes of candidate implementations, model-based lelism [Asanovic et al., 2006] in order to better control andauto-tuning predicts those runtimes using simplified abstrac- understand the various trade-offs among performance, reli-tions designed by hand. We show that machine learning ability, energy efficiency, production costs, etc. This evo-methods for non-linear regression can be used to estimate lution has profoundly altered the landscape of applicationtiming models from data, capturing the best of both ap- development: programmers are now facing a wide diversity Machine Learning for Predictive Auto-Tuning with Boostedproaches. A statistically-derived model offers the speed of of low-level architectural issues that must be carefully bal-a model-based approach, with the generality and simplicity anced if one is to write code that is both high-performanceof empirical auto-tuning. We validate our approach using and portable.the filterbank correlation kernel described in Pinto and Cox Regression Trees[2012], where we find that 0.1 seconds of hill climbing on 1.1 Motivationthe regression model (“predictive auto-tuning”) can achieve In this rapidly evolving landscape, the construction of gen-an average of 95% of the speed-up brought by minutes of eral development tools and libraries that fully utilize systemempirical auto-tuning. Our approach is not specific to filter- resources remains a daunting task. Even within special-bank correlation, nor even to GPU kernel auto-tuning, and ized architectures from the same vendor, such as NVIDIA’scan be applied to almost any templated-code optimization Graphics Processing Units (GPUs) and the Compute Unifiedproblem, spanning a wide variety of problem types, kernel Device Architecture (CUDA) [Nickolls et al., 2008, NVIDIA,types, and platforms. 2011], many developers default to massive amounts of man-1. INTRODUCTION First Last First Last ual labor to optimize CUDA code to specific input domains. In addition, hand-tuning rarely generalizes well to new hard- First Last Affiliation line 1 Affiliation line 1 Affiliation line 1 ware generations or different input domains, and it can also Due to power consumption and heat dissipation concerns, be error-prone or far from optimal. One of the reason is that Affiliation line 2 Affiliation line 2 Affiliation line 2scientific applications have shifted from computing platforms kernels can produce staggeringly large optimization spaceswhere performance had been primarily driven by rises in the [Datta et al., 2008]. The problem is further compounded anon@mail.comclock frequency of a single “heavy-weight” processor (withcomplex out-of-order control and cache structures) to a plat-form with ever increasing numbers of “light-weight” cores. anon@mail.com by the fact that these spaces can be highly discontinuous [Ryoo et al., 2008], difficult to explore, and quasi-optimal anon@mail.com solutions lie at the edge of “performance cliffs” induced byInterestingly, this shift is now not only relevant to compu- hard device-specific constraints (e.g. register file size or low-tational sciences but to the development of all computer sys- latency cache size). James Bergstratems: from ubiquitous consumer-facing devices (e.g. phones)to high-end computer farms for web-scale applications (e.g. 1.2 Auto-Tuning ABSTRACTsocial networks). Although the future lies in low-power multi-core hardware One strategy for addressing these challenges is to use one of a variety of automatic methods known collectively as designs, the field lacks consensus on exactly how the differ- The rapidly evolving landscape of multicore architecturesPermission to makethe or hard copies of all or part ofof work for “auto-tuning.” Two major auto-tuning approaches have emer- ged in the extensive literature covering the subject (see sur- veys in e.g. [Vuduc et al., 2001, Demmel et al., 2005, Vuduc makes digital construction this efficient libraries a daunting task. et al., 2005, Williams, 2008, Datta et al., 2008, Cavazos, Nicolas Pinto ent subsystems (memory, communication and computation) should be efficiently integrated, modeled and programmed. David Coxpersonal or classroom use is granted without fee provided that copies are 2008, Li et al., 2009, Park et al., 2011]): analytical model- These systems have exhibited varying degrees of memorynot A familyfor profit or commercial advantage and that copies made or distributed of methods known collectively as “auto-tuning” has driven optimization and empirical optimization [Yotov et al.,bear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or address this challenge. Two major approaches to emerged to to redistribute to lists, requires prior specific 2003]. hierarchy and multi-threading complexity and, as a conse- The model-driven optimization approach uses analyticalpermission and/or a fee. quence, they have been increasingly relying on flexible but auto-tuning are empirical and model-based: empirical auto- [submitted]Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. abstractions to model the hardware architectures, in order tuning is a generic but slow approach that works by mea- low-level software-controlled cache management and paral- suring runtimes of candidate implementations, model-based lelism [Asanovic et al., 2006] in order to better control and auto-tuning predicts those runtimes using simplified abstrac- understand the various trade-offs among performance, reli- tions designed by hand. We show that machine learning ability, energy efficiency, production costs, etc. This evo- methods for non-linear regression can be used to estimate lution has profoundly altered the landscape of application timing models from data, capturing the best of both ap- development: programmers are now facing a wide diversity proaches. A statistically-derived model offers the speed of of low-level architectural issues that must be carefully bal- anced if one is to write code that is both high-performance
  155. 155. lutio ns! k Co nvo i lter ba n3D F
  156. 156. NVIDIA GTX 580 (Fermi)0 P ie w rev(b) 2x faster equality 1200 GFLOP/s of predictive auto-tuning 1000Auto-­tuned mean 800 2x slower ML-based:Reference mean 600 < 0.1sec 400 200 0 200 0 200 400 600 800 1000 1200 1400d problem GFLOP/s of empirical auto-tuning r training old way: minutes!
  157. 157. NVIDIA GTX 580 (Fermi)0 P ie w rev(b) 2x faster equality 1200 GFLOP/s of predictive auto-tuning LOP /s ! RAF 1000 1 TEAuto-­tuned mean 800 > 1. 2x slower ML-based:Reference mean 600 < 0.1sec 400 200 0 200 0 200 400 600 800 1000 1200 1400d problem GFLOP/s of empirical auto-tuning r training old way: minutes!
  158. 158. What else ?
  159. 159. What else could we do for HPC ?
  160. 160. What else could we do for HPC ?• Minimize failures (exascale supercomputers)
  161. 161. What else could we do for HPC ?• Minimize failures (exascale supercomputers)• Minimize mixed-precision errors
  162. 162. What else could we do for HPC ?• Minimize failures (exascale supercomputers)• Minimize mixed-precision errors• Help better understand hardware features and their complex interactions
  163. 163. What else could we do for HPC ?• Minimize failures (exascale supercomputers)• Minimize mixed-precision errors• Help better understand hardware features and their complex interactions• Help design better architectures ?
  164. 164. What else could we do for HPC ?• Minimize failures (exascale supercomputers)• Minimize mixed-precision errors• Help better understand hardware features and their complex interactions• Help design better architectures ?• $$$
  165. 165. What else could we do for HPC ?• Minimize failures (exascale supercomputers)• Minimize mixed-precision errors• Help better understand hardware features and their complex interactions• Help design better architectures ?• $$$• etc.
  166. 166. It would be a win-win-win situation!(The Office Season 2, Episode 27: Conflict Resolution)
  167. 167. Outline1. HPC-aware ML2. GPU Meta-programming3. ML-aware HPC
  168. 168. en ts e dg em nowlAck DiCarlo Lab @ MIT arlo im DiC J id Cox Dav
  169. 169. en ts e dg em nowlAck
  170. 170. CO ME

×