Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IAP09 CUDA@MIT 6.963 - Guest Lecture: Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach (David Cox, Harvard | Jim DiCarlo, Nicolas Pinto, MIT)

18,231 views

Published on

Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach
David Cox, Harvard | James DiCarlo, Nicolas Pinto, MIT

Abstract:

The study of biological vision and the creation of artificial vision systems are naturally intertwined – exploration of the neuronal substrates of visual processing provides clues and inspiration for artificial systems, and artificial systems, in turn, serve as important generators of new ideas and working hypotheses. However, while systems neuroscience has provided inspiration for some of the "broad-stroke" properties of the visual system, much is still unknown. Even for those qualitative properties that most biologically-inspired models share, experimental data currently provide little constraint on their key parameters. Consequently, it is difficult to truly evaluate a set of computational ideas, since the performance of a model depends strongly on its particular instantiation – the size of the pooling kernels, the number of units per layer, exponents in normalization operations, etc.

To pave a way forward, we have developed a high-throughput approach to more expansively explore the possible range of biologically-inspired models, including models of larger, more realistic scale, leveraging recent advances in commodity stream processing hardware - particularly, high-end NVIDIA GPUs. In analogy to high-throughput screening approaches in molecular biology and genetics, we generated and trained thousands of potential network architectures and parameter instantiations, and "screened" the visual representations produced by these models using an object recognition task. From these candidate models, the most promising were selected for further analysis. We have shown that this approach can yield significant, reproducible gains in performance in a basic object recognition tasks, and that it can offer insight into which computational ideas are most important for achieving this performance.

In this talk, I'll also highlight how the application of flexible programming tools, such as high-level scripting and template metaprogramming, can enable large performance gains, while managing complexity for the developer. As the scale of available computational power continues to expand, our approach holds great potential both for accelerating progress in artificial vision, and for generating new, experimentally-testable hypotheses for the study of biological vision.

More on this research:
http://pinto.scripts.mit.edu/Research/Research
http://pinto.scripts.mit.edu/Research/Monster16GPU
http://www.rowland.harvard.edu/rjf/cox/Projects/projects/computation.html

More on IAP09 CUDA@MIT 6.963 at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

Published in: Education

IAP09 CUDA@MIT 6.963 - Guest Lecture: Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach (David Cox, Harvard | Jim DiCarlo, Nicolas Pinto, MIT)

  1. 1. A High-Throughput Approach to Discovering Good Forms of Visual Representation David Cox The Rowland Institute at Harvard Nicolas Pinto Jim DiCarlo MIT BCS The Rowland Institute at Harvard HARVARD UNIVERSITY
  2. 2. Goals
  3. 3. Goals 1) “Building Brains” Concrete example of real-world experiments fundamental enabled by stream processing hardware
  4. 4. Goals 1) “Building Brains” Concrete example of real-world experiments fundamental enabled by stream processing hardware 2) Tricks of the trade Some high-level highlights for how we leverage CUDA to achieve our goals
  5. 5. Goals 1) “Building Brains” Concrete example of real-world experiments fundamental enabled by stream processing hardware 2) Tricks of the trade Some high-level highlights for how we leverage CUDA to achieve our goals
  6. 6. The Problem Why is vision hard?
  7. 7. The Problem Why is vision hard? World is 3D, but retina is 2D
  8. 8. The Problem Why is vision hard? World is 3D, but retina is 2D The same object can cast an infinite number of dierent images onto the retina
  9. 9. The Problem Object Variation
  10. 10. The Problem Object Variation
  11. 11. The Problem Object Variation
  12. 12. The Problem Object Variation
  13. 13. Transformation Identity
  14. 14. Transformation Identity
  15. 15. Transformation Identity
  16. 16. Transformation Identity
  17. 17. Transformation Identity
  18. 18. Transformation Identity Bigger difference in pixel space
  19. 19. The Approach: Reverse Engineering the Brain REVERSE Study Natural System
  20. 20. The Approach: Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  21. 21. The Approach: Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  22. 22. Reverse Engineering the Brain
  23. 23. Monkeys
  24. 24. Monkeys
  25. 25. Rats
  26. 26. Rats
  27. 27. Visual System
  28. 28. Visual System
  29. 29. IT Neurons: Complex Stimuli Desimone et al.
  30. 30. IT Cortex can do object recognition Hung et al., 2005
  31. 31. Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  32. 32. Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  33. 33. Object Recognition Training Set
  34. 34. Object Recognition Training Set Representation
  35. 35. Object Recognition Training Set Representation Classifier
  36. 36. Object Recognition Training Set Representation Classifier
  37. 37. Object Recognition Training Set Representation Classifier Test Example Representation
  38. 38. Object Recognition Training Set Representation Classifier Test Example Representation Guess
  39. 39. Object Recognition
  40. 40. Object Recognition Training Set
  41. 41. Object Recognition Training Set
  42. 42. Object Recognition Training Set
  43. 43. Object Recognition Training Set
  44. 44. Object Recognition Training Set
  45. 45. Object Recognition Testing Set
  46. 46. Object Recognition Testing Set
  47. 47. Object Recognition Testing Set “Siri”
  48. 48. Object Recognition Testing Set “Siri”
  49. 49. Object Recognition Testing Set “Not Siri” “Siri”
  50. 50. How are things done normally?
  51. 51. How are things done normally? Usual Formula:
  52. 52. How are things done normally? Usual Formula: 1) One grad student
  53. 53. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime)
  54. 54. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets
  55. 55. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock.
  56. 56. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock. 5) One Ph.D.
  57. 57. Why is this not optimal?
  58. 58. Why is this not optimal? • Lots of parameters – can’t explore easily
  59. 59. Why is this not optimal? • Lots of parameters – can’t explore easily • Big models are paralyzingly slow to run
  60. 60. Why is this not optimal? • Lots of parameters – can’t explore easily • Big models are paralyzingly slow to run • Advice from my friend:
  61. 61. Why is this not optimal? • Lots of parameters – can’t explore easily • Big models are paralyzingly slow to run • Advice from my friend: “Don’t run anything that takes longer than a week to complete, because it will just crash halfway through anyways (or you’ll discover a bug) and you’ll never finish your Ph.D.”
  62. 62. Doing things a little bit dierently
  63. 63. Doing things a little bit dierently 1) One grad student
  64. 64. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models
  65. 65. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets*
  66. 66. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets*
  67. 67. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets* 4) yay. we. rock.
  68. 68. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets* 4) yay. we. rock. 5) One Ph.D.?
  69. 69. High-Throughput Screening
  70. 70. High-Throughput Screening
  71. 71. Pipeline: Biology
  72. 72. Pipeline: Biology “Plate” a Diversity of Organisms
  73. 73. Pipeline: Biology “Plate” a Diversity Allow them of Organisms to grow
  74. 74. Pipeline: Biology “Plate” a Diversity Allow them Apply Challenge of Organisms to grow
  75. 75. Pipeline: Biology “Plate” a Diversity Allow them Apply Challenge of Organisms to grow Collect Surviving Colonies
  76. 76. Pipeline: Biology “Plate” a Diversity Allow them Apply Challenge of Organisms to grow Collect Surviving Study / Repeat Colonies
  77. 77. Pipeline: Biology-Inspired Vision
  78. 78. Pipeline: Biology-Inspired Vision Generate Random Models
  79. 79. Pipeline: Biology-Inspired Vision Generate Unsupervised Random Models Learning (Video)
  80. 80. Pipeline: Biology-Inspired Vision Generate Unsupervised Test with Random Models Learning (Video) “screening” task
  81. 81. Pipeline: Biology-Inspired Vision Generate Unsupervised Test with Random Models Learning (Video) “screening” task Skim o best models
  82. 82. Pipeline: Biology-Inspired Vision Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  83. 83. Need to break all of the implicit rules
  84. 84. Need to break all of the implicit rules 1. Test tens to hundreds of thousands of instantiations of biologically-inspired hierarchical models
  85. 85. Need to break all of the implicit rules 1. Test tens to hundreds of thousands of instantiations of biologically-inspired hierarchical models 2. Use more realistic inputs (e.g. large quantities of video)
  86. 86. Need to break all of the implicit rules 1. Test tens to hundreds of thousands of instantiations of biologically-inspired hierarchical models 2. Use more realistic inputs (e.g. large quantities of video) 3. Test models which begin to approach the scale of natural systems
  87. 87. Massive Computation
  88. 88. Massive Computation If you work for it, a multi-TeraFLOPS cluster is doable for a modest price
  89. 89. Long, winding road of stream processing
  90. 90. GPU pr0n
  91. 91. GPU pr0n
  92. 92. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈
  93. 93. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈ Multiple scales of parallelism:
  94. 94. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈ Multiple scales of parallelism: “Embarrasingly” parallel: video frames, regions
  95. 95. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈ Multiple scales of parallelism: “Embarrasingly” parallel: video frames, regions Fine-grained: independent “neurons,” operating on overlapping inputs
  96. 96. A Match Made in Heaven Images In, Images Out ≈
  97. 97. A Match Made in Heaven Images In, Images Out ≈ Image processing particularly well-suited
  98. 98. A Match Made in Heaven Images In, Images Out ≈ Image processing particularly well-suited Excellent Arithmetic Intensity: very natural to load image patches into shared memory
  99. 99. A Match Made in Heaven Images In, Images Out ≈ Image processing particularly well-suited Excellent Arithmetic Intensity: very natural to load image patches into shared memory Data: 2D / 3D locality
  100. 100. These things are REALLY fast Performance (g ops) Development Time (hours)
  101. 101. These things are REALLY fast Performance (g ops) Development Time (hours) Matlab C/SSE PS3 GT200
  102. 102. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab C/SSE PS3 GT200
  103. 103. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 9.0 C/SSE PS3 GT200
  104. 104. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 9.0 C/SSE 110.0 PS3 GT200
  105. 105. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 9.0 C/SSE 110.0 PS3 330.0 GT200
  106. 106. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 110.0 PS3 330.0 GT200
  107. 107. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 10.0 110.0 PS3 330.0 GT200
  108. 108. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 10.0 110.0 PS3 30.0 330.0 GT200
  109. 109. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 10.0 110.0 PS3 30.0 330.0 GT200 10.0
  110. 110. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  111. 111. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  112. 112. Read-out L3 thresh/sat norm strength Learning normalization neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of lters L2 thresh/sat norm strength Learning normalization Rate neighborhood Trace kernel “Temp. Adv.” size “Auto-reset” ... n. of lters L1 Learning thresh/sat norm strength Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” ... kernel size number of lters input kernel size
  113. 113. neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of lters L2 thresh/sat norm strength Learning normalization Rate neighborhood Trace kernel “Temp. Adv.” size “Auto-reset” ... n. of lters L1 Learning thresh/sat norm strength Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” ... kernel size
  114. 114. A Broad Parametric Model Normalize Ni = Inputi / norm(Inputneighborhd) Compute Filter Responses Ri = Fi ⊗ N Ri thresh: Ri = thresh Ri sat: Ri = sat Determine a “Winning Filter” Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  115. 115. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) Compute Filter Responses Ri = Fi ⊗ N Ri thresh: Ri = thresh Ri sat: Ri = sat Determine a “Winning Filter” Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  116. 116. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) ( lters span the range of observed inputs) Compute Filter Responses Ri = Fi ⊗ N Ri thresh: Ri = thresh Ri sat: Ri = sat Determine a “Winning Filter” Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  117. 117. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) ( lters span the range of observed inputs) Compute Filter Responses Ri = Fi ⊗ N • Privilege movement of Ri thresh: Ri = thresh lters in certain Ri sat: Ri = sat directions using Determine a “Winning Filter” temporal information Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  118. 118. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) ( lters span the range of observed inputs) Compute Filter Responses Ri = Fi ⊗ N • Privilege movement of Ri thresh: Ri = thresh lters in certain Ri sat: Ri = sat directions using Determine a “Winning Filter” temporal information Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) • Expand dimensionality greatly and then scale Update Filter Fwinning = Fwinning + learning rate * N back as layers progress
  119. 119. Dealing with Parametric Complexity
  120. 120. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges:
  121. 121. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges: Complexity •
  122. 122. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges: Complexity • Kernels must perform under widely • varying conditions
  123. 123. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges: Complexity • Kernels must perform under widely • varying conditions Best kernel for a 3x3 conv may not be the same as the best kernel for a 17x17 one; more complex operations are even hairier
  124. 124. Meta-programming
  125. 125. Meta-programming Leave the grunt-programming to the computer
  126. 126. Meta-programming Leave the grunt-programming to the computer Dynamically compile specialized versions of • the same kernel for dierent conditions
  127. 127. Meta-programming Leave the grunt-programming to the computer Dynamically compile specialized versions of • the same kernel for dierent conditions Smooth syntactic ugliness: unroll loops, • index un-indexable registers
  128. 128. Meta-programming Leave the grunt-programming to the computer Dynamically compile specialized versions of • the same kernel for dierent conditions Smooth syntactic ugliness: unroll loops, • index un-indexable registers Dynamic, empirical run-time tuning •
  129. 129. texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  130. 130. texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  131. 131. conv_kernel_4x4x4.cu conv_kernel_template.cu #include stdio.h texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[4][4][4]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output) { #define IMUL(a, b) __mul24(a, b) __shared__ float shared_in[131][4+1]; extern quot;Cquot; { // -- input/output offsets #for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; __global__ void convolve_beta_j${j}(float4 *input, float4 *output) // -- load input to shared memory { { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; // -- input/output offsets } const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)131) blockIdx.x*blockDim.x + threadIdx.x; { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); const uint out_idx = blockIdx.y*OUTPUT_W + shared_in[threadIdx.x+128*1][0] = input_v4.x; blockIdx.x*blockDim.x + threadIdx.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; float4 input_v4; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; // -- load input to shared memory } __syncthreads(); #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) // -- compute dot products if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) float v, w; #end if { float sum0 = 0; float sum1 = 0; input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* float sum2 = 0; $i); float sum3 = 0; shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0]; sum0 += v*w; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; w = constant[0][0][1]; } sum1 += v*w; #end for w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w;
  132. 132. conv_kernel_template.cu texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] conv_kernel_4x4x4.cu [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 20 kB *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if conv_kernel_8x8x4.cu { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } 64 kB #end for
  133. 133. conv_kernel_beta_template.cu texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } #end for
  134. 134. conv_kernel_beta_template.cu texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets seemingly const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + innocuous “flow” blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; parameter change // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } #end for
  135. 135. version A ... conv_kernel_beta_template.cu mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mov.b32 $r1, c0[$ofs2+0x000c] [$N_FILTERS]; mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { mov.b32 $r1, c0[$ofs2+0x0010] #for j in xrange($FILTER_H) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 __global__ void convolve_beta_j${j}(float4 *input, float4 *output) ... { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets seemingly const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + innocuous “flow” blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; parameter change // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } #end for
  136. 136. version A ... conv_kernel_beta_template.cu mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mov.b32 $r1, c0[$ofs2+0x000c] [$N_FILTERS]; mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { mov.b32 $r1, c0[$ofs2+0x0010] #for j in xrange($FILTER_H) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 __global__ void convolve_beta_j${j}(float4 *input, float4 *output) ... { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets seemingly const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + innocuous “flow” blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; parameter change // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) version B if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* ... $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1 shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1 #end for mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1 ...
  137. 137. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  138. 138. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  139. 139. Unsupervised Experience Learn Filter Kernels from Temporal Input Statistics
  140. 140. Unsupervised Experience Learn Filter Kernels from Temporal Input Statistics law order
  141. 141. Screening A quick object rec. test to find promising models Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  142. 142. Screening A quick object rec. test to find promising models Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  143. 143. Screening A quick object rec. test to find promising models “cars” “planes”
  144. 144. Screening A quick object rec. test to find promising models Parametric Variation no variation
  145. 145. Screening A quick object rec. test to find promising models Parametric Variation no variation more variation
  146. 146. Screening A quick object rec. test to find promising models Parametric Variation no variation more variation lots of variation
  147. 147. Screening 250 200 N=2500 150 100 50 0 50 60 70 80 90 100
  148. 148. Screening 250 200 N=2500 150 100 50 0 50 60 70 80 90 100
  149. 149. Screening 100 250 90 80 200 N=2500 70 150 60 100 50 V1S 1 2 3 4 5 50 0 50 60 70 80 90 100
  150. 150. Validation See how we do on other test sets Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  151. 151. Validation See how we do on other test sets Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  152. 152. Validation cars vs. planes (validate) boats vs. animals 100 100 90 90 80 80 70 70 60 60 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5
  153. 153. Validation synthetic face webcam faces discrimination 100 100 90 90 80 80 70 70 60 60 50 50 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5
  154. 154. Other “Petri Dishes” boats
  155. 155. Other “Petri Dishes” cars and planes boats
  156. 156. Other “Petri Dishes” Unsupervised Screen Distribution Validation Training “World” synthetic face cars vs. planes (validate) boats vs. animals webcam faces discrimination cars and planes 250 100 100 100 100 200 N=2500 90 90 90 90 150 80 80 80 80 70 70 70 70 100 60 60 60 60 50 50 50 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 50 60 70 80 90 100 boats 250 200 100 100 100 100 N=2500 90 90 90 90 150 80 80 80 80 70 70 70 70 100 60 60 60 60 50 50 50 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 50 60 70 80 90 100 law order 250 100 100 100 100 200 N=2500 90 90 90 90 80 80 80 80 150 70 70 70 70 100 60 60 60 60 50 50 50 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 0 50 60 70 80 90 100
  157. 157. Screening Basically Works 3 2 1 cars vs. planes (validate) 0 -1 r=0.84 -2 -2 -1 0 1 2 3 cars vs. planes (screen)
  158. 158. Screening Basically Works 3 3 2 2 1 1 boats vs. animals cars vs. planes 0 (validate) 0 -1 -1 r=0.60 r=0.84 -2 -2 -2 -1 0 1 2 3 -2 -1 0 1 2 3 cars vs. planes (screen) cars vs. planes (screen) 3 4 2 3 1 2 0 discrimination 1 synthetic face webcam faces -1 0 -2 -1 r=0.23 r=0.49 -3 -2 -2 -1 0 1 2 3 -2 -1 0 1 2 3 cars vs. planes (screen) cars vs. planes (screen)
  159. 159. Screening Works on Average 3 2 noramalized average 1 0 -1 r=0.69 -2 -1 0 1 2 3 cars vs. planes (screen)
  160. 160. What does it all mean?
  161. 161. What does it all mean? Preprocessing / Normalization / Zero-mean Layer 1 / Linear Filtering / RF size 0.9 0.9 Performance (% correct) Performance (% correct) 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 False True 3 5 7 9 Parameter value Parameter value
  162. 162. What does it all mean? Layer 1 / Normalization / Threshold Layer 3 / Learning / Neighborhood size 0.9 0.9 Performance (% correct) Performance (% correct) 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 01 10 1 3 5 7 9 Parameter value Parameter value
  163. 163. Summary
  164. 164. Summary • GPUs allow us to engage a qualitatively dierent pace of experimentation
  165. 165. Summary • GPUs allow us to engage a qualitatively dierent pace of experimentation • CUDA provides unprecedented performance / eort ratio
  166. 166. Summary • GPUs allow us to engage a qualitatively dierent pace of experimentation • CUDA provides unprecedented performance / eort ratio • New laboratory for studying visual representation
  167. 167. Shameless Recruitment http://www.rowland.harvard.edu/rjf/cox
  168. 168. Shameless Recruitment http://www.rowland.harvard.edu/rjf/cox
  169. 169. Shameless Recruitment Now with E NVID xtra Good IA ness! http://www.rowland.harvard.edu/rjf/cox
  170. 170. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt
  171. 171. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt DiCarlo Lab @ MIT • Jim DiCarlo • Nicolas Pinto
  172. 172. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt DiCarlo Lab @ MIT • Jim DiCarlo • Nicolas Pinto
  173. 173. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt DiCarlo Lab @ MIT • Jim DiCarlo • Nicolas Pinto The Rowland Institute at Harvard HARVARD UNIVERSITY
  174. 174. The Problem with Evaluation What set to use?
  175. 175. The Problem with Evaluation What set to use?
  176. 176. The Problem with Evaluation What set to use?
  177. 177. Caltech 101
  178. 178. faces_easy faces car_side airplanes accordion Caltech 101
  179. 179. Caltech 101 from A. Torralba
  180. 180. The field does well on the ‘101 Performance 100 100% 8080 1] Wang et al. 2006 60 60 [2] Grauman and Darrell 2005 [3] Mutch and Lowe 2006 [4] Lazebnik et al. 2006 40 40 [5] Zhang et al. 2006 20 20 0 0 1 2 3 4 5 state of the art systems
  181. 181. The “start-simple” model
  182. 182. The “start-simple” model • Input divisive normalization
  183. 183. The “start-simple” model • Input divisive normalization • Threshold, saturation
  184. 184. The “start-simple” model • Input divisive normalization • Threshold, saturation • Output normalization
  185. 185. The “start-simple” model • Input divisive normalization • Threshold, saturation • Output normalization • Downsampling • Dimensionality Reduction • Classi cation
  186. 186. The “start-simple” model • Input divisive normalization • Threshold, saturation • Output normalization • Downsampling • Dimensionality Reduction • Classi cation • Really Big (~2.1 million dim)
  187. 187. V1-like model does shockingly well Performance 100 100% 8080 60 60 40 40 20 20 0 0 1 2 3 4 5 state of the art systems
  188. 188. V1-like model does shockingly well Performance 100 100% 8080 60 60 40 40 20 20 0 0 1 2 3 4 5 V1-like state of the art systems
  189. 189. V1-like model does shockingly well Performance 100 100% 8080 60 60 40 40 20 20 0 0 1 2 3 4 5 V1-like V1+ state of the art systems
  190. 190. But the model is stupid It’s just V1. It has no speci c mechanisms for achieving invariance.
  191. 191. But the model is stupid It’s just V1. It has no speci c mechanisms for achieving invariance.
  192. 192. But the model is stupid It’s just V1. It has no speci c mechanisms for achieving invariance. The Fight is Rigged?
  193. 193. Refocusing on the real problem
  194. 194. Refocusing on the real problem Just two classes: should be easy, no?
  195. 195. Refocusing on the real problem Parametric Variation no variation
  196. 196. Refocusing on the real problem Parametric Variation no variation more variation
  197. 197. Refocusing on the real problem Parametric Variation no variation more variation lots of variation
  198. 198. Pulling performance back to chance 100 90 Performance (%) 80 70 60 50 0% 10% 20% 30% 40% 50% 60% position (x-axis) 0% 20% 40% 60% 80% 100% 120% position (y-axis) 0% 10% 20% 30% 40% 50% 60% scale 0° 15° 30° 45° 60° 75° 90° rotation 0° 15° 30° 45° 60° 75° 90° view Intra-class variation
  199. 199. Pulling performance back to chance 100 90 Performance (%) 80 70 60 50 0% 10% 20% 30% 40% 50% 60% position (x-axis) 0% 20% 40% 60% 80% 100% 120% position (y-axis) 0% 10% 20% 30% 40% 50% 60% scale 0° 15° 30° 45° 60° 75° 90° rotation 0° 15° 30° 45° 60° 75° 90° view Intra-class variation
  200. 200. Just one bad database?
  201. 201. Just one bad database? Caltech 256
  202. 202. Just one bad database? Caltech 256 Face Databases: ORL, Yale, AR, CVL
  203. 203. Just one bad database? Caltech 256 Face Databases: ORL, Yale, AR, CVL See us @ ECCV 08 Faces in the Wild Workshop
  204. 204. Face Databases ORL 4 training examples 8 training examples 100 Performance (% correct) 80 60 40 20 0 V1-like V1-like 1 2 3 4 1 2 3 4 1. pixel space 3. and 4. Noushath et al. 2007 2. Savvides et al. 2007
  205. 205. Face Databases AR 5 training 8 training examples examples 100 Performance (% correct) 80 60 40 20 1. pixel space 2. Liang et al. 2007 0 3 V1-like 3 V1-like 1 2 1 2 3. Zhang et al. 2007
  206. 206. Face Databases YALE 4 training 8 training examples examples 100 Performance (% correct) 90 80 70 60 1. pixel space 2.,3.,4. and 5. Noushath et al. 2006 50 V1-like V1-like 1 2 34567 1 2 3456 6. Ben et al. 2006 7. Wang et al. 2007 chance (1/15=6.67%)
  207. 207. Face Databases CVL 2 training 3 training examples examples (frontal only) (all) 100 Performance (% correct) 80 60 40 chance 20 (1/114=0.88%) 1. pixel space 2. Goel et al. 2005 0 3. Gokberk et al. 2002 V1-like V1-like 12 1 3
  208. 208. Face Databases CAS-PEAL 4 training examples facing facing facing 100 down forward up Performance (% correct) 80 60 40 20 1. pixel space 2.,3.,4 and 5. Cao et al. 2004 0 1 2 3 4 5 V1 1 2 3 4 5 V1 1 2 3 4 5 V1
  209. 209. Simulation vs.
  210. 210. Synthetic Variation some variation 100 90 Performance (% correct) 80 scene background white noise background 70 phase scrambled background 60 chance 50 more variation 40 30 position (x-axis) 0% 20% 40% 60% 80% 100% 120% position (y-axis) 0% 10% 20% 30% 40% 50% 60% scale 0% 10% 20% 30% 40% 50% 60% in-plane rotation 0° 15° 30° 45° 60° 75° 90° depth-rotation 0° 15° 30° 45° 60° 75° 90° Increasing Variation

×