• Save
IAP09 CUDA@MIT 6.963 - Guest Lecture: Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach (David Cox, Harvard | Jim DiCarlo, Nicolas Pinto, MIT)
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

IAP09 CUDA@MIT 6.963 - Guest Lecture: Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach (David Cox, Harvard | Jim DiCarlo, Nicolas Pinto, MIT)

  • 16,654 views
Uploaded on

Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach ...

Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach
David Cox, Harvard | James DiCarlo, Nicolas Pinto, MIT

Abstract:

The study of biological vision and the creation of artificial vision systems are naturally intertwined – exploration of the neuronal substrates of visual processing provides clues and inspiration for artificial systems, and artificial systems, in turn, serve as important generators of new ideas and working hypotheses. However, while systems neuroscience has provided inspiration for some of the "broad-stroke" properties of the visual system, much is still unknown. Even for those qualitative properties that most biologically-inspired models share, experimental data currently provide little constraint on their key parameters. Consequently, it is difficult to truly evaluate a set of computational ideas, since the performance of a model depends strongly on its particular instantiation – the size of the pooling kernels, the number of units per layer, exponents in normalization operations, etc.

To pave a way forward, we have developed a high-throughput approach to more expansively explore the possible range of biologically-inspired models, including models of larger, more realistic scale, leveraging recent advances in commodity stream processing hardware - particularly, high-end NVIDIA GPUs. In analogy to high-throughput screening approaches in molecular biology and genetics, we generated and trained thousands of potential network architectures and parameter instantiations, and "screened" the visual representations produced by these models using an object recognition task. From these candidate models, the most promising were selected for further analysis. We have shown that this approach can yield significant, reproducible gains in performance in a basic object recognition tasks, and that it can offer insight into which computational ideas are most important for achieving this performance.

In this talk, I'll also highlight how the application of flexible programming tools, such as high-level scripting and template metaprogramming, can enable large performance gains, while managing complexity for the developer. As the scale of available computational power continues to expand, our approach holds great potential both for accelerating progress in artificial vision, and for generating new, experimentally-testable hypotheses for the study of biological vision.

More on this research:
http://pinto.scripts.mit.edu/Research/Research
http://pinto.scripts.mit.edu/Research/Monster16GPU
http://www.rowland.harvard.edu/rjf/cox/Projects/projects/computation.html

More on IAP09 CUDA@MIT 6.963 at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • It's a long presentation, but worth reading it. Full of useful contents.

    Mark Chang, www.free-ringtones.co.in/ www.free-ringtones-for-sprint.com/
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
16,654
On Slideshare
16,585
From Embeds
69
Number of Embeds
9

Actions

Shares
Downloads
0
Comments
1
Likes
13

Embeds 69

http://www.slideshare.net 44
https://twitter.com 9
http://shawnyproject.blogspot.com 7
http://www.fachak.com 3
http://192.168.6.52 2
http://192.168.10.100 1
http://static.slideshare.net 1
http://staging.plu.mx 1
https://plu.mx 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A High-Throughput Approach to Discovering Good Forms of Visual Representation David Cox The Rowland Institute at Harvard Nicolas Pinto Jim DiCarlo MIT BCS The Rowland Institute at Harvard HARVARD UNIVERSITY
  • 2. Goals
  • 3. Goals 1) “Building Brains” Concrete example of real-world experiments fundamental enabled by stream processing hardware
  • 4. Goals 1) “Building Brains” Concrete example of real-world experiments fundamental enabled by stream processing hardware 2) Tricks of the trade Some high-level highlights for how we leverage CUDA to achieve our goals
  • 5. Goals 1) “Building Brains” Concrete example of real-world experiments fundamental enabled by stream processing hardware 2) Tricks of the trade Some high-level highlights for how we leverage CUDA to achieve our goals
  • 6. The Problem Why is vision hard?
  • 7. The Problem Why is vision hard? World is 3D, but retina is 2D
  • 8. The Problem Why is vision hard? World is 3D, but retina is 2D The same object can cast an infinite number of dierent images onto the retina
  • 9. The Problem Object Variation
  • 10. The Problem Object Variation
  • 11. The Problem Object Variation
  • 12. The Problem Object Variation
  • 13. Transformation > Identity
  • 14. Transformation > Identity
  • 15. Transformation > Identity
  • 16. Transformation > Identity
  • 17. Transformation > Identity
  • 18. Transformation > Identity Bigger difference in pixel space
  • 19. The Approach: Reverse Engineering the Brain REVERSE Study Natural System
  • 20. The Approach: Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 21. The Approach: Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 22. Reverse Engineering the Brain
  • 23. Monkeys
  • 24. Monkeys
  • 25. Rats
  • 26. Rats
  • 27. Visual System
  • 28. Visual System
  • 29. IT Neurons: Complex Stimuli Desimone et al.
  • 30. IT Cortex can do object recognition Hung et al., 2005
  • 31. Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 32. Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 33. Object Recognition Training Set
  • 34. Object Recognition Training Set Representation
  • 35. Object Recognition Training Set Representation Classifier
  • 36. Object Recognition Training Set Representation Classifier
  • 37. Object Recognition Training Set Representation Classifier Test Example Representation
  • 38. Object Recognition Training Set Representation Classifier Test Example Representation Guess
  • 39. Object Recognition
  • 40. Object Recognition Training Set
  • 41. Object Recognition Training Set
  • 42. Object Recognition Training Set
  • 43. Object Recognition Training Set
  • 44. Object Recognition Training Set
  • 45. Object Recognition Testing Set
  • 46. Object Recognition Testing Set
  • 47. Object Recognition Testing Set “Siri”
  • 48. Object Recognition Testing Set “Siri”
  • 49. Object Recognition Testing Set “Not Siri” “Siri”
  • 50. How are things done normally?
  • 51. How are things done normally? Usual Formula:
  • 52. How are things done normally? Usual Formula: 1) One grad student
  • 53. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime)
  • 54. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets
  • 55. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock.
  • 56. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock. 5) One Ph.D.
  • 57. Why is this not optimal?
  • 58. Why is this not optimal? • Lots of parameters – can’t explore easily
  • 59. Why is this not optimal? • Lots of parameters – can’t explore easily • Big models are paralyzingly slow to run
  • 60. Why is this not optimal? • Lots of parameters – can’t explore easily • Big models are paralyzingly slow to run • Advice from my friend:
  • 61. Why is this not optimal? • Lots of parameters – can’t explore easily • Big models are paralyzingly slow to run • Advice from my friend: “Don’t run anything that takes longer than a week to complete, because it will just crash halfway through anyways (or you’ll discover a bug) and you’ll never finish your Ph.D.”
  • 62. Doing things a little bit dierently
  • 63. Doing things a little bit dierently 1) One grad student
  • 64. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models
  • 65. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets*
  • 66. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets*
  • 67. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets* 4) yay. we. rock.
  • 68. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets* 4) yay. we. rock. 5) One Ph.D.?
  • 69. High-Throughput Screening
  • 70. High-Throughput Screening
  • 71. Pipeline: Biology
  • 72. Pipeline: Biology “Plate” a Diversity of Organisms
  • 73. Pipeline: Biology “Plate” a Diversity Allow them of Organisms to grow
  • 74. Pipeline: Biology “Plate” a Diversity Allow them Apply Challenge of Organisms to grow
  • 75. Pipeline: Biology “Plate” a Diversity Allow them Apply Challenge of Organisms to grow Collect Surviving Colonies
  • 76. Pipeline: Biology “Plate” a Diversity Allow them Apply Challenge of Organisms to grow Collect Surviving Study / Repeat Colonies
  • 77. Pipeline: Biology-Inspired Vision
  • 78. Pipeline: Biology-Inspired Vision Generate Random Models
  • 79. Pipeline: Biology-Inspired Vision Generate Unsupervised Random Models Learning (Video)
  • 80. Pipeline: Biology-Inspired Vision Generate Unsupervised Test with Random Models Learning (Video) “screening” task
  • 81. Pipeline: Biology-Inspired Vision Generate Unsupervised Test with Random Models Learning (Video) “screening” task Skim o best models
  • 82. Pipeline: Biology-Inspired Vision Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 83. Need to break all of the implicit rules
  • 84. Need to break all of the implicit rules 1. Test tens to hundreds of thousands of instantiations of biologically-inspired hierarchical models
  • 85. Need to break all of the implicit rules 1. Test tens to hundreds of thousands of instantiations of biologically-inspired hierarchical models 2. Use more realistic inputs (e.g. large quantities of video)
  • 86. Need to break all of the implicit rules 1. Test tens to hundreds of thousands of instantiations of biologically-inspired hierarchical models 2. Use more realistic inputs (e.g. large quantities of video) 3. Test models which begin to approach the scale of natural systems
  • 87. Massive Computation
  • 88. Massive Computation If you work for it, a multi-TeraFLOPS cluster is doable for a modest price
  • 89. Long, winding road of stream processing
  • 90. GPU pr0n
  • 91. GPU pr0n
  • 92. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈
  • 93. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈ Multiple scales of parallelism:
  • 94. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈ Multiple scales of parallelism: “Embarrasingly” parallel: video frames, regions
  • 95. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈ Multiple scales of parallelism: “Embarrasingly” parallel: video frames, regions Fine-grained: independent “neurons,” operating on overlapping inputs
  • 96. A Match Made in Heaven Images In, Images Out ≈
  • 97. A Match Made in Heaven Images In, Images Out ≈ Image processing particularly well-suited
  • 98. A Match Made in Heaven Images In, Images Out ≈ Image processing particularly well-suited Excellent Arithmetic Intensity: very natural to load image patches into shared memory
  • 99. A Match Made in Heaven Images In, Images Out ≈ Image processing particularly well-suited Excellent Arithmetic Intensity: very natural to load image patches into shared memory Data: 2D / 3D locality
  • 100. These things are REALLY fast Performance (g ops) Development Time (hours)
  • 101. These things are REALLY fast Performance (g ops) Development Time (hours) Matlab C/SSE PS3 GT200
  • 102. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab C/SSE PS3 GT200
  • 103. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 9.0 C/SSE PS3 GT200
  • 104. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 9.0 C/SSE 110.0 PS3 GT200
  • 105. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 9.0 C/SSE 110.0 PS3 330.0 GT200
  • 106. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 110.0 PS3 330.0 GT200
  • 107. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 10.0 110.0 PS3 330.0 GT200
  • 108. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 10.0 110.0 PS3 30.0 330.0 GT200
  • 109. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 10.0 110.0 PS3 30.0 330.0 GT200 10.0
  • 110. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 111. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 112. Read-out L3 thresh/sat norm strength Learning normalization neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of lters L2 thresh/sat norm strength Learning normalization Rate neighborhood Trace kernel “Temp. Adv.” size “Auto-reset” ... n. of lters L1 Learning thresh/sat norm strength Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” ... kernel size number of lters input kernel size
  • 113. neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of lters L2 thresh/sat norm strength Learning normalization Rate neighborhood Trace kernel “Temp. Adv.” size “Auto-reset” ... n. of lters L1 Learning thresh/sat norm strength Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” ... kernel size
  • 114. A Broad Parametric Model Normalize Ni = Inputi / norm(Inputneighborhd) Compute Filter Responses Ri = Fi ⊗ N Ri < thresh: Ri = thresh Ri > sat: Ri = sat Determine a “Winning Filter” Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  • 115. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) Compute Filter Responses Ri = Fi ⊗ N Ri < thresh: Ri = thresh Ri > sat: Ri = sat Determine a “Winning Filter” Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  • 116. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) ( lters span the range of observed inputs) Compute Filter Responses Ri = Fi ⊗ N Ri < thresh: Ri = thresh Ri > sat: Ri = sat Determine a “Winning Filter” Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  • 117. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) ( lters span the range of observed inputs) Compute Filter Responses Ri = Fi ⊗ N • Privilege movement of Ri < thresh: Ri = thresh lters in certain Ri > sat: Ri = sat directions using Determine a “Winning Filter” temporal information Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  • 118. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) ( lters span the range of observed inputs) Compute Filter Responses Ri = Fi ⊗ N • Privilege movement of Ri < thresh: Ri = thresh lters in certain Ri > sat: Ri = sat directions using Determine a “Winning Filter” temporal information Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) • Expand dimensionality greatly and then scale Update Filter Fwinning = Fwinning + learning rate * N back as layers progress
  • 119. Dealing with Parametric Complexity
  • 120. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges:
  • 121. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges: Complexity •
  • 122. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges: Complexity • Kernels must perform under widely • varying conditions
  • 123. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges: Complexity • Kernels must perform under widely • varying conditions Best kernel for a 3x3 conv may not be the same as the best kernel for a 17x17 one; more complex operations are even hairier
  • 124. Meta-programming
  • 125. Meta-programming Leave the grunt-programming to the computer
  • 126. Meta-programming Leave the grunt-programming to the computer Dynamically compile specialized versions of • the same kernel for dierent conditions
  • 127. Meta-programming Leave the grunt-programming to the computer Dynamically compile specialized versions of • the same kernel for dierent conditions Smooth syntactic ugliness: unroll loops, • index un-indexable registers
  • 128. Meta-programming Leave the grunt-programming to the computer Dynamically compile specialized versions of • the same kernel for dierent conditions Smooth syntactic ugliness: unroll loops, • index un-indexable registers Dynamic, empirical run-time tuning •
  • 129. texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  • 130. texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  • 131. conv_kernel_4x4x4.cu conv_kernel_template.cu #include <stdio.h> texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[4][4][4]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output) { #define IMUL(a, b) __mul24(a, b) __shared__ float shared_in[131][4+1]; extern quot;Cquot; { // -- input/output offsets #for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; __global__ void convolve_beta_j${j}(float4 *input, float4 *output) // -- load input to shared memory { { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; // -- input/output offsets } const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)<131) blockIdx.x*blockDim.x + threadIdx.x; { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); const uint out_idx = blockIdx.y*OUTPUT_W + shared_in[threadIdx.x+128*1][0] = input_v4.x; blockIdx.x*blockDim.x + threadIdx.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; float4 input_v4; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; // -- load input to shared memory } __syncthreads(); #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) // -- compute dot products if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) float v, w; #end if { float sum0 = 0; float sum1 = 0; input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* float sum2 = 0; $i); float sum3 = 0; shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0]; sum0 += v*w; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; w = constant[0][0][1]; } sum1 += v*w; #end for w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w;
  • 132. conv_kernel_template.cu texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] conv_kernel_4x4x4.cu [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 20 kB *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if conv_kernel_8x8x4.cu { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } 64 kB #end for
  • 133. conv_kernel_beta_template.cu texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } #end for
  • 134. conv_kernel_beta_template.cu texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets seemingly const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + innocuous “flow” blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; parameter change // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } #end for
  • 135. version A ... conv_kernel_beta_template.cu mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mov.b32 $r1, c0[$ofs2+0x000c] [$N_FILTERS]; mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { mov.b32 $r1, c0[$ofs2+0x0010] #for j in xrange($FILTER_H) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 __global__ void convolve_beta_j${j}(float4 *input, float4 *output) ... { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets seemingly const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + innocuous “flow” blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; parameter change // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } #end for
  • 136. version A ... conv_kernel_beta_template.cu mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mov.b32 $r1, c0[$ofs2+0x000c] [$N_FILTERS]; mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { mov.b32 $r1, c0[$ofs2+0x0010] #for j in xrange($FILTER_H) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 __global__ void convolve_beta_j${j}(float4 *input, float4 *output) ... { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets seemingly const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + innocuous “flow” blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; parameter change // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) version B if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* ... $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1 shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1 #end for mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1 ...
  • 137. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 138. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 139. Unsupervised Experience Learn Filter Kernels from Temporal Input Statistics
  • 140. Unsupervised Experience Learn Filter Kernels from Temporal Input Statistics law & order
  • 141. Screening A quick object rec. test to find promising models Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 142. Screening A quick object rec. test to find promising models Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 143. Screening A quick object rec. test to find promising models “cars” “planes”
  • 144. Screening A quick object rec. test to find promising models Parametric Variation no variation
  • 145. Screening A quick object rec. test to find promising models Parametric Variation no variation more variation
  • 146. Screening A quick object rec. test to find promising models Parametric Variation no variation more variation lots of variation
  • 147. Screening 250 200 N=2500 150 100 50 0 50 60 70 80 90 100
  • 148. Screening 250 200 N=2500 150 100 50 0 50 60 70 80 90 100
  • 149. Screening 100 250 90 80 200 N=2500 70 150 60 100 50 V1S 1 2 3 4 5 50 0 50 60 70 80 90 100
  • 150. Validation See how we do on other test sets Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 151. Validation See how we do on other test sets Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 152. Validation cars vs. planes (validate) boats vs. animals 100 100 90 90 80 80 70 70 60 60 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5
  • 153. Validation synthetic face webcam faces discrimination 100 100 90 90 80 80 70 70 60 60 50 50 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5
  • 154. Other “Petri Dishes” boats
  • 155. Other “Petri Dishes” cars and planes boats
  • 156. Other “Petri Dishes” Unsupervised Screen Distribution Validation Training “World” synthetic face cars vs. planes (validate) boats vs. animals webcam faces discrimination cars and planes 250 100 100 100 100 200 N=2500 90 90 90 90 150 80 80 80 80 70 70 70 70 100 60 60 60 60 50 50 50 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 50 60 70 80 90 100 boats 250 200 100 100 100 100 N=2500 90 90 90 90 150 80 80 80 80 70 70 70 70 100 60 60 60 60 50 50 50 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 50 60 70 80 90 100 law & order 250 100 100 100 100 200 N=2500 90 90 90 90 80 80 80 80 150 70 70 70 70 100 60 60 60 60 50 50 50 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 0 50 60 70 80 90 100
  • 157. Screening Basically Works 3 2 1 cars vs. planes (validate) 0 -1 r=0.84 -2 -2 -1 0 1 2 3 cars vs. planes (screen)
  • 158. Screening Basically Works 3 3 2 2 1 1 boats vs. animals cars vs. planes 0 (validate) 0 -1 -1 r=0.60 r=0.84 -2 -2 -2 -1 0 1 2 3 -2 -1 0 1 2 3 cars vs. planes (screen) cars vs. planes (screen) 3 4 2 3 1 2 0 discrimination 1 synthetic face webcam faces -1 0 -2 -1 r=0.23 r=0.49 -3 -2 -2 -1 0 1 2 3 -2 -1 0 1 2 3 cars vs. planes (screen) cars vs. planes (screen)
  • 159. Screening Works on Average 3 2 noramalized average 1 0 -1 r=0.69 -2 -1 0 1 2 3 cars vs. planes (screen)
  • 160. What does it all mean?
  • 161. What does it all mean? Preprocessing / Normalization / Zero-mean Layer 1 / Linear Filtering / RF size 0.9 0.9 Performance (% correct) Performance (% correct) 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 False True 3 5 7 9 Parameter value Parameter value
  • 162. What does it all mean? Layer 1 / Normalization / Threshold Layer 3 / Learning / Neighborhood size 0.9 0.9 Performance (% correct) Performance (% correct) 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 01 10 1 3 5 7 9 Parameter value Parameter value
  • 163. Summary
  • 164. Summary • GPUs allow us to engage a qualitatively dierent pace of experimentation
  • 165. Summary • GPUs allow us to engage a qualitatively dierent pace of experimentation • CUDA provides unprecedented performance / eort ratio
  • 166. Summary • GPUs allow us to engage a qualitatively dierent pace of experimentation • CUDA provides unprecedented performance / eort ratio • New laboratory for studying visual representation
  • 167. Shameless Recruitment http://www.rowland.harvard.edu/rjf/cox
  • 168. Shameless Recruitment http://www.rowland.harvard.edu/rjf/cox
  • 169. Shameless Recruitment Now with E NVID xtra Good IA ness! http://www.rowland.harvard.edu/rjf/cox
  • 170. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt
  • 171. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt DiCarlo Lab @ MIT • Jim DiCarlo • Nicolas Pinto
  • 172. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt DiCarlo Lab @ MIT • Jim DiCarlo • Nicolas Pinto
  • 173. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt DiCarlo Lab @ MIT • Jim DiCarlo • Nicolas Pinto The Rowland Institute at Harvard HARVARD UNIVERSITY
  • 174. The Problem with Evaluation What set to use?
  • 175. The Problem with Evaluation What set to use?
  • 176. The Problem with Evaluation What set to use?
  • 177. Caltech 101
  • 178. faces_easy faces car_side airplanes accordion Caltech 101
  • 179. Caltech 101 from A. Torralba
  • 180. The field does well on the ‘101 Performance 100 100% 8080 1] Wang et al. 2006 60 60 [2] Grauman and Darrell 2005 [3] Mutch and Lowe 2006 [4] Lazebnik et al. 2006 40 40 [5] Zhang et al. 2006 20 20 0 0 1 2 3 4 5 state of the art systems
  • 181. The “start-simple” model
  • 182. The “start-simple” model • Input divisive normalization
  • 183. The “start-simple” model • Input divisive normalization • Threshold, saturation
  • 184. The “start-simple” model • Input divisive normalization • Threshold, saturation • Output normalization
  • 185. The “start-simple” model • Input divisive normalization • Threshold, saturation • Output normalization • Downsampling • Dimensionality Reduction • Classi cation
  • 186. The “start-simple” model • Input divisive normalization • Threshold, saturation • Output normalization • Downsampling • Dimensionality Reduction • Classi cation • Really Big (~2.1 million dim)
  • 187. V1-like model does shockingly well Performance 100 100% 8080 60 60 40 40 20 20 0 0 1 2 3 4 5 state of the art systems
  • 188. V1-like model does shockingly well Performance 100 100% 8080 60 60 40 40 20 20 0 0 1 2 3 4 5 V1-like state of the art systems
  • 189. V1-like model does shockingly well Performance 100 100% 8080 60 60 40 40 20 20 0 0 1 2 3 4 5 V1-like V1+ state of the art systems
  • 190. But the model is stupid It’s just V1. It has no speci c mechanisms for achieving invariance.
  • 191. But the model is stupid It’s just V1. It has no speci c mechanisms for achieving invariance.
  • 192. But the model is stupid It’s just V1. It has no speci c mechanisms for achieving invariance. The Fight is Rigged?
  • 193. Refocusing on the real problem
  • 194. Refocusing on the real problem Just two classes: should be easy, no?
  • 195. Refocusing on the real problem Parametric Variation no variation
  • 196. Refocusing on the real problem Parametric Variation no variation more variation
  • 197. Refocusing on the real problem Parametric Variation no variation more variation lots of variation
  • 198. Pulling performance back to chance 100 90 Performance (%) 80 70 60 50 0% 10% 20% 30% 40% 50% 60% position (x-axis) 0% 20% 40% 60% 80% 100% 120% position (y-axis) 0% 10% 20% 30% 40% 50% 60% scale 0° 15° 30° 45° 60° 75° 90° rotation 0° 15° 30° 45° 60° 75° 90° view Intra-class variation
  • 199. Pulling performance back to chance 100 90 Performance (%) 80 70 60 50 0% 10% 20% 30% 40% 50% 60% position (x-axis) 0% 20% 40% 60% 80% 100% 120% position (y-axis) 0% 10% 20% 30% 40% 50% 60% scale 0° 15° 30° 45° 60° 75° 90° rotation 0° 15° 30° 45° 60° 75° 90° view Intra-class variation
  • 200. Just one bad database?
  • 201. Just one bad database? Caltech 256
  • 202. Just one bad database? Caltech 256 Face Databases: ORL, Yale, AR, CVL
  • 203. Just one bad database? Caltech 256 Face Databases: ORL, Yale, AR, CVL See us @ ECCV 08 Faces in the Wild Workshop
  • 204. Face Databases ORL 4 training examples 8 training examples 100 Performance (% correct) 80 60 40 20 0 V1-like V1-like 1 2 3 4 1 2 3 4 1. pixel space 3. and 4. Noushath et al. 2007 2. Savvides et al. 2007
  • 205. Face Databases AR 5 training 8 training examples examples 100 Performance (% correct) 80 60 40 20 1. pixel space 2. Liang et al. 2007 0 3 V1-like 3 V1-like 1 2 1 2 3. Zhang et al. 2007
  • 206. Face Databases YALE 4 training 8 training examples examples 100 Performance (% correct) 90 80 70 60 1. pixel space 2.,3.,4. and 5. Noushath et al. 2006 50 V1-like V1-like 1 2 34567 1 2 3456 6. Ben et al. 2006 7. Wang et al. 2007 chance (1/15=6.67%)
  • 207. Face Databases CVL 2 training 3 training examples examples (frontal only) (all) 100 Performance (% correct) 80 60 40 chance 20 (1/114=0.88%) 1. pixel space 2. Goel et al. 2005 0 3. Gokberk et al. 2002 V1-like V1-like 12 1 3
  • 208. Face Databases CAS-PEAL 4 training examples facing facing facing 100 down forward up Performance (% correct) 80 60 40 20 1. pixel space 2.,3.,4 and 5. Cao et al. 2004 0 1 2 3 4 5 V1 1 2 3 4 5 V1 1 2 3 4 5 V1
  • 209. Simulation vs.
  • 210. Synthetic Variation some variation 100 90 Performance (% correct) 80 scene background white noise background 70 phase scrambled background 60 chance 50 more variation 40 30 position (x-axis) 0% 20% 40% 60% 80% 100% 120% position (y-axis) 0% 10% 20% 30% 40% 50% 60% scale 0% 10% 20% 30% 40% 50% 60% in-plane rotation 0° 15° 30° 45° 60° 75° 90° depth-rotation 0° 15° 30° 45° 60° 75° 90° Increasing Variation