Your SlideShare is downloading. ×
P03 neural networks cvpr2012 deep learning methods for vision
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

P03 neural networks cvpr2012 deep learning methods for vision

1,748
views

Published on

Published in: Education, Technology

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,748
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
117
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. NEURAL NETS FOR VISION CVPR 2012 Tutorial on Deep Learning Part IIIMarcAurelio Ranzato -ranzato@google.comwww.cs.toronto.edu/~ranzato
  • 2. Building an Object Recognition System “CAR” CLASSIFIER FEATURE EXTRACTORIDEA: Use data to optimize features for the given task. 2 Ranzato
  • 3. Building an Object Recognition System “CAR” CLASSIFIER f  X ; What we want: Use parameterized function such that a) features are computed efficiently b) features can be trained efficiently 3 Ranzato
  • 4. Building an Object Recognition System “CAR” END-TO-END RECOGNITION SYSTEM– Everything becomes adaptive.– No distiction between feature extractor and classifier.– Big non-linear system trained from raw pixels to labels. 4 Ranzato
  • 5. Building an Object Recognition System “CAR” END-TO-END RECOGNITION SYSTEMQ: How can we build such a highly non-linear system?A: By combining simple building blocks we can make more and more complex systems. 5 Ranzato
  • 6. Building A Complicated Function Simple Functions One Example of sin  x  Complicated Function log  x  3cos  x  log cos exp sin  x  3 x exp  x  6 Ranzato
  • 7. Building A Complicated Function Simple Functions One Example of sin  x  Complicated Function log  x  3cos  x  log cos exp sin  x  3 x exp  x  – Function composition is at the core of deep learning methods. – Each “simple function” will have parameters subject to training. 7 Ranzato
  • 8. Implementing A Complicated Function Complicated Function 3 log cos exp sin  x  3 sin  x  x exp  x  cos  x  log  x  8 Ranzato
  • 9. Intuition Behind Deep Neural Nets “CAR” 9 Ranzato
  • 10. Intuition Behind Deep Neural Nets “CAR”NOTE: Each black box can have trainable parameters. Their composition makes a highly non-linear system. 10 Ranzato
  • 11. Intuition Behind Deep Neural Nets Intermediate representations/features “CAR”NOTE: System produces a hierarchy of features. 11 Ranzato
  • 12. Intuition Behind Deep Neural Nets “CAR” 12Lee et al. “Convolutional DBNs for scalable unsup. learning...” ICML 2009 Ranzato
  • 13. Intuition Behind Deep Neural Nets “CAR” 13Lee et al. ICML 2009 Ranzato
  • 14. Intuition Behind Deep Neural Nets “CAR” 14Lee et al. ICML 2009 Ranzato
  • 15. KEY IDEAS OF NEURAL NETS IDEA # 1 Learn features from data IDEA # 2 Use differentiable functions that produce features efficiently IDEA # 3 End-to-end learning:no distinction between feature extractor and classifier IDEA # 4 “Deep” architectures: cascade of simpler non-linear modules 15 Ranzato
  • 16. KEY QUESTIONS- What is the input-output mapping?- How are parameters trained?- How computational expensive is it?- How well does it work? 16 Ranzato
  • 17. Outline- Neural Networks for Supervised Training - Architecture - Loss function- Neural Networks for Vision: Convolutional & Tiled- Unsupervised Training of Neural Networks- Extensions: - semi-supervised / multi-task / multi-modal- Comparison to Other Methods - boosting & cascade methods - probabilistic models- Large-Scale Learning with Deep Neural Nets 17 Ranzato
  • 18. Outline- Neural Networks for Supervised Training - Architecture - Loss function- Neural Networks for Vision: Convolutional & Tiled- Unsupervised Training of Neural Networks- Extensions: - semi-supervised / multi-task / multi-modal- Comparison to Other Methods - boosting & cascade methods - probabilistic models- Large-Scale Learning with Deep Neural Nets 18 Ranzato
  • 19. Linear Classifier: SVM DInput: X ∈RBinary label: yParameters: W ∈R D TOutput prediction: W X 1 2 TLoss: L= ∥W ∥  max [0,1−W X y ] 2 L Hinge Loss 19 1 WT X y Ranzato
  • 20. Linear Classifier: Logistic Regression DInput: X ∈RBinary label: yParameters: W ∈R D TOutput prediction: W X 1 2 TLoss: L= ∥W ∥  log1exp −W X y 2 L Log Loss 20 WT X y Ranzato
  • 21. Logistic Regression: Probabilistic Interpretation D Input: X ∈R Binary label: y Parameters: W ∈R D 1 1 Output prediction: p y=1∣X = T −W X 1e WT X Loss: L=−log p y∣X  Q: What is the gradient of L w.r.t. W ? 21 Ranzato
  • 22. Logistic Regression: Probabilistic Interpretation D Input: X ∈R Binary label: y Parameters: W ∈R D 1 1 Output prediction: p y=1∣X = T −W X 1e WT X Loss: L=log1exp −W T X y  Q: What is the gradient of L w.r.t. W ? 22 Ranzato
  • 23. Simple Functions sin  x  Complicated Function log  x  1cos  x  −log −W X T  x 3 1e exp  x  23 Ranzato
  • 24. Logistic Regression: Computing Loss Complicated Function 1 −log −W X T  1e T u 1 p L W X −u −log p 1e 24 Ranzato
  • 25. Chain Rule x y dL dL dx dyGiven y  x and dL /dy,What is dL /dx ? 25 Ranzato
  • 26. Chain Rule x y dL dL dx dyGiven y  x and dL /dy, dL dL dy = ⋅What is dL /dx ? dx dy dx 26 Ranzato
  • 27. Chain Rule x y dL dL dx dyGiven y  x and dL /dy, dL dL dy = ⋅What is dL /dx ? dx dy dx All needed information is local! 27 Ranzato
  • 28. Logistic Regression: Computing GradientsX T u 1 p L W X −u −log p 1e −1 p dL dp 28 Ranzato
  • 29. Logistic Regression: Computing GradientsX T u 1 p L W X −u −log p 1e p1− p −1 p dp dL du dp 29 Ranzato
  • 30. Logistic Regression: Computing GradientsX T u 1 p L W X −u −log p 1e X p1− p −1 p du dp dL dW du dp dL dL dp du = ⋅ ⋅ =  p−1 X dW dp du dW 30 Ranzato
  • 31. What Did We Learn?- Logistic Regression- How to compute gradients of complicated functions Logistic Regression 31 Ranzato
  • 32. Neural Network Logistic Logistic Regression Regression 32 Ranzato
  • 33. Neural Network– A neural net can be thought of as a stack of logistic regression classifiers. Each input is the output of the previous layer. Logistic Logistic Logistic Regression Regression RegressionNOTE: intermediate units can be thought of as linear classifiers trained with implicit target values. 33 Ranzato
  • 34. Key Computations: F-Prop / B-Prop F-PROP X Z  34 Ranzato
  • 35. Key Computations: F-Prop / B-Prop B-PROP ∂L ∂L ∂X ∂Z ∂Z ∂Z { , } ∂ X ∂ ∂L ∂ 35 Ranzato
  • 36. Neural Net: TrainingA) Compute loss on small mini-batch Layer 1 Layer 2 Layer 3 36 Ranzato
  • 37. Neural Net: TrainingA) Compute loss on small mini-batch F-PROP Layer 1 Layer 2 Layer 3 37 Ranzato
  • 38. Neural Net: TrainingA) Compute loss on small mini-batch F-PROP Layer 1 Layer 2 Layer 3 38 Ranzato
  • 39. Neural Net: TrainingA) Compute loss on small mini-batch F-PROP Layer 1 Layer 2 Layer 3 39 Ranzato
  • 40. Neural Net: TrainingA) Compute loss on small mini-batchB) Compute gradient w.r.t. parameters B-PROP Layer 1 Layer 2 Layer 3 40 Ranzato
  • 41. Neural Net: TrainingA) Compute loss on small mini-batchB) Compute gradient w.r.t. parameters B-PROP Layer 1 Layer 2 Layer 3 41 Ranzato
  • 42. Neural Net: TrainingA) Compute loss on small mini-batchB) Compute gradient w.r.t. parameters B-PROP Layer 1 Layer 2 Layer 3 42 Ranzato
  • 43. Neural Net: TrainingA) Compute loss on small mini-batchB) Compute gradient w.r.t. parameters dLC) Use gradient to update parameters   − d Layer 1 Layer 2 Layer 3 43 Ranzato
  • 44. NEURAL NET: ARCHITECTURE hj h j1 T h j1 =W j1 h j b j 1 M ×N N W j∈R , b j∈R M N h j ∈ R , h j1 ∈ R 44 Ranzato
  • 45. NEURAL NET: ARCHITECTURE hj h j1 T h j1 =W j1 h j b j 1 1  x   x = −x 1e x 45 Ranzato
  • 46. NEURAL NET: ARCHITECTURE hj h j1 T h j1 =W j1 h j b j 1  x   x =tanh x  x 46 Ranzato
  • 47. Graphical Notations f  X ;W  is equivalent to X h Wh k is called feature, hidden unit, neuron or code unit 47 Ranzato
  • 48. MOST COMMON ARCHITECTURE X y  Error yNOTE: Multi-layer neural nets with more than two layers are nowadays called deep nets!! 48 Ranzato
  • 49. MOST COMMON ARCHITECTURE X y  Error yNOTE: Multi-layer neural nets with more than two layers are nowadays called deep nets!!NOTE: User must specify number of layers, number of hidden units, type of layers and loss function. 49 Ranzato
  • 50. MOST COMMON LOSSES X y  Error ySquare Euclidean Distance (regression): Ny , y ∈R  1 N 2 L= ∑ i=1  y i − y i   2 50 Ranzato
  • 51. MOST COMMON LOSSES X y  Error yCross Entropy (classification): N N Ny , y ∈[0,1] ,  ∑i =1 y i =1, ∑i =1  i =1 y N L=−∑i =1 y i log y i  51 Ranzato
  • 52. NEURAL NETS FACTS1: User specifies loss based on the task.2: Any optimization algorithm can be chosen for training.3: Cost of F-Prop and B-Prop is similar and proportional to the number of layers and their size. 52 Ranzato
  • 53. Toy Code: Neural Net Trainer% F-PROPfor i = 1 : nr_layers - 1 [h{i} jac{i}] = logistic(W{i} * h{i-1} + b{i});endh{nr_layers-1} = W{nr_layers-1} * h{nr_layers-2} + b{nr_layers-1};prediction = softmax(h{l-1});% CROSS ENTROPY LOSSloss = - sum(sum(log(prediction) .* target));% B-PROPdh{l-1} = prediction - target;for i = nr_layers – 1 : -1 : 1 Wgrad{i} = dh{i} * h{i-1}; bgrad{i} = sum(dh{i}, 2); dh{i-1} = (W{i} * dh{i}) .* jac{i-1};end% UPDATEfor i = 1 : nr_layers - 1 W{i} = W{i} – (lr / batch_size) * Wgrad{i}; b{i} = b{i} – (lr / batch_size) * bgrad{i}; 53end Ranzato
  • 54. TOY EXAMPLE: SYNTHETIC DATA 1 input & 1 output 100 hidden units in each layer 54 Ranzato
  • 55. TOY EXAMPLE: SYNTHETIC DATA 1 input & 1 output 3 hidden layers 55 Ranzato
  • 56. TOY EXAMPLE: SYNTHETIC DATA1 input & 1 output3 hidden layers, 1000 hiddensRegression of cosine 56 Ranzato
  • 57. TOY EXAMPLE: SYNTHETIC DATA 1 input & 1 output 3 hidden layers, 1000 hiddens Regression of cosine 57 Ranzato
  • 58. Outline- Neural Networks for Supervised Training - Architecture - Loss function- Neural Networks for Vision: Convolutional & Tiled- Unsupervised Training of Neural Networks- Extensions: - semi-supervised / multi-task / multi-modal- Comparison to Other Methods - boosting & cascade methods - probabilistic models- Large-Scale Learning with Deep Neural Nets 58 Ranzato
  • 59. FULLY CONNECTED NEURAL NET Example: 1000x1000 image 1M hidden units 10^12 parameters!!! - Spatial correlation is local - Better to put resources elsewhere! 59 Ranzato
  • 60. LOCALLY CONNECTED NEURAL NET Example: 1000x1000 image 1M hidden units Filter size: 10x10 100M parameters 60 Ranzato
  • 61. LOCALLY CONNECTED NEURAL NET Example: 1000x1000 image 1M hidden units Filter size: 10x10 100M parameters 61 Ranzato
  • 62. LOCALLY CONNECTED NEURAL NET STATIONARITY? Statistics is similar at different locations Example: 1000x1000 image 1M hidden units Filter size: 10x10 100M parameters 62 Ranzato
  • 63. CONVOLUTIONAL NET Share the same parameters across different locations: Convolutions with learned kernels 63 Ranzato
  • 64. CONVOLUTIONAL NET Learn multiple filters. E.g.: 1000x1000 image 100 Filters Filter size: 10x10 10K parameters 64 Ranzato
  • 65. NEURAL NETS FOR VISIONA standard neural net applied to images:- scales quadratically with the size of the input- does not leverage stationaritySolution:- connect each hidden unit to a small patch of the input- share the weight across hidden unitsThis is called: convolutional network.LeCun et al. “Gradient-based learning applied to document recognition” IEEE 1998 65 Ranzato
  • 66. CONVOLUTIONAL NET Let us assume filter is an “eye” detector. Q.: how can we make the detection robust to the exact location of the eye? 66 Ranzato
  • 67. CONVOLUTIONAL NET By “pooling” (e.g., max or average) filter responses at different locations we gain robustness to the exact spatial location of features. 67 Ranzato
  • 68. CONV NETS: EXTENSIONS Over the years, some new modules have proven to be very effective when plugged into conv-nets: - L2 Pooling layer i layer i 1 N x , y  x , y h i1, x , y = ∑  j , k ∈ N  x , y  h 2 i, j ,k - Local Contrast Normalization layer i layer i 1 hi , x , y −m i , N  x , y  x , y h i1, x , y = N x , y  i , N  x , yJarrett et al. “What is the best multi-stage architecture for object 68recognition?” ICCV 2009 Ranzato
  • 69. CONV NETS: L2 POOLING -1 0 L2 Pooling 0 0 0 ∑ 5 +1 2 ⋅ hi i =1 i h i1 69Kavukguoglu et al. “Learning invariant features ...” CVPR 2009 Ranzato
  • 70. CONV NETS: L2 POOLING 0 0 L2 Pooling 0 0 +1 ∑ 5 +1 2 ⋅ hi i =1 i h i1 70Kavukguoglu et al. “Learning invariant features ...” CVPR 2009 Ranzato
  • 71. LOCAL CONTRAST NORMALIZATION hi , x , y −m i , N  x , y h i1, x , y =  i , N  x , y 0 0 1 0.5 0 LCN 0 -1 -0.5 0 0 71 Ranzato
  • 72. LOCAL CONTRAST NORMALIZATION hi , x , y −m i , N  x , y h i1, x , y =  i , N  x , y 1 0 11 0.5 1 LCN 0 -9 -0.5 1 0 72 Ranzato
  • 73. CONV NETS: EXTENSIONS L2 Pooling & Local Contrast Normalizationhelp learning more invariant representations! 73 Ranzato
  • 74. CONV NETS: TYPICAL ARCHITECTURE One stage (zoom) Filtering Pooling LCN Whole system Input ClassImage Labels Linear Layer 1st stage 2nd stage 3rd stage 74 Ranzato
  • 75. CONV NETS: TRAININGSince convolutions and sub-sampling are differentiable, we can usestandard back-propagation.Algorithm: Given a small mini-batch - FPROP - BPROP - PARAMETER UPDATE 75 Ranzato
  • 76. CONV NETS: EXAMPLES- Object category recognition Boureau et al. “Ask the locals: multi-way local pooling for image recognition” ICCV 2011- Segmentation Turaga et al. “Maximin learning of image segmentation” NIPS 2009- OCR Ciresan et al. “MCDNN for Image Classification” CVPR 2012- Pedestrian detection Kavukcuoglu et al. “Learning convolutional feature hierarchies for visual recognition” NIPS 2010- Robotics Sermanet et al. “Mapping and planning..with long range perception” IROS 2008 76 Ranzato
  • 77. LIMITATIONS & SOLUTIONS- requires lots of labeled data to train+ unsupervised learning- difficult optimization+ layer-wise training- scalability+ distributed training 77 Ranzato
  • 78. Outline- Neural Networks for Supervised Training - Architecture - Loss function- Neural Networks for Vision: Convolutional & Tiled- Unsupervised Training of Neural Networks- Extensions: - semi-supervised / multi-task / multi-modal- Comparison to Other Methods - boosting & cascade methods - probabilistic models- Large-Scale Learning with Deep Neural Nets 78 Ranzato
  • 79. BACK TO LOGISTIC REGRESSION input prediction Error target 79 Ranzato
  • 80. Unsupervised Learninginput prediction Error ? 80 Ranzato
  • 81. Unsupervised LearningQ: How should we train the input-output mapping if we do not have target values? input codeA: Code has to retain information from the input but only if this is similar to training samples. By better representing only those inputs that are similar to training samples we hope to extract interesting structure (e.g., structure of manifold where data live). 81 Ranzato
  • 82. Unsupervised LearningQ: How to constrain the model to represent training samples better than other data points? 82 Ranzato
  • 83. Unsupervised Learning– reconstruct the input from the code & make code compact (auto-encder with bottle-neck).– reconstruct the input from the code & make code sparse (sparse auto-encoders)see work in LeCun, Ng, Fergus, Lee, Yus labs– add noise to the input or code (denoising auto-encoders)see work in Y. Bengio, Lees lab– make sure that the model defines a distribution that normalizes to 1 (RBM).see work in Y. Bengio, Hinton, Lee, Salakthudinovs lab 83 Ranzato
  • 84. AUTO-ENCODERS NEURAL NETS input code reconstruction encoder decoder Error– input higher dimensional than code- error: ||reconstruction - input||2- training: back-propagation 84 Ranzato
  • 85. SPARSE AUTO-ENCODERS Sparsity Penalty input reconstruction encoder decoder code Error– sparsity penalty: ||code||1- error: ||reconstruction - input||2- loss: sum of squared reconstruction error and sparsity 85- training: back-propagation Ranzato
  • 86. SPARSE AUTO-ENCODERS Sparsity Penalty input reconstruction encoder decoder code Error T – input: X code: h=W X 2 - loss: L  X ; W =∥W h− X ∥  ∑ j ∣h j∣ 86Le et al. “ICA with reconstruction cost..” NIPS 2011 Ranzato
  • 87. How To Use Unsupervised Learning1) Given unlabeled data, learn features 87 Ranzato
  • 88. How To Use Unsupervised Learning1) Given unlabeled data, learn features2) Use encoder to produce features and train another layer on the top 88 Ranzato
  • 89. How To Use Unsupervised Learning1) Given unlabeled data, learn features2) Use encoder to produce features and train another layer on the top Layer-wise training of a feature hierarchy 89 Ranzato
  • 90. How To Use Unsupervised Learning1) Given unlabeled data, learn features2) Use encoder to produce features and train another layer on the top3) feed features to classifier & train just the classifier input prediction label Reduced overfitting since features are learned in unsupervised way! 90 Ranzato
  • 91. How To Use Unsupervised Learning1) Given unlabeled data, learn features2) Use encoder to produce features and train another layer on the top3) feed features to classifier & jointly train the whole system input prediction label Given enough data, this usually yields the best results: end-to-end learning! 91 Ranzato
  • 92. Outline- Neural Networks for Supervised Training - Architecture - Loss function- Neural Networks for Vision: Convolutional & Tiled- Unsupervised Training of Neural Networks- Extensions: - semi-supervised / multi-task / multi-modal- Comparison to Other Methods - boosting & cascade methods - probabilistic models- Large-Scale Learning with Deep Neural Nets 92 Ranzato
  • 93. Semi-Supervised Learning truckairplane deer frog bird 93 Ranzato
  • 94. Semi-Supervised Learning truck airplane deer frog birdLOTS & LOTS OF UNLABELED DATA!!! 94 Ranzato
  • 95. Semi-Supervised Learning Loss = supervised_error + unsupervised_error prediction input label 95Weston et al. “Deep learning via semi-supervised embedding” ICML 2008 Ranzato
  • 96. Multi-Task Learning Face detection is hard because of lighting, pose, but also occluding goggles. Face detection could made be easier by face identification.The identification task may help the detection task. 96 Ranzato
  • 97. Multi-Task Learning - Easy to add many error terms to loss function. - Joint learning of related tasks yields better representations. Example of architecture: 97Collobert et al. “NLP (almost) from scratch” JMLR 2011 Ranzato
  • 98. Multi-Modal Learning Audio and Video streams are often complimentary to each other. E.g., audio can provide important clues to improve visual recognition, and vice versa. 98 Ranzato
  • 99. Multi-Modal Learning - Weak assumptions on input distribution - Fully adaptive to data Example of architecture:modality #1modality #2modality #3 99Ngiam et al. “Multi-modal deep learning” ICML 2011 Ranzato
  • 100. Outline- Neural Networks for Supervised Training - Architecture - Loss function- Neural Networks for Vision: Convolutional & Tiled- Unsupervised Training of Neural Networks- Extensions: - semi-supervised / multi-task / multi-modal- Comparison to Other Methods - boosting & cascade methods - probabilistic models- Large-Scale Learning with Deep Neural Nets 100 Ranzato
  • 101. Boosting & ForestsDeep Nets:- single highly non-linear system- “deep” stack of simpler modules- all parameters are subject to learningBoosting & Forests:- sequence of “weak” (simple) classifiers that are linearly combinedto produce a powerful classifier- subsequent classifiers do not exploit representations of earlierclassifiers, its a “shallow” linear mixture- typically features are not learned 101 Ranzato
  • 102. Properties Deep Nets BoostingAdaptive featuresHierarchical featuresEnd-to-end learningLeverage unlab. dataEasy to parallelizeFast trainingFast at test time 102 Ranzato
  • 103. Deep Neural-Nets VS Probabilistic ModelsDeep Neural Nets:- mean-field approximations of intractable probabilistic models- usually more efficient- typically more unconstrained (partition function has to be replaced by other constraints, e.g. sparsity).Hierarchical Probabilistic Models (DBN, DBM, etc.):- in the most interesting cases, they are intractable- they better deal with uncertainty- they can be easily combined 103 Ranzato
  • 104. Example: Auto-EncoderNeural Net: T code Z = W X b e  e reconstruction  X =W d Z bdProbabilistic Model (Gaussian RBM): T E [ Z∣ X ]= W X b e  E [ X ∣Z ]=W Z bd 104 Ranzato
  • 105. Properties Deep Nets Probab. ModelsAdaptive featuresHierarchical featuresEnd-to-end learningLeverage unlab. dataModels uncertaintyFast trainingFast at test time 105 Ranzato
  • 106. Outline- Neural Networks for Supervised Training - Architecture - Loss function- Neural Networks for Vision: Convolutional & Tiled- Unsupervised Training of Neural Networks- Extensions: - semi-supervised / multi-task / multi-modal- Comparison to Other Methods - boosting & cascade methods - probabilistic models- Large-Scale Learning with Deep Neural Nets 106 Ranzato
  • 107. Tera-Scale Deep Learning @ GoogleObservation #1: more features always improve performance unless data is scarce.Observation #2: deep learning methods have higher capacity and have the potential to model data better.Q #1: Given lots of data and lots of machines, can we scale up deep learning methods?Q #2: Will deep learning methods perform much better? 107 Ranzato
  • 108. The Challenge A Large Scale problem has: – lots of training samples (>10M) – lots of classes (>10K) and – lots of input dimensions (>10K).– best optimizer in practice is on-line SGD which is naturally sequential, hard to parallelize.– layers cannot be trained independently and in parallel, hard to distribute– model can have lots of parameters that may clog the network, hard to distribute across machines 108 Ranzato
  • 109. Our Solution 2nd layer 1st layer input 109Le et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 Ranzato
  • 110. Our Solution 1st 2nd 3rdmachine machine machine 2nd layer 1st layer input 110 Ranzato
  • 111. Our Solution 1st 2nd 3rdmachine machine machine MODEL PARALLELISM 111 Ranzato
  • 112. Distributed Deep Nets Deep Net MODEL PARALLELISM input #1 112Le et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 Ranzato
  • 113. Distributed Deep Nets MODEL PARALLELISM + DATA PARALLELISM input #3 input #2 input #1 113Le et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 Ranzato
  • 114. Asynchronous SGD PARAMETER SERVER1st replica 2nd replica 3rd replica 114 Ranzato
  • 115. Asynchronous SGD PARAMETER SERVER∂L∂1 1st replica 2nd replica 3rd replica 115 Ranzato
  • 116. Asynchronous SGD PARAMETER SERVER 11st replica 2nd replica 3rd replica 116 Ranzato
  • 117. Asynchronous SGD PARAMETER SERVER (update parameters)1st replica 2nd replica 3rd replica 117 Ranzato
  • 118. Asynchronous SGD PARAMETER SERVER ∂L ∂21st replica 2nd replica 3rd replica 118 Ranzato
  • 119. Asynchronous SGD PARAMETER SERVER 21st replica 2nd replica 3rd replica 119 Ranzato
  • 120. Asynchronous SGD PARAMETER SERVER (update parameters)1st replica 2nd replica 3rd replica 120 Ranzato
  • 121. Unsupervised Learning With 1B Parameters DATA: 10M youtube (unlabeled) frames of size 200x200. 121Le et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 Ranzato
  • 122. Unsupervised Learning With 1B Parameters Deep Net: – 3 stages – each stage consists of local filtering, L2 pooling, LCN - 18x18 filters - 8 filters at each location - L2 pooling and LCN over 5x5 neighborhoods – training jointly the three layers by: - reconstructing the input of each layer - sparsity on the code 122Le et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 Ranzato
  • 123. Unsupervised Learning With 1B Parameters Deep Net: – 3 stages – each stage consists of local filtering, L2 pooling, LCN - 18x18 filters - 8 filters at each location - L2 pooling and LCN over 5x5 neighborhoods – training jointly the three layers by: - reconstructing the input of each layer - sparsity on the code 1B parameters!!! 123Le et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 Ranzato
  • 124. Validating Unsupervised Learning The network has seen lots of objects during training, but without any label. Q.: how can we validate unsupervised learning? Q.: Did the network form any high-level representation? E.g., does it have any neuron responding for faces? – build validation set with 50% faces, 50% random images - study properties of neurons 124Le et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 Ranzato
  • 125. Validating Unsupervised Learning neuron responses 1st stage 2nd stage 3rd stage 125Le et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 Ranzato
  • 126. Top Images For Best Face Neuron 126 Ranzato
  • 127. Best Input For Face Neuron 127Le et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 Ranzato
  • 128. Unsupervised + Supervised (ImageNet) Input Image y  1st stage 2nd stage 3rd stage COST y 128Le et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 Ranzato
  • 129. Object Recognition on ImageNet IMAGENET v.2011 (16M images, 20K categories)METHOD ACCURACY %Weston & Bengio 2011 9.3Linear Classifier on deep features 13.1Deep Net (from random) 13.6Deep Net (from unsup.) 15.8 129Le et al. “Building high-level features using large-scale unsupervised learning” ICML 2012 Ranzato
  • 130. Top Inputs After Supervision 130 Ranzato
  • 131. Top Inputs After Supervision 131 Ranzato
  • 132. Experiments: and many more...- automatic speech recognition- natural language processing- biomed applications- financeGeneric learning algorithm!! 132 Ranzato
  • 133. ReferencesTutorials & Background Material– Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends inMachine Learning, 2(1), pp.1-127, 2009.– LeCun, Chopra, Hadsell, Ranzato, Huang: A Tutorial on Energy-Based Learning, inBakir, G. and Hofman, T. and Schölkopf, B. and Smola, A. and Taskar, B. (Eds),Predicting Structured Data, MIT Press, 2006Convolutional Nets– LeCun, Bottou, Bengio and Haffner: Gradient-Based Learning Applied toDocument Recognition, Proceedings of the IEEE, 86(11):2278-2324, November1998– Jarrett, Kavukcuoglu, Ranzato, LeCun: What is the Best Multi-StageArchitecture for Object Recognition?, Proc. International Conference onComputer Vision (ICCV09), IEEE, 2009- Kavukcuoglu, Sermanet, Boureau, Gregor, Mathieu, LeCun: Learning ConvolutionalFeature Hierachies for Visual Recognition, Advances in Neural Information 133Processing Systems (NIPS 2010), 23, 2010 Ranzato
  • 134. Unsupervised Learning– ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning. Le,Karpenko, Ngiam, Ng. In NIPS*2011– Rifai, Vincent, Muller, Glorot, Bengio, Contracting Auto-Encoders: Explicitinvariance during feature extraction, in: Proceedings of the Twenty-eightInternational Conference on Machine Learning (ICML11), 2011- Vincent, Larochelle, Lajoie, Bengio, Manzagol, Stacked Denoising Autoencoders:Learning Useful Representations in a Deep Network with a Local DenoisingCriterion, Journal of Machine Learning Research, 11:3371--3408, 2010.- Gregor, Szlam, LeCun: Structured Sparse Coding via Lateral Inhibition,Advances in Neural Information Processing Systems (NIPS 2011), 24, 2011- Kavukcuoglu, Ranzato, LeCun. "Fast Inference in Sparse Coding Algorithms withApplications to Object Recognition". ArXiv 1010.3467 2008- Hinton, Krizhevsky, Wang, Transforming Auto-encoders, ICANN, 2011Multi-modal Learning– Multimodal deep learning, Ngiam, Khosla, Kim, Nam, Lee, Ng. In Proceedings ofthe Twenty-Eighth International Conference on Machine Learning, 2011. 134 Ranzato
  • 135. Locally Connected Nets– Gregor, LeCun “Emergence of complex-like cells in a temporal product networkwith local receptive fields” Arxiv. 2009– Ranzato, Mnih, Hinton “Generating more realistic images using gated MRFs”NIPS 2010– Le, Ngiam, Chen, Chia, Koh, Ng “Tiled convolutional neural networks” NIPS 2010Distributed Learning– Le, Ranzato, Monga, Devin, Corrado, Chen, Dean, Ng. "Building High-LevelFeatures Using Large Scale Unsupervised Learning". International Conference ofMachine Learning (ICML 2012), Edinburgh, 2012.Papers on Scene Parsing– Farabet, Couprie, Najman, LeCun, “Scene Parsing with Multiscale FeatureLearning, Purity Trees, and Optimal Covers”, in Proc. of the InternationalConference on Machine Learning (ICML12), Edinburgh, Scotland, 2012.- Socher, Lin, Ng, Manning, “Parsing Natural Scenes and Natural Language withRecursive Neural Networks”. International Conference of Machine Learning(ICML 2011) 2011. 135 Ranzato
  • 136. Papers on Object Recognition- Boureau, Le Roux, Bach, Ponce, LeCun: Ask the locals: multi-way local pooling forimage recognition, Proc. International Conference on Computer Vision 2011- Sermanet, LeCun: Traffic Sign Recognition with Multi-Scale ConvolutionalNetworks, Proceedings of International Joint Conference on Neural Networks(IJCNN11)- Ciresan, Meier, Gambardella, Schmidhuber. Convolutional Neural NetworkCommittees For Handwritten Character Classification. 11th InternationalConference on Document Analysis and Recognition (ICDAR 2011), Beijing, China.- Ciresan, Meier, Masci, Gambardella, Schmidhuber. Flexible, High PerformanceConvolutional Neural Networks for Image Classification. International JointConference on Artificial Intelligence IJCAI-2011.Papers on Action Recognition– Learning hierarchical spatio-temporal features for action recognition withindependent subspace analysis, Le, Zou, Yeung, Ng. In Computer Vision andPattern Recognition (CVPR), 2011Papers on Segmentation– Turaga, Briggman, Helmstaedter, Denk, Seung Maximin learning of image 136segmentation. NIPS, 2009. Ranzato
  • 137. Papers on Vision for Robotics– Hadsell, Sermanet, Scoffier, Erkan, Kavackuoglu, Muller, LeCun: Learning Long-Range Vision for Autonomous Off-Road Driving, Journal of Field Robotics,26(2):120-144, February 2009,Deep Convex Nets & Deconv-Nets– Deng, Yu. “Deep Convex Network: A Scalable Architecture for Speech PatternClassification.” Interspeech, 2011.- Zeiler, Taylor, Fergus "Adaptive Deconvolutional Networks for Mid and HighLevel Feature Learning." ICCV. 2011Papers on Biological Inspired Vision– Serre, Wolf, Bileschi, Riesenhuber, Poggio. Robust Object Recognition withCortex-like Mechanisms, IEEE Transactions on Pattern Analysis and MachineIntelligence, 29, 3, 411-426, 2007.- Pinto, Doukhan, DiCarlo, Cox "A high-throughput screening approach todiscovering good forms of biologically inspired visual representation." {PLoS}Computational Biology. 2009 137 Ranzato
  • 138. Papers on Embedded ConvNets for Real-Time Vision Applications– Farabet, Martini, Corda, Akselrod, Culurciello, LeCun: NeuFlow: A RuntimeReconfigurable Dataflow Processor for Vision, Workshop on Embedded ComputerVision, CVPR 2011,Papers on Image Denoising Using Neural Nets– Burger, Schuler, Harmeling: Image Denoisng: Can Plain Neural NetworksCompete with BM3D?, Computer Vision and Pattern Recognition, CVPR 2012, 138 Ranzato
  • 139. Software & LinksDeep Learning website– http://deeplearning.net/C++ code for ConvNets– http://eblearn.sourceforge.net/Matlab code for R-ICA unsupervised algorithm- http://ai.stanford.edu/~quocle/rica_release.zipPython-based learning library- http://deeplearning.net/software/theano/Lush learning library which includes ConvNets- http://lush.sourceforge.net/Torch7: learning library that supports neural net traininghttp://www.torch.ch 139 Ranzato
  • 140. Software & LinksCode used to generate demo for this tutorial- http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/ 140 Ranzato
  • 141. AcknowledgementsQuoc Le, Andrew NgJeff Dean, Kai Chen, Greg Corrado, Matthieu Devin,Mark Mao, Rajat Monga, Paul Tucker, Samy BengioYann LeCun, Pierre Sermanet, Clement Farabet 141 Ranzato
  • 142. Visualizing Learned Features Q: can we interpret the learned features? 142Ranzato et al. “Sparse feature learning for DBNs” NIPS 2007 Ranzato
  • 143. Visualizing Learned Features reconstruction: W h=W 1 h1W 2 h2  ≈ = 0.9 + 0.7 + 0.5 + 1.0 + ... Columns of W show what each code unit represents. 143Ranzato et al. “Sparse feature learning for DBNs” NIPS 2007 Ranzato
  • 144. Visualizing Learned Features 1st layer features 144Ranzato et al. “Sparse feature learning for DBNs” NIPS 2007 Ranzato
  • 145. Visualizing Learned Features Q: How about the second layer features? A: Similarly, each second layer code unit can be visualized by taking its bases and then projecting those bases in image space through the first layer decoder. 145Ranzato et al. “Sparse feature learning for DBNs” NIPS 2007 Ranzato
  • 146. Visualizing Learned Features Q: How about the second layer features? A: Similarly, each second layer code unit can be visualized by taking its bases and then projecting those bases in image space through the first layer decoder. Missing edges have 0 weight. Light gray nodes have zero value. 146Ranzato et al. “Sparse feature learning for DBNs” NIPS 2007 Ranzato
  • 147. Visualizing Learned Features Q: how are these images computed? 1st layer features2nd layer features 147Ranzato et al. “Sparse feature learning for DBNs” NIPS 2007 Ranzato
  • 148. Example of Feature Learning1st layer features2nd layer features 148Ranzato et al. “Sparse feature learning for DBNs” NIPS 2007 Ranzato

×