5 years from now,                            5 years from now,                          everyone will learn   ...
IIHave aaTerrible Confession to Make        Have Terrible Confession to Make        Im interested in vision, but no more i...
The Unity of             The Unity of              Recognition              Recognition             Architectures         ...
Most Recognition Systems Are Built on the Same Architecture     Most Recognition Systems Are Built on the Same Architectur...
Filter Bank + Non-Linearity + Pooling + Normalization      Filter Bank + Non-Linearity + Pooling + Normalization          ...
How well does this work?      How well does this work?                 Filter     Non­       feature    Filter    Non­    ...
Convolutional Networks (ConvNets) fits that model    Convolutional Networks (ConvNets) fits that modelYann LeCun
Why do two stages work better than one stage?     Why do two stages work better than one stage?                  Filter   ...
Learning Hierarchical Representations      Learning Hierarchical Representations                        Trainable         ...
Sometimes,                   Sometimes,                Biology gives you                Biology gives you                 ...
Harsh Non-Linearity + Contrast Normalization + Sparsity       Harsh Non-Linearity + Contrast Normalization + Sparsity    C...
Soft Thresholding Non-Linearity     Soft Thresholding Non-LinearityYann LeCun
Local Contrast Normalization      Local Contrast Normalization      Performed on the state of every layer, including      ...
C101 Performance (I know, IIknow)      C101 Performance (I know, know)       Small network: 64 features at stage-1, 256 fe...
Results on Caltech101 with sigmoid non-linearity     Results on Caltech101 with sigmoid non-linearity                     ...
Feature Learning              Feature Learning             Works Really Well             Works Really Well             on ...
C101 is very unfavorable to learning-based systems      C101 is very unfavorable to learning-based systems      Because it...
Sparse Auto-Encoders      Sparse Auto-Encoders       Inference by gradient descent starting from the encoder output       ...
Using PSD to Train aaHierarchy of Features      Using PSD to Train Hierarchy of Features       Phase 1: train first layer ...
Using PSD to Train aaHierarchy of Features      Using PSD to Train Hierarchy of Features       Phase 1: train first layer ...
Using PSD to Train aaHierarchy of Features      Using PSD to Train Hierarchy of Features       Phase 1: train first layer ...
Using PSD to Train aaHierarchy of Features      Using PSD to Train Hierarchy of Features       Phase 1: train first layer ...
Using PSD to Train aaHierarchy of Features      Using PSD to Train Hierarchy of Features      Phase 1: train first layer u...
Learned Features on natural patches: V1-like receptive fields      Learned Features on natural patches: V1-like receptive ...
Using PSD Features for Object Recognition      Using PSD Features for Object Recognition      64 filters on 9x9 patches tr...
Convolutional Sparse Coding         Convolutional Sparse Coding    [Kavukcuoglu et al. NIPS 2010]: convolutional PSD    [Z...
Convolutional Training     Convolutional Training      Problem:       With patch-level training, the learning algorithm mu...
Convolutional Sparse Coding     Convolutional Sparse Coding      Replace the dot products with dictionary element by convo...
Convolutional PSD: Encoder with aasoft sh() Function     Convolutional PSD: Encoder with soft sh() Function      Convoluti...
Cifar-10 Dataset       Cifar-10 Dataset      Dataset of tiny images       Images are 32x32 color images       10 object ca...
Comparative Results on Cifar-10 Dataset        Comparative Results on Cifar-10 Dataset    * Krizhevsky. Learning multiple ...
Road Sign Recognition Competition       Road Sign Recognition Competition   GTSRB Road Sign Recognition Competition (phase...
Pedestrian Detection (INRIA Dataset)       Pedestrian Detection (INRIA Dataset)                    [Sermanet et al., Rejec...
Pedestrian Detection: Examples       Pedestrian Detection: ExamplesYann LeCun                              [Kavukcuoglu et...
       Learning                              Learning                      Invariant Features                      Invaria...
Why just pool over space? Why not over orientation?    Why just pool over space? Why not over orientation?     Using an id...
Why just pool over space? Why not over orientation?    Why just pool over space? Why not over orientation?      The filter...
Pinwheels?     Pinwheels?      Does that look      pinwheely to      you?Yann LeCun
Sparsity through              Sparsity through             Lateral Inhibition             Lateral InhibitionYann LeCun
Invariant Features Lateral Inhibition     Invariant Features Lateral Inhibition      Replace the L1 sparsity term by a lat...
Invariant Features Lateral Inhibition     Invariant Features Lateral Inhibition      Zeros I S matrix have tree structureY...
Invariant Features Lateral Inhibition     Invariant Features Lateral Inhibition      Non-zero values in S form a ring in a...
Invariant Features Lateral Inhibition     Invariant Features Lateral Inhibition      Non-zero values in S form a ring in a...
Invariant Features Short-Range Lateral Excitation + L1     Invariant Features Short-Range Lateral Excitation + L1      lYa...
Disentangling the              Disentangling the             Explanatory Factors             Explanatory Factors          ...
Separating     Separating      I used to think that recognition was all about eliminating irrelevant      information whil...
Invariant Features through Temporal Constancy     Invariant Features through Temporal Constancy      Object is cross-produ...
Invariant Features through Temporal Constancy     Invariant Features through Temporal Constancy             Decoder       ...
Invariant Features through Temporal Constancy     Invariant Features through Temporal Constancy             C1            ...
Generating from the Network     Generating from the Network             InputYann LeCun
What is the right              What is the right              criterion to train               criterion to train         ...
Flattening the Data Manifold?     Flattening the Data Manifold?         The manifold of all images of <Category-X> is low-...
Flattening the             Flattening the             Data Manifold?             Data Manifold?Yann LeCun
The Ultimate Recognition System      The Ultimate Recognition System                       Trainable             Trainable...
Upcoming SlideShare
Loading in …5
×

Fcv learn le_cun

715 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
715
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
203
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Fcv learn le_cun

  1. 1.             5 years from now,              5 years from now,              everyone will learn              everyone will learn          their features         their features            (you might as well start now)            (you might as well start now)  Yann LeCun  Yann LeCun         Courant Institute of Mathematical Sciences          Courant Institute of Mathematical Sciences  and  and      Center for Neural Science,      Center for Neural Science,        New York University       New York UniversityYann LeCun
  2. 2. IIHave aaTerrible Confession to Make Have Terrible Confession to Make Im interested in vision, but no more in vision than in audition or in other perceptual modalities. Im interested in perception (and in control). Id like to find a learning algorithm and architecture that could work (with minor changes) for many modalities Nature seems to have found one. Almost all natural perceptual signals have a local structure (in space and time) similar to images and videos Heavy correlation between neighboring variables Local patches of variables have structure, and are representable by feature vectors. I like vision because its challenging, its useful, its fun, we have data the image recognition community is not yet stuck in a deep local minimum like the speech recognition community.Yann LeCun
  3. 3. The Unity of The Unity of Recognition Recognition Architectures ArchitecturesYann LeCun
  4. 4. Most Recognition Systems Are Built on the Same Architecture Most Recognition Systems Are Built on the Same Architecture Filter Non­ feature Norma­ Classifier Bank  Linearity Pooling  lization Filter Non­ Filter Non­ Pool Norm Pool Norm Classifier Bank  Lin Bank  Lin First stage: dense SIFT, HOG, GIST, sparse coding, RBM, auto-encoders..... Second stage: K-means, sparse coding, LCC.... Pooling: average, L2, max, max with bias (elastic templates)..... Convolutional Nets: same architecture, but everything is trained.Yann LeCun
  5. 5. Filter Bank + Non-Linearity + Pooling + Normalization Filter Bank + Non-Linearity + Pooling + Normalization Filter Non­ Spatial Bank  Linearity Pooling  This model of a feature extraction stage is biologically-inspired ...whether you like it or not (just ask David Lowe) Inspired by [Hubel and Wiesel 1962] The use of this module goes back to Fukushimas Neocognitron (and even earlier models in the 60s).Yann LeCun
  6. 6. How well does this work? How well does this work? Filter Non­ feature Filter Non­ feature Classifier Bank  Linearity Pooling  Bank  Linearity Pooling  Oriented Winner Histogram Pyramid SVM or K­means  Edges Takes (sum) Histogram. Another  Or  All Elastic parts Simple Sparse Coding SIFT Models,... classifier Some results on C101 (I know, I know....) SIFT->K-means->Pyramid pooling->SVM intersection kernel: >65% [Lazebnik et al. CVPR 2006] SIFT->Sparse coding on Blocks->Pyramid pooling->SVM: >75% [Boureau et al. CVPR 2010] [Yang et al. 2008] SIFT->Local Sparse coding on Block->Pyramid pooling->SVM: >77% [Boureau et al. ICCV 2011] (Small) supervised ConvNet with sparsity penalty: >71% [rejected from CVPR,ICCV,etc] REAL TIMEYann LeCun
  7. 7. Convolutional Networks (ConvNets) fits that model Convolutional Networks (ConvNets) fits that modelYann LeCun
  8. 8. Why do two stages work better than one stage? Why do two stages work better than one stage? Filter Non­ Filter Non­ Pool Norm Pool Norm Classifier Bank  Lin Bank  Lin The second stage extracts mid-level features Having multiple stages helps the selectivity-invariance dilemmaYann LeCun
  9. 9. Learning Hierarchical Representations Learning Hierarchical Representations Trainable Trainable Trainable Feature Feature Classifier Transform Transform Learned Internal Representation I agree with David Lowe: we should learn the features It worked for speech, handwriting, NLP..... In a way, the vision community has been running a ridiculously inefficient evolutionary learning algorithm to learn features: Mutation: tweak existing features in many different ways Selection: Publish the best ones at CVPR Reproduction: combine several features from the last CVPR Iterate. Problem: Moores law works against youYann LeCun
  10. 10. Sometimes, Sometimes, Biology gives you Biology gives you good hints good hints example: example: contrast normalization contrast normalizationYann LeCun
  11. 11. Harsh Non-Linearity + Contrast Normalization + Sparsity Harsh Non-Linearity + Contrast Normalization + Sparsity  C     Convolutions (filter bank)  Soft Thresholding + Abs    N     Subtractive and Divisive Local Normalization  P     Pooling down­sampling layer: average or max?  Pooling, sub­sampling contrast normalization subtractive+divisive   Thresholding Convolutions Rectification THIS IS ONE STAGE OF THE CONVNETYann LeCun
  12. 12. Soft Thresholding Non-Linearity Soft Thresholding Non-LinearityYann LeCun
  13. 13. Local Contrast Normalization Local Contrast Normalization Performed on the state of every layer, including the input Subtractive Local Contrast Normalization Subtracts from every value in a feature a Gaussian-weighted average of its neighbors (high-pass filter) Divisive Local Contrast Normalization Divides every value in a layer by the standard deviation of its neighbors over space and over all feature maps Subtractive + Divisive LCN performs a kind of approximate whitening.Yann LeCun
  14. 14. C101 Performance (I know, IIknow) C101 Performance (I know, know) Small network: 64 features at stage-1, 256 features at stage-2: Tanh non-linearity, No Rectification, No normalization: 29% Tanh non-linearity, Rectification, normalization: 65% Shrink non-linearity, Rectification, norm, sparsity penalty 71%Yann LeCun
  15. 15. Results on Caltech101 with sigmoid non-linearity Results on Caltech101 with sigmoid non-linearity ← like HMAX modelYann LeCun
  16. 16. Feature Learning Feature Learning Works Really Well Works Really Well on everything but C101 on everything but C101Yann LeCun
  17. 17. C101 is very unfavorable to learning-based systems C101 is very unfavorable to learning-based systems Because its so small. We are switching to ImageNet Some results on NORB No normalization Random filters Unsup filters Sup filters Unsup+Sup filtersYann LeCun
  18. 18. Sparse Auto-Encoders Sparse Auto-Encoders Inference by gradient descent starting from the encoder output i i 2 i 2 E Y , Z =∥Y −W d Z∥ ∥Z −g e W e ,Y ∥  ∑ j ∣z j∣ i i Z =argmin z E Y , z ; W  i ∥Y −Y∥  2 WdZ ∑j . INPUT Y Z ∣z j∣ FEATURES  i ge W e ,Y   2 ∥Z − Z∥Yann LeCun
  19. 19. Using PSD to Train aaHierarchy of Features Using PSD to Train Hierarchy of Features Phase 1: train first layer using PSD ∥Y i −Y∥2  WdZ ∑j . Y Z ∣z j∣ ge W e ,Y i   2 ∥Z − Z∥ FEATURES Yann LeCun
  20. 20. Using PSD to Train aaHierarchy of Features Using PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Y ∣z j∣ ge W e ,Y i  FEATURES Yann LeCun
  21. 21. Using PSD to Train aaHierarchy of Features Using PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Phase 3: train the second layer using PSD ∥Y i −Y∥2  WdZ ∑j . Y ∣z j∣ Y Z ∣z j∣ ge W e ,Y i  ge W e ,Y i   2 ∥Z − Z∥ FEATURES Yann LeCun
  22. 22. Using PSD to Train aaHierarchy of Features Using PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Phase 3: train the second layer using PSD Phase 4: use encoder + absolute value as 2 nd feature extractor Y ∣z j∣ ∣z j∣ ge W e ,Y i  ge W e ,Y i  FEATURES Yann LeCun
  23. 23. Using PSD to Train aaHierarchy of Features Using PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Phase 3: train the second layer using PSD Phase 4: use encoder + absolute value as 2 nd feature extractor Phase 5: train a supervised classifier on top Phase 6 (optional): train the entire system with supervised back-propagation Y ∣z j∣ ∣z j∣ classifier ge W e ,Y i  ge W e ,Y i  FEATURES Yann LeCun
  24. 24. Learned Features on natural patches: V1-like receptive fields Learned Features on natural patches: V1-like receptive fieldsYann LeCun
  25. 25. Using PSD Features for Object Recognition Using PSD Features for Object Recognition 64 filters on 9x9 patches trained with PSD with Linear-Sigmoid-Diagonal EncoderYann LeCun
  26. 26. Convolutional Sparse Coding Convolutional Sparse Coding [Kavukcuoglu et al. NIPS 2010]: convolutional PSD [Zeiler, Krishnan, Taylor, Fergus, CVPR 2010]: Deconvolutional Network [Lee, Gross, Ranganath, Ng,  ICML 2009]: Convolutional Boltzmann Machine [Norouzi, Ranjbar, Mori, CVPR 2009]:  Convolutional Boltzmann Machine [Chen, Sapiro, Dunson, Carin, Preprint 2010]: Deconvolutional Network with  automatic adjustment of code dimension.Yann LeCun
  27. 27. Convolutional Training Convolutional Training Problem: With patch-level training, the learning algorithm must reconstruct the entire patch with a single feature vector But when the filters are used convolutionally, neighboring feature vectors will be highly redundant Patch­level training produces lots of filters that are shifted versions of each other.Yann LeCun
  28. 28. Convolutional Sparse Coding Convolutional Sparse Coding Replace the dot products with dictionary element by convolutions. Input Y is a full image Each code component Zk is a feature map (an image) Each dictionary element is a convolution kernel Regular sparse coding Convolutional S.C. Y = ∑. * Zk k Wk “deconvolutional networks” [Zeiler, Taylor, Fergus CVPR 2010]Yann LeCun
  29. 29. Convolutional PSD: Encoder with aasoft sh() Function Convolutional PSD: Encoder with soft sh() Function Convolutional Formulation Extend sparse coding from PATCH to IMAGE PATCH based learning CONVOLUTIONAL learningYann LeCun
  30. 30. Cifar-10 Dataset Cifar-10 Dataset Dataset of tiny images Images are 32x32 color images 10 object categories with 50000 training and 10000 testing Example ImagesYann LeCun
  31. 31. Comparative Results on Cifar-10 Dataset Comparative Results on Cifar-10 Dataset * Krizhevsky. Learning multiple layers of features from tiny images. Masters thesis, Dept of CS U of Toronto **Ranzato and Hinton. Modeling pixel means and covariances using a factorized third order boltzmann machine. CVPR 2010Yann LeCun
  32. 32. Road Sign Recognition Competition Road Sign Recognition Competition GTSRB Road Sign Recognition Competition (phase 1) 32x32 images The 13 of the top 14 entries are ConvNets, 6 from NYU, 7 from IDSIA No 6 is humans!Yann LeCun
  33. 33. Pedestrian Detection (INRIA Dataset) Pedestrian Detection (INRIA Dataset) [Sermanet et al., Rejected from ICCV 2011]]Yann LeCun
  34. 34. Pedestrian Detection: Examples Pedestrian Detection: ExamplesYann LeCun [Kavukcuoglu et al. NIPS 2010]
  35. 35.        Learning         Learning          Invariant Features         Invariant FeaturesYann LeCun
  36. 36. Why just pool over space? Why not over orientation? Why just pool over space? Why not over orientation? Using an idea from Hyvarinen: topographic square pooling (subspace ICA) 1. Apply filters on a patch (with suitable non-linearity) 2. Arrange filter outputs on a 2D plane 3. square filter outputs 4. minimize sqrt of sum of blocks of squared filter outputsYann LeCun
  37. 37. Why just pool over space? Why not over orientation? Why just pool over space? Why not over orientation? The filters arrange themselves spontaneously so that similar filters enter the same pool. The pooling units can be seen as complex cells They are invariant to local transformations of the input For some its translations, for others rotations, or other transformations.Yann LeCun
  38. 38. Pinwheels? Pinwheels? Does that look pinwheely to you?Yann LeCun
  39. 39. Sparsity through Sparsity through Lateral Inhibition Lateral InhibitionYann LeCun
  40. 40. Invariant Features Lateral Inhibition Invariant Features Lateral Inhibition Replace the L1 sparsity term by a lateral inhibition matrixYann LeCun
  41. 41. Invariant Features Lateral Inhibition Invariant Features Lateral Inhibition Zeros I S matrix have tree structureYann LeCun
  42. 42. Invariant Features Lateral Inhibition Invariant Features Lateral Inhibition Non-zero values in S form a ring in a 2D topology Input patches are high-pass filteredYann LeCun
  43. 43. Invariant Features Lateral Inhibition Invariant Features Lateral Inhibition Non-zero values in S form a ring in a 2D topology Left: non high-pass filtering of input Right: patch-level mean removalYann LeCun
  44. 44. Invariant Features Short-Range Lateral Excitation + L1 Invariant Features Short-Range Lateral Excitation + L1 lYann LeCun
  45. 45. Disentangling the Disentangling the Explanatory Factors Explanatory Factors of Images of ImagesYann LeCun
  46. 46. Separating Separating I used to think that recognition was all about eliminating irrelevant information while keeping the useful one Building invariant representations Eliminating irrelevant variabilities I now think that recognition is all about disentangling independent factors of variations: Separating “what” and “where” Separating content from instantiation parameters Hintons “capsules”; Karol Gregors what-where auto-encodersYann LeCun
  47. 47. Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy Object is cross-product of object type and instantiation parameters [Hinton 1981] small medium large Object type Object size [Karol Gregor et al.]Yann LeCun
  48. 48. Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy Decoder Predicted St St­1 St­2 input W1 W1 W1 W2 t t­1 t­2 t Inferred  C 1 C 1 C 1 C 2 code C t C t­1 C t­2 C t Predicted 1 1 1 2 code f 1 1 f °W1 2 f °W f °W W 2  W W2 t t­1 t­2Yann LeCun Encoder S S S Input
  49. 49. Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy C1 (where) C2 (what)Yann LeCun
  50. 50. Generating from the Network Generating from the Network InputYann LeCun
  51. 51. What is the right What is the right criterion to train criterion to train hierarchical feature hierarchical feature extraction extraction architectures? architectures?Yann LeCun
  52. 52. Flattening the Data Manifold? Flattening the Data Manifold? The manifold of all images of <Category-X> is low-dimensional and highly curvy Feature extractors should “flatten” the manifoldYann LeCun
  53. 53. Flattening the Flattening the Data Manifold? Data Manifold?Yann LeCun
  54. 54. The Ultimate Recognition System The Ultimate Recognition System Trainable Trainable Trainable Feature Feature Classifier Transform Transform Learned Internal Representation Bottom-up and top-down information Top-down: complex inference and disambiguation Bottom-up: learns to quickly predict the result of the top-down inference Integrated supervised and unsupervised learning Capture the dependencies between all observed variables Compositionality Each stage has latent instantiation variablesYann LeCun

×