Your SlideShare is downloading. ×
Fcv learn le_cun
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Fcv learn le_cun

452
views

Published on

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
452
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
200
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1.             5 years from now,              5 years from now,              everyone will learn              everyone will learn          their features         their features            (you might as well start now)            (you might as well start now)  Yann LeCun  Yann LeCun         Courant Institute of Mathematical Sciences          Courant Institute of Mathematical Sciences  and  and      Center for Neural Science,      Center for Neural Science,        New York University       New York UniversityYann LeCun
  • 2. IIHave aaTerrible Confession to Make Have Terrible Confession to Make Im interested in vision, but no more in vision than in audition or in other perceptual modalities. Im interested in perception (and in control). Id like to find a learning algorithm and architecture that could work (with minor changes) for many modalities Nature seems to have found one. Almost all natural perceptual signals have a local structure (in space and time) similar to images and videos Heavy correlation between neighboring variables Local patches of variables have structure, and are representable by feature vectors. I like vision because its challenging, its useful, its fun, we have data the image recognition community is not yet stuck in a deep local minimum like the speech recognition community.Yann LeCun
  • 3. The Unity of The Unity of Recognition Recognition Architectures ArchitecturesYann LeCun
  • 4. Most Recognition Systems Are Built on the Same Architecture Most Recognition Systems Are Built on the Same Architecture Filter Non­ feature Norma­ Classifier Bank  Linearity Pooling  lization Filter Non­ Filter Non­ Pool Norm Pool Norm Classifier Bank  Lin Bank  Lin First stage: dense SIFT, HOG, GIST, sparse coding, RBM, auto-encoders..... Second stage: K-means, sparse coding, LCC.... Pooling: average, L2, max, max with bias (elastic templates)..... Convolutional Nets: same architecture, but everything is trained.Yann LeCun
  • 5. Filter Bank + Non-Linearity + Pooling + Normalization Filter Bank + Non-Linearity + Pooling + Normalization Filter Non­ Spatial Bank  Linearity Pooling  This model of a feature extraction stage is biologically-inspired ...whether you like it or not (just ask David Lowe) Inspired by [Hubel and Wiesel 1962] The use of this module goes back to Fukushimas Neocognitron (and even earlier models in the 60s).Yann LeCun
  • 6. How well does this work? How well does this work? Filter Non­ feature Filter Non­ feature Classifier Bank  Linearity Pooling  Bank  Linearity Pooling  Oriented Winner Histogram Pyramid SVM or K­means  Edges Takes (sum) Histogram. Another  Or  All Elastic parts Simple Sparse Coding SIFT Models,... classifier Some results on C101 (I know, I know....) SIFT->K-means->Pyramid pooling->SVM intersection kernel: >65% [Lazebnik et al. CVPR 2006] SIFT->Sparse coding on Blocks->Pyramid pooling->SVM: >75% [Boureau et al. CVPR 2010] [Yang et al. 2008] SIFT->Local Sparse coding on Block->Pyramid pooling->SVM: >77% [Boureau et al. ICCV 2011] (Small) supervised ConvNet with sparsity penalty: >71% [rejected from CVPR,ICCV,etc] REAL TIMEYann LeCun
  • 7. Convolutional Networks (ConvNets) fits that model Convolutional Networks (ConvNets) fits that modelYann LeCun
  • 8. Why do two stages work better than one stage? Why do two stages work better than one stage? Filter Non­ Filter Non­ Pool Norm Pool Norm Classifier Bank  Lin Bank  Lin The second stage extracts mid-level features Having multiple stages helps the selectivity-invariance dilemmaYann LeCun
  • 9. Learning Hierarchical Representations Learning Hierarchical Representations Trainable Trainable Trainable Feature Feature Classifier Transform Transform Learned Internal Representation I agree with David Lowe: we should learn the features It worked for speech, handwriting, NLP..... In a way, the vision community has been running a ridiculously inefficient evolutionary learning algorithm to learn features: Mutation: tweak existing features in many different ways Selection: Publish the best ones at CVPR Reproduction: combine several features from the last CVPR Iterate. Problem: Moores law works against youYann LeCun
  • 10. Sometimes, Sometimes, Biology gives you Biology gives you good hints good hints example: example: contrast normalization contrast normalizationYann LeCun
  • 11. Harsh Non-Linearity + Contrast Normalization + Sparsity Harsh Non-Linearity + Contrast Normalization + Sparsity  C     Convolutions (filter bank)  Soft Thresholding + Abs    N     Subtractive and Divisive Local Normalization  P     Pooling down­sampling layer: average or max?  Pooling, sub­sampling contrast normalization subtractive+divisive   Thresholding Convolutions Rectification THIS IS ONE STAGE OF THE CONVNETYann LeCun
  • 12. Soft Thresholding Non-Linearity Soft Thresholding Non-LinearityYann LeCun
  • 13. Local Contrast Normalization Local Contrast Normalization Performed on the state of every layer, including the input Subtractive Local Contrast Normalization Subtracts from every value in a feature a Gaussian-weighted average of its neighbors (high-pass filter) Divisive Local Contrast Normalization Divides every value in a layer by the standard deviation of its neighbors over space and over all feature maps Subtractive + Divisive LCN performs a kind of approximate whitening.Yann LeCun
  • 14. C101 Performance (I know, IIknow) C101 Performance (I know, know) Small network: 64 features at stage-1, 256 features at stage-2: Tanh non-linearity, No Rectification, No normalization: 29% Tanh non-linearity, Rectification, normalization: 65% Shrink non-linearity, Rectification, norm, sparsity penalty 71%Yann LeCun
  • 15. Results on Caltech101 with sigmoid non-linearity Results on Caltech101 with sigmoid non-linearity ← like HMAX modelYann LeCun
  • 16. Feature Learning Feature Learning Works Really Well Works Really Well on everything but C101 on everything but C101Yann LeCun
  • 17. C101 is very unfavorable to learning-based systems C101 is very unfavorable to learning-based systems Because its so small. We are switching to ImageNet Some results on NORB No normalization Random filters Unsup filters Sup filters Unsup+Sup filtersYann LeCun
  • 18. Sparse Auto-Encoders Sparse Auto-Encoders Inference by gradient descent starting from the encoder output i i 2 i 2 E Y , Z =∥Y −W d Z∥ ∥Z −g e W e ,Y ∥  ∑ j ∣z j∣ i i Z =argmin z E Y , z ; W  i ∥Y −Y∥  2 WdZ ∑j . INPUT Y Z ∣z j∣ FEATURES  i ge W e ,Y   2 ∥Z − Z∥Yann LeCun
  • 19. Using PSD to Train aaHierarchy of Features Using PSD to Train Hierarchy of Features Phase 1: train first layer using PSD ∥Y i −Y∥2  WdZ ∑j . Y Z ∣z j∣ ge W e ,Y i   2 ∥Z − Z∥ FEATURES Yann LeCun
  • 20. Using PSD to Train aaHierarchy of Features Using PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Y ∣z j∣ ge W e ,Y i  FEATURES Yann LeCun
  • 21. Using PSD to Train aaHierarchy of Features Using PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Phase 3: train the second layer using PSD ∥Y i −Y∥2  WdZ ∑j . Y ∣z j∣ Y Z ∣z j∣ ge W e ,Y i  ge W e ,Y i   2 ∥Z − Z∥ FEATURES Yann LeCun
  • 22. Using PSD to Train aaHierarchy of Features Using PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Phase 3: train the second layer using PSD Phase 4: use encoder + absolute value as 2 nd feature extractor Y ∣z j∣ ∣z j∣ ge W e ,Y i  ge W e ,Y i  FEATURES Yann LeCun
  • 23. Using PSD to Train aaHierarchy of Features Using PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Phase 3: train the second layer using PSD Phase 4: use encoder + absolute value as 2 nd feature extractor Phase 5: train a supervised classifier on top Phase 6 (optional): train the entire system with supervised back-propagation Y ∣z j∣ ∣z j∣ classifier ge W e ,Y i  ge W e ,Y i  FEATURES Yann LeCun
  • 24. Learned Features on natural patches: V1-like receptive fields Learned Features on natural patches: V1-like receptive fieldsYann LeCun
  • 25. Using PSD Features for Object Recognition Using PSD Features for Object Recognition 64 filters on 9x9 patches trained with PSD with Linear-Sigmoid-Diagonal EncoderYann LeCun
  • 26. Convolutional Sparse Coding Convolutional Sparse Coding [Kavukcuoglu et al. NIPS 2010]: convolutional PSD [Zeiler, Krishnan, Taylor, Fergus, CVPR 2010]: Deconvolutional Network [Lee, Gross, Ranganath, Ng,  ICML 2009]: Convolutional Boltzmann Machine [Norouzi, Ranjbar, Mori, CVPR 2009]:  Convolutional Boltzmann Machine [Chen, Sapiro, Dunson, Carin, Preprint 2010]: Deconvolutional Network with  automatic adjustment of code dimension.Yann LeCun
  • 27. Convolutional Training Convolutional Training Problem: With patch-level training, the learning algorithm must reconstruct the entire patch with a single feature vector But when the filters are used convolutionally, neighboring feature vectors will be highly redundant Patch­level training produces lots of filters that are shifted versions of each other.Yann LeCun
  • 28. Convolutional Sparse Coding Convolutional Sparse Coding Replace the dot products with dictionary element by convolutions. Input Y is a full image Each code component Zk is a feature map (an image) Each dictionary element is a convolution kernel Regular sparse coding Convolutional S.C. Y = ∑. * Zk k Wk “deconvolutional networks” [Zeiler, Taylor, Fergus CVPR 2010]Yann LeCun
  • 29. Convolutional PSD: Encoder with aasoft sh() Function Convolutional PSD: Encoder with soft sh() Function Convolutional Formulation Extend sparse coding from PATCH to IMAGE PATCH based learning CONVOLUTIONAL learningYann LeCun
  • 30. Cifar-10 Dataset Cifar-10 Dataset Dataset of tiny images Images are 32x32 color images 10 object categories with 50000 training and 10000 testing Example ImagesYann LeCun
  • 31. Comparative Results on Cifar-10 Dataset Comparative Results on Cifar-10 Dataset * Krizhevsky. Learning multiple layers of features from tiny images. Masters thesis, Dept of CS U of Toronto **Ranzato and Hinton. Modeling pixel means and covariances using a factorized third order boltzmann machine. CVPR 2010Yann LeCun
  • 32. Road Sign Recognition Competition Road Sign Recognition Competition GTSRB Road Sign Recognition Competition (phase 1) 32x32 images The 13 of the top 14 entries are ConvNets, 6 from NYU, 7 from IDSIA No 6 is humans!Yann LeCun
  • 33. Pedestrian Detection (INRIA Dataset) Pedestrian Detection (INRIA Dataset) [Sermanet et al., Rejected from ICCV 2011]]Yann LeCun
  • 34. Pedestrian Detection: Examples Pedestrian Detection: ExamplesYann LeCun [Kavukcuoglu et al. NIPS 2010]
  • 35.        Learning         Learning          Invariant Features         Invariant FeaturesYann LeCun
  • 36. Why just pool over space? Why not over orientation? Why just pool over space? Why not over orientation? Using an idea from Hyvarinen: topographic square pooling (subspace ICA) 1. Apply filters on a patch (with suitable non-linearity) 2. Arrange filter outputs on a 2D plane 3. square filter outputs 4. minimize sqrt of sum of blocks of squared filter outputsYann LeCun
  • 37. Why just pool over space? Why not over orientation? Why just pool over space? Why not over orientation? The filters arrange themselves spontaneously so that similar filters enter the same pool. The pooling units can be seen as complex cells They are invariant to local transformations of the input For some its translations, for others rotations, or other transformations.Yann LeCun
  • 38. Pinwheels? Pinwheels? Does that look pinwheely to you?Yann LeCun
  • 39. Sparsity through Sparsity through Lateral Inhibition Lateral InhibitionYann LeCun
  • 40. Invariant Features Lateral Inhibition Invariant Features Lateral Inhibition Replace the L1 sparsity term by a lateral inhibition matrixYann LeCun
  • 41. Invariant Features Lateral Inhibition Invariant Features Lateral Inhibition Zeros I S matrix have tree structureYann LeCun
  • 42. Invariant Features Lateral Inhibition Invariant Features Lateral Inhibition Non-zero values in S form a ring in a 2D topology Input patches are high-pass filteredYann LeCun
  • 43. Invariant Features Lateral Inhibition Invariant Features Lateral Inhibition Non-zero values in S form a ring in a 2D topology Left: non high-pass filtering of input Right: patch-level mean removalYann LeCun
  • 44. Invariant Features Short-Range Lateral Excitation + L1 Invariant Features Short-Range Lateral Excitation + L1 lYann LeCun
  • 45. Disentangling the Disentangling the Explanatory Factors Explanatory Factors of Images of ImagesYann LeCun
  • 46. Separating Separating I used to think that recognition was all about eliminating irrelevant information while keeping the useful one Building invariant representations Eliminating irrelevant variabilities I now think that recognition is all about disentangling independent factors of variations: Separating “what” and “where” Separating content from instantiation parameters Hintons “capsules”; Karol Gregors what-where auto-encodersYann LeCun
  • 47. Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy Object is cross-product of object type and instantiation parameters [Hinton 1981] small medium large Object type Object size [Karol Gregor et al.]Yann LeCun
  • 48. Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy Decoder Predicted St St­1 St­2 input W1 W1 W1 W2 t t­1 t­2 t Inferred  C 1 C 1 C 1 C 2 code C t C t­1 C t­2 C t Predicted 1 1 1 2 code f 1 1 f °W1 2 f °W f °W W 2  W W2 t t­1 t­2Yann LeCun Encoder S S S Input
  • 49. Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy C1 (where) C2 (what)Yann LeCun
  • 50. Generating from the Network Generating from the Network InputYann LeCun
  • 51. What is the right What is the right criterion to train criterion to train hierarchical feature hierarchical feature extraction extraction architectures? architectures?Yann LeCun
  • 52. Flattening the Data Manifold? Flattening the Data Manifold? The manifold of all images of <Category-X> is low-dimensional and highly curvy Feature extractors should “flatten” the manifoldYann LeCun
  • 53. Flattening the Flattening the Data Manifold? Data Manifold?Yann LeCun
  • 54. The Ultimate Recognition System The Ultimate Recognition System Trainable Trainable Trainable Feature Feature Classifier Transform Transform Learned Internal Representation Bottom-up and top-down information Top-down: complex inference and disambiguation Bottom-up: learns to quickly predict the result of the top-down inference Integrated supervised and unsupervised learning Capture the dependencies between all observed variables Compositionality Each stage has latent instantiation variablesYann LeCun