Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Skiena algorithm 2007 lecture19 int... by zukun 453 views
- Skiena algorithm 2007 lecture12 top... by zukun 822 views
- P04 restricted boltzmann machines c... by zukun 5605 views
- Deep Learning for Computer Vision (... by Xavier Giro 524 views
- Lecture27 by zukun 821 views
- 08 cv mil_regression by zukun 500 views

2,722 views

Published on

License: CC Attribution-ShareAlike License

No Downloads

Total views

2,722

On SlideShare

0

From Embeds

0

Number of Embeds

6

Shares

0

Downloads

342

Comments

0

Likes

6

No embeds

No notes for slide

- 1. CVPR12 Tutorial on Deep Learning Sparse Coding Kai Yu yukai@baidu.com Department of Multimedia, Baidu
- 2. Relentless research on visual recognition Caltech 101 80 Million Tiny Images ImageNet PASCAL VOC08/22/12 2
- 3. The pipeline of machine visual perception Most Efforts in Machine Learning Inference: Low-level Pre- Feature Feature prediction, sensing processing extract. selection recognition • Most critical for accuracy • Account for most of the computation for testing • Most time-consuming in development cycle • Often hand-craft in practice 08/22/12 3
- 4. Computer vision features SIFT Spin image HoG RIFT GLOH Slide Credit: Andrew Ng
- 5. Learning features from data Machine Learning Inference: Low-level Pre- Feature Feature prediction, sensing processing extract. selection recognition Feature Learning: instead of design features, let’s design feature learners 08/22/12 5
- 6. Learning features from data via sparse coding Inference: Low-level Pre- Feature Feature prediction, sensing processing extract. selection recognition Sparse coding offers an effective building block to learn useful features 08/22/12 6
- 7. Outline 1. Sparse coding for image classification 2. Understanding sparse coding 3. Hierarchical sparse coding 4. Other topics: e.g. structured model, scale-up, discriminative training 5. Summary08/22/12 7
- 8. “BoW representation + SPM” Paradigm - I Bag-of-visual-words representation (BoW) based on VQ coding 08/22/12 Figure credit: Fei-Fei Li 8
- 9. “BoW representation + SPM” Paradigm - II Spatial pyramid matching: pooling in different scales and locations 08/22/12 Figure credit: Svetlana Lazebnik 9
- 10. Image Classification using “BoW ＋ SPM” Image Classification Spatial Pooling Dense SIFT VQ Coding Classifier 08/22/12 10
- 11. The Architecture of “Coding + Pooling” Coding Pooling Coding Pooling • •e.g., e.g.,convolutional neural net, HMAX, BoW, … convolutional neural net, HMAX, BoW, … 11
- 12. “BoW+SPM” has two coding+pooling layers Local Gradients Pooling VQ Coding Average Pooling SVM (obtain histogram) e.g, SIFT, HOG SIFT feature itself follows a coding+pooling operation SIFT feature itself follows a coding+pooling operation 12
- 13. Develop better coding methods Better Coding Better Pooling Better Coding Better Pooling Better Classifier - Coding: nonlinear mapping data into another feature space - Better coding methods: sparse coding, RBMs, auto-encoders 13
- 14. What is sparse coding Sparse coding (Olshausen & Field,1996). Originally developed to explain early visual processing in the brain (edge detection). Training: given a set of random patches x, learning a dictionary of bases [Φ1, Φ2, …] Coding: for data vector x, solve LASSO to find the sparse coefficient vector a08/22/12 14
- 15. Sparse coding: training time Input: Images x1, x2, …, xm (eachinRd) Learn: Dictionary of bases φ1, φ2, …, φk (alsoRd). Alternating optimization: 1.Fix dictionary φ1, φ2, …, φk , optimize a (a standard LASSO problem ） 2.Fix activations a, optimize dictionary φ1, φ2, …, φk (a convex QP problem)
- 16. Sparse coding: testing timeInput: Novel image patch x (inRd) and previously learned φi’sOutput: Representation [ai,1, ai,2, …, ai,κ] of image patch xi. ≈ 0.8 * + 0.3 * + 0.5 * Represent xi as: ai = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, …]
- 17. Sparse coding illustration 50 Natural Images Learned bases (φ1 , …, φ64): “Edges” 1 00 1 50 2 00 50 2 50 1 00 3 00 1 50 3 50 2 00 4 00 2 50 50 4 50 3 00 100 5 00 50 100 1 50 2 00 250 300 3 50 4 00 450 150 5 00 3 50 200 4 00 250 4 50 300 5 00 50 1 00 150 2350 00 250 3 00 3 50 4 00 4 50 500 400 450 500 50 100 150 200 250 3 00 350 400 450 5 00 Test example ≈ 0.8 * + 0.3 * + 0.5 * x ≈ 0, …, φ , + 0.3 φ42 + 0.5 [a1, …, a64] = [0, 0.8 *0, 0.8360, …, 0,0.3 *, 0, …, 0, 0.5, 0] * φ63 (feature representation) Slide credit: Andrew Ng Compact & easily interpretable
- 18. Self-taught Learning [Raina, Lee, Battle, Packer & Ng, ICML 07] Motorcycles Not motorcycles Testing: Testing: What isis this? What this? … Unlabeled images Slide credit: Andrew Ng
- 19. Classification Result on Caltech 101 9K images, 101 classes 64% SIFT VQ + Nonlinear SVM ~50% Pixel Sparse Coding + Linear SVM 08/22/12 19
- 20. Sparse Coding on SIFT – ScSPM algorithm [Yang, Yu, Gong & Huang, CVPR09] Local Gradients Pooling Sparse Coding Max Pooling Linear Classifier e.g, SIFT, HOG 20
- 21. Sparse Coding on SIFT – the ScSPM algorithm [Yang, Yu, Gong & Huang, CVPR09] Caltech-101 64% SIFT VQ + Nonlinear SVM 73% SIFT Sparse Coding + Linear SVM (ScSPM) 08/22/12 21
- 22. 22 Dense SIFT 73% Sparse CodingSummary: Accuracies on Caltech 101 Spatial Max Pooling - Sparse coding is a better building block Linear SVM Sparse Coding 50% - Deep models are preferred Spatial Max Pooling Linear SVM Dense SIFT Key message: 64% VQ Coding Spatial Pooling 08/22/12 Nonlinear SVM
- 23. Outline 1. Sparse coding for image classification 2. Understanding sparse coding 3. Hierarchical sparse coding 4. Other topics: e.g. structured model, scale-up, discriminative training 5. Summary08/22/12 23
- 24. Outline 1. Sparse coding for image classification 2. Understanding sparse coding – Connections to RBMs, autoencoders, … – Sparse activations vs. sparse models, .. – Sparsity vs. locality – local sparse coding methods 1. Hierarchical sparse coding 2. Other topics: e.g. structured model, scale-up, discriminative training 3. Summary08/22/12 24
- 25. Classical sparse coding- a is sparse- a is often higher dimension than x- Activation a=f(x) is nonlinear implicit function of x- reconstruction x’=g(a) is linear & explicit a f(x) encoding x’ decoding a g(a) x08/22/12 25
- 26. RBM & autoencoders- also involve activation and reconstruction- but have explicit f(x)- not necessarily enforce sparsity on a- but if put sparsity on a, often get improved results [e.g. sparse RBM, Lee et al. NIPS08] a x’ f(x) encoding decoding a g(a) x08/22/12 26
- 27. Sparse coding: A broader view Any feature mapping from x to a, i.e. a = f(x), where -a is sparse (and often higher dim. than x) -f(x) is nonlinear -reconstruction x’=g(a) , such that x’≈x a x’ f(x) a g(a) x Therefore, sparse RBMs, sparse auto-encoder, even VQ can be viewed as a form of sparse coding.08/22/12 27
- 28. Outline 1. Sparse coding for image classification 2. Understanding sparse coding – Connections to RBMs, autoencoders, … – Sparse activations vs. sparse models, … – Sparsity vs. locality – local sparse coding methods 1. Hierarchical sparse coding 2. Other topics: e.g. structured model, scale-up, discriminative training 3. Summary08/22/12 28
- 29. Sparse activations vs. sparse modelsFor a general function learning problem a = f(x):1. sparse model: f(x)’s parameters are sparse - example: LASSO f(x)=<w,x>, w is sparse - the goal is feature selection: all data selects a common subset of features - hot topic in machine learning2. sparse activations: f(x)’s outputs are sparse - example: sparse coding a=f(x), a is sparse - the goal is feature learning: different data points activate different feature subsets 08/22/12 29
- 30. Example of sparse models f(x)=<w,x>, where w=[0, 0.2, 0, 0.1, 0, 0] ］ [ 0 | 0 | 0 0 ］ [ 0 | 0 | 0 0 ］ [ 0 | 0 | 0 0| ］xm [ | | | | | | ］[ | | | | | x1 [ | | | | | x2 | ］ [ 0 | 0 | 0 0 … x3 [ | | | | | | ］ … ］ • because the 2nd and 4th elements of w are non-zero, these are the two selected features in x • globally-aligned sparse representation 08/22/12 30
- 31. Example of sparse activations (sparsecoding) 0 ］ am [ 0 0 0 | | … 0 ］ a2 ］ | 0 0 0 … 0 [| a1 [ 0 | | 0 0 … x1 x2 x3 … xm a3 [ | 0 | 0 0 … 0 ］• different x has different dimensions activated• locally-shared sparse representation: similar x’s tend to have similar non-zero dimensions08/22/12 31
- 32. Example of sparse activations (sparsecoding) 0 ］ am [ 0 0 0 | | … 0 ］ a2 ］ | | 0 0 … 0 [0 a1 [ | | 0 0 0 … x2 x1 x3 … xm a3 [ 0 0 | | 0 … 0 ］ • another example: preserving manifold structure • more informative in highlighting richer data structures, i.e. clusters, manifolds,08/22/12 32
- 33. Outline 1. Sparse coding for image classification 2. Understanding sparse coding – Connections to RBMs, autoencoders, … – Sparse activations vs. sparse models, … – Sparsity vs. locality – Local sparse coding methods 1. Hierarchical sparse coding 2. Other topics: e.g. structured model, scale-up, discriminative training 3. Summary08/22/12 33
- 34. Sparsity vs. Locality • Intuition: similar data should get similar activated features sparse coding • Local sparse coding: • data in the same neighborhood tend to local sparse have shared activated coding features; • data in different neighborhoods tend to have different features activated.08/22/12 34
- 35. Sparse coding is not always local: example Case 1 Case 2 independent subspaces data manifold (or clusters) • Each basis is a “direction” • Each basis an “anchor point” • Sparsity: each datum is a • Sparsity: each datum is a linear combination of only linear combination of neighbor several bases. anchors. • Sparsity is caused by locality. 08/22/12 35
- 36. Two approaches to local sparse coding Approach 1 Approach 2 Coding via local anchor points Coding via local subspaces 08/22/12 36
- 37. Classical sparse coding is empirically local When it works best for classification, the codes are often found local. It’s preferred to let similar data have similar non-zero dimensions in their codes.08/22/12 37
- 38. MNIST Experiment: Classification using SC Try different values • •60K training, 10K for 60K training, 10K for test test • • k=512 Let k=512 Let •• Linear SVM on Linear SVM on sparse codes sparse codes08/22/12 38
- 39. MNIST Experiment: Lambda = 0.0005Each basis is like aEach basis is like a part or direction. part or direction.08/22/12 39
- 40. MNIST Experiment: Lambda = 0.005Again, each basis is likeAgain, each basis is like a part or direction. a part or direction. 08/22/12 40
- 41. MNIST Experiment: Lambda = 0.05Now, each basis isNow, each basis is more like a digit !! more like a digit08/22/12 41
- 42. MNIST Experiment: Lambda = 0.5 Like VQ now! Like VQ now!08/22/12 42
- 43. Geometric view of sparsecoding Error: 4.54% Error: 3.75% Error: 2.64% • •When sparse coding achieves the best classification When sparse coding achieves the best classification accuracy, the learned bases are like digits ––each basis accuracy, the learned bases are like digits each basis has aaclear local class association. has clear local class association. 08/22/12 43
- 44. Distribution of coefficients (MNIST) Neighbor bases tend to Neighbor bases tend to get nonzero coefficients get nonzero coefficients08/22/12 44
- 45. Distribution of coefficient (SIFT, Caltech101) Similar observation here! Similar observation here!08/22/12 45
- 46. Outline 1. Sparse coding for image classification 2. Understanding sparse coding – Connections to RBMs, autoencoders, … – Sparse activations vs. sparse models, … – Sparsity vs. locality – Local sparse coding methods 1. Other topics: e.g. structured model, scale-up, discriminative training 2. Summary08/22/12 46
- 47. Why develop local sparse coding methods Since locality is a preferred property in sparse coding, let’s explicitly ensure the locality. The new algorithms can be well theoretically justified The new algorithms will have computational advantages over classical sparse coding08/22/12 47
- 48. Two approaches to local sparse coding Approach 1 Approach 2 Coding via local anchor points Coding via local subspaces Local coordinate coding Super-vector coding Learning locality-constrained linear coding for image Image Classification using Super-Vector Coding of classification, Jingjun Wang, Jianchao Yang, Kai Yu, Local Image Descriptors, Xi Zhou, Kai Yu, Tong Fengjun Lv, Thomas Huang. In CVPR 2010. Zhang, and Thomas Huang. In ECCV 2010. Nonlinear learning using local coordinate coding, Large-scale Image Classification: Fast Feature Kai Yu, Tong Zhang, and Yihong Gong. In NIPS 2009. Extraction and SVM Training, Yuanqing Lin, Fengjun Lv, Shenghuo Zhu, Ming Yang, Timothee Cour, Kai Yu, LiangLiang Cao, Thomas Huang. In CVPR 2011 08/22/12 48
- 49. A function approximation framework tounderstand coding • •Assumption: image Assumption: image patches xxfollow aanonlinear patches follow nonlinear manifold, and f(x) is smooth manifold, and f(x) is smooth on the manifold. on the manifold. • •Coding: nonlinear mapping Coding: nonlinear mapping xx aa typically, aais high-dim & typically, is high-dim & sparse sparse • •Nonlinear Learning: Nonlinear Learning: f(x) ==<w, a> f(x) <w, a> 08/22/12 49
- 50. Local sparse coding Approach 1 Local coordinate coding 08/22/12 50
- 51. Function Interpolation based on LCC Yu, Zhang, Gong, NIPS 10 locally linear data points bases 08/22/12 51
- 52. Local Coordinate Coding (LCC):connect coding to n onlinear function learningIf f(x) is (alpha, beta)-Lipschitz smooth The key message: A good coding scheme should 1. have a small coding error, 2. and also be sufficiently local Function Coding error Locality term approximation error 08/22/12 52
- 53. Local Coordinate Coding (LCC) Yu, Zhang & Gong, NIPS 09 Wang, Yang, Yu, Lv, Huang CVPR 10• •Dictionary Learning: k-means (or hierarchical k-means) Dictionary Learning: k-means (or hierarchical k-means)• Coding for x, to obtain its sparse representation a Step 1 – ensure locality: find the K nearest bases Step 2 – ensure low coding error:08/22/12 53
- 54. Local sparse coding Approach 2 Super-vector coding 08/22/12 54
- 55. ecewise local linear (first-order)Function approximation via super-vector coding: Zhou, Yu, Zhang, and Huang, ECCV 10 Local tangent data points cluster centers
- 56. Super-vector coding: JustificationIf f(x) is beta-Lipschitz smooth, and Local tangent Function approximation error Quantization error 08/22/12 56
- 57. Super-Vector Coding (SVC) Zhou, Yu, Zhang, and Huang, ECCV 10 • •Dictionary Learning: k-means (or hierarchical k-means) Dictionary Learning: k-means (or hierarchical k-means)• Coding for x, to obtain its sparse representation a Step 1 – find the nearest basis of x, obtain its VQ coding e.g. [0, 0, 1, 0, …] Step 2 – form super vector coding: e.g. [0, 0, 1, 0, …, 0, 0, (x-m 3 ）， 0 ，… ] Zero-order Local tangent08/22/12 57
- 58. Results on ImageNet Challenge Dataset ImageNet Challenge: 1.4 million images, 1000 classes 40% VQ + Intersection Kernel 62% LCC + Linear SVM 65% SVC + Linear SVM 08/22/12 58
- 59. Summary: local sparse coding Approach 1 Approach 2 Local coordinate coding Super-vector coding - Sparsity achieved by explicitly ensuring locality - Sound theoretical justifications - Much simpler to implement and compute - Strong empirical success 08/22/12 59
- 60. Outline 1. Sparse coding for image classification 2. Understanding sparse coding 3. Hierarchical sparse coding 4. Other topics: e.g. structured model, scale-up, discriminative training 5. Summary08/22/12 60
- 61. Hierarchical sparse coding Yu, Lin, & Lafferty, CVPR 11 Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 11 Learning from unlabeled data Sparse Coding Pooling Sparse Coding Pooling 61
- 62. 1 wi αk (φk ) wi = S (W ) Ω(α ) n i =1A two-layer sparse coding + λ 1 W + γ α (W , α ) = k =1 L(W, α ) formulation 1 1 W ,α n Yu, Lin, & Lafferty, CVPR 11 α 0, n ∈ λ 1 S (= ) ≡ (W , α ) W wW,iTαλ + p×1 p W 1 + γ α L( iw ) R 1 (W , α ) = n((W,αα ) + 1 W 1 + γ α 1 W ,αL W, ) L i =1 n W ,α n α 0, 1 α 0, n 1 2 (W, α ) 2 xi − Bwi + λLwi Ω(α )wi . n i =1 2 L(W, α ) L(W, α ) n W = (w1 w2 · · n ) ∈2 Rp×λn2 2 n 11 ·1w L(W, α1 = α 1 RqBW −F Bwi + S2 Wi )Ω(α )wi ) X − xi ∈ xi2− Bwi + + λ 2 w λΩ(α )Ω(α ). 2 (w . 2n n i =1 n i wi n i =1 2 q −1 wi Ω(αw ≡ww1 · 2 ·)·∈n()R∈Σ n α ).= Ω(α )− 1 p× n W ) ( w α k · w φkp× ( = W = ( 1 2 ·· n qw ) R α ∈ α ∈k =1q R W R S (W ) Ω(α ) 1 wi α −1 08/22/12 q W −1 62 q
- 63. MNIST Results - classification Yu, Lin, & Lafferty, CVPR 11 HSC vs. CNN: HSC provide even better performance than CNN more amazingly, HSC learns features in unsupervised manner! 63
- 64. MNIST results -- learned dictionary Yu, Lin, & Lafferty, CVPR 11A hidden unit in the second layer is connected to a unit group inthe 1st layer: invariance to translation, rotation, and deformation 64
- 65. Caltech101 results - classification Yu, Lin, & Lafferty, CVPR 11 Learned descriptor: performs slightly better than SIFT + SC 65
- 66. Adaptive Deconvolutional Networks for Midand High Level Feature Learning Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 2011 Hierarchical L4 Feature Convolutional Sparse Maps Coding. Select L4 Features L3 Feature Trained with respect Maps to image from all Select L3 Feature Groups layers (L1-L4). L2 Feature Pooling both spatially Maps and amongst features. Learns invariant mid- L1 Feature Select L2 Feature Groups level features. Maps Image L1 Features
- 67. Outline 1. Sparse coding for image classification 2. Understanding sparse coding 3. Hierarchical sparse coding 4. Other topics: e.g. structured model, scale-up, discriminative training 5. Summary08/22/12 67
- 68. Other topics of sparse coding Structured sparse coding, for example – Group sparse coding [Bengio et al, NIPS 09] – Learning hierarchical dictionary [Jenatton, Mairal et al, 2010] Scale-up sparse coding, for example – Feature-sign algorithm [Lee et al, NIPS 07] – Feed-forward approximation [Gregor & LeCun, ICML 10] – Online dictionary learning [Mairal et al, ICML 2009] Discriminative training, for example – Backprop algorithms [Bradley & Jbagnell, NIPS 08; Yang et al. CVPR 10] – Supervised dictionary training [Mairal et al, NIPS08]08/22/12 68
- 69. Summary of Sparse Coding Sparse coding is an effect way for (unsupervised) feature learning A building block for deep models Sparse coding and its local variants (LCC, SVC) have pushed the boundary of accuracies on Caltech101, PASCAL VOC, ImageNet, … Challenge: discriminative training is not straightforward

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment