Deep Learning of Invariant                Spatiotemporal Features from Video                          Bo Chen, Jo-Anne Tin...
Introduction                    •         Focus: Unsupervised feature extraction from video for low and                   ...
Feature extraction from video                    •         Spatiotemporal feature detectors & descriptors:                ...
Overview                    •         Background:                          •     Restricted Boltzmann Machines (RBMs)     ...
Background: RBMsInductive Principles for Restricted Boltzmann Machine Learning              v (b) SML                Binar...
Background: Convolutional RBMs                                           p1                pg                     p|W |   ...
Convolutional RBMs                     Standard RBMs:                                                              P (hi =...
Proposed model:                              Spatiotemporal DBNs                                      8Thursday, September...
Proposed model:                                Spatiotemporal DBNs                                     time             Sp...
Proposed model:                                Spatiotemporal DBNs                                     time               ...
Proposed model:                                Spatiotemporal DBNs                                     time               ...
Training STDBNs                    •         Greedy layer-wise pre-training (Hinton, Osindero  Teh, 2006)                 ...
ch video and hidden unit i, we select a threshold such that i fires G(i) = 1% of the time. We lect 40 stimuli that activate...
STDBN as a discriminative feature extractor:                            Recognizing actions                               ...
Recognizing actions: Pipeline                                                              STDBN                     Input...
Recognizing actions: Model architecture                    •         4-layers: 8x8 filters for spatial pooling layers, 6x1 ...
329   the LOO protocol, classification results for a ST-DBN with up to 4 layers are shown, avera                           ...
STDBN as a generative model:                                           De-noising    378    379    380    381    382    38...
STDBN as a generative model:                              Filling-in from sequence of “gazes”              Missing        ...
STDBN as a generative model:                              Filling-in from sequence of “gazes”              Missing        ...
STDBN as a generative model:                              Filling-in from sequence of “gazes”              Missing        ...
Discussion                    •         Limitations:                          •      computation                          ...
Upcoming SlideShare
Loading in...5
×

Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Features from Video

737

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
737
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
25
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Features from Video

  1. 1. Deep Learning of Invariant Spatiotemporal Features from Video Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British ColumbiaThursday, September 9, 2010
  2. 2. Introduction • Focus: Unsupervised feature extraction from video for low and high-level vision tasks • Desirable Properties: • Invariant features (for selectivity, robustness to input transformations) • Distributed representations (for memory & generalization efficiency) • Generative properties for inference (de-noising, reconstruction, prediction, analogy-making) • Good discriminative performance 2Thursday, September 9, 2010
  3. 3. Feature extraction from video • Spatiotemporal feature detectors & descriptors: • 3D HMAX, 3D Hessian detector, cuboid detector, space-time interest points • HOG/HOF, 3D SIFT descriptor • recursive sparse coding (Dean et al., 2009), factored RBMs + sparse coding (Taylor et al., 2010), “factorized” sparse coding (Cadieu & Olshausen, 2008), deconvolutional network (Zeiler et al., 2010), recurrent temporal RBMs, gated RBMs, factored RBMs, mcRBMs, ... 3Thursday, September 9, 2010
  4. 4. Overview • Background: • Restricted Boltzmann Machines (RBMs) • Convolutional RBMs • Proposed model: • Spatio-temporal Deep Belief Networks • Experiments: • Measuring Invariance • Recognizing actions (KTH dataset) • De-noising • Filling-in from sequence of “gazes” 4Thursday, September 9, 2010
  5. 5. Background: RBMsInductive Principles for Restricted Boltzmann Machine Learning v (b) SML Binary/Continuous (c) PL Visible Data (d) RM Hinton & Salakhutdinov (2006), Lee et al. (2008)ghts and visible biases on the MNIST data set. The top left cell in each figure is the visibleresponds to a weight of −1 while white corresponds to a weight of +1. Figure 2. Left: Example sparse corgence (CD) are definitely bet- port vector machine for classification. Computational pixels) draw image patches (14x14has a consistent advantage over optimization and Applications, 20(1):5–22, 2001. images of natural scenery. Each terms of de-noising, which cor- Siwei Lyu. Interpretation and generalization of scorehe observation that the filters it matching. In Uncertainty in Artificial Intelligence 25, Example a sents one basis. Right:d strokes instead of spots. On 2009. the same algorithm, using 25ms so task pseudo-likelihood exhibits D. MacKay. Information Theory, Inference, and the four rectangles data. Each of LearningFinally, our most efficient imple- Algorithms. Cambridge University Press, 2003. 5 25ms long acoustic signal represematching and9, pseudo likelihood I. Murray and R. Salakhutdinov. Evaluating probabili- Thursday, September 2010
  6. 6. Background: Convolutional RBMs p1 pg p|W | ... ... h1 hg h|W | Max-pool ... ... Bα W 1 W g W |W | Convolve ∗ ∗ ∗ image v Desjardins & Bengio (2008), Lee, Grosse, Ranganath & Ng (2009), Norouzi, Ranjbar & Mori (2009) 6Thursday, September 9, 2010
  7. 7. Convolutional RBMs Standard RBMs: P (hi = 1|v) = logistic (W ) v j T P (vi = 1|h) = logistic (Wi ) h T Convolutional RBMs: P (hg , pg |v) = ProbMaxPool (W g ∗ v g ) |W | P (v|h) = logistic g=1 W g hg 7Thursday, September 9, 2010
  8. 8. Proposed model: Spatiotemporal DBNs 8Thursday, September 9, 2010
  9. 9. Proposed model: Spatiotemporal DBNs time Spatial Pooling Layer 9Thursday, September 9, 2010
  10. 10. Proposed model: Spatiotemporal DBNs time time Spatial Pooling Layer Temporal Pooling Layer 9Thursday, September 9, 2010
  11. 11. Proposed model: Spatiotemporal DBNs time time Spatial Pooling Layer Temporal Pooling Layer 9Thursday, September 9, 2010
  12. 12. Training STDBNs • Greedy layer-wise pre-training (Hinton, Osindero Teh, 2006) • Contrastive divergence for each layer (Carreira-Perpignan Hinton, 2005) • Sparsity regularization (e.g. Olshausen Field, 1996) 10Thursday, September 9, 2010
  13. 13. ch video and hidden unit i, we select a threshold such that i fires G(i) = 1% of the time. We lect 40 stimuli that activate i the most (these are single frames for the spatial pooling layersort sequences in the temporala discriminative feature extractor: stimulus STDBN as pooling layers) and extend the temporal length of each Measuring invariance The invariance score of rward and backward in time for 8 frames each. The local firing rate L(i) is then i’s average ate over 16 frames of stimuli, and the invariance score is L(i)/0.01.ork layer is the mean score over all the max-pooled units. ST-DBN performs max-pooling over Translation Zooming 2D Rotation 3D Rotation its Firing Rate of Unit i hidden representations should vary 40 40 40 40 slowly than static models. Its filters 35 35 35 35 also be more selective than purely spa- 30 30 30 30 ers. Fig. 4.1 shows invariance scores for Not Selective tions, zooming, and 2D and 3D rotations 25 25 25 25 ayer 1 of the CDBN (S1), layer 2 of 20 20 20 20DBN (S2), and layer 2 of ST-DBN (T1). 15 15 15 15ves as a baseline measure since it is the 10 10 10 10 yer for both CDBN and ST-DBN. We S1 S2 T1 S1 S2 T1 S1 S2 T1 S1 S2 T1 t ST-DBN yields significantly more in- Invariance scores for common representations than CDBN (S2 vs. T1 Figure 3: Invariance scores for common transfor- Degree of . ST-DBN shows the greatest invariance mations in transformations in natural videos,layer 1 Transformation computed videos, computed layer 2 natural for layer 1 (S1) and for rotations—the most complicated trans- (S1) and layer 2 (S2) of a CDBN and layer 2 (T1) ion. While a 2-layer architecture appears of ST-DBN. (Higher is better.)layer 2 (T1) of (S2) of a CDBN and STDBN. Higher is better. eve greater invariance for Ng (2009) and Goodfellow, Le, Saxe zooming ations, the improvement offered by ST-DBN is more pronounced. For translation, all archi- s have built-in invariance, leading to similar scores. 11ould point out2010 since ST-DBN is trained on video sequences, whereas the CDBN is trained Thursday, September 9, that
  14. 14. STDBN as a discriminative feature extractor: Recognizing actions KTH Dataset: 2391 videos (~1GB) Schuldt, Laptev Caputo (2004), Dollar, Rabaud, Cottrell Belongie (2005), Laptev, Marszalek, Schmid Rozenfield (2008), , Liu Shah (2008), Dean, Corrado Washington (2009), Wang Li (2009),Taylor Bregler (2010), ... 12Thursday, September 9, 2010
  15. 15. Recognizing actions: Pipeline STDBN Input to SVM Input video Input to STDBN responses Dimensionality Local and global Unsupervised feature standardization contrast normalization extraction using STDBN (resolution reduction) 13Thursday, September 9, 2010
  16. 16. Recognizing actions: Model architecture • 4-layers: 8x8 filters for spatial pooling layers, 6x1 filters for temporal pooling layers, pooling ratio of 3 Experiments: number of parameters ~300k, training time ~5 days 14Thursday, September 9, 2010
  17. 17. 329 the LOO protocol, classification results for a ST-DBN with up to 4 layers are shown, avera 330 4 trials. We included comparisons to three other methods—all of which used an SVM cl 331 We see that having additional (temporal and spatial) layers yields an improvement in perfo Recognizing actions: Classification accuracy 332 333 achieving a competitive accuracy of 91% (for the 3rd layer). Interestingly, a 4th layer slightly worse classification accuracy. This can be attributed to excessive temporal pooli 334 already short video snippet (around 100 frames long). The best reported result to date, 9 335 Liu Shah [21], used a bag-of-words approach on cuboid feature detectors and maxima 336 information to cluster the video-words. In contrast to their work, we do not use feature and take the entire video as input. Wang Li [31] cropped the region of the video conta 337 relevant activity and clustered difference frames. They also used a weighted sequence d 338 that accounted for the temporal order of features. 339 340 For the train/test protocol of [30], we see a similar pattern in the performance of ST-DB 341 number of layers increases. Laptev et al. [32] use a 3D extension of the Harris operator, a c 342 tion of HoG and HoF (histograms of gradients and optic flow) descriptors, and an SVM w kernel. Taylor et al. [25] use dense sampling of space-time cubes, along with a factored RB 343 and sparse coding to produce codewords that are then fed into an SVM classifier. Schuldt e 344 use the space-time interest points of [17] and an SVM classifier. 345 Layer 1 346 Layer 2 Layer 3 Table 1: Average classification accuracy results for KTH actions dataset. 347 LOO protocol Train/test protocol of [30] 348 Method Accuracy (%) Method Accuracy (%) 349 4-layer ST-DBN 90.3 ± 0.83 4-layer ST-DBN 85.2 350 3-layer ST-DBN 91.13 ± 0.85 3-layer ST-DBN 86.6 351 2-layer ST-DBN 89.73 ± 0.18 2-layer ST-DBN 84.6 352 1-layer ST-DBN 85.97 ± 0.94 1-layer ST-DBN 81.4 353 Liu Shah [21] 94.2 Laptev et al. [32] 91.8 354 Wang Li [31] 87.8 Taylor et al. [25] 89.1 355 Doll´ r et al. [18] a 81.2 Schuldt et al. [30] 71.7 356 357 4.3 Denoising and Prediction 15 358Thursday, September 9, 2010 359
  18. 18. STDBN as a generative model: De-noising 378 379 380 381 382 383 384 (a) (b) (c) (d) 385 Noisy Video Spatially Spatiotemporally Clean Video 386 NMSE = 1 denoised video denoised video Figure 5: Denoising results: (a) Test frame; (b) Test frame corrupted with noise; (c) Reconstruction 387 NMSE=0.175 using 1-layer ST-DBN; (d) Reconstruction with 2-layer ST-DBN. NMSE=0.155 388 389 390 comes at a cost when inferring missing parts of frames, it is crucial for good discriminative perfor- 391 mance. Future research must address this fundamental trade-off. The results in the figure, though 392 apparently simple, are quite remarkable. They represent an important step toward the design of at- 393 tentional mechanisms for gaze planning. While gazing at the subject’s head, the model is able to 394 infer where the legs are. This coarse resolution gist may be used to guide the placement of high resolution detectors. 395 396 397 398 16 399 400Thursday, September 9, 2010
  19. 19. STDBN as a generative model: Filling-in from sequence of “gazes” Missing 17Thursday, September 9, 2010
  20. 20. STDBN as a generative model: Filling-in from sequence of “gazes” Missing Prediction 17Thursday, September 9, 2010
  21. 21. STDBN as a generative model: Filling-in from sequence of “gazes” Missing Prediction Truth (Fully Observed) 17Thursday, September 9, 2010
  22. 22. Discussion • Limitations: • computation • invariance vs. reconstruction trade-off • alternating space-time vs. joint space-time convolution • Extensions: • attentional mechanisms for gaze planning? • leveraging feature detection to reduce training data size 18Thursday, September 9, 2010
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×