SlideShare a Scribd company logo
1 of 22
Download to read offline
Deep Learning of Invariant
                Spatiotemporal Features from Video



                          Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas
                                     University of British Columbia




Thursday, September 9, 2010
Introduction
                    •         Focus: Unsupervised feature extraction from video for low and
                              high-level vision tasks

                    •         Desirable Properties:

                          •     Invariant features (for selectivity, robustness to input
                                transformations)

                          •     Distributed representations (for memory & generalization
                                efficiency)

                          •     Generative properties for inference (de-noising,
                                reconstruction, prediction, analogy-making)

                          •     Good discriminative performance



                                                          2
Thursday, September 9, 2010
Feature extraction from video

                    •         Spatiotemporal feature detectors & descriptors:

                          •     3D HMAX, 3D Hessian detector, cuboid detector, space-time
                                interest points

                          •     HOG/HOF, 3D SIFT descriptor

                          •     recursive sparse coding (Dean et al., 2009), factored RBMs +
                                sparse coding (Taylor et al., 2010), “factorized” sparse coding
                                (Cadieu & Olshausen, 2008), deconvolutional network (Zeiler
                                et al., 2010), recurrent temporal RBMs, gated RBMs, factored
                                RBMs, mcRBMs, ...




                                                         3
Thursday, September 9, 2010
Overview
                    •         Background:

                          •     Restricted Boltzmann Machines (RBMs)
                          •     Convolutional RBMs
                    •         Proposed model:
                          •     Spatio-temporal Deep Belief Networks
                    •         Experiments:

                               •   Measuring Invariance
                               •   Recognizing actions (KTH dataset)
                               •   De-noising
                               •   Filling-in from sequence of “gazes”


                                                          4
Thursday, September 9, 2010
Background: RBMs
Inductive Principles for Restricted Boltzmann Machine Learning




              v (b) SML                Binary/Continuous
                                               (c) PL
                                          Visible Data
                                                                          (d) RM
                                                              Hinton & Salakhutdinov (2006),
                                                                       Lee et al. (2008)
ghts and visible biases on the MNIST data set. The top left cell in each figure is the visible
responds to a weight of −1 while white corresponds to a weight of +1.
                                                                 Figure 2. Left: Example sparse co
rgence (CD) are definitely bet-        port vector machine for classification. Computational pixels) draw
                                                                 image patches (14x14
has a consistent advantage over       optimization and Applications, 20(1):5–22, 2001.
                                                                 images of natural scenery. Each
 terms of de-noising, which cor-    Siwei Lyu. Interpretation and generalization of score
he observation that the filters it     matching. In Uncertainty in Artificial Intelligence 25, Example a
                                                                 sents one basis. Right:
d strokes instead of spots. On        2009.                      the same algorithm, using 25ms so
 task pseudo-likelihood exhibits    D. MacKay. Information Theory, Inference, and the four rectangles
                                                                 data. Each of Learning
Finally, our most efficient imple-      Algorithms. Cambridge University Press, 2003.
                                                   5             25ms long acoustic signal represe
matching and9, pseudo likelihood    I. Murray and R. Salakhutdinov. Evaluating probabili-
 Thursday, September 2010
Background: Convolutional RBMs

                                           p1                pg                     p|W |
                                                    ...                   ...
                                           h1                hg                     h|W |    Max-pool
                                                    ...               ...
                                                                     Bα
                                           W    1
                                                            W    g
                                                                                    W |W |
                                                                                            Convolve
                                                    ∗                ∗
                                                             ∗
                                                                                image v



                              Desjardins & Bengio (2008), Lee, Grosse, Ranganath & Ng (2009),
                                             Norouzi, Ranjbar & Mori (2009)




                                                            6
Thursday, September 9, 2010
Convolutional RBMs

                     Standard RBMs:
                                                  
            P (hi = 1|v) = logistic (W ) v   j T
                                          
            P (vi = 1|h) = logistic (Wi ) h
                                         T




                 Convolutional RBMs:

         P (hg , pg |v) = ProbMaxPool (W g ∗ v g 
                                               )
                                   |W |
              P (v|h) = logistic   g=1  W g  hg




                                                   7
Thursday, September 9, 2010
Proposed model:
                              Spatiotemporal DBNs




                                      8
Thursday, September 9, 2010
Proposed model:
                                Spatiotemporal DBNs




                                     time
             Spatial Pooling Layer


                                            9
Thursday, September 9, 2010
Proposed model:
                                Spatiotemporal DBNs




                                     time                                time
             Spatial Pooling Layer              Temporal Pooling Layer


                                            9
Thursday, September 9, 2010
Proposed model:
                                Spatiotemporal DBNs




                                     time                                time
             Spatial Pooling Layer              Temporal Pooling Layer


                                            9
Thursday, September 9, 2010
Training STDBNs


                    •         Greedy layer-wise pre-training (Hinton, Osindero  Teh, 2006)

                    •         Contrastive divergence for each layer (Carreira-Perpignan 
                              Hinton, 2005)
                    •         Sparsity regularization (e.g. Olshausen  Field, 1996)




                                                          10
Thursday, September 9, 2010
ch video and hidden unit i, we select a threshold such that i fires G(i) = 1% of the time. We
 lect 40 stimuli that activate i the most (these are single frames for the spatial pooling layers
ort sequences in the temporala discriminative feature extractor: stimulus
            STDBN as pooling layers) and extend the temporal length of each
                                  Measuring invariance The invariance score of
 rward and backward in time for 8 frames each. The local firing rate L(i) is then i’s average
 ate over 16 frames of stimuli, and the invariance score is L(i)/0.01.
ork layer is the mean score over all the max-pooled units.
  ST-DBN performs max-pooling over                Translation      Zooming   2D Rotation 3D Rotation
 its Firing Rate of Unit i
      hidden representations should vary 40                    40           40          40
 slowly than static models. Its filters 35                      35           35          35
   also be more selective than purely spa-
                                               30              30           30          30
 ers. Fig. 4.1 shows invariance scores for
                           Not Selective
 tions, zooming, and 2D and 3D rotations 25                    25           25          25
  ayer 1 of the CDBN (S1), layer 2 of 20                       20           20          20
DBN (S2), and layer 2 of ST-DBN (T1). 15                       15           15          15
ves as a baseline measure since it is the 10                   10           10          10
 yer for both CDBN and ST-DBN. We                  S1 S2 T1        S1 S2 T1    S1 S2 T1    S1 S2 T1

  t ST-DBN yields significantly more in-
                                                              Invariance scores for common
   representations than CDBN (S2 vs. T1 Figure 3: Invariance scores for common transfor-
                                 Degree of
  . ST-DBN shows the greatest invariance mations in transformations in natural videos,layer 1
                               Transformation
                                                          computed videos, computed layer 2
                                                             natural for layer 1 (S1) and for
   rotations—the most complicated trans- (S1) and layer 2 (S2) of a CDBN and layer 2 (T1)
 ion. While a 2-layer architecture appears of ST-DBN. (Higher is better.)layer 2 (T1) of
                                                            (S2) of a CDBN and
                                                                  STDBN. Higher is better.
  eve greater invariance for Ng (2009) and
         Goodfellow, Le, Saxe  zooming
 ations, the improvement offered by ST-DBN is more pronounced. For translation, all archi-
 s have built-in invariance, leading to similar scores.
                                                 11

ould point out2010 since ST-DBN is trained on video sequences, whereas the CDBN is trained
 Thursday, September 9, that
STDBN as a discriminative feature extractor:
                            Recognizing actions




                                            KTH Dataset: 2391 videos (~1GB)
                     Schuldt, Laptev  Caputo (2004), Dollar, Rabaud, Cottrell  Belongie
                     (2005), Laptev, Marszalek, Schmid  Rozenfield (2008), , Liu  Shah (2008),
                     Dean, Corrado  Washington (2009), Wang  Li (2009),Taylor  Bregler
                     (2010), ...


                                                        12
Thursday, September 9, 2010
Recognizing actions: Pipeline




                                                              STDBN                     Input to SVM
      Input video                  Input to STDBN
                                                             responses
                                                                             Dimensionality
                  Local and global            Unsupervised feature           standardization
                contrast normalization       extraction using STDBN      (resolution reduction)




                                                    13
Thursday, September 9, 2010
Recognizing actions: Model architecture

                    •         4-layers: 8x8 filters for spatial pooling layers, 6x1 filters for
                              temporal pooling layers, pooling ratio of 3




                  Experiments: number of parameters ~300k, training time ~5 days

                                                                   14
Thursday, September 9, 2010
329   the LOO protocol, classification results for a ST-DBN with up to 4 layers are shown, avera
                              330   4 trials. We included comparisons to three other methods—all of which used an SVM cl
                              331   We see that having additional (temporal and spatial) layers yields an improvement in perfo
                    Recognizing actions: Classification accuracy
                              332
                              333
                                    achieving a competitive accuracy of 91% (for the 3rd layer). Interestingly, a 4th layer
                                    slightly worse classification accuracy. This can be attributed to excessive temporal pooli
                              334   already short video snippet (around 100 frames long). The best reported result to date, 9
                              335   Liu  Shah [21], used a bag-of-words approach on cuboid feature detectors and maxima
                              336
                                    information to cluster the video-words. In contrast to their work, we do not use feature
                                    and take the entire video as input. Wang  Li [31] cropped the region of the video conta
                              337
                                    relevant activity and clustered difference frames. They also used a weighted sequence d
                              338
                                    that accounted for the temporal order of features.
                              339
                              340   For the train/test protocol of [30], we see a similar pattern in the performance of ST-DB
                              341   number of layers increases. Laptev et al. [32] use a 3D extension of the Harris operator, a c
                              342
                                    tion of HoG and HoF (histograms of gradients and optic flow) descriptors, and an SVM w
                                    kernel. Taylor et al. [25] use dense sampling of space-time cubes, along with a factored RB
                        343
                                    and sparse coding to produce codewords that are then fed into an SVM classifier. Schuldt e
                        344
                                    use the space-time interest points of [17] and an SVM classifier.
                        345
                    Layer 1
                        346                                    Layer 2                                  Layer 3
                                                 Table 1: Average classification accuracy results for KTH actions dataset.
                        347
                                                            LOO protocol                    Train/test protocol of [30]
                        348
                                                     Method          Accuracy (%)           Method           Accuracy (%)
                              349
                                                 4-layer ST-DBN       90.3 ± 0.83       4-layer ST-DBN            85.2
                              350
                                                 3-layer ST-DBN 91.13 ± 0.85            3-layer ST-DBN            86.6
                              351                2-layer ST-DBN 89.73 ± 0.18            2-layer ST-DBN            84.6
                              352                1-layer ST-DBN 85.97 ± 0.94            1-layer ST-DBN            81.4
                              353                Liu  Shah [21]          94.2         Laptev et al. [32]         91.8
                              354                Wang  Li [31]           87.8         Taylor et al. [25]         89.1
                              355                Doll´ r et al. [18]
                                                     a                    81.2         Schuldt et al. [30]        71.7
                              356
                              357
                                    4.3 Denoising and Prediction
                                                            15
                              358
Thursday, September 9, 2010   359
STDBN as a generative model:
                                           De-noising

    378
    379
    380
    381
    382
    383
    384
                              (a)                   (b)                      (c)                      (d)
    385                                      Noisy Video                  Spatially         Spatiotemporally
                    Clean Video
    386                                        NMSE = 1               denoised video         denoised video
                Figure 5: Denoising results: (a) Test frame; (b) Test frame corrupted with noise; (c) Reconstruction
    387                                                                 NMSE=0.175
                using 1-layer ST-DBN; (d) Reconstruction with 2-layer ST-DBN.                  NMSE=0.155
    388
    389
    390         comes at a cost when inferring missing parts of frames, it is crucial for good discriminative perfor-
    391         mance. Future research must address this fundamental trade-off. The results in the figure, though
    392         apparently simple, are quite remarkable. They represent an important step toward the design of at-
    393         tentional mechanisms for gaze planning. While gazing at the subject’s head, the model is able to
    394
                infer where the legs are. This coarse resolution gist may be used to guide the placement of high
                resolution detectors.
    395
    396
    397
    398
                                                                  16
    399
    400
Thursday, September 9, 2010
STDBN as a generative model:
                              Filling-in from sequence of “gazes”

              Missing




                                              17
Thursday, September 9, 2010
STDBN as a generative model:
                              Filling-in from sequence of “gazes”

              Missing




            Prediction




                                              17
Thursday, September 9, 2010
STDBN as a generative model:
                              Filling-in from sequence of “gazes”

              Missing




            Prediction



            Truth (Fully
            Observed)




                                              17
Thursday, September 9, 2010
Discussion
                    •         Limitations:

                          •      computation
                          •      invariance vs. reconstruction trade-off
                          •      alternating space-time vs. joint space-time convolution
                    •         Extensions:
                          •      attentional mechanisms for gaze planning?

                          •      leveraging feature detection to reduce training data size




                                                          18
Thursday, September 9, 2010

More Related Content

Viewers also liked

07 history of cv vision paradigms - system - algorithms - applications - eva...
07  history of cv vision paradigms - system - algorithms - applications - eva...07  history of cv vision paradigms - system - algorithms - applications - eva...
07 history of cv vision paradigms - system - algorithms - applications - eva...zukun
 
Fcv core szeliski_zisserman
Fcv core szeliski_zissermanFcv core szeliski_zisserman
Fcv core szeliski_zissermanzukun
 
13 cv mil_preprocessing
13 cv mil_preprocessing13 cv mil_preprocessing
13 cv mil_preprocessingzukun
 
Fcv bio cv_cottrell
Fcv bio cv_cottrellFcv bio cv_cottrell
Fcv bio cv_cottrellzukun
 
06 history of cv computer vision - toweards a simple bruth-force utility
06  history of cv computer vision - toweards a simple bruth-force utility06  history of cv computer vision - toweards a simple bruth-force utility
06 history of cv computer vision - toweards a simple bruth-force utilityzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Fcv cross hertzmann
Fcv cross hertzmannFcv cross hertzmann
Fcv cross hertzmannzukun
 
03 history of cv - a personal perspective
03  history of cv - a personal perspective03  history of cv - a personal perspective
03 history of cv - a personal perspectivezukun
 

Viewers also liked (9)

07 history of cv vision paradigms - system - algorithms - applications - eva...
07  history of cv vision paradigms - system - algorithms - applications - eva...07  history of cv vision paradigms - system - algorithms - applications - eva...
07 history of cv vision paradigms - system - algorithms - applications - eva...
 
Fcv core szeliski_zisserman
Fcv core szeliski_zissermanFcv core szeliski_zisserman
Fcv core szeliski_zisserman
 
13 cv mil_preprocessing
13 cv mil_preprocessing13 cv mil_preprocessing
13 cv mil_preprocessing
 
Fcv bio cv_cottrell
Fcv bio cv_cottrellFcv bio cv_cottrell
Fcv bio cv_cottrell
 
06 history of cv computer vision - toweards a simple bruth-force utility
06  history of cv computer vision - toweards a simple bruth-force utility06  history of cv computer vision - toweards a simple bruth-force utility
06 history of cv computer vision - toweards a simple bruth-force utility
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Fcv cross hertzmann
Fcv cross hertzmannFcv cross hertzmann
Fcv cross hertzmann
 
03 history of cv - a personal perspective
03  history of cv - a personal perspective03  history of cv - a personal perspective
03 history of cv - a personal perspective
 
From Zero To Visibility
From Zero To VisibilityFrom Zero To Visibility
From Zero To Visibility
 

Similar to Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Features from Video

Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012Christian Sandor
 
Dl1 deep learning_algorithms
Dl1 deep learning_algorithmsDl1 deep learning_algorithms
Dl1 deep learning_algorithmsArmando Vieira
 
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...Sigma web solutions pvt. ltd.
 
Monocular simultaneous localization and generalized object mapping with undel...
Monocular simultaneous localization and generalized object mapping with undel...Monocular simultaneous localization and generalized object mapping with undel...
Monocular simultaneous localization and generalized object mapping with undel...Chen-Han Hsiao
 
Robust Shape and Topology Optimization - Northwestern
Robust Shape and Topology Optimization - Northwestern Robust Shape and Topology Optimization - Northwestern
Robust Shape and Topology Optimization - Northwestern Altair
 
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal StructuresLarge Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structuresayubimoak
 
BehavioMetrics: A Big Data Approach
BehavioMetrics: A Big Data ApproachBehavioMetrics: A Big Data Approach
BehavioMetrics: A Big Data ApproachJiang Zhu
 
Impact of Spatial Correlation towards the Performance of MIMO Downlink Transm...
Impact of Spatial Correlation towards the Performance of MIMO Downlink Transm...Impact of Spatial Correlation towards the Performance of MIMO Downlink Transm...
Impact of Spatial Correlation towards the Performance of MIMO Downlink Transm...Rosdiadee Nordin
 
U_N.o.1T: A U-Net exploration, in Depth
U_N.o.1T: A U-Net exploration, in DepthU_N.o.1T: A U-Net exploration, in Depth
U_N.o.1T: A U-Net exploration, in DepthManuel Nieves Sáez
 
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)Jia-Bin Huang
 
論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy LabelsToru Tamaki
 
OWLED2009: A platform for distributing and reasoning with OWL-EL knowledge ba...
OWLED2009: A platform for distributing and reasoning with OWL-EL knowledge ba...OWLED2009: A platform for distributing and reasoning with OWL-EL knowledge ba...
OWLED2009: A platform for distributing and reasoning with OWL-EL knowledge ba...Michel Dumontier
 
NIPS2009: Understand Visual Scenes - Part 2
NIPS2009: Understand Visual Scenes - Part 2NIPS2009: Understand Visual Scenes - Part 2
NIPS2009: Understand Visual Scenes - Part 2zukun
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
Deep Beleif Networks
Deep Beleif NetworksDeep Beleif Networks
Deep Beleif NetworksDeepak Singh
 
Ad hoc routing
Ad hoc routingAd hoc routing
Ad hoc routingits
 

Similar to Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Features from Video (20)

Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012Keynote Virtual Efficiency Congress 2012
Keynote Virtual Efficiency Congress 2012
 
Dl1 deep learning_algorithms
Dl1 deep learning_algorithmsDl1 deep learning_algorithms
Dl1 deep learning_algorithms
 
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
 
Monocular simultaneous localization and generalized object mapping with undel...
Monocular simultaneous localization and generalized object mapping with undel...Monocular simultaneous localization and generalized object mapping with undel...
Monocular simultaneous localization and generalized object mapping with undel...
 
Robust Shape and Topology Optimization - Northwestern
Robust Shape and Topology Optimization - Northwestern Robust Shape and Topology Optimization - Northwestern
Robust Shape and Topology Optimization - Northwestern
 
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal StructuresLarge Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
 
BehavioMetrics: A Big Data Approach
BehavioMetrics: A Big Data ApproachBehavioMetrics: A Big Data Approach
BehavioMetrics: A Big Data Approach
 
Impact of Spatial Correlation towards the Performance of MIMO Downlink Transm...
Impact of Spatial Correlation towards the Performance of MIMO Downlink Transm...Impact of Spatial Correlation towards the Performance of MIMO Downlink Transm...
Impact of Spatial Correlation towards the Performance of MIMO Downlink Transm...
 
U_N.o.1T: A U-Net exploration, in Depth
U_N.o.1T: A U-Net exploration, in DepthU_N.o.1T: A U-Net exploration, in Depth
U_N.o.1T: A U-Net exploration, in Depth
 
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)Learning Moving Cast Shadows for Foreground Detection (VS 2008)
Learning Moving Cast Shadows for Foreground Detection (VS 2008)
 
論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels論文紹介:Learning With Neighbor Consistency for Noisy Labels
論文紹介:Learning With Neighbor Consistency for Noisy Labels
 
OWLED2009: A platform for distributing and reasoning with OWL-EL knowledge ba...
OWLED2009: A platform for distributing and reasoning with OWL-EL knowledge ba...OWLED2009: A platform for distributing and reasoning with OWL-EL knowledge ba...
OWLED2009: A platform for distributing and reasoning with OWL-EL knowledge ba...
 
NIPS2009: Understand Visual Scenes - Part 2
NIPS2009: Understand Visual Scenes - Part 2NIPS2009: Understand Visual Scenes - Part 2
NIPS2009: Understand Visual Scenes - Part 2
 
16 17 bag_words
16 17 bag_words16 17 bag_words
16 17 bag_words
 
ICPRAM 2012
ICPRAM 2012ICPRAM 2012
ICPRAM 2012
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Talk
TalkTalk
Talk
 
Deep Beleif Networks
Deep Beleif NetworksDeep Beleif Networks
Deep Beleif Networks
 
Propagation urban
Propagation urbanPropagation urban
Propagation urban
 
Ad hoc routing
Ad hoc routingAd hoc routing
Ad hoc routing
 

More from zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVzukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Informationzukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statisticszukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionzukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featureszukun
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...zukun
 

More from zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
 

Recently uploaded

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 

Recently uploaded (20)

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 

Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Features from Video

  • 1. Deep Learning of Invariant Spatiotemporal Features from Video Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia Thursday, September 9, 2010
  • 2. Introduction • Focus: Unsupervised feature extraction from video for low and high-level vision tasks • Desirable Properties: • Invariant features (for selectivity, robustness to input transformations) • Distributed representations (for memory & generalization efficiency) • Generative properties for inference (de-noising, reconstruction, prediction, analogy-making) • Good discriminative performance 2 Thursday, September 9, 2010
  • 3. Feature extraction from video • Spatiotemporal feature detectors & descriptors: • 3D HMAX, 3D Hessian detector, cuboid detector, space-time interest points • HOG/HOF, 3D SIFT descriptor • recursive sparse coding (Dean et al., 2009), factored RBMs + sparse coding (Taylor et al., 2010), “factorized” sparse coding (Cadieu & Olshausen, 2008), deconvolutional network (Zeiler et al., 2010), recurrent temporal RBMs, gated RBMs, factored RBMs, mcRBMs, ... 3 Thursday, September 9, 2010
  • 4. Overview • Background: • Restricted Boltzmann Machines (RBMs) • Convolutional RBMs • Proposed model: • Spatio-temporal Deep Belief Networks • Experiments: • Measuring Invariance • Recognizing actions (KTH dataset) • De-noising • Filling-in from sequence of “gazes” 4 Thursday, September 9, 2010
  • 5. Background: RBMs Inductive Principles for Restricted Boltzmann Machine Learning v (b) SML Binary/Continuous (c) PL Visible Data (d) RM Hinton & Salakhutdinov (2006), Lee et al. (2008) ghts and visible biases on the MNIST data set. The top left cell in each figure is the visible responds to a weight of −1 while white corresponds to a weight of +1. Figure 2. Left: Example sparse co rgence (CD) are definitely bet- port vector machine for classification. Computational pixels) draw image patches (14x14 has a consistent advantage over optimization and Applications, 20(1):5–22, 2001. images of natural scenery. Each terms of de-noising, which cor- Siwei Lyu. Interpretation and generalization of score he observation that the filters it matching. In Uncertainty in Artificial Intelligence 25, Example a sents one basis. Right: d strokes instead of spots. On 2009. the same algorithm, using 25ms so task pseudo-likelihood exhibits D. MacKay. Information Theory, Inference, and the four rectangles data. Each of Learning Finally, our most efficient imple- Algorithms. Cambridge University Press, 2003. 5 25ms long acoustic signal represe matching and9, pseudo likelihood I. Murray and R. Salakhutdinov. Evaluating probabili- Thursday, September 2010
  • 6. Background: Convolutional RBMs p1 pg p|W | ... ... h1 hg h|W | Max-pool ... ... Bα W 1 W g W |W | Convolve ∗ ∗ ∗ image v Desjardins & Bengio (2008), Lee, Grosse, Ranganath & Ng (2009), Norouzi, Ranjbar & Mori (2009) 6 Thursday, September 9, 2010
  • 7. Convolutional RBMs Standard RBMs: P (hi = 1|v) = logistic (W ) v j T P (vi = 1|h) = logistic (Wi ) h T Convolutional RBMs: P (hg , pg |v) = ProbMaxPool (W g ∗ v g ) |W | P (v|h) = logistic g=1 W g hg 7 Thursday, September 9, 2010
  • 8. Proposed model: Spatiotemporal DBNs 8 Thursday, September 9, 2010
  • 9. Proposed model: Spatiotemporal DBNs time Spatial Pooling Layer 9 Thursday, September 9, 2010
  • 10. Proposed model: Spatiotemporal DBNs time time Spatial Pooling Layer Temporal Pooling Layer 9 Thursday, September 9, 2010
  • 11. Proposed model: Spatiotemporal DBNs time time Spatial Pooling Layer Temporal Pooling Layer 9 Thursday, September 9, 2010
  • 12. Training STDBNs • Greedy layer-wise pre-training (Hinton, Osindero Teh, 2006) • Contrastive divergence for each layer (Carreira-Perpignan Hinton, 2005) • Sparsity regularization (e.g. Olshausen Field, 1996) 10 Thursday, September 9, 2010
  • 13. ch video and hidden unit i, we select a threshold such that i fires G(i) = 1% of the time. We lect 40 stimuli that activate i the most (these are single frames for the spatial pooling layers ort sequences in the temporala discriminative feature extractor: stimulus STDBN as pooling layers) and extend the temporal length of each Measuring invariance The invariance score of rward and backward in time for 8 frames each. The local firing rate L(i) is then i’s average ate over 16 frames of stimuli, and the invariance score is L(i)/0.01. ork layer is the mean score over all the max-pooled units. ST-DBN performs max-pooling over Translation Zooming 2D Rotation 3D Rotation its Firing Rate of Unit i hidden representations should vary 40 40 40 40 slowly than static models. Its filters 35 35 35 35 also be more selective than purely spa- 30 30 30 30 ers. Fig. 4.1 shows invariance scores for Not Selective tions, zooming, and 2D and 3D rotations 25 25 25 25 ayer 1 of the CDBN (S1), layer 2 of 20 20 20 20 DBN (S2), and layer 2 of ST-DBN (T1). 15 15 15 15 ves as a baseline measure since it is the 10 10 10 10 yer for both CDBN and ST-DBN. We S1 S2 T1 S1 S2 T1 S1 S2 T1 S1 S2 T1 t ST-DBN yields significantly more in- Invariance scores for common representations than CDBN (S2 vs. T1 Figure 3: Invariance scores for common transfor- Degree of . ST-DBN shows the greatest invariance mations in transformations in natural videos,layer 1 Transformation computed videos, computed layer 2 natural for layer 1 (S1) and for rotations—the most complicated trans- (S1) and layer 2 (S2) of a CDBN and layer 2 (T1) ion. While a 2-layer architecture appears of ST-DBN. (Higher is better.)layer 2 (T1) of (S2) of a CDBN and STDBN. Higher is better. eve greater invariance for Ng (2009) and Goodfellow, Le, Saxe zooming ations, the improvement offered by ST-DBN is more pronounced. For translation, all archi- s have built-in invariance, leading to similar scores. 11 ould point out2010 since ST-DBN is trained on video sequences, whereas the CDBN is trained Thursday, September 9, that
  • 14. STDBN as a discriminative feature extractor: Recognizing actions KTH Dataset: 2391 videos (~1GB) Schuldt, Laptev Caputo (2004), Dollar, Rabaud, Cottrell Belongie (2005), Laptev, Marszalek, Schmid Rozenfield (2008), , Liu Shah (2008), Dean, Corrado Washington (2009), Wang Li (2009),Taylor Bregler (2010), ... 12 Thursday, September 9, 2010
  • 15. Recognizing actions: Pipeline STDBN Input to SVM Input video Input to STDBN responses Dimensionality Local and global Unsupervised feature standardization contrast normalization extraction using STDBN (resolution reduction) 13 Thursday, September 9, 2010
  • 16. Recognizing actions: Model architecture • 4-layers: 8x8 filters for spatial pooling layers, 6x1 filters for temporal pooling layers, pooling ratio of 3 Experiments: number of parameters ~300k, training time ~5 days 14 Thursday, September 9, 2010
  • 17. 329 the LOO protocol, classification results for a ST-DBN with up to 4 layers are shown, avera 330 4 trials. We included comparisons to three other methods—all of which used an SVM cl 331 We see that having additional (temporal and spatial) layers yields an improvement in perfo Recognizing actions: Classification accuracy 332 333 achieving a competitive accuracy of 91% (for the 3rd layer). Interestingly, a 4th layer slightly worse classification accuracy. This can be attributed to excessive temporal pooli 334 already short video snippet (around 100 frames long). The best reported result to date, 9 335 Liu Shah [21], used a bag-of-words approach on cuboid feature detectors and maxima 336 information to cluster the video-words. In contrast to their work, we do not use feature and take the entire video as input. Wang Li [31] cropped the region of the video conta 337 relevant activity and clustered difference frames. They also used a weighted sequence d 338 that accounted for the temporal order of features. 339 340 For the train/test protocol of [30], we see a similar pattern in the performance of ST-DB 341 number of layers increases. Laptev et al. [32] use a 3D extension of the Harris operator, a c 342 tion of HoG and HoF (histograms of gradients and optic flow) descriptors, and an SVM w kernel. Taylor et al. [25] use dense sampling of space-time cubes, along with a factored RB 343 and sparse coding to produce codewords that are then fed into an SVM classifier. Schuldt e 344 use the space-time interest points of [17] and an SVM classifier. 345 Layer 1 346 Layer 2 Layer 3 Table 1: Average classification accuracy results for KTH actions dataset. 347 LOO protocol Train/test protocol of [30] 348 Method Accuracy (%) Method Accuracy (%) 349 4-layer ST-DBN 90.3 ± 0.83 4-layer ST-DBN 85.2 350 3-layer ST-DBN 91.13 ± 0.85 3-layer ST-DBN 86.6 351 2-layer ST-DBN 89.73 ± 0.18 2-layer ST-DBN 84.6 352 1-layer ST-DBN 85.97 ± 0.94 1-layer ST-DBN 81.4 353 Liu Shah [21] 94.2 Laptev et al. [32] 91.8 354 Wang Li [31] 87.8 Taylor et al. [25] 89.1 355 Doll´ r et al. [18] a 81.2 Schuldt et al. [30] 71.7 356 357 4.3 Denoising and Prediction 15 358 Thursday, September 9, 2010 359
  • 18. STDBN as a generative model: De-noising 378 379 380 381 382 383 384 (a) (b) (c) (d) 385 Noisy Video Spatially Spatiotemporally Clean Video 386 NMSE = 1 denoised video denoised video Figure 5: Denoising results: (a) Test frame; (b) Test frame corrupted with noise; (c) Reconstruction 387 NMSE=0.175 using 1-layer ST-DBN; (d) Reconstruction with 2-layer ST-DBN. NMSE=0.155 388 389 390 comes at a cost when inferring missing parts of frames, it is crucial for good discriminative perfor- 391 mance. Future research must address this fundamental trade-off. The results in the figure, though 392 apparently simple, are quite remarkable. They represent an important step toward the design of at- 393 tentional mechanisms for gaze planning. While gazing at the subject’s head, the model is able to 394 infer where the legs are. This coarse resolution gist may be used to guide the placement of high resolution detectors. 395 396 397 398 16 399 400 Thursday, September 9, 2010
  • 19. STDBN as a generative model: Filling-in from sequence of “gazes” Missing 17 Thursday, September 9, 2010
  • 20. STDBN as a generative model: Filling-in from sequence of “gazes” Missing Prediction 17 Thursday, September 9, 2010
  • 21. STDBN as a generative model: Filling-in from sequence of “gazes” Missing Prediction Truth (Fully Observed) 17 Thursday, September 9, 2010
  • 22. Discussion • Limitations: • computation • invariance vs. reconstruction trade-off • alternating space-time vs. joint space-time convolution • Extensions: • attentional mechanisms for gaze planning? • leveraging feature detection to reduce training data size 18 Thursday, September 9, 2010