Your SlideShare is downloading. ×
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

P06 motion and video cvpr2012 deep learning methods for vision

1,111

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,111
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
31
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. LEARNING REPRESENTATIONS OF SEQUENCES WITH APPLICATIONS TO MOTION CAPTURE AND VIDEO ANALYSIS GRAHAM TAYLOR SCHOOL OF ENGINEERING UNIVERSITY OF GUELPH Papers and software available at: http://www.uoguelph.ca/~gwtaylorSaturday, June 16, 2012
  • 2. OVERVIEW: THIS TALK 18 May 2012 / 2 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 3. OVERVIEW: THIS TALK • Learning representations of temporal data: X (Input) x Np Nx Nw Nz pk - existing methods and challenges faced x Nw Np zk m,n - recent methods inspired by “deep learning” Nx Nz k P Pooling Y (Output) layer k Ny y Z Nw Feature y Nw layer Ny 18 May 2012 / 2 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 4. OVERVIEW: THIS TALK • Learning representations of temporal data: X (Input) x Np Nx Nw Nz pk - existing methods and challenges faced x Nw Np zk m,n - recent methods inspired by “deep learning” Nx Nz k P Pooling Y (Output) layer k Ny y Z Nw Feature y Nw layer Ny • Applications: in particular, modeling human pose and activity - highly structured data: e.g. motion capture - weakly structured data: e.g. video w Component j 1 2 3 4 5 6 7 8 9 10 18 May 2012 / 2 100 200 300 400 500 600 700 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 5. OUTLINE Learning representations from sequences Existing methods, challenges yt−2 yt−1 yt Composable, distributed-state models for sequences Conditional Restricted Boltzmann Machines and their variants X (Input) Np Using learned representations to analyze video x Nx Nw Nz x pk Nw Np zk m,n A brief and (incomplete survey of deep learning for activity recognition Nx k Nz P Pooling Y (Output) layer k Ny y Z Nw Feature y Nw layer Ny 18 May 2012 / 3 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 6. TIME SERIES DATA • Time is an integral part of many human behaviours (motion, reasoning) • In building statistical models, time is sometimes ignored, often problematic • Models that do incorporate dynamics fail to account for the fact that data is often high-dimensional, nonlinear, and contains long-range dependencies INTENSITY (No of stories) 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 18 May 2012 / 4 Learning Representations of Sequences / G Taylor Graphic: David McCandless, informationisbeautiful.netSaturday, June 16, 2012
  • 7. TIME SERIES DATA • Time is an integral part of many human behaviours (motion, reasoning) • In building statistical models, time is sometimes ignored, often problematic • Models that do incorporate dynamics fail to account for the fact that data is often high-dimensional, nonlinear, and contains long-range dependencies INTENSITY (No of stories) 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Today we will discuss a number of models that have been developed to address these challenges 18 May 2012 / 4 Learning Representations of Sequences / G Taylor Graphic: David McCandless, informationisbeautiful.netSaturday, June 16, 2012
  • 8. VECTOR AUTOREGRESSIVE MODELS M vt = b + Am vt−m + et m=1 • Have dominated statistical time-series analysis for approx. 50 years • Can be fit easily by least-squares regression • Can fail even for simple nonlinearities present in the system - but many data sets can be modeled well by a linear system • Well understood; many extensions exist 18 May 2012 / 5 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 9. MARKOV (“N-GRAM”) MODELS vt−2 vt−1 vt • Fully observable • Sequential observations may have nonlinear dependence • Derived by assuming sequences have Markov property: t−1 t−1 p(vt |{v1 }) = p(vt |{vt−N }) • This leads to joint: T N T t−1 p({v1 }) = p({v1 }) p(vt |{vt−N }) t=N +1 • Number of parameters exponential in N ! 18 May 2012 / 6 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 10. HIDDEN MARKOV MODELS (HMM) ht−2 ht−1 ht vt−2 vt−1 vt 18 May 2012 / 7 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 11. HIDDEN MARKOV MODELS (HMM) ht−2 ht−1 ht Introduces a hidden state that controls the dependence of the current observation on the past vt−2 vt−1 vt 18 May 2012 / 7 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 12. HIDDEN MARKOV MODELS (HMM) ht−2 ht−1 ht Introduces a hidden state that controls the dependence of the current observation on the past vt−2 vt−1 vt • Successful in speech language modeling, biology • Defined by 3 sets of parameters: - Initial state parameters, π - Transition matrix, A - Emission distribution, p(vt |ht ) T • Factored joint distribution: p({ht }, {vt }) = p(h1 )p(v1 |h1 ) p(ht |ht−1 )p(vt |ht ) t=2 18 May 2012 / 7 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 13. HMM INFERENCE AND LEARNING • Typically three tasks we want to perform in an HMM: - Likelihood estimation - Inference - Learning • All are exact and tractable due to the simple structure of the model • Forward-backward algorithm for inference (belief propagation) • Baum-Welch algorithm for learning (EM) • Viterbi algorithm for state estimation (max-product) 18 May 2012 / 8 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 14. LIMITATIONS OF HMMS 18 May 2012 / 9 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 15. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure 18 May 2012 / 9 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 16. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series 18 May 2012 / 9 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 17. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states 18 May 2012 / 9 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 18. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states 18 May 2012 / 9 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 19. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states 18 May 2012 / 9 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 20. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states • We seek models with distributed hidden state: 18 May 2012 / 9 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 21. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states • We seek models with distributed hidden state: - capacity linear in the number of components 18 May 2012 / 9 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 22. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states • We seek models with distributed hidden state: - capacity linear in the number of components 18 May 2012 / 9 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 23. LINEAR DYNAMICAL SYSTEMS ht−2 ht−1 ht vt−2 vt−1 vt 18 May 2012 / 10 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 24. LINEAR DYNAMICAL SYSTEMS ht−2 ht−1 ht Graphical model is the same as HMM but with real-valued state vectors vt−2 vt−1 vt 18 May 2012 / 10 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 25. LINEAR DYNAMICAL SYSTEMS ht−2 ht−1 ht Graphical model is the same as HMM but with real-valued state vectors vt−2 vt−1 vt • Characterized by linear-Gaussian dynamics and observations: p(ht |ht − 1) = N (ht ; Aht−1 , Q) p(vt |ht ) = N (vt ; Cht , R) • Inference is performed using Kalman smoothing (belief propagation) • Learning can be done by EM • Dynamics, observations may also depend on an observed input (control) 18 May 2012 / 10 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 26. LATENT REPRESENTATIONS FOR REAL-WORLD DATA Data for many real-world problems (e.g. motion capture, finance) is high- dimensional, containing complex non-linear relationships between components Hidden Markov Models Pro: complex, nonlinear emission model Con: single K -state multinomial represents entire history Linear Dynamical Systems Pro: state can convey much more information Con: emission model constrained to be linear 18 May 2012 / 11 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 27. LEARNING DISTRIBUTED REPRESENTATIONS • Simple networks are capable of discovering useful and interesting internal representations of static data • Perhaps the parallel nature of computation in connectionist models may be at odds with the serial nature of temporal events • Simple idea: spatial representation of time - Need a buffer; not biologically plausible - Cannot process inputs of differing length - Cannot distinguish between absolute and relative position • This motivates an implicit representation of time in connectionist models where time is represented by its effect on processing 18 May 2012 / 12 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 28. RECURRENT NEURAL NETWORKS ˆ yt−1 ˆ yt ˆ yt+1 ht−1 ht ht+1 vt−1 vt vt+1 18 May 2012 / 13 Learning Representations of Sequences / G Taylor (Figure from Martens and Sutskever)Saturday, June 16, 2012
  • 29. RECURRENT NEURAL NETWORKS ˆ yt−1 ˆ yt ˆ yt+1 ht−1 ht ht+1 vt−1 vt vt+1 • Neural network replicated in time 18 May 2012 / 13 Learning Representations of Sequences / G Taylor (Figure from Martens and Sutskever)Saturday, June 16, 2012
  • 30. RECURRENT NEURAL NETWORKS ˆ yt−1 ˆ yt ˆ yt+1 ht−1 ht ht+1 vt−1 vt vt+1 • Neural network replicated in time • At each step, receives input vector, updates its internal representation via nonlinear activation functions, and makes a prediction: vt = W hv vt−1 + W hh ht−1 + bh hj,t = e(vj,t ) st = W yh ht + by yk,t ˆ = g(yk,t ) 18 May 2012 / 13 Learning Representations of Sequences / G Taylor (Figure from Martens and Sutskever)Saturday, June 16, 2012
  • 31. TRAINING RECURRENT NEURAL NETWORKS 18 May 2012 / 14 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 32. TRAINING RECURRENT NEURAL NETWORKS • Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series 18 May 2012 / 14 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 33. TRAINING RECURRENT NEURAL NETWORKS • Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series • Exact gradients can be computed exactly via Backpropagation Through Time 18 May 2012 / 14 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 34. TRAINING RECURRENT NEURAL NETWORKS • Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series • Exact gradients can be computed exactly via Backpropagation Through Time • It is an interesting and powerful model. What’s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to “vanishing” or “exploding” gradients - Much work in the 1990’s focused on identifying and addressing these issues: none of these methods were widely adopted 18 May 2012 / 14 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 35. TRAINING RECURRENT NEURAL NETWORKS • Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series • Exact gradients can be computed exactly via Backpropagation Through Time • It is an interesting and powerful model. What’s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to “vanishing” or “exploding” gradients - Much work in the 1990’s focused on identifying and addressing these issues: none of these methods were widely adopted 18 May 2012 / 14 Learning Representations of Sequences / G Taylor (Figure adapted from James Martens)Saturday, June 16, 2012
  • 36. TRAINING RECURRENT NEURAL NETWORKS • Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series • Exact gradients can be computed exactly via Backpropagation Through Time • It is an interesting and powerful model. What’s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to “vanishing” or “exploding” gradients - Much work in the 1990’s focused on identifying and addressing these issues: none of these methods were widely adopted • Best-known attempts to resolve the problem of RNN training: - Long Short-term Memory (LSTM) (Hochreiter and Schmidhuber 1997) - Echo-State Network (ESN) (Jaeger and Haas 2004) 18 May 2012 / 14 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 37. FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: 18 May 2012 / 15 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 38. FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: • increased frequency and severity of bad local minima 18 May 2012 / 15 Learning Representations of Sequences / G Taylor (Figures from James Martens)Saturday, June 16, 2012
  • 39. FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: • increased frequency and severity of bad local minima • pathological curvature, like the type seen in the Rosenbrock function: f (x, y) = (1 − x)2 + 100(y − x2 )2 18 May 2012 / 15 Learning Representations of Sequences / G Taylor (Figures from James Martens)Saturday, June 16, 2012
  • 40. SECOND ORDER METHODS • Model the objective function by the local approximation: T 1 T f (θ + p) ≈ qθ (p) ≡ f (θ) + ∆f (θ) p + p Bp 2 where p is the search direction and B is a matrix which quantifies curvature • In Newton’s method, B is the Hessian matrix, H • By taking the curvature information into account, Newton’s method “rescales” the gradient so it is a much more sensible direction to follow • Not feasible for high-dimensional problems! 18 May 2012 / 16 Learning Representations of Sequences / G Taylor (Figure from James Martens)Saturday, June 16, 2012
  • 41. HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional “tricks”): 18 May 2012 / 17 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 42. HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional “tricks”): • For an n-dimensional vector d , the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences 18 May 2012 / 17 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 43. HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional “tricks”): • For an n-dimensional vector d , the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences • There is a very effective algorithm for optimizing quadratic objectives which requires only Hessian-vector products: linear conjugate-gradient (CG) 18 May 2012 / 17 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 44. HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional “tricks”): • For an n-dimensional vector d , the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences • There is a very effective algorithm for optimizing quadratic objectives which requires only Hessian-vector products: linear conjugate-gradient (CG) This method was shown to effectively train RNNs in the pathological long-term dependency problems they were previously not able to solve (Martens and Sutskever 2011) 18 May 2012 / 17 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 45. GENERATIVE MODELS WITH DISTRIBUTED STATE 18 May 2012 / 18 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 46. GENERATIVE MODELS WITH DISTRIBUTED STATE • Many sequences are high-dimensional and have complex structure - RNNs simply predict the expected value at the next time step - Cannot capture multi-modality of time series 18 May 2012 / 18 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 47. GENERATIVE MODELS WITH DISTRIBUTED STATE • Many sequences are high-dimensional and have complex structure - RNNs simply predict the expected value at the next time step - Cannot capture multi-modality of time series • Generative models (like Restricted Boltzmann Machines) can express the negative log-likelihood of a given configuration of the output, and can capture complex distributions 18 May 2012 / 18 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 48. GENERATIVE MODELS WITH DISTRIBUTED STATE • Many sequences are high-dimensional and have complex structure - RNNs simply predict the expected value at the next time step - Cannot capture multi-modality of time series • Generative models (like Restricted Boltzmann Machines) can express the negative log-likelihood of a given configuration of the output, and can capture complex distributions • By using binary latent (hidden) state, we gain the best of both worlds: - the nonlinear dynamics and observation model of the HMM without the simple state - the representationally powerful state of the LDS without the linear-Gaussian restriction on dynamics and observations 18 May 2012 / 18 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 49. DISTRIBUTED BINARY HIDDEN STATE • Using distributed binary representations Hidden variables (factors) at time t for hidden state in directed models of time series makes inference difficult. But we can: - Use a Restricted Boltzmann Machine (RBM) for the interactions between hidden and visible variables. A factorial posterior makes inference and sampling Visible variables (observations) at time t easy. One typically uses binary logistic - Treat the visible variables in the previous units for both visibles and hiddens time slice as additional fixed inputs p(hj = 1|v) = σ(bj + vi Wij ) i 18 May 2012 / 19 Learning Representations of Sequences / G Taylor p(vi = 1|h) = σ(bi + hj Wij ) jSaturday, June 16, 2012
  • 50. MODELING OBSERVATIONS WITH AN RBM • So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations • But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 18 May 2012 / 20 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 51. MODELING OBSERVATIONS WITH AN RBM • So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations • But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 18 May 2012 / 20 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 52. MODELING OBSERVATIONS WITH AN RBM • So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations • But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 18 May 2012 / 20 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 53. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) 18 May 2012 / 21 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 54. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) • Start with a Restricted Boltzmann Machine (RBM) Hidden layer Visible layer 18 May 2012 / 21 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 55. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) • Start with a Restricted Boltzmann Machine (RBM) • Add two types of directed connections Hidden layer Visible layer 18 May 2012 / 21 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 56. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) • Start with a Restricted Boltzmann Machine (RBM) • Add two types of directed connections - Autoregressive connections model short-term, linear structure Hidden layer Visible layer Recent history 18 May 2012 / 21 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 57. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) • Start with a Restricted Boltzmann Machine (RBM) • Add two types of directed connections - Autoregressive connections model short-term, linear structure - History can also influence dynamics through hidden layer Hidden layer Visible layer Recent history 18 May 2012 / 21 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 58. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) • Start with a Restricted Boltzmann Machine (RBM) • Add two types of directed connections - Autoregressive connections model short-term, linear structure - History can also influence dynamics through hidden layer Hidden layer • Conditioning does not change inference nor learning Visible layer Recent history 18 May 2012 / 21 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 59. CONTRASTIVE DIVERGENCE LEARNING Fixed j Fixed j vi hj data vi hj recon Fixed Fixed i i iter = 0 iter =1 (data) (reconstruction) • When updating visible and hidden units, we implement directed connections by treating data from previous time steps as a dynamically changing bias • Inference and learning do not change 18 May 2012 / 22 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 60. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK 18 May 2012 / 23 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 61. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK • Learn a CRBM Hidden layer Visible layer 18 May 2012 / 23 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 62. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK • Learn a CRBM • Now, treat the sequence of hidden units as “fully observed” data and train a second CRBM Hidden layer Visible layer 18 May 2012 / 23 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 63. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK • Learn a CRBM • Now, treat the sequence of hidden units as “fully observed” data and train a second CRBM l Hidden layer • The composition of CRBMs is a conditional deep h1 t belief net Hidden layer 0 h0 t−2 h0 t−1 Visible layer 18 May 2012 / 23 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 64. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK • Learn a CRBM • Now, treat the sequence of hidden units as “fully observed” data and train a second CRBM l Hidden layer • The composition of CRBMs is a conditional deep h1 t belief net • It can be fine-tuned generatively or discriminatively Hidden layer 0 h0 t−2 h0 t−1 Visible layer 18 May 2012 / 23 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 65. l h1 t MOTION SYNTHESIS WITH A 2-LAYER CDBN h0 t−2 h0 t−1 0 • Model is trained on ~8000 frames of 60fps data (49 dimensions) • 10 styles of walking: cat, chicken, dinosaur, drunk, gangly, graceful, normal, old-man, sexy and strong • 600 binary hidden units per layer • 1 hour training on a modern workstation 18 May 2012 / 24 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 66. l h1 t MOTION SYNTHESIS WITH A 2-LAYER CDBN h0 t−2 h0 t−1 0 • Model is trained on ~8000 frames of 60fps data (49 dimensions) • 10 styles of walking: cat, chicken, dinosaur, drunk, gangly, graceful, normal, old-man, sexy and strong • 600 binary hidden units per layer • 1 hour training on a modern workstation 18 May 2012 / 24 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 67. MODELING CONTEXT Labels • A single model was trained on 10 “styled” m walks from CMU subject 137 • The model can generate each style based on initialization • We cannot prevent nor control transitioning • How to blend styles? • Style or person labels can be provided as part of the input to the top layer 18 May 2012 / 25 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 68. MODELING CONTEXT Labels • A single model was trained on 10 “styled” m l walks from CMU subject 137 Hidden layer h1 • The model can generate each style based t on initialization • We cannot prevent nor control Hidden layer transitioning 0 h0 t−2 h0 t−1 • How to blend styles? • Style or person labels can be provided as part of the input to the top layer Visible layer 18 May 2012 / 25 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 69. MULTIPLICATIVE INTERACTIONS • Let latent variables act like gates, that hj dynamically change the connections between other variables • This amounts to letting variables multiply connections between other variables: three-way multiplicative interactions zk • Recently used in the context of learning correspondence between images (Memisevic Hinton 2007, 2010) but long history before that vi 18 May 2012 / 26 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 70. GATED RESTRICTED BOLTZMANN MACHINES (GRBM) Two views: Memisevic Hinton (2007) hj Latent variables hj zk vi zk vi Output Input Output Input 18 May 2012 / 27 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 71. INFERRING OPTICAL FLOW: IMAGE “ANALOGIES” • Toy images (Memisevic Hinton 2006) • No structure in these images, only how they change • Can infer optical flow from a pair of images and apply it to a random image 18 May 2012 / 28 fie d Figure 2: Columns (left to right): Input images; output images; inferred flowfields; Learning Representations of Sequences / G Taylor t t ns ut ow re pu pu ld random target images; inferred transformation applied to target images. For the trans- p tra Fl nfer ut in In formations (last column) gray values represent the probability that a pixel is ’on’ ac- O w y cording to the model, ranging from black for 0 to white for 1. I pl Ne Ap 8Saturday, June 16, 2012
  • 72. BACK TO MOTION STYLE • Introduce a set of latent “context” variables hj whose value is known at training time • In our example, these represent “motion style” but could also represent height, weight, gender, etc. • The contextual variables gate every existing zk pairwise connection in our model vi 18 May 2012 / 29 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 73. LEARNING AND INFERENCE • Learning and inference remain almost the same hj as in the standard CRBM • We can think of the context or style variables as “blending in” a whole “sub-network” • This allows us to share parameters across zk styles but selectively adapt dynamics vi 18 May 2012 / 30 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 74. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) 18 May 2012 / 31 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 75. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer j € Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 76. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer j j € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 77. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 78. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 79. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 80. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 81. OVERPARAMETERIZATION Hidden layer Style • Note: weight Matrix W v,h has been replaced by a tensor W v,h,z ! (Likewise for other weights) j l Features • The number of parameters is O(N ) - per 3 group of weights • More, if we want sparse, overcomplete hiddens € • However, there is a simple yet powerful solution! Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 32 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 82. Hidden layer Style features j l FACTORING € vh Wijl vh Wijl z Wlf h Wjf Output layer (e.g. data at time t) v Wif vh v h z Wijl = Wif Wjf Wlf f 18 May 2012 / 33 Learning Representations of Sequences / G Taylor (Figure adapted from Roland Memisevic)Saturday, June 16, 2012
  • 83. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) 18 May 2012 / 34 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 84. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer j € Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 85. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer j j € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 86. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 87. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 88. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 89. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 90. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Factors Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 91. PARAMETER SHARING         18 May 2012 / 35 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 92. j l MOTION SYNTHESIS: € FACTORED 3RD-ORDER CRBM • Same 10-styles dataset • 600 binary hidden units • 3×200 deterministic factors • 100 real-valued style features • 1 hour training on a modern workstation • Synthesis is real-time 18 May 2012 / 36 Learning Representations of Sequences / G Taylor SummarySaturday, June 16, 2012
  • 93. j l MOTION SYNTHESIS: € FACTORED 3RD-ORDER CRBM • Same 10-styles dataset • 600 binary hidden units • 3×200 deterministic factors • 100 real-valued style features • 1 hour training on a modern workstation • Synthesis is real-time 18 May 2012 / 36 Learning Representations of Sequences / G Taylor SummarySaturday, June 16, 2012
  • 94. ACTIVITY RECOGNITION 3D Convolutional Neural Networks for Human Action Recognition 3D convolutional neural networks Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu (2010) hardwired 7x7x3 3D 2x2 7x6x3 3D convolution 3x3 subsampling 7x4 convolution full connnection convolution subsampling input: 7@60x40 C6: 128@1x1 S5: C4: 13*6@7x4 H1: 13*6@21x12 C2: S3: 33@60x40 23*2@54x34 23*2@27x17 Figure 3. A 3D CNN architecture for human action recognition. This architecture consists of 1 hardwired layer, 3 convo- lution layers, 2 subsampling layers, and 1 full connection layer. Detailed descriptions are given in the text. X (Input) N Convolutional gated restricted Boltzmann machines We then apply 3D convolutions with a kernel size of x the 7 input frames have been converted intopa 128D temporal dimension) on each of the 5 channels sepa- x x Nw Nz 7 × 7 × 3 (7 × 7 in the spatial dimension and 3 inNthe w feature vector capturing the motion information in the N input frames. The output layer consists of the Np pk same Graham Taylor, Rob Fergus, Yann LeCun, and Chris Bregler (2010) k rately. To increase the number of feature maps, two number of units as zm,n number of actions, and each the N sets of different convolutions are applied at each loca-x unit is fully connected to eachz of the 128 units in k N P tion, resulting in 2 sets of feature maps in the C2 layer the C6 layer. In this design we essentially apply a Pooling each consisting of 23 feature maps. This layer con- Y (Output) layer linear classifier on the 128D feature vector for action tains 1,480 trainable parameters. In the subsequent classification. For an kaction recognition problem with subsampling layer S3, we apply 2 × 2 subsampling NonNy y Z 3 classes, the number of trainable parameters at the each of the feature maps in the C2 layer, which leadsy w Feature output layer is 384. The total number of trainable W to the constraint set by computing to the same number of feature maps with reduced spa- Nw parameters in this 3Dlayer CNN model is 295,458, and all that the inverse square root of the m tial resolution. The number of trainable parameters Ny in of them are initialized randomly and trained by on- solving an eigenvector problem, whic this layer is 92. The next convolution layer C4 is ob- line error back-propagation algorithm as described in Therefore, this algorithm is expensiv tained by applying 3D convolution with a kernel size (LeCun et al., 1998). We have designed and evalu- mension is large. The convolution a of 7 × 6 × 3 on each of the 5 channels in the two sets ated other 3D CNN architectures that combine mul- dress this problem by slowly expandin of feature maps separately. To increase the number tiple channels of information at different stages, and via convolution. And although we hav whitening and dimension reduction, t Space-time deep belief networks of feature maps, we apply 3 convolutions with differ- our results show that this architecture gives the best once and hence much less expensive. ent kernels at each location, leading to 6 distinct sets performance. Training neural networks is difficu of feature maps in the C4 layer each containing 13 tuning. Our method, however, is very Bo Chen, Jo-Anne Ting, Ben Marlin, and Nando de Freitas (2010) feature maps. This layer contains 3,810 trainable pa- Figure 4. Stacked Convolutional ISA network. The network is 3. Related Work batch gradient descent does not need a rameters. The next layer S5 is obtained by applying built by “copying” the learned network and “pasting” it to different learning rate and the convergence crit 3×3 subsampling on each feature maps in the C4 layer, places of the input to the class treating the outputs as inputsmod- CNNs belong data and then of biologically inspired to a contrast with other methods such as which leads to the same number of feature maps with new ISA network. For clarity, the convolution step variantshere els for visual recognition, and some other is shown have and Stacked Autoencoders [2] wher reduced spatial resolution. The number of trainable non-overlapping, but in the within this family. Motivated also been developed experiments the convolution is done rate, weight decay, convergence param with overlapping. parameters in this layer is 156. At this stage, the size by the organization of visual cortex, a similar model, for learning good features. of the temporal dimension is already relatively small a called HMAX (Serre et al., 2005), has been developed sequence of image patches and flatten them into a vector. (3 for gray, gradient-x, gradient-y and 2 for optflow-x This vector becomes recognition. to the network above. for visual object input features In the HMAX model, 3.5. Norm-thresholding interest and optflow-y), so we perform convolution only in the a To learn high-level concepts, complex features are con- hierarchy of increasingly we can use the convolution In many datasets, an interest poi The (a) (b) spatial dimension at this layer. (a) size of the con- and stacking techniques (see Section 3.2) which of template structed by the alternating applications result in an (b) sary for improving recognition and lo volution kernel used is 7 × 4 so that the sizes of the architecture as shown inpooling. In particular, at the S1 matching and max Figure 5. costs. This can be achieved in our fram output feature maps are reduced to 1×1. The C6 layeran input a still input image frames. Each input frame is fed into at locations where the norm Figure Figure SpatialSpatial poolingand each of Gabor filterswith nV t orientations frame is fed into features 2: (a) 2: (a) pooling layer for layer for layer video is first analyzed by an array of consists ofa128 feature(b) Temporal poolingan input video with nVat frames. Each a CRBM. scales. The t multiple CRBM. maps of size 1 ×layer. Each pixel sequence is fed into a CRBM. and 1, layer. Each pixel sequence is fed into input below a certain threshold. This is bas themCRBM. (b) Temporal pooling maps in the S5 a is connected to all the 78 feature C1 layer is then obtained by pooling local neighbor- Stacked convolutional independent subspace analysis that the first layer’s activations tend layer, leading to 289,536 trainable parameters. hoods on the S1 maps, leading to increased invariance higher norms at edge and motion lo to distortions on the input. The S2 maps are obtained and feature-less locations (c.f. [13]). Training CRBMs using Monte Montemethods requires sampling from both the conditional distribu- ing the norm, the first layer of our ne Training CRBMs using Carlo Carlo methods requires sampling from both the conditional distribu- By the multiple layers of convolution and subsampling, Quoc Le, Will Zou, Serena Yeung, and Andrew Ng (2011) tion oftion of the hiddengiven the visible visibleand the conditional distribution of the|W | visible units feature detector that filters the hidden units units given the units units and the conditional distribution visible units a robust of the given the hiddenhidden If we define define the visible unit activations = d c+ dc + W g ∗ hg candh and given the units. units. If we the visible unit activations as Av as Av = |W | g=1 W g ∗ non-informative background: g ch c g c g=1 c the hiddenhidden unit activations for g by Ag by bg + bg + Wcc=1vc c ∗ vc using convolution) we If p1 (xt ; W, V ) 1 ≤ δ then the featu the unit activations for group group g = Ag = ch c=1 g ∗ W (again (again using convolution) we can express the conditional probabilities as seenas seen in Eqs. (2) and (3). can express the conditional probabilities in Eqs. (2) and (3). here p1 is the activations of the fi exp(Ag ) g 1 g P (hg = 1|v) = exp(Ai,j ) 1 i,j work. For instance, setting δ at 30 per P (hi,j = 1|v) = , g = (pg = 0|v) = P 0|v) = (2) i,j exp(AgP (p , 1 + 1 + α r,s∈Bα g ) r,s ) α exp(Ar,s α 1 + (2) exp(Ag set’s activation norms means that 70% 1 + r,s∈Bα r,s∈Bα g ) r,s ) dataset are discarded. In our experim exp(Ar,s r,s∈B 1 1 Figure 5. Stacked convolutional ISA for video data. In this figure, detector the KTH dataset where an i P (vc,i,j (vc,i,j = 1|h) = 1 + exp(−Av ) convolution is done with overlapping; the ISA network in the sec- (3) has been shown to be useful [42]. Th P = 1|h) = (3) 1 + exp(−Av ) c,i,j ond layer is trained on the combined activations of the first layer. 18 May 2012 / 37 c,i,j via cross validation. CRBMs are highly highly overcomplete by constructionin our experiments, regularization is required 4. Feature visualization and a CRBMs are overcomplete by constructionFinally,7], so additional we combine features fromis required [12, [12, 7], so additional regularization during during training.in As in we place a penaltypenaltyon the onasthe activations classification training. As [22], [22], we placeboth layers andterm activations of for of the max-pooling a term use them local features the max-pooling Learning Representations of Sequences / G Taylor units to encourage them to be close to a small constantsuggestedr. [22]). Ina dataset dataset imagesimages units to encourage them to be close to (previouslyconstant value r. Given a of K of K a small value in Given the experiment section, In Section 3.1, we discussed spatia {v(1) , v(2) , ..., v(K) },v problem is to find to findwill show of parametersminimizes the objective: of ISA the analysis for video bases. {v(1) , v(2) , ..., the }, the problem is the we theparameters θ that θ that minimizes the objective: when applied to image patche set of set that this combination works better than using (K) one set of features alone. extend K |W | K nH 2 2 K |W | K 1 nH 3.4. Learning with projected gradient (k) 1 batch descent 4.1. First layer − log log(v Ph ) + λ − P (k) (v(k) , h(k) ) + λ , (k) r − method is trained P (pg = projected 1|v ) de- (4) The first layer of our model learn r− Our K|B| K|B| by batch α (pg = P 1|v(k) ) α gradient (4) k=1 h g=1 g=1 α=1 k=1 k=1 h scent. Compared to α=1 feature learning methods (e.g., k=1 other a moving edge in time as shown in F where |B| is the number of max-pooled pg , in p λ is a regularization constant, and r is a constant where |B| is the number of max-pooled units inunits λ is a ,regularization constant, and r is a constant temporal bases give rise to another pro g RBMs [7]), the gradient of the objective function in Eq. 1 is to previously mentioned spatial inva tractable. that controls the sparseness of activated max-pooled units. We useWe use contrastive divergence [6] tivity. that controls the sparseness of activated max-pooled units. 1-stepis1-step contrastive with The orthonormal constraint ensured by projection divergence [6] to get an get an approximate gradient log-likelihood term, coupled with [10]. In stochastic gradient descent to approximate gradient of the of the log-likelihood term, coupled with detail, during opti- symmetric orthogonalization stochastic gradient descent We analyze this property by comp on the on the regularization termoptimize Eq. (4). (4).projected gradient descent requires us to project regularization term [7] to [7] to optimizemization, Eq. ISA features while varying the velocitSaturday, June 16, 2012 A practical issue that arises during during training effect of boundaries [12] on[12] on convolution. If the A practical issue that arises training is the is the effect of boundaries convolution. If the
  • 95. 3D CONVNETS FOR ACTIVITY RECOGNITION Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu (ICML 2010) • One approach: treat video frames as still images (LeCun et al. 2005) • Alternatively, perform 3D convolution so that discriminative features across space and time are captured Action Recognition 3D Convolutional Neural Networks for Human Action Recognition 3D Convolutional Neural Networks for Human 3D Convolutional Neural Networks for Human Action Recognition (a) 2D convolution t et e p o r o r a l (a) 2D convolution (a) 2D convolution m m p al t em p or al t etm mo ro r a l e p p al Figure 2. Extraction of multiple features from contiguous Figure 2.Multiple 3D of multiple features from contiguous frames. Extraction convolutions can be applied to con- t em p or al frames. frames to extract multiple features.of multiple con- tiguous Multiple 3D Figure 2. Extraction applied to features from contiguous convolutions can be As in Figure 1, tiguous frames to extract multiple features. thatin Figure 1, applied to contiguous frames Multiple As the shared frames. Multiple so convolutions the sets of connections are color-coded 3D convolutions can be applied to con- the sets of connections are color-codedextractthe 6 shared weights are in the same color. Note that thatmultiple features As in Figure 1, to to so all the sets of tiguous frames extract multiple features. weights are in the same color. Note that all the 6 sets of connections do not share sets of connections are color-coded so that the shared the weights, resulting in two different connections do not share weights, resulting in two different feature maps on the right. are in the same color. Note that all the 6 sets of weights feature maps on the right. connections do not share weights, resulting in two different feature maps on the right. set of lower-level feature maps. Similar toJithe al. 2010 Images from et case set 2D lower-level featurecan be achieved by the case of of convolution, this maps. Similar to applying (b) 3D convolution of 2D convolution, set of lower-level feature applyingSimilar to the case this can be achieved by maps. (b) 3D convolution multiple 3D convolutions with distinct kernels to the multiple 3D convolutions withlayer (Figure 2). to the same location in the previous distinct this can be achieved by applying of 2D convolution, kernels 18 May 2012 / of Figure 1. Comparison38 2D (a) and 3D (b) convolutions. Figure 1. Comparison of 2D (a) and 3D (b)(b) 3D temporal convolution same location in the previous layer (Figure 2). In (b) the size of the convolution kernel in convolutions. the multiple 3D convolutions with distinct kernels to the Learning 3, of the convolution kernel in the temporal Taylor Representations of Sequences / G In (b) the is and the sets of connections are color-coded dimension size 2.2. A 3D CNN Architecture in the previous layer (Figure 2). Figure 1. Comparison of 2D (a) and 3D (b) convolutions. same location dimension is shared the sets of connections are color-coded so that the 3, and weights the size of same color. In 3D In (b) are in the the convolution kernel 2.2. A temporal in the 3D CNN Architecture so that the shared weightskernel isthe sameto overlapping convolution, the same 3D are is 3, applied sets of connections are color-coded convolution described above, a variety in color. In 3D Based on the 3D dimension and the 2.2. A 3D CNN Architecture convolution, the input 3D kernel is applied to overlapping same video to extract motion features. Based onarchitectures can be describedIn the following, of color. the 3D convolution devised. above, a variety 3D cubes in the so that the shared weights are in the same CNN In 3D 3D cubes in the input video to extract motion features. of CNN architectures can be devised. In the following, convolution, the same 3D kernel is applied to overlapping CNN architecture that we have described above, a variety we describe a 3D Based on the 3D convolution devel- we describe a 3D CNN CNN architectures can be devel- oped for human action recognitionthatthe have devised. In the following, 3D cubes in the input video to extract motion features. of architecture on we TRECVIDSaturday, June 16, 2012 layer, thereby capturing motion information. previous oped for human actiondescribe shown in Figure 3, we that we have devel- data set. In this architecture a 3Don thearchitecture we recognition CNN TRECVID
  • 96. 3D CNN ARCHITECTURE 3D Convolutional Neural Networks for Human Action Recognition full 7x7x3 3D 2x2 7x6x3 3D 3x3 7x4 connnection hardwired convolution subsampling convolution subsampling convolution input: 7@60x40 C6: 128@1x1 S5: C4: 13*6@7x4 H1: 13*6@21x12 C2: S3: 33@60x40 23*2@54x34 23*2@27x17 Image from Ji et al. 2010 Figure 3. A 3D CNN architecture for human action recognition. This architecture consists of 1 hardwired layer, 3 convo- lution layers, 2 subsampling layers, and 1 full connection layer. Detailed descriptions are given in the text. Two fully- 2 different 3D filters Subsample 3 different 3D filters Action units Hardwired to extract: connected 1)grayscale applied to each of 5 spatially applied to each of 5 layers 2)grad-x blocks independently channels in 2 blocks We then 3)grad-y 3D convolutions with a kernel size of apply the 7 input frames have been converted into a 128D 7 × 7 × 3 4)flow-x in the spatial dimension and 3 in the (7 × 7 feature vector capturing the motion information in the temporal 5)flow-y dimension) on each of the 5 channels sepa- input frames. The output layer consists of the same rately. To increase the number of feature maps, two number of units as the number of actions, and each sets of different convolutions are applied at each loca- unit is fully connected to each of the 128 units in tion, resulting in 2 sets of feature maps in the C2 layer the C6 layer. In this design we essentially apply a each consisting of 23 feature maps. This layer con- linear classifier on the 128D feature vector for action 18 tains2012 / 39 May 1,480 trainable parameters. In the subsequent classification. For an action recognition problem with Learning Representations of we apply 2 × Taylor subsampling layer S3, Sequences / G 2 subsampling on 3 classes, the number of trainable parameters at the each of the feature maps in the C2 layer, which leads output layer is 384. The total number of trainable to the same number of feature maps with reduced spa- parameters in this 3D CNN model is 295,458, and all tial resolution. The number of trainable parameters in of them are initialized randomly and trained by on-Saturday, June 16, 2012 layer is 92. The next convolution layer C4 is ob- this line error back-propagation algorithm as described in
  • 97. 3D CONVNET: DISCUSSION • Good performance on TRECVID surveillance data (CellToEar, ObjectPut, Pointing) • Good performance on KTH actions (box, handwave, handclap, jog, run, walk) • Still a fair amount of engineering: person detection (TRECVID), foreground extraction (KTH), hard-coded first layer Human Action Recognition 3D Convolutional Neural Networks for Image from Ji et al. 2010 Figure 4. Sample human detection and tracking results from camera numbers 1, 2, 3, and 5, respectively from left to right. against-all linear SVM is learned for each action class. performed by 25 subjects. To follow the setup in the Specifically, we extract dense SIFT descriptors (Lowe, HMAX model, we use a 9-frame cube as input and ex- 2004) from raw gray images or motion edge history tract foreground as in (Jhuang et al., 2007). To reduce 18 May 2012 / 40 images (MEHI) (Yang et al., 2009). Local features on the memory requirement, the resolutions of the input Learning gray images preserve the appearance information, raw Representations of Sequences / G Taylor frames are reduced to 80 × 60 in our experiments as while MEHI concerns with the shape and motion pat- compared to 160 × 120 used in (Jhuang et al., 2007). terns. These SIFT descriptors are calculated every 6 We use a similar 3D CNN architecture as in Figure pixels from 7×7 and 16×16 local image patches in the 3 with the sizes of kernels and the number of feature same cubes as in the 3D CNN model. Then they are maps in each layer modified to consider the 80 × 60 × 9Saturday, June 16, 2012 softly quantized using a 512-word codebook to build inputs. In particular, the three convolutional layers
  • 98. LEARNING FEATURES FOR VIDEO UNDERSTANDING Transformation feature maps • Most work on unsupervised feature extraction has concentrated on static images • We propose a model that extracts motion- sensitive features from pairs of images • Existing attempts (e.g. Memisevic Hinton 2007, Cadieu Olshausen 2009) ignore the pictorial structure of the input • Thus limited to modeling small image patches Image pair 18 May 2012 / 41 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 99. GATED RESTRICTED BOLTZMANN MACHINES (GRBM) Two views: Memisevic Hinton (2007) hj Latent variables hj zk vi zk vi Output Input Output Input 18 May 2012 / 42 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 100. CONVOLUTIONAL GRBM Graham Taylor, Rob Fergus, Yann LeCun, and Chris Bregler (ECCV 2010) Np k P pk Pooling Np layer • Like the GRBM, captures third-order interactions Nz • Shares weights at all locations in an image Z k Feature • As in a standard RBM, exact inference is efficient layer zk m,n Nz • Inference and reconstruction are performed through convolution operations X (Input) Y (Output) Nx x Nw y Ny Nw x y Nw Nw Nx Ny 18 May 2012 / 43 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 101. MORE COMPLEX EXAMPLE OF “ANALOGIES” (Taylor et al. ECCV 2010) Input Output 18 May 2012 / 44 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 102. MORE COMPLEX EXAMPLE OF “ANALOGIES” (Taylor et al. ECCV 2010) Feature maps Input Output 18 May 2012 / 44 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 103. MORE COMPLEX EXAMPLE OF “ANALOGIES” (Taylor et al. ECCV 2010) Feature maps Input Output 18 May 2012 / 44 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 104. MORE COMPLEX EXAMPLE OF “ANALOGIES” (Taylor et al. ECCV 2010) Feature maps ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Input Output Input Output 18 May 2012 / 44 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 105. MORE COMPLEX EXAMPLE OF “ANALOGIES” (Taylor et al. ECCV 2010) Ground truth Feature maps ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Input Output Input Output Novel input Transformation (model) 18 May 2012 / 44 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 106. HUMAN ACTIVITY: KTH ACTIONS DATASET • We learn 32 feature maps Feature (zk ) • 6 are shown here • KTH contains 25 subjects performing 6 actions under 4 conditions • Only preprocessing is local contrast normalization Time •Motion sensitive features (1,3) •Edge features (4) •Segmentation operator (6) Hand clapping (above); Walking (below) 18 May 2012 / 45 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 107. ACTIVITY RECOGNITION: KTH Acc Convolutional Acc. Prior Art (%) architectures (%) HOG3D+KM+SVM 85.3 convGRBM+3D-convnet+logistic reg. 88.9 HOG/HOF+KM+SVM 86.1 convGRBM+3D convnet+MLP 90.0 HOG+KM+SVM 79.0 3D convnet+3D convnet+logistic reg. 79.4 HOF+KM+SVM 88.0 3D convnet+3D convnet+MLP 79.5 • Compared to methods that do not use explicit interest point detection • State of the art: 92.1% (Laptev et al. 2008) 93.9% (Le et al. 2011) • Other reported result on 3D convnets uses a different evaluation scheme 18 May 2012 / 46 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 108. ACTIVITY RECOGNITION: HOLLYWOOD 2 • 12 classes of human action extracted from 69 movies (20 hours) • Much more realistic and challenging than KTH (changing scenes, zoom, etc.) • Performance is evaluated by mean average precision over classes Method Average Prec. Prior Art (Wang et al. survey 2009): HOG3D+KM+SVM 45.3 HOG/HOF+KM+SVM 47.4 HOG+KM+SVM 39.4 HOF+KM+SVM 45.5 Our method: GRBM+SC+SVM 46.8 18 May 2012 / 47 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 109. SPACE-TIME DEEP BELIEF NETWORKS Bo Chen, Jo-Anne Ting, Ben Marlin, and Nando de Freitas (NIPS Deep Learning Workshop 2010) Background: Convolutional RBMs • Two previous approaches we saw used discriminative learning p1 pg p|W | • We now look at a generative method, ... ... h1 Max-pool opening up more applications ... hg ... h|W | - e.g. in-painting, denoising W1 Wg Bα W |W | Convolve ∗ ∗ • Another key aspect of this work is ∗ demonstrated learned invariance image v • Basic module: Convolutional Restricted Image from Chen al. 2010 Desjardins Bengio (2008), Lee, Grosse, Ranganath Ng (2009), Boltzmann Machine (Lee et al. 2009) Norouzi, Ranjbar Mori (2009) 6 Thursday, September 9, 2010 18 May 2012 / 48 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 110. ST-DBN • Key idea: alternate layers of spatial and temporal Convolutional RBMs • Weight sharing across all CRBMs in a layer • Highly overcomplete: use sparsity on activations of max-pooling units (a) (b) Spatial pooling layer Figure 2: (a) Spatial pooling layer for an input video with nV t frames. Each input frame is fed into a CRBM. (b) Temporal pooling layer. Each pixel sequence is fed into a CRBM. 18 May 2012 / 49 Images from Chen al. 2010 Learning Representations of Sequences / G Taylor Training CRBMs using Monte Carlo methods requires sampling from both the conditional distribu- tion of the hidden units given the visible units and the conditional distribution of the visible units |W | g given the hidden units. If we define the visible unit activations as Ac = dc + g=1 Wc ∗ hg and v ch the hidden unit activations for group g by A = b +Saturday, June 16, 2012 g W g ∗ v (again using convolution) we
  • 111. ST-DBN • Key idea: alternate layers of spatial and temporal Convolutional RBMs • Weight sharing across all CRBMs in a layer • Highly overcomplete: use sparsity on activations of max-pooling units (a) (a) (b) (b) Spatial pooling layer Temporal pooling layer Figure 2: (a) 2: (a) Spatial pooling for anfor an video videonV t frames. Each input frame frame is fed into Figure Spatial pooling layer layer input input with with nV t frames. Each input is fed into a CRBM. (b) Temporal pooling layer. layer. pixel sequence is fed is fed into a CRBM. a CRBM. (b) Temporal pooling Each Each pixel sequence into a CRBM. 18 May 2012 / 49 Images from Chen al. 2010 Learning Representations of Sequences / G Taylor Training CRBMs using using Monte Carlo methods requires sampling both the conditional distribu- Training CRBMs Monte Carlo methods requires sampling from from both the conditional distribu- tion of theof the hidden given given the visible and the conditional distribution of the of the visible units tion hidden units units the visible units units and the conditional distribution visible units |W | g given given the hidden units. If we define the visible unit activations as Av + dcg=1 W|W ∗ Wgg and g and the hidden units. If we define the visible unit activations as Ac = dc = v + g=1 h c ∗ h c | ch g c the hidden unit activations for group g by A = b g +Saturday, June 16, 2012 g Wch ∗ v g(again using convolution) we
  • 112. MEASURING INVARIANCE • Measure invariance at each layer for various transformations of the input • Use measure proposed by Goodfellow et al. (2009) Firing rate of unit i Translation Zooming 2D Rotation 3D Rotation 40 40 40 40 Invariant Overly Selective 35 35 35 35 Not Selective 30 30 30 30 25 25 25 25 20 20 20 20 15 15 15 15 10 10 10 10 Degree of Transformation S1 S2 T1 S1 S2 T1 S1 S2 T1 S1 S2 T1 (a) Invariance scores computed for Spatial Pooling Layer 1 (S1), Spatial Pooling Layer 2 (S2) and Temporal Pooling Layer 1 (T1). Higher is better. Figure 3: (a) Invariance scores for common transformations in natural (S1) and layer 2 (S2) of a CDBN and layer 2 (T1) of ST-DBN (higher 18 May 2012 / 50 2 ST-DBN filters on KTH. Time Images from Chenleft to right for each row. goes from al. 2010 Learning Representations of Sequences / G Taylor Invariance Measure: To evaluate invariance, we use the measure hidden unit i, which balances its global firing rate G(i) with its local firSaturday, June 16, 2012 measure for a hidden unit i is S(i) = L(i)/G(i), with:
  • 113. DENOISING AND RECONSTRUCTION • Operations not possible with a discriminative approach (a) (a) (b) (b) (c) (c) (d) (d) Figure 4: De-noising results: (a) Test frame; (b) Test frame corrupted with noise; (c) Reconstruction Figure 4: De-noising results: (a) Test frame; (b) Test frame corrupted with noise; (c) Reconstruction Reconstruction: Reconstruction: using 1-layer ST-DBN; (d) Reconstruction with 2-layer layer(c) (c) (a) (a) (a) (a) Corrupted test (b) (b) (b) (b) frame (c)ST-DBN using 1-layer ST-DBN; (d) Reconstruction with 2-layer ST-DBN. Test frame (c) 1 ST-DBN. 2 layer (d) (d) (d) (d) ST-DBN Figure 4:4:4:4: De-noising results: TestTest frame; TestTest frame corruptednoise;noise; Reconstruction Figure De-noising results: (a) TestTest frame; TestTest frame corrupted with noise; Reconstruction Figure De-noising results: (a) (a)frame; (b) (b) frame corrupted with with (c) (c) Reconstruction Figure De-noising results: (a) frame; (b) (b) frame corrupted with noise; (c) (c) Reconstruction using using 1-layer ST-DBN; Reconstruction with 2-layer ST-DBN. using 1-layer ST-DBN; (d) (d) Reconstruction 2-layer ST-DBN. using 1-layer ST-DBN; Reconstruction with with 2-layer ST-DBN. 1-layer ST-DBN; (d) (d) Reconstruction with 2-layer ST-DBN. Observed gazes Reconstructions Figure 5: Top video shows an observed sequence of gazes/foci of attention (i.e., frames 2-5). Bottom Figure 5: Top video shows an observed sequence of gazes/foci of attention (i.e., frames 2-5). Bottom video shows reconstructions within the gaze windows and predictions outside them. video shows reconstructions within the gaze windows and predictions outside them. 18 May 2012 / 51 Images from Chen al. 2010 The 2-layer ST-DBN of Sequences / G Taylor temporal pooling layer) gives slightly better background Learning Representations (with an additional temporal pooling layer) gives slightly better background The 2-layer ST-DBN (with an additional Figure 5:5:5:2-layer ST-DBNanan observed sequence ofofof gazes/foci gives slightlyframesbackground Bottom FigureTheTopTop video shows an observed sequence pooling layer)attention (i.e.,betterframes Bottom Figure 5: video shows(with an additional temporalgazes/foci attention (i.e., frames 2-5). Bottom FigureTopTop video showsobserved sequence of gazes/foci ofofofof attention (i.e., 2-5). 2-5). video shows an observed sequence gazes/foci attention (i.e., frames 2-5). Bottom video shows reconstructions within the gaze windows and predictions outside 0.1751 video shows reconstructions within the 1-layer and 2-layer reconstructions are them. shows reconstructions within the gaze gaze and 2-layer predictions outside them. de-noising. reconstructions within of gaze windows and reconstructions outside them. 0.155, de-noising. The normalized MSEs of 1-layer windows and predictions are 0.1751 and 0.155, videovideo shows The normalized MSEs the windows and predictions outside them. and respectively. For reference, the normalized MSE between the clean and noisy video has value 1. respectively. For reference, the normalized MSE between the clean and noisy video has value 1. Note that the de-noising effects are more visible over time (compared to single frame results shown Note that the de-noising effects are more visible over time (compared to single frame results shownSaturday, June 16, 2012 and can be easily observed in video format. below) and can be easily observed in video format. below)
  • 114. STACKED CONVOLUTIONAL INDEPENDENT SUBSPACE Figure 1.1. The neural network architecture ofof an ISA network.The Figure 1.1. The neural network architecture an ISA network. The Figure The neural network architecture an ISA network. The Figure The neural network architecture ofof an ISA network. The Figure 3.3.Tuning curves f Figure Tuning curves Figure 3.3.Tuning curves fof Figure Tuning curves on on static images. static images. The on static images. The on static images. Th The ANALYSIS (ISA) redredbubbles arethethepooling unitswhereas thethegreen bubblesareare redbubbles arearethe poolingunits whereas thegreen bubbles are bubbles pooling units whereas green bubbles red bubbles are thepooling units whereas the green bubbles are tion/frequency/rotation, theth tion/frequency/rotation, tion/frequency/rotation, they tion/frequency/rotation, the thethe simple units.In this picture, the size ofof the subspace2:2: each the simple units.In this picture, thethe size the subspace isisis each the simple units. InIn this picture, size ofof the subspace is 2:2:each simple units. this picture, the size the subspace each ofofthe network. Left: cha thethenetwork. Left: cha network. Left: ch ofof thenetwork. Left: chan redred pooling unit looks 22 2 simple units. red pooling unit looks atatsimple units. pooling unit looks at red pooling unit looks at 2 simple units. simple units. change inin frequency.Right: change frequency. Righ change inin frequency.Right: c change frequency. Right: Quoc Le Will Zou, Serena Yeung, and Andrew Ng (CVPR 2011) show that pooling units inin show that pooling units an show that pooling units inin an show that pooling units an layer, by by solving: layer, solving: layer, by solving: layer, by solving: and selective toto frequency and selective frequency a and selective toto frequency an and selective frequency andan minimize TT T mm pp(xit(x;tW, V V ), P P P Pm P T Pm tt minimize ; V ), t=1 i=1 i i (x; W,W, ), P P minimize minimize t=1 i=1 i (x ; W, V ), t=1 i=1 pp t=1 i=1 WW WW (1)(1) (1) (1) In Inmany experiments,w subject toto W W TT== subject toto W W TT== I I subject subject WW W II many experiments Inmany experiments, In many experiments, √() erty makes 2ISAperform erty Layer ISA perform ertymakes units perfor makes ISA erty makes ISA perform where {x{xTT T arearewhitened inputexamples.2 2Here, W W∈∈ tT where t }}t=1 where {x}}t=1 arewhitened input examples. 2 2Here, W∈∈ where {xttt=1 are whitened input examples. Here, W whitened input examples. Here, methods such as as ICA an methods such ICA and methods such as ICA and methods such as ICA and • Use of ISA (right) as a basic module t=1 k×nk×n k×n the weights connecting the input data to the simple k×n is the weights connecting the input data to the simple RR isis the weights connecting the input data to the simple RR is the weights connecting the input data to the simple units, V∈∈ RRm×k thethe weights connecting the simple units units, ∈m×k isis weights connecting the simple units units, VVV∈RRm×kisis the weights connecting the simple units units, m×k the weights connecting the simple units 3.2. Stacked convolu 3.2. Stacked convolutio 3.2. Stacked convoluti 3.2. Stacked convolut • Learns features robust to local ()2 input dimension, number ofof typically unitsandk,poolingunits toto thepooling units (V (Vistypically fixed); n, n, m mare the fixed); k, totothe pooling units (V isis typically fixed); n, k, marearethe thethepooling units (V istypically fixed); n, k, m arethethe pooling units input dimension, number ofsimple units and pooling units input dimension, number simple units andpooling units simple input dimension, number ofsimple units and pooling units The standard ISA tra The standard ISA train Layer 1 units The standard ISA trai The standard ISA trai cient when input patche cient when input patches a cient when input patches cient when input patches translation; selective to frequency, respectively. The orthonormal constraint isistotoensure thethe respectively. The orthonormal constraint isistotoensure the respectively. The orthonormal constraint respectively. The orthonormal constraint features areare diverse. features diverse. features are diverse. features are diverse. ensure ensure the thogonalization method thogonalization method ha thogonalization method h thogonalization method h jected gradient descent. jected gradient descent. T jected gradient descent. jected gradient descent. rotation and velocity In InFigure 2,2,we showthree pairs of offilters learnedfrom In Figure weweshow three pairs of filters learned from Figure show three pairs filters learned from InFigure 2,2, weshow three pairs offilters learned from natural images. AsAscan beseen from this figure, thetheISA natural images. Ascan be beseen from this figure, theISA natural images. can seen from this figure, natural images. As can be seen from this figure, the ISA ISA step grows as asaacubic fun step grows cubic fu step grows asaacubic func step grows as cubic fun InputThus, train Section 3.4). Section 3.4). Thus, training Section 3.4). Thus, trainin Section 3.4). Thus, traini algorithm isable totolearn Gabor filters (“edge detectors”) algorithm isable learn Gabor filters (“edge detectors”) algorithm able learn Gabor filters (“edge detectors”) algorithm isis abletoto learnGabor filters (“edge detectors”) sional data, especially v sional data, especially vide sional data, especially vid sional data, especially vid • Key idea: scale up ISA by applying with many frequencies and orientations. Further, itit also with many frequencies and orientations. Further, isisalso with many frequencies and orientations. Further, with many frequencies and orientations. Further, ititisis also able togroup similar features ininaagroup thereby achieving group similar features also group thereby achieving able toto groupsimilar features inin agroup thereby achieving group similar features a group thereby achieving In Inorder toto scaleup the Inorder toto scaleup upth In order scale upthet order scale able to able sign aaconvolutional neu sign convolutional ne sign aaconvolutional neura sign convolutional neu convolution and stacking invariances. invariances. invariances. invariances. gressively makes use of of gressively makes use gressively makes use ofPP gressively makes use of supervised learning as as s supervised learning sho supervised learning as sh supervised learning as sh The key ideas of ofthis The key ideas this The key ideas ofthis a The key ideas of thisthi train thetheISA algorithmo train ISA algorithm train theISA algorithm train the ISA algorithm Figure 2.2. Typical filters learned by the ISA algorithm when trained Figure Typical filters learned by the ISA algorithm when trained Figure 2.2. Typical filters learned by the ISA algorithm when trained Figure Typical filters learned by the ISA algorithm when trained take this learned networ take this learned network a take this learned network take this learned network on onstatic images. Here, learned by ISAthreegroups ofofbases pro- onstaticTypical filters wewevisualize threegroups ofof basespro- visualize when trained on pro- staticimages. Here, we visualize three groups images. Here, on staticimages. Here, wevisualize three groups bases pro-bases of ofthe inputimage. The co ofthetheinput image. The c input image. The of theinput image. The duced bystatic(each group isis a subspace and pooled together). duced by W images duced by W(each group isis a subspace and pooled together). duced by W (each group a a subspace and pooled together). W (each group subspace and pooled together). lution step areare then give lution step then given a lution step are then given lution step are then given (organized in pools - red units above) also implemented by byano also implemented also implemented byanoth also implemented by ano an prepossessing step. Sim prepossessing step. Simila prepossessing step. Simil prepossessing step. Simi Sim One property of thethe learned ISA pooling unitsthat they One property of the learned ISA pooling units isis that they One property of learned ISA pooling units is that they One property of the learned ISA pooling units is that they toto whiten the data and re toto whiten the data and red whiten the data and redu whiten the data and red areareinvariant andthus suitable forforrecognition tasks.ToToil- areinvariant and thus suitable forrecognition tasks. Toil- il- invariant and thus suitable recognition tasks. are invariant and thus suitable for recognition tasks. To il- next layer of of the ISA alg next layer the ISA algor next layer of the ISA algo next layer of the ISA algo lustrate this, wewetrainthetheISAalgorithm on onnaturalstatic lustrate this, train ISA algorithm natural static lustrate this, wetrain theISA algorithm onnatural static lustrate this, we train the ISA algorithm on natural static sional inputs. sional inputs. sional inputs. sional inputs. images and then test its its invarianceproperties usingthethetun- images and then test invariance properties using images and then test its invarianceproperties using thetun- images and then test its invariance properties using the tun-tun- 18 May 2012 / 52 ing curve test [10]. In In detail, we from the optimal stimulus of ing curve test [10]. detail, we find theLe et al. stimulus of Images find optimal 2010 ing curve test [10]. In detail, we find the optimal stimulus of ing curve test [10]. In detail, we find the optimal stimulus of In In our experiments, t our experiments, the In our experiments, the In our experiments, th Learning Representations of Sequences / G Taylor layerwise inin the same m layerwise the same ma layerwise inin the same ma layerwise the same man a particular neuron p pinin the network by fittingparametric particular neuron pi i i the network by fitting a a parametric particular neuron p aa a particular neuron iinin the network by fitting a a parametric the network by fitting parametric Gabor function toto thefilter. We then vary its itsthree degrees ininthe deep learning litera thethedeep learning liter deep learning lite inin thedeep learning literat Gabor function totothe filter. We then vary itsthree degrees Gabor function thethefilter. We then vary its three degrees Gabor function filter. We then vary three degrees of offreedom: translation(phase), rotation and frequency and wewetrain layer11until conv wetrain layer 11until con train layer until c we train layer until co freedom: translation (phase), rotation and frequency and offreedom: translation (phase), rotation and frequency and of freedom: translation (phase), rotation and frequency and plot thetheactivations ofthetheneurons ininthe networkwith re-re- Using this idea, thethetraini Using this idea, train Using this idea, thetrainin Using this idea, the traini plot theactivations of ofthe neuronsinin thenetwork with re- plot the activations of theneurons thethenetwork with re- plot activations neurons network with spect toto the variation.3 3Figure 33 3 shows results of the tuning 1-2 hours. 1-2 hours. 1-2 hours. 1-2 hours. spect toto the variation.3 3Figure 3 shows results of the tuning spect the variation. Figure shows results of the tuning spect the variation. Figure shows results of the tuning curve test forforarandomly selected neuron ininthethenetwork curve test foraa arandomly selected neuron ininthenetwork curve test randomly selected neuron the network curve test for randomly selected neuron network 3.3. Learning spatio- 3.3. Learning spatio-te 3.3. Learning spatio-t 3.3. Learning spatio-tSaturday, June 16, 2012
  • 115. SCALING UP: CONVOLUTION AND STACKING Simple example: 1D data • The network is built by “copying” the learned network and “pasting” it to different parts of the input data • Outputs are then treated as the inputs to a new ISA network • PCA is used to reduce dimensionality Figure 4. Stacked Convolutional ISA network. The network is built by “copying” the learned network and “pasting”al.it2010 different Image from Le et to places of the input data and then treating the outputs as inputs to a new ISA network. For clarity, the convolution step is shown here 18 May 2012 / 53 non-overlapping, but in the experiments the convolution is done Learning Representations of Sequences / G Taylor with overlapping. a sequence of image patches and flatten them into a vector.Saturday, June 16, 2012 This vector becomes input features to the network above.
  • 116. tuning. O Figure 4. Stacked Convolutional ISA network. The network is batch gra built by “copying” the learned network and “pasting” it to different places of the input data and then treating the outputs as inputs to a learning r new ISA network. For clarity, the convolution step is shown here contrast w non-overlapping, but in the experiments the convolution is done and Stack with overlapping. rate, weig for learni LEARNING SPATIO-TEMPORAL FEATURES flatten them into above. a sequence of image patches and This vector becomes input features to the network a vector. 3.5. Nor To learn high-level concepts, we can use the convolution In ma and stacking techniques (see Section 3.2) which result in an • Inputs to the network are blocks architecture as shown in Figure 5. sary for im costs. Th of video features a below a c • Each block is vectorized and that the fi processed by ISA higher no and featu • Features from Layer 1 and Layer ing the no 2 are combined prior to a robust classification non-infor If p1 (xt here p1 work. Fo set’s activ dataset ar Figure 5. Stacked convolutional ISA for video data. In this figure, detector t convolution is done with overlapping; the ISA network in the sec- has been ond layer is trained on the combined activations of the first layer. via cross Finally, in our experiments, we combine features from 4. Featu 18 May 2012 / 54 both layers and use them as local features for classification Learning Representations of Sequences / G Taylor (previously suggested in [22]). In the experiment section, In Sec we will show that this combination works better than using of ISA w one set of features alone. extend thSaturday, June 16, 2012 3.4. Learning with batch projected gradient descent 4.1. Firs
  • 117. of two sets of filters. Each set of filters is a filter in 3D (i.e., a row in matrix W ), and two sets grouped together to form an ISA 180 feature. In detail, we fit Gabor functions to all temporal bases to 210 330 estimate the velocity of the bases. We then vary this veloc- ity and plot the response of the features with respect to SELECTIVITY VELOCITY AND ORIENTATION the 240 300 changes. In Figure 7, we visualize this property by plotting directions than other directions. 270 the velocity tuning curves of five randomly-selected units in Figure 8. A polar plot of edge velocities (radius) a the first layer of the network. 90 (angle) to which filters give maximum response. E 120 60 the figure represents a pair of (velocity, orientation temporal filter learned from Hollywood2. The ouples of three ISA features learned from Holly- 150 has velocity of 4 pixels per frame. 3016 spatial size). In this picture, each row consistslters. Each set of filters is a filter in 3D (i.e., a 4.2. Higher layers ), and two sets grouped together to form an ISA 180 0 t Gabor functions to all temporal bases to 210 330locity of the bases. We then vary this veloc- response of the features with respect to the ure 7,Figure 7. Velocity tuning curves by five neurons in a ISA network we visualize this property of plotting 240 300 trained on Hollywood2. Most of the tuning curves are unimodal 270ing curves of means that ISA temporal bases can be used as velocity Edge velocitiesedge velocities (radius) and orientations five randomly-selected units in and this Velocity tuning curves for five neurons in an ISA network Figure 8. A polar plot of (radius) and orientations (angle) to whichf the network. trained on Hollywood2 data detectors. filters give maximum response (angle) to whichOutermost velocity:maximum response. Each red dot in filters giveVisualization of five typical optimal stimu Figure 9. 4 pixels per frame the figure represents a pair of (velocity, orientation) for athe purpo layer learned from Hollywood2 data (for spatio- As can be seen from the figure, the neurons are highly learned from Hollywood2. The outermost circleto temporal filter sualization, we use the size of 24x24x18 built on sensitiveMay 2012 / 55 in the velocity of the stimuli. Thisof 4 pixels layerframe. Compare this figure with Figure 18 to changes has velocity first per filters). suggests that the features can be/ G Taylor as velocity detec- Learning Representations of Sequences used 4.2. Higher layers tors which are valuable for detecting actions in movies. For example, the “Running” category in Hollywood2 has fast motions whereas the “Eating” category in Hollywood2 has Saturday, June 16, 2012
  • 118. SUMMARY 18 May 2012 / 56 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 119. SUMMARY • Learning distributed representations of sequences 18 May 2012 / 56 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 120. SUMMARY • Learning distributed representations of sequences j l • For high-dimensional, multi-modal data: CRBM, FCRBM € 18 May 2012 / 56 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 121. SUMMARY • Learning distributed representations of sequences j l • For high-dimensional, multi-modal data: CRBM, FCRBM € X (Input) • Activity recognition: 4 methods Nx x x Nw Nz pk Np Nw Np zk m,n Nx k Nz P Pooling Y (Output) layer k Ny y Z Nw Feature y layer 18 May 2012 / 56 Nw Ny Learning Representations of Sequences / G TaylorSaturday, June 16, 2012
  • 122. ACKNOWLEDGEMENTS •Faculty at U Toronto: Geoff Hinton, Sam Roweis •Faculty at NYU: Chris Bregler, Rob Fergus, Yann LeCun •Students and researchers at U Toronto, NYU •Funding: CIFAR, DARPA, ONR, Google 18 May 2012 /57 Learning Representations of Sequences / G TaylorSaturday, June 16, 2012

×