Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)

428 views

Published on

https://telecombcn-dl.github.io/2017-dlcv/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)

  1. 1. [course site] Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Learning with Videos Day 4 Lecture 4 #DLUPC
  2. 2. Acknowledgments 2 Xunyu Lin Junting Pan
  3. 3. 3 Greff, Klaus, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and Juergen Schmidhuber. "Tagger: Deep unsupervised perceptual grouping." NIPS 2016 [video] [code] Unsupervised Feature Learning
  4. 4. Unsupervised Feature Learning 4 Slide credit: Yann LeCun
  5. 5. Unsupervised Feature Learning 5 Slide credit: Yann LeCun
  6. 6. Unsupervised Feature Learning 6 Why Unsupervised Learning? ● It is the nature of how intelligent beings percept the world. ● It can save us tons of efforts to build a human-alike intelligent agent compared to a totally supervised fashion. ● It’ll be the new breakthroughs to get true AI!
  7. 7. Unsupervised Feature Learning 7F. Van Veen, “The Neural Network Zoo” (2016) Autoencoder (AE) Further details: D2L6, “Unsupervised”
  8. 8. Unsupervised Feature Learning 8F. Van Veen, “The Neural Network Zoo” (2016) Variational Autoencoder (VAE) Forced to follow a Gaussian distribution Kevin Frans, “Variational Autoencoders explained” (2016) Doersch, Carl. "Tutorial on variational autoencoders." (2016). Further details: D3L4, “Generative”
  9. 9. Unsupervised Feature Learning 9F. Van Veen, “The Neural Network Zoo” (2016) Generative Adversarial Network (GAN) Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." NIPS 2014 Goodfellow, Ian. "NIPS 2016 Tutorial: Generative Adversarial Networks." arXiv preprint arXiv:1701.00160 (2016). Further details: D3L4, “Generative”
  10. 10. First steps in video feature learning 10 Le, Quoc V., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis." CVPR 2011
  11. 11. Frame Reconstruction & Prediction 11 Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Unsupervised representation (feature) learning:
  12. 12. Frame Reconstruction & Prediction 12 Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Unsupervised feature learning (no labels) for... ...frame prediction.
  13. 13. Frame Reconstruction & Prediction 13 Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Unsupervised feature learning (no labels) for... ...frame prediction.
  14. 14. 14 Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Unsupervised learned features (lots of data) are fine-tuned for activity recognition (little data). Frame Reconstruction & Prediction
  15. 15. Frame Prediction 15 Ranzato, MarcAurelio, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).
  16. 16. 16 Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code] Video frame prediction with a ConvNet. Frame Prediction
  17. 17. 17 Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code] The blurry predictions from MSE are improved with multi-scale architecture, adversarial learning and an image gradient difference loss function. Frame Prediction
  18. 18. 18 Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code] The blurry predictions from MSE (l1) are improved with multi-scale architecture, adversarial training and an image gradient difference loss (GDL) function. Frame Prediction
  19. 19. 19 Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code] Frame Prediction
  20. 20. 20Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016. Frame Prediction
  21. 21. 21 Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016. Frame Prediction
  22. 22. 22 Frame Prediction Xue, Tianfan, Jiajun Wu, Katherine Bouman, and Bill Freeman. "Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks." NIPS 2016 [video] Given an input image, probabilistic generation of future frames (VAE).
  23. 23. 23 Frame Prediction Xue, Tianfan, Jiajun Wu, Katherine Bouman, and Bill Freeman. "Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks." NIPS 2016 [video] Encodes image as feature maps, and motion as and convolutional kernels.
  24. 24. Temporal Weak Labels 24 Goroshin, Ross, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. "Unsupervised learning of spatiotemporally coherent metrics." ICCV 2015. Assumption: adjacent video frames contain semantically similar information. Autoencoder trained with regularizations by slowliness and sparisty.
  25. 25. 25 Jayaraman, Dinesh, and Kristen Grauman. "Slow and steady feature analysis: higher order temporal coherence in video." CVPR 2016. [video] Slow feature analysis ● Temporal coherence assumption: features should change slowly over time in video Steady feature analysis ● Second order changes also small: changes in the past should resemble changes in the future Train on triplets of frames from video Loss encourages nearby frames to have slow and steady features, and far frames to have different features Temporal Weak Labels
  26. 26. 26 (Slides by Xunyu Lin): Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using temporal order verification." ECCV 2016. [code] Temporal order of frames is exploited as the supervisory signal for learning. Temporal Weak Labels
  27. 27. 27 (Slides by Xunyu Lin): Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using temporal order verification." ECCV 2016. [code] Take temporal order as the supervisory signals for learning Shuffled sequences Binary classification In order Not in order Temporal Weak Labels
  28. 28. 28 Fernando, Basura, Hakan Bilen, Efstratios Gavves, and Stephen Gould. "Self-supervised video representation learning with odd-one-out networks." CVPR 2017 Temporal Weak Labels Train a network to detect which of the video sequences contains frames wrong order.
  29. 29. 29 Spatio-Temporal Weak Labels Pathak, Deepak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. "Learning features by watching objects move." CVPR 2017
  30. 30. 30 Spatio-Temporal Weak Labels X. Lin, Campos, V., Giró-i-Nieto, X., Torres, J., and Canton-Ferrer, C., “Disentangling Motion, Foreground and Background Features in Videos”, in CVPR 2017 Workshop Brave New Motion Representations kernel dec C3D Foreground Motion First Foreground Background Fg Dec Bg Dec Fg Dec Reconstruction of foreground in last frame Reconstruction of foreground in first frame Reconstruction of background in first frame uNLC Mask Block gradients Last foreground Kernels share weights
  31. 31. 31 Greff, Klaus, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and Juergen Schmidhuber. "Tagger: Deep unsupervised perceptual grouping." NIPS 2016 [video] [code] Spatio-Temporal Weak Labels
  32. 32. Questions?

×