Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Learning from Videos (UPC 2018)

1,889 views

Published on

https://mcv-m6-video.github.io/deepvideo-2018/

Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.

Prepared for the Master in Computer Vision Barcelona:
http://pagines.uab.cat/mcv/

Published in: Data & Analytics
  • Be the first to comment

Deep Learning from Videos (UPC 2018)

  1. 1. @DocXavi Module 6 Deep Learning from Videos 13th March 2018 Xavier Giró-i-Nieto [http://pagines.uab.cat/mcv/]
  2. 2. ● MSc course (2017) ● BSc course (2018) 2 Deep Learning online courses by UPC: ● 1st edition (2016) ● 2nd edition (2017) ● 3rd edition (2018) ● 1st edition (2017) ● 2nd edition (2018) Next edition Autumn 2018 Next edition Winter/Spring 2019Summer School (late June 2018)
  3. 3. 3 Acknowledegments (Unsupervised) Kevin McGuinness, “Unsupervised Learning” Deep Learning for Computer Vision. [Slides 2016] [Slides 2017]
  4. 4. Acknowledgements (feature learning) 4 Víctor Campos Junting Pan Xunyu Lin
  5. 5. 5 Densely linked slides
  6. 6. 6 Outline 1. Unsupervised Learning 2. Predictive Learning 3. Self-supervised Learning 4. Cross-modal Learning
  7. 7. Unsupervised Learning 7 Slide credit: Yann LeCun
  8. 8. 8
  9. 9. 9 Greff, Klaus, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and Juergen Schmidhuber. "Tagger: Deep unsupervised perceptual grouping." NIPS 2016 [video] [code]
  10. 10. Unsupervised Learning 10 Why Unsupervised Learning? ● It is the nature of how intelligent beings percept the world. ● It can save us tons of efforts to build a human-alike intelligent agent compared to a totally supervised fashion. ● Vast amounts of unlabelled data.
  11. 11. Max margin decision boundary 11 How data distribution P(x) influences decisions (1D) Slide credit: Kevin McGuinness (DLCV UPC 2017)
  12. 12. Semi supervised decision boundary 12 How data distribution P(x) influences decisions (1D) Slide credit: Kevin McGuinness (DLCV UPC 2017)
  13. 13. 13 How data distribution P(x) influences decisions (2D) Slide credit: Kevin McGuinness (DLCV UPC 2017)
  14. 14. 14 How data distribution P(x) influences decisions (2D) Slide credit: Kevin McGuinness (DLCV UPC 2017)
  15. 15. x1 x2 Red / Blue classes are nt linearly separable in this 2D space :( 15 How clustering is valuable for linear classifiers Slide credit: Kevin McGuinness (DLCV UPC 2017)
  16. 16. x1 x2 Cluster 1 Cluster 2 Cluster 3 Cluster 4 1 2 3 4 1 2 3 4 4D BoW representation (“histogram”) Red / Blue classes are now separable with a linear classifiers in a 4D space :) 16 How clustering is valuable for linear classifiers Slide credit: Kevin McGuinness (DLCV UPC 2017)
  17. 17. Example from Kevin McGuinness Juyter notebook. 17 How clustering is valueble for linear classifiers
  18. 18. 18 Assumptions for unsupervised learning Slide credit: Kevin McGuinness (DLCV UPC 2017) ● ○ ● ○ ● ○
  19. 19. 19 The manifold hypothesis Slide credit: Kevin McGuinness (DLCV UPC 2017) x1 x2 Linear manifold wT x + b x1 x2 Non-linear manifold
  20. 20. 20 Slide credit: Kevin McGuinness (DLCV UPC 2017) ● ● ○ ● ○ The manifold hypothesis
  21. 21. 21 Assumptions for unsupervised learning Slide credit: Kevin McGuinness (DLCV UPC 2017) ● ○ ○ ● ○ ○ ○ ○ ● ○ ● ○ ○ ○ ○ ● ○ ○ ○ ○
  22. 22. 22 Autoencoder (AE) “Deep Learning Tutorial”, Dept. Computer Science, Stanford Autoencoders: ● Predict at the output the same input data. ● Do not need labels:
  23. 23. 23 Autoencoder (AE) “Deep Learning Tutorial”, Dept. Computer Science, Stanford Dimensionality reduction: ● Use hidden layer as a feature extractor of any desired size. Application #1
  24. 24. 24 Autoencoder (AE) Encoder W1 Decoder W2 hdata reconstruction Loss (reconstruction error) Latent variables (representation/features) Figure: Kevin McGuinness (DLCV UPC 2017) Application #2 Pretraining: 1. Initialize a NN solving an autoencoding problem.
  25. 25. 25 Autoencoder (AE) Application #2 Pretraining: 1. Initialize a NN solving an autoencoding problem. 2. Train for final task with “few” labels. Figure: Kevin McGuinness (DLCV UPC 2017) Encoder W1 hdata Classifier WC Latent variables (representation/features) prediction y Loss (cross entropy)
  26. 26. 26 Outline 1. Unsupervised Learning 2. Predictive Learning 3. Self-supervised Learning 4. Cross-modal Learning
  27. 27. Unsupervised Feature Learning 27 Slide credit: Yann LeCun
  28. 28. Slide credit: Junting Pan
  29. 29. Frame Reconstruction & Prediction 29 Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Unsupervised feature learning (no labels) for...
  30. 30. Frame Reconstruction & Prediction 30 Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Unsupervised feature learning (no labels) for... ...frame prediction.
  31. 31. Frame Reconstruction & Prediction 31 Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Unsupervised feature learning (no labels) for... ...frame prediction.
  32. 32. 32 Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Unsupervised learned features (lots of data) are fine-tuned for activity recognition (little data). Frame Reconstruction & Prediction
  33. 33. Frame Prediction 33 Ranzato, MarcAurelio, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).
  34. 34. 34 Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code] Video frame prediction with a ConvNet. Frame Prediction
  35. 35. 35 Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code] The blurry predictions from MSE are improved with multi-scale architecture, adversarial learning and an image gradient difference loss function. Frame Prediction
  36. 36. 36 Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code] The blurry predictions from MSE (l1) are improved with multi-scale architecture, adversarial training and an image gradient difference loss (GDL) function. Frame Prediction
  37. 37. 37 Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code] Frame Prediction
  38. 38. 38Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016. Frame Prediction
  39. 39. 39Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016.
  40. 40. 40 Xue, Tianfan, Jiajun Wu, Katherine Bouman, and Bill Freeman. "Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks." NIPS 2016 [video]
  41. 41. 41 Frame Prediction Xue, Tianfan, Jiajun Wu, Katherine Bouman, and Bill Freeman. "Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks." NIPS 2016 [video] Given an input image, probabilistic generation of future frames with a Variational AutoEncoder (VAE).
  42. 42. 42 Frame Prediction Xue, Tianfan, Jiajun Wu, Katherine Bouman, and Bill Freeman. "Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks." NIPS 2016 [video] Encodes image as feature maps, and motion as and cross-convolutional kernels.
  43. 43. 43 Outline 1. Unsupervised Learning 2. Predictive Learning 3. Self-supervised Learning 4. Cross-modal Learning
  44. 44. First steps in video feature learning 44 Le, Quoc V., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis." CVPR 2011
  45. 45. Temporal Weak Labels 45 Goroshin, Ross, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. "Unsupervised learning of spatiotemporally coherent metrics." ICCV 2015. Assumption: adjacent video frames contain semantically similar information. Autoencoder trained with regularizations by slowliness and sparisty.
  46. 46. 46 Jayaraman, Dinesh, and Kristen Grauman. "Slow and steady feature analysis: higher order temporal coherence in video." CVPR 2016. [video] ● ● Temporal Weak Labels
  47. 47. 47 (Slides by Xunyu Lin): Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using temporal order verification." ECCV 2016. [code] Temporal order of frames is exploited as the supervisory signal for learning. Temporal Weak Labels
  48. 48. 48 (Slides by Xunyu Lin): Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using temporal order verification." ECCV 2016. [code] Take temporal order as the supervisory signals for learning Shuffled sequences Binary classification In order Not in order Temporal Weak Labels
  49. 49. 49 Fernando, Basura, Hakan Bilen, Efstratios Gavves, and Stephen Gould. "Self-supervised video representation learning with odd-one-out networks." CVPR 2017 Temporal Weak Labels Train a network to detect which of the video sequences contains frames wrong order.
  50. 50. 50 Spatio-Temporal Weak Labels X. Lin, Campos, V., Giró-i-Nieto, X., Torres, J., and Canton-Ferrer, C., “Disentangling Motion, Foreground and Background Features in Videos”, in CVPR 2017 Workshop Brave New Motion Representations kernel dec C3D Foreground Motion First Foreground Background Fg Dec Bg Dec Fg Dec Reconstruction of foreground in last frame Reconstruction of foreground in first frame Reconstruction of background in first frame uNLC Mask Block gradients Last foreground Kernels share weights
  51. 51. 51 Spatio-Temporal Weak Labels Pathak, Deepak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. "Learning features by watching objects move." CVPR 2017
  52. 52. 52 Greff, Klaus, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and Juergen Schmidhuber. "Tagger: Deep unsupervised perceptual grouping." NIPS 2016 [video] [code] Spatio-Temporal Weak Labels
  53. 53. 53 Outline 1. Unsupervised Learning 2. Predictive Learning 3. Self-supervised Learning 4. Cross-modal Learning ○ Feature Learning ○ Cross-modal Retrieval ○ Cross-modal Translation
  54. 54. 54 Vision Audio Speech Cross-modal Learning
  55. 55. 55 Cross-modal Learning Vision Audio Speech Video Synchronization among modalities captured by video is exploited in a self-supervised manner.
  56. 56. 56 Outline 1. Unsupervised Learning 2. Predictive Learning 3. Self-supervised Learning 4. Cross-modal Learning ○ Feature Learning ○ Cross-modal Retrieval ○ Cross-modal Translation
  57. 57. 57 Vision Audio Video Visual Feature Learning
  58. 58. 58 Visual Feature Learning Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Based on the assumption that ambient sound in video is related to the visual semantics.
  59. 59. 59 Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Use videos to train a CNN that predicts the audio statistics of a frame. Visual Feature Learning
  60. 60. 60 Task: Use the predicted audio stats to clusters images. Audio clusters built with K-means over training set Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Average statsCluster assignments at test time (one row=one cluster) Visual Feature Learning
  61. 61. 61 Although the CNN was not trained with class labels, local units with semantic meaning emerge. Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Visual Feature Learning
  62. 62. 62 Vision Audio Video Audio Feature Learning
  63. 63. 63 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
  64. 64. 64 Audio Feature Learning: SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Pretrained visual ConvNets supervise the training of a model for sound representation
  65. 65. 65 Videos for training are unlabeled. Relies on Convnets trained on labeled images. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  66. 66. 66 Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  67. 67. 67 Visualization of the 1D filters over raw audio in conv1. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  68. 68. 68 Visualization of the 1D filters over raw audio in conv1. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  69. 69. 69 Visualize samples that mostly activate a neuron in a late layer (conv7) Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  70. 70. 70 Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7): Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  71. 71. 71 Hearing sounds that most activate a neuron in the sound network (conv7) Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  72. 72. 72 Hearing sounds that most activate a neuron in the sound network (conv5) Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Audio Feature Learning: SoundNet
  73. 73. 73 Vision Audio Audio & Visual Feature Learning Video
  74. 74. 7474Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Audio and visual features learned by assessing correspondence. Audio & Visual Feature Learning
  75. 75. 75Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." arXiv preprint arXiv:1712.06651 (2017). Audio & Visual Feature Learning
  76. 76. 76Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." arXiv preprint arXiv:1712.06651 (2017). Audio & Visual Feature Learning
  77. 77. 77Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." arXiv preprint arXiv:1712.06651 (2017).
  78. 78. 78 Outline 1. Unsupervised Learning 2. Predictive Learning 3. Self-supervised Learning 4. Cross-modal Learning ○ Feature Learning ○ Cross-modal Retrieval ○ Cross-modal Translation
  79. 79. 79 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.
  80. 80. 80 Cross-modal Retrieval Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Learn synthesized sounds from videos of people hitting objects with a drumstick.
  81. 81. 81 Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Not end-to-end Cross-modal Retrieval
  82. 82. 82 The Greatest Hits Dataset Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Cross-modal Retrieval
  83. 83. 83 [Paper draft] Cross-modal Retrieval Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018).
  84. 84. 84 Best match Visual feature Audio feature Video sonorization Cross-modal Retrieval Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018).
  85. 85. 85 Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018). Visual feature Audio feature Best match Audio coloring Cross-modal Retrieval
  86. 86. 86 Hong, Sungeun, Woobin Im, and Hyun S. Yang. "Deep Learning for Content-Based, Cross-Modal Retrieval of Videos and Music." (submitted)
  87. 87. 87 Cross-modal Retrieval Hong, Sungeun, Woobin Im, and Hyun S. Yang. "Deep Learning for Content-Based, Cross-Modal Retrieval of Videos and Music." 2017 (submitted)
  88. 88. 88 Outline 1. Unsupervised Learning 2. Predictive Learning 3. Self-supervised Learning 4. Cross-modal Learning ○ Feature Learning ○ Cross-modal Retrieval ○ Cross-modal Translation
  89. 89. 89 Cross-modal Translation Vision Speech Video
  90. 90. 90 Vision Speech Video Cross-modal Translation
  91. 91. 91 Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
  92. 92. 92Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video CNN (VGG) Frame from a silent video Audio feature Post-hoc synthesis
  93. 93. 93 Speech Generation from Video Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
  94. 94. 94 Cross-modal Translation Vision Speech Video
  95. 95. 95Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
  96. 96. 96Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. Speech to Video Synthesis (mouth)
  97. 97. 97 Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  98. 98. 98 Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from audio." SIGGRAPH 2017. Speech to Video Synthesis (mouth)
  99. 99. 99 Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  100. 100. 100 Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017 Speech to Video Synthesis (pose & emotion)
  101. 101. 101 L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM Multimedia Thematic Workshops 2017. Audio & Visual Generation
  102. 102. 102 ASRHello "Hello" Video Generator NMT SLPA Speech2Signs (under work)
  103. 103. 103 Outline 1. Unsupervised Learning 2. Predictive Learning 3. Self-supervised Learning 4. Cross-modal Learning
  104. 104. 104 Questions ?

×