Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)

1,426 views

Published on

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)

  1. 1. [course site] Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Video Analysis Day 2 Lecture 2 #DLUPC
  2. 2. Acknowledgments 2 Víctor Campos Alberto Montes
  3. 3. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 3
  4. 4. Recognition Demo: Clarifai MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015) 4
  5. 5. Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 5 Recognition
  6. 6. 6 (Slides by Dídac Surís) Abu-El-Haija, Sami, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. "Youtube-8m: A large-scale video classification benchmark." arXiv preprint arXiv:1609.08675 (2016). [project] Activity Recognition: Datasets
  7. 7. 7 Recognition Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
  8. 8. 8 Recognition Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." ICCV 2015
  9. 9. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. Large-scale video classification with convolutional neural networks. CVPR 2014 Slides extracted from ReadCV seminar by Victor Campos 9 Recognition: DeepVideo
  10. 10. 10 Recognition: DeepVideo: Demo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014
  11. 11. 11 Recognition: DeepVideo: Architectures Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014
  12. 12. 12 Recognition: DeepVideo: Features Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014
  13. 13. 13 Recognition: DeepVideo: Multiscale Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014
  14. 14. 14 Recognition: DeepVideo: Results Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014
  15. 15. 15 Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code Activity Recognition: Frames + LSTM
  16. 16. 16 Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015 Activity Recognition: Frames + Optical Flow + LSTM
  17. 17. 17 Recognition Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." ICCV 2015
  18. 18. 18 Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." NIPS 2014. Recognition: Two stream Two CNNs in paralel: ● One for RGB images ● One for Optical flow (hand-crafted features) Fusion after the softmax layer
  19. 19. 19Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code] Recognition: Two stream Two CNNs in paralel: ● One for RGB images ● One for Optical flow (hand-crafted features) Fusion at a convolutional layer
  20. 20. 20 Recognition Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
  21. 21. 21 Recognition: C3D Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
  22. 22. 22 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Demo
  23. 23. 23 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Temporal dimension 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets Temporal depth 2D ConvNets
  24. 24. 24 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets Recognition: C3D: Temporal dimension
  25. 25. 25 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Architecture Feature vector
  26. 26. 26 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Feature vector 16-frame clip 16-frame clip 16-frame clip 16-frame clip ... Average 4096-dimvideodescriptor 4096-dimvideodescriptor L2 norm
  27. 27. BSc thesis http://activity-net.org/ 27 Temporal Activity Detection
  28. 28. BSc thesis Videos Activity Classification Longboarding 28 Temporal Activity Detection
  29. 29. BSc thesis Videos Activity Temporal Localization Longboarding 29 Temporal Activity Detection
  30. 30. 30 Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016. (Slidecast and Slides by Alberto Montes) Recognition: Localization
  31. 31. 31 Recognition: Localization (Slidecast and Slides by Alberto Montes) Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.
  32. 32. 32 Recognition: Localization (Slidecast and Slides by Alberto Montes) Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.
  33. 33. BSc thesis Neural Network Activity 33 Temporal Activity Detection
  34. 34. BSc thesis Activity CNN RNN+ 34 Temporal Activity Detection
  35. 35. Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." CVPR 2015 3D Convolutions over sets of 16 frames... 35 Temporal Activity Detection
  36. 36. BSc thesis 36 Temporal Activity Detection
  37. 37. BSc thesis mAP = 0.5938 mAP = 0.5492 mAP = 0.5635 Deeper networks present overfitting 37 Temporal Activity Detection
  38. 38. BSc thesis 38 Temporal Activity Detection
  39. 39. BSc thesis 39 Temporal Activity Detection
  40. 40. BSc thesis 40 Temporal Activity Detection
  41. 41. BSc thesis Ground Truth: Playing water polo Prediction: 0.765 Playing water polo 0.202 Swimming 0.007 Springboard diving 41 Temporal Activity Detection
  42. 42. BSc thesis Ground Truth: Hopscotch Prediction: 0.848 Running a marathon 0.023 Triple jump 0.022 Javelin throw 42 Temporal Activity Detection
  43. 43. BSc thesis 43 Temporal Activity Detection A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks”, in 1st NIPS Workshop on Large Scale Computer Vision Systems 2016 (best poster award)
  44. 44. 44 Temporal Activity Detection A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks”, in 1st NIPS Workshop on Large Scale Computer Vision Systems 2016 (best poster award)
  45. 45. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 45
  46. 46. Optical Flow Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 46
  47. 47. Optical Flow: FlowNet Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. ICCV 2015 47
  48. 48. Optical Flow: FlowNet Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 48 End to end supervised learning of optical flow.
  49. 49. Optical Flow: FlowNet (contracting) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 49 Option A: Stack both input images together and feed them through a generic network.
  50. 50. Optical Flow: FlowNet (contracting) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 50 Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.
  51. 51. Optical Flow: FlowNet (contracting) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 51 Correlation layer: Convolution of data patches from the layers to combine. Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.
  52. 52. Optical Flow: FlowNet (expanding) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 52 Upconvolutional layers: Unpooling features maps + convolution. Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.
  53. 53. Optical Flow: FlowNet Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. ICCV 2015 53 Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset is generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise; changes in brightness, contrast, gamma and color). Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI. Data augmentation
  54. 54. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 54 Optical Flow: FlowNet
  55. 55. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 55
  56. 56. Object tracking: Deep but not CNN 56 Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking." NIPS 2013. Offline learning: Robust and generic features are learning by training a stacked denoising autoencoder on auxiliary images. Online learning: Encoder part of the autoencode + classification neural network
  57. 57. Object tracking: MDNet 57 Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
  58. 58. Object tracking: MDNet: Architecture 58 Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015) Domain-specific layers are used during training for each sequence, but are replaced by a single one at test time.
  59. 59. Object tracking: MDNet: Online update 59 Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015) MDNet is updated online at test time with hard negative mining, that is, selecting negative samples with the highest positive score.
  60. 60. Object tracking: FCNT 60 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015 [code]
  61. 61. Object tracking: FCNT 61 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." CVPR 2015 [code] Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification. conv4-3 conv5-3
  62. 62. Object tracking: FCNT: Specialization 62 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking sequence.
  63. 63. Object tracking: FCNT: Localization 63 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] Although trained for image classification, feature maps in conv5-3 enable object localization… ...but is not discriminative enough to different objects of the same category.
  64. 64. Object tracking: Localization 64 Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015. Other works have shown how features maps in convolutional layers allow object localization.
  65. 65. Object tracking: FCNT: Localization 65 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation… conv4-3 conv5-3
  66. 66. Object tracking: FCNT: Architecture 66 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] SNet=Specific Network (online update) GNet=General Network (fixed)
  67. 67. Object tracking: FCNT: Results 67 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
  68. 68. Object tracking: ROLO 68 Ning, Guanghan, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, and Haohong Wang. "Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking." IEEE International Symposium on Circuits and Systems, 2017
  69. 69. Object tracking: ROLO 69 Ning, Guanghan, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, and Haohong Wang. "Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking." arXiv preprint arXiv:1607.05781 (2016)
  70. 70. Object tracking: CFNet 70 Valmadre, Jack, Luca Bertinetto, João F. Henriques, Andrea Vedaldi, and Philip HS Torr. "End-to-end representation learning for Correlation Filter based tracking." CVPR 2017
  71. 71. Questions?

×