Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)

1,606 views

Published on

講演者: 中島悠太先生(奈良先端科学技術大学院大学)

Published in: Technology
  • Be the first to comment

深層学習を利用した映像要約への取り組み(第7回ステアラボ人工知能セミナー)

  1. 1. 2016/11/30
  2. 2. 2 Deep Semantic Feature 
 Sentence Sentence Embedding Video Embedding Web Images Embedding Space Video “A baby is playing a guitar.” Image Search 
 Deep Semantic Feature
  3. 3. • • 3
  4. 4. • • • ‣ ‣ 4
  5. 5. 6
  6. 6. 7 [1] [2] [1] https://www.ibm.com/blogs/think/2016/08/31/cognitive-movie-trailer/ [2] Uchihashi et al., “Video Manga: generating semantically meaningful video summaries,” ACM MM, 1999 From: https://www.youtube.com/watch?v=gJEzuYynaiw
  7. 7. 8
  8. 8. • • • vs • • Coverage/Representative vs Importance/Interestingness • 9
  9. 9. 10 BoVW
  10. 10. 11 Coverage Importance/ Preference
  11. 11. • • : [Babaguchi 2004] 12 [Babaguchi 2004] N. Babaguchi, Y. Kawai, T. Ogura, and T. Kitahashi, “Personalized abstraction of broadcasted American football video by highlight selection,” TMM 2004.
  12. 12. : [Gong 2014] • Fisher vector/SIFT desc. /1 • Coverage 13 [Gong 2014] B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised video summarization,” NIPS 2014.
  13. 13. : [Gygli 2014] • Importantce • 14 etc. Importance [Gygli 2014] M. Gygli, H. Grabner, H. Riemenschneider, and L. van Gool, “Creating summaries from user videos,” ECCV 2014.
  14. 14. 15
  15. 15. 17 …
  16. 16. 18 BoVW
  17. 17. 19 … “A man playing a guitar outside his house” “A flock of zebras grazing.”
  18. 18. Coverage 20
  19. 19. Importance 21
  20. 20. 23 … “A man playing a guitar outside his house” “A flock of zebras grazing.” ?
  21. 21. (e.g. [Li 2010]) 24 … m an w om an piano guitar zebralion grass … … … {1, 0, … 1, 0, …, 0, 0, …, 0} {0, 0, … 0, 0, …, 1, 0, …, 1} [Li 2010] L.-J. Li, H. Su, E. P. Xing, F.-F. Li, “Object bank: A high level image representation for scene classification
 & semantic feature sparsification,” NIPS 2010.
  22. 22. • • (e.g., word2vec) + Recurrent Neural Net (RNN) • • Convolutional Neural Net (CNN) + Pooling • 3D-CNN • + RNN 25 Deep Neural Net
  23. 23. DNN 26 … “A man playing a guitar outside his house” “A flock of zebras grazing.” DNNDNN
  24. 24. ( ) • • • 28 {“A”, “man”, “playing”, “a”, “guitar”, “outside”, “his”, “house”, “.”}
  25. 25. • ILSVRC CNN • AlexNet, VGG-16, GoogLeNet, ResNet • Mean Pooling • FC CNN + Pooling (e.g. [Pan 2016]) 29 … … …… [Pan 2016] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” CVPR 2016.
  26. 26. 3D-CNN (e.g. [Tran 2015]) • • FC • 30 … … [Tran 2015] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” ICCV 2015.
  27. 27. • • RNN 31 RNN
  28. 28. LSTM • Self-loop (cell) 32
  29. 29. GRU • LSTM gate reset update 2 33
  30. 30. RNN 34 • • Stacked convolutional GRU [Ballas 2016] [Ballas 2016] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into convolutional networks for learning video representations,” ICLR 2016.
  31. 31. RNN • • 35 Hierarchical RNN [Pan 2015] [Pan 2015] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” CVPR 2015.
  32. 32. • 37 “A man is playing a keyboard.” DNNDNN Loss
  33. 33. ( • 38 “A man playing a guitar outside his house” “A flock of zebras grazing.” ( ), ),
  34. 34. • 39 ( “A man playing a guitar outside his house” “A flock of zebras grazing.” ( ), ),
  35. 35. : • Play the keyboard vs Type the keyboard 40 keyboard Query: “A man is playing a keyboard.” keyboard keyboard
  36. 36. • :LSTM :CNN + mean pooling • Contrastive loss / • LSTM 41 “A man is playing a keyboard” semantic space A man is playing a keyboard CNN + mean pooling LSTM CNN
  37. 37. • • CNN RNN 42 Pooling } + Loss Web images Video “.”“A” “dog” “is” “eating” “watermelon” Pooling } Sentence Fully-connected LayersCNN for Videos CNN for Web Images RNN for Sentences RNN RNN RNN RNN RNN RNN
  38. 38. 43 “A child dances to the TV” “A man is playing a guitar” “A cat is hitting the keys on a piano” • MS Video Description Corpus (# Clips 1970, # Text 85K)
  39. 39. [Otani 2016] 44 ECCV-16 submission ID 631 11 Query GoogLeNet+VS GoogLeNet+ALL2 (1) A man is playing a keyboard. (2) Kids are playing in a pool. (3) A man is trimming fat from a roast. Query GoogLeNet+VI GoogLeNet+ALL2 (4) A boy is singing into a microphone. [Otani 2016] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkilä, N. Yokoya, “Learning joint representations of videos and sentences with web image search, ECCVW 2016.
  40. 40. • ‣ ‣ ‣ Storytelling • ‣ ‣ ‣ 46
  41. 41. Take-home message • ‣ ‣ • ‣ ‣ ‣ • 47

×