Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)

3,717 views

Published on

講演者: 吉川友也(千葉工業大学 人工知能・ソフトウェア技術研究センター)

Published in: Technology
  • Dating for everyone is here: ♥♥♥ http://bit.ly/39sFWPG ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ♥♥♥ http://bit.ly/39sFWPG ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you want to download or read this book, copy link or url below in the New tab ......................................................................................................................... DOWNLOAD FULL PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m77EgH } .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you want to download or read this book, Copy link or url below in the New tab ......................................................................................................................... DOWNLOAD FULL PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download Doc Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THAT BOOKS/FILE INTO AVAILABLE FORMAT - (Unlimited) ......................................................................................................................... ......................................................................................................................... Download FULL PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... accessibility Books Library allowing access to top content, including thousands of title from favorite author, plus the ability to read or download a huge selection of books for your pc or smartphone within minutes Christian, Classics, Comics, Contemporary, Cookbooks, Art, Biography, Business, Chick Lit, Children's, Manga, Memoir, Music, Science, Science Fiction, Self Help, History, Horror, Humor And Comedy, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)

  1. 1. 2018 6 28 17 v.20180802
  2. 2. • • 2015 9 • 2015 10 • 2018 6 AIP • • • • 2
  3. 3. (20 ) (35 ) 3 • •
  4. 4. 4 3 16
  5. 5. 5
  6. 6. 6 man in black shirt is playing guitar. • •
  7. 7. Neural Image Caption (NIC) 1. CNN Encode 2. LSTM Decode 7 [Vinyals+ 2015] Chainer Tensorflow PyTorch
  8. 8. Flickr8k • Flickr 8092 5 • ” ” [Hodosh+ 2013] 8
  9. 9. Flickr30k Flickr8k Flickr8k 31,783 5 9 [Young+ 2014]
  10. 10. Flickr30k Entities • Flickr30k ( ) • 10 [Plummer+ 2016] [Liu+ 2017]
  11. 11. MS COCO • Flickr 5 • 11 http://cocodataset.org/#explore?id=409091 • a lady blowing out candles on a cake • the woman is blowing out her birthday cake candles • a woman blowing candles on a frosted cake. • two people blowing out candles on a cake. • a girl is blowing out candles on a birthday cake. [Chen+ 2015]
  12. 12. MS COCO Amazon Mechanical Turk (AMT) 12 • • “There is” • • • • 8 words
  13. 13. Visual Genome 13 [Krishna+ 2017] Park bench is made of gray weathered wood The man is almost bald • MS COCO YFCC100M • 1 •
  14. 14. Visual Genome • Object ( ) • • Attribute ( ) • • Relationship ( ) • 2 • : jumbing_over(man, fire hydrant) • Region graph ( ) • object, attribute, relationship • Scene graph ( ) • Region graph 14 Region Graph
  15. 15. 15 Scene Graph
  16. 16. 1 1 • Visual Genome 1 16 [Krause+ 2017] Two children are sitting at a table in a restaurant. The children are one little girl and one little boy. The little girl is eating a pink frosted donut with white icing lines on top of it. The girl has blonde hair and is wearing a green jacket with a black long sleeve shirt underneath. The little boy is wearing a black zip up jacket and is holding his finger to his lip but is not eating. A metal napkin dispenser is in between them at the table. The wall next to them is white brick. Two adults are on the other side of the short white brick wall. The room has white circular lights on the ceiling and a large window in the front of the restaurant. It is daylight outside.
  17. 17. STAIR Captions MS COCO 17 [Yoshikawa+ 2017] http://captions.stair.center/explore/
  18. 18. STAIR Captions • 2100 18 1. 15 2. 3. 4. 5.
  19. 19. STAIR Captions MS COCO Google ( ) STAIR Captions ( ) 19 STAIR Captions
  20. 20. STAIR Captions http://captions.stair.center 20
  21. 21. • Pascal Sentence • PASCAL VOC2008 1000 5 • • Abstract Scenes • • YJ Captions • MS COCO • Multi30k • Flickr30k MS COCO 21 [Rashtchian+ 2010] [Funaki+ 2015] [Zitnick+ 2013] [Miyazaki+ 2016] [Elliott+ 2016]
  22. 22. 22 Pascal Sentence 1,000 5 MS COCO 123,287 5 Flickr8k 8,092 5 Flickr30k 31,783 5 Visual Genome 108,077 50 Krause et al. 19,551 1 Multi30k 123,287 5 STAIR Captions 123,287 5
  23. 23. 23 • •
  24. 24. Classification (Recognition) • Temporal Localization • 24 Spatial-Temporal Localization •
  25. 25. Classification C3D (3DCNN) • 3 (Conv) (Pool) • Conv : 3x3x3 kernels with stride 1 • Pool : 2x2x2 25 [Tran+ 2015] input 3 channels 16 frames 112x112 pixels output 3 (Conv)
  26. 26. MNIST • HMDB51 • 51 6766 • Prelinger archive, YouTube • UCF101 • 101 13320 • YouTube 26 [Kuehne+ 2011] [Soomro+ 2012]
  27. 27. ActivityNet 200 • • 27 200 1.5 2.3 [Heilbron+ 2015] • CVPR2016 ActivityNet Challenge • ActivityNet Challenge 2017 (Untrimmed Video Classification) 8.8% 1 YouTube
  28. 28. ActivityNet 200 (1/4) (1) • American Time Use Survey (ATUS) 2000 200 28 American Time Use Survey Activity Lexicon 2016
  29. 29. ActivityNet 200 (2/4) (2) • WordNet YouTube 29
  30. 30. ActivityNet 200 (3/4) (3) • (AMT) 30
  31. 31. ActivityNet 200 (4/4) (4) • 31
  32. 32. Charades • • • 32 [Sigurdsson+ 2016] : (mAP) 157 6.7
  33. 33. Charades (1/3) (1) • 1 5 5 • 2 2 33
  34. 34. Charades (2/3) (2) • 30 34
  35. 35. Charades (3/3) (3) • • 5 • 5 35
  36. 36. Charades-Ego • 1 3 • Charades • 60% 36 157 4000 [Sigurdsson+ 2018]
  37. 37. Kinetics-400 • • YouTube 10 • 1 YouTube 1 37 400 30 [Kay+ 2017] Top-1/Top-5 600 Kinetics-600
  38. 38. Kinetics 1. • AMT 2. • YouTube • 10 3. • AMT 38
  39. 39. SOMETHING-SOMETHING (v1) • • Something • • 1 • • Holding something • Dropping something into something • Something falling like a rock • • 88.5% 39 174 10 2~6 [Goyal+ 2017]
  40. 40. AVA • • 14 49 17 • Bounding box • 40 80 430 15 [Gu+ 2017]
  41. 41. AVA (1/2) (1) YouTube • 15 30 • 15 1 3 900 (2) Bounding Box • Faster-RCNN person detector • (3) Bounding Box • Bounding Box 41
  42. 42. AVA (2/2) (4) 1. 2. 42
  43. 43. Moments in Time • • • • 43 339 100 3 [Monfort+ 2018] Moments in Time • Top-1: 0.39, Top-5: 0.67 (The Moments in Time Recognition Challenge 2018 )
  44. 44. STAIR Actions (v1.0) • • 100 • YouTube 44 100 9 5 [Yoshikawa+ 2018]
  45. 45. STAIR Actions 45 • • • • • PC Wiktionary: 1000
  46. 46. STAIR Actions (1/4) 1. YouTube • 4 CC0 2. 5 • 5 5 3. 5 4. 46
  47. 47. STAIR Actions (2/4) 10 5 10 47 5 10 5
  48. 48. STAIR Actions (3/4) 3 3 2 48
  49. 49. STAIR Actions (4/4) STAIR Lab 49
  50. 50. STAIR Actions Kinetics OpenPose 50 STAIR Actions 95.6% Kinetics 55.5%
  51. 51. STAIR Actions • 2DCNN+LSTM (LRCN) Two-stream CNN 3DCNN STAIR Actions • 76.5% • c.f. Kinetics 61.0% (Two-stream CNN) 51 STAIR Actions
  52. 52. 52 / Bounding Box HMDB51 / YouTube 51 6K UCF101 / YouTube 101 13K ActivityNet 200 / YouTube 200 15K Charades / 157 67K Charades-Ego / 157 8K Kinetics / YouTube 400 300K SOMETHING- SOMETHING (v1) / 174 100K AVA / YouTube 80 430 Moments in Time / YouTube 339 >1M STAIR Actions (v1.0) / / YouTube 100 >90K
  53. 53. • • STAIR Captions • • MS COCO 5 • STAIR Actions • 100 53
  54. 54. 1 54 • [Vinyals+ 2015] Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015. • [Hodosh+ 2013] Hodosh, Micah, Peter Young, and Julia Hockenmaier. "Framing image description as a ranking task: Data, models and evaluation metrics." Journal of Artificial Intelligence Research 47 (2013): 853-899. • [Young+ 2014] Young, Peter, et al. "From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions." Transactions of the Association for Computational Linguistics 2 (2014): 67-78. • [Plummer+ 2016] Plummer, Bryan A., et al. "Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to- sentence models." Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE, 2015. • [Liu+ 2017] Liu, Chenxi, et al. "Attention Correctness in Neural Image Captioning." AAAI. 2017. • [Chen+ 2015] Chen, Xinlei, et al. "Microsoft COCO captions: Data collection and evaluation server." arXiv preprint arXiv:1504.00325 (2015).
  55. 55. 2 55 • [Krishna+ 2017] Krishna, Ranjay, et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." International Journal of Computer Vision 123.1 (2017): 32-73. • [Krause+ 2017] Krause, Jonathan, et al. "A hierarchical approach for generating descriptive image paragraphs." 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. • [Yoshikawa+ 2017] Yoshikawa, Yuya, Yutaro Shigeto, and Akikazu Takeuchi. "Stair captions: Constructing a large-scale japanese image caption dataset." arXiv preprint arXiv:1705.00823 (2017). • [Rashtchian+ 2010] Rashtchian, Cyrus, et al. "Collecting image annotations using Amazon's Mechanical Turk." Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Association for Computational Linguistics, 2010. • [Funaki+ 2015] Funaki, Ruka, and Hideki Nakayama. "Image-mediated learning for zero-shot cross-lingual document retrieval." Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015.
  56. 56. 3 56 • [Zitnick+ 2013] Zitnick, C. Lawrence, and Devi Parikh. "Bringing semantics into focus using visual abstraction." Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013. • [Miyazaki+ 2016] Miyazaki, Takashi, and Nobuyuki Shimizu. "Cross- lingual image caption generation." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2016. • [Elliott+ 2016] Elliott, Desmond, et al. "Multi30k: Multilingual english-german image descriptions." arXiv preprint arXiv:1605.00459 (2016). • [Tran+ 2015] Tran, Du, et al. "C3D: generic features for video analysis." CoRR, abs/1412.0767 2.7 (2014): 8. • [Kuehne+ 2011] Kuehne, Hilde, et al. "HMDB51: A large video database for human motion recognition." High Performance Computing in Science and Engineering ‘12. Springer, Berlin, Heidelberg, 2013. 571-582. • [Soomro+ 2012] Soomro, Khurram, Amir Roshan Zamir, and Mubarak Shah. "UCF101: A dataset of 101 human actions classes from videos in the wild." arXiv preprint arXiv:1212.0402 (2012).
  57. 57. 4 57 • [Heilbron+ 2015] Caba Heilbron, Fabian, et al. "Activitynet: A large- scale video benchmark for human activity understanding." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. • [Sigurdsson+ 2016] Sigurdsson, Gunnar A., et al. "Hollywood in homes: Crowdsourcing data collection for activity understanding." European Conference on Computer Vision. Springer, Cham, 2016. • [Sigurdsson+ 2018] Sigurdsson, Gunnar A., et al. "Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos." arXiv preprint arXiv:1804.09626 (2018). • [Kay+ 2017] Kay, Will, et al. "The kinetics human action video dataset." arXiv preprint arXiv:1705.06950 (2017). • [Goyal+ 2017] Goyal, Raghav, et al. "The” something something” video database for learning and evaluating visual common sense." Proc. ICCV. 2017. • [Gu+ 2017] Gu, Chunhui, et al. "AVA: A video dataset of spatio- temporally localized atomic visual actions." arXiv preprint arXiv:1705.08421(2017).
  58. 58. 5 58 • [Monfort+ 2018] Monfort, Mathew, et al. "Moments in Time Dataset: one million videos for event understanding." arXiv preprint arXiv:1801.03150(2018). • [Yoshikawa+ 2018] Yoshikawa, Yuya, Jiaqing Lin, and Akikazu Takeuchi. "STAIR Actions: A Video Dataset of Everyday Home Actions." arXiv preprint arXiv:1804.04326 (2018).

×