Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

566 views

Published on

Serena is a Ph.D. student in the Stanford Vision Lab, advised by Prof. Fei-Fei Li. Her research interests are in computer vision, machine learning, and deep learning. She is particularly interested in the areas of video understanding, human action recognition, and healthcare applications. She interned at Facebook AI Research in Summer 2016.

Before starting her Ph.D., she received a B.S. in Electrical Engineering in 2010, and an M.S. in Electrical Engineering in 2013, both from Stanford. She also worked as a software engineer at Rockmelt (acquired by Yahoo) from 2009-2011.

Abstract summary

Towards Scaling Video Understanding:
The quantity of video data is vast, yet our capabilities for visual recognition and understanding in videos lags significantly behind that for images. In this talk, I will first discuss some of the challenges of scale in labeling, modeling, and inference behind this gap. I will then present some of our recent work towards addressing these challenges, in particular using reinforcement learning-based formulations to tackle efficient inference in videos and learning classifiers from noisy web search results. Finally, I will conclude with discussion on future promising directions towards scaling video understanding.

Published in: Technology
  • Be the first to comment

Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

  1. 1. Towards Scaling Video Understanding Serena Yeung
  2. 2. YouTube TV GoPro Smart spaces
  3. 3. State-of-the-art in video understanding
  4. 4. State-of-the-art in video understanding Classification Abu-El-Haija et al. 2016 4,800 categories 15.2 Top5 error
  5. 5. State-of-the-art in video understanding Classification Detection Idrees et al. 2017, Sigurdsson et al. 2016Abu-El-Haija et al. 2016 4,800 categories 15.2 Top5 error Tens of categories ~10-20 mAP at 0.5 overlap
  6. 6. State-of-the-art in video understanding Classification Detection Abu-El-Haija et al. 2016 Captioning 4,800 categories 15.2 Top5 error Yu et al. 2016 Just getting started: Short clips, niche domains Idrees et al. 2017, Sigurdsson et al. 2016 Tens of categories ~10-20 mAP at 0.5 overlap
  7. 7. Comparing video with image understanding
  8. 8. Comparing video with image understanding Classification 4,800 categories 15.2% Top5 error Videos Images 1,000 categories* 3.1% Top5 error *Transfer learning widespread Krizhevsky 2012, Xie 2016
  9. 9. Comparing video with image understanding He 2017 Classification Detection 4,800 categories 15.2% Top5 error Tens of categories ~10-20 mAP at 0.5 overlap Videos Images 1,000 categories* 3.1% Top5 error *Transfer learning widespread Hundreds of categories* ~60 mAP at 0.5 overlap Pixel-level segmentation *Transfer learning widespread Krizhevsky 2012, Xie 2016
  10. 10. Comparing video with image understanding He 2017 Johnson 2016, Krause 2017 Classification Detection Captioning 4,800 categories 15.2% Top5 error Just getting started: Short clips, niche domains Videos Images 1,000 categories* 3.1% Top5 error *Transfer learning widespread Hundreds of categories* ~60 mAP at 0.5 overlap Pixel-level segmentation *Transfer learning widespread Dense captioning Coherent paragraphs Krizhevsky 2012, Xie 2016 Tens of categories ~10-20 mAP at 0.5 overlap
  11. 11. Comparing video with image understanding He 2017 Johnson 2016, Krause 2017 Classification Detection Captioning 4,800 categories 15.2 Top5 error Just getting started: Short clips, niche domains Videos Beyond Images 1,000 categories* 3.1% Top5 error *Transfer learning widespread Hundreds of categories* ~60 mAP at 0.5 overlap Pixel-level segmentation *Transfer learning widespread Dense captioning Coherent paragraphs Significant work on question-answering — Krizhevsky 2012, Xie 2016 Yang 2016 Tens of categories ~10-20 mAP at 0.5 overlap
  12. 12. The challenge of scale Training labels Inference Models Video processing is computationally expensive Video annotation is labor-intensive Temporal dimension adds complexity
  13. 13. The challenge of scale Training labels Inference Video processing is computationally expensive Video annotation is labor-intensive Models Temporal dimension adds complexity Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
  14. 14. Input Output t = 0 t = T Running Task: Temporal action detection Talking Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
  15. 15. Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Efficient video processing
  16. 16. t = 0 t = T Output Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Frame model Input: a frame Our model for efficient action detection
  17. 17. t = 0 t = T [ ] Output Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Output: Detection instance [start, end] Next frame to glimpse Frame model Input: a frame Our model for efficient action detection
  18. 18. t = 0 t = T Recurrent neural network (time information) [ ] Output: Detection instance [start, end] Next frame to glimpse Output Convolutional neural network (frame information) Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Our model for efficient action detection
  19. 19. t = 0 t = T Recurrent neural network (time information) [ ] Output: Detection instance [start, end] Next frame to glimpse Output Convolutional neural network (frame information) Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Our model for efficient action detection
  20. 20. t = 0 t = T Recurrent neural network (time information) [ ] Output: Detection instance [start, end] Next frame to glimpse Output Convolutional neural network (frame information) Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Our model for efficient action detection
  21. 21. t = 0 t = T Output Our model for efficient action detection Recurrent neural network (time information) [ ] [ ] Output: Detection instance [start, end] Next frame to glimpse Output Convolutional neural network (frame information) Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
  22. 22. t = 0 t = T Output Our model for efficient action detection Recurrent neural network (time information) [ ] [ ] Output: Detection instance [start, end] Next frame to glimpse Output Convolutional neural network (frame information) Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
  23. 23. t = 0 t = T Output Our model for efficient action detection Recurrent neural network (time information) [ ] [ ] Output: Detection instance [start, end] Next frame to glimpse Output Convolutional neural network (frame information) Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
  24. 24. t = 0 t = T Output Our model for efficient action detection Recurrent neural network (time information) [ ] [ ] Output: Detection instance [start, end] Next frame to glimpse Output Convolutional neural network (frame information) Output [ ] Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
  25. 25. t = 0 t = T Output Our model for efficient action detection Recurrent neural network (time information) [ ] [ ] Output: Detection instance [start, end] Next frame to glimpse Output Convolutional neural network (frame information) Output [ ] … Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
  26. 26. t = 0 t = T Output Our model for efficient action detection Recurrent neural network (time information) Output Convolutional neural network (frame information) Output [ ] … Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Optional output: Detection instance [start, end] Output: Next frame to glimpse
  27. 27. Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Our model for efficient action detection • Train differentiable outputs (detection output class and bounds) using standard backpropagation • Train non-differentiable outputs (where to look next, when to emit a prediction) using reinforcement learning (REINFORCE algorithm) • Achieves detection performance on par with dense sliding window- based approaches, while observing only 2% of frames
  28. 28. Learned policy in action Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
  29. 29. Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Learned policy in action
  30. 30. The challenge of scale Training labels Inference Video processing is computationally expensive Video annotation is labor-intensive Models Temporal dimension adds complexity Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017. Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.
  31. 31. Dense action labeling Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.
  32. 32. MultiTHUMOS • Extends the THUMOS’14 action detection dataset with dense, multilevel, frame-level action annotations for 30 hours across 400 videos THUMOS MultiTHUMOS Annotations 6,365 38,690 Classes 20 65 Density (labels / frame) 0.3 1.5 Classes per video 1.1 10.5 Max actions per frame 2 9 Max actions per video 3 25 Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.
  33. 33. Modeling dense, multilabel actions • Need to reason about multiple potential actions simultaneously • High degree of temporal dependency • In standard recurrent models for action recognition, all state is in hidden layer representation • At each time step, makes prediction of current frame based on the current frame and previous hidden representation Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.
  34. 34. MultiLSTM • Extension of LSTM that expands the temporal receptive field of input and output connections • Key idea: providing the model with more freedom in both reading input and writing output reduces the burden placed on the hidden layer representation Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.
  35. 35. MultiLSTM Input video frames Frame class predictions t Standard LSTM …… Donahue 2014 Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.
  36. 36. MultiLSTM Frame class predictions t …… Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017. Standard LSTM: Single input, single output Input video frames
  37. 37. MultiLSTM Frame class predictions t …… Frame class predictions t MultiLSTM: Multiple inputs, multiple outputs …… Standard LSTM: Single input, single output Input video frames Input video frames Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.
  38. 38. MultiLSTM Multiple Inputs (soft attention) Multiple Outputs (weighted average) Multilabel Loss Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.
  39. 39. MultiLSTM Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.
  40. 40. Retrieving sequential actions Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.
  41. 41. Retrieving co-occurring actions Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.
  42. 42. The challenge of scale Training labels Inference Video processing is computationally expensive Video annotation is labor-intensive Models Temporal dimension adds complexity Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017. Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017. Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei- Fei. Learning to learn from noisy web videos. CVPR 2017.
  43. 43. Labeling videos is expensive • Takes significantly longer to label a video than an image • If spatial or temporal bounds desired, even worse • How can we practically learn about new concepts in video?
  44. 44. Web queries are a source of noisy video labels
  45. 45. Image search is much cleaner!
  46. 46. Can we effectively learn from the noisy web queries? • Our approach: learn how to selective positive training examples from noisy queries in order to train classifiers for new classes • Use a reinforcement learning-based formulation to learn a data labeling policy that achieves strong performance on a small, manually-labeled dataset of classes • Then use this policy to automatically label noisy web data for new classes Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.
  47. 47. Balancing diversity vs. semantic drift • Want diverse training examples to improve classifier • But too much diversity can also lead to semantic drift • Our approach: balance diversity and drift by training labeling policies using an annotated reward set which the policy must successfully classify Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.
  48. 48. Overview of approach Boomerang … Boomerang on a beach Boomerang music video Classifier Candidate web queries (YouTube autocomplete) Agent Label new positives + Boomerang on a beach Current positive set Update classifier Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017. Update state
  49. 49. Overview of approach Boomerang … Boomerang on a beach Boomerang music video Classifier Candidate web queries (YouTube autocomplete) Agent Label new positives + Boomerang on a beach Current positive set Update classifier Fixed negative set Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017. Update state
  50. 50. Overview of approach Boomerang … Boomerang on a beach Boomerang music video Classifier Candidate web queries (YouTube autocomplete) Agent Label new positives + Boomerang on a beach Current positive set Update classifier Update state Fixed negative set Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017. Training reward Eval on reward set
  51. 51. Sports1M Greedy classifier Ours Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.
  52. 52. Sports1M Greedy classifier Ours Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.
  53. 53. Novel classes Greedy classifier Ours Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.
  54. 54. The challenge of scale Training labels Inference Video processing is computationally expensive Video annotation is labor-intensive Models Temporal dimension adds complexity Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017. Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017. Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei- Fei. Learning to learn from noisy web videos. CVPR 2017.
  55. 55. The challenge of scale Training labels Inference Video processing is computationally expensive Video annotation is labor-intensive Models Temporal dimension adds complexity Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017. Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017. Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei- Fei. Learning to learn from noisy web videos. CVPR 2017. Learning to learn
  56. 56. The challenge of scale Training labels Inference Video processing is computationally expensive Video annotation is labor-intensive Models Temporal dimension adds complexity Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016. Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017. Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017. Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei- Fei. Learning to learn from noisy web videos. CVPR 2017. Learning to learn Unsupervised learning
  57. 57. Towards Knowledge Training labels Inference Video processing is computationally expensive Video annotation is labor-intensive Models Temporal dimension adds complexity Videos Knowledge of the dynamic visual world
  58. 58. Collaborators Olga Russakovsky Mykhaylo Andriluka Ning Jin Vignesh Ramanathan Liyue Shen Greg Mori Fei-Fei Li
  59. 59. Thank You

×