Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Step zhedong

53 views

Published on

Xitong Yang

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Step zhedong

  1. 1. 1 STEP: Spatio-Temporal Progressive Learning for Video Action Detection Xitong Yang1,2 Xiaodong Yang2 Ming-Yu Liu2 Fanyi Xiao2,3 Larry Davis1 Jan Kautz2 1University of Maryland, College Park 2NVIDIA 3University of California, Davis
  2. 2. 2 About Me (Xitong Yang, 杨希桐) ► Education ► 2016 – Present: Ph.D., University of Maryland, College Park; Prof. Larry Davis ► 2014 – 2016: M.S., University of Rochester; Prof. Jiebo Luo ► 2010 – 2014: B.E., Beijing Institute of Technology ► Internship ► 2018, 2019: NVIDIA; Xiaodong Yang, Ming-Yu Liu, Sifei Liu, Jan Kautz ► 2017: Honda Research Institute; Yi-Ting Chen, Teruhisa Misu ► 2016: PARC East; Sriganesh Madhvanath, Raja Bala ► Research Interest ► Computer vision, video understanding
  3. 3. 3 Spatio-temporal Action Detection Time LongJump
  4. 4. 4 Object Detection ► Two-stage methods ► Fast / Faster R-CNN ► One-stage methods ► SSD Faster R-CNN (Ren et al, NeurIPS 2015) SSD (Liu et al, ECCV 2016)
  5. 5. 5 Object Detection Pipeline source: https://www.saagie.com/fr/blog/object-detection-part1 Proposals/ Anchors Classification: object recognition Regression: bounding box refinement Post-processing
  6. 6. 6 From Object Detection to Action Detection ► Use optical flow as additional input ► From frame-level prediction to clip-level prediction ► Process long sequences (use 3D CNNs) ► Replicate 2D proposals over time to obtain 3D proposals Two-stream R-CNN (Peng et al, ECCV 2016) Kalogeiton et al, ICCV 2017 I3D + Faster R-CNN (Girdhar et al, 2018)
  7. 7. 7 From Object Detection to Action Detection ► Use optical flow as additional input ► From frame-level prediction to clip-level prediction ► Process long sequences (use 3D CNNs) ► Replicate 2D proposals over time to obtain 3D proposals Two-stream R-CNN (Peng et al, ECCV 2016) Kalogeiton et al, ICCV 2017 I3D + Faster R-CNN (Girdhar et al, 2018)
  8. 8. 8 Challenges Time ► Extended two-stage methods ✕ Effective temporal modeling ► Spatial displacement over time
  9. 9. 9 Challenges ► Extended two-stage methods ✕ Effective temporal modeling ► Spatial displacement over time
  10. 10. 10 Challenges ► Extended two-stage methods ✕ Effective temporal modeling ► Spatial displacement over time ✕ Efficient detection ► Thousands of proposals ► Processing long sequences
  11. 11. 11 Spatio-TEmporal Progressive Learning (STEP)
  12. 12. 12 ► Goals of STEP ✓ Effective temporal modeling ► Adapt to spatial displacement ✓ Efficient detection ► Use a small number of proposals What is STEP
  13. 13. 13 What is STEP ► STEP = progressive learning + spatial refinement + temporal extension Step Initial Proposal Refined Tubelet Extended Tubelet Time progressive learning
  14. 14. 14 What is STEP Step Initial Proposal Refined Tubelet Extended Tubelet Time ► STEP = progressive learning + spatial refinement + temporal extension spatial refinement
  15. 15. 15 What is STEP Step Initial Proposal Refined Tubelet Extended Tubelet Time ► STEP = progressive learning + spatial refinement + temporal extension temporal extension
  16. 16. 16 Our Approach: STEP Time t
  17. 17. 17 Time s=1: anchors t Our Approach: STEP
  18. 18. 18 Time s=1: anchors Our Approach: STEP
  19. 19. 19 Time s=1: anchors Our Approach: STEP
  20. 20. 20 Time s=1: anchors Our Approach: STEP
  21. 21. 21 s=1: temporal extension Time Our Approach: STEP
  22. 22. 22 Time s=1: temporal extension Our Approach: STEP
  23. 23. 23 Time s=1: spatial refinement Our Approach: STEP
  24. 24. 24 Time s=1: spatial refinement Our Approach: STEP
  25. 25. 25 Time s=2: temporal extension Our Approach: STEP
  26. 26. 26 Time s=2: temporal extension Our Approach: STEP
  27. 27. 27 Time s=2: spatial refinement Our Approach: STEP
  28. 28. 28 Time s=2: spatial refinement Our Approach: STEP
  29. 29. 29 Time s=3: temporal extension Our Approach: STEP
  30. 30. 30 Time s=3: temporal extension Our Approach: STEP
  31. 31. 31 Time s=3: spatial refinement Our Approach: STEP
  32. 32. 32 Our Approach: STEP ► STEP ✓ Effective temporal modeling ► Adaptive temporal extension ✓ Efficient detection ► Use only 11 (34) proposals on UCF101-24 (AVA) ► Progressively increase the sequence length ✓ Generic learning framework for video understanding ► Instantiate with different backbones / refinement schedule Step Initial Proposal Refined Tubelet Extended Tubelet Time
  33. 33. 33 Related Work: Iterative Methods in Vision Iterative pose estimation (Carreira et al, CVPR16) Object detection Grid-CNN (Najibi et al, CVPR16) Recurrent image generation DRAW (Gregor et al, ICML15) Object detection Cascade R-CNN (Cai et al, CVPR18)
  34. 34. 34 Model Details Temporal Modeling Global Branch Local Branch Classification Regression Convolutional Features Proposals RoI Pool ► Spatial refinement ► Two branches for classification & regression Action detection Classification Regression • Temporal information • Context • Interaction • …. • Precise localization • Bounding box of the actor • …
  35. 35. 35 ► Temporal extension ► Linear extrapolation / location anticipation Model Details !"# $ !%# $ !$ ► Spatial refinement ► Two branches for classification & regression Temporal Modeling Global Branch Local Branch Classification Regression Convolutional Features Proposals RoI Pool
  36. 36. 36 Model Details ► Progressive learning ► Joint training Time RoI Pool S1 P1 L1 L0 Backbone Classification Regression Proposals
  37. 37. 37 Model Details Time RoI Pool RoI Pool S1 S2 P1 P2 L1 L2 T1 L0 Backbone ► Progressive learning ► Joint training
  38. 38. 38 Model Details Time RoI Pool RoI Pool RoI Pool S1 S2 S3 P1 P2 P3 L1 L2 L3 T1 T2 L0 Backbone ► Progressive learning ► Joint training
  39. 39. 39 Model Details ► The problem of distribution shift over different steps ► Our training strategies ► Increasing IoU thresholds for 3 steps (0.2 à 0.35 à 0.5) ► Separate header networks for different steps
  40. 40. 40 Experiments
  41. 41. 41 Experiment Setup ► Dataset ► UCF101-24 ► A subset of UCF-101 dataset that consists of videos from 24 action classes and their corresponding bounding box annotations. ► AVA ► Complex actions (60 classes) and scenes sourced from movies. Annotations are provided at 1-second intervals. ► Evaluation ► Frame-mAP at IoU=0.5
  42. 42. 42 Qualitative Results: Progressive Learning UCF101-24 AVA
  43. 43. 43 Qualitative Results: Progressive Learning Steps
  44. 44. 44 Ablation Study Spatial Refinement Temporal ExtensionNumber of Proposals ► Improvement obtained by more steps
  45. 45. 45 Ablation Study Spatial Refinement Temporal Extension ► Improvement obtained by more steps ► Performance saturates after 3 steps Number of Proposals
  46. 46. 46 Ablation Study Spatial Refinement Temporal Extension ► Improvement obtained by more proposals ► More inference time 0 0.8 1.6 2.4 3.2 58 61 64 67 11 34 83 132 secondsperbatch frame-mAP(%) number of initial proposals Number of Proposals
  47. 47. 47 Ablation Study Spatial Refinement Temporal Extension 0 0.8 1.6 2.4 3.2 58 61 64 67 11 34 83 132 secondsperbatch frame-mAP(%) number of initial proposals ACT ► Improvement obtained by more proposals ► More inference time ► Achieve SOTA using only 11 proposals Number of Proposals
  48. 48. 48 Ablation Study Spatial Refinement Temporal Extension Step Frame-mAP 51.5 60.7 62.6 49 51 53 55 57 59 61 63 65 67 1 2 3 w/o temporal extension (K = 6) w/o temporal extension (K = 30) w/ temporal extrapolation w/ temporal anticipation Number of Proposals
  49. 49. 49 Ablation Study Spatial Refinement Temporal Extension Step Frame-mAP 51.5 60.7 62.6 53.1 61.8 63.4 49 51 53 55 57 59 61 63 65 67 1 2 3 w/o temporal extension (K = 6) w/o temporal extension (K = 30) w/ temporal extrapolation w/ temporal anticipation Number of Proposals ► Long-range temporal context benefits action classification
  50. 50. 50 Ablation Study Spatial Refinement Temporal Extension Step Frame-mAP (K = 6 à 18 à 30) 51.5 60.7 62.6 53.1 61.8 63.4 51.5 62.8 65.5 51.5 62.5 66.7 49 51 53 55 57 59 61 63 65 67 1 2 3 w/o temporal extension (K = 6) w/o temporal extension (K = 30) w/ temporal extrapolation w/ temporal anticipation ► Long-range temporal context benefits action classification ► Adaptive temporal extension is more effective (and more efficient) Number of Proposals
  51. 51. 51 Comparison with SOTA ► UCF101-24 ► VGG16 backbone ► Two-stream fusion ► K = 6 à 18 à 30 ► AVA (v2.1) ► I3D backbone ► K = 12 à 12 à 36 * RGB + Flow (Updated result on arxiv: 20.2%)
  52. 52. 52 Qualitative Results: UCF101-24
  53. 53. 53 Qualitative Results: AVA
  54. 54. 54 Conclusion ► Spatio-TEmporal Progressive learning for action detection ► A novel framework for effective temporal modeling on long sequences ► A simply, fully end-to-end action detector (without external human detectors) ► Codes: https://github.com/NVlabs/STEP
  55. 55. 55 Thanks! Q & A

×