Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I

13 views

Published on

ICME2019 Tutorial: Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I

  1. 1. Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition Ting Yao Principal Researcher, Vision and Multimedia Lab, JD AI Research Tutorial @ ICME, July 8th, 2019
  2. 2. horse grass person “a boy is cleaning the floor” “not just beautiful”
  3. 3. 3 …… ……
  4. 4. 4
  5. 5. 5 2011 2012 2013 2014 2015 Action recognition by dense trajectories. [Wang et al. CVPR 2011] Hand-crafted feature 2016
  6. 6. 2011 2012 2013 2014 2015 2016 Large-scale Video Classification with Convolutional Neural Networks. [Karpathy et al. CVPR 2014] Two-Stream Convolutional Networks for Action Recognition in Videos. [Simonyan et al. NIPS 2014] 2D convolutional network
  7. 7. 2D CNN + LSTM (LRCN)2011 2012 2013 2014 2015 2016 Long-term Recurrent Convolutional Networks for Visual Recognition and Description. [Donahue et al. CVPR 2015]
  8. 8. 3D convolutional network (C3D)2011 2012 2013 2014 2015 2016 Learning Spatiotemporal Features with 3D Convolutional Networks. [Tran et al. ICCV 2015]
  9. 9. Temporal segment networks (TSN)2011 2012 2013 2014 2015 2016 Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. [Wang et al. ECCV 2016]
  10. 10. 10 Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  11. 11. 11 Backbone Network Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  12. 12. 12 State-of-the-Arts Image Domain Video Domain VGG [Simonyan et al. ICLR 2015] C3D [Tran et al. ICCV 2015] Inception [Szegedy et al. CVPR 2015] I3D [Carreira et al. CVPR 2017] ResNet [He et al. CVPR 2016] P3D [Qiu et al. ICCV 2017]
  13. 13. 13 Convolution 3D Convolution 2D Convolution 3D Convolution 3D ResNet 2D ResNet ResNet-152: Time Cost: 9 x C2 x H x W Model size: 230MB 3D ResNet-152: Time Cost: 27 x C2 x T x H x W Model size: 690MB
  14. 14. 14 Spatial 2D Spatial 2D
  15. 15. 15 Bottleneck Architecture: + 1x1 conv 1x1 conv 3x3 conv ReLU ReLU ReLU (a) Residual Unit + 1x1x1 conv 1x1x1 conv 1x3x3 conv ReLU ReLU 3x1x1 conv ReLU ReLU (b) P3D-A + 1x1x1 conv 1x1x1 conv ReLU ReLU 1x3x3 conv 3x1x1 conv + ReLU ReLU (c) P3D-B + 1x1x1 conv 1x1x1 conv 1x3x3 conv ReLU ReLU 3x1x1 conv ReLU + ReLU (d) P3D-C (a) (b) (c) (d) Very deep 3D CNN but still lighter weights than C3D
  16. 16. 16 •R(2+1)D > MCx > rMCx > R3D > R2D
  17. 17. 17 • ResNeXt-101 > Wide ResNet-50 > ResNet-200 > ResNet-152 > ResNet-101 > ResNet-50 > DenseNet-201 > DenseNet-121
  18. 18. 18 • Involve large-range (global) context into representation learning • Model the diffusions between local and global features
  19. 19. 19 Feature Aggregation Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  20. 20. 20 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN
  21. 21. 21 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN • Global Average Pooling
  22. 22. 22 CNN CNN CNN ... AttCell AttCell AttCell ... X X X .................. ... AttCell X LSTM ...... LSTM ...... LSTM ...... LSTM ...... ...... ......
  23. 23. 23
  24. 24. 24 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN • Global Average Pooling • Attention • Visual Attention [Sharma et al. ICLR workshop 2015] • Recurrent Attention [Du et al. TIP 2018] • Unified Attention [Li et al. TMM 2018]
  25. 25. 25 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN • Global Average Pooling • Attention • Visual Attention [Sharma et al. ICLR workshop 2015] • Recurrent Attention [Du et al. TIP 2018] • Unified Attention [Li et al. TMM 2018] • RNN • LRCN [Donahue et al. CVPR 2015] • Hybrid Framework (LSTM) [Wu et al. ACM MM 2015]
  26. 26. 26
  27. 27. 27 • Global Activations • Fully connected layer • Global pooling layer • Fisher Vector with Variational Auto-Encoder (FV-VAE) • Fisher Vector (FV) ... ... ... ... Global Activations Convolutional Activations FV Encoding FV-VAE Encoding Convolutional Activations Normalization term Generative model GMM VAE FV FV-VAE
  28. 28. 28 ... Reconstruct Loss ... Regularization Loss Classification Loss ... ... Encoder Sampling Decodertx ... Reconstruct Loss ... ... ... Encoder Identity Decodertx Back Propagation Gradient Vector Accumulator • Assumption of FV • Data is generated from Gaussian Mixture Model, which may not hold in practice • VAE • Encoder (𝑞 𝜙( Τ𝐳 𝐱)): learn new representations 𝐳 for the given input 𝐱 • Decoder (𝑝 𝜃( Τ𝐱 𝐳)): generate FV of new representations 𝐳 Training Extraction FV: ℊ 𝜃 𝑋 = 𝐹𝜃 − 1 2 𝛻𝜃 log 𝑢 𝜽(𝑋) = −𝐹𝜃 − 1 2 σ 𝑡=1 𝑇𝑥 𝛻𝜃ℒ 𝒓𝒆𝒄(𝒙𝒕; 𝜃, 𝜙)Reconstruct loss: ℒ 𝒓𝒆𝒄 = − log 𝜇 𝒙 𝑡 = − log 𝑝 𝜃( Τ𝒙 𝑡 𝒛 𝑡)
  29. 29. 29 CNN FV-VAE Gradient Vector Video Representation Convolutional Feature … … Region Feature Set Loss Function Ice Dancing + Spatial Pyramid Pooling Training Epoch Extraction Epoch • FV-VAE based action recognition framework • CNN as convolutional feature extractor • Encoding SPP output using FV-VAE
  30. 30. 30 Stream Fusion Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  31. 31. 31 human, guitar Playing Guitar Jumping Jack Cliff diving Basketball dunk Single Frame Consecutive Frames Clip (multiple adjacent frames) whole video Different actions may span different granularities!
  32. 32. 32 • Multi-granular spatio-temporal architecture for video action recognition • Hierarchical modeling (4 granularities) • Fusion based on multi-granular score distribution
  33. 33. 33
  34. 34. 34
  35. 35. 35 Single Frame softmax 0.4 … 0.2 … 0.7 … 0.9 Consecutive Frames softmax Clips softmax Video softmax Surfing scores Sort 0.9 … 0.7 … 0.4 … 0.2 Improved Surfing score 0.8 w=[1, 0, …, 0] Max-pooling w=[1, 1, …, 1] Ave-pooling optimized w Distribution-based classifier
  36. 36. 36
  37. 37. 37 2011 2012 2014 2015 2017 2018 2019
  38. 38. 38 Method UCF101 HMDB51 Improved dense trajectories (IDT) [Wang et al. ICCV 2011] 85.9% 57.2% Higher dimensional IDT [Peng et al. CVIU 2016] 87.9% 61.1% 2D CNN Slow Fusion [Karpathy et al. CVPR 2014] 65.4% -- Two-stream ConvNet [Simonyan et al. NIPS 2014] 88.0% 59.4% Factorized ST-ConvNet [Sun et al. ICCV 2015] 88.1% 59.1% Two-stream + LSTM [Yue-Hei et al. CVPR 2015] 88.6% -- Two-stream Conv fusion [Feichtenhofer et al. CVPR 2016] 92.5% 67.3% Two-stream ST Residual Networks [Feichtenhofer et al. NIPS 2016] 93.4% 66.4% Temporal Segment Networks [Wang et al. ECCV 2016] 94.0% 68.5% C3D [Tran et al. ICCV 2015] 82.3% 56.8% P3D ResNet [Qiu et al. ICCV 2017] 89.8% 58.6% Two-stream P3D ResNet [Qiu et al. ICCV 2017] 94.5% 71.8% I3D [Carreira et al. CVPR 2017] 93.4% 66.4% I3D + Kinetics pre-train [Carreira et al. CVPR 2017] 97.9% 80.2% LGD-3D + Kinetics pre-train [Qiu et al. CVPR 2019] 98.2% 80.5%
  39. 39. 39
  40. 40. 40
  41. 41. 41
  42. 42. 42
  43. 43. 43 Feature Extractor Pole vault 0.61 Pole vault 0.83 Pole vault 0.51
  44. 44. 44 3D CNN Pole vault 0.96 Gaussian Kernel
  45. 45. 45
  46. 46. 46
  47. 47. 47
  48. 48. 48
  49. 49. 49
  50. 50. 50
  51. 51. 51
  52. 52. 52 Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  53. 53. 53
  54. 54. 54
  55. 55. 55
  56. 56. Thanks! tingyao.ustc@gmail.com

×