Human Action Recognition without Human
He Yun1,2, Soma Shirakabe1,2, Yutaka Satoh1,2, Hirokatsu Kataoka1
1Computer Vision Research Group, AIST, Japan
2Human-Centered Vision Lab., University of Tsukuba, Japan
Motion representation
•  Database: UCF101, HMDB51, ActivityNet
•  Approach: IDT, Two-Stream CNN
–  DBs and approaches have been prepared in the field
Action Database
h"p://www.thumos.info/
The problem setting in action recognition
•  Video-level prediction
–  1 action-label prediction per input video
Tennis	Swing	
Mo6on	Descriptor
Dense Trajectories (DT) [Wang+, CVPR11]
•  Trajectory-based representation
–  A large amount of trajectories
–  Feature description (HOG, HOF, MBH)
–  Codeword vector is generated
Two-Stream CNN [Simonyan+, NIPS14]
•  Spatial and temporal convolution
–  Spatial-stream: From a RGB image
–  Temporal-stream: From a stacked flows
–  Score fusion: Average or SVM
Is background enough to classify actions?
•  RGB input is too strong!
–  The two-stream CNN[Simonyan+, NIPS14] reported spatial-stream can understand an
action more than expected
•  72.4% with spatial-stream (RGB) @UCF101
•  “Human Action Recognition without Human”
Without Human?
•  Human action recognition can be done just by motion of the
background?
Tennis	Swing	
Mo6on	Descriptor	
Tennis	Swing?	
Mo6on	Descriptor
Detailed setting of w/ and w/o Human
•  With and without human setting
–  Without human setting: center-blind image with UCF101
–  With human setting: inverse of the without human setting
I	(x,	y)	 f	(x,	y)	*	 I’	(x,	y)	
1/2	 1/4	1/4	
1/2	
1/4	
1/4	
I	(x,	y)	 f	(x,	y)	*	 I’	(x,	y)	
1/2	 1/4	1/4	
1/2	
1/4	
1/4	
ー	 ー	
Without	Human	SeIng		 With	Human	SeIng
Framework
–  Baseline: Very deep two-stream CNN [Wang+, arXiv15]
–  Two different scenarios: without human and with human
Exploration experiment
•  @UCF101
–  UCF101 pre-trained model with very deep two-stream CNN
–  With/Without Human Setting
Visual results (Full Image)
Visual results (Without Human Setting)
Without Human
•  The concept of ”Human Action Recognition without Human”
–  The accuracies are very close
•  With human is +9.49% better than without human
–  The current motion representation heavily rely on the backgrounds
Future work
•  This is a suggestive reality
–  We must accept this reality to realize better motion representation
–  Pure motion representation is an urgent work!
•  More sophisticated approach
•  Human only motion

【ECCV 2016 BNMW】Human Action Recognition without Human

  • 1.
    Human Action Recognitionwithout Human He Yun1,2, Soma Shirakabe1,2, Yutaka Satoh1,2, Hirokatsu Kataoka1 1Computer Vision Research Group, AIST, Japan 2Human-Centered Vision Lab., University of Tsukuba, Japan
  • 2.
    Motion representation •  Database:UCF101, HMDB51, ActivityNet •  Approach: IDT, Two-Stream CNN –  DBs and approaches have been prepared in the field
  • 3.
  • 4.
    The problem settingin action recognition •  Video-level prediction –  1 action-label prediction per input video Tennis Swing Mo6on Descriptor
  • 5.
    Dense Trajectories (DT)[Wang+, CVPR11] •  Trajectory-based representation –  A large amount of trajectories –  Feature description (HOG, HOF, MBH) –  Codeword vector is generated
  • 6.
    Two-Stream CNN [Simonyan+,NIPS14] •  Spatial and temporal convolution –  Spatial-stream: From a RGB image –  Temporal-stream: From a stacked flows –  Score fusion: Average or SVM
  • 7.
    Is background enoughto classify actions? •  RGB input is too strong! –  The two-stream CNN[Simonyan+, NIPS14] reported spatial-stream can understand an action more than expected •  72.4% with spatial-stream (RGB) @UCF101 •  “Human Action Recognition without Human”
  • 8.
    Without Human? •  Humanaction recognition can be done just by motion of the background? Tennis Swing Mo6on Descriptor Tennis Swing? Mo6on Descriptor
  • 9.
    Detailed setting ofw/ and w/o Human •  With and without human setting –  Without human setting: center-blind image with UCF101 –  With human setting: inverse of the without human setting I (x, y) f (x, y) * I’ (x, y) 1/2 1/4 1/4 1/2 1/4 1/4 I (x, y) f (x, y) * I’ (x, y) 1/2 1/4 1/4 1/2 1/4 1/4 ー ー Without Human SeIng With Human SeIng
  • 10.
    Framework –  Baseline: Verydeep two-stream CNN [Wang+, arXiv15] –  Two different scenarios: without human and with human
  • 11.
    Exploration experiment •  @UCF101 – UCF101 pre-trained model with very deep two-stream CNN –  With/Without Human Setting
  • 12.
  • 13.
  • 14.
    Without Human •  Theconcept of ”Human Action Recognition without Human” –  The accuracies are very close •  With human is +9.49% better than without human –  The current motion representation heavily rely on the backgrounds
  • 15.
    Future work •  Thisis a suggestive reality –  We must accept this reality to realize better motion representation –  Pure motion representation is an urgent work! •  More sophisticated approach •  Human only motion