Action RecognitionA general survey of previous works onSobhanNaderiPariziSeptember 2009
List of papersStatistical Analysis of Dynamic ActionsOn Space-Time Interest PointsUnsupervised Learning of Human Action Categories Using Spatial-Temporal WordsWhat, where and who? Classifying events by scene and object recognitionRecognizing Actions at a DistanceRecognizing Human Actions: A Local SVM ApproachRetrieving Actions in MoviesLearning Realistic Human Actions from MoviesActions in ContextSelection and Context for Action Recognition
Non-parametric Distance Measure for Action RecognitionPaper info:Title:Statistical Analysis of Dynamic ActionsAuthors:LihiZelnik-ManorMichal IraniTPAMI 2006A preliminary version appeared in CVPR 2001“Event-Based video Analysis”
“Statistical Analysis of Dynamic Actions”Overview:Introduce a non-parametric distance measureVideo matching (no action model): given a reference video, similar sequences are foundDense features from multiple temporal scales (only corresponding scales are compared)Temporal extent of videos in each category should be the same! (a fast and slow dancing are different)New database is introducedPeriodic activities (walk)Non-periodic activities (Punch, Kick, Duck, Tennis)Temporal Textures (water)www.wisdom.weizmann.ac.il/~vision/EventDetection.html
“Statistical Analysis of Dynamic Actions”Feature description:Space-time gradient of each pixelThreshold the gradient magnitudesNormalization (ignoring appearance)Absolute value (invariant to dark/light transitions)Direction invariant
“Statistical Analysis of Dynamic Actions”Comments:Actions are represented by 3L independent 1D distributions (L being number of temporal scales)The frames are blurred firstRobust to change of appearance e.g. high textured clothingAction recognition/localizationFor a test video sequence S and a reference sequence of T frames:Each consequent sub-sequence of length T is compared to the referenceIn case of multiple reference videos:Mahalanobis distance
Space-Time Interest Points (STIP)Paper info:Title:On Space-Time Interest PointsAuthors:Ivan Laptev: INRIA / IRISAIJCV 2009
“On Space-Time Interest Points”Extends Harris detector to 3D (space-time)Local space-time points with non-constant motion:Points with accelerated motion: physical forcesIndependent space and time scalesAutomatic scale selection
“On Space-Time Interest Points”Automatic scale selection procedure:Detect interest pointsMove in the direction of optimal scaleRepeat until locally optimal scale is reached (iterative)The procedure can not be used in real-time:Frames in future time are neededThere exist estimation approaches to solve this problem
Unsupervised Action RecognitionPaper info:Title:Unsupervised Learning of Human Action Categories Using Spatial-Temporal WordsAuthors:Juan Carlos Niebles: University of IllinoisHongcheng Wang: University of IllinoisLi Fei-Fei: University of IllinoisBMVC 2006
“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words”Generative graphical model (pLSA)STIP detector is used (piotrdollár et al.)Laptev’s STIP detector is too sparseDictionary of video words is createdThe method is unsupervisedSimultaneous action recognition/localizationEvaluations on:KTH action databaseSkating actions database (4 action classes)
“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words”Overview of the method:w: video word
d: video sequence
z: latent topic (action category)“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words”Feature descriptor:Brightness gradient + PCABrightness gradient found equiv. to Optical Flow for motion capturingMultiple action can be localized in the video:Average classification accuracy:KTH action database: 81.5%Skating dataset: 80.67%
Event recognition in sport imagesPaper info:Title:What, where and who? Classifying events by scene and object recognitionAuthors:Li-Jia Li: University of IllinoisLi Fei-Fei: Princeton UniversityICCV 2007
“What, where and who? Classifying events by scene and object recognition”Goal of the paper:Event classification in still imagesScene labelingObject labelingApproach:Generative graphical modelAssumes that objects and scenes are independent given the event categoryIgnores spatial relationships between objects
“What, where and who? Classifying events by scene and object recognition”Information channels:Scene context (holistic representation)Object appearanceGeometrical layout (sky at infinity/vertical structure/ground plane)Feature extraction:12x12 patches obtained by grid sampling (10x10)For each patch:SIFT feature (used both for scene and object models)Layout label (used only for object model)
“What, where and who? Classifying events by scene and object recognition”The graphical modelE: eventS: sceneO: objectX: scene featureA: appearance featureG: geometry layout
“What, where and who? Classifying events by scene and object recognition”A new database is compiled:8 sport even categories (downloaded from web)Bocce, croquet, polo, rowing, snowboarding, badminton, sailing, rock climbingAverage classification 	accuracy over all 8 	event classes = 74.3%
“What, where and who? Classifying events by scene and object recognition”Sample results:
Action recognition in medium resolution regimes Paper info:Title:Recognizing Actions at a DistanceAuthors:Alexei A. Efros: UC BerkeleyAlexander C. Berg: UC BerkeleyGreg Mori: UC BerkeleyJitendraMalik: UC BerkeleyICCV 2003
“Recognizing Actions at a Distance”Overall review:Actions in medium resolution (30 pix tall)Proposing a new motion descriptorKNN for classificationConsistent tracking bounding 	box of the actor is requiredAction recognition is done only 	on the tracking bounding boxMotion in terms of as relative 	movement of body partsNo info. about movements is given by the tracker
“Recognizing Actions at a Distance”Motion Feature:For each frame, a local temporal neighborhood is consideredOptical flow is extracted (other alternatives: image pixel values, temporal gradients)OF is noisy: half-wave rectifying + blurringTo preserve motion info:OF vector is decomposed to its 	vertical/horizontal components
“Recognizing Actions at a Distance”Similarity measure:i,j: index of frameT: temporal extentI: spatial extentA: 1st video sequence  = B: 2nd video sequence =
“Recognizing Actions at a Distance”New Dataset:Ballet (stationary camera):16 action classes2 men + 2 womenEasy dataset (controlled environment)Tennis (real action, stationary camera):6 action classes (stand, swing, move-left, …)different days/location/camera position2 players (man + woman)Football (real action, moving camera):8 action classes (run-left 45˚, run-left, walk-left, …)Zoom in/out
“Recognizing Actions at a Distance”Average classification accuracy:Ballet:        87.44% (5NN)Tennis:      64.33% (5NN)Football:  65.38% (1NN)What can be done?
“Recognizing Actions at a Distance”Applications:Do as I Do:Replace actors in videosDo as I Say:Develop real-world motions in computer games2D/3D skeleton transferFigure Correction:Remove occlusion/clutter in movies
KTH Action DatasetPaper info:Title:Recognizing Human Actions: A Local SVM ApproachAuthors:Christian Schuldt: KTH universityIvan Laptev: KTH universityICPR 2004
“Recognizing Human Actions: A Local SVM Approach”New dataset (KTH action database):2391 video sequences6 action classes (Walking, Jogging, Running, Handclapping, Boxing, Hand-waving)25 personsStatic camera4 scenarios:Outdoors (s1)Outdoors + scale variation (s2): the hardest scenarioOutdoors + cloth variation (s3)Indoors (s4)
“Recognizing Human Actions: A Local SVM Approach”Features:Sparse (STIP detector)Spatio-temporal jets of order 4Different feature representations:Raw jet feature descriptorsExponential       kernel on the histogram of jetsSpatial HoG with temporal pyramidDifferent classifiers:SVMNN
“Recognizing Human Actions: A Local SVM Approach”Experimental results:Local Feature (jets) + SVM performs the bestSVM outperforms NNHistLF (histogram of jets) is slightly better than HistSTG (histogram of spatio-temporal gradients)Average classification accuracy on all scenarios = 71.72%
Action Recognition in Real ScenariosPaper info:Title:Retrieving Actions in MoviesAuthors:Ivan Laptev: INRIA / IRISAPatrik Perez: INRIA / IRISAICCV 2007
“Retrieving Actions in Movies”A new action database from real moviesExperiments only on Drinking action vs. random/SmokingMain contributions:Recognizing unrestricted real actionsKey-frame primingConfiguration of experiments:Action recognition (on pre-segmented seq.)Comparing different featuresAction detection (using key-frame priming)
“Retrieving Actions in Movies”Real movie action database:105 drinking actions141 smoking actionsDifferent scenes/people/viewswww.irisa.fr/vista/Equipe/People/Laptev/actiondetection.htmlAction representation:R = (P, ΔP)P = (X, Y, T): space-time coordinatesΔP = (ΔX, ΔY, ΔT):ΔX: 1.6 width  of head bounding boxΔY: 1.3 height of head bounding box
“Retrieving Actions in Movies”Learning scheme:Discrete AdaBoost + FLD (Fisher Linear Discriminant)All action cuboids are normalized 	to 14x14x8 cells of 5x5x5 pixels	(needed for boosting)Slightly temporal-randomized 	sequences is added to trainingHoG(4bins)/OF(5bins) is usedLocal features:Θ=(x,y,t, δx, δy, δt, β, Ψ)ΒЄ{plain, temp-2, spat-4}ΨЄ{OF5, Grad4}
“Retrieving Actions in Movies”HoG captures shape, OF captures motionInformative motions: start & end of actionKey-frame:When hand reaches headBoosted-Histogram on HOGNo motion info	around key-frameIntegration of	motion & key-frame	should help
“Retrieving Actions in Movies”Experiments:OF/OF+HoG/STIP+NN/only key-frameOF/OF+HoG works best on hard test (drinking vs. smoking)Extension of OF5 to OFGrad9 does not help!Key-frame priming:#FPs decreases significantly (different info. channels)Significant overall accuracy:It’s better to model motion and appearance separatelySpeed of key-primed version: 3 seconds per frame
“Retrieving Actions in Movies”Possible extensions:Extend the experiments to more action classesMake it real-time
Automatic Video AnnotationPaper info:Title:Learning Realistic Human Actions from MoviesAuthors:Ivan Laptev: INRIA / IRISAMarcinMarszalek: INRIA / LEARCordeliaSchmid: INRIA / LEARBenjamin Rozenfeld: Bar-Ilan universityCVPR 2008
“Learning Realistic Human Actions from Movies”Overview:Automatic movie annotation:Alignment of movie scriptsText classificationClassification of real actionProviding a new datasetBeat state-of-the-art results on KTH datasetExtending spatial pyramid to space-time pyramid
“Learning Realistic Human Actions from Movies”Movie script:Publicly available textual description about:Scene descriptionCharactersTranscribed dialogsActions (descriptive)Limitations:No exact timing alignmentNo guarantee for correspondence with real actionsActions are expressed literally (diverse descriptions)Actions may be missed due to lack of conversation
“Learning Realistic Human Actions from Movies”Automatic annotation:Subtitles include exact time alignmentTiming of scripts is matched by subtitlesTextual description of action is done by a text classifierNew dataset:8 action classes (AnswerPhone, GetOutCar, SitUp, …)Two training sets (automatically/manually annotated)60% of the automatic training set is correctly annotatedhttp://www.irisa.fr/vista/actions
“Learning Realistic Human Actions from Movies”Action classification approach:BoF framework (k=4000)Space-time pyramids6 spatial grids: {1x1, 2x2, 3x3, 1x3, 3x1, o2x2}4 temporal grids: {t1, t2, t3, ot2}STIP with multiple scalesHoG and HoF
“Learning Realistic Human Actions from Movies”Feature extraction:A volume of (2kσ x 2kσ x 2kτ) is taken around each STIP where σ/τ is spatial/temporal extent (k=9)The volume is divided to                                          gridHoG and HoF for each grid cell is calculated and concatenated togetherThese concatenated features are concatenated once more according to the pattern of spatio-temporal pyramid
“Learning Realistic Human Actions from Movies”Different channels:Each spatio-temporal template: one channelGreedy search to find the best channel combinationKernel function =Chi2 distanceObservations:HoG performs better than HoFNo temporal subdivision is preferred (temporal grid = t1)Combination of channels improves classification in real scenarioMean AP on KTH action database = 91.8%Mean AP on real movies database:Trained on manually annotated dataset : 39.5%Trained on automatically annotated dataset : 22.9%Random classifier (chance) : 12.5%
“Learning Realistic Human Actions from Movies”Future works:Increase robustness to annotation noiseImprove script to video alignmentLearn on larger database of automatic annotationExperiment more low-level featuresMove from BoF to detector based methodsThe table shows:effect of temporal division when combining channels (HMM based methods should work)Pattern of spatio-temporal pyramid changes so that context is best captured when the action is scene-dependent
Image Context in Action RecognitionPaper info:Title:Actions in ContextAuthors:MarcinMarszalek: INRIA / LEARIvan Laptev: INRIA / IRISACordeliaSchmid: INRIA / LEARCVPR 2009
“Actions in Context”Contributions:Automatic learning of scene classes from videoImprove action recognition using image context and vice versaMovie scripts is used for automatic trainingFor both action and scene: BoF + SVMNew large database:12 action classes69 movies involved10 scene classeswww.irisa.fr/vista/actions/hollywood2
“Actions in Context”For automatic annotation, scenes are identified only from textFeatures:SIFT (modeling scene)	on 2D-HarrisHoG and HoF (motion)	on 3D-Harris (STIP)
“Actions in Context”Features:SIFT: extracted from 2D-Harris detectorCaptaures static appearanceUsed for modeling scene contextCalculated for single frame (every 2 seconds)HoG/HoF: extracted from 3D-Harris detectorHoG captures dynamic appearanceHoF captures motion patternOne video dictionary per channel is createdHistogram of video words is created for each channelClassifier:SVM using chi2 distanceExponential kernel (RBF)Sum over multiple channels
“Actions in Context”Evaluations:SIFT: better for contextHoG/HoF: better for actionOnly context can also classify 	actions fairly good!Combination of the 3 channels	works best
“Actions in Context”Observations:Context is not always goodIdea: The model should control 	contribution of context for each 	action class individually Overall, the gain of accuracy	is not significant using context:Idea: other types of context should	work better
Object Co-occurrence in Action Recognition Paper info:Title:Selection and Context for Action RecognitionAuthors:Dong Han: University of BonnLiefeng Bo: TTI-ChicagoCristianSminchisescu: University of BonnICCV 2009
“Selection and Context for Action Recognition”Main contributions:Contextual scene descriptors based on:Presence/absence of objects (bag-of-detectors)Structural relation between objects and their partsAutomatic learning of multiple featuresMultiple Kernel Gaussian Process Classifier (MKGPC)Experimental results on:KTH action datasetHollywood1,2 Human Action database (INRIA)
“Selection and Context for Action Recognition”Main message:Detection of a Car and a Personin its proximity increases probability of Get-Out-Car actionProvides a framework to train a classifier based on combination of multiple features (not necessarily relevant) e.g. HOG+HOF+histogram intersection, …Similar to MKL but hereParameters are learnt automatically i.e. (weights + hyper-parameters) Gaussian Process scheme is used for learning
“Selection and Context for Action Recognition”Descriptors:Bag of DetectorsDeformable part models are used (Pedro)Once object BBs are detected, 3 descriptors are built:ObjPres (4D)ObjCount (4D)ObjDist (21D): pair-wise distances of object parts for all of Person detector (7 parts)HOG (4D) + HOF (5D) from STIP detector (Ivan)Spatial grids: 1x1, 2x1, 3x1, 4x1, 2x2, 3x3Temporal grids: t1, t2, t33D gradient features
“Selection and Context for Action Recognition”Experimental results:KTH dataset94.1% mean AP vs. 91.8% reported by LaptevSuperior to state-of-the-art in all but Running classHOHA1 datasetTrained on clean set onlyThe optimal subset of features is found greedily (addition/removal) based on test error47.5% mean AP vs. 38.4% reported by LaptevHOHA2 dataset43.12% mean AP vs. 35.1% reported by Marszalek

A general survey of previous works on action recognition

  • 1.
    Action RecognitionA generalsurvey of previous works onSobhanNaderiPariziSeptember 2009
  • 2.
    List of papersStatisticalAnalysis of Dynamic ActionsOn Space-Time Interest PointsUnsupervised Learning of Human Action Categories Using Spatial-Temporal WordsWhat, where and who? Classifying events by scene and object recognitionRecognizing Actions at a DistanceRecognizing Human Actions: A Local SVM ApproachRetrieving Actions in MoviesLearning Realistic Human Actions from MoviesActions in ContextSelection and Context for Action Recognition
  • 3.
    Non-parametric Distance Measurefor Action RecognitionPaper info:Title:Statistical Analysis of Dynamic ActionsAuthors:LihiZelnik-ManorMichal IraniTPAMI 2006A preliminary version appeared in CVPR 2001“Event-Based video Analysis”
  • 4.
    “Statistical Analysis ofDynamic Actions”Overview:Introduce a non-parametric distance measureVideo matching (no action model): given a reference video, similar sequences are foundDense features from multiple temporal scales (only corresponding scales are compared)Temporal extent of videos in each category should be the same! (a fast and slow dancing are different)New database is introducedPeriodic activities (walk)Non-periodic activities (Punch, Kick, Duck, Tennis)Temporal Textures (water)www.wisdom.weizmann.ac.il/~vision/EventDetection.html
  • 5.
    “Statistical Analysis ofDynamic Actions”Feature description:Space-time gradient of each pixelThreshold the gradient magnitudesNormalization (ignoring appearance)Absolute value (invariant to dark/light transitions)Direction invariant
  • 6.
    “Statistical Analysis ofDynamic Actions”Comments:Actions are represented by 3L independent 1D distributions (L being number of temporal scales)The frames are blurred firstRobust to change of appearance e.g. high textured clothingAction recognition/localizationFor a test video sequence S and a reference sequence of T frames:Each consequent sub-sequence of length T is compared to the referenceIn case of multiple reference videos:Mahalanobis distance
  • 7.
    Space-Time Interest Points(STIP)Paper info:Title:On Space-Time Interest PointsAuthors:Ivan Laptev: INRIA / IRISAIJCV 2009
  • 8.
    “On Space-Time InterestPoints”Extends Harris detector to 3D (space-time)Local space-time points with non-constant motion:Points with accelerated motion: physical forcesIndependent space and time scalesAutomatic scale selection
  • 9.
    “On Space-Time InterestPoints”Automatic scale selection procedure:Detect interest pointsMove in the direction of optimal scaleRepeat until locally optimal scale is reached (iterative)The procedure can not be used in real-time:Frames in future time are neededThere exist estimation approaches to solve this problem
  • 10.
    Unsupervised Action RecognitionPaperinfo:Title:Unsupervised Learning of Human Action Categories Using Spatial-Temporal WordsAuthors:Juan Carlos Niebles: University of IllinoisHongcheng Wang: University of IllinoisLi Fei-Fei: University of IllinoisBMVC 2006
  • 11.
    “Unsupervised Learning ofHuman Action Categories Using Spatial-Temporal Words”Generative graphical model (pLSA)STIP detector is used (piotrdollár et al.)Laptev’s STIP detector is too sparseDictionary of video words is createdThe method is unsupervisedSimultaneous action recognition/localizationEvaluations on:KTH action databaseSkating actions database (4 action classes)
  • 12.
    “Unsupervised Learning ofHuman Action Categories Using Spatial-Temporal Words”Overview of the method:w: video word
  • 13.
  • 14.
    z: latent topic(action category)“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words”Feature descriptor:Brightness gradient + PCABrightness gradient found equiv. to Optical Flow for motion capturingMultiple action can be localized in the video:Average classification accuracy:KTH action database: 81.5%Skating dataset: 80.67%
  • 15.
    Event recognition insport imagesPaper info:Title:What, where and who? Classifying events by scene and object recognitionAuthors:Li-Jia Li: University of IllinoisLi Fei-Fei: Princeton UniversityICCV 2007
  • 16.
    “What, where andwho? Classifying events by scene and object recognition”Goal of the paper:Event classification in still imagesScene labelingObject labelingApproach:Generative graphical modelAssumes that objects and scenes are independent given the event categoryIgnores spatial relationships between objects
  • 17.
    “What, where andwho? Classifying events by scene and object recognition”Information channels:Scene context (holistic representation)Object appearanceGeometrical layout (sky at infinity/vertical structure/ground plane)Feature extraction:12x12 patches obtained by grid sampling (10x10)For each patch:SIFT feature (used both for scene and object models)Layout label (used only for object model)
  • 18.
    “What, where andwho? Classifying events by scene and object recognition”The graphical modelE: eventS: sceneO: objectX: scene featureA: appearance featureG: geometry layout
  • 19.
    “What, where andwho? Classifying events by scene and object recognition”A new database is compiled:8 sport even categories (downloaded from web)Bocce, croquet, polo, rowing, snowboarding, badminton, sailing, rock climbingAverage classification accuracy over all 8 event classes = 74.3%
  • 20.
    “What, where andwho? Classifying events by scene and object recognition”Sample results:
  • 21.
    Action recognition inmedium resolution regimes Paper info:Title:Recognizing Actions at a DistanceAuthors:Alexei A. Efros: UC BerkeleyAlexander C. Berg: UC BerkeleyGreg Mori: UC BerkeleyJitendraMalik: UC BerkeleyICCV 2003
  • 22.
    “Recognizing Actions ata Distance”Overall review:Actions in medium resolution (30 pix tall)Proposing a new motion descriptorKNN for classificationConsistent tracking bounding box of the actor is requiredAction recognition is done only on the tracking bounding boxMotion in terms of as relative movement of body partsNo info. about movements is given by the tracker
  • 23.
    “Recognizing Actions ata Distance”Motion Feature:For each frame, a local temporal neighborhood is consideredOptical flow is extracted (other alternatives: image pixel values, temporal gradients)OF is noisy: half-wave rectifying + blurringTo preserve motion info:OF vector is decomposed to its vertical/horizontal components
  • 24.
    “Recognizing Actions ata Distance”Similarity measure:i,j: index of frameT: temporal extentI: spatial extentA: 1st video sequence = B: 2nd video sequence =
  • 25.
    “Recognizing Actions ata Distance”New Dataset:Ballet (stationary camera):16 action classes2 men + 2 womenEasy dataset (controlled environment)Tennis (real action, stationary camera):6 action classes (stand, swing, move-left, …)different days/location/camera position2 players (man + woman)Football (real action, moving camera):8 action classes (run-left 45˚, run-left, walk-left, …)Zoom in/out
  • 26.
    “Recognizing Actions ata Distance”Average classification accuracy:Ballet: 87.44% (5NN)Tennis: 64.33% (5NN)Football: 65.38% (1NN)What can be done?
  • 27.
    “Recognizing Actions ata Distance”Applications:Do as I Do:Replace actors in videosDo as I Say:Develop real-world motions in computer games2D/3D skeleton transferFigure Correction:Remove occlusion/clutter in movies
  • 28.
    KTH Action DatasetPaperinfo:Title:Recognizing Human Actions: A Local SVM ApproachAuthors:Christian Schuldt: KTH universityIvan Laptev: KTH universityICPR 2004
  • 29.
    “Recognizing Human Actions:A Local SVM Approach”New dataset (KTH action database):2391 video sequences6 action classes (Walking, Jogging, Running, Handclapping, Boxing, Hand-waving)25 personsStatic camera4 scenarios:Outdoors (s1)Outdoors + scale variation (s2): the hardest scenarioOutdoors + cloth variation (s3)Indoors (s4)
  • 30.
    “Recognizing Human Actions:A Local SVM Approach”Features:Sparse (STIP detector)Spatio-temporal jets of order 4Different feature representations:Raw jet feature descriptorsExponential kernel on the histogram of jetsSpatial HoG with temporal pyramidDifferent classifiers:SVMNN
  • 31.
    “Recognizing Human Actions:A Local SVM Approach”Experimental results:Local Feature (jets) + SVM performs the bestSVM outperforms NNHistLF (histogram of jets) is slightly better than HistSTG (histogram of spatio-temporal gradients)Average classification accuracy on all scenarios = 71.72%
  • 32.
    Action Recognition inReal ScenariosPaper info:Title:Retrieving Actions in MoviesAuthors:Ivan Laptev: INRIA / IRISAPatrik Perez: INRIA / IRISAICCV 2007
  • 33.
    “Retrieving Actions inMovies”A new action database from real moviesExperiments only on Drinking action vs. random/SmokingMain contributions:Recognizing unrestricted real actionsKey-frame primingConfiguration of experiments:Action recognition (on pre-segmented seq.)Comparing different featuresAction detection (using key-frame priming)
  • 34.
    “Retrieving Actions inMovies”Real movie action database:105 drinking actions141 smoking actionsDifferent scenes/people/viewswww.irisa.fr/vista/Equipe/People/Laptev/actiondetection.htmlAction representation:R = (P, ΔP)P = (X, Y, T): space-time coordinatesΔP = (ΔX, ΔY, ΔT):ΔX: 1.6 width of head bounding boxΔY: 1.3 height of head bounding box
  • 35.
    “Retrieving Actions inMovies”Learning scheme:Discrete AdaBoost + FLD (Fisher Linear Discriminant)All action cuboids are normalized to 14x14x8 cells of 5x5x5 pixels (needed for boosting)Slightly temporal-randomized sequences is added to trainingHoG(4bins)/OF(5bins) is usedLocal features:Θ=(x,y,t, δx, δy, δt, β, Ψ)ΒЄ{plain, temp-2, spat-4}ΨЄ{OF5, Grad4}
  • 36.
    “Retrieving Actions inMovies”HoG captures shape, OF captures motionInformative motions: start & end of actionKey-frame:When hand reaches headBoosted-Histogram on HOGNo motion info around key-frameIntegration of motion & key-frame should help
  • 37.
    “Retrieving Actions inMovies”Experiments:OF/OF+HoG/STIP+NN/only key-frameOF/OF+HoG works best on hard test (drinking vs. smoking)Extension of OF5 to OFGrad9 does not help!Key-frame priming:#FPs decreases significantly (different info. channels)Significant overall accuracy:It’s better to model motion and appearance separatelySpeed of key-primed version: 3 seconds per frame
  • 38.
    “Retrieving Actions inMovies”Possible extensions:Extend the experiments to more action classesMake it real-time
  • 39.
    Automatic Video AnnotationPaperinfo:Title:Learning Realistic Human Actions from MoviesAuthors:Ivan Laptev: INRIA / IRISAMarcinMarszalek: INRIA / LEARCordeliaSchmid: INRIA / LEARBenjamin Rozenfeld: Bar-Ilan universityCVPR 2008
  • 40.
    “Learning Realistic HumanActions from Movies”Overview:Automatic movie annotation:Alignment of movie scriptsText classificationClassification of real actionProviding a new datasetBeat state-of-the-art results on KTH datasetExtending spatial pyramid to space-time pyramid
  • 41.
    “Learning Realistic HumanActions from Movies”Movie script:Publicly available textual description about:Scene descriptionCharactersTranscribed dialogsActions (descriptive)Limitations:No exact timing alignmentNo guarantee for correspondence with real actionsActions are expressed literally (diverse descriptions)Actions may be missed due to lack of conversation
  • 42.
    “Learning Realistic HumanActions from Movies”Automatic annotation:Subtitles include exact time alignmentTiming of scripts is matched by subtitlesTextual description of action is done by a text classifierNew dataset:8 action classes (AnswerPhone, GetOutCar, SitUp, …)Two training sets (automatically/manually annotated)60% of the automatic training set is correctly annotatedhttp://www.irisa.fr/vista/actions
  • 43.
    “Learning Realistic HumanActions from Movies”Action classification approach:BoF framework (k=4000)Space-time pyramids6 spatial grids: {1x1, 2x2, 3x3, 1x3, 3x1, o2x2}4 temporal grids: {t1, t2, t3, ot2}STIP with multiple scalesHoG and HoF
  • 44.
    “Learning Realistic HumanActions from Movies”Feature extraction:A volume of (2kσ x 2kσ x 2kτ) is taken around each STIP where σ/τ is spatial/temporal extent (k=9)The volume is divided to gridHoG and HoF for each grid cell is calculated and concatenated togetherThese concatenated features are concatenated once more according to the pattern of spatio-temporal pyramid
  • 45.
    “Learning Realistic HumanActions from Movies”Different channels:Each spatio-temporal template: one channelGreedy search to find the best channel combinationKernel function =Chi2 distanceObservations:HoG performs better than HoFNo temporal subdivision is preferred (temporal grid = t1)Combination of channels improves classification in real scenarioMean AP on KTH action database = 91.8%Mean AP on real movies database:Trained on manually annotated dataset : 39.5%Trained on automatically annotated dataset : 22.9%Random classifier (chance) : 12.5%
  • 46.
    “Learning Realistic HumanActions from Movies”Future works:Increase robustness to annotation noiseImprove script to video alignmentLearn on larger database of automatic annotationExperiment more low-level featuresMove from BoF to detector based methodsThe table shows:effect of temporal division when combining channels (HMM based methods should work)Pattern of spatio-temporal pyramid changes so that context is best captured when the action is scene-dependent
  • 47.
    Image Context inAction RecognitionPaper info:Title:Actions in ContextAuthors:MarcinMarszalek: INRIA / LEARIvan Laptev: INRIA / IRISACordeliaSchmid: INRIA / LEARCVPR 2009
  • 48.
    “Actions in Context”Contributions:Automaticlearning of scene classes from videoImprove action recognition using image context and vice versaMovie scripts is used for automatic trainingFor both action and scene: BoF + SVMNew large database:12 action classes69 movies involved10 scene classeswww.irisa.fr/vista/actions/hollywood2
  • 49.
    “Actions in Context”Forautomatic annotation, scenes are identified only from textFeatures:SIFT (modeling scene) on 2D-HarrisHoG and HoF (motion) on 3D-Harris (STIP)
  • 50.
    “Actions in Context”Features:SIFT:extracted from 2D-Harris detectorCaptaures static appearanceUsed for modeling scene contextCalculated for single frame (every 2 seconds)HoG/HoF: extracted from 3D-Harris detectorHoG captures dynamic appearanceHoF captures motion patternOne video dictionary per channel is createdHistogram of video words is created for each channelClassifier:SVM using chi2 distanceExponential kernel (RBF)Sum over multiple channels
  • 51.
    “Actions in Context”Evaluations:SIFT:better for contextHoG/HoF: better for actionOnly context can also classify actions fairly good!Combination of the 3 channels works best
  • 52.
    “Actions in Context”Observations:Contextis not always goodIdea: The model should control contribution of context for each action class individually Overall, the gain of accuracy is not significant using context:Idea: other types of context should work better
  • 53.
    Object Co-occurrence inAction Recognition Paper info:Title:Selection and Context for Action RecognitionAuthors:Dong Han: University of BonnLiefeng Bo: TTI-ChicagoCristianSminchisescu: University of BonnICCV 2009
  • 54.
    “Selection and Contextfor Action Recognition”Main contributions:Contextual scene descriptors based on:Presence/absence of objects (bag-of-detectors)Structural relation between objects and their partsAutomatic learning of multiple featuresMultiple Kernel Gaussian Process Classifier (MKGPC)Experimental results on:KTH action datasetHollywood1,2 Human Action database (INRIA)
  • 55.
    “Selection and Contextfor Action Recognition”Main message:Detection of a Car and a Personin its proximity increases probability of Get-Out-Car actionProvides a framework to train a classifier based on combination of multiple features (not necessarily relevant) e.g. HOG+HOF+histogram intersection, …Similar to MKL but hereParameters are learnt automatically i.e. (weights + hyper-parameters) Gaussian Process scheme is used for learning
  • 56.
    “Selection and Contextfor Action Recognition”Descriptors:Bag of DetectorsDeformable part models are used (Pedro)Once object BBs are detected, 3 descriptors are built:ObjPres (4D)ObjCount (4D)ObjDist (21D): pair-wise distances of object parts for all of Person detector (7 parts)HOG (4D) + HOF (5D) from STIP detector (Ivan)Spatial grids: 1x1, 2x1, 3x1, 4x1, 2x2, 3x3Temporal grids: t1, t2, t33D gradient features
  • 57.
    “Selection and Contextfor Action Recognition”Experimental results:KTH dataset94.1% mean AP vs. 91.8% reported by LaptevSuperior to state-of-the-art in all but Running classHOHA1 datasetTrained on clean set onlyThe optimal subset of features is found greedily (addition/removal) based on test error47.5% mean AP vs. 38.4% reported by LaptevHOHA2 dataset43.12% mean AP vs. 35.1% reported by Marszalek
  • 58.
    “Selection and Contextfor Action Recognition”Best feature combination