Human centered 2D/3D multiview video analysis and description Nikos NikolaidisGreek SP Jam, AUTH Ioannis Pitas*17/5/2012 AIIA Lab Aristotle University of Thessaloniki Greece
Anthropocentric video content analysistasks Face detection and tracking Body or body parts (head/hand/torso) detection Eye/mouth detection 3D face model reconstruction Face clustering/recognition Facial expression analysis Activity/gesture recognition Semantic video content description/annotation
Anthropocentric video content description Applications in video content search (e.g. in YouTube) Applications in film and games postproduction: Semantic annotation of video Video indexing and retrieval Matting 3D reconstruction Anthropocentric (human centered) approach humans are the most important video entity.
Examples Face detection and tracking 1st frame 6th frame 11th frame 16th frame
Face detection and tracking 1st frame 6th frame 11th frame 16th frame Problem statement: To detect the human faces that appear in each video frame and localize their Region-Of-Interest (ROI). To track the detected faces over the video frames.
Face detection and tracking Tracking associates each detected face in the current video frame with one in the next video frame. Therefore, we can describe the face (ROI) trajectory in a shot in (x,y) coordinates. Actor instance definition: face region of interest (ROI) plus other info Actor appearance definition: face trajectory plus other info
Face detection and tracking Overall system diagram
Face detection and trackingFace detection results
Face detection and tracking FaceFeature based face Select Features Detectiontracking module Track Features YES Occlusion Handling NO Result
Face detection and tracking Tracking failure may occur, i.e., when a face disappears. In such cases, face re-detection is employed. However, if any of the detected faces coincides with any of the faces already being tracked, the former ones are kept, while the latter ones are discarded from any further processing. Periodic face re-detection can be applied to account for new faces entering the cameras field-of-view (typically every 5 video frames). Forward and backward tracking, when the entire video is available.
Face detection and tracking LSK tracker (template-based tracking)
Face detection and tracking LSK tracker results
Face detection and tracking Performance measures: Detection rate False alarm rate Localization precision in the range [0,…,1]
Face detection and tracking Overall Tracking Accuracy
Multiview human detection Problem statement: Use information from multiple cameras to detect bodies or body parts, e.g. head. Camera 4 Camera 6 Applications: Human detection/localization in postproduction. Matting / segmentation initialization.
Multiview human detection The retained voxels are projected to all views. For every view we reject ROIs that have small overlapwith the regions resulting from the projection. Camera 2 Camera 4
Eye detectionProblem statement: To detect eye regions inan image. Input: A facial ROI produced by face detectionor tracking Output: eye ROIs / eye center location. Applications: face analysis e.g. expression recognition, face recognition, etc gaze tracking.
Eye detection Eye detection based on edges and distance maps from the eye and the surrounding area. Robustness to illumination conditions variations. The Canny edge detector is used. For each pixel, a vector pointing to the closest edge pixel is calculated. Its magnitude (length) and the slope (angle) are used.
3D face reconstruction from uncalibrated videoProblem statement: Input: facial images or facial video frames, taken from different view angles, provided that the face neither changes expression nor speaks. Output: 3D face model (saved as a VRML file) and its calibration in relation to each camera. Facial pose estimation. Applications: 3D face reconstruction, facial pose estimation, face recognition, face verification.
3D face reconstruction from uncalibrated videoMethod overview Manual selection of characteristic feature points on the input facial images. Use of an uncalibrated 3-D reconstruction algorithm. Incorporation of the CANDIDE generic face model. Deformation of the generic face model based on the 3-D reconstructed feature points. Re-projection of the face model grid onto the images and manual refinement.
3D face reconstruction fromuncalibrated video Input: three images with a number of matched characteristic feature points.
3D face reconstruction fromuncalibrated videoThe CANDIDE face model has 104nodes and 184 triangles.Its nodes correspond tocharacteristic points of the humanface, e.g. nose tip, outline of theeyes, outline of the mouth etc.
3D face reconstruction fromuncalibrated video Example
3D face reconstruction from uncalibrated video Selected features CANDIDE grid reprojection 3D face reconstruction
3D face reconstruction fromuncalibrated videoPerformance evaluation:Reprojection error
Face clusteringProblem statement: To cluster a set of facial ROIs Input: a set of face image ROIs Output: several face clusters, each containing faces of only one person. Applications Cluster the actor images, even if they belong in different shots. Cluster varying views of the same actor.
Face clustering A four step clustering algorithm is used: face tracking; calculation of mutual information between the tracked facial ROIs to form a similarity matrix (graph); application of face trajectory heuristics, to improve clustering performance; similarity graph clustering.
Face clusteringPerformance evaluation 85.4% correct clustering rate, 6.6% clustering errors An error of 7.9% is introduced by the face detector. Face detector errors have been manually removed 93% correct clustering rate, 7% clustering errors Cluster 2 results are improved
Face recognitionProblem statement: To identify a face identity Input: a facial ROI Output: the face id Hugh Sandra Grant Applications: Bullock Finding the movie cast Retrieving movies of an actor Surveillance applications
Face recognition Two general approaches: Subspace methods Elastic graph matching methods.
Face recognitionSubspace methods The original high-dimensional image space is projected onto a low-dimensional one. Face recognition according to a simple distance measure in the low dimensional space. Subspace methods: Eigenfaces (PCA), Fisherfaces (LDA), ICA, NMF. Main limitation of subspace methods: they require perfect face alignment (registration).
Face recognition Elastic graph matching (EGM) methods Elastic graph matching is a simplified implementation of the Dynamic Link Architecture (DLA). DLA represents an object by a rectangular elastic grid. A Gabor wavelet bank response is measured at each grid node. Multiscale dilation-erosion at each grid node can be used, leading to Morphological EGM (MEGM).
Face recognition Output of normalized multi-scale dilation-erosion for nine scales.
Face recognition Performance evaluation: Performance metric: Equal Error Rate (EER). M2VTS and XM2VTS facial image data base for training/testing MEGM EER: 2% Subspace techniques EER 5-10% Kernel subspace techniques: 2% Elastic graph matching techniques have the best overall performance.
Facial expression analysisProblem statement: To identify a facial expressionin a facial image or 3D point cloud. Input: a face ROI or a point cloud Output: the facial expression label(e.g. neutral, anger, disgust, sadness, happiness, surprise, fear). Applications Affective video content description
Facial expression analysis Two approaches for expression analysis on images / videos: Feature based ones They use texture or geometrical information as features for expression information extraction. Template based ones They use 3D or 2D head and facial models as templates for facial expression recognition.
Facial expression analysis Feature based methods The features used can be: geometric positions of a set of fiducial points on a face; multi-scale and multi-orientation Gabor wavelet coefficients at the fiducial points. Optical flow information Transform (e.g. DCT, ICA, NMF) coefficients Image pixels reduced in dimension by using PCA or LDA.
Facial expression analysisFusion of geometrical and texture information
3D facial expression recognition Experiments on the BU-3DFE database 100 subjects. 6 facial expressions x 4 different intensities +neutral per subject. 83 landmark points (used in our method). Recognition rate: 93.6%
Action recognitionProblem statement: To identify the action (elementary activity) of a person Input: a single-view or multi-view video or a sequence of 3D human body models (or point clouds). Output: An activity label (walk, run, jump,…) for each frame or for the entire sequence. run walk jump p. jump f. bend Applications: Semantic video content description, indexing, retrieval
Action recognition Action recognition on video data Input feature vectors: binary masks resulting from coarse body segmentation on each frame. run walk jump p. jump f. bend
Action recognition Perform fuzzy c-means (FCM) clustering on the feature vectors of all frames of training action videos
Action recognition Find the cluster centers, “dynemes”: key human body poses Characterize each action video by its distance from the dynemes. Apply Linear Discriminant Analysis (LDA) to further decrease dimensionality.
Action recognition Characterize each test video by its distances from the dynemes Perform LDA Perform classification, e.g. by SVM The method can operate on videos containing multiple actions performed sequentially. 96% correct action recognition on a database containing 10 actions (walking, running, jumping, waving,..)
3D action recognition Action recognition on 3D data Extension of the video-based activity recognition algorithm Input to the algorithm: binary voxel-based representation of frames
Visual speech detectionProblem statement: To detect video frames wherea persons speaks using only visual information Input: A mouth ROI produced by mouth detection Output: yes/no visual speech indication. The mouth is detected and localized by a similar technique to eye detection.
Visual speech detection Applications Detection of important video frames (when someone speaks) Lip reading applications Speech recognition applications Speech intent detection and speaker determination human--computer interaction applications video telephony and video conferencing systems. Dialogue detection system for movies and TV programs
Visual speech detection A statistical approach for visual speech detection, using mouth region luminance will be presented. The method employs face and mouth region detectors, applying signal detection algorithms to determine lip activity.
Visual speech detectionIncrease in the number of low intensity pixels, in the mouth region when mouth is open.
Visual speech detection The number of low grayscale intensity pixels of the mouth region in a video sequence. The rectangle encompasses the frames where a person is speaking.
Anthropocentric video content descriptionstructure based on MPEG-7Problem statement:To organize and store the extracted semantic information for video content description. Input: video analysis results Output: MPEG-7 compatible XML file. Applications Video indexing and retrieval Video metadata analysis
Anthropocentric video content descriptionAnthropos-7 metadata description profile Humans and their status/actions are the most important metadata items. Anthropos-7 presents an anthropocentric Perspective for Movie annotation. New video content descriptors (on top of MPEG7 entities) are introduced, in order to store in a better way intermediate and high level video metadata related to humans. Human ids, actions, status descriptions are supported. Information like:“Actor H. Grant appears in this shot and he is smiling”can be described in the proposed Anthropos-7 profile.
Anthropocentric video content descriptionMain Anthropos-7 Characteristics: Anthropos-7 Descriptors gather semantic (intermediate level) video metadata (tags). Tags are selected to suit research areas like face detection, object tracking, motion detection, facial expression extraction etc. High level semantic video metadata (events, e.g. dialogues) are supported.
Anthropocentric video content description Ds & DSs Name Movie Class Version Class Scene Class Shot Class Take Class Frame Class Sound Class Actor Class Actor Appearance Class Object Appearance Class Actor Instance Class Object Instance Class High Semantic Class Camera Class Camera Use Class Lens Class
Anthropocentric video content description Movie Class Actor Class Version Class Object Appearance Class Scene Class Actor Appearance Class Shot Class Object Instance Class Take Class Actor Instance Class Frame Class High Semantic Class Sound Class Camera Class Camera Use Class Lens Class
Anthropocentric video content description Shot Class ID Movie Class Actor Class Shot S/N Version Class Object Appearance Class Scene Class Actor Appearance Class Shot Description Shot Class Object Instance Class Take Ref Take Class Actor Instance Class Frame Class High Semantic Class Start Time Sound Class Camera Class End Time Camera Use Class Lens Class Duration Camera Use Key Frame Number Of Frames Color Information Object Appearances
Anthropocentric video content description Movie Class Actor Class Version Class Object Appearance Class Scene Class Actor Appearance Class Shot Class Object Instance Class Take Class Actor Instance Class Actor Class Frame Class High Semantic Class Sound Class Camera Class ID Camera Use Class Lens Class Name Sex Nationality Role Name
Anthropocentric video contentdescriptionMovie Class Actor ClassVersion Class Object Appearance ClassScene Class Actor Appearance Class Actor Appearance ClassShot Class Object Instance ClassTake Class Actor Instance Class IDFrame Class High Semantic ClassSound Class Camera Class Camera Use Class Time In List Lens Class Time Out List Color Information Motion Actor Instances
Anthropocentric video content descriptionExamples of Actor Instances in one Actor Appearence 1st frame 6th frame 11th frame 16th frame
Anthropocentric video content description Actor Instance Class ID ROIDescriptionModel Movie Class Actor Class Version Class Object Appearance Class ID Scene Class Actor Appearance Class Geometry Shot Class Object Instance Class ID Take Class Actor Instance Class Frame Class High Semantic Class Feature Points Sound Class Camera Class Bounding Box Camera Use Class Convex Hull Lens Class Grid Color Status Position In Frame Expressions Activity
Anthropos-7 Video Annotation software A set of video annotation software tools have been created for anthropocentric video content description. Anthropocentric Video Content Description Structure based on MPEG-7. These software tools perform (semi)-automatic video content annotation. The video metadata results (XML file) can be edited manually by a software tool called Anthropos-7 editor.
Multi-view Anthropos-7 The goal is to link the frame-based information instantiated by the ActorInstanceType DS at each video channel (e.g. L+R channel in stereo videos) In other words to link the instances of the ActorInstanceType DS belonging in different channels, due to disparity or color similarity
Multi-view Anthropos-7 The proposed structure organizes the data in 3 levels Level 1 Refers to the Correspondences tag Level 2 Refers to the CorrespondingInstancesType DS Level 3 Refers to the CorrespondingInstanceType DS
Conclusions Anthropocentric movie description is very important because in most cases the movies are based on human characters, their actions and status. We presented a number of anthropocentric video analysis and descrition techniques that have applications in film/games postproduction as well as in semantic video search Similar descriptions can be devised for stereo and multiview video.
3DTVS project European project: 3DTV Content Search (3DTVS) Start: 1/11/2011 Duration: 3 years www.3dtvs-project.eu Stay tuned!
Thank you very much for your attention!Further info:firstname.lastname@example.orgTel: +30-2310-996304