Long-term Face Tracking in the Wild
using Deep Learning
Presented by:
Elaheh Rashedi
Advisor:
Xuewen Chen
Department of Computer Science
Wayne State University
KDD Workshop on Large-scale Deep Learning for Data Mining
August 2016
1
Outline
• Introduction
– Long Term Tracking Algorithm
– Face Tracking Algorithm
– Tracking Challenges
• Related Work
• Proposed Methodology
– Detection Verification Tracking (DVT)
• DL based Face Detection
• CNN based Face Verification
• Multi-patch based Face Tracking
– System Framework
– Demonstration
• Experiments and Results
• Future Work
2
Long-Term Tracking Algorithm
• Common Steps
– Select a video
– Employ a bounding box around the target
– Distinguish the object from the background
– Track the object around the same region in next frame
Ref [1]
3
Introduction Related Work Methodology Experiments and Results Future Work
Face Tracking Algorithms
• Specialized for tracking face
• Common approaches:
– Using face detection
– Using Facial Landmark Localization
4
Introduction Related Work Methodology Experiments and Results Future Work
Tracking Challenges
• Can be challenging on real world noisy videos
• Not robust against
– Appearance changes
– Occlusion
– Fast motion
– Illumination changes
– Background clutter
5
Introduction Related Work Methodology Experiments and Results Future Work
Tracking Challenges (cont.)
• Sensitive to the initialization of target
• Not able to handle all situations
• Long term tracking challenge:
– Not reliable in cases where the object leaves the view
6
Introduction Related Work Methodology Experiments and Results Future Work
Related Work
• TLD: Tracking-Learning-Detection
7
Introduction Related Work Methodology Experiments and Results Future Work
TLD Flow Chart, Ref [2]
Proposed Methodology
• Model
– Detection-Verification-Tracking
• Goal
– Long term face tracking
– Wild video target
• Employ
– Deep learning based face detection
– CNN based face verification
– Multi-patch based tracking
8
Introduction Related Work Methodology Experiments and Results Future Work
DL based Face detection
• Model
– Cascade architecture built on CNN
• CNN structure:
– 3 CNNs for faces vs. non-faces (binary classification)
– 3 CNNs for bounding box calibration (Multiclass classification)
9
Introduction Related Work Methodology Experiments and Results Future Work
DL based Face detection (cont…)
10
Introduction Related Work Methodology Experiments and Results Future Work
Ref [3]
CNN based Face Verification
• Convolutional Neural Network: 37 layers
• Feature vector dimension: 4098
• Pre-trained network based on MatConvNet (VGG)
• Verification steps:
– Resize the target face (224x224)
– Create the feature query
– Extract features for each individual face
– Compute Cosine similarity
– Compare to a threshold
11
Introduction Related Work Methodology Experiments and Results Future Work
Multi-patch based Tracking
• Employs Multiple patches around the target
• Categorize patches to reliable/non-reliable categories
• Track reliable patches
• Ignore non-reliable patches
• Result is the average of reliable patches
12
Introduction Related Work Methodology Experiments and Results Future Work
VerificationBasedFaceTrackingFlowChart
System Framework
13
Introduction Related Work Methodology Experiments and Results Future Work
Demonstration
14
Introduction Related Work Methodology Experiments and Results Future Work
Demonstration of the system for pausing the video and selecting the target face to be tracked
Demonstration
15
Introduction Related Work Methodology Experiments and Results Future Work
Results
• Implemented by Matlab R2015b
• MatConvNet
• Threshold
– Similarity threshold: 0.75
– Skip time: 3s
• Running Time
– Video Duration *2
16
Introduction Related Work Methodology Experiments and Results Future Work
Experiments and Results
17
Introduction Related Work Methodology Experiments and Results Future Work
Method
Character Roy
Precision Recall
TLD 0.7 0.37
Face-TLD 0.75 0.54
DVT (the proposed) 0.95 0.75
Table 1: The comparison between TLD, Face-TLD and the proposed DVT
method in terms of precision and recall the sitcom IT-Crowd (first series, first
episode).
Future Work
• Using reliable frames to learn the model
– Useful for long term tracking
– Less sensitive to first initialization of the target
• Learning the similarity threshold
– SVM
18
Introduction Related Work Methodology Experiments and Results Future Work
References
1. Khan, Zulfiqar Hasan, Irene Yu-Hua Gu, and Andrew G. Backhouse. "A robust particle filter-based method for tracking single visual object through
complex scenes using dynamical object shape and appearance similarity." Journal of Signal Processing Systems 65, no. 1 (2011): 63-79.
2. Kalal, Zdenek, Krystian Mikolajczyk, and Jiri Matas. "Tracking-learning-detection." IEEE transactions on pattern analysis and machine intelligence34.7
(2012): 1409-1422.
3. Li, Haoxiang, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. "A convolutional neural network cascade for face detection." In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 5325-5334. 2015.
4. Li, Yang, Jianke Zhu, and Steven CH Hoi. "Reliable patch trackers: Robust visual tracking by exploiting reliable patches." In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 353-361. 2015.
5. http://www.vlfeat.org/matconvnet/
6. Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2014.
7. S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern
Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, pp. 539–546, 2005.
8. Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification verification,” in Advances in Neural Information
Processing Systems, pp. 1988–1996, 2014.
9. Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predicting 10,000 classes,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1891–1898, 2014.
10. Y. Sun, D. Liang, X.Wang, and X. Tang, “Deepid3: Face recognition with very deep neural networks,” arXiv preprint arXiv:1502.00873, 2015.
11. Y. Sun, X.Wang, and X. Tang, “Deeply learned face representations are sparse, selective, and robust,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2892–2900, 2015.
12. F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2015.
19
Thank you!
Contact us:
Elaheh.Rashedi@wayne.edu
20

Long-term Face Tracking in the Wild using Deep Learning

  • 1.
    Long-term Face Trackingin the Wild using Deep Learning Presented by: Elaheh Rashedi Advisor: Xuewen Chen Department of Computer Science Wayne State University KDD Workshop on Large-scale Deep Learning for Data Mining August 2016 1
  • 2.
    Outline • Introduction – LongTerm Tracking Algorithm – Face Tracking Algorithm – Tracking Challenges • Related Work • Proposed Methodology – Detection Verification Tracking (DVT) • DL based Face Detection • CNN based Face Verification • Multi-patch based Face Tracking – System Framework – Demonstration • Experiments and Results • Future Work 2
  • 3.
    Long-Term Tracking Algorithm •Common Steps – Select a video – Employ a bounding box around the target – Distinguish the object from the background – Track the object around the same region in next frame Ref [1] 3 Introduction Related Work Methodology Experiments and Results Future Work
  • 4.
    Face Tracking Algorithms •Specialized for tracking face • Common approaches: – Using face detection – Using Facial Landmark Localization 4 Introduction Related Work Methodology Experiments and Results Future Work
  • 5.
    Tracking Challenges • Canbe challenging on real world noisy videos • Not robust against – Appearance changes – Occlusion – Fast motion – Illumination changes – Background clutter 5 Introduction Related Work Methodology Experiments and Results Future Work
  • 6.
    Tracking Challenges (cont.) •Sensitive to the initialization of target • Not able to handle all situations • Long term tracking challenge: – Not reliable in cases where the object leaves the view 6 Introduction Related Work Methodology Experiments and Results Future Work
  • 7.
    Related Work • TLD:Tracking-Learning-Detection 7 Introduction Related Work Methodology Experiments and Results Future Work TLD Flow Chart, Ref [2]
  • 8.
    Proposed Methodology • Model –Detection-Verification-Tracking • Goal – Long term face tracking – Wild video target • Employ – Deep learning based face detection – CNN based face verification – Multi-patch based tracking 8 Introduction Related Work Methodology Experiments and Results Future Work
  • 9.
    DL based Facedetection • Model – Cascade architecture built on CNN • CNN structure: – 3 CNNs for faces vs. non-faces (binary classification) – 3 CNNs for bounding box calibration (Multiclass classification) 9 Introduction Related Work Methodology Experiments and Results Future Work
  • 10.
    DL based Facedetection (cont…) 10 Introduction Related Work Methodology Experiments and Results Future Work Ref [3]
  • 11.
    CNN based FaceVerification • Convolutional Neural Network: 37 layers • Feature vector dimension: 4098 • Pre-trained network based on MatConvNet (VGG) • Verification steps: – Resize the target face (224x224) – Create the feature query – Extract features for each individual face – Compute Cosine similarity – Compare to a threshold 11 Introduction Related Work Methodology Experiments and Results Future Work
  • 12.
    Multi-patch based Tracking •Employs Multiple patches around the target • Categorize patches to reliable/non-reliable categories • Track reliable patches • Ignore non-reliable patches • Result is the average of reliable patches 12 Introduction Related Work Methodology Experiments and Results Future Work
  • 13.
    VerificationBasedFaceTrackingFlowChart System Framework 13 Introduction RelatedWork Methodology Experiments and Results Future Work
  • 14.
    Demonstration 14 Introduction Related WorkMethodology Experiments and Results Future Work Demonstration of the system for pausing the video and selecting the target face to be tracked
  • 15.
    Demonstration 15 Introduction Related WorkMethodology Experiments and Results Future Work
  • 16.
    Results • Implemented byMatlab R2015b • MatConvNet • Threshold – Similarity threshold: 0.75 – Skip time: 3s • Running Time – Video Duration *2 16 Introduction Related Work Methodology Experiments and Results Future Work
  • 17.
    Experiments and Results 17 IntroductionRelated Work Methodology Experiments and Results Future Work Method Character Roy Precision Recall TLD 0.7 0.37 Face-TLD 0.75 0.54 DVT (the proposed) 0.95 0.75 Table 1: The comparison between TLD, Face-TLD and the proposed DVT method in terms of precision and recall the sitcom IT-Crowd (first series, first episode).
  • 18.
    Future Work • Usingreliable frames to learn the model – Useful for long term tracking – Less sensitive to first initialization of the target • Learning the similarity threshold – SVM 18 Introduction Related Work Methodology Experiments and Results Future Work
  • 19.
    References 1. Khan, ZulfiqarHasan, Irene Yu-Hua Gu, and Andrew G. Backhouse. "A robust particle filter-based method for tracking single visual object through complex scenes using dynamical object shape and appearance similarity." Journal of Signal Processing Systems 65, no. 1 (2011): 63-79. 2. Kalal, Zdenek, Krystian Mikolajczyk, and Jiri Matas. "Tracking-learning-detection." IEEE transactions on pattern analysis and machine intelligence34.7 (2012): 1409-1422. 3. Li, Haoxiang, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. "A convolutional neural network cascade for face detection." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5325-5334. 2015. 4. Li, Yang, Jianke Zhu, and Steven CH Hoi. "Reliable patch trackers: Robust visual tracking by exploiting reliable patches." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 353-361. 2015. 5. http://www.vlfeat.org/matconvnet/ 6. Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. 7. S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, pp. 539–546, 2005. 8. Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification verification,” in Advances in Neural Information Processing Systems, pp. 1988–1996, 2014. 9. Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predicting 10,000 classes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1891–1898, 2014. 10. Y. Sun, D. Liang, X.Wang, and X. Tang, “Deepid3: Face recognition with very deep neural networks,” arXiv preprint arXiv:1502.00873, 2015. 11. Y. Sun, X.Wang, and X. Tang, “Deeply learned face representations are sparse, selective, and robust,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2892–2900, 2015. 12. F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. 19
  • 20.

Editor's Notes

  • #4 —motion-based recognition, that is, human identification based on gait, automatic object detection, etc; —automated surveillance, that is, monitoring a scene to detect suspicious activities or unlikely events; —video indexing, that is, automatic annotation and retrieval of the videos in multimedia databases; —human-computer interaction, that is, gesture recognition, eye gaze tracking for data input to computers, etc.; —traffic monitoring, that is, real-time gathering of traffic statistics to direct traffic flow. —vehicle navigation, that is, video-based path planning and obstacle avoidance capabilities.
  • #10 Faces in videos have large visual variations like pose, expression, lighting. It needs a good algorithm to differentiate faces from the background. We used a detection algorithm based on CNN which is a cascade model. In fact, It reduces the number of candidates at later stages using a CNN based calibration stage after each detection in the cascade. The output of one stage is used to adjust the detection window position for input to the subsequent stage. 5800 background images to obtain negative training samples , AFLW dataset as positive training (26000) multinomial logistic regression objective function for optimization in training We have two 12, 24 , 48 networks. 12 -> convolution, max pooling, fully connected, label 24 -> convolution, max pooling, fully connected, label 48 -> convolution, max pooling, normalization, convolution, normalization, fully connected, labels.
  • #11 Faces in videos have large visual variations like pose, expression, lighting. It needs a good algorithm to differentiate faces from the background. We used a detection algorithm based on CNN which is a cascade model. In fact, It reduces the number of candidates at later stages using a CNN based calibration stage after each detection in the cascade. The output of one stage is used to adjust the detection window position for input to the subsequent stage. 5800 background images to obtain negative training samples , AFLW dataset as positive training (26000) multinomial logistic regression objective function for optimization in training We have two 12, 24 , 48 networks. 12 -> convolution, max pooling, fully connected, label 24 -> convolution, max pooling, fully connected, label 48 -> convolution, max pooling, normalization, convolution, normalization, fully connected, labels.
  • #13 Far away from target Imbalance between foreground and background particles