PR-185: RetinaFace: Single-stage Dense Face Localisation in the Wild

PR:185
RetinaFace: Single-stage Dense Face
Localisation in the Wild
visionNoobDeng, Jiankang, et al. "RetinaFace: Single-stage Dense Face Localisation in the Wild." arXiv preprint arXiv:1905.00641 (2019).
(Submitted on 2 May 2019 (v1), last revised 4 May 2019 (this version, v2))

Face Detection
state-of-the-art face detection
Definition : face localization
Broader definition : face localization + landmark detection + pixel-wise face parsing + 3d reconstruction

Encoder
Encoder
ℝ"#$ Unit vector
Similarity
[0,1]
if (similarity < threshold):
same!
else:
no same!
L2norm
L2norm
Unit vector
Preprocessing
Preprocessing
ℝ"#$
0. Face Recognition
Naïve Example : Face Verification

Encoder
ℝ"#$
Preprocessing
0. Face Recognition
Naïve Example : Face Verification
ROI region Face Registration
112px
112px
Detecting
1. Facial location
2. Facial Landmarks
Preprocessing

1. Introduction
1.2 RetinaFace

1. Introduction
1.2 RetinaFace
face localization(bbox) + face landmarks(key points) + Dense localization mask

1. Introduction
1.3 Main Contributions
1. Based on a single-stage design, we propose a novel pixel-wise face localisation
method named RetinaFace, which employs a multi-task learning strategy to
simultaneously predict face score, face box, five facial landmarks, and 3D position and
correspondence of of each facial pixel.
2. On the WIDER FACE hard subset, RetinaFace outperforms the AP of the state of the
art two-stage method.
3. On the IJB-C dataset, RetinaFace helps to improve ArcFace’s verification accuracy.
4. By employing light-weight backbone networks, RetinaFace can run real-time on a
single CPU core for a VGA-resolution image.
5. Extra annotations and code have been released to facilitate future research.

WIDER Face & Person Challenge 2019
Track 1: Face Detection Track 2: Pedestrian Detection
Track 3: Cast Search by Portrait Track 4: Person Search by Language
http://wider-challenge.org/2019.html

2. Related Work
2.1 Image Pyramid vs Feature Pyramid
2. Related Work
2.1. Image pyramid v.s. feature pyramid
2.2. Two-stage v.s. single-stage
2.3. Context Modelling
2.4. Multi-task Learning
Hao, Zekun, et al. "Scale-aware face detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
Feature PyramidImage Pyramid

2. Related Work
2.2 Two-stage v.s. single-stage
2. Related Work

2. Related Work
2.3 Context Modeling
2. Related Work
2.4. Multi-task LearningContext Module
To enhance the model’s contextual reasoning power.

2. Related Work
2.3 Context Modeling
2. Related Work
2.1 Image pyramid v.s. feature pyramid
2.2 Two-stage v.s. single-stage
2.3 Context Modelling
2.4 Multi-task LearningDeformable Convolutional Network
J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017. 2,
X. Zhu, H. Hu, S. Lin, and J. Dai. Deformable convnets v2: More deformable, better results. arXiv:1811.11168, 2018.

2. Related Work
2.4 Multi-task Learning
2. Related Work
He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.
Mask-rcnn
Multi-task learning

3. RetinaFace
3.1. Multi-task Loss
3. RetinaFace
3.1. Multi-task loss
3.2. Dense Regression Branch
Multi-task learning

3. RetinaFace
3. RetinaFace
3.1. Multi-task loss
Zhou, Yuxiang, et al. "Dense 3D Face Decoding over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders." Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2019.

4. Experiments
4.1 Dataset
WIDER face (hard)
- 32,203 images, 393,703 face bboxes
(with a high degree of variability in scale, pose, expression, occlusion and illumination)

car accident coupleconcert
4. Experiments
4.1 Dataset
WIDER face (hard)
- 32,203 images, 393,703 face bboxes
(with a high degree of variability in scale, pose, expression, occlusion and illumination)
4.1. Dataset
4.2. Implementation details
4.3. Ablation Study
4.4. Face box Accuracy
4.5. Five Facial Landmark Accuracy
4.6. Dense Facial Landmark Accuracy
4.7. Face Recognition Accuracy
4.8. Inference Efficiency

4. Experiments
4.1 Dataset
Extra Annotation
- Facial landmarks (eye centres, nose tip and mouth corners)
- 84.6k faces on the training set and 18.5k faces on the validation set.

4. Experiments
4.2 Implementation details
1. Feature pyramid
2. Context module
3. Anchor setting
4. Data augmentation
5. Training detail
6. Testing detail
4.1. Dataset
4.3. Ablation Study
# of anchors * (2 + 4 + 10 + 128 + 7 + 9)Conv -> DCN

4. Experiments
Anchor setting
- Scale step at 2^(1/3) and the aspect ratio at 1:1
- With the input image size at 640 × 640, the anchors can cover
scales from 16 × 16 to 406 × 406 on the feature pyramid levels.
In total, there are 102,300 anchors, and 75% of these anchors are
from P2.
- OHEM
- 1:3 (pos : neg)
1. Feature pyramid
2. Context module
3. Anchor setting
5. Training detail
6. Testing detail
4.1. Dataset
4.3. Ablation Study

4. Experiments
Data augmentation
- Random crop
- Horizontal flip
- Photo-metric color distortion
Training Details
- SGD (momentum at 0.9, weight decay at 0.0005, batch size of 8 × 4)
- on four NVIDIA Tesla P40 (24GB) GPUs.
- The learning rate starts from 10−3, rising to 10−2 after 5 epochs,
then divided by 10 at 55 and 68 epochs.
- terminating at 80 epochs.
Testing Details
- flip as well as multi-scale (the short edge of image at [500, 800, 1100, 1400, 1700]) strategies.
- Box voting at IoU at 0.4 -> or NMS is okay
1. Feature pyramid
2. Context module
3. Anchor setting
5. Training detail
6. Testing detail
4.1. Dataset
4.3. Ablation Study

4. Experiments – Ablation study
WIDER Face Dataset
(easy, medium, hard)
RetinaFace
Lightweight backbone -> Realtime inference
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset
Better verification accuracyExtra supervision

4. Experiments
4.3. Ablation Study
4.1. Dataset
4.3. Ablation Study
IoU=0.5:0.05:0.95IoU=0.5

4. Experiments
4.3. Ablation Study
4.1. Dataset
4.3. Ablation Study
IoU=0.5:0.05:0.95IoU=0.5
He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.
From Mask r-cnn

4. Experiments : Face Box Accuracy
WIDER Face Dataset
RetinaFace
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset

4. Experiments
4.4. Face box Accuracy (WIDER face)
4.1. Dataset
4.3. Ablation Study

4. Experiments : Five Facial Landmarks Accuracy
WIDER Face Dataset
RetinaFace
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset

4. Experiments
4.1. Dataset
4.3. Ablation Study
cumulative error distribution (CED)normalised mean errors (NME)
https://pdfs.semanticscholar.org/b4d2/151e29fb12dbe5d164b430273de65103d39b.pdf
26.31%
9.37%

4. Experiments : Dense Facial Landmark Accuracy
WIDER Face Dataset
RetinaFace
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset

4. Experiments
4.1. Dataset
4.3. Ablation Study

4. Experiments : Face Recognition Accuracy
WIDER Face Dataset
RetinaFace
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset

4. Experiments
4.1. Dataset
4.3. Ablation Study

4. Experiments : Inference Accuracy
WIDER Face Dataset
RetinaFace
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset

4. Experiments
4.1. Dataset
4.3. Ablation Study
https://github.com/deepinsight/insightface/tree/master/RetinaFace

4. Experiments
4.1. Dataset
4.3. Ablation Study
https://github.com/deepinsight/insightface/tree/master/RetinaFace
Yoo, YoungJoon, Dongyoon Han, and Sangdoo Yun. "EXTD: Extremely Tiny Face Detector via Iterative Filter Reuse." arXiv preprint arXiv:1906.06579 (2019).

5. Conclusion
WIDER Face Dataset
RetinaFace
(MobileNet)
Face Detection
Face 5 Landmarks
Detection
Face
3D reconstruction
SOTA (AP 91.4%)
ArcFace
(with RetinaNet)
IJB-C Dataset
Code is available at https://github.com/deepinsight/insightface
(MXNet)

https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/

Lightweight Face Recognition Challenge
https://ibug.doc.ic.ac.uk/resources/lightweight-face-recognition-challenge-workshop/

PR-185: RetinaFace: Single-stage Dense Face Localisation in the Wild

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PR-185: RetinaFace: Single-stage Dense Face Localisation in the Wild

Similar to PR-185: RetinaFace: Single-stage Dense Face Localisation in the Wild (20)

More from jaewon lee

More from jaewon lee (9)

Recently uploaded

Recently uploaded (20)

PR-185: RetinaFace: Single-stage Dense Face Localisation in the Wild