Learning a multi-center convolutional network for unconstrained face alignment

Learning a Multi-Center
Convolutional Network for
Unconstrained Face Alignment
Zhiwen Shao, Hengliang Zhu, Yangyang Hao,
Min Wang, and Lizhuang Ma
Shanghai Jiao Tong University

Face Alignment
Detecting facial landmarks like pupil
centers, nose tip, mouth corners

Unconstrained scenarios including severe
occlusions and large face variations
Challenges

 Methods based on low-level handcrafted features have a
limited capacity to represent highly complex faces
Deep convolutional network
 A nonlinear regression problem, which transforms
appearance to shape
Motivation

Cascaded CNN [1], Zhou et al. [2], CFAN [3], and CDAN [4]
employ cascaded deep networks to refine predicted shapes
Previous Deep Learning Methods
time-consuming training processes
high model complexity
[1] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point
detection,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2013, pp.
3476–3483.[2] E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin, “Extensive facial landmark localization with
coarse-to-fine convolutional network cascade,” in IEEE International Conference on Computer
Vision Workshops. IEEE, 2013, pp. 386–391.
[3] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-fine auto-encoder networks (cfan) for real-
time face alignment,” in European Conference on Computer Vision. Springer, 2014, pp. 1–16.
[4] R. Weng, J. Lu, Y.-P. Tan, and J. Zhou, “Learning cascaded deep auto-encoder networks for
face alignment,” IEEE Transactions on Multimedia, vol. 18, no. 10, pp. 2066–2078, 2016.
Multiple networks based

TCDCN [5] needs extra labels of facial attributes for
samples
one single network without auxiliary information
Previous Deep Learning Methods
limits the universality of this method
Single network based
[5] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Learning deep representation for face alignment with
auxiliary attributes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no.
5, pp. 918–930, 2016.

Structural Correlations
Chin is occluded Right contour is invisible
 Unconstrained faces with partial occlusion and large pose
Landmarks in the same local region have similar properties
including occlusion and visibility

Face Partition
29 landmarks 68 landmarks
Partition of facial landmarks for different labeling patterns
Left eye, right eye, nose, mouth, left contour, chin, and
right contour

Network Architecture
 Shared layers
 Multiple center-specific shape prediction layers

 Shared layers
• Eight convolutional layers and one fully-connected layer
• Each max-pooling layer follows a stack of two convolutional layers

• Each cluster of facial landmarks is treated as a separate center
• Each layer estimates x and y coordinates of all n facial landmarks
• Focusing on the shape estimation of a specific face region
 Multiple center-specific shape prediction layers

Loss Function
^ ^
2 2 2
2 1 22 1 2
1
[( ) ( ) ]/ (2 )
n
j j jj j
j
E w f f f f d− −
=
= − + −∑
 Weighted inter-ocular distance normalized Euclidean loss
jw weight of the j-th landmark
ground truth coordinatesf predicted coordinates
^
f
d ground truth inter-ocular distance
the first center-specific layer:
larger weights for landmarks around the left eye

Multi-Center Learning
Basic Model

Reinforcement
for Each Center

Combined Model

Weight Computation
 Multiple relationship
( ) ( )i c i m
P P
w wη=
( )i c
P set of center-specific landmarks
( )i m
P set of remaining minor landmarks
amplification factor
Different fine-tuning steps have different center-
specific and minor facial landmarks
 Consistent with the basic model
( ) ( )
( ) ( )
| | ( | |)i c i m
i c i c
P P
w P w n P n+ − =
| |× number of elements in a set
During the i-th fine-tuning step

( )
( )
( )
( )
/[( 1) | | ]
/[( 1) | | ]
i c
i m
i c
P
i c
P
w n P n
w n P n
η η
η
= − +
= − +
other centers with relatively small weights rather than
zeroutilize implicit structural correlations among different parts
landmarks from the same cluster have similar properties
share an identical weight
search the solution smoothly
Weight Computation
During the i-th fine-tuning step

Combined Model
high-level representation
( 1) 1
0 1( , , , ) ( 1024)T D
Dx x x D+ ×
= ∈ =x L ¡
weight matrix ( 1) 2
1 2 2( , , , ) D n
n
+ ×
= ∈W w w wL ¡
0 1( , , , ) , 1, ,2T
k k k Dkw w w k n= =w L L
^
2 12 1
^
22
T
jj
T
jj
f
f
−− =
=
w x
w x
weight matrix of the i-th center-specific layer
i
W
2 1 2 1
2 2
combined i
j j
combined i
j j
− −=
=
w w
w w
( )
1, , , i c
i m j P= ∈L

Combined Model
Combined Model S combined
Θ ∪ W
complexity is as same as the basic model
improves the location performance by
exploiting the advantage of each center-specific
solution
Our multi-center learning algorithm takes full advantage of each
stage and searches the optimal solution smoothly

Datasets
COFW
occluded dataset in the wild
1345 training images
507 testing images
IBUG
large appearance variations
3148 training images
135 testing images

Evaluation Metric
 inter-ocular distance normalized mean error
 cumulative errors distribution (CED) curves
 failure rate
failure: mean error larger than 10%

Validation of Multi-Center Learning Algorithm
Method COFW IBUG
Mean Failure Mean Failure
Basic 6.26 3.16 9.23 33.33
Combined 6.08 2.96 8.87 25.93
Mean Error (%) and Failure Rate (%)
improve the accuracy and robustness
good performance of basic model
effectiveness of our network
reinforce the learning for each local face region

Validation of Multi-Center Learning Algorithm
Mean error for different clusters on COFW

Comparison with Other Methods
Method COFW IBUG
ESR 11.2 17.00
SDM 11.14 15.40
RCPR 8.5 17.26
CFAN - 16.78
LBF - 11.98
cGPRT - 11.03
CFSS - 9.98
TCDCN 8.05 8.60
CFT 6.33 10.06
Wu et al. 5.93 -
MCNet 6.08 8.87

COFW

IBUG

Deep model Speed (FPS) CPU
Cascaded CNN 5 single core, i5-6200U 2.3GHz
CFAN* 43 i7-3770 3.4GHz
CDAN* 50 i5 3.2GHz
TCDCN 50 single core, i5-6200U 2.3GHz
CFT 31 single core, i5-6200U 2.3GHz
MCNet 67 single core, i5-6200U 2.3GHz
Time of face detection is excluded

Conclusions
 We propose a novel multi-center convolutional network, which
exploits the representation power of each center
 We propose the reinforcement for each center to improve the
shape estimation precision of each facial part
 Comprehensive experiments demonstrate that our method
achieves real-time and competitive performance compared to
other state-of-the-art techniques

Code
 Matlab
https://github.com/ZhiwenShao/MCNet
 C++
https://github.com/ZhiwenShao/MCNet-Extension

Learning a multi-center convolutional network for unconstrained face alignment

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Learning a multi-center convolutional network for unconstrained face alignment

Similar to Learning a multi-center convolutional network for unconstrained face alignment (20)

Recently uploaded

Recently uploaded (17)

Learning a multi-center convolutional network for unconstrained face alignment

Editor's Notes