Tracking Faces using Active Appearance Models

Tracking Faces
using Active Appearance Models
Steven Mitchell, Ph.D.
Componica, LLC
Iowa City, IA
Thursday, June 20, 13

Copyright 2013 - Componica, LLC (http://www.componica.com/)
Overview
Modeling the shape of faces.
Modeling the texture of faces.
Fitting this model to images with faces.
AAMs Revisited
Demo

Characterizing Faces
Locating a face... then
Characterizing the face... for
Face Recognition
Facial Tracking
Augmented Reality
Detecting emotions
Etc.
VIOLA, JONES: ROBUST REAL-TIME OBJECT DETECTION,
IJCV 2001

Faces have Shape
MUCT Facial Database of 3755
faces with landmark data.
Landmarks, a simple way of
encoding shape.
Landmarks must be ﬁxed in
amount and location.
Explicit locations (corners), implicit
(edges, symmetry)
These x/y coordinates can be
represented as a vector:
S. MILBORROW AND J. MORKEL AND F. NICOLLS:
THE MUCT LANDMARKED FACE DATABASE, 2010

Removing Rotation, Scale, Translation
1. Select a face as the mean
shape:
2. Align all faces to the mean
using a least squares method:
3. Compute a new mean:
4. Go to step 2 until convergence.

Principal Component Analysis
PCA is a dimension reducing technique that
approximates data using a lower dimension.
1
x2
x1
x
φ
2
φ
1
x2
x1
x
φ
b
φ2
~x
x
(a) (b)
Figure 3.5: A simple visual depiction of Principal Component Analysis. a) PCA
is applied to a set of 2D vectors which form a cluster around a mean x. This
creates a set of principal components ¡1 and ¡2. b) We can approximate feature
vectors, such as x by eliminating smaller components such as ¡2
benefits of PCA are its ability to create a more compact model of the shape by throw-
ance matrix will exhibit correlation along specific directions. Solving
nvectors, ¡i, and eigenvalues, ∏i, derives these directions:
CΦ = ΛΦ (3.7)
enotes the concatenation of individual eigenvectors ¡i, and Λ is a
atrix of eigenvalues ∏i. By convention, we assume ∏i ∏ ∏i+1.
nd P computed, for any vector x there exists a vector b such that:
x = x + Φb (3.8)
CΦ = ΛΦ
where Φ denotes the concatenation of individual eigenvecto
diagonal matrix of eigenvalues ∏i. By convention, we assume
With L and P computed, for any vector x there exists a vec
x = x + Φb
Conversely there exists a vector b such that:
b = ΦT
(x ° x)
Since the smallest eigenvalues exhibit the least influence onThursday, June 20, 13

(DEMO...)
In general we can characterize the shape of a
face fairly well using less than 10 numbers.

Faces have Texture
Faces have pixel values. We’ll call them textures.
In order to compare a pixel between faces, warp the face to the average face.
Triangulate the landmark points using Delaunay Triangulation.
Map each triangle to the mean shape using Barycentric Coordinates.

Removing Brightness, Contrast
1. Select a face as the mean
texture:
2. Normalize lighting to the mean:
3. Compute a new mean:
4. Go to step 2 until
convergence.

Average Faces
Here are the average female, combined, and male
faces from the MUCT dataset.

Treat the textures as
vectors, g, and compute
PCA
Now we have a model for
texture
Combine shape, using
warping, with texture and
we have a model for
appearance.
eters and a combined appearance model with only 80 parameters required to explain 98%
observed variation. The model uses about 10,000 pixel values to make up the face patch.
gures 5.2 and 5.3 show the effects of varying the first two shape and grey-level model
eters through ±3 standard deviations, as determined from the training set. The first
eter corresponds to the largest eigenvalue of the covariance matrix, which gives its variance
the training set. Figure 5.4 shows the effect of varying the first four appearance model
eters, showing changes in identity, pose and expression.
5.2: First two modes of shape
on (±3 sd)
Figure 5.3: First two modes of grey-
level variation (±3 sd)
Approximating a New Example
a new image, labelled with a set of landmarks, we can generate an approximation with the
We follow the steps in the previous section to obtain b, combining the shape and grey-
5.4. APPROXIMATING A NEW EXAMPLE
Figure 5.4: First four modes of appearance variation (±3 sd)
level parameters which match the example. Since Pc is orthogonal, the combined appe
model parameters, c are given by
T.F. COOTES AND C.J.TAYLOR:
STATISTICAL MODELS OF APPEARANCE FOR COMPUTER VISION, 2004

A Model of Faces
We create a model of a face
that statistical captures the
variations of shape and
texture.
This model can both generate
faces and reduce faces to a
vector.
Typically 80-120 values are
sufﬁcient to reconstruct most
faces.
This is a reconstruction of an
unseen image.
The full reconstruction is then given by applying equations (5
normalisation, applying the appropriate pose to the points and pro
into the image.
For example, Figure 5.5 shows a previously unseen image alongs
of the face patch (overlaid on the original image).
Figure 5.5: Example of combined model representation (right) of a
(left)

Using this Model to Fit Faces
What do we have:
The input image
A reasonable location of a face via
Viola-Jones
A model that can reconstruct any
face from a small number of
parameters.
Goal:
Adjust the model to best ﬁt the
synthetic face with real face.
From that, we’ve implicitly
characterized the face.
Minimize
Place model
in image
Measure
Difference
Update Model
Iterate
INSERT MAGIC
HERE

Knowing , we’d like to know how to adjust the model, and . In other
words, compute and .
Approximate and by and
Compute and :
In the original face data, compute a face that matches the original.
should be close to zero.
Perturb the model by a small random value and .
Generate pairs of and use least squares to compute and .
Side Note, we also compute four extra parameters for translation, scale and
rotation.

8.3. LEARNING TO CORRECT MODEL PARAMETERS
Figure 8.2: First mode and displace-
ment weights
Figure 8.3: Third mode and d
placement weights
8.3.2 Perturbing The Face Model
To examine the performance of the prediction, we systematically displaced the face model fr
the true position on a set of 10 test images, and used the model to predict the displacem
given the sampled error vector. Figures 8.4 and 8.5 show the predicted translations against
actual translations. There is a good linear relationship within about 4 pixels of zero. Althou
this breaks down with larger displacements, as long as the prediction has the same sign as
actual error, and does not over-predict too far, an iterative updating scheme should conver
In this case up to 20 pixel displacements in x and about 10 in y should be correctable.
LEARNING TO CORRECT MODEL PARAMETERS 47
re 8.2: First mode and displace-
weights
Figure 8.3: Third mode and dis-
placement weights
2 Perturbing The Face Model
xamine the performance of the prediction, we systematically displaced the face model from
rue position on a set of 10 test images, and used the model to predict the displacement
n the sampled error vector. Figures 8.4 and 8.5 show the predicted translations against the
al translations. There is a good linear relationship within about 4 pixels of zero. Although
breaks down with larger displacements, as long as the prediction has the same sign as the
al error, and does not over-predict too far, an iterative updating scheme should converge.
is case up to 20 pixel displacements in x and about 10 in y should be correctable.
ment weights placement weights
8.3.2 Perturbing The Face Model
To examine the performance of the prediction, we systematically displaced the face model from
the true position on a set of 10 test images, and used the model to predict the displacement
given the sampled error vector. Figures 8.4 and 8.5 show the predicted translations against the
actual translations. There is a good linear relationship within about 4 pixels of zero. Although
this breaks down with larger displacements, as long as the prediction has the same sign as the
actual error, and does not over-predict too far, an iterative updating scheme should converge.
In this case up to 20 pixel displacements in x and about 10 in y should be correctable.
actual dx (pixels)
predicteddx
−40 −30 −20 −10 0 10 20 30 40
−8
−6
−4
−2
0
2
4
6
8
Figure 8.4: Predicted dx vs actual
dx. Errorbars are 1 standard error
actual dx (pixels)
predicteddy
−40 −30 −20 −10 0 10 20 30 40
−8
−6
−4
−2
0
2
4
6
8
Figure 8.5: Predicted dy vs actual
dy. Errorbars are 1 standard error
We can, however, extend this range by building a multi-resolution model of object appear-
ance. We generate Gaussian pyramids for each of our training images, and generate an appear-
Ideally we seek to model a relationship that holds over as large a range errors, δg, as poss
However, the real relationship is found to be linear only over a limited range of values.
experiments on the face model suggest that the optimum perturbation was around 0.5 stand
deviations (over the training set) for each model parameter, about 10% in scale, the equiva
of 3 pixels translation and about 10% in texture scaling.
8.3.1 Results For The Face Model
We applied the above algorithm to the face model described in section 5.3.
We can visualise the effects of the perturbation as follows. If ai is the ith row of the ma
R, the predicted change in the ith parameter, δci is given by
δci = ai.δg (
and ai gives the weight attached to different areas of the sampled patch when estimating
displacement. Figure 8.1 shows the weights corresponding to changes in the pose parame
(sx, sy, tx, ty). Bright areas are positive weights, dark areas negative. As one would exp
the x and y displacement weights are similar to x and y derivative images. Similar results
obtained for weights corresponding to the appearance model parameters
Figure 8.2 and 8.3 show the first and third modes and corresponding displacement weig
The areas which exhibit the largest variations for the mode are assigned the largest weight
the training process.
Figure 8.1: Weights corresponding to changes in the pose parameters, (sx, sy, tx, ty)

Using this Model to Fit FacesFigure 8.9: Reconstruction (left) and original (right) given original landmark points
Initial 2 its 8 its 14 its 20 its converged
Figure 8.10: Multi-Resolution search from displaced position
As an example of applying the method to medical images, we built an Appearance Model

IMENTAL RESULTS 51
14 shows the mean intensity error per pixel (for an image using 256 grey-levels)
number of iterations, averaged over a set of searches at a single resolution. In
model was initially displaced by up to 15 pixels. The dotted line gives the mean
n error using the hand marked landmark points, suggesting a good result is obtained
.
15 shows the proportion of 100 multi-resolution searches which converged correctly
g positions displaced from the true position by up to 50 pixels in x and y. The
ys good results with up to 20 pixels (10% of the face width) displacement.
Number of Iterations
Meanintensityerror/pixel
0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
0
2
4
6
8
10
12
14
Mean intensity error as search progresses. Dotted line is the mean error of the best
dmarks.
y
70
80
90
100
dX
Number of Iterations
0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
0
Figure 8.14: Mean intensity error as search progresses. Dotted line is the mean er
ﬁt to the landmarks.
Displacement (pixels)
%convergedcorrectly
−50 −40 −30 −20 −10 0 10 20 30 40 50
0
10
20
30
40
50
60
70
80
90
100
dX
dY
Figure 8.15: Proportion of searches which converged from diﬀerent initial disp

AAMs Revisited
Context:
Tim Cootes, Edward, and Taylor’s original AAM model was created in the
late 90s.
No good face initialization until 2001 with the Cascade Haar face detector.
Had limited use in face recognition and medical imaging.
Generally the algorithm seemed ad hoc.
Redux:
In 2004, Iain Matthews and Simon Baker reexamined AAMs making
signiﬁcant improvements.
Iain Matthews went on to develop the facial motion tracking for the Movie
Avatar working for Weta Digital in New Zealand and now works for Disney
Research.

AAMs Revisited
Replaced simple gradient decent to more efficient
gradient decent using Gauss-Newton.
Analytically computes the gradients instead of
fitting perturbations between model coefficients
and image differences.
Projects out the texture model from AAM fitting
both simplifying and speeding up the algorithm.
Inverse Warp Composition.

Better Gradient Descent
Cootes’ gradient descent expanding out all the
terms. Notice the k fudge-factor term.
Matthews’ gradient descent via Gauss-Newton.
PRECOMPUTED

Better Gradients
Matthews formulated a direct computation of the
gradient images assuming triangular warping.

Better Gradients

Projecting out Texture
Project out the texture variations from the gradient images. This
means the steepest descent images, R, has the texture variations
subtracted out.
The model only needs to ﬁt the shapes ignoring the texture
component of the facial model.
Signiﬁcantly reduces computation cost as only the shape model is
considered (10 parameters vs. 80 or more parameters).

Demo
http://www.youtube.com/watch?v=VblXShzV2VY
Way cooler live in realtime on my laptop.

Implementation Details
C++ and Open CV for camera capture, face detection, and
matrix operations.
Implementation suckage: 8/10 - Took 4 solid months to
implement with many details left out of the talk.
Working on proprietary features to improve accuracy and
convergence. Potential topics for future presentations.
Porting to iOS & Android with the goal of close to realtime
tracking.
Derive as many products, apps, startups as I can from it to
make money.

Conclusion
Active Appearance Models statistically model shape
and texture of objects.
An optimization scheme ﬁts a model onto an image
implicitly tracking the object.
Original formulated by Cootes et al, AAMs have
recently been enhanced by Matthews et al in 2004.
With the combination of practical face detection and
the speed of the algorithm, AAMs potentially are a
useful step in characterizing and tracking faces.

Tracking Faces using Active Appearance Models

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (9)

Similar to Tracking Faces using Active Appearance Models

Similar to Tracking Faces using Active Appearance Models (20)

Recently uploaded

Recently uploaded (20)

Tracking Faces using Active Appearance Models