Le Satoh Unsupervised Face Annotation Icdm08

2008 Eighth IEEE International Conference on Data Mining

Unsupervised Face Annotation by Mining the Web

Duy-Dinh Le Shin’ichi Satoh
National Institute of Informatics National Institute of Informatics
2-1-2 Hitotsubashi, Chiyoda-ku 2-1-2 Hitotsubashi, Chiyoda-ku
Tokyo, JAPAN 101-8430 Tokyo, JAPAN 101-8430
ledduy@nii.ac.jp satoh@nii.ac.jp

Abstract by providing his or her name. Most current search engines
use the text associated with images and video as significant
Searching for images of people is an essential task for clues for returning results. However, other un-queried faces
image and video search engines. However, current search and names may appear with the queried ones (Figure 1), and
engines have limited capabilities for this task since they rely this significantly lowers the retrieval performance. One way
on text associated with images and video, and such text to improve the retrieval performance is to take into account
is likely to return many irrelevant results. We propose a visual information present in the retrieved faces. This task
method for retrieving relevant faces of one person by learn- is challenging for the following reasons:
ing the visual consistency among results retrieved from text-
correlation-based search engines. The method consists of • Large variations in facial appearance due to pose
two steps. In the first step, each candidate face obtained changes, illumination conditions, occlusions, and fa-
from a text-based search engine is ranked with a score that cial expressions make face recognition difficult even
measures the distribution of visual similarities among the with state-of-the-art techniques [1, 21, 2] (see example
faces. Faces that are possibly very relevant or irrelevant are in Figure 2).
ranked at the top or bottom of the list, respectively. The sec- • The fact that the retrieved face set consists of faces of
ond step improves this ranking by treating this problem as a several people with no labels makes supervised and un-
classification problem in which input faces are classified as supervised learning methods inapplicable.
’person-X’ or ’non-person-X’; and the faces are re-ranked
according to their relevant score inferred from the classi- We propose a method for solving the above problem.
fier’s probability output. To train this classifier, we use a The main idea is to assume that there is visual consistency
bagging-based framework to combine results from multiple among the results returned from text-based search engines
weak classifiers trained using different subsets. These train- and this visual consistency is then learned through an in-
ing subsets are extracted and labeled automatically from teractive process. This method consists of two stages. In
the rank list produced from the classifier trained from the the first stage, we explore the local density of faces to iden-
previous step. In this way, the accuracy of the ranked list tify potential candidates for relevant faces1 and irrelevant
increases after a number of iterations. Experimental results faces2 . This stage reflects the fact that the facial images of
on various face sets retrieved from captions of news photos the queried person tend to form dense clusters, whereas ir-
show that the retrieval performance improved after each it- relevant facial images are sparse since they look different
eration, with the final performance being higher than those from each other. For each face, we define a score to mea-
of the existing algorithms. sure the density of its neighbor set. This score is used to
form a ranked list, in which faces with high-density scores
are considered relevant and are put at the top.
1. Introduction The above ranking method is weak since dense clusters
have no guarantee of containing relevant faces. Therefore,
With the rapid growth of digital technology, large image a second stage is necessary to improve this ranked list. We
and video databases have become more available than ever model this problem as a classification problem in which in-
to users. This trend has shown the need for effective and ef- put faces are classified as person-X (the queried person)
ficient tools for indexing and retrieving based on visual con- 1 faces related to the queried person.
tent. A typical application is searching for a specific person 2 faces unrelated to the queried person.

1550-4786/08 $25.00 © 2008 IEEE 383
DOI 10.1109/ICDM.2008.47

Figure 2. Large variations in facial expres-
sions, poses, illumination conditions and oc-
clusions making face recognition difficult.
Best viewed in color.

• The bagging framework helps to leverage noises in the
unsupervised labeling process.

Our contribution is two-fold:
Figure 1. A news photo and its caption. Ex- • We propose a general framework to boost the face re-
tracted faces are shown on the top. These trieval performance of text-based search engines by vi-
faces might be returned for the query of sual consistency learning. The framework seamlessly
person-Bush. integrates data mining techniques such as supervised
learning and unsupervised learning based on bagging.
Our framework requires only a few parameters and
works stably.
or non-person-X (the un-queried person). The faces are
ranked according to a relevancy score that is inferred from • We demonstrate its feasibility with a practical web
the classifier’s probability output. Since annotation data is mining application. A comprehensive evaluation on a
not available, the rank list from the previous step is used to large face dataset of many people was carried out and
assign labels for a subset of faces. This subset is then used confirmed that our approach is promising.
to train a classifier using supervised methods such as sup-
port vector machines (SVM). The trained classifier is used
to re-rank faces in the original input set. This step is re- 2. Related Work
peated a number of times to get the final ranked list. Since
automatically assigning labels from the ranked list is not re- There are several approaches for re-ranking and learn-
liable, the trained classifiers are weak. To obtain the final ing models from web images. Their underlying assump-
strong classifier, we use the idea of ensemble learning [6] in tion is that text-based search engines return a large frac-
which weak classifiers trained on different subsets are com- tion of relevant images. The challenge is how to model
bined to improve the stability and classification accuracy of what is common in the relevant images. One approach
single classifiers. The learned classifier can be further used is to model this problem in a probabilistic framework in
for recognizing new facial images of the queried person. which the returned images are used to learn the parame-
The second stage improves the ranked list and recogni- ters of the model. For examples, as described by Fergus et
tion performance for the following reasons: al. [12], objects retrieved using an image search engine are
re-ranked by extending the constellation model. Another
• Supervised learning methods, such as SVM, provide proposal, described in [15], uses a non-parametric graphi-
a strong theoretical background for finding the opti- cal model and an interactive framework to simultaneously
mal decision boundary even with noisy data. Further- learn object class models and collect object class datasets.
more, recent studies [20, 17] suggest that SVM clas- The main contribution of these approaches is probabilistic
sifiers provide probability outputs that are suitable for models that can be learned with a small number of training
ranking. images. However, these models are complicated since they

384

require several hundred parameters for learning and are sus- 3 Proposed Framework
ceptible to over-fitting. Furthermore, to obtain robust mod-
els, a small amount of supervision is required to select seed Given a set of images returned by any text-based search
images. engine for a queried person (e.g. ’George Bush’), we per-
Another study [4, 3] proposed a clustering-based method form a ranking process and learning of person X’s model
for associating names and faces in news photos. To solve as follows:
the problem of ambiguity between several names and one
• Step 1: Detect faces and eye positions, and then per-
face, a modified k-means clustering process was used in
form face normalizations.
which faces are assigned to the closest cluster (each clus-
ter corresponding to one name) after a number of iterations. • Step 2: Compute an eigenface space and project the
Although the result was impressive, it is not easy to apply it input faces into this subspace.
to our problem since it is based on a strong assumption that
requires a perfect alignment when a news photo only has • Step 3: Estimate the ranked list of these faces using
one face and its caption only has one name. Furthermore, Rank-By-Local-Density-Score.
a large number of irrelevant faces (more than 12%) have to
be manually eliminated before clustering. • Step 4: Improve this ranked list using Rank-By-
Bagging-ProbSVM.
A graph-based approach was proposed by Ozkan and
Duygulu [16], in which a graph is formed from faces as Steps 1 and 2 are typical for any face processing system,
nodes, and the weights of edges linked between nodes are and they are described in section 4.2. The algorithms used
the similarity of faces, is closely related to our problem. in Steps 3 and 4 are described in section 3.1 and section 3.2,
Assuming that the number of faces of the queried person is respectively. Figure 3 illustrates the proposed framework.
larger than that of others and that these faces tend to form
the most similar subset among the set of retrieved faces, 3.1 Ranking by Local Density Score
this problem is considered equal to the problem of finding
the densest subgraph of a full graph; and can therefore be
solved by taking an available solution [9]. Although, exper-
imental results showed the effectiveness of this method, it is
still questionable whether the densest subgraph intuitively
describes most of the relevant faces of the queried person
and it is easy to extend for the ranking problem. Further-
more, choosing an optimal threshold to convert the initial
graph into a binary one is difficult and rather ad hoc due to
the curse of dimensionality.
An advantage of the methods [4, 3, 16] is they are fully
unsupervised. However, a disadvantage is that no model
is learned for predicting new images of the same category.
Furthermore, they are used for performing hard categoriza-
Figure 4. An example of faces retrieved for
tion on input images that are in applicable for re-ranking.
person-Donald Rumsfeld. Irrelevant faces
The balance of recall and precision was not addressed. Typ-
are marked with a star. Irrelevant faces might
ically, these approaches tend to ignore the recall to obtain
form several clusters, but the relevant faces
high precision. This leads to the reduction in the number of
form the largest cluster.
collected images.
Our approach combines a number of advances over the
existing approaches. Specifically, we learn a model for each Among the faces retrieved by text-based search engines
query from the returned images for purposes such as re- for a query of person-X, as shown in Figure 4, relevant
ranking and predicting new images. However, we used an faces usually look similar and form the largest cluster. One
unsupervised method to select training samples automati- approach of re-ranking these faces is to cluster based on vi-
cally, which is different from the methods proposed by Fer- sual similarity. However, to obtain ideal clustering results is
gus et al. and Li et al. [12, 15]. This unsupervised method impossible since these faces are high dimensional data and
is different from the one by Ozkan and Duygulu [16] in the the clusters are in different shapes, sizes, and densities. In-
modeling of the distribution of relevant images. We use stead, a graph-based approach was proposed by Ozkan and
density-based estimation rather than the densest graph. Duygulu [16] in which the nodes are faces and edge weights

385

Figure 3. The proposed framework for re-ranking faces returned by text-based search engines.

are the similarities between two faces. With the observation Algorithm 1: Rank-By-Local-Density-Score
that the nodes (faces) of the queried person are similar to Step 1: For each face p, compute LDS(p, k),
each other and different from other nodes in the graph, the where k is the number of neighbors of p
densest component of the full graph the set of highly con- and is the input of the ranking process.
nected nodes in the graph will correspond to the face of the Step 2: Rank these faces using LDS(p, k)
queried person. The main drawback of this approach is it (The higher the score the more relevant).
needs a threshold to convert the initial weighted graph to a
binary graph. Choosing this threshold in high dimensional
spaces is difficult since different persons might have differ- 3.2 Ranking by Bagging of SVM Classi-
ent optimal thresholds. fiers
We use the idea of density-based clustering described by
Ester et al. and Breunig et al. [11, 7] to solve this problem. One limitation of the local density score based ranking
Specifically, we define the local density score (LDS) of a is it cannot handle faces of another person strongly associ-
point p (i.e. a face) as the average distance to its k-nearest ated in the k-neighbor set (for example, many duplicates).
neighbors. Therefore, another step is proposed for handling this case.
distance(p, q) As a result, we have a model that can be used for both re-
q∈R(p,k)
LDS(p, k) = ranking current faces and predicting new incoming faces.
k The main idea is to use a probabilistic model to measure
where R(p, k) is the set of k - neighbors of p, and the relevancy of a face to person-X, P (person − X|f ace).
distance(p, q) is the similarity between p and q. Since the labels are not available for training, we use the
Since faces are represented in high dimensional feature input rank list found from the previous step to extract a sub-
space, and face clusters might have different sizes, shapes, set of faces lying at the top and bottom of the ranked list to
and densities, we do not directly use the Euclidean distance form the training set. After that, we use SVM with prob-
between two points in this feature space for distance(p, q). abilistic output [17] implemented in LibSVM [8] to learn
Instead, we use another similarity measure defined by the the person-X model. This model is applied to faces of the
number of shared neighbors between two points. The effi- original set, and the output probabilistic scores are used to
ciency of this similarity measure for density-based cluster- re-rank these faces. Since it is not guaranteed that faces ly-
ing methods was described in [10]. ing at two ends of the input rank list correctly correspond to
|R(q, k) ∩ R(p, k)| the faces of person-X and faces of non person-X, we adopt
distance(p, q) = the idea of a bagging framework [6] in which randomly se-
k
lecting subsets to train weak classifiers, and then combining
Therefore
these classifiers help reduce the risk of using noisy training
q∈R(p,k) |R(q, k) ∩ R(p, k)| sets.
LDS(p, k) =
k2 The details of the Rank-By-Bagging-ProbSVM-
A high value of LDS(p, k) indicates a strong association InnerLoop method, improving an input rank list by
between p and its neighbors. Therefore, we can use this combining weak classifiers trained from subsets annotated
local density score to rank faces. Faces with higher scores by that rank list are described in Algorithm 2.
are considered to be potential candidates that are relevant to Given an input ranked list, Rank-By-Bagging-ProbSVM-
person-X, while faces with lower scores are considered as InnerLoop is used to improve this list. We repeat the process
outliers and thus are potential candidates for non-person-X. a number of times whereby the ranked list output from the
Algorithm 1 describes these steps. previous step is used as the input ranked list of the next

386

Algorithm 2: Rank-By-Bagging-ProbSVM-InnerLoop 4 Experiments
Step 1: Train a weak classifier, hi .
Step 1.1: Select a set Spos including p% of top ranked faces 4.1 Dataset
∗
and then randomly select a subset Spos from Spos .
∗
Label faces in Spos as positive samples. We used the dataset described by Berg et al. [4] for our
Step 1.2: Select a set Sneg including p% of bottom ranked
∗
experiments. This dataset consists of approximately half a
faces and then randomly select a subset Sneg from Sneg . million news photos and captions from Yahoo News col-
∗
Label faces in Sneg as negative samples. lected over a period of roughly two years. This dataset is
∗ ∗
Step 1.3: Use Spos and Sneg to train a weak better than datasets collected from image search engines
classifier, hj , using LibSVM [8] with probability outputs. such as Google that usually limit the total number of re-
i
Step 2: Compute ensemble classifier Hi = j=1 hj . turned images to 1,000. Furthermore, it has annotations that
Step 3: Apply Hi to the original face set and form the are valuable for evaluation of methods. Note that these an-
rank list, Ranki , using the output probabilistic scores. notations are used for evaluation purpose only. Our method
Step 4: Repeat steps 1 to 3 is fully unsupervised, so it assumes the annotations are not
until Dist2RankList(Ranki−1, Ranki ) <= . available at running time.
Step 5: Return Hi = i hj .j=1 Only frontal faces were considered since current frontal
face detection systems [19] work in real time and have ac-
Algorithm 3: Rank-By-Bagging-ProbSVM-OuterLoop curacies exceeding 95%. 44,773 faces were detected and
Step 1: Rankcur = normalized to the size of 86×86 pixels.
Rank-By-Bagging-ProbSVM-InnerLoop(Rankprev). We selected fifteen government leaders, including
Step 2: dist = Dist2RankList(Rankprev , Rankcur ). George W. Bush (US), Vladimir Putin (Russia), Ziang
Step 3: Rankf inal = Rankcur . Jemin (China), Tony Blair (UK), Junichiro Koizumi
Step 4: Rankprev = Rankcur . (Japan), Roh Moo-hyun (Korea), Abdullah Gul (Turkey),
Step 5: Repeat steps 1 to 4 and other key individuals, such as John Paul II (the Former
until dist <= . Pope) and Hans Blix (UN), because their images frequently
Step 6: Return Rankf inal . appear in the dataset [16]. Variations in each person’s name
were collected. For example, George W. Bush, President
step. In this way, the iterations significantly improve the Bush, U.S. President, etc., all refer to the current U.S. pres-
final ranked list. The details are described in Algorithm 3. ident.
To determine the number of iterations of Rank- We performed simple string search in captions to check
By-Bagging-ProbSVM-InnerLoop and Rank-By-Bagging- whether a caption contained one of these names. The faces
ProbSVM-OuterLoop, we use the Kendall − tau dis- extracted from the corresponding image associated with this
tance [13], which is a metric that counts the number of pair- caption were returned. The faces retrieved from the differ-
wise disagreements between two lists. The larger the dis- ent name queries were merged into one set and used as input
tance, the more dissimilar the two lists are. The Kendall − for ranking.
tau distance between two lists, τ1 and τ2 , is defined as fol- Figure 5 shows the distribution of retrieved faces from
lows: this method and the corresponding number of relevant faces
for these fifteen individuals. In total, 5,603 faces were re-
K(τ1 , τ2 ) = K i,j (τ1 , τ2 ) trieved in which 3,374 faces were relevant. On average, the
(i,j)∈P accuracy was 60.22%.
where P is the set of unordered pairs of distinct elements
in τ1 and τ2 . K i,j (τ1 , τ2 ) = 0 if i and j are in the same 4.2 Face Processing
order in τ1 and τ2 , and K i,j (τ1 , τ2 ) = 1 if i and j are in the
opposite order in τ1 and τ2 . We used an eye detector to detect the positions of the
Since the maximum value of K(τ1 , τ2 ) is N (N − 1)/2, eyes of the detected faces. The eye detector, built with the
where N is the number of members of the list, the normal- same approach as that of Viola and Jones [19], had an ac-
ized Kendall tau distance can be written as follows: curacy of more than 95%. If the eye positions were not
detected, predefined eye locations were assigned. The eye
K(τ1 , τ2 )
Knorm (τ1 , τ2 ) = . positions were used to align faces to a predefined canonical
N (N − 1)/2
pose.
Using this measure for checking when the loops stop To compensate for illumination effects, the subtraction
means that if the ranked list does not change significantly of the bestfit brightness plane followed by histogram equal-
after a number of iterations, it is reasonable to stop. ization was applied. This normalization process is shown in

387

lated as follows:
Nrel
Recall =
Nhit

Nrel
P recision =
Nret

Precision and recall are only used to evaluate the quality
of an unordered set of retrieved faces. To evaluate ranked
lists in which both recall and precision are taken into ac-
count, average precision is usually used. The average pre-
cision is computed by taking the average of the interpolated
precision measured at the 11 recall levels of 0.0, 0.1, 0.2, ...,
Figure 5. Distribution of retrieved faces and
1.0.
relevant faces of 16 individuals used in ex-
The interpolated precision pinterp at a certain recall level
periments. Due to space limitation, bars cor-
r is defined as the highest precision found for any recall
responding to George Bush (2,282 vs. 1,284)
level q ≥ r:
and Tony Blair (682 vs. 323) were cut-off at
the upper limit of the graph.
pinterp = maxr ≥r p(r )

In addition, to evaluate the performance of multiple
Figure 6. queries, we used mean average precision, which is the mean
We then used principle component analysis [18] to re- of average precisions computed from queries3 .
duce the number of dimensions of the feature vector for face
representation. Eigenfaces were computed from the origi- 4.4 Parameters
nal face set returned using the text-based query method. The
number of eigenfaces used to form the eigen space was se- The parameters of our method include:
lected so that 97% of the total energy was retained [5]. The
number of dimensions of these feature spaces ranged from • p: the fraction of faces at the top and bottom of the
80 to 500. ranked list that are used to form a positive set Spos and
negative set Sneg for training weak classifiers in Rank-
By-Bagging-ProbSVM-InnerLoop. We empirically se-
lected p = 20% (i.e 40% samples of the rank list were
used) since a larger p will increase the number of incor-
rect labels, and a smaller p will cause over-fitting. In
∗
addition, Spos consists of 0.7 × |Spos | samples that are
selected randomly with replacement from Spos . This
sampling strategy is adopted from the bagging frame-
∗
Figure 6. Face normalization. (top) faces with work [6]. The same setting was used for Sneg .
detected eyes, (bottom) faces after normal-
ization process. • : the maximum Kendall tau distance Knorm (τ1 , τ 2)
between two rank lists τ 1 and τ2 . This value is used to
determine when the inner loop and the outer loop stop.
We set = 0.05 for balancing between accuracy and
processing time. Note that a smaller requires more
4.3 Evaluation Criteria iterations, making the system’s speed slower.

• kernel: the kernel type is used for the SVM. The de-
We evaluated the retrieval performance with measures fault is a linear kernel that is defined as: k(x, y) =
that are commonly used in information retrieval, such as x ∗y. We have tested other kernel types such as RBF or
precision, recall, and average precision. Given a queried polynomial, but the performance did not change much.
person and letting Nret be the total number of faces re- Therefore, we used the linear kernel for simplicity.
turned, Nrel the number of relevant faces, and Nhit the total
number of relevant faces, recall and precision can be calcu- 3 http://trec.nist.gov/pubs/trec10/appendices/measures.pdf

388

4.5 Results • Supervised Learning (SVM-SUP): We randomly se-
lected a portion p of the data with annotations to train
4.5.1 Performance Comparison with Existing Ap- the classifier; and then used this classifier to re-rank
proaches the remaining faces. This process was repeated five
times and the average performance was reported. We
We performed a comparison between our proposed method used a range of portion p values for experiments: p =
with other existing approaches. 1%, 2%, 3%, ..., 5%.

• Text Based Baseline (TBL): Once faces corresponding
with images whose captions contain the query name
are returned, they are ranked in time order. This is a
rather naive method in which no prior knowledge be-
tween names and faces is used.

• Distance-Based Outlier (DBO): We adopted the idea
of distance-based outliers detection for ranking [14].
Given a threshold dmin , for each point p, we counted
the number of points q so that dist(p, q) ≤ dmin ,
where dist(p, q) is the Euclidean distance between p
and q in the feature space mentioned in section 4.2.
This number was then used as the score to rank faces.
We selected a range of dmin values for experiments:
dmin = 10, 15, 20, ..., 90.

• Densest Sub-Graph based Method (DSG): We re- Figure 7. Performance comparison of meth-
implemented the densest sub-graph based method [16] ods. Due to different settings, performances
for ranking. Once the densest subgraph was found af- are superimposed for better evaluation.
ter an edge elimination process, we counted the num-
ber of surviving edges of each node (i.e face) and used
this number as the ranking score. To form the graph, Figure 7 shows a performance comparison of these meth-
the Euclidean distance dist(p, q) was used to assign ods. Our proposed methods (LDS and UEL-LDS) out-
the weight for the edge linked between node p and perform other unsupervised methods such as TBL, DBO
node q. DSG require a threshold θ to convert the and DSG. Furthermore, the performance of the DBO and
weighted graph to the binary graph before searching DSG methods are sensitive to the distance threshold, while
for the densest subgraph. We selected a range of θ the performance of our proposed method is less sensitive.
values that are the same as the values used in DBO: It confirms that the similarity measure using shared near-
θ = 10, 15, 20, ..., 90. est neighbors is reliable for estimation of the local den-
sity score. The performance of UEL-LDS is slightly bet-
• Local Density Score (LDS): This is the first stage of ter than LDS since the training sets labeled automatically
our proposed method. It requires the input value k to from the ranked list are noisy. However, UEL-LDS im-
compute the local density score. Since we do not know proves significantly even when the performance of LDS is
the number of returned faces from text-based search poor. These performances are worse than that of SVM-SUP
engines, we used another input value f raction defined using a small number of labeled samples.
as the fraction of neighbors and estimated k by the for- Figure 8 shows an example of the top 50 faces ranked
mula: k = f raction ∗ N , where N is the number of using the TBL, DBO, DSG and LDS methods. The perfor-
returned faces. We used a range of f raction values mance of DBO is poor since a low threshold is used. This
for experiments: f raction = 5%, 10%, 15%, ..., 50%. ranks irrelevant faces that are near duplicates (rows 2 and 3
For a large number of returned faces, we set k to the in Figure 8(b)) higher than relevant faces. This explains the
maximum value of 200: k = 200. same situation with DSG.

• Unsupervised Ensemble Learning Using Local Den- 4.5.2 Performance of Ensemble Classifiers
sity Score (UEL-LDS): This is a combination of rank-
ing by local density scores and then the ranked list is In Figure 9, we show the performance of five single clas-
used for training a classifier to boost the rank list. sifiers and that of five ensemble classifiers. The ensemble

389

Precision return a large fraction of relevant images is satisfied. Fig-
Method at top 20 Recall Precision ure 12 shows an example where this assumption is broken.
GoogleSE 79.33 100.00 57.08 Consequently, as shown in Figure 13, the model learned by
UEL-LDS 89.00 72.50 76.41 this set performed poorly in recognizing new faces returned
SVM-SUP-05 85.00 73.14 76.46 by GoogleSE. Our approach solely relies on the above as-
SVM-SUP-10 90.67 74.94 78.30 sumption; therefore, it is not affected by the ranking of text-
based search engines.
Table 1. Comparison of different methods on
The iteration of bagging SVM classifiers does not guar-
the new test set returned by Google Image
antee a significant improvement in performance. The aim
Search Engine.
of our future work is to study how to improve the quality of
the training sets used in this iteration.

classifier k is formed by combining single classifiers from 1 6 Conclusion
to k. It clearly indicates that the ensemble classifier is more
stable than single weak classifiers.
We presented a method for ranking faces retrieved us-
ing text-based correlation methods in searches for a specific
4.5.3 New Face Annotation person. This method learns the visual consistency among
faces in a two-stage process. In the first stage, a relative den-
We conducted another experiment to show the effectiveness
sity score is used to form a ranked list in which faces ranked
of our approach in which learned models are used to anno-
at the top or bottom of the list are likely to be relevant or ir-
tate new faces of other databases. We used each name in the
relevant faces, respectively. In the second stage, a bagging
list as a query to obtain the top 500 images from the Google
framework is used to combine weak classifiers trained on
Image Search Engine (GoogleSE). Next, these images were
subsets labeled from the ranked list into a strong classifier.
processed using the steps described in section 4.2: extract-
This strong classifier is then applied to the original set to
ing faces, detecting eyes and doing normalization. We pro-
re-rank faces on the basis of the output probabilistic scores.
jected these faces to the PCA subspace trained for that name
Experiments on various face sets showed the effectiveness
and used the learned model to re-rank faces.
of this method. Our approach is beneficial when there are
There were 4,103 faces (including false positives - non- several faces in a returned image, as shown in Figure 11.
faces detected as faces) detected from 7,500 returned im-
ages. We manually labeled these faces and there were 2,342
relevant faces. On average, the accuracy of the GoogleSE is References
57.08%.
In Table 1, we compare the performance of the methods. [1] O. Arandjelovic and A. Zisserman. Automatic face recog-
The performance of UEL-LDS was obtained by running nition for film character retrieval in feature-length films. In
the best system, which is shown as the peak of the UEL- Proc. Intl. Conf. on Computer Vision and Pattern Recogni-
LDS curve in Figure 7. The performances of SVM-SUP-05 tion, volume 1, pages 860–867, 2005.
and SVM-SUP-10 correspond to the supervised systems (cf. [2] M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski. Face
section 4.5.1) that used p = 5% and p = 10% of the data set recognition by independent component analysis. IEEE
respectively. We evaluated the performance by calculating Transactions on Neural Networks, 13(6):1450–1464, Nov
the precision at the top 20 returned faces, which is com- 2002.
mon for image search engines and recall and precision on [3] T. L. Berg, A. C. Berg, J. Edwards, and D. A. Forsyth. Who’s
in the picture? In Advances in Neural Information Process-
all detected faces of the test set. UEL-LDS achieved com-
ing Systems, 2004.
parable performance to the supervised methods and outper-
[4] T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White,
formed the baseline GoogleSE. The precision at the top 20
Y. W. Teh, E. G. Learned-Miller, and D. A. Forsyth. Names
of SVM-SUP-05 is poorer than that of UEL-LDS due to the and faces in the news. In Proc. Intl. Conf. on Computer
small number of training samples. Figure 10 shows top 20 Vision and Pattern Recognition, volume 2, pages 848–854,
faces ranked using these two methods. 2004.
[5] D. Bolme, R. Beveridge, M. Teixeira, and B. Draper. The
csu face identification evaluation system: Its purpose, fea-
5 Discussion
tures and structure. In International Conference on Vision
Systems, pages 304–311, 2003.
Our approach works fairly well for well known people, [6] L. Breiman. Bagging predictors. Machine Learning,
where the main assumption that text-based search engines 24(2):123140, 1996.

390

[7] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF:
Identifying density-based local outliers. In Proc. ACM SIG-
MOD Int. Conf. on Management of Data(SIGMOD), pages
93–104, 2000.
[8] C.-C. Chang and C.-J. Lin. LIBSVM: a library for
support vector machines, 2001. Software available at
http://www.csie.ntu.edu.tw/" "cjlin/libsvm.
[9] M. Charikar. Greedy approximation algorithms for finding
dense components in a graph. In APPROX ’00: Proceed-
ings of the Third International Workshop on Approximation
Algorithms for Combinatorial Optimization, pages 84–95.
Springer-Verlag, 2000.
[10] L. Ertoz, M. Steinbach, and V. Kumar. Finding clusters of
different sizes, shapes, and densities in noisy high dimen- (a) - TBL - 11 irrelevant faces
sional data. In SIAM International Conference on Data Min-
ing, pages 47–58, 2003.
[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-
based algorithm for discovering clusters in large spatial
databases with noise. In Proc. ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining (SIGKDD), pages
226–231, 1996.
[12] R. Fergus, P. Perona, and A. Zisserman. A visual category
filter for google images. In Proc. Intl. European Conference
on Computer Vision, volume 1, pages 242–256, 2004.
[13] M. Kendall. Rank Correlation Methods. Charles Griffin
Company Limited, 1948. (b) - DBO - 17 irrelevant faces
[14] E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based out-
liers: Algorithms and applications. VLDB Journal: Very
Large Data Bases, 8(3-4):237–253, 2000.
[15] L.-J. Li, G. Wang, and L. Fei-Fei. Optimol: automatic on-
line picture collection via incremental model learning. In
Proc. Intl. Conf. on Computer Vision and Pattern Recogni-
tion, volume 2, pages 1–8, 2007.
[16] D. Ozkan and P. Duygu. A graph based approach for naming
faces in news photos. In Proc. Intl. Conf. on Computer Vi-
sion and Pattern Recognition, volume 2, pages 1477–1482,
2006.
[17] J. Platt. Probabilistic outputs for support vector machines
and comparison to regularized likelihood methods. In Ad- (c) - DSG - 18 irrelevant faces
vances in Large Margin Classifiers, pages 61–74, 1999.
[18] M. Turk and A. Pentland. Face recognition using eigenfaces.
In Proc. Intl. Conf. on Computer Vision and Pattern Recog-
nition, 1991.
[19] P. Viola and M. Jones. Rapid object detection using a
boosted cascade of simple features. In Proc. Intl. Conf. on
Computer Vision and Pattern Recognition, volume 1, pages
511–518, 2001.
[20] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates
for multi-class classification by pairwise coupling. Journal
of Machine Learning Research, 5:975–1005, 2004.
[21] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face (d) - LDS - 4 irrelevant faces
recognition: A literature survey. ACM Computing Surveys,
35(4):399–458, 2003.
Figure 8. Top 50 faces ranked by the methods
TBL, DBO, DSG and LDS. Irrelevant faces are
marked with a star.

391

Figure 9. Performance of the ensemble clas-
sifiers and single classifiers.

(a) - 5 irrelevant faces

Figure 12. Example in which portion of rel-
evant faces is dominant, but it is difficult to
group all these faces into one cluster due
(b) - no any irrelevant face to large facial variations. In feature space,
the largest cluster formed from relevant faces
is not largest cluster among those formed
Figure 10. Top 20 faces ranked by Google from all returned faces. Irrelevant faces are
Image Search Engine (a) and ranked using marked with a star.
our learned model (b). Irrelevant faces are
marked with a star.

Figure 13. Many irrelevant faces annotated
using the model learned from the data set
Figure 11. Image returned by GoogleSE for shown in Figure 12. Irrelevant faces are
query ’Gerhard Schroeder’. GoogleSE was marked with a star.
unable to accurately identify who the queried
person was, while the learned model of our
approach accurately identified him.

392

Le Satoh Unsupervised Face Annotation Icdm08

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Le Satoh Unsupervised Face Annotation Icdm08

Similar to Le Satoh Unsupervised Face Annotation Icdm08 (20)

Recently uploaded

Recently uploaded (20)

Le Satoh Unsupervised Face Annotation Icdm08