• Like
Le Satoh Unsupervised Face Annotation Icdm08
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Le Satoh Unsupervised Face Annotation Icdm08

  • 320 views
Published

 

Published in Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
320
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 2008 Eighth IEEE International Conference on Data Mining Unsupervised Face Annotation by Mining the Web Duy-Dinh Le Shin’ichi Satoh National Institute of Informatics National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo, JAPAN 101-8430 Tokyo, JAPAN 101-8430 ledduy@nii.ac.jp satoh@nii.ac.jp Abstract by providing his or her name. Most current search engines use the text associated with images and video as significant Searching for images of people is an essential task for clues for returning results. However, other un-queried faces image and video search engines. However, current search and names may appear with the queried ones (Figure 1), and engines have limited capabilities for this task since they rely this significantly lowers the retrieval performance. One way on text associated with images and video, and such text to improve the retrieval performance is to take into account is likely to return many irrelevant results. We propose a visual information present in the retrieved faces. This task method for retrieving relevant faces of one person by learn- is challenging for the following reasons: ing the visual consistency among results retrieved from text- correlation-based search engines. The method consists of • Large variations in facial appearance due to pose two steps. In the first step, each candidate face obtained changes, illumination conditions, occlusions, and fa- from a text-based search engine is ranked with a score that cial expressions make face recognition difficult even measures the distribution of visual similarities among the with state-of-the-art techniques [1, 21, 2] (see example faces. Faces that are possibly very relevant or irrelevant are in Figure 2). ranked at the top or bottom of the list, respectively. The sec- • The fact that the retrieved face set consists of faces of ond step improves this ranking by treating this problem as a several people with no labels makes supervised and un- classification problem in which input faces are classified as supervised learning methods inapplicable. ’person-X’ or ’non-person-X’; and the faces are re-ranked according to their relevant score inferred from the classi- We propose a method for solving the above problem. fier’s probability output. To train this classifier, we use a The main idea is to assume that there is visual consistency bagging-based framework to combine results from multiple among the results returned from text-based search engines weak classifiers trained using different subsets. These train- and this visual consistency is then learned through an in- ing subsets are extracted and labeled automatically from teractive process. This method consists of two stages. In the rank list produced from the classifier trained from the the first stage, we explore the local density of faces to iden- previous step. In this way, the accuracy of the ranked list tify potential candidates for relevant faces1 and irrelevant increases after a number of iterations. Experimental results faces2 . This stage reflects the fact that the facial images of on various face sets retrieved from captions of news photos the queried person tend to form dense clusters, whereas ir- show that the retrieval performance improved after each it- relevant facial images are sparse since they look different eration, with the final performance being higher than those from each other. For each face, we define a score to mea- of the existing algorithms. sure the density of its neighbor set. This score is used to form a ranked list, in which faces with high-density scores are considered relevant and are put at the top. 1. Introduction The above ranking method is weak since dense clusters have no guarantee of containing relevant faces. Therefore, With the rapid growth of digital technology, large image a second stage is necessary to improve this ranked list. We and video databases have become more available than ever model this problem as a classification problem in which in- to users. This trend has shown the need for effective and ef- put faces are classified as person-X (the queried person) ficient tools for indexing and retrieving based on visual con- 1 faces related to the queried person. tent. A typical application is searching for a specific person 2 faces unrelated to the queried person. 1550-4786/08 $25.00 © 2008 IEEE 383 DOI 10.1109/ICDM.2008.47
  • 2. Figure 2. Large variations in facial expres- sions, poses, illumination conditions and oc- clusions making face recognition difficult. Best viewed in color. • The bagging framework helps to leverage noises in the unsupervised labeling process. Our contribution is two-fold: Figure 1. A news photo and its caption. Ex- • We propose a general framework to boost the face re- tracted faces are shown on the top. These trieval performance of text-based search engines by vi- faces might be returned for the query of sual consistency learning. The framework seamlessly person-Bush. integrates data mining techniques such as supervised learning and unsupervised learning based on bagging. Our framework requires only a few parameters and works stably. or non-person-X (the un-queried person). The faces are ranked according to a relevancy score that is inferred from • We demonstrate its feasibility with a practical web the classifier’s probability output. Since annotation data is mining application. A comprehensive evaluation on a not available, the rank list from the previous step is used to large face dataset of many people was carried out and assign labels for a subset of faces. This subset is then used confirmed that our approach is promising. to train a classifier using supervised methods such as sup- port vector machines (SVM). The trained classifier is used to re-rank faces in the original input set. This step is re- 2. Related Work peated a number of times to get the final ranked list. Since automatically assigning labels from the ranked list is not re- There are several approaches for re-ranking and learn- liable, the trained classifiers are weak. To obtain the final ing models from web images. Their underlying assump- strong classifier, we use the idea of ensemble learning [6] in tion is that text-based search engines return a large frac- which weak classifiers trained on different subsets are com- tion of relevant images. The challenge is how to model bined to improve the stability and classification accuracy of what is common in the relevant images. One approach single classifiers. The learned classifier can be further used is to model this problem in a probabilistic framework in for recognizing new facial images of the queried person. which the returned images are used to learn the parame- The second stage improves the ranked list and recogni- ters of the model. For examples, as described by Fergus et tion performance for the following reasons: al. [12], objects retrieved using an image search engine are re-ranked by extending the constellation model. Another • Supervised learning methods, such as SVM, provide proposal, described in [15], uses a non-parametric graphi- a strong theoretical background for finding the opti- cal model and an interactive framework to simultaneously mal decision boundary even with noisy data. Further- learn object class models and collect object class datasets. more, recent studies [20, 17] suggest that SVM clas- The main contribution of these approaches is probabilistic sifiers provide probability outputs that are suitable for models that can be learned with a small number of training ranking. images. However, these models are complicated since they 384
  • 3. require several hundred parameters for learning and are sus- 3 Proposed Framework ceptible to over-fitting. Furthermore, to obtain robust mod- els, a small amount of supervision is required to select seed Given a set of images returned by any text-based search images. engine for a queried person (e.g. ’George Bush’), we per- Another study [4, 3] proposed a clustering-based method form a ranking process and learning of person X’s model for associating names and faces in news photos. To solve as follows: the problem of ambiguity between several names and one • Step 1: Detect faces and eye positions, and then per- face, a modified k-means clustering process was used in form face normalizations. which faces are assigned to the closest cluster (each clus- ter corresponding to one name) after a number of iterations. • Step 2: Compute an eigenface space and project the Although the result was impressive, it is not easy to apply it input faces into this subspace. to our problem since it is based on a strong assumption that requires a perfect alignment when a news photo only has • Step 3: Estimate the ranked list of these faces using one face and its caption only has one name. Furthermore, Rank-By-Local-Density-Score. a large number of irrelevant faces (more than 12%) have to be manually eliminated before clustering. • Step 4: Improve this ranked list using Rank-By- Bagging-ProbSVM. A graph-based approach was proposed by Ozkan and Duygulu [16], in which a graph is formed from faces as Steps 1 and 2 are typical for any face processing system, nodes, and the weights of edges linked between nodes are and they are described in section 4.2. The algorithms used the similarity of faces, is closely related to our problem. in Steps 3 and 4 are described in section 3.1 and section 3.2, Assuming that the number of faces of the queried person is respectively. Figure 3 illustrates the proposed framework. larger than that of others and that these faces tend to form the most similar subset among the set of retrieved faces, 3.1 Ranking by Local Density Score this problem is considered equal to the problem of finding the densest subgraph of a full graph; and can therefore be solved by taking an available solution [9]. Although, exper- imental results showed the effectiveness of this method, it is still questionable whether the densest subgraph intuitively describes most of the relevant faces of the queried person and it is easy to extend for the ranking problem. Further- more, choosing an optimal threshold to convert the initial graph into a binary one is difficult and rather ad hoc due to the curse of dimensionality. An advantage of the methods [4, 3, 16] is they are fully unsupervised. However, a disadvantage is that no model is learned for predicting new images of the same category. Furthermore, they are used for performing hard categoriza- Figure 4. An example of faces retrieved for tion on input images that are in applicable for re-ranking. person-Donald Rumsfeld. Irrelevant faces The balance of recall and precision was not addressed. Typ- are marked with a star. Irrelevant faces might ically, these approaches tend to ignore the recall to obtain form several clusters, but the relevant faces high precision. This leads to the reduction in the number of form the largest cluster. collected images. Our approach combines a number of advances over the existing approaches. Specifically, we learn a model for each Among the faces retrieved by text-based search engines query from the returned images for purposes such as re- for a query of person-X, as shown in Figure 4, relevant ranking and predicting new images. However, we used an faces usually look similar and form the largest cluster. One unsupervised method to select training samples automati- approach of re-ranking these faces is to cluster based on vi- cally, which is different from the methods proposed by Fer- sual similarity. However, to obtain ideal clustering results is gus et al. and Li et al. [12, 15]. This unsupervised method impossible since these faces are high dimensional data and is different from the one by Ozkan and Duygulu [16] in the the clusters are in different shapes, sizes, and densities. In- modeling of the distribution of relevant images. We use stead, a graph-based approach was proposed by Ozkan and density-based estimation rather than the densest graph. Duygulu [16] in which the nodes are faces and edge weights 385
  • 4. Figure 3. The proposed framework for re-ranking faces returned by text-based search engines. are the similarities between two faces. With the observation Algorithm 1: Rank-By-Local-Density-Score that the nodes (faces) of the queried person are similar to Step 1: For each face p, compute LDS(p, k), each other and different from other nodes in the graph, the where k is the number of neighbors of p densest component of the full graph the set of highly con- and is the input of the ranking process. nected nodes in the graph will correspond to the face of the Step 2: Rank these faces using LDS(p, k) queried person. The main drawback of this approach is it (The higher the score the more relevant). needs a threshold to convert the initial weighted graph to a binary graph. Choosing this threshold in high dimensional spaces is difficult since different persons might have differ- 3.2 Ranking by Bagging of SVM Classi- ent optimal thresholds. fiers We use the idea of density-based clustering described by Ester et al. and Breunig et al. [11, 7] to solve this problem. One limitation of the local density score based ranking Specifically, we define the local density score (LDS) of a is it cannot handle faces of another person strongly associ- point p (i.e. a face) as the average distance to its k-nearest ated in the k-neighbor set (for example, many duplicates). neighbors. Therefore, another step is proposed for handling this case. distance(p, q) As a result, we have a model that can be used for both re- q∈R(p,k) LDS(p, k) = ranking current faces and predicting new incoming faces. k The main idea is to use a probabilistic model to measure where R(p, k) is the set of k - neighbors of p, and the relevancy of a face to person-X, P (person − X|f ace). distance(p, q) is the similarity between p and q. Since the labels are not available for training, we use the Since faces are represented in high dimensional feature input rank list found from the previous step to extract a sub- space, and face clusters might have different sizes, shapes, set of faces lying at the top and bottom of the ranked list to and densities, we do not directly use the Euclidean distance form the training set. After that, we use SVM with prob- between two points in this feature space for distance(p, q). abilistic output [17] implemented in LibSVM [8] to learn Instead, we use another similarity measure defined by the the person-X model. This model is applied to faces of the number of shared neighbors between two points. The effi- original set, and the output probabilistic scores are used to ciency of this similarity measure for density-based cluster- re-rank these faces. Since it is not guaranteed that faces ly- ing methods was described in [10]. ing at two ends of the input rank list correctly correspond to |R(q, k) ∩ R(p, k)| the faces of person-X and faces of non person-X, we adopt distance(p, q) = the idea of a bagging framework [6] in which randomly se- k lecting subsets to train weak classifiers, and then combining Therefore these classifiers help reduce the risk of using noisy training q∈R(p,k) |R(q, k) ∩ R(p, k)| sets. LDS(p, k) = k2 The details of the Rank-By-Bagging-ProbSVM- A high value of LDS(p, k) indicates a strong association InnerLoop method, improving an input rank list by between p and its neighbors. Therefore, we can use this combining weak classifiers trained from subsets annotated local density score to rank faces. Faces with higher scores by that rank list are described in Algorithm 2. are considered to be potential candidates that are relevant to Given an input ranked list, Rank-By-Bagging-ProbSVM- person-X, while faces with lower scores are considered as InnerLoop is used to improve this list. We repeat the process outliers and thus are potential candidates for non-person-X. a number of times whereby the ranked list output from the Algorithm 1 describes these steps. previous step is used as the input ranked list of the next 386
  • 5. Algorithm 2: Rank-By-Bagging-ProbSVM-InnerLoop 4 Experiments Step 1: Train a weak classifier, hi . Step 1.1: Select a set Spos including p% of top ranked faces 4.1 Dataset ∗ and then randomly select a subset Spos from Spos . ∗ Label faces in Spos as positive samples. We used the dataset described by Berg et al. [4] for our Step 1.2: Select a set Sneg including p% of bottom ranked ∗ experiments. This dataset consists of approximately half a faces and then randomly select a subset Sneg from Sneg . million news photos and captions from Yahoo News col- ∗ Label faces in Sneg as negative samples. lected over a period of roughly two years. This dataset is ∗ ∗ Step 1.3: Use Spos and Sneg to train a weak better than datasets collected from image search engines classifier, hj , using LibSVM [8] with probability outputs. such as Google that usually limit the total number of re- i Step 2: Compute ensemble classifier Hi = j=1 hj . turned images to 1,000. Furthermore, it has annotations that Step 3: Apply Hi to the original face set and form the are valuable for evaluation of methods. Note that these an- rank list, Ranki , using the output probabilistic scores. notations are used for evaluation purpose only. Our method Step 4: Repeat steps 1 to 3 is fully unsupervised, so it assumes the annotations are not until Dist2RankList(Ranki−1, Ranki ) <= . available at running time. Step 5: Return Hi = i hj .j=1 Only frontal faces were considered since current frontal face detection systems [19] work in real time and have ac- Algorithm 3: Rank-By-Bagging-ProbSVM-OuterLoop curacies exceeding 95%. 44,773 faces were detected and Step 1: Rankcur = normalized to the size of 86×86 pixels. Rank-By-Bagging-ProbSVM-InnerLoop(Rankprev). We selected fifteen government leaders, including Step 2: dist = Dist2RankList(Rankprev , Rankcur ). George W. Bush (US), Vladimir Putin (Russia), Ziang Step 3: Rankf inal = Rankcur . Jemin (China), Tony Blair (UK), Junichiro Koizumi Step 4: Rankprev = Rankcur . (Japan), Roh Moo-hyun (Korea), Abdullah Gul (Turkey), Step 5: Repeat steps 1 to 4 and other key individuals, such as John Paul II (the Former until dist <= . Pope) and Hans Blix (UN), because their images frequently Step 6: Return Rankf inal . appear in the dataset [16]. Variations in each person’s name were collected. For example, George W. Bush, President step. In this way, the iterations significantly improve the Bush, U.S. President, etc., all refer to the current U.S. pres- final ranked list. The details are described in Algorithm 3. ident. To determine the number of iterations of Rank- We performed simple string search in captions to check By-Bagging-ProbSVM-InnerLoop and Rank-By-Bagging- whether a caption contained one of these names. The faces ProbSVM-OuterLoop, we use the Kendall − tau dis- extracted from the corresponding image associated with this tance [13], which is a metric that counts the number of pair- caption were returned. The faces retrieved from the differ- wise disagreements between two lists. The larger the dis- ent name queries were merged into one set and used as input tance, the more dissimilar the two lists are. The Kendall − for ranking. tau distance between two lists, τ1 and τ2 , is defined as fol- Figure 5 shows the distribution of retrieved faces from lows: this method and the corresponding number of relevant faces for these fifteen individuals. In total, 5,603 faces were re- K(τ1 , τ2 ) = K i,j (τ1 , τ2 ) trieved in which 3,374 faces were relevant. On average, the (i,j)∈P accuracy was 60.22%. where P is the set of unordered pairs of distinct elements in τ1 and τ2 . K i,j (τ1 , τ2 ) = 0 if i and j are in the same 4.2 Face Processing order in τ1 and τ2 , and K i,j (τ1 , τ2 ) = 1 if i and j are in the opposite order in τ1 and τ2 . We used an eye detector to detect the positions of the Since the maximum value of K(τ1 , τ2 ) is N (N − 1)/2, eyes of the detected faces. The eye detector, built with the where N is the number of members of the list, the normal- same approach as that of Viola and Jones [19], had an ac- ized Kendall tau distance can be written as follows: curacy of more than 95%. If the eye positions were not detected, predefined eye locations were assigned. The eye K(τ1 , τ2 ) Knorm (τ1 , τ2 ) = . positions were used to align faces to a predefined canonical N (N − 1)/2 pose. Using this measure for checking when the loops stop To compensate for illumination effects, the subtraction means that if the ranked list does not change significantly of the bestfit brightness plane followed by histogram equal- after a number of iterations, it is reasonable to stop. ization was applied. This normalization process is shown in 387
  • 6. lated as follows: Nrel Recall = Nhit Nrel P recision = Nret Precision and recall are only used to evaluate the quality of an unordered set of retrieved faces. To evaluate ranked lists in which both recall and precision are taken into ac- count, average precision is usually used. The average pre- cision is computed by taking the average of the interpolated precision measured at the 11 recall levels of 0.0, 0.1, 0.2, ..., Figure 5. Distribution of retrieved faces and 1.0. relevant faces of 16 individuals used in ex- The interpolated precision pinterp at a certain recall level periments. Due to space limitation, bars cor- r is defined as the highest precision found for any recall responding to George Bush (2,282 vs. 1,284) level q ≥ r: and Tony Blair (682 vs. 323) were cut-off at the upper limit of the graph. pinterp = maxr ≥r p(r ) In addition, to evaluate the performance of multiple Figure 6. queries, we used mean average precision, which is the mean We then used principle component analysis [18] to re- of average precisions computed from queries3 . duce the number of dimensions of the feature vector for face representation. Eigenfaces were computed from the origi- 4.4 Parameters nal face set returned using the text-based query method. The number of eigenfaces used to form the eigen space was se- The parameters of our method include: lected so that 97% of the total energy was retained [5]. The number of dimensions of these feature spaces ranged from • p: the fraction of faces at the top and bottom of the 80 to 500. ranked list that are used to form a positive set Spos and negative set Sneg for training weak classifiers in Rank- By-Bagging-ProbSVM-InnerLoop. We empirically se- lected p = 20% (i.e 40% samples of the rank list were used) since a larger p will increase the number of incor- rect labels, and a smaller p will cause over-fitting. In ∗ addition, Spos consists of 0.7 × |Spos | samples that are selected randomly with replacement from Spos . This sampling strategy is adopted from the bagging frame- ∗ Figure 6. Face normalization. (top) faces with work [6]. The same setting was used for Sneg . detected eyes, (bottom) faces after normal- ization process. • : the maximum Kendall tau distance Knorm (τ1 , τ 2) between two rank lists τ 1 and τ2 . This value is used to determine when the inner loop and the outer loop stop. We set = 0.05 for balancing between accuracy and processing time. Note that a smaller requires more 4.3 Evaluation Criteria iterations, making the system’s speed slower. • kernel: the kernel type is used for the SVM. The de- We evaluated the retrieval performance with measures fault is a linear kernel that is defined as: k(x, y) = that are commonly used in information retrieval, such as x ∗y. We have tested other kernel types such as RBF or precision, recall, and average precision. Given a queried polynomial, but the performance did not change much. person and letting Nret be the total number of faces re- Therefore, we used the linear kernel for simplicity. turned, Nrel the number of relevant faces, and Nhit the total number of relevant faces, recall and precision can be calcu- 3 http://trec.nist.gov/pubs/trec10/appendices/measures.pdf 388
  • 7. 4.5 Results • Supervised Learning (SVM-SUP): We randomly se- lected a portion p of the data with annotations to train 4.5.1 Performance Comparison with Existing Ap- the classifier; and then used this classifier to re-rank proaches the remaining faces. This process was repeated five times and the average performance was reported. We We performed a comparison between our proposed method used a range of portion p values for experiments: p = with other existing approaches. 1%, 2%, 3%, ..., 5%. • Text Based Baseline (TBL): Once faces corresponding with images whose captions contain the query name are returned, they are ranked in time order. This is a rather naive method in which no prior knowledge be- tween names and faces is used. • Distance-Based Outlier (DBO): We adopted the idea of distance-based outliers detection for ranking [14]. Given a threshold dmin , for each point p, we counted the number of points q so that dist(p, q) ≤ dmin , where dist(p, q) is the Euclidean distance between p and q in the feature space mentioned in section 4.2. This number was then used as the score to rank faces. We selected a range of dmin values for experiments: dmin = 10, 15, 20, ..., 90. • Densest Sub-Graph based Method (DSG): We re- Figure 7. Performance comparison of meth- implemented the densest sub-graph based method [16] ods. Due to different settings, performances for ranking. Once the densest subgraph was found af- are superimposed for better evaluation. ter an edge elimination process, we counted the num- ber of surviving edges of each node (i.e face) and used this number as the ranking score. To form the graph, Figure 7 shows a performance comparison of these meth- the Euclidean distance dist(p, q) was used to assign ods. Our proposed methods (LDS and UEL-LDS) out- the weight for the edge linked between node p and perform other unsupervised methods such as TBL, DBO node q. DSG require a threshold θ to convert the and DSG. Furthermore, the performance of the DBO and weighted graph to the binary graph before searching DSG methods are sensitive to the distance threshold, while for the densest subgraph. We selected a range of θ the performance of our proposed method is less sensitive. values that are the same as the values used in DBO: It confirms that the similarity measure using shared near- θ = 10, 15, 20, ..., 90. est neighbors is reliable for estimation of the local den- sity score. The performance of UEL-LDS is slightly bet- • Local Density Score (LDS): This is the first stage of ter than LDS since the training sets labeled automatically our proposed method. It requires the input value k to from the ranked list are noisy. However, UEL-LDS im- compute the local density score. Since we do not know proves significantly even when the performance of LDS is the number of returned faces from text-based search poor. These performances are worse than that of SVM-SUP engines, we used another input value f raction defined using a small number of labeled samples. as the fraction of neighbors and estimated k by the for- Figure 8 shows an example of the top 50 faces ranked mula: k = f raction ∗ N , where N is the number of using the TBL, DBO, DSG and LDS methods. The perfor- returned faces. We used a range of f raction values mance of DBO is poor since a low threshold is used. This for experiments: f raction = 5%, 10%, 15%, ..., 50%. ranks irrelevant faces that are near duplicates (rows 2 and 3 For a large number of returned faces, we set k to the in Figure 8(b)) higher than relevant faces. This explains the maximum value of 200: k = 200. same situation with DSG. • Unsupervised Ensemble Learning Using Local Den- 4.5.2 Performance of Ensemble Classifiers sity Score (UEL-LDS): This is a combination of rank- ing by local density scores and then the ranked list is In Figure 9, we show the performance of five single clas- used for training a classifier to boost the rank list. sifiers and that of five ensemble classifiers. The ensemble 389
  • 8. Precision return a large fraction of relevant images is satisfied. Fig- Method at top 20 Recall Precision ure 12 shows an example where this assumption is broken. GoogleSE 79.33 100.00 57.08 Consequently, as shown in Figure 13, the model learned by UEL-LDS 89.00 72.50 76.41 this set performed poorly in recognizing new faces returned SVM-SUP-05 85.00 73.14 76.46 by GoogleSE. Our approach solely relies on the above as- SVM-SUP-10 90.67 74.94 78.30 sumption; therefore, it is not affected by the ranking of text- based search engines. Table 1. Comparison of different methods on The iteration of bagging SVM classifiers does not guar- the new test set returned by Google Image antee a significant improvement in performance. The aim Search Engine. of our future work is to study how to improve the quality of the training sets used in this iteration. classifier k is formed by combining single classifiers from 1 6 Conclusion to k. It clearly indicates that the ensemble classifier is more stable than single weak classifiers. We presented a method for ranking faces retrieved us- ing text-based correlation methods in searches for a specific 4.5.3 New Face Annotation person. This method learns the visual consistency among faces in a two-stage process. In the first stage, a relative den- We conducted another experiment to show the effectiveness sity score is used to form a ranked list in which faces ranked of our approach in which learned models are used to anno- at the top or bottom of the list are likely to be relevant or ir- tate new faces of other databases. We used each name in the relevant faces, respectively. In the second stage, a bagging list as a query to obtain the top 500 images from the Google framework is used to combine weak classifiers trained on Image Search Engine (GoogleSE). Next, these images were subsets labeled from the ranked list into a strong classifier. processed using the steps described in section 4.2: extract- This strong classifier is then applied to the original set to ing faces, detecting eyes and doing normalization. We pro- re-rank faces on the basis of the output probabilistic scores. jected these faces to the PCA subspace trained for that name Experiments on various face sets showed the effectiveness and used the learned model to re-rank faces. of this method. Our approach is beneficial when there are There were 4,103 faces (including false positives - non- several faces in a returned image, as shown in Figure 11. faces detected as faces) detected from 7,500 returned im- ages. We manually labeled these faces and there were 2,342 relevant faces. On average, the accuracy of the GoogleSE is References 57.08%. In Table 1, we compare the performance of the methods. [1] O. Arandjelovic and A. Zisserman. Automatic face recog- The performance of UEL-LDS was obtained by running nition for film character retrieval in feature-length films. In the best system, which is shown as the peak of the UEL- Proc. Intl. Conf. on Computer Vision and Pattern Recogni- LDS curve in Figure 7. The performances of SVM-SUP-05 tion, volume 1, pages 860–867, 2005. and SVM-SUP-10 correspond to the supervised systems (cf. [2] M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski. Face section 4.5.1) that used p = 5% and p = 10% of the data set recognition by independent component analysis. IEEE respectively. We evaluated the performance by calculating Transactions on Neural Networks, 13(6):1450–1464, Nov the precision at the top 20 returned faces, which is com- 2002. mon for image search engines and recall and precision on [3] T. L. Berg, A. C. Berg, J. Edwards, and D. A. Forsyth. Who’s in the picture? In Advances in Neural Information Process- all detected faces of the test set. UEL-LDS achieved com- ing Systems, 2004. parable performance to the supervised methods and outper- [4] T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, formed the baseline GoogleSE. The precision at the top 20 Y. W. Teh, E. G. Learned-Miller, and D. A. Forsyth. Names of SVM-SUP-05 is poorer than that of UEL-LDS due to the and faces in the news. In Proc. Intl. Conf. on Computer small number of training samples. Figure 10 shows top 20 Vision and Pattern Recognition, volume 2, pages 848–854, faces ranked using these two methods. 2004. [5] D. Bolme, R. Beveridge, M. Teixeira, and B. Draper. The csu face identification evaluation system: Its purpose, fea- 5 Discussion tures and structure. In International Conference on Vision Systems, pages 304–311, 2003. Our approach works fairly well for well known people, [6] L. Breiman. Bagging predictors. Machine Learning, where the main assumption that text-based search engines 24(2):123140, 1996. 390
  • 9. [7] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proc. ACM SIG- MOD Int. Conf. on Management of Data(SIGMOD), pages 93–104, 2000. [8] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/" "cjlin/libsvm. [9] M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In APPROX ’00: Proceed- ings of the Third International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 84–95. Springer-Verlag, 2000. [10] L. Ertoz, M. Steinbach, and V. Kumar. Finding clusters of different sizes, shapes, and densities in noisy high dimen- (a) - TBL - 11 irrelevant faces sional data. In SIAM International Conference on Data Min- ing, pages 47–58, 2003. [11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density- based algorithm for discovering clusters in large spatial databases with noise. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 226–231, 1996. [12] R. Fergus, P. Perona, and A. Zisserman. A visual category filter for google images. In Proc. Intl. European Conference on Computer Vision, volume 1, pages 242–256, 2004. [13] M. Kendall. Rank Correlation Methods. Charles Griffin Company Limited, 1948. (b) - DBO - 17 irrelevant faces [14] E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based out- liers: Algorithms and applications. VLDB Journal: Very Large Data Bases, 8(3-4):237–253, 2000. [15] L.-J. Li, G. Wang, and L. Fei-Fei. Optimol: automatic on- line picture collection via incremental model learning. In Proc. Intl. Conf. on Computer Vision and Pattern Recogni- tion, volume 2, pages 1–8, 2007. [16] D. Ozkan and P. Duygu. A graph based approach for naming faces in news photos. In Proc. Intl. Conf. on Computer Vi- sion and Pattern Recognition, volume 2, pages 1477–1482, 2006. [17] J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Ad- (c) - DSG - 18 irrelevant faces vances in Large Margin Classifiers, pages 61–74, 1999. [18] M. Turk and A. Pentland. Face recognition using eigenfaces. In Proc. Intl. Conf. on Computer Vision and Pattern Recog- nition, 1991. [19] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. Intl. Conf. on Computer Vision and Pattern Recognition, volume 1, pages 511–518, 2001. [20] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5:975–1005, 2004. [21] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face (d) - LDS - 4 irrelevant faces recognition: A literature survey. ACM Computing Surveys, 35(4):399–458, 2003. Figure 8. Top 50 faces ranked by the methods TBL, DBO, DSG and LDS. Irrelevant faces are marked with a star. 391
  • 10. Figure 9. Performance of the ensemble clas- sifiers and single classifiers. (a) - 5 irrelevant faces Figure 12. Example in which portion of rel- evant faces is dominant, but it is difficult to group all these faces into one cluster due (b) - no any irrelevant face to large facial variations. In feature space, the largest cluster formed from relevant faces is not largest cluster among those formed Figure 10. Top 20 faces ranked by Google from all returned faces. Irrelevant faces are Image Search Engine (a) and ranked using marked with a star. our learned model (b). Irrelevant faces are marked with a star. Figure 13. Many irrelevant faces annotated using the model learned from the data set Figure 11. Image returned by GoogleSE for shown in Figure 12. Irrelevant faces are query ’Gerhard Schroeder’. GoogleSE was marked with a star. unable to accurately identify who the queried person was, while the learned model of our approach accurately identified him. 392