SlideShare a Scribd company logo
1 of 10
Download to read offline
2008 Eighth IEEE International Conference on Data Mining




                            Unsupervised Face Annotation by Mining the Web

                                Duy-Dinh Le                                         Shin’ichi Satoh
                       National Institute of Informatics                    National Institute of Informatics
                       2-1-2 Hitotsubashi, Chiyoda-ku                       2-1-2 Hitotsubashi, Chiyoda-ku
                          Tokyo, JAPAN 101-8430                                Tokyo, JAPAN 101-8430
                              ledduy@nii.ac.jp                                      satoh@nii.ac.jp


                             Abstract                                      by providing his or her name. Most current search engines
                                                                           use the text associated with images and video as significant
       Searching for images of people is an essential task for             clues for returning results. However, other un-queried faces
   image and video search engines. However, current search                 and names may appear with the queried ones (Figure 1), and
   engines have limited capabilities for this task since they rely         this significantly lowers the retrieval performance. One way
   on text associated with images and video, and such text                 to improve the retrieval performance is to take into account
   is likely to return many irrelevant results. We propose a               visual information present in the retrieved faces. This task
   method for retrieving relevant faces of one person by learn-            is challenging for the following reasons:
   ing the visual consistency among results retrieved from text-
   correlation-based search engines. The method consists of                  • Large variations in facial appearance due to pose
   two steps. In the first step, each candidate face obtained                   changes, illumination conditions, occlusions, and fa-
   from a text-based search engine is ranked with a score that                 cial expressions make face recognition difficult even
   measures the distribution of visual similarities among the                  with state-of-the-art techniques [1, 21, 2] (see example
   faces. Faces that are possibly very relevant or irrelevant are              in Figure 2).
   ranked at the top or bottom of the list, respectively. The sec-           • The fact that the retrieved face set consists of faces of
   ond step improves this ranking by treating this problem as a                several people with no labels makes supervised and un-
   classification problem in which input faces are classified as                 supervised learning methods inapplicable.
   ’person-X’ or ’non-person-X’; and the faces are re-ranked
   according to their relevant score inferred from the classi-                 We propose a method for solving the above problem.
   fier’s probability output. To train this classifier, we use a             The main idea is to assume that there is visual consistency
   bagging-based framework to combine results from multiple                among the results returned from text-based search engines
   weak classifiers trained using different subsets. These train-           and this visual consistency is then learned through an in-
   ing subsets are extracted and labeled automatically from                teractive process. This method consists of two stages. In
   the rank list produced from the classifier trained from the              the first stage, we explore the local density of faces to iden-
   previous step. In this way, the accuracy of the ranked list             tify potential candidates for relevant faces1 and irrelevant
   increases after a number of iterations. Experimental results            faces2 . This stage reflects the fact that the facial images of
   on various face sets retrieved from captions of news photos             the queried person tend to form dense clusters, whereas ir-
   show that the retrieval performance improved after each it-             relevant facial images are sparse since they look different
   eration, with the final performance being higher than those              from each other. For each face, we define a score to mea-
   of the existing algorithms.                                             sure the density of its neighbor set. This score is used to
                                                                           form a ranked list, in which faces with high-density scores
                                                                           are considered relevant and are put at the top.
   1. Introduction                                                             The above ranking method is weak since dense clusters
                                                                           have no guarantee of containing relevant faces. Therefore,
      With the rapid growth of digital technology, large image             a second stage is necessary to improve this ranked list. We
   and video databases have become more available than ever                model this problem as a classification problem in which in-
   to users. This trend has shown the need for effective and ef-           put faces are classified as person-X (the queried person)
   ficient tools for indexing and retrieving based on visual con-             1 faces   related to the queried person.
   tent. A typical application is searching for a specific person             2 faces   unrelated to the queried person.


1550-4786/08 $25.00 © 2008 IEEE                                      383
DOI 10.1109/ICDM.2008.47
Figure 2. Large variations in facial expres-
                                                                          sions, poses, illumination conditions and oc-
                                                                          clusions making face recognition difficult.
                                                                          Best viewed in color.



                                                                         • The bagging framework helps to leverage noises in the
                                                                           unsupervised labeling process.

                                                                          Our contribution is two-fold:
   Figure 1. A news photo and its caption. Ex-                           • We propose a general framework to boost the face re-
   tracted faces are shown on the top. These                               trieval performance of text-based search engines by vi-
   faces might be returned for the query of                                sual consistency learning. The framework seamlessly
   person-Bush.                                                            integrates data mining techniques such as supervised
                                                                           learning and unsupervised learning based on bagging.
                                                                           Our framework requires only a few parameters and
                                                                           works stably.
or non-person-X (the un-queried person). The faces are
ranked according to a relevancy score that is inferred from              • We demonstrate its feasibility with a practical web
the classifier’s probability output. Since annotation data is               mining application. A comprehensive evaluation on a
not available, the rank list from the previous step is used to             large face dataset of many people was carried out and
assign labels for a subset of faces. This subset is then used              confirmed that our approach is promising.
to train a classifier using supervised methods such as sup-
port vector machines (SVM). The trained classifier is used
to re-rank faces in the original input set. This step is re-           2. Related Work
peated a number of times to get the final ranked list. Since
automatically assigning labels from the ranked list is not re-             There are several approaches for re-ranking and learn-
liable, the trained classifiers are weak. To obtain the final            ing models from web images. Their underlying assump-
strong classifier, we use the idea of ensemble learning [6] in          tion is that text-based search engines return a large frac-
which weak classifiers trained on different subsets are com-            tion of relevant images. The challenge is how to model
bined to improve the stability and classification accuracy of           what is common in the relevant images. One approach
single classifiers. The learned classifier can be further used           is to model this problem in a probabilistic framework in
for recognizing new facial images of the queried person.               which the returned images are used to learn the parame-
   The second stage improves the ranked list and recogni-              ters of the model. For examples, as described by Fergus et
tion performance for the following reasons:                            al. [12], objects retrieved using an image search engine are
                                                                       re-ranked by extending the constellation model. Another
  • Supervised learning methods, such as SVM, provide                  proposal, described in [15], uses a non-parametric graphi-
    a strong theoretical background for finding the opti-               cal model and an interactive framework to simultaneously
    mal decision boundary even with noisy data. Further-               learn object class models and collect object class datasets.
    more, recent studies [20, 17] suggest that SVM clas-               The main contribution of these approaches is probabilistic
    sifiers provide probability outputs that are suitable for           models that can be learned with a small number of training
    ranking.                                                           images. However, these models are complicated since they




                                                                 384
require several hundred parameters for learning and are sus-           3 Proposed Framework
ceptible to over-fitting. Furthermore, to obtain robust mod-
els, a small amount of supervision is required to select seed             Given a set of images returned by any text-based search
images.                                                                engine for a queried person (e.g. ’George Bush’), we per-
    Another study [4, 3] proposed a clustering-based method            form a ranking process and learning of person X’s model
for associating names and faces in news photos. To solve               as follows:
the problem of ambiguity between several names and one
                                                                         • Step 1: Detect faces and eye positions, and then per-
face, a modified k-means clustering process was used in
                                                                           form face normalizations.
which faces are assigned to the closest cluster (each clus-
ter corresponding to one name) after a number of iterations.             • Step 2: Compute an eigenface space and project the
Although the result was impressive, it is not easy to apply it             input faces into this subspace.
to our problem since it is based on a strong assumption that
requires a perfect alignment when a news photo only has                  • Step 3: Estimate the ranked list of these faces using
one face and its caption only has one name. Furthermore,                   Rank-By-Local-Density-Score.
a large number of irrelevant faces (more than 12%) have to
be manually eliminated before clustering.                                • Step 4: Improve this ranked list using Rank-By-
                                                                           Bagging-ProbSVM.
    A graph-based approach was proposed by Ozkan and
Duygulu [16], in which a graph is formed from faces as                    Steps 1 and 2 are typical for any face processing system,
nodes, and the weights of edges linked between nodes are               and they are described in section 4.2. The algorithms used
the similarity of faces, is closely related to our problem.            in Steps 3 and 4 are described in section 3.1 and section 3.2,
Assuming that the number of faces of the queried person is             respectively. Figure 3 illustrates the proposed framework.
larger than that of others and that these faces tend to form
the most similar subset among the set of retrieved faces,              3.1    Ranking by Local Density Score
this problem is considered equal to the problem of finding
the densest subgraph of a full graph; and can therefore be
solved by taking an available solution [9]. Although, exper-
imental results showed the effectiveness of this method, it is
still questionable whether the densest subgraph intuitively
describes most of the relevant faces of the queried person
and it is easy to extend for the ranking problem. Further-
more, choosing an optimal threshold to convert the initial
graph into a binary one is difficult and rather ad hoc due to
the curse of dimensionality.
    An advantage of the methods [4, 3, 16] is they are fully
unsupervised. However, a disadvantage is that no model
is learned for predicting new images of the same category.
Furthermore, they are used for performing hard categoriza-
                                                                          Figure 4. An example of faces retrieved for
tion on input images that are in applicable for re-ranking.
                                                                          person-Donald Rumsfeld. Irrelevant faces
The balance of recall and precision was not addressed. Typ-
                                                                          are marked with a star. Irrelevant faces might
ically, these approaches tend to ignore the recall to obtain
                                                                          form several clusters, but the relevant faces
high precision. This leads to the reduction in the number of
                                                                          form the largest cluster.
collected images.
    Our approach combines a number of advances over the
existing approaches. Specifically, we learn a model for each               Among the faces retrieved by text-based search engines
query from the returned images for purposes such as re-                for a query of person-X, as shown in Figure 4, relevant
ranking and predicting new images. However, we used an                 faces usually look similar and form the largest cluster. One
unsupervised method to select training samples automati-               approach of re-ranking these faces is to cluster based on vi-
cally, which is different from the methods proposed by Fer-            sual similarity. However, to obtain ideal clustering results is
gus et al. and Li et al. [12, 15]. This unsupervised method            impossible since these faces are high dimensional data and
is different from the one by Ozkan and Duygulu [16] in the             the clusters are in different shapes, sizes, and densities. In-
modeling of the distribution of relevant images. We use                stead, a graph-based approach was proposed by Ozkan and
density-based estimation rather than the densest graph.                Duygulu [16] in which the nodes are faces and edge weights




                                                                 385
Figure 3. The proposed framework for re-ranking faces returned by text-based search engines.


are the similarities between two faces. With the observation            Algorithm 1: Rank-By-Local-Density-Score
that the nodes (faces) of the queried person are similar to             Step 1: For each face p, compute LDS(p, k),
each other and different from other nodes in the graph, the             where k is the number of neighbors of p
densest component of the full graph the set of highly con-              and is the input of the ranking process.
nected nodes in the graph will correspond to the face of the            Step 2: Rank these faces using LDS(p, k)
queried person. The main drawback of this approach is it                (The higher the score the more relevant).
needs a threshold to convert the initial weighted graph to a
binary graph. Choosing this threshold in high dimensional
spaces is difficult since different persons might have differ-          3.2    Ranking by Bagging of SVM Classi-
ent optimal thresholds.                                                       fiers
   We use the idea of density-based clustering described by
Ester et al. and Breunig et al. [11, 7] to solve this problem.             One limitation of the local density score based ranking
Specifically, we define the local density score (LDS) of a               is it cannot handle faces of another person strongly associ-
point p (i.e. a face) as the average distance to its k-nearest         ated in the k-neighbor set (for example, many duplicates).
neighbors.                                                             Therefore, another step is proposed for handling this case.
                                      distance(p, q)                   As a result, we have a model that can be used for both re-
                           q∈R(p,k)
         LDS(p, k) =                                                   ranking current faces and predicting new incoming faces.
                                     k                                     The main idea is to use a probabilistic model to measure
where R(p, k) is the set of k - neighbors of p, and                    the relevancy of a face to person-X, P (person − X|f ace).
distance(p, q) is the similarity between p and q.                      Since the labels are not available for training, we use the
   Since faces are represented in high dimensional feature             input rank list found from the previous step to extract a sub-
space, and face clusters might have different sizes, shapes,           set of faces lying at the top and bottom of the ranked list to
and densities, we do not directly use the Euclidean distance           form the training set. After that, we use SVM with prob-
between two points in this feature space for distance(p, q).           abilistic output [17] implemented in LibSVM [8] to learn
Instead, we use another similarity measure defined by the               the person-X model. This model is applied to faces of the
number of shared neighbors between two points. The effi-                original set, and the output probabilistic scores are used to
ciency of this similarity measure for density-based cluster-           re-rank these faces. Since it is not guaranteed that faces ly-
ing methods was described in [10].                                     ing at two ends of the input rank list correctly correspond to
                              |R(q, k) ∩ R(p, k)|                      the faces of person-X and faces of non person-X, we adopt
           distance(p, q) =                                            the idea of a bagging framework [6] in which randomly se-
                                       k
                                                                       lecting subsets to train weak classifiers, and then combining
   Therefore
                                                                       these classifiers help reduce the risk of using noisy training
                         q∈R(p,k)   |R(q, k) ∩ R(p, k)|                sets.
       LDS(p, k) =
                                     k2                                    The details of the Rank-By-Bagging-ProbSVM-
   A high value of LDS(p, k) indicates a strong association            InnerLoop method, improving an input rank list by
between p and its neighbors. Therefore, we can use this                combining weak classifiers trained from subsets annotated
local density score to rank faces. Faces with higher scores            by that rank list are described in Algorithm 2.
are considered to be potential candidates that are relevant to             Given an input ranked list, Rank-By-Bagging-ProbSVM-
person-X, while faces with lower scores are considered as              InnerLoop is used to improve this list. We repeat the process
outliers and thus are potential candidates for non-person-X.           a number of times whereby the ranked list output from the
Algorithm 1 describes these steps.                                     previous step is used as the input ranked list of the next




                                                                 386
Algorithm 2: Rank-By-Bagging-ProbSVM-InnerLoop                             4 Experiments
 Step 1: Train a weak classifier, hi .
 Step 1.1: Select a set Spos including p% of top ranked faces               4.1    Dataset
                                       ∗
 and then randomly select a subset Spos from Spos .
                   ∗
 Label faces in Spos as positive samples.                                       We used the dataset described by Berg et al. [4] for our
 Step 1.2: Select a set Sneg including p% of bottom ranked
                                            ∗
                                                                            experiments. This dataset consists of approximately half a
 faces and then randomly select a subset Sneg from Sneg .                   million news photos and captions from Yahoo News col-
                   ∗
 Label faces in Sneg as negative samples.                                   lected over a period of roughly two years. This dataset is
                   ∗        ∗
 Step 1.3: Use Spos and Sneg to train a weak                                better than datasets collected from image search engines
 classifier, hj , using LibSVM [8] with probability outputs.                 such as Google that usually limit the total number of re-
                                                  i
 Step 2: Compute ensemble classifier Hi = j=1 hj .                           turned images to 1,000. Furthermore, it has annotations that
 Step 3: Apply Hi to the original face set and form the                     are valuable for evaluation of methods. Note that these an-
 rank list, Ranki , using the output probabilistic scores.                  notations are used for evaluation purpose only. Our method
 Step 4: Repeat steps 1 to 3                                                is fully unsupervised, so it assumes the annotations are not
 until Dist2RankList(Ranki−1, Ranki ) <= .                                  available at running time.
 Step 5: Return Hi = i hj .j=1                                                  Only frontal faces were considered since current frontal
                                                                            face detection systems [19] work in real time and have ac-
 Algorithm 3: Rank-By-Bagging-ProbSVM-OuterLoop                             curacies exceeding 95%. 44,773 faces were detected and
 Step 1: Rankcur =                                                          normalized to the size of 86×86 pixels.
 Rank-By-Bagging-ProbSVM-InnerLoop(Rankprev).                                   We selected fifteen government leaders, including
 Step 2: dist = Dist2RankList(Rankprev , Rankcur ).                         George W. Bush (US), Vladimir Putin (Russia), Ziang
 Step 3: Rankf inal = Rankcur .                                             Jemin (China), Tony Blair (UK), Junichiro Koizumi
 Step 4: Rankprev = Rankcur .                                               (Japan), Roh Moo-hyun (Korea), Abdullah Gul (Turkey),
 Step 5: Repeat steps 1 to 4                                                and other key individuals, such as John Paul II (the Former
 until dist <= .                                                            Pope) and Hans Blix (UN), because their images frequently
 Step 6: Return Rankf inal .                                                appear in the dataset [16]. Variations in each person’s name
                                                                            were collected. For example, George W. Bush, President
step. In this way, the iterations significantly improve the                  Bush, U.S. President, etc., all refer to the current U.S. pres-
final ranked list. The details are described in Algorithm 3.                 ident.
   To determine the number of iterations of Rank-                               We performed simple string search in captions to check
By-Bagging-ProbSVM-InnerLoop and Rank-By-Bagging-                           whether a caption contained one of these names. The faces
ProbSVM-OuterLoop, we use the Kendall − tau dis-                            extracted from the corresponding image associated with this
tance [13], which is a metric that counts the number of pair-               caption were returned. The faces retrieved from the differ-
wise disagreements between two lists. The larger the dis-                   ent name queries were merged into one set and used as input
tance, the more dissimilar the two lists are. The Kendall −                 for ranking.
tau distance between two lists, τ1 and τ2 , is defined as fol-                   Figure 5 shows the distribution of retrieved faces from
lows:                                                                       this method and the corresponding number of relevant faces
                                                                            for these fifteen individuals. In total, 5,603 faces were re-
               K(τ1 , τ2 ) =             K i,j (τ1 , τ2 )                   trieved in which 3,374 faces were relevant. On average, the
                               (i,j)∈P                                      accuracy was 60.22%.
   where P is the set of unordered pairs of distinct elements
in τ1 and τ2 . K i,j (τ1 , τ2 ) = 0 if i and j are in the same              4.2    Face Processing
order in τ1 and τ2 , and K i,j (τ1 , τ2 ) = 1 if i and j are in the
opposite order in τ1 and τ2 .                                                  We used an eye detector to detect the positions of the
   Since the maximum value of K(τ1 , τ2 ) is N (N − 1)/2,                   eyes of the detected faces. The eye detector, built with the
where N is the number of members of the list, the normal-                   same approach as that of Viola and Jones [19], had an ac-
ized Kendall tau distance can be written as follows:                        curacy of more than 95%. If the eye positions were not
                                                                            detected, predefined eye locations were assigned. The eye
                                     K(τ1 , τ2 )
               Knorm (τ1 , τ2 ) =                .                          positions were used to align faces to a predefined canonical
                                    N (N − 1)/2
                                                                            pose.
   Using this measure for checking when the loops stop                         To compensate for illumination effects, the subtraction
means that if the ranked list does not change significantly                  of the bestfit brightness plane followed by histogram equal-
after a number of iterations, it is reasonable to stop.                     ization was applied. This normalization process is shown in




                                                                      387
lated as follows:
                                                                                                             Nrel
                                                                                                Recall =
                                                                                                             Nhit

                                                                                                                Nrel
                                                                                             P recision =
                                                                                                                Nret

                                                                          Precision and recall are only used to evaluate the quality
                                                                      of an unordered set of retrieved faces. To evaluate ranked
                                                                      lists in which both recall and precision are taken into ac-
                                                                      count, average precision is usually used. The average pre-
                                                                      cision is computed by taking the average of the interpolated
                                                                      precision measured at the 11 recall levels of 0.0, 0.1, 0.2, ...,
   Figure 5. Distribution of retrieved faces and
                                                                      1.0.
   relevant faces of 16 individuals used in ex-
                                                                          The interpolated precision pinterp at a certain recall level
   periments. Due to space limitation, bars cor-
                                                                      r is defined as the highest precision found for any recall
   responding to George Bush (2,282 vs. 1,284)
                                                                      level q ≥ r:
   and Tony Blair (682 vs. 323) were cut-off at
   the upper limit of the graph.
                                                                                          pinterp = maxr ≥r p(r )

                                                                         In addition, to evaluate the performance of multiple
Figure 6.                                                             queries, we used mean average precision, which is the mean
   We then used principle component analysis [18] to re-              of average precisions computed from queries3 .
duce the number of dimensions of the feature vector for face
representation. Eigenfaces were computed from the origi-              4.4    Parameters
nal face set returned using the text-based query method. The
number of eigenfaces used to form the eigen space was se-                The parameters of our method include:
lected so that 97% of the total energy was retained [5]. The
number of dimensions of these feature spaces ranged from                • p: the fraction of faces at the top and bottom of the
80 to 500.                                                                ranked list that are used to form a positive set Spos and
                                                                          negative set Sneg for training weak classifiers in Rank-
                                                                          By-Bagging-ProbSVM-InnerLoop. We empirically se-
                                                                          lected p = 20% (i.e 40% samples of the rank list were
                                                                          used) since a larger p will increase the number of incor-
                                                                          rect labels, and a smaller p will cause over-fitting. In
                                                                                      ∗
                                                                          addition, Spos consists of 0.7 × |Spos | samples that are
                                                                          selected randomly with replacement from Spos . This
                                                                          sampling strategy is adopted from the bagging frame-
                                                                                                                       ∗
   Figure 6. Face normalization. (top) faces with                         work [6]. The same setting was used for Sneg .
   detected eyes, (bottom) faces after normal-
   ization process.                                                     • : the maximum Kendall tau distance Knorm (τ1 , τ 2)
                                                                          between two rank lists τ 1 and τ2 . This value is used to
                                                                          determine when the inner loop and the outer loop stop.
                                                                          We set = 0.05 for balancing between accuracy and
                                                                          processing time. Note that a smaller requires more
4.3    Evaluation Criteria                                                iterations, making the system’s speed slower.

                                                                        • kernel: the kernel type is used for the SVM. The de-
   We evaluated the retrieval performance with measures                   fault is a linear kernel that is defined as: k(x, y) =
that are commonly used in information retrieval, such as                  x ∗y. We have tested other kernel types such as RBF or
precision, recall, and average precision. Given a queried                 polynomial, but the performance did not change much.
person and letting Nret be the total number of faces re-                  Therefore, we used the linear kernel for simplicity.
turned, Nrel the number of relevant faces, and Nhit the total
number of relevant faces, recall and precision can be calcu-            3 http://trec.nist.gov/pubs/trec10/appendices/measures.pdf




                                                                388
4.5   Results                                                          • Supervised Learning (SVM-SUP): We randomly se-
                                                                         lected a portion p of the data with annotations to train
4.5.1 Performance Comparison with Existing Ap-                           the classifier; and then used this classifier to re-rank
      proaches                                                           the remaining faces. This process was repeated five
                                                                         times and the average performance was reported. We
We performed a comparison between our proposed method                    used a range of portion p values for experiments: p =
with other existing approaches.                                          1%, 2%, 3%, ..., 5%.

  • Text Based Baseline (TBL): Once faces corresponding
    with images whose captions contain the query name
    are returned, they are ranked in time order. This is a
    rather naive method in which no prior knowledge be-
    tween names and faces is used.

  • Distance-Based Outlier (DBO): We adopted the idea
    of distance-based outliers detection for ranking [14].
    Given a threshold dmin , for each point p, we counted
    the number of points q so that dist(p, q) ≤ dmin ,
    where dist(p, q) is the Euclidean distance between p
    and q in the feature space mentioned in section 4.2.
    This number was then used as the score to rank faces.
    We selected a range of dmin values for experiments:
    dmin = 10, 15, 20, ..., 90.

  • Densest Sub-Graph based Method (DSG): We re-                        Figure 7. Performance comparison of meth-
    implemented the densest sub-graph based method [16]                 ods. Due to different settings, performances
    for ranking. Once the densest subgraph was found af-                are superimposed for better evaluation.
    ter an edge elimination process, we counted the num-
    ber of surviving edges of each node (i.e face) and used
    this number as the ranking score. To form the graph,                Figure 7 shows a performance comparison of these meth-
    the Euclidean distance dist(p, q) was used to assign             ods. Our proposed methods (LDS and UEL-LDS) out-
    the weight for the edge linked between node p and                perform other unsupervised methods such as TBL, DBO
    node q. DSG require a threshold θ to convert the                 and DSG. Furthermore, the performance of the DBO and
    weighted graph to the binary graph before searching              DSG methods are sensitive to the distance threshold, while
    for the densest subgraph. We selected a range of θ               the performance of our proposed method is less sensitive.
    values that are the same as the values used in DBO:              It confirms that the similarity measure using shared near-
    θ = 10, 15, 20, ..., 90.                                         est neighbors is reliable for estimation of the local den-
                                                                     sity score. The performance of UEL-LDS is slightly bet-
  • Local Density Score (LDS): This is the first stage of             ter than LDS since the training sets labeled automatically
    our proposed method. It requires the input value k to            from the ranked list are noisy. However, UEL-LDS im-
    compute the local density score. Since we do not know            proves significantly even when the performance of LDS is
    the number of returned faces from text-based search              poor. These performances are worse than that of SVM-SUP
    engines, we used another input value f raction defined            using a small number of labeled samples.
    as the fraction of neighbors and estimated k by the for-            Figure 8 shows an example of the top 50 faces ranked
    mula: k = f raction ∗ N , where N is the number of               using the TBL, DBO, DSG and LDS methods. The perfor-
    returned faces. We used a range of f raction values              mance of DBO is poor since a low threshold is used. This
    for experiments: f raction = 5%, 10%, 15%, ..., 50%.             ranks irrelevant faces that are near duplicates (rows 2 and 3
    For a large number of returned faces, we set k to the            in Figure 8(b)) higher than relevant faces. This explains the
    maximum value of 200: k = 200.                                   same situation with DSG.

  • Unsupervised Ensemble Learning Using Local Den-                  4.5.2 Performance of Ensemble Classifiers
    sity Score (UEL-LDS): This is a combination of rank-
    ing by local density scores and then the ranked list is          In Figure 9, we show the performance of five single clas-
    used for training a classifier to boost the rank list.            sifiers and that of five ensemble classifiers. The ensemble




                                                               389
Precision                                       return a large fraction of relevant images is satisfied. Fig-
     Method            at top 20    Recall    Precision                ure 12 shows an example where this assumption is broken.
     GoogleSE          79.33        100.00    57.08                    Consequently, as shown in Figure 13, the model learned by
     UEL-LDS           89.00        72.50     76.41                    this set performed poorly in recognizing new faces returned
     SVM-SUP-05        85.00        73.14     76.46                    by GoogleSE. Our approach solely relies on the above as-
     SVM-SUP-10        90.67        74.94     78.30                    sumption; therefore, it is not affected by the ranking of text-
                                                                       based search engines.
   Table 1. Comparison of different methods on
                                                                          The iteration of bagging SVM classifiers does not guar-
   the new test set returned by Google Image
                                                                       antee a significant improvement in performance. The aim
   Search Engine.
                                                                       of our future work is to study how to improve the quality of
                                                                       the training sets used in this iteration.

classifier k is formed by combining single classifiers from 1            6 Conclusion
to k. It clearly indicates that the ensemble classifier is more
stable than single weak classifiers.
                                                                           We presented a method for ranking faces retrieved us-
                                                                       ing text-based correlation methods in searches for a specific
4.5.3 New Face Annotation                                              person. This method learns the visual consistency among
                                                                       faces in a two-stage process. In the first stage, a relative den-
We conducted another experiment to show the effectiveness
                                                                       sity score is used to form a ranked list in which faces ranked
of our approach in which learned models are used to anno-
                                                                       at the top or bottom of the list are likely to be relevant or ir-
tate new faces of other databases. We used each name in the
                                                                       relevant faces, respectively. In the second stage, a bagging
list as a query to obtain the top 500 images from the Google
                                                                       framework is used to combine weak classifiers trained on
Image Search Engine (GoogleSE). Next, these images were
                                                                       subsets labeled from the ranked list into a strong classifier.
processed using the steps described in section 4.2: extract-
                                                                       This strong classifier is then applied to the original set to
ing faces, detecting eyes and doing normalization. We pro-
                                                                       re-rank faces on the basis of the output probabilistic scores.
jected these faces to the PCA subspace trained for that name
                                                                       Experiments on various face sets showed the effectiveness
and used the learned model to re-rank faces.
                                                                       of this method. Our approach is beneficial when there are
    There were 4,103 faces (including false positives - non-           several faces in a returned image, as shown in Figure 11.
faces detected as faces) detected from 7,500 returned im-
ages. We manually labeled these faces and there were 2,342
relevant faces. On average, the accuracy of the GoogleSE is            References
57.08%.
    In Table 1, we compare the performance of the methods.              [1] O. Arandjelovic and A. Zisserman. Automatic face recog-
The performance of UEL-LDS was obtained by running                          nition for film character retrieval in feature-length films. In
the best system, which is shown as the peak of the UEL-                     Proc. Intl. Conf. on Computer Vision and Pattern Recogni-
LDS curve in Figure 7. The performances of SVM-SUP-05                       tion, volume 1, pages 860–867, 2005.
and SVM-SUP-10 correspond to the supervised systems (cf.                [2] M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski. Face
section 4.5.1) that used p = 5% and p = 10% of the data set                 recognition by independent component analysis. IEEE
respectively. We evaluated the performance by calculating                   Transactions on Neural Networks, 13(6):1450–1464, Nov
the precision at the top 20 returned faces, which is com-                   2002.
mon for image search engines and recall and precision on                [3] T. L. Berg, A. C. Berg, J. Edwards, and D. A. Forsyth. Who’s
                                                                            in the picture? In Advances in Neural Information Process-
all detected faces of the test set. UEL-LDS achieved com-
                                                                            ing Systems, 2004.
parable performance to the supervised methods and outper-
                                                                        [4] T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White,
formed the baseline GoogleSE. The precision at the top 20
                                                                            Y. W. Teh, E. G. Learned-Miller, and D. A. Forsyth. Names
of SVM-SUP-05 is poorer than that of UEL-LDS due to the                     and faces in the news. In Proc. Intl. Conf. on Computer
small number of training samples. Figure 10 shows top 20                    Vision and Pattern Recognition, volume 2, pages 848–854,
faces ranked using these two methods.                                       2004.
                                                                        [5] D. Bolme, R. Beveridge, M. Teixeira, and B. Draper. The
                                                                            csu face identification evaluation system: Its purpose, fea-
5 Discussion
                                                                            tures and structure. In International Conference on Vision
                                                                            Systems, pages 304–311, 2003.
  Our approach works fairly well for well known people,                 [6] L. Breiman. Bagging predictors. Machine Learning,
where the main assumption that text-based search engines                    24(2):123140, 1996.




                                                                 390
[7] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF:
     Identifying density-based local outliers. In Proc. ACM SIG-
     MOD Int. Conf. on Management of Data(SIGMOD), pages
     93–104, 2000.
 [8] C.-C. Chang and C.-J. Lin.          LIBSVM: a library for
     support vector machines, 2001. Software available at
     http://www.csie.ntu.edu.tw/" "cjlin/libsvm.
 [9] M. Charikar. Greedy approximation algorithms for finding
     dense components in a graph. In APPROX ’00: Proceed-
     ings of the Third International Workshop on Approximation
     Algorithms for Combinatorial Optimization, pages 84–95.
     Springer-Verlag, 2000.
[10] L. Ertoz, M. Steinbach, and V. Kumar. Finding clusters of
     different sizes, shapes, and densities in noisy high dimen-               (a) - TBL - 11 irrelevant faces
     sional data. In SIAM International Conference on Data Min-
     ing, pages 47–58, 2003.
[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-
     based algorithm for discovering clusters in large spatial
     databases with noise. In Proc. ACM SIGKDD Int. Conf. on
     Knowledge Discovery and Data Mining (SIGKDD), pages
     226–231, 1996.
[12] R. Fergus, P. Perona, and A. Zisserman. A visual category
     filter for google images. In Proc. Intl. European Conference
     on Computer Vision, volume 1, pages 242–256, 2004.
[13] M. Kendall. Rank Correlation Methods. Charles Griffin
     Company Limited, 1948.                                                    (b) - DBO - 17 irrelevant faces
[14] E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based out-
     liers: Algorithms and applications. VLDB Journal: Very
     Large Data Bases, 8(3-4):237–253, 2000.
[15] L.-J. Li, G. Wang, and L. Fei-Fei. Optimol: automatic on-
     line picture collection via incremental model learning. In
     Proc. Intl. Conf. on Computer Vision and Pattern Recogni-
     tion, volume 2, pages 1–8, 2007.
[16] D. Ozkan and P. Duygu. A graph based approach for naming
     faces in news photos. In Proc. Intl. Conf. on Computer Vi-
     sion and Pattern Recognition, volume 2, pages 1477–1482,
     2006.
[17] J. Platt. Probabilistic outputs for support vector machines
     and comparison to regularized likelihood methods. In Ad-                  (c) - DSG - 18 irrelevant faces
     vances in Large Margin Classifiers, pages 61–74, 1999.
[18] M. Turk and A. Pentland. Face recognition using eigenfaces.
     In Proc. Intl. Conf. on Computer Vision and Pattern Recog-
     nition, 1991.
[19] P. Viola and M. Jones. Rapid object detection using a
     boosted cascade of simple features. In Proc. Intl. Conf. on
     Computer Vision and Pattern Recognition, volume 1, pages
     511–518, 2001.
[20] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates
     for multi-class classification by pairwise coupling. Journal
     of Machine Learning Research, 5:975–1005, 2004.
[21] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face              (d) - LDS - 4 irrelevant faces
     recognition: A literature survey. ACM Computing Surveys,
     35(4):399–458, 2003.
                                                                      Figure 8. Top 50 faces ranked by the methods
                                                                      TBL, DBO, DSG and LDS. Irrelevant faces are
                                                                      marked with a star.




                                                                391
Figure 9. Performance of the ensemble clas-
sifiers and single classifiers.




             (a) - 5 irrelevant faces

                                                      Figure 12. Example in which portion of rel-
                                                      evant faces is dominant, but it is difficult to
                                                      group all these faces into one cluster due
           (b) - no any irrelevant face               to large facial variations. In feature space,
                                                      the largest cluster formed from relevant faces
                                                      is not largest cluster among those formed
Figure 10. Top 20 faces ranked by Google              from all returned faces. Irrelevant faces are
Image Search Engine (a) and ranked using              marked with a star.
our learned model (b). Irrelevant faces are
marked with a star.




                                                      Figure 13. Many irrelevant faces annotated
                                                      using the model learned from the data set
Figure 11. Image returned by GoogleSE for             shown in Figure 12. Irrelevant faces are
query ’Gerhard Schroeder’. GoogleSE was               marked with a star.
unable to accurately identify who the queried
person was, while the learned model of our
approach accurately identified him.




                                                392

More Related Content

What's hot

PERFORMANCE EVALUATION OF STATISTICAL CLASSIFIERS USING INDIAN SIGN LANGUAGE ...
PERFORMANCE EVALUATION OF STATISTICAL CLASSIFIERS USING INDIAN SIGN LANGUAGE ...PERFORMANCE EVALUATION OF STATISTICAL CLASSIFIERS USING INDIAN SIGN LANGUAGE ...
PERFORMANCE EVALUATION OF STATISTICAL CLASSIFIERS USING INDIAN SIGN LANGUAGE ...IJCSEA Journal
 
IRJET- Segmentation and Representation of Data Dependent Label Distribution L...
IRJET- Segmentation and Representation of Data Dependent Label Distribution L...IRJET- Segmentation and Representation of Data Dependent Label Distribution L...
IRJET- Segmentation and Representation of Data Dependent Label Distribution L...IRJET Journal
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filteringsscdotopen
 
An effective approach to offline arabic handwriting recognition
An effective approach to offline arabic handwriting recognitionAn effective approach to offline arabic handwriting recognition
An effective approach to offline arabic handwriting recognitionijaia
 
K2 Algorithm-based Text Detection with An Adaptive Classifier Threshold
K2 Algorithm-based Text Detection with An Adaptive Classifier ThresholdK2 Algorithm-based Text Detection with An Adaptive Classifier Threshold
K2 Algorithm-based Text Detection with An Adaptive Classifier ThresholdCSCJournals
 
An enhanced kernel weighted collaborative recommended system to alleviate spa...
An enhanced kernel weighted collaborative recommended system to alleviate spa...An enhanced kernel weighted collaborative recommended system to alleviate spa...
An enhanced kernel weighted collaborative recommended system to alleviate spa...IJECEIAES
 
An exhaustive font and size invariant classification scheme for ocr of devana...
An exhaustive font and size invariant classification scheme for ocr of devana...An exhaustive font and size invariant classification scheme for ocr of devana...
An exhaustive font and size invariant classification scheme for ocr of devana...ijnlc
 
NL based Object Oriented modeling - EJSR 35(1)
NL based Object Oriented modeling - EJSR 35(1)NL based Object Oriented modeling - EJSR 35(1)
NL based Object Oriented modeling - EJSR 35(1)IT Industry
 
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...IRJET- Neural Network based Script Recognition using Wavelet Features: An App...
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...IRJET Journal
 
DEVNAGARI NUMERALS CLASSIFICATION AND RECOGNITION USING AN INTEGRATED APPROACH
DEVNAGARI NUMERALS CLASSIFICATION AND RECOGNITION USING AN INTEGRATED APPROACHDEVNAGARI NUMERALS CLASSIFICATION AND RECOGNITION USING AN INTEGRATED APPROACH
DEVNAGARI NUMERALS CLASSIFICATION AND RECOGNITION USING AN INTEGRATED APPROACHijfcstjournal
 
Automated Java Code Generation (ICDIM 2006)
Automated Java Code Generation (ICDIM 2006)Automated Java Code Generation (ICDIM 2006)
Automated Java Code Generation (ICDIM 2006)IT Industry
 
A FRAMEWORK FOR SUMMARIZATION OF ONLINE OPINION USING WEIGHTING SCHEME
A FRAMEWORK FOR SUMMARIZATION OF ONLINE OPINION USING WEIGHTING SCHEMEA FRAMEWORK FOR SUMMARIZATION OF ONLINE OPINION USING WEIGHTING SCHEME
A FRAMEWORK FOR SUMMARIZATION OF ONLINE OPINION USING WEIGHTING SCHEMEaciijournal
 
856200902 a06
856200902 a06856200902 a06
856200902 a06amosalade
 

What's hot (14)

PERFORMANCE EVALUATION OF STATISTICAL CLASSIFIERS USING INDIAN SIGN LANGUAGE ...
PERFORMANCE EVALUATION OF STATISTICAL CLASSIFIERS USING INDIAN SIGN LANGUAGE ...PERFORMANCE EVALUATION OF STATISTICAL CLASSIFIERS USING INDIAN SIGN LANGUAGE ...
PERFORMANCE EVALUATION OF STATISTICAL CLASSIFIERS USING INDIAN SIGN LANGUAGE ...
 
IRJET- Segmentation and Representation of Data Dependent Label Distribution L...
IRJET- Segmentation and Representation of Data Dependent Label Distribution L...IRJET- Segmentation and Representation of Data Dependent Label Distribution L...
IRJET- Segmentation and Representation of Data Dependent Label Distribution L...
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
 
An effective approach to offline arabic handwriting recognition
An effective approach to offline arabic handwriting recognitionAn effective approach to offline arabic handwriting recognition
An effective approach to offline arabic handwriting recognition
 
K2 Algorithm-based Text Detection with An Adaptive Classifier Threshold
K2 Algorithm-based Text Detection with An Adaptive Classifier ThresholdK2 Algorithm-based Text Detection with An Adaptive Classifier Threshold
K2 Algorithm-based Text Detection with An Adaptive Classifier Threshold
 
An enhanced kernel weighted collaborative recommended system to alleviate spa...
An enhanced kernel weighted collaborative recommended system to alleviate spa...An enhanced kernel weighted collaborative recommended system to alleviate spa...
An enhanced kernel weighted collaborative recommended system to alleviate spa...
 
An exhaustive font and size invariant classification scheme for ocr of devana...
An exhaustive font and size invariant classification scheme for ocr of devana...An exhaustive font and size invariant classification scheme for ocr of devana...
An exhaustive font and size invariant classification scheme for ocr of devana...
 
NL based Object Oriented modeling - EJSR 35(1)
NL based Object Oriented modeling - EJSR 35(1)NL based Object Oriented modeling - EJSR 35(1)
NL based Object Oriented modeling - EJSR 35(1)
 
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...IRJET- Neural Network based Script Recognition using Wavelet Features: An App...
IRJET- Neural Network based Script Recognition using Wavelet Features: An App...
 
DEVNAGARI NUMERALS CLASSIFICATION AND RECOGNITION USING AN INTEGRATED APPROACH
DEVNAGARI NUMERALS CLASSIFICATION AND RECOGNITION USING AN INTEGRATED APPROACHDEVNAGARI NUMERALS CLASSIFICATION AND RECOGNITION USING AN INTEGRATED APPROACH
DEVNAGARI NUMERALS CLASSIFICATION AND RECOGNITION USING AN INTEGRATED APPROACH
 
Automated Java Code Generation (ICDIM 2006)
Automated Java Code Generation (ICDIM 2006)Automated Java Code Generation (ICDIM 2006)
Automated Java Code Generation (ICDIM 2006)
 
A FRAMEWORK FOR SUMMARIZATION OF ONLINE OPINION USING WEIGHTING SCHEME
A FRAMEWORK FOR SUMMARIZATION OF ONLINE OPINION USING WEIGHTING SCHEMEA FRAMEWORK FOR SUMMARIZATION OF ONLINE OPINION USING WEIGHTING SCHEME
A FRAMEWORK FOR SUMMARIZATION OF ONLINE OPINION USING WEIGHTING SCHEME
 
[IJET-V2I2P5] Authors:Mr. Veer Karan Bharat1, Miss. Dethe Pratima Vilas2, Mis...
[IJET-V2I2P5] Authors:Mr. Veer Karan Bharat1, Miss. Dethe Pratima Vilas2, Mis...[IJET-V2I2P5] Authors:Mr. Veer Karan Bharat1, Miss. Dethe Pratima Vilas2, Mis...
[IJET-V2I2P5] Authors:Mr. Veer Karan Bharat1, Miss. Dethe Pratima Vilas2, Mis...
 
856200902 a06
856200902 a06856200902 a06
856200902 a06
 

Similar to Le Satoh Unsupervised Face Annotation Icdm08

Face Annotation using Co-Relation based Matching for Improving Image Mining ...
Face Annotation using Co-Relation based Matching  for Improving Image Mining ...Face Annotation using Co-Relation based Matching  for Improving Image Mining ...
Face Annotation using Co-Relation based Matching for Improving Image Mining ...IRJET Journal
 
AUTOMATION OF ATTENDANCE USING DEEP LEARNING
AUTOMATION OF ATTENDANCE USING DEEP LEARNINGAUTOMATION OF ATTENDANCE USING DEEP LEARNING
AUTOMATION OF ATTENDANCE USING DEEP LEARNINGIRJET Journal
 
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...Journal For Research
 
Image Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep LearningImage Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep LearningIRJET Journal
 
IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...
IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...
IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...IRJET Journal
 
A Survey on Different Relevance Feedback Techniques in Content Based Image Re...
A Survey on Different Relevance Feedback Techniques in Content Based Image Re...A Survey on Different Relevance Feedback Techniques in Content Based Image Re...
A Survey on Different Relevance Feedback Techniques in Content Based Image Re...IRJET Journal
 
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...IRJET Journal
 
Dad (Data Analysis And Design)
Dad (Data Analysis And Design)Dad (Data Analysis And Design)
Dad (Data Analysis And Design)Jill Lyons
 
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...Editor IJMTER
 
IRJET- Emotion Classification of Human Face Expressions using Transfer Le...
IRJET-  	  Emotion Classification of Human Face Expressions using Transfer Le...IRJET-  	  Emotion Classification of Human Face Expressions using Transfer Le...
IRJET- Emotion Classification of Human Face Expressions using Transfer Le...IRJET Journal
 
Recognizing Celebrity Faces in Lot of Web Images
Recognizing Celebrity Faces in Lot of Web ImagesRecognizing Celebrity Faces in Lot of Web Images
Recognizing Celebrity Faces in Lot of Web ImagesIJERA Editor
 
Iaetsd efficient retrieval of face image from
Iaetsd efficient retrieval of face image fromIaetsd efficient retrieval of face image from
Iaetsd efficient retrieval of face image fromIaetsd Iaetsd
 
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATIONMULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATIONijaia
 
IRJET- Semantic Retrieval of Trademarks based on Text and Images Conceptu...
IRJET-  	  Semantic Retrieval of Trademarks based on Text and Images Conceptu...IRJET-  	  Semantic Retrieval of Trademarks based on Text and Images Conceptu...
IRJET- Semantic Retrieval of Trademarks based on Text and Images Conceptu...IRJET Journal
 
Image based search engine
Image based search engineImage based search engine
Image based search engineIRJET Journal
 
Comparison of Various Web Image Re - Ranking Techniques
Comparison of Various Web Image Re - Ranking TechniquesComparison of Various Web Image Re - Ranking Techniques
Comparison of Various Web Image Re - Ranking TechniquesIJSRD
 
Face Recognition for Human Identification using BRISK Feature and Normal Dist...
Face Recognition for Human Identification using BRISK Feature and Normal Dist...Face Recognition for Human Identification using BRISK Feature and Normal Dist...
Face Recognition for Human Identification using BRISK Feature and Normal Dist...ijtsrd
 
Paper id 25201471
Paper id 25201471Paper id 25201471
Paper id 25201471IJRAT
 

Similar to Le Satoh Unsupervised Face Annotation Icdm08 (20)

Report
ReportReport
Report
 
Face Annotation using Co-Relation based Matching for Improving Image Mining ...
Face Annotation using Co-Relation based Matching  for Improving Image Mining ...Face Annotation using Co-Relation based Matching  for Improving Image Mining ...
Face Annotation using Co-Relation based Matching for Improving Image Mining ...
 
AUTOMATION OF ATTENDANCE USING DEEP LEARNING
AUTOMATION OF ATTENDANCE USING DEEP LEARNINGAUTOMATION OF ATTENDANCE USING DEEP LEARNING
AUTOMATION OF ATTENDANCE USING DEEP LEARNING
 
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
 
Image Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep LearningImage Classification and Annotation Using Deep Learning
Image Classification and Annotation Using Deep Learning
 
IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...
IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...
IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...
 
A Survey on Different Relevance Feedback Techniques in Content Based Image Re...
A Survey on Different Relevance Feedback Techniques in Content Based Image Re...A Survey on Different Relevance Feedback Techniques in Content Based Image Re...
A Survey on Different Relevance Feedback Techniques in Content Based Image Re...
 
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
 
Dad (Data Analysis And Design)
Dad (Data Analysis And Design)Dad (Data Analysis And Design)
Dad (Data Analysis And Design)
 
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...
 
IRJET- Emotion Classification of Human Face Expressions using Transfer Le...
IRJET-  	  Emotion Classification of Human Face Expressions using Transfer Le...IRJET-  	  Emotion Classification of Human Face Expressions using Transfer Le...
IRJET- Emotion Classification of Human Face Expressions using Transfer Le...
 
Recognizing Celebrity Faces in Lot of Web Images
Recognizing Celebrity Faces in Lot of Web ImagesRecognizing Celebrity Faces in Lot of Web Images
Recognizing Celebrity Faces in Lot of Web Images
 
Iaetsd efficient retrieval of face image from
Iaetsd efficient retrieval of face image fromIaetsd efficient retrieval of face image from
Iaetsd efficient retrieval of face image from
 
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATIONMULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
 
IRJET- Semantic Retrieval of Trademarks based on Text and Images Conceptu...
IRJET-  	  Semantic Retrieval of Trademarks based on Text and Images Conceptu...IRJET-  	  Semantic Retrieval of Trademarks based on Text and Images Conceptu...
IRJET- Semantic Retrieval of Trademarks based on Text and Images Conceptu...
 
Image based search engine
Image based search engineImage based search engine
Image based search engine
 
Comparison of Various Web Image Re - Ranking Techniques
Comparison of Various Web Image Re - Ranking TechniquesComparison of Various Web Image Re - Ranking Techniques
Comparison of Various Web Image Re - Ranking Techniques
 
Ijetr042148
Ijetr042148Ijetr042148
Ijetr042148
 
Face Recognition for Human Identification using BRISK Feature and Normal Dist...
Face Recognition for Human Identification using BRISK Feature and Normal Dist...Face Recognition for Human Identification using BRISK Feature and Normal Dist...
Face Recognition for Human Identification using BRISK Feature and Normal Dist...
 
Paper id 25201471
Paper id 25201471Paper id 25201471
Paper id 25201471
 

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Le Satoh Unsupervised Face Annotation Icdm08

  • 1. 2008 Eighth IEEE International Conference on Data Mining Unsupervised Face Annotation by Mining the Web Duy-Dinh Le Shin’ichi Satoh National Institute of Informatics National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo, JAPAN 101-8430 Tokyo, JAPAN 101-8430 ledduy@nii.ac.jp satoh@nii.ac.jp Abstract by providing his or her name. Most current search engines use the text associated with images and video as significant Searching for images of people is an essential task for clues for returning results. However, other un-queried faces image and video search engines. However, current search and names may appear with the queried ones (Figure 1), and engines have limited capabilities for this task since they rely this significantly lowers the retrieval performance. One way on text associated with images and video, and such text to improve the retrieval performance is to take into account is likely to return many irrelevant results. We propose a visual information present in the retrieved faces. This task method for retrieving relevant faces of one person by learn- is challenging for the following reasons: ing the visual consistency among results retrieved from text- correlation-based search engines. The method consists of • Large variations in facial appearance due to pose two steps. In the first step, each candidate face obtained changes, illumination conditions, occlusions, and fa- from a text-based search engine is ranked with a score that cial expressions make face recognition difficult even measures the distribution of visual similarities among the with state-of-the-art techniques [1, 21, 2] (see example faces. Faces that are possibly very relevant or irrelevant are in Figure 2). ranked at the top or bottom of the list, respectively. The sec- • The fact that the retrieved face set consists of faces of ond step improves this ranking by treating this problem as a several people with no labels makes supervised and un- classification problem in which input faces are classified as supervised learning methods inapplicable. ’person-X’ or ’non-person-X’; and the faces are re-ranked according to their relevant score inferred from the classi- We propose a method for solving the above problem. fier’s probability output. To train this classifier, we use a The main idea is to assume that there is visual consistency bagging-based framework to combine results from multiple among the results returned from text-based search engines weak classifiers trained using different subsets. These train- and this visual consistency is then learned through an in- ing subsets are extracted and labeled automatically from teractive process. This method consists of two stages. In the rank list produced from the classifier trained from the the first stage, we explore the local density of faces to iden- previous step. In this way, the accuracy of the ranked list tify potential candidates for relevant faces1 and irrelevant increases after a number of iterations. Experimental results faces2 . This stage reflects the fact that the facial images of on various face sets retrieved from captions of news photos the queried person tend to form dense clusters, whereas ir- show that the retrieval performance improved after each it- relevant facial images are sparse since they look different eration, with the final performance being higher than those from each other. For each face, we define a score to mea- of the existing algorithms. sure the density of its neighbor set. This score is used to form a ranked list, in which faces with high-density scores are considered relevant and are put at the top. 1. Introduction The above ranking method is weak since dense clusters have no guarantee of containing relevant faces. Therefore, With the rapid growth of digital technology, large image a second stage is necessary to improve this ranked list. We and video databases have become more available than ever model this problem as a classification problem in which in- to users. This trend has shown the need for effective and ef- put faces are classified as person-X (the queried person) ficient tools for indexing and retrieving based on visual con- 1 faces related to the queried person. tent. A typical application is searching for a specific person 2 faces unrelated to the queried person. 1550-4786/08 $25.00 © 2008 IEEE 383 DOI 10.1109/ICDM.2008.47
  • 2. Figure 2. Large variations in facial expres- sions, poses, illumination conditions and oc- clusions making face recognition difficult. Best viewed in color. • The bagging framework helps to leverage noises in the unsupervised labeling process. Our contribution is two-fold: Figure 1. A news photo and its caption. Ex- • We propose a general framework to boost the face re- tracted faces are shown on the top. These trieval performance of text-based search engines by vi- faces might be returned for the query of sual consistency learning. The framework seamlessly person-Bush. integrates data mining techniques such as supervised learning and unsupervised learning based on bagging. Our framework requires only a few parameters and works stably. or non-person-X (the un-queried person). The faces are ranked according to a relevancy score that is inferred from • We demonstrate its feasibility with a practical web the classifier’s probability output. Since annotation data is mining application. A comprehensive evaluation on a not available, the rank list from the previous step is used to large face dataset of many people was carried out and assign labels for a subset of faces. This subset is then used confirmed that our approach is promising. to train a classifier using supervised methods such as sup- port vector machines (SVM). The trained classifier is used to re-rank faces in the original input set. This step is re- 2. Related Work peated a number of times to get the final ranked list. Since automatically assigning labels from the ranked list is not re- There are several approaches for re-ranking and learn- liable, the trained classifiers are weak. To obtain the final ing models from web images. Their underlying assump- strong classifier, we use the idea of ensemble learning [6] in tion is that text-based search engines return a large frac- which weak classifiers trained on different subsets are com- tion of relevant images. The challenge is how to model bined to improve the stability and classification accuracy of what is common in the relevant images. One approach single classifiers. The learned classifier can be further used is to model this problem in a probabilistic framework in for recognizing new facial images of the queried person. which the returned images are used to learn the parame- The second stage improves the ranked list and recogni- ters of the model. For examples, as described by Fergus et tion performance for the following reasons: al. [12], objects retrieved using an image search engine are re-ranked by extending the constellation model. Another • Supervised learning methods, such as SVM, provide proposal, described in [15], uses a non-parametric graphi- a strong theoretical background for finding the opti- cal model and an interactive framework to simultaneously mal decision boundary even with noisy data. Further- learn object class models and collect object class datasets. more, recent studies [20, 17] suggest that SVM clas- The main contribution of these approaches is probabilistic sifiers provide probability outputs that are suitable for models that can be learned with a small number of training ranking. images. However, these models are complicated since they 384
  • 3. require several hundred parameters for learning and are sus- 3 Proposed Framework ceptible to over-fitting. Furthermore, to obtain robust mod- els, a small amount of supervision is required to select seed Given a set of images returned by any text-based search images. engine for a queried person (e.g. ’George Bush’), we per- Another study [4, 3] proposed a clustering-based method form a ranking process and learning of person X’s model for associating names and faces in news photos. To solve as follows: the problem of ambiguity between several names and one • Step 1: Detect faces and eye positions, and then per- face, a modified k-means clustering process was used in form face normalizations. which faces are assigned to the closest cluster (each clus- ter corresponding to one name) after a number of iterations. • Step 2: Compute an eigenface space and project the Although the result was impressive, it is not easy to apply it input faces into this subspace. to our problem since it is based on a strong assumption that requires a perfect alignment when a news photo only has • Step 3: Estimate the ranked list of these faces using one face and its caption only has one name. Furthermore, Rank-By-Local-Density-Score. a large number of irrelevant faces (more than 12%) have to be manually eliminated before clustering. • Step 4: Improve this ranked list using Rank-By- Bagging-ProbSVM. A graph-based approach was proposed by Ozkan and Duygulu [16], in which a graph is formed from faces as Steps 1 and 2 are typical for any face processing system, nodes, and the weights of edges linked between nodes are and they are described in section 4.2. The algorithms used the similarity of faces, is closely related to our problem. in Steps 3 and 4 are described in section 3.1 and section 3.2, Assuming that the number of faces of the queried person is respectively. Figure 3 illustrates the proposed framework. larger than that of others and that these faces tend to form the most similar subset among the set of retrieved faces, 3.1 Ranking by Local Density Score this problem is considered equal to the problem of finding the densest subgraph of a full graph; and can therefore be solved by taking an available solution [9]. Although, exper- imental results showed the effectiveness of this method, it is still questionable whether the densest subgraph intuitively describes most of the relevant faces of the queried person and it is easy to extend for the ranking problem. Further- more, choosing an optimal threshold to convert the initial graph into a binary one is difficult and rather ad hoc due to the curse of dimensionality. An advantage of the methods [4, 3, 16] is they are fully unsupervised. However, a disadvantage is that no model is learned for predicting new images of the same category. Furthermore, they are used for performing hard categoriza- Figure 4. An example of faces retrieved for tion on input images that are in applicable for re-ranking. person-Donald Rumsfeld. Irrelevant faces The balance of recall and precision was not addressed. Typ- are marked with a star. Irrelevant faces might ically, these approaches tend to ignore the recall to obtain form several clusters, but the relevant faces high precision. This leads to the reduction in the number of form the largest cluster. collected images. Our approach combines a number of advances over the existing approaches. Specifically, we learn a model for each Among the faces retrieved by text-based search engines query from the returned images for purposes such as re- for a query of person-X, as shown in Figure 4, relevant ranking and predicting new images. However, we used an faces usually look similar and form the largest cluster. One unsupervised method to select training samples automati- approach of re-ranking these faces is to cluster based on vi- cally, which is different from the methods proposed by Fer- sual similarity. However, to obtain ideal clustering results is gus et al. and Li et al. [12, 15]. This unsupervised method impossible since these faces are high dimensional data and is different from the one by Ozkan and Duygulu [16] in the the clusters are in different shapes, sizes, and densities. In- modeling of the distribution of relevant images. We use stead, a graph-based approach was proposed by Ozkan and density-based estimation rather than the densest graph. Duygulu [16] in which the nodes are faces and edge weights 385
  • 4. Figure 3. The proposed framework for re-ranking faces returned by text-based search engines. are the similarities between two faces. With the observation Algorithm 1: Rank-By-Local-Density-Score that the nodes (faces) of the queried person are similar to Step 1: For each face p, compute LDS(p, k), each other and different from other nodes in the graph, the where k is the number of neighbors of p densest component of the full graph the set of highly con- and is the input of the ranking process. nected nodes in the graph will correspond to the face of the Step 2: Rank these faces using LDS(p, k) queried person. The main drawback of this approach is it (The higher the score the more relevant). needs a threshold to convert the initial weighted graph to a binary graph. Choosing this threshold in high dimensional spaces is difficult since different persons might have differ- 3.2 Ranking by Bagging of SVM Classi- ent optimal thresholds. fiers We use the idea of density-based clustering described by Ester et al. and Breunig et al. [11, 7] to solve this problem. One limitation of the local density score based ranking Specifically, we define the local density score (LDS) of a is it cannot handle faces of another person strongly associ- point p (i.e. a face) as the average distance to its k-nearest ated in the k-neighbor set (for example, many duplicates). neighbors. Therefore, another step is proposed for handling this case. distance(p, q) As a result, we have a model that can be used for both re- q∈R(p,k) LDS(p, k) = ranking current faces and predicting new incoming faces. k The main idea is to use a probabilistic model to measure where R(p, k) is the set of k - neighbors of p, and the relevancy of a face to person-X, P (person − X|f ace). distance(p, q) is the similarity between p and q. Since the labels are not available for training, we use the Since faces are represented in high dimensional feature input rank list found from the previous step to extract a sub- space, and face clusters might have different sizes, shapes, set of faces lying at the top and bottom of the ranked list to and densities, we do not directly use the Euclidean distance form the training set. After that, we use SVM with prob- between two points in this feature space for distance(p, q). abilistic output [17] implemented in LibSVM [8] to learn Instead, we use another similarity measure defined by the the person-X model. This model is applied to faces of the number of shared neighbors between two points. The effi- original set, and the output probabilistic scores are used to ciency of this similarity measure for density-based cluster- re-rank these faces. Since it is not guaranteed that faces ly- ing methods was described in [10]. ing at two ends of the input rank list correctly correspond to |R(q, k) ∩ R(p, k)| the faces of person-X and faces of non person-X, we adopt distance(p, q) = the idea of a bagging framework [6] in which randomly se- k lecting subsets to train weak classifiers, and then combining Therefore these classifiers help reduce the risk of using noisy training q∈R(p,k) |R(q, k) ∩ R(p, k)| sets. LDS(p, k) = k2 The details of the Rank-By-Bagging-ProbSVM- A high value of LDS(p, k) indicates a strong association InnerLoop method, improving an input rank list by between p and its neighbors. Therefore, we can use this combining weak classifiers trained from subsets annotated local density score to rank faces. Faces with higher scores by that rank list are described in Algorithm 2. are considered to be potential candidates that are relevant to Given an input ranked list, Rank-By-Bagging-ProbSVM- person-X, while faces with lower scores are considered as InnerLoop is used to improve this list. We repeat the process outliers and thus are potential candidates for non-person-X. a number of times whereby the ranked list output from the Algorithm 1 describes these steps. previous step is used as the input ranked list of the next 386
  • 5. Algorithm 2: Rank-By-Bagging-ProbSVM-InnerLoop 4 Experiments Step 1: Train a weak classifier, hi . Step 1.1: Select a set Spos including p% of top ranked faces 4.1 Dataset ∗ and then randomly select a subset Spos from Spos . ∗ Label faces in Spos as positive samples. We used the dataset described by Berg et al. [4] for our Step 1.2: Select a set Sneg including p% of bottom ranked ∗ experiments. This dataset consists of approximately half a faces and then randomly select a subset Sneg from Sneg . million news photos and captions from Yahoo News col- ∗ Label faces in Sneg as negative samples. lected over a period of roughly two years. This dataset is ∗ ∗ Step 1.3: Use Spos and Sneg to train a weak better than datasets collected from image search engines classifier, hj , using LibSVM [8] with probability outputs. such as Google that usually limit the total number of re- i Step 2: Compute ensemble classifier Hi = j=1 hj . turned images to 1,000. Furthermore, it has annotations that Step 3: Apply Hi to the original face set and form the are valuable for evaluation of methods. Note that these an- rank list, Ranki , using the output probabilistic scores. notations are used for evaluation purpose only. Our method Step 4: Repeat steps 1 to 3 is fully unsupervised, so it assumes the annotations are not until Dist2RankList(Ranki−1, Ranki ) <= . available at running time. Step 5: Return Hi = i hj .j=1 Only frontal faces were considered since current frontal face detection systems [19] work in real time and have ac- Algorithm 3: Rank-By-Bagging-ProbSVM-OuterLoop curacies exceeding 95%. 44,773 faces were detected and Step 1: Rankcur = normalized to the size of 86×86 pixels. Rank-By-Bagging-ProbSVM-InnerLoop(Rankprev). We selected fifteen government leaders, including Step 2: dist = Dist2RankList(Rankprev , Rankcur ). George W. Bush (US), Vladimir Putin (Russia), Ziang Step 3: Rankf inal = Rankcur . Jemin (China), Tony Blair (UK), Junichiro Koizumi Step 4: Rankprev = Rankcur . (Japan), Roh Moo-hyun (Korea), Abdullah Gul (Turkey), Step 5: Repeat steps 1 to 4 and other key individuals, such as John Paul II (the Former until dist <= . Pope) and Hans Blix (UN), because their images frequently Step 6: Return Rankf inal . appear in the dataset [16]. Variations in each person’s name were collected. For example, George W. Bush, President step. In this way, the iterations significantly improve the Bush, U.S. President, etc., all refer to the current U.S. pres- final ranked list. The details are described in Algorithm 3. ident. To determine the number of iterations of Rank- We performed simple string search in captions to check By-Bagging-ProbSVM-InnerLoop and Rank-By-Bagging- whether a caption contained one of these names. The faces ProbSVM-OuterLoop, we use the Kendall − tau dis- extracted from the corresponding image associated with this tance [13], which is a metric that counts the number of pair- caption were returned. The faces retrieved from the differ- wise disagreements between two lists. The larger the dis- ent name queries were merged into one set and used as input tance, the more dissimilar the two lists are. The Kendall − for ranking. tau distance between two lists, τ1 and τ2 , is defined as fol- Figure 5 shows the distribution of retrieved faces from lows: this method and the corresponding number of relevant faces for these fifteen individuals. In total, 5,603 faces were re- K(τ1 , τ2 ) = K i,j (τ1 , τ2 ) trieved in which 3,374 faces were relevant. On average, the (i,j)∈P accuracy was 60.22%. where P is the set of unordered pairs of distinct elements in τ1 and τ2 . K i,j (τ1 , τ2 ) = 0 if i and j are in the same 4.2 Face Processing order in τ1 and τ2 , and K i,j (τ1 , τ2 ) = 1 if i and j are in the opposite order in τ1 and τ2 . We used an eye detector to detect the positions of the Since the maximum value of K(τ1 , τ2 ) is N (N − 1)/2, eyes of the detected faces. The eye detector, built with the where N is the number of members of the list, the normal- same approach as that of Viola and Jones [19], had an ac- ized Kendall tau distance can be written as follows: curacy of more than 95%. If the eye positions were not detected, predefined eye locations were assigned. The eye K(τ1 , τ2 ) Knorm (τ1 , τ2 ) = . positions were used to align faces to a predefined canonical N (N − 1)/2 pose. Using this measure for checking when the loops stop To compensate for illumination effects, the subtraction means that if the ranked list does not change significantly of the bestfit brightness plane followed by histogram equal- after a number of iterations, it is reasonable to stop. ization was applied. This normalization process is shown in 387
  • 6. lated as follows: Nrel Recall = Nhit Nrel P recision = Nret Precision and recall are only used to evaluate the quality of an unordered set of retrieved faces. To evaluate ranked lists in which both recall and precision are taken into ac- count, average precision is usually used. The average pre- cision is computed by taking the average of the interpolated precision measured at the 11 recall levels of 0.0, 0.1, 0.2, ..., Figure 5. Distribution of retrieved faces and 1.0. relevant faces of 16 individuals used in ex- The interpolated precision pinterp at a certain recall level periments. Due to space limitation, bars cor- r is defined as the highest precision found for any recall responding to George Bush (2,282 vs. 1,284) level q ≥ r: and Tony Blair (682 vs. 323) were cut-off at the upper limit of the graph. pinterp = maxr ≥r p(r ) In addition, to evaluate the performance of multiple Figure 6. queries, we used mean average precision, which is the mean We then used principle component analysis [18] to re- of average precisions computed from queries3 . duce the number of dimensions of the feature vector for face representation. Eigenfaces were computed from the origi- 4.4 Parameters nal face set returned using the text-based query method. The number of eigenfaces used to form the eigen space was se- The parameters of our method include: lected so that 97% of the total energy was retained [5]. The number of dimensions of these feature spaces ranged from • p: the fraction of faces at the top and bottom of the 80 to 500. ranked list that are used to form a positive set Spos and negative set Sneg for training weak classifiers in Rank- By-Bagging-ProbSVM-InnerLoop. We empirically se- lected p = 20% (i.e 40% samples of the rank list were used) since a larger p will increase the number of incor- rect labels, and a smaller p will cause over-fitting. In ∗ addition, Spos consists of 0.7 × |Spos | samples that are selected randomly with replacement from Spos . This sampling strategy is adopted from the bagging frame- ∗ Figure 6. Face normalization. (top) faces with work [6]. The same setting was used for Sneg . detected eyes, (bottom) faces after normal- ization process. • : the maximum Kendall tau distance Knorm (τ1 , τ 2) between two rank lists τ 1 and τ2 . This value is used to determine when the inner loop and the outer loop stop. We set = 0.05 for balancing between accuracy and processing time. Note that a smaller requires more 4.3 Evaluation Criteria iterations, making the system’s speed slower. • kernel: the kernel type is used for the SVM. The de- We evaluated the retrieval performance with measures fault is a linear kernel that is defined as: k(x, y) = that are commonly used in information retrieval, such as x ∗y. We have tested other kernel types such as RBF or precision, recall, and average precision. Given a queried polynomial, but the performance did not change much. person and letting Nret be the total number of faces re- Therefore, we used the linear kernel for simplicity. turned, Nrel the number of relevant faces, and Nhit the total number of relevant faces, recall and precision can be calcu- 3 http://trec.nist.gov/pubs/trec10/appendices/measures.pdf 388
  • 7. 4.5 Results • Supervised Learning (SVM-SUP): We randomly se- lected a portion p of the data with annotations to train 4.5.1 Performance Comparison with Existing Ap- the classifier; and then used this classifier to re-rank proaches the remaining faces. This process was repeated five times and the average performance was reported. We We performed a comparison between our proposed method used a range of portion p values for experiments: p = with other existing approaches. 1%, 2%, 3%, ..., 5%. • Text Based Baseline (TBL): Once faces corresponding with images whose captions contain the query name are returned, they are ranked in time order. This is a rather naive method in which no prior knowledge be- tween names and faces is used. • Distance-Based Outlier (DBO): We adopted the idea of distance-based outliers detection for ranking [14]. Given a threshold dmin , for each point p, we counted the number of points q so that dist(p, q) ≤ dmin , where dist(p, q) is the Euclidean distance between p and q in the feature space mentioned in section 4.2. This number was then used as the score to rank faces. We selected a range of dmin values for experiments: dmin = 10, 15, 20, ..., 90. • Densest Sub-Graph based Method (DSG): We re- Figure 7. Performance comparison of meth- implemented the densest sub-graph based method [16] ods. Due to different settings, performances for ranking. Once the densest subgraph was found af- are superimposed for better evaluation. ter an edge elimination process, we counted the num- ber of surviving edges of each node (i.e face) and used this number as the ranking score. To form the graph, Figure 7 shows a performance comparison of these meth- the Euclidean distance dist(p, q) was used to assign ods. Our proposed methods (LDS and UEL-LDS) out- the weight for the edge linked between node p and perform other unsupervised methods such as TBL, DBO node q. DSG require a threshold θ to convert the and DSG. Furthermore, the performance of the DBO and weighted graph to the binary graph before searching DSG methods are sensitive to the distance threshold, while for the densest subgraph. We selected a range of θ the performance of our proposed method is less sensitive. values that are the same as the values used in DBO: It confirms that the similarity measure using shared near- θ = 10, 15, 20, ..., 90. est neighbors is reliable for estimation of the local den- sity score. The performance of UEL-LDS is slightly bet- • Local Density Score (LDS): This is the first stage of ter than LDS since the training sets labeled automatically our proposed method. It requires the input value k to from the ranked list are noisy. However, UEL-LDS im- compute the local density score. Since we do not know proves significantly even when the performance of LDS is the number of returned faces from text-based search poor. These performances are worse than that of SVM-SUP engines, we used another input value f raction defined using a small number of labeled samples. as the fraction of neighbors and estimated k by the for- Figure 8 shows an example of the top 50 faces ranked mula: k = f raction ∗ N , where N is the number of using the TBL, DBO, DSG and LDS methods. The perfor- returned faces. We used a range of f raction values mance of DBO is poor since a low threshold is used. This for experiments: f raction = 5%, 10%, 15%, ..., 50%. ranks irrelevant faces that are near duplicates (rows 2 and 3 For a large number of returned faces, we set k to the in Figure 8(b)) higher than relevant faces. This explains the maximum value of 200: k = 200. same situation with DSG. • Unsupervised Ensemble Learning Using Local Den- 4.5.2 Performance of Ensemble Classifiers sity Score (UEL-LDS): This is a combination of rank- ing by local density scores and then the ranked list is In Figure 9, we show the performance of five single clas- used for training a classifier to boost the rank list. sifiers and that of five ensemble classifiers. The ensemble 389
  • 8. Precision return a large fraction of relevant images is satisfied. Fig- Method at top 20 Recall Precision ure 12 shows an example where this assumption is broken. GoogleSE 79.33 100.00 57.08 Consequently, as shown in Figure 13, the model learned by UEL-LDS 89.00 72.50 76.41 this set performed poorly in recognizing new faces returned SVM-SUP-05 85.00 73.14 76.46 by GoogleSE. Our approach solely relies on the above as- SVM-SUP-10 90.67 74.94 78.30 sumption; therefore, it is not affected by the ranking of text- based search engines. Table 1. Comparison of different methods on The iteration of bagging SVM classifiers does not guar- the new test set returned by Google Image antee a significant improvement in performance. The aim Search Engine. of our future work is to study how to improve the quality of the training sets used in this iteration. classifier k is formed by combining single classifiers from 1 6 Conclusion to k. It clearly indicates that the ensemble classifier is more stable than single weak classifiers. We presented a method for ranking faces retrieved us- ing text-based correlation methods in searches for a specific 4.5.3 New Face Annotation person. This method learns the visual consistency among faces in a two-stage process. In the first stage, a relative den- We conducted another experiment to show the effectiveness sity score is used to form a ranked list in which faces ranked of our approach in which learned models are used to anno- at the top or bottom of the list are likely to be relevant or ir- tate new faces of other databases. We used each name in the relevant faces, respectively. In the second stage, a bagging list as a query to obtain the top 500 images from the Google framework is used to combine weak classifiers trained on Image Search Engine (GoogleSE). Next, these images were subsets labeled from the ranked list into a strong classifier. processed using the steps described in section 4.2: extract- This strong classifier is then applied to the original set to ing faces, detecting eyes and doing normalization. We pro- re-rank faces on the basis of the output probabilistic scores. jected these faces to the PCA subspace trained for that name Experiments on various face sets showed the effectiveness and used the learned model to re-rank faces. of this method. Our approach is beneficial when there are There were 4,103 faces (including false positives - non- several faces in a returned image, as shown in Figure 11. faces detected as faces) detected from 7,500 returned im- ages. We manually labeled these faces and there were 2,342 relevant faces. On average, the accuracy of the GoogleSE is References 57.08%. In Table 1, we compare the performance of the methods. [1] O. Arandjelovic and A. Zisserman. Automatic face recog- The performance of UEL-LDS was obtained by running nition for film character retrieval in feature-length films. In the best system, which is shown as the peak of the UEL- Proc. Intl. Conf. on Computer Vision and Pattern Recogni- LDS curve in Figure 7. The performances of SVM-SUP-05 tion, volume 1, pages 860–867, 2005. and SVM-SUP-10 correspond to the supervised systems (cf. [2] M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski. Face section 4.5.1) that used p = 5% and p = 10% of the data set recognition by independent component analysis. IEEE respectively. We evaluated the performance by calculating Transactions on Neural Networks, 13(6):1450–1464, Nov the precision at the top 20 returned faces, which is com- 2002. mon for image search engines and recall and precision on [3] T. L. Berg, A. C. Berg, J. Edwards, and D. A. Forsyth. Who’s in the picture? In Advances in Neural Information Process- all detected faces of the test set. UEL-LDS achieved com- ing Systems, 2004. parable performance to the supervised methods and outper- [4] T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, formed the baseline GoogleSE. The precision at the top 20 Y. W. Teh, E. G. Learned-Miller, and D. A. Forsyth. Names of SVM-SUP-05 is poorer than that of UEL-LDS due to the and faces in the news. In Proc. Intl. Conf. on Computer small number of training samples. Figure 10 shows top 20 Vision and Pattern Recognition, volume 2, pages 848–854, faces ranked using these two methods. 2004. [5] D. Bolme, R. Beveridge, M. Teixeira, and B. Draper. The csu face identification evaluation system: Its purpose, fea- 5 Discussion tures and structure. In International Conference on Vision Systems, pages 304–311, 2003. Our approach works fairly well for well known people, [6] L. Breiman. Bagging predictors. Machine Learning, where the main assumption that text-based search engines 24(2):123140, 1996. 390
  • 9. [7] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proc. ACM SIG- MOD Int. Conf. on Management of Data(SIGMOD), pages 93–104, 2000. [8] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/" "cjlin/libsvm. [9] M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In APPROX ’00: Proceed- ings of the Third International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 84–95. Springer-Verlag, 2000. [10] L. Ertoz, M. Steinbach, and V. Kumar. Finding clusters of different sizes, shapes, and densities in noisy high dimen- (a) - TBL - 11 irrelevant faces sional data. In SIAM International Conference on Data Min- ing, pages 47–58, 2003. [11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density- based algorithm for discovering clusters in large spatial databases with noise. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 226–231, 1996. [12] R. Fergus, P. Perona, and A. Zisserman. A visual category filter for google images. In Proc. Intl. European Conference on Computer Vision, volume 1, pages 242–256, 2004. [13] M. Kendall. Rank Correlation Methods. Charles Griffin Company Limited, 1948. (b) - DBO - 17 irrelevant faces [14] E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based out- liers: Algorithms and applications. VLDB Journal: Very Large Data Bases, 8(3-4):237–253, 2000. [15] L.-J. Li, G. Wang, and L. Fei-Fei. Optimol: automatic on- line picture collection via incremental model learning. In Proc. Intl. Conf. on Computer Vision and Pattern Recogni- tion, volume 2, pages 1–8, 2007. [16] D. Ozkan and P. Duygu. A graph based approach for naming faces in news photos. In Proc. Intl. Conf. on Computer Vi- sion and Pattern Recognition, volume 2, pages 1477–1482, 2006. [17] J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Ad- (c) - DSG - 18 irrelevant faces vances in Large Margin Classifiers, pages 61–74, 1999. [18] M. Turk and A. Pentland. Face recognition using eigenfaces. In Proc. Intl. Conf. on Computer Vision and Pattern Recog- nition, 1991. [19] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. Intl. Conf. on Computer Vision and Pattern Recognition, volume 1, pages 511–518, 2001. [20] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5:975–1005, 2004. [21] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face (d) - LDS - 4 irrelevant faces recognition: A literature survey. ACM Computing Surveys, 35(4):399–458, 2003. Figure 8. Top 50 faces ranked by the methods TBL, DBO, DSG and LDS. Irrelevant faces are marked with a star. 391
  • 10. Figure 9. Performance of the ensemble clas- sifiers and single classifiers. (a) - 5 irrelevant faces Figure 12. Example in which portion of rel- evant faces is dominant, but it is difficult to group all these faces into one cluster due (b) - no any irrelevant face to large facial variations. In feature space, the largest cluster formed from relevant faces is not largest cluster among those formed Figure 10. Top 20 faces ranked by Google from all returned faces. Irrelevant faces are Image Search Engine (a) and ranked using marked with a star. our learned model (b). Irrelevant faces are marked with a star. Figure 13. Many irrelevant faces annotated using the model learned from the data set Figure 11. Image returned by GoogleSE for shown in Figure 12. Irrelevant faces are query ’Gerhard Schroeder’. GoogleSE was marked with a star. unable to accurately identify who the queried person was, while the learned model of our approach accurately identified him. 392