SlideShare a Scribd company logo
1 of 38
Download to read offline
Project no. FP6-507752


                                      MUSCLE
                               Network of Excellence
        Multimedia Understanding through Semantics, Computation and LEarning




    Progress on Semi-Supervised Machine Learning
                     Techniques

                            Due date of deliverable: 31.09.2005
                            Actual submission date: 10.10.2005



    Start date of project: 1 March 2004                                  Duration: 48 Months



                                                                           Workpackage: 8
                                                                   Deliverable: D8.1c (4/5)

                                                                                  Editors:
                                                                      Pádraig Cunningham
                                                           Department of Computer Science
                                                                    Trinity College Dublin
                                                                          Dublin 2, Ireland

   Revision 1.0

      Project co-funded by the European Commission within the Sixth Framework Programme
                                           (2002-2006)
                                       Dissemination Level
    PU Public                                                                                  x
    PP Restricted to other programme participants (including the commission Services)
    RE Restricted to a group specified by the consortium (including the Commission Services)
    CO Confidential, only for members of the consortium (including the Commission Services)


Keyword List:


c MUSCLE- Network of Excellence                                        www.muscle-noe.org
Contributing Authors
Nozha Boujemaa       INRIA- IMEDIA   nozha.boujemaa@inria.fr
Michel Crucianu      INRIA- IMEDIA   michel.crucianu@inria.fr
Pádraig Cunningham   TCD             Padraig.Cunningham@cs.tcd.ie
Nizar Grira          INRIA- IMEDIA   nizar.grira@inria.fr
Allan Hanbury        TU Wien         hanbury@prip.tuwien.ac.at
Shaul Markovitch     Technion        shaulm@cs.Technion.ac.il
Branislav Miˇ ušík
            c        TU Wien         micusik@prip.tuwien.ac.at




                                     2
Contents

WP7: State of the Art                                                                                                                       1
Contents                                                                                                                                     3

1   Introduction                                                                                                                             5

2   One-Class Classification Problems                                                                                                         7
    2.1 One-class learning for image segmentation           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    8
        2.1.1 Segmentation . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
        2.1.2 Experiments . . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   12
        2.1.3 Conclusion . . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14

3   Active Learning                                                                                                                         15
    3.1 Active Learning . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   16
    3.2 The Source of the Examples . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
         3.2.1 Example Construction . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
         3.2.2 Selective Sampling . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
    3.3 Example Evaluation and Selection .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
         3.3.1 Uncertainty-based Sampling       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
         3.3.2 Committee-based sampling .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19
         3.3.3 Clustering . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   20
         3.3.4 Utility-Based Sampling . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   20
    3.4 Conclusion . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21

4   Semi-Supervised Clustering                                                                                                              23
    4.1 Semi-supervised Clustering of Images . . . . . .                    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
    4.2 Pairwise-Constrained Competitive Agglomeration                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
    4.3 Active Selection of Pairwise Constraints . . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
    4.4 Experimental Evaluation . . . . . . . . . . . . .                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28

5   General Conclusions                                                                                                                     31

Bibliography                                                                                                                                33




                                                3
4
Chapter 1

Introduction

A fundamental distinction in machine learning research is that between unsupervised and su-
pervised techniques. In supervised learning, the learning system is presented with labelled
examples and induces a model from this data that will correctly label other unlabelled data. In
unsupervised learning the data is unlabelled and all the learning system can do is organise or
cluster the data is some way. In processing multimedia data, some important real world prob-
lems have emerged that do not fall into these two categories but have some characteristics of
both: some examples are:

   • One Class Classification Problems: abundant training data is available for one class of a
     binary classification problem but not the other (e.g. process monitoring, text classifica-
     tion).
   • Active Learning: training data is unlabelled but a ‘supervisor’ is available to label exam-
     ples at the request of the learning system Ð the system must select examples for labelling
     that make best use of the supervisor’s effort (e.g. labelling images for image retrieval).
   • Semi-Supervised Clustering: It is often the case that background knowledge is available
     that can help in the clustering of data. Such extra knowledge can enable the discovery of
     small but significant bi-clusters in gene expression data. More generally, the use of back-
     ground knowledge avoids situations where the clustering algorithm is obviously ‘getting
     it wrong’.

     Since the Machine Learning (ML) research within the Muscle network is applications driven,
there is considerable interest among the network members in these research issues. This report
describes some initial research under each of these three titles and we can expect to see a good
deal of activity in this area that lies between unsupervised and supervised ML techniques in the
future.
     In chapter 2 of this report we provide an overview of One-Class Classification problems
and describe an application in image segmentation where foreground - background separation
is treated in a ‘one-class’ framework. In chapter 3 an overview of Active Learning is presented;
this overview is particularly focused on sample selection - the core research issue in Active
Learning. Chapter 4 begins with an overview of Semi-Supervised Clustering and describes an
some research from the network on the Semi-Supervised Clustering of images.

                                               5
6
Chapter 2

One-Class Classification Problems

In traditional supervised classification problems, discriminating classifiers are trained using pos-
itive and negative examples. However, for a significant number of practical problems, counter-
examples are either rare, entirely unavailable or statistically unrepresentative. Such problems
include industrial process control, text classification and analysis of chemical spectra. One-
sided classifiers have emerged as a technique for situations where labelled data exists for only
one of the classes in a two-class problem. For instance, in industrial inspection tasks, abundant
data may only exist describing the process operating correctly. It is difficult to put together
training data describing the myriad of ways the system might operate incorrectly. A related
problem is where negative examples exist, but their distribution cannot be characterised. For
example, it is reasonable to provide characteristic examples of family pictures but impossible
to provide examples of pictures that are ‘typical’ of non-family pictures. In this project we will
implement a comprehensive toolkit for building one sided-classifiers.
     These one-sided classification problems arise in a variety of situations:

   • industrial inspection and process control

   • text classification

   • image labelling

    A basic technique for developing one-class classifiers is to use Principal Component Anal-
ysis (PCA) or Independent Component Analysis (ICA). Details on PCA and ICA can be found
in [65]. The basic principle is best explained with reference to Figure 2.1 . With this approach,
PCA is performed on the labelled training data. Essentially this transforms the data into the new
space of principal components (PCs). These PCs will be ranked so that the first PC captures
most of the variation in the data and so forth. It is normal to drop the lesser PCs as most of the
information will be captured in the first two or three PCs. Figure 2.1 shows a cloud of exam-
ples described in a 3D space defined by the first 3 PCs. The different ellipsoids in the diagram
define regions bounded by 1σ, 2σ and 3σ in the PCs. A classifier could operate by classifying
any unlabeled example that falls within the 2σ ellipsoid as a positive example and everything
outside as a negative example.
    The problem with this is that PCA and ICA are unsupervised techniques for dimension
reduction. So PCs are being discarded because the data varies little in those dimensions; it is

                                                 7
Figure 2.1: A graphic view of the PCA approach to one-sided classification

nevertheless possible that those dimensions are discriminating. Such a classifier will provide a
baseline for comparison as this PCA based approach is commonly used in industry [71].


2.1     One-class learning for image segmentation
Image segmentation can be viewed as a partitioning of the image into regions having some
similar properties, e.g. colour, texture, shape, etc., or as a partitioning of the image into seman-
tically meaningful parts like people do. Moreover, measuring the goodness of segmentations in
general is an unsolved problem and obtaining absolute ground truth is almost impossible since
different people produce different manual segmentations of the same images [44].
    There are many papers dealing with automatic segmentation. We have to mention the well
known work of Shi & Malik [61] based on normalized cuts which segments the image into non-
overlapping regions. They introduced a modification of graph cuts, namely normalized graph
cuts, and provided an approximate closed-form solution. However, the boundaries of detected
regions often do not follow the true boundaries of the objects. The work [64] is a follow-up
to [61] where the segmentation is improved by doing it at various scales.
    One possibility to partially avoid the ill-posed problem of image segmentation is to use
additional constraints. Such constraints can be i) motion in the image caused either by camera
motion or by motion of objects in the scene [56, 75, 4], or ii) specifying the foreground object
properties [16, 52, 13].
    Here we concentrate on an easier task than fully automatic segmentation. We constrain the
segmentation by using a small user-provided template patch. We search for the segments of the
image coherent in terms of colour with the provided template. The texture of an input image

                                                 8
(a)                       (b)                       (c)

Figure 2.2: Supervised segmentation. (a) Original image. Top image shows marked place from
which the template was cut. (b) The enlarged template patch. (c) binary segmentation with
masked original image.

is taken into account to correctly detect boundaries of textured regions. The proposed method
gives the possibility to learn semi-automatically what the template patch should look like so as
to be representative of an object leading to a correct one-class segmentation.
     The proposed technique could be useful for segmentation and for detecting the objects in
images with a characteristic a priori known property defined through a template patch. Fig. 2.2
shows how a tiger can be detected in the image using a small template patch from the same
image. The same template patch can be used to detect the tiger in another image even though
lighting conditions are slightly different, as shown in Fig. 2.2.
     We follow the idea given in [16] of interactive segmentation where the user has to spec-
ify some pixels belonging to the foreground and to the background. Such labeled pixels give
a strong constraint for further segmentation based on the min-cut/max-flow algorithm given
in [17]. However, the method [16] was designed for greyscale images and thus most of the
information is thrown away. We improved the method in [47] to cope with colour and texture
images. However, both seeds for background and foreground objects were still needed. In this
work we avoid the need of background seeds and only seeds for the foreground object need to
be specified. This is therefore a one-class learning problem as we have a sample of what we
are looking for (foreground), but the characteristics of the rest of the image (background) are
unknown.
     In [79] the spatial coherence of the pixels together with standard local measurements (in-
tensity, colour) is handled. They propose an energy function that operates simultaneously in
feature space and in image space. Some forms of such an energy function are studied in [35]. In

                                               9
our work we follow a similar strategy. However, we define the neighborhood relation through
brightness, colour and texture gradients introduced in [42, 45].
    In the paper [52] a similar idea to ours has been treated. In principle, the result is the same,
but the strategy to obtain it differs. The boundary of a textured foreground object is achieved
by minimization (through the evolution of the region contour) of energies inside and outside
the region. The Geodetic Active Region framework is used to propagate the region contour.
However, the texture information for the foreground has to be specified by the user. In [51] the
user interaction is omitted. At first the number of regions is estimated by fitting a mixture of
Gaussians on the intensity histogram and is then used to drive the region evolution. However,
such a technique cannot be used for textured images. One textured region can be composed of
many colours and therefore Gaussian components say nothing about the number of dominant
textures.
    Our main contribution lies in incorporating the information included in the template patch
into the graph representing the image, leading to a reasonable binary image segmentation. Our
method does not need seeds for both foreground and the background as in [16, 47]. Only
some representative template patch of the object being searched for is required, see Fig. 2.2.
Moreover, texture information is taken into account.
    This section is organised as follows. First, segmentation based on the graph cut algo-
rithm is outlined. Second, non-parametric computation of probabilities of points being fore-
ground/background through histograms and incorporating template patch information is de-
scribed. Finally, the results and summary conclude the section.


2.1.1    Segmentation
We use the seed segmentation technique [47] based on the interactive graph cut method [16].
The core of the segmentation method is based on an efficient algorithm [17] for finding min-
cut/max-flow in a graph. We very briefly outline the construction of the graph representing the
image further used for finding min-cut of it. We describe in more detail the stage for building
pixel penalties of being foreground or background.


Graph representing the image
The general framework for building the graph is depicted in Fig. 2.3 (left). The graph is shown
for a 9 pixel image and an 8-point neighborhood N . For general images, the graph has as many
nodes as pixels plus two extra nodes labeled F, B. In addition, the pixel neighborhood is larger,
e.g. we use a window of size 21 × 21 pixels.
    For details on setting Wq,r , the reader is referred to [47, 48]. To fill this matrix the colour
and texture gradient introduced in [42, 45] producing a combined boundary probability image
is used. The reason is that our main emphasis is put on boundaries at the changes of different
textured regions and not local changes inside one texture. However, there usually are large
responses of edge detectors inside the texture. To detect boundaries in images correctly, the
colour changes and texturedness of the regions have to be taken into account as in [42, 45].
    Each node in the graph is connected to the two extra nodes F, B. This allows the incorpo-
ration of the information provided by the seed and a penalty for each pixel being foreground or

                                                10
B
                                               RB|q




                                     q
                                                                                   edge         cost         region

                        Wq,r
                                                                                  {q, r}       Wq,r       {q, r} ∈ N

PSfrag replacements r
                                                                                  {q, F}      λ RF |q          ∀q
                                               RF|q                               {q, B}      λ RB |q          ∀q

                                     F
           Figure 2.3: Left: Graph representation for 9 pixel image. Right: Table defining the costs of
           graph edges. λ is a constant weighting the importance of being foreground/background vs.
           mutual bond between pixels.

           background to be set. The penalty of a point as being foreground F or background B is defined
           as follows
                                  RF |q = − ln p(B |cq ),     RB |q = − ln p(F |cq ),              (2.1)
           where cq = (cL , ca , cb ) is a vector in R3 of CIELAB values at the pixel q. We use the
           CIELAB colour space as this results in better performance. This colour space has the advantage
           of being approximately perceptually uniform. Furthermore, Euclidean distances in this space
           are perceptually meaningful as they correspond to colour differences perceived by the human
           eye. Another reason for the good performance of this space could be that in calculating the
           colour probabilities below, we make the assumption that the three colour channels are statisti-
           cally independent. This assumption is better in the CIELAB space than in the RGB space, as
           the CIELAB space separates the brightness (on the L axis) and colour (on the a and b axes)
           information. The posterior probabilities are computed as
                                                                          p(cq |B )
                                                   p(B |cq ) =                            ,                                   (2.2)
                                                                    p(cq |B ) + p(cq |F )
           where the prior probabilities are

                   p(cq |F ) = f L (cL ) · f a (ca ) · f b (cb ),     and       p(cq |B ) = bL (cL ) · ba (ca ) · bb (cb ),

           where f {L,a,b} (i), resp. b{L,a,b} (i), represents the foreground, resp. the background histogram of
           each colour channel separately at the ith bin smoothed by a Gaussian kernel.
              All histogram channels are smoothed using one-dimensional Gaussians, simply written as
                                ( j−i)2
                            −
            f¯i = G ∑N f j e 2σ2 , where G is a normalization factor enforcing ∑N f¯i = 1. In our case,
                  1
                      j=1                                                         i=1
           the number of histogram bins N = 64. We used σ = 1 since experiments showed that it is a
           reasonable value. λ from the table in Fig. 2.3 was set to 1000.
                 The foreground histogram is computed from all pixels in the template patch. Computation
           of the background histogram b{L,a,b} is based on the following arguments.

                                                                       11
b                                    f




                           0                    1              0                     1
                                     (a)                                 (b)

Figure 2.4: Comparison of two histograms. The first one, shown in (a), is computed from all
image pixels, in contrast to the second one, shown in (b), computed from template patch points.
The template patch is cut out from the image at the position shown as a light square. The
histograms are only schematic and are shown for one colour channel only to illustrate the idea
described in the text.

    We suggest to compute the background histogram from all image pixels since we know a
priori neither the colours nor a template patch of the background. The basic idea behind this is
the assumption that the histogram computed from all points includes information of all colours
(the background and the foreground) in the image, see Fig. 2.4 for better understanding.
    Assume that the texture is composed of two significant colours, like the template patch in
Fig. 2.4(b) shows. Two significant colours create two peaks in the histograms. The same two
peaks appear in both histograms. But, since ∑N bi = 1, the probability p(cq |B ) gives smaller
                                                i=1
                                                    ¯
values than p(cq |F ) for the colours present in the template. Thus, points more similar to the
template are assigned in the graph more strongly to the foreground than to the background node.
It causes the min-cut to go through the weaker edge with higher probability. However, the edges
connecting the points in the neighbourhood play a role in the final cut as well. Therefore the
words “higher probability”, because it cannot be said that the min-cut always cuts the weaker
edge from two edges connecting the pixel with F, B nodes. An inherent assumption in this
approach is that the surface area in the image of the region matching the texture patch is “small”
compared to the area of the whole image. Experiments on the size of the region at which this
assumption breaks down remain to be done.


2.1.2    Experiments
The segmentation method was implemented in M ATLAB. Some of the most time consuming
operations (such as creating the graph edge weights) were implemented in C and interfaced
with M ATLAB through mex-files. We used with advantage the sparse matrices directly offered
by M ATLAB. We used the online available C++ implementations of the min-cut algorithm [17]
and some M ATLAB code for colour and texture gradient computation [42].
    The most time consuming part of the segmentation process is creating the weight matrix W. It
takes 50 seconds on a 250 × 375 image running on a Pentium 4@2.8 GHz. The implementation
of the texture gradient in C would dramatically speed up the computation time. Once the graph
is built, finding the min-cut takes 2 – 10 seconds.
    In all our experiments we used images from the Berkeley database [32]. We marked a small
“representative” part of the image and used it for further image segmentation of the image.
See Fig. 2.5 for the results. From the results it can be seen that very good segmentation can be

                                               12
Figure 2.5: Results (we recommend to see a colour version of this Figure). 1st column: en-
larged image template patch. 2nd column: input image with marked area used as the template.
3rd column: binary segmentation. 4th column: segmentation with masked original image.

                                            13
obtained even though only the colour histogram of the template patch is taken into account.
    It is also possible to apply the template patch obtained from one image for segmenting
another one. In the case depicted in Fig. 2.2 small tiger patch encoding the tiger’s colours
obtained from one image is used for finding the tiger in another image. It can be seen that
most of the tiger’s body was captured but also some pixels belonging to the background were
segmented. Such “non-tiger” regions could be pruned using some further procedure, which is
not discussed in this paper.
    It opens new possibilities for the use of the method, e.g., for image retrieval applications.
Since some representative image template is available, images from large databases coherent in
colour and texture can be found. This application remains to be investigated.

2.1.3    Conclusion
We suggest a method for supervised texture segmentation in images using one-class learning.
The method is based on finding the min-cut/max-flow in a graph representing the image to
be segmented. We described how to handle the information present in a small representative
template patch provided by the user. We proposed a new strategy to avoid the need for a back-
ground template or for a priori information of the background. The one-class learning problem
is difficult in this case as no information about the background is known a priori. Experiments
presented on some images from the Berkeley database show that the method gives reasonable
results.
    The method gives the possibility to detect objects in an image database coherent with a pro-
vided representative template patch. A question still to be considered is how the representative
template patch for each class of interest should look. For example, the template patch should
be invariant under illumination changes, perspective distortion, etc.




                                               14
Chapter 3

Active Learning

In many potential applications of machine learning it is difficult to get an adequate set of labelled
examples to train a supervised machine learning system. This may be because:

   • of the vast amount of training data available (e.g. astrophysical data, image retrieval or
     text mining),

   • the labelling is expensive or time consuming (Bioinformatics or medical applications) or

   • the labels are ephemeral as is the situation in assigning relevance in information retrieval.

    In these circumstances an Active Learning (AL) methodology is appropriate, i.e. the learner
is presented with a stream or a pool of unlabelled examples and may request a ÔsupervisorÕ
to assign labels to a subset of these. The idea of AL has been around since the early 90Õs
(e.g. [59]); however, it remains an interesting research area today as there are many applica-
tions of AL that need to start learning with very small labeled data sets. The two motivating
examples we will explore are the image indexing problem described in B.1.3 and the problem
of bootstrapping a message classification system. Both of these scenarios suffer from the Ôs-
mall sample problemÕ that has been considered in relevance feedback in information retrieval
systems.
    Relevance feedback as studied in Information Retrieval [57] is closely related to Active
Learning in the sense that both need to induce models from small samples. Relevance feedback
(RF) has also been used in content based image retrieval (Rui et al. 1998). However, the im-
portant difference is that RF is searching for a target document or image whereas the objective
with AL is to induce a classifier from a minimal amount of data. In addition RF is transductive
rather than inductive in that the algorithm has access to the full population of examples. Also,
in RF, examples are selected with two objectives in mind. It is expected that they will prove
informative to the learner when labelled and it is hoped that they will satisfy the userÕs infor-
mation need. Whereas in AL, examples are selected purely with the objective of maximising
information for the learner.
    This issue of example selection is at the heart of research on AL. The normal guiding princi-
ples are to select the unlabelled examples on which the learner is least certain or alternatively to
select the examples that will prove most informative for the learner Ð these often boil down to
the same thing. A classic strategy is that described by Tong and Koller [69] for Active Learning

                                                15
with SVMs. They present a formal analysis based on version spaces that shows that a good pol-
icy for selecting samples for AL from a pool of data is to select examples that are closest to the
separating hyperplane of the SVM. A better but more expensive policy is to select the example
that would produce the smallest margin in the next version of SVM. Clearly these policies are
less reliable when starting with a very small number of labelled examples as the version space
is very large and the location of the hyperplane is unstable. It is also desirable to be able to
select sets of examples together so it would be useful to introduce some notion of diversity to
maximise the utility of these complete sets.


3.1     Active Learning
Learning is a process where an agent uses a set of experiences to improve its problem solving ca-
pabilities. In the passive learning model, the learner passively receives a srteam of experiences
and processes them. In the active learning model the learner has some control over its training
experiences. Markovitch and Scott [43] define the information filtering framework which spec-
ifies five types of selection mechanisms that a learning system can employ to increase the utility
of the learning process. Active learning can be viewed in this model as selective experience
filter.
    Why should a learner be active? There are two major reasons for employing experience
selection mechanisms:

   1. If some experiences have negative utility – the quality of learning would have been bet-
      ter without them – then it is obviously desirable to filter them out. For example, noisy
      examples are usually harmful.

   2. Even when all the experiences are of positive utility, there is a need to be selective or
      to have control over the order in which they are processed. This is because the learning
      agent has bounded training resources which should be managed carefully.

     In passive learning systems, these problems are either ignored, leading to inefficient learn-
ing, or are solved by relying on a human teacher to select informative training experiences [76].
It may sometimes be advantageous for a system to select its own training experiences even when
an external teacher is available. Scott & Markovitch [58] argue that a learning system has an
advantage over an external teacher in selecting informative examples because, unlike a teacher,
it can directly access its own knowledge base. This is important because the utility of a training
experience depends upon the current state of the knowledge base.
     In this summary we concentrate on active learning for supervised concept induction. Most
of the works in this area assume that there are high costs associated with tagging examples.
Consider, for example, training a classifier for a character recognition problem. In this case, the
character images are easily available, while their classification is a costly process requiring a
human operator.
     In the following subsections we differentiate between pool-based selective sampling and
example construction approaches, define a formal framework for selective sampling algorithms
and present various approaches for solving the problem.

                                               16
3.2     The Source of the Examples
One characteristic in which active learning algorithms vary is the source of input examples.
Some algorithms construct input examples, while others select from a pool or a stream of unla-
beled examples.


3.2.1    Example Construction
Active learning by example construction was formalized and theoretically studied by Angluin [2].
Angluin developed a model which allows the learner to ask two types of queries: a membership
query, where the learner selects an instance x and is told its membership in the target concept,
 f (x); and an equivalence query, where the learner presents a hypothesis h, and either told that
h ≡ f , or is given a counterexample. Many polynomial time algorithms based on Anguin’s
model have been presented for learning target classes such as Horn sentences [3] and multi-
variate polynomials [18]. In each case, a domain-specific query construction algorithm was
presented.
     In practical experiments with example construction, there are algorithms that try to optimize
an objective function like information gain or generalization error, and derive expressions for
optimal queries in certain domains [63, 66]. Other algorithms find classification borders and
regions in the input space and construct queries near the borders [12, 30].
     A major drawback of query construction was shown by Baum and Ladner [37]. When trying
to apply a query construction method to the domain of images of handwritten characters, many
of the images constructed by the algorithm did not contain any recognizable characters. This is
a problem in many real-world domains.


3.2.2    Selective Sampling
Selective sampling avoids the problem described above by assuming that a pool of unlabeled
examples is available. Most algorithms work iteratively by evaluating the unlabeled examples
and select one for labeling. Alternatively it is assumed that the algorithms receives a stream of
unlabeled examples, and for each example a decision is made (according to some evaluation
criteria) whether to query for a label or not.
    The first approach can be formalized as follows. Let X be an instance space, a set of objects
described by a finite collection of attributes (features). Let f : X → {0, 1} be a teacher (also
called an expert) that labels instances as 0 or 1. A (supervised) learning algorithm takes a set
of labeled examples, { x1 , f (x1 ) ,. . . , xn , f (xn ) }, and returns a hypothesis h : X → {0, 1}. Let
X ⊆ X be an unlabeled training set. Let D = { xi , f (xi ) : xi ∈ X, i = 1, . . . , n} be the training
data – a set of labeled examples from X. A selective sampling algorithm SL , specified relative
to a learning algorithm L, receives X and D as input and returns an unlabeled element of X.
    The process of learning with selective sampling can be described as an iterative procedure
where at each iteration the selective sampling procedure is called to obtain an unlabeled example
and the teacher is called to label that example. The labeled example is added to the set of
currently available labeled examples and the updated set is given to the learning procedure,
which induces a new classifier. This sequence repeats until some stopping criterion is satisfied.

                                                   17
Active Learner(X,f()):

          /
   1. D ← 0.

            /
   2. h ← L(0).

   3. While stopping-criterion is not satisfied do:

                    a)   x ← SL (X, D)  ;      Apply SL and get the next example.
                    b)   ω ← f (x)      ;      Ask the teacher to label x.
                    c)   D ← D { x, ω } ;      Update the labeled examples set.
                    d)   h ← L(D)       ;      Update the classifier.

   4. Return classifier h.
Figure 3.1: Active learning with selective sampling. Active Learner is defined by specifying
stopping criterion, learning algorithm L and selective sampling algorithm SL . It then works with
unlabeled data X and teacher f () as input.

This criterion may be a resource bound, M, on the number of examples that the teacher is willing
to label, or a lower bound on the desired class accuracy. Adopting the first stopping criterion,
the goal of the selective sampling algorithm is to produce a sequence of length M which leads
to a best classifier according to some given measure.
    The pseudo code for an active learning system that uses selective sampling is shown in
Figure 3.1.


3.3     Example Evaluation and Selection
Active learning algorithms that perform selective sampling differ from each other by the way
they select the next example for labeling. In this section we present some common approaches
to this problem.


3.3.1    Uncertainty-based Sampling
Most of the works in active learning select untagged examples that the learner is most uncertain
about. Lewis and Gale [39] present a probabilistic classifier for text classification. They assume
that a large pool of unclassified text documents is available. The probabilistic classifier assigns
tag probabilities to text documents based on already labeled documents. At each iteration, sev-
eral unlabeled examples that were assigned probabilities closest to 0.5 are selected for labeling.
It is shown that the method significantly reduces that amount of labeled data needed to achieve
a certain level of accuracy, compared to random sampling. Later, Lewis and Catlett [38] used an
efficient probabilistic classifier to select examples for training another (C4.5) classifier. Again,
examples with probabilities closest to 0.5 were chosen.
     Special uncertainty sampling methods were developed for specific classifiers. Park [53] uses
sampling-at-the-boundary method to choose support vectors for SVM classifier. Orthogonal

                                               18
support vectors are chosen from the boundary hyperplane and queried one at a time. Tong and
Koller [70] also perform pool-based active learning of SVMs. Each new query is supposed to
split the current version space into two parts as equal as possible. The authors suggest three
methods of choosing the next example based on this principle. Hasenjager & Ritter[30] and
Lindenbaum et. al [41] present selective sampling algorithms for nearest neighbor classifiers.


3.3.2    Committee-based sampling
Query By Committee (QBC) is a general uncertainty-based method where a committee of hy-
potheses consistent with the labeled data is specified and an unlabeled example on which the
committee members most disagree is chosen. Seung, Oper and Sompolinsky [60] use Gibbs
algorithm to choose random hypotheses from the version space for the committee, and choose
an example on which the disagreement between the committee members is the biggest. Appli-
cation of the method is shown on the high-low game and perceptron learning. Later, Freund et.
al. [22] presented a more complete and general analysis of query by committee, and show that
the method guarantees rapid decrease in prediction error for the perceptron algorithm.
     Cohn et. al. [19] present the concept of uncertainty regions - the regions on which hypothe-
ses that are consistent with the labeled data might disagree. Although it is difficult to exactly
calculate there regions, one can use approximations (larger or smaller regions). The authors
show an application of the method to neural networks - two networks are trained on the labeled
example set, one is the most general network, and the other is the most specific network. When
examining an unlabeled example, if the two networks disagree on the example, the true label is
queried.
     Krogh and Vedelsby [36] show the connection between the generalization error of a com-
mittee of neural networks and its ambiguity. The larger is the ambiguity, the smaller is the error.
The authors build an ambiguous network ensemble by cross validation and calculate optimal
weights for ensemble members. The ensemble can be used for active learning purposes - the
example chosen for labeling is the one that leads to the largest ambiguity. Raychandhuri et.
al. [55] use the same principle and analyze the results with an emphasize on minimizing data
gathering. Hasenjager and Ritter [29] compare the theoretical optimum (maximization of in-
formation gain) of the high-low game with the performance of QBC approach. It is shown that
QBC results rapidly approach the optimal theoretical results.
     In some cases, especially in the presence of noise, it is impossible to find hypotheses con-
sistent with the labeled data. Abe and Mamitsuka [1] suggest to use QBC with combination of
boosting or bagging. Liere and Tadepalli [40] use a small committee of 7 Winnow classifiers
for text categorization. Engelson and Dagan [5] overcome the problem of finding consistent
hypotheses by generating a set of classifiers according to the posterior distribution of the pa-
rameters. The authors show how to do that for binomial and multinomial parameters. This
method is only applicable when it is possible to estimate a posterior distribution over the model
space given the training data. Muslea et. al. [49] present an active learning scheme for problems
with several redundant views – each committee member is based on one of the problem views.
     Park et. al. [54] combine active learning with using unlabeled examples for induction. They
train a committee on an initial training set. At each iteration, the algorithm predicts the label
of an unlabeled example. If prediction disagreement among committee members is below a

                                                19
threshold, the label is considered to be the true label and committee members weights are up-
dated according to their prediction. If disagreement is high, the algorithm queries the oracle
(expert) for the true label. Baram et. al. [7] use a committee of known-to-be-good active learn-
ing algorithms. The committee is then used to select the next query. At each stage, the best
performing algorithm in the committee is traced, and committee members weights are updated
accordingly.
    Melville and Mooney [46] use artificially created examples to construct highly diverse com-
mittees. Artificial examples are created based on the training set feature distribution, and labeled
with the least probable label. A new classifier is built using the new training set, and if it does
not increase the committee error, it is added to the committee.



3.3.3    Clustering
Recently, several works suggest performing pre-clustering of the instance pool before selecting
an example for labeling. The motivation is that grouping similar examples may help to decide
which example will be useful when labeled. Engelbrecht et. al. [21] cluster the unlabeled data
and then perform selective sampling in each of the clusters. Nguyen et. al. [50] pre-cluster the
unlabeled data and give priority to examples that are close to classification boundary and are
representatives of dense clusters.
    Soderland [62] incorporates active learning in an information extraction rules learning sys-
tem. The system maintains a group of learned rules. In order to choose the next example for
label, all unlabeled examples are divided to three categories - examples covered by rules, near
misses, and examples not covered by any rule. Examples for labele are sampled randomly from
these groups (proportions are selected by the user).



3.3.4    Utility-Based Sampling
Fujii et. al. [24] have observed that considering only example uncertainty is not sufficient when
selecting the next example for label, and that the impact of example labeling on other ulabeled
examples should also be taken into account. Therefore, example utility is measured by the sum
of interpretation certainty differences for all unlabeled examples, averaged with the examined
example interpretation probabilities. The method was tested on the word sense disambiguation
domain, and showed to perform better than other sampling methods.
     Lindenbaum et. al. [41] present a very general utility-based sampling method. The sampling
process is viewed as a game where the learner actions are the possible instances and the teacher
reactions are the different labels. The sampling algorithm expands a “game” tree of depth
k, where k depends on the resources available. This tree contains all the possible sampling
sequences of length k. The utility of each leaf is computed by evaluating the classifier resulting
from performing induction on the sequence leading to that leaf. The arcs associated with the
teacher’s reactions are assigned probabilities using a random field model. The example selected
is the one with the highest expected utility.

                                                20
3.4     Conclusion
Active learning is a powerful tool. It has been shown to be efficient in reducing the number of
labeled examples needed to perform learning, thus reducing the cost of creating training sets.
It has been successfully applied to many problem domains, and is potentially applicable to any
task involving learning. Active learning was applied to numerous domains such as function
approximation [36], speech recognition [72], recommender system [67], text classification and
categorization [39, 70, 40, 68, 80], part of speech tagging [5], image classification [49] and
word sense disambiguation [24, 54].




                                             21
22
Chapter 4

Semi-Supervised Clustering

It would be generally accepted that conventional approaches to clustering have the draw-back
that they do not bring background knowledge to bear on the clustering process [74] [11] [77].
This is particularly an issue in the analysis of gene-array data where clustering is a key data-
analysis technique [78] [6].This issue is increasingly being recognised and there is significant
research activity that attempts to make progress in this area. Bolshakova et al [14] have shown
how background information from Gene Ontologies can be used for cluster validity assessment.
However, if extra information is available, it should be used directly in the determination of the
clusters rather than after the fact in validation. The idea that background knowledge should be
brought to bear in clustering is a general principle and there is a variety of ways in which this
might be achieved in practice. Yip et al. [78] suggest there are three aspects to this:

   • what extra knowledge is being input to the clustering process,

         – cluster labels may be available for some objects,
         – more commonly, information will be available on whether objects must-link or
           cannot-link.

   • when the knowledge is input

         – before clustering to guide the clustering process,
         – after clustering to evaluate the clusters or guide the next round of clustering,
         – users may supply knowledge as the clustering proceeds.

   • how the knowledge affects the clustering process

         – through the formation of seed clusters,
         – constraining some objects to be put in the same cluster or different clusters,
         – modifying the assessment of similarity/difference.

   It is clear that a key determiner of the best approach to semi-supervised clustering is the
nature of the external information that is being brought to bear.

                                               23
4.1     Semi-supervised Clustering of Images
To let users easily apprehend the contents of image databases and to make their access to the
images more effective, a relevant organisation of the contents must be achieved. The results of
a fully automatic categorization (unsupervised clustering or cluster analysis, see the surveys in
[34], [33]), relying exclusively on a measure of similarity between data items, are rarely satis-
factory. Some supervision information provided by the user is then needed for obtaining results
that are closer to user’s expectations. Supervised classification and semi-supervised clustering
are both candidates for such a categorization approach. When the user is able and willing to
provide class labels for a significant sample of images in the database, supervised classification
is the method of choice. In practice, this will be the case for specific classification/recognition
tasks, but not so often for database categorization tasks.
     When the goal is general image database categorization, not only the user does not have
labels for images, but he doesn’t even know a priori what most of the classes are and how many
classes should be found. Instead, he expects the system to “discover” these classes for him.
To improve results with respect to what an unsupervised algorithm would produce, the user
may accept to provide some supervision if this information is of a very simple nature and in a
rather limited amount. A semi-supervised clustering approach should then be employed. Semi-
supervised clustering takes into account both the similarity between data items and information
regarding either the membership of some data items to specific clusters or, more often, pairwise
constraints (must-link, cannot-link) between data items. These two sources of information are
combined ([26], [10]) either by modifying the search for appropriate clusters (search-based
methods) or by adapting the similarity measure (similarity-adapting methods). Search-based
methods consider that the similarities between data items provide relatively reliable information
regarding the target categorization, but the algorithm needs some help in order to find the most
relevant clusters. Similarity-adapting methods assume that the initial similarity measure has
to be significantly modified (at a local or a more global scale) by the supervision in order to
reflect correctly the target categorization. While similarity-adapting methods appear to apply
to a wider range of situations, they need either significantly more supervision (which can be an
unacceptable burden for the user) or specific strong assumptions regarding the target similarity
measure (which can be a strong limitation in their domain of application).
     We assume that users can easily evaluate whether two images should be in the same cat-
egory or rather in different categories, so they can easily define must-link or cannot-link con-
straints between pairs of images. Following previous work by Demiriz et al. [20], Wagstaff et
al. [73] or Basu et al. [8], in [28] we introduced Pairwise-Constrained Competitive Agglomera-
tion (PCCA), a fuzzy semi-supervised clustering algorithm that exploits the simple information
provided by pairwise constraints. PCCA is reminded in section 4.2. In the original version of
PCCA [28] we did not make further assumptions regarding the data, so the pairs of items for
which the user is required to define constraints are randomly selected. But in many cases, such
assumptions regarding the data are available. We argue in [27] that quite general assumptions let
us perform a more adequate, active selection of the pairs of items and thus significantly reduce
the number of constraints required for achieving a desired level of performance. We present in
section 4.3 our criterion for the active selection of constraints. Experimental results obtained
with this criterion on a real-world image categorization problem are given in section 4.4.

                                               24
4.2     Pairwise-Constrained Competitive Agglomeration
PCCA belongs to the family of search-based semi-supervised clustering methods. It is based
on the Competitive Agglomeration (CA) algorithm [23], a fuzzy partitional algorithm that does
not require the user to specify the number of clusters to be found. Let xi , i ∈ {1, .., N} be a set
of N vectors representing the data items to be clustered, V the matrix having as columns the
prototypes µk , k ∈ {1, ..,C} of C clusters (C N) and U the matrix of the membership degrees,
such as uik is the membership of xi to the cluster k. Let d(xi , µk ) be the distance between
the vector xi and the cluster prototype µk . For PCCA, the objective function to be minimized
must combine the feature-based similarity between data items and knowledge of the pairwise
constraints. Let M be the set of available must-link constraints, i.e. (xi , x j ) ∈ M implies xi
and x j should be assigned to the same cluster, and C the set of cannot-link constraints, i.e.
(xi , x j ) ∈ C implies xi and x j should be assigned to different clusters. With these notations, we
can write the objective function PCCA must minimize [28]:
                                C    N                              C        N          2
                J (V, U) =      ∑ ∑ (uik ) d 2 2
                                                    (xi , µk ) − β ∑        ∑ (uik )            (4.1)
                               k=1 i=1                             k=1 i=1
                                                C       C                               C
                           + α           ∑     ∑ ∑             uik u jl +        ∑ ∑ uik u jk
                                    (xi ,x j )∈M k=1 l=1,l=k                (xi ,x j )∈C k=1

under the constraint ∑C uik = 1, for i ∈ {1, . . . , N}. The prototypes of the clusters, for k ∈
                        k=1
{1, ..,C}, are given by
                                          ∑N (uik )2 xi
                                    µk = i=1                                               (4.2)
                                           ∑N (uik )2
                                             i=1
and the fuzzy cardinalities of the clusters are Nk = ∑N uik .
                                                        i=1
    The first term in (4.1) is the sum of the squared distances to the prototypes weighted by
the memberships and comes from the FCM objective function; it reinforces the compactness of
the clusters. The second term is composed of the cost of violating the pairwise must-link and
cannot-link constraints. The penalty corresponding to the presence of two points in different
clusters (for a must-link constraint) or in a same cluster (for a cannot-link constraint) is weighted
by their membership values. The term taking the constraints into account is weighted by α, a
constant factor that specifies the relative importance of the supervision. The last term in (4.1)
is the sum of the squares of the cardinalities of the clusters (coming from the CA objective
function) and controls the competition between clusters.
    When all these terms are combined and β is well chosen, the final partition will minimize
the sum of intra-cluster distances, while partitioning the data set into the smallest number of
clusters such that the specified constraints are respected as well as possible. Note that when the
memberships are crisp and the number of clusters is pre-defined, this cost function reduces to
the one used by the PCKmeans algorithm in [8].
    It can be shown (see [28]) that the equation for updating memberships is
                               urs = uFCM + uConstraints + uBias
                                      rs     rs             rs                                  (4.3)
    The first term in (4.3), uFCM , comes from FCM and only focusses on the distances between
                             rs
data items and prototypes. The second term, uConstraints , takes into account the supervision:
                                                rs


                                                      25
memberships are reinforced or deprecated according to the pairwise constraints given by the
user. The third term, uBias , leads to a reduction of the cardinality of spurious clusters, which are
                       rs
discarded when their cardinality drops below a threshold.
    The β factor controls the competition between clusters and is defined at iteration t by:

                                 η0 exp(−|t − t0 |/τ)   C   N
                        β(t) =                      2   ∑ ∑ u2j d 2(xi, µ j )
                                                             i                                 (4.4)
                                 ∑C
                                  j=1    ∑N ui j
                                          i=1
                                                        j=1 i=1


    The exponential component of β makes the last term in (4.1) dominate during the first iter-
ations of the algorithm, in order to reduce the number of clusters by removing spurious ones.
Later, the first three terms will dominate, to seek the best partition of the data. The resulting
PCCA algorithm is given in [28]. In the original version of PCCA, the pairs of items for which
the user is required to define constraints are randomly selected, prior to running the cluster-
ing process. As distance d(xi , µ j ) between a data item xi and a cluster prototype µ j one can
use either the ordinary Euclidean distance when the clusters are assumed to be spherical or the
Mahalanobis distance when they are assumed to be elliptical.


4.3     Active Selection of Pairwise Constraints
To make this semi-supervised clustering approach attractive for the user, it is important to min-
imize the number of constraints he has to provide for reaching some given level of quality. This
can be done by asking the user to define must-link or cannot-link constraints for the pairs of
data items that are expected to have the strongest corrective effect on the clustering algorithm
(i.e. that are maximally informative).
     But does the identity of the constraints one provides have a significant impact on perfor-
mance or all constraints are relatively alike and only the number of constraints matters? In a
series of repeated experiments with PCCA using random constraints, we found a significant
variance in the quality of the final clustering results. So the selection of constraints can have
a strong impact and we must find appropriate selection criteria. Such criteria may depend on
further assumptions regarding the data; for the criteria to be relatively general, the assumptions
they rely on should not be too restrictive.
     In previous work on this issue, Basu et al. [9] developed a scheme for selecting pairwise
constraints before running their semi-supervised clustering algorithm (PCKmeans). They de-
fined a farthest-first traversal scheme of the set of data items, with the goal of finding k items
that are far from each other to serve as support for the constraints. This issue was also explored
in unsupervised learning for cases where prototypes cannot be defined. In such cases, cluster-
ing can only rely on the evaluation of pairwise similarities between data items, implying a high
computational cost. In [31], the authors consider subsampling as a solution for reducing cost
and perform an active selection of new data items by minimizing the estimated risk of making
wrong predictions regarding the true clustering from the already seen data. This active selec-
tion method can also be seen as maximizing the expected value of the information provided by
the new data items. The authors find that their active unsupervised clustering algorithm spends
more samples of data to disambiguate clusters that are close to each other and less samples for
items belonging to well-separated clusters.

                                                   26
As for other search-based semi-supervised clustering methods, when using PCCA we con-
sider that the similarities between data items provide relatively reliable information regarding
the target categorization and the constraints only help in order to find the most relevant clusters.
There is then little uncertainty in identifying well-separated compact clusters. To be maximally
informative, supervision effort (i.e. constraints) should rather be spent for defining those clus-
ters that are neither compact, nor well-separated from their neighbors. One can note that this
is consistent with the findings in [31] regarding unsupervised clustering. To find the data items
that provide the most informative pairwise constraints, we shall then focus on the least well-
defined clusters and, more specifically, on the frontier with their neighbors. To identify the least
well-defined cluster at some iteration t, we use the fuzzy hypervolume FHV = |Ck |, with |Ck |
the determinant of the covariance matrix Ck of cluster k. The FHV was introduced by Gath and
Geva [25] as an evaluation of the compactness of a fuzzy cluster; the smaller the spatial volume
of the cluster and the more concentrated the data items are near its center, the lower the FHV of
the cluster. We consider the least well-defined cluster at iteration t to be the one with the largest
FHV at that iteration. Note that we don’t need any extra computation because we already used
the FHV to find the Mahalanobis distance.
     Once the least well-defined cluster at iteration t is found, we need to identify the data items
near its boundary. Note that in the fuzzy setting, one can consider that a data item represented
by the vector xr is assigned to cluster s if urs is the highest among its membership degrees. The
data items at the boundary are those having the lowest membership values to this cluster among
all the items assigned to it. After obtaining a set of items that lie on the frontier of the cluster,
we find for each item the closest cluster, corresponding to its second highest membership value.
The user is then asked whether one of these items should be (or not) in the same cluster as the
closest item from the nearest cluster. It is easy to see that the time complexity of this method is
high. We suggest below an approximation of this method, with a much lower cost.
    After finding the least well-defined cluster with the FHV criterion, we consider a virtual
boundary that is only defined by a membership threshold and will usually be larger than the
true one (this is why we call it “extended” boundary). The items whose membership values
are closest to this threshold are considered to be on the boundary and constraints are generated
directly between these items. We expect these constraints to be relatively equivalent (and not
too suboptimal when compared) to the constraints that would have been obtained by the more
complex method described in the previous paragraphs.
    This approximate selection method has two parameters: the number of constraints at every
iteration and the membership threshold for defining the boundary. The first parameter concerns
the selection in general, not the approximate method specifically. In all the experiments pre-
sented next, we generate 3 constraints at every iteration of the clustering algorithm. For the
comparative evaluations, we plot the ratio of well-categorized items against the number of pair-
wise constraints. With this selection procedure, when the maximal number of constraints is
reached, clustering continues until convergence without generating new constraints. The sec-
ond parameter is specific to the approximation of the boundary and we use a fixed value of 0.3,
meaning that we consider a rather large approximation. These fixed values for the two param-
eters are necessarily suboptimal, but a significant increase in performance (or reduction in the
number of required constraints for a given performance) is nevertheless obtained.

                                                 27
4.4     Experimental Evaluation
We evaluated the effect our criterion for the active selection of constraints has on the PCCA
algorithm and we compared it to the CA algorithm [23] (unsupervised clustering) and to PCK-
means [8] (semi-supervised clustering).
    The comparison shown here is performed on a ground truth database composed of images
of different phenotypes of Arabidopsis thaliana, corresponding to slightly different genotypes.
This scientific image database is issued from studies of gene expression. There are 8 categories,
defined by visual criteria: textured plants, plants with long stems and round leaves, plants with
long stems and fine leaves, plants with dense, round leaves, plants with desiccated or yellow
leaves, plants with large green leaves, plants with reddish leaves, plants with partially white
leaves. There are a total of 187 plant images, but different categories contain very different
numbers of instances. The intra-class diversity is high for most classes. A sample of the images
is shown in [27]. The global image features we used are the Laplacian weighted histogram,
the probability weighted histogram, the Hough histogram, the Fourier histogram and a classical
color histogram obtained in HSV color space (described in [15]). Linear PCA was used for a
five-fold reduction of the dimension of the feature vectors. In all the experiments reported here
we employed the Mahalanobis distance rather than the classical Euclidean distance.
    Figure 4.1 presents the dependence between the percentage of well-categorized data points
and the number of pairwise constraints considered. We provide as a reference the graphs for
the CA algorithm and for K-means, both ignoring the constraints (unsupervised learning). The
correct number of classes was directly provided to K-means and PCKmeans. CA, PCCA and
Active-PCCA were initialized with a significantly larger number of classes and found the ap-
propriate number by themselves.
    For the fuzzy algorithms (CA, PCCA and Active-PCCA) every data item is assigned to the
cluster to which its membership value is the highest. For every number of constraints, 500
experiments were performed with different random selections of the constraints in order to
produce the error bars for PCKmeans and for the original PCCA.
    These experimental results clearly show that the user can significantly improve the quality
of the categories obtained by providing a simple form of supervision, the pairwise constraints.
With a similar number of constraints, PCCA performs significantly better than PCKmeans by
making a better use of the available constraints. The fuzzy clustering process directly takes into
account the the pairwise constraints thanks to the signed constraint terms in the equation for up-
dating the memberships. The active selection of constraints (Active-PCCA) further reduces the
number of constraints required for reaching such an improvement. The number of constraints
becomes very low with respect to the number of items in the dataset.
    By including information provided by the user, semi-supervised image database categoriza-
tion can produce results that come much closer to user’s expectations. But the user will not
accept such an approach unless the information he must provide is very simple in nature and
in a small amount. The experiments we performed show that the active selection of constraints
combined with PCCA allows the number of constraints required to remain sufficiently low for
this approach to be a valuable alternative in the categorization of image databases. The ap-
proximations we made allowed us to maintain a low computational complexity for the resulting
algorithm, making it suitable for real-world clustering applications.

                                               28
1

             Ratio of well categorized points   0.95
                                                 0.9
                                                0.85
                                                 0.8
                                                0.75
                                                 0.7
                                                0.65
                                                 0.6
                                                       0     10    20    30     40       50    60   70   80
                                                                    Pairwise constraints number
                                                            PCKmeans                             CA
                                                                  PCCA                        Kmeans
                                                           Active-PCCA

Figure 4.1: Results obtained by the different clustering algorithms on the Arabidopsis database

    This work was supported by the EU Network of Excellence (http://www.muscle-noe.org/).
The authors wish also to thank NASC (European Arabidopsis Stock Centre, UK, http://arabidopsis.info/)
for allowing them to use the Arabidopsis images and Ian Small from INRA (Institut National
de la Recherche Agronomique, France) for providing this image database.




                                                                            29
30
Chapter 5

General Conclusions

A special characteristic of the Machine Learning research in Muscle is that it is applications
driven. It is driven by the characteristics of scenarios in which ML techniques might be used
in processing multimedia data. The role for ML in these situations often does not fit well with
standard ML techniques that are divided into those that deal with labeled data and those that do
not. In this report we have identified three distinct variants of scenarios that do no fit the classic
supervised/unsupervised organisation:

One Sided Classifiers Sometimes it is in the nature of the problem that labeled examples are
     available for one class only. This situation is described in detail in Chapter 2 and an ap-
     proach for tackling image foreground-background separation as a one-class classification
     problem is described in Section 2.1.

Active Learning It is often the case that producing a corpus of labeled training examples is an
     expensive process. This is the situation for instance when supervised learning techniques
     are used for annotating images in a collection. Active Learning refers to a general set
     of techniques for building classifiers while minimising the number of labeled examples
     required. The core research question in Active Learning is the problem of selecting the
     most useful examples to present to the user for labeling. Promising approaches for this
     are presented in Chapter 3.

Semi-Supervised Clustering Confidence is a clustering system will be damaged if it returns
     clusters that disagree with background knowledge a user might have about the objects un-
     der consideration. Thus it is very desirable to be able to present this background knowl-
     edge as an input to the clustering process. A natural form for this background knowledge
     to take is as must-link and cannot-link constraints that indicate objects that should and
     should not be grouped together. PCCA as described in section 4.2 is a framework that
     allows background knowledge expressed as pairwise constraints to influence the cluster-
     ing process. This idea is further enhanced in section 4.3 where a process for identifying
     useful pairwise constraints is introduced. This active selection of constraints is in the
     spirit of Active Learning as described in Chapter 3.

   This research on ML techniques that fall somewhere between the traditional unsupervised
and supervised alternatives is a key area of work within Muscle. This is because many of

                                                31
the techniques under consideration are specialised for processing multimedia data. For this
reason there will be further activity in the area within the network and we should expect further
innovation in this area.




                                               32
Bibliography
 [1] N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging. In
     Proceedings of the Fifteenth International Conference on Machine Learning, pages 1–9.
     Morgan Kaufmann Publishers Inc., 1998.

 [2] D. Angluin. Queries and concept learning. Machine Learning, 2(4):319–342, April 1988.

 [3] D. Angluin, M. Frazier, and L. Pitt. Learning conjunctions of horn clauses. Machine
     Learning, 9(2-3):147–164, July 1992.

 [4] N. Apostoloff and A. Fitzgibbon. Bayesian estimation of layers from multiple images. In
     Proc. CVPR, volume 1, pages 407–414, 2004.

 [5] S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic
     classifiers. Journal of Artificial Intelligence Research, 11:335–360, November 1999.

 [6] E. Bair and R. Tibshirani. Semi-supervised methods to predict patient survival from gene
     expression data. PLoS Biol, 2(4):e108, 2004.

 [7] Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. In
     Proceedings of the Twentieth International Conference on Machine Learning, pages 19–
     26. AAAI Press, 2003.

 [8] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. In
     Proceedings of 19th International Conference on Machine Learning (ICML-2002), pages
     19–26, 2002.

 [9] Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. Semi-supervised clustering by
     seeding. In 19th International Conference on Machine Learning (ICML’02), pages 19–26,
     2002.

[10] Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. Comparing and unifying search-
     based and similarity-based approaches to semi-supervised clustering. In Proceedings of
     the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine
     Learning and Data Mining, pages 42–49, Washington, DC, August 2004.

[11] Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. A probabilistic framework for
     semi-supervised clustering. In Won Kim, Ron Kohavi, Johannes Gehrke, and William
     DuMouchel, editors, KDD, pages 59–68. ACM, 2004.

                                             33
[12] E.B. Baum. Neural net algorithms that learn in polynomial time from examples and
     queries. IEEE Transactions on Neural Networks, 2(1):5–19, January 1991.

[13] Andrew Blake, Carsten Rother, M. Brown, Patrick Perez, and Philip H. S. Torr. Interactive
     image segmentation using an adaptive GMMRF model. In Proc. ECCV, volume 1, pages
     428–441, 2004.

[14] Nadia Bolshakova, Francisco Azuaje, and Padraig Cunningham. A knowledge-driven
     approach to cluster validity assessment. Bioinformatics, 21(10):2546–2547, 2005.

[15] Nozha Boujemaa, Julien Fauqueur, Marin Ferecatu, François Fleuret, Valérie Gouet,
     Bertrand Le Saux, and Hichem Sahbi. Ikona: Interactive generic and specific image
     retrieval. In Proceedings of the International workshop on Multimedia Content-Based
     Indexing and Retrieval (MMCBIR’2001), 2001.

[16] Yuri Boykov and Marie-Pierre Jolly. Interactive graph cuts for optimal boundary & region
     segmentation of objects in N-D images. In Proc. ICCV, pages 105–112, 2001.

[17] Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of min-cut/max-
     flow algorithms for energy minimization in vision. PAMI, 26(9):1124–1137, 2004.

[18] Nader H. Bshouty. On learning multivariate polynomials under the uniform distribution.
     Information Processing Letters, 61(6):303–309, March 1997.

[19] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine
     Learning, 15(2):201–221, May 1994.

[20] A. Demiriz, K. Bennett, and M. Embrechts. Semi-supervised clustering using genetic
     algorithms. In C. H. Dagli et al., editor, Intelligent Engineering Systems Through Artificial
     Neural Networks 9, pages 809–814. ASME Press, 1999.

[21] A.P. Engelbrecht and R. Brits. Supervised training using an unsupervised approach to
     active learning. Neural Processing Letters, 15(3):247–260, June 2002.

[22] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by
     committee algorithm. Machine Learning, 28(2-3):133–168, August-September 1997.

[23] H. Frigui and R. Krishnapuram. Clustering by competitive agglomeration. Pattern Recog-
     nition, 30(7):1109–1119, 1997.

[24] A. Fujii, T. Tokunaga, K. Inui, and H. Tanaka. Selective sampling for example-based word
     sense disambiguation. Computational Linguistics, 24(4):573–597, December 1998.

[25] Isak Gath and Amir B. Geva. Unsupervised optimal fuzzy clustering. IEEE Trans. Pattern
     Anal. Mach. Intell., 11(7):773–780, 1989.

[26] Nizar Grira, Michel Crucianu, and Nozha Boujemaa. Unsupervised and semi-supervised
     clustering: a brief survey. In A Review of Machine Learning Techniques for Processing
     Multimedia Content. Report of the MUSCLE European Network of Excellence, July 2004.

                                               34
[27] Nizar Grira, Michel Crucianu, and Nozha Boujemaa. Active semi-supervised clustering
     for image database categorization. In Content-Based Multimedia Indexing (CBMI’05),
     June 2005.

[28] Nizar Grira, Michel Crucianu, and Nozha Boujemaa. Semi-supervised fuzzy clustering
     with pairwise-constrained competitive agglomeration. In IEEE International Conference
     on Fuzzy Systems (Fuzz’IEEE 2005), May 2005.

[29] M. Hasenjager and H. Ritter. Active learning of the generalized high-low game. In Pro-
     ceedings of the 1996 International Conference on Artificial Neural Networks, pages 501–
     506. Springer-Verlag, 1996.

[30] M. Hasenjager and H. Ritter. Active learning with local models. Neural Processing Let-
     ters, 7(2):107–117, April 1998.

[31] Thomas Hofmann and Joachim M. Buhmann. Active data clustering. In Advances in
     Neural Information Processing Systems (NIPS) 10, pages 528–534, 1997.

[32] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/.

[33] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing
     Surveys, 31(3):264–323, 1999.

[34] Anil K. Jain and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc.,
     1988.

[35] Vladimir Kolmogorov and Ramin Zabih. What energy functions can be minimized via
     graph cuts? PAMI, 26(2):147–159, 2004.

[36] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and active learn-
     ing. Advances in Neural Information Processing Systems, 7:231–238, 1995.

[37] K.J. Lang and E.B. Baum. Query learning can work poorly when a human oracle is used.
     In International Joint Conference on Neural Networks, Septebmer 1992.

[38] D.D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning.
     In W. W. Cohen and H. Hirsh, editors, Machine Learning: Proceedings of the Eleventh
     International Conference, pages 148–56. Morgan Kaufmann, July 1994.

[39] D.D. Lewis and W.A. Gale. A sequential algorithm for training text classifiers. In W.B.
     Croft and C.J. Van Rijsbergen, editors, Proceedings of the 17th Annual International
     ACM-SIGIR Conference on Research and Development in Information Retrieval, pages
     3–12. ACM/Springer, 1994.

[40] R. Liere and P. Tadepalli. Active learning with committees for text categorization. In
     Proceedings of the Fourteenth National Conference on Artificial Intelligence, pages 591–
     596. AAAI Press, July 1997.

                                            35
[41] M. Lindenbaum, S. Markovitch, and D. Rusakov. Selective sampling for nearest neighbor
     classifiers. Machine Learning, 54(2):125–152, February 2004.
[42] Jitendra Malik, Serge Belongie, Thomas Leung, and Jianbo Shi. Contour and texture
     analysis for image segmentation. IJCV, 43(1):7–27, 2001.
[43] Shaul Markovitch and Paul D. Scott. Information filtering: Selection mechanisms in learn-
     ing systems. Machine Learning, 10(2):113–151, 1993.
[44] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human
     segmented natural images and its application to evaluating segmentation algorithms and
     measuring ecological statistics. In Proc. ICCV, pages 416–425, 2001.
[45] David R. Martin, Charless C. Fowlkes, and Jitendra Malik. Learning to detect natural
     image boundaries using local brightness, color, and texture cues. PAMI, 26(5):530–549,
     2004.
[46] P. Melville and R. Mooney. Diverse ensembles for active learning. In In Proceedings
     of the 21th International Conference on Machine Learning, pages 584–591. IEEE, July
     2003.
[47] B. Miˇ ušík and A. Hanbury. Steerable semi-automatic segmentation of textured images.
           c
     In Proc. Scandinavian Conference on Image Analysis (SCIA), pages 35–44, 2005.
[48] B. Miˇ ušík and A. Hanbury. Supervised texture detection in images. In Proc. Conf. on
          c
     Image Analysis of Images and Patterns (CAIP), 2005.
[49] I. Muslea, S. Minton, and C.A. Knoblock. Selective sampling with redundant views. In
     Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth
     Conference on Innovative Applications of Artificial Intelligence, pages 621–626. AAAI
     Press / The MIT Press, 2000.
[50] H.T. Nguyen and A. Smeulders. Active learning using pre-clustering. In In Proceedings
     of the 21th International Conference on Machine Learning. IEEE, July 2003.
[51] N. Paragios and R. Deriche. Coupled geodesic active regions for image segmentation: A
     level set approach. In Proc. ECCV, volume II, pages 224–240, 2000.
[52] N. Paragios and R. Deriche. Geodesic active regions and level set methods for supervised
     texture segmentation. IJCV, 46(3):223–247, 2002.
[53] J.M Park. On-line learning by active sampling using orthogonal decision support vectors.
     In Neural Networks for Signal Processing - Proc. IEEE Workshop. IEEE, December 2000.
[54] S.B. Park, B.T. Zhang, and Y.T. Kim. Word sense disambiguation by learning decision
     trees from unlabeled data. Applied Intelligence, 19(1-2):27–38, July-October 2003.
[55] T. Raychaudhuri and L.G.C. Hamey. Minimisation of data collection by active learning.
     In IEEE International Conference on Neural Networks Proceedings, pages 1338–1341.
     IEEE, 1995.

                                             36
[56] M. Ruzon and C. Tomasi. Alpha estimation in natural images. In Proc. CVPR, volume 1,
     pages 18–25, 2000.

[57] G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Jour-
     nal of the American Society for Information Science, 41(4):288–297, 1990.

[58] Paul D. Scott and Shaul Markovitch. Experience selection and problem choice in an
     exploratory learning system. Machine Learning, 12:49–67, 1993.

[59] P.D. Scott and S. Markovitch. Experience selection and problem choice in an exploratory
     learning system. Machine Learning, 12:49–67, 1993.

[60] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the
     Fifth Annual Workshop on Computational Learning Theory, pages 287–294. ACM Press,
     1992.

[61] Jianbo Shi and Jitendra Malik.     Normalized cuts and image segmentation.        PAMI,
     22(8):888–905, 2000.

[62] S. Soderland. Learning information extraction rules for semi-structured and free text.
     Machine Learning, 34(1-3):233–272, February 1999.

[63] P. Sollich. Query construction, entropy, and generalization in neural network models.
     Physical Review E, 49(5):4637–4651, May 1994.

[64] X. Yu Stella. Segmentation using multiscale cues. In Proc. CVPR, volume 1, pages 247–
     254, 2004.

[65] J.V. Stone. Independent Component Analysis: A Tutorial Introduction. MIT Press, 2004.

[66] Kah-Kay Sung and P. Niyogi. Active learning the weights of a rbf network. In Neural
     Networks for Signal Processing, Septebmer 1995.

[67] I. R. Teixeira, Francisco de A. T. de Carvalho, G. Ramalho, and V. Corruble. Activecp:
     A method for speeding up user preferences acquisition in collaborative filtering systems.
     In G. Bittencourt and G. Ramalho, editors, 16th Brazilian Symposium on Artificial Intelli-
     gence, SBIA 2002, pages 237–247. Springer, November 2002.

[68] C.A. Thompson, M.E. Califf, and R.J. Mooney. Active learning for natural language
     parsing and information extraction. In I. Bratko and S. Dzeroski, editors, Proceedings
     of the Sixteenth International Conference on Machine Learning, pages 406–414. Morgan
     Kaufmann Publishers Inc., June 1999.

[69] S. Tong and D. Koller. Support vector machine active learning with applications to text
     classification. Journal of Machine Learning Research, 2:45–66, 2001.

[70] S. Tong and D. Koller. Support vector machine active learning with applications to text
     classification. The Journal of Machine Learning Research, 2:45–66, November 2001.

                                             37
[71] F. Tsung. Statistical monitoring and diagnosis of automatic controlled processes using
     dynamic pca. International Journal of Production Research, 38(3):625–627, 2000.

[72] D. Tur, G. Riccardi, and A. L. Gorin. Active learning for automatic speech recognition.
     In In Proceedings of IEEE ICASSP 2002. IEEE, May 2002.

[73] K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In Proceedings of
     the 17th International Conference on Machine Learning, pages 1103–1110, 2000.

[74] Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schrödl. Constrained k-means clus-
     tering with background knowledge. In Carla E. Brodley and Andrea Pohoreckyj Danyluk,
     editors, ICML, pages 577–584. Morgan Kaufmann, 2001.

[75] Y. Wexler, A. Fitzgibbon, and A. Zisserman. Bayesian estimation of layers from multiple
     images. In Proc. ECCV, volume 3, pages 487–501, 2002.

[76] P. H. Winston. Learning structural descriptions from examples. In P. H. Winston, editor,
     The Psychology of Computer Vision. McGraw-Hill, New York, 1975.

[77] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart J. Russell. Distance metric
     learning with application to clustering with side-information. In Suzanna Becker, Sebas-
     tian Thrun, and Klaus Obermayer, editors, NIPS, pages 505–512. MIT Press, 2002.

[78] Kevin Y. Yip, David W. Cheung, and Michael K. Ng. On discovery of extremely low-
     dimensional clusters using semi-supervised projected clustering. In ICDE, pages 329–340.
     IEEE Computer Society, 2005.

[79] Ramin Zabih and Vladimir Kolmogorov. Spatially coherent clustering using graph cuts.
     In Proc. CVPR, volume 2, pages 437–444, 2004.

[80] C. Zhang and T. Chen. Annotating retrieval database with active learning. In Proceedings.
     2003 International Conference on Image Processing, pages II: 595–598. IEEE, September
     2003.




                                             38

More Related Content

What's hot

Active Learning Literature Survey
Active Learning Literature SurveyActive Learning Literature Survey
Active Learning Literature Surveybutest
 
Calculus Research Lab 2: Integrals
Calculus Research Lab 2: IntegralsCalculus Research Lab 2: Integrals
Calculus Research Lab 2: IntegralsA Jorge Garcia
 
Improving Horn and Schunck’s Optical Flow Algorithm
Improving Horn and Schunck’s Optical Flow AlgorithmImproving Horn and Schunck’s Optical Flow Algorithm
Improving Horn and Schunck’s Optical Flow AlgorithmSylvain_Lobry
 
Lecture Notes in Machine Learning
Lecture Notes in Machine LearningLecture Notes in Machine Learning
Lecture Notes in Machine Learningnep_test_account
 
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmIntegrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmKavita Pillai
 
An introduction to higher mathematics
An introduction to higher mathematicsAn introduction to higher mathematics
An introduction to higher mathematicsYohannis Masresha
 
A Multi-Dimensional Compositional Approach for Business Process Engineering
A Multi-Dimensional Compositional Approach for Business Process EngineeringA Multi-Dimensional Compositional Approach for Business Process Engineering
A Multi-Dimensional Compositional Approach for Business Process EngineeringAng Chen
 
A Matlab Implementation Of Nn
A Matlab Implementation Of NnA Matlab Implementation Of Nn
A Matlab Implementation Of NnESCOM
 
Fuzzy and Neural Approaches in Engineering MATLAB
Fuzzy and Neural Approaches in Engineering MATLABFuzzy and Neural Approaches in Engineering MATLAB
Fuzzy and Neural Approaches in Engineering MATLABESCOM
 
Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014George Jenkins
 

What's hot (18)

main
mainmain
main
 
phd thesis
phd thesisphd thesis
phd thesis
 
Di11 1
Di11 1Di11 1
Di11 1
 
Active Learning Literature Survey
Active Learning Literature SurveyActive Learning Literature Survey
Active Learning Literature Survey
 
Calculus Research Lab 2: Integrals
Calculus Research Lab 2: IntegralsCalculus Research Lab 2: Integrals
Calculus Research Lab 2: Integrals
 
Ns doc
Ns docNs doc
Ns doc
 
Improving Horn and Schunck’s Optical Flow Algorithm
Improving Horn and Schunck’s Optical Flow AlgorithmImproving Horn and Schunck’s Optical Flow Algorithm
Improving Horn and Schunck’s Optical Flow Algorithm
 
Lecture Notes in Machine Learning
Lecture Notes in Machine LearningLecture Notes in Machine Learning
Lecture Notes in Machine Learning
 
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmIntegrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
 
Lakhotia09
Lakhotia09Lakhotia09
Lakhotia09
 
An introduction to higher mathematics
An introduction to higher mathematicsAn introduction to higher mathematics
An introduction to higher mathematics
 
Upstill_thesis_2000
Upstill_thesis_2000Upstill_thesis_2000
Upstill_thesis_2000
 
A Multi-Dimensional Compositional Approach for Business Process Engineering
A Multi-Dimensional Compositional Approach for Business Process EngineeringA Multi-Dimensional Compositional Approach for Business Process Engineering
A Multi-Dimensional Compositional Approach for Business Process Engineering
 
Uml (grasp)
Uml (grasp)Uml (grasp)
Uml (grasp)
 
Queueing
QueueingQueueing
Queueing
 
A Matlab Implementation Of Nn
A Matlab Implementation Of NnA Matlab Implementation Of Nn
A Matlab Implementation Of Nn
 
Fuzzy and Neural Approaches in Engineering MATLAB
Fuzzy and Neural Approaches in Engineering MATLABFuzzy and Neural Approaches in Engineering MATLAB
Fuzzy and Neural Approaches in Engineering MATLAB
 
Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014
 

Viewers also liked

Supervised machine learning based dynamic estimation of bulk soil moisture us...
Supervised machine learning based dynamic estimation of bulk soil moisture us...Supervised machine learning based dynamic estimation of bulk soil moisture us...
Supervised machine learning based dynamic estimation of bulk soil moisture us...eSAT Journals
 
Machine Learning in JavaScript
Machine Learning in JavaScriptMachine Learning in JavaScript
Machine Learning in JavaScriptJamison Dance
 
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...Sri Ambati
 
Machine Learning Techniques for the Smart Grid – Modeling of Solar Energy usi...
Machine Learning Techniques for the Smart Grid – Modeling of Solar Energy usi...Machine Learning Techniques for the Smart Grid – Modeling of Solar Energy usi...
Machine Learning Techniques for the Smart Grid – Modeling of Solar Energy usi...Wilfried Elmenreich
 
AWS April Webinar Series - Introduction to Amazon Machine Learning
AWS April Webinar Series - Introduction to Amazon Machine LearningAWS April Webinar Series - Introduction to Amazon Machine Learning
AWS April Webinar Series - Introduction to Amazon Machine LearningAmazon Web Services
 
Intro to modelling-supervised learning
Intro to modelling-supervised learningIntro to modelling-supervised learning
Intro to modelling-supervised learningJustin Sebok
 
[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016Grigoris C
 
Machine Learning and the Smart City
Machine Learning and the Smart CityMachine Learning and the Smart City
Machine Learning and the Smart CityErika Fille Legara
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 

Viewers also liked (11)

Supervised machine learning based dynamic estimation of bulk soil moisture us...
Supervised machine learning based dynamic estimation of bulk soil moisture us...Supervised machine learning based dynamic estimation of bulk soil moisture us...
Supervised machine learning based dynamic estimation of bulk soil moisture us...
 
Machine Learning in JavaScript
Machine Learning in JavaScriptMachine Learning in JavaScript
Machine Learning in JavaScript
 
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...
 
Machine Learning Techniques for the Smart Grid – Modeling of Solar Energy usi...
Machine Learning Techniques for the Smart Grid – Modeling of Solar Energy usi...Machine Learning Techniques for the Smart Grid – Modeling of Solar Energy usi...
Machine Learning Techniques for the Smart Grid – Modeling of Solar Energy usi...
 
AWS April Webinar Series - Introduction to Amazon Machine Learning
AWS April Webinar Series - Introduction to Amazon Machine LearningAWS April Webinar Series - Introduction to Amazon Machine Learning
AWS April Webinar Series - Introduction to Amazon Machine Learning
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Intro to modelling-supervised learning
Intro to modelling-supervised learningIntro to modelling-supervised learning
Intro to modelling-supervised learning
 
[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016
 
Machine Learning and the Smart City
Machine Learning and the Smart CityMachine Learning and the Smart City
Machine Learning and the Smart City
 
Mobile Trends & Innovations Research 2016
Mobile Trends & Innovations Research 2016Mobile Trends & Innovations Research 2016
Mobile Trends & Innovations Research 2016
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 

Similar to Progress on Semi-Supervised Machine Learning Techniques

Location In Wsn
Location In WsnLocation In Wsn
Location In Wsnnetfet
 
Computer Security: A Machine Learning Approach
Computer Security: A Machine Learning ApproachComputer Security: A Machine Learning Approach
Computer Security: A Machine Learning Approachbutest
 
Masters Thesis: A reuse repository with automated synonym support and cluster...
Masters Thesis: A reuse repository with automated synonym support and cluster...Masters Thesis: A reuse repository with automated synonym support and cluster...
Masters Thesis: A reuse repository with automated synonym support and cluster...Laust Rud Jacobsen
 
Inkscape industrial project report
Inkscape industrial project reportInkscape industrial project report
Inkscape industrial project reportSteren Giannini
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Cooper Wakefield
 
High Performance Traffic Sign Detection
High Performance Traffic Sign DetectionHigh Performance Traffic Sign Detection
High Performance Traffic Sign DetectionCraig Ferguson
 
A buffer overflow study attacks and defenses (2002)
A buffer overflow study   attacks and defenses (2002)A buffer overflow study   attacks and defenses (2002)
A buffer overflow study attacks and defenses (2002)Aiim Charinthip
 
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...Jason Cheung
 
Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...stainvai
 
Pattern classification via unsupervised learners
Pattern classification via unsupervised learnersPattern classification via unsupervised learners
Pattern classification via unsupervised learnersNick Palmer
 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Webpfleidi
 

Similar to Progress on Semi-Supervised Machine Learning Techniques (20)

Location In Wsn
Location In WsnLocation In Wsn
Location In Wsn
 
Computer Security: A Machine Learning Approach
Computer Security: A Machine Learning ApproachComputer Security: A Machine Learning Approach
Computer Security: A Machine Learning Approach
 
Masters Thesis: A reuse repository with automated synonym support and cluster...
Masters Thesis: A reuse repository with automated synonym support and cluster...Masters Thesis: A reuse repository with automated synonym support and cluster...
Masters Thesis: A reuse repository with automated synonym support and cluster...
 
Thesis
ThesisThesis
Thesis
 
Inkscape industrial project report
Inkscape industrial project reportInkscape industrial project report
Inkscape industrial project report
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...
 
Home automation
Home automationHome automation
Home automation
 
High Performance Traffic Sign Detection
High Performance Traffic Sign DetectionHigh Performance Traffic Sign Detection
High Performance Traffic Sign Detection
 
Oop c++ tutorial
Oop c++ tutorialOop c++ tutorial
Oop c++ tutorial
 
A buffer overflow study attacks and defenses (2002)
A buffer overflow study   attacks and defenses (2002)A buffer overflow study   attacks and defenses (2002)
A buffer overflow study attacks and defenses (2002)
 
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
 
Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...
 
Pattern classification via unsupervised learners
Pattern classification via unsupervised learnersPattern classification via unsupervised learners
Pattern classification via unsupervised learners
 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Web
 
test5
test5test5
test5
 
Sdd 2
Sdd 2Sdd 2
Sdd 2
 
test6
test6test6
test6
 
test4
test4test4
test4
 
test5
test5test5
test5
 
test6
test6test6
test6
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Progress on Semi-Supervised Machine Learning Techniques

  • 1. Project no. FP6-507752 MUSCLE Network of Excellence Multimedia Understanding through Semantics, Computation and LEarning Progress on Semi-Supervised Machine Learning Techniques Due date of deliverable: 31.09.2005 Actual submission date: 10.10.2005 Start date of project: 1 March 2004 Duration: 48 Months Workpackage: 8 Deliverable: D8.1c (4/5) Editors: Pádraig Cunningham Department of Computer Science Trinity College Dublin Dublin 2, Ireland Revision 1.0 Project co-funded by the European Commission within the Sixth Framework Programme (2002-2006) Dissemination Level PU Public x PP Restricted to other programme participants (including the commission Services) RE Restricted to a group specified by the consortium (including the Commission Services) CO Confidential, only for members of the consortium (including the Commission Services) Keyword List: c MUSCLE- Network of Excellence www.muscle-noe.org
  • 2. Contributing Authors Nozha Boujemaa INRIA- IMEDIA nozha.boujemaa@inria.fr Michel Crucianu INRIA- IMEDIA michel.crucianu@inria.fr Pádraig Cunningham TCD Padraig.Cunningham@cs.tcd.ie Nizar Grira INRIA- IMEDIA nizar.grira@inria.fr Allan Hanbury TU Wien hanbury@prip.tuwien.ac.at Shaul Markovitch Technion shaulm@cs.Technion.ac.il Branislav Miˇ ušík c TU Wien micusik@prip.tuwien.ac.at 2
  • 3. Contents WP7: State of the Art 1 Contents 3 1 Introduction 5 2 One-Class Classification Problems 7 2.1 One-class learning for image segmentation . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Active Learning 15 3.1 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 The Source of the Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 Example Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 Selective Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Example Evaluation and Selection . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Uncertainty-based Sampling . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.2 Committee-based sampling . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.4 Utility-Based Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Semi-Supervised Clustering 23 4.1 Semi-supervised Clustering of Images . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Pairwise-Constrained Competitive Agglomeration . . . . . . . . . . . . . . . . 25 4.3 Active Selection of Pairwise Constraints . . . . . . . . . . . . . . . . . . . . . 26 4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5 General Conclusions 31 Bibliography 33 3
  • 4. 4
  • 5. Chapter 1 Introduction A fundamental distinction in machine learning research is that between unsupervised and su- pervised techniques. In supervised learning, the learning system is presented with labelled examples and induces a model from this data that will correctly label other unlabelled data. In unsupervised learning the data is unlabelled and all the learning system can do is organise or cluster the data is some way. In processing multimedia data, some important real world prob- lems have emerged that do not fall into these two categories but have some characteristics of both: some examples are: • One Class Classification Problems: abundant training data is available for one class of a binary classification problem but not the other (e.g. process monitoring, text classifica- tion). • Active Learning: training data is unlabelled but a ‘supervisor’ is available to label exam- ples at the request of the learning system Ð the system must select examples for labelling that make best use of the supervisor’s effort (e.g. labelling images for image retrieval). • Semi-Supervised Clustering: It is often the case that background knowledge is available that can help in the clustering of data. Such extra knowledge can enable the discovery of small but significant bi-clusters in gene expression data. More generally, the use of back- ground knowledge avoids situations where the clustering algorithm is obviously ‘getting it wrong’. Since the Machine Learning (ML) research within the Muscle network is applications driven, there is considerable interest among the network members in these research issues. This report describes some initial research under each of these three titles and we can expect to see a good deal of activity in this area that lies between unsupervised and supervised ML techniques in the future. In chapter 2 of this report we provide an overview of One-Class Classification problems and describe an application in image segmentation where foreground - background separation is treated in a ‘one-class’ framework. In chapter 3 an overview of Active Learning is presented; this overview is particularly focused on sample selection - the core research issue in Active Learning. Chapter 4 begins with an overview of Semi-Supervised Clustering and describes an some research from the network on the Semi-Supervised Clustering of images. 5
  • 6. 6
  • 7. Chapter 2 One-Class Classification Problems In traditional supervised classification problems, discriminating classifiers are trained using pos- itive and negative examples. However, for a significant number of practical problems, counter- examples are either rare, entirely unavailable or statistically unrepresentative. Such problems include industrial process control, text classification and analysis of chemical spectra. One- sided classifiers have emerged as a technique for situations where labelled data exists for only one of the classes in a two-class problem. For instance, in industrial inspection tasks, abundant data may only exist describing the process operating correctly. It is difficult to put together training data describing the myriad of ways the system might operate incorrectly. A related problem is where negative examples exist, but their distribution cannot be characterised. For example, it is reasonable to provide characteristic examples of family pictures but impossible to provide examples of pictures that are ‘typical’ of non-family pictures. In this project we will implement a comprehensive toolkit for building one sided-classifiers. These one-sided classification problems arise in a variety of situations: • industrial inspection and process control • text classification • image labelling A basic technique for developing one-class classifiers is to use Principal Component Anal- ysis (PCA) or Independent Component Analysis (ICA). Details on PCA and ICA can be found in [65]. The basic principle is best explained with reference to Figure 2.1 . With this approach, PCA is performed on the labelled training data. Essentially this transforms the data into the new space of principal components (PCs). These PCs will be ranked so that the first PC captures most of the variation in the data and so forth. It is normal to drop the lesser PCs as most of the information will be captured in the first two or three PCs. Figure 2.1 shows a cloud of exam- ples described in a 3D space defined by the first 3 PCs. The different ellipsoids in the diagram define regions bounded by 1σ, 2σ and 3σ in the PCs. A classifier could operate by classifying any unlabeled example that falls within the 2σ ellipsoid as a positive example and everything outside as a negative example. The problem with this is that PCA and ICA are unsupervised techniques for dimension reduction. So PCs are being discarded because the data varies little in those dimensions; it is 7
  • 8. Figure 2.1: A graphic view of the PCA approach to one-sided classification nevertheless possible that those dimensions are discriminating. Such a classifier will provide a baseline for comparison as this PCA based approach is commonly used in industry [71]. 2.1 One-class learning for image segmentation Image segmentation can be viewed as a partitioning of the image into regions having some similar properties, e.g. colour, texture, shape, etc., or as a partitioning of the image into seman- tically meaningful parts like people do. Moreover, measuring the goodness of segmentations in general is an unsolved problem and obtaining absolute ground truth is almost impossible since different people produce different manual segmentations of the same images [44]. There are many papers dealing with automatic segmentation. We have to mention the well known work of Shi & Malik [61] based on normalized cuts which segments the image into non- overlapping regions. They introduced a modification of graph cuts, namely normalized graph cuts, and provided an approximate closed-form solution. However, the boundaries of detected regions often do not follow the true boundaries of the objects. The work [64] is a follow-up to [61] where the segmentation is improved by doing it at various scales. One possibility to partially avoid the ill-posed problem of image segmentation is to use additional constraints. Such constraints can be i) motion in the image caused either by camera motion or by motion of objects in the scene [56, 75, 4], or ii) specifying the foreground object properties [16, 52, 13]. Here we concentrate on an easier task than fully automatic segmentation. We constrain the segmentation by using a small user-provided template patch. We search for the segments of the image coherent in terms of colour with the provided template. The texture of an input image 8
  • 9. (a) (b) (c) Figure 2.2: Supervised segmentation. (a) Original image. Top image shows marked place from which the template was cut. (b) The enlarged template patch. (c) binary segmentation with masked original image. is taken into account to correctly detect boundaries of textured regions. The proposed method gives the possibility to learn semi-automatically what the template patch should look like so as to be representative of an object leading to a correct one-class segmentation. The proposed technique could be useful for segmentation and for detecting the objects in images with a characteristic a priori known property defined through a template patch. Fig. 2.2 shows how a tiger can be detected in the image using a small template patch from the same image. The same template patch can be used to detect the tiger in another image even though lighting conditions are slightly different, as shown in Fig. 2.2. We follow the idea given in [16] of interactive segmentation where the user has to spec- ify some pixels belonging to the foreground and to the background. Such labeled pixels give a strong constraint for further segmentation based on the min-cut/max-flow algorithm given in [17]. However, the method [16] was designed for greyscale images and thus most of the information is thrown away. We improved the method in [47] to cope with colour and texture images. However, both seeds for background and foreground objects were still needed. In this work we avoid the need of background seeds and only seeds for the foreground object need to be specified. This is therefore a one-class learning problem as we have a sample of what we are looking for (foreground), but the characteristics of the rest of the image (background) are unknown. In [79] the spatial coherence of the pixels together with standard local measurements (in- tensity, colour) is handled. They propose an energy function that operates simultaneously in feature space and in image space. Some forms of such an energy function are studied in [35]. In 9
  • 10. our work we follow a similar strategy. However, we define the neighborhood relation through brightness, colour and texture gradients introduced in [42, 45]. In the paper [52] a similar idea to ours has been treated. In principle, the result is the same, but the strategy to obtain it differs. The boundary of a textured foreground object is achieved by minimization (through the evolution of the region contour) of energies inside and outside the region. The Geodetic Active Region framework is used to propagate the region contour. However, the texture information for the foreground has to be specified by the user. In [51] the user interaction is omitted. At first the number of regions is estimated by fitting a mixture of Gaussians on the intensity histogram and is then used to drive the region evolution. However, such a technique cannot be used for textured images. One textured region can be composed of many colours and therefore Gaussian components say nothing about the number of dominant textures. Our main contribution lies in incorporating the information included in the template patch into the graph representing the image, leading to a reasonable binary image segmentation. Our method does not need seeds for both foreground and the background as in [16, 47]. Only some representative template patch of the object being searched for is required, see Fig. 2.2. Moreover, texture information is taken into account. This section is organised as follows. First, segmentation based on the graph cut algo- rithm is outlined. Second, non-parametric computation of probabilities of points being fore- ground/background through histograms and incorporating template patch information is de- scribed. Finally, the results and summary conclude the section. 2.1.1 Segmentation We use the seed segmentation technique [47] based on the interactive graph cut method [16]. The core of the segmentation method is based on an efficient algorithm [17] for finding min- cut/max-flow in a graph. We very briefly outline the construction of the graph representing the image further used for finding min-cut of it. We describe in more detail the stage for building pixel penalties of being foreground or background. Graph representing the image The general framework for building the graph is depicted in Fig. 2.3 (left). The graph is shown for a 9 pixel image and an 8-point neighborhood N . For general images, the graph has as many nodes as pixels plus two extra nodes labeled F, B. In addition, the pixel neighborhood is larger, e.g. we use a window of size 21 × 21 pixels. For details on setting Wq,r , the reader is referred to [47, 48]. To fill this matrix the colour and texture gradient introduced in [42, 45] producing a combined boundary probability image is used. The reason is that our main emphasis is put on boundaries at the changes of different textured regions and not local changes inside one texture. However, there usually are large responses of edge detectors inside the texture. To detect boundaries in images correctly, the colour changes and texturedness of the regions have to be taken into account as in [42, 45]. Each node in the graph is connected to the two extra nodes F, B. This allows the incorpo- ration of the information provided by the seed and a penalty for each pixel being foreground or 10
  • 11. B RB|q q edge cost region Wq,r {q, r} Wq,r {q, r} ∈ N PSfrag replacements r {q, F} λ RF |q ∀q RF|q {q, B} λ RB |q ∀q F Figure 2.3: Left: Graph representation for 9 pixel image. Right: Table defining the costs of graph edges. λ is a constant weighting the importance of being foreground/background vs. mutual bond between pixels. background to be set. The penalty of a point as being foreground F or background B is defined as follows RF |q = − ln p(B |cq ), RB |q = − ln p(F |cq ), (2.1) where cq = (cL , ca , cb ) is a vector in R3 of CIELAB values at the pixel q. We use the CIELAB colour space as this results in better performance. This colour space has the advantage of being approximately perceptually uniform. Furthermore, Euclidean distances in this space are perceptually meaningful as they correspond to colour differences perceived by the human eye. Another reason for the good performance of this space could be that in calculating the colour probabilities below, we make the assumption that the three colour channels are statisti- cally independent. This assumption is better in the CIELAB space than in the RGB space, as the CIELAB space separates the brightness (on the L axis) and colour (on the a and b axes) information. The posterior probabilities are computed as p(cq |B ) p(B |cq ) = , (2.2) p(cq |B ) + p(cq |F ) where the prior probabilities are p(cq |F ) = f L (cL ) · f a (ca ) · f b (cb ), and p(cq |B ) = bL (cL ) · ba (ca ) · bb (cb ), where f {L,a,b} (i), resp. b{L,a,b} (i), represents the foreground, resp. the background histogram of each colour channel separately at the ith bin smoothed by a Gaussian kernel. All histogram channels are smoothed using one-dimensional Gaussians, simply written as ( j−i)2 − f¯i = G ∑N f j e 2σ2 , where G is a normalization factor enforcing ∑N f¯i = 1. In our case, 1 j=1 i=1 the number of histogram bins N = 64. We used σ = 1 since experiments showed that it is a reasonable value. λ from the table in Fig. 2.3 was set to 1000. The foreground histogram is computed from all pixels in the template patch. Computation of the background histogram b{L,a,b} is based on the following arguments. 11
  • 12. b f 0 1 0 1 (a) (b) Figure 2.4: Comparison of two histograms. The first one, shown in (a), is computed from all image pixels, in contrast to the second one, shown in (b), computed from template patch points. The template patch is cut out from the image at the position shown as a light square. The histograms are only schematic and are shown for one colour channel only to illustrate the idea described in the text. We suggest to compute the background histogram from all image pixels since we know a priori neither the colours nor a template patch of the background. The basic idea behind this is the assumption that the histogram computed from all points includes information of all colours (the background and the foreground) in the image, see Fig. 2.4 for better understanding. Assume that the texture is composed of two significant colours, like the template patch in Fig. 2.4(b) shows. Two significant colours create two peaks in the histograms. The same two peaks appear in both histograms. But, since ∑N bi = 1, the probability p(cq |B ) gives smaller i=1 ¯ values than p(cq |F ) for the colours present in the template. Thus, points more similar to the template are assigned in the graph more strongly to the foreground than to the background node. It causes the min-cut to go through the weaker edge with higher probability. However, the edges connecting the points in the neighbourhood play a role in the final cut as well. Therefore the words “higher probability”, because it cannot be said that the min-cut always cuts the weaker edge from two edges connecting the pixel with F, B nodes. An inherent assumption in this approach is that the surface area in the image of the region matching the texture patch is “small” compared to the area of the whole image. Experiments on the size of the region at which this assumption breaks down remain to be done. 2.1.2 Experiments The segmentation method was implemented in M ATLAB. Some of the most time consuming operations (such as creating the graph edge weights) were implemented in C and interfaced with M ATLAB through mex-files. We used with advantage the sparse matrices directly offered by M ATLAB. We used the online available C++ implementations of the min-cut algorithm [17] and some M ATLAB code for colour and texture gradient computation [42]. The most time consuming part of the segmentation process is creating the weight matrix W. It takes 50 seconds on a 250 × 375 image running on a Pentium 4@2.8 GHz. The implementation of the texture gradient in C would dramatically speed up the computation time. Once the graph is built, finding the min-cut takes 2 – 10 seconds. In all our experiments we used images from the Berkeley database [32]. We marked a small “representative” part of the image and used it for further image segmentation of the image. See Fig. 2.5 for the results. From the results it can be seen that very good segmentation can be 12
  • 13. Figure 2.5: Results (we recommend to see a colour version of this Figure). 1st column: en- larged image template patch. 2nd column: input image with marked area used as the template. 3rd column: binary segmentation. 4th column: segmentation with masked original image. 13
  • 14. obtained even though only the colour histogram of the template patch is taken into account. It is also possible to apply the template patch obtained from one image for segmenting another one. In the case depicted in Fig. 2.2 small tiger patch encoding the tiger’s colours obtained from one image is used for finding the tiger in another image. It can be seen that most of the tiger’s body was captured but also some pixels belonging to the background were segmented. Such “non-tiger” regions could be pruned using some further procedure, which is not discussed in this paper. It opens new possibilities for the use of the method, e.g., for image retrieval applications. Since some representative image template is available, images from large databases coherent in colour and texture can be found. This application remains to be investigated. 2.1.3 Conclusion We suggest a method for supervised texture segmentation in images using one-class learning. The method is based on finding the min-cut/max-flow in a graph representing the image to be segmented. We described how to handle the information present in a small representative template patch provided by the user. We proposed a new strategy to avoid the need for a back- ground template or for a priori information of the background. The one-class learning problem is difficult in this case as no information about the background is known a priori. Experiments presented on some images from the Berkeley database show that the method gives reasonable results. The method gives the possibility to detect objects in an image database coherent with a pro- vided representative template patch. A question still to be considered is how the representative template patch for each class of interest should look. For example, the template patch should be invariant under illumination changes, perspective distortion, etc. 14
  • 15. Chapter 3 Active Learning In many potential applications of machine learning it is difficult to get an adequate set of labelled examples to train a supervised machine learning system. This may be because: • of the vast amount of training data available (e.g. astrophysical data, image retrieval or text mining), • the labelling is expensive or time consuming (Bioinformatics or medical applications) or • the labels are ephemeral as is the situation in assigning relevance in information retrieval. In these circumstances an Active Learning (AL) methodology is appropriate, i.e. the learner is presented with a stream or a pool of unlabelled examples and may request a ÔsupervisorÕ to assign labels to a subset of these. The idea of AL has been around since the early 90Õs (e.g. [59]); however, it remains an interesting research area today as there are many applica- tions of AL that need to start learning with very small labeled data sets. The two motivating examples we will explore are the image indexing problem described in B.1.3 and the problem of bootstrapping a message classification system. Both of these scenarios suffer from the Ôs- mall sample problemÕ that has been considered in relevance feedback in information retrieval systems. Relevance feedback as studied in Information Retrieval [57] is closely related to Active Learning in the sense that both need to induce models from small samples. Relevance feedback (RF) has also been used in content based image retrieval (Rui et al. 1998). However, the im- portant difference is that RF is searching for a target document or image whereas the objective with AL is to induce a classifier from a minimal amount of data. In addition RF is transductive rather than inductive in that the algorithm has access to the full population of examples. Also, in RF, examples are selected with two objectives in mind. It is expected that they will prove informative to the learner when labelled and it is hoped that they will satisfy the userÕs infor- mation need. Whereas in AL, examples are selected purely with the objective of maximising information for the learner. This issue of example selection is at the heart of research on AL. The normal guiding princi- ples are to select the unlabelled examples on which the learner is least certain or alternatively to select the examples that will prove most informative for the learner Ð these often boil down to the same thing. A classic strategy is that described by Tong and Koller [69] for Active Learning 15
  • 16. with SVMs. They present a formal analysis based on version spaces that shows that a good pol- icy for selecting samples for AL from a pool of data is to select examples that are closest to the separating hyperplane of the SVM. A better but more expensive policy is to select the example that would produce the smallest margin in the next version of SVM. Clearly these policies are less reliable when starting with a very small number of labelled examples as the version space is very large and the location of the hyperplane is unstable. It is also desirable to be able to select sets of examples together so it would be useful to introduce some notion of diversity to maximise the utility of these complete sets. 3.1 Active Learning Learning is a process where an agent uses a set of experiences to improve its problem solving ca- pabilities. In the passive learning model, the learner passively receives a srteam of experiences and processes them. In the active learning model the learner has some control over its training experiences. Markovitch and Scott [43] define the information filtering framework which spec- ifies five types of selection mechanisms that a learning system can employ to increase the utility of the learning process. Active learning can be viewed in this model as selective experience filter. Why should a learner be active? There are two major reasons for employing experience selection mechanisms: 1. If some experiences have negative utility – the quality of learning would have been bet- ter without them – then it is obviously desirable to filter them out. For example, noisy examples are usually harmful. 2. Even when all the experiences are of positive utility, there is a need to be selective or to have control over the order in which they are processed. This is because the learning agent has bounded training resources which should be managed carefully. In passive learning systems, these problems are either ignored, leading to inefficient learn- ing, or are solved by relying on a human teacher to select informative training experiences [76]. It may sometimes be advantageous for a system to select its own training experiences even when an external teacher is available. Scott & Markovitch [58] argue that a learning system has an advantage over an external teacher in selecting informative examples because, unlike a teacher, it can directly access its own knowledge base. This is important because the utility of a training experience depends upon the current state of the knowledge base. In this summary we concentrate on active learning for supervised concept induction. Most of the works in this area assume that there are high costs associated with tagging examples. Consider, for example, training a classifier for a character recognition problem. In this case, the character images are easily available, while their classification is a costly process requiring a human operator. In the following subsections we differentiate between pool-based selective sampling and example construction approaches, define a formal framework for selective sampling algorithms and present various approaches for solving the problem. 16
  • 17. 3.2 The Source of the Examples One characteristic in which active learning algorithms vary is the source of input examples. Some algorithms construct input examples, while others select from a pool or a stream of unla- beled examples. 3.2.1 Example Construction Active learning by example construction was formalized and theoretically studied by Angluin [2]. Angluin developed a model which allows the learner to ask two types of queries: a membership query, where the learner selects an instance x and is told its membership in the target concept, f (x); and an equivalence query, where the learner presents a hypothesis h, and either told that h ≡ f , or is given a counterexample. Many polynomial time algorithms based on Anguin’s model have been presented for learning target classes such as Horn sentences [3] and multi- variate polynomials [18]. In each case, a domain-specific query construction algorithm was presented. In practical experiments with example construction, there are algorithms that try to optimize an objective function like information gain or generalization error, and derive expressions for optimal queries in certain domains [63, 66]. Other algorithms find classification borders and regions in the input space and construct queries near the borders [12, 30]. A major drawback of query construction was shown by Baum and Ladner [37]. When trying to apply a query construction method to the domain of images of handwritten characters, many of the images constructed by the algorithm did not contain any recognizable characters. This is a problem in many real-world domains. 3.2.2 Selective Sampling Selective sampling avoids the problem described above by assuming that a pool of unlabeled examples is available. Most algorithms work iteratively by evaluating the unlabeled examples and select one for labeling. Alternatively it is assumed that the algorithms receives a stream of unlabeled examples, and for each example a decision is made (according to some evaluation criteria) whether to query for a label or not. The first approach can be formalized as follows. Let X be an instance space, a set of objects described by a finite collection of attributes (features). Let f : X → {0, 1} be a teacher (also called an expert) that labels instances as 0 or 1. A (supervised) learning algorithm takes a set of labeled examples, { x1 , f (x1 ) ,. . . , xn , f (xn ) }, and returns a hypothesis h : X → {0, 1}. Let X ⊆ X be an unlabeled training set. Let D = { xi , f (xi ) : xi ∈ X, i = 1, . . . , n} be the training data – a set of labeled examples from X. A selective sampling algorithm SL , specified relative to a learning algorithm L, receives X and D as input and returns an unlabeled element of X. The process of learning with selective sampling can be described as an iterative procedure where at each iteration the selective sampling procedure is called to obtain an unlabeled example and the teacher is called to label that example. The labeled example is added to the set of currently available labeled examples and the updated set is given to the learning procedure, which induces a new classifier. This sequence repeats until some stopping criterion is satisfied. 17
  • 18. Active Learner(X,f()): / 1. D ← 0. / 2. h ← L(0). 3. While stopping-criterion is not satisfied do: a) x ← SL (X, D) ; Apply SL and get the next example. b) ω ← f (x) ; Ask the teacher to label x. c) D ← D { x, ω } ; Update the labeled examples set. d) h ← L(D) ; Update the classifier. 4. Return classifier h. Figure 3.1: Active learning with selective sampling. Active Learner is defined by specifying stopping criterion, learning algorithm L and selective sampling algorithm SL . It then works with unlabeled data X and teacher f () as input. This criterion may be a resource bound, M, on the number of examples that the teacher is willing to label, or a lower bound on the desired class accuracy. Adopting the first stopping criterion, the goal of the selective sampling algorithm is to produce a sequence of length M which leads to a best classifier according to some given measure. The pseudo code for an active learning system that uses selective sampling is shown in Figure 3.1. 3.3 Example Evaluation and Selection Active learning algorithms that perform selective sampling differ from each other by the way they select the next example for labeling. In this section we present some common approaches to this problem. 3.3.1 Uncertainty-based Sampling Most of the works in active learning select untagged examples that the learner is most uncertain about. Lewis and Gale [39] present a probabilistic classifier for text classification. They assume that a large pool of unclassified text documents is available. The probabilistic classifier assigns tag probabilities to text documents based on already labeled documents. At each iteration, sev- eral unlabeled examples that were assigned probabilities closest to 0.5 are selected for labeling. It is shown that the method significantly reduces that amount of labeled data needed to achieve a certain level of accuracy, compared to random sampling. Later, Lewis and Catlett [38] used an efficient probabilistic classifier to select examples for training another (C4.5) classifier. Again, examples with probabilities closest to 0.5 were chosen. Special uncertainty sampling methods were developed for specific classifiers. Park [53] uses sampling-at-the-boundary method to choose support vectors for SVM classifier. Orthogonal 18
  • 19. support vectors are chosen from the boundary hyperplane and queried one at a time. Tong and Koller [70] also perform pool-based active learning of SVMs. Each new query is supposed to split the current version space into two parts as equal as possible. The authors suggest three methods of choosing the next example based on this principle. Hasenjager & Ritter[30] and Lindenbaum et. al [41] present selective sampling algorithms for nearest neighbor classifiers. 3.3.2 Committee-based sampling Query By Committee (QBC) is a general uncertainty-based method where a committee of hy- potheses consistent with the labeled data is specified and an unlabeled example on which the committee members most disagree is chosen. Seung, Oper and Sompolinsky [60] use Gibbs algorithm to choose random hypotheses from the version space for the committee, and choose an example on which the disagreement between the committee members is the biggest. Appli- cation of the method is shown on the high-low game and perceptron learning. Later, Freund et. al. [22] presented a more complete and general analysis of query by committee, and show that the method guarantees rapid decrease in prediction error for the perceptron algorithm. Cohn et. al. [19] present the concept of uncertainty regions - the regions on which hypothe- ses that are consistent with the labeled data might disagree. Although it is difficult to exactly calculate there regions, one can use approximations (larger or smaller regions). The authors show an application of the method to neural networks - two networks are trained on the labeled example set, one is the most general network, and the other is the most specific network. When examining an unlabeled example, if the two networks disagree on the example, the true label is queried. Krogh and Vedelsby [36] show the connection between the generalization error of a com- mittee of neural networks and its ambiguity. The larger is the ambiguity, the smaller is the error. The authors build an ambiguous network ensemble by cross validation and calculate optimal weights for ensemble members. The ensemble can be used for active learning purposes - the example chosen for labeling is the one that leads to the largest ambiguity. Raychandhuri et. al. [55] use the same principle and analyze the results with an emphasize on minimizing data gathering. Hasenjager and Ritter [29] compare the theoretical optimum (maximization of in- formation gain) of the high-low game with the performance of QBC approach. It is shown that QBC results rapidly approach the optimal theoretical results. In some cases, especially in the presence of noise, it is impossible to find hypotheses con- sistent with the labeled data. Abe and Mamitsuka [1] suggest to use QBC with combination of boosting or bagging. Liere and Tadepalli [40] use a small committee of 7 Winnow classifiers for text categorization. Engelson and Dagan [5] overcome the problem of finding consistent hypotheses by generating a set of classifiers according to the posterior distribution of the pa- rameters. The authors show how to do that for binomial and multinomial parameters. This method is only applicable when it is possible to estimate a posterior distribution over the model space given the training data. Muslea et. al. [49] present an active learning scheme for problems with several redundant views – each committee member is based on one of the problem views. Park et. al. [54] combine active learning with using unlabeled examples for induction. They train a committee on an initial training set. At each iteration, the algorithm predicts the label of an unlabeled example. If prediction disagreement among committee members is below a 19
  • 20. threshold, the label is considered to be the true label and committee members weights are up- dated according to their prediction. If disagreement is high, the algorithm queries the oracle (expert) for the true label. Baram et. al. [7] use a committee of known-to-be-good active learn- ing algorithms. The committee is then used to select the next query. At each stage, the best performing algorithm in the committee is traced, and committee members weights are updated accordingly. Melville and Mooney [46] use artificially created examples to construct highly diverse com- mittees. Artificial examples are created based on the training set feature distribution, and labeled with the least probable label. A new classifier is built using the new training set, and if it does not increase the committee error, it is added to the committee. 3.3.3 Clustering Recently, several works suggest performing pre-clustering of the instance pool before selecting an example for labeling. The motivation is that grouping similar examples may help to decide which example will be useful when labeled. Engelbrecht et. al. [21] cluster the unlabeled data and then perform selective sampling in each of the clusters. Nguyen et. al. [50] pre-cluster the unlabeled data and give priority to examples that are close to classification boundary and are representatives of dense clusters. Soderland [62] incorporates active learning in an information extraction rules learning sys- tem. The system maintains a group of learned rules. In order to choose the next example for label, all unlabeled examples are divided to three categories - examples covered by rules, near misses, and examples not covered by any rule. Examples for labele are sampled randomly from these groups (proportions are selected by the user). 3.3.4 Utility-Based Sampling Fujii et. al. [24] have observed that considering only example uncertainty is not sufficient when selecting the next example for label, and that the impact of example labeling on other ulabeled examples should also be taken into account. Therefore, example utility is measured by the sum of interpretation certainty differences for all unlabeled examples, averaged with the examined example interpretation probabilities. The method was tested on the word sense disambiguation domain, and showed to perform better than other sampling methods. Lindenbaum et. al. [41] present a very general utility-based sampling method. The sampling process is viewed as a game where the learner actions are the possible instances and the teacher reactions are the different labels. The sampling algorithm expands a “game” tree of depth k, where k depends on the resources available. This tree contains all the possible sampling sequences of length k. The utility of each leaf is computed by evaluating the classifier resulting from performing induction on the sequence leading to that leaf. The arcs associated with the teacher’s reactions are assigned probabilities using a random field model. The example selected is the one with the highest expected utility. 20
  • 21. 3.4 Conclusion Active learning is a powerful tool. It has been shown to be efficient in reducing the number of labeled examples needed to perform learning, thus reducing the cost of creating training sets. It has been successfully applied to many problem domains, and is potentially applicable to any task involving learning. Active learning was applied to numerous domains such as function approximation [36], speech recognition [72], recommender system [67], text classification and categorization [39, 70, 40, 68, 80], part of speech tagging [5], image classification [49] and word sense disambiguation [24, 54]. 21
  • 22. 22
  • 23. Chapter 4 Semi-Supervised Clustering It would be generally accepted that conventional approaches to clustering have the draw-back that they do not bring background knowledge to bear on the clustering process [74] [11] [77]. This is particularly an issue in the analysis of gene-array data where clustering is a key data- analysis technique [78] [6].This issue is increasingly being recognised and there is significant research activity that attempts to make progress in this area. Bolshakova et al [14] have shown how background information from Gene Ontologies can be used for cluster validity assessment. However, if extra information is available, it should be used directly in the determination of the clusters rather than after the fact in validation. The idea that background knowledge should be brought to bear in clustering is a general principle and there is a variety of ways in which this might be achieved in practice. Yip et al. [78] suggest there are three aspects to this: • what extra knowledge is being input to the clustering process, – cluster labels may be available for some objects, – more commonly, information will be available on whether objects must-link or cannot-link. • when the knowledge is input – before clustering to guide the clustering process, – after clustering to evaluate the clusters or guide the next round of clustering, – users may supply knowledge as the clustering proceeds. • how the knowledge affects the clustering process – through the formation of seed clusters, – constraining some objects to be put in the same cluster or different clusters, – modifying the assessment of similarity/difference. It is clear that a key determiner of the best approach to semi-supervised clustering is the nature of the external information that is being brought to bear. 23
  • 24. 4.1 Semi-supervised Clustering of Images To let users easily apprehend the contents of image databases and to make their access to the images more effective, a relevant organisation of the contents must be achieved. The results of a fully automatic categorization (unsupervised clustering or cluster analysis, see the surveys in [34], [33]), relying exclusively on a measure of similarity between data items, are rarely satis- factory. Some supervision information provided by the user is then needed for obtaining results that are closer to user’s expectations. Supervised classification and semi-supervised clustering are both candidates for such a categorization approach. When the user is able and willing to provide class labels for a significant sample of images in the database, supervised classification is the method of choice. In practice, this will be the case for specific classification/recognition tasks, but not so often for database categorization tasks. When the goal is general image database categorization, not only the user does not have labels for images, but he doesn’t even know a priori what most of the classes are and how many classes should be found. Instead, he expects the system to “discover” these classes for him. To improve results with respect to what an unsupervised algorithm would produce, the user may accept to provide some supervision if this information is of a very simple nature and in a rather limited amount. A semi-supervised clustering approach should then be employed. Semi- supervised clustering takes into account both the similarity between data items and information regarding either the membership of some data items to specific clusters or, more often, pairwise constraints (must-link, cannot-link) between data items. These two sources of information are combined ([26], [10]) either by modifying the search for appropriate clusters (search-based methods) or by adapting the similarity measure (similarity-adapting methods). Search-based methods consider that the similarities between data items provide relatively reliable information regarding the target categorization, but the algorithm needs some help in order to find the most relevant clusters. Similarity-adapting methods assume that the initial similarity measure has to be significantly modified (at a local or a more global scale) by the supervision in order to reflect correctly the target categorization. While similarity-adapting methods appear to apply to a wider range of situations, they need either significantly more supervision (which can be an unacceptable burden for the user) or specific strong assumptions regarding the target similarity measure (which can be a strong limitation in their domain of application). We assume that users can easily evaluate whether two images should be in the same cat- egory or rather in different categories, so they can easily define must-link or cannot-link con- straints between pairs of images. Following previous work by Demiriz et al. [20], Wagstaff et al. [73] or Basu et al. [8], in [28] we introduced Pairwise-Constrained Competitive Agglomera- tion (PCCA), a fuzzy semi-supervised clustering algorithm that exploits the simple information provided by pairwise constraints. PCCA is reminded in section 4.2. In the original version of PCCA [28] we did not make further assumptions regarding the data, so the pairs of items for which the user is required to define constraints are randomly selected. But in many cases, such assumptions regarding the data are available. We argue in [27] that quite general assumptions let us perform a more adequate, active selection of the pairs of items and thus significantly reduce the number of constraints required for achieving a desired level of performance. We present in section 4.3 our criterion for the active selection of constraints. Experimental results obtained with this criterion on a real-world image categorization problem are given in section 4.4. 24
  • 25. 4.2 Pairwise-Constrained Competitive Agglomeration PCCA belongs to the family of search-based semi-supervised clustering methods. It is based on the Competitive Agglomeration (CA) algorithm [23], a fuzzy partitional algorithm that does not require the user to specify the number of clusters to be found. Let xi , i ∈ {1, .., N} be a set of N vectors representing the data items to be clustered, V the matrix having as columns the prototypes µk , k ∈ {1, ..,C} of C clusters (C N) and U the matrix of the membership degrees, such as uik is the membership of xi to the cluster k. Let d(xi , µk ) be the distance between the vector xi and the cluster prototype µk . For PCCA, the objective function to be minimized must combine the feature-based similarity between data items and knowledge of the pairwise constraints. Let M be the set of available must-link constraints, i.e. (xi , x j ) ∈ M implies xi and x j should be assigned to the same cluster, and C the set of cannot-link constraints, i.e. (xi , x j ) ∈ C implies xi and x j should be assigned to different clusters. With these notations, we can write the objective function PCCA must minimize [28]: C N C N 2 J (V, U) = ∑ ∑ (uik ) d 2 2 (xi , µk ) − β ∑ ∑ (uik ) (4.1) k=1 i=1 k=1 i=1 C C C + α ∑ ∑ ∑ uik u jl + ∑ ∑ uik u jk (xi ,x j )∈M k=1 l=1,l=k (xi ,x j )∈C k=1 under the constraint ∑C uik = 1, for i ∈ {1, . . . , N}. The prototypes of the clusters, for k ∈ k=1 {1, ..,C}, are given by ∑N (uik )2 xi µk = i=1 (4.2) ∑N (uik )2 i=1 and the fuzzy cardinalities of the clusters are Nk = ∑N uik . i=1 The first term in (4.1) is the sum of the squared distances to the prototypes weighted by the memberships and comes from the FCM objective function; it reinforces the compactness of the clusters. The second term is composed of the cost of violating the pairwise must-link and cannot-link constraints. The penalty corresponding to the presence of two points in different clusters (for a must-link constraint) or in a same cluster (for a cannot-link constraint) is weighted by their membership values. The term taking the constraints into account is weighted by α, a constant factor that specifies the relative importance of the supervision. The last term in (4.1) is the sum of the squares of the cardinalities of the clusters (coming from the CA objective function) and controls the competition between clusters. When all these terms are combined and β is well chosen, the final partition will minimize the sum of intra-cluster distances, while partitioning the data set into the smallest number of clusters such that the specified constraints are respected as well as possible. Note that when the memberships are crisp and the number of clusters is pre-defined, this cost function reduces to the one used by the PCKmeans algorithm in [8]. It can be shown (see [28]) that the equation for updating memberships is urs = uFCM + uConstraints + uBias rs rs rs (4.3) The first term in (4.3), uFCM , comes from FCM and only focusses on the distances between rs data items and prototypes. The second term, uConstraints , takes into account the supervision: rs 25
  • 26. memberships are reinforced or deprecated according to the pairwise constraints given by the user. The third term, uBias , leads to a reduction of the cardinality of spurious clusters, which are rs discarded when their cardinality drops below a threshold. The β factor controls the competition between clusters and is defined at iteration t by: η0 exp(−|t − t0 |/τ) C N β(t) = 2 ∑ ∑ u2j d 2(xi, µ j ) i (4.4) ∑C j=1 ∑N ui j i=1 j=1 i=1 The exponential component of β makes the last term in (4.1) dominate during the first iter- ations of the algorithm, in order to reduce the number of clusters by removing spurious ones. Later, the first three terms will dominate, to seek the best partition of the data. The resulting PCCA algorithm is given in [28]. In the original version of PCCA, the pairs of items for which the user is required to define constraints are randomly selected, prior to running the cluster- ing process. As distance d(xi , µ j ) between a data item xi and a cluster prototype µ j one can use either the ordinary Euclidean distance when the clusters are assumed to be spherical or the Mahalanobis distance when they are assumed to be elliptical. 4.3 Active Selection of Pairwise Constraints To make this semi-supervised clustering approach attractive for the user, it is important to min- imize the number of constraints he has to provide for reaching some given level of quality. This can be done by asking the user to define must-link or cannot-link constraints for the pairs of data items that are expected to have the strongest corrective effect on the clustering algorithm (i.e. that are maximally informative). But does the identity of the constraints one provides have a significant impact on perfor- mance or all constraints are relatively alike and only the number of constraints matters? In a series of repeated experiments with PCCA using random constraints, we found a significant variance in the quality of the final clustering results. So the selection of constraints can have a strong impact and we must find appropriate selection criteria. Such criteria may depend on further assumptions regarding the data; for the criteria to be relatively general, the assumptions they rely on should not be too restrictive. In previous work on this issue, Basu et al. [9] developed a scheme for selecting pairwise constraints before running their semi-supervised clustering algorithm (PCKmeans). They de- fined a farthest-first traversal scheme of the set of data items, with the goal of finding k items that are far from each other to serve as support for the constraints. This issue was also explored in unsupervised learning for cases where prototypes cannot be defined. In such cases, cluster- ing can only rely on the evaluation of pairwise similarities between data items, implying a high computational cost. In [31], the authors consider subsampling as a solution for reducing cost and perform an active selection of new data items by minimizing the estimated risk of making wrong predictions regarding the true clustering from the already seen data. This active selec- tion method can also be seen as maximizing the expected value of the information provided by the new data items. The authors find that their active unsupervised clustering algorithm spends more samples of data to disambiguate clusters that are close to each other and less samples for items belonging to well-separated clusters. 26
  • 27. As for other search-based semi-supervised clustering methods, when using PCCA we con- sider that the similarities between data items provide relatively reliable information regarding the target categorization and the constraints only help in order to find the most relevant clusters. There is then little uncertainty in identifying well-separated compact clusters. To be maximally informative, supervision effort (i.e. constraints) should rather be spent for defining those clus- ters that are neither compact, nor well-separated from their neighbors. One can note that this is consistent with the findings in [31] regarding unsupervised clustering. To find the data items that provide the most informative pairwise constraints, we shall then focus on the least well- defined clusters and, more specifically, on the frontier with their neighbors. To identify the least well-defined cluster at some iteration t, we use the fuzzy hypervolume FHV = |Ck |, with |Ck | the determinant of the covariance matrix Ck of cluster k. The FHV was introduced by Gath and Geva [25] as an evaluation of the compactness of a fuzzy cluster; the smaller the spatial volume of the cluster and the more concentrated the data items are near its center, the lower the FHV of the cluster. We consider the least well-defined cluster at iteration t to be the one with the largest FHV at that iteration. Note that we don’t need any extra computation because we already used the FHV to find the Mahalanobis distance. Once the least well-defined cluster at iteration t is found, we need to identify the data items near its boundary. Note that in the fuzzy setting, one can consider that a data item represented by the vector xr is assigned to cluster s if urs is the highest among its membership degrees. The data items at the boundary are those having the lowest membership values to this cluster among all the items assigned to it. After obtaining a set of items that lie on the frontier of the cluster, we find for each item the closest cluster, corresponding to its second highest membership value. The user is then asked whether one of these items should be (or not) in the same cluster as the closest item from the nearest cluster. It is easy to see that the time complexity of this method is high. We suggest below an approximation of this method, with a much lower cost. After finding the least well-defined cluster with the FHV criterion, we consider a virtual boundary that is only defined by a membership threshold and will usually be larger than the true one (this is why we call it “extended” boundary). The items whose membership values are closest to this threshold are considered to be on the boundary and constraints are generated directly between these items. We expect these constraints to be relatively equivalent (and not too suboptimal when compared) to the constraints that would have been obtained by the more complex method described in the previous paragraphs. This approximate selection method has two parameters: the number of constraints at every iteration and the membership threshold for defining the boundary. The first parameter concerns the selection in general, not the approximate method specifically. In all the experiments pre- sented next, we generate 3 constraints at every iteration of the clustering algorithm. For the comparative evaluations, we plot the ratio of well-categorized items against the number of pair- wise constraints. With this selection procedure, when the maximal number of constraints is reached, clustering continues until convergence without generating new constraints. The sec- ond parameter is specific to the approximation of the boundary and we use a fixed value of 0.3, meaning that we consider a rather large approximation. These fixed values for the two param- eters are necessarily suboptimal, but a significant increase in performance (or reduction in the number of required constraints for a given performance) is nevertheless obtained. 27
  • 28. 4.4 Experimental Evaluation We evaluated the effect our criterion for the active selection of constraints has on the PCCA algorithm and we compared it to the CA algorithm [23] (unsupervised clustering) and to PCK- means [8] (semi-supervised clustering). The comparison shown here is performed on a ground truth database composed of images of different phenotypes of Arabidopsis thaliana, corresponding to slightly different genotypes. This scientific image database is issued from studies of gene expression. There are 8 categories, defined by visual criteria: textured plants, plants with long stems and round leaves, plants with long stems and fine leaves, plants with dense, round leaves, plants with desiccated or yellow leaves, plants with large green leaves, plants with reddish leaves, plants with partially white leaves. There are a total of 187 plant images, but different categories contain very different numbers of instances. The intra-class diversity is high for most classes. A sample of the images is shown in [27]. The global image features we used are the Laplacian weighted histogram, the probability weighted histogram, the Hough histogram, the Fourier histogram and a classical color histogram obtained in HSV color space (described in [15]). Linear PCA was used for a five-fold reduction of the dimension of the feature vectors. In all the experiments reported here we employed the Mahalanobis distance rather than the classical Euclidean distance. Figure 4.1 presents the dependence between the percentage of well-categorized data points and the number of pairwise constraints considered. We provide as a reference the graphs for the CA algorithm and for K-means, both ignoring the constraints (unsupervised learning). The correct number of classes was directly provided to K-means and PCKmeans. CA, PCCA and Active-PCCA were initialized with a significantly larger number of classes and found the ap- propriate number by themselves. For the fuzzy algorithms (CA, PCCA and Active-PCCA) every data item is assigned to the cluster to which its membership value is the highest. For every number of constraints, 500 experiments were performed with different random selections of the constraints in order to produce the error bars for PCKmeans and for the original PCCA. These experimental results clearly show that the user can significantly improve the quality of the categories obtained by providing a simple form of supervision, the pairwise constraints. With a similar number of constraints, PCCA performs significantly better than PCKmeans by making a better use of the available constraints. The fuzzy clustering process directly takes into account the the pairwise constraints thanks to the signed constraint terms in the equation for up- dating the memberships. The active selection of constraints (Active-PCCA) further reduces the number of constraints required for reaching such an improvement. The number of constraints becomes very low with respect to the number of items in the dataset. By including information provided by the user, semi-supervised image database categoriza- tion can produce results that come much closer to user’s expectations. But the user will not accept such an approach unless the information he must provide is very simple in nature and in a small amount. The experiments we performed show that the active selection of constraints combined with PCCA allows the number of constraints required to remain sufficiently low for this approach to be a valuable alternative in the categorization of image databases. The ap- proximations we made allowed us to maintain a low computational complexity for the resulting algorithm, making it suitable for real-world clustering applications. 28
  • 29. 1 Ratio of well categorized points 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0 10 20 30 40 50 60 70 80 Pairwise constraints number PCKmeans CA PCCA Kmeans Active-PCCA Figure 4.1: Results obtained by the different clustering algorithms on the Arabidopsis database This work was supported by the EU Network of Excellence (http://www.muscle-noe.org/). The authors wish also to thank NASC (European Arabidopsis Stock Centre, UK, http://arabidopsis.info/) for allowing them to use the Arabidopsis images and Ian Small from INRA (Institut National de la Recherche Agronomique, France) for providing this image database. 29
  • 30. 30
  • 31. Chapter 5 General Conclusions A special characteristic of the Machine Learning research in Muscle is that it is applications driven. It is driven by the characteristics of scenarios in which ML techniques might be used in processing multimedia data. The role for ML in these situations often does not fit well with standard ML techniques that are divided into those that deal with labeled data and those that do not. In this report we have identified three distinct variants of scenarios that do no fit the classic supervised/unsupervised organisation: One Sided Classifiers Sometimes it is in the nature of the problem that labeled examples are available for one class only. This situation is described in detail in Chapter 2 and an ap- proach for tackling image foreground-background separation as a one-class classification problem is described in Section 2.1. Active Learning It is often the case that producing a corpus of labeled training examples is an expensive process. This is the situation for instance when supervised learning techniques are used for annotating images in a collection. Active Learning refers to a general set of techniques for building classifiers while minimising the number of labeled examples required. The core research question in Active Learning is the problem of selecting the most useful examples to present to the user for labeling. Promising approaches for this are presented in Chapter 3. Semi-Supervised Clustering Confidence is a clustering system will be damaged if it returns clusters that disagree with background knowledge a user might have about the objects un- der consideration. Thus it is very desirable to be able to present this background knowl- edge as an input to the clustering process. A natural form for this background knowledge to take is as must-link and cannot-link constraints that indicate objects that should and should not be grouped together. PCCA as described in section 4.2 is a framework that allows background knowledge expressed as pairwise constraints to influence the cluster- ing process. This idea is further enhanced in section 4.3 where a process for identifying useful pairwise constraints is introduced. This active selection of constraints is in the spirit of Active Learning as described in Chapter 3. This research on ML techniques that fall somewhere between the traditional unsupervised and supervised alternatives is a key area of work within Muscle. This is because many of 31
  • 32. the techniques under consideration are specialised for processing multimedia data. For this reason there will be further activity in the area within the network and we should expect further innovation in this area. 32
  • 33. Bibliography [1] N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 1–9. Morgan Kaufmann Publishers Inc., 1998. [2] D. Angluin. Queries and concept learning. Machine Learning, 2(4):319–342, April 1988. [3] D. Angluin, M. Frazier, and L. Pitt. Learning conjunctions of horn clauses. Machine Learning, 9(2-3):147–164, July 1992. [4] N. Apostoloff and A. Fitzgibbon. Bayesian estimation of layers from multiple images. In Proc. CVPR, volume 1, pages 407–414, 2004. [5] S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research, 11:335–360, November 1999. [6] E. Bair and R. Tibshirani. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol, 2(4):e108, 2004. [7] Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. In Proceedings of the Twentieth International Conference on Machine Learning, pages 19– 26. AAAI Press, 2003. [8] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. In Proceedings of 19th International Conference on Machine Learning (ICML-2002), pages 19–26, 2002. [9] Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. Semi-supervised clustering by seeding. In 19th International Conference on Machine Learning (ICML’02), pages 19–26, 2002. [10] Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. Comparing and unifying search- based and similarity-based approaches to semi-supervised clustering. In Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pages 42–49, Washington, DC, August 2004. [11] Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. A probabilistic framework for semi-supervised clustering. In Won Kim, Ron Kohavi, Johannes Gehrke, and William DuMouchel, editors, KDD, pages 59–68. ACM, 2004. 33
  • 34. [12] E.B. Baum. Neural net algorithms that learn in polynomial time from examples and queries. IEEE Transactions on Neural Networks, 2(1):5–19, January 1991. [13] Andrew Blake, Carsten Rother, M. Brown, Patrick Perez, and Philip H. S. Torr. Interactive image segmentation using an adaptive GMMRF model. In Proc. ECCV, volume 1, pages 428–441, 2004. [14] Nadia Bolshakova, Francisco Azuaje, and Padraig Cunningham. A knowledge-driven approach to cluster validity assessment. Bioinformatics, 21(10):2546–2547, 2005. [15] Nozha Boujemaa, Julien Fauqueur, Marin Ferecatu, François Fleuret, Valérie Gouet, Bertrand Le Saux, and Hichem Sahbi. Ikona: Interactive generic and specific image retrieval. In Proceedings of the International workshop on Multimedia Content-Based Indexing and Retrieval (MMCBIR’2001), 2001. [16] Yuri Boykov and Marie-Pierre Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images. In Proc. ICCV, pages 105–112, 2001. [17] Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. PAMI, 26(9):1124–1137, 2004. [18] Nader H. Bshouty. On learning multivariate polynomials under the uniform distribution. Information Processing Letters, 61(6):303–309, March 1997. [19] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, May 1994. [20] A. Demiriz, K. Bennett, and M. Embrechts. Semi-supervised clustering using genetic algorithms. In C. H. Dagli et al., editor, Intelligent Engineering Systems Through Artificial Neural Networks 9, pages 809–814. ASME Press, 1999. [21] A.P. Engelbrecht and R. Brits. Supervised training using an unsupervised approach to active learning. Neural Processing Letters, 15(3):247–260, June 2002. [22] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133–168, August-September 1997. [23] H. Frigui and R. Krishnapuram. Clustering by competitive agglomeration. Pattern Recog- nition, 30(7):1109–1119, 1997. [24] A. Fujii, T. Tokunaga, K. Inui, and H. Tanaka. Selective sampling for example-based word sense disambiguation. Computational Linguistics, 24(4):573–597, December 1998. [25] Isak Gath and Amir B. Geva. Unsupervised optimal fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell., 11(7):773–780, 1989. [26] Nizar Grira, Michel Crucianu, and Nozha Boujemaa. Unsupervised and semi-supervised clustering: a brief survey. In A Review of Machine Learning Techniques for Processing Multimedia Content. Report of the MUSCLE European Network of Excellence, July 2004. 34
  • 35. [27] Nizar Grira, Michel Crucianu, and Nozha Boujemaa. Active semi-supervised clustering for image database categorization. In Content-Based Multimedia Indexing (CBMI’05), June 2005. [28] Nizar Grira, Michel Crucianu, and Nozha Boujemaa. Semi-supervised fuzzy clustering with pairwise-constrained competitive agglomeration. In IEEE International Conference on Fuzzy Systems (Fuzz’IEEE 2005), May 2005. [29] M. Hasenjager and H. Ritter. Active learning of the generalized high-low game. In Pro- ceedings of the 1996 International Conference on Artificial Neural Networks, pages 501– 506. Springer-Verlag, 1996. [30] M. Hasenjager and H. Ritter. Active learning with local models. Neural Processing Let- ters, 7(2):107–117, April 1998. [31] Thomas Hofmann and Joachim M. Buhmann. Active data clustering. In Advances in Neural Information Processing Systems (NIPS) 10, pages 528–534, 1997. [32] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/. [33] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [34] Anil K. Jain and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988. [35] Vladimir Kolmogorov and Ramin Zabih. What energy functions can be minimized via graph cuts? PAMI, 26(2):147–159, 2004. [36] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and active learn- ing. Advances in Neural Information Processing Systems, 7:231–238, 1995. [37] K.J. Lang and E.B. Baum. Query learning can work poorly when a human oracle is used. In International Joint Conference on Neural Networks, Septebmer 1992. [38] D.D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In W. W. Cohen and H. Hirsh, editors, Machine Learning: Proceedings of the Eleventh International Conference, pages 148–56. Morgan Kaufmann, July 1994. [39] D.D. Lewis and W.A. Gale. A sequential algorithm for training text classifiers. In W.B. Croft and C.J. Van Rijsbergen, editors, Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 3–12. ACM/Springer, 1994. [40] R. Liere and P. Tadepalli. Active learning with committees for text categorization. In Proceedings of the Fourteenth National Conference on Artificial Intelligence, pages 591– 596. AAAI Press, July 1997. 35
  • 36. [41] M. Lindenbaum, S. Markovitch, and D. Rusakov. Selective sampling for nearest neighbor classifiers. Machine Learning, 54(2):125–152, February 2004. [42] Jitendra Malik, Serge Belongie, Thomas Leung, and Jianbo Shi. Contour and texture analysis for image segmentation. IJCV, 43(1):7–27, 2001. [43] Shaul Markovitch and Paul D. Scott. Information filtering: Selection mechanisms in learn- ing systems. Machine Learning, 10(2):113–151, 1993. [44] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. ICCV, pages 416–425, 2001. [45] David R. Martin, Charless C. Fowlkes, and Jitendra Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI, 26(5):530–549, 2004. [46] P. Melville and R. Mooney. Diverse ensembles for active learning. In In Proceedings of the 21th International Conference on Machine Learning, pages 584–591. IEEE, July 2003. [47] B. Miˇ ušík and A. Hanbury. Steerable semi-automatic segmentation of textured images. c In Proc. Scandinavian Conference on Image Analysis (SCIA), pages 35–44, 2005. [48] B. Miˇ ušík and A. Hanbury. Supervised texture detection in images. In Proc. Conf. on c Image Analysis of Images and Patterns (CAIP), 2005. [49] I. Muslea, S. Minton, and C.A. Knoblock. Selective sampling with redundant views. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 621–626. AAAI Press / The MIT Press, 2000. [50] H.T. Nguyen and A. Smeulders. Active learning using pre-clustering. In In Proceedings of the 21th International Conference on Machine Learning. IEEE, July 2003. [51] N. Paragios and R. Deriche. Coupled geodesic active regions for image segmentation: A level set approach. In Proc. ECCV, volume II, pages 224–240, 2000. [52] N. Paragios and R. Deriche. Geodesic active regions and level set methods for supervised texture segmentation. IJCV, 46(3):223–247, 2002. [53] J.M Park. On-line learning by active sampling using orthogonal decision support vectors. In Neural Networks for Signal Processing - Proc. IEEE Workshop. IEEE, December 2000. [54] S.B. Park, B.T. Zhang, and Y.T. Kim. Word sense disambiguation by learning decision trees from unlabeled data. Applied Intelligence, 19(1-2):27–38, July-October 2003. [55] T. Raychaudhuri and L.G.C. Hamey. Minimisation of data collection by active learning. In IEEE International Conference on Neural Networks Proceedings, pages 1338–1341. IEEE, 1995. 36
  • 37. [56] M. Ruzon and C. Tomasi. Alpha estimation in natural images. In Proc. CVPR, volume 1, pages 18–25, 2000. [57] G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Jour- nal of the American Society for Information Science, 41(4):288–297, 1990. [58] Paul D. Scott and Shaul Markovitch. Experience selection and problem choice in an exploratory learning system. Machine Learning, 12:49–67, 1993. [59] P.D. Scott and S. Markovitch. Experience selection and problem choice in an exploratory learning system. Machine Learning, 12:49–67, 1993. [60] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 287–294. ACM Press, 1992. [61] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. PAMI, 22(8):888–905, 2000. [62] S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233–272, February 1999. [63] P. Sollich. Query construction, entropy, and generalization in neural network models. Physical Review E, 49(5):4637–4651, May 1994. [64] X. Yu Stella. Segmentation using multiscale cues. In Proc. CVPR, volume 1, pages 247– 254, 2004. [65] J.V. Stone. Independent Component Analysis: A Tutorial Introduction. MIT Press, 2004. [66] Kah-Kay Sung and P. Niyogi. Active learning the weights of a rbf network. In Neural Networks for Signal Processing, Septebmer 1995. [67] I. R. Teixeira, Francisco de A. T. de Carvalho, G. Ramalho, and V. Corruble. Activecp: A method for speeding up user preferences acquisition in collaborative filtering systems. In G. Bittencourt and G. Ramalho, editors, 16th Brazilian Symposium on Artificial Intelli- gence, SBIA 2002, pages 237–247. Springer, November 2002. [68] C.A. Thompson, M.E. Califf, and R.J. Mooney. Active learning for natural language parsing and information extraction. In I. Bratko and S. Dzeroski, editors, Proceedings of the Sixteenth International Conference on Machine Learning, pages 406–414. Morgan Kaufmann Publishers Inc., June 1999. [69] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45–66, 2001. [70] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. The Journal of Machine Learning Research, 2:45–66, November 2001. 37
  • 38. [71] F. Tsung. Statistical monitoring and diagnosis of automatic controlled processes using dynamic pca. International Journal of Production Research, 38(3):625–627, 2000. [72] D. Tur, G. Riccardi, and A. L. Gorin. Active learning for automatic speech recognition. In In Proceedings of IEEE ICASSP 2002. IEEE, May 2002. [73] K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In Proceedings of the 17th International Conference on Machine Learning, pages 1103–1110, 2000. [74] Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schrödl. Constrained k-means clus- tering with background knowledge. In Carla E. Brodley and Andrea Pohoreckyj Danyluk, editors, ICML, pages 577–584. Morgan Kaufmann, 2001. [75] Y. Wexler, A. Fitzgibbon, and A. Zisserman. Bayesian estimation of layers from multiple images. In Proc. ECCV, volume 3, pages 487–501, 2002. [76] P. H. Winston. Learning structural descriptions from examples. In P. H. Winston, editor, The Psychology of Computer Vision. McGraw-Hill, New York, 1975. [77] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart J. Russell. Distance metric learning with application to clustering with side-information. In Suzanna Becker, Sebas- tian Thrun, and Klaus Obermayer, editors, NIPS, pages 505–512. MIT Press, 2002. [78] Kevin Y. Yip, David W. Cheung, and Michael K. Ng. On discovery of extremely low- dimensional clusters using semi-supervised projected clustering. In ICDE, pages 329–340. IEEE Computer Society, 2005. [79] Ramin Zabih and Vladimir Kolmogorov. Spatially coherent clustering using graph cuts. In Proc. CVPR, volume 2, pages 437–444, 2004. [80] C. Zhang and T. Chen. Annotating retrieval database with active learning. In Proceedings. 2003 International Conference on Image Processing, pages II: 595–598. IEEE, September 2003. 38