Slides ppt

Learning from labelled and unlabeled data
Semi-Supervised Learning

Machine Learning – PDEEC 2008/2009

Filipe Tiago Alves de Magalhães

26-04-2010


Supervised Semi-Supervised Unsupervised
Learning Learning Learning

Labbeled + unlabeled data
discover patterns in the data The data have no
that relate data attributes target attribute (unlabeled).
with a target (class) attribute.
Typically, plenty of
unlabeled data
available. We want to explore the
These patterns are then data to find some intrinsic
utilized to predict the structures in them.
values of the target
attribute in future Tries to improve the predictive
data instances. power using both labelled and
unlabeled data. (Expected to be
better than using one alone) 2

Unlabeled data is easy to obtain

Labelled data can be difficult to obtain
- human annotation is boring
- may require experts
- may require special equipment
- very time-consuming

Examples:
- Web page classification (billions of pages)
- Email classification (SPAM or No-SPAM)
- Speech annotation (400h for each hour of conversation)
-…

3


Semi-Supervised learning can be seen as an excellent way to improve the results
that we would get using exclusively supervised or non-supervised methods, for the
same scenario.

Although we (or specialists) do not need to spend such a big effort labelling data,
a great concern must be faced for the design of good models, feature extraction,
kernels definition.

4

Sometimes, it may not be so hard to label data…

www.espgame.org

Tries to guess the user’s gender
based on his/her choices.

After that, we tell if it was
right or wrong

Takes advantage of player’s intervention in order to
enrich the training of automatic learning algorithms
5

Semi-Supervised Self-Training of Object Detection Models

Chuck Rosenberg Martial Hebert Henry Schneiderman
Google, Inc. Carnegie Mellon University Carnegie Mellon University

7th IEEE Workshops on Application of Computer Vision (WACV/MOTION'05)
2005

6

Self-Training
L = (Xi , Yi ) Set of labelled data

U = (Xi , ? ) Set of unlabeled data

Algorithm
Repeat
• Train a classifier C with training data L
• Classify data in U with C
• Find a subset U’ of U with the most
confident scores
• L + U’  L
• U – U’  U

7

Object detection

Object detection based on its shape
- time-consuming
- exhaustive labelling (background, foreground, object, non-object)

Try to simplify the collection and preparation of training data
- combining data labelled in different ways
- labelling of each image region can take the form of a probability
distribution over labels (“weakly” labelled)
- e.g., is more likely that the object is present in the centre of the image
- e.g., a certain image has a high likelihood of containing the object, but
its position is unknown.

8

Training Approaches
Generic detection algorithm for classification of a subwindow in an image as being part of
the “object” class or the “clutter/everything else” class

If

X – image feature vectors
xi – data at a specific location in the image (i = {1, … ,n} indexes images locations)
Y – class
f – foreground
b – background
θf – parameters of the foreground model
θb – parameters of the background model
9

Training Approaches

EM approach

10

Training Approaches

EM approach

There are many reasons why EM may not perform well in a particular semi-supervised
training context.

- EM solely finds a set of model parameters which maximize the likelihood of the data.

- Fully labeled data may not sufficiently constrain the solution, which means that there
may be solutions which maximize the data likelihood but do not optimize classification
performance.

11

Training Approaches

Alternative

12

Detector Overview (Experimental Setup)

1. Subwindow is processed for lighting correction
2. Two-level wavelet transform is applied
3. Features are computed by vector quantizing groups of wavelet coefficients
4. Subwindow is classified by thresholding a linear combination of the log-likelihood
ratios of the features

Cascade architecture → only image patches which are accepted by the first
detector are passed on to the next
13

Data (Experimental Setup)

Landmark used on a
typical training image
sample training images and the training
examples associated with them
Set with positive examples – 231 images
480 training examples 200-300 pixels high and
Independent test set – 44 images 300-400 pixels wide
102 test examples
15000 negative examples
Training examples – 24 x 16 pixels (rotated, scaled and cropped)
14

Training (Experimental Setup)
Training the model with fully labeled data consists of the following steps:

1. Given the training data landmark locations
• geometrically normalize the training example subimages;
• apply lighting normalization to the subimages;
• generate synthetic training examples (scaling, shifting and rotating)
2. Compute the wavelet transform of the subimages
3. Quantize each group of wavelet coefficients and build a naïve Bayes model with
respect to each group to discriminate between positive and negative examples
4. Adjust the naïve Bayes model using boosting, but maintaining a linear decision
function, effectively performing gradient descent on the margin
5. Compute a ROC curve for the detector using a cross validation set
6. Choose a threshold for the linear function, based on the final performance
desired

15

Selection Metrics (Experimental Setup)
Selection metric is crucial to the performance of the training

1. Confidence selection
• Computed at every iteration by applying the detector trained from the
current set of labelled data to the weakly labelled data set.
• Detection with highest confidence is selected and added to the training
set

2. MSE selection
• Is calculated for each weakly labelled example by evaluating the
distance between the corresponding image window and all of the
other templates in the training data (including the original labelled
examples and the weakly labelled examples added in prior iterations)

16


The candidate image and the labeled images
are first normalized with a specific set of
processing steps before the MSE based score
metric is computed.

The score is based on the Mahalanobis distance

17


position
MSE selection
Detector
metric
scale

The detector must be accurate in localization but need not be accurate in
detection since false detection will be discarded due to their large MSE
distances to all of the training examples.

This is crucial to ensure the performance of the training algorithm with
small initial training sets.

This is also part of the reason for the MSE to outperform the confidence
metric, which requires the detector to be accurate in both localization and
detection performance.

18

Experiment Scenarios (Experiments and Analysis)
Each experiment was repeated using a different initial random subset, in order
to avoid the variance that was being observed in the detector performance and
in the behaviour of the semi-supervised training process.

Experiment = specific set of experimental conditions

Run = each repetition of that experiment

Mostly, 5 runs were performed for each experiment

Typically, 20 weakly labelled images were added to the training set at each iteration,
because of the substantial training time of the detector.
Ideally, only a single image would be added at each iteration.

19

Evaluation Metrics (Experiments and Analysis)
Each run was evaluated by using the area under the ROC curve (AUC).
Because different experimental conditions affect performance, the AUCs
were normalized relatively to the full data performance of that run.

if (performance level = = 1.0)
{
the model being evaluated has the same performance
as it would if all of the labelled data was utilised
}
if (performance level < 1.0)
{
the model has a lower performance
than that achieved with the full data set
}

To compute the full data performance, each specific run is
trained with the full data set and its performance is recorded.
20

Baseline training configurations (Experiments and Analysis)

Smooth regime was chosen in order to perform experiments under conditions
where the addition of weakly labelled data would make a difference. 21

Selection Metrics (Experiments and Analysis)
Does the choice of the selection metric make a substantial
difference in the performance of the semi-supervised training?

Confidence metric MSE metric
22

Selection Metrics (Experiments and Analysis)
Does the choice of the selection metric make a substantial
difference in the performance of the semi-supervised training?

d i
e n
c c
r r
e e
a a
s s
e e
s s

23

Relative size of fully Labelled Data(Experiments and Analysis)
How many weakly labelled examples do we need to add to the training set in
order to reach the best detector performance?

24

Conclusions/Discussion

1. The results showed that it was possible to achieve detection performance that was
close to the base performance obtained with the fully labelled data, even when a
small fraction of the training data was used in the initial training set.

2. The experiments showed that the self-training approach to semi-supervised training
can be applied to an existing detector that was originally designed for supervised
training.

3. The MSE selection metric consistently outperformed the confidence metric. More
generally, the self-training approach using an independently-defined selection
metric outperforms both the confidence metrics and the batch EM approaches.

During the training process, the distribution of the labeled data at any particular
iteration may not match the actual underlying distribution of the data.

25

Conclusions/Discussion True labels for the unlabeled data
Original unlabeled data and labelled data

(c),(d) The points labelled by the incremental self-training algorithm after 5
iterations using the confidence metric and the Euclidean metric, respectively. 26

Future Work

Study the relation between the semi-supervised training approach evaluated here
with the co-training approaches.

Develop more precise guidelines for selecting the initial training set.

The approach could be extended to training examples that are labelled in different
ways. For example, some images may be provided with scale information and nothing
else. Additional information may be provided such as the rough shape of the object,
or a prior distribution over its location in the image.

27

ZZZZZZZZZZZZZZ…..

Still Awake???

28

Seeing stars when there aren’t many stars:
Graph-based semi-supervised learning for sentiment categorization

Andrew B. Goldberg Xiaojin Zhu
Computer Sciences Department Computer Sciences Department
University of Wisconsin-Madison University of Wisconsin-Madison

TextGraphs: HLT/NAACL Workshop on Graph-based Algorithms for Natural
Language Processing
2006

29

Sentiment Categorization

?

?
?
30

Sentiment Categorization

31


What we saw is rating inference
Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment
categorization with respect to rating scales. In Proceedings of the ACL.

In this work…
• Graph-based Semi-supervised Learning
• Main assumption encoded in graph:
• Similar documents should have similar ratings

32


33


34


35


36


50% accuracy


37


38


100% accuracy


39


Goal

40


Approach

41


Measuring Loss over the Graph

42


43


44


45


46


47


48

Minimization now
is non- trivial

49


Finding a Closed-Form Solution

50

Vector of f Vector of given labels yi
Graph-based semi-supervised reviews and sentiment categorization
values for
for labelled learning for
predicted labels for
allFinding a Closed-Form Solution
reviews unlabeled reviews

Labelled Unlabeled

C=

51

Graph Laplacian
matrix

Constant
parameter

52

Graph Laplacian Matrix
Assume n labelled and unlabeled documents

53


54

Experiments
Predict 1 to 4 stars ratings for reviews
• 4-author data (Pang and Lee, 2005)
• 1770, 902, 1307 and 1027 documents, respectively
• *

• Each document represented as a {0,1} word-presence
vector, normalized to sum 1
• Positive-Sentence Percentage (PSP) similarity (Pang and Lee, 2005)
• Tuned parameters with cross-validation

* Joachims, T., Transductive Inference for Text Classification using Support Vector
Machines, in Proceedings of the Sixteenth International Conference on Machine Learning.
1999, Morgan Kaufmann Publishers Inc.
55

Experiments

PSPi is defined as the percentage of positive sentences in review xi.

The similarity between reviews xi, xj is the cosine angle between the vectors
(PSPi,1-PSPi) and (PSPj, 1-PSPj)

Positive sentences are identified using a binary classifier trained on a “snippet
data set” (10662 documents)

56

Experiments

Low ratings tend to get low PSP scores
High ratings tend to get high PSP scores

The trend was qualitatively the same as in Pang and Lee (2005) (Naïve Bayes)
57

Experiments
Number of unlabeled

α = ak + bk’ neighbours

c = k/L Size of labelled set

Number of labelled
neighbours

Optimal Values (through cross-validation)
c = 0.2
α = 1.5
58

Results

Graph-based SSL
outperforms other methods
for small labelled set sizes

59

Alternative Similarity Measure
The cosine between word vectors containing all words,
each weighted by its mutual information

Scaling of mutual information values (maximum = 1)

Previously found values → weights for corresponding words in the word vectors

Words in the movie review data that did not appear in the “snippet data set” were excluded

Optimal Values (through cross-validation)
c = 0.1
α = 1.5
60

Results

In each row, in
20 trial average green is the best
unlabeled set result and any
accuracy for each results that could
not be distinguished
author across
from it with a paired
different labelled t-test at the 0.05
set sizes and level.
methods

61

Conclusions and Future Work
Graph-based semi-supervised learning based on PSP similarity achieved better performance
than all other methods in all four author corpora.

However, for larger labelled sets its performance was not so good.
a) Maybe, because SVM regressor trained on a large labelled set can achieve fairly high
__accuracy without considering relationships between examples.
b) PSP similarity is not accurate enough, thus biasing the overall performance when labelled
__data is abundant.

Investigate better document representations and similarity measures.

Extend the method to inductive learning setting

Experiment cross-reviewer and cross-domain analysis, such as using a model learned on
movie reviews to help classify product reviews.

62

Human Semi-Supervised Learning

Q: Do humans also use semi-supervised learning?

A: Apparently, yes!

63

Some evidences…
Face recognition is a very challenging computational task.

However, it is an easy task for humans.
Differences between two views of the same face are much larger than
those between two different faces viewed at the same angle. +
+ Sinha, P., et al., Face recognition by humans: 20 results all computer vision researchers
should know about. 2006, MIT.

Hint: Temporal association

64

Some evidences…

Observers were shown sequences
of novel faces in which the
identity of the face changed as
the head rotated.

image sequence Unlabeled data

As a result, observers showed a tendency to
treat the views as if they were of the same person.
suggests

We are continuously associating views of objects to support later recognition, and that
we do so not only on the basis of the physical similarity, but also the correlated
appearance in time of the objects.

Wallis, G. and H. Bülthoff, Effects of temporal association on recognition memory, in
65
National Academy of Sciences. 2001. p. 4800-4804.

Some evidences…

17-month infants listen to a word, see an object

They wanted to measure their ability to associate the word and the object

If the word was heard many times before (without seeing the object;
unlabeled data), association was stronger.

If the word was not heard before, association was weaker.

Graf, E., et al., Can Infants Map Meaning to Newly Segmented
Words?: Statistical Segmentation and Word Learning.
Psychological Science, 2007. 18(3): p. 254-260. 66
Image taken from www.dalla.is


Better understanding of the human cognitive model,
can guide the development of better machine learning
algorithms or make existent even better and robust…

67

References
• Rosenberg, C., M. Hebert, and H. Schneiderman, Semi-Supervised Self-Training of Object
Detection Models, in Proceedings of the Seventh IEEE Workshops on Application of
Computer Vision (WACV/MOTION'05) - Volume 1 - Volume 01. 2005, IEEE Computer Society.
• Goldberg, A.B. and X. Zhu. Seeing stars when there aren't many stars: Graph-based semi-
supervised learning for sentiment categorization. in TextGraphs: HLT/NAACL Workshop on
Graph-based Algorithms for Natural Language Processing. 2006.
• Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment
categorization with respect to rating scales. In Proceedings of the ACL.
• Joachims, T., Transductive Inference for Text Classification using Support Vector Machines,
in Proceedings of the Sixteenth International Conference on Machine Learning. 1999,
Morgan Kaufmann Publishers Inc.
• Sinha, P., et al., Face recognition by humans: 20 results all computer vision researchers
should know about. 2006, MIT.
• Wallis, G. and H. Bülthoff, Effects of temporal association on recognition memory, in
National Academy of Sciences. 2001. p. 4800-4804.
• Graf, E., et al., Can Infants Map Meaning to Newly Segmented Words?: Statistical
Segmentation and Word Learning. Psychological Science, 2007. 18(3): p. 254-260.
68

Slides ppt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Slides ppt

Similar to Slides ppt (20)

More from butest

More from butest (20)

Slides ppt