Machine Learning in Computer Vision

Machine Learning in
Computer Vision

Fei-Fei Li

What is (computer) vision?
• When we “see” something, what does it
involve?
• Take a picture with a camera, it is just a
bunch of colored dots (pixels)
• Want to make computers understand
images
• Looks easy, but not really…

Image (or video) Sensing device Interpreting device Interpretations

Corn/mature corn
jn a cornfield/
plant/blue sky in
the background
Etc.

What is it related to?
Biology
Psychology
Neuroscience
Information Cognitive
sciences
Engineering Computer
Robotics Science

Computer Vision
Speech Information retrieval

Machine learning

Physics Maths

A picture is worth a thousand words.
--- Confucius
or Printers’ Ink Ad (1921)

--- Confucius

horizontal lines textured

blue on the top
vertical
white

shadow to the left porous

oblique large green patches

--- Confucius

clear sky autumn leaves

building court yard
bicycles

trees campus
day time

talking outdoor
people

Today: machine learning methods
for object recognition

outline
• Intro to object categorization
• Brief overview
– Generative
– Discriminative
• Generative models
• Discriminative models

How many object categories are there?

Biederman 1987

Challenges 1: view point variation

Michelangelo 1475-1564

Challenges 2: illumination

slide credit: S. Ullman

Challenges 3: occlusion

Magritte, 1957

Challenges 5: deformation

Xu, Beihong 1943

Challenges 6: background clutter

Klimt, 1913

History: single object recognition

History: single object recognition

• Lowe, et al. 1999, 2003
• Mahamud and Herbert, 2000
• Ferrari, Tuytelaars, and Van Gool, 2004
• Rothganger, Lazebnik, and Ponce, 2004
• Moreels and Perona, 2005
•…

Challenges 7: intra-class variation

Object categorization:
the statistical viewpoint

p ( zebra | image) p (image | zebra ) p ( zebra)
= ⋅
p (no zebra | image) p (image | no zebra) p (no zebra )

posterior ratio likelihood ratio prior ratio

• Discriminative methods model posterior

• Generative methods model likelihood and
prior

Discriminative
p( zebra | image)
• Direct modeling of
p (no zebra | image)

Decision Zebra
boundary
Non-zebra

Generative
• Model p(image | zebra) and p (image | no zebra)

p(image | zebra) p (image | no zebra)

Low Middle

High Middle Low

Three main issues
• Representation
– How to represent an object category

• Learning
– How to form the classifier, given training data

• Recognition
– How the classifier is to be used on novel data

Representation
– Generative /
discriminative / hybrid

Representation
– Generative /
– Appearance only or
location and
appearance

Representation
– Generative /
location and
appearance
– Invariances
• View point
• Illumination
• Occlusion
• Scale
• Deformation
• Clutter
• etc.

Representation
– Generative /
location and
appearance
– invariances
– Part-based or global
w/sub-window

Representation
– Generative /
location and
appearance
– invariances
– Parts or global w/sub-
window
– Use set of features or
each pixel in image

Learning
– Unclear how to model categories, so we
learn what distinguishes them rather than
manually specify the difference -- hence
current interest in machine learning

Learning
current interest in machine learning)
– Methods of training: generative vs.
discriminative

5
p(C1|x) p(C2|x)
p(x|C )
2 1
4

posterior probabilities
class densities

0.8
3
0.6
2
p(x|C1) 0.4
1
0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x

Learning
– What are you maximizing? Likelihood
(Gen.) or performances on train/validation
set (Disc.)
– Level of supervision
• Manual segmentation; bounding box; image
labels; noisy labels
Contains a motorbike

Learning
set (Disc.)
– Batch/incremental (on category and image
level; user-feedback )

Learning
set (Disc.)
– Training images:
• Issue of overfitting
• Negative images for discriminative methods

Learning
set (Disc.)
– Training images:
• Issue of overfitting
• Negative images for discriminative methods
– Priors

Recognition
– Scale / orientation range to search over
– Speed

Analogy to documents
Of all the sensory impressions proceeding to China is forecasting a trade surplus of $90bn
the brain, the visual experiences are the (£51bn) to $100bn this year, a threefold
dominant ones. Our perception of the world increase on 2004's $32bn. The Commerce
around us is based essentially on the Ministry said the surplus would be created by
messages that reach the brain from our eyes. a predicted 30% jump in exports to $750bn,
For a long time it was thought that the retinal compared with a 18% rise in imports to
image was transmitted pointbrain, to visual
sensory, by point China, trade,
$660bn. The figures are likely to further
visual, perception,
centers in the brain; the cerebral cortex was a
movie screen, so to speak, upon which the
surplus, commerce,
annoy the US, which has long argued that
China's exports are unfairly helped by a
retinal, cerebral cortex,
image in the eye was projected. Through the exports, imports, US,
deliberately undervalued yuan. Beijing
discoveries of Hubelcell, optical
eye, and Wiesel we now yuan, bank, domestic,
agrees the surplus is too high, but says the
know that behind the origin of the visual yuan is only one factor. Bank of China
nerve, image
perception in the brain there is a considerably foreign, increase,
governor Zhou Xiaochuan said the country
Hubel, Wiesel
more complicated course of events. By also needed to do more tovalue
trade, boost domestic
following the visual impulses along their path demand so more goods stayed within the
to the various cell layers of the optical cortex, country. China increased the value of the
Hubel and Wiesel have been able to yuan against the dollar by 2.1% in July and
demonstrate that the message about the permitted it to trade within a narrow band, but
image falling on the retina undergoes a step- the US wants the yuan to be allowed to trade
wise analysis in a system of nerve cells freely. However, Beijing has made it clear that
stored in columns. In this system each cell it will take its time and tread carefully before
has its specific function and is responsible for allowing the yuan to rise further in value.
a specific detail in the pattern of the retinal
image.

learning recognition

codewords dictionary
feature detection
& representation

image representation

category models category
(and/or) classifiers decision

Representation

2.
1. feature detection
& representation

image representation

3.

1.Feature detection and representation


Compute
SIFT Normalize
descriptor patch
[Lowe’99]

Detect patches
[Mikojaczyk and Schmid ’02]
[Matas et al. ’02]
[Sivic et al. ’03]

Slide credit: Josef Sivic


…

2. Codewords dictionary formation

…


…

Vector quantization

Slide credit: Josef Sivic


Fei-Fei et al. 2005

3. Image representation
frequency

…..
codewords

Learning and Recognition


category models category
(and/or) classifiers decision

2 case studies

1. Naïve Bayes classifier
– Csurka et al. 2004

2. Hierarchical Bayesian text models
(pLSA and LDA)
– Background: Hoffman 2001, Blei et al. 2004
– Object categorization: Sivic et al. 2005, Sudderth et
al. 2005
– Natural scene categorization: Fei-Fei et al. 2005

First, some notations

• wn: each patch in an image
– wn = [0,0,…1,…,0,0]T
• w: a collection of all N patches in an image
– w = [w1,w2,…,wN]
• dj: the jth image in an image collection
• c: category of the image
• z: theme or topic of the patch

Case #1: the Naïve Bayes model

c w
N

N
c ∗ = arg max p (c | w) ∝ p (c) p ( w | c) = p(c)∏ p( wn | c)
c n =1

Object class Prior prob. of Image likelihood
decision the object classes given the class

Csurka et al. 2004

Case #2: Hierarchical Bayesian
text models
Probabilistic Latent Semantic Analysis (pLSA)

d z w
N
D Hoffman, 2001

Latent Dirichlet Allocation (LDA)

c π z w
N
D Blei et al., 2001

text models
Probabilistic Latent Semantic Analysis (pLSA)

d z w
N
D

“face”

Sivic et al. ICCV 2005

text models

“beach”

Latent Dirichlet Allocation (LDA)

c π z w
N
D
Fei-Fei et al. ICCV 2005

Another application
• Human action classification

Invariance issues
• Scale and rotation
– Implicit
– Detectors and descriptors

Kadir and Brady. 2003

Invariance issues
• Occlusion
– Implicit in the models
– Codeword distribution: small variations
– (In theory) Theme (z) distribution: different
occlusion patterns

Invariance issues
• Occlusion
• Translation
– Encode (relative) location information

Sudderth et al. 2005

Invariance issues
• Occlusion
• Translation
• View point (in theory)
– Codewords: detector
and descriptor
– Theme distributions:
different view points

Fergus et al. 2005

Model properties
Of all the sensory impressions proceeding to
the brain, the visual experiences are the
dominant ones. Our perception of the world
around us is based essentially on the
messages that reach the brain from our eyes.
For a long time it was thought that the retinal
image was transmitted pointbrain, to visual
sensory, by point
• Intuitive visual, perception,
centers in the brain; the cerebral cortex was a
movie screen, so to speak, upon which the
retinal, cerebral cortex,
image in the eye was projected. Through the
– Analogy to documents discoveries of Hubelcell, optical
eye, and Wiesel we now
know that behind the origin of the visual
nerve, image
perception in the brain there is a considerably
Hubel, Wiesel
more complicated course of events. By
following the visual impulses along their path
to the various cell layers of the optical cortex,
Hubel and Wiesel have been able to
demonstrate that the message about the
image falling on the retina undergoes a step-
wise analysis in a system of nerve cells
stored in columns. In this system each cell
has its specific function and is responsible for
a specific detail in the pattern of the retinal
image.

Model properties

• Intuitive
• (Could use)
generative models
– Convenient for weakly-
or un-supervised
training
– Prior information
– Hierarchical Bayesian
framework Sivic et al., 2005, Sudderth et al., 2005

Model properties

• Intuitive
• (Could use)
generative models
• Learning and
recognition relatively
fast
– Compare to other
methods

Weakness of the model

• No rigorous geometric information
of the object components
• It’s intuitive to most of us that
objects are made of parts – no
such information
• Not extensively tested yet for
– View point invariance
– Scale invariance
• Segmentation and localization
unclear

part-based models

Slides courtesy to Rob Fergus for “part-based models”

One-shot learning
Fei-Fei et al. ‘03, ‘04, ‘06
of object categories

P. Bruegel, 1562

One-shot learning
Fei-Fei et al. ‘03, ‘04, ‘06

model representation

One-shot learning
Fei-Fei et al. ‘03, ‘04, ‘06

X (location)

(x,y) coords. of region center

A (appearance)

normalize

ch
1 pat
1
11x c1
c2

…..
Projection onto
PCA basis c10

The Generative Model
X (location)

(x,y) coords. of region center

μX ΓX μA ΓA
A (appearance)

h normalize

X A ch
pat
I 11x
1 1 c1
c2

…..
Projection onto
PCA basis c10


μX ΓX μA ΓA parameters

hidden variable
h

X A observed variables
I


ML/MAP
μX ΓX μA ΓA

h θ1

X A
I θ2
θn
where θ = {µX, ΓX, µA, ΓA}

Weber et al. ’98 ’00, Fergus et al. ’03


μX ΓX ML/MAP
shape model

θ1
μA ΓA
appearance model θ2
θn
where θ = {µX, ΓX, µA, ΓA}


ML/MAP

μX ΓX μA ΓA
θ1
h

θ2
I X A
θn

Bayesian
m0X a0X m0A a0A
β0X B0X β0A B0A

μX ΓX μA ΓA
P
θ1
h

θ2
I X A
θn

Parameters to estimate: {mX, βX, aX, BX, mA, βA, aA, BA}
Fei-Fei et al. ‘03, ‘04, ‘06 i.e. parameters of Normal-Wishart distribution


m0X a0X m0A a0A priors
β0X B0X β0A B0A

μX ΓX μA ΓA parameters
P

h

I X A

Fei-Fei et al. ‘03, ‘04, ‘06


m0X a0X m0A a0A priors
β0X B0X β0A B0A

μX ΓX μA ΓA
P

h

I X A Prior distribution

Fei-Fei et al. ‘03, ‘04, ‘06

1. human vision

3. learning
& inferences

One-shot learning
2. model of object categories
representation
4. evaluation
& dataset
& application

learning & inferences

No labeling No segmentation No alignment

One-shot learning
Fei-Fei et al. 2003, 2004, 2006

learning & inferences

Bayesian

θ1

θ2
θn

One-shot learning
Fei-Fei et al. 2003, 2004, 2006

Random
Variational EM initialization

M-Step E-Step

new estimate
of p(θ|train)

prior knowledge of p(θ)
Attias, Jordan, Hinton etc.

evaluation & dataset

One-shot learning
Fei-Fei et al. 2004, 2006a, 2006b

evaluation & dataset -- Caltech 101 Dataset

One-shot learning
Fei-Fei et al. 2004, 2006a, 2006b

Part 3: discriminative methods

Discriminative methods
Object detection and recognition is formulated as a classification problem.
The image is partitioned into a set of overlapping windows
… and a decision is taken at each window about if it contains a target object or not.
Decision
Background boundary
Where are the screens?

Computer screen
Bag of image patches
In some feature space

Discriminative vs. generative
• Generative model
0.1
(The artist) 0.05

0
0 10 20 30 40 50 60 70
x = data

• Discriminative model
1
(The lousy painter)
0.5

0
0 10 20 30 40 50 60 70
x = data

• Classification function
1

-1

0 10 20 30 40 50 60 70 80
x = data

Discriminative methods
Nearest neighbor Neural networks

106 examples

Shakhnarovich, Viola, Darrell 2003 LeCun, Bottou, Bengio, Haffner 1998
Berg, Berg, Malik 2005 Rowley, Baluja, Kanade 1998
… …

Support Vector Machines and Kernels Conditional Random Fields

Guyon, Vapnik McCallum, Freitag, Pereira 2000
Heisele, Serre, Poggio, 2001 Kumar, Hebert 2003
… …

Formulation
• Formulation: binary classification
…
Features x = x1 x2 x3 … xN xN+1 xN+2 … xN+M
Labels y= -1 +1 -1 -1 ? ? ?

Training data: each image patch is labeled Test data
as containing the object or background

• Classification function
Where belongs to some family of functions

• Minimize misclassification error
(Not that simple: we need some guarantees that there will be generalization)

Overview of section
• Object detection with classifiers

• Boosting
– Gentle boosting
– Weak detectors
– Object model
– Object detection

• Multiclass object detection

Why boosting?
• A simple algorithm for learning robust classifiers
– Freund & Shapire, 1995
– Friedman, Hastie, Tibshhirani, 1998

• Provides efficient algorithm for sparse visual
feature selection
– Tieu & Viola, 2000
– Viola & Jones, 2003

• Easy to implement, not requires external
optimization tools.

Boosting
• Defines a classifier using an additive model:

Strong Weak classifier
classifier
Weight
Features
vector

Boosting
• Defines a classifier using an additive model:

Strong Weak classifier
classifier
Weight
Features
vector

• We need to define a family of weak classifiers
from a family of weak classifiers

Boosting
• It is a sequential procedure:

xt=1 Each data point has
xt
a class label:
xt=2
+1 ( )
yt =
-1 ( )

and a weight:
wt =1

Toy example
Weak learners from the family of lines

Each data point has
a class label:
+1 ( )
yt =
-1 ( )

and a weight:
wt =1

h => p(error) = 0.5 it is at chance

Toy example

Each data point has
a class label:
+1 ( )
yt =
-1 ( )

and a weight:
wt =1

This one seems to be the best
This is a ‘weak classifier’: It performs slightly better than chance.

Toy example

Each data point has
a class label:
+1 ( )
yt =
-1 ( )

We update the weights:
wt wt exp{-yt Ht}

We set a new problem for which the previous weak classifier performs at chance again

Toy example
f1 f2

f4

f3

The strong (non- linear) classifier is built as the combination of
all the weak (linear) classifiers.

From images to features:
Weak detectors
We will now define a family of visual
features that can be used as weak
classifiers (“weak detectors”)

Takes image as input and the output is binary response.
The output is a weak detector.

Weak detectors
Textures of textures
Tieu and Viola, CVPR 2000

Every combination of three filters
generates a different feature

This gives thousands of features. Boosting selects a sparse subset, so computations
on test time are very efficient. Boosting also avoids overfitting to some extend.

Weak detectors
Haar filters and integral image
Viola and Jones, ICCV 2001

The average intensity in the
block is computed with four
sums independently of the
block size.

Weak detectors
Other weak detectors:
• Carmichael, Hebert 2004
• Yuille, Snow, Nitzbert, 1998
• Amit, Geman 1998
• Papageorgiou, Poggio, 2000
• Heisele, Serre, Poggio, 2001
• Agarwal, Awan, Roth, 2004
• Schneiderman, Kanade 2004
• …

Weak detectors
Part based: similar to part-based generative
models. We create weak detectors by
using parts and voting for the object center
location

Car model Screen model

These features are used for the detector on the course web site.

Weak detectors
First we collect a set of part templates from a set of training
objects.
Vidal-Naquet, Ullman (2003)

…

Weak detectors
We now define a family of “weak detectors” as:

= * =

Better than chance

Weak detectors
We can do a better job using filtered images

= = * =
*

Still a weak detector
but better than before

Training
First we evaluate all the N features on all the training images.

Then, we sample the feature outputs on the object center and at random
locations in the background:

Representation and object model
Selected features for the screen detector

… …
1 2 3 4 10 100

Lousy painter

Representation and object model
Selected features for the car detector

… …
1 2 3 4 10 100

Example: screen detection
Feature
output

Feature Thresholded
output output

Weak ‘detector’
Produces many false alarms.

Feature Thresholded Strong classifier
output output at iteration 1

Feature Thresholded Strong
output output classifier

Second weak ‘detector’
Produces a different set of
false alarms.


+

Strong classifier
at iteration 2


+

…
Strong classifier
at iteration 10


+

…
Adding
features

Final
classification

Strong classifier
at iteration 200

Document Analysis

Digit recognition, AT&T labs
http://www.research.att.com/~yann/

Machine Learning in Computer Vision

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Machine Learning in Computer Vision

Similar to Machine Learning in Computer Vision (20)

More from butest

More from butest (20)

Machine Learning in Computer Vision