Lecture17 xing fei-fei

Machine LearningMachine Learninggg
Nonparametric Bayesian ModelsNonparametric Bayesian Models
----Learning/Reasoning in Open Possible WorldsLearning/Reasoning in Open Possible Worlds
Eric XingEric Xing
Lecture 17, August 16, 2010
Eric Xing © Eric Xing @ CMU, 2006-2010 1
Lecture 17, August 16, 2010
Reading:

ClusteringClustering

Image SegmentationImage Segmentation
 How to segment images?
 Manual segmentation (very expensive)
 Algorithm segmentation
 K-means
 Statistical mixture models Statistical mixture models
 Spectral clustering
 Problems with most existing Problems with most existing
algorithms
 Ignore the spatial information
 Perform the segmentation one image at Perform the segmentation one image at
a time
 Need to specify the number of segments
a priori

Object Recognition and TrackingObject Recognition and Tracking
(1.8, 7.4, 2.3)
(1.9, 9.0, 2.1)
(1.9, 6.1, 2.2)( , , )
(0 9 5 8 3 1)
(0.7, 5.1, 3.2)
(0.6, 5.9, 3.2)
(0.9, 5.8, 3.1)
t=1 t=2 t=3

The Evolution of ScienceThe Evolution of Science
ResearchResearch
circlescircles
ResearchResearch
Bio
PhyPhy
topicstopics
CS
PNASPNAS paperspapers
?Eric Xing © Eric Xing @ CMU, 2006-2010 5
1900 2000 ?

A Classical ApproachA Classical Approach
 Clustering as Mixture Modelingg g
 Then "model selection"

Model Selection vs. Posterior
Inference
 Model selection
"i t lli t" ???
Inference
 "intelligent" guess: ???
 cross validation: data-hungry 
 information theoretic:
 AIC
  AIC
 TIC
 MDL :
 Bayes factor: need to compute data likelihood
 ),ˆ|(|)(minarg KKL MLgf 
Parsimony, Ockam's RazorParsimony, Ockam's Razor
y p
 Posterior inference:
we want to handle uncertainty of model complexity explicitlywe want to handle uncertainty of model complexity explicitly
)()|()|( MpMDpDMp 
 K,M
 we favor a distribution that does not constrain M in a "closed" space!

OutlineOutline
 Motivation and challengeg
 Dirichlet Process and Infinite Mixture
 Formulation
 Approximate Inference algorithm Approximate Inference algorithm
 Example: population clustering
 Hierarchical Dirichlet Process and Multi-Task Clustering
 Formulation
 Application: joint multiple population clusteri
 Dynamic Dirichlet Processy
 Temporal DPM
 Application: evolutionary clustering of documents
 Summary

ClusteringClustering
 How to label them ?
 How many clusters ???

Random Partition of Probability
Space
 66  ,
 
Space
 55  ,
 66
 33  ,
 44  ,
. (event, pevent)
 
 22  ,
 33  ,
centroid :=
Image ele. :=(x, 11  ,
…age e e. : (x,

Dirichlet ProcessDirichlet Process
 A CDF, G, on possible worlds

of random partitions follows a
Dirichlet Process if for any
measurable finite partition
(   )
1 2
5
6
3
4
a distribution
(1,2, .., m):a distribution
(G(1), G(2), …, G(m) ) ~
Dirichlet( G0(1), …., G0(m) )
th
where G0 is the base measure
and is the scale parameter
another
distribution
and is the scale parameter
Thus a Dirichlet ProcessThus a Dirichlet Process GG defines a distribution of distributiondefines a distribution of distribution

Stick breaking ProcessStick-breaking Process
0 0.4 0.4
∑
∞
0.6 0.5 0.3
~
)(∑


0
1
k
k
kk
G
G


0.3 0.8 0.24
∑
∞


1
1
0
k
k
k G


Location
)-(∏
-
 1
1
1
1
k
j
kkk
k



G0),Beta(~  1k
j
Mass

Chinese Restaurant ProcessChinese Restaurant Process
1 2
=)|=( -ii kcP c 1 0 0)|( -ii
+1
1
0

+1
+2
1
+2
1


+2
+3
1
+3
2


+3
1+
1
i
m
1+
2
i
m
1+

i
....
CRP defines an exchangeable distribution on partitions over an (infinite) sequence
of samples, such a distribution is formally known as the Dirichlet Process (DP)
1-+i 1-+i 1-+i

Graphical Model Representations
f DPof DP
GG00 GG00
G


y
 

xi
N
i
yi
xi
NN N
The Stick-breaking constructionThe CRP construction

Ancestral InferenceAncestral Inference

Ak k

Gn
Hn1 Hn2
NN
Essentially a clustering problem, but …Essentially a clustering problem, but …
 Better recovery of the ancestors leads to better haplotyping results
(because of more accurate grouping of common haplotypes)
T h l t bt i bl ith hi h t b t th lid t d l
 True haplotypes are obtainable with high cost, but they can validate model
more subjectively (as opposed to examining saliency of clustering)
 Many other biological/scientific utilities

Example: DP haplotyperExample: DP-haplotyper [Xing et al, 2004]
 Clustering human populationsg p p
G
 G0
DP
A 
K
infinite mixture components
(for population haplotypeshaplotypes)
Gn
Hn1 Hn2
N
Likelihood model
(for individual
haplotypes and genotypesgenotypes)
 Inference: Markov Chain Monte Carlo (MCMC)
( )
 Gibbs sampling
 Metropolis Hasting

The DP Mixture of Ancestral
HaplotypesHaplotypes
 The customers around a table in CRP form a cluster
 associate a mixture component (i.e., a population haplotype) with a table
 sample {a, } at each table from a base measure G0 to obtain the
population haplotype and nucleotide substitution frequency for that
1
4 8 9
population haplotype and nucleotide substitution frequency for that
component
{A} {A} {A} {A} {A} {A} ……
3
2
4
5
6 7
8 9
5
 With p(h|{ }) and p(g|h1,h2), the CRP yields a posterior distribution on
the number of population haplotypes (and on the haplotype
the number of population haplotypes (and on the haplotype
configurations and the nucleotide substitution frequencies)

Inheritance and Observation Models
 Single-locus mutation model
Inheritance and Observation Models
1A
… for ah 
Ancestral
pool
1A
3A
2Aeei iC HA 
1iC
H




.
for
||
for
),|(
probwithah
ah
B
ah
ahP
tt
tt
tt
ttH









1
1 2iC
 Noisy observation model
1iH
Haplotypes
2iH
GHH 
iG Genotype
:),|(
probwithhhg
hhgPG 21

iii GHH 21
,
i yp.,, probwithhhg ttt 21 

MCMC for Haplotype Inference
 Gibbs sampling for exploring the posterior distribution under
MCMC for Haplotype Inference
p g p g p
the proposed model
 Integrate out the parameters such as or , and sample
and
  ki ac e
,
ihand ei
),|()|(),,|( ][,][][ chcahc eeeeee ikiiiii ahpkcpkcp  
Posterior Prior x Likelihood
CRP

 Gibbs sampling algorithm: draw samples of each random variable to be
sampled given values of all the remaining variables

MCMC for Haplotype InferenceMCMC for Haplotype Inference
1. Sample cie
(j), fromie
2. Sample ak from
3. Sample hie
(j) from
 For DP scale parameter : a vague inverse Gamma prior

Convergence of Ancestral InferenceConvergence of Ancestral Inference

DP vs Finite Mixture via EMDP vs. Finite Mixture via EM
0.35
0.4
0.45
r
0 15
0.2
0.25
0.3
0.35
ividualerror
Series1
Series2
DP
EM
0
0.05
0.1
0.15
indi
1 2 3 4 5
data sets

OutlineOutline
 Formulation
 Approximate Inference algorithm Approximate Inference algorithm
 Formulation
 Application: joint multiple population clustering
 Dynamic Dirichlet Processy
 Temporal DPM
 Summary

Multi-population Genetic
DemographyDemography
1 211 2 2
1 211 2 2
1 211 2 2
1 211 2 2
1 211 2 2
 Pool everything together and solve 1 hap problem?
 --- ignore population structures
 Solve 4 hap problems separately?
 --- data fragmentation
 Co-clustering … solve 4 coupled hap problems jointly

Hierarchical Dirichlet Process
 Two level Pólya urn scheme
 At the i th step in j th "group"
[Teh et al., 2005, Xing et al. 2005]
Hierarchical Dirichlet Process
 At the i-th step in j-th group ,
1 211 2
1 211 2 2 1 211 2 2 1 211 2 2





jk
k
m
m
probwithChoose .
 kn
probwithChoose
Oracle
0
0



k jk
probwith
DPlevelupperthetoGo
m
 

k
k
samplenewaDraw
n
probwithChoose .
0k jkm
probwith .
 

kn
probwith .

Results Simulated DataResults - Simulated Data
 5 populations with 20 individuals each (two kinds of mutationp p (
rates)
 5 populations share parts of their ancestral haplotypes
th l th 10 the sequence length = 10
Haplotype errorHaplotype error

Results - International HapMap
DBDB
 Different sample sizes, and different # of sub-populationsp p p

Topic Models for ImagesTopic Models for Images
“beach”
Latent Dirichlet Allocation (LDA)
w
N
c z
N
D

Image RepresentationImage Representation
cat, grass, tiger, water
][,][ ||1111 Vd wwrr 
][][
 annotation vector
(binary, same for each segment)
representation vector
(real, 1 per image segment)
][,][ ||11 Vndn wwrr 

Infinite Topic Model for ImageInfinite Topic Model for Image

 HH
Dirichlet

 HH
Stick
breaking

HH
zi
j 
k 
zi
j 
zi
j 

wi
N
wi
N
wi
N
A single image
with k topic
A single image
with inf-topic
J
J images
with inf-topic
An LDA A DP An HDP

Problem with HDPProblem with HDP
 Every group (i.e., image) has exactly the same set of visual-y g p ( g ) y
vocabulary topics, albeit with different frequency
HH
HH G0

zi
j


Gj
i
k
i
wi
N
wi
N
i
J J

Transformed Dirichlet ProcessTransformed Dirichlet Process
A t i f HDP i hi h l b l i t t
[Sudderth et al, 2005]
 An extension of HDP in which global mixture components
undergo a set of random transformations before being reused
in each group
HH HH
zi
j k


yi
j
 k
ki
wi
N
yi
wiN

RR
jk
J J

Synthetic Data ResultsSynthetic Data Results
 HDP uses a large set of global clusters to discretize the transformations
underlying the data, and may have poor generalization for modeling visual
scenes
scenes.

Analyzing Street ScenesAnalyzing Street Scenes

OutlineOutline
 Formulation
 Approximate Inference algorithmpp g
 Formulation
 Application: joint multiple population clustering
 Dynamic Dirichlet Process
 Temporal DPMp
 Summary
y

Evolutionary Clustering
 Adapts the number of mixture components over time
Evolutionary Clustering
p p
 Mixture components can die out
 New mixture components are born at any time
 Retained mixture components parameters evolve according to a Markovian
dynamics
CS
BioPhyPhy
Research
Topics
1900 2000
Papers
1900 2000

Temporal DPM [Ah d d Xi 2008]Temporal DPM [Ahmed and Xing 2008]
 The Recurrent Chinese Restaurant Process
 The restaurant operates in epochs
 The restaurant is closed at the end of each epoch
 The state of the restaurant at time epoch t depends on that at time epoch t-1p p p
 Can be extended to higher-order dependencies.

2,11,1 3,1 T=1
N1,1=2 N2,1=3 N3,1=1
T=2
2,21,2 3,1 4,2
N2,3
T=3
2,21,2 4,2
N2,3 =

TDPM Generative Power
W=T
DPM

W=4
  
TDPM Power‐law
curve

W= 0
Independent
DPMs
any)

The Big Picture TimeTime
 
K‐means Dynamic clusteringFixed-dimensions Dynamic clustering
ci


xi
N
ensionension

G
DPM TDPM

del Dimedel Dime
ci
 

 
MoMo
xi
N

Summary
 A non-parametric Bayesian model for Pattern Uncovery
Summary
p y y
 Finite mixture model of latent patterns (e.g., image segments, objects)
 infinite mixture of propotypes: alternative to model selection
 hi hi l i fi it i t hierarchical infinite mixture
 temporal infinite mixture model
 Applications in general data-mining …

F B f W d t T t l SFrom Bag‐of‐Words to Total Scene
Understanding:
Evolutions of Topic Models in Vision
‐ Part 2 ‐‐ Part 2 ‐
L. Fei‐Fei
Computer Science Dept.
f dStanford University

Machine learning in computer vision
• Aug 15+16, Lecture 14 + 17: Topic modelsAug 15+16, Lecture 14 + 17: Topic models
– Object recognition
– Scene classification, image annotation, large‐scale image clustering
l d d– Total scene understanding
Aug 16: Part 2 – nonparametric topic models
8/8/2010 44L. Fei-Fei, Dragon Star 2010, Stanford

Case studies in visualCase studies in visual
• Sudderth et al:Sudderth et al:
– HDP for object recognition
• OPTIMOL• OPTIMOL:
– HDP for object recognition: incremental learning
• Li et al:
– CRP for large‐scale image clustering

Describing Objects with Parts
Pictorial Structures
Fischler & Elschlager, IEEE Trans. Comp. 1973
Cascaded SVM Detectors
Heisele, Poggio, et. al., NIPS 2001
Constellation Model
Fergus, Perona, & Zisserman, CVPR 2003
Model-Guided Segmentation
Mori, Ren, Efros, & Malik, CVPR 2004

Counting Objects & Parts
How many parts? How many objects?

From Images to Features
Affinely Adapted Maximally Stable Linked SequencesAffinely Adapted
Harris Corners
y
Extremal Regions
q
of Canny Edges
• Some invariance to lighting & pose variationsSome invariance to lighting & pose variations
• Dense, multiscale, over-segmentation of image

A Discrete Feature Vocabulary
• Normalized histograms
SIFT Descriptors
Normalized histograms
of orientation energy
• Compute ~1,000 word Lowe IJCV 2004Compute 1,000 word
dictionary via K-means
• Map each feature to
appearance of
feature i in image j
Lowe, IJCV 2004
p
nearest visual word 2D position of
feature i in image j

Generative Model for Objects
For each image: Sample a reference position
For each feature:
 Randomly choose one part
 Sample from that part’s feature distribution

Objects as Mixture Models
• For a fixed reference position, our generative
model is equivalent to a finite mixture model:q
Feature Pr(part)
appearance
Feature
position
( )
Pr(appearance | part)
Pr(position | part)
• How many parts should we choose?
 Too few reduces model accuracy Too few reduces model accuracy
 Too many causes overfitting & poor generalization

Dirichlet Process Mixtures
• Dirichlet processes define a prior distribution
i ht i d t i t ton weights assigned to mixture components:
0 1
concentration
parameter
Stick-Breaking Construction: Sethuraman, 1994

Objects as Distributions
Feature
appearance
Feature
position
Pr(appearance | part) Pr(position | part)
• Parts are defined by parameters, which
encode distributions on visual features:
Obj t d fi d b di t ib ti th• Objects are defined by distributions on the
infinitely many potential part parameters:

Dirichlet Process Object Model
H RPart-based object model
sampled from DP prior:
G 

For each of J
images, sample a
reference position:

For each of N features,
sample part parameters:
JN
w vSample multinomial
feature appearance:
Sample Gaussian
feature position:Jfeature appearance: p

Learning DPs: Gibbs Sampling
P t S l &
H R
Image Scale
(insensitive)
Part Scale &
Visual Sparsity
(insensitive)
G Sample Integrate
 Sample
Integrate parameters
defining feature distributions
Sample assignments
clustering features to parts
JN
w v
J
Dirichlet processes have many desirable analytic properties,
which lead to efficient Rao-Blackwellized learning algorithms

Decomposing Faces into Parts
4 Images 16 Images 64 Images

Generalizing Across Categories
Can we transfer knowledge from one object category to another?

Learning Shared Parts
• Objects are often locally similar in appearance
• Discover parts shared across categories
 How many total parts should we share?y p
 How many parts should each category use?

Hierarchical DP Object Model
HH
RG0 R
G0
G 1G G2


JN
w v
L

Hierarchical DP Object Model
HH
R
Discrete Data:
Teh et. al., 2004
G0 R
G0
G 1G G2


JN
w v
L

Sharing Parts: 16 Categories
• Caltech 101 Dataset (Li & Perona)
• Horses (Borenstein & Ullman)
• Cat & dog faces (Vidal-Naquet & Ullman)
• Bikes from Graz-02 (Opelt & Pinz)
• Google…

Visualization of Shared Parts
P ( | )
Pr(position | part)
Pr(appearance | part)

Visualization of Part Densities
MDS Embedding of Pr(part | object)

Visualization of Part Densities
Horse Face
Llama Body
Wheelchair
Dog Face
Cow Face
Llama Face
Cat Face
Cougar Face
Leopard Face
g
Motorbike
Bicycle
Cannon
Cat Face
Rhino Body
Horse Body
Leopard Body
Motorbike
Hierarchical Clustering of Pr(part | object)
Elephant Body
Rhino Body

Detection Results
Shared Parts
more accurate than
Unshared PartsUnshared Parts
Modeling feature positions
improves shared detection, but
hurts unshared detectionhurts unshared detection
6 Training Images per Category
(ROC Curves)

Detection Results
6 Training Images per Category Detection vs. Training Set Size
(ROC Curves) (Area Under ROC)

Sharing Simplifies Models
450
500
350
400
alParts
250
300
rofGlobalP
Position & Appearance, HDP
Position & Appearance, DP
Appearance Only, HDP
150
200
Numbero
Appearance Only, HDP
50
100
N
5 10 15 20 25 30
0
Number of Training Images

Recognition Results
6 Training Images per Category
(ROC C )
Detection vs. Training Set Size
(ROC Curves) (Area Under ROC)

a chicken and egg problem…OPTIMOLOPTIMOL

…among users, researchers, and data
OPTIMOLOPTIMOL
researcher
e.g. Caltech101,
LabelMe, LHI
ImagesImages
LabelMe, LHI
Internet user

Related workOPTIMOLOPTIMOL
• Feng & Chua (2003)• Feng & Chua (2003)
• Fergus, Fei-Fei, Perona & Zisserman (2005)
• Yanai & Barnard (2005)• Yanai & Barnard (2005)
• Berg, Berg & Forsyth (‘Animals on the web’,
2006)2006)
• Schroff, Criminisi, Zisserman (2007)
• Collins Deng Li & Fei Fei (2008)• Collins, Deng, Li & Fei-Fei (2008)
• …

OPTIMOL in a glanceOPTIMOLOPTIMOL
Category
Dataset
Category
model
Classification
Keyword: accordion Li, Wang & Fei-Fei, CVPR 2007

D t t
OPTIMOL in a glanceOPTIMOLOPTIMOL
Category
Dataset
Category
model
Classification
Keyword: accordion Li, Wang & Fei-Fei, CVPR 2007

Image representationOPTIMOLOPTIMOL
Kadir&Brady interest
i t d t tpoint detector
Codewords
representation
…
representation
Compute SIFT
descriptor
[Lowe’99]

Nonparametric topic model
‐Hierarchical Dirichlet Process (HDP)( )
Each
image
N
Each
patch
M
Teh, et al. 2004; Sudderth et al. CVPR 2006; Wang, Zhang & Fei-Fei, CVPR 2006

Nonparametric topic model
‐Hierarchical Dirichlet Process (HDP)( )
N
M
Teh, et al. 2004; Sudderth et al. CVPR 2006; Wang, Zhang & Fei-Fei, CVPR 2006

ClassificationOPTIMOLOPTIMOL
Category likelihood for I:
Likelihood ratio for decision:Likelihood ratio for decision:
Li, Wang & Fei-Fei, CVPR 2007

AnnotationOPTIMOLOPTIMOL

OPTIMOLOPTIMOL Pitfall: model over‐specializationPitfall: model over specialization
Object Model…
Good Images “Bad” Images

The “cache set”OPTIMOLOPTIMOL
pyodelentropMo

The “cache set”OPTIMOLOPTIMOL
C h S tC h S t C h S tC h S td t td t t d t td t t
Cache SetCache Set Cache SetCache Setdatasetdataset datasetdataset

C t
learn
Seed images
Category
Model
Initialization
classification
Rejecte
images
ed
s
Raw image dataset

C t
Incremental
learning
Seed images
Category
Model
Cache
Enlarged dataset
classification
Raw image dataset

Experiments and ResultsOPTIMOLOPTIMOL
• DatasetDataset
• Object categorization
A l i• Analysis
– Batch vs. incremental
– Diversity
– Polysemy
• OPTIMOL wins robotic competition!

DatasetOPTIMOLOPTIMOL

Object categorization
OPTIMOLOPTIMOL
Testing dataset: Fergus et al. ICCV 2005

Analysis: batch vs. incremental
learning
OPTIMOLOPTIMOL
learning
Li & Fei-Fei, IJCV, in press

Analysis: diversity
OPTIMOLOPTIMOL

Analysis: polysemy
OPTIMOLOPTIMOL

Analysis: polysemy
OPTIMOLOPTIMOL
OPTIMOLOPTIMOLOPTIMOLOPTIMOL
Mouse (animal) Mouse (computer)Mouse (animal) Mouse (computer)

Analysis: polysemy
OPTIMOLOPTIMOL
OPTIMOLOPTIMOLOPTIMOLOPTIMOL
Bass (fish)Bass (instrument) Bass (fish)Bass (instrument)

Team OPTIMOL (UIUC‐Princeton)
1st Place in the software leagueOPTIMOLOPTIMOL
1st Place in the software league

From 1 image ‐> 100…0000 imagesFrom 1 image > 100…0000 images
70GB; 10^5‐10^6 images
10^7 or 10^810^7 or 10^8

To organize lots of data & many concepts,
hierarchy is usefulhierarchy is useful
(Miller et al. 1995; Fellbaum et al. 1998)( ; )

Semantic hierarchy
G l i l f iS Geological formationSport event
Snow boarding Snow mountain

(purely) visual hierarchy
Nested‐CRP, Blei et al. NIPS 2004
Sivic, Russell, Zisserman, Freeman, Efros, CVPR 2008 Bart, Porteous, Perona, Welling, CVPR 2008

A “semantivisual” hierarchy of images
My Pictures
vacationwork
more… more…
… … …
Hawaii Paris Ski
more more more
……
more… more… more…
Li, Wang, Lim, Blei & Fei‐Fei, CVPR, 2010

R:   Region Appearance
W: Words
N:   Node in the tree
T:    Tree
w1 = parrot
p(N_ik|R _i, W_k)
p
w2 = beak
w3 = zoo
p(R_i|R_rest,T)
p(W_k|W_rest,T,R_i)
p(N_ik|T, N_ik_rest)

R:   Region Appearance
W: Words
N:   Node in the tree
T:    Tree
p(N_ik|R _i, W_k)
p(R_i|R_rest,T)
p(W_k|W_rest,T,R_i)
N* = argmax (N_{R})
N {R} h
p(N_ik|T, N_ik_rest)
N_{R}: No. Regions assigned to the node

animal, bride, building, cake, child, christmas,
church, city, clouds, dessert, dinner, flower,
spring, friends, fruit, green, high‐school, calcio,
40 tags, 4000 images
spring, friends, fruit, green, high school, calcio,
italy, europe, london, love, nature, landscape,
macro, paris, party, present, sea, sun, sky,
seagull, soccer, reflection, sushi, vacation, trip,
water, silhouette, and wife.water, silhouette, and wife.

4000 images

4000 images
VPR, 2010& Fei‐Fei, CVLim, Blei&Li, Wang,

Evaluating and using the hierarchy
Evaluate the quality of image concept clustering by path
Evaluate the quality of hierarchy given a path of the tree

• Hierarchical annotation

• Image labeling (annotation)

• Image labeling (annotation)
• Image classificationImage classification

Lecture17 xing fei-fei

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Lecture17 xing fei-fei

Similar to Lecture17 xing fei-fei (20)

More from Tianlu Wang

More from Tianlu Wang (20)

Recently uploaded

Recently uploaded (20)

Lecture17 xing fei-fei