G. Carneiro, A. Chan, P. Moreno N. Vasconcelos
by: Lukáš
Tencer
ECSE626 2012
Outline
• Introduction
• Prior techniques
• Supervised OVA Labeling
• Unsupervised Labeling
• Methodology
• Supervised Multiclass Labeling
• Semantic Distribution Estimation
• Density Estimation
• Algorithm
• Learning, Annotation, Retrieval
• Results
• Quantitative
• Qualitative
• Conclusion
Introduction
• Task
• Assign labels to unknown images
• Retrieve relevant images given labels
• Supervised Learning
• Learning from labeled training data
• Training data consist of pairs
• Multiple instance learning
• Semantic Classes
• labels representing common concepts (sky, bear, snow…)
• Image Annotation and Retrieval
• Annotation: Given the image D, what labels are present in the
image
• Given the label what are the top n matching images
nilx ii ...1},{ 
Introduction
 Datasets:
 Corel5K – 5000 images, 272 Classes
 Corel30K – 30000 images, 1120 Classes
 MIRFLICKR – 25000 images, 37 Classes
 (PSU) – not available anymore
 ImageCLEF - The CLEF (Cross Language
Evaluation Forum) Cross Language Image
Retrieval Track
 Medical Image retrieval
 Photo Annotation
 Plant Identification
 Wikipedia Retrieval
 Patent Image Retrieval and Classification
Introduction
 Corel 5K Corel 30K MIRFLICKR
Bear New Zealand Urban
Prior Techniques
 Supervised OVA
 Binary decision problem, concept present / absent
 Hidden variable Yi
 Decision rule:
 Unsupervised Learning
 Modeling dependency between text label and image
features, expressed as hidden variable L
 Considering just positive examples, densities for Yi=1
)0()0|()1()1|( || iiii YYXYYX PXPPXP 


D
l LWLXWX lPlwPlxPwxP 1 ||, )(),(),(),(
L
W X
W1 W2 W3 X
bear
polar, grizzly features
Methodology
Supervised Multiclass Labeling (SML)
 Elements of semantic vocabulary (W) are
explicitly made to semantic classes (L) !
 Random var. W:
annotation and retrieval is then easy to do as:
Annotation Retrieval
)|(Pandfromsampleisifonly},...,1{, W|X ixwxTiiW i
)(
)(),(
)|( |
|
xP
iPixP
xiP
X
WWX
XW 
)|(maxarg)(* | XiPXi XWi )|(maxarg)(* | iXPwj jWXji 
Methodology
Estimation of Semantic Class
Distributions
 Given Di training set of images, estimate
 Assumption: Gaussian Distribution
 How to estimate?
 Direct estimation
 Model Averaging
 Naive Averaging
 GMM model:
 Averaged:
)|(| ixP WX

 iD
l WLX
i
WX ilxP
D
ixP 1 ,|| ),|(
1
),(
 
k
k
li
k
li
k
liWLX xGilxP ),,(),|( ,,,,| 


k
D
l
k
li
k
li
k
li
i
WX
i
xG
D
ixP
1
,,,| ),,(
1
)|( 
Methodology
Mixture hierarchies
 First step, get GMM from images – regular soft
EM
 E:
 M:


8
1
| ),,()|(
k
k
I
k
I
k
IWX xGIxP 
Initialization
Euclidian distance
Mahalonobis
distance
Initial Par.
estimate
Expectation
Maximizaiton
Max iter. 200Change in likelihood
is too small



n
i
j jjiji xGjzzxP
1
2
1
),;()()|,( 
)|,()|,()|,( 1 ttt
zxPzxPzxP   
)],;([log),( ,|
ZXFEQ t
xz
t
 

),(maxarg1 tt
Q  
Methodology
Mixture hierarchies for label
 Second step, get HGMM for labels
 E:
 M:


64
1
| ),,()|(
k
k
w
k
w
k
wWX xGwxP  Initialization
Bhattacharyya
distance
Initial Par.
estimate
Expectation
Maximizaiton
Max iter. 200Change in likelihood
is too small



n
i
j jjiji xGjzzxP
1
2
1
),;()()|,( 
)|,()|,()|,( 1 ttt
zxPzxPzxP   
)],;([log),( ,|
ZXFEQ t
xz
t
 

),(maxarg1 tt
Q  
E and M step for HGMM
 Input:
 Output:
 E-step:
 M-step:
KkDj i
k
j
k
j
k
j ,...,1,,...,1},,,{ 








l
l
c
Ntrace
l
c
l
c
k
j
m
c
Ntrace
m
c
m
c
k
jm
jk
k
j
k
j
l
c
k
j
k
j
m
c
eG
eG
h




]),,([
]),,([
}){(
2
1
}){(
2
1
1
1
Mmm
j
m
j
m
j ,...,1},,,{ 
KD
h
i
m
jkjknewm
c

)(



jk
jk
k
j
m
jk
k
j
m
jkm
jk
k
j
m
jk
newm
c
h
h
ww


 where,)(
 
jk
Tm
c
k
j
m
c
k
j
k
j
m
jk
newm
c w ]))(([)( 
Algorithm - learning
 Training
 For each training set I for label w
 Decompose image (192px * 128px ) into 8x8 regions
by sliding window moving each 2 pixels
 Calculate DCT for each window (8*8*3) 192-d feature
vector
 Calculate mixture of 8 Gaussians for each Image
using EM
 Calculate mixture of 64 Gaussians for each label
using H-EM


8
1
| ),,()|(
k
k
I
k
I
k
IWX xGIxP 


64
1
| ),,()|(
k
k
w
k
w
k
wWX xGwxP 
Algorithm – annotation, retrieval
 Annotation
 Get n(5) beast labels for image I
 Get features from image ((192*128/2)*192)
 Get log likelihood for each label, choose the best
n
 Retrieval
 For images IT and label w:
 Annotate IT and get decreasing scores of posterior




x
iWXiWX wxPwP )|(log)|(log ||
)|(| iWX wP 
Results-quantitative
 Database: Corel 5k
 Precision:
 Recall:
 4000 training 1000 testing
retrieved
retrievedrelevant
relevant
retrievedrelevant
H
C
w
w
recall 
auto
C
w
w
precision 
annotatedautomatic
annotatedhuman
imagesannotatedcorrectly



auto
H
C
w
w
w
Results-quantitative
Non zero recall mean Recall mean Precision
1 2 3 4 5 6
w with Recall > 0 140 121 110 125 90 131
Mean Recall per w 0.27 0.25 0.25 0.26 0.23 0.27
Mean Precision pre
w
0.25 0.24 0.23 0.23 0.2 0.23
Annotation
Results-quantitative
Recall > 0 PrecisionAll precision
1 2 3 4 5 6
Mean Recall all w 0.23 0.21 0.20 0.21 0.19 0.24
Mean Recall per w
R>0
0.45 0.40 0.40 0.41 0.37 0.41
Retrieval
Results-qualitative
Results-qualitative
plane jet f-14 sky
-----------------------
sky plane clouds
smoke snow
coast waves
water hills
-----------------------
water sky ocean
mountain clouds
polar bear bars
cage
-----------------------
bear snow texture
sunrise closeup
people cheese
market street
-----------------------
people wall sand
flower bird
Results-qualitative
Results-qualitative
Blooms Mountain Pool Smoke Woman
Results-qualitative
Conclusions
 Pros
 Nice segmentation as byproduct of annotation
 Great for general concepts with lots of samples
 Just weakly annotated data is required (multi-instance
learning)
 Allows hierarchical representation (adding images, speed)
 Contras
 Fixed number of labels per image
 Learning is time consuming
 Parameter tuning is time consuming
 Weakly represented classes could be associated with
wrong concepts
Resources
 Carneiro, G., Chan, A.B., Moreno, P.J., Vasconcelos, N.: Supervised learning of
semantic classes for image annotation and retrieval. Pattern Analysis and Machine
Intelligence, IEEE Transactions on. 29, 394–410 (2007).
 Gudivada, V.N., Raghavan, V.V.: Content based image retrieval systems. Computer.
28, 18–22 (1995).
 Belongie, S., Carson, C., Greenspan, H., Malik, J.: Color-and texture-based image
segmentation using EM and its application to content-based image retrieval.
Computer Vision, 1998. Sixth International Conference on. pp. 675–682. IEEE
(1998).
 Cappé, O., Moulines, E.: On-line expectation–maximization algorithm for latent data
models. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
71, 593–613 (2009).
 Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image Retrieval: Ideas, Influences, and
Trends of the New Age. ACM Computing Surveys. 40, 1-60 (2008).
lukas.tencer@gmail.com
http://tencer.hustej.net
@lukastencer
accuratelyrandom.blogspot.com
facebook.com/lukas.tencer
Google labeling game

Supervised Learning of Semantic Classes for Image Annotation and Retrieval

  • 1.
    G. Carneiro, A.Chan, P. Moreno N. Vasconcelos by: Lukáš Tencer ECSE626 2012
  • 2.
    Outline • Introduction • Priortechniques • Supervised OVA Labeling • Unsupervised Labeling • Methodology • Supervised Multiclass Labeling • Semantic Distribution Estimation • Density Estimation • Algorithm • Learning, Annotation, Retrieval • Results • Quantitative • Qualitative • Conclusion
  • 3.
    Introduction • Task • Assignlabels to unknown images • Retrieve relevant images given labels • Supervised Learning • Learning from labeled training data • Training data consist of pairs • Multiple instance learning • Semantic Classes • labels representing common concepts (sky, bear, snow…) • Image Annotation and Retrieval • Annotation: Given the image D, what labels are present in the image • Given the label what are the top n matching images nilx ii ...1},{ 
  • 4.
    Introduction  Datasets:  Corel5K– 5000 images, 272 Classes  Corel30K – 30000 images, 1120 Classes  MIRFLICKR – 25000 images, 37 Classes  (PSU) – not available anymore  ImageCLEF - The CLEF (Cross Language Evaluation Forum) Cross Language Image Retrieval Track  Medical Image retrieval  Photo Annotation  Plant Identification  Wikipedia Retrieval  Patent Image Retrieval and Classification
  • 5.
    Introduction  Corel 5KCorel 30K MIRFLICKR Bear New Zealand Urban
  • 6.
    Prior Techniques  SupervisedOVA  Binary decision problem, concept present / absent  Hidden variable Yi  Decision rule:  Unsupervised Learning  Modeling dependency between text label and image features, expressed as hidden variable L  Considering just positive examples, densities for Yi=1 )0()0|()1()1|( || iiii YYXYYX PXPPXP    D l LWLXWX lPlwPlxPwxP 1 ||, )(),(),(),( L W X W1 W2 W3 X bear polar, grizzly features
  • 7.
    Methodology Supervised Multiclass Labeling(SML)  Elements of semantic vocabulary (W) are explicitly made to semantic classes (L) !  Random var. W: annotation and retrieval is then easy to do as: Annotation Retrieval )|(Pandfromsampleisifonly},...,1{, W|X ixwxTiiW i )( )(),( )|( | | xP iPixP xiP X WWX XW  )|(maxarg)(* | XiPXi XWi )|(maxarg)(* | iXPwj jWXji 
  • 8.
    Methodology Estimation of SemanticClass Distributions  Given Di training set of images, estimate  Assumption: Gaussian Distribution  How to estimate?  Direct estimation  Model Averaging  Naive Averaging  GMM model:  Averaged: )|(| ixP WX   iD l WLX i WX ilxP D ixP 1 ,|| ),|( 1 ),(   k k li k li k liWLX xGilxP ),,(),|( ,,,,|    k D l k li k li k li i WX i xG D ixP 1 ,,,| ),,( 1 )|( 
  • 9.
    Methodology Mixture hierarchies  Firststep, get GMM from images – regular soft EM  E:  M:   8 1 | ),,()|( k k I k I k IWX xGIxP  Initialization Euclidian distance Mahalonobis distance Initial Par. estimate Expectation Maximizaiton Max iter. 200Change in likelihood is too small    n i j jjiji xGjzzxP 1 2 1 ),;()()|,(  )|,()|,()|,( 1 ttt zxPzxPzxP    )],;([log),( ,| ZXFEQ t xz t    ),(maxarg1 tt Q  
  • 10.
    Methodology Mixture hierarchies forlabel  Second step, get HGMM for labels  E:  M:   64 1 | ),,()|( k k w k w k wWX xGwxP  Initialization Bhattacharyya distance Initial Par. estimate Expectation Maximizaiton Max iter. 200Change in likelihood is too small    n i j jjiji xGjzzxP 1 2 1 ),;()()|,(  )|,()|,()|,( 1 ttt zxPzxPzxP    )],;([log),( ,| ZXFEQ t xz t    ),(maxarg1 tt Q  
  • 11.
    E and Mstep for HGMM  Input:  Output:  E-step:  M-step: KkDj i k j k j k j ,...,1,,...,1},,,{          l l c Ntrace l c l c k j m c Ntrace m c m c k jm jk k j k j l c k j k j m c eG eG h     ]),,([ ]),,([ }){( 2 1 }){( 2 1 1 1 Mmm j m j m j ,...,1},,,{  KD h i m jkjknewm c  )(    jk jk k j m jk k j m jkm jk k j m jk newm c h h ww    where,)(   jk Tm c k j m c k j k j m jk newm c w ]))(([)( 
  • 12.
    Algorithm - learning Training  For each training set I for label w  Decompose image (192px * 128px ) into 8x8 regions by sliding window moving each 2 pixels  Calculate DCT for each window (8*8*3) 192-d feature vector  Calculate mixture of 8 Gaussians for each Image using EM  Calculate mixture of 64 Gaussians for each label using H-EM   8 1 | ),,()|( k k I k I k IWX xGIxP    64 1 | ),,()|( k k w k w k wWX xGwxP 
  • 13.
    Algorithm – annotation,retrieval  Annotation  Get n(5) beast labels for image I  Get features from image ((192*128/2)*192)  Get log likelihood for each label, choose the best n  Retrieval  For images IT and label w:  Annotate IT and get decreasing scores of posterior     x iWXiWX wxPwP )|(log)|(log || )|(| iWX wP 
  • 14.
    Results-quantitative  Database: Corel5k  Precision:  Recall:  4000 training 1000 testing retrieved retrievedrelevant relevant retrievedrelevant H C w w recall  auto C w w precision  annotatedautomatic annotatedhuman imagesannotatedcorrectly    auto H C w w w
  • 15.
    Results-quantitative Non zero recallmean Recall mean Precision 1 2 3 4 5 6 w with Recall > 0 140 121 110 125 90 131 Mean Recall per w 0.27 0.25 0.25 0.26 0.23 0.27 Mean Precision pre w 0.25 0.24 0.23 0.23 0.2 0.23 Annotation
  • 16.
    Results-quantitative Recall > 0PrecisionAll precision 1 2 3 4 5 6 Mean Recall all w 0.23 0.21 0.20 0.21 0.19 0.24 Mean Recall per w R>0 0.45 0.40 0.40 0.41 0.37 0.41 Retrieval
  • 17.
  • 18.
    Results-qualitative plane jet f-14sky ----------------------- sky plane clouds smoke snow coast waves water hills ----------------------- water sky ocean mountain clouds polar bear bars cage ----------------------- bear snow texture sunrise closeup people cheese market street ----------------------- people wall sand flower bird
  • 19.
  • 20.
  • 21.
  • 22.
    Conclusions  Pros  Nicesegmentation as byproduct of annotation  Great for general concepts with lots of samples  Just weakly annotated data is required (multi-instance learning)  Allows hierarchical representation (adding images, speed)  Contras  Fixed number of labels per image  Learning is time consuming  Parameter tuning is time consuming  Weakly represented classes could be associated with wrong concepts
  • 23.
    Resources  Carneiro, G.,Chan, A.B., Moreno, P.J., Vasconcelos, N.: Supervised learning of semantic classes for image annotation and retrieval. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 29, 394–410 (2007).  Gudivada, V.N., Raghavan, V.V.: Content based image retrieval systems. Computer. 28, 18–22 (1995).  Belongie, S., Carson, C., Greenspan, H., Malik, J.: Color-and texture-based image segmentation using EM and its application to content-based image retrieval. Computer Vision, 1998. Sixth International Conference on. pp. 675–682. IEEE (1998).  Cappé, O., Moulines, E.: On-line expectation–maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 71, 593–613 (2009).  Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image Retrieval: Ideas, Influences, and Trends of the New Age. ACM Computing Surveys. 40, 1-60 (2008).
  • 24.
  • 25.