Semantic Retrieval and
Automatic Annotation
Linear Transformations, Correlation and
Semantic Spaces
Jonathon Hare & Paul Lewis
School of Electronics and Computer Science
University of Southampton
Introduction and Motivation
• Introduce a new, simple linear-transform based
annotation/retrieval technique
• Compare against a number of similar existing
techniques for automatic annotation & semantic
retrieval that:
• Represent images by a fixed length histogram (of
visual-term occurrences)
• Optionally use SVD for noise reduction
• Are deterministic (no randomness)
• Are (relatively) computationally efficient
• Reflect on real-world performance
SingularValue Decomposition
• SVD can be used to filter noise by producing a rank-k
estimate of the original data matrix
• The rank-k estimate is optimal in the least-squares
sense
Nomenclature
• F is a visual-term occurrence matrix
(columns represent images, rows visual-
terms)
• W is a keyword occurrence matrix
(columns represent images, rows keywords)
Technique: linear transform
Assume that visual-term occurrence vectors can be
related to keyword occurrence vectors by a simple
linear transformation, T.
FT=W
T can be estimated using the pseudo-inverse (calculated
using the SVD, which allows noise reduction) given a
training set with known F and W, then unknown W*
can be calculated from F* (from unannotated images)
and T.
Technique: Semantic Spaces
• Based around the factorisation [-]= TD
• Calculated using truncated SVD
• Rows of T represent coordinates of the features
and words in a vector space
• Columns of D represent coordinates of images in
the same space
• Similar objects have similar locations in the space, so
it is possible to rank images on their distance to a
given word
F
W
Hare, J. S., Lewis, P. H., Enser, P. G. B., and Sandom, C. J., “A Linear-Algebraic Technique with an
Application in Semantic Image Retrieval,” in CIVR 2006, Sundaram, H., Naphade, M., Smith, J. R., and
Rui,Y., eds., LNCS 4071, 31–40, Springer (2006).
Technique: Correlation
• Pan et al defined four techniques for building
translation tables between visual terms and keywords
[i.e. the elements of the table/matrix represent
p(wi,fj)].
• The Corr method used WTF to build the table
• The Cos method used the cosine of wi and fj
• The SVDCorr and SVDCos methods filtered the
tables from the Corr and Cos methods reducing
the rank using the SVD
Pan, J.-Y.,Yang, H.-J., Duygulu, P., and Faloutsos, C., “Automatic image captioning,” IEEE International Conference
on Multimedia and Expo 2004 (ICME ’04). Vol.3 (27-30 June 2004).
Technique Summary
Technique Variables Notes
Transform
feature-weighting,
dimensionality reduction
Words independent
Corr, Cos feature-weighting Words independent
SVDCorr,
SVDCos
feature-weighting,
dimensionality reduction
Words independent
Semantic Space
feature-weighting,
dimensionality reduction
Inter-word
dependencies
Image Features
• Two types of visual-term feature considered:
• Segmented-blob based (using shape, colour, texture
descriptors) [500 terms]
• Quantised DCT-based [500 terms]
Experimental Protocol
• 5000 image Corel data-set
• 4000 training images
• 500 validation images (for optimising reduced rank)
• 500 test images
• Two weighting types: unweighted and IDF
• Evaluation performed as a hypothetical retrieval
experiment
• Unannotated test images retrieved in response to
using each word in turn as a query
• Mean-average precision used for comparison
Results
Real-world performance
• ~20% mAP might sound low, but in reality many
queries will work quite well (reasonable initial
precision, but drops fast)
• Choice of image features is very important
• It would be difficult to learn the concept of “sun”
from grey-level SIFT features!
• See the paper for some more reflection on real-word
performance...
Conclusions
• We have described a set of auto-annotation/semantic
retrieval algorithms
• Performance is less than the state-of-the-art, but this is
partially explained by the use of different image
features (see our MIR 2010 paper)
• However, the methods;
• Are computationally inexpensive (although this is
proportional to the amount of training data)
• Are deterministic, and don’t rely on algorithms such
as EM which might get stuck in local minima/maxima

Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlation and Semantic Spaces

  • 1.
    Semantic Retrieval and AutomaticAnnotation Linear Transformations, Correlation and Semantic Spaces Jonathon Hare & Paul Lewis School of Electronics and Computer Science University of Southampton
  • 2.
    Introduction and Motivation •Introduce a new, simple linear-transform based annotation/retrieval technique • Compare against a number of similar existing techniques for automatic annotation & semantic retrieval that: • Represent images by a fixed length histogram (of visual-term occurrences) • Optionally use SVD for noise reduction • Are deterministic (no randomness) • Are (relatively) computationally efficient • Reflect on real-world performance
  • 3.
    SingularValue Decomposition • SVDcan be used to filter noise by producing a rank-k estimate of the original data matrix • The rank-k estimate is optimal in the least-squares sense
  • 4.
    Nomenclature • F isa visual-term occurrence matrix (columns represent images, rows visual- terms) • W is a keyword occurrence matrix (columns represent images, rows keywords)
  • 5.
    Technique: linear transform Assumethat visual-term occurrence vectors can be related to keyword occurrence vectors by a simple linear transformation, T. FT=W T can be estimated using the pseudo-inverse (calculated using the SVD, which allows noise reduction) given a training set with known F and W, then unknown W* can be calculated from F* (from unannotated images) and T.
  • 6.
    Technique: Semantic Spaces •Based around the factorisation [-]= TD • Calculated using truncated SVD • Rows of T represent coordinates of the features and words in a vector space • Columns of D represent coordinates of images in the same space • Similar objects have similar locations in the space, so it is possible to rank images on their distance to a given word F W Hare, J. S., Lewis, P. H., Enser, P. G. B., and Sandom, C. J., “A Linear-Algebraic Technique with an Application in Semantic Image Retrieval,” in CIVR 2006, Sundaram, H., Naphade, M., Smith, J. R., and Rui,Y., eds., LNCS 4071, 31–40, Springer (2006).
  • 7.
    Technique: Correlation • Panet al defined four techniques for building translation tables between visual terms and keywords [i.e. the elements of the table/matrix represent p(wi,fj)]. • The Corr method used WTF to build the table • The Cos method used the cosine of wi and fj • The SVDCorr and SVDCos methods filtered the tables from the Corr and Cos methods reducing the rank using the SVD Pan, J.-Y.,Yang, H.-J., Duygulu, P., and Faloutsos, C., “Automatic image captioning,” IEEE International Conference on Multimedia and Expo 2004 (ICME ’04). Vol.3 (27-30 June 2004).
  • 8.
    Technique Summary Technique VariablesNotes Transform feature-weighting, dimensionality reduction Words independent Corr, Cos feature-weighting Words independent SVDCorr, SVDCos feature-weighting, dimensionality reduction Words independent Semantic Space feature-weighting, dimensionality reduction Inter-word dependencies
  • 9.
    Image Features • Twotypes of visual-term feature considered: • Segmented-blob based (using shape, colour, texture descriptors) [500 terms] • Quantised DCT-based [500 terms]
  • 10.
    Experimental Protocol • 5000image Corel data-set • 4000 training images • 500 validation images (for optimising reduced rank) • 500 test images • Two weighting types: unweighted and IDF • Evaluation performed as a hypothetical retrieval experiment • Unannotated test images retrieved in response to using each word in turn as a query • Mean-average precision used for comparison
  • 11.
  • 12.
    Real-world performance • ~20%mAP might sound low, but in reality many queries will work quite well (reasonable initial precision, but drops fast) • Choice of image features is very important • It would be difficult to learn the concept of “sun” from grey-level SIFT features! • See the paper for some more reflection on real-word performance...
  • 13.
    Conclusions • We havedescribed a set of auto-annotation/semantic retrieval algorithms • Performance is less than the state-of-the-art, but this is partially explained by the use of different image features (see our MIR 2010 paper) • However, the methods; • Are computationally inexpensive (although this is proportional to the amount of training data) • Are deterministic, and don’t rely on algorithms such as EM which might get stuck in local minima/maxima