Lecture 03 internet video search

What makes a cow a cow?
Google knows
How do you know? because other people know

We think we know
“because it has four legs”

But the fact of the matter:
not all cows show four legs
nor are they brown …
not all…

What is the object in the middle?

No segmentation …
Not even the pixel values of the object …

Where is evidence for an object?

Uijlings IJCV 2011

What is the visual extent of an object?

Uijlings IJCV 2012

Where: exhaustive search

Look everywhere for the object window
Imposes computational constraints on
Very many locations and windows
(coarse grid/fixed aspect ratio)
Evaluation cost per location
(weak features/classifiers)
Impressive but takes long.

Viola IJCV 2004 Dalal CVPR 2005
Felzenszwalb PAMI 2010 Vedaldi ICCV 2009 7

Where: the need for a hierarchy

An image is intrinsically hierarchical.

Gu CVPR 2009

Selective search

Windows formed by hierarchical grouping.

Adjacent grouping on color/texture/shape cues.
Felzenszwalb 2004 Van de Sande ICCV 2011

Selective search example

11

Average best overlap ~88%

… looks like this

High recall
cat

Pairs of concepts

Uijlings ICCV demo 2012

6 Conclusion

Selective search gives good localization.

Localization needed to understand pairs of concepts.

7 Data and metadata

http://bit.ly/visualsearchengines

How many concepts?

Li Fei Fei slide. Biederman, Psychological Rev. 1987

How many examples?

Once you are over 100 – 1000 examples, success is there.

Amateur labeling

LabelMe 290,000 object annotations
Russell IJCV 2008

Tag relevance by social annotation

Consistency in tagging between users on similar images.

Xirong Li, TMM 2009

Tag relevance by social annotation

Pretty good for snow not so good for rainbow.

Social negative bootstrapping

Negative images are as important as positive images to learn.
Not just random negative images, but close ones.
• We want to learn positive
example from an expert,
and obtain as many
negative samples as we
like for free from the web.
• We iteratively aim for the
hardest negatives.

Xirong Li ACM MM 2009

Social negative bootstrapping

Xirong Li ICMR 2011

acknowledgement
WordNet friends

Christiane Fellbaum
Dan Osherson Kai Li Alex Berg Columbia
Princeton Princeton

Jia Deng Hao Su
Princeton/Stanford Stanford

PASCAL VOC

The PASCAL Visual Object Classes (VOC).

500,000 Images downloaded from flickr.
Queries like “car”, “vehicle”, “street”, “downtown”.
10,000 objects, 25,000 labels.

Mark Everingham, Luc Van Gool, Chris Williams, John Winn,
Andrew Zisserman

7. Conclusion

Data is king.

The data are beginning to reflect the human cognition
capacity [at a basic level].

Harvesting social data requires advanced computer
vision control.

PASCAL 2010
Aeroplane Bicycle Bird Boat Bottle

Bus Car Cat Chair Cow

True Positives - Person
UOCTTI_LSVM_MDPM

NLPR_HOGLBP_MC_LCEGCHLC

NUS_HOGLBP_CTX_CLS_RESCORE_V2

False Positives - Person
UOCTTI_LSVM_MDPM



Non-birds & non-boats

Non-bird images:
Highest ranked

Non-boat images:
Highest ranked

Water texture and scene composition?

True Positives - Motorbike
MITUCLA_HIERARCHY



False Positives - Motorbike
MITUCLA_HIERARCHY



Object localization 2008-2010

60

50
Max AP (%)

40
2008
30 2009
2010
20

10

0

tvmonitor
pottedplant
bottle

motorbike
diningtable

horse

sofa
train
person

sheep
aeroplane
bicycle

cow
cat
boat

bus

dog
bird

car

chair

Results on 2008 data improve for 2010 methods for all
categories, by over 100% for some categories.

Concept detection

Aircraft

Beach

Mountain

People marching

Police/Security

Flower

Measuring performance

Set of relevant Set of retrieved
Results items items
1.

2.
• Precision Set of relevant
3.
retrieved items
4.
inverse relationship
Recall
5.

UvA-MediaMill@TRECVID

• other systems

Snoek et al, TRECVID 04-10

Performance doubled in just 3 years

• 36 concept detectors
Even when
using training
data of different
origin, great
progress.
But the number
of concepts is
still limited.

Snoek & Smeulders, IEEE Computer 2010

8. Conclusion

Impressive results and quickly improving per year.

Very valuable competition.

Best non-classes start to make sense!

SURF based on integral images

Introduced by Viola & Jones in the context of face
detection: sliding windows in left to right / up to bottom
integral images.

46

SURF principle

Approximate Gaussian derivatives with box filters:
Lyy
Lyy

Lyy Lxy

L xx L
LREC 2004, 26 May yy Lisbon
2004, L xy 47

SURF speed

Scale
Computation time: 6 times faster than DoG (~100msec).
Independent of filter scale. 26 May 2004, Lisbon
LREC 2004, 48

Dense descriptor extraction

Pixel-wise Responses Final Descriptor

Factor 16 speed improvement,
Another factor 2 by the use of matrix libs.

Projection: Random Forest

Binary decision trees
.
...

.
...

......
Moosmann et al. 2008

Real-time bag of words
Descriptor Projection Classification
Extraction
Pre-projection Actual projection SVM kernel

D-SURF Random MAP:
<empty> RBF
2x2 Forest 0.370

15 10 13
Total computation time is 38 milliseconds per image

26 frames per second on a normal PC in any 20 concepts.

9. Conclusion

SURF scale and rotation invariant
Fast due to the use of integral images
Download: http://www.vision.ee.ethz.ch/~surf/
DURF extraction is 6x faster than Dense-SIFT.
Projection using Random Forest 50x faster than NN.

Internet Video Search: the beginning

telling
stories

measuring concept lexicon
video features detection learning

browsing
video
video

Lecture 03 internet video search

Recommended

Recommended

More Related Content

More from zukun

More from zukun (20)

Lecture 03 internet video search