Lecture 03 internet video searchPresentation Transcript
6: Location and context
What makes a cow a cow? Google knows How do you know? because other people know We think we know “because it has four legs” But the fact of the matter: not all cows show four legs nor are they brown … not all…
What is the object in the middle?No segmentation …Not even the pixel values of the object …
Where is evidence for an object? Uijlings IJCV 2011
Where is evidence for an object? Uijlings IJCV 2011
What is the visual extent of an object?Uijlings IJCV 2012
Where: exhaustive searchLook everywhere for the object windowImposes computational constraints on Very many locations and windows (coarse grid/fixed aspect ratio) Evaluation cost per location (weak features/classifiers)Impressive but takes long.Viola IJCV 2004 Dalal CVPR 2005Felzenszwalb PAMI 2010 Vedaldi ICCV 2009 7
Where: the need for a hierarchy An image is intrinsically hierarchical. Gu CVPR 2009
Selective searchWindows formed by hierarchical grouping.Adjacent grouping on color/texture/shape cues. Felzenszwalb 2004 Van de Sande ICCV 2011
Selective search example
Selective search example 11
Average best overlap ~88%… looks like this High recall cat
Pairs of concepts Uijlings ICCV demo 2012
6 ConclusionSelective search gives good localization.Localization needed to understand pairs of concepts.
7 Data and metadata http://bit.ly/visualsearchengines
How many concepts? Li Fei Fei slide. Biederman, Psychological Rev. 1987
How many examples?Once you are over 100 – 1000 examples, success is there.
Amateur labelingLabelMe 290,000 object annotations Russell IJCV 2008
Tag relevance by social annotationConsistency in tagging between users on similar images. Xirong Li, TMM 2009
Tag relevance by social annotation Pretty good for snow not so good for rainbow.
Social negative bootstrappingNegative images are as important as positive images to learn.Not just random negative images, but close ones.• We want to learn positive example from an expert, and obtain as many negative samples as we like for free from the web.• We iteratively aim for the hardest negatives. Xirong Li ACM MM 2009
Social negative bootstrapping Xirong Li ICMR 2011
Knowledge ontology ImageNet
acknowledgement WordNet friendsChristiane Fellbaum Dan Osherson Kai Li Alex Berg Columbia Princeton Princeton Jia Deng Hao Su Princeton/Stanford Stanford
PASCAL VOCThe PASCAL Visual Object Classes (VOC).500,000 Images downloaded from flickr.Queries like “car”, “vehicle”, “street”, “downtown”.10,000 objects, 25,000 labels.Mark Everingham, Luc Van Gool, Chris Williams, John Winn,Andrew Zisserman
7. ConclusionData is king.The data are beginning to reflect the human cognitioncapacity [at a basic level].Harvesting social data requires advanced computervision control.
PASCAL 2010Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow
True Positives - Person UOCTTI_LSVM_MDPM NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
False Positives - Person UOCTTI_LSVM_MDPM NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
Non-birds & non-boatsNon-bird images:Highest rankedNon-boat images:Highest rankedWater texture and scene composition?
Object localization 2008-2010 60 50Max AP (%) 40 2008 30 2009 2010 20 10 0 tvmonitor pottedplant bottle motorbike diningtable horse sofa train person sheep aeroplane bicycle cow cat boat bus dog bird car chair Results on 2008 data improve for 2010 methods for all categories, by over 100% for some categories.
TRECvid evaluation standard
Concept detection Aircraft Beach Mountain People marching Police/Security Flower
Measuring performance Set of relevant Set of retrieved Results items items1.2. • Precision Set of relevant3. retrieved items4. inverse relationship Recall5.
UvA-MediaMill@TRECVID • other systems Snoek et al, TRECVID 04-10
Performance doubled in just 3 years • 36 concept detectors Even when using training data of different origin, great progress. But the number of concepts is still limited. Snoek & Smeulders, IEEE Computer 2010
8. ConclusionImpressive results and quickly improving per year.Very valuable competition.Best non-classes start to make sense!
SURF based on integral imagesIntroduced by Viola & Jones in the context of facedetection: sliding windows in left to right / up to bottomintegral images. 46
SURF principleApproximate Gaussian derivatives with box filters: Lyy Lyy Lyy Lxy L xx L LREC 2004, 26 May yy Lisbon 2004, L xy 47
SURF speed ScaleComputation time: 6 times faster than DoG (~100msec).Independent of filter scale. 26 May 2004, Lisbon LREC 2004, 48
Dense descriptor extraction Pixel-wise Responses Final Descriptor Factor 16 speed improvement, Another factor 2 by the use of matrix libs.
Projection: Random ForestBinary decision trees.... . ... ...... Moosmann et al. 2008
Real-time bag of words Descriptor Projection Classification Extraction Pre-projection Actual projection SVM kernel D-SURF Random MAP: <empty> RBF 2x2 Forest 0.370 15 10 13 Total computation time is 38 milliseconds per image26 frames per second on a normal PC in any 20 concepts.
9. ConclusionSURF scale and rotation invariantFast due to the use of integral imagesDownload: http://www.vision.ee.ethz.ch/~surf/DURF extraction is 6x faster than Dense-SIFT.Projection using Random Forest 50x faster than NN.
Internet Video Search: the beginning telling stories measuring concept lexiconvideo features detection learning browsing video video