• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Lecture 03 internet video search
 

Lecture 03 internet video search

on

  • 588 views

 

Statistics

Views

Total Views
588
Views on SlideShare
588
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Lecture 03 internet video search Lecture 03 internet video search Presentation Transcript

    • 6: Location and context
    • What makes a cow a cow? Google knows How do you know? because other people know We think we know “because it has four legs” But the fact of the matter: not all cows show four legs nor are they brown … not all…
    • What is the object in the middle?No segmentation …Not even the pixel values of the object …
    • Where is evidence for an object? Uijlings IJCV 2011
    • Where is evidence for an object? Uijlings IJCV 2011
    • What is the visual extent of an object?Uijlings IJCV 2012
    • Where: exhaustive searchLook everywhere for the object windowImposes computational constraints on Very many locations and windows (coarse grid/fixed aspect ratio) Evaluation cost per location (weak features/classifiers)Impressive but takes long.Viola IJCV 2004 Dalal CVPR 2005Felzenszwalb PAMI 2010 Vedaldi ICCV 2009 7
    • Where: the need for a hierarchy An image is intrinsically hierarchical. Gu CVPR 2009
    • Selective searchWindows formed by hierarchical grouping.Adjacent grouping on color/texture/shape cues. Felzenszwalb 2004 Van de Sande ICCV 2011
    • Selective search example
    • Selective search example 11
    • Average best overlap ~88%… looks like this High recall cat
    • Pairs of concepts Uijlings ICCV demo 2012
    • 6 ConclusionSelective search gives good localization.Localization needed to understand pairs of concepts.
    • 7 Data and metadata http://bit.ly/visualsearchengines
    • How many concepts? Li Fei Fei slide. Biederman, Psychological Rev. 1987
    • How many examples?Once you are over 100 – 1000 examples, success is there.
    • Amateur labelingLabelMe 290,000 object annotations Russell IJCV 2008
    • Amateur labeling
    • Amateur labeling
    • Tag relevance by social annotationConsistency in tagging between users on similar images. Xirong Li, TMM 2009
    • Tag relevance by social annotation Pretty good for snow not so good for rainbow.
    • Social negative bootstrappingNegative images are as important as positive images to learn.Not just random negative images, but close ones.• We want to learn positive example from an expert, and obtain as many negative samples as we like for free from the web.• We iteratively aim for the hardest negatives. Xirong Li ACM MM 2009
    • Social negative bootstrapping Xirong Li ICMR 2011
    • Knowledge ontology ImageNet
    • acknowledgement WordNet friendsChristiane Fellbaum Dan Osherson Kai Li Alex Berg Columbia Princeton Princeton Jia Deng Hao Su Princeton/Stanford Stanford
    • PASCAL VOCThe PASCAL Visual Object Classes (VOC).500,000 Images downloaded from flickr.Queries like “car”, “vehicle”, “street”, “downtown”.10,000 objects, 25,000 labels.Mark Everingham, Luc Van Gool, Chris Williams, John Winn,Andrew Zisserman
    • 7. ConclusionData is king.The data are beginning to reflect the human cognitioncapacity [at a basic level].Harvesting social data requires advanced computervision control.
    • 8 Performance
    • PASCAL 2010Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow
    • True Positives - Person UOCTTI_LSVM_MDPM NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
    • False Positives - Person UOCTTI_LSVM_MDPM NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
    • Non-birds & non-boatsNon-bird images:Highest rankedNon-boat images:Highest rankedWater texture and scene composition?
    • Non-chair
    • True Positives - Motorbike MITUCLA_HIERARCHY NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
    • False Positives - Motorbike MITUCLA_HIERARCHY NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
    • Object localization 2008-2010 60 50Max AP (%) 40 2008 30 2009 2010 20 10 0 tvmonitor pottedplant bottle motorbike diningtable horse sofa train person sheep aeroplane bicycle cow cat boat bus dog bird car chair Results on 2008 data improve for 2010 methods for all categories, by over 100% for some categories.
    • TRECvid evaluation standard
    • Concept detection Aircraft Beach Mountain People marching Police/Security Flower
    • Measuring performance Set of relevant Set of retrieved Results items items1.2. • Precision Set of relevant3. retrieved items4. inverse relationship Recall5.
    • UvA-MediaMill@TRECVID • other systems Snoek et al, TRECVID 04-10
    • Performance doubled in just 3 years • 36 concept detectors Even when using training data of different origin, great progress. But the number of concepts is still limited. Snoek & Smeulders, IEEE Computer 2010
    • 8. ConclusionImpressive results and quickly improving per year.Very valuable competition.Best non-classes start to make sense!
    • 9 Speed
    • SURF based on integral imagesIntroduced by Viola & Jones in the context of facedetection: sliding windows in left to right / up to bottomintegral images. 46
    • SURF principleApproximate Gaussian derivatives with box filters: Lyy Lyy Lyy Lxy L xx L LREC 2004, 26 May yy Lisbon 2004, L xy 47
    • SURF speed ScaleComputation time: 6 times faster than DoG (~100msec).Independent of filter scale. 26 May 2004, Lisbon LREC 2004, 48
    • Dense descriptor extraction Pixel-wise Responses Final Descriptor Factor 16 speed improvement, Another factor 2 by the use of matrix libs.
    • Projection: Random ForestBinary decision trees.... . ... ...... Moosmann et al. 2008
    • Real-time bag of words Descriptor Projection Classification Extraction Pre-projection Actual projection SVM kernel D-SURF Random MAP: <empty> RBF 2x2 Forest 0.370 15 10 13 Total computation time is 38 milliseconds per image26 frames per second on a normal PC in any 20 concepts.
    • 9. ConclusionSURF scale and rotation invariantFast due to the use of integral imagesDownload: http://www.vision.ee.ethz.ch/~surf/DURF extraction is 6x faster than Dense-SIFT.Projection using Random Forest 50x faster than NN.
    • Internet Video Search: the beginning telling stories measuring concept lexiconvideo features detection learning browsing video video