6: Location and context
What makes a cow a cow?                     Google knows  How do you know?     because other people know                  ...
What is the object in the middle?No segmentation …Not even the pixel values of the object …
Where is evidence for an object?                         Uijlings IJCV 2011
Where is evidence for an object?                         Uijlings IJCV 2011
What is the visual extent of an object?Uijlings IJCV 2012
Where: exhaustive searchLook everywhere for the object windowImposes computational constraints on   Very many locations an...
Where: the need for a hierarchy  An image is intrinsically hierarchical.                                            Gu CVP...
Selective searchWindows formed by hierarchical grouping.Adjacent grouping on color/texture/shape cues.              Felzen...
Selective search example
Selective search example                           11
Average best overlap ~88%… looks like this                    High recall                                  cat
Pairs of concepts           Uijlings ICCV demo 2012
6 ConclusionSelective search gives good localization.Localization needed to understand pairs of concepts.
7 Data and metadata         http://bit.ly/visualsearchengines
How many concepts?   Li Fei Fei slide. Biederman, Psychological Rev. 1987
How many examples?Once you are over 100 – 1000 examples, success is there.
Amateur labelingLabelMe 290,000 object annotations                                     Russell IJCV 2008
Amateur labeling
Amateur labeling
Tag relevance by social annotationConsistency in tagging between users on similar images.                                 ...
Tag relevance by social annotation    Pretty good for snow not so good for rainbow.
Social negative bootstrappingNegative images are as important as positive images to learn.Not just random negative images,...
Social negative bootstrapping                     Xirong Li ICMR 2011
Knowledge ontology ImageNet
acknowledgement    WordNet friendsChristiane Fellbaum  Dan Osherson            Kai Li    Alex Berg Columbia     Princeton ...
PASCAL VOCThe PASCAL Visual Object Classes (VOC).500,000 Images downloaded from flickr.Queries like “car”, “vehicle”, “str...
7. ConclusionData is king.The data are beginning to reflect the human cognitioncapacity [at a basic level].Harvesting soci...
8 Performance
PASCAL 2010Aeroplane   Bicycle   Bird   Boat    Bottle  Bus         Car      Cat   Chair   Cow
True Positives - Person             UOCTTI_LSVM_MDPM          NLPR_HOGLBP_MC_LCEGCHLC       NUS_HOGLBP_CTX_CLS_RESCORE_V2
False Positives - Person              UOCTTI_LSVM_MDPM          NLPR_HOGLBP_MC_LCEGCHLC        NUS_HOGLBP_CTX_CLS_RESCORE_V2
Non-birds & non-boatsNon-bird images:Highest rankedNon-boat images:Highest rankedWater texture and scene composition?
Non-chair
True Positives - Motorbike             MITUCLA_HIERARCHY         NLPR_HOGLBP_MC_LCEGCHLC       NUS_HOGLBP_CTX_CLS_RESCORE_V2
False Positives - Motorbike             MITUCLA_HIERARCHY         NLPR_HOGLBP_MC_LCEGCHLC       NUS_HOGLBP_CTX_CLS_RESCORE...
Object localization 2008-2010             60             50Max AP (%)             40                                      ...
TRECvid evaluation standard
Concept detection                    Aircraft                    Beach                    Mountain                    Peop...
Measuring performance           Set of relevant            Set of retrieved Results       items                       item...
UvA-MediaMill@TRECVID                 • other systems                 Snoek et al, TRECVID 04-10
Performance doubled in just 3 years   • 36 concept detectors                                                 Even when    ...
8. ConclusionImpressive results and quickly improving per year.Very valuable competition.Best non-classes start to make se...
9 Speed
SURF based on integral imagesIntroduced by Viola & Jones in the context of facedetection: sliding windows in left to right...
SURF principleApproximate Gaussian derivatives with box filters:     Lyy     Lyy                                          ...
SURF speed                                        ScaleComputation time: 6 times faster than DoG (~100msec).Independent of...
Dense descriptor extraction  Pixel-wise Responses            Final Descriptor             Factor 16 speed improvement,    ...
Projection: Random ForestBinary decision trees....      .      ...         ......                        Moosmann et al. 2...
Real-time bag of words   Descriptor                       Projection                          Classification   Extraction ...
9. ConclusionSURF scale and rotation invariantFast due to the use of integral imagesDownload: http://www.vision.ee.ethz.ch...
Internet Video Search: the beginning                                                  telling                             ...
Upcoming SlideShare
Loading in …5
×

Lecture 03 internet video search

806 views

Published on

  • Be the first to comment

  • Be the first to like this

Lecture 03 internet video search

  1. 1. 6: Location and context
  2. 2. What makes a cow a cow? Google knows How do you know? because other people know We think we know “because it has four legs” But the fact of the matter: not all cows show four legs nor are they brown … not all…
  3. 3. What is the object in the middle?No segmentation …Not even the pixel values of the object …
  4. 4. Where is evidence for an object? Uijlings IJCV 2011
  5. 5. Where is evidence for an object? Uijlings IJCV 2011
  6. 6. What is the visual extent of an object?Uijlings IJCV 2012
  7. 7. Where: exhaustive searchLook everywhere for the object windowImposes computational constraints on Very many locations and windows (coarse grid/fixed aspect ratio) Evaluation cost per location (weak features/classifiers)Impressive but takes long.Viola IJCV 2004 Dalal CVPR 2005Felzenszwalb PAMI 2010 Vedaldi ICCV 2009 7
  8. 8. Where: the need for a hierarchy An image is intrinsically hierarchical. Gu CVPR 2009
  9. 9. Selective searchWindows formed by hierarchical grouping.Adjacent grouping on color/texture/shape cues. Felzenszwalb 2004 Van de Sande ICCV 2011
  10. 10. Selective search example
  11. 11. Selective search example 11
  12. 12. Average best overlap ~88%… looks like this High recall cat
  13. 13. Pairs of concepts Uijlings ICCV demo 2012
  14. 14. 6 ConclusionSelective search gives good localization.Localization needed to understand pairs of concepts.
  15. 15. 7 Data and metadata http://bit.ly/visualsearchengines
  16. 16. How many concepts? Li Fei Fei slide. Biederman, Psychological Rev. 1987
  17. 17. How many examples?Once you are over 100 – 1000 examples, success is there.
  18. 18. Amateur labelingLabelMe 290,000 object annotations Russell IJCV 2008
  19. 19. Amateur labeling
  20. 20. Amateur labeling
  21. 21. Tag relevance by social annotationConsistency in tagging between users on similar images. Xirong Li, TMM 2009
  22. 22. Tag relevance by social annotation Pretty good for snow not so good for rainbow.
  23. 23. Social negative bootstrappingNegative images are as important as positive images to learn.Not just random negative images, but close ones.• We want to learn positive example from an expert, and obtain as many negative samples as we like for free from the web.• We iteratively aim for the hardest negatives. Xirong Li ACM MM 2009
  24. 24. Social negative bootstrapping Xirong Li ICMR 2011
  25. 25. Knowledge ontology ImageNet
  26. 26. acknowledgement WordNet friendsChristiane Fellbaum Dan Osherson Kai Li Alex Berg Columbia Princeton Princeton Jia Deng Hao Su Princeton/Stanford Stanford
  27. 27. PASCAL VOCThe PASCAL Visual Object Classes (VOC).500,000 Images downloaded from flickr.Queries like “car”, “vehicle”, “street”, “downtown”.10,000 objects, 25,000 labels.Mark Everingham, Luc Van Gool, Chris Williams, John Winn,Andrew Zisserman
  28. 28. 7. ConclusionData is king.The data are beginning to reflect the human cognitioncapacity [at a basic level].Harvesting social data requires advanced computervision control.
  29. 29. 8 Performance
  30. 30. PASCAL 2010Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow
  31. 31. True Positives - Person UOCTTI_LSVM_MDPM NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
  32. 32. False Positives - Person UOCTTI_LSVM_MDPM NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
  33. 33. Non-birds & non-boatsNon-bird images:Highest rankedNon-boat images:Highest rankedWater texture and scene composition?
  34. 34. Non-chair
  35. 35. True Positives - Motorbike MITUCLA_HIERARCHY NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
  36. 36. False Positives - Motorbike MITUCLA_HIERARCHY NLPR_HOGLBP_MC_LCEGCHLC NUS_HOGLBP_CTX_CLS_RESCORE_V2
  37. 37. Object localization 2008-2010 60 50Max AP (%) 40 2008 30 2009 2010 20 10 0 tvmonitor pottedplant bottle motorbike diningtable horse sofa train person sheep aeroplane bicycle cow cat boat bus dog bird car chair Results on 2008 data improve for 2010 methods for all categories, by over 100% for some categories.
  38. 38. TRECvid evaluation standard
  39. 39. Concept detection Aircraft Beach Mountain People marching Police/Security Flower
  40. 40. Measuring performance Set of relevant Set of retrieved Results items items1.2. • Precision Set of relevant3. retrieved items4. inverse relationship Recall5.
  41. 41. UvA-MediaMill@TRECVID • other systems Snoek et al, TRECVID 04-10
  42. 42. Performance doubled in just 3 years • 36 concept detectors Even when using training data of different origin, great progress. But the number of concepts is still limited. Snoek & Smeulders, IEEE Computer 2010
  43. 43. 8. ConclusionImpressive results and quickly improving per year.Very valuable competition.Best non-classes start to make sense!
  44. 44. 9 Speed
  45. 45. SURF based on integral imagesIntroduced by Viola & Jones in the context of facedetection: sliding windows in left to right / up to bottomintegral images. 46
  46. 46. SURF principleApproximate Gaussian derivatives with box filters: Lyy Lyy Lyy Lxy L xx L LREC 2004, 26 May yy Lisbon 2004, L xy 47
  47. 47. SURF speed ScaleComputation time: 6 times faster than DoG (~100msec).Independent of filter scale. 26 May 2004, Lisbon LREC 2004, 48
  48. 48. Dense descriptor extraction Pixel-wise Responses Final Descriptor Factor 16 speed improvement, Another factor 2 by the use of matrix libs.
  49. 49. Projection: Random ForestBinary decision trees.... . ... ...... Moosmann et al. 2008
  50. 50. Real-time bag of words Descriptor Projection Classification Extraction Pre-projection Actual projection SVM kernel D-SURF Random MAP: <empty> RBF 2x2 Forest 0.370 15 10 13 Total computation time is 38 milliseconds per image26 frames per second on a normal PC in any 20 concepts.
  51. 51. 9. ConclusionSURF scale and rotation invariantFast due to the use of integral imagesDownload: http://www.vision.ee.ethz.ch/~surf/DURF extraction is 6x faster than Dense-SIFT.Projection using Random Forest 50x faster than NN.
  52. 52. Internet Video Search: the beginning telling stories measuring concept lexiconvideo features detection learning browsing video video

×