HARVEST ProgrammeFeature evaluationOld and new benchmarks, and new softwareAndrea Vedaldi, University of Oxford
Benchmarking: why and how                        37•   Dozens of feature detectors and descriptors have been proposed•   B...
38Indirect evaluationRepeatability and matching scoreData: affine covariant testbedDirect evaluationImage retrievalData: ox...
39Indirect evaluationRepeatability and matching scoreData: affine covariant testbedDirect evaluationImage retrievalData: ox...
Indirect feature evaluation                                                    40•   Intuition    Test how well features p...
frame
Affine Testbed             49Viewpoint, scale, rotation
Affine Testbed             50Lighting, compression, blur
Detector repeatability    51                                        Intuition•   two pairs of features correspond•   anoth...
Region overlap                                52            Formal definition                  H             homography    ...
Region overlap                                                                   53                                       ...
54
55
56
57
Normalised region overlap                             58             Intuition  larger scale = better overlap             ...
Normalised region overlap                          59                          Formal definition                           ...
Detector repeatability                                      60                                    Formal definition1. Find ...
Descriptor matching score                      61                                         Intuition•   In addition to bein...
Descriptor matching score                                   62                                 Formal definition1. Find fea...
Example of a repeatability graph                                   63                                    Repeatability (gr...
64Indirect evaluationRepeatability and matching scoreData: affine covariant testbedDirect evaluationImage retrievalData: ox...
Indirect evaluation                  65•   Indirect evaluation    -   a “synthetic” performance measure in a “synthetic” s...
Direct evaluation                                      66•   Direct evaluation    -   performance of a real system using a...
Image retrieval                 67Used to evaluate features                            ...
Image retrieval pipeline                                  68              Represent images as bags of featuresinput image ...
Image retrieval pipeline                                                                        69                   Step ...
Image retrieval pipeline                  70Step 2: each query descriptor casts a vote for each DB imaged    query descrip...
Image retrieval pipeline                                     71    Step 3: sort DB images by decreasing total votes      q...
Image retrieval pipeline                       72        Step 4: Overall performance scorequery          retrieval results...
Oxford 5K data                             73                            A retrieval benchmark dataset            Query   ...
74Indirect evaluationRepeatability and matching scoreData: affine covariant testbedDirect evaluationImage retrievalData: ox...
VLBenchmarks                  75                        A new easy-to-use benchmarking suite            http://www.vlfeat....
VLBenchmarks                  76                           Obtaining and installing the code•   Installation    -   Downlo...
Example usage                                   77            t.a                                 choose a detector       ...
Testing on a sequence of images                                     78                   .ls                              ...
79                 1                0.9                0.8                0.7                0.6repeatability             ...
Comparing two features                             80import"datasets.*;"import"localFeatures.*;"import"benchmarks.*;detect...
81                 1                0.9                0.8                0.7                                      Affine A...
Example                                                                                    82•   Compare the following fea...
Backward compatible                                                                               83•   Previously publish...
Other useful tricks                                                                                                       ...
Other benchmarks                 85•   Detector matching score    benchmark"="RepeatabilityBenchmark(mode,MatchingScore)";...
Summary                        86             http://www.vlfeat.org/benchmarks/index.html•   Benchmarks    -   Indirect: r...
Credits                                87                       Karel Lenc              Varun Gulshan                     ...
Thank you for coming!   88VLFeathttp://www.vlfeat.org/VLBenchmarkshttp://www.vlfeat.org/benchmarks/
Modern features-part-4-evaluation
Modern features-part-4-evaluation
Modern features-part-4-evaluation
Modern features-part-4-evaluation
Modern features-part-4-evaluation
Modern features-part-4-evaluation
Modern features-part-4-evaluation
Upcoming SlideShare
Loading in …5
×

Modern features-part-4-evaluation

1,342 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,342
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
59
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Modern features-part-4-evaluation

  1. 1. HARVEST ProgrammeFeature evaluationOld and new benchmarks, and new softwareAndrea Vedaldi, University of Oxford
  2. 2. Benchmarking: why and how 37• Dozens of feature detectors and descriptors have been proposed• Benchmarks - compare methods empirically - select the best method for a task• Public benchmarks - reproducible research - simplify your life!• Ingredients of a benchmark: Theory Data Software
  3. 3. 38Indirect evaluationRepeatability and matching scoreData: affine covariant testbedDirect evaluationImage retrievalData: oxford 5kSoftwareVLBenchmarks
  4. 4. 39Indirect evaluationRepeatability and matching scoreData: affine covariant testbedDirect evaluationImage retrievalData: oxford 5kSoftwareVLBenchmarks
  5. 5. Indirect feature evaluation 40• Intuition Test how well features persist and can be matched across image transformations.• Data - must be representative of the transformations (viewpoint, illumination, noise, etc.)• Performance measures - repeatability persistence of features - matching score matchability of featuresK. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. Acomparison of affine region detectors. IJCV, 1(65):43–72, 2005.
  6. 6. frame
  7. 7. Affine Testbed 49Viewpoint, scale, rotation
  8. 8. Affine Testbed 50Lighting, compression, blur
  9. 9. Detector repeatability 51 Intuition• two pairs of features correspond• another pair of features does not• repeatability = 2/3
  10. 10. Region overlap 52 Formal definition H homography Rb |Ra HRb |Ra HRb overlap(a, b) = |Ra [ HRb | area intersection = area union
  11. 11. Region overlap 53 A Comparison of Affine Region Detec Intuitiongure 12. Overlap error O . Examples of ellipses projected on the corresponding ellipse with the ground truth transformation. (bot erlap error for above displayed ellipses. Note that the overlap error comes from different size, orientation and position of the ellipses. 1 - overlap metimes specific to detectors and scene types of correspondences. The results for images contaiscussed below), and sometimes general—the trans- ing repeated texture motifs (Fig. 9(b)) are displarmation is outside the range for which the detector is in Fig. 14. The best results are obtained with • Examples of ellipses overlapping by different amounts signed, e.g. discretization errors, noise, non-linear MSER detector for both scene types. This is dueumination changes, projective deformations etc. the high detection accuracy especially on the hom •lso the limited features are tested at 40% overlap geneous regions overlap) Usually, ‘range’ of the regions shape (size, error (= 60% with distinctive boundaries. The ewness, . . . ) can partially explain this effect. For peatability score for a viewpoint change of 20 degrstance, in case of a zoomed out test image, only the varies between 40% and 78% and decreases for large regions in the reference image will survive the viewpoint angles to 10% − 46%. The largest num
  12. 12. 54
  13. 13. 55
  14. 14. 56
  15. 15. 57
  16. 16. Normalised region overlap 58 Intuition larger scale = better overlap Rescale so that reference region has and area of 302 pixels
  17. 17. Normalised region overlap 59 Formal definition H1. Detect homography Ra Rb2. Warp Ra HRb3. Normalise sRa sHRb s = |Ra |/3024. Intersection |sRa sHRb | overlap(a, b) = over union |sRa [ sHRb |
  18. 18. Detector repeatability 60 Formal definition1. Find features Hin common area homography {Ra : a 2 A} {Rb : b 2 B} (2. Threshold the overlap(a, b), overlap(a, b) 1 ✏o sab =overlap score inf, otherwise. X3. Find geometric M⇤ = max sab greedymatches M bipartite approximation (a,b)2M |M⇤ | repeatability(A, B) = min{|A|, |B|} min{|A|
  19. 19. Descriptor matching score 61 Intuition• In addition to being stable, features must be visually distinctive• Descriptor matching score - similar to repeatability - but matches are constructed by comparing descriptors
  20. 20. Descriptor matching score 62 Formal definition1. Find features Hin common area homography {Ra : a 2 A} {Rb : b 2 B}2. Descriptor {da : a 2 A} {db : b 2 B} dab = kda db k2distances X3. Descriptor M⇤ = d min dab M bipartitematches (a,b)2M X4. Geometric M⇤ = max sab M bipartitematches (as before) (a,b)2M |M⇤ M⇤ | d match-score(A, B) = min{|A|, |B|}
  21. 21. Example of a repeatability graph 63 Repeatability (graf) 100 VL DoG VL Hessian 90 VL HessianLaplace VL HarrisLaplace 80 VL DoG (double) VL Hessian (double) 70 VL HessianLaplace (double)Repeatability (graf) VL HarrisLaplace (double) 60 VLFeat SIFT CMP Hessian 50 VGG hes VGG har 40 30 20 10 0 30 40 50 60 20 Viewpoint angle
  22. 22. 64Indirect evaluationRepeatability and matching scoreData: affine covariant testbedDirect evaluationImage retrievalData: oxford 5kSoftwareVLBenchmarks
  23. 23. Indirect evaluation 65• Indirect evaluation - a “synthetic” performance measure in a “synthetic” setting• The good - independent of a specific application / implementation - allow to evaluate single components, e.g. ▪ repeatability of detector ▪ matching score of descriptor• The bad - difficult to design well - unclear correlation to the performance in applications
  24. 24. Direct evaluation 66• Direct evaluation - performance of a real system using a feature object instance retrieval object category recognition• The good object detection - tied to the “real” performance of the feature text recognition semantic segmentation• The bad ... - tied to one application - worse, tied to one implementation - difficult to evaluate single aspects of a feature• In the follow up we will focus on object instance retrieval
  25. 25. Image retrieval 67Used to evaluate features ...
  26. 26. Image retrieval pipeline 68 Represent images as bags of featuresinput image detector descriptor {f1 , . . . , fn } {d1 , . . . , dn } Harris (Laplace) SIFT Hessian (Laplace) LIOP DoG BRIEF MSER Jets Harris Affine ... Hessian Affine ....
  27. 27. Image retrieval pipeline 69 Step 1: find neighbours of each query descriptor query image increasing descriptor distance ...H. Jégou, M. Douze, and C. Schmid. Exploiting descriptor distances for precise image search. Technical Report 7656, INRIA, 2011.
  28. 28. Image retrieval pipeline 70Step 2: each query descriptor casts a vote for each DB imaged query descriptord1 d2 dk ... vote strength max{dk di , 0} distance ... rank k
  29. 29. Image retrieval pipeline 71 Step 3: sort DB images by decreasing total votes query image 1 Average Precision (AP) precision 35% ✔ ✗ ✔ ✗ 1 recall ...✗ ✔ ✗ decreasing total votes
  30. 30. Image retrieval pipeline 72 Step 4: Overall performance scorequery retrieval results AP 35% ✗ ✔ ✗ 100% ✔ ✗ ✗ 75% ✔ ✗ ✔ ... ... ... ... ... Mean Average Precision (mAP) 53%
  31. 31. Oxford 5K data 73 A retrieval benchmark dataset Query Retrieved Images ... ✔ ✗ ✔• ~ 5K images of Oxford - For each of 58 queries ▪ about XX matching images ▪ about XX confounders images• Larger datasets are possible, but slow for extensive evaluation• Relative ranking of features seems to be representative
  32. 32. 74Indirect evaluationRepeatability and matching scoreData: affine covariant testbedDirect evaluationImage retrievalData: oxford 5kSoftwareVLBenchmarks
  33. 33. VLBenchmarks 75 A new easy-to-use benchmarking suite http://www.vlfeat.org/benchmarks/index.html• A novel MATLAB framework for feature evaluation - Repeatability and matching scores ▪ VGG affine testbed - Image retrieval ▪ Oxford 5K• Goodies - Simple to use MATLAB code - Automatically download datasets & run evaluations - Backward compatible with published results
  34. 34. VLBenchmarks 76 Obtaining and installing the code• Installation - Download the latest version - Unpack the archive - Launch MATLAB and type >>8install• Requirements - MATLAB R2008a (7.6) - A C compiler (e.g. Visual Studio, GCC, or Xcode) - Do not forget to setup MATLAB to use your C compiler mex8;setup
  35. 35. Example usage 77 t.a choose a detector t.a (Harris Affine, VGG version) t.a o se a choose a dataset (graffiti sequence) o s* "*d* *e a choose test o " sea (detector repeatability) " o ttt t s dt tt t sled ttt t s;ed ttt t slee a run the evaluation(repeatability = 0.66)
  36. 36. Testing on a sequence of images 78 .ls . ls . ls r *; s r *o aoto o; s r a *;s r ceu a* ; r ... . d * t. .. . * ;t ... . *c;t ... . * ;; s s * at o " otUse parfor on a cluster! F; sd *o o; sa *o ao; s s
  37. 37. 79 1 0.9 0.8 0.7 0.6repeatability 0.5 0.4 0.3 0.2 0.1 0 1 1.5 2 2.5 3 3.5 4 4.5 5 image number
  38. 38. Comparing two features 80import"datasets.*;"import"localFeatures.*;"import"benchmarks.*;detectors{1}"="VlFeatSift()";detectors{2}"="VlFeatCovdet(EstimateAffineShape,"true)";dataset"="VggAffineDataset(category,graf)";benchmark"="RepeatabilityBenchmark()";for"d"="1:2 for"j"="1:5 repeatability(j,d)"="... """"""benchmark."testFeatureExtractor(detectors{d},"... """""""""""""""""""""""""""dataset.getTransformation(j),"... """""""""""""""""""""""""""dataset.getImagePath(1),"... """""""""""""""""""""""""""dataset.getImagePath(j))"; endendclf";"plot(repeatability,"linewidth,"4)";xlabel(image"number)";ylabel(repeatability)";grid"on";
  39. 39. 81 1 0.9 0.8 0.7 Affine Adaptation 0.6repeatability 0.5 0.4 0.3 0.2 0.1 0 1 1.5 2 2.5 3 3.5 4 4.5 5 image number
  40. 40. Example 82• Compare the following features - SIFT, MSER, and features on a grid - on the Graffiti sequence - for repeatability and number of correspondence Repeatability Number of correspondences 70 800 SIFT SIFT MSER MSER Features on a grid 700 Features on a grid 60 Number of correspondences 600 50 500 Repeatability 40 400 30 300 20 200 10 100 0 0 30 40 50 60 20 30 40 50 60 20 Viewpoint angle Viewpoint angle
  41. 41. Backward compatible 83• Previously published results can be easily reproducedInternational Journal of Computer Vision c 2006 Springer Science + Business Media, Inc. Manufactured in The Netherlands. - if interested, try the script8reproduceIjcv05.m DOI: 10.1007/s11263-005-3848-x A Comparison of Affine Region Detectors K. MIKOLAJCZYK University of Oxford, OX1 3PJ, Oxford, United Kingdom km@robots.ox.ac.uk T. TUYTELAARS University of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium tuytelaa@esat.kuleuven.be C. SCHMID INRIA, GRAVIR-CNRS, 655, av. de l’Europe, 38330, Montbonnot, France schmid@inrialpes.fr A. ZISSERMAN University of Oxford, OX1 3PJ, Oxford, United Kingdom az@robots.ox.ac.uk J. MATAS Czech Technical University, Karlovo Namesti 13, 121 35, Prague, Czech Republic matas@cmp.felk.cvut.cz F. SCHAFFALITZKY AND T. KADIR University of Oxford, OX1 3PJ, Oxford, United Kingdom
  42. 42. Other useful tricks 84• Compare different parameter settings detectors{1}"="VggAffine(Detector,haraff,"Threshold","500)"; detectors{2}"="VggAffine(Detector,haraff,"Threshold","1000)";• Visualising matches [~"~"matches"reprojFrames]"="benchmark.testFeatureExtractor("...") ... benchmarks.helpers.plotFrameMatches(matches,"reprojFrames) SIFT Matches with 4 image (VggAffineDataset−graf dataset). Matches using mean−variance−median descriptor with 4 image (VggAffineDataset−graf dataset). Matched ref. image frames Matched ref. image frames Unmatched ref. image frames Unmatched ref. image frames Matched test image frames Matched test image frames Unmatched test image frames Unmatched test image frames
  43. 43. Other benchmarks 85• Detector matching score benchmark"="RepeatabilityBenchmark(mode,MatchingScore)";• Image retrieval - Example: Oxford 5K lite - mAP evaluation dataset"="VggRetrievalDataset(Category,oxbuild, """"""""""""""""""""""""""""""BadImagesNum,100); benchmark"="RetrievalBenchmark()"; mAP"="benchmark.testFeatureExtractor(detectors{d},"dataset);
  44. 44. Summary 86 http://www.vlfeat.org/benchmarks/index.html• Benchmarks - Indirect: repeatability and matching score - Direct: image retrieval• VLBenchmarks - a simple to use MATLAB framework - convenient• The future - Existing measures have many shortcomings - Hopefully better benchmarks will be available soon - And they will be added to VLBenchmarks for your convenience
  45. 45. Credits 87 Karel Lenc Varun Gulshan HARVEST ProgrammeKrystian Mikolajczyk Tinne Tuytelaars Jiri Matas Cordelia Schmid Andrew Zisserman
  46. 46. Thank you for coming! 88VLFeathttp://www.vlfeat.org/VLBenchmarkshttp://www.vlfeat.org/benchmarks/

×