Pc Seminar Jordi


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Pc Seminar Jordi

  1. 1. Visual Object Recognition Vi l Obj R ii Perceptual Computing Seminar Perceptual Computing Seminar Sergio Escalera,  Xavier Baró, Jordi Vitrià, Petia Radeva, Oriol Pujol BCN Perceptual Computing Lab
  2. 2. Index 1. Introduction 2. Recognition with Local Features: Basics.  3. 3 Invariant representations: SIFT I i i SIFT 4. Recognition as a Classification Problem:  g FERNS 5. Very large databases: Hashing 5 Very large databases Hashing Visual Object Recognition                 Perceptual Computing Seminar                        Page 2
  3. 3. Introduction The recognition of object categories in images is one of the most challenging problems in computer vision especially when the number vision, of categories is large. Humans are able to recognize thousands of object types, whereas most of the existing object recognition systems are trained to j g y recognize only a few. Visual Object Recognition                 Perceptual Computing Seminar                        Page 3
  4. 4. Introduction Invariance t i I i to viewpoint, illumination, “shape”, color, scale, texture, etc. i t ill i ti “h ” l l t t t Visual Object Recognition                 Perceptual Computing Seminar                        Page 4
  5. 5. Introduction Why do we care about recognition? (theoretical question) y g ( q ) Perception of function: We can perceive the p p 3D shape, texture, material properties, without knowing about objects But the objects. But, concept of category encapsulates also information about what can we d with i f ti b t h t do ith those objects. Li Fei‐Fei, Stanford; Rob Fergus, NYU; Antonio Torralba, MIT. Recognizing and Learning Object Categories: Year 2009, ICCV 2009 Kyoto, Short Course, S eptember 24. Visual Object Recognition                 Perceptual Computing Seminar                        Page 5
  6. 6. Introduction Why it is hard? y Find the chair in this image Output of correlation This is a chair Li Fei‐Fei, Stanford; Rob Fergus, NYU; Antonio Torralba, MIT. Recognizing and Learning Object Categories: Year 2009, ICCV 2009 Kyoto, Short Course, S eptember 24. Visual Object Recognition                 Perceptual Computing Seminar                        Page 6
  7. 7. Introduction Why it is hard? y Find the chair in this image  Pretty much garbage; Simple template  P tt h b Si l t l t matching is not going to make it Li Fei‐Fei, Stanford; Rob Fergus, NYU; Antonio Torralba, MIT. Recognizing and Learning Object Categories: Year 2009, ICCV 2009 Kyoto, Short Course, September 24. Visual Object Recognition                 Perceptual Computing Seminar                        Page 7
  8. 8. Introduction Why do we care about recognition? (practical question) Visual Object Recognition                 Perceptual Computing Seminar                        Page 8
  9. 9. Introduction Why do we care about recognition? (practical question) Visual Object Recognition                 Perceptual Computing Seminar                        Page 9
  10. 10. Introduction Why do we care about recognition (practical question)? Query Results from 5k Flickr images (demo available for 100k set) James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, Andrew Zisserman: Object retrieval with large vocabularies and fast spatial matching. CVPR 2007 Visual Object Recognition                 Perceptual Computing Seminar                        Page 10
  11. 11. Recognition with Local Features g It is known that the visual system can use local, informative image «fragments» of a given object, rather than the whole object, to j , j , classify it into a familiar category. This approach has some advantages over holistic methods... methods Visual Object Recognition                 Perceptual Computing Seminar                        Page 11
  12. 12. Recognition with Local Features g Holistic Fragment‐based g Visual Object Recognition                 Perceptual Computing Seminar                        Page 12
  13. 13. Recognition with Local Features g Jay Hegde, Evgeniy Bart, and Daniel Kersten, "Fragment‐based learning of visual object categories", Current Biology, 2008. Visual Object Recognition                 Perceptual Computing Seminar                        Page 13
  14. 14. Recognition with Local Features g The most basic approach is called the “bag of words” approach (it was inspired in as techniques used by the natural language processing community). Visual Object Recognition                 Perceptual Computing Seminar                        Page 14
  15. 15. Recognition with Local Features g Assumptions: • Independent features. d d f Fragments  Fragments vocabulary • Histogram representation. (generic/class‐ based, etc.) based etc ) Image  Image = Fragments  histogram Li Fei‐Fei, Stanford; Rob Fergus, NYU; Antonio Torralba, MIT. Recognizing and Learning Object Categories: Year 2009, ICCV 2009 Kyoto, Short Course, S eptember 24. Visual Object Recognition                 Perceptual Computing Seminar                        Page 15
  16. 16. Recognition with Local Features g A more advanced approach involves several  steps: steps • Stage 0: Find image locations where we can reliably find correspondences with other images. • Stage 1: Image content is transformed into local g g features (that are invariant to translation, rotation, and scale). • Stage 2: Verify if they belong to a consistent configuration Slide credit: David Lowe Visual Object Recognition                 Perceptual Computing Seminar                        Page 16
  17. 17. SIFT A wonderful example of these stages can be found in David Lowe’s (2004) “Distinctive image features from Lowe s Distinctive scale‐invariant keypoints” paper, which describes the development and refinement of his Scale Invariant Feature Transform (SIFT). Local Features, e.g. SIFT L lF t Visual Object Recognition                 Perceptual Computing Seminar                        Page 17
  18. 18. Recognition with Local Features g Which local features? ? Slide credit: A. Efros Visual Object Recognition                 Perceptual Computing Seminar                        Page 18
  19. 19. SIFT Stage 0: How can we find image locations where we can reliably find correspondences with other images? A “good” location has one stable sharp extremum. f Good ! f f bad bad x x x Visual Object Recognition                 Perceptual Computing Seminar                        Page 19
  20. 20. SIFT Visual Object Recognition                 Perceptual Computing Seminar                        Page 20
  21. 21. SIFT Stage 0: How can we find image locations where we can reliably find correspondences with other images? How to compute extrema at a given scale: 1) We apply a Gaussian filter: 2) We compute a difference‐of‐Gaussians 3) We look for 3D extrema in the resulting structure.  Visual Object Recognition                 Perceptual Computing Seminar                        Page 21
  22. 22. SIFT Visual Object Recognition                 Perceptual Computing Seminar                        Page 22
  23. 23. SIFT These features are invariant to location and scale Visual Object Recognition                 Perceptual Computing Seminar                        Page 23
  24. 24. SIFT Stage 1: Image content is transformed into local features (that are invariant to translation, rotation, and scale). In addition to dealing with scale changes, we need to deal with (at least) in‐plane image rotation. One way to deal with this problem is to design descriptors that are rotationally invariant, but such descriptors have poor discriminability, i.e. they map different looking patches to the same descriptor. Visual Object Recognition                 Perceptual Computing Seminar                        Page 24
  25. 25. SIFT A better method is to estimate a dominant orientation at each detected keypoint. 1.Calculate histogram of local gradients in the window 2.Take the dominant orientation gradient as “up” 3.Rotate local area for computing descriptor Visual Object Recognition                 Perceptual Computing Seminar                        Page 25
  26. 26. SIFT Lowe: • computes a 36‐bin histogram of edge orientations weighted by both gradient magnitude and Gaussian distance to the center, • finds all peaks within 80% of the global maximum, and then • computes a more accurate orientation estimate using a 3‐bin parabolic fit. Visual Object Recognition                 Perceptual Computing Seminar                        Page 26
  27. 27. SIFT Visual Object Recognition                 Perceptual Computing Seminar                        Page 27
  28. 28. SIFT Local patch around descriptor  Gradient magnitude Gradient orientation from Gaussian pyramid Visual Object Recognition                 Perceptual Computing Seminar                        Page 28
  29. 29. SIFT Visual Object Recognition                 Perceptual Computing Seminar                        Page 29
  30. 30. SIFT Visual Object Recognition                 Perceptual Computing Seminar                        Page 30
  31. 31. SIFT Even after compensating for translation, rotation, rotation and scale changes the local changes, appearance of image patches will usually still vary from image to image. How can we make the descriptor that we match more invariant to such changes while still changes, preserving discriminability between different (non‐corresponding) (non corresponding) patches? Visual Object Recognition                 Perceptual Computing Seminar                        Page 31
  32. 32. SIFT SIFT features are formed by computing the gradient at each pixel in a 16x16 window around the d h l d d h detected d keypoint, using the appropriate level of the Gaussian pyramid at which the k id hi h h keypoint was d i detected. d The Th gradient magnitudes are d di t it d downweighted b a G i ht d by Gaussian f ll ff f ti i fall‐off function in order to reduce the influence of gradients far from the center, as these are more affected by small misregistrations. Visual Object Recognition                 Perceptual Computing Seminar                        Page 32
  33. 33. SIFT In each 4x4 quadrant, a gradient orientation histogram is formed b (concept all ) adding by (conceptually) the weighted gradient value to one of 8 orientation histogram bins. Visual Object Recognition                 Perceptual Computing Seminar                        Page 33
  34. 34. SIFT The resulting 128 non negative values form a non‐negative raw version of the SIFT descriptor vector. To reduce the effects of contrast/gain (additive variations are already removed by the gradient), the 128‐D vector is normalized to 128 D unit length. Visual Object Recognition                 Perceptual Computing Seminar                        Page 34
  35. 35. SIFT Once we have extracted features and their descriptors from two or more images the next step is to establish images, some preliminary feature matches between these images. images Visual Object Recognition                 Perceptual Computing Seminar                        Page 35
  36. 36. SIFT Once we have extracted features and their descriptors from two or more images the next step is to establish images, some preliminary feature matches between these images. images SIFT uses a nearest neighbor classifier with a distance ratio matching criterion We can define this nearest neighbor criterion. distance ratio as where d1 and d2 are the nearest and second nearest neighbor distances, and DA…..DC are the target descriptor along with its closest two neighbors neighbors. Visual Object Recognition                 Perceptual Computing Seminar                        Page 36
  37. 37. SIFT Visual Object Recognition                 Perceptual Computing Seminar                        Page 37
  38. 38. SIFT Linear method: The simplest way to find all corresponding feature points is to compare all features against all other features in each pair of potentially matching images. Unfortunately, this is quadratic in the f l h d h number of extracted features, which makes it impractical for some applications. Visual Object Recognition                 Perceptual Computing Seminar                        Page 38
  39. 39. SIFT Nearest‐neighbor matching is the major computational bottleneck: • Linear search performs dn2 operations for n feature points and d dimensions • No exact NN methods are faster than linear search for d>10 • Approximate methods can be much faster, but at the cost of missing some correct matches matches. Failure rate gets worse for large datasets. Visual Object Recognition                 Perceptual Computing Seminar                        Page 39
  40. 40. SIFT A better approach is to devise an indexing structure such as a multi‐dimensional search tree or a hash table to rapidly search for features near a given feature. For extremely large databases (millions of images or more), even more efficient structures based on ideas from document retrieval (e.g., vocabulary trees) can be used. Visual Object Recognition                 Perceptual Computing Seminar                        Page 40
  41. 41. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration The first step is to establish a set of putative correspondences. Visual Object Recognition                 Perceptual Computing Seminar                        Page 41
  42. 42. SIFT How can we discard erroneous correspondences? Visual Object Recognition                 Perceptual Computing Seminar                        Page 42
  43. 43. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration Once we have some hypothetical (putative) matches, we can use geometric alignment to t verify which matches are i li if hi h t h inliers and d which ones are outliers. Visual Object Recognition                 Perceptual Computing Seminar                        Page 43
  44. 44. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration • Extract features • Compute putative matches Visual Object Recognition                 Perceptual Computing Seminar                        Page 44
  45. 45. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration • Loop: – Hypothesize transformation T (using a small group of putative  matches that are related by T) matches that are related by T) Visual Object Recognition                 Perceptual Computing Seminar                        Page 45
  46. 46. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration • Loop: – Hypothesize transformation T (small group of putative matches that  are related by T) – Verify transformation (search for other matches consistent with T) Visual Object Recognition                 Perceptual Computing Seminar                        Page 46
  47. 47. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration Visual Object Recognition                 Perceptual Computing Seminar                        Page 47
  48. 48. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration 2D transformation models: • Similarity (translation,  (translation, scale, rotation) • Affine • Projective (homography) Visual Object Recognition                 Perceptual Computing Seminar                        Page 48
  49. 49. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration Fitting an affine transformation (given the point correspondences): ( xi , yi ) ( xi, yi) Slide credit: S. Lazebnik Visual Object Recognition                 Perceptual Computing Seminar                        Page 49
  50. 50. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration Fitting an affine transformation (given the point correspondences):  m1       m2    xi   m1 m2   xi   t1  x yi 0 0 1 0  m3   xi   y   m   y   t   i       i  3 m4   i   2  0 0 xi yi 0 1 m4   yi          t1       t2  Slide credit: S. Lazebnik Visual Object Recognition                 Perceptual Computing Seminar                        Page 50
  51. 51. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration Fitting an affine transformation (given the point correspondences): • Linear system with six unknowns • Each match gives us two linearly independent equations:  need at least three to solve for the transformation  d l h l f h f parameters • C Can solve Ax=b using pseduo‐inverse: l A b i d i x = (ATA)‐1ATb       Slide credit: S. Lazebnik Visual Object Recognition                 Perceptual Computing Seminar                        Page 51
  52. 52. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration Fitting an affine transformation (given the point correspondences): • Linear system with six unknowns • Each match gives us two linearly independent equations:  need at least three to solve for the transformation  d l h l f h f parameters • C Can solve Ax=b using pseduo‐inverse: l A b i d i x = (ATA)‐1ATb       Slide credit: S. Lazebnik Visual Object Recognition                 Perceptual Computing Seminar                        Page 52
  53. 53. SIFT Stage 2: Verify if they belong to a consistent configuration. config ration The process of selecting a small set of seed matches and then verifying a larger set is y g g often called random sampling or RANSAC. Visual Object Recognition                 Perceptual Computing Seminar                        Page 53
  54. 54. RANSAC RANSAC was originally formulated in Martin A. Fischler and Robert C. Bolles (June 1981). "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography". Comm. of the pp g y g p y ACM 24: 381–395. Visual Object Recognition                 Perceptual Computing Seminar                        Page 54
  55. 55. RANSAC “We approached the fitting problem in the opposite way from most previous techniques. Instead of averaging all the measurements and then trying to throw out bad ones we used the smallest number of measurements to ones, compute a model’s unknown parameters and then evaluated the instantiated model by counting the number of consistent samples” From “RANSAC: An Historical Perspective” Bob Bolles & Marty Fischler, 2006. Visual Object Recognition                 Perceptual Computing Seminar                        Page 55
  56. 56. RANSAC It’s easy to understand and it’s effective • It helps solve a common problem (i.e., filter out gross errors introduced by automatic techniques) • The number of trials to “guarantee” a high level of success (e.g., 99.99 (e g 99 99 probability) is surprisingly small • The dramatic increase in computation speed made it possible to do a large number of trials (100s or 1000s) • The algorithm can stop as soon as a good match is computed (unlike Hough techniques that typically compute a large number of examples and then identify matches) From “RANSAC: An Historical Perspective” Bob Bolles & Marty Fischler, 2006. Visual Object Recognition                 Perceptual Computing Seminar                        Page 56
  57. 57. RANSAC The basic idea is to repeat M times the following process: 1. A model is fitted to the hypothetical inliers, i.e. all free parameters of the yp , p model are reconstructed from the data set. 2. All other data are then tested against the fitted model and, if a point fits well to the estimated model also considered as a hypothetical inlier model, inlier. 3. The estimated model is reasonably good if sufficiently many points have been classified as hypothetical inliers. 4. The model is reestimated from all hypothetical inliers, because it has only been estimated from the initial set of hypothetical inliers. 5. Finally, 5 Finally the model is evaluated by estimating the error of the inliers relative to the model. This procedure is repeated a fixed number of times, each time producing either a model which is rejected because too few points are classified as inliers or a refined model together with a corresponding error measure. In the latter case, we keep the refined model if its error is lower than the last saved model. , p From “RANSAC: An Historical Perspective” Bob Bolles & Marty Fischler, 2006. Visual Object Recognition                 Perceptual Computing Seminar                        Page 57
  58. 58. RANSAC Visual Object Recognition                 Perceptual Computing Seminar                        Page 58
  59. 59. RANSAC Line fitting example: Line fitting example: Task: Estimate best line st ate best e Visual Object Recognition                 Perceptual Computing Seminar                        Page 59
  60. 60. RANSAC Line fitting example: Line fitting example: Sample two points Visual Object Recognition                 Perceptual Computing Seminar                        Page 60
  61. 61. RANSAC Line fitting example: Line fitting example: Fit Line Visual Object Recognition                 Perceptual Computing Seminar                        Page 61
  62. 62. RANSAC Line fitting example: Line fitting example: Total number of points  within a threshold of line. Visual Object Recognition                 Perceptual Computing Seminar                        Page 62
  63. 63. RANSAC Line fitting example: Line fitting example: Repeat, until get a  good esu t good result Visual Object Recognition                 Perceptual Computing Seminar                        Page 63
  64. 64. RANSAC Line fitting example: Line fitting example: Repeat, until get a  good esu t good result Visual Object Recognition                 Perceptual Computing Seminar                        Page 64
  65. 65. RANSAC Visual Object Recognition                 Perceptual Computing Seminar                        Page 65
  66. 66. RANSAC example: translation p Putative matches Slide credit: A. Efros Visual Object Recognition                 Perceptual Computing Seminar                        Page 66
  67. 67. RANSAC example: translation p Select one match, count inliers Slide credit: A. Efros Visual Object Recognition                 Perceptual Computing Seminar                        Page 67
  68. 68. RANSAC example: translation p Find “average” translation vector Slide credit: A. Efros Visual Object Recognition                 Perceptual Computing Seminar                        Page 68
  69. 69. RANSAC Interest points (500/image) ( / ) Putative correspondences  (268) Outliers (117) Inliers (151) Final inliers (262) Visual Object Recognition                 Perceptual Computing Seminar                        Page 69
  70. 70. SIFT Applications pp Visual Object Recognition                 Perceptual Computing Seminar                        Page 70
  71. 71. SIFT Applications pp Visual Object Recognition                 Perceptual Computing Seminar                        Page 71
  72. 72. SIFT Applications pp HDRSoft Visual Object Recognition                 Perceptual Computing Seminar                        Page 72
  73. 73. SIFT Applications pp Visual Object Recognition                 Perceptual Computing Seminar                        Page 73
  74. 74. Matching and Classification g SIFT allows reliable real‐time recognition but at a computational cost that severely limits the number of points that can be handled. A standard implementation requires 1 ms per feature point which limits the number of point, feature points to 50 per frame if one‐ requires frame rate performance frame‐rate performance. Visual Object Recognition                 Perceptual Computing Seminar                        Page 74
  75. 75. Matching and Classification g An alternative is to rely on statistical learning techniques to model the set of possible appearances of a patch. The major challenge is to use simple models to allow for real time efficient recognition real‐time, recognition. Visual Object Recognition                 Perceptual Computing Seminar                        Page 75
  76. 76. Matching and Classification g Can we match keypoints using simpler features without intensive preprocessing? ?:{ … } We will assume that we have the possibility p y to train a classifier for each keypoint class. Visual Object Recognition                 Perceptual Computing Seminar                        Page 76
  77. 77. Matching and Classification g Simple binary features I(mi,1) I(m I( i,2) The test compares the intensities of two pixels around the keypoint: 1 if I(mii,1 )  I(mii,2 ) fi    0 otherwise Visual Object Recognition                 Perceptual Computing Seminar                        Page 77
  78. 78. Matching and Classification g Without intensive preprocessing We can synthetically generate the set of keypoint’s possible appearances under various perspective, lighting, noise, etc. Visual Object Recognition                 Perceptual Computing Seminar                        Page 78
  79. 79. Matching and Classification g FERN Formulation We model the class conditional probabilities of a large number of binary features which are estimated by a training phase. y gp At run time, these probabilities are used to select the best match for a given image patch. patch Visual Object Recognition                 Perceptual Computing Seminar                        Page 79
  80. 80. Matching and Classification g FERN Formulation fi : Binary feature. Nf : Total number of features in the model. Ck : Class representing all views of an image patch around a keypoint. Given f1 ,..., f Nf select the class k such that k  arg max P(Ck | f1 , f 2 ,  , f N f )  arg max P( f1 , f 2 , , f N f | Ck ) k k Mustafa Ozuysal, Michael Calonder, Vincent Lepetit, Pascal Fua, "Fast Keypoint Recognition Using Random Ferns," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 99, , 2009 Visual Object Recognition                 Perceptual Computing Seminar                        Page 80
  81. 81. Matching and Classification g FERN Formulation However, it is not practical to model the joint distribution of all features. We group features into small sets (fern) and assume independence between these sets (Semi‐Naïve Bayesian Classifier): Fj : A fern is defined to be the set of S binary features {fr ,..., fr+S }. +S M is the number of ferns, Nf = S X M. Visual Object Recognition                 Perceptual Computing Seminar                        Page 81
  82. 82. Matching and Classification g FERN Formulation P( f1 , f 2 ,  , f N f | Ck )  2 Nf p parameters! Nf P( f1 , f 2 ,  , f N f | Ck )   P ( f i | Ck ) N f parameters, p i 1 but too simple. M P( f1 , f 2 ,  , f N f | Ck )   P ( F j | Ck )  M  2 s parameters. j 1 Visual Object Recognition                 Perceptual Computing Seminar                        Page 82
  83. 83. Matching and Classification g FERN Implementation We generate a random set of binary features. A binary feature outputs a binary number y p y 2  possibilities 8 possibilities ibili i A fern with S nodes outputs a number between o and 2S‐1 A fern with S nodes outputs a number between o and 2 ‐1. Visual Object Recognition                 Perceptual Computing Seminar                        Page 83
  84. 84. Matching and Classification g FERN Implementation When we have multiple patches of the same Probability  class we can model the output of a fern with for each  a multinomial distribution. possibility. Visual Object Recognition                 Perceptual Computing Seminar                        Page 84
  85. 85. Matching and Classification g Slide Credit: V.Lepetit Visual Object Recognition                 Perceptual Computing Seminar                        Page 85
  86. 86. Matching and Classification g 0 1 1 6 Slide Credit: V.Lepetit Visual Object Recognition                 Perceptual Computing Seminar                        Page 86
  87. 87. Matching and Classification g 0 1 1 0 1 0 1 6 Slide Credit: V.Lepetit Visual Object Recognition                 Perceptual Computing Seminar                        Page 87
  88. 88. Matching and Classification g 0 1 1 1 0 0 1 0 1 1 5 6 Slide Credit: V.Lepetit Visual Object Recognition                 Perceptual Computing Seminar                        Page 88
  89. 89. Matching and Classification g Slide Credit: V.Lepetit Visual Object Recognition                 Perceptual Computing Seminar                        Page 89
  90. 90. Matching and Classification g Normalize: N li  P( f , f 1 2 , , f n | C  c i )  1 000 001   111 Slide Credit: V.Lepetit Visual Object Recognition                 Perceptual Computing Seminar                        Page 90
  91. 91. Matching and Classification g FERN Implementation At the end of the training we have distributions over possible fern outputs for each class class. Visual Object Recognition                 Perceptual Computing Seminar                        Page 91
  92. 92. Matching and Classification g FERN Implementation To recognize a new patch the outputs selects rows of distributions for each fern and these are then combined assuming independence between distributions. Visual Object Recognition                 Perceptual Computing Seminar                        Page 92
  93. 93. Matching and Classification g Visual Object Recognition                 Perceptual Computing Seminar                        Page 93
  94. 94. Matching and Classification g FERN Implementation …in 10 lines of code…. 1: for(int i = 0; i < H; i++) P[i ] = 0.; 2: for(int k = 0; k < M; k++) { 3: int index = 0, * d = D + k * 2 * S; 4: for(int j = 0; j < S; j++) { 5: index <<= 1; 6: if (*(K + d[0]) < *(K + d[1])) 7: index++; 8: d += 2; } 9: p = PF + k * shift2 + index * shift1; 10: for(int i = 0; i < H; i++) P[i] += p[i]; } Visual Object Recognition                 Perceptual Computing Seminar                        Page 94
  95. 95. Matching and Classification g Visual Object Recognition                 Perceptual Computing Seminar                        Page 95
  96. 96. Matching and Classification g Visual Object Recognition                 Perceptual Computing Seminar                        Page 96
  97. 97. Matching and Classification g Visual Object Recognition                 Perceptual Computing Seminar                        Page 97
  98. 98. Matching and Classification g Visual Object Recognition                 Perceptual Computing Seminar                        Page 98
  99. 99. Matching and Classification g The FERN technique speeds‐up keypoint matching but the training is slow and performed offline. Hence, it is not suited for applications that require real‐time online learning or real time incremental addition of arbitrary numbers of keypoints (f e SLAM) (f.e. SLAM). Visual Object Recognition                 Perceptual Computing Seminar                        Page 99
  100. 100. Matching and Classification g This limitation can be removed if we train a FERN classifier to recognize a number of keypoints extracted from a reference database and all other keypoints are characterized in terms of their response to these classification ferns (signature) (signature). Visual Object Recognition                 Perceptual Computing Seminar                        Page 100
  101. 101. Matching and Classification g M. Calonder, V. Lepetit, and P. Fua, Keypoint Signatures for Fast Learning and Recognition.  In Proceedings of European Conference on Computer Vision, 2008. Visual Object Recognition                 Perceptual Computing Seminar                        Page 101
  102. 102. Matching and Classification g It can be empirically shown that these signatures are stable under changes in viewing conditions conditions. Signatures are sparse in nature if we apply a threshold function. Signatures do not need a training phase and scale well with the number of classes (nearest neighbor). Visual Object Recognition                 Perceptual Computing Seminar                        Page 102
  103. 103. Matching and Classification g However, matching signatures still involves many more elementary operations than absolutely necessary necessary. Moreover, evaluating the signatures requires M l i h i i storing many distributions of the same size as themselves and, therefore, large amounts of memory. y Visual Object Recognition                 Perceptual Computing Seminar                        Page 103
  104. 104. Matching and Classification g The full response vector r(p) for all J Ferns is taken p (p) to be: Vectors storing the  probability that p is one of  the N reference points. the N reference points where Z is a normalizer s.t. its elements sum to one. In practice, when p truly corresponds to one of the reference keypoints r(p) contains one element that is close keypoints, to one where all others are close to zero. Otherwise, Otherwise it contains a few relatively large values that correspond to reference keypoints that are similar in appearance and small values elsewhere. pp Visual Object Recognition                 Perceptual Computing Seminar                        Page 104
  105. 105. Matching and Classification g We can compute a sparse signature by applting a p p g y pp g point wise threshold function with a θ value. It is an N‐dimensional vector with only a few non‐ y zero elements that is mostly invariant to different imaging conditions and therefore presents a useful g g p descriptor for matching purposes. Visual Object Recognition                 Perceptual Computing Seminar                        Page 105
  106. 106. Matching and Classification g The patch J Ferns Vectors storing  Vectors storing the probability  that p is one of  the N reference  points. Typical parameters:  J 50; d 10; N 500 J=50; d=10; N=500 Visual Object Recognition                 Perceptual Computing Seminar                        Page 106
  107. 107. Matching and Classification g Typical parameters:  J 50; d 10; N 500 J=50; d=10; N=500 We need for each of the 2d leaves in each of the J Ferns an N‐ dimensional vector of floats floats. The total memory requirement is M=bJ2d N bytes, where b is the number of bytes to store a float (8) In practice 100MB! (8). practice, Visual Object Recognition                 Perceptual Computing Seminar                        Page 107
  108. 108. Matching and Classification g Compressive Sensing literature: • High‐dimensional sparse vectors can be g p reconstructed from their linear projections into much lower‐dimensional spaces. p • The Johnson–Lindenstrauss lemma states that a small set of points in a h h d ll f high‐dimensional space can l be embedded into a space of much lower dimension i such a way that di di i in h h distances b between the points are nearly preserved. Visual Object Recognition                 Perceptual Computing Seminar                        Page 108
  109. 109. Matching and Classification g Many kinds of matrices can be used for this purpouse. Random Ortho‐Projection (ROP) matrices are a good choice and can be easily constructed by applying a Gram‐Schmidt y pp y g orthonormalization process to a random matrix. matrix Visual Object Recognition                 Perceptual Computing Seminar                        Page 109
  110. 110. Matching and Classification g In I mathematics th G th ti the Gram–Schmidt process i a S h idt is method for orthonormalizing a set of vectors in an i inner product space, most commonly d t t l the Euclidean space Rn. The Gram–Schmidt process takes a finite, linearly independent set S = { 1, …, vk} f k ≤ n and i d d t t {v for d generates an orthogonal set S' = {u1, …, uk} that k‐dimensional subspace of Rn as S spans th same k di the i l b f S. Visual Object Recognition                 Perceptual Computing Seminar                        Page 110
  111. 111. Matching and Classification g M. Calonder, V. Lepetit, P. Fua, K. Konolige, J. Bowman, and P. Mihelich, Compact Signatures for High‐ speed Interest Point Description and Matching. In Proceedings of International Conference on Computer  Vision, 2009. Visual Object Recognition                 Perceptual Computing Seminar                        Page 111
  112. 112. Matching and Classification g M. Calonder, V. Lepetit, P. Fua, K. Konolige, J. Bowman, and P. Mihelich, Compact Signatures for High‐ speed Interest Point Description and Matching. In Proceedings of International Conference on Computer  Vision, 2009. Visual Object Recognition                 Perceptual Computing Seminar                        Page 112
  113. 113. Matching and Classification g M. Calonder, V. Lepetit, P. Fua, K. Konolige, J. Bowman, and P. Mihelich, Compact Signatures for High‐ speed Interest Point Description and Matching. In Proceedings of International Conference on Computer  Vision, 2009. Visual Object Recognition                 Perceptual Computing Seminar                        Page 113
  114. 114. Matching and Classification g This approach reduces the memory requirement when storing the models: for N=512, M=176, the requirements change from 93.75MB to 175B! The CPU time is 6.3ms per an exhaustive NN matching of 256 points (256x256) (256x256). Visual Object Recognition                 Perceptual Computing Seminar                        Page 114
  115. 115. Internet‐scale image databases g Visual Object Recognition                 Perceptual Computing Seminar                        Page 115
  116. 116. Min HASH How can we find similar images in  How can we find similar images in very large datasets?  Can we get clusters from these g images? Visual Object Recognition                 Perceptual Computing Seminar                        Page 116
  117. 117. Min HASH Let s suppose that we choose a LARGE bag Let’s suppose that we choose a LARGE bag‐ of‐words representation of our images and  that we use a binary histogram. that we use a binary histogram Visual Object Recognition                 Perceptual Computing Seminar                        Page 117
  118. 118. Min HASH Given two different images, we can compute their histogram intersection: Visual Object Recognition                 Perceptual Computing Seminar                        Page 118
  119. 119. Min HASH …and their histogram union: …and their histogram union: Visual Object Recognition                 Perceptual Computing Seminar                        Page 119
  120. 120. Min HASH Then we can define a set similarity measure in the following way: That is, the number of times both images have a given keypoint in common divided by the total number of keypoints that are present in both images. Visual Object Recognition                 Perceptual Computing Seminar                        Page 120
  121. 121. Min HASH Visual Object Recognition                 Perceptual Computing Seminar                        Page 121
  122. 122. Min HASH We can perform clustering or matching of an unordered set of i f d d f images with this h h measure, but this can be used only with a limited amount of data! The method requires  w d i1 i 2 similarity evaluations, where w is  the size of the vocabulary and di is  the number of regions assigned to  th b f i i dt the i‐th visual word.  Vocabulary commonly used is  w=1.000.000.  w=1 000 000 Visual Object Recognition                 Perceptual Computing Seminar                        Page 122
  123. 123. Min HASH From can perform clustering or matching of an unordered set of images with this measure but this can be used measure, only with a limited amount of data! Observation: histograms for an g image are highly sparse! Visual Object Recognition                 Perceptual Computing Seminar                        Page 123
  124. 124. Min HASH The key idea of min‐hash is to map min hash (“hash”) each row/histogram to a small amount of data Sig(A) (the signature) such that: • Sig(A) is small enough. • Rows A1 and A2 are highly similar if Sig(A1) is highly similar to Sig(A2). g g y g Visual Object Recognition                 Perceptual Computing Seminar                        Page 124
  125. 125. Min HASH Useful convention: we will refer to columns as being of four types: A1: 1010 A2: 1100 Type: yp abcd We will also use “a” as the number of columns  of type a.  yp Notes:   • Sim (A1 , A2)=a/(a+b+c) Sim (A A • Most columns are type d.   Visual Object Recognition                 Perceptual Computing Seminar                        Page 125
  126. 126. Min HASH • Imagine the columns permuted randomly in order. d • Hash each row A to h(A), the number of the first l fi column i which row A h a 1. in hi h has 1 0 0 1 0 π 0 1 0 0 1 h(A1) 2 )=2 1 0 0 0 0 0 1 0 0 0 h(A2)=2 The probability that h(A1) = h(A2) is a/(a+b+c) = Sim (A1 , A2) (the hash agree if the first column with a 1 is a and disagree if it is of type b or c). Visual Object Recognition                 Perceptual Computing Seminar                        Page 126
  127. 127. Min HASH If we repeat the experiment with a new permutation of columns a l f l large number of b f times, say 512, we get a signature consisting of 512 column numbers for each row. The “similarity” of these lists (fraction of positions in which they agree) will be very close to the similarity of the rows (= ( similar signatures mean similar rows!). Visual Object Recognition                 Perceptual Computing Seminar                        Page 127
  128. 128. Min HASH In fact, it is not necessary to permute the columns: we can hash each original column with 512 different hash functions and keep for each row the lowest hash value of a row in which that column has a 1, independently for each of the 512 hash functions. Then we look for the coincidences. signature row 1 0 0 1 0 h1 5 1 3 2 4 h1(row)=  2 h2 1 2 5 3 4 h2(row)=  1 h3 3 4 1 5 2 h3(row)= 3 (row)=  3 h4 2 5 4 1 3 h4(row)=  1 Visual Object Recognition                 Perceptual Computing Seminar                        Page 128
  129. 129. Min HASH Row 1 1 0 1 1 0 Row 2 0 1 0 0 1 Row 3 R 3 1 1 0 1 0 h1 1 2 3 4 5 h1(row)=  1 ,  2 , 1 h2 5 4 3 2 1 h2(row)=  2 ,  1 , 2     (row) 2 1 2 h3 3 4 5 1 2 h3(row)=  1 ,  2 , 1 Similarities: Row Row Row‐Row Sig Sig Sig‐Sig 1‐2:   0/5 0/3 1‐3:  2/4 3/3 2‐3:  1/4   / 0/3 / Visual Object Recognition                 Perceptual Computing Seminar                        Page 129
  130. 130. Min Hash For efficient retrieval, the min hashes are grouped into n‐tuples. In this example, we can form the following 2‐tuples: h1(row)=  1 ,  2 , 1 h2(row)= 2 1 2 (row)=  2 ,  1 , 2     h3(row)=  1 ,  2 , 1 h4(row)=  3 ,  2 , 3 (row) 3 , 2 , 3 The retrieval procedure then estimates the full similarity for only those image pairs that have at least h identical tuples out of k tuples. Visual Object Recognition                 Perceptual Computing Seminar                        Page 130
  131. 131. Min Hash From 100k images.... From 100k images Visual Object Recognition                 Perceptual Computing Seminar                        Page 131
  132. 132. Min Hash From 100k images.... From 100k images Visual Object Recognition                 Perceptual Computing Seminar                        Page 132
  133. 133. Min Hash From 100k images.... From 100k images Representatives of the largest clusters Visual Object Recognition                 Perceptual Computing Seminar                        Page 133
  134. 134. Min Hash Automatic localization of different buildings Visual Object Recognition                 Perceptual Computing Seminar                        Page 134