Image Matching by Seed-and-Spawn
Shahan Lilja, David Nist´er
Abstract
A novel algorithm is presented that searches for matching local regions between two images. The scheme takes a set of seed
matches as input and propagates them through a mechanism referred to as spawning. An important feature of the algorithm is that
correspondences established at coarse image scales continuously initiate correspondences at finer scales, and vice versa.
The system is evaluated with a recognition benchmark on a database with ground truth, where it is shown to dramatically increase
the discriminative power of a state-of-the-art method based on SIFT-like features.
Key words: image matching, spawning, object recognition
1. Introduction
Establishing correspondence between images is arguably the
basis for object recognition and much of computer vision. A
common approach is based on image-to-image matching of lo-
cal features, where each feature characterizes a small patch of
an image. There are methods to extract and represent local fea-
tures that are highly invariant to differences in factors such as
scale [7] and viewpoint [9, 8]. Moreover, the fact that the infor-
mation being matched is localized (and analyzed statistically)
provides an inherent tolerance against occlusion, clutter and
noise.
The success of this approach has been demonstrated in a wide
range of settings, but it does have limitations. It fails when dif-
ferences in viewpoint and other factors become too large. As
demonstrated by [5, 15, 12, 3], one way to solve more chal-
lenging instances is to grow the domain of correspondence.
This paper presents a novel algorithm that completely cov-
ers two images with matching local regions, as shown in Fig-
ure 1. Focus is not on accuracy of matches, but rather on find-
ing a large body of evidence for correspondence. This tailors
the method for object recognition, and content-based image re-
trieval in particular. One goal is to be able to extend existing
approaches, increasing their discriminative power and allowing
more challenging image pairs to be matched.
The basic idea is to use pairs of matching local regions as
templates to spawn new matches. Starting with a small set of
initial correspondences (seeds), the goal is to generate many
matches at multiple scales, completely covering the overlap be-
tween the images. If, as in Figure 1, many matching regions are
found, this supports the hypothesis that the images come from
the same scene.
The full pipeline is evaluated using a recognition benchmark
on a large dataset with ground truth. Initial seed matches are
provided by a state-of-the-art approach based on matching of
local features similar to the popular SIFT features [7] (the par-
ticular implementation was recently used in [14] and comes
from the work of [1]). The proposed algorithm is shown to sig-
nificantly increase the discriminative power of this approach.
Most impressive is the ability to match additional image pairs
when there are considerable differences in viewpoint, scale and
lighting.
Figure 1: Example of matching an (easy) image pair by establishing correspon-
dence between local regions at multiple scales. Each patch in the left image has
a corresponding patch in the right image.
Preprint submitted to Elsevier September 1, 2011
2. Related Work
Growing the domain of correspondence turns out to be highly
useful for image matching under challenging conditions. The
approach has been successfully applied to wide baseline match-
ing [16], registration of challenging image pairs [15, 11] and
object detection [5, 12].
The mechanism used to spawn new matches is inspired
by [5]. They present an object detection scheme that, by propa-
gating an initial set of seed matches, densely covers images with
pairs of corresponding patches. This is done in so called phases
of expansion (generation of new matches) and contraction (re-
moval of mismatches). The approach is based on matching
model views of an object to images where it might be present.
An added bonus is that the object is exactly located in the scene,
simply by tracing out the outer boundary of the matched re-
gions. One problem with this method is that it is computation-
ally inefficient (the time to detect an object is reported to be on
the order of minutes on a 1.4 GHz computer).
This paper proposes a more efficient matching scheme that
still achieves comparable results. This is done primarily by em-
phasizing the order in which matches are spawned and letting
removal of mismatches be implicit in the algorithm. The time
to cover two images with local regions at multiple scales is 1-2
seconds on a 3.4 GHz computer with 2 GB memory.
Similar to [3] and unlike [12], there is no explicit regulariza-
tion of the shape of matched regions or the form of local map-
pings. That is, the measure of match quality does not directly
depend on these factors. The idea is to have a very simple cost
function and, instead, put emphasis on the matching scheme.
As can be seen in Figure 1, all patches in the left image are
squares, as opposed to some other shape. In fact, the left im-
age is covered by a regular grid of cells, only some of which
are shown (the ones that could be matched with highest confi-
dence). Extracting information to be matched from a uniform
grid turns out to be a successful approach to object recogni-
tion [4, 13, 2], even compared to some much more sophisti-
cated feature-based methods. It has been suggested that the
main reason for this is the added benefit that comes from the
dense coverage of the images [4].
In this work the aim is to find a large number of matches,
even in areas with relatively low texture, and efficiency is less
of an issue than recognition performance. This motivated the
use of a multi-scale regular grid (described in section 3.1).
3. The Seed-and-Spawn Algorithm
At the core of the object recognition pipeline, and the main
contribution of this work, is an algorithm that searches for
matching pairs of local regions between two images. The basic
idea is to use established matches to spawn new matches at dif-
ferent image locations or different image scales, or both. This
can be seen as growing the domain of correspondence in three
dimensions, two spatial dimensions and a third scale dimension.
The seed-and-spawn algorithm is outlined in Algorithm 1.
The input is a reference image I0 and a target image I1, along
with initial matches (seeds) to be spawned. Setting a side for
Algorithm 1: The seed-and-spawn algorithm
input : reference image I0 and target image I1, set of seed
matches (initial pairs of corresponding local
regions)
output: set of matches
refine all seed matches1
initialize heap with seed matches2
repeat3
pop match from heap4
spawn match5
foreach spawned match do6
if spawned match satisfies refinement criteria then7
refine spawned match8
if spawned match satisfies acceptance criteria9
then
replace current match in solution with10
spawned match
push spawned match on heap11
until heap is empty12
the moment how seeds are obtained in the first place, they are
refined and thrown on top of a heap, where they are prioritized
according to some measure of correspondence quality. What
follows is a search for new matches by repeatedly popping the
best match from the heap and allowing it to be the parent of a
new generation of matches. In the following sections the algo-
rithm is explained in more detail.
3.1. The Correspondence Framework
Underlying the algorithm is a correspondence framework
that (a) defines what a match is, and (b) contains the best
matches found at any given time. When the algorithm termi-
nates this framework contains all the gathered correspondence
information.
One goal of the framework is to support matching at mul-
tiple image scales. This is done by constructing multi-scale
representations in the form of image pyramids for both input
images. On each level of the reference image pyramid a regular
grid of (possibly overlapping) square cells is defined. A match
specifies how a grid cell, i.e. a local region extracted from some
scaled version of I0, is mapped to a region in I1. Concretely, let
rijl be the square region (cell) at row i and column j in the grid
at level l of the reference image pyramid. A match, then, is
defined as a pair
mijl = (rijl, A)
where A is an affine transformation that maps points in rijl to
a region in the target image I1. The actual image content that
is matched to rijl is in general sampled from different levels of
the pyramid of I1, depending on the shape of the corresponding
region (i.e. depending on the mapping A).
3.2. Refinement of Matches
The purpose of refining a match is to raise some similarity
measure between two local regions that are suspected to be in
2
correspondence. The basic idea is to hold one region fixed
while the other is varied by trying different affine mappings.
In order to be able to compare the regions in a consistent way,
they are first warped into canonical square patches of fixed size.
Figure 2 shows the basic steps of the refinement process.
warp
canonical
patches
sim
patch0
patch1
Figure 2: The basic steps of the refinement process. Two local regions postu-
lated to be in correspondence are (1) warped into canonical patches, and then
(2) compared using a similarity measure. These two steps are repeated until
similarity is maximized.
The similarity measure used to score a pair of canonical
patches is normalized cross-correlation (ncc). Let patch0 and
patch1 be two n × n canonical patches from I0 and I1, respec-
tively. The definition of similarity can be expressed in a com-
pact and intuitive way by introducing patch vectors p0 and p1.
These are simply the vectors obtained by flattening out the two-
dimensional patches (filling the first n entries of pi with the top
row of patchi, the next n entries with the second row of the
patch, and so on). The similarity measure can then be expressed
as
sim(patch0, patch1) =
(p0 − m0)
(p0 − m0)
·
(p1 − m1)
(p1 − m1)
, (1)
where mi is a vector with n × n elements all equal to the mean
value of patchi. Equation 1 is exactly the normalized cross-
correlation between the two patches, but without nested sum-
mation expressions. Incidentally, this illustrates an intuitive in-
terpretation of ncc as the cosine of the angle between two unit
vectors.
The problem of refinement, then, is finding the six param-
eters of the affine mapping that maximizes the similarity be-
tween two canonical patches (or, equivalently, minimizes the
difference between them). The numerical optimization is done
using the Levenberg-Marquardt (LM) algorithm. An alterna-
tive would have been to use Gauss-Newtons method, but that
requires better initialization of the initial affine parameters; that
is, the patches must be more similar to begin with. The ad-
vantage of using the LM algorithm is a more robust refinement
procedure, where one can afford to start farther from the opti-
mum. The price is a somewhat slower rate of convergence than
with Gauss-Newtons algorithm.
3.3. Spawning of Matches
The mechanism for growing the correspondence domain be-
tween images is based on the concept of spawning matches.
Consider a match m = (r, A), where r is a fixed square patch
(grid cell) at some level of the reference image pyramid, and
A is an affine transformation of this region. The match m can
spawn a new match
m = (r , A )
where r is a neighboring region of r, and A is initialized to
A. The new match (or, more precisely, its mapping A ) may or
may not be refined, but that is not part of the spawning mecha-
nism. It remains to define what it means for two regions to be
neighbors. Figure 3(a) and 3(b) shows two possible definitions,
based on the concepts of 4-connectivity and 8-connectivity. In
both cases, the highlighted region in the middle is the grid cell
from I0 associated with the match that is spawning, and no re-
gions from I1 are shown. The left-image regions of the spawns
are shaded.
(a) 4-connectivity (b) 8-connectivity
Figure 3: Two possible neighborhood concepts for regions in a regular grid.
The number of new matches generated during spawning depends on which def-
inition is used.
The approach taken here is more general in the sense that
matches can also be spawned along a third scale dimension.
This idea is illustrated in Figure 4, where every match can give
rise to at most nine spawns. The strength of this approach is
two-fold. First, matching becomes more robust because image
information is incorporated from multiple scales. Second, cor-
respondences that normally would not be detected at a certain
scale can now be established. This is because the refinement
process is more likely to succeed if the affine mapping is ini-
tialized close to the final optimum. Spawning in three dimen-
sions increases the chance of finding a good initialization from
somewhere.
The rationale for initializing spawns with the same affine
transformation as their parent is intuitive. If r and r are neigh-
boring regions in I0, corresponding to the same smooth surface
of some object, they should be mapped to I1 by similar transfor-
mations. Of course, two neighboring image regions may come
from different faces of an object. Or they might not belong to
the same object at all due to, for example, occlusion of other
objects. There are, however, several reasons why the spawning
scheme is still reasonable:
• Neighboring regions have similar mappings more often
than not. This may be true even for images with exten-
3
Figure 4: Illustration of the match neighborhood concept used for spawning.
The grid cell of the spawning parent match is highlighted and the cells of the
resulting spawns are shaded. Each newly generated match is initialized with
the affine mapping of their parent.
sive clutter, as when there are many small objects in the
scene.
• Refinement tends to raise the similarity of correct matches
more than that of incorrect matches, as pointed out by [5].
This is a simple but very powerful mechanism. In com-
bination with matches competing and continuously over-
writing one another, it makes it difficult for mismatches to
prevail.
• Avoidance of incorrect correspondences is implicit in the
matching scheme. The order in which matches are consid-
ered matters. The seed-and-spawn algorithm is designed
to take the safest decisions first, and to postpone unsafe
decisions until later. As a base of valid matches is accu-
mulated they will reinforce one another. For example, if
one is replaced by a mismatch it will have many neighbors
that can spawn back to it.
• Incorrect matches are explicitly detected and discarded.
Two heuristics are used, defined in the refinement criteria
and acceptance criteria in the seed-and-spawn algorithm
(lines 7 and 9 in Algorithm 1). First, if a spawn has a
correlation score below some threshold it is never refined.
This is also beneficial for efficiency reasons. Second, re-
fined spawns are not accepted if the mapped region in the
target image is a thin sliver, i.e. extreme local transforma-
tions are not allowed.
Other mechanisms for avoiding mismatches were experi-
mented with, including adding regularization terms to the cost
function and discarding refined spawns that differ too much
from their parents. These were ultimately abandoned because
they did not bring any significant improvement, and it was dif-
ficult to set their parameters for general input.
3.4. Final Remarks
It is interesting to note that the seed-and-spawn algorithm
can be seen as a generalization of optical flow, or specifically
an approach known as hierarchical block matching (HBM). The
basic idea of HBM is to first match at the highest (coarsest)
pyramid level, and then fine-tune this estimate at lower (finer)
levels. Factors that might disturb the matching process at high
resolutions, such as noise and clutter, can be avoided by al-
ways obtaining a sufficiently accurate estimate from correspon-
dences at lower resolutions. The seed-and-spawn scheme gen-
eralizes this idea by also transmitting correspondence infor-
mation up through the pyramid. Moreover, instead of simply
sweeping through the pyramid from top to bottom, the most re-
liable matches are selected for spawning at each iteration of the
algorithm.
Using a heap or priority queue to guide the search for new
matches is a common approach [12, 3, 6]. This can be seen as
an example of applying the so called least commitment strat-
egy [16], which states that only the most reliable decisions
should be taken first, postponing risky decisions until they
hopefully become safer. By spawning the best matches first
there is a higher probability that new matches are initialized
correctly, and mismatches are less likely to get a foothold.
4. Extracting Seeds
A seed is simply an initial match, specifying two correspond-
ing local regions, one from each image. The only thing that
makes it different from other matches is that it is found before
them. In principle, seeds can be extracted from any initial cor-
respondence information that is available. For example, one
could manually specify corresponding control points, or do a
brute force search for the mapping at some selected locations.
The important thing is that the initial match data can be cast into
the correspondence framework presented earlier in section 3.1.
Of course, sometimes this might mean that information is lost.
In the current recognition pipeline, seed matches are obtained
from an existing matching scheme which is part of a non-public
computer vision code base. This is a three-step process that
involves:
1. Extraction of interest points independently in each image
2. Characterization of the local region around each interest
point with a descriptor, which is a vector of numbers
3. Matching of descriptor vectors
In short, interest points are detected (step 1) using a Lapla-
cian interest point operator, and these points are then described
(step 2) with a compact SIFT-like descriptor. This descriptor is
a vector that has 36 dimensions instead of the 128 dimensions
that appear in Lowe’s paper [7].
The matching of the descriptors (step 3) is based on a vari-
ation of the nearest neighbor approach with the ratio test used
by Lowe. The basic idea is to compute the ratio between the
closest and second closest descriptors, and consider a match to
have been found if this ratio is sufficiently small. Since even a
single seed match can potentially lead to covering the images
4
with matching regions, the primary goal of the seeding phase
is to obtain matches at any cost. Therefore a relatively high
threshold of 0.7 is used in the ratio test. An unreliable match is
better than no match at all, because any valid match is a spark
that can cause an explosion of matches. For methods based on
matching of descriptors only, and that do not attempt to grow
the correspondence domain, too few matches indicates that the
images do not overlap or that this can not be reliably estab-
lished. Here, the seed-and-spawn algorithm takes off where the
traditional approach ends.
5. Experimental Results
One important goal of this work is to be able to extend ap-
proaches based on matching of compact descriptors, such as the
SIFT descriptor of Lowe [7] and derivatives thereof. Clearly
there is a value in having the option to obtain rich and abun-
dant correspondence information. For example, the accuracy of
matches can be improved, or objects can be located more pre-
cisely in the scene. In this work focus is on obtaining higher
match confidence, especially when correspondence cannot be
established without further analysis. This leads to greater dis-
criminative power and, as a result, improved recognition. To
this end, two kinds of experiments were performed.
Recognition performance is evaluated with an image retrieval
benchmark, described in section 5.2. Performance is measured
in terms of a single scalar value and compared with a method
based on matching of SIFT-like descriptors.
Discriminative power is assessed more directly by using the
system to classify image pairs as matches or non-matches. This
experiment is described in section 5.3.
5.1. Dataset
All experiments use a dataset that comes from the work
of [10] and has been widely used for evaluating object recog-
nition systems. The set consists of 10 200 images that are par-
titioned into 2 550 quads, that is, groups of four. All members
of a quad come from the same scene, recorded under varying
viewpoints, scales, lighting conditions and so on.
5.2. Retrieval Benchmark
Recognition performance is evaluated in a simple but strict
way: given one out of a large number of images, find all images
that come from the same scene, without returning any that come
from other scenes. This idea is illustrated in Figure 5.
In each retrieval experiment a subset of N images is selected
from the full dataset. All chosen images come in quads, so N is
a multiple of 4 and there are N/4 quads in the selected subset.
Each image takes the role as a query exactly once, at which
point it is matched pairwise against all other images (including
itself), to find the ones closest to it in some sense. A simple
measure is used for the strength of correspondence between two
images: the total number of matches on all scales that have
a similarity score above some threshold τsim. Ideally, a given
query should have strongest correspondence established with
the other three members of the quad it belongs to. It should
...
Figure 5: The idea behind the retrieval benchmark. An image (top) is matched
pairwise to every image in the dataset (middle), and finally the top four matches
are returned (bottom). The figure shows the best possible result, i.e. exactly the
images that come from the same scene as the query are retrieved.
rank all other images lower, because these do not come from
the same scene.
5.2.1. Performance Measure
The measure of system performance is a single scalar value
that is computed as follows. For each query image, the top n
images most similar to the query are determined, where n is
some integer between 1 and N. These are the images that the
system deems most likely to come from the same scene as the
query image. In general some of these will come from the same
quad as the query (success), and some will not (failure). The
performance measure, then, can be stated as:
The average number of images among the top n re-
trieved that come from the same quad as the query
image, where the average is taken over all queries.
This measure of performance is from here on referred to as
the average-number-in-top (ANT) measure. The ANT value
will be a non-negative number between 0 (none of the top n
matches are ever from the same quad as the query image) and 4
(all members of the quad are always among the top n, and the
5
system never makes a mistake in this sense). For the experi-
ments, a value of n = 4 is chosen so that the highest ANT value
(4) can only be obtained if the system ranks exactly all images
from the same scene highest. This is a very strict measure of
the discriminative power of a system.
5.2.2. Results
The results of the retrieval benchmark will be presented in
a single table. Experiments were performed on two subsets of
the full dataset, to evaluate performance on a small number of
N = 40 challenging instances (set 1), and a larger number of
N = 256 images (set 2). Figure 5 shows some of the images
from set 1.
For both sets, the ANT value was computed for the seed-
and-spawn algorithm and for the method based on matching
36-dimensional SIFT-like descriptors. The latter method hap-
pens to be used for extracting seeds used as input to the former
method. The results are presented in Table 1.
method set 1 set 2
SIFT-based 2.175 3.406
seed-and-spawn 2.925 3.781
Table 1: Retrieval performance for the seed-and-spawn system and for a method
based on matching of SIFT-like features, as measured by the ANT values for
two datasets (described in the text).
Both methods perform worse on the smaller and more chal-
lenging dataset (set 1). All other things being equal, it should be
much easier to get a high score on a small set, because there are
less images to match falsely to. It turns out, however, that the
difficulty of the images in set 1 is a stronger factor. The large
difference in performance between the left column and the right
column in Table 1 is a testament to how challenging set 1 is.
For both sets, the improvement of the seed-and-spawn sys-
tem on the SIFT based approach is significant. This clearly
demonstrates the increased discriminative power obtained by
growing the domain of correspondence. The improvement is
largest for set 1, where the seed-and-spawn algorithm managed
to match many image pairs that the other method could not (for
instance, the duck in Figure 6). This was often done by locking
on to just a single seed match, which spawns waves of matches.
5.3. Binary Classification
10 200 image pairs were produced by matching each image
in the full dataset to a single other image. This was done with
an existing retrieval pipeline unrelated to the system being eval-
uated. The idea is to run the seed-and-spawn algorithm on all
10 200 pairs, classifying each pair as positive (match) or nega-
tive (no match). The motivation for this experiment is to inves-
tigate if the algorithm can be used for post verification, say,
for selected instances in a more efficient real-time matching
scheme.
If more than τcount correspondences can be found with an ncc
score above some threshold τsim, the system outputs ’match’;
otherwise it outputs ’no match’. The ability to take a correct
decision (i.e. classification performance) depends directly on
the value of τcount. For example, if τcount = 0 is chosen every
pair will be classified as positive, because at least 0 matches
can always be found. Conversely, if a sufficiently large value of
τcount is chosen all instances will be classified as negative.
5.3.1. Results
A nice way to test classification performance, and to learn a
good value for τcount, is to run the experiment for a range of val-
ues of τcount, from 0 to 2 000 in this case. For each value of τcount
we get a number of true positives (pairs correctly classified as
matching, or hits) and a number of false positives (pairs incor-
rectly classified as matching, or false alarms). After normal-
izing for different number of positive and negative examples,
these results are presented in the form of an ROC curve. Three
curves were generated using three different values of τsim, and
one curve was generated with the SIFT-based method used for
seed extraction. The results are presented in Figure 7.
0.5
0.6
0.7
0.8
0.9
1
0 0.05 0.1 0.15 0.2
truepositiverate
false positive rate
ncc thr = 0.90
ncc thr = 0.95
ncc thr = 0.98
SIFT-like features (seeds only)
Figure 7: ROC curves for the seed-and-spawn algorithm (the three top curves)
and for the seed extraction method (based on matching of SIFT-like features),
when used as binary classifiers. Each curve is generated by varying a threshold
τcount that discriminates between ’match’ and ’no match’. Specifically, τcount is
the minimum number of correspondences required to consider an image pair as
matching, and takes on values from 0 to 2 000.
The seed-and-spawn system increases the classification per-
formance of the method that provides the seeds, as can be seen
from the higher ROC curves. The pipeline used for seed extrac-
tion is based on establishing correspondence between compact
representations of local image regions. This is a popular ap-
proach to object recognition, and image matching in general,
and there is clearly great value in being able to augment such
methods.
Analyzing the match data more closely indicates that seed ex-
traction is a bottleneck to overall performance. When no seeds
can be found the seed-and-spawn system fails by default in the
current implementation, and this happens for a substantial num-
ber of the 10 200 image pairs. Future work should aim at ex-
tracting seeds from more than one source, since a true test of the
power of the seed-and-spawn scheme would require that there
is at least one seed available for each matching image pair.
An interesting observation is that classification performance
in general is better when requiring higher match similarity, i.e.
6
Figure 6: An example of an image pair that could not be reliably matched in the recognition benchmark using only the SIFT-based approach. The seed-and-spawn
algorithm increases the discriminative power of this approach by finding a very large number of (not necessarily) accurate matches.
a higher ncc threshold between patches. Referring to Figure 7,
the highest ncc threshold of 0.98 gives much better discrimina-
tion at low false alarm rates, and only slightly lower hit rate at
high false alarm rates. Quality of matches seems to be a more
important factor than quantity of matches. One possible expla-
nation for this is the relatively weak mechanisms for removing
mismatches. That is to say, there is probably a fair amount of in-
valid correspondences with a relatively high match (ncc) score.
A more sophisticated mechanism for removal of mismatches
is a promising avenue for increasing recognition performance.
There is, of course, a trade-off between match quality and com-
putational efficiency. This work has aimed to strike a balance
between the two.
6. Summary and Conclusion
A new algorithm has been presented that efficiently searches
for dense image-to-image correspondence. The scheme was
shown to considerably increase the discriminative power of an
existing state-of-the-art implementation. The improvement is
most significant for a small set of challenging images with large
differences in lighting condition, viewpoint and scale.
The system was also evaluated by classifying 10 200 image
pairs as matching or non-matching. The improvement over the
method used for seed extraction is considerable. In order to
use the system for post verification, however, higher hit rates
and lower false alarm rates are necessary. The results suggest
that this could be achieved by using multiple sources of seeds.
This should increase the probability of finding at least one valid
correspondence, which is a bottleneck in the current pipeline.
If nothing else, the take-home message of this work is the
demonstrated value in growing the domain of correspondence.
The strength of the proposed seed-and-spawn algorithm is that
rich and dense correspondence information can be found rela-
tively fast. This is hopefully a step towards challenging object
recognition in real-time.
References
[1] M. Brown, R. Szeliski, and S. Winder. Multi-image matching using multi-
scale oriented patches. In CVPR, volume 1, pages 510–517, 2005.
[2] P. Carbonetto, N. D. Freitas, and K. Barnard. A statistical model for
general contextual object recognition. In ECCV, pages 350–362, 2004.
[3] J. ˇCech, J. Matas, and M. Perˇdoch. Efficient sequential correspondence
selection by cosegmentation. In CVPR, 2008.
7
[4] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning
natural scene categories. In CVPR, volume 2, pages 524–531, 2005.
[5] V. Ferrari, T. Tuytelaars, and L. Gool. Simultaneous object recognition
and segmentation from single or multiple model views. International
Journal of Computer Vision, 67(2):159–188, 2006.
[6] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. M. Seitz. Multi-
view stereo for community photo collections. In ICCV, 2007.
[7] D. G. Lowe. Distinctive image features from scale-invariant keypoints.
International Journal of Computer Vision, 60(2):91–110, 2004.
[8] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo
from maximally stable extremal regions. In Proceedings of the British
Machine Vision Conference, volume 1, pages 384–393, September 2002.
[9] K. Mikolajczyk and C. Schmid. An affine invariant interest point detector.
In ECCV, pages 128–142, 2002.
[10] D. Nist´er and H. Stew´enius. Scalable recognition with a vocabulary tree.
In CVPR, pages 2161–2168, 2006.
[11] C. V. Stewart, C.-L. Tsai, and B. Roysam. The dual-bootstrap itera-
tive closest point algorithm with application to retinal image registration.
IEEE Transactions on Medical Imaging, 22(11):1379–1394, 2003.
[12] A. Vedaldi and S. Soatto. Local features, all grown up. In CVPR, pages
1753–1760, 2006.
[13] J. Vogel and B. Schiele. On performance characterization and optimiza-
tion for image retrieval. In ECCV, pages 49–66, 2002.
[14] S. Winder and M. Brown. Learning local image descriptors. In CVPR,
2007.
[15] G. Yang, C. V. Stewart, M. Sofka, and C.-L. Tsai. Registration of chal-
lenging image pairs: Initialization, estimation, and decision. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 29(11):1973–1989,
2007.
[16] Z. Zhang and Y. Shan. A progressive scheme for stereo matching. In
SMILE, pages 68–85, 2000.
8

sns

  • 1.
    Image Matching bySeed-and-Spawn Shahan Lilja, David Nist´er Abstract A novel algorithm is presented that searches for matching local regions between two images. The scheme takes a set of seed matches as input and propagates them through a mechanism referred to as spawning. An important feature of the algorithm is that correspondences established at coarse image scales continuously initiate correspondences at finer scales, and vice versa. The system is evaluated with a recognition benchmark on a database with ground truth, where it is shown to dramatically increase the discriminative power of a state-of-the-art method based on SIFT-like features. Key words: image matching, spawning, object recognition 1. Introduction Establishing correspondence between images is arguably the basis for object recognition and much of computer vision. A common approach is based on image-to-image matching of lo- cal features, where each feature characterizes a small patch of an image. There are methods to extract and represent local fea- tures that are highly invariant to differences in factors such as scale [7] and viewpoint [9, 8]. Moreover, the fact that the infor- mation being matched is localized (and analyzed statistically) provides an inherent tolerance against occlusion, clutter and noise. The success of this approach has been demonstrated in a wide range of settings, but it does have limitations. It fails when dif- ferences in viewpoint and other factors become too large. As demonstrated by [5, 15, 12, 3], one way to solve more chal- lenging instances is to grow the domain of correspondence. This paper presents a novel algorithm that completely cov- ers two images with matching local regions, as shown in Fig- ure 1. Focus is not on accuracy of matches, but rather on find- ing a large body of evidence for correspondence. This tailors the method for object recognition, and content-based image re- trieval in particular. One goal is to be able to extend existing approaches, increasing their discriminative power and allowing more challenging image pairs to be matched. The basic idea is to use pairs of matching local regions as templates to spawn new matches. Starting with a small set of initial correspondences (seeds), the goal is to generate many matches at multiple scales, completely covering the overlap be- tween the images. If, as in Figure 1, many matching regions are found, this supports the hypothesis that the images come from the same scene. The full pipeline is evaluated using a recognition benchmark on a large dataset with ground truth. Initial seed matches are provided by a state-of-the-art approach based on matching of local features similar to the popular SIFT features [7] (the par- ticular implementation was recently used in [14] and comes from the work of [1]). The proposed algorithm is shown to sig- nificantly increase the discriminative power of this approach. Most impressive is the ability to match additional image pairs when there are considerable differences in viewpoint, scale and lighting. Figure 1: Example of matching an (easy) image pair by establishing correspon- dence between local regions at multiple scales. Each patch in the left image has a corresponding patch in the right image. Preprint submitted to Elsevier September 1, 2011
  • 2.
    2. Related Work Growingthe domain of correspondence turns out to be highly useful for image matching under challenging conditions. The approach has been successfully applied to wide baseline match- ing [16], registration of challenging image pairs [15, 11] and object detection [5, 12]. The mechanism used to spawn new matches is inspired by [5]. They present an object detection scheme that, by propa- gating an initial set of seed matches, densely covers images with pairs of corresponding patches. This is done in so called phases of expansion (generation of new matches) and contraction (re- moval of mismatches). The approach is based on matching model views of an object to images where it might be present. An added bonus is that the object is exactly located in the scene, simply by tracing out the outer boundary of the matched re- gions. One problem with this method is that it is computation- ally inefficient (the time to detect an object is reported to be on the order of minutes on a 1.4 GHz computer). This paper proposes a more efficient matching scheme that still achieves comparable results. This is done primarily by em- phasizing the order in which matches are spawned and letting removal of mismatches be implicit in the algorithm. The time to cover two images with local regions at multiple scales is 1-2 seconds on a 3.4 GHz computer with 2 GB memory. Similar to [3] and unlike [12], there is no explicit regulariza- tion of the shape of matched regions or the form of local map- pings. That is, the measure of match quality does not directly depend on these factors. The idea is to have a very simple cost function and, instead, put emphasis on the matching scheme. As can be seen in Figure 1, all patches in the left image are squares, as opposed to some other shape. In fact, the left im- age is covered by a regular grid of cells, only some of which are shown (the ones that could be matched with highest confi- dence). Extracting information to be matched from a uniform grid turns out to be a successful approach to object recogni- tion [4, 13, 2], even compared to some much more sophisti- cated feature-based methods. It has been suggested that the main reason for this is the added benefit that comes from the dense coverage of the images [4]. In this work the aim is to find a large number of matches, even in areas with relatively low texture, and efficiency is less of an issue than recognition performance. This motivated the use of a multi-scale regular grid (described in section 3.1). 3. The Seed-and-Spawn Algorithm At the core of the object recognition pipeline, and the main contribution of this work, is an algorithm that searches for matching pairs of local regions between two images. The basic idea is to use established matches to spawn new matches at dif- ferent image locations or different image scales, or both. This can be seen as growing the domain of correspondence in three dimensions, two spatial dimensions and a third scale dimension. The seed-and-spawn algorithm is outlined in Algorithm 1. The input is a reference image I0 and a target image I1, along with initial matches (seeds) to be spawned. Setting a side for Algorithm 1: The seed-and-spawn algorithm input : reference image I0 and target image I1, set of seed matches (initial pairs of corresponding local regions) output: set of matches refine all seed matches1 initialize heap with seed matches2 repeat3 pop match from heap4 spawn match5 foreach spawned match do6 if spawned match satisfies refinement criteria then7 refine spawned match8 if spawned match satisfies acceptance criteria9 then replace current match in solution with10 spawned match push spawned match on heap11 until heap is empty12 the moment how seeds are obtained in the first place, they are refined and thrown on top of a heap, where they are prioritized according to some measure of correspondence quality. What follows is a search for new matches by repeatedly popping the best match from the heap and allowing it to be the parent of a new generation of matches. In the following sections the algo- rithm is explained in more detail. 3.1. The Correspondence Framework Underlying the algorithm is a correspondence framework that (a) defines what a match is, and (b) contains the best matches found at any given time. When the algorithm termi- nates this framework contains all the gathered correspondence information. One goal of the framework is to support matching at mul- tiple image scales. This is done by constructing multi-scale representations in the form of image pyramids for both input images. On each level of the reference image pyramid a regular grid of (possibly overlapping) square cells is defined. A match specifies how a grid cell, i.e. a local region extracted from some scaled version of I0, is mapped to a region in I1. Concretely, let rijl be the square region (cell) at row i and column j in the grid at level l of the reference image pyramid. A match, then, is defined as a pair mijl = (rijl, A) where A is an affine transformation that maps points in rijl to a region in the target image I1. The actual image content that is matched to rijl is in general sampled from different levels of the pyramid of I1, depending on the shape of the corresponding region (i.e. depending on the mapping A). 3.2. Refinement of Matches The purpose of refining a match is to raise some similarity measure between two local regions that are suspected to be in 2
  • 3.
    correspondence. The basicidea is to hold one region fixed while the other is varied by trying different affine mappings. In order to be able to compare the regions in a consistent way, they are first warped into canonical square patches of fixed size. Figure 2 shows the basic steps of the refinement process. warp canonical patches sim patch0 patch1 Figure 2: The basic steps of the refinement process. Two local regions postu- lated to be in correspondence are (1) warped into canonical patches, and then (2) compared using a similarity measure. These two steps are repeated until similarity is maximized. The similarity measure used to score a pair of canonical patches is normalized cross-correlation (ncc). Let patch0 and patch1 be two n × n canonical patches from I0 and I1, respec- tively. The definition of similarity can be expressed in a com- pact and intuitive way by introducing patch vectors p0 and p1. These are simply the vectors obtained by flattening out the two- dimensional patches (filling the first n entries of pi with the top row of patchi, the next n entries with the second row of the patch, and so on). The similarity measure can then be expressed as sim(patch0, patch1) = (p0 − m0) (p0 − m0) · (p1 − m1) (p1 − m1) , (1) where mi is a vector with n × n elements all equal to the mean value of patchi. Equation 1 is exactly the normalized cross- correlation between the two patches, but without nested sum- mation expressions. Incidentally, this illustrates an intuitive in- terpretation of ncc as the cosine of the angle between two unit vectors. The problem of refinement, then, is finding the six param- eters of the affine mapping that maximizes the similarity be- tween two canonical patches (or, equivalently, minimizes the difference between them). The numerical optimization is done using the Levenberg-Marquardt (LM) algorithm. An alterna- tive would have been to use Gauss-Newtons method, but that requires better initialization of the initial affine parameters; that is, the patches must be more similar to begin with. The ad- vantage of using the LM algorithm is a more robust refinement procedure, where one can afford to start farther from the opti- mum. The price is a somewhat slower rate of convergence than with Gauss-Newtons algorithm. 3.3. Spawning of Matches The mechanism for growing the correspondence domain be- tween images is based on the concept of spawning matches. Consider a match m = (r, A), where r is a fixed square patch (grid cell) at some level of the reference image pyramid, and A is an affine transformation of this region. The match m can spawn a new match m = (r , A ) where r is a neighboring region of r, and A is initialized to A. The new match (or, more precisely, its mapping A ) may or may not be refined, but that is not part of the spawning mecha- nism. It remains to define what it means for two regions to be neighbors. Figure 3(a) and 3(b) shows two possible definitions, based on the concepts of 4-connectivity and 8-connectivity. In both cases, the highlighted region in the middle is the grid cell from I0 associated with the match that is spawning, and no re- gions from I1 are shown. The left-image regions of the spawns are shaded. (a) 4-connectivity (b) 8-connectivity Figure 3: Two possible neighborhood concepts for regions in a regular grid. The number of new matches generated during spawning depends on which def- inition is used. The approach taken here is more general in the sense that matches can also be spawned along a third scale dimension. This idea is illustrated in Figure 4, where every match can give rise to at most nine spawns. The strength of this approach is two-fold. First, matching becomes more robust because image information is incorporated from multiple scales. Second, cor- respondences that normally would not be detected at a certain scale can now be established. This is because the refinement process is more likely to succeed if the affine mapping is ini- tialized close to the final optimum. Spawning in three dimen- sions increases the chance of finding a good initialization from somewhere. The rationale for initializing spawns with the same affine transformation as their parent is intuitive. If r and r are neigh- boring regions in I0, corresponding to the same smooth surface of some object, they should be mapped to I1 by similar transfor- mations. Of course, two neighboring image regions may come from different faces of an object. Or they might not belong to the same object at all due to, for example, occlusion of other objects. There are, however, several reasons why the spawning scheme is still reasonable: • Neighboring regions have similar mappings more often than not. This may be true even for images with exten- 3
  • 4.
    Figure 4: Illustrationof the match neighborhood concept used for spawning. The grid cell of the spawning parent match is highlighted and the cells of the resulting spawns are shaded. Each newly generated match is initialized with the affine mapping of their parent. sive clutter, as when there are many small objects in the scene. • Refinement tends to raise the similarity of correct matches more than that of incorrect matches, as pointed out by [5]. This is a simple but very powerful mechanism. In com- bination with matches competing and continuously over- writing one another, it makes it difficult for mismatches to prevail. • Avoidance of incorrect correspondences is implicit in the matching scheme. The order in which matches are consid- ered matters. The seed-and-spawn algorithm is designed to take the safest decisions first, and to postpone unsafe decisions until later. As a base of valid matches is accu- mulated they will reinforce one another. For example, if one is replaced by a mismatch it will have many neighbors that can spawn back to it. • Incorrect matches are explicitly detected and discarded. Two heuristics are used, defined in the refinement criteria and acceptance criteria in the seed-and-spawn algorithm (lines 7 and 9 in Algorithm 1). First, if a spawn has a correlation score below some threshold it is never refined. This is also beneficial for efficiency reasons. Second, re- fined spawns are not accepted if the mapped region in the target image is a thin sliver, i.e. extreme local transforma- tions are not allowed. Other mechanisms for avoiding mismatches were experi- mented with, including adding regularization terms to the cost function and discarding refined spawns that differ too much from their parents. These were ultimately abandoned because they did not bring any significant improvement, and it was dif- ficult to set their parameters for general input. 3.4. Final Remarks It is interesting to note that the seed-and-spawn algorithm can be seen as a generalization of optical flow, or specifically an approach known as hierarchical block matching (HBM). The basic idea of HBM is to first match at the highest (coarsest) pyramid level, and then fine-tune this estimate at lower (finer) levels. Factors that might disturb the matching process at high resolutions, such as noise and clutter, can be avoided by al- ways obtaining a sufficiently accurate estimate from correspon- dences at lower resolutions. The seed-and-spawn scheme gen- eralizes this idea by also transmitting correspondence infor- mation up through the pyramid. Moreover, instead of simply sweeping through the pyramid from top to bottom, the most re- liable matches are selected for spawning at each iteration of the algorithm. Using a heap or priority queue to guide the search for new matches is a common approach [12, 3, 6]. This can be seen as an example of applying the so called least commitment strat- egy [16], which states that only the most reliable decisions should be taken first, postponing risky decisions until they hopefully become safer. By spawning the best matches first there is a higher probability that new matches are initialized correctly, and mismatches are less likely to get a foothold. 4. Extracting Seeds A seed is simply an initial match, specifying two correspond- ing local regions, one from each image. The only thing that makes it different from other matches is that it is found before them. In principle, seeds can be extracted from any initial cor- respondence information that is available. For example, one could manually specify corresponding control points, or do a brute force search for the mapping at some selected locations. The important thing is that the initial match data can be cast into the correspondence framework presented earlier in section 3.1. Of course, sometimes this might mean that information is lost. In the current recognition pipeline, seed matches are obtained from an existing matching scheme which is part of a non-public computer vision code base. This is a three-step process that involves: 1. Extraction of interest points independently in each image 2. Characterization of the local region around each interest point with a descriptor, which is a vector of numbers 3. Matching of descriptor vectors In short, interest points are detected (step 1) using a Lapla- cian interest point operator, and these points are then described (step 2) with a compact SIFT-like descriptor. This descriptor is a vector that has 36 dimensions instead of the 128 dimensions that appear in Lowe’s paper [7]. The matching of the descriptors (step 3) is based on a vari- ation of the nearest neighbor approach with the ratio test used by Lowe. The basic idea is to compute the ratio between the closest and second closest descriptors, and consider a match to have been found if this ratio is sufficiently small. Since even a single seed match can potentially lead to covering the images 4
  • 5.
    with matching regions,the primary goal of the seeding phase is to obtain matches at any cost. Therefore a relatively high threshold of 0.7 is used in the ratio test. An unreliable match is better than no match at all, because any valid match is a spark that can cause an explosion of matches. For methods based on matching of descriptors only, and that do not attempt to grow the correspondence domain, too few matches indicates that the images do not overlap or that this can not be reliably estab- lished. Here, the seed-and-spawn algorithm takes off where the traditional approach ends. 5. Experimental Results One important goal of this work is to be able to extend ap- proaches based on matching of compact descriptors, such as the SIFT descriptor of Lowe [7] and derivatives thereof. Clearly there is a value in having the option to obtain rich and abun- dant correspondence information. For example, the accuracy of matches can be improved, or objects can be located more pre- cisely in the scene. In this work focus is on obtaining higher match confidence, especially when correspondence cannot be established without further analysis. This leads to greater dis- criminative power and, as a result, improved recognition. To this end, two kinds of experiments were performed. Recognition performance is evaluated with an image retrieval benchmark, described in section 5.2. Performance is measured in terms of a single scalar value and compared with a method based on matching of SIFT-like descriptors. Discriminative power is assessed more directly by using the system to classify image pairs as matches or non-matches. This experiment is described in section 5.3. 5.1. Dataset All experiments use a dataset that comes from the work of [10] and has been widely used for evaluating object recog- nition systems. The set consists of 10 200 images that are par- titioned into 2 550 quads, that is, groups of four. All members of a quad come from the same scene, recorded under varying viewpoints, scales, lighting conditions and so on. 5.2. Retrieval Benchmark Recognition performance is evaluated in a simple but strict way: given one out of a large number of images, find all images that come from the same scene, without returning any that come from other scenes. This idea is illustrated in Figure 5. In each retrieval experiment a subset of N images is selected from the full dataset. All chosen images come in quads, so N is a multiple of 4 and there are N/4 quads in the selected subset. Each image takes the role as a query exactly once, at which point it is matched pairwise against all other images (including itself), to find the ones closest to it in some sense. A simple measure is used for the strength of correspondence between two images: the total number of matches on all scales that have a similarity score above some threshold τsim. Ideally, a given query should have strongest correspondence established with the other three members of the quad it belongs to. It should ... Figure 5: The idea behind the retrieval benchmark. An image (top) is matched pairwise to every image in the dataset (middle), and finally the top four matches are returned (bottom). The figure shows the best possible result, i.e. exactly the images that come from the same scene as the query are retrieved. rank all other images lower, because these do not come from the same scene. 5.2.1. Performance Measure The measure of system performance is a single scalar value that is computed as follows. For each query image, the top n images most similar to the query are determined, where n is some integer between 1 and N. These are the images that the system deems most likely to come from the same scene as the query image. In general some of these will come from the same quad as the query (success), and some will not (failure). The performance measure, then, can be stated as: The average number of images among the top n re- trieved that come from the same quad as the query image, where the average is taken over all queries. This measure of performance is from here on referred to as the average-number-in-top (ANT) measure. The ANT value will be a non-negative number between 0 (none of the top n matches are ever from the same quad as the query image) and 4 (all members of the quad are always among the top n, and the 5
  • 6.
    system never makesa mistake in this sense). For the experi- ments, a value of n = 4 is chosen so that the highest ANT value (4) can only be obtained if the system ranks exactly all images from the same scene highest. This is a very strict measure of the discriminative power of a system. 5.2.2. Results The results of the retrieval benchmark will be presented in a single table. Experiments were performed on two subsets of the full dataset, to evaluate performance on a small number of N = 40 challenging instances (set 1), and a larger number of N = 256 images (set 2). Figure 5 shows some of the images from set 1. For both sets, the ANT value was computed for the seed- and-spawn algorithm and for the method based on matching 36-dimensional SIFT-like descriptors. The latter method hap- pens to be used for extracting seeds used as input to the former method. The results are presented in Table 1. method set 1 set 2 SIFT-based 2.175 3.406 seed-and-spawn 2.925 3.781 Table 1: Retrieval performance for the seed-and-spawn system and for a method based on matching of SIFT-like features, as measured by the ANT values for two datasets (described in the text). Both methods perform worse on the smaller and more chal- lenging dataset (set 1). All other things being equal, it should be much easier to get a high score on a small set, because there are less images to match falsely to. It turns out, however, that the difficulty of the images in set 1 is a stronger factor. The large difference in performance between the left column and the right column in Table 1 is a testament to how challenging set 1 is. For both sets, the improvement of the seed-and-spawn sys- tem on the SIFT based approach is significant. This clearly demonstrates the increased discriminative power obtained by growing the domain of correspondence. The improvement is largest for set 1, where the seed-and-spawn algorithm managed to match many image pairs that the other method could not (for instance, the duck in Figure 6). This was often done by locking on to just a single seed match, which spawns waves of matches. 5.3. Binary Classification 10 200 image pairs were produced by matching each image in the full dataset to a single other image. This was done with an existing retrieval pipeline unrelated to the system being eval- uated. The idea is to run the seed-and-spawn algorithm on all 10 200 pairs, classifying each pair as positive (match) or nega- tive (no match). The motivation for this experiment is to inves- tigate if the algorithm can be used for post verification, say, for selected instances in a more efficient real-time matching scheme. If more than τcount correspondences can be found with an ncc score above some threshold τsim, the system outputs ’match’; otherwise it outputs ’no match’. The ability to take a correct decision (i.e. classification performance) depends directly on the value of τcount. For example, if τcount = 0 is chosen every pair will be classified as positive, because at least 0 matches can always be found. Conversely, if a sufficiently large value of τcount is chosen all instances will be classified as negative. 5.3.1. Results A nice way to test classification performance, and to learn a good value for τcount, is to run the experiment for a range of val- ues of τcount, from 0 to 2 000 in this case. For each value of τcount we get a number of true positives (pairs correctly classified as matching, or hits) and a number of false positives (pairs incor- rectly classified as matching, or false alarms). After normal- izing for different number of positive and negative examples, these results are presented in the form of an ROC curve. Three curves were generated using three different values of τsim, and one curve was generated with the SIFT-based method used for seed extraction. The results are presented in Figure 7. 0.5 0.6 0.7 0.8 0.9 1 0 0.05 0.1 0.15 0.2 truepositiverate false positive rate ncc thr = 0.90 ncc thr = 0.95 ncc thr = 0.98 SIFT-like features (seeds only) Figure 7: ROC curves for the seed-and-spawn algorithm (the three top curves) and for the seed extraction method (based on matching of SIFT-like features), when used as binary classifiers. Each curve is generated by varying a threshold τcount that discriminates between ’match’ and ’no match’. Specifically, τcount is the minimum number of correspondences required to consider an image pair as matching, and takes on values from 0 to 2 000. The seed-and-spawn system increases the classification per- formance of the method that provides the seeds, as can be seen from the higher ROC curves. The pipeline used for seed extrac- tion is based on establishing correspondence between compact representations of local image regions. This is a popular ap- proach to object recognition, and image matching in general, and there is clearly great value in being able to augment such methods. Analyzing the match data more closely indicates that seed ex- traction is a bottleneck to overall performance. When no seeds can be found the seed-and-spawn system fails by default in the current implementation, and this happens for a substantial num- ber of the 10 200 image pairs. Future work should aim at ex- tracting seeds from more than one source, since a true test of the power of the seed-and-spawn scheme would require that there is at least one seed available for each matching image pair. An interesting observation is that classification performance in general is better when requiring higher match similarity, i.e. 6
  • 7.
    Figure 6: Anexample of an image pair that could not be reliably matched in the recognition benchmark using only the SIFT-based approach. The seed-and-spawn algorithm increases the discriminative power of this approach by finding a very large number of (not necessarily) accurate matches. a higher ncc threshold between patches. Referring to Figure 7, the highest ncc threshold of 0.98 gives much better discrimina- tion at low false alarm rates, and only slightly lower hit rate at high false alarm rates. Quality of matches seems to be a more important factor than quantity of matches. One possible expla- nation for this is the relatively weak mechanisms for removing mismatches. That is to say, there is probably a fair amount of in- valid correspondences with a relatively high match (ncc) score. A more sophisticated mechanism for removal of mismatches is a promising avenue for increasing recognition performance. There is, of course, a trade-off between match quality and com- putational efficiency. This work has aimed to strike a balance between the two. 6. Summary and Conclusion A new algorithm has been presented that efficiently searches for dense image-to-image correspondence. The scheme was shown to considerably increase the discriminative power of an existing state-of-the-art implementation. The improvement is most significant for a small set of challenging images with large differences in lighting condition, viewpoint and scale. The system was also evaluated by classifying 10 200 image pairs as matching or non-matching. The improvement over the method used for seed extraction is considerable. In order to use the system for post verification, however, higher hit rates and lower false alarm rates are necessary. The results suggest that this could be achieved by using multiple sources of seeds. This should increase the probability of finding at least one valid correspondence, which is a bottleneck in the current pipeline. If nothing else, the take-home message of this work is the demonstrated value in growing the domain of correspondence. The strength of the proposed seed-and-spawn algorithm is that rich and dense correspondence information can be found rela- tively fast. This is hopefully a step towards challenging object recognition in real-time. References [1] M. Brown, R. Szeliski, and S. Winder. Multi-image matching using multi- scale oriented patches. In CVPR, volume 1, pages 510–517, 2005. [2] P. Carbonetto, N. D. Freitas, and K. Barnard. A statistical model for general contextual object recognition. In ECCV, pages 350–362, 2004. [3] J. ˇCech, J. Matas, and M. Perˇdoch. Efficient sequential correspondence selection by cosegmentation. In CVPR, 2008. 7
  • 8.
    [4] L. Fei-Feiand P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR, volume 2, pages 524–531, 2005. [5] V. Ferrari, T. Tuytelaars, and L. Gool. Simultaneous object recognition and segmentation from single or multiple model views. International Journal of Computer Vision, 67(2):159–188, 2006. [6] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. M. Seitz. Multi- view stereo for community photo collections. In ICCV, 2007. [7] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [8] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proceedings of the British Machine Vision Conference, volume 1, pages 384–393, September 2002. [9] K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In ECCV, pages 128–142, 2002. [10] D. Nist´er and H. Stew´enius. Scalable recognition with a vocabulary tree. In CVPR, pages 2161–2168, 2006. [11] C. V. Stewart, C.-L. Tsai, and B. Roysam. The dual-bootstrap itera- tive closest point algorithm with application to retinal image registration. IEEE Transactions on Medical Imaging, 22(11):1379–1394, 2003. [12] A. Vedaldi and S. Soatto. Local features, all grown up. In CVPR, pages 1753–1760, 2006. [13] J. Vogel and B. Schiele. On performance characterization and optimiza- tion for image retrieval. In ECCV, pages 49–66, 2002. [14] S. Winder and M. Brown. Learning local image descriptors. In CVPR, 2007. [15] G. Yang, C. V. Stewart, M. Sofka, and C.-L. Tsai. Registration of chal- lenging image pairs: Initialization, estimation, and decision. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 29(11):1973–1989, 2007. [16] Z. Zhang and Y. Shan. A progressive scheme for stereo matching. In SMILE, pages 68–85, 2000. 8