But you need to mail a letter! Which post office do you go to?
Finding free images of postoffices is hard, so... Well just reduce it to this: q
Naive implementation Calculate distance to all points, find smallestmin = INFINITYP = <points to be searched>K = <dimensionality of points, e.g. 2>q = <query point>best = nilfor p in P do dimDistSum = 0 for k in K do dimDistSum += (q[k]-p[k])**2 dist = dimDistSum.sqrt if dist < min min = dist best = preturn best
With a little preprocessing... But that takes time! - can we do better? You bet! k-d tree Binary tree (each node has at most two children) Each node represents a single point in the set to be searched
Each node looks like... Domain: the vector describing the point (i.e. [p, p, … p[k-1]]) Range: Some identifying characteristic (e.g. PK in a database) Split: A chosen dimension from 0 ≤ split < k Left: The left child (left.domain[split] < self.domain[split]) Right: The right child (right.domain[split] ≥ self.domain[split])
How do we search it? All the way back to the root!
How do we search it?And you have your nearest neighbor, with a good case of running time! Im the answer!
But that was a pretty good case... We barely had to backtrack at all – best case is Worst case (lots of backtracking – examining almost every node) can get up to Amount of backtracking is directly proportional to k! If k is small (say 2, as in this example) and n is large, we see a huge improvement over linear search As k becomes large, the benefits of this over a naive implementation virtually disappear!
The Curse of Dimensionality Curse you, dimensionality! High-dimensional vector spaces are darned hard to search! Why? Too many dimensions! Why are there so many dimensions!?! What can we do about it? Get rid of the extra weight! Enter Mssrs. Johnson and Lindenstrauss
It turns out... Your vectors have a high dimension Absolute distance and precise location versus relative distance between points Relative distance can be largely preserved by a lower dimensional space Reduce k dimensions to kproj dimensions, kproj << k
It turns out... Relative distance can be largely but not completely preserved by a lower dimensional space Every projection will have errors How do you choose one with the fewest? Trick question: Let fate decide!
Multiple random projection Choose the projections radomly Multiple projections Exchange cost in resources for cost in accuracy More projections = greater resource cost = greater accuracy Fewer projections = lesser resource cost = lesser accuracy Trivially parallelizable Learn to be happy with ”good enough”
Multiple random projections Get the nearest from each projection, then run a naive nearest on the results thereof.Nns = P = <projections>q = <query point>for p in P do pq = <project q to the same plane as p> nns << <nearest neighbor to pq from projection><execute naive nearest on nns to find nearest of result>return nn Et voilá!
Multiple random projection Experiments yield > 98% accuracy when multiple nearest neighbors are selected from each projection and d is reduced from 256 to 15, with approximately 30% of the calculation. (see credits) Additional experiments yielded similar results, as did my own Thats pretty darn-tootin good
Stuff to watch out for Balancing is vitally important (assuming uniform distribution of points): careful attention must be paid to selection of nodes (node with median coordinate for split axis) Cycle through axes for each level of the tree – root should split on 0, lvl 1 on 1, lvl 2 on 2, etc.
Stuff to watch out for Building the trees still takes some time Building the projections is effectively matrix multiplication, time in (Strassens algorithm) Building the (balanced) trees from the projections takes time in approximately Solution: build the trees ahead of time and store them for later querying (i.e. index those bad boys!)
Thanks! Credits: Based in large part on research conducted by Yousuf Ahmed, NYU: http://bit.ly/NZ7ZHo K-d trees: J. L. Bentley, Stanford U.: http://bit.ly/Mpy05p Dimensionality reduction: W. B. Johnson and J. Lindenstrauss: http://bit.ly/m9SGPN Research Fuel: Ardbeg Uigeadail: http://bit.ly/fcag0E