A talk I presented at Nashville Hack Day 2012 concerning using multiple random low-dimensional projections of high-dimensional data to optimize an approximate nearest neighbor search
Size: 1.9 MB
Language: en
Added: Aug 04, 2012
Slides: 55 pages
Slide Content
The Post Office Problem
k-d trees, k-nn search, and the Johnson-
Lindenstrauss lemma
Who am I?
Jeremy Holland
Senior lead developer at
Centresource
Math and algorithms nerd
@awebneck,
github.com/awebneck, freenode:
awebneck, (you get the idea)
If you like the talk...
I like scotch. Just putting it out there.
What is the Post Office Problem?
Don Knuth, professional CS badass.
TAOCP, vol. 3
Otherwise known as ”Nearest Neighbor search”
Let's say you've...
Just moved to Denmark!
But you need to mail a letter!
Which post office do you go to?
Finding free images of post
offices is hard, so...
We'll just reduce it to this:
q
Naive implementation
min = INFINITY
P = <points to be searched>
K = <dimensionality of points, e.g. 2>
q = <query point>
best = nil
for p in P do
dimDistSum = 0
for k in K do
dimDistSum += (q[k]-p[k])**2
dist = dimDistSum.sqrt
if dist < min
min = dist
best = p
return best
Calculate distance to all points, find smallest
With a little preprocessing...
But that takes time! - can we do better?
You bet!
k-d tree
Binary tree (each node has at most two
children)
Each node represents a single point in the set
to be searched
Each node looks like...
Domain: the vector describing the point (i.e.
[p[0], p[1], … p[k-1]])
Range: Some identifying characteristic (e.g. PK
in a database)
Split: A chosen dimension from 0 ≤ split < k
Left: The left child (left.domain[split] <
self.domain[split])
Right: The right child (right.domain[split] ≥
self.domain[split])
Let's build a k-d tree!
Point 1: [20,10]
Let's build a k-d tree!
Let's split on the x axis
Let's build a k-d tree!
Add a new point: [10,5]
Let's build a k-d tree!
The new point is the Left Child of the first point
Let's build a k-d tree!
Let's split him on the y axis
Let's build a k-d tree!
And add a 3rd point: [25,3]
Let's build a k-d tree!
The new point is the Right Child of the first point
Let's build a k-d tree!
So on and so forth...
Let's build a k-d tree!
Giving you a tree:
How do we search it?
Step 1: Find the best bin (where the query point
would otherwise be inserted)
q
root
How do we search it?
NOTE: There is no node for this bin – just the
space a node would be if existed!
q
root
How do we search it?
Step 2: Make the current leaf node the current
”best guess”
Best guess
How do we search it?
… and set the ”best guess radius” to be the
distance between the query and that point
Best guess radius
How do we search it?
Step 3: Back up the tree 1 node
Current node
How do we search it?
If the distance between the query and the new
node is less than the best guess radius...
How do we search it?
Then set the best guess radius to the new
distance, and make the current node the best
How do we search it?
Step 4: If the hypersphere described by the best
guess radius crosses the current split...
Oh nooooh!
How do we search it?
And the current node has a child on the other
side...
Oh snap!
How do we search it?
… then make that node the current node, and
repeat:
How do we search it?
Here, the distance is not less than the best guess
radius...
How do we search it?
… and the hypersphere neither crosses the
split ...
Whew, missed it!
How do we search it?
… nor does the current node have any children ...
Whew, missed it!
How do we search it?
So we can eliminate it and back up the tree again!
How do we search it?
We've already compared this node, so let's keep
going back up the tree
How do we search it?
Again, the radius is bigger than the best guess,
and there is no crossing – back up again!
How do we search it?
...and again...
How do we search it?
All the way back to the root!
How do we search it?
And you have your nearest neighbor, with a good
case of running time!
I'm the answer!
But that was a pretty good case...
We barely had to backtrack at all – best case is
Worst case (lots of backtracking – examining
almost every node) can get up to
Amount of backtracking is directly proportional
to k!
If k is small (say 2, as in this example) and n is
large, we see a huge improvement over linear
search
As k becomes large, the benefits of this over a
naive implementation virtually disappear!
The Curse of Dimensionality
Curse you, dimensionality!
High-dimensional vector spaces are darned
hard to search!
Why? Too many dimensions! Why are there so
many dimensions!?!
What can we do about it?
Get rid of the extra weight!
Enter Mssrs. Johnson and Lindenstrauss
It turns out...
Your vectors have a high dimension
Absolute distance and precise location versus
relative distance between points
Relative distance can be largely preserved by a
lower dimensional space
Reduce k dimensions to k
proj
dimensions,
k
proj
<< k
It turns out...
Relative distance can be largely but not
completely preserved by a lower dimensional
space
Every projection will have errors
How do you choose one with the fewest?
Trick question: Let fate decide!
Multiple random projection
Choose the projections radomly
Multiple projections
Exchange cost in resources for cost in accuracy
More projections = greater resource cost = greater
accuracy
Fewer projections = lesser resource cost = lesser
accuracy
Trivially parallelizable
Learn to be happy with ”good enough”
Multiple random projections
Nns = []
P = <projections>
q = <query point>
for p in P do
pq = <project q to the same plane as p>
nns << <nearest neighbor to pq from projection>
<execute naive nearest on nns to find nearest of result>
return nn
Get the nearest from each projection, then run a
naive nearest on the results thereof.
Et voilá!
Multiple random projection
Experiments yield > 98% accuracy when
multiple nearest neighbors are selected from
each projection and d is reduced from 256 to
15, with approximately 30% of the calculation.
(see credits)
Additional experiments yielded similar results,
as did my own
That's pretty darn-tootin' good
Stuff to watch out for
Balancing is vitally important (assuming uniform
distribution of points): careful attention must be
paid to selection of nodes (node with median
coordinate for split axis)
Cycle through axes for each level of the tree –
root should split on 0, lvl 1 on 1, lvl 2 on 2, etc.
Stuff to watch out for
Building the trees still takes some time
Building the projections is effectively matrix
multiplication, time in (Strassen's
algorithm)
Building the (balanced) trees from the projections
takes time in approximately
Solution: build the trees ahead of time and
store them for later querying (i.e. index those
bad boys!)
Thanks!
Credits:
Based in large part on research conducted by
Yousuf Ahmed, NYU: http://bit.ly/NZ7ZHo
K-d trees: J. L. Bentley, Stanford U.:
http://bit.ly/Mpy05p
Dimensionality reduction: W. B. Johnson and J.
Lindenstrauss: http://bit.ly/m9SGPN
Research Fuel: Ardbeg Uigeadail:
http://bit.ly/fcag0E