The Post Office Problem

awebneck 5,141 views 55 slides Aug 04, 2012
Slide 1
Slide 1 of 55
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55

About This Presentation

A talk I presented at Nashville Hack Day 2012 concerning using multiple random low-dimensional projections of high-dimensional data to optimize an approximate nearest neighbor search


Slide Content

The Post Office Problem
k-d trees, k-nn search, and the Johnson-
Lindenstrauss lemma

Who am I?
Jeremy Holland
Senior lead developer at
Centresource
Math and algorithms nerd
@awebneck,
github.com/awebneck, freenode:
awebneck, (you get the idea)

If you like the talk...
I like scotch. Just putting it out there.

What is the Post Office Problem?
Don Knuth, professional CS badass.
TAOCP, vol. 3
Otherwise known as ”Nearest Neighbor search”
Let's say you've...

Just moved to Denmark!

But you need to mail a letter!
Which post office do you go to?

Finding free images of post
offices is hard, so...
We'll just reduce it to this:
q

Naive implementation
min = INFINITY
P = <points to be searched>
K = <dimensionality of points, e.g. 2>
q = <query point>
best = nil
for p in P do
dimDistSum = 0
for k in K do
dimDistSum += (q[k]-p[k])**2
dist = dimDistSum.sqrt
if dist < min
min = dist
best = p
return best
Calculate distance to all points, find smallest

With a little preprocessing...
But that takes time! - can we do better?
You bet!
k-d tree
Binary tree (each node has at most two
children)
Each node represents a single point in the set
to be searched

Each node looks like...
Domain: the vector describing the point (i.e.
[p[0], p[1], … p[k-1]])
Range: Some identifying characteristic (e.g. PK
in a database)
Split: A chosen dimension from 0 ≤ split < k
Left: The left child (left.domain[split] <
self.domain[split])
Right: The right child (right.domain[split] ≥
self.domain[split])

Let's build a k-d tree!
Point 1: [20,10]

Let's build a k-d tree!
Let's split on the x axis

Let's build a k-d tree!
Add a new point: [10,5]

Let's build a k-d tree!
The new point is the Left Child of the first point

Let's build a k-d tree!
Let's split him on the y axis

Let's build a k-d tree!
And add a 3rd point: [25,3]

Let's build a k-d tree!
The new point is the Right Child of the first point

Let's build a k-d tree!
So on and so forth...

Let's build a k-d tree!
Giving you a tree:

How do we search it?
Step 1: Find the best bin (where the query point
would otherwise be inserted)
q
root

How do we search it?
NOTE: There is no node for this bin – just the
space a node would be if existed!
q
root

How do we search it?
Step 2: Make the current leaf node the current
”best guess”
Best guess

How do we search it?
… and set the ”best guess radius” to be the
distance between the query and that point
Best guess radius

How do we search it?
Step 3: Back up the tree 1 node
Current node

How do we search it?
If the distance between the query and the new
node is less than the best guess radius...

How do we search it?
Then set the best guess radius to the new
distance, and make the current node the best

How do we search it?
Step 4: If the hypersphere described by the best
guess radius crosses the current split...
Oh nooooh!

How do we search it?
And the current node has a child on the other
side...
Oh snap!

How do we search it?
… then make that node the current node, and
repeat:

How do we search it?
Here, the distance is not less than the best guess
radius...

How do we search it?
… and the hypersphere neither crosses the
split ...
Whew, missed it!

How do we search it?
… nor does the current node have any children ...
Whew, missed it!

How do we search it?
So we can eliminate it and back up the tree again!

How do we search it?
We've already compared this node, so let's keep
going back up the tree

How do we search it?
Again, the radius is bigger than the best guess,
and there is no crossing – back up again!

How do we search it?
...and again...

How do we search it?
All the way back to the root!

How do we search it?
And you have your nearest neighbor, with a good
case of running time!
I'm the answer!

But that was a pretty good case...
We barely had to backtrack at all – best case is
Worst case (lots of backtracking – examining
almost every node) can get up to
Amount of backtracking is directly proportional
to k!
If k is small (say 2, as in this example) and n is
large, we see a huge improvement over linear
search
As k becomes large, the benefits of this over a
naive implementation virtually disappear!

The Curse of Dimensionality
Curse you, dimensionality!
High-dimensional vector spaces are darned
hard to search!
Why? Too many dimensions! Why are there so
many dimensions!?!
What can we do about it?
Get rid of the extra weight!
Enter Mssrs. Johnson and Lindenstrauss

It turns out...
Your vectors have a high dimension
Absolute distance and precise location versus
relative distance between points
Relative distance can be largely preserved by a
lower dimensional space
Reduce k dimensions to k
proj
dimensions,
k
proj
<< k

Example: 2d to 1d
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4

Example: 2d to 1d, 1st attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4

Projection Plane

Example: 2d to 1d, 1st attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4

Finished 1-d Projection

Example: 2d to 1d, 2nd attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4
Projection Plane

Example: 2d to 1d, 2nd attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4
Finished 1-d Projection

Example: 2d to 1d, 3rd attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4

Projection Plane

Example: 2d to 1d, 3rd attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4

Projection Plane
Finished 1-d Projection

It turns out...
Relative distance can be largely but not
completely preserved by a lower dimensional
space
Every projection will have errors
How do you choose one with the fewest?
Trick question: Let fate decide!

Multiple random projection
Choose the projections radomly
Multiple projections
Exchange cost in resources for cost in accuracy
More projections = greater resource cost = greater
accuracy
Fewer projections = lesser resource cost = lesser
accuracy
Trivially parallelizable
Learn to be happy with ”good enough”

Multiple random projections
Nns = []
P = <projections>
q = <query point>
for p in P do
pq = <project q to the same plane as p>
nns << <nearest neighbor to pq from projection>
<execute naive nearest on nns to find nearest of result>
return nn
Get the nearest from each projection, then run a
naive nearest on the results thereof.
Et voilá!

Multiple random projection
Experiments yield > 98% accuracy when
multiple nearest neighbors are selected from
each projection and d is reduced from 256 to
15, with approximately 30% of the calculation.
(see credits)
Additional experiments yielded similar results,
as did my own
That's pretty darn-tootin' good

Stuff to watch out for
Balancing is vitally important (assuming uniform
distribution of points): careful attention must be
paid to selection of nodes (node with median
coordinate for split axis)
Cycle through axes for each level of the tree –
root should split on 0, lvl 1 on 1, lvl 2 on 2, etc.

Stuff to watch out for
Building the trees still takes some time
Building the projections is effectively matrix
multiplication, time in (Strassen's
algorithm)
Building the (balanced) trees from the projections
takes time in approximately
Solution: build the trees ahead of time and
store them for later querying (i.e. index those
bad boys!)

Thanks!
Credits:
Based in large part on research conducted by
Yousuf Ahmed, NYU: http://bit.ly/NZ7ZHo
K-d trees: J. L. Bentley, Stanford U.:
http://bit.ly/Mpy05p
Dimensionality reduction: W. B. Johnson and J.
Lindenstrauss: http://bit.ly/m9SGPN
Research Fuel: Ardbeg Uigeadail:
http://bit.ly/fcag0E