The Post Office Problem

awebneck 5,141 views 55 slides Aug 04, 2012

Slide 1 of 55

About This Presentation

A talk I presented at Nashville Hack Day 2012 concerning using multiple random low-dimensional projections of high-dimensional data to optimize an approximate nearest neighbor search

Size: 1.9 MB

Language: en

Added: Aug 04, 2012

Slides: 55 pages

Slide Content

The Post Office Problem
k-d trees, k-nn search, and the Johnson-
Lindenstrauss lemma

Who am I?
Jeremy Holland
Senior lead developer at
Centresource
Math and algorithms nerd
@awebneck,
github.com/awebneck, freenode:
awebneck, (you get the idea)

If you like the talk...
I like scotch. Just putting it out there.

What is the Post Office Problem?
Don Knuth, professional CS badass.
TAOCP, vol. 3
Otherwise known as ”Nearest Neighbor search”
Let's say you've...

Just moved to Denmark!

But you need to mail a letter!
Which post office do you go to?

Finding free images of post
offices is hard, so...
We'll just reduce it to this:
q

Naive implementation
min = INFINITY
P = <points to be searched>
K = <dimensionality of points, e.g. 2>
q = <query point>
best = nil
for p in P do
dimDistSum = 0
for k in K do
dimDistSum += (q[k]-p[k])**2
dist = dimDistSum.sqrt
if dist < min
min = dist
best = p
return best
Calculate distance to all points, find smallest

With a little preprocessing...
But that takes time! - can we do better?
You bet!
k-d tree
Binary tree (each node has at most two
children)
Each node represents a single point in the set
to be searched

Each node looks like...
Domain: the vector describing the point (i.e.
[p[0], p[1], … p[k-1]])
Range: Some identifying characteristic (e.g. PK
in a database)
Split: A chosen dimension from 0 ≤ split < k
Left: The left child (left.domain[split] <
self.domain[split])
Right: The right child (right.domain[split] ≥
self.domain[split])

Let's build a k-d tree!
Point 1: [20,10]

Let's build a k-d tree!
Let's split on the x axis

Let's build a k-d tree!
Add a new point: [10,5]

Let's build a k-d tree!
The new point is the Left Child of the first point

Let's build a k-d tree!
Let's split him on the y axis

Let's build a k-d tree!
And add a 3rd point: [25,3]

Let's build a k-d tree!
The new point is the Right Child of the first point

Let's build a k-d tree!
So on and so forth...

Let's build a k-d tree!
Giving you a tree:

How do we search it?
Step 1: Find the best bin (where the query point
would otherwise be inserted)
q
root

How do we search it?
NOTE: There is no node for this bin – just the
space a node would be if existed!
q
root

How do we search it?
Step 2: Make the current leaf node the current
”best guess”
Best guess

How do we search it?
… and set the ”best guess radius” to be the
distance between the query and that point
Best guess radius

How do we search it?
Step 3: Back up the tree 1 node
Current node

How do we search it?
If the distance between the query and the new
node is less than the best guess radius...

How do we search it?
Then set the best guess radius to the new
distance, and make the current node the best

How do we search it?
Step 4: If the hypersphere described by the best
guess radius crosses the current split...
Oh nooooh!

How do we search it?
And the current node has a child on the other
side...
Oh snap!

How do we search it?
… then make that node the current node, and
repeat:

How do we search it?
Here, the distance is not less than the best guess
radius...

How do we search it?
… and the hypersphere neither crosses the
split ...
Whew, missed it!

How do we search it?
… nor does the current node have any children ...
Whew, missed it!

How do we search it?
So we can eliminate it and back up the tree again!

How do we search it?
We've already compared this node, so let's keep
going back up the tree

How do we search it?
Again, the radius is bigger than the best guess,
and there is no crossing – back up again!

How do we search it?
...and again...

How do we search it?
All the way back to the root!

How do we search it?
And you have your nearest neighbor, with a good
case of running time!
I'm the answer!

But that was a pretty good case...
We barely had to backtrack at all – best case is
Worst case (lots of backtracking – examining
almost every node) can get up to
Amount of backtracking is directly proportional
to k!
If k is small (say 2, as in this example) and n is
large, we see a huge improvement over linear
search
As k becomes large, the benefits of this over a
naive implementation virtually disappear!

The Curse of Dimensionality
Curse you, dimensionality!
High-dimensional vector spaces are darned
hard to search!
Why? Too many dimensions! Why are there so
many dimensions!?!
What can we do about it?
Get rid of the extra weight!
Enter Mssrs. Johnson and Lindenstrauss

It turns out...
Your vectors have a high dimension
Absolute distance and precise location versus
relative distance between points
Relative distance can be largely preserved by a
lower dimensional space
Reduce k dimensions to k
proj
dimensions,
k
proj
<< k

Example: 2d to 1d
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4

Example: 2d to 1d, 1st attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4

Projection Plane

Example: 2d to 1d, 1st attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4

Finished 1-d Projection

Example: 2d to 1d, 2nd attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4
Projection Plane

Example: 2d to 1d, 2nd attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4
Finished 1-d Projection

Example: 2d to 1d, 3rd attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4

Projection Plane

Example: 2d to 1d, 3rd attempt
5
.
0
0
0
11.180
8.246
7
.
2
8
0
3
.1
6
2
10.296
7
.
0
7
1
14.000
13.342
1
7
.4
6
4

Projection Plane
Finished 1-d Projection

It turns out...
Relative distance can be largely but not
completely preserved by a lower dimensional
space
Every projection will have errors
How do you choose one with the fewest?
Trick question: Let fate decide!

Multiple random projection
Choose the projections radomly
Multiple projections
Exchange cost in resources for cost in accuracy
More projections = greater resource cost = greater
accuracy
Fewer projections = lesser resource cost = lesser
accuracy
Trivially parallelizable
Learn to be happy with ”good enough”

Multiple random projections
Nns = []
P = <projections>
q = <query point>
for p in P do
pq = <project q to the same plane as p>
nns << <nearest neighbor to pq from projection>
<execute naive nearest on nns to find nearest of result>
return nn
Get the nearest from each projection, then run a
naive nearest on the results thereof.
Et voilá!

Multiple random projection
Experiments yield > 98% accuracy when
multiple nearest neighbors are selected from
each projection and d is reduced from 256 to
15, with approximately 30% of the calculation.
(see credits)
Additional experiments yielded similar results,
as did my own
That's pretty darn-tootin' good

Stuff to watch out for
Balancing is vitally important (assuming uniform
distribution of points): careful attention must be
paid to selection of nodes (node with median
coordinate for split axis)
Cycle through axes for each level of the tree –
root should split on 0, lvl 1 on 1, lvl 2 on 2, etc.

Stuff to watch out for
Building the trees still takes some time
Building the projections is effectively matrix
multiplication, time in (Strassen's
algorithm)
Building the (balanced) trees from the projections
takes time in approximately
Solution: build the trees ahead of time and
store them for later querying (i.e. index those
bad boys!)

Thanks!
Credits:
Based in large part on research conducted by
Yousuf Ahmed, NYU: http://bit.ly/NZ7ZHo
K-d trees: J. L. Bentley, Stanford U.:
http://bit.ly/Mpy05p
Dimensionality reduction: W. B. Johnson and J.
Lindenstrauss: http://bit.ly/m9SGPN
Research Fuel: Ardbeg Uigeadail:
http://bit.ly/fcag0E

The Post Office Problem

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

The Post Office Problem

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......