Bloom filter

leefs 3,137 views 20 slides Jul 18, 2013

Slide 1 of 20

About This Presentation

No description available for this slideshow.

Size: 490.16 KB

Language: en

Added: Jul 18, 2013

Slides: 20 pages

Slide Content

Bloom Filter
YHD Search Sharing 2013-04-23

Outline
●Interview questions
●Bloom Filter
–Data structure
–Probability of false positives
–Set properties
●Application
–Cache sharing : squid
–Speed up data access : Hbase
–ID Mapping : zoie
●Materials

Interview questions
●Crawler
–Billions web pages
–How to keep track crawled urls
●Straggler Detection
–You are manning the security desk of a large building
–Everyone checks in or checks out with their id
–At the end of day, identify the few stragglers left in the
building

Data structure
●Data structure
–Init : a bit array of m bits, all set to 0
–Add an element
●K hash function to get K array positions
●Set the bits at all these positions to 1
●Query an element (test whether it's in the set)
–K hash function to get K array positions
–If any positionare 0, not in the set
–If all are 1, probabilistic in the set

Probability of false positives
●1 hash function
000000010000010000000000
p(A[i]=1∣hash[x
1,…,x
n])=1−(1−
1
m
)
n
p(hash[x
1]=i)=
1
m
p(A[i]=0∣hash[x
1])=1−
1
m
p(A[i]=0∣hash[x
1,…,x
n])=(1−
1
m
)
n
p(A[i]=1)=1−(1−
1
m
)
n
≃1−e
−n/m
lim
x→∞
(1−
1
x
)
−x
=e

Probability of false positives
●1 hash function
000000010000010000000000
p(A[H(y)]=1∣y∉S)=
(numberof1)
m
2/23
p(A[i]=1)=1−e
−n/m
Given
E(numberof1)=m⋅(1−e
−n/m
)
p(A[H(y)]=1∣y∉S)=(1−e
−n/m
)

Probability of false positives
●K hash function : repeat for k times
000000010000010000000000
p(A[H(y)]=1∣y∉S)=
(numberof1)
m
p(A[i]=1)=1−e
−n/m
Given
E(numberof1)=m⋅(1−e
−n/m
)
p(A[H(y)]=1∣y∉S)=(1−e
−n/m
)
p(A[H(y)]=1∣y∉S)=(
numberof1
m
)
k
p(A[i]=1)=1−(1−
1
m
)
kn
≃1−e
−kn/m
E(numberof1)=m⋅(1−e
−kn/m
)
p(A[H(y)]=1∣y∉S)=(1−e
−kn/m
)
k

Probability of false positives
●Minimal Probability of false positives
p(A[H(y)]=1∣y∉S)=(1−e
−kn/m
)
k
f=(1−e
−kn/m
)
k
f=e
k∗ln(1−e
−kn/m
)
Minimal f, then minimal g
g=k∗ln(1−e
−kn/m
)
p=e
−kn/m
Given
g=−
m
n
ln(p)∗ln(1−p)
Minimal(f)=(
1
2
)
k
p=
1
2
e
−kn/m
=
1
2
is the probability than any specific bit is still 0
half-full Bloom filter array

Probability of false positives
●examples k=ln2
m
n
m/n k k=1 k=2 k=3 k=4 k=5
2 1.39 0.393 0.400
3 2.08 0.283 0.237 0.253
4 2.77 0.221 0.155 0.147 0.160
5 3.46 0.181 0.109 0.092 0.0920.101
6 4.16 0.154 0.08040.0609 0.05610.0578
7 4.85 0.133 0.06180.0423 0.0359 0.0347
8 5.55 0.118 0.04890.0306 0.024 0.0217

Set properties
●Union (bitwise OR)
– same as the Bloom filter created from scratch using
the union of the two sets.
●Intersection (AND operations)
– the false positive probability in the resulting Bloom
filter is at most the false-positive probability in one of
the constituent Bloom filters, but may be larger than
the false positive probability in the Bloom filter
created from scratch using the intersection of the two
sets

Squid : Cache Digests

Squid : Cache Digests
●False positive:
–Proxy A thinks Proxy B has URL U cached. A
asks for cached U, B responds back with “no”,
A goes to actual website.

HBase :architecture

Hbase :HFile format
●(Not including Bloom Filter)

HBase : Query optimization
●Bloom Filter
–As meta store of HFile
–used to determine if a given key is in that store file
●Characteristics
–Know n total KV count (N), but actual count can
often be much lower
–HFile.insert (and hence, BloomFilter.add)
commands are done in lexicographically increasing
order
4000
10000
5000
9001

Application : Zoie
●Long[] uidArray
–Add element
–Query element
int h = (int) ((uid >>> 32) ^ uid) * MIXER;
long bits = _filter[h & _mask];
bits |= ((1L << (h >>> 26)));
bits |= ((1L << ((h >> 20) & 0x3F)));
_filter[h & _mask] = bits;
final int h = (int) ((uid >>> 32) ^ uid) * MIXER;
final int p = h & _mask;

// check the filter
final long bits = _filter[p];
if ((bits & (1L << (h >>> 26))) == 0
|| (bits & (1L << ((h >> 20) & 0x3F))) == 0)
return -1;

Materials
●http://www.cs.jhu.edu/~fabian/courses/CS600.6
24/slides/bloomslides.pdf
●Google tech talk, bloom filtering
●http://www.slideshare.net/jaxlondon2012/intro
-to-hbase-lars-george
●https://issues.apache.org/jira/secure/attachme
nt/12444007/Bloom_Filters_in_HBase.pdf

Bloom filter

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Bloom filter

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......