Scaling Out With Hadoop And HBase

5,406 views 36 slides Nov 17, 2009
Slide 1
Slide 1 of 36
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36

About This Presentation

A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands


Slide Content

Scaling Out
Hadoop and NoSQL
Age Mooij

Big Data
An Introduction to Dealing with

About me...
@agemooij

...and me
Big Data

IP Address Registration for
Europe, Middle East, Russia
Ipv4: 2
32
(4.3×10
9
) addresses
Ipv6: 2
128
(3.4×10
38
) addresses
My Current Project...

Challenge
10 years of historical registration/routing data in flat files
200+ billion (!) historical data records (25 TB)
30 billion records per year (4 TB)
80 million per day / 1,000 per second
Make it searchable...

...and you
Big Data

Twitter
Google AmazonYahoo
LinkedIn
Facebook
eBay
Marktplaats
Hyves
Digg
Flickr
YouTube
Wikipedia
MySpace
300M users
45M users
6.5M users, 5.5M ads
50M users
264M users
32M users

Scalability:
Handling more load / requests
Handling more data
Handling more types of data
...without anything breaking or falling over
...and without going bankrupt

UP
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
VS

Scaling Out, Part 1
Processing Data
a.k.a. Data Crunching

Map/Reduce
Break the data into chunks
Process the chunks in parallel
Parallel Batch Processing of Data
Distribute the chunks
Merge the results

Reliable, Scalable, Distributed Computing
(written in Java)

Distributed File System (DFS)
Automatic file replication
Automatic checksumming / error correction
Foundation for all Hadoop projects
Based on Google’s File System (GFS)

Map / Reduce
Simple Java API
Powerful supporting framework
Powerful tools
Good support for non-java languages

24 hours, about $240
4TB of raw image TIFF data (stored in S3)
100 Amazon EC2 instances
Hadoop Map/Reduce
11 million finished PDFs

Scaling Out, Part 1I
Storing & Retrieving Data
Reads and Writes

Relational Databases
are hard to scale out

Replication
Master-Slave
Master-Master Limited scaling of writes
Single point of failure
Single point of bottleneck
Ways to Scale out an RDBMS (1)
Good for scaling reads
Complicated

Partitioning
Vertical : by function / table
Not truly Relational anymore (application joins)
Horizontal : by key / id (Sharding)
Limited Scalability (relocating, resharding)
Ways to Scale out an RDBMS (2)

Why are RDBMSs
so hard to
scale out

Consistency
Availability
Partition Tolerance
...pick any two
Brewer’s CAP Theorem

ACID vs BASE
Atomic
Consistent
Isolated
Durable
Basic
Availability
Soft State
Eventual Consistency
Relational Non-Relational

NoSQL NO-SQL
Better Different
Non-Relational Databases

Types of NOSQL
(Distributed) Key-Value
Document Oriented
Column Oriented
Graph Oriented
Redis
Voldemort
Scalaris (D)
CouchDB
MongoDB
Riak (D)
Cassandra (D)
HBase (D)
Neo4J
(D) = Distributed (automatic out scaling)

RIPE NCC
Experiences so far...

Those Big Numbers Again...
10 years of historical data in flat files
200+ billion (!) historical data records (25 TB)
30 billion records per year (4 TB)
80 million per day / 1,000 per second
Make it searchable...

~ 200 000 000 000 records
~ 15 000 000 000 records
Map / Reduce

TimestampRecord
Our Data is 3D
IP Address Record Timestamp
Row Column Name (!) Values (!)
Best fit & performance:
Column Oriented
1 10..* 0..*

Cassandra
Digg
Facebook
0.4.1
Tunable: Availability vs Consistency
Twitter
Very active community
No documentation

Tumblr
Yahoo
StumbleUpon
Meetup
Streamy
Adobe
0.20.1
Good Documentation
Very active community
Built on top of Hadoop DFS

Initial Results:
Tested on an EC2 cluster of 8 XLarge instances
3.8 B (23 GB) 33 M (1 GB)
5 hours
33 M (1 GB)
75 minutes “Needle in a haystack” full on-disk table scan:
44000 inserts/second 0.5 M records/second
15 GB
Record duplication: 6x

In order to choose the right
scaling tools, you need to:
Know what you want to query and how
Understand your data

Big Data
...Be Prepared !

Try some Scala in the basement !
val shameless = <SelfPromotion>
</SelfPromotion>