A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands
Size: 2.93 MB
Language: en
Added: Nov 17, 2009
Slides: 36 pages
Slide Content
Scaling Out
Hadoop and NoSQL
Age Mooij
Big Data
An Introduction to Dealing with
About me...
@agemooij
...and me
Big Data
IP Address Registration for
Europe, Middle East, Russia
Ipv4: 2
32
(4.3×10
9
) addresses
Ipv6: 2
128
(3.4×10
38
) addresses
My Current Project...
Challenge
10 years of historical registration/routing data in flat files
200+ billion (!) historical data records (25 TB)
30 billion records per year (4 TB)
80 million per day / 1,000 per second
Make it searchable...
Scalability:
Handling more load / requests
Handling more data
Handling more types of data
...without anything breaking or falling over
...and without going bankrupt
UP
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
VS
Scaling Out, Part 1
Processing Data
a.k.a. Data Crunching
Map/Reduce
Break the data into chunks
Process the chunks in parallel
Parallel Batch Processing of Data
Distribute the chunks
Merge the results
Reliable, Scalable, Distributed Computing
(written in Java)
Distributed File System (DFS)
Automatic file replication
Automatic checksumming / error correction
Foundation for all Hadoop projects
Based on Google’s File System (GFS)
Map / Reduce
Simple Java API
Powerful supporting framework
Powerful tools
Good support for non-java languages
24 hours, about $240
4TB of raw image TIFF data (stored in S3)
100 Amazon EC2 instances
Hadoop Map/Reduce
11 million finished PDFs
Scaling Out, Part 1I
Storing & Retrieving Data
Reads and Writes
Relational Databases
are hard to scale out
Replication
Master-Slave
Master-Master Limited scaling of writes
Single point of failure
Single point of bottleneck
Ways to Scale out an RDBMS (1)
Good for scaling reads
Complicated
Partitioning
Vertical : by function / table
Not truly Relational anymore (application joins)
Horizontal : by key / id (Sharding)
Limited Scalability (relocating, resharding)
Ways to Scale out an RDBMS (2)
Why are RDBMSs
so hard to
scale out
Consistency
Availability
Partition Tolerance
...pick any two
Brewer’s CAP Theorem
ACID vs BASE
Atomic
Consistent
Isolated
Durable
Basic
Availability
Soft State
Eventual Consistency
Relational Non-Relational
NoSQL NO-SQL
Better Different
Non-Relational Databases
Those Big Numbers Again...
10 years of historical data in flat files
200+ billion (!) historical data records (25 TB)
30 billion records per year (4 TB)
80 million per day / 1,000 per second
Make it searchable...
~ 200 000 000 000 records
~ 15 000 000 000 records
Map / Reduce
TimestampRecord
Our Data is 3D
IP Address Record Timestamp
Row Column Name (!) Values (!)
Best fit & performance:
Column Oriented
1 10..* 0..*
Cassandra
Digg
Facebook
0.4.1
Tunable: Availability vs Consistency
Twitter
Very active community
No documentation
Tumblr
Yahoo
StumbleUpon
Meetup
Streamy
Adobe
0.20.1
Good Documentation
Very active community
Built on top of Hadoop DFS
Initial Results:
Tested on an EC2 cluster of 8 XLarge instances
3.8 B (23 GB) 33 M (1 GB)
5 hours
33 M (1 GB)
75 minutes “Needle in a haystack” full on-disk table scan:
44000 inserts/second 0.5 M records/second
15 GB
Record duplication: 6x
In order to choose the right
scaling tools, you need to:
Know what you want to query and how
Understand your data
Big Data
...Be Prepared !
Try some Scala in the basement !
val shameless = <SelfPromotion>
</SelfPromotion>