Scaling Out With Hadoop And HBase

5,406 views 36 slides Nov 17, 2009

Slide 1 of 36

About This Presentation

A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands

Size: 2.93 MB

Language: en

Added: Nov 17, 2009

Slides: 36 pages

Slide Content

Scaling Out
Hadoop and NoSQL
Age Mooij

Big Data
An Introduction to Dealing with

About me...
@agemooij

...and me
Big Data

IP Address Registration for
Europe, Middle East, Russia
Ipv4: 2
32
(4.3×10
9
) addresses
Ipv6: 2
128
(3.4×10
38
) addresses
My Current Project...

Challenge
10 years of historical registration/routing data in ﬂat ﬁles
200+ billion (!) historical data records (25 TB)
30 billion records per year (4 TB)
80 million per day / 1,000 per second
Make it searchable...

...and you
Big Data

Twitter
Google AmazonYahoo
LinkedIn
Facebook
eBay
Marktplaats
Hyves
Digg
Flickr
YouTube
Wikipedia
MySpace
300M users
45M users
6.5M users, 5.5M ads
50M users
264M users
32M users

Scalability:
Handling more load / requests
Handling more data
Handling more types of data
...without anything breaking or falling over
...and without going bankrupt

UP
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
Out
VS

Scaling Out, Part 1
Processing Data
a.k.a. Data Crunching

Map/Reduce
Break the data into chunks
Process the chunks in parallel
Parallel Batch Processing of Data
Distribute the chunks
Merge the results

Reliable, Scalable, Distributed Computing
(written in Java)

Distributed File System (DFS)
Automatic ﬁle replication
Automatic checksumming / error correction
Foundation for all Hadoop projects
Based on Google’s File System (GFS)

Map / Reduce
Simple Java API
Powerful supporting framework
Powerful tools
Good support for non-java languages

24 hours, about $240
4TB of raw image TIFF data (stored in S3)
100 Amazon EC2 instances
Hadoop Map/Reduce
11 million ﬁnished PDFs

Scaling Out, Part 1I
Storing & Retrieving Data
Reads and Writes

Relational Databases
are hard to scale out

Replication
Master-Slave
Master-Master Limited scaling of writes
Single point of failure
Single point of bottleneck
Ways to Scale out an RDBMS (1)
Good for scaling reads
Complicated

Partitioning
Vertical : by function / table
Not truly Relational anymore (application joins)
Horizontal : by key / id (Sharding)
Limited Scalability (relocating, resharding)
Ways to Scale out an RDBMS (2)

Why are RDBMSs
so hard to
scale out

Consistency
Availability
Partition Tolerance
...pick any two
Brewer’s CAP Theorem

ACID vs BASE
Atomic
Consistent
Isolated
Durable
Basic
Availability
Soft State
Eventual Consistency
Relational Non-Relational

NoSQL NO-SQL
Better Different
Non-Relational Databases

Types of NOSQL
(Distributed) Key-Value
Document Oriented
Column Oriented
Graph Oriented
Redis
Voldemort
Scalaris (D)
CouchDB
MongoDB
Riak (D)
Cassandra (D)
HBase (D)
Neo4J
(D) = Distributed (automatic out scaling)

RIPE NCC
Experiences so far...

Those Big Numbers Again...
10 years of historical data in ﬂat ﬁles
200+ billion (!) historical data records (25 TB)
30 billion records per year (4 TB)
80 million per day / 1,000 per second
Make it searchable...

~ 200 000 000 000 records
~ 15 000 000 000 records
Map / Reduce

TimestampRecord
Our Data is 3D
IP Address Record Timestamp
Row Column Name (!) Values (!)
Best ﬁt & performance:
Column Oriented
1 10..* 0..*

Cassandra
Digg
Facebook
0.4.1
Tunable: Availability vs Consistency
Twitter
Very active community
No documentation

Tumblr
Yahoo
StumbleUpon
Meetup
Streamy
Adobe
0.20.1
Good Documentation
Very active community
Built on top of Hadoop DFS

Initial Results:
Tested on an EC2 cluster of 8 XLarge instances
3.8 B (23 GB) 33 M (1 GB)
5 hours
33 M (1 GB)
75 minutes “Needle in a haystack” full on-disk table scan:
44000 inserts/second 0.5 M records/second
15 GB
Record duplication: 6x

In order to choose the right
scaling tools, you need to:
Know what you want to query and how
Understand your data

Big Data
...Be Prepared !

Try some Scala in the basement !
val shameless = <SelfPromotion>
</SelfPromotion>

Scaling Out With Hadoop And HBase

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Scaling Out With Hadoop And HBase

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx