Cloud Computing: Hadoop

darugar 7,245 views 21 slides Nov 20, 2008
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

Data Processing in the Cloud with Hadoop from Data Services World conference.


Slide Content

Data Processing in the Cloud
Parand Tony Darugar
http://parand.com/say/
[email protected]

2
What is Hadoop
Flexible infrastructure for large
scale computation and data
processing on a network of
commodity hardware.

3
Why?
A common infrastructure pattern
extracted from building distributed
systems
Scale
Incremental growth
Cost
Flexibility

4
Built-in Resilience to Failure
When dealing with large numbers of
commodity servers, failure is a fact of
life
Assume failure, build protections and
recovery into your architecture
-Data level redundancy
-Job/Task level monitoring and automated
restart and re-allocation

5
Current State of Hadoop Project
Top level Apache Foundation project
In production use at Yahoo, Facebook,
Amazon, IBM, Fox, NY Times, Powerset, …
Large, active user base, mailing lists, user
groups
Very active development, strong development
team

6
Widely Adopted
A valuable and reusable skill set
-Taught at major universities
-Easier to hire for
-Easier to train on
-Portable across projects, groups

7
Plethora of Related Projects
Pig
Hive
Hbase
Cascading
Hadoop on EC2
JAQL , X-Trace, Happy, Mahout

8
What is Hadoop
The Linux of distributed
processing.

How Does Hadoop Work?

10
Hadoop File System
A distributed file system for large data
-Your data in triplicate
-Built-in redundancy, resiliency to large scale
failures
-Intelligent distribution, striping across racks
-Accommodates very large data sizes
-On commodity hardware

11
Programming Model: Map/Reduce
Very simple programming model:
-Map(anything)->key, value
-Sort, partition on key
-Reduce(key,value)->key, value
No parallel processing / message passing
semantics
Programmable in Java or any other language
(streaming)

12
Processing Model
Create or allocate a cluster
Put data onto the file system:
-Data is split into blocks, stored in triplicate across
your cluster
Run your job:
-Your Map code is copied to the allocated nodes,
preferring nodes that contain copies of your data
•Move computation to data, not data to computation

13
Processing Model
•Monitor workers, automatically restarting failed or slow
tasks
-Gather output of Map, sort and partition on key
-Run Reduce tasks
•Monitor workers, automatically restarting failed or slow
tasks
Results of your job are now available on the
Hadoop file system

14
Hadoop on the Grid
Managed Hadoop clusters
Shared resources
-improved utilization
Standard data sets, storage
Shared, standardized operations
management
Hosted internally or externally (eg. on EC2)

Usage Patterns

16
ETL
Put large data source (eg. Log files) onto the
Hadoop File System
Perform aggregations, transformations,
normalizations on the data
Load into RDBMS / data mart

17
Reporting and Analytics
Run canned and ad-hoc queries over large
data
Run analytics and data mining operations on
large data
Produce reports for end-user consumption or
loading into data mart

18
Data Processing Pipelines
Multi-step pipelines for data processing
Coordination, scheduling, data collection and
publishing of feeds
SLA carrying, regularly scheduled jobs

19
Machine Learning & Graph Algorithms
Traverse large graphs and data sets, building
models and classifiers
Implement machine learning algorithms over
massive data sets

20
General Back-End Processing
Implement significant portions of back-end,
batch oriented processing on the grid
General computation framework
Simplify back-end architecture

21
What Next?
Dowload Hadoop:
-http://hadoop.apache.org/
Try it on your laptop
Try Pig
-http://hadoop.apahe.org/pig/
Deploy to multiple boxes
Try it on EC2