Cloud Computing: Hadoop

darugar 7,245 views 21 slides Nov 20, 2008

Slide 1 of 21

About This Presentation

Data Processing in the Cloud with Hadoop from Data Services World conference.

Size: 141.18 KB

Language: en

Added: Nov 20, 2008

Slides: 21 pages

Slide Content

Data Processing in the Cloud
Parand Tony Darugar
http://parand.com/say/
[email protected]

2
What is Hadoop
Flexible infrastructure for large
scale computation and data
processing on a network of
commodity hardware.

3
Why?
A common infrastructure pattern
extracted from building distributed
systems
Scale
Incremental growth
Cost
Flexibility

4
Built-in Resilience to Failure
When dealing with large numbers of
commodity servers, failure is a fact of
life
Assume failure, build protections and
recovery into your architecture
-Data level redundancy
-Job/Task level monitoring and automated
restart and re-allocation

5
Current State of Hadoop Project
Top level Apache Foundation project
In production use at Yahoo, Facebook,
Amazon, IBM, Fox, NY Times, Powerset, …
Large, active user base, mailing lists, user
groups
Very active development, strong development
team

6
Widely Adopted
A valuable and reusable skill set
-Taught at major universities
-Easier to hire for
-Easier to train on
-Portable across projects, groups

7
Plethora of Related Projects
Pig
Hive
Hbase
Cascading
Hadoop on EC2
JAQL , X-Trace, Happy, Mahout

8
What is Hadoop
The Linux of distributed
processing.

How Does Hadoop Work?

10
Hadoop File System
A distributed file system for large data
-Your data in triplicate
-Built-in redundancy, resiliency to large scale
failures
-Intelligent distribution, striping across racks
-Accommodates very large data sizes
-On commodity hardware

11
Programming Model: Map/Reduce
Very simple programming model:
-Map(anything)->key, value
-Sort, partition on key
-Reduce(key,value)->key, value
No parallel processing / message passing
semantics
Programmable in Java or any other language
(streaming)

12
Processing Model
Create or allocate a cluster
Put data onto the file system:
-Data is split into blocks, stored in triplicate across
your cluster
Run your job:
-Your Map code is copied to the allocated nodes,
preferring nodes that contain copies of your data
•Move computation to data, not data to computation

13
Processing Model
•Monitor workers, automatically restarting failed or slow
tasks
-Gather output of Map, sort and partition on key
-Run Reduce tasks
•Monitor workers, automatically restarting failed or slow
tasks
Results of your job are now available on the
Hadoop file system

14
Hadoop on the Grid
Managed Hadoop clusters
Shared resources
-improved utilization
Standard data sets, storage
Shared, standardized operations
management
Hosted internally or externally (eg. on EC2)

Usage Patterns

16
ETL
Put large data source (eg. Log files) onto the
Hadoop File System
Perform aggregations, transformations,
normalizations on the data
Load into RDBMS / data mart

17
Reporting and Analytics
Run canned and ad-hoc queries over large
data
Run analytics and data mining operations on
large data
Produce reports for end-user consumption or
loading into data mart

18
Data Processing Pipelines
Multi-step pipelines for data processing
Coordination, scheduling, data collection and
publishing of feeds
SLA carrying, regularly scheduled jobs

19
Machine Learning & Graph Algorithms
Traverse large graphs and data sets, building
models and classifiers
Implement machine learning algorithms over
massive data sets

20
General Back-End Processing
Implement significant portions of back-end,
batch oriented processing on the grid
General computation framework
Simplify back-end architecture

21
What Next?
Dowload Hadoop:
-http://hadoop.apache.org/
Try it on your laptop
Try Pig
-http://hadoop.apahe.org/pig/
Deploy to multiple boxes
Try it on EC2

Cloud Computing: Hadoop

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Cloud Computing: Hadoop

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......