1. Big Data - Introduction(what is bigdata).pdf

Big Data and Hadoop
Veningston .K
Assistant Professor
Department of Computer Science and Engineering
National Institute of Technology Srinagar
[email protected], [email protected]

11Explosion in Quantity of Data
Big Data Characteristics
2
Usage Example in Big Data
3
4
Importance of Big Data
5
2
3
4
5
Contents
Challengesin Big Data
2
Big Data vs. Hadoop vs. Map Reduce
Data Analytics Architecture
Hive
Pig
Hadoop Ecosystem

What is big data?
“A massive volume of data sets that is so large& complexthat it becomes
difficultto process with traditional database management tools.”
3
Big data Challenges
•Capturing
•Curation -Integration
•Storage
•Search
•Sharing
•Transfer
•Analysis
•Visualization

Explosion in Quantity of Data
4
Internet 60 Seconds Image

Explosion in Quantity of Data
5

Volume, Veracity, Velocity,
Variety, and Value
Banking/Marketing/IT:
Volume, Velocity, and Value
Healthcare/Life Sciences:
Veracity (sparse/inconsistency), Variety, and Value
6
5 V’s of Big Data

7
Analysis vs. Analytics
•The primary difference between analytics and analysis is a
matter of scale, as data analytics is a broader term of which
data analysis is a subcomponent.
•Data Analysis:
–Data analysis refers to the process of compiling, examining,
transforming and analyzing the data to support decision making.
•Data Analytics:
–Data analytics also includes tools and techniques used to do so.
–This not only includes analysis, but also data collection, organization,
storage, and all the tools and techniques used.

8
Hidden Treasures
(Advantages of Big Data)
•Insights into data can provide Business advantage
–Ex: healthcareanalyzing disease pattern/trend to model future
demands and to invest/make strategic R&D
•Some key early indicators can mean Fortunes to Business
–Ex: financial industries analyse through transaction data/real-time
market feeds/social media data to find odd behavior of customers and
minimize the risk
•More precise Analysis with more data
–Ex: retailerpoint of sale data, supply chain management Use
these new insights to deliver highly targeted customers or do location
based promotions in real-time.

Limitations of Existing Data Analytics Architecture -
Traditional Data flow Architecture
(Non-Hadoop Architecture)
9
Processing Layer -Dashboard
Expensive Servers
(Data generated from web servers, network equipment, system logs, sensors, etc.)
(Data Cleaning)
(Limitation on size of the data)
(Moving data to tapes)

Solution: A Combined Storage Compute Layer
HADOOP
10
Commodity Computers to save and process the data
(Original high fidelity raw data to process)
(Yearly/Quarterly analysis)

Hadoop Cluster
•300 Node Hadoop cluster
–Hadoop cluster having 300 servers/machines/nodes
•Assume each node is 10 TB,
–300 Nodes = 10 TB * 300 = 3000 TB==> 3 PB
–Cluster can store up to 3 PB data to process
•Hadoop uses commodity computersto store and process the
data.
–No need of higher end servers
•Scale-out technology
–If an organization gets 1 more PB of data extra, they just need to add
another 100 nodes to this cluster to make it 400 node cluster without
changing any configuration in the existing Hadoop cluster.
–Any amount of data can be stored in cost effective manner to keep
data alive forever (No need of any backup)
11

Hadoop is a DFS–
Why DFS?
12
1 TB = 1024*1024 MB
((1024*1024)/4*100)/60 min

13
Hadoop is a DFS–
Why DFS?

What is Hadoop?
•Apache Hadoopis a frameworkthat allows for the
distributed processing of large data sets across
clusters of commodity computers using a simple
programming model
•It is an Open-source Data Management with scale-
out storage and distributed processing
14
Foundation layer
Map Reduce
No need of higher-end server

Hadoop Key Characteristics
15
Commodity Computers
Adding Nodes
Schema less while writing –It
can observe any kind of data
from any source
Fault tolerant

Hadoop enables...
•Scalable
•New nodes can be added as needed
•Cost effective
•Hadoop brings massively parallel computing to commodity servers.
•sizeable decrease in the cost per terabyte of storage
•Flexible
•Hadoop is schema-less, and can absorb any type of data, structured or not, from any number
of sources.
•Fault tolerant
•When you lose a node, the system redirects work to another location of the data and
continues processing
16

Hadoop core components and Daemons
•Hadoop is Master-Slave architecture
–Some daemons are Masters and some are Slaves
•Master daemons tells what these Slaves daemons
has to do.
–Slaves obeys Masters order
–Whatever Namenodetells, Datanodedoes.
•At YARN component, whatever ResourceManager
tells, NodeManager does.
17

Simple cluster with Hadoop Daemons
18

RDBMS vs. Hadoop
19

Hadoop Ecosystem
20

Hadoop 2.x Core Components
21

Main Components of HDFS
22

NameNode Metadata
23

File Blocks
24

HDFS Architecture
25
Datawill never be passed from NameNode.
Hadoop library
Hadoop library

Anatomy of a File Read
26
Client JVM

Anatomy of a File Write
27

Replication and Rack Awareness
28

Hadoop 2.x Cluster Architecture
29

Hadoop 2.x Cluster Architecture
30

Hadoop 2.x Cluster Architecture
31
R data Finance data Marketing data

Hadoop 2.x –High Availability
32
Data sync

Hadoop 2.x –High Availability
33

Hadoop 2.x –Resource Management
34

Hadoop 2.x –Resource Management
35

YARN –Moving beyond MapReduce
36

Hadoop Cluster Modes
37

Hadoop 2.x –Configuration files
38

Hadoop 2.x Configuration Files -Apache Hadoop
39

Data Loading Techniques & Data Analysis
40

Where MapReduce is Used?
41

The Traditional Way
42

MapReduce Way
43
NOTE:
•We need not worry about Splittingthe data. Hadoop splitsdata based on block
size, distributeand manageinternally.
•Hadoop manages File and Directory Structureby itself through NN.
•Request for reading, processing comes through NN.
•Mapper and Reducer Logic is simple

Why MapReduce?
•Majority of data processing in real-world –70 –80%
–Text-basedE-mail body, csvfiles, XML, JSON…
•Commandto Run MapReduce Program:
•Note: When MapReduce Job runs, NNwill give where input_fileis located and will
give specification of DNwhere outputhas to be written.
44

MapReduce Basic Flow
45
Note:
•No need to store intermediate output on to HDFS. it will be stored in Local File System
•If finalHDFSoutputis bigger than block size, the splitand replicationwill be applied.

Sequence of MapReduce Execution
•MR dump After execution of all Mappers, then Reducersrun.
46

Example–Scenario (Web log analysis)
•Problem statement: To count the number of clicks for a link between 4:00 am –5:00
am
47

How do we decide number of Map & Reduce?
•# Mappers:
–Based on number of blocks, Mappers will be running
–Map tasks run in parallel. They will be processing different blocks.
–No coordination between Map task to Map task,
•# Reducers:
–It is optional. All problem statement may not require aggregation
–By default, number of Reduce will be 1.
–Reduce task may run in anywhere in the cluster. We do not have
control. Whichever machine is free, that will be allocated.
48

Getting the Data from InputFile
•InputFormat
–It is responsible for generating/giving input to Mappers.
•TextInputFormatgenerates Key-Value pair to Mappers.
49

Key-Value (K/V)Pair
•In MapReduce, all the 4 stages of data is represented by Key-Valuepair.
–Input to the Mapper & Output from the Mapper
–Input to the Reducer & Output from the Reducer
50
Stagesof Data Who is responsible Data format
Input to Mapper InputFormat Individual K/V pair
Output from Mapper Developer decidesbased
on the implementation
logic
KeyList of values
Inputto Reducer Sameas Mappers outputKeyList of values
Output fromReducer Developer decides based
on the aggregationlogic

Why Key-ValuePair?
•Key-Value pair is the record entity that Hadoop MapReduce accepts for
data processing.
•Example:
51

Scenario
•Word Count
52

•TextInputFormatgenerates, Key-Valuepair
•TextInputFormatgives input to the Mapper
53
Byte Offset (location starting from 0) key
Entire Line value
1
st
key is 0 1
st
value is this line
2
nd
key is 26 2
nd
value is this line
NOTE: To do wordcount, we do not need ByteOffset. We need only Value.
Mapper’s Input for wordcount

Map Logic for wordcount
•Map task
•Map output
54
NOTE: Aggregation logic will be left to Reducer

Role of Hadoop Framework in MapReduce
•Hadoop Framework reads all the Map’s output
•And sortbased on key and prepare list of values for unique key
55

Reducer Logic for wordcount
•Reducer’s input
•Reducer’s Logic
•Reducer’s Output
56

MapReduce Paradigm
57
3 Blocks of data
Mapper 1 runs
Mapper 2 runs
Mapper 3 runs

Anatomy of a MapReduce Program
58
Data given to Map Developer Choice
–Mapper Output
Input Data given
to Reducer
NOTE:
•Input to both the Mapper/Reducer will be always a single Key-Value pair
•Reducer’s input will be stored in HDFS.
Business Logic of Map
Business Logic of Reduce
Reducer’s Output

Combiner
59

Demo of WordCountProgram
•MapReduce Programming
–Driver class
–Map class
–Reduce class
•We will not use Java primitive datatypes.
60

WordCountProgram –Data Flow
61

Driver class
62

Map class
63
K1,V1, K2,V2
K2,V2
this, 1
is, 1
…
this, 1
0, this is my first program in map reduce and this is my favorite
Map’s Business
logic

Reduce class
64
Aggregation logic
K2,V2
K3,V3
this, [1,1]
this, 2

How to Runthis program?
65
hadoopjar wc.jar /input_file_path/output_dir_path
args[0] args[1]

Why MapReduce?
•Two Advantages:
–Data locality optimization 
Taking processing to the data
•Inter-rack network transfer
happens occasionally
–Processing data in parallel
66
Data localRack localOff rack

MapReduce Framework
67

Hadoop Configuration
68

Hadoop 2.x –MapReduce Components
69

Map Reduce Components
•ContainerResources required for the job
–Node Manager creates a container for job and allocates resources for job.
–Containers are allocated by this Node Manager.
•AppMastermonitor and takes care of every job.
•Resource Manager maintains the complete cluster.
70

YARN MR Application Execution Flow
•MapReduce Job Execution
–Job submission
–Job initialization
–Tasks Assignment
–Memory Assignment
–Status Updates
–Failure recovery
71

Application Workflow
72

MapReduce Paradigm
73

Past vs. Future
74

Thank
You all
75

1. Big Data - Introduction(what is bigdata).pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

1. Big Data - Introduction(what is bigdata).pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Tags

Categories