1. Big Data - Introduction(what is bigdata).pdf

AmanCSE050 19 views 75 slides Jun 13, 2024
Slide 1
Slide 1 of 75
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75

About This Presentation

Big Data Characteristics
Contents
Explosion in Quantity of Data
Importance of Big Data
Usage Example in Big Data
Challenges in Big Data
Hadoop Ecosystem


Slide Content

Big Data and Hadoop
Veningston .K
Assistant Professor
Department of Computer Science and Engineering
National Institute of Technology Srinagar
[email protected], [email protected]

11Explosion in Quantity of Data
Big Data Characteristics
2
Usage Example in Big Data
3
4
Importance of Big Data
5
2
3
4
5
Contents
Challengesin Big Data
2
Big Data vs. Hadoop vs. Map Reduce
Data Analytics Architecture
Hive
Pig
Hadoop Ecosystem

What is big data?
“A massive volume of data sets that is so large& complexthat it becomes
difficultto process with traditional database management tools.”
3
Big data Challenges
•Capturing
•Curation -Integration
•Storage
•Search
•Sharing
•Transfer
•Analysis
•Visualization

Explosion in Quantity of Data
4
Internet 60 Seconds Image

Explosion in Quantity of Data
5

Volume, Veracity, Velocity,
Variety, and Value
Banking/Marketing/IT:
Volume, Velocity, and Value
Healthcare/Life Sciences:
Veracity (sparse/inconsistency), Variety, and Value
6
5 V’s of Big Data

7
Analysis vs. Analytics
•The primary difference between analytics and analysis is a
matter of scale, as data analytics is a broader term of which
data analysis is a subcomponent.
•Data Analysis:
–Data analysis refers to the process of compiling, examining,
transforming and analyzing the data to support decision making.
•Data Analytics:
–Data analytics also includes tools and techniques used to do so.
–This not only includes analysis, but also data collection, organization,
storage, and all the tools and techniques used.

8
Hidden Treasures
(Advantages of Big Data)
•Insights into data can provide Business advantage
–Ex: healthcareanalyzing disease pattern/trend to model future
demands and to invest/make strategic R&D
•Some key early indicators can mean Fortunes to Business
–Ex: financial industries analyse through transaction data/real-time
market feeds/social media data to find odd behavior of customers and
minimize the risk
•More precise Analysis with more data
–Ex: retailerpoint of sale data, supply chain management Use
these new insights to deliver highly targeted customers or do location
based promotions in real-time.

Limitations of Existing Data Analytics Architecture -
Traditional Data flow Architecture
(Non-Hadoop Architecture)
9
Processing Layer -Dashboard
Expensive Servers
(Data generated from web servers, network equipment, system logs, sensors, etc.)
(Data Cleaning)
(Limitation on size of the data)
(Moving data to tapes)

Solution: A Combined Storage Compute Layer
HADOOP
10
Commodity Computers to save and process the data
(Original high fidelity raw data to process)
(Yearly/Quarterly analysis)

Hadoop Cluster
•300 Node Hadoop cluster
–Hadoop cluster having 300 servers/machines/nodes
•Assume each node is 10 TB,
–300 Nodes = 10 TB * 300 = 3000 TB==> 3 PB
–Cluster can store up to 3 PB data to process
•Hadoop uses commodity computersto store and process the
data.
–No need of higher end servers
•Scale-out technology
–If an organization gets 1 more PB of data extra, they just need to add
another 100 nodes to this cluster to make it 400 node cluster without
changing any configuration in the existing Hadoop cluster.
–Any amount of data can be stored in cost effective manner to keep
data alive forever (No need of any backup)
11

Hadoop is a DFS–
Why DFS?
12
1 TB = 1024*1024 MB
((1024*1024)/4*100)/60 min

13
Hadoop is a DFS–
Why DFS?

What is Hadoop?
•Apache Hadoopis a frameworkthat allows for the
distributed processing of large data sets across
clusters of commodity computers using a simple
programming model
•It is an Open-source Data Management with scale-
out storage and distributed processing
14
Foundation layer
Map Reduce
No need of higher-end server

Hadoop Key Characteristics
15
Commodity Computers
Adding Nodes
Schema less while writing –It
can observe any kind of data
from any source
Fault tolerant

Hadoop enables...
•Scalable
•New nodes can be added as needed
•Cost effective
•Hadoop brings massively parallel computing to commodity servers.
•sizeable decrease in the cost per terabyte of storage
•Flexible
•Hadoop is schema-less, and can absorb any type of data, structured or not, from any number
of sources.
•Fault tolerant
•When you lose a node, the system redirects work to another location of the data and
continues processing
16

Hadoop core components and Daemons
•Hadoop is Master-Slave architecture
–Some daemons are Masters and some are Slaves
•Master daemons tells what these Slaves daemons
has to do.
–Slaves obeys Masters order
–Whatever Namenodetells, Datanodedoes.
•At YARN component, whatever ResourceManager
tells, NodeManager does.
17

Simple cluster with Hadoop Daemons
18

RDBMS vs. Hadoop
19

Hadoop Ecosystem
20

Hadoop 2.x Core Components
21

Main Components of HDFS
22

NameNode Metadata
23

File Blocks
24

HDFS Architecture
25
Datawill never be passed from NameNode.
Hadoop library
Hadoop library

Anatomy of a File Read
26
Client JVM

Anatomy of a File Write
27

Replication and Rack Awareness
28

Hadoop 2.x Cluster Architecture
29

Hadoop 2.x Cluster Architecture
30

Hadoop 2.x Cluster Architecture
31
R data Finance data Marketing data

Hadoop 2.x –High Availability
32
Data sync

Hadoop 2.x –High Availability
33

Hadoop 2.x –Resource Management
34

Hadoop 2.x –Resource Management
35

YARN –Moving beyond MapReduce
36

Hadoop Cluster Modes
37

Hadoop 2.x –Configuration files
38

Hadoop 2.x Configuration Files -Apache Hadoop
39

Data Loading Techniques & Data Analysis
40

Where MapReduce is Used?
41

The Traditional Way
42

MapReduce Way
43
NOTE:
•We need not worry about Splittingthe data. Hadoop splitsdata based on block
size, distributeand manageinternally.
•Hadoop manages File and Directory Structureby itself through NN.
•Request for reading, processing comes through NN.
•Mapper and Reducer Logic is simple

Why MapReduce?
•Majority of data processing in real-world –70 –80%
–Text-basedE-mail body, csvfiles, XML, JSON…
•Commandto Run MapReduce Program:
•Note: When MapReduce Job runs, NNwill give where input_fileis located and will
give specification of DNwhere outputhas to be written.
44

MapReduce Basic Flow
45
Note:
•No need to store intermediate output on to HDFS. it will be stored in Local File System
•If finalHDFSoutputis bigger than block size, the splitand replicationwill be applied.

Sequence of MapReduce Execution
•MR dump After execution of all Mappers, then Reducersrun.
46

Example–Scenario (Web log analysis)
•Problem statement: To count the number of clicks for a link between 4:00 am –5:00
am
47

How do we decide number of Map & Reduce?
•# Mappers:
–Based on number of blocks, Mappers will be running
–Map tasks run in parallel. They will be processing different blocks.
–No coordination between Map task to Map task,
•# Reducers:
–It is optional. All problem statement may not require aggregation
–By default, number of Reduce will be 1.
–Reduce task may run in anywhere in the cluster. We do not have
control. Whichever machine is free, that will be allocated.
48

Getting the Data from InputFile
•InputFormat
–It is responsible for generating/giving input to Mappers.
•TextInputFormatgenerates Key-Value pair to Mappers.
49

Key-Value (K/V)Pair
•In MapReduce, all the 4 stages of data is represented by Key-Valuepair.
–Input to the Mapper & Output from the Mapper
–Input to the Reducer & Output from the Reducer
50
Stagesof Data Who is responsible Data format
Input to Mapper InputFormat Individual K/V pair
Output from Mapper Developer decidesbased
on the implementation
logic
KeyList of values
Inputto Reducer Sameas Mappers outputKeyList of values
Output fromReducer Developer decides based
on the aggregationlogic

Why Key-ValuePair?
•Key-Value pair is the record entity that Hadoop MapReduce accepts for
data processing.
•Example:
51

Scenario
•Word Count
52

•TextInputFormatgenerates, Key-Valuepair
•TextInputFormatgives input to the Mapper
53
Byte Offset (location starting from 0) key
Entire Line value
1
st
key is 0 1
st
value is this line
2
nd
key is 26 2
nd
value is this line
NOTE: To do wordcount, we do not need ByteOffset. We need only Value.
Mapper’s Input for wordcount

Map Logic for wordcount
•Map task
•Map output
54
NOTE: Aggregation logic will be left to Reducer

Role of Hadoop Framework in MapReduce
•Hadoop Framework reads all the Map’s output
•And sortbased on key and prepare list of values for unique key
55

Reducer Logic for wordcount
•Reducer’s input
•Reducer’s Logic
•Reducer’s Output
56

MapReduce Paradigm
57
3 Blocks of data
Mapper 1 runs
Mapper 2 runs
Mapper 3 runs

Anatomy of a MapReduce Program
58
Data given to Map Developer Choice
–Mapper Output
Input Data given
to Reducer
NOTE:
•Input to both the Mapper/Reducer will be always a single Key-Value pair
•Reducer’s input will be stored in HDFS.
Business Logic of Map
Business Logic of Reduce
Reducer’s Output

Combiner
59

Demo of WordCountProgram
•MapReduce Programming
–Driver class
–Map class
–Reduce class
•We will not use Java primitive datatypes.
60

WordCountProgram –Data Flow
61

Driver class
62

Map class
63
K1,V1, K2,V2
K2,V2
this, 1
is, 1

this, 1
0, this is my first program in map reduce and this is my favorite
Map’s Business
logic

Reduce class
64
Aggregation logic
K2,V2
K3,V3
this, [1,1]
this, 2

How to Runthis program?
65
hadoopjar wc.jar /input_file_path/output_dir_path
args[0] args[1]

Why MapReduce?
•Two Advantages:
–Data locality optimization 
Taking processing to the data
•Inter-rack network transfer
happens occasionally
–Processing data in parallel
66
Data localRack localOff rack

MapReduce Framework
67

Hadoop Configuration
68

Hadoop 2.x –MapReduce Components
69

Map Reduce Components
•ContainerResources required for the job
–Node Manager creates a container for job and allocates resources for job.
–Containers are allocated by this Node Manager.
•AppMastermonitor and takes care of every job.
•Resource Manager maintains the complete cluster.
70

YARN MR Application Execution Flow
•MapReduce Job Execution
–Job submission
–Job initialization
–Tasks Assignment
–Memory Assignment
–Status Updates
–Failure recovery
71

Application Workflow
72

MapReduce Paradigm
73

Past vs. Future
74

Thank
You all
75
Tags