Big Data Characteristics
Contents
Explosion in Quantity of Data
Importance of Big Data
Usage Example in Big Data
Challenges in Big Data
Hadoop Ecosystem
Size: 5.46 MB
Language: en
Added: Jun 13, 2024
Slides: 75 pages
Slide Content
Big Data and Hadoop
Veningston .K
Assistant Professor
Department of Computer Science and Engineering
National Institute of Technology Srinagar [email protected], [email protected]
11Explosion in Quantity of Data
Big Data Characteristics
2
Usage Example in Big Data
3
4
Importance of Big Data
5
2
3
4
5
Contents
Challengesin Big Data
2
Big Data vs. Hadoop vs. Map Reduce
Data Analytics Architecture
Hive
Pig
Hadoop Ecosystem
What is big data?
“A massive volume of data sets that is so large& complexthat it becomes
difficultto process with traditional database management tools.”
3
Big data Challenges
•Capturing
•Curation -Integration
•Storage
•Search
•Sharing
•Transfer
•Analysis
•Visualization
Explosion in Quantity of Data
4
Internet 60 Seconds Image
Explosion in Quantity of Data
5
Volume, Veracity, Velocity,
Variety, and Value
Banking/Marketing/IT:
Volume, Velocity, and Value
Healthcare/Life Sciences:
Veracity (sparse/inconsistency), Variety, and Value
6
5 V’s of Big Data
7
Analysis vs. Analytics
•The primary difference between analytics and analysis is a
matter of scale, as data analytics is a broader term of which
data analysis is a subcomponent.
•Data Analysis:
–Data analysis refers to the process of compiling, examining,
transforming and analyzing the data to support decision making.
•Data Analytics:
–Data analytics also includes tools and techniques used to do so.
–This not only includes analysis, but also data collection, organization,
storage, and all the tools and techniques used.
8
Hidden Treasures
(Advantages of Big Data)
•Insights into data can provide Business advantage
–Ex: healthcareanalyzing disease pattern/trend to model future
demands and to invest/make strategic R&D
•Some key early indicators can mean Fortunes to Business
–Ex: financial industries analyse through transaction data/real-time
market feeds/social media data to find odd behavior of customers and
minimize the risk
•More precise Analysis with more data
–Ex: retailerpoint of sale data, supply chain management Use
these new insights to deliver highly targeted customers or do location
based promotions in real-time.
Limitations of Existing Data Analytics Architecture -
Traditional Data flow Architecture
(Non-Hadoop Architecture)
9
Processing Layer -Dashboard
Expensive Servers
(Data generated from web servers, network equipment, system logs, sensors, etc.)
(Data Cleaning)
(Limitation on size of the data)
(Moving data to tapes)
Solution: A Combined Storage Compute Layer
HADOOP
10
Commodity Computers to save and process the data
(Original high fidelity raw data to process)
(Yearly/Quarterly analysis)
Hadoop Cluster
•300 Node Hadoop cluster
–Hadoop cluster having 300 servers/machines/nodes
•Assume each node is 10 TB,
–300 Nodes = 10 TB * 300 = 3000 TB==> 3 PB
–Cluster can store up to 3 PB data to process
•Hadoop uses commodity computersto store and process the
data.
–No need of higher end servers
•Scale-out technology
–If an organization gets 1 more PB of data extra, they just need to add
another 100 nodes to this cluster to make it 400 node cluster without
changing any configuration in the existing Hadoop cluster.
–Any amount of data can be stored in cost effective manner to keep
data alive forever (No need of any backup)
11
Hadoop is a DFS–
Why DFS?
12
1 TB = 1024*1024 MB
((1024*1024)/4*100)/60 min
13
Hadoop is a DFS–
Why DFS?
What is Hadoop?
•Apache Hadoopis a frameworkthat allows for the
distributed processing of large data sets across
clusters of commodity computers using a simple
programming model
•It is an Open-source Data Management with scale-
out storage and distributed processing
14
Foundation layer
Map Reduce
No need of higher-end server
Hadoop Key Characteristics
15
Commodity Computers
Adding Nodes
Schema less while writing –It
can observe any kind of data
from any source
Fault tolerant
Hadoop enables...
•Scalable
•New nodes can be added as needed
•Cost effective
•Hadoop brings massively parallel computing to commodity servers.
•sizeable decrease in the cost per terabyte of storage
•Flexible
•Hadoop is schema-less, and can absorb any type of data, structured or not, from any number
of sources.
•Fault tolerant
•When you lose a node, the system redirects work to another location of the data and
continues processing
16
Hadoop core components and Daemons
•Hadoop is Master-Slave architecture
–Some daemons are Masters and some are Slaves
•Master daemons tells what these Slaves daemons
has to do.
–Slaves obeys Masters order
–Whatever Namenodetells, Datanodedoes.
•At YARN component, whatever ResourceManager
tells, NodeManager does.
17
Simple cluster with Hadoop Daemons
18
RDBMS vs. Hadoop
19
Hadoop Ecosystem
20
Hadoop 2.x Core Components
21
Main Components of HDFS
22
NameNode Metadata
23
File Blocks
24
HDFS Architecture
25
Datawill never be passed from NameNode.
Hadoop library
Hadoop library
Anatomy of a File Read
26
Client JVM
Anatomy of a File Write
27
Replication and Rack Awareness
28
Hadoop 2.x Cluster Architecture
29
Hadoop 2.x Cluster Architecture
30
Hadoop 2.x Cluster Architecture
31
R data Finance data Marketing data
Hadoop 2.x –High Availability
32
Data sync
Hadoop 2.x –High Availability
33
Hadoop 2.x –Resource Management
34
Hadoop 2.x –Resource Management
35
YARN –Moving beyond MapReduce
36
Hadoop Cluster Modes
37
Hadoop 2.x –Configuration files
38
Hadoop 2.x Configuration Files -Apache Hadoop
39
Data Loading Techniques & Data Analysis
40
Where MapReduce is Used?
41
The Traditional Way
42
MapReduce Way
43
NOTE:
•We need not worry about Splittingthe data. Hadoop splitsdata based on block
size, distributeand manageinternally.
•Hadoop manages File and Directory Structureby itself through NN.
•Request for reading, processing comes through NN.
•Mapper and Reducer Logic is simple
Why MapReduce?
•Majority of data processing in real-world –70 –80%
–Text-basedE-mail body, csvfiles, XML, JSON…
•Commandto Run MapReduce Program:
•Note: When MapReduce Job runs, NNwill give where input_fileis located and will
give specification of DNwhere outputhas to be written.
44
MapReduce Basic Flow
45
Note:
•No need to store intermediate output on to HDFS. it will be stored in Local File System
•If finalHDFSoutputis bigger than block size, the splitand replicationwill be applied.
Sequence of MapReduce Execution
•MR dump After execution of all Mappers, then Reducersrun.
46
Example–Scenario (Web log analysis)
•Problem statement: To count the number of clicks for a link between 4:00 am –5:00
am
47
How do we decide number of Map & Reduce?
•# Mappers:
–Based on number of blocks, Mappers will be running
–Map tasks run in parallel. They will be processing different blocks.
–No coordination between Map task to Map task,
•# Reducers:
–It is optional. All problem statement may not require aggregation
–By default, number of Reduce will be 1.
–Reduce task may run in anywhere in the cluster. We do not have
control. Whichever machine is free, that will be allocated.
48
Getting the Data from InputFile
•InputFormat
–It is responsible for generating/giving input to Mappers.
•TextInputFormatgenerates Key-Value pair to Mappers.
49
Key-Value (K/V)Pair
•In MapReduce, all the 4 stages of data is represented by Key-Valuepair.
–Input to the Mapper & Output from the Mapper
–Input to the Reducer & Output from the Reducer
50
Stagesof Data Who is responsible Data format
Input to Mapper InputFormat Individual K/V pair
Output from Mapper Developer decidesbased
on the implementation
logic
KeyList of values
Inputto Reducer Sameas Mappers outputKeyList of values
Output fromReducer Developer decides based
on the aggregationlogic
Why Key-ValuePair?
•Key-Value pair is the record entity that Hadoop MapReduce accepts for
data processing.
•Example:
51
Scenario
•Word Count
52
•TextInputFormatgenerates, Key-Valuepair
•TextInputFormatgives input to the Mapper
53
Byte Offset (location starting from 0) key
Entire Line value
1
st
key is 0 1
st
value is this line
2
nd
key is 26 2
nd
value is this line
NOTE: To do wordcount, we do not need ByteOffset. We need only Value.
Mapper’s Input for wordcount
Map Logic for wordcount
•Map task
•Map output
54
NOTE: Aggregation logic will be left to Reducer
Role of Hadoop Framework in MapReduce
•Hadoop Framework reads all the Map’s output
•And sortbased on key and prepare list of values for unique key
55
MapReduce Paradigm
57
3 Blocks of data
Mapper 1 runs
Mapper 2 runs
Mapper 3 runs
Anatomy of a MapReduce Program
58
Data given to Map Developer Choice
–Mapper Output
Input Data given
to Reducer
NOTE:
•Input to both the Mapper/Reducer will be always a single Key-Value pair
•Reducer’s input will be stored in HDFS.
Business Logic of Map
Business Logic of Reduce
Reducer’s Output
Combiner
59
Demo of WordCountProgram
•MapReduce Programming
–Driver class
–Map class
–Reduce class
•We will not use Java primitive datatypes.
60
WordCountProgram –Data Flow
61
Driver class
62
Map class
63
K1,V1, K2,V2
K2,V2
this, 1
is, 1
…
this, 1
0, this is my first program in map reduce and this is my favorite
Map’s Business
logic
How to Runthis program?
65
hadoopjar wc.jar /input_file_path/output_dir_path
args[0] args[1]
Why MapReduce?
•Two Advantages:
–Data locality optimization
Taking processing to the data
•Inter-rack network transfer
happens occasionally
–Processing data in parallel
66
Data localRack localOff rack
MapReduce Framework
67
Hadoop Configuration
68
Hadoop 2.x –MapReduce Components
69
Map Reduce Components
•ContainerResources required for the job
–Node Manager creates a container for job and allocates resources for job.
–Containers are allocated by this Node Manager.
•AppMastermonitor and takes care of every job.
•Resource Manager maintains the complete cluster.
70