Big Data Analytics-Module introduction11

RadhikaR7 1 views 74 slides Oct 12, 2025
Slide 1
Slide 1 of 74
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74

About This Presentation

As the size and complexity of data increases,
the proportion of unstructured data types also increase.
An example of a traditional tool for structured data storage and querying is RDBMS.
Volume, velocity and variety (3Vs) of data need the usage of number of programs and tools for analyzing and pr...


Slide Content

Introduction to Hadoop

Apache initiated the project for developing storage and processing framework for Big Data storage and processing. Doug Cutting and Machael J. Cafarelle the creators named that framework as Hadoop. It consists of two components One of them is for data store in blocks in the cluster the other is computations at each individual cluster in parallel with another.

Hadoop components are written in Java with part of native code in c. The command line utilities are written in shell scripts. Hadoop is a computing environment in which input data stores, processes and store the results. The complete system consists of a scalable distributed set of clusters. Infrastructure consists of cloud for clusters. Hadoop platform provides a low cost Big data platform, which is open source and uses cloud services.

Hadoop enables distributed processing of large datasets (above 10 million bytes) across clusters of computers using a programming model called MapReduce. The system characteristics are scalable, self-manageable, self-healing and distributed file system. Scalable: can be scaled up by adding storage and processing units as per the requirements. self-manageable: means creation of storage and processing resources which are used, scheduled and reduced or increased with the help of the system itself. self-healing: means that in case of faults, they are taken care of the system itself.

Hadoop core components

Hadoop Common : the common module contain he libraries and utilities that are required by the other modules of the Hadoop. Hadoop common provides various components and interface for distributed file system and general input/output. This includes serialization, Java RPC and file based structures. HDFS: A Java based distributed file system which can store all kinds of data on the disks at the cluster. MapReduce v1: software programming model in Hadoop using Mapper and Reducer. The v1 processes large set of data in parallel and in batches. YARN (Yet Another Resource Negotiator) : software for managing resources for computing. (Scheduling and handles the request)

Features of Hadoop Fault-efficient scalable, flexible and modular design System provides servers at high scalability by adding new node to handle larger data. Hadoop helpful in storing, managing, processing and analysing Big data. Modular function makes the system flexible. One can add or replace components at ease. Modularity allows replacing its components for a different software tool.

Robust design of HDFS Execution of Big data applications continue even when an individual server or cluster fails. This is because of Hadoop provisions for backup (replication at least at 3 times for each block)

Store and process Big data: processes Big data of 3V characteristics. Distributed clusters computing model with data locality: Processes Big data at high speed as the application tasks and sub-task submit to the data nodes(Server). One can achieve more computing power by increasing the number of computing nodes. The processing splits across multiple Data nodes and thus fast processing and aggregated result.

Hardware fault-tolerant: A fault does not affect data and application processing. If a node goes down, another node take care of the residue. This is because multiple copies of all data blocks which replicate automatically.

Open-source frame work: Open source access and cloud services enable large data store. Hadoop uses a cluster of multiple inexpensive server or the cloud. Java and Linux based: Hadoop uses Java interface. Hadoop base is Linux but has its own set of shell commands support.

Hadoop ecosystem components

The four layers are as follows: Distributed storage layer Resource-manager layer for job or application sub-tasks scheduling and execution. Processing framework layer consisting of Mapper and reducer for the MapReduce process flow. API’s at application support layer.

Hadoop streaming HDFS with MapReduce and YARN-based system enables parallel processing of large datasets. Spark provides in-memory processing of data thus improving the processing speed. Flink is emerging as a powerful tool, it improves the overall performance as it provides single run-time for streaming as well as batch processing.

Hadoop pipes Hadoop pipes are the C ++ pipes which interface with MapReduce. Pipes: pipes means data streaming into the system at Mapper input and aggregated results flowing out at outputs. Pipes do not use the standard I/O when communicating with Mapper and Reducer codes.

Hadoop Distributed File System HDFS is a core components of Hadoop HDFS is designed to run on a cluster of computers and servers at cloud based utility services. HDFS stores the data in a distributed manner in order to compute fast HDFS stores data in any format regardless of schema.

HDFS Data storage Hadoop data store concept implies storing the data a number of clusters. Each cluster has a number of data stores called racks. Each rack stores a number of DataNodes . Each DataNode has a large number of data blocks. The racks distribute across a cluster. The nodes have processing and storage capabilities. The nodes have data in data blocks to run the application task. The data blocks replicate by default at least on three DataNodes in same or remote nodes. The data block default size is 64-MB.

Hadoop HDFS features are as follows: Create, append, delete, rename and attribute modification functions Content of individual file cannot be modified or replaced but appended with new data at the end of the file. Write once but read many times during usages and processing. Average file size can be more than 500MB.

Hadoop cluster example Consider a data storage for U niversity students. Each student data, stuData which is in a file of size less than 64MB. A data block stores the full file data for a student of stuData_idN , where N=1 to 500. how the files of each student will be distributed at a Hadoop cluster? How many student data can be stored at one cluster? Assume that each rack has two DataNodes for processing each of 64MB memory. Assume that cluster consists of 120 rack and thus 240 DataNodes . What is the total memory capacity of the cluster in TB and DataNodes in each rack? show the distributed blocks for students with ID=96 and 1025. assume default replication in the DataNode =3. What shall be the changes when a stuData file size <= 128MB?

Solution Q.1 Data block default size is 64MB. Each students file size is less than 64MB. Therefore for each student file one data block suffice. A data block is in a DataNode . Assume for simplicity each rack has two nodes each of memory capacity=64MB. Each node can thus store 64GB/64MB=1024 data blocks=1024 student files. Each rack thus store 2X64GB/64MB=2048 data blocks=2048 student files. Each data block default replicates three times in the data node. Therefore the number of students whose data can be stored in the cluster = number of racks multiplied by number of files divided by 3 =120 X 2048 / 3=81920. Therefore the maximum number of 81920 stuData_IDN files can be distributed per cluster with N=1 to 81920

Solution Q.2 Total memory capacity of the cluster = 120 X 128MB =15360 GB=15TB. Total memory capacity of each DataNode in the rack = 1024 X 64 MB= 64GB.

Solution Q.3

Solution Q.4 Changes will be that each node will have half the number of data blocks.

Hadoop Physical O rganisation

HDFS use the NameNodes and DataNodes . NameNode stores the file’s meta data. Meta data gives the information about the file of user application. DataNode stores the actual data files in the data blocks. Few nodes in a Hadoop cluster act as NameNodes . These nodes are termed as MasterNodes or simply masters.

MasterNodes The masters have a different configuration supporting high DRAM and processing power. The masters have much less local storage. DataNodes : Majority of the nodes in Hadoop cluster act as DataNodes and TaskTrackers . These nodes are referred to as nodes or slaves. The slaves have lots of disk storage and moderate amount of processing capabilities and DRAM. Slaves are responsible to store the data and process the computation tasks submitted by the clients.

Clients as the user run the application with the help of Hadoop ecosystem projects. Eg : Hive, Mahout and pig are the ecosystem’s projects. A single MasterNode provides HDFS, MapReduce and Hbase using threads in small to medium sized clusters.

Few nodes in a Hadoop cluster act as NameNode Nodes are termed as MasterNoode or slaveNode MasterNode : Fundamentally plays the role of a coordinator. Receives the client connections, maintains the description of the global file system namespace and the allocation of the file blocks. It also monitors the state of the system in order to detect any failure.

The Masters consists of three components NameNode , SecondaryNode and JobTracker . NameNode stores All the file system related information such as The file section is stored in which part of the cluster Last access time for the files User permission like which user has access to the file. SecondaryNode : is an alternate for NameNode Secondary node keeps a copy of NameNode meta data. Thus stored meta data can be rebuilt easily in case of NameNode failure. JobTracker : Coordinates the parallel processing of data. Zookeeper: functions as a centralized repository for distributed applications. Zookeeper uses synchronization, serialization and coordination activities It enables functioning of a distributed system as a single function

HDFS Commands m kdir : $Hadoop hdfs-mkdir /user/ stu_files_dir Creates the directory named stu _ files_dir -put: $Hadoop hdfs -put stuData_id96/user/ stu_files_dir Copies file for student of id96 into stu_files_dir -ls: assume all files to be listed. $ hdfs hdfs dfs -ls - cp : assume stuData_id96 to be copied from stu_files_dir to new student’s directory newstu_filesDir . Then Hadoop hdfs-cp stuData_id96/user/ stu_filesdir newstu_filesdir

MapReduce framework and programming model MapReduce is a programming model for distributed computing. Mapper means software for doing the assigned task after organizing the data blocks imported using the keys. A key specifies in a common line of Mapper. The command maps the key to the data which an application uses.

Reducer means a software for reducing the mapped data by using the aggregation, query or user-specified function. The reducer provides a concise cohesive response for the application. Aggregation function means the function that groups the values of multiple rows together to result a single value of more significant meaning or measurement. Eg : function such as count, sum, maximum, minimum, deviation and standard deviation. Querying function means a function that finds the desired values Eg : function for finding a best student of a class who has shown the best performance in examination.

Features of MapReduce framework : Provides automatic parallelization and distribution of computation based on several processors. Processes data stored on distributed clusters of DataNodes and racks. Allows processing large amount of data in parallel. Provides scalability for usages of large number of servers. Provides MapReduce batch oriented programming model in Hadoop version1. Provides additional processing modes Eg : queries, graph database, streaming data , messages, OLAP

Shown below is a MapReduce example to count the frequency of each word in a given input text. Our input text is, “Big data comes in various formats. This data can be stored in multiple data servers.”

MapReduce programming Model MapReduce program can be written in any language including Java, C++PIPE or Python. MapReduce program do mapping to compute the data and convert the data into other data sets. After the Mapper computation finish, the reducer function collects the result of Map and generates the final output result. MapReduce program can be applied to any type of data (structured or Unstructured)

Hadoop YARN It is a resource management platform. It manages computer resources. The platform is responsible for providing computational resources, such as CPUs, memory, network I/O. It separates the resource management and processing components. It enables running of multi-threaded applications.

Hadoop 2 Execution Model

The figure show the YARN components-client, Resource Manager(RM), Node Manager(NM), Application Master (AM) and containers. List of actions: A MasterNode has two components:1. Job History server 2. Resource Manager(RM) The client node submits the request of an application to the RM. RM is the master. One RM exists per cluster. The RM keeps information of all the slave NMs.

An NM creates the AM(application manager) instance, AMI AMI initializes itself and register with the RM. Multiple AMIs can be created in an AM. All active NM send the controlling signal periodically to the RM signalling their presence. Each NM includes several containers for uses by the subtasks of the application.

Hadoop Ecosystem Tools Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between themselves and maintain shared data with robust synchronization techniques.  The common services provided by ZooKeeper are as follows − Naming service  − Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes. Configuration management  − Latest and up-to-date configuration information of the system for a joining node. Cluster management  − Joining / leaving of a node in a cluster and node status at real time. Leader election  − Electing a node as leader for coordination purpose. Locking and synchronization service  − Locking the data while modifying it. This mechanism helps you in automatic fail recovery while connecting other distributed applications like Apache HBase . Highly reliable data registry  − Availability of data even when one or a few nodes are down.

oozie Apache Oozie is a scheduler system to run and  manage Hadoop jobs  in a distributed environment. It allows to combine multiple complex jobs to be run in a sequential order to achieve a bigger task. Within a sequence of task, two or more jobs can also be programmed to run parallel to each other.

Following three types of jobs are common in Oozie − Oozie Workflow Jobs  − These are represented as Directed Acyclic Graphs (DAGs) to specify a sequence of actions to be executed. Oozie Coordinator Jobs  − These consist of workflow jobs triggered by time and data availability. Oozie Bundle  − These can be referred to as a package of multiple coordinator and workflow jobs.

Oozie provisions for following Integrates multiple jobs in a sequential manner Stores and supports Hadoop jobs for MapReduce , Hive,pig,sqoop Runs workflow jobs based on time and data triggers Manages batch coordinator for the applications

sqoop Sqoop is  a tool used to transfer bulk data between Hadoop and external datastores , such as relational databases (MS SQL Server, MySQL ). 

Sqoop Import The import tool imports individual tables from RDBMS to HDFS.  Each row in a table is treated as a record in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files. Sqoop Export The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop contain records, which are called as rows in table Those are read and parsed into a set of records and delimited with user-specified delimiter.

Flume Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log data, events (etc...) from various web serves to a centralized data store. It is a highly reliable, distributed, and configurable tool that is principally designed to transfer streaming data from various sources to HDFS.

Data generators  (such as Facebook , Twitter) generate data which gets collected by individual Flume  agents  running on them. Thereafter, a  data collector  (which is also an agent) collects the data from the agents which is aggregated and pushed into a centralized store such as HDFS or HBase .

Flume Agent An  agent  is an independent process (JVM) in Flume. It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent). 

As shown in the diagram a Flume Agent contains three main components namely,  source ,  channel , and  sink . A  source  is the component of an Agent which receives data from the data generators and transfers it to one or more channels in the form of Flume events. A  channel  is a transient store which receives the events from the source and buffers them till they are consumed by sinks. It acts as a bridge between the sources and the sinks. Example  − JDBC channel, File system channel, Memory channel, etc. A  sink  stores the data into centralized stores like HBase and HDFS. It consumes the data (events) from the channels and delivers it to the destination. The destination of the sink might be another agent or the central stores. Example  − HDFS sink

Ambari Apache Ambari is an open-source administration tool deployed on top of Hadoop clusters, It is responsible for keeping track of the running applications and their status Apache Ambari can be referred to as a web-based management tool that manages, monitors, and provisions the health of Hadoop clusters.

Features of Ambari Simplification of installation, configuration and management Enable easy, efficient, repeatable and automated creation of clusters. Manages and Monitors scalable clustering Enables detection of faulty node links Provides extensibility and cutomizability

Hbase HBase is  a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS) .    HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases. HBase is scalable, distributed, and NoSQL database HBase , provide real-time access to read or write data in HDFS

Components of Hbase There are two HBase Components namely- HBase Master and RegionServer . HBase Master: It is not part of the actual data storage but negotiates load balancing across all RegionServer . Maintain and monitor the  Hadoop cluster. Performs administration (interface for creating, updating and deleting tables.) Controls the failover. HMaster handles DDL operation.

RegionServer: It is the worker node which handles read, writes, updates and delete requests from clients. Region server process runs on every node in Hadoop cluster. Region server runs on HDFS DateNode .

Storage Mechanism in HBase HBase is a  column-oriented database  and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. A table have multiple column families and each column family can have any number of columns

In short, in an HBase : Table is a collection of rows. Row is a collection of column families. Column family is a collection of columns. Column is a collection of key value pairs.

Features of HBase HBase is linearly scalable. It has automatic failure support. It provides consistent read and writes. It integrates with Hadoop , both as a source and a destination. It has easy java API for client. It provides data replication across clusters.

Hive Hive is  an ETL and Data warehousing tool used to query or analyze large datasets stored within the Hadoop ecosystem Hive has three main functions: data summarization, query, and analysis of unstructured and semi-structured data in Hadoop .

Features of Hive It stores schema in a database and processed data into HDFS. It is designed for OLAP. It provides SQL type language for querying called HiveQL or HQL. It is familiar, fast, scalable, and extensible.

User Interface Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server). Meta StoreHive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping.

HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the Metastore . It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce . HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.

pig Apache Pig is an abstraction over MapReduce . It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with  Hadoop ; we can perform all the data manipulation operations in Hadoop using Pig.

Features of Pig Rich set of operators  − It provides many operators to perform operations like join, sort, filer, etc. Ease of programming  − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL. Optimization opportunities  − The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language. Extensibility  − Using the existing operators, users can develop their own functions to read, process, and write data. UDF’s  − Pig provides the facility to create  User-defined Functions  in other programming languages such as Java and invoke or embed them in Pig Scripts. Handles all kinds of data  − Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.

To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output.

Parser Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.

Optimizer The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown. Compiler The compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.

Mahout Apache  Mahout  is an open source project that is primarily used for creating scalable machine learning algorithms. It implements popular machine learning techniques such as: Recommendation Classification Clustering

Features of Mahout The algorithms of Mahout are written on top of Hadoop , so it works well in distributed environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud. Mahout offers the coder a ready-to-use framework for doing data mining tasks on large volumes of data. Mahout lets applications to analyze large sets of data effectively and in quick time. Includes several MapReduce enabled clustering implementations such as k-means, fuzzy k-means, Canopy, Dirichlet , and Mean-Shift. Supports Distributed Naive Bayes and Complementary Naive Bayes classification implementations. Comes with distributed fitness function capabilities for evolutionary programming. Includes matrix and vector libraries.