Map_reduce_working_Big Data_Analytics_2025

ashima967262 2 views 16 slides May 20, 2025
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

Map reduce


Slide Content

How to Analyze and Process Unstructured Data
Unstructured data are datasets that have not been structured in a predefined manner.
Unstructured data is typically textual, like open-ended survey responses and social media
conversations, but can also be non-textual, like images, video, and audio. Unstructured
information is growing quickly due to increased use of digital applications and services. Some
estimates say that 80-90% of company data is unstructured, and it continues to grow at an
alarming rate per year.

While structured data is important, unstructured data is even more valuable to businesses if
analyzed correctly. It can provide a wealth of insights that statistics and numbers just can’t
explain.

Structured Data Vs. Unstructured Data: What's The Difference?
Unstructured vs Structured Data
Structured, unstructured and semi-structured data all fall under the umbrella of ‘big data’.
While all three types of data can offer incredible insights, it’s important to know which data
type to collect and when, and which one to analyze for the insights you’re hoping to gain.
Although it contains figures, statistics, and facts, unstructured data is usually text-heavy or
configured in a way that’s difficult to analyze. Social media posts, for example, might contain
opinions, topics that are being discussed, and feature recommendations. But this information
is difficult to process in bulk. First, specific bits of information must be extracted and
categorized, then analyzed to gain usable insights.
Structured data, on the other hand, is often numerical and easy to analyze. It’s organized in a
pre-defined structured format, such as Excel and Google Sheets, where data is added to
standardized columns and rows relating to pre-set parameters. The framework of structured
data models is designed for easy data entry, search, comparison, and extraction.
There is also semi-structured data, which is also text-heavy data but loosely organized into
categories or “meta tags.” This information can be easily broken into its individual groups, but
the data within these groups is itself unstructured.

Email is a good example of this: you can search your email by Inbox, Sent, and Drafts, but the
email text within each category has no pre-set structure.
Unstructured Data Types & Examples
Unstructured data examples, like twitter, emails, chat, images, audio, and more
Examples of unstructured data include legal documents, audio, chats, video, images, text on a
web page, and much more. Discover some of the most common unstructured data examples
below:

Examples of unstructured data are:
Business Documents
Emails
Social Media
Customer Feedback
Webpages
Open Ended Survey Responses
Images, Audio, and Video

Business Documents
Written business reports, legal documents, and presentations are often printed on paper, in
PDFs, or even hand-written, and some may contain spreadsheets, images, or XML files.
Although text files may be organized in a common format, data isn’t structured in a way that
can be analyzed without advanced AI technology.
These documents contain huge amounts of unstructured data that often goes unexploited, as
it’s considered too time-consuming to analyze. Fortunately, by using text analysis techniques,
companies can now gather valuable information from these documents about customers,
employees, and use them for competitive research.
Emails
We send dozens of emails daily, which translates into huge amounts of unstructured data.
Although emails are semi-structured by categories, like in this example below, the data within
each email is unstructured.
Example of semi-unstrcutured data: emails
Text analysis software can scan through thousands of emails in seconds to extract customer
information, organize by category and route to the proper department, track customer service
quality, and more.
You can even find out what kind of language works best for customer communication and
easily analyze to find out major customer pain points in just a few minutes. For example, you
might discover particular topics that are mentioned most frequently in a negative way.

Social Media
Social media data is similar to emails, in that some of it is organized. Hashtags, for example,
help users search for topics that they’re interested in. But the messages containing these
hashtags are unstructured.
A hashtag search on Twitter as a semi-structured data example

Social media data mass-grows by the second into a huge, nebulous, real-time archive of ideas,
opinions, and statistics. When social media users mention brands and products, it can turn into
useful data that can be mined for opinions.
Customer Feedback
Customer feedback can come in many forms: online reviews, surveys, phone calls, and
unsolicited social media posts. When it’s possible to gather and analyze all of this information
together, you can get a fully-balanced view into the thoughts of your customers.
Follow your customers’ major concerns daily, implement changes, and track the results with
tools like sentiment analysis and word clouds. You’ll save time and increase accuracy with text
analysis software – no more guesswork or semi-informed decision making.
By performing customer feedback analysis, you’ll have hard data on the voice of the customer
and an overview of your area of expertise.
Webpages
The vast internet creates unstructured data at breakneck speed. Webpages can include text,
images, audio, video, all manner of content. And while the structure of web pages is written in
HTML code, this doesn’t actually explain the content of the pages.It can be useful to mine,
extract, and organize this data to find information about customers, competitors, and overall
public sentiment. Also, as web pages are constantly changing, machine learning software
allows you to constantly track them and compare throughout time.
Open-Ended Survey Responses
While some surveys are designed to be easily analyzed with multiple-choice questions, there
are usually more insights to be gained from open-ended questionnaires. Because responders
answer in their own words, the text or recordings produced need to be broken down into usable
data before it can be properly analyzed. Performing survey analysis on open-ended responses
offers more nuance and may even include new ideas and recommendations from customers.
Once the unstructured responses have been gathered, they can be organized and analyzed with
business intelligence tools that classify, analyze and visualize data. Discover 6 effective ways
to analyze open-ended responses in NPS surveys.
Customer feedback in survey responses: 'I love your product'
Images, Audio, and Video
Although multimedia files may be tagged with titles or subjects and saved in databases as MP3,
JPG, PNG, GIF, etc., they are still unstructured because we don’t know what the image, audio,
or video represents.
Speech-to-text technology like Gong, however, can be used to convert audio files into text,
which can then be analyzed by natural language processing software. And image and video
analysis has made great advancements with facial and subject recognition software.

Importance of Unstructured Data

The majority of data created today is unstructured (documents, social media, emails), and often
an untapped resource. When managed in the right way, unstructured data can deliver countless
insights that help you make informed, data-driven decisions. Machine learning technology
allows you to automatically manage and analyze unstructured data quickly and accurately.
Through technological advancements, like natural language processing (NLP), machines can
now read text just like a human would. That means you can eliminate repetitive tasks like
manually tagging and routing tickets, or sifting through social media posts.Instead, AI
technology can automatically learn how to extract keywords, names, phone numbers, and
locations, understand opinions and intent, and recognize topics that are important to your
business. Once all your unstructured data has been organized, you’ll gain granular insights that
will help you make informed business decisions.
Analyze the Unstructured Data:
1. Choose the End Goal
Make sure you define a clear set of measurable goals. What insights do you want to obtain from
your data? Do you want to understand how customers feel about a particular topic? Knowing
this will help you identify what type of unstructured data you need to collect.
2. Collect Relevant Data
Data is everywhere, but maybe you just want to focus on data from one channel, like social
media, online reviews, or surveys. Depending on your end goal, you can collect data in real
time, look at historical data, or request data (surveys) at every step of a customers’ journey.
3. Clean Data
To make unstructured data easier for machines to analyze, you’ll need to preprocess or clean
your data first. Preprocessing data involves reducing noise, eliminating irrelevant information
(for example, stop words), and slicing data into more manageable pieces of content (like
opinion units).
4. Implement Technology
You’ll need more than just unstructured data analysis tools to get the most out of your data.
Data storage and information retrieval architecture, like NoSQL databases, for example, are
essential to help manage your data flow, while data visualization tools, like Tableau and Google
Data Studio, help summarize unstructured data.
Let your data speak for itself through compelling charts and graphs, making it easy to draw out
actionable insights that you can share with your team and higher up.


Hadoop Ecosystem and Their Components
1. Hadoop Ecosystem Components

The objective of this Apache Hadoop ecosystem components tutorial is to have an
overview of what are the different components of Hadoop ecosystem that make Hadoop
so powerful. We will also learn about Hadoop ecosystem components like HDFS and
HDFS components, MapReduce, YARN, Hive, Apache Pig, Apache HBase and
HBase components, HCatalog, Avro, Thrift, Drill, Apache
mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie to deep
dive into Big Data Hadoop and to acquire master level knowledge of the Hadoop
Ecosystem.


Hadoop Ecosystem and Their Components
2. Introduction to Hadoop Ecosystem
As we can see the different Hadoop ecosystem explained in the above figure of Hadoop
Ecosystem.
2.1. Hadoop Distributed File System
It is the most important component of Hadoop Ecosystem. HDFS is the primary storage system
of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that provides
scalable, fault tolerance, reliable and cost efficient data storage for Big data. HDFS is a
distributed filesystem that runs on commodity hardware. HDFS is already configured with
default configuration for many installations. Most of the time for large clusters configuration
is needed. Hadoop interact directly with HDFS by shell-like commands.
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and DataNode.
i. NameNode

It is also known as Master node. NameNode does not store actual data or dataset. NameNode
stores Metadata i.e. number of blocks, their location, on which Rack, which Datanode the data
is stored and other details. It consists of files and directories.
Tasks of HDFS NameNode
• Manage file system namespace.
• Regulates client’s access to files.
• Executes file system execution such as naming, closing, opening files and
directories.
ii. DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data in HDFS.
Datanode performs read and write operation as per the request of the clients. Replica block
of Datanode consists of 2 files on the file system. The first file is for data and second file is for
recording the block’s metadata. HDFS Metadata includes checksums for data. At startup, each
Datanode connects to its corresponding Namenode and does handshaking. Verification of
namespace ID and software version of DataNode take place by handshaking. At the time of
mismatch found, DataNode goes down automatically.
Tasks of HDFS DataNode
• DataNode performs operations like block replica creation, deletion, and replication
according to the instruction of NameNode.
• DataNode manages data storage of the system.


2.2. MapReduce
Hadoop MapReduce is the core Hadoop ecosystem component which provides data
processing. MapReduce is a software framework for easily writing applications that process
the vast amount of structured and unstructured data stored in the Hadoop Distributed File
system.
MapReduce programs are parallel in nature, thus are very useful for performing large-scale
data analysis using multiple machines in the cluster. Thus, it improves the speed and reliability
of cluster this parallel processing.

Hadoop MapReduce
Working of MapReduce
Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two
phases:
• Map phase
• Reduce phase
Each phase has key-value pairs as input and output. In addition, programmer also specifies
two functions: map function and reduce function
Map function takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
Reduce function takes the output from the Map as an input and combines those data tuples
based on the key and accordingly modifies the value of the key.
Features of MapReduce
• Simplicity – MapReduce jobs are easy to run. Applications can be written in any
language such as java, C++, and python.
• Scalability – MapReduce can process petabytes of data.
• Speed – By means of parallel processing problems that take days to solve, it is
solved in hours and minutes by MapReduce.
• Fault Tolerance – MapReduce takes care of failures. If one copy of data is
unavailable, another machine has a copy of the same key pair which can be used
for solving the same subtask.
2.3. YARN
Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component that
provides the resource management. Yarn is also one the most important component of Hadoop

Ecosystem. YARN is called as the operating system of Hadoop as it is responsible for
managing and monitoring workloads. It allows multiple data processing engines such as real-
time streaming and batch processing to handle data stored on a single platform.

Hadoop Yarn Diagram
YARN has been projected as a data operating system for Hadoop2. Main features of YARN
are:
• Flexibility – Enables other purpose-built data processing models beyond
MapReduce (batch), such as interactive and streaming. Due to this feature of
YARN, other applications can also be run along with Map Reduce programs in
Hadoop2.
• Efficiency – As many applications run on the same cluster, Hence, efficiency of
Hadoop increases without much effect on quality of service.
• Shared – Provides a stable, reliable, secure foundation and shared operational
services across multiple workloads. Additional programming models such as graph
processing and iterative modeling are now possible for data processing.
2.4. Hive
The Hadoop ecosystem component, Apache Hive, is an open-source data warehouse system
for querying and analyzing large datasets stored in Hadoop files. Hive do three main
functions: data summarization, query, and analysis. Hive use language called HiveQL (HQL),
which is similar to SQL. HiveQL automatically translates SQL-like queries into MapReduce
jobs which will execute on Hadoop.

Hive Diagram
Main parts of Hive are:
• Metastore – It stores the metadata.
• Driver – Manage the lifecycle of a HiveQL statement.
• Query compiler – Compiles HiveQL into Directed Acyclic Graph (DAG).
• Hive server – Provide a thrift interface and JDBC/ODBC server.
2.5. Pig
Apache Pig is a high-level language platform for analyzing and querying huge dataset that are
stored in HDFS. Pig as a component of Hadoop Ecosystem uses PigLatin language. It is very
similar to SQL. It loads the data, applies the required filters and dumps the data in the required
format. For Programs execution, pig requires Java runtime environment.

Pig Diagram
Features of Apache Pig:
• Extensibility – For carrying out special purpose processing, users can create
their own function.
• Optimization opportunities – Pig allows the system to optimize automatic
execution. This allows the user to pay attention to semantics instead of
efficiency.
• Handles all kinds of data – Pig analyzes both structured as well as unstructured.
2.6. HBase
Apache HBase is a Hadoop ecosystem component which is a distributed database that was
designed to store structured data in tables that could have billions of row and millions of
columns. HBase is scalable, distributed, and NoSQL database that is built on top of HDFS.
HBase, provide real-time access to read or write data in HDFS.

HBase Diagram
Components of Hbase
There are two HBase Components namely- HBase Master and RegionServer.
i. HBase Master
It is not part of the actual data storage but negotiates load balancing across all RegionServer.
• Maintain and monitor the Hadoop cluster.
• Performs administration (interface for creating, updating and deleting tables.)
• Controls the failover.
• HMaster handles DDL operation.
ii. RegionServer
It is the worker node which handles read, writes, updates and delete requests from clients.
Region server process runs on every node in Hadoop cluster. Region server runs on HDFS
DateNode.Refer HBase Tutorial for more details.

SQL (STRUCTURED QUERY LANGUAGE)
SQL (Structured Query Language) is a programming language used to manage relational
databases. It provides a standard syntax for defining, manipulating, and querying data stored
in a relational database management system (RDBMS). SQL is used to create, modify, and
delete databases, tables, and other database objects. It is also used to insert, update, and delete
data in tables, as well as to retrieve data from one or more tables.
SQL is a declarative language, which means that instead of specifying how to perform an
operation, such as searching for data, you specify what you want to accomplish, such as
retrieving specific data. SQL commands are executed by the RDBMS and the results are
returned to the user. SQL is widely used in applications that require the storage and
management of large amounts of data, such as business applications, e-commerce websites,
and social media platforms.
SQL commands are grouped into several categories, including Data Definition Language
(DDL), Data Manipulation Language (DML), Data Control Language (DCL), and Transaction
Control Language (TCL). DDL commands are used to create, alter, and drop database objects
such as tables, indexes, and views. DML commands are used to insert, update, and delete data
in tables. DCL commands are used to control access to data, while TCL commands are used to
manage transactions.
SQL is a standard language that is supported by most relational database systems, including
MySQL, Oracle, Microsoft SQL Server, PostgreSQL, and SQLite. While there are some
differences in the implementation of SQL by different database systems, the basic syntax and
commands are similar. SQL is an essential tool for managing and analyzing data in a relational
database system.

Types of Structured Query Language
Structured Query Language (SQL) is a standardized programming language that is used to
manage relational databases. SQL has several types, which include:

Data Definition Language (DDL): This type of SQL is used to define the database schema,
create tables, modify the structure of existing tables, and define relationships between tables.
Examples of DDL commands include CREATE, ALTER, and DROP.
Data Manipulation Language (DML): This type of SQL is used to manipulate data within
tables. Examples of DML commands include SELECT, INSERT, UPDATE, and DELETE.
Data Control Language (DCL): This type of SQL is used to control access to data within the
database. Examples of DCL commands include GRANT and REVOKE.
Transaction Control Language (TCL): This type of SQL is used to control transactions within
the database. Examples of TCL commands include COMMIT and ROLLBACK.
Data Query Language (DQL): This type of SQL is used to retrieve data from one or more
tables. Examples of DQL commands include SELECT and JOIN.
Each type of SQL serves a specific purpose and can be used in different scenarios depending
on the needs of the database administrator or developer.
Some examples of the different types of SQL commands:
1. Data Definition Language (DDL) commands:
CREATE TABLE: creates a new table in the database.
ALTER TABLE: modifies the structure of an existing table.
DROP TABLE: deletes a table from the database.
2. Data Manipulation Language (DML) commands:
INSERT INTO: inserts new rows of data into a table.
UPDATE: modifies existing rows of data in a table.
DELETE FROM: deletes rows of data from a table.
3. Data Control Language (DCL) commands:
GRANT: grants permissions to a user or group to perform certain actions on a table or database.
REVOKE: revokes previously granted permissions from a user or group.

4. Transaction Control Language (TCL) commands:
COMMIT: saves changes made within a transaction.
ROLLBACK: undoes changes made within a transaction.
5. Data Query Language (DQL) commands:
SELECT: retrieves data from one or more tables based on specified criteria.
JOIN: combines data from two or more tables based on a common column or columns.
These are just a few examples of the types of SQL commands that are used to manage and
manipulate data in a relational database system.
Window Functions
Window functions are a type of SQL function that allows you to perform calculations across a
set of rows in a table without changing the result set. Window functions were introduced in
SQL:2003 and are available in most modern relational database systems such as PostgreSQL,
Oracle, SQL Server, and MySQL.
The key feature of window functions is that they operate on a set of rows called a "window"
that is defined by a range of rows or partitioned by one or more columns. The result of the
window function is computed based on the values of the rows in the window, and the result is
included in the output for each row in the window.
Window functions are particularly useful for performing complex calculations on large data
sets that would be difficult or impossible to achieve using traditional SQL statements. Some
common use cases for window functions include:
1. Ranking and Sorting: Window functions can be used to rank or sort rows based on a specific
column or set of columns. For example, you can use the ROW_NUMBER function to assign a
unique rank to each row in a table.
2. Aggregation: Window functions can be used to perform calculations across multiple rows
within a partition. For example, you can use the SUM function to calculate a rolling sum of a
column within a window.

3.Statistical Analysis: Window functions can be used to perform statistical analysis on a data
set. For example, you can use the AVG function to calculate the rolling average of a column
within a window.
Userdefined Functions and Aggregates
User-defined functions (UDFs) and aggregates are two types of custom functions that can be
created in SQL to perform specific calculations or operations. UDFs are functions that are
created by the user and can be called within SQL statements. UDFs can be used to encapsulate
complex calculations or procedures, making them easier to use and understand. They can
accept input parameters and return a single value or a table of values. UDFs can be created
using SQL or a programming language such as Python or Java, depending on the database
system being used.
Aggregates, on the other hand, are functions that perform a calculation across a set of rows and
return a single value. Some common examples of aggregates are COUNT, SUM, AVG, MIN,
and MAX. Aggregates can be used to calculate various statistics on a data set, such as the total
number of rows, the average value of a column, or the maximum value of a column. Similar to
UDFs, user-defined aggregates (UDAs) can also be created in some database systems. UDAs
are custom aggregate functions that can perform calculations that are not supported by standard
aggregates. UDAs can be created using SQL or a programming language, and can be used to
calculate customized statistics or perform other operations on a data set.
Ordered Aggregates
Ordered aggregates, also known as window aggregates, are a type of aggregate function in SQL
that allow you to perform calculations on a specific subset of rows within a partition. In a
traditional aggregate function, such as SUM or COUNT, all of the rows in the partition are
considered when calculating the result. With ordered aggregates, you can define a specific
subset of rows within the partition based on an order specified by one or more columns. The
ordered subset of rows is commonly referred to as a "window". The syntax for using an ordered
aggregate function typically involves two parts:

The OVER clause: This clause defines the window or subset of rows that will be used in the
calculation. It specifies the order in which the rows should be considered and any additional
partitioning criteria.

The aggregate function: This is the function that will be applied to the ordered window of rows.
Examples of ordered aggregate functions include SUM, AVG, and RANK.
SELECT
employee_id,
department_id,
salary,
AVG(salary) OVER (PARTITION BY department_id ORDER BY salary DESC) as
avg_salary_by_dept
FROM
employees;

In this example, we are calculating the average salary by department, but instead of considering
all of the rows in the department, we are only considering the subset of rows ordered by salary
in descending order. This allows us to see the average salary for employees in a particular
department compared to those with higher salaries.
Tags