Introduction to �Big Data Analytics.ppsx

JSujatha2 23 views 40 slides Aug 10, 2024
Slide 1
Slide 1 of 40
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40

About This Presentation

BDA_Presentations_M1_P1


Slide Content

Introduction to
Big Data Analytics

LoopingOutline
•Types of Digital Data
•Introduction to Big Data
•Big Data Characteristics
•Challenges of Big Data
•Traditional vs. Big Data business Approach

Introduction
Firstly, We need to know “what is data?”
Data represents raw facts and figures collected from various sources,
Data to be analyzed to gain insights and make decisions
Data Comes From Types of Data

Computer Data as Information
Data is the information processed or stored by a computer.
This information may be in the form of text documents, images, audio clips,
software programs, or other types of data.
Computer data may be processed by the computer's CPU and is stored
in files and folders on the computer's hard disk.

Types of Digital Data

Types of Digital Data
1.Structured
2.Semi-structured
3.UnStructured

Structured
Structured data is highly organized and easily searchable within databases.
This type of data is typically stored in tables, with rows and columns that clearly
define relationships between different data points.
Characteristics:
Fixed schema: The structure of the data is predetermined and does not
change.
Ease of search and analysis: Data can be easily queried using SQL (Structured
Query Language).
Storage: Often stored in relational databases (RDBMS) like MySQL, Oracle, and
SQL Server.

Structured - Example
Databases: Tables in a relational database, such as customer information in a
CRM system.
Spreadsheets: Excel files where data is organized in rows and columns.
Employee_Table
Employee_ID Employee_Na
me
Gender Department Salary_In_lacs
1 XYX MALE FINANCE 850000
2 ABC MALE ADMIN 250000
3 PQR FEMALE SALES 350000
4 MNR FEMALE FINANCE 600000

Semi-structured
Definition:
Semi-structured data does not follow a rigid schema like structured data
but contains tags or markers to separate data elements.
This makes it somewhat organized but flexible.
Characteristics:
Flexible schema: The structure is not fixed and can vary.
Self-describing: Includes metadata that provides information about the data.
Storage: Often stored in NoSQL databases or formats like JSON and XML.

Semi-structured
Examples:
JSON (JavaScript Object Notation): Used extensively in web applications for
data interchange.
XML (eXtensible Markup Language) : Used in web services and configuration
files.
Emails: While the body is unstructured, metadata (like sender, recipient, date) is
structured.
Web application data, which is unstructured, consists of log files, transaction
history files etc.
Online transaction processing systems are built to work with structured data
wherein data is stored in relations (tables).

Semi-structured - Example
Personal data stored in a XML file:
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>

Unstructured
Definition:
Unstructured data lacks any specific format or structure, making it more
complex and diverse. This type of data is often generated by humans and
includes text, multimedia, and more.
Characteristics:
No predefined structure: Data does not follow a specific model or format.
Complexity: Requires advanced processing techniques like natural language
processing (NLP) and machine learning to analyze.
Storage: Stored in various formats like text files, multimedia repositories, and
data lakes.

Unstructured
Examples:
Text documents: Word documents, PDFs, and reports.
Multimedia files: Images, videos, and audio files.
Social media content: Tweets, Facebook posts, and comments.
Webpages: HTML pages without a specific structure.
Human Generated Data Machine Generated Data

Unstructured - Example
The output returned by 'Google Search'

Comparison and Use Cases
Structured Data: Best for applications requiring consistent, repeatable
transactions and queries, such as financial records, inventory management, and
CRM systems.
Semi-Structured Data: Ideal for scenarios where data flexibility is needed, such
as data integration, web services, and API responses.
Unstructured Data: Suitable for big data analytics, content management, and
understanding human-generated data, like social media analysis, multimedia
archiving, and customer feedback analysis.

Find the type of Digital Data
?

Find?
Customer Information in a CRM System
Email Messages
Social Media Posts (e.g., Tweets)
Website Logs
Medical Records in a Hospital Database
JSON Responses from Web APIs
PDF Documents
Photos and Videos
Spreadsheets (e.g., Excel Files)
XML Configuration Files
Voice Recordings
Geospatial Data from GPS

Answer
Digital Data Type
Customer Information in a CRM
System
Structured Data
Email Messages Semi-Structured Data
Social Media Posts (e.g., Tweets) Unstructured Data
Website Logs Semi-Structured Data
Medical Records in a Hospital
Database
Structured Data
JSON Responses from Web APIs Semi-Structured Data
PDF Documents Unstructured Data
Photos and Videos Unstructured Data
Spreadsheets (e.g., Excel Files) Structured Data
XML Configuration Files Semi-Structured Data
Voice Recordings Unstructured Data
Geospatial Data from GPS Structured Data

Concept, Importance of Data
Concept :
Data represents raw facts and figures collected from various sources, which
can be analyzed to gain insights and make decisions
Importance:
Data drives decision-making, informs strategies, supports research, and
enhances business operations.

Definition – Big Data
Big Data is a massive collection of data that
continues to grow dramatically over time.
It is a data set that is so huge and complicated
that no typical data management technologies
can effectively store or process it.
Big Data is like regular data, but it is much larger.
A data which are very large in size.
Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies,
Codes) but data in Peta bytes i.e. 10
15
byte size is
called Big Data.
It is stated that almost 90% of today's data has
been generated in the past 3 years.

Sources of Big Data
Huge data from Weather station and
satellite that stored and manipulated
to forecasting
Emails, Blogs and e-
news
Posts, Photos Videos, Likes
and Comments on Social
Media
Traffic data & GPS
Signals
Digital Pictures &
Videos
Software logs, camera and
microphone

Big Data Characteristics

Important 5 V’s
Volume
Variety
Velocity
Veracity
Value

Big Data Characteristics
Volume represents the volume i.e. amount of data that is growing at a high
rate i.e. data volume in Petabytes.

Big Data Characteristics
Value refers to turning data into value. By turning accessed big data into
values, businesses may generate revenue.

Big Data Characteristics
Veracity refers to the uncertainty of available data. Veracity arises due to the
high volume of data that brings incompleteness and inconsistency.

Big Data Characteristics
Visualization is the process of displaying data in charts, graphs, maps, and
other visual forms.

Big Data Characteristics
Variety refers to the different data types i.e. various data formats like text,
audios, videos, etc.

Big Data Characteristics
Velocity is the rate at which data grows. Social media contributes a major role
in the velocity of growing data.

Big Data Characteristics
Virality describes how quickly information gets spread across people to people
(P2P) networks.

Volume
As it follows from the name, big data is used to refer to
enormous amounts of information.
We are talking about not gigabytes but terabytes and
petabytes of data.
The IoT (Internet of Things) is creating exponential
growth in data.
The volume of data is projected to change significantly
in the coming years.
Hence, 'Volume' is one characteristic which needs to be
considered while dealing with Big Data.
Volume
[ Data at Rest ]
•Terabytes,
Petabytes
•Records/Arch
•Table/Files
•Distributed

Variety
Variety refers to heterogeneous sources and the nature
of data, both structured and unstructured. 
Data comes in different formats – from structured,
numeric data in traditional databases to unstructured
text documents, emails, videos, audios, stock ticker
data and financial transactions.
This variety of unstructured data poses certain issues
for storage, mining and analysing data.
Organizing the data in a meaningful way is no simple
task, especially when the data itself changes rapidly.
Another challenge of Big Data processing goes beyond
the massive volumes and increasing velocities of data
but also in manipulating the enormous variety of these
data.
Variety
[ Data in many
Forms ]
•Structured
•Unstructured
•Text
•Multimedia

Veracity
Veracity describes whether the data can be trusted.
Veracity refers to the uncertainty of available data.
Veracity arises due to the high volume of data that
brings incompleteness and inconsistency.
Hygiene of data in analytics is important because
otherwise, you cannot guarantee the accuracy of your
results.
Because data comes from so many different sources,
it’s difficult to link, match, cleanse and transform data
across systems.
However, it is useless if the data being analysed are
inaccurate or incomplete.
Veracity is all about making sure the data is accurate,
which requires processes to keep the bad data from
accumulating in your systems.
Veracity
[ Data in Doubt ]
•Trustworthiness
•Authenticity
•Accurate
•Availability

Velocity
Velocity is the speed in which data is grows, process and
becomes accessible.
A data flows in from sources like business processes,
application logs, networks, and social media sites,
sensors, Mobile devices, etc.
The flow of data is massive and continuous.
Most data are warehoused before analysis, there is an
increasing need for real-time processing of these
enormous volumes.
Real-time processing reduces storage requirements
while providing more responsive, accurate and
profitable responses.
It should be processed fast by batch, in a stream-like
manner because it just keeps growing every years.
Velocity
[ Data in Motion ]
•Streaming
•Batch
•Real / Near Time
•Processes

Value
It refers to turning data into value. By turning accessed
big data into values, businesses may generate revenue.
Value is the end game. After addressing volume,
velocity, variety, variability, veracity, and visualization –
which takes a lot of time, effort and resources – you
want to be sure your organization is getting value from
the data.
For example, data that can be used to analyze
consumer behavior is valuable for your company
because you can use the research results to make
individualized offers.
Value
[ Data into Money ]
•Statistical
•Events
•Correlations

Visualization
Big data visualization is the process of displaying data
in charts, graphs, maps, and other visual forms. 
It is used to help people easily understand and interpret
their data at a glance, and to clearly show trends and
patterns that arise from this data. 
Raw data comes in a different formats, so creating data
visualizations is process of gathering, managing, and
transforming data into a format that’s most usable and
meaningful.
Big Data Visualization makes your data as accessible as
possible to everyone within your organization, whether
they have technical data skills or not. 
Visualization
[ Data Readable ]
•Readable
•Accessible
•Presentation
•Visual Forms

Virality
Virality describes how quickly information gets spread
across people to people (P2P) networks.
It is measures how quickly data is spread and shared to
each unique node.
Time is a determinant factor along with rate of spread.
Virality
[ Data Spread ]
•P2P
•Shared
•Rate of Spread

Important 5 V’s
Volume: The amount of data generated and stored.
Variety: The different types of data (structured, semi-structured,
unstructured).
Velocity: The speed at which data is generated and processed.
Veracity: The accuracy and reliability of data.
Value: The usefulness of data in deriving actionable insights.

How many V’s?
10 V’s

END