Unit 2 - Big Data - Structured and Unstructured.pptx

KumarasamyPK 25 views 23 slides Apr 30, 2024
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

Big Data


Slide Content

Data

AA oe We
« Any data that can be processed by digital
computer and stored in the sequences of 0's and

1's (Binary language) is knowns as digital data.

« Whenever you send an email, read a social media
O post, or take pictures with your digital camera,
@ You are working with digital data.

O ‘In general, data can be any character, text,
o numbers, voice messages, SMS, WhatsApp
messages, pictures, sound, or video.

Data

ee ened
* Byte is the basic unit of information
in computer storage and processing, and is
composed of eight bits; a kilobyte is 1,000 bytes;
one megabyte is 1,000 kilobytes . (GB, TB, PB, EB,

ZB, YB)
* Digitizing is the process of converting information

into digital form and is necessary for a computer to
be able to process and store the information.

Data

ee

* Itisan invaluable asset of any enterprise (big or small).
* Data is present internal to the enterprise and also exists
outside the firewalls of the enterprise.
_* Data may be in homogeneous or heterogeneous.
* Need of the hour is to
— Understand, manage, process,
— and take the data for analysis
— to draw valuable insights.

Types of digital data

A A Ce
1, Structured Data : data stored in the form of
rows and columns (databases, Excel)

2, Un-structured Data: No pre-defined schema
O (PPTs, images, Videos, pdfs)
o 3, Semi-structured Data: Hybrid schema (JSON,
o HTML, XML, Email, and so on),

Structured Data

010
os
oan
030
a
am
ans

0300
asar

om

or
om

am
am
ar
em
am

éeis

Unstructured Data

« BD)

Pa = zu...

Distribution of digital data (in %)
(by Gartner)

¡xKOAAoAáAáA nn 2 _ _ KK

u Unstructured
u Semi-structured

u Structured

Structured Data
ee —

* Data which is in an organized form (In rows & columns).
* Computer programs can use this data easily,
* Relationships exists between entities of data,
+ Example

— Data stored in databases

— ERP

— CRM

- DW

— Data Cube

Structured Data
AA ES
* The data conforms to a pre-defined schema or structure

is known as structured data.

* The data can be processed, stored, and retrieved in a
fixed format. This data can be processed easily by
programs.

* Conforms to a relational data model.

* Structured data is organized in semantic chunks/entities
with similar entities grouped together to form
relations/tables.

structured Data

+ Descriptions for all entities in a group
* Have the same defined format
* Have a predefined length
* Follow the same order.

Example

Departments ]

Sources of Structured Data

What Is Structured Data?

Conforms to a
data model

Data is stored in
form of rows and
columns
(e.g.. relational
database)

Attributes in a
group are the
same

Data resides in
fixed fields within
a record or file

Definition, format

& meaning of data

is explicitly
known

Ease with Structured Data-Storage

Data types — both defined and user defined help
with the storage of structured data

Scalabiliy is not generally on issue with
increase data
Ease with structured
7

Update and Updating, deleting, etc. is easy due to
delete structured form

Ease with Structured Data-Retrieval

A well-defined structure helps in easy
retrieval of data

Data can be indexed based not only on a
text string but other attributes as well. This
enables streamlined search

uns Structured data can be easily mined and
Mining data knowledge can be extracted from it
A BI works extremely well with structured data.
BI operations Hence data mining, warchousing, etc. can be
easily undertaken

Ease with structured
data

Semi-structured Data

Semi-structured data does not conform to any data model i.e. it is difficult to
determine the meaning of data neither can data be stored in rows and columns as
in a database but semi-structured data has tags and markers which help to group
data and describe how data is stored, giving some metadata but it is not sufficient
for management and automation of data.

Similar entities in the data are grouped and organized in a hierarchy. The
attributes or the properties within a group may or may not be the same. For
example two addresses may or may not contain the same number of properties as
in

Address |

<house number><street name><area name><city>

Address 2

house number><street name><city>
For example an e-mail follows a standard format

To: <Name>

From: <Name>

Subject: <Text>

CC: <Name>

Body: <Text, Graphics, Images etc. >
The tags give us some metadata but the body of the e-mail contains no format
neither is such which conveys meaning of the data it contains.
There is very fine line between unstructured and semi-structured data.

What is Semi-structured Data?

Does not
conform to a
data model but
contains tags &
elements
(metadata)

Cannot be
stored in form
of rows and
columns as in a
database

Similar entities
are grouped

Attributes in a The tags and

group may not elements
be the same describe how
data is stored

Not sufficient
Metadata

Where does Semi-structured Data Come from?

Mark-up languages
Integration of data from
heterogeneous sources

¡+ Describe the + Contain data on |
structure and the leaves of the |

| content of data to graph. Also known |

| some extent as ‘schema less’ |
| |
|» Assign meaning to + Used for data |
| data hence exchange among |
| allowing automatic heterogeneous |
| search and sources |

indexing

XML

+ Models the data
using tags and
elements

+ Schemas are not

tightly coupled to
data

Storing data with their schemas increases cost

How to Store Semi-structured Data?
‘Semi-structured data cannot be stored in
ET mms >) Sy RDBMS as data cannot be mapped

Storage cost
into tables directly

Irregular and
partial structure

Evolving schemas
Distinction between
schema and data

Some data elements may have extra
information while others none at all

Challenges faced

In many cases the structure is implicit.
Interpreting relationships and
correlations is very difficult

Schemas keep changing with
requirements making it difficult to
capture it in a database

‘Vague distinction between schema and data exists at times
making it difficult to capture data

How to Store Semi-structured Data?

XML allows to define tags and attributes to store data.
Data can be stored in a hierarchical/nested structure
‘Semi-structured data can be stored in a relational
RDBMS database by mapping the data to a relational
schema which is then mapped to a table
Databases which are specifically designed to store

pu semi-structured data
Data can be stored and exchanged in the form of graph
where entities are represented as objects which are the
vertices in a graph

How to Extract Information from Semi-structured Data?

Semi-structured is usually stored in flat
files which are difficult to index and
search

Data comes from varied sources which is
difficult to tag and search

Challenges faced

Incomplete/ Extracting structure when there is none and
irregular interpreting the relations existing in the structure
structure which is present is a difficult task

How to Extract Information from Semi-structured Data?

Indexing data in a graph-based model
enables quick search

Allows data to be stored in a graph-based data
model which is easier to index and search

Possible solutions

Allows data to be arranged in a hierarchical or
tree-like structure which enables indexing and

Various mining tools are available which search
data based on graphs, schemas, structure, etc.

XML-A Solution for Semi-structured Data Management

Open-source mark up language written in plain text.
It is hardware and software independent

It allows data to be stored in a hierarchical/nested
structure. It allows user to define tags to store the
data

How to Store Unstructured Data?

Sheer volume of unstructured data and its unprecedented
growth makes it difficult to store. Audios, videos, images,
ete. acquire huge amount of storage space

Scalability becomes an issue with increase
jin unstructured data

Retrieving and recovering unstructured
data are cumbersome
Challenges faced

Ensuring security is difficult due to varied
sources of data (e.g. e-mail, web pages)

Updating, deleting, ete. are not easy due to
the unstructured form

Indexing becomes difficult with increase in data.
Searching is difficult for non-text data