Unit 2 - Big Data - Structured and Unstructured.pptx
KumarasamyPK
25 views
23 slides
Apr 30, 2024
Slide 1 of 23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
About This Presentation
Big Data
Size: 1.57 MB
Language: en
Added: Apr 30, 2024
Slides: 23 pages
Slide Content
Data
AA oe We
« Any data that can be processed by digital
computer and stored in the sequences of 0's and
1's (Binary language) is knowns as digital data.
« Whenever you send an email, read a social media
O post, or take pictures with your digital camera,
@ You are working with digital data.
O ‘In general, data can be any character, text,
o numbers, voice messages, SMS, WhatsApp
messages, pictures, sound, or video.
Data
ee ened
* Byte is the basic unit of information
in computer storage and processing, and is
composed of eight bits; a kilobyte is 1,000 bytes;
one megabyte is 1,000 kilobytes . (GB, TB, PB, EB,
ZB, YB)
* Digitizing is the process of converting information
into digital form and is necessary for a computer to
be able to process and store the information.
Data
ee
* Itisan invaluable asset of any enterprise (big or small).
* Data is present internal to the enterprise and also exists
outside the firewalls of the enterprise.
_* Data may be in homogeneous or heterogeneous.
* Need of the hour is to
— Understand, manage, process,
— and take the data for analysis
— to draw valuable insights.
Types of digital data
A A Ce
1, Structured Data : data stored in the form of
rows and columns (databases, Excel)
2, Un-structured Data: No pre-defined schema
O (PPTs, images, Videos, pdfs)
o 3, Semi-structured Data: Hybrid schema (JSON,
o HTML, XML, Email, and so on),
Structured Data
010
os
oan
030
a
am
ans
0300
asar
om
or
om
am
am
ar
em
am
éeis
Unstructured Data
« BD)
Pa = zu...
Distribution of digital data (in %)
(by Gartner)
¡xKOAAoAáAáA nn 2 _ _ KK
u Unstructured
u Semi-structured
u Structured
Structured Data
ee —
* Data which is in an organized form (In rows & columns).
* Computer programs can use this data easily,
* Relationships exists between entities of data,
+ Example
— Data stored in databases
— ERP
— CRM
- DW
— Data Cube
Structured Data
AA ES
* The data conforms to a pre-defined schema or structure
is known as structured data.
* The data can be processed, stored, and retrieved in a
fixed format. This data can be processed easily by
programs.
* Conforms to a relational data model.
* Structured data is organized in semantic chunks/entities
with similar entities grouped together to form
relations/tables.
structured Data
+ Descriptions for all entities in a group
* Have the same defined format
* Have a predefined length
* Follow the same order.
Example
Departments ]
Sources of Structured Data
What Is Structured Data?
Conforms to a
data model
Data is stored in
form of rows and
columns
(e.g.. relational
database)
Attributes in a
group are the
same
Data resides in
fixed fields within
a record or file
Definition, format
& meaning of data
is explicitly
known
Ease with Structured Data-Storage
Data types — both defined and user defined help
with the storage of structured data
Scalabiliy is not generally on issue with
increase data
Ease with structured
7
Update and Updating, deleting, etc. is easy due to
delete structured form
Ease with Structured Data-Retrieval
A well-defined structure helps in easy
retrieval of data
Data can be indexed based not only on a
text string but other attributes as well. This
enables streamlined search
uns Structured data can be easily mined and
Mining data knowledge can be extracted from it
A BI works extremely well with structured data.
BI operations Hence data mining, warchousing, etc. can be
easily undertaken
Ease with structured
data
Semi-structured Data
Semi-structured data does not conform to any data model i.e. it is difficult to
determine the meaning of data neither can data be stored in rows and columns as
in a database but semi-structured data has tags and markers which help to group
data and describe how data is stored, giving some metadata but it is not sufficient
for management and automation of data.
Similar entities in the data are grouped and organized in a hierarchy. The
attributes or the properties within a group may or may not be the same. For
example two addresses may or may not contain the same number of properties as
in
Address |
<house number><street name><area name><city>
Address 2
house number><street name><city>
For example an e-mail follows a standard format
To: <Name>
From: <Name>
Subject: <Text>
CC: <Name>
Body: <Text, Graphics, Images etc. >
The tags give us some metadata but the body of the e-mail contains no format
neither is such which conveys meaning of the data it contains.
There is very fine line between unstructured and semi-structured data.
What is Semi-structured Data?
Does not
conform to a
data model but
contains tags &
elements
(metadata)
Cannot be
stored in form
of rows and
columns as in a
database
Similar entities
are grouped
Attributes in a The tags and
group may not elements
be the same describe how
data is stored
Not sufficient
Metadata
Where does Semi-structured Data Come from?
Mark-up languages
Integration of data from
heterogeneous sources
¡+ Describe the + Contain data on |
structure and the leaves of the |
| content of data to graph. Also known |
| some extent as ‘schema less’ |
| |
|» Assign meaning to + Used for data |
| data hence exchange among |
| allowing automatic heterogeneous |
| search and sources |
indexing
XML
+ Models the data
using tags and
elements
+ Schemas are not
tightly coupled to
data
Storing data with their schemas increases cost
How to Store Semi-structured Data?
‘Semi-structured data cannot be stored in
ET mms >) Sy RDBMS as data cannot be mapped
Storage cost
into tables directly
Irregular and
partial structure
Evolving schemas
Distinction between
schema and data
Some data elements may have extra
information while others none at all
Challenges faced
In many cases the structure is implicit.
Interpreting relationships and
correlations is very difficult
Schemas keep changing with
requirements making it difficult to
capture it in a database
‘Vague distinction between schema and data exists at times
making it difficult to capture data
How to Store Semi-structured Data?
XML allows to define tags and attributes to store data.
Data can be stored in a hierarchical/nested structure
‘Semi-structured data can be stored in a relational
RDBMS database by mapping the data to a relational
schema which is then mapped to a table
Databases which are specifically designed to store
pu semi-structured data
Data can be stored and exchanged in the form of graph
where entities are represented as objects which are the
vertices in a graph
How to Extract Information from Semi-structured Data?
Semi-structured is usually stored in flat
files which are difficult to index and
search
Data comes from varied sources which is
difficult to tag and search
Challenges faced
Incomplete/ Extracting structure when there is none and
irregular interpreting the relations existing in the structure
structure which is present is a difficult task
How to Extract Information from Semi-structured Data?
Indexing data in a graph-based model
enables quick search
Allows data to be stored in a graph-based data
model which is easier to index and search
Possible solutions
Allows data to be arranged in a hierarchical or
tree-like structure which enables indexing and
Various mining tools are available which search
data based on graphs, schemas, structure, etc.
XML-A Solution for Semi-structured Data Management
Open-source mark up language written in plain text.
It is hardware and software independent
It allows data to be stored in a hierarchical/nested
structure. It allows user to define tags to store the
data
How to Store Unstructured Data?
Sheer volume of unstructured data and its unprecedented
growth makes it difficult to store. Audios, videos, images,
ete. acquire huge amount of storage space
Scalability becomes an issue with increase
jin unstructured data
Retrieving and recovering unstructured
data are cumbersome
Challenges faced
Ensuring security is difficult due to varied
sources of data (e.g. e-mail, web pages)
Updating, deleting, ete. are not easy due to
the unstructured form
Indexing becomes difficult with increase in data.
Searching is difficult for non-text data