Digital Types

766 views 80 slides Nov 14, 2021
Slide 1
Slide 1 of 80
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80

About This Presentation

Describes types of digital data


Slide Content

CHAPTER 01: TYPES OF DIGITAL DATA

Data
•Anydatathatcanbeprocessedbydigital
computerandstoredinthesequencesof0'sand
1's(Binarylanguage)isknownsasdigitaldata.
•Wheneveryousendanemail,readasocialmedia
post,ortakepictureswithyourdigitalcamera,
youareworkingwithdigitaldata.
•Ingeneral,datacanbeanycharacter,text,
numbers,voicemessages,SMS,WhatsApp
messages,pictures,sound,orvideo.

Data
•Byteisthebasicunitofinformation
incomputerstorageandprocessing,andis
composedofeightbits;akilobyteis1,000bytes;
onemegabyteis1,000kilobytes.(GB,TB,PB,EB,
ZB,YB)
•Digitizingistheprocessofconvertinginformation
intodigitalformandisnecessaryforacomputerto
beabletoprocessandstoretheinformation.

Data
•It is an invaluable asset of any enterprise (big or small).
•Data is present internal to the enterprise and also exists
outside the firewalls of the enterprise.
•Data may be in homogeneous or heterogeneous.
•Need of the hour is to
–Understand, manage, process,
–and take the data for analysis
–to draw valuable insights.

Types of digital data
1.StructuredData:datastoredintheformof
rowsandcolumns(databases,Excel)
2.Un-structuredData:Nopre-definedschema
(PPTs,images,Videos,pdfs)
3.Semi-structuredData:Hybridschema(JSON,
HTML,XML,Email,andsoon),

Distribution of digital data (in %)
(by Gartner)
80
10
10
Unstructured
Semi-structured
Structured

Structured Data
•Data which is in an organized form (In rows & columns).
•Computer programs can use this data easily.
•Relationships exists between entities of data.
•Example
–Data stored in databases
–ERP
–CRM
–DW
–Data Cube

Structured Data
•Thedataconformstoapre-definedschemaorstructure
isknownasstructureddata.
•Thedatacanbeprocessed,stored,andretrievedina
fixedformat.Thisdatacanbeprocessedeasilyby
programs.
•Conformstoarelationaldatamodel.
•Structureddataisorganizedinsemanticchunks/entities
withsimilarentitiesgroupedtogethertoform
relations/tables.

structured Data
•Descriptionsforallentitiesinagroup
•Havethesamedefinedformat
•Haveapredefinedlength
•Followthesameorder.

Example

Sources of Structured Data
Structured
Data
OLTP
systems
Excel
Databases

Ease with structured data
Ease with
structured
data
Security
Insert/Update/
delete
Scalability
Transaction
processing
(ACID)
Indexing/
Searching

Database (RDBMS)
•Oracle Corp. –Oracle
•IBM –DB2, IBM-Informix
•Microsoft –SQL
•EMC –Greenplum
•Teradata –Teradata
•Open source-MySQL, PostgresSQL
•Sqlite
•Sequel Pro
•Amazon Aurora
•SAP SQL Anywhere, SAP IQ (Sybase)

Semi-structured Data
•Data which does not conform to a data model but has
some structure.
•Computer programs can not use this data easily.
•Example
–emails
–XML
–HTML
–JSON, and so on.

Semi-structured data (SSD)
•Itisreferredtoasselfdescribingstructure.
•Itisaformofstructureddatathatdoesnot
conformwiththeformalstructureofdatamodels
associatedwithrelationaldatabasesorother
formsofdatatables.
•Itusesmetadataandtagstoprovidesemantic
information.

Characteristics of semi-structured data
(SSD)
•Doesnotconformtoadatamodel
•Cannotbestoredintheformofrowsandcolumns
asinadatabase.
•Thetagsandelementsareusedtodescribedata.
•Attributesinagroupmaynotbethesame.
•Similarentitiesaregrouped.
•Sizeofthesameattributesinagroupmaydiffer
•Typeofsameattributesingroupmaydiffer.
•EvolvingSchema
•Schemaanddataaretightlycoupled.

Example (Names & Emails)
•Onewayis:
Name:RajuPatil
Email:[email protected],[email protected]
•Anotherwayis:
FirstName:Raju
LastName:Patil
Email:[email protected]

Sources of SSD
•Email
•XML
•TCP/IP
•Zipped files
•Mark-up languages
•Integration of data from heterogeneous sources.

Example: Email format
To: <Name>
From: <Name>
Subject: <Text>
CC: <Name>
Body: <Text, Graphics, Images, etc.><Name>

ABC Healthcare Blood Test Report
Date
<> ----
Department
<> -----
Patient Name
<>
Attending Doctor
<>
Hemoglobin
content
<>
Patient Age
<>
RBC count
<>
WBC count
<>
Platelet count
<>
Diagnosis <notes>
Conclusion <notes>

XML & JSON

Integration of data from heterogeneous
sources
User
Mediator : Uniform access to multiple data sources
OODBMSRDBMS
Legacy
system
Structured
file

Getting to know Unstructured data
•Overthepastfewdays,Dr.BenandDr.Stanley
hadbeenexchanginglongemailsabouta
particularcaseofgastro-intestinalproblem.
•EmailcontainsprocedurepracticedbyDr.Stanley,
aboutcombinationofdrugsthathassuccessfully
curedgastro-intestinaldisordersinpatients.
•Dr.Markhasapatientinthe“GoodLife”
emergencyunitwithquitesimilarcaseofgastro-
intestinaldisorder.

Unstructured Data
•Unstructureddatareferstothedatathatlacksany
specificformorstructure.
•Thismakesitverydifficultandtime-consumingto
processandanalyzeunstructureddata.
•Data which does not conform to any data model is USD.
•Computer programs can not use this data directly.
•About 80-90% data of an organization is in this format.
•Anenormousamountofknowledgeishiddeninthis
data.
•Hencefindingusefulknowledge/insightfromUSDisvery
crucial.

Unstructured Data
•Unstructureddataisagenericlabelfordescribingdata
thatisnotcontainedinadatabaseorsomeothertype
ofdatastructure.
•Unstructureddatacanbetextualornon-textual.
•Textualunstructureddataisgeneratedinmedialike
emailmessages,PowerPointpresentations,Word
documents,commentsinsocialmedia,etc.
•Non-textualunstructureddataisgeneratedinmedia
likeimages,CCTVfootage,audiofilesandvideofiles.
•Anythinginanon-databaseformisunstructureddata.

Unstructured Data
•Twotypes:
1.Bitmapobjects:image,video,oraudiofiles
2.Textualobjects:word,emails,pptsandsoon.

Unstructured Data
•Example
–Memos, QR code (Quick Response), Blogs
–Chat rooms, Tweets, Comments, likes, tags
–PPTs, emoji's, emoticons (emotion icons)
–Images, log files, social media posts
–Videos, sensor data (raw), weather data
–Doc files, geospatial data, surveillance data
–Body of email , GPS data, sensor data, etc.
–WhatsApp messages, CCTV footage and so on.

Getting to know Unstructured data

Characteristics of Unstructured data
•Thisdatacannotbestoredintheformofrows
andcolumnsasinadatabaseanddoesnot
conformtoanydatamodel.
•Itisdifficulttodeterminethemeaningofthe
data.
•Itdoesnotfollowanyruleorsemantics,i.e.Not
inanyparticularformatorsequence.
•Noteasilyusablebyaprogram.

Sources of Unstructured data
•Web pages
•Audio and Videos
•Images
•Body of an email
•Word document
•PPT and reports
•Chats and text messages
•Social media data
•White papers
•Surveys
•SMS
•Free form text
•Server Log files
•Product reviews

Web page is unstructured data
Web Page
Multimedia Image
Database
Text
XML

Challenges
•Storagespace:AlotofspaceisrequiredtostoreUSD.
•Scalability:Asthedatagrows,scalabilitybecomesan
issueandthecostofstoringUSDincreases.
•Retrieveinformation:Difficulttoretrieverequired
informationfromUSD
•Security:Ensuringsecurityisdifficultduetovaried
sourcesofdata.E.g.emails,webpages,etc.
•Indexing&searching:Verydifficultanderror-prone
asthestructureoftheUSDisnotclear.

Challenges
•Interpretation:USDisnoteasilyinterpretedby
conventionalsearchalgorithms.
•Classification:Differentnamingconventions
followedacrosstheorganizationmakeitdifficultto
classifydata.
•Derivingmeaning:Computerprogramscannot
automaticallyderivemeaningorstructurefromUSD.
•Fileformats:Increasingnumberoffileformats
makesitdifficulttointerpretdata.

Portion of Unstructured data
USD
SD

Dealing with USD
1.Datamining
2.Textmining/TextAnalytics
3.NLP
4.Noisytextanalytics
5.Manualtaggingwithmetadata
6.Partofspeechtagging
7.UIMA
8.WebScraping
Possible
Solutions

Data Mining
•Itisthecomputingprocessofdiscoveringpatterns
inlargedatasetsinvolvingmethodsatthe
intersectionofAI,machinelearning&
DL,statistics,anddatabasesystems.
•Popularalgorithms:
–Associationrulemining(MBA)
–RegressionAnalysis(Y=mX+c)
–Collaborativefiltering

Collaborative filtering
•collaborativefilteringusessimilaritiesbetweenusersand
itemssimultaneouslytoproviderecommendations.
•Itisamethodofmakingautomaticpredictions(filtering)
abouttheinterestsofauserbycollectingpreferences
ortasteinformationfrommanyusers(collaborating).
•Collaborativefilteringworksonafundamental
principle:youarelikelytolikewhatsomeonesimilarto
youlikes.

Collaborative filtering
•Collaborativefiltering(CF)isatechniquecommonlyused
•Collaborativefiltering(CF)isatechniqueused
byrecommendersystemstobuildpersonalized
recommendationsontheWeb.
•CompaniesthatemployCFmodelincludeAmazon,
Facebook,Twitter,LinkedIn,Spotify,GoogleNews,
Netflix,iTunes.

Collaborative filtering

Text analytics or text mining
•Itistheprocessofconverting
unstructuredtextdataintomeaningfuldatafor
analysis,tomeasurecustomeropinions,product
reviews,feedbackandsentimentalanalysisto
supportfactbaseddecisionmaking.
•Usesmanylinguistic,statistical,andmachine
learningtechniquessuchasclustering,pattern
recognition,tagging,associationanalysis,
predictiveanalytics,etc.

Text analytics or text mining
•Ithelpsorganizationstofindpotentiallyvaluable
businessinsightsincorporatedocuments,customer
emails,callcenterlogs,surveycomments,social
networkposts,medicalrecordsandothersourcesof
text-baseddata.
•Textminingcapabilitiesarealsobeingincorporated
intoAIchatbots/virtualagentsthatcompaniesdeploy
toprovideautomatedresponsestocustomersaspart
oftheirmarketing,salesandcustomerservice
operations.

Natural Language Processing (NLP)
•Naturallanguageprocessing(NLP)istheabilityofa
computerprogramtounderstandhumanlanguageas
itisspoken.NLPisacomponentofartificial
intelligence(AI).
•Itisafieldofcomputerscience,artificial
intelligenceandcomputationallinguisticsconcerned
withtheinteractionsbetweencomputersandhuman
(natural)languages(HCIdomain).
•NLPstrivestobuildmachinesthatunderstandand
respondtotextorvoicedata.

Natural Language Processing (NLP)

Noisy text analytics
•Itistheprocessofextractingstructuredorsemi-
structuredinformationfromnoisyunstructuredtextdata
suchasonlinechat,textmessages,emails,message
boards,blogs,wikis,etc.
•Thenoisyunstructureddatacomprisesoneormoreof
thefollowings:
–Spellingmistakes,
–Acronyms
–Non-standardwords(HBD,K,GN,GM,VGM,etc.)
–Missingpunctuations,
–Missinglettersandsoon.

Manual tagging with metadata
•Itistheprocessoftaggingmanuallywithadequate
metadatatoprovidethesemanticstounderstand
unstructureddata.
.
Road Accident

Part of Speech Tagging
•ItisalsocalledasPOSorPOSTorgrammatical
tagging.
•Itistheprocessofreadingtextandtaggingeach
wordinthesentenceasbelongingtoaparticular
partofspeechsuchas“noun”,“verb”,“adjective”,
“pronoun”,etc.
.

Unstructured Information
Management Architecture(UIMA)
•ItisanopensourceplatformfromIBM,which
integratesdifferentkindsofanalysisenginesto
provideacompletesolutionforknowledge
discoveryfromUSD.
•ItbridgethegapbetweenstructuredandUSD.

Uses of UIMA
•Usedtoconvertunstructureddatasuchas
repairlogs and servicenotes
intorelationaltables.
•Thesetablescanthenbeused
byautomatedtoolstodetectmaintenanceor
manufacturingproblems.

Uses of UIMA
•Usedinmedicalcontextstoanalyzeclinicalnotes,
suchastheClinicalTextAnalysisandKnowledge
ExtractionSystem(ApacheCTAKES).
•CTAKESisanopen-sourceNaturalLanguage
Processing(NLP)systemthatextractsclinical
informationfromelectronichealth/medical
recordfree-text(Usersarefreetotypewhatever
theywantinanyform).

UIMA block diagram
Users
Acquired from
various
sources
Subjected to
semantic
analysis
Structured
information
access
Query and
presentation
Structured
information
Analysis
Delivery
USD
Transformed into

Web Scraping

Big Data
•Bigdataisatermthatdescribeslarge,hard-
to-managevolumesofdata–bothstructured
andunstructured-noneoftraditionaldata
managementtoolscanstoreitorprocessit
efficiently.
•expertsnowpredictthat74zettabytesof
datawillbeinexistenceby2021.

Big Data
•Everyday,wecreate2.5quintillion(10
18
)
bytesofdata—90%ofthedataintheworld
todayhasbeencreatedinthelasttwoyears
alone.
•Thisdatacomesfromeverywhere:sensors
usedtogatherclimateinformation,poststo
socialmediasites,digitalpicturesandvideos,
purchasetransactionrecords,andcellphone
GPSsignals,WhatsApp,IOTandsoon.

Characteristics of Data
•Composition:Dealswithstructureofdata,i.e.,
sourcesofdata,thegranularity(Ex.Postal
address),thetypes,natureofdata(Staticorreal-
time).
•Condition:Dealswiththestateofdata,thatis,
“Canoneusedataasitisforanalysis?”or“Doesit
requirecleansingforfurtherenhancementand
enrichment?”.

Characteristics of Data
•Context:Dealswith
–Where,thisdatahasbeengenerated?
–Whythisdatagenerated?
–Howsensitiveisthisdata?
–Whataretheeventsassociatedwiththisdata?
–Andsoon.

Gartner
•Isaglobalresearchandadvisoryfirm
providinginsights,advice,andtoolsfor
leadersinIT,Finance,HR,CustomerService
andSupport

Big data definition-Gartner
•Bigdataishigh-volume,high-velocity,andhigh-
varietyinformationassetsthatdemandcost
effective,innovativeformsofinformation
processingforenhancedinsightanddecision
making.
•Costeffectiveandinnovativeformsof
informationprocessing:Talksaboutembracing
newtechniquesandtechnologiestocapture,
store,process,persevere,integrateandvisualize
thebigdata(3vs).

Definition of Big data by Gartner
•Enhancedinsightanddecisionmaking:Talks
aboutderivingdeeper,richer,andmeaningful
insightsandthenusingtheseinsightstomake
fasterandbetterdecisionstogainbusinessvalue
andthusacompetitiveedge.

Big data formula
DATA
Enhanced
Business
Value
Information
Actionable
Intelligence
Better
Decisions

Challenges with Big Data
•Capture
•Storage(Solution:CloudComputing)
•Curation(Managementofdata+Dataretention)
•Search
•Analysis
•Transfer
•Visualization
•Privacyviolations

3 Vs

3 V’s of Big data
•ThedatathatisbiginVolume,Velocityand
Varietyisknownasbigdata.

Sources of big data
•Archives:Archivesofscanneddocuments,
customercorrespondencerecords,patient’s
healthrecords,student’sadmissionrecords,
students’assessmentrecordsandsoon.
•Sensordata:Carsensors,smartelectricmeters,
officebuildings,washingm/c,otherelectronic
appliancesandsoon.
•Machinelogdata:Eventlogs,applicationlogs,
auditlogs,serverlogs,etc.

Sources of big data
•Publicweb:Wikipedia,Weather,regulatory,census,etc.
•Datastorage:Filesystems,SQLdatabase,NoSQL
database(MongoDB,Cassandra)andsoon.
•Media:Audio,Video,image,etc.
•Docs:CSV,worddocs,PDF,PPT,XLS,etc.
•BusinessApps:ERP,CRM,HR,GoogleDocs,etc.
•Socialmedia:Twitterblogs,Facebook,LinkedIn,
YouTube,Instagram,etc.
•IOT

Other characteristics of big data
•VeracityandValidity:Referstotheaccuracy
(quality)andcorrectnessofthedata.
•Volatility:Dealswithhowlongthedataisvalid?,
andhowlongshoulditbestored?.(OTP,Aadhar
No.,PW)
•Variability:Dataflowscanbehighlyinconsistent
withperiodicpeaks.(Intotal7V’sofbigdata)

Why Big data
More confidence in decision making
MoreData
More Accurate analysis
Greateroperationalefficiency,costreduction,time
reduction,newproductdevelopment,optimized
offerings,etc.

Three reasons for leveraging big data
1.CompetitiveAdvantage.
2.Decisionmaking
3.Tocreatenewbusinessvalueoutofdata.

Typical data warehouse Environment

Typical Hadoop Environment
•ItisdifferentfromDWenvironment.
•Heredatasourcesareweblogs,images,audios,
videos,socialmedia,docfiles,pdfs,etc.

Hadoop Environment

Big data & DW coexistence

Big data & DW coexistence
Tags