Data
•Anydatathatcanbeprocessedbydigital
computerandstoredinthesequencesof0'sand
1's(Binarylanguage)isknownsasdigitaldata.
•Wheneveryousendanemail,readasocialmedia
post,ortakepictureswithyourdigitalcamera,
youareworkingwithdigitaldata.
•Ingeneral,datacanbeanycharacter,text,
numbers,voicemessages,SMS,WhatsApp
messages,pictures,sound,orvideo.
Data
•Byteisthebasicunitofinformation
incomputerstorageandprocessing,andis
composedofeightbits;akilobyteis1,000bytes;
onemegabyteis1,000kilobytes.(GB,TB,PB,EB,
ZB,YB)
•Digitizingistheprocessofconvertinginformation
intodigitalformandisnecessaryforacomputerto
beabletoprocessandstoretheinformation.
Data
•It is an invaluable asset of any enterprise (big or small).
•Data is present internal to the enterprise and also exists
outside the firewalls of the enterprise.
•Data may be in homogeneous or heterogeneous.
•Need of the hour is to
–Understand, manage, process,
–and take the data for analysis
–to draw valuable insights.
Types of digital data
1.StructuredData:datastoredintheformof
rowsandcolumns(databases,Excel)
2.Un-structuredData:Nopre-definedschema
(PPTs,images,Videos,pdfs)
3.Semi-structuredData:Hybridschema(JSON,
HTML,XML,Email,andsoon),
Distribution of digital data (in %)
(by Gartner)
80
10
10
Unstructured
Semi-structured
Structured
Structured Data
•Data which is in an organized form (In rows & columns).
•Computer programs can use this data easily.
•Relationships exists between entities of data.
•Example
–Data stored in databases
–ERP
–CRM
–DW
–Data Cube
structured Data
•Descriptionsforallentitiesinagroup
•Havethesamedefinedformat
•Haveapredefinedlength
•Followthesameorder.
Example
Sources of Structured Data
Structured
Data
OLTP
systems
Excel
Databases
Ease with structured data
Ease with
structured
data
Security
Insert/Update/
delete
Scalability
Transaction
processing
(ACID)
Indexing/
Searching
Database (RDBMS)
•Oracle Corp. –Oracle
•IBM –DB2, IBM-Informix
•Microsoft –SQL
•EMC –Greenplum
•Teradata –Teradata
•Open source-MySQL, PostgresSQL
•Sqlite
•Sequel Pro
•Amazon Aurora
•SAP SQL Anywhere, SAP IQ (Sybase)
Semi-structured Data
•Data which does not conform to a data model but has
some structure.
•Computer programs can not use this data easily.
•Example
–emails
–XML
–HTML
–JSON, and so on.
Semi-structured data (SSD)
•Itisreferredtoasselfdescribingstructure.
•Itisaformofstructureddatathatdoesnot
conformwiththeformalstructureofdatamodels
associatedwithrelationaldatabasesorother
formsofdatatables.
•Itusesmetadataandtagstoprovidesemantic
information.
Characteristics of semi-structured data
(SSD)
•Doesnotconformtoadatamodel
•Cannotbestoredintheformofrowsandcolumns
asinadatabase.
•Thetagsandelementsareusedtodescribedata.
•Attributesinagroupmaynotbethesame.
•Similarentitiesaregrouped.
•Sizeofthesameattributesinagroupmaydiffer
•Typeofsameattributesingroupmaydiffer.
•EvolvingSchema
•Schemaanddataaretightlycoupled.
Sources of SSD
•Email
•XML
•TCP/IP
•Zipped files
•Mark-up languages
•Integration of data from heterogeneous sources.
Example: Email format
To: <Name>
From: <Name>
Subject: <Text>
CC: <Name>
Body: <Text, Graphics, Images, etc.><Name>
ABC Healthcare Blood Test Report
Date
<> ----
Department
<> -----
Patient Name
<>
Attending Doctor
<>
Hemoglobin
content
<>
Patient Age
<>
RBC count
<>
WBC count
<>
Platelet count
<>
Diagnosis <notes>
Conclusion <notes>
XML & JSON
Integration of data from heterogeneous
sources
User
Mediator : Uniform access to multiple data sources
OODBMSRDBMS
Legacy
system
Structured
file
Getting to know Unstructured data
•Overthepastfewdays,Dr.BenandDr.Stanley
hadbeenexchanginglongemailsabouta
particularcaseofgastro-intestinalproblem.
•EmailcontainsprocedurepracticedbyDr.Stanley,
aboutcombinationofdrugsthathassuccessfully
curedgastro-intestinaldisordersinpatients.
•Dr.Markhasapatientinthe“GoodLife”
emergencyunitwithquitesimilarcaseofgastro-
intestinaldisorder.
Unstructured Data
•Unstructureddatareferstothedatathatlacksany
specificformorstructure.
•Thismakesitverydifficultandtime-consumingto
processandanalyzeunstructureddata.
•Data which does not conform to any data model is USD.
•Computer programs can not use this data directly.
•About 80-90% data of an organization is in this format.
•Anenormousamountofknowledgeishiddeninthis
data.
•Hencefindingusefulknowledge/insightfromUSDisvery
crucial.
Unstructured Data
•Twotypes:
1.Bitmapobjects:image,video,oraudiofiles
2.Textualobjects:word,emails,pptsandsoon.
Unstructured Data
•Example
–Memos, QR code (Quick Response), Blogs
–Chat rooms, Tweets, Comments, likes, tags
–PPTs, emoji's, emoticons (emotion icons)
–Images, log files, social media posts
–Videos, sensor data (raw), weather data
–Doc files, geospatial data, surveillance data
–Body of email , GPS data, sensor data, etc.
–WhatsApp messages, CCTV footage and so on.
Getting to know Unstructured data
Characteristics of Unstructured data
•Thisdatacannotbestoredintheformofrows
andcolumnsasinadatabaseanddoesnot
conformtoanydatamodel.
•Itisdifficulttodeterminethemeaningofthe
data.
•Itdoesnotfollowanyruleorsemantics,i.e.Not
inanyparticularformatorsequence.
•Noteasilyusablebyaprogram.
Sources of Unstructured data
•Web pages
•Audio and Videos
•Images
•Body of an email
•Word document
•PPT and reports
•Chats and text messages
•Social media data
•White papers
•Surveys
•SMS
•Free form text
•Server Log files
•Product reviews
Web page is unstructured data
Web Page
Multimedia Image
Database
Text
XML
Text analytics or text mining
•Itistheprocessofconverting
unstructuredtextdataintomeaningfuldatafor
analysis,tomeasurecustomeropinions,product
reviews,feedbackandsentimentalanalysisto
supportfactbaseddecisionmaking.
•Usesmanylinguistic,statistical,andmachine
learningtechniquessuchasclustering,pattern
recognition,tagging,associationanalysis,
predictiveanalytics,etc.
Text analytics or text mining
•Ithelpsorganizationstofindpotentiallyvaluable
businessinsightsincorporatedocuments,customer
emails,callcenterlogs,surveycomments,social
networkposts,medicalrecordsandothersourcesof
text-baseddata.
•Textminingcapabilitiesarealsobeingincorporated
intoAIchatbots/virtualagentsthatcompaniesdeploy
toprovideautomatedresponsestocustomersaspart
oftheirmarketing,salesandcustomerservice
operations.
UIMA block diagram
Users
Acquired from
various
sources
Subjected to
semantic
analysis
Structured
information
access
Query and
presentation
Structured
information
Analysis
Delivery
USD
Transformed into
Web Scraping
Big Data
•Bigdataisatermthatdescribeslarge,hard-
to-managevolumesofdata–bothstructured
andunstructured-noneoftraditionaldata
managementtoolscanstoreitorprocessit
efficiently.
•expertsnowpredictthat74zettabytesof
datawillbeinexistenceby2021.
Big Data
•Everyday,wecreate2.5quintillion(10
18
)
bytesofdata—90%ofthedataintheworld
todayhasbeencreatedinthelasttwoyears
alone.
•Thisdatacomesfromeverywhere:sensors
usedtogatherclimateinformation,poststo
socialmediasites,digitalpicturesandvideos,
purchasetransactionrecords,andcellphone
GPSsignals,WhatsApp,IOTandsoon.
Characteristics of Data
•Composition:Dealswithstructureofdata,i.e.,
sourcesofdata,thegranularity(Ex.Postal
address),thetypes,natureofdata(Staticorreal-
time).
•Condition:Dealswiththestateofdata,thatis,
“Canoneusedataasitisforanalysis?”or“Doesit
requirecleansingforfurtherenhancementand
enrichment?”.
Characteristics of Data
•Context:Dealswith
–Where,thisdatahasbeengenerated?
–Whythisdatagenerated?
–Howsensitiveisthisdata?
–Whataretheeventsassociatedwiththisdata?
–Andsoon.
Big data definition-Gartner
•Bigdataishigh-volume,high-velocity,andhigh-
varietyinformationassetsthatdemandcost
effective,innovativeformsofinformation
processingforenhancedinsightanddecision
making.
•Costeffectiveandinnovativeformsof
informationprocessing:Talksaboutembracing
newtechniquesandtechnologiestocapture,
store,process,persevere,integrateandvisualize
thebigdata(3vs).
Definition of Big data by Gartner
•Enhancedinsightanddecisionmaking:Talks
aboutderivingdeeper,richer,andmeaningful
insightsandthenusingtheseinsightstomake
fasterandbetterdecisionstogainbusinessvalue
andthusacompetitiveedge.
Big data formula
DATA
Enhanced
Business
Value
Information
Actionable
Intelligence
Better
Decisions
Challenges with Big Data
•Capture
•Storage(Solution:CloudComputing)
•Curation(Managementofdata+Dataretention)
•Search
•Analysis
•Transfer
•Visualization
•Privacyviolations
3 Vs
3 V’s of Big data
•ThedatathatisbiginVolume,Velocityand
Varietyisknownasbigdata.
Sources of big data
•Archives:Archivesofscanneddocuments,
customercorrespondencerecords,patient’s
healthrecords,student’sadmissionrecords,
students’assessmentrecordsandsoon.
•Sensordata:Carsensors,smartelectricmeters,
officebuildings,washingm/c,otherelectronic
appliancesandsoon.
•Machinelogdata:Eventlogs,applicationlogs,
auditlogs,serverlogs,etc.
Sources of big data
•Publicweb:Wikipedia,Weather,regulatory,census,etc.
•Datastorage:Filesystems,SQLdatabase,NoSQL
database(MongoDB,Cassandra)andsoon.
•Media:Audio,Video,image,etc.
•Docs:CSV,worddocs,PDF,PPT,XLS,etc.
•BusinessApps:ERP,CRM,HR,GoogleDocs,etc.
•Socialmedia:Twitterblogs,Facebook,LinkedIn,
YouTube,Instagram,etc.
•IOT
Other characteristics of big data
•VeracityandValidity:Referstotheaccuracy
(quality)andcorrectnessofthedata.
•Volatility:Dealswithhowlongthedataisvalid?,
andhowlongshoulditbestored?.(OTP,Aadhar
No.,PW)
•Variability:Dataflowscanbehighlyinconsistent
withperiodicpeaks.(Intotal7V’sofbigdata)
Why Big data
More confidence in decision making
MoreData
More Accurate analysis
Greateroperationalefficiency,costreduction,time
reduction,newproductdevelopment,optimized
offerings,etc.
Three reasons for leveraging big data
1.CompetitiveAdvantage.
2.Decisionmaking
3.Tocreatenewbusinessvalueoutofdata.