Big data Analytics

5,548 views 73 slides Nov 14, 2021
Slide 1
Slide 1 of 73
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73

About This Presentation

Types of Big data analytics


Slide Content

CHAPTER 02: Big Data Analytics

BDA
•Bigdataanalyticsistheprocessofexamininglargeand
varieddatasets--i.e.,bigdata--touncover
–hiddenpatterns,
–unknowncorrelations,
–markettrends,
–customerpreferences
–Meaningful/actionableinformation/insightstomake
fasterandbetterdecisions(informeddecisions).

BDA
•Dataanalyticshelpstosliceanddicethedatatoextract
insightsthatallowtoleveragethisdatatogivean
organizationacompetitiveadvantage.
•BDAsupports360degreeviewofthecustomer
(clickstreamdatawhichisunstructured).

BDA
•Businessescanuseadvancedanalyticstechniquessuch
as
–textanalytics,
–machinelearning,
–predictiveanalytics,
–datamining,statistics
–andnaturallanguageprocessingtogainnewinsightsfrom
previouslyuntappeddatasourceswithexistingenterprisedata.

BDA
•Itistechnologyenabledanalyticstoprocessand
analyzebigdata.
•BDAisaboutgainingameaningful,deeper,and
richerinsightintothebusinesstosteeritintheright
direction,understandingthecustomer’s
demographicstocross-sellandup-selltothem,by
betterleveragingtheservicesofvendorsand
suppliers,etc.

BDA
•BDAisatighthandshakebetweenthree
communities:IT,businessusers,anddatascientists.
•BDAisworkingwithdatasetswhosevolumeand
varietyexceedthecurrentstorage,processing
capabilitiesandinfrastructureofanenterprise.
•BDAisaboutmovingcodetodata,becauseprograms
fordistributedprocessingistiny(fewKBs)compared
tothedata(TB,PB,EB,ZBandYB).

Types of USD available for analysis
BDA
RFIDCRMERPPOSWebsites
Social
Media

What big data analytics is not?

Why the sudden hype around BDA?
1.Dataisgrowingata40%compoundannualrate,
reachingnearly74ZBby2021.Thevolumeof
businessdataworldwideisexpectedtodouble
every1.2years.
–Examples:
–Wal-Mart:Processonemillioncustomer
transactionsperhour.
–Twitter:500milliontweetsperday.
–2.7billion“Likes”andcommentsarepostedby
Facebookusersinaday.

Why the sudden hype around BDA?
2.Costpergigabyteofstoragehashugelydropped.
3.Anoverwhelmingnumberofuserfriendly
analyticstoolsareavailableinthemarkettoday.
4.ThreeDigitalAccelerators:bandwidth,digital
storage,andprocessingpower.

Classification of Analytics
1.FirstSchoolofThought.
2.SecondSchoolofThought

First school of thought
•Classified analytics into:
–Basic analytics
–Operationalized analytics
–Advanced analytics
–Monetized analytics.
•convert into or express in the form of currency.

First school of thought
•Basic analytics:
–Deals with slicing and dicing of data to help with basic
business insights.
–Reporting on historical data, basic visualization, etc..
•Operationalized analytics:
–Focusingontheintegrationofanalyticsintobusiness
unitsinordertotakespecificactiononinsights.
–Analyticsareintegratedintobothproduction
technologysystemsandbusinessprocesses.

First school of thought
•Advanced analytics:
–Forecastingforthefuturebypredictiveand
prescriptivemodeling.
•Monetized analytics:
–Usedtoderivedirectbusinessrevenue.
–Theactofgeneratingmeasurableeconomicbenefits
fromavailabledatasources

Six ways to indirectly monetize your
data
1.Reduce costs
2.Enhance your product or service
3.Enter new business sectors or tap into new
types of customers
4.Develop new products, services, or markets.
5.Drive sales and marketing
6.Improve productivity and efficiencies.

Second school of thought

Classification of analytics

Analytics 1.0, 2.0 and 3.0

Analytics 1.0 (1950-2009)
•Descriptive statistics.
•Report on events, occurrences, etc. of the past.
•Key questions:
–What happened?
–Why did it happen?
•Data:
–CRM, ERP, or 3
rd
party applications
–Small and structured data sources, DW and Data mart
–Internally sourced.
–Relational databases.

Analytics 2.0 (2005-2012)
•Descriptive + predictive statistics
•Uses data from the past to make predictions for the
future.
•Key questions:
–What will happen?
–Why will it happen?
•Data:
–Big data (SD, USD and SSD)
–Massive parallel servers running Hadoop
–Externally sourced.
–DB applications, Hadoop clusters, Hadoop environment.

Analytics 3.0 (2012 to present)
•Descriptive +predictive + Prescriptive statistics.
•Uses data from the past to make predictions for the future and at
the same time make recommendations to leverage the situations
to one’s advantage.
•Key questions:
–What will happen?
–When will it happen?
–Why will it happen?
–What action should be taken?
•Data:
–Big data (SD, USD and SSD) in real-time.
–Internally + Externally sourced.
–In memory analytics, M/C learning, agile analytical methods.

Challenges that prevent business from
capitalizing on big data
1.Gettingthebusinessunitstoshareinformationacross
organizationalsilos.
2.Findingtherightskills(businessanalystsanddatascientists)
thatcanmanagelargeamountsofSD,SSD,andUSDandcreate
insightfromit.
3.Theneedtoaddressthestorageandprocessingoflarge
volume,velocityandvarietyofbigdata.
4.DecidingwhethertouseSDorUSD,internalorexternaldatato
makebusinessdecisions.
5.Choosingtheoptimalwaytoreportfindingandanalysisofbig
dataforthepresentationtomakethemoresense.
6.Determiningwhattodowithinsightscreatedfrombigdata.

Top challenges facing big data
1.Thepracticalissuesofstoringallthedata(Scale)
2.Security:lackofauthenticationandauthorization
3.Dataschema(Nofixedandrigidschema)leadsto
dynamicschema.
4.Consistency/Eventualconsistency
5.Availability(24x7)(Failuretransparency)
6.Partitiontolerance(BothH/wandS/w)
7.Validatingbigdata(Dataquality):accuracy,
completenessandtimeliness.

Importance

Importance of BDA
•Costreduction:Cost-effectivestoragesystemfor
hugedatasets.[Hadoopandcloudbased
analytics]
•Faster,betterdecisionmaking:Provideswaysto
analyzeinformationquicklyandmakedecisions.
•[Hadoopprocessingspeedandin-memory
analytics,combinedwiththeabilitytoanalyzenew
sourcesofdata]

Importance of BDA
•Newproductsandservices:Evaluationof
customerneedsandsatisfactionthroughBDA.
–businesscancreatenewproductstomeet
customers’needs.
–WithBDA,morecompaniesarecreatingnew
products,includingnewrevenueopportunities,
moreeffectivemarketing,bettercustomer
service.
–Example:Automatedcar,healthcareappsand
entertainmentapps,Bankapps,Govt.apps,

Technologies needed to meet challenges
posed by big data
1.Cheapandabundantstorage.
2.Fasterprocessortohelpwithquickerprocessingof
bigdata.
3.Affordableopen-source,distributedbigdata
platformssuchasHadoop.
4.Parallelprocessing,clustering,virtualization,large
gridenvironments,highconnectivity,high
throughputsandlowlatency.
5.Cloudcomputingandotherflexibleresource
allocationarrangements.

Data Science
•Datascienceisaninterdisciplinaryfieldthatuses
scientificmethods,processes,algorithmsand
systemstoextractknowledgeandinsights
fromstructuredandunstructureddata.
•Itemploystechniquesandtheoriesdrawnfrom
manyfieldswithinthebroadareas
ofmathematics,statistics,informationtechnology
includingmachine learning,probability
models,classification,clusteranalysis,data
mining,databases,patternrecognition
andvisualization.

Data Science
•DataScienceisprimarilyusedtomake
predictionsand decisions using
predictive,prescriptiveanalyticsandmachine
learning.
•Prescriptiveanalyticsisallaboutproviding
advice.itnotonlypredictsbutsuggestsarange
ofprescribedactionsandassociatedoutcomes.

Data science –development of data
product
•A"dataproduct"isatechnicalassetthat:(1)
utilizesdataasinput,and(2)processesthatdata
toreturnalgorithmically-generatedresults.
•Theclassicexampleofadataproductisa
recommendationengine,whichingestsuserdata,
andmakespersonalizedrecommendationsbased
onthatdata

Examples of data products:
•Amazon'srecommendationenginessuggest
itemsforyoutobuy.Netflixrecommendsmovies
toyou.Spotifyrecommendsmusictoyou.
•Gmail'sspamfilterisdataproduct–analgorithm
behindthescenesprocessesincomingmailand
determinesifamessageisspamornot.
•Computervisionusedforself-drivingcarsisalso
dataproduct–machinelearningalgorithmsare
abletorecognizetrafficlights,othercarsonthe
road,pedestrians,etc.

The role of Data Science
•Theself-drivingcarscollectlivedatafromsensors,
includingradars,camerasandlaserstocreateamapofits
surroundings.Basedonthisdata,ittakesdecisionslike
whentospeedup,whentospeeddown,whento
overtake,wheretotakeaturn–makinguseofadvanced
machinelearningalgorithms.
•DataSciencecanbeusedinpredictiveanalytics:Datafrom
ships,aircrafts,radars,satellitescanbecollectedand
analyzedtobuildmodels.Thesemodelswillnotonly
forecasttheweatherbutalsohelpinpredictingthe
occurrenceofanynaturalcalamities.

Domains of Data Science
Infographic

Difference between Data Analysis and
Data Science
•DataAnalysisincludesdescriptiveanalyticsand
predictiontoacertainextent.
•Ontheotherhand,DataScienceismoreabout
PredictiveAnalyticsandMachineLearning.
•DataScienceisamoreforward-lookingapproach,an
exploratorywaywiththefocusonanalyzingthepastor
currentdataandpredictingthefutureoutcomeswith
theaimofmakinginformeddecisions.

Use-cases of for Data Science
•Internetsearch
•DigitalAdvertisements
•Recommender Systems
•Image Recognition
•Speech Recognition
•Gaming
•Price Comparison Websites
•Airline Route Planning
•FraudandRiskDetection
•Medicaldiagnosis,etc.

Data Science is multi-disciplinary

Data Science process
1.Collectingrawdatafrommultipledisparatedata
sources.
2.Processingthedata
3.Integratingthedataandpreparingcleandatasets
4.Engaginginexplorativedataanalysisusingmodel(ML
model)andalgorithms.
5.Preparingpresentationsusingdatavisualization.
6.Communicatingthefindingstoallstakeholders
7.Makingfasterandbetterdecisions.

Business Acumen (wisdom) skills of
Data Scientist
•Understanding of domain
•Business strategy
•Problem solving
•Communication
•Presentation
•Thirst for knowledge

Technology Expertise of Data Scientist
•GooddatabaseknowledgesuchasRDBMS
•GoodNoSQLdatabaseknowledgesuchasMongoDB,
Cassandra,Hbase,etc.
•LanguagessuchasJava,Python,R,C++,etc.
•Open-sourcetoolssuchasHadoop.
•Datawarehousing
•DataMining,Patternrecognition,algorithms
•Excellentunderstandingofmachinelearningtechniques
andalgorithms,suchasK-means,Regression,kNN,Naive
Bayes,SVM,PCA,Decisiontree,Tableau,Flare,Google
visualizationAPIs,textanalytics,DL,NLP,AIetc.

Mathematics Expertise of Data Scientist
•Probability
•Statistics
•LinearAlgebra
•Calculus

Data Scientist
•Adatascientistisaprofessionalresponsiblefor
collecting,analyzingandinterpretinglarge
amountsofdatatoidentifywaystohelpa
businesstoimproveoperationsandgaina
competitiveedgeovercompetitors.
•They're part mathematician,part
computerscientistandparttrend-spotter.

Responsibilities of Data Scientist
1.Prepareandintegrateslargeandvarieddatasets
anddeveloprelevantdatasetsforanalysis.
2.Thoroughlycleanandprunedatatodiscard
irrelevantinformation
3.Appliesbusiness/domainknowledgetoprovide
context.
4.Employsablendofanalyticaltechniquesto
developmodelsandalgorithmstounderstand
thedata,interpretrelationships,spottrends
andunveilpatterns.

Responsibilities of Data Scientist
5.Communicatesorpresentsfindingsorresultsin
thebusinesscontextinalanguagethatis
understoodbythedifferentbusiness
stakeholders.
6.Inventnewalgorithmstosolveproblemsand
buildnewtoolstoautomatework.(IIT)

Responsibilities of Data Scientist
7.Employsophisticatedanalyticsprograms,
machinelearningandstatisticalmethodsto
preparedataforuseinpredictiveand
prescriptivemodeling.
8.Exploreandexaminedatafromavarietyof
anglestodeterminehiddenweaknesses,trends
and/oropportunities

Terminologies/Technologies used in big
data environment
1.In-memoryanalytics
2.In-databaseprocessing
3.Symmetricmultiprocessorsystem(SMP)
4.Massivelyparallelprocessing
5.Parallelanddistributedsystems
6.Sharednothingarchitecture

In-memory analytics
•In-memoryanalyticsisanapproachtoqueryingdata
thatresidesinacomputer'srandomaccess
memory(RAM),asopposedtoqueryingdatathatis
storedonphysicaldisks.
•Thisresultsinreducedqueryresponsetimes,allowing
analyticapplicationstosupportfasterbusiness
decisions.
•In-memoryanalyticsisachievedthroughadoptionof64-
bitarchitectures,whichcanhandlemorememoryand
largerfilescomparedto32-bit.

In-database processing
•In-databaseanalyticsisatechnologythatallowsdata
processingtobeconductedwithinthedatabaseby
buildinganalyticlogicintothedatabaseitself.
•In-databaseprocessing,sometimesreferredtoasin-
databaseanalytics,referstotheintegrationof
dataanalyticsintodatawarehousingfunctionality.
•Iteliminatesthetimeandeffortrequiredtotransform
dataandmoveitbackandforthbetweenadatabaseand
aseparateanalyticsapplication.
•Example:creditcardfrauddetection,Bankrisk
management,etc.

Symmetric multiprocessor system (SMP)
•SMP(symmetricmultiprocessing)istheprocessingof
programsbymultipleprocessorsthatshareacommon
operatingsystemandmemory.
•Insymmetric(or"tightlycoupled")multiprocessing,the
processorssharememoryandtheI/Obusordatapath.
•Asinglecopyoftheoperatingsystemisinchargeofall
theprocessors(homogeneous).

Massively parallel processing (MPP)
•MPP(massivelyparallelprocessing)isthe
coordinatedprocessingofaproblem(program)by
multipleprocessorsthatworkondifferentpartsofthe
programinparallel,witheachprocessorhavingitsown
operatingsystemanddedicatedmemory.
•TheMPPprocessorscommunicateusingmessage
passing.
•Typically,thesetupforMPPismorecomplicated,
requiringthoughtabouthowtopartitionacommon
databaseamongprocessorsandhowtoassignwork
amongtheprocessors.AnMPPsystemisalsoknownas
a"looselycoupled"or"sharednothing"system.

Massively parallel processing (MPP)
•AnMPPsystemisconsideredbetterthana
symmetricallyparallelsystem(SMP)forapplications
thatallowanumberofdatabasestobesearchedin
parallel.Theseincludedecisionsupportsystem,data
warehouseandbigdataapplications.

Massively parallel processing

Parallel Systems

Distributed Systems

Shared nothing architecture

Shared Nothing architecture (SNA)
•Asharednothingarchitecture(SN)isadistributed
computingarchitectureinwhicheachnodeisindependent
andself-sufficient.Morespecifically,noneofthenodesshare
memoryordiskstorage.
•shared-nothingisoftencalledmassivelyparallelprocessor
(MPP).Manyresearchprototypesandcommercialproducts
haveadoptedtheshared-nothingarchitecturebecauseithas
thebestscalability.
•Intheshared-nothingarchitecture,eachnodeismadeof
processor,mainmemoryanddiskandcommunicateswith
othernodesthroughtheinterconnectionnetwork.Eachnode
isunderthecontrolofitsowncopyoftheoperatingsystem.

SNA advantages
1.Faultisolation
2.Scalability
3.Absenceofsinglepointoffailure
4.Self-healingcapabilities

CAP Theorem

CAP Theorem
•TheCAPtheorem,alsonamedBrewer'stheoremafter
computerscientistEricBrewer,statesthatitisimpossiblefor
adistributeddatastoretosimultaneouslyprovidemorethan
twooutofthefollowingthreeguarantees:
1.Consistency:Everyreadfetchesthelastwrite.
2.Availability:Everyrequestgetsaresponseon
success/failure.
3.Partitiontolerance:Systemcontinuestoworkdespite
messagelossorpartialfailureorN/Wpartition.
•Distributeddatastore:Itisisacomputernetworkwhere
informationisstoredonmorethanonenode,oftenin
areplicatedfashion.Itisalsoknownasdistributed
database.

CAP Theorem

Possible combinations of CAP for databases
1.AvailabilityandPartitionTolerance(AP)
2.ConsistencyandPartitionTolerance(CP)
3.ConsistencyandAvailability(CA)
Note:Google’sBigTable,Amazon’sDynamoand
Facebook’sCassandrausesoneofthesecombinations.

CAP Theorem
Examples

BASE
•BASEstandsforBasicallyAvailable,Softstate,Eventual
consistencyandusedtoachievehighavailabilityin
distributedcomputing.
•Basicallyavailableindicatesthatthesystemguarantees
theavailability,intermsoftheCAPtheorem.
•Softstateindicatesthatthestateofthesystemmay
changeovertime.
•Eventualconsistencyindicatesthatthesystemwill
becomeconsistentovertimeorafteracertaintimeall
nodesbecomeconsistent,

Few top analytics tools
1.MSExcel
2.SAS
3.IBMSPSSModeler
4.Statistica
5.QlickView
6.Tableau
7.Ranalytics
8.Weka
9.ApacheSpark
10.KNIME,Rapidminer
11.Splunkandsoon.