Types of USD available for analysis
BDA
RFIDCRMERPPOSWebsites
Social
Media
What big data analytics is not?
Why the sudden hype around BDA?
1.Dataisgrowingata40%compoundannualrate,
reachingnearly74ZBby2021.Thevolumeof
businessdataworldwideisexpectedtodouble
every1.2years.
–Examples:
–Wal-Mart:Processonemillioncustomer
transactionsperhour.
–Twitter:500milliontweetsperday.
–2.7billion“Likes”andcommentsarepostedby
Facebookusersinaday.
Why the sudden hype around BDA?
2.Costpergigabyteofstoragehashugelydropped.
3.Anoverwhelmingnumberofuserfriendly
analyticstoolsareavailableinthemarkettoday.
4.ThreeDigitalAccelerators:bandwidth,digital
storage,andprocessingpower.
Classification of Analytics
1.FirstSchoolofThought.
2.SecondSchoolofThought
First school of thought
•Classified analytics into:
–Basic analytics
–Operationalized analytics
–Advanced analytics
–Monetized analytics.
•convert into or express in the form of currency.
First school of thought
•Basic analytics:
–Deals with slicing and dicing of data to help with basic
business insights.
–Reporting on historical data, basic visualization, etc..
•Operationalized analytics:
–Focusingontheintegrationofanalyticsintobusiness
unitsinordertotakespecificactiononinsights.
–Analyticsareintegratedintobothproduction
technologysystemsandbusinessprocesses.
First school of thought
•Advanced analytics:
–Forecastingforthefuturebypredictiveand
prescriptivemodeling.
•Monetized analytics:
–Usedtoderivedirectbusinessrevenue.
–Theactofgeneratingmeasurableeconomicbenefits
fromavailabledatasources
Six ways to indirectly monetize your
data
1.Reduce costs
2.Enhance your product or service
3.Enter new business sectors or tap into new
types of customers
4.Develop new products, services, or markets.
5.Drive sales and marketing
6.Improve productivity and efficiencies.
Second school of thought
Classification of analytics
Analytics 1.0, 2.0 and 3.0
Analytics 1.0 (1950-2009)
•Descriptive statistics.
•Report on events, occurrences, etc. of the past.
•Key questions:
–What happened?
–Why did it happen?
•Data:
–CRM, ERP, or 3
rd
party applications
–Small and structured data sources, DW and Data mart
–Internally sourced.
–Relational databases.
Analytics 2.0 (2005-2012)
•Descriptive + predictive statistics
•Uses data from the past to make predictions for the
future.
•Key questions:
–What will happen?
–Why will it happen?
•Data:
–Big data (SD, USD and SSD)
–Massive parallel servers running Hadoop
–Externally sourced.
–DB applications, Hadoop clusters, Hadoop environment.
Analytics 3.0 (2012 to present)
•Descriptive +predictive + Prescriptive statistics.
•Uses data from the past to make predictions for the future and at
the same time make recommendations to leverage the situations
to one’s advantage.
•Key questions:
–What will happen?
–When will it happen?
–Why will it happen?
–What action should be taken?
•Data:
–Big data (SD, USD and SSD) in real-time.
–Internally + Externally sourced.
–In memory analytics, M/C learning, agile analytical methods.
Challenges that prevent business from
capitalizing on big data
1.Gettingthebusinessunitstoshareinformationacross
organizationalsilos.
2.Findingtherightskills(businessanalystsanddatascientists)
thatcanmanagelargeamountsofSD,SSD,andUSDandcreate
insightfromit.
3.Theneedtoaddressthestorageandprocessingoflarge
volume,velocityandvarietyofbigdata.
4.DecidingwhethertouseSDorUSD,internalorexternaldatato
makebusinessdecisions.
5.Choosingtheoptimalwaytoreportfindingandanalysisofbig
dataforthepresentationtomakethemoresense.
6.Determiningwhattodowithinsightscreatedfrombigdata.
Top challenges facing big data
1.Thepracticalissuesofstoringallthedata(Scale)
2.Security:lackofauthenticationandauthorization
3.Dataschema(Nofixedandrigidschema)leadsto
dynamicschema.
4.Consistency/Eventualconsistency
5.Availability(24x7)(Failuretransparency)
6.Partitiontolerance(BothH/wandS/w)
7.Validatingbigdata(Dataquality):accuracy,
completenessandtimeliness.
Technologies needed to meet challenges
posed by big data
1.Cheapandabundantstorage.
2.Fasterprocessortohelpwithquickerprocessingof
bigdata.
3.Affordableopen-source,distributedbigdata
platformssuchasHadoop.
4.Parallelprocessing,clustering,virtualization,large
gridenvironments,highconnectivity,high
throughputsandlowlatency.
5.Cloudcomputingandotherflexibleresource
allocationarrangements.
Data Science
•DataScienceisprimarilyusedtomake
predictionsand decisions using
predictive,prescriptiveanalyticsandmachine
learning.
•Prescriptiveanalyticsisallaboutproviding
advice.itnotonlypredictsbutsuggestsarange
ofprescribedactionsandassociatedoutcomes.
Data science –development of data
product
•A"dataproduct"isatechnicalassetthat:(1)
utilizesdataasinput,and(2)processesthatdata
toreturnalgorithmically-generatedresults.
•Theclassicexampleofadataproductisa
recommendationengine,whichingestsuserdata,
andmakespersonalizedrecommendationsbased
onthatdata
Examples of data products:
•Amazon'srecommendationenginessuggest
itemsforyoutobuy.Netflixrecommendsmovies
toyou.Spotifyrecommendsmusictoyou.
•Gmail'sspamfilterisdataproduct–analgorithm
behindthescenesprocessesincomingmailand
determinesifamessageisspamornot.
•Computervisionusedforself-drivingcarsisalso
dataproduct–machinelearningalgorithmsare
abletorecognizetrafficlights,othercarsonthe
road,pedestrians,etc.
The role of Data Science
•Theself-drivingcarscollectlivedatafromsensors,
includingradars,camerasandlaserstocreateamapofits
surroundings.Basedonthisdata,ittakesdecisionslike
whentospeedup,whentospeeddown,whento
overtake,wheretotakeaturn–makinguseofadvanced
machinelearningalgorithms.
•DataSciencecanbeusedinpredictiveanalytics:Datafrom
ships,aircrafts,radars,satellitescanbecollectedand
analyzedtobuildmodels.Thesemodelswillnotonly
forecasttheweatherbutalsohelpinpredictingthe
occurrenceofanynaturalcalamities.
Domains of Data Science
Infographic
Difference between Data Analysis and
Data Science
•DataAnalysisincludesdescriptiveanalyticsand
predictiontoacertainextent.
•Ontheotherhand,DataScienceismoreabout
PredictiveAnalyticsandMachineLearning.
•DataScienceisamoreforward-lookingapproach,an
exploratorywaywiththefocusonanalyzingthepastor
currentdataandpredictingthefutureoutcomeswith
theaimofmakinginformeddecisions.
Use-cases of for Data Science
•Internetsearch
•DigitalAdvertisements
•Recommender Systems
•Image Recognition
•Speech Recognition
•Gaming
•Price Comparison Websites
•Airline Route Planning
•FraudandRiskDetection
•Medicaldiagnosis,etc.
Data Science is multi-disciplinary
Data Science process
1.Collectingrawdatafrommultipledisparatedata
sources.
2.Processingthedata
3.Integratingthedataandpreparingcleandatasets
4.Engaginginexplorativedataanalysisusingmodel(ML
model)andalgorithms.
5.Preparingpresentationsusingdatavisualization.
6.Communicatingthefindingstoallstakeholders
7.Makingfasterandbetterdecisions.
Business Acumen (wisdom) skills of
Data Scientist
•Understanding of domain
•Business strategy
•Problem solving
•Communication
•Presentation
•Thirst for knowledge
Technology Expertise of Data Scientist
•GooddatabaseknowledgesuchasRDBMS
•GoodNoSQLdatabaseknowledgesuchasMongoDB,
Cassandra,Hbase,etc.
•LanguagessuchasJava,Python,R,C++,etc.
•Open-sourcetoolssuchasHadoop.
•Datawarehousing
•DataMining,Patternrecognition,algorithms
•Excellentunderstandingofmachinelearningtechniques
andalgorithms,suchasK-means,Regression,kNN,Naive
Bayes,SVM,PCA,Decisiontree,Tableau,Flare,Google
visualizationAPIs,textanalytics,DL,NLP,AIetc.
Mathematics Expertise of Data Scientist
•Probability
•Statistics
•LinearAlgebra
•Calculus
Data Scientist
•Adatascientistisaprofessionalresponsiblefor
collecting,analyzingandinterpretinglarge
amountsofdatatoidentifywaystohelpa
businesstoimproveoperationsandgaina
competitiveedgeovercompetitors.
•They're part mathematician,part
computerscientistandparttrend-spotter.
Responsibilities of Data Scientist
1.Prepareandintegrateslargeandvarieddatasets
anddeveloprelevantdatasetsforanalysis.
2.Thoroughlycleanandprunedatatodiscard
irrelevantinformation
3.Appliesbusiness/domainknowledgetoprovide
context.
4.Employsablendofanalyticaltechniquesto
developmodelsandalgorithmstounderstand
thedata,interpretrelationships,spottrends
andunveilpatterns.
Responsibilities of Data Scientist
5.Communicatesorpresentsfindingsorresultsin
thebusinesscontextinalanguagethatis
understoodbythedifferentbusiness
stakeholders.
6.Inventnewalgorithmstosolveproblemsand
buildnewtoolstoautomatework.(IIT)
Responsibilities of Data Scientist
7.Employsophisticatedanalyticsprograms,
machinelearningandstatisticalmethodsto
preparedataforuseinpredictiveand
prescriptivemodeling.
8.Exploreandexaminedatafromavarietyof
anglestodeterminehiddenweaknesses,trends
and/oropportunities
Terminologies/Technologies used in big
data environment
1.In-memoryanalytics
2.In-databaseprocessing
3.Symmetricmultiprocessorsystem(SMP)
4.Massivelyparallelprocessing
5.Parallelanddistributedsystems
6.Sharednothingarchitecture
Possible combinations of CAP for databases
1.AvailabilityandPartitionTolerance(AP)
2.ConsistencyandPartitionTolerance(CP)
3.ConsistencyandAvailability(CA)
Note:Google’sBigTable,Amazon’sDynamoand
Facebook’sCassandrausesoneofthesecombinations.