CS8091_BDA_Unit_IV_Stream_Computing

316 views 32 slides May 24, 2021
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

Stream Computing Platforms


Slide Content

CS8091 / Big Data Analytics
III Year / VI Semester

UNIT IV -STREAM MEMORY
IntroductiontoStreamsConcepts–StreamDataModeland
Architecture-StreamComputing,SamplingDatainaStream–
FilteringStreams–CountingDistinctElementsinaStream–
Estimatingmoments–CountingonenessinaWindow–Decaying
Window–RealtimeAnalyticsPlatform(RTAP)applications-Case
Studies-RealTimeSentimentAnalysis,StockMarketPredictions.
UsingGraphAnalyticsforBigData:GraphAnalytics.

Stream Computing
Ahighperformancecomputersystemthatanalyzes
multipledatastreamsfrommanysources.
Streamcomputingisusedtomeanpullinginstreamsof
data,processingthedataandstreamingitbackoutasa
singleflow.
Itusessoftwarealgorithmsthatanalyzesthedatainreal
timeasitstreamsintoincreaseandaccuracywhendealing
withdatahandlingandanalysis.

Stream Computing

Stream Computing
Streamcomputingdeliversreal-timeanalytic
processingonconstantlychangingdatain
motion.
Itallowstocaptureandanalyzealldatainall
thetime,justintime.

Stream Computing
Streamanalyzesdatabeforeyoustoreit.
Analyzedatathatisinmotion(Velocity)
Processanytypeofdata(Variety)
Streamsisdesignedtoscaletoprocessanysizeof
datafromTerabytestoZetabytesperday.

Stream Computing
Storeless
Analyzemore
Makebetterdecisions,faster

Stream Computing
DataStreamprocessingplatforms:
Manyoftheseareopensourcesolutions.
Theseplatformsfacilitatetheconstructionofreal-time
applications,inparticularmessage-orientedorevent-
drivenapplicationswhichsupportingressofmessages
oreventsataveryhighrate,transfertosubsequent
processing,andgenerationofalerts.

Stream Computing
DataStreamprocessingplatforms:
Theseplatformsaremostlyfocusedonsupporting
event-drivendataflowthroughnodesinadistributed
systemorwithinacloudinfrastructureplatform.
TheHadoopecosystemcoversafamilyofprojects
thatfallundertheumbrellaofinfrastructurefor
distributedcomputingandlargedataprocessing.

Stream Computing
DataStreamprocessingplatforms:
Hadoopincludesanumberofcomponents,andbelow
isthelistofcomponents:
MapReduce,adistributeddataprocessingmodeland
executionenvironmentthatrunsonlargeclustersof
commoditymachines.
HadoopDistributedFileSystem(HDFS),adistributedfile
systemthatrunsonlargeclustersofcommoditymachines

Stream Computing
DataStreamprocessingplatforms:
Hadoopincludesanumberofcomponents,andbelowis
thelistofcomponents:
ZooKeeper,adistributed,highlyavailablecoordinationservice,
providingprimitivessuchasdistributedlocksthatcanbeusedfor
buildingdistributedapplications.
Pig,adataflowlanguageandexecutionenvironmentforexploring
verylargedatasets.PigsrunsonHDFSandMapReduceclusters.
Hive,adistributeddatawarehouse.

Stream Computing
DataStreamprocessingplatforms:
Itisdevelopedtosupportprocessinglargesetsof
structured,unstructured,andsemi-structureddata,
butitwasdesignedasabatchprocessingsystem.

Stream Computing
DataStreamprocessingplatforms–SPARK:
ApacheSparkismorerecentframeworkthatcombinesan
enginefordistributingprogramsacrossclustersof
machineswithamodelforwritingprogramsontopofit.
Itisaimedataddressingtheneedsofthedatascientist
community,inparticularinsupportofRead-Evaluate-Print
Loop(REPL)approachforplayingwithdatainteractively.

Stream Computing
DataStreamprocessingplatforms–SPARK:
SparkmaintainsMapReduce’slinearscalabilityand
faulttolerance,butextendsitinthreeimportantways:
First,ratherthanrelyingonarigidmap-then-reduceformat,
itsenginecanexecuteamoregeneraldirectedacyclicgraph
(DAG)ofoperators.Thismeansthatinsituationswhere
MapReducemustwriteoutintermediateresultstothe
distributedfilesystem,Sparkcanpassthemdirectlytothe
nextstepinthepipeline.

Stream Computing
DataStreamprocessingplatforms–SPARK:
SparkmaintainsMapReduce’slinearscalabilityandfault
tolerance,butextendsitinthreeimportantways:
Second,itcomplementsthiscapabilitywitharichsetof
transformationsthatenableuserstoexpresscomputationmore
naturally.
Third,Sparksupportsin-memoryprocessingacrossaclusterof
machines,thusnotrelyingontheuseofstorageforrecording
intermediatedata,asinMapReduce.

Stream Computing
DataStreamprocessingplatforms–SPARK:
Sparksupportsintegrationwiththevarietyoftoolsinthe
Hadoopecosystem.
Itcanreadandwritedatainallofthedataformatssupportedby
MapReduce.
ItcanreadfromandwritetoNoSQLdatabaseslikeHBaseand
Cassandra.
Itiswellsuitedforreal-timeprocessingandanalysis,supporting
scalable,highthroughput,andfault-tolerantprocessingoflivedata
streams.

Stream Computing
DataStreamprocessingplatforms–SPARK:
SparkStreaminggeneratesadiscretizedstream
(DStream)asacontinuousstreamofdata.
Regardinginputstream,SparkStreamingreceiveslive
inputdatastreamsthroughareceiveranddividesdata
intomicrobatches,whicharethenprocessedbythe
Sparkenginetogeneratethefinalstreamofresultsin
batches.

Stream Computing
DataStreamprocessingplatforms–SPARK:
SparkStreamingutilizesasmall-interval(inseconds)
deterministicbatchtoseparatestreamintoprocessable
units.
Thesizeoftheintervaldictatesthroughputand
latency,sothelargertheinterval,thehigherthe
throughputandthelatency.

Stream Computing
DataStreamprocessingplatforms–SPARK:
SinceSparkcoreframeworkexploitsmain
memory(asopposedtoStorm,whichisusing
Zookeeper)itsminibatchprocessingcanappearas
fastas“oneatatimeprocessing”adoptedin
Storm,despiteofthefactthattheRDDunitsare
largerthanStormtuples.

Stream Computing
DataStreamprocessingplatforms–SPARK:
Thebenefitfromtheminibatchistoenhancethe
throughputininternalenginebyreducingdata
shippingoverhead,suchasloweroverheadforthe
ISO/OSItransportlayerheader,whichwillallowthe
threadstoconcentrateoncomputation.
SparkwaswritteninScala,butitcomeswithlibraries
andwrappersthatallowtheuseofRorPython.

Stream Computing
DataStreamprocessingplatforms–Storm:
Stormisadistributedreal-timecomputationsystem
forprocessinglargevolumesofhigh-velocitydata.
Itmakesiteasytoreliablyprocessunboundedstreams
ofdataandhasarelativelysimpleprocessingmodel
owingtotheuseofpowerfulabstractions.

Stream Computing
DataStreamprocessingplatforms–Storm:
Aspoutisasourceofstreamsinacomputation.
Typically,aspoutreadsfromaqueuingbroker,suchas
RabbitMQ,orKafka,butaspoutcanalsogenerateitsown
streamorreadfromsomewhereliketheTwitterstreaming
API.
Spoutimplementationsalreadyexistformostqueuing
systems.

Stream Computing
DataStreamprocessingplatforms–Storm:
Aboltprocessesanynumberofinputstreamsand
producesanynumberofnewoutputstreams.
Theyareevent-drivencomponents,andcannotbeusedto
readdata.Thisiswhatspoutsaredesignedfor.
Mostofthelogicofacomputationgoesintobolts,suchas
functions,filters,streamingjoins,streamingaggregations,
talkingtodatabases,andsoon.

Stream Computing
DataStreamprocessingplatforms–Storm:
AtopologyisaDAGofspoutsandbolts,witheach
edgeintheDAGrepresentingaboltsubscribingtothe
outputstreamofsomeotherspoutorbolt.
Atopologyisanarbitrarilycomplexmultistagestream
computation;topologiesrunindefinitelywhendeployed.

Stream Computing
DataStreamprocessingplatforms–Storm:
Tridentprovidesasetofhigh-levelabstractionsinStorm
thatweredevelopedtofacilitateprogrammingofreal-time
applicationsontopofStorminfrastructure.
Itsupportsjoins,aggregations,grouping,functions,and
filters.Inadditiontothese,Tridentaddsprimitivesfor
doingstatefulincrementalprocessingontopofany
databaseorpersistencestore

Stream Computing
DataStreamprocessingplatforms–KAFKA:
Kafkaisanopensourcemessagebrokerproject
developedbytheApacheSoftwareFoundationand
writteninScala.
Theprojectaimstoprovideaunified,high-
throughput,low-latencyplatformforhandlingreal-
timedatafeeds.

Stream Computing
DataStreamprocessingplatforms–KAFKA:
AsingleKafkabrokercanhandlehundredsof
megabytesofreadsandwritespersecondfrom
thousandsofclients.
Inordertosupporthighavailabilityandhorizontal
scalability,datastreamsarepartitionedandspread
overaclusterofmachines.

Stream Computing
DataStreamprocessingplatforms–KAFKA:
KafkadependsonZookeeperfromtheHadoop
ecosystemforcoordinationofprocessingnodes.
ThemainusesofKafkaareinsituationswhen
applicationsneedaveryhighthroughputformessage
processing,whilemeetinglowlatency,high
availability,andhighscalabilityrequirements.

Stream Computing
DataStreamprocessingplatforms–Flume:
Flumeisadistributed,reliable,andavailableservice
forefficientlycollecting,aggregating,andmoving
largeamountsoflogdata.
Itisrobustandfaulttolerantwithtunablereliability
mechanismsandmanyfailoverandrecovery
mechanisms.Itusesasimpleextensibledatamodel
thatallowsforonlineanalyticapplication.

Stream Computing
DataStreamprocessingplatforms–Flume:
WhileFlumeandKafkabothcanactastheevent
backboneforreal-timeeventprocessing,theyhave
differentcharacteristics.
Flumeisbettersuitedincaseswhenoneneedsto
supportdataingestionandsimpleeventprocessing.

Stream Computing
DataStreamprocessingplatforms–AmazonKinesis:
AmazonKinesisisacloud-basedserviceforreal-timedata
processingoverlarge,distributeddatastreams.
AmazonKinesiscancontinuouslycaptureandstore
terabytesofdataperhourfromhundredsofthousandsof
sourcessuchaswebsiteclickstreams,financial
transactions,socialmediafeeds,ITlogs,andlocation-
trackingevents.

Stream Computing
DataStreamprocessingplatforms–AmazonKinesis:
KinesisallowsintegrationwithStorm,asitprovidesa
KinesisStormSpoutthatfetchesdatafromaKinesis
streamandemitsitastuples.
TheinclusionofthisKinesiscomponentintoaStorm
topologyprovidesareliableandscalablestreamcapture,
storage,andreplayservice.