CS8091_BDA_Unit_IV_Stream_Computing

CS8091 / Big Data Analytics
III Year / VI Semester

UNIT IV -STREAM MEMORY
IntroductiontoStreamsConcepts–StreamDataModeland
Architecture-StreamComputing,SamplingDatainaStream–
FilteringStreams–CountingDistinctElementsinaStream–
Estimatingmoments–CountingonenessinaWindow–Decaying
Window–RealtimeAnalyticsPlatform(RTAP)applications-Case
Studies-RealTimeSentimentAnalysis,StockMarketPredictions.
UsingGraphAnalyticsforBigData:GraphAnalytics.

Stream Computing
Ahighperformancecomputersystemthatanalyzes
multipledatastreamsfrommanysources.
Streamcomputingisusedtomeanpullinginstreamsof
data,processingthedataandstreamingitbackoutasa
singleflow.
Itusessoftwarealgorithmsthatanalyzesthedatainreal
timeasitstreamsintoincreaseandaccuracywhendealing
withdatahandlingandanalysis.

Stream Computing

Stream Computing
Streamcomputingdeliversreal-timeanalytic
processingonconstantlychangingdatain
motion.
Itallowstocaptureandanalyzealldatainall
thetime,justintime.

Stream Computing
Streamanalyzesdatabeforeyoustoreit.
Analyzedatathatisinmotion(Velocity)
Processanytypeofdata(Variety)
Streamsisdesignedtoscaletoprocessanysizeof
datafromTerabytestoZetabytesperday.

Stream Computing
Storeless
Analyzemore
Makebetterdecisions,faster

Stream Computing
DataStreamprocessingplatforms:
Manyoftheseareopensourcesolutions.
Theseplatformsfacilitatetheconstructionofreal-time
applications,inparticularmessage-orientedorevent-
drivenapplicationswhichsupportingressofmessages
oreventsataveryhighrate,transfertosubsequent
processing,andgenerationofalerts.

Stream Computing
DataStreamprocessingplatforms:
Theseplatformsaremostlyfocusedonsupporting
event-drivendataflowthroughnodesinadistributed
systemorwithinacloudinfrastructureplatform.
TheHadoopecosystemcoversafamilyofprojects
thatfallundertheumbrellaofinfrastructurefor
distributedcomputingandlargedataprocessing.

Stream Computing
DataStreamprocessingplatforms:
Hadoopincludesanumberofcomponents,andbelow
isthelistofcomponents:
MapReduce,adistributeddataprocessingmodeland
executionenvironmentthatrunsonlargeclustersof
commoditymachines.
HadoopDistributedFileSystem(HDFS),adistributedfile
systemthatrunsonlargeclustersofcommoditymachines

Stream Computing
DataStreamprocessingplatforms:
Hadoopincludesanumberofcomponents,andbelowis
thelistofcomponents:
ZooKeeper,adistributed,highlyavailablecoordinationservice,
providingprimitivessuchasdistributedlocksthatcanbeusedfor
buildingdistributedapplications.
Pig,adataflowlanguageandexecutionenvironmentforexploring
verylargedatasets.PigsrunsonHDFSandMapReduceclusters.
Hive,adistributeddatawarehouse.

Stream Computing
DataStreamprocessingplatforms:
Itisdevelopedtosupportprocessinglargesetsof
structured,unstructured,andsemi-structureddata,
butitwasdesignedasabatchprocessingsystem.

Stream Computing
DataStreamprocessingplatforms–SPARK:
ApacheSparkismorerecentframeworkthatcombinesan
enginefordistributingprogramsacrossclustersof
machineswithamodelforwritingprogramsontopofit.
Itisaimedataddressingtheneedsofthedatascientist
community,inparticularinsupportofRead-Evaluate-Print
Loop(REPL)approachforplayingwithdatainteractively.

Stream Computing
DataStreamprocessingplatforms–SPARK:
SparkmaintainsMapReduce’slinearscalabilityand
faulttolerance,butextendsitinthreeimportantways:
First,ratherthanrelyingonarigidmap-then-reduceformat,
itsenginecanexecuteamoregeneraldirectedacyclicgraph
(DAG)ofoperators.Thismeansthatinsituationswhere
MapReducemustwriteoutintermediateresultstothe
distributedfilesystem,Sparkcanpassthemdirectlytothe
nextstepinthepipeline.

Stream Computing
DataStreamprocessingplatforms–SPARK:
SparkmaintainsMapReduce’slinearscalabilityandfault
tolerance,butextendsitinthreeimportantways:
Second,itcomplementsthiscapabilitywitharichsetof
transformationsthatenableuserstoexpresscomputationmore
naturally.
Third,Sparksupportsin-memoryprocessingacrossaclusterof
machines,thusnotrelyingontheuseofstorageforrecording
intermediatedata,asinMapReduce.

Stream Computing
DataStreamprocessingplatforms–SPARK:
Sparksupportsintegrationwiththevarietyoftoolsinthe
Hadoopecosystem.
Itcanreadandwritedatainallofthedataformatssupportedby
MapReduce.
ItcanreadfromandwritetoNoSQLdatabaseslikeHBaseand
Cassandra.
Itiswellsuitedforreal-timeprocessingandanalysis,supporting
scalable,highthroughput,andfault-tolerantprocessingoflivedata
streams.

Stream Computing
DataStreamprocessingplatforms–SPARK:
SparkStreaminggeneratesadiscretizedstream
(DStream)asacontinuousstreamofdata.
Regardinginputstream,SparkStreamingreceiveslive
inputdatastreamsthroughareceiveranddividesdata
intomicrobatches,whicharethenprocessedbythe
Sparkenginetogeneratethefinalstreamofresultsin
batches.

Stream Computing
DataStreamprocessingplatforms–SPARK:
SparkStreamingutilizesasmall-interval(inseconds)
deterministicbatchtoseparatestreamintoprocessable
units.
Thesizeoftheintervaldictatesthroughputand
latency,sothelargertheinterval,thehigherthe
throughputandthelatency.

Stream Computing
DataStreamprocessingplatforms–SPARK:
SinceSparkcoreframeworkexploitsmain
memory(asopposedtoStorm,whichisusing
Zookeeper)itsminibatchprocessingcanappearas
fastas“oneatatimeprocessing”adoptedin
Storm,despiteofthefactthattheRDDunitsare
largerthanStormtuples.

Stream Computing
DataStreamprocessingplatforms–SPARK:
Thebenefitfromtheminibatchistoenhancethe
throughputininternalenginebyreducingdata
shippingoverhead,suchasloweroverheadforthe
ISO/OSItransportlayerheader,whichwillallowthe
threadstoconcentrateoncomputation.
SparkwaswritteninScala,butitcomeswithlibraries
andwrappersthatallowtheuseofRorPython.

Stream Computing
DataStreamprocessingplatforms–Storm:
Stormisadistributedreal-timecomputationsystem
forprocessinglargevolumesofhigh-velocitydata.
Itmakesiteasytoreliablyprocessunboundedstreams
ofdataandhasarelativelysimpleprocessingmodel
owingtotheuseofpowerfulabstractions.

Stream Computing
DataStreamprocessingplatforms–Storm:
Aspoutisasourceofstreamsinacomputation.
Typically,aspoutreadsfromaqueuingbroker,suchas
RabbitMQ,orKafka,butaspoutcanalsogenerateitsown
streamorreadfromsomewhereliketheTwitterstreaming
API.
Spoutimplementationsalreadyexistformostqueuing
systems.

Stream Computing
DataStreamprocessingplatforms–Storm:
Aboltprocessesanynumberofinputstreamsand
producesanynumberofnewoutputstreams.
Theyareevent-drivencomponents,andcannotbeusedto
readdata.Thisiswhatspoutsaredesignedfor.
Mostofthelogicofacomputationgoesintobolts,suchas
functions,filters,streamingjoins,streamingaggregations,
talkingtodatabases,andsoon.

Stream Computing
DataStreamprocessingplatforms–Storm:
AtopologyisaDAGofspoutsandbolts,witheach
edgeintheDAGrepresentingaboltsubscribingtothe
outputstreamofsomeotherspoutorbolt.
Atopologyisanarbitrarilycomplexmultistagestream
computation;topologiesrunindefinitelywhendeployed.

Stream Computing
DataStreamprocessingplatforms–Storm:
Tridentprovidesasetofhigh-levelabstractionsinStorm
thatweredevelopedtofacilitateprogrammingofreal-time
applicationsontopofStorminfrastructure.
Itsupportsjoins,aggregations,grouping,functions,and
filters.Inadditiontothese,Tridentaddsprimitivesfor
doingstatefulincrementalprocessingontopofany
databaseorpersistencestore

Stream Computing
DataStreamprocessingplatforms–KAFKA:
Kafkaisanopensourcemessagebrokerproject
developedbytheApacheSoftwareFoundationand
writteninScala.
Theprojectaimstoprovideaunified,high-
throughput,low-latencyplatformforhandlingreal-
timedatafeeds.

Stream Computing
DataStreamprocessingplatforms–KAFKA:
AsingleKafkabrokercanhandlehundredsof
megabytesofreadsandwritespersecondfrom
thousandsofclients.
Inordertosupporthighavailabilityandhorizontal
scalability,datastreamsarepartitionedandspread
overaclusterofmachines.

Stream Computing
DataStreamprocessingplatforms–KAFKA:
KafkadependsonZookeeperfromtheHadoop
ecosystemforcoordinationofprocessingnodes.
ThemainusesofKafkaareinsituationswhen
applicationsneedaveryhighthroughputformessage
processing,whilemeetinglowlatency,high
availability,andhighscalabilityrequirements.

Stream Computing
DataStreamprocessingplatforms–Flume:
Flumeisadistributed,reliable,andavailableservice
forefficientlycollecting,aggregating,andmoving
largeamountsoflogdata.
Itisrobustandfaulttolerantwithtunablereliability
mechanismsandmanyfailoverandrecovery
mechanisms.Itusesasimpleextensibledatamodel
thatallowsforonlineanalyticapplication.

Stream Computing
DataStreamprocessingplatforms–Flume:
WhileFlumeandKafkabothcanactastheevent
backboneforreal-timeeventprocessing,theyhave
differentcharacteristics.
Flumeisbettersuitedincaseswhenoneneedsto
supportdataingestionandsimpleeventprocessing.

Stream Computing
DataStreamprocessingplatforms–AmazonKinesis:
AmazonKinesisisacloud-basedserviceforreal-timedata
processingoverlarge,distributeddatastreams.
AmazonKinesiscancontinuouslycaptureandstore
terabytesofdataperhourfromhundredsofthousandsof
sourcessuchaswebsiteclickstreams,financial
transactions,socialmediafeeds,ITlogs,andlocation-
trackingevents.

Stream Computing
DataStreamprocessingplatforms–AmazonKinesis:
KinesisallowsintegrationwithStorm,asitprovidesa
KinesisStormSpoutthatfetchesdatafromaKinesis
streamandemitsitastuples.
TheinclusionofthisKinesiscomponentintoaStorm
topologyprovidesareliableandscalablestreamcapture,
storage,andreplayservice.

CS8091_BDA_Unit_IV_Stream_Computing

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

CS8091_BDA_Unit_IV_Stream_Computing

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......