Data Streaming For Big Data

svltr 1,896 views 32 slides May 07, 2020
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

Data Streaming For Big Data by Seval Capraz


Slide Content

Data Streaming For Big Data
CMP652 Next Generation Database
Systems
Seval Çapraz

Content

1. What, Why, How of Streaming Big Data
•2. Overview of Data Management Systems
–Vendors, Architectures, Ecosystem

3. The Most Popular Streaming Technologies
–Apache Storm, Apache Flink, Spark Streaming

Summary

Questions and Answers

References

1. What, Why, How of Streaming Big Data

What is streaming data?

Streaming data is an analytic computing platform that is focused on
speed.

By streaming, data can be continuously analyzed and transformed in
memory before it is stored on a disk.

It is a real time processing technique.
●All definitions are taken from reference [1]

Why Streaming Data?

Businesses are dealing with a lot of data that needs to be
processed and analyzed in real time.

Therefore, the physical environment that supports this level of
responsiveness is critical.

Streaming data environments typically require a clustered
hardware solution, and sometimes a massively parallel processing
approach will be required to handle the analysis.

Defining properties or dimensions of big data are volume, variety,
and velocity. Streaming technology can cover these three.
●All definitions are taken from reference [1]
BIG DATA

How Stream Processing?

Stream processing is a computer programming paradigm, equivalent to
dataflow programming, event stream processing, and reactive
programming.

It is the real-time processing of data continuously, concurrently, and in a
record-by-record fashion.

Processing streams of data works by processing “time windows” of data in
memory across a cluster of servers.

Data Processing

Stream Processing

When to use streaming?

Some key principles define when using streams is most appropriate:
When it is necessary to determine a retail buying opportunity
at the point of engagement, either via social media or via
permission-based messaging
Collecting information about the movement around a secure
site
To be able to react to an event that needs an immediate
response, such as a service outage or a change in a patient’s
medical condition
Real-time calculation of costs that are dependent on variables
such as usage and available resources
●All definitions are taken from reference [1]

Single-pass Analysis

One important factor about streaming data analysis is the fact
that it is a single-pass analysis.

In other words, the analyst cannot reanalyze the data after it is
streamed.
•This is common in applications where you are looking for the
absence of data.
•If several passes are required, the data will have to be put into
some sort of warehouse where additional analysis can be
performed.
●All definitions are taken from reference [1]

Streaming data vs. Hadoop

Streaming data is similar to the approach
when managing data at rest leveraging
Hadoop.

The primary difference is the issue of velocity.

In the Hadoop cluster, data is collected in
batch mode and then processed.
●All definitions are taken from reference [1]
Speed matters less
in Hadoop
than it does in
data streaming.

2. Overview of Database Management
Systems

Evolution of Data Management Solution

Relational Databases are not suited for Big Data
●All images are taken from reference [2]

Vendor Landscape
●All images are taken from reference [2]

An architecture of big data processing service
●All images are taken from reference [3]

Big Data Analytics Ecosystem

Recently, each architectural layer changed dramatically in terms of
the software stack

when services such as Yahoo!, Twitter, and LinkedIn released open
source solutions for dealing with big data.
•The new architecture:
–Apache Kafka serves as a high-throughput distributed in-
memory messaging system in data ingestion layer,
–Apache Storm as a distributed and fault-tolerant real-time
computation in data analytic layer,
–Apache Cassandra as a NoSQL database in data storage layer.
●All definitions are taken from reference [3]

A simple instance of large-scale datastream-
processing service
●All images are taken from reference [3]

3. The Most Popular Streaming Technologies

Most Popular Technologies

Piping and Messaging
–Apache Kafka, Apache Flume, FluentD and ZeroMQ

Stream Processing
–Apache Storm, Apache Spark, Apache Flink, Esper, Apache Samza

Machine Learning
–Machine Learning: MLLib and Mahout
•Persisting
–NoSQL DBs
–HDFS

Capability Analysis of Recent Open Source Stream-Processing Systems
[13] L. Neumeyer et al., “S4: Distributed Stream Computing Platform,” Proc. IEEE Int’l Conf. on Data Mining Workshops, 2010,
pp. 170–177.
●Table is taken from reference [3]

[12] M. Zaharia et al., “Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters,”
Proc. 4th Usenix Conf. Hot Topics in Cloud Computing, 2012.

Some of Streaming Computation Engines

Three open-source streaming engines:
–Apache Storm
–Apache Flink
–Apache Spark Streaming
●All definitions and images are taken from reference [4]

Apache Storm
●All definitions and images are taken from reference [4]

Apache Storm is a free and open source distributed realtime
computation system.

Apache Storm has the TopologyBuilder API to create a directed graph
(topology) through which streams of data flow.
•“Spouts” are the entry point to the graph, and “bolts” perform the
processing.
•Data flows through the system as individual tuples.

Graphs are not necessarily acyclic (although that is often the case)

●All definitions are taken from reference [6]
●All images are taken from reference [4]

Storm is fast: a benchmark clocked it at over a million tuples
processed per second per node.

A Storm topology consumes streams of data and processes those
streams in arbitrarily complex ways, repartitioning the streams
between each stage of the computation however needed.

Apache Flink
•Apache Flink is an open-source stream processing framework for
distributed, high-performing, always-available, and accurate data
streaming applications.[7]
•Apache Flink has the DataStream API to perform operations on
streams of data. (map, filter, reduce, join, etc.)
•These operations are turned into a graph at job submission time by
Flink.

It works similarly to Storm’s model.

Also supports a Storm-compatible API.
●All definitions and images are taken from reference [4]

●All definitions and images are taken from reference [4]

Flink is designed to run on large-scale clusters with many thousands
of nodes, and in addition to a standalone cluster mode.

Flink’s core is a distributed streaming dataflow engine, meaning that
data is processed an event-at-a-time rather than as a series of
batches.

Apache Spark Streaming

Apache Spark is a fast and general engine for large-scale data
processing.

Apache Spark has the DStream API to perform operations on streams
of data. (map, filter, reduce, join, etc.) Based on Spark’s RDD
(Resilient Distributed Dataset) abstraction.

Similar to Flink’s API. However streaming accomplished through
micro-batches.
•Spark streaming job consists of one small batch after another.
●All definitions and images are taken from reference [4]

●All definitions and images are taken from reference [4]

A Resilient Distributed Dataset (RDD), the basic abstraction in
Spark.

Using RDD(Resilient Distributed Dataset), Spark hides data
partitioning and can have parallel computational framework with
an API for four mainstream programming languages.

Storm 0.10
Storm 0.11
Storm 0.11
NO ACK
Flink
Spark
•Benchmark is taken from reference [4].
9
9
t
h

P
e
r
c
e
n
t
ile

L
a
t
e
n
c
y
Throughput rate (events/sec)
Comparison of Streaming Technologies

Summary

Streaming data processing is beneficial in most scenarios where new,
dynamic data is generated on a continual basis. It applies to most of
the industry segments and big data use cases.[5]

Stream processing requires ingesting a sequence of data, and
incrementally updating metrics, reports, and summary statistics in
response to each arriving data record. It is better suited for real-time
monitoring and response functions.[5]

There are a few popular streaming data platforms such as –Apache
Storm, Apache Flink, Apache Spark Streaming.

Each of the streaming platforms have their advantages and
disadvantages. Active communities for big data processing projects
continue to innovate and benefit from each other’s advancements.

Questions and Answers…
Q&A

References

[1] Judith Hurwitz, Alan Nugent, Fern Halper, Marcia Kaufman, "How to Use Data Streaming
For Big Data", Dummies.com, 2017.

[2] Sanjai Marimadaiah (CA Technologies), “Big Data, Big Opportunity: A Primer for
Understanding The Big Data Frontier”, CA World 2015.
•[3] Rajiv Ranjan, “Streaming Big Data Processing in Datacenter Clouds”, IEEE Cloud
Computing, vol. 1, no. 1, pp. 73-83, 2014.
•[4] Reza Farivar, Kyle Knusbaum, “Performance Comparison of Streaming Big Data
Platforms”, DataWorks Summit/Hadoop Summit, 2016.
•[5] “What is Streaming Data?”, https://aws.amazon.com/streaming-data/
•[6] “Why use Storm?”, http://storm.apache.org/
•[7] “Introduction to Flink”, https://flink.apache.org/