Getting Started With Spark Structured Streaming With Dustin Vannoy | Current 2022
HostedbyConfluent
1,071 views
56 slides
Oct 21, 2022
Slide 1 of 56
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
About This Presentation
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current 2022
Many data pipelines still default to processing data nightly or hourly, but information is created all the time and should be available much sooner. While the move to stream processing adds complexity, Spark Structure...
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current 2022
Many data pipelines still default to processing data nightly or hourly, but information is created all the time and should be available much sooner. While the move to stream processing adds complexity, Spark Structured Streaming makes it achievable for teams of any size to switch to streaming.
This session shares techniques for data engineers who are new to building streaming pipelines with Spark Structured Streaming. It covers how to implement real-time stream processes with Apache Spark and Apache Kafka. We will discuss general concepts for Spark Structured Streaming along with introductory code examples. We will also look at important streaming concepts like triggers, windows, and state. To connect it all we will walk through a complete pipeline, including a demo using PySpark, Apache Kafka, and Delta Lake tables
Size: 2.05 MB
Language: en
Added: Oct 21, 2022
Slides: 56 pages
Slide Content
Getting Started with
Spark Structured Streaming
Dustin Vannoy
dustinvannoy.com
Why Spark?
Batch and Streaming
framework that has
large community
and wide adoption.
Spark is a leader in
data processing that
scales across many
machines. It is
relatively easy to get
started.
Yeah, but what isSpark?
Spark consists of a programming API and execution engine
Worker Worker Worker Worker
Master
from pyspark.sqlimport SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate ()
song_df= spark.read\
.option('sep','\t') \
.option("inferSchema","true") \
.csv("/databricks-datasets/songs/data-001/part-0000*")
tempo_df= song_df.select(
col('_c4').alias('artist_name'),
col('_c14').alias('tempo'),
)
avg_tempo_df= tempo_df\
.groupBy('artist_name') \
.avg('tempo') \
.orderBy('avg(tempo)',ascending=False)
avg_tempo_df.show(truncate=False)
What is Spark?
●Fast, general purpose engine for large-scale data processing
●Replaces MapReduce as Hadoop parallel programming API
●Many options:
○Yarn / Spark Cluster / Local
○Scala / Python / Java / R
○Spark Core / SQL / Streaming / MLLib / Graph
What is Spark Structured Streaming?
●Alternative to traditional Spark Streaming which used
DStreams
●If you are familiar with Spark, it is best to think of
Structured Streaming as Spark SQL API but for streaming
●Use import spark.sql.streaming
What is Spark Structured Streaming?
Tathagata Das “TD” -Lead Developer on Spark Streaming
●“Fast, fault-tolerant, exactly-once stateful stream
processing without having to reason about streaming"
●"The simplest way to perform streaming analytics is not
having to reason about streaming at all"
●A table that is constantly appended with each micro-batch
Reference: https://youtu.be/rl8dIzTpxrI
Lazy Evaluation with Query Optimization
●Transformations not run immediately
●Builds up a DAG
●Optimizes steps
●Runs when an Action is called
Why use Spark Structured Streaming?
●Run Spark as batch or streaming
●Open-source with many connectors
●Stateful streaming and joins
●Kafka is not required for every stream
Structured Streaming –Output Mode
How to output data when triggered
●Append -Just keep adding newest records
●Update –Only those updated since last trigger
●Complete mode -Output latest state of table
(useful for aggregation results)
Example from docs (Structured Streaming Programming Guide)
Structured Streaming -Triggers
Specify how often to check for a new data
●Default: micro-batch mode, as soon as possible
●Fixed: micro-batch mode, wait specified time
●Only once: just one micro-batch then job stops
●Continuous: continuously read from source,
experimental and only works for certain types of queries
Structured Streaming -Windows
Calculate over time using windows -> Avg trips every 5 minutes
●Tumbling window: specific period of time, such as 1:00 to 1:30
●Sliding windows: fixed window, but can track multiple groups...
1:00 –1:30, 1:05 –1:35, 1:10 –1:40, etc
●Session windows: similar to an inactive session timeout…
group together until a long enough gap in time between events
Comparing windows (Structured Streaming Programming Guide)
Window Example
df2.groupBy(
col("VendorID"),
window(col("pickup_dt"), "10 min"))
.avg("trip_distance")
Streaming Joins
Stream-static
Simple join
Does not tracking state
Options: Inner, Left Outer,
Left Semi
Stream-stream
Complex join
Data may arrive early or
late
Options: Inner, Left/Right
Outer, Full Outer, Left Semi
Stateful Streaming
Aggregations, dropDuplicates, and Joins will require storing state
●Custom stateful operations possible
●HDFS or RocksDBState Store (Spark 3.2)
●If retaining many keys, HDFSBackedStateStoreis likely to have JVM
memory issues