Why Spark? Batch and Streaming framework that has large community and wide adoption. Spark is a leader in data processing that scales across many machines. It is relatively easy to get started.
Yeah, but what is Spark? Spark consists of a programming API and execution engine Worker Worker Worker Worker Master from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.getOrCreate () song_df = spark.read \ .option(' sep ','\t') \ .option(" inferSchema ","true") \ .csv("/ databricks -datasets/songs/data-001/part-0000*") tempo_df = song_df.select ( col('_c4').alias(' artist_name '), col('_c14').alias('tempo'), ) avg_tempo_df = tempo_df \ . groupBy (' artist_name ') \ .avg('tempo') \ . orderBy ('avg(tempo)',ascending=False) avg_tempo_df.show (truncate=False)
What is Spark? Fast, general purpose engine for large-scale data processing Replaces MapReduce as Hadoop parallel programming API Many options: Yarn / Spark Cluster / Local Scala / Python / Java / R Spark Core / SQL / Streaming / MLLib / Graph
What is Spark Structured Streaming? Alternative to traditional Spark Streaming which used DStreams If you are familiar with Spark, it is best to think of Structured Streaming as Spark SQL API but for streaming Use import spark.sql.streaming
What is Spark Structured Streaming? Tathagata Das “TD” - Lead Developer on Spark Streaming “Fast, fault-tolerant, exactly-once stateful stream processing without having to reason about streaming" "The simplest way to perform streaming analytics is not having to reason about streaming at all" A table that is constantly appended with each micro-batch Reference : https://youtu.be/rl8dIzTpxrI
Lazy Evaluation with Query Optimization Transformations not run immediately Builds up a DAG Optimizes steps Runs when an Action is called
Why use Spark Structured Streaming? Run Spark as batch or streaming Open-source with many connectors Stateful streaming and joins Kafka is not required for every stream
Structured Streaming – Output Mode How to output data when triggered Append - Just keep adding newest records Update – Only those updated since last trigger Complete mode - Output latest state of table (useful for aggregation results)
Example from docs (Structured Streaming Programming Guide)
Structured Streaming - Triggers Specify how often to check for a new data Default: micro-batch mode, as soon as possible Fixed: micro-batch mode, wait specified time Only once: just one micro-batch then job stops Continuous: continuously read from source, experimental and only works for certain types of queries
Structured Streaming - Windows Calculate over time using windows -> Avg trips every 5 minutes Tumbling window: specific period of time, such as 1:00 to 1:30 Sliding windows: fixed window, but can track multiple groups... 1:00 – 1:30, 1:05 – 1:35, 1:10 – 1:40, etc Session windows: similar to an inactive session timeout… group together until a long enough gap in time between events
Comparing windows (Structured Streaming Programming Guide)
Streaming Joins Stream-static Simple join Does not tracking state Options: Inner, Left Outer, Left Semi Stream-stream Complex join Data may arrive early or late Options: Inner, Left/Right Outer, Full Outer, Left Semi
Stateful Streaming Aggregations, dropDuplicates , and Joins will require storing state Custom stateful operations possible HDFS or RocksDB State Store (Spark 3.2) If retaining many keys, HDFSBackedStateStore is likely to have JVM memory issues
Example Structured Streaming Application 38
Databricks Notebook - Setup
Databricks Notebook – Confluent Cloud Config
Databricks Notebook – Spark Schema
Databricks Notebook – Read Kafka Record
Databricks Notebook – Convert JSON
Databricks Notebook – Read Lookup Table
Databricks Notebook – Join and write
Databricks Notebook – Check destination
Monitoring 47
SQL UI
SQL UI - Detail
Structured Streaming UI
Structured Streaming UI
Notebook Streaming UI - Timeseries
Notebook Streaming UI – Query Progress
Cloud Logging Services
Why Kafka? Streaming data directly from one system to another often problematic Kafka serves as the scalable broker, keeping up with producer and persisting data for all consumers
The Log “It is an append-only, totally-ordered sequence of records ordered by time.” - Jay Kreps Reference: The Log - Jay Kreps