Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

weimeilin1 129 views 21 slides Apr 25, 2024
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

The slide that goes with the workshop.

- Demonstrate Transitioning to Real-Time Data Systems: This track shows how to transition from traditional batch processing to real-time data handling, emphasizing the critical importance of timely updates in scenarios like air traffic control where delays ca...


Slide Content

BATCH
TO
STREAM
DATA PIPELINE
Christina Lin
Redpanda
Developer Advocate

Agenda
•Quick Intro – 10-15 mins
•HANDS-ON – 35-40 mins
•Q & A – 5-10 mins
© 2024 REDPANDA DATA

Christina Lin
Developer Advocate, Redpanda
aka. The Redpanda Lady
SOA
WebSphere
DB2
Sybase
Oracle
MQ
J2EE
EJB
DevOps
Microservice
EIP
K8s
Agile
Integration
Data
Mesh
Active MQ
Living data stack
Resilience - handle failures and scale gracefully
Elasticity – infrastructure that can scale dynamically
Decentralization - data ownership, empowering
individual teams
Performance - low latency and high throughput
Autonomy – self service, define quality, and access
Nimble - efficient data movement
Distributed-distributed data processing for cloud native
Agility – quickly respond to change in data
© 2024 REDPANDA DATA

© 2024 REDPANDA DATA
An ordinary day of Data Engineer

© 2024 REDPANDA DATA© 2024 REDPANDA DATA
Stateless
StreamingPipeline
Transform
format Change, masking, filtering, validating
Dispatch, Wiretap
Spilt, multiple destinationControl
reroute
Normalize/ Denormalize
Enrich
Multiple ingestion
Stateful
StreamingPipeline
Complex event processing
Time-window based processing
Enrich
Multiple ingestion
Micro batch Pipeline
Transform for large output (Dataset)
Partitioning Split workload
Analytics
batch
Pipeline
Analytics large volume (legacy)
Transform large output (Dataset, legacy)
Transport large unstructured data
Better scalability for pipelines

Batch
Every 10 mins
CSV
Right away!
CSV
CSV
Stream
Batch pipeline
Batch Processing
Batch
pipeline
The Workshop overview
© 2024 REDPANDA DATA

Batch
Every 10 mins
Right away!
CSV
CSV
Stream
Batch pipeline
Batch Processing
Batch
pipeline
The Workshop overview
© 2024 REDPANDA DATA
CSV

VM
cassandra:5432
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
The Batch
Setup & Load data
postgres:5432
Table
public.bos_air_traffic
jupyterlab:8888
postgresload.ipynb

VM
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
Setup
cassandraload.ipynb
Table
latest_flight_data
cassandra:5432
The Batch

VM
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
CSVCSVCSV
spark.ipynb
The Batch

VM
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
CSVCSVCSV
Map.ipynb
The Batch

VM
© 2024 REDPANDA DATAhttps://bit.ly/osdc-redpanda
Let’s Stream
redpanda-0:9092
console:8080

VM
© 2024 REDPANDA DATAhttps://bit.ly/osdc-redpanda
kafka-connect:8083
Let’s Stream

VM
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
ConfigTopic:
boston.public.bos_air_trafficTable
public.bos_air_traffic
Let’s Stream

VM
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
jobmanager:8081
Topic:
boston.public.bos_air_traffic
Flink Data
Stream
Java JAR
Let’s Stream

VM
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
CSVCSVCSV
Topic:
sensor_csv
Topic:
Sensor_csv
Let’s Stream

VM
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
Let’s Stream
CSVCSVCSV
Topic:
sensor_csv
rpkredpanda-0:9644
Topic:
filtered_sensor_csv

VM
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
CSVCSVCSV
Topic:
sensor_csv
SQL
SQL
Client
Let’s Stream

Batch
CSV
Batch pipeline
Batch
pipelineBatch Processing
Every 10 mins
CSV
Right away!
CSV
Stream
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
The Workshop overview

© 2024 REDPANDA DATA
Stateless
StreamingPipeline
Transform
format Change, masking, filtering, validating
Dispatch, Wiretap
Spilt, multiple destinationControl
reroute
Normalize/ Denormalize
Enrich
Multiple ingestion
Stateful
StreamingPipeline
Complex event processing
Time-window based processing
Enrich
Multiple ingestion
Better scalability for pipelines

© 2024 REDPANDA DATA
Keep Learning
Streaming - Communication
Basics of K8s networking
Connectivity
Performance
Docs
Get a peak under the hood.
https://docs.redpanda.com/
Blogs
Keep up to date with Redpanda.
https://redpanda.com/blog
Slack
Engage with our community.
https://redpanda.com/slack
Code
Check out the source.
https://github.com/redpanda-data
Redpanda University
Free, self-paced online learning
https://university.redpanda.com