ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, PostgreSQL, Redpanda, Debezium, and Benthos to master building advanced real-time data pipelines.

weimeilin1 26 views 20 slides Apr 24, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

The slide that goes with the workshop.

- Demonstrate Transitioning to Real-Time Data Systems: This track shows how to transition from traditional batch processing to real-time data handling, emphasizing the critical importance of timely updates in scenarios like air traffic control where delays ca...


Slide Content

BATCH
TO
STREAM
DATA PIPELINE
Christina Lin
Redpanda
Developer Advocate

Agenda
•Quick Intro – 10-15 mins
•HANDS-ON – 35-40 mins
•Q & A – 5-10 mins
© 2024 REDPANDA DATA

Christina Lin
Developer Advocate, Redpanda
aka. The Redpanda Lady
SOA
WebSphere
DB2
Sybase
Oracle
MQ
J2EE
EJB
DevOps
Microservice
EIP
K8s
Agile
Integration
Data
Mesh
Active MQ
Living data stack
Resilience - handle failures and scale gracefully
Elasticity – infrastructure that can scale dynamically
Decentralization - data ownership, empowering
individual teams
Performance - low latency and high throughput
Autonomy – self service, define quality, and access
Nimble - efficient data movement
Distributed-distributed data processing for cloud native
Agility – quickly respond to change in data
© 2024 REDPANDA DATA

© 2024 REDPANDA DATA
An ordinary day of Data Engineer

© 2024 REDPANDA DATA© 2024 REDPANDA DATA
Stateless
StreamingPipeline
Transform
format Change, masking, filtering, validating
Dispatch, Wiretap
Spilt, multiple destinationControl
reroute
Normalize/ Denormalize
Enrich
Multiple ingestion
Stateful
StreamingPipeline
Complex event processing
Time-window based processing
Enrich
Multiple ingestion
Micro batch Pipeline
Transform for large output (Dataset)
Partitioning Split workload
Analytics
batch
Pipeline
Analytics large volume (legacy)
Transform large output (Dataset, legacy)
Transport large unstructured data
Better scalability for pipelines

Batch
CSV
Every 10 mins
CSV
Right away!
CSV
Stream
Batch pipeline
Batch Processing
Batch
pipeline
The Workshop overview
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda

VM
postgres:5432cassandra:5432
postgresload.ipynb
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
The Batch
cassandra:5432

VM
postgres:5432cassandra:5432
cassandraload.ipynb
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
The Batch

VM
postgres:5432cassandra:5432
spark.ipynb
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
The Batch
CSVCSVCSV

VM
postgres:5432cassandra:5432
Map.ipynb
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
Dashboard View
CSVCSVCSV

VM
postgres:5432cassandra:5432redpanda-0:9092
© 2024 REDPANDA DATAhttps://bit.ly/osdc-redpanda
Let’s Stream

VM
postgres:5432cassandra:5432
© 2024 REDPANDA DATAhttps://bit.ly/osdc-redpanda
Let’s Stream

VM
postgres:5432cassandra:5432
Config
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
Let’s Stream

VM
postgres:5432cassandra:5432
Job manager
Flink Data
Stream
Java JAR
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
Let’s Stream

VM
postgres:5432cassandra:5432
Job manager
CSV
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
Let’s Stream

VM
postgres:5432cassandra:5432
Job manager
CSV
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
Let’s Stream
CSVCSV

VM
postgres:5432cassandra:5432
SQL Client
SQL
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
The Batch
CSVCSVCSV

Batch
CSV
Batch pipeline
Batch
pipelineBatch Processing
Every 10 mins
CSV
Right away!
CSV
Stream
© 2024 REDPANDA DATAhttps://bit.ly/odsc-redpanda
The Workshop overview

© 2024 REDPANDA DATA
Stateless
StreamingPipeline
Transform
format Change, masking, filtering, validating
Dispatch, Wiretap
Spilt, multiple destinationControl
reroute
Normalize/ Denormalize
Enrich
Multiple ingestion
Stateful
StreamingPipeline
Complex event processing
Time-window based processing
Enrich
Multiple ingestion
Better scalability for pipelines

© 2024 REDPANDA DATA
Keep Learning
Streaming - Communication
Basics of K8s networking
Connectivity
Performance
Docs
Get a peak under the hood.
https://docs.redpanda.com/
Blogs
Keep up to date with Redpanda.
https://redpanda.com/blog
Slack
Engage with our community.
https://redpanda.com/slack
Code
Check out the source.
https://github.com/redpanda-data
Redpanda University
Free, self-paced online learning
https://university.redpanda.com