HostedbyConfluent
971 views
34 slides
Oct 25, 2023
Slide 1 of 34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
About This Presentation
"Initial snapshots are a core feature of Debezium: when setting up a new CDC connector, existing tables can be scanned in order to export their full state to consumers, before starting to capture changes from the transaction log. While this works great in general, a few questions came up again ...
"Initial snapshots are a core feature of Debezium: when setting up a new CDC connector, existing tables can be scanned in order to export their full state to consumers, before starting to capture changes from the transaction log. While this works great in general, a few questions came up again and again in the Debezium community over time:
* How to re-snapshot just a single table?
* How to pause and resume long-running snapshots?
* How to run snapshots in parallel to reading changes from the log?
All this, and more, becomes possible with the notion of incremental snapshots. In this session you'll learn how this innovative scheme of interleaving snapshot queries and log-based change events works under the hood and how it solves common tasks when running CDC pipelines. We'll also discuss advanced topics like parallelising snapshots and customising snapshot contents."
#DebeziumSnapshotting @gunnarmorling
●Software engineer at Decodable
●Former project lead of Debezium
●kcctl ?????? , JfrUnit, ModiTect,
MapStruct
●Spec Lead for Bean Validation 2.0
●Java Champion
Gunnar Morling
#DebeziumSnapshotting @gunnarmorling
Recap – Debezium
Log-Based Change Data Capture
#DebeziumSnapshotting @gunnarmorling
Snapshotting
Why Is It Needed?
●Need to backfill data, but don’t
have all TX logs
●Solution: scan data once before
streaming
●Emit READ event for each record
#DebeziumSnapshotting @gunnarmorling
Snapshotting
Classic Approach – General Idea
●Capture current
position in transaction
log
●Scan all relevant tables
●Start streaming
#DebeziumSnapshotting @gunnarmorling
Incremental Snapshotting
The Paper
●“DBLog: A Watermark Based
Change-Data-Capture
Framework”, by Andreas Andreakis
and Ioannis Papapanagiotou
●Key idea: interleave snapshot events
and events from TX log
https://arxiv.org/pdf/2010.12597v1.pdf
#DebeziumSnapshotting @gunnarmorling
Incremental Snapshotting
General Idea
#DebeziumSnapshotting @gunnarmorling
Incremental Snapshotting
Windowing via Watermarks
#DebeziumSnapshotting @gunnarmorling
Incremental Snapshotting
Semantics
●No guarantee for snapshot (read) events for all records
●May receive update or delete without prior insert/read
●May receive read and update/delete
●What is guaranteed: complete data set after snapshot
#DebeziumSnapshotting @gunnarmorling
Incremental Snapshotting
Benefits
●Can update filter list ✅
#DebeziumSnapshotting @gunnarmorling
Incremental Snapshotting
Benefits
●Can update filter list ✅
●Long-running snapshots can be paused/resumed ✅
#DebeziumSnapshotting @gunnarmorling
Incremental Snapshotting
Benefits
●Can update filter list ✅
●Long-running snapshots can be paused/resumed ✅
●Can stream changes before snapshot completed ✅
#DebeziumSnapshotting @gunnarmorling
Incremental Snapshotting
Benefits
●Can update filter list ✅
●Long-running snapshots can be paused/resumed ✅
●Can stream changes before snapshot completed ✅
●Can re-snapshot selected tables ✅
#DebeziumSnapshotting @gunnarmorling
●Incremental Snapshots in Debezium
https://debezium.io/blog/2021/10/07/incremental-snapshots/
●Read-only Incremental Snapshots for MySQL
https://debezium.io/blog/2022/04/07/read-only-incremental-snapshots/
●Flink CDC
https://ververica.github.io/flink-cdc-connectors/
Resources
#DebeziumSnapshotting @gunnarmorling
●Debezium & Kafka Connect – Ask the Experts
With Chris Cranford (Red Hat) and Chris Egerton (Aiven)
Sep 27, 2:30 PM
●Change Stream Processing with Debezium and Apache Flink
With Robert Metzger (Decodable)
Sep 27, 5:30 PM, Dremio Office
https://www.meetup.com/sf-big-analytics/events/294068331/
Upcoming