Architecting Applications With Multiple Open Source Big Data Technologies

PaulBrebner 61 views 96 slides Jun 18, 2024
Slide 1
Slide 1 of 96
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96

About This Presentation

Keynote for Data Engineering track at Community over Code EU (Bratislava, Slovakia, June 4 2024) https://eu.communityovercode.org/sessions/2024/architecting-applications-with-multiple-open-source-big-data-technologies/ When I started as the Instaclustr Technology Evangelist 7 years ago, I already ha...


Slide Content

Architecting applications
with multiple open source
big data technologies
Paul Brebner
Open SourceTechnology Evangelist
June 2024
© 2024 NetApp, Inc. All rights reserved.

© 2024 NetApp, Inc. All rights reserved.2
Data engineering?!
Do I know anything about Data Engineering?
Maybe—I contributed a chapter “The Yin and Yang of
Big Data Scalability” to 97 Things Every Data
Engineering Should Know, 2021
And “Architecting”? Computer Scientist with 40 years
experience in distributed systems R&D—so hopefully!
This was supposed to be a custom talk—but due to
COVID I reengineered my FOSSASIA 2024 talk “30 of
My Favorite Open Source Technologies in 30 Minutes”

© 2024 NetApp, Inc. All rights reserved.3

© 2024 NetApp, Inc. All rights reserved.4
What do they have in common?
•Instaclustr provides some as
managed services
•They are complementary and can be
used together
•I’ve learned them all over the last 7
years and built realistic demo
applications
•Focus on integration, streaming,
persistence, pipelines, transformation,
meta-data, observability, visualization,
AI/ML, performance, scalability, different
architectural styles (e.g. CDC,
orchestration/choreography)
•all “architectural” features

© 2024 NetApp, Inc. All rights reserved.5
What’s that?!
•An escaped “Pokémon”!
•When my kids were growing up Pokémon lived
inside a “Game Boy”
A strange toy I found at the shop

© 2024 NetApp, Inc. All rights reserved.``6
Name, overview, superpower(s), Watch out for….
E.g. “Pokémon”
•ame: Charmander
•What: A fire Lizard
•Superpower: Evolves to Charizard, a flying fire
breathing lizard
•Watch out for: Water
+ use cases and what’s new>
Format

1. Apache Cassandra®

© 2024 NetApp, Inc. All rights reserved.8
•What?
•NoSQL horizontally scalable key-value database
•Superpowers
•Fast Writes (lots of typewriters)
•Wide column store
•Clustering Columns, good for hierarchical data modeling
(e.g. Geospatial)
•In-built multi-DC replication
•My Use Cases
Apache Cassandra®
Office Typing Pool, 1918
(Source: Wikipedia Public)
Domain

© 2024 NetApp, Inc. All rights reserved.9
Anomaly detection: 19 billion checks/day
Apache Cassandra
Apache Kafka
Kubernetes
And more

© 2024 NetApp, Inc. All rights reserved.10
Global low-latency fintech

© 2024 NetApp, Inc. All rights reserved.11
•Watch out for
•CQL != SQL
•Different data model
•Design for reads
•De-normalization is normal
•Consistency < traditional SQL databases
•Reads are slower
•What’s new?
•Vector Search in 5.0
Apache Cassandra®
Office Typing Pool, 1918
(Source: Wikipedia Public)
Domain

2.Apache Spark™

© 2024 NetApp, Inc. All rights reserved.13
•What?
•Cluster batch/stream processing, analytics, and ML
•Superpowers
•In-memory àfast
•Good support for ML
•+ Cassandra (wide columns) as a feature store
•Good for heavy transformation operations at scale
•My Use Cases
Apache Spark™
Car Factory Assembly Line
(Source: Wikimedia Public Domain)
Domain

© 2024 NetApp, Inc. All rights reserved.14
ML of Cassandra monitoring data
Apache Spark
Apache Cassandra
MLlib
DataFrames
Spark Streaming

© 2024 NetApp, Inc. All rights reserved.15
Apache Spark™
Car Factory Assembly Line
(Source: Wikimedia Public Domain)
Domain
•Watch out for
•Lots of RAM, else OOM (Out-of-Memory Errors)
•Spark Streaming is nearreal-time (micro-batch)
•What’s new?
•3.4 has Spark Connect for decoupled client-servers
•Ocean for Apache Spark (Spot by NetApp)

3.Apache Zeppelin™

© 2024 NetApp, Inc. All rights reserved.17
Apache Zeppelin™
Graf Zeppelin exploring the Arctic, 1931
(Source: Wikimedia Public Domain)
Domain
•What?
•Web-based notebook for data exploration
•Superpowers
•Interactive “notebook” style tool
•Supports Apache Spark

© 2024 NetApp, Inc. All rights reserved.18
Apache Zeppelin™
The Galilean moons of Jupiter
(Source: Wikimedia CCL)
Domain
•Watch out for
•Sufficient Zeppelin resources
•We don’t support it anymore
•What’s new?
•JupyterNotebook!
•Good Kafka and Cassandra integration

4.Apache Lucene™

© 2024 NetApp, Inc. All rights reserved.20
•What?
•Fast full-featured search engine
•Superpowers
•Lucene plugin + Cassandra for enhanced Cassandra search
•Works as a Cassandra secondary index
•Support Vector Search too
•Watch out for
•Performance
•We currently support it:
https://github.com/instaclustr/cassandra-lucene-index
•My Use Cases
Apache Lucene™
A Librarian using a card catalogue (1940)
(Source: Library of Congress Public Domain)
Domain

© 2024 NetApp, Inc. All rights reserved.21
Geospatial anomaly detection
Apache Cassandra
Apache Lucene Plugin
Geospatial searches
(Source:WikimediaPublic Domain)

5.Apache Kafka®

© 2024 NetApp, Inc. All rights reserved.23
•What?
•Distributed publish-subscribe messaging system
•Superpowers
•Fast
•Highly distributed and horizontally scalable, available
and durable
•Buffering and message replay
•My Use Cases
Apache Kafka®
Postal Delivery Service
Railway Post Office:
Mail bags snatched by speeding train
(Source: Wikimedia CCL Domain)

© 2024 NetApp, Inc. All rights reserved.24
Xmas tree lights simulation

© 2024 NetApp, Inc. All rights reserved.25
“Kongo” IoT logistics simulation
Apache Kafka
Guava Event Bus
Real-time logistics
Tracking and checking

© 2024 NetApp, Inc. All rights reserved.26
Anomaly detection: 19 billion checks/day
Apache Cassandra
Apache Kafka
Kubernetes
And more

© 2024 NetApp, Inc. All rights reserved.27
Apache Kafka®
Postal Delivery Service
Railway Post Office:
Mail bags snatched by speeding train
(Source: Wikimedia CCL Domain)
•Watch out for
•Too many topics/partitions impacts throughput
•What’s new?
•KRaft(replacing ZooKeeper) for faster meta-data operations
•And maybe even faster data workloads
•Tiered storage (3.6)
•End-to-end client monitoring (3.7)

6.Apache Kafka®
Streams

© 2024 NetApp, Inc. All rights reserved.29
•What?
•Stream processing API and client for Kafka
•From/to Kafka cluster
•Superpowers
•Complex stateful stream processing operations (e.g. joins)
•Over time windows and multiple topics and state stores
•My Use Cases
Apache Kafka®Streams
Niagara Falls daredevil
(Source: Shutterstock)

© 2024 NetApp, Inc. All rights reserved.30
Apache Kafka®Streams IoT application truck overload
(Source: Shutterstock)

© 2024 NetApp, Inc. All rights reserved.31
Apache Kafka®Streams
•Watch out for
•Complex stream topologies
•Debugging is tricky
•Performance
•What’s new?
•Alternatives (e.g. Apache Flink, RisingWave, etc.)

7.Apache Kafka®
Connect

© 2024 NetApp, Inc. All rights reserved.33
•What?
•Kafka API for streaming from source to sink systems
•Via Kafka cluster
•Superpowers
•Heterogeneous integration
•Code-free –just connector configuration
•Independently scalable
•connectors run on independent Kafka Connect cluster
•My Use Cases
Apache Kafka®Connect
Telephone Switchboard Operators Connecting Calls
(Source: Wikimedia Public Domain)

© 2024 NetApp, Inc. All rights reserved.34
Zero-code data pipelines
REST Tidal Data to PostgreSQL + SupersetREST Tidal Data to OpenSearch
OpenSearch sink
connector

© 2024 NetApp, Inc. All rights reserved.35
•Watch out for
•Open source connector evaluation and selection
•Error handling
•Source/sink system scalability
•What’s new
•Debezium
Apache Kafka®Connect
Telephone Switchboard Operators Connecting Calls
(Source: Wikimedia Public Domain)

8.KafkaMirrorMaker 2(MM2)

© 2024 NetApp, Inc. All rights reserved.37
•What?
•Replicates Kafka topics between clusters
•Superpowers
•Uses Kafka Connect (but reads/writes from/to Kafka clusters)
•Topic renaming, prevents loops
•Complex bi-directional topologies
•Many use cases for multiple Kafka clusters
•Cluster migration
•Geographical distribution
•Low latency, redundancy
•Fan-out architectures
•Edge computing, etc.
KafkaMirrorMaker 2
Head of Kafka –replicated tiers move
(Paul Brebner)

© 2024 NetApp, Inc. All rights reserved.38
•Watch out for
•Bi-directional flow requires 2Kafka Connect Clusters
•Duplicate events (from overlapping topic subscriptions)
•Use topic renaming and the default source cluster alias to
•Prevent cycles and infinite topic creation
•What’s new?
•For me, automated consumer offset sync across clusters
•In 2.7.0 (2020)!
KafkaMirrorMaker 2
Head of Kafka –replicated tiers move
(Paul Brebner)

9.Apache Camel™

© 2024 NetApp, Inc. All rights reserved.40
•What?
•Apache Camel—Integration framework
•Apache Camel Kafka Connectors—open source Kafka
connectors
•Superpowers
•Large number of open source Kafka Connectors—172
(officially), 179 sources and sinks
•Auto-generated from Camel components
Apache Camel™
Camel train in Broome, WA
(Source: Adobe Stock)

© 2024 NetApp, Inc. All rights reserved.41
•Watch out for
•Configuration!
•Need to read (1) Camel component, (2) Basic connector configuration,
and (3) connector specific documentation
•Some connectors are both sources and sinks (source or sink
depends on configuration)
•What’s new?
•Kamelets!
•Can appear in the configuration
Apache Camel™
Camel train in Broome, WA
(Source: Adobe Stock)

10.Kafka Parallel Consumer

© 2024 NetApp, Inc. All rights reserved.43
•What?
•Multi-thread Kafka Consumer
•Superpowers
•Multi-threaded c.f. default consumers single-threaded
•Higher concurrency with less consumers and partitions
•Use cases
•Low latency, High Throughput
•Slow consumers
•Replacement for my multiple pool consumer hack
Kafka Parallel Consumer
Jacquard Loom, Berlin
(Source: Paul Brebner)

© 2024 NetApp, Inc. All rights reserved.44
•Watch out for
•Configure for
•Ordering mode
•Partition àKey àUnordered (Increasing concurrency)
•Max threads
•What’s new?
•Choice of commit modes
•Consumer Asynchronous, Synchronous and Producer Transactions
Kafka Parallel Consumer
Jacquard Loom, Berlin
(Source: Paul Brebner)

11.Apache ZooKeeper™
12.Apache Curator™

© 2024 NetApp, Inc. All rights reserved.46
•What?
•Distributed systems and coordination and meta-data
management
•Superpowers
•High consistency, availability and performance (reads)
•Use cases
•Until recently, used in Kafka, Pulsar, etc.
Apache ZooKeeper™
Being a Zookeeper in Australia can be risky
(Source: Shutterstock)

© 2024 NetApp, Inc. All rights reserved.47
Apache ZooKeeper™(and Curator™) meet the dining philosophers
(Source: Wikipedia CCL)

© 2024 NetApp, Inc. All rights reserved.48
•Watch out for
•Low-level
•Apache Curator (high level ZK client) is better with
•Leader Latch
•Shared Lock
•Shared Counter
•Scalability limitations
•Slow for writes, max cluster size is 7 servers
•What’s new?
•KRaft–Kafka based RAFT implementation
•For meta-data management and leader election
•Faster meta-data operations, more partitions etc. Potentially faster
data workloads
Apache ZooKeeper™
Being a Zookeeper in Australia can be risky
(Source: Shutterstock)

13.Kubernetes

© 2024 NetApp, Inc. All rights reserved.50
•What?
•Automation of containerized applications
•Superpowers
•Available on public clouds, E.g. AWS EKS
•Ephemeral Pods are the unit of concurrency
•Easy to scale applications (more or less Pods)
•My use cases
Kubernetes
Greek Triremes ruled the seas
Captained by Helmsmen (Kubernetes)
(Source: Wikipedia CCL)

© 2024 NetApp, Inc. All rights reserved.51
Anomaly detection: 19 billion checks/day
Apache Cassandra
Apache Kafka
Kubernetes
And more

© 2024 NetApp, Inc. All rights reserved.52
•Watch out for
•Pod and resource scaling
•Easy to create many Pods
•With insufficient or lots of resources
•Tuning the application can be tricky
•Optimize the number of Pods vs. Kafka consumers/partitions,
Cassandra database connections, etc.
•What’s new?
•Operators
•e.g. Strimzifor Kafka
Kubernetes
Greek Triremes ruled the seas
Captained by Helmsmen (Kubernetes)
(Source: Wikipedia CCL)

14.Prometheus
15.Grafana

© 2024 NetApp, Inc. All rights reserved.54
•What?
•Prometheus: Monitoring and Alerting
•Grafana: Graphing
•Superpowers
•Instrumentation or Agents (Exporters) to expose
application metrics
•Time series data with counter, gauge, histogram and
summary metrics
•My use cases
•Monitoring and scaling/optimization/debugging
•Anomaly Detector (Cassandra, Kafka, Kubernetes) application
•Kafka Connect data pipelines
•Instaclustr’s Monitoring API has a Prometheus version
Prometheus + Grafana
Counting on an abacus
(Source: Wikimedia Public Domain)

© 2024 NetApp, Inc. All rights reserved.55
•Watch out for
•Need to run a Prometheus server
•Configuring Prometheus with Kubernetes is tricky
•use Prometheus Operator
•What’s new
•Since using it Grafana is now AGPL licensed
•modified code has to be open sourced
Prometheus + Grafana
Counting on an abacus
(Source: Wikimedia Public Domain)

16. OpenTracing17. OpenTelemetry18. Jaeger(and others)

© 2024 NetApp, Inc. All rights reserved.57
•What?
•OpenTracing: End-to-end distributed tracing
•Superpowers
•End-to-end distributed application visibility
•Traces have Spans
•Visualization of system topology and times
OpenTracing/OpenTelemetry
X-ray vision
(Source: Public Domain)

© 2024 NetApp, Inc. All rights reserved.58
•Watch out for
•Originally used OpenTracing and Jaeger
•Manual instrumentation
•What’s new?
•OpenTelemetry is the new standard
•Tracing, metrics and logs
•Automatic instrumentation
•Lots of open-source visualization tools
•Jaeger, SigNoz, Uptrace, OpenSearch
•Used in new client monitoring KIP-714, Kafka 3.7.0
OpenTracing/OpenTelemetry
X-ray vision
(Source: Public Domain)

© 2024 NetApp, Inc. All rights reserved.59
SigNoz service map for Toy+Boxes application

19.PostgreSQL®

© 2024 NetApp, Inc. All rights reserved.61
•What?
•Powerful SQL Database
•Superpowers
•SQL + Object Database
•Extensible
•JSONB+GIN indexes (efficient storage and
search of JSON)
PostgreSQL®
Elephant vs. tree –elephants are powerful
(Source: Adobe Stock)

© 2024 NetApp, Inc. All rights reserved.62
•Watch out for
•Scalability
•Vertical; limited horizontal
•Benefits from connection pooling
•What’s new
•PGVector(vector similarity search)
•Significant performance improvement
•on NetApp Azure Files
•FerretDB(MongoDB front-end)
PostgreSQL®
Elephant vs. tree—elephants are powerful
(Source: Adobe Stock)

20.Apache Superset™

© 2024 NetApp, Inc. All rights reserved.64
•What?
•Powerful data visualization tool
•Superpowers
•Reads from SQL sources
•Lots of visualization and graph types including geospatial
•My use case
•Visualization of tidal data from Kafka Connect pipeline
•Easy integration with PostgreSQL + JSONB
Apache Superset™
All superheroes (B) are a superset of those who use weapons (A)
(Source: Shutterstock)

21.OpenSearch®
22.Dashboard

© 2024 NetApp, Inc. All rights reserved.66
•What?
•Open source version of Elasticsearch
•Based on Lucene àpowerful + scalable text searching
•Superpowers
•Ingestion, indexing and searching of JSON documents
•Integrated dashboard for visualization
•Computational linguistics support:
•Stemming, Lemmatization, LevenshteinFuzzy Queries,
N-grams, Slop, Partial matching!
•My use cases
•Sink and visualization for Kafka connect tidal data
processing pipeline
OpenSearch®+ Dashboard
Library of Congress Card Division 1919 (City block long)
(Source: Library of Congress Public Domain)

© 2024 NetApp, Inc. All rights reserved.67
•Watch out for
•Default mappings and ingestion may not work
•E.g. geospatial data needs custom mappings and ingest pipelines
•Reindexing
•Kafka Connect Sink àOpenSearch throughput
•Needed the BULK API
•What’s new?
•Vector Search
OpenSearch®+ Dashboard
Library of Congress Card Division 1919 (City block long)
(Source: Library of Congress Public Domain)

23.Redis™

© 2024 NetApp, Inc. All rights reserved.69
•What?
•Fast (in-memory) Data Structures server
•Superpowers
•Lots of data types
•Keys, Strings, Lists, Hashes, Sets, Sorted sets, bitmaps,
geospatial, streams, time series, HyperLogLogs
(approximate counting)
•Pub/Sub
•Client-side caching for ultra-low latency,
e.g. Redissonclient
Redis™
Look! Up in the sky!
It’s an in-memory key-value store!It’s a database!
It’s Redis!
(Source: Shutterstock)

© 2024 NetApp, Inc. All rights reserved.70
•Watch out for
•Pipeline tuning impacts throughput
•Often used as a cache to reduce load on
backend database
•i.e. Efficiency not improved latency
•As other factors may dominate
•What’s new?
•Redis functions
•Code executed on the server (Redis 7)
•License change (7.4 source-available)
Redis™
Look! Up in the sky!
It’s an in-memory key-value store!It’s a database!
It’s Redis!
(Source: Shutterstock)

24.Uber’s Cadence®

© 2024 NetApp, Inc. All rights reserved.72
•What?
•Scalable code-as-workflows engine
•Superpowers
•Sequenced, stateful, long-running, scheduled steps
•Scalable and reliable using event-sourcing
•Workflows are failproof, history is replayed until the point of
failure and resumed
•My use cases
Uber’s Cadence®
Railway Signal “man” (Signalwoman!)
(Source: Wikimedia Public Domain)

© 2024 NetApp, Inc. All rights reserved.73
Drone delivery application
Kafka Microservices
Integration of fast/slow systems

© 2024 NetApp, Inc. All rights reserved.74
•Watch out for
•Uses Apache Cassandra and OpenSearch backends
•Code must be deterministic (replayed on failure)
•Use special functions for non-deterministic functions
•What’s new?
•Potential use cases
•Scalable push notifications (Uber)
•ML workflows
Uber’s Cadence®
Railway Signal “man” (Signalwoman!)
(Source: Wikimedia Public Domain)

25.Debezium

© 2024 NetApp, Inc. All rights reserved.76
•What?
•Change Data Capture (CDC)
•Superpowers
•Captures slow database state changes
•Turns them into fast Kafka events
•Uses Kafka: Kafka Connect, and/or
DB-specific “Connectors”
•Can be used to replicate databases
(same type), or send events to different
sink systems
•My use cases
•Debezium Cassandra Connector
(doesn’t use Kafka Connect, writes to
Kafka directly)
•Debezium PostgreSQL Connector
(Kafka source connector)
Debezium
Animal speed transformation
(Source: Shutterstock)

© 2024 NetApp, Inc. All rights reserved.77
•Watch out for
•The DB specific connectors need to be
configured/run in the DB
•Debezium change data format is complex
•Actual content depends on the source DB
•Schemas may be inline or just an ID
•May include schema changes
•Tricky to find Kafka Connect sink connectors
that work correctly
•Duplicates and ordering issues, latency and
scalability challenges
•Schema IDs require a Kafka Schema Registry
•What’s new?
•GA on Instaclustr’s Managed Cassandra
(Dec 2023)
Debezium
Animal speed transformation
(Source: Shutterstock)

26.Karapace

© 2024 NetApp, Inc. All rights reserved.79
•What?
•Open source Kafka Schema Registry
•Superpowers
•Adds schemas to schemeless Kafka
•Supports multiple schema formats
•Avro, Protobuf, and JSON Schemas
•Kafka cluster is not directly involved
•Karapace enforces schema checks for
clients only
•Use cases
•Debezium
Karapace
Karapace in the driver's seat!
(Source: Shutterstock)

© 2024 NetApp, Inc. All rights reserved.80
•Watch out for
•Auto vs. manual schema registration—
manual is safer in production
•Schema compatibility, compatibility
modes, and evolution: complex!
Karapace
Karapace in the driver's seat!
(Source: Shutterstock)

27.FerretDB

© 2024 NetApp, Inc. All rights reserved.82
•What?
•Open source MongoDB proxy for PostgreSQL
•Superpowers
•Compatible with MongoDB drivers on the
front-end
•Pluggable backends including PostgreSQL
(using JSONB/GIN indexes)
•Query Pushdown for efficiency/performance
FerretDB
Fish/shark
(Source: Adobe Stock)

28.RisingWave

© 2024 NetApp, Inc. All rights reserved.84
•What?
•Stream processing database—also as a service
•Superpowers
•Stateful stream processing
•Using Cloud Native Storage
•Potential replacement for Kafka Streams
•PostgreSQL compatible
•Works with Apache Superset
•My use cases
RisingWave
Wave processing
(Source: Adobe Stock)

© 2024 NetApp, Inc. All rights reserved.85
Santa’s elves and toy + box packing
Streaming joins to match toys and boxes
(Source: Adobe Stock)Service Map using
OpenTelemetry + SigNoz

© 2024 NetApp, Inc. All rights reserved.86
•Watch out for
•SQL != Kafka Streams DSL
•Kafka keys not propagated
•Windowing has different semantics
RisingWave
Wave processing
(Source: Adobe Stock)

29.TensorFlow

© 2024 NetApp, Inc. All rights reserved.88
•What?
•Neural network ML library
•Superpowers
•Supports incremental ML
•From streaming Kafka data
•My use cases
TensorFlow
What does the future hold?
(Source: Adobe Stock)

© 2024 NetApp, Inc. All rights reserved.89
ML over streaming Kafka data—with concept drift
Kafka Streams

© 2024 NetApp, Inc. All rights reserved.90
•Watch out for
•ML over streaming spatiotemporal data with
concept drifts is tricky
•Time/space bias
•Wild model accuracy oscillation
•Concept shift can result in very low-accuracy
models initially
•Train/use Multiple Models
TensorFlow
What does the future hold?
(Source: Adobe Stock)0
0.5
1
020406080100120
Concept Drift -incremental training
(time vs. accuracy)
same modelreset modelguessing

Integration example 1:Our customer facing monitoring
© 2024 NetApp, Inc. All rights reserved.91
Before:
Spark and API
requests
àHigh load on
Cassandra

Integration example 1:Our customer facing monitoring
© 2024 NetApp, Inc. All rights reserved.92
After:
Kafka + Kafka
Streams + Redis
Reduced
Cassandra Load
Recent metrics
served from
Redis, or
Cassandra on
cache miss
Postgre
SQL
2—get data from Redis
3 -or from Cassandra
1—get meta-data
20k Nodes
Thanks to my colleague
KuangdaHe
for this information

Integration example 2:Drone delivery demo
© 2024 NetApp, Inc. All rights reserved.93
Customers
Order
Shops
Busy warnings
Uses Cassandra+OpenSearch
ML over streaming data
Demo/POC
Kafka Streams

Integration example 2:Drone delivery prod?
Kafka StreamsCustomers
Postgre
SQL
Drone Operations
Order tracking
Shops
Busy warnings
Uses Cassandra+OpenSearch
ML over streaming data
Drone/order locations cached in Redis
Read-through or write-behind
Kafka sink
connectors
Order

© 2024 NetApp, Inc. All rights reserved.95
What next?
•Try us out!
•Free 30 daytrial
•Developersizeclusters
•www.instaclustr.com
•Allmyblogs (100+):
•https://instaclustr.com/paul-brebner

Thank you
© 2024 NetApp, Inc. All rights reserved.