Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandaries Explored)

PaulBrebner 74 views 59 slides Jun 18, 2024
Slide 1
Slide 1 of 59
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59

About This Presentation

Closing talk for the Performance Engineering track at Community Over Code EU (Bratislava, Slovakia, June 5 2024) https://eu.communityovercode.org/sessions/2024/why-apache-kafka-clusters-are-like-galaxies-and-other-cosmic-kafka-quandaries-explored/ Instaclustr (now part of NetApp) manages 100s of Ap...


Slide Content

Why Apache Kafka Clusters Are Like Galaxies
And Other Cosmic Kafka Quandaries
Explored
Paul Brebner
Instaclustr Technology Evangelist
June 52024
© 2024 NetApp, Inc. All rights reserved.

Performance Engineering Track on again atCommunity over Code 7-10 October Denver 2024
© 2024 NetApp, Inc. All rights reserved.

© 2024 NetApp, Inc. All rights reserved.3
Instaclustr Managed Platform
•Cloud Platform for Big
Data Open Source
Technologies
•Free 30 daytrial
•Focus of this talk is on
Apache Kafka®

Centenary of Franz Kafka’s death -June 3 2024
© 2024 NetApp, Inc. All rights reserved.4
Head of Kafka, Prague (Paul Brebner)

Overview
1 Kafka Scalability
2 Kafka Clusters and Zipf’sLaw
3 Kafka Clusters and Storage
4 Top 10 Kafka Clusters and Performance
Thanks to Instaclustr colleagues for Kafka cluster data:
Kafka Clusters & Storage -Alastair Daivis& Kafka Team
Top 10 Clusters -Joseph Clay & Ramana Selvaratnam (Technical Operations Team)

A note on Kafka cluster metrics
EasyPerformance MetricsHarder
Broker Cluster All Clusters
Size Metrics available
Focus of our metrics
collection is
Per broker
Not per cluster or all clusters
DALL·E 3

Part 1Kafka Scalability
(Source: Shutterstock)

Kafkaisadistributedstreamsprocessing
system—itallowsdistributedproducerstosend
messagestodistributedconsumersviaaKafka
cluster.
Whatis
Kafka?

Cluster = Brokers + Partitions
Enabling Write & Read Concurrency

Partitionn
Topic
Partition1Producer
Partition2
ConsumerGroup
Consumer
Consumer
Consumers share
workwithingroups
Consumer
Partitions enable Consumers to share work
(c.f. Amish Barn raising) within a consumer group

Multiple groups enable message broadcasting.
Messagesareduplicated (c.f. clones) across groups,aseach
consumergroupreceivesacopyofeachmessage.
MultipleGroupsEnable MessageBroadcasting
Consumer
Consumer
Consumer
Consumer
Topic
Partition1
Partition2
Partitionn
Producer
ConsumerGroup
ConsumerGroupMessages are
duplicatedacross
Consumergroups
Messagesareduplicated(c.f. clones)across groups,
aseachconsumergroupreceivesa copyofeachmessage

Partitions –concurrency mechanism –more is better –until it’s not
You need sufficient partitions to benefit from the cluster concurrency
And not too many that the replication overhead impacts overall throughput
0
0.5
1
1.5
2
2.5
1 10 100 1000 10000
Partitions vs. Throughput (M TPS)
ZK TPS (M)KRAFT TPS (M)2020 TPS (M)
2022 -Better
2020 -Worse
2022 results better due to improvements to Kafka and h/w

© 2024 NetApp, Inc. All rights reserved.17
•Horizontal Scalability (Brokers/Nodes)
•Vertical Scalability (more/faster cores per Broker)
•Hardware (cores, CPU speed/type, RAM, disk, network, etc)
•Partitions + Consumers
•Optimise number of Partitions
•Consumer speed optimization (slow consumers are bad –high latency and too many partitions)
•Kafka cluster and client configurations (many and complex)
•Goals are typically
•High Throughput
•Fast Latency (low 10s ms)
Kafka Scalability and Performance Summary
(Slow consumers are a problem: Getty)

Part 2
Kafka Clusters and Zipf’sLaw –size
distribution
Visual size comparison of the six largest Local Group galaxies, with details (Wikipedia)

© 2024 NetApp, Inc. All rights reserved.19
•Distribution function
•Most frequent observation is twice as common
•as next and so on (i.e. 1/rank)
•Long-tailed distribution
•80/20 rule (20% of people own 80% of $)
•C.f. Pareto (discrete vs. continuous)
•Log-log rank vs frequency/size gives approx. straight line
•Common examples
•Frequency of words
•Wealth distribution
•Animal species size
•Earthquakes
•City sizes
•Computer systems (e.g. workload modelling, subsystem capacity)
•Galaxy sizes
Scaling/power lawZipf’sLaw

© 2024 NetApp, Inc. All rights reserved.20
•Question: How large are the largest
structures in the universe?
•Answer: Bigger!
•Zipf’slaw predicted that
•bigger galaxies would be detected in older parts of
the universe
•beyond the reach of the Hubble at the time
•confirmed with the James Webb telescope
observations
•But what’s this got to do with Kafka?
Size and Scale PredictionsApache Kafka + Galaxies?
Image from NASA’s James Webb Space Telescopeshowingolder and bigger galaxy clusters

© 2024 NetApp, Inc. All rights reserved.21
Raw Kafka Cluster Size Data -Summary Statistics
3 3 3
4.520702635
7.023373433
96
797
3603
1
10
100
1000
10000
Nodes/Cluster
Summary Statistics (log nodes/cluster)
minmedianmodeaveragestdevmaxcountsum

© 2024 NetApp, Inc. All rights reserved.22
Histogram (size vs count) –skewed distribution
Raw Kafka Cluster Size Data
0
100
200
300
400
500
600
700
800
Total
34689121518212427303336394860727896

© 2024 NetApp, Inc. All rights reserved.23
What is the distribution? Definitely a long-tailed power lawKafka Clusters and Zipf’sLaw
0
20
40
60
80
100
120
0100200300400500600700800900
Size
Cluster
Cluster Size Distribution (largest to smallest)

© 2024 NetApp, Inc. All rights reserved.24
Approximately ZipfianKafka Clusters –log size vs log rank
1
10
100
1000
1 10 100
Log rank
Log size
Kafka Clusters -Log size vs log rank

© 2024 NetApp, Inc. All rights reserved.25
Can expect larger clusters (animals, galaxies etc)So What? Kafka and Zipf’sLaw (1)
African Elephant, 7 t
Maraapunisaurus, extinct dinosaur, 150 t

© 2024 NetApp, Inc. All rights reserved.26
Extrapolation of size from Zipf’slaw + largest observed clusterPredicted larger clusters
0.1
1
10
100
1000
1 10 100 1000
Log rank
Log size
Kafka Clusters -Log size vs log rank
RankPredicted larger clusters
Predicted larger clusters
Larger

© 2024 NetApp, Inc. All rights reserved.27
Estimate total nodes for more clusters
Animal transportation problem
So What? Kafka and Zipf’sLaw (2)
How many animals can fit in a boat? Public Domain

© 2024 NetApp, Inc. All rights reserved.28
Total weight of animals on Ark (assuming Elephant is the largest) tends to 90 tonnesIf you know the size of the biggest thing you can predict the total size

© 2024 NetApp, Inc. All rights reserved.29
Only increases total nodes by 25%Doubling number of Kafka clusters
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
020040060080010001200140016001800
Cumulative total nodes
100% more clusters
25% more nodes

Part 3Kafka cluster storage
DALL·E 3
Storage for all Kafka clusters
Available from a recent project

© 2024 NetApp, Inc. All rights reserved.31
Correlation coefficient between size and disk = 0.9
5.6 PB Total Disk across all Kafka clusters
Raw Data –total disk per cluster
0
50
100
150
200
250
300
350
400
450
500
0 20 40 60 80 100 120
Disk (TB)
Nodes/cluster
Disk (TB) per cluster

© 2024 NetApp, Inc. All rights reserved.32
•Disk space used is a function of average write rate x average message size x retention period x RF (Little’s
Law)
•Our metrics our total disk available, not used
•Some clusters are DEV not PROD –not real workloads, and RF may be < 3
•Approximation -number of nodes as a proxy for cluster size –actual instance sizes impact capacity
•Kafka log retention policy and time impact how many messages are retained
•Kafka clusters are sized for peak load not average load
•Some clusters may be older than others (disk can be increased)
•Write vs. Read workload imbalance
•Some clusters may have higher write workload rate (requiring more disk) vs.
•Higher read workload rates (requiring less disk)
What’s going on?
0
100
200
300
400
500
020406080100120
Disk (TB)
Nodes/cluster
Disk (TB) per cluster

Part 4
Performance Metrics for
Top Ten Kafka Clusters
Top 10 tallest buildings (Wikipedia)

But in reality more people are killed by horses, cows, dogs,
and bees than kangaroos, sharks, snakes, crocodiles,
emus, jellyfish, etc!
Most Dangerous
Australian Critters?
Ranking can be tricky
Most “dangerous” = most teeth? Most venomous?
(Paul Brebner)(Wikimedia)

© 2024 NetApp, Inc. All rights reserved.36
•For all clusters
•Size (number of nodes) and type
•Disk (from extra project)
•Performance Metrics are collected for all clusters
•But not easily available as the focus is per-cluster operations
•Requested Performance Metrics for Top Ten Clusters
•What did I get?
•Static (per cluster):
•Nodes, Topics, Partitions
•For 24 hours (per broker):
•Resource Utilisation: CPU (avg, max)
•Throughput: Bytes in (avg, max), Bytes out (avg, max), Messages in (avg, max) [Have to scale by number of nodes to get cluster metrics]
•Performance: Producer and consumer latency (avg, p99)
What metrics are available for Kafka clusters?
Broker metrics need scaling to cluster metrics
Variation in broker metric values
24 hour sampling loses accuracy
24 hour sample size is limited/biased
Real workloads not benchmarking
Ten biggest clusters by node count only
Speculative Results!
Warning!

© 2024 NetApp, Inc. All rights reserved.37
Min, Avg, MaxSummary Statistics: Nodes, Topics, Partitions
27
7
2598
56.4
429.7
92145.3
96
1755
508800
1
10
100
1000
10000
100000
1000000
Nodes Topics Partitions
Nodes, Topics, Partitions (Log)
MinAvgMax

© 2024 NetApp, Inc. All rights reserved.38
Summary Statistics: CPU, GB/s in/out, Message/s (in)
2
0.396
0.12
24.5
3.14175
1.419
67.5
14.4
8.4
0.1
1
10
100
CPU Bytes in/out (GB/s)Messages in (M/s)
CPU, GB/s (in+out), Messages/s (in, M/s) (Log)
MinAvgMax

© 2024 NetApp, Inc. All rights reserved.39
Producers faster than Consumers
Note that some clusters use EBS, others use SSDs (faster!)
Summary Statistics: Latency (ms)
0.075
6.53.2925
106.6590
700
0.01
0.1
1
10
100
1000
Producer latency (ms)Consumer latency (ms)
Latency (Log)
MinAvgMax

© 2024 NetApp, Inc. All rights reserved.40
50% of clusters have sub 50ms average latency Consumer latency distribution
0
50
100
150
200
250
300
350
12345678910
Latency distribution (ms) –increasing

© 2024 NetApp, Inc. All rights reserved.41
150-3k BytesSummary Statistics: Message size (Bytes)
150
1163.950072
3000
0
500
1000
1500
2000
2500
3000
3500
Message size (avg, Bytes)
Message size (avg, Bytes)
minavgmax

© 2024 NetApp, Inc. All rights reserved.42
0.4 to 25 Million/sUsing Average message size, compute messages out àtotal messages in+out
0
5
10
15
20
25
30
Msgs in+out (M/s)
Msgs in+out (M/s)
minavgmax

© 2024 NetApp, Inc. All rights reserved.43
1.4 to 28 –i.e. 28 consumer groups potentiallyFan out (ratio of consumer to producer messages)
0
5
10
15
20
25
30
Fan out
Fan out
minavgmax

© 2024 NetApp, Inc. All rights reserved.44
Knowing metrics for top 10 clusters we can estimate total values for ALL CLUSTERS
27K topics (probably underestimate), 5.8 M partitions; 321-564 Million messages/s
Assuming Zipfdistribution…
27.45051596
5.886516239
321.3248554
564.9712845
1
10
100
1000
1
Grand Totals for All Kafka Clusters
Topics (k)Partitions (M)Msgs in+out (avg, M/s)Msgs in+out (max, M/s)

© 2024 NetApp, Inc. All rights reserved.45
Nodes –27 to 96 (1% of clusters, 564 nodes total, 16% of total nodes overall)Static data –top 10 clusters (largest on right)
27
3636
4851
6060
72
78
96
0
20
40
60
80
100
120
12345678910
Nodes/Cluster

© 2024 NetApp, Inc. All rights reserved.46
Ranges, odd ones out
Biggest (10) cluster has most partitions; cluster 6 has “hottest” topics (max partitions/topic)
Topics/Partitions/Nodes
7
631
13
1337
57727101
362
1755
0
500
1000
1500
2000
12345678910
Topics/Cluster
667557672
2598
200490
11940113941303823046
85800
508800
0
100000
200000
300000
400000
500000
600000
12345678910
Partitions/Cluster
0
200
400
600
800
1000
1200
1400
1600
1800
12345678910
Partitions/Topic
0
200
400
600
800
1000
1200
1400
1600
1800
2000
12345678910
Partitions/Node
Most topics Most partitions
Hottest topics

© 2024 NetApp, Inc. All rights reserved.47
Cluster 4 has highest max = highest topics/partitions per cluster/node
Cluster 6 has highest average = highest partitions/topic (“hot” topics)
These are both ”hot” clusters
CPU
0%
10%
20%
30%
40%
50%
60%
70%
80%
12345678910
CPU (Avg, max)
CPU (Avg)CPU (max)
Hottest
Hot

© 2024 NetApp, Inc. All rights reserved.48
Topics? Theory and our Technical operations people say probably not
as topics are not correlated to throughput (or size)
Correlation = 0.4, some known smaller clusters with way more topics (e.g. 10,000!)
Any obvious correlations to cluster size?
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 20406080100120
Total topics in cluster

© 2024 NetApp, Inc. All rights reserved.49
Partitions are related to throughput and size in theory
Correlation = 0.63, and the largest cluster has most and above average partitions/nodes
Size/Partition correlation?
0
100000
200000
300000
400000
500000
600000
0 20406080100120
Total partitions

© 2024 NetApp, Inc. All rights reserved.50
Average –poor correlationSize/Throughput?
0
5000000
10000000
15000000
20000000
25000000
0 20406080100120
Msgs in+out (avg/s)

© 2024 NetApp, Inc. All rights reserved.51
Max –poor correlation
But avg & peak TP correlates with “hot” cluster
Real workloads in 24 hour sample period don’t necessarily correlate with cluster capacities
Size/Throughput?
0
5000000
10000000
15000000
20000000
25000000
30000000
0 20406080100120
Msgs in+out (max/s)

© 2024 NetApp, Inc. All rights reserved.52
•AWS ARM Graviton2 R6g high price performance for memory-intensive workloads
•R6g.4xlarge 16 core (EBS) (4 clusters)
•R6g.2xlarge 8 cores (EBS) (2 clusters)
•AWS ARM Graviton2 Im4gn Nitro SSD for I/O intensive workloads
•Im4gn.4xlarge 16 core SSD (2 clusters, including “hot” cluster)
•AWS ARM Graviton2 M6g for balanced workloads
•M6gd.4xlarge 16 cores SSD (1 cluster)
•AWS x86 I3en for data-intensive workloads
•I3en.3xlarge 12 cores SSD (1 cluster)
A mix of EC2 instance types/sizes (4/5) and storage -EBS (6)/SSD (4)Top 10 clusters have heterogeneous h/w

© 2024 NetApp, Inc. All rights reserved.53
Good correlation (0.8) –definite increase in total cores for bigger clustersCores per Cluster
0
200
400
600
800
1000
1200
1400
1600
1800
0 20 40 60 80 100120
Cores per cluster
Nodes per cluster
Cores per cluster

© 2024 NetApp, Inc. All rights reserved.54
•Insights from our Techopsteam –thanks!
•Biggest cluster (#10)
•Over provisioned, 96 nodes, 1536 cores
•EBS (slow)
•Peak in messages/s = 1M/s
•Consumer latency 200 -400ms
•Runs “cool” (18-45%)
•Most partitions (0.5088 Million)
•Hottest cluster (#6)
•60 nodes, 960 cores
•Runs “hot” (45-55%)
•But lowest consumer latency
•Faster SSDs
•Few topics, most partitions/topic (hot “topics”)
Drill downBiggest cluster vs “hottest” cluster

© 2024 NetApp, Inc. All rights reserved.55
Average for cluster = 290 msbut actually a large variation across brokersAlso illustrates that metrics are per broker –and have wide variability

© 2024 NetApp, Inc. All rights reserved.56
For target throughput how many cores and partitions are needed (in practice need both)?
Can only predict a range from this data (avg=conservative; max=optimistic)
Capacity Planning
6288.039891
431.386635
25583.88158
2155.5666420
5000
10000
15000
20000
25000
30000
Msgs/s per coreMsgs/s per partition
Msgs/s per core and partition
AvgMax

© 2024 NetApp, Inc. All rights reserved.57
Range: Avg (conservative), Max (optimistic)Cores for target throughput (x2 max current cluster)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 10 20 30 40 50 60
TPS (Million/s) vs Cores
Cores (avg)Cores (max)

© 2024 NetApp, Inc. All rights reserved.58
Range: Avg (conservative) max (optimistic)
Note: This is probably skewed due to large cluster with most partitions having low throughput
and “hot” cluster with highest throughput having few partitions!
Partitions for target throughput (x2 max current cluster)
0
20000
40000
60000
80000
100000
120000
0 10 20 30 40 50 60
TPS (Million/s) vs Partitions
Partitions (avg)Partitions (max)

© 2024 NetApp, Inc. All rights reserved.59
•Lots of small clusters
•Few big clusters
•Even bigger clusters are likely
•A wide distribution of sizes is observed
•Kafka is horizontally scalable
•Fits many different customer workloads
•Some customers have many smaller clusters
•Some clusters grow in size over time
Conclusions?
Kafka cluster size distribution is Zipfian
DALL·E 3

© 2024 NetApp, Inc. All rights reserved.61
•Wide range of workloads, throughputs, hot vs cold CPU, fan-outs, latency,
message size and hardware
•Some interesting “odd ones out”
•Biggest
•Hottest
•Performance metrics were
•biased & coarse grain
•due to broker level collection and 24 hour sample & average & summary
•and from real workloads not benchmarks
•Hard to find correlations and make accurate predictions
•Some broad correlations and range predictions possible
Conclusions?
Top 10 clusters are “diverse”
(Paul Brebner)
Adolf Hoffmeister & Franz Kafka (Wikimedia)

© 2024 NetApp, Inc. All rights reserved.63
•Is normal for our managed Kafka clusters
•Usage/workload varies widely for customers
•Including topics, partitions, throughput, message sizes, client
settings (e.g. batching), fan-out, latency SLAs etc
•Many bigger clusters are dedicated to very specific customer
workloads
•Higher throughput clusters are not representative of lower
throughput clusters
•Hardware varies and is optimized/customized to take into
account specific customer workloads, cost and SLA
requirements
Conclusions?
Custom Cluster Optimization and Sizing
DALL·E 3

© 2024 NetApp, Inc. All rights reserved.64
•Performance prediction from coarse-grained
metrics feels like Déjà vu
•2007-2017 I developed an automated approach
to Performance Modelling from distributed
application traces
•This could work for Kafka
•Instrument Apache Kafka source code with
OpenTelemetryto provide
•Kafka specific resource (CPU, IO, network) + time spans
•Run Kafka benchmarks on representative hardware
•Transform OT traces into a performance model
•Make more accurate predictions
Conclusions?
Performance Prediction
DALL·E 3

© 2024 NetApp, Inc. All rights reserved.65
What next?
•Try us out!
•Free 30 daytrial
•Developersizeclusters
•www.instaclustr.com
•Allmyblogs (100+):
•https://instaclustr.com/paul-brebner

Thank you
© 2024 NetApp, Inc. All rights reserved.