Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips

ScyllaDB 556 views 32 slides Jun 19, 2024
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

ScyllaDB monitoring provides a lot of useful information. But sometimes it’s not easy to find the root of the problem if something is wrong or even estimate the remaining capacity by the load on the cluster. This talk shares our team's practical tips on: 1) How to find the root of the problem ...


Slide Content

Getting The Most Out of ScyllaDB Monitoring: Rarely Mentioned Debugging and Tuning Tips Andrei Manakov, Staff Software Engineer at ShareChat

Andrei Manakov More than 13 years experience in industry Designed and developed multiple highload projects Passionate about performance problems in distributed systems Developing TikTok-like app with more than 20m DAU

Moj Overview Reconnection Storms Compaction Strategies Cluster Capacity Presentation Agenda

Moj Overview

Moj, a short video app

Feed Feature Store

(Re)Connection Storms

What Happened?

Internal CQL Inserts on Some Shards

(Re)Connection Storm fd sum(rate(scylla_transport_cql_connections)) by (instance)

Driver Hygiene (golang)

Driver Hygiene Rules (golang)

Still (Re)connection Storm.. fd https://github.com/scylladb/gocql/issues/124

Driver Hygiene Rules (golang)

Still Problems.. fd https://nginx.org/en/docs/http/ngx_http_grpc_module.html#grpc_next_upstream

Compaction Strategies

Compaction Strategies

Disks Reads/Writes fd sum(rate(node_disk_reads_completed_total[5m])) fd sum(rate(node_disk_writes_completed_total[5m]))

C ompaction CPU Usage fd avg(rate(scylla_scheduler_runtime_ms{group="compaction"}[2m])) by (group)/10

How Does ScyllaDB Reads?

SSTable Read s fd sum(rate(scylla_sstables_single_partition_reads[5m]))

ScyllaDB Cache Misses fd sum(rate(scylla_cache_reads_with_misses[5m]))

SSTable Read Efficiency fd sum(rate(scylla_sstables_single_partition_reads[5m])) / sum(rate(scylla_cache_reads_with_misses[5m]))

Real Life Situation

Cluster Capacity

Why Load is Not Really U seful ? fd https://github.com/scylladb/scylla-monitoring/issues/2003

CPU Capacity Analysis fd max(sum(rate(scylla_scheduler_runtime_ms{group!="compaction"}[2m])) by (instance, shard))/10 fd avg (sum(rate(scylla_scheduler_runtime_ms{group!="compaction"}[2m])) by (instance, shard))/10

Cache fd 1-sum(rate(scylla_cache_reads_with_misses[5m])) / sum(rate(scylla_cache_reads[5m]))

Disk Bytes Written/Read fd sum(rate(node_disk_written_bytes_total[5m])) fd sum(rate(node_disk_read_bytes_total[5m]))

Disk I/O and Bandwidth Capacity Capacity in io_properties.yaml disks: - mountpoint: /var/lib/scylla/data read_iops: 2400000 read_bandwidth: 5921532416 write_iops: 1200000 write_bandwidth: 4663037952

Conclusion disks: - mountpoint: /var/lib/scylla/data read_iops: 2400000 read_bandwidth: 5921532416 write_iops: 1200000 write_bandwidth: 4663037952

Stay in Touch Andrei Manakov [email protected] @AndreyManakov https://www.linkedin.com/in/andrei-manakov-69228a81/
Tags