Cache Me If You Can: How Grafana Labs Scaled Up Their Memcached 42x & Cut Costs Too
ScyllaDB
1,476 views
63 slides
Jun 25, 2024
Slide 1 of 63
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
About This Presentation
Our cloud database stores billions of files in object storage. With petabytes of data being queried every day, we started bumping into our cloud storage providers' rate-limits, resulting in decreased reliability & performance. We had large memcached clusters in place to absorb & deamplif...
Our cloud database stores billions of files in object storage. With petabytes of data being queried every day, we started bumping into our cloud storage providers' rate-limits, resulting in decreased reliability & performance. We had large memcached clusters in place to absorb & deamplify reads to object storage - but these could hold at most a few hours’ worth of data, and constantly churned due to the excessive volume of data passing through. The conclusion we came to was: we needed much larger caches, ideally without inflating our cloud costs and adding operational complexity.
I'll show how we managed to increase our cache size by 45x and reduce our costs by using a little-known feature of memcached called ""extstore"". Extstore enables offloading of objects to SSDs which can't fit into memory. In this talk I’ll be covering how we use it, how to monitor it, why we chose it, and other considerations. I'll also cover how we use ephemeral storage provided by public cloud vendors in the form of physically-attached SSDs with incredibly high throughput, low latency, and best of all - low cost!
This talk is also a story of how products evolve, and how we as a team are buying time in the short term to keep up our reliability while we evolve our storage design in the medium-long term.
Size: 4.96 MB
Language: en
Added: Jun 25, 2024
Slides: 63 pages
Slide Content
Cache Me If You Can How Grafana Labs Scaled Up Their Memcached 42x and Cut Costs Too Danny Kopping Senior Software Engineer @ Grafana Labs
Cutting to the chase rollout rollout 65% reduction in object storage reqs using memcached + 50TB of SSD 2% overall TCO reduction No degradation in performance
But first… Loki internals
2023-06-16T06:09:05.123456789Z {app=”nginx”, env=”dev”} Timestamp with nanosecond precision Content log line Labels/Selectors key-value pairs indexed unindexed How does Loki work? 192.168.1.1 "GET / HTTP/1.1" 200 Object Storage chunks index
Google Cloud Storage Amazon S3 $5 per million writes $0.005 per 1000 $0.4 per million reads $0.0004 per 1000 Object Storage No charge for bandwidth within the same region
65% reduction in object storage reqs using memcached + 50TB of SSD ~ 2% overall TCO reduction No degradation in performance rollout
Cache effectiveness TB/d rollout
Cache effectiveness rollout
Query performance P50 P99 rollout
Trade-offs
Latency P50 rollout
Latency P50 rollout
Latency
Latency
Durability vs volatility
Disk-related issues Disks fill up Sectors fail Performance degrades
Solution: Kill the disk!
Observability Prometheus memcached-exporter node-exporter Alertmanager Disk full alert Disk latency alert Memcached latency alert Grafana & Loki (shocker!)
Implications for the future of cloud databases? Cost is becoming compelling Capacious disks at near-DRAM speeds Higher network throughput Getting higher as network is offloaded from CPU