Taming Discard Latency Spikes by Patryk Wróbel

ScyllaDB 317 views 32 slides Oct 17, 2024
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

Learned a crucial lesson on read/write latency when fixing a real ScyllaDB issue! Discover how TRIM requests impact NVMe SSDs with XFS online discard enabled. Uncover the problems and explore potential solutions. #ScyllaDB #NVMe #XFS #DevOps #DatabasePerformance


Slide Content

A ScyllaDB Community
Taming Discard Latency Spikes
Patryk Wróbel
Software Engineer at ScyllaDB

Patryk Wróbel (He/him)

Software Engineer at ScyllaDB
■C++ and programming enthusiast
■Contributed to various projects including compute
runtime for GPUs and 5G packet scheduler.
■Maintainer of happy-cat project

Historical overview (HDD vs SSD)
■HDDs store data in magnetic disks – SSDs store data in flash memory
[1]
.
■Access time depends on the location on the drive for HDDs – it does not
matter for SSDs
[2]
.
■Important: low-level operations are considerably different for these two
drive types
[3]
.
■For the purpose of this lecture, the most interesting is the overwrite
operation.

Overwrite operation differences (HDD vs SSD)
■SSDs cannot overwrite existing data (unlike HDDs)
[4]
:
●Program/Erase cycle is required.
■In the case of SSDs, memory is divided into blocks, which are further divided
into pages:
●write operations use page granularity
●erase operations utilize block granularity
●the sizes vary between manufacturers: page (2 KB–16 KB), block (256 KB–4 MB)
[5]
■Writing to empty pages is faster:
●to reuse the page, SSDs have to firstly ensure that all pages are unused and then erase the
block to which they belong
●overwrite initiates read-erase-modify-write cycle
[3]
.

Deleting files from your machine
■In many file systems, when a file is removed, then only metadata related to
that specific file is deleted from the drive – the data blocks that belong to the
file are left in place
[6]
.
■During removal we “forget” that data was stored at a given place – however,
the drive does not know that the blocks are no longer used – it is not aware of
the file system structures
[3]
.
■Later, when the file system writes to that “free” space, the drive treats it as an
overwrite operation:
●That is not a problem for HDDs – they will simply overwrite the data when needed.
●In the case of SSDs this is quite problematic – overwrite initiates read-erase-modify-write
cycle, which results in performance degradation of write operations.

TRIM for the rescue
■The TRIM command was introduced to allow operating systems to notify the
SSDs of the removal of files
[3]
.
■When the drive knows which pages of data are no longer used, it can erase
them internally during garbage collection process.
■This way, the likelihood of encountering an overwrite situation when writing
data to the drive is significantly lower.

How to issue TRIM requests?
■On Linux, the fstrim tool can be explicitly used by users to discard blocks that
are not used by the mounted file system
[7]
.
●fstrim <MOUNT_POINT>
■Another possibility is to use online discard option when mounting the file
system
[8]
:
●TRIMs are issued by the operating system in real-time without any interaction with users.
●Passing -o discard to mount command enables the mode.

Which method to choose?
■It all depends on the requirements of your system and its workload.
■According to the publicly available sources
[8] [9]
:
●If performance needs to be maintained, and a demanding workload prevents efficient usage of
the fstrim tool, then online discard should be enabled.
●In other cases, the usage of the fstrim tool is recommended.

Problem statement
■In ScyllaDB we have seen cases in the field where deleting numerous
SSTables in parallel (e.g. thousands of SSTables holding over a hundred
thousands files) caused latency spikes
[10]
.
■There was a suspicion that the deletion storm propagated to the disk
sub-system as a discard storm with the online discard mount option.

Problem statement
Chart 1. Read latency spikes observed in the field.

Measurements

Idea: rate limiting of removed files per second
■One of the initial ideas was to check if the simple rate limiting of the number
of removed files per second could reduce the latency spikes.
■To assess the impact of removal of files according to a defined rate per
second io_tester
[11]
tool from scylladb/seastar
[12]
project was used.

■The I/O tester allows generating a user-defined I/O patterns to simulate the
I/O behavior of a complex seastar application.
■All tests were conducted on AWS i3.2xlarge instance using Linux Ubuntu
22.04.

I/O tester: online discard off
Total RPS Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
0 148 147 188 205 239 360
4 160 163 195 214 238 311
8 168 152 230 239 302 530
16 160 165 176 181 300 468
32 137 132 168 185 232 328
64 155 159 203 222 231 715
Table 1. Random read latency impacted by unlinking 128MB files (32MB extents) with different frequencies.
Total unlink requests per second are visible on the left. 512B random read operations issued with
frequency of 4000 IOPS.

I/O tester: online discard on (low extents count)

Total RPS Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
0 148 137 244 260 268 294
4 166 170 192 305 1542 8463
8 163 157 236 398 2252 7890
16 141 126 208 363 3119 11362
32 157 141 242 727 4569 13362
64 196 154 240 712 12181 31892
Table 2. Random read latency impacted by unlinking 128MB files (32MB extents) with different frequencies.
Total unlink requests per second are visible on the left. 512B random read operations issued with
frequency of 4000 IOPS.

I/O tester: online discard on (high extents count)

Total RPS Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
0 148 137 244 260 268 294
4 212 202 279 291 6016 18484
8 158 141 190 211 6122 19785
16 172 146 205 495 7349 17365
32 178 139 234 887 8083 17605
64 381 152 315 6219 56063 64571
Table 3. Random read latency impacted by unlinking 32MB files (1MB extents) with different frequencies.
Total unlink requests per second are visible on the left. 512B random read operations issued with
frequency of 4000 IOPS.

Impact of TRIMs on read+write workload
■After the initial measurements related to simple rate limiting unlinks per
second, the diskplorer
[13]
tool was used.
■It is a small set of tools around FIO
[14]
(Flexible I/O Tester) that can be used to
discover disk read latency at different read and write workloads.
■Diskplorer was extended to allow scheduling a trim job that used the same
amount of bandwidth as the write job. It was issuing TRIMs for data that was
written 10 seconds earlier.

Chart 2. Reference results without an additional trim job obtained via diskplorer on i3.2xlarge.
Without TRIM

Chart 3. Results with constant 32MB trim block size and trim_bw=write_bw obtained via
diskplorer on i3.2xlarge.
With TRIM

No latency spikes?
■The charts look very similar, so it's hard to tell how the TRIMs affect latency.
■The new workload with additional TRIM job shows more white cells, which
means that in some cases it was not able to achieve the target write
bandwidth or target read IOPS.
■The comparison of numbers showed that the workload with TRIM had higher
latency. However, the huge latency spikes were not visible.

Why were latency spikes not visible?

Trace the ongoing requests
■Why are latency spikes visible for the workloads executed via the I/O tester?
■How does XFS issue TRIM requests?

■To answer these questions blktrace
[15]
tool was used:
●it is a block layer I/O tracing mechanism
●it can be used to get the detailed information about sent I/O requests

TRIM requests storm
■Usage of blktrace revealed that TRIM requests were issued by XFS in bulk per
every 30 seconds.
●Traces for unlinking 32 MiB files (1 MiB extents) with RPS=4:

...
28.783049012 123 Q D 2098688 + 2048 [kworker/4:1H]
<numerous TRIMs queued between>
28.883829107 123 Q D 938000752 + 2048 [kworker/4:1H]
...
59.502569437 123 Q D 6530560 + 4096 [kworker/4:1H]
<numerous TRIMs queued between>
59.604438561 123 Q D 938010992 + 2048 [kworker/4:1H]
...

Configure the period of metadata sync
■The period of issuing TRIM requests is related to
fs.xfs.xfssyncd_centisecs
[16]
parameter:
●It is the interval at which the file system flushes metadata out to disk – by default, the period
is set to 30s. The minimal configurable value is 1s.
●TRIMs are issued when the metadata flush completes
[17]
.

■Note: If too many changes are made in a given period of time, then an
additional “on-demand” flush can be triggered by XFS
[18]
.

I/O tester: sync period 30s (higher read IOPS)

Total RPS Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
0 355 345 516 612 1017 1605
4 334 266 441 542 23934 50168
8 992 464 810 12187 122310 166656
16 1380 280 541 44507 158441 186335
32 2014 346 542 64450 236702 281954
64 2139 277 435 84931 165573 183962
Table 4. Random read latency impacted by unlinking 128MB files (32MB extents) with different frequencies.
Total unlink requests per second are visible on the left. 512B random read operations issued with
frequency of 200k IOPS.

I/O tester: sync period 1s (higher read IOPS)

Total RPS Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
0 444 431 679 812 993 1502
4 293 270 450 569 2727 6193
8 437 382 787 1934 6872 11765
16 389 278 476 5047 8943 12384
32 601 278 463 12126 17455 23028
64 1441 280 10733 26602 32930 39354
Table 5. Random read latency impacted by unlinking 128MB files (32MB extents) with different frequencies.
Total unlink requests per second are visible on the left. 512B random read operations issued with
frequency of 200k IOPS.

Is shorter period and rate limiting enough?
■Setting xfssyncd_centisecs to 1 second increased the frequency of issuing
TRIM requests.
■A shorter period of issuing TRIMs seemed to have a positive effect on low
RPS values and a negative effect on high RPS values.
■The impact of unlinking a file depends on its size and the number of its
extents – the more fragmented the file, the worse the effect.

■Conclusion: unlinking files and ensuring that the number of removals per
second does not exceed some value does not seem to guarantee stable
latency (even with increased frequency of issuing TRIMs).

What else could be done?
■When we unlink() a file, then all its extents are undergoing TRIM at once.
■Large and fragmented files can easily exceed the “safety” limit.

■Maybe we could TRIM only a part of the file to ensure certain bandwidth and
rate per second?
■Such process could be repeated until the file is completely removed.

Experiments with fallocate(PUNCH_HOLE)
File size [MB]Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
— 148 137 244 260 268 294
32 158 139 238 283 1481 2718
128 169 161 215 254 1586 4967
1024 155 148 222 251 1356 3498
7168 150 144 189 241 1518 2927
Table 6. Random read latency impacted by discarding 512MB/s (16 RPS, 32MB blocks).
File sizes are visible on the left. 512B random read operations issued with frequency of 4000 IOPS.

Experiments with fallocate(PUNCH_HOLE)
Extent size [MB]Avg [us]P0.5 [us]P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
1 276 141 226 6697 11331 15569
8 145 134 181 245 1551 2601
32 140 135 170 241 1119 2329
512 160 158 192 258 1117 2687
Table 7. Random read latency impacted by discarding 512MB/s (16 RPS, 32MB blocks) from 1GB files.
Extent sizes visible on the left. 512B random read operations issued with frequency of 4000 IOPS.

Summary
■Unlinking numerous files when online discard is enabled can significantly
impact the latency of read and write operations.
■XFS issues TRIM requests when metadata is flushed:
●by default it happens periodically (30s)
●it can be configured via fs.xfs.xfssyncd_centisecs.
■The impact of removal of a single file depends on its size and fragmentation.

Summary
■Simple rate limiting of unlinks does not seem to guarantee stable latency,
albeit it may help to reduce the latency spikes.
■An alternative approach to issuing TRIM requests at a given pace could help.
■However:
●Limiting the rate of files removal or issuing discards according to bandwidth increases the
time needed to remove the given amount of data.
●In such a case, the storage will be left lingering for some time, but it's a small price to pay.

Thank you!
ScyllaDB.com

scylladb-users.slack.com
Tags