Taming Discard Latency Spikes by Patryk Wróbel

A ScyllaDB Community
Taming Discard Latency Spikes
Patryk Wróbel
Software Engineer at ScyllaDB

Patryk Wróbel (He/him)

Software Engineer at ScyllaDB
■C++ and programming enthusiast
■Contributed to various projects including compute
runtime for GPUs and 5G packet scheduler.
■Maintainer of happy-cat project

Historical overview (HDD vs SSD)
■HDDs store data in magnetic disks – SSDs store data in ﬂash memory
[1]
.
■Access time depends on the location on the drive for HDDs – it does not
matter for SSDs
[2]
.
■Important: low-level operations are considerably different for these two
drive types
[3]
.
■For the purpose of this lecture, the most interesting is the overwrite
operation.

Overwrite operation differences (HDD vs SSD)
■SSDs cannot overwrite existing data (unlike HDDs)
[4]
:
●Program/Erase cycle is required.
■In the case of SSDs, memory is divided into blocks, which are further divided
into pages:
●write operations use page granularity
●erase operations utilize block granularity
●the sizes vary between manufacturers: page (2 KB–16 KB), block (256 KB–4 MB)
[5]
■Writing to empty pages is faster:
●to reuse the page, SSDs have to ﬁrstly ensure that all pages are unused and then erase the
block to which they belong
●overwrite initiates read-erase-modify-write cycle
[3]
.

Deleting ﬁles from your machine
■In many ﬁle systems, when a ﬁle is removed, then only metadata related to
that speciﬁc ﬁle is deleted from the drive – the data blocks that belong to the
ﬁle are left in place
[6]
.
■During removal we “forget” that data was stored at a given place – however,
the drive does not know that the blocks are no longer used – it is not aware of
the ﬁle system structures
[3]
.
■Later, when the ﬁle system writes to that “free” space, the drive treats it as an
overwrite operation:
●That is not a problem for HDDs – they will simply overwrite the data when needed.
●In the case of SSDs this is quite problematic – overwrite initiates read-erase-modify-write
cycle, which results in performance degradation of write operations.

TRIM for the rescue
■The TRIM command was introduced to allow operating systems to notify the
SSDs of the removal of ﬁles
[3]
.
■When the drive knows which pages of data are no longer used, it can erase
them internally during garbage collection process.
■This way, the likelihood of encountering an overwrite situation when writing
data to the drive is signiﬁcantly lower.

How to issue TRIM requests?
■On Linux, the fstrim tool can be explicitly used by users to discard blocks that
are not used by the mounted ﬁle system
[7]
.
●fstrim <MOUNT_POINT>
■Another possibility is to use online discard option when mounting the ﬁle
system
[8]
:
●TRIMs are issued by the operating system in real-time without any interaction with users.
●Passing -o discard to mount command enables the mode.

Which method to choose?
■It all depends on the requirements of your system and its workload.
■According to the publicly available sources
[8] [9]
:
●If performance needs to be maintained, and a demanding workload prevents eﬃcient usage of
the fstrim tool, then online discard should be enabled.
●In other cases, the usage of the fstrim tool is recommended.

Problem statement
■In ScyllaDB we have seen cases in the ﬁeld where deleting numerous
SSTables in parallel (e.g. thousands of SSTables holding over a hundred
thousands ﬁles) caused latency spikes
[10]
.
■There was a suspicion that the deletion storm propagated to the disk
sub-system as a discard storm with the online discard mount option.

Problem statement
Chart 1. Read latency spikes observed in the field.

Measurements

Idea: rate limiting of removed ﬁles per second
■One of the initial ideas was to check if the simple rate limiting of the number
of removed ﬁles per second could reduce the latency spikes.
■To assess the impact of removal of ﬁles according to a deﬁned rate per
second io_tester
[11]
tool from scylladb/seastar
[12]
project was used.

■The I/O tester allows generating a user-deﬁned I/O patterns to simulate the
I/O behavior of a complex seastar application.
■All tests were conducted on AWS i3.2xlarge instance using Linux Ubuntu
22.04.

I/O tester: online discard off
Total RPS Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
0 148 147 188 205 239 360
4 160 163 195 214 238 311
8 168 152 230 239 302 530
16 160 165 176 181 300 468
32 137 132 168 185 232 328
64 155 159 203 222 231 715
Table 1. Random read latency impacted by unlinking 128MB files (32MB extents) with different frequencies.
Total unlink requests per second are visible on the left. 512B random read operations issued with
frequency of 4000 IOPS.

I/O tester: online discard on (low extents count)

Total RPS Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
0 148 137 244 260 268 294
4 166 170 192 305 1542 8463
8 163 157 236 398 2252 7890
16 141 126 208 363 3119 11362
32 157 141 242 727 4569 13362
64 196 154 240 712 12181 31892
Table 2. Random read latency impacted by unlinking 128MB files (32MB extents) with different frequencies.
Total unlink requests per second are visible on the left. 512B random read operations issued with
frequency of 4000 IOPS.

I/O tester: online discard on (high extents count)

Total RPS Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
0 148 137 244 260 268 294
4 212 202 279 291 6016 18484
8 158 141 190 211 6122 19785
16 172 146 205 495 7349 17365
32 178 139 234 887 8083 17605
64 381 152 315 6219 56063 64571
Table 3. Random read latency impacted by unlinking 32MB files (1MB extents) with different frequencies.
Total unlink requests per second are visible on the left. 512B random read operations issued with
frequency of 4000 IOPS.

Impact of TRIMs on read+write workload
■After the initial measurements related to simple rate limiting unlinks per
second, the diskplorer
[13]
tool was used.
■It is a small set of tools around FIO
[14]
(Flexible I/O Tester) that can be used to
discover disk read latency at different read and write workloads.
■Diskplorer was extended to allow scheduling a trim job that used the same
amount of bandwidth as the write job. It was issuing TRIMs for data that was
written 10 seconds earlier.

Chart 2. Reference results without an additional trim job obtained via diskplorer on i3.2xlarge.
Without TRIM

Chart 3. Results with constant 32MB trim block size and trim_bw=write_bw obtained via
diskplorer on i3.2xlarge.
With TRIM

No latency spikes?
■The charts look very similar, so it's hard to tell how the TRIMs affect latency.
■The new workload with additional TRIM job shows more white cells, which
means that in some cases it was not able to achieve the target write
bandwidth or target read IOPS.
■The comparison of numbers showed that the workload with TRIM had higher
latency. However, the huge latency spikes were not visible.

Why were latency spikes not visible?

Trace the ongoing requests
■Why are latency spikes visible for the workloads executed via the I/O tester?
■How does XFS issue TRIM requests?

■To answer these questions blktrace
[15]
tool was used:
●it is a block layer I/O tracing mechanism
●it can be used to get the detailed information about sent I/O requests

TRIM requests storm
■Usage of blktrace revealed that TRIM requests were issued by XFS in bulk per
every 30 seconds.
●Traces for unlinking 32 MiB ﬁles (1 MiB extents) with RPS=4:

...
28.783049012 123 Q D 2098688 + 2048 [kworker/4:1H]
<numerous TRIMs queued between>
28.883829107 123 Q D 938000752 + 2048 [kworker/4:1H]
...
59.502569437 123 Q D 6530560 + 4096 [kworker/4:1H]
<numerous TRIMs queued between>
59.604438561 123 Q D 938010992 + 2048 [kworker/4:1H]
...

Conﬁgure the period of metadata sync
■The period of issuing TRIM requests is related to
fs.xfs.xfssyncd_centisecs
[16]
parameter:
●It is the interval at which the ﬁle system ﬂushes metadata out to disk – by default, the period
is set to 30s. The minimal conﬁgurable value is 1s.
●TRIMs are issued when the metadata ﬂush completes
[17]
.

■Note: If too many changes are made in a given period of time, then an
additional “on-demand” ﬂush can be triggered by XFS
[18]
.

I/O tester: sync period 30s (higher read IOPS)

Total RPS Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
0 355 345 516 612 1017 1605
4 334 266 441 542 23934 50168
8 992 464 810 12187 122310 166656
16 1380 280 541 44507 158441 186335
32 2014 346 542 64450 236702 281954
64 2139 277 435 84931 165573 183962
Table 4. Random read latency impacted by unlinking 128MB files (32MB extents) with different frequencies.
Total unlink requests per second are visible on the left. 512B random read operations issued with
frequency of 200k IOPS.

I/O tester: sync period 1s (higher read IOPS)

Total RPS Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
0 444 431 679 812 993 1502
4 293 270 450 569 2727 6193
8 437 382 787 1934 6872 11765
16 389 278 476 5047 8943 12384
32 601 278 463 12126 17455 23028
64 1441 280 10733 26602 32930 39354
Table 5. Random read latency impacted by unlinking 128MB files (32MB extents) with different frequencies.
Total unlink requests per second are visible on the left. 512B random read operations issued with
frequency of 200k IOPS.

Is shorter period and rate limiting enough?
■Setting xfssyncd_centisecs to 1 second increased the frequency of issuing
TRIM requests.
■A shorter period of issuing TRIMs seemed to have a positive effect on low
RPS values and a negative effect on high RPS values.
■The impact of unlinking a ﬁle depends on its size and the number of its
extents – the more fragmented the ﬁle, the worse the effect.

■Conclusion: unlinking ﬁles and ensuring that the number of removals per
second does not exceed some value does not seem to guarantee stable
latency (even with increased frequency of issuing TRIMs).

What else could be done?
■When we unlink() a ﬁle, then all its extents are undergoing TRIM at once.
■Large and fragmented ﬁles can easily exceed the “safety” limit.

■Maybe we could TRIM only a part of the ﬁle to ensure certain bandwidth and
rate per second?
■Such process could be repeated until the ﬁle is completely removed.

Experiments with fallocate(PUNCH_HOLE)
File size [MB]Avg [us] P0.5 [us] P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
— 148 137 244 260 268 294
32 158 139 238 283 1481 2718
128 169 161 215 254 1586 4967
1024 155 148 222 251 1356 3498
7168 150 144 189 241 1518 2927
Table 6. Random read latency impacted by discarding 512MB/s (16 RPS, 32MB blocks).
File sizes are visible on the left. 512B random read operations issued with frequency of 4000 IOPS.

Experiments with fallocate(PUNCH_HOLE)
Extent size [MB]Avg [us]P0.5 [us]P0.95 [us]P0.99 [us]P0.999 [us]Max [us]
1 276 141 226 6697 11331 15569
8 145 134 181 245 1551 2601
32 140 135 170 241 1119 2329
512 160 158 192 258 1117 2687
Table 7. Random read latency impacted by discarding 512MB/s (16 RPS, 32MB blocks) from 1GB files.
Extent sizes visible on the left. 512B random read operations issued with frequency of 4000 IOPS.

Summary
■Unlinking numerous ﬁles when online discard is enabled can signiﬁcantly
impact the latency of read and write operations.
■XFS issues TRIM requests when metadata is ﬂushed:
●by default it happens periodically (30s)
●it can be conﬁgured via fs.xfs.xfssyncd_centisecs.
■The impact of removal of a single ﬁle depends on its size and fragmentation.

Summary
■Simple rate limiting of unlinks does not seem to guarantee stable latency,
albeit it may help to reduce the latency spikes.
■An alternative approach to issuing TRIM requests at a given pace could help.
■However:
●Limiting the rate of ﬁles removal or issuing discards according to bandwidth increases the
time needed to remove the given amount of data.
●In such a case, the storage will be left lingering for some time, but it's a small price to pay.

Thank you!
ScyllaDB.com

scylladb-users.slack.com

Taming Discard Latency Spikes by Patryk Wróbel

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Taming Discard Latency Spikes by Patryk Wróbel

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......