Read- and Write-Optimization in Modern Database Infrastructures by Dzejla Medjedovic

ScyllaDB 196 views 28 slides Mar 07, 2025
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

Learn more about the data structures behind large databases. Explore B-trees, B^eps-trees, and LSM-trees—how they optimize reads and writes, their tradeoffs, and how they work well under different scenarios.


Slide Content

A ScyllaDB Community
Read- and Write-Optimization
in Modern Database
Infrastructures
Dzejla Medjedovic-Tahirovic
VP of Data

Dzejla Medjedovic-Tahirovic (she/her)
■Completed PhD in Computer
Science (Algorithms for Massive
Data) in 2014 at Stony Brook Univ.
■Co-wrote a book on massive data
algorithms and data structures w
Emin Tahirovic
■Currently serves as a VP of Data at
Social Explorer

Dzejla Medjedovic-Tahirovic (she/her)
■Completed PhD in Computer
Science (Algorithms for Massive
Data) in 2014 at Stony Brook Univ.
■Co-wrote a book on massive data
algorithms and data structures w
Emin Tahirovic
■Currently serves as a VP of Data at
Social Explorer
Manning discount code (50% off): SCALE2025
(valid until June 30 for all books)

This talk is about data structures underlying these database technologies.

■Most engineers do not build their own database engines
■But many of us are facing scalability challenges in our everyday tasks —
you don’t have to work for one of the Big Five companies to have these
problems
■Understanding how these technologies work can help choose the correct
one for your use case
Why it’s important to know this stuff

■Read-Optimization: B-trees
■Can you have it all? B
ε
-tree
■Write-Optimization: LSM-tree
Each one of these data structures deserve an hour-long talk.
I am going to give an overview of main ideas behind these data structures,
leaving out many important details.
Presentation Agenda

The external-memory (I/O) model
■Data sits on disk
■Data between disks and internal
memory is transferred in large
consecutive chunks (blocks/pages)
■Memory transfers (or I/Os) are the
main performance bottleneck
■We use #I/Os to measure
performance of data structures

B-Trees
■The most ubiquitous data structure used in database
design, has been around since 1970s
■B-trees are used in a variety of database management
systems and their storage engines as the default index
(MySQL, Postgres, MongoDB, Dynamo are some examples).
Table data in MySQL InnoDB is a B-tree
■It’s a search tree where nodes are the size of a block (can
be from 64KB to 1MB). With 64-bit keys, a block can fit
thousands of keys

B-Trees are shallow and wide
■A node has O(B) children, ie.,
thousands of children
■The tree depth is ㏒
B
N
■B-trees are rarely going to be
deeper than 6 or 7 levels or so.
■All operations need to go to the
bottom of the tree in the worst
case

B-tree: Summary of costs
Lookup Insert Delete
B-tree O(㏒
B
N) I/Os O(㏒
B
N) I/Os O(㏒
B
N) I/Os
■The cost of all operations is directly related to the depth of the B-tree
■This is expensive, but for searches, it’s best we can do (there are some
tricks that help a little)
■Great for systems where we frequently query pretty stable datasets

As it turns out…


It’s possible to design a data structure that has better
inserts/deletes than a B-tree, yet it does not sacrifice the
(asymptotic) performance of lookups!

B
ε
-tree: A more laid-back approach to
inserts/deletes
■Allow some space in
the node for buffers
that store insert/
delete messages
■Inserts and deletes will
be delayed (not like in a
B-tree)

When the buffer gets full, the fullest sub-buffer gets flushed.

All messages being flushed together pay the cost of 1 I/O to
descend to the next level of the tree (in contrast to the B-tree)

How do lookups work? A lookup can reconstruct insert/delete history.

Runtime of operations on a B
ε
-tree
To get the maximum performance benefits, it’s important
to balance the number of keys and the buffer size.
B
ε

keys
B-B
ε
buffer space
What should ε

be?

■√B keys per node still guarantee a shallow tree =O(㏒
B
N) depth
■ O(B) buffer size, at least √B elements descend down one level
in 1 I/O), so cost per level per element is 1/√B

B
½
-tree: Summary of costs
Lookup Insert Delete
B-treeO(㏒
B
N) I/OsO(㏒
B
N) I/Os O(㏒
B
N) I/Os
B
½
-treeO(㏒
B
N) I/Os

O((㏒
B
N)/√B)
I/Os

O((㏒
B
N)/√B)
I/Os

In practical terms, 10-1000x faster
When ε=½, lookup is
twice as slow

B
ε
-tree has been implemented in some databases
and file systems
■TokuDB has Fractal Tree technology which is a
B
ε
-tree with a number of optimizations
■Prototype file system BetrFS
■B
ε
-tree is great for systems that need fast
lookups and fairly fast inserts (this is
important when inserts have an embedded
lookup)

Onto write-optimization:
Log-structured Merge Trees (LSM-trees)
■But if we’re after a truly write-optimized data structure, we’ll
need to look further: LSM-Trees
■LSM trees are used in Dynamo, HBase, ScyllaDB, Apache
Cassandra, LevelDB, for blindingly fast writes

Start with a simple write-optimized data structure
■First, forget about lookups
■Fill the RAM, flush to disk
■The data structure is a set of
sorted key-value pair tables (sorted
string tables — SSTs)
■Inserts/deletes: O(1/B) I/Os
■Lookups: O(N/B) or
O((N/M) * ㏒
2
(M/B))

Improvement 1: Bloom filters
■Bloom filters route lookups by saying
which SSTs a record might be in
■However, Bloom filters grow linearly with
the size of the dataset, so this is not a
scalable

Improvement 2: Tables of exponentially
increasing sizes (This is an LSM-tree!)
■Have a table of size M, then f*M,
then f*f*M… exponentially increasing
sizes (imagine f=2,3,4)
■Tables get merged and compacted
occasionally
■Original data structure was B-tree,
but having skip lists works too

Leveling-merge policy: performance costs
■Insert: Each element needs to
be written to each level f times,
and there are ㏒
f
(N/M) levels,
so O((f/B)* ㏒
f
(N/M)) I/Os
■Lookup: Constant cost per level
— O(㏒
f
(N/M)) I/Os
The insert cost can be further
brought down→

Tiering-merge policy: helps with write amplification
■If each level is organized
as f components
■We then write into a
component only once
■This helps with inserts:
O((1/B)* ㏒
f
(N/M)) I/Os

LSM-tree: Summary of costs
Lookup Insert Delete
B-tree O(㏒
B
N) I/Os O(㏒
B
N) I/Os O(㏒
B
N) I/Os
B
½
-treeO(㏒
B
N) I/Os

O((㏒
B
N)/√B) I/Os

O((㏒
B
N)/√B) I/Os

LSM-treeO(㏒
f
(N/M)) I/Os
where f =2,3,4
(Bloom filters can help this
go down to O(1))
O((1/B)* ㏒
f
(N/M))
I/Os where f =2,3,4
O((1/B)* ㏒
f
(N/M))
I/Os where f =2,3,4
Sequential write speed

Main ideas from the talk
■B-trees are hard to beat for lookups

■Delaying inserts and deletes helps write-optimization

■Bloom filters are great and are used everywhere

Stay in Touch
Dzejla Medjedovic-Tahirovic
[email protected]
linkedin.com/in/dzejla-medjedovic-tahirovic-b8709189/
Tags