Read- and Write-Optimization in Modern Database Infrastructures by Dzejla Medjedovic
ScyllaDB
196 views
28 slides
Mar 07, 2025
Slide 1 of 28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
About This Presentation
Learn more about the data structures behind large databases. Explore B-trees, B^eps-trees, and LSM-trees—how they optimize reads and writes, their tradeoffs, and how they work well under different scenarios.
Size: 3.54 MB
Language: en
Added: Mar 07, 2025
Slides: 28 pages
Slide Content
A ScyllaDB Community
Read- and Write-Optimization
in Modern Database
Infrastructures
Dzejla Medjedovic-Tahirovic
VP of Data
Dzejla Medjedovic-Tahirovic (she/her)
■Completed PhD in Computer
Science (Algorithms for Massive
Data) in 2014 at Stony Brook Univ.
■Co-wrote a book on massive data
algorithms and data structures w
Emin Tahirovic
■Currently serves as a VP of Data at
Social Explorer
Dzejla Medjedovic-Tahirovic (she/her)
■Completed PhD in Computer
Science (Algorithms for Massive
Data) in 2014 at Stony Brook Univ.
■Co-wrote a book on massive data
algorithms and data structures w
Emin Tahirovic
■Currently serves as a VP of Data at
Social Explorer
Manning discount code (50% off): SCALE2025
(valid until June 30 for all books)
This talk is about data structures underlying these database technologies.
■Most engineers do not build their own database engines
■But many of us are facing scalability challenges in our everyday tasks —
you don’t have to work for one of the Big Five companies to have these
problems
■Understanding how these technologies work can help choose the correct
one for your use case
Why it’s important to know this stuff
■Read-Optimization: B-trees
■Can you have it all? B
ε
-tree
■Write-Optimization: LSM-tree
Each one of these data structures deserve an hour-long talk.
I am going to give an overview of main ideas behind these data structures,
leaving out many important details.
Presentation Agenda
The external-memory (I/O) model
■Data sits on disk
■Data between disks and internal
memory is transferred in large
consecutive chunks (blocks/pages)
■Memory transfers (or I/Os) are the
main performance bottleneck
■We use #I/Os to measure
performance of data structures
B-Trees
■The most ubiquitous data structure used in database
design, has been around since 1970s
■B-trees are used in a variety of database management
systems and their storage engines as the default index
(MySQL, Postgres, MongoDB, Dynamo are some examples).
Table data in MySQL InnoDB is a B-tree
■It’s a search tree where nodes are the size of a block (can
be from 64KB to 1MB). With 64-bit keys, a block can fit
thousands of keys
B-Trees are shallow and wide
■A node has O(B) children, ie.,
thousands of children
■The tree depth is ㏒
B
N
■B-trees are rarely going to be
deeper than 6 or 7 levels or so.
■All operations need to go to the
bottom of the tree in the worst
case
B-tree: Summary of costs
Lookup Insert Delete
B-tree O(㏒
B
N) I/Os O(㏒
B
N) I/Os O(㏒
B
N) I/Os
■The cost of all operations is directly related to the depth of the B-tree
■This is expensive, but for searches, it’s best we can do (there are some
tricks that help a little)
■Great for systems where we frequently query pretty stable datasets
As it turns out…
It’s possible to design a data structure that has better
inserts/deletes than a B-tree, yet it does not sacrifice the
(asymptotic) performance of lookups!
B
ε
-tree: A more laid-back approach to
inserts/deletes
■Allow some space in
the node for buffers
that store insert/
delete messages
■Inserts and deletes will
be delayed (not like in a
B-tree)
When the buffer gets full, the fullest sub-buffer gets flushed.
All messages being flushed together pay the cost of 1 I/O to
descend to the next level of the tree (in contrast to the B-tree)
How do lookups work? A lookup can reconstruct insert/delete history.
Runtime of operations on a B
ε
-tree
To get the maximum performance benefits, it’s important
to balance the number of keys and the buffer size.
B
ε
keys
B-B
ε
buffer space
What should ε
be?
■√B keys per node still guarantee a shallow tree =O(㏒
B
N) depth
■ O(B) buffer size, at least √B elements descend down one level
in 1 I/O), so cost per level per element is 1/√B
B
½
-tree: Summary of costs
Lookup Insert Delete
B-treeO(㏒
B
N) I/OsO(㏒
B
N) I/Os O(㏒
B
N) I/Os
B
½
-treeO(㏒
B
N) I/Os
O((㏒
B
N)/√B)
I/Os
O((㏒
B
N)/√B)
I/Os
In practical terms, 10-1000x faster
When ε=½, lookup is
twice as slow
B
ε
-tree has been implemented in some databases
and file systems
■TokuDB has Fractal Tree technology which is a
B
ε
-tree with a number of optimizations
■Prototype file system BetrFS
■B
ε
-tree is great for systems that need fast
lookups and fairly fast inserts (this is
important when inserts have an embedded
lookup)
Onto write-optimization:
Log-structured Merge Trees (LSM-trees)
■But if we’re after a truly write-optimized data structure, we’ll
need to look further: LSM-Trees
■LSM trees are used in Dynamo, HBase, ScyllaDB, Apache
Cassandra, LevelDB, for blindingly fast writes
Start with a simple write-optimized data structure
■First, forget about lookups
■Fill the RAM, flush to disk
■The data structure is a set of
sorted key-value pair tables (sorted
string tables — SSTs)
■Inserts/deletes: O(1/B) I/Os
■Lookups: O(N/B) or
O((N/M) * ㏒
2
(M/B))
Improvement 1: Bloom filters
■Bloom filters route lookups by saying
which SSTs a record might be in
■However, Bloom filters grow linearly with
the size of the dataset, so this is not a
scalable
Improvement 2: Tables of exponentially
increasing sizes (This is an LSM-tree!)
■Have a table of size M, then f*M,
then f*f*M… exponentially increasing
sizes (imagine f=2,3,4)
■Tables get merged and compacted
occasionally
■Original data structure was B-tree,
but having skip lists works too
Leveling-merge policy: performance costs
■Insert: Each element needs to
be written to each level f times,
and there are ㏒
f
(N/M) levels,
so O((f/B)* ㏒
f
(N/M)) I/Os
■Lookup: Constant cost per level
— O(㏒
f
(N/M)) I/Os
The insert cost can be further
brought down→
Tiering-merge policy: helps with write amplification
■If each level is organized
as f components
■We then write into a
component only once
■This helps with inserts:
O((1/B)* ㏒
f
(N/M)) I/Os
LSM-tree: Summary of costs
Lookup Insert Delete
B-tree O(㏒
B
N) I/Os O(㏒
B
N) I/Os O(㏒
B
N) I/Os
B
½
-treeO(㏒
B
N) I/Os
O((㏒
B
N)/√B) I/Os
O((㏒
B
N)/√B) I/Os
LSM-treeO(㏒
f
(N/M)) I/Os
where f =2,3,4
(Bloom filters can help this
go down to O(1))
O((1/B)* ㏒
f
(N/M))
I/Os where f =2,3,4
O((1/B)* ㏒
f
(N/M))
I/Os where f =2,3,4
Sequential write speed
Main ideas from the talk
■B-trees are hard to beat for lookups
■Delaying inserts and deletes helps write-optimization
■Bloom filters are great and are used everywhere
Stay in Touch
Dzejla Medjedovic-Tahirovic [email protected]
linkedin.com/in/dzejla-medjedovic-tahirovic-b8709189/