Read- and Write-Optimization in Modern Database Infrastructures by Dzejla Medjedovic

A ScyllaDB Community
Read- and Write-Optimization
in Modern Database
Infrastructures
Dzejla Medjedovic-Tahirovic
VP of Data

Dzejla Medjedovic-Tahirovic (she/her)
■Completed PhD in Computer
Science (Algorithms for Massive
Data) in 2014 at Stony Brook Univ.
■Co-wrote a book on massive data
algorithms and data structures w
Emin Tahirovic
■Currently serves as a VP of Data at
Social Explorer

Dzejla Medjedovic-Tahirovic (she/her)
■Completed PhD in Computer
Science (Algorithms for Massive
Data) in 2014 at Stony Brook Univ.
■Co-wrote a book on massive data
algorithms and data structures w
Emin Tahirovic
■Currently serves as a VP of Data at
Social Explorer
Manning discount code (50% off): SCALE2025
(valid until June 30 for all books)

This talk is about data structures underlying these database technologies.

■Most engineers do not build their own database engines
■But many of us are facing scalability challenges in our everyday tasks —
you don’t have to work for one of the Big Five companies to have these
problems
■Understanding how these technologies work can help choose the correct
one for your use case
Why it’s important to know this stuff

■Read-Optimization: B-trees
■Can you have it all? B
ε
-tree
■Write-Optimization: LSM-tree
Each one of these data structures deserve an hour-long talk.
I am going to give an overview of main ideas behind these data structures,
leaving out many important details.
Presentation Agenda

The external-memory (I/O) model
■Data sits on disk
■Data between disks and internal
memory is transferred in large
consecutive chunks (blocks/pages)
■Memory transfers (or I/Os) are the
main performance bottleneck
■We use #I/Os to measure
performance of data structures

B-Trees
■The most ubiquitous data structure used in database
design, has been around since 1970s
■B-trees are used in a variety of database management
systems and their storage engines as the default index
(MySQL, Postgres, MongoDB, Dynamo are some examples).
Table data in MySQL InnoDB is a B-tree
■It’s a search tree where nodes are the size of a block (can
be from 64KB to 1MB). With 64-bit keys, a block can ﬁt
thousands of keys

B-Trees are shallow and wide
■A node has O(B) children, ie.,
thousands of children
■The tree depth is ㏒
B
N
■B-trees are rarely going to be
deeper than 6 or 7 levels or so.
■All operations need to go to the
bottom of the tree in the worst
case

B-tree: Summary of costs
Lookup Insert Delete
B-tree O(㏒
B
N) I/Os O(㏒
B
N) I/Os O(㏒
B
N) I/Os
■The cost of all operations is directly related to the depth of the B-tree
■This is expensive, but for searches, it’s best we can do (there are some
tricks that help a little)
■Great for systems where we frequently query pretty stable datasets

As it turns out…

It’s possible to design a data structure that has better
inserts/deletes than a B-tree, yet it does not sacriﬁce the
(asymptotic) performance of lookups!

B
ε
-tree: A more laid-back approach to
inserts/deletes
■Allow some space in
the node for buffers
that store insert/
delete messages
■Inserts and deletes will
be delayed (not like in a
B-tree)

When the buffer gets full, the fullest sub-buffer gets ﬂushed.

All messages being ﬂushed together pay the cost of 1 I/O to
descend to the next level of the tree (in contrast to the B-tree)

How do lookups work? A lookup can reconstruct insert/delete history.

Runtime of operations on a B
ε
-tree
To get the maximum performance beneﬁts, it’s important
to balance the number of keys and the buffer size.
B
ε

keys
B-B
ε
buffer space
What should ε

be?

■√B keys per node still guarantee a shallow tree =O(㏒
B
N) depth
■ O(B) buffer size, at least √B elements descend down one level
in 1 I/O), so cost per level per element is 1/√B

B
½
-tree: Summary of costs
Lookup Insert Delete
B-treeO(㏒
B
N) I/OsO(㏒
B
N) I/Os O(㏒
B
N) I/Os
B
½
-treeO(㏒
B
N) I/Os

O((㏒
B
N)/√B)
I/Os

O((㏒
B
N)/√B)
I/Os

In practical terms, 10-1000x faster
When ε=½, lookup is
twice as slow

B
ε
-tree has been implemented in some databases
and ﬁle systems
■TokuDB has Fractal Tree technology which is a
B
ε
-tree with a number of optimizations
■Prototype ﬁle system BetrFS
■B
ε
-tree is great for systems that need fast
lookups and fairly fast inserts (this is
important when inserts have an embedded
lookup)

Onto write-optimization:
Log-structured Merge Trees (LSM-trees)
■But if we’re after a truly write-optimized data structure, we’ll
need to look further: LSM-Trees
■LSM trees are used in Dynamo, HBase, ScyllaDB, Apache
Cassandra, LevelDB, for blindingly fast writes

Start with a simple write-optimized data structure
■First, forget about lookups
■Fill the RAM, ﬂush to disk
■The data structure is a set of
sorted key-value pair tables (sorted
string tables — SSTs)
■Inserts/deletes: O(1/B) I/Os
■Lookups: O(N/B) or
O((N/M) * ㏒
2
(M/B))

Improvement 1: Bloom ﬁlters
■Bloom ﬁlters route lookups by saying
which SSTs a record might be in
■However, Bloom ﬁlters grow linearly with
the size of the dataset, so this is not a
scalable

Improvement 2: Tables of exponentially
increasing sizes (This is an LSM-tree!)
■Have a table of size M, then f*M,
then f*f*M… exponentially increasing
sizes (imagine f=2,3,4)
■Tables get merged and compacted
occasionally
■Original data structure was B-tree,
but having skip lists works too

Leveling-merge policy: performance costs
■Insert: Each element needs to
be written to each level f times,
and there are ㏒
f
(N/M) levels,
so O((f/B)* ㏒
f
(N/M)) I/Os
■Lookup: Constant cost per level
— O(㏒
f
(N/M)) I/Os
The insert cost can be further
brought down→

Tiering-merge policy: helps with write ampliﬁcation
■If each level is organized
as f components
■We then write into a
component only once
■This helps with inserts:
O((1/B)* ㏒
f
(N/M)) I/Os

LSM-tree: Summary of costs
Lookup Insert Delete
B-tree O(㏒
B
N) I/Os O(㏒
B
N) I/Os O(㏒
B
N) I/Os
B
½
-treeO(㏒
B
N) I/Os

O((㏒
B
N)/√B) I/Os

O((㏒
B
N)/√B) I/Os

LSM-treeO(㏒
f
(N/M)) I/Os
where f =2,3,4
(Bloom ﬁlters can help this
go down to O(1))
O((1/B)* ㏒
f
(N/M))
I/Os where f =2,3,4
O((1/B)* ㏒
f
(N/M))
I/Os where f =2,3,4
Sequential write speed

Main ideas from the talk
■B-trees are hard to beat for lookups

■Delaying inserts and deletes helps write-optimization

■Bloom ﬁlters are great and are used everywhere

Stay in Touch
Dzejla Medjedovic-Tahirovic
[email protected]
linkedin.com/in/dzejla-medjedovic-tahirovic-b8709189/

Read- and Write-Optimization in Modern Database Infrastructures by Dzejla Medjedovic

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Read- and Write-Optimization in Modern Database Infrastructures by Dzejla Medjedovic

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......