Using ScyllaDB to Implement Lists in Medium’s Feature Store by Andreas Saudemont

ScyllaDB 124 views 31 slides Mar 05, 2025
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

Discover how Medium is leveraging ScyllaDB to power a fast, scalable data layer for lists in Medium’s feature store.


Slide Content

A ScyllaDB Community
Using ScyllaDB to Implement
Lists in Medium’s Feature Store
Andréas Saudemont
Software Engineer

Andréas Saudemont (he/him)
■Principal software engineer at Medium
■Building scalable architectures for heavy loads

■Medium’s feature store and list features
■ScyllaDB data model
■Implementing list operations
■Some metrics
■Recap
Presentation Agenda

Medium’s Feature Store
and List Features

Medium’s Feature Store
Key component of Medium’s recommendation system
Used by machine learning models for ‘For you’ feed, Daily Digest, etc.
Database with a specialized API
Low-latency, high-throughput access patterns

Features
A property of an entity in the feature store
Defined by:
●entity type – e.g. user
●name – e.g. is_member
●data type of its values – e.g. boolean
●version (optional) – e.g. ‘2025/03/11’
A feature value is the value of a feature for a given entity ID
●e.g. true for the user.is_member feature for entity "user123"

Relational Features
Has multiple values for a given entity ID
Each value has:
●a relation ID – the ID of the related entity
●a timestamp
Sample: story.user_has_read
●Relates with the user entity type
●Values indicate whether and when a given user has read a given story

Limitations of Relational Features
Suboptimal data model
Data split across 2 tables
→ Too many DB queries
●Inefficient “ALLOW FILTERING” queries to fetch entity IDs
●Plus one query for each entity ID to fetch values
→ Hard to optimize using primary keys/indexes

List Features
Goal: better way to handle cross-entity relations in the feature store
List feature: defined by its entity type, name, and optional version (like
other features)
List: value of a list feature for a given entity ID
List item: holds a value and a timestamp
Mandatory time-to-live (TTL)
Expected call rate of up to 1M ops/s

Sample: user.reading_history

ScyllaDB Data Model

The list_items Table
-- Stores all the list items for all the list features
-- managed by the feature store.
CREATE TABLE list_items (
feature_key TEXT,
entity_id TEXT,
item_key TEXT,
value BLOB,
PRIMARY KEY ((feature_key, entity_id), item_key)
)
WITH CLUSTERING ORDER BY (item_key DESC)
AND DEFAULT_TIME_TO_LIVE = $defaultTTL;

feature_key TEXT
●Identifies the list feature
●Built by concatenating the entity type, name, and version
entity_id TEXT
●ID of the entity the list belongs to
item_key TEXT
●Identifies the item in the list
●Built by concatenating the timestamp and a hash of the value
value BLOB
●Opaque bytes representation of list item value
Columns

Partition key: feature_key + entity_id
●feature_key = entity type + name + version
●All items of a given list are stored in the same partition
●Enables efficient operations on a given list
Clustering key: item_key
●item_key = timestamp + hash(value)
●Items in a list are sorted following the order of their timestamp
●Enables efficient retrieval of list items in reverse-chronological order
●Allows multiple items in a list with same timestamp but distinct values
Primary Key

Used by the Remove List Items with Value operation
Local index ensures data is stored on same node as the base table
Faster than a scan as query is highly selective
Faster than a global index for our use cases
The list_items_by_value LSI
CREATE INDEX list_items_by_value
ON list_items((feature_key, entity_id), value);

Implementing List Operations

Add List Items
BEGIN BATCH
-- for each $item in $items:
INSERT INTO list_items(feature_key, entity_id, item_key, value)
VALUES (
'$entityType#$featureName|$featureVersion',
$entityID,
${buildItemKey(item)},
${item.Value},
)
USING TTL ${item.Timestamp + ttl - now};
APPLY BATCH;1
1Logged batch (default) ensures that insertion of items is atomic

Get List Items
SELECT value, item_key
FROM list_items
WHERE feature_key = '...' AND entity_id = $entityID
AND item_key >= ${buildItemKey(minTimestamp)}
ORDER BY item_key DESC
LIMIT $limit;
1
1Single partition for maximum efficiency
2Discard old items
2
3
3Efficient sorting via clustering key

Remove List Items with Value
SELECT item_key FROM list_items
WHERE feature_key = '...' AND entity_id = $entityID
AND value = $value;

-- for each batch of item_key values:
DELETE FROM list_items
WHERE feature_key = '...' AND entity_id = $entityID
AND item_key IN ($itemKeyBatch);
1
1Fetch keys of items to delete via list_items_by_value LSI
2
2Run batches of DELETE statements

Remove All List Items
DELETE FROM list_items
WHERE feature_key = '...' AND entity_id = $entityID; 1
1Deletion is atomic as we’re deleting a whole partition

Some Metrics

Latencies

Recap

Recap
●Suboptimal data model leads to queries that scale badly
●List features designed as replacement for relational features
●Primary key shaped for efficient querying, sorting, and filtering
●Local secondary index for efficient querying outside the primary key
●ScyllaDB is fast

Stay in Touch
Andréas Saudemont
[email protected]
https://medium.com/@asaudemont
https://www.linkedin.com/in/andreassaudemont/

Features

Lists

Controlling Storage Usage with TTL

List Operations
Tags