Using ScyllaDB to Implement Lists in Medium’s Feature Store by Andreas Saudemont
ScyllaDB
124 views
31 slides
Mar 05, 2025
Slide 1 of 31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
About This Presentation
Discover how Medium is leveraging ScyllaDB to power a fast, scalable data layer for lists in Medium’s feature store.
Size: 2.48 MB
Language: en
Added: Mar 05, 2025
Slides: 31 pages
Slide Content
A ScyllaDB Community
Using ScyllaDB to Implement
Lists in Medium’s Feature Store
Andréas Saudemont
Software Engineer
Andréas Saudemont (he/him)
■Principal software engineer at Medium
■Building scalable architectures for heavy loads
■Medium’s feature store and list features
■ScyllaDB data model
■Implementing list operations
■Some metrics
■Recap
Presentation Agenda
Medium’s Feature Store
and List Features
Medium’s Feature Store
Key component of Medium’s recommendation system
Used by machine learning models for ‘For you’ feed, Daily Digest, etc.
Database with a specialized API
Low-latency, high-throughput access patterns
Features
A property of an entity in the feature store
Defined by:
●entity type – e.g. user
●name – e.g. is_member
●data type of its values – e.g. boolean
●version (optional) – e.g. ‘2025/03/11’
A feature value is the value of a feature for a given entity ID
●e.g. true for the user.is_member feature for entity "user123"
Relational Features
Has multiple values for a given entity ID
Each value has:
●a relation ID – the ID of the related entity
●a timestamp
Sample: story.user_has_read
●Relates with the user entity type
●Values indicate whether and when a given user has read a given story
Limitations of Relational Features
Suboptimal data model
Data split across 2 tables
→ Too many DB queries
●Inefficient “ALLOW FILTERING” queries to fetch entity IDs
●Plus one query for each entity ID to fetch values
→ Hard to optimize using primary keys/indexes
List Features
Goal: better way to handle cross-entity relations in the feature store
List feature: defined by its entity type, name, and optional version (like
other features)
List: value of a list feature for a given entity ID
List item: holds a value and a timestamp
Mandatory time-to-live (TTL)
Expected call rate of up to 1M ops/s
Sample: user.reading_history
ScyllaDB Data Model
The list_items Table
-- Stores all the list items for all the list features
-- managed by the feature store.
CREATE TABLE list_items (
feature_key TEXT,
entity_id TEXT,
item_key TEXT,
value BLOB,
PRIMARY KEY ((feature_key, entity_id), item_key)
)
WITH CLUSTERING ORDER BY (item_key DESC)
AND DEFAULT_TIME_TO_LIVE = $defaultTTL;
feature_key TEXT
●Identifies the list feature
●Built by concatenating the entity type, name, and version
entity_id TEXT
●ID of the entity the list belongs to
item_key TEXT
●Identifies the item in the list
●Built by concatenating the timestamp and a hash of the value
value BLOB
●Opaque bytes representation of list item value
Columns
Partition key: feature_key + entity_id
●feature_key = entity type + name + version
●All items of a given list are stored in the same partition
●Enables efficient operations on a given list
Clustering key: item_key
●item_key = timestamp + hash(value)
●Items in a list are sorted following the order of their timestamp
●Enables efficient retrieval of list items in reverse-chronological order
●Allows multiple items in a list with same timestamp but distinct values
Primary Key
Used by the Remove List Items with Value operation
Local index ensures data is stored on same node as the base table
Faster than a scan as query is highly selective
Faster than a global index for our use cases
The list_items_by_value LSI
CREATE INDEX list_items_by_value
ON list_items((feature_key, entity_id), value);
Implementing List Operations
Add List Items
BEGIN BATCH
-- for each $item in $items:
INSERT INTO list_items(feature_key, entity_id, item_key, value)
VALUES (
'$entityType#$featureName|$featureVersion',
$entityID,
${buildItemKey(item)},
${item.Value},
)
USING TTL ${item.Timestamp + ttl - now};
APPLY BATCH;1
1Logged batch (default) ensures that insertion of items is atomic
Get List Items
SELECT value, item_key
FROM list_items
WHERE feature_key = '...' AND entity_id = $entityID
AND item_key >= ${buildItemKey(minTimestamp)}
ORDER BY item_key DESC
LIMIT $limit;
1
1Single partition for maximum efficiency
2Discard old items
2
3
3Efficient sorting via clustering key
Remove List Items with Value
SELECT item_key FROM list_items
WHERE feature_key = '...' AND entity_id = $entityID
AND value = $value;
-- for each batch of item_key values:
DELETE FROM list_items
WHERE feature_key = '...' AND entity_id = $entityID
AND item_key IN ($itemKeyBatch);
1
1Fetch keys of items to delete via list_items_by_value LSI
2
2Run batches of DELETE statements
Remove All List Items
DELETE FROM list_items
WHERE feature_key = '...' AND entity_id = $entityID; 1
1Deletion is atomic as we’re deleting a whole partition
Some Metrics
Latencies
Recap
Recap
●Suboptimal data model leads to queries that scale badly
●List features designed as replacement for relational features
●Primary key shaped for efficient querying, sorting, and filtering
●Local secondary index for efficient querying outside the primary key
●ScyllaDB is fast
Stay in Touch
Andréas Saudemont [email protected]
https://medium.com/@asaudemont
https://www.linkedin.com/in/andreassaudemont/