Rethinking Durable Workflows and Queues: A Library-based Approach by Qian Li

ScyllaDB 2 views 28 slides Oct 15, 2025
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

Durable workflow engines typically depend on external orchestration, adding overhead, write amplification, and complexity. This talk presents an alternative: a lightweight library-based engine that embeds into application code and checkpoints state directly to the database. We'll share our exper...


Slide Content

A ScyllaDB Community
Rethinking Durable Workflows
and Queues:
A Library-Based Approach
Qian Li
Co-Founder

Qian Li (she/her)

Co-Founder at DBOS, Inc
■PhD in Computer Science (yes, I graduated)
■Always measure one level deeper; your benchmark
is likely wrong
■Organizing South Bay Systems meetups:
https://southbaysystems.xyz/
■I love birdwatching ??????

Reliability is Hard: Any Step Can Break!
Initiate Transfer Request
Wait Until Completion
Send Confirmation Email
Async Background Processing

??????
??????
??????

Reliability is Hard: Any Step Can Break!
■Process crashes (e.g., out-of-memory, bugs)
■User takes too long to respond
■External API calls can be unreliable
●Rate limiting
●Transient failures
■Large scale, massive parallel tasks

Try try try again?
■Data corruption
■Duplication
■Wasted compute resources
■Sl.o..w…

How Durable Workflows Can Help
■Checkpoint your program's execution state so it can resume where it left off

A Common Design: External Orchestration
Workflow Service
Data Store
Workers
(Workflow
code)
App Servers
Workflow
Orchestration
Queues
Messaging
Clients
(HTTP, RPC,
Kafka, etc.)
Admin UIControl Plane

Problem: Too Much Overhead
■For example, to execute a step in AWS Step Functions:
●Durably records its intent to execute the step
●Schedules it for execution
●Waits for a worker to submit it for execution to AWS Lambda
●Waits for the execution to complete
●Durably records its output

Problem: Too Much Overhead
■A single step has 180 ms communication overhead, even when the step itself
only takes ~20 ms

Simplicity = Better Performance

DBOS: Workflow Engine as a Library
Postgres
App Servers
Durable Workflows Queues Messaging
Clients
(HTTP, RPC,
Kafka, etc.)
Admin UI
DBOS Conductor
(Optional, out-of-band)
DBOS Library

DBOS: Workflow Engine as a Library
■Write normal functions
●Python/TypeScript/Go/Java
■Annotate with decorators
●@DBOS.workflow()
●@DBOS.step()
■The library runs in-process
with the application, persisting
workflow/step state in
Postgres

13
Step 1
Step 2
Step 3
WF Output
Postgres
WF Input ●2 writes per workflow
●1 write per step
●Each DB write is ~1ms
How It Works: Persist State in a Database

Performance Gap Increases w/ Number of Steps

Your Database Is All You Need
■Workflow graph tracing for observability
■Workflow control: fork, cancel, resume, … for recovery
■Queues for rate limiting, concurrency control
■Durable sleep for long waits across failures
■Durable timeout for bound execution time
■Durable events for human input and external triggers
■Durable cron jobs
■Durable streams
■… many more database-backed features

Use Cases
■Transactional backends (payment, checkout, reservation services)
■CI/CD pipeline orchestration
■Data pipelines with many parallel tasks
■Agent main loops: orchestrating LLM calls and tool calls
■Durable tools for MCP servers and AI agents

The Good, the Bad, and the Ugly

Advantages
■Lower communication overhead, better performance
■Easy to integrate with existing programs
■Easy to manage and operate: no separate orchestrator/workers
■Battle-tested Postgres vendors and large-scale deployments
■SQL-backed management and introspection

Easy Integration: Use DBOS with AI Frameworks
?????? https://www.dbos.dev/blog/durable-execution-crashproof-ai-agents

Build-in Observability via SQL

Challenges
■Deep language integration: more difficult to add new languages
■Performance is database bound; scale with your database
■Distributed failure recovery
●Our solution: DBOS Conductor, an out-of-band service
Postgres
App Servers
Durable Workflows Queues Messaging
Clients
(HTTP, RPC,
Kafka, etc.)
Admin UI
DBOS Conductor
(Optional, out-of-band)
DBOS Library

Problem: Lock Contention in Queues
■Problem: when concurrent workers pulling tasks from the queue, lock
contentions lead to bad performance
Worker A Worker BWorkflow 1
Workflow 2
Workflow 3
Selects Selects

Solution: Use "Skip Locked" Wisely

Solution: Use "Skip Locked" Wisely
Worker A
Worker B
Workflow 1
Workflow 2
Workflow 3
Locks Row
Locks Row
■Each worker selects rows that are not already locked
■But be careful when enforcing a global LIMIT clause

Problem: Observability Queries Are Tricky
■How to design a responsive admin dashboard?
■Sequential scans on the workflow table with > 10M rows can be very slow

Solution: Use Secondary Indexes Wisely
■Expensive to construct, so don't index every column
■Added secondary indexes to a small number of fields that are the most
selective in frequently run queries:
●created_at: most queries are time-based selections
●executor_id: often used to find workflows ran on a given server
●status: find all errored/cancelled/pending workflows

Conclusion
■Library-based workflow engine is more lightweight, easier to operate
■Challenging to do it right, but databases provide battle-tested solutions
Postgres
App Servers
Durable Workflows Queues Messaging
Clients
(HTTP, RPC,
Kafka, etc.)
Admin UI
DBOS Conductor
(Optional, out-of-band)
DBOS Library

Thank you! Let’s connect.
Qian Li
[email protected]
@qianl_cs
https://qianli.dev
Tags