Rethinking Durable Workflows and Queues: A Library-based Approach by Qian Li
ScyllaDB
2 views
28 slides
Oct 15, 2025
Slide 1 of 28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
About This Presentation
Durable workflow engines typically depend on external orchestration, adding overhead, write amplification, and complexity. This talk presents an alternative: a lightweight library-based engine that embeds into application code and checkpoints state directly to the database. We'll share our exper...
Durable workflow engines typically depend on external orchestration, adding overhead, write amplification, and complexity. This talk presents an alternative: a lightweight library-based engine that embeds into application code and checkpoints state directly to the database. We'll share our experience building DBOS, a library-based engine designed for simplicity and efficiency. We'll discuss the architectural trade-offs, challenges in failure recovery, and key optimizations for scalability and maintainability.
Size: 2.17 MB
Language: en
Added: Oct 15, 2025
Slides: 28 pages
Slide Content
A ScyllaDB Community
Rethinking Durable Workflows
and Queues:
A Library-Based Approach
Qian Li
Co-Founder
Qian Li (she/her)
Co-Founder at DBOS, Inc
■PhD in Computer Science (yes, I graduated)
■Always measure one level deeper; your benchmark
is likely wrong
■Organizing South Bay Systems meetups:
https://southbaysystems.xyz/
■I love birdwatching ??????
Reliability is Hard: Any Step Can Break!
Initiate Transfer Request
Wait Until Completion
Send Confirmation Email
Async Background Processing
??????
??????
??????
Reliability is Hard: Any Step Can Break!
■Process crashes (e.g., out-of-memory, bugs)
■User takes too long to respond
■External API calls can be unreliable
●Rate limiting
●Transient failures
■Large scale, massive parallel tasks
How Durable Workflows Can Help
■Checkpoint your program's execution state so it can resume where it left off
A Common Design: External Orchestration
Workflow Service
Data Store
Workers
(Workflow
code)
App Servers
Workflow
Orchestration
Queues
Messaging
Clients
(HTTP, RPC,
Kafka, etc.)
Admin UIControl Plane
Problem: Too Much Overhead
■For example, to execute a step in AWS Step Functions:
●Durably records its intent to execute the step
●Schedules it for execution
●Waits for a worker to submit it for execution to AWS Lambda
●Waits for the execution to complete
●Durably records its output
Problem: Too Much Overhead
■A single step has 180 ms communication overhead, even when the step itself
only takes ~20 ms
DBOS: Workflow Engine as a Library
■Write normal functions
●Python/TypeScript/Go/Java
■Annotate with decorators
●@DBOS.workflow()
●@DBOS.step()
■The library runs in-process
with the application, persisting
workflow/step state in
Postgres
13
Step 1
Step 2
Step 3
WF Output
Postgres
WF Input ●2 writes per workflow
●1 write per step
●Each DB write is ~1ms
How It Works: Persist State in a Database
Performance Gap Increases w/ Number of Steps
Your Database Is All You Need
■Workflow graph tracing for observability
■Workflow control: fork, cancel, resume, … for recovery
■Queues for rate limiting, concurrency control
■Durable sleep for long waits across failures
■Durable timeout for bound execution time
■Durable events for human input and external triggers
■Durable cron jobs
■Durable streams
■… many more database-backed features
Use Cases
■Transactional backends (payment, checkout, reservation services)
■CI/CD pipeline orchestration
■Data pipelines with many parallel tasks
■Agent main loops: orchestrating LLM calls and tool calls
■Durable tools for MCP servers and AI agents
The Good, the Bad, and the Ugly
Advantages
■Lower communication overhead, better performance
■Easy to integrate with existing programs
■Easy to manage and operate: no separate orchestrator/workers
■Battle-tested Postgres vendors and large-scale deployments
■SQL-backed management and introspection
Easy Integration: Use DBOS with AI Frameworks
?????? https://www.dbos.dev/blog/durable-execution-crashproof-ai-agents
Build-in Observability via SQL
Challenges
■Deep language integration: more difficult to add new languages
■Performance is database bound; scale with your database
■Distributed failure recovery
●Our solution: DBOS Conductor, an out-of-band service
Postgres
App Servers
Durable Workflows Queues Messaging
Clients
(HTTP, RPC,
Kafka, etc.)
Admin UI
DBOS Conductor
(Optional, out-of-band)
DBOS Library
Problem: Lock Contention in Queues
■Problem: when concurrent workers pulling tasks from the queue, lock
contentions lead to bad performance
Worker A Worker BWorkflow 1
Workflow 2
Workflow 3
Selects Selects
Solution: Use "Skip Locked" Wisely
Solution: Use "Skip Locked" Wisely
Worker A
Worker B
Workflow 1
Workflow 2
Workflow 3
Locks Row
Locks Row
■Each worker selects rows that are not already locked
■But be careful when enforcing a global LIMIT clause
Problem: Observability Queries Are Tricky
■How to design a responsive admin dashboard?
■Sequential scans on the workflow table with > 10M rows can be very slow
Solution: Use Secondary Indexes Wisely
■Expensive to construct, so don't index every column
■Added secondary indexes to a small number of fields that are the most
selective in frequently run queries:
●created_at: most queries are time-based selections
●executor_id: often used to find workflows ran on a given server
●status: find all errored/cancelled/pending workflows
Conclusion
■Library-based workflow engine is more lightweight, easier to operate
■Challenging to do it right, but databases provide battle-tested solutions
Postgres
App Servers
Durable Workflows Queues Messaging
Clients
(HTTP, RPC,
Kafka, etc.)
Admin UI
DBOS Conductor
(Optional, out-of-band)
DBOS Library
Thank you! Let’s connect.
Qian Li [email protected]
@qianl_cs
https://qianli.dev