Cost Effective, Low Latency Vector Search In Databases: A Case Study with Azure Cosmos DB by Magdalen Manohar
ScyllaDB
0 views
24 slides
Oct 13, 2025
Slide 1 of 24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
About This Presentation
We've integrated DiskANN, a state-of-the-art vector indexing algorithm, into Azure Cosmos DB NoSQL, a state-of-the-art cloud-native operational database. Learn how we overcame the systems and algorithmic challenges of this integration to achieve <20ms query latency at the 10 million scale, wh...
We've integrated DiskANN, a state-of-the-art vector indexing algorithm, into Azure Cosmos DB NoSQL, a state-of-the-art cloud-native operational database. Learn how we overcame the systems and algorithmic challenges of this integration to achieve <20ms query latency at the 10 million scale, while supporting scale-out to billions of vectors via automatic partitioning.
Size: 2.36 MB
Language: en
Added: Oct 13, 2025
Slides: 24 pages
Slide Content
A ScyllaDB Community
Cost Effective, Low Latency
Vector Search In Databases: A
Case Study with Azure Cosmos DB
Magdalen Manohar
Senior Researcher
Magdalen Manohar
Senior Researcher at Microsoft
■Researcher in Vector Databases, primary author of
the ParlayANN library
■Developer of the DiskANN library within Microsoft
Azure
■Interested in algorithms and performance for
semantic search, particularly vector search
Semantic Search and Retrieval
■Vector search, or approximate
nearest neighbor search, is a crucial
routine in semantic search,
recommendations, ads, etc.
[image credit] ?????? The Current Best of Universal Word Embeddings and Sentence Embeddings | by Thomas Wolf | HuggingFace | Medium
■Embedding models capture semantic
similarity and encode objects into
vectors
■Semantically similar objects are
translated to spatially close vectors
Vector Indices
■A vector index is a data structure that speeds up
computations of similarity search
■Common types of vector indices include graph
indices and inverted file indices
■Data compression (product quantization, scalar
quantization, etc.) also a crucial component of
answering similarity queries
A Vector Index
and a Vector Database?
Features we want to support:
■Automatic replication and crash recovery
■Elastic scaling to >10s of billions of documents
■Automatic partitioning across physical partitions and tiers of storage
■Support for incremental changes
Should a database natively support fast, cost-effective vector search, or does this
require external replication outside the database?
An Integrated Vector Database
DiskANN (Vector index) Azure Cosmos DB NoSQL
Efficient and accurate vector search from
10K to >50 billion point indices
Support for multiple tiers of memory and
storage
Algorithmic innovations for dynamic
updates and queries filtered by logical
predicate
Planet-scale collections with elastic scaling
and automatic partitioning
Flexible cost and memory models
Database fundamentals: automatic crash
recovery and replication, support for
multi-tenancy
In our paper Cost-Effective, Low Latency Vector Search with Azure
Cosmos DB we provide a point in favor of integrated vector databases by
deeply integrating DiskANN with Azure Cosmos DB.
Our Results
■<20 ms query latency at the 10 million scale at P99 cost of $17.50/million
queries
■Scale out to 1 billion vectors with <150 ms client latency
■Robust accuracy under a stream of updates
We overcome the algorithmic and performance challenges of this integration
to achieve:
■Support for efficient filter predicate search
■Multi-tenant design where the number of partition keys or the vectors per
partition can grow independently to large numbers
■Auto-scaling cost structure
I.Introduction
II.Background on DiskANN and Azure CosmosDB
III.Challenges of the Integration
IV.Experimental Results
Outline
I.Introduction
II.Background on DiskANN and Azure CosmosDB
III.Challenges of the Integration
IV.Experimental Results
Outline
DiskANN
Index: directed graph with one
vertex per data point/embedding
For query q: greedily search the
graph from a designated start
point to converge to answer
Supports:
■Fast and accurate search,
filtered search
■Point insertions and
deletions
Bw-Tree
The BW-Tree is a B-tree variant used for
indexing in Azure Cosmos DB. Its
features include:
■Latch-free design
■Log-structured B-tree index
■Updates persisted as deltas in
batches flushed to disk, with periodic
consolidation
Δ
Δ
ΔΔ Δ
I.Introduction
II.Background on DiskANN and Azure CosmosDB
III.Challenges of the Integration
IV.Experimental Results
Outline
DiskANN Index Design
■Index consists of three terms:
quantized vectors, adjacency lists,
and full precision vectors
■Indexed by document id
■Typical values for a ~10M index
with 768-dimensional float
embeddings: 30 GB full-precision,
2 GB quantized, 1.3 GB graph
Adjacency list term on
Bw-tree (small, updated
most frequently)
Quantized vector term on
Bw-tree (small, updated
infrequently)
Full vector term in
DocStore (largest,
updated least frequently)
Operations with Limited Memory
= true top-4 nearest neighbors
= top-10 nearest neighbors computed using
compressed data
Top-10 results computed
using compressed data
Results re-ranked with
full-precision data
■Queries are processed reading
only the index term and the
quantized term, then re-ranked
with the full-precision term
■We determine experimentally that
inserts and deletes can be done
entirely with quantized vectors
■Challenge 1: can’t store full
vectors in cache, can’t afford too
many SSD round-trips
Real-world numbers: 3500 quantized lookups, 50
full-precision lookups for one query
Updates to a Single-Writer Data Structure
■We use the update routine from
ParlayANN for parallel updates
■We extend this routine to deletions
Adjacency list term
(single writer updates)
Updated adjacency list
term
Updates computed with
no write-locking and
internal parallelism
Updates persisted to
Bw-Tree with a single
thread
■Challenge 2: Bw-Tree can only be
updated by a single writer, but we
need parallelism over
inserts/deletes
■ Inserts and deletes require
changes to existing graph vertices
Maintaining Stability over Updates
■The previous deletion policy required
waiting until a critical mass of
deletions (~20% of the index) built up
■We utilize new work on inplace
deletions, with some changes to the
algorithm for our specific case
Recall over time comparing deletion
policies over a sequence of inserts
and deletes
■Challenge 3: how to maintain stable
recall over updates in a Bw-Tree index?
A Look at Prior Work
Additional papers used to bring this work together:
■FreshDiskANN: A Fast and Accurate Graph-Based ANNIndex for Streaming
Similarity Search [SJKS‘21]
■ParlayANN: Scalable and Deterministic Parallel Graph Based Approximate
Nearest Neighbor Search Algorithms [MSBDGSS‘24]
■In-Place Updates of a Graph Index for Streaming Approximate Nearest
Neighbor Search [XMBCWS‘25]
I.Introduction
II.Background on DiskANN and Azure CosmosDB
III.Challenges of the Integration
IV.Experimental Results
Outline
Experimental Setup
Unless otherwise indicated:
■Partition size limit 50 GB, Bw-tree max chain length 15
■Graph built with parameters R=32, L=100, slack=1.3
■Bw-tree cache contains adjacency lists and quantized terms
■Warmup phase for queries before measuring
■All benchmark datasets are publicly available at
https://github.com/harsha-simhadri/big-ann-benchmarks
Query Latency and RU
■Compare RU (proxy for compute cost,
$.25/1 million RU) and latency for
different sizes of dataset
■Query and RU cost increase only ~2x
even when index size increases 100x
■Very little change in performance when
vectors are ~7x larger
■Using P99 RU figures, $17.50/1 million
queries (~$20 monthly storage cost for a
10M size index)
Wikipedia Cohere dataset (768 float dimensions,
compressed to 192 bytes/vector), ~95% recall
MSTuring Web dataset (100 float dimensions,
compressed to 50 bytes/vector), ~90% recall
Insert Latency
■Insert requires a query in the
quantized space, then additional work
to integrate the new node into the
graph
■Most of the cost is reading quantized
vectors
■Vector read cost increases as size of
index increases
MSTuring-1M ingestion time breakdown (ms), single
threaded
Billion Scale Partitioned Index
■Billion size index on the MSTuring-1B
dataset spans 50 partitions
■Server latency = per partition, client
latency = end-to-end (proportional to
worst partition)
■Recall 83.76%, 89.61%, and 93.89% for
L=50, 100, 200 respectively
■Worst latency still under 150 ms
MSTuring-1B dataset (100 floating point
dimensions, compressed to 50 bytes/vector)
Conclusion
By deeply integrating DiskANN with Azure Cosmos DB NoSQL and incorporating
the latest research to address the accompanying technical challenges, we
achieved:
■A commercially available, cost-effective vector database inheriting the
technical capabilities of the DiskANN vector index and the flexibility, support,
and scale of Azure Cosmos DB
■<20 ms query latency at the 10 million scale, <150 ms query latency at the
billion scale