Cost Effective, Low Latency Vector Search In Databases: A Case Study with Azure Cosmos DB by Magdalen Manohar

ScyllaDB 0 views 24 slides Oct 13, 2025
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

We've integrated DiskANN, a state-of-the-art vector indexing algorithm, into Azure Cosmos DB NoSQL, a state-of-the-art cloud-native operational database. Learn how we overcame the systems and algorithmic challenges of this integration to achieve <20ms query latency at the 10 million scale, wh...


Slide Content

A ScyllaDB Community
Cost Effective, Low Latency
Vector Search In Databases: A
Case Study with Azure Cosmos DB
Magdalen Manohar
Senior Researcher

Magdalen Manohar
Senior Researcher at Microsoft
■Researcher in Vector Databases, primary author of
the ParlayANN library
■Developer of the DiskANN library within Microsoft
Azure
■Interested in algorithms and performance for
semantic search, particularly vector search

Semantic Search and Retrieval

■Vector search, or approximate
nearest neighbor search, is a crucial
routine in semantic search,
recommendations, ads, etc.
[image credit] ?????? The Current Best of Universal Word Embeddings and Sentence Embeddings | by Thomas Wolf | HuggingFace | Medium
■Embedding models capture semantic
similarity and encode objects into
vectors
■Semantically similar objects are
translated to spatially close vectors

Vector Indices
■A vector index is a data structure that speeds up
computations of similarity search
■Common types of vector indices include graph
indices and inverted file indices
■Data compression (product quantization, scalar
quantization, etc.) also a crucial component of
answering similarity queries

A Vector Index
and a Vector Database?
Features we want to support:
■Automatic replication and crash recovery
■Elastic scaling to >10s of billions of documents
■Automatic partitioning across physical partitions and tiers of storage
■Support for incremental changes

Should a database natively support fast, cost-effective vector search, or does this
require external replication outside the database?

An Integrated Vector Database
DiskANN (Vector index) Azure Cosmos DB NoSQL
Efficient and accurate vector search from
10K to >50 billion point indices

Support for multiple tiers of memory and
storage

Algorithmic innovations for dynamic
updates and queries filtered by logical
predicate
Planet-scale collections with elastic scaling
and automatic partitioning

Flexible cost and memory models

Database fundamentals: automatic crash
recovery and replication, support for
multi-tenancy
In our paper Cost-Effective, Low Latency Vector Search with Azure
Cosmos DB we provide a point in favor of integrated vector databases by
deeply integrating DiskANN with Azure Cosmos DB.

Our Results

■<20 ms query latency at the 10 million scale at P99 cost of $17.50/million
queries
■Scale out to 1 billion vectors with <150 ms client latency
■Robust accuracy under a stream of updates
We overcome the algorithmic and performance challenges of this integration
to achieve:

■Support for efficient filter predicate search
■Multi-tenant design where the number of partition keys or the vectors per
partition can grow independently to large numbers
■Auto-scaling cost structure

I.Introduction
II.Background on DiskANN and Azure CosmosDB
III.Challenges of the Integration
IV.Experimental Results

Outline

I.Introduction
II.Background on DiskANN and Azure CosmosDB
III.Challenges of the Integration
IV.Experimental Results

Outline

DiskANN
Index: directed graph with one
vertex per data point/embedding

For query q: greedily search the
graph from a designated start
point to converge to answer

Supports:
■Fast and accurate search,
filtered search
■Point insertions and
deletions

Bw-Tree
The BW-Tree is a B-tree variant used for
indexing in Azure Cosmos DB. Its
features include:
■Latch-free design
■Log-structured B-tree index
■Updates persisted as deltas in
batches flushed to disk, with periodic
consolidation

Δ
Δ
ΔΔ Δ

I.Introduction
II.Background on DiskANN and Azure CosmosDB
III.Challenges of the Integration
IV.Experimental Results

Outline

DiskANN Index Design
■Index consists of three terms:
quantized vectors, adjacency lists,
and full precision vectors
■Indexed by document id
■Typical values for a ~10M index
with 768-dimensional float
embeddings: 30 GB full-precision,
2 GB quantized, 1.3 GB graph
Adjacency list term on
Bw-tree (small, updated
most frequently)
Quantized vector term on
Bw-tree (small, updated
infrequently)
Full vector term in
DocStore (largest,
updated least frequently)

Operations with Limited Memory
= true top-4 nearest neighbors
= top-10 nearest neighbors computed using
compressed data
Top-10 results computed
using compressed data
Results re-ranked with
full-precision data
■Queries are processed reading
only the index term and the
quantized term, then re-ranked
with the full-precision term
■We determine experimentally that
inserts and deletes can be done
entirely with quantized vectors
■Challenge 1: can’t store full
vectors in cache, can’t afford too
many SSD round-trips

Real-world numbers: 3500 quantized lookups, 50
full-precision lookups for one query

Updates to a Single-Writer Data Structure
■We use the update routine from
ParlayANN for parallel updates
■We extend this routine to deletions
Adjacency list term
(single writer updates)
Updated adjacency list
term
Updates computed with
no write-locking and
internal parallelism
Updates persisted to
Bw-Tree with a single
thread
■Challenge 2: Bw-Tree can only be
updated by a single writer, but we
need parallelism over
inserts/deletes
■ Inserts and deletes require
changes to existing graph vertices

Maintaining Stability over Updates
■The previous deletion policy required
waiting until a critical mass of
deletions (~20% of the index) built up
■We utilize new work on inplace
deletions, with some changes to the
algorithm for our specific case
Recall over time comparing deletion
policies over a sequence of inserts
and deletes
■Challenge 3: how to maintain stable
recall over updates in a Bw-Tree index?

A Look at Prior Work
Additional papers used to bring this work together:
■FreshDiskANN: A Fast and Accurate Graph-Based ANNIndex for Streaming
Similarity Search [SJKS‘21]
■ParlayANN: Scalable and Deterministic Parallel Graph Based Approximate
Nearest Neighbor Search Algorithms [MSBDGSS‘24]
■In-Place Updates of a Graph Index for Streaming Approximate Nearest
Neighbor Search [XMBCWS‘25]

I.Introduction
II.Background on DiskANN and Azure CosmosDB
III.Challenges of the Integration
IV.Experimental Results

Outline

Experimental Setup
Unless otherwise indicated:
■Partition size limit 50 GB, Bw-tree max chain length 15
■Graph built with parameters R=32, L=100, slack=1.3
■Bw-tree cache contains adjacency lists and quantized terms
■Warmup phase for queries before measuring
■All benchmark datasets are publicly available at
https://github.com/harsha-simhadri/big-ann-benchmarks

Query Latency and RU
■Compare RU (proxy for compute cost,
$.25/1 million RU) and latency for
different sizes of dataset
■Query and RU cost increase only ~2x
even when index size increases 100x
■Very little change in performance when
vectors are ~7x larger
■Using P99 RU figures, $17.50/1 million
queries (~$20 monthly storage cost for a
10M size index)
Wikipedia Cohere dataset (768 float dimensions,
compressed to 192 bytes/vector), ~95% recall
MSTuring Web dataset (100 float dimensions,
compressed to 50 bytes/vector), ~90% recall

Insert Latency
■Insert requires a query in the
quantized space, then additional work
to integrate the new node into the
graph
■Most of the cost is reading quantized
vectors
■Vector read cost increases as size of
index increases
MSTuring-1M ingestion time breakdown (ms), single
threaded

Billion Scale Partitioned Index
■Billion size index on the MSTuring-1B
dataset spans 50 partitions
■Server latency = per partition, client
latency = end-to-end (proportional to
worst partition)
■Recall 83.76%, 89.61%, and 93.89% for
L=50, 100, 200 respectively
■Worst latency still under 150 ms
MSTuring-1B dataset (100 floating point
dimensions, compressed to 50 bytes/vector)

Conclusion
By deeply integrating DiskANN with Azure Cosmos DB NoSQL and incorporating
the latest research to address the accompanying technical challenges, we
achieved:
■A commercially available, cost-effective vector database inheriting the
technical capabilities of the DiskANN vector index and the flexibility, support,
and scale of Azure Cosmos DB
■<20 ms query latency at the 10 million scale, <150 ms query latency at the
billion scale

Thank you! Let’s connect.
Magdalen Manohar
[email protected],
[email protected]
magdalendobson.github.io
Tags