Elevating PostgreSQL: Benchmarking Vector Search Performance

ScyllaDB 306 views 35 slides Oct 15, 2024
Slide 1
Slide 1 of 35
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35

About This Presentation

PostgreSQL continues to evolve with vector search extensions like pgvector and pgvecto.rs. We'll explore recent benchmarks comparing vector search performance across various datasets and configurations, highlighting PostgreSQL's adaptability in modern use cases. #PostgreSQL #VectorSearch


Slide Content

A ScyllaDB Community
Elevating PostgreSQL:
Benchmarking Vector Search
Performance
Daniel Seybold
Co-Founder of benchANT

Daniel Seybold (he/him)

Co-Founder at benchANT
■PhD about benchmarking cloud and database
systems
■All about distributed systems, databases and
cloud
■Enjoys to demystify the black art of database
benchmarking
■Loves every kind of racket sports

Database Hot Topics

Agenda
■The Vector Databases Landscape (Trending topics in the database world)
■Benchmarking Vector Databases
■PostgreSQL Vector Benchmark Results
■Takeaways

Hot Topics in the Database World

Hot Topics in the Database World
Google Trends for the search term “vector database” over the last five years

The Vector Database Landscape
native vector databases general purpose databases with
vector support

The Elephant in the Room
https://www.timescale.com/blog/postgres-the-birdhorse-of-databases/

PostgreSQL for Vector
Search

What is a vector database (I/II)
■stores data as a vectors
■stores embeddings (i.e. vectors) together with original data (e.g. text, images)
■each vector represents a data point using n dimensions
■embeddings are typically created outside of the database
●the higher the number of dimensions the better the quality
■provides similarity search capabilities by algorithms of the Approximate
Nearest Neighbour (ANN) class

What is a vector database (II/II)
■vector data needs to be indexed to enable efficient lookups
■various indexing algorithms are available
● Inverted File Index (IVF)
●Hierarchical Navigable Small World (HNSW) graphs
●and many more
■for more details on vector databases see
https://thedataquarry.com/posts/vector-db-1/

PostgreSQL Vector Search Extensions

■pgvector (0.7.4)
■pgvecto.rs (0.3.0)
■pgvectorscale (0.3.0)
■lantern (0.3.2)
■and probably many more

PostgreSQL Vector Search Extensions
index types quantization additional details
pgvector IVFFLAT
HNSW

binary most popular vector
extension
pgveco.rs IVFFLAT
HNSW
scalar, product supports up to
65535 dimensions
pgvectorscale IVFFLAT
HNSW
StreamingDiskANN
Statistical Binary
Quantization
extends pgvector
lantern HNSW scalar, binary,
product
index creation
outside of the
database instance

PostgreSQL Vector Search Extensions
■Which index type fits best for my target data set?
■Which throughput and latency numbers can be expected?
■Which PostgreSQL extension provides the best performance?

Benchmarking Vector
Databases

Why to Benchmark Vector Databases
■Comparing the performance of a native vector database with a general
purpose database with vector support

■Get a general understanding of the vector search performance for your target
data sets

■Exploring the performance impact of resource and database layer knobs for
vector search use cases

From an Real-World Application to a Benchmark
■Building a synthetic vector search benchmark is more straightforward approach as
for many OLTP/HTAP/OLAP applications

■Important data set parameters:
●# of vectors
●vector dimensions

■Important query parameters:
●filtering

Vector Database Benchmark Suites
■ANN-Benchmark
■Big-ANN-Benchmark
■pgvectorbench
■Qdrant vector-db-benchmark
■VectorDBBench

For a continuously updated list of database benchmarks:
https://benchant.com/blog/benchmarking-suites

Vector Database Benchmark Metrics
■throughput
●ingestion
●search
■latency
■recall — search quality

PostgreSQL Vector Search
Benchmarks

Benchmarking Objectives
■knobs and bolts to consider when running vector search benchmarks or
reading reports (in general and for PostgreSQL)

■performance impact of using different index types on different data sets

■baseline performance numbers for PostgreSQL with pgvector and pgvectors

Out of Scope Objectives (for this Talk)
■comparative benchmarks against other vector databases
■in-depth benchmarking for each PostgreSQL vector extension
●analyzing index-specific parameters
●analysing extension specific parameters
■optimizing hardware and PostgreSQL configuration for vector search
●analyzing compute and storage resources
●analyzing PostgreSQL configuration options

Benchmark Methodology
■benchmark suite: VectorDBBench
●benchANT fork with extensions
■benchANT framework for benchmark execution
●automated database benchmarking in the cloud
●TODO: add ref to paper/blog
■benchmark setup: OVH, single node, config by DBTune
■data is available on GitHub

Benchmark Setup
■PostgreSQL Version 16
●pgvector 0.7.4
●pgvecto.rs 0.3.0
■IaaS: OVH VM with 16 vCores/64GB RAM
■VectorDBBench
●based on 0.12.0
●benchANT fork https://github.com/benchANT/vectordbbench
■CLI support
■index creation during optimize step

VectorDBBench Workflow
def load():
# download vector data set
# create database
# create index (option I)
# ingest into database (single threaded)

def optimize():
# create index (option II)
# apply DB specific tuning

def query():
# execute search query with 1..30 threads

Benchmark Results: Index Types

Benchmark Results: Index Types

Benchmark Results: Index Types

Benchmark Results: Index Creation Timing

Benchmark Results: Index Creation Timing

pgvecto.rs Index Type and Quantization
Performance

pgvector vs. pgvecto.rs Performance 1M Cohere

Takeaways

Takeaways
■The vector database market is rapidly growing
■The PostgreSQL ecosystem provides different vector search extensions
■The vector search extensions differ in their supported indexes, index creation
approaches and additional features but also in their performance
■Benchmarking vector databases supports the selection of the right
technology for your use case
■Vector database benchmarking requires the consideration of new workload
parameters to create meaningful benchmark results

Thank you! Let’s connect.
Daniel Seybold
[email protected]
https://benchant.org/ — DataScaleFail Newsletter
Tags