Mastering Vector Search with MongoDB Atlas - Manosh Malai - Mydbops MyWebinar 39

MyDBOPS 37 views 31 slides Mar 04, 2025
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

Mastering Vector Search with MongoDB Atlas - Manosh Malai - Mydbops MyWebinar 39

In this session, explore how to harness MongoDB's native vector search capabilities to enhance your database and search functionality. From the basics to advanced techniques, gain insights into building intelligent...


Slide Content

Manosh Malai
CTO, Mydbops LLP
Mydbops MyWebinar Edition 39

Implementing Vector Search
with MongoDB

About Me
Manosh Malai
❏Interested in Open Source technologies
❏Interested in MongoDB, DevOps & DevOpSec Practices
❏Tech Speaker/Blogger
❏MongoDB User Group Leader(Bangalore)

Consulting
Services
Consulting
Services
Managed
Services
❏Database Management and Consultancy Provider
❏Founded in 2016
❏Assisted 800+ happy customers
❏AWS Service Delivery Partner(RDS)
❏PCI & ISO Certified
About Us

❏Introduction to Vector Search
❏Storing Vector Embeddings in MongoDB
❏Creating and Managing Vector Indexes
❏Demo
❏Best Practices
Agenda

Introduction to
Vector Search

Definition
●Vector: A numerical representation of unstructured data such as text, images, and audio.
Key Characteristics
●Numerical Array: Stored as an array of floating-point numbers.
●Dimensions: Each value in the array represents a distinct dimension in a high-dimensional space.
Benefits
●Flexibility: Enables efficient search and computation on unstructured data.
●Semantic Understanding: Captures the underlying meaning and relationships within the data.

What is Vector?

plot: 'Michael "Beau" Geste leaves England in disgrace and joins the infamous French Foreign Legion. He is reunited with his two brothers in North Africa, where they face greater
danger from their...',


plot_embedding: [
0.00023330493, -0.028511643, 0.014653289, -0.03847482,
-0.016243158, 0.049179934, -0.0020221141, 0.0025272286,
-0.0033271313, -0.011612665, 0.007803605, 0.01923741,
0.020072091, 0.01717058, 0.017515052, -0.01136756,
0.030631468, -0.004259523, 0.021237994, 0.008618413,
-0.0020138335, 0.015156747, 0.013606625, -0.00086035073,
0.026100343, -0.019568631, 0.011983634, -0.015501219,
0.016680371, -0.014057088, 0.0033884074, -0.014918267,
-0.008452801, -0.009870434, -0.029545058, 0.005097516,
0.010373892, -0.010135412, 0.020628545, -0.006955675,
0.019648125, 0.011559669, 0.0065548955, 0.0037461277,
-0.022350902, 0.010850853, 0.017727034, -0.031240918,
-0.012672577, 0.008605164, 0.023543304, 0.010930346,
-0.0034579642, -0.005153824, 0.010506381, 0.00008958537,
-0.0071676574, 0.0039349245, 0.0075916224, -0.015673455,
0.016653873, -0.0125599615, -0.0061541162, 0.0019691184,
0.00022212617, -0.003135022, -0.017250074, -0.0037229422,
-0.03556006, -0.008081832, 0.039614227, 0.020496055,
0.010446762, -0.014666538, 0.042052023, -0.006008378,
-0.031717878, -0.005037896, -0.014057088, 0.017859524,
0.0005349245, -0.025848612, -0.025013933, 0.020111836,
0.011102582, -0.00565397, -0.010254652, 0.01251359,
-0.013414516, -0.033334244, -0.008605164, 0.016057672,
0.00064091576, 0.007671116, -0.018694205, 0.0020767658,
-0.027186753, 0.014322066, 0.008969508, -0.01414983,
... 1436 more items
]

Unstructured Data & Vector

How Are Vector Embeddings Created?

Process Overview
●Source Data: Begin with unstructured data such as text, images, or audio.
●Embedding Models: Utilize machine learning models to transform this data into vector
embeddings.
Example: OpenAI’s models (e.g., GPT-4, CLIP) generate vector embeddings from unstructured data.
Steps Involved
●Data Input: Input the raw unstructured data into the embedding model.
●Model Processing: The model analyzes and encodes the data into a high-dimensional vector.
●Output Vector: The result is a numerical array representing the original data.
How Are Vector Embeddings Created?

Storing Vector
Embeddings In MongoDB

{
_id: ObjectId('65fda3274c505f1555d87c37'),
plot: 'Michael "Beau" Geste leaves England in disgrace and joins the infamous French Foreign Legion. He is reunited with his two brothers in North Africa, where they face greater danger from their...',
genres: [ 'Action', 'Adventure', 'Drama' ],
runtime: 101,
cast: [ 'Ronald Colman', 'Neil Hamilton', 'Ralph Forbes', 'Alice Joyce' ],
num_mflix_comments: 0,
title: 'Beau Geste',
fullplot: 'Michael "Beau" Geste leaves England in disgrace and joins the infamous French Foreign Legion. He is reunited with his two brothers in North Africa, where they face greater danger from their own sadistic commander than from the rebellious Arabs.',
languages: [ 'English' ],
directors: [ 'Herbert Brenon' ],
writers: [
'Herbert Brenon (adaptation)',
'John Russell (adaptation)',
'Paul Schofield',
'Percival Christopher Wren (novel)'
],
awards: { wins: 1, nominations: 0, text: '1 win.' },
imdb: { rating: 6.9, votes: 222, id: 16634 },
countries: [ 'USA' ],
type: 'movie',
plot_embedding: [
0.00023330493, -0.028511643, 0.014653289, -0.03847482,
-0.016243158, 0.049179934, -0.0020221141, 0.0025272286,
-0.0033271313, -0.011612665, 0.007803605, 0.01923741,
0.020072091, 0.01717058, 0.017515052, -0.01136756,
0.030631468, -0.004259523, 0.021237994, 0.008618413,
-0.0020138335, 0.015156747, 0.013606625, -0.00086035073,
0.026100343, -0.019568631, 0.011983634, -0.015501219,
0.016680371, -0.014057088, 0.0033884074, -0.014918267,
-0.008452801, -0.009870434, -0.029545058, 0.005097516,
0.010373892, -0.010135412, 0.020628545, -0.006955675,
0.019648125, 0.011559669, 0.0065548955, 0.0037461277,
-0.022350902, 0.010850853, 0.017727034, -0.031240918,
-0.012672577, 0.008605164, 0.023543304, 0.010930346,
-0.0034579642, -0.005153824, 0.010506381, 0.00008958537,
-0.0071676574, 0.0039349245, 0.0075916224, -0.015673455,
0.016653873, -0.0125599615, -0.0061541162, 0.0019691184,
0.00022212617, -0.003135022, -0.017250074, -0.0037229422,
-0.03556006, -0.008081832, 0.039614227, 0.020496055,
0.010446762, -0.014666538, 0.042052023, -0.006008378,
-0.031717878, -0.005037896, -0.014057088, 0.017859524,
0.0005349245, -0.025848612, -0.025013933, 0.020111836,
0.011102582, -0.00565397, -0.010254652, 0.01251359,
-0.013414516, -0.033334244, -0.008605164, 0.016057672,
0.00064091576, 0.007671116, -0.018694205, 0.0020767658,
-0.027186753, 0.014322066, 0.008969508, -0.01414983,
... 1436 more items
]
}

Vector in Document

Dimensionality and Features
●Dimensions: Each dimension in the vector embedding corresponds to specific features or attributes
of the data.
●Features and Attributes: Represent various aspects and characteristics inherent in the data.
Semantic Representation
●Capturing Meaning: Vector embeddings encapsulate the semantic meaning, enabling the system to
understand relationships and similarities between different data points.
Example Interpretation
●Text Embedding: Dimensions might represent syntactic structures, semantic topics, sentiment, etc.
●Image Embedding: Dimensions could correspond to visual features like color patterns, shapes,
textures, etc.
What Do Vector Embeddings Represent?

MongoDB Supports Sparse and Dense Vectors, both numerical representations of Data but calculate
differently and separate uses.


Types of Vector
Sparse Vector Dense Vector
Use Cases: Keyword-based searches, simple text
classification
Use Cases: Contextual search, recommendation
systems, NLP tasks
Use BM-25 to Calculate(TF-IDF) each Dimension Use Transformer model(BERT, GPT, ViT)
Dimensionality: High Dimensionality: Lower
Semantic Understanding: Poor; struggles with
synonyms and context
Semantic Understanding: Excellent; understands
semantics and nuances
Computational Efficiency: Efficient for storage; limited
for complex queries
Computational Efficiency: Efficient for contextual
operations; optimized libraries

●MongoDB Support 4096 Dimensions Adequate: Suitable for most applications involving text, image,
and audio embeddings.
Practical Example
○Text Embedding Example:
○Sentence: “Artificial Intelligence is transforming the world.”
○Embedding Vector:
■"embedding": [0.12, 0.98, 0.45, ..., 0.67] // Total of 4096 values\
Sufficiency for Most Models:
○Common Embedding Dimensions: Most popular embedding models (e.g., BERT with 768
dimensions, CLIP with 512 dimensions) produce vectors well below the 4096-dimension limit.
Exceptional Use Cases: dimensionality reduction techniques (like PCA or UMAP) can be applied to fit
within MongoDB’s constraints.
Why 4096 Dimensions?

Creating And Managing
Vector Indexes

Atlas Vector Search Indexes Algorithms: Brute Force O(N)

Atlas Vector Search Indexes Algorithms: NSW(Navigable Small Worlds)

Atlas Vector Search Indexes Algorithms:(Skipped link lists)

NSW + Skipped link lists = Hierarchical Navigable Small World Graph
Atlas Vector Search Indexes Algorithms: HNSW

A graph-based algorithm used for Approximate Nearest Neighbor (ANN) search.
How HNSW Works?
1.Graph Structure:
a.Multi-layer graph where:
i.Top Layers: Sparse connections for fast global navigation.
ii.Bottom Layer: Dense connections for fine-grained search.
2.Search Process:
a.Start at the top layer and navigate downward, refining results at each level.
3.Tunable Parameters:
a.m: Connections per node (higher = better accuracy, more memory).
b.efConstruction: Build-time accuracy (higher = better index quality).
c.efSearch: Query-time accuracy (higher = better precision, slower search).

Atlas Vector Search Indexes Algorithms: HNSW

●Must be a vector embeddings field
●Currently supports up to 4,096 dimensions
●Don't use on a field in an array of document
●Supported Only on Atlas Cluster running on v6.0.11 and v7.0.2 Later
Vector Index Limitation

A vector search index allows you to store and quickly query vector embeddings, optionally combined with
filtered fields to narrow search scope before measuring similarity.
db.<collectionName>.createSearchIndex(
"<index-name>",
"vectorSearch", //index type
{
fields: [
{
"type": "vector",
"numDimensions": <number-of-dimensions>,
"path": "<field-to-index>",
"similarity": "euclidean | cosine | dotProduct"
},
{
"type": "filter",
"path": "<field-to-index>"
},
...
]
}
);

Vector Index

Key Parameters Explained:
●Type:
○"vector": Index a field containing vector embeddings.
○"filter": Index a scalar field (e.g., string, number) to pre-filter search results.
●path: The field to index. For nested fields, use dot notation (e.g., "metadata.category").
●numDimensions: The size of each embedding vector. Must match the embedding model
output (e.g., 768 or 1536).
Similarity:
○euclidean: Measures direct distance between vectors.
○cosine: Focuses on the angle between vectors (direction vs. magnitude).
○dotProduct: Considers both angle and magnitude, often a good default choice.
quantization (optional):
○none: Store as-is (float32, etc.).
○scalar: Compress values to 1 byte integers for space efficiency.
○binary: Compress even further to a single bit per dimension; use if maximum
compression is needed.

Vector Index

Filters reduce the search space before running similarity checks. This speeds up queries and
ensures more contextually relevant results.

How It Works:
1.Apply a filter to narrow down documents (e.g., status: "active").
2.Among the filtered set, find the top-k most similar embeddings.

Example
{
"fields": [
{
"type": "vector",
"path": "description_embedding" ,
"numDimensions": 768,
"similarity": "dotProduct"
},
{
"type": "filter",
"path": "category"
},
{
"type": "filter",
"path": "status"
}
]
}
Result:
A vector search can now consider only documents where category = "electronics" and status =
"active", returning relevant matches faster.

Using type: "filter" for Better Results

db.<collectionName>.updateSearchIndex(
"<index-name>",
{
fields: [
{
"type": "vector",
"numDimensions": <number-of-dimensions>,
"path": "<field-to-index>",
"similarity": "euclidean | cosine | dotProduct"
},
{
"type": "filter",
"path": "<field-to-index>"
},
...
] } );

db.<collectionName>.dropSearchIndex( "<index-name>" );

Update and Delete Vector Index

Demo

Best Practices

Pre-Convert Embeddings: Store as BSON float32 or int8/int1 for efficiency.

●Check the documentation of your embedding use the specified one
●Use the Right Similarity Metric
○Cosine: For text embeddings and semantic similarity.
○Euclidean: For spatial distance measurements.
○Dot Product: For ranking or recommendation systems.
●Index Type Configuration
○Explicitly Set type: vector: Ensure vector fields are defined as type: vector in the search index.
○Include type: filter: Optimize non-vector fields (e.g., genres, tags) for metadata filtering during
queries.

Best Practices & Query Example

●Tune Query Parameters
○numCandidates:
■Higher values improve accuracy but slow down queries.
■Start with 50-100 and adjust as needed.
○Limit:
■Specify the number of nearest neighbors to return (e.g., limit: 10).
●Use Filters Wisely: Index frequently used filter fields to improve both relevance and speed.
{
"$vectorSearch": {
"vector": "$description_embedding",
"query": [0.12, 0.34, 0.56, ...],
"k": 5,
"filter": {
"category": "electronics",
"status": "active"
}
}
}
Best Practices & Query Example

●Monitor Resource Usage
○Memory: Higher numCandidates and dimensions consume more memory.
○Indexing Time: Larger datasets and high efConstruction values slow down index creation.

Best Practices & Query Example

Thank you!