PgVector + : Enable Richer Interaction with vector database.pptx
aranjan11
238 views
42 slides
Oct 13, 2024
Slide 1 of 42
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
About This Presentation
advanced PostgreSQL extension on top of pgvector for optimizing vector similarity search for diversified rec-
ommendation systems, significantly improving system performance
and data retrieval accuracy for machine learning and AI applications
Today’s Agenda Intro to Vector and Similarity Intro to Postgres Extension Intro to pgVector Why not dedicated vector databases like PineCone ? Motivation UseCase 1 and its proposed solution UseCase 2 and its proposed solution UseCase 3 and its proposed solution UseCase 4 and its proposed solution Lesson Learnt and future scope References and links to explore
Vector and Similarity Search Vectors: Data points represented as multi-dimensional arrays of numbers, capturing various features or attributes. Similarity Search: Finding data points (neighbors) most similar to a query based on a specific metric (e.g., cosine similarity, Euclidean distance). Revolutionizing various industries: Recommender systems Fraud detection Content moderation Image and video search Vector databases are a powerful tool, but what about incorporating long-term memory capabilities into neural networks themselves? This is where the concept of Memory-Augmented Neural Networks (MANNs) comes in. Vector databases can be a powerful external memory source for MANNs. Vector databases are a perfect fit for the external memory component of RAG models.
PostGres Extensions Introduction PostgreSQL (PG) is a robust relational database with extensive features, including SQL standards compliance, support for various extensions, and active community development. PG's flexible, open-source license allows for the development of custom solutions and integration with third-party vendors. PostgreSQL (PG) extensions are add-on modules that enhance the functionality of the PostgreSQL database system. Extensions are installed using the CREATE EXTENSION command in PostgreSQL. Four basic file types that are required for building an extension : MakeFile, Control File,SQL File and Source Code File.
PostGres Extension More Information Makefile for PostgreSQL Extension: Provides instructions for compiling C code, linking it into a shared library, and installing necessary files into PostgreSQL. Relies on "PGXS" infrastructure to simplify building and installing extensions. Extension Control File: Named [EXTENSION NAME].control. Specifies metadata such as name, version, author, module path, and relocatability. Extension SQL File: Mapping file to link PostgreSQL functions with corresponding C functions. Extension C Code: Most common language for PostgreSQL extensions due to performance and flexibility. Allows direct access to PostgreSQL's internal APIs for optimized functionality. Must include PG_MODULE_MAGIC macro for identifying and providing metadata to PostgreSQL about the extension.
Introduction to PgVector PgVector is an open-source extension tailored for PostgreSQL, specifically engineered to streamline vector similarity search operations. It equips users with the ability to store and query vector data, which serves as a fundamental component in numerous machine learning and artificial intelligence applications. Harnessing the robust querying capabilities, data security features, and extensive ecosystem of tools and libraries offered by PostgreSQL, PgVector delivers a flexible and potent solution for managing vector data. PgVector allows you to create UDFs written in languages like C or PL/pgSQL. These functions can implement custom distance metrics or vector manipulation logic that extend PgVector's built-in functionalities. PgVector overloads comparison operators like < (less than) and > (greater than) to work with vector columns. When used in queries, these operators internally translate to distance metric calculations (e.g., cosine similarity) between the query vector and the vectors in the column. PgVector leverages PostgreSQL's existing query planning and optimization techniques to ensure efficient execution of vector-based queries. This involves analyzing the query, available indexes, and data distribution to determine the most efficient execution plan.
Internals of PgVector It introduces a dedicated data type for vectors and functions for vector similarity search. PgVector supports two main index types: HNSW and IVFFlat, which prioritize speed over perfect accuracy in similarity search. These indexes are tailored for efficient similarity search, allowing you to retrieve vectors most similar to a given query vector based on specific distance metrics (e.g., cosine similarity). Uses TOAST (The Only As Needed Storage). The large content from the flagged columns is extracted and stored in a separate table called a TOAST table. This table is hidden from users and automatically managed by PostgreSQL. In the original table, the large columns are replaced with smaller pointers referencing the location of the actual data in the corresponding TOAST table.
Why not do vector DB from scratch, rather than using PostgreSQL? Bottleneck in using a closed-source search solution like Pinecone is primarily the latency from network requests, not the search operation itself Pinecone does provide scalability through the adjustment of resources such as vCPU, RAM, and disk (termed as “pods”) Data Synchronization Issues Incomplete database solution — lacks row-level security, database backups, bulk operations, and complete ACID compliance More Info - PgVector vs Pinecone by supabase , Confident AI
Motivation PgVector, while a valuable PostgreSQL ex-tension for vector similarity search, lacks certain essential features compared to other vector databases, such as scalar quantization. These features have proven to be efficient for specific workloads. However, all vector databases have inherent limitations even with these additional features. Diversification of Set of Responses to the Query Enabling Similarity and Dissimilarity Constraint Enabling Low-Level Composition of Multiple Queries Overcoming these challenges, vector databases can become more versatile and effective tools for various applications.
Use Case 1 Diversification of Set of Responses to the Query In a recommendation system, users often require a variety of suggestions rather than exact matches. For instance, if a user is watching videos of a white cat, the system should recommend videos featuring cats of different colors or species to ensure diversity in the suggestions. This diversification is crucial for enhancing user engagement and satisfaction.
Method 1: Naive Approach outside Database The Naive approach to diversification involves expanding the search area beyond the required number of responses (Top K) by a factor n, where n takes values like 2, 3, 4, etc and then randomly selecting or systematically picking responses, such as selecting every nth element starting from the first. Advantage : This approach is straightforward to implement, minimal effort to implement. By expanding the search area, there’s a chance of discovering new and diverse content that may not have been included in the top K results. Disadvantage : Expanding the search area increases computational overhead.
Method 2: Naive Approach inside Database The database would typically conduct a similarity search across all vectors, sorting them and returning the top K results. We can modify this process. Instead of solely returning the top K vectors, we can perform the same logic on the sorted list obtained from the database. This means we can choose to return responses randomly or by selecting every nth element directly within the database. Advantage : eliminates the overhead associated with conducting these operations externally Drawback - While this works well when we do brute force to find similar vectors, it will not work well if indexes are used to do so since the data returned by doing so using index might not contain k elements.
Method 3: With Indexing( IVF-FLAT) In IVF, clusters are utilized, and we identify the cluster centroid with the least distance from the query cluster. This is determined by the default value of probes, which sets the number of clusters to be examined at 1, thus selecting the cluster with the minimum distance. We require input from various clusters. Therefore, we must augment the number of probes to, let's say, n. Consequently, IVF-Flat will compare all the vectors within these clusters and generate a sorted list. We then could pick every (size of list)/ k elements. Although this elevates the computational cost compared to examining just one cluster as before, it still diminishes the search space in contrast to operating without indexing. Drawback - Finding the optimal "probes" value can be challenging, balancing diversification and efficiency.
Method 4 Diversification with threshold one of the primary challenges we encountered was determining the diversity of our results. Simply increasing the number of probes did not ensure diversity. We explored the idea of implementing a threshold to classify results as both similar and diverse if they fell within a specified range. At present, pgvector offers three distance calculation metrics: Euclidean Distance, Cosine Distance, and Inner Product. To ascertain the feasibility of such a threshold, we need to investigate these metrics.
Euclidean Distance Also know as L2 distance The threshold could be based on a predefined maximum acceptable difference between the two values. For example, if the Euclidean distance between two points is less than or equal to the threshold, they could be considered similar. Disadvantages : Determining an appropriate threshold can be subjective and context-dependent. What may be considered "similar" in one scenario may not be appropriate in another. The Euclidean distance is sensitive to the scale of the data. Normalization or standardization of the data may be necessary to address this issue. For datasets with many zero or near-zero values (sparse data), the Euclidean distance may not effectively capture similarity.
Inner Product Also known as the scalar product or dot product It is calculated by multiplying corresponding components of the vectors and then summing up the results. For the dot product, the range of possible values depends on the vectors being multiplied. Positive values indicate the vectors are in the same direction, negative values that they are in opposite directions, and 0 are orthogonal. Disadvantages : Positive dot product values may exhibit varying ranges depending on the dataset. Consequently, identifying a single threshold value becomes complex in such scenarios. The effectiveness of a threshold may vary depending on the variability of the data and the specific characteristics of the vectors being compared.
Cosine Similarity It calculates the cosine of the angle between the two vectors The cosine similarity ranges from -1 to 1 : A value of 1 indicates that the vectors are perfectly similar, meaning they are pointing in the same direction. A value of -1 indicates they are perfectly dissimilar, meaning they are pointing in opposite directions. A value of 0 indicates orthogonality (perpendicular), meaning the vectors are at right angles to each other. We can adopt a threshold-based approach. This approach entails leveraging cosine similarity with three distinct modes, each characterized by its unique threshold range, to classify similarity levels among pairs of vectors. These modes are outlined as follows: Strict Mode (Threshold: 0.5 to 0.75) - It indicates a moderately diverse result set, allowing for some level of similarity but also room for diversity. Moderate Mode (Threshold: 0.75 to 0.9)- It represents a less diverse set compared to strict mode, implying a higher degree of similarity among the pairs. Limited Mode (Threshold: 0.9 to 1)- It denotes very similar result sets, where the cosine similarity is close to 1, indicating a high level of similarity between the pairs. The approach provides a finer-grained classification of similarity levels by dividing the cosine similarity scores into three distinct modes. Furthermore, adjusting threshold ranges in each mode offers customization and flexibility based on specific use cases and preferences.
Using threshold outside database vs Inside Currently, pgvector provides functionality for comparing vector similarity distances. As per the documentation, a query such as SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5; can be utilized to retrieve all vectors with distances smaller than 5 from the specified query vector. Utilizing a keyword within the database as an alternative approach to performing the same process is feasible but won't add any performance benefit. Use Iterative Search to add values to result set if in case we don't get top K vectors. Conclusion : Employing a library that harnesses SQL queries would deliver comparable accuracy without the need for introducing a new keyword. Opting for such a library not only ensures ease of comprehension and implementation but also leverages pgvector's optimized distance calculations. Moreover, it offers flexibility as threshold adjustments necessitate only query modifications, minimizing the performance impact caused by frequent changes.
Use Case 2 Enabling Similarity and Dissimilarity Constraint Another limitation of current vector databases is the inability to handle dissimilarity constraints A search query like ”Places without snow” may not be accurately interpreted by the vectors, as neural models often struggle with encoding negations and nuances, as mentioned in various research papers such as [1] To address this, databases should incorporate features that should be able to find dissimilarity, thereby improving the accuracy of search results.
Method 1: Dissimilarity using threshold For dissimilar images, a cosine value of -1 indicates that the vector lies in the opposite plane of the query vector. For instance, if we consider a scenario with 100 vectors in our database and none of them exactly matches a cosine similarity of 1, the query would return empty results. A potential workaround involves employing modes similar to our previous exploration. These modes could be categorized as follows: "Strict" indicating cosine = -1, "Moderate" encompassing values from -0.75 to -1, and "Limited" ranging from -0.5 to -0.75. However, a significant challenge with this approach lies in determining the appropriate threshold values and ensuring that they guarantee retrieval of the desired number of elements (k). Another issue is the use of indexing. We can't use the existing indexes for dissimilarity even if we use Iterative Search. The threshold approach poses another challenge in terms of computational cost.
Method 2: Dissimilarity Constraint in Database Let's first explore how we can adapt the brute force approach to achieve this goal. when searching for the top K similar elements, the typical process involves iterating through all data points, calculating the similarity score for each sorting these scores, and finally returning the top K elements for dissimilarity, we can simply select the last K elements after sorting. We can leverage indexing and aim to refine the existing indexing(IVF-FLAT) approach with some adjustments. IVF-Flat groups vectors stored in the database into distinct clusters, with each cluster characterized by a centroid representing the average of all data points within it. When a query vector is received, its distance is computed with respect to these cluster centroids. Instead of selecting the minimum 'n' clusters as in the original approach, we prioritize the maximum 'n' clusters. This means we prioritize clusters that exhibit the greatest distance from the query vector. Following this selection, a brute force approach is applied to the elements within the chosen clusters. There's a crucial modification in the sorting process: while initially, the elements were sorted in ascending order based on distance, now, with the dissimilarity constraint, the sorting is done in descending order.
Implementation Concept To implement the necessary changes in the code flow of pgvector, utilizing the Postgres Indexing API, we'll focus on the IndexAmRoutine API These changes will align the code flow with the dissimilarity constraint, ensuring that indexing operations select clusters based on maximum distance and sort the elements within these clusters in descending order. Update Sorting Criteria : Modify the ivfflatbeginscan method within the IndexAmRoutine API. Replace Float8LessOperator with Float8gt to sort in descending order. Adjust Cluster Selection : In the amgettuple method, which retrieves the next valid tuple, adjust the heap maintained to store elements. Set the value of min_diff to _DBL_MAX to ensure maximum distance comparison. Reverse the if condition to check if distance > maxDistance. Conclusion : Create a new index with different name IVF_Flat_Dissimilar with all the code of IVF_FLAT with the above mentioned changes, then query the database with CREATE INDEX ON items USING ivfflat_dissimilar (embedding vector_l2_ops) WITH (lists = 100);
Bonus Method PgVector stores the negative dot product as the distance in the result vector. To use the IVFFLAT index effectively, pgvector needs to convert the similarity measure (dot product) into a dissimilarity measure (distance). It achieves this by negating the dot product. So, a larger negative value (more negative dot product) indicates greater dissimilarity (vectors are further apart) which aligns well with the IVFFLAT indexing approach. We could levarage this by storing the dot product as distance in the result vector. In this case we just need to add one more distance calculation method. This would be most effective but there is no formal proof that I found, given the time constraint.
Use Case 3 Enabling Low-Level Composition of Multiple Queries In real-world scenarios, single-query searches may not always suffice. Consider a situation where one needs to search for images similar to both image A and image B simultaneously. However, the current system lacks support for such multi-image search operations. The workaround involves conducting separate query searches and merging the results let's say we have three vectors: vector 1 representing "USA," vector 2 representing "Basketball," and vector 3 representing "team." we would retrieve the top k vectors similar to each of these individual vectors. based on certain criteria, elements from each set might be selected. A significant issue with this method is that even if we do find these vectors, it does not guarantee relevance to the concept of "USA Basketball Team" as a whole; it merely identifies vectors related to individual words. The next approach would be instead of vector get the original data and then append those data pass it to deep learning model and get the vector embedding . Now use this vector embedding to do similarity search. But even with that it still requires us to have the original data. Can we do something similar with vectors?
Method 1: Using centroid approach for set of vectors To understand how centroid should be calculated, we should look into the implementation of IVF-FLAT of pgvector i.e. IvfflatKmeans method The function IvfflatKmeans takes three arguments: index: This relates to the specific Post-greSQL index being used for the k-means clustering samples: This is a vector array containing the data points to be clustered centers: This is a vector array that will store the centroids (cluster centers) to be found The function first checks if there are enough data points (samples) to create the desired number of centroids (centers). If not, it uses a simplified approach (QuickCenters) to select centroids. Here's a summary of the steps involved in the QuickCenters function: Sorting and Avoiding Duplicates: The function sorts the samples array based on their vector components. This sorting ensures that centroids are chosen from diverse regions of the data space. It iterates through the sorted samples array, adding each unique data point to the centers array until the desired number of centroids (specified by centers->maxlen) is reached The centroids are iteratively updated by calculating the mean of the data points assigned to each centroid. This process continues until a convergence criterion is met (e.g., no significant changes in centroid positions). Normalization (optional): The code checks if an optional normalization procedure is defined for the k-means process (Ivf-flatOptionalProcInfo). If so, it applies the normalization function to each randomly generated centroid vector. Normalization can be important for distance metrics like cosine similarity, where the magnitude of the vector affects the similarity score Filling with Random Data (if needed): If there are still empty slots in the centers array after processing all unique data points, the function fills them with randomly generated data points.
Method 1: Using centroid approach for set of vectors Calculating the mean of vectors can be an effective method for combining vectors This approach has been previously employed to generate vector embeddings of sentences, particularly before the widespread adoption of deep learning models, where methods like word2vec were utilized to create embeddings of individual words. This approach has proven to be effective and can be leveraged to identify similar vectors in comparison to a given set of vectors Drawbacks of Centroid Method Outliers in the data can significantly influence the mean calculation, potentially skew-ing the combined vector away from the central tendency of the data. Mean calculation assumes that all vectors contribute equally to the combined result. However, in many cases, certain vectors may be more relevant or significant than others Other potential variations Minimum Value Selection Consider having three vectors, each with three dimensions. One approach is to select the minimum value among these three vectors in each dimension. This results in a strict match, as the resultant vector contains the minimum values required from each dimension. Subsequently, a similarity search can be performed based on this combined query vector. Maximum Value Selection Another variation of this approach is to select the maximum value in each dimension to create the combined query vector. While both methods may seem more stringent compared to averaging, their effectiveness depends on the specific use case.
How to validate choices ? Further analysis is required to validate earlier hypothesis and determine the most suitable approach for the given scenario. Library Development : Implementing these methods directly in the database would require substantial effort. A more efficient approach would be to create a library to check the effectiveness of each method. Once the library is developed, thorough exploration can be conducted to determine the best approach for the given scenario.
Method 2: Custom Searching Individual Similarity Search Conduct a similarity search with each individual vector in the database. Track the distance of each vector in the database from each query vector. Combined Vector Distance Representation To represent the combined vector distance, compute the average of all the distance vectors. Sorting Based on Combined Distance This can incorporate the different weights associated with vectors Since the distance are maintained in database and discarded once the query is completed, an ideal place to do this change in inside the database system. Two implementation approach can be explored here Inside pgvector by modifying the code Create a custom PL/pgSQL function named search_similar_vectors within your PostgreSQL database which would work on top of pgvector
What is PL/psSQL ? PL/pgSQL is a procedural language extension for PostgreSQL that enables the creation of custom functions directly within the database. These functions are written using SQL commands, control structures, and procedural language constructs, providing a powerful tool for developing complex database logic. PL/pgSQL functions seamlessly integrate with SQL commands, allowing developers to execute SQL queries within the function body. PL/pgSQL functions support parameterized inputs, allowing developers to define input parameters that can be used to customize the behavior of the function. These parameters can be of various data types, including scalar types, composite types, and arrays, providing flexibility in function usage. PL/pgSQL functions execute within the security context of the PostgreSQL database, adhering to the access privileges and permissions defined for the database objects. This ensures data security and integrity while executing the function. Implementation with PL/pgSQL: Write the provided PL/pgSQL code within a text editor. Use the CREATE FUNCTION statement in PostgreSQL to define the function with its arguments, return type, and the PL/pgSQL code body. Once created, you can call the search_similar_vectors function from within your SQL queries, specifying the query vectors and index name.
Implementation of PL/pgSQL function Input: query_vectors: A pgvector array containing the set of query vectors for which you want to find similar vectors. index_name: The name of the PostgreSQL index that uses pgvector for similarity search. Process: Fetches database vectors from the specified table using the provided index. Loops through each query vector: Calculates the distance between the current query vector and all database vectors using pgr_distance. Accumulates these distances for each database vector in a combined_distances array. Calculates the average distance for each database vector by dividing the accumulated distance by the total number of query vectors. Sorts the database vectors based on their average distance (ascending order), with closer average distances indicating greater similarity to the query vectors. Output: The function returns the sorted database vector array, providing the most similar vectors first based on their average distance to the query set
UseCase 4: Scalar Quantization Quantization is the process of reducing the number of distinct values (or levels of precision) used to represent data. It involves mapping a continuous range of values to a finite set of discrete values. Commonly used in data compression, image and video coding, and various signal processing applications. Quantization in the Context of Vectors: Refers to the process of reducing the number of distinct values or levels used to represent each component of a vector. Vectors typically have high dimensions, such as the 568 and 768 dimensions commonly used in models like CLIP and Dinov2. Representing these vectors as floating-point numbers requires a significant amount of memory. Each component of vectors is represented as a floating-point number, requiring 32 bits for representation. pgvector currently lacks support for quantization methods. There are open issues regarding the implementation of quantization in pgvector.
Where to use Quantisation ? Should we do quantization when we save the data ? Initial intuition involved adjusting the existing structure of pgvector to accommodate an integer array alongside the float array. The current struct for vectors in pgvector was considered, but this approach appeared impractical due to extensive changes required. Another option considered was storing quantized data in separate tables or blob storage linked to the original vectors. While this approach could be implemented with less effort, it ultimately defeats the purpose of quantization, as the original data would still be retained. Overall, the idea of storing quantized vectors instead of original vectors may not be feasible for various scenarios since Quantization is a lossy technique, and there are scenarios such as face recognition where accuracy is crucial. Is it feasible to employ quantisation in indexing to mitigate storage expenses? IVF FLAT Stores the original vector data, offering high accuracy for searches but requiring more storage. Scalar Quantization (SQ) with IVF_FLAT method reduces storage space further by converting each dimension (value) in the original vector to a smaller data type (e.g., from float to byte). This significantly reduces the index size. create a new index which would support quantization? Why new ? Since there is an accuracy loss with these quantization based search.
How to do Scalar Quantisation ? There are few vector databases which have these functionality such as Zilliz and Qdrant. Apart from these, Milvus, which is also a vector database offers the scalar quantization functionality as an index i.e. IVF_SQ8 Qdrant There example shows that 99 percentage of the values come from a [-2.0, 5.0] range So accordingly they derived the equation. Both parameters, namely offset and alpha, must be calculated for a given set of vectors. This can be achieved by substituting the minimum and maximum of the represented range for both f32 and i8. By substituting the maximum over minimum in place of alpha, we obtain two equations to calculate the offset. Zilliz Take the maximum and minimum value of that particular dimension across the database. Split the dimension into bins across the range. To create bins, we use two variables: Start = Min Value Steps = (Max Value - Min Value) / 255, where 255 is the integer range. Subtract the minimum value from each vector, divide by 255, and place vectors in these bins.
Which approach to choose? There is no comparision of these as of now so deciding on which to pick would be difficult. Stumbled upon a paper by Google “Accelerating Large-Scale Inference with Anisotropic Vector Quantization ” which is also used by Google's ScaNN library I believe Zilliz also took reference from this paper although they haven't mentioned about anything. Steps: Specify the minimum and maximum values of the input signal. Choose the number of quantization levels: 2^number of bits (2^8 = 256 for 8-bit quantization). Techniques like uniform or Lloyd-Max quantization can be used to distribute the levels across the range. In uniform quantization, the range of values is divided into equal intervals. The width of each interval is determined by dividing the range of values by the desired number of quantization levels. Lloyd-Max quantization algorithm begins with an initial set of quantization levels, often evenly spaced across the range of values. In each iteration, the algorithm performs the following steps: Assign each input value to the nearest quantization level. Update the quantization levels by computing the centroid of the values assigned to each level. Check for convergence by measuring the change in the quantization levels between iterations. Assign each input value to the closest quantization level, represented by an 8-bit binary code (0 to 255).
Implementation Concept Establish a distinct set of distance functions tailored for these quantized vectors. Register these functions as supplementary support functions for the access method. Allow users to designate WITH (quantizer = 'BQ') when creating an index on a vector column to indicate the use of quantization. Expand the strategy to enable users to store quantized vectors directly in the table, utilizing the chosen data type. Ensure that the same set of distance functions is utilized for both indexed quantized vectors and directly stored quantized vectors. Optimize performance for various scenarios by implementing these changes in pgvector to enable efficient storage and indexing of quantized vectors.
Implementation Concept ScalarQuantizeVector method Parameters: const Vector* in: The input vector to be quantized. const Vector* multipliers: Multipliers used for quantization. bytea* quantized_storage: Storage for storing quantized values. Iterates over each component of the input vector. Multiplies each component with the corresponding multiplier. Rounds the result to the nearest integer and stores it in quantized_storage. Handles cases where the result exceeds the range of int8_t. Returns a pointer to quantized_storage Dot Product Distance: Parameters: const Vector* preprocessed_query: The preprocessed query vector. const TupleDesc tuple_desc: The tuple descriptor of tuples in the page. const Page page: Pointer to the page containing vectors. Vector* result: Pointer to store the computed distances. Iterates over each vector in the page. Computes the dot product between the preprocessed query vector and each vector in the page Stores the negative dot product as the distance in the result vector.
IVF Build Phase Quantizer Initialization: Determine the type of quantizer using IvfGetQuantizer(index) and store it in buildstate->quantizer. Perform initialization steps specific to kIvfflat if that quantizer type is detected. Perform initialization steps specific to kIvfsq8 if that quantizer type is detected. Throw an error if the quantizer type is neither kIvfflat nor kIvfsq8. Memory Allocation: Allocate memory for buildstate->multipliers and buildstate->quantized_storage based on the quantizer type. Initialize buildstate->multipliers with zero values and allocate memory for buildstate->quantized_storage based on index dimensions if the quantizer type is kIvfsq8. Memory Deallocation: Free buildstate->multipliers and buildstate->quantized_storage if they were previously allocated to prevent memory leaks. Setting Page Header: Adjust the page header (((PageHeader)page)->pd_lower) to accommodate metadata or multipliers based on the quantizer type Creating IVF List: Create an IVF list (list) based on the index and associated metadata. Allocate memory for the list and set its properties based on the provided parameters.
IVF Insert Phase Index Tuple Formation Form an index tuple from a raw vector in its scalar 8-bit quantized form. Compute the quantized vector using scalar quantization with the multipliers retrieved from the index. Allocate memory for the quantized vector storage and set its size. Form the index tuple using the computed quantized vector. Free the allocated memory after forming the index tuple. Processing Index List Retrieve each item (list) from a page in the index. Extract metadata associated with the list and flatten it if it's stored externally. Calculate the distance metric using the flattened metadata and a specified function. Handling Quantizers Determine the quantizer type used in the relation. If the quantizer is scalar 8-bit, form the index tuple using the quantized vector. If the quantizer is flat, form the index tuple directly from the raw vector. Throw an error if the quantizer type is not supported.
Create a new index while retaining the existing IVFFLAT index. This allows users to choose between using the IVFFLAT index or the new quantized version. IVFFLAT Index Usage CREATE INDEX ON items USING ivf (lists=100); New Quantized Version Usage CREATE INDEX ON items USING ivf (lists=100, quantizer='SQ8');
Pgvector is a great extension which helps you use postgres as vector database which has all the features of a vector database and utilizes the strong infrastructure of postgres. Learnt about internal of postgres, internals indexing uses in pgvector , PL/pgSQL, vector manipulation Few Open points To validate/ test the finding we need a proper dataset and models. Testing of approaches of Use Case 3 using library extension as well as PL/pgSQL Enabling Different Matrix to Self-Similarity for Responses Binary Quantisation Dive into HSNW index implementation and it's usage in these usecases. Lesson Learnt and Future Scope
References And Further links to explore https://www.postgresql.org/docs/current/extend-pgxs.html https://www.postgresql.org/docs/current/index-functions.html Pgvector - https://github.com/pgvector/pgvector Quatization https://www.elastic.co/search-labs/blog/articles/scalar-quantization-101 , https://zilliz.com/learn/scalar-quantization-and-product-quantization , https://qdrant.tech/documentation/guides/quantization/ , https://arxiv.org/pdf/1908.10396.pdf PCA - A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SEN- TENCE EMBEDDINGS , All-but-the-Top: Simple and Effective Postprocessing for Word Representations , Effective Dimensionality Reduction for Word Embeddings Distance Metrics - https://www.youtube.com/watch?v=e9U0QAFbfLI , https://www.learndatasci.com/glossary/cosine-similarity/ , https://en.wikipedia.org/wiki/Euclidean_distance , https://tivadardanka.com/blog/how-the-dot-product-measures-similarity , Take this QUIZ