RheoData_23ai_Vector-Datatype-Webinar-2024.pptx

BobbyCurtisMBA 263 views 55 slides May 10, 2024
Slide 1
Slide 1 of 55
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55

About This Presentation

Oracle Database 23ai: AI Vector Search: The What, The How, & The Possiblities


Slide Content

Oracle 23ai: AI Vector Search: The What, The How, & The Possibilities Will start about 3 minutes after the hour.

Webinar Keeping Use Q&A dialog to ask questions. Will answer as pauses occur or at end of webinar.

Speaker Bobby Curtis, MBA Atlanta, GA Author Speaker @ dbasolved | @ rheodatallc https:// dbasolved.com | https:// www.rheodata.com /blog [email protected] https:// www.slideshare.net / BobbyCurtisMBA

Cloud Managed Services Provider (CMSP) 24/7 On-Call Support Data and Cloud Migrations Near-Zero Downtime Data Engineering Analytics, Machine Learning, Artificial Intelligence (Oracle Digital Assistants) License Management On-Premises & Cloud Cost Management Data Modernization Data Visualization, Cyber Security, Data Warehouse, etc. Service Offerings: www.rheodata.com | www.rheodata.ai Co-Managed Fully-Managed Professional Services

Infrastructure Data Integration Data Engineering & App development Applications Database (Oracle /MySQL/PostgreSQL) (ADW, ATP, ExaCC , etc.) Query Performance Data Warehouse (Autonomous Database/MySQL Heatwave) Cloud IaaS/PaaS Automation (Terraform) GoldenGate (on-premises & oci ) Oracle Integration Cloud (OIC) Oracle Data Integrator (ODI) Machine Learning (ML) & Artificial Intelligence (AI) Data Catalog Low Code (Apex) Oracle Analytics Cloud Oracle Technology Stacks: Security Data Lakehouse HCM/PeopleSoft EBS Primavera Engineered Systems (Exadata/ODA)

Prerequisites Oracle Database 23c 23ai oracle-database-free-23c-1.0-1.el8.x86_64 (23.4 ((LA1)) oracle-database-ee-23c-1.0-1.el8.x86_64 (23.4(LA2)) OCI Oracle Database Service On-premises ( here ) - Free Python 3.11/3.12 python- oracledb oracledb-2.2.0 Langchain https:// python.langchain.com /docs/ get_started /introduction LLM API Key Cohere, ChatGPT, etc.

The What?

Vector Database are a new classification of specialized databases designed for AI workloads that allows you to query based on semantics, rather than keywords.

Vector Search enhances information retrieval by mapping queries to relevant data in your database based on semantics, instead of precise matches, using vectors to measure similarity.

45 68 16 50 42 Vector Vectors are a sequence of numbers, called dimensions, used to capture the important “features” of the data. Vectors in AI represent semantics of unstructured data such as images, documents, videos, etc.

45 68 16 50 42 Vector The mathematical distance between two vectors indicate how similar they are Vectors represent the semantic content of data, not the underlying words or pixels

Online Embedding Models (Cohere) English Multi-Lingual embed-english-v3.0 (1024) embed-multilingual-v3.0 (1024) embed-english-light-v3.0 (384) embed-multilingual-light-v3.0 (384) embed-english-v2.0 (4096) embed-multilingual-v2.0 (768) embed-english-light-v2.0 (1024)

Online Embedding Models (Cohere) English Multi-Lingual embed-english-v3.0 (1024) embed-multilingual-v3.0 (1024) embed-english-light-v3.0 (384) embed-multilingual-light-v3.0 (384) embed-english-v2.0 (4096) embed-multilingual-v2.0 (768) embed-english-light-v2.0 (1024) Database Embedding Models (Hugging Face/ OMLUtils )

Input Data Object Embedding Model 4 8 42 72 Vector Search Relevant Content Data Object Retrieval Vector ID Matches Vector Embedding

The How? -> The Math

The Math? Euclidean/Squared Euclidean Distance Cosine Similarity Reflects the distance between each of the vector coordinates being compared -> straight line distance between vectors Measures the cosine of the angle between two vectors.

The Math? DOT Product Similarity Manhattan Distance Similarity of two vectors are viewed by multiplying the size of each vector by the cosine of their angle. Calculated by summing the distance between the coordinates of two vectors.

The Math? Hamming Distance The distance between two vectors represent the number of coordinates where they differ.

The How? -> Oracle

Vector data types can be declared in the following forms: Vector Datatype Vector Arbitrary number of dimensions and format Vector (*,*) Vector = Vector(*,*) Vector ( number_of_dimensions , *) Vectors must all have the specified number of dimensions, or an error is thrown. No format modification. Vector(*, dimension_element_format ) Arbitrary number of dimensions, but format will be up-converted or down-converted to the specific dimension (INT8, FLOAT32, FLOAT64). Vector( number_of_dimensions , dimension_element_format ) Must have all the specified number of dimensions, or error thrown. Each will be up-converted or down-converted to the specified format (INT8, FLOAT32, FLOAT64). Vectors can be NULL but dimensions cannot: [1.1, NULL, 2.2] Same Thing

Different Types of Vectors Supported! drop table vector . rd_vector ; CREATE TABLE vector . rd_vector ( v1 VECTOR , v2 VECTOR ( 3 , FLOAT32 ), v3 VECTOR ( 2 , FLOAT64 ), v4 VECTOR ( 1 , INT8 ), v5 VECTOR ( 1 , * ), v6 VECTOR ( * , FLOAT32 ), v7 VECTOR ( * , * ) );

Vector Base Operations TO_VECTOR || VECTOR Constructors for vectors FROM_VECTOR || VECTOR_SERIALIZE Functions take a vector and return as string in VARCHAR2 or CLOB as output VECTOR_NORM Function returns the Euclidean norm of the vector in BINARY_DOUBLE (SQRT(SUM((xi- yi )2)) VECTOR_DIMENSION_COUNT Function returns the number of dimensions of a vector as an Oracle number VECTOR_DIMENSION_FORMAT Returns the storage format of the vector in VARCHAR2 format (INT8, FLOAT32, FLOAT64) VECTOR_DISTANCE * Function returns Squared Euclidean distance. Default for Oracle AI Vector Search There are a few base functions that are used with Vectors .

VECTOR_DISTANCE vector_distance (expr1, expr2, [, metric]) L1_distance (expr1, expr2) L2_distance (expr1, expr2) cosine_distance (expr1, expr2) inner_distance (expr1, expr2) Manhattan/Taxi Cab Euclidean_Squared DOT Cosine

How do I create a “ vector ”?

Declare vector datatypes during table creation. Example 1: CREATE TABLE text_vec ( id NUMBER, embeddings VECTOR ); Example 2: CREATE TABLE text_vec ( id NUMBER, embeddings VECTOR(384, INT8) );

Vector datatypes can be added to tables later as well. Example 1: CREATE TABLE text_vec ( id NUMBER, col1 varchar2(50), ); ALTER TABLE text_vec add ( vec vector );

sql > create table vec_tab ( id number, vec vector(2) ); sql > insert into vec_tab values (1, ’[1,2]’); sql > insert into vec_tab values (2, ’[2,3]’); import oracledb connection = oracledb.connection () cursor = connection.cursor () cursor.execute () cursor.setinputsizes (None, oracledb.DB_TYPE_VECTOR ) cursor.execute () connection.commit () Available Create Create & Populate

Let’s go with Python!

Python Approach – 1 of 4 import oracledb import cohere import os from langchain . document_loaders import TextLoader from langchain . text_splitter import CharacterTextSplitter

Python Approach – 2 of 4 with open ( "./Scripts/oracle/vector/ cities.txt " ) as f : cities = f . read () #print(cities) text_splitter = CharacterTextSplitter ( separator = " \n " , chunk_size = 35 , chunk_overlap = 5 , length_function = len , is_separator_regex = False ) texts = text_splitter . split_text ( cities )

Python Approach – 3 of 4 api_key = ”API Key" os . environ [ "COHERE_API_KEY" ] = api_key print ( os . environ [ "COHERE_API_KEY" ]) co = cohere . Client ( os . getenv ( "COHERE_API_KEY" )) Cohere offers two types of API Keys: Development (Limit 10 calls per minute) Production (unlimited calls per minute)

Python Approach – 3.5 of 4 def database_connection (): try : connection = oracledb . connect ( user = "vector" , password = "vector" , dsn = "100.130.230.100:1521/freepdb1" ) print ( 'connected’ ) return connection except : print ( 'Could not make a connection' )

Python Approach – 4 of 4 connection = database_connection () id_val = 1 for i in texts : response = co . embed ( texts = [ i ], model = 'embed-english-light-v3.0' , input_type = 'classification' ) vtext = i vec = response . embeddings [ ] #print( vtext ) #print( len ( vec )) #print( vec ) cursor = connection . cursor () cursor . setinputsizes ( None , oracledb .DB_TYPE_VECTOR ) cursor . execute ( "insert into vector.text_vec (id, txt_vec , text) values (:1, :2 , :3)" ,[ id_val , vec , vtext ]) connection . commit () #print("recorded inserted") id_val = id_val + 1 print ( str ( id_val - 1 ) + " inserted" ) cursor . execute ( "select * from vector.text_vec " ) for row in cursor : print ( row )

Create Vector Demo (Video)

Create Vectors select text from vector . text_vec order by vector_distance ( txt_vec ,( select txt_vec from vector . text_vec_lookup where id = :1 ), DOT ) fetch first :2 rows only ;

Similarly Searches (Demo)

Similarly Searches (Demo) Explain Plan No indexes Selected 43 rows in milliseconds – all text New webinar: Vector Indexes, coming soon.

How do I embed a LLM within the database?

ONNX Files ONNX = Open Neural Network Exchange A n open-source format designed for machine-learning models. Ensuring cross-platform compatibility and supports major languages and frameworks, facilitating easy and efficient model exchanges.

Python Approach – 1 of 5 $ export ORACLE_HOME_23ai=/opt/oracle/product/23ai/dbhome_1 $ cd $ORACLE_HOME_23ai/python/bin $ python -V $ export PATH=$ORACLE_HOME_23ai/python/bin:$PATH $ python -v Verify that the correct Python is installed. Python 3.12 (min)

Python Approach – 2 of 5 $ cd ~ $ mkdir onnx Make directory for all ONNX files

Python Approach – 3 of 5 $ cd ~/ onnx $ unzip ./ omlutils.zip -d Change directory to ONNX directory Unzip OMLUTILS.zip file

Python Approach – 4 of 5 $ cd ~/ onnx $ python -m pip install -r requirements.txt $ python -m pip install omlutils-0.13.0-cp312-cp312-linux_x86_64.whl Install all the necessary requirments Install OMLUTILS wheel file

Python Approach – 5 of 5 $ cd ~/ onnx $ python >>> from omlutils import EmbeddingModel , EmbeddingModelConfig >>> em = EmbeddingModel ( model_name ="sentence-transformers/all-MiniLM-L6-v2”) >>> em.export2file("all-MiniLM-L6-v2",output_dir=".”) >>> exit() $ Install requested LLM into database using Python Interactive Prompt View model in Oracle Data Dictonary

New Database Views *_DATA_MODELS SELECT * FROM *_DATA_MODELS;

Using the Embedded Model (TO_VECTOR & VECTOR_EMBEDDING) SELECT to_vector ( vector_embedding ( VECTOR . DOC_MODEL using url as data )) as embedding from ggai.base ;

Updating existing table with Vectors update ggai . base_url set urlvec = ( SELECT to_vector ( vector_embedding ( VECTOR.DOC_MODEL using url as data )) from ggai . base_url where id = 1 ) where client = upper ( 'test' );

The Possibilities?

The possibilities of Oracle Vector Search are going to drive growth in everything!

Retrieval Augmented Generation (RAG) A technique for enhancing the accuracy and reliability of generative AI models with facts from external sources

Input Data Object Embedding Model 4 8 42 72 Vector Search Relevant Content Data Object Retrieval Vector ID Matches Vector Embedding

Use Cases: Search (known use case) Natural-Language Processing/Search (semantic search) Recommendation Systems (retail, customer service) Biometric and Anomaly Detection (healthcare) Drug discovery and genomics (healthcare) Patient similarity Analysis (healthcare) Personalized Digital Assistants

Private data shared via Vector databases/data type need to be secure. Organizations should evaluate what data is vectorized (shared) before using with public LLMs. Consider building private LLMs to keep data secure.

Questions

Next webinars! June 13, 2024 @ 1 pm EST: Oracle GoldenGate 23ai: What is new and how to upgrade June 27, 2024 @ 1 pm EST (tentative): Oracle Database 23ai: Vector Indexes – What and how to use