06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases

bunkertor 519 views 50 slides Jun 20, 2024

Slide 1 of 50

About This Presentation

Tech Talk: Unstructured Data and Vector Databases

Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t....

Size: 16.66 MB

Language: en

Added: Jun 20, 2024

Slides: 50 pages

Slide Content

June 20, 2024
Unstructured Data and
Vector Databases

Tim Spann
Principal Developer Advocate, Zilliz
[email protected]
https://www.linkedin.com/in/timothyspann/
https://x.com/paasdev
https://github.com/tspannhw
https://github.com/milvus-io/milvus
Speaker

Agenda
Introduction
Unstructured data, vector databases, traditional databases, similarity
search
01
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database
Architecture
02
Introducing Milvus
What drives Milvus' Emergence as the most widely
adopted vector database
03

4 | © Copyright Zilliz4 | © Copyright Zilliz4
Introduction

5 | © Copyright Zilliz5
-Unstructured Data is 80% of data

-Vector Databases are the only type of database
that can work with unstructured data

- Examples of Unstructured Data include text,
images, videos, audio, etc
Why Vector Databases?

6 | © Copyright Zilliz6
Traditional databases were built on exact
search

7 | © Copyright Zilliz7
…which misses context, semantic meaning, and user intent

VS.
Apple

VS.
Rising dough

VS.
Change car tire
Rising Dough
Proofing Bread
✔
❌

8 | © Copyright Zilliz8
Vector
Databases
Where do Vectors Come From?

…and cannot process increasingly growing unstructured data
*Data Source: The Digitization of the World by IDC
20%
Other
newly generated data in 2025
will be unstructured data80%

10 | © Copyright Zilliz10
The evolution of AI made the semantic search of
unstructured data possible
Search by Probability
Statistical analyses of common
datasets established the foundation for
processing unstructured data, e.g. NLP,
and image classification
AI Model Breakthrough
The advancements in BERT, ViT, CBT
etc. have revolutionized semantic
analysis across unstructured data
Vectorization
Word2Vec, CNNs, Deep Speech pioneered
unstructured data embeddings, mapping the
words, images, videos into high-dimensional
vectors

11 | © Copyright Zilliz11
This new AI breakthrough requires new databases to
fully unleash its potential
Support multiple
use case types
Accommodate diverse data
requirements, enhancing
flexibility and effectiveness in
varied operational contexts
Scale as needed
Enable robust handling of
expanding data volumes and
search demands
Highly performant
Ensures swift and accurate
query responses, crucial for
optimal user experience

12 | © Copyright Zilliz12
https://milvus.io/milvus-demos/reverse-image-search
Show Me

13 | © Copyright Zilliz13
03
How do Vector
Databases Work?

14 | © Copyright Zilliz14
Semantic Similarity
Image from Sutor et al
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Woman = [0.3, 0.4]
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Man = [0.5, 0.2]
Queen - Woman + Man = King
Queen = [0.3, 0.9]
- Woman = [0.3, 0.4]
[0.0, 0.5]
+ Man = [0.5, 0.2]
King = [0.5, 0.7]Man = [0.5, 0.2]

15 | © Copyright Zilliz15
Vector Similarity Measures: L2 (Euclidean)
Queen = [0.3, 0.9]
King = [0.5, 0.7]
d(Queen, King) = √(0.3-0.5)
2
+ (0.9-0.7)
2

= √(0.2)
2
+ (0.2)
2

= √0.04 + 0.04
= √0.08 ≅ 0.28

16 | © Copyright Zilliz16
Vector Similarity Measures: Inner Product (IP)
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Queen · King = (0.3*0.5) + (0.9*0.7)
= 0.15 + 0.63 = 0.78

17 | © Copyright Zilliz17
Queen = [0.3, 0.9]
King = [0.5, 0.7]
Vector Similarity Measures: Cosine
??????
cos(Queen, King) = (0.3*0.5)+(0.9*0.7)
√0.3
2
+0.9
2
* √0.5
2
+0.7
2

= 0.15+0.63 _
√0.9 * √0.74
= 0.78 _
√0.666
≅ 0.03

19 | © Copyright Zilliz19
Why Not Use a SQL/NoSQL Database?
•Inefficiency in High-dimensional spaces
•Suboptimal Indexing
•Inadequate query support
•Lack of scalability
•Limited analytics capabilities
•Data conversion issues

TL;DR: Vector operations are too computationally intensive for
traditional database infrastructures

20 | © Copyright Zilliz20
Why Not Use a Vector Search Library?
•Have to manually implement filtering
•Not optimized to take advantage of the latest hardware
•Unable to handle large scale data
•Lack of lifecycle management
•Inefficient indexing capabilities
•No built in safety mechanisms

TL;DR: Vector search libraries lack the infrastructure to help you scale,
deploy, and manage your apps in production.

21 | © Copyright Zilliz21
What is Milvus ideal for?
•Advanced filtering
•Hybrid search
•Durability and backups
•Replications/High Availability
•Sharding
•Aggregations
•Lifecycle management
•Multi-tenancy
•High query load
•High insertion/deletion
•Full precision/recall
•Accelerator support (GPU,
FPGA)
•Billion-scale storage

Purpose-built to store, index and query vector embeddings from unstructured data at scale.

22 | © Copyright Zilliz22
We’ve built technologies for various types of use
cases
Compute Types

Designed for various
compute powers, such as
AVX512, Neon for SIMD,
quantization cache-aware
optimization and GPU

Leverage strengths of each
hardware type, ensuring
high-speed processing and
cost-effective scalability for
different application needs

Search Types

Support multiple types such
as top-K ANN, Range ANN,
sparse & dense,
multi-vector, grouping, and
metadata filtering

Enable query flexibility and
accuracy, allowing
developers to tailor their
information retrieval needs
Multi-tenancy

Enable multi-tenancy
through collection and
partition management

Allow for efficient resource
utilization and customizable
data segregation, ensuring
secure and isolated data
handling for each tenant
Index Types

Offer a wide range of 15
indexes support, including
popular ones like HNSW,
PQ, Binary, Sparse,
DiskANN and GPU index

Empower developers with
tailored search
optimizations, catering to
performance, accuracy and
cost needs

23 | © Copyright Zilliz23
Meta Storage
Root Query Data Index
Coordinator Service
Proxy
Proxy
etcd
Log Broker
SDK
Load Balancer
DDL/DCL
DML
NOTIFICATION
CONTROL SIGNAL
Object Storage
Minio / S3 / AzureBlob
Log Snapshot Delta File Index File
Worker Node
QUERY DATA DATA
Message
Storage
Access Layer
Query Node Data Node Index Node
Milvus’ fully distributed architecture is designed
scalability and performance

24 | © Copyright Zilliz24
Milvus: From Dev to Prod
AI Powered Search made easy
Milvus is an Open-Source Vector
Database to store, index, manage, and
use the massive number of embedding
vectors generated by deep neural
networks and LLMs.
contributors
267+
stars
27K+
downloads
25M+
forks
2K+

25 | © Copyright Zilliz25
Retrieval Augmented
Generation (RAG)
Expand LLMs' knowledge by
incorporating external data sources
into LLMs and your AI applications.
Match user behavior or content
features with other similar ones to
make effective recommendations.
Recommender System
Search for semantically similar
texts across vast amounts of
natural language documents.
Text/ Semantic Search
Image Similarity Search
Identify and search for visually
similar images or objects from a
vast collection of image libraries.
Video Similarity Search
Search for similar videos, scenes,
or objects from extensive
collections of video libraries.
Audio Similarity Search
Find similar audios in large datasets
for tasks like genre classification or
speech recognition
Molecular Similarity Search
Search for similar substructures,
superstructures, and other
structures for a specific molecule.
Anomaly Detection
Detect data points, events, and
observations that deviate
significantly from the usual pattern
Multimodal Similarity Search
Search over multiple types of data
simultaneously, e.g. text and
images
…powers searches across various types of
unstructured data

29 | © Copyright Zilliz29
Vector Database Resources
Give Milvus a Star!

Chat with me on Discord!
https://github.com/milvus-io/milvus

30
Unstructured Data Meetup

https://www.meetup.com/unstructured-data-meetup-new-york/

This meetup is for people working in unstructured data. Speakers will come present about related topics
such as vector databases, LLMs, and managing data at scale. The intended audience of this group
includes roles like machine learning engineers, data scientists, data engineers, software engineers, and
PMs.
This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.

https://medium.com/@tspann/unstructured-data-processing-with-a-raspberry-pi-ai-kit-c959dd7fff47
Raspberry Pi AI Kit Hailo
Edge AI

https://medium.com/@tspann/unstructured-street-data-in-new-york-8d3cde0a1e5b

https://medium.com/@tspann/not-every-field-is-just-text-numbers-or-vectors-976231e90e4d

https://medium.com/@tspann/shining-some-light-on-the-new-milvus-lite-5a0565eb5dd9

Extracting Value from Unstructured Data
Example
•A company has 100,000s+ pages of
proprietary documentation to enable
their staff to service customers.
Problem
•Searching can be slow, ineﬃcient, or
lack context.
Solution
•Create internal chatbot with ChatGPT
and a vector database enriched with
company documentation to provide
direction and support to employees
and customers.
https://osschat.io/chat

We provide deployment flexibility for different operational, security and compliance requirements
BRING YOUR OWN CLOUD
Zilliz BYOC
Enterprise-ready Milvus for
Private VPCs
Deploy in your virtual private cloud
Zilliz Cloud
Milvus Re-engineered for the
Cloud
Available on the leading public
clouds
FULLY MANAGED SERVICE
Coming Soon!Coming Soon!
Milvus
Most widely-adopted open
source vector database
Self hosted on any machine with
community support
SELF MANAGED SOFTWARE
Local Docker K8s

39 | © Copyright Zilliz39
Well-connected in LLM infrastructure to enable RAG
use cases
Framework
Hardware
Infrastructure
Embedding Models LLMs
Software Infrastructure
Vector Database

41 | © Copyright Zilliz41
Milvus Dependencies
https://zilliz.com/blog/Milvus-server-docker-installation-and-packaging-dependencies
?????? Main Dependencies:
●FAISS &#3627932941; (vector search)
●etcd ?????? (metadata store)
●Pulsar/Kafka ?????? (messaging)
●Tantivy &#3627932942; (text search)
●RocksDB ?????? (storage)
●Object Storage ?????? (Minio/S3/GCS/Azure Blob Storage)
●Kubernetes ?????? (containerization)
●StorageClass & Persistent Volumes ??????(Storage Management for etcd and Pulsar)
●Prometheus & Grafana ?????? (monitoring)
?????? Docker Image Size: ~500MB
?????? Release Frequency: ~1x per month, with frequent minor releases
?????? SDKs Available: Python ?????? , Node ?????? , Go ?????? , C# ?????? , Java ☕ , Ruby ??????
?????? Python SDK Installation: pip install pymilvus
✅ Version Compatibility: Ensure SDK and Milvus server versions match (major.minor)

…different types of data and schemas needs to be thoroughly planned ahead of time

44 | © Copyright Zilliz44
•Search Quality - Hybrid Search? Filtering?
•Scalability - Billions of vectors?
•Multi tenancy - Isolating Multi-Tenant data
•Cost - Memory, disk, S3?
•Security - Data Safety and Privacy

TL;DR: Vector search libraries lack the infrastructure to help you scale,
deploy, and manage your apps in production.
Why Not Vector Search Libraries?

45 | © Copyright Zilliz45
Why Not Use a SQL/NoSQL Database?
•Inefficiency in High-dimensional spaces
•Suboptimal Indexing
•Inadequate query support
•Lack of scalability
•Limited analytics capabilities
•Data conversion issues

TL;DR: Vector operations are too computationally intensive for
traditional database infrastructures

46 | © Copyright Zilliz46
What is Milvus/Zilliz ideal for?
•Advanced filtering
•Hybrid search
•Multi-vector Search
•Durability and backups
•Replications/High Availability
•Sharding
•Aggregations
•Lifecycle management
•Multi-tenancy
•High query load
•High insertion/deletion
•Full precision/recall
•Accelerator support (GPU,
FPGA)
•Billion-scale storage

Purpose-built to store, index and query vector embeddings from unstructured data at scale.

47 | © Copyright Zilliz47
Vector Databases are purpose-built to handle
indexing, storing, and querying vector data.

Milvus & Zilliz are specifically designed for high
performance and billion+ scale use cases.
Takeaway:

48 | © Copyright Zilliz48
Inverted File Index
Source:
https://towardsdatascience.com/similarity-search-with-ivfpq-9c6348fd4db3

06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx