Multimodal Pipelines for AI Apps: Journey To Day 2

chloewilliams62 186 views 18 slides Oct 10, 2024
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

After organizations identify and prototype high-value use cases for GenAI, the journey to production begins. Key operational concerns emerge on this journey to Day 2, including scalability, governance, automation, error handling, and more.

Data pipelines that support AI apps will need to provide fe...


Slide Content

datavolo.io Datavolo Overview
Datavolo
Multimodal Data Pipelines for AI
The Journey to Day 2
1

datavolo.io Datavolo Overview
Observations from a16z about successful AI apps


●Just having an LLM API isn't enough to deploy AI solutions at
scale
●A multimodal future for AI requires data pipeline flexibility
●Retrieval augmented generation (RAG) brings needed context to
closed models
●Using one’s proprietary data provides more Enterprise value
●Evaluating quality of retrieval is hard, especially productivity or
efficiency ROI
2

datavolo.io Datavolo Overview
What Makes Data Pipelines for AI Systems Unique?

●GenAI Systems are struggling to remain grounded: LLMs are built to hallucinate (ie.,
to generate novel content) so it is critical to ground them with secure, trusted,
internal data.

●Data is predominantly unstructured and multimodal: While enterprises have some
level of maturity with handling structured data, 90% of enterprise data is
unstructured. This is what modern AI can unlock.

●Current ETL/ELT tools are not able to properly extract, prep, or deliver these
complex data types : AI output is only as good as the data inputs. The preparation
(parsing, chunking, embedding) of the data inputs in a multimodal world is critical.

3

datavolo.io Datavolo Overview
Datavolo and AI Systems
Where does Datavolo fit in the modern AI stack?
4

datavolo.io Datavolo Overview 5
Data Pipelines
Acquire
Chunk
Enrich
Transform
Load
Data Pipelines
Schedule I Validate I Return
Observability
Trace I Log I Govern
Application
Frameworks
Prompt Engineering
Agent APIs
LLM Apps
Functional
Models
API / Plugins
Enterprise Data
and Documents
Embedding
Models
Vector
Store
Relational
DB
Object
Store
Functional Models

datavolo.io Datavolo Overview
Datavolo: Enabling the 10x Data Engineer
Whatʼs the solution?
6

datavolo.io Datavolo Overview
Datavolo is powered by Apache NiFi

●Proven at Massive Scale: ~8K Enterprises run Apache NiFi today – with a density in
Financial Services, Telco, and Defense –the largest and most secure organizations in the
world.

●Endless Optionality: Freedom to integrate with any system as source or destination. Plug
and play best of breed components/processors at any point in a data pipeline.

●Observability and Security as Table Stakes: NiFi was built with security and governance
as foundational principles. Every step of every pipeline is observable and audited out of
the box.

Datavolo is brought to you by the creators and core engineers of Apache NiFi!

7

datavolo.io Datavolo Overview
Datavolo is powered by Apache NiFi… and extended!

●Cloud-Native Refresh: Delivered with cloud-native foundational principles across
AWS, GCP, Snowflake Snowpark, private cloud, and more! Containerization
delivered natively – solving the community’s most asked about questions.

●Simplified Security: Simplified setup of secure NiFi clusters including certificate
management, user/access management

●Extended for Needs of Modern AI Systems: Datavolo is built to enable modern AI
System architectures. New integrations, new processors, new frameworks.



8

datavolo.io Datavolo Overview
CDC for Unstructured Data (including metadata!

●Secure, Continuous Ingest of Unstructured Data: Commonly this is data stored in
Sharepoint, OneDrive, Google Drive, etc. Continuously keeping AI Systems
up-to-date with the latest data (including updates of files).

●CDC on ACLs!: Continuously keeping security access control in synch across
source systems and AI Systems. For example, ensuring that if someone loses
access in Sharepoint, they no longer have access to that data accessed via an
LLM.

●Unlock Data for AI…. but also simply for any analytics: Datavolo allows
unstructured data to be ingested and stored in formats accessible not only via
LLMs, but also into traditionally accessed tables, document stores, etc.



9

datavolo.io Datavolo Overview
Key Data Platform Partnerships
10
Databricks

datavolo.io Datavolo Overview
Mission:
Helping organizations make sense
of unstructured data.
2017
Founded
$113M
Raised
140
Employees
Redwood City, CA
Headquarters
11

datavolo.io Datavolo Overview
Deployment flexibility for different operational,
security and compliance requirements
BRING YOUR OWN CLOUD
Zilliz BYOC
Enterprise-ready Milvus for
Private VPCs
Deploy in your virtual private cloud
Zilliz Cloud
Milvus Re-engineered for the
Cloud
Available on the leading public
clouds
FULLY MANAGED SERVICE
Coming Soon!
Milvus
Most widely-adopted open source
vector database
Self hosted on any machine with
community support
SELF MANAGED SOFTWARE
Lite Docker K8s
12

datavolo.io Datavolo Overview
Mix and match cluster type across your organization

Serverless

For serverless applications with variable
or infrequent traffic

Dedicated
Standard
For POC and development environment
with advanced configuration controls

Dedicated
Enterprise
For mission-critical use cases with
advanced security needs
CU options NA
3 options to balance the
performance, storage and cost
needs

3 options to balance the
performance, storage and cost
needs
Scale Auto-scale
One-click scale
Up to 32 CUs

One-click scale
Up to 256 CUs
Uptime SLAs NA NA 99.95%
Availability Zone Single AZ Single AZ Multi AZ
Security & Compliance
Data Encryption, RBAC, SOC 2 Type
II, GDPR Ready, HIPAA Ready

Data Encryption, RBAC, SOC 2 Type
II, GDPR Ready, HIPAA Ready

Data Encryption, Advanced RBAC,
SOC 2 Type II, GDPR Ready, HIPAA
Ready, Private Networking
Support
Email support during business hours
with response time SLAs;
Up to 2 technical contacts

Email support during business hours
with response time SLAs;
Up to 2 technical contacts

Email support during business hours
with higher response time SLAs; Up
to 4 technical contacts
Provision your vector databases within minutes
13

datavolo.io Datavolo Overview
Industry-wide trust and recognition
Focus and vision
Monthly Unstructured data meetups across
the US and discord community
24K (and counting) Github Stars, leading all
purpose-built vector databases
Community Leaders
Listed in Generative AI 50
by CBInsights
Industry Recognition
Recognized as the fastest vector databases
by ANN Benchmark
Advised on hundreds of real-world vector database
deployments
More than 5,000 enterprise have adopted Milvus for
their vector search use cases
Creators of Milvus with 6+ years dedicated to building
the most widely-adopted vector database
Listed in Trend-Setting
Products by DBTA
Listed in Top 100 Next Gen
Companies by World
Future Awards
Listed companies driving
innovation in Big Data 75

datavolo.io Datavolo Overview
Milvus is an Open-Source Vector Database to
store, index, manage, and use the massive
number of embedding vectors generated by
deep neural networks and LLMs.
contributors
400
stars
29K
docker pulls
66M
forks
2.7K
+
Milvus: The most widely-adopted vector database
15

datavolo.io Datavolo Overview
Milvus
Open Source Self-Managed

Zilliz Cloud
SaaS Fully-Managed

github.com/milvus-io/milvus
Getting Started with Vector Databases
zilliz.com/cloud
29K - Star us on GitHub!
16

datavolo.io Datavolo Overview
T H A N K Y O U

datavolo.io Datavolo Overview 18
Tags