This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Size: 5.09 MB
Language: en
Added: Jun 13, 2024
Slides: 44 pages
Slide Content
Feuling AI ??????
with Great Data
Staff Software
Engineer
AJ Steers
Jun 13, 2024
•Introduction to GenAI Data Sourcing
•Building Solid Foundations with ELTP
•Data from Anywhere with just enough code
•PyAirbyte and Milvus Lite: Better Together
•Getting from Prototype to Production
2
Our journey today…
Introductions
3
4
Intro to Myself (“AJ”)
“I build data products.” ??????
5
INTRO TO
Covers the long tail of connectors
Airbyte has over 330+ high-quality connectors by thanks to
its participative model.
Extensible and non-opinionated to
address your exact needs
You can build or edit any pre-built connectors to your specific
needs. Airbyte also integrates with your data stack.
A fair usage-based pricing
Volume is well known across data warehouses. It is
predictable and scales well with database replication. Airbyte
can be up to 10x cheaper than alternatives.
6
INTRO TO
Support for unstructured
document sources
Vector store destinations
Run Anywhere
135+
No-Code
Sources
(Yaml)
Largest data connectors catalog
285+
Total Sources
250+
Python-Based
Sources
10+
Vector Destinations
40+
Destinations
6,000+
Daily active users
Largest data movement community
150K+
Deployments
2 PBs +
Synced /month
1,000+
PRs merged
800+
Code contributors
9
INTRO TO
Airbyte OSS
Open Source, Deploy Anywhere via K8
Airbyte Enterprise
Self-Managed, with Airbyte Support
Airbyte Cloud
Data Movement as a Service
10
INTRO TO
Airbyte OSS
Open Source, Deploy Anywhere via K8
Airbyte Enterprise
Self-Managed, with Airbyte Support
Airbyte Cloud
Data Movement as a Service
PyAirbyte
Run Anywhere ??????
11
INTRO TO
GenAI
✓All the data for your AI app.
✓Interop with GenAI Python libraries.
✓Generate LLM documents.
Machine Learning
✓Get source data in minutes.
✓Integrate with Pandas.
✓Runs in a Notebook.
Data Warehousing and Analytics
✓Supports SQL natively.
✓Streamlines path to production.
Data Engineering
✓Scale to large data volumes.
✓Incremental sync built-in.
✓Built on ELT best practices.
Break down silos between data teams
12
INTRO TO
●Powerful open source vector store
●Scalable and Elastic Architecture
●Diverse Index Support
●Versatile Search Capabilities
●Built-in Staleness Handling
●Hardware-Accelerated Compute
13
INTRO TO
Developer-Friendly - Get Started in Seconds
14
Data Pipeline
Design Principles
15
No code or low-code?
❔❔❔
> “Everything should be as simple
as possible, but no simpler.”
16
No code or low-code?
- Albert Einstein
17
Simple beats complex
Goal is to design resilient composable pipelines, where each
step in the pipeline is simple and obvious. Things should
break in expectable ways, resulting in similarly obvious and
easy remediations.
18
Future-Proofing Your Data Pipeline
Extract
Transform
Load
19
Extract
Transform
Load
Extract
Load
Transform
Future-Proofing Your Data Pipeline
20
Extract
Transform
Load
Extract
Load
Transform
Extract
Load
Transform
Publish
Future-Proofing Your Data Pipeline
21
A scalable and extensible framework
22
NOW... LET’S GET STARTED!
Introducing
PyAirbyte
✓Hundreds of Airbyte source connectors
✓The ability to create your own source connectors with no coding
✓A library you can pip install anywhere, including notebooks!
✓Your choice of production deployment paths:
Airbyte Cloud, OSS, or Self-Hosted Enterprise
23
DATA FROM ANYWHERE,
IN MINUTES, NOT DAYS
Data from Anywhere in 3 Steps
Step 1: Create a Source using get_source()
Data from Anywhere in 3 Steps
Step 2: Configure with set_config()
Data from Anywhere in 3 Steps
Step 3: Read the data using read()
Data from Anywhere in 3 Steps
The “Speedrun” Version
1 Step
Full Control in (Not a Lot of) Code
29
Choose Airbyte Cloud, Airbyte OSS, or
Airbye Enterprise for:
✓Ease-of-Use
✓Friendly UI
✓Redundancy
✓Peace of Mind
29
Migrate to Airbyte Cloud for
Zero-Code Load to Vector Stores
DEMO
Data extraction and exploration…
30
31
DEMO Script
Simple data extraction demo
Get data from anywhere - show list of sources for yaml,
docker, and python.
Record Format
32
CONVERTING RECORDS TO DOCUMENTS
Document Format
Issue Title DescriptionCreated Updated
123 Broken Widget
on Product Page
… 2024-01-042024-02-23
124 Feature
Request: New
Interactive UI
Model
… 2024-01-152024-02-12
… … … … …
# Broken Widget on
Product Page
```
Issue: 123
Created: 2024-01-04
```
{description}
DEMO
Converting Records to Documents
33
34
DEMO Script
GitHub Records to RAG Demo:
-Show records from last read() operation.
-Export to Documents.
-Show Langchain code to ingest those
documents for chunking, embedding, etc.
DEMO
Building Python prototypes that scale
35
36
DEMO Script
-Return to the notebook
-Continue to load in RAG and run sample query.
37
What happens now depends upon your choices earlier…
How to get to production?
38
What happens next depends upon your choices…
Deploy now!
And move on to new
adventures.
Go back to the
beginning.
Start again on the
“production” solution.
How to get to production?
○Recreate Notebook-based transformations.
○Migrate to a new tool or a new language after
the prototype.
○Pass the prototype to your IT team, which will
(probably) rebuild it differently.
○Test everything over from scratch and fix the
new bugs.
✓Seamless migration from PyAirbyte to Airbyte Cloud
or Self-Managed K8.
✓Same Schema regardless of how you deploy.
✓Handoff to your IT team without changing the
pipelines or switching tools.
✓Your transformations and tests carry over after
deployment.
The Old Way:
Rebuild from Scratch, Cross your Fingers ??????
The Airbyte Way:
Promote what Works, Drop the Rest
39
Benefits of Building ELTP and Airbyte
DEMO
From Prototype to Production
40
41
Recap &
Wrap Up
To get to production faster:
Start with a tool and set of design principles
that will see you all the way to the finish.
42
RECAPPING LESSONS LEARNT