Fueling AI with Great Data with Airbyte Webinar

chloewilliams62 131 views 44 slides Jun 13, 2024
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.


Slide Content

Feuling AI ??????
with Great Data
Staff Software
Engineer
AJ Steers
Jun 13, 2024

•Introduction to GenAI Data Sourcing
•Building Solid Foundations with ELTP
•Data from Anywhere with just enough code
•PyAirbyte and Milvus Lite: Better Together
•Getting from Prototype to Production
2
Our journey today…

Introductions


3

4
Intro to Myself (“AJ”)
“I build data products.” ??????

5
INTRO TO
Covers the long tail of connectors

Airbyte has over 330+ high-quality connectors by thanks to
its participative model.

Extensible and non-opinionated to
address your exact needs

You can build or edit any pre-built connectors to your specific
needs. Airbyte also integrates with your data stack.

A fair usage-based pricing

Volume is well known across data warehouses. It is
predictable and scales well with database replication. Airbyte
can be up to 10x cheaper than alternatives.

6
INTRO TO
Support for unstructured
document sources

Vector store destinations

Run Anywhere

135+

No-Code
Sources

(Yaml)
Largest data connectors catalog
285+

Total Sources

250+

Python-Based

Sources
10+

Vector Destinations
40+

Destinations

6,000+

Daily active users
Largest data movement community
150K+

Deployments
2 PBs +

Synced /month
1,000+

PRs merged
800+

Code contributors

9
INTRO TO
Airbyte OSS
Open Source, Deploy Anywhere via K8
Airbyte Enterprise
Self-Managed, with Airbyte Support
Airbyte Cloud
Data Movement as a Service

10
INTRO TO
Airbyte OSS
Open Source, Deploy Anywhere via K8
Airbyte Enterprise
Self-Managed, with Airbyte Support
Airbyte Cloud
Data Movement as a Service

PyAirbyte
Run Anywhere ??????

11
INTRO TO
GenAI

✓All the data for your AI app.
✓Interop with GenAI Python libraries.
✓Generate LLM documents.

Machine Learning

✓Get source data in minutes.
✓Integrate with Pandas.
✓Runs in a Notebook.
Data Warehousing and Analytics

✓Supports SQL natively.
✓Streamlines path to production.


Data Engineering

✓Scale to large data volumes.
✓Incremental sync built-in.
✓Built on ELT best practices.
Break down silos between data teams

12
INTRO TO
●Powerful open source vector store
●Scalable and Elastic Architecture
●Diverse Index Support
●Versatile Search Capabilities
●Built-in Staleness Handling
●Hardware-Accelerated Compute

13
INTRO TO
Developer-Friendly - Get Started in Seconds

14
Data Pipeline
Design Principles

15
No code or low-code?
❔❔❔

> “Everything should be as simple
as possible, but no simpler.”
16
No code or low-code?
- Albert Einstein

17
Simple beats complex
Goal is to design resilient composable pipelines, where each
step in the pipeline is simple and obvious. Things should
break in expectable ways, resulting in similarly obvious and
easy remediations.

18
Future-Proofing Your Data Pipeline
Extract
Transform
Load

19
Extract
Transform
Load
Extract
Load
Transform
Future-Proofing Your Data Pipeline

20
Extract
Transform
Load
Extract
Load
Transform
Extract
Load
Transform
Publish
Future-Proofing Your Data Pipeline

21
A scalable and extensible framework

22
NOW... LET’S GET STARTED!

Introducing
PyAirbyte

✓Hundreds of Airbyte source connectors
✓The ability to create your own source connectors with no coding
✓A library you can pip install anywhere, including notebooks!
✓Your choice of production deployment paths:
Airbyte Cloud, OSS, or Self-Hosted Enterprise
23
DATA FROM ANYWHERE,
IN MINUTES, NOT DAYS

Data from Anywhere in 3 Steps
Step 1: Create a Source using get_source()

Data from Anywhere in 3 Steps
Step 2: Configure with set_config()

Data from Anywhere in 3 Steps
Step 3: Read the data using read()

Data from Anywhere in 3 Steps
The “Speedrun” Version
1 Step

Full Control in (Not a Lot of) Code

29

Choose Airbyte Cloud, Airbyte OSS, or
Airbye Enterprise for:

✓Ease-of-Use
✓Friendly UI
✓Redundancy
✓Peace of Mind

29
Migrate to Airbyte Cloud for
Zero-Code Load to Vector Stores

DEMO


Data extraction and exploration…
30

31
DEMO Script
Simple data extraction demo

Get data from anywhere - show list of sources for yaml,
docker, and python.

Record Format
32
CONVERTING RECORDS TO DOCUMENTS

Document Format
Issue Title DescriptionCreated Updated
123 Broken Widget
on Product Page
… 2024-01-042024-02-23
124 Feature
Request: New
Interactive UI
Model
… 2024-01-152024-02-12
… … … … …
# Broken Widget on
Product Page

```
Issue: 123
Created: 2024-01-04
```

{description}

DEMO


Converting Records to Documents
33

34
DEMO Script
GitHub Records to RAG Demo:
-Show records from last read() operation.
-Export to Documents.
-Show Langchain code to ingest those
documents for chunking, embedding, etc.

DEMO


Building Python prototypes that scale
35

36
DEMO Script

-Return to the notebook
-Continue to load in RAG and run sample query.

37
What happens now depends upon your choices earlier…
How to get to production?

38
What happens next depends upon your choices…
Deploy now!

And move on to new
adventures.
Go back to the
beginning.

Start again on the
“production” solution.
How to get to production?

○Recreate Notebook-based transformations.
○Migrate to a new tool or a new language after
the prototype.
○Pass the prototype to your IT team, which will
(probably) rebuild it differently.
○Test everything over from scratch and fix the
new bugs.
✓Seamless migration from PyAirbyte to Airbyte Cloud
or Self-Managed K8.
✓Same Schema regardless of how you deploy.
✓Handoff to your IT team without changing the
pipelines or switching tools.
✓Your transformations and tests carry over after
deployment.
The Old Way:
Rebuild from Scratch, Cross your Fingers ??????
The Airbyte Way:
Promote what Works, Drop the Rest
39
Benefits of Building ELTP and Airbyte

DEMO


From Prototype to Production
40

41
Recap &

Wrap Up

To get to production faster:

Start with a tool and set of design principles
that will see you all the way to the finish.
42
RECAPPING LESSONS LEARNT

43
Questions?

44



THANK YOU!


AJ Steers
Airbyte.com
Tags