Airbyte-ModernDataStack

knoldus 150 views 11 slides Jan 30, 2023
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

"Airbyte - Data acquisition for Modern Data Platforms
In this session, you will learn the basic tenets of a modern data platform. Subsequently, you will understand the role of Airbyte in building the data platform. We will deep dive into how Airbyte's plugin type architecture enables rapid ...


Slide Content

Presented By: Ram Indukuri
Airbyte - ELT in Modern
Data Stack

KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.

Our Agenda
01 The Holy Grail
02 Modern Data Stack
03 Transformations
04 Airbyte
05 Demo

Holy Grail of Modern Data Architecture - Data Mesh

Modern data stack
ELT Data Pipeline
Cloud Based Data Lake
Data Transformation Tool
Modern BI Tools
1
2
3
4

Modern data stack
1
2
3
4

# PYTHON
# connect to db using wrapper around psycopg2
db = DatabaseConnection(db='db', user='username', password='password')
# grab data from db and load into memory
df = db.run_query("SELECT * FROM cleaned_table;")
df = pd.DataFrame(df, columns=['user_id', 'series_id', 'timestamp'])
# calculate time since first visit
df = df.assign(time_since_first=df.groupby('user_id', sort=False).timestamp.apply(lambda x: x -
x.min()))
# calculate time since last visit
df = df.assign(time_since_last=df.sort_values(['timestamp'], ascending=True).groupby('user_id',
sort=False)['timestamp'].transform(pd.Series.diff))
# save df to compressed csv
df.to_csv('transform_time_test.gz', compression='gzip')
-- SQL equivalent
-- increase the working memory (be careful with this)
set work_mem='600MB';
-- create a dual index on the partition
CREATE INDEX IF NOT EXISTS user_time_index ON table(user_id, timestamp);
-- calculate time since last visit and time since first visit in one pass
SELECT *, AGE(timestamp, LAG(timestamp, 1, timestamp) OVER w) AS time_since_last, AGE(timestamp,
FIRST_VALUE(timestamp) OVER w) AS time_since_first FROM table WINDOW w AS (PARTITION BY user_id
ORDER BY timestamp);
SQL has taken over as dominant transformation tool

dbt

Airbyte - Plugin Architecture

Airbyte Internal Architecture