"Airbyte - Data acquisition for Modern Data Platforms
In this session, you will learn the basic tenets of a modern data platform. Subsequently, you will understand the role of Airbyte in building the data platform. We will deep dive into how Airbyte's plugin type architecture enables rapid ...
"Airbyte - Data acquisition for Modern Data Platforms
In this session, you will learn the basic tenets of a modern data platform. Subsequently, you will understand the role of Airbyte in building the data platform. We will deep dive into how Airbyte's plugin type architecture enables rapid data acquisition of data from variety of data sources. We will also explore extensibility of Airbyte, by going through its Connector development kit."
Size: 1.45 MB
Language: en
Added: Jan 30, 2023
Slides: 11 pages
Slide Content
Presented By: Ram Indukuri
Airbyte - ELT in Modern
Data Stack
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Our Agenda
01 The Holy Grail
02 Modern Data Stack
03 Transformations
04 Airbyte
05 Demo
Holy Grail of Modern Data Architecture - Data Mesh
Modern data stack
ELT Data Pipeline
Cloud Based Data Lake
Data Transformation Tool
Modern BI Tools
1
2
3
4
Modern data stack
1
2
3
4
# PYTHON
# connect to db using wrapper around psycopg2
db = DatabaseConnection(db='db', user='username', password='password')
# grab data from db and load into memory
df = db.run_query("SELECT * FROM cleaned_table;")
df = pd.DataFrame(df, columns=['user_id', 'series_id', 'timestamp'])
# calculate time since first visit
df = df.assign(time_since_first=df.groupby('user_id', sort=False).timestamp.apply(lambda x: x -
x.min()))
# calculate time since last visit
df = df.assign(time_since_last=df.sort_values(['timestamp'], ascending=True).groupby('user_id',
sort=False)['timestamp'].transform(pd.Series.diff))
# save df to compressed csv
df.to_csv('transform_time_test.gz', compression='gzip')
-- SQL equivalent
-- increase the working memory (be careful with this)
set work_mem='600MB';
-- create a dual index on the partition
CREATE INDEX IF NOT EXISTS user_time_index ON table(user_id, timestamp);
-- calculate time since last visit and time since first visit in one pass
SELECT *, AGE(timestamp, LAG(timestamp, 1, timestamp) OVER w) AS time_since_last, AGE(timestamp,
FIRST_VALUE(timestamp) OVER w) AS time_since_first FROM table WINDOW w AS (PARTITION BY user_id
ORDER BY timestamp);
SQL has taken over as dominant transformation tool