Orchestrating the Future: Navigating Today's Data Workflow Challenges with Airflow and Beyond | Budapest Data + ML Forum 2024

kaxil 53 views 44 slides Jun 11, 2024
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our ...


Slide Content

Orchestrating the Future
Navigating Today's Data Workflow Challenges with Airflow and Beyond
Budapest Data + ML Forum
June 2024

Kaxil Naik
Apache Airflow Committer & PMC Member
Senior Director of Engineering @ Astronomer


@kaxil
@kaxil
@kaxil

●Orchestrator – The What & Why?
●What is Apache Airflow?
○Why is Airflow the Industry Standard for Data Professionals?
○Evolution of Airflow
●Today’s Data Workflow Challenges
○How Airflow addresses them – Real world case studies
●The Future of Airflow
Agenda

Orchestrator
The What & Why?

What is Orchestration? Who is an Orchestrator?

Why Orchestration?

Orchestration in Engineering!
Workflow Orchestrator
Automates and manages interconnected tasks across various systems to
streamline complex business processes. E.g Running bash script everyday to
update packages on a laptop.

Data Orchestrator
Automates and manages interconnected tasks that deal with data across various
systems to streamline complex business processes. E.g ETL for a BI dashboard.

What is Apache Airflow?

A Workflow Orchestrator, most commonly used for Data Orchestration

Official Definition:
A platform to programmatically author, schedule and monitor workflows



What is Apache Airflow?

Python Native

The language of data professionals
(Data Engineers & Scientists). DAGs
are defined in code: allowing more
flexibility & observability of code
changes when used with git.


Pluggable Compute

GPUs, Kubernetes, EC2, VMs etc.
Integrates with Toolkit

All data sources, all Python
libraries, TensorFlow, SageMaker,
MLFlow, Spark, Ray, etc.
Common Interface

Between Data Engineering, Data
Science, ML Engineering and
Operations.
Data Agnostic

But data aware.
Cloud Native

But cloud neutral.
Monitoring & Alerting

Built in features for logging,
monitoring and alerting to external
systems.
Extensible

Standardize custom operators and
templates for common DS tasks
across the organization.
Key Features of Airflow

Example DAG

Why is Airflow the Industry
Standard for
Data Professionals?

25M
Monthly Downloads
The Community
2.9K
Contributors
35K
GitHub Stars
47K
Slack Community

Under …

Governed by

Committers
33
PMC Members
Project Management Committee
62

Integrations

And ……

90+ Providers

Docker Image
docker pull apache/airflow

Helm Chart
helm repo add apache-airflow https://airflow.apache.org/
helm install my-airflow apache-airflow/airflow

Conference & Meetups
Attendees:
Online Edition (2020-2022): 10k
In-person (2023+): 500+
15 Local Groups
across the globe
with 11k members

Managed Airflow Vendors

Airflow Survey and State of Apache Airflow report
Infographic:
https://airflow.apache.org/survey/

Report:
https://www.astronomer.io/state-of-airflow/

Use cases for Airflow
Ingestion and ETL/ELT
related to business
operations
0% 25%
Source: 2023 Apache Airflow Survey, n=797
13%
90%
of Apache Airflow usage is
dedicated to ingestion and
ETL/ELT tasks associated with
analytics, followed by 68% for
business operations.

Additionally, there’s a growing
adoption for MLOps (28%) and
infrastructure management
(13%), highlighting its
versatility across various data
workflow tasks.

50% 100%
90%
68%
28%
Ingestion and ETL/ELT
related to analytics
Training, serving, or
generally manage MLOps
Spinning up and spinning
down infrastructure
Other 3%
75%

The Evolution of Airflow

Timeline: Major Milestones
2014
Oct

Created at
AirBnb
2016
March


Donated to the
Apache Software
Foundation (ASF)
as an Incubating
project
2020
Dec


Airflow 2.0
released
2015
June

Open
Sourced
2018
Dec


Graduated as a
top-level
project
2025
Mar-Apr


(Planned)
Airflow 3.0
release
2020
July


First
Airflow Summit

Timeline: 2.x Minor Releases
2.1
2021-05


2.3
2022-05


2.2
2021-11


2.4
2022-09


2.5
2022-11


2.6
2023-04


2.7
2023-08


2.8
2023-12


2.9
2024-04

Code Contributions & downloads continue to grow!
Downloads:
500K / month
Downloads:
25M / month

Today’s Data Workflow
Challenges

Today’s Data Workflow challenges
Increasing Data
Volumes

Businesses
generates more
data than ever.
Handling this
data & its quality
is critical.
Need for near
Real-time
Processing

Data Workflows
are being used to
drive critical
business
decisions in near
real-time &
hence requiring
reliability &
performance
guarantees.









Complexity in
Data Workflows

Modern
workflows need
handling data
from multiple
sources that
require managing
complex deps &
dynamic
schedules.






Intelligent
Infrastructure

Infrastructure
must be elastic &
flexible to
optimize for a
modern
workloads.

Today’s Data Workflow challenges

Additional
Interfaces

Net-new teams-
from ML to AI -
want to get the
best out of Airflow
without learning a
new framework.







Licensing &
Security in OSS

OSS projects
owned by a single
company have
changed licenses
too often in recent
past.













Platform
Governance

Visibility,
auditability, &
lineage across a
data platform is
need-to-have.








Cost Reduction

Tight budgets
have pushed
teams to
efficiently utilize
the resources to
drive operational
costs down.

How does Airflow address
these challenges?

Case Study: Texas Rangers
Company: A professional baseball team in Major League Baseball (MLB), based in
Arlington, Texas. The Rangers won their first World Series championship in 2023.

Goal: Use data to gain unfair advantage, Moneyball style! Data to be collected:
real-time game data streaming, comprehensive player health reporting, predictive
analytics of everything from pitch spin to hit trajectory, and more

Challenge: Scalability issues due to volume & unprecedented rate of data &
infra bottleneck in their live game analytics pipeline. This impacted the timely
delivery of analytics to their team and affected their competitive edge.

Case Study: Texas Rangers
Solution: Use Airflow’s worker queues to create dedicated worker pools for
CPU-intensive tasks while other tasks used cheaper workers. Using Data-aware
Scheduling, they were able to start their DAGs when data was available instead of
time-based scheduling.

Result:






Improved Scalability

Using worker queues, DAG
completion time reduced by
80% (from 20 mins to 3
mins)




Increased Efficiency

Optimizing compute
resources allowed processing
of 4 additional DAGs in
parallel, enabling immediate
post-game analytics delivery
for a competitive edge.

Case Study: Bloomberg
Company: Bloomberg is a leading source for financial & economic data: Equities,
bonds, Index, Mortgages, currencies, etc. Founded in 1981 with subscribers in
170+ countries.

Goal: Deliver a diverse array of information, news & analytics to facilitate
decision-making

Challenge: Maintaining custom pipelines for diverse datasets of different
domains is expensive & time consuming. Their engineers lacked domain
knowledge to aggregate data into client insights & their domain experts lack skills
to maintain data pipelines in Production.

Case Study: Bloomberg
Solution: Configuration-driven ETL platform leveraging Airflow & dynamic DAGs.
User-defined configs are translated into Dynamic DAGs determining tasks & their
dependencies with success/failure actions.





Result: The Data Platform teams now supports 1600+ DAGs, 700+ datasets,
200+ users, 11 different product teams, 10k+ weekly file ingestions






Source: https://airflowsummit.org/sessions/2023/airflow-at-bloomberg-leveraging-dynamic-dags-for-data-ingestion/

Case Study:
Company: FanDuel Group is a sports betting company that lives on data with
approx 17 million customers.

Goal: Business growth led to higher daily data volumes, which fueled demand for
new sources and richer analytics.

Challenge: 2022 NFL season was fast approaching and FanDuel wanted a robust
data architecture in anticipation of company’s busiest time in terms of daily
volume of data.

Case Study:
Solution: They worked with Astro professional services team to replace Operators
with more efficient Deferrable Operators along with Astro’s auto-scaling
features.



Result:
The number of worker nodes running on avg decreased by 35%, resulting in
immediate infrastructure cost savings & average tasks per worker increased by
305%

Other Interesting Case Studies
Grindr has saved $600,000 in Snowflake costs by monitoring their Snowflake
usage across the organization with Airflow.

Condé Nast has reduced costs by 54% by using deferrable operators.

Airline: a tool powered by Airflow, built by Astronomer’s Customer Reliability
Engineering (CRE) team, that monitors Airflow deployments and sends alerts
proactively when issues arise.

Other Interesting Case Studies
King uses ‘data reliability engineering as code’ tools such as SodaCore within
Airflow pipelines to detect, diagnose and inform about data issues to create
coverage, improve quality & accuracy and help eliminate data downtime.

Laurel.ai: A pioneering AI company that automates time and billing for
professional services. Uses multiple domain-specific LLMs to create billing
timesheets from users’s footprints across their workflows & tools (Zoom, MS
Teams etc). Airflow orchestrates their entire GenAI lifecycle: data extraction,
model tuning & feedback loops.

Ask Astro: An end-to-end example of a Q&A LLM application used to answer
questions about Apache Airflow and Astronomer

The Future of
Apache Airflow

Airflow 3
Make Airflow the foundation for Data, ML, and Gen AI orchestration for
the next 5 years.

1.Enable secure remote task execution across network boundaries.

2.Integrate data awareness needed for governance and compliance

3.Enable non-python tasks, for integration with any language

4.Enable Versioning of Dags and Datasets

5.Single command local install for learning and experimentation.

Thank You

A friendly reminder to RSVP to
Airflow Summit 2024:
●Celebrating 10 Years of Airflow
●Sept. 10th-12th
●The Westin St. Francis
●San Francisco, CA
@kaxil
@kaxil
@kaxil
Airflow Summit Discount Code:
15DISC_MEETUP