Orchestrating the Future: Navigating Today's Data Workflow Challenges with Airflow and Beyond | Budapest Data + ML Forum 2024
kaxil
53 views
44 slides
Jun 11, 2024
Slide 1 of 44
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
About This Presentation
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our ...
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Size: 5.76 MB
Language: en
Added: Jun 11, 2024
Slides: 44 pages
Slide Content
Orchestrating the Future
Navigating Today's Data Workflow Challenges with Airflow and Beyond
Budapest Data + ML Forum
June 2024
Kaxil Naik
Apache Airflow Committer & PMC Member
Senior Director of Engineering @ Astronomer
@kaxil
@kaxil
@kaxil
●Orchestrator – The What & Why?
●What is Apache Airflow?
○Why is Airflow the Industry Standard for Data Professionals?
○Evolution of Airflow
●Today’s Data Workflow Challenges
○How Airflow addresses them – Real world case studies
●The Future of Airflow
Agenda
Orchestrator
The What & Why?
What is Orchestration? Who is an Orchestrator?
Why Orchestration?
Orchestration in Engineering!
Workflow Orchestrator
Automates and manages interconnected tasks across various systems to
streamline complex business processes. E.g Running bash script everyday to
update packages on a laptop.
Data Orchestrator
Automates and manages interconnected tasks that deal with data across various
systems to streamline complex business processes. E.g ETL for a BI dashboard.
What is Apache Airflow?
A Workflow Orchestrator, most commonly used for Data Orchestration
Official Definition:
A platform to programmatically author, schedule and monitor workflows
What is Apache Airflow?
Python Native
The language of data professionals
(Data Engineers & Scientists). DAGs
are defined in code: allowing more
flexibility & observability of code
changes when used with git.
Pluggable Compute
GPUs, Kubernetes, EC2, VMs etc.
Integrates with Toolkit
All data sources, all Python
libraries, TensorFlow, SageMaker,
MLFlow, Spark, Ray, etc.
Common Interface
Between Data Engineering, Data
Science, ML Engineering and
Operations.
Data Agnostic
But data aware.
Cloud Native
But cloud neutral.
Monitoring & Alerting
Built in features for logging,
monitoring and alerting to external
systems.
Extensible
Standardize custom operators and
templates for common DS tasks
across the organization.
Key Features of Airflow
Example DAG
Why is Airflow the Industry
Standard for
Data Professionals?
25M
Monthly Downloads
The Community
2.9K
Contributors
35K
GitHub Stars
47K
Slack Community
Under …
Governed by
Committers
33
PMC Members
Project Management Committee
62
Use cases for Airflow
Ingestion and ETL/ELT
related to business
operations
0% 25%
Source: 2023 Apache Airflow Survey, n=797
13%
90%
of Apache Airflow usage is
dedicated to ingestion and
ETL/ELT tasks associated with
analytics, followed by 68% for
business operations.
Additionally, there’s a growing
adoption for MLOps (28%) and
infrastructure management
(13%), highlighting its
versatility across various data
workflow tasks.
50% 100%
90%
68%
28%
Ingestion and ETL/ELT
related to analytics
Training, serving, or
generally manage MLOps
Spinning up and spinning
down infrastructure
Other 3%
75%
The Evolution of Airflow
Timeline: Major Milestones
2014
Oct
Created at
AirBnb
2016
March
Donated to the
Apache Software
Foundation (ASF)
as an Incubating
project
2020
Dec
Today’s Data Workflow challenges
Increasing Data
Volumes
Businesses
generates more
data than ever.
Handling this
data & its quality
is critical.
Need for near
Real-time
Processing
Data Workflows
are being used to
drive critical
business
decisions in near
real-time &
hence requiring
reliability &
performance
guarantees.
Complexity in
Data Workflows
Modern
workflows need
handling data
from multiple
sources that
require managing
complex deps &
dynamic
schedules.
Intelligent
Infrastructure
Infrastructure
must be elastic &
flexible to
optimize for a
modern
workloads.
Today’s Data Workflow challenges
Additional
Interfaces
Net-new teams-
from ML to AI -
want to get the
best out of Airflow
without learning a
new framework.
Licensing &
Security in OSS
OSS projects
owned by a single
company have
changed licenses
too often in recent
past.
Platform
Governance
Visibility,
auditability, &
lineage across a
data platform is
need-to-have.
Cost Reduction
Tight budgets
have pushed
teams to
efficiently utilize
the resources to
drive operational
costs down.
How does Airflow address
these challenges?
Case Study: Texas Rangers
Company: A professional baseball team in Major League Baseball (MLB), based in
Arlington, Texas. The Rangers won their first World Series championship in 2023.
Goal: Use data to gain unfair advantage, Moneyball style! Data to be collected:
real-time game data streaming, comprehensive player health reporting, predictive
analytics of everything from pitch spin to hit trajectory, and more
Challenge: Scalability issues due to volume & unprecedented rate of data &
infra bottleneck in their live game analytics pipeline. This impacted the timely
delivery of analytics to their team and affected their competitive edge.
Case Study: Texas Rangers
Solution: Use Airflow’s worker queues to create dedicated worker pools for
CPU-intensive tasks while other tasks used cheaper workers. Using Data-aware
Scheduling, they were able to start their DAGs when data was available instead of
time-based scheduling.
Result:
Improved Scalability
Using worker queues, DAG
completion time reduced by
80% (from 20 mins to 3
mins)
Increased Efficiency
Optimizing compute
resources allowed processing
of 4 additional DAGs in
parallel, enabling immediate
post-game analytics delivery
for a competitive edge.
Case Study: Bloomberg
Company: Bloomberg is a leading source for financial & economic data: Equities,
bonds, Index, Mortgages, currencies, etc. Founded in 1981 with subscribers in
170+ countries.
Goal: Deliver a diverse array of information, news & analytics to facilitate
decision-making
Challenge: Maintaining custom pipelines for diverse datasets of different
domains is expensive & time consuming. Their engineers lacked domain
knowledge to aggregate data into client insights & their domain experts lack skills
to maintain data pipelines in Production.
Case Study: Bloomberg
Solution: Configuration-driven ETL platform leveraging Airflow & dynamic DAGs.
User-defined configs are translated into Dynamic DAGs determining tasks & their
dependencies with success/failure actions.
Result: The Data Platform teams now supports 1600+ DAGs, 700+ datasets,
200+ users, 11 different product teams, 10k+ weekly file ingestions
Case Study:
Company: FanDuel Group is a sports betting company that lives on data with
approx 17 million customers.
Goal: Business growth led to higher daily data volumes, which fueled demand for
new sources and richer analytics.
Challenge: 2022 NFL season was fast approaching and FanDuel wanted a robust
data architecture in anticipation of company’s busiest time in terms of daily
volume of data.
Case Study:
Solution: They worked with Astro professional services team to replace Operators
with more efficient Deferrable Operators along with Astro’s auto-scaling
features.
Result:
The number of worker nodes running on avg decreased by 35%, resulting in
immediate infrastructure cost savings & average tasks per worker increased by
305%
Other Interesting Case Studies
Grindr has saved $600,000 in Snowflake costs by monitoring their Snowflake
usage across the organization with Airflow.
Condé Nast has reduced costs by 54% by using deferrable operators.
Airline: a tool powered by Airflow, built by Astronomer’s Customer Reliability
Engineering (CRE) team, that monitors Airflow deployments and sends alerts
proactively when issues arise.
Other Interesting Case Studies
King uses ‘data reliability engineering as code’ tools such as SodaCore within
Airflow pipelines to detect, diagnose and inform about data issues to create
coverage, improve quality & accuracy and help eliminate data downtime.
Laurel.ai: A pioneering AI company that automates time and billing for
professional services. Uses multiple domain-specific LLMs to create billing
timesheets from users’s footprints across their workflows & tools (Zoom, MS
Teams etc). Airflow orchestrates their entire GenAI lifecycle: data extraction,
model tuning & feedback loops.
Ask Astro: An end-to-end example of a Q&A LLM application used to answer
questions about Apache Airflow and Astronomer
The Future of
Apache Airflow
Airflow 3
Make Airflow the foundation for Data, ML, and Gen AI orchestration for
the next 5 years.
1.Enable secure remote task execution across network boundaries.
2.Integrate data awareness needed for governance and compliance
3.Enable non-python tasks, for integration with any language
4.Enable Versioning of Dags and Datasets
5.Single command local install for learning and experimentation.
Thank You
A friendly reminder to RSVP to
Airflow Summit 2024:
●Celebrating 10 Years of Airflow
●Sept. 10th-12th
●The Westin St. Francis
●San Francisco, CA
@kaxil
@kaxil
@kaxil
Airflow Summit Discount Code:
15DISC_MEETUP