From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned for ML in production by Anaël Beaugnon

WiMLDS_Paris 29 views 22 slides Jun 14, 2024

Slide 1 of 22

About This Presentation

Anaël Beaugnon did a presentation called "From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned for ML in production" at a joint meetup between WiMLDS Paris and MLOps Paris on june 2024.

Size: 601.88 KB

Language: en

Added: Jun 14, 2024

Slides: 22 pages

Slide Content

From Machine Learning Scientist
to Full Stack Data Scientist
Lessons learned for ML in production
Anaël Beaugnon, Full Stack Data Scientist @ AlanWiMLDS & MLOps Paris meetup - 06/13/24

My Data Journey
ANSSI
Machine Learning Scientist
Roche
Data Scientist / OPS tech lead
Full Stack
Data Scientist
2

ANSSI - Research oriented Machine Learning
Not so far from real world Machine Learning
●Machine learning for cybersecurity
●Applied PhD (thèse CIFRE)
●Collaborative work with operational teams
threat intelligence, detection and response
3

ANSSI - Research oriented Machine Learning
Not so far from real world Machine Learning
Ref: Machine Learning Engineering for Production (Coursera lecture by Andrew Ng)
Data centric Machine LearningMost PhD in Machine Learning
4

ANSSI - Research oriented Machine Learning
Not so far from real world Machine Learning
➕ Open source project and contributions to scikit-learn.
➕ Collaborative work with operational teams.
➖ Big silos between operational and research teams.
➖ Could not touch production code.
➖ No cloud at all.
I wanted to be closer to production,
to be able to deploy ML code myself!
5

Data Scientist / OPS tech lead
Working in a Data Product supported by a
Data Platform
Machine learning models deployed in 80+ countries

6

Data Product with Machine Learning
Product Manager
Data Scientists
●Model benchmarking
●Model deployment
●Model maintenance
Analytics Engineers
●Building views for Data
Scientists
●KPI monitoring
Software Engineers
●Supporting Data
Scientists on python
non ML parts
MLOps / DevOps Engineers
Deployment / Scalability /
Robustness
7

Data platform supporting Data Products

Database
e.g. Snowﬂake,
Redshift
Orchestrator
e.g. Kubeﬂow,
Airﬂow
CI-CD
e.g. github, gitlab
Data Pipeline
Onboarding raw
data
Data Platform

-Product manager
-Data scientists
-Analytics engineers
-Software engineers
-MLOps / DevOps engineers

Data Product
8

OPS tech lead: an amazing opportunity to grow
●Made the ML pipelines more robust in production
●Learned a lot about MLOps / DevOps
●Eased collaboration between Software Engineers and Data Scientists
●Collaborated with the data platform
Jump on great opportunities when they arise!

-Ref: "Les Règles du Jeu" by Clara Moley
-Ref: "Lean In: Women, Work, and the Will to Lead" by Sheryl Sandberg

9

2 lessons learned for Machine Learning in production
1.How to easily iterate on a model already deployed

2.Full Stack Data Scientists:
Key to eﬃciently delivering great data products
10

1.How to easily iterate on a model already deployed

a.Evaluation framework

b.Separate production and benchmarking code
11

Evaluation Framework

1.Oﬄine testing (to compare many candidates)
●Classical evaluation on datasets with ground truth (e.g. AUC, f-score)
●Fully automated
●Easy to test several candidates

2.Real-world impact analysis (before deploying a candidate)
●Compare current and candidate solutions (e.g. #recommendations)
●Some manual checks are necessary (e.g. recommendation added / removed)
●Only for the best candidate(s)
12

Separating production and benchmarking code
Production package

-Abstract classes
-Derived classes used in prod

Benchmarking package

-Derived classes for the
candidates
-Scripts for oﬄine testing and
real world impact analysis

13
Only the dependencies needed for
production
All the dependencies needed for the
experimentations

Separating production and benchmarking code

Production package
-Abstract classes
-Derived classes used in prod
Benchmarking package
-Candidate classes
-Evaluation scripts
Advantages
●No risk to break production code with benchmarking experiments
●Similar CI for production and benchmarking packages (more permissive
for benchmarking)
●Easy to compare production models with benchmarking candidates
●Easy to move a better candidate from benchmarking to production
14

1.How to easily iterate on a model already deployed
Production package

-Abstract classes
-Derived classes used in prod
Benchmarking package

-Derived classes for the
candidates
-Scripts for oﬄine and online
testing
15
a.Evaluation framework b.Separage production and
benchmarking code
●Oﬄine testing
Fully automated to test many
candidates

●Real world impact analysis
To assess the best candidate
before deployment since some
manual checks are needed.

2.Full Stack Data Science Approach
●Data Scientists deploy and maintain their data pipelines.
●They own the whole data pipeline.

Advantages:
●They master their pipelines in production.
●They can easily deploy improvements.

Ref: Designing Machine Learning Systems from Chip Huyen
16
Machine
Learning
Preprocessing Postprocessing
Raw data Output

What about Data Scientists asking ML engineers to deploy a pipeline
from a Jupyter Notebook ?

❌ When it fails in production who should ﬁx the problem ? Data
Scientists or ML engineers ?

❌ Data Scientists do not exactly know / master what is deployed in
production

❌ Diﬃcult to iterate to improve the pipeline

✅ Can be great for very research oriented Data Scientists with deep
work on model improvement.
2.Full Stack Data Science Approach
17

2.Full Stack Data Science Approach
So Full Stack Data Scientists really do EVERYTHING ???

NO!! They get great support from the Data Platform
●Reusable modules for CI-CD
○Pre-commit checks
○Dockerization
○Unit tests
○Model deployment
●Shared libraries to use the Data Platform components
●Admin of the Data Platform components
??????
18

2.Full Stack Data Science Approach
●Data Scientists deploy and maintain their data pipelines.
●They own the whole data pipeline.
●They get support from the Data Platform.

Advantages:
●They master their pipelines in production.
●They can easily deploy improvements.

Ref: Designing Machine Learning Systems from Chip Huyen
19

Full Stack Data Scientist at Roche ?
✅ I was deploying and maintaining the pipelines
✅ I was doing lots of software engineering, MLOps and DevOps.

❌ I was not owning the WHOLE data pipeline.

20
Machine
Learning
Preprocessing Postprocessing
Raw data Output
SQL preprocessing
by
Analytics Engineers
Python preprocessing
by
Data Scientists
Data Scientists ownership

Full stack Data Scientist
Ref:
-Vision for a data team
-Machine Learning at Alan
21
Machine
Learning
Preprocessing Postprocessing
Raw data Output
Data Scientists ownership

Key take aways
22
Full Stack Data Science Approach
●Data Scientists deploy and maintain their data pipelines.
●They own the whole data pipeline.
●They get support from the Data Platform.

How to easily iterate on a model already deployed
●Evaluation framework
●Separating production and benchmarking code

From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned for ML in production by Anaël Beaugnon

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned for ML in production by Anaël Beaugnon

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx