From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned for ML in production by Anaël Beaugnon
WiMLDS_Paris
29 views
22 slides
Jun 14, 2024
Slide 1 of 22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
About This Presentation
Anaël Beaugnon did a presentation called "From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned for ML in production" at a joint meetup between WiMLDS Paris and MLOps Paris on june 2024.
Size: 601.88 KB
Language: en
Added: Jun 14, 2024
Slides: 22 pages
Slide Content
From Machine Learning Scientist
to Full Stack Data Scientist
Lessons learned for ML in production
Anaël Beaugnon, Full Stack Data Scientist @ AlanWiMLDS & MLOps Paris meetup - 06/13/24
My Data Journey
ANSSI
Machine Learning Scientist
Roche
Data Scientist / OPS tech lead
Full Stack
Data Scientist
2
ANSSI - Research oriented Machine Learning
Not so far from real world Machine Learning
●Machine learning for cybersecurity
●Applied PhD (thèse CIFRE)
●Collaborative work with operational teams
threat intelligence, detection and response
3
ANSSI - Research oriented Machine Learning
Not so far from real world Machine Learning
Ref: Machine Learning Engineering for Production (Coursera lecture by Andrew Ng)
Data centric Machine LearningMost PhD in Machine Learning
4
ANSSI - Research oriented Machine Learning
Not so far from real world Machine Learning
➕ Open source project and contributions to scikit-learn.
➕ Collaborative work with operational teams.
➖ Big silos between operational and research teams.
➖ Could not touch production code.
➖ No cloud at all.
I wanted to be closer to production,
to be able to deploy ML code myself!
5
Data Scientist / OPS tech lead
Working in a Data Product supported by a
Data Platform
Machine learning models deployed in 80+ countries
6
Data Product with Machine Learning
Product Manager
Data Scientists
●Model benchmarking
●Model deployment
●Model maintenance
Analytics Engineers
●Building views for Data
Scientists
●KPI monitoring
Software Engineers
●Supporting Data
Scientists on python
non ML parts
MLOps / DevOps Engineers
Deployment / Scalability /
Robustness
7
Data platform supporting Data Products
Database
e.g. Snowflake,
Redshift
Orchestrator
e.g. Kubeflow,
Airflow
CI-CD
e.g. github, gitlab
Data Pipeline
Onboarding raw
data
Data Platform
OPS tech lead: an amazing opportunity to grow
●Made the ML pipelines more robust in production
●Learned a lot about MLOps / DevOps
●Eased collaboration between Software Engineers and Data Scientists
●Collaborated with the data platform
Jump on great opportunities when they arise!
-Ref: "Les Règles du Jeu" by Clara Moley
-Ref: "Lean In: Women, Work, and the Will to Lead" by Sheryl Sandberg
9
2 lessons learned for Machine Learning in production
1.How to easily iterate on a model already deployed
2.Full Stack Data Scientists:
Key to efficiently delivering great data products
10
1.How to easily iterate on a model already deployed
a.Evaluation framework
b.Separate production and benchmarking code
11
Evaluation Framework
1.Offline testing (to compare many candidates)
●Classical evaluation on datasets with ground truth (e.g. AUC, f-score)
●Fully automated
●Easy to test several candidates
2.Real-world impact analysis (before deploying a candidate)
●Compare current and candidate solutions (e.g. #recommendations)
●Some manual checks are necessary (e.g. recommendation added / removed)
●Only for the best candidate(s)
12
Separating production and benchmarking code
Production package
-Abstract classes
-Derived classes used in prod
Benchmarking package
-Derived classes for the
candidates
-Scripts for offline testing and
real world impact analysis
13
Only the dependencies needed for
production
All the dependencies needed for the
experimentations
Separating production and benchmarking code
Production package
-Abstract classes
-Derived classes used in prod
Benchmarking package
-Candidate classes
-Evaluation scripts
Advantages
●No risk to break production code with benchmarking experiments
●Similar CI for production and benchmarking packages (more permissive
for benchmarking)
●Easy to compare production models with benchmarking candidates
●Easy to move a better candidate from benchmarking to production
14
1.How to easily iterate on a model already deployed
Production package
-Abstract classes
-Derived classes used in prod
Benchmarking package
-Derived classes for the
candidates
-Scripts for offline and online
testing
15
a.Evaluation framework b.Separage production and
benchmarking code
●Offline testing
Fully automated to test many
candidates
●Real world impact analysis
To assess the best candidate
before deployment since some
manual checks are needed.
2.Full Stack Data Science Approach
●Data Scientists deploy and maintain their data pipelines.
●They own the whole data pipeline.
Advantages:
●They master their pipelines in production.
●They can easily deploy improvements.
Ref: Designing Machine Learning Systems from Chip Huyen
16
Machine
Learning
Preprocessing Postprocessing
Raw data Output
What about Data Scientists asking ML engineers to deploy a pipeline
from a Jupyter Notebook ?
❌ When it fails in production who should fix the problem ? Data
Scientists or ML engineers ?
❌ Data Scientists do not exactly know / master what is deployed in
production
❌ Difficult to iterate to improve the pipeline
✅ Can be great for very research oriented Data Scientists with deep
work on model improvement.
2.Full Stack Data Science Approach
17
2.Full Stack Data Science Approach
So Full Stack Data Scientists really do EVERYTHING ???
NO!! They get great support from the Data Platform
●Reusable modules for CI-CD
○Pre-commit checks
○Dockerization
○Unit tests
○Model deployment
●Shared libraries to use the Data Platform components
●Admin of the Data Platform components
??????
18
2.Full Stack Data Science Approach
●Data Scientists deploy and maintain their data pipelines.
●They own the whole data pipeline.
●They get support from the Data Platform.
Advantages:
●They master their pipelines in production.
●They can easily deploy improvements.
Ref: Designing Machine Learning Systems from Chip Huyen
19
Full Stack Data Scientist at Roche ?
✅ I was deploying and maintaining the pipelines
✅ I was doing lots of software engineering, MLOps and DevOps.
❌ I was not owning the WHOLE data pipeline.
20
Machine
Learning
Preprocessing Postprocessing
Raw data Output
SQL preprocessing
by
Analytics Engineers
Python preprocessing
by
Data Scientists
Data Scientists ownership
Full stack Data Scientist
Ref:
-Vision for a data team
-Machine Learning at Alan
21
Machine
Learning
Preprocessing Postprocessing
Raw data Output
Data Scientists ownership
Key take aways
22
Full Stack Data Science Approach
●Data Scientists deploy and maintain their data pipelines.
●They own the whole data pipeline.
●They get support from the Data Platform.
How to easily iterate on a model already deployed
●Evaluation framework
●Separating production and benchmarking code