Working with Machine Learning in Production

apixgrounds 22 views 36 slides Oct 31, 2025
Slide 1
Slide 1 of 36
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36

About This Presentation

A presentation of Machine Learning in Production.


Slide Content

Machine
Learning in
Production

Hello!
I am Frederick Apina
I am here because I love to
give presentations.
You can email me at:
[email protected]
2


Deploying deep learning models in
production can be challenging, as
it is far beyond training models
with good performance.
3

4

Fun Fact
85% of AI
Projects fail.
5
??????

Potential reasons include:
◎Technically infeasible or poorly scoped
◎Never make the leap to production
◎Unclear success criteria (metrics)
◎Poor team management

6


This talk aims to be an engineering
guideline for building
production-level machine learning
systems which will be deployed in
real world applications.
7

1.
ML Projects lifecycle
8

9

Important Note:
It is important to understand state of the art in your
domain:
Why?
◎Helps to understand what is possible
◎Helps to know what to try next
10

Important factors to consider when defining and
prioritizing ML projects:
High Impact
◎Complex parts of your pipeline
◎Where "cheap prediction" is
valuable
◎Where automating complicated
manual process is valuable




Low Cost
◎Cost is driven by:
○Data availability
○Performance requirements:
costs tend to scale super-linearly
in the accuracy requirement
○Problem difficulty
11

12

13

2.
Data Management
14

2.1 Data Sources
◎Supervised deep learning requires a lot of labeled data
◎Labeling own data is costly!
◎Here are some resources for data:
○Open source data (good to start with, but not an
advantage)
○Data augmentation (a MUST for computer vision, an
option for NLP)
○Synthetic data (almost always worth starting with, esp.
in NLP)



15

2.2 Data Labeling
◎Requires: separate software stack (labeling platforms),
temporary labor, and QC
◎Sources of labor for labeling:
○Crowdsourcing (Mechanical Turk): cheap and scalable,
less reliable, needs QC
○Hiring own annotators: less QC needed, expensive,
slow to scale
○Data labeling service companies
◎Labeling platforms

16

2.3 Data Storage
◎Data storage options
○Object store: Store binary data (images, sound files,
compressed texts)
○Database: Store metadata (file paths, labels, user
activity, etc).
○Data Lake: to aggregate features which are not
obtainable from database (e.g. logs)
○Feature Store: store, access, and share machine
learning features
◎Suggestion: At training time, copy data into a local or
networked filesystem (NFS)

17

2.4 Data Versioning
◎It's a "MUST" for deployed ML models:
Deployed ML models are part code, part data. No data
versioning means no model versioning.
◎Data versioning platforms

18

2.5 Data Processing
◎Training data for production models may come from
different sources.
◎There are dependencies between tasks, each needs to be
kicked off after its dependencies are finished.
◎Makefiles are not scalable. ʻWorkflow managerʼs become
pretty essential in this regard.
◎Workflow orchestration

19

3.
Development, Training
and Evaluation
20

3.1 Software Engineering
◎Winner language: Python
◎Editors:
○VS Code, Pycharm
○Notebooks -> Jupyter notebook, JupyterLab, nteract
○Streamlit: Interactive data science tool with applets
◎Compute recommendations
○For individuals or startups: Use GPU PC or buy shared
servers or use cloud instances
○For large companies: Use cloud instances with proper
provisioning and handling of failures
21

3.2 Resource Management
◎Allocating free resources to programs
◎Resources management options:
○Old school cluster job scheduler
○Docker + Kubernetes
○Kubeflow
○Polyaxon (paid features)
22

3.3 DL Frameworks
23

3.4 Experiment management
◎Development, training, and evaluation strategy:
○Always start simple
○Experiment management tools:
◉Tensorboard
◉Comet
◉Weights & Biases
◉MLFlow Tracking
24

3.5 Hyperparameter Tuning
◎Approaches:
○Grid search
○Random search
○Bayesian Optimization

◎Platforms
○RayTune
○Katib
○Hyperas
25

3.6 Distributed Training
◎Data parallelism: Use it when iteration time is too long
(both tensorflow and PyTorch support)

◎Model parallelism: when model does not fit on a single GPU

◎Solutions
○Horovod
26

4.
Testing and Deployment
27

28

4.2 Web Deployment
◎Consists of a Prediction System and a Serving System
◎Serving options:
○Deploy to VMs, scale by adding instances
○Deploy as containers, scale via orchestration
◎Model serving:
○Specialized web deployment for ML models
○Frameworks:
◉Tensorflow serving, Clipper (Berkeley), Seldon
◎Decision making: CPU or GPU?
◎(Bonus) Deploying Jupyter Notebooks: Use Kubeflow
Fairing
29

4.3 Service Mesh and Traffic Routing
◎Transition from monolithic applications towards a
distributed microservice architecture could be challenging.
◎A Service mesh (consisting of a network of microservices)
reduces the complexity of such deployments, and eases the
strain on development teams.
○Istio: a service mesh to ease creation of a network of
deployed services with load balancing,
service-to-service authentication, monitoring, with few
or no code changes in service code.
30

4.4 Monitoring
◎Purpose of monitoring:
○Alerts for downtime, errors, and distribution shifts
○Catching service and data regressions
◎Kiali: an observability console for Istio with service mesh
configuration capabilities. It answers these questions: How
are the microservices connected? How are they
performing?
31

4.5 Deploying on Embedded and Mobile Devices
◎Main challenge: memory footprint and compute constraints
◎Solutions:
○Quantization
○Reduced model size (MobileNets)
○Knowledge Distillation
◎Embedded and Mobile Frameworks:
○Tensorflow Lite, PyTorch Mobile, Core ML, FRITZ, ML Kit
◎Model Conversion:
○Open Neural Network Exchange (ONNX): open-source
format for deep learning models


32

4.6 All-in-one solutions
◎Tensorflow Extended (TFX)
◎Michelangelo (Uber)
◎Google Cloud AI Platform
◎Amazon SageMaker
◎Neptune
◎FLOYD
◎Paperspace
◎Determined AI
◎Domino data lab
33

Are We Done?
34

35

Thanks!
Any questions?
You can find me at:
[email protected]
36