Working with Machine Learning in Production

apixgrounds 22 views 36 slides Oct 31, 2025

Slide 1 of 36

About This Presentation

A presentation of Machine Learning in Production.

Size: 3.34 MB

Language: en

Added: Oct 31, 2025

Slides: 36 pages

Slide Content

Machine
Learning in
Production

Hello!
I am Frederick Apina
I am here because I love to
give presentations.
You can email me at:
[email protected]
2

“
Deploying deep learning models in
production can be challenging, as
it is far beyond training models
with good performance.
3

Fun Fact
85% of AI
Projects fail.
5
??????

Potential reasons include:
◎Technically infeasible or poorly scoped
◎Never make the leap to production
◎Unclear success criteria (metrics)
◎Poor team management

6

“
This talk aims to be an engineering
guideline for building
production-level machine learning
systems which will be deployed in
real world applications.
7

1.
ML Projects lifecycle
8

Important Note:
It is important to understand state of the art in your
domain:
Why?
◎Helps to understand what is possible
◎Helps to know what to try next
10

Important factors to consider when deﬁning and
prioritizing ML projects:
High Impact
◎Complex parts of your pipeline
◎Where "cheap prediction" is
valuable
◎Where automating complicated
manual process is valuable

Low Cost
◎Cost is driven by:
○Data availability
○Performance requirements:
costs tend to scale super-linearly
in the accuracy requirement
○Problem diﬀiculty
11

2.
Data Management
14

2.1 Data Sources
◎Supervised deep learning requires a lot of labeled data
◎Labeling own data is costly!
◎Here are some resources for data:
○Open source data (good to start with, but not an
advantage)
○Data augmentation (a MUST for computer vision, an
option for NLP)
○Synthetic data (almost always worth starting with, esp.
in NLP)
○
○

15

2.2 Data Labeling
◎Requires: separate software stack (labeling platforms),
temporary labor, and QC
◎Sources of labor for labeling:
○Crowdsourcing (Mechanical Turk): cheap and scalable,
less reliable, needs QC
○Hiring own annotators: less QC needed, expensive,
slow to scale
○Data labeling service companies
◎Labeling platforms

16

2.3 Data Storage
◎Data storage options
○Object store: Store binary data (images, sound files,
compressed texts)
○Database: Store metadata (file paths, labels, user
activity, etc).
○Data Lake: to aggregate features which are not
obtainable from database (e.g. logs)
○Feature Store: store, access, and share machine
learning features
◎Suggestion: At training time, copy data into a local or
networked filesystem (NFS)

17

2.4 Data Versioning
◎It's a "MUST" for deployed ML models:
Deployed ML models are part code, part data. No data
versioning means no model versioning.
◎Data versioning platforms

18

2.5 Data Processing
◎Training data for production models may come from
diﬀerent sources.
◎There are dependencies between tasks, each needs to be
kicked oﬀ after its dependencies are finished.
◎Makefiles are not scalable. ʻWorkflow managerʼs become
pretty essential in this regard.
◎Workflow orchestration

19

3.
Development, Training
and Evaluation
20

3.1 Software Engineering
◎Winner language: Python
◎Editors:
○VS Code, Pycharm
○Notebooks -> Jupyter notebook, JupyterLab, nteract
○Streamlit: Interactive data science tool with applets
◎Compute recommendations
○For individuals or startups: Use GPU PC or buy shared
servers or use cloud instances
○For large companies: Use cloud instances with proper
provisioning and handling of failures
21

3.2 Resource Management
◎Allocating free resources to programs
◎Resources management options:
○Old school cluster job scheduler
○Docker + Kubernetes
○Kubeflow
○Polyaxon (paid features)
22

3.3 DL Frameworks
23

3.4 Experiment management
◎Development, training, and evaluation strategy:
○Always start simple
○Experiment management tools:
◉Tensorboard
◉Comet
◉Weights & Biases
◉MLFlow Tracking
24

3.5 Hyperparameter Tuning
◎Approaches:
○Grid search
○Random search
○Bayesian Optimization

◎Platforms
○RayTune
○Katib
○Hyperas
25

3.6 Distributed Training
◎Data parallelism: Use it when iteration time is too long
(both tensorflow and PyTorch support)

◎Model parallelism: when model does not fit on a single GPU

◎Solutions
○Horovod
26

4.
Testing and Deployment
27

4.2 Web Deployment
◎Consists of a Prediction System and a Serving System
◎Serving options:
○Deploy to VMs, scale by adding instances
○Deploy as containers, scale via orchestration
◎Model serving:
○Specialized web deployment for ML models
○Frameworks:
◉Tensorflow serving, Clipper (Berkeley), Seldon
◎Decision making: CPU or GPU?
◎(Bonus) Deploying Jupyter Notebooks: Use Kubeflow
Fairing
29

4.3 Service Mesh and Trafﬁc Routing
◎Transition from monolithic applications towards a
distributed microservice architecture could be challenging.
◎A Service mesh (consisting of a network of microservices)
reduces the complexity of such deployments, and eases the
strain on development teams.
○Istio: a service mesh to ease creation of a network of
deployed services with load balancing,
service-to-service authentication, monitoring, with few
or no code changes in service code.
30

4.4 Monitoring
◎Purpose of monitoring:
○Alerts for downtime, errors, and distribution shifts
○Catching service and data regressions
◎Kiali: an observability console for Istio with service mesh
configuration capabilities. It answers these questions: How
are the microservices connected? How are they
performing?
31

4.5 Deploying on Embedded and Mobile Devices
◎Main challenge: memory footprint and compute constraints
◎Solutions:
○Quantization
○Reduced model size (MobileNets)
○Knowledge Distillation
◎Embedded and Mobile Frameworks:
○Tensorflow Lite, PyTorch Mobile, Core ML, FRITZ, ML Kit
◎Model Conversion:
○Open Neural Network Exchange (ONNX): open-source
format for deep learning models
○
○
32

4.6 All-in-one solutions
◎Tensorflow Extended (TFX)
◎Michelangelo (Uber)
◎Google Cloud AI Platform
◎Amazon SageMaker
◎Neptune
◎FLOYD
◎Paperspace
◎Determined AI
◎Domino data lab
33

Working with Machine Learning in Production

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Working with Machine Learning in Production

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx