PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)

RebeccaBilbro 66 views 30 slides Jun 19, 2024

Slide 1 of 30

About This Presentation

To honor ten years of PyData London, join Dr. Rebecca Bilbro as she takes us back in time to reflect on a little over ten years working as a data scientist. One of the many renegade PhDs who joined the fledgling field of data science of the 2010's, Rebecca will share lessons learned the hard way...

Size: 3.64 MB

Language: en

Added: Jun 19, 2024

Slides: 30 pages

Slide Content

Mistakes Were Made
Lessons Learned ~15 Years In
PyData London X | Dr. Rebecca Bilbro

What is a mistake?

Classification Errors
Overtraining
Dubious Data Sourcing

Is dunking on genAI in
bad faith?

(The ﬁrst ~10
years of data
science)

Low ROI Is an existential threat

“The ROI in AI
(and How to Find It)”

“Those who cannot remember
the past are condemned to
repeat it.”

George Santayana

So let’s get vulnerable

Types of mistakes in data science
embarrassing costly
not following
PEP8
using jupyter
notebooks in
production
using up
non-renewable
energy
vendor lock
everything
else

Fig 1. How ancient data scientists
conceived ML architectures.
What it actually takes to
build an ML product:

●Model microservices
●Rules-based agents
●Multi-modal models
●Expert systems
●Agentic AI
●“Crews”
●Mixture of experts (MoE)
●TBD

More vulnerable

Synthetic Oversampling
Active Learning
Visual Hallucinations & The
Crowding Problem

Even more vulnerable

310 miles / 499 km
?????? 4 hr 50 min
(one way)
Inspection targeting

Setting: OSHA, the Obama years
Business goal: Make the most of a
small budget for health and safety
inspections and hopefully save more
lives.
Data science angle:
●Use ML to identify features of
companies where workers get
sick/hurt/killed.
●Find companies that match the
characteristics and haven’t
already been inspected in the
last 3 years.
●Fix government with data
science.

In the US, the Occupational Safety and Health
Administration can only do inspections under the following
circumstances:
-Fatalities and/or serious injuries
-Worker complaints
-Random selection
Machine learning is not a clear ﬁt.

Setting: Department of Commerce, still the Obama years
Business goal: Unlock greater economic opportunity for American exporters
Data science angle:
-Use ML to identify features of companies that export their goods and services.
-Find US-based companies that match the characteristics but aren’t exporting yet.
-Fix government with data science.

Imports and Exports
-Used LASSO to identify most
informative features.
-Added a random forest for binary
classiﬁcation.
-This was a dumb idea

Imports and Exports
Attempt #2
-Found a huge number of companies within
the same industry (NAICS 488510) that were
not yet exporting.
-Presented results to domain experts,
who all looked at me like I was born
yesterday. The industry was “Freight
Transportation Arrangement”.
-The feedback was gentle but clear –
you can’t export freight logistics,
dumdum. Had to go back to the
drawing board again.

Setting: Analytics for sales enablement
Business goal: Link proﬁles across social
media platforms and enhance with
demographics
Data science angle:
-Use graph-based entity resolution and
network analysis to infer linkages.
-Build classiﬁers to label incoming text
data with metadata for sales
enablement.
REDACTED
-The initial models were “thrown
over the wall” and it took months to
build the inference API
-Discovered some models had been
trained on ethically questionable
data.
-The data science team had to build
a production-grade ingestion
system.

Setting: Customer service organization
Business goal: Use AI to interpret customer
service requests to route them more efficiently
Data science angle:
-Train a neural network on historic CSR
data to do multiclass classiﬁcation.
-Use human CSR routing decisions as the
“labels”
-Serialize the model and use an inferencing
API to route incoming queries.

The company said they wanted “AI”. We took what they
said at face value.
We picked a preliminary model that was way too
complex
-Tensorﬂow architecture with many layers and hyperparameters.
-Training took a long time.
-Tuning was difficult.
-We didn’t do a good job of capturing and versioning our experiments.

Attempt #2
-A simpler model
performed better but
still underwhelmed
leadership
-.68 overall F1 score was
not exciting, even
though some per-class
scores were good
-Presented
visualizations that
made sense to us but
not to them

Attempt #3:
-Finally got a model that was “good enough”.
-Team went straight from research to deployment.
-Deployed Jupyter notebooks as code
-No packaging
-No versioning
-No linting
-No tests
-No plan for model drift detection or retraining
-Incoming customer requests after March 2020 did not
match the training data.
-Everyone was suddenly working from home and
experiencing issues that did not match any known labels.
-“Production is noise”

In engineering, we call these
kinds of things “anti-patterns”

We need to talk more about the
anti-patterns of data science.

SOME
Anti-patterns from the
first 10 years
-Self-indulgent experimentation
-Lack of understanding and/or
interest in the downstream user
-Disregard for data provenance
-Estrangement from engineering
culture
-Over reliance on batch wise
training
-Ignorance of application
contexts

Practice using 1st person when
talking about mistakes.

“I made a mistake when…”

Actually, your mistakes make you.

“An expert is a person who has
made all the mistakes that can be
made in a very narrow field.”
- Niels Bohr
Dr. Rebecca Bilbro

Co-Founder/CTO @ Rotational Labs
Applied Text Analysis with Python (O’Reilly)
Scikit- Yellowbrick

PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)

About This Presentation

Slide Content

Slide 1

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx