Garbage In, Garbage Out: Why poor data curation is killing your AI models (and how to fix it)

16th July 2024

Agenda
I.Introduction
II.Why quality data matters
III.Data cleaning and curation trends
IV.The steps to producing Quality data
Webinar

Speaker

Alexandre Bonnet
ML Solutions Lead

[email protected]

*Encord operates on 'Salesforce quarters'—all reported ﬁgures align with our ﬁscal year
Manage and select the right
data for your AI application
USERS
Data Engineer, ML Engineer
Orchestrate human-in-the-loop
workﬂows for ground truth,
ranking (RLHF), and, validation

USERS
Domain Expert, AI Trainer, Labeler
Find edge cases & debug model
errors, discover high-impact
data to boost performance

USERS
AI Researcher, ML Engineer
CONTINUOUS DATA SELECTION FOR TRAINING, FINE-TUNING & VALIDATION
IDENTIFY & FIX DATA & MODEL ERRORS
DATA SELECTED FOR HUMAN REVIEW VALIDATE QUALITY OF DATA
We are a data development platform that
helps AI teams automate data in a single workﬂow

The Future of AI is in the present

The Future of AI is in the present… Sort of

encord.com
Build from scratch

Uni-modal

Model Centric

Technical-users

Traditional AI Modern AI
Build on Foundations

Multi-modal

Data Centric

General users

Pillars of AI
Data: Foundational information for models, guiding AI
learning and determining task specifications and
accuracy.

Models: The "brains" of AI, using algorithms to process
information and make decisions.

Compute: The hardware and resources required to
train and run models, handling complex calculations
and data processing.

While models and compute
are evolving rapidly

Data is still in its infancy
and AI teams’ main
competitive advantage
Models in one conﬁg ﬁle
Compute in one conﬁg ﬁle
encord.com

Foundation models
●YOLOv8-v10
●RCNNs
●LLAVA
●Segment Anything
●GPT4o
●Claude
●Etc

From Big data to Small data.

Rest assured, you don’t have to
start from scratch -

Presenting, Foundation models.

encord.com
Why Data Quality
Maers

Bad Data Bad Model
=

Great Data Great Model =
… Not (necessarily) Big Data

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Small data, but HIGH quality data…

Small data, but HIGH quality data…

https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/
“The use of natural language in training robots is still in its infancy, particularly in autonomous driving.”
“Incorporating language along with vision and action may have an enormous impact…”

Domain Speciﬁc Fine-tuning - Medical
There are three critical issues
identiﬁed that requires your aention
2048x A100s 8x A100s

Domain Speciﬁc Application - Medical

Deﬁnitions
Refers to the process of identifying and
correcting errors or inconsistencies in the
data.

This can include:

●Removing noisy and corrupt data
●Handling missing or duplicate data
●Standardizing and preprocessing the
data to ensure quality

Refers to the organization and management
of the data to make it suitable for labeling or
model training.

This can include:

●Selecting data for labeling
●Pre-labeling and enriching data
●Ensuring that the data is error free and
high quality

Data cleaning Data curation

encord.com
Data quality &
curation trends

Data Collection
Data
Verification
Serving
Infrastructure
Annotation
tool
Model Training
tool
Model
Evaluation
tool
Data
Warehouse
External Tool
Stakeholders involved:
Data science, Annotation
Data volume:
10-100 thousand frames per year
What we used to see in data pipelines

Data Collection
Data
Verification Label Quality
tool
Data Curation
tool
Serving
Infrastructure
Versioning
tool
Monitoring
tool
Annotation
tool
Model Training
tool
Model
Evaluation
tool
Data
Warehouse
Data
Visualisation
tool
Workflow
Orchestration
tool
External Tool
Data Cleaning
tool
Stakeholders involved:
Data science, DataOps, MLOps,
Annotation, Leadership
Data volume:
10 million frames per year or more
What we now see in data pipelines

Data Collection
Data
Verification Label Quality
tool
Data Curation
tool
Serving
Infrastructure
Versioning
tool
Monitoring
tool
Annotation
tool
Model Training
tool
Model
Evaluation
tool
Data
Warehouse
Data
Visualisation
tool
Workflow
Orchestration
tool
External Tool
Data Cleaning
tool
Today’s focus

Manual inspection
One-off Jupyter scripts
Immature OS tools
Cleaning & Curation
is often cumbersome
Manual video scrolling
One-off Python scripts
What we typically see

Operational expenses increase due
to unnecessary labeling and MLOps.
Increased Annotation
/ MLOps Cost

Poor data quality and less time spent
on improving ML models decreases
model outcome.
Poor Model Quality

Data quality issues and overhead
slow down product speed.
Delayed
Time-To-Market
There are downsides without cleaning and curation

Read the case study >
How Automotus increased
mAP 20% by reducing their
dataset size by 35% with
visual data curation
Achieve beer model performance with less data

Model Accuracy
Model with
Clean data
Model with
Clean & Curated data
Model with
noisy unclean data
↘35%
Dataset reduction
↗20%
mAP increase
Customer improvement
↘33%
Labeling cost decrease
Achieve beer model performance with less data

encord.com
So… how can we look
at changing this?

Learnings of data quality common errors
●Duplicate samples
●Corrupt samples
●Noisy samples
●Data outliers
Cleaning errors
●Random data selection
●Limited task prioritization
●No pre-labeling data
●No label validation
Curation errors

Quality Metrics to help you understand your data
DuplicatesBorder
closeness
Occlusion Object
density
Object Area Polygon
shape
Brightness ContrastImage areaRGB valuesAspect ratio Blur

Model performance by metricMetric distribution
Metrics Outliers2D metrics comparison
40+ Metrics out of the box

6 tools out of the box
Embeddings to help you understand your data

Data Cleaning

Data Curation

Metadata Validation

Pre-trained

Fine-Tuned

●The data curation process is key to building high performance,
accurate and domain specific Models.
●This curation becomes exponentially more difficult and time
consuming for complex and specific applications without a solid
pipeline in place.
●Building on foundations allows us to move with speed and efficiency.
●We have seen a shift in the scale and complexity of applications and
unstructured data lakes.
Summary

encord.com
Thank you!

Garbage In, Garbage Out: Why poor data curation is killing your AI models (and how to fix it)

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Garbage In, Garbage Out: Why poor data curation is killing your AI models (and how to fix it)

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx