Garbage In, Garbage Out: Why poor data curation is killing your AI models (and how to fix it)

chloewilliams62 230 views 43 slides Jul 18, 2024
Slide 1
Slide 1 of 43
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43

About This Presentation

Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack eff...


Slide Content

16th July 2024

Agenda
I.Introduction
II.Why quality data matters
III.Data cleaning and curation trends
IV.The steps to producing Quality data
Webinar

Speaker

Alexandre Bonnet
ML Solutions Lead

[email protected]

*Encord operates on 'Salesforce quarters'—all reported figures align with our fiscal year
Manage and select the right
data for your AI application
USERS
Data Engineer, ML Engineer
Orchestrate human-in-the-loop
workflows for ground truth,
ranking (RLHF), and, validation

USERS
Domain Expert, AI Trainer, Labeler
Find edge cases & debug model
errors, discover high-impact
data to boost performance

USERS
AI Researcher, ML Engineer
CONTINUOUS DATA SELECTION FOR TRAINING, FINE-TUNING & VALIDATION
IDENTIFY & FIX DATA & MODEL ERRORS
DATA SELECTED FOR HUMAN REVIEW VALIDATE QUALITY OF DATA
We are a data development platform that
helps AI teams automate data in a single workflow

The Future of AI is in the present

The Future of AI is in the present… Sort of

The Future of AI is in the present… Sort of

encord.com
Build from scratch

Uni-modal

Model Centric

Technical-users

Traditional AI Modern AI
Build on Foundations

Multi-modal

Data Centric

General users

Pillars of AI
Data: Foundational information for models, guiding AI
learning and determining task specifications and
accuracy.


Models: The "brains" of AI, using algorithms to process
information and make decisions.


Compute: The hardware and resources required to
train and run models, handling complex calculations
and data processing.

While models and compute
are evolving rapidly

Data is still in its infancy
and AI teams’ main
competitive advantage
Models in one config file
Compute in one config file
encord.com

Foundation models
●YOLOv8-v10
●RCNNs
●LLAVA
●Segment Anything
●GPT4o
●Claude
●Etc

From Big data to Small data.

Rest assured, you don’t have to
start from scratch -

Presenting, Foundation models.

encord.com
Why Data Quality
Maers

Bad Data Bad Model
=

Great Data Great Model =
… Not (necessarily) Big Data

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Small data, but HIGH quality data…

Small data, but HIGH quality data…

https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/
“The use of natural language in training robots is still in its infancy, particularly in autonomous driving.”
“Incorporating language along with vision and action may have an enormous impact…”

Domain Specific Fine-tuning - Medical
There are three critical issues
identified that requires your aention
2048x A100s 8x A100s

Domain Specific Application - Medical

Definitions
Refers to the process of identifying and
correcting errors or inconsistencies in the
data.

This can include:

●Removing noisy and corrupt data
●Handling missing or duplicate data
●Standardizing and preprocessing the
data to ensure quality

Refers to the organization and management
of the data to make it suitable for labeling or
model training.

This can include:

●Selecting data for labeling
●Pre-labeling and enriching data
●Ensuring that the data is error free and
high quality

Data cleaning Data curation

encord.com
Data quality &
curation trends

Data Collection
Data
Verification
Serving
Infrastructure
Annotation
tool
Model Training
tool
Model
Evaluation
tool
Data
Warehouse
External Tool
Stakeholders involved:
Data science, Annotation
Data volume:
10-100 thousand frames per year
What we used to see in data pipelines

Data Collection
Data
Verification Label Quality
tool
Data Curation
tool
Serving
Infrastructure
Versioning
tool
Monitoring
tool
Annotation
tool
Model Training
tool
Model
Evaluation
tool
Data
Warehouse
Data
Visualisation
tool
Workflow
Orchestration
tool
External Tool
Data Cleaning
tool
Stakeholders involved:
Data science, DataOps, MLOps,
Annotation, Leadership
Data volume:
10 million frames per year or more
What we now see in data pipelines

Data Collection
Data
Verification Label Quality
tool
Data Curation
tool
Serving
Infrastructure
Versioning
tool
Monitoring
tool
Annotation
tool
Model Training
tool
Model
Evaluation
tool
Data
Warehouse
Data
Visualisation
tool
Workflow
Orchestration
tool
External Tool
Data Cleaning
tool
Today’s focus

Manual inspection
One-off Jupyter scripts
Immature OS tools
Cleaning & Curation
is often cumbersome
Manual video scrolling
One-off Python scripts
What we typically see

Operational expenses increase due
to unnecessary labeling and MLOps.
Increased Annotation
/ MLOps Cost


Poor data quality and less time spent
on improving ML models decreases
model outcome.
Poor Model Quality

Data quality issues and overhead
slow down product speed.
Delayed
Time-To-Market
There are downsides without cleaning and curation

Read the case study >
How Automotus increased
mAP 20% by reducing their
dataset size by 35% with
visual data curation
Achieve beer model performance with less data

Model Accuracy
Model with
Clean data
Model with
Clean & Curated data
Model with
noisy unclean data
↘35%
Dataset reduction
↗20%
mAP increase
Customer improvement
↘33%
Labeling cost decrease
Achieve beer model performance with less data

encord.com
So… how can we look
at changing this?

Learnings of data quality common errors
●Duplicate samples
●Corrupt samples
●Noisy samples
●Data outliers
Cleaning errors
●Random data selection
●Limited task prioritization
●No pre-labeling data
●No label validation
Curation errors

Quality Metrics to help you understand your data
DuplicatesBorder
closeness
Occlusion Object
density
Object Area Polygon
shape
Brightness ContrastImage areaRGB valuesAspect ratio Blur








Model performance by metricMetric distribution
Metrics Outliers2D metrics comparison
40+ Metrics out of the box

6 tools out of the box
Embeddings to help you understand your data

Data Cleaning

Data Curation

Data Curation

Metadata Validation

Metadata Validation

Metadata Validation

Pre-trained

Fine-Tuned

●The data curation process is key to building high performance,
accurate and domain specific Models.
●This curation becomes exponentially more difficult and time
consuming for complex and specific applications without a solid
pipeline in place.
●Building on foundations allows us to move with speed and efficiency.
●We have seen a shift in the scale and complexity of applications and
unstructured data lakes.
Summary

encord.com
Thank you!
Tags