Garbage In, Garbage Out: Why poor data curation is killing your AI models (and how to fix it)
chloewilliams62
230 views
43 slides
Jul 18, 2024
Slide 1 of 43
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
About This Presentation
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack eff...
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Size: 13.74 MB
Language: en
Added: Jul 18, 2024
Slides: 43 pages
Slide Content
16th July 2024
Agenda
I.Introduction
II.Why quality data matters
III.Data cleaning and curation trends
IV.The steps to producing Quality data
Webinar
*Encord operates on 'Salesforce quarters'—all reported figures align with our fiscal year
Manage and select the right
data for your AI application
USERS
Data Engineer, ML Engineer
Orchestrate human-in-the-loop
workflows for ground truth,
ranking (RLHF), and, validation
USERS
Domain Expert, AI Trainer, Labeler
Find edge cases & debug model
errors, discover high-impact
data to boost performance
USERS
AI Researcher, ML Engineer
CONTINUOUS DATA SELECTION FOR TRAINING, FINE-TUNING & VALIDATION
IDENTIFY & FIX DATA & MODEL ERRORS
DATA SELECTED FOR HUMAN REVIEW VALIDATE QUALITY OF DATA
We are a data development platform that
helps AI teams automate data in a single workflow
The Future of AI is in the present
The Future of AI is in the present… Sort of
The Future of AI is in the present… Sort of
encord.com
Build from scratch
Uni-modal
Model Centric
Technical-users
Traditional AI Modern AI
Build on Foundations
Multi-modal
Data Centric
General users
Pillars of AI
Data: Foundational information for models, guiding AI
learning and determining task specifications and
accuracy.
Models: The "brains" of AI, using algorithms to process
information and make decisions.
Compute: The hardware and resources required to
train and run models, handling complex calculations
and data processing.
While models and compute
are evolving rapidly
Data is still in its infancy
and AI teams’ main
competitive advantage
Models in one config file
Compute in one config file
encord.com
Foundation models
●YOLOv8-v10
●RCNNs
●LLAVA
●Segment Anything
●GPT4o
●Claude
●Etc
From Big data to Small data.
Rest assured, you don’t have to
start from scratch -
Presenting, Foundation models.
encord.com
Why Data Quality
Maers
Bad Data Bad Model
=
Great Data Great Model =
… Not (necessarily) Big Data
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Small data, but HIGH quality data…
Small data, but HIGH quality data…
https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/
“The use of natural language in training robots is still in its infancy, particularly in autonomous driving.”
“Incorporating language along with vision and action may have an enormous impact…”
Domain Specific Fine-tuning - Medical
There are three critical issues
identified that requires your aention
2048x A100s 8x A100s
Domain Specific Application - Medical
Definitions
Refers to the process of identifying and
correcting errors or inconsistencies in the
data.
This can include:
●Removing noisy and corrupt data
●Handling missing or duplicate data
●Standardizing and preprocessing the
data to ensure quality
Refers to the organization and management
of the data to make it suitable for labeling or
model training.
This can include:
●Selecting data for labeling
●Pre-labeling and enriching data
●Ensuring that the data is error free and
high quality
Data cleaning Data curation
encord.com
Data quality &
curation trends
Data Collection
Data
Verification
Serving
Infrastructure
Annotation
tool
Model Training
tool
Model
Evaluation
tool
Data
Warehouse
External Tool
Stakeholders involved:
Data science, Annotation
Data volume:
10-100 thousand frames per year
What we used to see in data pipelines
Data Collection
Data
Verification Label Quality
tool
Data Curation
tool
Serving
Infrastructure
Versioning
tool
Monitoring
tool
Annotation
tool
Model Training
tool
Model
Evaluation
tool
Data
Warehouse
Data
Visualisation
tool
Workflow
Orchestration
tool
External Tool
Data Cleaning
tool
Stakeholders involved:
Data science, DataOps, MLOps,
Annotation, Leadership
Data volume:
10 million frames per year or more
What we now see in data pipelines
Data Collection
Data
Verification Label Quality
tool
Data Curation
tool
Serving
Infrastructure
Versioning
tool
Monitoring
tool
Annotation
tool
Model Training
tool
Model
Evaluation
tool
Data
Warehouse
Data
Visualisation
tool
Workflow
Orchestration
tool
External Tool
Data Cleaning
tool
Today’s focus
Manual inspection
One-off Jupyter scripts
Immature OS tools
Cleaning & Curation
is often cumbersome
Manual video scrolling
One-off Python scripts
What we typically see
Operational expenses increase due
to unnecessary labeling and MLOps.
Increased Annotation
/ MLOps Cost
Poor data quality and less time spent
on improving ML models decreases
model outcome.
Poor Model Quality
Data quality issues and overhead
slow down product speed.
Delayed
Time-To-Market
There are downsides without cleaning and curation
Read the case study >
How Automotus increased
mAP 20% by reducing their
dataset size by 35% with
visual data curation
Achieve beer model performance with less data
Model Accuracy
Model with
Clean data
Model with
Clean & Curated data
Model with
noisy unclean data
↘35%
Dataset reduction
↗20%
mAP increase
Customer improvement
↘33%
Labeling cost decrease
Achieve beer model performance with less data
encord.com
So… how can we look
at changing this?
Learnings of data quality common errors
●Duplicate samples
●Corrupt samples
●Noisy samples
●Data outliers
Cleaning errors
●Random data selection
●Limited task prioritization
●No pre-labeling data
●No label validation
Curation errors
Quality Metrics to help you understand your data
DuplicatesBorder
closeness
Occlusion Object
density
Object Area Polygon
shape
Brightness ContrastImage areaRGB valuesAspect ratio Blur
Model performance by metricMetric distribution
Metrics Outliers2D metrics comparison
40+ Metrics out of the box
6 tools out of the box
Embeddings to help you understand your data
Data Cleaning
Data Curation
Data Curation
Metadata Validation
Metadata Validation
Metadata Validation
Pre-trained
Fine-Tuned
●The data curation process is key to building high performance,
accurate and domain specific Models.
●This curation becomes exponentially more difficult and time
consuming for complex and specific applications without a solid
pipeline in place.
●Building on foundations allows us to move with speed and efficiency.
●We have seen a shift in the scale and complexity of applications and
unstructured data lakes.
Summary