[DSC DACH 25] Zrinka Puljiz - Importance of data at the time of LLMs.pdf

DataScienceConferenc1 7 views 26 slides Oct 24, 2025

Slide 1 of 26

About This Presentation

In the era of Large Language Models, data has become more than just fuel — it defines the very boundaries of intelligence itself. In this distinguished keynote, Zrinka Puljiz, Tech Lead Manager at YouTube, explores how the quality, diversity, and ethics of data shape the power and limitations of t...

Size: 4.72 MB

Language: en

Added: Oct 24, 2025

Slides: 26 pages

Slide Content

Proprietary + Conﬁdential
October 2025
Importance of data at the
time of LLMs
Zrinka Puljiz
EM, Youtube Foundational Data

Proprietary + Conﬁdential

Proprietary + Conﬁdential
Today, you can use machine
learning with zero data

Proprietary + ConﬁdentialProprietary + Conﬁdential
Importance of Data

Proprietary + Conﬁdential
Data - Learn the Unknown
Machine learning has been a
term used to describe learning
the unknown from the data in
an automated fashion.
ML is a statistical tool, meaning
in order for it to give us
meaningful results we need to
feed it large amounts of data

AI generated image

Proprietary + Conﬁdential
Data to ML Pipeline
AI generated image

Proprietary + Conﬁdential
Raw data collection
Early data collection focused on
manual observations
Platforms like Google crowdsource
expedite “human input” data
LLMs can augment the raw data when
the data collection is limited
AI generated image

Proprietary + Conﬁdential
Data Processing
Machines understand numbers, so
we need to convert the data to
numerical format
Denoising removes errors and
inconsistencies from data
Feature engineering - deciding what
data matters and how do we
represent it (sparse vs. dense,
feature crosses, etc.)
Data processing can remove signal

AI generated image

Proprietary + Conﬁdential
ML learning
Practical ML course offered by google:
https://developers.google.com/machine-learning/crash-course
Use the processed
data to train the
model
Trained model is able
to start predicting
new outcomes
developers.google.com/machine-learning

Proprietary + ConﬁdentialProprietary + Conﬁdential
Machine learning runs on large
amounts of data, so how can we
use it with zero data?

Proprietary + Conﬁdential
LLMs: The Next Frontier in
Machine Learning
Large Language Models (LLMs) are a
significant leap in machine learning.

They are designed for broad language
understanding and generation.

AI generated image

Proprietary + Conﬁdential
Prompt Tuning: Directing LLM Intelligence
Prompt tuning is a method for adapting pre-trained
LLMs to new tasks.
It involves crafting specific text prompts to guide
the model's output.
This approach changes the model's behavior
without altering its core parameters.

AI generated image

Proprietary + Conﬁdential
Let’s see how LLMs change things
Biggest challenges
Collecting and processing data at scale
Defining your ML model (deciding what kind of model do you want to
train, what is the objective function and how complex is your model
going to be)
Productizing and maintenance of new model

Proprietary + Conﬁdential
Collecting and processing data at scale
Supervised ML without
LLMs

Millions of examples to
train models that are able
to extract complex
relationships between the
data

Takes months to collect
and process

Feature engineering
LLMs as a model

Do not need any raw data or
smaller set of raw data
examples.
LLMs as data collection

Use LLMs to create syntetic
data for most expensive use
cases

Proprietary + Conﬁdential
Defining your objective function and ML architecture
Supervised ML without
LLMs

Need to know what
problem you are trying to
solve (classification,
ranking, regression, etc.),
define the number of
internal parameters and
size of the data set that
you need as a result.

With LLMs

LLMs are generic
pre-trained models that
can be used out of the box

Use prompt tuning to
temporarily make them
task specific

Use fine tuning to make
them permanently task
specific

Proprietary + Conﬁdential
Productization/maintenance of the model
Supervised ML without
LLMs

We can verify the
performance of the model
offline - generate AUC
curves, etc. to tell us how
the model is doing.

Training is computationally
expensive, inference is
not
*
.

With LLMs

LLM performance on a
specific subtask can be
tested against a
benchmark.

LLMs are very resource
heavy even at the time of
inference.

* For large models and high amount of traﬃc inference can also be expensive, but compared with LLM inference cost it is still much less resource heavy.

Proprietary + ConﬁdentialProprietary + Conﬁdential
So, how should you approach you
problem today as compared to
3+ years ago?

Proprietary + Conﬁdential
Try asking your friendly LLM for answer directly

Proprietary + Conﬁdential
Prompt tuning with examples (~10 to ~100)

Proprietary + Conﬁdential
Fine tune your LLM

Alter the generalist model to making it specialist with
your examples (~1000s)

Proprietary + Conﬁdential
If the inference cost is too high, you might need to go back to
traditional ML models

Proprietary + Conﬁdential
Hurdles of using LLM
Bias Amplification: LLMs can reflect and
amplify biases present in their training data,
leading to biased generated datasets.
Hallucinations and Inaccuracies: LLMs may
generate factually incorrect or nonsensical
information, requiring careful validation.
Computational Cost: LLMs are computationally
intensive and expensive.
Overfitting to Synthetic Data: Relying solely on
synthetic data can sometimes lead to models
that perform poorly on real-world data.
AI generated image

Proprietary + ConﬁdentialProprietary + Conﬁdential
LLMs changed the objective

Proprietary + Conﬁdential
GeminiCLI

Project mariner

Proprietary + Conﬁdential
So how much data do you need
today to do machine learning?

[DSC DACH 25] Zrinka Puljiz - Importance of data at the time of LLMs.pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

[DSC DACH 25] Zrinka Puljiz - Importance of data at the time of LLMs.pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx