[DSC DACH 25] Zrinka Puljiz - Importance of data at the time of LLMs.pdf

DataScienceConferenc1 7 views 26 slides Oct 24, 2025
Slide 1
Slide 1 of 26
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26

About This Presentation

In the era of Large Language Models, data has become more than just fuel — it defines the very boundaries of intelligence itself. In this distinguished keynote, Zrinka Puljiz, Tech Lead Manager at YouTube, explores how the quality, diversity, and ethics of data shape the power and limitations of t...


Slide Content

Proprietary + Confidential
October 2025
Importance of data at the
time of LLMs
Zrinka Puljiz
EM, Youtube Foundational Data

Proprietary + Confidential

Proprietary + Confidential
Today, you can use machine
learning with zero data

Proprietary + ConfidentialProprietary + Confidential
Importance of Data

Proprietary + Confidential
Data - Learn the Unknown
Machine learning has been a
term used to describe learning
the unknown from the data in
an automated fashion.
ML is a statistical tool, meaning
in order for it to give us
meaningful results we need to
feed it large amounts of data


AI generated image

Proprietary + Confidential
Data to ML Pipeline
AI generated image

Proprietary + Confidential
Raw data collection
Early data collection focused on
manual observations
Platforms like Google crowdsource
expedite “human input” data
LLMs can augment the raw data when
the data collection is limited
AI generated image

Proprietary + Confidential
Data Processing
Machines understand numbers, so
we need to convert the data to
numerical format
Denoising removes errors and
inconsistencies from data
Feature engineering - deciding what
data matters and how do we
represent it (sparse vs. dense,
feature crosses, etc.)
Data processing can remove signal

AI generated image

Proprietary + Confidential
ML learning
Practical ML course offered by google:
https://developers.google.com/machine-learning/crash-course
Use the processed
data to train the
model
Trained model is able
to start predicting
new outcomes
developers.google.com/machine-learning

Proprietary + ConfidentialProprietary + Confidential
Machine learning runs on large
amounts of data, so how can we
use it with zero data?

Proprietary + Confidential
LLMs: The Next Frontier in
Machine Learning
Large Language Models (LLMs) are a
significant leap in machine learning.

They are designed for broad language
understanding and generation.

AI generated image

Proprietary + Confidential
Prompt Tuning: Directing LLM Intelligence
Prompt tuning is a method for adapting pre-trained
LLMs to new tasks.
It involves crafting specific text prompts to guide
the model's output.
This approach changes the model's behavior
without altering its core parameters.

AI generated image

Proprietary + Confidential
Let’s see how LLMs change things
Biggest challenges
Collecting and processing data at scale
Defining your ML model (deciding what kind of model do you want to
train, what is the objective function and how complex is your model
going to be)
Productizing and maintenance of new model

Proprietary + Confidential
Collecting and processing data at scale
Supervised ML without
LLMs

Millions of examples to
train models that are able
to extract complex
relationships between the
data

Takes months to collect
and process

Feature engineering
LLMs as a model


Do not need any raw data or
smaller set of raw data
examples.
LLMs as data collection

Use LLMs to create syntetic
data for most expensive use
cases

Proprietary + Confidential
Defining your objective function and ML architecture
Supervised ML without
LLMs

Need to know what
problem you are trying to
solve (classification,
ranking, regression, etc.),
define the number of
internal parameters and
size of the data set that
you need as a result.


With LLMs

LLMs are generic
pre-trained models that
can be used out of the box

Use prompt tuning to
temporarily make them
task specific

Use fine tuning to make
them permanently task
specific

Proprietary + Confidential
Productization/maintenance of the model
Supervised ML without
LLMs

We can verify the
performance of the model
offline - generate AUC
curves, etc. to tell us how
the model is doing.

Training is computationally
expensive, inference is
not
*
.




With LLMs

LLM performance on a
specific subtask can be
tested against a
benchmark.

LLMs are very resource
heavy even at the time of
inference.



* For large models and high amount of traffic inference can also be expensive, but compared with LLM inference cost it is still much less resource heavy.

Proprietary + ConfidentialProprietary + Confidential
So, how should you approach you
problem today as compared to
3+ years ago?

Proprietary + Confidential
Try asking your friendly LLM for answer directly

Proprietary + Confidential
Prompt tuning with examples (~10 to ~100)

Proprietary + Confidential
Fine tune your LLM

Alter the generalist model to making it specialist with
your examples (~1000s)

Proprietary + Confidential
If the inference cost is too high, you might need to go back to
traditional ML models

Proprietary + Confidential
Hurdles of using LLM
Bias Amplification: LLMs can reflect and
amplify biases present in their training data,
leading to biased generated datasets.
Hallucinations and Inaccuracies: LLMs may
generate factually incorrect or nonsensical
information, requiring careful validation.
Computational Cost: LLMs are computationally
intensive and expensive.
Overfitting to Synthetic Data: Relying solely on
synthetic data can sometimes lead to models
that perform poorly on real-world data.
AI generated image

Proprietary + ConfidentialProprietary + Confidential
LLMs changed the objective

Proprietary + Confidential
GeminiCLI


Project mariner

Proprietary + Confidential
So how much data do you need
today to do machine learning?

Proprietary + ConfidentialProprietary + Confidential
Thank you!
Tags