[DSC DACH 25] Zrinka Puljiz - Importance of data at the time of LLMs.pdf
DataScienceConferenc1
7 views
26 slides
Oct 24, 2025
Slide 1 of 26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
About This Presentation
In the era of Large Language Models, data has become more than just fuel — it defines the very boundaries of intelligence itself. In this distinguished keynote, Zrinka Puljiz, Tech Lead Manager at YouTube, explores how the quality, diversity, and ethics of data shape the power and limitations of t...
In the era of Large Language Models, data has become more than just fuel — it defines the very boundaries of intelligence itself. In this distinguished keynote, Zrinka Puljiz, Tech Lead Manager at YouTube, explores how the quality, diversity, and ethics of data shape the power and limitations of today’s AI systems. She will dive into how data collection, labeling, and governance directly influence LLM behavior, creativity, and bias, and why organizations must rethink their data strategies to stay competitive and responsible. Attendees will gain a deeper understanding of how to build data ecosystems that not only drive innovation but also ensure trust, fairness, and long-term sustainability in the age of generative AI.
Size: 4.72 MB
Language: en
Added: Oct 24, 2025
Slides: 26 pages
Slide Content
Proprietary + Confidential
October 2025
Importance of data at the
time of LLMs
Zrinka Puljiz
EM, Youtube Foundational Data
Proprietary + Confidential
Proprietary + Confidential
Today, you can use machine
learning with zero data
Proprietary + ConfidentialProprietary + Confidential
Importance of Data
Proprietary + Confidential
Data - Learn the Unknown
Machine learning has been a
term used to describe learning
the unknown from the data in
an automated fashion.
ML is a statistical tool, meaning
in order for it to give us
meaningful results we need to
feed it large amounts of data
AI generated image
Proprietary + Confidential
Data to ML Pipeline
AI generated image
Proprietary + Confidential
Raw data collection
Early data collection focused on
manual observations
Platforms like Google crowdsource
expedite “human input” data
LLMs can augment the raw data when
the data collection is limited
AI generated image
Proprietary + Confidential
Data Processing
Machines understand numbers, so
we need to convert the data to
numerical format
Denoising removes errors and
inconsistencies from data
Feature engineering - deciding what
data matters and how do we
represent it (sparse vs. dense,
feature crosses, etc.)
Data processing can remove signal
AI generated image
Proprietary + Confidential
ML learning
Practical ML course offered by google:
https://developers.google.com/machine-learning/crash-course
Use the processed
data to train the
model
Trained model is able
to start predicting
new outcomes
developers.google.com/machine-learning
Proprietary + ConfidentialProprietary + Confidential
Machine learning runs on large
amounts of data, so how can we
use it with zero data?
Proprietary + Confidential
LLMs: The Next Frontier in
Machine Learning
Large Language Models (LLMs) are a
significant leap in machine learning.
They are designed for broad language
understanding and generation.
AI generated image
Proprietary + Confidential
Prompt Tuning: Directing LLM Intelligence
Prompt tuning is a method for adapting pre-trained
LLMs to new tasks.
It involves crafting specific text prompts to guide
the model's output.
This approach changes the model's behavior
without altering its core parameters.
AI generated image
Proprietary + Confidential
Let’s see how LLMs change things
Biggest challenges
Collecting and processing data at scale
Defining your ML model (deciding what kind of model do you want to
train, what is the objective function and how complex is your model
going to be)
Productizing and maintenance of new model
Proprietary + Confidential
Collecting and processing data at scale
Supervised ML without
LLMs
Millions of examples to
train models that are able
to extract complex
relationships between the
data
Takes months to collect
and process
Feature engineering
LLMs as a model
Do not need any raw data or
smaller set of raw data
examples.
LLMs as data collection
Use LLMs to create syntetic
data for most expensive use
cases
Proprietary + Confidential
Defining your objective function and ML architecture
Supervised ML without
LLMs
Need to know what
problem you are trying to
solve (classification,
ranking, regression, etc.),
define the number of
internal parameters and
size of the data set that
you need as a result.
With LLMs
LLMs are generic
pre-trained models that
can be used out of the box
Use prompt tuning to
temporarily make them
task specific
Use fine tuning to make
them permanently task
specific
Proprietary + Confidential
Productization/maintenance of the model
Supervised ML without
LLMs
We can verify the
performance of the model
offline - generate AUC
curves, etc. to tell us how
the model is doing.
Training is computationally
expensive, inference is
not
*
.
With LLMs
LLM performance on a
specific subtask can be
tested against a
benchmark.
LLMs are very resource
heavy even at the time of
inference.
* For large models and high amount of traffic inference can also be expensive, but compared with LLM inference cost it is still much less resource heavy.
Proprietary + ConfidentialProprietary + Confidential
So, how should you approach you
problem today as compared to
3+ years ago?
Proprietary + Confidential
Try asking your friendly LLM for answer directly
Proprietary + Confidential
Prompt tuning with examples (~10 to ~100)
Proprietary + Confidential
Fine tune your LLM
Alter the generalist model to making it specialist with
your examples (~1000s)
Proprietary + Confidential
If the inference cost is too high, you might need to go back to
traditional ML models
Proprietary + Confidential
Hurdles of using LLM
Bias Amplification: LLMs can reflect and
amplify biases present in their training data,
leading to biased generated datasets.
Hallucinations and Inaccuracies: LLMs may
generate factually incorrect or nonsensical
information, requiring careful validation.
Computational Cost: LLMs are computationally
intensive and expensive.
Overfitting to Synthetic Data: Relying solely on
synthetic data can sometimes lead to models
that perform poorly on real-world data.
AI generated image
Proprietary + ConfidentialProprietary + Confidential
LLMs changed the objective
Proprietary + Confidential
GeminiCLI
Project mariner
Proprietary + Confidential
So how much data do you need
today to do machine learning?