Data_Prep_Techniques_Challenges_Methods.pdf

ShailjaThakur2 282 views 35 slides Sep 03, 2024
Slide 1
Slide 1 of 35
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35

About This Presentation

Building successful applications with Large Language Models (LLMs) hinges on one critical factor: the quality of the data used in their development. LLMs have proven capabilities across wide range of applications, but their effectiveness is deeply intertwined with the quality and relevance of the da...


Slide Content

Building Successful LLM Applications: The
Power of High-Quality Data
Hima Patel
Senior Technical Staff Member & Research Manager
IBM Research, India
Shailja Thakur
Research Scientist
IBM Research, India
IBM Research

Agenda for talk series
Session 1
•Introduction and motivation for data preparation for LLM applications
•Discuss state of art methods for code data preparation
Session 2
•Hands on session 1: run end to end data prep pipelines with open source
toolkit data-prep-kit
Session 3
•Hands on session 2: build your own data preparation module using data-
prep-kit

Let us start with a story..
data-prep-kit
ThePhoto by PhotoAuthor is lice nsed unde r CCYY SA.

Example scenario
Granite models on Hugging Face
data-prep-kit
Goal: I want to make my CodeLLM awesome at VHDL
Question: How does my base model perform OOB for VHDL?
If performance is not satisfactory, then the model needs to be
fine-tuned withVHDL data
Solution Steps:
1.Acquire the right data
2.Prepare the data in the right format
3.Data preparation and quality analysis (various steps)
4.Tokenizethe data
5.Fine tune a model
6.Validate the model
7.Iterate (add more data, clean data,…)
Data
acquisition
Tokenization
Data
preparation
and Quality
analysis
Model
Tuning
Validation
Meets my
needs?
Iterative nature of data prep makes it cumbersome, time consuming and tiring

Data preparation is both important and
challenging
data-prep-kit
79% identifydata preparation and generation*asthe
mostcommon strategic task performed byAIteams.
30% viewdata volume and complexity*asone ofthe
mostchallenging aspects ofAIimplementation.

Gartner, Explore Data-Centric AISolutions toStreamline AIDevelopment,2023
Quality of data affects the quality of the model

Putting things in context of LLM dev
lifecycle
data-prep-kit
Exhaustive data quality evaluation at start of lifecycle makes it easier to produce high quality data

Data preparation challenges
data-prep-kit
•Data challenges are not known upfront, discovery is cumbersome and
time consuming
•Every use-case has its own unique needs, and manual verification is
not possible if size of data is large
•Data prep steps can vary across modalities and use cases, making it a
long tail problem
•No easy availability of tools to solve data challenges at various scales
for various use-cases and different data modalities
•Low code/no code automation required for production scenarios

Gartner, Explore Data-Centric AISolutions toStreamline AIDevelopment,2023
Image Credits: Base image generated via LLM (Dalle 2024) using chatgpt

Introducingdata-prep-kit
•Open source toolkit (Apache 2.0 license) that contains data
prep recipes for code and language modalities.
•Contains data prep recipes well tested during data prep for
buildingGranite Code Models, and aimed at fine tuning, RAG,
instruct tuning usecases.
•Flexible computing that works from laptop to cluster scale
•Growing set of modules and capability to bring your own
module very easily
•Work on this as an open source project to push the research
and development in this area, and encouraging ideas from
everywhere
Open source community effort to accelerate data preparation for LLMs
data-prep-kit
https://github.com/IBM/data-prep-kit

Data preparation modules ready to use
data-prep-kit

State Of The Art Data Preparation Techniques

Let's Talk Datasets
Large Language Models are characterized by large # parameters.
This leaves them vulnerable to memorization.
A key factor in promoting generalization has been the introduction
of large datasets.
=> Manual review and curation is expensive, and larger datasets
suffer in quality
It is a consequence of lack of due diligence.
Reference: Carlini et al. Extracting Training Data from Large Language Models. USENIX SEC 2021.
Discoverable memorization scales with large models, context size, and data repetition!
data-prep-kit
Context SizeData repetition

Duplication
Language modeling training datasets contain many duplicated
sequences (Lee et. al. 2021)
This promotes memorization because:
•Repeat samples are weighted during training
•Validation scores are highest when duplicated data is
memorized
•Language models can generate long passages that are
repeated in the training data (%3 of C4-> 10M documents,
Mccoy et. al. 2021)

Reference: Granite Code Models: A Family of Open Foundation Models for Code Intelligence
This encourages memorization and discourages generalization
data-prep-kit

Deduplicating Training Data Makes
Language Models Better
>1% of tokens emitted unprompted from a biased model are
part of a memorized sequence. Deduplication reduced this to ≈
0.1%.
A 61-word sequence was found to repeat 61, 036 times in
training and 61 times in validation in the C4 dataset.
Training models on deduplicated datasets improves training
efficiency. Deduplicated datasets are upto 19% smaller!
Deduplication does not hurt perplexity; in cases it reduces
perplexity by upto 10%.
Reference: Granite Code Models: A Family of Open Foundation Models for Code Intelligence
Takeaway Example: Data preparation is an essential step for building LLM applications
data-prep-kit

Exact Substring Match
data-prep-kit
Snippet 1
def add(a, b):
return a + b
def multiply(a, b):
return a * b
Snippet 2 (Duplicate)
defmultiply(a, b):
returna * b
•Consider a datasetD={x
1,x
2,…,x
N}, where each x
iis a sequence of
tokens.
•Suppose we have two code samplesx
iandx
j. Each sample can be
thought of as a series of tokens.
•If a segment of tokensd
ainx
iexactly matches a segmentd
binx
j, then
we have anexact substring match.
•Substring length d
a, d
b >= 50 is a hyperparameter
•But, ExactSUBSTRruns inquadratic time = O(N*b)
Reference: Granite Code Models: A Family of Open Foundation Models for Code Intelligence

data-prep-kit
Exact Substring Match – Suffix Array
S =add(a, b):returna +bdefmultiply(a,
b):returna*b
•To improve the efficiency -Suffix Array
•Concatenate samples {x
1,x
2,…,x
N}into a single sequence S.
And construct a Suffix Array A of S.
•A Suffix Array is the sorted list of suffix in a sequence
lexicographically
•A(S) can be converted in linear time O(|S|).
•Resulting in binary search over the list of indices, O(blogN) , b
is length of pattern (duplicate) andNis the length of
original code.
•A(S) can be used to find duplicate substrings
•Identifying suffixes therefore involves parallelizable tasks of
searching through A
Suffixes,
0: add(a, b):returna +bdefmultiply(a,b):returna*b
1: dd(a, b):returna +bdefmultiply(a,b):returna*b
2:d(a, b):returna +bdefmultiply(a,b):returna*b
3: (a, b):returna +bdefmultiply(a,b):returna*b
...
59: a*b
60: a*b
61: *b
62: b
Suffix Array A(S),
A(s) = [16, 37, 59, 60, 40, 61, 62, 28, 49,
51, ...] (sorted indices lexicographically)​

data-prep-kit
The need for Approximate Matching
Consider the following examples:
Despite significant overlap, duplication is not detected by EXACTSUBSTR
Example Near Duplicate-Example
Reference: Granite Code Models: A Family of Open Foundation Models for Code Intelligence

NEARDUP Algorithm
•Approximate the Jaccard Similarity Coefficient and an Edit Similarity Score between two documents {xi ,
xj}. Approximate duplications exist for high Jaccard Coefficients and high Edit Similarities
•Algorithm: MinHash . J(A, B) ∈ [0, 1]. Each document is represented by a hash h; in this case the set of
n-grams. Only the k-smallest n-grams are used to compute the Jaccard:
Here, h = tabulation hashing, n = 5 and k =.
data-prep-kit

NEARDUP Algorithm
Locality Sensitive Hashing (Table hashing) is a bucketized hashing algorithm. Subsequence hashes are
computed against each bucket, and a final hash for the document is obtained using bitwise XOR.

•The set of hashes obtained per gram is the document signature. Each element is hashed using k other
hashing functions.
•The minimum hashed element for each k function is stored.
•They are partitioned into r buckets, with b hashes per bucket. If {x
i , x
j} share hashes in ≥ 1 bucket, it is
considered a match. The prob. of match is given by,
•Here, b = 20, r = 450 and k = br = 9000
data-prep-kit

•For document pairs {x
i , x
j} considered potential matches, the full Jaccard Index is computed. If
J
F(d
xi, d
xj) ≥ 0.8, the edit similarity is computed:
•Finally, a graph is created to cluster similar documents. If documents are considered a match,
edges are constructed between the pair. The connected components form clusters.
•Deduplication is performed on these clusters, and a filtered dataset is obtained.
data-prep-kit
NEARDUP Algorithm

data-prep-kit
NEARDUP Example
doc_idshingles
0 {"Deduplication is so", "is so much", "so much fun"}
1 {'so much fun', 'fun and easy', 'Deduplication is so', 'is so much'}
2 {'dog is a', 'is a thing', 'wish spider dog', 'spider dog is', 'I wish spider'}
tri-gram substrings (shingles)
MinHash – 5-way permutation
doc_idminhash
0 [403996643, 840529008, 1008110251, 2888962350, 432993166]
1 [403996643, 840529008, 1008110251, 1998729813, 432993166]
2 [166417565, 213933364, 1129612544, 1419614622, 1370935710]
Reference: https://huggingface.co/blog/dedup

data-prep-kit
NEARDUP Example
band
index band value doc_ids
0 [403996643, 840529008] 0, 1
1 [1008110251, 2888962350] 0
1 [1008110251, 1998729813] 1
0 [166417565, 213933364] 2
1 [1129612544, 1419614622] 2
doc_idminhash bands
0
[403996643, 840529008, 1008110251,
2888962350, 432993166]
[0:[403996643, 840529008], 1:[1008110251,
2888962350]]
1
[403996643, 840529008, 1008110251,
1998729813, 432993166]
[0:[403996643, 840529008], 1:[1008110251,
1998729813]]
2
[166417565, 213933364, 1129612544,
1419614622, 1370935710]
[0:[166417565, 213933364], 1:[1129612544,
1419614622]]

Data Decontamination
•To what extent training or test datasets from downstream NLP
tasks appear in the pretraining corpus?
•Types:
•Input and Output Contamination
•Input Contamination
•Why does it happen?
•Dataset is built from text on the web
•It is uploaded after creation ( eg.to a GitHub repo)
•On Surface-level, sub-string match, edit-distance scoring
•At semantic-level, AST representation followed by n-gram overlap
(n between 8 and 13) between training and
testexamples
•max(surface-level score, semantic-level score)
Reference:LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction
Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
data-prep-kit

•Data contamination on two popular benchmarks: HumanEval
and MBPP
•Observed in dataset corpusthe PILE and the STACK corpus
•“Top-1 Score” denotes the similarity score between the gold
solution and the most similar program found in the training
corpus.
•Impact : Models perform significantly inflates on the subset of
the benchmarks where similar solutions are seen during
training.
Reference: Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
Decontamination is essential to improvemodel generalization
data-prep-kit
Data Decontamination

Tokenization
How do we represent an input text in machines?
Tokenization: splitting a string into a sequence of
tokens
Word-level Tokenization
Pros: Containssemantic and contextual information
Cons: Exploding vocabulary problem
Character-level Tokenization
Pros: Smaller vocabulary
Cons:No semantic morphology,Much longer input
sequences
Subword Tokenization :Benefits of both the worlds!
Frequently-used words should be stored as entire
tokens
Infrequently used words should be split into
subwords
data-prep-kit
Transformer Blocks
Unicode characters like emojis may be split
['Unicode', 'characters', 'like', 'emojis',
'may', 'be','split', '.']

Byte Pair Encoding (BPE)
Used in: BERT, CodeBERT, GPTs, LLaMa, ..
Steps:
1.Pre-tokenize
2.Create base vocabulary (set of unique words of
corpus)
3.Merge the most frequentlyoccurringchar pairs
together
4.Insert the merged word in vocabulary
5.Repeat 2 to 4 until vocab size limit (~32k-64K)
Other tokenization methods:
•WordPiece (Schuster et al., ICASSP 2012)
•SentencePiece (Kudo et al., 2018):subword
tokenization without pretokenization (w/ spaces)
data-prep-kit
# Training set
words = ["hug","pug","pun","bun","hugs"]
Base Vocabulary :
["b", "g", "h", "n", "p", "s", "u"]
INITIAL CORPUS:
[(["h", "u", "g"], 10), (["p", "u", "g"], 5), (["p", "u", "n"], 12),
(["b", "u", "n", 4), (["h", "u", "g", "s", 5)]
NEW MERGE RULE: Combine "u" and "g"
("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4),
("h" "ug" "s", 5)
NEW MERGE RULE: Combine "h" and "ug"
("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug"
"s", 5)
Reference: https://huggingface.co/docs/transformers/tokenizer_summary

Handling unknown words
What happens when we encounter a word at test time
that we’ve never seen in our training data?
Out of Vocabulary (OOV)
mug -> [??, "ug"]
Indexing not feasible, No word embedding
Tokenized:["<UNK>", "ug"]
Solution: replace low-frequency words in training data
with a specialtoken <UNK>, andunseen words
What is the limitation of <UNK>?
data-prep-kit
We lose lots of information about texts with a lot of
rare words / entities
The chapel is sometimes referred to as
"HenGapelLligwy" ("hen"being the
Welsh word for "old" and " capel" meaning
"chapel")
The chapel is sometimes referred to as "
Hen <UNK><UNK>" (" hen " being the Welsh
word for " old " and " <UNK>" meaning "
chapel ").

Prog Language Selection & Code Quality
data-prep-kit
Filters and quality inspections applied on code corpus,
•Programming Language selection
•Malware detection: Detect trojans, viruses, malware &
other malicious threat, and classifies the file - benign/malignant
•Alphanumeric Filter
•Alphanumeric_Frac (0.1 – 0.2)
•Samples with lots of symbols/special characters and files
with only spaces/tabs.
•Majority of samples in this range are smaller program with
developer headers.
Alphanumeric_Frac (0.1-0.2)
Reference: Granite Code Models: A Family of Open Foundation Models for Code Intelligence
StarCoder 2 and The Stack v2: The Next Generation

Code Quality
data-prep-kit
Filters and quality inspections applied on code corpus,
•Autogenerated filter
•is_generated function from go-enry (go-enry, 2024)
•{“auto-generated”, “autogenerated”, “automatically generated”,
“generated automatically”, “this file is generated” } in the first 5
lines of the file.
•Encoded data filter – Inline encoded data
•Base64 strings: [a-zA-Z0-9+/\n=]{64,}
•Hexadecimal sequences: (?:\b(?:0x|\\x)?[0-9a-fA-
F]{2}(?:,|\b\s*)){8,}
•Unicode strings: (?:\\u[0-9a-fA-F]{4}){8,}
•JSON/HTML/YAML Filter
•TEXT file -{"readme", "notes", "todo", "description",
"cmakelists", "requirement"}
# This file is automatically generated by the build system.
# Do not modify this file directly. Instead, modify the
source files.
# Changes made directly to this file will be lost.
def calculate_sum(a, b):
return a + b
# Example usage
result = calculate_sum(5, 3)
print(f"The sum is {result}")
Reference: Granite Code Models: A Family of Open Foundation Models for Code Intelligence
StarCoder 2 and The Stack v2: The Next Generation
# This variable contains a sequence of hexadecimal values,
often used for representing binary data in a readable
format.
hex_data= "0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x20, 0x57,
0x6F, 0x72, 0x6C, 0x64"
# Converting hexadecimal to a string
decoded_hex = bytes.fromhex(hex_data.replace('0x',
'').replace(', ', '')).decode('utf-8')
print(decoded_hex)

Code Quality
data-prep-kit
Filters and quality inspections applied on code corpus,
•Mean Line Length
Average length of lines in code file
•0-200: Sample good
•200-300: equal mix of good and bad samples found.
•What is the recommended mean line length of code
samples?
•Varies across language: Python: 150-200, C: <200,
Java: <250
• C and C++ libraries notextensive as
those in Python or Java.
•May include a lot of boilerplate code for
setting up projects (header files/ source. files)
Example: line_mean(280-300)
Reference: Granite Code Models: A Family of Open Foundation Models for Code Intelligence

Code Quality
data-prep-kit
Filters and quality inspections applied on code corpus,
•Char to Token Ratio
Ratio of # characters in a given string (e.g., a line or
block of code) to the number of tokens generated from
that string
•Impact of Low ratio (truncated, unreadable)
•Impact of high ratio (non-code, Large header
comments)
Filters,
•Identifying Obfuscated or Minified Code:
•Detecting Comment Overload
•Filtering Out Non-Code Data (Doc, config)
•Non-English text
CharsToToken Ratio(0.5 – 1.0)
Reference: Granite Code Models: A Family of Open Foundation Models for Code Intelligence

PII (Personally IdentifiableInformation)
data-prep-kit
•Personally Identifiable Information (PII)
•NAME, EMAIL_ADDRESS,ID,IP_ADDRESS,KEY, PASSWORD,
USERNAME
•How to address them?
•replace: Replaces detected PII with a placeholder
•redact: Removes the detected PII from the text
•Detect-secrets tool
1

•Regular expressions by Ben Allal et al. (2023) for detecting
emails, IPv4 and IPv6 addresses.
# Functional style for user registration containing
def create_user_info(name, email, user_id,
ip_address):
return {
"name": name,
"email_address": email,
"id": user_id,
"ip_address": ip_address
}
def register_user():
user_info = create_user_info(
"John Doe",# Name
"[email protected]",# Email address
"123456789",# User ID
"192.168.1.1"# IP address
)
print("User registered successfully:",
user_info)
register_user()
Reference: Granite Code Models: A Family of Open Foundation Models for Code Intelligence
StarCoder 2 and The Stack v2: The Next Generation
[1] https://github.com/Yelp/detect-secrets

Semantic Ordering of Code Files
Research question: How should we show files of a
repo to an LLM?
•To enhance the LLM learning, we have a proposed
a novel technique to pack the data, utilizing
natural structure of a repository and semantic file
dependencies
•Such data organization allows the model to learn
across files, thereby enabling ability to perform
repository level coding tasks.
Reference: Scaling Granite Code Models to 128K Context
Semantic data ordering of files accelerates model learning
data-prep-kit

Algorithm for semantic ordering
Input: GitHub repositories
Output: Ordered code files
•Build dependency graph G between all files of a repository
using import dependencies, such that each file is represented
by a node and every edge captures the dependency.
•Remove cycles to form a directed acyclic graph
•Perform topological sort of the graph to get an ordered set of
files
•Readme and other documentation files are placed at the top
of the repo
Reference: Scaling Granite Code Models to 128K Context
Semantic data ordering of files improves model performance and is useful for long context training
data-prep-kit
Sample results* on one of the
benchmarks that tests code generation
capability across files in a repository.

Summary
data-prep-kit
•High quality models start with high quality data
•Data preparation is challenging and time consuming
•Different downstream applications necessitate unique data processing approaches.
•Challenges amplify with modality and scale of data
•Solution : one place for all your data prep
needs run as an open-source community
project. Contributions are welcome!
Data-Prep-Kit
https://github.com/IBM/data-prep-kit

data-prep-kit