Introduction to the Data Prep Kit for LLMs

chloewilliams62 170 views 11 slides Sep 19, 2024

Slide 1 of 11

About This Presentation

Data is the new oil ( for LLMs growth )

Size: 1.27 MB

Language: en

Added: Sep 19, 2024

Slides: 11 pages

Slide Content

Data Prep for LLM
- Santosh Borse, IBM Research

Data is the new oil ( for LLMs growth )
•Common crawl
•From 2007, 250 Billion Pages
•Petabytes of data
•Superset for lot of other available datasets
•Processed dataset : Refined Web, The Pile, C4, Red Pajama, Wiki etc.
•Domain\source Specific Dataset: BookCorpus, MathQA, StarCoder etc
•HF has more than 210K datasets
•Your Own data
•Models evolved from millions of token for training to trillions of token

Data Quality
•Variety
•Linguistic Pattern
•Overfitting vs underfitting
•Bias
•Personal Information
•Bad Data - Hate, abuse, profanity
•Time and cost required for training

Good vs Bad
data

Data Journey
for IBM
Granite
Model

Data Prep Kit
•Recently open sourced https://github.com/IBM/data-prep-kit
•Tried and Tested - Used for preparing data for IBM’s Granite Models
•Ability to scale from laptop to full datacenter scale
•Lot of inbuilt transforms & Bring your own Transform
•Can run on Pure python, Spark or Ray
•Abstracted scaling logic
•Ability to checkpointing

Data Prep Kit

Bring your own
transform

DPK
Hands-on
in colab
https://github.com/sujee/data-prep-kit-examples/
blob/main/dpk-intro/README.md

Introduction to the Data Prep Kit for LLMs

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Introduction to the Data Prep Kit for LLMs

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......