Language-Interfaced Tabular Oversampling Via Progressive Imputation And Self Autentication
Size: 1.38 MB
Language: en
Added: May 28, 2024
Slides: 12 pages
Slide Content
Language-Interfaced Tabular Oversampling Via
Progressive Imputation And Self Autentication
JuneYongYang*, Geondo Park*, JoowonKim, HyeongwonJang,
EunhoYang
ICLR 2024
Graduate School of AI, KAIST
Machine Learning & Intelligence Laboratory
Introduction
●Tabular data is ubiquitous across a myriad of industries,such as health care,
marketing, and finance.
●Tabular data in the wild are often ridden with class-imbalance.
●Given a training dataset !,the number of samples for each class is skewed.
!#≥!$≥⋯≥!%
whereN!isthenumberofsamplesbelongedtoclassc
Introduction
3
●Our research goal is to utilize Tabular Language Model (TLM) to synthesize tabular
samples belonging to the minority class to balance class distribution.
●Recent advances in deep generative models have bestowed the means to generate
high-quality synthetic tabular data.
●We propose Language-Interfaced Tabular Oversampling (LITO), oversampling
framework for tabular data that comprehensively utilizes the power of language-
interfaced tabular learning.
Language-Interfaced Tabular Generation
4
●Tabular data can be readily formatted into text.
Processed by generative language models without the usage
of external adapters or representation alignment.
●Given a tabular dataset, the :-throw of tabular can be represented as followed.
;",$=ℎ$,is,>",$,,?:@;"=;",%,;",&,⋯,;",'
where the (",$)-thvalue of the table and ℎ!is the name of m-thcolumn.
Language-Interfaced Tabular Oversampling
5
●Minor-Conditioned Sampling With Importance-Aware Imputation
●Simple class conditioned generation.
'"=)'#$%&#"*"+'#$%&#"=["label","23","c,","]
●Convert the sample to the targeted minority class by conditional imputation.
'"=)'#$%&#",'','(,⋯,')*+
Language-Interfaced Tabular Oversampling
6
●Minor-Conditioned Sampling With Importance-Aware Imputation
●Consideringthe heterogeneity of columns, puncture and impute columns guided by a
feature importance criterion.
●Self-attention scores of the TLM to attribute the importance of column features.
Last layer attention score
Language-Interfaced Tabular Oversampling
7
●Rejection Sampling via Self-Authenthication
●To filter out the ill-generated synthetic samples.
●The generative language model is capable of imputing the label of the given sample.
Language-Interfaced Tabular Oversampling
8
●Adaptive Oversampling With Progressive Imputation
●The number of column imputations required for successfully conversion may vary from
one sample to another.
Experiment: Binary Classification Tasks
●LITO consistently outperforms baselines on four binary classification tasks, excelling
in both extreme and mild imbalance scenarios.
Experiment: Multi-label Classification Tasks
●LITO brings better imbalance handling performance in most cases compared to
other baselines.
●In the extreme imbalance setting, LITO clearly outperform all baselines by large
margins.
Experiment: In-Context LITO
●A proof of-concept experiment to demonstrate the performance of in-context LITO
using OpenAIGPT-3.5-turbo API.
●Oversampling minority class samples through in-context learning is indeed effective.
Experiment: In-Context LITO
●A proof of-concept experiment to demonstrate the performance of in-context LITO
using OpenAIGPT-3.5-turbo API.
●Oversampling minority class samples through in-context learning is indeed effective.