251027_Thuy_Labseminar[Scaling Language-Image Pre-training via Masking].pptx
thanhdowork
9 views
13 slides
Oct 27, 2025
Slide 1 of 13
1
2
3
4
5
6
7
8
9
10
11
12
13
About This Presentation
Scaling Language-Image Pre-training via Masking
Size: 1.13 MB
Language: en
Added: Oct 27, 2025
Slides: 13 pages
Slide Content
Scaling Language-Image Pre-training via Masking Van Thuy Hoang Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: [email protected] 2025-10-27 Yanghao Li, et. al., CVPR 24
Challenges Expressivity. The vast majority of existing results on GNNs’ expressive power are coarse-grained and focus on specific architectures. Additionally, guidelines for choosing between highly-expressive GNNs are needed. Generalization. Besides expressivity, it is vital to choose models based on a training dataset so that it generalizes to unseen data. Current works studying GNNs’ generalization properties often rely on many simplifying assumptions, e.g., not considering graph structure or optimization.
Fast Language-Image Pre-training (FLIP) A simple method for efficient CLIP training via Masking Randomly masking out image patches with a high masking ratio
FLIP Overview Benefits from masking See more sample pairs under the same wall-clock training time Contrast more sample pairs by larger batches under similar memory constraint
Properties of FLIP – Image Masking Image masking yields higher or comparable accuracy and speeds up training A large batch has big gains over smaller batches
FLIP Results Zero-shot ImageNet accuracy For ViT-L/14, FLIP is better than both OpenCLIP and our reproduced CLIP pre-trained on the same data
FLIP Results Zero-shot ImageNet accuracy Linear-probing and fine-tuning on ImageNet FLIP outperforms OpenCLIP and CLIP counterparts pre-trained on the same data
FLIP Results FLIP performs better on zero-shot image/text retrieval
FLIP Results FLIP performs better on image captioning and visual question answering
FLIP Results Scaling Behavior of FLIP: The speed-up of FLIP facilitates scaling explorations Model and data scaling consistently outperform baselines Data scaling is favored for zero-shot transfer Model scaling is favored for transfer learning
FLIP Results Scaling Behavior of FLIP: The speed-up of FLIP facilitates scaling explorations Model and data scaling are highly complementary Scaling both (+3.3%) > model + data scaling alone (+1.2% + 1.5%) Joint scaling with schedule scaling leads to the best in most cases
Conclusion FLIP outperforms its CLIP counterparts pre-trained on the same LAION data. By comparing several LAION-based models and the original WIT-based ones, this observe that the pre-training data creates big systematic gaps in several tasks. Data scaling is a favored scaling dimension, given that it can improve accuracy with no extra cost at training or inference time. This fast method encourages us to scale beyond what is studied in this work