[NS][Lab_Seminar_240703]GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning.pptx

thanhdowork 70 views 15 slides Jul 04, 2024
Slide 1
Slide 1 of 15
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15

About This Presentation

GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning


Slide Content

GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/07/03 Guangyue Xu et al. WACV 2024

Introduction Compositional Zero-Shot Learning Different from traditional ZSL settings where each class is represented by a single text label, CZSL considers the compositional information among the concepts Challenges Don’t have training data for the compositions Model should learn the compositional rules to compose the learned element concepts Distribution shift from the training data to the test data cased by zero-shot setting

Problem Formulation A = {a0, a1, …, an} be attribute set O = {o1,o2, …, om} be object set All possible compositional label space C is Cartesian product C = A*O Training time, given set of seen examples Cseen = {(x1, c1), …, (xk, ck)} where xi is an image and ci = (ai, oi) is its compositional label from the seen set CZSL can be categorized into

Method GIPCOL GIPCOL Architecture consists of 2 learnable components: a soft-prompting module and a GNN

Method GIPCOL Architecture CLIP is pre-training on 400 million image-text association pairs, which already learned the general knowledge for images recognition Freeze CLIP’s textual and visual encoders, focus on structuring its textual prompt to address compositional concept learning Two learnable components to construct soft prompt: learnable prefix vectors and GNN module Prefix vectors are used to add more learnable parameters to represent the compositional concepts and reprogram CLIP for compositional learning GNN module capture the compositional structure of the objects and attributes for a better compositional concept representation in the constructed soft prompt

GIPCOL Architecture Learnable Prefix Vectors k learnable prefix vectors in soft prompt d = 768 for CLIP embedding size Vectors used to prepend to the attr-obj embeddings and a part of the compositional representation Vectors are fine tuned by gradients flowing back through CLIP during training time

GIPCOL Architecture GNN as Concept Encoder CZSL requires modeling the interactions between element concepts Given compositional concept red apple, need to learn both the concept apple and how red changes apple’s state instead of treating red and apple as two independent concepts GNN enrich the concept’s representations by fusing information from their compositional neighbors Updated node representations from GNN will serve as class labels in soft prompt Whole soft prompt represents the compositional concept and will be put into CLIP’s textual encoder for compositional learning Frozen CLIP’s Text Encoder After obtain updated composit ional representations, add learnable prefix vectors Use Bert encoder to extract normalized EOS vector as compositional concept’s representation for further multi-modal alignment

GIPCOL Architecture GNN as Concept Encoder Frozen visual encoder Rescale image’s size to 224*224 Use ViT-L/14 as visual encoder ViT to encode the image, extract [CLASS] token as image’s representation Extracted image vector normalized for similarity Aligning Image and Compositional Concept Calculate probability of x belonging to class ci

GIPCOL Architecture GNN in Soft Prompting Model the element concepts and their composition explicitly for the soft prompting construction Node Element concept node Compositional concept node Initialize using average embedding of element node: (attribute + object)/2 Compositional Graph Construction Given a pair c = (a, o), besides the self-connected edge, add 3 undirected edges (c-a), (c-o), (a-o) where adjacency matrix is symmetric GNN Module

GIPCOL Architecture Training Loss function Inference First construct soft prompts for all target concepts using fine-tuned prefix vectors and GNN Use CLIP’s frozen textual and visual encoders to obtain image vector x and target concept vector set Use cosine measurement to select the most similar attr-obj pair as compositional label CLIP-Prompting Method Comparison

Experiments Datasets, Implementation detail and Evaluation metrics MIT-States and C-GQA consist of images with objects and their attributes in the general domain UT-Zappos contains images of shoes paired with their material attributes which is more domain-specific dataset For fair comparison, the length of prefix vector k is set to 3 which is same length of CLIP hard-prompting The dimension of soft-prompting d is set to 768 which is consistent with CLIP’s model setting Use two-layer GCN to encode concepts and the corresponding GNN’s learnable parameters Introduce a scalar value adding to the unseen classes to adjust the bias towards the seen classes as used in [20,22]

Experiments Results

Experiments Ablation Study

Experiments Higher-Order Compositional Learning

Conclusion Propose GIPCOL, new CLIP-based prompting framework, to address the compositional zero-shot learning Goal is to recognize compositional concepts of objects with their states and attributes as described by images Objects and attributes have been observed during training in some compositions, however, the test-time compositions could be novel and unseen => introduce a novel prompting strategy for soft prompt construction by treating element concepts as part of global GNN network that encodes feasible compositional information including objects, attributes and their compositions Soft-prompt representation is influenced not only by pre-trained VLMs but also by all the compositional representations in its neighborhood captures by the compositional graph