Exploiting LMM based knowledge for image classification tasks

VasileiosMezaris 40 views 18 slides Jun 30, 2024
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

Presentation of our paper, "Exploiting LMM-based knowledge for image classification tasks", by M. Tzelepi and V. Mezaris. Presented at the 25th Int. Conf. on Engineering Applications of Neural Networks (EANN/EAAAI 2024), Corfu, Greece, June 2024. Preprint: http://arxiv.org/abs/2406.03071


Slide Content

Exploiting LMM-based knowledge for image classification tasks Maria Tzelepi and Vasileios Mezaris CERTH-ITI, Thermi, Thessaloniki, Greece EANN / EAAAI 2024: 25th International Conference on Engineering Applications of Neural Networks, Corfu, Greece, 27-30 June, 2024

Outline Introduction CLIP MiniGPT-4 Related work Proposed method Experiments Conclusions

Introduction Large Language Models (LLMs) have demonstrated exceptional performance in several downstream tasks Natural Language Processing (NLP) and computer vision tasks Vision-Language Models (VLMs), e.g., BLIP-2 and CLIP, allowed for connecting image-based vision models with LLMs Goal: Exploit knowledge encoded in Large Multimodal Models (LMMs) for downstream image classification tasks, focusing on action/event recognition tasks To do so, we utilize CLIP , proposing to further incorporate knowledge encoded in the powerful MiniGPT-4 model

CLIP CLIP comprises of an image encoder and a text encoder It is trained with (image, text) pairs for predicting which of the possible (image, text) pairings actually occurred To do so, it learns a multimodal embedding space by jointly training the image and text encoders to maximize the cosine similarity of the real image and text embeddings, while minimizing the cosine similarity of the embeddings of the incorrect pairings CLIP provides outstanding zero-shot classification performance Another approach is to use the CLIP image encoder for extracting the corresponding image representations and use them with a linear classifier The CLIP model. 1 1. Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning . PMLR, 2021.

MiniGPT-4 GPT-4 is the first model to accept both text and image input, producing text output, however the technical details behind GPT-4 remain undisclosed MiniGPT-4 aligns a frozen visual encoder with a frozen LLM , utilizing a projection layer . Specifically, MiniGPT-4 uses the Vicuna LLM, while for the visual perception it uses a ViT-G/14 from EVA-CLIP and a Q-Former network MiniGPT-4 only requires training the linear projection layer to align the visual features with the Vicuna The MiniGPT-4 model. 2 2. Zhu, Deyao , et al. "Minigpt-4: Enhancing vision-language understanding with advanced large language models." arXiv preprint arXiv:2304.10592 (2023).

Related work CLIP-zero shot 1 CLIP - Linear Probe 1 CLIP (image) 1 CLIP-A-self 3 Context Optimization ( CoOp ) 4 Conditional Context Optimization ( CoCoOp ) 5 Language Guided Bottlenecks ( LaBo ) 6 1. Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning . PMLR, 2021. 3. Maniparambil, Mayug, et al. "Enhancing clip with gpt-4: Harnessing visual descriptions as prompts." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. 4. Zhou, Kaiyang, et al. "Learning to prompt for vision-language models." International Journal of Computer Vision 130.9 (2022): 2337-2348. 5. Zhou, Kaiyang, et al. "Conditional prompt learning for vision-language models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. 6. Yang, Yue, et al. "Language in a bottle: Language model guided concept bottlenecks for interpretable image classification." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

Related work CLIP-zero shot 1 CLIP - Linear Probe 1 CLIP (image) 1 CLIP-A-self 3 Context Optimization ( CoOp ) 4 Conditional Context Optimization ( CoCoOp ) 5 Language Guided Bottlenecks ( LaBo ) 6 This is the first work that utilizes GPT-4 in a multimodal fashion (i.e., MiniGPT-4 model both with text and images) for downstream image classification tasks Relevant approaches utilize either the unimodal GPT-3 model for textual prompting or the multimodal GPT-4 model again for textual prompting We obtain sample-specific information from the LMM In the relevant approaches the LLM is prompted to extract information about the classes of the problem

Proposed method

Proposed method

Proposed method

Datasets & implementation details Datasets UCF-101: action recognition dataset, containing 101 classes (13,320 extracted frames) ERA: event recognition dataset in unconstrained aerial videos , consisting of 25 classes (1,473 training images and 1,391 test images) BAR: a real-world action recognition dataset, consisting of 6 classes (1,941 training images and 654 test images) Implementation Details MiniGPT-4 with Vicuna-13B locally ViT-L-14 CLIP version For the classification task, a single linear layer is used, with output equal to the number of classes of each dataset Models are trained for 500 epochs, and the learning rate is set to 0.001

Experimental results UCF-101 ERA BAR Combining LMM knowledge with image embeddings provides considerably advanced classification performance

Ablation study UCF-101 dataset CLIP image embeddings (main comparison approach) provides very good performance Using only the text embeddings (MiniGPT-4) achieves lower but very competitive performance Using both the image and text CLIP embeddings, concatenated, accomplishes significant improvement over the approach of using only the image embeddings The mean embedding of image and text embeddings also considerably improves the baseline performance

Comparisons with CLIP-based approaches UCF-101 dataset The most relevant and straightforward comparison is against the CLIP model utilizing only the image embeddings We also include relevant works that include CLIP and LLMs/LMMs All the models include training Even though we do not proceed in a zero-shot approach, we also report the CLIP zero-shot classification performance

Qualitative results

Qualitative results

Conclusions We dealt with image classification tasks exploiting knowledge encoded in LMMs We used the MiniGPT-4 model to extract sample-specific semantic descriptions Subsequently, we used these description for obtaining the text embeddings from the CLIP's text encoder Then, we used these text embeddings along with the corresponding image embeddings obtained from the CLIP's image encoder for the classification task Incorporating LMM-based knowledge to the image embeddings achieves considerably improved classification performance

Thank you for your attention! Questions? Maria Tzelepi , [email protected] This work was supported by the EU Horizon Europe program, under grant agreement 101070109 TransMIXR .