SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS

Bruno769908 90 views 28 slides May 10, 2024
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

Vision


Slide Content

236781 Lecture 9 Deep learning Transformers for Vision Tasks Dr. Chaim Baskin

Transformers for image modeling

Transformers for image modeling: Outline Architectures Self-supervised learning

Vision-and-language BERT J. Lu, D. Batra, D. Parikh, S. Lee, ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , NeurIPS 2019 image regions and features from detector output vision-language co-attention predict object class of masked out region predict whether image and sentence go together

Hybrid of CNN and transformer, aimed at standard recognition task Detection Transformer (DETR) – Facebook AI N. Carion et al., End-to-end object detection with transformers , ECCV 2020

Image generation and super-resolution with 32x32 output, attention restricted to local neighborhoods Image transformer – Google N. Parmar et al., Image transformer , ICML 2018

Sparse transformers – OpenAI R. Child et al., Generating Long Sequences with Sparse Transformers , arXiv 2019

Previous approaches 1. On pixels, but locally or factorized Usually replaces 3x3 conv in ResNet: Image credit: Stand-Alone Self-Attention in Vision Models by Ramachandran et.al. Image credit: Local Relation Networks for Image Recognition by Hu et.al. Examples: Non-local NN (Wang et.al. 2017) SASANet (Stand-Alone Self-Attention in Vision Models) HaloNet (Scaling Local Self-Attn for Parameter Efficient...) LR-Net (Local Relation Networks for Image Recognition) SANet (Exploring Self-attention for Image Recognition) ... Results: Are usually "meh", nothing to call home about Do not justify increased complexity Do not justify slowdown over convolutions Many prior works attempted to introduce self-attention at the pixel level. For 224px², that's 50k sequence length, too much!

Previous approaches 2. Globally, after/inside a full-blown CNN, or even detector/segmenter! Cons: result is highly complex, often multi-stage trained architecture. not from pixels, i.e. transformer can't "learn to fix" the (often frozen!) CNN's mistakes. Examples: DETR (Carion, Massa et.al. 2020) Visual Transformers (Wu et.al. 2020) UNITER (Chen, Li, Yu et.al. 2019) ViLBERT (Lu et.al. 2019) etc... VisualBERT (Li et.al. 20190) Image credit: UNITER: UNiversal Image-TExt Representation Learning by Chen et.al. Image credit: Visual Transformers: Token-based Image Representation and Processing for Computer Vision by Wu et.al.

Split an image into patches, feed linearly projected patches into standard transformer encoder With patches of 14x14 pixels, you need 16x16=256 patches to represent 224x224 images Vision transformer ( ViT ) – Google A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale . ICLR 2021

Vision transformer ( ViT ) BiT : Big Transfer ( ResNet ) ViT : Vision Transformer (Base/Large/Huge, patch size of 14x14, 16x16, or 32x32) Internal Google dataset (not public) A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale . ICLR 2021 Trained in a supervised fashion, fine-tuned on ImageNet

Vision transformer ( ViT ) A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale . ICLR 2021 Transformers are claimed to be more computationally efficient than CNNs or hybrid architectures

Hierarchical transformer: Swin Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows . ICCV 2021

Hierarchical transformer: Swin Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows . ICCV 2021

Hierarchical transformer: Swin Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows . ICCV 2021

Swin results COCO detection and segmentation

Beyond transformers? I. Tolstikhin et al. MLP-Mixer: An all-MLP Architecture for Vision . NeurIPS 2021

Hybrid of CNNs and transformers? T. Xiao et al. Early convolutions help transformers see better . NeurIPS 2021

Or completely back to CNNs? Z. Liu et al. A ConvNet for the 2020s . CVPR 2022

Back to CNNs? Z. Liu et al. A ConvNet for the 2020s . CVPR 2022

Back to CNNs? Z. Liu et al. A ConvNet for the 2020s . CVPR 2022

Outline Architectures Self-supervised learning

DINO: Self-distillation with no labels M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers . ICCV 2021

DINO M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers . ICCV 2021

DINO M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers . ICCV 2021

Masked autoencoders K. He et al. Masked autoencoders are scalable vision learners . CVPR 2022

Masked autoencoders K. He et al. Masked autoencoders are scalable vision learners . CVPR 2022

Masked autoencoders: Results K. He et al. Masked autoencoders are scalable vision learners . CVPR 2022
Tags