SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
Bruno769908
90 views
28 slides
May 10, 2024
Slide 1 of 28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
About This Presentation
Vision
Size: 46.58 MB
Language: en
Added: May 10, 2024
Slides: 28 pages
Slide Content
236781 Lecture 9 Deep learning Transformers for Vision Tasks Dr. Chaim Baskin
Transformers for image modeling
Transformers for image modeling: Outline Architectures Self-supervised learning
Vision-and-language BERT J. Lu, D. Batra, D. Parikh, S. Lee, ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , NeurIPS 2019 image regions and features from detector output vision-language co-attention predict object class of masked out region predict whether image and sentence go together
Hybrid of CNN and transformer, aimed at standard recognition task Detection Transformer (DETR) – Facebook AI N. Carion et al., End-to-end object detection with transformers , ECCV 2020
Image generation and super-resolution with 32x32 output, attention restricted to local neighborhoods Image transformer – Google N. Parmar et al., Image transformer , ICML 2018
Sparse transformers – OpenAI R. Child et al., Generating Long Sequences with Sparse Transformers , arXiv 2019
Previous approaches 1. On pixels, but locally or factorized Usually replaces 3x3 conv in ResNet: Image credit: Stand-Alone Self-Attention in Vision Models by Ramachandran et.al. Image credit: Local Relation Networks for Image Recognition by Hu et.al. Examples: Non-local NN (Wang et.al. 2017) SASANet (Stand-Alone Self-Attention in Vision Models) HaloNet (Scaling Local Self-Attn for Parameter Efficient...) LR-Net (Local Relation Networks for Image Recognition) SANet (Exploring Self-attention for Image Recognition) ... Results: Are usually "meh", nothing to call home about Do not justify increased complexity Do not justify slowdown over convolutions Many prior works attempted to introduce self-attention at the pixel level. For 224px², that's 50k sequence length, too much!
Previous approaches 2. Globally, after/inside a full-blown CNN, or even detector/segmenter! Cons: result is highly complex, often multi-stage trained architecture. not from pixels, i.e. transformer can't "learn to fix" the (often frozen!) CNN's mistakes. Examples: DETR (Carion, Massa et.al. 2020) Visual Transformers (Wu et.al. 2020) UNITER (Chen, Li, Yu et.al. 2019) ViLBERT (Lu et.al. 2019) etc... VisualBERT (Li et.al. 20190) Image credit: UNITER: UNiversal Image-TExt Representation Learning by Chen et.al. Image credit: Visual Transformers: Token-based Image Representation and Processing for Computer Vision by Wu et.al.
Split an image into patches, feed linearly projected patches into standard transformer encoder With patches of 14x14 pixels, you need 16x16=256 patches to represent 224x224 images Vision transformer ( ViT ) – Google A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale . ICLR 2021
Vision transformer ( ViT ) BiT : Big Transfer ( ResNet ) ViT : Vision Transformer (Base/Large/Huge, patch size of 14x14, 16x16, or 32x32) Internal Google dataset (not public) A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale . ICLR 2021 Trained in a supervised fashion, fine-tuned on ImageNet
Vision transformer ( ViT ) A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale . ICLR 2021 Transformers are claimed to be more computationally efficient than CNNs or hybrid architectures
Hierarchical transformer: Swin Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows . ICCV 2021
Hierarchical transformer: Swin Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows . ICCV 2021
Hierarchical transformer: Swin Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows . ICCV 2021
Swin results COCO detection and segmentation
Beyond transformers? I. Tolstikhin et al. MLP-Mixer: An all-MLP Architecture for Vision . NeurIPS 2021
Hybrid of CNNs and transformers? T. Xiao et al. Early convolutions help transformers see better . NeurIPS 2021
Or completely back to CNNs? Z. Liu et al. A ConvNet for the 2020s . CVPR 2022
Back to CNNs? Z. Liu et al. A ConvNet for the 2020s . CVPR 2022
Back to CNNs? Z. Liu et al. A ConvNet for the 2020s . CVPR 2022
Outline Architectures Self-supervised learning
DINO: Self-distillation with no labels M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers . ICCV 2021
DINO M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers . ICCV 2021
DINO M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers . ICCV 2021
Masked autoencoders K. He et al. Masked autoencoders are scalable vision learners . CVPR 2022
Masked autoencoders K. He et al. Masked autoencoders are scalable vision learners . CVPR 2022
Masked autoencoders: Results K. He et al. Masked autoencoders are scalable vision learners . CVPR 2022