SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS

236781 Lecture 9 Deep learning Transformers for Vision Tasks Dr. Chaim Baskin

Transformers for image modeling

Transformers for image modeling: Outline Architectures Self-supervised learning

Vision-and-language BERT J. Lu, D. Batra, D. Parikh, S. Lee, ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , NeurIPS 2019 image regions and features from detector output vision-language co-attention predict object class of masked out region predict whether image and sentence go together

Hybrid of CNN and transformer, aimed at standard recognition task Detection Transformer (DETR) – Facebook AI N. Carion et al., End-to-end object detection with transformers , ECCV 2020

Image generation and super-resolution with 32x32 output, attention restricted to local neighborhoods Image transformer – Google N. Parmar et al., Image transformer , ICML 2018

Sparse transformers – OpenAI R. Child et al., Generating Long Sequences with Sparse Transformers , arXiv 2019

Previous approaches 1. On pixels, but locally or factorized Usually replaces 3x3 conv in ResNet: Image credit: Stand-Alone Self-Attention in Vision Models by Ramachandran et.al. Image credit: Local Relation Networks for Image Recognition by Hu et.al. Examples: Non-local NN (Wang et.al. 2017) SASANet (Stand-Alone Self-Attention in Vision Models) HaloNet (Scaling Local Self-Attn for Parameter Efficient...) LR-Net (Local Relation Networks for Image Recognition) SANet (Exploring Self-attention for Image Recognition) ... Results: Are usually "meh", nothing to call home about Do not justify increased complexity Do not justify slowdown over convolutions Many prior works attempted to introduce self-attention at the pixel level. For 224px², that's 50k sequence length, too much!

Previous approaches 2. Globally, after/inside a full-blown CNN, or even detector/segmenter! Cons: result is highly complex, often multi-stage trained architecture. not from pixels, i.e. transformer can't "learn to fix" the (often frozen!) CNN's mistakes. Examples: DETR (Carion, Massa et.al. 2020) Visual Transformers (Wu et.al. 2020) UNITER (Chen, Li, Yu et.al. 2019) ViLBERT (Lu et.al. 2019) etc... VisualBERT (Li et.al. 20190) Image credit: UNITER: UNiversal Image-TExt Representation Learning by Chen et.al. Image credit: Visual Transformers: Token-based Image Representation and Processing for Computer Vision by Wu et.al.

Split an image into patches, feed linearly projected patches into standard transformer encoder With patches of 14x14 pixels, you need 16x16=256 patches to represent 224x224 images Vision transformer ( ViT ) – Google A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale . ICLR 2021

Vision transformer ( ViT ) BiT : Big Transfer ( ResNet ) ViT : Vision Transformer (Base/Large/Huge, patch size of 14x14, 16x16, or 32x32) Internal Google dataset (not public) A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale . ICLR 2021 Trained in a supervised fashion, fine-tuned on ImageNet

Vision transformer ( ViT ) A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale . ICLR 2021 Transformers are claimed to be more computationally efficient than CNNs or hybrid architectures

Hierarchical transformer: Swin Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows . ICCV 2021

Swin results COCO detection and segmentation

Beyond transformers? I. Tolstikhin et al. MLP-Mixer: An all-MLP Architecture for Vision . NeurIPS 2021

Hybrid of CNNs and transformers? T. Xiao et al. Early convolutions help transformers see better . NeurIPS 2021

Or completely back to CNNs? Z. Liu et al. A ConvNet for the 2020s . CVPR 2022

Back to CNNs? Z. Liu et al. A ConvNet for the 2020s . CVPR 2022

Outline Architectures Self-supervised learning

DINO: Self-distillation with no labels M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers . ICCV 2021

DINO M. Caron et al. Emerging Properties in Self-Supervised Vision Transformers . ICCV 2021

Masked autoencoders K. He et al. Masked autoencoders are scalable vision learners . CVPR 2022

Masked autoencoders: Results K. He et al. Masked autoencoders are scalable vision learners . CVPR 2022

SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx