社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers

NABLAS 425 views 24 slides Jul 12, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

人間の視覚に似た視覚的モデリング手法である大域性と局所性を併せ持つ注意機構を導入したビジョントランスフォーマーアーキテクチャTransNeXtについて紹介しています。


Slide Content

TransNeXt: Robust Foveal
Visual Perception for Vision
Transformers
CVPR 2024

Background
The workflow of ViT
1. Image patching
2. Patch embedding
3. Position embedding
4. Transformer encoder
5. Task head
A quick review of vision transformer

Self-Attention Mechanism
token mixer
•establish global connections between
different tokens
•exchange information and fuse
features among tokens
MLP (Multi-Layer Perceptron)
channel mixer
•mix information across the different
channels
•enhance the internal representation
Background
A quick review of vision transformer

Advantages
•long-range dependencies
•weaker inductive bias than CNN
Disadvantages
•columnar structure with coarse patches
•single-scale and low-resolution output
•overlook fine-grained local details
•challenging for pixel-level dense predictions (e.g., object detection)
•high computational and memory costs
Background

Related Work
introduce the
pyramid structure
CNNs:
VGG, ResNet, etc. Pyramid Vit
Key characteristics of CNNs:
•increasing channels
•decreasing spatial Dimensions
•hierarchical feature learning
Benefits:
•multi-scale feature extraction
•improve computational efficiency
•improve performance on dense
prediction tasks

Local attention
limit the attention calculation to a fixed local window and achieve cross window information exchange
by shifting windows between layers
Limitations: cannot fully capture the global context
Strategies for realizing pyramid structure in vision transformer
Related Work
Problem 1:
unable to form sufficient information mixing
Pooling attention
incorporate pooling attention mechanisms to reduce the number of tokens and emphasize the most
informative features
Limitations: cause an information loss of fine-grained details

represents the current query’s position.
Problem 2:
unnatural visual perception
The black area represents the region that the
current query cannot perceive
Related Work
Visualization of the effective receptive field

Biological vision:
higher acuity for features around the visual focus and lower acuity for distant features
Motivation:
1.explore a visual modeling approach that closely aligns with biological vision
2.achieve information perception closer to human foveal vision
Problems and Motivation
Problems:
1.unable to form sufficient information mixing
2.unnatural visual perception
focus on the center

Proposal 1: Self-attentionAggregated attention
a biomimetic foveal vision designed based token mixer
Proposals
Proposal 2: MLPConvolutional GLU
a channel mixer with gated channel attention
Proposed model:
TransNeXt

Overall architecture of pixel-focused attention
Dual-path design: two attention paths
1.fine-grained attention (query-centered sliding window attention)
2.coarse-grained attention (pooling attention)
fine-grained attention
coarse-grained attention
Pixel-focused Attention
Target:
1.possess fine-grained perception in the vicinity of each query
2.concurrently maintaining a coarse-grained awareness of global information

Pixel-focused Attention
1. Obtain K, V
2. Calculate similarity score
uFine-grained path:
uCoarse-grained path:
uFine-grained path:
Nearest Tokens
uCoarse-grained path:
Pooled Tokens
!!,"!
!",""
Details of pixel-focused attention
1.
1.
2.

Pixel-focused Attention
Details of pixel-focused attention
3. Calculate attention weight
concatenate two paths
4. Obtain the attention weight of each path
5. Calculate the final output
3.4.
5.

Pixel-focused Attention
Summary of pixel-focused attention
uPixel-focused attention employs a dual-pathdesign with fine-grained attentionand coarse-grained attention.
uThis mechanism allows each query to perceive both local detailsand global context, effectively mimicking biological vision
and enhancing the model’s natural visual perception.

Aggregation (Pixel-focused) Attention
Overall architecture of aggregation pixel-focused attention
Differences:
1.Query embedding
2.Learnable tokens

Aggregation (Pixel-focused) Attention
Query Embedding
Why use query embedding? –Task adaptation
Query embedding is a learnable query vector that provides additional learning capacity for each task. This allows the model to
dynamically adjust attention weights based on task requirements.
Formula:
Query in traditional QKV –generate attention weights
The calculated attention weights help model focus on different parts of the input data.
additional learning
capacity for specific task

Aggregation (Pixel-focused) Attention
Learnable Tokens
Formula:
What are learnable tokens?
Unlike fixed tokens that are directly derived from the input data, learnable tokens can be optimized to improve the model’s
performance on specific tasks.
Why use learnable tokens? –1) Dynamic attention weights 2) Task-specific adaptation
generate more flexible and dynamic attention weights ----better feature aggregation
adapt to different tasks by learning ----focus on relevant features for each specific task

Proposal 1: Self-attentionAggregated attention
a biomimetic foveal vision designed based token mixer
Proposals
Proposal 2: MLPConvolutional GLU
a channel mixer with gated channel attention
Proposed model:
TransNeXt

Channel Mixer
What is channel mixer?
Channel mixer is a mechanism for processing and integrating information across different channels of a
feature map
The need for Channel mixing:
uInformation integration
Each channel in a feature map focuses on different aspects of the input. To make better predictions,
the model needs to combine and integrate information from all these channels.
uFeature enhancement
Some features might be more important than others. A channel mixer helps in emphasizing significant
features while suppressing less relevant ones.
uDiverse representations
Mixing channels leads to richer and more diverse feature representations, which can improve the
overall performance of the model.

MLP + depthwiseconvolution
1.extract local features more effectively
2.provide additional spatial context --help the model understand the spatial arrangement of features within an image
GLU:
1.GLU consist of two linear projections: 1) gating mechanism 2) value branch
2.The gating mechanism controls which elements of the value branch are passed through, acting as a filter
to zero out less important features.
3.Obtaining more complex interactions between features compared to simple linear layers.
Prevalent Channel-mixer Design
MLP (multi-layer perceptron):
1.the simplest form of a channel mixer
2.combine information from different channels by fully connected layers

Convolutional GLU
GLU + Depthwiseconvolution
Convert gated channel attention into a gated channel attention mechanism based on nearest neighbor features
Benefits
1.enhance GLU by incorporating a 3x3 depthwiseconvolution before the gating function
2.capture local spatial relationships and positional information
3.make the channel mixer more robust

Overall architecture of TransNeXt
TransNeXt
Four stage hierarchical backbone

Experiments
ImageNet-1K: a large-scale image classification dataset with 1000
categories
ImageNet-C: a 224x224-sized testsetincludesvarioustypesof
distortions
ImageNet-A: a subset of ImageNet including challenging images for
accurate classification
ImageNet-R: a test set contains samples that ResNet50 failed to
classify correctly
ImageNet-Sketch: contain hand-drawn images
ImageNet-V2: contain new images

Experiments
Dataset
ImageNet-1K: a large-scale image classification dataset with 1000 categories
ImageNet-A: a subset of ImageNet including challenging images for accurate classification
MaxVitMaxVit
ConvNeXt
ConvNeXt
TransNeXtTransNeXt

Summary
Summary & Discussion
1.Biomimetic design
2.Efficient attention mechanisms
3.State-of-the-art performance
Discussion – Insights from TransNeXt
1.Biologically inspired design
2.Integration of multiple attention mechanisms
3.Effectiveness of gating mechanisms
4.Combining convolution and self-Attention