TransNeXt: Robust Foveal
Visual Perception for Vision
Transformers
CVPR 2024
Background
The workflow of ViT
1. Image patching
2. Patch embedding
3. Position embedding
4. Transformer encoder
5. Task head
A quick review of vision transformer
Self-Attention Mechanism
token mixer
•establish global connections between
different tokens
•exchange information and fuse
features among tokens
MLP (Multi-Layer Perceptron)
channel mixer
•mix information across the different
channels
•enhance the internal representation
Background
A quick review of vision transformer
Advantages
•long-range dependencies
•weaker inductive bias than CNN
Disadvantages
•columnar structure with coarse patches
•single-scale and low-resolution output
•overlook fine-grained local details
•challenging for pixel-level dense predictions (e.g., object detection)
•high computational and memory costs
Background
Related Work
introduce the
pyramid structure
CNNs:
VGG, ResNet, etc. Pyramid Vit
Key characteristics of CNNs:
•increasing channels
•decreasing spatial Dimensions
•hierarchical feature learning
Benefits:
•multi-scale feature extraction
•improve computational efficiency
•improve performance on dense
prediction tasks
Local attention
limit the attention calculation to a fixed local window and achieve cross window information exchange
by shifting windows between layers
Limitations: cannot fully capture the global context
Strategies for realizing pyramid structure in vision transformer
Related Work
Problem 1:
unable to form sufficient information mixing
Pooling attention
incorporate pooling attention mechanisms to reduce the number of tokens and emphasize the most
informative features
Limitations: cause an information loss of fine-grained details
represents the current query’s position.
Problem 2:
unnatural visual perception
The black area represents the region that the
current query cannot perceive
Related Work
Visualization of the effective receptive field
Biological vision:
higher acuity for features around the visual focus and lower acuity for distant features
Motivation:
1.explore a visual modeling approach that closely aligns with biological vision
2.achieve information perception closer to human foveal vision
Problems and Motivation
Problems:
1.unable to form sufficient information mixing
2.unnatural visual perception
focus on the center
Proposal 1: Self-attentionAggregated attention
a biomimetic foveal vision designed based token mixer
Proposals
Proposal 2: MLPConvolutional GLU
a channel mixer with gated channel attention
Proposed model:
TransNeXt
Overall architecture of pixel-focused attention
Dual-path design: two attention paths
1.fine-grained attention (query-centered sliding window attention)
2.coarse-grained attention (pooling attention)
fine-grained attention
coarse-grained attention
Pixel-focused Attention
Target:
1.possess fine-grained perception in the vicinity of each query
2.concurrently maintaining a coarse-grained awareness of global information
Pixel-focused Attention
Details of pixel-focused attention
3. Calculate attention weight
concatenate two paths
4. Obtain the attention weight of each path
5. Calculate the final output
3.4.
5.
Pixel-focused Attention
Summary of pixel-focused attention
uPixel-focused attention employs a dual-pathdesign with fine-grained attentionand coarse-grained attention.
uThis mechanism allows each query to perceive both local detailsand global context, effectively mimicking biological vision
and enhancing the model’s natural visual perception.
Aggregation (Pixel-focused) Attention
Query Embedding
Why use query embedding? –Task adaptation
Query embedding is a learnable query vector that provides additional learning capacity for each task. This allows the model to
dynamically adjust attention weights based on task requirements.
Formula:
Query in traditional QKV –generate attention weights
The calculated attention weights help model focus on different parts of the input data.
additional learning
capacity for specific task
Aggregation (Pixel-focused) Attention
Learnable Tokens
Formula:
What are learnable tokens?
Unlike fixed tokens that are directly derived from the input data, learnable tokens can be optimized to improve the model’s
performance on specific tasks.
Why use learnable tokens? –1) Dynamic attention weights 2) Task-specific adaptation
generate more flexible and dynamic attention weights ----better feature aggregation
adapt to different tasks by learning ----focus on relevant features for each specific task
Proposal 1: Self-attentionAggregated attention
a biomimetic foveal vision designed based token mixer
Proposals
Proposal 2: MLPConvolutional GLU
a channel mixer with gated channel attention
Proposed model:
TransNeXt
✓
Channel Mixer
What is channel mixer?
Channel mixer is a mechanism for processing and integrating information across different channels of a
feature map
The need for Channel mixing:
uInformation integration
Each channel in a feature map focuses on different aspects of the input. To make better predictions,
the model needs to combine and integrate information from all these channels.
uFeature enhancement
Some features might be more important than others. A channel mixer helps in emphasizing significant
features while suppressing less relevant ones.
uDiverse representations
Mixing channels leads to richer and more diverse feature representations, which can improve the
overall performance of the model.
MLP + depthwiseconvolution
1.extract local features more effectively
2.provide additional spatial context --help the model understand the spatial arrangement of features within an image
GLU:
1.GLU consist of two linear projections: 1) gating mechanism 2) value branch
2.The gating mechanism controls which elements of the value branch are passed through, acting as a filter
to zero out less important features.
3.Obtaining more complex interactions between features compared to simple linear layers.
Prevalent Channel-mixer Design
MLP (multi-layer perceptron):
1.the simplest form of a channel mixer
2.combine information from different channels by fully connected layers
Convolutional GLU
GLU + Depthwiseconvolution
Convert gated channel attention into a gated channel attention mechanism based on nearest neighbor features
Benefits
1.enhance GLU by incorporating a 3x3 depthwiseconvolution before the gating function
2.capture local spatial relationships and positional information
3.make the channel mixer more robust
Overall architecture of TransNeXt
TransNeXt
Four stage hierarchical backbone
Experiments
ImageNet-1K: a large-scale image classification dataset with 1000
categories
ImageNet-C: a 224x224-sized testsetincludesvarioustypesof
distortions
ImageNet-A: a subset of ImageNet including challenging images for
accurate classification
ImageNet-R: a test set contains samples that ResNet50 failed to
classify correctly
ImageNet-Sketch: contain hand-drawn images
ImageNet-V2: contain new images
Experiments
Dataset
ImageNet-1K: a large-scale image classification dataset with 1000 categories
ImageNet-A: a subset of ImageNet including challenging images for accurate classification
MaxVitMaxVit
ConvNeXt
ConvNeXt
TransNeXtTransNeXt
Summary
Summary & Discussion
1.Biomimetic design
2.Efficient attention mechanisms
3.State-of-the-art performance
Discussion – Insights from TransNeXt
1.Biologically inspired design
2.Integration of multiple attention mechanisms
3.Effectiveness of gating mechanisms
4.Combining convolution and self-Attention