250310_Thanh_LabSeminar[WiGNet: Windowed Vision Graph Neural Network].pptx

thanhdowork 97 views 18 slides Mar 10, 2025
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

WiGNet: Windowed Vision Graph Neural Network


Slide Content

WiGNet: Windowed Vision Graph Neural Network Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 5/03/10 Gabriele Spadaro et al. WACV 2025

Introduction Deep neural networks have significantly advanced CV Model like CNNs and ViTs are established in tasks such as image classification Vision-based GNNs have emerged, applying graph convolutions instead of 2D convolutions or self-attention Problem: Vision GNNs face challenges due to high computational complexity when dealing with large-scale datasets and high-resolution images. The computational complexity of ViG increases quadratically with the number of nodes extracted from the image, and hence with the image size.

Solution WiGNet addresses the scalability issues of vision GNNs Core Idea: The image is partitioned into non-overlapping windows, and a separate graph is built for each window WiGNet's complexity grows linearly with the number of windows, unlike the quadratic complexity of previous Vision GNNs By focusing on localized regions, WiGNet efficiently captures relevant features

Contributions First to introduce windowed processing in vision GNNs WiGNet's computational and memory complexity scales linearly with image size, broadening the applications of graph-based models in computer vision

Related Work CNNs and ViTs CNNs, starting with AlexNet, have dominated computer vision, exploiting pixel locality to extract features Visual Transformers (ViTs) utilize the self-attention mechanism Swin Transformer employs a hierarchical architecture with multi-head self-attention in non-overlapped windows WiGNet differs from Swin Transformers by operating in k-NN graphs and using a GNN function instead of self-attention

Related Work Graphs in Computer Vision GNNs extend CNNs to graph-structured data The Vision GNN (ViG) model divides images into smaller patches, treating each patch as a node in a graph. The k-NN algorithm establishes connections between nodes based on feature similarity Techniques like MobileViG and GreedyViG aim to reduce ViG's complexity WiGNet uniquely works on original feature maps, partitioning them into fixed-size windows

Model WiGNet Overview: A four-stage pyramidal feature extractor Components Stem: Feature extractor with three convolutional layers WiGNet block: Composed of a Window-based Grapher module and a Feed Forward Network (FFN). The Grapher module partitions the image into non-overlapping windows, builds a graph for each window, and then local GNN updates are applied to each window. The FFN module further encourages feature diversity Downsampling block: Reduces feature dimension by merging node representations

Model Window-based Grapher Module Partitions the input tensor into non-overlapping windows The Dynamic Graph Convolution component builds a graph and performs graph convolution for each window independently, using the k-NN algorithm The Windows Reverse component reshapes the output A skip connection completes the Window-based Grapher block Figure 2. (a): WiGNet architecture exemplified for the Tiny version, equipped with a linear classifier. (b): A graphical illustration of the Window-based Grapher module.

Model Window-based Grapher Module Table 1. Detailed settings of WiGNet series. D: feature dimension, E: hidden dimension ratio in FFN, k: number of neighbors in GCN, W : window size, H × W : input image size. ‘Ti’ denotes tiny, ‘S’ denotes small, and ‘M’ denotes medium.

Model Dynamic Graph Convolution For each window, the k-NN algorithm produces a graph Max-Relative graph convolution updates node representations Given input feature , overall Window-based Grapher module transfer function: Figure 4. Illustrative example of the dynamic graph convolution of WiGNet.

Model FFN Module Encourages feature diversity using a multi-layer perceptron with two fully connected layers and a residual connection FFN module is implemented as multi-layer perceptron with 2 fully connected layers with residual connection to Y

Model Shifted Windows Includes a shifting operator to introduce cross-window connections A cycling operation partitions the feature map, and a masking mechanism allows connections only between adjacent nodes The number of neighbors is linearly adjusted based on the masking mechanism Figure 3. Overview of the cycling operation used to obtain shifted windows. The top-left part of the feature maps is copied on the bottomright part, then the masking mechanism is used to avoid connection between non-adjacent nodes in the original feature maps.

Model Complexity Considerations WiGNet's k-NN complexity grows linearly with the number of patches ViG's k-NN complexity grows quadratically Figure 5. Computational complexity and GPU memory footprint of several vision GNN architectures and WiGNet in terms of MACs and MB on NVIDIA GeForce RTX 3090 GPU.

Experiments Datasets ImageNet-1K dataset for image classification CelebA-HQ dataset to test adaptability in a downstream task with high-resolution images

Experiments Main results Table 2. Results of WiGNet and other deep learning methods on ImageNet.

Experiments Main results Figure 6. Comparison of Graph-based models on 512 × 512 resolution images. The size of the dots represents the used MACs.

Experiments Ablation Analysis Table 5. ImageNet results using different Graph Convolutional layers. Comparison performed on the tiny model size without the shifting operator. Table 3. Ablation study in the impact of the shifting operation and the adaptive k-NN strategy for WiGNet on ImageNet. Table 4. Ablation study in the impact of the shifting operation for higher resolution images from the CelebA-HQ dataset.

Conclusion WiGNet is a new Windowed vision GNN for image analysis tasks, constructing graphs in local windows WiGNet's computational and memory complexity scales linearly with image size WiGNet achieves a better trade-off between accuracy and complexity compared to other graph-based models WiGNet is suitable for working with high-resolution images
Tags