Swin_Transformer_Presentation_VITpp.pptx

bhaveshagrawal35 36 views 10 slides Oct 08, 2024
Slide 1
Slide 1 of 10
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10

About This Presentation

Swin Transformer ppt


Slide Content

Understanding Swin Transformer Concepts, Implementation, and Mathematical Explanation Presented By: Bhavesh Agrawal

Introduction to Vision Transformers • What is a Transformer? - Originally introduced for NLP tasks. - Utilizes self-attention mechanisms. • Why Vision Transformers (ViT)? - Suitable for image classification. - Effectively captures global dependencies in images.

What is Swin Transformer? • Definition: - A hierarchical Vision Transformer that divides the image into non-overlapping patches and computes self-attention within these patches. • Why 'Swin'? - 'Swin' stands for Shifted Window, referring to how the model calculates self-attention within local windows. • Key Features: - Shifted window partitioning. - Hierarchical feature representation. - Efficient computation with linear complexity.

How Does Swin Transformer Work? • Step-by-Step Explanation: 1. Patch Partitioning: Image divided into small patches (e.g., 4x4 or 8x8). 2. Linear Embedding: Each patch is embedded into a fixed-dimensional vector. 3. Self-Attention: Attention is computed within non-overlapping windows. 4. Shifted Window Operation: Windows are shifted to compute cross-window connections.

Mathematical Formulation • Self-Attention: Attention(Q, K, V) = Softmax((QK^T) / sqrt(d_k)) V where Q (Query), K (Key), V (Value) are matrices derived from the input. d_k is the dimensionality of the key vectors. • Shifting Mechanism: Windows are shifted by a pre-defined number of pixels, enhancing cross-window information sharing.

Example 1: Implementing Swin Transformer # Code Implementation (Python & PyTorch): from transformers import SwinForImageClassification, SwinConfig # Define configuration config = SwinConfig(image_size=224, patch_size=4, num_labels=1000) # Create a Swin model model = SwinForImageClassification(config) # Example input tensor (batch of 2, 3 channels, 224x224 size) import torch input_tensor = torch.rand(2, 3, 224, 224) # Get model predictions outputs = model(input_tensor) print(outputs.logits)

Example 2: Mathematical Analysis with Simple Data • Toy Example: Consider a simple 2x2 image patch. For each pixel, calculate self-attention using the formula. • Mathematical Computation: Given: Q = [[1, 2], [3, 4]] K = [[2, 1], [0, 1]] V = [[0, 2], [1, 3]] Compute: Attention(Q, K, V) = Softmax((QK^T) / sqrt(2)) V Show step-by-step calculations.

Results and Observations • Visualization: Graphically show how self-attention is computed within and across windows. • Performance Benefits: - Efficient use of computational resources. - Better results on vision tasks with large-scale images.

Conclusion • Summary: - Swin Transformer builds upon ViT with hierarchical and local-window-based self-attention. - Shifting mechanism helps capture global information efficiently. • Future Directions: - Explore Swin Transformer in object detection and semantic segmentation tasks.

Q&A Any Questions?