Transformer_BERT_ViT_Swin_Presentation.pptx

bhaveshagrawal35 15 views 11 slides Oct 08, 2024
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

Overview of , Transformer , Swin Transformer


Slide Content

Overview of Transformer, BERT, ViT , and Swin Transformer

Transformer Input Representation : Convert input tokens to embeddings and add positional encodings. Multi-Head Self-Attention (Encoder) : Calculate attention weights and generate weighted sum representations using queries, keys, and values. Feed-Forward Network (Encoder) : Apply a two-layer feed-forward network to each position independently. Repeated Encoder Layers : Pass the data through multiple encoder layers for deeper feature extraction. Masked Multi-Head Self-Attention (Decoder) : Apply self-attention while masking future positions to prevent lookahead during sequence generation.

Encoder-Decoder Attention : Use encoder outputs to guide the decoder in focusing on relevant input features. Feed-Forward Network (Decoder) : Apply another feed-forward network to decoder outputs. Output Token Generation : Generate output tokens one-by-one using a linear layer and softmax function to predict probabilities. Training with Loss Function : Optimize the model using a loss function (e.g., Cross-Entropy Loss) comparing predictions to true labels. Optimization and Backpropagation : Update parameters using gradient-based optimization methods like Adam to minimize the loss

BERT (Bidirectional Encoder Representations from Transformers) BERT stands for "Bidirectional Encoder Representations from Transformers." It's a new type of language model. Unlike earlier models, BERT looks at text from both directions (left and right context) at the same time to understand meaning better. BERT is trained on a large amount of unlabeled text, learning how language works in a deep, bidirectional way. Once trained, BERT can be adapted to many tasks (like answering questions or understanding text) by adding a single extra layer. No major changes to the model’s core are needed. BERT is simple in concept but very effective in practice.

Example: "The bank can ensure that the river bank is clean." "The bank can ensure..." refers to a financial institution. "...the river bank " refers to the edge of a river. Traditional model (left-to-right) might misinterpret "bank" because it only considers the words to the left of it, missing important context from the right. BERT's bidirectional approach In the first case, " bank " is likely referring to a financial institution because it looks at the surrounding context, including the word " ensure ." In the second case, BERT sees " river " nearby and understands that "bank" refers to the edge of a river.

Vision Transformer ( ViT ) In vision tasks, attention mechanisms are usually combined with convolutional networks (CNNs) or used to replace parts of CNNs. that CNNs are not necessary for good performance in vision tasks. A pure transformer can work well for image classification when applied to image patches. When the transformer model (Vision Transformer or ViT ) is pre-trained on large datasets and then fine-tuned on smaller datasets (e.g., ImageNet, CIFAR-100), it gives excellent results. Vision Transformer performs better than state-of-the-art CNNs on several benchmarks, and it also requires less computational power to train.

Swin Transformer The Swin Transformer introduces several innovations to overcome the limitations of ViT : Hierarchical structure with patch merging : Swin Transformer builds a hierarchical representation of the image by gradually merging patches as the network deepens. This allows the model to process at different resolutions, reducing computational load. Shifted window approach : Instead of global self-attention, Swin Transformer applies self-attention locally within non-overlapping windows of patches. This reduces the computational cost from quadratic to linear. Additionally, a "shifted window" mechanism is used in alternating layers to allow cross-window connections, improving global modeling. Linear computational complexity : By restricting self-attention to local windows, the Swin Transformer reduces the computational complexity from O(n^2) in ViT to O(n), where n is the number of patches. This makes it scalable for high-resolution images.