Microsoft Office 365 Crack Latest Version 2025? Free
raheemk1122g
28 views
33 slides
Apr 20, 2025
Slide 1 of 33
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
About This Presentation
COPY & PASTE LINK>>> https://click4pc.com/after-verification-click-go-to-download-page/
Microsoft 365 (Office) is a powerful application designed to centralize all of your commonly used Office and Microsoft 365 applications in one
Size: 3.66 MB
Language: en
Added: Apr 20, 2025
Slides: 33 pages
Slide Content
An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
Anonymous (ICLR 2021 under
review)
Choi Dongmin
Yonsei University Severance Hospital CCIDS
Abstract
•Transformer
-standard architecture for NLP
•Convolutional Networks
-attention is applied keeping their overall structure
•Transformer in Computer Vision
-a pure transformer can perform very well on image classification
tasks when applied directly to sequences of image patches
-achieved S.O.T.A with small computational costs when pre-trained
on large dataset
Introduction
Transformer
BERT
The dominant approach : pre-training on a large text corpus
and then fine-tuning on a smaller task-specific dataset
Self-attention
based architecture
V
a
s
w
a
n
i
e
t
a
l
.
A
t
t
e
n
t
i
o
n
I
s
A
l
l
Y
o
u
N
e
e
d
.
N
I
P
S
2
0
1
7
Introduction
Self-Attention in CV inspired by
NLP
DETR
Axial-DeepLab
However, classic ResNet-like architectures are still S.O.T.A
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Wang et al. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020
•Applying a Transformer Directly to Images
-with the fewest possible modifications
-provide the sequence of linear embeddings of the patches as an input
-image patches = tokens (words) in NLP
•Small Scale Training
-achieved accuracies below ResNets of comparable size
-Transformers lack some inductive biased inherent to
CNNs (such as translation equivariance and locality)
•Large Scale Training
-trumps (surpass) inductive bias
-excellent results when pre-trained at sufficient scale and
transferred
Introduction
Related Works
Vaswani et al. Attention Is All You Need. NIPS 2017
Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019
Radford et al. Improving language under- standing with unsupervised learning. Technical Report 2018
Transformer
-Standard model in NLP tasks
-Only consists of attention modules
not using RNN
-Encoder-decoder
-Requires large scale dataset and
high computational cost
-Pre-training and fine-tuning
approaches : BERT & GPT
Method
Method
ViT Base D=768= 16x16x3
ViT Large D=1028
ViT huge D=1280
Method
Method
Method
Method
Trước khi đi vào công thức, hãy hiểu lý do tại sao cần có bước này:
1.Ổn định học: Giúp giảm sự biến động trong gradient, cho phép mô hình học với
tốc độ học (learning rate) cao hơn và hội tụ nhanh hơn.
2.Internal Covariate Shift: Giảm hiện tượng phân phối của dữ liệu thay đổi qua các
lớp khi mô hình cập nhật trọng số.
3.Vị trí đặc biệt: Sử dụng "Pre-norm" (chuẩn hóa trước) giúp ổn định quá trình
huấn luyện cho mô hình Transformer sâu.
Method
Method
Layer Normalization (Norm) thứ hai
Chuẩn hóa lại đầu ra sau Multi-Head Attention:
•Chuẩn hóa riêng từng vector trong chuỗi trung gian
•Giúp ổn định đầu vào cho MLP
Layer Normalization (Norm) thứ hai
Chuẩn hóa lại đầu ra sau Multi-Head Attention:
•Chuẩn hóa riêng từng vector trong chuỗi trung gian
•Giúp ổn định đầu vào cho MLP
Method
Image x ∈ R
H×W×C
→ A sequence of flattened 2D patches x
p∈ R
N×(P
2
·C)
x∈
R
2
p p
→ x E ∈
R
N×(P ·C) N×D
* Because Transformer uses constant
widths, model dimension, through all of its layers
Learnable Position Embedding
Epos ∈ R
(N+1)×D
* to retain positional information
Trainable linear projection
maps
z
0
L
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py#L99-L111
Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
z ∈ R
N×D
: input sequence
Attention weight A
ij : similarity btw q
i
, k
j
Method
Hybrid
Architecture
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Flattened intermediate feature
maps of a ResNet
as the input sequence like
DETR
Method
Fine-tuning and Higher
Resolution
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Remove the pre-trained prediction head and attach a zero-initialized
D × K feedforward layer ( =the number of
downstream classes)
Experiments
•Training & Fine-tuning
< Pre-training>
- Adam with β
1 = 0.9, β
2 = 0.999
-Batch size 4,096
-Weight decay 0.1 (high weight decay is useful for transfer models)
-Linear learning rate warmup and decay
< Fine-tuning >
-SGD with momentum, batch size 512
•Metrics
-Few-shot (for fast on-the-fly evaluation)
-Fine-tuning accuracy
Experiments
•Comparison to State of the Art
*BiT-L : Big Transfer, which performs supervised transfer learning with large ResNets
*Noisy Student : a large EfficientNet trained using semi-supervised learning
Kolesnikov et al. Big Transfer (BiT): General Visual Representation Learning. ECCV
2020 Xie et al. Self-training with noisy student improves imagenet classification. CVPR
2020
Experiments
•Comparison to State of the Art
Experiments
•Pre-training Data Requirements
Larger Dataset
Larger Dataset
Experiments
•Scaling
Study
Experiments
•Inspecting Vision Transformer
The components resemble plausible basis functions
for a low-dimensional representation of the fine structure within each patch
analogous to receptive field size in CNNs
Conclusion
•Application of Transformers to Image Recognition
-no image-specific inductive biases in the architecture
-interpret an image as sequence of patches and process it by a standard
Transformer encoder
-simple, yet scalable, strategy works
-matches or exceeds the S.O.T.A being cheap to pre-train
•Many Challenges Remain
-other computer vision tasks, such as detection and segmentation
-further scaling ViT
Q&
A•ViT for Segmentation
•Fine-tuning on Grayscale Dataset