Transformers for Video Understanding (2020-Present)
•Transformer characteristics:
•Scale with larger datasets
•Can naturally handle any input which
can get “tokenized”
•Inherent attention mechanism for
spatio-temporal information
encoding
•Handle long-range dependencies,
better scene understanding
•Can accommodate multiple
modalities
© 2025 Sportlogiq 123.1.OverviewofVisionTransformers(ViT)
VisionTransformer(ViT)[18]adaptsthetransformer
architectureof[68]toprocess2Dimageswithminimal
changes.Inparticular,ViTextractsNnon-overlappingim-
agepatches,xi2R
h⇥w
,performsalinearprojectionand
thenrasterisestheminto1Dtokenszi2R
d
.Thesequence
oftokensinputtothefollowingtransformerencoderis
z=[zcls,Ex1,Ex2,...,ExN]+p, (1)
wheretheprojectionbyEisequivalenttoa2Dconvolution.
AsshowninFig.1,anoptionallearnedclassificationtoken
zclsisprependedtothissequence,anditsrepresentationat
thefinallayeroftheencoderservesasthefinalrepresen-
tationusedbytheclassificationlayer[17].Inaddition,a
learnedpositionalembedding,p2R
N⇥d
,isaddedtothe
tokenstoretainpositionalinformation,asthesubsequent
self-attentionoperationsinthetransformerarepermutation
invariant.Thetokensarethenpassedthroughanencoder
consistingofasequenceofLtransformerlayers.Eachlayer
`comprisesofMulti-HeadedSelf-Attention[68],layernor-
malisation(LN)[2],andMLPblocksasfollows:
y
`
=MSA(LN(z
`
))+z
`
(2)
z
`+1
=MLP(LN(y
`
))+y
`
. (3)
TheMLPconsistsoftwolinearprojectionsseparatedbya
GELUnon-linearity[28]andthetoken-dimensionality,d,
remainsfixedthroughoutalllayers.Finally,alinearclassi-
fierisusedtoclassifytheencodedinputbasedonz
L
cls
2R
d
,
ifitwasprependedtotheinput,oraglobalaveragepooling
ofallthetokens,z
L
,otherwise.
Asthetransformer[68],whichformsthebasisof
ViT[18],isaflexiblearchitecturethatcanoperateonany
sequenceofinputtokensz2R
N⇥d
,wedescribestrategies
fortokenisingvideosnext.
3.2.Embeddingvideoclips
Weconsidertwosimplemethodsformappingavideo
V2R
T⇥H⇥W⇥C
toasequenceoftokens˜z2
R
nt⇥nh⇥nw⇥d
.Wethenaddthepositionalembeddingand
reshapeintoR
N⇥d
toobtainz,theinputtothetransformer.
UniformframesamplingAsillustratedinFig.2,a
straightforwardmethodoftokenisingtheinputvideoisto
uniformlysamplentframesfromtheinputvideoclip,em-
bedeach2Dframeindependentlyusingthesamemethod
asViT[18],andconcatenateallthesetokenstogether.Con-
cretely,ifnh·nwnon-overlappingimagepatchesareex-
tractedfromeachframe,asin[18],thenatotalofnt·nh·nw
tokenswillbeforwardedthroughthetransformerencoder.
Intuitively,thisprocessmaybeseenassimplyconstructing
alarge2DimagetobetokenisedfollowingViT.Wenote
thatthisistheinputembeddingmethodemployedbythe
concurrentworkof[4].
#
!
"
Figure2:Uniformframesampling:Wesimplysamplentframes,
andembedeach2DframeindependentlyfollowingViT[18].
!
"
#
Figure3:Tubeletembedding.Weextractandlinearlyembednon-
overlappingtubeletsthatspanthespatio-temporalinputvolume.
TubeletembeddingAnalternatemethod,asshownin
Fig.3,istoextractnon-overlapping,spatio-temporal
“tubes”fromtheinputvolume,andtolinearlyprojectthisto
R
d
.ThismethodisanextensionofViT’sembeddingto3D,
andcorrespondstoa3Dconvolution.Foratubeletofdi-
mensiont⇥h⇥w,nt=b
T
t
c,nh=b
H
h
candnw=b
W
w
c,
tokensareextractedfromthetemporal,height,andwidth
dimensionsrespectively.Smallertubeletdimensionsthus
resultinmoretokenswhichincreasesthecomputation.
Intuitively,thismethodfusesspatio-temporalinformation
duringtokenisation,incontrastto“Uniformframesam-
pling”wheretemporalinformationfromdifferentframesis
fusedbythetransformer.
3.3.TransformerModelsforVideo
AsillustratedinFig.1,weproposemultipletransformer-
basedarchitectures.Webeginwithastraightforwardex-
tensionofViT[18]thatmodelspairwiseinteractionsbe-
tweenallspatio-temporaltokens,andthendevelopmore
efficientvariantswhichfactorisethespatialandtemporal
dimensionsoftheinputvideoatvariouslevelsofthetrans-
formerarchitecture.
Model1:Spatio-temporalattentionThismodelsim-
plyforwardsallspatio-temporaltokensextractedfromthe
video,z
0
,throughthetransformerencoder.Wenotethat
thishasalsobeenexploredconcurrentlyby[4]intheir
“JointSpace-Time”model.IncontrasttoCNNarchitec-
tures,wherethereceptivefieldgrowslinearlywiththe
numberoflayers,eachtransformerlayermodelsallpair-
Image credit Arnab et al. 20213.1.OverviewofVisionTransformers(ViT)
VisionTransformer(ViT)[18]adaptsthetransformer
architectureof[68]toprocess2Dimageswithminimal
changes.Inparticular,ViTextractsNnon-overlappingim-
agepatches,xi2R
h⇥w
,performsalinearprojectionand
thenrasterisestheminto1Dtokenszi2R
d
.Thesequence
oftokensinputtothefollowingtransformerencoderis
z=[zcls,Ex1,Ex2,...,ExN]+p, (1)
wheretheprojectionbyEisequivalenttoa2Dconvolution.
AsshowninFig.1,anoptionallearnedclassificationtoken
zclsisprependedtothissequence,anditsrepresentationat
thefinallayeroftheencoderservesasthefinalrepresen-
tationusedbytheclassificationlayer[17].Inaddition,a
learnedpositionalembedding,p2R
N⇥d
,isaddedtothe
tokenstoretainpositionalinformation,asthesubsequent
self-attentionoperationsinthetransformerarepermutation
invariant.Thetokensarethenpassedthroughanencoder
consistingofasequenceofLtransformerlayers.Eachlayer
`comprisesofMulti-HeadedSelf-Attention[68],layernor-
malisation(LN)[2],andMLPblocksasfollows:
y
`
=MSA(LN(z
`
))+z
`
(2)
z
`+1
=MLP(LN(y
`
))+y
`
. (3)
TheMLPconsistsoftwolinearprojectionsseparatedbya
GELUnon-linearity[28]andthetoken-dimensionality,d,
remainsfixedthroughoutalllayers.Finally,alinearclassi-
fierisusedtoclassifytheencodedinputbasedonz
L
cls
2R
d
,
ifitwasprependedtotheinput,oraglobalaveragepooling
ofallthetokens,z
L
,otherwise.
Asthetransformer[68],whichformsthebasisof
ViT[18],isaflexiblearchitecturethatcanoperateonany
sequenceofinputtokensz2R
N⇥d
,wedescribestrategies
fortokenisingvideosnext.
3.2.Embeddingvideoclips
Weconsidertwosimplemethodsformappingavideo
V2R
T⇥H⇥W⇥C
toasequenceoftokens˜z2
R
nt⇥nh⇥nw⇥d
.Wethenaddthepositionalembeddingand
reshapeintoR
N⇥d
toobtainz,theinputtothetransformer.
UniformframesamplingAsillustratedinFig.2,a
straightforwardmethodoftokenisingtheinputvideoisto
uniformlysamplentframesfromtheinputvideoclip,em-
bedeach2Dframeindependentlyusingthesamemethod
asViT[18],andconcatenateallthesetokenstogether.Con-
cretely,ifnh·nwnon-overlappingimagepatchesareex-
tractedfromeachframe,asin[18],thenatotalofnt·nh·nw
tokenswillbeforwardedthroughthetransformerencoder.
Intuitively,thisprocessmaybeseenassimplyconstructing
alarge2DimagetobetokenisedfollowingViT.Wenote
thatthisistheinputembeddingmethodemployedbythe
concurrentworkof[4].
#
!
"
Figure2:Uniformframesampling:Wesimplysamplentframes,
andembedeach2DframeindependentlyfollowingViT[18].
!
"
#
Figure3:Tubeletembedding.Weextractandlinearlyembednon-
overlappingtubeletsthatspanthespatio-temporalinputvolume.
TubeletembeddingAnalternatemethod,asshownin
Fig.3,istoextractnon-overlapping,spatio-temporal
“tubes”fromtheinputvolume,andtolinearlyprojectthisto
R
d
.ThismethodisanextensionofViT’sembeddingto3D,
andcorrespondstoa3Dconvolution.Foratubeletofdi-
mensiont⇥h⇥w,nt=b
T
t
c,nh=b
H
h
candnw=b
W
w
c,
tokensareextractedfromthetemporal,height,andwidth
dimensionsrespectively.Smallertubeletdimensionsthus
resultinmoretokenswhichincreasesthecomputation.
Intuitively,thismethodfusesspatio-temporalinformation
duringtokenisation,incontrastto“Uniformframesam-
pling”wheretemporalinformationfromdifferentframesis
fusedbythetransformer.
3.3.TransformerModelsforVideo
AsillustratedinFig.1,weproposemultipletransformer-
basedarchitectures.Webeginwithastraightforwardex-
tensionofViT[18]thatmodelspairwiseinteractionsbe-
tweenallspatio-temporaltokens,andthendevelopmore
efficientvariantswhichfactorisethespatialandtemporal
dimensionsoftheinputvideoatvariouslevelsofthetrans-
formerarchitecture.
Model1:Spatio-temporalattentionThismodelsim-
plyforwardsallspatio-temporaltokensextractedfromthe
video,z
0
,throughthetransformerencoder.Wenotethat
thishasalsobeenexploredconcurrentlyby[4]intheir
“JointSpace-Time”model.IncontrasttoCNNarchitec-
tures,wherethereceptivefieldgrowslinearlywiththe
numberoflayers,eachtransformerlayermodelsallpair-
Spatial Embedding
TubeletEmbedding