6. HMDB51
nHuman Motion DataBase
• ソース:Digitized movies, Prelinger
archive, YouTube and Google videos,
etc
• 既存のUCF-SportsやOlympicSportsは
ソースがYoutubeのみ,アクションが
曖昧,人の姿勢で分かってしまう
n 51カテゴリ,6766動画
• 各カテゴリ最低101
• 1〜5秒程度,平均3.15秒程度
• min 0.63秒,max 35.43秒
[Jhuang+, ICCV2011]
Karlsruhe, Germany
kuehne@kit.edu
Cambridge, MA 02139
hueihan@mit.edu, tp@ai.mit.edu
Providence, RI 02906
thomas serre@brown.edu
Abstract
With nearly one billion online videos viewed everyday,
an emerging new frontier in computer vision research is
recognition and search in video. While much effort has
been devoted to the collection and annotation of large scal-
able static image datasets containing thousands of image
categories, human action datasets lag far behind. Cur-
rent action recognition databases contain on the order of
ten different action categories collected under fairly con-
trolled conditions. State-of-the-art performance on these
datasets is now near ceiling and thus there is a need for the
design and creation of new benchmarks. To address this is-
sue we collected the largest action video database to-date
with 51 action categories, which in total contain around
7,000 manually annotated clips extracted from a variety of
sources ranging from digitized movies to YouTube. We use
this database to evaluate the performance of two represen-
tative computer vision systems for action recognition and
explore the robustness of these methods under various con-
ditions such as camera motion, viewpoint, video quality and
occlusion.
Figure 1. Sample frames from the proposed HMDB51 [1] (from
top left to lower right, actions are: hand-waving, drinking, sword
7. UCF101
nUniversity of Central Florida
• ソースはYouTube
• 手動でクリーニング
• UCF-Sports/11/50の後継
n101カテゴリ,13,320動画
• 最短1.06秒,最長71.04秒,平
均7.21秒
[Soomro+, arXiv, 2012]
9. SSv2
nsomething-something v2
• アクションラベルではなく,名詞・
動詞のパターンを理解するべき
• 174のテンプレート文174=ラベル
• "Dropping [something] into
[something]"
• "Stacking [number of] [something]"
• 「something」
• アクション対象の物体名が入る
プレースホルダー
• 221k動画
• train 167k, val 25k, test 27k
• 平均4.03秒 (v1)
v1 [Goyal+, ICCV2017] v2 [Mahdisoltani+, arXiv2018]
Putting a white remote into a cardboard box
Pretending to put candy onto chair
Pushing a green chilli so that it falls off the table
Moving puncher closer to scissor
Figure 4: Example videos and corresponding descriptions. Object entries shown in italics.
11. Action Recognition models
2015 2017 2019 2020 2021 2022
2016 2018
2013 2014
restricted 3D
Full 3D
DT
IDT
Two
Stream
TSN
C3D I3D
P3D
S3D
R(2+1)D
3D ResNet
Non-Local
TSM
SlowFast X3D
ViVit
TimeSformer
STAM
Video Transformer Network
VidTr
X-ViT
2D + 1D aggregation
(2+1)D
(2+1)D
CNN
Non-Deep
Vision
Transformer
2D + 1D aggregation
R3D
Transformer
Kinetics
ResNet
U-Net
GAN ViT
2012
ImageNet
TokenShift
VideoSwin
12. Fusion:フレーム毎に2D CNNを適用
n2D CNNを動画像に適用する方法
• single:1フレームにだけ使用
• late fusion:各フレームに2D CNNを適用,
最後に時間方向に集約
• early fusion:複数フレームを一度に入力
• TSN
• slow fusion:ネットワークの途中で徐々に統
合していく(lateral fusion)
n考え方,用語は受け継がれている
• late fusionは健在
• フレーム毎に適用,単純平均・1D CNN・
Transformerなどで集約
• ベースラインとして利用
[Karpathy+, CVPR2014]
Figure 1: Explored approaches for fusing information over
temporal dimension through the network. Red, green and
blue boxes indicate convolutional, normalization and pool-
ing layers respectively. In the Slow Fusion model, the de-
picted columns share parameters.
3.1. Time Information Fusion in CNNs
in t
fram
con
by c
S
mix
info
ers
bot
by
in t
to s
[1,
exte
an
14. TSN
nTemporal Segment Networks
• 単純な工夫
• クリップを分割
• それぞれでtwo-stream
• 最後に統合
• 長期の時間的モデリング
• フローは短期の時間的モデリング
[Wang+, ECCV2016]
TSNs: Towards Good Practices for Deep Action Recognition 25
15. (2+1)D CNN
n3D Convの計算量削減
• 空間方向2D convと時間方向の1D
convを組み合わせる
nSeparable convの代表例
• P3D [Qiu+, ICCV2017]
• S3D [Xie+, ECCV2018]
• R(2+1)D [Tran+, CVPR2018]
• R3Dも比較(3D ResNetのこと)
nモジュールの中で2D/1D conv
• P3D, S3D
n層ごとに2Dと3Dを分ける
• S3D,MCx/rMCx
t x d x d
1 x d x d
t x 1 x 1
Mi
a) b)
Figure 2. (2+1)D vs 3D convolution. The illustration is given for
the simplified setting where the input consists of a spatiotemporal
volume with a single feature channel. (a) Full 3D convolution is
carried out using a filter of size t × d × d where t denotes the tem-
poral extent and d is the spatial width and height. (b) A (2+1)D
convolutional block splits the computation into a spatial 2D con-
volution followed by a temporal 1D convolution. We choose the
numbers of 2D filters (Mi) so that the number of parameters in our
(2+1)D block matches that of the full 3D convolutional block.
using 2D convolutions in the top layers. Since in this work
we consider 3D ResNets (R3D) having 5 groups of convo-
lutions (see Table 1), our first variant consists in replacing
all 3D convolutions in group 5 with 2D convolutions. We
denote this variant with MC5 (Mixed Convolutions). We
design a second variant that uses 2D convolutions in group
4 and 5, and name this model MC4 (meaning from group 4
and deeper layers all convolutions are 2D). Following this
pattern, we also create MC3 and MC2 variations. We omit
to consider MC1 since it is equivalent to a 2D ResNet (f-
0 10 20 30 40 50
epoch
0
0.2
0.4
0.6
0.8
1
error
(%)
R3D-18 train
R3D-18 val
R(2+1)D-18 train
R(2+1)D-18 val
0 10 20 30 40 50
epoch
0
0.2
0.4
0.6
0.8
1
error
(%)
R3D-34 train
R3D-34 val
R(2+1)D-34 train
R(2+1)D-34 val
Figure 3. Training and testing errors for R(2+1)D and R3D.
Results are reported for ResNets of 18 layers (left) and 34 layers
(right). It can be observed that the training error (thin lines) is
smaller for R(2+1)D compared to R3D, particularly for the net-
work with larger depth (right). This suggests that the the spatial-
temporal decomposition implemented by R(2+1)D eases the opti-
mization, especially as depth is increased.
the temporal convolutions. We choose Mi = ⌊ td2
Ni−1Ni
d2Ni−1+tNi
⌋
so that the number of parameters in the (2+1)D block is
approximately equal to that implementing full 3D convolu-
tion. We note that this spatiotemporal decomposition can
be applied to any 3D convolutional layer. An illustration
of this decomposition is given in Figure 2 for the simplified
setting where the input tensor zi−1 contains a single channel
(i.e., Ni−1 = 1). If the 3D convolution has spatial or tem-
poral striding (implementing downsampling), the striding is
correspondingly decomposed into its spatial or temporal di-
mensions. This architecture is illustrated in Figure 1(e).
Compared to full 3D convolution, our (2+1)D decom-
position offers two advantages. First, despite not changing
R(2+1)D [Tran+, CVPR2018]
2D Inc.
Conv
1x3x3
Conv
1x1x1
1x3x3
Max-Pool
Conv
1x1x1
Previous Layer
Conv
1x1x1
Conv
1x1x1
Next Layer
Concat
Conv
1x3x3
(a)
3D Inc.
Conv
3x3x3
Conv
1x1x1
3x3x3
Max-Pool
Conv
1x1x1
Previous Layer
Conv
1x1x1
Conv
1x1x1
Next Layer
Concat
Conv
3x3x3
(b)
Sep-Inc.
Conv
1x3x3
Conv
3x1x1
Conv
3x1x1
Conv
1x1x1
3x3x3
Max-Pool
Conv
1x1x1
Previous Layer
Conv
1x1x1
Conv
1x1x1
Next Layer
Concat
Conv
1x3x3
(c)
Fig. 3. (a) 2D Inception block; (b) 3D Inception block; (c) 3D temporal separable Inception block
used in S3D networks.
S3D [Xie+, ECCV2018]
7,7,7
Conv
Stride 2
1x3x3
Max-Pool
Stride
1,2,2
1x1x1
Conv
1x1x1
Conv
3x3x3
Conv
1x3x3
Max-Pool
Stride
1,2,2
3D
Inc.
3D
Inc.
3x3x3
Max-Pool
Stride 2
3D
Inc.
3D
Inc.
3D
Inc.
3D
Inc.
3D
Inc.
3D
Inc.
3D
Inc.
2x2x2
Max-Pool
Stride 2
2x7x7
Avg-Pool
Video
(64 Frames)
Prediction
(400D)
(a) I3D
1,7,7
Conv
Stride 2
1x3x3
Max-Pool
Stride
1,2,2
1x1x1
Conv
1x1x1
Conv
1x3x3
Conv
1x3x3
Max-Pool
Stride
1,2,2
2D
Inc.
2D
Inc.
3x3x3
Max-Pool
Stride
2,2,2
2D
Inc.
2D
Inc.
2D
Inc.
2D
Inc.
2D
Inc.
2D
Inc.
2D
Inc.
2x2x2
Max-Pool
Stride
2,2,2
1x7x7
Avg-Pool
Video
Prediction
(b) I2D
7,7,7
Conv
Stride 2
1x3x3
Max-Pool
Stride
1,2,2
1x1x1
Conv
1x1x1
Conv
3x3x3
Conv
1x3x3
Max-Pool
Stride
1,2,2
3D
Inc.
3D
Inc.
3x3x3
Max-Pool
Stride
2,2,2
3D
Inc.
3D
Inc.
2D
Inc.
2D
Inc.
2D
Inc.
2D
Inc.
2D
Inc.
2x2x2
Max-Pool
Stride
2,2,2
2x7x7
Avg-Pool
Video
Prediction
K=0 K=1
K=2
K=3
K=4
K=5
K=6
K=7
K=8 K=9 K=10
(c) Bottom-heavy I3D
1,7,7
Conv
Stride 2
1x3x3
Max-Pool
Stride
1,2,2
1x1x1
Conv
1x1x1
Conv
1x3x3
Conv
1x3x3
Max-Pool
Stride
1,2,2
2D
Inc.
2D
Inc.
3x3x3
Max-Pool
Stride
2,2,2
2D
Inc.
2D
Inc.
3D
Inc.
3D
Inc.
3D
Inc.
3D
Inc.
3D
Inc.
2x2x2
Max-Pool
Stride
2,2,2
2x7x7
Avg-Pool Prediction
Video
K=0 K=1
K=2
K=3
K=4
K=5
K=6
K=7
K=8 K=9 K=10
(d) Top-heavy I3D
Fig. 2. Network architecture details for (a) I3D, (b) I2D, (c) Bottom-Heavy and (d) Top-Heav
variants. K indexes the spatio-temporal convolutional layers. The “2D Inc.” and “3D Inc.” block
refer to 2D and 3D inception blocks, defined in Figure 3.
2D Inc.
Conv
1x3x3
Conv
1x1x1
1x3x3
Max-Pool
Conv
1x1x1
Previous Layer
Conv
1x1x1
Conv
1x1x1
Next Layer
Concat
Conv
1x3x3
(a)
3D Inc.
Conv
3x3x3
Conv
1x1x1
3x3x3
Max-Pool
Conv
1x1x1
Previous Layer
Conv
1x1x1
Conv
1x1x1
Next Layer
Concat
Conv
3x3x3
(b)
Sep-Inc.
Conv
1x3x3
Conv
3x1x1
Conv
3x1x1
Conv
1x1x1
3x3x3
Max-Pool
Conv
1x1x1
Previous Layer
Conv
1x1x1
Conv
1x1x1
Next Layer
Concat
Conv
1x3x3
(c)
Fig. 3. (a) 2D Inception block; (b) 3D Inception block; (c) 3D temporal separable Inception bloc
used in S3D networks.
2D conv
2D conv
2D conv
2D conv
2D conv
fc
(a) R2D
3D conv
3D conv
3D conv
3D conv
3D conv
fc
(d) R3D
(2+1)D conv
fc
(2+1)D conv
(2+1)D conv
(2+1)D conv
(2+1)D conv
(e) R(2+1)D
2D conv
2D conv
2D conv
3D conv
3D conv
fc
(b) MC
3D conv
3D conv
3D conv
2D conv
2D conv
fc
(c) rMC
x x
space-time pool space-time pool space-time pool space-time pool space-time pool
clip clip clip clip clip
Figure 1. Residual network architectures for video classification considered in this work. (a) R2D are 2D ResNets; (b) MCx are
ResNets with mixed convolutions (MC3 is presented in this figure); (c) rMCx use reversed mixed convolutions (rMC3 is shown here); (d)
R3D are 3D ResNets; and (e) R(2+1)D are ResNets with (2+1)D convolutions. For interpretability, residual connections are omitted.
3. Convolutional residual blocks for video
In this section we discuss several spatiotemporal convo-
lutional variants within the framework of residual learning.
Let x denote the input clip of size 3×L×H ×W, where L
is the number of frames in the clip, H and W are the frame
height and width, and 3 refers to the RGB channels. Let
zi be the tensor computed by the i-th convolutional block
in the residual network. In this work we consider only
“vanilla” residual blocks (i.e., without bottlenecks) [13],
with each block consisting of two convolutional layers with
a ReLU activation function after each layer. Then the output
of the i-th residual block is given by
zi = zi−1 + F(zi−1; θi) (1)
where F(; θi) implements the composition of two convo-
lutions parameterized by weights θi and the application of
the ReLU functions. In this work we consider networks
dimensions of the preceding tensor zi−1. Each filter yields
a single-channel output. Thus, the very first convolutional
layer in R2D collapses the entire temporal information of
the video in single-channel feature maps, which prevent any
temporal reasoning to happen in subsequent layers. This
type of CNN architecture is illustrated in Figure 1(a). Note
that since the feature maps have no temporal meaning, we
do not perform temporal striding for this network.
3.2. f-R2D: 2D convolutions over frames
Another 2D CNN approach involves processing indepen-
dently the L frames via a series of 2D convolutional resid-
ual block. The same filters are applied to all L frames. In
this case, no temporal modeling is performed in the convo-
lutional layers and the global spatiotemporal pooling layer
at the top simply fuses the information extracted indepen-
dently from the L frames. We refer to this architecture vari-
ant as f-R2D (frame-based R2D).
P3D [Qiu+, ICCV2017]
(a) Residual Unit [7]
+
1x1x1 conv
1x1x1 conv
1x3x3 conv
ReLU
ReLU
3x1x1 conv
ReLU
ReLU
(b) P3D-A
+
1x1x1 conv
1x1x1 conv
ReLU
ReLU
1x3x3 conv 3x1x1 conv
+
ReLU
ReLU
(c) P3D-B
+
1x1x1 conv
1x1x1 conv
1x3x3 conv
ReLU
ReLU
3x1x1 conv
ReLU
+
ReLU
(d) P3D-C
Figure 3. Bottleneck building blocks of Residual Unit and our Pseudo-3D.
shortcut connection from S to the final output, making the
output x as
Table 1. Comparisons of ResNet-50 and different Pseudo-3D
ResNet variants in terms of model size, speed, and accuracy on
Table 2. Comparisons in terms of pre-train data, clip length, Top-1 clip
Method Pre-train Data Clip L
Deep Video (Single Frame) [10] ImageNet1K
Deep Video (Slow Fusion) [10] ImageNet1K 1
Convolutional Pooling [37] ImageNet1K 1
C3D [31] – 1
C3D [31] I380K 1
ResNet-152 [7] ImageNet1K
P3D ResNet (ours) ImageNet1K 1
P3D-A P3D-B P3D-C P3D-A P3D-B P3D-C
...
17. X3D
n最適なアーキテクチャを探す
• NAS (Network Architecture Search)によ
る探索
• SlowFastのFast(ResNet)をベース
• 複数のパラメータを変更
• 空間・時間解像度,チャネル数など
• greedyに探索
• 規模の異なる複数のX3Dを提案
• かなり軽量
[Feichtenhofer, CVPR2020]
τ
s
d
b
t
w
X3D
d
d
t
t
b
τ
τ
X3D-L
X3D-M
X3D-S
X3D-XS
s
s
s
X2D
w
X3D-XL
Model capacity in GFLOPs (# of multiply-adds x 109
)
0 5 15 25 35
10 20 30
80
75
70
65
60
55
50
Kinetics
top-1
accuracy
(%)
Figure 2. Progressive network expansion of X3D. The X2D base
st nd
model to
X3D-XS 6
X3D-S 7
X3D-M 7
X3D-L 7
X3D-XL 7
Table 2. Expand
used. We show t
as computationa
operations, in #
Inference-time c
as a fixed numb
4.1. Expande
The accura
sion process o
from X2D tha
axis) with 1.63
zontal axis), w
step. We use 1
model top-1 top-5
regime FLOPs Params
FLOPs (G) (G) (M)
X3D-XS 68.6 87.9 X-Small ≤ 0.6 0.60 3.76
X3D-S 72.9 90.5 Small ≤ 2 1.96 3.76
X3D-M 74.6 91.7 Medium ≤ 5 4.73 3.76
X3D-L 76.8 92.5 Large ≤ 20 18.37 6.08
X3D-XL 78.4 93.6 X-Large ≤ 40 35.84 11.0
Table 2. Expanded instances on K400-val. 10-Center clip testing is
18. フレーム毎にVision Transformer+late fusion
n(2+1)D Transformer
• STAM [Sharir+, arXiv2021]
• Video Transformer Network
[Neimark+, ICCVW2021]
nVidTr [Zhang+, ICCV2021]
• Transformerブロック内で(2+1)Dアテン
ション
nX-ViT [Bulat+, NeurIPS2021]
• アテンションを前後フレームに制限
(a) Full space-time atten-
tion: O(T2
S2
)
(b) Spatial-only attention:
O(TS2
)
(c) TimeSformer [3] and
ViViT (Model 3) [1]:
O(T2
S + TS2
)
(d) Ours: O(TS2
)
X-ViT [Bulat+, NeurIPS2021]
Video Transformer Network
Daniel Neimark Omri Bar Maya Zohar Dotan Asselmann
Theator
{danieln, omri, maya, dotan}@theator.io
Abstract
This paper presents VTN, a transformer-based frame-
work for video recognition. Inspired by recent developments
in vision transformers, we ditch the standard approach in
video action recognition that relies on 3D ConvNets and
introduce a method that classifies actions by attending
to the entire video sequence information. Our approach
is generic and builds on top of any given 2D spatial
network. In terms of wall runtime, it trains 16.1⇥ faster
and runs 5.1⇥ faster during inference while maintaining
competitive accuracy compared to other state-of-the-art
methods. It enables whole video analysis, via a single
end-to-end pass, while requiring 1.5⇥ fewer GFLOPs. We
report competitive results on Kinetics-400 and Moments
in Time benchmarks and present an ablation study of
VTN properties and the trade-off between accuracy and
inference speed. We hope our approach will serve as a
new baseline and start a fresh line of research in the video
recognition domain. Code and models are available at:
https://github.com/bomri/SlowFast/blob/
master/projects/vtn/README.md.
1. Introduction
Attention matters. For almost a decade, ConvNets have
ruled the computer vision field [22, 7]. Applying deep
ConvNets produced state-of-the-art results in many visual
recognition tasks, i.e., image classification [32, 19, 34], ob-
ject detection [17, 16, 28], semantic segmentation [25], ob-
ject instance segmentation [18], face recognition [33, 30]
and video action recognition [9, 38, 3, 39, 23, 14, 13,
12]. But, recently this domination is starting to crack as
transformer-based models are showing promising results in
many of these tasks [10, 2, 35, 40, 42, 15].
Video recognition tasks also rely heavily on ConvNets.
In order to handle the temporal dimension, the fundamen-
tal approach is to use 3D ConvNets [5, 3, 4]. In contrast to
other studies that add the temporal dimension straight from
the input clip level, we aim to move apart from 3D net-
works. We use state-of-the-art 2D architectures to learn the
spatial feature representations and add the temporal infor-
Figure 1. Video Transformer Network architecture. Connecting
three modules: A 2D spatial backbone (f(x)), used for feature ex-
traction. Followed by a temporal attention-based encoder (Long-
former in this work), that uses the feature vectors ( i) combined
with a position encoding. The [CLS] token is processed by a clas-
sification MLP head to get the final class prediction.
mation later in the data flow by using attention mechanisms
on top of the resulting features. Our approach input only
RGB video frames and without any bells and whistles (e.g.,
optical flow, streams lateral connections, multi-scale infer-
ence, multi-view inference, longer clips fine-tuning, etc.)
achieves comparable results to other state-of-the-art mod-
els.
Video recognition is a perfect candidate for Transform-
ers. Similar to language modeling, in which the input words
or characters are represented as a sequence of tokens [37],
videos are represented as a sequence of images (frames).
However, this similarity is also a limitation when it comes
to processing long sequences. Like long documents, long
videos are hard to process. Even a 10 seconds video, such
as those in the Kinetics-400 benchmark [21], are processed
in recent studies as short, ˜2 seconds, clips.
But how does this clip-based inference would work on
much longer videos (i.e., movie films, sports events, or sur-
gical procedures)? It seems counterintuitive that the infor-
3163
Video Transformer
Network [Neimark+,
CVPRW2021]
STAM [Sharir+, arXiv2021]
Figure 1: Spatio-temporal separable-attention video trans-
former (VidTr). The model takes pixels patches as input and
3.2. VidTr
In Table 2
pable of learn
cal patches. H
tention matrix
stored in mem
ory consumpt
length. We ca
creases memo
to O(T2
W2
H
ing, which ma
vices. We no
attention archi
3.2.1 Separa
To address the
VidTr [Zhang+, ICCV2021]
Figure 3. Our proposed Transformer Network for video
Our goal is to provide a model that can utilize sparsely
subsampled temporal data for accurate predictions. Such
model need to be able to capture long-term dependencies
as well. While 2D convolutions filters are tailor-made for
the structure of images, utilizing local connections and pro-
viding desired properties for object recognition and detec-
tion [16], the same properties might negatively affect the
processing of subsampled temporal data. While a series of
3D-convolutions can learn long-term interactions due to in-
creased receptive field, they are biased towards local ones.
In order to verify this, we conducted an experiment: we fed
leading methods that are based on 3D convolutions with the
same subsampled data as in our method. The results are
presented in Table 5. The performance of both methods de-
graded significantly, the error of X3D increased by 23% and
SlowFast error by 50%.
Transformers offer advantages over their convolutional
counterparts regarding modeling long-term dependencies.
While a multi-head self-attention layer with sufficient num-
attention applied to the sequence of fr
tors. This separation between the spatia
tion components has several advantage
the computation by breaking down the
two shorter sequences. In the first stag
pared to N other patches within a fram
compares each frame embedding vecto
resulting in less overall computation t
patch to NF other patches.
The second advantage stems from th
temporal information is better exploite
abstract) level of the network. In ma
2D and 3D convolutions are used in th
3D components are only used on the t
same reasoning, we apply the tempora
embeddings rather than on individual
level representations provide more sens
in a video compared to individual patc
Input embeddings. The input to
transformer is X 2 RH⇥W ⇥3⇥F
consi
of size H ⇥W sampled from the origin
in this input block is first divided i
patches. For a frame of size H⇥W, we
patches of size P ⇥ P.
These patches are flattened into vec
jected into an embedding vector:
z
(0)
(p,t) = Ex(p,t) + ep
(
where input vector x(p,t) 2 R3P 2
, an
z(p,t) 2 RD
are related by a learnable p
vector epos
(p,t), and matrix E. The ind
patch and frame index, respecitvely wi
t = 1, . . . , F. In order to use the Tr
classification, a learnable classificatio
the first position in the embedding se
As will be shown, this classification t
encode the information from each fra
temporally across the sequence of fram
we include a separate classification tok
the sequence z
(0)
(0,t).
19. ViViT
nModel 1:3D ViT
• ViTへのパッチを3Dにする
nModel 2:late fusion,(2+1)D
• フレーム毎に2D Transformer
• 時間方向に1D Transformer
!
"
#
…
1
C
L
S
N
Positional
+
Token
Embedding
Temporal
+
Token
Embedding
Embed to tokens
…
1 N
2
…
1 N
…
T
Temporal Transformer Encoder
MLP
Head Class
…
C
L
S
1
0
0
C
L
S
0
C
L
S
0
Spatial Transformer
Encoder
Spatial Transformer
Encoder
Spatial Transformer
Encoder
[Arnab+, ICCV2021]
20. TimeSformer
nアテンションを制限
• Divided attention
n(2+1)D
• 空間:同じフレーム内の
パッチ
• 時間:異なるフレーム同
じ位置のパッチ
[Bertasius+, ICML2021]
Is Space-Time Attention All You Need for Video Understanding?
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
Time Att.
Space Att.
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
MLP
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
Space Att.
MLP
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
Joint Space-Time Att.
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
Space Attention (S)
Joint Space-Time
Attention (ST)
Divided Space-Time
Attention (T+S)
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
Time Att.
Width Att.
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
MLP
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
Height Att.
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>
Axial Attention
(T+W+H)
Sparse Local Global
Attention (L+G)
MLP
Local Att.
Global Att.
MLP
z(`)
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
z(` 1)
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
z(`)
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
z(`)
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
z(`)
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
z(`)
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
<latexit sha1_base64="vr9M27hwkmJn9HyewcyBFvyrmCg=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB</latexit>
z(` 1)
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
z(` 1)
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
z(` 1)
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
z(` 1)
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
<latexit sha1_base64="PqFOkq34IvzyoSmk0/rjIYa/Lb0=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz</latexit>
Figure 1. The video self-attention blocks that we investigate in this work. Each attention layer implements self-attention (Vaswani et al.,
2017b) on a specified spatiotemporal neighborhood of frame-level patches (see Figure 2 for a visualization of the neighborhoods). We use
residual connections to aggregate information from different attention layers within each block. A 1-hidden-layer MLP is applied at the
end of each block. The final model is constructed by repeatedly stacking these blocks on top of each other.
the N patches span the entire frame, i.e., N = HW/P2
.
29. Feature shift for CNN and ViT
nTemporal Shift Module (TSM) [Lin+,
ICCV2019]
• 2D CNN layer features are temporally
shifted
nToken Shift Transformer (TokenShift)
[Zhang+, ACMMM2021]
• Shifting class tokens only
• Not fully exploiting spatio-temporal features
nProposed method: MSCA
• A new shift method based on the ViT
structure
• Multi-head Self+Cross Attention (MSCA)
[Lin+, ICCV2019]
[Zhang+, ACMMM2021]
30. ViT, TokenShift, and MSCA
ViT TokenShift Ours
Embedded
shift
Norm
MSA
Norm
MLP
shift
+
+
L x
Embedded
Norm
MSCA
Norm
MLP
+
+
L x
Embedded
Norm
MSA
Norm
MLP
+
+
L x
31. Multi-head Self+Cross Attention (MSCA)
nPartially insert Cross Attention with the time t-1 and t+1
V V
V Q K
Q K
Q K
<latexit sha1_base64="stt9By8uBe0uqJWSZ83PJjuGj1Q=">AAACZXichVHLSsNAFD2Nr1pf9YEILhSL4qrciKi4KrpxWatVQUWSONVgmoRkWqjFHxC36sKVgoj4GW78ARf9AhGXFdy48CYNiIp6h5k5c+aeO2dmdNcyfUlUiylNzS2tbfH2REdnV3dPsrdvzXdKniHyhmM53oau+cIybZGXprTEhusJrahbYl0/WAz218vC803HXpUVV2wXtT3bLJiGJpnKycROMkVpCmP0J1AjkEIUWSd5gy3swoGBEooQsCEZW9Dgc9uECoLL3DaqzHmMzHBf4AgJ1pY4S3CGxuwBj3u82oxYm9dBTT9UG3yKxd1j5SjG6ZFuqU4PdEfP9P5rrWpYI/BS4VlvaIW703M8tPL2r6rIs8T+p+pPzxIFzIVeTfbuhkxwC6OhLx+e11fmc+PVCbqiF/Z/STW65xvY5VfjelnkLhB8gPr9uX+Ctam0OpOeXp5OZRair4hjGGOY5PeeRQZLyCLP5xZwglOcxZ6ULmVAGWykKrFI048voYx8ACsdihE=</latexit>
t
<latexit sha1_base64="fnOYX2ucAqbDkgRk1KZKc6TFWVw=">AAACZ3ichVHLSsNAFD2Nr1ofrQoiuKmWiiCUGxEVV0U3Ln1VC1VKEscaTJOQTAta/AEXbhVcKYiIn+HGH3DhJ4jLCm5ceJMGREW9w8ycOXPPnTMzumuZviR6iilt7R2dXfHuRE9vX38yNTC46Ts1zxAFw7Ecr6hrvrBMWxSkKS1RdD2hVXVLbOkHS8H+Vl14vunYG/LQFTtVrWKbe6ahyZCaUhPlVIZyFEb6J1AjkEEUK07qBtvYhQMDNVQhYEMytqDB51aCCoLL3A4azHmMzHBf4BgJ1tY4S3CGxuwBjxVelSLW5nVQ0w/VBp9icfdYmUaWHumWmvRAd/RM77/WaoQ1Ai+HPOstrXDLyZOR9bd/VVWeJfY/VX96ltjDfOjVZO9uyAS3MFr6+tFZc31hLduYoCt6Yf+X9ET3fAO7/mpcr4q1CwQfoH5/7p9gczqnzuZmVmcy+cXoK+IYxTgm+b3nkMcyVlDgc/dxijOcx56VpDKsjLRSlVikGcKXUMY+AC9hioE=</latexit>
t + 1
<latexit sha1_base64="wrGjWsuTUr2IwgiStFcGL0bqh08=">AAACZ3ichVHLSsNAFD2Nr1ofrQoiuKmWihvLjYiKq6Ibl76qhSoliWMNpklIpgUt/oALtwquFETEz3DjD7jwE8RlBTcuvEkDoqLeYWbOnLnnzpkZ3bVMXxI9xZS29o7Ornh3oqe3rz+ZGhjc9J2aZ4iC4ViOV9Q1X1imLQrSlJYoup7QqroltvSDpWB/qy4833TsDXnoip2qVrHNPdPQZEhNqYlyKkM5CiP9E6gRyCCKFSd1g23swoGBGqoQsCEZW9DgcytBBcFlbgcN5jxGZrgvcIwEa2ucJThDY/aAxwqvShFr8zqo6Ydqg0+xuHusTCNLj3RLTXqgO3qm919rNcIagZdDnvWWVrjl5MnI+tu/qirPEvufqj89S+xhPvRqsnc3ZIJbGC19/eisub6wlm1M0BW9sP9LeqJ7voFdfzWuV8XaBYIPUL8/90+wOZ1TZ3MzqzOZ/GL0FXGMYhyT/N5zyGMZKyjwufs4xRnOY89KUhlWRlqpSizSDOFLKGMfM2WKgw==</latexit>
t 1
Embedde
d
Norm
MHA
Norm
MLP
+
+
L x
32. MSCA-KV
V
K
Q
shift
nshift K, V in head direction
nHeads=4
X
X
attention
<latexit sha1_base64="stt9By8uBe0uqJWSZ83PJjuGj1Q=">AAACZXichVHLSsNAFD2Nr1pf9YEILhSL4qrciKi4KrpxWatVQUWSONVgmoRkWqjFHxC36sKVgoj4GW78ARf9AhGXFdy48CYNiIp6h5k5c+aeO2dmdNcyfUlUiylNzS2tbfH2REdnV3dPsrdvzXdKniHyhmM53oau+cIybZGXprTEhusJrahbYl0/WAz218vC803HXpUVV2wXtT3bLJiGJpnKycROMkVpCmP0J1AjkEIUWSd5gy3swoGBEooQsCEZW9Dgc9uECoLL3DaqzHmMzHBf4AgJ1pY4S3CGxuwBj3u82oxYm9dBTT9UG3yKxd1j5SjG6ZFuqU4PdEfP9P5rrWpYI/BS4VlvaIW703M8tPL2r6rIs8T+p+pPzxIFzIVeTfbuhkxwC6OhLx+e11fmc+PVCbqiF/Z/STW65xvY5VfjelnkLhB8gPr9uX+Ctam0OpOeXp5OZRair4hjGGOY5PeeRQZLyCLP5xZwglOcxZ6ULmVAGWykKrFI048voYx8ACsdihE=</latexit>
t
<latexit sha1_base64="fnOYX2ucAqbDkgRk1KZKc6TFWVw=">AAACZ3ichVHLSsNAFD2Nr1ofrQoiuKmWiiCUGxEVV0U3Ln1VC1VKEscaTJOQTAta/AEXbhVcKYiIn+HGH3DhJ4jLCm5ceJMGREW9w8ycOXPPnTMzumuZviR6iilt7R2dXfHuRE9vX38yNTC46Ts1zxAFw7Ecr6hrvrBMWxSkKS1RdD2hVXVLbOkHS8H+Vl14vunYG/LQFTtVrWKbe6ahyZCaUhPlVIZyFEb6J1AjkEEUK07qBtvYhQMDNVQhYEMytqDB51aCCoLL3A4azHmMzHBf4BgJ1tY4S3CGxuwBjxVelSLW5nVQ0w/VBp9icfdYmUaWHumWmvRAd/RM77/WaoQ1Ai+HPOstrXDLyZOR9bd/VVWeJfY/VX96ltjDfOjVZO9uyAS3MFr6+tFZc31hLduYoCt6Yf+X9ET3fAO7/mpcr4q1CwQfoH5/7p9gczqnzuZmVmcy+cXoK+IYxTgm+b3nkMcyVlDgc/dxijOcx56VpDKsjLRSlVikGcKXUMY+AC9hioE=</latexit>
t + 1
<latexit sha1_base64="wrGjWsuTUr2IwgiStFcGL0bqh08=">AAACZ3ichVHLSsNAFD2Nr1ofrQoiuKmWihvLjYiKq6Ibl76qhSoliWMNpklIpgUt/oALtwquFETEz3DjD7jwE8RlBTcuvEkDoqLeYWbOnLnnzpkZ3bVMXxI9xZS29o7Ornh3oqe3rz+ZGhjc9J2aZ4iC4ViOV9Q1X1imLQrSlJYoup7QqroltvSDpWB/qy4833TsDXnoip2qVrHNPdPQZEhNqYlyKkM5CiP9E6gRyCCKFSd1g23swoGBGqoQsCEZW9DgcytBBcFlbgcN5jxGZrgvcIwEa2ucJThDY/aAxwqvShFr8zqo6Ydqg0+xuHusTCNLj3RLTXqgO3qm919rNcIagZdDnvWWVrjl5MnI+tu/qirPEvufqj89S+xhPvRqsnc3ZIJbGC19/eisub6wlm1M0BW9sP9LeqJ7voFdfzWuV8XaBYIPUL8/90+wOZ1TZ3MzqzOZ/GL0FXGMYhyT/N5zyGMZKyjwufs4xRnOY89KUhlWRlqpSizSDOFLKGMfM2WKgw==</latexit>
t 1
V
K
Q V
K
Q
V V
V Q K
Q K
Q K
33. Effect of the amount of head shifts
nMSCA-KV
nBest when only two heads shifted
nToo much shift hurts the performance
18 4. 実験
表 1: Kinetics400 の検証セットに対する MSCA-KV の性能への,ヘッド
ト量の影響.シフト量 0 は ViT に対応する.ヘッド数 h = 12 であるため
1/12 は 1 つのヘッドをシフトすることを意味し,1/6 はヘッド 2 つを意
shift heads top-1 top-5
0 (ViT) 0 75.65 92.19
1/12 1 76.47 92.88
1/6 2 76.07 92.61
1/4 3 75.66 92.30
1/3 4 74.72 91.91
素の部分をクロップする.さらに,それぞれ 10%の割合で,画像の明る
MSCA-KV
!
"
#
$%&'(
X
X
)((*+(&,+
<latexit sha1_base64="stt9By8uBe0uqJWSZ83PJjuGj1Q=">AAACZXichVHLSsNAFD2Nr1pf9YEILhSL4qrciKi4KrpxWatVQUWSONVgmoRkWqjFHxC36sKVgoj4GW78ARf9AhGXFdy48CYNiIp6h5k5c+aeO2dmdNcyfUlUiylNzS2tbfH2REdnV3dPsrdvzXdKniHyhmM53oau+cIybZGXprTEhusJrahbYl0/WAz218vC803HXpUVV2wXtT3bLJiGJpnKycROMkVpCmP0J1AjkEIUWSd5gy3swoGBEooQsCEZW9Dgc9uECoLL3DaqzHmMzHBf4AgJ1pY4S3CGxuwBj3u82oxYm9dBTT9UG3yKxd1j5SjG6ZFuqU4PdEfP9P5rrWpYI/BS4VlvaIW703M8tPL2r6rIs8T+p+pPzxIFzIVeTfbuhkxwC6OhLx+e11fmc+PVCbqiF/Z/STW65xvY5VfjelnkLhB8gPr9uX+Ctam0OpOeXp5OZRair4hjGGOY5PeeRQZLyCLP5xZwglOcxZ6ULmVAGWykKrFI048voYx8ACsdihE=</latexit>
t
<latexit sha1_base64="fnOYX2ucAqbDkgRk1KZKc6TFWVw=">AAACZ3ichVHLSsNAFD2Nr1ofrQoiuKmWiiCUGxEVV0U3Ln1VC1VKEscaTJOQTAta/AEXbhVcKYiIn+HGH3DhJ4jLCm5ceJMGREW9w8ycOXPPnTMzumuZviR6iilt7R2dXfHuRE9vX38yNTC46Ts1zxAFw7Ecr6hrvrBMWxSkKS1RdD2hVXVLbOkHS8H+Vl14vunYG/LQFTtVrWKbe6ahyZCaUhPlVIZyFEb6J1AjkEEUK07qBtvYhQMDNVQhYEMytqDB51aCCoLL3A4azHmMzHBf4BgJ1tY4S3CGxuwBjxVelSLW5nVQ0w/VBp9icfdYmUaWHumWmvRAd/RM77/WaoQ1Ai+HPOstrXDLyZOR9bd/VVWeJfY/VX96ltjDfOjVZO9uyAS3MFr6+tFZc31hLduYoCt6Yf+X9ET3fAO7/mpcr4q1CwQfoH5/7p9gczqnzuZmVmcy+cXoK+IYxTgm+b3nkMcyVlDgc/dxijOcx56VpDKsjLRSlVikGcKXUMY+AC9hioE=</latexit>
t + 1
<latexit sha1_base64="wrGjWsuTUr2IwgiStFcGL0bqh08=">AAACZ3ichVHLSsNAFD2Nr1ofrQoiuKmWihvLjYiKq6Ibl76qhSoliWMNpklIpgUt/oALtwquFETEz3DjD7jwE8RlBTcuvEkDoqLeYWbOnLnnzpkZ3bVMXxI9xZS29o7Ornh3oqe3rz+ZGhjc9J2aZ4iC4ViOV9Q1X1imLQrSlJYoup7QqroltvSDpWB/qy4833TsDXnoip2qVrHNPdPQZEhNqYlyKkM5CiP9E6gRyCCKFSd1g23swoGBGqoQsCEZW9DgcytBBcFlbgcN5jxGZrgvcIwEa2ucJThDY/aAxwqvShFr8zqo6Ydqg0+xuHusTCNLj3RLTXqgO3qm919rNcIagZdDnvWWVrjl5MnI+tu/qirPEvufqj89S+xhPvRqsnc3ZIJbGC19/eisub6wlm1M0BW9sP9LeqJ7voFdfzWuV8XaBYIPUL8/90+wOZ1TZ3MzqzOZ/GL0FXGMYhyT/N5zyGMZKyjwufs4xRnOY89KUhlWRlqpSizSDOFLKGMfM2WKgw==</latexit>
t 1
!
"
# !
"
#
34. Number of encoder blocks with MSCA
nReplaced MSA modules with MSCA
• 4, 8, 12 modules near the top of the network
• Replacing 12 modules is MSCA-KV
nUsing 8 or more MSCA modules
• Input video clip consists of 8 frame
• Shifting more than 8 times covers the
temporal information from all 8 frames of the
input video clip
表 3: MSA を MCSA に置き換える個数が性能に与える影響
12 は MSCA-KV,0 は ViT に相当する.
# MSCA top-1 top-5
0 (ViT) 75.65 92.19
4 75.67 92.19
8 76.40 92.77
12 76.47 92.88
36. データ拡張
n画像認識用
• 回転,水平垂直反転,ノイズ
• Mix系の手法
• CutOut [DeVries&Taylor, arXiv2017]
• Mixup [Zhang+, ICLR2018]
• CutMix [Yun+, ICCV2019]
• CG合成
• FlyingThings3D [Mayer+, CVPR2016]
• オプティカルフロー推定用
• SURREAL [Varol+, CVPR2017]
• 人物姿勢推定用
n動画像
• 水平反転はしない場合がある
Cutout Mixup CutMix
we rendered all non-RGB data without antialiasing.
Given the intrinsic camera parameters (focal length,
principal point) and the render settings (image size, virtual
sensor size and format), we project the 3D motion vector
of each pixel into a 2D pixel motion vector coplanar to the
imaging plane: the optical flow. Depth is directly retrieved
from a pixel’s 3D position and converted to disparity using
the known configuration of the virtual stereo rig. We com-
FlyingThings3D
albumentations [Buslaev+,
Information, 2020]
imgaug
[Jung+, 2020]
SURREAL
37. VideoMix
nCutMix [Yun+, ICCV2019]の応用
• 3種類を提案
• S-VideoMix
• すべての時刻に同じ矩形部分を貼り付け
• T-VideoMix
• ある時刻区間に全画面を貼り付け
• ST-VideoMix
• 時空間ボリュームを貼り付け
[Yun+, arXiv2020]
top1 top5
75.2 91.7
77.0 93.1
5] 75.6 92.2
S-VideoMix T-VideoMix ST-VideoMix
Video B
Video A
time
height
39. Data augmentation for action recognition
nInspired by: Copy-Paste [Ghiasi+, CVPR2021]
• Method for segmentation
• Cut out only instance and pastes it onto the other
nProposed method: ObjectMix
• Cut the object and pastes it onto another video
• Extract object from each frames
• Mix videos while considering object
Figure 2. We use a simple copy and paste method to create new images for training instance se
jittering on two random training images and then randomly select a subset of instances from o
The key idea behind the Copy-Paste augmentation is to
paste objects from one image to another image. This can
lead to a combinatorial number of new training data, with
multiple possibilities for: (1) choices of the pair of source
image from which instances are copied, and the target im-
age on which they are pasted; (2) choices of object instances
to copy from the source image; (3) choices of where to paste
the copied instances on the target image. The large variety
of options when utilizing this data augmentation method al-
lows for lots of exploration on how to use the technique
most effectively. Prior work [12, 15] adopts methods for de-
ciding where to paste the additional objects by modeling the
surrounding visual context. In contrast, we find that a sim-
ple strategy of randomly picking objects and pasting them at
random locations on the target image provides a significant
mentation [43] (48.
passes state-of-the-
EfficientDet-D7x-1
P7-1536 [61] (55.8
size of 1280 instead
Finally, we show
sults in better featu
typically used in th
Paste we get improv
rare and common ca
The Copy-Paste
into any instance s
labeled images effe
inference overhead
Mask-RCNN show
Source images
Figure 2. We use a simple copy and paste method to create new images for train
jittering on two random training images and then randomly select a subset of ins
The key idea behind the Copy-Paste augmentation is to
paste objects from one image to another image. This can
mentat
passes
Figure 2. We use a simple copy and paste method to create new images for training instance segm
jittering on two random training images and then randomly select a subset of instances from one
The key idea behind the Copy-Paste augmentation is to
paste objects from one image to another image. This can
lead to a combinatorial number of new training data, with
multiple possibilities for: (1) choices of the pair of source
image from which instances are copied, and the target im-
age on which they are pasted; (2) choices of object instances
to copy from the source image; (3) choices of where to paste
the copied instances on the target image. The large variety
of options when utilizing this data augmentation method al-
lows for lots of exploration on how to use the technique
most effectively. Prior work [12, 15] adopts methods for de-
ciding where to paste the additional objects by modeling the
surrounding visual context. In contrast, we find that a sim-
ple strategy of randomly picking objects and pasting them at
random locations on the target image provides a significant
boost on top of baselines across multiple settings. Specif-
ically, it gives solid improvements across a wide range of
mentation [43] (48.5 m
passes state-of-the-art
EfficientDet-D7x-1536
P7-1536 [61] (55.8 bo
size of 1280 instead of
Finally, we show th
sults in better features
typically used in the L
Paste we get improvem
rare and common cate
The Copy-Paste au
into any instance seg
labeled images effecti
inference overheads.
Mask-RCNN show th
training, and without
ily improved, e.g., by +
Generated
images
Source videos Generated videos
40. Proposed method: ObjectMix
nAlgorithm
1. Preparing two source videos
2. Extracting object regions from each video (Detectron2 [Wu+, 2019])
3. Cut out the object using the extracted masks
4. The cut object is pasted to the other side
Videos Masks Objects Generated
41. Experiment 1: ObjectMix
nHighest performance at about p=0.6
nToo large p reduces performance.
np=0 is higher in the early stages,
but becomes equal or higher