SlideShare a Scribd company logo
1 of 17
Download to read offline
!"#$%#&$'#!"(#)*+,'-./#0"#
1$&.$23'-$&#4/2/5'-$&#6$%#
7685-/&'#9-*/$#:/5$;&-'-$&
!"#$%&'()*+(,&-&.#(/&+(012*#.3(425*3.67.3.+().558(49(:.3*6
;<=>?@?A
仁田智也(名工大玉木研)
導入
n動画認識
• 認識に必要なフレームや計算コストは動画による
• 物体を映した動画
• ダンスなど激しい動きの動画
n0%.B:の提案
• 入力動画によって動的にモデルを決定
• 入力フレーム数
• 畳み込み処理
• ?@CからD@Cの計算量の削減
• 様々なB:モデルに適応
2D or not 2D? Adaptive 3D Convolution Selection for
Efficient Video Recognition
Hengduo Li1
Zuxuan Wu2* Abhinav Shrivastava1
Larry S. Davis1
1
University of Maryland 2
Fudan University
{hdli,abhinav,lsd}@cs.umd.edu zxwu@fudan.edu.cn
Abstract
3D convolutional networks are prevalent for video
recognition. While achieving excellent recognition perfor-
mance on standard benchmarks, they operate on a sequence
of frames with 3D convolutions and thus are computation-
ally demanding. Exploiting large variations among differ-
ent videos, we introduce Ada3D, a conditional computa-
tion framework that learns instance-specific 3D usage poli-
cies to determine frames and convolution layers to be used
in a 3D network. These policies are derived with a two-
head lightweight selection network conditioned on each in-
put video clip. Then, only frames and convolutions that
are selected by the selection network are used in the 3D
Barbequing
Breakdancing
Golf driving
Figure 1: A conceptual overview of our approach.
Ada3D learns to adaptively keep/discard 3D convolutional
関連研究
n効率的な動画認識モデル
• E>F(G,2'&H+(I;;<?@AJK
• フレーム間の関係から推測
• E4L(G)*#H+(M;;<?@ANK
• 時間方向にチャネルをシフト
• OB:(GP"*Q27"#2'R"5+(;<=>?@?@K
• F04による探索
n動的な推論方法
• L0>)(G/&H+(M;;<?@ANK
• 0%.R5.S"(G/&H+(;<=>?@ANK
• 推論時に必要なフレームを動的に
決定しモデルに入力
G/&H+(;<=>?@ANK
手法
n選択ネットワーク
• 入力:𝒱 ∈ ℝ!×#×$×%
• 出力:m ∈ ℝ!, n ∈ ℝ&
• 𝐾:クラス識別器の層数
nクラス識別器 (B:モデル)
• 入力: 𝒱', v ∈ {0, 1}&
• 𝒱':𝒱からサンプリング
• u ∈ {0, 1}!でサンプリング
• mから得たベクトル
• v:?:畳み込みに変更する層
• nから得たベクトル
• 出力:クラス識別
......
......
Frame Policy
Spatial Downsampling
3D Convolution Policy
3D Video Model
Selection Network
3D
2D 2D
Figure 2: An overview of our approach. Given an input clip, the selection network produces features for each frame in th
clip, which are further aggregated uniformly to derive a frame usage policy and a convolution usage policy simultaneousl
These policies activate a subset of frames and 3D convolutions in the 3D network for inference. Then, conditioned on th
prediction, two rewards are computed to evaluate the frame and convolution policy, respectively. See texts for more details
(a.k.a, conditional computation) methods have been devel-
oped in the image domain, achieving reduced computation
by dynamically selecting channels [2, 25], skipping lay-
ers [42, 11, 40, 37], performing early exiting with auxil-
iary structures [22, 18, 3, 46], adaptively switching input
resolutions [27, 36, 46], etc. There are also a few recent
studies exploring adaptive computation for videos. These
approaches adaptively select salient clips for faster infer-
dictions. To this end, we first revisit popular 3D networ
used for temporal modeling in Sec. 3.1, and then elabora
different components of Ada3D in Sec. 3.2
3.1. 3D Networks for Video Recognition
Operating on stacked RGB frames, 3D video models ty
ically extend state-of-the-art 2D networks by replacing
number of 2D convolutions with 3D convolutions for tem
識別機の畳み込み層は
?:とB:の選択可能
!"フィルター #"フィルター
選択ネットワーク
nm, n = sigmoid(𝑓( 𝒱; w )
• 𝑓(:選択ネットワーク
• w:パラメータ
n方策
• 動画フレーム方策
• 𝜋) u 𝒱 = ∏*+,
!
m*
-!
(1 − m*),.-!
• 畳み込み方策
• 𝜋/ v 𝒱 = ∏0+,
&
n0
1"
(1 − n0),.1"
n報酬関数
• 𝑅 x = >
1 − 𝒪 x 正しい識別
−𝛾 誤った識別
• x ∈ {u, v}
• 𝒪 x :クラス識別器の計算コスト
......
......
Frame
Spatial Downsampling
3D Con
Selection Network
Figure 2: An overview of our approach. Given an
学習
n損失関数
• 選択ネットワーク:max
2
ℒ = 𝔼-~4#,1~4$
[𝑅 u + 𝑅 v ]
• クラス識別器:min
6
− ∑7+, 𝑦7 log(𝐅(𝒱; 𝜃)7)
• 𝐅:クラス識別器
• 𝜃:パラメータ
n学習方法
A9 識別器のみで学習 (全ての層でB:畳み込み)
?9 識別器のパラメータを固定して選択ネットワークの学習
B9 選択ネットワークと識別機を同時に学習
実験設定
nモデル
• クラス識別器
• >G?HAKT:(GE5.#H+(;<=>?@AJKベース
• 選択ネットワーク
• L'1*U"F"7<?(G4.#%U"5H+(;<=>?@AJK
nデータセット
• 0Q7*3*78F"7(G!"*U1'5#H+(;<=>?@ADK
• P&%.#T;'U&S1*.(<*%"'(:.7.4"76(GP;<M:K(GV*.#$H+(E=0LM?@AJK
• L*#*TW*#"7*Q6T?@@(GO*"H+(I;;<?@AJK
n学習
• 入力サイズ
• フレーム:元動画からJフレーム毎にJ+AXフレームサンプリング
• 空間領域:?DX×?DXから??Y×??Yにランダムクロップ
実験結果
nZ[["5:選択ネットワークなし
n>.#%'S:選択ネットワークの出力がランダム
n>.#%'S(PE:選択ネットワークの出力がランダムの状態で学習
n&56:提案手法 FCVID ACTIVITYNET MINI-KINETICS
mAP GFLOPs # 3D # Frame mAP GFLOPs # 3D # Frame Acc GFLOPs # 3D # Frame
8-frame per clip
Upper 82.1 58.6 5.0 8.0 82.6 58.6 5.0 8.0 79.0 58.6 5.0 8.0
Random 78.1 36.1 2.2 5.8 79.2 42.9 3.0 6.6 74.0 42.2 2.0 6.9
Random FT 80.7 36.1 2.2 5.8 81.1 42.9 3.0 6.6 77.4 41.5 1.9 6.8
Ours 81.9 35.6 2.2 5.7 82.6 42.2 3.1 6.6 78.9 42.4 1.9 6.9
16-frame per clip
Upper 84.4 117.3 5.0 16.0 84.4 117.3 5.0 16.0 79.6 117.3 5.0 16.0
Random 79.2 63.2 2.1 10.3 80.4 73.3 3.0 11.2 75.2 75.8 2.9 11.8
Random FT 82.0 65.3 2.1 10.6 82.8 71.3 3.0 11.1 78.2 78.0 2.9 12.0
Ours 84.3 66.6 2.1 10.7 84.0 70.1 3.0 11.1 79.2 73.8 2.9 11.8
Table 1: Recognition performance and computational cost of our method vs. baselines. Two input settings are experi-
mented, i.e. 8-frame setting (Top) and 16-frame setting (Bottom). # 3D and # Frame denote the number of 3D convolutions
and frames usage per input clip respectively, averaged over the entire test set. See texts for more details.
• 認識率を大きく下げることなく計算量の削減
• $%&'()と比べても認識率が高い
• *+,-"では計算量大きく削減
アブレーションスタディ
n選択ネットワークと計算量
• 報酬関数のハイパーパラメータを調整
• 計算量と認識率のトレードオフを調整
• 選択ネットワークはランダムよりも有効
• 重要なフレームを選択している
nバックボーンの変更
• クラス識別器:4U']'#U8(GP"*Q27"#2'R"5H+(M;;<?@ANK
• データセット:P;<M:
• 効率よくフレームを選択している
the Random baseline by 3.5% to 5% mAP/accuracy on
three datasets. Ada3D also outperforms Random FT by
1% to 2.5%. These results verify that Ada3D produces
adaptive polices and allocates computational resources on
a per-input basis to maintain recognition performance. It
is worth noting that there are slightly differences in com-
putational savings on different datasets. This results from
the fact that video categories in these datasets are dif-
ferent. For example, FCVID contains some classes of
static objects and scenes like “bridge” and “temple”, and
thus we observe more computational savings than ACTIV-
ITYNET and MINI-KINETICS, which are more activity-
focused; on MINI-KINETICS, where categories are motion-
intensive, more computational resources are needed com-
pared to FCVID and ACTIVITYNET.
Recognition with varying computational budgets. As
discussed in Section 3.2, the choice of γ in Eqn. 4 adjusts
we use Ada3D as the backbone of AdaFrame to dynami-
cally allocate computational resources conditioned on each
input clip, as opposed to the original AdaFrame that uses the
Figure 3: Recognition performance under different compu-
tational budgets controlled by γ.
6160
Van → Ada Van → Ada Van → Ada
# Clip 3.0 → 2.9 5.0 → 4.2 10.0 → 7.4
mAP 77.8 → 78.1 80.3 → 80.5 81.9 → 82.0
Table 2: Extension to clip-level selection. Combining
Ada3D with AdaFrame [43] offers computational savings
for video-level aggregation. # Clip denotes number of clips
used per testing video; Van (Vanilla) and Ada denote our
method without and with AdaFrame, respectively.
same amount of computation with a fixed backbone for all
clips. Following [43], we train three variants of AdaFrame
which operates on 3, 5, and 10 clips for different compu-
tational budgets. As demonstrated in Table 2, extending
our approach with adaptive clip selection further decreases
the computational cost while producing comparable perfor-
mance with the Upper. For example, it reduces the number
of clips sampled from each testing video from 10 to 7.4 and
networks. We use a more efficient 3D network architectu
recently introduced in [9] termed as Slowonly and evalua
our approach. In particular, it only uses 3D convolutions
the 4-th and 5-th stage of a ResNet50, resulting in compe
itive recognition performance with less computational co
As shown in Table 4, our method still obtains 20% to 40
savings in GFLOPs with similar recognition performanc
indicating Ada3D is compatible with different 3D mode
Our method by design is model-agnostic, for which we b
lieve it could be complementary to recent work on designi
efficient 3D models such as X3D [8] as well.
Method mAP GFLOPs # 3D # Frame
Upper 82.6 54.5 2.0 8.0
Ours 82.4 42.1 1.3 6.6
Upper 83.5 109.1 2.0 16.0
Ours 83.4 61.8 1.4 9.5
Jフレーム
AXフレーム
アブレーションスタディ
n使用したフレーム
• フレーム間の動きが小さい動画は使用フレームが少ない
Hurling
Book Binding
Barbequing Dancing macarena
Capoeira
Somersaulting
Tai Chi
Playing Bass Guitar
3D usage: 2 Frame usage: 3
3D usage: 1 Frame usage: 4
3D usage: 2 Frame usage: 4
3D usage: 2 Frame usage: 3
3D usage: 4 Frame usage: 8
3D usage: 4 Frame usage: 7
3D usage: 5 Frame usage: 8
3D usage: 4 Frame usage: 8
Figure 4: Qualitative results. Black mask indicates the frame is discarded. Left: Fewer 3D convolutions and frames are used
まとめ
n0%.B:モデルの提案
• 入力動画によってモデルを動的に決定
• 使用フレーム
• 畳み込み処理
• 認識率を大きく下げることなく計算量の削減
• ?@CからD@Cの削減
• 様々なモデルに適用可能
• 選択ネットワークの有効性
• ランダムとの比較
• 定性的な評価
補足資料
クラス識別器
n入力
• 𝒱からu ∈ {0, 1}!を用いてサンプリング
n畳み込み層
• v ∈ {0, 1}&で?:、B:の決定
......
......
Frame Policy
Spatial Downsampling
3D Convolution Policy
3D Video Model
Selection Network
3D
2D 2D
Figure 2: An overview of our approach. Given an input clip, the selection network produces features for each frame in th
clip, which are further aggregated uniformly to derive a frame usage policy and a convolution usage policy simultaneously
!"フィルター #"フィルター
.層
サンプリングした
フレーム
選択ネットワーク 勾配計算
n方策勾配定理 n選択ネットワーク
• 目的関数:
max
2
ℒ = 𝔼-~4#,1~4$
[𝑅 u + 𝑅 v ]
• 勾配:∇8ℒ = 𝔼[𝑅 u ∇8 log 𝜋)(u|𝒱) +
𝑅 v ∇8 log 𝜋/(v|𝒱)]
• 𝜌(𝜋):方策𝜋での報酬総和
• 𝔼!~#!,%~#"
[𝑅 u + 𝑅 v ]
• 𝜃:パラメータ
• w
• 𝑠:状態
• 𝒱
• 𝑎:行動
• u, v
• 𝑑#(𝑠):エピソード中何回状態sを訪れたか
• エピソードがないので考えない
• 𝑄#(𝑠, 𝑎):行動価値関数
• 𝑅 u + 𝑅 v
実験結果 !"#$%&#'()**+
nモデル
• 選択ネットワーク
• L*#*TW*#"7*Q6で学習
• クラス識別機
• S*#*TW*#"7*Q6で学習後、W*#"7*Q6でファインチューニング
nデータセット
• W*#"7*Q6Y@@
our approach with adaptive clip selection further decre
the computational cost while producing comparable per
mance with the Upper. For example, it reduces the num
of clips sampled from each testing video from 10 to 7.4
obtains an mAP of 82.0% that is on par with Upper (82.1
Additionally, we believe our method is also complemen
to other clip selection methods leveraging multi-moda
puts such as audio [20, 12], as well as adaptive spatial r
lution modulating methods [26, 36, 46].
Method Acc GFLOPs # 3D # Frame
Upper 73.1 58.6 5.0 8.0
Ours 72.8 43.7 2.3 6.9
Table 3: Transferring learned policies. We fine-tun
学習方法
nE5:選択ネットワークのみ学習
nPE:選択ネットワークとクラス識別機の同時学習
n同時学習で識別機が選択ネットワークに適したモデルになる
nスクラッチから学習の際、目的関数を最初から複数にしない方が良い
Figure 4: Qualitative results. Black mask indicates the fra
for action classes and instances that are more “static”, i.e
Right: For motion-intensive instances, more computation
FCVID ACTIVITYNET
Tr FT mAP GFLOPs mAP GFLOPs
Upper 78.1 58.6 76.4 58.6
72.3 36.1 71.1 42.9
! 75.9 34.7 74.3 38.1
! 76.5 34.9 75.1 37.6
! ! 77.8 35.6 76.1 42.2
Table 5: Ablation on the effectiveness of two trainin
選択ネットワークの寄与
nB::畳み込み方策
nP5.S":動画フレーム方策
nいずれか一方で認識率の向上
n両方適用で認識率がさらに向上
stage (i.e., directly training the selection network with the
3D model jointly) leads to a lower recognition performance
(76.5 vs. 77.8). We posit the reason is that adding another
objective (the classification loss) while training the selec-
tion network from random initialization further increases
the instability of network learning under such a reinforce-
ment learning setting; and thus the selection network con-
verges to sub-optimal policies.
FCVID ACTIVITYNET
3D Frame mAP GFLOPs mAP GFLOPs
Upper 78.1 58.6 76.4 58.6
75.5 35.6 74.3 42.3
! 76.8 35.3 75.3 41.1
! 76.3 35.5 74.8 43.5
! ! 77.8 35.6 76.1 42.2
Table 6: Ablation on the usefulness of 3D convolution usage
and frame usage policies.

More Related Content

Similar to 文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition

文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...Toru Tamaki
 
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video ClassificationToru Tamaki
 
Hello, DirectCompute
Hello, DirectComputeHello, DirectCompute
Hello, DirectComputedasyprocta
 
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action RecognitionToru Tamaki
 
[DL Hacks]Self-Attention Generative Adversarial Networks
[DL Hacks]Self-Attention Generative Adversarial Networks[DL Hacks]Self-Attention Generative Adversarial Networks
[DL Hacks]Self-Attention Generative Adversarial NetworksDeep Learning JP
 
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video RecognitionToru Tamaki
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleToru Tamaki
 
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation MetricsToru Tamaki
 
ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)Hideki Okada
 
プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610HIDEOMI SUZUKI
 
SSII2014 チュートリアル資料
SSII2014 チュートリアル資料SSII2014 チュートリアル資料
SSII2014 チュートリアル資料Masayuki Tanaka
 
サイボウズ・ラボユース成果発表会資料
サイボウズ・ラボユース成果発表会資料サイボウズ・ラボユース成果発表会資料
サイボウズ・ラボユース成果発表会資料masahiro13
 
文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical FlowToru Tamaki
 
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
文献紹介:TSM: Temporal Shift Module for Efficient Video UnderstandingToru Tamaki
 
シェーダーしよっ☆ Let's play shaders!
シェーダーしよっ☆ Let's play shaders!シェーダーしよっ☆ Let's play shaders!
シェーダーしよっ☆ Let's play shaders!Yuichi Higuchi
 

Similar to 文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition (20)

文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
 
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
 
Hello, DirectCompute
Hello, DirectComputeHello, DirectCompute
Hello, DirectCompute
 
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
 
[DL Hacks]Self-Attention Generative Adversarial Networks
[DL Hacks]Self-Attention Generative Adversarial Networks[DL Hacks]Self-Attention Generative Adversarial Networks
[DL Hacks]Self-Attention Generative Adversarial Networks
 
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
文献紹介:X3D: Expanding Architectures for Efficient Video Recognition
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
 
TVM の紹介
TVM の紹介TVM の紹介
TVM の紹介
 
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
 
ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)
 
プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610プロファイラGuiを用いたコード分析 20160610
プロファイラGuiを用いたコード分析 20160610
 
SSII2014 チュートリアル資料
SSII2014 チュートリアル資料SSII2014 チュートリアル資料
SSII2014 チュートリアル資料
 
ADVENTURE_Solidの概要
ADVENTURE_Solidの概要ADVENTURE_Solidの概要
ADVENTURE_Solidの概要
 
MIRU2018 tutorial
MIRU2018 tutorialMIRU2018 tutorial
MIRU2018 tutorial
 
サイボウズ・ラボユース成果発表会資料
サイボウズ・ラボユース成果発表会資料サイボウズ・ラボユース成果発表会資料
サイボウズ・ラボユース成果発表会資料
 
文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow
 
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
 
JEITA4602B
JEITA4602BJEITA4602B
JEITA4602B
 
CMSI計算科学技術特論A (2015) 第9回
CMSI計算科学技術特論A (2015) 第9回CMSI計算科学技術特論A (2015) 第9回
CMSI計算科学技術特論A (2015) 第9回
 
シェーダーしよっ☆ Let's play shaders!
シェーダーしよっ☆ Let's play shaders!シェーダーしよっ☆ Let's play shaders!
シェーダーしよっ☆ Let's play shaders!
 

More from Toru Tamaki

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A surveyToru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt TuningToru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in MoviesToru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICAToru Tamaki
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context RefinementToru Tamaki
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...Toru Tamaki
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...Toru Tamaki
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusionToru Tamaki
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous DrivingToru Tamaki
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large MotionToru Tamaki
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense PredictionsToru Tamaki
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understandingToru Tamaki
 

More from Toru Tamaki (20)

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
 

文献紹介:2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition

  • 2. 導入 n動画認識 • 認識に必要なフレームや計算コストは動画による • 物体を映した動画 • ダンスなど激しい動きの動画 n0%.B:の提案 • 入力動画によって動的にモデルを決定 • 入力フレーム数 • 畳み込み処理 • ?@CからD@Cの計算量の削減 • 様々なB:モデルに適応 2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition Hengduo Li1 Zuxuan Wu2* Abhinav Shrivastava1 Larry S. Davis1 1 University of Maryland 2 Fudan University {hdli,abhinav,lsd}@cs.umd.edu zxwu@fudan.edu.cn Abstract 3D convolutional networks are prevalent for video recognition. While achieving excellent recognition perfor- mance on standard benchmarks, they operate on a sequence of frames with 3D convolutions and thus are computation- ally demanding. Exploiting large variations among differ- ent videos, we introduce Ada3D, a conditional computa- tion framework that learns instance-specific 3D usage poli- cies to determine frames and convolution layers to be used in a 3D network. These policies are derived with a two- head lightweight selection network conditioned on each in- put video clip. Then, only frames and convolutions that are selected by the selection network are used in the 3D Barbequing Breakdancing Golf driving Figure 1: A conceptual overview of our approach. Ada3D learns to adaptively keep/discard 3D convolutional
  • 3. 関連研究 n効率的な動画認識モデル • E>F(G,2'&H+(I;;<?@AJK • フレーム間の関係から推測 • E4L(G)*#H+(M;;<?@ANK • 時間方向にチャネルをシフト • OB:(GP"*Q27"#2'R"5+(;<=>?@?@K • F04による探索 n動的な推論方法 • L0>)(G/&H+(M;;<?@ANK • 0%.R5.S"(G/&H+(;<=>?@ANK • 推論時に必要なフレームを動的に 決定しモデルに入力 G/&H+(;<=>?@ANK
  • 4. 手法 n選択ネットワーク • 入力:𝒱 ∈ ℝ!×#×$×% • 出力:m ∈ ℝ!, n ∈ ℝ& • 𝐾:クラス識別器の層数 nクラス識別器 (B:モデル) • 入力: 𝒱', v ∈ {0, 1}& • 𝒱':𝒱からサンプリング • u ∈ {0, 1}!でサンプリング • mから得たベクトル • v:?:畳み込みに変更する層 • nから得たベクトル • 出力:クラス識別 ...... ...... Frame Policy Spatial Downsampling 3D Convolution Policy 3D Video Model Selection Network 3D 2D 2D Figure 2: An overview of our approach. Given an input clip, the selection network produces features for each frame in th clip, which are further aggregated uniformly to derive a frame usage policy and a convolution usage policy simultaneousl These policies activate a subset of frames and 3D convolutions in the 3D network for inference. Then, conditioned on th prediction, two rewards are computed to evaluate the frame and convolution policy, respectively. See texts for more details (a.k.a, conditional computation) methods have been devel- oped in the image domain, achieving reduced computation by dynamically selecting channels [2, 25], skipping lay- ers [42, 11, 40, 37], performing early exiting with auxil- iary structures [22, 18, 3, 46], adaptively switching input resolutions [27, 36, 46], etc. There are also a few recent studies exploring adaptive computation for videos. These approaches adaptively select salient clips for faster infer- dictions. To this end, we first revisit popular 3D networ used for temporal modeling in Sec. 3.1, and then elabora different components of Ada3D in Sec. 3.2 3.1. 3D Networks for Video Recognition Operating on stacked RGB frames, 3D video models ty ically extend state-of-the-art 2D networks by replacing number of 2D convolutions with 3D convolutions for tem 識別機の畳み込み層は ?:とB:の選択可能 !"フィルター #"フィルター
  • 5. 選択ネットワーク nm, n = sigmoid(𝑓( 𝒱; w ) • 𝑓(:選択ネットワーク • w:パラメータ n方策 • 動画フレーム方策 • 𝜋) u 𝒱 = ∏*+, ! m* -! (1 − m*),.-! • 畳み込み方策 • 𝜋/ v 𝒱 = ∏0+, & n0 1" (1 − n0),.1" n報酬関数 • 𝑅 x = > 1 − 𝒪 x 正しい識別 −𝛾 誤った識別 • x ∈ {u, v} • 𝒪 x :クラス識別器の計算コスト ...... ...... Frame Spatial Downsampling 3D Con Selection Network Figure 2: An overview of our approach. Given an
  • 6. 学習 n損失関数 • 選択ネットワーク:max 2 ℒ = 𝔼-~4#,1~4$ [𝑅 u + 𝑅 v ] • クラス識別器:min 6 − ∑7+, 𝑦7 log(𝐅(𝒱; 𝜃)7) • 𝐅:クラス識別器 • 𝜃:パラメータ n学習方法 A9 識別器のみで学習 (全ての層でB:畳み込み) ?9 識別器のパラメータを固定して選択ネットワークの学習 B9 選択ネットワークと識別機を同時に学習
  • 7. 実験設定 nモデル • クラス識別器 • >G?HAKT:(GE5.#H+(;<=>?@AJKベース • 選択ネットワーク • L'1*U"F"7<?(G4.#%U"5H+(;<=>?@AJK nデータセット • 0Q7*3*78F"7(G!"*U1'5#H+(;<=>?@ADK • P&%.#T;'U&S1*.(<*%"'(:.7.4"76(GP;<M:K(GV*.#$H+(E=0LM?@AJK • L*#*TW*#"7*Q6T?@@(GO*"H+(I;;<?@AJK n学習 • 入力サイズ • フレーム:元動画からJフレーム毎にJ+AXフレームサンプリング • 空間領域:?DX×?DXから??Y×??Yにランダムクロップ
  • 8. 実験結果 nZ[["5:選択ネットワークなし n>.#%'S:選択ネットワークの出力がランダム n>.#%'S(PE:選択ネットワークの出力がランダムの状態で学習 n&56:提案手法 FCVID ACTIVITYNET MINI-KINETICS mAP GFLOPs # 3D # Frame mAP GFLOPs # 3D # Frame Acc GFLOPs # 3D # Frame 8-frame per clip Upper 82.1 58.6 5.0 8.0 82.6 58.6 5.0 8.0 79.0 58.6 5.0 8.0 Random 78.1 36.1 2.2 5.8 79.2 42.9 3.0 6.6 74.0 42.2 2.0 6.9 Random FT 80.7 36.1 2.2 5.8 81.1 42.9 3.0 6.6 77.4 41.5 1.9 6.8 Ours 81.9 35.6 2.2 5.7 82.6 42.2 3.1 6.6 78.9 42.4 1.9 6.9 16-frame per clip Upper 84.4 117.3 5.0 16.0 84.4 117.3 5.0 16.0 79.6 117.3 5.0 16.0 Random 79.2 63.2 2.1 10.3 80.4 73.3 3.0 11.2 75.2 75.8 2.9 11.8 Random FT 82.0 65.3 2.1 10.6 82.8 71.3 3.0 11.1 78.2 78.0 2.9 12.0 Ours 84.3 66.6 2.1 10.7 84.0 70.1 3.0 11.1 79.2 73.8 2.9 11.8 Table 1: Recognition performance and computational cost of our method vs. baselines. Two input settings are experi- mented, i.e. 8-frame setting (Top) and 16-frame setting (Bottom). # 3D and # Frame denote the number of 3D convolutions and frames usage per input clip respectively, averaged over the entire test set. See texts for more details. • 認識率を大きく下げることなく計算量の削減 • $%&'()と比べても認識率が高い • *+,-"では計算量大きく削減
  • 9. アブレーションスタディ n選択ネットワークと計算量 • 報酬関数のハイパーパラメータを調整 • 計算量と認識率のトレードオフを調整 • 選択ネットワークはランダムよりも有効 • 重要なフレームを選択している nバックボーンの変更 • クラス識別器:4U']'#U8(GP"*Q27"#2'R"5H+(M;;<?@ANK • データセット:P;<M: • 効率よくフレームを選択している the Random baseline by 3.5% to 5% mAP/accuracy on three datasets. Ada3D also outperforms Random FT by 1% to 2.5%. These results verify that Ada3D produces adaptive polices and allocates computational resources on a per-input basis to maintain recognition performance. It is worth noting that there are slightly differences in com- putational savings on different datasets. This results from the fact that video categories in these datasets are dif- ferent. For example, FCVID contains some classes of static objects and scenes like “bridge” and “temple”, and thus we observe more computational savings than ACTIV- ITYNET and MINI-KINETICS, which are more activity- focused; on MINI-KINETICS, where categories are motion- intensive, more computational resources are needed com- pared to FCVID and ACTIVITYNET. Recognition with varying computational budgets. As discussed in Section 3.2, the choice of γ in Eqn. 4 adjusts we use Ada3D as the backbone of AdaFrame to dynami- cally allocate computational resources conditioned on each input clip, as opposed to the original AdaFrame that uses the Figure 3: Recognition performance under different compu- tational budgets controlled by γ. 6160 Van → Ada Van → Ada Van → Ada # Clip 3.0 → 2.9 5.0 → 4.2 10.0 → 7.4 mAP 77.8 → 78.1 80.3 → 80.5 81.9 → 82.0 Table 2: Extension to clip-level selection. Combining Ada3D with AdaFrame [43] offers computational savings for video-level aggregation. # Clip denotes number of clips used per testing video; Van (Vanilla) and Ada denote our method without and with AdaFrame, respectively. same amount of computation with a fixed backbone for all clips. Following [43], we train three variants of AdaFrame which operates on 3, 5, and 10 clips for different compu- tational budgets. As demonstrated in Table 2, extending our approach with adaptive clip selection further decreases the computational cost while producing comparable perfor- mance with the Upper. For example, it reduces the number of clips sampled from each testing video from 10 to 7.4 and networks. We use a more efficient 3D network architectu recently introduced in [9] termed as Slowonly and evalua our approach. In particular, it only uses 3D convolutions the 4-th and 5-th stage of a ResNet50, resulting in compe itive recognition performance with less computational co As shown in Table 4, our method still obtains 20% to 40 savings in GFLOPs with similar recognition performanc indicating Ada3D is compatible with different 3D mode Our method by design is model-agnostic, for which we b lieve it could be complementary to recent work on designi efficient 3D models such as X3D [8] as well. Method mAP GFLOPs # 3D # Frame Upper 82.6 54.5 2.0 8.0 Ours 82.4 42.1 1.3 6.6 Upper 83.5 109.1 2.0 16.0 Ours 83.4 61.8 1.4 9.5 Jフレーム AXフレーム
  • 10. アブレーションスタディ n使用したフレーム • フレーム間の動きが小さい動画は使用フレームが少ない Hurling Book Binding Barbequing Dancing macarena Capoeira Somersaulting Tai Chi Playing Bass Guitar 3D usage: 2 Frame usage: 3 3D usage: 1 Frame usage: 4 3D usage: 2 Frame usage: 4 3D usage: 2 Frame usage: 3 3D usage: 4 Frame usage: 8 3D usage: 4 Frame usage: 7 3D usage: 5 Frame usage: 8 3D usage: 4 Frame usage: 8 Figure 4: Qualitative results. Black mask indicates the frame is discarded. Left: Fewer 3D convolutions and frames are used
  • 11. まとめ n0%.B:モデルの提案 • 入力動画によってモデルを動的に決定 • 使用フレーム • 畳み込み処理 • 認識率を大きく下げることなく計算量の削減 • ?@CからD@Cの削減 • 様々なモデルに適用可能 • 選択ネットワークの有効性 • ランダムとの比較 • 定性的な評価
  • 13. クラス識別器 n入力 • 𝒱からu ∈ {0, 1}!を用いてサンプリング n畳み込み層 • v ∈ {0, 1}&で?:、B:の決定 ...... ...... Frame Policy Spatial Downsampling 3D Convolution Policy 3D Video Model Selection Network 3D 2D 2D Figure 2: An overview of our approach. Given an input clip, the selection network produces features for each frame in th clip, which are further aggregated uniformly to derive a frame usage policy and a convolution usage policy simultaneously !"フィルター #"フィルター .層 サンプリングした フレーム
  • 14. 選択ネットワーク 勾配計算 n方策勾配定理 n選択ネットワーク • 目的関数: max 2 ℒ = 𝔼-~4#,1~4$ [𝑅 u + 𝑅 v ] • 勾配:∇8ℒ = 𝔼[𝑅 u ∇8 log 𝜋)(u|𝒱) + 𝑅 v ∇8 log 𝜋/(v|𝒱)] • 𝜌(𝜋):方策𝜋での報酬総和 • 𝔼!~#!,%~#" [𝑅 u + 𝑅 v ] • 𝜃:パラメータ • w • 𝑠:状態 • 𝒱 • 𝑎:行動 • u, v • 𝑑#(𝑠):エピソード中何回状態sを訪れたか • エピソードがないので考えない • 𝑄#(𝑠, 𝑎):行動価値関数 • 𝑅 u + 𝑅 v
  • 15. 実験結果 !"#$%&#'()**+ nモデル • 選択ネットワーク • L*#*TW*#"7*Q6で学習 • クラス識別機 • S*#*TW*#"7*Q6で学習後、W*#"7*Q6でファインチューニング nデータセット • W*#"7*Q6Y@@ our approach with adaptive clip selection further decre the computational cost while producing comparable per mance with the Upper. For example, it reduces the num of clips sampled from each testing video from 10 to 7.4 obtains an mAP of 82.0% that is on par with Upper (82.1 Additionally, we believe our method is also complemen to other clip selection methods leveraging multi-moda puts such as audio [20, 12], as well as adaptive spatial r lution modulating methods [26, 36, 46]. Method Acc GFLOPs # 3D # Frame Upper 73.1 58.6 5.0 8.0 Ours 72.8 43.7 2.3 6.9 Table 3: Transferring learned policies. We fine-tun
  • 16. 学習方法 nE5:選択ネットワークのみ学習 nPE:選択ネットワークとクラス識別機の同時学習 n同時学習で識別機が選択ネットワークに適したモデルになる nスクラッチから学習の際、目的関数を最初から複数にしない方が良い Figure 4: Qualitative results. Black mask indicates the fra for action classes and instances that are more “static”, i.e Right: For motion-intensive instances, more computation FCVID ACTIVITYNET Tr FT mAP GFLOPs mAP GFLOPs Upper 78.1 58.6 76.4 58.6 72.3 36.1 71.1 42.9 ! 75.9 34.7 74.3 38.1 ! 76.5 34.9 75.1 37.6 ! ! 77.8 35.6 76.1 42.2 Table 5: Ablation on the effectiveness of two trainin
  • 17. 選択ネットワークの寄与 nB::畳み込み方策 nP5.S":動画フレーム方策 nいずれか一方で認識率の向上 n両方適用で認識率がさらに向上 stage (i.e., directly training the selection network with the 3D model jointly) leads to a lower recognition performance (76.5 vs. 77.8). We posit the reason is that adding another objective (the classification loss) while training the selec- tion network from random initialization further increases the instability of network learning under such a reinforce- ment learning setting; and thus the selection network con- verges to sub-optimal policies. FCVID ACTIVITYNET 3D Frame mAP GFLOPs mAP GFLOPs Upper 78.1 58.6 76.4 58.6 75.5 35.6 74.3 42.3 ! 76.8 35.3 75.3 41.1 ! 76.3 35.5 74.8 43.5 ! ! 77.8 35.6 76.1 42.2 Table 6: Ablation on the usefulness of 3D convolution usage and frame usage policies.