動作認識の最前線：手法，タスク，データセット

動作認識の最前線
手法，タスク，データセット
玉木徹（名工大）
2022/11/18
精密工学会画像応用技術専門委員会 2022/11/18

動作認識とは
n人間の動作（action）を認識
する
• Action Recognition (AR)
n動画像の識別
• 「人間が動作している映像」に限
らない
• 人間に限定する手法もある
• 人物検出や姿勢推定を併用
n入力：動画像
• 時間方向の次元が増える
• 時間情報，動き情報のモデル化が
必要
モデル
カテゴリ
画像
モデル
カテゴリ
動画像
時間
画像認識
動作認識

M. S. Hutchinson, V. N. Gadepally: Video Action Understanding
メジャーなタスク
Video Action Understanding [Hutchinson&Gadepally, IEEE Access, 2021]

trimmed, untrimmed
n untrimmed video
• YouTubeなどから取得
• 元動画の長さはバラバラ（数分〜）
• タスク：アクション区間検出など
n trimmed video（クリップ）
• untrimmed videoからアクション部分を抽出
• 数秒〜10秒程度の動画
• タスク：アクション認識など
n 注意
• YouTube動画はそもそも投稿者が切り取っ
た動画
• 同じ動画でも
• 動画単位で識別する時：trimmed
• 動画の区間を検出する時：untrimmed
• と呼ぶ場合もある（UCF101-24がそれ）
FIGURE 2. An overview of the main action understanding problems. Video is depicted as a 3D volume
temporal dimension (left-to-right). Action recognition (upper left) shows how an action class label is a
Action prediction (upper right) shows how an action class label is assigned to a yet unobserved or onl
action proposal (middle left) shows how temporal regions of likely action are bounded by start and en
(middle right) shows how action class labels are assigned to temporal regions of likely action that are
untrimmed
trimmed
事前に
トリミング

関連タスク
nAction recognition
• zero-shot (ZSAR)
• low-res video, low-quality video
• compressed video
nCaptioning
• video captioning：trimmed video
captioning
• dense captioning：untrimmed video
temporal localization + captioning
nVideo QA
nVideo object segmentation
• VOS
nTracking
• Video Object Tracking (VOT)
• Multiple Object Tracking (MOT)
nVideo summarization
nMore tasks & datasets
• xiaobai1217 / Awesome-Video-
Datasets

Challenges, contests, competitions
nActivityNet challenges
• 2016, 2017, 2018, 2019, 2020, 2021
nLOVEU (LOng-form VidEo Understanding)
• 2021, 2022
nDeeperAction
• Localized and Detailed Understanding of Human Actions in Videos
• 2021, 2022

ActivityNet challenges
n 2016
• untrimmed video classification
• temporal activity localization
n 2017
• untrimmed video classification
• trimmed video classification (Kinetics)
• temporal action proposals
• temporal action localization
• event dense captioning
n 2018
• ActivityNet Task
• Guest tasks
• trimmed activity recognition (Kinetics)
• spatio-temporal action localization (AVA)
• trimmed event recognition (MiT)
n 2019
• Guest tasks
• EPIC-Kitchens
• Activity detection (ActEV-PC)
n 2020
• Guest tasks
• object localization (ActivityNet Entities)
• EPIC-Kitchens
• Activity detection (ActEV SDL)
• temporal action localization (HACS)
n 2021
• Action recognition
• Kinetics700
• TinyActions
• Temporal localization
• ActicityNet
• HACS
• SoccerNet
• Spatio-Temporal localization
• AVA-Kinetics & Active speakers
• ActEV SDL UF
• Event understanding
• Event dense captioning (ActivityNet)
• object localization (ActivityNet Entities)
• Video Semantic Role Labeling (VidSitu)
• Multi-view, cross-modal
• MMAct
• HOMAGE
n 2022
• Temporal localization (ActivityNet)
• Event dense captioning (ActivityNet)
• AVA-Kinetics & Active speakers
• ActEV SRL
• SoccerNet
• TinyActions
• HOMAGE

メジャーなデータセット
TABLE 2. Thirty historically influential, current state-of-the-art, and emerging benchmarks of video action datasets. Tabular information includes dataset
name, year of publication, citations on Google Scholar as of May 2021, number of action classes, number of action instances, actors: human (H) and/or
non-human (N), annotations: action class (C), temporal markers (T), spatiotemporal bounding boxes/masks (S), and theme/purpose.

Action recognition datasets
KTH
Weizmann
UCF11
UCF50
UCF101
UCF101-24
(THUMOS13)
THUMOS14
THUMOS15
2005 2010 2012 2015 2017 2019 2020
Kinetics
400
Kinetics
600
Kinetics
700
Kinetics
700
-2020
ActivityNet
v1.3
MiT
Multi
THUMOS
Multi-MiT
HVU
AVA
Actions
AVA-Kinetics
HACS
Segment
SSv1
HMDB51
JHMDB21
Charades
Action Genome
Home
Action
Genome
Visual Genome
Transformer
Ego-4D
ActivityNet
v1.2
SSv2
Charades-Ego
Jester HAA500
FineGym
Hollywood2
MPII Composites MPII Cooking 2 Diving48
2021 2022
2016 2018
2013 2014
2011
2006 2007 2008 2009
2004
Action label(s) per video/clip
Temporal annotation
Spatio-Temporal annotation
HACS
Clip
MPII Cooking
YouYube-8M
HowTo
100M
ImageNet AlexNet ResNet
U-Net
EPIC-
KITCHENS55
EPIC-
KITCHENS100
SURF
FAST
ORB GAN ViT
HOG
YouYube-8M
Segments
ver2016 ver2017 ver2018
Hollywood
Olympic
Sports
IXMAS
DALY
Sports-1M
UCF
Sports
machine generated labels
multi labels
untrimmed YouTube videos
worker/lab videos
YouCook2
YouCook
Hollywood
-Extended
Breakfast
50Salads
Coffee
Cigarettes
ActivityNet
Entities
Kinetics
100
Mini-Kinetics
200

KTH
nスウェーデン王立工科大学（Kungliga Tekniska Högskolan）
n6カテゴリ，2391動画，25fps，固定カメラ，平均4秒，モノクロ
• 25人，4シーン：屋外（スケール変化あり・なし，異なる服装），屋内
[Schuldt+, ICPR2004]
s1
Walking Jogging Running Boxing Hand waving Hand clapping
s2
s3
s4

Weizmann
nWeizmann Institute of Science
（ワイツマン科学研究所）
n固定カメラ
• 2005: 9カテゴリ，81動画
（180x144, 25fps）
• 2007: 10カテゴリ，90動画
（180x144, deinterlaced 50fps）
periodic actions in the same framework as well as to com-
pensate for different length of periods, we used a sliding
window in time to extract space-time cubes, each having 10
[Blank+, ICCV2005] [Gorelick+, TPAMI2007]

IXMAS
nInria XmasMotion Acquisition
Sequences
nIXMAS [Weinland+, CVIU2006]
• 11カテゴリ
• 10人（男5女5）が3回ずつ動作
• 計330動画
nIXMAS Actions with Occlusions
[Weinland+, ECCV2010]
• 11カテゴリ
• マルチビュー
• オクルージョン
• 1148動画
[Weinland+, CVIU2006]
Check watch Cross arms Scratch head
Sit down Get up Turn around
Walk Wave Punch
Kick Pick up
Fig. 6. 11 actions, performed by 10 actors.
[Weinland+, ECCV2010]
2 Daniel Weinland, Mustafa Özuysal, and Pascal Fua
Fig. 1. We evaluate our approach on several datasets. (Top) Weizmann, KTH, and
UCF datasets. (Middle) IXMAS dataset, which contains strong viewpoint changes.
(Bottom) Finally, to measure robustness to occlusions, we evaluate the models learned

Coffee and Cigarettes
n最初のin-the-wildデータセット
• KTH, Weizmannなどは制御下撮影
n映像
• 映画Coffee and Cigarettes (2003)
• 計36,000フレーム
nタスク
• 識別：2クラス
• drinking 105サンプル
• smoking 141サンプル
• spatio-temporal action localization
• 3Dのbboxアノテーション
• 区間長：30〜200フレーム，平均
70フレーム
[Laptev&Pérez, ICCV2007]
https://www.di.ens.fr/~laptev/download.html
https://www.di.ens.fr/~laptev/eventdetection/eventann01/
https://web.archive.org/web/20080219232702/http://www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html

Hollywood
nHollywood Human Actions (HOHA)
• 動画のアノテーションは大変，そこで台本
と字幕が利用できる映画を利用する
• 台本はwebで入手可能
• 時刻のある字幕と時刻のない台本とを対応
付ける
• 識別器を学習して，台本の文章を8クラスの
アクションに変換することでラベル付け
• 学習データセット：12映画
• clean/manual：手動でラベル付け
• automatic：識別器でラベル付け，1000フ
レーム以下の動画
• テストデータセット：20映画
• 手動でラベル付け
[Laptev+, CVPR2008]
Ivan Laptev Marcin Marszałek Cordelia Schmid Benjamin Rozenfeld
INRIA Rennes, IRISA INRIA Grenoble, LEAR - LJK Bar-Ilan University
ivan.laptev@inria.fr marcin.marszalek@inria.fr cordelia.schmid@inria.fr grurgrur@gmail.com
Abstract
The aim of this paper is to address recognition of natural
human actions in diverse and realistic video settings. This
challenging but important subject has mostly been ignored
in the past due to several problems one of which is the lack
of realistic and annotated video datasets. Our first contri-
bution is to address this limitation and to investigate the
use of movie scripts for automatic annotation of human ac-
tions in videos. We evaluate alternative methods for action
retrieval from scripts and show benefits of a text-based clas-
sifier. Using the retrieved action samples for visual learn-
ing, we next turn to the problem of action classification in
video. We present a new method for video classification
that builds upon and extends several recent ideas including
local space-time features, space-time pyramids and multi-
channel non-linear SVMs. The method is shown to improve
state-of-the-art results on the standard KTH action dataset
by achieving 91.8% accuracy. Given the inherent problem
of noisy labels in automatic annotation, we particularly in-
vestigate and show high tolerance of our method to annota-
tion errors in the training set. We finally apply the method
to learning and classifying challenging action classes in
Figure 1. Realistic samples for three classes of human actions:
kissing; answering a phone; getting out of a car. All samples have
been automatically retrieved from script-aligned movies.
dividual variations of people in expression, posture, motion
and clothing; perspective effects and camera motions; illu-

Hollywood2
nHollywoodの拡張
• 作成方法は同じ
• カテゴリ：
• アクションが12クラスに
• シーン10クラスも追加
• Automatic train set：33映画
• 識別器でラベル付け
• 動画数：action 810, scene 570
• Actionとsceneで別サンプル
• Clean test set：36映画
• 手動でラベル付け
• 動画数：action 570, scene 582
[Marszalek+, CVPR2009]
Marcin Marszałek
INRIA Grenoble
marcin.marszalek@inria.fr
Ivan Laptev
INRIA Rennes
ivan.laptev@inria.fr
Cordelia Schmid
INRIA Grenoble
cordelia.schmid@inria.fr
Abstract
This paper exploits the context of natural dynamic scenes
for human action recognition in video. Human actions
are frequently constrained by the purpose and the physi-
cal properties of scenes and demonstrate high correlation
with particular scene classes. For example, eating often
happens in a kitchen while running is more common out-
doors. The contribution of this paper is three-fold: (a) we
automatically discover relevant scene classes and their cor-
relation with human actions, (b) we show how to learn se-
lected scene classes from video without manual supervision
and (c) we develop a joint framework for action and scene
recognition and demonstrate improved recognition of both
in natural video. We use movie scripts as a means of auto-
matic supervision for training. For selected action classes
we identify correlated scene classes in text and then re-
trieve video samples of actions and scenes for training using
script-to-video alignment. Our visual models for scenes and
actions are formulated within the bag-of-features frame-
work and are combined in a joint scene-action SVM-based
classifier. We report experimental results and validate the
(a) eating, kitchen (b) eating, cafe
(c) running, road (d) running, street
Figure 1. Video samples from our dataset with high co-occurrences
of actions and scenes and automatically assigned annotations.
automatically discover correlated scene classes and to use
this correlation to improve action recognition. Since some
actions are relatively scene-independent (e.g. “smile”), we
do not expect context to be equally important for all ac-
tions. Scene context, however, is correlated with many ac-

Hollywood-Extended
nアクションの順序
• 69映画
• HollyWood2を利用
• 16アクションを全フレームに付
与
• 異なるアクションが連続して
いる部分をクリップとして抽
出
• 全937クリップ
• 更に10フレーム毎の区間に分
割（1クリップ平均84区間）
[Bojanowski+, ECCV2014]

HMDB51
nHuman Motion DataBase
• ソース：Digitized movies, Prelinger
archive, YouTube and Google videos,
etc
• 既存のUCF-SportsやOlympicSportsは
ソースがYoutubeのみ，アクションが
曖昧，人の姿勢で分かってしまう
n 51カテゴリ，6766動画
• 各カテゴリ最低101
• 1〜5秒程度，平均3.15秒程度
• 3スプリット
• train 70% (3570), test 30% (1530)
• not using clips (1530)
• 高さ240pix, 30fps
• DivX5.0 (ffmpeg), avi
• 動き補正済み（stabilization）
n構築
• main actorが最低60pix
• 最短でも1秒
• 1クリップ1action
n配布
• rarアーカイブ
• unrar必要
• 動き補正
• あり・なし
[Jhuang+, ICCV2011]
HMDB: A Large Video Database for Human Motion Recognition
H. Kuehne
Karlsruhe Instit. of Tech.
Karlsruhe, Germany
kuehne@kit.edu
H. Jhuang E. Garrote T. Poggio
Massachusetts Institute of Technology
Cambridge, MA 02139
hueihan@mit.edu, tp@ai.mit.edu
T. Serre
Brown University
Providence, RI 02906
thomas serre@brown.edu
Abstract
With nearly one billion online videos viewed everyday,
an emerging new frontier in computer vision research is
recognition and search in video. While much effort has
been devoted to the collection and annotation of large scal-
able static image datasets containing thousands of image
categories, human action datasets lag far behind. Cur-
rent action recognition databases contain on the order of
ten different action categories collected under fairly con-
trolled conditions. State-of-the-art performance on these
datasets is now near ceiling and thus there is a need for the
design and creation of new benchmarks. To address this is-
sue we collected the largest action video database to-date
with 51 action categories, which in total contain around
7,000 manually annotated clips extracted from a variety of
sources ranging from digitized movies to YouTube. We use
this database to evaluate the performance of two represen-
tative computer vision systems for action recognition and
explore the robustness of these methods under various con-
ditions such as camera motion, viewpoint, video quality and
occlusion.
Figure 1. Sample frames from the proposed HMDB51 [1] (from

JHMDB21
nJoint-annotated Human Motion
DataBase
• HMDB51から21カテゴリ928 video
を抽出
• アクションの開始終了でトリミング
• 各クリップ15-48フレーム
• 各フレームでアノテーション
• スケール，姿勢，人物マスク，フ
ロー，カメラ視点
• 2Dパペットモデル[Zuffi&Black,
MPII-TR-2013][Zuffi+, ICCV2013]
を利用
nBboxアノテーション
• オリジナルにはない
• [Li+, ECCV2020]がGitHubリポジトリ
経由でGooge Driveで配布
• フレーム数が最大40にトリミング
[Jhuang+, ICCV2013]
low level
(e) baseline
(a) image (b) puppt flow
(f) given puppt flow
(c) puppet mask
(g) given puppet mask
(d) joint positions and relations
(h) given joint positions
u
v
mid level high level
Figure 1. Overview of our annotation and evaluation. (a-d) A video frame annotated by a puppet model [36]. (a) image frame, (b) puppet

UCF11/50/101/Sports
nUniversity of Central Florida
• UCF11(YouTube Action Dataset) [Liu+, CVPR2009]
• YouTubeの動画 = "in the Wild"
• 11クラス，1168動画
• 25グループ（背景で識別するのを防ぐために，同じグループには同じ環境や
撮影者を含まない）
• UCF50 [Reddy&Shah, MVAP2012]
• 50クラス，6676動画
• UCF11はUCF50のサブセット
• UCF101 [Soomro+, arXiv, 2012]
• UCF50はUCF101のサブセット
• UCF-Sports [Rodriguez+, CVPR2008][Soomro&Zamir, 2014]

UCF101
n University of Central Florida
• ソースはYouTube
• 手動でクリーニング
• 既存のHMDB51やUCF50の50カテゴリ
程度では少ない
n 101カテゴリ，13,320動画
• UCF50に51カテゴリ追加
• 追加カテゴリの動画には音声あり
• 25グループに分割
• 25-fold cross validation用
• 各グループで各カテゴリに4-7動画
• 3スプリット
• train 9537, test 3783
• 320x240pix, 25fps, DivX, avi
n 配布
• rarアーカイブ（unrar必要）
n 最短1.06秒，最長71.04秒，平均
7.21秒
• Kineticsなどaction recognition・
temporal localizationタスクではtrimmed
video clip扱い
• JHMDBなどst-localizationタスクでは
untrimemd video扱い
[Soomro+, arXiv, 2012]
Figure 3. Number of clips per action class. The distribution of clip durations is illustrated by the colors.
Dhol, Playing Flute, Playing Sitar, Rafting, Shaving Beard,
Shot put, Sky Diving, Soccer Penalty, Still Rings, Sumo
Wrestling, Surfing, Table Tennis Shot, Typing, Uneven Bars,
Wall Pushups, Writing On Board}. Fig. 2 shows a sample
frame for each action class of UCF101.
Clip Groups: The clips of one action class are divided
into 25 groups which contain 4-7 clips each. The clips in
one group share some common features, such as the back-
ground or actors.
The bar chart of Fig. 3 shows the number of clips in
each class. The colors on each bar illustrate the durations
of different clips included in that class. The chart shown in
Fig. 4 illustrates the average clip length (green) and total
duration of clips (blue) for each action class.
Actions 101
Clips 13320
Groups per Action 25
Clips per Group 4-7
Mean Clip Length 7.21 sec
Total Duration 1600 mins
Min Clip Length 1.06 sec
Max Clip Length 71.04 sec
Frame Rate 25 fps
Resolution 320⇥240
Audio Yes (51 actions)
Table 1. Summary of Characteristics of UCF101

Kinetics
n the DeepMind Kinetics human action video dataset
• Kinetics-400 [Kay+, arXiv2017] [Carreira&Zisserman, CVPR2017]
• Kinetics-600 [Carreira+, arXiv2018]
• Kinetics-700 [Carreira+, arXiv2019]
• Kinetics-700-2020 [Smaira+, arXiv2020]
• "the 2020 edition of the DeepMind Kinetics human action video dataset"
n ポリシー：1動画から1クリップ
• HMDBやUCFは1動画から複数クリップ
n 配布
• 公式にはYoutubeリンクのみ（＋時刻）
• youtube-dlでダウンロード
• ffmpegでトリミング
• ただし約5%/yearでリンク消失
• 動画の配布あり
• Activity-Net challenge用に配布
• CVD FoundationのgithubリポジトリにあるAmazon S3のリンクから各自でダウン
ロード
A Short Note on the Kinetics-70
Lucas Smaira
lsmaira@google.com
João Carreira
joaoluis@google.com
Amy Wu
amybwu@google.com
Dataset # classes Average Minimum
Kinetics-400 400 683 303
Kinetics-600 600 762 519
Kinetics-700 700 906 532
Kinetics-700-2020 700 926 705
Table 1: Statistics on the number of video clips per class for
different Kinetics datasets as of 14-10-2020.
Abstract
We describe the 2020 edition of the DeepMind Kinetics
4v1
[cs.CV]
21
Oct
2020
[Smaira+, arXiv2020]

Kinetics-400
nKinetics400, K400
• ソースはYoutube
• train 22k, val 18k, test 35k
• 動画は最長10秒
• サイズとfpsはバラバラ
• top1とtop5で評価
• 1動画に複数のactionが入っていることもあ
るがアノテーションは1つだけだから
• ImageNetも同じコンセプトか？
[Kay+, arXiv2017]
(a) headbanging
(c) shaking hands
(e) robot dancing

Kinetics-400
n 構築手順
• アクション名：既存のHMDB, UCF101, ActivityNetなどから再利用＋ワーカーが思いつ
いたもの
• （自動）候補ビデオ：アクション名でYouTubeのタイトルを検索
• （自動）アクションの抽出：アクション名でGoogle画像検索して出てきた画像で識別
器を構築し，候補ビデオをフレーム単位で識別，スコアの高い時刻の前後5秒で10秒ク
リップを作成（ただし動画の最初や最後だったら10秒にならない）
• 実験では短いビデオは予め複数回ループしてある
• （手動）AMTワーカーがクリップを（音なしで映像だけで）判断．
• 本物を混ぜてワーカーの質を評価しておく
• 3人以上のワーカーがOKならそのクリップを採用
• （手動）フィルタリング
• 重複ビデオの削除（Inception-V1のスコアでクラスタリング，各クラスタから1つ採
用）
• カテゴリを統廃合
• Two-Streamモデルのスコアを使ってソートして更にフィルタリング
[Kay+, arXiv2017]

Kinetics-600
nKinetics600, K600
• K400とほぼ同様
• test setが2つ
• standard test set
• ラベル公開
• 論文の数値はこれを使うこと
• held-out test set
• ラベル未公開
• Activity-Net challenge用
• K400 test setのラベルも同時公開
nK400からの変更
• カテゴリ名の選出
• Google知識グラフ
• YouTubeの検索補完
• 候補動画の検索
• 英語だけでなくポルトガル語でも
（ブラジルで使われているので言語
人口のtop2，実は著者もポルトガル
語ネイティブ）
• 検索には重み付きN-gramを利用
• 多言語でも可能
• タイトルとメタデータで検索
• カテゴリ
• K400の368カテゴリを再利用
• 残りの32は名称変更，分割，削除
• いくつか動画を移動
• K400 valからK600 testへ
• K400 trainからK600 valへ
[Carreira+, arXiv2018]
Version Train Valid. Test Held-out Test Total Train Total Classes
Kinetics-400 [6] 250–1000 50 100 0 246,245 306,245 400
Kinetics-600 450–1000 50 100 around 50 392,622 495,547 600

Kinetics-700
n Kinetics700, K700
• held-out test setは廃止
• 論文はstandard val setで評価
• test setはActivity-Net challenge用
n K600からの変更点
• カテゴリ
• K600の597カテゴリを再利用
• いくつかは分割
• fruitをapplesとblueberriesに
• 最近のEPIC-KitchenとAVAから採用流
用
• 創造的なカテゴリを追加
• スライムを作る，無重力
• 候補動画の検索
• クラス名の検索キーワードを分離，ク
ラス名以外でも検索
• 英語とポルトガル語に加えてフランス
語とスペイン語も（言語人口のtop4）
で検索
• 5分以上の動画は除外
• 最終判断
• K400とK600は自分たちで最終判断を
した
• K700では最終判断もワーカーまかせ
n 収集方法の補足
• クリップ中に動作が持続するアクション
には適している（ギターを引く，ジャグ
リングする）
• 動作が開始・終わりの時間的要素を持つ
ものは難しい（皿を落とす，車を降り
る）
n 最終ゴール
• 1000クラスのデータセット作成
[Carreira+, arXiv2019]
Version Train Valid. Test Held-out Test Total Train Total Classes
Kinetics-400 [7] 250–1000 50 100 0 246,245 306,245 400
Kinetics-600 [2] 450–1000 50 100 around 50 392,622 495,547 600
Kinetics-700 450–1000 50 100 0 545,317 650,317 700

Kinetics-700-2020
nKinetics700-2020
nK700からの変更点
• カテゴリは変更なし
• K700のうち動画数の少ない123カテゴ
リの動画数を最低700に増やす
• K700の他のカテゴリにも動画を継ぎ
足す
• 年5%の割合でYouTube動画が消え
ていくため
n実験
• 各カテゴリの動画数を増やすと性能は
向上する
[Smaira+, arXiv2020]
Dataset & split # clips # clips 14-10-2020 % retained
Kinetics-400 train 246,245 220,033 89%
Kinetics-400 val 20,000 18,059 90%
Kinetics-400 test 40,000 35,400 89%
Kinetics-600 train 392,622 371,910 95%
Kinetics-600 val 30,000 28,366 95%
Kinetics-600 test 60,000 56,703 95%
Kinetics-700 train 545,317 532,370 98%
Kinetics-700 val 35,000 34,056 97%
Kinetics-700 test 70,000 67,302 96%
Kinetics-700-2020 train 545,793 – –
Kinetics-700-2020 val 34,256 – –
Kinetics-700-2020 test 67,858 – –
Table 2: The number of original (left) and current (right) available video clips in the various Kinetics datasets.
ing milk’, ’tasting wine’, ’vacuuming car’. In order to filter
those clips from the final dataset, we cluster them and look
at individual clusters gifs removing duplicates. A final fil-
tering is also done to make sure clips belong to the correct
class.
Geographical diversity. We provide an analysis of the geo-
graphical distribution of the videos in the final dataset at the
granularity of continents. The location is assigned based on
where the video was uploaded from. The results are shown
in table 3 based on the fraction of videos containing that
information (around 90%).
Geographical diversity increased slightly over the years,
especially the percentage of videos from Latin America,
Figure 1: Performance of an I3D model with RGB in
on the Kinetics-700-2020 dataset using different nu
2020/10/14時点でまだ
入手できる割合
Dataset & split # clips # clips 14-10-2020 % retained
Kinetics-400 train 246,245 220,033 89%
Kinetics-400 val 20,000 18,059 90%
Kinetics-400 test 40,000 35,400 89%
Kinetics-600 train 392,622 371,910 95%
Kinetics-600 val 30,000 28,366 95%
Kinetics-600 test 60,000 56,703 95%
Kinetics-700 train 545,317 532,370 98%
Kinetics-700 val 35,000 34,056 97%
Kinetics-700 test 70,000 67,302 96%
Kinetics-700-2020 train 545,793 – –
Kinetics-700-2020 val 34,256 – –
Kinetics-700-2020 test 67,858 – –
Table 2: The number of original (left) and current (right) available video clips in the various Kinetics datasets.
ing milk’, ’tasting wine’, ’vacuuming car’. In order to filter
those clips from the final dataset, we cluster them and look
at individual clusters gifs removing duplicates. A final fil-
tering is also done to make sure clips belong to the correct
class.
Geographical diversity. We provide an analysis of the geo-
graphical distribution of the videos in the final dataset at the
granularity of continents. The location is assigned based on
where the video was uploaded from. The results are shown
in table 3 based on the fraction of videos containing that
information (around 90%).
# training examples
20
40
60
80
100 200 300 400 500 600 700
Validation-Top1 Test-Top1 Validation-Top5 Test-Top5

Kinetics-200 / Mini-Kinetics
nS3D [Xie+, ECCV2018]で提案
• 200カテゴリからなるサブセット
• train：各クラスからランダムに400サンプル
• val：25サンプル
• train/val：80k/5k
[Xie+, ECCV2018]

Kinetics-100
nVideoSSL [Jing+, WACV2021]で提案
• 各クラス最低700サンプル（train）を持つ100クラスからなるサブセット
[Jing+, WACV2021]

Mimetics [Weinzaepfel&Rogez, IJCV2021]
nコンテキスト無視のアクション動画
• 713動画
• 例（右図）
• 部屋の中でサーフィン
• パントマイム
• サッカー場でボーリング
• 50クラス（Kinetics400のサブセット）
• 評価専用，学習には使用しない
• 使い方：
• Kinetics400の学習セットで学習
• Mimeticsの50クラスで評価
[Weinzaepfel&Rogez, arXiv2019]

SSv2
nsomething-something
• v1 [Goyal+, ICCV2017]
• v2：SSv2, sth-sth-v2（v2が主流）
• 元は20BN (Twenty Billion Neurons Inc.)
• 現在はQualcomm
• 2021/7にQualcommが20BNを買収
n動画
• v1：108,499（平均4.03秒）
• v2：220,847
• ラベル数174（テンプレート文の数）
• v2はwebm形式（v1はフレームjpeg?）
[Goyal+, ICCV2017] [Mahdisoltani+, arXiv2018]
Putting a white remote into a cardboard box
Pretending to put candy onto chair
Pushing a green chilli so that it falls off the table
Moving puncher closer to scissor
Figure 4: Example videos and corresponding descriptions. Object entries shown in italics.
To obtain useful natural-language labels, but also be able
to train, and potentially bootstrap, networks to learn from
the data, we generate natural language descriptions auto-
matically by appropriately combining partly pre-defined,
and partly human-generated, parts of speech. Natural lan-
guage descriptions take the form of templates that crowd
and high-level concepts.
However, it is possible to increase the degree of com-
plexity as well as the sophistication of language over time
as the dataset grows. This approach can be viewed as “cur-
riculum learning” [2], where simple concepts are taught
first, and more complicated concepts are added progres-

SSv2
nコンセプト
• アクションラベルではなく，名詞・動詞のパターンを理解するべき
• テンプレート文を準備
• "Dropping [something] into [something]"
• "Stacking [number of] [something]"
• 「something」はアクション対象の物体名が入るプレースホルダー
• テンプレート文の種類が174（これがラベル数）
• 物体名は多い（v2 trainで23688）
n構築
• Charadesと同様：ワーカーが撮影
• 名詞と動詞を多数用意，でも組み合わせは膨大
• ワーカーにテンプレート文を提示（プレースホルダーあり）
• ワーカーはvideoを撮影し，映像提出時には物体名も提出
[Goyal+, ICCV2017] [Mahdisoltani+, arXiv2018]

Jester
nジェスチャ認識データセット
• 元は20BN，現在はQualcomm
（SSv2と同じ）
n動画
• 148,092 clips
• train/val/testは8:1:1
• AMTワーカー 1376人が参加
• 27ジェスチャカテゴリ
• 「No gesture」と「Doing Other
Things 」も含む
• 12fps，高さ100pix，幅可変
• 平均36フレーム（3秒）
[Materzynska+, ICCVW2019]
Figure 3. Examples from the Jester Dataset. Classes presented from the top; ’Zooming Out With Two Fingers’, ’Rolling Hand Backward’,
’Rolling Hand Forward’. Videos are different with respect to the person, background and lighting conditions.

MiT
nMoments in Time
• 英語で「その時，あの瞬間」
• 3秒クリップ
• train727k, val30k
• （論文ではtrain802k, val34k, test68k）
• 論文の副題が「one million videos」
• 305カテゴリ（論文では339）
• visualとaudibleの両方の動画
• 音だけで静止画のクリップもある
• 多様なソース
• Youtube, Flickr, Vine, Metacafe, Peeks,
Vimeo, VideoBlocks, Bing, Giphy, The
Weather Channel, and Getty-Images
[Monfort+, TPAMI2019]
This article has been accepted for publication in a future issue of this journal, but has not be
Transaction
Fig. 1: Sample Videos. Day-to-day events can hap
scales. Moments in Time dataset has a significant in

MiT
n構築
• カテゴリ名：VerbNetの頻出4500動
詞をクラスタリング
• 動画検索：多様なソースからクラス
名をメタデータで検索
• ランダムな3秒クリップを切り出す
• 3秒を検出するモデルを使うとバ
イアスがかかるため
• AMTワーカーが棄却
n配布
• フォームに記入して提出
• zipをダウンロード（275GB）
• ソースが多様なためリンク配布で
は無理？

Multi-MiT
nMiTの拡張
• マルチラベル
• 2ラベル以上が553k
• 3ラベル以上が275k
• 動画数1M
• train997k, val9.8k
• MiTに追加
• カテゴリ数292
• MiTからマージ・追加・削除
F
Abstract—Videos capture events that typically contain multiple sequen-
tial, and simultaneous, actions even in the span of only a few seconds.
However, most large-scale datasets built to train models for action
recognition in video only provide a single label per video. Consequently,
models can be incorrectly penalized for classifying actions that exist
in the videos but are not explicitly labeled and do not learn the full
spectrum of information present in each video in training. Towards this
goal, we present the Multi-Moments in Time dataset (M-MiT) which
includes over two million action labels for over one million three second
videos. This multi-label dataset introduces novel challenges on how to
train and analyze models for multi-action detection. Here, we present
baseline results for multi-action recognition using loss functions adapted
for long tail multi-label learning, provide improved methods for visualizing
and interpreting models trained for multi-label action detection and show
the strength of transferring models trained on M-MiT to smaller datasets.
1 INTRODUCTION
In this paper we present the Multi-Moments in Time dataset
(M-MiT). This is a multi-label extension to the Moments in
Time dataset (MiT) [27] which includes one million 3-second
videos each with a human annotated action label.
Videos by their nature are dynamic. In contrast to images,
events can evolve over time and a single action label for an
event may not fully capture the set of actions being depicted.
For example, a short video of a person raising their hand and
snapping their fingers before laughing has multiple actions
(a) Multi-Moments in Time 2 million action labels for 1
million 3 second videos
(b) Multi-Regions Localizing multiple visual regions
involved in recognizing simultaneous actions, like running
and bicycling
(c) Action Regions Spatial localization of actions in single
frames for network interpretation
(d) Action Concepts Interpretable action features learned
by a trained model (i.e. jogging)

HVU
nHolistic Video Understanding
• 3142カテゴリ，6種類
• 動作認識だけではない
• 最長10秒クリップ
• HVU
• train481k, val31k, test65k
• Mini-HVU
• train130k, 10k
• ただしラベルはmachine-generated
Supplementary Material: Large Scale Holistic Video Understanding
forest,musician,flutist,music,musical_instrument,brass_inst
rument,wind_instrument,flautist,recreation,musical_instrum
ent_accessory,plant,playing_flute,tree
string_instrument,musician,man,,guitarist,plucked_string_i
nstruments,music,tapping_guitar,bass,musical_instrument
_accessory,performance,string_instrument_accessory,elec
tric_guitar,sitting,monochrome_photography,musical_instru
ment,guitar_accessory,resonator
sport_venue,shoe,outdoor_shoe,joint,foot,ball,grass,knee,
human_leg,fun,football_player,ball_game,green,footwear,f
ootball,player,sports_equipment,juggling_soccer_ball,socc
er,plant,soccer_ball,sports,play
opening_bottle_not_wine_,joint,muscle,service,finger,distill
ed_beverage,fun,taste,standing,arm,t_shirt,glass,alcohol,d
rink,hand,bottle,photograph,cooking
[Diba+, ECCV2020]
Large Scale Holistic Video Understanding 5
Task Category Scene Object Action Event Attribute Concept Total
#Labels 248 1678 739 69 117 291 3142
#Annotations 672,622 3,418,198 1,473,216 245,868 581,449 1,108,552 7,499,905
#Videos 251,794 471,068 479,568 164,924 316,040 410,711 481,417

HVU
n構築
• ビデオは既存のデータセットを再利
用
• YouYube-8M, Kinetics-600, HACS
• ただしtest setはHVUのtrainからは
除外
• アノテーション
• クラウドAPIでラフなタグ付け
• 人間が正しくないタグを除去
• 残ったタグ約3500を元にWordNet
からラベル3142を採用
• 6種類に手動で分類
n配布
• YouTubeリンク
• 動画の配布なし
• 消失動画多数
[Diba+, ECCV2020]

HAA500
nHuman Atomic Action
• アトミックな（最小単位の）アク
ションが500カテゴリ
• MiTの「open」はいろいろopen
• MAAは「door open」
• クリップ長は可変
• Kineticsのように10秒固定だと不要
な部分が多い・ショットの切り替
えが発生
• 10000動画：train8k, val500, test1500，
平均2.12秒
• クラスバランスは調整
• trainは各クラス16動画
• アクションの人物のみ大写し
• 83%の動画で人物が一人しか写っ
ていない
[Chung+, ICCV2021]
Sports/Athletics
Soccer
Baseball
Run (Dribble) Throw In Shoot Save
Run Pitch Swing Catch Flyball
Figure 1. HAA500 is a fine-grained atomic action dataset, with fi
compared to the traditional composite action annotations (e.g., Soc
atomic action datasets, where we have distinctions (e.g., Soccer-Thro
when the action difference is visible. The figure above displays sam
video contains one or a few dominant human figures performing the
cess, they only contain 4 events with atomic action anno-
tations (Balance Beam, Floor Exercise, Uneven Bars, and
Vault-Women), and their clips were extracted from profes-
AVA
HACS
re 3. The video clips in AVA, HACS, and Kinetics 400 contain multiple human figures with different actions in the same frame.
ething-Something focuses on the target object and barely shows any human body parts. In contrast, all video clips in HAA500 are
fully curated where each video shows either a single person or the person-of-interest as the most dominant figure in a given frame.
Dataset Detectable Joints
Kinetics 400 [21] 41.0%
UCF101 [42] 37.8%
HMDB51 [25] 41.8%
FineGym [39] 44.7%
HAA500 69.7%
e 5. Detectable joints of video action datasets. We use Alpha-
e [10] to detect the largest person in the frame, and count the
ber of joints with a score higher than 0.5.
me (e.g., a basketball to detect Playing Basketball) rather
n recognizing the pertinent human gesture, thus causing
action recognition to have no better performance im-

Activity-Net
nActivity-Net challenge
• ActivityNet v1.2
• 100カテゴリ，train4.8k, val2.4k, test2.5k
• ActivityNet v1.3（主流）
• 200カテゴリ，train10k, val5k, test5k
• カテゴリは階層的に定義（4階層）
• CVPR2015論文では203カテゴリ
• ソースはYoutube
n動画
• untrimmed video：5〜10分，最長20分，半数が1280x720，多くが30fps
• アクションの区間をアノテーション
• 複数の異なるアクションも発生するのでマルチラベル問題
• trimmed video：各untrimmed videoからアクション部分を抽出（平均で1.41動
画）
• 1クリップ1アクションなのでシングルラベル問題
[Heilbron+, CVPR2015]

Activity-Net
n 構築手順
• カテゴリ名
• 米国労働省のAmerican Time Use
Survey (ATUS)を利用
• 2000カテゴリ以上，階層的（最上位
は18カテゴリ）
• ここから203カテゴリを手動で選択
（最上位7カテゴリ）
• 候補untrimmed videoをYouTubeから検索
• WordNetの類義語・上位語・下位語も
利用
• AMTワーカーが選択（正しいものを混ぜ
て，エキスパートワーカーを選別）
• アクション区間のアノテーション
• 複数のAMTエキスパートワーカーがア
ノテーション
• それらをクラスタリングして最終採用
• 区間をtrimmed videoに切り出す
n 配布
• YouTubeリンク（json）のみ
• downloadページのrequest formから
missing video listを提出
• 面倒なのか全動画zipのダウンロードリ
ンクが送られてくる．ただし
• Google Drive共有のはダウンロード数
制限のためダウンロードできない
• Baidu Drive（百度网盘）のはなんとか
日本でアカウント作成しても帯域制限
（50kb/p程度...）で無理
• 上級会員（app store経由で月400円
程度）になって50MB/sでダウンロー
ドするしかない
• みんな困っているgithub issue
[Heilbron+, CVPR2015]

KineticsとActivity-Net
nKinetics
• 1動画1クリップのポリシー
• trimmed video clipの識別に使われる
• temporal action localizationには使わ
れない
• 元のuntrimmed video中に複数の
actionがあっても1つのclipしか抽
出してないから
• trimmed clipが配布されているから元
のuntrimmed videoは不要
nActivity-Net
• untrimmed videoとtrimmed clip
• temporal action localizationに使われ
る
• trimmed clipの識別には使われない
• どこかで見かけてもメジャーでは
ない
• untrimmed videoの識別には使われて
ない
• この問題設定が少ない
• やるならwealkly TAL
• やるならマルチラベル問題

ActivityNet Captions
n各イベント区間にキャプション
• ActivityNetから20k動画，計100k文
• 時間区間アノテーションあり
• 1動画平均2.65文
• 1文平均13.48単語
nDense captioningというタスク
の開始
• video captioning：ビデオ単位で
caption生成
• dense captioning：TAL＋captioning
[Krishna+, ICCV2017]

ActivityNet Entities
nキャプションの各単語に
bbox付与
• ActivityNet Captionsを利用
• bbox 158k
[Zhou+, CVPR2019]

THUMOS
nthe THUMOS challenges
[Idrees+, CVIU2016]
• THUMOS2013
• spatio-temporal action localization
• THUMOS2014
• THUMOS2015
nTHUMOSの読み方
• ギリシャ語θυμoς：サーモス
• 意味：spirited contest（活発なコン
テスト）
http://www.thumos.info/

THUMOS13, aka UCF101-24
nICCV2013併設コンテスト
nUCF101の動画のみを利用
• シングルラベル（101クラス）
• 3 splitで交差検証
• 2 splitで学習，1 splitでテスト
nタスク
• 3207 untrimmed video
• UCF101のサブセット
• spatio-temporal localization
nアノテーション
• UCF101のうち24クラスにbboxを付
けたもの
• xml形式
nUCF101-24
• なぜかTHUMOS13とは呼ばれない
• 論文ではUCF101-24と書く
• オリジナルのxml形式は使いにくく，
また間違いがある
• [Singh+, ICCV2017]が修正した
githubリポジトリ
（gurkirt/corrected-UCF101-
Annots）がよく使われる？
• matとpickleがある
[Singh+, ICCV2017]

THUMOS14 & THUMOS15
nTHUMOS2013からの変更点
• spation-temporalからtemporalへ
• 区間検出の方がbbox検出よりも有用
• ただしアノテーションはbbox
（THUMOS2013と同じもの）
• 背景ビデオの導入
• 20クラスのマルチラベルタスク
• 1区間につき複数アクション
• 平均4.6秒（動画長の28%）
• untrimmed videoの導入
• 最長250秒？
• 学習はtrimmedのUCF101とuntrimmed
のval+背景
H. Idrees et al. / Computer Vision and Ima
ARTICLE IN
JID: YCVIU
[Idrees+, CVIU2016]

THUMOS14 & THUMOS15
n THUMOS14データ
• train：13320 trimmed (UCF101)，101
クラス
• val：1010 untrimmed，20クラス
• 区間アノテーションあり200
• background：2500 untrimmed
• test：1574 untrimmed，20クラス
n THUMOS15データ
• train：13320 trimmed (UCF101) ，101
クラス
• val：2140 untrimmed，20クラス
• background：2980 untrimmed
• test：5613 untrimmed，20クラス
• 区間アノテーションと背景かどうか
は未公開
n コンテスト
• train, val, backgroundを使ってよい
• 自分で手動アノテーションは不可
• action recognition：マルチラベル
• 101クラスのconfidence
• （ただしマルチラベルの動画は全体の
15%程度）
• TAL
• 各区間
• その区間の検出クラスのconfidence
value
n 配布
• zipファイル
• 展開パスワードはフォームから申請

THUMOS14 & THUMOS15
n構築
• 正例（アクションを含む）ビデオ：
• Youtube検索にFreeBaseのトピックを利用
• 各actionのFreeBaseトピックを定義
• キーワード検索も併用
• 編集ビデオは削除
• "-awesome" "-crazy"などで検索から排除
• 背景ビデオ：こちらのほうが収集が大変
• 単に異なるカテゴリの映像を使うのはだめ（見た目が似てしまう９
• 背景とは？
• アクションのコンテキストは共有している（似たシーン，物体，人）
• しかしそのアクションは実際には発生していない
• 例：ピアノが写っているが人はピアノを弾いていない
• 他のカテゴリのアクションが入っていてもダメ

THUMOS14 & THUMOS15
n構築：２
• 背景ビデオの集め方
• 背景になりそうなキーワード指定で検索
• X + "for sale"，X + "review"
• 物体名で検索
• カテゴリにない「物体とアクションの組み合わせ」で検索
• 修正とアノテーション
• ワーカーが101クラスのどれかを含んでいるかを判定
• 含んでいても，スローモーションや遮蔽，アニメ，長過ぎる（10分以上），
編集されたもの，スライドショー，一人称視点などは排除
• 共起する2次的なアクションがあればアノテーション
• 背景ビデオ：3人が34クラスずつ担当して，どれでもないことを確認

THUMOS14 & THUMOS15
n構築：３
• 問題
• アクション区間の境界はあいまいで主観的，ひとによってバラバラ
• 101カテゴリを2つに分類
• instantaneous：区間が明確（BasketballDunk, GolfSwing）
• この20クラスを利用する
• syclic：反復的・持続的（Biking, HairCut, PlayingGuitar）
• 指標を客観的にするための対策
• UCF101カテゴリと一致するように，区間を厳密にアノテーション
• 曖昧，不完全，一般的ではない場合には，ambiguousとマーク
• 20クラス＋ambigousクラス
• 10%という小さいIoUを設定
• 複数のIoUで評価

MultiTHUMOS
nTHUMOS14を拡張
• val200, trian213にラベル付け
• 全フレームをマルチラベル
• 元の20クラスに新たに45クラス加えて
合計65クラス
• THUMOS14のラベル数
• フレームあたり平均0.3最大2
• 動画あたり平均1.1最大3
• 区間長平均4.8秒
• MultiTHUMOSのラベル数
• フレームあたり平均1.5最大9
• 動画あたり平均10.5最大25
• 区間長平均3.3秒最短66ms（2フ
レーム）
[Serana+, IJCV2018]

AVA
nAtomic Visual Action
• AVA-Kinetics [Li+, arXiv2020]
• AVA Actions [Gu+, CVP2018]
• AVA Spoken Activity datasets
• AVA Active Speaker [Roth+,
arXiv2019]
• AVA Speech [Chaudhuri,
Interspeech2018]
n配布
• 公式サイトはcsvのみ
• CVDのGitHubリポジトリに
Amazon S3から動画をダウン
ロードするリンクあり
https://research.google.com/ava/explore.html

AVA Actions
nいわゆるAVAデータセット
• spatio-temporal action localizationタス
ク
• アトミックな（最小単位の）アクショ
ンが80カテゴリ
• 15分のuntrimmed videoが430本
• train 235, val 64, test 131
• アノテーションは1秒毎（1Hz）
• 30分以上の動画の15分目から30分目
の15分間（900フレーム）
• 1.5秒（3秒）のセグメントに対し
てアノテーション
• バージョン v1.0, v2.0, v2.1, v2.2
• 現在はv2.2を使う
Chunhui Gu⇤
Chen Sun⇤
David A. Ross⇤
Carl Vondrick⇤
Caroline Pantofaru⇤
Yeqing Li⇤
Sudheendra Vijayanarasimhan⇤
George Toderici⇤
Susanna Ricco⇤
Rahul Sukthankar⇤
Cordelia Schmid† ⇤
Jitendra Malik‡ ⇤
Abstract
This paper introduces a video dataset of spatio-
temporally localized Atomic Visual Actions (AVA). The AVA
dataset densely annotates 80 atomic visual actions in 430
15-minute video clips, where actions are localized in space
and time, resulting in 1.58M action labels with multiple
labels per person occurring frequently. The key charac-
teristics of our dataset are: (1) the definition of atomic
visual actions, rather than composite actions; (2) precise
spatio-temporal annotations with possibly multiple annota-
tions for each person; (3) exhaustive annotation of these
atomic actions over 15-minute video clips; (4) people tem-
porally linked across consecutive segments; and (5) using
movies to gather a varied set of action representations. This
departs from existing datasets for spatio-temporal action
recognition, which typically provide sparse annotations for
Figure 1. The bounding box and action annotations in sample
frames of the AVA dataset. Each bounding box is associated with
1 pose action (in orange), 0–3 interactions with objects (in red),
and 0–3 interactions with other people (in blue). Note that some

AVA Actions
n 構築
• アクションは一般的・アトミック・網羅的なのもを選択
• 過去のデータセットを参考
• pose 14，man-object interaction 49，man-man interaction 17
• YouTubeから有名俳優の"film"や"television"で検索
• 30分以上，1年以上経過，1000ビュー以上の動画を選択
• モノクロ，低解像度，アニメ，ゲームなどは除外
• 予告などの部分を除外するために開始15分後から15分間を利用
• Bboxアノテーション
• まずFaster R-CNN，次に手動で追加削除
• フレーム間リンクは自動マッチングして手動で修正
• Bboxあたりpose 1クラス（必須），man-objectやman-man interactionがあれば追加
（それぞれ3つまで）．合計最大7ラベル
• 注：poseはsoftmax-CE，man-objやman-manはBCEにすることが多い
• 動画1本あたり1秒毎の3秒セグメント（897個）に対してラベルあり
n 評価
• frame-mAP

AVA-Kinetics
nAVA Actionsと同様の手順で
Kinetics-700にbboxを付与
• 10秒クリップ中のある1フレーム
(key-frame) だけにbboxアノテーショ
ン
• train
• AVA80クラスで認識率が低い27カ
テゴリに対応するK700の115クラ
スを手動選出し，すべての動画を
アノテーション
• 残りのクラスからは一様にサンプ
ルしてアノテーション
• val・test：すべての動画をアノテー
ション
The AVA-Kinetics Localized Human Actions Video Dataset
Ang Li1
Meghana Thotakuri1
David A. Ross2
João Carreira1
Alexander Vostrikov1⇤
Andrew Zisserman1,4
1
DeepMind 2
Google Research 4
VGG, Oxford
{anglili,sreemeghana,dross,joaoluis,zisserman}@google.com
alexander.vostrikov@gmail.com
Abstract
This paper describes the AVA-Kinetics localized human
actions video dataset. The dataset is collected by annotat-
ing videos from the Kinetics-700 dataset using the AVA an-
notation protocol, and extending the original AVA dataset
with these new AVA annotated Kinetics clips. The dataset
contains over 230k clips annotated with the 80 AVA action
classes for each of the humans in key-frames. We describe
the annotation process and provide statistics about the new
dataset. We also include a baseline evaluation using the
Video Action Transformer Network on the AVA-Kinetics
dataset, demonstrating improved performance for action
classification on the AVA test set. The dataset can be down-
loaded from https://research.google.com/ava/.
1. The AVA-Kinetics Dataset
kinetics: chopping wood
ava: bend/bow (at the waist)
ava: touch (an object)
lift (a person)
stand
kinetics: high kick
carry/hold (an object);
stand;
talk to (e.g., self, a person,
a group);
watch (a person)
jump/leap;
touch (an object)
high jump
Figure 1. An example key-frame in the AVA-Kinetics dataset. The
0214v2
[cs.CV]
20
May
2020
[Li+, arXiv2020]
and a total of around 650k video clips. For each class, each
clip is from a different Internet video, lasts about 10s and
has a single label describing the dominant action occurring
in the video.
1.2. Data Annotation Process
The AVA-Kinetics dataset extends the Kinetics dataset
with AVA-style bounding boxes and atomic actions. A
single frame is annotated for each Kinetics video, using a
frame selection procedure described below. The AVA anno-
tation process is applied to a subset of the training data and
to all video clips in the validation and testing sets from the
Kinetics-700 dataset. The procedure to annotate bounding
boxes for each Kinetics video clip was as follows:
1. Person detection: Apply a pre-trained Faster RCNN
[8] person detector on each frame of the 10-second
long video clips.
2. Key-frame selection: Choose the frame with the
highest person detection confidence as the key-frame
of each video clip, at least 1s away from the
start/endpoint of the clip.
3. Missing box annotation: Human annotators verify and
annotate missing bounding boxes for the key-frame.
unique video clips in different splits for the AVA-Kinetics dataset.
AVA-Kinetics data is a combination of AVA and Kinetics. While
the numbers of annotated frames are roughly on par between AVA
and Kinetics, Kinetics brings many more unique videos to the
AVA-Kinetics dataset.
# unique frames # unique videos
AVA Kinetics AVA-Kinetics AVA Kinetics AVA-Kinetics
Train 210,634 141,457 352,091 235 141,240 141,475
Val 57,371 32,511 89,882 64 32,465 32,529
Test 117,441 65,016 182,457 131 64,771 64,902
Total 385,446 238,984 624,430 430 238,476 238,906
Clips from the remaining Kinetics classes are sampled uni-
formly (we have not yet annotated all of them). Different
from the training set, the Kinetics validation and test sets
are both fully annotated.
2. Data Statistics
We discuss in this section characteristics of the data dis-
tribution in AVA-Kinetics and compare it with the existing
AVA dataset. The statistics of the dataset are given in Ta-
ble 1 which shows the total number of unique frames and
that of unique videos in these datasets. Kinetics dataset con-
tains a large number of videos so it contributes a lot more
unique videos to the AVA-Kinetics dataset.

AVA Speech
nSpeech activityの認識
• 発話があるかないか，音楽やノ
イズがあるかどうか
• 動画：AVA v1.0（192動画）
• カテゴリは4つ
• No speech
• Clean speech
• Speech + Music
• Speech + Noise
• 各時刻にラベル付け(dense)
• 185動画，各15分
Figure 1: Rating interface with the set of labels and labeling
shortcuts shown on the right of the video player, and playback
shortcuts below the player.
190 movies, and should lead to a wider diversity of contexts. It
explicitly annotates when speech activity co-occurs with back-
Figure 2: An example of the labeled activity timeline from 3
[Chaudhuri+, Interspeech2018]

AVA Active Speaker
n発話者の特定
• 各フレームで検出された顔が発話してい
るかどうか
• 動画：AVA v1.0 （192動画）
• カテゴリは3つ
• Not Speaking
• Speaking and Audible
• Speaking but not Audible（話している
が聞き取れない）
• 各時刻でラベル付け(dense)
• 160動画
• Face trackerを使用
• 1秒以上，10秒以下のトラック
• 38.5kトラック，顔3.65M個
{josephroth, sourc, klejcho, radahika, agallagher, lkaver,
sharadh, astopczynski, cordelias, zxi, cpantofaru}@google.com
Abstract
aker detection is an important component in
algorithms for applications such as speaker
ideo re-targeting for meetings, speech en-
nd human-robot interaction. The absence
refully labeled audio-visual dataset for this
trained algorithm evaluations with respect
ity, environments, and accuracy. This has
sons and improvements difficult. In this pa-
nt the AVA Active Speaker detection dataset
eaker) that will be released publicly to facili-
development and enable comparisons. The
Figure 1: The annotation interface for AVA
Given its surrounding video and audio (w
[Roth+, ICASSP2020]

HACS
nHACS Clip
• 1.5M trimmed videos
• train 1.4M, val 50k, test 50k
• from 492k, 6k, 6k untrimmed
videos
• 2秒クリップ
nHACS Segment
• 139k action segments
untrimmed videos
nカテゴリ
• 200カテゴリ (clips/segment)
• Activit-Net 200と同じもの
[Zhao+, ICCV2019]
Hang Zhao†
, Antonio Torralba†
, Lorenzo Torresani‡
, Zhicheng Yan!
†
Massachusetts Institute of Technology, ‡
Dartmouth College, !
University of Illinois at Urbana-Champaign
Abstract
This paper presents a new large-scale dataset for recog-
nition and temporal localization of human actions collected
from Web videos. We refer to it as HACS (Human Action
Clips and Segments). We leverage both consensus and dis-
agreement among visual classifiers to automatically mine
candidate short clips from unlabeled videos, which are sub-
sequently validated by human annotators. The resulting
dataset is dubbed HACS Clips. Through a separate pro-
cess we also collect annotations defining action segment
boundaries. This resulting dataset is called HACS Seg-
ments. Overall, HACS Clips consists of 1.5M annotated
clips sampled from 504K untrimmed videos, and HACS Seg-
ments contains 139K action segments densely annotated
in 50K untrimmed videos spanning 200 action categories.
HACS Clips contains more labeled examples than any ex-
isting video benchmark. This renders our dataset both a
large-scale action recognition benchmark and an excellent
source for spatiotemporal feature learning. In our transfer
learning experiments on three target datasets, HACS Clips
outperforms Kinetics-600, Moments-In-Time and Sports1M
as a pretraining source. On HACS Segments, we evaluate
state-of-the-art methods of action proposal generation and
4.E+03
3.E+04
3.E+05
2.E+06
4.0E+3 1.6E+4 6.4E+4 2.6E+5 1.0E+6
number
of
videos
number of clips
UCF-101
HMDB-51
Kinetics-600 HACS Clips (Ours)
5.E+01
4.E+02
3.E+03
3.E+04
2.E+05
5.0E+3 2.0E+4 8.0E+4
number
of
videos
number of segments
THUMOS 14
AVA
ActivityNet
Charades HACS Segments (Ours)
400
200
100
50
200
100
50
Moments
Figure 1: Comparisons of manually labeled action recog-
nition datasets (Top) and action localization datasets
(Bottom), where ours are marked as red. The marker size
encodes the number of action classes in logarithmic scale.
It currently contains 9M images with image-level label and
1.7M images with 14.6M bounding boxes, and has greatly
pushed the advances of research work in those fields [1, 19].
In the video domain, we have witnessed an analogous
growth in the scale of action recognition datasets. While
video benchmarks created a few years ago consists of only
a few thousands examples (7K videos in HMDB51 [29],

HACS
n構築
• YouTubeからActivityNetの200カテゴリ名で検索
• 得られた890k動画をフィルタリング
• 重複除去，他データセット（Kinetics, ActivityNet, UCF101, HMDB51）のval
とtestも除去
• 504k動画（最長4分，平均2.6分）
• ここから2秒クリップ1.5Mを一様にサンプリング
• ただしpos（アクション）よりneg（非アクション）が圧倒的に多いので一様
サンプリングではダメ
• 複数の画像識別器でpos 0.6Mとneg 0.9Mに分類
• これをtrain/val/testにsplitしたのがHACS Clip
• そのうちのサブセット50k untrimmed videoにアクション区間を付与
• これがHACS Segment
• ActivityNetより1動画あたりのsegment数は多く，segment長は短い
[Zhao+, ICCV2019]

DALY
nDaily Action Localization in
YouTube
• 10カテゴリ，8133クリップ
• 高解像度：1290x790
• 長いuntrimmed video（平均3分45
秒）
• 平均アクション長8秒
• アクション＝道具が体に触れてい
る時間
• 区間アノテーションはワーカーを使
わず自分達で
• bboxは区間から一様にサンプルし
た5フレームのみ（最大1fps）
[Weinzaepfel+, arXiv2016]
supervision, performs on par with methods using full supervision, i.e., one bounding box annotation per frame. To further validate our
method, we introduce DALY (Daily Action Localization in YouTube), a dataset for realistic action localization in space and time. It
contains high quality temporal and spatial annotations for 3.6k instances of 10 actions in 31 hours of videos (3.3M frames). It is an
order of magnitude larger than existing datasets, with more diversity in appearance and long untrimmed videos.
Index Terms—Spatio-temporal action localization, weak supervision, human tubes, CNNs, dense trajectories.
F
1 INTRODUCTION
ACTION classification has been widely studied over the
past decade and state-of-the-art methods [1], [2], [3],
[4], [5] now achieve excellent performance. However, to
analyze video content in more detail, we need to localize
actions in space and time. Detecting actions in videos is
a challenging task which has received increasing attention
over the past few years. Recently, significant progress has
been achieved in supervised action localization, see for ex-
ample [6], [7], [8], [9], [10]. However these methods require
a large amount of annotation, i.e., bounding box annotations
in every frame. Such annotations are, for example, used
to train Convolutional Neural Networks (CNNs) [6], [7],
[9], [10] at the bounding box level. Several works have
suggested to generate action proposals before classifying
them [11], [12], however they generate hundreds of pro-
posals for a video, thus supervision is still required to label
them in order to train a classifier. Consequently, all these ap-
proaches require full supervision, where action localization
needs to be annotated in every frame. This makes scaling up
to a large dataset difficult. The goal of this paper is to move
away from full supervision, similar in spirit to recent work
on weakly-supervised object localization [13], [14].
Recently, Mettes et al. [15] have addressed action local-
ization with another annotation scheme, e.g. with pointly-
supervised proposals. A large number of candidate pro-
posals are obtained using APT [12], a method based on
grouping dense trajectories. They show that Multiple In-
stance Learning (MIL) applied directly on these proposals
performs poorly. They thus introduce point supervision
and incorporate an overlap measure between annotated
points and proposals into the mining process. This requires
• P. Weinzaepfel is with Xerox Research Centre Europe, Meylan, France.
E-mail: philippe.weinzaepfel@xrce.xerox.com
• X. Martin and C. Schmid are with Inria, LJK, Grenoble, France.
E-mail: firstname.lastname@inria.fr
Fig. 1. We consider sparse spatial supervision: the temporal extent
of the action as well as one box per instance are annotated in the
training videos (left). To train an action detector, we extract human tubes
and select positive and negative ones (right) according to the sparse
annotations.
annotating a point in every frame. In this paper we go a
step further and significantly reduce the number of frames
to annotate. To this end, we leverage the fact that actors
are humans and extract human tubes. Given these human
tubes, our approach uses only one spatial annotation per
action instance, see Figure 1. We show that such a sparse
annotation scheme is sufficient to train state-of-the-art action
detectors.
Our approach first extracts human tubes from videos.
Using human tubes for action recognition is not a novel
idea [16], [17], [18]. However, we show that extracting high
quality human tubes is possible by leveraging a recent state-
of-the-art object detection approach (Faster R-CNN [19]),
a large annotated dataset of humans in a variety of poses
arXiv:1605.05197v2
[cs.CV]
23
May
20
7
Fig. 5. Example frames from the DALY dataset with simultaneous actions.
Fig. 6. Example of spatial annotation from the DALY dataset. In addition to the bounding box around the actor (yellow), we also annotate the objects
(green) and the pose of the upper body (bounding box around the head in blue and joint annotation for shoulders, elbows and wrists).
7
Fig. 5. Example frames from the DALY dataset with simultaneous actions.

Action Genome
n動画のシーングラフ
• Charadesの動画にVisual Genome
[Krishna+, IJCV2017]的なグラフをアノ
テーション
• Visual Genome：画像中のすべての物
体間の関係
• Action Genome：アクション発生区間
中のみ，アクション対象の物体と人
間の関係のみ
• 234,253フレームの物体35クラス，関係
25クラス
• 各アクション区間から一様に5フレー
ム抽出してアノテーション
• アクションカテゴリは157（Charades
と同じ）
[Ji+, CVPR2020]
Action recognition in videos. Many research projects have
tackled the task of action recognition. A major line of work
has focused on developing powerful neural architectures to
extract useful representations from videos [10, 23, 31, 69,
72]. Pre-trained on large-scale databases for action clas-
sification [8, 9], these architectures serve as cornerstones
for downstream video tasks and action recognition on other
datasets. To assist more complicated action understanding,
another growing set of research explores structural informa-
tion in videos including temporal ordering [51, 88], object
localization [4, 25, 32, 53, 74, 76], and implicit interactions
between objects [4, 53]. In our work, we contrast against
these methods by explicitly using a structured decomposi-
tion of actions into objects and relationships.
Table 1 lists some of the most popular datasets used for
action recognition. One major trend of video datasets is
providing considerably large amount of video clips with
single action labels [8, 9, 87]. Although these databases
have driven the progress of video feature representation for

Home Action Genome
n Home Action Genome (HOMAGE)
• 家2軒の全部屋，参加者27名
• マルチモーダル
• カメラ：1人称，3人称マルチビュー，赤外
線
• センサ：照明，加速度，ジャイロ，人感，磁
気，気圧，湿度，室温
• 1752 untrimmed video
• train1388, tests 198/166
• そこから5700動画を抽出
• 75 activity，453 atomic action
• 1動画1 activity
• 各フレームにマルチラベルのatomic action
（2〜5秒）
• atomic action：開始，終了，カテゴリ
• train20k, tests2.1k/2.5k
• bbox 583k：一様に抽出した3-5フレーム
• 物体86クラス，関係29クラス
[Rai+, CVPR2021]
ization and relationships between objects. LEMMA [21] is
a recent multi-view and multi-agent human activity recogni-
tion dataset, providing bounding box annotations on third-
person views and compositional action labels annotated
with predefined action templates and verbs/nouns. How-
ever, they do not provide bounding boxes of objects the
subjects (human) interact with. Action Genome [10] is built
upon the videos from Charades [38], with the additional an-
notation of spatio-temporal scene graph labels. However, it
only provides videos from a single camera view. HOMAGE
aims to provide 1) multiple modalities to promote multi-
modal video representation learning, 2) high-level activity
labels and temporally localized atomic action labels, and 3)
scene graphs that provide spatial localization cues for both
the subject and the object and their relationship.
Multi-Modal Learning. Multiple modalities of videos are
rich sources of information for both supervised [25] and
self-supervised learning [26, 27, 39]. [40, 27] introduce a
contrastive learning framework to maximize the mutual in-
formation between modalities in a self-supervised manner.
The method achieves state-of-the-art results on unsuper-
vised learning benchmarks while being modality-agnostic
view1(Ego-view) view2 view3 view4 view5
eat dinner
pack
suitcase
blow-dry
hair
handwash
dishes
Figure 2: Multiple Views of Home Action Genome (HOMAGE)
Dataset. Each sequence has one ego-view video as well as at least
one or more synchronized third person views.
the instructions assigned. To make sure the behaviors are as
natural as possible, we did not specify detailed procedures
and time limits within the activities, and let the individual
participants perform the activity freely.
Data Collection. We recorded 27 participants in kitchens,
bathrooms, bedrooms, living rooms, and laundry rooms in
two different houses. We used 12 sensor types: cameras
(RGB), infrared (IR), microphone, RGB light, light, accel-
UCF101 [14] 13K 27 1
ActivityNet [13] 28K 648 1
Kinetics-700 [11] 650K 1.79K 1
AVA [42] 430 108 1
PKU-MMD [33] 1.08K 50 3
EPIC-Kitchens [15] - 55 1
MMAct [37] 36K - 6
Action Genome [10] 10K 82 1
Breakfast [43] - 77 1
LEMMA [21] 324 10.1 2
Ours 1.75K 25.4 12
Table 1: Comparison between related datasets and HOMAGE.
not including annotation data or derived data like optical flow, V
level activity label (often assigned one per video), TL: temporal
rich multi-modal action data, including dense annotations such as
Video
(ego-view)
Video
(3rd-view)
Atomic
Action
Activity
take sth from washing machine
holding detergent
holding a basket
unloading washing machine

Charades
n英語でジェスチャゲームを意味する
• 発音：シュレィヅ（米）シャラーヅ（英）
n動画
• 9848本の30秒ビデオ
• train 7985, test 1863
• （区間はtrain 49k, test 17k）
• 157カテゴリ
• 46物体，30アクション（動詞），15屋内
シーンの組み合わせ
• アクション区間（平均12.8秒，動画当たり
平均6.8アクション）
• 説明文（description）
• 映像中に異なるカテゴリの区間が存在
[Sigurdsson+, ECCV2016]
Hollywood in Homes: Crowdsou
The Charades Dataset
Fig. 1. Comparison of actions in the Charades dataset
book, Opening a refrigerator, Drinking from a cup. YouT
often atypical videos, while Charades contains typical ev

Charades
n構築：「Hollywood in Homes」アプローチ
• ハリウッドでの撮影プロセスAMTワーカーに依頼する
• 15種類の屋内シーンで起こりうる状況を文章（script）にする
• ランダムな物体とアクションが5つずつ提示される
• 2つずつ選択して1人か2人が家で行う行動を記述する
• その文章に従って30秒の動画を自分で撮影する
• 価格が安いとワーカーが集まらないので工夫あり
• 全scriptを解析して157のアクションクラスを選出
• アノテーションする（descriptionを書く）
• ビデオから物体を自動抽出
• その物体とそれに関連するアクション（5個程度）とscriptを見て，ワーカー
が判断
• さらに別のワーカーがアクション区間をアノテーション
[Sigurdsson+, ECCV2016]

Charades-Ego
n1人称と3人称の動画ペアの理解
• Charadesと同様の手順
n撮影
• AMTに依頼，参加者112人
• ワーカーはどこかにカメラを置いて
3人称で撮影後，同じ内容を1人称で
撮影（カメラを額に取り付け）
• 同一人物，同一環境が保証
• Charadesのスクリプトを借用
• 157カテゴリ（Charadesと同様）
• 4000ペア動画（364ペアで2人以上が
登場），平均31.2秒
[Sigurdsson+, CVPR2018]

EPIC-KITCHENS-55
n キッチンの一人称視点動画
• 55時間，参加者32人4都市
• キッチンに入ったら録画，出たら停
止
• 平均動画長1.7h，最長4.6h
• 参加者あたり平均13.6動画
• 行動の指定なし，記録は連続する3日
間，カメラはGoPro
• 参加者に音声で荒くアノテーションし
てもらう
• AMTでラベル付け
• 最低0.5秒，前後の区間と重複可
• 39.6k アクション区間，454.3k 物体
bbox
• アクション：動詞125クラス
• 物体：名詞331クラス
[Damen+, ECCV2018] [Damen+, TPAMI2021]
8 D. Damen et al
Fig. 6: Sample consecutive action segments with keyframe object annotations
We refer to the set of minimally-overlapping verb classes as CV , and similarly
CN for nouns. We attempted to automate the clustering of verbs and nouns
using combinations of WordNet [32], Word2Vec [31], and Lesk algorithm [4],
however, due to limited context there were too many meaningless clusters. We
thus elected to manually cluster the verbs and semi-automatically cluster the
nouns. We preprocessed the compound nouns e.g. ‘pizza cutter’ as a subset of
the second noun e.g. ‘cutter’. We then manually adjusted the clustering, merging
the variety of names used for the same object, e.g. ‘cup’ and ‘mug’, as well as
splitting some base nouns, e.g. ‘washing machine’ vs ‘coffee machine’.
In total, we have 125 CV classes and 331 CN classes. Table 3 shows a sample
of grouped verbs and nouns into classes. These classes are used in all three
APRIL 2020
Fig. 2: Head-mounted GoPro used in dataset recording
Use any
Use pres
Use verb
You may
the kiwi
Use prop
Use ‘and
If an act
Fig. 3:
particip
0.5% at
90fps1
.
The
kitchen

EPIC-KITCHENS-100
nEPIC-KITCHENS-55の拡張
• EK55の動画は使いまわしせず，新たに
100時間を作成（参加者45人）
• 参加者による音声ナレーション付け
• EP55では動画にアフレコ
• EP100では一時停止して録画
• 90.0k アクション区間，454.3k 物体bbox
• アクション：動詞97クラス
• 物体：名詞300クラス
• マスクの追加 by Mask R-CNN
• 38M物体，31Mの手
• Hand-object interactionのラベル
• [Shan+, CVPR2020]
[Damen+, IJCV2021]
38 International Journal of Computer Vision (2022) 130:33–55
Fig. 5 Top: Sample Mask R-CNN of large objects (col1: oven), hands
(labelled person), smaller objects (col2: knife, carrot, banana, col3:
clock, toaster, col4: bottle, bowl), incorrect labels of visually ambiguous
objects (col3: apple vs onion) and incorrect labels (col3: mouse, col4:
chair). Bottom: Sample hand-object detections from Shan et al. (2020).
L/R = Left/Right, P = interaction with portable object, O = object. Mul-
tiple object interactions are detected (col2: pan and lid, col4: tap and
kettle)

nマックス・プランク情報学研究所
• Max Planck Institute for Informatics (MPII)
• MPII Cooking Activities [Rohrbach+, CVPR2012]
• MPII Cooking Composites Activities [Rohrbach+, ECCV2012]
• MPII Cooking 2 [Rohrbach+, IJCV2016]
MPII Cooking datasets

MPII Cooking Activities
nMPII Cooking
• クラス間の違いが小さい（体の動きが小さい）
• 区間アノテーションがある初のデータセット
• 5609区間（以降の論文では3824？）
• 1秒以上何もなければ自動的に背景ラベルを付
与
• 65種類の料理アクティビティ
• 14種類の料理，参加者12人
• 44動画（計8時間，881kフレーム）
• 動画あたり3〜41分
• そのうち2.2kフレームには2D姿勢アノテーショ
ンあり（train 1071, test 1277）
• 1624x1224, 29.4fps
[Rohrbach+, CVPR2012]
A Database for Fine Grained Activity Detection of Cooking Activities
Marcus Rohrbach Sikandar Amin Mykhaylo Andriluka Bernt Schiele
Max Planck Institute for Informatics, Saarbrücken, Germany
Abstract
While activity recognition is a current focus of research
the challenging problem of fine-grained activity recognition
is largely overlooked. We thus propose a novel database of
65 cooking activities, continuously recorded in a realistic
setting. Activities are distinguished by fine-grained body
motions that have low inter-class variability and high intra-
class variability due to diverse subjects and ingredients. We
benchmark two approaches on our dataset, one based on
articulated pose tracks and the second using holistic video
features. While the holistic approach outperforms the pose-
based approach, our evaluation suggests that fine-grained
activities are more difficult to detect and the body model
can help in those cases. Providing high-resolution videos
as well as an intermediate pose representation we hope to
foster research in fine-grained activity recognition. Figure 1. Fine grained cooking activities. (a) Full scene of cut
slices, and crops of (b) take out from drawer, (c) cut dice, (d) take

MPII Cooking Composite Activities
n MPII Composites
• 上位のアクティビティ（料理）は基礎レ
ベルのアクティビティ（手順）の組み合
わせ
• 手順（スクリプト）を先に作成
• 理由：手順を示さないと，あまりにバ
ラバラすぎる
• ワーカーが料理手順を記述
• 53種のタスクを最大15手順で
• スクリプト数2124
• 手順に従って参加者が調理
• カテゴリ
• 料理41（composite）
• 手順218（attributes）
• 212動画，参加者22人
• 区間数8818
[Rohrbach+, ECCV2012]
2 Rohrbach, Regneri, Andriluka, Amin, Pinkal, and Schiele
prepare scrambled egg
version - K
1) get the pan from drawer
2) put some butter on the
pan then heat it on the stove
3) crack the egg in a bowl
4) put some salt and whisk
5) put the mixture on pan
6) stir for 3-4 minutes
version - 02
1) open the egg in a bowl
and stir, add salt and pepper
2) heat the pan on the stove
3) put some oil on the pan
4) when oil is hot then put
the mixture in the pan and
stir for some minutes
version - 01
1) take egg from the fridge
2) put pan on the stove
3) open egg over pan
4) fry for 3-4 minutes
separate egg
prepare onion
take-out egg open
pan onion fry
prepare scrambled eggs
egg open
pan fry
Script data collected using
Mechanical Turk
Fig. 1. Sharing or transferring attributes of composite activities using script data.

Int J Comput Vis (2016) 119:346–373 347
MPII Cooking 2
nMPII Compositesの拡張
• CookingとCompositesのカテゴリを整理統合
• 273動画（計27時間）
• train 201, val 17, test 42
• 区間数14k
• カメラ8台同期（ただし1台のみ使用）
• カテゴリ
• 料理59（composite）
• 手順222（attributes）
[Rohrbach+, IJCV2016]

YouCook
n動画単位での説明文付き
• video captioning用
nソース：YouTube動画
• 88動画，三人称視点
• train 49, test 39
• 6料理スタイル
• キャプション
• 動画あたり最低3文，平均8文
• 各文15単語以上，平均10単語
• 動画あたりの合計は平均67単語
[Das+, CVPR2013]

YouCook2
n手順動画を分析したい
• 区間アノテーション＋区間の内容の説
明文
• 複雑な手順はアクションラベルでは
表せないはず
• 2000動画（176時間），89レシピ
• train 1333, val 457, test 210
• ソース：YouTube動画
• 一人称視点は除外
• 平均動画長5.27分，最大10分
• 動画あたり3〜16区間（手順）
• 区間長1〜264秒，平均19.6秒
• 説明文は20単語以下
[Zou+, README]
[Zou+, AAAI2018]
Grill the tomatoes in
a pan and then put
them on a plate. Add oil to a pan and spread
it well so as to fry the bacon
Place a piece of lettuce as
the first layer, place the
tomatoes over it.
Sprinkle salt and
pepper to taste.
Add a bit of Worcestershire
sauce to mayonnaise and
spread it over the bread.
Place the bacon at
the top.
Place a piece of
bread at the top.
Cook bacon until crispy,
then drain on paper towel
00:21 00:54 01:06 01:56 02:41 03:08 03:16 03:25
00:51 01:03 01:54 02:40 03:00 03:15 03:25 03:28
Start time:
End time:

Breakfast [Kuehne+, CVPR2014]
n手順（Activity）の認識
• レシピ10種類（coffee, etc)
• レシピだけ渡して作り方は参加者にま
かせる(unscripted)
• 参加者52名，18キッチン
• 複数カメラ：場所により台数は異なる
（3〜5台）
• ステレオカメラあり
• 320x240, 15fps, 計77時間
• 各動画数分（1〜5分程度）
nアノテーションは2種類
• coarse：48種類．例「牛乳を注ぐ」
• fine：「牛乳を掴む，カップを回す，
カップを開ける」

50 Salads
n手順（Activity）の認識
• レシピはサラダのみ
• activityはhigh-level 3種類，low-level
17種類
• それぞれ3種類に分割
• pre, core（約60%）, post
• 参加者27名，50動画（各4〜8分）
• 参加者は提示された手順に従う（分量は
指定していないが1人分を作成）
• それぞれ2回ずつ撮影
n計測
• 天井固定のRGB-Dカメラ(Kinect),
640x480, 30Hz
• 各道具に加速度センサ(50Hz)
[Stein&McKenna, UbiComp13] [Stein&McKenna, CVIU2017]
University of Dundee
Dundee, United Kingdom
{sstein,stephen}@computing.dundee.ac.uk
ABSTRACT
This paper introduces a publicly available dataset of com-
plex activities that involve manipulative gestures. The dataset
captures people preparing mixed salads and contains more
than 4.5 hours of accelerometer and RGB-D video data, de-
tailed annotations, and an evaluation protocol for compari-
son of activity recognition algorithms. Providing baseline
results for one possible activity recognition task, this pa-
per further investigates modality fusion methods at different
stages of the recognition pipeline: (i) prior to feature extrac-
tion through accelerometer localization, (ii) at feature level
via feature concatenation, and (iii) at classification level by
combining classifier outputs. Empirical evaluation shows
that fusing information captured by these sensor types can
considerably improve recognition performance.
Author Keywords
Activity recognition, sensor fusion, accelerometers, com-
puter vision, multi-modal dataset
ACM Classification Keywords
I.5.5 Pattern Recognition: Applications; I.4.8 Scene Analy-
sis: Sensor Fusion; I.2.10 Vision and Scene Understanding:
Figure 1. Snapshot from the dataset. Data from an RGB-D camera and
from accelerometers attached to kitchen objects were recorded while
25 people prepared two mixed salads each. Activities were split into
preparation, core and post-phase, and these phases were annotated as
temporal intervals.

HowTo100M
n動画とテキストのペア
• テキストは自動生成字幕
n作成
• YouTubeのナレーション付きinstruction動画
1.22Mから136Mクリップ
• untrimmed video：平均6.5分，計15年
• ダウンロードすると360pでも >40TB
• 1動画から平均110クリップ生成
• 1クリップ平均4秒，4単語
• WikiHowから23kのvisual taskを先に設定，動画
を検索
• 抽象的なタスクは除外
• 自動字幕作成機能でナレーション取得
• 字幕の各行をクリップに対応
• 厳密ではないのでweakly paired
[Miech+, ICCV2019]

Olympic Sports
nスポーツに特化
• YouTube動画800本
• 16カテゴリ
• 各クラス50本
• train 40, test 10
n配布フォーマット
• seq video format (Piotr's Computer
Vision Matlab Toolbox)
[Niebles+, ECCV2010]

Sports-1M
n100万本のYouTube動画
• 487カテゴリ
• 各クラス1000〜3000本
• train70%, val 20%, test 10%
• 重複はたった0.18%程度
• 自動アノテーション
• YouTubeのメタデータを利用
• 5%に複数のラベル付与（マルチラベ
ル）
• untrimmed video
• URLのみ配布
[Karpathy+, CVPR2014]
Figure 4: Predictions on Sports-1M test data. Blue (first row) in

FineGym
n 時間もラベルも階層的な詳細認識
• ラベルの階層
• event：競技種目名
• set：elementの集合
• element：技の名前（double saltoな
ど）
• 時間
• action：eventに対応
• sub-action：elementに対応
n データセット
• event 10（男子競技6，女子競技4）
• element 530（gym530の場合）
• バージョン：v1.0, 1.1
• カテゴリ数：gym99, gym288, gym530
[Shao+, CVPR2020]
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
Dian Shao Yue Zhao Bo Dai Dahua Lin
CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong
{sd017, zy317, bdai, dhlin}@ie.cuhk.edu.hk
… … … … …
… …
…
…
vault Uneven Bars
Actions
Sub-actions
Balance Beam Floor Exercise
Balance Beam
Beam-turns Leap-Jump-Hop BB-flight-handspring
3 turn in tuck stand Wolf jump--hip angle at 45, knees together Flic-flac with step-out
Tree
Reasoning
Tree
Reasoning
Tree
Reasoning
Sets
Elements
More
Fine-grained
Events

Diving48
n飛び込み競技の詳細認識
• 48カテゴリ：以下の組み合わせ
• takeoff, movements in flight
(somersaults / twists), entry
• 国際水泳連盟(FINA, Fédération
internationale de natation)のルー
ルに準拠
• webから動画を取得
• 1分クリップ18kに自動分割
• train16k, test 2k
• AMTでアノテーション
• 飛び込みの種類と難度
• 開始，終了フレーム
n動機
• representation biasの排除
• object, scene, person
• 「飛び込み」は国際基準のカテゴリ
がある，差異は小さい，見えにも似
ている
• 物体やシーンでは識別できないタ
スク
[Li+, ECCV2018]

Ego-4D
n 1人称視点動画（Ego4D-3K）
• 全3670時間，参加者931人，9カ国74都
市
• 日常のアクティビティ
• カメラのバッテリがなくなるまで撮
影し続ける（1撮影者で1〜10時間）
• 1クリップ平均8分
• マルチモーダル
• カメラ：RGB，ステレオ，マルチ
ビュー
• 画像：顔，視線方向， 3Dスキャン
• ナレーションテキスト，音声，IMU
n アノテーション250k時間
• ナレーション
• EPIC-Kitchenと類似手順，1分あたり
平均13.2文，計3.85M文
n タスク毎のアノテーション
• Episodic memory：110アクティビティ
• Hands and Objects：手と物体のbboxと関係ラ
ベル，時刻，状態
• Audio-Visual Diarization：顔bbox，人物ラベル，
発話の時刻と内容
• Social interaction：顔bbox，人物ID，フレーム
単位の発話人物特定，カメラを向いているか
どうか，カメラに向かって話しているか
[Grauman+, CVPR2022]
Gardening
Shopping
Pets
Sewing / Knitting
Baking
Playing games
Reading
Sports
Multi-perspective
1
3 4
2
L R
Stereo vision
Human locomotion
IMU
Social interaction
Doing laundry
3D
Video + 3D scans
Geographic diversity
Figure 1. Ego4D is a massive-scale egocentric video dataset of daily life activity spanning 74 locations worldwide. Here we see a snapshot of
the dataset (5% of the clips, randomly sampled) highlighting its diversity in geographic location, activities, and modalities. The data includes
social videos where participants consented to remain unblurred. See https://ego4d-data.org/fig1.html for interactive figure.
However, in both robotics and augmented reality, the input
is a long, fluid video stream from the first-person or “ego-
centric” point of view—where we see the world through
the eyes of an agent actively engaged with its environment.
Second, whereas Internet photos are intentionally captured
by a human photographer, images from an always-on wear-
able egocentric camera lack this active curation. Finally,
first-person perception requires a persistent 3D understand-
ing of the camera wearer’s physical surroundings, and must
interpret objects and actions in a human context—attentive
to human-object interactions and high-level social behaviors.
Motivated by these critical contrasts, we present the
Ego4D dataset and benchmark suite. Ego4D aims to cat-
alyze the next era of research in first-person visual percep-
video content that displays the full arc of a person’s complex
interactions with the environment, objects, and other people.
In addition to RGB video, portions of the data also provide
audio, 3D meshes, gaze, stereo, and/or synchronized multi-
camera views that allow seeing one event from multiple
perspectives. Our dataset draws inspiration from prior ego-
centric video data efforts [43,44,129,138,179,201,205,210],
but makes significant advances in terms of scale, diversity,
and realism.
Equally important to having the right data is to have the
right research problems. Our second contribution is a suite
of five benchmark tasks spanning the essential components
of egocentric perception—indexing past experiences, ana-
lyzing present interactions, and anticipating future activity.
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman1,2
, Andrew Westbury1
, Eugene Byrne⇤1
, Zachary Chavis⇤3
, Antonino Furnari⇤4
,
Rohit Girdhar⇤1
, Jackson Hamburger⇤1
, Hao Jiang⇤5
, Miao Liu⇤6
, Xingyu Liu⇤7
, Miguel Martin⇤1
,
Tushar Nagarajan⇤1,2
, Ilija Radosavovic⇤8
, Santhosh Kumar Ramakrishnan⇤1,2
, Fiona Ryan⇤6
,
Jayant Sharma⇤3
, Michael Wray⇤9
, Mengmeng Xu⇤10
, Eric Zhongcong Xu⇤11
, Chen Zhao⇤10
,
Siddhant Bansal17
, Dhruv Batra1
, Vincent Cartillier1,6
, Sean Crane7
, Tien Do3
, Morrie Doulaty1
,
Akshay Erapalli13
, Christoph Feichtenhofer1
, Adriano Fragomeni9
, Qichen Fu7
,
Abrham Gebreselasie12
, Cristina González14
, James Hillis5
, Xuhua Huang7
, Yifei Huang15
,
Wenqi Jia6
, Weslie Khoo16
, Jáchym Kolář13
, Satwik Kottur1
, Anurag Kumar5
, Federico Landini1
,
Chao Li5
, Yanghao Li1
, Zhenqiang Li15
, Karttikeya Mangalam1,8
, Raghava Modhugu17
,
Jonathan Munro9
, Tullie Murrell1
, Takumi Nishiyasu15
, Will Price9
, Paola Ruiz Puentes14
,
Merey Ramazanova10
, Leda Sari5
, Kiran Somasundaram5
, Audrey Southerland6
, Yusuke Sugano15
,
Ruijie Tao11
, Minh Vo5
, Yuchen Wang16
, Xindi Wu7
, Takuma Yagi15
, Ziwei Zhao16
, Yunyi Zhu11
,
Pablo Arbeláez†14
, David Crandall†16
, Dima Damen†9
, Giovanni Maria Farinella†4
,
Christian Fuegen†1
, Bernard Ghanem†10
, Vamsi Krishna Ithapu†5
, C. V. Jawahar†17
, Hanbyul Joo†1
,
Kris Kitani†7
, Haizhou Li†11
, Richard Newcombe†5
, Aude Oliva†18
, Hyun Soo Park†3
,
James M. Rehg†6
, Yoichi Sato†15
, Jianbo Shi†19
, Mike Zheng Shou†11
, Antonio Torralba†18
,
Lorenzo Torresani†1,20
, Mingfei Yan†5
, Jitendra Malik1,8
1
Meta AI, 2
University of Texas at Austin, 3
University of Minnesota, 4
University of Catania,
5
Meta Reality Labs, 6
Georgia Tech, 7
Carnegie Mellon University, 8
UC Berkeley, 9
University of Bristol,
10
King Abdullah University of Science and Technology, 11
National University of Singapore,
12
Carnegie Mellon University Africa, 13
Meta, 14
Universidad de los Andes, 15
University of Tokyo, 16
Indiana University,
17
International Institute of Information Technology, Hyderabad, 18
MIT, 19
University of Pennsylvania, 20
Dartmouth
Abstract
We introduce Ego4D, a massive-scale egocentric video
dataset and benchmark suite. It offers 3,670 hours of daily-
life activity video spanning hundreds of scenarios (house-
hold, outdoor, workplace, leisure, etc.) captured by 931
unique camera wearers from 74 worldwide locations and 9
different countries. The approach to collection is designed
to uphold rigorous privacy and ethics standards, with con-
senting participants and robust de-identification procedures
where relevant. Ego4D dramatically expands the volume of
diverse egocentric video footage publicly available to the
research community. Portions of the video are accompanied
by audio, 3D meshes of the environment, eye gaze, stereo,
and/or synchronized videos from multiple egocentric cam-
eras at the same event. Furthermore, we present a host of
new benchmark challenges centered around understanding
the first-person visual experience in the past (querying an
episodic memory), present (analyzing hand-object manipu-
lation, audio-visual conversation, and social interactions),
and future (forecasting activities). By publicly sharing this
massive annotated dataset and benchmark suite, we aim to
push the frontier of first-person perception. Project page:
https://ego4d-data.org/
1. Introduction
Today’s computer vision systems excel at naming objects
and activities in Internet photos or video clips. Their tremen-
dous progress over the last decade has been fueled by major
dataset and benchmark efforts, which provide the annota-
tions needed to train and evaluate algorithms on well-defined
tasks [49,60,61,92,108,143].
While this progress is exciting, current datasets and mod-
els represent only a limited definition of visual perception.
First, today’s influential Internet datasets capture brief, iso-
lated moments in time from a third-person “spectactor” view.
18995
Ego4Dプロジェクト紹介（SSII2022 東大佐藤先生）

YouTube-8M / Segments
nYouTube-8M
• 自動タグ付けでラベル付与，マルチラベル
• 2016: 8.2M動画, 4800クラス, 1.8 labels/video
nYouTube-8M Segmenrts
• 2019: 230Kセグメント（約46k video）, 1000クラス,
動画あたり5セグメント
• 各動画のどこかに5秒セグメントを5つ
• 開始時刻，終了時刻（=開始時刻+5秒），クラス
ラベルを人間がアノテーション
nKaggleコンペ開催
• 2017, 2018, 2019
[Abu-El-Haija+, arXiv2016]
YouTube-8M: A Large-Scale Video Classification
Benchmark
Sami Abu-El-Haija
haija@google.com
Nisarg Kothari
ndk@google.com
Joonseok Lee
joonseok@google.com
Paul Natsev
natsev@google.com
George Toderici
gtoderici@google.com
Balakrishnan Varadarajan
balakrishnanv@google.com
Sudheendra Vijayanarasimhan
svnaras@google.com
Google Research
ABSTRACT
Many recent advancements in Computer Vision are attributed to
large datasets. Open-source software packages for Machine Learn-
ing and inexpensive commodity hardware have reduced the bar-
rier of entry for exploring novel approaches at scale. It is possible
to train models over millions of examples within a few days. Al-
though large-scale datasets exist for image understanding, such as
ImageNet, there are no comparable size video classification datasets.
In this paper, we introduce YouTube-8M, the largest multi-label
video classification dataset, composed of ⇠8 million videos—500K
hours of video—annotated with a vocabulary of 4800 visual en-
tities. To get the videos and their (multiple) labels, we used a
YouTube video annotation system, which labels videos with the
main topics in them. While the labels are machine-generated, they
have high-precision and are derived from a variety of human-based
signals including metadata and query click signals, so they repre-
sent an excellent target for content-based annotation approaches.
We filtered the video labels (Knowledge Graph entities) using both
automated and manual curation strategies, including asking human
Figure 1: YouTube-8M is a large-scale benchmark for general
multi-label video classification. This screenshot of a dataset
explorer depicts a subset of videos in the dataset annotated
5v1
[cs.CV]
27
Sep
2016

YouTube-8M / Segments
n構築
• クラスはGoogle Knowldge Graphのエンティティ（元Freebase topic）
• 映像だけから分かるもの，多数のビデオがあるものを選択
• YouTubeの全動画で検索，再生回数1000以上，長さ120秒〜500秒のみ
• ランダムに10M動画を選択
• 200動画以下のクラスは除外，残ったのが8M動画
• ラベルは自動作成
• 特徴量
• 500k時間≧50年なので色々無理．特徴量だけ提供．
• 最初の6分間のフレームについて
• フレーム単位のInception特徴2048dをPCA圧縮した1024d特徴
• 動画単位に集約した1024d特徴
• 2017から：フレーム単位・動画単位の音声128d特徴
[Abu-El-Haija+, arXiv2016]

YFCC-100M
nYahoo Flickr Creative
Commons 100 Million Dataset
• 画像99.2M，動画0.8M
• マルチメディアデータセット
nアノテーション
• 人間：画像68M，動画0.4M
• 自動：画像3M, 動画7k
[Thomee+, Comm. ACM, 2016]

Action Recognition models
2015 2017 2019 2020 2021 2022
2016 2018
2013 2014
restricted 3D
Full 3D
DT
IDT
Two
Stream
TSN
C3D I3D
P3D
S3D
R(2+1)D
3D ResNet
Non-Local
TSM
SlowFast X3D
ViVit
TimeSformer
STAM
Video Transformer Network
VidTr
X-ViT
2D + 1D aggregation
(2+1)D
(2+1)D
CNN
Non-Deep
Vision
Transformer
2D + 1D aggregation
R3D
Transformer
Kinetics
ResNet
U-Net
GAN ViT
2012
ImageNet
TokenShift
VideoSwin

IDT
nDeep以前のSoTA
• Dense Trajectory (DT) [Wang+,
IJCV, 2013] を改良
• 密に検出・追跡したSIFT特
徴点周辺の特徴量（HOG,
HOF, etc）のBoF
• Improved DT (IDT)
• カメラ運動の補正
• 人物部分を除外
• BoFに加えてFisher vectorも
• power normalization利用
nDeep初期にも
• CNNにIDTを加えて性能ブース
トということも
[Wang&Schmid, ICCV2013] [Wang+, IJCV, 2013]
Int J Comput Vis (2013) 103:60–79 63
Fig. 2 Illustration of our approach to extract and characterize dense
trajectories. Left Feature points are densely sampled on a grid for each
spatial scale. Middle Tracking is carried out in the corresponding spatial
scale for L frames by median filtering in a dense optical flow field. Right
The trajectory shape is represented by relative point coordinates, and
the descriptors (HOG, HOF, MBH) are computed along the trajectory
in a N × N pixels neighborhood, which is divided into nσ × nσ × nτ
cells
3.1 Dense Sampling
We first densely sample feature points on a grid spaced by
W pixels. Sampling is carried out on each spatial scale sep-
arately, see Fig. 2 (left). This guarantees that feature points
equally cover all spatial positions and scales. Experimental
results showed that a sampling step size of W = 5 pixels is
dense enough to give good results over all datasets. There are
at most 8 spatial scales in total, depending on the resolution
of the video. The spatial scale increases by a factor of 1/
√
2.
Our goal is to track all these sampled points through the
video. However, in homogeneous image areas without any
Fig. 3 Visualization of densely sampled feature points after removing
Figure 2. Visualization of inlier matches of the robustly esti-

動作認識の最前線：手法，タスク，データセット

動作認識の最前線：手法，タスク，データセット

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 動作認識の最前線：手法，タスク，データセット

Similar to 動作認識の最前線：手法，タスク，データセット (20)

More from Toru Tamaki

More from Toru Tamaki (20)

Recently uploaded

Recently uploaded (8)

動作認識の最前線：手法，タスク，データセット