CVPR 2020の動向・気付き・メタサーベイ 

1
若宮天雅,笠井誠斗,石川裕地

Group 22: Video Analysis and Understanding
2
Outline

1. CVPRの論文で使われた言葉

2. 動画認識タスクの流行

3. 動画認識における表現学習

4. 新しいデータセット

5. 新しいタスク

6. 動画認識のアーキテクチャ

3
CVPRの論文で使われた言葉
4
CVPRでタイトルに使われた言葉

2020

2015

5
CVPRでアブストに使われた言葉

2020

2015

6
動画認識タスクの流行
CVPRにおける動画認識タスクの流行

7
task / method 2015 2016 2017 2018 2019 2020
action recognition
video classification
21 32 31 35 34 38
3D CNN 0 3 8 12 23 22
two-stream 0 6 11 11 10 9
action localization
(detection)
2 10 6 6 11 10
action segmentation 3 0 3 5 1 7
action proposal 1 2 2 1 2 1
video captioning 0 3 9 10 4 6
spatio-temporal action
localization (detection)
0 1 0 0 2 0
video retrieval 1 0 1 0 1 5
● 動画認識の論文数は極端には増えていない → 計算資源が必要なため参入障壁が高い?

● fine-grained taskが近年増えてきている → video retrieval や action segmentation は今後注目?

● spatio-temporal action localization は全くやられていない

→ タスクが難しすぎる?

※ 値はおおよそです

8
動画認識における表現学習
9
Representation Learning はホットなトピック

● representation learning に関する論文の割合は大幅に増加

○ 特にCVPR 2020は多くの論文が出ている



10
Video Representation Learning

• CVPR 2020 における最先端の動画表現学習

– Embedding learning
• [Teng Long+] “Searching for Actions on the Hyperbole”
– Multimodal representation learning
• [Lei Li+] “End-to-End Learning of Visual Representations From Uncurated
Instructional Videos”
• [Shizhe Chen+] “Fine-Grained Video-Text Retrieval With Hierarchical
Graph Reasoning”
– Unsupervised representation learning
• [Hyodong Lee+] “Large Scale Video Representation Learning via
Relational Graph Clustering”
• [AJ Piergiovanni+] “Evolving Losses for Unsupervised Video
Representation Learning”
11
新しいデータセット
12
データセット提案論文 (1/11)

● データセット名 (タスクや概要)
[著者] “論文のタイトル” で表記します
● COD10K (camouflaged object detection)
[Deng-Ping Fan+] “Camouflaged Object Detection”
● The Liver-Kidney-Stomach Dataset (whole slide images classification)
[Sam Maksoud+] “SOS: Selective Objective Switch for Rapid
Immunofluorescence Whole Slide Image Classification”
● 3DOH50K (human shape and pose estimation)
[Tianshu Zhang+] “Object-Occluded Human Shape and Pose Estimation from a
Single Color Image”
● VGPHRASECUT dataset (PhraseCut)
[Chenyun Wu+] “PhraseCut: Language-based Image Segmentation in the Wild”
● Internal Camera Image Dataset, Dual-Camera-Pose Dataset
[Zhenpei Yang+] “SurfelGAN: Synthesizing Realistic Sensor Data for
Autonomous Driving”
13
データセット提案論文 (2/11)

● TAPOS (temporal action parsing)
[Dian Shao+] “Intra- and Inter-Action Understanding via Temporal Action Parsing”
● M-R dataset (reflection removal)
[Chenyang Lei+] “Polarized Reflection Removal with Perfect Alignment in the
Wild”
● SOBA (instance shadow detection)
[Tianyu Wang+] “Instance Shadow Detection”
● Waymo Open Dataset (including camera+LiDAR with bounding boxes)
[Pei Sun+] “Scalability in Perception for Autonomous Driving: Waymo Open Dataset”
● Google Landmarks Dataset v2 (instance-level recognition and retrieval)
[Tobias Weyand+] “Google Landmarks Dataset v2 A Large-Scale Benchmark for
Instance-Level Recognition and Retrieval”
● Plagiarized Fashion (plagiarized clothes retrieval)
[Yining Lang+] ”Which Is Plagiarism: Fashion Image Retrieval Based on Regional
Representation for Design Protection”
14
データセット提案論文 (3/11)

● FineGym Dataset (Event-/Set/Element-level Action Recognition, Temporal
Action Localization)
[Dian Shao+] “FineGym: A Hierarchical Video Dataset for Fine-Grained Action
Understanding”
● FSOD (few-shot object detection)
[Qi Fan+] “Few-Shot Object Detection With Attention-RPN and Multi-Relation
Detector”
● NYU-VP (vanishing point estimation)
[Florian Kluger+] “CONSAC: Robust Multi-Model Fitting by Conditional Sample
Consensus”
● GHS3D (3D human shape modeling)
[Hongyi Xu+] “GHUM & GHUML: Generative 3D Human Shape and Articulated
Pose Models”
● unnamed (shadow generation)
[Qingyuan Zheng+] “Learning to Shadow Hand-Drawn Sketches”
15
データセット提案論文 (4/11)

● Homography Dataset (homography estimation)
[Hoang Le+] “Deep Homography Estimation for Dynamic Scenes”
● VQA introspect dataset (visual question answering)
[Ramprasaath R. Selvaraju+], “SQuINTing at VQA Models: Introspecting VQA Models With
Sub-Questions”
● Cops-Ref dataset (referring expression comprehension)
[Zhenfang Chen+] “Cops-Ref: A New Dataset and Task on Compositional Referring
Expression Comprehension”
● ActivityNet-SRL (video object grounding)
[Arka Sadhu+] “Video Object Grounding using Semantic Roles in Language
Description”
● Forking Paths (person trajectory prediction)
[Junwei Liang+] “The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction”
● MOPED (unseen object pose estimation)
[Keunhong Park+] “LatentFusion: End-to-End Differentiable Reconstruction and Rendering
for Unseen Object Pose Estimation”
16
データセット提案論文 (5/11)

● unnamed (visual reaction)
[Kuo-Han Zeng+] “Visual Reaction: Learning to Play Catch With Your Drone”
● The Photoshop Tutorial Video Dataset (visual understanding of screencast
tutorials)
[Kunpeng Li+] “Screencast Tutorial Video Understanding”
● fly dataset (animal pose estimation)
[Siyuan Li+] “Deformation-Aware Unpaired Image Translation for Pose Estimation on
Laboratory Animals”
● BlendedMVS (multi-view stereo)
[Yao Yao+] “BlendedMVS: A Large-scale Dataset for Generalized Multi-view
Stereo Networks”
● Violin (video-and-language inference)
[Jingzhou Liu+] “VIOLIN: A Large-Scale Dataset for Video-and-Language
Inference”
● VIPER, Cityscapes-VPS (video panoptic segmentation)
[Dahun Kim+] “Video Panoptic Segmentation”
17
データセット提案論文 (6/11)

● COCO-MEBOW (body orientation estimation)
[Chenyan Wu+] “MEBOW: Monocular Estimation of Body Orientation In the
Wild”
● VizWiz-QualityIssues (image quality assessment)
[Tai-Yin Chiu+] “Assessing Image Quality Issues for Real-World Problems”
● DVSNOISE20 (event denosing)
[R. Wes Baldwin+] “Event Probability Mask (EPM) and Event Denoising
Convolutional Neural Network (EDnCNN) for Neuromorphic Cameras”
● IQVA (immersive visual attention dataset)
[Ming Jiang+] “Fantastic Answers and Where to Find Them: Immersive
Question-Directed Visual Attention”
● nuScenes dataset (autonomous driving)
[Holger Caesar+] “nuScenes: A Multimodal Dataset for Autonomous Driving”
● VidSTG (spatio-temporal video grounding)
[Zhu Zhang+] “Where Does It Exist: Spatio-Temporal Video Grounding for
Multi-Form Sentences”
18
データセット提案論文 (7/11)

● Action Genome ( decomposing actions into spatio-temporal scene graphs)
[Jingwei Ji+] “Action Genome: Actions as Compositions of Spatio-temporal
Scene Graphs”
● YCB-Affordance dataset (predicting human grasp affordances)
[Enric Corona+] “GanHand: Predicting Human Grasp Affordances in
Multi-Object Scenes”
● AnimalWeb (animal face alignment)
[Muhammad Haris Khan+] “AnimalWeb: A Large-Scale Hierarchical Dataset of
Annotated Animal Faces”
● SPARE3D (spatial reasoning)
[Wenyu Han+] “SPARE3D: A Dataset for SPAtial REasoning on Three-View Line
Drawings”
● 3D-ZeF (multi-object zebrafish tracking)
[Malte Pedersen+] “3D-ZeF: A 3D Zebrafish Tracking Benchmark Dataset”
19
データセット提案論文 (8/11)

● OASIS (Single-view 3D)
[Weifeng Chen+] “OASIS: A Large-Scale Dataset for Single Image 3D in the
Wild”
● FaceScape (3D face dataset)
[Haotian Yang+] “FaceScape: a Large-scale High Quality 3D Face Dataset and
Detailed Riggable 3D Face Prediction”
● MSeg (multi-domain semantic segmentation)
[John Lambert+] “MSeg: A Composite Dataset for Multi-domain Semantic
Segmentation”
● MSLS (place recognition)
[Frederik Warburg+] “Mapillary Street-Level Sequences: A Dataset for Lifelong
Place Recognition”
● BDD100K (driving video dataset)
[Fisher Yu+] “BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask
Learning”
20
データセット提案論文 (9/11)

● FSS-1000 (few-shot segmentation)
[Xiang Li+] “FSS-1000: A 1000-Class Dataset for Few-Shot Segmentation”
● HUMBI (body expressions)
[Zhixuan Yu+] “HUMBI: A Large Multiview Dataset of Human Body Expressions”
● PANDA (gigapixel-level human-centric video dataset)
[Xueyang Wang+] “PANDA: A Gigapixel-Level Human-Centric Video Dataset”
● COCAS (re-identification)
[Shijie Yu+] “COCAS: A Large-Scale Clothes Changing Person Dataset for
Re-Identification”
● IntrA (3D intracranial aneurysm)
[Xi Yang+] “IntrA: 3D Intracranial Aneurysm Dataset for Deep Learning”
● DeeperForensics-1.0 (Face Forgery Detection)
[Liming Jiang+] “DeeperForensics-1.0: A Large-Scale Dataset for Real-World
Face Forgery Detection”
21
データセット提案論文 (10/11)

● Celeb-DF (DeepFake detection)
[Yuezun Li+] “Celeb-DF: A Large-Scale Challenging Dataset for DeepFake
Forensics”
● WHU Dataset (multi-view stereo)
[Jin Liu+] “A Novel Recurrent Encoder-Decoder Structure for Large-Scale
Multi-view Stereo Reconstruction from An Open Aerial Dataset”
● KeypointNet (3D keypoint dataseet)
[Yang You+] “KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from
Numerous Human Annotations”
● RealFaceDB (human face)
[Alexandros Lattas+] “AvatarMe: Realistically Renderable 3D Facial
Reconstruction “in-the-wild””
● CarlaRS dataset, Fastec-RS dataset (rolling shutter effect correction)
[Peidong Liu+] “Deep Shutter Unrolling Network”
22
データセット提案論文 (11/11)

● CUS dataset (cars in uncommon states)
[Zongdai Liu+] “3D Part Guided Image Editing for Fine-grained Object
Understanding”
● unnamed (exposure bracketing selection)
[Zhouxia Wang+] “Learning a Reinforced Agent for Flexible Exposure Bracketing
Selection”
● Ambiguous-HOI (hard ambiguous images of Human-Object Interaction)
[Yong-Lu Li+] “Detailed 2D-3D Joint Representation for Human-Object
Interaction”
● unnamed (non-rigid 3D reconstruction)
[Aljaz Bozic+] “DeepDeform: Learning Non-Rigid RGB-D Reconstruction With
Semi-Supervised Data”
● CHI3D, FlickrCI3D (3D contact prediction and reconstruction)
[Mihai Fieraru+] “Three-dimensional Reconstruction of Human Interactions”
23
データセット提案論文数の遷移

● 値はおおよそ

- スクレイピングにより取得したため過不足があるかもしれません

● 新規データセットは毎年提案され,その数は増えている

● タスクの提案とともにデータセットの提案を行う研究も多く,

タスクの細分化が進んでいる (後述)

year 2015 2016 2017 2018 2019 2020
n_papers 26 29 30 44 50 53
ratio 4.3% 4.5% 3.8% 4.5% 3.8% 3.6%
24
データセットの例

● ただただ大規模なデータセットよりも,新しい問題設定のものや

特徴のあるデータセットが採択されている印象

25K x 14K以上の超高解像な動画データセット[1] 

[1] [Xueyang Wang+] “PANDA: A Gigapixel-Level Human-Centric Video Dataset”
[2] [Deng-Ping Fan+] “Camouflaged Object Detection”
カモフラージュされた物体の検出 [2] 

25
新しいタスク
26
新しいタスクを提案論文 (1/4)

● タスク名
[著者] “論文のタイトル” で表記します
● video-and-language inference
[Jingzhou Liu+] “VIOLIN: A Large-Scale Dataset for Video-and-Language
Inference”
● camouflaged object detection
[Deng-Ping Fan+] “Camouflaged Object Detection”
● semantic instance completion
[Ji Hou+] “RevealNet: Seeing Behind Objects in RGB-D Scans”
● spatio-temporal video grounding
[Zhu Zhang+] “Where Does It Exist: Spatio-Temporal Video Grounding for
Multi-Form Sentences”
● zero-shot temporal activity detection (ZSTAD)
[Lingling Zhang+] “ZSTAD: Zero-Shot Temporal Activity Detection”
27
新しいタスクを提案論文 (2/4)

● context-aware group captioning
[Zhuowan Li+] “Context-Aware Group Captioning via Self-Attention and
Contrastive Features”
● 3D controllable image synthesis
[Yiyi Liao+] “Towards Unsupervised Learning of Generative Models for 3D
Controllable Image Synthesis”
● image synthesis conditioned on salient object layout
[Yandong Li+] “BachGAN: High-Resolution Image Synthesis from Salient Object
Layout”
● video panoptic segmentation
[Dahun Kim+] “Video Panoptic Segmentation”
● image manipulation from scene graphs
[Helisa Dhamo+] “Semantic Image Manipulation Using Scene Graphs”
28
新しいタスクを提案論文 (3/4)

● REVERIE
[Yuankai Qi+] “REVERIE: Remote Embodied Visual Referring Expression in Real
Indoor Environments”
● Action Genome
[Jingwei Ji+] “Action Genome: Actions as Compositions of Spatio-temporal
Scene Graphs”
● instance shadow detection
[Tianyu Wang+] “Instance Shadow Detection”
● composed query image retrieval
[Mehrdad Hosseinzadeh+] “Composed Query Image Retrieval Using Locally
Bounded Features”
● image quality assessment for image captioning and visual question answering
[Tai-Yin Chiu+] “Assessing Image Quality Issues for Real-World Problems”
● predicting human grasp affordances
[Enric Corona+] “GanHand: Predicting Human Grasp Affordances in
Multi-Object Scenes”
29
新しいタスクを提案論文 (4/4)

● task-specific end-to-end NAS
[Shoukang Hu+] “DSNAS: Direct Neural Architecture Search without Parameter
Retraining”
● few-shot and zero-shot animal face alignment
[Muhammad Haris Khan+] “AnimalWeb: A Large-Scale Hierarchical Dataset of
Annotated Animal Faces”
● PhraseCut
[Chenyun Wu+] “PhraseCut: Language-based Image Segmentation in the Wild”
● understanding of screencast tutorials
[Kunpeng Li+] “Screencast Tutorial Video Understanding”
● fine-grainedなタスク,データ量を減らしたタスクが多い

● やや特殊な問題設定や,画像の問題設定を動画認識に落とし込んだものも

● (こちらも不足があると思いますが,ご容赦ください)

30
新しいタスクの例

[1] [Kunpeng Li+] “Screencast Tutorial Video Understanding”

[2] [Jingwei Ji+] “Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs”

各ソフトウェアの操作理解のためのデータセット[1] 

アクションを時空間的なシーングラフに分解[2] 

31
動画認識のアーキテクチャ
32
動画認識のアーキテクチャ

● タスク名
[著者] “論文のタイトル” で表記します
● X3D
[Christoph Feichtenhofer] “X3D: Expanding Architectures for Efficient Video
Recognition”
● Templete Network
[Yizhou Zhou+] “Spatiotemporal Fusion in 3D CNNs: A Probabilistic View”
● FDC3D
[Hanting Chen+] “Frequency Domain Compact 3D Convolutional Neural
Networks”
● SmallBigNet
[Zianhang L+] “SmallBigNet: Integrating Core and Contextual Views for Video
Classification”
33
動画認識のアーキテクチャ

2D CNN, 3D CNN, 2+1D CNNのような単純なアーキテクチャよりも
計算コストの削減やCNNの組み合わせ方の提案などに発展
入力サイズなどの最適化[1] 

周波数ドメインを利用した計算コス
ト削減手法の提案[2] 

[1] [Christoph Feichtenhofer] “X3D: Expanding Architectures for Efficient Video Recognition”
[2] [Hanting Chen+] “Frequency Domain Compact 3D Convolutional Neural Networks”
総括

34
● 動画認識に関する論文はコンスタントに出ているものの,

すごい流行っている感じではなさそう

(計算資源など参入コストが高い,タスクが難しい)



● 3D CNN 系の論文数はのびつつある



● fine-grained video recognition が注目されつつある



● 表現学習は上昇トレンドだが、動画に着目したものは出始めているばかり



● 毎年多くのデータセットが提案され,今年は50個以上の新しいデータセット



● 様々なタスクが提案され,中にはかなり特殊な問題設定も



● 動画認識は計算コストが高いため, 

今後は認識率よりも計算コスト削減の論文が増えていくかも?


【CVPR 2020 メタサーベイ】Video Analysis and Understanding

  • 1.
  • 2.
    2 Outline
 1. CVPRの論文で使われた言葉
 2. 動画認識タスクの流行
 3.動画認識における表現学習
 4. 新しいデータセット
 5. 新しいタスク
 6. 動画認識のアーキテクチャ

  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
    CVPRにおける動画認識タスクの流行
 7 task / method2015 2016 2017 2018 2019 2020 action recognition video classification 21 32 31 35 34 38 3D CNN 0 3 8 12 23 22 two-stream 0 6 11 11 10 9 action localization (detection) 2 10 6 6 11 10 action segmentation 3 0 3 5 1 7 action proposal 1 2 2 1 2 1 video captioning 0 3 9 10 4 6 spatio-temporal action localization (detection) 0 1 0 0 2 0 video retrieval 1 0 1 0 1 5 ● 動画認識の論文数は極端には増えていない → 計算資源が必要なため参入障壁が高い?
 ● fine-grained taskが近年増えてきている → video retrieval や action segmentation は今後注目?
 ● spatio-temporal action localization は全くやられていない
 → タスクが難しすぎる?
 ※ 値はおおよそです

  • 8.
  • 9.
    9 Representation Learning はホットなトピック
 ●representation learning に関する論文の割合は大幅に増加
 ○ 特にCVPR 2020は多くの論文が出ている
 

  • 10.
    10 Video Representation Learning
 •CVPR 2020 における最先端の動画表現学習
 – Embedding learning • [Teng Long+] “Searching for Actions on the Hyperbole” – Multimodal representation learning • [Lei Li+] “End-to-End Learning of Visual Representations From Uncurated Instructional Videos” • [Shizhe Chen+] “Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning” – Unsupervised representation learning • [Hyodong Lee+] “Large Scale Video Representation Learning via Relational Graph Clustering” • [AJ Piergiovanni+] “Evolving Losses for Unsupervised Video Representation Learning”
  • 11.
  • 12.
    12 データセット提案論文 (1/11)
 ● データセット名(タスクや概要) [著者] “論文のタイトル” で表記します ● COD10K (camouflaged object detection) [Deng-Ping Fan+] “Camouflaged Object Detection” ● The Liver-Kidney-Stomach Dataset (whole slide images classification) [Sam Maksoud+] “SOS: Selective Objective Switch for Rapid Immunofluorescence Whole Slide Image Classification” ● 3DOH50K (human shape and pose estimation) [Tianshu Zhang+] “Object-Occluded Human Shape and Pose Estimation from a Single Color Image” ● VGPHRASECUT dataset (PhraseCut) [Chenyun Wu+] “PhraseCut: Language-based Image Segmentation in the Wild” ● Internal Camera Image Dataset, Dual-Camera-Pose Dataset [Zhenpei Yang+] “SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving”
  • 13.
    13 データセット提案論文 (2/11)
 ● TAPOS(temporal action parsing) [Dian Shao+] “Intra- and Inter-Action Understanding via Temporal Action Parsing” ● M-R dataset (reflection removal) [Chenyang Lei+] “Polarized Reflection Removal with Perfect Alignment in the Wild” ● SOBA (instance shadow detection) [Tianyu Wang+] “Instance Shadow Detection” ● Waymo Open Dataset (including camera+LiDAR with bounding boxes) [Pei Sun+] “Scalability in Perception for Autonomous Driving: Waymo Open Dataset” ● Google Landmarks Dataset v2 (instance-level recognition and retrieval) [Tobias Weyand+] “Google Landmarks Dataset v2 A Large-Scale Benchmark for Instance-Level Recognition and Retrieval” ● Plagiarized Fashion (plagiarized clothes retrieval) [Yining Lang+] ”Which Is Plagiarism: Fashion Image Retrieval Based on Regional Representation for Design Protection”
  • 14.
    14 データセット提案論文 (3/11)
 ● FineGymDataset (Event-/Set/Element-level Action Recognition, Temporal Action Localization) [Dian Shao+] “FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding” ● FSOD (few-shot object detection) [Qi Fan+] “Few-Shot Object Detection With Attention-RPN and Multi-Relation Detector” ● NYU-VP (vanishing point estimation) [Florian Kluger+] “CONSAC: Robust Multi-Model Fitting by Conditional Sample Consensus” ● GHS3D (3D human shape modeling) [Hongyi Xu+] “GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models” ● unnamed (shadow generation) [Qingyuan Zheng+] “Learning to Shadow Hand-Drawn Sketches”
  • 15.
    15 データセット提案論文 (4/11)
 ● HomographyDataset (homography estimation) [Hoang Le+] “Deep Homography Estimation for Dynamic Scenes” ● VQA introspect dataset (visual question answering) [Ramprasaath R. Selvaraju+], “SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions” ● Cops-Ref dataset (referring expression comprehension) [Zhenfang Chen+] “Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension” ● ActivityNet-SRL (video object grounding) [Arka Sadhu+] “Video Object Grounding using Semantic Roles in Language Description” ● Forking Paths (person trajectory prediction) [Junwei Liang+] “The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction” ● MOPED (unseen object pose estimation) [Keunhong Park+] “LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation”
  • 16.
    16 データセット提案論文 (5/11)
 ● unnamed(visual reaction) [Kuo-Han Zeng+] “Visual Reaction: Learning to Play Catch With Your Drone” ● The Photoshop Tutorial Video Dataset (visual understanding of screencast tutorials) [Kunpeng Li+] “Screencast Tutorial Video Understanding” ● fly dataset (animal pose estimation) [Siyuan Li+] “Deformation-Aware Unpaired Image Translation for Pose Estimation on Laboratory Animals” ● BlendedMVS (multi-view stereo) [Yao Yao+] “BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks” ● Violin (video-and-language inference) [Jingzhou Liu+] “VIOLIN: A Large-Scale Dataset for Video-and-Language Inference” ● VIPER, Cityscapes-VPS (video panoptic segmentation) [Dahun Kim+] “Video Panoptic Segmentation”
  • 17.
    17 データセット提案論文 (6/11)
 ● COCO-MEBOW(body orientation estimation) [Chenyan Wu+] “MEBOW: Monocular Estimation of Body Orientation In the Wild” ● VizWiz-QualityIssues (image quality assessment) [Tai-Yin Chiu+] “Assessing Image Quality Issues for Real-World Problems” ● DVSNOISE20 (event denosing) [R. Wes Baldwin+] “Event Probability Mask (EPM) and Event Denoising Convolutional Neural Network (EDnCNN) for Neuromorphic Cameras” ● IQVA (immersive visual attention dataset) [Ming Jiang+] “Fantastic Answers and Where to Find Them: Immersive Question-Directed Visual Attention” ● nuScenes dataset (autonomous driving) [Holger Caesar+] “nuScenes: A Multimodal Dataset for Autonomous Driving” ● VidSTG (spatio-temporal video grounding) [Zhu Zhang+] “Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences”
  • 18.
    18 データセット提案論文 (7/11)
 ● ActionGenome ( decomposing actions into spatio-temporal scene graphs) [Jingwei Ji+] “Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs” ● YCB-Affordance dataset (predicting human grasp affordances) [Enric Corona+] “GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes” ● AnimalWeb (animal face alignment) [Muhammad Haris Khan+] “AnimalWeb: A Large-Scale Hierarchical Dataset of Annotated Animal Faces” ● SPARE3D (spatial reasoning) [Wenyu Han+] “SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings” ● 3D-ZeF (multi-object zebrafish tracking) [Malte Pedersen+] “3D-ZeF: A 3D Zebrafish Tracking Benchmark Dataset”
  • 19.
    19 データセット提案論文 (8/11)
 ● OASIS(Single-view 3D) [Weifeng Chen+] “OASIS: A Large-Scale Dataset for Single Image 3D in the Wild” ● FaceScape (3D face dataset) [Haotian Yang+] “FaceScape: a Large-scale High Quality 3D Face Dataset and Detailed Riggable 3D Face Prediction” ● MSeg (multi-domain semantic segmentation) [John Lambert+] “MSeg: A Composite Dataset for Multi-domain Semantic Segmentation” ● MSLS (place recognition) [Frederik Warburg+] “Mapillary Street-Level Sequences: A Dataset for Lifelong Place Recognition” ● BDD100K (driving video dataset) [Fisher Yu+] “BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning”
  • 20.
    20 データセット提案論文 (9/11)
 ● FSS-1000(few-shot segmentation) [Xiang Li+] “FSS-1000: A 1000-Class Dataset for Few-Shot Segmentation” ● HUMBI (body expressions) [Zhixuan Yu+] “HUMBI: A Large Multiview Dataset of Human Body Expressions” ● PANDA (gigapixel-level human-centric video dataset) [Xueyang Wang+] “PANDA: A Gigapixel-Level Human-Centric Video Dataset” ● COCAS (re-identification) [Shijie Yu+] “COCAS: A Large-Scale Clothes Changing Person Dataset for Re-Identification” ● IntrA (3D intracranial aneurysm) [Xi Yang+] “IntrA: 3D Intracranial Aneurysm Dataset for Deep Learning” ● DeeperForensics-1.0 (Face Forgery Detection) [Liming Jiang+] “DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection”
  • 21.
    21 データセット提案論文 (10/11)
 ● Celeb-DF(DeepFake detection) [Yuezun Li+] “Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics” ● WHU Dataset (multi-view stereo) [Jin Liu+] “A Novel Recurrent Encoder-Decoder Structure for Large-Scale Multi-view Stereo Reconstruction from An Open Aerial Dataset” ● KeypointNet (3D keypoint dataseet) [Yang You+] “KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations” ● RealFaceDB (human face) [Alexandros Lattas+] “AvatarMe: Realistically Renderable 3D Facial Reconstruction “in-the-wild”” ● CarlaRS dataset, Fastec-RS dataset (rolling shutter effect correction) [Peidong Liu+] “Deep Shutter Unrolling Network”
  • 22.
    22 データセット提案論文 (11/11)
 ● CUSdataset (cars in uncommon states) [Zongdai Liu+] “3D Part Guided Image Editing for Fine-grained Object Understanding” ● unnamed (exposure bracketing selection) [Zhouxia Wang+] “Learning a Reinforced Agent for Flexible Exposure Bracketing Selection” ● Ambiguous-HOI (hard ambiguous images of Human-Object Interaction) [Yong-Lu Li+] “Detailed 2D-3D Joint Representation for Human-Object Interaction” ● unnamed (non-rigid 3D reconstruction) [Aljaz Bozic+] “DeepDeform: Learning Non-Rigid RGB-D Reconstruction With Semi-Supervised Data” ● CHI3D, FlickrCI3D (3D contact prediction and reconstruction) [Mihai Fieraru+] “Three-dimensional Reconstruction of Human Interactions”
  • 23.
    23 データセット提案論文数の遷移
 ● 値はおおよそ
 - スクレイピングにより取得したため過不足があるかもしれません
 ●新規データセットは毎年提案され,その数は増えている
 ● タスクの提案とともにデータセットの提案を行う研究も多く,
 タスクの細分化が進んでいる (後述)
 year 2015 2016 2017 2018 2019 2020 n_papers 26 29 30 44 50 53 ratio 4.3% 4.5% 3.8% 4.5% 3.8% 3.6%
  • 24.
    24 データセットの例
 ● ただただ大規模なデータセットよりも,新しい問題設定のものや
 特徴のあるデータセットが採択されている印象
 25K x14K以上の超高解像な動画データセット[1] 
 [1] [Xueyang Wang+] “PANDA: A Gigapixel-Level Human-Centric Video Dataset” [2] [Deng-Ping Fan+] “Camouflaged Object Detection” カモフラージュされた物体の検出 [2] 

  • 25.
  • 26.
    26 新しいタスクを提案論文 (1/4)
 ● タスク名 [著者]“論文のタイトル” で表記します ● video-and-language inference [Jingzhou Liu+] “VIOLIN: A Large-Scale Dataset for Video-and-Language Inference” ● camouflaged object detection [Deng-Ping Fan+] “Camouflaged Object Detection” ● semantic instance completion [Ji Hou+] “RevealNet: Seeing Behind Objects in RGB-D Scans” ● spatio-temporal video grounding [Zhu Zhang+] “Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences” ● zero-shot temporal activity detection (ZSTAD) [Lingling Zhang+] “ZSTAD: Zero-Shot Temporal Activity Detection”
  • 27.
    27 新しいタスクを提案論文 (2/4)
 ● context-awaregroup captioning [Zhuowan Li+] “Context-Aware Group Captioning via Self-Attention and Contrastive Features” ● 3D controllable image synthesis [Yiyi Liao+] “Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis” ● image synthesis conditioned on salient object layout [Yandong Li+] “BachGAN: High-Resolution Image Synthesis from Salient Object Layout” ● video panoptic segmentation [Dahun Kim+] “Video Panoptic Segmentation” ● image manipulation from scene graphs [Helisa Dhamo+] “Semantic Image Manipulation Using Scene Graphs”
  • 28.
    28 新しいタスクを提案論文 (3/4)
 ● REVERIE [YuankaiQi+] “REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments” ● Action Genome [Jingwei Ji+] “Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs” ● instance shadow detection [Tianyu Wang+] “Instance Shadow Detection” ● composed query image retrieval [Mehrdad Hosseinzadeh+] “Composed Query Image Retrieval Using Locally Bounded Features” ● image quality assessment for image captioning and visual question answering [Tai-Yin Chiu+] “Assessing Image Quality Issues for Real-World Problems” ● predicting human grasp affordances [Enric Corona+] “GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes”
  • 29.
    29 新しいタスクを提案論文 (4/4)
 ● task-specificend-to-end NAS [Shoukang Hu+] “DSNAS: Direct Neural Architecture Search without Parameter Retraining” ● few-shot and zero-shot animal face alignment [Muhammad Haris Khan+] “AnimalWeb: A Large-Scale Hierarchical Dataset of Annotated Animal Faces” ● PhraseCut [Chenyun Wu+] “PhraseCut: Language-based Image Segmentation in the Wild” ● understanding of screencast tutorials [Kunpeng Li+] “Screencast Tutorial Video Understanding” ● fine-grainedなタスク,データ量を減らしたタスクが多い
 ● やや特殊な問題設定や,画像の問題設定を動画認識に落とし込んだものも
 ● (こちらも不足があると思いますが,ご容赦ください)

  • 30.
    30 新しいタスクの例
 [1] [Kunpeng Li+]“Screencast Tutorial Video Understanding”
 [2] [Jingwei Ji+] “Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs”
 各ソフトウェアの操作理解のためのデータセット[1] 
 アクションを時空間的なシーングラフに分解[2] 

  • 31.
  • 32.
    32 動画認識のアーキテクチャ
 ● タスク名 [著者] “論文のタイトル”で表記します ● X3D [Christoph Feichtenhofer] “X3D: Expanding Architectures for Efficient Video Recognition” ● Templete Network [Yizhou Zhou+] “Spatiotemporal Fusion in 3D CNNs: A Probabilistic View” ● FDC3D [Hanting Chen+] “Frequency Domain Compact 3D Convolutional Neural Networks” ● SmallBigNet [Zianhang L+] “SmallBigNet: Integrating Core and Contextual Views for Video Classification”
  • 33.
    33 動画認識のアーキテクチャ
 2D CNN, 3DCNN, 2+1D CNNのような単純なアーキテクチャよりも 計算コストの削減やCNNの組み合わせ方の提案などに発展 入力サイズなどの最適化[1] 
 周波数ドメインを利用した計算コス ト削減手法の提案[2] 
 [1] [Christoph Feichtenhofer] “X3D: Expanding Architectures for Efficient Video Recognition” [2] [Hanting Chen+] “Frequency Domain Compact 3D Convolutional Neural Networks”
  • 34.
    総括
 34 ● 動画認識に関する論文はコンスタントに出ているものの,
 すごい流行っている感じではなさそう
 (計算資源など参入コストが高い,タスクが難しい)
 
 ● 3DCNN 系の論文数はのびつつある
 
 ● fine-grained video recognition が注目されつつある
 
 ● 表現学習は上昇トレンドだが、動画に着目したものは出始めているばかり
 
 ● 毎年多くのデータセットが提案され,今年は50個以上の新しいデータセット
 
 ● 様々なタスクが提案され,中にはかなり特殊な問題設定も
 
 ● 動画認識は計算コストが高いため, 
 今後は認識率よりも計算コスト削減の論文が増えていくかも?