本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
文献紹介:Selective Feature Compression for Efficient Activity Recognition InferenceToru Tamaki
Chunhui Liu, Xinyu Li, Hao Chen, Davide Modolo, Joseph Tighe; Selective Feature Compression for Efficient Activity Recognition Inference, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13628-13637
https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Selective_Feature_Compression_for_Efficient_Activity_Recognition_Inference_ICCV_2021_paper.html
【ECCV 2016 BNMW】Human Action Recognition without HumanHirokatsu Kataoka
Project page:
http://www.hirokatsukataoka.net/research/withouthuman/withouthuman.html
The objective of this paper is to evaluate "human action recognition without human". Motion representation is frequently discussed in human action recognition. We have examined several sophisticated options, such as dense trajectories (DT) and the two-stream convolutional neural network (CNN). However, some features from the background could be too strong, as shown in some recent studies on human action recognition. Therefore, we considered whether a background sequence alone can classify human actions in current large-scale action datasets (e.g., UCF101). In this paper, we propose a novel concept for human action analysis that is named "human action recognition without human". An experiment clearly shows the effect of a background sequence for understanding an action label.
To the best of our knowledge, this is the first study of human action recognition without human. However, we should not have done that kind of thing. The motion representation from a background sequence is effective to classify videos in a human action database. We demonstrated human action recognition in with and without a human settings on the UCF101 dataset. The results show the setting without a human (47.42%; without human setting) was close to the setting with a human (56.91%; with human setting). We must accept this reality to realize better motion representation.
【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction...Hirokatsu Kataoka
Project page
http://www.hirokatsukataoka.net/research/transitionalactionrecognition/transitionalactionrecognition.html
Herein, we address transitional actions class as a class between actions. Transitional actions should be useful for producing short-term action predictions while an action is transitive. However, transitional action recognition is difficult because actions and transitional actions partially overlap each other. To deal with this issue, we propose a subtle motion descriptor (SMD) that identifies the sensitive differences between actions and transitional actions. The two primary contributions in this paper are as follows: (i) defining transitional actions for short-term action predictions that permit earlier predictions than early action recognition, and (ii) utilizing convolutional neural network (CNN) based SMD to present a clear distinction between actions and transitional actions. Using three different datasets, we will show that our proposed approach produces better results than do other state-of-the-art models. The experimental results clearly show the recognition performance effectiveness of our proposed model, as well as its ability to comprehend temporal motion in transitional actions.
【論文紹介】Fashion Style in 128 Floats: Joint Ranking and Classification using Wea...Hirokatsu Kataoka
CVPR2016にてシモセラ・エドガー氏が発表した、StyleNetの紹介資料です。
"Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction," Edgar Simo-Serra and Hiroshi Ishikawa, in CVPR2016.
論文情報
http://hi.cs.waseda.ac.jp/~esimo/publications/SimoSerraCVPR2016.pdf
プロジェクトページ
http://hi.cs.waseda.ac.jp/~esimo/ja/research/stylenet/
【CVPR2016_LAP】Dominant Codewords Selection with Topic Model for Action Recogn...Hirokatsu Kataoka
http://www.hirokatsukataoka.net/pdf/cvprw16_kataoka_ddt.pdf
In this paper, we propose a framework for recognizing human activities that uses only in-topic dominant codewords and a mixture of intertopic vectors. Latent Dirichlet allocation (LDA) is used to develop approximations of human motion primitives; these are mid-level representations, and they adaptively integrate dominant vectors when classifying human activities. In LDA topic modeling, action videos (documents) are represented by a bag-of-words (input from a dictionary), and these are based on improved dense trajectories ([18]). The output topics correspond to human motion primitives, such as finger moving or subtle leg motion. We eliminate the impurities, such as missed tracking or changing light conditions, in each motion primitive. The assembled vector of motion primitives is an improved representation of the action. We demonstrate our method on four different datasets.
【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Traj...Hirokatsu Kataoka
ISVC2015 paper
http://www.hirokatsukataoka.net/pdf/isvc15_kataoka_dt13feature.pdf
Activity recognition has been an active research topic in computer vision. Recently, the most successful approaches use dense trajectories that extract a large number of trajectories and features on the trajectories into a codeword. In this paper, we evaluate various features in the framework of dense trajectories on several types of datasets. We implement 13 features in total by including five different types of descriptor, namely motion-, shape-, texture- trajectory- and co-occurrence-based feature descriptors. The experimental results show a relationship between feature descriptors and performance rate at each dataset. Different scenes of traffic, surgery, daily living and sports are used to analyze the feature characteristics. Moreover, we test how much the performance rate of concatenated vectors depends on the type, top-ranked in experiment and all 13 feature descriptors on fine-grained datasets. Feature evaluation is beneficial not only in the activity recognition problem, but also in other domains in spatio-temporal recognition.
【ITSC2015】Fine-grained Walking Activity Recognition via Driving Recorder DatasetHirokatsu Kataoka
ITSC2015
http://www.itsc2015.org/
The paper presents a fine-grained walking activity recognition toward an inferring pedestrian intention which is an important topic to predict and avoid a pedestrian’s dangerous activity. The fine-grained activity recognition is to distinguish different activities between subtle changes such as walking with different directions. We believe a change of pedestrian’s activity is significant to grab a pedestrian intention. However, the task is challenging since a couple of reasons, namely (i) in-vehicle mounted camera is always moving (ii) a pedestrian area is too small to capture a motion and shape features (iii) change of pedestrian activity (e.g. walking straight into turning) has only small feature difference. To tackle these problems, we apply vision-based approach in order to classify pedestrian activities. The dense trajectories (DT) method is employed for high-level recognition to capture a detailed difference. Moreover, we additionally extract detection-based region-of-interest (ROI) for higher performance in fine-grained activity recognition. Here, we evaluated our proposed approach on “self-collected dataset” and “near-miss driving recorder (DR) dataset” by dividing several activities– crossing, walking straight, turning, standing and riding a bicycle. Our proposal achieved 93.7% on the self-collected NTSEL traffic dataset and 77.9% on the near-miss DR dataset.
Extended Co-occurrence HOG with Dense Trajectories for Fine-grained Activity ...Hirokatsu Kataoka
In this paper we propose a novel feature descriptor Extended Co-occurrence HOG (ECoHOG) and integrate it with dense point trajectories demonstrating its usefulness in fine grained activity recognition. This feature is inspired by original Co-occurrence HOG (CoHOG) that is based on histograms of occurrences of pairs of image gradients in the image. Instead relying only on pure histograms we introduce a sum of gradient magnitudes of co-occurring pairs of image gradients in the image. This results in giving the importance to the object boundaries and straightening the difference between the moving foreground and static background. We also couple ECoHOG with dense point trajectories extracted using optical flow from video sequences and demonstrate that they are extremely well suited for fine grained activity recognition. Using our feature we outperform state of the art methods in this task and provide extensive quantitative evaluation.
16. 1st Gene. 2nd Gene. 3rd Gene.
動画認識の流れ – Sparse, Dense and Deep
1) Laptev, I. and Lindeberg, T. “Space-Time Interest Points,” Interna?onal Conference on Computer Vision (ICCV), pp.432–439, 2003.
2) Laptev, I., Marszalek, M., Schmid, C. and Rozenfeld, B. “Learning realis?c human ac?ons from movies,” IEEE Conference on Computer Vision
and Pa]ern Recogni?on (CVPR), pp.1–8, 2008.
3) Klaser, A., Marszalek, M., and Schmid, C. “A Spa?o-Temporal Descriptor Based on 3D-Gradients,” Bri?sh Machine Vision Conference (BMVC),
2008.
4) Wang, H., Klaser, A., Schmid, C. and Liu, C.-L. “Ac?on recogni?on by dense trajectories,” IEEE Conference on Computer Vision and Pa]ern
Recogni?on (CVPR), pp.3169–3176, 2011.
5) Wang, H. and Schmid, C. “Ac?on Recogni?on with Improved Trajectories,” Interna?onal Conference on Computer Vision (ICCV), pp.3551–
3558, 2013.
6) Simonyan, K. and Zisserman, A. “Two-Stream Convolu?onal Networks for Ac?on Recogni?on in Videos,” Neural Informa?on Processing
Systems (NIPS), 2014.
7) Wang, L., Qiao, Y. and Tang, X. “Ac?on Recogni?on with Trajectory-Pooled Deep-Convolu?onal Descriptors,” IEEE Conference on Computer
Vision and Pa]ern Recogni?on (CVPR), 2015.
8) D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spa?otemporal Features with 3D Convolu?onal Networks“, ICCV 2015.
9) Wang, L., Xiong, Y., Wang, Z. Qiao, Y., Lin, D., Tang, X. and Gool, L. C. “Temporal Segment Networks: Towards Good Prac?ces for Deep Ac?on
Recogni?on,“ in ECCV 2016.
10) He, Y., Shirakabe, S., Satoh, Y. and Kataoka, H. “Human Ac?on Recogni?on without Human,“ in ECCV WS 2016.
Sparse Space-Time feature
Dense
Space-Time feature Deeply-Learned Representation
41. THUMOS@ICCV’13
• Improved DTはワークショップTHUMOSで優勝
– THUMOS: The First International Workshop on Action Recognition with a
Large Number of Classes, in conjunction with ICCV '13
– UCF50をさらに拡張したUCF101(101クラス認識)にて認識率を評価
– INRIAの研究グループはImproved Dense Trajectoriesを⽤いて85.9%の認
識率を達成
48. Two-stream CNNの基本的な情報
• 考案者
– Karen Simonyan (発表当時Oxford所属、現Deep Mind)
– NIPS2014
• ⼿法
– RGB画像のみでなく、時間情報を画像に投影したフロー画像に対して
CNN!
(XYTの3次元畳み込みで苦労していた(している))
Simonyan, K. and Zisserman, A. “Two-Stream Convolu?onal Networks for Ac?on Recogni?on in Videos,” Neural Informa?on Processing Systems
(NIPS), 2014.
49. 3次元の畳み込みについて
• 実際には (Two-Stream CNN) > (C3D: Spatiotemporal 3DCNN)
– 学習画像が⾜りない:2次元画像は1M@ImageNet、時系列画像は学習画
像数の桁を上げないといけない?
– XYとTは性質が異なる:単純なXYTのカーネルではダメ?
Tran, D., Bourdev, L. , Fergus, R., Torresani, L. and M. Paluri, “Learning Spa?otemporal Features with 3D Convolu?onal Networks“, ICCV 2015.
58. TDDの基本的な情報
• 考案者
– Limin Wang (発表当時CUHK所属、現ETHZ)
– CVPR2015
• ⼿法
– IDTのフローの特徴記述を畳み込みマップで置き換え
– ハンドクラフトと深層学習のいいとこ取りを実現
Wang, L., Qiao, Y. and Tang, X. “Ac?on Recogni?on with Trajectory-Pooled Deep-Convolu?onal Descriptors,” IEEE Conference on Computer
Vision and Pa]ern Recogni?on (CVPR), 2015.
59. TDDのフレームワーク
• TDDとIDTの⽐較
x x x x x x x x x x x x x x x
Trajectory (in t + L frames)
Feature extraction
(HOG, HOF, MBH, Traj.)
Fisher Vectors (FVs)
IDT
x x x x x x x x x x x x x x x
x x x
TDD
x x x x x x x x x x x x x x x
x
x x
Feature extraction
(spa4, spa5, tem3, tem4)
Fisher Vectors (FVs)
60. もう少し詳しく
• 特徴抽出が異なる
– IDT:ハンドクラフト特徴の抽出
• サンプリング点(下図のx)の周囲から局所特徴を抽出
– TDD:畳み込みマップから値を抽出
• サンプリング点からチャネル⽅向に値を抽出
• “特徴次元数” = “チャネル数”
Feature extraction
(HOG, HOF, MBH, Traj.)
Fisher Vectors (FVs)
IDT
x x x x x x x x x x x x x x x
x x x
TDD
x x x x x x x x x x x x x x x
x
x x
Feature extraction
(spa4, spa5, tem3, tem4)
Fisher Vectors (FVs)
68. 最近のDBでは背景が効いているんじゃ?
• Two-stream CNNでもRGBの⼊⼒
– UCF101, HMDB51などは⼈物領域と⽐較して背景領域が⼤きい
– RGBを⼊⼒とした空間情報のみを⽤いて⾼い識別を実現
• Two-stream CNNのspatial-streamだけでも70%強の識別率@UCF101
• “Human Action Recognition without Human”の提案
• (⼈を⾒ない⼈物⾏動認識)
Y. He, S. Shirakabe, Y. Satoh, H. Kataoka “Human Action Recognition without Human”, in ECCV 2016
Workshop on on Brave New Ideas for Motion Representations in Videos (BNMW). (Oral & Best Paper)
賀雲, 白壁奏馬, 佐藤雄隆, 片岡裕雄, “人を見ない人物行動認識”, ViEW, 2016 (ViEW若手奨励賞)
70. w/ and w/o Human Setting
• With / Without human setting
– Without human setting: 中央部分が⿊抜き
– With human setting: Without human settingのインバース
I (x, y) f (x, y) * I’ (x, y)
1/2 1/4 1/4
1/2
1/4
1/4
I (x, y) f (x, y) * I’ (x, y)
1/2 1/4 1/4
1/2
1/4
1/4
ー ー
Without Human Seqng With Human Seqng
71. 実験の設定
– ベースライン: Very deep two-stream CNN [Wang+, arXiv15]
– ⼆つの設定: without human and with human
77. 未来のモーション認識
• 洗練されたモーションを捉える事ができたら?
– 動画による教師無し学習の洗練? [Vondrick+, CVPR16]
– ⾃然な動画⽣成 [Vondrick+, NIPS16]
C. Vondrick et al. “Anticipating Visual
Representations from Unlabeled Video”, in
CVPR, 2016.
C. Vondrick et al. “Generating Videos with
Scene Dynamics”, in NIPS, 2016.
80. 提案⼿法の問題設定
• 2つの⾏動間に遷移⾏動 (TA; Transitional Action)を挿⼊
– 予測のためのヒントがTAに含有: 早期⾏動認識より時間的に早く認識
– TAの認識が即ち次⾏動の予測: ⾏動予測より安定した予測
Δt
【Proposal】
Short-term action prediction
recognize “cross” at time t5
【Previous works】
Early action recognition
recognize “cross” at time t9
Walk straight
(Action)
Cross
(Action)
Walk straight – Cross
(Transitional action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
81. 提案⼿法の問題設定
• 2つの⾏動間に遷移⾏動 (TA; Transitional Action)を挿⼊
– 予測のためのヒントがTAに含有: 早期⾏動認識より時間的に早く認識
– TAの認識が即ち次⾏動の予測: ⾏動予測より安定した予測
手法 設定
行動認識
早期行動認識
行動予測
遷移行動認識
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
f (F1...t
TA
) → At+L