文献紹介:TSM: Temporal Shift Module for Efficient Video UnderstandingToru Tamaki
Ji Lin, Chuang Gan, Song Han; TSM: Temporal Shift Module for Efficient Video Understanding, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7083-7093
https://openaccess.thecvf.com/content_ICCV_2019/html/Lin_TSM_Temporal_Shift_Module_for_Efficient_Video_Understanding_ICCV_2019_paper.html
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話Yusuke Uchida
第7回全日本コンピュータビジョン勉強会「CVPR2021読み会」(前編)の発表資料です
https://kantocv.connpass.com/event/216701/
You Only Look One-level Featureの解説と、YOLO系の雑談や、物体検出における関連する手法等を広く説明しています
文献紹介:TSM: Temporal Shift Module for Efficient Video UnderstandingToru Tamaki
Ji Lin, Chuang Gan, Song Han; TSM: Temporal Shift Module for Efficient Video Understanding, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7083-7093
https://openaccess.thecvf.com/content_ICCV_2019/html/Lin_TSM_Temporal_Shift_Module_for_Efficient_Video_Understanding_ICCV_2019_paper.html
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話Yusuke Uchida
第7回全日本コンピュータビジョン勉強会「CVPR2021読み会」(前編)の発表資料です
https://kantocv.connpass.com/event/216701/
You Only Look One-level Featureの解説と、YOLO系の雑談や、物体検出における関連する手法等を広く説明しています
2020/10/10に開催された第4回全日本コンピュータビジョン勉強会「人に関する認識・理解論文読み会」発表資料です。
以下の2本を読みました
Harmonious Attention Network for Person Re-identification. (CVPR2018)
Weekly Supervised Person Re-Identification (CVPR2019)
【ECCV 2016 BNMW】Human Action Recognition without HumanHirokatsu Kataoka
Project page:
http://www.hirokatsukataoka.net/research/withouthuman/withouthuman.html
The objective of this paper is to evaluate "human action recognition without human". Motion representation is frequently discussed in human action recognition. We have examined several sophisticated options, such as dense trajectories (DT) and the two-stream convolutional neural network (CNN). However, some features from the background could be too strong, as shown in some recent studies on human action recognition. Therefore, we considered whether a background sequence alone can classify human actions in current large-scale action datasets (e.g., UCF101). In this paper, we propose a novel concept for human action analysis that is named "human action recognition without human". An experiment clearly shows the effect of a background sequence for understanding an action label.
To the best of our knowledge, this is the first study of human action recognition without human. However, we should not have done that kind of thing. The motion representation from a background sequence is effective to classify videos in a human action database. We demonstrated human action recognition in with and without a human settings on the UCF101 dataset. The results show the setting without a human (47.42%; without human setting) was close to the setting with a human (56.91%; with human setting). We must accept this reality to realize better motion representation.
【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction...Hirokatsu Kataoka
Project page
http://www.hirokatsukataoka.net/research/transitionalactionrecognition/transitionalactionrecognition.html
Herein, we address transitional actions class as a class between actions. Transitional actions should be useful for producing short-term action predictions while an action is transitive. However, transitional action recognition is difficult because actions and transitional actions partially overlap each other. To deal with this issue, we propose a subtle motion descriptor (SMD) that identifies the sensitive differences between actions and transitional actions. The two primary contributions in this paper are as follows: (i) defining transitional actions for short-term action predictions that permit earlier predictions than early action recognition, and (ii) utilizing convolutional neural network (CNN) based SMD to present a clear distinction between actions and transitional actions. Using three different datasets, we will show that our proposed approach produces better results than do other state-of-the-art models. The experimental results clearly show the recognition performance effectiveness of our proposed model, as well as its ability to comprehend temporal motion in transitional actions.
【論文紹介】Fashion Style in 128 Floats: Joint Ranking and Classification using Wea...Hirokatsu Kataoka
CVPR2016にてシモセラ・エドガー氏が発表した、StyleNetの紹介資料です。
"Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction," Edgar Simo-Serra and Hiroshi Ishikawa, in CVPR2016.
論文情報
http://hi.cs.waseda.ac.jp/~esimo/publications/SimoSerraCVPR2016.pdf
プロジェクトページ
http://hi.cs.waseda.ac.jp/~esimo/ja/research/stylenet/
【CVPR2016_LAP】Dominant Codewords Selection with Topic Model for Action Recogn...Hirokatsu Kataoka
http://www.hirokatsukataoka.net/pdf/cvprw16_kataoka_ddt.pdf
In this paper, we propose a framework for recognizing human activities that uses only in-topic dominant codewords and a mixture of intertopic vectors. Latent Dirichlet allocation (LDA) is used to develop approximations of human motion primitives; these are mid-level representations, and they adaptively integrate dominant vectors when classifying human activities. In LDA topic modeling, action videos (documents) are represented by a bag-of-words (input from a dictionary), and these are based on improved dense trajectories ([18]). The output topics correspond to human motion primitives, such as finger moving or subtle leg motion. We eliminate the impurities, such as missed tracking or changing light conditions, in each motion primitive. The assembled vector of motion primitives is an improved representation of the action. We demonstrate our method on four different datasets.
【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Traj...Hirokatsu Kataoka
ISVC2015 paper
http://www.hirokatsukataoka.net/pdf/isvc15_kataoka_dt13feature.pdf
Activity recognition has been an active research topic in computer vision. Recently, the most successful approaches use dense trajectories that extract a large number of trajectories and features on the trajectories into a codeword. In this paper, we evaluate various features in the framework of dense trajectories on several types of datasets. We implement 13 features in total by including five different types of descriptor, namely motion-, shape-, texture- trajectory- and co-occurrence-based feature descriptors. The experimental results show a relationship between feature descriptors and performance rate at each dataset. Different scenes of traffic, surgery, daily living and sports are used to analyze the feature characteristics. Moreover, we test how much the performance rate of concatenated vectors depends on the type, top-ranked in experiment and all 13 feature descriptors on fine-grained datasets. Feature evaluation is beneficial not only in the activity recognition problem, but also in other domains in spatio-temporal recognition.
【ITSC2015】Fine-grained Walking Activity Recognition via Driving Recorder DatasetHirokatsu Kataoka
ITSC2015
http://www.itsc2015.org/
The paper presents a fine-grained walking activity recognition toward an inferring pedestrian intention which is an important topic to predict and avoid a pedestrian’s dangerous activity. The fine-grained activity recognition is to distinguish different activities between subtle changes such as walking with different directions. We believe a change of pedestrian’s activity is significant to grab a pedestrian intention. However, the task is challenging since a couple of reasons, namely (i) in-vehicle mounted camera is always moving (ii) a pedestrian area is too small to capture a motion and shape features (iii) change of pedestrian activity (e.g. walking straight into turning) has only small feature difference. To tackle these problems, we apply vision-based approach in order to classify pedestrian activities. The dense trajectories (DT) method is employed for high-level recognition to capture a detailed difference. Moreover, we additionally extract detection-based region-of-interest (ROI) for higher performance in fine-grained activity recognition. Here, we evaluated our proposed approach on “self-collected dataset” and “near-miss driving recorder (DR) dataset” by dividing several activities– crossing, walking straight, turning, standing and riding a bicycle. Our proposal achieved 93.7% on the self-collected NTSEL traffic dataset and 77.9% on the near-miss DR dataset.
Extended Co-occurrence HOG with Dense Trajectories for Fine-grained Activity ...Hirokatsu Kataoka
In this paper we propose a novel feature descriptor Extended Co-occurrence HOG (ECoHOG) and integrate it with dense point trajectories demonstrating its usefulness in fine grained activity recognition. This feature is inspired by original Co-occurrence HOG (CoHOG) that is based on histograms of occurrences of pairs of image gradients in the image. Instead relying only on pure histograms we introduce a sum of gradient magnitudes of co-occurring pairs of image gradients in the image. This results in giving the importance to the object boundaries and straightening the difference between the moving foreground and static background. We also couple ECoHOG with dense point trajectories extracted using optical flow from video sequences and demonstrate that they are extremely well suited for fine grained activity recognition. Using our feature we outperform state of the art methods in this task and provide extensive quantitative evaluation.
10. 動画認識の流れ – Sparse, Dense and Deep
1) Laptev, I. and Lindeberg, T. “Space-Time Interest Points,” International Conference on Computer Vision (ICCV), pp.432–439, 2003.
2) Laptev, I., Marszalek, M., Schmid, C. and Rozenfeld, B. “Learning realistic human actions from movies,” IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp.1–8, 2008.
3) Klaser, A., Marszalek, M., and Schmid, C. “A Spatio-Temporal Descriptor Based on 3D-Gradients,” British Machine Vision Conference
(BMVC), 2008.
4) Wang, H., Klaser, A., Schmid, C. and Liu, C.-L. “Action recognition by dense trajectories,” IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp.3169–3176, 2011.
5) Wang, H. and Schmid, C. “Action Recognition with Improved Trajectories,” International Conference on Computer Vision (ICCV), pp.3551–
3558, 2013.
6) Simonyan, K. and Zisserman, A. “Two-Stream Convolutional Networks for Action Recognition in Videos,” Neural Information Processing
Systems (NIPS), 2014.
7) Wang, L., Qiao, Y. and Tang, X. “Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors,” IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2015.
8) D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks“, ICCV 2015.
9) Wang, L., Xiong, Y., Wang, Z. Qiao, Y., Lin, D., Tang, X. and Gool, L. C. “Temporal Segment Networks: Towards Good Practices for Deep
Action Recognition,“ in ECCV 2016.
10) J. Carreira, A. Zisserman, “Quo Vadis, Action Recognition?”, in CVPR 2017.
11) K. Hara, H. Kataoka, Y. Satoh, “Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?”, in CVPR 2018.
Sparse Space-Time feature Dense Space-Time feature Deeply-Learned Representation
34. Two-stream ConvNetsの基本的な情報
• 考案者
– Karen Simonyan (発表当時Oxford所属、現Deep Mind)
– NIPS2014
• ⼿法
– RGBのみでなく、時間情報を画像に投影したフロー画像に対してCNN
Simonyan, K. and Zisserman, A. “Two-Stream Convolutional Networks for Action Recognition in Videos,” Neural Information
Processing Systems (NIPS), 2014.
40. IDTとTwo-stream ConvNetsの統合: TDD
• TDD(Trajectory-pooled Deep-convolutional Descriptors)
– 動線抽出まではIDTと同様
– TDD:畳み込みマップから値を抽出
Feature extraction
(HOG, HOF, MBH, Traj.)
Fisher Vectors (FVs)
IDT
x x x
TDD
x
x x
Feature extraction
(spa4, spa5, tem3, tem4)
Fisher Vectors (FVs)
xxxxx x x x xx x xxx x xxxxx x x x xx x xxx x
48. What Actions are Needed? (ICCV 2017)
⼈物⾏動認識のためにはどんな⾏動が必要?
– アノテーション/アルゴリズム構築等への提⾔
– マルチラベル,より詳細説明かつ物体/⼈体関節情報が重
要と結論
DB作成や⼿法構築の⽅策を決定づける実験
49. What makes a video a video (CVPR 2018)
動画認識は動きを捉えていないのでは?
– 動画から重要フレームを選択/⽣成して認識
– 動きを学習しているのではなく,実は⼊⼒から識別しや
すいフレームを選択していると結論
効果的な動き特徴は実は未だ学習できていない?
54. 提案⼿法の問題設定
• 2つの⾏動間に遷移⾏動 (TA; Transitional Action)を挿⼊
– 予測のためのヒントがTAに含有: 早期⾏動認識より時間的に早く認識
– TAの認識が即ち次⾏動の予測: ⾏動予測より安定した予測
Δt
【Proposal】
Short-term action prediction
recognize “cross” at time t5
【Previous works】
Early action recognition
recognize “cross” at time t9
Walk straight
(Action)
Cross
(Action)
Walk straight – Cross
(Transitional action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
55. 提案⼿法の問題設定
• 2つの⾏動間に遷移⾏動 (TA; Transitional Action)を挿⼊
– 予測のためのヒントがTAに含有: 早期⾏動認識より時間的に早く認識
– TAの認識が即ち次⾏動の予測: ⾏動予測より安定した予測
手法 設定
行動認識
早期行動認識
行動予測
遷移行動認識
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
f (F1...t
TA
) → At+L