This document summarizes recent developments in action recognition using deep learning techniques. It discusses early approaches using improved dense trajectories and two-stream convolutional neural networks. It then focuses on advances using 3D convolutional networks, enabled by large video datasets like Kinetics. State-of-the-art results are achieved using inflated 3D convolutional networks and temporal aggregation methods like temporal linear encoding. The document provides an overview of popular datasets and challenges and concludes with tips on training models at scale.
This document summarizes recent developments in action recognition using deep learning techniques. It discusses early approaches using improved dense trajectories and two-stream convolutional neural networks. It then focuses on advances using 3D convolutional networks, enabled by large video datasets like Kinetics. State-of-the-art results are achieved using inflated 3D convolutional networks and temporal aggregation methods like temporal linear encoding. The document provides an overview of popular datasets and challenges and concludes with tips on training models at scale.
This document summarizes a paper titled "DeepI2P: Image-to-Point Cloud Registration via Deep Classification". The paper proposes a method for estimating the camera pose within a point cloud map using a deep learning model. The model first classifies whether points in the point cloud fall within the camera's frustum or image grid. It then performs pose optimization to estimate the camera pose by minimizing the projection error of inlier points onto the image. The method achieves more accurate camera pose estimation compared to existing techniques based on feature matching or depth estimation. It provides a new approach for camera localization using point cloud maps without requiring cross-modal feature learning.
2020/10/10に開催された第4回全日本コンピュータビジョン勉強会「人に関する認識・理解論文読み会」発表資料です。
以下の2本を読みました
Harmonious Attention Network for Person Re-identification. (CVPR2018)
Weekly Supervised Person Re-Identification (CVPR2019)
2018/10/20コンピュータビジョン勉強会@関東「ECCV読み会2018」発表資料
Yew, Z. J., & Lee, G. H. (2018). 3DFeat-Net: Weakly Supervised Local 3D Features for Point Cloud Registration. European Conference on Computer Vision.
51. 参考文献
51
[Lowe1999]Lowe, D. G. (1999). Object recognition from local scale-
invariant features. In IEEE International Conference on ComputerVision
(pp. 1150–1157 vol.2).
[Csurka2004]Csurka, G., Dance, C. R., Fan, L.,Willamowski, J., & Bray,
C. (2004).Visual categorization with bags of keypoints. In Workshop
on statistical learning in computer vision, ECCV (Vol. 1, p. 22).
[Lazebnik2006]Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond
bags of features: Spatial pyramid matching for recognizing natural
scene categories. In IEEE Conference on ComputerVision and Pattern
Recognition.
[Perronnin2007]Perronnin, F., & Dance, C. (2007). Fisher kernels on
visual vocabularies for image categorization. In IEEE conference on
ComputerVision and Pattern Recognition.
[Jegou2010]Jegou, H., Douze, M., Schmid, C., & Perez, P. (2010).
Aggregating local descriptors into a compact image representation.
In IEEE Conference on ComputerVision and Pattern Recognition
52. 参考文献
52
[Krizhevsky2012]Krizhevsky,A., Sutskever, I., & Hinton, G. E.
(2012). ImageNet Classification with Deep Convolutional
Neural Networks. In Advances in Neural Information Processing
Systems (NIPS)
[Simonyan2014]Simonyan, K., & Zisserman,A. (2014).Very
Deep Convolutional Networks for Large-Scale Image
Recognition. In IEEE Conference on ComputerVision and Pattern
Recognition.
[Szegedy2015]Szegedy, C., Liu,W., Jia,Y., Sermanet, P., Reed, S.,
Anguelov, D., … Rabinovich,A. (2015). Going Deeper with
Convolutions. Conference on ComputerVision and Pattern
Recognition
[He2016]He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep
Residual Learning for Image Recognition. IEEE Conference on
ComputerVision and Pattern Recognition.
69. 参考文献
69
[Viola2001]Viola, P., & Jones, M. (2001). Rapid object detection
using a boosted cascade of simple features. IEEE International
Conference on ComputerVision and Pattern Recognition (CVPR).
[Dalal2005]Dalal, N., &Triggs, B. (2005). Histograms of
Oriented Gradients for Human Detection. IEEE Conference on
ComputerVision and Pattern Recognition (CVPR).
[Felzenswalb2009]Felzenszwalb, P. F., Girshick, R. B., McAllester,
D., & Ramanan, D. (2009). Object detection with
discriminatively trained part-based models. IEEETransactions on
Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
[Girshick2014] Girshick, R., Donahue, J., Darrell,T., & Malik, J.
(2014). Rich feature hierarchies for accurate object detection
and semantic segmentation. In IEEE Conference on Computer
Vision and Pattern Recognition.
70. 参考文献
70
[Girshick2015] Girshick, R. (2015). Fast R-CNN. International
Conference on ComputerVision, 1440–1448.
[Ren2015] Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster
R-CNN:Towards Real-Time Object Detection with Region
Proposal Networks. Advances in Neural Information Processing
Systems (NIPS).
[Redmon2015]Redmon, J., Divvala, S., Girshick, R., & Farhadi,A.
(2015).You Only Look Once: Unified, Real-Time Object
Detection. Conference on ComputerVision and Pattern Recognition.
[Liu2016]Liu,W.,Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,
Fu, C.Y., & Berg,A. C. (2016). SSD: Single shot multibox
detector. In IEEE Europian Conference on ComputerVision.
71. 参考文献
71
[Law2018]Law, H., & Deng, J. (2018). CornerNet:
Detecting Objects as Paired Keypoints. In IEEE Europian
Conference on ComputerVision.
[Zhou2019]Zhou, X.,Wang, D., & Krähenbühl, P. (2019).
Objects as Points. ArXiv, arXiv:1904.
[Duan2019]Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., &
Tian, Q. (2019). CenterNet: Keypoint triplets for object
detection. In IEEE International Conference on Computer
Vision
91. 参考文献
[Thoma2016] Matin Thoma,“A Suvey of Semantic
Segmentation”, arXiv:1602.06541v2
[He2004] He, X., Zemel, R. S., & Carreira-Perpiñán, M. Á.
(2004). Multiscale conditional random fields for image labeling.
In IEEE Conference on ComputerVision and Pattern Recognition.
[Shotton2009] Shotton, J.,Winn, J., Rother, C., & Criminisi,A.
(2009).TextonBoost for image understanding: Multi-class
object recognition and segmentation by jointly modeling
texture, layout, and context. International Journal of Computer
Vision, 81(1), 2–23.
[Krahenbuhl2011] Krahenbuhl, P., & Koltun,V. (2011). Efficient
Inference in Fully Connected CRFs with Gaussian Edge
Potentials. Advances in Neural Information Processing Systems
(NIPS).
92. 参考文献
[Long2015] Long, J., Shelhamer, E., & Darrell,T. (2015). Fully
Convolutional Networks for Semantic Segmentation. In IEEE
Conference on ComputerVision and Pattern Recognition.
[Zheng2015] Zehng, S., Jayasumana, S., Romera-Paredes, B.,
Vineet,V., Su, Z., Du, D., …Torr, P. H. S. (2015). Conditional
Random Fields as Recurrent Neural Networks. In IEEE
Conference on ComputerVision and Pattern Recognition.
[Noh2015] Noh, H., Hong, S., & Han, B. (2015). Learning
deconvolution network for semantic segmentation. In IEEE
International Conference on ComputerVision.
[Ronneberger2015]Ronneberger, O., Fischer, P., & Brox,T.
(2015). U-Net: Convolutional networks for biomedical image
segmentation. International Conference on Medical Image
Computing and Computer-Assisted Intervention.
93. 参考文献
[Yu2016]Yu, F., & Koltun,V. (2016). Multi-Scale Context
Aggregation by Dilated Convolutions. International
Conference on Machine Learning
[Chen2017]Chen, L.-C., Papandreou, G., Schroff, F., &
Adam, H. (2017). Rethinking Atrous Convolution for
Semantic Image Segmentation. ArXiv, arXiv:1706.
[Zhao2017]Zhao, H., Shi, J., Qi, X.,Wang, X., & Jia, J. (2017).
Pyramid Scene Parsing Network. In IEEE Conference on
ComputerVision and Pattern Recognition.
110. Building Rome in a Day [Agarwal2009]
110
15万件のインターネット上の画像から1都市を500コアの
クラスタで1日かからずに構築。
https://www.youtube.com/watch?v=sQegEro5Bfo
111. Building Rome in a Day [Agarwal2009]
111
15万件のインターネット上の画像から1都市を500コアの
クラスタで1日かからずに構築。
112. Building Rome in a Cloudless Day
[Frahm2010]
112
300万枚の画像から、密な三次元モデルを1台のPC
(+GPU)で約1日で構築
Credit:[Frahm2010]
https://www.youtube.com/watch?v=PySBQ8Q_R8k
113. Building Rome in a Cloudless Day
[Frahm2010]
113
300万枚の画像から、密な三次元モデルを1台のPC
(+GPU)で約1日で構築
114. Visual SLAM
114
Structure from Motionの仕組みを利用して、カメラの動き
と3次元空間を同時に認識し、拡張現実感(AR)などに活
用
Simultaneous Localization And Mapping (SLAM)
Localization
Mapping
139. 参考文献
139
[Snavely2006]Snavely, N., Seitz, S. M., & Szeliski, R. (2006). Photo
tourism: exploring photo collections in 3D. In Conference on
Computer Graphics and InteractiveTechniques (SIGGRAPH).
[岡谷2010]岡谷貴之. (2010). コンピュータビジョン最先端ガイ
ド3 第1章バンドル調整. アドコムメディア. 1-32
[古川2012]古川泰隆. (2012). コンピュータビジョン最先端ガイ
ド5 第2章複数画像からの三次元復元手法. アドコムメディア.
33-70
[Agarwal2009]Agarwal, S., Snavely, N., Simon, I., Seitz, S. M., &
Szeliski, R. (2009). Building Rome in a day. In International
Conference on ComputerVision (pp. 72–79).
[Frahm2010]Frahm, J., Fite-georgel, P., Gallup, D., Johnson,T.,
Raguram, R.,Wu, C., … Pollefeys, M. (2010). Building Rome on a
Cloudless Day. In European Conference on ComputerVision (pp.
368–381)
140. 参考文献
140
[Mur-Artal2015]Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. (2015).
ORB-SLAM:AVersatile and Accurate Monocular SLAM System. IEEE
Transactions on Robotics, 31(5), 1147–1163.
[Rublee2011]Rublee, E., Rabaud,V., Konolige, K., & Bradski, G. (2011).
ORB:An efficient alternative to SIFT or SURF. 2011 International
Conference on ComputerVision
[Newcombe2011]Newcombe, R.A., Lovegrove, S. J., & Davison,A. J.
(2011). DTAM: Dense Tracking and Mapping in Real-Time. In
International Conference on ComputerVision.
[Engel2014]Engel, J., Schops,T., & Cremers, D. (2014). LSD-SLAM:
Large-Scale Direct monocular SLAM. In European Conference on
ComputerVision
[Godard2017] Godard, C., Mac Aodha, O., & Brostow, G. J. (2017).
Unsupervised Monocular Depth Estimation with Left-Right
Consistency. Conference on ComputerVision and Pattern Recognition
141. 参考文献
141
[Tateno2017]Tateno, K.,Tombari, F., Laina, I., & Navab, N. (2017). CNN-
SLAM : Real-time dense monocular SLAM with learned depth prediction. In
IEEE Conference on ComputerVision and Pattern Recognition.
[Zhou2017]Zhou,T., Brown, M., Snavely, N., & Lowe, D. G. (2017).
Unsupervised learning of depth and ego-motion from video. In IEEE
Conference on ComputerVision and Pattern Recognition
[Bloesch2018]Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., &
Davison,A. J. (2018). CodeSLAM — Learning a Compact, Optimisable
Representation for DenseVisual SLAM. In IEEE Conference on Computer
Vision and Pattern Recognition.
[Tang2019]Tang, C., &Tan, P. (2019). BA-Net: Dense Bundle Adjustment
Network. In International Conference on Learning Representation.
[Gordon2019]Gordon,A., Li, H., Jonschkowski, R., & Angelova,A. (2019).
Depth from videos in the wild: Unsupervised monocular depth learning
from unknown cameras. IEEE International Conference on ComputerVision
148. 参考文献
148
[Agarwala2004]Agarwala,A., Dontcheva, M.,Agrawala, M., Drucker, S.,
Colburn,A., Curless, B., … Cohen, M. (2004). Interactive digital
photomontage. In Conference on Computer Graphics and InteractiveTechniques
(SIGGRAPH) (Vol. 23).
[Chen2009]Chen,T., Cheng, M.-M.,Tan, P., Shamir,A., & Hu, S.-M. (2009).
Sketch2Photo: internet image montage. In Conference on Computer Graphics
and InteractiveTechniques (SIGGRAPH).
[Radford2016]Radford,A., Metz, L., & Chintala, S. (2016). Unsupervised
Representation Learning with Deep Convolutional Generative Adversarial
Networks. International Conference on Learning Representation.
[Gatys2016]Gatys, L.A., Ecker,A. S., & Bethge, M. (2016). Image Style
Transfer Using Convolutional Neural Networks. In IEEE Conference on
ComputerVision and Pattern Recognition.
[Isola2017]Isola, P., Zhu, J.Y., Zhou,T., & Efros,A.A. (2017). Image-to-image
translation with conditional adversarial networks. IEEE Conference on
ComputerVision and Pattern Recognition.
149. 参考文献
149
[Blanz1999] Blanz,V., &Vetter,T. (1999).A morphable model for the
synthesis of 3D faces. In Conference on Computer Graphics and
InteractiveTechniques (SIGGRAPH) (pp. 187–194).
[Hoiem2005]Hoiem, D., & Efros,A.A. (2005).Automatic photo pop-
up. In Conference on Computer Graphics and InteractiveTechniques
(SIGGRAPH).
[Tran2018]Tran, L., & Liu, X. (2018). Nonlinear 3D Face Morphable
Model. IEEE Conference on ComputerVision and Pattern Recognition.
[Kato2018]Kato, H., Ushiku,Y., & Harada,T. (2018). Neural 3D Mesh
Renderer. In IEEE Conference on ComputerVision and Pattern
Recognition.
[Saito2019]Saito, S., Huang, Z., Natsume, R., Morishima, S., Li, H., &
Kanazawa,A. (2019). PIFu: Pixel-aligned implicit function for high-
resolution clothed human digitization. IEEE International Conference
on ComputerVision.