Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionEun Ji Lee
1. The document summarizes a research paper on neural image caption generation using visual attention mechanisms. It introduces attention models that allow an image captioning model to focus on salient regions of the image dynamically.
2. It describes the image captioning model which uses an LSTM decoder conditioned on an encoded image representation and a context vector. The context vector is generated by taking a weighted sum of image features, with the weights determined by an attention model.
3. It discusses two types of attention mechanisms - "hard" or stochastic attention which selects a single image location at each time step, and "soft" or deterministic attention which blends all locations with learned weights. The model is trained end-to-end to maximize
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionEun Ji Lee
1. The document summarizes a research paper on neural image caption generation using visual attention mechanisms. It introduces attention models that allow an image captioning model to focus on salient regions of the image dynamically.
2. It describes the image captioning model which uses an LSTM decoder conditioned on an encoded image representation and a context vector. The context vector is generated by taking a weighted sum of image features, with the weights determined by an attention model.
3. It discusses two types of attention mechanisms - "hard" or stochastic attention which selects a single image location at each time step, and "soft" or deterministic attention which blends all locations with learned weights. The model is trained end-to-end to maximize
文献紹介:Selective Feature Compression for Efficient Activity Recognition InferenceToru Tamaki
Chunhui Liu, Xinyu Li, Hao Chen, Davide Modolo, Joseph Tighe; Selective Feature Compression for Efficient Activity Recognition Inference, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13628-13637
https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Selective_Feature_Compression_for_Efficient_Activity_Recognition_Inference_ICCV_2021_paper.html
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...Toru Tamaki
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo, SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
https://proceedings.neurips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html
https://arxiv.org/abs/2105.15203
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleToru Tamaki
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR2021.
https://openreview.net/forum?id=YicbFdNTTy
文献紹介:TinyVIRAT: Low-resolution Video Action RecognitionToru Tamaki
Ugur Demir, Yogesh S Rawat, Mubarak Shah, TinyVIRAT: Low-resolution Video Action Recognition, ICPR2021, pp. 7387-7394
doi: 10.1109/ICPR48806.2021.9412541
https://www.computer.org/csdl/proceedings-article/icpr/2021/09412541/1tmi1dyHSEg
https://arxiv.org/abs/2007.07355
文献紹介:Selective Feature Compression for Efficient Activity Recognition InferenceToru Tamaki
Chunhui Liu, Xinyu Li, Hao Chen, Davide Modolo, Joseph Tighe; Selective Feature Compression for Efficient Activity Recognition Inference, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13628-13637
https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Selective_Feature_Compression_for_Efficient_Activity_Recognition_Inference_ICCV_2021_paper.html
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...Toru Tamaki
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo, SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
https://proceedings.neurips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html
https://arxiv.org/abs/2105.15203
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleToru Tamaki
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR2021.
https://openreview.net/forum?id=YicbFdNTTy
文献紹介:TinyVIRAT: Low-resolution Video Action RecognitionToru Tamaki
Ugur Demir, Yogesh S Rawat, Mubarak Shah, TinyVIRAT: Low-resolution Video Action Recognition, ICPR2021, pp. 7387-7394
doi: 10.1109/ICPR48806.2021.9412541
https://www.computer.org/csdl/proceedings-article/icpr/2021/09412541/1tmi1dyHSEg
https://arxiv.org/abs/2007.07355
From Points to Multi-Object 3D Reconstructiontomoaki0705
This document proposes a method to reconstruct multiple 3D objects from a single RGB image. It introduces shape selection to improve 3D shape recognition and incorporates collision constraints. Unlike prior work, it does not rely on a specific shape representation format. Keypoints are first detected in the image to locate object centers. Then a shape is selected from a database and bounding box is estimated. A loss function including collision loss is optimized to reconstruct objects while avoiding overlaps during training. The method shows improved shape fidelity over prior work and can reconstruct objects from both real and synthetic images.
This document discusses OpenCV and contains 27 sections covering various topics:
1. It introduces OpenCV and mentions IplImage, CvMat, and cv::Mat image structures
2. It describes image and matrix properties like width, height, depth, and data types
3. It explains how to create, access, and release IplImage and CvMat structures
This document discusses half-precision floating point numbers and provides examples of using them on different platforms:
- Half-precision floats use 16 bits allowing floating point numbers to be stored in 2 bytes, compared to 32 bits for single-precision and 64 bits for double-precision.
- Examples are provided for ARM, ARM with SIMD instructions, x86 architectures using the F16C instruction set, and NVIDIA CUDA showing how to perform operations and conversions between half and single-precision floats.
- On ARM, specifying the FPU type is necessary to use hardware instructions for conversions between half and single-precision floats. SIMD and vector instructions further improve performance.
- The x86 F16
Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation
1. Mind’s Eye: A Recurrent
Visual Representation for
Image Caption Generation
Xinlei Chen(*), C. Lawrence Zitnick(**)
(*):Carnegie Mellon University
(**):Microsoft Research, Redmond
手島知昭 (@tomoaki_teshima)
5. Long short term memory
• 過去情報を減衰せず
に再利用する
• これにより誤差が発
散・消失しない
http://www.slideshare.net/FujimotoKeisuke/learning-to-forget-continual-prediction-with-lstm
14. Language Model
• 3,000-20,000 words
• 計算コストが爆発するので、各wordにclassを割り
当てる
• classは似た出現頻度のwordをまとめることで生成
する
• Maximum Entropy language model
• preprocessing
W が one-hot representation の word
Sがcontextを覚える隠れ層
VはVisual Featureで、こいつらはconstant
Vとwを直接つなげるのはよろしくなくて、なぜならvはconstantだから
あと、vをSの半分のノードにしか繋げないほうが性能が良かった。
MS COCO を使って学習、PASCAL 1K を使ってテスト
4.3. Sentence generation
Our first set of experiments evaluate our model’s ability
to generate novel sentence descriptions of images. We experiment
on all the image-sentence datasets described previously
and compare to the RNN baselines and other previous
papers [33, 24]. Since PASCAL 1K has a limited amount of
training data, we report results trained on MS COCO and
tested on PASCAL 1K.
Human は人間が生成した文章。
特徴量はPPL は perplexity と言って、生成された文と元の文を、符号化する際に必要なbit数を表す
BLEUは1-4のn-gram で試して平均をとり、それに近い長さと比較する
For BLEU, we took the geometric
mean of the scores from 1-gram to 4-gram, and used the
ground truth length closest to the generated sentence to penalize
brevity