SlideShare a Scribd company logo
Knowing When to Look:
Adaptive Attention via A Visual Sentinel for Image Captioning
참고자료
“Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning”,
Jiasen Lu, Caiming Xiong, Devi Parikh, Richard Socher
2017. 1.
김홍배
한국항공우주연구원
Image Captioning
 1장의 스틸사진으로 부터 설명문(Caption)을 생성
 자동번역등에 사용되는 Recurrent Neural Networks (RNN)에 Deep
Convolutional Neural Networks에서 생성한 이미지의 특징벡터를
입력
 Neural Image Caption (NIC)
Example : Neural Image Caption (NIC)
4
기존 기술과의 차이
 일반적으로 생성된 각각의 단어와 연결된 이미지 영역의 공간지도를
만드는 Attention Mechanism을 사용
 따라서 시각신호의 중요성에 관련없이 Attention model이
매번 각 이미지에 적용됨
 이는 자원의 낭비뿐만 아니라 비시각적 단어(a, of, on, and top, etc)
를 사용한 훈련은 설명문 생성에서 성능저하를 유발
 본 연구에서는 Hidden layer 위에 추가적인 시각중지 벡터
(visual sentinel vector)를 갖는 LSTM의 확장형을 사용.
 시각신호로부터 필요시 언어모델로 전환이 가능한
Adaptive attention encoder-decoder framework
 이모델은 “white”, “bird”, “stop,”과 같은 시각적 단어에 대해서는
좀 더 이미지에 집중하고, “top”, “of”, “on.”의 경우에는 시각중지를 사용
5
기존 기술과의 차이
Atten : 새로 추가된 모듈
vi : spatial image features St : visual sentinel vector
6
Encoder-Decoder for Image Captioning
 사진(I)를 입력으로 주었을 때
 정답 “설명문“, y를 만들어 낼 가능성을 최대가 되도록
 학습데이터(I, y)를 이용하여
 넷의 변수(ϴ)들을 찾아내는 과정
설명문
ϴ∗ = argmin 𝐼,𝑦 log ‫(݌‬y|I;ϴ)
ϴ 사진, 변수
확률
손실함수
전체 학습데이터 셋에 대한 손실함수
손실함수를 최소화 시키는 변수, ϴ*를 구하는 작업
7
log 𝑝 𝑦 𝐼; ϴ =
𝑡=0
𝑁
log 𝑝 𝑦𝑡 𝐼, 𝑦0, 𝑦1,···, 𝑦𝑡−1; ϴ
학습 데이터 셋(I,y)로 부터 훈련을 통해 찾아내는 변수
log likelihood of the joint probability distribution
Encoder-Decoder for Image Captioning
Recurrent neural network (RNN)을 사용한 encoder-decoder
framework에서는 개개 conditional probability는 다음과 같이 정의
log 𝑝 𝑦𝑡 𝐼, 𝑦0, 𝑦1,···, 𝑦𝑡−1 = 𝑓 ℎ 𝑡, 𝑐𝑡
8
log 𝑝 𝑦𝑡 𝐼, 𝑦0, 𝑦1,···, 𝑦𝑡−1 = 𝑓 ℎ 𝑡, 𝑐𝑡
Encoder-Decoder for Image Captioning
Visual context vector @ time tHidden state @ time t
𝑦𝑡의 확률을 출력하는 비선형 함수(eg. Softmax)
ℎ 𝑡 : Long-Short Term Memory (LSTM)인 경우
ℎ 𝑡 = LSTM(𝑥 𝑡; ℎ 𝑡−1; 𝑚 𝑡−1)
𝑚 𝑡 ∶ memory cell vector at time t-1
9
Encoder-Decoder for Image Captioning
Spatial Attention Model
최종 목표인 Adaptive Attention Model에 들어가기 전
Spatial Attention Model에 대해 간단하게 알아봅시다.
1. Vanilla encoder-decoder frameworks
- 𝑐𝑡 는 encoder인 CNN에만 의존
- 입력 이미지, I가 CNN에 들어가서 마지막 FCL에서 global image feature 를 뽑아냄.
- 단어생성과정에서 𝑐𝑡는 decoder의 hidden state에 상관없이 일정함
2. Attention based encoder-decoder frameworks
- 𝑐𝑡 는 encoder & decoder모두에 의존
- 시간 t에서 hidden state의 상태에 따라 decoder가 이미지의 특정 영역으로 부터의
spatial image features를 이용하여 𝑐𝑡를 계산
context vector 𝑐𝑡 를 계산할 때, 다음과 같이 두개의 접근방법이 있음.
10
Encoder-Decoder for Image Captioning
𝑐𝑡 = g(V; ℎ 𝑡)
context vector 𝑐𝑡는 𝑒𝑛𝑐𝑜𝑑𝑒𝑟출력 image features 과 𝑑𝑒𝑐𝑜𝑑𝑒𝑟출력
(𝐻𝑖𝑑𝑑𝑒𝑛 𝑠𝑡𝑎𝑡𝑒)의 함수
attention function
𝑉 = 𝑣1 … 𝑣 𝑘 , 𝑣 𝑘 ∈ 𝑅 𝑑
, V ∈ 𝑅 𝑑𝑥𝑘
,
ℎ 𝑡∈ 𝑅 𝑑
𝑣 𝑘 는 the spatial image features
ℎ 𝑡 는 LSTM의 hidden state @ time t
𝑣 𝑘 구하는 방법은 뒤에 계산방법이 자세히 나옴
Spatial Attention Model
11
Encoder-Decoder for Image Captioning
Spatial Attention Model
우선 이미지상의 k개의 영역에 대한 attention분포, α 𝑡를 계산
𝑧𝑡=𝑤ℎ
𝑇
𝑡𝑎𝑛ℎ 𝑊𝑣 𝑉 + (𝑊𝑔ℎ 𝑡)ℾ 𝑇
α 𝑡=softmax (𝑧𝑡)
ℾ ∈ 𝑅 𝑘
is a vector with all elements set to 1
𝑊𝑣, 𝑊𝑔 ∈ 𝑅 𝑘𝑥𝑑
and 𝑊ℎ ∈ 𝑅 𝑘
context vector 𝑐𝑡는 다음과 같이 정의
𝑐𝑡 =
𝑖=1
𝑘
𝛼 𝑡𝑖 𝑣 𝑡𝑖
12
Adaptive Attention Model
 spatial attention 기반의 decoders가 image captioning에 효과적이긴
하나, 시각적 신호와 언어모델 중 어느 쪽을 선택하여야 하는지는 결정
하지 못함.
 Visual sentinel gate(시각중지 게이트), β𝑖라는 새로운 게이트를 도입
vi : spatial image features St : visual sentinel vector
Encoder-Decoder for Image Captioning
13
β 𝑡를 이용하여 새로운 Adaptive context vector, 𝑐𝑡를 정의
vi : spatial image features St : visual sentinel vector
𝑐𝑡= β 𝑡 𝑠𝑡+(1-β 𝑡)𝑐𝑡 β 𝑡 는 0 ~ 1 사이의 상수를 생성
β 𝑡≈1 : 시각중지 정보만 사용  언어모델만 사용
β 𝑡≈0 : 공간영상 정보만 사용  시각정보만 사용
Adaptive Attention Model
Encoder-Decoder for Image Captioning
14
앞에서 언급한 이미지 상의 attention 분포, 𝛼 𝑡를 다음과 같이 수정
α 𝑡=softmax ( 𝑧𝑡; 𝑤ℎ
𝑇
𝑡𝑎𝑛ℎ 𝑊𝑠 𝑠𝑡 + 𝑊𝑔ℎ 𝑡 )
α 𝑡 ∈ 𝑅 𝑘+1로 spatial image feature(𝑅 𝑘)와 visual sentinel vector
에 대한 attention 분포를 나타냄.
여기서는 𝜶 𝒕 𝒌 + 𝟏 , 즉 𝜶 𝒕의 마지막 값을 𝜷 𝒕값으로 정의 함.
시간 t에서 가능한 단어들의 어휘에 대한 선택 확률은 다음과 같이 계산
𝑝𝑡=softmax (𝑊𝑝 𝑐𝑡 + ℎ 𝑡 )
Adaptive Attention Model
Encoder-Decoder for Image Captioning
15
Image Encoding
CNN출력으로부터 image feature vector 구하기
• 이미지에 대한 Encoding은 ResNet의 마지막 convolutional layer
의 spatial feature 출력(2,048x7x7)을 사용
• 이미지상의 k개 영역의 CNN의 spatial feature와의 관계(?)
Encoder-Decoder for Image Captioning
𝐴 = 𝑎1 … 𝑎 𝑘 , 𝑎𝑖 ∈ 𝑅2,048
Global image feature, 𝑎 𝑔는 다음과 같이 평균값으로 정의
𝑎 𝑔
=
1
𝑘
𝑖=1
𝑘
𝑎𝑖
16
Image Encoding
CNN출력으로부터 image feature vector 구하기
Encoder-Decoder for Image Captioning
𝑣𝑖=ReLU (𝑊𝑎 𝑎𝑖)
𝑎𝑖 로 부터 d차원의 𝑣𝑖로 변환하기
𝑣𝑔=ReLU ( 𝑊𝑏 𝑎 𝑔
)
입력벡터, 𝒙 𝒕
𝑥 𝑡 = 𝑤𝑡;𝑣𝑔
𝑤𝑡는 “word embedding vector”, 𝑣𝑔는 "global image feature vector”
실험결과
실험결과

More Related Content

What's hot

Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)
Abhimanyu Dwivedi
 
Color models in Digitel image processing
Color models in Digitel image processingColor models in Digitel image processing
Color models in Digitel image processing
Aryan Shivhare
 
two dimensional random variable
two dimensional random variabletwo dimensional random variable
two dimensional random variable
BOmebratu
 
Methods of point estimation
Methods of point estimationMethods of point estimation
Methods of point estimation
Suruchi Somwanshi
 
Classification of EEG Signals for Brain-Computer Interface
Classification of EEG Signals for Brain-Computer InterfaceClassification of EEG Signals for Brain-Computer Interface
Classification of EEG Signals for Brain-Computer Interface
Azoft
 
Bayesian Belief Network and its Applications.pptx
Bayesian Belief Network and its Applications.pptxBayesian Belief Network and its Applications.pptx
Bayesian Belief Network and its Applications.pptx
SamyakJain710491
 
Fuzzy logic control of washing m achines
Fuzzy logic control of washing m achinesFuzzy logic control of washing m achines
Fuzzy logic control of washing m achines
pradnya patil
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimationguestfee8698
 
Point Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis testsPoint Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis testsUniversity of Salerno
 
Journal Club: VQ-VAE2
Journal Club: VQ-VAE2Journal Club: VQ-VAE2
Journal Club: VQ-VAE2
Takuya Koumura
 
Run-Length Encoding algorithm
Run-Length Encoding algorithmRun-Length Encoding algorithm
Run-Length Encoding algorithm
Hyeon Sik Song
 
App inventor 2
App  inventor  2App  inventor  2
App inventor 2
ssuser0013fc
 
Monte carlo simulation
Monte carlo simulationMonte carlo simulation
Monte carlo simulationMissAnam
 
Probability and Random Variables
Probability and Random VariablesProbability and Random Variables
Probability and Random Variables
Subhobrata Banerjee
 
Utp va_sl4_procesamiento digital de imagenes con matlab iii
 Utp va_sl4_procesamiento digital de imagenes con matlab iii Utp va_sl4_procesamiento digital de imagenes con matlab iii
Utp va_sl4_procesamiento digital de imagenes con matlab iiijcbenitezp
 
EEG Based Classification of Emotions with CNN and RNN
EEG Based Classification of Emotions with CNN and RNNEEG Based Classification of Emotions with CNN and RNN
EEG Based Classification of Emotions with CNN and RNN
ijtsrd
 
Statistical inference: Estimation
Statistical inference: EstimationStatistical inference: Estimation
Statistical inference: Estimation
Parag Shah
 
First order predicate logic(fopl)
First order predicate logic(fopl)First order predicate logic(fopl)
First order predicate logic(fopl)
surbhi jha
 
Multimedia fundamental concepts in video
Multimedia fundamental concepts in videoMultimedia fundamental concepts in video
Multimedia fundamental concepts in video
Mazin Alwaaly
 
Artificial Neural Networks
Artificial Neural NetworksArtificial Neural Networks
Artificial Neural Networks
Arslan Zulfiqar
 

What's hot (20)

Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)
 
Color models in Digitel image processing
Color models in Digitel image processingColor models in Digitel image processing
Color models in Digitel image processing
 
two dimensional random variable
two dimensional random variabletwo dimensional random variable
two dimensional random variable
 
Methods of point estimation
Methods of point estimationMethods of point estimation
Methods of point estimation
 
Classification of EEG Signals for Brain-Computer Interface
Classification of EEG Signals for Brain-Computer InterfaceClassification of EEG Signals for Brain-Computer Interface
Classification of EEG Signals for Brain-Computer Interface
 
Bayesian Belief Network and its Applications.pptx
Bayesian Belief Network and its Applications.pptxBayesian Belief Network and its Applications.pptx
Bayesian Belief Network and its Applications.pptx
 
Fuzzy logic control of washing m achines
Fuzzy logic control of washing m achinesFuzzy logic control of washing m achines
Fuzzy logic control of washing m achines
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 
Point Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis testsPoint Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis tests
 
Journal Club: VQ-VAE2
Journal Club: VQ-VAE2Journal Club: VQ-VAE2
Journal Club: VQ-VAE2
 
Run-Length Encoding algorithm
Run-Length Encoding algorithmRun-Length Encoding algorithm
Run-Length Encoding algorithm
 
App inventor 2
App  inventor  2App  inventor  2
App inventor 2
 
Monte carlo simulation
Monte carlo simulationMonte carlo simulation
Monte carlo simulation
 
Probability and Random Variables
Probability and Random VariablesProbability and Random Variables
Probability and Random Variables
 
Utp va_sl4_procesamiento digital de imagenes con matlab iii
 Utp va_sl4_procesamiento digital de imagenes con matlab iii Utp va_sl4_procesamiento digital de imagenes con matlab iii
Utp va_sl4_procesamiento digital de imagenes con matlab iii
 
EEG Based Classification of Emotions with CNN and RNN
EEG Based Classification of Emotions with CNN and RNNEEG Based Classification of Emotions with CNN and RNN
EEG Based Classification of Emotions with CNN and RNN
 
Statistical inference: Estimation
Statistical inference: EstimationStatistical inference: Estimation
Statistical inference: Estimation
 
First order predicate logic(fopl)
First order predicate logic(fopl)First order predicate logic(fopl)
First order predicate logic(fopl)
 
Multimedia fundamental concepts in video
Multimedia fundamental concepts in videoMultimedia fundamental concepts in video
Multimedia fundamental concepts in video
 
Artificial Neural Networks
Artificial Neural NetworksArtificial Neural Networks
Artificial Neural Networks
 

Viewers also liked

MNIST for ML beginners
MNIST for ML beginnersMNIST for ML beginners
MNIST for ML beginners
홍배 김
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
홍배 김
 
Binarized CNN on FPGA
Binarized CNN on FPGABinarized CNN on FPGA
Binarized CNN on FPGA
홍배 김
 
Convolution 종류 설명
Convolution 종류 설명Convolution 종류 설명
Convolution 종류 설명
홍배 김
 
Meta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural NetworksMeta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural Networks
홍배 김
 
Normalization 방법
Normalization 방법 Normalization 방법
Normalization 방법
홍배 김
 
Learning to remember rare events
Learning to remember rare eventsLearning to remember rare events
Learning to remember rare events
홍배 김
 
Explanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expertExplanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expert
홍배 김
 
A neural image caption generator
A neural image caption generatorA neural image caption generator
A neural image caption generator
홍배 김
 
Learning by association
Learning by associationLearning by association
Learning by association
홍배 김
 
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder
홍배 김
 
Visualizing data using t-SNE
Visualizing data using t-SNEVisualizing data using t-SNE
Visualizing data using t-SNE
홍배 김
 
Single Shot MultiBox Detector와 Recurrent Instance Segmentation
Single Shot MultiBox Detector와 Recurrent Instance SegmentationSingle Shot MultiBox Detector와 Recurrent Instance Segmentation
Single Shot MultiBox Detector와 Recurrent Instance Segmentation
홍배 김
 
Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)
홍배 김
 
머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)
홍배 김
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향
홍배 김
 
Q Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object LocalizationQ Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object Localization
홍배 김
 

Viewers also liked (17)

MNIST for ML beginners
MNIST for ML beginnersMNIST for ML beginners
MNIST for ML beginners
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
 
Binarized CNN on FPGA
Binarized CNN on FPGABinarized CNN on FPGA
Binarized CNN on FPGA
 
Convolution 종류 설명
Convolution 종류 설명Convolution 종류 설명
Convolution 종류 설명
 
Meta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural NetworksMeta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural Networks
 
Normalization 방법
Normalization 방법 Normalization 방법
Normalization 방법
 
Learning to remember rare events
Learning to remember rare eventsLearning to remember rare events
Learning to remember rare events
 
Explanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expertExplanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expert
 
A neural image caption generator
A neural image caption generatorA neural image caption generator
A neural image caption generator
 
Learning by association
Learning by associationLearning by association
Learning by association
 
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder
 
Visualizing data using t-SNE
Visualizing data using t-SNEVisualizing data using t-SNE
Visualizing data using t-SNE
 
Single Shot MultiBox Detector와 Recurrent Instance Segmentation
Single Shot MultiBox Detector와 Recurrent Instance SegmentationSingle Shot MultiBox Detector와 Recurrent Instance Segmentation
Single Shot MultiBox Detector와 Recurrent Instance Segmentation
 
Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)Focal loss의 응용(Detection & Classification)
Focal loss의 응용(Detection & Classification)
 
머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향
 
Q Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object LocalizationQ Learning과 CNN을 이용한 Object Localization
Q Learning과 CNN을 이용한 Object Localization
 

Similar to Knowing when to look : Adaptive Attention via A Visual Sentinel for Image Captioning

[Paper Review] Image captioning with semantic attention
[Paper Review] Image captioning with semantic attention[Paper Review] Image captioning with semantic attention
[Paper Review] Image captioning with semantic attention
Hyeongmin Lee
 
전형규, 가성비 좋은 렌더링 테크닉 10선, NDC2012
전형규, 가성비 좋은 렌더링 테크닉 10선, NDC2012전형규, 가성비 좋은 렌더링 테크닉 10선, NDC2012
전형규, 가성비 좋은 렌더링 테크닉 10선, NDC2012devCAT Studio, NEXON
 
Mt
MtMt
I3D and Kinetics datasets (Action Recognition)
I3D and Kinetics datasets (Action Recognition)I3D and Kinetics datasets (Action Recognition)
I3D and Kinetics datasets (Action Recognition)
Susang Kim
 
A Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding AutoencoderA Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding Autoencoder
Lee Seungeun
 
FCN to DeepLab.v3+
FCN to DeepLab.v3+FCN to DeepLab.v3+
FCN to DeepLab.v3+
Whi Kwon
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection
창기 문
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection
창기 문
 
Modern gpu optimize blog
Modern gpu optimize blogModern gpu optimize blog
Modern gpu optimize blog
ozlael ozlael
 
Modern gpu optimize
Modern gpu optimizeModern gpu optimize
Modern gpu optimize
ozlael ozlael
 
Latest Frame interpolation Algorithms
Latest Frame interpolation AlgorithmsLatest Frame interpolation Algorithms
Latest Frame interpolation Algorithms
Hyeongmin Lee
 
210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision
taeseon ryu
 
[Ndc11 박민근] deferred shading
[Ndc11 박민근] deferred shading[Ndc11 박민근] deferred shading
[Ndc11 박민근] deferred shadingMinGeun Park
 
이정근_project_로봇비전시스템.pdf
이정근_project_로봇비전시스템.pdf이정근_project_로봇비전시스템.pdf
이정근_project_로봇비전시스템.pdf
tangtang1026
 
Achieving human parity on visual question answering alicemind
Achieving human parity on visual question answering alicemindAchieving human parity on visual question answering alicemind
Achieving human parity on visual question answering alicemind
taeseon ryu
 
Animal iris recognition
Animal iris recognitionAnimal iris recognition
Animal iris recognition
진성 정
 
15.ai term project_final
15.ai term project_final15.ai term project_final
15.ai term project_final호상 장
 
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
Gyubin Son
 
[Paper] shuffle net an extremely efficient convolutional neural network for ...
[Paper] shuffle net  an extremely efficient convolutional neural network for ...[Paper] shuffle net  an extremely efficient convolutional neural network for ...
[Paper] shuffle net an extremely efficient convolutional neural network for ...
Susang Kim
 
[0326 박민근] deferred shading
[0326 박민근] deferred shading[0326 박민근] deferred shading
[0326 박민근] deferred shadingMinGeun Park
 

Similar to Knowing when to look : Adaptive Attention via A Visual Sentinel for Image Captioning (20)

[Paper Review] Image captioning with semantic attention
[Paper Review] Image captioning with semantic attention[Paper Review] Image captioning with semantic attention
[Paper Review] Image captioning with semantic attention
 
전형규, 가성비 좋은 렌더링 테크닉 10선, NDC2012
전형규, 가성비 좋은 렌더링 테크닉 10선, NDC2012전형규, 가성비 좋은 렌더링 테크닉 10선, NDC2012
전형규, 가성비 좋은 렌더링 테크닉 10선, NDC2012
 
Mt
MtMt
Mt
 
I3D and Kinetics datasets (Action Recognition)
I3D and Kinetics datasets (Action Recognition)I3D and Kinetics datasets (Action Recognition)
I3D and Kinetics datasets (Action Recognition)
 
A Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding AutoencoderA Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding Autoencoder
 
FCN to DeepLab.v3+
FCN to DeepLab.v3+FCN to DeepLab.v3+
FCN to DeepLab.v3+
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection
 
Modern gpu optimize blog
Modern gpu optimize blogModern gpu optimize blog
Modern gpu optimize blog
 
Modern gpu optimize
Modern gpu optimizeModern gpu optimize
Modern gpu optimize
 
Latest Frame interpolation Algorithms
Latest Frame interpolation AlgorithmsLatest Frame interpolation Algorithms
Latest Frame interpolation Algorithms
 
210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision
 
[Ndc11 박민근] deferred shading
[Ndc11 박민근] deferred shading[Ndc11 박민근] deferred shading
[Ndc11 박민근] deferred shading
 
이정근_project_로봇비전시스템.pdf
이정근_project_로봇비전시스템.pdf이정근_project_로봇비전시스템.pdf
이정근_project_로봇비전시스템.pdf
 
Achieving human parity on visual question answering alicemind
Achieving human parity on visual question answering alicemindAchieving human parity on visual question answering alicemind
Achieving human parity on visual question answering alicemind
 
Animal iris recognition
Animal iris recognitionAnimal iris recognition
Animal iris recognition
 
15.ai term project_final
15.ai term project_final15.ai term project_final
15.ai term project_final
 
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
 
[Paper] shuffle net an extremely efficient convolutional neural network for ...
[Paper] shuffle net  an extremely efficient convolutional neural network for ...[Paper] shuffle net  an extremely efficient convolutional neural network for ...
[Paper] shuffle net an extremely efficient convolutional neural network for ...
 
[0326 박민근] deferred shading
[0326 박민근] deferred shading[0326 박민근] deferred shading
[0326 박민근] deferred shading
 

More from 홍배 김

Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
홍배 김
 
Gaussian processing
Gaussian processingGaussian processing
Gaussian processing
홍배 김
 
Lecture Summary : Camera Projection
Lecture Summary : Camera Projection Lecture Summary : Camera Projection
Lecture Summary : Camera Projection
홍배 김
 
Learning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robotsLearning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robots
홍배 김
 
Robotics of Quadruped Robot
Robotics of Quadruped RobotRobotics of Quadruped Robot
Robotics of Quadruped Robot
홍배 김
 
Basics of Robotics
Basics of RoboticsBasics of Robotics
Basics of Robotics
홍배 김
 
Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명
홍배 김
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
홍배 김
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
홍배 김
 
Optimal real-time landing using DNN
Optimal real-time landing using DNNOptimal real-time landing using DNN
Optimal real-time landing using DNN
홍배 김
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
홍배 김
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain
홍배 김
 
Anomaly Detection and Localization Using GAN and One-Class Classifier
Anomaly Detection and Localization  Using GAN and One-Class ClassifierAnomaly Detection and Localization  Using GAN and One-Class Classifier
Anomaly Detection and Localization Using GAN and One-Class Classifier
홍배 김
 
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
홍배 김
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
홍배 김
 
Anomaly Detection with GANs
Anomaly Detection with GANsAnomaly Detection with GANs
Anomaly Detection with GANs
홍배 김
 

More from 홍배 김 (16)

Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
 
Gaussian processing
Gaussian processingGaussian processing
Gaussian processing
 
Lecture Summary : Camera Projection
Lecture Summary : Camera Projection Lecture Summary : Camera Projection
Lecture Summary : Camera Projection
 
Learning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robotsLearning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robots
 
Robotics of Quadruped Robot
Robotics of Quadruped RobotRobotics of Quadruped Robot
Robotics of Quadruped Robot
 
Basics of Robotics
Basics of RoboticsBasics of Robotics
Basics of Robotics
 
Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명Recurrent Neural Net의 이론과 설명
Recurrent Neural Net의 이론과 설명
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Optimal real-time landing using DNN
Optimal real-time landing using DNNOptimal real-time landing using DNN
Optimal real-time landing using DNN
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain
 
Anomaly Detection and Localization Using GAN and One-Class Classifier
Anomaly Detection and Localization  Using GAN and One-Class ClassifierAnomaly Detection and Localization  Using GAN and One-Class Classifier
Anomaly Detection and Localization Using GAN and One-Class Classifier
 
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
 
Anomaly Detection with GANs
Anomaly Detection with GANsAnomaly Detection with GANs
Anomaly Detection with GANs
 

Knowing when to look : Adaptive Attention via A Visual Sentinel for Image Captioning

  • 1. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning 참고자료 “Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning”, Jiasen Lu, Caiming Xiong, Devi Parikh, Richard Socher 2017. 1. 김홍배 한국항공우주연구원
  • 2. Image Captioning  1장의 스틸사진으로 부터 설명문(Caption)을 생성  자동번역등에 사용되는 Recurrent Neural Networks (RNN)에 Deep Convolutional Neural Networks에서 생성한 이미지의 특징벡터를 입력  Neural Image Caption (NIC)
  • 3. Example : Neural Image Caption (NIC)
  • 4. 4 기존 기술과의 차이  일반적으로 생성된 각각의 단어와 연결된 이미지 영역의 공간지도를 만드는 Attention Mechanism을 사용  따라서 시각신호의 중요성에 관련없이 Attention model이 매번 각 이미지에 적용됨  이는 자원의 낭비뿐만 아니라 비시각적 단어(a, of, on, and top, etc) 를 사용한 훈련은 설명문 생성에서 성능저하를 유발  본 연구에서는 Hidden layer 위에 추가적인 시각중지 벡터 (visual sentinel vector)를 갖는 LSTM의 확장형을 사용.  시각신호로부터 필요시 언어모델로 전환이 가능한 Adaptive attention encoder-decoder framework  이모델은 “white”, “bird”, “stop,”과 같은 시각적 단어에 대해서는 좀 더 이미지에 집중하고, “top”, “of”, “on.”의 경우에는 시각중지를 사용
  • 5. 5 기존 기술과의 차이 Atten : 새로 추가된 모듈 vi : spatial image features St : visual sentinel vector
  • 6. 6 Encoder-Decoder for Image Captioning  사진(I)를 입력으로 주었을 때  정답 “설명문“, y를 만들어 낼 가능성을 최대가 되도록  학습데이터(I, y)를 이용하여  넷의 변수(ϴ)들을 찾아내는 과정 설명문 ϴ∗ = argmin 𝐼,𝑦 log ‫(݌‬y|I;ϴ) ϴ 사진, 변수 확률 손실함수 전체 학습데이터 셋에 대한 손실함수 손실함수를 최소화 시키는 변수, ϴ*를 구하는 작업
  • 7. 7 log 𝑝 𝑦 𝐼; ϴ = 𝑡=0 𝑁 log 𝑝 𝑦𝑡 𝐼, 𝑦0, 𝑦1,···, 𝑦𝑡−1; ϴ 학습 데이터 셋(I,y)로 부터 훈련을 통해 찾아내는 변수 log likelihood of the joint probability distribution Encoder-Decoder for Image Captioning Recurrent neural network (RNN)을 사용한 encoder-decoder framework에서는 개개 conditional probability는 다음과 같이 정의 log 𝑝 𝑦𝑡 𝐼, 𝑦0, 𝑦1,···, 𝑦𝑡−1 = 𝑓 ℎ 𝑡, 𝑐𝑡
  • 8. 8 log 𝑝 𝑦𝑡 𝐼, 𝑦0, 𝑦1,···, 𝑦𝑡−1 = 𝑓 ℎ 𝑡, 𝑐𝑡 Encoder-Decoder for Image Captioning Visual context vector @ time tHidden state @ time t 𝑦𝑡의 확률을 출력하는 비선형 함수(eg. Softmax) ℎ 𝑡 : Long-Short Term Memory (LSTM)인 경우 ℎ 𝑡 = LSTM(𝑥 𝑡; ℎ 𝑡−1; 𝑚 𝑡−1) 𝑚 𝑡 ∶ memory cell vector at time t-1
  • 9. 9 Encoder-Decoder for Image Captioning Spatial Attention Model 최종 목표인 Adaptive Attention Model에 들어가기 전 Spatial Attention Model에 대해 간단하게 알아봅시다. 1. Vanilla encoder-decoder frameworks - 𝑐𝑡 는 encoder인 CNN에만 의존 - 입력 이미지, I가 CNN에 들어가서 마지막 FCL에서 global image feature 를 뽑아냄. - 단어생성과정에서 𝑐𝑡는 decoder의 hidden state에 상관없이 일정함 2. Attention based encoder-decoder frameworks - 𝑐𝑡 는 encoder & decoder모두에 의존 - 시간 t에서 hidden state의 상태에 따라 decoder가 이미지의 특정 영역으로 부터의 spatial image features를 이용하여 𝑐𝑡를 계산 context vector 𝑐𝑡 를 계산할 때, 다음과 같이 두개의 접근방법이 있음.
  • 10. 10 Encoder-Decoder for Image Captioning 𝑐𝑡 = g(V; ℎ 𝑡) context vector 𝑐𝑡는 𝑒𝑛𝑐𝑜𝑑𝑒𝑟출력 image features 과 𝑑𝑒𝑐𝑜𝑑𝑒𝑟출력 (𝐻𝑖𝑑𝑑𝑒𝑛 𝑠𝑡𝑎𝑡𝑒)의 함수 attention function 𝑉 = 𝑣1 … 𝑣 𝑘 , 𝑣 𝑘 ∈ 𝑅 𝑑 , V ∈ 𝑅 𝑑𝑥𝑘 , ℎ 𝑡∈ 𝑅 𝑑 𝑣 𝑘 는 the spatial image features ℎ 𝑡 는 LSTM의 hidden state @ time t 𝑣 𝑘 구하는 방법은 뒤에 계산방법이 자세히 나옴 Spatial Attention Model
  • 11. 11 Encoder-Decoder for Image Captioning Spatial Attention Model 우선 이미지상의 k개의 영역에 대한 attention분포, α 𝑡를 계산 𝑧𝑡=𝑤ℎ 𝑇 𝑡𝑎𝑛ℎ 𝑊𝑣 𝑉 + (𝑊𝑔ℎ 𝑡)ℾ 𝑇 α 𝑡=softmax (𝑧𝑡) ℾ ∈ 𝑅 𝑘 is a vector with all elements set to 1 𝑊𝑣, 𝑊𝑔 ∈ 𝑅 𝑘𝑥𝑑 and 𝑊ℎ ∈ 𝑅 𝑘 context vector 𝑐𝑡는 다음과 같이 정의 𝑐𝑡 = 𝑖=1 𝑘 𝛼 𝑡𝑖 𝑣 𝑡𝑖
  • 12. 12 Adaptive Attention Model  spatial attention 기반의 decoders가 image captioning에 효과적이긴 하나, 시각적 신호와 언어모델 중 어느 쪽을 선택하여야 하는지는 결정 하지 못함.  Visual sentinel gate(시각중지 게이트), β𝑖라는 새로운 게이트를 도입 vi : spatial image features St : visual sentinel vector Encoder-Decoder for Image Captioning
  • 13. 13 β 𝑡를 이용하여 새로운 Adaptive context vector, 𝑐𝑡를 정의 vi : spatial image features St : visual sentinel vector 𝑐𝑡= β 𝑡 𝑠𝑡+(1-β 𝑡)𝑐𝑡 β 𝑡 는 0 ~ 1 사이의 상수를 생성 β 𝑡≈1 : 시각중지 정보만 사용  언어모델만 사용 β 𝑡≈0 : 공간영상 정보만 사용  시각정보만 사용 Adaptive Attention Model Encoder-Decoder for Image Captioning
  • 14. 14 앞에서 언급한 이미지 상의 attention 분포, 𝛼 𝑡를 다음과 같이 수정 α 𝑡=softmax ( 𝑧𝑡; 𝑤ℎ 𝑇 𝑡𝑎𝑛ℎ 𝑊𝑠 𝑠𝑡 + 𝑊𝑔ℎ 𝑡 ) α 𝑡 ∈ 𝑅 𝑘+1로 spatial image feature(𝑅 𝑘)와 visual sentinel vector 에 대한 attention 분포를 나타냄. 여기서는 𝜶 𝒕 𝒌 + 𝟏 , 즉 𝜶 𝒕의 마지막 값을 𝜷 𝒕값으로 정의 함. 시간 t에서 가능한 단어들의 어휘에 대한 선택 확률은 다음과 같이 계산 𝑝𝑡=softmax (𝑊𝑝 𝑐𝑡 + ℎ 𝑡 ) Adaptive Attention Model Encoder-Decoder for Image Captioning
  • 15. 15 Image Encoding CNN출력으로부터 image feature vector 구하기 • 이미지에 대한 Encoding은 ResNet의 마지막 convolutional layer 의 spatial feature 출력(2,048x7x7)을 사용 • 이미지상의 k개 영역의 CNN의 spatial feature와의 관계(?) Encoder-Decoder for Image Captioning 𝐴 = 𝑎1 … 𝑎 𝑘 , 𝑎𝑖 ∈ 𝑅2,048 Global image feature, 𝑎 𝑔는 다음과 같이 평균값으로 정의 𝑎 𝑔 = 1 𝑘 𝑖=1 𝑘 𝑎𝑖
  • 16. 16 Image Encoding CNN출력으로부터 image feature vector 구하기 Encoder-Decoder for Image Captioning 𝑣𝑖=ReLU (𝑊𝑎 𝑎𝑖) 𝑎𝑖 로 부터 d차원의 𝑣𝑖로 변환하기 𝑣𝑔=ReLU ( 𝑊𝑏 𝑎 𝑔 ) 입력벡터, 𝒙 𝒕 𝑥 𝑡 = 𝑤𝑡;𝑣𝑔 𝑤𝑡는 “word embedding vector”, 𝑣𝑔는 "global image feature vector”