SlideShare a Scribd company logo
1 of 29
Download to read offline
Fluency-guided Cross-lingual
Image Captioning
Weiyu Lan1, Xirong Li1, Jianfeng Dong2
1 Renmin University of China
2 Zhejiang University
1
A person holding a book with a bird sitting on the book.
一个人拿着一本书,有一只小鸟站在上面
Una persona sostiene un libro con un pájaro que se sienta sobre el libro
책에 앉아 있는 새와 함께 책을 들고 있는 사람
Image Captioning
2
Cross lingual image captioning
• Goal: To generate relevant and fluent captions in a target language with
minimal human effort
3
Related Work
• Monolingual image captioning
• Deep learning: Encoder + Decoder (CNN+RNN)
• Cross-lingual image captioning
• Generate captions base on both image and captions in source language
(Elliott et al., 2015)
• Crowd sourcing to collect Japanese descriptions of the MSCOCO (Miyazaki
and Shimizu, 2016)
• Machine Translation (Li, ICMR2016)
4
Related Work
• Cross-lingual image captioning
• Machine Translation (Li, ICMR2016)
A person holding a book with a bird sitting on the book
Captioning
Model
一个人拿着一本书,
有一只小鸟站在上面
Machine
translation
Chinese
Captioning
Model
×
√
5
translated
Chinese
sentences
Machine-translated sentences are not fluent
A person holding a book with a bird sitting on the book.
拿着一本书和一只鸟坐在书上的人
Una persona que sostiene un libro con un pájaro sentado en el libro.
책 한 권을 쥐고 한 사람 한 마리가 책 위에 앉아 있다.
?
6
Our Approach
• Fluency-guided cross-lingualimage captioning
Not Fluent
Fluent
Estimated
Fluency
(A person holding a book with a bird sitting on the book)
拿着一本书和一只鸟坐在书上的人
(A small bird sitting on top of an open book)
一只小鸟坐在一本打开的书上
Captioning
Model
Image
Sentence
7
Sentence Fluency Estimation
• Binary classification
• Manual annotation
• 8k sentences: fluent/not
fluent
• Less than 30% sentences are
fluent
• LSTM based model
(A person holding a book
with a bird sitting on the book)
拿着一本书和一只鸟坐在
书上的人
LSTM LSTM LSTM LSTM
Word Embedding
拿着 一本 书 人
output
p(fluent)
Fluency Estimation Model
8
A four-way LSTM based classifier
拿着一本书和一只鸟坐在书上的人
A person holding a book with a bird sitting …
拿:v 着:uzhe 一:m 本:q 书:n 和:c 一:m 只:q
鸟:n 坐:v 在:p 书:n 上:f 的:ude 人:n
a:DT person:NN holding:VBG a:DT
book:NN with:IN a:DT bird:NN …
Sentence
POS
Fluency
Score
Fluency Estimation Model
Fluency Estimation Model
Fluency Estimation Model
Fluency Estimation Model
!
"
∑
9
Sentence Fluency Estimation Results
English sentences Chinese sentences Estimated fluency
scores
The two large elephants are
standing in the grass
两只大象正站在草地上 0.803
The young man in the blue shirt is
playing tennis
穿蓝色衬衫的年轻人正在打网球 0.624
A group of people riding skis in
their bathing suits
一群人在他们的沐浴骑滑雪服 0.117
A sports arena under a dome with
snow on it
一个体育馆下一个圆顶下的雪在它 0.060
10
Image Captioning Model
• CNN + RNN framework [Vinyals, CVPR2015]
• Training loss is the sum of the negative log likelihoods of the next correct word at
each step.
CNN LSTM LSTM LSTM LSTM LSTM
Visual
Embedding
Word Embedding
<start> 拿着 一本 人
output
p(w1) p(w2) p(w3) p(wn)
11
Fluency-Guided Training
12
Fluency-Guided Training
Strategy I: Fluency only
Captioning
Model
(A person holding a book with a bird sitting on the book)
拿着一本书和一只鸟坐在书上的人
(A small bird sitting on top of an open book)
一只小鸟坐在一本打开的书上
0.9
Fluent
0.2
Not Fluent
13
• Allow the sentences classified as not fluent to be used for training with a certain
chance
(A person holding a book with a bird sitting on the book)
拿着一本书和一只鸟坐在书上的人
Rejection
Sampling
(A small bird sitting on top of an open book)
一只小鸟坐在一本打开的书上
0.9
Fluent
0.2
Not Fluent
Captioning
Model
Fluency-Guided Training
Strategy II: Rejection sampling
14
u~ U (0, 0.5)
Fluency-Guided Training
Strategy III: Weighted loss
• Cost-sensitive learning
Captioning
Model(A person holding a book with a bird sitting on the book)
拿着一本书和一只鸟坐在书上的人
(A small bird sitting on top of an open book)
一只小鸟坐在一本打开的书上
𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑	
   𝑏 𝑎𝑡𝑐ℎ	
   𝑙 𝑜𝑠𝑠 = −
1
m
log 𝑝 𝑠! + 0.2 ∗ log	
   𝑝(𝑠?) + ⋯
15
0.9
Fluent
0.2
Not Fluent
Experiments
• To verify if a cross-lingualcaptioning model trained by fluency-guided
learning can generate Chinese captions that are more fluent, meanwhile
maintaining the level of relevance when compared to learning from the
complete set of machine-translated sentences.
16
Datasets and Experiments
17
Developing test set
• Manually translating sentences in test set as ground truth
• Providing both English sentence and corresponding image
• To eliminate ambiguity and translate referring to the image
18
Two Bilingual (English-Chinese) Datasets
19
Flickr8k-cn Flickr30k-cn
Train Validation Test Train Validation Test
Images 6,000 1,000 1,000 29,783 1,000 1,000
Machine-translated Chinese sentences 30,000 5,000 - 148,915 5,000 -
Human-translated Chinese sentences - - 5,000 - - 5,000
Human-annotated Chinese sentences 30,000 5,000 5,000 - - -
• Extending Flickr8k and Flickr30k to bilingual version (English + Chinese)
Download: https://github.com/li-xirong/cross-lingual-cap
Experiments
• Baselines:
1. Late translation[Li, ICMR2016]
2. Late translation rerank
3. Without fluency
4. Manual Flickr8k-cn
• Proposed approaches:
1. Fluency-only
2. Rejection sampling
3. Weighted loss
A person holding a book with a bird sitting on the book
Captioning Model
Machine translation
拿着一本书和一只鸟坐在书上的人
A small bird sitting on top of an open book
一只小鸟坐在一本打开的书上
0.3
0.8
拿着一本书
和一只鸟坐
在书上的人
Chinese
Captioning
Model
20
Experiments
Approach
Flickr8k-cn Flickr30k-cn
BLEU4 ROUGE CIDEr BLEU4 ROUGE CIDEr
Late Translation 17.3 39.3 33.7 15.3 38.2 27.1
Late Translation Rerank 17.5 40.2 34.2 14.3 38.5 27.5
Without fluency 24.1 45.9 47.6 17.8 40.8 32.5
Fluency-only 20.7 41.1 35.2 14.5 35.9 25.1
Rejection sampling 23.9 45.3 46.6 18.2 40.5 32.9
Weighted loss 24.0 45.0 46.3 18.3 40.2 33.0
21
Automatic Evaluation Results
√
0 10 20 30 40 50
BLEU4
ROUGE
CIDEr
BLEU4
ROUGE
CIDEr
Flickr30k-cnFlickr8k-cn
Late Translation
Late Translation Rerank
Without fluency
• Late translation is not effective
22
Automatic Evaluation Results
0 10 20 30 40 50
BLEU4
ROUGE
CIDEr
BLEU4
ROUGE
CIDEr
Flickr30k-cnFlickr8k-cn
Without fluency
Fluency-only
Rejection sampling
Weighted loss
√
√
• Rejection sampling and Weighted loss are able to preserve relevant information
23
Experiments
• Human Evaluation
24
Human Evaluation
• Automatic evaluation is insufficient to guarantee the overall fluency
• Annotators rate the sentences using a Likert scale of 1 to 5 (higher is
better) in two aspects, namely relevance and fluency
• Sentences generated by distinct approaches are shown together
• Sentences randomly shuffled before presenting to the annotators
25
Human Evaluation Results
• Rejection sampling achieves the best balance between relevance and
fluency, without the need of manualwritten Chinese captions.
26
√
1 2 3 4 5
Relevance
Fluency
Relevance
Fluency
Flickr30k-cnFlickr8k-cn
Late Translation
Late Translation Rerank
Without fluency
Fluency-only
Rejection sampling
Weighted loss
Manual Flickr8k-cn
Human Evaluation Results
• Rejection sampling achieves the best balance between relevance and
fluency
27
√
1 2 3 4 5
Relevance
Fluency
Relevance
Fluency
Flickr30k-cnFlickr8k-cn
Without fluency
Fluency-only
Rejection sampling
Weighted loss
Human Evaluation Results
• Rejection sampling achieves the best balance between relevance and
fluency, without the need of manualwritten Chinese captions.
28
√
1 2 3 4 5
Relevance
Fluency
Relevance
Fluency
Flickr30k-cnFlickr8k-cn
Without fluency
Fluency-only
Rejection sampling
Weighted loss
Manual Flickr8k-cn
Conclusion
Fluency-guided framework
• Tackling cross-lingual image captioning with minimal manual
annotation effort
• Capable of generating relevant and fluent captions in target language
29
https://github.com/li-xirong/cross-lingual-cap
bluey@ruc.edu.cn

More Related Content

Similar to Fluency-Guided Cross-Lingual Image Captioning

A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
transfer.pptx
transfer.pptxtransfer.pptx
transfer.pptxHaibinSu2
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPInsoo Chung
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialAlyona Medelyan
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
ODSC London 2018
ODSC London 2018ODSC London 2018
ODSC London 2018Kfir Bar
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹台灣資料科學年會
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning
Frontiers of Vision and Language: Bridging Images and Texts by Deep LearningFrontiers of Vision and Language: Bridging Images and Texts by Deep Learning
Frontiers of Vision and Language: Bridging Images and Texts by Deep LearningYoshitaka Ushiku
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 

Similar to Fluency-Guided Cross-Lingual Image Captioning (20)

A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
transfer.pptx
transfer.pptxtransfer.pptx
transfer.pptx
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
ODSC London 2018
ODSC London 2018ODSC London 2018
ODSC London 2018
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning
Frontiers of Vision and Language: Bridging Images and Texts by Deep LearningFrontiers of Vision and Language: Bridging Images and Texts by Deep Learning
Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
co:op-READ-Convention Marburg - Enrique Vidal
co:op-READ-Convention Marburg - Enrique Vidalco:op-READ-Convention Marburg - Enrique Vidal
co:op-READ-Convention Marburg - Enrique Vidal
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 

Recently uploaded

Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideStefan Dietze
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Paige Cruz
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 

Recently uploaded (20)

Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 

Fluency-Guided Cross-Lingual Image Captioning

  • 1. Fluency-guided Cross-lingual Image Captioning Weiyu Lan1, Xirong Li1, Jianfeng Dong2 1 Renmin University of China 2 Zhejiang University 1
  • 2. A person holding a book with a bird sitting on the book. 一个人拿着一本书,有一只小鸟站在上面 Una persona sostiene un libro con un pájaro que se sienta sobre el libro 책에 앉아 있는 새와 함께 책을 들고 있는 사람 Image Captioning 2
  • 3. Cross lingual image captioning • Goal: To generate relevant and fluent captions in a target language with minimal human effort 3
  • 4. Related Work • Monolingual image captioning • Deep learning: Encoder + Decoder (CNN+RNN) • Cross-lingual image captioning • Generate captions base on both image and captions in source language (Elliott et al., 2015) • Crowd sourcing to collect Japanese descriptions of the MSCOCO (Miyazaki and Shimizu, 2016) • Machine Translation (Li, ICMR2016) 4
  • 5. Related Work • Cross-lingual image captioning • Machine Translation (Li, ICMR2016) A person holding a book with a bird sitting on the book Captioning Model 一个人拿着一本书, 有一只小鸟站在上面 Machine translation Chinese Captioning Model × √ 5 translated Chinese sentences
  • 6. Machine-translated sentences are not fluent A person holding a book with a bird sitting on the book. 拿着一本书和一只鸟坐在书上的人 Una persona que sostiene un libro con un pájaro sentado en el libro. 책 한 권을 쥐고 한 사람 한 마리가 책 위에 앉아 있다. ? 6
  • 7. Our Approach • Fluency-guided cross-lingualimage captioning Not Fluent Fluent Estimated Fluency (A person holding a book with a bird sitting on the book) 拿着一本书和一只鸟坐在书上的人 (A small bird sitting on top of an open book) 一只小鸟坐在一本打开的书上 Captioning Model Image Sentence 7
  • 8. Sentence Fluency Estimation • Binary classification • Manual annotation • 8k sentences: fluent/not fluent • Less than 30% sentences are fluent • LSTM based model (A person holding a book with a bird sitting on the book) 拿着一本书和一只鸟坐在 书上的人 LSTM LSTM LSTM LSTM Word Embedding 拿着 一本 书 人 output p(fluent) Fluency Estimation Model 8
  • 9. A four-way LSTM based classifier 拿着一本书和一只鸟坐在书上的人 A person holding a book with a bird sitting … 拿:v 着:uzhe 一:m 本:q 书:n 和:c 一:m 只:q 鸟:n 坐:v 在:p 书:n 上:f 的:ude 人:n a:DT person:NN holding:VBG a:DT book:NN with:IN a:DT bird:NN … Sentence POS Fluency Score Fluency Estimation Model Fluency Estimation Model Fluency Estimation Model Fluency Estimation Model ! " ∑ 9
  • 10. Sentence Fluency Estimation Results English sentences Chinese sentences Estimated fluency scores The two large elephants are standing in the grass 两只大象正站在草地上 0.803 The young man in the blue shirt is playing tennis 穿蓝色衬衫的年轻人正在打网球 0.624 A group of people riding skis in their bathing suits 一群人在他们的沐浴骑滑雪服 0.117 A sports arena under a dome with snow on it 一个体育馆下一个圆顶下的雪在它 0.060 10
  • 11. Image Captioning Model • CNN + RNN framework [Vinyals, CVPR2015] • Training loss is the sum of the negative log likelihoods of the next correct word at each step. CNN LSTM LSTM LSTM LSTM LSTM Visual Embedding Word Embedding <start> 拿着 一本 人 output p(w1) p(w2) p(w3) p(wn) 11
  • 13. Fluency-Guided Training Strategy I: Fluency only Captioning Model (A person holding a book with a bird sitting on the book) 拿着一本书和一只鸟坐在书上的人 (A small bird sitting on top of an open book) 一只小鸟坐在一本打开的书上 0.9 Fluent 0.2 Not Fluent 13
  • 14. • Allow the sentences classified as not fluent to be used for training with a certain chance (A person holding a book with a bird sitting on the book) 拿着一本书和一只鸟坐在书上的人 Rejection Sampling (A small bird sitting on top of an open book) 一只小鸟坐在一本打开的书上 0.9 Fluent 0.2 Not Fluent Captioning Model Fluency-Guided Training Strategy II: Rejection sampling 14 u~ U (0, 0.5)
  • 15. Fluency-Guided Training Strategy III: Weighted loss • Cost-sensitive learning Captioning Model(A person holding a book with a bird sitting on the book) 拿着一本书和一只鸟坐在书上的人 (A small bird sitting on top of an open book) 一只小鸟坐在一本打开的书上 𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑   𝑏 𝑎𝑡𝑐ℎ   𝑙 𝑜𝑠𝑠 = − 1 m log 𝑝 𝑠! + 0.2 ∗ log   𝑝(𝑠?) + ⋯ 15 0.9 Fluent 0.2 Not Fluent
  • 16. Experiments • To verify if a cross-lingualcaptioning model trained by fluency-guided learning can generate Chinese captions that are more fluent, meanwhile maintaining the level of relevance when compared to learning from the complete set of machine-translated sentences. 16
  • 18. Developing test set • Manually translating sentences in test set as ground truth • Providing both English sentence and corresponding image • To eliminate ambiguity and translate referring to the image 18
  • 19. Two Bilingual (English-Chinese) Datasets 19 Flickr8k-cn Flickr30k-cn Train Validation Test Train Validation Test Images 6,000 1,000 1,000 29,783 1,000 1,000 Machine-translated Chinese sentences 30,000 5,000 - 148,915 5,000 - Human-translated Chinese sentences - - 5,000 - - 5,000 Human-annotated Chinese sentences 30,000 5,000 5,000 - - - • Extending Flickr8k and Flickr30k to bilingual version (English + Chinese) Download: https://github.com/li-xirong/cross-lingual-cap
  • 20. Experiments • Baselines: 1. Late translation[Li, ICMR2016] 2. Late translation rerank 3. Without fluency 4. Manual Flickr8k-cn • Proposed approaches: 1. Fluency-only 2. Rejection sampling 3. Weighted loss A person holding a book with a bird sitting on the book Captioning Model Machine translation 拿着一本书和一只鸟坐在书上的人 A small bird sitting on top of an open book 一只小鸟坐在一本打开的书上 0.3 0.8 拿着一本书 和一只鸟坐 在书上的人 Chinese Captioning Model 20
  • 21. Experiments Approach Flickr8k-cn Flickr30k-cn BLEU4 ROUGE CIDEr BLEU4 ROUGE CIDEr Late Translation 17.3 39.3 33.7 15.3 38.2 27.1 Late Translation Rerank 17.5 40.2 34.2 14.3 38.5 27.5 Without fluency 24.1 45.9 47.6 17.8 40.8 32.5 Fluency-only 20.7 41.1 35.2 14.5 35.9 25.1 Rejection sampling 23.9 45.3 46.6 18.2 40.5 32.9 Weighted loss 24.0 45.0 46.3 18.3 40.2 33.0 21
  • 22. Automatic Evaluation Results √ 0 10 20 30 40 50 BLEU4 ROUGE CIDEr BLEU4 ROUGE CIDEr Flickr30k-cnFlickr8k-cn Late Translation Late Translation Rerank Without fluency • Late translation is not effective 22
  • 23. Automatic Evaluation Results 0 10 20 30 40 50 BLEU4 ROUGE CIDEr BLEU4 ROUGE CIDEr Flickr30k-cnFlickr8k-cn Without fluency Fluency-only Rejection sampling Weighted loss √ √ • Rejection sampling and Weighted loss are able to preserve relevant information 23
  • 25. Human Evaluation • Automatic evaluation is insufficient to guarantee the overall fluency • Annotators rate the sentences using a Likert scale of 1 to 5 (higher is better) in two aspects, namely relevance and fluency • Sentences generated by distinct approaches are shown together • Sentences randomly shuffled before presenting to the annotators 25
  • 26. Human Evaluation Results • Rejection sampling achieves the best balance between relevance and fluency, without the need of manualwritten Chinese captions. 26 √ 1 2 3 4 5 Relevance Fluency Relevance Fluency Flickr30k-cnFlickr8k-cn Late Translation Late Translation Rerank Without fluency Fluency-only Rejection sampling Weighted loss Manual Flickr8k-cn
  • 27. Human Evaluation Results • Rejection sampling achieves the best balance between relevance and fluency 27 √ 1 2 3 4 5 Relevance Fluency Relevance Fluency Flickr30k-cnFlickr8k-cn Without fluency Fluency-only Rejection sampling Weighted loss
  • 28. Human Evaluation Results • Rejection sampling achieves the best balance between relevance and fluency, without the need of manualwritten Chinese captions. 28 √ 1 2 3 4 5 Relevance Fluency Relevance Fluency Flickr30k-cnFlickr8k-cn Without fluency Fluency-only Rejection sampling Weighted loss Manual Flickr8k-cn
  • 29. Conclusion Fluency-guided framework • Tackling cross-lingual image captioning with minimal manual annotation effort • Capable of generating relevant and fluent captions in target language 29 https://github.com/li-xirong/cross-lingual-cap bluey@ruc.edu.cn