SlideShare a Scribd company logo
Multi-modal Summarization
Xiachong Feng
For General Summarization
1. Brief intro to Summariza0on(PPT):
h'p://xcfeng.net/res/presenta5on/%E6%96%87%E6%9C%AC%E6%91%9
8%E8%A6%81%E7%AE%80%E8%BF%B0.pdf
2. Brief intro to Summariza0on(Notes): h'p://xcfeng.net/res/notes/Brief-
intro-to-summariza5on.pdf
3. ACL19: h'p://xcfeng.net/res/presenta5on/ACL19%20Summariza5on.pdf
4. EMNLP19: h'p://xcfeng.net/res/notes/EMNLP19_Summariza5on.pdf
5. ACL20: h'p://xcfeng.net/res/presenta5on/acl2020-summariza5on.pdf
Task
Text Summarization
We're developing a remote control
which you probably already know.
Video
Audio
Text
Multi-modal Summarization
Classifica5on
How2
NIPS18,ACL19
Jiajun Zhang, Chengqing Zong
News
EMNLP17,IJCAI18
MSMO
EMNLP18,AAAI20
Meeting
ICME03, GIFT18, ACL19
• Classification based on task type
Classifica5on
How2
NIPS18,ACL19
Jiajun Zhang, Chengqing Zong
News
EMNLP17,IJCAI18
MSMO
EMNLP18,AAAI20
Meeting
ICME03, GIFT18, ACL19
Asynchronous
Synchronous
• Classification based on dataset type
Basic Ideas
• Attention
• Embedding
• Pretrain
Attention Strategies for Multi-Source
Sequence-to-Sequence Learning ACL17
Learning curves on validation data
Encoder1
Encoder2
src
src
Multi Encoder
0.8
0.2
hierarchical
flat
concate
Task : WMT16 MulGmodal TranslaGon
Words Can Shift: Dynamically Adjusting Word
Representations Using Nonverbal Behaviors AAAI19
• Core Idea : use nonverbal information to adjust word representation
hate
Recurrent Attended Variation Embedding
Network (RAVEN)
• Visual Features : facial expression analysis toolkit FACET(30Hz)
• Acoustic Features : COVAREP acoustic analysis framework(100Hz)
Multimodal Sentiment Analysis:
Multimodal Emotion Recognition:
Experiment
✔
×
Unicoder-VL: A Universal Encoder for Vision and
Language by Cross-modal Pre-training AAAI20
• Masked Language Modeling (MLM)
• Masked Object Classification (MOC)
• Visual-linguistic Matching (VLM)
Downstream Task : Image-text retrieval is the task of identifying an image
from candidates given a caption describing its content, or vice versa.
How2
How2: A Large-scale Dataset for Multimodal
Language Understanding NIPS18
How2: A Large-scale Dataset for Multimodal
Language Understanding NIPS18
Training 73993
Validation 2965
Testing 2156
Input avg 291 words
Summary avg 33 words
• 2,000 hours of short instruc=onal videos, spanning different domains
such as cooking, sports, indoor/outdoor ac=vi=es, music, etc.
• Each video is accompanied by a human-generated transcript and a 2 to
3 sentence summary
(1) Multimodal Abstractive Summarization
for Open-Domain Videos NIPS18
• Action Feature Extraction
• 2048 dimensions extracted every 16 non-
overlapping frames using a ResNext-101
3D Convolutional Neural Network, 400
different human actions
• Result
• Best : Text + Action with Hierarchical Attn
(2) Multimodal Abstractive Summarization
for How2 Videos ACL19
News
(1) Multi-modal Summarization for Asynchronous
Collection of Text, Image, Audio and Video EMNLP17
(2) Read, Watch, Listen, and Summarize: Multi-Modal
Summarization for Asynchronous Text, Image, Audio
and Video IEEE18
Summary
MulO-modal SummarizaOon for Asynchronous
CollecOon of Text, Image, Audio and Video EMNLP17
• Method: Extractive Summarization
Document
Sentences
Video
Transcripts
Sentence Set For Each Sentence
SCORE
formal
noisy
• LexRank
• Video
• Audio
• Picture align score
• Picture
• Video
Multi-modal Summarization for Asynchronous
Collection of Text, Image, Audio and Video EMNLP17
Modified LexRank
Video information
Hier audio score
transcripts are
more important
Audio information
Document
sentence is
better then
transcripts.
Multi-modal Summarization for Asynchronous
Collection of Text, Image, Audio and Video EMNLP17
8月13日,深圳1份巴西进口冻鸡
翅和西安1份厄瓜多尔进口冻虾
的外包装先后被披露新冠病毒核
酸检测结果呈阳性。
消费者可在购买时做好防护,在烹
饪时应注意关键环节卫生,且将食
材煮熟。但食品安全没有零风险,
如果个人特别在意,可选择不吃。
Multi-modal Summarization for Asynchronous
Collection of Text, Image, Audio and Video EMNLP17
• Dataset
• 25 Chinese topics
• 25 English topics
• Each topic 20 documents
• Human generate summaries
Multi-modal Sentence Summarization with
Modality Attention and Image Filtering IJCAI18
• Task
• Dataset
• Gigaword + Image Search (top 5) + human selec=on (top 1)
• train(62000) valid(2000) test(2000)
Multi-modal Sentence Summarization with
Modality Attention and Image Filtering IJCAI18
IMG Encoder Text Encoder
IMG Text
Decoder
Hierarchical
Attention
𝐶!
gate
G
element-wise gate
MSMO
MSMO: Multimodal Summarization with
Multimodal Output EMNLP18
• Task:MSMO • Model
• Dataset: From Daily Mail, annotators
select up to three images as ref imgs.
• Evaluation: MMAE
ROUGE+ Image precision + Image-Text Relevance
MSMO: Multimodal Summarization with
Multimodal Output EMNLP18
Decoder Step 1 Decoder Step 2
• Visual Coverage
Multimodal Summarization with Guidance
of Multimodal Reference AAAI20
• Problem
• Only trained by target of text modality , lead to the modality-bias problem.
Data Extension
• ROUGE-ranking.
• Order-ranking.
Meeting
Communication Load is Overwheming
https://www.youtube.com/watch?v=Pd5Y7pfbU4w&list=LLFh1maHZbOfktmNXYft0-rA&index=4&t=0s
Multimodal Summarization of Meeting
Recordings ICME2003
• Why summarize meetings?
• Frequently, meetings last hours and have many lengthy boring parts.
• Why use multi-modal information?
• Language analysis-based abstraction techniques may not be sufficient
to capture significant visual and audio events in a meeting, such as a
person entering the room to join the meeting or an emotional
discussion.
• Meeting sequences generally contain a very small amount of visual
activity. Also, because our meeting recorder captures omni-directional
video, there is no camera motion and therefore no scene breaks.
Multimodal Summarization of Meeting
Recordings ICME2003
Audio
• sound localizaRon output
• magnitude of the audio signal
Video
• luminance changes in a small
window
Text
• IF-IDF
Fusing Verbal and Nonverbal Information for
Extractive Meeting Summarization GIFT18
Keep Meeting Summaries on Topic: Abstractive
Multi-Modal Meeting Summarization ACL19
• Input:5-dimensional feature vector.
• head pose angle (roll, pitch and yaw)
• eye gaze direction vector (azimuth and
elevation)
• Trained on the VFOA annotation
• p0,...,p|P|, table, whiteboard, projection
screen and unknown
• For utterance 𝑢! , the VFOA vector
𝑓! ∈ ℝ|P|∗|F|
Conclusion
• Model is simple. [Encoder + Decoder + Hierarchical attention]
• Modality Interaction is less.
• Priori knowledge is important.
• Multimodal output (text, pic, video) can help a larger set of people.
• Data privacy is not considered in Meeting Summarization.
Thanks!

More Related Content

What's hot

The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
NU_I_TODALAB
 
Google's Pathways Language Model and Chain-of-Thought
Google's Pathways Language Model and Chain-of-ThoughtGoogle's Pathways Language Model and Chain-of-Thought
Google's Pathways Language Model and Chain-of-Thought
Vaclav1
 
機械学習の応用例にみる認知症診断と将来の発症予測
機械学習の応用例にみる認知症診断と将来の発症予測機械学習の応用例にみる認知症診断と将来の発症予測
機械学習の応用例にみる認知症診断と将来の発症予測
Momoko Hayamizu
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
Yuki Saito
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
NU_I_TODALAB
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換
NU_I_TODALAB
 
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Image feature extraction
Image feature extractionImage feature extraction
Image feature extraction
Rushin Shah
 
音声認識における言語モデル
音声認識における言語モデル音声認識における言語モデル
音声認識における言語モデル
KOTARO SETOYAMA
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査
Tomoki Hayashi
 
[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio
Deep Learning JP
 
iNEDI - Accuracy Improvements and Artifacts Removal in Edge Based Image Inter...
iNEDI - Accuracy Improvements and Artifacts Removal in Edge Based Image Inter...iNEDI - Accuracy Improvements and Artifacts Removal in Edge Based Image Inter...
iNEDI - Accuracy Improvements and Artifacts Removal in Edge Based Image Inter...
Tecnick.com LTD
 
Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Dis...
Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Dis...Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Dis...
Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Dis...
KwanyoungKim7
 
音声認識の基礎
音声認識の基礎音声認識の基礎
音声認識の基礎
Akinori Ito
 
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningMulti-Agent Reinforcement Learning
Multi-Agent Reinforcement Learning
Seolhokim
 
音声生成の基礎と音声学
音声生成の基礎と音声学音声生成の基礎と音声学
音声生成の基礎と音声学
Akinori Ito
 
[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)
Susang Kim
 
DNN音響モデルにおける特徴量抽出の諸相
DNN音響モデルにおける特徴量抽出の諸相DNN音響モデルにおける特徴量抽出の諸相
DNN音響モデルにおける特徴量抽出の諸相
Takuya Yoshioka
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
Christian Perone
 

What's hot (20)

The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Google's Pathways Language Model and Chain-of-Thought
Google's Pathways Language Model and Chain-of-ThoughtGoogle's Pathways Language Model and Chain-of-Thought
Google's Pathways Language Model and Chain-of-Thought
 
機械学習の応用例にみる認知症診断と将来の発症予測
機械学習の応用例にみる認知症診断と将来の発症予測機械学習の応用例にみる認知症診断と将来の発症予測
機械学習の応用例にみる認知症診断と将来の発症予測
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換
 
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
 
Image feature extraction
Image feature extractionImage feature extraction
Image feature extraction
 
音声認識における言語モデル
音声認識における言語モデル音声認識における言語モデル
音声認識における言語モデル
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査
 
[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio[DL輪読会]Wavenet a generative model for raw audio
[DL輪読会]Wavenet a generative model for raw audio
 
iNEDI - Accuracy Improvements and Artifacts Removal in Edge Based Image Inter...
iNEDI - Accuracy Improvements and Artifacts Removal in Edge Based Image Inter...iNEDI - Accuracy Improvements and Artifacts Removal in Edge Based Image Inter...
iNEDI - Accuracy Improvements and Artifacts Removal in Edge Based Image Inter...
 
Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Dis...
Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Dis...Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Dis...
Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Dis...
 
音声認識の基礎
音声認識の基礎音声認識の基礎
音声認識の基礎
 
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningMulti-Agent Reinforcement Learning
Multi-Agent Reinforcement Learning
 
音声生成の基礎と音声学
音声生成の基礎と音声学音声生成の基礎と音声学
音声生成の基礎と音声学
 
[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)
 
DNN音響モデルにおける特徴量抽出の諸相
DNN音響モデルにおける特徴量抽出の諸相DNN音響モデルにおける特徴量抽出の諸相
DNN音響モデルにおける特徴量抽出の諸相
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 

Similar to Multi-Modal Summarization

What’s new in MPEG?
What’s new in MPEG?What’s new in MPEG?
What’s new in MPEG?
Alpen-Adria-Universität
 
Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for Universities
Educational Technology
 
Mpeg4copy 120428133000-phpapp01
Mpeg4copy 120428133000-phpapp01Mpeg4copy 120428133000-phpapp01
Mpeg4copy 120428133000-phpapp01
netzwelt12345
 
Performance Analysis of Various Video Compression Techniques
Performance Analysis of Various Video Compression TechniquesPerformance Analysis of Various Video Compression Techniques
Performance Analysis of Various Video Compression Techniques
International Journal of Science and Research (IJSR)
 
SUMMARY GENERATION FOR LECTURING VIDEOS
SUMMARY GENERATION FOR LECTURING VIDEOSSUMMARY GENERATION FOR LECTURING VIDEOS
SUMMARY GENERATION FOR LECTURING VIDEOS
IRJET Journal
 
mpeg4copy-120428133000-phpapp01.ppt
mpeg4copy-120428133000-phpapp01.pptmpeg4copy-120428133000-phpapp01.ppt
mpeg4copy-120428133000-phpapp01.ppt
PawachMetharattanara
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & Introduction
Champ Yen
 
2015.12.06 - digi test evaluation B&G & FFW (1)
2015.12.06 - digi test evaluation B&G & FFW (1)2015.12.06 - digi test evaluation B&G & FFW (1)
2015.12.06 - digi test evaluation B&G & FFW (1)
Gráinne D'alton
 
AnnoTone (CHI 2015)
AnnoTone (CHI 2015)AnnoTone (CHI 2015)
AnnoTone (CHI 2015)
Ryohei Suzuki
 
Video Codecs and the Future by Vince Puglia
Video Codecs and the Future by Vince PugliaVideo Codecs and the Future by Vince Puglia
Video Codecs and the Future by Vince Puglia
Dialogic Inc.
 
Altitude San Francisco 2018: Video Workshop Docs
Altitude San Francisco 2018: Video Workshop DocsAltitude San Francisco 2018: Video Workshop Docs
Altitude San Francisco 2018: Video Workshop Docs
Fastly
 
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
Anand Bhojan
 
TI TechDays 2010: swiftBoot
TI TechDays 2010: swiftBootTI TechDays 2010: swiftBoot
TI TechDays 2010: swiftBoot
andrewmurraympc
 
A Study on FFmpeg Multimedia Framework
A Study on FFmpeg Multimedia FrameworkA Study on FFmpeg Multimedia Framework
A Study on FFmpeg Multimedia Framework
ijtsrd
 
SDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLSSDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLS
NECST Lab @ Politecnico di Milano
 
5.Arne_Nowak_Digital_Archiving_Pilots.pdf
5.Arne_Nowak_Digital_Archiving_Pilots.pdf5.Arne_Nowak_Digital_Archiving_Pilots.pdf
5.Arne_Nowak_Digital_Archiving_Pilots.pdf
ssuser7dc8771
 
Using AI to transcribe qualitative data: Personal reflections of an experienc...
Using AI to transcribe qualitative data: Personal reflections of an experienc...Using AI to transcribe qualitative data: Personal reflections of an experienc...
Using AI to transcribe qualitative data: Personal reflections of an experienc...
John Wren
 
Re-using Media on the Web tutorial: Media Fragment Creation and Annotation
Re-using Media on the Web tutorial: Media Fragment Creation and AnnotationRe-using Media on the Web tutorial: Media Fragment Creation and Annotation
Re-using Media on the Web tutorial: Media Fragment Creation and Annotation
MediaMixerCommunity
 
A Cloud-based Automated Authoring System to support e-Learning in Higher Educ...
A Cloud-based Automated Authoring System to support e-Learning in Higher Educ...A Cloud-based Automated Authoring System to support e-Learning in Higher Educ...
A Cloud-based Automated Authoring System to support e-Learning in Higher Educ...
Mohamed Ossa
 
VIDEO TO TEXT SUMMARIZER USING AI.pdf
VIDEO TO TEXT SUMMARIZER USING AI.pdfVIDEO TO TEXT SUMMARIZER USING AI.pdf
VIDEO TO TEXT SUMMARIZER USING AI.pdf
FreeFire293813
 

Similar to Multi-Modal Summarization (20)

What’s new in MPEG?
What’s new in MPEG?What’s new in MPEG?
What’s new in MPEG?
 
Automated Podcasting System for Universities
Automated Podcasting System for UniversitiesAutomated Podcasting System for Universities
Automated Podcasting System for Universities
 
Mpeg4copy 120428133000-phpapp01
Mpeg4copy 120428133000-phpapp01Mpeg4copy 120428133000-phpapp01
Mpeg4copy 120428133000-phpapp01
 
Performance Analysis of Various Video Compression Techniques
Performance Analysis of Various Video Compression TechniquesPerformance Analysis of Various Video Compression Techniques
Performance Analysis of Various Video Compression Techniques
 
SUMMARY GENERATION FOR LECTURING VIDEOS
SUMMARY GENERATION FOR LECTURING VIDEOSSUMMARY GENERATION FOR LECTURING VIDEOS
SUMMARY GENERATION FOR LECTURING VIDEOS
 
mpeg4copy-120428133000-phpapp01.ppt
mpeg4copy-120428133000-phpapp01.pptmpeg4copy-120428133000-phpapp01.ppt
mpeg4copy-120428133000-phpapp01.ppt
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & Introduction
 
2015.12.06 - digi test evaluation B&G & FFW (1)
2015.12.06 - digi test evaluation B&G & FFW (1)2015.12.06 - digi test evaluation B&G & FFW (1)
2015.12.06 - digi test evaluation B&G & FFW (1)
 
AnnoTone (CHI 2015)
AnnoTone (CHI 2015)AnnoTone (CHI 2015)
AnnoTone (CHI 2015)
 
Video Codecs and the Future by Vince Puglia
Video Codecs and the Future by Vince PugliaVideo Codecs and the Future by Vince Puglia
Video Codecs and the Future by Vince Puglia
 
Altitude San Francisco 2018: Video Workshop Docs
Altitude San Francisco 2018: Video Workshop DocsAltitude San Francisco 2018: Video Workshop Docs
Altitude San Francisco 2018: Video Workshop Docs
 
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
 
TI TechDays 2010: swiftBoot
TI TechDays 2010: swiftBootTI TechDays 2010: swiftBoot
TI TechDays 2010: swiftBoot
 
A Study on FFmpeg Multimedia Framework
A Study on FFmpeg Multimedia FrameworkA Study on FFmpeg Multimedia Framework
A Study on FFmpeg Multimedia Framework
 
SDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLSSDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLS
 
5.Arne_Nowak_Digital_Archiving_Pilots.pdf
5.Arne_Nowak_Digital_Archiving_Pilots.pdf5.Arne_Nowak_Digital_Archiving_Pilots.pdf
5.Arne_Nowak_Digital_Archiving_Pilots.pdf
 
Using AI to transcribe qualitative data: Personal reflections of an experienc...
Using AI to transcribe qualitative data: Personal reflections of an experienc...Using AI to transcribe qualitative data: Personal reflections of an experienc...
Using AI to transcribe qualitative data: Personal reflections of an experienc...
 
Re-using Media on the Web tutorial: Media Fragment Creation and Annotation
Re-using Media on the Web tutorial: Media Fragment Creation and AnnotationRe-using Media on the Web tutorial: Media Fragment Creation and Annotation
Re-using Media on the Web tutorial: Media Fragment Creation and Annotation
 
A Cloud-based Automated Authoring System to support e-Learning in Higher Educ...
A Cloud-based Automated Authoring System to support e-Learning in Higher Educ...A Cloud-based Automated Authoring System to support e-Learning in Higher Educ...
A Cloud-based Automated Authoring System to support e-Learning in Higher Educ...
 
VIDEO TO TEXT SUMMARIZER USING AI.pdf
VIDEO TO TEXT SUMMARIZER USING AI.pdfVIDEO TO TEXT SUMMARIZER USING AI.pdf
VIDEO TO TEXT SUMMARIZER USING AI.pdf
 

Recently uploaded

BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
Hitesh Sikarwar
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 

Recently uploaded (20)

BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 

Multi-Modal Summarization

  • 2. For General Summarization 1. Brief intro to Summariza0on(PPT): h'p://xcfeng.net/res/presenta5on/%E6%96%87%E6%9C%AC%E6%91%9 8%E8%A6%81%E7%AE%80%E8%BF%B0.pdf 2. Brief intro to Summariza0on(Notes): h'p://xcfeng.net/res/notes/Brief- intro-to-summariza5on.pdf 3. ACL19: h'p://xcfeng.net/res/presenta5on/ACL19%20Summariza5on.pdf 4. EMNLP19: h'p://xcfeng.net/res/notes/EMNLP19_Summariza5on.pdf 5. ACL20: h'p://xcfeng.net/res/presenta5on/acl2020-summariza5on.pdf
  • 3. Task Text Summarization We're developing a remote control which you probably already know. Video Audio Text Multi-modal Summarization
  • 4. Classifica5on How2 NIPS18,ACL19 Jiajun Zhang, Chengqing Zong News EMNLP17,IJCAI18 MSMO EMNLP18,AAAI20 Meeting ICME03, GIFT18, ACL19 • Classification based on task type
  • 5. Classifica5on How2 NIPS18,ACL19 Jiajun Zhang, Chengqing Zong News EMNLP17,IJCAI18 MSMO EMNLP18,AAAI20 Meeting ICME03, GIFT18, ACL19 Asynchronous Synchronous • Classification based on dataset type
  • 6. Basic Ideas • Attention • Embedding • Pretrain
  • 7. Attention Strategies for Multi-Source Sequence-to-Sequence Learning ACL17 Learning curves on validation data Encoder1 Encoder2 src src Multi Encoder 0.8 0.2 hierarchical flat concate Task : WMT16 MulGmodal TranslaGon
  • 8. Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors AAAI19 • Core Idea : use nonverbal information to adjust word representation hate
  • 9. Recurrent Attended Variation Embedding Network (RAVEN) • Visual Features : facial expression analysis toolkit FACET(30Hz) • Acoustic Features : COVAREP acoustic analysis framework(100Hz) Multimodal Sentiment Analysis: Multimodal Emotion Recognition: Experiment ✔ ×
  • 10. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training AAAI20 • Masked Language Modeling (MLM) • Masked Object Classification (MOC) • Visual-linguistic Matching (VLM) Downstream Task : Image-text retrieval is the task of identifying an image from candidates given a caption describing its content, or vice versa.
  • 11. How2
  • 12. How2: A Large-scale Dataset for Multimodal Language Understanding NIPS18
  • 13. How2: A Large-scale Dataset for Multimodal Language Understanding NIPS18 Training 73993 Validation 2965 Testing 2156 Input avg 291 words Summary avg 33 words • 2,000 hours of short instruc=onal videos, spanning different domains such as cooking, sports, indoor/outdoor ac=vi=es, music, etc. • Each video is accompanied by a human-generated transcript and a 2 to 3 sentence summary
  • 14. (1) Multimodal Abstractive Summarization for Open-Domain Videos NIPS18 • Action Feature Extraction • 2048 dimensions extracted every 16 non- overlapping frames using a ResNext-101 3D Convolutional Neural Network, 400 different human actions • Result • Best : Text + Action with Hierarchical Attn (2) Multimodal Abstractive Summarization for How2 Videos ACL19
  • 15. News
  • 16. (1) Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video EMNLP17 (2) Read, Watch, Listen, and Summarize: Multi-Modal Summarization for Asynchronous Text, Image, Audio and Video IEEE18 Summary
  • 17. MulO-modal SummarizaOon for Asynchronous CollecOon of Text, Image, Audio and Video EMNLP17 • Method: Extractive Summarization Document Sentences Video Transcripts Sentence Set For Each Sentence SCORE formal noisy • LexRank • Video • Audio • Picture align score • Picture • Video
  • 18. Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video EMNLP17 Modified LexRank Video information Hier audio score transcripts are more important Audio information Document sentence is better then transcripts.
  • 19. Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video EMNLP17 8月13日,深圳1份巴西进口冻鸡 翅和西安1份厄瓜多尔进口冻虾 的外包装先后被披露新冠病毒核 酸检测结果呈阳性。 消费者可在购买时做好防护,在烹 饪时应注意关键环节卫生,且将食 材煮熟。但食品安全没有零风险, 如果个人特别在意,可选择不吃。
  • 20. Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video EMNLP17 • Dataset • 25 Chinese topics • 25 English topics • Each topic 20 documents • Human generate summaries
  • 21. Multi-modal Sentence Summarization with Modality Attention and Image Filtering IJCAI18 • Task • Dataset • Gigaword + Image Search (top 5) + human selec=on (top 1) • train(62000) valid(2000) test(2000)
  • 22. Multi-modal Sentence Summarization with Modality Attention and Image Filtering IJCAI18 IMG Encoder Text Encoder IMG Text Decoder Hierarchical Attention 𝐶! gate G element-wise gate
  • 23. MSMO
  • 24. MSMO: Multimodal Summarization with Multimodal Output EMNLP18 • Task:MSMO • Model • Dataset: From Daily Mail, annotators select up to three images as ref imgs. • Evaluation: MMAE ROUGE+ Image precision + Image-Text Relevance
  • 25. MSMO: Multimodal Summarization with Multimodal Output EMNLP18 Decoder Step 1 Decoder Step 2 • Visual Coverage
  • 26. Multimodal Summarization with Guidance of Multimodal Reference AAAI20 • Problem • Only trained by target of text modality , lead to the modality-bias problem. Data Extension • ROUGE-ranking. • Order-ranking.
  • 28. Communication Load is Overwheming https://www.youtube.com/watch?v=Pd5Y7pfbU4w&list=LLFh1maHZbOfktmNXYft0-rA&index=4&t=0s
  • 29. Multimodal Summarization of Meeting Recordings ICME2003 • Why summarize meetings? • Frequently, meetings last hours and have many lengthy boring parts. • Why use multi-modal information? • Language analysis-based abstraction techniques may not be sufficient to capture significant visual and audio events in a meeting, such as a person entering the room to join the meeting or an emotional discussion. • Meeting sequences generally contain a very small amount of visual activity. Also, because our meeting recorder captures omni-directional video, there is no camera motion and therefore no scene breaks.
  • 30. Multimodal Summarization of Meeting Recordings ICME2003 Audio • sound localizaRon output • magnitude of the audio signal Video • luminance changes in a small window Text • IF-IDF
  • 31.
  • 32. Fusing Verbal and Nonverbal Information for Extractive Meeting Summarization GIFT18
  • 33. Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization ACL19 • Input:5-dimensional feature vector. • head pose angle (roll, pitch and yaw) • eye gaze direction vector (azimuth and elevation) • Trained on the VFOA annotation • p0,...,p|P|, table, whiteboard, projection screen and unknown • For utterance 𝑢! , the VFOA vector 𝑓! ∈ ℝ|P|∗|F|
  • 34. Conclusion • Model is simple. [Encoder + Decoder + Hierarchical attention] • Modality Interaction is less. • Priori knowledge is important. • Multimodal output (text, pic, video) can help a larger set of people. • Data privacy is not considered in Meeting Summarization.