Multimodal Transformer for
Unaligned Multimodal Language Sequences
Tsai et al., ACL 2019
딥러닝 논문읽기 자연어처리팀
발표자 : 주정헌 팀원: 황소현, 백지윤
2021.05.16
Introduction | Multimodal?
text
vision
acoustic
Human
Introduction | Multimodal?
Demo
Introduction | Multimodal?
Data2text
Proposed Method|Crossmodal Attention
seq2seq
Multimodal Transformer
Proposed Method|Crossmodal Attention
Human Multimodal Language Analysis.
Proposed Method|Crossmodal Attention
Aligned sequence
Today the weather is gorgeous and I see a beautiful blue sky.
Today the Weather is gorgeous And I see a
Proposed Method|Crossmodal Attention
Unaligned sequence
Today
the
Weather
is
gorgeous
And
0.513 0.323 0.422 0.145 0.612
0.425 0.122 0.432 0.6756 0.46
0.25 0.342 0.335 0.2345 0.26278
0.5124 0.312 0.245 0.113 0.7834
0.78 0.355 0.734 0.534 0.642
0.35 0.666 0.7665 0.4565 0.25223
Proposed Method|Crossmodal Attention
“Spectacle” 라는 단어의
Alignment visualization
Overall Architecture |Temporal Convolutions
X L : 텍스트 input
X V : 이미지 input
X A : 오디오 input
Input sequences has sufficient awareness of its
neighborhood elements.
Overall Architecture |Positional Embedding
PE = Augment positional embedding
Z = Low-level position-aware features for different
modalities
Overall Architecture |Crossmodal Transformers
Overall Architecture |Crossmodal Transformers
Target modality
(text)
Source modality
(Audio)
1) Audio -> text
2) Vision -> text
1) Text -> vision
2) audio -> vision
1) Text -> audio
2) vision -> audio
Overall Architecture |Crossmodal Transformers
Experiments |Evaluation metrics
Experiments |Evaluation metrics & Ablation study
Experiments |Ablation study
Experiments |Features and Hyper-parameters
Language : pre-trained Glove(840B.300d)
Vision: Facet, 35 facial action units, per-frame basic and
emotions
Audio : COVAREP MFCC, voice/unvoiced segmenting features
Q & A
Thank you

Multimodal Transformer for Unaligned Multimodal Language Sequences