Multimodal Transformer for Unaligned Multimodal Language Sequences

Multimodal Transformer for
Unaligned Multimodal Language Sequences
Tsai et al., ACL 2019
딥러닝 논문읽기 자연어처리팀
발표자 : 주정헌 팀원: 황소현, 백지윤
2021.05.16

Introduction | Multimodal?
text
vision
acoustic
Human

Demo

Data2text

Proposed Method|Crossmodal Attention
seq2seq
Multimodal Transformer

Human Multimodal Language Analysis.

Aligned sequence
Today the weather is gorgeous and I see a beautiful blue sky.
Today the Weather is gorgeous And I see a

Unaligned sequence
Today
the
Weather
is
gorgeous
And
0.513 0.323 0.422 0.145 0.612
0.425 0.122 0.432 0.6756 0.46
0.25 0.342 0.335 0.2345 0.26278
0.5124 0.312 0.245 0.113 0.7834
0.78 0.355 0.734 0.534 0.642
0.35 0.666 0.7665 0.4565 0.25223

“Spectacle” 라는 단어의
Alignment visualization

Overall Architecture |Temporal Convolutions
X L : 텍스트 input
X V : 이미지 input
X A : 오디오 input
Input sequences has sufficient awareness of its
neighborhood elements.

Overall Architecture |Positional Embedding
PE = Augment positional embedding
Z = Low-level position-aware features for different
modalities

Overall Architecture |Crossmodal Transformers

Overall Architecture |Crossmodal Transformers
Target modality
(text)
Source modality
(Audio)
1) Audio -> text
2) Vision -> text
1) Text -> vision
2) audio -> vision
1) Text -> audio
2) vision -> audio

Experiments |Evaluation metrics

Experiments |Evaluation metrics & Ablation study

Experiments |Features and Hyper-parameters
Language : pre-trained Glove(840B.300d)
Vision: Facet, 35 facial action units, per-frame basic and
emotions
Audio : COVAREP MFCC, voice/unvoiced segmenting features

Multimodal Transformer for Unaligned Multimodal Language Sequences

More Related Content

More from taeseon ryu

Recently uploaded

Multimodal Transformer for Unaligned Multimodal Language Sequences