Multimodal Sequential Learning for Video QA

Multimodal Sequential Learning
Eun-Sol Kim
Department of Computer Science and Engineering
Seoul National University
Seoul 08826, Korea

Contents
 Automatic Schema Construction
 DeepSchema: Automatic Schema Acquisition
from Wearable Sensor Data in Restaurant Situations
 Neurosymbolic Knowledge Graphs Learned from
Multimodal Sequential Data
 Video Question and Answering
 Multimodal Memory Network
 A reinforcement approach to multimodal sequential
learning

Motivation
 To describe human knowledge in formal languages
 Sheds new light on SCRIPTs
 Schank et al., 1997
 Conceptual dependency theory

Motivation
 To describe human knowledge in formal languages
 Sheds new light on SCRIPTs (Schank et al., 1997)
 Conceptual dependency theory
 (+) Abstracted knowledge, Generalization
 (-) should be designed in advance, not flexible, hard to apply
to new types of knowledge
 To extract abstracted representation from low-level
sensory data
 Deep neural networks
 (+) can be applied to low-level dataset, hierarchical structures
 (-) hard to interpret the results, not formal languages

Hierarchical Event Network
 A machine learning method which automatically
constructs the hierarchical schema for restaurant
situations from low-level sensory data
 Multimodal deep neural network architecture
 A three-layer hierarchy
 Inputs: low-level sensory data streams from wearable
devices
 Action primitives, Events and Probabilistic scripts

Data Acquisition
 Real-life dataset: DineAid
 Restaurant situation
 7 days dataset
 About 4000 seconds in
each
 11 Annotated with situation
 Greeting, Having a seat,
Selecting menu, Ordering
menu, etc.
 Multiple wearable devices
 Glass-type
 Video data
 Audio data
 Watch-type
 Accelerometer
 EDA
 BVP

Experimental Results (1/2)
- Event Prediction
 Classify the corresponding event using the event schema
 Learning the hierarchical event network with separated training data
 the corresponding events of the test data are predicted

Experimental Results (2/2)
- Probabilistic SCRIPT
 Classify the corresponding event using the event schema
 Learning the hierarchical event network with separated training data
 the corresponding events of the test data are predicted

 핑크퐁 애니메이션 75개에 대한 질의 응답 데이터 수집
 하나의 애니메이션에 대하여 400개의 질의 응답 데이터 수집
 Amazon Mechanical Turk 이용
 애니메이션의 자막, 이미지, 소리, 질의 응답 정보를 기계 학습
알고리즘으로 학습
 애니메이션에 대한 사용자의 질문에 응답할 수 있는 기술
Contents 기반 질의 응답 기술
19

Framework
20
Server
데이터 수집, 전처리, 질의응답 데이터 수집
기계학습
알고리즘을
이용한
학습
inter
face
Android
Web
HTTP
Bluetooth

Multimodal Sequential Learning with RL
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
1 ℎ 𝑡
1
ℎ 𝑠
1
𝑊𝑐 𝑡𝑠
ℎ1
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
2 ℎ 𝑡
2
ℎ 𝑠
2
ℎ2
RNN Weight 𝑊𝐺𝑅𝑈
Combining
Policy
𝜋
Combining Weight 𝑊𝑐
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
3 ℎ 𝑡
3
ℎ 𝑠
3
ℎ3
𝑊𝑐 𝑡
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
𝑇 ℎ 𝑡
𝑇
ℎ 𝑠
𝑇
ℎ 𝑇
𝑊𝑐 𝑖𝑡𝑠
𝐸 𝐺𝑅𝑈 𝑅

Multimodal Sequential Learning with RL
* An error function for GRU part is trivial

Multimodal Sequential Learning for Video QA

More Related Content

What's hot

Viewers also liked

More from NAVER Engineering

Recently uploaded

Multimodal Sequential Learning for Video QA