Multimodal Sequential Learning
Eun-Sol Kim
Department of Computer Science and Engineering
Seoul National University
Seoul 08826, Korea
Contents
 Automatic Schema Construction
 DeepSchema: Automatic Schema Acquisition
from Wearable Sensor Data in Restaurant Situations
 Neurosymbolic Knowledge Graphs Learned from
Multimodal Sequential Data
 Video Question and Answering
 Multimodal Memory Network
 A reinforcement approach to multimodal sequential
learning
Automatic Schema Construction
Motivation
 To describe human knowledge in formal languages
 Sheds new light on SCRIPTs
 Schank et al., 1997
 Conceptual dependency theory
Motivation
 To describe human knowledge in formal languages
 Sheds new light on SCRIPTs (Schank et al., 1997)
 Conceptual dependency theory
 (+) Abstracted knowledge, Generalization
 (-) should be designed in advance, not flexible, hard to apply
to new types of knowledge
 To extract abstracted representation from low-level
sensory data
 Deep neural networks
 (+) can be applied to low-level dataset, hierarchical structures
 (-) hard to interpret the results, not formal languages
Hierarchical Event Network
 A machine learning method which automatically
constructs the hierarchical schema for restaurant
situations from low-level sensory data
 Multimodal deep neural network architecture
 A three-layer hierarchy
 Inputs: low-level sensory data streams from wearable
devices
 Action primitives, Events and Probabilistic scripts
Hierarchical Event Network
Derivation - Learning
Restaurant Situations
Data Acquisition
 Real-life dataset: DineAid
 Restaurant situation
 7 days dataset
 About 4000 seconds in
each
 11 Annotated with situation
 Greeting, Having a seat,
Selecting menu, Ordering
menu, etc.
 Multiple wearable devices
 Glass-type
 Video data
 Audio data
 Watch-type
 Accelerometer
 EDA
 BVP
Experimental Results (1/2)
- Event Prediction
 Classify the corresponding event using the event schema
 Learning the hierarchical event network with separated training data
 the corresponding events of the test data are predicted
Experimental Results (2/2)
- Probabilistic SCRIPT
 Classify the corresponding event using the event schema
 Learning the hierarchical event network with separated training data
 the corresponding events of the test data are predicted
WikiHow Dataset
14
System Architecture (1)
15
System Architecture (2)
16
Result
17
Video Question and Answering
 핑크퐁 애니메이션 75개에 대한 질의 응답 데이터 수집
 하나의 애니메이션에 대하여 400개의 질의 응답 데이터 수집
 Amazon Mechanical Turk 이용
 애니메이션의 자막, 이미지, 소리, 질의 응답 정보를 기계 학습
알고리즘으로 학습
 애니메이션에 대한 사용자의 질문에 응답할 수 있는 기술
Contents 기반 질의 응답 기술
19
Framework
20
Server
데이터 수집, 전처리, 질의응답 데이터 수집
기계학습
알고리즘을
이용한
학습
inter
face
Android
Web
HTTP
Bluetooth
Multimodal Memory Network
Multimodal Sequential Learning with RL
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
1 ℎ 𝑡
1
ℎ 𝑠
1
𝑊𝑐 𝑡𝑠
ℎ1
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
2 ℎ 𝑡
2
ℎ 𝑠
2
ℎ2
RNN Weight 𝑊𝐺𝑅𝑈
Combining
Policy
𝜋
Combining Weight 𝑊𝑐
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
3 ℎ 𝑡
3
ℎ 𝑠
3
ℎ3
𝑊𝑐 𝑡
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
𝑇 ℎ 𝑡
𝑇
ℎ 𝑠
𝑇
ℎ 𝑇
𝑊𝑐 𝑖𝑡𝑠
𝐸 𝐺𝑅𝑈 𝑅
Multimodal Sequential Learning with RL
* An error function for GRU part is trivial
Thank you!

Multimodal Sequential Learning for Video QA

  • 1.
    Multimodal Sequential Learning Eun-SolKim Department of Computer Science and Engineering Seoul National University Seoul 08826, Korea
  • 2.
    Contents  Automatic SchemaConstruction  DeepSchema: Automatic Schema Acquisition from Wearable Sensor Data in Restaurant Situations  Neurosymbolic Knowledge Graphs Learned from Multimodal Sequential Data  Video Question and Answering  Multimodal Memory Network  A reinforcement approach to multimodal sequential learning
  • 3.
  • 4.
    Motivation  To describehuman knowledge in formal languages  Sheds new light on SCRIPTs  Schank et al., 1997  Conceptual dependency theory
  • 6.
    Motivation  To describehuman knowledge in formal languages  Sheds new light on SCRIPTs (Schank et al., 1997)  Conceptual dependency theory  (+) Abstracted knowledge, Generalization  (-) should be designed in advance, not flexible, hard to apply to new types of knowledge  To extract abstracted representation from low-level sensory data  Deep neural networks  (+) can be applied to low-level dataset, hierarchical structures  (-) hard to interpret the results, not formal languages
  • 7.
    Hierarchical Event Network A machine learning method which automatically constructs the hierarchical schema for restaurant situations from low-level sensory data  Multimodal deep neural network architecture  A three-layer hierarchy  Inputs: low-level sensory data streams from wearable devices  Action primitives, Events and Probabilistic scripts
  • 8.
  • 9.
  • 10.
  • 11.
    Data Acquisition  Real-lifedataset: DineAid  Restaurant situation  7 days dataset  About 4000 seconds in each  11 Annotated with situation  Greeting, Having a seat, Selecting menu, Ordering menu, etc.  Multiple wearable devices  Glass-type  Video data  Audio data  Watch-type  Accelerometer  EDA  BVP
  • 12.
    Experimental Results (1/2) -Event Prediction  Classify the corresponding event using the event schema  Learning the hierarchical event network with separated training data  the corresponding events of the test data are predicted
  • 13.
    Experimental Results (2/2) -Probabilistic SCRIPT  Classify the corresponding event using the event schema  Learning the hierarchical event network with separated training data  the corresponding events of the test data are predicted
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
     핑크퐁 애니메이션75개에 대한 질의 응답 데이터 수집  하나의 애니메이션에 대하여 400개의 질의 응답 데이터 수집  Amazon Mechanical Turk 이용  애니메이션의 자막, 이미지, 소리, 질의 응답 정보를 기계 학습 알고리즘으로 학습  애니메이션에 대한 사용자의 질문에 응답할 수 있는 기술 Contents 기반 질의 응답 기술 19
  • 20.
    Framework 20 Server 데이터 수집, 전처리,질의응답 데이터 수집 기계학습 알고리즘을 이용한 학습 inter face Android Web HTTP Bluetooth
  • 21.
  • 22.
    Multimodal Sequential Learningwith RL Image Text Sound 𝑎1 𝑎2 𝑎4 𝑎3 𝑎5 𝑎6 𝑎7 𝑎8 ℎ𝑖 1 ℎ 𝑡 1 ℎ 𝑠 1 𝑊𝑐 𝑡𝑠 ℎ1 Image Text Sound 𝑎1 𝑎2 𝑎4 𝑎3 𝑎5 𝑎6 𝑎7 𝑎8 ℎ𝑖 2 ℎ 𝑡 2 ℎ 𝑠 2 ℎ2 RNN Weight 𝑊𝐺𝑅𝑈 Combining Policy 𝜋 Combining Weight 𝑊𝑐 Image Text Sound 𝑎1 𝑎2 𝑎4 𝑎3 𝑎5 𝑎6 𝑎7 𝑎8 ℎ𝑖 3 ℎ 𝑡 3 ℎ 𝑠 3 ℎ3 𝑊𝑐 𝑡 Image Text Sound 𝑎1 𝑎2 𝑎4 𝑎3 𝑎5 𝑎6 𝑎7 𝑎8 ℎ𝑖 𝑇 ℎ 𝑡 𝑇 ℎ 𝑠 𝑇 ℎ 𝑇 𝑊𝑐 𝑖𝑡𝑠 𝐸 𝐺𝑅𝑈 𝑅
  • 23.
    Multimodal Sequential Learningwith RL * An error function for GRU part is trivial
  • 24.