Text2Action: Generative Adversarial Synthesis from Language to Action

Text2Action: Generative Adversarial
Synthesis from Language to Action
2017.11.17
Presenter : Hyemin Ahn

Introducing Myself
2017-11-16 CPSLAB (EECS) 2
Interested in Human Robot Interaction based on the machine learning,
and Human’s nonverbal communication.

Today’s Seminar: Text2Action
2017-11-16 CPSLAB (EECS) 3
Text2Action: Generative Adversarial Synthesis from Language to Action
• 사람의 행동을 설명하는 문장이 주어지면, 해당 문장 (Language)이 설명하
는 사람의 행동(Action)을 생성할 수 있게 하는 Neural Network.
Man is dancing to music

Text2Action: Generative Adversarial Synthesis from Language to Action
• 사람의 행동을 설명하는 문장이 주어지면, 해당 문장 (Language)이 설명하
는 사람의 행동(Action)을 생성할 수 있게 하는 Neural Network.
Today’s Seminar: Text2Action
2017-11-16 CPSLAB (EECS) 4
이런 네트워크를 만드는 것이 목적이라면 구체적으로 어떤 일을 해야 하는가?
1. 입력 받은 Natural Language를 어떻게 처리해야 하는가?
• 문장(Sentence) 이란 무엇인가?
• Sequence of characters / words
• 입력 문장이 행동에 대해 어떤 정보를 담고 있는지와 관련된
feature는 어떻게 encoding해야 하는가?
2. 처리된 Natural Language로부터 행동을 어떻게 생성해내야 하는가?
• 행동(Action) 이란 무엇인가?
• Sequence of poses in time.
• 매 순간의 pose를 생성하기 위해선 입력문장으로부터
encoding된 feature를 어떻게 전달해 주는 것이 좋은가?
Word2Vec
RNN
Sequence
to Sequence

• Vector Representations of Words! (Word embeddings)
• 글 내부에서 가까이 위치해 있는 단어끼리는 유사한 의미를 지녔을 것이라는
가정(Distributional Hypothesis)을 기반으로, 벡터 공간에서 각 단어들이 어떻게
분포해 있는지를 학습.
• 각 단어들을 one-hot vector로 표현해 쓰는 것 보다 더 효과적!
Backgrounds : Word2Vec
2017-11-16 CPSLAB (EECS) 5

Backgrounds : Word2Vec
2017-11-16 CPSLAB (EECS) 6

Backgrounds : Recurrent Neural Networks(RNN)
2017-11-16 CPSLAB (EECS) 7
• 사람은 연속적으로 일어나는 일들의 패턴을 기억하고 사용.
• 쉽게 되는 것 : ‘가 나 다 라 마 바 사…’
• 하지만 이걸 거꾸로 한다면?: ‘하 파 카 타 차 자 아…’ ?
• ‘이러한 Sequence에 담긴 정보를 활용할 수 있도록 해보자!’
가 RNN이라는 것을 탄생시킨 아이디어!
• Sequence가 가진 패턴을 학습해서, 다음에 어떤 일이 일어날
지 Estimation하거나, 새로운 Sequence를 Generation하는데
이용해보자!
• But HOW?

2017-11-16 CPSLAB (EECS) 8
OUTPUT
INPUT
ONE
STEP
DELAY
HIDDEN
STATE
 RNN이 “RECURRENT” 라고 불리는 이유는
Sequence를 이루는 요소를 하나씩 입력으로 받을
때 마다 같은 작업을 반복적으로 수행하기 때문.
 또한, 출력되는 값은 이전 작업들에서 계산되어왔
던 내용들에 dependent 하게 됨.
 RNN은 현재까지 어떤 내용들이 계산되어 왔는지
를 저장하는 “메모리”를 가지고 있음
 “메모리”에 해당하는 Hidden state 𝒉 𝒕 는 입력
Sequence와 관련된 정보를 저장함.
 만약 𝑓 = tanh, 이라면 Vanishing/Exploding
gradient problem이 생겨날 수 있음.
 이를 극복하기 위해, 주로 LSTM/GRU가 𝑓로
써 주로 사용됨.
𝒉 𝒕
𝒚 𝒕
𝒙 𝒕
ℎ 𝑡 = 𝑓 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏
𝑦𝑡 = 𝑉ℎ 𝑡 + 𝑐
𝑈
𝑊
𝑉
Backgrounds : Recurrent Neural Networks(RNN)

2017-11-16 CPSLAB (EECS) 9
• 쇼핑백에 들어있는 물건들로부터 오늘의 저녁 메뉴가 무엇일지 추측해
보는 기계가 있다고 생각해 봅시다.
음…
까르보나라?
Backgrounds : Long Short Term Memory (LSTM)

2017-11-16 CPSLAB (EECS) 10
𝑪 𝒕
Cell state,
Internal memory unit,
Like a conveyor belt!
𝒉 𝒕
𝒙 𝒕

2017-11-16 CPSLAB (EECS) 11
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒙 𝒕
Forget
Some
Memories!

2017-11-16 CPSLAB (EECS) 12
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒙 𝒕
Forget
Some
Memories!
LSTM 은 (1) 이전 ℎ 𝑡−1와 새로운 입력 𝑥 𝑡 이 주어졌을 때 Memory의 어떤 부분을 지울지
(2) 그리고 ℎ 𝑡−1 and 𝑥 𝑡가 들어왔을 때 새 메모리를 어떻게 더할 지 결정

2017-11-16 CPSLAB (EECS) 13
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒙 𝒕
Insert
Some
Memories!

2017-11-16 CPSLAB (EECS) 14
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒙 𝒕

2017-11-16 CPSLAB (EECS) 15
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒙 𝒕

2017-11-16 CPSLAB (EECS) 16
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒚 𝒕
𝒙 𝒕

2017-11-16 CPSLAB (EECS) 17
Figures from http://colah.github.io/posts/2015-08-Understanding-LSTMs/

2017-11-16 CPSLAB (EECS) 18

2017-11-16 CPSLAB (EECS) 19

𝑧𝑡 = 𝜎 𝑊𝑧 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏 𝑧
𝑟𝑡 = 𝜎 𝑊𝑟 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏 𝑟
෨ℎ 𝑡 = tanh 𝑊ℎ ∙ 𝑟𝑡 ∗ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏 𝐶
ℎ 𝑡 = (1 − 𝑧𝑡) ∗ ℎ 𝑡−1 + 𝑧𝑡 ∗ ෨ℎ 𝑡
2017-11-16 CPSLAB (EECS) 20
𝑓𝑡 = 𝜎(𝑊𝑓 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏𝑓)
𝑖 𝑡 = 𝜎 𝑊𝑖 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏𝑖
𝑜𝑡 = 𝜎(𝑊𝑜 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏 𝑜)
ሚ𝐶𝑡 = tanh 𝑊𝐶 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏 𝐶
𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡−1 + 𝑖 𝑡 ∗ ሚ𝐶𝑡
ℎ 𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡)
이 구조는 더 간단하게 바뀔 수 있을것만 같은데…!
GRU

2017-11-16 CPSLAB (EECS) 21
ℎ 𝑒(1) ℎ 𝑒(2) ℎ 𝑒(3) ℎ 𝑒(4) ℎ 𝑒(5)
LSTM/GRU
Encoder
LSTM/GRU
Decoder
ℎ 𝑑(1) ℎ 𝑑(𝑇𝑒)
Western Food
To
Korean Food
Transition
Backgrounds : Sequence to Sequence

2017-11-16 CPSLAB (EECS) 22
• Sequence to Sequence 모델을 구현하는 가장 간단한 방법은?
Encoder의 마지막 hidden state 𝒉 𝑻를 Decoder
의 맨 처음 cell으로 넘겨준다!
• 하지만, 이 방법은 Decoder에서 더 긴 sequence를 생성해낼 필요가 있을 수
록 효과가 떨어진다는 단점이 있다.

2017-11-16 CPSLAB (EECS) 23
Bidirectional
GRU Encoder
Attention
GRU Decoder
𝑐𝑡
• Decoder를 구성하는 각 GRU cell마다,
Encoder가 가진 정보를 각각 다르게
넘겨주자!
ℎ𝑖 =
ℎ𝑖
ℎ𝑖
𝑐𝑖 = ෍
𝑗=1
𝑇𝑥
𝛼𝑖𝑗ℎ𝑗
𝑠𝑖 = 𝑓 𝑠𝑖−1, 𝑦𝑖−1, 𝑐𝑖
= 1 − 𝑧𝑖 ∗ 𝑠𝑖−1 + 𝑧𝑖 ∗ ǁ𝑠𝑖
𝑧𝑖 = 𝜎 𝑊𝑧 𝑦𝑖−1 + 𝑈𝑧 𝑠𝑖−1 + 𝑏 𝑧
𝑟𝑖 = 𝜎 𝑊𝑟 𝑦𝑖−1 + 𝑈𝑟 𝑠𝑖−1 + 𝑏 𝑟
ǁ𝑠𝑖 = tanh(𝑦𝑖−1 + 𝑈 𝑟𝑖 ∗ 𝑠𝑖−1 + 𝐶𝑐𝑖 + 𝑏)
𝛼𝑖𝑗 =
exp(𝑒 𝑖𝑗)
σ 𝑘=1
𝑇 𝑥 exp(𝑒 𝑖𝑘)
𝑒𝑖𝑗 = 𝑣 𝑎
𝑇
tanh 𝑊𝑎 𝑠𝑖−1 + 𝑈 𝑎ℎ𝑗 + 𝑏 𝑎

2017-11-16 CPSLAB (EECS) 24
Back to the Text2Action : Possible Structure?

2017-11-16 CPSLAB (EECS) 25
But the result from just Seq2Seq is…..
Input Sentence:
The girl is dancing
to the music.

2017-11-16 CPSLAB (EECS) 26
But the result from just Seq2Seq is…..
Input Sentence:
The man is talking
to the audience.

2017-11-16 CPSLAB (EECS) 27
How can we generate more realistic action?
Let’s take advantage of Generative Adversarial Network! (GAN)
But HOW?

2017-11-16 CPSLAB (EECS) 28
Generator and Discriminator
min
𝐺
max
𝐷
𝑉 𝐷, 𝐺 =
𝔼 𝒙~𝑝 𝑑𝑎𝑡𝑎(𝒙) log 𝐷(𝒙, 𝒄)
+𝔼 𝒛~𝑝 𝒛(𝒛) log 1 − 𝐷 𝐺 𝒛, 𝒄
Only relying on this
value function can
make terrible results!
<Warning>

2017-11-16 CPSLAB (EECS) 29
Text2Action: Overall Structure

2017-11-16 CPSLAB (EECS) 30
Text2Action: Used Training Data
• Extracted pose data from the MSR-VTT dataset, which includes the Youtube
videos and corresponding language descriptions

2017-11-16 CPSLAB (EECS) 31
Text2Action: Result
Input Sentence:
The girl is dancing
to the hip hop beat.

2017-11-16 CPSLAB (EECS) 32
Text2Action: Result
Input Sentence:
The girl is dancing

2017-11-16 CPSLAB (EECS) 33
Text2Action: Result
Input Sentence:
The girl is dancing

2017-11-16 CPSLAB (EECS) 34
Text2Action: Result
Input Sentence:
A chef is cooking a
meal in the kitchen.

2017-11-16 CPSLAB (EECS) 35
Text2Action: Result
Input Sentence:
A man is throwing
something to the
front.

Text2Action: Generative Adversarial Synthesis from Language to Action

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

More from NAVER Engineering

More from NAVER Engineering (20)

Recently uploaded

Recently uploaded (6)

Text2Action: Generative Adversarial Synthesis from Language to Action