Multiple vector encoding (KOR. version)

Sequence Encoding
Multiple Vector Encoding
정상근
딥러닝 1도 몰라도 이해할 수 있는

AI Problem Formulation
Classification Clustering
대부분의 문제를 Classification 혹은 Clustering 문제로 바라보고 시도함

Image Classification
from Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012.

Classification Formulation
Class 1
Class 2
Class 3
Class 4
Big
&
Deep
Network
!!! Class에 “don’t know” 는 없음에 유의

Data Transformation
=
입/출력 수준에서 추상화

Ex) Image Classification
Image Pixel Data Class Distribution
…
Class 1
Class 2
Class 3
Class 4
Class …
Data Transformation @ Image Classification

Graphical Notation for Data
10 2 8
2 15 3
5 1 5
10
2
8
2
15
3
5
1
5
=
9차원의 데이터 x 를 표현상의 편의를 위해 3x3 형태로 표현
Data X Data X

V to 1
10 2 8
2 15 3
5 1 5
?
3x3 data 를 1x1 로 요약하는 방법??
10
2
8
2
15
3
5
1
5
=
(Data) Abstraction / Encoding / Summarization / Reduction ….

V to 1 – Simple Method
10 2 8
2 15 3
5 1 5
15
center one
10 2 8
2 15 3
5 1 5
5.6
average
10 2 8
2 15 3
5 1 5
5
median

V to 1 – Weighted Method
10 2 8
2 15 3
5 1 5
28.1
Weighted Sum
3 1 5
2 6 3
9 3 6
Element-wise multiplication
Weighted Average
10 2 8
2 15 3
5 1 5
6.65
3/9 1/9 5/9
2/9 6/9 3/9
9/9 3/9 6/9
Value Weight
Value Weight

V to 1 – General Form
x1 x2 x3
x4 x5 x6
x7 x8 x9
~
Weighted Sum
w1 w2 w3
w4 w5 w6
w7 w8 w9
Element-wise multiplication𝑣 = 𝑥1 ∗ 𝑤1 + 𝑥2 ∗ 𝑤2 + ⋯ + 𝑥9 ∗ 𝑤9

V to 1 - Linear Algebra
Weighted Sum
w1
w2
w3
w4
w5
w6
w7
w8
w9
x1 x2 x3 x4 x5 x6 x7 x8 x9 X =
𝑖
9
𝑥𝑖 ∗ 𝑤𝑖
[1 x 9] matrix
[9x1] matrix
[1x1] matrix

Convolution Neural Network (1)
http://arxiv.org/pdf/1406.3332v1.pdf
Mairal et al., Convolutional Kernel Networks, 2014
16
1 9
1

Convolution Neural Network (2)
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
1 0 1
0 1 0
1 0 1
filter

CNN – Examples
http://docs.gimp.org/ko/plug-in-convmatrix.html

V to V’
V 개로 이루어진 데이터를 V’ 데이터로 변형
10 2 8
2 15 3
5 1 5
?
?
V= 9 V’= 2
10 2 8 2 15 3 5 1 5 ? ?

V to V’ – generalized method
x1 x2 x3
x4 x5 x6
x7 x8 x9
Weighted Sum
w1,1 w1,2 w1,3
w1,4 w1,5 w1,6
w1,7 w1,8 w1,9
𝑣1 = 𝑥1 ∗ 𝑤1,1 + 𝑥2 ∗ 𝑤1,1 + ⋯ + 𝑥9 ∗ 𝑤1,9
w2,1 w2,2 w2,3
w2,4 w2,5 w2,6
w2,7 w2,8 w2,9
𝑣2 = 𝑥1 ∗ 𝑤2,1 + 𝑥2 ∗ 𝑤2,1 + ⋯ + 𝑥9 ∗ 𝑤2,9
?
?

V to V’ – Linear Algebra
Weighted Sum
w1,1
w1,2
w1,3
w1,4
w1,5
w1,6
w1,7
w1,8
w1,9
x1 x2 x3 x4 x5 x6 x7 x8 x9 X = 𝑖
9
𝑥𝑖 ∗ 𝑤1,𝑖
w2,1
w2,2
w2,3
w2,4
w2,5
w2,6
w2,7
w2,8
w2,9
[1 x 9] matrix
[9x2] matrix
𝑖
9
,
[1x2] matrix
Fully Connected Network

V to V’ – Projection Notation
- - -
- - -
- - -
?
?
V V’
W
V dimension 을 가진 데이터를 V’ 로 변형시킬 수 있는
Weight Set 을 W 라고 하자.
W =V
V’

Projection with Context (1)
하나의 데이터를 요약할 때 도움이 되는
다른 부가 정보가 있다면 어떻게 반영할 수 있을까?
DataContext
(부가정보)

Projection with Context (2)
x1 x2 x3
x4 x5 x6
x7 x8 x9
I
II
c1
c2
c3
c4
DataContext
(부가정보)
I
II
+ =
I
II
WWC
Context 와 Data 에 Weight W 과 Wc 를 각각 적용

V to V’ with Context - Linear Algebra
w1,1
w1,2
w1,3
w1,4
w1,5
w1,6
w1,7
w1,8
w1,9
x1 x2 x3 x4 x5 x6 x7 x8 x9 X = 𝑖
9
w2,1
w2,2
w2,3
w2,4
w2,5
w2,6
w2,7
w2,8
w2,9
[1 x 9] matrix
[9x2] matrix
𝑖
9
,
[1x2] matrix
c1 c2 c3 c4
[1 x 4] matrix wC
1,1
wC
1,2
wC
1,3
wC
1,4
wC
2,1
wC
2,2
wC
2,3
wC
2,4
= 𝑖
4
𝑐𝑖 ∗ 𝑤1,𝑖
𝑐
𝑖
4
𝑐
,
[1x2] matrix
X
I II
I II

V to V’ with Context - Linear Algebra (simple)
w1,1
w1,2
w1,3
w1,4
w1,5
w1,6
w1,7
w1,8
w1,9
x1 x2 x3 x4 x5 x6 x7 x8 x9 c1 c2 c3 c4 X =
𝑖
9
+
𝑖
4
𝑐
w2,1
w2,2
w2,3
w2,4
w2,5
w2,6
w2,7
w2,8
w2,9
[1 x (9+4)] matrix
[(9+4) x2] matrix
[1x2] matrix
wC
1,1
wC
1,2
wC
1,3
wC
1,4
wC
2,1
wC
2,2
wC
2,3
wC
2,4
𝑖
9
+
𝑖
4
𝑐
입력 Data 를 Concatenate 시키고
Weight Matrix를 하나로 만들어서 처리

V  V’  1
V 개로 이루어진 데이터를 한번에 1로 줄이는 것이 아니라
중간단계의 V’ 로 줄였다가 다시 1로 줄일 수 있다.
10 2 8
2 15 3
5 1 5
?
?
V= 9 V’= 2
?
1
W W’

V  V’  1 Multi-Layer Perceptron
V  V’  1 V  V’  V’’  1

하나의 Data
요약
여러 개 Data
요약

Multiple Data Abstraction / Encoding / Summarization / Reduction ….
Multiple Item 을 요약하는 방법은?
10 2 8
2 15 3
5 1 5
13 4 8
4 5 2
1 45 31
6 3 4
1 7 1
3 4 0
? ? ?
? ? ?
? ? ?
Data 1 Data 2 Data 3
?
?
?
V V
V’
1

Vs to V’
10 2 8
2 15 3
5 1 5
13 4 8
4 5 2
1 45 31
6 3 4
1 7 1
3 4 0
9.6 ? 6.6
? ? ?
? ? ?
(10+13+6)/3 (8+8+4)/3
Element-wise Average

Vs to V’
10 2 8
2 15 3
5 1 5
13 4 8
4 5 2
1 45 31
6 3 4
1 7 1
3 4 0
𝑤1
= 0.2 𝑤2
= 0.4 𝑤3
= 0.4
X X X
2 0.4 1.6
0.4 3 0.6
1 0.2 1.0
5.2 1.6 3.2
1.6 2 0.8
0.4 18 12.4
2.4 1.2 1.6
0.4 2.8 0.4
1.2 1.6 0
9.6 3.2 6.4
2.4 7.8 1.8
2.6 19.8 13.4
+ +
Element-wise multiplication
Element-wise summation
각 Data 에 Weight 를 곱하여 더하는 방식

Vs  V’s  ?
1 2 3
Element wise calculation
각 Data 를 각각 Projection 시킨 뒤에
축소된 Data 상에서 Sequence wise 요약방법을 고민
W W W

N to 1 Problem
N to 1 문제가 잘 풀리면 어떤분야에 적용할 수 있나?
언어모델 대한 민국 만세
대화모델 알람 좀 Set.Alarm켜줄래
감성분석 이 영화 Positive짱!
문서분류 w w Topic=music… w
Zero Pronoun w w Pronoun Dropped… w
….

Data Transformation 에 ‘Temporal 정보’ 를
반영할 수 없을까?
Context
=부가정보

Vs  V’s  ?
Data 1 Data 2
1 2
W
WC
Context
현재 입력 (Data 2) 과
과거 정보 반영
W
정상근

Vs  V’s  V’
Data 1 Data 2
1 2
W
WC
Context
현재 입력 (Data 2) 와
Data 3
3
W
WC
현재 입력 (Data 3) 와
모든 정보
반영된
Data
Recurrent Neural Network
정상근

Graphical Notation
Data t-1 Data t Data t+1
W
Input
Data
Hidden
State U
RNN Layer
Simplified Version
Vs  V’s Vs  V’s

Forward / Backward RNN
Data 1 Data 2 Data 3 Data 1 Data 2 Data 3
Forward RNN Backward RNN
Vs  V’s Vs  V’s

Bidirectional RNN
Bidirectional RNN
concatenation
Vs  (2*V’)s
V
V’
V’

Stacking RNN
Stacking RNN
Vs  V’s

RNN 의 Input / Output
Out 1 Out 2 Out 3
 Vs  V’s
 Len(Vs) = Len(V’s)
Input 데이터에 대응하는
Output 데이터를 만들어 낼 수 있음
Out
Summarization
 Vs  1
Input 데이터를 Temporal 한 정보를 반영하여
요약할 수 있다.

Basic RNN 모델의 한계
Out 1 Out 2 Out 3
Out 1 에는 Data 2 와 Data 3 의
정보가 반영이 안된다.
멀리 있는 Data 1의 정보는 반영되기 힘들다.
해결하는 방법
1) Global Summarization
2) LSTM, GRU, …

Sequence Encoding
가장 마지막 RNN node 의
Hidden Variable
Sentence
(Sequence)
≈
Idea : RNN 에 누적된 정보가 결국 Sequence 의 Vector Form 일 것이다.

Sequence Decoding
RNN Layer
Let’s go to Seoul-Station Starbucks
복사

Sequence Encoding-Decoding Approach
RNN Layer
서울역 근처 스타벅스 로 가자
RNN Layer
Let’s go to Seoul-station Starbucks
Encoding
Decoding

[참고] Translation Pyramid
Bernard Vauquois' pyramid showing comparative
depths of intermediary representation, interlingual
machine translation at the peak, followed by transfer-
based, then direct translation.
[ http://en.wikipedia.org/wiki/Machine_translation]

RNN Encoder-Decoder for Machine Translation
Cho et al. (2014)
http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/

Limitation - RNN Encoder-Decoder Approach
문장이 길어지면, 제한된 hidden variable 에 정보를 충분히 담지
못하게 되어, 긴 문장의 경우 번역이 잘되지 않는다.

Data Transformation 에 전체 ‘Sequence-wise’
정보를 반영할 수 없을까?
- Blending multiple items
- Handling items as one item

http://artquestionsanswered.com/wp-content/uploads/2014/05/blending-Colored-Pencil-Techniques.jpg
Sequence of Information을 Blending 해보는 건 어떨까?

Attention Without Context – In/Out View
x1 x2 x3
Vs  V
 Global Summarization 개념은 RNN 과 상관없지만, Context 가 RNN 의 결과물과
연동되는 경우가 많기 때문에 advanced RNN 테크닉으로 보통 언급됨.
Sequence Data 에 중요도 정보를 반영하여
요약한 Data 를 만듦
Attention
x1 x2 x3
- 각 Data 중 더 중요한 데이터가 있을 것
중요도 반영
- 중요도 반영
Xs
- Element wise summation
- = Blending

Attention without Context
a1 a2 a3
1 2 3
A A A
V
1
wa1 wa2 wa31
1’ 2’ 3’V
𝑤𝑎1 =
𝑒 𝑎1
𝑒 𝑎1 + 𝑒 𝑎2 + 𝑒 𝑎3
𝑤𝑎2 =
𝑒 𝑎2
𝑒 𝑎1 + 𝑒 𝑎2 + 𝑒 𝑎3
𝑤𝑎3 =
𝑒 𝑎3
𝑒 𝑎1 + 𝑒 𝑎2 + 𝑒 𝑎3
Element wise
multiplication
Softmax
Xs
Element wise
Summation
정상근

Attention Without Context – Simple View
x1 x2 x3
Attention
Xs
정상근
Summarization Vector

Attention Context – In/Out View
x1 x2 x3
Xs
Sequence Data 에 Context 를 반영하여 Attention 을 구하고,
이를 다시 데이터에 반영하여 요약한 Data 를 만듦
Vs  V’s
C
정상근

Attention with Context
a1 a2 a3
1 2 3V
1
wa1 wa2 wa31
1’ 2’ 3’V
𝑤𝑎1 =
𝑒 𝑎1
𝑒 𝑎1 + 𝑒 𝑎2 + 𝑒 𝑎3
𝑤𝑎2 =
𝑒 𝑎2
𝑒 𝑎1 + 𝑒 𝑎2 + 𝑒 𝑎3
𝑤𝑎3 =
𝑒 𝑎3
𝑒 𝑎1 + 𝑒 𝑎2 + 𝑒 𝑎3
Element wise
multiplication
Softmax
Xs
Element wise
Summation
C
정상근

Attention With Context – Simple View
x1 x2 x3
Attention
Xs
C
정상근

Context 반영 방법
concat
어떻게 C 와 xi 을
조합할까?
xi
C
X W
𝑠𝑐𝑜𝑟𝑒 = 𝑣 𝑇
tanh(𝑊 𝑥𝑖; 𝑐 )
concatenate
[ 1 x (V+K) ] [ (V+K) x M ]
K
V
X VT
[ M x 1 ]
[ 1 x M ]
general
xi
XWC X
[ 1 x K ] [ K x V ] [ V x 1]
1
1
𝑠𝑐𝑜𝑟𝑒 = 𝑐𝑊𝑥𝑖
𝑇
dot
xi
C X
[ 1 x K ] [ V x 1 ]
1
!!! K = V 일때만 가능
V+K  M  1
Minh-Thang et al., “Effective Approaches to Attention-based Neural Machine Translation”
𝑠𝑐𝑜𝑟𝑒 = 𝑐𝑥𝑖
𝑇

Attention Modeling
Let’s go to Starbucks near Seoul station
:: 한국어  영어 번역

Attention Modeling
Encoding

Attention Modeling
Decoding
 Attention to corresponding input

Attention Modeling
Decoding
 Attention to corresponding input
:: Decoding 시 사용되는 Encoding 정보를 선별적이고 동적으로 바꿔줌으로써
(중요한 것에 집중-Attention함으로써) Decoding 을 더 잘 할 수 있게 됨

[Review] Attention With Context – Simple View
x1 x2 x3
Attention
Xs
C
정상근

Encoding + Attention Decoding With Context
x1 x2 x3
Att
S1
Att
S2
Att
S3
Encoding
h1 h2 h3
Decoding
C1 C2 C3
정상근

Encoding + Attention Decoding With Context (from prev. RNN output)
x1 x2 x3
Att
S1
Att
S2
Att
S3
Encoding
h1 h2 h3
Decoding
φ
CtHt-1 =
Previous Hidden state
as Context
정상근

Input Sequence 의 길이와
Output Sequence 의 길이가
같아야 할까?
No!

Encoding + Attention Decoding With Context (from prev. RNN output)
x1 x2 x3
Att
S1
Att
S2
Att
S3
Encoding
h1 h2 h3
Decoding
φ
CtHt-1 =
Previous Hidden state
as Context
정상근
Att
S4
h4
Len(Output Sequence) = Number of Attention Module
3
4
4
4

Attention Mechanism Review
Attention Mechanism
Global, Selective, Dynamic
Sequence Summarization(Blending)
=
정상근

사례 : Simple Ideas  Complex Model
1) Temporal 정보를 가능한 모두
포함 시키고 싶다.
2) Decoding 할 때
Blending 된 Input을 사용
3) Attention Context 로서
과거 정보를 이용하고 싶다.

사례 : Attention based Neural Translation Model - Bahdanau et al. (2014)
One-Hot
BiRNN
EMB EMB EMB EMB
x1 x2 x3 x4
F1 F2 F3 F4
B1 B2 B3 B4
N1 N2 N3 N4Concat
Att
S1
Att
S2
Att
S3
φ
A A
C
h1 h2 h3
N1 N2 N3 N4
A A
Softmax
D1 D2 D2
EMB’ EMB’ EMB’
D0
EMB’
Special Symbol
A : alignment weight
EMB, EMB’ : Embedding

사례 : Attention based Neural Translation Model - Bahdanau et al. (2014)
One-Hot
BiRNN
EMB EMB EMB EMB
x1 x2 x3 x4
F1 F2 F3 F4
B1 B2 B3 B4
N1 N2 N3 N4Concat
Att
S1
Att
S2
Att
S3
φ
A A
C
h1 h2 h3
N1 N2 N3 N4
A A
Softmax
D1 D2 D2
EMB’ EMB’ EMB’
D0
EMB’
Special Symbol
A : alignment weight
EMB, EMB’ : Embedding
Idea 1)
Idea
2/3)

Attention Modeling
Bahdanau et al. (2014)
 Bidirectional RNN for Encoding
 Attention Modeling

Performance – Attention Modeling @ Machine Translation
:: 선별적으로 가중치가 적용된 Encoding 이 적용됨으로서, 긴 문장에서도
번역 성능이 떨어지지 않는다.

Xu et al. (2015)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Attention Modeling for Image2Text

Attention Modeling for Image2Text
Xu et al. (2015)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Encoder / Decoder 에서 Text Sequence Encoding 을
Image Sequence Encoding 으로 교체만 해도 작동함
X
Y
X
Y

CONVOLUTIONAL SEQUENCE ENCODING

하나의 데이터인 것처럼 다루자
x1 x2 x3
x1 x2 x3
concatenate

Word Embedding  Character Embedding
대한상공회의소Word embedding
Character embedding 대 한 상 공 회 의 소

대 한 상 공 회 의 소
 Question 1 : 단순 Concatenate 시키는 것만 가지고 될까?
 Question 2 : 길이에 상관없이 일관된 Vector Size 를 유지하고 싶다.
7

≈
 Concatenate 시킨 데이터를 마치 Image 처럼 생각하자.
 단, row 한 줄이 char embedding 단위임을 기억

Convolutional Word Embedding with Char-Embedding (1)
Filter 1-1 Filter 3-1 Filter 5-1
4  1 3*4  1 5*4  1
Filter 1-2 Filter 3-2
Filter 3-3

Filter 3-1 적용
대 한 상 공 회 의 소 대 한 상 공 회 의 소 대 한 상 공 회 의 소 대 한 상 공 회 의 소 대 한 상 공 회 의 소
Stride over time

Filter 3-2 적용
Stride over time

Filter 3-3 적용
Stride over time

take max
take max
take max
각 Filter 에서 나온 값들을 시간축에서 가장 큰 값을 취한다.
(max-pooling-over-time)
num_filter
(3)
Length – Range + 1
( 7 – 3 + 1 = 5 )
4x7  3
num_filter
(3)

비슷한 방식으로 다른 filter 도 적용 & Concatenate!
Max
Pooling
Max
Pooling
Max
Pooling
= convolutional word embedding

Convolutional Word Embedding with Char-Embedding - Summary
dim(char) * Length
( 4 * 7 = 21) 𝒓
𝑹𝒂𝒏𝒈𝒆
𝒏𝒖𝒎_𝒐𝒇_𝒇𝒊𝒍𝒕𝒆𝒓𝒔 (𝒓)
( 2 + 3 + 1 = 6)

Convolutional Sentence Embedding with Char-Embedding
dim(char) * Length
( 4 * 14 = 56) 𝒓
𝑹𝒂𝒏𝒈𝒆
𝒏𝒖𝒎_𝒐𝒇_𝒇𝒊𝒍𝒕𝒆𝒓𝒔 (𝒓)
( 2 + 3 + 1 = 6)
서 울 역 _ 근 처 _ 스 타 벅 스 _ 가 줘
Sentence Embedding 도 word Embedding 과
똑같은 방식으로 진행하면 됨

Summary
Neural Network
Data Transformation
Recurrent Neural Network
Temporal Summarization
Attention Mechanism
Global Summarization
Convolutional Sequence Encoding

Q/A
Thank you.
Sangkeun Jung, Ph.D
Intelligence Architect
Senior Researcher, AI Tech. Lab. SKT Future R&D
Contact : hugmanskj@gmail.com, hugman@sk.com
Lecture Blog: www.hugman.re.kr
Lecture Video: https://goo.gl/7NL5hV
Lecture Slides: https://goo.gl/6NfR1V
Code Share: https://github.com/hugman
Facebook: https://goo.gl/1RML3C

Multiple vector encoding (KOR. version)

More Related Content

Similar to Multiple vector encoding (KOR. version)

Multiple vector encoding (KOR. version)