Sequence Encoding
Multiple Vector Encoding
정상근
λ”₯λŸ¬λ‹ 1도 λͺ°λΌλ„ 이해할 수 μžˆλŠ”
AI Problem Formulation
Classification Clustering
λŒ€λΆ€λΆ„μ˜ 문제λ₯Ό Classification ν˜Ήμ€ Clustering 문제둜 바라보고 μ‹œλ„ν•¨
Image Classification
from Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012.
Classification Formulation
Class 1
Class 2
Class 3
Class 4
Big
&
Deep
Network
!!! Class에 β€œdon’t know” λŠ” μ—†μŒμ— 유의
Data Transformation
=
μž…/좜λ ₯ μˆ˜μ€€μ—μ„œ 좔상화
Ex) Image Classification
Image Pixel Data Class Distribution
…
Class 1
Class 2
Class 3
Class 4
Class …
Data Transformation @ Image Classification
Graphical Notation for Data
10 2 8
2 15 3
5 1 5
10
2
8
2
15
3
5
1
5
=
9μ°¨μ›μ˜ 데이터 x λ₯Ό ν‘œν˜„μƒμ˜ 편의λ₯Ό μœ„ν•΄ 3x3 ν˜•νƒœλ‘œ ν‘œν˜„
Data X Data X
V to 1
10 2 8
2 15 3
5 1 5
?
3x3 data λ₯Ό 1x1 둜 μš”μ•½ν•˜λŠ” 방법??
10
2
8
2
15
3
5
1
5
=
(Data) Abstraction / Encoding / Summarization / Reduction ….
V to 1 – Simple Method
10 2 8
2 15 3
5 1 5
15
center one
10 2 8
2 15 3
5 1 5
5.6
average
10 2 8
2 15 3
5 1 5
5
median
V to 1 – Weighted Method
10 2 8
2 15 3
5 1 5
28.1
Weighted Sum
3 1 5
2 6 3
9 3 6
Element-wise multiplication
Weighted Average
10 2 8
2 15 3
5 1 5
6.65
3/9 1/9 5/9
2/9 6/9 3/9
9/9 3/9 6/9
Value Weight
Value Weight
V to 1 – General Form
x1 x2 x3
x4 x5 x6
x7 x8 x9
~
Weighted Sum
w1 w2 w3
w4 w5 w6
w7 w8 w9
Element-wise multiplication𝑣 = π‘₯1 βˆ— 𝑀1 + π‘₯2 βˆ— 𝑀2 + β‹― + π‘₯9 βˆ— 𝑀9
V to 1 - Linear Algebra
Weighted Sum
w1
w2
w3
w4
w5
w6
w7
w8
w9
x1 x2 x3 x4 x5 x6 x7 x8 x9 X =
𝑖
9
π‘₯𝑖 βˆ— 𝑀𝑖
[1 x 9] matrix
[9x1] matrix
[1x1] matrix
Convolution Neural Network (1)
http://arxiv.org/pdf/1406.3332v1.pdf
Mairal et al., Convolutional Kernel Networks, 2014
16
1 9
1
Convolution Neural Network (2)
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
1 0 1
0 1 0
1 0 1
filter
CNN – Examples
http://docs.gimp.org/ko/plug-in-convmatrix.html
V to V’
V 개둜 이루어진 데이터λ₯Ό V’ λ°μ΄ν„°λ‘œ λ³€ν˜•
10 2 8
2 15 3
5 1 5
?
?
V= 9 V’= 2
10 2 8 2 15 3 5 1 5 ? ?
V to V’ – generalized method
x1 x2 x3
x4 x5 x6
x7 x8 x9
Weighted Sum
w1,1 w1,2 w1,3
w1,4 w1,5 w1,6
w1,7 w1,8 w1,9
𝑣1 = π‘₯1 βˆ— 𝑀1,1 + π‘₯2 βˆ— 𝑀1,1 + β‹― + π‘₯9 βˆ— 𝑀1,9
w2,1 w2,2 w2,3
w2,4 w2,5 w2,6
w2,7 w2,8 w2,9
𝑣2 = π‘₯1 βˆ— 𝑀2,1 + π‘₯2 βˆ— 𝑀2,1 + β‹― + π‘₯9 βˆ— 𝑀2,9
?
?
V to V’ – Linear Algebra
Weighted Sum
w1,1
w1,2
w1,3
w1,4
w1,5
w1,6
w1,7
w1,8
w1,9
x1 x2 x3 x4 x5 x6 x7 x8 x9 X = 𝑖
9
π‘₯𝑖 βˆ— 𝑀1,𝑖
w2,1
w2,2
w2,3
w2,4
w2,5
w2,6
w2,7
w2,8
w2,9
[1 x 9] matrix
[9x2] matrix
𝑖
9
π‘₯𝑖 βˆ— 𝑀2,𝑖
,
[1x2] matrix
Fully Connected Network
V to V’ – Projection Notation
- - -
- - -
- - -
?
?
V V’
W
V dimension 을 κ°€μ§„ 데이터λ₯Ό V’ 둜 λ³€ν˜•μ‹œν‚¬ 수 μžˆλŠ”
Weight Set 을 W 라고 ν•˜μž.
W =V
V’
Projection with Context (1)
ν•˜λ‚˜μ˜ 데이터λ₯Ό μš”μ•½ν•  λ•Œ 도움이 λ˜λŠ”
λ‹€λ₯Έ λΆ€κ°€ 정보가 μžˆλ‹€λ©΄ μ–΄λ–»κ²Œ λ°˜μ˜ν•  수 μžˆμ„κΉŒ?
DataContext
(뢀가정보)
Projection with Context (2)
x1 x2 x3
x4 x5 x6
x7 x8 x9
I
II
c1
c2
c3
c4
DataContext
(뢀가정보)
I
II
+ =
I
II
WWC
Context 와 Data 에 Weight W κ³Ό Wc λ₯Ό 각각 적용
V to V’ with Context - Linear Algebra
w1,1
w1,2
w1,3
w1,4
w1,5
w1,6
w1,7
w1,8
w1,9
x1 x2 x3 x4 x5 x6 x7 x8 x9 X = 𝑖
9
π‘₯𝑖 βˆ— 𝑀1,𝑖
w2,1
w2,2
w2,3
w2,4
w2,5
w2,6
w2,7
w2,8
w2,9
[1 x 9] matrix
[9x2] matrix
𝑖
9
π‘₯𝑖 βˆ— 𝑀2,𝑖
,
[1x2] matrix
c1 c2 c3 c4
[1 x 4] matrix wC
1,1
wC
1,2
wC
1,3
wC
1,4
wC
2,1
wC
2,2
wC
2,3
wC
2,4
= 𝑖
4
𝑐𝑖 βˆ— 𝑀1,𝑖
𝑐
𝑖
4
π‘₯𝑖 βˆ— 𝑀2,𝑖
𝑐
,
[1x2] matrix
X
I II
I II
V to V’ with Context - Linear Algebra (simple)
w1,1
w1,2
w1,3
w1,4
w1,5
w1,6
w1,7
w1,8
w1,9
x1 x2 x3 x4 x5 x6 x7 x8 x9 c1 c2 c3 c4 X =
𝑖
9
π‘₯𝑖 βˆ— 𝑀1,𝑖
+
𝑖
4
𝑐𝑖 βˆ— 𝑀1,𝑖
𝑐
w2,1
w2,2
w2,3
w2,4
w2,5
w2,6
w2,7
w2,8
w2,9
[1 x (9+4)] matrix
[(9+4) x2] matrix
[1x2] matrix
wC
1,1
wC
1,2
wC
1,3
wC
1,4
wC
2,1
wC
2,2
wC
2,3
wC
2,4
𝑖
9
π‘₯𝑖 βˆ— 𝑀2,𝑖
+
𝑖
4
𝑐𝑖 βˆ— 𝑀2,𝑖
𝑐
μž…λ ₯ Data λ₯Ό Concatenate μ‹œν‚€κ³ 
Weight Matrixλ₯Ό ν•˜λ‚˜λ‘œ λ§Œλ“€μ–΄μ„œ 처리
V οƒ  V’ οƒ  1
V 개둜 이루어진 데이터λ₯Ό ν•œλ²ˆμ— 1둜 μ€„μ΄λŠ” 것이 μ•„λ‹ˆλΌ
μ€‘κ°„λ‹¨κ³„μ˜ V’ 둜 μ€„μ˜€λ‹€κ°€ λ‹€μ‹œ 1둜 쀄일 수 μžˆλ‹€.
10 2 8
2 15 3
5 1 5
?
?
V= 9 V’= 2
?
1
W W’
V οƒ  V’ οƒ  1 Multi-Layer Perceptron
V οƒ  V’ οƒ  1 V οƒ  V’ οƒ  V’’ οƒ  1
ν•˜λ‚˜μ˜ Data
μš”μ•½
μ—¬λŸ¬ 개 Data
μš”μ•½
Multiple Data Abstraction / Encoding / Summarization / Reduction ….
Multiple Item 을 μš”μ•½ν•˜λŠ” 방법은?
10 2 8
2 15 3
5 1 5
13 4 8
4 5 2
1 45 31
6 3 4
1 7 1
3 4 0
? ? ?
? ? ?
? ? ?
Data 1 Data 2 Data 3
?
?
?
V V
V’
1
Vs to V’
10 2 8
2 15 3
5 1 5
13 4 8
4 5 2
1 45 31
6 3 4
1 7 1
3 4 0
9.6 ? 6.6
? ? ?
? ? ?
(10+13+6)/3 (8+8+4)/3
Element-wise Average
Vs to V’
10 2 8
2 15 3
5 1 5
13 4 8
4 5 2
1 45 31
6 3 4
1 7 1
3 4 0
𝑀1
= 0.2 𝑀2
= 0.4 𝑀3
= 0.4
X X X
2 0.4 1.6
0.4 3 0.6
1 0.2 1.0
5.2 1.6 3.2
1.6 2 0.8
0.4 18 12.4
2.4 1.2 1.6
0.4 2.8 0.4
1.2 1.6 0
9.6 3.2 6.4
2.4 7.8 1.8
2.6 19.8 13.4
+ +
Element-wise multiplication
Element-wise summation
각 Data 에 Weight λ₯Ό κ³±ν•˜μ—¬ λ”ν•˜λŠ” 방식
Vs οƒ  V’s οƒ  ?
Data 1 Data 2 Data 3
1 2 3
Element wise calculation
각 Data λ₯Ό 각각 Projection μ‹œν‚¨ 뒀에
μΆ•μ†Œλœ Data μƒμ—μ„œ Sequence wise μš”μ•½λ°©λ²•μ„ κ³ λ―Ό
W W W
N to 1 Problem
N to 1 λ¬Έμ œκ°€ 잘 풀리면 어떀뢄야에 μ μš©ν•  수 μžˆλ‚˜?
μ–Έμ–΄λͺ¨λΈ λŒ€ν•œ λ―Όκ΅­ λ§Œμ„Έ
λŒ€ν™”λͺ¨λΈ μ•ŒλžŒ μ’€ Set.AlarmμΌœμ€„λž˜
감성뢄석 이 μ˜ν™” Positiveμ§±!
λ¬Έμ„œλΆ„λ₯˜ w w Topic=music… w
Zero Pronoun w w Pronoun Dropped… w
….
TEMPORAL SUMMARIZATION
Data Transformation 에 β€˜Temporal 정보’ λ₯Ό
λ°˜μ˜ν•  수 μ—†μ„κΉŒ?
Context
=뢀가정보
Vs οƒ  V’s οƒ  ?
Data 1 Data 2
1 2
W
WC
Context
ν˜„μž¬ μž…λ ₯ (Data 2) κ³Ό
κ³Όκ±° 정보 반영
W
정상근
Vs οƒ  V’s οƒ  V’
Data 1 Data 2
1 2
W
WC
Context
ν˜„μž¬ μž…λ ₯ (Data 2) 와
κ³Όκ±° 정보 반영
Data 3
3
W
WC
ν˜„μž¬ μž…λ ₯ (Data 3) 와
κ³Όκ±° 정보 반영
λͺ¨λ“  정보
반영된
Data
Recurrent Neural Network
정상근
Graphical Notation
Data t-1 Data t Data t+1
W
Input
Data
Hidden
State U
RNN Layer
Simplified Version
Vs οƒ  V’s Vs οƒ  V’s
Forward / Backward RNN
Data 1 Data 2 Data 3 Data 1 Data 2 Data 3
Forward RNN Backward RNN
Vs οƒ  V’s Vs οƒ  V’s
Bidirectional RNN
Data 1 Data 2 Data 3
Bidirectional RNN
concatenation
Vs οƒ  (2*V’)s
V
V’
V’
Stacking RNN
Data 1 Data 2 Data 3
Stacking RNN
Vs οƒ  V’s
RNN 의 Input / Output
Data 1 Data 2 Data 3
Out 1 Out 2 Out 3
οƒΌ Vs οƒ  V’s
οƒΌ Len(Vs) = Len(V’s)
Input 데이터에 λŒ€μ‘ν•˜λŠ”
Output 데이터λ₯Ό λ§Œλ“€μ–΄ λ‚Ό 수 있음
Data 1 Data 2 Data 3
Out
Summarization
οƒΌ Vs οƒ  1
Input 데이터λ₯Ό Temporal ν•œ 정보λ₯Ό λ°˜μ˜ν•˜μ—¬
μš”μ•½ν•  수 μžˆλ‹€.
Basic RNN λͺ¨λΈμ˜ ν•œκ³„
Data 1 Data 2 Data 3
Out 1 Out 2 Out 3
Out 1 μ—λŠ” Data 2 와 Data 3 의
정보가 반영이 μ•ˆλœλ‹€.
멀리 μžˆλŠ” Data 1의 μ •λ³΄λŠ” 반영되기 νž˜λ“€λ‹€.
ν•΄κ²°ν•˜λŠ” 방법
1) Global Summarization
2) LSTM, GRU, …
Sequence Encoding
κ°€μž₯ λ§ˆμ§€λ§‰ RNN node 의
Hidden Variable
Sentence
(Sequence)
β‰ˆ
Idea : RNN 에 λˆ„μ λœ 정보가 κ²°κ΅­ Sequence 의 Vector Form 일 것이닀.
Sequence Decoding
RNN Layer
Let’s go to Seoul-Station Starbucks
볡사
Sequence Encoding-Decoding Approach
RNN Layer
μ„œμšΈμ—­ 근처 μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž
RNN Layer
Let’s go to Seoul-station Starbucks
Encoding
Decoding
[μ°Έκ³ ] Translation Pyramid
Bernard Vauquois' pyramid showing comparative
depths of intermediary representation, interlingual
machine translation at the peak, followed by transfer-
based, then direct translation.
[ http://en.wikipedia.org/wiki/Machine_translation]
RNN Encoder-Decoder for Machine Translation
Cho et al. (2014)
http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/
Limitation - RNN Encoder-Decoder Approach
λ¬Έμž₯이 κΈΈμ–΄μ§€λ©΄, μ œν•œλœ hidden variable 에 정보λ₯Ό μΆ©λΆ„νžˆ λ‹΄μ§€
λͺ»ν•˜κ²Œ λ˜μ–΄, κΈ΄ λ¬Έμž₯의 경우 λ²ˆμ—­μ΄ μž˜λ˜μ§€ μ•ŠλŠ”λ‹€.
http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
GLOBAL SUMMARIZATION
Data Transformation 에 전체 β€˜Sequence-wise’
정보λ₯Ό λ°˜μ˜ν•  수 μ—†μ„κΉŒ?
- Blending multiple items
- Handling items as one item
ATTENTION MECHANISM
http://artquestionsanswered.com/wp-content/uploads/2014/05/blending-Colored-Pencil-Techniques.jpg
Sequence of Information을 Blending ν•΄λ³΄λŠ” 건 μ–΄λ–¨κΉŒ?
Attention Without Context – In/Out View
x1 x2 x3
Vs οƒ  V
 Global Summarization κ°œλ…μ€ RNN κ³Ό μƒκ΄€μ—†μ§€λ§Œ, Context κ°€ RNN 의 κ²°κ³Όλ¬Όκ³Ό
μ—°λ™λ˜λŠ” κ²½μš°κ°€ 많기 λ•Œλ¬Έμ— advanced RNN ν…Œν¬λ‹‰μœΌλ‘œ 보톡 언급됨.
Sequence Data 에 μ€‘μš”λ„ 정보λ₯Ό λ°˜μ˜ν•˜μ—¬
μš”μ•½ν•œ Data λ₯Ό λ§Œλ“¦
Attention
x1 x2 x3
- 각 Data 쀑 더 μ€‘μš”ν•œ 데이터가 μžˆμ„ 것
μ€‘μš”λ„ 반영
- μ€‘μš”λ„ 반영
Xs
- Element wise summation
- = Blending
Attention without Context
a1 a2 a3
1 2 3
A A A
V
1
wa1 wa2 wa31
1’ 2’ 3’V
π‘€π‘Ž1 =
𝑒 π‘Ž1
𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3
π‘€π‘Ž2 =
𝑒 π‘Ž2
𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3
π‘€π‘Ž3 =
𝑒 π‘Ž3
𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3
Element wise
multiplication
Softmax
Xs
Element wise
Summation
정상근
Attention Without Context – Simple View
x1 x2 x3
Attention
Xs
정상근
Summarization Vector
Attention Context – In/Out View
x1 x2 x3
Xs
Sequence Data 에 Context λ₯Ό λ°˜μ˜ν•˜μ—¬ Attention 을 κ΅¬ν•˜κ³ ,
이λ₯Ό λ‹€μ‹œ 데이터에 λ°˜μ˜ν•˜μ—¬ μš”μ•½ν•œ Data λ₯Ό λ§Œλ“¦
Vs οƒ  V’s
C
정상근
Attention with Context
a1 a2 a3
1 2 3V
1
wa1 wa2 wa31
1’ 2’ 3’V
π‘€π‘Ž1 =
𝑒 π‘Ž1
𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3
π‘€π‘Ž2 =
𝑒 π‘Ž2
𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3
π‘€π‘Ž3 =
𝑒 π‘Ž3
𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3
Element wise
multiplication
Softmax
Xs
Element wise
Summation
C
정상근
Attention With Context – Simple View
x1 x2 x3
Attention
Xs
C
정상근
Context 반영 방법
concat
μ–΄λ–»κ²Œ C 와 xi 을
μ‘°ν•©ν• κΉŒ?
xi
C
X W
π‘ π‘π‘œπ‘Ÿπ‘’ = 𝑣 𝑇
tanh(π‘Š π‘₯𝑖; 𝑐 )
concatenate
[ 1 x (V+K) ] [ (V+K) x M ]
K
V
X VT
[ M x 1 ]
[ 1 x M ]
general
xi
XWC X
[ 1 x K ] [ K x V ] [ V x 1]
1
1
π‘ π‘π‘œπ‘Ÿπ‘’ = π‘π‘Šπ‘₯𝑖
𝑇
dot
xi
C X
[ 1 x K ] [ V x 1 ]
1
!!! K = V μΌλ•Œλ§Œ κ°€λŠ₯
V+K οƒ  M οƒ  1
Minh-Thang et al., β€œEffective Approaches to Attention-based Neural Machine Translation”
π‘ π‘π‘œπ‘Ÿπ‘’ = 𝑐π‘₯𝑖
𝑇
Attention Modeling
μ„œμšΈμ—­ 근처 μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž
Let’s go to Starbucks near Seoul station
:: ν•œκ΅­μ–΄ οƒ  μ˜μ–΄ λ²ˆμ—­
Attention Modeling
μ„œμšΈμ—­ 근처 μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž
Let’s go to Starbucks near Seoul station
Encoding
Attention Modeling
μ„œμšΈμ—­ 근처 μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž
Let’s go to Starbucks near Seoul station
Decoding
οƒŸ Attention to corresponding input
Attention Modeling
μ„œμšΈμ—­ 근처 μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž
Let’s go to Starbucks near Seoul station
Decoding
οƒŸ Attention to corresponding input
Attention Modeling
μ„œμšΈμ—­ 근처 μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž
Let’s go to Starbucks near Seoul station
Decoding
οƒŸ Attention to corresponding input
Attention Modeling
μ„œμšΈμ—­ 근처 μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž
Let’s go to Starbucks near Seoul station
Decoding
οƒŸ Attention to corresponding input
Attention Modeling
μ„œμšΈμ—­ 근처 μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž
Let’s go to Starbucks near Seoul station
Decoding
οƒŸ Attention to corresponding input
Attention Modeling
μ„œμšΈμ—­ 근처 μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž
Let’s go to Starbucks near Seoul station
Decoding
οƒŸ Attention to corresponding input
:: Decoding μ‹œ μ‚¬μš©λ˜λŠ” Encoding 정보λ₯Ό 선별적이고 λ™μ μœΌλ‘œ λ°”κΏ”μ€ŒμœΌλ‘œμ¨
(μ€‘μš”ν•œ 것에 집쀑-Attentionν•¨μœΌλ‘œμ¨) Decoding 을 더 잘 ν•  수 있게 됨
[Review] Attention With Context – Simple View
x1 x2 x3
Attention
Xs
C
정상근
Encoding + Attention Decoding With Context
x1 x2 x3
Att
S1
Att
S2
Att
S3
Encoding
h1 h2 h3
Decoding
C1 C2 C3
정상근
Encoding + Attention Decoding With Context (from prev. RNN output)
x1 x2 x3
Att
S1
Att
S2
Att
S3
Encoding
h1 h2 h3
Decoding
Ο†
CtHt-1 =
Previous Hidden state
as Context
정상근
Input Sequence 의 길이와
Output Sequence 의 길이가
κ°™μ•„μ•Ό ν• κΉŒ?
No!
Encoding + Attention Decoding With Context (from prev. RNN output)
x1 x2 x3
Att
S1
Att
S2
Att
S3
Encoding
h1 h2 h3
Decoding
Ο†
CtHt-1 =
Previous Hidden state
as Context
정상근
Att
S4
h4
Len(Output Sequence) = Number of Attention Module
3
4
4
4
Attention Mechanism Review
Attention Mechanism
Global, Selective, Dynamic
Sequence Summarization(Blending)
=
정상근
사둀 : Simple Ideas οƒ  Complex Model
1) Temporal 정보λ₯Ό κ°€λŠ₯ν•œ λͺ¨λ‘
포함 μ‹œν‚€κ³  μ‹Άλ‹€.
2) Decoding ν•  λ•Œ
Blending 된 Input을 μ‚¬μš©
3) Attention Context λ‘œμ„œ
κ³Όκ±° 정보λ₯Ό μ΄μš©ν•˜κ³  μ‹Άλ‹€.
사둀 : Attention based Neural Translation Model - Bahdanau et al. (2014)
One-Hot
BiRNN
EMB EMB EMB EMB
x1 x2 x3 x4
F1 F2 F3 F4
B1 B2 B3 B4
N1 N2 N3 N4Concat
Att
S1
Att
S2
Att
S3
Ο†
A A
C
h1 h2 h3
N1 N2 N3 N4
A A
Softmax
D1 D2 D2
EMB’ EMB’ EMB’
D0
EMB’
Special Symbol
A : alignment weight
EMB, EMB’ : Embedding
사둀 : Attention based Neural Translation Model - Bahdanau et al. (2014)
One-Hot
BiRNN
EMB EMB EMB EMB
x1 x2 x3 x4
F1 F2 F3 F4
B1 B2 B3 B4
N1 N2 N3 N4Concat
Att
S1
Att
S2
Att
S3
Ο†
A A
C
h1 h2 h3
N1 N2 N3 N4
A A
Softmax
D1 D2 D2
EMB’ EMB’ EMB’
D0
EMB’
Special Symbol
A : alignment weight
EMB, EMB’ : Embedding
Idea 1)
Idea
2/3)
Attention Modeling
Bahdanau et al. (2014)
http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
ο‚§ Bidirectional RNN for Encoding
ο‚§ Attention Modeling
Performance – Attention Modeling @ Machine Translation
:: μ„ λ³„μ μœΌλ‘œ κ°€μ€‘μΉ˜κ°€ 적용된 Encoding 이 μ μš©λ¨μœΌλ‘œμ„œ, κΈ΄ λ¬Έμž₯μ—μ„œλ„
λ²ˆμ—­ μ„±λŠ₯이 λ–¨μ–΄μ§€μ§€ μ•ŠλŠ”λ‹€.
Xu et al. (2015)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
Attention Modeling for Image2Text
Attention Modeling for Image2Text
Xu et al. (2015)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Encoder / Decoder μ—μ„œ Text Sequence Encoding 을
Image Sequence Encoding 으둜 ꡐ체만 해도 μž‘λ™ν•¨
X
Y
X
Y
CONVOLUTIONAL SEQUENCE ENCODING
ν•˜λ‚˜μ˜ 데이터인 κ²ƒμ²˜λŸΌ λ‹€λ£¨μž
x1 x2 x3
x1 x2 x3
concatenate
Word Embedding οƒ  Character Embedding
λŒ€ν•œμƒκ³΅νšŒμ˜μ†ŒWord embedding
Character embedding λŒ€ ν•œ 상 곡 회 의 μ†Œ
λŒ€ ν•œ 상 곡 회 의 μ†Œ
οƒΌ Question 1 : λ‹¨μˆœ Concatenate μ‹œν‚€λŠ” κ²ƒλ§Œ κ°€μ§€κ³  될까?
οƒΌ Question 2 : 길이에 상관없이 μΌκ΄€λœ Vector Size λ₯Ό μœ μ§€ν•˜κ³  μ‹Άλ‹€.
7
β‰ˆ
οƒΌ Concatenate μ‹œν‚¨ 데이터λ₯Ό 마치 Image 처럼 μƒκ°ν•˜μž.
οƒΌ 단, row ν•œ 쀄이 char embedding λ‹¨μœ„μž„μ„ κΈ°μ–΅
Convolutional Word Embedding with Char-Embedding (1)
Filter 1-1 Filter 3-1 Filter 5-1
4 οƒ  1 3*4 οƒ  1 5*4 οƒ  1
Filter 1-2 Filter 3-2
Filter 3-3
Convolutional Word Embedding with Char-Embedding (2)
Filter 3-1 적용
λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ
Stride over time
Convolutional Word Embedding with Char-Embedding (3)
Filter 3-2 적용
λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ
Stride over time
Convolutional Word Embedding with Char-Embedding (4)
Filter 3-3 적용
λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ
Stride over time
Convolutional Word Embedding with Char-Embedding (5)
λŒ€ ν•œ 상 곡 회 의 μ†Œ
take max
take max
take max
각 Filter μ—μ„œ λ‚˜μ˜¨ 값듀을 μ‹œκ°„μΆ•μ—μ„œ κ°€μž₯ 큰 값을 μ·¨ν•œλ‹€.
(max-pooling-over-time)
num_filter
(3)
Length – Range + 1
( 7 – 3 + 1 = 5 )
4x7 οƒ  3
num_filter
(3)
Convolutional Word Embedding with Char-Embedding (6)
λŒ€ ν•œ 상 곡 회 의 μ†Œ
λΉ„μŠ·ν•œ λ°©μ‹μœΌλ‘œ λ‹€λ₯Έ filter 도 적용 & Concatenate!
Max
Pooling
λŒ€ ν•œ 상 곡 회 의 μ†Œ
Max
Pooling
λŒ€ ν•œ 상 곡 회 의 μ†Œ
Max
Pooling
= convolutional word embedding
Convolutional Word Embedding with Char-Embedding - Summary
λŒ€ ν•œ 상 곡 회 의 μ†Œ
dim(char) * Length
( 4 * 7 = 21) 𝒓
π‘Ήπ’‚π’π’ˆπ’†
π’π’–π’Ž_𝒐𝒇_π’‡π’Šπ’π’•π’†π’“π’” (𝒓)
( 2 + 3 + 1 = 6)
Convolutional Sentence Embedding with Char-Embedding
dim(char) * Length
( 4 * 14 = 56) 𝒓
π‘Ήπ’‚π’π’ˆπ’†
π’π’–π’Ž_𝒐𝒇_π’‡π’Šπ’π’•π’†π’“π’” (𝒓)
( 2 + 3 + 1 = 6)
μ„œ 울 μ—­ _ κ·Ό 처 _ 슀 타 λ²… 슀 _ κ°€ 쀘
Sentence Embedding 도 word Embedding κ³Ό
λ˜‘κ°™μ€ λ°©μ‹μœΌλ‘œ μ§„ν–‰ν•˜λ©΄ 됨
Summary
Neural Network
Data Transformation
Recurrent Neural Network
Temporal Summarization
Attention Mechanism
Global Summarization
Convolutional Sequence Encoding
Q/A
Thank you.
Sangkeun Jung, Ph.D
Intelligence Architect
Senior Researcher, AI Tech. Lab. SKT Future R&D
Contact : hugmanskj@gmail.com, hugman@sk.com
Lecture Blog: www.hugman.re.kr
Lecture Video: https://goo.gl/7NL5hV
Lecture Slides: https://goo.gl/6NfR1V
Code Share: https://github.com/hugman
Facebook: https://goo.gl/1RML3C

Multiple vector encoding (KOR. version)

  • 1.
    Sequence Encoding Multiple VectorEncoding 정상근 λ”₯λŸ¬λ‹ 1도 λͺ°λΌλ„ 이해할 수 μžˆλŠ”
  • 2.
    AI Problem Formulation ClassificationClustering λŒ€λΆ€λΆ„μ˜ 문제λ₯Ό Classification ν˜Ήμ€ Clustering 문제둜 바라보고 μ‹œλ„ν•¨
  • 3.
    Image Classification from AlexKrizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012.
  • 4.
    Classification Formulation Class 1 Class2 Class 3 Class 4 Big & Deep Network !!! Class에 β€œdon’t know” λŠ” μ—†μŒμ— 유의
  • 5.
  • 6.
    Ex) Image Classification ImagePixel Data Class Distribution … Class 1 Class 2 Class 3 Class 4 Class … Data Transformation @ Image Classification
  • 7.
    Graphical Notation forData 10 2 8 2 15 3 5 1 5 10 2 8 2 15 3 5 1 5 = 9μ°¨μ›μ˜ 데이터 x λ₯Ό ν‘œν˜„μƒμ˜ 편의λ₯Ό μœ„ν•΄ 3x3 ν˜•νƒœλ‘œ ν‘œν˜„ Data X Data X
  • 8.
    V to 1 102 8 2 15 3 5 1 5 ? 3x3 data λ₯Ό 1x1 둜 μš”μ•½ν•˜λŠ” 방법?? 10 2 8 2 15 3 5 1 5 = (Data) Abstraction / Encoding / Summarization / Reduction ….
  • 9.
    V to 1– Simple Method 10 2 8 2 15 3 5 1 5 15 center one 10 2 8 2 15 3 5 1 5 5.6 average 10 2 8 2 15 3 5 1 5 5 median
  • 10.
    V to 1– Weighted Method 10 2 8 2 15 3 5 1 5 28.1 Weighted Sum 3 1 5 2 6 3 9 3 6 Element-wise multiplication Weighted Average 10 2 8 2 15 3 5 1 5 6.65 3/9 1/9 5/9 2/9 6/9 3/9 9/9 3/9 6/9 Value Weight Value Weight
  • 11.
    V to 1– General Form x1 x2 x3 x4 x5 x6 x7 x8 x9 ~ Weighted Sum w1 w2 w3 w4 w5 w6 w7 w8 w9 Element-wise multiplication𝑣 = π‘₯1 βˆ— 𝑀1 + π‘₯2 βˆ— 𝑀2 + β‹― + π‘₯9 βˆ— 𝑀9
  • 12.
    V to 1- Linear Algebra Weighted Sum w1 w2 w3 w4 w5 w6 w7 w8 w9 x1 x2 x3 x4 x5 x6 x7 x8 x9 X = 𝑖 9 π‘₯𝑖 βˆ— 𝑀𝑖 [1 x 9] matrix [9x1] matrix [1x1] matrix
  • 13.
    Convolution Neural Network(1) http://arxiv.org/pdf/1406.3332v1.pdf Mairal et al., Convolutional Kernel Networks, 2014 16 1 9 1
  • 14.
    Convolution Neural Network(2) http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ 1 0 1 0 1 0 1 0 1 filter
  • 15.
  • 16.
    V to V’ V개둜 이루어진 데이터λ₯Ό V’ λ°μ΄ν„°λ‘œ λ³€ν˜• 10 2 8 2 15 3 5 1 5 ? ? V= 9 V’= 2 10 2 8 2 15 3 5 1 5 ? ?
  • 17.
    V to V’– generalized method x1 x2 x3 x4 x5 x6 x7 x8 x9 Weighted Sum w1,1 w1,2 w1,3 w1,4 w1,5 w1,6 w1,7 w1,8 w1,9 𝑣1 = π‘₯1 βˆ— 𝑀1,1 + π‘₯2 βˆ— 𝑀1,1 + β‹― + π‘₯9 βˆ— 𝑀1,9 w2,1 w2,2 w2,3 w2,4 w2,5 w2,6 w2,7 w2,8 w2,9 𝑣2 = π‘₯1 βˆ— 𝑀2,1 + π‘₯2 βˆ— 𝑀2,1 + β‹― + π‘₯9 βˆ— 𝑀2,9 ? ?
  • 18.
    V to V’– Linear Algebra Weighted Sum w1,1 w1,2 w1,3 w1,4 w1,5 w1,6 w1,7 w1,8 w1,9 x1 x2 x3 x4 x5 x6 x7 x8 x9 X = 𝑖 9 π‘₯𝑖 βˆ— 𝑀1,𝑖 w2,1 w2,2 w2,3 w2,4 w2,5 w2,6 w2,7 w2,8 w2,9 [1 x 9] matrix [9x2] matrix 𝑖 9 π‘₯𝑖 βˆ— 𝑀2,𝑖 , [1x2] matrix Fully Connected Network
  • 19.
    V to V’– Projection Notation - - - - - - - - - ? ? V V’ W V dimension 을 κ°€μ§„ 데이터λ₯Ό V’ 둜 λ³€ν˜•μ‹œν‚¬ 수 μžˆλŠ” Weight Set 을 W 라고 ν•˜μž. W =V V’
  • 20.
    Projection with Context(1) ν•˜λ‚˜μ˜ 데이터λ₯Ό μš”μ•½ν•  λ•Œ 도움이 λ˜λŠ” λ‹€λ₯Έ λΆ€κ°€ 정보가 μžˆλ‹€λ©΄ μ–΄λ–»κ²Œ λ°˜μ˜ν•  수 μžˆμ„κΉŒ? DataContext (뢀가정보)
  • 21.
    Projection with Context(2) x1 x2 x3 x4 x5 x6 x7 x8 x9 I II c1 c2 c3 c4 DataContext (뢀가정보) I II + = I II WWC Context 와 Data 에 Weight W κ³Ό Wc λ₯Ό 각각 적용
  • 22.
    V to V’with Context - Linear Algebra w1,1 w1,2 w1,3 w1,4 w1,5 w1,6 w1,7 w1,8 w1,9 x1 x2 x3 x4 x5 x6 x7 x8 x9 X = 𝑖 9 π‘₯𝑖 βˆ— 𝑀1,𝑖 w2,1 w2,2 w2,3 w2,4 w2,5 w2,6 w2,7 w2,8 w2,9 [1 x 9] matrix [9x2] matrix 𝑖 9 π‘₯𝑖 βˆ— 𝑀2,𝑖 , [1x2] matrix c1 c2 c3 c4 [1 x 4] matrix wC 1,1 wC 1,2 wC 1,3 wC 1,4 wC 2,1 wC 2,2 wC 2,3 wC 2,4 = 𝑖 4 𝑐𝑖 βˆ— 𝑀1,𝑖 𝑐 𝑖 4 π‘₯𝑖 βˆ— 𝑀2,𝑖 𝑐 , [1x2] matrix X I II I II
  • 23.
    V to V’with Context - Linear Algebra (simple) w1,1 w1,2 w1,3 w1,4 w1,5 w1,6 w1,7 w1,8 w1,9 x1 x2 x3 x4 x5 x6 x7 x8 x9 c1 c2 c3 c4 X = 𝑖 9 π‘₯𝑖 βˆ— 𝑀1,𝑖 + 𝑖 4 𝑐𝑖 βˆ— 𝑀1,𝑖 𝑐 w2,1 w2,2 w2,3 w2,4 w2,5 w2,6 w2,7 w2,8 w2,9 [1 x (9+4)] matrix [(9+4) x2] matrix [1x2] matrix wC 1,1 wC 1,2 wC 1,3 wC 1,4 wC 2,1 wC 2,2 wC 2,3 wC 2,4 𝑖 9 π‘₯𝑖 βˆ— 𝑀2,𝑖 + 𝑖 4 𝑐𝑖 βˆ— 𝑀2,𝑖 𝑐 μž…λ ₯ Data λ₯Ό Concatenate μ‹œν‚€κ³  Weight Matrixλ₯Ό ν•˜λ‚˜λ‘œ λ§Œλ“€μ–΄μ„œ 처리
  • 24.
    V οƒ  V’ 1 V 개둜 이루어진 데이터λ₯Ό ν•œλ²ˆμ— 1둜 μ€„μ΄λŠ” 것이 μ•„λ‹ˆλΌ μ€‘κ°„λ‹¨κ³„μ˜ V’ 둜 μ€„μ˜€λ‹€κ°€ λ‹€μ‹œ 1둜 쀄일 수 μžˆλ‹€. 10 2 8 2 15 3 5 1 5 ? ? V= 9 V’= 2 ? 1 W W’
  • 25.
    V οƒ  V’ 1 Multi-Layer Perceptron V οƒ  V’ οƒ  1 V οƒ  V’ οƒ  V’’ οƒ  1
  • 26.
  • 27.
    Multiple Data Abstraction/ Encoding / Summarization / Reduction …. Multiple Item 을 μš”μ•½ν•˜λŠ” 방법은? 10 2 8 2 15 3 5 1 5 13 4 8 4 5 2 1 45 31 6 3 4 1 7 1 3 4 0 ? ? ? ? ? ? ? ? ? Data 1 Data 2 Data 3 ? ? ? V V V’ 1
  • 28.
    Vs to V’ 102 8 2 15 3 5 1 5 13 4 8 4 5 2 1 45 31 6 3 4 1 7 1 3 4 0 9.6 ? 6.6 ? ? ? ? ? ? (10+13+6)/3 (8+8+4)/3 Element-wise Average
  • 29.
    Vs to V’ 102 8 2 15 3 5 1 5 13 4 8 4 5 2 1 45 31 6 3 4 1 7 1 3 4 0 𝑀1 = 0.2 𝑀2 = 0.4 𝑀3 = 0.4 X X X 2 0.4 1.6 0.4 3 0.6 1 0.2 1.0 5.2 1.6 3.2 1.6 2 0.8 0.4 18 12.4 2.4 1.2 1.6 0.4 2.8 0.4 1.2 1.6 0 9.6 3.2 6.4 2.4 7.8 1.8 2.6 19.8 13.4 + + Element-wise multiplication Element-wise summation 각 Data 에 Weight λ₯Ό κ³±ν•˜μ—¬ λ”ν•˜λŠ” 방식
  • 30.
    Vs οƒ  V’sοƒ  ? Data 1 Data 2 Data 3 1 2 3 Element wise calculation 각 Data λ₯Ό 각각 Projection μ‹œν‚¨ 뒀에 μΆ•μ†Œλœ Data μƒμ—μ„œ Sequence wise μš”μ•½λ°©λ²•μ„ κ³ λ―Ό W W W
  • 31.
    N to 1Problem N to 1 λ¬Έμ œκ°€ 잘 풀리면 어떀뢄야에 μ μš©ν•  수 μžˆλ‚˜? μ–Έμ–΄λͺ¨λΈ λŒ€ν•œ λ―Όκ΅­ λ§Œμ„Έ λŒ€ν™”λͺ¨λΈ μ•ŒλžŒ μ’€ Set.AlarmμΌœμ€„λž˜ 감성뢄석 이 μ˜ν™” Positiveμ§±! λ¬Έμ„œλΆ„λ₯˜ w w Topic=music… w Zero Pronoun w w Pronoun Dropped… w ….
  • 32.
  • 33.
    Data Transformation μ—β€˜Temporal 정보’ λ₯Ό λ°˜μ˜ν•  수 μ—†μ„κΉŒ? Context =뢀가정보
  • 34.
    Vs οƒ  V’sοƒ  ? Data 1 Data 2 1 2 W WC Context ν˜„μž¬ μž…λ ₯ (Data 2) κ³Ό κ³Όκ±° 정보 반영 W 정상근
  • 35.
    Vs οƒ  V’sοƒ  V’ Data 1 Data 2 1 2 W WC Context ν˜„μž¬ μž…λ ₯ (Data 2) 와 κ³Όκ±° 정보 반영 Data 3 3 W WC ν˜„μž¬ μž…λ ₯ (Data 3) 와 κ³Όκ±° 정보 반영 λͺ¨λ“  정보 반영된 Data Recurrent Neural Network 정상근
  • 36.
    Graphical Notation Data t-1Data t Data t+1 W Input Data Hidden State U RNN Layer Simplified Version Vs οƒ  V’s Vs οƒ  V’s
  • 37.
    Forward / BackwardRNN Data 1 Data 2 Data 3 Data 1 Data 2 Data 3 Forward RNN Backward RNN Vs οƒ  V’s Vs οƒ  V’s
  • 38.
    Bidirectional RNN Data 1Data 2 Data 3 Bidirectional RNN concatenation Vs οƒ  (2*V’)s V V’ V’
  • 39.
    Stacking RNN Data 1Data 2 Data 3 Stacking RNN Vs οƒ  V’s
  • 40.
    RNN 의 Input/ Output Data 1 Data 2 Data 3 Out 1 Out 2 Out 3 οƒΌ Vs οƒ  V’s οƒΌ Len(Vs) = Len(V’s) Input 데이터에 λŒ€μ‘ν•˜λŠ” Output 데이터λ₯Ό λ§Œλ“€μ–΄ λ‚Ό 수 있음 Data 1 Data 2 Data 3 Out Summarization οƒΌ Vs οƒ  1 Input 데이터λ₯Ό Temporal ν•œ 정보λ₯Ό λ°˜μ˜ν•˜μ—¬ μš”μ•½ν•  수 μžˆλ‹€.
  • 41.
    Basic RNN λͺ¨λΈμ˜ν•œκ³„ Data 1 Data 2 Data 3 Out 1 Out 2 Out 3 Out 1 μ—λŠ” Data 2 와 Data 3 의 정보가 반영이 μ•ˆλœλ‹€. 멀리 μžˆλŠ” Data 1의 μ •λ³΄λŠ” 반영되기 νž˜λ“€λ‹€. ν•΄κ²°ν•˜λŠ” 방법 1) Global Summarization 2) LSTM, GRU, …
  • 42.
    Sequence Encoding κ°€μž₯ λ§ˆμ§€λ§‰RNN node 의 Hidden Variable Sentence (Sequence) β‰ˆ Idea : RNN 에 λˆ„μ λœ 정보가 κ²°κ΅­ Sequence 의 Vector Form 일 것이닀.
  • 43.
    Sequence Decoding RNN Layer Let’sgo to Seoul-Station Starbucks 볡사
  • 44.
    Sequence Encoding-Decoding Approach RNNLayer μ„œμšΈμ—­ 근처 μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž RNN Layer Let’s go to Seoul-station Starbucks Encoding Decoding
  • 45.
    [μ°Έκ³ ] Translation Pyramid BernardVauquois' pyramid showing comparative depths of intermediary representation, interlingual machine translation at the peak, followed by transfer- based, then direct translation. [ http://en.wikipedia.org/wiki/Machine_translation]
  • 46.
    RNN Encoder-Decoder forMachine Translation Cho et al. (2014) http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/
  • 47.
    Limitation - RNNEncoder-Decoder Approach λ¬Έμž₯이 κΈΈμ–΄μ§€λ©΄, μ œν•œλœ hidden variable 에 정보λ₯Ό μΆ©λΆ„νžˆ λ‹΄μ§€ λͺ»ν•˜κ²Œ λ˜μ–΄, κΈ΄ λ¬Έμž₯의 경우 λ²ˆμ—­μ΄ μž˜λ˜μ§€ μ•ŠλŠ”λ‹€. http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
  • 48.
  • 49.
    Data Transformation 에전체 β€˜Sequence-wise’ 정보λ₯Ό λ°˜μ˜ν•  수 μ—†μ„κΉŒ? - Blending multiple items - Handling items as one item
  • 50.
  • 51.
  • 52.
    Attention Without Context– In/Out View x1 x2 x3 Vs οƒ  V  Global Summarization κ°œλ…μ€ RNN κ³Ό μƒκ΄€μ—†μ§€λ§Œ, Context κ°€ RNN 의 κ²°κ³Όλ¬Όκ³Ό μ—°λ™λ˜λŠ” κ²½μš°κ°€ 많기 λ•Œλ¬Έμ— advanced RNN ν…Œν¬λ‹‰μœΌλ‘œ 보톡 언급됨. Sequence Data 에 μ€‘μš”λ„ 정보λ₯Ό λ°˜μ˜ν•˜μ—¬ μš”μ•½ν•œ Data λ₯Ό λ§Œλ“¦ Attention x1 x2 x3 - 각 Data 쀑 더 μ€‘μš”ν•œ 데이터가 μžˆμ„ 것 μ€‘μš”λ„ 반영 - μ€‘μš”λ„ 반영 Xs - Element wise summation - = Blending
  • 53.
    Attention without Context a1a2 a3 1 2 3 A A A V 1 wa1 wa2 wa31 1’ 2’ 3’V π‘€π‘Ž1 = 𝑒 π‘Ž1 𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3 π‘€π‘Ž2 = 𝑒 π‘Ž2 𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3 π‘€π‘Ž3 = 𝑒 π‘Ž3 𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3 Element wise multiplication Softmax Xs Element wise Summation 정상근
  • 54.
    Attention Without Context– Simple View x1 x2 x3 Attention Xs 정상근 Summarization Vector
  • 55.
    Attention Context –In/Out View x1 x2 x3 Xs Sequence Data 에 Context λ₯Ό λ°˜μ˜ν•˜μ—¬ Attention 을 κ΅¬ν•˜κ³ , 이λ₯Ό λ‹€μ‹œ 데이터에 λ°˜μ˜ν•˜μ—¬ μš”μ•½ν•œ Data λ₯Ό λ§Œλ“¦ Vs οƒ  V’s C 정상근
  • 56.
    Attention with Context a1a2 a3 1 2 3V 1 wa1 wa2 wa31 1’ 2’ 3’V π‘€π‘Ž1 = 𝑒 π‘Ž1 𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3 π‘€π‘Ž2 = 𝑒 π‘Ž2 𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3 π‘€π‘Ž3 = 𝑒 π‘Ž3 𝑒 π‘Ž1 + 𝑒 π‘Ž2 + 𝑒 π‘Ž3 Element wise multiplication Softmax Xs Element wise Summation C 정상근
  • 57.
    Attention With Context– Simple View x1 x2 x3 Attention Xs C 정상근
  • 58.
    Context 반영 방법 concat μ–΄λ–»κ²ŒC 와 xi 을 μ‘°ν•©ν• κΉŒ? xi C X W π‘ π‘π‘œπ‘Ÿπ‘’ = 𝑣 𝑇 tanh(π‘Š π‘₯𝑖; 𝑐 ) concatenate [ 1 x (V+K) ] [ (V+K) x M ] K V X VT [ M x 1 ] [ 1 x M ] general xi XWC X [ 1 x K ] [ K x V ] [ V x 1] 1 1 π‘ π‘π‘œπ‘Ÿπ‘’ = π‘π‘Šπ‘₯𝑖 𝑇 dot xi C X [ 1 x K ] [ V x 1 ] 1 !!! K = V μΌλ•Œλ§Œ κ°€λŠ₯ V+K οƒ  M οƒ  1 Minh-Thang et al., β€œEffective Approaches to Attention-based Neural Machine Translation” π‘ π‘π‘œπ‘Ÿπ‘’ = 𝑐π‘₯𝑖 𝑇
  • 59.
    Attention Modeling μ„œμšΈμ—­ κ·Όμ²˜μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž Let’s go to Starbucks near Seoul station :: ν•œκ΅­μ–΄ οƒ  μ˜μ–΄ λ²ˆμ—­
  • 60.
    Attention Modeling μ„œμšΈμ—­ κ·Όμ²˜μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž Let’s go to Starbucks near Seoul station Encoding
  • 61.
    Attention Modeling μ„œμšΈμ—­ κ·Όμ²˜μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž Let’s go to Starbucks near Seoul station Decoding οƒŸ Attention to corresponding input
  • 62.
    Attention Modeling μ„œμšΈμ—­ κ·Όμ²˜μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž Let’s go to Starbucks near Seoul station Decoding οƒŸ Attention to corresponding input
  • 63.
    Attention Modeling μ„œμšΈμ—­ κ·Όμ²˜μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž Let’s go to Starbucks near Seoul station Decoding οƒŸ Attention to corresponding input
  • 64.
    Attention Modeling μ„œμšΈμ—­ κ·Όμ²˜μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž Let’s go to Starbucks near Seoul station Decoding οƒŸ Attention to corresponding input
  • 65.
    Attention Modeling μ„œμšΈμ—­ κ·Όμ²˜μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž Let’s go to Starbucks near Seoul station Decoding οƒŸ Attention to corresponding input
  • 66.
    Attention Modeling μ„œμšΈμ—­ κ·Όμ²˜μŠ€νƒ€λ²…μŠ€ 둜 κ°€μž Let’s go to Starbucks near Seoul station Decoding οƒŸ Attention to corresponding input :: Decoding μ‹œ μ‚¬μš©λ˜λŠ” Encoding 정보λ₯Ό 선별적이고 λ™μ μœΌλ‘œ λ°”κΏ”μ€ŒμœΌλ‘œμ¨ (μ€‘μš”ν•œ 것에 집쀑-Attentionν•¨μœΌλ‘œμ¨) Decoding 을 더 잘 ν•  수 있게 됨
  • 67.
    [Review] Attention WithContext – Simple View x1 x2 x3 Attention Xs C 정상근
  • 68.
    Encoding + AttentionDecoding With Context x1 x2 x3 Att S1 Att S2 Att S3 Encoding h1 h2 h3 Decoding C1 C2 C3 정상근
  • 69.
    Encoding + AttentionDecoding With Context (from prev. RNN output) x1 x2 x3 Att S1 Att S2 Att S3 Encoding h1 h2 h3 Decoding Ο† CtHt-1 = Previous Hidden state as Context 정상근
  • 70.
    Input Sequence μ˜κΈΈμ΄μ™€ Output Sequence 의 길이가 κ°™μ•„μ•Ό ν• κΉŒ? No!
  • 71.
    Encoding + AttentionDecoding With Context (from prev. RNN output) x1 x2 x3 Att S1 Att S2 Att S3 Encoding h1 h2 h3 Decoding Ο† CtHt-1 = Previous Hidden state as Context 정상근 Att S4 h4 Len(Output Sequence) = Number of Attention Module 3 4 4 4
  • 72.
    Attention Mechanism Review AttentionMechanism Global, Selective, Dynamic Sequence Summarization(Blending) = 정상근
  • 73.
    사둀 : SimpleIdeas οƒ  Complex Model 1) Temporal 정보λ₯Ό κ°€λŠ₯ν•œ λͺ¨λ‘ 포함 μ‹œν‚€κ³  μ‹Άλ‹€. 2) Decoding ν•  λ•Œ Blending 된 Input을 μ‚¬μš© 3) Attention Context λ‘œμ„œ κ³Όκ±° 정보λ₯Ό μ΄μš©ν•˜κ³  μ‹Άλ‹€.
  • 74.
    사둀 : Attentionbased Neural Translation Model - Bahdanau et al. (2014) One-Hot BiRNN EMB EMB EMB EMB x1 x2 x3 x4 F1 F2 F3 F4 B1 B2 B3 B4 N1 N2 N3 N4Concat Att S1 Att S2 Att S3 Ο† A A C h1 h2 h3 N1 N2 N3 N4 A A Softmax D1 D2 D2 EMB’ EMB’ EMB’ D0 EMB’ Special Symbol A : alignment weight EMB, EMB’ : Embedding
  • 75.
    사둀 : Attentionbased Neural Translation Model - Bahdanau et al. (2014) One-Hot BiRNN EMB EMB EMB EMB x1 x2 x3 x4 F1 F2 F3 F4 B1 B2 B3 B4 N1 N2 N3 N4Concat Att S1 Att S2 Att S3 Ο† A A C h1 h2 h3 N1 N2 N3 N4 A A Softmax D1 D2 D2 EMB’ EMB’ EMB’ D0 EMB’ Special Symbol A : alignment weight EMB, EMB’ : Embedding Idea 1) Idea 2/3)
  • 76.
    Attention Modeling Bahdanau etal. (2014) http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/ ο‚§ Bidirectional RNN for Encoding ο‚§ Attention Modeling
  • 77.
    Performance – AttentionModeling @ Machine Translation :: μ„ λ³„μ μœΌλ‘œ κ°€μ€‘μΉ˜κ°€ 적용된 Encoding 이 μ μš©λ¨μœΌλ‘œμ„œ, κΈ΄ λ¬Έμž₯μ—μ„œλ„ λ²ˆμ—­ μ„±λŠ₯이 λ–¨μ–΄μ§€μ§€ μ•ŠλŠ”λ‹€.
  • 78.
    Xu et al.(2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/ Attention Modeling for Image2Text
  • 79.
    Attention Modeling forImage2Text Xu et al. (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Encoder / Decoder μ—μ„œ Text Sequence Encoding 을 Image Sequence Encoding 으둜 ꡐ체만 해도 μž‘λ™ν•¨ X Y X Y
  • 80.
  • 81.
    ν•˜λ‚˜μ˜ 데이터인 κ²ƒμ²˜λŸΌλ‹€λ£¨μž x1 x2 x3 x1 x2 x3 concatenate
  • 82.
    Word Embedding οƒ Character Embedding λŒ€ν•œμƒκ³΅νšŒμ˜μ†ŒWord embedding Character embedding λŒ€ ν•œ 상 곡 회 의 μ†Œ
  • 83.
    λŒ€ ν•œ 상곡 회 의 μ†Œ οƒΌ Question 1 : λ‹¨μˆœ Concatenate μ‹œν‚€λŠ” κ²ƒλ§Œ κ°€μ§€κ³  될까? οƒΌ Question 2 : 길이에 상관없이 μΌκ΄€λœ Vector Size λ₯Ό μœ μ§€ν•˜κ³  μ‹Άλ‹€. 7
  • 84.
    β‰ˆ οƒΌ Concatenate μ‹œν‚¨λ°μ΄ν„°λ₯Ό 마치 Image 처럼 μƒκ°ν•˜μž. οƒΌ 단, row ν•œ 쀄이 char embedding λ‹¨μœ„μž„μ„ κΈ°μ–΅
  • 85.
    Convolutional Word Embeddingwith Char-Embedding (1) Filter 1-1 Filter 3-1 Filter 5-1 4 οƒ  1 3*4 οƒ  1 5*4 οƒ  1 Filter 1-2 Filter 3-2 Filter 3-3
  • 86.
    Convolutional Word Embeddingwith Char-Embedding (2) Filter 3-1 적용 λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ Stride over time
  • 87.
    Convolutional Word Embeddingwith Char-Embedding (3) Filter 3-2 적용 λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ Stride over time
  • 88.
    Convolutional Word Embeddingwith Char-Embedding (4) Filter 3-3 적용 λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ λŒ€ ν•œ 상 곡 회 의 μ†Œ Stride over time
  • 89.
    Convolutional Word Embeddingwith Char-Embedding (5) λŒ€ ν•œ 상 곡 회 의 μ†Œ take max take max take max 각 Filter μ—μ„œ λ‚˜μ˜¨ 값듀을 μ‹œκ°„μΆ•μ—μ„œ κ°€μž₯ 큰 값을 μ·¨ν•œλ‹€. (max-pooling-over-time) num_filter (3) Length – Range + 1 ( 7 – 3 + 1 = 5 ) 4x7 οƒ  3 num_filter (3)
  • 90.
    Convolutional Word Embeddingwith Char-Embedding (6) λŒ€ ν•œ 상 곡 회 의 μ†Œ λΉ„μŠ·ν•œ λ°©μ‹μœΌλ‘œ λ‹€λ₯Έ filter 도 적용 & Concatenate! Max Pooling λŒ€ ν•œ 상 곡 회 의 μ†Œ Max Pooling λŒ€ ν•œ 상 곡 회 의 μ†Œ Max Pooling = convolutional word embedding
  • 91.
    Convolutional Word Embeddingwith Char-Embedding - Summary λŒ€ ν•œ 상 곡 회 의 μ†Œ dim(char) * Length ( 4 * 7 = 21) 𝒓 π‘Ήπ’‚π’π’ˆπ’† π’π’–π’Ž_𝒐𝒇_π’‡π’Šπ’π’•π’†π’“π’” (𝒓) ( 2 + 3 + 1 = 6)
  • 92.
    Convolutional Sentence Embeddingwith Char-Embedding dim(char) * Length ( 4 * 14 = 56) 𝒓 π‘Ήπ’‚π’π’ˆπ’† π’π’–π’Ž_𝒐𝒇_π’‡π’Šπ’π’•π’†π’“π’” (𝒓) ( 2 + 3 + 1 = 6) μ„œ 울 μ—­ _ κ·Ό 처 _ 슀 타 λ²… 슀 _ κ°€ 쀘 Sentence Embedding 도 word Embedding κ³Ό λ˜‘κ°™μ€ λ°©μ‹μœΌλ‘œ μ§„ν–‰ν•˜λ©΄ 됨
  • 93.
    Summary Neural Network Data Transformation RecurrentNeural Network Temporal Summarization Attention Mechanism Global Summarization Convolutional Sequence Encoding
  • 94.
    Q/A Thank you. Sangkeun Jung,Ph.D Intelligence Architect Senior Researcher, AI Tech. Lab. SKT Future R&D Contact : hugmanskj@gmail.com, hugman@sk.com Lecture Blog: www.hugman.re.kr Lecture Video: https://goo.gl/7NL5hV Lecture Slides: https://goo.gl/6NfR1V Code Share: https://github.com/hugman Facebook: https://goo.gl/1RML3C