RNN sharing at Trend Micro

RNN Share @ Trend Micro
chuchuhao

Outline
1. Problem Definition
2. Walk through RNN Model
3. How to tune best result
4. (optional) View Problem in CNN
5. (optional) A Sequence to Sequence Problem (Include HMM, CTC, Attention

Author
chuchuhao (Tony)
Linkedin:
https://www.linkedin.com/in/chun-hao-wang-19b94271/

#1 Problem Definition
1.Representation 2.Model 3.Evaluation
Define Problem
Sourcing Data
Cleaning
Normalization
Model Structure
Cost Function
Optimization
Metrics
Explanation

1-1 Define Problem .. (more Task)
1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
.txt file
Sequence Learning:
- Turn to Pseudo Code
- Output Behavior Seq (without run)
- Summarize Code
- Learn When to Segment
Generative Model:
- Code Generation
- Code repatch
Supervised:
- Gen Hash Code Via Similarity
- Action to take (run sandbox or ..)
Unsupervised:
- Detect Unusual Coding Style
Reinforcement:
- Accroding PC statuc decide next
action

.txt file
.txt file
Text
1-1 Define Problem ..
1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
.txt file
JavaScript

1-2 Sourcing Data
1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
Tim will share XDD
js
js
.txt file
js
Data Lake
Web
Benchmark
Dataset
John’s
Tool->

1-2 Sourcing Data … Sampling Bias
Data Lake
(Trend Micro)
Web
Benchmark
Dataset
# of API used
#oflines
Valid JS script, but not in any bank
どっち ?

1-3 Cleaning & Normalization
1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
.txt file
var evwfrEmu = new ActiveXObject("WScript.Shell");
//// 4fQei4pD7o69uOE8Trgp
YThMSpGKzmIcsQ =
evwfrEmu.ExpandEnvironmentStrings("%TEMP%")
+ "ssd" + Math.round(1e8 * Math.random());
var lmXPGbOCL0 = new
ActiveXObject("Msxml2.DOMDocument.6.0");
//// NN08kES93To
var MsslCA = new ActiveXObject("ADODB.Stream");
var dsgkpp = lmXPGbOCL0.createElement("tmp");
//// lZPifi9y
dsgkpp.dataType = "bin.base64";
//// wacLqE
//// bNloPOpvlZ8EdqKEn
MsslCA.Type = 1;
dsgkpp.text = "dmFyIGJUa3l
….
Source File

1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
.txt file
.txt file
var evwfrE
[23, 2, 19, 0, 6, 23, 24, 7, 19, 32]
Cleaned
Normalized
d = np.zeros((10, 98))
d[0,23] = 1
d[0,2] = 1
….
d[0,32] = 1
# data -> a sparse 2d matrix

1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
.txt file
( ) = (.js)< , , …., >
.txt file
Cleaning
Normalization
Raw Data
Cleaned Data
X y
js
Label ȳ

2.Model
Model Structure
Cost Function
Optimization
2. Model (Overview)
(X; ) = (.js)
(X; ) = {1, 0}

2.Model
Model Structure
Cost Function
Optimization
2. Model (Overview)
( ) = (X; )
What Functions Looks Like
Model Structure is a set of functions
(or named Hypothesis set)
Inference using one of funtion in set

2.Model
Model Structure
Cost Function
Optimization
2. Model (Overview)
(X; 1)
The goodness of chosen function
( )
12
3 4
(X; 2)
Which one better ?

2.Model
Model Structure
Cost Function
Optimization
2. Model (Overview)
Find the Best function
, model parameter
-1*goodness()
best

3.Evaluation
Metrics
Explanation
3. Evaluation
best
.txt file
js
.txt file
js
.txt file
js
.txt file
js
Training Set
Testing Set
Learn Task ?

3.Evaluation
Metrics
Explanation
3. Evaluation
.txt file( ; best) = JS
js
Label
predict

3.Evaluation
Metrics
Explanation
3. Evaluation
.txt file( ; best) = JS, True Positive
js
.txt file( ; best) = TXT, False Positive
txt
.txt file( ; best) = TXT, True Negative
txt
.txt file( ; best) = TXT, False Negative
js
False Alarm

3.Evaluation
Metrics
Explanation
3. Evaluation
Accuracy =
TP + TN
TP+TN+FN+FP

3.Evaluation
Metrics
Explanation
3. Evaluation, Goodness(loss) v.s Metrics(accuracy)
Loss Function Accuracy
Objective
Measure Goodness
of
hypothesis function
Measure Goodness
of
Model on Task
Property Continuous Discrete

#2 Walk through RNN Model
Problem Definition:
Find a function (X; ) = (.js)
Input : Output:
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0
....
0 1 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0
95% probability
might be javascript
var evwfrE 98x10

2.1 Recall DNN
Problem Definition:
Find a function (X; ) = (.js)
Input : DNN Input:
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0
....
0 1 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0
var evwfrE 98x10 (98*10)x1
varevwfrE
0
0
0
1
0
0
0
0
0
0
0
0
0
..
0
0
1
980
2
scalar
scalar
scalar
...

2.1 Recall DNN, Model Structure
‘v’
‘a’
…
‘s’
…
‘E’
1
2
j
...... 1
2
i
......
1
Input Layer Hidden Layer Output Layer
Input Layer ∈ R 980x1
Hidden Layer ∈ R 4x1
Output Layer ∈ R 1x1
Verticles here are `Scalar`
Edges heare are `Operator`

1
2
j
......
1
2
i
......
1
Wij Zi
i = A(Zi
)
Zi
= Wi1
* +…+Wij
* + ....1 j
Input Layer : X ∈ R 980x1
Hidden Layer : a ∈ R 4x1
Output Layer : y ∈ R 1x1
Weighted : W ∈ R 4 x 980
Activation Func : A ∈ RxR

1
2
j
......
1
2
i
......
1
Wij Zi
a = A( X * W0
+ b0
)
Input Layer : X ∈ R980x1
Hidden Layer : a ∈ R4x1
Output Layer : y ∈ R1x1
Weighted : W0
∈R4x980
, W1
∈R1x4
bias : b0
∈R4x1
, b1
∈R1x1
Activation Func : A ∈ RxR
y = a * W1
+ b1

2.1 Recall DNN, Model Structure = Inference
1
2
j
......
1
2
i
......
1
Wij Zi
a = A( X * W0
+ b0
)
y = a * W1
+ b1
if y > 0.5 : Javascript
else : TEXT

2.1 Recall DNN, Model Structure = Inference
1
2
j
......
1
2
i
......
1
Wij Zi
( ) = (X; ) = y = (.js)
W0
∈ R 4x980
, W1
∈ R 1x4
b0
∈ R 4x1
, b1
∈ R 1x1
= { }
| | = total parameter
= 4*980 +4 +4 +1 = 3929

F(X) = (.js): Probability Distribution
(X; ): DNN Model
2.1 Recall DNN, Why it works ?
Neural Network = Function approximator
> Given enough variable, nn can approximate any continuous function
`Universal approximation theorem`
x^(2)+ ((5y)/(4)- sqrt(abs(x)))^(2)=1
approximate

Different NN structure: find a way to imporove learnability
Deep NN Recurrent NNConvolution NN Other Mincraft NN ..
1 2 1 2

Different NN structure: find a way to imporove learnability
Deep NN Deeper NN

2.1 Recall DNN, Deeper Model Structure
1
2
j
......
1
2
i
......
Layer: L-1 Layer: L
Wij
L
Zi
L
i = ai
L
= A(Zi
L
)
Zi
L
= Wi1
L
* +…+Wij
L
* +
....
1 j
ai
L
= A(Wi1
L
* ai
L-1
+…+Wij
L
*ai
L-1
+...

1
2
j
......
1
2
i
......
Layer: L-1 Layer: L
Wij
L
Zi
L
= (X; )
= A( WL
…
A( W2
A( X * W1
+ b1
)+ b2
) …
+ bL
)
= { W1
, b1
, W2
, b2
, … , WL
, bL
}
pick the best model = best function
= best *

...
...
‘v’
‘a’
…
...
Input
Layer Hidden Layer
Output
Layer
...
...
...
R 980x1
R 1x1
fӨ
: R 980
➡R 1
Fully Connected Network
vectorize it !

a1
L
2.1 Recall DNN, Deeper Model Structure (vector)
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
Output of a neuron:
ai
L
Output of one layer:
aL
: a vector
a0
L-1
a1
L-1
aj
L-1
a2
L
ai
L
Layer L
Neuron i

a0
L-1
a1
L-1
aj
L-1
a1
L
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
Weight:
Wij
L
a2
L
ai
L
Layer L-1 to Layer L
from neuron j in Layer L-1
to neuron i in Layer L
W L
=
W11
L
W12
L
….
W21
L
W22
L
….
NL
xNL-1
….
NL-1
NL
W L

a0
L-1
a1
L-1
aj
L-1
a1
L
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
Bias:
bi
L
a2
L
ai
L
bias for neuron i at Layer L
b L
=
b1
L
b2
L
bi
L
NL
….….
1
W L

a0
L-1
a1
L-1
aj
L-1
a1
L
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
Input of a neuron:
zi
L
Input of one layer:
zL
: a vector
input of the activation
function for neuron i at layer l

a0
L-1
a1
L-1
aj
L-1
a1
L
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
Input of a neuron:
zi
L input of the activation
function for neuron i at layer l
zi
L
= Wi1
L
a1
L-1
+Wi2
L
a2
L-1
+...+ bi
L
= ∑ Wij
L
aj
L-1
+ bi
Lj=1
NL-1

….
a0
L-1
a1
L-1
aj
L-1
a1
L
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
z1
L
= W11
L
a1
L-1
+W12
L
a2
L-1
+...+ b1
L
z2
L
= W21
L
a1
L-1
+W22
L
a2
L-1
+...+ b2
L
…
zi
L
= Wi1
L
a1
L-1
+Wi2
L
a2
L-1
+...+ bi
L
=
W11
L
W12
L
….
W21
L
W22
L
….
….
z1
L
z2
L
zi
L
NL
….
….
+
a1
L-1
a2
L-1
ai
L-1
….
….
b1
L
b2
L
bi
L
NL
….
NL
xNL-1
NL-1

….
a0
L-1
a1
L-1
aj
L-1
a1
L
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
=
W11
L
W12
L
….
W21
L
W22
L
….
….
z1
L
z2
L
zi
L
NL
….
….
+
a1
L-1
a2
L-1
ai
L-1
….
….
b1
L
b2
L
bi
L
NL
….
NL
xNL-1
NL-1
z L
= W L
a L-1
+ b L

….
a0
L-1
a1
L-1
aj
L-1
a1
L
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
=
a1
L
a2
L
ai
L
NL
….
ai
L
=A(zi
L
)
….
A(z1
L
)
A(z2
L
)
A(zi
L
)
NL
….
Relations between Layers and Output
a L
=A(z L
) : vector

a0
L-1
a1
L-1
aj
L-1
a1
L
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
z L
= W L
a L-1
+ b L
, a L
=A(z L
)
a L
=A(W L
a L-1
+ b L
)
z L
a La L-1
W L b L Computational Graph
Node: Tensor
Edge: Operator

a0
L-1
a1
L-1
aj
L-1
a1
L
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
z L
= W L
a L-1
+ b L
, a L
=A(z L
)
a L
=A(W L
a L-1
+ b L
)
z L
a La L-1
W L
b L
Computational Graph
Node: Tensor
Edge: Operator

...
...
‘v’
‘a’
…
...
...
...
... fӨ
: R 980
➡R 1
Input Layer 1 Layer 2 Layer L Output
W 1
, b 1
W 2
, b 2
W L
, b L
= (X; ) = A( WL
… A( W2
A( X * W1
+ b1
)+ b2
) … + bL
)

= (X; ) = A( WL
… A( W2
A( X*W1
+ b1
)+ b2
) … + bL
)
z 1
a 1
W1
b1
z 2 a L-1
W2
b2
z L
….
Computational Graph
Node: Tensor
Edge: Operator
W L
b L

1
2
j
......
1
2
i
......
Layer: L-1 Layer: L
Wij
L
Zi
L
= (X; )
= A( WL
…
A( W2
A( X * W1
+ b1
) + b2
) …
+ bL
)
= { W1
, b1
, W2
, b2
, … , WL
, bL
}
= best *

= (X; )
= A( WL
…
A( W2
A( X * W1
+ b1
) + b2
) …
+ bL
)
= { W1
, b1
, W2
, b2
, … , WL
, bL
}
= best *
2.Model
Model Structure
Cost Function
Optimization
functions set

2.1 Recall DNN, Cost Function
Cost Function: C( )
- How bad the parameter is
- also called `loss/ error function`
Object Function: O( )
- How good the parameter is
2.Model
Model Structure
Cost Function
Optimization

Cost Function: C( )
Best Parameter: *
　　 *
= arg min C( )
2.Model
Model Structure
Cost Function
Optimization

Cost Function: C( )
F(X) = (.js): Probability Distribution (X; ): DNN Model
distance ( , )

Cost Function: C( )
F(X) = (.js): Probability Distribution (X; ): DNN Model
distance ( , )
Don’t actually know the
real distribution function

X
Real Probability Distribution
Dataset Sampling
From Real World
(X1
, ȳ1
), (X2
, ȳ2
), (X3
, ȳ3
) …
X1
ȳ1
1
C( ) = ∑ loss ( k
, ȳk
)
= ∑ ( )
1
k
k
Machine Learning Model
1
k
k
Wrong Content !!
Classification and Regression is not going to learn a probability distribution

Dataset Sampling
From Real World
Real Probability Distribution ?
ȳ1
=1
1
C( ) = ∑ loss ( k
, ȳk
)
1
k
k
1
0
ȳ2
=0
.txt
file
js
.txt
file
js
X
(X1
, ȳ1
), (X2
, ȳ2
), (X3
, ȳ3
) …
X1
Wrong Content !!

Dataset Sampling
From Real World
2.1 Recall DNN, Cost Function, MSE as loss
ȳ1
=1
1
C( ) = ∑ loss ( k
, ȳk
)
1
k
k
1
0
ȳ2
=0
= ∑ ॥ k
- ȳk
॥
1
k
k
.txt
file
js
.txt
file
js
X
(X1
, ȳ1
), (X2
, ȳ2
), (X3
, ȳ3
) …
X1
Wrong Content !!

Dataset Sampling
From Real World
2.1 Recall DNN, Cost Function, Cross Ent. as loss
ȳ1
=1
1
C( ) = ∑ loss ( k
, ȳk
)
1
k
k
1
0
ȳ2
=0
C( )= ∑( k
ln( ȳk
)+(1- k
)ln(1- ȳk
))
-1
k
k
.txt
file
js
.txt
file
js
X
(X1
, ȳ1
), (X2
, ȳ2
), (X3
, ȳ3
) …
X1
Wrong Content !!

2.1 Recall DNN, Cost Function, MSE as loss
C( ) = ∑ loss ( k
, ȳk
)
1
k
k
Accuracy
MSE
Logistic Loss
Hinge Loss
loss(k
*ȳk
)
To Visualize the differnet of cost function
Φ( k
* ȳk
) ⬅ loss ( k
, ȳk
)
k
* ȳk
, ȳk
∈ {-1, 1}
z=yy^
Acc: L(z)=(sign(z)+1)/2L(z)=(sign(z)+1)/2
Logistic: L(z)=log(exp(−x)+1)/log(2)
MSE: L(z)=(y−y^)2=(y2−y(^y))2=(1−z)2

2.Model
Model Structure
Cost Function
Optimization
2.1 Recall DNN, Optimization
Have
- A function set
Know
- How good is the selected function
How to find the best one?
Enumerate ? Calculus !!

, model parameter
loss,C()
For simplification, consider
that has only one variable
1. Randomly start at 0
2. Compute dC( 0
) / d( )
3. 1
⬅ 0
- η * dC( 0
) / d( )
4. Compute dC( 1
) / d( )
5. 2
⬅ 1
- η * dC( 1
) / d( )
..
0 1
dC( 0
)
d( )
dC( 1
)
d( )
η is learning rate

set the learning rate carefully
η is learning rate
For simplification, consider
that has only one variable
2. Compute dC( 0
) / d( )
3. 1
⬅ 0
- η * dC( 0
) / d( )
4. Compute dC( 1
) / d( )
5. 2
⬅ 1
- η * dC( 1
) / d( )
..
Copy from http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20(v4).pdf

2
1
0
1
2
Suppose that has two variable
=
2. Compute the gradient of C( ) at 0
C( 0
) =
3. update parameter ..
⬅ - η *
….
….
1
0
2
0
∂C( 0
) / ∂ 1
∂C( 0
) / ∂ 2
∂C( 0
) / ∂ 1
∂C( 0
) / ∂ 2
1
0
2
0
1
1
2
1
Gradient
Movement
C( )

1
2
j
......
1
2
i
......
Layer: L-1 Layer: L
Wij
L
Zi
L
Optimize out model
2. Compute the gradient of C( ) at 0
: C( 0
)
3. update parameter: 1
⬅ 0
- η * C( 0
)
To Calculate C( 0
), we need
{ ∂C( 0
)/∂W11
1
, ∂C( 0
)/∂b1
1
,
∂C( 0
)/∂W12
1
, ∂C( 0
)/∂b2
1
,
… ,
∂C( 0
)/∂Wij
L
, ∂C( 0
)/∂bi
L
}

2.1 Recall DNN, Diff on Computational Graph
x zy
z = f(x), y = g(x), z = h(y)
dz/dx = dy/dx * dz/dy
h()g()
s z
t
h()
g()
u
k() z = f(s), t=g(s), u=h(s), z = k(t,u)
dz/ds = ∂t/∂s * ∂z/∂t + ∂u/∂s * ∂z/∂u

**
exp()
Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v
x v
x
u w
x

**
exp()
x v
x
u w
x
x = 2, y?
2
2 2
4 e^4 2*e^4
y = 2*e^4
Forward Pass

**
exp()
x v
x
u w
x
dy/dx|x=2
?
2
2 2
4 e^4 2*e^4
Backward Pass
∂u/∂s=exp(u)
= exp(x^2)
∂y/∂v=x
∂w/∂x=v=exp(u)=exp(x^2)
∂u/∂x=x
∂u/∂x=x
WARNING: Different X

∂u/∂x=x
∂u/∂s=exp(u)
= exp(x^2)
∂y/∂v=x
**
exp()
x v
x
u w
x
dy/dx|x=2
?
2
2 2
4 e^4 2*e^4
Backward Pass
∂u/∂x=x
∂y/∂x=x*exp(x^2)*x

∂u/∂x=x
∂u/∂s=exp(u)
= exp(x^2)
∂y/∂v=x
**
exp()
x v
x
u w
x
dy/dx|x=2
?
2
2 2
4 e^4 2*e^4
Backward Pass
∂u/∂x=x

∂u/∂x=x
∂u/∂s=exp(u)
= exp(x^2)
∂y/∂v=x
**
exp()
x v
x
u w
x
dy/dx|x=2
?
2
2 2
4 e^4 2*e^4
Backward Pass
∂u/∂x=x
= 2*x2
*exp(x2
) +exp(x2
) |x=2
= 5*exp(4)

= (X; ) = A( W2
A( X* W1
+ b1
)+ b2
))
= { W1
, b1
, W2
, b2
}
Computational Graph
Node: Tensor
Edge: Operator
ȳ
Cz 1
a 1
W1
b1
z 2
W2
b2
∂C( 0
)/∂W11
1
∂C/∂y = -2(ȳ - y)
C( ,ȳ) = (ȳ - y)^2
z2
= W2
* a1
+ b2
a1
= A(z1
)
z1
= W1
* X + b1
∂y/∂z2
= A(z2
)
∂z2
/∂a1
=W2
∂a1
/∂z1
=A(z1
)
∂z1
/∂w1
=X
Backpropagation

= (X; ) = A( W2
A( X* W1
+ b1
)+ b2
))
= { W1
, b1
, W2
, b2
}
Computational Graph
Node: Tensor
Edge: Operator
ȳ
Cz 1
a 1
W1
b1
z 2
W2
b2
∂C( 0
)/∂W11
1
∂C/∂y = -2(ȳ - y)
C( ,ȳ) = (ȳ - y)^2
z2
= W2
* a1
+ b2
a1
= A(z1
)
z1
= W1
* X + b1
∂y/∂z2
= A(z2
)
∂a1
/∂z1
=A(z1
)
∂z1
/∂w1
=X
Backpropagation
Layer 1 Layer 2
Error Signal
Amplifier
Amplifier
∂z2
/∂a1
=W2

2.1 Recall DNN, Optimization (batch)
Training Data: { (X1
, ȳ1
), (X2
, ȳ2
), … , (Xr
, ȳr
), …. (XR
, ȳR
)}
- Gradient Descent
i
⬅ i-1
- η * C( i-1
), C( i-1
) = ∑ Cr
( i-1
)
- Stochastic Gradient Descent
i
⬅ i-1
- η * Cr
( i-1
)
- Mini Batch Gradient Descant
i
⬅ i-1
- η * C( i-1
), C( i-1
) = ∑ Cr
( i-1
)
1
R r
Use all sample for each update
Use one sample for each update
1
B r
∈ b

2.2 Lets Go Back to RNN
- Why memory Important ?
- What kind of problem can be handled by RNN ?
- Simple RNN structure
- Why it is so hard to train ?
- Classic variant (LSTM, GRU)

2.2 Why memory important
1 9 9
+ 3 3 3
5 3 2
1 1 9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
Input:
2-dimensions
Output:
1-dimensions

9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
X1
c1
1
1 1
1
1-10
1
1
10
0
1
X2

9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
0
10
0
1
X1
X2

9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
12 12
0
10
0
1
X1
X2

9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
12 12
0
1 12
10
0
1
X1
X2

9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
12 12
0
1 12
1 10
0
1
X1
X2

9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
12 12
0
1 12
1
2
10
0
1
X1
X2

9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
13 13
0
1 10
0
1
X1
X2

9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
13 13
0
1 10
0
1
1 13
1
X1
X2

9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
13 13
0
1 10
0
1
1 13
1
3
X1
X2

2.3 Simple RNN
X1
1
X
X2
2
X3
3
X4
4
v.s. DNN
3
2
1
many to many
one to one

2.3 Simple RNN
X1
1
X2
2
X3
3
4
4
many to many

2.3 Simple RNN
X1
X2
1 2
X1
1 2 3 4
many to many one to many

2.3 Simple RNN
X1
X2
1 2
many to many

2.3 Simple RNN
X1
1 2 3 4
one to many

2.3 Simple RNN, Model Structure
X1
X2
X3
X4
many to one
X10
X5
‘v’ ‘a’ ‘r’ ‘ ‘ ‘e’ …. ‘E’
js

X1
X2
...
X3
biW
i
zi
a
W
h
bh
zh

X1
X2
...
X3
b
W
i
zi
a
W
h
zh
zh
= Wh
* h
a1
= A(zi
+zh
+b)
zi
= Wi
* X
*
*
f(a,b,c)= A(a+b+c)

*
X1
X2
...
X4
b
W
i
zi
a
W
h
zh
*
A
X3
b
W
i
zi
a
W
h
zh
*
*
A
X5

X10
b
W
i
zi
a
W
h
zh
*
*
A
X1
X2
X3
b
W
i
zi
a
W
h
zh
*
*
A
js
Xi
... ...
...
C

2.3 Simple RNN, Cost Function
= (Xt
, h; )
= A(Wi
* Xt
+ Wh
* h + b)
= { Wi
, Wh
, b }
= best *
2.Model
Model Structure
Cost Function
Optimization
function set
timestep
C( )= ∑( k
ln( ȳk
)+(1- k
)ln(1- ȳk
))
-1
k
k

2.3 Simple RNN, Optimization, Back Propogation?
= (Xt
, h; )
= A(Wi
* Xt
+ Wh
* h + b)
= { Wi
, Wh
, b }
= best *
2.Model
Model Structure
Cost Function
Optimization
function set
timestep
C( )= ∑( k
ln( ȳk
)+(1- k
)ln(1- ȳk
))
-1
k
k

X10
b
W
i
zi
a
W
h
zh
*
*
A
2.3 Simple RNN, BPTT, CG
X9
b
W
i
zi
a
W
h
zh
*
*
A
js
...
...
C
∂C( 0
)/∂Wh

X8
b
W
i
zi
a
W
h
zh
*
*
A
X7
b
W
i
zi
a
W
h
zh
*
*
A
...
...
...
...
∂C( 0
)/∂Wh

X2
b
W
i
zi
a
W
h
zh
*
*
A
X1
b
W
i
zi
a
W
h
zh
*
*
A
...
...
∂C( 0
)/∂Wh
= sum all
nodes

2.3 Simple RNN, BPTT, gradient problem
X1
X2
X3
X4
X10
X5
‘v’ ‘a’ ‘r’ ‘ ‘ ‘e’ …. ‘E’
js
Error Signal ઠ
Amplifier
AmplifierAmplifierAmplifierAmplifier
Update W needs
Error Signal * (Amplifier)10
Vanishing
gradient problem
Exploding

2.3 Simple RNN, BPTT, gradient problem
Update W needs
Error Signal * (Amplifier)10
Vanishing
gradient problem
Exploding
direction + large value
var exwFrE

2.4 LSTM
---Hidden Unit---
[Input] : statet-1
, Inputt
, Cell_Statet-1
[Output] : statet
, Outputt
, Cell_Statet
Input Gate: Whether input `the [input]`
Forget Gate: whether update `the Cell_State`
Output Gate: Whetehr output `the output`

2.4 LSTM
Xt
ht-1
ct-1
Wi
zi
Input Gate: Whether input `the [input]`
= σ ( )Wi
zi
Xt
ht-1

2.4 LSTM
Xt
ht-1
ct-1
Wi
zi
Xt
ht-1
W
z
[ Input ]
= σ ( )Wz
Xt
ht-1
Elementwist Mulitply
Input Gate

2.4 LSTM
Xt
ht-1
ct-1
Wi
zi
Xt
ht-1
W
z
[ Input ]
Xt
ht-1
Wf
zf
Forget Gate: whether update `the Cell_State`
= σ ( )Wf
zf
Xt
ht-1
Elementwist Addition
ct
Input Gate

2.4 LSTM
Xt
ht-1
ct-1
Wi
zi
Xt
ht-1
W
z
[ Input ]
Input Gate
Xt
ht-1
Wf
zf
Forget Gate
Elementwist Addition
ct
Xt
ht-1
Wo
zo
Output Gate
ht
ht
= zo
⊙ tanh(ct
)

Xt
ht-1
Wi
Xt
ht-1
W
⊙
ct-1
Xt
ht-1
Wf
ct
⊙⊕
Xt
ht-1
Wo
⊙
ht
2.Model
Model Structure
Cost Function
Optimization
2.4 LSTM
Cross Entropy
BPTT

2.5 GRU (Gated Recurrent Unit)
RNN LSTM GRU
Xt
ht-1
ht
t
ht-1
ct
ct-1
ht
Xt
t
Xt
ht-1
ht
t
Short Term
Short Term
Long Term
Long + Short
Term ?

2.5 GRU
GRU
Xt
ht-1
ht
t
Long + Short
Term ?
Xt
ht-1
Wr
zr
Reset Gate
ht-1

Wc
2.5 GRU
GRU
Xt
ht-1
ht
t
Long + Short
Term ?
Xt
ht-1
Wr
zr
Reset Gate
ht-1
Xt
W
ĥt-1
Candiate state
Candiate State
ĥt-1
= tanh(WXt
+Wc
(zr
⊙ht-1
))

2.5 GRU
Xt
ht-1
Wr
zr
Reset Gate
ht-1
Xt
W
ĥt-1
Candiate state
Candiate State
Xt
ht-1
Wu
zu
Update Gate
ht-1
ht
ht
= (1-zu
)ht-1
+
zu
ĥt-1
Wc

2.5 GRU
Xt
ht
Xt
ht-1
Xt
ht-1
Wr
ht-1
W
Wu
Wc
2.Model
Model Structure
Cost Function
Optimization
Cross Entropy
BPTT

3 How to Tune Ur Model
- Basic Metrics
- Model HyperParameters
- Visualization Inspiration
- Optimization Setup
- Network Structure

3.1 Basic Metrics
DataSet Real World
Training Data Testing Data
Trainging Data Validation Data Real Testing
Do I get good
results on
training set ?
- Code has bug ?
- Cannot find a good function
- Bad Model (no good function in hyp. set)
Do I get good
results on
validation set ?
- over fitting ?
YES
YES
NoNo

= (Xt
, h; )
= σ(Wi
*Xt
+ Wh
*h + b)
= { Wi
, Wh
, b }
3.2 Model HyperParameter
Simple RNN
These are parameters
- number of epoch (training iteration)
- |h|: Hidden Layer Size
- : Number of Layers
- C(): cost function
- Batch Size (stochastic, mini-batch, ..)
- Parameter Initial value
- A: activation function (tanh, sigmoid, Relu
1
⬅ 0
- η * C( 0
) : Learning
Rate
- Regularization (Dropout, Zoneout)
- Forget Gate Bias
- Gate Initialize
- Implicit zero padding
Grid Search ?

3.3 Visualization Inspiration
one layer, 10 hidden unit simple rnn

3.4 Optimization Setup
Adaptive Learning Rate
Gradient Clipping
Truncated BPTT
Batch Normalization
Longer Training Time
….

3.5 Network Structure
Bi-Direction RNN …
GRU
LSTM
Stack RNN
Neural Turning Machine
….

4. View Problem in CNN
var evwfrE
[23, 2, 19, 0, 6, 23, 24, 7, 19, 32]
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0
....
0 1 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0
var evwfrE 98x10
x
1
x
2
x
3
x
4
... x
7
x
8
x
9
x
0
r r...
y

5 A Sequence to Sequence Problem
- Three different ways to solved the problem: HMM, CTC, Attention

Topics Not Cover
- Bi-directional RNN and other RNN variants
- Attention-based RNN
- structure learning vs. RNN

Reference
- Most of content are from Prof Lee @ NTU:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/
- https://danijar.com/tips-for-training-recurrent-neural-networks/
- https://medium.com/@erikhallstrm/hello-world-rnn-83cd7105b767
This slides share @ trend micro scan engine team as an entry level sharing for
team member at 2017/9/4
If any suggestion or correction please mail to chuchuhao831@gmail.com
Or just leave comment on google slides -> link

RNN sharing at Trend Micro

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to RNN sharing at Trend Micro

Similar to RNN sharing at Trend Micro (20)

Recently uploaded

Recently uploaded (20)

RNN sharing at Trend Micro