SlideShare a Scribd company logo
RNN Share @ Trend Micro
chuchuhao
Outline
1. Problem Definition
2. Walk through RNN Model
3. How to tune best result
4. (optional) View Problem in CNN
5. (optional) A Sequence to Sequence Problem (Include HMM, CTC, Attention
Author
chuchuhao (Tony)
Linkedin:
https://www.linkedin.com/in/chun-hao-wang-19b94271/
#1 Problem Definition
1.Representation 2.Model 3.Evaluation
Define Problem
Sourcing Data
Cleaning
Normalization
Model Structure
Cost Function
Optimization
Metrics
Explanation
1-1 Define Problem .. (more Task)
1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
.txt file
Sequence Learning:
- Turn to Pseudo Code
- Output Behavior Seq (without run)
- Summarize Code
- Learn When to Segment
Generative Model:
- Code Generation
- Code repatch
Supervised:
- Gen Hash Code Via Similarity
- Action to take (run sandbox or ..)
Unsupervised:
- Detect Unusual Coding Style
Reinforcement:
- Accroding PC statuc decide next
action
.txt file
.txt file
Text
1-1 Define Problem ..
1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
.txt file
JavaScript
1-2 Sourcing Data
1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
Tim will share XDD
js
js
.txt file
js
Data Lake
Web
Benchmark
Dataset
John’s
Tool->
1-2 Sourcing Data … Sampling Bias
Data Lake
(Trend Micro)
Web
Benchmark
Dataset
# of API used
#oflines
Valid JS script, but not in any bank
どっち ?
1-3 Cleaning & Normalization
1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
.txt file
var evwfrEmu = new ActiveXObject("WScript.Shell");
//// 4fQei4pD7o69uOE8Trgp
YThMSpGKzmIcsQ =
evwfrEmu.ExpandEnvironmentStrings("%TEMP%")
+ "ssd" + Math.round(1e8 * Math.random());
var lmXPGbOCL0 = new
ActiveXObject("Msxml2.DOMDocument.6.0");
//// NN08kES93To
var MsslCA = new ActiveXObject("ADODB.Stream");
var dsgkpp = lmXPGbOCL0.createElement("tmp");
//// lZPifi9y
dsgkpp.dataType = "bin.base64";
//// wacLqE
//// bNloPOpvlZ8EdqKEn
MsslCA.Type = 1;
dsgkpp.text = "dmFyIGJUa3l
….
Source File
1-3 Cleaning & Normalization
1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
.txt file
.txt file
var evwfrE
[23, 2, 19, 0, 6, 23, 24, 7, 19, 32]
Cleaned
Normalized
d = np.zeros((10, 98))
d[0,23] = 1
d[0,2] = 1
….
d[0,32] = 1
# data -> a sparse 2d matrix
1-3 Cleaning & Normalization
1.Representation
Define Problem
Sourcing Data
Cleaning
Normalization
.txt file
( ) = (.js)< , , …., >
.txt file
Cleaning
Normalization
Raw Data
Cleaned Data
X y
js
Label ȳ
2.Model
Model Structure
Cost Function
Optimization
2. Model (Overview)
(X; ) = (.js)
(X; ) = {1, 0}
2.Model
Model Structure
Cost Function
Optimization
2. Model (Overview)
( ) = (X; )
What Functions Looks Like
Model Structure is a set of functions
(or named Hypothesis set)
Inference using one of funtion in set
2.Model
Model Structure
Cost Function
Optimization
2. Model (Overview)
(X; 1)
The goodness of chosen function
( )
12
3 4
(X; 2)
Which one better ?
2.Model
Model Structure
Cost Function
Optimization
2. Model (Overview)
Find the Best function
, model parameter
-1*goodness()
best
3.Evaluation
Metrics
Explanation
3. Evaluation
best
.txt file
js
.txt file
js
.txt file
js
.txt file
js
Training Set
Testing Set
Learn Task ?
3.Evaluation
Metrics
Explanation
3. Evaluation
.txt file( ; best) = JS
js
Label
predict
3.Evaluation
Metrics
Explanation
3. Evaluation
.txt file( ; best) = JS, True Positive
js
.txt file( ; best) = TXT, False Positive
txt
.txt file( ; best) = TXT, True Negative
txt
.txt file( ; best) = TXT, False Negative
js
False Alarm
3.Evaluation
Metrics
Explanation
3. Evaluation
Accuracy =
TP + TN
TP+TN+FN+FP
3.Evaluation
Metrics
Explanation
3. Evaluation, Goodness(loss) v.s Metrics(accuracy)
Loss Function Accuracy
Objective
Measure Goodness
of
hypothesis function
Measure Goodness
of
Model on Task
Property Continuous Discrete
#2 Walk through RNN Model
Problem Definition:
Find a function (X; ) = (.js)
Input : Output:
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0
....
0 1 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0
95% probability
might be javascript
var evwfrE 98x10
2.1 Recall DNN
Problem Definition:
Find a function (X; ) = (.js)
Input : DNN Input:
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0
....
0 1 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0
var evwfrE 98x10 (98*10)x1
varevwfrE
0
0
0
1
0
0
0
0
0
0
0
0
0
..
0
0
1
980
2
scalar
scalar
scalar
...
2.1 Recall DNN, Model Structure
‘v’
‘a’
…
‘s’
…
‘E’
1
2
j
...... 1
2
i
......
1
Input Layer Hidden Layer Output Layer
Input Layer ∈ R 980x1
Hidden Layer ∈ R 4x1
Output Layer ∈ R 1x1
Verticles here are `Scalar`
Edges heare are `Operator`
2.1 Recall DNN, Model Structure
1
2
j
......
1
2
i
......
1
Input Layer Hidden Layer Output Layer
Wij Zi
i = A(Zi
)
Zi
= Wi1
* +…+Wij
* + ....1 j
Input Layer : X ∈ R 980x1
Hidden Layer : a ∈ R 4x1
Output Layer : y ∈ R 1x1
Weighted : W ∈ R 4 x 980
Activation Func : A ∈ RxR
2.1 Recall DNN, Model Structure
1
2
j
......
1
2
i
......
1
Input Layer Hidden Layer Output Layer
Wij Zi
a = A( X * W0
+ b0
)
Input Layer : X ∈ R980x1
Hidden Layer : a ∈ R4x1
Output Layer : y ∈ R1x1
Weighted : W0
∈R4x980
, W1
∈R1x4
bias : b0
∈R4x1
, b1
∈R1x1
Activation Func : A ∈ RxR
y = a * W1
+ b1
2.1 Recall DNN, Model Structure = Inference
1
2
j
......
1
2
i
......
1
Input Layer Hidden Layer Output Layer
Wij Zi
a = A( X * W0
+ b0
)
y = a * W1
+ b1
if y > 0.5 : Javascript
else : TEXT
2.1 Recall DNN, Model Structure = Inference
1
2
j
......
1
2
i
......
1
Input Layer Hidden Layer Output Layer
Wij Zi
( ) = (X; ) = y = (.js)
W0
∈ R 4x980
, W1
∈ R 1x4
b0
∈ R 4x1
, b1
∈ R 1x1
= { }
| | = total parameter
= 4*980 +4 +4 +1 = 3929
F(X) = (.js): Probability Distribution
(X; ): DNN Model
2.1 Recall DNN, Why it works ?
Neural Network = Function approximator
> Given enough variable, nn can approximate any continuous function
`Universal approximation theorem`
x^(2)+ ((5y)/(4)- sqrt(abs(x)))^(2)=1
approximate
2.1 Recall DNN, Model Structure
Different NN structure: find a way to imporove learnability
Deep NN Recurrent NNConvolution NN Other Mincraft NN ..
1 2 1 2
2.1 Recall DNN, Model Structure
Different NN structure: find a way to imporove learnability
Deep NN Deeper NN
2.1 Recall DNN, Deeper Model Structure
1
2
j
......
1
2
i
......
Layer: L-1 Layer: L
Wij
L
Zi
L
i = ai
L
= A(Zi
L
)
Zi
L
= Wi1
L
* +…+Wij
L
* +
....
1 j
ai
L
= A(Wi1
L
* ai
L-1
+…+Wij
L
*ai
L-1
+...
2.1 Recall DNN, Deeper Model Structure
1
2
j
......
1
2
i
......
Layer: L-1 Layer: L
Wij
L
Zi
L
= (X; )
= A( WL
…
A( W2
A( X * W1
+ b1
)+ b2
) …
+ bL
)
= { W1
, b1
, W2
, b2
, … , WL
, bL
}
pick the best model = best function
= best *
2.1 Recall DNN, Deeper Model Structure
...
...
‘v’
‘a’
…
...
Input
Layer Hidden Layer
Output
Layer
...
...
...
R 980x1
R 1x1
fӨ
: R 980
➡R 1
Fully Connected Network
vectorize it !
a1
L
2.1 Recall DNN, Deeper Model Structure (vector)
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
Output of a neuron:
ai
L
Output of one layer:
aL
: a vector
a0
L-1
a1
L-1
aj
L-1
a2
L
ai
L
Layer L
Neuron i
a0
L-1
a1
L-1
aj
L-1
a1
L
2.1 Recall DNN, Deeper Model Structure (vector)
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
Weight:
Wij
L
a2
L
ai
L
Layer L-1 to Layer L
from neuron j in Layer L-1
to neuron i in Layer L
W L
=
W11
L
W12
L
….
W21
L
W22
L
….
NL
xNL-1
….
NL-1
NL
W L
a0
L-1
a1
L-1
aj
L-1
a1
L
2.1 Recall DNN, Deeper Model Structure (vector)
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
Bias:
bi
L
a2
L
ai
L
bias for neuron i at Layer L
b L
=
b1
L
b2
L
bi
L
NL
….….
1
W L
a0
L-1
a1
L-1
aj
L-1
a1
L
2.1 Recall DNN, Deeper Model Structure (vector)
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
Input of a neuron:
zi
L
Input of one layer:
zL
: a vector
input of the activation
function for neuron i at layer l
a0
L-1
a1
L-1
aj
L-1
a1
L
2.1 Recall DNN, Deeper Model Structure (vector)
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
Input of a neuron:
zi
L input of the activation
function for neuron i at layer l
zi
L
= Wi1
L
a1
L-1
+Wi2
L
a2
L-1
+...+ bi
L
= ∑ Wij
L
aj
L-1
+ bi
Lj=1
NL-1
….
a0
L-1
a1
L-1
aj
L-1
a1
L
2.1 Recall DNN, Deeper Model Structure (vector)
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
z1
L
= W11
L
a1
L-1
+W12
L
a2
L-1
+...+ b1
L
z2
L
= W21
L
a1
L-1
+W22
L
a2
L-1
+...+ b2
L
…
zi
L
= Wi1
L
a1
L-1
+Wi2
L
a2
L-1
+...+ bi
L
=
W11
L
W12
L
….
W21
L
W22
L
….
….
z1
L
z2
L
zi
L
NL
….
….
+
a1
L-1
a2
L-1
ai
L-1
….
….
b1
L
b2
L
bi
L
NL
….
NL
xNL-1
NL-1
….
a0
L-1
a1
L-1
aj
L-1
a1
L
2.1 Recall DNN, Deeper Model Structure (vector)
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
=
W11
L
W12
L
….
W21
L
W22
L
….
….
z1
L
z2
L
zi
L
NL
….
….
+
a1
L-1
a2
L-1
ai
L-1
….
….
b1
L
b2
L
bi
L
NL
….
NL
xNL-1
NL-1
z L
= W L
a L-1
+ b L
….
a0
L-1
a1
L-1
aj
L-1
a1
L
2.1 Recall DNN, Deeper Model Structure (vector)
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
=
a1
L
a2
L
ai
L
NL
….
ai
L
=A(zi
L
)
….
A(z1
L
)
A(z2
L
)
A(zi
L
)
NL
….
Relations between Layers and Output
a L
=A(z L
) : vector
a0
L-1
a1
L-1
aj
L-1
a1
L
2.1 Recall DNN, Deeper Model Structure (vector)
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
z L
= W L
a L-1
+ b L
, a L
=A(z L
)
a L
=A(W L
a L-1
+ b L
)
z L
a La L-1
W L b L Computational Graph
Node: Tensor
Edge: Operator
a0
L-1
a1
L-1
aj
L-1
a1
L
2.1 Recall DNN, Deeper Model Structure (vector)
1
2
j
......
1
2
i
......
Layer L-1
NL-1
Nodes
Layer L
NL
Nodes
a2
L
ai
L
1
W L
Zi
L
z L
= W L
a L-1
+ b L
, a L
=A(z L
)
a L
=A(W L
a L-1
+ b L
)
z L
a La L-1
W L
b L
Computational Graph
Node: Tensor
Edge: Operator
2.1 Recall DNN, Deeper Model Structure
...
...
‘v’
‘a’
…
...
...
...
... fӨ
: R 980
➡R 1
Input Layer 1 Layer 2 Layer L Output
W 1
, b 1
W 2
, b 2
W L
, b L
= (X; ) = A( WL
… A( W2
A( X * W1
+ b1
)+ b2
) … + bL
)
2.1 Recall DNN, Deeper Model Structure
= (X; ) = A( WL
… A( W2
A( X*W1
+ b1
)+ b2
) … + bL
)
z 1
a 1
W1
b1
z 2 a L-1
W2
b2
z L
….
Computational Graph
Node: Tensor
Edge: Operator
W L
b L
2.1 Recall DNN, Deeper Model Structure
1
2
j
......
1
2
i
......
Layer: L-1 Layer: L
Wij
L
Zi
L
= (X; )
= A( WL
…
A( W2
A( X * W1
+ b1
) + b2
) …
+ bL
)
= { W1
, b1
, W2
, b2
, … , WL
, bL
}
pick the best model = best function
= best *
2.1 Recall DNN, Deeper Model Structure
= (X; )
= A( WL
…
A( W2
A( X * W1
+ b1
) + b2
) …
+ bL
)
= { W1
, b1
, W2
, b2
, … , WL
, bL
}
pick the best model = best function
= best *
2.Model
Model Structure
Cost Function
Optimization
functions set
2.1 Recall DNN, Cost Function
Cost Function: C( )
- How bad the parameter is
- also called `loss/ error function`
Object Function: O( )
- How good the parameter is
2.Model
Model Structure
Cost Function
Optimization
2.1 Recall DNN, Cost Function
Cost Function: C( )
- How bad the parameter is
- also called `loss/ error function`
Best Parameter: *
   *
= arg min C( )
2.Model
Model Structure
Cost Function
Optimization
2.1 Recall DNN, Cost Function
Cost Function: C( )
- How bad the parameter is
- also called `loss/ error function`
F(X) = (.js): Probability Distribution (X; ): DNN Model
distance ( , )
2.1 Recall DNN, Cost Function
Cost Function: C( )
- How bad the parameter is
- also called `loss/ error function`
F(X) = (.js): Probability Distribution (X; ): DNN Model
distance ( , )
Don’t actually know the
real distribution function
2.1 Recall DNN, Cost Function
X
Real Probability Distribution
Dataset Sampling
From Real World
(X1
, ȳ1
), (X2
, ȳ2
), (X3
, ȳ3
) …
X1
ȳ1
1
C( ) = ∑ loss ( k
, ȳk
)
= ∑ ( )
1
k
k
Machine Learning Model
1
k
k
Wrong Content !!
Classification and Regression is not going to learn a probability distribution
Dataset Sampling
From Real World
2.1 Recall DNN, Cost Function
Real Probability Distribution ?
ȳ1
=1
1
C( ) = ∑ loss ( k
, ȳk
)
1
k
k
1
0
ȳ2
=0
.txt
file
js
.txt
file
js
Machine Learning Model
X
(X1
, ȳ1
), (X2
, ȳ2
), (X3
, ȳ3
) …
X1
Wrong Content !!
Classification and Regression is not going to learn a probability distribution
Dataset Sampling
From Real World
2.1 Recall DNN, Cost Function, MSE as loss
Real Probability Distribution ?
ȳ1
=1
1
C( ) = ∑ loss ( k
, ȳk
)
1
k
k
1
0
ȳ2
=0
= ∑ ॥ k
- ȳk
॥
1
k
k
.txt
file
js
.txt
file
js
Machine Learning Model
X
(X1
, ȳ1
), (X2
, ȳ2
), (X3
, ȳ3
) …
X1
Wrong Content !!
Classification and Regression is not going to learn a probability distribution
Dataset Sampling
From Real World
2.1 Recall DNN, Cost Function, Cross Ent. as loss
Real Probability Distribution ?
Machine Learning Model
ȳ1
=1
1
C( ) = ∑ loss ( k
, ȳk
)
1
k
k
1
0
ȳ2
=0
C( )= ∑( k
ln( ȳk
)+(1- k
)ln(1- ȳk
))
-1
k
k
.txt
file
js
.txt
file
js
X
(X1
, ȳ1
), (X2
, ȳ2
), (X3
, ȳ3
) …
X1
Wrong Content !!
Classification and Regression is not going to learn a probability distribution
2.1 Recall DNN, Cost Function, MSE as loss
C( ) = ∑ loss ( k
, ȳk
)
1
k
k
Accuracy
MSE
Logistic Loss
Hinge Loss
loss(k
*ȳk
)
To Visualize the differnet of cost function
Φ( k
* ȳk
) ⬅ loss ( k
, ȳk
)
k
* ȳk
, ȳk
∈ {-1, 1}
z=yy^
Acc: L(z)=(sign(z)+1)/2L(z)=(sign(z)+1)/2
Logistic: L(z)=log(exp(−x)+1)/log(2)
MSE: L(z)=(y−y^)2=(y2−y(^y))2=(1−z)2
2.Model
Model Structure
Cost Function
Optimization
2.1 Recall DNN, Optimization
Find the Best function
Have
- A function set
Know
- How good is the selected function
How to find the best one?
Enumerate ? Calculus !!
2.1 Recall DNN, Optimization
Find the Best function
, model parameter
loss,C()
For simplification, consider
that has only one variable
1. Randomly start at 0
2. Compute dC( 0
) / d( )
3. 1
⬅ 0
- η * dC( 0
) / d( )
4. Compute dC( 1
) / d( )
5. 2
⬅ 1
- η * dC( 1
) / d( )
..
0 1
dC( 0
)
d( )
dC( 1
)
d( )
η is learning rate
2.1 Recall DNN, Optimization
set the learning rate carefully
η is learning rate
For simplification, consider
that has only one variable
1. Randomly start at 0
2. Compute dC( 0
) / d( )
3. 1
⬅ 0
- η * dC( 0
) / d( )
4. Compute dC( 1
) / d( )
5. 2
⬅ 1
- η * dC( 1
) / d( )
..
Copy from http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20(v4).pdf
2
1
0
2.1 Recall DNN, Optimization
Find the Best function
1
2
Suppose that has two variable
1. Randomly start at 0
=
2. Compute the gradient of C( ) at 0
C( 0
) =
3. update parameter ..
⬅ - η *
….
….
1
0
2
0
∂C( 0
) / ∂ 1
∂C( 0
) / ∂ 2
∂C( 0
) / ∂ 1
∂C( 0
) / ∂ 2
1
0
2
0
1
1
2
1
Gradient
Movement
C( )
2.1 Recall DNN, Optimization
1
2
j
......
1
2
i
......
Layer: L-1 Layer: L
Wij
L
Zi
L
Optimize out model
1. Randomly start at 0
2. Compute the gradient of C( ) at 0
: C( 0
)
3. update parameter: 1
⬅ 0
- η * C( 0
)
To Calculate C( 0
), we need
{ ∂C( 0
)/∂W11
1
, ∂C( 0
)/∂b1
1
,
∂C( 0
)/∂W12
1
, ∂C( 0
)/∂b2
1
,
… ,
∂C( 0
)/∂Wij
L
, ∂C( 0
)/∂bi
L
}
2.1 Recall DNN, Diff on Computational Graph
x zy
z = f(x), y = g(x), z = h(y)
dz/dx = dy/dx * dz/dy
h()g()
s z
t
h()
g()
u
k() z = f(s), t=g(s), u=h(s), z = k(t,u)
dz/ds = ∂t/∂s * ∂z/∂t + ∂u/∂s * ∂z/∂u
**
exp()
2.1 Recall DNN, Diff on Computational Graph
Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v
x v
x
u w
x
**
exp()
2.1 Recall DNN, Diff on Computational Graph
Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v
x v
x
u w
x
x = 2, y?
2
2 2
4 e^4 2*e^4
y = 2*e^4
Forward Pass
**
exp()
2.1 Recall DNN, Diff on Computational Graph
Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v
x v
x
u w
x
dy/dx|x=2
?
2
2 2
4 e^4 2*e^4
Backward Pass
∂u/∂s=exp(u)
= exp(x^2)
∂y/∂v=x
∂w/∂x=v=exp(u)=exp(x^2)
∂u/∂x=x
∂u/∂x=x
WARNING: Different X
∂u/∂x=x
∂u/∂s=exp(u)
= exp(x^2)
∂y/∂v=x
**
exp()
2.1 Recall DNN, Diff on Computational Graph
Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v
x v
x
u w
x
dy/dx|x=2
?
2
2 2
4 e^4 2*e^4
Backward Pass
∂w/∂x=v=exp(u)=exp(x^2)
∂u/∂x=x
∂y/∂x=x*exp(x^2)*x
∂u/∂x=x
∂u/∂s=exp(u)
= exp(x^2)
∂y/∂v=x
**
exp()
2.1 Recall DNN, Diff on Computational Graph
Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v
x v
x
u w
x
dy/dx|x=2
?
2
2 2
4 e^4 2*e^4
Backward Pass
∂u/∂x=x
∂y/∂x=x*exp(x^2)*x
∂y/∂x=x*exp(x^2)*x
∂w/∂x=v=exp(u)=exp(x^2)
∂u/∂x=x
∂u/∂s=exp(u)
= exp(x^2)
∂y/∂v=x
**
exp()
2.1 Recall DNN, Diff on Computational Graph
Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v
x v
x
u w
x
dy/dx|x=2
?
2
2 2
4 e^4 2*e^4
Backward Pass
∂u/∂x=x
∂y/∂x=x*exp(x^2)*x
∂y/∂x=x*exp(x^2)*x
= 2*x2
*exp(x2
) +exp(x2
) |x=2
= 5*exp(4)
∂w/∂x=v=exp(u)=exp(x^2)
2.1 Recall DNN, Diff on Computational Graph
= (X; ) = A( W2
A( X* W1
+ b1
)+ b2
))
= { W1
, b1
, W2
, b2
}
Computational Graph
Node: Tensor
Edge: Operator
ȳ
Cz 1
a 1
W1
b1
z 2
W2
b2
∂C( 0
)/∂W11
1
∂C/∂y = -2(ȳ - y)
C( ,ȳ) = (ȳ - y)^2
z2
= W2
* a1
+ b2
a1
= A(z1
)
z1
= W1
* X + b1
∂y/∂z2
= A(z2
)
∂z2
/∂a1
=W2
∂a1
/∂z1
=A(z1
)
∂z1
/∂w1
=X
Backpropagation
2.1 Recall DNN, Diff on Computational Graph
= (X; ) = A( W2
A( X* W1
+ b1
)+ b2
))
= { W1
, b1
, W2
, b2
}
Computational Graph
Node: Tensor
Edge: Operator
ȳ
Cz 1
a 1
W1
b1
z 2
W2
b2
∂C( 0
)/∂W11
1
∂C/∂y = -2(ȳ - y)
C( ,ȳ) = (ȳ - y)^2
z2
= W2
* a1
+ b2
a1
= A(z1
)
z1
= W1
* X + b1
∂y/∂z2
= A(z2
)
∂a1
/∂z1
=A(z1
)
∂z1
/∂w1
=X
Backpropagation
Layer 1 Layer 2
Error Signal
Amplifier
Amplifier
∂z2
/∂a1
=W2
2.1 Recall DNN, Optimization (batch)
Training Data: { (X1
, ȳ1
), (X2
, ȳ2
), … , (Xr
, ȳr
), …. (XR
, ȳR
)}
- Gradient Descent
i
⬅ i-1
- η * C( i-1
), C( i-1
) = ∑ Cr
( i-1
)
- Stochastic Gradient Descent
i
⬅ i-1
- η * Cr
( i-1
)
- Mini Batch Gradient Descant
i
⬅ i-1
- η * C( i-1
), C( i-1
) = ∑ Cr
( i-1
)
1
R r
Use all sample for each update
Use one sample for each update
1
B r
∈ b
2.2 Lets Go Back to RNN
- Why memory Important ?
- What kind of problem can be handled by RNN ?
- Simple RNN structure
- Why it is so hard to train ?
- Classic variant (LSTM, GRU)
2.2 Why memory important
1 9 9
+ 3 3 3
5 3 2
1 1 9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
Input:
2-dimensions
Output:
1-dimensions
2.2 Why memory important
9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
X1
c1
1
1 1
1
1-10
1
1
10
0
1
X2
2.2 Why memory important
9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
0
10
0
1
X1
X2
2.2 Why memory important
9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
12 12
0
10
0
1
X1
X2
2.2 Why memory important
9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
12 12
0
1 12
10
0
1
X1
X2
2.2 Why memory important
9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
12 12
0
1 12
1 10
0
1
X1
X2
2.2 Why memory important
9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
12 12
0
1 12
1
2
10
0
1
X1
X2
2.2 Why memory important
9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
13 13
0
1 10
0
1
X1
X2
2.2 Why memory important
9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
13 13
0
1 10
0
1
1 13
1
X1
X2
2.2 Why memory important
9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
13 13
0
1 10
0
1
1 13
1
3
X1
X2
2.2 Why memory important
9
3
9
3
1
3
5 3 2
X1
X2
X3
3 2 1
c1
1
1 1
1
1-10
1
1
39
13 13
0
1 10
0
1
1 13
1
3
X1
X2
2.3 Simple RNN
X1
1
X
X2
2
X3
3
X4
4
v.s. DNN
3
2
1
many to many
one to one
2.3 Simple RNN
X1
1
X2
2
X3
3
4
4
many to many
2.3 Simple RNN
X1
X2
1 2
X1
1 2 3 4
many to many one to many
2.3 Simple RNN
X1
X2
1 2
many to many
2.3 Simple RNN
X1
1 2 3 4
one to many
2.3 Simple RNN, Model Structure
X1
X2
X3
X4
many to one
X10
X5
‘v’ ‘a’ ‘r’ ‘ ‘ ‘e’ …. ‘E’
js
2.3 Simple RNN, Model Structure
X1
X2
...
X3
biW
i
zi
a
W
h
bh
zh
2.3 Simple RNN, Model Structure
X1
X2
...
X3
b
W
i
zi
a
W
h
zh
zh
= Wh
* h
a1
= A(zi
+zh
+b)
zi
= Wi
* X
*
*
f(a,b,c)= A(a+b+c)
*
2.3 Simple RNN, Model Structure
X1
X2
...
X4
b
W
i
zi
a
W
h
zh
*
A
X3
b
W
i
zi
a
W
h
zh
*
*
A
X5
X10
b
W
i
zi
a
W
h
zh
*
*
A
2.3 Simple RNN, Model Structure
X1
X2
X3
b
W
i
zi
a
W
h
zh
*
*
A
js
Xi
... ...
...
C
2.3 Simple RNN, Cost Function
= (Xt
, h; )
= A(Wi
* Xt
+ Wh
* h + b)
= { Wi
, Wh
, b }
pick the best model = best function
= best *
2.Model
Model Structure
Cost Function
Optimization
function set
timestep
C( )= ∑( k
ln( ȳk
)+(1- k
)ln(1- ȳk
))
-1
k
k
2.3 Simple RNN, Optimization, Back Propogation?
= (Xt
, h; )
= A(Wi
* Xt
+ Wh
* h + b)
= { Wi
, Wh
, b }
pick the best model = best function
= best *
2.Model
Model Structure
Cost Function
Optimization
function set
timestep
C( )= ∑( k
ln( ȳk
)+(1- k
)ln(1- ȳk
))
-1
k
k
X10
b
W
i
zi
a
W
h
zh
*
*
A
2.3 Simple RNN, BPTT, CG
X9
b
W
i
zi
a
W
h
zh
*
*
A
js
...
...
C
∂C( 0
)/∂Wh
X8
b
W
i
zi
a
W
h
zh
*
*
A
2.3 Simple RNN, BPTT, CG
X7
b
W
i
zi
a
W
h
zh
*
*
A
...
...
...
...
∂C( 0
)/∂Wh
X2
b
W
i
zi
a
W
h
zh
*
*
A
2.3 Simple RNN, BPTT, CG
X1
b
W
i
zi
a
W
h
zh
*
*
A
...
...
∂C( 0
)/∂Wh
= sum all
nodes
2.3 Simple RNN, BPTT, gradient problem
X1
X2
X3
X4
X10
X5
‘v’ ‘a’ ‘r’ ‘ ‘ ‘e’ …. ‘E’
js
Error Signal ઠ
Amplifier
AmplifierAmplifierAmplifierAmplifier
Update W needs
Error Signal * (Amplifier)10
Vanishing
gradient problem
Exploding
2.3 Simple RNN, BPTT, gradient problem
Update W needs
Error Signal * (Amplifier)10
Vanishing
gradient problem
Exploding
direction + large value
var exwFrE
2.4 LSTM
---Hidden Unit---
[Input] : statet-1
, Inputt
, Cell_Statet-1
[Output] : statet
, Outputt
, Cell_Statet
Input Gate: Whether input `the [input]`
Forget Gate: whether update `the Cell_State`
Output Gate: Whetehr output `the output`
2.4 LSTM
Xt
ht-1
ct-1
Wi
zi
Input Gate: Whether input `the [input]`
= σ ( )Wi
zi
Xt
ht-1
2.4 LSTM
Xt
ht-1
ct-1
Wi
zi
Xt
ht-1
W
z
[ Input ]
= σ ( )Wz
Xt
ht-1
Elementwist Mulitply
Input Gate
2.4 LSTM
Xt
ht-1
ct-1
Wi
zi
Xt
ht-1
W
z
[ Input ]
Elementwist Mulitply
Xt
ht-1
Wf
zf
Forget Gate: whether update `the Cell_State`
= σ ( )Wf
zf
Xt
ht-1
Elementwist Mulitply
Elementwist Addition
ct
Input Gate
2.4 LSTM
Xt
ht-1
ct-1
Wi
zi
Xt
ht-1
W
z
[ Input ]
Input Gate
Elementwist Mulitply
Xt
ht-1
Wf
zf
Forget Gate
Elementwist Mulitply
Elementwist Addition
ct
Xt
ht-1
Wo
zo
Output Gate
ht
ht
= zo
⊙ tanh(ct
)
Xt
ht-1
Wi
Xt
ht-1
W
⊙
ct-1
Xt
ht-1
Wf
ct
⊙⊕
Xt
ht-1
Wo
⊙
ht
2.Model
Model Structure
Cost Function
Optimization
2.4 LSTM
Cross Entropy
BPTT
2.5 GRU (Gated Recurrent Unit)
RNN LSTM GRU
Xt
ht-1
ht
t
ht-1
ct
ct-1
ht
Xt
t
Xt
ht-1
ht
t
Short Term
Short Term
Long Term
Long + Short
Term ?
2.5 GRU
GRU
Xt
ht-1
ht
t
Long + Short
Term ?
Xt
ht-1
Wr
zr
Reset Gate
ht-1
Elementwist Mulitply
Wc
2.5 GRU
GRU
Xt
ht-1
ht
t
Long + Short
Term ?
Xt
ht-1
Wr
zr
Reset Gate
ht-1
Elementwist Mulitply
Xt
W
ĥt-1
Candiate state
Candiate State
ĥt-1
= tanh(WXt
+Wc
(zr
⊙ht-1
))
2.5 GRU
Xt
ht-1
Wr
zr
Reset Gate
ht-1
Elementwist Mulitply
Xt
W
ĥt-1
Candiate state
Candiate State
Xt
ht-1
Wu
zu
Update Gate
ht-1
ht
ht
= (1-zu
)ht-1
+
zu
ĥt-1
Wc
2.5 GRU
Xt
ht
Xt
ht-1
Xt
ht-1
Wr
ht-1
W
Wu
Wc
2.Model
Model Structure
Cost Function
Optimization
Cross Entropy
BPTT
3 How to Tune Ur Model
- Basic Metrics
- Model HyperParameters
- Visualization Inspiration
- Optimization Setup
- Network Structure
3.1 Basic Metrics
DataSet Real World
Training Data Testing Data
Trainging Data Validation Data Real Testing
Do I get good
results on
training set ?
- Code has bug ?
- Cannot find a good function
- Bad Model (no good function in hyp. set)
Do I get good
results on
validation set ?
- over fitting ?
YES
YES
NoNo
= (Xt
, h; )
= σ(Wi
*Xt
+ Wh
*h + b)
= { Wi
, Wh
, b }
3.2 Model HyperParameter
Simple RNN
These are parameters
- number of epoch (training iteration)
- |h|: Hidden Layer Size
- : Number of Layers
- C(): cost function
- Batch Size (stochastic, mini-batch, ..)
- Parameter Initial value
- A: activation function (tanh, sigmoid, Relu
1
⬅ 0
- η * C( 0
) : Learning
Rate
- Regularization (Dropout, Zoneout)
- Forget Gate Bias
- Gate Initialize
- Implicit zero padding
Grid Search ?
3.3 Visualization Inspiration
one layer, 10 hidden unit simple rnn
3.3 Visualization Inspiration
one layer, 10 hidden unit simple rnn
3.4 Optimization Setup
Adaptive Learning Rate
Gradient Clipping
Truncated BPTT
Batch Normalization
Longer Training Time
….
3.5 Network Structure
Bi-Direction RNN …
GRU
LSTM
Stack RNN
Neural Turning Machine
….
4. View Problem in CNN
var evwfrE
[23, 2, 19, 0, 6, 23, 24, 7, 19, 32]
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0
....
0 1 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0
var evwfrE 98x10
x
1
x
2
x
3
x
4
... x
7
x
8
x
9
x
0
r r...
y
5 A Sequence to Sequence Problem
- Three different ways to solved the problem: HMM, CTC, Attention
Topics Not Cover
- Bi-directional RNN and other RNN variants
- Attention-based RNN
- structure learning vs. RNN
Reference
- Most of content are from Prof Lee @ NTU:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/
- https://danijar.com/tips-for-training-recurrent-neural-networks/
- https://medium.com/@erikhallstrm/hello-world-rnn-83cd7105b767
This slides share @ trend micro scan engine team as an entry level sharing for
team member at 2017/9/4
If any suggestion or correction please mail to chuchuhao831@gmail.com
Or just leave comment on google slides -> link

More Related Content

What's hot

Image Recognition with Neural Network
Image Recognition with Neural NetworkImage Recognition with Neural Network
Image Recognition with Neural Network
Sajib Sen
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithmsguestfee8698
 
Object calisthenics
Object calisthenicsObject calisthenics
Object calisthenics
PolSnchezManzano
 
A simple study on computer algorithms by S. M. Risalat Hasan Chowdhury
A simple study on computer algorithms by S. M. Risalat Hasan ChowdhuryA simple study on computer algorithms by S. M. Risalat Hasan Chowdhury
A simple study on computer algorithms by S. M. Risalat Hasan Chowdhury
S. M. Risalat Hasan Chowdhury
 
Basic operations by novi reandy sasmita
Basic operations by novi reandy sasmitaBasic operations by novi reandy sasmita
Basic operations by novi reandy sasmita
beasiswa
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data Science
Mike Anderson
 
211 738-1-pb
211 738-1-pb211 738-1-pb
211 738-1-pbishwari85
 
Procedural Content Generation with Clojure
Procedural Content Generation with ClojureProcedural Content Generation with Clojure
Procedural Content Generation with Clojure
Mike Anderson
 
Inheritance and-polymorphism
Inheritance and-polymorphismInheritance and-polymorphism
Inheritance and-polymorphism
Usama Malik
 
Chpater 6
Chpater 6Chpater 6
Chpater 6
EasyStudy3
 
JSUG - Effective Java Puzzlers by Christoph Pickl
JSUG - Effective Java Puzzlers by Christoph PicklJSUG - Effective Java Puzzlers by Christoph Pickl
JSUG - Effective Java Puzzlers by Christoph Pickl
Christoph Pickl
 
Swift for TensorFlow - CoreML Personalization
Swift for TensorFlow - CoreML PersonalizationSwift for TensorFlow - CoreML Personalization
Swift for TensorFlow - CoreML Personalization
Jacopo Mangiavacchi
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Jason Tsai
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of Abstraction
Alex Miller
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data Science
henrygarner
 
Rabbit challenge 3 DNN Day2
Rabbit challenge 3 DNN Day2Rabbit challenge 3 DNN Day2
Rabbit challenge 3 DNN Day2
TOMMYLINK1
 
Image Processing
Image ProcessingImage Processing
Image Processingyuvhashree
 

What's hot (20)

Clojure class
Clojure classClojure class
Clojure class
 
Image Recognition with Neural Network
Image Recognition with Neural NetworkImage Recognition with Neural Network
Image Recognition with Neural Network
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
 
Object calisthenics
Object calisthenicsObject calisthenics
Object calisthenics
 
A simple study on computer algorithms by S. M. Risalat Hasan Chowdhury
A simple study on computer algorithms by S. M. Risalat Hasan ChowdhuryA simple study on computer algorithms by S. M. Risalat Hasan Chowdhury
A simple study on computer algorithms by S. M. Risalat Hasan Chowdhury
 
Basic operations by novi reandy sasmita
Basic operations by novi reandy sasmitaBasic operations by novi reandy sasmita
Basic operations by novi reandy sasmita
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data Science
 
211 738-1-pb
211 738-1-pb211 738-1-pb
211 738-1-pb
 
Procedural Content Generation with Clojure
Procedural Content Generation with ClojureProcedural Content Generation with Clojure
Procedural Content Generation with Clojure
 
Inheritance and-polymorphism
Inheritance and-polymorphismInheritance and-polymorphism
Inheritance and-polymorphism
 
DCT
DCTDCT
DCT
 
Chpater 6
Chpater 6Chpater 6
Chpater 6
 
JSUG - Effective Java Puzzlers by Christoph Pickl
JSUG - Effective Java Puzzlers by Christoph PicklJSUG - Effective Java Puzzlers by Christoph Pickl
JSUG - Effective Java Puzzlers by Christoph Pickl
 
Swift for TensorFlow - CoreML Personalization
Swift for TensorFlow - CoreML PersonalizationSwift for TensorFlow - CoreML Personalization
Swift for TensorFlow - CoreML Personalization
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of Abstraction
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data Science
 
Rabbit challenge 3 DNN Day2
Rabbit challenge 3 DNN Day2Rabbit challenge 3 DNN Day2
Rabbit challenge 3 DNN Day2
 
Image Processing
Image ProcessingImage Processing
Image Processing
 
Lec3
Lec3Lec3
Lec3
 

Similar to RNN sharing at Trend Micro

Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)
Dongheon Lee
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Deep Learning and TensorFlow
Deep Learning and TensorFlowDeep Learning and TensorFlow
Deep Learning and TensorFlow
Oswald Campesato
 
6-Python-Recursion PPT.pptx
6-Python-Recursion PPT.pptx6-Python-Recursion PPT.pptx
6-Python-Recursion PPT.pptx
Venkateswara Babu Ravipati
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.ppt
ssuserd64918
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
Oswald Campesato
 
Part 3-functions
Part 3-functionsPart 3-functions
Part 3-functionsankita44
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
AI Frontiers
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
Ding Li
 
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
台灣資料科學年會
 
User defined functions
User defined functionsUser defined functions
User defined functionsshubham_jangid
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Universitat Politècnica de Catalunya
 
Intro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.jsIntro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.js
Oswald Campesato
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
Oswald Campesato
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
Oswald Campesato
 
C++ and Deep Learning
C++ and Deep LearningC++ and Deep Learning
C++ and Deep Learning
Oswald Campesato
 
Deep learning with C++ - an introduction to tiny-dnn
Deep learning with C++  - an introduction to tiny-dnnDeep learning with C++  - an introduction to tiny-dnn
Deep learning with C++ - an introduction to tiny-dnn
Taiga Nomi
 
Introduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlowIntroduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlow
Oswald Campesato
 

Similar to RNN sharing at Trend Micro (20)

Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
 
Deep Learning and TensorFlow
Deep Learning and TensorFlowDeep Learning and TensorFlow
Deep Learning and TensorFlow
 
6-Python-Recursion PPT.pptx
6-Python-Recursion PPT.pptx6-Python-Recursion PPT.pptx
6-Python-Recursion PPT.pptx
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.ppt
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
 
Part 3-functions
Part 3-functionsPart 3-functions
Part 3-functions
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
 
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
 
User defined functions
User defined functionsUser defined functions
User defined functions
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Intro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.jsIntro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.js
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
 
C++ and Deep Learning
C++ and Deep LearningC++ and Deep Learning
C++ and Deep Learning
 
Deep learning with C++ - an introduction to tiny-dnn
Deep learning with C++  - an introduction to tiny-dnnDeep learning with C++  - an introduction to tiny-dnn
Deep learning with C++ - an introduction to tiny-dnn
 
Functions12
Functions12Functions12
Functions12
 
Functions123
Functions123 Functions123
Functions123
 
Introduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlowIntroduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlow
 

Recently uploaded

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 

Recently uploaded (20)

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 

RNN sharing at Trend Micro

  • 1. RNN Share @ Trend Micro chuchuhao
  • 2. Outline 1. Problem Definition 2. Walk through RNN Model 3. How to tune best result 4. (optional) View Problem in CNN 5. (optional) A Sequence to Sequence Problem (Include HMM, CTC, Attention
  • 4. #1 Problem Definition 1.Representation 2.Model 3.Evaluation Define Problem Sourcing Data Cleaning Normalization Model Structure Cost Function Optimization Metrics Explanation
  • 5. 1-1 Define Problem .. (more Task) 1.Representation Define Problem Sourcing Data Cleaning Normalization .txt file Sequence Learning: - Turn to Pseudo Code - Output Behavior Seq (without run) - Summarize Code - Learn When to Segment Generative Model: - Code Generation - Code repatch Supervised: - Gen Hash Code Via Similarity - Action to take (run sandbox or ..) Unsupervised: - Detect Unusual Coding Style Reinforcement: - Accroding PC statuc decide next action
  • 6. .txt file .txt file Text 1-1 Define Problem .. 1.Representation Define Problem Sourcing Data Cleaning Normalization .txt file JavaScript
  • 7. 1-2 Sourcing Data 1.Representation Define Problem Sourcing Data Cleaning Normalization Tim will share XDD js js .txt file js Data Lake Web Benchmark Dataset John’s Tool->
  • 8. 1-2 Sourcing Data … Sampling Bias Data Lake (Trend Micro) Web Benchmark Dataset # of API used #oflines Valid JS script, but not in any bank どっち ?
  • 9. 1-3 Cleaning & Normalization 1.Representation Define Problem Sourcing Data Cleaning Normalization .txt file var evwfrEmu = new ActiveXObject("WScript.Shell"); //// 4fQei4pD7o69uOE8Trgp YThMSpGKzmIcsQ = evwfrEmu.ExpandEnvironmentStrings("%TEMP%") + "ssd" + Math.round(1e8 * Math.random()); var lmXPGbOCL0 = new ActiveXObject("Msxml2.DOMDocument.6.0"); //// NN08kES93To var MsslCA = new ActiveXObject("ADODB.Stream"); var dsgkpp = lmXPGbOCL0.createElement("tmp"); //// lZPifi9y dsgkpp.dataType = "bin.base64"; //// wacLqE //// bNloPOpvlZ8EdqKEn MsslCA.Type = 1; dsgkpp.text = "dmFyIGJUa3l …. Source File
  • 10. 1-3 Cleaning & Normalization 1.Representation Define Problem Sourcing Data Cleaning Normalization .txt file .txt file var evwfrE [23, 2, 19, 0, 6, 23, 24, 7, 19, 32] Cleaned Normalized d = np.zeros((10, 98)) d[0,23] = 1 d[0,2] = 1 …. d[0,32] = 1 # data -> a sparse 2d matrix
  • 11. 1-3 Cleaning & Normalization 1.Representation Define Problem Sourcing Data Cleaning Normalization .txt file ( ) = (.js)< , , …., > .txt file Cleaning Normalization Raw Data Cleaned Data X y js Label ȳ
  • 12. 2.Model Model Structure Cost Function Optimization 2. Model (Overview) (X; ) = (.js) (X; ) = {1, 0}
  • 13. 2.Model Model Structure Cost Function Optimization 2. Model (Overview) ( ) = (X; ) What Functions Looks Like Model Structure is a set of functions (or named Hypothesis set) Inference using one of funtion in set
  • 14. 2.Model Model Structure Cost Function Optimization 2. Model (Overview) (X; 1) The goodness of chosen function ( ) 12 3 4 (X; 2) Which one better ?
  • 15. 2.Model Model Structure Cost Function Optimization 2. Model (Overview) Find the Best function , model parameter -1*goodness() best
  • 16. 3.Evaluation Metrics Explanation 3. Evaluation best .txt file js .txt file js .txt file js .txt file js Training Set Testing Set Learn Task ?
  • 18. 3.Evaluation Metrics Explanation 3. Evaluation .txt file( ; best) = JS, True Positive js .txt file( ; best) = TXT, False Positive txt .txt file( ; best) = TXT, True Negative txt .txt file( ; best) = TXT, False Negative js False Alarm
  • 20. 3.Evaluation Metrics Explanation 3. Evaluation, Goodness(loss) v.s Metrics(accuracy) Loss Function Accuracy Objective Measure Goodness of hypothesis function Measure Goodness of Model on Task Property Continuous Discrete
  • 21. #2 Walk through RNN Model Problem Definition: Find a function (X; ) = (.js) Input : Output: 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 .... 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 95% probability might be javascript var evwfrE 98x10
  • 22. 2.1 Recall DNN Problem Definition: Find a function (X; ) = (.js) Input : DNN Input: 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 .... 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 var evwfrE 98x10 (98*10)x1 varevwfrE 0 0 0 1 0 0 0 0 0 0 0 0 0 .. 0 0 1 980 2 scalar scalar scalar ...
  • 23. 2.1 Recall DNN, Model Structure ‘v’ ‘a’ … ‘s’ … ‘E’ 1 2 j ...... 1 2 i ...... 1 Input Layer Hidden Layer Output Layer Input Layer ∈ R 980x1 Hidden Layer ∈ R 4x1 Output Layer ∈ R 1x1 Verticles here are `Scalar` Edges heare are `Operator`
  • 24. 2.1 Recall DNN, Model Structure 1 2 j ...... 1 2 i ...... 1 Input Layer Hidden Layer Output Layer Wij Zi i = A(Zi ) Zi = Wi1 * +…+Wij * + ....1 j Input Layer : X ∈ R 980x1 Hidden Layer : a ∈ R 4x1 Output Layer : y ∈ R 1x1 Weighted : W ∈ R 4 x 980 Activation Func : A ∈ RxR
  • 25. 2.1 Recall DNN, Model Structure 1 2 j ...... 1 2 i ...... 1 Input Layer Hidden Layer Output Layer Wij Zi a = A( X * W0 + b0 ) Input Layer : X ∈ R980x1 Hidden Layer : a ∈ R4x1 Output Layer : y ∈ R1x1 Weighted : W0 ∈R4x980 , W1 ∈R1x4 bias : b0 ∈R4x1 , b1 ∈R1x1 Activation Func : A ∈ RxR y = a * W1 + b1
  • 26. 2.1 Recall DNN, Model Structure = Inference 1 2 j ...... 1 2 i ...... 1 Input Layer Hidden Layer Output Layer Wij Zi a = A( X * W0 + b0 ) y = a * W1 + b1 if y > 0.5 : Javascript else : TEXT
  • 27. 2.1 Recall DNN, Model Structure = Inference 1 2 j ...... 1 2 i ...... 1 Input Layer Hidden Layer Output Layer Wij Zi ( ) = (X; ) = y = (.js) W0 ∈ R 4x980 , W1 ∈ R 1x4 b0 ∈ R 4x1 , b1 ∈ R 1x1 = { } | | = total parameter = 4*980 +4 +4 +1 = 3929
  • 28. F(X) = (.js): Probability Distribution (X; ): DNN Model 2.1 Recall DNN, Why it works ? Neural Network = Function approximator > Given enough variable, nn can approximate any continuous function `Universal approximation theorem` x^(2)+ ((5y)/(4)- sqrt(abs(x)))^(2)=1 approximate
  • 29. 2.1 Recall DNN, Model Structure Different NN structure: find a way to imporove learnability Deep NN Recurrent NNConvolution NN Other Mincraft NN .. 1 2 1 2
  • 30. 2.1 Recall DNN, Model Structure Different NN structure: find a way to imporove learnability Deep NN Deeper NN
  • 31. 2.1 Recall DNN, Deeper Model Structure 1 2 j ...... 1 2 i ...... Layer: L-1 Layer: L Wij L Zi L i = ai L = A(Zi L ) Zi L = Wi1 L * +…+Wij L * + .... 1 j ai L = A(Wi1 L * ai L-1 +…+Wij L *ai L-1 +...
  • 32. 2.1 Recall DNN, Deeper Model Structure 1 2 j ...... 1 2 i ...... Layer: L-1 Layer: L Wij L Zi L = (X; ) = A( WL … A( W2 A( X * W1 + b1 )+ b2 ) … + bL ) = { W1 , b1 , W2 , b2 , … , WL , bL } pick the best model = best function = best *
  • 33. 2.1 Recall DNN, Deeper Model Structure ... ... ‘v’ ‘a’ … ... Input Layer Hidden Layer Output Layer ... ... ... R 980x1 R 1x1 fӨ : R 980 ➡R 1 Fully Connected Network vectorize it !
  • 34. a1 L 2.1 Recall DNN, Deeper Model Structure (vector) 1 2 j ...... 1 2 i ...... Layer L-1 NL-1 Nodes Layer L NL Nodes Output of a neuron: ai L Output of one layer: aL : a vector a0 L-1 a1 L-1 aj L-1 a2 L ai L Layer L Neuron i
  • 35. a0 L-1 a1 L-1 aj L-1 a1 L 2.1 Recall DNN, Deeper Model Structure (vector) 1 2 j ...... 1 2 i ...... Layer L-1 NL-1 Nodes Layer L NL Nodes Weight: Wij L a2 L ai L Layer L-1 to Layer L from neuron j in Layer L-1 to neuron i in Layer L W L = W11 L W12 L …. W21 L W22 L …. NL xNL-1 …. NL-1 NL W L
  • 36. a0 L-1 a1 L-1 aj L-1 a1 L 2.1 Recall DNN, Deeper Model Structure (vector) 1 2 j ...... 1 2 i ...... Layer L-1 NL-1 Nodes Layer L NL Nodes Bias: bi L a2 L ai L bias for neuron i at Layer L b L = b1 L b2 L bi L NL ….…. 1 W L
  • 37. a0 L-1 a1 L-1 aj L-1 a1 L 2.1 Recall DNN, Deeper Model Structure (vector) 1 2 j ...... 1 2 i ...... Layer L-1 NL-1 Nodes Layer L NL Nodes a2 L ai L 1 W L Zi L Input of a neuron: zi L Input of one layer: zL : a vector input of the activation function for neuron i at layer l
  • 38. a0 L-1 a1 L-1 aj L-1 a1 L 2.1 Recall DNN, Deeper Model Structure (vector) 1 2 j ...... 1 2 i ...... Layer L-1 NL-1 Nodes Layer L NL Nodes a2 L ai L 1 W L Zi L Input of a neuron: zi L input of the activation function for neuron i at layer l zi L = Wi1 L a1 L-1 +Wi2 L a2 L-1 +...+ bi L = ∑ Wij L aj L-1 + bi Lj=1 NL-1
  • 39. …. a0 L-1 a1 L-1 aj L-1 a1 L 2.1 Recall DNN, Deeper Model Structure (vector) 1 2 j ...... 1 2 i ...... Layer L-1 NL-1 Nodes Layer L NL Nodes a2 L ai L 1 W L Zi L z1 L = W11 L a1 L-1 +W12 L a2 L-1 +...+ b1 L z2 L = W21 L a1 L-1 +W22 L a2 L-1 +...+ b2 L … zi L = Wi1 L a1 L-1 +Wi2 L a2 L-1 +...+ bi L = W11 L W12 L …. W21 L W22 L …. …. z1 L z2 L zi L NL …. …. + a1 L-1 a2 L-1 ai L-1 …. …. b1 L b2 L bi L NL …. NL xNL-1 NL-1
  • 40. …. a0 L-1 a1 L-1 aj L-1 a1 L 2.1 Recall DNN, Deeper Model Structure (vector) 1 2 j ...... 1 2 i ...... Layer L-1 NL-1 Nodes Layer L NL Nodes a2 L ai L 1 W L Zi L = W11 L W12 L …. W21 L W22 L …. …. z1 L z2 L zi L NL …. …. + a1 L-1 a2 L-1 ai L-1 …. …. b1 L b2 L bi L NL …. NL xNL-1 NL-1 z L = W L a L-1 + b L
  • 41. …. a0 L-1 a1 L-1 aj L-1 a1 L 2.1 Recall DNN, Deeper Model Structure (vector) 1 2 j ...... 1 2 i ...... Layer L-1 NL-1 Nodes Layer L NL Nodes a2 L ai L 1 W L Zi L = a1 L a2 L ai L NL …. ai L =A(zi L ) …. A(z1 L ) A(z2 L ) A(zi L ) NL …. Relations between Layers and Output a L =A(z L ) : vector
  • 42. a0 L-1 a1 L-1 aj L-1 a1 L 2.1 Recall DNN, Deeper Model Structure (vector) 1 2 j ...... 1 2 i ...... Layer L-1 NL-1 Nodes Layer L NL Nodes a2 L ai L 1 W L Zi L z L = W L a L-1 + b L , a L =A(z L ) a L =A(W L a L-1 + b L ) z L a La L-1 W L b L Computational Graph Node: Tensor Edge: Operator
  • 43. a0 L-1 a1 L-1 aj L-1 a1 L 2.1 Recall DNN, Deeper Model Structure (vector) 1 2 j ...... 1 2 i ...... Layer L-1 NL-1 Nodes Layer L NL Nodes a2 L ai L 1 W L Zi L z L = W L a L-1 + b L , a L =A(z L ) a L =A(W L a L-1 + b L ) z L a La L-1 W L b L Computational Graph Node: Tensor Edge: Operator
  • 44. 2.1 Recall DNN, Deeper Model Structure ... ... ‘v’ ‘a’ … ... ... ... ... fӨ : R 980 ➡R 1 Input Layer 1 Layer 2 Layer L Output W 1 , b 1 W 2 , b 2 W L , b L = (X; ) = A( WL … A( W2 A( X * W1 + b1 )+ b2 ) … + bL )
  • 45. 2.1 Recall DNN, Deeper Model Structure = (X; ) = A( WL … A( W2 A( X*W1 + b1 )+ b2 ) … + bL ) z 1 a 1 W1 b1 z 2 a L-1 W2 b2 z L …. Computational Graph Node: Tensor Edge: Operator W L b L
  • 46. 2.1 Recall DNN, Deeper Model Structure 1 2 j ...... 1 2 i ...... Layer: L-1 Layer: L Wij L Zi L = (X; ) = A( WL … A( W2 A( X * W1 + b1 ) + b2 ) … + bL ) = { W1 , b1 , W2 , b2 , … , WL , bL } pick the best model = best function = best *
  • 47. 2.1 Recall DNN, Deeper Model Structure = (X; ) = A( WL … A( W2 A( X * W1 + b1 ) + b2 ) … + bL ) = { W1 , b1 , W2 , b2 , … , WL , bL } pick the best model = best function = best * 2.Model Model Structure Cost Function Optimization functions set
  • 48. 2.1 Recall DNN, Cost Function Cost Function: C( ) - How bad the parameter is - also called `loss/ error function` Object Function: O( ) - How good the parameter is 2.Model Model Structure Cost Function Optimization
  • 49. 2.1 Recall DNN, Cost Function Cost Function: C( ) - How bad the parameter is - also called `loss/ error function` Best Parameter: *    * = arg min C( ) 2.Model Model Structure Cost Function Optimization
  • 50. 2.1 Recall DNN, Cost Function Cost Function: C( ) - How bad the parameter is - also called `loss/ error function` F(X) = (.js): Probability Distribution (X; ): DNN Model distance ( , )
  • 51. 2.1 Recall DNN, Cost Function Cost Function: C( ) - How bad the parameter is - also called `loss/ error function` F(X) = (.js): Probability Distribution (X; ): DNN Model distance ( , ) Don’t actually know the real distribution function
  • 52. 2.1 Recall DNN, Cost Function X Real Probability Distribution Dataset Sampling From Real World (X1 , ȳ1 ), (X2 , ȳ2 ), (X3 , ȳ3 ) … X1 ȳ1 1 C( ) = ∑ loss ( k , ȳk ) = ∑ ( ) 1 k k Machine Learning Model 1 k k Wrong Content !! Classification and Regression is not going to learn a probability distribution
  • 53. Dataset Sampling From Real World 2.1 Recall DNN, Cost Function Real Probability Distribution ? ȳ1 =1 1 C( ) = ∑ loss ( k , ȳk ) 1 k k 1 0 ȳ2 =0 .txt file js .txt file js Machine Learning Model X (X1 , ȳ1 ), (X2 , ȳ2 ), (X3 , ȳ3 ) … X1 Wrong Content !! Classification and Regression is not going to learn a probability distribution
  • 54. Dataset Sampling From Real World 2.1 Recall DNN, Cost Function, MSE as loss Real Probability Distribution ? ȳ1 =1 1 C( ) = ∑ loss ( k , ȳk ) 1 k k 1 0 ȳ2 =0 = ∑ ॥ k - ȳk ॥ 1 k k .txt file js .txt file js Machine Learning Model X (X1 , ȳ1 ), (X2 , ȳ2 ), (X3 , ȳ3 ) … X1 Wrong Content !! Classification and Regression is not going to learn a probability distribution
  • 55. Dataset Sampling From Real World 2.1 Recall DNN, Cost Function, Cross Ent. as loss Real Probability Distribution ? Machine Learning Model ȳ1 =1 1 C( ) = ∑ loss ( k , ȳk ) 1 k k 1 0 ȳ2 =0 C( )= ∑( k ln( ȳk )+(1- k )ln(1- ȳk )) -1 k k .txt file js .txt file js X (X1 , ȳ1 ), (X2 , ȳ2 ), (X3 , ȳ3 ) … X1 Wrong Content !! Classification and Regression is not going to learn a probability distribution
  • 56. 2.1 Recall DNN, Cost Function, MSE as loss C( ) = ∑ loss ( k , ȳk ) 1 k k Accuracy MSE Logistic Loss Hinge Loss loss(k *ȳk ) To Visualize the differnet of cost function Φ( k * ȳk ) ⬅ loss ( k , ȳk ) k * ȳk , ȳk ∈ {-1, 1} z=yy^ Acc: L(z)=(sign(z)+1)/2L(z)=(sign(z)+1)/2 Logistic: L(z)=log(exp(−x)+1)/log(2) MSE: L(z)=(y−y^)2=(y2−y(^y))2=(1−z)2
  • 57. 2.Model Model Structure Cost Function Optimization 2.1 Recall DNN, Optimization Find the Best function Have - A function set Know - How good is the selected function How to find the best one? Enumerate ? Calculus !!
  • 58. 2.1 Recall DNN, Optimization Find the Best function , model parameter loss,C() For simplification, consider that has only one variable 1. Randomly start at 0 2. Compute dC( 0 ) / d( ) 3. 1 ⬅ 0 - η * dC( 0 ) / d( ) 4. Compute dC( 1 ) / d( ) 5. 2 ⬅ 1 - η * dC( 1 ) / d( ) .. 0 1 dC( 0 ) d( ) dC( 1 ) d( ) η is learning rate
  • 59. 2.1 Recall DNN, Optimization set the learning rate carefully η is learning rate For simplification, consider that has only one variable 1. Randomly start at 0 2. Compute dC( 0 ) / d( ) 3. 1 ⬅ 0 - η * dC( 0 ) / d( ) 4. Compute dC( 1 ) / d( ) 5. 2 ⬅ 1 - η * dC( 1 ) / d( ) .. Copy from http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20(v4).pdf
  • 60. 2 1 0 2.1 Recall DNN, Optimization Find the Best function 1 2 Suppose that has two variable 1. Randomly start at 0 = 2. Compute the gradient of C( ) at 0 C( 0 ) = 3. update parameter .. ⬅ - η * …. …. 1 0 2 0 ∂C( 0 ) / ∂ 1 ∂C( 0 ) / ∂ 2 ∂C( 0 ) / ∂ 1 ∂C( 0 ) / ∂ 2 1 0 2 0 1 1 2 1 Gradient Movement C( )
  • 61. 2.1 Recall DNN, Optimization 1 2 j ...... 1 2 i ...... Layer: L-1 Layer: L Wij L Zi L Optimize out model 1. Randomly start at 0 2. Compute the gradient of C( ) at 0 : C( 0 ) 3. update parameter: 1 ⬅ 0 - η * C( 0 ) To Calculate C( 0 ), we need { ∂C( 0 )/∂W11 1 , ∂C( 0 )/∂b1 1 , ∂C( 0 )/∂W12 1 , ∂C( 0 )/∂b2 1 , … , ∂C( 0 )/∂Wij L , ∂C( 0 )/∂bi L }
  • 62. 2.1 Recall DNN, Diff on Computational Graph x zy z = f(x), y = g(x), z = h(y) dz/dx = dy/dx * dz/dy h()g() s z t h() g() u k() z = f(s), t=g(s), u=h(s), z = k(t,u) dz/ds = ∂t/∂s * ∂z/∂t + ∂u/∂s * ∂z/∂u
  • 63. ** exp() 2.1 Recall DNN, Diff on Computational Graph Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v x v x u w x
  • 64. ** exp() 2.1 Recall DNN, Diff on Computational Graph Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v x v x u w x x = 2, y? 2 2 2 4 e^4 2*e^4 y = 2*e^4 Forward Pass
  • 65. ** exp() 2.1 Recall DNN, Diff on Computational Graph Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v x v x u w x dy/dx|x=2 ? 2 2 2 4 e^4 2*e^4 Backward Pass ∂u/∂s=exp(u) = exp(x^2) ∂y/∂v=x ∂w/∂x=v=exp(u)=exp(x^2) ∂u/∂x=x ∂u/∂x=x WARNING: Different X
  • 66. ∂u/∂x=x ∂u/∂s=exp(u) = exp(x^2) ∂y/∂v=x ** exp() 2.1 Recall DNN, Diff on Computational Graph Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v x v x u w x dy/dx|x=2 ? 2 2 2 4 e^4 2*e^4 Backward Pass ∂w/∂x=v=exp(u)=exp(x^2) ∂u/∂x=x ∂y/∂x=x*exp(x^2)*x
  • 67. ∂u/∂x=x ∂u/∂s=exp(u) = exp(x^2) ∂y/∂v=x ** exp() 2.1 Recall DNN, Diff on Computational Graph Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v x v x u w x dy/dx|x=2 ? 2 2 2 4 e^4 2*e^4 Backward Pass ∂u/∂x=x ∂y/∂x=x*exp(x^2)*x ∂y/∂x=x*exp(x^2)*x ∂w/∂x=v=exp(u)=exp(x^2)
  • 68. ∂u/∂x=x ∂u/∂s=exp(u) = exp(x^2) ∂y/∂v=x ** exp() 2.1 Recall DNN, Diff on Computational Graph Example: y = x * exp( x*x ) : u = x*x, v = exp(u), w = x*v x v x u w x dy/dx|x=2 ? 2 2 2 4 e^4 2*e^4 Backward Pass ∂u/∂x=x ∂y/∂x=x*exp(x^2)*x ∂y/∂x=x*exp(x^2)*x = 2*x2 *exp(x2 ) +exp(x2 ) |x=2 = 5*exp(4) ∂w/∂x=v=exp(u)=exp(x^2)
  • 69. 2.1 Recall DNN, Diff on Computational Graph = (X; ) = A( W2 A( X* W1 + b1 )+ b2 )) = { W1 , b1 , W2 , b2 } Computational Graph Node: Tensor Edge: Operator ȳ Cz 1 a 1 W1 b1 z 2 W2 b2 ∂C( 0 )/∂W11 1 ∂C/∂y = -2(ȳ - y) C( ,ȳ) = (ȳ - y)^2 z2 = W2 * a1 + b2 a1 = A(z1 ) z1 = W1 * X + b1 ∂y/∂z2 = A(z2 ) ∂z2 /∂a1 =W2 ∂a1 /∂z1 =A(z1 ) ∂z1 /∂w1 =X Backpropagation
  • 70. 2.1 Recall DNN, Diff on Computational Graph = (X; ) = A( W2 A( X* W1 + b1 )+ b2 )) = { W1 , b1 , W2 , b2 } Computational Graph Node: Tensor Edge: Operator ȳ Cz 1 a 1 W1 b1 z 2 W2 b2 ∂C( 0 )/∂W11 1 ∂C/∂y = -2(ȳ - y) C( ,ȳ) = (ȳ - y)^2 z2 = W2 * a1 + b2 a1 = A(z1 ) z1 = W1 * X + b1 ∂y/∂z2 = A(z2 ) ∂a1 /∂z1 =A(z1 ) ∂z1 /∂w1 =X Backpropagation Layer 1 Layer 2 Error Signal Amplifier Amplifier ∂z2 /∂a1 =W2
  • 71. 2.1 Recall DNN, Optimization (batch) Training Data: { (X1 , ȳ1 ), (X2 , ȳ2 ), … , (Xr , ȳr ), …. (XR , ȳR )} - Gradient Descent i ⬅ i-1 - η * C( i-1 ), C( i-1 ) = ∑ Cr ( i-1 ) - Stochastic Gradient Descent i ⬅ i-1 - η * Cr ( i-1 ) - Mini Batch Gradient Descant i ⬅ i-1 - η * C( i-1 ), C( i-1 ) = ∑ Cr ( i-1 ) 1 R r Use all sample for each update Use one sample for each update 1 B r ∈ b
  • 72. 2.2 Lets Go Back to RNN - Why memory Important ? - What kind of problem can be handled by RNN ? - Simple RNN structure - Why it is so hard to train ? - Classic variant (LSTM, GRU)
  • 73. 2.2 Why memory important 1 9 9 + 3 3 3 5 3 2 1 1 9 3 9 3 1 3 5 3 2 X1 X2 X3 3 2 1 Input: 2-dimensions Output: 1-dimensions
  • 74. 2.2 Why memory important 9 3 9 3 1 3 5 3 2 X1 X2 X3 3 2 1 X1 c1 1 1 1 1 1-10 1 1 10 0 1 X2
  • 75. 2.2 Why memory important 9 3 9 3 1 3 5 3 2 X1 X2 X3 3 2 1 c1 1 1 1 1 1-10 1 1 39 0 10 0 1 X1 X2
  • 76. 2.2 Why memory important 9 3 9 3 1 3 5 3 2 X1 X2 X3 3 2 1 c1 1 1 1 1 1-10 1 1 39 12 12 0 10 0 1 X1 X2
  • 77. 2.2 Why memory important 9 3 9 3 1 3 5 3 2 X1 X2 X3 3 2 1 c1 1 1 1 1 1-10 1 1 39 12 12 0 1 12 10 0 1 X1 X2
  • 78. 2.2 Why memory important 9 3 9 3 1 3 5 3 2 X1 X2 X3 3 2 1 c1 1 1 1 1 1-10 1 1 39 12 12 0 1 12 1 10 0 1 X1 X2
  • 79. 2.2 Why memory important 9 3 9 3 1 3 5 3 2 X1 X2 X3 3 2 1 c1 1 1 1 1 1-10 1 1 39 12 12 0 1 12 1 2 10 0 1 X1 X2
  • 80. 2.2 Why memory important 9 3 9 3 1 3 5 3 2 X1 X2 X3 3 2 1 c1 1 1 1 1 1-10 1 1 39 13 13 0 1 10 0 1 X1 X2
  • 81. 2.2 Why memory important 9 3 9 3 1 3 5 3 2 X1 X2 X3 3 2 1 c1 1 1 1 1 1-10 1 1 39 13 13 0 1 10 0 1 1 13 1 X1 X2
  • 82. 2.2 Why memory important 9 3 9 3 1 3 5 3 2 X1 X2 X3 3 2 1 c1 1 1 1 1 1-10 1 1 39 13 13 0 1 10 0 1 1 13 1 3 X1 X2
  • 83. 2.2 Why memory important 9 3 9 3 1 3 5 3 2 X1 X2 X3 3 2 1 c1 1 1 1 1 1-10 1 1 39 13 13 0 1 10 0 1 1 13 1 3 X1 X2
  • 84. 2.3 Simple RNN X1 1 X X2 2 X3 3 X4 4 v.s. DNN 3 2 1 many to many one to one
  • 86. 2.3 Simple RNN X1 X2 1 2 X1 1 2 3 4 many to many one to many
  • 87. 2.3 Simple RNN X1 X2 1 2 many to many
  • 88. 2.3 Simple RNN X1 1 2 3 4 one to many
  • 89. 2.3 Simple RNN, Model Structure X1 X2 X3 X4 many to one X10 X5 ‘v’ ‘a’ ‘r’ ‘ ‘ ‘e’ …. ‘E’ js
  • 90. 2.3 Simple RNN, Model Structure X1 X2 ... X3 biW i zi a W h bh zh
  • 91. 2.3 Simple RNN, Model Structure X1 X2 ... X3 b W i zi a W h zh zh = Wh * h a1 = A(zi +zh +b) zi = Wi * X * * f(a,b,c)= A(a+b+c)
  • 92. * 2.3 Simple RNN, Model Structure X1 X2 ... X4 b W i zi a W h zh * A X3 b W i zi a W h zh * * A X5
  • 93. X10 b W i zi a W h zh * * A 2.3 Simple RNN, Model Structure X1 X2 X3 b W i zi a W h zh * * A js Xi ... ... ... C
  • 94. 2.3 Simple RNN, Cost Function = (Xt , h; ) = A(Wi * Xt + Wh * h + b) = { Wi , Wh , b } pick the best model = best function = best * 2.Model Model Structure Cost Function Optimization function set timestep C( )= ∑( k ln( ȳk )+(1- k )ln(1- ȳk )) -1 k k
  • 95. 2.3 Simple RNN, Optimization, Back Propogation? = (Xt , h; ) = A(Wi * Xt + Wh * h + b) = { Wi , Wh , b } pick the best model = best function = best * 2.Model Model Structure Cost Function Optimization function set timestep C( )= ∑( k ln( ȳk )+(1- k )ln(1- ȳk )) -1 k k
  • 96. X10 b W i zi a W h zh * * A 2.3 Simple RNN, BPTT, CG X9 b W i zi a W h zh * * A js ... ... C ∂C( 0 )/∂Wh
  • 97. X8 b W i zi a W h zh * * A 2.3 Simple RNN, BPTT, CG X7 b W i zi a W h zh * * A ... ... ... ... ∂C( 0 )/∂Wh
  • 98. X2 b W i zi a W h zh * * A 2.3 Simple RNN, BPTT, CG X1 b W i zi a W h zh * * A ... ... ∂C( 0 )/∂Wh = sum all nodes
  • 99. 2.3 Simple RNN, BPTT, gradient problem X1 X2 X3 X4 X10 X5 ‘v’ ‘a’ ‘r’ ‘ ‘ ‘e’ …. ‘E’ js Error Signal ઠ Amplifier AmplifierAmplifierAmplifierAmplifier Update W needs Error Signal * (Amplifier)10 Vanishing gradient problem Exploding
  • 100. 2.3 Simple RNN, BPTT, gradient problem Update W needs Error Signal * (Amplifier)10 Vanishing gradient problem Exploding direction + large value var exwFrE
  • 101. 2.4 LSTM ---Hidden Unit--- [Input] : statet-1 , Inputt , Cell_Statet-1 [Output] : statet , Outputt , Cell_Statet Input Gate: Whether input `the [input]` Forget Gate: whether update `the Cell_State` Output Gate: Whetehr output `the output`
  • 102. 2.4 LSTM Xt ht-1 ct-1 Wi zi Input Gate: Whether input `the [input]` = σ ( )Wi zi Xt ht-1
  • 103. 2.4 LSTM Xt ht-1 ct-1 Wi zi Xt ht-1 W z [ Input ] = σ ( )Wz Xt ht-1 Elementwist Mulitply Input Gate
  • 104. 2.4 LSTM Xt ht-1 ct-1 Wi zi Xt ht-1 W z [ Input ] Elementwist Mulitply Xt ht-1 Wf zf Forget Gate: whether update `the Cell_State` = σ ( )Wf zf Xt ht-1 Elementwist Mulitply Elementwist Addition ct Input Gate
  • 105. 2.4 LSTM Xt ht-1 ct-1 Wi zi Xt ht-1 W z [ Input ] Input Gate Elementwist Mulitply Xt ht-1 Wf zf Forget Gate Elementwist Mulitply Elementwist Addition ct Xt ht-1 Wo zo Output Gate ht ht = zo ⊙ tanh(ct )
  • 107. 2.5 GRU (Gated Recurrent Unit) RNN LSTM GRU Xt ht-1 ht t ht-1 ct ct-1 ht Xt t Xt ht-1 ht t Short Term Short Term Long Term Long + Short Term ?
  • 108. 2.5 GRU GRU Xt ht-1 ht t Long + Short Term ? Xt ht-1 Wr zr Reset Gate ht-1 Elementwist Mulitply
  • 109. Wc 2.5 GRU GRU Xt ht-1 ht t Long + Short Term ? Xt ht-1 Wr zr Reset Gate ht-1 Elementwist Mulitply Xt W ĥt-1 Candiate state Candiate State ĥt-1 = tanh(WXt +Wc (zr ⊙ht-1 ))
  • 110. 2.5 GRU Xt ht-1 Wr zr Reset Gate ht-1 Elementwist Mulitply Xt W ĥt-1 Candiate state Candiate State Xt ht-1 Wu zu Update Gate ht-1 ht ht = (1-zu )ht-1 + zu ĥt-1 Wc
  • 112. 3 How to Tune Ur Model - Basic Metrics - Model HyperParameters - Visualization Inspiration - Optimization Setup - Network Structure
  • 113. 3.1 Basic Metrics DataSet Real World Training Data Testing Data Trainging Data Validation Data Real Testing Do I get good results on training set ? - Code has bug ? - Cannot find a good function - Bad Model (no good function in hyp. set) Do I get good results on validation set ? - over fitting ? YES YES NoNo
  • 114. = (Xt , h; ) = σ(Wi *Xt + Wh *h + b) = { Wi , Wh , b } 3.2 Model HyperParameter Simple RNN These are parameters - number of epoch (training iteration) - |h|: Hidden Layer Size - : Number of Layers - C(): cost function - Batch Size (stochastic, mini-batch, ..) - Parameter Initial value - A: activation function (tanh, sigmoid, Relu 1 ⬅ 0 - η * C( 0 ) : Learning Rate - Regularization (Dropout, Zoneout) - Forget Gate Bias - Gate Initialize - Implicit zero padding Grid Search ?
  • 115. 3.3 Visualization Inspiration one layer, 10 hidden unit simple rnn
  • 116. 3.3 Visualization Inspiration one layer, 10 hidden unit simple rnn
  • 117. 3.4 Optimization Setup Adaptive Learning Rate Gradient Clipping Truncated BPTT Batch Normalization Longer Training Time ….
  • 118. 3.5 Network Structure Bi-Direction RNN … GRU LSTM Stack RNN Neural Turning Machine ….
  • 119. 4. View Problem in CNN var evwfrE [23, 2, 19, 0, 6, 23, 24, 7, 19, 32] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 .... 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 var evwfrE 98x10 x 1 x 2 x 3 x 4 ... x 7 x 8 x 9 x 0 r r... y
  • 120. 5 A Sequence to Sequence Problem - Three different ways to solved the problem: HMM, CTC, Attention
  • 121. Topics Not Cover - Bi-directional RNN and other RNN variants - Attention-based RNN - structure learning vs. RNN
  • 122. Reference - Most of content are from Prof Lee @ NTU: http://speech.ee.ntu.edu.tw/~tlkagk/courses/ - https://danijar.com/tips-for-training-recurrent-neural-networks/ - https://medium.com/@erikhallstrm/hello-world-rnn-83cd7105b767 This slides share @ trend micro scan engine team as an entry level sharing for team member at 2017/9/4 If any suggestion or correction please mail to chuchuhao831@gmail.com Or just leave comment on google slides -> link