Introduction to Deep Neural Network

Copyright 2011 Trend Micro Inc. 1
Introduction to Deep Neural Network
Liwei Ren, Ph.D
San Jose, California, Nov, 2016

Copyright 2011 Trend Micro Inc.
Agenda
• What a DNN is
• How a DNN works
• Why a DNN works
• Those DNNs in action
• Where the challenges are
• Successful stories
• Security problems
• Summary
• Quiz
• What else
2

What is a DNN?
• DNN and AI in the secular world
3

What is a DNN?
4

What is a DNN?
5

What is a DNN?
• DNN in the technical world
6

What is a DNN?
7

What is a DNN?
8

What is a DNN?
• Categorizing the DNNs :
9

What is a DNN?
• Three technical elements
• Architecture: the graph, weights/biases, activation functions
• Activity Rule: weights/biases, activation functions
• Learning Rule: a typical one is backpropagation algorithm
• Three masters in this area:
10

What is a DNN?
• Given a practical problem , we have two approaches
to solve it.
11

What is a DNN?
• An example: image recognition
12

What is a DNN?
• An example: image recognition
13

What is a DNN?
• In the mathematical world
– A DNN is a mathematical function f: D  S, where D ⊆ Rn and S ⊆ Rm,
which is constructed by a directed graph based architecture.
– A DNN is also a composition of functions from a network of primitive
functions.
14

What is a DNN?
• We denote the a feed-forward DNN function by O= f(I) which
is determined by a few parameters G, Φ ,W,B
• Hyper-parameters:
– G is the directed graph which presents the structure
– Φ presents one or multiple activation functions for activating the nodes
• Parameters:
– W is the vector of weights relevant to the edges
– B is the vector of biases relevant to the nodes
15

What is a DNN?
• Activation at a node:
16

What is a DNN?
• Activation function:
17

What is a DNN?
• G=(V,E) is a graph and Φ is a set of activation functions.
• <G,Φ> constructs a family of functions F:
– F(G,Φ) = { f | f is a function constructed by <G, Φ ,W> where WϵRN }
• N= total number of weights at all nodes of output layer and hidden layers.
• Each f(I) can be denoted by f(I ,W).
18

What is a DNN?
• Mathematically, a DNN based supervised machine
learning technology can be described as follows :
– Given g ϵ { h | h:D  S where D ⊆ Rn and S ⊆ Rm} and δ>0 , find f ϵ
F(G,Φ) such that 𝑓 − 𝑔 < δ.
• Essentially, it is to identify a W ϵ RN such that 𝑓(∗, 𝑊) − 𝑔 < δ
• However, in practice, g is not explicitly expressed . It
usually appears in a sequence of samples:
– { <I(j),T(j)> | T(j) =g(I(j)), j=1, 2, …,M}
• where I(j) is an input vector and T(j) is its corresponding target vector.
19

How Does a DNN work ?
• The function g is not explicitly expressed, we are not able
to calculate g − f(∗, W)
• Instead, we evaluate the error function E(W)=
1
2𝑀
∑||T(j) -
f(I(j),W)||2
• We expect to determine W such that E(W) < δ
• How to identify W ϵ RN so that E(W) < δ ? Lets solve the
nonlinear optimization problem min{E(W)| W ϵ RN} , i.e.:
min{
1
2𝑀
∑|| T(j) - f(I(j),W) ||2 | W ϵ RN } (P1)
20

• (P1) is for batch mode training, however ,it is too
expensive.
• In order to reduce the computational cost, a sequential
mode is introduced.
• Picking <I,T> ϵ {<I(1),T(1) >, <I(2),T(2)> ,…, <I(M),T(M)>}
sequentially, let the output of the network as O= f(I,W) for any
W:
• Error function E(W)= ||T- f(I,W)||2 /2 = ∑(Tj-Oj)2 /2
• Each Oj can be considered as a function of W. We denote it as Oj(W).
• We have the optimization problem for training with
sequential mode:
– min{ ∑(Tj-Oj(W) )2 /2 | W ϵ RN} (P2)
21

• One may ask whether we get the same solution for
both batch mode and sequential mode ?
• BTW
– batch mode = offline mode
– sequential mode = online mode
• We focus on online mode in this talk
22

• How to solve the unconstrained nonlinear optimization
problem (P2)?
• The general approach of unconstrained nonlinear
optimization is to find local minima of E(W) by using
the iterative process of Gradient Descent.
•∂E = (∂E/∂W1, ∂E/∂W2, …, ∂E/∂WT)
• The iterations:
– ΔWj = - γ ∂E/∂Wj for j=1, …,T
– Updating W in each step by
• Wj
(k+1) = Wj
(k) - γ ∂E(W (k))/∂Wj for j=1, …,T (A1)
• until E(W (k+1)) < δ or E(W (k+1)) can not be reduced anymore
23

• The algorithm of Gradient Descent:
24

• From the perspective of mathematics, the process of
Gradient Descent is straightforward.
• However, from the perspective of scientific computing, it is
quite challenging to calculate the values of all ∂E/∂Wj for
j=1, …,N:
– The complexity of presenting each ∂E/∂Wj where j=1, …,N.
– There are (k+1)-layer function compositions for a DNN of k hidden
layers.
25

• For example, we have a very simple network as follows
with the activation function φ(v)=1/(1 + 𝑒−𝑣
).
• E(W) = [ T - f(I,W) ]2 /2= [T – φ(w1φ(w3I+ w2) + w0)]2 /2, we
have:
– ∂E/∂w0 = -[T – φ(w1φ(w3I + w2) + w0)] φ’(w1φ(w3I+w2) + w0)
– ∂E/∂w1 = -[T – φ(w1φ(w3I + w2) + w0)] φ’(w1φ(w3I+w2) + w0) φ(w3I+w2)
– ∂E/∂w2 = - w1 [T – φ(w1φ(w3I + w2) + w0)] φ’(w1φ(w3I+w2) + w0)
φ’(w3I+w2)
– ∂E/∂w3 = - I w1 [T – φ(w1φ(w3I + w2) + w0)] φ’(w1φ(w3I+w2) + w0)
φ’(w3I+w2)
26

• Lets imagine a network of N inputs, M outputs and K
hidden layers each of which has L nodes.
– It is a daunting task to express ∂E/∂wj explicitly. Last simple
example already shows this.
• The backpropagation (BP) algorithm was proposed as a
rescue:
– Main idea : the weights of (k-1)-th hidden layer can be expressed
by the k-th layer recursively.
– We can start with the output layer which is considered as (L+1)-
layer.
27

• BP algorithm has the following major steps:
1. Feed-forward computation
2. Back-propagation to the output layer
3. Back-propagation to the hidden layers
4. Weight updates
28

29

• A general DNN can be drawn as follows
30

• How to express the weights of (k-1)-th hidden layer by the
weights of k-th layer recursively?
31

• Let us experience the BP with our small network.
– E(W) = [ T - f(I,W) ]2 /2= [T – φ(w1φ(w3I+ w2) + w0)]2 /2.
• ∂E/∂w0 = - φ’(O) (T – O)
• ∂E/∂w1 = -φ’(O) (T – O) φ(O)
• ∂E/∂w2 = - φ’(O) (T – O) φ’(H) w1 * 1
• ∂E/∂w3 = - φ’(O) (T – O) φ’(H) w1 * I
– Let H0
(1)= 1, H1
(1) = H = φ(w3I+ w2), H1
(0) = I, we verify the follows:
• δ1
(2)= φ’(O) (T – O)
• w0
+ = w0 + γ δ1
(2) H0
(1) , w1
+ = w1 + γ δ1
(2) H1
(1)
• δ1
(1)= φ’(H1
(1)) δ1
(2) w1
• w2
+ = w2 + γ δ1
(1) H0
(0) , w3
+ = w3 + γ δ1
(1) H1
(0)
• where w0 = w0,1
(2) , w1 = w1,1
(2), w2 = w0,1
(1) , w2 = w1,1
(1)
32

Why Does a DNN Work?
• It is amazing ! However, why does it work?
• For a FNN, it is to ask whether the following approximation
problem has a solution:
– Given g ϵ { h | h:D  S where D ⊆ Rn and S ⊆ Rm} and δ>0 , find a W ϵ
RN such that 𝑓(∗, 𝑊) − 𝑔 < δ.
• Universal approximation theorem (S):
– Let φ(.) be a bounded and monotonically-increasing continuous function. Let
Im denote the m-dimensional unit hypercube [0,1]m . The space of continuous
functions on Im is denoted by C(Im) . Then, given any function f ϵ C(Im) and
ε>0 , there exists an integer N , real constants vi, bi ϵ R and real vectors wi ϵ
Rm, where i=1, …, N , such that
|F(x)-f(x)| < ε
for all x ϵ Im , where F(x) = vi φ(wi T x + bi)𝑵
𝒊=𝟏 is an approximation to the
function f which is independent of φ .
33
Im

• Its corresponding network with only one hidden layer
– NOTE : this is not even a general case for one hidden layer. It is a
special case. WHY?
– However, it is powerful and encouraging from the mathematical
perspective.
34
Im

The general networks have a general version of Universal
Approximation Theorem accordingly:
35
Im

• Universal approximation theorem (G):
– Let φ(.) be a bounded and monotonically-increasing continuous function. Let
S be a compact space in Rm . Let C(S ) = {g | g:S ⊂ Rm  Rn is continuous}.
Then, given any function f ϵ C(S) and ε>0 , there exists a FNN as shown
above which constructs the network function F such that
|| F(x)-f(x) || < ε
where F is an approximation to the function f which is independent of φ .
• It seems both shallow and deep neural networks can
construct an approximation to a given function.
– Which is better?
– Or which is more efficient in terms of using less nodes ?
36
Im
Rm

• Mathematical foundation of neural networks:
37
Im
Rm

Those DNNs in action
• DNN has three elements
• Architecture: the graph , weights/biases, activation functions
• Activity Rule: weights/biases, activation functions
• Learning Rule: a typical one is backpropagation algorithm
• The architecture basically determines the capability of a specific DNN
– Different architectures are suitable for different applications.
– The most general architecture of an ANN is a DAG ( directed acyclic graph).
38

Those DNNs in action
• There are a few well-known categories of DNNs.
39

What Are the Challenges?
• Given a specific problem, there are a few questions
before one starts the journey with DNNs:
– Do you understand the problem that you need to solve?
– Do you really want to solve this problem with DNN, why?
• Do you have an alternative yet effective solution?
– Do you know how to describe the problem in DNN mathematically ?
– Do you know how to implement a DNN , beyond a few APIs and
sizzling hype?
– How to collect sufficient data for training?
– How to solve the problem efficiently and cost-effectively?
40

• 1st Challenge:
– a full mesh network has the curse of dimensionality.
41

• Many tasks of FNN do not need a full mesh network.
• For example, if we can present the input vector as a grid, the nearest-
neighborhood models can be used when constructing an effective FNN
which can reduce connections
– Image recognition
– GO (圍棋) : a game that two players play on a 19x19 grid of lines.
42

• The 2nd challenge is how to describe a technical problem in terms of
DNN, i.e., mathematical modeling. There are generally two
approaches:
– Applying a well-learned DNN architecture to describe the problem. Deep
understanding of the specific network is usually required!
• Two general DNN architectures are well-known
– FNN: feedforward neural network. Its special architecture CNN (convolutional
neural network) is widely used in many applications such as image recognition,
GO, and etc.
– RNN: recurrent neural network. Its special architecture is LSTM (long short-
term memory) which has been applied successfully in speech recognition,
language translation, and etc.
• For example, if we want to try a FNN, how to describe the problem in
terms of <Input vector, Output vector> with fixed dimension ?
– Creating a novel DNN architecture from ground if none of the existing
models fits your problem. Deep understanding of DNN theory /
algorithms is required.
43

• Handwriting digit recognition:
– Modeling this problem is straightforward
44

• Image Recognition is also straightforward
45

• However, due to the curse of dimensionality, we can use a special FFN:
– Convolutional neural network (CNN)
46

• How to construct a DNN to describe language translation ?
– They use LSTM networks
• How to construct a DNN to describe the problem of malware
classification?
• How to construct a DNN to describe the network traffic for
security purpose?
47

• The 3rd challenge is how to collect sufficient training data.
To achieve required accuracy, sufficient training data is
necessary. WHY?
48

• The 4th challenge is how to identify various talents for
providing a DNN solution to solve specific problems.
– Who knows how to use existing DL APIs such as TensorFlow
– Who understands various DNN architectures in depth so that he/she
knows how to evaluate and identify a suitable DNN architecture to
solve the problem.
– Who understands the theory and algorithms of the DNN in depth so
that he/she can create and design a novel DNN from ground.
49

Successful Stories
• ImageNet : 1M+ images, 1000+ categories, CNN
50

Successful Stories
• Unsupervised learning neural networks… YouTube and the
Cat .
51

Successful Stories
• AlphaGo, a significant milestone in AI history
– More significant than DeepBlue
• Both Policy Network and Value Network are CNNs.
52

Successful Stories
• Google Machine Neural Translation… LSTM (Long
Short Term Memory) network
53

Successful Stories
• Microsoft Speech Recognition … LSTM and TDNN
(Time Delay Neural Networks )
54

Security Problems
• Not disclosed for the public version.
55

Summary
• What a DNN is
• How a DNN works
• Why a DNN works
• The categories of DNNs
• Some challenges
• Well-known stories
• Security problems
56

Quiz
• Why do we choose the activation function as a
nonlinear function?
• Why Deep? Why deep networks are better than
shallow networks?
• What is the difference between online and batch
mode training?
• Will online and batch mode training converge to the
same solution?
• Why do we need the backpropagation algorithm?
• Why do we apply convolutional neural networks to
image recognition? 57

Quiz
• If we solve a problem with a FNN,
– how many deep layers should we go?
– How many nodes are good for each layer?
– How to estimate and optimize the cost?
• Is it guaranteed that the backpropagation algorithm converge
to a solution?
• Why do we need sufficient data for training in order to achieve
certain accuracy?
• Can a DNN do some tasks more than extending human’s
capabilities or automating extensive manual tasks ?
– To prove a mathematical theorem ... or to introduce an interesting
concept… or to appreciate a poem… or to love…
58

Quiz
• AlphaGo is trained for 19x19 lattice. If we play GO
game on 20x20 board, can AlphaGo handle it?
• ImageNet is trained for 1000 categories. If we add the
1001-th category, what should we do?
• People do consider a special DNN as a black box.
Why?
• More questions from you …
59

What Else?
• What to share next from me? Why do you care?
– Various DNNs: principles, examples, analysis and
experiments…
•ImageNet, AlphaGO, GNMT and etc..
– My Ph.D work and its relevance to DNN
– Little History of AI and Artificial Neural Network
– Various Schools of the AI Discipline
– Strong AI vs. Weak AI
60

What Else?
– Questions when thinking about AI:
• Are we able to understand how we learn?
• Are we going the right directions mathematically and scientifically?
• Are there simple principles for cognition like what Newton and Einstein
established for understanding our universe?
• What are we lack between now and the coming of so called Strong AI?
61

What Else?
•Questions about who we are.
– Are we created? 
– Are we the AI of the creator?
•My little theory about the Universe
62

Introduction to Deep Neural Network

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Deep Neural Network

Similar to Introduction to Deep Neural Network (20)

More from Liwei Ren任力偉

More from Liwei Ren任力偉 (20)

Recently uploaded

Recently uploaded (20)

Introduction to Deep Neural Network