A Study On Deep Learning

A Study On Deep Learning
Abdelrahman Hosny
Graduate Student, Master’s
Computer Science
University of Connecticut
Email: abdelrahman@engr.uconn.edu
Anthony Parziale
Undergraduate Student, Junior
Computer Science
University of Connecticut
Email: anthony.parziale@uconn.edu
Abstract—With massive amounts of computational power, ma-
chines can now recognize objects and translate speech in real
time. Thanks to Deep Learning, Artificial Intelligence is now
getting smart. Deep Learning models attempt to mimic the
activity of the neocortex. It is understood that the activity of
these layers of neurons is what constitutes a brain to be able
to ”think.” These models learn to recognize patterns in digital
representations of data in a very similar sense to humans. In
this survey report, we introduce the most important concepts
of Deep Learning along with the state of the art models that
are now widely adopted in commercial products.
1. Introduction
Machine learning is the science of getting computers
to act without being explicitly programmed. It is the main
engine to many of the modern software applications: from
web searches to content filtering on social networks to
recommendations on e-commerce websites and smartphone
applications. Deep learning is a new area of machine learn-
ing research, which has been introduced with the objective
of moving machine learning closer to one of its original
goals: Artificial Intelligence. When exploring the field of
deep learning, it is easy to be overwhelmed with various
models and in the process, lose sight of the end objective [1].
Researchers aim at utilizing deep learning models to make
progress toward human-level AI. Many of them view deep
learning is a direct extension to artificial neural networks,
that are inspired by how the human brain works.
In this survey, our aim is to provide a brief explanation of
neural networks in addition to a concise explanation of the
differing deep learning architectures, their objectives, and
how they relate. In the next section, we start with the build-
ing block of any deep learning architecture: Artificial Neural
Networks. After that, we explore Deep learning models that
generally either consist of a ”deep” neural network (more
than 3 layers), or a stack of neural networks (where each
layer in the deep architecture is in fact a neural network
itself). In each model introduced, we shed the light on the
purpose of the model and its architecture. At the end, we
give practical tips on using each model and introduce some
of the recent commercial applications that are empowered
by deep learning models.
2. Artificial Neural Networks
Artificial neural networks are a family of models that
are inspired by biological neural networks. The idea behind
artificial neural network is the observation that babies see
adults moving around and after a few months, the accu-
mulated knowledge start to stimulate them to make mini-
pushups. This behavior encouraged neuroscientists to study
the activity that happens in their brains to learn without
being explicitly taught. In a similar analogy, computer sci-
entists modeled the brain in a mathematical model called:
artificial neural network. The question now is: how does the
brain work?
2.1. Background
Brains consist of neural cells. Each cell of these looks
like the one in figure 1. In the body of the neuron, there
is the nucleus that receives pulses of electricity from input
wires (dendrites) and based on these signals, the neuron
does some computation and sends a message (electrical
impulses) to other neurons through output wires (axons).
The human brain has billions of these neurons connected
together. Different neurons in the brain are responsible for
different senses, like the sight, smell and touch senses. It
is scientifically observed that any neuron in the brain net
can learn to do other jobs. For example, experiments on
animals prove that if we disconnect the wires that connect
an auditory neuron to the ears and connect it to the eyes, the
Figure 1: Human Brain Neuron

(a) Auditory cortex learns to see (b) Somatosensory cortex learns to see
Figure 2: Neurons learn to do different tasks when original wires are disconnected and reconnected to other senses
neuron will learn to see as in figure 2a. Similar experiments
disconnect the somatosensory neuron connection to the hand
and connect it to the eyes, it will eventually learn to see as in
figure 2b. Now, let’s switch context to talk about mimicking
this neural network in computers.
In a software environment, we create a similar model
that has the three major components:
• Cell body that contains the neuron. This neuron is
responsible for doing the computations.
• Input wires that carry out signals as inputs.
• Output wire(s) that transfer the output signal to other
neurons.
Figure 3 is a simple artificial (computer-one) neural
network that has only one neuron (the orange circle). x1, x2,
and x3 are the inputs to the neuron and they carry numerical
values. The function h is called the hypothesis function. It
computes its value by multiplying the input vector x by a
wight vector w and then the output is passed through an
activation function that computes the final scalar output
Figure 3: Artificial neural network with one neuron
Figure 4 shows a more advanced neural network. Each
vertical set of neurons is called a layer. Layer 1 contains the
neurons that represent inputs. Layer 2 is also called a hidden
layer. It does the core computation. Layer 3 is called the
output layer and does a computation on the data received
from layer 2 and then outputs one final result. Now, the
missing information in the one-neuron figure is:
1) What is the weight vector to be multiplied by the
input vector?
2) After multiplying the two vectors, what is the acti-
vation function that will output the final result?
Besides the number of layers and the number of neurons in
each layer, the answers to the above two questions are going
to define the neural network model. If one could solve, or
model, a specific mathematical problem by assigning values
to the weight vector and choosing an appropriate activation
function, the neural network model would satisfy its goal.
Figure 4: Artificial neural network with two layers
In practice, assigning weights and choosing an activation
function is the most challenging part in designing a neural
network. Therefore, computerized training procedures have
been developed to let the software optimize the values of the
weights. In the next two subsections, we discuss the back-
propagation algorithm; the fundamental technique to train a
neural network.
2.2. Activation Functions
As stated in the previous subsection, each layer is com-
posed of a set of neurons. The purpose of each neuron is
to perform a non-linear transformation on the input. Using
the network in Figure 3 as an example, input vector x will
be multiplied by weight vector w. If N is the number of
nodes in a layer, vector x will have a shape of [1, N] and
vector w will have a shape of [N, 1]. Multiplying these two
vectors will result in a scalar [1, 1] value.
x = [x1, x2, ..., xn] (1)

w =



w1
w2
...
wn


 (2)
x × w = xi ∗ wj = x1 ∗ w1 + x2 ∗ w2 + ... + xn ∗ wn
(3a)
y = x1 ∗ w1 + x2 ∗ w2 + ... + xn ∗ wn + bias (3b)
As you can see from equation 3b, y represents a simple
linear equation. Although interesting, this linearity serves
no advantage over simple linear regression. If y were to
be passed right onto the next layer’s nodes, we would say
that it had a linear activation function. In fact, one can
view a perceptron with a linear activation function as just
that – linear regression! By passing y through non-linear
activation functions, the network is able to truly represent
any function. The following equations illustrate the most
popular activation functions:
Identity
Figure 5: Identity: A(y) = y
Binary Step
Figure 6: Binary Step
A(y) =
0 for y < 0
1 for y ≥ 0
From a biological standpoint, these Activation functions
determine whether the neuron propagates a signal forward
to a receiving neuron or not.
Logistic
Figure 7: Logistic
A(y) =
1
1 + e−y
TanH
Figure 8: TanH
A(y) = tanh(y) =
2
1 + e−2y
− 1
Softsign
A(y) =
y
1 + |y|
Figure 9: Softsign
Rectiﬁed Linear Unit (ReLU)
Figure 10: ReLU
A(y) =
0 for y < 0
y for y ≥ 0

2.3. Backpropagation Algorithm
A neural network is trained with the combination of two
steps. The first step involves propagating the information
forward through the activation functions. The previous sec-
tion illustrated some of the most popular activation functions
used for the nodes in a network. Once this first pass is
completed, the model will produce an output. The error of
the network represents how close this output was to the
expected value. The second step in the training process
involves adjusting the weights of the network in an attempt
to minimize this error. As one can imagine, in a network
where every layer is fully connected to the next, the number
of weights that are produced is exponential. Therefore, min-
imizing training error through the use of back propagation is
a crucial need. Back propagation can be viewed as a clever
use of the chain rule [2].
Figure 11: Demonstration of the chain rule
Back Propagation propagates signals in the opposite
direction. Starting at the output layer L, the error derivative
is computed bases on all the input connections coming from
the previous layer L−1. Stemming from the simple fact that
the error of the output layer is the Ouput−Target, the error
can then be ”recursively” defined, enabling fast training of
the network. In reality, the error is usually defined as shown
in the equation below.
Etotal = Σ
1
2
(target − output)2
As you can see in figure 12, the error derivative of the
unactivated input z to each layer is used to compute the
error of the previous layer’s output. With the use of the
true power of the matrix to perform many calculations in
one step, neural networks are able to compute these error
derivatives and update the weight matrices very fast. Back
propagation was the key to finally being able to train and
therefore utilize neural networks. Due to matrix operations,
Back propagation can be parallelized to further decrease
Figure 12: Back Propagation Algorithm
training time, making deep neural networks possible. In
fact, the emergence of the entire field of Deep Learning
has been made possible with these advances to hardware.
Bottom line, without the advent of Back propagation, neural
networks would be borderline impossible to train efficiently.
2.4. Constraints of Neural Networks
Although neural networks have proven to be very ef-
ficient in many applications, research in cognitive neuro-
science has revealed many important differences between
brains and computers. Here, we list some of the major
differences:
• First, brains are analogue while computers are dig-
ital. Brains transmit information at a rate that is
essentially a continuous variable. Therefore, it is
believed that to build a model that is absolutely
identical to brains takes scientists either to build ana-
logue computers (changing the whole computation
model we know), or creatively develop a scheme
for mapping continuous brain signals to the existing
binary computing capabilities.
• Second, brains retrieve information by content while
computers retrieve them by address. For example,
thinking of the word apple automatically stimulate
your activation to think about other related fruits. In
a computer, it is either the word apple is addressable
and has a specific value or not. However, similar
paradigms can be implemented in computers, mostly
by building massive indices of stored data (like what
Google does).
• Third, while artificial neural networks are not ca-
pable of storing in memory, processing and memory

are performed by the same components in brains. In-
spired by neurons memory, a model of deep learning
has been developed called Long Short Term Memory
(LSTM) that address this inability by introducing a
technique to store information for longer time in
artificial neurons (see section 3.3 below).
Although the idea of artificial neural networks dates back
to 1950s, their applications are now brought back to the
table with the availability of large computational and storage
powers. Computer scientists are continuously improving
the models of neural networks to address different insuf-
ficiencies. The evolving architectures are now called Deep
Learning models, which will be the focus of the next section.
3. Deep Learning Models
When exploring differing deep learning models, it is
easy to tune into the ”buzzwords” that are frequently re-
peated and lose sight of what the actual objective of the
learning procedure is. It is easiest to divide the types of
model into the following two categories:
1) Discriminant Architectures: these models char-
acterize patterns based off posterior distributions
of classes. This can be assimilated to techniques
such as classification/regression. The paradigm of
discriminant models is that for an input, produce
an output. Discriminant models can be viewed as
bottom up networks. Inputs are given and they
propagate up through the network to produce out-
puts. This is the main difference from its genera-
tive counterpart that has no outputs. These models
can be view as the Supervised Deep Learning.
Examples for these models include Deep Neural
Networks(neural networks with >2 layers), Convo-
lutional Neural Networks (section 3.1), Recurrent
Neural Networks (section 3.2), and Long Short
Term Memory (section 3.3).
2) Generative Architectures: these models are em-
ployed to discover high-order correlations in
a given input. In these models, there are no
classes/value to predict for the input data as seen in
classification/regression techniques. The goal is to
extract meaningful relationships between features
in an effort to learn high-order features. Generative
models can learn a distribution from training data
and then be able to produce samples. The bottom
layer of these networks generate a vector x. The
goal is to train the model to give high probability
to the training data. The reason these models are in
fact called Generative is because they start from the
top layer and aim to generate the inputs by propa-
gating downwards through the network. The main
domain of these architectures is therefore Unsuper-
vised Feature Learning. Examples for these models
include Restricted Boltzmann Machines (section
3.4), Deep Boltzmann Machines (section 3.6), Deep
Belief Networks (section 3.5), and Auto-encoders
(section 3.7).
Figure 13: Common network architectures
In each of the following subsections, we illustrate a deep
learning model and its purpose. In general, discriminant
architectures are trained with backpropagation whereas gen-
erative architectures are trained with a modified free-energy
method. Training procedures tend to vary on the generative
side. Figure 13 illustrates some of the general schemes of
network architectures.
3.1. Convolutional Neural Network (CNN)
3.1.1. Purpose. A CNN is primarily used for processing
two dimensional data. Therefore, it is a prime candidate
for data such as images and videos. In the area of image
processing, a CNN (also called ConvNet) is able to extract
high-order features from an image (such as horizontal edges,
vertical edges, or color contrasts) and can lead to an impres-
sive understanding of the content. Convolutional networks
proved to be very efficient for learning representations of
data.
3.1.2. Architecture. For simplicity, we start by describing
the model on one-dimensional data then we move forward
to see how the model is express its effectiveness on two-
dimensional data. To classify a sample: x1, x2, x3, ... xn
using a basic neural network, we connect all the inputs to a
fully-connected layer where each input sample connects to
each neuron in the hidden layer as in figure 14.
Figure 14: Feeding input samples into a fully connected
layer (denoted by F) in a basic neural network
The architecture of CNNs follows a more sophisticated
approach that notices a symmetry in the features it is looking

for in the data. Therefore, we can create a group of neurons
before the hidden layer that takes a segment of the data as in
figure 15. This added layer is called a Convolutional Layer.
The output from the convolutional layer is fed into the fully
Figure 15: Adding a convolutional layer. Each A contains
a group of neurons that are fully connected to a segment
from the inputs.
connected layer, which we previously added. Convolutional
layer output can be fed into another convolutional layers,
hence creating layers of convolutions. The idea of a con-
volutional layer is to learn the appropriate feature filters as
opposed to hand engineering them.
To get a higher level representation of the data, a Pooling
Layer is added after the convolutional layers. A pooling
layer not only learns more abstract representations of the
data, but also reduces the number of parameters that will be
fed to the fully connected layer. For example, A max-pooling
layer takes the maximum of features over small blocks of
the the previous convolutional layer. Output from a pooling
layers can also be fed into the input of another convolutional
layer as in figure 16.
Figure 16: Adding a max-pooling layer. The output is fed
into another convolutional layer B.
The same concepts are applied to two-dimensional in-
puts such as images or videos. We can think of figure 17
from bottom to top as zooming out from the very specific
details of the data representation toward the more general
representation. A convolutional layer has A groups of neu-
rons. Each group just feed on part of the two dimensional
(e.g. a 5x5 pixel frame). As an example of face detection,
a first convolutional neural layer learns representations of
edges. After a first pooling layer, a second convolutional
(a) 2-D input (b) A full 2-D input to a convolutional network
Figure 17: A full convolutional neural network with two
dimensional input.
layer learns a more general representations of face parts such
as the eye or the nose. After a second pooling layer, a third
convolutional layer learns the most general representation to
detect a human face. The output is then passed to a fully
connected layer to produce the final classifications.
In summary, a CNN is divided into two stages. The first
is called the convolution layer. At this layer, each input has
a filter applied to it. This filter is a function representing a
certain transformation on the input data. The second stage
is the pooling layer. This process consists of summing
up neighborhoods in the output of the convolutional layer.
These two alternating stages can be applied for as many
layers as needed, each having a different filter A final fully
connected layer is responsible for the classifications. This
ensures that the model is able to detect high-order similari-
ties within the data irrespective of orientation/rotation.
3.1.3. References. Refer to the following paper [3] and blog
post [4] for a detailed illustration and studies of CNNs.
3.2. Recurrent Neural Network (RNN)
3.2.1. Purpose. A RNN is primarily used for processing
data that come in the form of a sequence. Therefore, it is a
prime candidate for speech recognition, language modeling
and translation. One limitation of ConvNets is that they
accept a fixed-size vector as input and produce a fixed-size
vector as output, performing this mapping using a fixed
amount of computational steps - the number of layers in
the model and the number of units in each layer. The core
difference in RNNs is that they operate over sequences of
vectors in the input as well as the output.
3.2.2. Architecture. Traditional neural network (and Con-
vNets) are memoryless. If a traditional neural network is
to be used to classify what is the weather like from the
forecast readings, it is unclear how the model will do that.
They operate on a fixed size input and a fixed size output
performing the computation using a pre-specified number of

Figure 18: Recurrent neural network basic component.
Left: a chunk of neural network A receives some input x
and outputs a value h. Right: an unrolled RNN.
hidden layers and units. Recurrent neural networks address
this issue by introducing memory in the network in the form
of a loop as in figure 18. You can think of an RNN as a stack
of separate neural networks with some parameters of each
network fed from the previous network; these parameters
play the role of a memory.
Inside each repeating module of the recurrent neural
network, the input x at time-step t is concatenated with
the output h at the time-step t-1 and together are passed
through an activation function to result in the output h at the
current time-step t. Figure 19 shows an unrolled illustration
of the this behavior, where the yellow box represents a single
neural network layer with a tanh activation function (other
activation functions can be used as well).
Figure 19: The repeating module in a RNN with tanh used
as the activation function in the neural network.
Although RNNs are simple in the way that they accept
an input vector x and produce and output vector y, their
effectiveness comes from the fact that the output vector’s
content is influenced not only by the input x, but also by
the entire history of that have been fed to the network in
the past. The RNN has some internal state that gets updated
every time an input is fed into the network. In the simplest
case, this state is represented as a single hidden vector h.
What happens when there is a long-term dependency?
For example, a word in an essay is derived from a word
in the previous paragraph. Unfortunately, the more the gap
grows, RNNs become unable to learn to connect depen-
dencies in the sequence. Figure 20 shows the long-term
dependency problem in RNNs. Therefore, Long Short Term
Memory (LSTM) models have been proposed to overcome
this problem. LSTM are the subject of the next section.
3.2.3. References. Refer to the following paper [5] and blog
post [6] for a detailed illustration and studies of RNNs.
Figure 20: The output h at time t+1 depends on the input
x at times 0 and 1.
3.3. Long Short Term Memory (LSTM)
3.3.1. Purpose. Long Short Term Memory networks are
considered an improvement to recurrent neural networks that
solves the problem of long-term dependency. Real-world
implementations mostly depend on LSTM models rather
than the basic RNNs.
3.3.2. Architecture. Like RNNs, LSTMs also have a a
chain-like structure (when unrolled). However, instead of
a single neural network layer in the repeating module as in
figure 19, LSTMs have four neural network layers interact-
ing in a special harmony as in figure 21.
Figure 21: Four interactive neural network layers inside
the repeating module of LSTM. Each line carries an entire
vector from one node to another.The yellow boxes are
neural networks with the indicated activation function. The
pink circles resent point-wise operations like vector
addition. Lines merging denote concatenation. Lines
forking denote content being cloned to different locations.
The core idea behind LSTMs is the horizontal line
passing through the top of the module. The line represents
a cell state that carries information along from one cycle
to the next. Addition and multiplication gates control the
information being stored (or not) in the cell state vector.
Each of the four neural network layers is responsible for
a specific functionality to be carried out in the cell. The
operation of cell occurs in three steps as follows:
• First: the first neural network layer from the left (also
called forget gate layer) decides what information is
going to be thrown away from the sate vector. The
sigmoid layer looks at ht−1 and xt, and outputs a
number between 0 and 1 for each number in the
cell state vector; that passed through the top line. A

1 represents a ”completely keep this” decision and
a 0 represents a ”completely removes this” decision.
The output from the first layer is represented as ft
below:
ft = σ(Wf .[ht−1, xt] + bf )
• Second: the next two layers decide what new in-
formation we are going to store in the cell state
vector. The sigmoid layer (also called input gate
layer) decides which values will be updated:
it = σ(Wi.[ht−1, xt] + bi)
and the tanh layer creates a vector of new candidate
values, that could be added to the state:
˜Ct = tanh(WC.[ht−1, xt] + bC)
The new cell state vector Ct is computed as follows:
Ct = ft ∗ Ct−1 + ti ∗ ˜Ct
• Third: the last layer in the cell computes the actual
output ht. The output value is influenced by the last
sigmoid layer as well as the new cell state vector
that was just computed.
ot = σ(Wo.[ht−1, x] + bo)
ht = ot ∗ tanh(Ct)
Although what is described so far is a normal LSTM, every
paper involving LSTMs uses a slightly different architecture.
A common variation is to let the above functions ft, it and
ot consider the previous cell state vector Ct; a technique
known as peephole connections. Other variations exist de-
pending on the training task. Yet, all variations depend on
the idea of the cell state vector that can carry information
for long time, hence allowing long-term dependencies to be
taken in consideration for prediction.
3.3.3. References. Refer to the following papers [7], [8]
and generous blog post [9] for a detailed illustration and
studies of LSTMs.
3.4. Restricted Boltzmann Machine (RBM)
3.4.1. Purpose. The first Generative architecture we will
explore is the Restricted Boltzmann Machine. This is not
to be confused with the Boltzmann Machine. Figure 22
illustrates the difference and the next section will explain
the subtlty. A RBM is commonly utilized in Unsupervised
Learning tasks such as Dimensionality Reduction, Feature
Learning, and Collaborative Filtering.
3.4.2. Architecture. An RBM is composed of two layers,
an input layer and a hidden layer. These layers have undi-
rected connections between them. The restriction placed on
a RBM is that no two nodes in the same layer can have a
connection. This is the differentiator between a Boltzmann
Machine and a Restricted Boltzmann Machine. The former
has existed for many years but it was not until this slight
Figure 22: As you can see, the Boltzmann Machine
includes intra-layer connections whereas the RBM is
limited to only having inter-layer connections.
modification created the latter that this theoretical model
was utilizable. Without the restriction of having any intra-
layer connections, a Boltzmann Machine is completely un-
trainable and essentially folds into chaos. We can therefore
define an RBM formally as a two layer neural network
with many inter-layer, but no intra-layer connections. Each
connection bears a weight that is trained during the learning
procedure. By adjusting these weights, an RBM can fit its
parameters(hidden layer nodes) to represent the distribution
of the training data. Once this hidden layer is trained, one
can generate samples that fit the distribution of the training
data. This technique has been used to compensate for a
scarce amount of available data in certain fields.
Figure 23: The architecture of a RBM. The shaded nodes
represent the visible input layer and the white nodes
represent the hidden layer.
3.4.3. Training. The training procedure for a RBM has a
few differentiators over the methods used for Discriminant
models. While the final step in the procedure entails per-
forming stochastic gradient descent to decrease the error
rate, the means by which the error is computed differs. In
an RBM, a procedure called Contrastive Divergence is used.
In simplest terms, each iteration can be broken down into
3 phases. The hidden layer is created from the input layer
based on probabilities minimizing the free energy of the

model. This will create a hidden layer with certain activa-
tions based on that minimization function. This is called the
Positive Phase. The next phase is called the Negative Phase.
The input layer is the reconstructed based on this hidden
layer. This newly constructed layer is then propagated back
to the hidden layer to create a new set of activations. The
third phase is the Update phase where the hidden layer in
the Positive phase, and both the reconstructed input and
second created hidden layer in the Negative phase, are used
to determine the error and update the weights to minimize
this term.
All in all, this learning procedure requires hands-on
experience to master. There are many hyper-parameters such
as the learning rate, momentum, weight-cost, sparsity target,
initial values of the weights, number of hidden units, and
size of each batch [10].
For each specific application, a specific set of hyper-
parameters must be set. This is the art of training RBM’s
and there is no right or wrong way to set them, and only
through trial and error can one determine the correct set.
3.4.4. References. Refer to the following papers [11], [12],
[13] for a detailed illustration and studies of RBMs.
3.5. Deep Belief Network (DBN)
3.5.1. Purpose. Deep Belief Networks are utilized for learn-
ing a representation of some input data. Their purpose is
very similar to the RBM and in practice, researchers rarely
use RBMs anymore. The DBN can be viewed as the logical
next step in the timeline of the development of the RBM.
It is the next iteration and improvement to this type of
model and has been widely accepted as a replacement for
the RBM. Some have argued that since RBM’s have the
representational power for any function approximation, what
is the use of DBNs? Further research has only been able to
conclude that by adding an additional layer, the information
gain must be positive over a more shallow one. This implies
that there is no harm in adding an additional layer and
from our understanding, it is to able to detect higher level
abstractions in the data.
3.5.2. Architecture. A Deep Belief Networks is a stack of
feedforward RBMs. The output of layer k, which is the
hidden layer of an RBM, is the input of the next layer’s
RBM. The motivation of this architecture is the idea that an
efficient way to learn a complicated model is to combine
a set of simpler models that are learned sequentially. We
believe that by adding layers to the DBN as opposed to
adding nodes to the hidden layer of a RBM allows the
model to become more flexible, more representable, and
less dependent on the number of nodes in each hidden layer.
This requires less manual feature engineering and allows the
Neural Net to ”work its magic.” We believe this makes the
DBN a more preferable model over the RBM.
3.5.3. Training. The study of training DBNs has filled many
research papers and cannot be properly explained to the
Figure 24: This network represents the architecture of a
Deep Belief Net. Each pair of layers represents a RBM.
As explained, each RBM’s hidden layer is fed into the
input layer of the next RBM.
extent required in the scope of this paper. But more gen-
erally, training is performed in a greedy layer-wise fashion.
To summarize, all of the learning involved is localized. By
performing a greedy-layer wise procedure, the network can
train iteratively and the complexity becomes manageable.
This layer-by-layer unsupervised learning algorithm consists
of learning a stack of RBM’s, one RBM at a time, and is
illustrated in Figure 25. The first step consists of training
the first layer as an RBM that models the input. This first
layer is used as the input layer for the second layer. This
part is general modified by choosing only mean activations
or by sampling. This process is repeated for as many layers
as desired, each time propagating forward the hidden layer
of the previously trained RBM. The parameters(weights)
of the model are then updated on this deep architecture
with respect to the log-likelihood. In supervised training
scenarios, a target output can be substituted for the error
term instead of the log-likelihood.
3.5.4. References. Refer to the following paper [14] for a
detailed illustration and studies of DBNs.
3.6. Deep Boltzmann Machine (DBM)
3.6.1. Purpose. Deep Boltzmann Machines can be views as
multi-layer RBM’s. In contrast to the RBM network being
limited to one hidden layer, Deep Boltzmann Machines can
have many. This allows the weights to be visible to other lay-
ers and forms a more complex version of the RBM. DBM’s
have the potential of learning increasingly complex internal

Figure 25: This figure represents the layer-wise training
procedure of a Deep Belief Network. Each RBM is
trained, stacked, and their hidden layer is fed to the input
layer of the next RBM.
representations of the data, which is needed in the fields
of speech recognition and object assimilation. In practice,
however, a DBM is rarely used and often is substituted with
the more promising, and trainable, DBN. We include this ex-
planation of the architecture purely as a reference so readers
can differentiate the terms and understand the difference in
architectures between Deep Boltzmann Machine and Deep
Belief Networks.
3.6.2. Architecture. Although very similar to the archi-
tecture of a DBN, the architecture of a Deep Boltzmann
Machine has one striking difference. Instead of having di-
rected connections between each stacked RBM, a DBM has
undirected connections between each layer. This implies that
weights are shared throughout the entire model as opposed
to the more layer-wise approach of a DBN. The difference
is illustrated in Figure 26 A DBN is a stacked of connected
RBMs. A Deep Boltzmann Machine is a RBM with multiple
hidden layers. This implies a fundamental difference in the
training procedure. We will not cover the training procedure
for a DBM because it is out of the scope of this literature
survey, but bear in mind that it entails factoring in the
weights of not just one direction of inputs because signals
can be propagating from both directions of the network.
When comparing the two models, DBNs can be viewed as
a stack of RBMs whereas the DBM is a hybrid version of
the RBM.
detailed illustration and studies of DBMs.
3.7. Auto-encoders
3.7.1. Purpose. Auto-encoders are neural networks that aim
to learn a compressed representation, or encoding, of the
input data. The model is considered Generative because it
is trained to recreate the input data from its hidden layer.
Auto-encoders are great for Dimensionality Reduction and
are have spawned serious interest recently.
Figure 26: Although each layer is a stacked RBM, the
direction of the connections between layers in Deep Belief
Networks and Deep Boltzmann Machines differ.
3.7.2. Architecture. Auto-encoders have a unique architec-
ture. They are designed to have three layers. The first is the
input layer. The third layer is the output layer. This is shown
in Figure 27. The middle hidden layer between these two is
called the Feature layer. The input and output layers of an
Auto-encoder are intended to be same after training. This
middle layer that serves as an encoder of the input data.
This middle layer’s dimensionality can be greater than or
less than to the input layer depending on application. In the
case of the feature layer having a dimensionality less than
the input layer, this model is excellent at performing dimen-
sionality reduction. The real focus on these models is the
Feature layer that is created during training. Since the input
and output layers will be the same, they are of no interest
besides for training purposes. The middle layer represents
an encoding of the data. Architectures such as stacked Auto-
Encoders link these Feature Layers in a stacked fashion to
create higher level abstractions of the data as well. This
methodology of stacking neural networks to create high level
understanding of the data is the key to Deep Learning. By
allowing the network to have more representations, more
correlations are able to be automatically detected. This is
why this model is called an Auto-Encoder.
3.7.3. Training. Training can be conceptualized as the net-
work trying to ”recreate” the data. The network receives
its inputs and feeds this to the Feature layer. The first
part of the training process is called the encoding phase.
The input data from the first layer is encoded into the
Feature layer through adjustable weights. Each node in the
Feature layer then propagates a signal forward and, with
the assistance of adjustable weights and biases, maps this
encoded representation back to its original un-encoded state.
This is referred to as the decoding phase. To summarize,
data is fed into the input layer, encoded in the feature layer,
then decoded into the output layer. Error is determined by
comparing this output value to the inputted value, as they

Figure 27: The architecture of a Auto-Encoder. As you can
see, the first and last layers are the same. The middle layer
represents the Features(encoding) learned during training
should be exactly the same.
detailed illustration and studies of Auto-encoders.
4. Choosing a Model
With the growing number of variations of deep learning
models, it is important to choose a model that is suitable
for the task in hand. Many factors contribute to choosing a
model that can effectively represent a solution to the task in
hand.
• First, study the dataset in hand.
• Second, decide whether you want to do a classifica-
tion, a prediction, or learn about data representation.
• Third, choose a model and try out different varia-
tions of it until you reach the desired objective.
In table 1, we summarize decision factors for the surveyed
models in this study. These models cover a wide variety
of domains. Other models fall in one of these general
architectures.
5. Applications
At the time we do this extensive survey over deep
learning models, researchers from different labs are utilizing
these approaches in a myriad of real world applications and
achieve state of the art performance. In this section, we shed
the light on some of the trending projects being sponsored
by large tech companies.
5.1. Facebook’s DeepFace
Uploading a picture with your friends to Facebook au-
tomatically suggests tagging your friends in the picture by
recognizing their faces. Closing the gap to a human-level
performance in face verification is the main research focus
at Facebook’s DeepFace. They derive face representation
from a nine-layer deep neural network that involves more
than 120 million parameters using several locally connected
layers without weight sharing. The model was trained on
a four million facial images belonging to more than 4000
entities, and the result is: the most powerful face recognition
module we see in the largest social network in the world!
5.2. Google’s DeepMind
Founded in London in 2010 and acquired by Google
in early 2014, DeepMind algorithms are capable of learning
for themselves directly from raw experience or data, and are
general in that they can perform well across a wide variety
of tasks straight out of the box. Their team consists of many
renowned experts in their respective fields, including but not
limited to deep neural networks, reinforcement learning and
systems neuroscience-inspired models. One recent remark-
able achievement is AlphaGo – the first computer program
to ever beat a professional player of Go. It was a tremendous
milestone when we have seen a computer brain, powered by
deep learning models, beats a human brain.
create a probabilistic reconstruction of data
feature detectors
5.3. Apple’s Siri
Siri (Speech Interpretation and Recognition Interface)
is Apple’s intelligent personal assistant that comes pre-
installed with their iPhone devices. Siri’s primary technical
areas focus on a conversational interface, personal context
awareness and service delegation. In the core of the conver-
sational interface resides a strong speech recognition engine,
powered by deep learning models, that learns a user’s accent
and adapts itself to it to respond with better results. The
power of Siri comes not only from the speech recognition
engine, but also from other machine learning models that
can carry a full conversation between the user and the device
relying on a set of web services.

Model CNN RNN LSTM RBM DBN Auto-encoder
Type Discriminative Discriminative Discriminative Generative Generative Generative
Purpose Classification Prediction Prediction Unsupervised Feature Unsupervised Feature Dimensionality
Learning Learning Reduction
Suitable for Processing two- Processing sequence Processing long Learning distributions Creating a probabilistic Creating a compact
dimensional data data sequence data of data reconstruction of data representation of data
Example Images/Videos Language models Language models Generate samples from Trained layer used PCA-like
and Speech and Speech learned hidden representations as feature detectors tasks
TABLE 1: Summary of the surveyed deep learning models
5.4. Microsoft’s Cortana
Analogous to Siri, Cortana is the clever personal as-
sistant developed by Microsoft that helps you find things
on your PC, manage your calendar, track packages, find
files, chat with you, and tell jokes. Cortana learns the user
behavior through a deep learning model in the sense that
the more you use Cortana, the more personalized your ex-
perience will be. Cortana depends heavily on understanding
a user’s query and takes actions based on this request. A set
of deep learning language models enable setting reminders,
making calls, sending emails and answering questions when
requested by the user. Cortana is a massive improvement
in the field of artificial intelligence in human-computer
interaction.
6. Current Research Directions
As demonstrated, deep learning is a vast field [3], [17].
Following the theoretical claim that -with enough hidden
nodes- a model can be trained to represent any function
or distribution, we are seeing a re-emergence of many
classical techniques of machine learning, especially with the
increasing improvements in computational resources.
On the discriminant side of models, there is an aggres-
sive push towards memory-based models. As shown, one
successful model is the LSTM. But in no way is that all
that has been proposed. Memory Networks, Neural Turing
Networks, and Hierarchical Temporal Memory are all similar
memory-based deep neural networks. The advantage of this
direction is that the networks are able to retain state through-
out their lifetimes. The goal of these networks is to enable
tasks such as sequence learning and reinforcement learning
to be representable and trainable. These tasks require the use
of memory to be able to utilize previously seen inputs and
correlations in future models. In our opinion, reinforcement
learning will be the heavy focus of deep learning in the next
few years. We are seeing paradigm shift occurring as data
scientists are now realizing that deep neural networks can be
used as function approximations in reinforcement learning
algorithms. We believe this has a lot of potential and will
be pursuing research in this area in the future.
On the generative side of models, there has been a re-
emerging interest in the past few years. Hinton has dropped
a bomb and ignited the entire field of deep learning in his
influential paper [14] about a generative architecture. The
focus then shifted to the more shiny side of unsupervised
learning. With the explosion of unlabeled data spouting from
the various sources of big data, the need to improve these
unsupervised deep learning models has been growing. The
focus on these models varies from being a pre-training step
to be fed forward to a discriminant model to more ”all-in-
one” hybrid solutions. There is some past research in the
area of discriminant RBM’s and it’s variations highlighted
in Bengio’s paper [18] that we believe will be useful in
truly harnessing the representational power of these typically
generative models.
7. Conclusion
This concludes our survey of the field of deep learning.
To summarize in one statement, we believe deep learning
can be viewed as the art of utilizing deep neural network
structures to represent any machine learning task. Although
some of the theoretical strengths of neural networks has been
claimed since the 50’s, the recent advancement of computer
hardware have made these hypothesis’ verifiable. What we
are now seeing a complete redefinition of the tasks that have
been stapled in the field of machine learning, and the broader
domain of artificial intelligence.
We view these recent advancements as the beginning of
the era of truly thinking computers. Whereas old machine
learning techniques such as SVMs, clustering, PCA, ect.
are each based on certain statistical characteristics of the
data, neural networks can be viewed as a digital muscle that
can be strengthen in a certain manner to represent any of
those models. In our own opinion, old ML techniques can be
viewed as discrete learning methods whereas deep learning
is more of a continuous learning method. A simple example
of this is when comparing a stack of linear regressions on
top of each other as opposed to a deep neural network.
Ultimately, a stack of linear regressions is still going to be
linear no matter what. The equation may have a completely
different slope and bias, and be able to represent an arbitrary
function, but the capabilities are limited. As demonstrated,
neural networks are not bound by this linearity. The usage
of a nonlinear activation function boosts the representational
power of certain models so high that there are theoretical

claims that deep learning architectures can be learned to rep-
resent any distribution or function [19]. This representational
power stems from the differentiation of network structures
into discriminant and generative architectures.
To state that the emergence of the field of deep learning
has correlated with the rise of performance of computer
hardware does not illustrate the dependence quite enough.
If one thing was clear from our research, it was that the
techniques of deep learning are the most computationally
intensive problems that computers have been introduced
to. It is no understatement that deep learning is a field
that models its methods based on the world’s most power
processor – the human brain. With a strong rooted foun-
dation in neuroscience, we have no doubt that the models
developed by deep learning researchers will aid and push
the sister field. There is an innate link between the research
neuroscientists are performing on the brain to understand
how the human mind works and the work deep learning
experts are undergoing to emulate this process. We be-
lieve that only by the further integration of the fields of
deep learning and neuroscience, seen in models such as
Hierarchical Temporal Memory, true general intelligence
can be realized. As such computationally intensive software
methods are created, hardware will continue to push the
boundaries of what is considered possible.
References
[1] Y. Bengio, “Learning deep architectures for AI,” Foundations and
Trends in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009, also
published as a book. Now Publishers, 2009.
[2] ——, “Practical recommendations for gradient-based train-
ing of deep architectures,” 06 2012. [Online]. Available:
http://arxiv.org/abs/1206.5533
[3] I. Arel, D. C. Rose, and T. P. Karnowski, “Deep machine learning -
a new frontier in artificial intelligence research [research frontier],”
IEEE Computational Intelligence Magazine, vol. 5, no. 4, pp. 13–18,
Nov 2010.
[4] O. Christopher, “Conv nets: A modular perspective.”
[5] I. Sutskever, “Training recurrent neural networks,” Ph.D. dissertation,
Toronto, Ont., Canada, Canada, 2013, aAINS22066.
[6] K. Andrej, “The unreasonable effectiveness of recurrent neural net-
works.”
[7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online].
Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735
[8] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Ste-
unebrink, and J. Schmidhuber, “LSTM: A search space
odyssey,” CoRR, vol. abs/1503.04069, 2015. [Online]. Available:
[9] O. Christopher, “Understanding lstm networks.”
[10] G. E. Hinton, Neural Networks: Tricks of the Trade: Second Edition.
Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, ch. A Practical
Guide to Training Restricted Boltzmann Machines, pp. 599–619.
[11] ——, “Deterministic boltzmann learning performs steepest descent in
weight-space,” Neural Comput., vol. 1, no. 1, pp. 143–150, Mar. 1989.
[Online]. Available: http://dx.doi.org/10.1162/neco.1989.1.1.143
[12] N. Le Roux and Y. Bengio, “Representational power of restricted
boltzmann machines and deep belief networks,” Neural Comput.,
vol. 20, no. 6, pp. 1631–1649, Jun. 2008. [Online]. Available:
http://dx.doi.org/10.1162/neco.2008.04-07-510
[13] R. Salakhutdinov and G. Hinton, “Deep Boltzmann machines,” in
Proceedings of the International Conference on Artificial Intelligence
and Statistics, vol. 5, 2009, pp. 448–455.
[14] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast
learning algorithm for deep belief nets,” Neural Comput.,
vol. 18, no. 7, pp. 1527–1554, Jul. 2006. [Online]. Available:
http://dx.doi.org/10.1162/neco.2006.18.7.1527
[15] R. Salakhutdinov and G. Hinton, “An efficient learning procedure
for deep boltzmann machines,” Neural Comput., vol. 24, no. 8, pp.
1967–2006, Aug. 2012.
[16] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
vol. 521, no. 7553, pp. 436–444, 05 2015. [Online]. Available:
http://dx.doi.org/10.1038/nature14539
[17] J. Schmidhuber, “Deep learning in neural networks: An overview,”
04 2014. [Online]. Available: http://arxiv.org/abs/1404.7828
[18] H. Larochelle and Y. Bengio, “Classification using discriminative
restricted Boltzmann machines,” in Proceedings of the Twenty-fifth
International Conference on Machine Learning (ICML’08), W. W.
Cohen, A. McCallum, and S. T. Roweis, Eds. ACM, 2008, pp.
536–543.
[19] Y. Bengio, A. Courville, and P. Vincent, “Representation learning:
A review and new perspectives,” 06 2012. [Online]. Available:
[20] W. W. Cohen, A. McCallum, and S. T. Roweis, Eds., Proceedings
of the Twenty-fifth International Conference on Machine Learning
(ICML’08). ACM, 2008.
[21] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled
sampling for sequence prediction with recurrent neural networks,”
06 2015. [Online]. Available: http://arxiv.org/abs/1506.03099
[22] R. Sun, “Introduction to sequence learning,” in Sequence
Learning - Paradigms, Algorithms, and Applications. London,
UK, UK: Springer-Verlag, 2001, pp. 1–10. [Online]. Available:
http://dl.acm.org/citation.cfm?id=647073.713884
[23] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian Opti-
mization of Machine Learning Algorithms,” ArXiv e-prints, Jun. 2012.
[24] S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller,
“The manifold tangent classifier,” in Advances in Neural Information
Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett,
F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011,
pp. 2294–2302. [Online]. Available: http://papers.nips.cc/paper/4409-
the-manifold-tangent-classifier.pdf
[25] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Improving neural networks by preventing co-
adaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012.
[Online]. Available: http://arxiv.org/abs/1207.0580
[26] Y. Bengio, Y. Bengio, and S. Bengio, “Modeling high-dimensional
discrete data with multi-layer neural networks,” ADVANCES IN NEU-
RAL INFORMATION PROCESSING SYSTEMS 12, pp. 400–406,
2000.
[27] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
vol. 86, no. 11, pp. 2278–2324, Nov 1998.

A Study On Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to A Study On Deep Learning

Similar to A Study On Deep Learning (20)

More from Abdelrahman Hosny

More from Abdelrahman Hosny (17)

Recently uploaded

Recently uploaded (20)

A Study On Deep Learning