Brief discussion about Machine Learning,Artificial Intelligence and Deep learning
Gradient Descent Algorithm
Feed Forward Neural Network
Back propagation Algorithm
Neural Network
Machine learning (ML) is the study of computer algorithms that improve
automatically through experience.
It is seen as a subset of artificial intelligence.
Machine learning algorithms build a model based on sample data, known as
"training data", in order to make predictions or decisions.
Artificial intelligence (AI), is intelligence demonstrated by machines, unlike
the natural intelligence displayed by humans and animals.
The study of "intelligent agents": any device that perceives its environment and
takes actions that maximize its chance of successfully achieving its goals.
 Colloquially, the term "artificial intelligence" is often used to describe machines
(or computers) that mimic "cognitive" functions that humans associate with
the human mind, such as "learning" and "problem solving".
Deep learning (also known as deep structured learning) is part of a broader family
of machine learning methods based on artificial neural networks with representation
learning. Learning can be supervised, semi-supervised or unsupervised.
Deep-learning architectures such as deep neural networks, deep belief
networks, recurrent neural networks and convolutional neural networks have been
applied to fields including computer vision, machine vision, speech
recognition, natural language processing, audio recognition, social network
filtering, machine translation, bioinformatics, drug design, medical image analysis,
material inspection and board game programs, where they have produced results
comparable to and in some cases surpassing human expert performance.
Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of
a differentiable function.
Gradient Descent is an optimization technique that is used to improve deep learning and neural
network-based models by minimizing the cost function
 To find a local minimum of a function using gradient descent, we take steps proportional to
the negative of the gradient (or approximate gradient) of the function at the current point.
 But if we instead take steps proportional to the positive of the gradient, we approach a local
maximum of that function; the procedure is then known as gradient ascent.
Gradient descent is generally attributed to Cauchy, who first suggested it in 1847,but its
convergence properties for non-linear optimization problems were first studied by Haskell
Curry in 1944.
Local minima,Global Minima,Local
Maxima ,Global Maxima
An analogy for understanding gradient
The basic intuition behind gradient descent can be illustrated by a
hypothetical scenario.
A person is stuck in the mountains and is trying to get down (i.e.
trying to find the global minimum). There is heavy fog such that
visibility is extremely low.
Therefore, the path down the mountain is not visible, so they must
use local information to find the minimum.
They can use the method of gradient descent, which involves looking
at the steepness of the hill at their current position, then proceeding
in the direction with the steepest descent (i.e. downhill).
An analogy for understanding gradient
In this analogy, the person represents the algorithm, and the path taken down the mountain
represents the sequence of parameter settings that the algorithm will explore.
The steepness of the hill represents the slope of the error surface at that point. The instrument
used to measure steepness is differentiation .
The direction they choose to travel in aligns with the gradient of the error surface at that point.
The amount of time they travel before taking another measurement is the learning rate of the
 An analogy could be drawn in the form of
a steep mountain whose base touches the
 We assume a person’s goal is to reach
down to sea level. Ideally, the person
would have to take one step at a time to
reach the goal.
 Each step has a gradient in the negative
direction (Note: the value can be of
different magnitude).
 The person continues hiking down till he
reaches the bottom or to a threshold point,
where there is no room to go further down.
Illustration of gradient descent on an
Consider the nonlinear system of equations
showing the first 80 iterations of
gradient descent applied to this
example. and arrows show the direction
of descent. Due to a small and constant
step size, the convergence is slow.
Gradient descent can be used to solve a system of linear equations.
Gradient descent can also be used to solve a system of nonlinear equations.
Gradient descent works in spaces of any number of dimensions, even in infinite-dimensional
The gradient descent can be combined with a line search
Methods based on Newton's method and inversion of the Hessian using conjugate
gradient techniques can be better alternatives
Gradient descent can be viewed as applying Euler's method for solving ordinary differential
equations to a gradient flow.
A feedforward neural network is an artificial neural network wherein connections between
the nodes do not form a cycle. As such, it is different from its descendant: recurrent neural
The feedforward neural network was the first and simplest type of artificial neural network
In this network, the information moves in only one direction—forward—from the input nodes,
through the hidden nodes (if any) and to the output nodes.
Deep feedforward networks, also often called feedforward neural networks, or multilayer
perceptrons(MLPs), are the quintessential deep learning models.
The goal of a feedforward network is to approximate some function f*.
These models are called feedforward because information flows through the function being
evaluated from x, through the intermediate computations used to define f, and finally to the
output y.
 There are no feedback connections in which outputs of the model are fed back into itself.
 When feedforward neural networks are extended to include feedback connections, they are
called recurrent neural networks
The inspiration behind neural networks are our brains. So lets see the biological aspect of neural
Visualising the two images in Fig 1 where the left image shows how multilayer neural network
identify different object by learning different characteristic of object at each layer, for example
at first hidden layer edges are detected, on second hidden layer corners and contours are
Similarly in our brain there are different regions for the same purpose, as we can the region
denoted by V1, identifies edges, corners and etc.
The simplest kind of neural network is a single-layer perceptron network, which consists of a
single layer of output nodes; the inputs are fed directly to the outputs via a series of weights.
 The sum of the products of the weights and the inputs is calculated in each node
 if the value is above some threshold (typically 0) the neuron fires and takes the activated value
(typically 1); otherwise it takes the deactivated value (typically -1). Neurons with this kind
of activation function are also called artificial neurons or linear threshold units.
A perceptron can be created using any values for the activated and deactivated states as long
as the threshold value lies between the two.
Perceptrons can be trained by a simple learning algorithm that is usually called the delta rule. It
calculates the errors between calculated output and sample output data, and uses this to create
an adjustment to the weights, thus implementing a form of gradient descent.
Single-layer perceptrons are only capable of learning linearly separable patterns
In 1969 in a famous monograph entitled Perceptrons, Marvin Minsky and Seymour
Papert showed that it was impossible for a single-layer perceptron network to learn an XOR
function (nonetheless, it was known that multi-layer perceptrons are capable of producing any
possible boolean function).
A single-layer neural network can compute a continuous output instead of a step function. A
common choice is the so-called logistic function. the single-layer network is identical to
the logistic regression model, widely used in statistical modeling.
 If single-layer neural network activation function is modulo 1, then this network can solve XOR
problem with exactly ONE neuron.
This class of networks consists of multiple layers of computational units, usually interconnected
in a feed-forward way. Each neuron in one layer has directed connections to the neurons of the
subsequent layer.
 In many applications the units of these networks apply a sigmoid function as an activation
The universal approximation theorem for neural networks states that every continuous
function that maps intervals of real numbers to some output interval of real numbers can be
approximated arbitrarily closely by a multi-layer perceptron with just one hidden layer. This
result holds for a wide range of activation functions, e.g. for the sigmoidal functions.
Multi-layer networks use a variety of learning techniques, the most popular being back-
More generally, any directed acyclic graph may be used for a feedforward network, with some
nodes (with no parents) designated as inputs, and some nodes (with no children) designated as
outputs. These can be viewed as multilayer networks where some edges skip layers, either
counting layers backwards from the outputs or forwards from the inputs.
Various activation functions can be used, and there can be relations between weights, as
in convolutional neural networks.
Examples of other feedforward networks include radial basis function networks, which use a
different activation function.
Sometimes multi-layer perceptron is used loosely to refer to any feedforward neural network,
while in other cases it is restricted to specific ones (e.g., with specific activation functions, or
with fully connected layers, or trained by the perceptron algorithm).
A two-layer neural network capable of calculating
XOR. The numbers within the neurons represent
each neuron's explicit threshold (which can be
factored out so that all neurons have the same
threshold, usually 1). The numbers that annotate
arrows represent the weight of the inputs. This net
assumes that if the threshold is not reached, zero
(not -1) is output. Note that the bottom layer of
inputs is not always considered a real neural
network layer
Backpropagation algorithm is probably the
most fundamental building block in a neural
network. It was first introduced in 1960s and
almost 30 years later (1989) popularized by
Rumelhart, Hinton and Williams in a paper
called “Learning representations by back-
propagating errors”.
The algorithm is used to effectively train a
neural network through a method called chain
rule. In simple terms, after each forward pass
through a network, backpropagation performs
a backward pass while adjusting the model’s
parameters (weights and biases).
The output values are compared with the correct answer to compute the value
of some predefined error-function.
By various techniques, the error is then fed back through the network.
Using this information, the algorithm adjusts the weights of each connection in
order to reduce the value of the error function by some small amount.
After repeating this process for a sufficiently large number of training cycles,
the network will usually converge to some state where the error of the
calculations is small
To adjust weights properly, we can apply a general method for non-
linear optimization that is called gradient descent.
For this, the network calculates the derivative of the error function with respect to the network
weights, and changes the weights such that the error decreases (thus going downhill on the
surface of the error function).
For this reason, back-propagation can only be applied on networks with differentiable
activation functions
Why We Need Backpropagation?
 Most prominent advantages of Backpropagation are:
Backpropagation is fast, simple and easy to program
It has no parameters to tune apart from the numbers of input
It is a flexible method as it does not require prior knowledge about the network
It is a standard method that generally works well
It does not need any special mention of the features of the function to be learned.
Define the neural network model:
The 4-layer neural network consists of 4 neurons for the input layer, 4 neurons for the hidden
layers and 1 neuron for the output layer.
The neurons, colored in purple, represent the input data.
These can be as simple as scalars or more complex like
vectors or multidimensional matrices.
The first set of activations (a) are equal to the input
values. NB: “activation” is the neuron’s value after applying
an activation function.
The final values at the hidden
neurons, colored in green, are
computed using z^l — weighted
inputs in layer l, and a^l—
activations in layer l.
For layer 2 and 3 the equations are:
W² and W³ are the weights in layer 2 and 3 while b² and b³ are the biases in those layers.
Activations a² and a³ are computed using an activation function f. Typically, this function f is
non-linear (e.g. sigmoid, ReLU, tanh) and allows the network to learn complex patterns in data.
Combined all parameter values in matrices, grouped by layers.
Let’s pick layer 2 and its parameters as an example. The same operations can be applied to any
layer in the network.
W¹ is a weight matrix of shape (n, m) where n is the number of output neurons (neurons in the
next layer) and m is the number of input neurons (neurons in the previous layer). For us, n =
2 and m = 4.
NB: The first number in any
weight’s subscript matches the
index of the neuron in the next
layer (in our case this is
the Hidden_1 layer) and the second
number matches the index of the
neuron in previous layer (in our
case this is the Input layer).
x is the input vector of
shape (m, 1) where m is the
number of input neurons. For
us, m = 4.
b¹ is a bias vector of shape (n , 1) where n is
the number of neurons in the current layer.
For us, n = 2.
Following the equation for z², we can use the above definitions of W¹, x and b¹ to derive
“Equation for z² ”:
Now carefully observe the neural network illustration from above.
The final part of a neural
network is the output layer
which produces the predicated
value. In our simple example, it
is presented as a single neuron,
colored in blue and evaluated
as follows:
Again, we are using the matrix representation to simplify the equation. One can use the above
techniques to understand the underlying logic.
Forward propagation and evaluation
The equations above form network’s forward propagation.The slide is a short overview:
The final step in a forward pass is to evaluate the predicted output s against an expected
output y.
The output y is part of the training dataset (x, y) where x is the input (as we saw in the previous
Evaluation between s and y happens through a cost function. This can be as simple
as MSE (mean squared error) or more complex like cross-entropy
We name this cost function C and denote it as follows:
where cost can be equal to MSE, cross-entropy or any other cost function.
Based on C’s value, the model “knows” how much to adjust its parameters in order to get closer
to the expected output y. This happens using the backpropagation algorithm.
Backpropagation and computing gradients
According to the paper from 1989, backpropagation:
repeatedly adjusts the weights of the connections in the network so as to minimize a measure
of the difference between the actual output vector of the net and the desired output vector.
the ability to create useful new features distinguishes back-propagation from earlier,
simpler methods…
In other words, backpropagation aims to minimize the cost function by adjusting network’s
weights and biases. The level of adjustment is determined by the gradients of the cost function
with respect to those parameters.
One question may arise — why computing gradients?
To answer this, we first need to revisit some calculus terminology:
Gradient of a function C(x_1, x_2, …, x_m) in point x is a vector of the partial derivatives of C in
The derivative of a function C measures the sensitivity to change of the function value (output
value) with respect to a change in its argument x (input value). In other words, the derivative
tells us the direction C is going.
The gradient shows how much the parameter x needs to change (in positive or negative
direction) to minimize C.
Compute those gradients happens using a technique called chain rule
Similar set of equations can be applied to (b_j)^l:
The common part in both equations is often
called “local gradient” and is expressed as follows:
The “local gradient” can easily be determined
using the chain rule.
The gradients allow us to optimize the model’s
Initial values of w and b are randomly chosen.
Epsilon (e) is the learning rate. It determines the gradient’s influence.
w and b are matrix representations of the weights and biases. Derivative of C in w or b can be
calculated using partial derivatives of C in the individual weights or biases.
Termination condition is met once the cost function is minimized.
The final part of this
section to a simple
example in which we
will calculate the
gradient of C with
respect to a single
weight (w_22)².
Let’s zoom in on the
bottom part of the
above neural
Weight (w_22)² connects (a_2)² and (z_2)², so computing the gradient requires applying the
chain rule through (z_2)³ and (a_2)³:
Calculating the final value of derivative of C in (a_2)³ requires knowledge of the function C.
Since C is dependent on (a_2)³, calculating the derivative should be fairly straightforward.
Knowing the nuts and bolts of this algorithm will fortify your neural networks knowledge and
make you feel comfortable to take on more complex models.
Summary of Back Propagation Algorithm
Summary of Back Propagation Algorithm
1.Inputs X, arrive through the preconnected path
2.Input is modeled using real weightsW.The weights are usually randomly selected.
3.Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.
4.Calculate the error in the outputs
ErrorB= Actual Output – Desired Output
5.Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.
Keep repeating the process until the desired output is achieved
Feed forward ,back propagation,gradient descent

