Unit 2 ml.pptx

Unit II –Linear Models
Multi-layer Perceptron – Going Forwards – Going Backwards: Back
Propagation Error – Multi-layer Perceptron in Practice – Examples of
using the MLP – Overview – Deriving Back-Propagation – Radial Basis
Functions and Splines – Concepts – RBF Network – Curse of
Dimensionality – Interpolations and Basis Functions – Support Vector
Machines

A Perceptron is an Artificial Neuron
• It is the simplest possible Neural Network
• Neural Networks are the building blocks of Machine Learning.
• A Perceptron is a neural network unit that does certain
computations. It is a function that maps its input “x,” which is
multiplied by the learned weight coefficient, and generates an output
value ”f(x).

Basic Components of Perceptron
• Mr. Frank Rosenblatt invented the perceptron model as a binary classifier
which contains three main components. These are as follows:

Input Nodes or Input Layer:
• This is the primary component of Perceptron which accepts the initial data into
the system for further processing. Each input node contains a real numerical
value.
Wight and Bias:
• Weight parameter represents the strength of the connection between units. This
is another most important parameter of Perceptron components. Weight is
directly proportional to the strength of the associated input neuron in deciding
the output. Further, Bias can be considered as the line of intercept in a linear
equation.
Activation Function:
• These are the final and important components that help to determine whether
the neuron will fire or not. Activation Function can be considered primarily as a
step function.

Types of Activation functions:
• Sign function
• Step function, and
• Sigmoid function

How does Perceptron work?
• In Machine Learning, Perceptron is considered as a single-layer neural
network that consists of four main parameters named input values
(Input nodes), weights and Bias, net sum, and an activation function.
• The perceptron model begins with the multiplication of all input values
and their weights, then adds these values together to create the
weighted sum.
• Then this weighted sum is applied to the activation function 'f' to
obtain the desired output.
• This activation function is also known as the step function and is
represented by 'f'.

• This step function or Activation function plays a vital role in ensuring
that output is mapped between required values (0,1) or (-1,1). It is
important to note that the weight of input is indicative of the strength
of a node. Similarly, an input's bias value gives the ability to shift the
activation function curve up or down.
• Perceptron model works in two important steps as follows:
• Step-1
• In the first step first, multiply all input values with corresponding
weight values and then add them to determine the weighted sum.
Mathematically, we can calculate the weighted sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn
• Add a special term called bias 'b' to this weighted sum to improve the
model's performance.
∑wi*xi + b
Step-2
• In the second step, an activation function is applied with the above-
mentioned weighted sum, which gives us output either in binary form
or a continuous value as follows:
Y = f(∑wi*xi + b)

Multi-layer Perceptron
• The perceptron is very useful for classifying data sets that are linearly
separable.
• They encounter serious limitations with data sets that do not conform
to this pattern as discovered with the XOR problem.
• The XOR problem shows that for any classification of four points that
there exists a set that are not linearly separable.
• The MultiLayer Perceptron (MLPs) breaks this restriction and classifies
datasets which are not linearly separable.
• They do this by using a more robust and complex architecture to learn
regression and classification models for difficult datasets.

• A multilayer perceptron (MLP) is a feedforward artificial neural
network that generates a set of outputs from a set of inputs.
• An MLP is characterized by several layers of input nodes connected as
a directed graph between the input and output layers.
• MLP uses backpropogation for training the network. MLP is a deep
learning method.

• A multilayer perceptron is a neural network connecting multiple
layers in a directed graph, which means that the signal path through
the nodes only goes one way.
• Each node, apart from the input nodes, has a nonlinear activation
function.
• An MLP uses backpropagation as a supervised learning technique.
Since there are multiple layers of neurons, MLP is a deep learning
technique.

• MLP is widely used for solving problems that require supervised
learning as well as research into computational neuroscience and
parallel distributed processing.
• Applications include speech recognition, image recognition and
machine translation.
• Multi-layer perception is also known as MLP. It is fully connected
dense layers, which transform any input dimension to the desired
dimension.
• A multi-layer perception is a neural network that has multiple layers.
To create a neural network we combine neurons together so that the
outputs of some neurons are inputs of other neurons.

• A multi-layer perceptron has one input layer and for each input, there
is one neuron(or node), it has one output layer with a single node for
each output and it can have any number of hidden layers and each
hidden layer can have any number of nodes.
• In the multi-layer perceptron diagram above, we can see that there
are three inputs and thus three input nodes and the hidden layer has
three nodes.
• The output layer gives two outputs, therefore there are two output
nodes.

• The nodes in the input layer take input and forward it for further
process, in the diagram above the nodes in the input layer forwards
their output to each of the three nodes in the hidden layer, and in the
same way, the hidden layer processes the information and passes it to
the output layer.

• The neural network solution proposed by Rumelhart, Hinton, and
McClelland, the Multi-layer Perceptron (MLP), which is still one of the
most commonly used machine learning methods around.
• The MLP is one of the most common neural networks in use.
• It is often treated as a ‘black box’, in that people use it without
understanding how it works, which often results in fairly poor results.
• Getting to the stage where we understand how it works and what we
can do with it is going to take us into lots of different areas of
statistics, mathematics, and computer science

Going Forwards
• Training the MLP consists of two parts: working out what the outputs
are for the given inputs and the current weights, and then updating
the weights according to the error, which is a function of the
difference between the outputs and the targets.
• These are generally known as going forwards and backwards through
the network.
• Having made an MLP with two layers of nodes, there is no reason why
we can’t make one with 3, or 4, or 20 layers of nodes .
• This won’t even change our recall (forward) algorithm much, since we
just work forwards through the network computing the activations of
one layer of neurons and using those as the inputs for the next layer.

• we start at the left by filling in the values for the inputs. We then use
these inputs and the first level of weights to calculate the activations
of the hidden layer, and then we use those activations and the next
set of weights to calculate the activations of the output layer.
• Now that we’ve got the outputs of the network, we can compare
them to the targets and compute the error.

Biases:
Need to include a bias input to each neuron.
Give an extra input that is permanently set to -1, and adjusting the
weights to each neuron as part of the training.
Thus, each neuron in the network (whether it is a hidden layer or the
output) has 1 extra input, with fixed value.

GOING BACKWARDS: BACK-PROPAGATION OF
ERROR
• In back-propagation of error, the errors are sent backwards through the
network.
• It is a form of gradient descent.
• Gradient Descent is known as one of the most commonly used
optimization algorithms to train machine learning models by means of
minimizing errors between actual and expected results. Further, gradient
descent is also used to train Neural Networks.
• In the Perceptron, we changed the weights so that the neurons fired when
the targets said they should, and didn’t fire when the targets said they
shouldn’t.
• What we did was to choose an error function for each neuron k: Ek = yk −
tk, and tried to make it as small as possible. Since there was only one set of
weights in the network, this was sufficient to train the network

• With the addition of extra layers of weights, this is harder to arrange.
The problem is that when we try to adapt the weights of the Multi-
layer Perceptron, we have to work out which weights caused the
error.
• This could be the weights connecting the inputs to the hidden layer,
or the weights connecting the hidden layer to the output layer.
• The error function that we used for the Perceptron was

• where N is the number of output nodes.
• However, suppose that we make two errors. In the first, the target is
bigger than the output, while in the second the output is bigger than
the target.
• If these two errors are the same size, then if we add them up we
could get 0, which means that the error value suggests that no error
was made.

• To get around this we need to make all errors have the same sign.
• We can do this in a few different ways, but the one that will turn out
to be best is the sum-of-squares error function, which calculates the
difference between y and t for each node, squares them, and adds
them all together:

• Imagine a ball rolling around on a surface that looks like the line in
diagram Gravity will make the ball roll downhill (follow the downhill
gradient) until it ends up in the bottom of one of the hollows. These
are places where the error is small, so that is exactly what we want.
This is why the algorithm is called gradient descent.

Every node in the multi-layer perception uses a sigmoid activation
function. The sigmoid activation function takes real values as input and
converts them to numbers between 0 and 1 using the sigmoid formula.
As far as an algorithm goes, we’ve fed our inputs forward through the
network and worked out which nodes are firing.
Now, at the output, we’ve computed the errors as the sum-squared
difference between the outputs and the targets

• What we want to do next is to compute the gradient of these errors
and use them to decide how much to update each weight in the
network.
• We will do that first for the nodes connected to the output layer, and
after we have updated those, we will work backwards through the
network until we get back to the inputs again.
• There are just two problems
• for the output neurons, we don’t know the inputs.
• for the hidden neurons, we don’t know the targets

• For extra hidden layers, we know neither the inputs nor the targets,
but even this won’t matter for the algorithm we derive.
• Here use the chain rule of differentiation.
• The chain rule tells us that if we want to know how the error changes
as we vary the weights, we can think about how the error changes as
we vary the inputs to the weights, and multiply this by how those
input values change as we vary the weights

• This is useful because it lets us calculate all of the derivatives that we
want to we can write the activations of the output nodes in terms of
the activations of the hidden nodes and the output weights, and then
we can send the error calculations back through the network to the
hidden layer to decide what the target outputs were for those
neurons.

The Multi-layer Perceptron Algorithm
• We will assume that there are L input nodes, plus the bias, M hidden
nodes, also plus a bias, and N output nodes, so that there are (L+
1)×M weights between the input and the hidden layer and (M + 1)×N
between the hidden layer and the output.
• The sums that we write will start from 0 if they include the bias nodes
and 1 otherwise, and run up to L, M, or N, so that x0 = −1 is the bias
input, and a0 = −1 is the bias hidden node.
• The algorithm that is described could have any number of hidden
layers, in which case there might be several values for M, and extra
sets of weights between the hidden layers.

• We will also use i, j, k to index the nodes in each layer in the sums,
and the corresponding Greek letters (ι, ζ, κ) for fixed indices.
• Here is a quick summary of how the algorithm works, and then the
full MLP training algorithm using back-propagation of error is
described.
1. an input vector is put into the input nodes
2. the inputs are fed forward through the network
• the inputs and the first-layer weights (here labelled as v) are used to
decide whether the hidden nodes fire or not. The activation function
g(·) is the sigmoid function

• the outputs of these neurons and the second-layer weights (labelled
as w) are used to decide if the output neurons fire or not.
3. the error is computed as the sum-of-squares difference between the
network outputs and the targets
4. this error is fed backwards through the network in order to
• first update the second-layer weights
• and then afterwards, the first-layer weights

THE MULTI-LAYER PERCEPTRON IN PRACTICE
• To using the MLP to find solutions to four different types of problem:
regression, classification, time-series prediction, and data
compression.
1 .Amount of Training Data
For the MLP with one hidden layer there are (L + 1) × M + (M + 1) × N
weights, where L, M, N are the number of nodes in the input, hidden,
and output layers, respectively.
The extra +1s come from the bias nodes, which also have adjustable
weights. This is a potentially huge number of adjustable parameters
that we need to set during the training phase.

• Setting the values of these weights is the job of the back-propagation
algorithm, which is driven by the errors coming from the training
data.
• Clearly, the more training data there is, the better for learning,
although the time that the algorithm takes to learn increases.
• This is probably going to be a very large number of examples, so
neural network training is a fairly computationally expensive
operation, because we need to show the network all of these inputs
lots of times.

2. Number of Hidden Layers
• Two hidden layers is the most that you ever need for normal MLP
learning.
• In fact, this result can be strengthened: it is possible to show
mathematically that one hidden layer with lots of hidden nodes is
sufficient.
• This is known as the Universal Approximation Theorem;
• BelowFigure shows that two hidden layers are sufficient. In fact, they
aren’t necessary: one hidden layer will do, although it may require an
arbitrarily large number of hidden nodes. This is known as the
Universal Approximation Theorem,

3. When to Stop Learning
• The training of the MLP requires that the algorithm runs over the
entire dataset many times, with the weights changing as the network
makes errors in each iteration.
• The question is how to decide when to stop learning, stopping when
some predefined minimum error is reached might mean the
algorithm never terminates, or that it overfits.

• We train the network for some predetermined amount of time, and
then use the validation set to estimate how well the network is
generalising.
• We then carry on training for a few more iterations, and repeat the
whole process.
• At some stage the error on the validation set will start increasing
again, because the network has stopped learning about the function
that generated the data, and started to learn about the noise that is
in the data itself. At this stage we stop the training. This technique is
called early stopping.

EXAMPLES OF USING THE MLP
• We shall look at the four types of problems that are generally solved
using an MLP: regression, classification, time-series prediction, and
data compression/data denoising.
A Regression Problem :
• Take a set of samples generated by a simple mathematical function,
and try to learn the generating function (that describes how the data
was made) so that we can find the values of any inputs, not just the
ones we have training data for.

• The function that we will use is a very simple one, just a bit of a sine
wave.
• We’ll make the data in the following way (make sure that you have
NumPy imported as np first):

• The reason why we have to use the reshape() method is that NumPy
defaults to lists for arrays that are N ×1; compare the results of the
np.shape() calls below, and the effect of the transpose operator .T on
the array:

• You can plot this data
• We can now train an MLP on the data. There is one input value, x and
one output value t, so the neural network will have one input and one
output.
• Also, because we want the output to be the value of the function,
rather than 0 or 1, we will use linear neurons at the output. We don’t
know how many hidden neurons we will need yet, so we’ll have to
experiment to see what works

• Separate the data into training, testing, and validation sets. For this
example there are only 40 datapoints.
• We can split the data in the ratio 50:25:25 by using the odd-
numbered elements as training data, the even-numbered ones that
do not divide by 4 for testing, and the rest for validation:

• To start with, we will construct a network with three nodes in the
hidden layer, and run it for 101 iterations with a learning rate of 0.25,
just to see that it works:

• we can see that the network is learning, since the error is decreasing. We
now need to do two things:
• work out how many hidden nodes we need, and decide how long to train
the network for.
• In order to solve the first problem, we need to test out different networks
and see which get lower errors, but to do that properly we need to know
when to stop training.
• So we’ll solve the second problem first, which is to implement early
stopping. We train the network for a few iterations (let’s make it 10 for
now), then evaluate the validation set error by running the network
forward (i.e., the recall phase).
• Learning should stop when the validation set error starts to increase.

• We can now return to the problem of finding the right size of
network.
• There is one important thing to remember, which is that the weights
are initialised randomly, and so the fact that a particular size of
network gets a good solution once does not mean it is the right size, it
could have been a lucky starting point.
• So each network size is run 10 times, and the average is monitored.
The following table shows the results of doing this, reporting the sum-
of-squares validation error, for a few different sizes of network:

Based on these numbers, we would select a network with a small
number of hidden nodes, certainly between 2 and 10 (and the smaller
the better, in general), since their maximum error is much smaller than
a network with just 1 hidden node.
Note also that the error increases once too many hidden nodes are
used, since the network has too much variation for the problem.

Classification with the MLP
• Using the MLP for classification problems is not radically different
once the output encoding has been worked out.
• The inputs are easy: they are just the values of the feature
measurements (suitably normalised).
• There are a couple of choices for the outputs.
• The first is to use a single linear node for the output, y, and put some
thresholds on the activation value of that node. For example, for a
four-class problem, we could use:

• What about an example that is very close to a boundary, say y = 0.5?
• We arbitrarily guess that it belongs to class C3, but the neural
network doesn’t give us any information about how close it was to
the boundary in the output, so we don’t know that this was a difficult
example to classify.

• Once the network has been trained, performing the classification is
easy:
• Simply choose the element yk of the output vector that is the largest
element of y (in mathematical notation, pick the yk for which yk > yj
∀j /= k; ∀ means for all, so this statement says pick the yk that is
bigger than all other possible values yj ).
• This generates an unambiguous decision, since it is very unlikely that
two output neurons will have identical largest output values. This is
known as the hard-max activation function (since the neuron with the
highest activation is chosen to fire and the rest are ignored).

• The soft-max function, which has the effect of scaling the output of
each neuron according to how large it is in comparison to the others,
and making the total output sum to 1.
• So if there is one clear winner, it will have a value near 1, while if
there are several values that are close to each other, they will each
have a value of about 1 p , where p is the number of output neurons
that have similar values.

• There is one other thing that we need to be aware of when
performing classification, which is true for all classifiers.
• Suppose that we are doing two-class classification, and 90% of our
data belongs to class 1. (This can happen: for example in medical
data, most tests are negative in general.)
• In that case, the algorithm can learn to always return the negative
class, since it will be right 90% of the time, but still a completely
useless classifier!

• There is an alternative solution, known as novelty detection, which is
to train the data on the data in the negative class only, and to assume
that anything that looks different to that is a positive example.
• A Classification Example: The Iris Dataset
• As an example we are going to look at another example from the UCI
Machine Learning repository.
• This one is concerned with classifying examples of three types of iris
(flower) by the length and width of the sepals and petals and is called
iris.

• It was originally worked on by R.A. Fisher, a famous statistician and
biologist, who analysed it in the 1930s.
• We can’t currently load this into NumPy using loadtxt() because the
class (which is the last column) is text rather than a number, and the
txt in the function name doesn’t mean that it reads text, only
numbers in plaintext format.
• There are two alternatives.
• One is to edit the data in a text editor using search and replace, and
the other is to use some Python code, such as this function:

• You can then load it from the new file using loadtxt().
• In the dataset, the last column is the class ID, and the others are the
four measurements.

• We now need to convert the targets into 1-of-N encoding, from their
current encoding as class 1, 2, or 3.
• This is pretty easy if we make a new matrix that is initially all zeroes,
and simply set one of the entries to be 1:

• We now need to separate the data into training, testing, and
validation sets.
• There are 150 examples in the dataset, and they are split evenly
amongst the three classes, so the three classes are the same size and
we don’t need to worry about discarding any datapoints.
• We’ll split them into half training, and one quarter each testing and
validation.
• If you look at the file, you will notice that the first 50 are class 1, the
second 50 class 2, etc.

This tells us that the algorithm got nearly all of the test data correct,
misclassifying just two examples of class 2 and one of class 3.
• We’re now finally ready to set up and train the network.

4. Time-Series Prediction
• There is a common data analysis task known as time-series
prediction, where we have a set of data that show how something
varies over time, and we want to predict how the data will vary in the
future.
• It is useful in any field where there is data that appears over time,
which is to say almost any field. Most notable (if often unsuccessful)
uses have been in trying to predict stock markets and disease
patterns.

• The other problems with the data are practical.
• How many datapoints should we look at to make the prediction (i.e.,
how many inputs should there be to the neural network) and how far
apart in time should we space those inputs (i.e., should we use every
second datapoint, every 10th, or all of them)?
• We can write this as an equation, where we are predicting y using a
neural network that is written as a function f(·):

5 Data Compression: The Auto-Associative
Network
• we train the network to reproduce the inputs at the output layer
(called auto-associative learning; sometimes the network is known as
an autoencoder).
• The network is trained so that whatever you show it at the input is
reproduced at the output, which doesn’t seem very useful at first, but
suppose that we use a hidden layer that has fewer neurons than the
input layer (see Figure 4.17).
• This bottleneck hidden layer has to represent all of the information in
the input, so that it can be reproduced at the output. It therefore
performs some compression of the data, representing it using fewer
dimensions than were used in the input.

• This gives us some idea of what the hidden layers of the MLP are
doing: they are finding a different (often lower dimensional)
representation of the input data that extracts important components
of the data, and ignores the noise.
• This auto-associative network can be used to compress images and
other data.

• The 2D image is turned into a 1D vector of inputs by cutting the image
into strips and sticking the strips into a long line.
• The values of this vector are the intensity (colour) values of the
image, and these are the input values.
• The network learns to reproduce the same image at the output, and
the activations of the hidden nodes are recorded for each image.
• After training, we can throw away the input nodes and first set of
weights of the network

• If we insert some values in the hidden nodes (their activations for a
particular image; see Figure 4.19), then by feeding these activations
forward through the second set of weights, the correct image will be
reproduced on the output.
• So all we need to store are the set of second-layer weights and the
activations of the hidden nodes for each image, which is the
compressed version.

• Auto-associative networks can also be used to denoise images, since,
after training, the network will reproduce the trained image that best
matches the current (noisy) input.
• We don’t throw away the first set of weights this time, but if we feed
a noisy version of the image into the inputs, then the network will
produce the image that is closest to the noisy version at the outputs,
which will be the version it learnt on, which is uncorrupted by noise.

DERIVING BACK-PROPAGATION
• 1 The Network Output and the Error
• The output of the neural network (the end of the forward phase of
the algorithm) is a function of three things:
• the current input (x)
• the activation function g(·) of the nodes of the network
• the weights of the network (v for the first layer and w for the second)

The notation ∂ means the partial derivative, and is used because there
are lots of different functions that we can differentiate E with respect
to: all of the different weights.

• The gradient that we want to know is how the error function changes
with respect to the different weights:

Requirements of an Activation Function

• The important thing that we need to remember is that inputs to the
output layer neurons come from the activations of the hidden layer
neurons multiplied by the second layer weights:

5. The Output Activation Functions

Radial Basis Functions
• What is radial basis function in ML?
A radial basis function network is a type of supervised artificial neural
network that uses supervised machine learning (ML) to function as a
nonlinear classifier.
Nonlinear classifiers use sophisticated functions to go further in
analysis than simple linear classifiers that work on lower-dimensional
vectors.

What do you understand by radial basis function network?
• In the field of mathematical modeling, a radial basis function network
is an artificial neural network that uses radial basis functions as
activation functions.
• The output of the network is a linear combination of radial basis
functions of the inputs and neuron parameters.

Difference between multilayer perceptron
and radial basis function network
• MLP network presents general approach as a whole to handle
nonlinear relationship between the input parameter(s) and output
parameter(s) while RBF network evaluates the different subspaces of
input set as different relationships and produces local solutions.
• Hidden layer of RBF is different from MLP. It performs some
computations. Each hidden unit act as a point in input space and
activation/output for any instance depends on the distance between
that point(Hidden Unit) and instance(Also a point in space).

• MLP: uses dot products (between inputs and weights) and sigmoidal
activation functions and training is usually done through back
propagation for all layers. This type of neural network is used in deep
learning.
• RBF: Uses Euclidean distances (between inputs and weights, which
can be viewed as centers) and Gaussian activation functions. Thus,
RBF neurons have maximum activation when the center/weights are
equal to the inputs. RBFs make it easier to grow new neurons during
training.

• RBF and MLP belong to a class of neural networks called feed-forward
networks.
• Hidden layer of RBF is different from MLP. It performs some
computations. Each hidden unit act as a point in input space and
activation/output for any instance depends on the distance between
that point(Hidden Unit) and instance(Also a point in space).
• Thus, RBF learns two kinds of parameters:
1) Centers and width of RBFs
2) Weights to from linear combination of outputs obtained from
hidden layer

Radial Basis Function Network
• “Artificial Neural Networks” they are referring to the Multilayer
Perceptron (MLP).
• Each neuron in an MLP takes the weighted sum of its input values.
That is, each input value is multiplied by a coefficient, and the results
are all summed together.
• A single MLP neuron is a simple linear classifier, but complex non-
linear classifiers can be built by combining these neurons into a
network.

• The RBFN approach is more intuitive than the MLP.
• An RBFN performs classification by measuring the input’s similarity to
examples from the training set.
• Each RBFN neuron stores a “prototype”, which is just one of the
examples from the training set. When we want to classify a new
input, each neuron computes the Euclidean distance between the
input and its prototype.
• Roughly speaking, if the input more closely resembles the class A
prototypes than the class B prototypes, it is classified as class A.

• The above illustration shows the typical architecture of an RBF Network. It
consists of an input vector, a layer of RBF neurons, and an output layer with
one node per category or class of data.
The Input Vector
• The input vector is the n-dimensional vector that you are trying to classify.
The entire input vector is shown to each of the RBF neurons.
The RBF Neurons
• Each RBF neuron stores a “prototype” vector which is just one of the
vectors from the training set.
• Each RBF neuron compares the input vector to its prototype, and outputs a
value between 0 and 1 which is a measure of similarity.

• If the input is equal to the prototype, then the output of that RBF
neuron will be 1.
• As the distance between the input and prototype grows, the response
falls off exponentially towards 0.
• The shape of the RBF neuron’s response is a bell curve, as illustrated
in the network architecture diagram.
• The neuron’s response value is also called its “activation” value.
• The prototype vector is also often called the neuron’s “center”, since
it’s the value at the center of the bell curve.

The Output Nodes
• The output of the network consists of a set of nodes, one per
category that we are trying to classify.
• Each output node computes a sort of score for the associated
category.
• Typically, a classification decision is made by assigning the input to
the category with the highest score.
• The score is computed by taking a weighted sum of the activation
values from every RBF neuron.

• By weighted sum we mean that an output node associates a weight
value with each of the RBF neurons, and multiplies the neuron’s
activation by this weight before adding it to the total response.
• Because each output node is computing the score for a different
category, every output node has its own set of weights.
• The output node will typically give a positive weight to the RBF
neurons that belong to its category, and a negative weight to the
others.

RBF Neuron Activation Function
• Each RBF neuron computes a measure of the similarity between the
input and its prototype vector (taken from the training set).
• Input vectors which are more similar to the prototype return a result
closer to 1.
• There are different possible choices of similarity functions, but the
most popular is based on the Gaussian. Below is the equation for a
Gaussian with a one-dimensional input.

• Where x is the input, mu is the mean, and sigma is the standard
deviation. This produces the familiar bell curve shown below, which is
centered at the mean, mu (in the below plot the mean is 5 and sigma
is 1).

• In the Gaussian distribution, mu refers to the mean of the
distribution. Here, it is the prototype vector which is at the center of
the bell curve.
• The first change is that we’ve removed the outer coefficient, 1 /
(sigma * sqrt(2 * pi)). This term normally controls the height of the
Gaussian. Here, though, it is redundant with the weights applied by
the output nodes.

• During training, the output nodes will learn the correct coefficient or
“weight” to apply to the neuron’s response.
• The second change is that we’ve replaced the inner coefficient, 1 / (2
* sigma^2), with a single parameter ‘beta’. This beta coefficient
controls the width of the bell curve.
• Again, in this context, we don’t care about the value of sigma, we just
care that there’s some coefficient which is controlling the width of the
bell curve. So we simplify the equation by replacing the term with a
single variable.

RBF Neuron activation for different values of beta

• Evaluating the similarity between an input vector and a prototype is
the Euclidean distance between the two vectors.
• Also, each RBF neuron will produce its largest response when the
input is equal to the prototype vector. This allows to take it as a
measure of similarity, and sum the results from all of the RBF
neurons.

Example Dataset
• Before going into the details on training an RBFN, let’s look at a fully
trained example.
• In the below dataset, we have two dimensional data points which
belong to one of two classes, indicated by the blue x’s and red
circles. I’ve trained an RBF Network with 20 RBF neurons on this data
set. The prototypes selected are marked by black asterisks.
• We can also visualize the category 1 (red circle) score over the input
space. We could do this with a 3D mesh.

• The areas where the category 1 score is highest are colored dark red,
and the areas where the score is lowest are dark blue. The values
range from -0.2 to 1.38.
• I’ve included the positions of the prototypes again as black asterisks.
You can see how the hills in the output values are centered around
these prototypes.
• It’s also interesting to look at the weights used by output nodes to
remove some of the mystery.
• For the category 1 output node, all of the weights for the category 2
RBF neurons are negative:

And all of the weights for category 1 RBF neurons are positive:
-0.79934 -1.26054 -0.68206 -0.68042 -0.65370 -0.63270 -0.65949 -0.83266 -0.82232 -0.64140
0.78968 0.64239 0.61945 0.44939 0.83147 0.61682 0.49100 0.57227 0.68786 0.84207

Training The RBFN
• The training process for an RBFN consists of selecting three sets of
parameters: the prototypes (mu) and beta coefficient for each of the
RBF neurons, and the matrix of output weights between the RBF
neurons and the output nodes.
• There are many possible approaches to selecting the prototypes and
their variances.
Selecting the Prototypes
Two possible approaches are to create an RBF neuron for every training
example, or to just randomly select k prototypes from the training
data.

• In other words, you can always improve its accuracy by using more
RBF neurons.
• One of the approaches for making an intelligent selection of
prototypes is to perform k-Means clustering on your training set and
to use the cluster centers as the prototypes.
• Selecting Beta Values
If you use k-means clustering to select your prototypes, then one
simple method for specifying the beta coefficients is to set sigma equal
to the average distance between all points in the cluster and the cluster
center.

Output Weights
• The final set of parameters to train are the output weights. These can
be trained using gradient descent (also known as least mean squares).
• First, for every data point in your training set, compute the activation
values of the RBF neurons. These activation values become the
training inputs to gradient descent.
• The linear equation needs a bias term, so we always add a fixed value
of ‘1’ to the beginning of the vector of activation values.

• Gradient descent must be run separately for each output node (that
is, for each class in your data set).
• For the output labels, use the value ‘1’ for samples that belong to the
same category as the output node, and ‘0’ for all other samples.
• For example, if our data set has three classes, and we’re learning the
weights for output node 3, then all category 3 examples should be
labeled as ‘1’ and all category 1 and 2 examples should be labeled as
0.

What are Splines?
• To overcome the disadvantages of linear and polynomial regression
we introduced the regression splines.
• As we know in linear regression the dataset is considered as one, but
in splines regression, we have to split the dataset into many parts
which we call bin.
• And the points in which we divide the data are called knots and we
use different methods in different bins. These separate functions we
use in the different bins are called piecewise step functions.

• Splines are a way to fit a high-degree polynomial function by breaking
it up into smaller piecewise polynomial functions. For each
polynomial, we fit a separate model and connect them all together.

Why Splines?
• We already discussed that linear regression is a straight line hence we
made polynomial regression but it can make the model overfitting
issue.
• The need for a model that can be used with the good properties of
both linear and polynomial regression made the spline regression.
• While this sounds complicated, by breaking up each section into
smaller polynomials, we decrease the risk of overfitting.

How to break up a polynomial
• Because a spline breaks up a polynomial into smaller pieces, we need
to determine where to break up the polynomial.
• The point where this division occurs is called a knot.
• In the example above, each P_x represents a knot.
• The knots at the ends of the curves are known as boundary knots,
while the knots within the curve are known as internal knots.

Selecting number and location of knots
• While we can visually inspect where to place these knots, we need to
devise systematic methods to select knots.
Some strategies include:
• Placing knots in highly variable regions
• Specify the degrees of freedom and place knots uniformly throughout
the data
• Cross-validation

Types of Splines
• Cubic Splines
Cubic splines require that we connect these different polynomial
functions smoothly.
This means that the first and second derivatives of these functions
must be continuous.
The plot below shows a cubic spline and how the first derivative is a
continuous function.

Natural Splines
• Polynomial functions and other kinds of splines tend to have bad fits
near the ends of the functions.
• This variability can have huge consequences, particularly in
forecasting.
• Natural splines resolve this issue by forcing the function to be linear
after the boundary knots.

Smoothing Splines
• Finally, we can consider the regularized version of a spline: the
smoothing spline.
• The cost function is penalized if the variability of the coefficient is
high.
• Below is a plot that shows a situation where smoothing splines are
needed to get an adequate model fit.

Curse of Dimensionality
• The curse of dimensionality refers to the phenomena that occur
when classifying, organizing, and analyzing high dimensional data that
does not occur in low dimensional spaces, specifically the issue of
data sparsity and “closeness” of data.
• Curse of Dimensionality describes the explosive nature of increasing
data dimensions and its resulting exponential increase in
computational efforts required for its processing and/or analysis.

• This term was first introduced by Richard E. Bellman, to explain the
increase in volume of Euclidean space associated with adding extra
dimensions, in area of dynamic programming.
• Today, this phenomenon is observed in fields like machine learning,
data analysis, data mining to name a few.
• In machine learning, a feature of an object can be an attribute or a
characteristic that defines it.
• Each feature represents a dimension and group of dimensions creates
a data point.

• This represents a feature vector that defines the data point to be used
by a machine learning algorithm(s).
• When we say increase in dimensionality it implies an increase in the
number of features used to describe the data.
• For example, in the field of breast cancer research, age, number of
cancerous nodes can be used as features to define the prognosis of
the breast cancer patient.
• These features constitute the dimensions of a feature vector. But
other factors like past surgeries, patient history, type of tumor and
other such features help a doctor to better determine the prognosis.

• In this case by adding features, we are theoretically increasing the
dimensions of our data.
• As the dimensionality increases, the number of data points required
for good performance of any machine learning algorithm increases
exponentially.
• The reason is that, we would need more number of data points for
any given combination of features, for any machine learning model to
be valid.
• For example, let’s say that for a model to perform well, we need at
least 10 data points for each combination of feature values.

• If we assume that we have one binary feature, then for its 21 unique
values (0 and 1) we would need 2¹x 10 = 20 data points.
• For 2 binary features, we would have 2² unique values and need 2² x
10 = 40 data points.
• Thus, for k-number of binary features we would need 2ᵏ x 10 data
points.

• Hughes (1968) in his study concluded that with a fixed number of
training samples, the predictive power of any classifier first increases
as the number of dimensions increase, but after a certain value of
number of dimensions, the performance deteriorates.
• Thus, the phenomenon of curse of dimensionality is also known as
Hughes phenomenon.

Effect of Curse of Dimensionality on Distance
Functions:
• For any point A, lets assume distₘᵢₙ(A) is the minimum distance between
A and its nearest neighbor and distₘₐₓ(A) is the maximum distance
between A and the farthest neighbor.

• That is, for a d — dimensional space, given n-random points, the
distₘᵢₙ(A) ≈ distₘₐₓ(A) meaning, any given pair of points are
equidistant to each other.
• Therefore, any machine learning algorithms which are based on the
distance measure including KNN(k-Nearest Neighbor) tend to fail
when the number of dimensions in the data is very high.
• Thus, dimensionality can be considered as a “curse” in such
algorithms.

Solutions to Curse of Dimensionality:
• One of the ways to reduce the impact of high dimensions is to use a
different measure of distance in a space vector.
• One could explore the use of cosine similarity to replace Euclidean
distance. Cosine similarity can have a lesser impact on data with
higher dimensions.
• However, use of such method could also be specific to the required
solution of the problem.

Other methods:
• Other methods could involve the use of reduction in dimensions.
Some of the techniques that can be used are:
1. Forward-feature selection: This method involves picking the most
useful subset of features from all given features.
2. PCA/t-SNE: Though these methods help in reduction of number of
features, but it does not necessarily preserve the class labels and thus
can make the interpretation of results a tough task.

Interpolations and Basis Functions
• Interpolation is a method of creating new data points within the
range of known data points.
• n the graph below, the dots show original data and the curves show
functions plotting interpolated data points.

• Types of interpolation include:
• Linear
• Multivariate
• Nearest Neighbor
• Polynomial
• Spline

• Linear interpolation
Linear interpolation creates a continuous function out of discrete data. It’s a
foundational building block for the gradient descent algorithm, which is used
in the training of just about every machine learning technique.
Interpolation of a data set
Linear interpolation on a set of data points (x0, y0), (x1, y1), ..., (xn, yn) is
defined as the concatenation of linear interpolants between each pair of
data points. This results in a continuous curve.

Multivariate interpolation
• In numerical analysis, multivariate interpolation is interpolation on
functions of more than one variable; when the variates are spatial
coordinates, it is also known as spatial interpolation.
• The function to be interpolated is known at given points(xi,yi,zi,….)
and the interpolation problem consists of yielding values at arbitrary
points(x,y,z,…).

Nearest-neighbor interpolation
• Nearest-neighbor interpolation is a simple method of multivariate
interpolation in one or more dimensions.
• Interpolation is the problem of approximating the value of a function
for a non-given point in some space when given the value of that
function in points around (neighboring) that point.
• The nearest neighbor algorithm selects the value of the nearest point
and does not consider the values of neighboring points at all, yielding
a piecewise-constant interpolant. The algorithm is very simple to
implement and is commonly used in real-time 3D rendering to select
color values for a textured surface.

Polynomial interpolation
• Polynomial interpolation is a method of estimating values between
known data points.
• When graphical data contains a gap, but data is available on either
side of the gap or at a few specific points within the gap, an estimate
of values within the gap can be made by interpolation.

Spline interpolation
• The Spline tool uses an interpolation method that estimates values
using a mathematical function that minimizes overall surface
curvature, resulting in a smooth surface that passes exactly through
the input points.

Basis Functions
• Radial basis functions and several other machine learning algorithms
can be written in this form:
• where Φi(x) is some function of the input value x and the αi are the
parameters we can solve for in order to make the model fit the data.

• We will consider the input being scalar values x rather than vector
values x in what follows.
• The Φi(x) are known as basis functions and they are parameters of
the model that are chosen.
• The first thing we need to think about is where each Φi is defined.
Looking at the third graph
• we see that the first function should only be defined between 0 and
x1, the next between x1 and x2, and so on. These points xi are called
knotpoints and they are generally evenly spaced.

• The more knotpoints there are, the more complex the model can be,
in which case the model is more likely to overfit, and needs more
training data, just like the neural networks that we have seen.

• We simply use a constant function Φ(x) = 1. Now the model would
have value α1 to the left of x1, value α2 between x1 and x2, etc.
• So depending upon how we fit the spline model to the data, the
model will have different values, but it will certainly be constant in
each region.
• This is sufficient to make the straight line approximation.
• However, we might decide that a constant value is not enough, and
we use a function that varies linearly (a linear function that has value
Φ(x) = x within the region).

• As well as Φ1(x) = 1, Φ2(x) = x, we add some basis functions that
insist that the value is 0 at the boundary with x1: Φ3(x) = (x − x1)+,
and the next with the boundary at x2: Φ4(x) = (x − x2)+, etc., where
(x)+ = x if x > 0 and 0 otherwise.
• These functions are sufficient to insist that the knotpoint values are
enforced, since one is defined on each knotpoint.

• The uses of interpolation include: Help users to determine what data
might exist outside of their collected data. Similarly, for scientists,
engineers, photographers and mathematicians to fit the data for
analysing the trend and so on.
• Basis function is used to adjusted to fit in the data set.

Support Vector Machines
• Support Vector Machine or SVM is one of the most popular
Supervised Learning algorithms, which is used for Classification as
well as Regression problems.
• Primarily, it is used for Classification problems in Machine Learning.
• The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that
we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyper plane.

• SVM chooses the extreme points/vectors that help in creating the
hyper plane.
• These extreme cases are called as support vectors, and hence
algorithm is termed as Support Vector Machine.
• Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyper
plane:

Example:
• Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such
a model can be created by using the SVM algorithm.
• We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with
this strange creature.
• So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog.
• On the basis of the support vectors, it will classify it as a cat. Consider the
below diagram:

SVM algorithm can be used for Face detection, image classification,
text categorization, etc.
Types of SVM
• SVM can be of two types:
• Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.

• Non-linear SVM:
Non-Linear SVM is used for non-linearly separated data, which means if
a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear
SVM classifier.
• Hyperplane: There can be multiple lines/decision boundaries to
segregate the classes in n-dimensional space, but we need to find out
the best decision boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.

• The dimensions of the hyper plane depend on the features present in
the dataset, which means if there are 2 features (as shown in image),
then hyper plane will be a straight line. And if there are 3 features,
then hyper plane will be a 2-dimension plane.
• We always create a hyper plane that has a maximum margin, which
means the maximum distance between the data points.
• Support Vectors:
The data points or vectors that are the closest to the hyper plane and
which affect the position of the hyper plane are termed as Support
Vector. Since these vectors support the hyper plane, hence called a
Support vector.

How does SVM works?
• Linear SVM:
• The working of the SVM algorithm can be understood by using an
example.
• Suppose we have a dataset that has two tags (green and blue), and
the dataset has two features x1 and x2.
• We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily
separate these two classes. But there can be multiple lines that can
separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision
boundary; this best boundary or region is called as a hyper plane.

• SVM algorithm finds the closest point of the lines from both the
classes.
• These points are called support vectors. The distance between the
vectors and the hyperplane is called as margin.
• And the goal of SVM is to maximize this margin. The hyperplane with
maximum margin is called the optimal hyperplane.

Non-Linear SVM:
• If data is linearly arranged, then we can separate it by using a straight
line, but for non-linear data, we cannot draw a single straight line.
Consider the below image:

• So to separate these data points, we need to add one more
dimension. For linear data, we have used two dimensions x and y, so
for non-linear data, we will add a third dimension z. It can be
calculated as:
• By adding the third dimension, the sample space will become as
below image:

• So now, SVM will divide the datasets into classes in the following way.
Consider the below image:

Hence we get a circumference of radius 1 in case
of non-linear data.
Since we are in 3-d Space, hence it is looking like a plane parallel to the
x-axis. If we convert it in 2d space with z=1, then it will become as:

What is Kernel?
• A kernel is a function used in SVM for helping to solve problems. They
provide shortcuts to avoid complex calculations.
• The amazing thing about kernel is that we can go to higher
dimensions and perform smooth calculations with the help of it.
• We can go up to an infinite number of dimensions using kernels.
• Sometimes, we cannot have a hyper plane for certain problems. This
problem arises when we go up to higher dimensions and try to form a
hyper plane.
• A kernel helps to form the hyper plane in the higher dimension
without raising the complexity.

Working of Kernel Functions
• Kernels are a way to solve non-linear problems with the help of linear
classifiers.
• This is known as the kernel trick method. The kernel functions are
used as parameters in the SVM codes.
• They help to determine the shape of the hyper plane and decision
boundary.
• We can set the value of the kernel parameter in the SVM code.
• The value can be any type of kernel from linear to polynomial. If the
value of the kernel is linear then the decision boundary would be
linear and two-dimensional.

• These kernel functions also help in giving decision boundaries for
higher dimensions.
• We do not need to do complex calculations. The kernel functions do
all the hard work. We just have to give the input and use the
appropriate kernel. Also, we can solve the overfitting problem in SVM
using the kernel functions.
• Overfitting happens when there are more feature sets than sample
sets in the data. We can solve the problem by either increasing the
data or by choosing the right kernel.

Types of Kernel Functions
1. Polynomial Kernel Function
• The polynomial kernel is a general representation of kernels with a
degree of more than one. It’s useful for image processing.
• There are two types of this:
• Homogenous Polynomial Kernel Function
• K(xi,xj) = (xi.xj)d, where ‘.’ is the dot product of both the numbers and
d is the degree of the polynomial.

• Inhomogeneous Polynomial Kernel Function
• K(xi,xj) = (xi.xj + c)d where c is a constant.
2. Gaussian RBF Kernel Function
• RBF is the radial basis function. This is used when there is no prior
knowledge about the data.
• It’s represented as:
• K(xi,xj) = exp(-γ||xi – xj||)2

3. Sigmoid Kernel Function
• This is mainly used in neural networks. We can show this as:
• K(xi,xj) = tanh(αxay + c)

Why do we need to use support vector
machines.
• SVMs are used in applications like handwriting recognition, intrusion
detection, face detection, email classification, gene classification, and
in web pages.
• This is one of the reasons we use SVMs in machine learning.
• It can handle both classification and regression on linear and non-
linear data.

• Why SVM is an example of a large margin classifier?
• SVM is a type of classifier which classifies positive and negative
examples, here blue and red data points
• As shown in the image, the largest margin is found in order to avoid
overfitting ie,.. the optimal hyperplane is at the maximum distance
from the positive and negative examples(Equal distant from the
boundary lines).
• To satisfy this constraint, and also to classify the data points
accurately, the margin is maximised, that is why this is called the large
margin classifier.

• What is a kernel in SVM? Why do we use kernels in SVM?
SVM algorithms use a set of mathematical functions that are defined
as the kernel.
The function of kernel is to take data as input and transform it into the
required form.
Different SVM algorithms use different types of kernel functions.
These functions can be different types.
For example linear, nonlinear, polynomial, radial basis function (RBF),
and sigmoid.

• Introduce Kernel functions for sequence data, graphs, text, images,
as well as vectors.
• The most used type of kernel function is RBF. Because it has localized
and finite response along the entire x-axis.
• The kernel functions return the inner product between two points in
a suitable feature space.
• Thus by defining a notion of similarity, with little computational cost
even in very high-dimensional spaces.

• What is generalization error in terms of the SVM?
Generalization error is generally the out-of-sample error which is the
measure of how accurately a model can predict values for previously
unseen data.

• Pros
• Effective on datasets with multiple features, like financial or medical
data.
• Effective in cases where number of features is greater than the
number of data points.
• Uses a subset of training points in the decision function called
support vectors which makes it memory efficient.
• Different kernel functions can be specified for the decision function.
You can use common kernels, but it's also possible to specify custom
kernels.

• Cons
• If the number of features is a lot bigger than the number of data
points, avoiding over-fitting when choosing kernel functions and
regularization term is crucial.
• SVMs don't directly provide probability estimates. Those are
calculated using an expensive five-fold cross-validation.
• Works best on small sample sets because of its high training time.

• Cross-validation
Cross-validation is a resampling method that uses different portions of
the data to test and train a model on different iterations. It is mainly
used in settings where the goal is prediction, and one wants to
estimate how accurately a predictive model will perform in practice.

Unit 2 ml.pptx

Recommended

Recommended

More Related Content

Similar to Unit 2 ml.pptx

Similar to Unit 2 ml.pptx (20)

More from PradeeshSAI

More from PradeeshSAI (6)

Recently uploaded

Recently uploaded (20)

Unit 2 ml.pptx