SlideShare a Scribd company logo
Shreyas GS
PREDICTIVE ANALYSIS USING
MACHINE LEARNING
Machine Learning
Page 1
INTRODUCTION
Machine Learning
"A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E."
Where
 E= The experience of playing many games of checkers.
 T= Task of playing checkers.
 P= The probability that the program will win the next game.
Supervised Learning
In supervised learning, we are given a data set and already know what our correct output
should look like, having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into regression and classification problems. In
a regression problem, we are trying to predict results within a continuous output, meaning
that we are trying to map input variables to some continuous function. In a classification
problem, we are trying to predict results in a discrete output. In other words, we are trying
to map input variables into discrete categories.
Example:
Regression-Given data about the size of houses on the real estate market, try to predict their
price. Price as a function of size is a continuous output.
Classification-Bank have to decide whether or not they give loan to someone on this basis of
his credit history.
Machine Learning
Page 2
Unsupervised Learning
Unsupervised learning, on the other hand, allows us to approach problems with little or no
idea what our results should look like. We can derive structure from data where we don't
necessarily know the effect of the variables.
We can derive this structure by clustering the data based on relationships among the
variables in the data.
Example:
Clustering: Take a collection of 1000 essays written on the US Economy, and find a way to
automatically group these essays into a small number that are somehow similar or related by
different variables, such as word frequency, sentence length, page count, and so on.
1. Linear Regression With One Variable
Linear regression with one variable is also known as "univariate linear regression."
Univariate linear regression is used when you want to predict a single output value from a
single input value. We're doing supervised learning here, so that means we already have an
idea what the input/output cause and effect should be.
The Hypothesis Function
Our hypothesis function has the general form: hθ(x)= θ0 + θ1x.
We are trying to create a function called hθ that is able to reliably map our input data(x’s) to
our output data(y’s).
Example:
X(input) y(output)
0 4
1 7
2 7
3 8
Machine Learning
Page 3
Now we can make a random guess about our hθ function: θ0=2 and θ1=2. The hypothesis
function becomes hθ(x)= 2 + 2x.
So for input of 1 to our hypothesis, y will be 4. This is off by 3.
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This takes
an average (actually a fancier version of an average) of all the results of the hypothesis with
inputs from x's compared to the actual output y's.
J(θ0,θ1) = (1/2m)∑ (𝒉𝜽(xi
𝒎
𝒊=𝟏 ) − 𝐲i)2
To break it apart, it is ½ 𝑥̅ where 𝑥̅ is the mean of the squares of hθ(xi) - yi, or the difference
between the predicted value and the actual value.
This function is otherwise called the "Squared error function", or "Mean squared error". The
mean is halved (1/2m) as a convenience for the computation of the gradient descent, as the
derivative term of the square function will cancel out the ½ term.
Now we are able to concretely measure the accuracy of our predictor function against the
correct results we have so that we can predict new results we don't have.
Gradient Descent
So we have our hypothesis function and we have a way of measuring how accurate it is. Now
what we need is a way to automatically improve our hypothesis function. That's where
gradient descent comes in.
Imagine that we graph our hypothesis function based on its fields θ0 and θ1 (actually we are
graphing the cost function for the combinations of parameters). We are not graphing x and y
itself, but the parameter range of our hypothesis function and the cost resulting from
selecting particular set of parameters.
We put θ0 on X axis and θ1 on Y axis, with the cost function on the vertical z axis. The points
on our graph will be the result of the cost function using our hypothesis with those specific
theta parameters.
We will know that we have succeeded when our cost function is at the very bottom of the
pits in our graph, i.e. when its value is the minimum.
Machine Learning
Page 4
The way we do this is by taking the derivative (the line tangent to a function) of our cost
function. The slope of the tangent is the derivative at that point and it will give us a direction
to move towards. We make steps down that derivative by the parameter α, called the learning
rate.
The gradient descent equation is:
Repeat until convergence:
θj ∶= θj -α
𝜹
𝜹𝜽𝐣
J(θ0, θ1)
for j=0 and j=1.
Gradient Descent for Linear Regression
When specifically applied to the case of linear regression, a new form of the gradient descent
equation can be derived. We can substitute our actual cost function and our actual hypothesis
function and modify the equation.
Repeat until convergence: {
θ0 ∶= θ0-α
𝟏
𝒎
∑ (𝒉𝜽(xi
𝒎
𝒊=𝟏 ) − 𝐲i)
θ1 ∶= θ1-α
𝟏
𝒎
∑ ((𝒉𝜽(xi
𝒎
𝒊=𝟏 ) − 𝐲i)xi)
}
Where m is the size of the training set, θ0 a constant that will be changing simultaneously with
θ1 and xi and yi are values of the given training set.
We have separated out the two cases for θj and that for θ1 we are multiplying xi at the end
due to derivative.
The point of all this is that if we start with a guess for our hypothesis and then repeatedly
apply these gradient descent equations, our hypothesis will become more and more accurate.
Machine Learning
Page 5
Linear Regression With Multiple Variables
Linear regression with multiple variables is also known as "multivariate linear regression".
We now introduce notation for equations where we can have any number of input variables.
xj
(i) = value of j in the ith training example.
x(i) = the column vector of all the feature inputs of the ith training example.
m = number of training examples.
n = |x(i)| (the number of features)
Now we define the multivariable form of the hypothesis function as follows, accommodating
these multiple features:
hθ(x)= θ0 + θ1x1 + θ2x2 + θ3x3 +….+ θnxn
In order to have little intuition about this function, we can think about θ0 as the basic price of
a house, θ1 as the price per square metre, θ2 as the price per floor, etc. x1 will be the number
of square meters in the floor, x2 the number of floors in the house, etc.
Using the definition of matrix multiplication, our multivariable hypothesis function can be
concisely represented as:
hθ(x)= [ θ0 θ1 θ2 …. θn][
x0
⋮
xn
] = θTx
This is a vectorization of our hypothesis function for one training example.
Now we can collect all m training examples each with n features and record them in an n+1
by m matrix. In this matrix we let the value of the subscript (feature) also represent the row
number (except the initial row is the "zeroth" row), and the value of the superscript (the
training example) also represent the column number, as shown here:
X=[
x0(1) ⋯ x0(m)
⋮ ⋮ ⋮
xn(1) ⋯ xn(m)
]
Notice above that the first column is the first training example (like the vector above), the
second column is the second training example, and so forth.
Now we can define hθ(X) as a row vector that gives the value of hθ(x) at each of the m training
examples:
hθ(X) = θTX
Machine Learning
Page 6
If we store the training examples in X row wise, the hypothesis function can be calculated as:
hθ(X) = Xθ
Cost Function
For the parameter vector θ, the cost function is calculated as:
J(θ) = (1/2m)∑ (𝒉𝜽(x(i)𝒎
𝒊=𝟏 ) − 𝐲(i))2
Gradient Descent For Multiple Variables
The gradient descent equation itself is generally the same form; we just have to repeat it for
our 'n' features:
Repeat until convergence: {
θ0 ∶= θ0 - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝟎(i)
θ1 ∶= θ1 - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝟏(i)
θ2 ∶= θ0 - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝟐(i)
…..
}
In other words:
Repeat until convergence: {
θj ∶= θj - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝐣(i)
for j = 0….n
}
Machine Learning
Page 7
Feature Normalization
We can speed up gradient descent by having each of our input values in roughly the same
range. This is because θ will descend quickly on small ranges and slowly on large ranges, and
so will oscillate inefficiently down to the optimum when the variables are very uneven.
Two techniques to help with this are feature scaling and mean normalization. Feature scaling
involves dividing the input values by the range (i.e. the maximum value minus the minimum
value) of the input variable, resulting in a new range of just 1. Mean normalization involves
subtracting the average value for an input variable from the values for that input variable,
resulting in a new average value for the input variable of just zero. To implement both of
these techniques, adjust your input values as shown in this formula:
xi ∶=
𝒙𝒊−µ𝒊
𝒔𝒊
µi is the average of all the values and si is the standard deviation.
Predictive Analysis Using Machine Learning-1
In this part of Machine Learning, we will implement linear regression with one variable to
predict the outcome using Octave.
Problem
Suppose you are the CEO of a restaurant franchise and are considering different cities for
opening a new outlet. The chain already has trucks in various cities and you have data for
profits and populations from the cities. You would like to use this data to help you select
which city to expand to next.
Dataset
The file ex1data1.txt contains the dataset for our linear regression problem. The first column
is the population of a city and the second column is the profit of a food truck in that city. A
negative value for profit indicates a loss.
Plotting The Data
Before starting on any task, it is often useful to understand the data by visualizing it. For this
dataset, we can use a scatter plot to visualize the data, since it has only two properties to plot
(profit and population).
Machine Learning
Page 8
Code:
function plotData(X, y)
plot(X,y,'rx','MarkerSize',10);
ylabel('Profit in $10,000s');
xlabel('Population of City in 10,000s');
Gradient Descent
We will fit the linear regression parameters θ to our dataset using gradient descent.
Code:
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y);
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
Machine Learning
Page 9
delta = (1/m)*sum(X.*repmat((X*theta - y), 1, size(X,2)));
theta = (theta' - (alpha * delta))';
J_history(iter) = computeCost(X, y, theta);
end
Computing Cost
We will implement a function to calculate J(θ) so you can check the convergence of gradient
descent implementation.
Code:
function J = computeCost(X, y, theta)
m = length(y);
J = (1/(2*m))*sum(power((X*theta - y),2));
end
Plotting Linear Fit
We plot X variable values on x axis and X*theta values on y axis which gives us a linear fit from
which we can predict y values for unseen examples.
Code:
plot(X(:,2), X*theta, '-')
Machine Learning
Page 10
Working
A good way to verify that gradient descent is working correctly is to look at the value of J(θ)
and check that it is decreasing with each step. The starter code of gradient descent calls
compute cost on every iteration and prints the cost.
Assuming we have implemented gradient descent and compute cost correctly, our value of
J(θ) should never increase, and should converge to a steady value by the end of the algorithm.
Prediction
The final values of θ will be used to make predictions on profits in areas of 35,000 and 70,000.
Code:
predict1 = [1, 35000] *theta;
predict2 = [1, 70000] * theta;
Visualizing Cost Function
To understand the cost function J(θ) better, we will now plot the cost over a 2-dimensional
grid of θ0 and θ1 values. A code is set up to calculate J(θ) over a grid of values using the
compute cost function.
Code:
theta0_vals = linspace(-10, 10, 100);
theta1_vals = linspace(-1, 4, 100);
J_vals = zeros(length(theta0_vals), length(theta1_vals));
for i = 1:length(theta0_vals)
for j = 1:length(theta1_vals)
t = [theta0_vals(i); theta1_vals(j)];
J_vals(i,j) = computeCost(X, y, t);
end
end
surf(theta0_vals, theta1_vals, J_vals)
contour(theta0_vals, theta1_vals, J_vals, logspace(-2, 3, 20))
Machine Learning
Page 11
Surface Graph
Contour Plot
Machine Learning
Page 12
The purpose of these graphs is to show you that how J(θ) varies with changes in θ0 and θ1.
The cost function J(θ) is bowl-shaped and has a global minimum.
This minimum is the optimal point for θ0 and θ1, and each step of gradient descent moves
closer to this point.
Program Execution
Machine Learning
Page 13
Predictive Analysis Using Machine Learning-2
In this part, we will implement linear regression with multiple variables to predict the prices
of houses.
Problem
Suppose you are selling your house and you want to know what a good market price would
be. One way to do this is to first collect information on recent houses sold and make a model
of housing prices.
Dataset
The file ex1data2.txt contains a training set of housing prices in Portland, Oregon. The first
column is the size of the house (in square feet), the second column is the number of
bedrooms, and the third column is the price of the house.
Feature Normalization
By examining the dataset values, note that house sizes are about 1000 times the number of
bedrooms.
When features differ by orders of magnitude, first performing feature scaling can make
gradient descent converge much more quickly.
The standard deviation is a way of measuring how much variation there is in the range of
values of a particular feature.
Code:
function [X_norm, mu, sigma] = featureNormalize(X)
mu = mean(X);
sigma = std(X);
X_norm = (X - repmat(mu, [size(X,1), 1]))./repmat(sigma, [size(X,1), 1]);
end
Machine Learning
Page 14
Gradient Descent
Previously, we implemented gradient descent on a univariate regression problem. The only
difference now is that there is one more feature in the matrix X. The hypothesis function and
the batch gradient descent update rule remain unchanged.
Code:
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
m = length(y);
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
hy = X*theta-y;
hyr = repmat(hy, [1, size(X,2)]);
prodhyrx = hyr.*X;
sumVec = sum(prodhyrx)';
theta = theta - alpha.*sumVec./m;
J_history(iter) = computeCostMulti(X, y, theta);
end
end
Compute Cost
Code:
function J = computeCostMulti(X, y, theta)
m = length(y);
J = 0;
J = 1/(2*m) * (X * theta - y)' * (X * theta - y);
end
Machine Learning
Page 15
Convergence Graph
We plot a graph of cost J(θ) on Y axis against the number of iterations on the X axis. We
observe that the cost is reducing with each iteration towards the minimum. This minimum is
the optimal point for θ0 and θ1, and each step of gradient descent moves closer to this point.
Working
A good way to verify that gradient descent is working correctly is to look at the value of J(θ)
and check that it is decreasing with each step. The starter code of gradient descent calls
compute cost on every iteration and prints the cost.
Assuming we have implemented gradient descent and compute cost correctly, our value of
J(θ) should never increase, and should converge to a steady value by the end of the algorithm.
Machine Learning
Page 16
Prediction
We now predict the price of a house of an unseen example using the optimal values of theta
obtained from the gradient descent.
The price of a house with 1650 square meters having 3 bedrooms is:
Code:
X = [1650, 3];
X_norm = (X-mu)./sigma;
price = [1, X_norm]*theta;
Price: $293237.1614
Normal Equation
The closed-form solution to linear regression is:
θ = (XTX)-1XT
y
Using this formula does not require any feature scaling, and you will get an exact solution in
one calculation. There is no "loop until convergence" like in gradient descent.
Code:
function [theta] = normalEqn(X, y)
theta = zeros(size(X, 2), 1);
theta = pinv(X'*X)*X'*y;
end
Prediction
Code:
price = [1 1650 3]*theta;
Machine Learning
Page 17
Program Execution
Machine Learning
Page 18
2. Logistic Regression
Now we are switching from regression problems to classification problems. The name
‘Logistic Regression’ is an approach to classification problem.
Classification
Instead of our output vector y being a continuous range of values, it will only be 0 or 1.
Y ∈ {0,1}
Where 0 is usually taken as the "negative class" and 1 as the "positive class", but we are free
to assign any representation to it. We're only doing two classes for now, called a "Binary
Classification Problem."
One method is to use linear regression and map all predictions greater than 0.5 as a 1 and all
less than 0.5 as a 0. This method doesn't work well because classification is not actually a
linear function. Therefore, we use a sigmoid function to classify.
Hypothesis Representation
Our hypothesis should satisfy:
0≤hθ(x)≤1
Our new form uses the sigmoid function, also called the “logistic function”:
hθ(x) = g(θTx)
Z= θTx
g(z) =
1
1+𝑒−𝑧
The function g(z), shown below, maps any real number to the (0, 1) interval, making it useful
for transforming an arbitrary-valued function into a function better suited for classification.
We start with our old hypothesis (linear regression), except that we want to restrict the range
to 0 and 1. This is accomplished by plugging θTx into the Logistic Function.
Machine Learning
Page 19
Decision Boundary
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis
function as follows:
hθ(x) ≥ 0.5 → y=1
hθ(x) < 0.5 → y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero,
its output is greater than or equal to 0.5:
g(z) ≥ 0.5
when z ≥ 0
Remember:
z=0, e0=1, g(z)=1/2
z →∞, e−∞ →0, g(z)=1
Machine Learning
Page 20
z→ −∞, e∞ →∞, g(z)=0
So if our input to g is θTx, then that means:
hθ(x) = g(θTx) ≥ 0.5
when θTx ≥ 0
From these statements we can now say:
θTx ≥ 0 → y = 1
θTx < 0 → y = 0
The decision boundary is the line that separates the area where y=0 and where y=1. It is
created by our hypothesis function.
Cost Function
We cannot use the same cost function that we use for linear regression because the Logistic
Function will cause the output to be wavy, causing many local optima. In other words, it will
not be a convex function.
Instead, our cost function for logistic regression looks like:
J(θ) = (1/m)∑ 𝐂𝐨𝐬𝐭(𝒉𝜽(x(i)𝒎
𝒊=𝟏 ) − 𝐲(i))
Cost(hθ(x),y) = −log(hθ(x)) if y = 1
Cost(hθ(x),y) = −log(1−hθ(x)) if y = 0
Machine Learning
Page 21
The more our hypothesis is off from y, the larger the cost function output. If our hypothesis
is equal to y, then our cost is 0:
Cost(hθ(x),y) = 0 if hθ(x) = y
Cost(hθ(x),y)→∞ if y = 0and hθ(x)→1
Cost(hθ(x),y)→∞ if y = 1and hθ(x)→0
If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also
outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.
If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs
1. If our hypothesis approaches 0, then the cost function will approach infinity.
Note that writing the cost function in this way guarantees that J(θ) is convex for logistic
regression.
Simplified Cost Function And Gradient Descent
We can compress our cost function's two conditional cases into one case:
Cost(hθ(x),y) = −ylog(hθ(x)) − (1−y)log(1−hθ(x))
Notice that when y is equal to 1, then the second term ((1−y)log(1−hθ(x))) will be zero and will
not affect the result. If y is equal to 0, then the first term (−ylog(hθ(x))) will be zero and will
not affect the result.
We can fully write out our entire cost function as follows:
J(θ) = -
𝟏
𝒎
∑ .𝒎
𝒊=𝟏 [y(i)log(hθ(x(i))) + (1−y(i))log(1−hθ(x(i)))
Gradient Descent:
Repeat {
θj :=θj −
𝛼
𝑚
∑ .𝑚
𝑖=1 (hθ(x(i))−y(i))x(i)j
}
Machine Learning
Page 22
Regularization
Regularization is designed to address the problem of overfitting.
High bias or underfitting is when the form of our hypothesis maps poorly to the trend of the
data. It is usually caused by a function that is too simple or uses too few features. Ex. if we
take hθ(x)=θ0+θ1x1+θ2x2 then we are making an initial assumption that a linear model will fit
the training data well and will be able to generalize but that may not be the case.
At the other extreme, overfitting or high variance is caused by a hypothesis function that fits
the available data but does not generalize well to predict new data. It is usually caused by a
complicated function that creates a lot of unnecessary curves and angles unrelated to the
data.
This terminology is applied to both linear and logistic regression.
There are two main options to address the issue of overfitting:
1. Reduce the number of features
2. Regularization – Keep all features, reduce parameters θj.
Regularization works well when we have a lot of slightly useful features.
Cost Function
If we have overfitting from our hypothesis function, we can reduce the weight that some of
the terms in our function carry by increasing their cost.
Say we wanted to make the following function more quadratic:
θ0+θ1x+θ2x2+θ3x3+θ4x4
We'll want to eliminate the influence of θ3x3 and θ4x4. Without actually getting rid of these
features or changing the form of our hypothesis, we can instead modify our cost function:
minθ
1
2𝑚
∑ .𝑚
𝑖=1 (hθ(x(i))−y(i))2 + 1000⋅θ2
3+1000⋅θ2
4
We've added two extra terms at the end to inflate the cost of θ3 and θ4. Now, in order for
the cost function to get close to zero, we will have to reduce the values of θ3 and θ4 to near
zero. This will in turn greatly reduce the values of θ3x3 and θ4x4 in our hypothesis function.
We could also regularize all of our theta parameters in a single summation:
Machine Learning
Page 23
minθ
1
2𝑚
[ ∑ .𝑚
𝑖=1 (hθ(x(i))−y(i))2 + λ∑ (𝑛
𝑗=1 θ2
j)]
The λ, or lambda, is the regularization parameter. It determines how much the costs of our
theta parameters are inflated.
Using the above cost function with the extra summation, we can smooth the output of our
hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth
out the function too much and cause underfitting.
Regularized Linear Regression
We will modify our gradient descent function to separate out θ0 from the rest of the
parameters because we do not want to penalize θ0.
Repeat {
θ0 ∶= θ0 - α
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). x0(i)
θj ∶= θj – α[(
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). xj(i)) +
𝜆
𝑚
θj]
}
The term
𝜆
𝑚
θj performs our regularization.
Regularized Logistic Regression
We can regularize logistic regression in a similar way that we regularize linear regression. Let's
start with the cost function.
J(θ) = -
1
𝑚
∑ .𝑚
𝑖=1 [y(i)log(hθ(x(i))) + (1−y(i))log(1−hθ(x(i))) +
𝜆
2𝑚
∑ (𝑛
𝑗=1 θ2
j)
Gradient Descent:
Repeat {
θ0 ∶= θ0 - α
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). x0(i)
θj ∶= θj – α[(
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). xj(i)) +
𝜆
𝑚
θj]
}
Machine Learning
Page 24
Predictive Analysis Using Machine Learning-3
In this part, we will build a logistic regression model to predict whether a student gets
admitted into a university.
Problem
Suppose that you are the administrator of a university department and you want to
determine each applicant's chance of admission based on their results on two exams.
Our task is to build a classification model that estimates an applicant's probability of
admission based the scores from those two exams.
Dataset
We have historical data from previous applicants that we use as a training set for logistic
regression. For each training example, we have the applicant's scores on two exams and the
admissions decision.
Visualizing The Data
We code the function plotData so that it displays a figure, where the axes are the two exam
scores, and the positive and negative examples are shown with different markers.
Code:
data = load('ex2data1.txt');
X = data(:, [1, 2]);
y = data(:, 3);
function plotData(X, y)
pos = find(y==1);
neg = find(y == 0);
plot(X(pos, 1), X(pos, 2), 'k+','LineWidth', 2, 'MarkerSize', 7);
plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7);
end
plotData(X, y);
xlabel('Exam 1 score')
ylabel('Exam 2 score')
Machine Learning
Page 25
Cost Function and Gradient Descent
We will implement the cost function and gradient for logistic regression. Here we use these
functions only to calculate the initial cost and gradient using initial theta(zeros).
Code:
initial_theta = zeros(n + 1, 1);
function [J, grad] = costFunction(theta, X, y)
m = length(y);
J = 0;
grad = zeros(size(theta));
J = (1 / m) * sum( -y'*log(sigmoid(X*theta)) - (1-y)'*log( 1 - sigmoid(X*theta)) );
grad = (1 / m) * sum( X .* repmat((sigmoid(X*theta) - y), 1, size(X,2)) );
end
Machine Learning
Page 26
Sigmoid Function
We compute the sigmoid of each value of z (z can be a matrix, vector or scalar). This function
is called by other parts of the program during various stages.
Code:
function g = sigmoid(z)
g = zeros(size(z));
g = 1 ./ (1 + exp(-z));
end
Learning Parameters using fminunc
In the previous assignment, you found the optimal parameters of a linear regression model
by implementing gradient descent.
This time, instead of taking gradient descent steps, we will use an Octave built-in function
called fminunc. fminunc is an optimization solver that finds the minimum of an unconstrained
function.
Code:
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);
We first define the options to be used with fminunc. Specifically, we set the GradObj option
to on, which tells fminunc that our function returns both the cost and the gradient. This allows
fminunc to use the gradient when minimizing the function. Furthermore, we set the MaxIter
option to 400, so that fminunc will run for at most 400 steps before it terminates.
To specify the actual function, we are minimizing, we use a "short-hand" for specifying
functions with the @(t) (costFunction(t, X, y)). This creates a function, with argument t, which
calls your costFunction. This allows us to wrap the costFunction for use with fminunc. fminunc
will converge on the right optimization parameters and return the final values of the cost and
θ.
Decision Boundary Plot
The final θ value obtained from fminunc will then be used to plot the decision boundary on
the training data.
Code:
function plotDecisionBoundary(theta, X, y)
plotData(X(:,2:3), y);
Machine Learning
Page 27
hold on
if size(X, 2) <= 3
plot_x = [min(X(:,2))-2, max(X(:,2))+2];
plot_y = (-1./theta(3)).*(theta(2).*plot_x + theta(1));
plot(plot_x, plot_y)
legend('Admitted', 'Not admitted', 'Decision Boundary')
axis([30, 100, 30, 100])
else
% Here is the grid range
u = linspace(-1, 1.5, 50);
v = linspace(-1, 1.5, 50);
z = zeros(length(u), length(v));
for i = 1:length(u)
for j = 1:length(v)
z(i,j) = mapFeature(u(i), v(j))*theta;
end
end
z = z';
contour(u, v, z, [0, 0], 'LineWidth', 2)
end
hold off
end
plotDecisionBoundary plots the data points X and y into a new figure with the decision
boundary defined by theta.
Machine Learning
Page 28
Prediction
After learning the parameters, we'll like to use it to predict the outcomes on unseen data. In
this part, we will use the logistic regression model to predict the probability that a student
with score 45 on exam 1 and score 85 on exam 2 will be admitted.
prob = sigmoid([1 45 85] * theta);
We compute the training accuracy using predict function. Predict computes the predictions
for X using a threshold at 0.5 (i.e., if sigmoid(theta'*x) >= 0.5, predict 1)
p = predict(theta, X);
m = size(X, 1);
p = zeros(m, 1);
p = sigmoid(X * theta)>=0.5 ;
end
fprintf('Train Accuracy: %fn', mean(double(p == y)) * 100);
Machine Learning
Page 29
Program Execution
Machine Learning
Page 30
3. Unsupervised Learning: Clustering
Unsupervised learning is contrasted from supervised learning because it uses an unlabelled
training set rather than a labelled one.
In other words, we don't have the vector y of expected results, we only have a dataset of
features where we can find structure.
Clustering is good for:
 Market segmentation
 Social network analysis
 Organizing computer clusters
 Astronomical data analysis
K-Means Algorithm
The K-Means Algorithm is the most popular and widely used algorithm for automatically
grouping data into coherent subsets.
The steps include:
1. Randomly initialize two points in the dataset called the cluster centroids.
2. Cluster assignment: assign all examples into one of two groups based on which cluster
centroid the example is closest to.
3. Move centroid: compute the averages for all the points inside each of the two cluster
centroid groups, then move the cluster centroid points to those averages.
4. Re-run (2) and (3) until we have found our clusters.
Our main variables are:
K (number of clusters)
Training set x(1), x(2),….,x(m)
Where x(i) ϵ IRn
Note that we will not use the x0=1 convention.
Machine Learning
Page 31
Algorithm:
Randomly initialize K cluster centroids mu(1), mu(2), ..., mu(K)
Repeat:
for i = 1 to m:
c(i) := index (from 1 to K) of cluster centroid closest to x(i)
for k = 1 to K:
mu(k) := average (mean) of points assigned to cluster k
The first for-loop is the 'Cluster Assignment' step. We make a vector c where c(i) represents
the centroid assigned to example x(i).
We can write the operation of the Cluster Assignment step more mathematically as follows:
c(i) = argmink||x(i) - µk||
That is, each c(i) contains the index of the centroid that has minimal distance to x(i).
By convention, we square the right-hand-side, which makes the function we are trying to
minimize more sharply increasing.
The second for-loop is the 'Move Centroid' step where we move each centroid to the average
of its group.
The equation for this loop is as follows:
µk =
1
𝑛
[x(k1) + x(k2) + … + x(kn)] ϵ IRn
Where each of x(k1), x(k2),…, x(kn) are the training examples assigned to group μk.
If you have a cluster centroid with 0 points assigned to it, you can randomly re-initialize that
centroid to a new point. You can also simply eliminate that cluster group.
After a number of iterations, the algorithm will converge, where new iterations do not affect
the clusters.
Optimization Objective
Recall some of the parameters we used in our algorithm:
c(i) = index of cluster (1,2,...,K) to which example x(i) is currently assigned
Machine Learning
Page 32
μk = cluster centroid k (μk ∈ IRn)
μc
(i) = cluster centroid of cluster to which example x(i) has been assigned
Using these variables, we can define our cost function:
J(c(i), … , c(m), μ1, … , μK)==
1
𝑚
∑ .𝑚
𝑖=1 ||x(i) - μc
(i)||2
Our optimization objective is to minimize all our parameters using the above cost function.
That is, we are finding all the values in sets c, representing all our clusters, and μ, representing
all our centroids, that will minimize the average of the distances of every training example to
its corresponding cluster centroid.
The above cost function is often called the distortion of the training examples.
In the cluster assignment step, our goal is to:
Minimize J(…) with c(1),…,c(m) (holding μ1,…,μK fixed)
In the move centroid step, our goal is to:
Minimize J(…) with μ1,…,μK
With k-means, it is not possible for the cost function to sometimes increase. It should always
descend.
Random Initialization
There's one particular recommended method for randomly initializing the cluster centroids.
1. Have K<m. That is, make sure the number of your clusters is less than the number of
your training examples.
2. Randomly pick K training examples.
3. Set μ1,…,μk equal to these K examples.
K-means can get stuck in local optima. To decrease the chance of this happening, we can run
the algorithm on many different random initializations.
for i = 1 to 100:
randomly initialize k-means
run k-means to get 'c' and 'µ'
compute the cost function (distortion) J(c,µ)
pick the clustering that gave us the lowest cost
Machine Learning
Page 33
Choosing the Number of Clusters
Choosing K can be quite arbitrary and ambiguous.
The elbow method: plot the cost J and the number of clusters K. The cost function should
reduce as we increase the number of clusters, and then flatten out. Choose K at the point
where the cost function starts to flatten out.
J will always decrease as K is increased. The one exception is if k-means gets stuck at a bad
local optimum.
Another way to choose K is to observe how well k-means performs on a downstream purpose.
In other words, we choose K that proves to be most useful for some goal you're trying to
achieve from using these clusters.
Predictive Analysis Using Machine Learning-4
In this this part, we will implement the K-means algorithm and use it for image compression.
Problem
An image of resolution 128*128 pixels occupies 393,216 bits since each pixel is made up of 3
8-bit unsigned integer that represent RGB encoding, which specify the red, green, blue values
of each pixel. The image contains thousands of colours of varying RGB intensity. This causes
a lot of overhead on memory for storage.
The solution to this problem is to use less colours, in this case using K-means algorithm to
select the 16 colours that will be used to represent the compressed image.
Solution Intuition
In this problem, we only need to store the RGB values of the 16 selected colours, and for each
pixel in the image, we now need to only store the index of the colour at that location (where
only 4 bits are necessary to represent 16 possibilities).
The new representation requires some overhead storage in form of a dictionary of 16 colours,
each of which require 24 bits, but the image itself then only requires 4 bits per pixel location.
The final number of bits used is therefore 16*24 + 128*128*4 = 65,920 bits which corresponds
to compressing the original image by about a factor of 6.
Machine Learning
Page 34
K-means Implementation
The K-means algorithm is a method to automatically cluster similar data examples together.
The intuition behind K-means is an iterative procedure that starts by guessing the initial
centroids, and then refines this guess by repeatedly assigning examples to their closest
centroids and then re-computing the centroids based on the assignments.
Code:
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
idx = findClosestCentroids(X, centroids);
centroids = computeMeans(X, idx, K);
end
The first step in the loop is the cluster assignment step where each example is assigned to its
closest centroid. The second step is moving the centroid based on the means of all points
belonging to that particular cluster. The K-means algorithm will always converge to some final
set of means for the centroids. Note that the converged solution may not always be ideal and
depends on the initial setting of the centroids. Therefore, in practice the K-means algorithm
is usually run a few times with different random initializations.
Initializing centroids
A good strategy for initializing the centroids is to select random examples from the training
set.
Code:
function centroids = kMeansInitCentroids(X, K)
centroids = zeros(K, size(X, 2));
randidx = randperm(size(X, 1));
centroids = X(randidx(1:K), :)
end
Finding Closest Centroids
In the "cluster assignment" phase of the K-means algorithm, the algorithm assigns every
training example x(i) to its closest centroid, given the current positions of centroids.
Code:
function idx = findClosestCentroids(X, centroids)
Machine Learning
Page 35
K = size(centroids, 1);
idx = zeros(size(X,1), 1);
m = size(X,1)
for i = 1:m
distance_array = zeros(1,K);
for j = 1:K
distance_array(1,j) = sqrt(sum(power((X(i,:)-centroids(j,:)),2)));
end
[d, d_idx] = min(distance_array);
idx(i,1) = d_idx;
end
Computing Centroid means
Given assignments of every point to a centroid, the second phase of the algorithm
recomputes, for each centroid, the mean of the points that were assigned to it.
Code:
function centroids = computeCentroids(X, idx, K)
[m n] = size(X);
centroids = zeros(K, n);
for i = 1:K
c_i = idx==i;
n_i = sum(c_i);
c_i_matrix = repmat(c_i,1,n);
X_c_i = X .* c_i_matrix;
centroids(i,:) = sum(X_c_i) ./ n_i;
end
K-means Intuition
Below are three different plots which visually show how clusters are assigned and centroids
are moved using K-means algorithm.
Machine Learning
Page 36
Machine Learning
Page 37
K-means on Pixels
We load the image to be compressed:
A = imread('bird small.png');
This creates a three-dimensional matrix A whose first two indices identify a pixel position and
whose last index represents red, green, or blue.
After finding the top K = 16 colors to represent the image, we can now assign each pixel
position to its closest centroid using the findClosestCentroids function. This allows you to
represent the original image using the centroid assignments of each pixel.
We can view the effects of the compression by reconstructing the image based only on the
centroid assignments. Specifically, we can replace each pixel location with the mean of the
centroid assigned to it.
idx = findClosestCentroids(X, centroids);
X_recovered = centroids(idx,:); //Recover Image
Reshape the recovered image into proper dimensions:
X_recovered = reshape(X_recovered, img_size(1), img_size(2), 3);
Machine Learning
Page 38
Below is the image reconstructed after compression:
Machine Learning
Page 39
Program Execution
Machine Learning
Page 40
Conclusion
Machine Learning has a wide area of applications such as Computer Vision, Natural Language
Processing, Bioinformatics, Stock Market Analysis, DNA sequence etc. Deep Learning is an
advancement in the field of Machine Learning. It’ll help us in discerning underlying patterns
and make better business decisions.
Bibliography
Andrew NG, Machine Learning, Stanford University.
Pang Ning Tan, Michael Steinbach, Vipin Kumar, Data Mining.
Susheela Devi, Pattern Recognition, IISC.
Author
My name is Shreyas GS. I’m very passionate about Data Analytics and Machine Learning. My
career goal is to be a Data Scientist. I’m also an avid reader of books and my other interests
include playing guitar, chess and football. I’m an eclectic learner with a keen interest in
astronomy.

More Related Content

What's hot

XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
Derek Kane
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
Anandha L Ranganathan
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine Learning
Paxcel Technologies
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
Kuppusamy P
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Simplilearn
 
Machine Learning and Data Mining: 04 Association Rule Mining
Machine Learning and Data Mining: 04 Association Rule MiningMachine Learning and Data Mining: 04 Association Rule Mining
Machine Learning and Data Mining: 04 Association Rule Mining
Pier Luca Lanzi
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problem
Michele Filannino
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
Arunabha Saha
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
AyanaRukasar
 
Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)
cairo university
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
kishanthkumaar
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
Anandha L Ranganathan
 
Linear regression
Linear regressionLinear regression
Linear regression
SreerajVA
 
Logistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesLogistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | Disadvantages
Rajat Sharma
 
Computer Vision with Deep Learning
Computer Vision with Deep LearningComputer Vision with Deep Learning
Computer Vision with Deep Learning
Capgemini
 
Machine Learning Interview Questions and Answers | Machine Learning Interview...
Machine Learning Interview Questions and Answers | Machine Learning Interview...Machine Learning Interview Questions and Answers | Machine Learning Interview...
Machine Learning Interview Questions and Answers | Machine Learning Interview...
Edureka!
 
PCA Final.pptx
PCA Final.pptxPCA Final.pptx
PCA Final.pptx
HarisMasood20
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
Md. Enamul Haque Chowdhury
 
SVM
SVM SVM
SVM
Bangalore
 

What's hot (20)

XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine Learning
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
 
Machine Learning and Data Mining: 04 Association Rule Mining
Machine Learning and Data Mining: 04 Association Rule MiningMachine Learning and Data Mining: 04 Association Rule Mining
Machine Learning and Data Mining: 04 Association Rule Mining
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problem
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
 
Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Logistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesLogistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | Disadvantages
 
Computer Vision with Deep Learning
Computer Vision with Deep LearningComputer Vision with Deep Learning
Computer Vision with Deep Learning
 
Machine Learning Interview Questions and Answers | Machine Learning Interview...
Machine Learning Interview Questions and Answers | Machine Learning Interview...Machine Learning Interview Questions and Answers | Machine Learning Interview...
Machine Learning Interview Questions and Answers | Machine Learning Interview...
 
PCA Final.pptx
PCA Final.pptxPCA Final.pptx
PCA Final.pptx
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
SVM
SVM SVM
SVM
 

Viewers also liked

Data science for advanced dummies
Data science for advanced dummiesData science for advanced dummies
Data science for advanced dummies
Saurav Chakravorty
 
Discriminant analysis basicrelationships
Discriminant analysis basicrelationshipsDiscriminant analysis basicrelationships
Discriminant analysis basicrelationshipsdivyakalsi89
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithmsguestfee8698
 
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014 Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014
Persontyle
 
Coursera machine learning week6
Coursera machine learning week6Coursera machine learning week6
Coursera machine learning week6
Kikuya Takumi
 
Personalizing a Stream of Content
Personalizing a Stream of ContentPersonalizing a Stream of Content
Personalizing a Stream of Content
Anthea Watson Strong
 
Red Blue Presentation
Red Blue PresentationRed Blue Presentation
Red Blue Presentation
Lincoln Jackson
 
Lets eat presentation_final_20160521
Lets eat presentation_final_20160521Lets eat presentation_final_20160521
Lets eat presentation_final_20160521
Lesley Chapman
 
CDTW Capstone Presentation
CDTW Capstone Presentation CDTW Capstone Presentation
CDTW Capstone Presentation
Todd Rutherford
 
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
Brittne Kakulla, Ph.D.
 
Analysis of differential investor performance captstone presentation final
Analysis of differential investor  performance   captstone  presentation finalAnalysis of differential investor  performance   captstone  presentation final
Analysis of differential investor performance captstone presentation final
Howard Ho
 
Capital Bikeshare Presentation
Capital Bikeshare PresentationCapital Bikeshare Presentation
Capital Bikeshare Presentationdonahuerm
 
Probabilistic generative models for machine vision
Probabilistic generative models for machine visionProbabilistic generative models for machine vision
Probabilistic generative models for machine visionzukun
 
classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning
Shiraz316
 
Machine learning
Machine learningMachine learning
Machine learning
Andrea Iacono
 
Georgetown Data Analytics - Team 1 Capstone Project
Georgetown Data Analytics - Team 1 Capstone ProjectGeorgetown Data Analytics - Team 1 Capstone Project
Georgetown Data Analytics - Team 1 Capstone Project
Mark Phillips
 
Georgetown Data Analytics Project (Team DC)
Georgetown Data Analytics Project (Team DC)Georgetown Data Analytics Project (Team DC)
Georgetown Data Analytics Project (Team DC)
Noah Turner
 
Hotel Performance FINAL
Hotel Performance FINALHotel Performance FINAL
Hotel Performance FINAL
team_hotelperformance
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
NYC Predictive Analytics
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 

Viewers also liked (20)

Data science for advanced dummies
Data science for advanced dummiesData science for advanced dummies
Data science for advanced dummies
 
Discriminant analysis basicrelationships
Discriminant analysis basicrelationshipsDiscriminant analysis basicrelationships
Discriminant analysis basicrelationships
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
 
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014 Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014
 
Coursera machine learning week6
Coursera machine learning week6Coursera machine learning week6
Coursera machine learning week6
 
Personalizing a Stream of Content
Personalizing a Stream of ContentPersonalizing a Stream of Content
Personalizing a Stream of Content
 
Red Blue Presentation
Red Blue PresentationRed Blue Presentation
Red Blue Presentation
 
Lets eat presentation_final_20160521
Lets eat presentation_final_20160521Lets eat presentation_final_20160521
Lets eat presentation_final_20160521
 
CDTW Capstone Presentation
CDTW Capstone Presentation CDTW Capstone Presentation
CDTW Capstone Presentation
 
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
 
Analysis of differential investor performance captstone presentation final
Analysis of differential investor  performance   captstone  presentation finalAnalysis of differential investor  performance   captstone  presentation final
Analysis of differential investor performance captstone presentation final
 
Capital Bikeshare Presentation
Capital Bikeshare PresentationCapital Bikeshare Presentation
Capital Bikeshare Presentation
 
Probabilistic generative models for machine vision
Probabilistic generative models for machine visionProbabilistic generative models for machine vision
Probabilistic generative models for machine vision
 
classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Georgetown Data Analytics - Team 1 Capstone Project
Georgetown Data Analytics - Team 1 Capstone ProjectGeorgetown Data Analytics - Team 1 Capstone Project
Georgetown Data Analytics - Team 1 Capstone Project
 
Georgetown Data Analytics Project (Team DC)
Georgetown Data Analytics Project (Team DC)Georgetown Data Analytics Project (Team DC)
Georgetown Data Analytics Project (Team DC)
 
Hotel Performance FINAL
Hotel Performance FINALHotel Performance FINAL
Hotel Performance FINAL
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to Machine learning

CS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesCS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture Notes
Eric Conner
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)NYversity
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
Marco Moldenhauer
 
2 linear regression with one variable
2 linear regression with one variable2 linear regression with one variable
2 linear regression with one variable
TanmayVijay1
 
2. Linear regression with one variable.pptx
2. Linear regression with one variable.pptx2. Linear regression with one variable.pptx
2. Linear regression with one variable.pptx
Emad Nabil
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
 
Statistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionStatistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling Distribution
Dexlab Analytics
 
working with python
working with pythonworking with python
working with python
bhavesh lande
 
lec18_ref.pdf
lec18_ref.pdflec18_ref.pdf
lec18_ref.pdf
vishal choudhary
 
Linear regression
Linear regression Linear regression
Linear regression
mohamed Naas
 
Function notation by sadiq
Function notation by sadiqFunction notation by sadiq
Function notation by sadiq
Sadiq Hussain
 
Functions.pptx
Functions.pptxFunctions.pptx
Functions.pptx
JokoAkhbarDelicano
 
Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...
Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...
Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...
IJEID :: International Journal of Excellence Innovation and Development
 
Regression_1.pdf
Regression_1.pdfRegression_1.pdf
Regression_1.pdf
Amir Saleh
 
function
functionfunction
function
som allul
 
Bootcamp of new world to taken seriously
Bootcamp of new world to taken seriouslyBootcamp of new world to taken seriously
Bootcamp of new world to taken seriously
khaled125087
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 
Chapter on Functions and Graphs.ppt
Chapter on Functions and Graphs.pptChapter on Functions and Graphs.ppt
Chapter on Functions and Graphs.ppt
PhongLan30
 
functions
 functions  functions
functions Gaditek
 
Calculus Application Problem #3 Name _________________________.docx
Calculus Application Problem #3 Name _________________________.docxCalculus Application Problem #3 Name _________________________.docx
Calculus Application Problem #3 Name _________________________.docx
humphrieskalyn
 

Similar to Machine learning (20)

CS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesCS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture Notes
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
 
2 linear regression with one variable
2 linear regression with one variable2 linear regression with one variable
2 linear regression with one variable
 
2. Linear regression with one variable.pptx
2. Linear regression with one variable.pptx2. Linear regression with one variable.pptx
2. Linear regression with one variable.pptx
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 
Statistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionStatistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling Distribution
 
working with python
working with pythonworking with python
working with python
 
lec18_ref.pdf
lec18_ref.pdflec18_ref.pdf
lec18_ref.pdf
 
Linear regression
Linear regression Linear regression
Linear regression
 
Function notation by sadiq
Function notation by sadiqFunction notation by sadiq
Function notation by sadiq
 
Functions.pptx
Functions.pptxFunctions.pptx
Functions.pptx
 
Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...
Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...
Approximating the Bell-shaped Function based on Combining Hedge Algebras and ...
 
Regression_1.pdf
Regression_1.pdfRegression_1.pdf
Regression_1.pdf
 
function
functionfunction
function
 
Bootcamp of new world to taken seriously
Bootcamp of new world to taken seriouslyBootcamp of new world to taken seriously
Bootcamp of new world to taken seriously
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Chapter on Functions and Graphs.ppt
Chapter on Functions and Graphs.pptChapter on Functions and Graphs.ppt
Chapter on Functions and Graphs.ppt
 
functions
 functions  functions
functions
 
Calculus Application Problem #3 Name _________________________.docx
Calculus Application Problem #3 Name _________________________.docxCalculus Application Problem #3 Name _________________________.docx
Calculus Application Problem #3 Name _________________________.docx
 

More from Shreyas G S

President cerificate
President cerificatePresident cerificate
President cerificate
Shreyas G S
 
Club certificate
Club certificateClub certificate
Club certificate
Shreyas G S
 
Best President
Best PresidentBest President
Best President
Shreyas G S
 
Data Analysis and Statistical inference
Data Analysis and Statistical inferenceData Analysis and Statistical inference
Data Analysis and Statistical inference
Shreyas G S
 
RC E-City Annual Report-15-16.
RC E-City Annual Report-15-16.RC E-City Annual Report-15-16.
RC E-City Annual Report-15-16.
Shreyas G S
 
E-go issue 6 Beyond
E-go issue 6 Beyond E-go issue 6 Beyond
E-go issue 6 Beyond
Shreyas G S
 
Community and Social Service Projects
Community and Social Service ProjectsCommunity and Social Service Projects
Community and Social Service Projects
Shreyas G S
 
Report 13-14
Report 13-14Report 13-14
Report 13-14
Shreyas G S
 

More from Shreyas G S (8)

President cerificate
President cerificatePresident cerificate
President cerificate
 
Club certificate
Club certificateClub certificate
Club certificate
 
Best President
Best PresidentBest President
Best President
 
Data Analysis and Statistical inference
Data Analysis and Statistical inferenceData Analysis and Statistical inference
Data Analysis and Statistical inference
 
RC E-City Annual Report-15-16.
RC E-City Annual Report-15-16.RC E-City Annual Report-15-16.
RC E-City Annual Report-15-16.
 
E-go issue 6 Beyond
E-go issue 6 Beyond E-go issue 6 Beyond
E-go issue 6 Beyond
 
Community and Social Service Projects
Community and Social Service ProjectsCommunity and Social Service Projects
Community and Social Service Projects
 
Report 13-14
Report 13-14Report 13-14
Report 13-14
 

Recently uploaded

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 

Recently uploaded (20)

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 

Machine learning

  • 1. Shreyas GS PREDICTIVE ANALYSIS USING MACHINE LEARNING
  • 2. Machine Learning Page 1 INTRODUCTION Machine Learning "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." Where  E= The experience of playing many games of checkers.  T= Task of playing checkers.  P= The probability that the program will win the next game. Supervised Learning In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. Supervised learning problems are categorized into regression and classification problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories. Example: Regression-Given data about the size of houses on the real estate market, try to predict their price. Price as a function of size is a continuous output. Classification-Bank have to decide whether or not they give loan to someone on this basis of his credit history.
  • 3. Machine Learning Page 2 Unsupervised Learning Unsupervised learning, on the other hand, allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables. We can derive this structure by clustering the data based on relationships among the variables in the data. Example: Clustering: Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number that are somehow similar or related by different variables, such as word frequency, sentence length, page count, and so on. 1. Linear Regression With One Variable Linear regression with one variable is also known as "univariate linear regression." Univariate linear regression is used when you want to predict a single output value from a single input value. We're doing supervised learning here, so that means we already have an idea what the input/output cause and effect should be. The Hypothesis Function Our hypothesis function has the general form: hθ(x)= θ0 + θ1x. We are trying to create a function called hθ that is able to reliably map our input data(x’s) to our output data(y’s). Example: X(input) y(output) 0 4 1 7 2 7 3 8
  • 4. Machine Learning Page 3 Now we can make a random guess about our hθ function: θ0=2 and θ1=2. The hypothesis function becomes hθ(x)= 2 + 2x. So for input of 1 to our hypothesis, y will be 4. This is off by 3. Cost Function We can measure the accuracy of our hypothesis function by using a cost function. This takes an average (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's compared to the actual output y's. J(θ0,θ1) = (1/2m)∑ (𝒉𝜽(xi 𝒎 𝒊=𝟏 ) − 𝐲i)2 To break it apart, it is ½ 𝑥̅ where 𝑥̅ is the mean of the squares of hθ(xi) - yi, or the difference between the predicted value and the actual value. This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved (1/2m) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the ½ term. Now we are able to concretely measure the accuracy of our predictor function against the correct results we have so that we can predict new results we don't have. Gradient Descent So we have our hypothesis function and we have a way of measuring how accurate it is. Now what we need is a way to automatically improve our hypothesis function. That's where gradient descent comes in. Imagine that we graph our hypothesis function based on its fields θ0 and θ1 (actually we are graphing the cost function for the combinations of parameters). We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting particular set of parameters. We put θ0 on X axis and θ1 on Y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum.
  • 5. Machine Learning Page 4 The way we do this is by taking the derivative (the line tangent to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down that derivative by the parameter α, called the learning rate. The gradient descent equation is: Repeat until convergence: θj ∶= θj -α 𝜹 𝜹𝜽𝐣 J(θ0, θ1) for j=0 and j=1. Gradient Descent for Linear Regression When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation. Repeat until convergence: { θ0 ∶= θ0-α 𝟏 𝒎 ∑ (𝒉𝜽(xi 𝒎 𝒊=𝟏 ) − 𝐲i) θ1 ∶= θ1-α 𝟏 𝒎 ∑ ((𝒉𝜽(xi 𝒎 𝒊=𝟏 ) − 𝐲i)xi) } Where m is the size of the training set, θ0 a constant that will be changing simultaneously with θ1 and xi and yi are values of the given training set. We have separated out the two cases for θj and that for θ1 we are multiplying xi at the end due to derivative. The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.
  • 6. Machine Learning Page 5 Linear Regression With Multiple Variables Linear regression with multiple variables is also known as "multivariate linear regression". We now introduce notation for equations where we can have any number of input variables. xj (i) = value of j in the ith training example. x(i) = the column vector of all the feature inputs of the ith training example. m = number of training examples. n = |x(i)| (the number of features) Now we define the multivariable form of the hypothesis function as follows, accommodating these multiple features: hθ(x)= θ0 + θ1x1 + θ2x2 + θ3x3 +….+ θnxn In order to have little intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square metre, θ2 as the price per floor, etc. x1 will be the number of square meters in the floor, x2 the number of floors in the house, etc. Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as: hθ(x)= [ θ0 θ1 θ2 …. θn][ x0 ⋮ xn ] = θTx This is a vectorization of our hypothesis function for one training example. Now we can collect all m training examples each with n features and record them in an n+1 by m matrix. In this matrix we let the value of the subscript (feature) also represent the row number (except the initial row is the "zeroth" row), and the value of the superscript (the training example) also represent the column number, as shown here: X=[ x0(1) ⋯ x0(m) ⋮ ⋮ ⋮ xn(1) ⋯ xn(m) ] Notice above that the first column is the first training example (like the vector above), the second column is the second training example, and so forth. Now we can define hθ(X) as a row vector that gives the value of hθ(x) at each of the m training examples: hθ(X) = θTX
  • 7. Machine Learning Page 6 If we store the training examples in X row wise, the hypothesis function can be calculated as: hθ(X) = Xθ Cost Function For the parameter vector θ, the cost function is calculated as: J(θ) = (1/2m)∑ (𝒉𝜽(x(i)𝒎 𝒊=𝟏 ) − 𝐲(i))2 Gradient Descent For Multiple Variables The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features: Repeat until convergence: { θ0 ∶= θ0 - α 𝟏 𝒎 ∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎 𝒊=𝟏 ). 𝐱𝟎(i) θ1 ∶= θ1 - α 𝟏 𝒎 ∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎 𝒊=𝟏 ). 𝐱𝟏(i) θ2 ∶= θ0 - α 𝟏 𝒎 ∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎 𝒊=𝟏 ). 𝐱𝟐(i) ….. } In other words: Repeat until convergence: { θj ∶= θj - α 𝟏 𝒎 ∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎 𝒊=𝟏 ). 𝐱𝐣(i) for j = 0….n }
  • 8. Machine Learning Page 7 Feature Normalization We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven. Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable, resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula: xi ∶= 𝒙𝒊−µ𝒊 𝒔𝒊 µi is the average of all the values and si is the standard deviation. Predictive Analysis Using Machine Learning-1 In this part of Machine Learning, we will implement linear regression with one variable to predict the outcome using Octave. Problem Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from the cities. You would like to use this data to help you select which city to expand to next. Dataset The file ex1data1.txt contains the dataset for our linear regression problem. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss. Plotting The Data Before starting on any task, it is often useful to understand the data by visualizing it. For this dataset, we can use a scatter plot to visualize the data, since it has only two properties to plot (profit and population).
  • 9. Machine Learning Page 8 Code: function plotData(X, y) plot(X,y,'rx','MarkerSize',10); ylabel('Profit in $10,000s'); xlabel('Population of City in 10,000s'); Gradient Descent We will fit the linear regression parameters θ to our dataset using gradient descent. Code: function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) m = length(y); J_history = zeros(num_iters, 1); for iter = 1:num_iters
  • 10. Machine Learning Page 9 delta = (1/m)*sum(X.*repmat((X*theta - y), 1, size(X,2))); theta = (theta' - (alpha * delta))'; J_history(iter) = computeCost(X, y, theta); end Computing Cost We will implement a function to calculate J(θ) so you can check the convergence of gradient descent implementation. Code: function J = computeCost(X, y, theta) m = length(y); J = (1/(2*m))*sum(power((X*theta - y),2)); end Plotting Linear Fit We plot X variable values on x axis and X*theta values on y axis which gives us a linear fit from which we can predict y values for unseen examples. Code: plot(X(:,2), X*theta, '-')
  • 11. Machine Learning Page 10 Working A good way to verify that gradient descent is working correctly is to look at the value of J(θ) and check that it is decreasing with each step. The starter code of gradient descent calls compute cost on every iteration and prints the cost. Assuming we have implemented gradient descent and compute cost correctly, our value of J(θ) should never increase, and should converge to a steady value by the end of the algorithm. Prediction The final values of θ will be used to make predictions on profits in areas of 35,000 and 70,000. Code: predict1 = [1, 35000] *theta; predict2 = [1, 70000] * theta; Visualizing Cost Function To understand the cost function J(θ) better, we will now plot the cost over a 2-dimensional grid of θ0 and θ1 values. A code is set up to calculate J(θ) over a grid of values using the compute cost function. Code: theta0_vals = linspace(-10, 10, 100); theta1_vals = linspace(-1, 4, 100); J_vals = zeros(length(theta0_vals), length(theta1_vals)); for i = 1:length(theta0_vals) for j = 1:length(theta1_vals) t = [theta0_vals(i); theta1_vals(j)]; J_vals(i,j) = computeCost(X, y, t); end end surf(theta0_vals, theta1_vals, J_vals) contour(theta0_vals, theta1_vals, J_vals, logspace(-2, 3, 20))
  • 12. Machine Learning Page 11 Surface Graph Contour Plot
  • 13. Machine Learning Page 12 The purpose of these graphs is to show you that how J(θ) varies with changes in θ0 and θ1. The cost function J(θ) is bowl-shaped and has a global minimum. This minimum is the optimal point for θ0 and θ1, and each step of gradient descent moves closer to this point. Program Execution
  • 14. Machine Learning Page 13 Predictive Analysis Using Machine Learning-2 In this part, we will implement linear regression with multiple variables to predict the prices of houses. Problem Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices. Dataset The file ex1data2.txt contains a training set of housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house. Feature Normalization By examining the dataset values, note that house sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge much more quickly. The standard deviation is a way of measuring how much variation there is in the range of values of a particular feature. Code: function [X_norm, mu, sigma] = featureNormalize(X) mu = mean(X); sigma = std(X); X_norm = (X - repmat(mu, [size(X,1), 1]))./repmat(sigma, [size(X,1), 1]); end
  • 15. Machine Learning Page 14 Gradient Descent Previously, we implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged. Code: function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters) m = length(y); J_history = zeros(num_iters, 1); for iter = 1:num_iters hy = X*theta-y; hyr = repmat(hy, [1, size(X,2)]); prodhyrx = hyr.*X; sumVec = sum(prodhyrx)'; theta = theta - alpha.*sumVec./m; J_history(iter) = computeCostMulti(X, y, theta); end end Compute Cost Code: function J = computeCostMulti(X, y, theta) m = length(y); J = 0; J = 1/(2*m) * (X * theta - y)' * (X * theta - y); end
  • 16. Machine Learning Page 15 Convergence Graph We plot a graph of cost J(θ) on Y axis against the number of iterations on the X axis. We observe that the cost is reducing with each iteration towards the minimum. This minimum is the optimal point for θ0 and θ1, and each step of gradient descent moves closer to this point. Working A good way to verify that gradient descent is working correctly is to look at the value of J(θ) and check that it is decreasing with each step. The starter code of gradient descent calls compute cost on every iteration and prints the cost. Assuming we have implemented gradient descent and compute cost correctly, our value of J(θ) should never increase, and should converge to a steady value by the end of the algorithm.
  • 17. Machine Learning Page 16 Prediction We now predict the price of a house of an unseen example using the optimal values of theta obtained from the gradient descent. The price of a house with 1650 square meters having 3 bedrooms is: Code: X = [1650, 3]; X_norm = (X-mu)./sigma; price = [1, X_norm]*theta; Price: $293237.1614 Normal Equation The closed-form solution to linear regression is: θ = (XTX)-1XT y Using this formula does not require any feature scaling, and you will get an exact solution in one calculation. There is no "loop until convergence" like in gradient descent. Code: function [theta] = normalEqn(X, y) theta = zeros(size(X, 2), 1); theta = pinv(X'*X)*X'*y; end Prediction Code: price = [1 1650 3]*theta;
  • 19. Machine Learning Page 18 2. Logistic Regression Now we are switching from regression problems to classification problems. The name ‘Logistic Regression’ is an approach to classification problem. Classification Instead of our output vector y being a continuous range of values, it will only be 0 or 1. Y ∈ {0,1} Where 0 is usually taken as the "negative class" and 1 as the "positive class", but we are free to assign any representation to it. We're only doing two classes for now, called a "Binary Classification Problem." One method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. This method doesn't work well because classification is not actually a linear function. Therefore, we use a sigmoid function to classify. Hypothesis Representation Our hypothesis should satisfy: 0≤hθ(x)≤1 Our new form uses the sigmoid function, also called the “logistic function”: hθ(x) = g(θTx) Z= θTx g(z) = 1 1+𝑒−𝑧 The function g(z), shown below, maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification. We start with our old hypothesis (linear regression), except that we want to restrict the range to 0 and 1. This is accomplished by plugging θTx into the Logistic Function.
  • 20. Machine Learning Page 19 Decision Boundary In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows: hθ(x) ≥ 0.5 → y=1 hθ(x) < 0.5 → y=0 The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5: g(z) ≥ 0.5 when z ≥ 0 Remember: z=0, e0=1, g(z)=1/2 z →∞, e−∞ →0, g(z)=1
  • 21. Machine Learning Page 20 z→ −∞, e∞ →∞, g(z)=0 So if our input to g is θTx, then that means: hθ(x) = g(θTx) ≥ 0.5 when θTx ≥ 0 From these statements we can now say: θTx ≥ 0 → y = 1 θTx < 0 → y = 0 The decision boundary is the line that separates the area where y=0 and where y=1. It is created by our hypothesis function. Cost Function We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function. Instead, our cost function for logistic regression looks like: J(θ) = (1/m)∑ 𝐂𝐨𝐬𝐭(𝒉𝜽(x(i)𝒎 𝒊=𝟏 ) − 𝐲(i)) Cost(hθ(x),y) = −log(hθ(x)) if y = 1 Cost(hθ(x),y) = −log(1−hθ(x)) if y = 0
  • 22. Machine Learning Page 21 The more our hypothesis is off from y, the larger the cost function output. If our hypothesis is equal to y, then our cost is 0: Cost(hθ(x),y) = 0 if hθ(x) = y Cost(hθ(x),y)→∞ if y = 0and hθ(x)→1 Cost(hθ(x),y)→∞ if y = 1and hθ(x)→0 If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity. If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity. Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression. Simplified Cost Function And Gradient Descent We can compress our cost function's two conditional cases into one case: Cost(hθ(x),y) = −ylog(hθ(x)) − (1−y)log(1−hθ(x)) Notice that when y is equal to 1, then the second term ((1−y)log(1−hθ(x))) will be zero and will not affect the result. If y is equal to 0, then the first term (−ylog(hθ(x))) will be zero and will not affect the result. We can fully write out our entire cost function as follows: J(θ) = - 𝟏 𝒎 ∑ .𝒎 𝒊=𝟏 [y(i)log(hθ(x(i))) + (1−y(i))log(1−hθ(x(i))) Gradient Descent: Repeat { θj :=θj − 𝛼 𝑚 ∑ .𝑚 𝑖=1 (hθ(x(i))−y(i))x(i)j }
  • 23. Machine Learning Page 22 Regularization Regularization is designed to address the problem of overfitting. High bias or underfitting is when the form of our hypothesis maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. Ex. if we take hθ(x)=θ0+θ1x1+θ2x2 then we are making an initial assumption that a linear model will fit the training data well and will be able to generalize but that may not be the case. At the other extreme, overfitting or high variance is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data. This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting: 1. Reduce the number of features 2. Regularization – Keep all features, reduce parameters θj. Regularization works well when we have a lot of slightly useful features. Cost Function If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost. Say we wanted to make the following function more quadratic: θ0+θ1x+θ2x2+θ3x3+θ4x4 We'll want to eliminate the influence of θ3x3 and θ4x4. Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function: minθ 1 2𝑚 ∑ .𝑚 𝑖=1 (hθ(x(i))−y(i))2 + 1000⋅θ2 3+1000⋅θ2 4 We've added two extra terms at the end to inflate the cost of θ3 and θ4. Now, in order for the cost function to get close to zero, we will have to reduce the values of θ3 and θ4 to near zero. This will in turn greatly reduce the values of θ3x3 and θ4x4 in our hypothesis function. We could also regularize all of our theta parameters in a single summation:
  • 24. Machine Learning Page 23 minθ 1 2𝑚 [ ∑ .𝑚 𝑖=1 (hθ(x(i))−y(i))2 + λ∑ (𝑛 𝑗=1 θ2 j)] The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated. Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting. Regularized Linear Regression We will modify our gradient descent function to separate out θ0 from the rest of the parameters because we do not want to penalize θ0. Repeat { θ0 ∶= θ0 - α 1 𝑚 ∑ (hθ(x(i)) − y(i)𝑚 𝑖=1 ). x0(i) θj ∶= θj – α[( 1 𝑚 ∑ (hθ(x(i)) − y(i)𝑚 𝑖=1 ). xj(i)) + 𝜆 𝑚 θj] } The term 𝜆 𝑚 θj performs our regularization. Regularized Logistic Regression We can regularize logistic regression in a similar way that we regularize linear regression. Let's start with the cost function. J(θ) = - 1 𝑚 ∑ .𝑚 𝑖=1 [y(i)log(hθ(x(i))) + (1−y(i))log(1−hθ(x(i))) + 𝜆 2𝑚 ∑ (𝑛 𝑗=1 θ2 j) Gradient Descent: Repeat { θ0 ∶= θ0 - α 1 𝑚 ∑ (hθ(x(i)) − y(i)𝑚 𝑖=1 ). x0(i) θj ∶= θj – α[( 1 𝑚 ∑ (hθ(x(i)) − y(i)𝑚 𝑖=1 ). xj(i)) + 𝜆 𝑚 θj] }
  • 25. Machine Learning Page 24 Predictive Analysis Using Machine Learning-3 In this part, we will build a logistic regression model to predict whether a student gets admitted into a university. Problem Suppose that you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. Our task is to build a classification model that estimates an applicant's probability of admission based the scores from those two exams. Dataset We have historical data from previous applicants that we use as a training set for logistic regression. For each training example, we have the applicant's scores on two exams and the admissions decision. Visualizing The Data We code the function plotData so that it displays a figure, where the axes are the two exam scores, and the positive and negative examples are shown with different markers. Code: data = load('ex2data1.txt'); X = data(:, [1, 2]); y = data(:, 3); function plotData(X, y) pos = find(y==1); neg = find(y == 0); plot(X(pos, 1), X(pos, 2), 'k+','LineWidth', 2, 'MarkerSize', 7); plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7); end plotData(X, y); xlabel('Exam 1 score') ylabel('Exam 2 score')
  • 26. Machine Learning Page 25 Cost Function and Gradient Descent We will implement the cost function and gradient for logistic regression. Here we use these functions only to calculate the initial cost and gradient using initial theta(zeros). Code: initial_theta = zeros(n + 1, 1); function [J, grad] = costFunction(theta, X, y) m = length(y); J = 0; grad = zeros(size(theta)); J = (1 / m) * sum( -y'*log(sigmoid(X*theta)) - (1-y)'*log( 1 - sigmoid(X*theta)) ); grad = (1 / m) * sum( X .* repmat((sigmoid(X*theta) - y), 1, size(X,2)) ); end
  • 27. Machine Learning Page 26 Sigmoid Function We compute the sigmoid of each value of z (z can be a matrix, vector or scalar). This function is called by other parts of the program during various stages. Code: function g = sigmoid(z) g = zeros(size(z)); g = 1 ./ (1 + exp(-z)); end Learning Parameters using fminunc In the previous assignment, you found the optimal parameters of a linear regression model by implementing gradient descent. This time, instead of taking gradient descent steps, we will use an Octave built-in function called fminunc. fminunc is an optimization solver that finds the minimum of an unconstrained function. Code: options = optimset('GradObj', 'on', 'MaxIter', 400); [theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options); We first define the options to be used with fminunc. Specifically, we set the GradObj option to on, which tells fminunc that our function returns both the cost and the gradient. This allows fminunc to use the gradient when minimizing the function. Furthermore, we set the MaxIter option to 400, so that fminunc will run for at most 400 steps before it terminates. To specify the actual function, we are minimizing, we use a "short-hand" for specifying functions with the @(t) (costFunction(t, X, y)). This creates a function, with argument t, which calls your costFunction. This allows us to wrap the costFunction for use with fminunc. fminunc will converge on the right optimization parameters and return the final values of the cost and θ. Decision Boundary Plot The final θ value obtained from fminunc will then be used to plot the decision boundary on the training data. Code: function plotDecisionBoundary(theta, X, y) plotData(X(:,2:3), y);
  • 28. Machine Learning Page 27 hold on if size(X, 2) <= 3 plot_x = [min(X(:,2))-2, max(X(:,2))+2]; plot_y = (-1./theta(3)).*(theta(2).*plot_x + theta(1)); plot(plot_x, plot_y) legend('Admitted', 'Not admitted', 'Decision Boundary') axis([30, 100, 30, 100]) else % Here is the grid range u = linspace(-1, 1.5, 50); v = linspace(-1, 1.5, 50); z = zeros(length(u), length(v)); for i = 1:length(u) for j = 1:length(v) z(i,j) = mapFeature(u(i), v(j))*theta; end end z = z'; contour(u, v, z, [0, 0], 'LineWidth', 2) end hold off end plotDecisionBoundary plots the data points X and y into a new figure with the decision boundary defined by theta.
  • 29. Machine Learning Page 28 Prediction After learning the parameters, we'll like to use it to predict the outcomes on unseen data. In this part, we will use the logistic regression model to predict the probability that a student with score 45 on exam 1 and score 85 on exam 2 will be admitted. prob = sigmoid([1 45 85] * theta); We compute the training accuracy using predict function. Predict computes the predictions for X using a threshold at 0.5 (i.e., if sigmoid(theta'*x) >= 0.5, predict 1) p = predict(theta, X); m = size(X, 1); p = zeros(m, 1); p = sigmoid(X * theta)>=0.5 ; end fprintf('Train Accuracy: %fn', mean(double(p == y)) * 100);
  • 31. Machine Learning Page 30 3. Unsupervised Learning: Clustering Unsupervised learning is contrasted from supervised learning because it uses an unlabelled training set rather than a labelled one. In other words, we don't have the vector y of expected results, we only have a dataset of features where we can find structure. Clustering is good for:  Market segmentation  Social network analysis  Organizing computer clusters  Astronomical data analysis K-Means Algorithm The K-Means Algorithm is the most popular and widely used algorithm for automatically grouping data into coherent subsets. The steps include: 1. Randomly initialize two points in the dataset called the cluster centroids. 2. Cluster assignment: assign all examples into one of two groups based on which cluster centroid the example is closest to. 3. Move centroid: compute the averages for all the points inside each of the two cluster centroid groups, then move the cluster centroid points to those averages. 4. Re-run (2) and (3) until we have found our clusters. Our main variables are: K (number of clusters) Training set x(1), x(2),….,x(m) Where x(i) ϵ IRn Note that we will not use the x0=1 convention.
  • 32. Machine Learning Page 31 Algorithm: Randomly initialize K cluster centroids mu(1), mu(2), ..., mu(K) Repeat: for i = 1 to m: c(i) := index (from 1 to K) of cluster centroid closest to x(i) for k = 1 to K: mu(k) := average (mean) of points assigned to cluster k The first for-loop is the 'Cluster Assignment' step. We make a vector c where c(i) represents the centroid assigned to example x(i). We can write the operation of the Cluster Assignment step more mathematically as follows: c(i) = argmink||x(i) - µk|| That is, each c(i) contains the index of the centroid that has minimal distance to x(i). By convention, we square the right-hand-side, which makes the function we are trying to minimize more sharply increasing. The second for-loop is the 'Move Centroid' step where we move each centroid to the average of its group. The equation for this loop is as follows: µk = 1 𝑛 [x(k1) + x(k2) + … + x(kn)] ϵ IRn Where each of x(k1), x(k2),…, x(kn) are the training examples assigned to group μk. If you have a cluster centroid with 0 points assigned to it, you can randomly re-initialize that centroid to a new point. You can also simply eliminate that cluster group. After a number of iterations, the algorithm will converge, where new iterations do not affect the clusters. Optimization Objective Recall some of the parameters we used in our algorithm: c(i) = index of cluster (1,2,...,K) to which example x(i) is currently assigned
  • 33. Machine Learning Page 32 μk = cluster centroid k (μk ∈ IRn) μc (i) = cluster centroid of cluster to which example x(i) has been assigned Using these variables, we can define our cost function: J(c(i), … , c(m), μ1, … , μK)== 1 𝑚 ∑ .𝑚 𝑖=1 ||x(i) - μc (i)||2 Our optimization objective is to minimize all our parameters using the above cost function. That is, we are finding all the values in sets c, representing all our clusters, and μ, representing all our centroids, that will minimize the average of the distances of every training example to its corresponding cluster centroid. The above cost function is often called the distortion of the training examples. In the cluster assignment step, our goal is to: Minimize J(…) with c(1),…,c(m) (holding μ1,…,μK fixed) In the move centroid step, our goal is to: Minimize J(…) with μ1,…,μK With k-means, it is not possible for the cost function to sometimes increase. It should always descend. Random Initialization There's one particular recommended method for randomly initializing the cluster centroids. 1. Have K<m. That is, make sure the number of your clusters is less than the number of your training examples. 2. Randomly pick K training examples. 3. Set μ1,…,μk equal to these K examples. K-means can get stuck in local optima. To decrease the chance of this happening, we can run the algorithm on many different random initializations. for i = 1 to 100: randomly initialize k-means run k-means to get 'c' and 'µ' compute the cost function (distortion) J(c,µ) pick the clustering that gave us the lowest cost
  • 34. Machine Learning Page 33 Choosing the Number of Clusters Choosing K can be quite arbitrary and ambiguous. The elbow method: plot the cost J and the number of clusters K. The cost function should reduce as we increase the number of clusters, and then flatten out. Choose K at the point where the cost function starts to flatten out. J will always decrease as K is increased. The one exception is if k-means gets stuck at a bad local optimum. Another way to choose K is to observe how well k-means performs on a downstream purpose. In other words, we choose K that proves to be most useful for some goal you're trying to achieve from using these clusters. Predictive Analysis Using Machine Learning-4 In this this part, we will implement the K-means algorithm and use it for image compression. Problem An image of resolution 128*128 pixels occupies 393,216 bits since each pixel is made up of 3 8-bit unsigned integer that represent RGB encoding, which specify the red, green, blue values of each pixel. The image contains thousands of colours of varying RGB intensity. This causes a lot of overhead on memory for storage. The solution to this problem is to use less colours, in this case using K-means algorithm to select the 16 colours that will be used to represent the compressed image. Solution Intuition In this problem, we only need to store the RGB values of the 16 selected colours, and for each pixel in the image, we now need to only store the index of the colour at that location (where only 4 bits are necessary to represent 16 possibilities). The new representation requires some overhead storage in form of a dictionary of 16 colours, each of which require 24 bits, but the image itself then only requires 4 bits per pixel location. The final number of bits used is therefore 16*24 + 128*128*4 = 65,920 bits which corresponds to compressing the original image by about a factor of 6.
  • 35. Machine Learning Page 34 K-means Implementation The K-means algorithm is a method to automatically cluster similar data examples together. The intuition behind K-means is an iterative procedure that starts by guessing the initial centroids, and then refines this guess by repeatedly assigning examples to their closest centroids and then re-computing the centroids based on the assignments. Code: centroids = kMeansInitCentroids(X, K); for iter = 1:iterations idx = findClosestCentroids(X, centroids); centroids = computeMeans(X, idx, K); end The first step in the loop is the cluster assignment step where each example is assigned to its closest centroid. The second step is moving the centroid based on the means of all points belonging to that particular cluster. The K-means algorithm will always converge to some final set of means for the centroids. Note that the converged solution may not always be ideal and depends on the initial setting of the centroids. Therefore, in practice the K-means algorithm is usually run a few times with different random initializations. Initializing centroids A good strategy for initializing the centroids is to select random examples from the training set. Code: function centroids = kMeansInitCentroids(X, K) centroids = zeros(K, size(X, 2)); randidx = randperm(size(X, 1)); centroids = X(randidx(1:K), :) end Finding Closest Centroids In the "cluster assignment" phase of the K-means algorithm, the algorithm assigns every training example x(i) to its closest centroid, given the current positions of centroids. Code: function idx = findClosestCentroids(X, centroids)
  • 36. Machine Learning Page 35 K = size(centroids, 1); idx = zeros(size(X,1), 1); m = size(X,1) for i = 1:m distance_array = zeros(1,K); for j = 1:K distance_array(1,j) = sqrt(sum(power((X(i,:)-centroids(j,:)),2))); end [d, d_idx] = min(distance_array); idx(i,1) = d_idx; end Computing Centroid means Given assignments of every point to a centroid, the second phase of the algorithm recomputes, for each centroid, the mean of the points that were assigned to it. Code: function centroids = computeCentroids(X, idx, K) [m n] = size(X); centroids = zeros(K, n); for i = 1:K c_i = idx==i; n_i = sum(c_i); c_i_matrix = repmat(c_i,1,n); X_c_i = X .* c_i_matrix; centroids(i,:) = sum(X_c_i) ./ n_i; end K-means Intuition Below are three different plots which visually show how clusters are assigned and centroids are moved using K-means algorithm.
  • 38. Machine Learning Page 37 K-means on Pixels We load the image to be compressed: A = imread('bird small.png'); This creates a three-dimensional matrix A whose first two indices identify a pixel position and whose last index represents red, green, or blue. After finding the top K = 16 colors to represent the image, we can now assign each pixel position to its closest centroid using the findClosestCentroids function. This allows you to represent the original image using the centroid assignments of each pixel. We can view the effects of the compression by reconstructing the image based only on the centroid assignments. Specifically, we can replace each pixel location with the mean of the centroid assigned to it. idx = findClosestCentroids(X, centroids); X_recovered = centroids(idx,:); //Recover Image Reshape the recovered image into proper dimensions: X_recovered = reshape(X_recovered, img_size(1), img_size(2), 3);
  • 39. Machine Learning Page 38 Below is the image reconstructed after compression:
  • 41. Machine Learning Page 40 Conclusion Machine Learning has a wide area of applications such as Computer Vision, Natural Language Processing, Bioinformatics, Stock Market Analysis, DNA sequence etc. Deep Learning is an advancement in the field of Machine Learning. It’ll help us in discerning underlying patterns and make better business decisions. Bibliography Andrew NG, Machine Learning, Stanford University. Pang Ning Tan, Michael Steinbach, Vipin Kumar, Data Mining. Susheela Devi, Pattern Recognition, IISC. Author My name is Shreyas GS. I’m very passionate about Data Analytics and Machine Learning. My career goal is to be a Data Scientist. I’m also an avid reader of books and my other interests include playing guitar, chess and football. I’m an eclectic learner with a keen interest in astronomy.