Machine Learning
Supervised Learning: Logistic Regression
•Our goal is to estimate w from a training data
of <xi
,yi
> pairs
•One way to find such relationship is to
minimize the least squares error:
Linear regression
i
i
i
w − wx )2
arg min
∑
(
y
X
Y y = wx + ε
Applications of Linear Regression
Marks of a student based on the number of
hours he/she put into the preparation
• Simple linear regression…..
• Multiple linear regression….
• Non-linear problem….
Marks of a student based on the number of
hours he/she put into the preparation
let’s assume
Marks of a student (M) do depend on the number
of hours (H) he/she put up for preparation.
The following formula can represent the model:
Marks = function (No. of hours)
=> Marks = m*Hours + c
Marks of a student based on the number of
hours he/she put into the preparation
let’s assume
Marks of a student (M) do depend on the number
of hours (H) he/she put up for preparation.
The following formula can represent the model:
Marks = function (No. of hours)
=> Marks = m*Hours + c
Marks of a student based on the number of
hours he/she put into the preparation
let’s Plot the data to check if it’s a Linear Problem
– The easiest way to determine
Marks of a student based on the number of
hours he/she put into the preparation
Marks of a student based on the number of
hours he/she put into the preparation
Marks of a student based on the number of
hours he/she put into the preparation
How to determine the Slope of line
The value of m….
Marks of a student based on the number of
hours he/she put into the preparation
• The value of m (slope of the line) can be
determined using an objective function which
is a combination of the loss function and a
regularization term.
• For simple linear regression, the objective
function would be the summation of Mean
Squared Error (MSE).
• The best fit line would be obtained
by minimizing the objective function
(summation of mean squared error).
Marks of a student based on the number of
hours he/she put into the preparation
• The value of m (slope of the line) can be
determined using an objective function which
is a combination of the loss function and a
regularization term.
• For simple linear regression, the objective
function would be the summation of Mean
Squared Error (MSE).
• The best fit line would be obtained
by minimizing the objective function
(summation of mean squared error).
Predicting weight reduction in form of the
number of KGs reduced
• Lets Assume
IT could depend upon input features such as:
age, height, the weight of the person, and the
time spent on exercises,
Predicting weight reduction in form of the
number of KGs reduced
Weight Reduction = Function(Age, Height,
Weight, Time On Exercise)
=> Shoe-size = b1*Height + b2*Weight +
b3*age + b4*time On Exercise + b0
Predicting weight reduction in form of the
number of KGs reduced
As part of training the above model
Goal:
Find the value of b1, b2, b3, b4, and b0 which would
minimize the objective function.
The objective function would be the summation of
mean squared error which is nothing but the sum of
the square of the actual value and the predicted
value for different values of age, height, weight,
and time On Exercise
Forecasting sales
Organizations often use linear regression models to
forecast future sales.
This can be helpful for things like budgeting and
planning.
Algorithms such as Amazon’s item-to-item
collaborative filtering are used to predict what
customers will buy in the future based on their past
purchase history
Cash forecasting
Many businesses use linear regression to forecast how
much cash they’ll have on hand in the future.
This is important for things like managing expenses
and ensuring that there is enough cash on hand to
cover unexpected costs.
Analyzing survey data
Linear regression can also be used to analyze survey
data.
This can help businesses understand things like
customer satisfaction and product preferences.
For example, a company might use linear regression
to figure out how likely people are to recommend their
product to others..
Stock predictions
A lot of businesses use linear regression models to
predict how stocks will perform in the future.
This is done by analyzing past data on stock prices
and trends to identify patterns.
Predicting consumer behavior
Businesses can use linear regression to predict things
like how much a customer is likely to spend.
Regression models can also be used to predict
consumer behavior. This can be helpful for things like
targeted marketing and product development.
For example, Walmart uses linear regression to predict
what products will be popular in different regions of
the country.
Code:
Code:
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:nb_0 = {} 
nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()
Logistic Regression…
Regression for classification
• In some cases we can use linear regression for determining the
appropriate boundary.
• However, since the output is usually binary or discrete there are
more efficient regression methods
Regression for classification
• Assume we would like to use linear regression to learn the
parameters for p(y | X ; θ)
• Problems?
1
-1
Optimal regression
model
wT
X ≥ 0 ⇒ classify as 1
wT
X < 0 ⇒ classify as -1
Logistic Regression
Logistic Regression is basically a predictive model
analysis technique where the target variables (output)
are discrete values for a given set of features or input
(X).
For example whether someone is covid-19 positive
(1) or negative (0).
It is a very powerful yet simple classification
algorithm in machine learning borrowed from
statistics algorithms.
Logistic Regression
Logistic Regression is basically a predictive model
analysis technique where the target variables (output)
are discrete values for a given set of features or input
(X).
For example whether someone is covid-19 positive
(1) or negative (0).
It is a very powerful yet simple classification
algorithm in machine learning borrowed from
statistics algorithms.
Around 60% of the world’s classification problems
can be solved by using the logistic regression
algorithm.
Logistic Regression
Logistic regression is one of the most common
machine learning algorithms used for binary
classification.
It predicts the probability of occurrence of a binary
outcome.
Fraud detection, spam detection, cancer detection,
etc.
Sigmoid Function
It is a mathematical function having a characteristic
that can take any real value and map it to between 0
to 1 shaped like the letter “S”.
The sigmoid function also called a logistic function
g(h) =
1
1+ e− h
Sigmoid Function
Sigmoid Function
If the value of z goes to positive infinity then the
predicted value of y will become 1 and if it goes to
negative infinity then the predicted value of y will
become 0.
And if the outcome of the sigmoid function is more
than 0.5 then we classify that label as class 1 or
positive class and if it is less than 0.5 then we can
classify it to negative class or label as class 0.
Diff b/w Linear & Logistic
Regression
Linear Regression is used when our dependent variable is
continuous in nature for example weight, height, numbers, etc.
and in contrast,
Logistic Regression is used when the dependent variable is
binary or limited for example: yes and no, true and false, 1 or 2,
etc.
In the 19th century, people use linear regression on biology to
predict health disease but it is very risky for example if a patient
has cancer and its probability of malignant is 0.4 then in linear
regression it will show that cancer is benign (because
probability comes <0.5). That’s where Logistic Regression
comes which only provides us with binary results.
Logistic regression vs. Linear regression
1
T
1+ ew X
p( y = 0 | X ;θ ) = g(wT
X )
=
T
e
w X
T
1+ ew X
p( y = 1| X ;θ ) = 1− g(wT
X )
=
Determining parameters for logistic
regression problems
∏
i
y
i i
i
i (1− y )
(1− g( X ; w)) g( X ; w)
L( y | X ; w)
=
• So how do we learn the
parameters?
• Similar to other regression problems
we look for the MLE for w
• The likelihood of the data given
the model is:
1
T
1+ ew X
p( y = 0 | X ;θ ) = g( X ; w)
= T
e
w X
T
1+ ew X
p( y = 1| X ;θ ) = 1− g( X ; w)
=
Gradient Descent
Gradient descent is an optimization algorithm
used to minimize some function by iteratively
moving in the direction of steepest descent as
defined by the negative of the gradient.
In machine learning, we use gradient descent to
update the parameters of our model.
Gradient Descent
Starting at the top of the mountain, we take our first
step downhill in the direction specified by the
negative gradient.
Next we recalculate the negative gradient (passing in
the coordinates of our new point) and take another
step in the direction it specifies.
We continue this process iteratively until we get to
the bottom of our graph, or to a point where we can
no longer move downhill–a local minimum.
Gradient Descent
Learning Rate
The size of these steps is called the learning rate.
With a high learning rate we can cover more ground each
step, but we risk overshooting the lowest point since the slope
of the hill is constantly changing.
With a very low learning rate, we can confidently move in
the direction of the negative gradient since we are
recalculating it so frequently.
A low learning rate is more precise, but calculating the
gradient is time-consuming, so it will take us a very long time
to get to the bottom.
Gradient ascent
Slope = ∂z/
∂w
z
Δw
w
•Going in the direction to the slope will lead to a larger z
•But not too much, otherwise we would go beyond the
optimal w
Gradient descent
z Slope = ∂z/
∂w
Δz
Δw
w
•Going in the opposite direction to the slope will lead to
a smaller z
•But not too much, otherwise we would go beyond the
optimal w
Finding The Best Weights -
Hill Descent
Ball on a complicated hilly terrain
— rolls down to a local valley
↑
this is called a local minimum
Questions:
How to get to the bottom of the deepest valley?
How to do this when we don’t have gravity?
Our Ein
Has Only One Valley
Weights, w
In-sample
Error,
E
in
. . . because Ein
(w) is a convex function of w.
Gradient Descent Method
Batch Gradient Descent: use all examples in each iteration
Mini batch Gradient Descent: use some examples in each iteration
Stochastic Gradient Descent: use 1 example in each iteration
Regularization
• Similar to other data estimation problems, we may not have enough
samples to learn good models for logistic regression classification
• One way to overcome this is to ‘regularize’ the model, impose
additional constraints on the parameters we are fitting.
• For example, lets assume that wj
comes from a Gaussian
distribution with mean 0 and variance σ2
(where σ2
is a user defined
parameter): wj
~N(0, σ2
)
• In that case we have a prior on the parameters and so:
p( y = 1,θ | X ) ∝ p( y = 1| X ;θ ) p(θ )
Credits
Yasir Abu Mustafa ,Caltech university, California
Barnabas Poczos, Ziv Bar-Joseph School of Computer Science,
Carnegie Mellon University
Vibhav Gogate The University of Texas at Dallas

3ml.pdf

  • 1.
  • 2.
    •Our goal isto estimate w from a training data of <xi ,yi > pairs •One way to find such relationship is to minimize the least squares error: Linear regression i i i w − wx )2 arg min ∑ ( y X Y y = wx + ε
  • 3.
  • 4.
    Marks of astudent based on the number of hours he/she put into the preparation • Simple linear regression….. • Multiple linear regression…. • Non-linear problem….
  • 5.
    Marks of astudent based on the number of hours he/she put into the preparation let’s assume Marks of a student (M) do depend on the number of hours (H) he/she put up for preparation. The following formula can represent the model: Marks = function (No. of hours) => Marks = m*Hours + c
  • 6.
    Marks of astudent based on the number of hours he/she put into the preparation let’s assume Marks of a student (M) do depend on the number of hours (H) he/she put up for preparation. The following formula can represent the model: Marks = function (No. of hours) => Marks = m*Hours + c
  • 7.
    Marks of astudent based on the number of hours he/she put into the preparation let’s Plot the data to check if it’s a Linear Problem – The easiest way to determine
  • 8.
    Marks of astudent based on the number of hours he/she put into the preparation
  • 9.
    Marks of astudent based on the number of hours he/she put into the preparation
  • 10.
    Marks of astudent based on the number of hours he/she put into the preparation How to determine the Slope of line The value of m….
  • 11.
    Marks of astudent based on the number of hours he/she put into the preparation • The value of m (slope of the line) can be determined using an objective function which is a combination of the loss function and a regularization term. • For simple linear regression, the objective function would be the summation of Mean Squared Error (MSE). • The best fit line would be obtained by minimizing the objective function (summation of mean squared error).
  • 12.
    Marks of astudent based on the number of hours he/she put into the preparation • The value of m (slope of the line) can be determined using an objective function which is a combination of the loss function and a regularization term. • For simple linear regression, the objective function would be the summation of Mean Squared Error (MSE). • The best fit line would be obtained by minimizing the objective function (summation of mean squared error).
  • 13.
    Predicting weight reductionin form of the number of KGs reduced • Lets Assume IT could depend upon input features such as: age, height, the weight of the person, and the time spent on exercises,
  • 14.
    Predicting weight reductionin form of the number of KGs reduced Weight Reduction = Function(Age, Height, Weight, Time On Exercise) => Shoe-size = b1*Height + b2*Weight + b3*age + b4*time On Exercise + b0
  • 15.
    Predicting weight reductionin form of the number of KGs reduced As part of training the above model Goal: Find the value of b1, b2, b3, b4, and b0 which would minimize the objective function. The objective function would be the summation of mean squared error which is nothing but the sum of the square of the actual value and the predicted value for different values of age, height, weight, and time On Exercise
  • 16.
    Forecasting sales Organizations oftenuse linear regression models to forecast future sales. This can be helpful for things like budgeting and planning. Algorithms such as Amazon’s item-to-item collaborative filtering are used to predict what customers will buy in the future based on their past purchase history
  • 17.
    Cash forecasting Many businessesuse linear regression to forecast how much cash they’ll have on hand in the future. This is important for things like managing expenses and ensuring that there is enough cash on hand to cover unexpected costs.
  • 18.
    Analyzing survey data Linearregression can also be used to analyze survey data. This can help businesses understand things like customer satisfaction and product preferences. For example, a company might use linear regression to figure out how likely people are to recommend their product to others..
  • 19.
    Stock predictions A lotof businesses use linear regression models to predict how stocks will perform in the future. This is done by analyzing past data on stock prices and trends to identify patterns.
  • 20.
    Predicting consumer behavior Businessescan use linear regression to predict things like how much a customer is likely to spend. Regression models can also be used to predict consumer behavior. This can be helpful for things like targeted marketing and product development. For example, Walmart uses linear regression to predict what products will be popular in different regions of the country.
  • 21.
  • 23.
    Code: import numpy asnp import matplotlib.pyplot as plt def estimate_coef(x, y): # number of observations/points n = np.size(x) # mean of x and y vector m_x = np.mean(x) m_y = np.mean(y) # calculating cross-deviation and deviation about x SS_xy = np.sum(y*x) - n*m_y*m_x SS_xx = np.sum(x*x) - n*m_x*m_x # calculating regression coefficients b_1 = SS_xy / SS_xx b_0 = m_y - b_1*m_x return (b_0, b_1)
  • 24.
    def plot_regression_line(x, y,b): # plotting the actual points as scatter plot plt.scatter(x, y, color = "m", marker = "o", s = 30) # predicted response vector y_pred = b[0] + b[1]*x # plotting the regression line plt.plot(x, y_pred, color = "g") # putting labels plt.xlabel('x') plt.ylabel('y') # function to show plot plt.show() def main(): # observations / data x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12]) # estimating coefficients b = estimate_coef(x, y) print("Estimated coefficients:nb_0 = {} nb_1 = {}".format(b[0], b[1])) # plotting regression line plot_regression_line(x, y, b) if __name__ == "__main__": main()
  • 26.
  • 27.
    Regression for classification •In some cases we can use linear regression for determining the appropriate boundary. • However, since the output is usually binary or discrete there are more efficient regression methods
  • 28.
    Regression for classification •Assume we would like to use linear regression to learn the parameters for p(y | X ; θ) • Problems? 1 -1 Optimal regression model wT X ≥ 0 ⇒ classify as 1 wT X < 0 ⇒ classify as -1
  • 29.
    Logistic Regression Logistic Regressionis basically a predictive model analysis technique where the target variables (output) are discrete values for a given set of features or input (X). For example whether someone is covid-19 positive (1) or negative (0). It is a very powerful yet simple classification algorithm in machine learning borrowed from statistics algorithms.
  • 30.
    Logistic Regression Logistic Regressionis basically a predictive model analysis technique where the target variables (output) are discrete values for a given set of features or input (X). For example whether someone is covid-19 positive (1) or negative (0). It is a very powerful yet simple classification algorithm in machine learning borrowed from statistics algorithms. Around 60% of the world’s classification problems can be solved by using the logistic regression algorithm.
  • 31.
    Logistic Regression Logistic regressionis one of the most common machine learning algorithms used for binary classification. It predicts the probability of occurrence of a binary outcome. Fraud detection, spam detection, cancer detection, etc.
  • 32.
    Sigmoid Function It isa mathematical function having a characteristic that can take any real value and map it to between 0 to 1 shaped like the letter “S”. The sigmoid function also called a logistic function g(h) = 1 1+ e− h
  • 33.
  • 34.
    Sigmoid Function If thevalue of z goes to positive infinity then the predicted value of y will become 1 and if it goes to negative infinity then the predicted value of y will become 0. And if the outcome of the sigmoid function is more than 0.5 then we classify that label as class 1 or positive class and if it is less than 0.5 then we can classify it to negative class or label as class 0.
  • 35.
    Diff b/w Linear& Logistic Regression Linear Regression is used when our dependent variable is continuous in nature for example weight, height, numbers, etc. and in contrast, Logistic Regression is used when the dependent variable is binary or limited for example: yes and no, true and false, 1 or 2, etc. In the 19th century, people use linear regression on biology to predict health disease but it is very risky for example if a patient has cancer and its probability of malignant is 0.4 then in linear regression it will show that cancer is benign (because probability comes <0.5). That’s where Logistic Regression comes which only provides us with binary results.
  • 36.
    Logistic regression vs.Linear regression 1 T 1+ ew X p( y = 0 | X ;θ ) = g(wT X ) = T e w X T 1+ ew X p( y = 1| X ;θ ) = 1− g(wT X ) =
  • 37.
    Determining parameters forlogistic regression problems ∏ i y i i i i (1− y ) (1− g( X ; w)) g( X ; w) L( y | X ; w) = • So how do we learn the parameters? • Similar to other regression problems we look for the MLE for w • The likelihood of the data given the model is: 1 T 1+ ew X p( y = 0 | X ;θ ) = g( X ; w) = T e w X T 1+ ew X p( y = 1| X ;θ ) = 1− g( X ; w) =
  • 38.
    Gradient Descent Gradient descentis an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model.
  • 39.
    Gradient Descent Starting atthe top of the mountain, we take our first step downhill in the direction specified by the negative gradient. Next we recalculate the negative gradient (passing in the coordinates of our new point) and take another step in the direction it specifies. We continue this process iteratively until we get to the bottom of our graph, or to a point where we can no longer move downhill–a local minimum.
  • 40.
  • 41.
    Learning Rate The sizeof these steps is called the learning rate. With a high learning rate we can cover more ground each step, but we risk overshooting the lowest point since the slope of the hill is constantly changing. With a very low learning rate, we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to get to the bottom.
  • 42.
    Gradient ascent Slope =∂z/ ∂w z Δw w •Going in the direction to the slope will lead to a larger z •But not too much, otherwise we would go beyond the optimal w
  • 43.
    Gradient descent z Slope= ∂z/ ∂w Δz Δw w •Going in the opposite direction to the slope will lead to a smaller z •But not too much, otherwise we would go beyond the optimal w
  • 44.
    Finding The BestWeights - Hill Descent Ball on a complicated hilly terrain — rolls down to a local valley ↑ this is called a local minimum Questions: How to get to the bottom of the deepest valley? How to do this when we don’t have gravity?
  • 45.
    Our Ein Has OnlyOne Valley Weights, w In-sample Error, E in . . . because Ein (w) is a convex function of w.
  • 46.
    Gradient Descent Method BatchGradient Descent: use all examples in each iteration Mini batch Gradient Descent: use some examples in each iteration Stochastic Gradient Descent: use 1 example in each iteration
  • 47.
    Regularization • Similar toother data estimation problems, we may not have enough samples to learn good models for logistic regression classification • One way to overcome this is to ‘regularize’ the model, impose additional constraints on the parameters we are fitting. • For example, lets assume that wj comes from a Gaussian distribution with mean 0 and variance σ2 (where σ2 is a user defined parameter): wj ~N(0, σ2 ) • In that case we have a prior on the parameters and so: p( y = 1,θ | X ) ∝ p( y = 1| X ;θ ) p(θ )
  • 48.
    Credits Yasir Abu Mustafa,Caltech university, California Barnabas Poczos, Ziv Bar-Joseph School of Computer Science, Carnegie Mellon University Vibhav Gogate The University of Texas at Dallas