2. •Our goal is to estimate w from a training data
of <xi
,yi
> pairs
•One way to find such relationship is to
minimize the least squares error:
Linear regression
i
i
i
w − wx )2
arg min
∑
(
y
X
Y y = wx + ε
4. Marks of a student based on the number of
hours he/she put into the preparation
• Simple linear regression…..
• Multiple linear regression….
• Non-linear problem….
5. Marks of a student based on the number of
hours he/she put into the preparation
let’s assume
Marks of a student (M) do depend on the number
of hours (H) he/she put up for preparation.
The following formula can represent the model:
Marks = function (No. of hours)
=> Marks = m*Hours + c
6. Marks of a student based on the number of
hours he/she put into the preparation
let’s assume
Marks of a student (M) do depend on the number
of hours (H) he/she put up for preparation.
The following formula can represent the model:
Marks = function (No. of hours)
=> Marks = m*Hours + c
7. Marks of a student based on the number of
hours he/she put into the preparation
let’s Plot the data to check if it’s a Linear Problem
– The easiest way to determine
8. Marks of a student based on the number of
hours he/she put into the preparation
9. Marks of a student based on the number of
hours he/she put into the preparation
10. Marks of a student based on the number of
hours he/she put into the preparation
How to determine the Slope of line
The value of m….
11. Marks of a student based on the number of
hours he/she put into the preparation
• The value of m (slope of the line) can be
determined using an objective function which
is a combination of the loss function and a
regularization term.
• For simple linear regression, the objective
function would be the summation of Mean
Squared Error (MSE).
• The best fit line would be obtained
by minimizing the objective function
(summation of mean squared error).
12. Marks of a student based on the number of
hours he/she put into the preparation
• The value of m (slope of the line) can be
determined using an objective function which
is a combination of the loss function and a
regularization term.
• For simple linear regression, the objective
function would be the summation of Mean
Squared Error (MSE).
• The best fit line would be obtained
by minimizing the objective function
(summation of mean squared error).
13. Predicting weight reduction in form of the
number of KGs reduced
• Lets Assume
IT could depend upon input features such as:
age, height, the weight of the person, and the
time spent on exercises,
14. Predicting weight reduction in form of the
number of KGs reduced
Weight Reduction = Function(Age, Height,
Weight, Time On Exercise)
=> Shoe-size = b1*Height + b2*Weight +
b3*age + b4*time On Exercise + b0
15. Predicting weight reduction in form of the
number of KGs reduced
As part of training the above model
Goal:
Find the value of b1, b2, b3, b4, and b0 which would
minimize the objective function.
The objective function would be the summation of
mean squared error which is nothing but the sum of
the square of the actual value and the predicted
value for different values of age, height, weight,
and time On Exercise
16. Forecasting sales
Organizations often use linear regression models to
forecast future sales.
This can be helpful for things like budgeting and
planning.
Algorithms such as Amazon’s item-to-item
collaborative filtering are used to predict what
customers will buy in the future based on their past
purchase history
17. Cash forecasting
Many businesses use linear regression to forecast how
much cash they’ll have on hand in the future.
This is important for things like managing expenses
and ensuring that there is enough cash on hand to
cover unexpected costs.
18. Analyzing survey data
Linear regression can also be used to analyze survey
data.
This can help businesses understand things like
customer satisfaction and product preferences.
For example, a company might use linear regression
to figure out how likely people are to recommend their
product to others..
19. Stock predictions
A lot of businesses use linear regression models to
predict how stocks will perform in the future.
This is done by analyzing past data on stock prices
and trends to identify patterns.
20. Predicting consumer behavior
Businesses can use linear regression to predict things
like how much a customer is likely to spend.
Regression models can also be used to predict
consumer behavior. This can be helpful for things like
targeted marketing and product development.
For example, Walmart uses linear regression to predict
what products will be popular in different regions of
the country.
27. Regression for classification
• In some cases we can use linear regression for determining the
appropriate boundary.
• However, since the output is usually binary or discrete there are
more efficient regression methods
28. Regression for classification
• Assume we would like to use linear regression to learn the
parameters for p(y | X ; θ)
• Problems?
1
-1
Optimal regression
model
wT
X ≥ 0 ⇒ classify as 1
wT
X < 0 ⇒ classify as -1
29. Logistic Regression
Logistic Regression is basically a predictive model
analysis technique where the target variables (output)
are discrete values for a given set of features or input
(X).
For example whether someone is covid-19 positive
(1) or negative (0).
It is a very powerful yet simple classification
algorithm in machine learning borrowed from
statistics algorithms.
30. Logistic Regression
Logistic Regression is basically a predictive model
analysis technique where the target variables (output)
are discrete values for a given set of features or input
(X).
For example whether someone is covid-19 positive
(1) or negative (0).
It is a very powerful yet simple classification
algorithm in machine learning borrowed from
statistics algorithms.
Around 60% of the world’s classification problems
can be solved by using the logistic regression
algorithm.
31. Logistic Regression
Logistic regression is one of the most common
machine learning algorithms used for binary
classification.
It predicts the probability of occurrence of a binary
outcome.
Fraud detection, spam detection, cancer detection,
etc.
32. Sigmoid Function
It is a mathematical function having a characteristic
that can take any real value and map it to between 0
to 1 shaped like the letter “S”.
The sigmoid function also called a logistic function
g(h) =
1
1+ e− h
34. Sigmoid Function
If the value of z goes to positive infinity then the
predicted value of y will become 1 and if it goes to
negative infinity then the predicted value of y will
become 0.
And if the outcome of the sigmoid function is more
than 0.5 then we classify that label as class 1 or
positive class and if it is less than 0.5 then we can
classify it to negative class or label as class 0.
35. Diff b/w Linear & Logistic
Regression
Linear Regression is used when our dependent variable is
continuous in nature for example weight, height, numbers, etc.
and in contrast,
Logistic Regression is used when the dependent variable is
binary or limited for example: yes and no, true and false, 1 or 2,
etc.
In the 19th century, people use linear regression on biology to
predict health disease but it is very risky for example if a patient
has cancer and its probability of malignant is 0.4 then in linear
regression it will show that cancer is benign (because
probability comes <0.5). That’s where Logistic Regression
comes which only provides us with binary results.
36. Logistic regression vs. Linear regression
1
T
1+ ew X
p( y = 0 | X ;θ ) = g(wT
X )
=
T
e
w X
T
1+ ew X
p( y = 1| X ;θ ) = 1− g(wT
X )
=
37. Determining parameters for logistic
regression problems
∏
i
y
i i
i
i (1− y )
(1− g( X ; w)) g( X ; w)
L( y | X ; w)
=
• So how do we learn the
parameters?
• Similar to other regression problems
we look for the MLE for w
• The likelihood of the data given
the model is:
1
T
1+ ew X
p( y = 0 | X ;θ ) = g( X ; w)
= T
e
w X
T
1+ ew X
p( y = 1| X ;θ ) = 1− g( X ; w)
=
38. Gradient Descent
Gradient descent is an optimization algorithm
used to minimize some function by iteratively
moving in the direction of steepest descent as
defined by the negative of the gradient.
In machine learning, we use gradient descent to
update the parameters of our model.
39. Gradient Descent
Starting at the top of the mountain, we take our first
step downhill in the direction specified by the
negative gradient.
Next we recalculate the negative gradient (passing in
the coordinates of our new point) and take another
step in the direction it specifies.
We continue this process iteratively until we get to
the bottom of our graph, or to a point where we can
no longer move downhill–a local minimum.
41. Learning Rate
The size of these steps is called the learning rate.
With a high learning rate we can cover more ground each
step, but we risk overshooting the lowest point since the slope
of the hill is constantly changing.
With a very low learning rate, we can confidently move in
the direction of the negative gradient since we are
recalculating it so frequently.
A low learning rate is more precise, but calculating the
gradient is time-consuming, so it will take us a very long time
to get to the bottom.
42. Gradient ascent
Slope = ∂z/
∂w
z
Δw
w
•Going in the direction to the slope will lead to a larger z
•But not too much, otherwise we would go beyond the
optimal w
43. Gradient descent
z Slope = ∂z/
∂w
Δz
Δw
w
•Going in the opposite direction to the slope will lead to
a smaller z
•But not too much, otherwise we would go beyond the
optimal w
44. Finding The Best Weights -
Hill Descent
Ball on a complicated hilly terrain
— rolls down to a local valley
↑
this is called a local minimum
Questions:
How to get to the bottom of the deepest valley?
How to do this when we don’t have gravity?
45. Our Ein
Has Only One Valley
Weights, w
In-sample
Error,
E
in
. . . because Ein
(w) is a convex function of w.
46. Gradient Descent Method
Batch Gradient Descent: use all examples in each iteration
Mini batch Gradient Descent: use some examples in each iteration
Stochastic Gradient Descent: use 1 example in each iteration
47. Regularization
• Similar to other data estimation problems, we may not have enough
samples to learn good models for logistic regression classification
• One way to overcome this is to ‘regularize’ the model, impose
additional constraints on the parameters we are fitting.
• For example, lets assume that wj
comes from a Gaussian
distribution with mean 0 and variance σ2
(where σ2
is a user defined
parameter): wj
~N(0, σ2
)
• In that case we have a prior on the parameters and so:
p( y = 1,θ | X ) ∝ p( y = 1| X ;θ ) p(θ )
48. Credits
Yasir Abu Mustafa ,Caltech university, California
Barnabas Poczos, Ziv Bar-Joseph School of Computer Science,
Carnegie Mellon University
Vibhav Gogate The University of Texas at Dallas