DA 5230 – Statistical & Machine Learning
Lecture 5 – Logistic Regression
Maninda Edirisooriya
manindaw@uom.lk
Classification
• When the Y variable of a Supervised Learning problem is of several
discreate classes (e.g.: Color, Age groups) the problem is known as a
Classification problem
• A Classification problem has to predict/select a certain Category (or a
Class) as the dependent variable
• When there are only 2 classes to be classified, it is known as a Binary
Classification problem
E.g.: Predicting a person’s gender (either as male or female) by testosterone
concentration in blood, height and bone density
Binary Classification
• Output classes of a binary classification can be represented by either
• Boolean values, True or False (or Positive or Negative)
• Numbers 1 or 0
• True or 1 value is used for the Positive Class for one class which is
generally the class we want to analyze
• False or 0 value is used for the Negative Class for the other class
• E.g.: For classifying a tumor as malignant (a cancer) or benign (not a
cancer) by the tumor size, being malignant can be taken as the
Positive class and the benign class as the Negative class
Binary Classification - Example
0 (Benign)
1 (Malignant)
X
Y
Binary Classification – with Linear Regression
0 (Benign)
1 (Malignant)
X
Y Linear
Regression
Classifier
0.5
Malignant
Benign
Binary Classification – Problem with LR
0 (Benign)
1 (Malignant)
X
Y
Linear
Regression
Classifier
0.5
Malignant
Benign
Misclassified
Binary Classification – Requirement
0 (Benign)
1 (Malignant)
X
Y
Linear
Regression
Classifier
0.5
Malignant
Benign
Required Regression Classifier
(Variant of Unit Step Function)
Binary Classification – Requirement
0 (Benign)
1 (Malignant)
X
Y
Linear
Regression
Classifier
0.5
Malignant
Benign
Not Differentiable here
for Gradient Descent
Required Regression Classifier
(Variant of Unit Step Function)
Binary Classification – Requirement
0 (Benign)
1 (Malignant)
X
Y
Linear
Regression
Classifier
0.5
Malignant
Benign
Continuous
Regression
Classifier
Logistic/Sigmoid Function
• Sigmoid function: 𝐟 𝐳 =
𝟏
𝟏+𝐞−𝐳
Z = 0 ⇒ f(Z) = 0.5
0 < f(Z) < 1
• A Non-linear function
• This is a continuous alternative
for the Unit Step Function
Z
f(Z)
Logistic Regression
Like Linear Regression say, Z = β0 + β1*X1 + β2*X2 + ... + βn*Xn
Logistic Function, f Z =
1
1+e−z
f X =
1
1 + e−(β0 + β1∗X1 + β2∗X2 + ... + βn∗Xn)
In vector form,
f X =
1
1 + e−βTX
where β0 = β0*X0 taking X0 = 1
This is the function of Logistic Regression.
Logistic Regression - Prediction
Let’s take predictions as f(X) = ቊ
1 (or Positive) if, f x ≥ 0.5
0 (or Negative) if, f x < 0.5
f(X) = ൞
Positive ⇒ f X ≥ 0.5 ⇒
1
1+e−βTX
≥ 0.5 ⇒ βTX ≥ 0
Negative ⇒ f X < 0.5 ⇒
1
1+e−βTX
< 0.5 ⇒ βTX < 0
Here, βTX = β0 + β1*X1 + β2*X2 + ... + βn*Xn
Prediction Example
Take a classification problem with 2 independent variables where,
f(X) =
1
1+e−(β0 + β1∗X1 + β2∗X2)
Negative
Positive
X2
Z = β0 + β1*X1 + β2*X2
(Decision boundary)
Z > 0
Positive
Z < 0
Negative
X1
Non-linear Classification
Taking polynomials of X values (as discussed in Polynomial Regression)
can classify non-linear data points with non-linear decision boundaries
E.g.:
f(X) =
1
1+e− (β0 + β1∗X1
2 + β2∗X2
2)
Negative Positive
X2
Z = β0 + β1∗X1
2 + β2∗X2
2
(Decision boundary)
Z > 0
Positive
Z < 0
Negative
X1
Binary Logistic Regression – Cost Function
Cost for a single data point is known as the Loss
Take the Loss Function of Logistic Regression as L{f(X)}
L f X , Y = ቊ
− log f(X) if Y = 1
− log 1 − f(X) if Y = 0
L f X , Y = −Y log f(X) −(1 − Y) log 1 − f(X)
Cost function: J(β) =
1
n
σ𝑖=1
n
L f x , Y
J(β) =
1
n
෌𝑖=1
n
[−Y log f(X) − (1 − Y) log 1 − f(X) ]
This Cost Function is Convex (has a Global Minimum)
Multiclass Logistic Regression
• Up to now we have looked at Binary Classification problems where
there can be only two outcomes/categories/classes as the Y variable
• When there are more than 2 classes available (only one of them is
positive for any given data point) the problem becomes a Multiclass
Classification problem
• One way to handle Multiclass Classification is using the Binary
Classifiers known as One-vs-All (OvA), also known as one-vs-rest
(OvR)
• It trains multiple binary classifiers, each one predicting the confidence
(probability) of one class against the rest, and the highest class is selected
Multiclass Logistic Regression
• OvA can be used
• When you want to use different binary classifiers (e.g., SVMs or logistic
regression) for each class
• When available memory is limited or need to highly parallelize
• There is another technique for Multiclass Logistic Regression by
simply generalizing the binary classification problem of the Logistic
Regression
• This General form of Classifier is known as the Softmax Classifier
• There, the Softmax Function is used instead of the Sigmoid function
when there are multiple classes
Softmax Function
• The name Softmax is used, as it is a continuous function
approximation to the Maximum Function, where only one class
(maximum) is allowed to be considered as Positive
• Softmax function is used instead of the Maximum Function to make
the function differentiable
• Softmax Function: S(Xi) =
𝐞𝐱𝐢
෍
𝐣=𝟏
𝐧
𝐞
𝐱𝐣
where i is any data point and j is the index of the dimension of the vector Xi
Softmax Function
• Softmax function exponentially highlights the value in the dimension
where the value is maximum, while suppressing all other dimensions
• Output values of a vector from a Softmax function sums to 1
• E.g.: Input Vector Output Vector
Softmax Function
Softmax Regression
• Like Z = βTX is the used for binary classification, Zk = βk
TX is used for
Multiclass classification, where k is the index of the class
• Note that there K number of β vectors exists as model parameters
• Like Y is used for binary classification where there is only a single
dependent variables, Multiclass classification has K dependent
variables, each denoted by Yk and its estimator ෡
𝐘𝐤
෡
𝐘𝐤 =
𝐞𝐙𝒌
෍
𝐣=𝟏
𝐊
𝐞
𝐙𝐣
Softmax Regression
Loss function:
L f X , Y = -log(෡
Yk) = -log(
eZ𝑘
෍
j=1
K
e
Zj
) = -log(
eβk
TX
෎
j=1
K
e
βj
TX
)
Cost function (Cross Entropy Loss):
J(β) = − ා
𝑖=1
N
Σk=1
K
I[Yi = k]log(
eβk
TX
෎
j=1
K
e
βj
TX
)
One Hour Homework
• Officially we have one more hour to do after the end of the lectures
• Therefore, for this week’s extra hour you have a homework
• Logistic Regression is the basic building block of Deep Neural Networks (DNN).
Softmax classifiers are used as it is in DNNs as the final classification layer
• Go through the slides and get a clear understanding on Logistic and Softmax
Regressions
• Refer external sources to clarify all the ambiguities related to it
• Good Luck!
Questions?

Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Machine Learning

  • 1.
    DA 5230 –Statistical & Machine Learning Lecture 5 – Logistic Regression Maninda Edirisooriya manindaw@uom.lk
  • 2.
    Classification • When theY variable of a Supervised Learning problem is of several discreate classes (e.g.: Color, Age groups) the problem is known as a Classification problem • A Classification problem has to predict/select a certain Category (or a Class) as the dependent variable • When there are only 2 classes to be classified, it is known as a Binary Classification problem E.g.: Predicting a person’s gender (either as male or female) by testosterone concentration in blood, height and bone density
  • 3.
    Binary Classification • Outputclasses of a binary classification can be represented by either • Boolean values, True or False (or Positive or Negative) • Numbers 1 or 0 • True or 1 value is used for the Positive Class for one class which is generally the class we want to analyze • False or 0 value is used for the Negative Class for the other class • E.g.: For classifying a tumor as malignant (a cancer) or benign (not a cancer) by the tumor size, being malignant can be taken as the Positive class and the benign class as the Negative class
  • 4.
    Binary Classification -Example 0 (Benign) 1 (Malignant) X Y
  • 5.
    Binary Classification –with Linear Regression 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign
  • 6.
    Binary Classification –Problem with LR 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Misclassified
  • 7.
    Binary Classification –Requirement 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Required Regression Classifier (Variant of Unit Step Function)
  • 8.
    Binary Classification –Requirement 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Not Differentiable here for Gradient Descent Required Regression Classifier (Variant of Unit Step Function)
  • 9.
    Binary Classification –Requirement 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Continuous Regression Classifier
  • 10.
    Logistic/Sigmoid Function • Sigmoidfunction: 𝐟 𝐳 = 𝟏 𝟏+𝐞−𝐳 Z = 0 ⇒ f(Z) = 0.5 0 < f(Z) < 1 • A Non-linear function • This is a continuous alternative for the Unit Step Function Z f(Z)
  • 11.
    Logistic Regression Like LinearRegression say, Z = β0 + β1*X1 + β2*X2 + ... + βn*Xn Logistic Function, f Z = 1 1+e−z f X = 1 1 + e−(β0 + β1∗X1 + β2∗X2 + ... + βn∗Xn) In vector form, f X = 1 1 + e−βTX where β0 = β0*X0 taking X0 = 1 This is the function of Logistic Regression.
  • 12.
    Logistic Regression -Prediction Let’s take predictions as f(X) = ቊ 1 (or Positive) if, f x ≥ 0.5 0 (or Negative) if, f x < 0.5 f(X) = ൞ Positive ⇒ f X ≥ 0.5 ⇒ 1 1+e−βTX ≥ 0.5 ⇒ βTX ≥ 0 Negative ⇒ f X < 0.5 ⇒ 1 1+e−βTX < 0.5 ⇒ βTX < 0 Here, βTX = β0 + β1*X1 + β2*X2 + ... + βn*Xn
  • 13.
    Prediction Example Take aclassification problem with 2 independent variables where, f(X) = 1 1+e−(β0 + β1∗X1 + β2∗X2) Negative Positive X2 Z = β0 + β1*X1 + β2*X2 (Decision boundary) Z > 0 Positive Z < 0 Negative X1
  • 14.
    Non-linear Classification Taking polynomialsof X values (as discussed in Polynomial Regression) can classify non-linear data points with non-linear decision boundaries E.g.: f(X) = 1 1+e− (β0 + β1∗X1 2 + β2∗X2 2) Negative Positive X2 Z = β0 + β1∗X1 2 + β2∗X2 2 (Decision boundary) Z > 0 Positive Z < 0 Negative X1
  • 15.
    Binary Logistic Regression– Cost Function Cost for a single data point is known as the Loss Take the Loss Function of Logistic Regression as L{f(X)} L f X , Y = ቊ − log f(X) if Y = 1 − log 1 − f(X) if Y = 0 L f X , Y = −Y log f(X) −(1 − Y) log 1 − f(X) Cost function: J(β) = 1 n σ𝑖=1 n L f x , Y J(β) = 1 n ෌𝑖=1 n [−Y log f(X) − (1 − Y) log 1 − f(X) ] This Cost Function is Convex (has a Global Minimum)
  • 16.
    Multiclass Logistic Regression •Up to now we have looked at Binary Classification problems where there can be only two outcomes/categories/classes as the Y variable • When there are more than 2 classes available (only one of them is positive for any given data point) the problem becomes a Multiclass Classification problem • One way to handle Multiclass Classification is using the Binary Classifiers known as One-vs-All (OvA), also known as one-vs-rest (OvR) • It trains multiple binary classifiers, each one predicting the confidence (probability) of one class against the rest, and the highest class is selected
  • 17.
    Multiclass Logistic Regression •OvA can be used • When you want to use different binary classifiers (e.g., SVMs or logistic regression) for each class • When available memory is limited or need to highly parallelize • There is another technique for Multiclass Logistic Regression by simply generalizing the binary classification problem of the Logistic Regression • This General form of Classifier is known as the Softmax Classifier • There, the Softmax Function is used instead of the Sigmoid function when there are multiple classes
  • 18.
    Softmax Function • Thename Softmax is used, as it is a continuous function approximation to the Maximum Function, where only one class (maximum) is allowed to be considered as Positive • Softmax function is used instead of the Maximum Function to make the function differentiable • Softmax Function: S(Xi) = 𝐞𝐱𝐢 ෍ 𝐣=𝟏 𝐧 𝐞 𝐱𝐣 where i is any data point and j is the index of the dimension of the vector Xi
  • 19.
    Softmax Function • Softmaxfunction exponentially highlights the value in the dimension where the value is maximum, while suppressing all other dimensions • Output values of a vector from a Softmax function sums to 1 • E.g.: Input Vector Output Vector Softmax Function
  • 20.
    Softmax Regression • LikeZ = βTX is the used for binary classification, Zk = βk TX is used for Multiclass classification, where k is the index of the class • Note that there K number of β vectors exists as model parameters • Like Y is used for binary classification where there is only a single dependent variables, Multiclass classification has K dependent variables, each denoted by Yk and its estimator ෡ 𝐘𝐤 ෡ 𝐘𝐤 = 𝐞𝐙𝒌 ෍ 𝐣=𝟏 𝐊 𝐞 𝐙𝐣
  • 21.
    Softmax Regression Loss function: Lf X , Y = -log(෡ Yk) = -log( eZ𝑘 ෍ j=1 K e Zj ) = -log( eβk TX ෎ j=1 K e βj TX ) Cost function (Cross Entropy Loss): J(β) = − ා 𝑖=1 N Σk=1 K I[Yi = k]log( eβk TX ෎ j=1 K e βj TX )
  • 22.
    One Hour Homework •Officially we have one more hour to do after the end of the lectures • Therefore, for this week’s extra hour you have a homework • Logistic Regression is the basic building block of Deep Neural Networks (DNN). Softmax classifiers are used as it is in DNNs as the final classification layer • Go through the slides and get a clear understanding on Logistic and Softmax Regressions • Refer external sources to clarify all the ambiguities related to it • Good Luck!
  • 23.