Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Machine Learning

DA 5230 – Statistical & Machine Learning
Lecture 5 – Logistic Regression
Maninda Edirisooriya
manindaw@uom.lk

Classification
• When the Y variable of a Supervised Learning problem is of several
discreate classes (e.g.: Color, Age groups) the problem is known as a
Classification problem
• A Classification problem has to predict/select a certain Category (or a
Class) as the dependent variable
• When there are only 2 classes to be classified, it is known as a Binary
E.g.: Predicting a person’s gender (either as male or female) by testosterone
concentration in blood, height and bone density

Binary Classification
• Output classes of a binary classification can be represented by either
• Boolean values, True or False (or Positive or Negative)
• Numbers 1 or 0
• True or 1 value is used for the Positive Class for one class which is
generally the class we want to analyze
• False or 0 value is used for the Negative Class for the other class
• E.g.: For classifying a tumor as malignant (a cancer) or benign (not a
cancer) by the tumor size, being malignant can be taken as the
Positive class and the benign class as the Negative class

Binary Classification - Example
0 (Benign)
1 (Malignant)
X
Y

Binary Classification – with Linear Regression
0 (Benign)
1 (Malignant)
X
Y Linear
Regression
Classifier
0.5
Malignant
Benign

Binary Classification – Problem with LR
0 (Benign)
1 (Malignant)
X
Y
Linear
Regression
Classifier
0.5
Malignant
Benign
Misclassified

Binary Classification – Requirement
0 (Benign)
1 (Malignant)
X
Y
Linear
Regression
Classifier
0.5
Malignant
Benign
Required Regression Classifier
(Variant of Unit Step Function)

0 (Benign)
1 (Malignant)
X
Y
Linear
Regression
Classifier
0.5
Malignant
Benign
Not Differentiable here
for Gradient Descent
Required Regression Classifier
(Variant of Unit Step Function)

0 (Benign)
1 (Malignant)
X
Y
Linear
Regression
Classifier
0.5
Malignant
Benign
Continuous
Regression
Classifier

Logistic/Sigmoid Function
• Sigmoid function: 𝐟 𝐳 =
𝟏
𝟏+𝐞−𝐳
Z = 0 ⇒ f(Z) = 0.5
0 < f(Z) < 1
• A Non-linear function
• This is a continuous alternative
for the Unit Step Function
Z
f(Z)

Logistic Regression
Like Linear Regression say, Z = β0 + β1*X1 + β2*X2 + ... + βn*Xn
Logistic Function, f Z =
1
1+e−z
f X =
1
1 + e−(β0 + β1∗X1 + β2∗X2 + ... + βn∗Xn)
In vector form,
f X =
1
1 + e−βTX
where β0 = β0*X0 taking X0 = 1
This is the function of Logistic Regression.

Logistic Regression - Prediction
Let’s take predictions as f(X) = ቊ
1 (or Positive) if, f x ≥ 0.5
0 (or Negative) if, f x < 0.5
f(X) = ൞
Positive ⇒ f X ≥ 0.5 ⇒
1
1+e−βTX
≥ 0.5 ⇒ βTX ≥ 0
Negative ⇒ f X < 0.5 ⇒
1
1+e−βTX
< 0.5 ⇒ βTX < 0
Here, βTX = β0 + β1*X1 + β2*X2 + ... + βn*Xn

Prediction Example
Take a classification problem with 2 independent variables where,
f(X) =
1
1+e−(β0 + β1∗X1 + β2∗X2)
Negative
Positive
X2
Z = β0 + β1*X1 + β2*X2
(Decision boundary)
Z > 0
Positive
Z < 0
Negative
X1

Non-linear Classification
Taking polynomials of X values (as discussed in Polynomial Regression)
can classify non-linear data points with non-linear decision boundaries
E.g.:
f(X) =
1
1+e− (β0 + β1∗X1
2 + β2∗X2
2)
Negative Positive
X2
Z = β0 + β1∗X1
2 + β2∗X2
2
(Decision boundary)
Z > 0
Positive
Z < 0
Negative
X1

Binary Logistic Regression – Cost Function
Cost for a single data point is known as the Loss
Take the Loss Function of Logistic Regression as L{f(X)}
L f X , Y = ቊ
− log f(X) if Y = 1
− log 1 − f(X) if Y = 0
L f X , Y = −Y log f(X) −(1 − Y) log 1 − f(X)
Cost function: J(β) =
1
n
σ𝑖=1
n
L f x , Y
J(β) =
1
n
෌𝑖=1
n
[−Y log f(X) − (1 − Y) log 1 − f(X) ]
This Cost Function is Convex (has a Global Minimum)

Multiclass Logistic Regression
• Up to now we have looked at Binary Classification problems where
there can be only two outcomes/categories/classes as the Y variable
• When there are more than 2 classes available (only one of them is
positive for any given data point) the problem becomes a Multiclass
• One way to handle Multiclass Classification is using the Binary
Classifiers known as One-vs-All (OvA), also known as one-vs-rest
(OvR)
• It trains multiple binary classifiers, each one predicting the confidence
(probability) of one class against the rest, and the highest class is selected

Multiclass Logistic Regression
• OvA can be used
• When you want to use different binary classifiers (e.g., SVMs or logistic
regression) for each class
• When available memory is limited or need to highly parallelize
• There is another technique for Multiclass Logistic Regression by
simply generalizing the binary classification problem of the Logistic
Regression
• This General form of Classifier is known as the Softmax Classifier
• There, the Softmax Function is used instead of the Sigmoid function
when there are multiple classes

Softmax Function
• The name Softmax is used, as it is a continuous function
approximation to the Maximum Function, where only one class
(maximum) is allowed to be considered as Positive
• Softmax function is used instead of the Maximum Function to make
the function differentiable
• Softmax Function: S(Xi) =
𝐞𝐱𝐢
෍
𝐣=𝟏
𝐧
𝐞
𝐱𝐣
where i is any data point and j is the index of the dimension of the vector Xi

Softmax Function
• Softmax function exponentially highlights the value in the dimension
where the value is maximum, while suppressing all other dimensions
• Output values of a vector from a Softmax function sums to 1
• E.g.: Input Vector Output Vector
Softmax Function

Softmax Regression
• Like Z = βTX is the used for binary classification, Zk = βk
TX is used for
Multiclass classification, where k is the index of the class
• Note that there K number of β vectors exists as model parameters
• Like Y is used for binary classification where there is only a single
dependent variables, Multiclass classification has K dependent
variables, each denoted by Yk and its estimator ෡
𝐘𝐤
෡
𝐘𝐤 =
𝐞𝐙𝒌
෍
𝐣=𝟏
𝐊
𝐞
𝐙𝐣

Softmax Regression
Loss function:
L f X , Y = -log(෡
Yk) = -log(
eZ𝑘
෍
j=1
K
e
Zj
) = -log(
eβk
TX
෎
j=1
K
e
βj
TX
)
Cost function (Cross Entropy Loss):
J(β) = − ා
𝑖=1
N
Σk=1
K
I[Yi = k]log(
eβk
TX
෎
j=1
K
e
βj
TX
)

One Hour Homework
• Officially we have one more hour to do after the end of the lectures
• Therefore, for this week’s extra hour you have a homework
• Logistic Regression is the basic building block of Deep Neural Networks (DNN).
Softmax classifiers are used as it is in DNNs as the final classification layer
• Go through the slides and get a clear understanding on Logistic and Softmax
Regressions
• Refer external sources to clarify all the ambiguities related to it
• Good Luck!

Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Machine Learning

More Related Content

Similar to Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Machine Learning

More from Maninda Edirisooriya

Recently uploaded

Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Machine Learning