Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. It is one of the simplest and most commonly used techniques in statistical modeling and predictive analysis. Here's how linear regression works
5. BITS Pilani, Pilani Campus
Decision Theory
• Target Concept : t
• Discrete : f(x) ∈ {Yes, No} ie., t ∈ {0, 1}
• Continuous : f(x) ∈ [20-100]
• Probability Estimation : f(x) ∈ [0-1]
6
Binary Classification
6. BITS Pilani, Pilani Campus
Decision Theory :
The decision problem: given x, predict t according to a probabilistic model p(x, t)
• Target Concept : t
• Discrete : f(x) ∈ {Yes, No} ie., t ∈ {0, 1}
• Continuous : f(x) ∈ [20-100]
• Probability Estimation : f(x) ∈ [0-1]
7
X =< , , , , , > = P(Ck| X)
p(x, Ck ) is the (central!) inference problem
7. BITS Pilani, Pilani Campus
Classification Problem: Stages
8
Induction/
Inference
step
Deduction/
Decision Step
Test Set
Training Set Model
Learning
algorithm
Learn
Model for
p(x,Ck)
Apply Model
to find
optimal t
8. BITS Pilani, Pilani Campus
Decision Region
9
Induction/
Inference
step
Training Set Model
Learning
algorithm
Learn
Model for
p(x,Ck)
Model divides the input
space into regions Rk called
decision regions, one for
each class, such that all
points in Rk are assigned to
class Ck
A mistake occurs when an
input vector belonging to
class C1 is assigned to class
C2
9. BITS Pilani, Pilani Campus
Misclassification Rate
10
To minimize p(mistake),
each x is assigned to
whichever class has the
smaller value of the
integrand
The minimum probability of
making a mistake is
obtained if each value of x is
assigned to the class for
which the posterior
probability p(Ck|x) is
largest.
15. BITS Pilani, Pilani Campus
Prediction – Multi class Classification
One Vs All Strategy(one-vs-rest)
𝑥2
𝑥1
Class 1:
Class 2:
Class 3:
ℎ𝜃
𝑖
𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3)
𝑥2
𝑥1
𝑥2
𝑥1
𝑥2
𝑥1
ℎ𝜃
1
𝑥
ℎ𝜃
2
𝑥
ℎ𝜃
3
𝑥
For input x Predict ∶ max
i
ℎ𝜃
𝑖
𝑥
Note: Scikit-Learn detects when you try to use a binary classification
algorithm for a multi‐class classification task, and it automatically runs
OvA (except for SVM classifiers for which it uses OvO)
16. BITS Pilani, Pilani Campus
Prediction – Multi class Classification
One Vs One Strategy
𝑥2
𝑥1
Class 1:
Class 2:
Class 3:
ℎ𝜃
𝑖
𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3)
ℎ𝜃
1
𝑥
ℎ𝜃
2
𝑥
ℎ𝜃
3
𝑥
For input x Predict ∶ max
i
ℎ𝜃
𝑖
𝑥
N × (N – 1) / 2 classifiers
𝑥2
𝑥1
𝑥2
𝑥1
𝑥2
𝑥1
17. BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Sky AirTemp Humidity Wind Forecast Enjoy
Sport?
Sunny Warm Normal Strong Same Yes
Sunny Warm High Strong Same No
Rainy Cold High Strong Change No
Sunny Warm Normal Breeze Same Yes
Sunny Hot Normal Breeze Same No
Rainy Cold High Strong Change No
Sunny Warm High Strong Change Yes
Rainy Warm Normal Breeze Same Yes
Types of Classification
Decision Theory: Interpretation Model Building
x1
x2
Generative
)
(
)
(
)
|
(
)
|
(
2
1
2
1
2
1
d
d
n
X
X
X
P
Y
P
Y
X
X
X
P
X
X
X
Y
P
Known as generative models, because by sampling
from them it is possible to generate synthetic data
points in the input space.
Eg., Gaussians, Naïve Bayes, Mixtures of multinomials
, Mixtures of Gaussians, Bayesian networks
18. BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Sky AirTemp Humidity Wind Forecast Enjoy
Sport?
Sunny Warm Normal Strong Same Yes
Sunny Warm High Strong Same No
Rainy Cold High Strong Change No
Sunny Warm Normal Breeze Same Yes
Sunny Hot Normal Breeze Same No
Rainy Cold High Strong Change No
Sunny Warm High Strong Change Yes
Rainy Warm Normal Breeze Same Yes
Types of Classification
Decision Theory: Interpretation Model Building
Logistic regression, SVMs , tree based
classifiers (e.g. decision tree) Traditional neural
networks, Nearest neighbor
x1
x2
Decision
Boundary
Discriminative
y=1
y=0
𝜃0 +
𝑖
𝜃𝑖𝑥𝑖 ≥ 0
𝜃0 +
𝑖
𝜃𝑖𝑥𝑖 < 0
19. BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Sky AirTemp Humidity Wind Forecast Enjoy
Sport?
Sunny Warm Normal Strong Same Yes
Sunny Warm High Strong Same No
Rainy Cold High Strong Change No
Sunny Warm Normal Breeze Same Yes
Sunny Hot Normal Breeze Same No
Rainy Cold High Strong Change No
Sunny Warm High Strong Change Yes
Rainy Warm Normal Breeze Same Yes
Types of Classification
Decision Theory: Interpretation Model Building
Logistic regression, SVMs , tree based
classifiers (e.g. decision tree) Traditional neural
networks, Nearest neighbor
𝐼𝐹 𝑂𝑈𝑇𝐿𝑂𝑂𝐾 = Overcast THEN PLAY = Yes
ELSE
𝐼𝐹 𝑂𝑈𝑇𝐿𝑂𝑂𝐾 = Rain AND WIND = Strong
THEN PLAY = No
21. BITS Pilani, Pilani Campus
If
If
, predict “y = 1”
, predict “y = 0”
Malignant
?
(Yes) 1
(No) 0
• Tumor Size
• Can we solve the problem using linear
regression? E.g., fit a straight line and
define a threshold at 0.5
• Threshold classifier output h(x) at 0.5:
Logistic Regression vs Least Squares Regression
A Discriminant function f (x)
directly map input to class labels
In two-class problem, f (.) is binary valued
22. BITS Pilani, Pilani Campus
Decision Rules
• Classifier:
f (x,w) = wo + wT x (linear discriminant function)
• Decision rule is
• Mathematically
y = sign(w0 + wT x)
• This specifies a linear classifier: it has a linear boundary (hyperplane)
w0 + wT x = 0
A discriminant is a function that takes an input vector x and assigns it
to one of K classes, denoted Ck.
23. BITS Pilani, Pilani Campus
Learning model parameters
• Training set:
• m examples
• How to choose parameters (feature weights) ?
27
24. BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Logistic Regression
• ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2
– e.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1
• Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥ 0
• At decision boundary output of logistic regression is 0.5
Tumor Size
Age
Decision
boundary
Slide credit: Andrew Ng
25. BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Logistic Regression
Slide credit: Andrew Ng
26. BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Logistic Regression
• Training set:
• How to choose parameters (feature weights)?
• m examples
• n features
27. BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Logistic Regression
• Training set:
• How to choose parameters (feature weights)?
“non-convex” “convex”
28. BITS Pilani, Pilani Campus
Logistic regression cost function (cross entropy)
If y = 1
1
0
Cost
29. BITS Pilani, Pilani Campus
Logistic regression cost function
If y = 0
1
0
Cost
Cost=0; If y=0 and hꝊ(x)=0
30. BITS Pilani, Pilani Campus
Output :
To fit parameters : Apply Gradient Descent Algorithm
To make a prediction given new :
Cost function
36. BITS Pilani, Pilani Campus
Logistic Regression – Inference & Interpretation
CGPA IQ IQ Job Offered
5.5 6.7 100 1
5 7 105 0
8 6 90 1
9 7 105 1
6 8 120 0
7.5 7.3 110 0
Assume : 0.4+0.3CGPA-0.45IQ
Predict the Job offered for a candidate : (5, 6)
h(x) = 0.31
Y-Predicted = 0 / No
Note :
The exponential function of the regression coefficient (ew-cpga) is the odds ratio associated with a one-unit
increase in the cgpa.
+ The odd of being offered with job increase by a factor of 1.35 for every unit increase in the CGPA
[np.exp(model.params)]
37. BITS Pilani, Pilani Campus
Logistic regression (Classification)
• Model
ℎ𝜃 𝑥 = 𝑃 𝑌 = 1 𝑋1, 𝑋2, ⋯ , 𝑋𝑛 =
1
1+𝑒−𝜃⊤𝑥
• Cost function
𝐽 𝜃 =
1
𝑚
𝑖=1
𝑚
Cost(ℎ𝜃(𝑥 𝑖
), 𝑦(𝑖)
)) Cost(ℎ𝜃 𝑥 , 𝑦) =
−log ℎ𝜃 𝑥 if 𝑦 = 1
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
• Learning
Gradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
1
𝑚 𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦 𝑖
𝑥𝑗
𝑖
}
• Inference
𝑌 = ℎ𝜃 𝑥test
=
1
1 + 𝑒−𝜃⊤𝑥test
Note:
• σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0, so a Logistic model predicts 1 if xTθ is positive, and 0 if it is negative
• logit(p) = log(p / (1 - p)), is the inverse of the logistic function. Indeed, if you compute the logit of the estimated probability
p, you will find that the result is t. The logit is also called the log-odds
38. BITS Pilani, Pilani Campus
Overfitting vs Underfitting
Overfitting
• Fitting the data too well
– Features are noisy / uncorrelated to
concept
Underfitting
• Learning too little of the true
concept
– Features don’t capture concept
– Too much bias in model
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
39. BITS Pilani, Pilani Campus
BITS Pilani
Regularization
Note : This topic is already covered in the module 3 . Refresher & few more
points added here
41. BITS Pilani, Pilani Campus
Ways to Control Overfitting
• Regularization
𝐿𝑜𝑠𝑠 𝑆 =
𝑖
𝑛
𝐿𝑜𝑠𝑠(𝑦𝑖
^
, 𝑦𝑖) + 𝛼
𝑗
# 𝑊𝑒𝑖𝑔ℎ𝑡𝑠
|θ𝑗|
45
Note:
The hyperparameter controlling the regularization strength of a Scikit-Learn LogisticRegression model is not
alpha (as in other linear models), but its inverse: C. The higher the value of C, the less the model is
regularized.