This is a presentation made for our Intro to Machine Learning class. As a result it focuses more on the use of logit regression as a classifier as opposed to statistical applications. Many of the slides are based on Stanford's Open Course in machine learning.
3. Regression
Analysis +
Classification
How can we predict a nominal class
using regression analysis?
Consider a binary class:
Each instance x is a vector of feature
values
Our output values or class labels are
restricted to 0 or 1, i.e. f(x) ∈ {0, 1}
We need an h(x) where: 0 < h(x) < 1
We need a function which exhibits this
behavior
3
4. Logistic
Functions Sigmoid Function σ(x)
Asymptotes at y = 1 and y = 0
Easy to specify threshold (σ(0) = .5)
Results are P(y=1)
As a result:
Where θ is a vector of weights
4
5. Cost Function
Need to find hθ(x) that is a logistic
function that represents our data
Need to find θ to fit our data
-log(1-x)-log(x)
5
6. Gradient
Descent
In order to find the minimum, we can
use the partial derivative of J(θ)
do {
}until θ converges
Where α is the learning rate (almost
always between 0 and 1, .1-.3 usually
a good range)
6
7. Maximum Likelihood Estimation
7
do {
}until θ converges
Can also be calculated using:
Iteratively Reweighted Least Squares
Multinomial data uses Softmax Regression
9. Interpreting hθ
I want to create a model to give me the
probability that I will pass a test given how
many hours I have studied
Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1
Using this generated model, calculate my probability
of passing given I have studied 3 hours
P(passing| study time = 3) = .61
9source
11. vs Decision Tree
Assumptions
DT: decision boundaries parallel to axes
LR: one smooth boundary
Decision trees can be used when there are
multiple decision boundaries
11
12. Feature Weights
NB: each set independently depending on class
LR: together such that decision function tends to be high for positive classes and low for negative
classes
Correlated features have no effect on logistic regression
vs Naive Bayes
12
13. vs Support Vector Machine
13
Both attempt to find hyperplane separating training samples
SVM: find the solution with maximum margin
LR: find any solution that separates the instances
SVM is a hard classified while LR is probabilistic
14. Advantages
Works well with diagonal decision boundaries
Does not give undue weight to correlated
features
Probabilistic outcomes
14
Requires large sample size for stable results
Disadvantages
16. For more info...
Helpful links to go into more
depth with Logistic Regression
Stanford Open Course (Logit
regression section)
Logit Regression Tutorial (exercises in
MATLAB)
Logit Regression Tutorial (no code)
How to use Logit Regression in Python
How to use Logit Regression in R
How to use Logit Regression in Java
using Weka
16
Editor's Notes
hθ(x) = σ(θTx)
We’re trying to find the most likely theta given our test instances. This can be nondeterministic, Also need stopping criteria.
We can actually add terms such as x1^2, or x1*x2^4 to make a non-linear
The first entry in our instance vector is always one, due to the weight of the intercept
When we say theta transpose x - this means that we transpose theta and then multiply it, using matrix multiplication,