2. A picture’s worth a thousand
words..
• http://scikit-
learn.org/stable/auto_examples/classificati
on/plot_classifier_comparison.html#examp
le-classification-plot-classifier-comparison-
py
3. Algorithm
Problem
Type
Results
interpreta
ble by
you?
Easy to
explain
algorithm
to others?
Average
predictive
accuracy
Training
speed
Prediction
speed
Amount of
parameter
tuning
needed
(excluding
feature
selection)
Performs
well with
small
number of
observatio
ns?
Handles
lots of
irrelevant
features
well
(separate
s signal
from
noise)?
Automaticall
y learns
feature
interactions
?
Gives
calibrated
probabilities
of class
membership
? Parametric?
Features
might
need
scaling? Algorithm
KNN Either Yes Yes Low er Fast
Depends on
n Minimal No No No Yes No Yes KNN
Linear
regression Regression Yes Yes Low er Fast Fast
None
(excluding
regularizatio
n) Yes No No N/A Yes
No (unless
regularized)
Linear
regression
Logistic
regression
Classificatio
n Somew hat Somew hat Low er Fast Fast
None
(excluding
regularizatio
n) Yes No No Yes Yes
No (unless
regularized)
Logistic
regression
Naive
Bayes
Classificatio
n Somew hat Somew hat Low er
Fast
(excluding
feature
extraction) Fast
Some for
feature
extraction Yes Yes No No Yes No
Naive
Bayes
Decision
trees Either Somew hat Somew hat Low er Fast Fast Some No No Yes Possibly No No
Decision
trees
Random
Forests Either A little No Higher Slow Moderate Some No
Yes (unless
noise ratio
is very
high) Yes Possibly No No
Random
Forests
AdaBoost Either A little No Higher Slow Fast Some No Yes Yes Possibly No No AdaBoost
Neural
netw orks Either No No Higher Slow Fast Lots No Yes Yes Possibly No Yes
Neural
netw orks
parametric: assumptions ofan underlying distribution
non-parametric-no underlying distirbutional assumptions
calibrated probabilities: probablity between 0 and 1 computed, rather than simply determining the class.
tuning parameters-variables that you can manipulate to get better fits.
SUMMARY OF MACHINE LEARNING ALGORITHM FEATURES
4. Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records
5. Nearest-Neighbor Classifiers
Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Unknown record
6. Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
7. Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance
• Determine the class from nearest neighbor
list
– take the majority vote of class labels among
the k-nearest neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2
i ii
qpqpd 2
)(),(
8. Nearest Neighbor Classification…
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points
from other classes
X
9. Nearest Neighbor Classification…
• Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
10. • An inductive learning task
– Use particular facts to make more generalized
conclusions
• A predictive model based on a branching series
of Boolean tests
– These smaller Boolean tests are less complex than a
one-stage classifier
• Let’s look at a sample decision tree…
What is a Decision Tree?
11. Predicting Commute Time
Leave At
Stall? Accident?
10 AM 9 AM
8 AM
Long
Long
Short Medium Long
No Yes No Yes
If we leave at
10 AM and
there are no
cars stalled on
the road, what
will our
commute time
be?
12. Inductive Learning
• In this decision tree, we made a series of
Boolean decisions and followed the
corresponding branch
– Did we leave at 10 AM?
– Did a car stall on the road?
– Is there an accident on the road?
• By answering each of these yes/no questions,
we then came to a conclusion on how long our
commute might take
13. Decision Tree Algorithms
• The basic idea behind any decision tree
algorithm is as follows:
– Choose the best attribute(s) to split the remaining
instances and make that attribute a decision node
– Repeat this process for recursively for each child
– Stop when:
• All the instances have the same target attribute value
• There are no more attributes
• There are no more instances
14. How to determine the Best Split
Own
Car?
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type?
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Student
ID?
...
Yes No Family
Sports
Luxury c1
c10
c20
C0: 0
C1: 1
...
c11
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
15. How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution
are preferred
• Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
16. Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
17. Measure of Impurity: GINI
• Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed among
all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class, implying
most interesting information
j
tjptGINI 2
)]|([1)(
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
19. Alternative Splitting Criteria based on
INFO
• Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node.
• Maximum (log nc) when records are equally distributed
among all classes implying least information
• Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations
j
tjptjptEntropy )|(log)|()(
21. Splitting Based on INFO...
• Information Gain:
Parent Node, p is split into k partitions;
ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the split.
Choose the split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.
k
i
i
split
iEntropy
n
n
pEntropyGAIN 1
)()(
22. Stopping Criteria for Tree Induction
• Stop expanding a node when all the
records belong to the same class
• Stop expanding a node when all the
records have similar attribute values
• Early termination (to be discussed later)
23. Decision Tree Based Classification
• Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets
24. Practical Issues of Classification
• Underfitting and Overfitting
• Missing Values
• Costs of Classification
25. Notes on Overfitting
• Overfitting results in decision trees that are
more complex than necessary
• Training error no longer provides a good
estimate of how well the tree will perform
on previously unseen records
• Need new ways for estimating errors
26. How to Address Overfitting
• Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
– More restrictive conditions:
• Stop if number of instances is less than some user-specified threshold
• Stop if class distribution of instances are independent of the available
features (e.g., using 2 test)
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
27. Bayes Classifiers
Intuitively, Naïve Bayes computes the
probability of the previously unseen
instance belonging to each class, then
simply pick the most probable class.
http://blog.yhat.com/posts/naive-bayes-in-python.html
28. Bayes Classifiers
• Bayesian classifiers use Bayes theorem, which
says
p(cj | d ) = p(d | cj ) p(cj)
p(d)
• p(cj | d) = probability of instance d being in class cj,
This is what we are trying to compute
• p(d | cj) = probability of generating instance d given class cj,
We can imagine that being in class cj, causes you to have feature
d with some probability
• p(cj) = probability of occurrence of class cj,
This is just how frequent the class cj, is in our database
• p(d) = probability of instance d occurring
This can actually be ignored, since it is the same for all classes
29. Different Naïve Bayes Models
• Multi-variate Bernoulli Naive Bayes The binomial model is useful
if your feature vectors are binary (i.e., 0s and 1s). One application
would be text classification with a bag of words model where the 0s
1s are "word occurs in the document" and "word does not occur in
the document"
• Multinomial Naive Bayes The multinomial naive Bayes model is
typically used for discrete counts. E.g., if we have a text
classification problem, we can take the idea of bernoulli trials one
step further and instead of "word occurs in the document" we have
"count how often word occurs in the document", you can think of it
as "number of times outcome number x_i is observed over the n
trials"
• Gaussian Naive Bayes Here, we assume that the features follow a
normal distribution. Instead of discrete counts, we have continuous
features (e.g., the popular Iris dataset where the features are sepal
width, petal width, sepal length, petal length).
30. Check out these websites for more!
• http://www.datasciencecentral.com/profiles/bl
ogs/naive-bayes-for-dummies-a-simple-
explanation
• http://blog.yhat.com/posts/naive-bayes-in-
python.html
• In Sklearn:
• http://scikit-
learn.org/stable/modules/naive_bayes.html
31. Logistic Regression vs. Naïve
Bayes
• Logistic Regression Idea:
• Naïve Bayes allows computing P(Y|X) by
learning P(Y) and P(X|Y)
• Why not learn P(Y|X) directly?
32. The Logistic Function
• We want a model that predicts probabilities between 0 and 1, that is, S-
shaped.
• There are lots of s-shaped curves. We use the logistic model:
• Probability = exp(0+ 1X) /[1 + exp(0+ 1X) ] or loge[P/(1-P)] = 0+ 1X
• The function on left, loge[P/(1-P)], is called the logistic function.
0.0
0.2
0.4
0.6
0.8
1.0
x
P y x
e
e
x
x
( )
1
33. Logistic Regression Function
• Logistic regression models the logit of the outcome, instead of the
outcome i.e. instead of winning or losing, we build a model for
log odds of winning or losing
• Natural logarithm of the odds of the outcome
• ln(Probability of the outcome (p)/Probability of not having the
outcome (1-p))
ii2211 xβ...xβxβα
P-1
P
ln
P y x
e
e
x
x
( )
1
34. ROC Curves
• Originated from signal detection theory
– Binary signal corrupted by Guassian noise
– What is the optimal threshold (i.e. operating
point)?
• Dependence on 3 factors
– Signal Strength
– Noise Variance
– Personal tolerance in Hit / False Alarm Rate
35. ROC Curves
• Receiver operator characteristic
• Summarize & present performance of any
binary classification model
• Models ability to distinguish between false &
true positives
36. Use Multiple Contingency Tables
• Sample contingency tables from range of threshold/probability.
• TRUE POSITIVE RATE (also called SENSITIVITY)
True Positives
(True Positives) + (False Negatives)
• FALSE POSITIVE RATE (also called 1 - SPECIFICITY)
False Positives
(False Positives) + (True Negatives)
• Plot Sensitivity vs. (1 – Specificity) for sampling and you are done
37.
38.
39.
40.
41.
42. Pros/Cons of Various Classification
Algorithms
Logistic regression: no distribution requirement, perform well with few
categories categorical variables, compute the logistic distribution, good for few
categories variables, easy to interpret, compute CI, suffer multicollinearity
Decision Trees: no distribution requirement, heuristic, good for few categories
variables, not suffer multicollinearity (by choosing one of them), Interpretable
Naïve Bayes: generally no requirements, good for few categories variables,
compute the multiplication of independent distributions, suffer multicollinearity
SVM: no distribution requirement, compute hinge loss, flexible selection of
kernels for nonlinear correlation, not suffer multicollinearity, hard to interpret
Bagging, boosting, ensemble methods(RF, Ada, etc): generally outperform
single algorithm listed above.
Source: Quora
43. Prediction Error and the Bias-variance
tradeoff
• A good measure of the quality of an estimator ˆf (x) is the mean
squared error. Let f0(x) be the true value of f (x) at the point x. Then
• This can be written as
variance + bias^2.
• Typically, when bias is low, variance is high and vice-versa.
Choosing estimators often involves a tradeoff between bias and
variance.
2
0 )]()(ˆE[)](ˆMse[ xfxfxf
2
0 )]()(ˆ[)](ˆVar[)](ˆMse[ xfxfExfxf
The logistic function is a non-linear function of the independent variables. It is bound between 0 and 1, which is what we want. The range of values between 0 and 1 predict the probability of an event occurring/not occurring.
We convert this non-linear function into a linear relationship using the LOG of the ODDS ratio.
It’s easy to prove mathematically that the P(y/x) function, can be converted to a function log (p/1-p ) = alpha plus beta X. This is the linear transformation.
The logit is a linear function of the xs. However, having log of (P/(1-P) on the Y axis is not very helpful. We have to compute the actual probability. To do that we have to use the exponential functions.
It’s really important to understand what we are measuring. When we run a logistic regression, your betas are measuring the impact of X on the LOG Of the odds ratio. Taking the exponential value of the beta, we get the impact of X on the ODDS RATIO.
ROC curve: Y axis is the actual positive rate Predicted TP/total positive (hit rate)
X axis: false alarm rate: what percent of the actual negatives does the classifier get wrong:
It’s important to understand this concept well, and I’m hopeful you all are understanding it now. The Mean square error can be decomposed into two parts – the variance and the squared bias. The important point here is that there is a tradeoff between variance and bias. Our least squares model provides us the BEST LINEAR UNBIASED estimator. If we are willing to give up on the accuracy of the coefficient estimate i.e. accept some bias, then we can lower the variance and the net effect of that could be a lower mean square error.
Lets see how that’s done.
I think this simple diagram provides an excellent understanding of the tradeoffs, and why we should consider models with non-zero bias. Notice that the blue line representing bias decreases as the model gets more complex (and we keep all the Xs, as we do in our multiple linear regression model). But when we have many Xs and we want the BLUE estimator, we are accepting a very high variance for no bias. But the best model FOR PREDICTION PURPOSES lies somewhere to the left, where we have some bias and low variance.
So now we are going to look at models which allow some bias in the betas.