SlideShare a Scribd company logo
1 of 44
Machine Learning: Classification
Algorithms
Review
A picture’s worth a thousand
words..
• http://scikit-
learn.org/stable/auto_examples/classificati
on/plot_classifier_comparison.html#examp
le-classification-plot-classifier-comparison-
py
Algorithm
Problem
Type
Results
interpreta
ble by
you?
Easy to
explain
algorithm
to others?
Average
predictive
accuracy
Training
speed
Prediction
speed
Amount of
parameter
tuning
needed
(excluding
feature
selection)
Performs
well with
small
number of
observatio
ns?
Handles
lots of
irrelevant
features
well
(separate
s signal
from
noise)?
Automaticall
y learns
feature
interactions
?
Gives
calibrated
probabilities
of class
membership
? Parametric?
Features
might
need
scaling? Algorithm
KNN Either Yes Yes Low er Fast
Depends on
n Minimal No No No Yes No Yes KNN
Linear
regression Regression Yes Yes Low er Fast Fast
None
(excluding
regularizatio
n) Yes No No N/A Yes
No (unless
regularized)
Linear
regression
Logistic
regression
Classificatio
n Somew hat Somew hat Low er Fast Fast
None
(excluding
regularizatio
n) Yes No No Yes Yes
No (unless
regularized)
Logistic
regression
Naive
Bayes
Classificatio
n Somew hat Somew hat Low er
Fast
(excluding
feature
extraction) Fast
Some for
feature
extraction Yes Yes No No Yes No
Naive
Bayes
Decision
trees Either Somew hat Somew hat Low er Fast Fast Some No No Yes Possibly No No
Decision
trees
Random
Forests Either A little No Higher Slow Moderate Some No
Yes (unless
noise ratio
is very
high) Yes Possibly No No
Random
Forests
AdaBoost Either A little No Higher Slow Fast Some No Yes Yes Possibly No No AdaBoost
Neural
netw orks Either No No Higher Slow Fast Lots No Yes Yes Possibly No Yes
Neural
netw orks
parametric: assumptions ofan underlying distribution
non-parametric-no underlying distirbutional assumptions
calibrated probabilities: probablity between 0 and 1 computed, rather than simply determining the class.
tuning parameters-variables that you can manipulate to get better fits.
SUMMARY OF MACHINE LEARNING ALGORITHM FEATURES
Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records
Nearest-Neighbor Classifiers
 Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
 To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Unknown record
Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance
• Determine the class from nearest neighbor
list
– take the majority vote of class labels among
the k-nearest neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2
  i ii
qpqpd 2
)(),(
Nearest Neighbor Classification…
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points
from other classes
X
Nearest Neighbor Classification…
• Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
• An inductive learning task
– Use particular facts to make more generalized
conclusions
• A predictive model based on a branching series
of Boolean tests
– These smaller Boolean tests are less complex than a
one-stage classifier
• Let’s look at a sample decision tree…
What is a Decision Tree?
Predicting Commute Time
Leave At
Stall? Accident?
10 AM 9 AM
8 AM
Long
Long
Short Medium Long
No Yes No Yes
If we leave at
10 AM and
there are no
cars stalled on
the road, what
will our
commute time
be?
Inductive Learning
• In this decision tree, we made a series of
Boolean decisions and followed the
corresponding branch
– Did we leave at 10 AM?
– Did a car stall on the road?
– Is there an accident on the road?
• By answering each of these yes/no questions,
we then came to a conclusion on how long our
commute might take
Decision Tree Algorithms
• The basic idea behind any decision tree
algorithm is as follows:
– Choose the best attribute(s) to split the remaining
instances and make that attribute a decision node
– Repeat this process for recursively for each child
– Stop when:
• All the instances have the same target attribute value
• There are no more attributes
• There are no more instances
How to determine the Best Split
Own
Car?
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type?
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Student
ID?
...
Yes No Family
Sports
Luxury c1
c10
c20
C0: 0
C1: 1
...
c11
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution
are preferred
• Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
Measure of Impurity: GINI
• Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed among
all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class, implying
most interesting information

j
tjptGINI 2
)]|([1)(
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
Examples for computing GINI
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

j
tjptGINI 2
)]|([1)(
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Alternative Splitting Criteria based on
INFO
• Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node.
• Maximum (log nc) when records are equally distributed
among all classes implying least information
• Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations
 j
tjptjptEntropy )|(log)|()(
Examples for computing Entropy
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
 j
tjptjptEntropy )|(log)|()( 2
Splitting Based on INFO...
• Information Gain:
Parent Node, p is split into k partitions;
ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the split.
Choose the split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.






 

k
i
i
split
iEntropy
n
n
pEntropyGAIN 1
)()(
Stopping Criteria for Tree Induction
• Stop expanding a node when all the
records belong to the same class
• Stop expanding a node when all the
records have similar attribute values
• Early termination (to be discussed later)
Decision Tree Based Classification
• Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets
Practical Issues of Classification
• Underfitting and Overfitting
• Missing Values
• Costs of Classification
Notes on Overfitting
• Overfitting results in decision trees that are
more complex than necessary
• Training error no longer provides a good
estimate of how well the tree will perform
on previously unseen records
• Need new ways for estimating errors
How to Address Overfitting
• Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
– More restrictive conditions:
• Stop if number of instances is less than some user-specified threshold
• Stop if class distribution of instances are independent of the available
features (e.g., using  2 test)
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
Bayes Classifiers
Intuitively, Naïve Bayes computes the
probability of the previously unseen
instance belonging to each class, then
simply pick the most probable class.
http://blog.yhat.com/posts/naive-bayes-in-python.html
Bayes Classifiers
• Bayesian classifiers use Bayes theorem, which
says
p(cj | d ) = p(d | cj ) p(cj)
p(d)
• p(cj | d) = probability of instance d being in class cj,
This is what we are trying to compute
• p(d | cj) = probability of generating instance d given class cj,
We can imagine that being in class cj, causes you to have feature
d with some probability
• p(cj) = probability of occurrence of class cj,
This is just how frequent the class cj, is in our database
• p(d) = probability of instance d occurring
This can actually be ignored, since it is the same for all classes
Different Naïve Bayes Models
• Multi-variate Bernoulli Naive Bayes The binomial model is useful
if your feature vectors are binary (i.e., 0s and 1s). One application
would be text classification with a bag of words model where the 0s
1s are "word occurs in the document" and "word does not occur in
the document"
• Multinomial Naive Bayes The multinomial naive Bayes model is
typically used for discrete counts. E.g., if we have a text
classification problem, we can take the idea of bernoulli trials one
step further and instead of "word occurs in the document" we have
"count how often word occurs in the document", you can think of it
as "number of times outcome number x_i is observed over the n
trials"
• Gaussian Naive Bayes Here, we assume that the features follow a
normal distribution. Instead of discrete counts, we have continuous
features (e.g., the popular Iris dataset where the features are sepal
width, petal width, sepal length, petal length).
Check out these websites for more!
• http://www.datasciencecentral.com/profiles/bl
ogs/naive-bayes-for-dummies-a-simple-
explanation
• http://blog.yhat.com/posts/naive-bayes-in-
python.html
• In Sklearn:
• http://scikit-
learn.org/stable/modules/naive_bayes.html
Logistic Regression vs. Naïve
Bayes
• Logistic Regression Idea:
• Naïve Bayes allows computing P(Y|X) by
learning P(Y) and P(X|Y)
• Why not learn P(Y|X) directly?
The Logistic Function
• We want a model that predicts probabilities between 0 and 1, that is, S-
shaped.
• There are lots of s-shaped curves. We use the logistic model:
• Probability = exp(0+ 1X) /[1 + exp(0+ 1X) ] or loge[P/(1-P)] = 0+ 1X
• The function on left, loge[P/(1-P)], is called the logistic function.
0.0
0.2
0.4
0.6
0.8
1.0
x
P y x
e
e
x
x
( ) 



 
 
1
Logistic Regression Function
• Logistic regression models the logit of the outcome, instead of the
outcome i.e. instead of winning or losing, we build a model for
log odds of winning or losing
• Natural logarithm of the odds of the outcome
• ln(Probability of the outcome (p)/Probability of not having the
outcome (1-p))
ii2211 xβ...xβxβα
P-1
P
ln 





P y x
e
e
x
x
( ) 



 
 
1
ROC Curves
• Originated from signal detection theory
– Binary signal corrupted by Guassian noise
– What is the optimal threshold (i.e. operating
point)?
• Dependence on 3 factors
– Signal Strength
– Noise Variance
– Personal tolerance in Hit / False Alarm Rate
ROC Curves
• Receiver operator characteristic
• Summarize & present performance of any
binary classification model
• Models ability to distinguish between false &
true positives
Use Multiple Contingency Tables
• Sample contingency tables from range of threshold/probability.
• TRUE POSITIVE RATE (also called SENSITIVITY)
True Positives
(True Positives) + (False Negatives)
• FALSE POSITIVE RATE (also called 1 - SPECIFICITY)
False Positives
(False Positives) + (True Negatives)
• Plot Sensitivity vs. (1 – Specificity) for sampling and you are done
Pros/Cons of Various Classification
Algorithms
Logistic regression: no distribution requirement, perform well with few
categories categorical variables, compute the logistic distribution, good for few
categories variables, easy to interpret, compute CI, suffer multicollinearity
Decision Trees: no distribution requirement, heuristic, good for few categories
variables, not suffer multicollinearity (by choosing one of them), Interpretable
Naïve Bayes: generally no requirements, good for few categories variables,
compute the multiplication of independent distributions, suffer multicollinearity
SVM: no distribution requirement, compute hinge loss, flexible selection of
kernels for nonlinear correlation, not suffer multicollinearity, hard to interpret
Bagging, boosting, ensemble methods(RF, Ada, etc): generally outperform
single algorithm listed above.
Source: Quora
Prediction Error and the Bias-variance
tradeoff
• A good measure of the quality of an estimator ˆf (x) is the mean
squared error. Let f0(x) be the true value of f (x) at the point x. Then
• This can be written as
variance + bias^2.
• Typically, when bias is low, variance is high and vice-versa.
Choosing estimators often involves a tradeoff between bias and
variance.
2
0 )]()(ˆE[)](ˆMse[ xfxfxf 
2
0 )]()(ˆ[)](ˆVar[)](ˆMse[ xfxfExfxf 
Note the tradeoff between Bias and
Variance!

More Related Content

What's hot

Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
 
Naive bayes
Naive bayesNaive bayes
Naive bayesumeskath
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine LearningAnkit Rai
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionGianluca Bontempi
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleImpetus Technologies
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
05 Classification And Prediction
05   Classification And Prediction05   Classification And Prediction
05 Classification And PredictionAchmad Solichin
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectAbhishek Singh
 
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaMachine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaEdureka!
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearnPratap Dangeti
 

What's hot (20)

Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Decision tree
Decision treeDecision tree
Decision tree
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
05 Classification And Prediction
05   Classification And Prediction05   Classification And Prediction
05 Classification And Prediction
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
Decision tree
Decision treeDecision tree
Decision tree
 
Supervised learning
  Supervised learning  Supervised learning
Supervised learning
 
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaMachine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
 

Viewers also liked

Introduction to Machine Learning.
Introduction to Machine Learning.Introduction to Machine Learning.
Introduction to Machine Learning.butest
 
BAS 150 Lesson 2 Lecture
BAS 150 Lesson 2 Lecture BAS 150 Lesson 2 Lecture
BAS 150 Lesson 2 Lecture Wake Tech BAS
 
BAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 LectureBAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 LectureWake Tech BAS
 
Base 9.1 preparation guide
Base 9.1 preparation guideBase 9.1 preparation guide
Base 9.1 preparation guideimaduddin91
 
Analytics with SAS
Analytics with SASAnalytics with SAS
Analytics with SASEdureka!
 
Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolutionLearning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolutionVibeesh CS
 
Machine Learning and Data Mining: 03 Data Representation
Machine Learning and Data Mining: 03 Data RepresentationMachine Learning and Data Mining: 03 Data Representation
Machine Learning and Data Mining: 03 Data RepresentationPier Luca Lanzi
 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationPier Luca Lanzi
 
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaBig Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaEdureka!
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsBuhwan Jeong
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
The Second Little Book of Leadership
The Second Little Book of LeadershipThe Second Little Book of Leadership
The Second Little Book of LeadershipPhil Dourado
 
Data Science - Part VIII - Artifical Neural Network
Data Science - Part VIII -  Artifical Neural NetworkData Science - Part VIII -  Artifical Neural Network
Data Science - Part VIII - Artifical Neural NetworkDerek Kane
 
Best Presentation About Infosys
Best Presentation About InfosysBest Presentation About Infosys
Best Presentation About InfosysDurgadatta Dash
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through ExamplesSri Ambati
 

Viewers also liked (19)

Introduction to Machine Learning.
Introduction to Machine Learning.Introduction to Machine Learning.
Introduction to Machine Learning.
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
BAS 150 Lesson 2 Lecture
BAS 150 Lesson 2 Lecture BAS 150 Lesson 2 Lecture
BAS 150 Lesson 2 Lecture
 
BAS 250 Lecture 2
BAS 250 Lecture 2BAS 250 Lecture 2
BAS 250 Lecture 2
 
BAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 LectureBAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 Lecture
 
Base 9.1 preparation guide
Base 9.1 preparation guideBase 9.1 preparation guide
Base 9.1 preparation guide
 
Analytics with SAS
Analytics with SASAnalytics with SAS
Analytics with SAS
 
Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolutionLearning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
 
Machine Learning and Data Mining: 03 Data Representation
Machine Learning and Data Mining: 03 Data RepresentationMachine Learning and Data Mining: 03 Data Representation
Machine Learning and Data Mining: 03 Data Representation
 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to Classification
 
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaBig Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
The Second Little Book of Leadership
The Second Little Book of LeadershipThe Second Little Book of Leadership
The Second Little Book of Leadership
 
Data Science - Part VIII - Artifical Neural Network
Data Science - Part VIII -  Artifical Neural NetworkData Science - Part VIII -  Artifical Neural Network
Data Science - Part VIII - Artifical Neural Network
 
Best Presentation About Infosys
Best Presentation About InfosysBest Presentation About Infosys
Best Presentation About Infosys
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through Examples
 

Similar to BAS 250 Lecture 8

Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคลBAINIDA
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computertttiba
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree inductionthamizh arasi
 
Lecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfLecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfssuser4c50a9
 
Decision trees
Decision treesDecision trees
Decision treesNcib Lotfi
 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learningRajasekhar364622
 
Decision Trees
Decision TreesDecision Trees
Decision TreesStudent
 
Decision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic UnveiledDecision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic UnveiledLuca Zavarella
 
NN Classififcation Neural Network NN.pptx
NN Classififcation   Neural Network NN.pptxNN Classififcation   Neural Network NN.pptx
NN Classififcation Neural Network NN.pptxcmpt cmpt
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxSubrata Kumer Paul
 
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...ShivarkarSandip
 
Expert estimation from Multimodal Features
Expert estimation from Multimodal FeaturesExpert estimation from Multimodal Features
Expert estimation from Multimodal FeaturesXavier Ochoa
 

Similar to BAS 250 Lecture 8 (20)

Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Lecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfLecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdf
 
Decision trees
Decision treesDecision trees
Decision trees
 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learning
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Classification
ClassificationClassification
Classification
 
Classification
ClassificationClassification
Classification
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
[ppt]
[ppt][ppt]
[ppt]
 
[ppt]
[ppt][ppt]
[ppt]
 
Decision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic UnveiledDecision Trees - The Machine Learning Magic Unveiled
Decision Trees - The Machine Learning Magic Unveiled
 
NN Classififcation Neural Network NN.pptx
NN Classififcation   Neural Network NN.pptxNN Classififcation   Neural Network NN.pptx
NN Classififcation Neural Network NN.pptx
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptx
 
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
 
Expert estimation from Multimodal Features
Expert estimation from Multimodal FeaturesExpert estimation from Multimodal Features
Expert estimation from Multimodal Features
 
CS632_Lecture_15_updated.pptx
CS632_Lecture_15_updated.pptxCS632_Lecture_15_updated.pptx
CS632_Lecture_15_updated.pptx
 

More from Wake Tech BAS

BAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 LectureBAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 LectureWake Tech BAS
 
BAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 LectureBAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 LectureWake Tech BAS
 
BAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 LectureBAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 LectureWake Tech BAS
 
BAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 LectureBAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 LectureWake Tech BAS
 
BAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 LectureBAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 LectureWake Tech BAS
 
BAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 LectureBAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 LectureWake Tech BAS
 

More from Wake Tech BAS (9)

BAS 250 Lecture 5
BAS 250 Lecture 5BAS 250 Lecture 5
BAS 250 Lecture 5
 
BAS 250 Lecture 4
BAS 250 Lecture 4BAS 250 Lecture 4
BAS 250 Lecture 4
 
BAS 250 Lecture 3
BAS 250 Lecture 3BAS 250 Lecture 3
BAS 250 Lecture 3
 
BAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 LectureBAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 Lecture
 
BAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 LectureBAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 Lecture
 
BAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 LectureBAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 Lecture
 
BAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 LectureBAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 Lecture
 
BAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 LectureBAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 Lecture
 
BAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 LectureBAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 Lecture
 

Recently uploaded

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 

Recently uploaded (20)

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 

BAS 250 Lecture 8

  • 2. A picture’s worth a thousand words.. • http://scikit- learn.org/stable/auto_examples/classificati on/plot_classifier_comparison.html#examp le-classification-plot-classifier-comparison- py
  • 3. Algorithm Problem Type Results interpreta ble by you? Easy to explain algorithm to others? Average predictive accuracy Training speed Prediction speed Amount of parameter tuning needed (excluding feature selection) Performs well with small number of observatio ns? Handles lots of irrelevant features well (separate s signal from noise)? Automaticall y learns feature interactions ? Gives calibrated probabilities of class membership ? Parametric? Features might need scaling? Algorithm KNN Either Yes Yes Low er Fast Depends on n Minimal No No No Yes No Yes KNN Linear regression Regression Yes Yes Low er Fast Fast None (excluding regularizatio n) Yes No No N/A Yes No (unless regularized) Linear regression Logistic regression Classificatio n Somew hat Somew hat Low er Fast Fast None (excluding regularizatio n) Yes No No Yes Yes No (unless regularized) Logistic regression Naive Bayes Classificatio n Somew hat Somew hat Low er Fast (excluding feature extraction) Fast Some for feature extraction Yes Yes No No Yes No Naive Bayes Decision trees Either Somew hat Somew hat Low er Fast Fast Some No No Yes Possibly No No Decision trees Random Forests Either A little No Higher Slow Moderate Some No Yes (unless noise ratio is very high) Yes Possibly No No Random Forests AdaBoost Either A little No Higher Slow Fast Some No Yes Yes Possibly No No AdaBoost Neural netw orks Either No No Higher Slow Fast Lots No Yes Yes Possibly No Yes Neural netw orks parametric: assumptions ofan underlying distribution non-parametric-no underlying distirbutional assumptions calibrated probabilities: probablity between 0 and 1 computed, rather than simply determining the class. tuning parameters-variables that you can manipulate to get better fits. SUMMARY OF MACHINE LEARNING ALGORITHM FEATURES
  • 4. Nearest Neighbor Classifiers • Basic idea: – If it walks like a duck, quacks like a duck, then it’s probably a duck Training Records Test Record Compute Distance Choose k of the “nearest” records
  • 5. Nearest-Neighbor Classifiers  Requires three things – The set of stored records – Distance Metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve  To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) Unknown record
  • 6. Definition of Nearest Neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x
  • 7. Nearest Neighbor Classification • Compute distance between two points: – Euclidean distance • Determine the class from nearest neighbor list – take the majority vote of class labels among the k-nearest neighbors – Weigh the vote according to distance • weight factor, w = 1/d2   i ii qpqpd 2 )(),(
  • 8. Nearest Neighbor Classification… • Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes X
  • 9. Nearest Neighbor Classification… • Scaling issues – Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes – Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 90lb to 300lb • income of a person may vary from $10K to $1M
  • 10. • An inductive learning task – Use particular facts to make more generalized conclusions • A predictive model based on a branching series of Boolean tests – These smaller Boolean tests are less complex than a one-stage classifier • Let’s look at a sample decision tree… What is a Decision Tree?
  • 11. Predicting Commute Time Leave At Stall? Accident? 10 AM 9 AM 8 AM Long Long Short Medium Long No Yes No Yes If we leave at 10 AM and there are no cars stalled on the road, what will our commute time be?
  • 12. Inductive Learning • In this decision tree, we made a series of Boolean decisions and followed the corresponding branch – Did we leave at 10 AM? – Did a car stall on the road? – Is there an accident on the road? • By answering each of these yes/no questions, we then came to a conclusion on how long our commute might take
  • 13. Decision Tree Algorithms • The basic idea behind any decision tree algorithm is as follows: – Choose the best attribute(s) to split the remaining instances and make that attribute a decision node – Repeat this process for recursively for each child – Stop when: • All the instances have the same target attribute value • There are no more attributes • There are no more instances
  • 14. How to determine the Best Split Own Car? C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 Car Type? C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 Student ID? ... Yes No Family Sports Luxury c1 c10 c20 C0: 0 C1: 1 ... c11 Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best?
  • 15. How to determine the Best Split • Greedy approach: – Nodes with homogeneous class distribution are preferred • Need a measure of node impurity: C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity
  • 16. Measures of Node Impurity • Gini Index • Entropy • Misclassification error
  • 17. Measure of Impurity: GINI • Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information  j tjptGINI 2 )]|([1)( C1 0 C2 6 Gini=0.000 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500 C1 1 C2 5 Gini=0.278
  • 18. Examples for computing GINI C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0  j tjptGINI 2 )]|([1)( P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
  • 19. Alternative Splitting Criteria based on INFO • Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). – Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed among all classes implying least information • Minimum (0.0) when all records belong to one class, implying most information – Entropy based computations are similar to the GINI index computations  j tjptjptEntropy )|(log)|()(
  • 20. Examples for computing Entropy C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92  j tjptjptEntropy )|(log)|()( 2
  • 21. Splitting Based on INFO... • Information Gain: Parent Node, p is split into k partitions; ni is number of records in partition i – Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) – Used in ID3 and C4.5 – Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.          k i i split iEntropy n n pEntropyGAIN 1 )()(
  • 22. Stopping Criteria for Tree Induction • Stop expanding a node when all the records belong to the same class • Stop expanding a node when all the records have similar attribute values • Early termination (to be discussed later)
  • 23. Decision Tree Based Classification • Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Accuracy is comparable to other classification techniques for many simple data sets
  • 24. Practical Issues of Classification • Underfitting and Overfitting • Missing Values • Costs of Classification
  • 25. Notes on Overfitting • Overfitting results in decision trees that are more complex than necessary • Training error no longer provides a good estimate of how well the tree will perform on previously unseen records • Need new ways for estimating errors
  • 26. How to Address Overfitting • Pre-Pruning (Early Stopping Rule) – Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node: • Stop if all instances belong to the same class • Stop if all the attribute values are the same – More restrictive conditions: • Stop if number of instances is less than some user-specified threshold • Stop if class distribution of instances are independent of the available features (e.g., using  2 test) • Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).
  • 27. Bayes Classifiers Intuitively, Naïve Bayes computes the probability of the previously unseen instance belonging to each class, then simply pick the most probable class. http://blog.yhat.com/posts/naive-bayes-in-python.html
  • 28. Bayes Classifiers • Bayesian classifiers use Bayes theorem, which says p(cj | d ) = p(d | cj ) p(cj) p(d) • p(cj | d) = probability of instance d being in class cj, This is what we are trying to compute • p(d | cj) = probability of generating instance d given class cj, We can imagine that being in class cj, causes you to have feature d with some probability • p(cj) = probability of occurrence of class cj, This is just how frequent the class cj, is in our database • p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes
  • 29. Different Naïve Bayes Models • Multi-variate Bernoulli Naive Bayes The binomial model is useful if your feature vectors are binary (i.e., 0s and 1s). One application would be text classification with a bag of words model where the 0s 1s are "word occurs in the document" and "word does not occur in the document" • Multinomial Naive Bayes The multinomial naive Bayes model is typically used for discrete counts. E.g., if we have a text classification problem, we can take the idea of bernoulli trials one step further and instead of "word occurs in the document" we have "count how often word occurs in the document", you can think of it as "number of times outcome number x_i is observed over the n trials" • Gaussian Naive Bayes Here, we assume that the features follow a normal distribution. Instead of discrete counts, we have continuous features (e.g., the popular Iris dataset where the features are sepal width, petal width, sepal length, petal length).
  • 30. Check out these websites for more! • http://www.datasciencecentral.com/profiles/bl ogs/naive-bayes-for-dummies-a-simple- explanation • http://blog.yhat.com/posts/naive-bayes-in- python.html • In Sklearn: • http://scikit- learn.org/stable/modules/naive_bayes.html
  • 31. Logistic Regression vs. Naïve Bayes • Logistic Regression Idea: • Naïve Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y) • Why not learn P(Y|X) directly?
  • 32. The Logistic Function • We want a model that predicts probabilities between 0 and 1, that is, S- shaped. • There are lots of s-shaped curves. We use the logistic model: • Probability = exp(0+ 1X) /[1 + exp(0+ 1X) ] or loge[P/(1-P)] = 0+ 1X • The function on left, loge[P/(1-P)], is called the logistic function. 0.0 0.2 0.4 0.6 0.8 1.0 x P y x e e x x ( )         1
  • 33. Logistic Regression Function • Logistic regression models the logit of the outcome, instead of the outcome i.e. instead of winning or losing, we build a model for log odds of winning or losing • Natural logarithm of the odds of the outcome • ln(Probability of the outcome (p)/Probability of not having the outcome (1-p)) ii2211 xβ...xβxβα P-1 P ln       P y x e e x x ( )         1
  • 34. ROC Curves • Originated from signal detection theory – Binary signal corrupted by Guassian noise – What is the optimal threshold (i.e. operating point)? • Dependence on 3 factors – Signal Strength – Noise Variance – Personal tolerance in Hit / False Alarm Rate
  • 35. ROC Curves • Receiver operator characteristic • Summarize & present performance of any binary classification model • Models ability to distinguish between false & true positives
  • 36. Use Multiple Contingency Tables • Sample contingency tables from range of threshold/probability. • TRUE POSITIVE RATE (also called SENSITIVITY) True Positives (True Positives) + (False Negatives) • FALSE POSITIVE RATE (also called 1 - SPECIFICITY) False Positives (False Positives) + (True Negatives) • Plot Sensitivity vs. (1 – Specificity) for sampling and you are done
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42. Pros/Cons of Various Classification Algorithms Logistic regression: no distribution requirement, perform well with few categories categorical variables, compute the logistic distribution, good for few categories variables, easy to interpret, compute CI, suffer multicollinearity Decision Trees: no distribution requirement, heuristic, good for few categories variables, not suffer multicollinearity (by choosing one of them), Interpretable Naïve Bayes: generally no requirements, good for few categories variables, compute the multiplication of independent distributions, suffer multicollinearity SVM: no distribution requirement, compute hinge loss, flexible selection of kernels for nonlinear correlation, not suffer multicollinearity, hard to interpret Bagging, boosting, ensemble methods(RF, Ada, etc): generally outperform single algorithm listed above. Source: Quora
  • 43. Prediction Error and the Bias-variance tradeoff • A good measure of the quality of an estimator ˆf (x) is the mean squared error. Let f0(x) be the true value of f (x) at the point x. Then • This can be written as variance + bias^2. • Typically, when bias is low, variance is high and vice-versa. Choosing estimators often involves a tradeoff between bias and variance. 2 0 )]()(ˆE[)](ˆMse[ xfxfxf  2 0 )]()(ˆ[)](ˆVar[)](ˆMse[ xfxfExfxf 
  • 44. Note the tradeoff between Bias and Variance!

Editor's Notes

  1. The logistic function is a non-linear function of the independent variables. It is bound between 0 and 1, which is what we want. The range of values between 0 and 1 predict the probability of an event occurring/not occurring. We convert this non-linear function into a linear relationship using the LOG of the ODDS ratio. It’s easy to prove mathematically that the P(y/x) function, can be converted to a function log (p/1-p ) = alpha plus beta X. This is the linear transformation.
  2. The logit is a linear function of the xs. However, having log of (P/(1-P) on the Y axis is not very helpful. We have to compute the actual probability. To do that we have to use the exponential functions. It’s really important to understand what we are measuring. When we run a logistic regression, your betas are measuring the impact of X on the LOG Of the odds ratio. Taking the exponential value of the beta, we get the impact of X on the ODDS RATIO.
  3. ROC curve: Y axis is the actual positive rate Predicted TP/total positive (hit rate) X axis: false alarm rate: what percent of the actual negatives does the classifier get wrong:
  4. It’s important to understand this concept well, and I’m hopeful you all are understanding it now. The Mean square error can be decomposed into two parts – the variance and the squared bias. The important point here is that there is a tradeoff between variance and bias. Our least squares model provides us the BEST LINEAR UNBIASED estimator. If we are willing to give up on the accuracy of the coefficient estimate i.e. accept some bias, then we can lower the variance and the net effect of that could be a lower mean square error. Lets see how that’s done.
  5. I think this simple diagram provides an excellent understanding of the tradeoffs, and why we should consider models with non-zero bias. Notice that the blue line representing bias decreases as the model gets more complex (and we keep all the Xs, as we do in our multiple linear regression model). But when we have many Xs and we want the BLUE estimator, we are accepting a very high variance for no bias. But the best model FOR PREDICTION PURPOSES lies somewhere to the left, where we have some bias and low variance. So now we are going to look at models which allow some bias in the betas.