SlideShare a Scribd company logo
1 of 52
Chapter 07
Classification
Dr. Steffen Herbold
herbold@cs.uni-goettingen.de
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Classification Models
• Comparison of Classification Models
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Example of Classification
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
This is a whale
This is a bear
The General Problem
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Class of object 1
Class of object 2
Object 1
Object 2
Concept
The Formal Problem
• Object space
• 𝑂 = 𝑜𝑏𝑗𝑒𝑐𝑡1, 𝑜𝑏𝑗𝑒𝑐𝑡2, …
• Often infinite
• Representations of the objects in a feature space
• ℱ = {𝜙 𝑜 , 𝑜 ∈ 𝑂}
• Set of classes
• 𝐶 = {𝑐𝑙𝑎𝑠𝑠1, … , 𝑐𝑙𝑎𝑠𝑠𝑛}
• A target concept that maps objects to classes
• ℎ∗: 𝑂 → 𝐶
• Classification
• Finding an approximation of the target concept
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
How do you
get ℎ∗
?
The „Whale“ Hypothesis
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
• Why do we know this is a whale?
Blue background
Has a fin
Oval body
Black top, white
bottom Objects with fins, an oval general shape that are
black on top and white on the bottom in front of a
blue background are whales.
Hypothesis:
The Hypothesis
• A hypothesis maps features to classes
• ℎ: ℱ → 𝐶
• ℎ: 𝜙(𝑜) → 𝐶
• Approximation of the target concept ℎ∗
• ℎ∗
𝑜 ≈ ℎ(𝜙 𝑜 )
• Hypothesis = Classifier = Classification Model
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
What if I am not
sure about the
class?
Classification using Scores
• A numeric score for each class 𝑐 ∈ 𝐶
• Often a probability distribution
• ℎ′: 𝜙 𝑜 → 0,1 𝐶
• ℎ′(𝜙 𝑜 ) 1 = 1
• Example
• Three classes: „whale“, „bear“, „other“
• ℎ′ 𝜙 "whalepicture" = 0.7,0.1,0.2
• Standard approach:
• Classification is class with highest score
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
score whale
score bear score other
Thresholds for Scores
• Different thresholds also possible
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Many “No Spam” incorrectly
detected as spam if “highest”
score is used
Threshold of 0.2 would miss “Spam”
but better identify “No Spam”
Quality of Hypothesis
• Goal: Approximation of the target concept
• ℎ∗ 𝑜 ≈ ℎ(𝜙 𝑜 )
 Use Test Data
• Structure is the same as training data
• Apply hypothesis
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
𝝓(𝒐) 𝒉∗
(𝒐) 𝒉(𝝓 𝒐 )
hasFin shape colorTop colorBottom background class prediction
true oval black black blue whale whale
false rectangle brown brown green bear whale
… … … … … …
How do you evaluate
ℎ∗
𝑜 ≈ ℎ(𝜙 𝑜 )
The Confusion Matrix
• Table of actual values versus prediction
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
whale bear other
whale 29 1 3
bear 2 22 13
other 4 11 51
Actual class
Predicted
Class
Two whales were incorrectly
predicted as bears
Binary Classification
• Many problems are binary
• Will I get my money back?
• Is this credit card fraud?
• Will my paper be accepted?
• …
• Can all be formulated as either being in a class or not
Labels true and false
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
The Binary Confusion Matrix
• False positives are also called Type I error
• False negatives are also called Type II error
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
true false
true True Positives (TP) False Positives (FP)
false False Negatives (FN) True Negatives (TN)
Actual class
Predicted
Class
Binary Performance Metrics (1)
• Rates per actual class
• True positive rate, recall, sensitivity
• Percentage of actually „True“ that is predicted correctly
• 𝑇𝑃𝑅 =
𝑇𝑃
𝑇𝑃+𝐹𝑁
• True negative rate, specificity
• Percentage of actually „False“ that is predicted correctly
• 𝑇𝑁𝑅 =
𝑇𝑁
𝑇𝑁+𝐹𝑃
• False negative rate
• Percentage of actually „True“ that is predicted wrongly
• 𝐹𝑁𝑅 =
𝐹𝑁
𝐹𝑁+𝑇𝑃
• False positive rate
• Percentage of actually „False“ that is predicted wrongly
• 𝐹𝑃𝑅 =
𝐹𝑃
𝐹𝑃+𝑇𝑁
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Binary Performance Metrics (2)
• Rates per predicted class
• Positive predictive value, precision
• Percentage of predicted „True“ that is predicted correctly
• 𝑃𝑃𝑉 =
𝑇𝑃
𝑇𝑃+𝐹𝑃
• Negative predictive value
• Percentage of predicted „False“ that is predicted correctly
• 𝑁𝑃𝑉 =
𝑇𝑁
𝑇𝑁+𝐹𝑁
• False discovery rate
• Percentage of predicted „True“ that is predicted wrongly
• 𝐹𝐷𝑅 =
𝐹𝑃
𝑇𝑃+𝐹𝑃
• False omission rate
• Percentage of predicted „False“ that is predicted wrongly
• 𝐹𝑂𝑅 =
𝐹𝑁
𝐹𝑁+𝑇𝑁
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Binary Performance Metrics (3)
• Metrics that take „everything“ into account
• Accuracy
• Percentage of data that is predicted correctly
• 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
• F1 measure
• Harmonic mean of precision and recall
• 𝐹1 = 2
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
• Matthews correlation coefficient (MCC)
• Chi-squared correlation between prediction and actual values
• 𝑀𝐶𝐶 =
𝑇𝑃×𝑇𝑁−𝐹𝑃×𝐹𝑁
√(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Receiver Operator Characteristics (ROC)
• Plot of true positive rate (TPR) versus false positive rate (FPR)
• Different TPR/FPR values possible due to thresholds for scores
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Area Under the Curve (AUC)
• Large Area = Good Performance
• Accounts for tradeoffs between TPR and FPR
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Micro and Macro Averaging
• Metrics not directly applicable for more than two classes
• Accuracy is the exception
• Micro Averaging
• Expand formulas to use individual positive, negative examples for each
class
• Macro Averaging
• Assume one class as true, combine all other as false
• Compute metrics for all such combinations
• Take average
• Example for the true positive rate:
• 𝑇𝑃𝑅𝑚𝑖𝑐𝑟𝑜 = 𝑐∈𝐶 𝑇𝑃𝑐
𝑐∈𝐶 𝑇𝑃𝑐+ 𝑐∈𝐶 𝐹𝑁𝑐
• 𝑇𝑃𝑅𝑚𝑎𝑐𝑟𝑜 =
𝑐∈𝐶
𝑇𝑃𝑐
𝑇𝑃𝑐+𝐹𝑁𝑐
|𝐶|
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Classification Models
• Comparison of Classification Models
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Overview of Classifiers
• The following classifiers are introduced
• 𝑘-nearest Neighbor
• Decision Trees
• Random Forests
• Logistic Regression
• Naive Bayes
• Support Vector Machines
• Neural Networks
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
𝑘-nearest Neighbor
• Basic Idea
• Instances with similar feature values should have the same class
• Class can be determined by looking at instances that are similar
 Assign each instance the mode of its 𝑘 nearest instances
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Impact of 𝑘
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Decision Trees
• Basic Idea
• Make decisions based on logical rules about features
• Organize rules as a tree
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Basic Decision Tree Algorithm
• Recursive algorithm
• Stop if
• Data is “pure”, i.e. mostly from class
• Amount of data is too small, i.e., only few instances in partition
• Otherwise
• Determine „most informative feature“ 𝑋
• Partition training data using 𝑋
• Recursively create subtree for each partition
• Details may vary depending on the specific algorithm
• For example, CART, ID3, C4.5
• General concept always the same
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
The „Most Informative Feature“
• Information theory based approach
• Entropy of the class label
• 𝐻 𝐶 = − 𝑐∈𝐶 𝑝 𝑐 log 𝑝(𝑐)
• Conditional entropy of the class label based on feature 𝑋
• 𝐻 𝐶 𝑋 = − 𝑥∈𝑋 𝑝 𝑥 𝑐∈𝐶 𝑝 𝑐 𝑥 log 𝑝(𝑐|𝑥)
• Mutual Information
• 𝐼 𝐶, 𝑋 = 𝐻 𝐶 − 𝐻 𝐶 𝑋
 Feature with highest mutual information is most informative
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Interpret each dimension as
random variable
Can be used as measure for purity
Decision Surface of Decision Trees
• All decisions are axis-aligned
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Overfitting
Random Forest
• Basic Idea
• Ensemble of randomized decision trees
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Randomized
subset
Randomized
attributes
Classification as majority vote of random trees
Bagging as Ensemble Learner
• Bagging is short for bootstrap aggregating
• Randomly draw subsamples of training data
• Build model for each subsample  ensemble of models
• Voting to create class
• Can be weighted, e.g., using quality of ensemble models
• Random Forests combine Bagging with
• Short decision trees, i.e., low depth
• Allowing only a random subset of features for each decision
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Decision Surface of Random Forests
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Logistic Regression
• Basic Idea:
• Regression model of the probability that an object belongs to a class
• Combines the 𝑙𝑜𝑔𝑖𝑡 function with linear regression
• Linear Regression
• 𝑦 as linear combination of 𝑥1, … , 𝑥𝑛
• 𝑦 = 𝑏0 + 𝑏1𝑥1 + ⋯ + 𝑏𝑛𝑥𝑛
• The 𝑙𝑜𝑔𝑖𝑡 function
• 𝑙𝑜𝑔𝑖𝑡 𝑃 𝑦 = 𝑐 = ln
𝑃(𝑦=𝑐)
1−𝑃(𝑦=𝑐)
• Logistic Regression
• 𝑙𝑜𝑔𝑖𝑡 𝑃 𝑦 = 𝑐 = 𝑏0 + 𝑏1𝑥1 + ⋯ + 𝑏𝑛𝑥𝑛
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Odds Ratios
• Probabilities vs. Odds
• Probability: 𝑃 pass_exam = 0.75
• Odds of passing the exam: 𝑜𝑑𝑑𝑠 pass_exam =
0.75
1−0.75
= 3
• The odds if passing the exam is 3 to 1
• If we invert the natural logarithm, we get
•
𝑃(𝑦=𝑐)
1−𝑃(𝑦=𝑐)
= exp 𝑏0 + 𝑏1𝑥1 + ⋯ + 𝑏𝑛𝑥𝑛 = 𝑗=0
𝑛
exp(𝑏𝑗𝑥𝑗)
• It follows that exp(𝑏𝑗) is the odds ratio of feature 𝑗
• Odds ratio means the change in odds if we increase 𝑥𝑗 by one.
• Odds ratio greater than one means increased odds
• Odds ratio less than one mean decreased odds
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Definition
of odds
Decision Surface of Logistic Regression
• Decision boundaries are linear
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Naive Bayes
• Basic idea:
• Assume all features as independent
• Score classes using the conditional probability
• Bayes Law
• 𝑃 𝑌|𝑋 =
𝑃(𝑋|𝑌)𝑃(𝑌)
𝑃(𝑋)
• Conditional probability of a class:
• 𝑃 𝑐|𝑥1, … , 𝑥𝑛 =
𝑃(𝑥1,…,𝑥𝑛|𝑐)𝑃(𝑐)
𝑃(𝑥1,…,𝑥𝑛)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
From Bayes Law to Naive Bayes
• Probability following Bayes law
• 𝑃 𝑐|𝑥1, … , 𝑥𝑛 =
𝑃(𝑥1,…,𝑥𝑛|𝑐)𝑃(𝑐)
𝑃(𝑥1,…,𝑥𝑛)
• „Naive“ assumption: 𝑥1, … , 𝑥𝑛 conditionally independent given 𝑐
• 𝑃 𝑐|𝑥1, … , 𝑥𝑛 =
𝑃 𝑥1|𝑐 … 𝑃(𝑥𝑛|𝑐) 𝑃(𝑐)
𝑃(𝑥1,…,𝑥𝑛)
=
𝑗=1
𝑛
𝑃(𝑥𝑗|𝑐) 𝑃(𝑐)
𝑃(𝑥1,…,𝑥𝑛)
• 𝑃(𝑥1, … , 𝑥𝑛) is independent of 𝑐 and always the same
• 𝑠𝑐𝑜𝑟𝑒 𝑐|𝑥1, … , 𝑥𝑛 = 𝑗=1
𝑛
𝑃(𝑥𝑗|𝑐) 𝑃(𝑐)
• Assign the class with highest score
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Multinomial and Gaussian Naive Bayes
• Different variants on how 𝑃 𝑥𝑗 𝑐 is estimated
• Multinomial
• 𝑃 𝑥𝑗 𝑐 is the empirical probability of observing a feature
• “Counts” observations of 𝑥𝑗 in the data
• Gaussian
• Assumes features follow a gaussian/normal distribution
• Estimates 𝑃 𝑥𝑗 𝑐 conditional probability using the gaussian density
function
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Decision Surface of Naive Bayes
• Multinomial has linear decision boundaries
• Gaussian has piecewise quadratic decision boundaries
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Support Vector Machines (SVM)
• Basic Idea:
• Calculate decision boundary such that it is “far away” from data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Linear decision
boundary
Margin
Margin is
maximized
Support vectors
= Instances with minimal
distance to decision
bounday
Non-linear SVMs through Kernels
• Expand features using kernels to separate non-linear data
• Transformation into high-dimensional kernel space
• Can be infinite (e.g., Gaussian kernel, RBF kernel) !
• Calculate linear separation in kernel space
• Use kernel trick to avoid actual expansion
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Quadractic
kernel
Decision Surface of SVMs
• Shape of decision surface depends on kernel
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Neural Networks
• Basic Idea:
• Network of neurons with different layers and communication between
neurons
• Input layer feeds data into the network
• Hidden layers “correlate” data
• Output layer gives computation results
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Input
Layer
Output
Layer
Two hidden
layers
Multilayer Perceptron (MLP)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Each feature gets
an input neuron
Single output
neuron with the
classification
Multiple fully
connected
hidden layers
• First weighted sum of inputs
• Then activation function, e.g, 𝑠𝑖𝑔𝑚𝑜𝑖𝑑/𝑡𝑎𝑛ℎ
Decision Surface of MLP
• Shape of decision boundary depends on
• Activation function
• Number of hidden layers
• Number of neurons in the hidden layers
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Classification Models
• Comparison of Classification Models
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
General Approach
• Different approaches behind all covered classifiers
• 𝑘-nearest Neighbor  Instance based
• Decision Trees  Rule based + information theory
• Random Forests  Randomized ensemble
• Logistic Regression  Regression
• Naive Bayes  Conditional probability
• Support Vector Machines  Margin maximization + kernels
• Neural Networks  (Very complex) Regression
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Comparison of Decision Surfaces
IRIS Data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Results may vary with hyper parameter tuning
Comparison of Decision Surfaces
Non-linear separable
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Results may vary with hyper parameter tuning
Comparison of Decision Surfaces
Circles within circles
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Results may vary with hyper parameter tuning
Comparison of Execution Times
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Times taken using GWDG Jupyter Hub and scikit-learn implementations of the algorithms.
Data randomly generated with using scikit-learn.datasets.make_moons (July 2018)
Strengths and Weaknesses
Explanatory
value
Consise
representation
Scoring Categorical
features
Missing
features
Correlated
features
𝑘-nearest
Neighbor
o - - - + -
Decision
Tree
+ + + + o +
Random
Forest
- o + + o +
Logistic
Regression
+ + + o - o
Naive Bayes o o + + - -
SVM - o - o - -
Neural
Network
- o + o - +
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Classification Models
• Comparison of Classification Models
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Summary
• Classification is the task of assigning labels to objects
• Many evaluation criteria
• Confusion matrix commonly used
• Lots of classification algorithms
• Rule based, instance based, ensembles, regressions, …
• Different algorithms may be best in different situations
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science

More Related Content

Similar to 07-Classification.pptx

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsNavisro Analytics
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
11-Statistical-Tests.pptx
11-Statistical-Tests.pptx11-Statistical-Tests.pptx
11-Statistical-Tests.pptxShree Shree
 
Developing in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit editionDeveloping in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit editionRobin van Emden
 
Love / Hate Puppet (Puppet Gotchas)
Love / Hate Puppet (Puppet Gotchas)Love / Hate Puppet (Puppet Gotchas)
Love / Hate Puppet (Puppet Gotchas)Puppet
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OSri Ambati
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
Data Science 101
Data Science 101Data Science 101
Data Science 101ideatoipo
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)Ali-ziane Myriam
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"Portland R User Group
 
How Does Math Matter in Data Science
How Does Math Matter in Data ScienceHow Does Math Matter in Data Science
How Does Math Matter in Data ScienceMutia Ulfi
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8Roger Barga
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabSri Ambati
 
Discovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory searchDiscovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory searchFabien Gandon
 

Similar to 07-Classification.pptx (20)

Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
11-Statistical-Tests.pptx
11-Statistical-Tests.pptx11-Statistical-Tests.pptx
11-Statistical-Tests.pptx
 
Developing in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit editionDeveloping in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit edition
 
Love / Hate Puppet (Puppet Gotchas)
Love / Hate Puppet (Puppet Gotchas)Love / Hate Puppet (Puppet Gotchas)
Love / Hate Puppet (Puppet Gotchas)
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
How Does Math Matter in Data Science
How Does Math Matter in Data ScienceHow Does Math Matter in Data Science
How Does Math Matter in Data Science
 
Ml7 bagging
Ml7 baggingMl7 bagging
Ml7 bagging
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
Discovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory searchDiscovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory search
 

More from Shree Shree

More from Shree Shree (20)

10 rectangular options.pptx
10 rectangular options.pptx10 rectangular options.pptx
10 rectangular options.pptx
 
1732002B.pptx
1732002B.pptx1732002B.pptx
1732002B.pptx
 
9D52BDFC.pptx
9D52BDFC.pptx9D52BDFC.pptx
9D52BDFC.pptx
 
865FC50.pptx
865FC50.pptx865FC50.pptx
865FC50.pptx
 
9DE46AD6.pptx
9DE46AD6.pptx9DE46AD6.pptx
9DE46AD6.pptx
 
20DE6DE0.pptx
20DE6DE0.pptx20DE6DE0.pptx
20DE6DE0.pptx
 
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
 
2C0A56C5.pptx
2C0A56C5.pptx2C0A56C5.pptx
2C0A56C5.pptx
 
1EF5E174.pptx
1EF5E174.pptx1EF5E174.pptx
1EF5E174.pptx
 
1BB40E18.pptx
1BB40E18.pptx1BB40E18.pptx
1BB40E18.pptx
 
8D0ECBB0.pptx
8D0ECBB0.pptx8D0ECBB0.pptx
8D0ECBB0.pptx
 
D1586BA6.pptx
D1586BA6.pptxD1586BA6.pptx
D1586BA6.pptx
 
FB09958.pptx
FB09958.pptxFB09958.pptx
FB09958.pptx
 
13072EF.pptx
13072EF.pptx13072EF.pptx
13072EF.pptx
 
E65B78F3.pptx
E65B78F3.pptxE65B78F3.pptx
E65B78F3.pptx
 
355C162.pptx
355C162.pptx355C162.pptx
355C162.pptx
 
795589F6.pptx
795589F6.pptx795589F6.pptx
795589F6.pptx
 
AD177DF3.pptx
AD177DF3.pptxAD177DF3.pptx
AD177DF3.pptx
 
961F1B3A.pptx
961F1B3A.pptx961F1B3A.pptx
961F1B3A.pptx
 
5F2D461.pptx
5F2D461.pptx5F2D461.pptx
5F2D461.pptx
 

Recently uploaded

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 

Recently uploaded (20)

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 

07-Classification.pptx

  • 1. Chapter 07 Classification Dr. Steffen Herbold herbold@cs.uni-goettingen.de Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 2. Outline • Overview • Classification Models • Comparison of Classification Models • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 3. Example of Classification Introduction to Data Science https://sherbold.github.io/intro-to-data-science This is a whale This is a bear
  • 4. The General Problem Introduction to Data Science https://sherbold.github.io/intro-to-data-science Class of object 1 Class of object 2 Object 1 Object 2 Concept
  • 5. The Formal Problem • Object space • 𝑂 = 𝑜𝑏𝑗𝑒𝑐𝑡1, 𝑜𝑏𝑗𝑒𝑐𝑡2, … • Often infinite • Representations of the objects in a feature space • ℱ = {𝜙 𝑜 , 𝑜 ∈ 𝑂} • Set of classes • 𝐶 = {𝑐𝑙𝑎𝑠𝑠1, … , 𝑐𝑙𝑎𝑠𝑠𝑛} • A target concept that maps objects to classes • ℎ∗: 𝑂 → 𝐶 • Classification • Finding an approximation of the target concept Introduction to Data Science https://sherbold.github.io/intro-to-data-science How do you get ℎ∗ ?
  • 6. The „Whale“ Hypothesis Introduction to Data Science https://sherbold.github.io/intro-to-data-science • Why do we know this is a whale? Blue background Has a fin Oval body Black top, white bottom Objects with fins, an oval general shape that are black on top and white on the bottom in front of a blue background are whales. Hypothesis:
  • 7. The Hypothesis • A hypothesis maps features to classes • ℎ: ℱ → 𝐶 • ℎ: 𝜙(𝑜) → 𝐶 • Approximation of the target concept ℎ∗ • ℎ∗ 𝑜 ≈ ℎ(𝜙 𝑜 ) • Hypothesis = Classifier = Classification Model Introduction to Data Science https://sherbold.github.io/intro-to-data-science What if I am not sure about the class?
  • 8. Classification using Scores • A numeric score for each class 𝑐 ∈ 𝐶 • Often a probability distribution • ℎ′: 𝜙 𝑜 → 0,1 𝐶 • ℎ′(𝜙 𝑜 ) 1 = 1 • Example • Three classes: „whale“, „bear“, „other“ • ℎ′ 𝜙 "whalepicture" = 0.7,0.1,0.2 • Standard approach: • Classification is class with highest score Introduction to Data Science https://sherbold.github.io/intro-to-data-science score whale score bear score other
  • 9. Thresholds for Scores • Different thresholds also possible Introduction to Data Science https://sherbold.github.io/intro-to-data-science Many “No Spam” incorrectly detected as spam if “highest” score is used Threshold of 0.2 would miss “Spam” but better identify “No Spam”
  • 10. Quality of Hypothesis • Goal: Approximation of the target concept • ℎ∗ 𝑜 ≈ ℎ(𝜙 𝑜 )  Use Test Data • Structure is the same as training data • Apply hypothesis Introduction to Data Science https://sherbold.github.io/intro-to-data-science 𝝓(𝒐) 𝒉∗ (𝒐) 𝒉(𝝓 𝒐 ) hasFin shape colorTop colorBottom background class prediction true oval black black blue whale whale false rectangle brown brown green bear whale … … … … … … How do you evaluate ℎ∗ 𝑜 ≈ ℎ(𝜙 𝑜 )
  • 11. The Confusion Matrix • Table of actual values versus prediction Introduction to Data Science https://sherbold.github.io/intro-to-data-science whale bear other whale 29 1 3 bear 2 22 13 other 4 11 51 Actual class Predicted Class Two whales were incorrectly predicted as bears
  • 12. Binary Classification • Many problems are binary • Will I get my money back? • Is this credit card fraud? • Will my paper be accepted? • … • Can all be formulated as either being in a class or not Labels true and false Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 13. The Binary Confusion Matrix • False positives are also called Type I error • False negatives are also called Type II error Introduction to Data Science https://sherbold.github.io/intro-to-data-science true false true True Positives (TP) False Positives (FP) false False Negatives (FN) True Negatives (TN) Actual class Predicted Class
  • 14. Binary Performance Metrics (1) • Rates per actual class • True positive rate, recall, sensitivity • Percentage of actually „True“ that is predicted correctly • 𝑇𝑃𝑅 = 𝑇𝑃 𝑇𝑃+𝐹𝑁 • True negative rate, specificity • Percentage of actually „False“ that is predicted correctly • 𝑇𝑁𝑅 = 𝑇𝑁 𝑇𝑁+𝐹𝑃 • False negative rate • Percentage of actually „True“ that is predicted wrongly • 𝐹𝑁𝑅 = 𝐹𝑁 𝐹𝑁+𝑇𝑃 • False positive rate • Percentage of actually „False“ that is predicted wrongly • 𝐹𝑃𝑅 = 𝐹𝑃 𝐹𝑃+𝑇𝑁 Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 15. Binary Performance Metrics (2) • Rates per predicted class • Positive predictive value, precision • Percentage of predicted „True“ that is predicted correctly • 𝑃𝑃𝑉 = 𝑇𝑃 𝑇𝑃+𝐹𝑃 • Negative predictive value • Percentage of predicted „False“ that is predicted correctly • 𝑁𝑃𝑉 = 𝑇𝑁 𝑇𝑁+𝐹𝑁 • False discovery rate • Percentage of predicted „True“ that is predicted wrongly • 𝐹𝐷𝑅 = 𝐹𝑃 𝑇𝑃+𝐹𝑃 • False omission rate • Percentage of predicted „False“ that is predicted wrongly • 𝐹𝑂𝑅 = 𝐹𝑁 𝐹𝑁+𝑇𝑁 Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 16. Binary Performance Metrics (3) • Metrics that take „everything“ into account • Accuracy • Percentage of data that is predicted correctly • 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 • F1 measure • Harmonic mean of precision and recall • 𝐹1 = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 • Matthews correlation coefficient (MCC) • Chi-squared correlation between prediction and actual values • 𝑀𝐶𝐶 = 𝑇𝑃×𝑇𝑁−𝐹𝑃×𝐹𝑁 √(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁) Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 17. Receiver Operator Characteristics (ROC) • Plot of true positive rate (TPR) versus false positive rate (FPR) • Different TPR/FPR values possible due to thresholds for scores Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 18. Area Under the Curve (AUC) • Large Area = Good Performance • Accounts for tradeoffs between TPR and FPR Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 19. Micro and Macro Averaging • Metrics not directly applicable for more than two classes • Accuracy is the exception • Micro Averaging • Expand formulas to use individual positive, negative examples for each class • Macro Averaging • Assume one class as true, combine all other as false • Compute metrics for all such combinations • Take average • Example for the true positive rate: • 𝑇𝑃𝑅𝑚𝑖𝑐𝑟𝑜 = 𝑐∈𝐶 𝑇𝑃𝑐 𝑐∈𝐶 𝑇𝑃𝑐+ 𝑐∈𝐶 𝐹𝑁𝑐 • 𝑇𝑃𝑅𝑚𝑎𝑐𝑟𝑜 = 𝑐∈𝐶 𝑇𝑃𝑐 𝑇𝑃𝑐+𝐹𝑁𝑐 |𝐶| Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 20. Outline • Overview • Classification Models • Comparison of Classification Models • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 21. Overview of Classifiers • The following classifiers are introduced • 𝑘-nearest Neighbor • Decision Trees • Random Forests • Logistic Regression • Naive Bayes • Support Vector Machines • Neural Networks Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 22. 𝑘-nearest Neighbor • Basic Idea • Instances with similar feature values should have the same class • Class can be determined by looking at instances that are similar  Assign each instance the mode of its 𝑘 nearest instances Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 23. Impact of 𝑘 Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 24. Decision Trees • Basic Idea • Make decisions based on logical rules about features • Organize rules as a tree Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 25. Basic Decision Tree Algorithm • Recursive algorithm • Stop if • Data is “pure”, i.e. mostly from class • Amount of data is too small, i.e., only few instances in partition • Otherwise • Determine „most informative feature“ 𝑋 • Partition training data using 𝑋 • Recursively create subtree for each partition • Details may vary depending on the specific algorithm • For example, CART, ID3, C4.5 • General concept always the same Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 26. The „Most Informative Feature“ • Information theory based approach • Entropy of the class label • 𝐻 𝐶 = − 𝑐∈𝐶 𝑝 𝑐 log 𝑝(𝑐) • Conditional entropy of the class label based on feature 𝑋 • 𝐻 𝐶 𝑋 = − 𝑥∈𝑋 𝑝 𝑥 𝑐∈𝐶 𝑝 𝑐 𝑥 log 𝑝(𝑐|𝑥) • Mutual Information • 𝐼 𝐶, 𝑋 = 𝐻 𝐶 − 𝐻 𝐶 𝑋  Feature with highest mutual information is most informative Introduction to Data Science https://sherbold.github.io/intro-to-data-science Interpret each dimension as random variable Can be used as measure for purity
  • 27. Decision Surface of Decision Trees • All decisions are axis-aligned Introduction to Data Science https://sherbold.github.io/intro-to-data-science Overfitting
  • 28. Random Forest • Basic Idea • Ensemble of randomized decision trees Introduction to Data Science https://sherbold.github.io/intro-to-data-science Randomized subset Randomized attributes Classification as majority vote of random trees
  • 29. Bagging as Ensemble Learner • Bagging is short for bootstrap aggregating • Randomly draw subsamples of training data • Build model for each subsample  ensemble of models • Voting to create class • Can be weighted, e.g., using quality of ensemble models • Random Forests combine Bagging with • Short decision trees, i.e., low depth • Allowing only a random subset of features for each decision Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 30. Decision Surface of Random Forests Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 31. Logistic Regression • Basic Idea: • Regression model of the probability that an object belongs to a class • Combines the 𝑙𝑜𝑔𝑖𝑡 function with linear regression • Linear Regression • 𝑦 as linear combination of 𝑥1, … , 𝑥𝑛 • 𝑦 = 𝑏0 + 𝑏1𝑥1 + ⋯ + 𝑏𝑛𝑥𝑛 • The 𝑙𝑜𝑔𝑖𝑡 function • 𝑙𝑜𝑔𝑖𝑡 𝑃 𝑦 = 𝑐 = ln 𝑃(𝑦=𝑐) 1−𝑃(𝑦=𝑐) • Logistic Regression • 𝑙𝑜𝑔𝑖𝑡 𝑃 𝑦 = 𝑐 = 𝑏0 + 𝑏1𝑥1 + ⋯ + 𝑏𝑛𝑥𝑛 Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 32. Odds Ratios • Probabilities vs. Odds • Probability: 𝑃 pass_exam = 0.75 • Odds of passing the exam: 𝑜𝑑𝑑𝑠 pass_exam = 0.75 1−0.75 = 3 • The odds if passing the exam is 3 to 1 • If we invert the natural logarithm, we get • 𝑃(𝑦=𝑐) 1−𝑃(𝑦=𝑐) = exp 𝑏0 + 𝑏1𝑥1 + ⋯ + 𝑏𝑛𝑥𝑛 = 𝑗=0 𝑛 exp(𝑏𝑗𝑥𝑗) • It follows that exp(𝑏𝑗) is the odds ratio of feature 𝑗 • Odds ratio means the change in odds if we increase 𝑥𝑗 by one. • Odds ratio greater than one means increased odds • Odds ratio less than one mean decreased odds Introduction to Data Science https://sherbold.github.io/intro-to-data-science Definition of odds
  • 33. Decision Surface of Logistic Regression • Decision boundaries are linear Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 34. Naive Bayes • Basic idea: • Assume all features as independent • Score classes using the conditional probability • Bayes Law • 𝑃 𝑌|𝑋 = 𝑃(𝑋|𝑌)𝑃(𝑌) 𝑃(𝑋) • Conditional probability of a class: • 𝑃 𝑐|𝑥1, … , 𝑥𝑛 = 𝑃(𝑥1,…,𝑥𝑛|𝑐)𝑃(𝑐) 𝑃(𝑥1,…,𝑥𝑛) Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 35. From Bayes Law to Naive Bayes • Probability following Bayes law • 𝑃 𝑐|𝑥1, … , 𝑥𝑛 = 𝑃(𝑥1,…,𝑥𝑛|𝑐)𝑃(𝑐) 𝑃(𝑥1,…,𝑥𝑛) • „Naive“ assumption: 𝑥1, … , 𝑥𝑛 conditionally independent given 𝑐 • 𝑃 𝑐|𝑥1, … , 𝑥𝑛 = 𝑃 𝑥1|𝑐 … 𝑃(𝑥𝑛|𝑐) 𝑃(𝑐) 𝑃(𝑥1,…,𝑥𝑛) = 𝑗=1 𝑛 𝑃(𝑥𝑗|𝑐) 𝑃(𝑐) 𝑃(𝑥1,…,𝑥𝑛) • 𝑃(𝑥1, … , 𝑥𝑛) is independent of 𝑐 and always the same • 𝑠𝑐𝑜𝑟𝑒 𝑐|𝑥1, … , 𝑥𝑛 = 𝑗=1 𝑛 𝑃(𝑥𝑗|𝑐) 𝑃(𝑐) • Assign the class with highest score Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 36. Multinomial and Gaussian Naive Bayes • Different variants on how 𝑃 𝑥𝑗 𝑐 is estimated • Multinomial • 𝑃 𝑥𝑗 𝑐 is the empirical probability of observing a feature • “Counts” observations of 𝑥𝑗 in the data • Gaussian • Assumes features follow a gaussian/normal distribution • Estimates 𝑃 𝑥𝑗 𝑐 conditional probability using the gaussian density function Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 37. Decision Surface of Naive Bayes • Multinomial has linear decision boundaries • Gaussian has piecewise quadratic decision boundaries Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 38. Support Vector Machines (SVM) • Basic Idea: • Calculate decision boundary such that it is “far away” from data Introduction to Data Science https://sherbold.github.io/intro-to-data-science Linear decision boundary Margin Margin is maximized Support vectors = Instances with minimal distance to decision bounday
  • 39. Non-linear SVMs through Kernels • Expand features using kernels to separate non-linear data • Transformation into high-dimensional kernel space • Can be infinite (e.g., Gaussian kernel, RBF kernel) ! • Calculate linear separation in kernel space • Use kernel trick to avoid actual expansion Introduction to Data Science https://sherbold.github.io/intro-to-data-science Quadractic kernel
  • 40. Decision Surface of SVMs • Shape of decision surface depends on kernel Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 41. Neural Networks • Basic Idea: • Network of neurons with different layers and communication between neurons • Input layer feeds data into the network • Hidden layers “correlate” data • Output layer gives computation results Introduction to Data Science https://sherbold.github.io/intro-to-data-science Input Layer Output Layer Two hidden layers
  • 42. Multilayer Perceptron (MLP) Introduction to Data Science https://sherbold.github.io/intro-to-data-science Each feature gets an input neuron Single output neuron with the classification Multiple fully connected hidden layers • First weighted sum of inputs • Then activation function, e.g, 𝑠𝑖𝑔𝑚𝑜𝑖𝑑/𝑡𝑎𝑛ℎ
  • 43. Decision Surface of MLP • Shape of decision boundary depends on • Activation function • Number of hidden layers • Number of neurons in the hidden layers Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 44. Outline • Overview • Classification Models • Comparison of Classification Models • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 45. General Approach • Different approaches behind all covered classifiers • 𝑘-nearest Neighbor  Instance based • Decision Trees  Rule based + information theory • Random Forests  Randomized ensemble • Logistic Regression  Regression • Naive Bayes  Conditional probability • Support Vector Machines  Margin maximization + kernels • Neural Networks  (Very complex) Regression Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 46. Comparison of Decision Surfaces IRIS Data Introduction to Data Science https://sherbold.github.io/intro-to-data-science Results may vary with hyper parameter tuning
  • 47. Comparison of Decision Surfaces Non-linear separable Introduction to Data Science https://sherbold.github.io/intro-to-data-science Results may vary with hyper parameter tuning
  • 48. Comparison of Decision Surfaces Circles within circles Introduction to Data Science https://sherbold.github.io/intro-to-data-science Results may vary with hyper parameter tuning
  • 49. Comparison of Execution Times Introduction to Data Science https://sherbold.github.io/intro-to-data-science Times taken using GWDG Jupyter Hub and scikit-learn implementations of the algorithms. Data randomly generated with using scikit-learn.datasets.make_moons (July 2018)
  • 50. Strengths and Weaknesses Explanatory value Consise representation Scoring Categorical features Missing features Correlated features 𝑘-nearest Neighbor o - - - + - Decision Tree + + + + o + Random Forest - o + + o + Logistic Regression + + + o - o Naive Bayes o o + + - - SVM - o - o - - Neural Network - o + o - + Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 51. Outline • Overview • Classification Models • Comparison of Classification Models • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 52. Summary • Classification is the task of assigning labels to objects • Many evaluation criteria • Confusion matrix commonly used • Lots of classification algorithms • Rule based, instance based, ensembles, regressions, … • Different algorithms may be best in different situations Introduction to Data Science https://sherbold.github.io/intro-to-data-science