SlideShare a Scribd company logo
How to gain a foothold in the
world of classification
Torsten Schön
dotplot GmbH
Overview
•
•
•
•
•

What is classification?
Workflow
Preprocessing
Basic classifiers
Evaluation

27.02.2014

How to gain a foothold in the world of classification

2
What is classification?
• Prediction model
• Supervised learning
• A set of historical data is available with known
class values
• Task: Predict to which class/category a new
unseen item belongs

27.02.2014

How to gain a foothold in the world of classification

3
What is classification?
• Terminology:
• Dataset: complete data measures
• Attributes/Features: Parameters measured for
each instance (usually columns)
• Instance: A single item for which parameters
are measured (usually rows)

27.02.2014

How to gain a foothold in the world of classification

4
What is classification?
Example:
• A set of blood parameters is measured from
50 cancer patients and from 50 control
persons
• 2-class problem: Cancer vs. Healthy
• To test if a new patient has cancer, the same
blood parameters are measured and
classification is used to predict the class
27.02.2014

How to gain a foothold in the world of classification

5
General Workflow
Training Data
Class values are known
Classification
Model

Predicted class
values

Test Data
Unknown class

27.02.2014

How to gain a foothold in the world of classification

6
Detailed Workflow
Training Data

Preprocessing

- Feature selection
- Feature engineering
- Impute missing values
…

Test Data

27.02.2014

Preprocessing

Model selection

Classification
Model

How to gain a foothold in the world of classification

Cross-Validation
Accuracy
ROC
…

Predicted
class values

7
Preprocessing
Feature Selection
• Select discriminant features only
• Save execution time
• Remove noise effects
• 2 Kind of methods:
– Ranking
– Subset evaluation

27.02.2014

How to gain a foothold in the world of classification

8
Preprocessing
Ranking (Filters)
• Features are ranked by a score
– Correlation
– Information gain
–…

• Number of selected features must be given
manually

27.02.2014

How to gain a foothold in the world of classification

9
Preprocessing
Subset Evaluation (Filter)
• A search algorithm is used to find best
features
• Number of selected features is determined by
the algorithm
Subset Evaluation (Wrapper)
• A model is learned and evaluated on the
subset to find best features
27.02.2014

How to gain a foothold in the world of classification

10
Preprocessing
Feature Engineering
• Transform or compute features to better
match requirements
• Text analysis: A plain text field cannot be used
for classification
• Extract key words as nominal features, count
number of word, letters …
• Start and end time  duration
27.02.2014

How to gain a foothold in the world of classification

11
Preprocessing
Estimate Missing Values
• Some algorithms require complete datasets
• Missing values need to be imputed
• Simplest: Mean and mode
• More advanced techniques lead to better
results
(own scientific field)

27.02.2014

How to gain a foothold in the world of classification

12
Preprocessing
Add Noise
• Generalization of the
algorithm is most
important!
• Adding artificial noise to
the training data can
lead the model to
generalize more
27.02.2014

How to gain a foothold in the world of classification

13
Classification Algorithms
• There are many different classification models
• Important:
– Generalization
– Robustness to noise
– Speed
– Performance
–…

• “No free lunch” Theorem
27.02.2014

How to gain a foothold in the world of classification

14
Classification Algorithms
k-Nearest Neighbors
• Selects the k closest
instances from the
training set
• Similarity measure
needed

27.02.2014

How to gain a foothold in the world of classification

15
Classification Algorithms
Support Vector Machine (SVM)
• Learns support vectors
which separate training
instances
• Can be
– Higher dimensions
– Non-linear
– multiple
27.02.2014

How to gain a foothold in the world of classification

16
Classification Algorithms
Random Forest
• Learns a “forest” of decision trees of randomly
different structures
• Majority of the votes of single trees is final
result
• Works well in many areas as it is very robust
to noise and against over fitting

27.02.2014

How to gain a foothold in the world of classification

17
Evaluation
• Evaluate different models and preprocessing
steps by comparing model performance
• Use only the training set for evaluation
• Often used: Cross-Validation
– Split the training data into k parts of equal size
– Use each part once as test set and remaining k-1
parts as training sets.
– Average the results
27.02.2014

How to gain a foothold in the world of classification

18

More Related Content

Similar to How to gain a foothold in the world of classification

Feature Selection.pdf
Feature Selection.pdfFeature Selection.pdf
Feature Selection.pdf
adarshbarnwal5
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
Marco Meoni
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Kris Jack
 
E3 chap-09
E3 chap-09E3 chap-09
E3 chap-09
Welly Dian Astika
 
Evaluation techniques
Evaluation techniquesEvaluation techniques
Evaluation techniques
PhD Research Scholar
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
Cataldo Musto
 
Lec 4 expert systems
Lec 4  expert systemsLec 4  expert systems
Lec 4 expert systems
Eyob Sisay
 
e3-chap-09.ppt
e3-chap-09.ppte3-chap-09.ppt
e3-chap-09.ppt
KingSh2
 
Developing a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGISDeveloping a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGIS
COGS Presentations
 
TESTING
TESTINGTESTING
TESTING
Dhanya LK
 
Ignacio panach ormeño et-al_caise2013
Ignacio panach   ormeño et-al_caise2013Ignacio panach   ormeño et-al_caise2013
Ignacio panach ormeño et-al_caise2013
caise2013vlc
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
Nagasuri Bala Venkateswarlu
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 
Human Computer Interaction Evaluation
Human Computer Interaction EvaluationHuman Computer Interaction Evaluation
Human Computer Interaction Evaluation
LGS, GBHS&IC, University Of South-Asia, TARA-Technologies
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Lucidworks
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
Cataldo Musto
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.com
Simon Hughes
 
Chapter 8 eval. tech. lesson 1
Chapter 8 eval. tech. lesson 1 Chapter 8 eval. tech. lesson 1
Chapter 8 eval. tech. lesson 1
MLG College of Learning, Inc
 
NLTestDag_20161118-B
NLTestDag_20161118-BNLTestDag_20161118-B
NLTestDag_20161118-B
Andre Verschelling
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
SrushtiSuvarna
 

Similar to How to gain a foothold in the world of classification (20)

Feature Selection.pdf
Feature Selection.pdfFeature Selection.pdf
Feature Selection.pdf
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
E3 chap-09
E3 chap-09E3 chap-09
E3 chap-09
 
Evaluation techniques
Evaluation techniquesEvaluation techniques
Evaluation techniques
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Lec 4 expert systems
Lec 4  expert systemsLec 4  expert systems
Lec 4 expert systems
 
e3-chap-09.ppt
e3-chap-09.ppte3-chap-09.ppt
e3-chap-09.ppt
 
Developing a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGISDeveloping a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGIS
 
TESTING
TESTINGTESTING
TESTING
 
Ignacio panach ormeño et-al_caise2013
Ignacio panach   ormeño et-al_caise2013Ignacio panach   ormeño et-al_caise2013
Ignacio panach ormeño et-al_caise2013
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Human Computer Interaction Evaluation
Human Computer Interaction EvaluationHuman Computer Interaction Evaluation
Human Computer Interaction Evaluation
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.com
 
Chapter 8 eval. tech. lesson 1
Chapter 8 eval. tech. lesson 1 Chapter 8 eval. tech. lesson 1
Chapter 8 eval. tech. lesson 1
 
NLTestDag_20161118-B
NLTestDag_20161118-BNLTestDag_20161118-B
NLTestDag_20161118-B
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
 

Recently uploaded

THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
indexPub
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
Mohammad Al-Dhahabi
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
zuzanka
 
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
ImMuslim
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
blueshagoo1
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
Celine George
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
Prof. Dr. K. Adisesha
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
nitinpv4ai
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
David Douglas School District
 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
RamseyBerglund
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
EduSkills OECD
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
IsmaelVazquez38
 

Recently uploaded (20)

THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
 
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
 

How to gain a foothold in the world of classification

  • 1. How to gain a foothold in the world of classification Torsten Schön dotplot GmbH
  • 2. Overview • • • • • What is classification? Workflow Preprocessing Basic classifiers Evaluation 27.02.2014 How to gain a foothold in the world of classification 2
  • 3. What is classification? • Prediction model • Supervised learning • A set of historical data is available with known class values • Task: Predict to which class/category a new unseen item belongs 27.02.2014 How to gain a foothold in the world of classification 3
  • 4. What is classification? • Terminology: • Dataset: complete data measures • Attributes/Features: Parameters measured for each instance (usually columns) • Instance: A single item for which parameters are measured (usually rows) 27.02.2014 How to gain a foothold in the world of classification 4
  • 5. What is classification? Example: • A set of blood parameters is measured from 50 cancer patients and from 50 control persons • 2-class problem: Cancer vs. Healthy • To test if a new patient has cancer, the same blood parameters are measured and classification is used to predict the class 27.02.2014 How to gain a foothold in the world of classification 5
  • 6. General Workflow Training Data Class values are known Classification Model Predicted class values Test Data Unknown class 27.02.2014 How to gain a foothold in the world of classification 6
  • 7. Detailed Workflow Training Data Preprocessing - Feature selection - Feature engineering - Impute missing values … Test Data 27.02.2014 Preprocessing Model selection Classification Model How to gain a foothold in the world of classification Cross-Validation Accuracy ROC … Predicted class values 7
  • 8. Preprocessing Feature Selection • Select discriminant features only • Save execution time • Remove noise effects • 2 Kind of methods: – Ranking – Subset evaluation 27.02.2014 How to gain a foothold in the world of classification 8
  • 9. Preprocessing Ranking (Filters) • Features are ranked by a score – Correlation – Information gain –… • Number of selected features must be given manually 27.02.2014 How to gain a foothold in the world of classification 9
  • 10. Preprocessing Subset Evaluation (Filter) • A search algorithm is used to find best features • Number of selected features is determined by the algorithm Subset Evaluation (Wrapper) • A model is learned and evaluated on the subset to find best features 27.02.2014 How to gain a foothold in the world of classification 10
  • 11. Preprocessing Feature Engineering • Transform or compute features to better match requirements • Text analysis: A plain text field cannot be used for classification • Extract key words as nominal features, count number of word, letters … • Start and end time  duration 27.02.2014 How to gain a foothold in the world of classification 11
  • 12. Preprocessing Estimate Missing Values • Some algorithms require complete datasets • Missing values need to be imputed • Simplest: Mean and mode • More advanced techniques lead to better results (own scientific field) 27.02.2014 How to gain a foothold in the world of classification 12
  • 13. Preprocessing Add Noise • Generalization of the algorithm is most important! • Adding artificial noise to the training data can lead the model to generalize more 27.02.2014 How to gain a foothold in the world of classification 13
  • 14. Classification Algorithms • There are many different classification models • Important: – Generalization – Robustness to noise – Speed – Performance –… • “No free lunch” Theorem 27.02.2014 How to gain a foothold in the world of classification 14
  • 15. Classification Algorithms k-Nearest Neighbors • Selects the k closest instances from the training set • Similarity measure needed 27.02.2014 How to gain a foothold in the world of classification 15
  • 16. Classification Algorithms Support Vector Machine (SVM) • Learns support vectors which separate training instances • Can be – Higher dimensions – Non-linear – multiple 27.02.2014 How to gain a foothold in the world of classification 16
  • 17. Classification Algorithms Random Forest • Learns a “forest” of decision trees of randomly different structures • Majority of the votes of single trees is final result • Works well in many areas as it is very robust to noise and against over fitting 27.02.2014 How to gain a foothold in the world of classification 17
  • 18. Evaluation • Evaluate different models and preprocessing steps by comparing model performance • Use only the training set for evaluation • Often used: Cross-Validation – Split the training data into k parts of equal size – Use each part once as test set and remaining k-1 parts as training sets. – Average the results 27.02.2014 How to gain a foothold in the world of classification 18