This talk is supposed to serve as a basic introduction to classification. I will explain some common classification algorithms and fundamentals in the field of classification. Before starting to learn a model, it is crucial to explore and understand the underlying data. Based on these findings, proper feature engineering and selection is to be performed in order to get appropriate results. After choosing a model and classifying data instances, we will see different methods of evaluating the results, using techniques like cross-validation.
3. What is classification?
• Prediction model
• Supervised learning
• A set of historical data is available with known
class values
• Task: Predict to which class/category a new
unseen item belongs
27.02.2014
How to gain a foothold in the world of classification
3
4. What is classification?
• Terminology:
• Dataset: complete data measures
• Attributes/Features: Parameters measured for
each instance (usually columns)
• Instance: A single item for which parameters
are measured (usually rows)
27.02.2014
How to gain a foothold in the world of classification
4
5. What is classification?
Example:
• A set of blood parameters is measured from
50 cancer patients and from 50 control
persons
• 2-class problem: Cancer vs. Healthy
• To test if a new patient has cancer, the same
blood parameters are measured and
classification is used to predict the class
27.02.2014
How to gain a foothold in the world of classification
5
6. General Workflow
Training Data
Class values are known
Classification
Model
Predicted class
values
Test Data
Unknown class
27.02.2014
How to gain a foothold in the world of classification
6
7. Detailed Workflow
Training Data
Preprocessing
- Feature selection
- Feature engineering
- Impute missing values
…
Test Data
27.02.2014
Preprocessing
Model selection
Classification
Model
How to gain a foothold in the world of classification
Cross-Validation
Accuracy
ROC
…
Predicted
class values
7
8. Preprocessing
Feature Selection
• Select discriminant features only
• Save execution time
• Remove noise effects
• 2 Kind of methods:
– Ranking
– Subset evaluation
27.02.2014
How to gain a foothold in the world of classification
8
9. Preprocessing
Ranking (Filters)
• Features are ranked by a score
– Correlation
– Information gain
–…
• Number of selected features must be given
manually
27.02.2014
How to gain a foothold in the world of classification
9
10. Preprocessing
Subset Evaluation (Filter)
• A search algorithm is used to find best
features
• Number of selected features is determined by
the algorithm
Subset Evaluation (Wrapper)
• A model is learned and evaluated on the
subset to find best features
27.02.2014
How to gain a foothold in the world of classification
10
11. Preprocessing
Feature Engineering
• Transform or compute features to better
match requirements
• Text analysis: A plain text field cannot be used
for classification
• Extract key words as nominal features, count
number of word, letters …
• Start and end time duration
27.02.2014
How to gain a foothold in the world of classification
11
12. Preprocessing
Estimate Missing Values
• Some algorithms require complete datasets
• Missing values need to be imputed
• Simplest: Mean and mode
• More advanced techniques lead to better
results
(own scientific field)
27.02.2014
How to gain a foothold in the world of classification
12
13. Preprocessing
Add Noise
• Generalization of the
algorithm is most
important!
• Adding artificial noise to
the training data can
lead the model to
generalize more
27.02.2014
How to gain a foothold in the world of classification
13
14. Classification Algorithms
• There are many different classification models
• Important:
– Generalization
– Robustness to noise
– Speed
– Performance
–…
• “No free lunch” Theorem
27.02.2014
How to gain a foothold in the world of classification
14
15. Classification Algorithms
k-Nearest Neighbors
• Selects the k closest
instances from the
training set
• Similarity measure
needed
27.02.2014
How to gain a foothold in the world of classification
15
16. Classification Algorithms
Support Vector Machine (SVM)
• Learns support vectors
which separate training
instances
• Can be
– Higher dimensions
– Non-linear
– multiple
27.02.2014
How to gain a foothold in the world of classification
16
17. Classification Algorithms
Random Forest
• Learns a “forest” of decision trees of randomly
different structures
• Majority of the votes of single trees is final
result
• Works well in many areas as it is very robust
to noise and against over fitting
27.02.2014
How to gain a foothold in the world of classification
17
18. Evaluation
• Evaluate different models and preprocessing
steps by comparing model performance
• Use only the training set for evaluation
• Often used: Cross-Validation
– Split the training data into k parts of equal size
– Use each part once as test set and remaining k-1
parts as training sets.
– Average the results
27.02.2014
How to gain a foothold in the world of classification
18