14 Classification
Prof. Dr. Ziawasch Abedjan, Felix Neutatz
June 24th
2019
Repetition: Prediction
How old is this dugong?
2
Years
0 73
Dugong by Geoff Spiby is licensed under CC BY-SA 3.0
Classification
https://www.inferentialthinking.com/chapters/17/Classification.html
Is this Classification?
● Is this bank transfer fraudulent?
● Is this patient healthy?
● Will you vote for me, for X, or for Y?
● Will these two people fit to each other?
● Is this an apple?
Def.: Given a number of examples, identify to which class a given
observation belongs to.
4
What do we need for apple classification?
Mass Width Height Is an Apple
192 8.4 7.3 1
342 9 9.4 0
186 7.2 9.2 0
152 7.6 7.3 1
Observations
Attributes Class
80 5.9 4.3 1
194 7.2 10.3 0
Training data
Test data
https://homepages.inf.ed.ac.uk/imurray2/teaching/oranges_and_lemons/
5
Nearest Neighbors
https://www.inferentialthinking.com/chapters/17/1/Nearest_Neighbors.html
Classification based on Height and Width
7
Classification based on Height and Width
8
A Nearest Neighbor Classifier
● Find the point in the training set that is nearest to the new point.
● If that nearest point is an apple, classify the new point as apple.
9
Decision Boundary
10
Decision Boundary
11
Chronic Kidney Disease Classification
Due to different value ranges, a difference
for the white blood cell count is significantly
more impactful than for Glucose.
⇒ Standard Units
Kidney Cross Section by Anmats is licensed
under CC BY 3.0
12
Glucose White Blood Cell Count Class
117 6700 1
70 12100 1
114 7200 0
131 6800 0
Standard Units
Absolute Distance Standardized Distance
13
What if there is no clear decision boundary?
Does this patient have the chronic kidney disease?
14
K-Nearest Neighbors ill
healthy
● Find the k points in the training set that are nearest to the new point.
● If most nearest points are healthy, classify the new point as healthy.
15
Which K?
Good fitOverfitting
1-Nearest Neighbors 4-Nearest Neighbors
Underfitting
30-Nearest Neighbors
16
Implementing the Classifier
https://www.inferentialthinking.com/chapters/17/4/Implementing_the_Classifier.html
Is Alice ill?
● Step 1: Find the distance between Alice and each point in the training
sample.
● Step 2: Sort the data table in increasing order of the distances.
● Step 3: Take the top k=4 rows of the sorted table.
● Step 4: Choose the majority class of these 4 rows.
18
Is Alice ill?
Step 0.1: Load training data.
Step 0.2: Select and standardize the attributes that we use for classification.
19
Euclidean distance
20
Distance by Jim.belk is licensed under public domain
Applying a function to each row in a table
● We already can apply a function to each element in a column:
TableName.apply(FunctionName, 'ColumnName')
● Now, we want to apply a function to the entire row:
TableName.apply(FunctionName)
21
Is Alice ill?
● Step 1: Find the distance between Alice and each point in the training
sample.
22
Is Alice ill?
● Step 1: Find the distance between Alice and each point in the training
sample.
23
Glucose White Blood Cell Count Class Distance from Alice
-0.2215 -0.569768 1 0.88943
-0.9475 1.16268 1 2.16332
3.8412 -1.27558 1 4.84907
0.3963 0.809777 1 2.28585
0.6435 0.232293 1 2.0542
-0.5614 -0.505603 1 0.660906
Is Alice ill?
● Step 2: Sort the data table in increasing order of the distances.
24
Glucose White Blood Cell Count Class Distance from Alice
-0.94759 -0.98684 0 0.0540298
-0.82401 -0.98684 0 0.176477
-0.87035 -0.794345 0 0.243107
-0.71588 -0.85851 0 0.317401
-0.70043 -0.85851 0 0.331301
Is Alice ill?
● Step 3: Take the top k=4 rows of the sorted table.
25
Glucose White Blood Cell Count Class Distance from Alice
-0.94759 -0.98684 0 0.0540298
-0.82401 -0.98684 0 0.176477
-0.87035 -0.794345 0 0.243107
-0.71588 -0.85851 0 0.317401
Is Alice ill?
● Step 4: Choose the majority class of these 5 rows.
26
All in one:
27
Training and Testing
https://www.inferentialthinking.com/chapters/17/2/Training_and_Testing.html
Training and Testing
● How good is the classifier?
● How well does the classifier predict data that it has not seen before?
29
Training by Luca_Episcopo is licensed under Pixabay
License
Game by pixabay.com is licensed under CC0 1.0
Generating test data (hold-out set)
● We can gather more data,
● or we randomly split the given data into two parts: training and testing.
30
Never test on training data!
Is Felix a good soccer player?
Felix scores 10 goals in the training session with his friends!
So, Felix is a good player!?
Well, in the game, Felix is super nervous and scores own goal.
31
Never train on test data!
If we train a 1-Nearest Neighbor classifier on the following data, would it make any
mistakes on the same data?
32
The Accuracy of the Classifier
https://www.inferentialthinking.com/chapters/17/5/Accuracy_of_the_Classifier.html
Naming Convention for Prediction Evaluation
Was the prediction correct? Which class did we predict?
True False Positive Negative
Example: Prediction = Apple, Ground Truth = Not Apple
⇒ False Positive
34
Accuracy
Ground Truth
Positive Negative
Prediction
Positive True Positive False Positive
Negative False Negative True Negative
35
Accuracy
36
Which K?
Test Accuracy: 0.70Test Accuracy: 0.91Test Accuracy: 0.89
1-Nearest Neighbors 4-Nearest Neighbors 30-Nearest Neighbors
37
Summary
38
● Classification: Given a number of examples, identify to which class a given
observation belongs to.
● We can use the nearest neighbors of an observation to classify it.
● To evaluate a classification model, we split the data into training and test.
● To measure the success, we can use metrics, such as accuracy.

Lecture on Classification

  • 1.
    14 Classification Prof. Dr.Ziawasch Abedjan, Felix Neutatz June 24th 2019
  • 2.
    Repetition: Prediction How oldis this dugong? 2 Years 0 73 Dugong by Geoff Spiby is licensed under CC BY-SA 3.0
  • 3.
  • 4.
    Is this Classification? ●Is this bank transfer fraudulent? ● Is this patient healthy? ● Will you vote for me, for X, or for Y? ● Will these two people fit to each other? ● Is this an apple? Def.: Given a number of examples, identify to which class a given observation belongs to. 4
  • 5.
    What do weneed for apple classification? Mass Width Height Is an Apple 192 8.4 7.3 1 342 9 9.4 0 186 7.2 9.2 0 152 7.6 7.3 1 Observations Attributes Class 80 5.9 4.3 1 194 7.2 10.3 0 Training data Test data https://homepages.inf.ed.ac.uk/imurray2/teaching/oranges_and_lemons/ 5
  • 6.
  • 7.
    Classification based onHeight and Width 7
  • 8.
    Classification based onHeight and Width 8
  • 9.
    A Nearest NeighborClassifier ● Find the point in the training set that is nearest to the new point. ● If that nearest point is an apple, classify the new point as apple. 9
  • 10.
  • 11.
  • 12.
    Chronic Kidney DiseaseClassification Due to different value ranges, a difference for the white blood cell count is significantly more impactful than for Glucose. ⇒ Standard Units Kidney Cross Section by Anmats is licensed under CC BY 3.0 12 Glucose White Blood Cell Count Class 117 6700 1 70 12100 1 114 7200 0 131 6800 0
  • 13.
    Standard Units Absolute DistanceStandardized Distance 13
  • 14.
    What if thereis no clear decision boundary? Does this patient have the chronic kidney disease? 14
  • 15.
    K-Nearest Neighbors ill healthy ●Find the k points in the training set that are nearest to the new point. ● If most nearest points are healthy, classify the new point as healthy. 15
  • 16.
    Which K? Good fitOverfitting 1-NearestNeighbors 4-Nearest Neighbors Underfitting 30-Nearest Neighbors 16
  • 17.
  • 18.
    Is Alice ill? ●Step 1: Find the distance between Alice and each point in the training sample. ● Step 2: Sort the data table in increasing order of the distances. ● Step 3: Take the top k=4 rows of the sorted table. ● Step 4: Choose the majority class of these 4 rows. 18
  • 19.
    Is Alice ill? Step0.1: Load training data. Step 0.2: Select and standardize the attributes that we use for classification. 19
  • 20.
    Euclidean distance 20 Distance byJim.belk is licensed under public domain
  • 21.
    Applying a functionto each row in a table ● We already can apply a function to each element in a column: TableName.apply(FunctionName, 'ColumnName') ● Now, we want to apply a function to the entire row: TableName.apply(FunctionName) 21
  • 22.
    Is Alice ill? ●Step 1: Find the distance between Alice and each point in the training sample. 22
  • 23.
    Is Alice ill? ●Step 1: Find the distance between Alice and each point in the training sample. 23 Glucose White Blood Cell Count Class Distance from Alice -0.2215 -0.569768 1 0.88943 -0.9475 1.16268 1 2.16332 3.8412 -1.27558 1 4.84907 0.3963 0.809777 1 2.28585 0.6435 0.232293 1 2.0542 -0.5614 -0.505603 1 0.660906
  • 24.
    Is Alice ill? ●Step 2: Sort the data table in increasing order of the distances. 24 Glucose White Blood Cell Count Class Distance from Alice -0.94759 -0.98684 0 0.0540298 -0.82401 -0.98684 0 0.176477 -0.87035 -0.794345 0 0.243107 -0.71588 -0.85851 0 0.317401 -0.70043 -0.85851 0 0.331301
  • 25.
    Is Alice ill? ●Step 3: Take the top k=4 rows of the sorted table. 25 Glucose White Blood Cell Count Class Distance from Alice -0.94759 -0.98684 0 0.0540298 -0.82401 -0.98684 0 0.176477 -0.87035 -0.794345 0 0.243107 -0.71588 -0.85851 0 0.317401
  • 26.
    Is Alice ill? ●Step 4: Choose the majority class of these 5 rows. 26
  • 27.
  • 28.
  • 29.
    Training and Testing ●How good is the classifier? ● How well does the classifier predict data that it has not seen before? 29 Training by Luca_Episcopo is licensed under Pixabay License Game by pixabay.com is licensed under CC0 1.0
  • 30.
    Generating test data(hold-out set) ● We can gather more data, ● or we randomly split the given data into two parts: training and testing. 30
  • 31.
    Never test ontraining data! Is Felix a good soccer player? Felix scores 10 goals in the training session with his friends! So, Felix is a good player!? Well, in the game, Felix is super nervous and scores own goal. 31
  • 32.
    Never train ontest data! If we train a 1-Nearest Neighbor classifier on the following data, would it make any mistakes on the same data? 32
  • 33.
    The Accuracy ofthe Classifier https://www.inferentialthinking.com/chapters/17/5/Accuracy_of_the_Classifier.html
  • 34.
    Naming Convention forPrediction Evaluation Was the prediction correct? Which class did we predict? True False Positive Negative Example: Prediction = Apple, Ground Truth = Not Apple ⇒ False Positive 34
  • 35.
    Accuracy Ground Truth Positive Negative Prediction PositiveTrue Positive False Positive Negative False Negative True Negative 35
  • 36.
  • 37.
    Which K? Test Accuracy:0.70Test Accuracy: 0.91Test Accuracy: 0.89 1-Nearest Neighbors 4-Nearest Neighbors 30-Nearest Neighbors 37
  • 38.
    Summary 38 ● Classification: Givena number of examples, identify to which class a given observation belongs to. ● We can use the nearest neighbors of an observation to classify it. ● To evaluate a classification model, we split the data into training and test. ● To measure the success, we can use metrics, such as accuracy.