Threshold setting for reduction of false positives
1. Threshold setting for high prediction
rate with low false positives
Improving the functionality of Supervised
Classification
2. Why do we need a low false positive rate?
Let us take the example of a cancer prediction problem. If our model
would predict that one of our patients is going to have cancer, when he
actually isn’t going to, we are going to render a mental trauma for him
and his family. In other words, while we want to accurately predict the
number of people who are going to have cancer, we do not want to
falsely predict if someone is going to have cancer when they actually
aren’t going to.
Hence, when we build a classification model, we need to ensure that it is
tested correctly and that the false positive rate is as low as possible
without compromising the classifying accuracy of the model.
3. Testing the Classification Model
Testing requires two parameters to be observed:
• Sensitivity: Proportion of true positives predicted
Total number of positives
• Specificity: Proportion of true negatives predicted
Total number of negatives
Sensitivity can be intuitively thought of as the predictive(classifying) accuracy of the model on the positive class
(Eg; how correctly are we predicting the number of cancer patients)
Specificity can be intuitively thought of as the predictive(classifying)accuracy of the model on the negative
class (Eg; how correctly are we predicting the number of patients who do not have cancer)
4. For example
There is a sample of 2000 patients out of which 20 have ovarian cancer.
The classification model built by a healthcare company predicts 22
patients have ovarian cancer out of which 15 people have ovarian
cancer.
What is the sensitivity and specificity?
Sensitivity = 15/20 = 0.75
Specificity = 1973/1980 = 0.99
5. ROC Curve Analysis
• ROC Curve – plot of sensitivity vs. False positive rate
• Each point corresponds to a different threshold that separates negative samples
from positive samples
• The objective is to find a point (threshold) where the prediction rate is high
(high sensitivity) and false positive rate is low.
• Example in next slide
Source
The use of Decision Threshold Adjustment in Classification of Cancer Predictionhttp://www.ams.sunysb.edu/~hahn/psfile/papthres.pdf
6.
7. Cases
• Breast Cancer Prediction – 0.98
• Fraud detection – 0.92
Source: http://www.gcxanalytics.com/papers/GCX%20Fraud%20Detection%20Performance%20Evaluation-
GCX.pdf