DATA MINING
MODEL
UNDERSTANDING ROC &
AUROC & MODEL
PREDICTIONS
-SIDDHARTH NEGI
2020H1120906U
DATA MINING USING PYTHON
What is Receiver Operating Characteristic curve (ROC curve)?
 It s a plot of sensitivity or True Positive Rate vs. False Positive Rate.
 TPR is on the y-axis, from 0% to 100% -- FPR is on the x-axis, from 0% to 100%
 ROC is for tests which produce results on a numerical scale, rather than binary (positive vs.
negative results)
DATA MINING USING PYTHON
SIMPLE CASE STUDY
Let us look into four prediction results from 100 positive & 100 negative instances
 The result of method A clearly shows the best predictive power
among A, B, and C
 The result of B lies on the random guess line (the diagonal line),
and it can be seen in the table that the accuracy of B is 50%
 Although the original C method has negative predictive power,
simply reversing its decisions leads to a new predictive
method C′ which has positive predictive power, when C is
mirrored across the center point (0.5,0.5), the resulting
DATA MINING USING PYTHON
ROC CURVE USES IN PREDICTING MODELS AND PREDICTIONS
 To draw a ROC curve, only the true positive rate
(TPR) and false positive rate (FPR) are needed.
 TPR, defines how many correct positive results
occur among all positive samples available during
the test.
 FPR, defines how many incorrect positive results
occur among all negative samples available during
the test
The closer a result from a contingency table is to the upper left
corner, the better it predicts.
The distance from the random guess line in either direction is the
best indicator of how much predictive power a method has.
If the result is below the line (i.e. the method is worse than a
random guess), all of the method s predictions must be reversed
in order to utilize its power,
thereby moving the result above the random guess line.
DATA MINING USING PYTHON
ROC CURVE USES IN PREDICTING MODELS AND PREDICTIONS
Area under the ROC curve (AUC)
The higher the AUC, the more accurate the test
AUC = 1.0 means the test is 100% accurate (i.e. the
curve is square) Perfect Classifier
AUC = 0.5 (50%) means the ROC curve is a straight
diagonal line, which represents the ideal bad test , one
which is only ever accurate by pure chance. Random
Classifier example Flipping a coin
•When comparing two tests, the more accurate test is
the one with an ROC curve further to the top left
corner of the graph, with a higher AUC. Better
AUC < 0.5 means ROC curve will be below diagonal
line and is a Worse Classifier
DATA MINING USING PYTHON
ROC CURVE USING PYTHON
Prerequisites
1. Anaconda Navigator
2. Jupyter localhost
DATA MINING USING PYTHON
ROC CURVE USING PYTHON
Step-1 Generating synthetic dataset
• Using make_classification function
• We are creating 2000 samples in dataset with 2 classes and 10 features
Step-2 Adding noise features
• In order to make dataset more real otherwise it will make perfect prediction and it will look to good to be true
• To make it more difficult for Model to perform.
DATA MINING USING PYTHON
ROC CURVE USING PYTHON
Step-3 Splitting Training and Test data
• Here we are splitting Training and test data using input data of X & Y Matrices
• 20% testing set (i.e. test_size=0.2) resulting in 80% training set  This 80% of training set will be used in both
models as input variables
Step-2 Defining 2 Classification Models
• We are defining 2 models
1- Random Forest Classifier
2- Gaussian Navie Bayes
• We will compare these 2 models and their performance
DATA MINING USING PYTHON
ROC CURVE USING PYTHON
Step-4 Defining Random Forest Classifier
• Assigning RandomForest to rf variable
• rf.fit to create the model
• Input Arguments X_train & Y_train
Which is 80% Training set
Step-5 Defining Navie Bayes Classifier
DATA MINING USING PYTHON
ROC CURVE USING PYTHON
Step-6 Predicting Probabilities Matrices
• Variable r_probs is created to contain 0  Worst case scenario
• Variable rf_probs will contain probability of the predicted values from Random Forest Classifier
• Varianle nb_probs will contain probability of the predicted values from Navie Bayes Classifier
• Predict_proba function is used to get the probability from prediction
We are keeping positive outcome in both variables
DATA MINING USING PYTHON
ROC CURVE USING PYTHON
Step-7 Computing AUROC (Area under ROC Curve)
• Importing the library sklearn.metrics with functions roc_curve and roc_auc_score
• Variable
r_auc Random Prediction
rf_auc Random Forest Prediction
nb_auc Navie Bayes Prediction
• Random Prediction is 50% which we
Defined as 0 in variable r_probs
More AUROC
Better the Classifier
Better the Prediction
DATA MINING USING PYTHON
PRINTING ROC CURVE USING PYTHON
Step-8 Printing the ROC Curve
DATA MINING USING PYTHON
ROC CURVE PLOTTING
Step-9 Outcome of ROC Curve
Conclusions
 Random Prediction is the Diagonal line
With prediction of 0.5 – Just like flipping a coin
AUROC = 0.500
 Random Forest Orange curve
AUROC = 0.941
Better than Random Prediction model
 Navie Bayes  Green curve
AUROC = 0.993 (Maximum is 1.00)
Best model and prediction in all 3 models
DATA MINING USING PYTHON
ROC CURVE PLOTTING
Dataset from CSV file
Same data which was used in Assignment-1
Appended the data till 1000 samples
DATA MINING USING PYTHON
ROC CURVE PLOTTING
Train-Test Split
Feature Scaling
DATA MINING USING PYTHON
ROC CURVE PLOTTING
SVM Classifier
Logistic Classifier
DATA MINING USING PYTHON
ROC CURVE PLOTTING
Plotting ROC
DATA MINING USING PYTHON
ROC CURVE PLOTTING
Calculating ROC and comparison
ROC of SVM 0.837
ROC of Logistic 0.853
Logistic Regression is better than SVM
DATA MINING USING PYTHON
Full Python code is posted on my github page
https //github.com/enggsidds/SidData/blob/main/DataModelling_ROC

Data mining model

  • 1.
    DATA MINING MODEL UNDERSTANDING ROC& AUROC & MODEL PREDICTIONS -SIDDHARTH NEGI 2020H1120906U
  • 2.
    DATA MINING USINGPYTHON What is Receiver Operating Characteristic curve (ROC curve)?  It s a plot of sensitivity or True Positive Rate vs. False Positive Rate.  TPR is on the y-axis, from 0% to 100% -- FPR is on the x-axis, from 0% to 100%  ROC is for tests which produce results on a numerical scale, rather than binary (positive vs. negative results)
  • 3.
    DATA MINING USINGPYTHON SIMPLE CASE STUDY Let us look into four prediction results from 100 positive & 100 negative instances  The result of method A clearly shows the best predictive power among A, B, and C  The result of B lies on the random guess line (the diagonal line), and it can be seen in the table that the accuracy of B is 50%  Although the original C method has negative predictive power, simply reversing its decisions leads to a new predictive method C′ which has positive predictive power, when C is mirrored across the center point (0.5,0.5), the resulting
  • 4.
    DATA MINING USINGPYTHON ROC CURVE USES IN PREDICTING MODELS AND PREDICTIONS  To draw a ROC curve, only the true positive rate (TPR) and false positive rate (FPR) are needed.  TPR, defines how many correct positive results occur among all positive samples available during the test.  FPR, defines how many incorrect positive results occur among all negative samples available during the test The closer a result from a contingency table is to the upper left corner, the better it predicts. The distance from the random guess line in either direction is the best indicator of how much predictive power a method has. If the result is below the line (i.e. the method is worse than a random guess), all of the method s predictions must be reversed in order to utilize its power, thereby moving the result above the random guess line.
  • 5.
    DATA MINING USINGPYTHON ROC CURVE USES IN PREDICTING MODELS AND PREDICTIONS Area under the ROC curve (AUC) The higher the AUC, the more accurate the test AUC = 1.0 means the test is 100% accurate (i.e. the curve is square) Perfect Classifier AUC = 0.5 (50%) means the ROC curve is a straight diagonal line, which represents the ideal bad test , one which is only ever accurate by pure chance. Random Classifier example Flipping a coin •When comparing two tests, the more accurate test is the one with an ROC curve further to the top left corner of the graph, with a higher AUC. Better AUC < 0.5 means ROC curve will be below diagonal line and is a Worse Classifier
  • 6.
    DATA MINING USINGPYTHON ROC CURVE USING PYTHON Prerequisites 1. Anaconda Navigator 2. Jupyter localhost
  • 7.
    DATA MINING USINGPYTHON ROC CURVE USING PYTHON Step-1 Generating synthetic dataset • Using make_classification function • We are creating 2000 samples in dataset with 2 classes and 10 features Step-2 Adding noise features • In order to make dataset more real otherwise it will make perfect prediction and it will look to good to be true • To make it more difficult for Model to perform.
  • 8.
    DATA MINING USINGPYTHON ROC CURVE USING PYTHON Step-3 Splitting Training and Test data • Here we are splitting Training and test data using input data of X & Y Matrices • 20% testing set (i.e. test_size=0.2) resulting in 80% training set  This 80% of training set will be used in both models as input variables Step-2 Defining 2 Classification Models • We are defining 2 models 1- Random Forest Classifier 2- Gaussian Navie Bayes • We will compare these 2 models and their performance
  • 9.
    DATA MINING USINGPYTHON ROC CURVE USING PYTHON Step-4 Defining Random Forest Classifier • Assigning RandomForest to rf variable • rf.fit to create the model • Input Arguments X_train & Y_train Which is 80% Training set Step-5 Defining Navie Bayes Classifier
  • 10.
    DATA MINING USINGPYTHON ROC CURVE USING PYTHON Step-6 Predicting Probabilities Matrices • Variable r_probs is created to contain 0  Worst case scenario • Variable rf_probs will contain probability of the predicted values from Random Forest Classifier • Varianle nb_probs will contain probability of the predicted values from Navie Bayes Classifier • Predict_proba function is used to get the probability from prediction We are keeping positive outcome in both variables
  • 11.
    DATA MINING USINGPYTHON ROC CURVE USING PYTHON Step-7 Computing AUROC (Area under ROC Curve) • Importing the library sklearn.metrics with functions roc_curve and roc_auc_score • Variable r_auc Random Prediction rf_auc Random Forest Prediction nb_auc Navie Bayes Prediction • Random Prediction is 50% which we Defined as 0 in variable r_probs More AUROC Better the Classifier Better the Prediction
  • 12.
    DATA MINING USINGPYTHON PRINTING ROC CURVE USING PYTHON Step-8 Printing the ROC Curve
  • 13.
    DATA MINING USINGPYTHON ROC CURVE PLOTTING Step-9 Outcome of ROC Curve Conclusions  Random Prediction is the Diagonal line With prediction of 0.5 – Just like flipping a coin AUROC = 0.500  Random Forest Orange curve AUROC = 0.941 Better than Random Prediction model  Navie Bayes  Green curve AUROC = 0.993 (Maximum is 1.00) Best model and prediction in all 3 models
  • 14.
    DATA MINING USINGPYTHON ROC CURVE PLOTTING Dataset from CSV file Same data which was used in Assignment-1 Appended the data till 1000 samples
  • 15.
    DATA MINING USINGPYTHON ROC CURVE PLOTTING Train-Test Split Feature Scaling
  • 16.
    DATA MINING USINGPYTHON ROC CURVE PLOTTING SVM Classifier Logistic Classifier
  • 17.
    DATA MINING USINGPYTHON ROC CURVE PLOTTING Plotting ROC
  • 18.
    DATA MINING USINGPYTHON ROC CURVE PLOTTING Calculating ROC and comparison ROC of SVM 0.837 ROC of Logistic 0.853 Logistic Regression is better than SVM
  • 19.
    DATA MINING USINGPYTHON Full Python code is posted on my github page https //github.com/enggsidds/SidData/blob/main/DataModelling_ROC