Data mining model

DATA MINING
MODEL
UNDERSTANDING ROC &
AUROC & MODEL
PREDICTIONS
-SIDDHARTH NEGI
2020H1120906U

DATA MINING USING PYTHON
What is Receiver Operating Characteristic curve (ROC curve)?
 It s a plot of sensitivity or True Positive Rate vs. False Positive Rate.
 TPR is on the y-axis, from 0% to 100% -- FPR is on the x-axis, from 0% to 100%
 ROC is for tests which produce results on a numerical scale, rather than binary (positive vs.
negative results)

SIMPLE CASE STUDY
Let us look into four prediction results from 100 positive & 100 negative instances
 The result of method A clearly shows the best predictive power
among A, B, and C
 The result of B lies on the random guess line (the diagonal line),
and it can be seen in the table that the accuracy of B is 50%
 Although the original C method has negative predictive power,
simply reversing its decisions leads to a new predictive
method C′ which has positive predictive power, when C is
mirrored across the center point (0.5,0.5), the resulting

ROC CURVE USES IN PREDICTING MODELS AND PREDICTIONS
 To draw a ROC curve, only the true positive rate
(TPR) and false positive rate (FPR) are needed.
 TPR, defines how many correct positive results
occur among all positive samples available during
the test.
 FPR, defines how many incorrect positive results
occur among all negative samples available during
the test
The closer a result from a contingency table is to the upper left
corner, the better it predicts.
The distance from the random guess line in either direction is the
best indicator of how much predictive power a method has.
If the result is below the line (i.e. the method is worse than a
random guess), all of the method s predictions must be reversed
in order to utilize its power,
thereby moving the result above the random guess line.

ROC CURVE USES IN PREDICTING MODELS AND PREDICTIONS
Area under the ROC curve (AUC)
The higher the AUC, the more accurate the test
AUC = 1.0 means the test is 100% accurate (i.e. the
curve is square) Perfect Classifier
AUC = 0.5 (50%) means the ROC curve is a straight
diagonal line, which represents the ideal bad test , one
which is only ever accurate by pure chance. Random
Classifier example Flipping a coin
•When comparing two tests, the more accurate test is
the one with an ROC curve further to the top left
corner of the graph, with a higher AUC. Better
AUC < 0.5 means ROC curve will be below diagonal
line and is a Worse Classifier

ROC CURVE USING PYTHON
Prerequisites
1. Anaconda Navigator
2. Jupyter localhost

Step-1 Generating synthetic dataset
• Using make_classification function
• We are creating 2000 samples in dataset with 2 classes and 10 features
Step-2 Adding noise features
• In order to make dataset more real otherwise it will make perfect prediction and it will look to good to be true
• To make it more difficult for Model to perform.

Step-3 Splitting Training and Test data
• Here we are splitting Training and test data using input data of X & Y Matrices
• 20% testing set (i.e. test_size=0.2) resulting in 80% training set  This 80% of training set will be used in both
models as input variables
Step-2 Defining 2 Classification Models
• We are defining 2 models
1- Random Forest Classifier
2- Gaussian Navie Bayes
• We will compare these 2 models and their performance

Step-4 Defining Random Forest Classifier
• Assigning RandomForest to rf variable
• rf.fit to create the model
• Input Arguments X_train & Y_train
Which is 80% Training set
Step-5 Defining Navie Bayes Classifier

Step-6 Predicting Probabilities Matrices
• Variable r_probs is created to contain 0  Worst case scenario
• Variable rf_probs will contain probability of the predicted values from Random Forest Classifier
• Varianle nb_probs will contain probability of the predicted values from Navie Bayes Classifier
• Predict_proba function is used to get the probability from prediction
We are keeping positive outcome in both variables

Step-7 Computing AUROC (Area under ROC Curve)
• Importing the library sklearn.metrics with functions roc_curve and roc_auc_score
• Variable
r_auc Random Prediction
rf_auc Random Forest Prediction
nb_auc Navie Bayes Prediction
• Random Prediction is 50% which we
Defined as 0 in variable r_probs
More AUROC
Better the Classifier
Better the Prediction

PRINTING ROC CURVE USING PYTHON
Step-8 Printing the ROC Curve

ROC CURVE PLOTTING
Step-9 Outcome of ROC Curve
Conclusions
 Random Prediction is the Diagonal line
With prediction of 0.5 – Just like flipping a coin
AUROC = 0.500
 Random Forest Orange curve
AUROC = 0.941
Better than Random Prediction model
 Navie Bayes  Green curve
AUROC = 0.993 (Maximum is 1.00)
Best model and prediction in all 3 models

ROC CURVE PLOTTING
Dataset from CSV file
Same data which was used in Assignment-1
Appended the data till 1000 samples

ROC CURVE PLOTTING
Train-Test Split
Feature Scaling

ROC CURVE PLOTTING
SVM Classifier
Logistic Classifier

ROC CURVE PLOTTING
Plotting ROC

ROC CURVE PLOTTING
Calculating ROC and comparison
ROC of SVM 0.837
ROC of Logistic 0.853
Logistic Regression is better than SVM

Full Python code is posted on my github page
https //github.com/enggsidds/SidData/blob/main/DataModelling_ROC

Data mining model

More Related Content

What's hot

Similar to Data mining model

Recently uploaded

Data mining model