Classification of Breast Cancer dataset using  Decision Tree Induction Sunil Nair  Abel Gebreyesus   Masters of Health Inf...
Agenda <ul><li>Objective </li></ul><ul><li>Dataset </li></ul><ul><li>Approach </li></ul><ul><li>Classification Methods  </...
Introduction <ul><li>Breast Cancer prognosis </li></ul><ul><ul><li>Breast cancer incidence is high </li></ul></ul><ul><ul>...
Objective  <ul><li>Significance of project </li></ul><ul><ul><li>Previous work done using this dataset </li></ul></ul><ul>...
Breast Cancer Dataset <ul><li># of Instances:  699 </li></ul><ul><li># of Attributes:  10  plus  Class  attribute </li></u...
Attributes <ul><ul><li>Indicate Cellular characteristics </li></ul></ul><ul><ul><li>Variables are Continuous, Ordinal  wit...
Attributes / class - distribution <ul><li>Dataset unbalanced </li></ul>
Our Approach <ul><li>Data Pre-processing </li></ul><ul><li>Comparison between Classification techniques </li></ul><ul><li>...
Data Pre-processing  <ul><li>Filter out the ID column </li></ul><ul><li>Handle Missing Values </li></ul><ul><ul><ul><li>WE...
Data preprocessing  <ul><li>Two options to manage Missing data – WEKA </li></ul><ul><ul><li>“ Replacemissingvalues ” </li>...
Comparison chart – Handle Missing Value Confusion Matrix Total Correctly Classified Instances Test split = 223 Accuracy Ra...
Data Pre-processing  <ul><li>Missing Value  Replaced  - Mean-Mode </li></ul>Missing Value  Removed  - Mean-Mode
Agenda <ul><li>Objective </li></ul><ul><li>Dataset </li></ul><ul><li>Approach </li></ul><ul><ul><li>Data Pre-Processing </...
Classification Methods Comparison 94% 97% 3% 233 Support Vector M 92% 97% 4% 233 DT-J48 79% 91% 10% 233 Neural Network 90%...
Classification using Decision Tree  <ul><li>Decision Tree – WEKA J48 (C4.5) </li></ul><ul><ul><li>Divide and conquer algor...
Attributes Selected – most IG weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttrib...
The DT – IG/Attribute selection Visualization
Decision Tree - Problems <ul><li>Concerns </li></ul><ul><ul><li>Missing values </li></ul></ul><ul><ul><li>Pruning – Prepru...
Confusion Matrix – Performance Evaluation <ul><li>The overall  Accuracy  rate  is the number of correct classifications di...
Unbalanced dataset problem <ul><li>Solution: Stratified Sampling Method </li></ul><ul><ul><li>Partitioning of dataset base...
Stratified Sampling Method
Performance Evaluation 92% 96% 3% 13 412 Testing set 97% 99% 2% 13 476 Training set Exp. Acc. Rate Act. Acc. Rate MAE # Ru...
Tree Visualization
Unbalanced dataset Problem <ul><li>Solution: Cost Matrix  </li></ul><ul><ul><li>Cost sensitive classification </li></ul></...
Future direction <ul><li>The overall accuracy of the classifier needs to be increased </li></ul><ul><li>Cluster based Stra...
ROC Curve - Visualization For Benign class For Malignant class <ul><li>Area under the curve AUC </li></ul><ul><ul><li>Larg...
Questions / Comments Thank You !
Upcoming SlideShare
Loading in …5
×

Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

18,647 views

Published on

Published in: Health & Medicine, Education
2 Comments
10 Likes
Statistics
Notes
No Downloads
Views
Total views
18,647
On SlideShare
0
From Embeds
0
Number of Embeds
90
Actions
Shares
0
Downloads
755
Comments
2
Likes
10
Embeds 0
No embeds

No notes for slide

Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

  1. 1. Classification of Breast Cancer dataset using Decision Tree Induction Sunil Nair Abel Gebreyesus Masters of Health Informatics Dalhousie University HINF6210 Project Presentation – November 25, 2008
  2. 2. Agenda <ul><li>Objective </li></ul><ul><li>Dataset </li></ul><ul><li>Approach </li></ul><ul><li>Classification Methods </li></ul><ul><li>Decision Tree </li></ul><ul><li>Problems </li></ul><ul><li>Future direction </li></ul>
  3. 3. Introduction <ul><li>Breast Cancer prognosis </li></ul><ul><ul><li>Breast cancer incidence is high </li></ul></ul><ul><ul><li>Improvement in diagnostic methods </li></ul></ul><ul><ul><li>Early diagnosis and treatment. </li></ul></ul><ul><ul><li>But, recurrence is high </li></ul></ul><ul><ul><li>Good prognosis is important…. </li></ul></ul>
  4. 4. Objective <ul><li>Significance of project </li></ul><ul><ul><li>Previous work done using this dataset </li></ul></ul><ul><ul><li>Most previous work indicated room for improvement in increasing accuracy of classifier </li></ul></ul>
  5. 5. Breast Cancer Dataset <ul><li># of Instances: 699 </li></ul><ul><li># of Attributes: 10 plus Class attribute </li></ul><ul><ul><li>Class distribution : </li></ul></ul><ul><ul><ul><li>Benign (2): 458 (65.5%) </li></ul></ul></ul><ul><ul><ul><li>Malignant (4): 241 (34.5%) </li></ul></ul></ul><ul><ul><li>Missing Values : 16 </li></ul></ul>Wisconsin Breast Cancer Database (1991) University of Wisconsin Hospitals, Dr. William H. Wolberg
  6. 6. Attributes <ul><ul><li>Indicate Cellular characteristics </li></ul></ul><ul><ul><li>Variables are Continuous, Ordinal with 10 levels </li></ul></ul>Class Benign (2), Malignant (4) 11 1-10 Mitoses 10 1-10 Normal Nucleoli 9 1-10 Bland Chromatin 8 1-10 Bare Nuclei 7 1-10 Single Epithelial Cell Size 6 1-10 Marginal Adhesion 5 1-10 Uniformity of Cell Shape 4 1-10 Uniformity of Cell Size 3 1-10 Clump Thickness 2 id number Sample code number 1
  7. 7. Attributes / class - distribution <ul><li>Dataset unbalanced </li></ul>
  8. 8. Our Approach <ul><li>Data Pre-processing </li></ul><ul><li>Comparison between Classification techniques </li></ul><ul><li>Decision Tree Induction </li></ul><ul><ul><li>Attribute Selection </li></ul></ul><ul><ul><li>J48 </li></ul></ul><ul><ul><li>Evaluation </li></ul></ul>
  9. 9. Data Pre-processing <ul><li>Filter out the ID column </li></ul><ul><li>Handle Missing Values </li></ul><ul><ul><ul><li>WEKA </li></ul></ul></ul>
  10. 10. Data preprocessing <ul><li>Two options to manage Missing data – WEKA </li></ul><ul><ul><li>“ Replacemissingvalues ” </li></ul></ul><ul><ul><li>weka.filters.unsupervised.attribute.ReplaceMissingValues </li></ul></ul><ul><ul><ul><li>Missing nominal and numeric attributes replaced with mode-means </li></ul></ul></ul><ul><ul><li>Remove (delete) the tuple with missing values. </li></ul></ul><ul><ul><ul><li>Missing values are attribute bare nuclei = 16 </li></ul></ul></ul><ul><ul><ul><li>Outliers </li></ul></ul></ul>
  11. 11. Comparison chart – Handle Missing Value Confusion Matrix Total Correctly Classified Instances Test split = 223 Accuracy Rate: 95.78% How many predictions by chance? Expected Accuracy Rate = Kappa Statistic - is used to measure the agreement between predicted and actual categorization of data while correcting for prediction that occurs by chance. 89% 95% 7% 14 Missing Replaced 90% 96% 5% 11 Missing Removed 87% 94% 8% 14 Complete Exp. Acc. Rate Act. Acc. Rate MAE # RULES DATASET PERFORMANCE EVALUATION 233 70 163 Total 66 63 3 M 167 7 160 B Total M B Class
  12. 12. Data Pre-processing <ul><li>Missing Value Replaced - Mean-Mode </li></ul>Missing Value Removed - Mean-Mode
  13. 13. Agenda <ul><li>Objective </li></ul><ul><li>Dataset </li></ul><ul><li>Approach </li></ul><ul><ul><li>Data Pre-Processing </li></ul></ul><ul><li>Classification Methods </li></ul><ul><li>Decision Tree </li></ul><ul><li>Problems </li></ul><ul><li>Future direction </li></ul>
  14. 14. Classification Methods Comparison 94% 97% 3% 233 Support Vector M 92% 97% 4% 233 DT-J48 79% 91% 10% 233 Neural Network 90% 96% 4% 233 Naïve Bayes Exp. Acc. Rate Act. Acc. Rate MAE # Total Inst. CLASSIFIER PERFORMANCE EVALUATION Test Set
  15. 15. Classification using Decision Tree <ul><li>Decision Tree – WEKA J48 (C4.5) </li></ul><ul><ul><li>Divide and conquer algorithm </li></ul></ul><ul><ul><li>Convert tree to Classification rules </li></ul></ul><ul><ul><li>J48 can handle numeric attributes. </li></ul></ul><ul><li>Attribute Selection - Information gain </li></ul>
  16. 16. Attributes Selected – most IG weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker 89% 95% 7% 14 Missing Replaced 90% 96% 5% 11 Missing Removed 92% 97% 4% 11 Attributes Selected Exp. Acc. Rate Act. Acc. Rate MAE # RULES DATASET PERFORMANCE EVALUATION 0.198 Mitosis 9 0.443 Marginal Adhesion 8 0.459 Clump Thickness 7 0.466 Normal Nucleoli 6 0.505 Single Epithelial Cell Size 5 0.543 Bland Chromatin 4 0.564 Bare Nucleoli 3 0.66 Uniformity of Cell Shape 2 0.675 Uniformity of Cell Size 1 Information Gain Attribute Rank
  17. 17. The DT – IG/Attribute selection Visualization
  18. 18. Decision Tree - Problems <ul><li>Concerns </li></ul><ul><ul><li>Missing values </li></ul></ul><ul><ul><li>Pruning – Preprune or postprune </li></ul></ul><ul><ul><li>Estimating error rates </li></ul></ul><ul><li>Unbalanced Dataset </li></ul><ul><ul><li>Bias in prediction </li></ul></ul><ul><ul><li>Overfitting – in test set </li></ul></ul><ul><ul><li>Underfitting </li></ul></ul>
  19. 19. Confusion Matrix – Performance Evaluation <ul><li>The overall Accuracy rate is the number of correct classifications divided by the total number of classifications: </li></ul><ul><ul><li>TP+TN / </li></ul></ul><ul><ul><ul><li>TP+TN+FP+FN </li></ul></ul></ul><ul><li>Error Rate = 1- Accuracy </li></ul><ul><li>Not a correct measure if </li></ul><ul><ul><li>Unbalanced Dataset </li></ul></ul><ul><ul><ul><li>Classes are unequally represented </li></ul></ul></ul>TN FP M (4) FN TP B (2) Act. Class M (4) B (2) Predicted Class
  20. 20. Unbalanced dataset problem <ul><li>Solution: Stratified Sampling Method </li></ul><ul><ul><li>Partitioning of dataset based on class </li></ul></ul><ul><ul><li>Random Sampling Process </li></ul></ul><ul><ul><li>Create Training and Test set with equal size class </li></ul></ul><ul><ul><li>Testing set data independent from Training set. </li></ul></ul><ul><ul><ul><li>Standard Verification technique </li></ul></ul></ul><ul><ul><ul><li>Best error estimate </li></ul></ul></ul>
  21. 21. Stratified Sampling Method
  22. 22. Performance Evaluation 92% 96% 3% 13 412 Testing set 97% 99% 2% 13 476 Training set Exp. Acc. Rate Act. Acc. Rate MAE # Rules # Instances Dataset PERFORMANCE EVALUATION Test Set
  23. 23. Tree Visualization
  24. 24. Unbalanced dataset Problem <ul><li>Solution: Cost Matrix </li></ul><ul><ul><li>Cost sensitive classification </li></ul></ul><ul><ul><li>Costs not known </li></ul></ul><ul><ul><ul><li>Complete financial analysis needed; i.e cost of </li></ul></ul></ul><ul><ul><ul><ul><li>Using ML tool </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Gathering training data </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Using the model </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Determining the attributes for test </li></ul></ul></ul></ul><ul><ul><ul><li>Cross Validation once all costs are known </li></ul></ul></ul>
  25. 25. Future direction <ul><li>The overall accuracy of the classifier needs to be increased </li></ul><ul><li>Cluster based Stratified Sampling </li></ul><ul><ul><li>Partitioning the original dataset using Kmeans Alg. </li></ul></ul><ul><li>Multiple Classifier model </li></ul><ul><ul><li>Bagging and Boosting techniques </li></ul></ul><ul><li>ROC (Receiver Operating Characteristic) </li></ul><ul><ul><li>Plotting the TP Rate (Y-axis) over FP Rate (X-Axis) </li></ul></ul><ul><ul><li>Advantage: Does not regard class distribution or error costs. </li></ul></ul>
  26. 26. ROC Curve - Visualization For Benign class For Malignant class <ul><li>Area under the curve AUC </li></ul><ul><ul><li>Larger the area, better is the model </li></ul></ul>
  27. 27. Questions / Comments Thank You !

×