Hinf6210 Project Classification Of Breast Cancer Dataset

2,035 views
1,847 views

Published on

Breast cancer treatment is one of the medical mysteries, yet unresolved challenge for medical practitioners. The key for better treatment is early diagnosis and treatment. However, even after early diagnosis and treatment, there is high chance of recurrence. By making early prognosis, thus, patients can get better treatment. Data mining, as a knowledge mining field, can contribute on better prognosis with better accuracy rate of prediction. In this report, working on WEKA software, we are trying to show on how to get a decision tree with better accuracy rate. Dealing with the Wisconsin Breast Cancer Database, collected by Dr. William H. Wolberg, University of Wisconsin Hospitals, we will discuss on how we a decision tree data mining technique gives better prediction tool.

Published in: Health & Medicine
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,035
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hinf6210 Project Classification Of Breast Cancer Dataset

  1. 1. Classification of Breast Cancer dataset using Decision Tree Induction Abel Medhanie Gebreyesus Sunil Nair HINF6210 Project Presentation – November 25, 2008
  2. 2. Agenda Objective Dataset Approach Classification Methods Decision Tree Problems Future direction 11/25/2008 2 HINF6210/Project presentation/Abel/Sunil
  3. 3. Introduction Breast Cancer prognosis Breast cancer incidence is high Improvement in diagnostic methods Early diagnosis and treatment. But, recurrence is high Good prognosis is important…. 11/25/2008 3 HINF6210/Project presentation/Abel/Sunil
  4. 4. Objective Significance of project Previous work done using this dataset Most previous work indicated room for improvement in increasing accuracy of classifier 11/25/2008 4 HINF6210/Project presentation/Abel/Sunil
  5. 5. Breast Cancer Dataset Wisconsin Breast Cancer Database (1991) University of Wisconsin Hospitals, Dr. William H. Wolberg # of Instances: 699 # of Attributes: 10 plus Class attribute Class distribution: Benign (2): 458 (65.5%) Malignant (4): 241 (34.5%) Missing Values : 16 11/25/2008 5 HINF6210/Project presentation/Abel/Sunil
  6. 6. Attributes •Indicate Cellular characteristics •Variables are Continuous, Ordinal with 10 levels 1 id number Sample code number 2 1-10 Clump Thickness 3 1-10 Uniformity of Cell Size 4 1-10 Uniformity of Cell Shape 5 1-10 Marginal Adhesion 6 1-10 Single Epithelial Cell Size 7 1-10 Bare Nuclei 8 1-10 Bland Chromatin 9 1-10 Normal Nucleoli 10 1-10 Mitoses 11 Class Benign (2), Malignant (4) 11/25/2008 6 HINF6210/Project presentation/Abel/Sunil
  7. 7. Attributes / class - distribution • Dataset unbalanced 11/25/2008 7 HINF6210/Project presentation/Abel/Sunil
  8. 8. Our Approach Data Pre-processing Comparison between Classification techniques Decision Tree Induction Attribute Selection J48 Evaluation 11/25/2008 8 HINF6210/Project presentation/Abel/Sunil
  9. 9. Data Pre-processing Filter out the ID column Handle Missing Values WEKA 11/25/2008 9 HINF6210/Project presentation/Abel/Sunil
  10. 10. Data preprocessing Two options to manage Missing data – WEKA “Replacemissingvalues” weka.filters.unsupervised.attribute.ReplaceMissingValues Missing nominal and numeric attributes replaced with mode-means Remove (delete) the tuple with missing values. Missing values are attribute bare nuclei = 16 Outliers 11/25/2008 10 HINF6210/Project presentation/Abel/Sunil
  11. 11. Comparison chart – Handle Missing Value Confusion Matrix Total Correctly Classified Instances Test split = 223 Class B M Total Accuracy Rate: PERFORMANCE EVALUATION 95.78% B 160 7 167 # Act. Exp. M 3 63 66 DATASET RULES MAE Acc. Acc. Total 163 70 233 Rate Rate 14 8% 94% 87% How many predictions by chance? Complete Missing Expected Accuracy Rate = Kappa 11 5% 96% 90% Statistic Removed -is used to measure the agreement between predicted and actual categorization of data Missing while correcting for prediction that occurs by 14 7% 95% 89% Replaced chance. 11/25/2008 11 HINF6210/Project presentation/Abel/Sunil
  12. 12. Data Pre-processing Missing Value Replaced - Mean-Mode Missing Value Removed - Mean-Mode 11/25/2008 12 HINF6210/Project presentation/Abel/Sunil
  13. 13. Agenda Objective Dataset Approach Data Pre-Processing Classification Methods Decision Tree Problems Future direction 11/25/2008 13 HINF6210/Project presentation/Abel/Sunil
  14. 14. Classification Methods Comparison PERFORMANCE EVALUATION Test Set # Act. Exp. CLASSIFIER Total  MAE Acc. Acc. Inst. Rate Rate Naïve Bayes 233 4% 96% 90% Neural  233 10% 91% 79% Network Support Vector  233 3% 97% 94% M 233 4% 97% 92% DT‐J48 11/25/2008 14 HINF6210/Project presentation/Abel/Sunil
  15. 15. Classification using Decision Tree Decision Tree – WEKA J48 (C4.5) Divide and conquer algorithm Convert tree to Classification rules J48 can handle numeric attributes, no need for discretization Attribute Selection - Information gain 11/25/2008 15 HINF6210/Project presentation/Abel/Sunil
  16. 16. Attributes Selected – most IG weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval- Sweka.attributeSelection.Ranker Rank Information Gain Attribute PERFORMANCE EVALUATION 1 Uniformity of Cell Size 0.675 # Act. Exp. 2 Uniformity of Cell Shape 0.66 DATASET RULES MAE Acc. Acc. Rate Rate 3 Bare Nucleoli 0.564 Attributes  4 Bland Chromatin 0.543 11 4% 97% 92% Selected 5 Single Epithelial Cell Size 0.505 Missing 6 Normal Nucleoli 0.466 11 5% 96% 90% Removed 7 Clump Thickness 0.459 Missing 8 Marginal Adhesion 0.443 14 7% 95% 89% Replaced 9 Mitosis 0.198 11/25/2008 16 HINF6210/Project presentation/Abel/Sunil
  17. 17. The DT – IG/Attribute selection Visualization 11/25/2008 17 HINF6210/Project presentation/Abel/Sunil
  18. 18. Decision Tree - Problems Concerns Missing values Pruning – Preprune or postprune Estimating error rates Unbalanced Dataset Bias in prediction Overfitting – in test set Underfitting 11/25/2008 18 HINF6210/Project presentation/Abel/Sunil
  19. 19. Confusion Matrix – Performance Evaluation The overall Accuracy rate is the number of correct classifications Predicted Class divided by the total number of classifications: B (2) M (4) TP+TN / TP+TN+FP+FN Error Rate = 1- Accuracy B (2) TP FN Act. Not a correct measure if Class Unbalanced Dataset M (4) FP TN Classes are unequally represented 11/25/2008 19 HINF6210/Project presentation/Abel/Sunil
  20. 20. Unbalanced dataset problem Solution: Stratified Sampling Method Partitioning of dataset based on class Random Sampling Process Create Training and Test set with equal size class Testing set data independent from Training set. Standard Verification technique Best error estimate 11/25/2008 20 HINF6210/Project presentation/Abel/Sunil
  21. 21. Stratified Sampling Method 11/25/2008 21 HINF6210/Project presentation/Abel/Sunil
  22. 22. Performance Evaluation PERFORMANCE EVALUATION Test Set # # Act. Exp. Dataset Instances Rules MAE Acc. Acc. Rate Rate 476 13 2% 99% 97% Training set 412 13 3% 96% 92% Testing set 11/25/2008 22 HINF6210/Project presentation/Abel/Sunil
  23. 23. Tree Visualization 11/25/2008 23 HINF6210/Project presentation/Abel/Sunil
  24. 24. Unbalanced dataset Problem Solution: Cost Matrix Cost sensitive classification Costs not known Complete financial analysis needed; i.e cost of Using ML tool Gathering training data Using the model Determining the attributes for test Cross Validation once all costs are known 11/25/2008 24 HINF6210/Project presentation/Abel/Sunil
  25. 25. Future direction The overall accuracy of the classifier needs to be increased Cluster based Stratified Sampling Partitioning the original dataset using Kmeans Alg. Multiple Classifier model Bagging and Boosting techniques ROC (Receiver Operating Characteristic) Plotting the TP Rate (Y-axis) over FP Rate (X-Axis) Advantage: Does not regard class distribution or error costs. 11/25/2008 25 HINF6210/Project presentation/Abel/Sunil
  26. 26. ROC Curve - Visualization •Area under the curve AUC •Larger the area, better is the model For Benign class For Malignant class 11/25/2008 26 HINF6210/Project presentation/Abel/Sunil
  27. 27. Questions / Comments Thank You! 11/25/2008 27 HINF6210/Project presentation/Abel/Sunil

×