• Save
Hinf6210 Project Classification Of Breast Cancer Dataset
Upcoming SlideShare
Loading in...5
×
 

Hinf6210 Project Classification Of Breast Cancer Dataset

on

  • 2,199 views

Breast cancer treatment is one of the medical mysteries, yet unresolved challenge for medical practitioners. The key for better treatment is early diagnosis and treatment. However, even after early ...

Breast cancer treatment is one of the medical mysteries, yet unresolved challenge for medical practitioners. The key for better treatment is early diagnosis and treatment. However, even after early diagnosis and treatment, there is high chance of recurrence. By making early prognosis, thus, patients can get better treatment. Data mining, as a knowledge mining field, can contribute on better prognosis with better accuracy rate of prediction. In this report, working on WEKA software, we are trying to show on how to get a decision tree with better accuracy rate. Dealing with the Wisconsin Breast Cancer Database, collected by Dr. William H. Wolberg, University of Wisconsin Hospitals, we will discuss on how we a decision tree data mining technique gives better prediction tool.

Statistics

Views

Total Views
2,199
Views on SlideShare
2,193
Embed Views
6

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 6

http://www.linkedin.com 4
http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hinf6210 Project Classification Of Breast Cancer Dataset Hinf6210 Project Classification Of Breast Cancer Dataset Presentation Transcript

  • Classification of Breast Cancer dataset using Decision Tree Induction Abel Medhanie Gebreyesus Sunil Nair HINF6210 Project Presentation – November 25, 2008
  • Agenda Objective Dataset Approach Classification Methods Decision Tree Problems Future direction 11/25/2008 2 HINF6210/Project presentation/Abel/Sunil
  • Introduction Breast Cancer prognosis Breast cancer incidence is high Improvement in diagnostic methods Early diagnosis and treatment. But, recurrence is high Good prognosis is important…. 11/25/2008 3 HINF6210/Project presentation/Abel/Sunil
  • Objective Significance of project Previous work done using this dataset Most previous work indicated room for improvement in increasing accuracy of classifier 11/25/2008 4 HINF6210/Project presentation/Abel/Sunil
  • Breast Cancer Dataset Wisconsin Breast Cancer Database (1991) University of Wisconsin Hospitals, Dr. William H. Wolberg # of Instances: 699 # of Attributes: 10 plus Class attribute Class distribution: Benign (2): 458 (65.5%) Malignant (4): 241 (34.5%) Missing Values : 16 11/25/2008 5 HINF6210/Project presentation/Abel/Sunil
  • Attributes •Indicate Cellular characteristics •Variables are Continuous, Ordinal with 10 levels 1 id number Sample code number 2 1-10 Clump Thickness 3 1-10 Uniformity of Cell Size 4 1-10 Uniformity of Cell Shape 5 1-10 Marginal Adhesion 6 1-10 Single Epithelial Cell Size 7 1-10 Bare Nuclei 8 1-10 Bland Chromatin 9 1-10 Normal Nucleoli 10 1-10 Mitoses 11 Class Benign (2), Malignant (4) 11/25/2008 6 HINF6210/Project presentation/Abel/Sunil
  • Attributes / class - distribution • Dataset unbalanced 11/25/2008 7 HINF6210/Project presentation/Abel/Sunil
  • Our Approach Data Pre-processing Comparison between Classification techniques Decision Tree Induction Attribute Selection J48 Evaluation 11/25/2008 8 HINF6210/Project presentation/Abel/Sunil
  • Data Pre-processing Filter out the ID column Handle Missing Values WEKA 11/25/2008 9 HINF6210/Project presentation/Abel/Sunil
  • Data preprocessing Two options to manage Missing data – WEKA “Replacemissingvalues” weka.filters.unsupervised.attribute.ReplaceMissingValues Missing nominal and numeric attributes replaced with mode-means Remove (delete) the tuple with missing values. Missing values are attribute bare nuclei = 16 Outliers 11/25/2008 10 HINF6210/Project presentation/Abel/Sunil
  • Comparison chart – Handle Missing Value Confusion Matrix Total Correctly Classified Instances Test split = 223 Class B M Total Accuracy Rate: PERFORMANCE EVALUATION 95.78% B 160 7 167 # Act. Exp. M 3 63 66 DATASET RULES MAE Acc. Acc. Total 163 70 233 Rate Rate 14 8% 94% 87% How many predictions by chance? Complete Missing Expected Accuracy Rate = Kappa 11 5% 96% 90% Statistic Removed -is used to measure the agreement between predicted and actual categorization of data Missing while correcting for prediction that occurs by 14 7% 95% 89% Replaced chance. 11/25/2008 11 HINF6210/Project presentation/Abel/Sunil
  • Data Pre-processing Missing Value Replaced - Mean-Mode Missing Value Removed - Mean-Mode 11/25/2008 12 HINF6210/Project presentation/Abel/Sunil
  • Agenda Objective Dataset Approach Data Pre-Processing Classification Methods Decision Tree Problems Future direction 11/25/2008 13 HINF6210/Project presentation/Abel/Sunil
  • Classification Methods Comparison PERFORMANCE EVALUATION Test Set # Act. Exp. CLASSIFIER Total  MAE Acc. Acc. Inst. Rate Rate Naïve Bayes 233 4% 96% 90% Neural  233 10% 91% 79% Network Support Vector  233 3% 97% 94% M 233 4% 97% 92% DT‐J48 11/25/2008 14 HINF6210/Project presentation/Abel/Sunil
  • Classification using Decision Tree Decision Tree – WEKA J48 (C4.5) Divide and conquer algorithm Convert tree to Classification rules J48 can handle numeric attributes, no need for discretization Attribute Selection - Information gain 11/25/2008 15 HINF6210/Project presentation/Abel/Sunil
  • Attributes Selected – most IG weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval- Sweka.attributeSelection.Ranker Rank Information Gain Attribute PERFORMANCE EVALUATION 1 Uniformity of Cell Size 0.675 # Act. Exp. 2 Uniformity of Cell Shape 0.66 DATASET RULES MAE Acc. Acc. Rate Rate 3 Bare Nucleoli 0.564 Attributes  4 Bland Chromatin 0.543 11 4% 97% 92% Selected 5 Single Epithelial Cell Size 0.505 Missing 6 Normal Nucleoli 0.466 11 5% 96% 90% Removed 7 Clump Thickness 0.459 Missing 8 Marginal Adhesion 0.443 14 7% 95% 89% Replaced 9 Mitosis 0.198 11/25/2008 16 HINF6210/Project presentation/Abel/Sunil
  • The DT – IG/Attribute selection Visualization 11/25/2008 17 HINF6210/Project presentation/Abel/Sunil
  • Decision Tree - Problems Concerns Missing values Pruning – Preprune or postprune Estimating error rates Unbalanced Dataset Bias in prediction Overfitting – in test set Underfitting 11/25/2008 18 HINF6210/Project presentation/Abel/Sunil
  • Confusion Matrix – Performance Evaluation The overall Accuracy rate is the number of correct classifications Predicted Class divided by the total number of classifications: B (2) M (4) TP+TN / TP+TN+FP+FN Error Rate = 1- Accuracy B (2) TP FN Act. Not a correct measure if Class Unbalanced Dataset M (4) FP TN Classes are unequally represented 11/25/2008 19 HINF6210/Project presentation/Abel/Sunil
  • Unbalanced dataset problem Solution: Stratified Sampling Method Partitioning of dataset based on class Random Sampling Process Create Training and Test set with equal size class Testing set data independent from Training set. Standard Verification technique Best error estimate 11/25/2008 20 HINF6210/Project presentation/Abel/Sunil
  • Stratified Sampling Method 11/25/2008 21 HINF6210/Project presentation/Abel/Sunil
  • Performance Evaluation PERFORMANCE EVALUATION Test Set # # Act. Exp. Dataset Instances Rules MAE Acc. Acc. Rate Rate 476 13 2% 99% 97% Training set 412 13 3% 96% 92% Testing set 11/25/2008 22 HINF6210/Project presentation/Abel/Sunil
  • Tree Visualization 11/25/2008 23 HINF6210/Project presentation/Abel/Sunil
  • Unbalanced dataset Problem Solution: Cost Matrix Cost sensitive classification Costs not known Complete financial analysis needed; i.e cost of Using ML tool Gathering training data Using the model Determining the attributes for test Cross Validation once all costs are known 11/25/2008 24 HINF6210/Project presentation/Abel/Sunil
  • Future direction The overall accuracy of the classifier needs to be increased Cluster based Stratified Sampling Partitioning the original dataset using Kmeans Alg. Multiple Classifier model Bagging and Boosting techniques ROC (Receiver Operating Characteristic) Plotting the TP Rate (Y-axis) over FP Rate (X-Axis) Advantage: Does not regard class distribution or error costs. 11/25/2008 25 HINF6210/Project presentation/Abel/Sunil
  • ROC Curve - Visualization •Area under the curve AUC •Larger the area, better is the model For Benign class For Malignant class 11/25/2008 26 HINF6210/Project presentation/Abel/Sunil
  • Questions / Comments Thank You! 11/25/2008 27 HINF6210/Project presentation/Abel/Sunil