SlideShare a Scribd company logo
1 of 104
A Novel Approach for Breast Cancer
Detection using
Data Mining Techniques
Presented by:
• Ahmed Abd Elhafeez
15/7/2014 AAST-Comp eng
AGENDA
 Scientific and Medical Background
1. What is cancer?
2. Breast cancer
3. History and Background
4. Pattern recognition system decomposition
5. About data mining
6. Data mining tools
7. Classification Techniques
2 5/7/2014AAST-Comp eng
AGENDA (Cont.)
 Paper contents
1. Introduction
2. Related Work
3. Classification Techniques
4. Experiments and Results
5. Conclusion
6. References
3 5/7/2014AAST-Comp eng
What Is Cancer?
 Cancer is a term used for diseases in which abnormal
cells divide without control and are able to invade
other tissues. Cancer cells can spread to other parts of
the body through the blood and lymph systems.
 Cancer is not just one disease but many diseases.
There are more than 100 different types of cancer.
 Most cancers are named for the organ or type of cell in
which they start
 There are two general types of cancer tumours namely:
• benign
• malignant
4 AAST-Comp eng 5/7/2014
Skin
cancer
Breast cancerColon cancer
Lung cancer
Pancreatic cancer
Liver Bladder
Prostate Cancer
Kidney cancerThyroid Cancer
Leukemia Cancer
Edometrial Cancer
Rectal Cancer
Non-Hodgkin Lymphoma
Cervical cancer
Thyroid Cancer
Oral cancer
AAST-Comp eng 55/7/2014
Breast Cancer
6
• The second leading cause of death among
women is breast cancer, as it comes
directly after lung cancer.
• Breast cancer considered the most
common invasive cancer in women, with
more than one million cases and nearly
600,000 deaths occurring worldwide
annually.
• Breast cancer comes in the top of cancer
list in Egypt by 42 cases per 100 thousand
of the population. However 80% of the
cases of breast cancer in Egypt are of the
benign kind.
AAST-Comp eng5/7/2014
History and Background
Medical Prognosis is the estimation of :
• Cure
• Complication
• disease recurrence
• Survival
for a patient or group of patients after treatment.
7AAST-Comp eng5/7/2014
Breast Cancer Classification
8AAST-Comp eng
Round well-
defined, larger
groups are more
likely benign.
Tight cluster of tiny,
irregularly shaped
groups may indicate
cancer Malignant
Suspicious pixels groups show up as white spots on a
mammogram.
5/7/2014
Breast cancer’s Features
• MRI - Cancer can have a unique appearance –
features that turned out to be cancer used for diagnosis
/ prognosis of each cell nucleus.
9AAST-Comp eng
F2Magnetic
Resonance Image
F1
F3
Fn
Feature
Extraction
5/7/2014
Diagnosis or prognosis
Brest Cancer
Benign
Malignant
AAST-Comp eng 105/7/2014
Computer-Aided Diagnosis
• Mammography allows for efficient diagnosis of
breast cancers at an earlier stage
• Radiologists misdiagnose 10-30% of the malignant
cases
• Of the cases sent for surgical biopsy, only 10-20%
are actually malignant
5/7/2014 AAST-Comp eng 11
Computational Intelligence
Computational Intelligence
Data + Knowledge
Artificial Intelligence
Expert
systems
Fuzzy
logic
Pattern
Recognition
Machine
learning
Probabilistic
methods
Multivariate
statistics
Visuali-
zation
Evolutionary
algorithms
Neural
networks
5/7/2014 AAST-Comp eng 12
What do these methods do?
• Provide non-parametric models of data.
• Allow to classify new data to pre-defined
categories, supporting diagnosis &
prognosis.
• Allow to discover new categories.
• Allow to understand the data, creating fuzzy
or crisp logical rules.
• Help to visualize multi-dimensional
relationships among data samples.5/7/2014 AAST-Comp eng 13
Feature selection
Data Preprocessing
Selecting Data mining tooldataset
Classification algorithm
SMO IBK BF TREE
Results and evaluations
AAST-Comp eng
Pattern recognition system decomposition
5/7/2014
Results
Data
preprocessing
Feature selectionClassification
Selection tool
data mining
Performance evaluation Cycle
Dataset
data sets
AAST-Comp eng 165/7/2014
results
Data
preprocessing
Feature selectionclassification
Selection tool
datamining
Performance evaluation Cycle
Dataset
AAST-Comp eng 18
Data Mining
• Data Mining is set of techniques used
in various domains to give meaning to
the available data
• Objective: Fit data to a model
–Descriptive
–Predictive
5/7/2014
Predictive & descriptive data mining
• Predictive:
Is the process of automatically creating a classification
model from a set of examples, called the training set,
which belongs to a set of classes.
Once a model is created, it can be used to automatically
predict the class of other unclassified examples
• Descriptive :
Is to describe the general or special features of a set of
data in a concise manner
AAST-Comp eng 195/7/2014
AAST-Comp eng 20
Data Mining Models and Tasks
5/7/2014
Data mining Tools
Many advanced tools for data mining are
available either as open-source or commercial
software.
21AAST-Comp eng5/7/2014
weka
• Waikato environment for knowledge analysis
• Weka is a collection of machine learning algorithms for
data mining tasks. The algorithms can either be applied
directly to a dataset or called from your own Java code.
• Weka contains tools for data pre-processing,
classification, regression, clustering, association rules,
and visualization. It is also well-suited for developing
new machine learning schemes.
• Found only on the islands of New Zealand, the Weka is
a flightless bird with an inquisitive nature.
5/7/2014 AAST-Comp eng 22
Results
Data
preprocessing
Feature
selection Classification
Selection tool
data mining
Performance evaluation Cycle
Dataset
Data Preprocessing
• Data in the real world is :
– incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• Quality decisions must be based on quality data
measures:
Accuracy ,Completeness, Consistency, Timeliness, Believability,
Value added and Accessibility
AAST-Comp eng 245/7/2014
Preprocessing techniques
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially for
numerical data
AAST-Comp eng 255/7/2014
Results
Data
preprocessing
Feature
selection
Classification
Selection tool
datamining
Performance evaluation Cycle
Dataset
Finding a feature subset that has the most
discriminative information from the original
feature space.
The objective of feature selection is :
• Improving the prediction performance of the
predictors
• Providing a faster and more cost-effective
predictors
• Providing a better understanding of the underlying
process that generated the data
Feature selection
AAST-Comp eng 275/7/2014
Feature Selection
• Transforming a dataset by removing some
of its columns
A1 A2 A3 A4 C A2 A4 C
5/7/2014 AAST-Comp eng 28
Results
Data
preprocessing
Feature
selection
Classification
Selection tool
data mining
Performance evaluation Cycle
Dataset
Supervised Learning
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the model built on training set
known categories
AAST-Comp eng
Category ―A‖
Category ―B‖
Classification (Recognition)
(Supervised Classification)
305/7/2014
Classification
• Everyday, all the time we classify
things.
• Eg crossing the street:
– Is there a car coming?
– At what speed?
– How far is it to the other side?
– Classification: Safe to walk or not!!!
5/7/2014 AAST-Comp eng 31
5/7/2014 AAST-Comp eng 32
 Classification:
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
 Prediction:
 models continuous-valued functions, i.e., predicts
unknown or missing values
Classification vs. Prediction
5/7/2014 AAST-Comp eng 33
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
5/7/2014 AAST-Comp eng 34
Classification Process (1): Model
Construction
Training
Data
N A M E R A N K Y E A R S T E N U R E D
M ik e A s s is ta n t P ro f 3 n o
M a ry A s s is ta n t P ro f 7 ye s
B ill P ro fe s s o r 2 ye s
J im A s s o c ia te P ro f 7 ye s
D a ve A s s is ta n t P ro f 6 n o
A n n e A s s o c ia te P ro f 3 n o
Classification
Algorithms
IF rank = „professor‟
OR years > 6
THEN tenured = „yes‟
Classifier
(Model)
5/7/2014 AAST-Comp eng 35
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
N A M E R A N K Y E A R S T E N U R E D
T o m A s s is ta n t P ro f 2 n o
M e rlis a A s s o c ia te P ro f 7 n o
G e o rg e P ro fe s s o r 5 ye s
J o s e p h A s s is ta n t P ro f 7 ye s
Unseen Data
(Jeff, Professor, 4)
Tenured?
Classification
• is a data mining (machine learning) technique used to
predict group membership for data instances.
• Classification analysis is the organization of data in
given class.
• These approaches normally use a training set where
all objects are already associated with known class
labels.
• The classification algorithm learns from the training
set and builds a model.
• Many classification models are used to classify new
objects.
AAST-Comp eng 365/7/2014
Classification
• predicts categorical class labels (discrete or
nominal)
• constructs a model based on the training set
and the values (class labels) in a classifying
attribute and uses it in classifying unseen
data
AAST-Comp eng 375/7/2014
Quality of a classifier
• Quality will be calculated with respect to lowest
computing time.
• Quality of certain model one can describe by confusion
matrix.
• Confusion matrix shows a new entry properties
predictive ability of the method.
• Row of the matrix represents the instances in a
predicted class, while each column represents the
instances in an actual class.
• Thus the diagonal elements represent correctly
classified compounds
• the cross-diagonal elements represent misclassified
compounds.
AAST-Comp eng 385/7/2014
Classification Techniques
 Building accurate and efficient classifiers for
large databases is one of the essential tasks of
data mining and machine learning research
 The ultimate reason for doing classification is to
increase understanding of the domain or to
improve predictions compared to unclassified
data.
5/7/2014AAST-Comp eng39
Classification Techniques
classificatio
n
Techniques
Naïve
Bays
SVM
C4.5
KNN
BF tree
IBK
40 5/7/2014AAST-Comp eng
Classification Model
Support vector machine
Classifier
V. Vapnik
5/7/2014 AAST-Comp eng 41
Support Vector Machine (SVM)
 SVM is a state-of-the-art learning machine which has
been extensively used as a tool for data
classification , function approximation, etc.
 due to its generalization ability and has found a
great deal of success in many applications.
 Unlike traditional methods which minimizing the
empirical training error, a noteworthy feature of SVM
is that it minimize an upper bound of the
generalization error through maximizing the margin
between the separating hyper-plane and a data set
5/7/2014AAST-Comp eng42
Support Vector Machine (SVM)
5/7/2014AAST-Comp eng43
 SVM is a state-of-the-art learning machine which has
been extensively used as a tool for data
classification , function approximation, etc.
 due to its generalization ability and has found a
great deal of success in many applications.
 Unlike traditional methods which minimizing the
empirical training error, a noteworthy feature of SVM
is that it minimize an upper bound of the
generalization error through maximizing the margin
between the separating hyper-plane and a data set
Tennis example
Humidity
Temperature
= play tennis
= do not play tennis
5/7/2014 AAST-Comp eng 44
Linear classifiers: Which Hyperplane?
• Lots of possible solutions for a, b, c.
• Some methods find a separating hyperplane,
but not the optimal one
• Support Vector Machine (SVM) finds an
optimal solution.
– Maximizes the distance between the
hyperplane and the “difficult points” close to
decision boundary
– One intuition: if there are no points near the
decision surface, then there are no very
uncertain classification decisions
45
This line
represents the
decision
boundary:
ax + by − c = 0
Ch. 15
5/7/2014 AAST-Comp eng
Selection of a Good Hyper-Plane
Objective: Select a `good' hyper-plane using
only the data!
Intuition:
(Vapnik 1965) - assuming linear separability
(i) Separate the data
(ii) Place hyper-plane `far' from data
5/7/2014 AAST-Comp eng 46
SVM – Support Vector Machines
Support Vectors
Small Margin Large Margin
5/7/2014 AAST-Comp eng 47
Support Vector Machine (SVM)
• SVMs maximize the margin around
the separating hyperplane.
• The decision function is fully
specified by a subset of training
samples, the support vectors.
• Solving SVMs is a quadratic
programming problem
• Seen by many as the most
successful current text
classification method
48
Support vectors
Maximizes
margin
Sec. 15.1
Narrower
margin
5/7/2014 AAST-Comp eng
Non-Separable Case
5/7/2014 AAST-Comp eng 49
The Lagrangian trick
SVM
 SVM
 Relatively new concept
 Nice Generalization properties
 Hard to learn – learned in batch mode using
quadratic programming techniques
 Using kernels can learn very complex functions
5/7/2014 AAST-Comp eng 51
Classification Model
K-Nearest Neighbor
Classifier5/7/2014 AAST-Comp eng 52
K-Nearest Neighbor Classifier
Learning by analogy:
Tell me who your friends are and I’ll tell
you who you are
A new example is assigned to the most
common class among the (K) examples
that are most similar to it.
5/7/2014 AAST-Comp eng 53
K-Nearest Neighbor Algorithm
 To determine the class of a new example E:
 Calculate the distance between E and all examples in
the training set
 Select K-nearest examples to E in the training set
 Assign E to the most common class among its K-
nearest neighbors
Response
Response
No response
No response
No response
Class: Response
5/7/2014 AAST-Comp eng 54
 Each example is represented with a set of numerical
attributes
 ―Closeness‖ is defined in terms of the Euclidean distance
between two examples.
 The Euclidean distance between X=(x1, x2, x3,…xn) and Y
=(y1,y2, y3,…yn) is defined as:
 Distance (John, Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
n
i
ii
yxYXD
1
2
)(),(
John:
Age=35
Income=95K
No. of credit cards=3
Rachel:
Age=41
Income=215K
No. of credit cards=2
Distance Between Neighbors
5/7/2014 AAST-Comp eng 55
Instance Based Learning
 No model is built: Store all training examples
 Any processing is delayed until a new instance must be
classified.
Response
Response No response
No response
No response
Class: Respond
5/7/2014 AAST-Comp eng 56
Example : 3-Nearest Neighbors
Customer Age Income No. credit
cards
Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 ?
5/7/2014 AAST-Comp eng 57
Customer Age Income
(K)
No.
cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
Response
No
Yes
No
No
Yes
Distance from David
sqrt [(35-37)2+(35-50)2
+(3-2)2]=15.16
sqrt [(22-37)2+(50-50)2
+(2-2)2]=15
sqrt [(63-37)2+(200-
50)2 +(1-2)2]=152.23
sqrt [(59-37)2+(170-
50)2 +(1-2)2]=122
sqrt [(25-37)2+(40-50)2
+(4-2)2]=15.74
Yes
5/7/2014 AAST-Comp eng 58
Strengths and Weaknesses
Strengths:
 Simple to implement and use
 Comprehensible – easy to explain prediction
 Robust to noisy data by averaging k-nearest neighbors.
Weaknesses:
 Need a lot of space to store all examples.
 Takes more time to classify a new example than with a
model (need to calculate and compare distance from
new example to all other examples).
5/7/2014 AAST-Comp eng 59
Decision Tree
5/7/2014 AAST-Comp eng 60
– Decision tree induction is a simple but powerful
learning paradigm. In this method a set of training
examples is broken down into smaller and smaller
subsets while at the same time an associated decision
tree get incrementally developed. At the end of the
learning process, a decision tree covering the training
set is returned.
– The decision tree can be thought of as a set sentences
written propositional logic.
5/7/2014 AAST-Comp eng 61
Example
Jenny Lind is a writer of romance novels. A movie
company and a TV network both want exclusive
rights to one of her more popular works. If she signs
with the network, she will receive a single lump sum,
but if she signs with the movie company, the amount
she will receive depends on the market response to
her movie. What should she do?
5/7/2014 AAST-Comp eng 62
Payouts and Probabilities
• Movie company Payouts
– Small box office - $200,000
– Medium box office - $1,000,000
– Large box office - $3,000,000
• TV Network Payout
– Flat rate - $900,000
• Probabilities
– P(Small Box Office) = 0.3
– P(Medium Box Office) = 0.6
– P(Large Box Office) = 0.1
5/7/2014 AAST-Comp eng 63
Jenny Lind - Payoff Table
Decisions
States of Nature
Small Box
Office
Medium Box
Office
Large Box
Office
Sign with Movie
Company
$200,000 $1,000,000 $3,000,000
Sign with TV
Network
$900,000 $900,000 $900,000
Prior
Probabilities
0.3 0.6 0.1
5/7/2014 AAST-Comp eng 64
Using Expected Return Criteria
EVmovie=0.3(200,000)+0.6(1,000,000)+0.1(3,000,000)
= $960,000 = EVUII or EVBest
EVtv =0.3(900,000)+0.6(900,000)+0.1(900,000)
= $900,000
Therefore, using this criteria, Jenny should select the movie
contract.
5/7/2014 AAST-Comp eng 65
Decision Trees
• Three types of “nodes”
– Decision nodes - represented by squares ( )
– Chance nodes - represented by circles (Ο)
– Terminal nodes - represented by triangles (optional)
• Solving the tree involves pruning all but the best
decisions at decision nodes, and finding expected values
of all possible states of nature at chance nodes
• Create the tree from left to right
• Solve the tree from right to left
5/7/2014 AAST-Comp eng 66
Example Decision Tree
Decision
node
Chance
node Event 1
Event 2
Event 3
5/7/2014 AAST-Comp eng 67
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co.
Sign with TV Network
$200,000
$1,000,000
$3,000,000
$900,000
$900,000
$900,000
5/7/2014 AAST-Comp eng 68
Jenny Lind Decision Tree
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co.
Sign with TV Network
$200,000
$1,000,000
$3,000,000
$900,000
$900,000
$900,000
.3
.6
.1
.3
.6
.1
ER
?
ER
?
ER
?
5/7/2014 AAST-Comp eng 69
Jenny Lind Decision Tree - Solved
Small Box Office
Medium Box Office
Large Box Office
Small Box Office
Medium Box Office
Large Box Office
Sign with Movie Co.
Sign with TV Network
$200,000
$1,000,000
$3,000,000
$900,000
$900,000
$900,000
.3
.6
.1
.3
.6
.1
ER
900,000
ER
960,000
ER
960,000
5/7/2014 AAST-Comp eng 70
Results
Data
preprocessing
Feature
selection
Classification
Selection tool
data mining
Performance evaluation Cycle
Dataset
Evaluation Metrics
Predicted as healthy Predicted as unhealthy
Actual healthy tp fn
Actual not healthy fp
tn
AAST-Comp eng 725/7/2014
Cross-validation
• Correctly Classified Instances 143 95.3%
• Incorrectly Classified Instances 7 4.67 %
• Default 10-fold cross validation i.e.
– Split data into 10 equal sized pieces
– Train on 9 pieces and test on remainder
– Do for all possibilities and average
5/7/2014 AAST-Comp eng 73
A Novel Approach for Breast Cancer
Detection using Data Mining Techniques
74 5/7/2014AAST-Comp eng
Abstract
 The aim of this paper is to investigate the
performance of different classification techniques.
 Aim is developing accurate prediction models for
breast cancer using data mining techniques
 Comparing three classification techniques in Weka
software and comparison results.
 Sequential Minimal Optimization (SMO) has
higher prediction accuracy than IBK and BF Tree
methods.
75 5/7/2014AAST-Comp eng
Introduction
 Breast cancer is on the rise across developing nations
 due to the increase in life expectancy and lifestyle
changes such as women having fewer children.
 Benign tumors:
• Are usually not harmful
• Rarely invade the tissues around them
• Don‘t spread to other parts of the body
• Can be removed and usually don‘t grow back
 Malignant tumors:
• May be a threat to life
• Can invade nearby organs and tissues (such as the
chest wall)
• Can spread to other parts of the body
• Often can be removed but sometimes grow back
76 5/7/2014AAST-Comp eng
Risk factors
 Gender
 Age
 Genetic risk factors
 Family history
 Personal history of breast cancer
 Race : white or black
 Dense breast tissue :denser breast tissue have a
higher risk
 Certain benign (not cancer) breast problems
 Lobular carcinoma in situ
 Menstrual periods
77 5/7/2014AAST-Comp eng
Risk factors
 Breast radiation early in life
 Treatment with DES : the drug DES (diethylstilbestrol)
during pregnancy
 Not having children or having them later in life
 Certain kinds of birth control
 Using hormone therapy after menopause
 Not breastfeeding
 Alcohol
 Being overweight or obese
78 5/7/2014AAST-Comp eng
BACKGROUND
 Bittern et al. used artificial neural network to
predict the survivability for breast cancer
patients. They tested their approach on a limited
data set, but their results show a good
agreement with actual survival Traditional
segmentation
 Vikas Chaurasia et al. used Representive Tree, RBF
Network and Simple Logistic to predict the survivability
for breast cancer patients.
 Liu Ya-Qin‘s experimented on breast cancer data using
C5 algorithm with bagging to predict breast cancer
survivability.
79 5/7/2014AAST-Comp eng
BACKGROUND
 Bellaachi et al. used naive bayes, decision
tree and back-propagation neural network to
predict the survivability in breast cancer
patients. Although they reached good results
(about 90% accuracy), their results were not
significant due to the fact that they divided the
data set to two groups; one for the patients
who survived more than 5 years and the other
for those patients who died before 5 years.
 Vikas Chaurasia et al. used Naive Bayes, J48
Decision Tree to predict the survivability for
Heart Diseases patients.
80 5/7/2014AAST-Comp eng
BACKGROUND
 Vikas Chaurasia et al. used CART (Classification
and Regression Tree), ID3 (Iterative Dichotomized
3) and decision table (DT) to predict the
survivability for Heart Diseases patients.
 Pan wen conducted experiments on ECG data to
identify abnormal high frequency
electrocardiograph using decision tree algorithm
C4.5.
 Dong-Sheng Cao‘s proposed a new decision tree
based ensemble method combined with feature
selection method backward elimination strategy to
find the structure activity relationships in the area
of chemo metrics related to pharmaceutical
industry.81 5/7/2014AAST-Comp eng
BACKGROUND
 Dr. S.Vijayarani et al., analyses the
performance of different classification function
techniques in data mining for predicting the
heart disease from the heart disease dataset.
The classification function algorithms is used
and tested in this work. The performance
factors used for analyzing the efficiency of
algorithms are clustering accuracy and error
rate. The result illustrates shows logistics
classification function efficiency is better than
multilayer perception and sequential minimal
optimization.82 5/7/2014AAST-Comp eng
BACKGROUND
 Kaewchinporn C‘s presented a new
classification algorithm TBWC combination of
decision tree with bagging and clustering.
This algorithm is experimented on two
medical datasets: cardiocography1,
cardiocography2 and other datasets not
related to medical domain.
 BS Harish et al., presented various text
representation schemes and compared
different classifiers used to classify text
documents to the predefined classes. The
existing methods are compared and
contrasted based on various parameters83 5/7/2014AAST-Comp eng
5/7/2014AAST-Comp eng84
BREAST-CANCER-WISCONSIN
DATA SET SUMMARY
BREAST-CANCER-WISCONSIN DATA SET SUMMARY
 the UC Irvine machine learning repository
 Data from University of Wisconsin Hospital, Madison,
collected by dr. W.H. Wolberg.
 2 classes (malignant and benign), and 9 integer-
valued attributes
 breast-cancer-Wisconsin having 699 instances
 We removed the 16 instances with missing values
from the dataset to construct a new dataset with 683
instances
 Class distribution: Benign: 458 (65.5%) Malignant:
241 (34.5%)
 Note :2 malignant and 14 benign excluded hence
percentage is wrong and the right one is :
 benign 444 (65%) and malignant 239 (35%)
5/7/2014AAST-Comp eng85
5/7/2014 AAST-Comp eng 86
Attribute Domain
Sample Code Number Id Number
Clump Thickness 1 - 10
Uniformity Of Cell Size 1 - 10
Uniformity Of Cell Shape 1 - 10
Marginal Adhesion 1 - 10
Single Epithelial Cell Size 1 - 10
Bare Nuclei 1 - 10
Bland Chromatin 1 - 10
Normal Nucleoli 1 - 10
Mitoses 1 - 10
Class 2 For Benign
4 For Malignant
EVALUATION METHODS
 We have used the Weka (Waikato Environment for
Knowledge Analysis). version 3.6.9
 WEKA is a collection of machine learning algorithms
for data mining tasks.
 The algorithms can either be applied directly to a
dataset or called from your own Java code.
 WEKA contains tools for data preprocessing,
classification, regression, clustering, association
rules, visualization and feature selection.
 It is also well suited for developing new machine
learning schemes.
 WEKA is open source software issued under the
GNU General Public License
5/7/2014AAST-Comp eng87
EXPERIMENTAL RESULTS
88 5/7/2014AAST-Comp eng
EXPERIMENTAL RESULTS
89 5/7/2014AAST-Comp eng
importance of the input variables
5/7/2014AAST-Comp eng90
Domain 1 2 3 4 5 6 7 8 9 10 Sum
Clump Thickness 139 50 104 79 128 33 23 44 14 69 683
Uniformity of
Cell Size
373 45 52 38 30 25 19 28 6 67 683
Uniformity of
Cell Shape
346 58 53 43 32 29 30 27 7 58 683
Marginal
Adhesion
393 58 58 33 23 21 13 25 4 55 683
Single Epithelial
Cell Size
44 376 71 48 39 40 11 21 2 31 683
Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683
Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683
Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683
Mitoses 563 35 33 12 6 3 9 8 0 14 683
Sum 2843 850 605 333 346 192 207 233 77 516
EXPERIMENTAL RESULTS
91 5/7/2014AAST-Comp eng
Evaluation Criteria Classifiers
BF TREE IBK SMO
Timing To Build Model (In
Sec)
0.97 0.02 0.33
Correctly Classified
Instances
652 655 657
Incorrectly Classified
Instances
31 28 26
Accuracy (%) 95.46% 95.90% 96.19%
EXPERIMENTAL RESULTS
 The sensitivity or the true positive rate (TPR) is
defined by TP / (TP + FN)
 the specificity or the true negative rate (TNR) is
defined by TN / (TN + FP)
 the accuracy is defined by (TP + TN) / (TP + FP + TN
+ FN).
 True positive (TP) = number of positive samples
correctly predicted.
 False negative (FN) = number of positive samples
wrongly predicted.
 False positive (FP) = number of negative samples
wrongly predicted as positive.
 True negative (TN) = number of negative samples
correctly predicted
92 5/7/2014AAST-Comp eng
EXPERIMENTAL RESULTS
Classifier TP FP Precision Recall Class
BF Tree
0.971 0.075 0.96 0.971 Benign
0.925 0.029 0.944 0.925 Malignant
IBK
0.98 0.079 0.958 0.98 Benign
0.921 0.02 0.961 0.921 Malignant
SMO
0.971 0.054 0.971 0.971 Benign
0.946 0.029 0.946 0.946 Malignant
93 5/7/2014AAST-Comp eng
EXPERIMENTAL RESULTS
Classifier Benign Malignant Class
BF Tree
431 13 Benign
18 221 Malignant
IBK
435 9 Benign
19 220 Malignant
SMO
431 13 Benign
13 226 Malignant
94 5/7/2014AAST-Comp eng
importance of the input variables
5/7/2014AAST-Comp eng95
variable
Chi-
squared
Info Gain
Gain
Ratio
Average
Rank
IMPORTANCE
Clump
Thickness
378.08158 0.464 0.152 126.232526 8
Uniformity of
Cell Size
539.79308 0.702 0.3 180.265026 1
Uniformity of
Cell Shape
523.07097 0.677 0.272 174.673323 2
Marginal
Adhesion
390.0595 0.464 0.21 130.2445 7
Single Epithelial
Cell Size
447.86118 0.534 0.233 149.542726 5
Bare Nuclei 489.00953 0.603 0.303 163.305176 3
Bland
Chromatin 453.20971 0.555 0.201 151.321903 4
Normal Nucleoli 416.63061 0.487 0.237 139.118203 6
Mitoses 191.9682 0.212 0.212 64.122733 9
5/7/2014AAST-Comp eng96
CONCLUSION.
 the accuracy of classification techniques is evaluated
based on the selected classifier algorithm.
 we used three popular data mining methods:
Sequential Minimal Optimization (SMO), IBK, BF
Tree.
 The performance of SMO shows the high level
compare with other classifiers.
 most important attributes for breast cancer survivals
are Uniformity of Cell Size.
97 5/7/2014AAST-Comp eng
Future work
 using updated version of weka
 Using another data mining tool
 Using alternative algorithms and techniques
5/7/2014AAST-Comp eng98
Notes on paper
 Spelling mistakes
 No point of contact (e - mail)
 Wrong percentage calculation
 Copying from old papers
 Charts not clear
 No contributions
5/7/2014AAST-Comp eng99
comparison
 Breast Cancer Diagnosis on Three Different
Datasets Using Multi-Classifiers written
 International Journal of Computer and Information
Technology (2277 – 0764) Volume 01– Issue 01,
September 2012
 Paper introduced more advanced idea and make a
fusion between classifiers
5/7/2014AAST-Comp eng100
References
101AAST-Comp eng
[1] U.S. Cancer Statistics Working Group. United States Cancer
Statistics: 1999–2008 Incidence and Mortality Web-based Report.
Atlanta (GA): Department of Health and Human Services, Centers for
Disease Control
[2] Lyon IAfRoC: World Cancer Report. International Agency for Research on
Cancer Press 2003:188-193.
[3] Elattar, Inas. “Breast Cancer: Magnitude of the Problem”,Egyptian Society
of Surgical Oncology Conference, Taba,Sinai, in Egypt (30 March – 1
April 2005).
[2] S. Aruna, Dr S.P. Rajagopalan and L.V. Nandakishore (2011).
Knowledge based analysis of various statistical tools in detecting
breast cancer.
[3] Angeline Christobel. Y, Dr. Sivaprakasam (2011). An Empirical
Comparison of Data Mining Classification Methods. International
Journal of Computer Information Systems,Vol. 3, No. 2, 2011.
[4] D.Lavanya, Dr.K.Usha Rani,..,” Analysis of feature selection with
classification: Breast cancer datasets”,Indian Journal of Computer
Science and Engineering (IJCSE),October 2011.
5/7/2014
AAST-Comp eng 102
[5] E.Osuna, R.Freund, and F. Girosi, “Training support vector
machines:Application to face detection”. Proceedings of computer vision and
pattern recognition, Puerto Rico pp. 130–136.1997.
[6] Vaibhav Narayan Chunekar, Hemant P. Ambulgekar (2009).Approach of
Neural Network to Diagnose Breast Cancer on three different Data Set. 2009
International Conference on Advances in Recent Technologies in
Communication and Computing.
[7] D. Lavanya, “Ensemble Decision Tree Classifier for Breast Cancer Data,”
International Journal of Information Technology Convergence and Services,
vol. 2, no. 1, pp. 17-24, Feb. 2012.
[8] B.Ster, and A.Dobnikar, “Neural networks in medical diagnosis:
Comparison with other methods.” Proceedings of the international
conference on engineering applications of neural networks pp. 427–
430. 1996.
5/7/2014
[9] T.Joachims, Transductive inference for text classification using support
vector machines. Proceedings of international conference machine learning.
Slovenia. 1999.
[10] J.Abonyi, and F. Szeifert, “Supervised fuzzy clustering for the
identification of fuzzy classifiers.” Pattern Recognition Letters, vol.14(24),
2195–2207,2003.
[11] Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,School of
Information and Computer Science.
[12] William H. Wolberg, M.D., W. Nick Street, Ph.D., Dennis M. Heisey,
Ph.D., Olvi L. Mangasarian, Ph.D. computerized breast cancer diagnosis and
prognosis from fine needle aspirates, Western Surgical Association meeting in
Palm Desert, California, November 14, 1994.
AAST-Comp eng 1035/7/2014
AAST-Comp eng 104
[13] Street WN, Wolberg WH, Mangasarian OL. Nuclear feature extraction for
breast tumor diagnosis. Proceedings IS&T/ SPIE International Symposium on
Electronic Imaging 1993; 1905:861–70.
[14] Chen, Y., Abraham, A., Yang, B.(2006), Feature Selection and Classification
using Flexible Neural Tree. Journal of Neurocomputing 70(1-3): 305–313.
[15] J. Han and M. Kamber,”Data Mining Concepts and Techniques”,Morgan
Kauffman Publishers, 2000.
[16] Bishop, C.M.: “Neural Networks for Pattern Recognition”. Oxford
University Press,New York (1999).
[17] Vapnik, V.N., The Nature of Statistical Learning Theory, 1st ed.,Springer-
Verlag,New York, 1995.
[18] Ross Quinlan, (1993) C4.5: Programs for Machine Learning, Morgan
Kaufmann Publishers, San Mateo, CA.
185
5/7/2014
105 5/7/2014AAST-Comp eng

More Related Content

What's hot

Segmentation of thermograms breast cancer tarek-to-slid share
Segmentation of thermograms breast cancer tarek-to-slid shareSegmentation of thermograms breast cancer tarek-to-slid share
Segmentation of thermograms breast cancer tarek-to-slid share
Tarek Gaber
 
Breast Cancer Screening and Surveillance: Dr. Paula Gordon (Dense Breasts Can...
Breast Cancer Screening and Surveillance: Dr. Paula Gordon (Dense Breasts Can...Breast Cancer Screening and Surveillance: Dr. Paula Gordon (Dense Breasts Can...
Breast Cancer Screening and Surveillance: Dr. Paula Gordon (Dense Breasts Can...
Canadian Cancer Survivor Network
 

What's hot (20)

Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...
 
Machine Learning - Breast Cancer Diagnosis
Machine Learning - Breast Cancer DiagnosisMachine Learning - Breast Cancer Diagnosis
Machine Learning - Breast Cancer Diagnosis
 
Breast cancer diagnosis and recurrence prediction using machine learning tech...
Breast cancer diagnosis and recurrence prediction using machine learning tech...Breast cancer diagnosis and recurrence prediction using machine learning tech...
Breast cancer diagnosis and recurrence prediction using machine learning tech...
 
Cancer detection using data mining
Cancer detection using data miningCancer detection using data mining
Cancer detection using data mining
 
Breast cancer diagnosis machine learning ppt
Breast cancer diagnosis machine learning pptBreast cancer diagnosis machine learning ppt
Breast cancer diagnosis machine learning ppt
 
Application of-image-segmentation-in-brain-tumor-detection
Application of-image-segmentation-in-brain-tumor-detectionApplication of-image-segmentation-in-brain-tumor-detection
Application of-image-segmentation-in-brain-tumor-detection
 
a novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool wekaa novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool weka
 
Mansi_BreastCancerDetection
Mansi_BreastCancerDetectionMansi_BreastCancerDetection
Mansi_BreastCancerDetection
 
IRJET- Breast Cancer Prediction using Support Vector Machine
IRJET-  	  Breast Cancer Prediction using Support Vector MachineIRJET-  	  Breast Cancer Prediction using Support Vector Machine
IRJET- Breast Cancer Prediction using Support Vector Machine
 
Breast Cancer - A Study (Early Detection)
Breast Cancer - A Study  (Early Detection)Breast Cancer - A Study  (Early Detection)
Breast Cancer - A Study (Early Detection)
 
Final ppt
Final pptFinal ppt
Final ppt
 
Breast Cancer Detection using Convolution Neural Network
Breast Cancer Detection using Convolution Neural NetworkBreast Cancer Detection using Convolution Neural Network
Breast Cancer Detection using Convolution Neural Network
 
Segmentation of thermograms breast cancer tarek-to-slid share
Segmentation of thermograms breast cancer tarek-to-slid shareSegmentation of thermograms breast cancer tarek-to-slid share
Segmentation of thermograms breast cancer tarek-to-slid share
 
Breast cancerdetection IE594 Project Report
Breast cancerdetection IE594 Project ReportBreast cancerdetection IE594 Project Report
Breast cancerdetection IE594 Project Report
 
AI in Gynaec Onco
AI in Gynaec OncoAI in Gynaec Onco
AI in Gynaec Onco
 
Breast Cancer Classification.pdf
Breast Cancer Classification.pdfBreast Cancer Classification.pdf
Breast Cancer Classification.pdf
 
Breast cancer Detection using MATLAB
Breast cancer Detection using MATLABBreast cancer Detection using MATLAB
Breast cancer Detection using MATLAB
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 
Breast Cancer Risk Assessment: How and Why
Breast Cancer Risk Assessment:  How and WhyBreast Cancer Risk Assessment:  How and Why
Breast Cancer Risk Assessment: How and Why
 
Breast Cancer Screening and Surveillance: Dr. Paula Gordon (Dense Breasts Can...
Breast Cancer Screening and Surveillance: Dr. Paula Gordon (Dense Breasts Can...Breast Cancer Screening and Surveillance: Dr. Paula Gordon (Dense Breasts Can...
Breast Cancer Screening and Surveillance: Dr. Paula Gordon (Dense Breasts Can...
 

Viewers also liked

A Study of RandomForests Learning Mechanism with Application to the Identific...
A Study of RandomForests Learning Mechanism with Application to the Identific...A Study of RandomForests Learning Mechanism with Application to the Identific...
A Study of RandomForests Learning Mechanism with Application to the Identific...
Salford Systems
 
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Sunil Nair
 
A classic model
A classic modelA classic model
A classic model
fiontanlh
 
Tyler Curr Model
Tyler Curr ModelTyler Curr Model
Tyler Curr Model
John White
 
Tyler objective model group presentation
Tyler objective model group presentationTyler objective model group presentation
Tyler objective model group presentation
Jordan Adinit
 
Tyler s model_of_curriculum_development
Tyler s model_of_curriculum_developmentTyler s model_of_curriculum_development
Tyler s model_of_curriculum_development
Abdulrahman Al'uganda
 
Linear programming - Model formulation, Graphical Method
Linear programming  - Model formulation, Graphical MethodLinear programming  - Model formulation, Graphical Method
Linear programming - Model formulation, Graphical Method
Joseph Konnully
 
Nursing Curriculum Development
Nursing Curriculum DevelopmentNursing Curriculum Development
Nursing Curriculum Development
Jofred Martinez
 

Viewers also liked (20)

Data Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer DiagnosisData Mining Techniques In Computer Aided Cancer Diagnosis
Data Mining Techniques In Computer Aided Cancer Diagnosis
 
A Study of RandomForests Learning Mechanism with Application to the Identific...
A Study of RandomForests Learning Mechanism with Application to the Identific...A Study of RandomForests Learning Mechanism with Application to the Identific...
A Study of RandomForests Learning Mechanism with Application to the Identific...
 
Masters' whole work(big back-u_pslide)
Masters' whole work(big back-u_pslide)Masters' whole work(big back-u_pslide)
Masters' whole work(big back-u_pslide)
 
Emotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicEmotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logic
 
Introduction to Feature (Attribute) Selection with RapidMiner Studio 6
Introduction to Feature (Attribute) Selection with RapidMiner Studio 6Introduction to Feature (Attribute) Selection with RapidMiner Studio 6
Introduction to Feature (Attribute) Selection with RapidMiner Studio 6
 
Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...Machine learning fro computer vision - a whirlwind of key concepts for the un...
Machine learning fro computer vision - a whirlwind of key concepts for the un...
 
Edu555 cni week 2
Edu555 cni week 2Edu555 cni week 2
Edu555 cni week 2
 
Unit 506Session 1 task 9
Unit 506Session 1 task 9Unit 506Session 1 task 9
Unit 506Session 1 task 9
 
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
 
A classic model
A classic modelA classic model
A classic model
 
Curriculum tyler currmodel
Curriculum tyler currmodelCurriculum tyler currmodel
Curriculum tyler currmodel
 
Tyler Curr Model
Tyler Curr ModelTyler Curr Model
Tyler Curr Model
 
Tyler objective model group presentation
Tyler objective model group presentationTyler objective model group presentation
Tyler objective model group presentation
 
Evaluating the Curriculum
Evaluating the CurriculumEvaluating the Curriculum
Evaluating the Curriculum
 
Tyler objective model
Tyler objective modelTyler objective model
Tyler objective model
 
Tyler s model_of_curriculum_development
Tyler s model_of_curriculum_developmentTyler s model_of_curriculum_development
Tyler s model_of_curriculum_development
 
Linear programming - Model formulation, Graphical Method
Linear programming  - Model formulation, Graphical MethodLinear programming  - Model formulation, Graphical Method
Linear programming - Model formulation, Graphical Method
 
Nursing Curriculum Development
Nursing Curriculum DevelopmentNursing Curriculum Development
Nursing Curriculum Development
 
Tyler model
Tyler modelTyler model
Tyler model
 
The concept of curriculum
The concept of curriculumThe concept of curriculum
The concept of curriculum
 

Similar to A Novel Approach for Breast Cancer Detection using Data Mining Techniques

SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
ijscai
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
ijscai
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
ijscai
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
ijscai
 
Propose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart DiseasePropose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart Disease
IJERA Editor
 

Similar to A Novel Approach for Breast Cancer Detection using Data Mining Techniques (20)

IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
 
IRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET- Disease Prediction System
IRJET- Disease Prediction System
 
Assessment of Decision Tree Algorithms on Student’s Recital
Assessment of Decision Tree Algorithms on Student’s RecitalAssessment of Decision Tree Algorithms on Student’s Recital
Assessment of Decision Tree Algorithms on Student’s Recital
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
 
first review.pptxgghggggvvvvbbvvvvvhhjjjbbvvvvbbbbbhhhhhhhhhbbh
first review.pptxgghggggvvvvbbvvvvvhhjjjbbvvvvbbbbbhhhhhhhhhbbhfirst review.pptxgghggggvvvvbbvvvvvhhjjjbbvvvvbbbbbhhhhhhhhhbbh
first review.pptxgghggggvvvvbbvvvvvhhjjjbbvvvvbbbbbhhhhhhhhhbbh
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
 
IRJET- Breast Cancer Disease Prediction : Using Machine Learning Approach
IRJET- Breast Cancer Disease Prediction : Using Machine Learning ApproachIRJET- Breast Cancer Disease Prediction : Using Machine Learning Approach
IRJET- Breast Cancer Disease Prediction : Using Machine Learning Approach
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
 
research paper
research paperresearch paper
research paper
 
AIQC - ISCB 2022.pdf
AIQC - ISCB 2022.pdfAIQC - ISCB 2022.pdf
AIQC - ISCB 2022.pdf
 
IJET-V2I6P32
IJET-V2I6P32IJET-V2I6P32
IJET-V2I6P32
 
IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...
IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...
IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...
 
PARKINSON’S DISEASE DETECTION USING MACHINE LEARNING
PARKINSON’S DISEASE DETECTION USING MACHINE LEARNINGPARKINSON’S DISEASE DETECTION USING MACHINE LEARNING
PARKINSON’S DISEASE DETECTION USING MACHINE LEARNING
 
Heart Failure Prediction using Different MachineLearning Techniques
Heart Failure Prediction using Different MachineLearning TechniquesHeart Failure Prediction using Different MachineLearning Techniques
Heart Failure Prediction using Different MachineLearning Techniques
 
Artificial Intelligence in pathology
Artificial Intelligence in pathologyArtificial Intelligence in pathology
Artificial Intelligence in pathology
 
Breast Cancer Detection Using Machine Learning
Breast Cancer Detection Using Machine LearningBreast Cancer Detection Using Machine Learning
Breast Cancer Detection Using Machine Learning
 
Diagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set TheoryDiagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set Theory
 
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSISSEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
 
Propose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart DiseasePropose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart Disease
 

More from ahmad abdelhafeez

More from ahmad abdelhafeez (20)

Surveying cross layer protocols in ws ns
Surveying cross layer protocols in ws nsSurveying cross layer protocols in ws ns
Surveying cross layer protocols in ws ns
 
Service level management
Service level managementService level management
Service level management
 
Energy harvesting sensor nodes
Energy harvesting sensor nodes   Energy harvesting sensor nodes
Energy harvesting sensor nodes
 
V5I3_IJERTV5IS031157
V5I3_IJERTV5IS031157V5I3_IJERTV5IS031157
V5I3_IJERTV5IS031157
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
 
Energy conservation in wireless sensor networks
Energy conservation in wireless sensor networksEnergy conservation in wireless sensor networks
Energy conservation in wireless sensor networks
 
Localization in wsn
Localization in wsnLocalization in wsn
Localization in wsn
 
Routing
RoutingRouting
Routing
 
Wsn security issues
Wsn security issuesWsn security issues
Wsn security issues
 
Trusted systems
Trusted systemsTrusted systems
Trusted systems
 
opnet
opnetopnet
opnet
 
Wsn security issues
Wsn security issuesWsn security issues
Wsn security issues
 
Sdn pres v2-Software-defined networks
Sdn pres v2-Software-defined networksSdn pres v2-Software-defined networks
Sdn pres v2-Software-defined networks
 
Intrusion prevension
Intrusion prevensionIntrusion prevension
Intrusion prevension
 
Digital forensics ahmed emam
Digital forensics   ahmed emamDigital forensics   ahmed emam
Digital forensics ahmed emam
 
Digital forensics.abdallah
Digital forensics.abdallahDigital forensics.abdallah
Digital forensics.abdallah
 
Cloud computing final show
Cloud computing final   showCloud computing final   show
Cloud computing final show
 
Incident handling.final
Incident handling.finalIncident handling.final
Incident handling.final
 
Malewareanalysis presentation
Malewareanalysis presentationMalewareanalysis presentation
Malewareanalysis presentation
 

Recently uploaded

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Recently uploaded (20)

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

  • 1. A Novel Approach for Breast Cancer Detection using Data Mining Techniques Presented by: • Ahmed Abd Elhafeez 15/7/2014 AAST-Comp eng
  • 2. AGENDA  Scientific and Medical Background 1. What is cancer? 2. Breast cancer 3. History and Background 4. Pattern recognition system decomposition 5. About data mining 6. Data mining tools 7. Classification Techniques 2 5/7/2014AAST-Comp eng
  • 3. AGENDA (Cont.)  Paper contents 1. Introduction 2. Related Work 3. Classification Techniques 4. Experiments and Results 5. Conclusion 6. References 3 5/7/2014AAST-Comp eng
  • 4. What Is Cancer?  Cancer is a term used for diseases in which abnormal cells divide without control and are able to invade other tissues. Cancer cells can spread to other parts of the body through the blood and lymph systems.  Cancer is not just one disease but many diseases. There are more than 100 different types of cancer.  Most cancers are named for the organ or type of cell in which they start  There are two general types of cancer tumours namely: • benign • malignant 4 AAST-Comp eng 5/7/2014
  • 5. Skin cancer Breast cancerColon cancer Lung cancer Pancreatic cancer Liver Bladder Prostate Cancer Kidney cancerThyroid Cancer Leukemia Cancer Edometrial Cancer Rectal Cancer Non-Hodgkin Lymphoma Cervical cancer Thyroid Cancer Oral cancer AAST-Comp eng 55/7/2014
  • 6. Breast Cancer 6 • The second leading cause of death among women is breast cancer, as it comes directly after lung cancer. • Breast cancer considered the most common invasive cancer in women, with more than one million cases and nearly 600,000 deaths occurring worldwide annually. • Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population. However 80% of the cases of breast cancer in Egypt are of the benign kind. AAST-Comp eng5/7/2014
  • 7. History and Background Medical Prognosis is the estimation of : • Cure • Complication • disease recurrence • Survival for a patient or group of patients after treatment. 7AAST-Comp eng5/7/2014
  • 8. Breast Cancer Classification 8AAST-Comp eng Round well- defined, larger groups are more likely benign. Tight cluster of tiny, irregularly shaped groups may indicate cancer Malignant Suspicious pixels groups show up as white spots on a mammogram. 5/7/2014
  • 9. Breast cancer’s Features • MRI - Cancer can have a unique appearance – features that turned out to be cancer used for diagnosis / prognosis of each cell nucleus. 9AAST-Comp eng F2Magnetic Resonance Image F1 F3 Fn Feature Extraction 5/7/2014
  • 10. Diagnosis or prognosis Brest Cancer Benign Malignant AAST-Comp eng 105/7/2014
  • 11. Computer-Aided Diagnosis • Mammography allows for efficient diagnosis of breast cancers at an earlier stage • Radiologists misdiagnose 10-30% of the malignant cases • Of the cases sent for surgical biopsy, only 10-20% are actually malignant 5/7/2014 AAST-Comp eng 11
  • 12. Computational Intelligence Computational Intelligence Data + Knowledge Artificial Intelligence Expert systems Fuzzy logic Pattern Recognition Machine learning Probabilistic methods Multivariate statistics Visuali- zation Evolutionary algorithms Neural networks 5/7/2014 AAST-Comp eng 12
  • 13. What do these methods do? • Provide non-parametric models of data. • Allow to classify new data to pre-defined categories, supporting diagnosis & prognosis. • Allow to discover new categories. • Allow to understand the data, creating fuzzy or crisp logical rules. • Help to visualize multi-dimensional relationships among data samples.5/7/2014 AAST-Comp eng 13
  • 14. Feature selection Data Preprocessing Selecting Data mining tooldataset Classification algorithm SMO IBK BF TREE Results and evaluations AAST-Comp eng Pattern recognition system decomposition 5/7/2014
  • 18. AAST-Comp eng 18 Data Mining • Data Mining is set of techniques used in various domains to give meaning to the available data • Objective: Fit data to a model –Descriptive –Predictive 5/7/2014
  • 19. Predictive & descriptive data mining • Predictive: Is the process of automatically creating a classification model from a set of examples, called the training set, which belongs to a set of classes. Once a model is created, it can be used to automatically predict the class of other unclassified examples • Descriptive : Is to describe the general or special features of a set of data in a concise manner AAST-Comp eng 195/7/2014
  • 20. AAST-Comp eng 20 Data Mining Models and Tasks 5/7/2014
  • 21. Data mining Tools Many advanced tools for data mining are available either as open-source or commercial software. 21AAST-Comp eng5/7/2014
  • 22. weka • Waikato environment for knowledge analysis • Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. • Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. • Found only on the islands of New Zealand, the Weka is a flightless bird with an inquisitive nature. 5/7/2014 AAST-Comp eng 22
  • 24. Data Preprocessing • Data in the real world is : – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data – noisy: containing errors or outliers – inconsistent: containing discrepancies in codes or names • Quality decisions must be based on quality data measures: Accuracy ,Completeness, Consistency, Timeliness, Believability, Value added and Accessibility AAST-Comp eng 245/7/2014
  • 25. Preprocessing techniques • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization – Part of data reduction but with particular importance, especially for numerical data AAST-Comp eng 255/7/2014
  • 27. Finding a feature subset that has the most discriminative information from the original feature space. The objective of feature selection is : • Improving the prediction performance of the predictors • Providing a faster and more cost-effective predictors • Providing a better understanding of the underlying process that generated the data Feature selection AAST-Comp eng 275/7/2014
  • 28. Feature Selection • Transforming a dataset by removing some of its columns A1 A2 A3 A4 C A2 A4 C 5/7/2014 AAST-Comp eng 28
  • 30. Supervised Learning • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the model built on training set known categories AAST-Comp eng Category ―A‖ Category ―B‖ Classification (Recognition) (Supervised Classification) 305/7/2014
  • 31. Classification • Everyday, all the time we classify things. • Eg crossing the street: – Is there a car coming? – At what speed? – How far is it to the other side? – Classification: Safe to walk or not!!! 5/7/2014 AAST-Comp eng 31
  • 32. 5/7/2014 AAST-Comp eng 32  Classification:  predicts categorical class labels (discrete or nominal)  classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data  Prediction:  models continuous-valued functions, i.e., predicts unknown or missing values Classification vs. Prediction
  • 33. 5/7/2014 AAST-Comp eng 33 Classification—A Two-Step Process  Model construction: describing a set of predetermined classes  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute  The set of tuples used for model construction is training set  The model is represented as classification rules, decision trees, or mathematical formulae  Model usage: for classifying future or unknown objects  Estimate accuracy of the model  The known label of test sample is compared with the classified result from the model  Accuracy rate is the percentage of test set samples that are correctly classified by the model  Test set is independent of training set, otherwise over-fitting will occur  If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
  • 34. 5/7/2014 AAST-Comp eng 34 Classification Process (1): Model Construction Training Data N A M E R A N K Y E A R S T E N U R E D M ik e A s s is ta n t P ro f 3 n o M a ry A s s is ta n t P ro f 7 ye s B ill P ro fe s s o r 2 ye s J im A s s o c ia te P ro f 7 ye s D a ve A s s is ta n t P ro f 6 n o A n n e A s s o c ia te P ro f 3 n o Classification Algorithms IF rank = „professor‟ OR years > 6 THEN tenured = „yes‟ Classifier (Model)
  • 35. 5/7/2014 AAST-Comp eng 35 Classification Process (2): Use the Model in Prediction Classifier Testing Data N A M E R A N K Y E A R S T E N U R E D T o m A s s is ta n t P ro f 2 n o M e rlis a A s s o c ia te P ro f 7 n o G e o rg e P ro fe s s o r 5 ye s J o s e p h A s s is ta n t P ro f 7 ye s Unseen Data (Jeff, Professor, 4) Tenured?
  • 36. Classification • is a data mining (machine learning) technique used to predict group membership for data instances. • Classification analysis is the organization of data in given class. • These approaches normally use a training set where all objects are already associated with known class labels. • The classification algorithm learns from the training set and builds a model. • Many classification models are used to classify new objects. AAST-Comp eng 365/7/2014
  • 37. Classification • predicts categorical class labels (discrete or nominal) • constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data AAST-Comp eng 375/7/2014
  • 38. Quality of a classifier • Quality will be calculated with respect to lowest computing time. • Quality of certain model one can describe by confusion matrix. • Confusion matrix shows a new entry properties predictive ability of the method. • Row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class. • Thus the diagonal elements represent correctly classified compounds • the cross-diagonal elements represent misclassified compounds. AAST-Comp eng 385/7/2014
  • 39. Classification Techniques  Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research  The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data. 5/7/2014AAST-Comp eng39
  • 41. Classification Model Support vector machine Classifier V. Vapnik 5/7/2014 AAST-Comp eng 41
  • 42. Support Vector Machine (SVM)  SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification , function approximation, etc.  due to its generalization ability and has found a great deal of success in many applications.  Unlike traditional methods which minimizing the empirical training error, a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set 5/7/2014AAST-Comp eng42
  • 43. Support Vector Machine (SVM) 5/7/2014AAST-Comp eng43  SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification , function approximation, etc.  due to its generalization ability and has found a great deal of success in many applications.  Unlike traditional methods which minimizing the empirical training error, a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set
  • 44. Tennis example Humidity Temperature = play tennis = do not play tennis 5/7/2014 AAST-Comp eng 44
  • 45. Linear classifiers: Which Hyperplane? • Lots of possible solutions for a, b, c. • Some methods find a separating hyperplane, but not the optimal one • Support Vector Machine (SVM) finds an optimal solution. – Maximizes the distance between the hyperplane and the “difficult points” close to decision boundary – One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions 45 This line represents the decision boundary: ax + by − c = 0 Ch. 15 5/7/2014 AAST-Comp eng
  • 46. Selection of a Good Hyper-Plane Objective: Select a `good' hyper-plane using only the data! Intuition: (Vapnik 1965) - assuming linear separability (i) Separate the data (ii) Place hyper-plane `far' from data 5/7/2014 AAST-Comp eng 46
  • 47. SVM – Support Vector Machines Support Vectors Small Margin Large Margin 5/7/2014 AAST-Comp eng 47
  • 48. Support Vector Machine (SVM) • SVMs maximize the margin around the separating hyperplane. • The decision function is fully specified by a subset of training samples, the support vectors. • Solving SVMs is a quadratic programming problem • Seen by many as the most successful current text classification method 48 Support vectors Maximizes margin Sec. 15.1 Narrower margin 5/7/2014 AAST-Comp eng
  • 49. Non-Separable Case 5/7/2014 AAST-Comp eng 49 The Lagrangian trick
  • 50. SVM  SVM  Relatively new concept  Nice Generalization properties  Hard to learn – learned in batch mode using quadratic programming techniques  Using kernels can learn very complex functions 5/7/2014 AAST-Comp eng 51
  • 52. K-Nearest Neighbor Classifier Learning by analogy: Tell me who your friends are and I’ll tell you who you are A new example is assigned to the most common class among the (K) examples that are most similar to it. 5/7/2014 AAST-Comp eng 53
  • 53. K-Nearest Neighbor Algorithm  To determine the class of a new example E:  Calculate the distance between E and all examples in the training set  Select K-nearest examples to E in the training set  Assign E to the most common class among its K- nearest neighbors Response Response No response No response No response Class: Response 5/7/2014 AAST-Comp eng 54
  • 54.  Each example is represented with a set of numerical attributes  ―Closeness‖ is defined in terms of the Euclidean distance between two examples.  The Euclidean distance between X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) is defined as:  Distance (John, Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2] n i ii yxYXD 1 2 )(),( John: Age=35 Income=95K No. of credit cards=3 Rachel: Age=41 Income=215K No. of credit cards=2 Distance Between Neighbors 5/7/2014 AAST-Comp eng 55
  • 55. Instance Based Learning  No model is built: Store all training examples  Any processing is delayed until a new instance must be classified. Response Response No response No response No response Class: Respond 5/7/2014 AAST-Comp eng 56
  • 56. Example : 3-Nearest Neighbors Customer Age Income No. credit cards Response John 35 35K 3 No Rachel 22 50K 2 Yes Hannah 63 200K 1 No Tom 59 170K 1 No Nellie 25 40K 4 Yes David 37 50K 2 ? 5/7/2014 AAST-Comp eng 57
  • 57. Customer Age Income (K) No. cards John 35 35 3 Rachel 22 50 2 Hannah 63 200 1 Tom 59 170 1 Nellie 25 40 4 David 37 50 2 Response No Yes No No Yes Distance from David sqrt [(35-37)2+(35-50)2 +(3-2)2]=15.16 sqrt [(22-37)2+(50-50)2 +(2-2)2]=15 sqrt [(63-37)2+(200- 50)2 +(1-2)2]=152.23 sqrt [(59-37)2+(170- 50)2 +(1-2)2]=122 sqrt [(25-37)2+(40-50)2 +(4-2)2]=15.74 Yes 5/7/2014 AAST-Comp eng 58
  • 58. Strengths and Weaknesses Strengths:  Simple to implement and use  Comprehensible – easy to explain prediction  Robust to noisy data by averaging k-nearest neighbors. Weaknesses:  Need a lot of space to store all examples.  Takes more time to classify a new example than with a model (need to calculate and compare distance from new example to all other examples). 5/7/2014 AAST-Comp eng 59
  • 60. – Decision tree induction is a simple but powerful learning paradigm. In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed. At the end of the learning process, a decision tree covering the training set is returned. – The decision tree can be thought of as a set sentences written propositional logic. 5/7/2014 AAST-Comp eng 61
  • 61. Example Jenny Lind is a writer of romance novels. A movie company and a TV network both want exclusive rights to one of her more popular works. If she signs with the network, she will receive a single lump sum, but if she signs with the movie company, the amount she will receive depends on the market response to her movie. What should she do? 5/7/2014 AAST-Comp eng 62
  • 62. Payouts and Probabilities • Movie company Payouts – Small box office - $200,000 – Medium box office - $1,000,000 – Large box office - $3,000,000 • TV Network Payout – Flat rate - $900,000 • Probabilities – P(Small Box Office) = 0.3 – P(Medium Box Office) = 0.6 – P(Large Box Office) = 0.1 5/7/2014 AAST-Comp eng 63
  • 63. Jenny Lind - Payoff Table Decisions States of Nature Small Box Office Medium Box Office Large Box Office Sign with Movie Company $200,000 $1,000,000 $3,000,000 Sign with TV Network $900,000 $900,000 $900,000 Prior Probabilities 0.3 0.6 0.1 5/7/2014 AAST-Comp eng 64
  • 64. Using Expected Return Criteria EVmovie=0.3(200,000)+0.6(1,000,000)+0.1(3,000,000) = $960,000 = EVUII or EVBest EVtv =0.3(900,000)+0.6(900,000)+0.1(900,000) = $900,000 Therefore, using this criteria, Jenny should select the movie contract. 5/7/2014 AAST-Comp eng 65
  • 65. Decision Trees • Three types of “nodes” – Decision nodes - represented by squares ( ) – Chance nodes - represented by circles (Ο) – Terminal nodes - represented by triangles (optional) • Solving the tree involves pruning all but the best decisions at decision nodes, and finding expected values of all possible states of nature at chance nodes • Create the tree from left to right • Solve the tree from right to left 5/7/2014 AAST-Comp eng 66
  • 66. Example Decision Tree Decision node Chance node Event 1 Event 2 Event 3 5/7/2014 AAST-Comp eng 67
  • 67. Jenny Lind Decision Tree Small Box Office Medium Box Office Large Box Office Small Box Office Medium Box Office Large Box Office Sign with Movie Co. Sign with TV Network $200,000 $1,000,000 $3,000,000 $900,000 $900,000 $900,000 5/7/2014 AAST-Comp eng 68
  • 68. Jenny Lind Decision Tree Small Box Office Medium Box Office Large Box Office Small Box Office Medium Box Office Large Box Office Sign with Movie Co. Sign with TV Network $200,000 $1,000,000 $3,000,000 $900,000 $900,000 $900,000 .3 .6 .1 .3 .6 .1 ER ? ER ? ER ? 5/7/2014 AAST-Comp eng 69
  • 69. Jenny Lind Decision Tree - Solved Small Box Office Medium Box Office Large Box Office Small Box Office Medium Box Office Large Box Office Sign with Movie Co. Sign with TV Network $200,000 $1,000,000 $3,000,000 $900,000 $900,000 $900,000 .3 .6 .1 .3 .6 .1 ER 900,000 ER 960,000 ER 960,000 5/7/2014 AAST-Comp eng 70
  • 71. Evaluation Metrics Predicted as healthy Predicted as unhealthy Actual healthy tp fn Actual not healthy fp tn AAST-Comp eng 725/7/2014
  • 72. Cross-validation • Correctly Classified Instances 143 95.3% • Incorrectly Classified Instances 7 4.67 % • Default 10-fold cross validation i.e. – Split data into 10 equal sized pieces – Train on 9 pieces and test on remainder – Do for all possibilities and average 5/7/2014 AAST-Comp eng 73
  • 73. A Novel Approach for Breast Cancer Detection using Data Mining Techniques 74 5/7/2014AAST-Comp eng
  • 74. Abstract  The aim of this paper is to investigate the performance of different classification techniques.  Aim is developing accurate prediction models for breast cancer using data mining techniques  Comparing three classification techniques in Weka software and comparison results.  Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods. 75 5/7/2014AAST-Comp eng
  • 75. Introduction  Breast cancer is on the rise across developing nations  due to the increase in life expectancy and lifestyle changes such as women having fewer children.  Benign tumors: • Are usually not harmful • Rarely invade the tissues around them • Don‘t spread to other parts of the body • Can be removed and usually don‘t grow back  Malignant tumors: • May be a threat to life • Can invade nearby organs and tissues (such as the chest wall) • Can spread to other parts of the body • Often can be removed but sometimes grow back 76 5/7/2014AAST-Comp eng
  • 76. Risk factors  Gender  Age  Genetic risk factors  Family history  Personal history of breast cancer  Race : white or black  Dense breast tissue :denser breast tissue have a higher risk  Certain benign (not cancer) breast problems  Lobular carcinoma in situ  Menstrual periods 77 5/7/2014AAST-Comp eng
  • 77. Risk factors  Breast radiation early in life  Treatment with DES : the drug DES (diethylstilbestrol) during pregnancy  Not having children or having them later in life  Certain kinds of birth control  Using hormone therapy after menopause  Not breastfeeding  Alcohol  Being overweight or obese 78 5/7/2014AAST-Comp eng
  • 78. BACKGROUND  Bittern et al. used artificial neural network to predict the survivability for breast cancer patients. They tested their approach on a limited data set, but their results show a good agreement with actual survival Traditional segmentation  Vikas Chaurasia et al. used Representive Tree, RBF Network and Simple Logistic to predict the survivability for breast cancer patients.  Liu Ya-Qin‘s experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability. 79 5/7/2014AAST-Comp eng
  • 79. BACKGROUND  Bellaachi et al. used naive bayes, decision tree and back-propagation neural network to predict the survivability in breast cancer patients. Although they reached good results (about 90% accuracy), their results were not significant due to the fact that they divided the data set to two groups; one for the patients who survived more than 5 years and the other for those patients who died before 5 years.  Vikas Chaurasia et al. used Naive Bayes, J48 Decision Tree to predict the survivability for Heart Diseases patients. 80 5/7/2014AAST-Comp eng
  • 80. BACKGROUND  Vikas Chaurasia et al. used CART (Classification and Regression Tree), ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients.  Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C4.5.  Dong-Sheng Cao‘s proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry.81 5/7/2014AAST-Comp eng
  • 81. BACKGROUND  Dr. S.Vijayarani et al., analyses the performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset. The classification function algorithms is used and tested in this work. The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate. The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization.82 5/7/2014AAST-Comp eng
  • 82. BACKGROUND  Kaewchinporn C‘s presented a new classification algorithm TBWC combination of decision tree with bagging and clustering. This algorithm is experimented on two medical datasets: cardiocography1, cardiocography2 and other datasets not related to medical domain.  BS Harish et al., presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes. The existing methods are compared and contrasted based on various parameters83 5/7/2014AAST-Comp eng
  • 84. BREAST-CANCER-WISCONSIN DATA SET SUMMARY  the UC Irvine machine learning repository  Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg.  2 classes (malignant and benign), and 9 integer- valued attributes  breast-cancer-Wisconsin having 699 instances  We removed the 16 instances with missing values from the dataset to construct a new dataset with 683 instances  Class distribution: Benign: 458 (65.5%) Malignant: 241 (34.5%)  Note :2 malignant and 14 benign excluded hence percentage is wrong and the right one is :  benign 444 (65%) and malignant 239 (35%) 5/7/2014AAST-Comp eng85
  • 85. 5/7/2014 AAST-Comp eng 86 Attribute Domain Sample Code Number Id Number Clump Thickness 1 - 10 Uniformity Of Cell Size 1 - 10 Uniformity Of Cell Shape 1 - 10 Marginal Adhesion 1 - 10 Single Epithelial Cell Size 1 - 10 Bare Nuclei 1 - 10 Bland Chromatin 1 - 10 Normal Nucleoli 1 - 10 Mitoses 1 - 10 Class 2 For Benign 4 For Malignant
  • 86. EVALUATION METHODS  We have used the Weka (Waikato Environment for Knowledge Analysis). version 3.6.9  WEKA is a collection of machine learning algorithms for data mining tasks.  The algorithms can either be applied directly to a dataset or called from your own Java code.  WEKA contains tools for data preprocessing, classification, regression, clustering, association rules, visualization and feature selection.  It is also well suited for developing new machine learning schemes.  WEKA is open source software issued under the GNU General Public License 5/7/2014AAST-Comp eng87
  • 89. importance of the input variables 5/7/2014AAST-Comp eng90 Domain 1 2 3 4 5 6 7 8 9 10 Sum Clump Thickness 139 50 104 79 128 33 23 44 14 69 683 Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683 Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683 Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683 Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683 Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683 Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683 Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683 Mitoses 563 35 33 12 6 3 9 8 0 14 683 Sum 2843 850 605 333 346 192 207 233 77 516
  • 90. EXPERIMENTAL RESULTS 91 5/7/2014AAST-Comp eng Evaluation Criteria Classifiers BF TREE IBK SMO Timing To Build Model (In Sec) 0.97 0.02 0.33 Correctly Classified Instances 652 655 657 Incorrectly Classified Instances 31 28 26 Accuracy (%) 95.46% 95.90% 96.19%
  • 91. EXPERIMENTAL RESULTS  The sensitivity or the true positive rate (TPR) is defined by TP / (TP + FN)  the specificity or the true negative rate (TNR) is defined by TN / (TN + FP)  the accuracy is defined by (TP + TN) / (TP + FP + TN + FN).  True positive (TP) = number of positive samples correctly predicted.  False negative (FN) = number of positive samples wrongly predicted.  False positive (FP) = number of negative samples wrongly predicted as positive.  True negative (TN) = number of negative samples correctly predicted 92 5/7/2014AAST-Comp eng
  • 92. EXPERIMENTAL RESULTS Classifier TP FP Precision Recall Class BF Tree 0.971 0.075 0.96 0.971 Benign 0.925 0.029 0.944 0.925 Malignant IBK 0.98 0.079 0.958 0.98 Benign 0.921 0.02 0.961 0.921 Malignant SMO 0.971 0.054 0.971 0.971 Benign 0.946 0.029 0.946 0.946 Malignant 93 5/7/2014AAST-Comp eng
  • 93. EXPERIMENTAL RESULTS Classifier Benign Malignant Class BF Tree 431 13 Benign 18 221 Malignant IBK 435 9 Benign 19 220 Malignant SMO 431 13 Benign 13 226 Malignant 94 5/7/2014AAST-Comp eng
  • 94. importance of the input variables 5/7/2014AAST-Comp eng95 variable Chi- squared Info Gain Gain Ratio Average Rank IMPORTANCE Clump Thickness 378.08158 0.464 0.152 126.232526 8 Uniformity of Cell Size 539.79308 0.702 0.3 180.265026 1 Uniformity of Cell Shape 523.07097 0.677 0.272 174.673323 2 Marginal Adhesion 390.0595 0.464 0.21 130.2445 7 Single Epithelial Cell Size 447.86118 0.534 0.233 149.542726 5 Bare Nuclei 489.00953 0.603 0.303 163.305176 3 Bland Chromatin 453.20971 0.555 0.201 151.321903 4 Normal Nucleoli 416.63061 0.487 0.237 139.118203 6 Mitoses 191.9682 0.212 0.212 64.122733 9
  • 96. CONCLUSION.  the accuracy of classification techniques is evaluated based on the selected classifier algorithm.  we used three popular data mining methods: Sequential Minimal Optimization (SMO), IBK, BF Tree.  The performance of SMO shows the high level compare with other classifiers.  most important attributes for breast cancer survivals are Uniformity of Cell Size. 97 5/7/2014AAST-Comp eng
  • 97. Future work  using updated version of weka  Using another data mining tool  Using alternative algorithms and techniques 5/7/2014AAST-Comp eng98
  • 98. Notes on paper  Spelling mistakes  No point of contact (e - mail)  Wrong percentage calculation  Copying from old papers  Charts not clear  No contributions 5/7/2014AAST-Comp eng99
  • 99. comparison  Breast Cancer Diagnosis on Three Different Datasets Using Multi-Classifiers written  International Journal of Computer and Information Technology (2277 – 0764) Volume 01– Issue 01, September 2012  Paper introduced more advanced idea and make a fusion between classifiers 5/7/2014AAST-Comp eng100
  • 100. References 101AAST-Comp eng [1] U.S. Cancer Statistics Working Group. United States Cancer Statistics: 1999–2008 Incidence and Mortality Web-based Report. Atlanta (GA): Department of Health and Human Services, Centers for Disease Control [2] Lyon IAfRoC: World Cancer Report. International Agency for Research on Cancer Press 2003:188-193. [3] Elattar, Inas. “Breast Cancer: Magnitude of the Problem”,Egyptian Society of Surgical Oncology Conference, Taba,Sinai, in Egypt (30 March – 1 April 2005). [2] S. Aruna, Dr S.P. Rajagopalan and L.V. Nandakishore (2011). Knowledge based analysis of various statistical tools in detecting breast cancer. [3] Angeline Christobel. Y, Dr. Sivaprakasam (2011). An Empirical Comparison of Data Mining Classification Methods. International Journal of Computer Information Systems,Vol. 3, No. 2, 2011. [4] D.Lavanya, Dr.K.Usha Rani,..,” Analysis of feature selection with classification: Breast cancer datasets”,Indian Journal of Computer Science and Engineering (IJCSE),October 2011. 5/7/2014
  • 101. AAST-Comp eng 102 [5] E.Osuna, R.Freund, and F. Girosi, “Training support vector machines:Application to face detection”. Proceedings of computer vision and pattern recognition, Puerto Rico pp. 130–136.1997. [6] Vaibhav Narayan Chunekar, Hemant P. Ambulgekar (2009).Approach of Neural Network to Diagnose Breast Cancer on three different Data Set. 2009 International Conference on Advances in Recent Technologies in Communication and Computing. [7] D. Lavanya, “Ensemble Decision Tree Classifier for Breast Cancer Data,” International Journal of Information Technology Convergence and Services, vol. 2, no. 1, pp. 17-24, Feb. 2012. [8] B.Ster, and A.Dobnikar, “Neural networks in medical diagnosis: Comparison with other methods.” Proceedings of the international conference on engineering applications of neural networks pp. 427– 430. 1996. 5/7/2014
  • 102. [9] T.Joachims, Transductive inference for text classification using support vector machines. Proceedings of international conference machine learning. Slovenia. 1999. [10] J.Abonyi, and F. Szeifert, “Supervised fuzzy clustering for the identification of fuzzy classifiers.” Pattern Recognition Letters, vol.14(24), 2195–2207,2003. [11] Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,School of Information and Computer Science. [12] William H. Wolberg, M.D., W. Nick Street, Ph.D., Dennis M. Heisey, Ph.D., Olvi L. Mangasarian, Ph.D. computerized breast cancer diagnosis and prognosis from fine needle aspirates, Western Surgical Association meeting in Palm Desert, California, November 14, 1994. AAST-Comp eng 1035/7/2014
  • 103. AAST-Comp eng 104 [13] Street WN, Wolberg WH, Mangasarian OL. Nuclear feature extraction for breast tumor diagnosis. Proceedings IS&T/ SPIE International Symposium on Electronic Imaging 1993; 1905:861–70. [14] Chen, Y., Abraham, A., Yang, B.(2006), Feature Selection and Classification using Flexible Neural Tree. Journal of Neurocomputing 70(1-3): 305–313. [15] J. Han and M. Kamber,”Data Mining Concepts and Techniques”,Morgan Kauffman Publishers, 2000. [16] Bishop, C.M.: “Neural Networks for Pattern Recognition”. Oxford University Press,New York (1999). [17] Vapnik, V.N., The Nature of Statistical Learning Theory, 1st ed.,Springer- Verlag,New York, 1995. [18] Ross Quinlan, (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA. 185 5/7/2014

Editor's Notes

  1. But note that Naïve Bayes also finds an optimal solution … just under a different definition of optimality.