Data Mining Techniques in
The Diagnosis of
Coronary Artery Disease (CAD)
Steve Iduye
Xiaoqing Zhuang
HINF 6210 Data Mining
Contents
❖Coronary Heart Disease in a Nutshell
❖Description of the Datasets
❖Case 1
❖Case 2
❖Case 3
❖Discussion
❖Conclusion
Heart Disease in a Nutshell
● Coronary Artery Disease(CAD) happens when the arteries that supply blood to heart
muscle become hardened and narrowed.
● As a result, the heart muscle cannot get the blood or oxygen it needs and this can
lead to chest pain (angina) or a heart attack.
● Current research on heart disease research has established that it is not a single
condition, but refers to any condition in which the heart and blood vessels are
injured and do not function properly, resulting in serious and fatal health problems
(Chilnick, 2008; HEALTHS, 2010; King, 2004; Silverstein et al., 2006).
Heart Disease in a Nutshell
● The causes of heart disease are unclear, but age, gender, family history, and ethnic
background are all considered to be the major causes in different investigations
(Chilnick, 2008; HEALTHS, 2010; King, 2004; Silverstein et al., 2006).
● Other factors like eating habits, fatty foods, lack of exercise, high cholesterol,
hypertension, pollution, life style factors, obesity, high blood pressure, stress,
diabetes and lack of awareness have also been claimed to increase the chance of
developing heart disease (Chilnick, 2008; HEALTHS, 2010);
● Heart research, further, has found that the majority of the disease occurrence is
noticed in people between the ages of 50–60 (Chilnick, 2008; HEALTHS, 2010)
Case 1
● The case study investigates the risk factors which contribute to Coronary Artery
Disease in males and females
● (Article was published by Jesmin Nahar, Tasadduq Imama, Kevin S. Ticklea, Yi-Ping
Phoebe Chen)
● UCI Cleveland Dataset(https://archive.ics.uci.edu/ml/machine-learning-
databases/heart-disease/)
● Predictive Apriori (Association Rules) was used to identify those risk factors
Apriori Algorithm (Case 1)
The learning process looks for the following:
– Support and Confidence greater than or equal to the min threshold
– List all possible association rules that meet these requirements
– Confidence and support are used in this study because of its accuracy in Apriori to
rank the rules (Agrawal et al., 1993; Mutter, Hall, & Frank, 2005; Taihua & Fan, 2010)
Attributes of Interest in the Dataset
● These attributes are the combination of symptoms, characteristics of heart disease,
diagnostic techniques and probable causes.
● Let X represents all the attributes
● Let Y represents the class vector(CAD=unhealthy, No_CAD= healthy)
Attributes of Interest in the
Dataset
Prior Setting
● Rules with confidence levels above 90%, with accuracy levels above 99% and
confirmation levels above 79% were selected respectively for Predictive Apriori .
● As there can be many such rules, only the rules containing the ‘sick’ or ‘healthy’
class in the right-hand side (RHS) were considered.
● If no such rules were available, rules containing the ‘sick’ or ‘healthy’ class in the
left-hand side (LHS) were reported.
Apriori Rules
Apriori Rules
Summary: Case 1
● Four of the five rules attributed for the ‘healthy’ class indicates female gender on
this particular dataset, have more chance of being free from coronary heart disease.
● Also, the results shows that when exercise induced angina (chest pain) was false, it
was a good indicator of a person being healthy, irrespective of gender (exercise
induced angina = false has appeared in the LHS of all the high confidence rules).
● The number of coloured vessels being zero and thal (heart status) being normal
were also shown to be good indicators of health.
Case 1 Summary
● Rules mined for the ‘sick’ class, on the other hand, showed that chest pain type
being asymptomatic and thal being reversed were probable indicators of a person
being sick (both the high confidence rules have these two factors in LHS).
Building Classification Rules
Objectives
● Building Classification Rules from the previous A.R attributes data
● Trained data are analyzed by a classification algorithm
● The learned attribute or classifier becomes the rules
● Trained Data are used to estimate the accuracy of the rules
● The rules can be applied to the classification of new data tuples (Jiawei, Kamber, Pei,
2012)
Step 1: Training Data
Healthy Class
SEX EXERCISE_INDUCED_
ANGINA
NO_VESSEL_COLORED THAL(HEART
STATUS)
FASTING
BLOOD
SUGAR
CLASS
Female Failed 0 Normal Healthy (no_CAD)
Female Failed 0 False Healthy(no_CAD)
Female Failed 0 Healthy (no_CAD)
Female Failed Normal False Healthy (no_CAD)
M or F Failed 0 Normal Healthy (no_CAD)
Step 1: Training Data
Un- Healthy Class
CHEST_PAIN_TYPE SLOPE EXERCISE
INDUCED
ANGINA
THAL(HEART STATUS) CLASS
asymptomatic flat reversible defect Unhealthy
(CAD)
asymptomatic true reversible defect Unhealthy (CAD)
Step 2 : Create Classification Rules
● The learned attribute or classifier becomes the rules
● If {Sex = female  exercise_induced_angina = fal  number_of_vessels_colored=0 
thal = nom} => Then, no CAD .
● If {Sex = female  fasting_blood_sugar = fal  exercise_induced_angina = fal 
number_of_vessels_colored = 0} => Then,no CAD .
C. Rules
● If {Sex = female  fasting_blood_sugar = fal  exercise_induced_angina = fal  thal =
norm} => Then, no CAD
● If {Resting_blood_pres less or = ‘(115.2, 136.4]’  exercise_induced_angina = fal 
number_of_vessels_colored = 0  thal = norm} => Then, no CAD
● If {Sex=female  exercise_induced_angina = fal  number_of_vessels_colored = 0} =>
Then, no CAD
C. Rules
● If {Chest_pain_type = asympt  slope = flat  thal = rev} => Then, CAD is present
● If {Chest_pain_type=asympt  exercise_induced_angina=TRUE  thal=rev} => Then,
CAD is present
Step 3: To Estimate the Accuracy of the
Rules Using Decision Tree
● Find the attributes Information Gain
info(D) -5/7log2(5/7)-2/7log2(2/7)= 1.9848 (A)
infosex(D) 4/7*(-4/5log24/5-1/5log21/5)=1.4411(B)
info exercise_induced_angina(D) 6/7*(-5/6log2 ⅚-1/6log2 1/6)= 3.6914(C)
info heart status(D) 5/7*(-3/5log23/5-2/5log22/5)= 2.6779 (D)
A-B=0.5437bits(sex), A-C= -1.7066bits, A-D= -0.6931bits
Case 2: Diagnosing Coronary Artery Disease via
Data Mining Algorithms by Considering
Laboratory and Echocardiography Features
Case 3: A data mining approach for diagnosis of coronary
artery disease
Dataset Z-Alizadeh Sani dataset: 303 patients (each 54
features)
Z-Alizadeh Sani dataset: 303 patients (each 54 features)
Objective Using non-invasive, less costly method, various
data mining algorithms to predict stenosis of
each artery separately.
Using affordable costs and affordable feature measurements
and applying proposed approached to identify CAD state
probability.
Features Demographic Features, Laboratory and Echo
Features
FEATURES 4 GROUPS: demographic, symptom and
examination, ECG, laboratory and echo features
2 possible categories: CAD or Normal
(IF patient’s diameter narrowing is >= 50% THEN CAD,
ELSE = Normal)
Methods Classification Algorithm: C4.5, Bagging algorithm
Information gain, Gini Index, Ten-fold cross-
validation method, Confusion matrix,
Performance measure
RapidMiner software
Classification Algorithm: SMO, Naïve Bayes classifier, Bagging
algorithm, Neural Network algorithm
Feature Selection & Feature creation, Information gain, Gini
Index, Association rule mining, Performance measure,
Confusion matrix
Version 5.2.003 of RapidMiner
Results This study presents the highest accuracy value
(79.54%) for diagnosing the LAD stenosis in the
94.08% accuracy is achieved which is higher than the known
approaches in the available literature.
Case 2 (METHODS)
 C4.5 classification algorithm
• Based on decision trees (augment the performance)
• Has the ability of the latter to manage continuous values by breaking them down into
sub intervals
• Using pruning methods: improve accuracy
Case 2 (METHODS)
 Bagging Algorithm
• Classifies each sample based on the output of a set of diverse base classifiers.
• Base classifiers can be selected from the C4.5, Naïve Bayes, ID3, and other data mining
algorithms.
Case 3 (METHODS)
 Sequential Minimal Optimization (SMO): algorithm for efficiently solving the
optimization problem which arises during the training of Support Vector Machines
(SVMs)
 Naïve Bayes classifier: simple probabilistic classifier based on applying Bayes’
theorem with strong independence assumption
 Bagging algorithm
 Neural Network algorithm: Artificial Neural Network (ANN) interconnected group
of artificial neuronsuse a mathematical or computational model for information
processing based on a connectionist approach.Model complex relationships
between inputs and outputs or to find patterns in data.
Case 3 (METHODS)
 Feature Selection
• uses the coefficients of the normal vector of a linear SVM as feature weights
• The attribute values still have to be numerical.
• 34 of features had the weight > 0.6: selected and the algorithms were applied on them.
Case 3 (METHODS)
 Feature creation
• 3 new features: LAD (Left Anterior Descending) recognizer, LCX (Left Circumflex)
recognizer, RCA (Right Coronary Artery) recognizer are used to recognize whether LAD,
LCX, RCA is blocked. Higher the value, higher the risk.
• Available features of the dataset are first discretized into binary variables
value 1 for a feature indicates higher probabilities of the record being in the CAD class,
while value zero indicates otherwise.
Case 3(METHODS)
 Association rule mining (Mentioned in Case 1)
• Support
• Confidence
Case 2 and Case 3
 Informaton gain
• measures the reduction in entropy of the data records because of a single split over a
given attribute.
• The entropy before and after the split is computed
c is the class value which can be CAD or Normal
P(c)probability of a record being in class c
if a feature separates the two classes completely, it has the most Information Gain and is
the best feature for classification
Case 2 and Case 3
 Gini Index
• measure of how often a randomly chosen element from a set of elements would be
incorrectly labeled if it was randomly labeled according to the distribution of labels in
the subset
• the probability of correctly labeling an item is equal to the probability of choosing that
item
• higher values of Gini Index for a feature indicate its prevalence in causing the disease.
Case1 and Case 2
 Performance measure: Accuracy, sensitivity, and specificity are the most important
performance measures in the medical field
 Confusion matrix: a table that allows visualization of the performance of an algorithm
Discussion(Improve Accuracy of CAD
Diagnosis by Using Data Mining Techniques)
Understand CAD CAD Risk Features
Feature Selection
Feature Creation
Information Gain
Gini Index
Dataset with Effective Features
Data Mining Methods
C 4.5
Bagging Algorithm
SMO Algorithm
Naïve Bayes algorithm
Neural Network algorithm
Association Rule Mining
Performance
Measurement
Confusion Matrix
Sensitivity
Specificity
Accuracy
Results
RapidMiner
Rules
Extracted
Confidence
Support
Conclusion
– Using Feature selection methods can increase the accuracy of CAD diagnosis (Though sometimes may
decrease the accuracy of the LAD, RCA stenosis diagnosis)
– To enrich our dataset, we may need to create some new features which has vital influence the accuracy
of the CAD diagnosis.
– Rules extracted from association rule mining methods may not be 100% correct, we need some more
testing data to test the rules.
– Still need the results of the standard angiographic method which are used as the base of comparison, to
assess the prediction capability of classification algorithms.

Data mining techniques on heart failure diagnosis

  • 1.
    Data Mining Techniquesin The Diagnosis of Coronary Artery Disease (CAD) Steve Iduye Xiaoqing Zhuang HINF 6210 Data Mining
  • 2.
    Contents ❖Coronary Heart Diseasein a Nutshell ❖Description of the Datasets ❖Case 1 ❖Case 2 ❖Case 3 ❖Discussion ❖Conclusion
  • 3.
    Heart Disease ina Nutshell ● Coronary Artery Disease(CAD) happens when the arteries that supply blood to heart muscle become hardened and narrowed. ● As a result, the heart muscle cannot get the blood or oxygen it needs and this can lead to chest pain (angina) or a heart attack. ● Current research on heart disease research has established that it is not a single condition, but refers to any condition in which the heart and blood vessels are injured and do not function properly, resulting in serious and fatal health problems (Chilnick, 2008; HEALTHS, 2010; King, 2004; Silverstein et al., 2006).
  • 4.
    Heart Disease ina Nutshell ● The causes of heart disease are unclear, but age, gender, family history, and ethnic background are all considered to be the major causes in different investigations (Chilnick, 2008; HEALTHS, 2010; King, 2004; Silverstein et al., 2006). ● Other factors like eating habits, fatty foods, lack of exercise, high cholesterol, hypertension, pollution, life style factors, obesity, high blood pressure, stress, diabetes and lack of awareness have also been claimed to increase the chance of developing heart disease (Chilnick, 2008; HEALTHS, 2010); ● Heart research, further, has found that the majority of the disease occurrence is noticed in people between the ages of 50–60 (Chilnick, 2008; HEALTHS, 2010)
  • 5.
    Case 1 ● Thecase study investigates the risk factors which contribute to Coronary Artery Disease in males and females ● (Article was published by Jesmin Nahar, Tasadduq Imama, Kevin S. Ticklea, Yi-Ping Phoebe Chen) ● UCI Cleveland Dataset(https://archive.ics.uci.edu/ml/machine-learning- databases/heart-disease/) ● Predictive Apriori (Association Rules) was used to identify those risk factors
  • 6.
    Apriori Algorithm (Case1) The learning process looks for the following: – Support and Confidence greater than or equal to the min threshold – List all possible association rules that meet these requirements – Confidence and support are used in this study because of its accuracy in Apriori to rank the rules (Agrawal et al., 1993; Mutter, Hall, & Frank, 2005; Taihua & Fan, 2010)
  • 7.
    Attributes of Interestin the Dataset ● These attributes are the combination of symptoms, characteristics of heart disease, diagnostic techniques and probable causes. ● Let X represents all the attributes ● Let Y represents the class vector(CAD=unhealthy, No_CAD= healthy)
  • 8.
    Attributes of Interestin the Dataset
  • 9.
    Prior Setting ● Ruleswith confidence levels above 90%, with accuracy levels above 99% and confirmation levels above 79% were selected respectively for Predictive Apriori . ● As there can be many such rules, only the rules containing the ‘sick’ or ‘healthy’ class in the right-hand side (RHS) were considered. ● If no such rules were available, rules containing the ‘sick’ or ‘healthy’ class in the left-hand side (LHS) were reported.
  • 10.
  • 11.
  • 12.
    Summary: Case 1 ●Four of the five rules attributed for the ‘healthy’ class indicates female gender on this particular dataset, have more chance of being free from coronary heart disease. ● Also, the results shows that when exercise induced angina (chest pain) was false, it was a good indicator of a person being healthy, irrespective of gender (exercise induced angina = false has appeared in the LHS of all the high confidence rules). ● The number of coloured vessels being zero and thal (heart status) being normal were also shown to be good indicators of health.
  • 13.
    Case 1 Summary ●Rules mined for the ‘sick’ class, on the other hand, showed that chest pain type being asymptomatic and thal being reversed were probable indicators of a person being sick (both the high confidence rules have these two factors in LHS).
  • 14.
    Building Classification Rules Objectives ●Building Classification Rules from the previous A.R attributes data ● Trained data are analyzed by a classification algorithm ● The learned attribute or classifier becomes the rules ● Trained Data are used to estimate the accuracy of the rules ● The rules can be applied to the classification of new data tuples (Jiawei, Kamber, Pei, 2012)
  • 15.
    Step 1: TrainingData Healthy Class SEX EXERCISE_INDUCED_ ANGINA NO_VESSEL_COLORED THAL(HEART STATUS) FASTING BLOOD SUGAR CLASS Female Failed 0 Normal Healthy (no_CAD) Female Failed 0 False Healthy(no_CAD) Female Failed 0 Healthy (no_CAD) Female Failed Normal False Healthy (no_CAD) M or F Failed 0 Normal Healthy (no_CAD)
  • 16.
    Step 1: TrainingData Un- Healthy Class CHEST_PAIN_TYPE SLOPE EXERCISE INDUCED ANGINA THAL(HEART STATUS) CLASS asymptomatic flat reversible defect Unhealthy (CAD) asymptomatic true reversible defect Unhealthy (CAD)
  • 17.
    Step 2 :Create Classification Rules ● The learned attribute or classifier becomes the rules ● If {Sex = female exercise_induced_angina = fal number_of_vessels_colored=0 thal = nom} => Then, no CAD . ● If {Sex = female fasting_blood_sugar = fal exercise_induced_angina = fal number_of_vessels_colored = 0} => Then,no CAD .
  • 18.
    C. Rules ● If{Sex = female fasting_blood_sugar = fal exercise_induced_angina = fal thal = norm} => Then, no CAD ● If {Resting_blood_pres less or = ‘(115.2, 136.4]’ exercise_induced_angina = fal number_of_vessels_colored = 0 thal = norm} => Then, no CAD ● If {Sex=female exercise_induced_angina = fal number_of_vessels_colored = 0} => Then, no CAD
  • 19.
    C. Rules ● If{Chest_pain_type = asympt slope = flat thal = rev} => Then, CAD is present ● If {Chest_pain_type=asympt exercise_induced_angina=TRUE thal=rev} => Then, CAD is present
  • 20.
    Step 3: ToEstimate the Accuracy of the Rules Using Decision Tree ● Find the attributes Information Gain info(D) -5/7log2(5/7)-2/7log2(2/7)= 1.9848 (A) infosex(D) 4/7*(-4/5log24/5-1/5log21/5)=1.4411(B) info exercise_induced_angina(D) 6/7*(-5/6log2 ⅚-1/6log2 1/6)= 3.6914(C) info heart status(D) 5/7*(-3/5log23/5-2/5log22/5)= 2.6779 (D) A-B=0.5437bits(sex), A-C= -1.7066bits, A-D= -0.6931bits
  • 22.
    Case 2: DiagnosingCoronary Artery Disease via Data Mining Algorithms by Considering Laboratory and Echocardiography Features Case 3: A data mining approach for diagnosis of coronary artery disease Dataset Z-Alizadeh Sani dataset: 303 patients (each 54 features) Z-Alizadeh Sani dataset: 303 patients (each 54 features) Objective Using non-invasive, less costly method, various data mining algorithms to predict stenosis of each artery separately. Using affordable costs and affordable feature measurements and applying proposed approached to identify CAD state probability. Features Demographic Features, Laboratory and Echo Features FEATURES 4 GROUPS: demographic, symptom and examination, ECG, laboratory and echo features 2 possible categories: CAD or Normal (IF patient’s diameter narrowing is >= 50% THEN CAD, ELSE = Normal) Methods Classification Algorithm: C4.5, Bagging algorithm Information gain, Gini Index, Ten-fold cross- validation method, Confusion matrix, Performance measure RapidMiner software Classification Algorithm: SMO, Naïve Bayes classifier, Bagging algorithm, Neural Network algorithm Feature Selection & Feature creation, Information gain, Gini Index, Association rule mining, Performance measure, Confusion matrix Version 5.2.003 of RapidMiner Results This study presents the highest accuracy value (79.54%) for diagnosing the LAD stenosis in the 94.08% accuracy is achieved which is higher than the known approaches in the available literature.
  • 23.
    Case 2 (METHODS) C4.5 classification algorithm • Based on decision trees (augment the performance) • Has the ability of the latter to manage continuous values by breaking them down into sub intervals • Using pruning methods: improve accuracy
  • 24.
    Case 2 (METHODS) Bagging Algorithm • Classifies each sample based on the output of a set of diverse base classifiers. • Base classifiers can be selected from the C4.5, Naïve Bayes, ID3, and other data mining algorithms.
  • 25.
    Case 3 (METHODS) Sequential Minimal Optimization (SMO): algorithm for efficiently solving the optimization problem which arises during the training of Support Vector Machines (SVMs)  Naïve Bayes classifier: simple probabilistic classifier based on applying Bayes’ theorem with strong independence assumption  Bagging algorithm  Neural Network algorithm: Artificial Neural Network (ANN) interconnected group of artificial neuronsuse a mathematical or computational model for information processing based on a connectionist approach.Model complex relationships between inputs and outputs or to find patterns in data.
  • 26.
    Case 3 (METHODS) Feature Selection • uses the coefficients of the normal vector of a linear SVM as feature weights • The attribute values still have to be numerical. • 34 of features had the weight > 0.6: selected and the algorithms were applied on them.
  • 27.
    Case 3 (METHODS) Feature creation • 3 new features: LAD (Left Anterior Descending) recognizer, LCX (Left Circumflex) recognizer, RCA (Right Coronary Artery) recognizer are used to recognize whether LAD, LCX, RCA is blocked. Higher the value, higher the risk. • Available features of the dataset are first discretized into binary variables value 1 for a feature indicates higher probabilities of the record being in the CAD class, while value zero indicates otherwise.
  • 28.
    Case 3(METHODS)  Associationrule mining (Mentioned in Case 1) • Support • Confidence
  • 29.
    Case 2 andCase 3  Informaton gain • measures the reduction in entropy of the data records because of a single split over a given attribute. • The entropy before and after the split is computed c is the class value which can be CAD or Normal P(c)probability of a record being in class c if a feature separates the two classes completely, it has the most Information Gain and is the best feature for classification
  • 30.
    Case 2 andCase 3  Gini Index • measure of how often a randomly chosen element from a set of elements would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset • the probability of correctly labeling an item is equal to the probability of choosing that item • higher values of Gini Index for a feature indicate its prevalence in causing the disease.
  • 31.
    Case1 and Case2  Performance measure: Accuracy, sensitivity, and specificity are the most important performance measures in the medical field  Confusion matrix: a table that allows visualization of the performance of an algorithm
  • 37.
    Discussion(Improve Accuracy ofCAD Diagnosis by Using Data Mining Techniques) Understand CAD CAD Risk Features Feature Selection Feature Creation Information Gain Gini Index Dataset with Effective Features Data Mining Methods C 4.5 Bagging Algorithm SMO Algorithm Naïve Bayes algorithm Neural Network algorithm Association Rule Mining Performance Measurement Confusion Matrix Sensitivity Specificity Accuracy Results RapidMiner Rules Extracted Confidence Support
  • 38.
    Conclusion – Using Featureselection methods can increase the accuracy of CAD diagnosis (Though sometimes may decrease the accuracy of the LAD, RCA stenosis diagnosis) – To enrich our dataset, we may need to create some new features which has vital influence the accuracy of the CAD diagnosis. – Rules extracted from association rule mining methods may not be 100% correct, we need some more testing data to test the rules. – Still need the results of the standard angiographic method which are used as the base of comparison, to assess the prediction capability of classification algorithms.

Editor's Notes

  • #8 These attributes are a combinations of symptoms, characteristics of heart disease, diagnostic techniques and probable causes.
  • #9 Let X represents all the attributes here. Let Y represents the class vector(CAD=unhealthy, No_CAD= healthy)
  • #16 dots represents not-given information Steps in Classification Rule: Training data to build model or rules. Also, the trained data are analyzed by a classification algorithm. The trained data contains attributes and class
  • #17 represents not-given information. Learning: Training data are analyzed by a classification algorithm.
  • #22 no CAD { female, exercise_induced_angina, no_vessel_colored=0, heart_status=Normal, fasting_blood_sugar= >120dl} no CAD {male, exercise_induced_angina, no_vessel_colored=0, heart_status=Normal} CAD {male or female, heart_status=reversible effect, cheat_pain_type=asymptomatic, ECG=slope} CAD {male or female, heart_status=reversible effect, cheat_pain_type=asymptomatic, exercise_induced_angina=TRUE}
  • #35 Table has the features selected by the feature selection method. Which implies those