Data mining approach to
predict BRCA1 genes
mutation

Olegas Niakšu1, Jurgita Gedminaitė2, Olga Kurasova1
1 Vilnius

Unive...
Breast Cancer
• Cancer is the second main cause of death in
the developed countries and one of the main
causes in the worl...
BRCA genes (1)


The gene named BRCA stands for breast cancer
(BC) susceptibility gene. In normal cells, BRCA
genes help ...
BRCA genes (2)


The research deals with the issue of cancer
suppression genes BRCA1 mutations.



We methodically apply...
Medical data mining


Healthcare domain is known for its ontological
complexity and variety of medical data
standards and...
Research data


The original medical research had been carried
out in Oncology institute of Lithuanian University
of Heal...
The full list of attributes of initial dataset (1)
#
1
2
3
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Attribute
Age
Histology ...
The full list of attributes of initial dataset (2)
#
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Attribute
Triple ...
The distribution of prediction class
attributes
Attribute

BRCA1
mutation
BC
reoccurrence
Diseased
patients

Positive attr...
The distribution of prediction class
attributes

10
Data mining approach


1 step: preprocessing. Continuous attributes
Age, Tumor size, Time to reoccurrence were
discretize...
Classification methods
• Classification trees – J48, Random Forest,
Random tree, tree ensemble,
• Classification rules – Z...
FURIA results




Fuzzy Unordered Rule Induction Algorithm
(FURIA) showed overall performance
improvement after changing...
BRCA1 prediction
Algorithm
J48 (C4.5)
Random Forest
Random tree
ZeroR
OneR
Furia
Multilayer
perceptron
Multilayer
perceptr...
BC reoccurrence
Algorithm
J48 (C4.5)
Random Forest
Random tree
ZeroR
OneR
Furia
Multilayer
perceptron
Multilayer
perceptro...
Association rules discovery


Apriori, PredictiveApriori, and HotSpot algorithms
were used.



Generic and class specifi...
Balancing data set results


Data set was changed by incrementally equalling
the proportion of dependent binary (class)
a...
BRCA mutation prediction models
performance

18
BRCA mutation prediction models
performance

19
BC reoccurrence prediction models
performance

20
BC reoccurrence prediction models
performance

21
Conclusions (1)


By analyzing breast cancer patient data, we have
realized the importance of systematic approach
in know...
Conclusions (2)


Artificial neural networks have showed the best
performance for BRCA1 gene mutation carrier
prediction,...
Thank you for your attention

Olegas Niakšu (Olegas.Niaksu@mii.vu.lt)
Jurgita Gedminaitė (Jurgita.Gedminaite@lmu.lt)
Olga ...
Upcoming SlideShare
Loading in …5
×

Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 genų mutacijos prognozei“ (VU MII)

503 views

Published on

Pranešimas XVI kompiuterininkų konferencijos sekcijoje „Duomenų tyryba ir jos taikymai“,
„Kompiuterininkų dienos – 2013“, Šiauliai 2013-09-20

Published in: Technology, Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
503
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 genų mutacijos prognozei“ (VU MII)

  1. 1. Data mining approach to predict BRCA1 genes mutation Olegas Niakšu1, Jurgita Gedminaitė2, Olga Kurasova1 1 Vilnius University, Institute of Mathematics and Informatics, 2 Lithuanian University of Health Sciences, Oncology Institute
  2. 2. Breast Cancer • Cancer is the second main cause of death in the developed countries and one of the main causes in the world. • There are more than 10 million people, diagnosed with oncologic disease, about 6 million people will die for it in each year. • Breast cancer is the most common cancer in women worldwide. • Survivability is a major concern and is highly related with early diagnosis and optimal treatment plan. 2
  3. 3. BRCA genes (1)  The gene named BRCA stands for breast cancer (BC) susceptibility gene. In normal cells, BRCA genes help ensure the stability of the cell’s genetic material (DNA) and help prevent uncontrolled cell growth.  Mutation of these genes has been linked to the development of hereditary breast and ovarian cancer.  Patients with pathological mutation of a BRCA gene have 65% lifelong breast cancer probability. 3
  4. 4. BRCA genes (2)  The research deals with the issue of cancer suppression genes BRCA1 mutations.  We methodically apply a set of classical and emerging statistical and data mining tools having a goal to answer questions formulated by clinicians: • what are BRCA1 mutation prognostic factors, • what are BC tumour reoccurrence factors, • if-and-what BRCA1 mutations influence to the course of decease. 4
  5. 5. Medical data mining  Healthcare domain is known for its ontological complexity and variety of medical data standards and variable data quality.  Typically, the available medical datasets are fragmented and distributed; thereby the process of data cleaning and integration is a challenging task.  Other important issues related to the use of personal healthcare data have origins in legal, ethical and social aspects. 5
  6. 6. Research data  The original medical research had been carried out in Oncology institute of Lithuanian University of Health Sciences from 2010 till 2013.  The study group consisted of 83 women, who have been diagnosed with I-II stage breast cancer with the following tumour morphology (T1 N0, T2 N0, T3 N0, T1 N1, T2 N1).  Research duration was determined considering amount of patients and not less than 2 years period of disease progress monitoring. 6
  7. 7. The full list of attributes of initial dataset (1) # 1 2 3 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Attribute Age Histology type cT pT Multifocality cN pN G L V ER PR HER2 BRCA mutation Bilateral BC Tumor size CHEK2 mutation Affected l_m number Attribute type* Continuous Nominal (5) Nominal (5) Nominal (6) Nominal (2) Nominal (3) Nominal (2) Nominal (3) Nominal (2) Nominal (2) Nominal (4) Nominal (4) Nominal (2) Nominal (6) Nominal (2) Continuous Nominal (4) Continuous 7
  8. 8. The full list of attributes of initial dataset (2) # 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Attribute Triple neg. BC Family history type Prostate cancer fam. Hist. Pancreatic cancer fam. hist. Colorectal cancer fam. Hist. Surgery type Chemotherapy type Herceptin Cht. complications Reoccurrence Metastases Time to diseased Is Diseased Monitoring period Time to reoccurrence Adjuv. ST Adjuv. HT Attribute type* Nominal (2) Nominal (3) Nominal (2) Nominal (2) Nominal (2) Nominal (4) Nominal (3) Nominal (2) Nominal (4) Nominal (2) Nominal (2) Continuous Nominal (2) Continuous Continuous Nominal (2) Nominal (5) 8
  9. 9. The distribution of prediction class attributes Attribute BRCA1 mutation BC reoccurrence Diseased patients Positive attribute Negative attribute value value Number Percentage Number Percentage of of the of of the patients whole patients whole group group 12 14% 71 86% 22 27% 61 73% 2 2% 81 98% 9
  10. 10. The distribution of prediction class attributes 10
  11. 11. Data mining approach  1 step: preprocessing. Continuous attributes Age, Tumor size, Time to reoccurrence were discretized accordingly to D_age group, D_tumor size, and D_time to reoccurrence. Dataset was checked for outliers and missing values.  2 step: dimensionality reduction. Feature subset selection algorithms were used.  3 step: classification. A comparative analysis of the classification techniques was performed in WEKA, Orange, Tibco Spotfire Mining. 11
  12. 12. Classification methods • Classification trees – J48, Random Forest, Random tree, tree ensemble, • Classification rules – ZeroR, OneR, and FURIA, • Artificial neural networks – Multi-layer Perceptron, SOM, • Regression – logistic regression, • Bayes – Naïve Bayes, • Meta – Ada Boost, Bagging. 12
  13. 13. FURIA results   Fuzzy Unordered Rule Induction Algorithm (FURIA) showed overall performance improvement after changing uncovered rules handling parameter to “vote for the most frequent class”. FURIA algorithm optimization results: Algorithm Furia initial Furia optimized Accuracy 0.916 0.940 Sensitivity 0.667 0.667 Specificity 0.958 0.986 ROC AUC 0.80 0.81 13
  14. 14. BRCA1 prediction Algorithm J48 (C4.5) Random Forest Random tree ZeroR OneR Furia Multilayer perceptron Multilayer perceptronCS Logistic regression AdaBoostM1 Bagging with J48 Accuracy Sensitivity Specificity ROC AUC 0.880 0.667 0.915 0.825 0.855 0.167 0.972 0.774 0.819 0.333 0.901 0.696 0.854 0.000 1.000 0.428 0.807 0.000 0.944 0.472 0.940 0.667 0.986 0.810 0.819 0.667 0.845 0.805 0.916 0.667 0.958 0.865 0.795 0.500 0.845 0.738 0.892 0.880 0.667 0.417 0.930 0.958 0.790 0.853
  15. 15. BC reoccurrence Algorithm J48 (C4.5) Random Forest Random tree ZeroR OneR Furia Multilayer perceptron Multilayer perceptronCS Logistic regression AdaBoostM1 Bagging with J48 Accuracy Sensitivity Specificity ROC AUC 0.734 0.000 1.000 0.457 0.710 0.091 0.934 0.516 0.639 0.227 0.787 0.484 0.735 0.000 1.000 0.457 0.675 0.000 0.918 0.459 0.747 0.091 0.984 0.633 0.687 0.455 0.770 0.576 0.687 0.455 0.770 0.596 0.639 0.136 0.820 0.508 0.663 0.651 0.591 0.000 0.689 0.885 0.675 0.319
  16. 16. Association rules discovery  Apriori, PredictiveApriori, and HotSpot algorithms were used.  Generic and class specific rules with a minimum support in the range of [0.01; 0.2] with confidence greater than 0.75.  Association rules search has found from 46 thousand to 78 thousand rules.  However there were no clinically interesting (novel) patterns found, though the patterns reconfirmed already known features of the BC. 16
  17. 17. Balancing data set results  Data set was changed by incrementally equalling the proportion of dependent binary (class) attribute values till it reached 50% to 50% distribution.  The balancing of data set gave significant results to the most of the classification algorithms.  Meta algorithm Bagging: ◦ accuracy – 0.90, sensitivity – 0.95, ◦ specificity – 0.85, ROC area value – 0.96  J48 tree algorithm: ◦ accuracy – 0.88, sensitivity – 0.93, ◦ specificity – 0.83, ROC area value – 0.85 17
  18. 18. BRCA mutation prediction models performance 18
  19. 19. BRCA mutation prediction models performance 19
  20. 20. BC reoccurrence prediction models performance 20
  21. 21. BC reoccurrence prediction models performance 21
  22. 22. Conclusions (1)  By analyzing breast cancer patient data, we have realized the importance of systematic approach in knowledge discovery process.  The study has showed high importance of an optimal dataset forming for the classification accuracy. 22
  23. 23. Conclusions (2)  Artificial neural networks have showed the best performance for BRCA1 gene mutation carrier prediction, but due to the lack of its expressivity, decision tree and decision rules methods were preferred by the clinicians.  Our overall conclusion is that after additional validation on a larger dataset the created prediction models can be used as clinical decision support systems. 23
  24. 24. Thank you for your attention Olegas Niakšu (Olegas.Niaksu@mii.vu.lt) Jurgita Gedminaitė (Jurgita.Gedminaite@lmu.lt) Olga Kurasova (Olga.Kurasova@mii.vu.lt) 24

×