Your SlideShare is downloading. ×
Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 genų mutacijos prognozei“ (VU MII)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Niakšu, Olegas ; Kurasova, Olga ; Gedminaitė, Jurgita „Duomenų tyryba BRCA1 genų mutacijos prognozei“ (VU MII)

116
views

Published on

Pranešimas XVI kompiuterininkų konferencijos sekcijoje „Duomenų tyryba ir jos taikymai“, …

Pranešimas XVI kompiuterininkų konferencijos sekcijoje „Duomenų tyryba ir jos taikymai“,
„Kompiuterininkų dienos – 2013“, Šiauliai 2013-09-20

Published in: Technology, Health & Medicine

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
116
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data mining approach to predict BRCA1 genes mutation Olegas Niakšu1, Jurgita Gedminaitė2, Olga Kurasova1 1 Vilnius University, Institute of Mathematics and Informatics, 2 Lithuanian University of Health Sciences, Oncology Institute
  • 2. Breast Cancer • Cancer is the second main cause of death in the developed countries and one of the main causes in the world. • There are more than 10 million people, diagnosed with oncologic disease, about 6 million people will die for it in each year. • Breast cancer is the most common cancer in women worldwide. • Survivability is a major concern and is highly related with early diagnosis and optimal treatment plan. 2
  • 3. BRCA genes (1)  The gene named BRCA stands for breast cancer (BC) susceptibility gene. In normal cells, BRCA genes help ensure the stability of the cell’s genetic material (DNA) and help prevent uncontrolled cell growth.  Mutation of these genes has been linked to the development of hereditary breast and ovarian cancer.  Patients with pathological mutation of a BRCA gene have 65% lifelong breast cancer probability. 3
  • 4. BRCA genes (2)  The research deals with the issue of cancer suppression genes BRCA1 mutations.  We methodically apply a set of classical and emerging statistical and data mining tools having a goal to answer questions formulated by clinicians: • what are BRCA1 mutation prognostic factors, • what are BC tumour reoccurrence factors, • if-and-what BRCA1 mutations influence to the course of decease. 4
  • 5. Medical data mining  Healthcare domain is known for its ontological complexity and variety of medical data standards and variable data quality.  Typically, the available medical datasets are fragmented and distributed; thereby the process of data cleaning and integration is a challenging task.  Other important issues related to the use of personal healthcare data have origins in legal, ethical and social aspects. 5
  • 6. Research data  The original medical research had been carried out in Oncology institute of Lithuanian University of Health Sciences from 2010 till 2013.  The study group consisted of 83 women, who have been diagnosed with I-II stage breast cancer with the following tumour morphology (T1 N0, T2 N0, T3 N0, T1 N1, T2 N1).  Research duration was determined considering amount of patients and not less than 2 years period of disease progress monitoring. 6
  • 7. The full list of attributes of initial dataset (1) # 1 2 3 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Attribute Age Histology type cT pT Multifocality cN pN G L V ER PR HER2 BRCA mutation Bilateral BC Tumor size CHEK2 mutation Affected l_m number Attribute type* Continuous Nominal (5) Nominal (5) Nominal (6) Nominal (2) Nominal (3) Nominal (2) Nominal (3) Nominal (2) Nominal (2) Nominal (4) Nominal (4) Nominal (2) Nominal (6) Nominal (2) Continuous Nominal (4) Continuous 7
  • 8. The full list of attributes of initial dataset (2) # 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Attribute Triple neg. BC Family history type Prostate cancer fam. Hist. Pancreatic cancer fam. hist. Colorectal cancer fam. Hist. Surgery type Chemotherapy type Herceptin Cht. complications Reoccurrence Metastases Time to diseased Is Diseased Monitoring period Time to reoccurrence Adjuv. ST Adjuv. HT Attribute type* Nominal (2) Nominal (3) Nominal (2) Nominal (2) Nominal (2) Nominal (4) Nominal (3) Nominal (2) Nominal (4) Nominal (2) Nominal (2) Continuous Nominal (2) Continuous Continuous Nominal (2) Nominal (5) 8
  • 9. The distribution of prediction class attributes Attribute BRCA1 mutation BC reoccurrence Diseased patients Positive attribute Negative attribute value value Number Percentage Number Percentage of of the of of the patients whole patients whole group group 12 14% 71 86% 22 27% 61 73% 2 2% 81 98% 9
  • 10. The distribution of prediction class attributes 10
  • 11. Data mining approach  1 step: preprocessing. Continuous attributes Age, Tumor size, Time to reoccurrence were discretized accordingly to D_age group, D_tumor size, and D_time to reoccurrence. Dataset was checked for outliers and missing values.  2 step: dimensionality reduction. Feature subset selection algorithms were used.  3 step: classification. A comparative analysis of the classification techniques was performed in WEKA, Orange, Tibco Spotfire Mining. 11
  • 12. Classification methods • Classification trees – J48, Random Forest, Random tree, tree ensemble, • Classification rules – ZeroR, OneR, and FURIA, • Artificial neural networks – Multi-layer Perceptron, SOM, • Regression – logistic regression, • Bayes – Naïve Bayes, • Meta – Ada Boost, Bagging. 12
  • 13. FURIA results   Fuzzy Unordered Rule Induction Algorithm (FURIA) showed overall performance improvement after changing uncovered rules handling parameter to “vote for the most frequent class”. FURIA algorithm optimization results: Algorithm Furia initial Furia optimized Accuracy 0.916 0.940 Sensitivity 0.667 0.667 Specificity 0.958 0.986 ROC AUC 0.80 0.81 13
  • 14. BRCA1 prediction Algorithm J48 (C4.5) Random Forest Random tree ZeroR OneR Furia Multilayer perceptron Multilayer perceptronCS Logistic regression AdaBoostM1 Bagging with J48 Accuracy Sensitivity Specificity ROC AUC 0.880 0.667 0.915 0.825 0.855 0.167 0.972 0.774 0.819 0.333 0.901 0.696 0.854 0.000 1.000 0.428 0.807 0.000 0.944 0.472 0.940 0.667 0.986 0.810 0.819 0.667 0.845 0.805 0.916 0.667 0.958 0.865 0.795 0.500 0.845 0.738 0.892 0.880 0.667 0.417 0.930 0.958 0.790 0.853
  • 15. BC reoccurrence Algorithm J48 (C4.5) Random Forest Random tree ZeroR OneR Furia Multilayer perceptron Multilayer perceptronCS Logistic regression AdaBoostM1 Bagging with J48 Accuracy Sensitivity Specificity ROC AUC 0.734 0.000 1.000 0.457 0.710 0.091 0.934 0.516 0.639 0.227 0.787 0.484 0.735 0.000 1.000 0.457 0.675 0.000 0.918 0.459 0.747 0.091 0.984 0.633 0.687 0.455 0.770 0.576 0.687 0.455 0.770 0.596 0.639 0.136 0.820 0.508 0.663 0.651 0.591 0.000 0.689 0.885 0.675 0.319
  • 16. Association rules discovery  Apriori, PredictiveApriori, and HotSpot algorithms were used.  Generic and class specific rules with a minimum support in the range of [0.01; 0.2] with confidence greater than 0.75.  Association rules search has found from 46 thousand to 78 thousand rules.  However there were no clinically interesting (novel) patterns found, though the patterns reconfirmed already known features of the BC. 16
  • 17. Balancing data set results  Data set was changed by incrementally equalling the proportion of dependent binary (class) attribute values till it reached 50% to 50% distribution.  The balancing of data set gave significant results to the most of the classification algorithms.  Meta algorithm Bagging: ◦ accuracy – 0.90, sensitivity – 0.95, ◦ specificity – 0.85, ROC area value – 0.96  J48 tree algorithm: ◦ accuracy – 0.88, sensitivity – 0.93, ◦ specificity – 0.83, ROC area value – 0.85 17
  • 18. BRCA mutation prediction models performance 18
  • 19. BRCA mutation prediction models performance 19
  • 20. BC reoccurrence prediction models performance 20
  • 21. BC reoccurrence prediction models performance 21
  • 22. Conclusions (1)  By analyzing breast cancer patient data, we have realized the importance of systematic approach in knowledge discovery process.  The study has showed high importance of an optimal dataset forming for the classification accuracy. 22
  • 23. Conclusions (2)  Artificial neural networks have showed the best performance for BRCA1 gene mutation carrier prediction, but due to the lack of its expressivity, decision tree and decision rules methods were preferred by the clinicians.  Our overall conclusion is that after additional validation on a larger dataset the created prediction models can be used as clinical decision support systems. 23
  • 24. Thank you for your attention Olegas Niakšu (Olegas.Niaksu@mii.vu.lt) Jurgita Gedminaitė (Jurgita.Gedminaite@lmu.lt) Olga Kurasova (Olga.Kurasova@mii.vu.lt) 24