Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets Connecticut College, October 15, 2003
  2. 2. Overview <ul><li>Data Mining and Knowledge Discovery </li></ul><ul><li>Genomics and Microarrays </li></ul><ul><li>Microarray Data Mining </li></ul>
  3. 3. Trends leading to Data Flood <ul><li>More data is generated: </li></ul><ul><ul><li>Bank, telecom, other business transactions ... </li></ul></ul><ul><ul><li>Scientific Data: astronomy, biology, etc </li></ul></ul><ul><ul><li>Web, text, and e-commerce </li></ul></ul><ul><li>More data is captured: </li></ul><ul><ul><li>Storage technology faster and cheaper </li></ul></ul><ul><ul><li>DBMS capable of handling bigger DB </li></ul></ul>
  4. 4. Transformed Data Target Data RawData Knowledge Data Mining Transformation Interpretation & Evaluation Selection & Cleaning Integration Understanding Knowledge Discovery Process DATA Ware house Knowledge __ ____ __ ____ __ ____ Patterns and Rules
  5. 5. Major Data Mining Tasks <ul><li>Classification: predicting an item class </li></ul><ul><li>Clustering: finding clusters in data </li></ul><ul><li>Associations: e.g. A & B & C occur frequently </li></ul><ul><li>Visualization: to facilitate human discovery </li></ul><ul><li>Summarization: describing a group </li></ul><ul><li>Estimation: predicting a continuous value </li></ul><ul><li>Deviation Detection: finding changes </li></ul><ul><li>Link Analysis: finding relationships </li></ul>
  6. 6. Major Application Areas for Data Mining Solutions <ul><li>Advertising </li></ul><ul><li>Bioinformatics </li></ul><ul><li>Customer Relationship Management (CRM) </li></ul><ul><li>Database Marketing </li></ul><ul><li>Fraud Detection </li></ul><ul><li>eCommerce </li></ul><ul><li>Health Care </li></ul><ul><li>Investment/Securities </li></ul><ul><li>Manufacturing, Process Control </li></ul><ul><li>Sports and Entertainment </li></ul><ul><li>Telecommunications </li></ul><ul><li>Web </li></ul>
  7. 7. Genome, DNA & Gene Expression <ul><li>An organism’s genome is the “program” for making the organism, encoded in DNA </li></ul><ul><ul><li>Human DNA has about 30-35,000 genes </li></ul></ul><ul><ul><li>A gene is a segment of DNA that specifies how to make a protein </li></ul></ul><ul><li>Cells are different because of differential gene expression </li></ul><ul><ul><li>About 40% of human genes are expressed at one time </li></ul></ul><ul><ul><li>Microarray devices measure gene expression </li></ul></ul>
  8. 8. Molecular Biology Overview Cell Nucleus Chromosome Protein Graphics courtesy of the National Human Genome Research Institute Gene (DNA) Gene (mRNA), single strand Gene expression
  9. 9. Affymetrix Microarrays 1.28cm ~10 7 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM) Gene expression computed from PM and MM 50um
  10. 10. Affymetrix Microarray Raw Image Gene Value D26528_at 193 D26561_cds1_at -70 D26561_cds2_at 144 D26561_cds3_at 33 D26579_at 318 D26598_at 1764 D26599_at 1537 D26600_at 1204 D28114_at 707 Scanner enlarged section of raw image raw data
  11. 11. Microarray Potential Applications <ul><li>New and better molecular diagnostics </li></ul><ul><li>New molecular targets for therapy </li></ul><ul><ul><li>few new drugs, large pipeline, … </li></ul></ul><ul><li>Outcome depends on genetic signature </li></ul><ul><ul><li>best treatment? </li></ul></ul><ul><li>Fundamental Biological Discovery </li></ul><ul><ul><li>finding and refining biological pathways </li></ul></ul><ul><li>Personalized medicine ?! </li></ul>
  12. 12. Microarray Data Mining Challenges <ul><li>Avoiding false positives, due to </li></ul><ul><ul><li>too few records (samples), usually < 100 </li></ul></ul><ul><ul><li>too many columns (genes), usually > 1,000 </li></ul></ul><ul><li>Model needs to be robust in presence of noise </li></ul><ul><li>For reliability need large gene sets; for diagnostics or drug targets, need small gene sets </li></ul><ul><li>Estimate class probability </li></ul><ul><li>Model needs to be explainable to biologists </li></ul>
  13. 13. False Positives in Astronomy cartoon used with permission
  14. 14. CATs: Clementine Application Templates <ul><li>CATs - examples of complete data mining processes </li></ul><ul><li>Microarray CAT </li></ul>Preparation 2-Class Multi- Class Clustering
  15. 15. Key Ideas <ul><li>Capture the complete process </li></ul><ul><li>X-validation loop w. feature selection inside </li></ul><ul><li>Randomization to select significant genes </li></ul><ul><li>Internal iterative feature selection loop </li></ul><ul><li>For each class, separate selection of optimal gene sets </li></ul><ul><li>Neural nets – robust in presence of noise </li></ul><ul><li>Bagging of neural nets </li></ul>
  16. 16. Microarray Classification Train data Feature and Parameter Selection Evaluation Test data Data Model Building
  17. 17. Classification: External X-val Train data Feature and Parameter Selection Evaluation Test data Gene Data T r a i n FinalTest Data Model Building Final Model Final Results
  18. 18. Measuring false positives with randomization Class Gene 178 105 4174 7133 1 1 2 2 Class 178 105 4174 7133 2 1 1 2 Rand Class 2 1 1 2 Randomize 500 times Bottom 1% T-value = -2.08 Select potentially interesting genes at 1% Gene
  19. 19. Gene Reduction improves Classification <ul><li>most learning algorithms look for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes </li></ul><ul><li>Classification accuracy improves if we first reduce # of genes by a linear method, e.g. T-values of mean difference </li></ul><ul><li>Heuristic: select equal # genes from each class </li></ul><ul><li>Then apply a favorite machine learning algorithm </li></ul>
  20. 20. Iterative Wrapper approach to selecting the best gene set <ul><li>Test models using 1,2,3, …, 10, 20, 30, 40, ..., 100 top genes with x-validation. </li></ul><ul><li>Heuristic 1: evaluate errors from each class; select # number of genes from each class that minimizes error for that class </li></ul><ul><li>For randomized algorithms, average 10+ Cross-validation runs! </li></ul><ul><li>Select gene set with lowest average error </li></ul>
  21. 21. Clementine stream for subset selection by x-validation
  22. 22. Microarrays: ALL/AML Example <ul><li>Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 </li></ul><ul><ul><li>72 examples (38 train, 34 test), about 7,000 genes </li></ul></ul><ul><ul><li>well-studied (CAMDA-2000), good test example </li></ul></ul>ALL AML Visually similar, but genetically very different
  23. 23. Gene subset selection: one X-validation Single Cross-Validation run
  24. 24. Gene subset selection: multiple cross-validation runs For ALL/AML data, 10 genes per class had the lowest error: (<1%) Point in the center is the average error from 10 cross-validation runs Bars indicate 1 st. dev above and below
  25. 25. ALL/AML: Results on the test data <ul><li>Genes selected and model trained on Train set ONLY! </li></ul><ul><li>Best Net with 10 top genes per class (20 overall) was applied to the test data (34 samples): </li></ul><ul><ul><li>33 correct predictions (97% accuracy), </li></ul></ul><ul><ul><li>1 error on sample 66 </li></ul></ul><ul><ul><ul><li>Actual Class AML, Net prediction: ALL </li></ul></ul></ul><ul><ul><ul><li>other methods consistently misclassify sample 66 -- misclassified by a pathologist? </li></ul></ul></ul>
  26. 26. Pediatric Brain Tumour Data <ul><li>92 samples, 5 classes (MED, EPD, JPA, EPD, MGL, RHB) from U. of Chicago Children’s Hospital </li></ul><ul><li>Outer cross-validation with gene selection inside the loop </li></ul><ul><li>Ranking by absolute T-test value (selects top positive and negative genes) </li></ul><ul><li>Select best genes by adjusted error for each class </li></ul><ul><li>Bagging of 100 neural nets </li></ul>
  27. 27. Selecting Best Gene Set <ul><li>Minimizing Combined Error for all classes is not optimal </li></ul>Average, high and low error rate for all classes
  28. 28. Error rates for each class Error rate Genes per Class
  29. 29. Evaluating One Network Averaged over 100 Networks: 9% EPD 24% RHB 8.3% *ALL* 19% JPA 17% MGL 2.1% MED Error rate Class
  30. 30. Bagging 100 Networks <ul><li>Note: suspected error on one sample (labeled as MED but consistently classified as RHB) </li></ul>8.3% 19% 9% 24% 17% 2.1% Individual Error Rate 91% 0 EPD 76% 11% RHB 92% 3% (2)* *ALL* 81% 0 JPA 83% 10% MGL 98% 2% (0)* MED Bag Avg Conf Bag Error rate Class
  31. 31. AF1q: New Marker for Medulloblastoma? <ul><li>AF1Q ALL1-fused gene from chromosome 1q </li></ul><ul><li>transmembrane protein </li></ul><ul><li>Related to leukemia (3 PUBMED entries) but not to Medulloblastoma </li></ul>
  32. 32. Future directions for Microarray Analysis <ul><li>Algorithms optimized for small samples </li></ul><ul><li>Integration with other data </li></ul><ul><ul><li>biological networks </li></ul></ul><ul><ul><li>medical text </li></ul></ul><ul><ul><li>protein data </li></ul></ul><ul><li>Cost-sensitive classification algorithms </li></ul><ul><ul><li>error cost depends on outcome (don’t want to miss treatable cancer), treatment side effects, etc. </li></ul></ul>
  33. 33. Acknowledgements <ul><li>Eric Bremer, Children’s Hospital (Chicago) & Northwestern U. </li></ul><ul><li>Greg Cooper, U. Pittsburgh </li></ul><ul><li>Tom Khabaza, SPSS </li></ul><ul><li>Sridhar Ramaswamy, MIT/Whitehead Institute </li></ul><ul><li>Pablo Tamayo, MIT/Whitehead Institute </li></ul>
  34. 34. Thank you <ul><li>Further resources on Data Mining: </li></ul><ul><li>Microarrays: </li></ul><ul><li> </li></ul><ul><li>Contact: </li></ul><ul><li>Gregory Piatetsky-Shapiro: </li></ul>