Introduction to Data Mining / Bioinformatics

832 views

Published on

This presentation, prepared by Gerry Lushington, is a friendly introduction to the basics of data mining, as applied to biological problems. The intended audience is students and scientific researchers from a non-computational background.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
832
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
34
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Introduction to Data Mining / Bioinformatics

  1. 1. Introduction to Bioinformatics: Mining Your Data Gerry Lushington Lushington in Silico modeling / informatics consultant
  2. 2. What is Data Mining? Use of computational methods to perceive trends in data that can be used to explain or predict important outcomes or propertiesApplicable across many disciplines:Molecular bioinformaticsMedical InformaticsHealth InformaticsBiodiversity informatics
  3. 3. Example Applications: Find relationships between:Convenient Observables vs. Important Outcomesa) Relative gene expression data 1. Disease susceptibilityb) Relative protein abundance data 2. Drug efficacyc) Relative lipid & metabolite profiles 3. Toxin susceptibilityd) Glycosylation variants 4. Immunitye) SNPs, alleles 5. Genetic disordersf) Cellular traits 6. Microbial virulenceg) Organism traits 7. Species adaptive successh) Behavioral traits 8. Species complementarityi) Case history
  4. 4. Goals for this lecture:Focus on Data Mining: how to approach your data and use it tounderstand biologyOverview of available techniquesUnderstanding model validationTry to think about data you’ve seen: what techniques might beuseful? Don’t worry about grasping everything: K-INBRE Bioinformatics Core is here to help!!
  5. 5. Basic Data Mining:Find relationships between:a) Easy to measure properties vs.b) Important (but harder to measure) outcomes or attributesUse relationships to understand the conceptual basis foroutcomes in b)Use relationships to predict outcomes in new cases whereoutcome has not yet been measured
  6. 6. Basic Data Mining: simple measureables
  7. 7. Basic Data Mining: general observation Unhappy Happy
  8. 8. Basic Data Mining: relationship (#1) Unhappy Happy Blue = happy; Red = unhappy accuracy = 12/20 = 60%
  9. 9. Basic Data Mining: relationship (#2) Unhappy Happy Blue + BIG Red = happy; little red = unhappy accuracy = 17/20 = 85%
  10. 10. Data Mining: procedure1. Data Acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
  11. 11. Data Mining: procedure1. Data acquisition2. Data Preprocessing Peak heights?3. Feature Selection4. Classification5. Validation6. Prediction & Iteration Peak positions?Key issues include:a) format conversion from instrumentb) any necessary mathematical manipulations (e.g., Density = M/V)
  12. 12. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & IterationKey issues include:a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers
  13. 13. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection Use controls to4. Classification scale data5. Validation6. Prediction & IterationKey issues include:a) Normalization to account for experimental biasb) Statistical detection of flagrant outliersC C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3
  14. 14. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection Subjective4. Classification (requires experience5. Validation and/or domain6. Prediction & Iteration knowledge)Key issues include:a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers
  15. 15. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & IterationWhich out of many measurable properties relate to outcome of interest?a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
  16. 16. Data Mining: procedure1. Data acquisition2.3. Data Preprocessing Feature Selection x x4. Classification5. Validation6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4Which out of many measurable properties relate to outcome of interest?a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
  17. 17. Data Mining: procedure1. Data acquisition2.3. Data Preprocessing Feature Selection x4. Classification5. Validation6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4Which out of many measurable properties relate to outcome of interest?a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
  18. 18. Data Mining: procedure1. Data acquisition2.3. Data Preprocessing Feature Selection x4. Classification5. Validation6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4Which out of many measurable properties relate to outcome of interest?a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training 1 2 3 4
  19. 19. Data Mining: procedure1. Data acquisition • Train preliminary models based on random sets of properties2. Data Preprocessing • Evaluate models according to3. Feature Selection correlative or predictive performance4. Classification • Experiment with promising sets adding5. Validation or deleting descriptors to gauge impact6. Prediction & Iteration on performanceWhich out of many measurable properties relate to outcome of interest?a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
  20. 20. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration Predict which sample will have which outcome?a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
  21. 21. Data Mining: procedure y1. Data acquisition2. Data Preprocessing x3. Feature Selection4. Classification5. Validation6. Prediction & IterationPredict which sample will have which outcome?a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
  22. 22. Data Mining: procedure y1. Data acquisition2. Data Preprocessing x3. Feature Selection4. Classification5. Validation6. Prediction & Iteration -n y +nPredict which sample will have which outcome? NO YESa) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
  23. 23. Data Mining: procedure x21. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration x1Predict which sample will have which outcome?a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
  24. 24. Data Mining: procedure y1 y2 x21. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification y35. Validation y46. Prediction & Iteration x1Predict which sample will have which outcome?a) Correlative methods y1 = resistant to types I & II diabetesb) Distance-based clustering y2 = susceptible only to type IIc) Boundary detectiond) Rule learning y3 = susceptible only to type Ie) Weighted probability y4 = susceptible to types I & II
  25. 25. Data Mining: procedure Resistant to type I x21. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration x1 Susceptible to type IPredict which sample will have which outcome?a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
  26. 26. Data Mining: procedure Resistant to type I x21. Data acquisition2. Data Preprocessing b3. Feature Selection4. Classification a5. Validation6. Prediction & Iteration c x1 Susceptible to type IPredict which sample will have which outcome?a) Correlative methodsb) Distance-based clustering If x1 < c and x2 > a then resistantc) Boundary detection Else if x1 > c and x2 > b then resistantd) Rule learning Else susceptiblee) Weighted probability E=9
  27. 27. Data Mining: procedure Resistant Susc.1. Data acquisition2. Data Preprocessing3. Feature Selection a x14. Classification5. Validation Susc. Resistant6. Prediction & Iteration b x2Predict which sample will have which outcome?a) Correlative methods Resistant Susc.b) Distance-based clusteringc) Boundary detection c Fx1 -d) Rule learning Gx2e) Weighted probability If Fx1 - Gx2 < c then resistant Else susceptible
  28. 28. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & IterationDefine criteria and tests to prove model validitya) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
  29. 29. Data Mining: procedure x21. Data acquisition2. Data Preprocessing3. Feature Selection Resistant (Neg.)4. Classification5. Validation Susc.6. Prediction & Iteration x1 (Pos.)Define criteria and tests to prove model validitya) Accuracy Accuracy = (TP + TN)b) Sensitivity vs. Specificity TP + TN + FP + FNc) Receiver Operating Characteristic (ROC) plotd) Cross-validation = 142 / 154
  30. 30. Data Mining: procedure x21. Data acquisition2. Data Preprocessing3. Feature Selection Resistant (Neg.)4. Classification5. Validation Susc.6. Prediction & Iteration x1 (Pos.)Define criteria and tests to prove model validitya) Accuracy Sensitivity = TP = 67 / 72b) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plot TP + FNd) Cross-validation FPR = FP = 6 / 81 TN + FP Note: Specificity = 1 - FPR
  31. 31. Data Mining: procedure x21. Data acquisition2. Data Preprocessing3. Feature Selection Resistant (Neg.)4. Classification less5. Validation Varying Susc.6. Prediction & Iteration model more x1 (Pos.) stringencyDefine criteria and tests to prove model validitya) Accuracy Sensitivity = TP = 69 / 72b) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plot TP + FNd) Cross-validation FPR = FP = 19 / 81 TN + FP Note: Specificity = 1 - FPR
  32. 32. Data Mining: procedure Sens1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration FPRDefine criteria and tests to prove model validitya) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
  33. 33. Data Mining: procedure Sens1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration FPRDefine criteria and tests to prove model validity Area under curve isa) Accuracy excellent measure ofb) Sensitivity vs. Specificity model performancec) Receiver Operating Characteristic (ROC) plotd) Cross-validation 1.0: perfect model 0.5: random
  34. 34. Data Mining: procedure1. Data acquisition Predictions are imperfect due to:2. Data Preprocessing • Imperfect Algorithms3. Feature Selection • Imperfect Data4. Classification5. Validation6. Prediction & IterationDefine criteria and tests to prove model validitya) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
  35. 35. Cross-Validation:• Carefully monitor features that are useful across different independent data subsets• This can be accomplished with N-fold cross-validation: Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Test Train Model performance = mean predictive performance over 5 trials• Best feature selection and classification algorithms will yield best consistent performance across independent trials• Best features will be consistently important across trials
  36. 36. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & IterationAnalysis is only useful if it is used; only improves if it is testeda) Good validation requires successful new predictionsb) Imperfect predictions can lead to method refinement and greater understanding
  37. 37. Questions? Lushington in SilicoGeraldlushington3117 at aol.com Geraldlushington.org

×