Your SlideShare is downloading. ×
0
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Introduction to Data Mining / Bioinformatics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction to Data Mining / Bioinformatics

444

Published on

This presentation, prepared by Gerry Lushington, is a friendly introduction to the basics of data mining, as applied to biological problems. The intended audience is students and scientific …

This presentation, prepared by Gerry Lushington, is a friendly introduction to the basics of data mining, as applied to biological problems. The intended audience is students and scientific researchers from a non-computational background.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
444
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction to Bioinformatics: Mining Your Data Gerry Lushington Lushington in Silico modeling / informatics consultant
  • 2. What is Data Mining? Use of computational methods to perceive trends in data that can be used to explain or predict important outcomes or propertiesApplicable across many disciplines:Molecular bioinformaticsMedical InformaticsHealth InformaticsBiodiversity informatics
  • 3. Example Applications: Find relationships between:Convenient Observables vs. Important Outcomesa) Relative gene expression data 1. Disease susceptibilityb) Relative protein abundance data 2. Drug efficacyc) Relative lipid & metabolite profiles 3. Toxin susceptibilityd) Glycosylation variants 4. Immunitye) SNPs, alleles 5. Genetic disordersf) Cellular traits 6. Microbial virulenceg) Organism traits 7. Species adaptive successh) Behavioral traits 8. Species complementarityi) Case history
  • 4. Goals for this lecture:Focus on Data Mining: how to approach your data and use it tounderstand biologyOverview of available techniquesUnderstanding model validationTry to think about data you’ve seen: what techniques might beuseful? Don’t worry about grasping everything: K-INBRE Bioinformatics Core is here to help!!
  • 5. Basic Data Mining:Find relationships between:a) Easy to measure properties vs.b) Important (but harder to measure) outcomes or attributesUse relationships to understand the conceptual basis foroutcomes in b)Use relationships to predict outcomes in new cases whereoutcome has not yet been measured
  • 6. Basic Data Mining: simple measureables
  • 7. Basic Data Mining: general observation Unhappy Happy
  • 8. Basic Data Mining: relationship (#1) Unhappy Happy Blue = happy; Red = unhappy accuracy = 12/20 = 60%
  • 9. Basic Data Mining: relationship (#2) Unhappy Happy Blue + BIG Red = happy; little red = unhappy accuracy = 17/20 = 85%
  • 10. Data Mining: procedure1. Data Acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration
  • 11. Data Mining: procedure1. Data acquisition2. Data Preprocessing Peak heights?3. Feature Selection4. Classification5. Validation6. Prediction & Iteration Peak positions?Key issues include:a) format conversion from instrumentb) any necessary mathematical manipulations (e.g., Density = M/V)
  • 12. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & IterationKey issues include:a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers
  • 13. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection Use controls to4. Classification scale data5. Validation6. Prediction & IterationKey issues include:a) Normalization to account for experimental biasb) Statistical detection of flagrant outliersC C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3
  • 14. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection Subjective4. Classification (requires experience5. Validation and/or domain6. Prediction & Iteration knowledge)Key issues include:a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers
  • 15. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & IterationWhich out of many measurable properties relate to outcome of interest?a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
  • 16. Data Mining: procedure1. Data acquisition2.3. Data Preprocessing Feature Selection x x4. Classification5. Validation6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4Which out of many measurable properties relate to outcome of interest?a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
  • 17. Data Mining: procedure1. Data acquisition2.3. Data Preprocessing Feature Selection x4. Classification5. Validation6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4Which out of many measurable properties relate to outcome of interest?a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
  • 18. Data Mining: procedure1. Data acquisition2.3. Data Preprocessing Feature Selection x4. Classification5. Validation6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4Which out of many measurable properties relate to outcome of interest?a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training 1 2 3 4
  • 19. Data Mining: procedure1. Data acquisition • Train preliminary models based on random sets of properties2. Data Preprocessing • Evaluate models according to3. Feature Selection correlative or predictive performance4. Classification • Experiment with promising sets adding5. Validation or deleting descriptors to gauge impact6. Prediction & Iteration on performanceWhich out of many measurable properties relate to outcome of interest?a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training
  • 20. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration Predict which sample will have which outcome?a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
  • 21. Data Mining: procedure y1. Data acquisition2. Data Preprocessing x3. Feature Selection4. Classification5. Validation6. Prediction & IterationPredict which sample will have which outcome?a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
  • 22. Data Mining: procedure y1. Data acquisition2. Data Preprocessing x3. Feature Selection4. Classification5. Validation6. Prediction & Iteration -n y +nPredict which sample will have which outcome? NO YESa) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
  • 23. Data Mining: procedure x21. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration x1Predict which sample will have which outcome?a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
  • 24. Data Mining: procedure y1 y2 x21. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification y35. Validation y46. Prediction & Iteration x1Predict which sample will have which outcome?a) Correlative methods y1 = resistant to types I & II diabetesb) Distance-based clustering y2 = susceptible only to type IIc) Boundary detectiond) Rule learning y3 = susceptible only to type Ie) Weighted probability y4 = susceptible to types I & II
  • 25. Data Mining: procedure Resistant to type I x21. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration x1 Susceptible to type IPredict which sample will have which outcome?a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability
  • 26. Data Mining: procedure Resistant to type I x21. Data acquisition2. Data Preprocessing b3. Feature Selection4. Classification a5. Validation6. Prediction & Iteration c x1 Susceptible to type IPredict which sample will have which outcome?a) Correlative methodsb) Distance-based clustering If x1 < c and x2 > a then resistantc) Boundary detection Else if x1 > c and x2 > b then resistantd) Rule learning Else susceptiblee) Weighted probability E=9
  • 27. Data Mining: procedure Resistant Susc.1. Data acquisition2. Data Preprocessing3. Feature Selection a x14. Classification5. Validation Susc. Resistant6. Prediction & Iteration b x2Predict which sample will have which outcome?a) Correlative methods Resistant Susc.b) Distance-based clusteringc) Boundary detection c Fx1 -d) Rule learning Gx2e) Weighted probability If Fx1 - Gx2 < c then resistant Else susceptible
  • 28. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & IterationDefine criteria and tests to prove model validitya) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
  • 29. Data Mining: procedure x21. Data acquisition2. Data Preprocessing3. Feature Selection Resistant (Neg.)4. Classification5. Validation Susc.6. Prediction & Iteration x1 (Pos.)Define criteria and tests to prove model validitya) Accuracy Accuracy = (TP + TN)b) Sensitivity vs. Specificity TP + TN + FP + FNc) Receiver Operating Characteristic (ROC) plotd) Cross-validation = 142 / 154
  • 30. Data Mining: procedure x21. Data acquisition2. Data Preprocessing3. Feature Selection Resistant (Neg.)4. Classification5. Validation Susc.6. Prediction & Iteration x1 (Pos.)Define criteria and tests to prove model validitya) Accuracy Sensitivity = TP = 67 / 72b) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plot TP + FNd) Cross-validation FPR = FP = 6 / 81 TN + FP Note: Specificity = 1 - FPR
  • 31. Data Mining: procedure x21. Data acquisition2. Data Preprocessing3. Feature Selection Resistant (Neg.)4. Classification less5. Validation Varying Susc.6. Prediction & Iteration model more x1 (Pos.) stringencyDefine criteria and tests to prove model validitya) Accuracy Sensitivity = TP = 69 / 72b) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plot TP + FNd) Cross-validation FPR = FP = 19 / 81 TN + FP Note: Specificity = 1 - FPR
  • 32. Data Mining: procedure Sens1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration FPRDefine criteria and tests to prove model validitya) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
  • 33. Data Mining: procedure Sens1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration FPRDefine criteria and tests to prove model validity Area under curve isa) Accuracy excellent measure ofb) Sensitivity vs. Specificity model performancec) Receiver Operating Characteristic (ROC) plotd) Cross-validation 1.0: perfect model 0.5: random
  • 34. Data Mining: procedure1. Data acquisition Predictions are imperfect due to:2. Data Preprocessing • Imperfect Algorithms3. Feature Selection • Imperfect Data4. Classification5. Validation6. Prediction & IterationDefine criteria and tests to prove model validitya) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation
  • 35. Cross-Validation:• Carefully monitor features that are useful across different independent data subsets• This can be accomplished with N-fold cross-validation: Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Test Train Model performance = mean predictive performance over 5 trials• Best feature selection and classification algorithms will yield best consistent performance across independent trials• Best features will be consistently important across trials
  • 36. Data Mining: procedure1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & IterationAnalysis is only useful if it is used; only improves if it is testeda) Good validation requires successful new predictionsb) Imperfect predictions can lead to method refinement and greater understanding
  • 37. Questions? Lushington in SilicoGeraldlushington3117 at aol.com Geraldlushington.org

×