Data Mining in Bioinformatics

775 views
721 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
775
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
65
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Mining in Bioinformatics

  1. 1. January 31, 2002 Data Mining in Bioinformatics Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois pbajcsy@ncsa.uiuc.edu
  2. 2. Outline • Introduction • Overview of Microarray Problem • Image Analysis • Data Mining • Validation • Summary 2
  3. 3. Introduction: Recommended Literature 1. Bioinformatics – The Machine Learning Approach by P. Baldi & S. Brunak, 2nd edition, The MIT Press, 2001 2. Data Mining – Concepts and Techniques by J. Han & M. Kamber, Morgan Kaufmann Publishers, 2001 3. Pattern Classification by R. Duda, P. Hart and D. Stork, 2nd edition, John Wiley & Sons, 2001 3
  4. 4. Introduction: Microarray Problem in Bioinformatics Domain • Problems in Bioinformatics Domain — Data production at the levels of molecules, cells, organs, organisms, populations — Integration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, … — Prediction of Molecular Function and Structure — Computational biology: synthesis (simulations) and analysis (machine learning) 4
  5. 5. Microarray Problem: Major Objective • Major Objective: Discover a comprehensive theory of life’s organization at the molecular level — The major actors of molecular biology: the nucleic acids, DeoxyriboNucleic acid (DNA) and RiboNucleic Acids (RNA) — The central dogma of molecular biology Proteins are very complicated molecules with 20 different amino acids. 5
  6. 6. Input and Output of Microarray Data Analysis • Input: Laser image scans (data) and underlying experiment hypotheses or experiment designs (prior knowledge) • Output: — Conclusions about the input hypotheses or knowledge about statistical behavior of measurements — The theory of biological systems learnt automatically from data (machine learning perspective) – Model fitting, Inference process 6
  7. 7. Overview of Microarray Problem Biology Application Domain Validation Data Analysis Microarray Image Data Experiment Analysis Mining Experiment Design and Data Warehouse Hypothesis Artificial Knowledge discovery Intelligence (AI) in databases (KDD) 7
  8. 8. Artificial Intelligence (AI) Community Collect Data • Issues: — Prior knowledge (e.g., invariance) Choose Features — Model deviation from true model — Sampling Choose Model distributions — Computational Train Classifier complexity — Model complexity (overfitting) Evaluate Classifier Design Cycle of Predictive Modeling 8
  9. 9. Knowledge Discovery in Databases (KDD) Community Database GeneFilter Comparison Report GeneFilter 1 Name: GeneFilter 1 Name: O2#1 8-20-99adjfinal N2#1finaladj INTENSITIES RAW NORMALIZED ORF NAME GENE NAME CHRM F G R GF1 GF2 GF1 GF2 DIFFERENCE RATIO YAL001C TFC3 1 1 A 1 2 12.03 7.38 403.83 209.79 194.04 1.92 YBL080C PET112 2 1 A 1 3 53.21 35.62 "1,786.11" "1,013.13" 772.98 1.76 YBR154C RPB5 2 1 A 1 4 79.26 78.51 "2,660.73" "2,232.86" 427.87 1.19 YCL044C 3 1 A 1 5 53.22 44.66 "1,786.53" "1,270.12" 516.41 1.41 YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06 578.42 220.64 1.38 YDL211C 4 1 A 1 7 17.31 35.34 581.00 "1,005.18" -424.18 -1.73 YDR155C CPH1 4 1 A 1 8 349.78 401.84 "11,741.98" "11,428.10" 313.88 1.03 9 YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87" "1,873.67" 307.21 1.16 YAL010C MDM10 1 1 A 2 2 13.73 9.61 461.03 273.36 187.67 1.69
  10. 10. Data Mining and Image Analysis Steps • Image Analysis — Normalization — Grid Alignment — Feature construction (selection and extraction) • Data Mining — Statistics GeneFilter Comparison Report — Machine learning GeneFilter 1 Name: GeneFilter 1 Name: O2#1 8-20-99adjfinal N2#1finaladj — Pattern recognition INTENSITIES RAW NORMALIZED — Database techniques ORF NAME GENE NAME CHRM F G R GF1 GF — Optimization techniques YAL001C YBL080C TFC3 1 PET112 1 A 1 2 12.03 7.38 403.83 2 1 A 1 3 53.21 35.62 "1 — Visualization YBR154C YCL044C RPB5 2 3 1 A 1 4 79.26 78.51 "2,660.7 1 A 1 5 53.22 44.66 "1,786.5 — Prior knowledge YDL020C YDL211C SON1 4 4 1 A 1 6 23.80 20.34 799.06 1 A 1 7 17.31 35.34 581.00 • Validation YDR155C YDR346C CPH1 4 4 1 A 1 8 349.78 401.84 1 A 1 9 64.97 65.88 "2,180.8 — Issues YAL010C MDM10 1 ? 1 A 2 2 13.73 9.61 461.03 YBL088C TEL1 2 1 A 2 3 8.50 7.74 285.38 — Cross validation techniques YBR162C 2 1 A 2 4 226.84 293.83 YCL052C PBN1 3 1 A 2 5 41.28 34.79 "1,385.7 YDL028C MPS1 4 1 A 2 6 7.95 6.24 266.99 YDL219W 4 1 A 2 7 16.08 11.33 539.93 10 YDR163W 4 1 A 2 8 19.13 14.19 642.17 YDR354W TRP4 4 1 A 2 9 62.24 40.74 "2,089.4
  11. 11. IMAGE ANALYSIS 11
  12. 12. Image Analysis: Normalization Cattle and Soy Controls Beta Actin PKG HPRT Beta 2 microglobulin Dynamic Rubisco AB binding protein Major latex protein homologue (MSG) range of Array of cattle and soy spiking controls. 50 ug of cattle brain total RNA was labeled with Cy3 (green). 1 ul each of in vitro transcribed soy Rubisco (5 ng), AB binding protein (0.5 ng) and MSG (0.05 ng) Red were labeled with Cy5. The two labeled samples were cohybridized on superamine slides (Telechem, Inc.). To the right of each set of spots are five negative controls (water). red band Band Green Band Dynamic range of green band Solution: Reference points with reference values 12
  13. 13. Image Analysis: Grid Alignment Solution: Manual, semi-automatic and fully automatic alignment based on fiducials and/or global grid fitting. 13
  14. 14. Image Analysis: Feature Selection Features: mean, median, standard deviation, ratios Area: Sensitive to background noise 14
  15. 15. Image Analysis: Feature Extraction • Area is determined by image thresholding and used during feature extraction 1102 Dist: 2004 Box: 902 Plane: 2632 15
  16. 16. DATA MINING 16
  17. 17. Why Data Mining ? Sequence Example • Biology: Language and Goals • A gene can be defined as a region of DNA. • A genome is one haploid set of chromosomes with the genes they contain. • Perform competent comparison of gene sequences across species and account for inherently noisy biological sequences due to random variability amplified by evolution • Assumption: if a gene has high similarity to another gene then they perform the same function • Analysis: Language and Goals • Feature is an extractable attribute or measurement (e.g., gene expression, location) • Pattern recognition is trying to characterize data pattern (e.g., similar gene expressions, equidistant gene locations). • Data mining is about uncovering patterns, anomalies and statistically significant structures in data (e.g., find two similar gene expressions with confidence > x) 17
  18. 18. Data Mining Techniques Data mining techniques draw from Statistics Machine learning Database techniques Pattern recognition Optimization techniques Visualization 18
  19. 19. Statistics Statistics Descriptive Inductive Statistics Statistics Make forecast Describe data and inferences Are two sample sets identically distributed ? 19
  20. 20. Machine Learning Machine Learning Unsupervised Supervised “Natural groupings” Reinforced Examples 20
  21. 21. Pattern Recognition Pattern Recognition k-nearest Statistical Models Locally Weighted neighbors, support Learning vectors Linear Correlation and Regression Decision Trees Neural Networks NN representation NN representation and and gradient based genetic algorithm based optimization optimization 21
  22. 22. Database Techniques • Database Design and Modeling (tables, procedures, functions, constraints) • Database Interface to Data Mining System • Efficient Import and Export of Data • Database Data Visualization • Database Clustering for Access Efficiency MINING • Database Performance Tuning (memory usage, query encoding) • Database Parallel Processing (multiple servers and CPUs) • Distributed Information Repositories (data warehouse) 22
  23. 23. Optimization Techniques • Highly nonlinear search space (global versus local maxima) • Gradient based optimization • Genetic algorithm based optimization • Optimization with sampling • Large search space • Example: A genome with N genes can encode 2^N states (active or inactive states, regulated is not considered). Human genome ~ 2^30,000; Nematode genome ~ 2^20,000 patterns. 23
  24. 24. Visualization • Data: 3D cubes,distribution charts, curves, surfaces, link graphs, image frames and movies, parallel coordinates • Results: pie charts, scatter plots, box plots, association rules, parallel coordinates, dendograms, temporal evolution Pie chart Parallel coordinates Temporal evolution 24
  25. 25. Prior Knowledge from Experiment Design Complexity Levels of Microarray Experiments: 1. Compare single gene in a control situation versus a treatment situation • Example: Is the level of expression (up-regulated or down-regulated) significantly different in the two situations? (drug design application) • Methods: t-test, Bayesian approach 2. Find multiple genes that share common functionalities • Example: Find related genes that are dependent? • Methods: Clustering (hierarchical, k-means, self-organizing maps, neural network, support vector machines) 3. Infer the underlying gene and protein networks that are responsible for the patterns and functional pathways observed • Example: What is the gene regulation at system level? • Directions: mining regulatory regions, modeling regulatory networks on a global scale Goal of Future Experiment Designs: Understand biology at the system level, e.g., gene networks, protein networks, signaling networks, metabolic networks, immune system and neuronal networks. 25
  26. 26. Types of Expected Data Mining and Analysis Results Hypothetical Examples: • Binary answers using tests of hypotheses — Drug treatment is successful with a confidence level x. • Statistical behavior (probability distribution functions) — A class of genes with functionality X follows Poisson distribution. • Expected events — As the amount of treatment will increase the gene expression level will decrease. • Relationships — Expression level of gene A is correlated with expression level of gene B under varying treatment conditions (gene A and B are part of the same pathway). • Decision trees — Classification of a new gene sequence by a ―domain expert‖. 26
  27. 27. VALIDATION 27
  28. 28. Why Validation? • Validation type: — Within the existing data — With newly collected data • Errors and uncertainties: — Systematic or random errors — Unknown variables - number of classes — Noise level - statistical confidence due to noise — Model validity – error measure, model over-fit or under-fit — Number of data points - measurement replicas • Other issues — Experimental support of general theories — Exhaustive sampling is not permissive 28
  29. 29. Cross Validation: Example • One-tier cross validation — Train on different data than test data • Two-tier cross validation — The score from one-tier cross validation is used by the bias optimizer to select the best learning algorithm parameters (# of control points) . The more you optimize the more you over-fit. The second tier is to measure the level of over-fit (unbiased measure of accuracy). — Useful for comparing learning algorithms with control parameters that are optimized. — Number of folds is not optimized. • Computational complexity: — #folds of top tier X #folds of bottom tier X #control points X CPU of algorithm 29
  30. 30. Summary • Microarray problem — Computational biology — Major objective of microarray technology — Input and output of data analysis • Data mining and image analysis steps — Image normalization, grid alignment, feature construction — Data mining techniques — Prior knowledge — Expected results of data mining • Validation — Issues — Cross validation techniques 30

×