müßt

497 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
497
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

müßt

  1. 1. Data Mining in Bioinformatics
  2. 2. Outline <ul><li>Introduction </li></ul><ul><ul><li>Interdisciplinary Problem Statement </li></ul></ul><ul><ul><li>Microarray Problem Overview </li></ul></ul><ul><li>Microarray Data Processing </li></ul><ul><ul><li>Image Analysis and Data Mining </li></ul></ul><ul><ul><li>Prior Knowledge </li></ul></ul><ul><ul><li>Data Mining Methods </li></ul></ul><ul><ul><li>Database and Optimization Techniques </li></ul></ul><ul><ul><li>Visualization </li></ul></ul><ul><li>Validation </li></ul><ul><li>Artificial Immune Systems </li></ul><ul><li>Summary </li></ul>
  3. 3. Introduction: Recommended Literature <ul><li>1. Bioinformatics – The Machine Learning Approach by P. Baldi & S. Brunak, 2 nd edition, The MIT Press, 2001 </li></ul><ul><li>2. Data Mining – Concepts and Techniques by J. Han & M. Kamber, Morgan Kaufmann Publishers, 2001 </li></ul><ul><li>3. Pattern Classification by R. Duda, P. Hart and D. Stork, 2 nd edition, John Wiley & Sons, 2001 </li></ul>
  4. 4. Bioinformatics, Computational Biology, Data Mining <ul><li>Bioinformatics is an interdisciplinary field about the information processing problems in computational biology and a unified treatment of the data mining methods for solving these problems. </li></ul><ul><li>Computational Biology is about modeling real data and simulating unknown data of biological entities, e.g. </li></ul><ul><ul><li>Genomes (viruses, bacteria, fungi, plants, insects,…) </li></ul></ul><ul><ul><li>Proteins and Proteomes </li></ul></ul><ul><ul><li>Biological Sequences </li></ul></ul><ul><ul><li>Molecular Function and Structure </li></ul></ul><ul><li>Data Mining is searching for knowledge in data </li></ul><ul><ul><li>Knowledge mining from databases </li></ul></ul><ul><ul><li>Knowledge extraction </li></ul></ul><ul><ul><li>Data/pattern analysis </li></ul></ul><ul><ul><li>Data dredging </li></ul></ul><ul><ul><li>Knowledge Discovery in Databases (KDD) </li></ul></ul>
  5. 5. Basic Terms in Biology <ul><li>Example: </li></ul><ul><li>The human body contains ~100 trillion cells </li></ul><ul><li>Inside each cell is a nucleus </li></ul><ul><li>Inside the nucleus are two complete sets of the human genome (except in egg, sperm cells and blood cells) </li></ul><ul><li>Each set of genomes includes 30,000-80,000 genes on the same 23 chromosomes </li></ul><ul><li>Gene – A functional hereditary unit that occupies a fixed location on a chromosome, has a specific influence on phenotype, and is capable of mutation. </li></ul><ul><li>Chromosome – A DNA containing linear body of the cell nuclei responsible for determination and transmission of hereditary characteristics </li></ul>
  6. 6. Basic Terms in Data Mining <ul><li>Data Mining: A step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data. </li></ul><ul><li>Knowledge Discovery Process: T he process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations. </li></ul><ul><li>A pattern is a conservative statement about a probability distribution. </li></ul><ul><ul><li>Webster: A pattern is (a) a natural or chance configuration, (b) a reliable sample of traits, acts, tendencies, or other observable characteristics of a person, group, or institution </li></ul></ul>
  7. 7. Introduction: Problems in Bioinformatics Domain <ul><li>Problems in Bioinformatics Domain </li></ul><ul><ul><li>Data production at the levels of molecules, cells, organs, organisms, populations </li></ul></ul><ul><ul><li>Integration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, … </li></ul></ul><ul><ul><li>Prediction of Molecular Function and Structure </li></ul></ul><ul><ul><li>Computational biology: synthesis (simulations) and analysis (machine learning) </li></ul></ul>
  8. 8. <ul><li>MICROARRAY PROBLEM </li></ul>
  9. 9. Microarray Problem: Major Objective <ul><li>Major Objective: Discover a comprehensive theory of life’s organization at the molecular level </li></ul><ul><ul><li>The major actors of molecular biology: the nucleic acids, DeoxyriboNucleic acid (DNA) and RiboNucleic Acids (RNA) </li></ul></ul><ul><ul><li>The central dogma of molecular biology </li></ul></ul>Proteins are very complicated molecules with 20 different amino acids.
  10. 10. Input and Output of Microarray Data Analysis <ul><li>Input: Laser image scans (data) and underlying experiment hypotheses or experiment designs (prior knowledge) </li></ul><ul><li>Output: </li></ul><ul><ul><li>Conclusions about the input hypotheses or knowledge about statistical behavior of measurements </li></ul></ul><ul><ul><li>The theory of biological systems learnt automatically from data (machine learning perspective) </li></ul></ul><ul><ul><ul><li>Model fitting, Inference process </li></ul></ul></ul>
  11. 11. Overview of Microarray Problem Data Mining Microarray Experiment Image Analysis Biology Application Domain Experiment Design and Hypothesis Data Analysis Artificial Intelligence (AI) Knowledge discovery in databases (KDD) Data Warehouse Validation Statistics
  12. 12. Statistics Community <ul><li>Random Variables </li></ul><ul><li>Statistical Measures </li></ul><ul><li>Probability and Probability Distribution </li></ul><ul><li>Confidence Interval Estimations </li></ul><ul><li>Test of Hypotheses </li></ul><ul><li>Goodness of Fit </li></ul><ul><li>Regression and Correlation Analysis </li></ul>
  13. 13. Artificial Intelligence (AI) Community <ul><li>Issues: </li></ul><ul><ul><li>Prior knowledge (e.g., invariance) </li></ul></ul><ul><ul><li>Model deviation from true model </li></ul></ul><ul><ul><li>Sampling distributions </li></ul></ul><ul><ul><li>Computational complexity </li></ul></ul><ul><ul><li>Model complexity (overfitting) </li></ul></ul>Collect Data Train Classifier Choose Model Choose Features Evaluate Classifier Design Cycle of Predictive Modeling
  14. 14. Knowledge Discovery in Databases (KDD) Community Database
  15. 15. Microarray Data Mining and Image Analysis Steps <ul><li>Image Analysis </li></ul><ul><ul><li>Normalization </li></ul></ul><ul><ul><li>Grid Alignment </li></ul></ul><ul><ul><li>Spot Quality Assurance Control </li></ul></ul><ul><ul><li>Feature construction (selection and extraction) </li></ul></ul><ul><li>Data Mining </li></ul><ul><ul><li>Prior knowledge </li></ul></ul><ul><ul><li>Statistics </li></ul></ul><ul><ul><li>Machine learning </li></ul></ul><ul><ul><li>Pattern recognition </li></ul></ul><ul><ul><li>Database techniques </li></ul></ul><ul><ul><li>Optimization techniques </li></ul></ul><ul><ul><li>Visualization </li></ul></ul><ul><li>Validation </li></ul><ul><ul><li>Issues </li></ul></ul><ul><ul><li>Cross validation techniques </li></ul></ul>?
  16. 16. <ul><li>MICROARRAY IMAGE ANALYSIS </li></ul>
  17. 17. Microarray Image Analysis
  18. 18. <ul><li>DATA MINING OF MICROARRAY DATA </li></ul>
  19. 19. Why Data Mining ? Sequence Example <ul><li>Biology: Language and Goals </li></ul><ul><li>A gene can be defined as a region of DNA. </li></ul><ul><li>A genome is one haploid set of chromosomes with the genes they contain. </li></ul><ul><li>Perform competent comparison of gene sequences across species and account for inherently noisy biological sequences due to random variability amplified by evolution </li></ul><ul><li>Assumption: if a gene has high similarity to another gene then they perform the same function </li></ul><ul><li>Analysis: Language and Goals </li></ul><ul><li>Feature is an extractable attribute or measurement (e.g., gene expression, location) </li></ul><ul><li>Pattern recognition is trying to characterize data pattern (e.g., similar gene expressions, equidistant gene locations). </li></ul><ul><li>Data mining is about uncovering patterns, anomalies and statistically significant structures in data (e.g., find two similar gene expressions with confidence > x) </li></ul>
  20. 20. Types of Expected Data Mining and Analysis Results <ul><li>Hypothetical Examples: </li></ul><ul><li>Binary answers using tests of hypotheses </li></ul><ul><ul><li>Drug treatment is successful with a confidence level x. </li></ul></ul><ul><li>Statistical behavior (probability distribution functions) </li></ul><ul><ul><li>A class of genes with functionality X follows Poisson distribution. </li></ul></ul><ul><li>Expected events </li></ul><ul><ul><li>As the amount of treatment will increase the gene expression level will decrease. </li></ul></ul><ul><li>Relationships </li></ul><ul><ul><li>Expression level of gene A is correlated with expression level of gene B under varying treatment conditions (gene A and B are part of the same pathway). </li></ul></ul><ul><li>Decision trees </li></ul><ul><ul><li>Classification of a new gene sequence by a “domain expert”. </li></ul></ul>
  21. 21. <ul><li>PRIOR KNOWLEDGE </li></ul>
  22. 22. Prior Knowledge: Experiment Design <ul><li>Microarray sources of systematic and random errors </li></ul><ul><li>Feature selection and variability </li></ul><ul><li>Expectations and Hypotheses </li></ul><ul><li>Data cleaning and transformations </li></ul><ul><li>Data mining method selection </li></ul><ul><li>Interpretation </li></ul>Collect Data Choose Features Data Cleaning and Transformations Choose Model and Data Mining Method Prior Knowledge
  23. 23. Prior Knowledge from Experiment Design <ul><li>Complexity Levels of Microarray Experiments: </li></ul><ul><li>Compare single gene in a control situation versus a treatment situation </li></ul><ul><ul><li>Example: Is the level of expression (up-regulated or down-regulated) significantly different in the two situations? (drug design application) </li></ul></ul><ul><ul><li>Methods: t-test, Bayesian approach </li></ul></ul><ul><li>Find multiple genes that share common functionalities </li></ul><ul><ul><li>Example: Find related genes that are dependent? </li></ul></ul><ul><ul><li>Methods: Clustering (hierarchical, k-means, self-organizing maps, neural network, support vector machines) </li></ul></ul><ul><li>Infer the underlying gene and protein networks that are responsible for the patterns and functional pathways observed </li></ul><ul><ul><li>Example: What is the gene regulation at system level? </li></ul></ul><ul><ul><li>Directions: mining regulatory regions, modeling regulatory networks on a global scale </li></ul></ul><ul><li>Goal of Future Experiment Designs: Understand biology at the system level, e.g., gene networks, protein networks, signaling networks, metabolic networks, immune system and neuronal networks. </li></ul>
  24. 24. Data Mining Techniques Visualization
  25. 25. <ul><li>STATISTICS </li></ul>
  26. 26. Statistics Inductive Statistics Statistics Descriptive Statistics Are two sample sets identically distributed ? Make forecast and inferences Describe data
  27. 27. Statistical t-test <ul><li>m – sample mean </li></ul><ul><li>s – variance </li></ul><ul><li>Gene Expression Level in Control and Treatment situations </li></ul><ul><li>Is the behavior of a single gene different in Control situation than in Treatment situation ? </li></ul>Normalized distance Normalized distance t follows a Student distribution with f degrees of freedom. If t>thresh then the control and treatment data populations are considered to be different. ?
  28. 28. <ul><li>MACHINE LEARNING </li></ul><ul><li>AND </li></ul><ul><li>PATTERN RECOGNITION </li></ul>
  29. 29. Machine Learning Supervised Machine Learning Unsupervised Reinforcement “ Natural groupings” Examples
  30. 30. Pattern Recognition Pattern Recognition Linear Correlation and Regression Neural Networks Statistical Models Decision Trees Locally Weighted Learning NN representation and gradient based optimization NN representation and genetic algorithm based optimization k-nearest neighbors, support vectors
  31. 31. Unsupervised Learning and Clustering <ul><li>A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. </li></ul><ul><li>Examples of data objects: </li></ul><ul><ul><li>gene expression levels, sets of co-regulated genes (pathways), protein structures </li></ul></ul><ul><li>Categories of Clustering Methods </li></ul><ul><ul><li>Partitioning Methods </li></ul></ul><ul><ul><li>Hierarchical Methods </li></ul></ul><ul><ul><li>Density-Based Methods </li></ul></ul>“ Natural groupings”
  32. 32. Unsupervised Clustering: Partitioning Methods <ul><li>K-means Algorithm partitions a set of n objects into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster similarity is low. </li></ul><ul><li>Input: number of desired cluster k </li></ul><ul><li>Output: k labels assigned to n objects </li></ul><ul><li>Steps: </li></ul><ul><li>Select k initial cluster’s centers </li></ul><ul><li>Compute similarity as a distance between an object and each cluster center </li></ul><ul><li>Assign a label to an object based on the minimum similarity </li></ul><ul><li>Repeat for all objects </li></ul><ul><li>Re-compute the cluster’s centers as a mean of all objects assign to a given cluster </li></ul><ul><li>Repeat from Step 2 until objects do not change their labels. </li></ul>Example: Centroid-Based Technique
  33. 33. Unsupervised Clustering: Partitioning Methods <ul><li>K-medoids Algorithm partitions a set of n objects into k clusters so that it minimizes the sum of the dissimilarities of all the objects to their nearest medoid. </li></ul><ul><li>Input: number of desired cluster k </li></ul><ul><li>Output: k labels assigned to n objects </li></ul><ul><li>Steps: </li></ul><ul><li>Select k initial objects as the initial medoids </li></ul><ul><li>Compute similarity as a distance between an object and each cluster medoid </li></ul><ul><li>Assign a label to an object based on the minimum similarity </li></ul><ul><li>Repeat for all objects </li></ul><ul><li>Randomly select a non-medoid object and swap with the current medoid it would decrease intra-cluster square error </li></ul><ul><li>Repeat from Step 2 until objects do not change their labels. </li></ul>Example: Representative-Based Technique
  34. 34. Unsupervised Clustering: Hierarchical Clustering <ul><li>Hierarchical Clustering partitions a set of n objects into a tree of clusters </li></ul><ul><li>Types of Hierarchical Clustering </li></ul><ul><ul><li>Agglomerative hierarchical clustering </li></ul></ul><ul><ul><ul><li>Bottom-up strategy of building clusters </li></ul></ul></ul><ul><ul><li>Divisive hierarchical clustering </li></ul></ul><ul><ul><ul><li>Top-down strategy of building clusters </li></ul></ul></ul>
  35. 35. Unsupervised Agglomerative Hierarchical Clustering <ul><li>Agglomerative Hierarchical Clustering partitions a set of n objects into a tree of clusters with a bottom-up strategy. </li></ul><ul><li>Steps: </li></ul><ul><li>Assign a unique label to each data object and form n clusters </li></ul><ul><li>Find nearest clusters and merge them </li></ul><ul><li>Repeat Step 2 till the number of desired clusters is equal to the number of merged clusters. </li></ul><ul><li>Types of Agglomerative Hierarchical Clustering </li></ul><ul><ul><li>The nearest neighbor algorithms (minimum or single-linkage algorithm , minimal spanning tree) </li></ul></ul><ul><ul><li>The farthest neighbor algorithms (maximum or complete-linkage algorithm ) </li></ul></ul>
  36. 36. Unsupervised Clustering: Density-Based Clustering <ul><li>Density-Based Spatial Clustering with Noise aggregates objects into clusters if the objects are density connected. </li></ul><ul><li>Density connected objects: </li></ul><ul><ul><li>Simplified explanation: P and Q are density connected if there is an object O such that both P and Q are density connected to O. </li></ul></ul><ul><ul><li>Aggregate P and Q if they are density connected with respect to R-radius neighborhood and Minimum Object criteria </li></ul></ul>
  37. 37. Supervised Learning or Classification <ul><li>Classification is a two-step process consisting of learning classification rules followed by assignment of classification label. </li></ul>
  38. 38. Supervised Learning: Decision Tree <ul><li>Decision tree algorithm constructs a tree structure in a top-down recursive divide-and-conquer manner </li></ul>Car Insurance: Risk Assessment Age < 25 ? Risk: Low Risk: High Sports car ? Risk: High yes no no yes Attributes Answers Visualization of Decision Boundaries High family 20 Low truck 32 Low family 68 High sports 43 High sports 17 High family 23 Risk Car Type Age
  39. 39. Supervised Learning: Bayesian Classification <ul><li>Bayesian Classification is based on Bayes theorem and it can predict class membership probabilities. </li></ul><ul><li>Bayes Theorem (X-data sample, H-hypothesis of data label) </li></ul><ul><ul><li>P(H/X) posterior probability </li></ul></ul><ul><ul><li>P(H) prior probability </li></ul></ul><ul><li>Classification-maximum posteriori hypothesis </li></ul>
  40. 40. Statistical Models: Linear Discriminant <ul><li>Linear Discriminant Functions form boundaries between data classes. </li></ul><ul><li>Finding Linear Discriminant Functions is achieved by minimizing a criterion error function. </li></ul>Linear discriminant function Quadratic discriminant function Finding w coefficients: -Gradient Descent Procedures -Newton’s algorithm
  41. 41. Artificial Neural Networks <ul><li>Artificial Neural Network (ANN) is a computational analogue of neurons. </li></ul><ul><li>Artificial neural network is a set of connected input/output units where each connection has a weight associated with it. </li></ul><ul><li>Phase I: learning – adjust weights such that the network predicts accurately class labels of the input samples </li></ul><ul><li>Phase II: classification- assign labels by passing an unknown sample through the network </li></ul>Network topology or “Structure”
  42. 42. Artificial Neural Networks (cont.) <ul><li>Steps: </li></ul><ul><ul><li>Initial weights from [-1,1] </li></ul></ul><ul><ul><li>Propagate the inputs forward </li></ul></ul><ul><ul><li>Backpropagate the error </li></ul></ul><ul><ul><li>Terminate learning (training) if (a) delta w < thresh or (b) percentage of misclassified samples < thresh or (c) max number of iterations has been exceeded </li></ul></ul><ul><li>Pros & Cons of ANN: Good performance with noisy data, rule extraction & long training, poor interpretability, trial-and-error network design </li></ul>Interpretation Unit or node j
  43. 43. Support Vector Machines (SVM) <ul><li>SVM algorithm finds a separating hyperplane with the largest margin and uses it for classification of new samples </li></ul>
  44. 44. <ul><li>DATABASE TECHNIQUES </li></ul><ul><li>AND </li></ul><ul><li>OPTIMIZATION TECHNIQUES </li></ul>
  45. 45. Data Types and Databases <ul><li>Relational Databases </li></ul><ul><li>Data Warehouses </li></ul><ul><li>Transactional Databases </li></ul><ul><li>Advanced Database Systems </li></ul><ul><ul><li>Object-Relational </li></ul></ul><ul><ul><li>Spatial and Temporal </li></ul></ul><ul><ul><li>Time-Series </li></ul></ul><ul><ul><li>Multimedia </li></ul></ul><ul><ul><li>Text </li></ul></ul><ul><ul><li>Heterogeneous, Legacy, and Distributed </li></ul></ul><ul><ul><li>WWW </li></ul></ul>Structure - 3D Anatomy Function – 1D Signal Metadata – Annotation
  46. 46. Database Techniques <ul><li>Database Design and Modeling ( tables, procedures, functions, constraints) </li></ul><ul><li>Database Interface to Data Mining System </li></ul><ul><li>Efficient Import and Export of Data </li></ul><ul><li>Database Data Visualization </li></ul><ul><li>Database Clustering for Access Efficiency </li></ul><ul><li>Database Performance Tuning (memory usage, query encoding) </li></ul><ul><li>Database Parallel Processing (multiple servers and CPUs) </li></ul><ul><li>Distributed Information Repositories (data warehouse) </li></ul>MINING
  47. 47. Search and Optimization Techniques: Search Types <ul><li>Types of search methods: </li></ul><ul><ul><li>Calculus-based </li></ul></ul><ul><ul><ul><li>Indirect (solve a nonlinear set of equations) </li></ul></ul></ul><ul><ul><ul><li>Direct (follow local gradient - hill climbing) </li></ul></ul></ul><ul><ul><li>Enumerative (search objective function values at every point – dynamic programming) </li></ul></ul><ul><ul><li>Random (search with random sampling) </li></ul></ul><ul><li>Randomized search methods: guide the search with random processes – simulated annealing, genetic programming </li></ul>
  48. 48. Search and Optimization Techniques: Challenges <ul><li>Search and optimization challenges: </li></ul><ul><ul><li>Global versus local maxima </li></ul></ul><ul><ul><li>Existence of derivatives (calculus-based) </li></ul></ul><ul><ul><li>High dimensionality </li></ul></ul><ul><ul><li>Highly nonlinear search space (global versus local maxima) </li></ul></ul><ul><ul><li>Large search space </li></ul></ul><ul><li>Example: A genome with N genes can encode 2^N states (active or inactive states, regulated is not considered). Human genome ~ 2^30,000; Nematode genome ~ 2^20,000 patterns. </li></ul>
  49. 49. Genetic Algorithm <ul><li>Genetic Algorithm (GA) based optimization is a computational analogue of Darwin’s evolution theory (survival of the fittest). </li></ul><ul><li>Description of GA based optimization: </li></ul><ul><ul><li>Uses coding of the parameter set (not the parameters themselves) </li></ul></ul><ul><ul><li>Searches from a population of points (not a single point) </li></ul></ul><ul><ul><li>Uses an objective function (not derivatives or other auxiliary knowledge) </li></ul></ul><ul><ul><li>Employs probability transition rules (not deterministic rules) </li></ul></ul><ul><ul><li>Is composed of three operators </li></ul></ul><ul><ul><ul><li>Reproduction (or selection) </li></ul></ul></ul><ul><ul><ul><li>Crossover </li></ul></ul></ul><ul><ul><ul><li>Mutation </li></ul></ul></ul><ul><li>Reference: D. Goldberg: Genetic Algorithms in Search, Optimization & Machine Learning,Addison-Wesley Publishing Co., 1989. </li></ul>
  50. 50. Genetic Algorithm: Additional Operators <ul><li>Additional operators </li></ul><ul><ul><li>Niching for optimization of multimodal and multiobjective functions </li></ul></ul><ul><ul><ul><li>Fitness sharing: the number of individuals residing near any peak will be proportional to the height of that peak (reduce individual fitness according to their similarity) </li></ul></ul></ul><ul><ul><ul><li>Crowding: spread individuals among the most prominent peaks and do not allocate individuals proportionally to fitness (maintain diversity) </li></ul></ul></ul><ul><ul><li>Speciation for optimization of multimodal functions </li></ul></ul><ul><ul><ul><li>Mating restriction scheme (restrict mating or crossover according to the similarity among individuals) </li></ul></ul></ul>
  51. 51. <ul><li>Steps: </li></ul><ul><li>Randomly generate initial population of size n=2; e.g., strings 0110 & 1100 </li></ul><ul><li>Reproduction is a process of copying strings according to their objective function – “a roulette wheel” </li></ul><ul><li>Crossover proceeds in two steps (1) random mating of strings and (2) selecting random positions of each string for mating; e.g., obtain 1 110 & 0 100 </li></ul><ul><li>Mutation is the occasional random alteration of the value of a string position to protect premature loss of information; obtain 0 110 & 0100 </li></ul>Genetic Algorithm: Example Objective Function On Off On Off (on,off,on,off) input sequence is converted to a string (1010)
  52. 52. <ul><li>VISUALIZATION </li></ul>
  53. 53. Visualization <ul><li>Data: 3D cubes,distribution charts, curves, surfaces, link graphs, image frames and movies, parallel coordinates </li></ul><ul><li>Results: pie charts, scatter plots, box plots, association rules, parallel coordinates, dendograms, temporal evolution </li></ul>Pie chart Parallel coordinates Temporal evolution
  54. 54. Novel Visualization of Features Feature Selection and Visualization Feature Selection Mean Feature Image
  55. 55. Novel Visualization of Clustering Results Isodata (K-means) Clustering Class Labeling and Visualization Mean Feature Image Label Image
  56. 56. <ul><li>VALIDATION </li></ul>
  57. 57. Why Validation? <ul><li>Validation type: </li></ul><ul><ul><li>Within the existing data </li></ul></ul><ul><ul><li>With newly collected data </li></ul></ul><ul><li>Errors and uncertainties: </li></ul><ul><ul><li>Systematic or random errors </li></ul></ul><ul><ul><li>Unknown variables - number of classes </li></ul></ul><ul><ul><li>Noise level - statistical confidence due to noise </li></ul></ul><ul><ul><li>Model validity – error measure, model over-fit or under-fit </li></ul></ul><ul><ul><li>Number of data points - measurement replicas </li></ul></ul><ul><li>Other issues </li></ul><ul><ul><li>Experimental support of general theories </li></ul></ul><ul><ul><li>Exhaustive sampling is not permissive </li></ul></ul>
  58. 58. Error Detection: Example of Spot Screening Mask Image – No Screening Mask Image – Location and Size Screening Mask Image – SNR Screening
  59. 59. Cross Validation: Example <ul><li>One-tier cross validation </li></ul><ul><ul><li>Train on different data than test data </li></ul></ul><ul><li>Two-tier cross validation </li></ul><ul><ul><li>The score from one-tier cross validation is used by the bias optimizer to select the best learning algorithm parameters (# of control points) . The more you optimize the more you over-fit. The second tier is to measure the level of over-fit (unbiased measure of accuracy). </li></ul></ul><ul><ul><li>Useful for comparing learning algorithms with control parameters that are optimized. </li></ul></ul><ul><ul><li>Number of folds is not optimized. </li></ul></ul><ul><li>Computational complexity: </li></ul><ul><ul><li>#folds of top tier X #folds of bottom tier X #control points X CPU of algorithm </li></ul></ul>
  60. 60. <ul><li>ARTIFICIAL IMMUNE SYSTEMS </li></ul>
  61. 61. Artificial Immune Systems <ul><li>Artificial Immune Systems (AIS) are adaptive systems, inspired by theoretical immunology and observed immune functions, principles and models, which are applied to problem solving. </li></ul><ul><li>Other types of AIS are hybrids of ANN, GA and fuzzy systems combined with theoretical immunology models </li></ul><ul><li>Applications of AIS: </li></ul><ul><ul><li>Pattern recognition (surveillance of infectious diseases) </li></ul></ul><ul><ul><li>Fault and anomaly detection ((image inspection and segmentation) </li></ul></ul><ul><ul><li>Data analysis (reinforced, unsupervised learning) </li></ul></ul><ul><ul><li>Agent-based systems </li></ul></ul><ul><ul><li>Scheduling (adaptive scheduling) </li></ul></ul><ul><ul><li>Autonomous navigation and control (walking robots) </li></ul></ul><ul><ul><li>Search and optimization methods (constrained, time-dependent optimization) </li></ul></ul><ul><ul><li>Security of information systems (virus detection, network intrusion) </li></ul></ul>
  62. 62. Basic Terms Used in Artificial Immune Systems <ul><li>Immune system is understood as a complex set of cells and molecules that protect our bodies against infection under constant attack by antigens (foreign or self-antigens) </li></ul><ul><li>Immune system consists of two-tier line of defense: adaptive (lymphocytes: B-cells & T-cells) and innate (granulocytes & macrophages) immune systems. Both systems depend upon the activity of white blood cells (leukocytes). </li></ul>The organs that make up the immune system (lymphoid organs) are thymus & bone marrow (primary) and tonsils,adenoids, spleen, appendix, lymph nodes, lymphatic vessels, peyer’s patches (secondary).
  63. 63. Mechanisms Adapted in Artificial Immune Systems <ul><li>Pattern recognition: lymphocytes (B-cells & T-cells) carry surface receptors capable of recognizing antigens </li></ul><ul><ul><li>Example: recognition via complementary regions </li></ul></ul><ul><li>The clonal selection principle: only cells capable of recognizing an antigen stimulus will proliferate and differentiate into effector cells </li></ul><ul><li>Immune learning and memory: reinforced </li></ul><ul><li>learning strategy </li></ul><ul><li>Self/Nonself discrimination: distinguish between molecules of its own cell (self) and foreign molecules (nonself)- positive and negative selection, clonal expansion and ignorance </li></ul>
  64. 64. Why Artificial Immune System? <ul><li>Pattern recognition: cells and molecules of the immune system have several ways of recognizing patterns </li></ul><ul><li>Uniqueness: each individual possesses its own immune system </li></ul><ul><li>Self identity: other than native “elements” to the body can be recognized and eliminated by the immune system </li></ul><ul><li>Diversity: there exist varying types of elements that together protect the body </li></ul><ul><li>Disposability: no single native element is essential for the functioning of the immune system </li></ul><ul><li>Autonomy: there is no central element controlling the immune system </li></ul><ul><li>Multi-layered: multiple layers of different mechanisms provide overall security </li></ul><ul><li>No secure layer: any cell of the organism can be attacked by the IS </li></ul><ul><li>Anomaly detection: the IS can recognize and react to pathogens that the body has never encountered before </li></ul><ul><li>Dynamically changing coverage: the IS maintains a circulating repertoire of lymphocytes constantly being changed through cell death, production and reproduction </li></ul>
  65. 65. Why Artificial Immune System? (cont.) <ul><li>Distributivity: the immune elements are distributed all over the body </li></ul><ul><li>Noise tolerance: an absolute recognition of pathogens is not required (tolerance to molecular noise) </li></ul><ul><li>Resilience: the IS is capable of functioning despite disturbances </li></ul><ul><li>Fault tolerance: the complementary roles of several immune components allow the re-allocation of tasks to other elements </li></ul><ul><li>Robustness: diversity & number of immune elements </li></ul><ul><li>Immune learning and memory: the molecules of the IS can adapt to themselves, structurally and in number, to the antigenic challenges </li></ul><ul><li>Predator-prey pattern of response :#pathogens goes up =>#immune cells goes up </li></ul><ul><li>Self-organization: clonal selection and affinity maturation are responsible for selecting the most adapted cells to be maintained as long living memory cells </li></ul><ul><li>Integration with other systems: the IS communicates with parts of the body </li></ul>
  66. 66. General Framework for Artificial Immune Systems <ul><li>General Framework for AIS: </li></ul><ul><ul><li>A representation for the components of the system </li></ul></ul><ul><ul><li>A set of mechanisms to evaluate the interaction of individuals with the environment and each other (input stimuli, 1 to N fitness functions or other means) – Affinity measures </li></ul></ul><ul><ul><li>Procedures of adaptation that govern the dynamics of the system (e.g., behavior over time) - Algorithms </li></ul></ul>Reference:L. N. de Castro and J. Timmis, “Artificial Immune Systems: A New Computational Intelligence Approach,”Springer 2002.
  67. 67. Components of Artificial Immune Systems <ul><li>Representation: </li></ul><ul><ul><li>Generalized shape of any molecule in shape space is described by an attribute string (set of coordinates) of length L. </li></ul></ul><ul><ul><li>Shape-space describes interactions between molecules of the immune system and antigens. </li></ul></ul><ul><ul><li>Immune system is represented as a pattern (molecular) recognition system that is designed to identify shapes. </li></ul></ul><ul><li>Affinity Measures: </li></ul><ul><ul><li>Euclidean, Manhattan and Hamming </li></ul></ul><ul><ul><li>Real-valued, integer, symbolic or alphabet sub-string spaces </li></ul></ul>
  68. 68. Components of Artificial Immune Systems <ul><li>Immune system algorithms: </li></ul><ul><ul><li>Bone marrow model: generate repertoire of cells and molecules (generate random attribute strings) </li></ul></ul><ul><ul><li>Thymus model: generate repertoire of cells and molecules capable of performing self/non-self discrimination (Positive selection: initialize strings, evaluate affinity and keep strings with affinity < threshold; Negative selection: eliminate strings > threshold) </li></ul></ul><ul><ul><li>Clonal selection algorithms: modeling interaction control of the IS and external environment or antigens (similar to GA without crossover and with affinity proportional to reproduction and mutation) </li></ul></ul><ul><ul><li>Immune network models: simulate immune networks (differential equations describing the dynamics) </li></ul></ul>
  69. 69. Examples of Artificial Immune Systems Procedures for generation, cloning, selection and IS network dynamics Procedures for reproduction, genetic variation and selection Learning algorithms Algorithms Affinity defined on a shape-space Fitness function Backpropagation measures Affinity Measures IS representation of molecules (strings of coordinates), and their interactions (shape-space) Genetic representation (gene = a single bit or a block of bits, chromosome= bitstring) Artificial neurons & interconnection of neurons (summing junction, connection strength, activation) Representation AIS GA ANN Example
  70. 70. <ul><li>SUMMARY </li></ul>
  71. 71. Summary: Interdisciplinary Science <ul><li>CS and ECE have been used to gain a better understanding of biological processes through modeling and simulation </li></ul><ul><li>CS and ECE have been enriched with the introduction of biological ideas, e.g., ANN, GA, cellular automata, artificial life, artificial immune systems (AIS) </li></ul><ul><li>New fields: bio-informatics, bio-medical engineering </li></ul><ul><li>Bilateral interactions between CS, ECE and Biology: </li></ul><ul><ul><li>Biologically motivated computing (ANN, GA, artificial immune systems) </li></ul></ul><ul><ul><li>Computationally motivated biology (cellular automata) </li></ul></ul><ul><ul><li>Computing with biological mechanisms (silicon-based computing => quantum and DNA computing) </li></ul></ul>
  72. 72. Summary: Bioinformatics <ul><li>Bioinformatics and Microarray problem </li></ul><ul><ul><li>Interdisciplinary Challenges: Terminology </li></ul></ul><ul><ul><li>Understanding Biology and Computer Science </li></ul></ul><ul><li>Data mining and image analysis steps </li></ul><ul><ul><li>Image Analysis </li></ul></ul><ul><ul><li>Experiment Design as Prior Knowledge </li></ul></ul><ul><ul><li>Expected Results of Data Mining </li></ul></ul><ul><ul><li>Which Data Mining Technique to Use? </li></ul></ul><ul><ul><li>Data Mining Challenges: Complexity, Data Size, Search Space </li></ul></ul><ul><li>Validation </li></ul><ul><ul><li>Confidence in Obtained Results? </li></ul></ul><ul><ul><li>Error Screening </li></ul></ul><ul><ul><li>Cross validation techniques </li></ul></ul><ul><li>Artificial Systems </li></ul><ul><ul><li>Biologically motivated computing </li></ul></ul>
  73. 73. Backup

×