Machine Learning

Machine Learning Methods: an overview Master in Bioinformatica – April 9th, 2010 Paolo Marcatili University of Rome “Sapienza” Dept. of Biochemical Sciences “Rossi Fanelli” [email_address] ,[object Object],[object Object],[object Object],[object Object]

Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Overview

Why Overview ,[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],Why Overview

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Why Overview

How ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Overview

How ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Probability and statistics are fundamental They provide a solid framework for creating models and acquire knowledge Overview

Datasets ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Overview

Methods Machine Learning can Overview

Methods Machine Learning can Predict unknown function values Overview

Methods Machine Learning can Predict unknown function values Infer classes and assign samples Overview

Methods Machine Learning can not Overview

Methods Machine Learning can not Provide knowledge Overview

Methods Machine Learning can not Provide knowledge Learn Overview

Methods Information is In the data? In the model? Overview

Methods ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Overview

Love all, trust a few, do wrong to none. Overview 4 patients, 4 controls

Love all, trust a few, do wrong to none. Overview 2 more

Love all, trust a few, do wrong to none. Overview 10 more

Assessment Prediction of unknown data! Problems : Few data, robustness. Overview

Assessment ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Overview

Assessment 50% Training set : used to tune the model parameters 25% Test set : used to verify that the machine has “learnt” 25% Validation set : final assessment of the results Unfeasible with few data Overview

Assessment Leave-one-out: for each sample A i Training set : all the samples - { A i } Test set : { A i } Repeat Computationally intensive, good estimate of the mean error high variance Overview

Assessment K-fold cross validation: Divide your data in K subsets S 1..k Training set : all the samples - S i Test set : S i Repeat good compromise Overview

Assessment Sensitivity: TP/ [ TP + FN ] Given the disease is present, the likelihood of testing positive. Specificity: TN / [ TN + FP ] Given the disease is not present, the likelihood of testing negative. Positive Predictive Value : TP / [ TP + FP ] Given a test is positive, the likelihood disease is present Overview

Assessment Sensitivity: TP/ [ TP + FN ] Given the disease is present, the likelihood of testing positive. Specificity: TN / [ TN + FP ] Given the disease is not present, the likelihood of testing negative. Positive Predictive Value : TP / [ TP + FP ] Given a test is positive, the likelihood disease is present receiver operating characteristic (ROC) is a graphical plot of the sensitivity vs. (1 - specificity) for a binary classifier system as its discrimination threshold is varied. Overview

Assessment Overview Sensitivity: TP/ [ TP + FN ] Given the disease is present, the likelihood of testing positive. Specificity: TN / [ TN + FP ] Given the disease is not present, the likelihood of testing negative. Predictive Value Positive: TP / [ TP + FP ] Given a test is positive, the likelihood disease is present receiver operating characteristic (ROC) is a graphical plot of the sensitivity vs. (1 - specificity) for a binary classifier system as its discrimination threshold is varied. Area under ROC (AROC) is often used as a parameter to compare different classifiers

Agenda Supervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Supervised Learning Supervised Learning Basic Idea: use data+classification of known samples find “fingerprints” of classes in the data

Supervised Learning Supervised Learning Basic Idea: use data+classification of known samples find “fingerprints” of classes in the data Example: use microarray data, different condition classes: genes related/unrelated to different cancer types

Support Vector Machines Supervised Learning Basic idea: Plot your data in an N-dimensional space Find the best hyperplane that separates the different classes Further samples can be classified using the region of the space they belong to

Support Vector Machines Supervised Learning length weight Fail Pass

Support Vector Machines Supervised Learning Fail Pass length weight Fail Pass margin

Support Vector Machines Supervised Learning Optimal Hyperplane (OHP) simple kind of SVM (called an LSVM ) maximum margin Fail Pass length weight Fail Pass margin Support vectors

Support Vector Machines Supervised Learning What if data are not linearly separable? Original Data

Support Vector Machines Supervised Learning What if data are not linearly separable? Allow mismatches soft margins (add a weight matrix) Original Data Original Data

Support Vector Machines Supervised Learning Hyperplane What if data are not linearly separable? weight 2 length 2 weight * length Original Data

Support Vector Machines Supervised Learning Only Inner product is needed to calculate Dual problem and decision function Hypersurface Kernelization Hyperplane What if data are not linearly separable? The Kernel trick! weight 2 length 2 weight * length length Original Data

SVM example Supervised Learning Knowledge-based analysis of microarray gene expression data by using support vector machines Michael P. S. Brown*, William Noble Grundy †‡ , David Lin*, Nello Cristianini § , Charles Walsh Sugnet ¶ , Terrence S. Furey*, Manuel Ares, Jr. ¶ , and David Haussler* We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data. To judge overall performance, we define the cost of using the method M as C ( M ) 5 fp ( M ) 1 2 z fn ( M ), where fp ( M ) is the number of false positives for method M , and fn ( M ) is the number of false negatives for method M . The false negatives are weighted more heavily than the false positives because, for these data, the number of positive examples is small compared with the number of negatives.

Hidden Markov Models Supervised Learning ,[object Object],[object Object],[object Object],[object Object]

Hidden Markov Models Supervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Hidden Markov Models Supervised Learning

Decision trees Supervised Learning Mimics the behavior of an expert

Mimics the behavior of an expert Decision trees Supervised Learning

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Majority rules! Decision trees Supervised Learning

Random Forests Supervised Learning Split the data in several subsets, construct a DT for each set Each DT expresses a vote, the majority wins Much more accurate and robust (bootstrap)

Random Forests Supervised Learning Split the data in several subsets, construct a DT for each set Each DT expresses a vote, the majority wins Much more accurate and robust (bootstrap) Prediction of protein–protein interactions using random decision forest framework Xue-Wen Chen * and Mei Liu Motivation: Protein interactions are of biological interest because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Domains are the building blocks of proteins; therefore, proteins are assumed to interact as a result of their interacting domains. Many domain-based models for protein interaction prediction have been developed, and preliminary results have demonstrated their feasibility. Most of the existing domain-based methods, however, consider only single-domain pairs (one domain from one protein) and assume independence between domain–domain interactions. Results: In this paper, we introduce a domain-based random forest of decision trees to infer protein interactions. Our proposed method is capable of exploring all possible domain interactions and making predictions based on all the protein domains. Experimental results on Saccharomyces cerevisiae dataset demonstrate that our approach can predict protein–protein interactions with higher sensitivity (79.78%) and specificity (64.38%) compared with that of the maximum likelihood approach. Furthermore, our model can be used to infer interactions not only for single-domain pairs but also for multiple domain pairs.

Bayesian Networks Supervised Learning The probabilistic approach is extremely powerful but requires a huge amount of information/data for a complete representation Not all correlations or cause-effect relationships between variables are significative

Bayesian Networks Supervised Learning The probabilistic approach is extremely powerful but requires a huge amount of information/data for a complete representation Not all correlations or cause-effect relationships between variables are significative Consider only meaningful links!

Bayesian Networks Supervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Bayesian Networks Supervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Bayes Theorem again!

Bayesian Networks Supervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Neural Networks Supervised Learning Neural Networks interpolate functions They have nothing to do with brains

Neural Networks Supervised Learning Parameter settings:

Neural Networks Supervised Learning Parameter settings: avoid overfitting

Neural Networks Supervised Learning Parameter settings: avoid overfitting Learning --> validation --> usage No underlying model, but it often works

Neural Networks Supervised Learning Protein Disorder Prediction: Implications for Structural Proteomics Rune Linding, 1,4, * Lars Juhl Jensen, 1,2,4 Francesca Diella, 3 Peer Bork, 1,2 Toby J. Gibson, 1 and Robert B. Russell 1 Abstract A great challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Disordered regions in proteins often contain short linear peptide motifs (e.g., SH3 ligands and targeting signals) that are important for protein function. We present here DisEMBL, a computational tool for prediction of disordered/unstructured regions within a protein sequence. As no clear definition of disorder exists, we have developed parameters based on several alternative definitions and introduced a new one based on the concept of “hot loops,” i.e., coils with high temperature factors. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability, and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects. The tool is freely available via a web interface (http://dis.embl.de) and can be downloaded for use in large-scale studies.

Agenda Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Unsupervised Learning Unsupervised Learning If we have no idea of actual data classification, we can try to guess

Clustering Unsupervised Learning Put together similar objects to define classes

Clustering Unsupervised Learning K-means Hierarchical top-down Hierarchical down-up Fuzzy Put together similar objects to define classes How?

Clustering Unsupervised Learning Euclidean Correlation Spearman Rank Manhattan Put together similar objects to define classes Which metric? How?

Clustering Unsupervised Learning Put together similar objects to define classes Which metric? Which “shape”? Compact Concave Outliers Inner radius cluster separation How?

Hierarchical Clustering Unsupervised Learning ,[object Object],[object Object],[object Object]

Hierarchical Clustering Unsupervised Learning ,[object Object],[object Object],[object Object],

K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]

PCA Unsupervised Learning Multidimensional data (hard to visualize) Data variability is not equally distributed

PCA Unsupervised Learning Multidimensional data (hard to visualize) Data variability is not equally distributed Correlation between variables Change coordinate system, remove correlation retain only most variable coordinates How: (generalized eigenvectors, SVD) Pro: noise (and information) reduction

Agenda Caveats ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Data independence Training set, Test set and Validation set must be clearly separated E.g. neural network to infer gene function from sequence training set: annotated gene sequences, deposit date before Jan 2007 test set: annotated gene sequences, deposit date after Jan 2007 But annotation of new sequences is often inferred from old sequences! Caveats

Biases Data should be unbiased, i.e. it should be a good sample of our “space” E.g. neural network to find disordered regions training set: solved structures, residues in SEQRES but not in ATOM But solved structures are typically small, globular, cytoplasmatic proteins Caveats

Take-home message ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Caveats

References ,[object Object],[object Object],[object Object],[object Object]

Bayes Theorem Supplementary a) AIDS is affecting 0,01% of population. b) The AIDS test, when performed on patients, is correct 99,9% of times. b) The AIDS test, when performed on uninfected people, is correct 99,99% of times. If a person has a positive test, how likely is it for him to be infected? A B A  B E

Bayes Theorem Supplementary a) AIDS is affecting 0,01% of population. b) The AIDS test, when performed on patients, is correct 99,9% of times. b) The AIDS test, when performed on uninfected people, is correct 99,99% of times. If a person has a positive test, how likely is it for him to be infected? P(A|T) =P(T|A)*P(A) / (P(T|A)*P(A) + P(T|¬A)*P(¬A)) P(A|T) = 49.97% A B A  B E

Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning

Similar to Machine Learning (20)

More from Paolo Marcatili

More from Paolo Marcatili (9)

Machine Learning