• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hugh Shanahan Association Talk
 

Hugh Shanahan Association Talk

on

  • 511 views

Talk I gave at this year\'s CCC Summer School at Zhejiang University, Hangzhou

Talk I gave at this year\'s CCC Summer School at Zhejiang University, Hangzhou

Statistics

Views

Total Views
511
Views on SlideShare
510
Embed Views
1

Actions

Likes
0
Downloads
2
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hugh Shanahan Association Talk Hugh Shanahan Association Talk Presentation Transcript

    • Associative methods in Systems Biology Hugh Shanahan Associative methods in Systems Biology Outline Gene Ontologies Hugh Shanahan Over-representation Semantic similarity Associative Measures Department of Computer Science Hypotheses Royal Holloway, University of London Linear Correlation Partial Correlation Non-linear measures September 22, 2009 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Gene Ontologies Associative methods in Systems Biology Hugh Before finding interactions, need to be able Shanahan to systematically annotate all genes Outline Gene to determine which functional groupings are Ontologies Over-representation over-represented Semantic similarity measure objectively the “functional similarity” of two Associative Measures genes. Hypotheses Linear Correlation Partial Correlation Gene Ontology (GO) is a means to do this. Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Ontologies Associative methods in Systems Biology Abstract method for expressing structured data. Hugh Shanahan Annotation of any gene can be expressed in terms of Outline incresingly accurate description, e.g. Gene This gene is involved in transport. Ontologies Over-representation This gene is involved in vesicle mediated Semantic similarity Associative transport. Measures Hypotheses This gene is involved in vesicle fusion. Linear Correlation Partial Correlation Genes may not have an accurate annotation, so Non-linear measures Validation definition can stop higher up in this hierarchy. DREAM Hugh Shanahan Associative methods in Systems Biology
    • More complexity required Associative methods in Systems Biology Annotation is not a simple chain. Hugh A single gene can have have a very specific annotation, Shanahan which comes from two (or more) more general Outline descriptions. Gene Ontologies Different types of annotation as well: location, Over-representation Semantic similarity biochemistry, part of organism expressed in, and so on. Associative Measures An Ontology is a Directed Acyclic Graph (DAG), not a Hypotheses Linear Correlation Tree (this means a lot to Graph Theorists). Partial Correlation Non-linear measures Each node in the DAG is an annotation term. Validation DREAM Each “child” node can more than one “parent” nodes. Hugh Shanahan Associative methods in Systems Biology
    • GO’s visualised Associative IEWS methods in Systems Biology a b c Hugh Biological process (root) Shanahan Outline Transport Membrane organization and biogenesis Gene asing ficity Ontologies is_a is_a or Over-representation larity Vesicle-mediated Semantic similarity Membrane fusion transport Associative part_of is_a Measures Hypotheses Vesicle fusion Linear Correlation Partial Correlation Figure 1 | Simple trees versus directed acyclic graphs. Boxes represent nodes and arrows represent edges. a | An Nature Reviews | Genetics example of a simple tree, in which each child has only one parent and the edges are directed, that is, there is a source Non-linear measures (parent) and a destination (child) for each edge. b | A directed acyclic graph (DAG), in which each child can have one or Validation Rhee et al., Nature Reviews Genetics, (2008) more parents. The node with multiple parents is coloured red and the additional edge is coloured grey.c | An example of a node, vesicle fusion, in the biological process ontology with multiple parentage. The dashed edges indicate that there DREAM are other nodes not shown between the nodes and the root node (biological process). A root is a node with no incoming edges, and at least one leaf (also called a sink). A leaf node is a node with no outgoing edges, that is, a terminal node with no children (vesicle fusion). Similar to a simple tree, A DAG has directed edges and does not have cycles, that is, no path starts and ends at the same node, and will always have at least one root node. The depth of a node is the length of the longest path from the root to that node, whereas the height is the length of the longest path from that node to a leaf41. is_a and part_of are types of relationships that link the terms in the GO ontology. More information about the relationships between GO terms are found online (An Introduction to the Gene Ontology). Hugh Shanahan Associative methods in Systems Biology
    • GO’s visualised Associative methods in Systems Biology Hugh Shanahan Outline Gene Ontologies Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation Partial Correlation http://amigo.geneontology.org/ Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Different types of Annotation Associative methods in Systems Biology Hugh Typically, there are three distinct ontologies Shanahan (overwhelmingly used). Outline Cellular Compartment Gene Ontologies Over-representation Biological Process Semantic similarity Molecular Function Associative Measures Hypotheses Many other ontologies have been constructed, e.g. Linear Correlation Partial Correlation Plant Organ for Arabidopsis. Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Caveat Associative methods in Systems Biology The annotation of most genes (90%) have been carried out Hugh Shanahan computationally. The annotations usually work pretty well, though they have a tendency not to be as accurate as those Outline obtained by direct assay. Gene Ontologies All annotated genes have an evidence code (IED) Over-representation Semantic similarity associated with them in order to demonstrate how much we Associative can rely on it. Measures Hypotheses Increasingly, co-expression is being used as a means to Linear Correlation Partial Correlation annotate genes, so one should be careful in not using this Non-linear measures information in constructing annotations ! Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Over-representation Associative methods in Systems One of the most useful tools to hand when one analyses Biology micro-array data is to ask what functional groupings occur Hugh Shanahan more often than one expects. Outline Notation Gene N number of genes in the genome. Ontologies Over-representation Semantic similarity n number of genes which have been found to be Associative differentially expressed. Measures Hypotheses m number of genes in the genome with a specific Linear Correlation Partial Correlation annotation. Non-linear measures Validation k number of genes which are differentially expressed DREAM with the same annotation. Hugh Shanahan Associative methods in Systems Biology
    • Probabilities Associative methods in Systems One can derive the probability Pk that k genes would be Biology found by chance amongst n genes using the Hugh Shanahan hypergeometric probability distribution and the above Outline parameters. Gene For the record Ontologies Over-representation Semantic similarity Associative m C N−m C Measures k n−k Pk = NC , (1) Hypotheses Linear Correlation n Partial Correlation N N! Non-linear measures Cm = . (2) Validation (N − n)!n! DREAM Hugh Shanahan Associative methods in Systems Biology
    • Difficulties Associative methods in Systems Biology There are thousand’s of possible GO terms and one Hugh should adjust the probabilities to deal with multiple Shanahan hypotheses. Outline Applying Bonferroni correction using all GO terms gives Gene Ontologies a p-value of 10−7 equivalent to 1% significence. Over-representation Semantic similarity Because of the structure of the GO terms these Associative Measures probabilities are highly correlated with each other. Hypotheses Linear Correlation In all these cases focussing on as short a list of Partial Correlation Non-linear measures possible biological processes as possible will minimise Validation the above difficulties. DREAM Hugh Shanahan Associative methods in Systems Biology
    • Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • What genes match In benchmarking methods to infer interactions between Associative methods in gene products, we expect genes that interact to have similar Systems Biology GO terms, though perhaps not entirely the same. Hugh Semantic Similarity is a means to measure how similar the Shanahan annotations of two genes are (0 being no similarity, 1 Outline meaning total similarity). Gene Ontologies GO provides us with a means to do this in a quantitative Over-representation Semantic similarity fashion. Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Simple implementation Determine the ratio of the number of nodes two genes share Associative methods in with the total number of nodes they have between them. Systems Biology Hugh #{N(G1 ) ∩ N(G2 )} Shanahan GOsimUI = (3) #{N(G1 ) ∪ N(G2 )} Outline N(G1 ) being the set of nodes associated with G1 ’s Gene Ontologies annotation. Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM More elaborate methods are available. Hugh Shanahan Associative methods in Systems Biology
    • Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Motivation Associative methods in Systems Yesterday, encountered clustering. Biology Hugh Hypothesis 1 (weak) :- coexpression implies involvment Shanahan in the same process. Outline Expand to many different experiments. Gene Ontologies Hypothesis 2 (strong) :- greater a level of association, Over-representation Semantic similarity greater the chance of interaction. Associative Measures Hypothesis 2 is often referred to as “guilt by Hypotheses association”. Linear Correlation Partial Correlation Non-linear measures Association may tell us about interactions between Validation gene products. It does not tell us anything about the DREAM regulation mechanism. Hugh Shanahan Associative methods in Systems Biology
    • Associative methods in Systems Biology Hugh Shanahan Outline Gene Ontologies Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation http://www.arabidopsis.leeds.ac.uk/act/index.php Partial Correlation Non-linear measures 266841_at AT2G26150 Validation heat shock transcription factor family protein contains Pfam profile: DREAM PF00447 HSF-type DNA-binding domain 260978_at AT1G53540 17.6 kDa class I small heat shock protein Hugh Shanahan Associative methods in Systems Biology
    • What do we mean by association ? Associative methods in Systems Knowing something about the expression level of one gene Biology (over many different experiments) means we know Hugh Shanahan something about the expression level of the other. Replotting the above Outline Gene Ontologies Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Linear Correlation coexpression Associative methods in Simplest form of association. Systems Biology Assume that there is a linear relationship between Hugh Shanahan genes. Outline Formally :- Gene y1 = a12 + c12 y2 + η12 , (4) Ontologies Over-representation Semantic similarity Associative y1 , y2 are (log) expression levels Measures η12 noise term. Hypotheses Linear Correlation a12 , c12 parameters to be determined. Partial Correlation Non-linear measures But we’re not interested in that ! Validation DREAM We are interested in asking how good a model is this for this pair of genes ? Hugh Shanahan Associative methods in Systems Biology
    • Covariance Associative methods in Can estimate how good the linear model is by computing Systems Biology E((y1 − y 1 )(y2 − y 2 )) , Hugh Shanahan where y 1 , y 2 = E(y1 ), E(y2 ) are the means of y1 and y2 . Outline Gene E means the expectation value of the above (think of it Ontologies Over-representation for now as taking the average over all the points in the Semantic similarity previous figure). Associative Measures Can prove to oneself (exercise) that the magnitude of Hypotheses Linear Correlation the covariance is largest when y1 can be perfectly Partial Correlation Non-linear measures expressed as a linear function of y2 . Validation DREAM The covariance is zero when there is no relationship at all between y1 and y2 . Hugh Shanahan Associative methods in Systems Biology
    • Associative methods in Systems Biology Hugh q q q q Shanahan q q qq q q q q 2 2 q q q q q q Outline qq q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q Gene 1 1 qq qq q q q q q q qq q q q q q Ontologies y2 y2 qq q q q q q q q q qq q q q q qq q q q q q q q qq q q q qq q qqq q q q q q qq q q Over-representation qq q q q q q qq q q q q q q q q q q q Semantic similarity 0 0 q q q q q q q qq qq q q qq q q qq q q qq q q qq q q q q q q q q q q q q q q q Associative q q Measures −1 −1 q q q q Hypotheses 0.6 0.8 1.0 1.2 1.4 1.6 1.8 −1 0 1 2 3 Linear Correlation y1 y1 Partial Correlation Non-linear measures Maximum covariance Zero covariance Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Correlation Associative methods in Systems Biology Hugh We want to compare every possible pair of genes, so using Shanahan the covariance is not very practical since the maximum Outline covariance will vary from pair of gene to pair of gene. Gene However, Ontologies Over-representation Semantic similarity E((y1 − y 1 )(y2 − y 2 )) Associative ρ12 = , (5) Measures E((y1 − y 1 )2 )E((y2 − y 2 )2 ) Hypotheses Linear Correlation Partial Correlation is bounded: −1 ≤ ρ12 ≤ 1. Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • How well does it work ? Associative methods in Systems Number of examples of improved functional annotation. Biology Unannotated gene which is highly correlated with gene Hugh Shanahan in a known response implies it is likely to be in the same response. Outline Gene Ontologies Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Associative methods in Systems Biology Hugh Shanahan Difficulty : genes correlate with many other genes, not Outline just a few. Gene Why ? Ontologies Over-representation Suggestion : correlations do not distinguish between Semantic similarity Associative potential direct interactions and indirect interactions Measures Hypotheses between gene products. Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Example Associative methods in Other interactions Systems A Biology Hugh Shanahan B F Outline Gene Ontologies D Over-representation C Semantic similarity Associative Measures E Hypotheses Linear Correlation Partial Correlation Non-linear measures B directly interacts with three other genes, but could be Validation highly correlated with others. DREAM C and D would be highly correlated with each other even though they are not directly interacting. Hugh Shanahan Associative methods in Systems Biology
    • Associative methods in Artificial example: create randomised data to represent Systems Biology expression of B. Hugh Generate two other sets of data (C and D) that are by Shanahan construction highly correlated to the original data set, but Outline are not set out to have a relationship with each other. Gene Ontologies Over-representation 3 q q q 3 3 q q q q q qq qq q qqq q q q qq q q q qq q qq q qq q qqq q q q q Semantic similarity 2 q q q qq q qqq q q q qqq q q q q 2 2 q q q qqq q q q qq q q q qqq qqqq q q qq q qq q q q q q qq q q qqqq q q q qq qq q qq qq q q q q qq q q q q q qq qq q q qq q qq q q q q q q qq q q q qq q qq q q q q q q qqq q q q q q q qq q q q q qq q qq q qq q qq qq q q q qq q q qq qq q q q qqq q qq qqqq q q q q q q qqq q q q qq q qq q q qqq qq q qq q q qq q q q q q qq q q q q qqqqqqq q qq qq q q q q q qqq q q q qq q q q q qqqqq q q qq q q q qq q q q q qqqq q q qqq q q qq q q qqq q q qq q qq q q q qq q q q q q q qqqq q qq qqq q q q q q qq qqqq qqq q qq q q Associative 1 qqqqqqq q q qqq qq q q qqqq q qq qq qqq qq q qqq qqq q q qqqq qqqq qqq q q q qq qqqqq q q qqqqqq q qqqqqq q q q qqqq qqqq 1 1 qqq q q qq q q q qqqq q qqqqqqqq q q q q qqq qq qqqqqqq qq qqq q qq q qqqqq q qqqqqqqqq qq qq qq q qq q qq q qq q q q qqq qq q q q q qqq qq qq q qqq q qqqq q qq q q qq qqq q q qqq q qqqqq q qqqqqq q qq qqqqq q q qqq qqqqq q q qqqqqqq q qqq q qq q q q q qqqq q qqqqqqqqqqq q qq qq q q q q q q q q qqq qqq q q q q qqqq qqqqqqq q qqqq q qq q q qqqqqq q qq q q qq qq qqqq qq q q q qqqqqq q qqq q q qqq q qqqqqqq q qqqqq q qq q qqqqqqq q q q qqqqqqqq q q q Measures q q qq qqq q q qqqqqq q q q qq qq q q qqqqq q q q q qqq q qqq q q qqqqq q qqq q q q qq q q q qq q qqq qqqq q q q qqqqqqqq q q q qqqqqqqq qq q qq qqq qq q qqqqq q qqqqqq q qqq qqqqqqqq q qqq q qq qqq qqqqqqqqqq qq qq q q q qqqqq qq qqq qq q qq qqq q q qqq q qqq qqqq q q qqq q qqqq qq qqqqq q q q q qq qqqqq qqqq q q qqqqq qq q q q qq qqqqq q qq qq q qqq q q q q qqqq q qqqqqq qq q qq q q q q qq q qqqqqqq qq qqqq qq q q qq qqq q qqq q q qqqqqq qq q q q qq q qqqqqq qq q q qq q qqqqqqq qq qq q qq q q q q qqq q qqqqqqqqqq q q q q q q qq qq q q qqqqqqqq qq q qqq q qqqqqqqq q qqq q q q q qqq qqq qqqqqq qqqqqqqqqq qq q q 0 qq qqqqqq q qqqqq q q q q qqqqqq q q q q qqqqq q q qq q q q qq q qq q qqqqqqqq q q qq qq qq q q qqq qqq qq q qqq q q qqq qq q q 0 0 q q qqqqqqqq qq q q qqqqq q qqqqq qqqqqqqqq q qqqqqq q qq qqqqq qq q qq qqqqq q q qqqqq qq qq q q qqq q q q q qq q qqqqqqqqq qqqq q q q q qqq qqqqqq q q qq qqqqqq q qq q q qq q qqqq q q q qqqq q qq qq q qqq q qq q q q qqq q q q qqqqqqqq q q qq q q q q q q qqq q qq qqqqqqqq q q q q q qqqqqq q q q qqqqqqqq qq q qq q qqq qq q qq q q q q q q q qqqqqqq qqq qq qq q q q qqqqqqq qqqq q qqqqqq qq q qqqqqqqq q D qq q q C q D q q qqqqqqq q q q qqqq q qqq qq q q qqqqqq qqqqqqq qq qqqqqqqq q q qq q qqq q q qqq q qq qq qqqqqq q q qqqqqq q q qqqq qq qq qqqqqqq qq qq q qq q qq q qqqqqqq q q q q qq qqqqq q q q q q q q qqqqqq q q q qqqqq q qqq q qqqqqqqqq q qqqqq qq qq q qqqqqq qq q qq q qqq qq qqqqqqqq qq qqq qqqq q qq q q qqqqq q q q qq qqqqq q qq q q q q q qqqqqqqq q q q q qq q qqqqqq q q q q qq q qq qqq q q qq qqqqqqqq qqq q qqqqqq qq q q qqq q q qqqq q q qq qqq q qq q qqqqqqq q q q q qqq qq qq q Hypotheses qqqq q q q q q q q qqq q q qqq q q q q qqqqq q q qq q qq q qqqqq q q q q q qq q q q q qq qqqqq −1 q qq q q qqqqqqqq q q qq q q qqqqqq qqqq q qqqqqq q qqqq q q q q qqq q q qqq q q −1 −1 q qq q qq q q qq qqq q qqq q qqqq q q q q qq q q qqqqqqqq q q q qqq q qq qq q q q qq q q qq qq qq q q q qq q qqq qqqqqqqqq q q qqqqq q q q q q qqq qqq q q qqqq q q q q qq qqq qq q q q qq q qq q q qqq q q q qqqqq qq q q qqqq qq qqq qqqq qqq qq q q q qqq q q qq qqqq q qqqq q qqqqq q qq q qq q q q qq q q q q q qqqqq q q q q q qq q q q q qq q q q Linear Correlation q qq qq q qqq qq qq qq qqqq qqq qq q q q qq q q q q qqq q qqq q qqq q qq qqq q q q qq q q q q q q qq q q qq q q qq q q qqq q q qq −2 qq qq Partial Correlation −2 −2 qq qqq qq qq qq q q q q q q q q q q q q q qq qq q q q q q q q qq q q q q q q q q q q Non-linear measures −3 −3 −3 q q q q q Validation −4 −4 −4 −4 −2 0 2 −4 −2 0 2 −4 −3 −2 −1 0 1 2 3 DREAM B B C ρ = 0.98 ρ = 0.98 ρ = 0.96 (!!!) Hugh Shanahan Associative methods in Systems Biology
    • Extending correlations :- partial correlations Associative methods in Systems Correlations only take pairs of genes into consideration. Biology Hugh Partial correlations extends the initial pairwise Shanahan regression model introduced in equation 4. Outline Gene y1 = a1 + b12 y2 + b13 y3 + · · · + b1n yp + η1 . (6) Ontologies Over-representation Semantic similarity Again, we are not interested in solving this explicitely. Associative Measures We want to understand the correlation that each one of Hypotheses the genes y2 . . . yp has on y1 once we have removed Linear Correlation Partial Correlation the effect of all the other genes. Non-linear measures Validation We will use the notation PCij to refer to this partial DREAM correlation. Hugh Shanahan Associative methods in Systems Biology
    • Derivation Associative Computed easily once you have all the correlations methods in Systems between all the genes. Biology Hugh Shanahan   1 ρ12 ρ13 . . .  ρ12 1 ρ23 . . .  Outline R= ρ  , (7)    13 ρ23 1 . . .  Gene Ontologies . . . . . . .. Over-representation . . . . Semantic similarity Associative Measures Covariance matrix C is defined similarly. Hypotheses Linear Correlation ρij is the correlation between gene i and gene j. Partial Correlation Non-linear measures Validation −1 Rij DREAM PCij = − (8) −1 −1 Rii Rjj Hugh Shanahan Associative methods in Systems Biology
    • Questions Associative methods in Systems Biology Hugh Shanahan Outline ρij = ρji - why ? Gene Ontologies Diagonals are 1 - why ? Over-representation Semantic similarity Exercise :- compute PC using the covariance matrix. Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Artificial example Associative methods in Systems Biology Hugh Shanahan   1.0 0.96 0.98 Outline RBCD =  0.96 1.0 0.98  , (9) Gene Ontologies 0.98 0.98 1.0 Over-representation Semantic similarity   −1.0 −0.01 0.70 Associative Measures PCBCD =  −0.01 −1.0 0.70  . (10) Hypotheses Linear Correlation 0.70 0.70 −1.0 Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Disadvantages of using Partial correlations Associative methods in Systems Biology Hugh High partial correlations no longer tend to go to 1 (or Shanahan -1). Outline Dependent on ranking. Gene Ontologies How large should/can p (the number of genes Over-representation Semantic similarity examined) be ? Associative Measures Taking inverses of matrices should make us jumpy - Hypotheses especially when there is limited data. Linear Correlation Partial Correlation Non-linear measures Problem also dates to computing correlations. Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • “Large p, small n” Notation Associative methods in Systems p :- number of variables (in this case, expression of a Biology gene) Hugh Shanahan n :- number of measurements (total number of affy Outline slides) Gene Ontologies R has of the order p2 (p(p − 1)/2 to be exact) Over-representation Semantic similarity potentially interesting correlations. Associative Measures Could be dealing with of the order 10, 0002 variables. Hypotheses Linear Correlation Have at best a few thousand measurements per gene :- Partial Correlation Non-linear measures n ∼ 1000. Validation DREAM If p n, then the definition of equation (5) gives a robust estimate of all those correlations, but that is not where we are ! Hugh Shanahan Associative methods in Systems Biology
    • Artificial example p/n = 0.1 p/n = 0.5 Associative methods in 600 600 Systems 400 Biology 400 eigenvalue eigenvalue ! ! ! !! !!! ! ! !! ! ! Hugh ! ! ! ! ! Shanahan 200 200 !! ! ! ! !! ! !! !! !!! !! ! ! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !!! !! !! !!! !!! !!! !!! !!! !!! !!! !!!! !!!! !!! !!! !!!!! !!!!!!! !!!! !!!! !!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!! !!!!!! !!!!!!!! !! !!!!!!!!!!! !!!!!!!!!!! Outline 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Gene Ontologies Over-representation Semantic similarity p/n = 2 p/n = 10 Associative 1500 1500 Measures Hypotheses 1000 1000 eigenvalue eigenvalue Linear Correlation ! Partial Correlation 500 500 ! ! Non-linear measures ! ! ! !! !! !! !! ! !! !! ! !!! !!! !!! !!!! !!!!! !!!!!!!! !!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!! !! !!!!! !!!!!!! !!!!!!!!! !!!!!!!!!!!!! !!!!!!!!!!!!!!! !!!!!!!!!!!!! !!!!!!!!!!!! !!!!!!!!!! !!!! Validation 0 0 DREAM 0 20 40 60 80 100 0 20 40 60 80 100 Schäfer and Strimmer, Statistical Applications in Genetics Figure 1: Ordered eigenvalues of the sample covariance matrix S (thin black line) and that of an alternative estimator 4, 1, (2005) and Molecular Biology, S (fat green line, for definition see Tab. 1), calculated from simulated data with underlying p-variate normal distribution, for p = 100 and various ratios Biology Hugh Shanahan Associative methods in Systems
    • Explanation Associative methods in Systems Biology Spectrum of eigen-values. Hugh Shanahan Any eigen-value equal to zero means matrix is Outline non-invertible. Gene Dashed lines - actual eigenvalues. Ontologies Over-representation Thin black lines - estimated eigenvalues using equation Semantic similarity Associative (5). Measures Hypotheses Green line - improved estimator. Linear Correlation Partial Correlation In general, if n < p then the correlation/covariance Non-linear measures Validation matrix is non-invertible. DREAM Hugh Shanahan Associative methods in Systems Biology
    • Strategies Associative methods in Systems Biology Hugh Shanahan Reduce p - cluster data initially, then perform analysis on each cluster. (Toh and Harimoto, 2002). Outline Gene Compute lower order partial correlations - compute first Ontologies order partial correlations (Magwene and Kim, 2004). Over-representation Semantic similarity Employ improved estimator of correlations (Schäffer Associative Measures and Strimmer, 2005). Hypotheses Linear Correlation These options are not necessarily exclusive. Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Shrinkage estimate - wordy explanation Associative methods in Systems What is computed in equation (5) is an estimate of the Biology correlation based on the available data, not the actual Hugh Shanahan correlation if we knew the underlying multi-variate distribution. They would coincide if we had much greater Outline Gene statistics. That said, we can use other estimators of Ontologies correlation. Statisticians have pointed out that many other Over-representation Semantic similarity possible estimators can be used which work better in the Associative Measures regime we lie (large p, small n). Hypotheses Shrinkage estimates attempt to combine different naive Linear Correlation Partial Correlation estimates to get an improved estimate. The principal has Non-linear measures Validation been around for some time (Stein, 1956) though its use has DREAM increased significantly in the last ten years. Hugh Shanahan Associative methods in Systems Biology
    • Shrinkage estimates - the details Associative methods in Systems Biology Notation: Hugh Shanahan C is the actual covariance matrix. ˆ C is an estimate of the covariance matrix. Outline Gene ˆ In computing C we could either attempt to compute it Ontologies Over-representation using the standard definition (“full”) or assume (for Semantic similarity example) that all the off-diagonal entries are zero Associative Measures (“reduced”). Hypotheses Linear Correlation ˆ CF for the “full” estimate of covariance matrix. Partial Correlation Non-linear measures ˆ CR for the “reduced” estimate of covariance matrix. Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Mean Square Error Associative methods in Systems Defining Biology Hugh Shanahan ˆ MSE(C) = ˆ E((C − C)2 ) , (11) = ˆ ˆ Var (C) + Bias2 (C) . (12) Outline Gene ˆ Var (C) = ˆ ˆ E((C − E(C))2 ) , (13) Ontologies Over-representation ˆ Bias(C) = ˆ E(C) − C . (14) Semantic similarity Associative Measures (Expectation operator is over the data that we have). Hypotheses Linear Correlation Partial Correlation ˆ ˆ Bias(CF ) is small but Var (CF ) will be large. Non-linear measures Validation ˆ ˆ Var (CR ) will be small but Bias(CR ) will be large. DREAM Hugh Shanahan Associative methods in Systems Biology
    • The problem Associative methods in Systems Biology Hugh Depending on the assumptions we use to estimate the Shanahan correlation/covariance matrix Outline we can either compute a very poor estimate of the Gene Ontologies parameters in a very accurate model, Over-representation Semantic similarity or compute a good estimate of the parameters for a Associative Measures very inaccurate model (!) Hypotheses Linear Correlation But maybe we can reconcile the two... Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • The problem Associative methods in Systems Biology Hugh Depending on the assumptions we use to estimate the Shanahan correlation/covariance matrix Outline we can either compute a very poor estimate of the Gene Ontologies parameters in a very accurate model, Over-representation Semantic similarity or compute a good estimate of the parameters for a Associative Measures very inaccurate model (!) Hypotheses Linear Correlation But maybe we can reconcile the two... Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • The problem Associative methods in Systems Biology Hugh Depending on the assumptions we use to estimate the Shanahan correlation/covariance matrix Outline we can either compute a very poor estimate of the Gene Ontologies parameters in a very accurate model, Over-representation Semantic similarity or compute a good estimate of the parameters for a Associative Measures very inaccurate model (!) Hypotheses Linear Correlation But maybe we can reconcile the two... Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Combining the two Associative methods in One can combine these two estimates: Systems Biology Hugh Shanahan C ∗ ˆ ˆ = λCR + (1 − λ)CF , (15) Outline 0 ≤ λ ≤ 1 , (16) Gene Ontologies Over-representation choosing a λ such that MSE(C∗ ) is minimised. Semantic similarity Associative Computing λ is normally very expensive. Measures Hypotheses Ledoit and Wolf (2003) came up with a short analytical Linear Correlation Partial Correlation way of computing λ; Schäffer and Strimmer modified Non-linear measures Validation this for genomic data. DREAM We have an R package to do this. Hugh Shanahan Associative methods in Systems Biology
    • Final Note Associative methods in Systems Biology Hugh Shanahan The Schäffer and Strimmer estimate uses the Outline “zero-covariance” low dimensional model for their estimate, Gene but this isn’t necessarily the best choice. Ontologies Over-representation Notably, while shrinkage estimates make much of Semantic similarity incorporating information, they don’t explicitely include Associative Measures Biological information. Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • b1057 pckA nuoI pspB ompT resultsyecO ompC ompF nuoF artQ glnH glnP hupB yfiA Associative methods in Using E. coli time series data (8 time slices), Schäffer and Systems GM network Biology Strimmer examined 102 genes using correlations and Schäfer and Strimmer: Large-Scale Covariance Matrix Estimation Hugh partial correlations. Shanahan yjbO yjbE ybjZ sucC pyrI yhgI hns yfiA yfaD yedE pspD manX sucA sodA nuoM nmpC nuoA atpG flgD atpH grpE gltA gatC glnP gatA yhfV Outline fixC dnaJ ibpA nuoL manZ hupA manY flgC ilvC lacA cspA cyoD dnaJ lacY aldA atpB yjcH manY artJ sucD b0725 nuoH ptsG b0725 yjbE gltA yhdM ygbD hisJ Gene yrfH cchB ynaF pyrI aceB manZ ompF cchB mopB yhfV sodA atpF Ontologies yfaD pyrB ompC atpF ibpB nmpC ahpC nuoB Over-representation ompT degS cstA ftsJ sucA sucC yecO tnaA ycgX gatB ibpB icdA pspC Semantic similarity dnaK pckA ybjZ ygcE atpE nuoI fixC gatZ atpE yjcH artJ b1057 cyoD yceP eutG aceB gatC b1191 ibpA cstA flgC nuoC atpD dnaG folK aldA pyrB Associative nuoH nuoL nuoM lacZ manX mopB asnA yjbO yheI pspD Measures nuoB ptsG gatB ahpC flgD aceA eutG yrfH b1963 pspA lpdA ycgX b1583 lacY dnaK atpB Hypotheses hisJ atpH asnA yedE atpD artQ b1057 pckA Linear Correlation atpG icdA grpE cspG nuoI gatD b1191 lacA pspB Partial Correlation tnaA sucD b1583 lacZ ompT nuoF cspG b1963 nuoA gatD nuoC yecO nuoF folK dnaG cspA ompC ompF Non-linear measures ynaF hns hupA gatZ ftsJ glnH artQ lpdA hupB degS gatA pspA pspB ygbD ygcE yaeM pspC yceP ilvC glnH glnP yhgI yhdM yheI aceA yaeM yfiA hupB Validation DREAM (a) Shrinkage GGM network (c) Relevance network Correlations Partial Correlations (Graphical gatD gatB aceB nuoL (Relevance Network) yjbO Gaussian Network) manY yjbO ahpC sucA yfiA yfaD sodA yedE pspD asnA ygcE nuoM nmpC coli data by (a) the shrinkage GGM ap- sodA atpB ynaF icdA cspA yfiA gatZ artJ pckA cstA ilvC grpE lacA gltA lacY gatC cspA ptsG glnP aldA fixC b0725 dnaJ atpB gatA manY yhfV yjbE nmpC yrfH asso GGM approach by Meinshausen and yrfH b1583 hupA manX pyrI aceB ompF cchB mopB gatC hupB manZ flgC ybjZ ibpB hns pyrB ompC atpF ibpB yjbE ompF pyrI nuoH yecO ompT degS yecO ibpA degS fixC cchB sucC cspG pckA ybjZ k with abs(r) > 0.8. Black and grey edges Shanahan manZ pyrB dnaK ompC gltA sucC nuoB yaeM atpE nuoI folK atpD b1963 yjcH b1057 cyoD b1057 glnH yjcH ygbD artJ atpF sucD ibpA nuoC atpH cstA flgD nuoL flgC Hugh atpE Associative methods in Systems Biology ptsG pspC ilvC yhfV nuoH aceA nuoF hisJ tnaA b1191 lacZ nuoB manX cyoD ompT pspD glnP nuoC dnaJ gatB dnaK eutG aceA eutG ahpC flgD
    • Results of comparison Associative methods in Systems Biology Hugh Shanahan Outline Recover centrality of sucA gene. Gene Ontologies lacA, lacZ and lacY genes have the largest absolute Over-representation Semantic similarity partial correlations. Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Associative methods in Systems Biology So far, we have concerned outselves with linear Hugh relationships. Shanahan However, such an approximation may not be valid. Outline Naively, one expects a more non-linear relatiopnship Gene Ontologies between gene products. Over-representation Semantic similarity For example, typically Transcription Factor - target Associative Measures interactions are modelled using Michaelis-Menton Hypotheses Linear Correlation kinetics. Partial Correlation Non-linear measures Furthermore expression levels are derived after a Validation number of transformations. DREAM Hugh Shanahan Associative methods in Systems Biology
    • One approach: Spearman correlation Associative methods in Systems Biology Hugh Shanahan Basic idea: use ranks rather than raw data. Outline Use nearly the same definition of linear (Pearson) Gene Ontologies correlation. Over-representation Semantic similarity Must be careful about ties, i.e. raw data having Associative precisely the same value (unlikely for floating point Measures Hypotheses data). Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Comparison Associative methods in Systems Biology Hugh Shanahan Outline Gene Ontologies Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Comparison of different measures. Validation DREAM Many other methods for non-linear measures are possible, the best known being Mutual Information. Hugh Shanahan Associative methods in Systems Biology
    • Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Modelling of gene interactions Associative We have only just touched upon methods for inferring methods in Systems interactions between gene products using transcriptomic Biology data. Some of the others include the use of Hugh Shanahan Mutual Information/Spearman Correlations - addresses Outline non-linearities. Gene Kinetic models - attempt to infer interactions. Ontologies Over-representation Boolean Networks - model interactions as circuitry. Semantic similarity Associative Petri Nets - Prof. Ming Chen. Measures Hypotheses Bayesian Networks - Dr. Chris Needham. Linear Correlation Partial Correlation Non-linear measures Machine Learning methods - Validation Unsupervised/semi-supervised/supervised learning. DREAM Integration of other data sources. ... Hugh Shanahan Associative methods in Systems Biology
    • Explosion of methods 160 Annals of the New York Academy of Sciences Associative methods in Systems Biology Hugh Shanahan Outline Gene Ontologies Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures FIGURE 1. The number of publications retrieved from a PubMed search of “Pathway Validation Inference” or “Reverse Engineering.” On average, this number has been doubling every two DREAM years since about 1995. (Color is shown in online version.) reverse-engineering methods was going to be that computational methods can, in the blink of a key prerequisite to their increasing value to an eye, generate large numbers of predictions, Stolovitzsky et al., Ann.the field of re- from a few (2009) hundreds of thousands, biology. Indeed, at that time, N.Y. Acad. Sci. hundred to verse engineering biological networks was be- most (if not all) of which usually go untested. ginning to experience considerable expansion, Even worse, and this would be a best case sce- Hugh Shanahan Associative methods in Systems Biology
    • Validation - DREAM Associative methods in While there are a colossal number of methods out there, the Systems Biology validation of them is very much in its infancy. Hugh DREAM (Dialogue for Reverse Engineering Assessments Shanahan and Methods) is an attempt to deal with this question. Outline Features: Unseen experimental data of (for example) Gene Ontologies Transcription Factor bindings sites, Over-representation Semantic similarity artificial data (in silico), Associative Measures genome-wide interactions, Hypotheses Linear Correlation Partial Correlation is gathered and groups are invited to reproduce the Non-linear measures interactions. Different groups results are then compared Validation DREAM against data to determine how well they did. New challenges are presented on an annual basis. Hugh Shanahan Associative methods in Systems Biology
    • Some Results from DREAM 2 (2007) Associative methods in Systems Biology Hugh Challenge 1: Shanahan Identify targets of transcription factor BCL6. Outline 53 genuine targets of BCL6 inferred from unpublished Gene Ontologies ChP-chip data. Over-representation Semantic similarity 147 decoys addes. Associative Measures Task : identify the genuine targets from decoys by Hypotheses Linear Correlation picking genes with similar expression patterns to BCL6. Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • Results Associative methods in Systems Biology Best approaches were selective in the data sets Hugh Shanahan employed. (Data sets where BCL6 was highly expressed or not expressed at all were used.) Outline Gene Semi-supervised learning was employed - using known Ontologies Over-representation targets of BCL6 to train best method. Semantic similarity Used correlations. Associative Measures Hypotheses 1st-order Partial correlations did badly. Linear Correlation Partial Correlation Basic correlations were approaching most Non-linear measures sophisticated approaches. Validation DREAM Hugh Shanahan Associative methods in Systems Biology
    • More results from DREAM Associative methods in Systems Challenge 2: E. Coli network. Biology Hugh From RegulonDB good evidence for targets to assorted Shanahan Transcription Factors. Outline Task : identify targets. Gene Results Ontologies Over-representation Semantic similarity Best method used Mutual Information and ideas behind Associative 1st order partial correlation. Measures Hypotheses Correlations and Partial Correlations were not too far Linear Correlation Partial Correlation behind. Non-linear measures Validation Level of identification of targets was low - perhaps 5%. DREAM Hugh Shanahan Associative methods in Systems Biology
    • Conclusions Associative GO terms allow us to handle large amounts of methods in Systems annotation in a structured fashion. Biology Hugh Associative measures are a first attempt at using the Shanahan huge amounts of expression data that is out there. Outline Very simple ideas such as correlation work surprisingly Gene Ontologies well (or rather more complicated methods of Over-representation association don’t give orders of magnitude better Semantic similarity Associative performance). Measures Hypotheses A long way to go nonetheless. Linear Correlation Partial Correlation The type of expression data we select; Non-linear measures Validation a clear understanding of what microarray/RNA-seq/... DREAM technology; may be even more important. Hugh Shanahan Associative methods in Systems Biology