Hugh Shanahan Association Talk

603 views

Published on

Talk I gave at this year\'s CCC Summer School at Zhejiang University, Hangzhou

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
603
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hugh Shanahan Association Talk

  1. 1. Associative methods in Systems Biology Hugh Shanahan Associative methods in Systems Biology Outline Gene Ontologies Hugh Shanahan Over-representation Semantic similarity Associative Measures Department of Computer Science Hypotheses Royal Holloway, University of London Linear Correlation Partial Correlation Non-linear measures September 22, 2009 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  2. 2. Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  3. 3. Gene Ontologies Associative methods in Systems Biology Hugh Before finding interactions, need to be able Shanahan to systematically annotate all genes Outline Gene to determine which functional groupings are Ontologies Over-representation over-represented Semantic similarity measure objectively the “functional similarity” of two Associative Measures genes. Hypotheses Linear Correlation Partial Correlation Gene Ontology (GO) is a means to do this. Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  4. 4. Ontologies Associative methods in Systems Biology Abstract method for expressing structured data. Hugh Shanahan Annotation of any gene can be expressed in terms of Outline incresingly accurate description, e.g. Gene This gene is involved in transport. Ontologies Over-representation This gene is involved in vesicle mediated Semantic similarity Associative transport. Measures Hypotheses This gene is involved in vesicle fusion. Linear Correlation Partial Correlation Genes may not have an accurate annotation, so Non-linear measures Validation definition can stop higher up in this hierarchy. DREAM Hugh Shanahan Associative methods in Systems Biology
  5. 5. More complexity required Associative methods in Systems Biology Annotation is not a simple chain. Hugh A single gene can have have a very specific annotation, Shanahan which comes from two (or more) more general Outline descriptions. Gene Ontologies Different types of annotation as well: location, Over-representation Semantic similarity biochemistry, part of organism expressed in, and so on. Associative Measures An Ontology is a Directed Acyclic Graph (DAG), not a Hypotheses Linear Correlation Tree (this means a lot to Graph Theorists). Partial Correlation Non-linear measures Each node in the DAG is an annotation term. Validation DREAM Each “child” node can more than one “parent” nodes. Hugh Shanahan Associative methods in Systems Biology
  6. 6. GO’s visualised Associative IEWS methods in Systems Biology a b c Hugh Biological process (root) Shanahan Outline Transport Membrane organization and biogenesis Gene asing ficity Ontologies is_a is_a or Over-representation larity Vesicle-mediated Semantic similarity Membrane fusion transport Associative part_of is_a Measures Hypotheses Vesicle fusion Linear Correlation Partial Correlation Figure 1 | Simple trees versus directed acyclic graphs. Boxes represent nodes and arrows represent edges. a | An Nature Reviews | Genetics example of a simple tree, in which each child has only one parent and the edges are directed, that is, there is a source Non-linear measures (parent) and a destination (child) for each edge. b | A directed acyclic graph (DAG), in which each child can have one or Validation Rhee et al., Nature Reviews Genetics, (2008) more parents. The node with multiple parents is coloured red and the additional edge is coloured grey.c | An example of a node, vesicle fusion, in the biological process ontology with multiple parentage. The dashed edges indicate that there DREAM are other nodes not shown between the nodes and the root node (biological process). A root is a node with no incoming edges, and at least one leaf (also called a sink). A leaf node is a node with no outgoing edges, that is, a terminal node with no children (vesicle fusion). Similar to a simple tree, A DAG has directed edges and does not have cycles, that is, no path starts and ends at the same node, and will always have at least one root node. The depth of a node is the length of the longest path from the root to that node, whereas the height is the length of the longest path from that node to a leaf41. is_a and part_of are types of relationships that link the terms in the GO ontology. More information about the relationships between GO terms are found online (An Introduction to the Gene Ontology). Hugh Shanahan Associative methods in Systems Biology
  7. 7. GO’s visualised Associative methods in Systems Biology Hugh Shanahan Outline Gene Ontologies Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation Partial Correlation http://amigo.geneontology.org/ Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  8. 8. Different types of Annotation Associative methods in Systems Biology Hugh Typically, there are three distinct ontologies Shanahan (overwhelmingly used). Outline Cellular Compartment Gene Ontologies Over-representation Biological Process Semantic similarity Molecular Function Associative Measures Hypotheses Many other ontologies have been constructed, e.g. Linear Correlation Partial Correlation Plant Organ for Arabidopsis. Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  9. 9. Caveat Associative methods in Systems Biology The annotation of most genes (90%) have been carried out Hugh Shanahan computationally. The annotations usually work pretty well, though they have a tendency not to be as accurate as those Outline obtained by direct assay. Gene Ontologies All annotated genes have an evidence code (IED) Over-representation Semantic similarity associated with them in order to demonstrate how much we Associative can rely on it. Measures Hypotheses Increasingly, co-expression is being used as a means to Linear Correlation Partial Correlation annotate genes, so one should be careful in not using this Non-linear measures information in constructing annotations ! Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  10. 10. Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  11. 11. Over-representation Associative methods in Systems One of the most useful tools to hand when one analyses Biology micro-array data is to ask what functional groupings occur Hugh Shanahan more often than one expects. Outline Notation Gene N number of genes in the genome. Ontologies Over-representation Semantic similarity n number of genes which have been found to be Associative differentially expressed. Measures Hypotheses m number of genes in the genome with a specific Linear Correlation Partial Correlation annotation. Non-linear measures Validation k number of genes which are differentially expressed DREAM with the same annotation. Hugh Shanahan Associative methods in Systems Biology
  12. 12. Probabilities Associative methods in Systems One can derive the probability Pk that k genes would be Biology found by chance amongst n genes using the Hugh Shanahan hypergeometric probability distribution and the above Outline parameters. Gene For the record Ontologies Over-representation Semantic similarity Associative m C N−m C Measures k n−k Pk = NC , (1) Hypotheses Linear Correlation n Partial Correlation N N! Non-linear measures Cm = . (2) Validation (N − n)!n! DREAM Hugh Shanahan Associative methods in Systems Biology
  13. 13. Difficulties Associative methods in Systems Biology There are thousand’s of possible GO terms and one Hugh should adjust the probabilities to deal with multiple Shanahan hypotheses. Outline Applying Bonferroni correction using all GO terms gives Gene Ontologies a p-value of 10−7 equivalent to 1% significence. Over-representation Semantic similarity Because of the structure of the GO terms these Associative Measures probabilities are highly correlated with each other. Hypotheses Linear Correlation In all these cases focussing on as short a list of Partial Correlation Non-linear measures possible biological processes as possible will minimise Validation the above difficulties. DREAM Hugh Shanahan Associative methods in Systems Biology
  14. 14. Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  15. 15. What genes match In benchmarking methods to infer interactions between Associative methods in gene products, we expect genes that interact to have similar Systems Biology GO terms, though perhaps not entirely the same. Hugh Semantic Similarity is a means to measure how similar the Shanahan annotations of two genes are (0 being no similarity, 1 Outline meaning total similarity). Gene Ontologies GO provides us with a means to do this in a quantitative Over-representation Semantic similarity fashion. Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  16. 16. Simple implementation Determine the ratio of the number of nodes two genes share Associative methods in with the total number of nodes they have between them. Systems Biology Hugh #{N(G1 ) ∩ N(G2 )} Shanahan GOsimUI = (3) #{N(G1 ) ∪ N(G2 )} Outline N(G1 ) being the set of nodes associated with G1 ’s Gene Ontologies annotation. Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM More elaborate methods are available. Hugh Shanahan Associative methods in Systems Biology
  17. 17. Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  18. 18. Motivation Associative methods in Systems Yesterday, encountered clustering. Biology Hugh Hypothesis 1 (weak) :- coexpression implies involvment Shanahan in the same process. Outline Expand to many different experiments. Gene Ontologies Hypothesis 2 (strong) :- greater a level of association, Over-representation Semantic similarity greater the chance of interaction. Associative Measures Hypothesis 2 is often referred to as “guilt by Hypotheses association”. Linear Correlation Partial Correlation Non-linear measures Association may tell us about interactions between Validation gene products. It does not tell us anything about the DREAM regulation mechanism. Hugh Shanahan Associative methods in Systems Biology
  19. 19. Associative methods in Systems Biology Hugh Shanahan Outline Gene Ontologies Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation http://www.arabidopsis.leeds.ac.uk/act/index.php Partial Correlation Non-linear measures 266841_at AT2G26150 Validation heat shock transcription factor family protein contains Pfam profile: DREAM PF00447 HSF-type DNA-binding domain 260978_at AT1G53540 17.6 kDa class I small heat shock protein Hugh Shanahan Associative methods in Systems Biology
  20. 20. What do we mean by association ? Associative methods in Systems Knowing something about the expression level of one gene Biology (over many different experiments) means we know Hugh Shanahan something about the expression level of the other. Replotting the above Outline Gene Ontologies Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  21. 21. Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  22. 22. Linear Correlation coexpression Associative methods in Simplest form of association. Systems Biology Assume that there is a linear relationship between Hugh Shanahan genes. Outline Formally :- Gene y1 = a12 + c12 y2 + η12 , (4) Ontologies Over-representation Semantic similarity Associative y1 , y2 are (log) expression levels Measures η12 noise term. Hypotheses Linear Correlation a12 , c12 parameters to be determined. Partial Correlation Non-linear measures But we’re not interested in that ! Validation DREAM We are interested in asking how good a model is this for this pair of genes ? Hugh Shanahan Associative methods in Systems Biology
  23. 23. Covariance Associative methods in Can estimate how good the linear model is by computing Systems Biology E((y1 − y 1 )(y2 − y 2 )) , Hugh Shanahan where y 1 , y 2 = E(y1 ), E(y2 ) are the means of y1 and y2 . Outline Gene E means the expectation value of the above (think of it Ontologies Over-representation for now as taking the average over all the points in the Semantic similarity previous figure). Associative Measures Can prove to oneself (exercise) that the magnitude of Hypotheses Linear Correlation the covariance is largest when y1 can be perfectly Partial Correlation Non-linear measures expressed as a linear function of y2 . Validation DREAM The covariance is zero when there is no relationship at all between y1 and y2 . Hugh Shanahan Associative methods in Systems Biology
  24. 24. Associative methods in Systems Biology Hugh q q q q Shanahan q q qq q q q q 2 2 q q q q q q Outline qq q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q Gene 1 1 qq qq q q q q q q qq q q q q q Ontologies y2 y2 qq q q q q q q q q qq q q q q qq q q q q q q q qq q q q qq q qqq q q q q q qq q q Over-representation qq q q q q q qq q q q q q q q q q q q Semantic similarity 0 0 q q q q q q q qq qq q q qq q q qq q q qq q q qq q q q q q q q q q q q q q q q Associative q q Measures −1 −1 q q q q Hypotheses 0.6 0.8 1.0 1.2 1.4 1.6 1.8 −1 0 1 2 3 Linear Correlation y1 y1 Partial Correlation Non-linear measures Maximum covariance Zero covariance Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  25. 25. Correlation Associative methods in Systems Biology Hugh We want to compare every possible pair of genes, so using Shanahan the covariance is not very practical since the maximum Outline covariance will vary from pair of gene to pair of gene. Gene However, Ontologies Over-representation Semantic similarity E((y1 − y 1 )(y2 − y 2 )) Associative ρ12 = , (5) Measures E((y1 − y 1 )2 )E((y2 − y 2 )2 ) Hypotheses Linear Correlation Partial Correlation is bounded: −1 ≤ ρ12 ≤ 1. Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  26. 26. How well does it work ? Associative methods in Systems Number of examples of improved functional annotation. Biology Unannotated gene which is highly correlated with gene Hugh Shanahan in a known response implies it is likely to be in the same response. Outline Gene Ontologies Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  27. 27. Outline Associative methods in 1 Outline Systems Biology 2 Gene Ontologies Hugh Shanahan Over-representation Outline Semantic similarity Gene Ontologies 3 Associative Measures Over-representation Semantic similarity Hypotheses Associative Linear Correlation Measures Hypotheses Partial Correlation Linear Correlation Partial Correlation Non-linear measures Non-linear measures Validation DREAM 4 Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  28. 28. Associative methods in Systems Biology Hugh Shanahan Difficulty : genes correlate with many other genes, not Outline just a few. Gene Why ? Ontologies Over-representation Suggestion : correlations do not distinguish between Semantic similarity Associative potential direct interactions and indirect interactions Measures Hypotheses between gene products. Linear Correlation Partial Correlation Non-linear measures Validation DREAM Hugh Shanahan Associative methods in Systems Biology
  29. 29. Example Associative methods in Other interactions Systems A Biology Hugh Shanahan B F Outline Gene Ontologies D Over-representation C Semantic similarity Associative Measures E Hypotheses Linear Correlation Partial Correlation Non-linear measures B directly interacts with three other genes, but could be Validation highly correlated with others. DREAM C and D would be highly correlated with each other even though they are not directly interacting. Hugh Shanahan Associative methods in Systems Biology

×