Network Biology: from lists to underpinnings of molecular behaviour


Published on

BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

Published in: Health & Medicine, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • GeneMANIA uses query specific weights for multifaceted function queries.Let’s say you have a co-expression network that was generated from microarray data. You know there is a cluster of cell cycle genes, and a cluster of DNA repair genes, and a few unknown genes between or within those clusters.This tells you a little bit about your genes of interest.But you want to add in a genetic interaction network, which is considerably more complex.And a protein interaction network, which is even more complex.How do you know what network contains the most relevant information about your query genes?The GeneMANIA algorithm weights the networks based on how connected your query genes are. A network is weighted more heavily if your query genes are more connected within that network.GeneMANIA produces a composite network showing the weights of the genetic and protein interaction, and co-expression networks used to generate the composite network.
  • Network Biology: from lists to underpinnings of molecular behaviour

    1. 1. Network Biology:from lists to underpinnings of molecular behaviour<br />Michel Dumontier, Ph.D.<br />Associate Professor of Bioinformatics<br />Carleton University<br />1<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    2. 2. 2<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    3. 3. Provenance<br />This talk was prepared in part with input from the “Interpreting Gene Lists” workshop put forward by the Canadian Bioinformatics Workshops (<br /><br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />3<br />
    4. 4. So you did some mass spectrometry?<br />Protein Identification<br />4<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    5. 5. database search vs de novo<br />W<br />R<br />V<br />A<br />L<br />T<br />Database ofknown peptidesMDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN..<br />G<br />E<br />P<br />L<br />K<br />C<br />W<br />D<br />T<br />W<br />R<br />V<br />A<br />L<br />T<br />G<br />E<br />P<br />L<br />K<br />C<br />W<br />D<br />T<br />Database Search<br />de novo<br />AVGELTK<br />5<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    6. 6. 6<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    7. 7. My experiment worked and I have dozens, hundreds, or thousands of hits…. now what?<br />Protein <br />Identification<br />?<br />7<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    8. 8. Use the list to explore Biology<br />Determine significant shared attributes<br />Explore putative mechanisms of actions<br />Test hypotheses<br />Protein <br />Identification<br />Network <br />Biology<br />Eureka!<br />Hypothesis on the <br />molecular basis<br />of disease/process <br />8<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    9. 9. Detoxification<br />Oxidative Metabolism<br /># in list having attribute<br />Enriched in smokers =<br />UP-regulated in smokers<br /># in list sharing <br />these attributes<br />9<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    10. 10. Outline<br />Explore identified proteins<br />Attribute enrichment<br />Networks <br />Pathways<br />Lab<br />10<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    11. 11. A hypothesis underlies the list of identified proteins<br />An initial question was posed, an experiment performed and a list of candidates obtained.<br />The question is, what are the roles of these entities in the biological process being investigated. <br />Normal vs pathological<br />Response to stimulus<br />Interactions and complexes<br />11<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    12. 12. Biological Answers<br />Computational systems biology<br />Information retrieval and summary<br />Interaction network analysis<br />Pathway analysis<br />Function prediction<br />12<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    13. 13. Molecular Attributes<br />An attribute provides information about to the entity in question (e.g. shape, function, process)<br />Sequence and structure provides information about <br />Motifs, domains, interaction/binding sites, post-translational modifications, conformational changes, molecular complexes, mutations, conservation/evolution<br />Functions, localization, biological / pathological processes<br />13<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    14. 14. Gene Ontology<br />Captures terminology related to three aspects<br /> biological processes<br />molecular functions <br />cellular components<br />Relationships between terms are largely defined with “is a” and “part of” relations<br />Cell division<br />Isomerase activity<br />14<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    15. 15. cell<br />membrane chloroplast<br />mitochondrial chloroplast<br />membrane membrane<br />is-a<br />part-of<br />GO Structure<br />Species independent. Some lower-level terms are specific to a group, but higher level terms are not<br />15<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    16. 16. Gene Ontology<br />30,393 terms, 99.2% with definitions<br /><ul><li>18,939 biological processes
    17. 17. 2,735 cellular components
    18. 18. 8,719 molecular functions</li></ul>GO Slim is an official reduced set of GO terms<br /><ul><li>Generic, plant, yeast
    19. 19. Good for making pie charts</li></ul>16<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    20. 20. Annotation<br />Manual annotation<br />Created by scientific curators<br />High quality<br />Small number (time-consuming to create)<br />Electronic annotation<br />Annotation derived without human validation<br />Computational predictions (accuracy varies)<br />Lower ‘quality’ than manual codes<br />Key point: be aware of annotation origin <br />17<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    21. 21. Evidence Type(provenance of facts)<br /><ul><li>ISS: Inferred from Sequence/Structural Similarity
    22. 22. IDA: Inferred from Direct Assay
    23. 23. IPI: Inferred from Physical Interaction
    24. 24. IMP: Inferred from Mutant Phenotype
    25. 25. IGI: Inferred from Genetic Interaction
    26. 26. IEP: Inferred from Expression Pattern
    27. 27. TAS: Traceable Author Statement
    28. 28. NAS: Non-traceable Author Statement
    29. 29. IC: Inferred by Curator
    30. 30. ND: No Data available
    31. 31. IEA: Inferred from electronic annotation</li></ul>18<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    32. 32. Variable Coverage<br />Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304.<br />19<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    33. 33. GO Software Tools<br />GO resources are freely available to anyone without restriction<br />Includes the ontologies, gene associations and tools developed by GO<br />Other groups have used GO to create tools for many purposes<br /><br />20<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    34. 34. Accessing GO: QuickGO<br /><br />21<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    35. 35. Explore Ontologies<br /><br />22<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    36. 36. Databases of Molecular Annotation<br />NCBI <br />Genbank / RefSeq<br />Entrez Gene<br />EBI <br />UniProt<br />Ensembl BioMart (eukaryotes)<br />Model Organism Databases<br />Berkeley Drosophila Genome Project (BDGP)<br />dictyBase (Dictyostelium discoideum) <br />FlyBase (Drosophila melanogaster) <br />GeneDB (Schizosaccharomyces pombe, Plasmodium falciparum, Leishmania major and Trypanosoma brucei) <br />UniProt Knowledgebase (Swiss-Prot/TrEMBL/PIR-PSD) and InterPro databases <br />Gramene (grains, including rice, Oryza) <br />Mouse Genome Database (MGD) and Gene Expression Database (GXD) (Mus musculus) <br />Rat Genome Database (RGD) (Rattus norvegicus)<br />Reactome<br />Saccharomyces Genome Database (SGD) (Saccharomyces cerevisiae) <br />The Arabidopsis Information Resource (TAIR) (Arabidopsis thaliana) <br />The Institute for Genomic Research (TIGR): databases on several bacterial species <br />WormBase (Caenorhabditis elegans) <br />Zebrafish Information Network (ZFIN): (Danio rerio<br />23<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    37. 37. 24<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    38. 38. Identifiers<br />Identifiers (IDs) are ideally unique, stable names or numbers that help track database records<br />E.g. Social Insurance Number, Entrez Gene ID 41232<br />Gene and protein information stored in many databases<br /> Genes have many IDs<br />Records for: Gene, DNA, RNA, Protein<br />Important to recognize the correct record type<br />E.g. Entrez Gene records don’t store sequence. They link to DNA regions, RNA transcripts and proteins.<br />25<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    39. 39. NCBI Database Links<br />NCBI:<br />U.S. National Center for Biotechnology Information<br />Part of National Library of Medicine (NLM)<br /><br />26<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    40. 40. Common Identifiers<br />Species-specific<br />HUGO HGNC BRCA2<br />MGI MGI:109337<br />RGD 2219 <br />ZFIN ZDB-GENE-060510-3 <br />FlyBase CG9097 <br />WormBase WBGene00002299 or ZK1067.1<br />SGD S000002187 or YDL029W<br />Annotations<br />InterPro IPR015252<br />OMIM 600185<br />Pfam PF09104<br />Gene Ontology GO:0000724<br />SNPs rs28897757<br />Experimental Platform<br />Affymetrix 208368_3p_s_at<br />Agilent A_23_P99452<br />CodeLink GE60169<br />Illumina GI_4502450-S<br />Gene<br />Ensembl ENSG00000139618<br />Entrez Gene 675<br />Unigene Hs.34012<br />RNA transcript<br />GenBank BC026160.1<br />RefSeq NM_000059<br />Ensembl ENST00000380152<br />Protein<br />Ensembl ENSP00000369497<br />RefSeq NP_000050.2<br />UniProt BRCA2_HUMAN or A1YBP1_HUMAN<br />IPI IPI00412408.1<br />EMBL AF309413 <br />PDB 1MIU<br />Red = Recommended<br />27<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    41. 41. Identifier Mapping<br />So many IDs!<br />Mapping (conversion) is a headache<br />Four main uses<br />Disambiguate similarly named entities<br />Used to reference related information<br />Biological and informational provenance<br />E.g. Genes to proteins, Entrez Gene to Affy<br />Unification during dataset merging<br />Equivalent entities<br />28<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    42. 42. ID Mapping Services<br />Synergizer<br /><br />Ensembl BioMart<br /><br />UniProt<br /><br />29<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    43. 43. Outline<br />Explore identified proteins<br />Attribute enrichment<br />Networks <br />Pathways<br />30<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    44. 44. Attribute Enrichment (AE)<br />Given:<br />list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42<br />attributes: e.g. function, process, localization, interactions<br /> AE Question: Are any of the attributes surprisingly enriched in the list?<br />Details:<br />How to assess “surprisingly” (statistics)<br />How to correct for repeating the tests<br />31<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    45. 45. What is a P-value?<br />The P-value is (a bound) on the probability that the “null hypothesis” is true,<br />Calculated through statistics with the data and testing the probability of observing those statistics, or ones more extreme, given a sample of the same size distributed according to the null hypothesis,<br />Intuitively: P-value is the probability of a false positive result (aka “Type I error”)<br />32<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    46. 46. How likely are the observed differences between the two distributions due to chance?<br />0<br />1<br />7<br />1<br />5<br />6<br />6<br />0<br />1<br />1<br />0<br />7<br />2<br />0<br />1<br />2<br />1<br />0<br />value<br />value distribution<br />33<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    47. 47. AE using the T-test<br />Answer: Two-tailed T-test<br />Black: N1=500<br />Mean: m1 = 1.1 <br />Std: s1 = 0.9<br />Red: N2=4500<br />Mean: m1 = 4.9 <br />Std: s1 = 1.0<br />T-statistic =<br />Formal Question: What is the probability of observing the T-statistic or one more extreme if the means of the two distributions were the same?<br />= -88.5<br />34<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    48. 48. AE using the T-test<br />P-value = shaded area * 2<br />-88.5<br />T-distribution<br />Probability density<br />0<br />T-statistic<br />T-statistic =<br />Formal Question: What is the probability of observing the T-statistic or one more extreme if the means of the two distributions were the same?<br />= -88.5<br />35<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    49. 49. T-test limitations<br />Values are positive and have increasing density near zero, e.g. sequence counts<br />Bimodal “two-bumped” distributions.<br />Distributions with outliers, or “heavy-tailed” distributions<br />Probability density<br />0<br />score <br />Probability density<br />Probability density<br />score <br />score <br />Assumes distributions are both approximately Gaussian (i.e. normal) <br />Score distribution assumption is often true for:<br />Log ratios from microarrays<br />Score distribution assumption is rarely true for:<br />Peptide counts, sequence tags (SAGE or NextGen sequencing), transcription factor binding sites hits<br />Tests for significance of difference in means of two distribution but does not test for other differences between distributions.<br />36<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    50. 50. Kolmogorov-Smirnov (K-S) test<br />Probability density<br />0<br />score <br />Cumulative distribution<br />1.0<br />Cumulative probability<br />0.5<br />Length = 0.4<br />0<br />Question: Are the red and black distributions significantly different?<br />score <br />Formal question: Is the length of largest difference between the “empirical distribution functions” statistically significant?<br />Calculate cumulative distributions of red and black<br />37<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    51. 51. What is the probability of finding 4 or more proteins with feature X in a random sample of 5 proteins<br />list<br />RRP6<br />MRD1<br />RRP7<br />RRP43<br />RRP42<br />Background population:<br />500 X proteins,<br />5000 proteins<br />38<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    52. 52. Fisher’s exact test<br />Null distribution<br />P-value<br />Answer = 4.6 x 10-4<br />list<br />RRP6<br />MRD1<br />RRP7<br />RRP43<br />RRP42<br />P-value for Fisher’s exact test<br />is “the probability that a random draw of the same size as the list from the background population would produce the observed number (or more) of attributes in the list.”,<br />depends on size of the list, # with features (in list, background), and the background population.<br />Background population:<br />500 X proteins, <br />5000 proteins<br />39<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    53. 53. Important details<br />To test for under-enrichment of “black”, test for over-enrichment of “red”.<br />Need to choose “background population” appropriately, e.g., if only portion of the total complement is queried (or having annotation), only use that population as background.<br />To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type. <br />The hypergeometric test is equivalent to a one-tailed Fisher’s exact test.<br />40<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    54. 54. How to win the P-value lottery, part 1<br />Random draws<br />Expect a random draw with observed enrichment once every 1 / P-value draws<br />… 7,834 draws later …<br />Background population:<br />500 X<br />5000 Y<br />41<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    55. 55. How to win the P-value lottery, part 2Keep the list the same, evaluate different annotations<br />Different annotations<br />Observed draw<br />RRP6<br />MRD1<br />RRP7<br />RRP43<br />RRP42<br />RRP6<br />MRD1<br />RRP7<br />RRP43<br />RRP42<br />42<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    56. 56. Correcting for multiple tests<br />The Bonferroni correction controls the probability any one test is due to random chance akaFamily-Wise Error Rate (FWER)<br /> If M = # of annotations tested: Corrected P-value = M x original P-value<br />The Benjamini-Hochberg (B-H) controls the proportion of positive tests (i.e. rejections of the null hypothesis) that are false positives akaFalse Discovery Rate (FDR)<br />FDR is the expected proportion of the observed enrichments that are due to random chance.<br />Less stringent than the Bonferroni<br />43<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    57. 57. Reducing multiple test correction stringency<br />The correction to the P-value threshold a depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be<br />Can control the stringency by reducing the number of tests: <br /> e.g. use GO slim or restrict testing to the appropriate GO annotations.<br />44<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    58. 58. AE tools<br />Web-based tools <br />Funspec: <br />easy tool for yeast, not maintained, uses GO annotations and some annotations (e.g. protein complexes)<br />YeastFeatures<br />Similar to Funspec, different datasets and presentation<br />GoMiner: <br />Uses GO annotations, covers many organisms, needs a background set of genes<br />Cytoscape-based tools<br />BINGO:<br />Does GO annotations and displays enrichment results graphically and visually organizes related categories<br />45<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    59. 59. Funspec: Simple ORA for yeast<br />Choose sources of annotation<br />Bonferroni correct? YES!<br />Paste list here<br />Cavaets:<br /><ul><li> yeast only,
    60. 60. last updated 2002</li></ul>46<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    61. 61.<br />47<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    62. 62. 48<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    63. 63. GoMiner, part 1<br />1. Click “web interface”<br />2. Upload background<br />3. Upload list<br />4. Choose organism<br />5. Choose evidence code (All or Level 1)<br />49<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    64. 64. GoMiner, part 2<br />6. Restrict # of tests via category size<br />7. Restrict # of tests via GO hierarchy<br />8. Results emailed to this address, in a few minutes<br />50<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    65. 65. DAVID, part 1<br />Paste list here<br />DAVID automatically detects organism<br />Choose ID type<br />List type: list or background?<br />51<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    66. 66. DAVID, part 2<br />52<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    67. 67. BINGO, an ORA cytoscape plugin<br />Links represent parent-child relationships in GO ontology<br />Colours represent significance of enrichment<br />Nodes represent GO categories<br />53<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    68. 68. 54<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    69. 69. Outline<br />Explore identified proteins<br />Attribute enrichment<br />Networks <br /><ul><li>Physical networks
    70. 70. Genetic networks
    71. 71. Functional networks</li></ul>Pathways<br />55<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    72. 72. Why Network and Pathway Analysis?<br />Intuitive to Biologists<br /><ul><li>Provide a biological context for results
    73. 73. More efficient than searching databases gene-by-gene
    74. 74. Intuitive display for sharing data </li></ul>Computation on Pathway Content<br /><ul><li>Visualize multiple data types on a pathway or network
    75. 75. Find active pathways
    76. 76. Identify potential regulators</li></ul>56<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    77. 77. network<br /> In biology, a network is a graph comprised of nodes that correspond to entities (genes, proteins, small molecules) and edges that correspond to physical/agentive or associative relations between entities.<br />Vertex (node)<br />Cycle<br />Edge<br />-5<br />Directed Edge (Arc)<br />Weighted Edge<br />10<br />7<br />57<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    78. 78. Integration in a Network Context<br />58<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    79. 79. Integration in a Network Context<br />Expression data mapped<br />to node colours<br />59<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    80. 80. Mapping Biology to a Network<br />A simple mapping: Protein-protein interactions<br />one protein/node, one interaction/edge<br />Edges can represent other relationships<br />Physical e.g. protein-protein interaction<br />Regulatory e.g. kinase activates target<br />Genetic e.g. epistasis<br />Similarity e.g. protein sequence similarity<br />Critical: understand the mapping for network analysis<br />60<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    81. 81. Protein Sequence Similarity Network<br /><br />61<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    82. 82. Literature Network<br />Computationally extract gene relationships from text, usually PubMed abstracts<br />Useful if network is not in a database<br />Literature search tool<br />BUT not perfect<br />Problems recognizing gene names<br />Natural language processing is difficult<br />Agilent Literature Search Cytoscape plugin<br />iHOP (<br />62<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    83. 83. Agilent Literature Search<br />63<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    84. 84. Cytoscape Network produced by Literature Search.<br />Abstract from the scientific literature<br />Sentences for an edge<br />64<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    85. 85. Enrichment Map<br />Overlap<br />A<br />B<br />65<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    86. 86. Nodes represent <br />gene-sets<br />66<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    87. 87. Muscle Contraction<br />Olfactory Receptor<br />Ubiquitin Processes<br />Ubiquitin-dependent Proteolysis<br />Ectodermal Dev. &<br />Keratinocyte Diff.<br />DNA Repair<br />Mitotic Cell Cycle<br />Ubiquitin Ligase<br />DNA Processes<br />Cytoskeleton<br />DNA Replication<br />Intermediate Filament Cytoskeleton<br />Microtubule Cytoskeleton<br />Ras GTPase<br />mRNA Transport<br />Chromosome<br />RNA Processes<br />Serine Endopeptidase<br />Chromatin Remodeling<br />RNA Splicing<br />Fatty Acid Metabolism<br />Ion Channel<br />Transcription<br />Calcium<br />rRNA Processing<br />Mitochondrial Oxidative Metabolism<br />Ribonucleotide Metabolism<br />Potassium Sodium<br />Translation<br />67<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    88. 88. 68<br />Physical Networks<br />B<br />A<br />Between two molecular objects<br />DNA, RNA, gene, protein, complex, small molecule, photon<br />Requires a site of interaction / binding<br />Biologically relevant:<br />Present/expressed at the same time<br />Share a cellular location<br />Leads to some biologically relevant outcome<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    89. 89. Molecular Interactions<br />RAS interacting with RALGDS<br />(PDB: 1LFD)<br />Synthetic protein interacting with ATP and Zinc<br />(PDB: 2P0X)<br />69<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    90. 90. 70<br />Experimental Interaction Discovery<br />MassSpectrometry<br />Genetics<br />Two-Hybrid<br />Direct, Physical<br />Indirect, Physical<br />Indirect, Genetic<br />Microarray<br />X-Ray<br />NMR<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    91. 91. 71<br />Experimental Considerations<br />How do you know if the interaction really exists? <br />Each method has its advantages and disadvantages. <br />Be aware of systematic errors<br />Be aware of contaminants.<br />Each method observes interactions from a slightly different experimental condition.<br />Support from many different sources is certainly better (necessary) than just one.<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    92. 92. 72<br />Some affinity purification caveats<br />First and most importantly, this is only a representation of the observation.<br />You can only tell what proteins are in the eluate; <br />you can’t tell how they are connected to one another.<br />If there is only one other protein present (B), then its likely that<br />A and B are directly interacting.<br />But, what if I told you that two other proteins (B and C) were<br />present along with A…. <br />A<br />B<br />A<br />C<br />B<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    93. 93. 73<br />Complexes with unknown topology<br />A<br />A<br />A<br />B<br />C<br />B<br />C<br />B<br />C<br />Which of these models is correct?<br />The complex described by this experimental result is <br />said to have an Unknown Topology.<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    94. 94. 74<br />Complexes with unknown stoichiometry<br />A<br />A<br />B<br />B<br />B<br />Here’s another possibility?<br />The complex described by this experimental result is <br />also said to have Unknown Stoichiometry.<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    95. 95. 75<br />Interaction Models<br />Actual<br />Topology<br />Spoke<br />Matrix<br />Simple model, useful for data navigation<br />More accurate<br />Theoretical max. number of interactions<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    96. 96. 76<br />High-throughput Mass Spectrometric Protein Complex Identification (HMS-PCI)<br />Mike Tyers, SLRI<br />Ste12<br />Ho et al. Nature. 2002 Jan 10;415(6868):180-3<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    97. 97. 77<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    98. 98. 78<br />k-core analysis<br />A part of a graph where every node is connected to other nodes with at least k edges (k=0,1,2,3...)<br />Highest k-core is a central most densely connected region of a graph<br />Regions of dense connectivity may represent molecular complexes<br />Therefore, high k-cores may be molecular complexes<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    99. 99. 79<br />Pre MS<br />Ho<br />6-core<br />6-core<br />Interaction can define function <br />Gavin<br />Union<br />6-core<br />9-core<br /> MCODE plugin for Cytoscape<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    100. 100. 80<br /><br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    101. 101. Interaction Databases<br />Experiment (E)<br />Structure detail (S)<br />Predicted<br />Physical (P)<br />Functional (F)<br />Curated (C)<br />Homology modeling (H)<br />*IMEx consortium<br />81<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    102. 102. Network Classification of Disease<br />Traditional: Gene association<br />Limitations: Too many genes reduces statistical power<br />New: Active cell map based approaches combining network and molecular profiles<br />Chuang HY, Lee E, Liu YT, Lee D, Ideker T<br />Network-based classification of breast cancer metastasis<br />Mol Syst Biol. 2007;3:140. Epub 2007 Oct 16<br />Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif S<br />Network-based analysis of affected biological processes in type 2 diabetes models<br />PLoS Genet. 2007 Jun;3(6):e96<br />Efroni S, Schaefer CF, Buetow KH<br />Identification of key processes underlying cancer phenotypes using biologic pathway analysis<br />PLoS ONE. 2007 May 9;2(5):e425<br />82<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    103. 103. Network-Based Breast Cancer Classification<br />57k intx from Y2H, orthology, co-citation, HPRD, BIND, Reactome<br />2 breast cancer cohorts, different expression platforms<br />Chuang HY, Lee E, Liu YT, Lee D, Ideker T<br />Network-based classification of breast cancer metastasis<br />Mol Syst Biol. 2007;3:140. Epub 2007 Oct 16<br />83<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    104. 104. Similar network markers across 2 data sets (better than original overlap)<br />Increased classification accuracy<br />Better coverage of known cancer risk genes (*)<br />84<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    105. 105. PIPE<br />Predicts yeast PPI from sequence<br />Uses interaction databases to find similar interacting proteins<br />Estimates the site of interaction<br />75% accuracy (61% sensitivity, 89% specificity)<br />Finds new interactions among complexes<br />85<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    106. 106. 86<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    107. 107. 87<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    108. 108. PIPE2<br />First all-to-all sequence-based computational screen of PPIs in yeast <br />29,589 high confidence interactions of ~ 2 x 107 possible pairs <br />16,000x faster than PIPE<br />99.95% specificity<br />88<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    109. 109. 89<br />Synthetic Genetic Interactions<br />Synthetic genetic interactions (lethal, slow growth)<br />Mate two mutants without phenotypes to get a daughter cell with a phenotype<br />Synthetic lethal (SL), slow growth<br />robotic mating using the yeast deletion library<br />Genetic interactions provide functional data on protein interactions or redundant genes<br />About 23% of known SLs (1295 - YPD+MIPS) were known protein interactions in yeast<br />Tong et al. Science. 2001 Dec 14;294(5550):2364-8<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    110. 110. 90<br />Cell Polarity<br />Cell Wall Maintenance <br />Cell Structure<br />Mitosis<br />Chromosome Structure<br />DNA Synthesis <br />DNA Repair<br />Unknown<br />Others<br />Synthetic Genetic Interactions in Yeast<br />Tong, Boone<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    111. 111. Validation: Protein Localization<br />A – A3: Y2H<br />B: physical methods<br />C: genetic<br />E: immunological<br />True positives:<br /><ul><li>Localized in the same cellular compartment
    112. 112. Have common cellular role</li></ul>Sprinzak, Sattath, Margalit, J Mol Biol, 2003<br />91<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    113. 113. Comparisons<br />All methods except for Y2H and synthetic lethality technique are biased toward abundant proteins. <br />PPI bias toward certain cellular localizations. <br />Evolutionarily conserved proteins have much better coverage in Y2H than the proteins restricted to a certain organism. <br />C. Von Mering et al, Nature, 2002:<br />92<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    114. 114. Functional Associations<br />Molecular Interactions<br />Regulatory Interactions<br />Genetic Interactions<br />Similarity relationships<br />Co-expression<br />Protein sequence<br />Domain architecture<br />Phylogenetic profiles<br />Gene neighborhood<br />Gene fusion<br />…<br />93<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    115. 115.<br />von Mering et al., Nucleic Acids Res., 2005<br />94<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    116. 116. 95<br />95<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    117. 117. 96<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    118. 118. Gene Function Prediction using a <br />Multiple Association Network Integration Algorithm<br />Query-specific weights for multifaceted function queries<br />w1x<br />w2x<br />w3x<br />weights<br />CDC27<br />Cell <br />cycle<br />CDC23<br />+<br />+<br />APC11<br />UNK1<br />Co-complexed<br />Durrett 2006<br />Genetic<br />Tong et al. 2001<br />RAD54<br />XRS2<br />DNA <br />repair<br />=<br />MRE11<br />UNK2<br />Co-expression<br />Pavlidis et al, 2002, Lanckriet et al, 2004<br />Mostafavi et al, 2008<br />97<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    119. 119. GeneMANIA Cytoscape Plugin<br />98<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    120. 120. Outline<br />Explore identified proteins<br />Attribute enrichment<br />Networks <br />Pathways<br />Lab<br />99<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    121. 121. pathway<br /> In biology, a pathway is a network which consists of inputs (physical entities), outputs (physical entities, biological outcomes), and the molecular machinery and chemical transformations required/expected to realize the end-directed activity.<br />100<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    122. 122. Using Pathway Information<br />Expert knowledge<br />Experimental Data<br />Find active processes<br />underlying a phenotype<br />Databases<br />Literature<br />Pathway<br />Information<br />Pathway<br />Analysis<br />101<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    123. 123. >290 Pathway<br />Databases!<br /><br /><ul><li>Varied formats, representation, coverage
    124. 124. Pathway data extremely difficult to combine and use</li></ul>Vuk Pavlovic<br />Sylva Donaldson<br />102<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    125. 125. Aim: Convenient Access to Pathway Information<br /><br />Facilitate creation and communication of pathway data<br />Aggregate pathway data in the public domain<br />Provide easy access for pathway analysis<br />103<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    126. 126. Access From Cytoscape<br />104<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    127. 127. cardiomyopathy: downregulated genes<br />Fatty Acid Degradation?<br />Other pathways / processes?<br /><br />105<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    128. 128. Fatty Acid Degradation Pathway<br />106<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    129. 129. Cardiomyopathy Data on Fatty Acid Degradation Pathway <br />107<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    130. 130. Visualizing Time Course Data on Pathways: Multiple Comparison View<br />108<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    131. 131. Outline<br />Explore identified proteins<br />Attribute enrichment<br />Networks <br />Pathways<br />Lab<br />109<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    132. 132. 110<br />Network Analysis<br />Cytoscape<br />Visualize molecular interaction networks and integrate interactions with gene expression profiles and other state data. Data filters & custom plug-in architecture.<br /><br />Biolayout Express 3D<br />Large networks<br />Gene expression<br /><br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    133. 133. Expert knowledge<br />Experimental Data<br />Network Analysis using Cytoscape<br />Find biological processes<br />underlying a phenotype<br />Databases<br />Literature<br />Network<br />Information<br />Network<br />Analysis<br />111<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    134. 134.<br />Network visualization and analysis<br />Pathway comparison<br />Literature mining<br />Gene Ontology analysis<br />Active modules<br />Complex detection<br />Network motif search<br />UCSD, ISB, Agilent, MSKCC, Pasteur, UCSF, Unilever, UToronto, U Texas<br />112<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    135. 135. Manipulate Networks<br />Filter/Query<br />Interaction Database Search<br />Automatic Layout<br />113<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    136. 136. Overview<br />Zoom<br />Focus<br />PKC Cell Wall Integrity<br />114<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    137. 137. Active Community<br /><br />Help<br />8 tutorials, >10 case studies<br />Mailing lists for discussion<br />Documentation, data sets<br />Annual Conference: Houston Nov 6-9, 2009<br />10,000s users, 2500 downloads/month<br />>40 Plugins Extend Functionality<br />Build your own, requires programming<br />Cline MS et al. Integration of biological networks and gene expression data using Cytoscape Nat Protoc. 2007;2(10):2366-82<br />115<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    138. 138. LAB<br />Objective<br />Create a map of the functional enrichments from the 14 input proteins<br />Methods<br />Use HGNC to obtain the gene symbols from the names<br />Submit the gene symbols to a tool that already has datasets loaded.<br />Get Attributes and do analysis on network<br />116<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    140. 140. Get their gene symbol/identifiersHGNC -<br />Provide a table of mappings<br />What challenges did you face when trying to identify the symbols from textual descriptions?<br />118<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    141. 141. Identify functional enrichments <br />Discuss and provide a plot for the enrichment of Gene Ontology categories<br />119<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    142. 142. Build an attribute enrichment network<br />Which new proteins are functionally linked?<br />What datasets were used in the network construction?<br />120<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    143. 143. Attribute Enrichment with a custom data set<br />Use BioMart to<br />convert HGNC identifiers to Ensembl Identifiers<br />Obtain the Gene Ontology categories for the target proteins and the background proteins.<br />Use FUNC to do the enrichment analysis<br />121<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    144. 144. 122<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    145. 145. 123<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    146. 146. 124<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    147. 147. 125<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    148. 148. Collect the Gene Ontology attributes for the list, then for all the human genes<br />126<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />
    149. 149. Next steps are harder…<br /><br /> To use FUNC, you need to convert the BioMART output to the file format above. This is pretty easy to do in excel for the protein list, but excel can’t handle the results for all the human proteins. Need to write a small script… take BIOC3008 and become a competent in simple data manipulation <br />127<br />BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]<br />