Network Biology: from lists to underpinnings of molecular behaviour
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Network Biology: from lists to underpinnings of molecular behaviour

on

  • 2,311 views

BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]

Statistics

Views

Total Views
2,311
Views on SlideShare
2,282
Embed Views
29

Actions

Likes
0
Downloads
84
Comments
0

1 Embed 29

http://www.slideshare.net 29

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • GeneMANIA uses query specific weights for multifaceted function queries.Let’s say you have a co-expression network that was generated from microarray data. You know there is a cluster of cell cycle genes, and a cluster of DNA repair genes, and a few unknown genes between or within those clusters.This tells you a little bit about your genes of interest.But you want to add in a genetic interaction network, which is considerably more complex.And a protein interaction network, which is even more complex.How do you know what network contains the most relevant information about your query genes?The GeneMANIA algorithm weights the networks based on how connected your query genes are. A network is weighted more heavily if your query genes are more connected within that network.GeneMANIA produces a composite network showing the weights of the genetic and protein interaction, and co-expression networks used to generate the composite network.

Network Biology: from lists to underpinnings of molecular behaviour Presentation Transcript

  • 1. Network Biology:from lists to underpinnings of molecular behaviour
    Michel Dumontier, Ph.D.
    Associate Professor of Bioinformatics
    Carleton University
    1
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 2. 2
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 3. Provenance
    This talk was prepared in part with input from the “Interpreting Gene Lists” workshop put forward by the Canadian Bioinformatics Workshops (bioinformatics.ca)
    http://bioinformatics.ca/workshops/2009/course-content
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
    3
  • 4. So you did some mass spectrometry?
    Protein Identification
    4
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 5. database search vs de novo
    W
    R
    V
    A
    L
    T
    Database ofknown peptidesMDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN..
    G
    E
    P
    L
    K
    C
    W
    D
    T
    W
    R
    V
    A
    L
    T
    G
    E
    P
    L
    K
    C
    W
    D
    T
    Database Search
    de novo
    AVGELTK
    5
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 6. 6
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 7. My experiment worked and I have dozens, hundreds, or thousands of hits…. now what?
    Protein
    Identification
    ?
    7
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 8. Use the list to explore Biology
    Determine significant shared attributes
    Explore putative mechanisms of actions
    Test hypotheses
    Protein
    Identification
    Network
    Biology
    Eureka!
    Hypothesis on the
    molecular basis
    of disease/process
    8
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 9. Detoxification
    Oxidative Metabolism
    # in list having attribute
    Enriched in smokers =
    UP-regulated in smokers
    # in list sharing
    these attributes
    9
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 10. Outline
    Explore identified proteins
    Attribute enrichment
    Networks
    Pathways
    Lab
    10
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 11. A hypothesis underlies the list of identified proteins
    An initial question was posed, an experiment performed and a list of candidates obtained.
    The question is, what are the roles of these entities in the biological process being investigated.
    Normal vs pathological
    Response to stimulus
    Interactions and complexes
    11
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 12. Biological Answers
    Computational systems biology
    Information retrieval and summary
    Interaction network analysis
    Pathway analysis
    Function prediction
    12
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 13. Molecular Attributes
    An attribute provides information about to the entity in question (e.g. shape, function, process)
    Sequence and structure provides information about
    Motifs, domains, interaction/binding sites, post-translational modifications, conformational changes, molecular complexes, mutations, conservation/evolution
    Functions, localization, biological / pathological processes
    13
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 14. Gene Ontology
    Captures terminology related to three aspects
    biological processes
    molecular functions
    cellular components
    Relationships between terms are largely defined with “is a” and “part of” relations
    Cell division
    Isomerase activity
    14
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 15. cell
    membrane chloroplast
    mitochondrial chloroplast
    membrane membrane
    is-a
    part-of
    GO Structure
    Species independent. Some lower-level terms are specific to a group, but higher level terms are not
    15
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 16. Gene Ontology
    30,393 terms, 99.2% with definitions
    • 18,939 biological processes
    • 17. 2,735 cellular components
    • 18. 8,719 molecular functions
    GO Slim is an official reduced set of GO terms
    • Generic, plant, yeast
    • 19. Good for making pie charts
    16
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 20. Annotation
    Manual annotation
    Created by scientific curators
    High quality
    Small number (time-consuming to create)
    Electronic annotation
    Annotation derived without human validation
    Computational predictions (accuracy varies)
    Lower ‘quality’ than manual codes
    Key point: be aware of annotation origin
    17
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 21. Evidence Type(provenance of facts)
    • ISS: Inferred from Sequence/Structural Similarity
    • 22. IDA: Inferred from Direct Assay
    • 23. IPI: Inferred from Physical Interaction
    • 24. IMP: Inferred from Mutant Phenotype
    • 25. IGI: Inferred from Genetic Interaction
    • 26. IEP: Inferred from Expression Pattern
    • 27. TAS: Traceable Author Statement
    • 28. NAS: Non-traceable Author Statement
    • 29. IC: Inferred by Curator
    • 30. ND: No Data available
    • 31. IEA: Inferred from electronic annotation
    18
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 32. Variable Coverage
    Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304.
    19
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 33. GO Software Tools
    GO resources are freely available to anyone without restriction
    Includes the ontologies, gene associations and tools developed by GO
    Other groups have used GO to create tools for many purposes
    http://www.geneontology.org/GO.tools
    20
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 34. Accessing GO: QuickGO
    http://www.ebi.ac.uk/ego/
    21
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 35. Explore Ontologies
    http://www.ebi.ac.uk/ontology-lookup
    22
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 36. Databases of Molecular Annotation
    NCBI
    Genbank / RefSeq
    Entrez Gene
    EBI
    UniProt
    Ensembl BioMart (eukaryotes)
    Model Organism Databases
    Berkeley Drosophila Genome Project (BDGP)
    dictyBase (Dictyostelium discoideum)
    FlyBase (Drosophila melanogaster)
    GeneDB (Schizosaccharomyces pombe, Plasmodium falciparum, Leishmania major and Trypanosoma brucei)
    UniProt Knowledgebase (Swiss-Prot/TrEMBL/PIR-PSD) and InterPro databases
    Gramene (grains, including rice, Oryza)
    Mouse Genome Database (MGD) and Gene Expression Database (GXD) (Mus musculus)
    Rat Genome Database (RGD) (Rattus norvegicus)
    Reactome
    Saccharomyces Genome Database (SGD) (Saccharomyces cerevisiae)
    The Arabidopsis Information Resource (TAIR) (Arabidopsis thaliana)
    The Institute for Genomic Research (TIGR): databases on several bacterial species
    WormBase (Caenorhabditis elegans)
    Zebrafish Information Network (ZFIN): (Danio rerio
    23
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 37. 24
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 38. Identifiers
    Identifiers (IDs) are ideally unique, stable names or numbers that help track database records
    E.g. Social Insurance Number, Entrez Gene ID 41232
    Gene and protein information stored in many databases
     Genes have many IDs
    Records for: Gene, DNA, RNA, Protein
    Important to recognize the correct record type
    E.g. Entrez Gene records don’t store sequence. They link to DNA regions, RNA transcripts and proteins.
    25
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 39. NCBI Database Links
    NCBI:
    U.S. National Center for Biotechnology Information
    Part of National Library of Medicine (NLM)
    http://www.ncbi.nlm.nih.gov/Database/datamodel/data_nodes.swf
    26
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 40. Common Identifiers
    Species-specific
    HUGO HGNC BRCA2
    MGI MGI:109337
    RGD 2219
    ZFIN ZDB-GENE-060510-3
    FlyBase CG9097
    WormBase WBGene00002299 or ZK1067.1
    SGD S000002187 or YDL029W
    Annotations
    InterPro IPR015252
    OMIM 600185
    Pfam PF09104
    Gene Ontology GO:0000724
    SNPs rs28897757
    Experimental Platform
    Affymetrix 208368_3p_s_at
    Agilent A_23_P99452
    CodeLink GE60169
    Illumina GI_4502450-S
    Gene
    Ensembl ENSG00000139618
    Entrez Gene 675
    Unigene Hs.34012
    RNA transcript
    GenBank BC026160.1
    RefSeq NM_000059
    Ensembl ENST00000380152
    Protein
    Ensembl ENSP00000369497
    RefSeq NP_000050.2
    UniProt BRCA2_HUMAN or A1YBP1_HUMAN
    IPI IPI00412408.1
    EMBL AF309413
    PDB 1MIU
    Red = Recommended
    27
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 41. Identifier Mapping
    So many IDs!
    Mapping (conversion) is a headache
    Four main uses
    Disambiguate similarly named entities
    Used to reference related information
    Biological and informational provenance
    E.g. Genes to proteins, Entrez Gene to Affy
    Unification during dataset merging
    Equivalent entities
    28
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 42. ID Mapping Services
    Synergizer
    http://llama.med.harvard.edu/synergizer/translate/
    Ensembl BioMart
    http://www.ensembl.org
    UniProt
    http://www.uniprot.org/
    29
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 43. Outline
    Explore identified proteins
    Attribute enrichment
    Networks
    Pathways
    30
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 44. Attribute Enrichment (AE)
    Given:
    list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42
    attributes: e.g. function, process, localization, interactions
    AE Question: Are any of the attributes surprisingly enriched in the list?
    Details:
    How to assess “surprisingly” (statistics)
    How to correct for repeating the tests
    31
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 45. What is a P-value?
    The P-value is (a bound) on the probability that the “null hypothesis” is true,
    Calculated through statistics with the data and testing the probability of observing those statistics, or ones more extreme, given a sample of the same size distributed according to the null hypothesis,
    Intuitively: P-value is the probability of a false positive result (aka “Type I error”)
    32
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 46. How likely are the observed differences between the two distributions due to chance?
    0
    1
    7
    1
    5
    6
    6
    0
    1
    1
    0
    7
    2
    0
    1
    2
    1
    0
    value
    value distribution
    33
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 47. AE using the T-test
    Answer: Two-tailed T-test
    Black: N1=500
    Mean: m1 = 1.1
    Std: s1 = 0.9
    Red: N2=4500
    Mean: m1 = 4.9
    Std: s1 = 1.0
    T-statistic =
    Formal Question: What is the probability of observing the T-statistic or one more extreme if the means of the two distributions were the same?
    = -88.5
    34
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 48. AE using the T-test
    P-value = shaded area * 2
    -88.5
    T-distribution
    Probability density
    0
    T-statistic
    T-statistic =
    Formal Question: What is the probability of observing the T-statistic or one more extreme if the means of the two distributions were the same?
    = -88.5
    35
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 49. T-test limitations
    Values are positive and have increasing density near zero, e.g. sequence counts
    Bimodal “two-bumped” distributions.
    Distributions with outliers, or “heavy-tailed” distributions
    Probability density
    0
    score 
    Probability density
    Probability density
    score 
    score 
    Assumes distributions are both approximately Gaussian (i.e. normal)
    Score distribution assumption is often true for:
    Log ratios from microarrays
    Score distribution assumption is rarely true for:
    Peptide counts, sequence tags (SAGE or NextGen sequencing), transcription factor binding sites hits
    Tests for significance of difference in means of two distribution but does not test for other differences between distributions.
    36
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 50. Kolmogorov-Smirnov (K-S) test
    Probability density
    0
    score 
    Cumulative distribution
    1.0
    Cumulative probability
    0.5
    Length = 0.4
    0
    Question: Are the red and black distributions significantly different?
    score 
    Formal question: Is the length of largest difference between the “empirical distribution functions” statistically significant?
    Calculate cumulative distributions of red and black
    37
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 51. What is the probability of finding 4 or more proteins with feature X in a random sample of 5 proteins
    list
    RRP6
    MRD1
    RRP7
    RRP43
    RRP42
    Background population:
    500 X proteins,
    5000 proteins
    38
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 52. Fisher’s exact test
    Null distribution
    P-value
    Answer = 4.6 x 10-4
    list
    RRP6
    MRD1
    RRP7
    RRP43
    RRP42
    P-value for Fisher’s exact test
    is “the probability that a random draw of the same size as the list from the background population would produce the observed number (or more) of attributes in the list.”,
    depends on size of the list, # with features (in list, background), and the background population.
    Background population:
    500 X proteins,
    5000 proteins
    39
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 53. Important details
    To test for under-enrichment of “black”, test for over-enrichment of “red”.
    Need to choose “background population” appropriately, e.g., if only portion of the total complement is queried (or having annotation), only use that population as background.
    To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type.
    The hypergeometric test is equivalent to a one-tailed Fisher’s exact test.
    40
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 54. How to win the P-value lottery, part 1
    Random draws
    Expect a random draw with observed enrichment once every 1 / P-value draws
    … 7,834 draws later …
    Background population:
    500 X
    5000 Y
    41
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 55. How to win the P-value lottery, part 2Keep the list the same, evaluate different annotations
    Different annotations
    Observed draw
    RRP6
    MRD1
    RRP7
    RRP43
    RRP42
    RRP6
    MRD1
    RRP7
    RRP43
    RRP42
    42
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 56. Correcting for multiple tests
    The Bonferroni correction controls the probability any one test is due to random chance akaFamily-Wise Error Rate (FWER)
    If M = # of annotations tested: Corrected P-value = M x original P-value
    The Benjamini-Hochberg (B-H) controls the proportion of positive tests (i.e. rejections of the null hypothesis) that are false positives akaFalse Discovery Rate (FDR)
    FDR is the expected proportion of the observed enrichments that are due to random chance.
    Less stringent than the Bonferroni
    43
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 57. Reducing multiple test correction stringency
    The correction to the P-value threshold a depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be
    Can control the stringency by reducing the number of tests:
    e.g. use GO slim or restrict testing to the appropriate GO annotations.
    44
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 58. AE tools
    Web-based tools
    Funspec:
    easy tool for yeast, not maintained, uses GO annotations and some annotations (e.g. protein complexes)
    YeastFeatures
    Similar to Funspec, different datasets and presentation
    GoMiner:
    Uses GO annotations, covers many organisms, needs a background set of genes
    Cytoscape-based tools
    BINGO:
    Does GO annotations and displays enrichment results graphically and visually organizes related categories
    45
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 59. Funspec: Simple ORA for yeasthttp://funspec.med.utoronto.ca/
    Choose sources of annotation
    Bonferroni correct? YES!
    Paste list here
    Cavaets:
    • yeast only,
    • 60. last updated 2002
    46
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 61. http://software.dumontierlab.com/yeastfeatures
    47
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 62. 48
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 63. GoMiner, part 1http://discover.nci.nih.gov/gominer
    1. Click “web interface”
    2. Upload background
    3. Upload list
    4. Choose organism
    5. Choose evidence code (All or Level 1)
    49
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 64. GoMiner, part 2
    6. Restrict # of tests via category size
    7. Restrict # of tests via GO hierarchy
    8. Results emailed to this address, in a few minutes
    50
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 65. DAVID, part 1 http://david.abcc.ncifcrf.gov/
    Paste list here
    DAVID automatically detects organism
    Choose ID type
    List type: list or background?
    51
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 66. DAVID, part 2http://david.abcc.ncifcrf.gov/
    52
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 67. BINGO, an ORA cytoscape pluginhttp://www.psb.ugent.be/cbd/papers/BiNGO/index.htm
    Links represent parent-child relationships in GO ontology
    Colours represent significance of enrichment
    Nodes represent GO categories
    53
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 68. 54
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 69. Outline
    Explore identified proteins
    Attribute enrichment
    Networks
    • Physical networks
    • 70. Genetic networks
    • 71. Functional networks
    Pathways
    55
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 72. Why Network and Pathway Analysis?
    Intuitive to Biologists
    • Provide a biological context for results
    • 73. More efficient than searching databases gene-by-gene
    • 74. Intuitive display for sharing data
    Computation on Pathway Content
    • Visualize multiple data types on a pathway or network
    • 75. Find active pathways
    • 76. Identify potential regulators
    56
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 77. network
    In biology, a network is a graph comprised of nodes that correspond to entities (genes, proteins, small molecules) and edges that correspond to physical/agentive or associative relations between entities.
    Vertex (node)
    Cycle
    Edge
    -5
    Directed Edge (Arc)
    Weighted Edge
    10
    7
    57
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 78. Integration in a Network Context
    58
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 79. Integration in a Network Context
    Expression data mapped
    to node colours
    59
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 80. Mapping Biology to a Network
    A simple mapping: Protein-protein interactions
    one protein/node, one interaction/edge
    Edges can represent other relationships
    Physical e.g. protein-protein interaction
    Regulatory e.g. kinase activates target
    Genetic e.g. epistasis
    Similarity e.g. protein sequence similarity
    Critical: understand the mapping for network analysis
    60
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 81. Protein Sequence Similarity Network
    http://apropos.icmb.utexas.edu/lgl/
    61
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 82. Literature Network
    Computationally extract gene relationships from text, usually PubMed abstracts
    Useful if network is not in a database
    Literature search tool
    BUT not perfect
    Problems recognizing gene names
    Natural language processing is difficult
    Agilent Literature Search Cytoscape plugin
    iHOP (www.ihop-net.org/UniPub/iHOP/)
    62
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 83. Agilent Literature Search
    63
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 84. Cytoscape Network produced by Literature Search.
    Abstract from the scientific literature
    Sentences for an edge
    64
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 85. Enrichment Map
    Overlap
    A
    B
    65
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 86. Nodes represent
    gene-sets
    66
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 87. Muscle Contraction
    Olfactory Receptor
    Ubiquitin Processes
    Ubiquitin-dependent Proteolysis
    Ectodermal Dev. &
    Keratinocyte Diff.
    DNA Repair
    Mitotic Cell Cycle
    Ubiquitin Ligase
    DNA Processes
    Cytoskeleton
    DNA Replication
    Intermediate Filament Cytoskeleton
    Microtubule Cytoskeleton
    Ras GTPase
    mRNA Transport
    Chromosome
    RNA Processes
    Serine Endopeptidase
    Chromatin Remodeling
    RNA Splicing
    Fatty Acid Metabolism
    Ion Channel
    Transcription
    Calcium
    rRNA Processing
    Mitochondrial Oxidative Metabolism
    Ribonucleotide Metabolism
    Potassium Sodium
    Translation
    67
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 88. 68
    Physical Networks
    B
    A
    Between two molecular objects
    DNA, RNA, gene, protein, complex, small molecule, photon
    Requires a site of interaction / binding
    Biologically relevant:
    Present/expressed at the same time
    Share a cellular location
    Leads to some biologically relevant outcome
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 89. Molecular Interactions
    RAS interacting with RALGDS
    (PDB: 1LFD)
    Synthetic protein interacting with ATP and Zinc
    (PDB: 2P0X)
    69
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 90. 70
    Experimental Interaction Discovery
    MassSpectrometry
    Genetics
    Two-Hybrid
    Direct, Physical
    Indirect, Physical
    Indirect, Genetic
    Microarray
    X-Ray
    NMR
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 91. 71
    Experimental Considerations
    How do you know if the interaction really exists?
    Each method has its advantages and disadvantages.
    Be aware of systematic errors
    Be aware of contaminants.
    Each method observes interactions from a slightly different experimental condition.
    Support from many different sources is certainly better (necessary) than just one.
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 92. 72
    Some affinity purification caveats
    First and most importantly, this is only a representation of the observation.
    You can only tell what proteins are in the eluate;
    you can’t tell how they are connected to one another.
    If there is only one other protein present (B), then its likely that
    A and B are directly interacting.
    But, what if I told you that two other proteins (B and C) were
    present along with A….
    A
    B
    A
    C
    B
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 93. 73
    Complexes with unknown topology
    A
    A
    A
    B
    C
    B
    C
    B
    C
    Which of these models is correct?
    The complex described by this experimental result is
    said to have an Unknown Topology.
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 94. 74
    Complexes with unknown stoichiometry
    A
    A
    B
    B
    B
    Here’s another possibility?
    The complex described by this experimental result is
    also said to have Unknown Stoichiometry.
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 95. 75
    Interaction Models
    Actual
    Topology
    Spoke
    Matrix
    Simple model, useful for data navigation
    More accurate
    Theoretical max. number of interactions
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 96. 76
    High-throughput Mass Spectrometric Protein Complex Identification (HMS-PCI)
    Mike Tyers, SLRI
    Ste12
    Ho et al. Nature. 2002 Jan 10;415(6868):180-3
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 97. 77
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 98. 78
    k-core analysis
    A part of a graph where every node is connected to other nodes with at least k edges (k=0,1,2,3...)
    Highest k-core is a central most densely connected region of a graph
    Regions of dense connectivity may represent molecular complexes
    Therefore, high k-cores may be molecular complexes
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 99. 79
    Pre MS
    Ho
    6-core
    6-core
    Interaction can define function
    Gavin
    Union
    6-core
    9-core
    MCODE plugin for Cytoscape
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 100. 80
    http://pathguide.org
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 101. Interaction Databases
    Experiment (E)
    Structure detail (S)
    Predicted
    Physical (P)
    Functional (F)
    Curated (C)
    Homology modeling (H)
    *IMEx consortium
    81
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 102. Network Classification of Disease
    Traditional: Gene association
    Limitations: Too many genes reduces statistical power
    New: Active cell map based approaches combining network and molecular profiles
    Chuang HY, Lee E, Liu YT, Lee D, Ideker T
    Network-based classification of breast cancer metastasis
    Mol Syst Biol. 2007;3:140. Epub 2007 Oct 16
    Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif S
    Network-based analysis of affected biological processes in type 2 diabetes models
    PLoS Genet. 2007 Jun;3(6):e96
    Efroni S, Schaefer CF, Buetow KH
    Identification of key processes underlying cancer phenotypes using biologic pathway analysis
    PLoS ONE. 2007 May 9;2(5):e425
    82
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 103. Network-Based Breast Cancer Classification
    57k intx from Y2H, orthology, co-citation, HPRD, BIND, Reactome
    2 breast cancer cohorts, different expression platforms
    Chuang HY, Lee E, Liu YT, Lee D, Ideker T
    Network-based classification of breast cancer metastasis
    Mol Syst Biol. 2007;3:140. Epub 2007 Oct 16
    83
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 104. Similar network markers across 2 data sets (better than original overlap)
    Increased classification accuracy
    Better coverage of known cancer risk genes (*)
    84
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 105. PIPE
    Predicts yeast PPI from sequence
    Uses interaction databases to find similar interacting proteins
    Estimates the site of interaction
    75% accuracy (61% sensitivity, 89% specificity)
    Finds new interactions among complexes
    85
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 106. 86
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 107. 87
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 108. PIPE2
    First all-to-all sequence-based computational screen of PPIs in yeast
    29,589 high confidence interactions of ~ 2 x 107 possible pairs
    16,000x faster than PIPE
    99.95% specificity
    88
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 109. 89
    Synthetic Genetic Interactions
    Synthetic genetic interactions (lethal, slow growth)
    Mate two mutants without phenotypes to get a daughter cell with a phenotype
    Synthetic lethal (SL), slow growth
    robotic mating using the yeast deletion library
    Genetic interactions provide functional data on protein interactions or redundant genes
    About 23% of known SLs (1295 - YPD+MIPS) were known protein interactions in yeast
    Tong et al. Science. 2001 Dec 14;294(5550):2364-8
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 110. 90
    Cell Polarity
    Cell Wall Maintenance
    Cell Structure
    Mitosis
    Chromosome Structure
    DNA Synthesis
    DNA Repair
    Unknown
    Others
    Synthetic Genetic Interactions in Yeast
    Tong, Boone
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 111. Validation: Protein Localization
    A – A3: Y2H
    B: physical methods
    C: genetic
    E: immunological
    True positives:
    • Localized in the same cellular compartment
    • 112. Have common cellular role
    Sprinzak, Sattath, Margalit, J Mol Biol, 2003
    91
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 113. Comparisons
    All methods except for Y2H and synthetic lethality technique are biased toward abundant proteins.
    PPI bias toward certain cellular localizations.
    Evolutionarily conserved proteins have much better coverage in Y2H than the proteins restricted to a certain organism.
    C. Von Mering et al, Nature, 2002:
    92
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 114. Functional Associations
    Molecular Interactions
    Regulatory Interactions
    Genetic Interactions
    Similarity relationships
    Co-expression
    Protein sequence
    Domain architecture
    Phylogenetic profiles
    Gene neighborhood
    Gene fusion

    93
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 115. http://string.embl.de/
    von Mering et al., Nucleic Acids Res., 2005
    94
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 116. 95
    95
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 117. 96
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 118. Gene Function Prediction using a
    Multiple Association Network Integration Algorithm
    Query-specific weights for multifaceted function queries
    w1x
    w2x
    w3x
    weights
    CDC27
    Cell
    cycle
    CDC23
    +
    +
    APC11
    UNK1
    Co-complexed
    Durrett 2006
    Genetic
    Tong et al. 2001
    RAD54
    XRS2
    DNA
    repair
    =
    MRE11
    UNK2
    Co-expression
    Pavlidis et al, 2002, Lanckriet et al, 2004
    Mostafavi et al, 2008
    97
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 119. GeneMANIA Cytoscape Plugin
    98
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 120. Outline
    Explore identified proteins
    Attribute enrichment
    Networks
    Pathways
    Lab
    99
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 121. pathway
    In biology, a pathway is a network which consists of inputs (physical entities), outputs (physical entities, biological outcomes), and the molecular machinery and chemical transformations required/expected to realize the end-directed activity.
    100
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 122. Using Pathway Information
    Expert knowledge
    Experimental Data
    Find active processes
    underlying a phenotype
    Databases
    Literature
    Pathway
    Information
    Pathway
    Analysis
    101
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 123. >290 Pathway
    Databases!
    http://pathguide.org
    • Varied formats, representation, coverage
    • 124. Pathway data extremely difficult to combine and use
    Vuk Pavlovic
    Sylva Donaldson
    102
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 125. Aim: Convenient Access to Pathway Information
    http://www.pathwaycommons.org
    Facilitate creation and communication of pathway data
    Aggregate pathway data in the public domain
    Provide easy access for pathway analysis
    103
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 126. Access From Cytoscape
    104
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 127. cardiomyopathy: downregulated genes
    Fatty Acid Degradation?
    Other pathways / processes?
    GenMAPP.org
    105
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 128. Fatty Acid Degradation Pathway
    106
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 129. Cardiomyopathy Data on Fatty Acid Degradation Pathway
    107
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 130. Visualizing Time Course Data on Pathways: Multiple Comparison View
    108
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 131. Outline
    Explore identified proteins
    Attribute enrichment
    Networks
    Pathways
    Lab
    109
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 132. 110
    Network Analysis
    Cytoscape
    Visualize molecular interaction networks and integrate interactions with gene expression profiles and other state data. Data filters & custom plug-in architecture.
    http://www.cytoscape.org
    Biolayout Express 3D
    Large networks
    Gene expression
    www.sanger.ac.uk/Teams/Team101/biolayout/b3d.html
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 133. Expert knowledge
    Experimental Data
    Network Analysis using Cytoscape
    Find biological processes
    underlying a phenotype
    Databases
    Literature
    Network
    Information
    Network
    Analysis
    111
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 134. http://cytoscape.org
    Network visualization and analysis
    Pathway comparison
    Literature mining
    Gene Ontology analysis
    Active modules
    Complex detection
    Network motif search
    UCSD, ISB, Agilent, MSKCC, Pasteur, UCSF, Unilever, UToronto, U Texas
    112
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 135. Manipulate Networks
    Filter/Query
    Interaction Database Search
    Automatic Layout
    113
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 136. Overview
    Zoom
    Focus
    PKC Cell Wall Integrity
    114
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 137. Active Community
    http://www.cytoscape.org
    Help
    8 tutorials, >10 case studies
    Mailing lists for discussion
    Documentation, data sets
    Annual Conference: Houston Nov 6-9, 2009
    10,000s users, 2500 downloads/month
    >40 Plugins Extend Functionality
    Build your own, requires programming
    Cline MS et al. Integration of biological networks and gene expression data using Cytoscape Nat Protoc. 2007;2(10):2366-82
    115
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 138. LAB
    Objective
    Create a map of the functional enrichments from the 14 input proteins
    Methods
    Use HGNC to obtain the gene symbols from the names
    Submit the gene symbols to a tool that already has datasets loaded.
    Get Attributes and do analysis on network
    116
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 139. 14 Proteins
    ISOFORM of APOPTOSIS-INDUCING FACTOR 1, MITOCHONDRIAL
    QUINONE OXIDOREDUCTASE.; 26 KDA PROTEIN.;22 KDA PROTEIN.; 32 KDA PROTEIN.
    14-3-3 PROTEIN EPSILON.
    ELONGATION FACTOR 1-GAMMA.; 50 KDA PROTEIN.
    AFG3-LIKE PROTEIN 2.
    3-KETOACYL-COA THIOLASE, MITOCHONDRIAL
    IMPORTIN BETA-1 SUBUNIT.
    FH1/FH2 DOMAIN-CONTAINING PROTEIN
    ANNEXIN VI ISOFORM 2.; ANNEXIN A6.
    2,4-DIENOYL-COA REDUCTASE, MITOCHONDRIAL
    HYDROXYACYL GLUTATHIONE HYDROLASE ISOFORM 1.; HYDROXYACYLGLUTATHIONE HYDROLASE.
    ISOFORM 1 OF ELECTRON TRANSFER FLAVOPROTEIN SUBUNIT BETA.; ISOFORM 2 OF ELECTRON TRANSFER FLAVOPROTEIN SUBUNIT BETA
    ISOFORM 1 OF LONG-CHAIN-FATTY-ACID--COA LIGASE 1
    PHOSPHOLIPASE C DELTA 4.
    117
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 140. Get their gene symbol/identifiersHGNC - http://www.genenames.org
    Provide a table of mappings
    What challenges did you face when trying to identify the symbols from textual descriptions?
    118
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 141. Identify functional enrichments
    Discuss and provide a plot for the enrichment of Gene Ontology categories
    119
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 142. Build an attribute enrichment network
    Which new proteins are functionally linked?
    What datasets were used in the network construction?
    120
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 143. Attribute Enrichment with a custom data set
    Use BioMart to
    convert HGNC identifiers to Ensembl Identifiers
    Obtain the Gene Ontology categories for the target proteins and the background proteins.
    Use FUNC to do the enrichment analysis
    121
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 144. 122
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 145. 123
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 146. 124
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 147. 125
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 148. Collect the Gene Ontology attributes for the list, then for all the human genes
    126
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]
  • 149. Next steps are harder…
    http://func.eva.mpg.de/
    To use FUNC, you need to convert the BioMART output to the file format above. This is pretty easy to do in excel for the protein list, but excel can’t handle the results for all the human proteins. Need to write a small script… take BIOC3008 and become a competent in simple data manipulation 
    127
    BIOL5502B|CHEM5900 Methods in Proteomics [17/05/2010:Dumontier]