Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Designing a community resource - Sandra Orchard

209 views

Published on

EMBL-ABR Best Practice Workshop Series - The Data Life-cycle

Published in: Science
  • Be the first to comment

  • Be the first to like this

Designing a community resource - Sandra Orchard

  1. 1. Designing a community resource – the Complex Portal as an example Sandra Orchard
  2. 2. Hands-on exercise Design a manually curated data resource that will enable the description of species agnostic protein complexes, to act as reference resource in the same way that UniProt does for proteins – use as examples 1. Human Haemoglobin 2. Arabidopsis Light harvesting complex
  3. 3. Designing a new resource - what else is out there? • Before starting to design a resource, assess what else is out there – re-inventing the wheel causes community fragmentation and confusion as well as being a waste of limited funds • Is it needed – what gap in the market is it designed to fill? • Investigate possibilities for collaboration, rather than competition • If another resource exists, does it meet your/consumer demands – can you contribute and improve
  4. 4. Designing a new resource • How will researchers use it, what information do they want? Conduct extensive user requirement studies before starting the design process. • How will users search it? This will impact on data entry/annotation. • Data visualisation – again, what do users want? Usability studies are critical • Long term plans – will it survive the first grant renewal?
  5. 5. Complex Portal - what else was out there? • Information on protein complexes scattered between multiple resources but no unifying resource • MIPS catalogued yeast complexes in 2000 • Corum – human complexes, project terminated in 2009 • Decision – use as starting point or start again?
  6. 6. Information content and presentation • User consultation – design what they need, not what you want to give them • Don’t get too attached to your first paper prototype – be prepared to sacrifice your concept to community need • Develop a beta site, then observe researchers using it. • Keep testing, react to new demands, novel use cases
  7. 7. Use of community standards • Use of community standards enable • Data merger across multiple resources – contribute to a greater community effort • Data re-use and longevity • Immediate access to existing tool suites
  8. 8. Use of Community standards – Complex Portal • Established standard formats for molecular interactions PSI-MI XML/MITAB) • PSI-XML2.5 designed for experimental data, curated complex data not a perfect fit – worked with PSI-MI workgroup to produce new version • MITAB designed for binary pairs, not complexes – ComplexTAB will be presented to MI workgroup for adoption
  9. 9. Use of Community standards – Complex Portal • Used existing identifiers for components (UniProtKB, ChEBI, RNAcentral) – enables import of additional information using resource APIs, for example can search website using gene synonyms - Organism non-specific, enables us to describe complexes in a range of species, including non-model organisms
  10. 10. Use of Community standards enables use of existing tools • Community standards have encouraged tool development by users, software often open-source and freely available – often can be incorporated directly into websites with little/no additional development • Complex Portal viewer originally written to visualise cross-linking data
  11. 11. Use of Community standards enables use of existing tools • Look for initiatives which make open-source tools, apps/plug-ins, visualizers and widgets freely available e.g. BioJS, BioPerl, Cytoscape……
  12. 12. Free text vs Ontologies Free text Pros – versatile, fully descriptive, flexible Cons – can be difficult to interpret, long winded, error-prone, difficult to search CVs Pros – structured, consistent, concise Cons – may not deal well with ‘odd’ cases, lack of information Consider using both!
  13. 13. Use of controlled vocabularies • Again, re-use rather than re-invent • Use of CVs enables searches across resources, but also can make intelligent searches within resources easy to implement For example can search for • all transcription factors • all complexes involved in respiration • all mitochondrial complexes
  14. 14. Use of controlled vocabularies In the Complex Portal you can search for 1. All enzymes - GO:0003824 (catalytic activity) 2. All transferases - GO:0016740 (transferase activity) 3. All protein kinases - GO:0004672 (protein kinase activity) 4. All cyclin-dependent protein kinase - GO:0097472 (cyclin-dependent protein kinase activity) Similarly can use the ChEBI ontology – search on porphyrin
  15. 15. Linking to external resources • Extensive cross- referencing is time consuming but enables subsequent pulling in of data from other resources
  16. 16. Make this the ‘go to’ resource for your community • Must fit community need, be easy to search and deliver the results the user wants Outreach – publications, conferences, talks…. Collaborate on a high impact analysis paper, with your resource playing a key role. Protocols, tutorials, videos, hands-on training courses. Use social media
  17. 17. Using Social Media
  18. 18. InterPro and Annotation transfer to Non-Model Organism Proteomes
  19. 19. What is InterPro • InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. • Combine protein signatures from a number of member databases into a single searchable resource, • Has resulted in an integrated database and diagnostic tool (InerProScan).
  20. 20. Protein signatures Model the pattern of conserved amino acids at specific positions within a multiple sequence alignment • Patterns • Profiles • Profile HMMs Use these models (signatures) to infer relationships with the characterised sequences from which the alignment was constructed Approach used by a variety of databases: Pfam, TIGRFAMs, PANTHER, Prosite, etc
  21. 21. Protein signatures Alternatively, model the pattern of conserved amino acids at specific positions within a multiple sequence alignment • Patterns • Profiles • Profile HMMs Use these models (signatures) to infer relationships with the characterised sequences from which the alignment was constructed Approach used by a variety of databases: Pfam, TIGRFAMs, PANTHER, Prosite, etc
  22. 22. Introduction to InterPro How are protein signatures made? Multiple sequence alignment Protein family/domain Build model Search Significant matches ITWKGPVCGLDGKTYRNECALL AVPRSPVCGSDDVTYANECELK SVPRSPVCGSDGVTYGTECDLK HPPPGPVCGTDGLTYDNRCELR E-value 1e-49 E-value 3e-42 E-value 5e-39 E-value 6e-10 Protein signature Refine
  23. 23. Structural domains Functional annotation of families/domains Protein features (sites) Hidden Markov Models Finger prints Profiles Patterns HAMAP
  24. 24. Database Basis Institution Built from Focus URL Pfam HMM EBI Sequence alignment Family & Domain based on conserved sequence http://pfam.xfam.org/ Gene3D HMM UCL Structure alignment Structural Domain http://gene3d.biochem.ucl.a c.uk/Gene3D/ Superfamily HMM Uni. of Bristol Structure alignment Evolutionary domain relationships http://supfam.cs.bris.ac.uk/ SUPERFAMILY/ SMART HMM EMBL Heidelberg Sequence alignment Functional domain annotation http://smart.embl- heidelberg.de/ TIGRFAM HMM J. Craig Venter Inst. Sequence alignment Microbial Functional Family Classification http://www.jcvi.org/cms/rese arch/projects/tigrfams/overv iew/ Panther HMM Uni. S. California Sequence alignment Family functional classification http://www.pantherdb.org/ PIRSF HMM PIR, Georgetown, Washington D.C. Sequence alignment Functional classification http://pir.georgetown.edu/pir www/dbinfo/pirsf.shtml PRINTS Fingerprints Uni. of Manchester Sequence alignment Family functional classification http://www.bioinf.mancheste r.ac.uk/dbbrowser/PRINTS/i ndex.php PROSITE Patterns & Profiles SIB Sequence alignment Functional annotation http://expasy.org/prosite/ HAMAP Profiles SIB Sequence alignment Microbial protein family classification http://expasy.org/sprot/ham ap/ Conserved
  25. 25. The aim of InterPro InterPro
  26. 26. InterPro: multiple sequence analysis • Outputs TSV, XML, GFF3, HTML & SVG formats
  27. 27. InterPro as a tool for Automatic annotation
  28. 28. Why automatic annotation is needed • data growth in UniProtKB is fast: • manual curation is time-consuming • experimental data are unavailable for many sequences/organisms • organisms’ genomes are sequenced but often no biochemical characterization is conducted Release Section of database No. of entries Growth 2015_10 reviewed (Swiss-Prot) ~0.5 mio slow 2015_10 unreviewed (TrEMBL) >50 mio rapid
  29. 29. The Concepts in GO 1. Molecular Function 2. Biological Process 3. Cellular Component An elemental activity or task or job • protein kinase activity • insulin receptor activity A commonly recognised series of events • cell division Where a gene product is located • mitochondrion • mitochondrial matrix • mitochondrial inner membrane
  30. 30. The relationship between InterPro and GO (InterPro2GO) • Curators manually add relevant GO terms to InterPro entries • InterPro entry specificity determines the GO terms assigned GO:0007186 G-protein coupled receptor signaling GO:0016021 integral to membrane GO:0007601 visual perception GO:0007186 G-protein coupled receptor signaling GO:0016021 integral to membrane
  31. 31. InterPro2GO InterPro
  32. 32. Using InterPro for annotation • InterPro is the world’s major source of GO terms: ~ 90 million GO terms for ~ 30 million distinct UniProtKB seqs • Also underlies the system adding annotation to UniProtKB/TrEMBL • Provides matches to ~40 million proteins (approx 80% of UniProtKB) Annotation consistency: • Using InterPro and GO for annotation allows direct comparison proteins in UniProtKB
  33. 33. System Rule creation Trigger Annotations Scope SAAS automatic taxonomy InterPro protein names, EC numbers, comments, KW GO terms all taxa UniRule manual taxonomy InterPro* proteome property sequence length protein names, EC numbers, gene names, comments, features**, KW, GO terms all taxa *flexibility to create custom signatures and submitted to InterPro as required **predictors for signal, transmembrane, coiled-coil features, alignment for positional ones Automatic Annotation in UniProtKB
  34. 34. Components of a rule: conditions Restrict application of rules to those unreviewed UniProtKB entries fulfilling the conditions Types of conditions: • InterPro signatures • Functional classification of proteins using predictive models (signatures) • taxonomy • sequences features, e.g. length • proteome features, e.g. outer membrane:yes; (bacterial sequences)
  35. 35. Components of a rule: annotations If an unreviewed UniProtKB entry fulfils conditions of a rule, annotations in a rule are propagated to this entry. Types of annotations: • protein names, including enzyme classification (EC) numbers • functional annotation, e.g. catalytic activities • gene ontology terms • keywords • sequence features, e.g. active sites, transmembrane domains
  36. 36. How to access automatic annotation data?
  37. 37. How to access automatic annotation data?
  38. 38. Example of a UniRule
  39. 39. UR000172789 applied evidence tags clearly state where annotation comes from
  40. 40. Example of a UniRule highlight a rule’s logic
  41. 41. Example of a UniRule highlight a rule’s logic
  42. 42. Attributing evidence It needs to be made clear to the user when information is 1. experimentally based 2. predicted 3. transferred from a related species Use of evidence codes give this information Evidence Code Ontology http://www.ebi.ac.uk/ols/ontologies/eco
  43. 43. Thank you! www.ebi.ac.uk Twitter: @emblebi Facebook: EMBLEBI

×