Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The annotation of plant proteins in UniProtKB

1,210 views

Published on

Event: Plant and Animal Genomes conference 2012
Speaker: Michel Schneider

The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot, which contains manually-annotated protein sequence enriched with functional information added by expert human curators, and UniProtKB/TrEMBL, which contains unreviewed records that are enhanced by information provided by automated rule-based annotation systems. The majority of UniProtKB records are based on automatic translation of coding sequences (CDS) provided by submitters at the time of initial deposition to the nucleotide sequence databases. In order to provide the complete proteome of Arabidopsis thaliana, a complementary curation pipeline for import of protein sequences from TAIR has been developed. As the complete genome reannotation proposed in the TAIR10 release contains most of the sequences already in UniProtKB, these existing sequences have to be reconciled with those imported. Around 7% of them have a different gene model and should be checked manually. Based on these comparisons, we improved over 200 of our predicted proteins. In exchange, we provide TAIR with the gene model corrections that we introduce on the bases of our trans-species family annotation. This approach allows identification of data that can be seamlessly transferred from one site to the other and the development of common annotations. With the significant increase in the number of complete genomes sequenced (1001 Arabidopsis cultivars are currently under way!), organization of this data in a convenient way is critical. UniProt have selected a set of “reference proteomes”, including A. thaliana cv. Columbia, which provide broad coverage of the tree of life and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB.


Published in: Technology
  • Be the first to comment

The annotation of plant proteins in UniProtKB

  1. 1. The annotation of Plant Proteins in UniProtKB Michel Schneider Plant protein annotation program, Swiss-Prot group Swiss Institute of Bioinformatics Geneva, Switzerland Michel.Schneider@isb-sib.ch
  2. 2. 1. The UniProt consortium and its products2. Content of an entry in UniProtKB and manual curation3. Complete proteomes and reference proteomes4. Synchronization between UniProtKB and TAIR5. Some statistics “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  3. 3. The UniProt consortium “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  4. 4. The missions of the UniProt consortiumProvide the scientific community with a resource of proteinsequence and functional annotation which has to be … comprehensive high quality and freely accessible “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  5. 5. Four components to fulfill specific demands UniProtKB Protein Knowledgebase UniRef UniProtKB/Swiss-Prot UniMes Sequence clusters Reviewed Metagenomic and UniRef100 (533’657 entries) UniRef90 environmental Manual curation sample sequences UniRef50 UniProtKB/Trembl Unreviewed (19 million entries) Automated annotation UniParc – Sequence archive contains current and obsolete sequences (29.6 million sequences) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  6. 6. UniProtKB, the expertly curatedcomponent of UniProt The high-quality curated protein knowledge database where data becomes structured knowledge “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  7. 7. UniProtKB, the expertly curatedcomponent of UniProt Shigeo Fukuda “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  8. 8. Protein sequence One gene - One species© 2009 SIB
  9. 9. Protein and gene names Taxonomic information Protein sequence One gene - One species© 2009 SIB
  10. 10. Protein and gene names Taxonomic information Sequence annotation: PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide…© 2009 SIB
  11. 11. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide…© 2009 SIB
  12. 12. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: References PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide…© 2009 SIB
  13. 13. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: References PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide… Keywords - Gene Ontology© 2009 SIB
  14. 14. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: References PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide… Keywords Cross-references - Gene Ontology (~ 130 databases)© 2009 SIB
  15. 15. Origin of the sequences in UniProtKB International Nucleotide Sequence Database Collection (INSDC) Ensembl or EnsemblGenomes RefSeq Direct submissions (protein sequences) Literature Protein Data Bank “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  16. 16. The process of manual sequence curation 1. Select entry/gene (priorities) 2. Identify entries from same gene and homologs using BLAST against UniProtKB 3. Merge entries from the same gene and same species into a single record 4. Select a canonical sequence “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  17. 17. Critical analysis and report of sequence discrepanciesQPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  18. 18. Critical analysis and report of sequence discrepanciesQPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  19. 19. “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  20. 20. Literature-based curation Identify relevant papers through searching literature databases Read full text of papers and extract and summarize relevant information “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  21. 21. Literature-based curation “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  22. 22. Literature-based curation “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  23. 23. Literature-based curation “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  24. 24. Controlled vocabularies• Keywords provide a summary of the entry content• We annotate using the Gene Ontology (GO) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  25. 25. UniProtKB, complete proteomesequence sets • Genome completely sequenced • Proteins mapped to the genome 2’902 complete proteomes Fully manually reviewed (e.g. S. cerevisiae) Partially manually reviewed (e.g. A. thaliana) Unreviewed (e.g. Chlorella variabilis) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  26. 26. UniProtKB, reference proteomesequence setsA reference proteome is the complete proteome of arepresentative, well-studied model organism or an organismof interest for biomedical research.509 reference proteomes “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  27. 27. UniProtKB, complete proteomesequence sets “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  28. 28. Arabidopsis thalianaThe building of the complete proteome sequence set:• Based on the re-annotation of complete genome by TAIR: 27’416 protein coding genes “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  29. 29. UniProtKB – TAIR synchronization cDNAs, ESTs, genomic sequences Nucleic acid databases UniProtKB/TrEMBL Unreviewed (40’574 entries) UniProtKB/Swiss-Prot Reviewed (10’340 entries)release 2011_03 - Mar 08, 2011 “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  30. 30. UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences Genome re-annotation 35’386 gene products Nucleic acid databasesUniProtKB/TrEMBL Temporary TrEMBL set 33’341 entries Unreviewed (40’574 entries)UniProtKB/Swiss-Prot Reviewed (10’340 entries) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  31. 31. UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences Genome re-annotation 35’386 gene products Nucleic acid databasesUniProtKB/TrEMBL Temporary TrEMBL set 33’341 entries Unreviewed (40’574 entries) 11’508 sequencesUniProtKB/Swiss-Prot Compare translations from the same gene, merge if 100 % identical, report sequence discrepancies, align with Reviewed (10’340 entries) orthologs and paralogs “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  32. 32. UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences Genome re-annotation Nucleic acid databasesUniProtKB/TrEMBL Temporary TrEMBL set UnreviewedUniProtKB/Swiss-Prot Compare translations from the same gene, merge if 100 % identical, report sequence discrepancies, align with Reviewed orthologs and paralogs Feedback to TAIR 90 gene models correct gene models or add new isoforms 283 corrections at the Heart of Science” 1998 – 2008 “Pioneers PAG XX, San Diego, January 15, 2012
  33. 33. UniProtKB – TAIR synchronizationcDNAs, ESTs,genomic sequences Genome re-annotation Nucleic acid databasesUniProtKB/TrEMBL Temporary TrEMBL set Unreviewed Cleaned set of new TrEMBL entriesUniProtKB/Swiss-Prot (21’656 entries) Reviewed “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  34. 34. UniProtKB – TAIR synchronization cDNAs, ESTs, genomic sequences Genome re-annotation Nucleic acid databases UniProtKB/TrEMBL Temporary TrEMBL set Unreviewed (44’628 entries) Cleaned set of new TrEMBL entries UniProtKB/Swiss-Prot (21’656 entries) Reviewed + (10’875 entries) UniProtKB/Swiss-Prot Reviewed (10’865 entries)release 2011_12 - Dec 14, 2011 Arabidopsis thaliana, cv. Columbia Complete proteome: 32’521 entries “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  35. 35. 1001 Arabidopsis genomes• Deposited to INSDC ?• Fully Annotated ? With CDS ?• Should we still merge all the identical sequences together?• If they are not merged but kept separate, how to get relevant Blast results? “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  36. 36. Some UniProtKB/Swiss-Prot Statisticsconcerning plant entries(UniProt release 2011_12 - Dec 14, 2011)• 31,959 entries of Viridiplantae• from 1,924 species• 10’875 entries from Arabidopsis thaliana (with 1,219 isoforms)• 2,823 entries from Oryza sativa sp. Japonica• 11,897 plant entries with an EC number• 966 different complete EC numbers• 5,744 putative transporters or proteins involved in transport “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  37. 37. SummaryUniProtKB/Swiss-Prot, the manually curated knowledgebase:• Protein sequence database covering all kingdoms of life (533’657 sequence entries; 12’664 species)• Manually annotated• Non-redundant: all products of one gene in one species in a single entry• Highly cross-referenced (links to ~130 databases).Plant protein annotation:• Complete proteome for Arabidopsis thaliana• Synchronization with TAIR “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  38. 38. We need your feedback and your collaboration ! help@uniprot.org “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  39. 39. AcknowledgementsSIBIoannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie-Claude Blatter,Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel Breuza, Alan Bridge, Edouard deCastro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael Doche, Dolnide Dornevil, Severine Duvaud, AnneEstreicher, Livia Famiglietti, Marc Feuermann, Sebastien Gehant, Elisabeth Gasteiger, Vivienne Gerritsen, Arnaud Gos,Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller,Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat, Salvo Paesano, IvoPedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine Rivoire, BerndRoechert, Michel Schneider, Christian Sigrist, Karin Sonesson, Sylvie Staehli, Eleanor Stanley, André Stutz, ShyamalaSundaram, Michael Tognolli, Laure Verbregue and Anne-Lise VeutheyEBIRolf Apweiler, Maria Jesus Martin, Claire ODonovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo Antunes,Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer, Francesco Fazzini, AlexanderFedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius Jacobsen, Michael Kleen, Duncan Legge, WudongLiu, Jie Luo, Sandra Orchard, Samuel Patient, Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, TonySawford, Harminder Sehra, Edward Turner, Matt Corbett, Mike Donnelly and Pieter van RensburgPIRCathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen, Pratibha Dubey,Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale, Thanemozhi G. Natarajan, JulesNchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang,Lai-Su Yeh and Jian Zhang www.uniprot.org
  40. 40. UniProt is mainly supported by the National Institutes ofHealth (NIH) grant 1 U41 HG006104-01. Additional support forthe EBIs involvement in UniProt comes from the NIH grant2P41 HG02273-07. Swiss-Prot activities at the SIB aresupported by the Swiss Federal Government through theFederal Office of Education and Science and the EuropeanCommission contracts SLING (226073), Gen2Phen (200754)and MICROME (222886). PIR activities are also supported bythe NIH grants 5R01GM080646-04, 3R01GM080646-04S2,1G08LM010720-01, and 3P20RR016472-09S2, and NSF grantDBI-0850319. “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012

×