Event: Plant and Animal Genomes conference 2012
Speaker: Michel Schneider
The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot, which contains manually-annotated protein sequence enriched with functional information added by expert human curators, and UniProtKB/TrEMBL, which contains unreviewed records that are enhanced by information provided by automated rule-based annotation systems. The majority of UniProtKB records are based on automatic translation of coding sequences (CDS) provided by submitters at the time of initial deposition to the nucleotide sequence databases. In order to provide the complete proteome of Arabidopsis thaliana, a complementary curation pipeline for import of protein sequences from TAIR has been developed. As the complete genome reannotation proposed in the TAIR10 release contains most of the sequences already in UniProtKB, these existing sequences have to be reconciled with those imported. Around 7% of them have a different gene model and should be checked manually. Based on these comparisons, we improved over 200 of our predicted proteins. In exchange, we provide TAIR with the gene model corrections that we introduce on the bases of our trans-species family annotation. This approach allows identification of data that can be seamlessly transferred from one site to the other and the development of common annotations. With the significant increase in the number of complete genomes sequenced (1001 Arabidopsis cultivars are currently under way!), organization of this data in a convenient way is critical. UniProt have selected a set of “reference proteomes”, including A. thaliana cv. Columbia, which provide broad coverage of the tree of life and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB.