SlideShare a Scribd company logo
Automatic annotation in UniProtKB using
UniRule, and Complete Proteomes
Wei Mun Chan
Talk outline
• Introduction to UniProt
• UniProtKB annotation and propagation
• Data increase and the need for Automatic Annotation
• Automatic annotation systems in UniProtKB
• UniRule Automatic Annotation System
• Complete Proteomes in UniProtKB
30 January 2015
2
Talk outline
• Introduction to UniProt
• UniProtKB annotation and propagation
• Data increase and the need for Automatic Annotation
• Automatic annotation systems in UniProtKB
• UniRule Automatic Annotation System
• Complete proteomes in UniProtKB
30 January 2015
3
30 January 2015
4
UniProt Consortium
• Formed in 2002
• Previously known as “Swiss-Prot” since 1986
• UniProt group at the EBI is led by Claire Odonovan and Maria
Jesus Martin, part of the PANDA proteins group led by Rolf
Apweiler
• UniProt group at PIR, Georgetown University is led by Cathy Wu
• UniProt group at SIB (Geneva/Lausanne) is led by Ioannis
Xenarios and Lydie Bougeleret (heirs to Amos Bairoch, left 2009)
• UniProtKB is UniProt KnowledgeBase, and includes TrEMBL and
Swiss-Prot entries
www.uniprot.org
30 January 2015
5
UniProt databases
30 January 2015
6
ENA/GenBank/DDBJ, Ensembl, VEGA, RefSeq, other sequence resources
UniSave
- Providing entry
version history
UniSave
- Providing entry
version history
Talk outline
• Introduction to UniProt
• UniProtKB annotation and propagation
• Data increase and the need for Automatic Annotation
• Automatic annotation systems in UniProtKB
• UniRule Automatic Annotation System
• Complete Proteomes in UniProtKB
30 January 2015
7
30 January 2015
8
UniProtKB annotation
30 January 2015
9
UniProtKB annotation
30 January 2015
10
UniProtKB annotation
30 January 2015
11
UniProtKB annotation
30 January 2015
12
Propagation of annotation in UniProtKB
Annotation Propagated
RecName Yes
AltName Yes
Function Yes
Catalytic activity Yes
Pathway Yes
Subunit Yes
Subcellular location Yes
Disease No
Disruption phenotype No
Polymorphism No
Alternative products No
Generalannotation_______
Featureannotation_______
Annotation Propagated
KW Yes
GO Yes
Regions of interest Yes
Active site Yes
Ligand-binding Yes
Processing Yes
PTMs Yes
Ambiguities No
Conflicts No
Natural variants No
Isoforms No
Talk outline
• Introduction to UniProt
• UniProtKB annotation and propagation
• Data increase and the need for Automatic Annotation
• Automatic annotation systems in UniProtKB
• UniRule Automatic Annotation System
• Complete Proteomes in UniProtKB
30 January 2015
13
Data increase in UniProtKB
30 January 2015
14
30 January 2015
15
Benefits of Automatic Annotation
• Added value for TrEMBL in the face of rapid data growth
• many species/proteins without published experimental data
• Support for manual curation
• making manual curation of TrEMBL entries for which there is
published data easier
• Correction of misleading annotation in data received from
sequencing centres
• Highlighting of patterns
• knowledge that can be/needs to be propagated across the
databases
• inconsistent annotation e.g. of a protein family
Talk outline
• Introduction to UniProt
• UniProtKB annotation and propagation
• Data increase and the need for Automatic Annotation
• Automatic annotation systems in UniProtKB
• UniRule Automatic Annotation System
• Complete Proteomes in UniProtKB
30 January 2015
16
30 January 2015
17
Automatic Annotation in UniProtKB/TrEMBL
We have implemented Automatic Annotation systems based
on annotation rules
•Rules are linked to specific signatures - InterPro
•Annotation rules have
• annotations
• conditions
•Rules are tested and validated against UniProtKB/Swiss-
Prot
•Rules and annotations are updated each UniProt release
Automatic Annotation Systems in UniProtKB
System
Rule
creation
Trigger Annotations Scope
SAAS automatic
taxonomy
InterPro
comments, KW all taxa
UniRule
(Rulebase/HAMAP/
PIRNR/PIRSR)
manual
taxonomy
InterPro*
proteome
property
sequence
length
protein names,
comments,
features, KW,
GO terms
all taxa
30 January 2015
18
* Flexibility to create custom signatures and submitted to InterPro as required
Principle of an Annotation Rule Creation
30 January 2015
19
annotated Swiss-Prot entries rule TrEMBL entries
extract common
annotation
propagate
taxonomic nodes
Interpro entries and member signatures
proteome properties
sequence length
TrEMBL entries remain in TrEMBL, but offer more (predicted) annotation
30 January 2015
20
SAAS – Statistically Automatic Annotation System
• Automatically generated annotation rule system to
supplement the labour intensive UniRule system
• Employs a C4.5 decision-tree algorithm to find the most
concise rule
Talk outline
• Introduction to UniProt
• UniProtKB annotation and propagation
• Data increase and the need for Automatic Annotation
• Automatic annotation systems in UniProtKB
• UniRule Automatic Annotation System
• Complete Proteomes in UniProtKB
30 January 2015
21
30 January 2015
22
UniRule Automatic Annotation System
• Manually created/curated rules of varying complexity:
annotation varies from simple Keyword attribution to
complete annotation
• Sources for rule creation
• automatically generated SAAS rules as input
• literature based curation of characterised families – as a
potential source for creating new signatures for a specific
functional group
• also …
UniRule - conditions used to created a rule
Conditions (can be positive or negative)
•Taxonomy
•InterPro entries and member signatures
•Subcellular location e.g. organelles
•Proteome properties e.g. photosynthetic
•Sequence length
30 January 2015
23
UniRule – UniProtKB annotations defined in a rule
Annotations
•Description lines
• Protein names
• EC numbers
•Gene names
•General annotation (comments)
•UniProtKB Keywords
•GO terms
30 January 2015
24
UniRule – output with evidence attribution
30 January 2015
25
UniRule – output with evidence attribution
30 January 2015
26
UniRule – predictions
30 January 201527
UniRule – prediction rules
30 January 201528
Talk outline
• Introduction to UniProt
• UniProtKB annotation and propagation
• Data increase and the need for Automatic Annotation
• Automatic annotation systems in UniProtKB
• UniRule Automatic Annotation System
• Complete Proteomes in UniProtKB
30 January 2015
29
How does UniProt define a Complete
proteome?
• A complete proteome consists of the set of proteins
thought to be expressed by an organism whose genome
has been completely sequenced.
30 January 2015
30
Status of complete proteomes in UniProt
• Longstanding project, 2902 proteomes that are spread over the
entire taxonomic range
• Archaea
• Bacteria
• Eukaryota
• Viruses
• Capture of “Complete proteome” data is a mixture of automatic and
manual procedures
• Aim is to provide a set of UniProtKB entries that define the proteome
30 January 2015
31
Human complete proteome
• First draft of the complete human proteome available in
UniProtKB/Swiss-Prot in September 2008
• The first mammalian proteome to be annotated
• Representing approximately 20,000 putative protein-coding
genes each represented by one canonical sequence
30 January 2015
32
Other complete proteomes
Human not the only organism to have its proteome annotated
•Sus scrofa (Pig) – 19,576 entries
•Gallus gallus (Chicken) – 21,622 entries
•Mus musculus (Mouse) – 46,656 entries
•Arabidopsis thaliana (Mouse-ear cress) - 32,521 entries
30 January 2015
33
Challenges of proteome data
• How to define a complete genome, what is complete? Does it have a
complete set of gene model annotations?
• Track any changes in the genome annotations and the impact on
UniProt
• Gather all proteomes available, develop import pipelines to improve
species coverage, current sources include:
• INSDC species
• Ensembl species
• UniProtKB also define a subset of the Complete proteomes as being
'Reference proteomes'.
• Complete proteome of a representative, well-studied model organism or
an organism of interest for biomedical research.
30 January 2015
34
30 January 2015
35
Obtaining
Proteomes
30 January 2015
36
Obtaining
Proteomes
Closing remarks
• Manual annotation cannot keep pace with current or future
rates of growth of UniProtKB so there is a need for automatic
annotation
• UniProtKB currently uses two automatic annotation systems
referred to as SAAS and UniRule
• Automatic annotation of TrEMBL is refreshed and validated
using UniProtKB/Swiss-Prot as a reference, each UniProtKB
release
30 January 2015
37
Closing remarks
• UniRule – manually annotated rules
• annotation varies from simple keywords to full annotation
• starting from SAAS rules, InterPro signatures, literature-based
curation of protein families
• possibility to create custom signatures for InterPro
• Evidence attribution - users to determine the composition
of the rule behind predicted annotation
30 January 2015
38
Closing remarks
• Requirements for completed proteomes
• Completely sequenced genome
• Good gene prediction models
• Good quality transcriptome/proteome data
• Proteins are mapped to genome
30/01/15
39
Acknowledgements
• UniProt group at the EBI is led by Claire Odonovan and Maria
Jesus Martin, part of the PANDA proteins group led by Rolf
Apweiler
• UniProt group at PIR, Georgetown University is led by Cathy
Wu
• UniProt group at SIB (Geneva/Lausanne) is led by Ioannis
Xenarios and Lydie Bougeleret (heirs to Amos Bairoch, left
2009)
• Thanks also to all curators, developers and support staff at
all three sites
30 January 2015
40
Funding
• National Institutes of Health (NIH)
• European Commission (EC)'s SLING
• Swiss Federal Government through the Federal Office of
Education and Science
• GEN2PHEN
• MICROME
• National Science Foundation (NSF)
30 January 2015
41

More Related Content

Viewers also liked

RSA Monthly Online Fraud Report -- October 2013
RSA Monthly Online Fraud Report -- October 2013RSA Monthly Online Fraud Report -- October 2013
RSA Monthly Online Fraud Report -- October 2013
EMC
 
Math
MathMath
Monday latin america
Monday latin americaMonday latin america
Monday latin america
Travis Klein
 
Software Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and StorageSoftware Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and Storage
EMC
 
Chromatography lect 2
Chromatography lect 2Chromatography lect 2
Chromatography lect 2
FLI
 
Elasticity
ElasticityElasticity
Elasticity
Travis Klein
 
Stomp presentation v1.5.1
Stomp presentation v1.5.1Stomp presentation v1.5.1
Stomp presentation v1.5.1
Patrick Cannon
 
FIRM: Capability-based Inline Mediation of Flash Behaviors
FIRM: Capability-based Inline Mediation of Flash BehaviorsFIRM: Capability-based Inline Mediation of Flash Behaviors
FIRM: Capability-based Inline Mediation of Flash Behaviors
EMC
 
Battle of the Scales: Examining Respondent Scale Usage across 10 countries
Battle of the Scales: Examining Respondent Scale Usage across 10 countriesBattle of the Scales: Examining Respondent Scale Usage across 10 countries
Battle of the Scales: Examining Respondent Scale Usage across 10 countries
Research Now
 
Price discriminating monopolist
Price discriminating monopolistPrice discriminating monopolist
Price discriminating monopolist
Travis Klein
 
Block renaissanceart
Block renaissanceartBlock renaissanceart
Block renaissanceart
Travis Klein
 
Ppt toy3
Ppt toy3Ppt toy3
Wrk
WrkWrk
Magne website
Magne websiteMagne website
Magne website
Mamta Binani
 

Viewers also liked (14)

RSA Monthly Online Fraud Report -- October 2013
RSA Monthly Online Fraud Report -- October 2013RSA Monthly Online Fraud Report -- October 2013
RSA Monthly Online Fraud Report -- October 2013
 
Math
MathMath
Math
 
Monday latin america
Monday latin americaMonday latin america
Monday latin america
 
Software Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and StorageSoftware Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and Storage
 
Chromatography lect 2
Chromatography lect 2Chromatography lect 2
Chromatography lect 2
 
Elasticity
ElasticityElasticity
Elasticity
 
Stomp presentation v1.5.1
Stomp presentation v1.5.1Stomp presentation v1.5.1
Stomp presentation v1.5.1
 
FIRM: Capability-based Inline Mediation of Flash Behaviors
FIRM: Capability-based Inline Mediation of Flash BehaviorsFIRM: Capability-based Inline Mediation of Flash Behaviors
FIRM: Capability-based Inline Mediation of Flash Behaviors
 
Battle of the Scales: Examining Respondent Scale Usage across 10 countries
Battle of the Scales: Examining Respondent Scale Usage across 10 countriesBattle of the Scales: Examining Respondent Scale Usage across 10 countries
Battle of the Scales: Examining Respondent Scale Usage across 10 countries
 
Price discriminating monopolist
Price discriminating monopolistPrice discriminating monopolist
Price discriminating monopolist
 
Block renaissanceart
Block renaissanceartBlock renaissanceart
Block renaissanceart
 
Ppt toy3
Ppt toy3Ppt toy3
Ppt toy3
 
Wrk
WrkWrk
Wrk
 
Magne website
Magne websiteMagne website
Magne website
 

Similar to Automatic Annotation in UniProtKB

uniprotpresentation-150510081717-lva1-app6891.pdf
uniprotpresentation-150510081717-lva1-app6891.pdfuniprotpresentation-150510081717-lva1-app6891.pdf
uniprotpresentation-150510081717-lva1-app6891.pdf
MuhammadShahzaib456470
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentation
Rida Khalid
 
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
TheUniProtKBpptx__2022_03_30_13_07_41.pptxTheUniProtKBpptx__2022_03_30_13_07_41.pptx
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
PRIYANKAZALA9
 
Designing a community resource - Sandra Orchard
Designing a community resource - Sandra OrchardDesigning a community resource - Sandra Orchard
Designing a community resource - Sandra Orchard
EMBL-ABR
 
Protein database
Protein  databaseProtein  database
Protein database
KAUSHAL SAHU
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
Enabling Semantically Aware Software Applications
Enabling Semantically Aware Software Applications Enabling Semantically Aware Software Applications
Enabling Semantically Aware Software Applications
Trish Whetzel
 
Using EMBL-EBI resources to explore stem cell data
Using EMBL-EBI resources to explore stem cell dataUsing EMBL-EBI resources to explore stem cell data
Using EMBL-EBI resources to explore stem cell data
Rafael C. Jimenez
 
Proteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASyProteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASy
Christ College, Rajkot
 
UniprotKB
UniprotKBUniprotKB
UniprotKB
Syed Lokman
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
sagrika chugh
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
SATHIYA NARAYANAN
 
Bioinformatics
BioinformaticsBioinformatics
Swiss prot
Swiss protSwiss prot
Swiss prot
Shikha Thakur
 
Protein Databases
Protein DatabasesProtein Databases
Ontology-based Tools to Enhance the Curation Workflow
Ontology-based Tools to Enhance the Curation WorkflowOntology-based Tools to Enhance the Curation Workflow
Ontology-based Tools to Enhance the Curation Workflow
Trish Whetzel
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra tool
BITS
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
Juan Antonio Vizcaino
 
Automation and continuous flow analyzer
Automation and continuous flow analyzerAutomation and continuous flow analyzer
Automation and continuous flow analyzer
SurendraMarasini1
 
Mass Spectrometry Informatics formats in progress
Mass Spectrometry Informatics formats in progressMass Spectrometry Informatics formats in progress
Mass Spectrometry Informatics formats in progress
Juan Antonio Vizcaino
 

Similar to Automatic Annotation in UniProtKB (20)

uniprotpresentation-150510081717-lva1-app6891.pdf
uniprotpresentation-150510081717-lva1-app6891.pdfuniprotpresentation-150510081717-lva1-app6891.pdf
uniprotpresentation-150510081717-lva1-app6891.pdf
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentation
 
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
TheUniProtKBpptx__2022_03_30_13_07_41.pptxTheUniProtKBpptx__2022_03_30_13_07_41.pptx
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
 
Designing a community resource - Sandra Orchard
Designing a community resource - Sandra OrchardDesigning a community resource - Sandra Orchard
Designing a community resource - Sandra Orchard
 
Protein database
Protein  databaseProtein  database
Protein database
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Enabling Semantically Aware Software Applications
Enabling Semantically Aware Software Applications Enabling Semantically Aware Software Applications
Enabling Semantically Aware Software Applications
 
Using EMBL-EBI resources to explore stem cell data
Using EMBL-EBI resources to explore stem cell dataUsing EMBL-EBI resources to explore stem cell data
Using EMBL-EBI resources to explore stem cell data
 
Proteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASyProteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASy
 
UniprotKB
UniprotKBUniprotKB
UniprotKB
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Swiss prot
Swiss protSwiss prot
Swiss prot
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Ontology-based Tools to Enhance the Curation Workflow
Ontology-based Tools to Enhance the Curation WorkflowOntology-based Tools to Enhance the Curation Workflow
Ontology-based Tools to Enhance the Curation Workflow
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra tool
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
Automation and continuous flow analyzer
Automation and continuous flow analyzerAutomation and continuous flow analyzer
Automation and continuous flow analyzer
 
Mass Spectrometry Informatics formats in progress
Mass Spectrometry Informatics formats in progressMass Spectrometry Informatics formats in progress
Mass Spectrometry Informatics formats in progress
 

More from EBI

The annotation of plant proteins in UniProtKB
The annotation of plant proteins in UniProtKBThe annotation of plant proteins in UniProtKB
The annotation of plant proteins in UniProtKB
EBI
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
EBI
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0
EBI
 
The European Nucleotide Archive
The European Nucleotide ArchiveThe European Nucleotide Archive
The European Nucleotide Archive
EBI
 
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl GenomesGenome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
EBI
 
The Vertebrate Genome Annotation Database
The Vertebrate Genome Annotation DatabaseThe Vertebrate Genome Annotation Database
The Vertebrate Genome Annotation Database
EBI
 
Train online
Train onlineTrain online
Train online
EBI
 

More from EBI (7)

The annotation of plant proteins in UniProtKB
The annotation of plant proteins in UniProtKBThe annotation of plant proteins in UniProtKB
The annotation of plant proteins in UniProtKB
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0
 
The European Nucleotide Archive
The European Nucleotide ArchiveThe European Nucleotide Archive
The European Nucleotide Archive
 
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl GenomesGenome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
 
The Vertebrate Genome Annotation Database
The Vertebrate Genome Annotation DatabaseThe Vertebrate Genome Annotation Database
The Vertebrate Genome Annotation Database
 
Train online
Train onlineTrain online
Train online
 

Recently uploaded

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 

Recently uploaded (20)

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 

Automatic Annotation in UniProtKB

  • 1. Automatic annotation in UniProtKB using UniRule, and Complete Proteomes Wei Mun Chan
  • 2. Talk outline • Introduction to UniProt • UniProtKB annotation and propagation • Data increase and the need for Automatic Annotation • Automatic annotation systems in UniProtKB • UniRule Automatic Annotation System • Complete Proteomes in UniProtKB 30 January 2015 2
  • 3. Talk outline • Introduction to UniProt • UniProtKB annotation and propagation • Data increase and the need for Automatic Annotation • Automatic annotation systems in UniProtKB • UniRule Automatic Annotation System • Complete proteomes in UniProtKB 30 January 2015 3
  • 4. 30 January 2015 4 UniProt Consortium • Formed in 2002 • Previously known as “Swiss-Prot” since 1986 • UniProt group at the EBI is led by Claire Odonovan and Maria Jesus Martin, part of the PANDA proteins group led by Rolf Apweiler • UniProt group at PIR, Georgetown University is led by Cathy Wu • UniProt group at SIB (Geneva/Lausanne) is led by Ioannis Xenarios and Lydie Bougeleret (heirs to Amos Bairoch, left 2009) • UniProtKB is UniProt KnowledgeBase, and includes TrEMBL and Swiss-Prot entries
  • 6. UniProt databases 30 January 2015 6 ENA/GenBank/DDBJ, Ensembl, VEGA, RefSeq, other sequence resources UniSave - Providing entry version history UniSave - Providing entry version history
  • 7. Talk outline • Introduction to UniProt • UniProtKB annotation and propagation • Data increase and the need for Automatic Annotation • Automatic annotation systems in UniProtKB • UniRule Automatic Annotation System • Complete Proteomes in UniProtKB 30 January 2015 7
  • 12. 30 January 2015 12 Propagation of annotation in UniProtKB Annotation Propagated RecName Yes AltName Yes Function Yes Catalytic activity Yes Pathway Yes Subunit Yes Subcellular location Yes Disease No Disruption phenotype No Polymorphism No Alternative products No Generalannotation_______ Featureannotation_______ Annotation Propagated KW Yes GO Yes Regions of interest Yes Active site Yes Ligand-binding Yes Processing Yes PTMs Yes Ambiguities No Conflicts No Natural variants No Isoforms No
  • 13. Talk outline • Introduction to UniProt • UniProtKB annotation and propagation • Data increase and the need for Automatic Annotation • Automatic annotation systems in UniProtKB • UniRule Automatic Annotation System • Complete Proteomes in UniProtKB 30 January 2015 13
  • 14. Data increase in UniProtKB 30 January 2015 14
  • 15. 30 January 2015 15 Benefits of Automatic Annotation • Added value for TrEMBL in the face of rapid data growth • many species/proteins without published experimental data • Support for manual curation • making manual curation of TrEMBL entries for which there is published data easier • Correction of misleading annotation in data received from sequencing centres • Highlighting of patterns • knowledge that can be/needs to be propagated across the databases • inconsistent annotation e.g. of a protein family
  • 16. Talk outline • Introduction to UniProt • UniProtKB annotation and propagation • Data increase and the need for Automatic Annotation • Automatic annotation systems in UniProtKB • UniRule Automatic Annotation System • Complete Proteomes in UniProtKB 30 January 2015 16
  • 17. 30 January 2015 17 Automatic Annotation in UniProtKB/TrEMBL We have implemented Automatic Annotation systems based on annotation rules •Rules are linked to specific signatures - InterPro •Annotation rules have • annotations • conditions •Rules are tested and validated against UniProtKB/Swiss- Prot •Rules and annotations are updated each UniProt release
  • 18. Automatic Annotation Systems in UniProtKB System Rule creation Trigger Annotations Scope SAAS automatic taxonomy InterPro comments, KW all taxa UniRule (Rulebase/HAMAP/ PIRNR/PIRSR) manual taxonomy InterPro* proteome property sequence length protein names, comments, features, KW, GO terms all taxa 30 January 2015 18 * Flexibility to create custom signatures and submitted to InterPro as required
  • 19. Principle of an Annotation Rule Creation 30 January 2015 19 annotated Swiss-Prot entries rule TrEMBL entries extract common annotation propagate taxonomic nodes Interpro entries and member signatures proteome properties sequence length TrEMBL entries remain in TrEMBL, but offer more (predicted) annotation
  • 20. 30 January 2015 20 SAAS – Statistically Automatic Annotation System • Automatically generated annotation rule system to supplement the labour intensive UniRule system • Employs a C4.5 decision-tree algorithm to find the most concise rule
  • 21. Talk outline • Introduction to UniProt • UniProtKB annotation and propagation • Data increase and the need for Automatic Annotation • Automatic annotation systems in UniProtKB • UniRule Automatic Annotation System • Complete Proteomes in UniProtKB 30 January 2015 21
  • 22. 30 January 2015 22 UniRule Automatic Annotation System • Manually created/curated rules of varying complexity: annotation varies from simple Keyword attribution to complete annotation • Sources for rule creation • automatically generated SAAS rules as input • literature based curation of characterised families – as a potential source for creating new signatures for a specific functional group • also …
  • 23. UniRule - conditions used to created a rule Conditions (can be positive or negative) •Taxonomy •InterPro entries and member signatures •Subcellular location e.g. organelles •Proteome properties e.g. photosynthetic •Sequence length 30 January 2015 23
  • 24. UniRule – UniProtKB annotations defined in a rule Annotations •Description lines • Protein names • EC numbers •Gene names •General annotation (comments) •UniProtKB Keywords •GO terms 30 January 2015 24
  • 25. UniRule – output with evidence attribution 30 January 2015 25
  • 26. UniRule – output with evidence attribution 30 January 2015 26
  • 27. UniRule – predictions 30 January 201527
  • 28. UniRule – prediction rules 30 January 201528
  • 29. Talk outline • Introduction to UniProt • UniProtKB annotation and propagation • Data increase and the need for Automatic Annotation • Automatic annotation systems in UniProtKB • UniRule Automatic Annotation System • Complete Proteomes in UniProtKB 30 January 2015 29
  • 30. How does UniProt define a Complete proteome? • A complete proteome consists of the set of proteins thought to be expressed by an organism whose genome has been completely sequenced. 30 January 2015 30
  • 31. Status of complete proteomes in UniProt • Longstanding project, 2902 proteomes that are spread over the entire taxonomic range • Archaea • Bacteria • Eukaryota • Viruses • Capture of “Complete proteome” data is a mixture of automatic and manual procedures • Aim is to provide a set of UniProtKB entries that define the proteome 30 January 2015 31
  • 32. Human complete proteome • First draft of the complete human proteome available in UniProtKB/Swiss-Prot in September 2008 • The first mammalian proteome to be annotated • Representing approximately 20,000 putative protein-coding genes each represented by one canonical sequence 30 January 2015 32
  • 33. Other complete proteomes Human not the only organism to have its proteome annotated •Sus scrofa (Pig) – 19,576 entries •Gallus gallus (Chicken) – 21,622 entries •Mus musculus (Mouse) – 46,656 entries •Arabidopsis thaliana (Mouse-ear cress) - 32,521 entries 30 January 2015 33
  • 34. Challenges of proteome data • How to define a complete genome, what is complete? Does it have a complete set of gene model annotations? • Track any changes in the genome annotations and the impact on UniProt • Gather all proteomes available, develop import pipelines to improve species coverage, current sources include: • INSDC species • Ensembl species • UniProtKB also define a subset of the Complete proteomes as being 'Reference proteomes'. • Complete proteome of a representative, well-studied model organism or an organism of interest for biomedical research. 30 January 2015 34
  • 37. Closing remarks • Manual annotation cannot keep pace with current or future rates of growth of UniProtKB so there is a need for automatic annotation • UniProtKB currently uses two automatic annotation systems referred to as SAAS and UniRule • Automatic annotation of TrEMBL is refreshed and validated using UniProtKB/Swiss-Prot as a reference, each UniProtKB release 30 January 2015 37
  • 38. Closing remarks • UniRule – manually annotated rules • annotation varies from simple keywords to full annotation • starting from SAAS rules, InterPro signatures, literature-based curation of protein families • possibility to create custom signatures for InterPro • Evidence attribution - users to determine the composition of the rule behind predicted annotation 30 January 2015 38
  • 39. Closing remarks • Requirements for completed proteomes • Completely sequenced genome • Good gene prediction models • Good quality transcriptome/proteome data • Proteins are mapped to genome 30/01/15 39
  • 40. Acknowledgements • UniProt group at the EBI is led by Claire Odonovan and Maria Jesus Martin, part of the PANDA proteins group led by Rolf Apweiler • UniProt group at PIR, Georgetown University is led by Cathy Wu • UniProt group at SIB (Geneva/Lausanne) is led by Ioannis Xenarios and Lydie Bougeleret (heirs to Amos Bairoch, left 2009) • Thanks also to all curators, developers and support staff at all three sites 30 January 2015 40
  • 41. Funding • National Institutes of Health (NIH) • European Commission (EC)'s SLING • Swiss Federal Government through the Federal Office of Education and Science • GEN2PHEN • MICROME • National Science Foundation (NSF) 30 January 2015 41

Editor's Notes

  1. Good morning. My name is Wei Mun Chan and I am a UniProt curator working at the European Bioniformatics Institute, in Hinxton, UK. In the following session I will talk about Automatic Annotation as it is used in UniProtKB, focusing on UniRAule, and I will subsequently talk about Complete Proteomes in UniProtKB.
  2. Firstly, I will talk about the background to UniProt. I’ll then go on to describe UniProtKB annotation and propagation of this annotation. Moving on, I will talk about the growth of data in UniProtKB and the need for Automatic Annotation to cope with this data. I’ll then describe the Automatic Annotation Systems we use in UniProtKB. And to end, I will describe UniProtKB Complete Proteomes.
  3. The UniProt Consortium was formed in 2002 – together to provide the Universal Protein Resource UniProt.
  4. This is a screenshot of the UniProt Resource, which is available at the URL www.uniprot.org. This is the entry point to the various resources offered by UniProt.
  5. The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. UniProt provides several databases including: UniSave – which contains the entry version history for UniProtKB entries. UniRef – sequence cluster database. UniMES – stores metagenomic and environmental sequences. UniParc – Sequence archive The centre piece of UniProt’s activities is the UniProt KnowledgeBase UniProtKB, which contains the Manually reviewed UniProtKB/Swiss-Prot entries and Unreviewed TrEMBL entries. Unreviewed UniProtKB/TrEMBL entries are manually reviewed by experts and upon integration/become reviewed UniProtKB/Swiss-Prot entries. Automatic Annotation procedures then feed back into the UniProtKB/TrEMBL database.
  6. Now to give you a brief overview of the information annotated in a UniProtKB entry. Take the human protein Ky-nu-renin-ase as an example. This is a screenshot taken of the top section of the protein entry as it is represented in the UniProtKB database. As you can see the protein has been manually annotated, Reviewed entry and therefore resides in the UniProtKB/Swiss-Prot database. Within this entry, we provide a Recommended name, Alternative name, Gene Symbol, Organism and Taxonomy.
  7. In addition, within the entry we also provide further information that can be attributed to the protein in General annotation comments section. Here we collate valuable protein characterisation information.
  8. Annotation in UniProtKB also includes the annotation of Ontologies. And this includes the annotation of UniProtKB-specific Keywords and also the complimentary Gene Ontology terms.
  9. We also provide sequence feature annotation – where we describe for example how the molecule is processed, regions, sites within the sequence. Also amino acid modifications and natural sequence variations. This slide captures the graphical representation of sequence features in UniProtKB.
  10. As part of our manual annotation efforts we also propagate annotation from proteins where the annotation has been demonstrated experimentally. We restrict the propagation between closely related species, or if appropriate also to distantly related species where the proteins are well-conserved. This table lists some of the categories of annotations that we do or do not propagate between UniProtKB entries.
  11. This graph illustrates the disparity between the growth in the number of: Manually reviewed UniProtKB/Swiss-Prot entries represented by the red line. Unreviewed automatically generated UniProtKB/TrEMBL entries represented by the green line. Thus, our manual curation efforts is being outpaced by the growth in the unreviewed data. There is therefore a need to enrich these unreviewed UniProt/TrEMBL entries we annotate. How. Through the use of Automatic Annotation.
  12. What are the benefits of Automatic Annotation? Firstly, to enrich unreviewed entries in TrEMBL in the face of rapid data growth. There are many species/proteins without published experimental data. Secondly, to support our manual curation efforts. Making manual curation of unreviewed TrEMBL entries for which there is published data easier. Thirdly, to correct misleading annotation in data received from sequencing centres. Fourthly, to help highlight patterns. Recognise knowledge that CAN or NEEDs to be propagated across the databases. Also to help spot, inconsistent annotation e.g. of a protein family.
  13. How is Automatic Annotation in UniProtKB/TrEMBL accomplished? We have implemented Automatic Annotation systems based on Annotation rules. Rules are linked to specific signatures – InterPro. Rules are associated with – annotations and conditions. Rules are tested and validated against UniProtKB/Swiss-Prot. Rules and annotations are updated each UniProt release.
  14. UniProtKB currently employs two automatic annotation systems – referred to as SAAS and UniRule. Recognises common annotation belonging to a closely related family within UniProtKB/Swiss-Prot Transfer of common annotation to related family members in UniProtKB/TrEMBL The manually curated UniRules annotation system – incorporates the HAMAP, RuleBase, PIRNR/SR systems.
  15. This slide shows a simplified overview of how Annotation rules work in UniProtKB. Rules are built by a combination of specific conditions (grey box) and associated with common annotation from UniProtKB/Swiss-Prot entries. The rules identify/extract common annotation present in UniProtKB entries (represented by the highlighted red grids) and then propagate these annotations to unreviewed TrEMBL entries. Thus, the annotationally enriched TrEMBL entries remain in TrEMBL.
  16. The first of the Automatic Annotation Systems used in UniProtKB is: Statistically Automatic Annotation System abbreviated SAAS. Supplements the labour intensive manual curation of rules in the UniRule system. C4.5 algorithm uses entropy gain to find most concise rule.
  17. Manually created/curated rules of varying complexity. Simple UniProtKB Keyword to complete annotation. Sources for rule creation – roughly categorised into four categories: 1) Automatically generated SAAS rules as input. 2) Literature-based curation of characterised families – as a potential source for creating new signatures for specific functional group And also:
  18. 3. Conditions.
  19. And 4. Annotations.
  20. This is a screenshot to show how links to UniRule rules is represented in a UniProtKB entry. We can see that the rule of interest is UniRule/RuleBase RU003777. Information that has been propagated to this UniProtKB/TrEMBL entry by this rule is followed by the RuleBase evidence tag.
  21. The same evidence tag also appears associated with General Annotation and Ontologies. Clicking one of these tags/links allows us to view the predicted annotations and the rule itself.
  22. On a left we see the predicted annotations. And on the right we see the conditions used to build the rule (conditions can be positive or negative).
  23. On a left we see the predicted annotations. And on the right we see the conditions used to build the rule (conditions can be positive or negative).
  24. So what are some of the challenges UniProt faces when dealing with Complete Proteome data? How do we define a complete genome, and decide what is complete? For example - does it have a complete set of gene model annotations? There is a need to track changes in genome annotations and how these changes impact on UniProt. Gather all proteomes available,