SlideShare a Scribd company logo
1 of 73
Marie-Claude.Blatter@isb-sib.ch
Swiss-Prot group, Geneva
SIB Swiss Institute of Bioinformatics
The UniProt knowledgebase
www.uniprot.org
a hub of integrated protein data
http://education.expasy.org/cours/Prague2011/
Science cover, february 2011
protein sequence functional information
data knowledge
UniProt consortium
EBI : European Bioinformatics Institute (UK)
SIB : Swiss Institute of Bioinformatics (CH)
PIR : Protein information resource (US)
www.uniprot.org
UniProt databases
UniProtKB: protein sequence knowledgebase, 2 sections
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast,
download) (~15 mo entries)
UniParc: protein sequence archive (ENA equivalent at the protein
level). Each entry contains a protein sequence with cross-
links to other databases where you find the sequence
(active or not). Not annotated (query, Blast, download) (~25 mo entries)
UniRef: 3 clusters of protein sequences with 100, 90 and
50 % identity; useful to speed up sequence similarity
search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo
entries; UniRef50 3.3 mo entries)
UniMES: protein sequences derived from metagenomic
projects (mostly Global Ocean Sampling (GOS)) (download)
(8 mo entries, included in UniParc)
UniProt databases
The central piece
UniProtKB
an encyclopedia on proteins
composed of 2 sections
UniProtKB/TrEMBL and UniProtKB/Swiss-Prot
unreviewed and reviewed
automatically annotated and manually annotated
released every 4 weeks
UniProtKB
Origin of protein sequences
UniProtKB protein sequences are mainly derived from
- INSDC (translated submitted coding sequences - CDS)
- Ensembl (gene prediction ) and RefSeq sequences
- Sequences of PDB structures
- Direct submission or sequences scanned from literature
Notes: - UniProt is not doing any gene prediction
- Most non-germline immunoglobulins, T-cell receptors , most patent sequences,
highly over-represented data (e.g. viral antigens), pseudogenes sequences are
excluded from UniProtKB, - but stored in UniParc
- Data from the PIR database have been integrated in UniProtKB since 2003.
15 %
85 %
Swiss-Prot
TrEMBL
EMBL
Automated extraction
of protein sequence
(translated CDS), gene
name and references.
Automated annotation
Manual annotation of
the sequence and
associated biological
information
UniProtKB/TrEMBL
unreviewed
Automatic annotation
released every 4 weeks
One protein sequence
One species
Automated annotation
Keywords
and
Gene Ontology
Automated annotation
Function, Subcellular location,
Catalytic activity,
Sequence similarities…
Automated annotation
transmembrane domains,
signal peptide…
Cross-references
to over 125 databases
References
Protein and gene names
Taxonomic information
UniProtKB/TrEMBL
www.uniprot.org
UniProtKB/TrEMBL
Automatic annotation
Protein sequence
- The quality of the protein sequences is dependent on the information
provided by the submitter of the original nucleotide entry (CDS) or of the
gene prediction pipeline (i.e. Ensembl).
- 100% identical sequences (same length, same organism are merged
automatically).
Biological information
Sources of annotation
- Provided by the submitter (EMBL, PDB, TAIR…)
- From automated annotation (automated generated annotation rules (i.e.
SAAS) and/or manually generated annotation rules (i.e. UniRule))
Example of fully automatic annotation: SAAS
• Rules are derived from the UniProtKB/Swiss-Prot manual annotation.
• Fully automated rule generation based on C4.5 decision tree algorithm.
• One annotation, one rule.
• High stringency – require 99% or greater estimated precision to
generate annotation (test on UniProtKB/Swiss-Prot)
• Rules are produced, updated and validated at each release.
UniProtKB/TrEMBL
UniProtKB/Swiss-Prot
reviewed
manually annotated
released every 4 weeks
MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR
AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK
NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFL
NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE
GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRG
TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGR
AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD
EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV
VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG
One protein sequence
One gene
One species
Manual annotation
Keywords
and
Gene Ontology
Manual annotation
Function, Subcellular location,
Catalytic activity, Disease,
Tissue specificty, Pathway…
Manual annotation
Post-translational modifications,
variants, transmembrane domains,
signal peptide…
Cross-references
to over 125 databases
References
Protein and gene names
Taxonomic information
Alternative products:
protein sequences produced by
alternative splicing,
alternative promoter usage,
alternative initiation…
UniProtKB/Swiss-Prot
www.uniprot.org
UniProtKB/Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate
sequence discrepancies, report sequencing mistakes…)
2. Biological information (sequence analysis, extract
literature information, ortholog data propagation, …)
UniProtKB/Swiss-Prot
1- Protein sequence curation
The displayed protein sequence:
…canonical, representative, consensus…
+
alternative sequences (described within the entry)
1 entry <-> 1 gene (1 species)
UniProtKB/Swiss-Prot
a gene-centric view of the protein space
What is the current status?
• At least 20% of Swiss-Prot entries required a minimal
amount of curation effort so as to obtain the “correct”
sequence.
• Typical problems
– unsolved conflicts
– uncorrected initiation sites
– frameshifts
– wrong gene prediction
– other ‘problems’
UCSC genome browser
examples of CDS annotation submitted to INSDC…
UniProtKB/Swiss-Prot
2- Biological data curation
UniProtKB/Swiss-Prot gathers data form multiple sources:
- publications (literature/Pubmed)
- prediction programs (Prosite, TMHMM, …)
- contacts with experts
- other databases
- nomenclature committees
An evidence attribution system allows to easily trace the
source of each annotation
Extract literature information
and protein sequence analysis
maximum usage of controlled vocabulary
Protein and gene names
…enable researchers to
obtain a summary of
what is known about a
protein…
General annotation
(Comments)
www.uniprot.org
Human protein manual annotation:
some statistics (June 2011)
Sequence annotation
(Features)
…enable researchers to
obtain a summary of
what is known about a
protein…
www.uniprot.org
Non-experimental qualifiers
UniProtKB/Swiss-Prot considers both experimental and predicted
data and makes a clear distinction between both
Type of evidence Qualifier
Strong experimental evidence None or Ref.X
Light experimental evidence Probable
Inferred by similarity with homologous protein By similarity
Inferred by prediction Potential
Find all the proteins localized in
the cytoplasm (experimentally
proven) which are phosphorylated
on a serine (experimentally proven)
• The ‘Protein existence’ tag indicates what is the evidence
for the existence of a given protein;
• Different qualifiers:
– 1. Evidence at protein level (~18%)
– (MS, western blot (tissue specificity), immuno (subcellular
location),…)
– 2. Evidence at transcript level (~19%)
– 3. Inferred from homology (~58 %)
– 4. Predicted (~5%)
– 5. Uncertain (mainly in TrEMBL)
‘Protein existence’ tag
http://www.uniprot.org/docs/pe_criteria
UniProtKB
Additional information
can be found in the cross-references
(to more than 140 databases)
2D gel
2DBase-Ecoli
ANU-2DPAGE
Aarhus/Ghent-2DPAGE (no server)
COMPLUYEAST-2DPAGE
Cornea-2DPAGE
DOSAC-COBS-2DPAGE
ECO2DBASE (no server)
OGP
PHCI-2DPAGE
PMMA-2DPAGE
Rat-heart-2DPAGE
REPRODUCTION-2DPAGE
Siena-2DPAGE
SWISS-2DPAGE
UCD-2DPAGE
World-2DPAGE
Family and domain
Gene3D
HAMAP
InterPro
PANTHER
Pfam
PIRSF
PRINTS
ProDom
PROSITE
SMART
SUPFAM
TIGRFAMs
Organism-specific
AGD
ArachnoServer
CGD
ConoServer
CTD
CYGD
dictyBase
EchoBASE
EcoGene
euHCVdb
EuPathDB
FlyBase
GeneCards
GeneDB_Spombe
GeneFarm
GenoList
Gramene
H-InvDB
HGNC
HPA
LegioList
Leproma
MaizeGDB
MGI
MIM
neXtProt
Orphanet
PharmGKB
PseudoCAP
RGD
SGD
TAIR
TubercuList
WormBase
Xenbase
ZFIN
Protein family/group
Allergome
CAZy
MEROPS
PeroxiBase
PptaseDB
REBASE
TCDB
Genome annotation
Ensembl
EnsemblBacteria
EnsemblFungi
EnsemblMetazoa
EnsemblPlants
EnsemblProtists
GeneID
GenomeReviews
KEGG
NMPDR
TIGR
UCSC
VectorBase
Enzyme and pathway
BioCyc
BRENDA
Pathway_Interaction_DB
Reactome
Other
BindingDB
DrugBank
NextBio
PMAP-CutDB
Sequence
EMBL
IPI
PIR
RefSeq
UniGene
3D structure
DisProt
HSSP
PDB
PDBsum
ProteinModelPortal
SMR
PTM
GlycoSuiteDB
PhosphoSite
PhosSite
UniProtKB/Swiss-Prot:
129 explicit links
and 14 implicit links!
Proteomic
PeptideAtlas
PRIDE
ProMEX
PPI
DIP
IntAct
MINT
STRING
Phylogenomic dbs
eggNOG
GeneTree
HOGENOM
HOVERGEN
InParanoid
OMA
OrthoDB
PhylomeDB
ProtClustDB
Polymorphism
dbSNP
Gene expression
ArrayExpress
Bgee
CleanEx
Genevestigator
GermOnline
Ontologies
GO
The UniProt web site
www.uniprot.org
• Powerful search engine, google-like and easy-to-use, but also
supports very directed field searches
• Scoring mechanism presenting relevant matches first
• Entry views, search result views and downloads are customizable
• The URL of a result page reflects the query; all pages and queries
are bookmarkable, supporting programmatic access
• Search, Blast, Align, Retrieve, ID mapping
Search
A very powerful text search tool with
autocompletion and refinement options
allowing to look for UniProt entries and
documentation by biological information
Find all human proteins
located in the nucleus
The search interface
guides users with helpful
suggestions and hints
Advanced Search
A very powerful search tool
To be used when you know in which
entry section the information is stored
Find all the protein localized in the
cytoplasm (experimentally proven)
which are phosphorylated on a
serine (experimentally proven)
Result pages: highly customizable
Result pages: downloadable
The URL can be bookmarked
and manually modified.
Blast
A tool associated with the standard
options to search sequences
in different UniProt databases and
data sets
Blast: customize the result display
Blast: local alignment
sequence annotation highlighting option
Align
A ClustalW multiple alignment tool
with
sequence annotation highlighting option
Align
sequence annotation highlighting option
Retrieve
A UniProt specific tool allowing to retrieve a list of
entries in several standard identifiers formats.
You can then query your ‘personal database’ with the
UniProt search tool.
Query your own dataset
ID Mapping
Gives the possibility to get a mapping between
different databases for a given protein
These identifiers are all pointing to a TP53 (p53) protein sequence !
●
P04637, NP_000537, NP_001119584.1, NP_001119585.1,
●
NP_001119584.1, NP_001119584.1, NP_001119584.1,
●
NP_001119584.1, ENSG00000141510, CCDS11118,
●
UPI000002ED67, IPI00025087, etc.
Download
Download UniProt
http://www.uniprot.org/downloads
Canonical and isoform sequences (fasta format)
A few words on the UniProt
‘complete proteome’
sequence sets…
2’747 complete proteomes
 Genome completely sequenced
 Proteins mapped to the genome
 Entries tagged with the KW ‘Complete proteome’
 UniProtKB/Swiss-Prot isoform sequences are available
in FASTA format only
Fully manually reviewed (e.g. S. cerevisiae)
Partially manually reviewed (e.g. Homo sapiens)
Unreviewed (e.g. Acinetobacter baumannii (strain 1656-2))
UniProtKB - complete proteomes
Can be downloaded:
 From our complete proteome page
www.uniprot.org/taxonomy/complete-proteomes
 From the ‘ftp download ‘ page
 By querying UniProtKB + download
Query: organism:93062 AND keyword:"complete proteome"
UniProtKB - complete proteomes
Additional information: www.uniprot.org/faq/15
Query UniProtKB + download
Human proteome ~ 20’200 genes
Query for ‘homo sapiens’ (August 2011)
• UniProtKB: 110,056 entries + alt sequences (~ 15’435) = 125’491
• UniProtKB/Swiss-Prot: 20’244 entries + alt sequences (~ 15’435) = 35’679
• UniProtKB/TrEMBL: 89,834 entries
• RefSeq: 32’898 sequences
• Ensembl: 90’720 sequences
Query for ‘homo sapiens’ + Complete proteome (KW-181)
• UniProtKB: 56’392 + alt sequences (15’435) = 71’827
• UniProtKB/Swiss-Prot: 20’238 + alt sequences (15’435) = 35’673
• UniProtKB/TrEMBL: 36’154
92% of human entries are linked with at least one RefSeq entry…
Summary
Do not hesitate to contact us !
help@uniprot.org
The UniProt Consortium
SIB
Ioannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie-
Claude Blatter, Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel
Breuza, Alan Bridge, Edouard de Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael
Doche, Dolnide Dornevil, Severine Duvaud, Anne Estreicher, Livia Famiglietti, Marc Feuermann,
Sebastien Gehant, Elisabeth Gasteiger, Alain Gateau, Vivienne Gerritsen, Arnaud Gos, Nadine Gruaz-
Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller,
Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat,
Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole
Redaschi, Catherine Rivoire, Bernd Roechert, Michel Schneider, Christian Sigrist, Karin Sonesson,
Sylvie Staehli, Eleanor Stanley, André Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue,
Anne-Lise Veuthey
EBI
Rolf Apweiler, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo
Antunes, Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer,
Francesco Fazzini, Alexander Fedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius
Jacobsen, Michael Kleen, Duncan Legge, Wudong Liu, Jie Luo, Sandra Orchard, Samuel Patient,
Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, Tony Sawford, Harminder Sehra,
Edward Turner, Matt Corbett, Mike Donnelly and Pieter van Rensburg
PIR
Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen,
Pratibha Dubey, Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale,
Thanemozhi G. Natarajan, Jules Nchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka
Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su Yeh and Jian Zhang
www.uniprot.org
UniProt is mainly supported by the National
Institutes of Health (NIH) grant 1 U41 HG006104-
01. Additional support for the EBI's involvement in
UniProt comes from the NIH grant 2P41 HG02273-07.
Swiss-Prot activities at the SIB are supported by the
Swiss Federal Government through the Federal
Office of Education and Science and the European
Commission contracts SLING (226073), Gen2Phen
(200754) and MICROME (222886). PIR activities are
also supported by the NIH grants 5R01GM080646-04,
3R01GM080646-04S2, 1G08LM010720-01, and
3P20RR016472-09S2, and NSF grant DBI-0850319.
www.isb-sib.ch
Thank you for your attention
http://education.expasy.org/cours/Prague2011/

More Related Content

What's hot

What's hot (20)

Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Scop database
Scop databaseScop database
Scop database
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)
 
OMIM Database
OMIM DatabaseOMIM Database
OMIM Database
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural Alignment
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Proteome databases
Proteome databasesProteome databases
Proteome databases
 
Gene prediction method
Gene prediction method Gene prediction method
Gene prediction method
 
protein data bank
protein data bankprotein data bank
protein data bank
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. local
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Protein database
Protein databaseProtein database
Protein database
 
Prosite
PrositeProsite
Prosite
 

Similar to The uni prot knowledgebase

Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02
Sreekanth Gali
 

Similar to The uni prot knowledgebase (20)

TheUniProtKBpptx__2022_03_30_13_07_41.pptx
TheUniProtKBpptx__2022_03_30_13_07_41.pptxTheUniProtKBpptx__2022_03_30_13_07_41.pptx
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
 
Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02
 
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
NCBI
NCBINCBI
NCBI
 
Introduction to Bioinformatics: Part 3
Introduction to Bioinformatics: Part 3Introduction to Bioinformatics: Part 3
Introduction to Bioinformatics: Part 3
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
 
02. Biological sequence databases.pptx
02. Biological sequence databases.pptx02. Biological sequence databases.pptx
02. Biological sequence databases.pptx
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
NIH-mar2604.rm.ppt
NIH-mar2604.rm.pptNIH-mar2604.rm.ppt
NIH-mar2604.rm.ppt
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics final
 
protein databases
 protein databases protein databases
protein databases
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Presentation on Biological database  By Elufer Akram @ University Of Science ...Presentation on Biological database  By Elufer Akram @ University Of Science ...
Presentation on Biological database By Elufer Akram @ University Of Science ...
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision Medicine
 
Data retrieval
Data retrievalData retrieval
Data retrieval
 
Introduction to Biological databases
Introduction to Biological databasesIntroduction to Biological databases
Introduction to Biological databases
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 

Recently uploaded

Recently uploaded (20)

General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 

The uni prot knowledgebase

  • 1. Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics The UniProt knowledgebase www.uniprot.org a hub of integrated protein data http://education.expasy.org/cours/Prague2011/
  • 3. protein sequence functional information data knowledge
  • 4. UniProt consortium EBI : European Bioinformatics Institute (UK) SIB : Swiss Institute of Bioinformatics (CH) PIR : Protein information resource (US)
  • 7. UniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~15 mo entries) UniParc: protein sequence archive (ENA equivalent at the protein level). Each entry contains a protein sequence with cross- links to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25 mo entries) UniRef: 3 clusters of protein sequences with 100, 90 and 50 % identity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries) UniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc)
  • 9. UniProtKB an encyclopedia on proteins composed of 2 sections UniProtKB/TrEMBL and UniProtKB/Swiss-Prot unreviewed and reviewed automatically annotated and manually annotated released every 4 weeks
  • 10. UniProtKB Origin of protein sequences UniProtKB protein sequences are mainly derived from - INSDC (translated submitted coding sequences - CDS) - Ensembl (gene prediction ) and RefSeq sequences - Sequences of PDB structures - Direct submission or sequences scanned from literature Notes: - UniProt is not doing any gene prediction - Most non-germline immunoglobulins, T-cell receptors , most patent sequences, highly over-represented data (e.g. viral antigens), pseudogenes sequences are excluded from UniProtKB, - but stored in UniParc - Data from the PIR database have been integrated in UniProtKB since 2003. 15 % 85 %
  • 11. Swiss-Prot TrEMBL EMBL Automated extraction of protein sequence (translated CDS), gene name and references. Automated annotation Manual annotation of the sequence and associated biological information
  • 13. One protein sequence One species Automated annotation Keywords and Gene Ontology Automated annotation Function, Subcellular location, Catalytic activity, Sequence similarities… Automated annotation transmembrane domains, signal peptide… Cross-references to over 125 databases References Protein and gene names Taxonomic information UniProtKB/TrEMBL www.uniprot.org
  • 14. UniProtKB/TrEMBL Automatic annotation Protein sequence - The quality of the protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e. Ensembl). - 100% identical sequences (same length, same organism are merged automatically). Biological information Sources of annotation - Provided by the submitter (EMBL, PDB, TAIR…) - From automated annotation (automated generated annotation rules (i.e. SAAS) and/or manually generated annotation rules (i.e. UniRule))
  • 15.
  • 16.
  • 17. Example of fully automatic annotation: SAAS • Rules are derived from the UniProtKB/Swiss-Prot manual annotation. • Fully automated rule generation based on C4.5 decision tree algorithm. • One annotation, one rule. • High stringency – require 99% or greater estimated precision to generate annotation (test on UniProtKB/Swiss-Prot) • Rules are produced, updated and validated at each release. UniProtKB/TrEMBL
  • 19. MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFL NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRG TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGR AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG One protein sequence One gene One species Manual annotation Keywords and Gene Ontology Manual annotation Function, Subcellular location, Catalytic activity, Disease, Tissue specificty, Pathway… Manual annotation Post-translational modifications, variants, transmembrane domains, signal peptide… Cross-references to over 125 databases References Protein and gene names Taxonomic information Alternative products: protein sequences produced by alternative splicing, alternative promoter usage, alternative initiation… UniProtKB/Swiss-Prot www.uniprot.org
  • 20. UniProtKB/Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)
  • 22. The displayed protein sequence: …canonical, representative, consensus… + alternative sequences (described within the entry) 1 entry <-> 1 gene (1 species) UniProtKB/Swiss-Prot a gene-centric view of the protein space
  • 23. What is the current status? • At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence. • Typical problems – unsolved conflicts – uncorrected initiation sites – frameshifts – wrong gene prediction – other ‘problems’
  • 24. UCSC genome browser examples of CDS annotation submitted to INSDC…
  • 26. UniProtKB/Swiss-Prot gathers data form multiple sources: - publications (literature/Pubmed) - prediction programs (Prosite, TMHMM, …) - contacts with experts - other databases - nomenclature committees An evidence attribution system allows to easily trace the source of each annotation Extract literature information and protein sequence analysis maximum usage of controlled vocabulary
  • 28. …enable researchers to obtain a summary of what is known about a protein… General annotation (Comments) www.uniprot.org
  • 29. Human protein manual annotation: some statistics (June 2011)
  • 30. Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org
  • 31. Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between both Type of evidence Qualifier Strong experimental evidence None or Ref.X Light experimental evidence Probable Inferred by similarity with homologous protein By similarity Inferred by prediction Potential
  • 32. Find all the proteins localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven)
  • 33. • The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein; • Different qualifiers: – 1. Evidence at protein level (~18%) – (MS, western blot (tissue specificity), immuno (subcellular location),…) – 2. Evidence at transcript level (~19%) – 3. Inferred from homology (~58 %) – 4. Predicted (~5%) – 5. Uncertain (mainly in TrEMBL) ‘Protein existence’ tag http://www.uniprot.org/docs/pe_criteria
  • 34.
  • 35. UniProtKB Additional information can be found in the cross-references (to more than 140 databases)
  • 36. 2D gel 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE UCD-2DPAGE World-2DPAGE Family and domain Gene3D HAMAP InterPro PANTHER Pfam PIRSF PRINTS ProDom PROSITE SMART SUPFAM TIGRFAMs Organism-specific AGD ArachnoServer CGD ConoServer CTD CYGD dictyBase EchoBASE EcoGene euHCVdb EuPathDB FlyBase GeneCards GeneDB_Spombe GeneFarm GenoList Gramene H-InvDB HGNC HPA LegioList Leproma MaizeGDB MGI MIM neXtProt Orphanet PharmGKB PseudoCAP RGD SGD TAIR TubercuList WormBase Xenbase ZFIN Protein family/group Allergome CAZy MEROPS PeroxiBase PptaseDB REBASE TCDB Genome annotation Ensembl EnsemblBacteria EnsemblFungi EnsemblMetazoa EnsemblPlants EnsemblProtists GeneID GenomeReviews KEGG NMPDR TIGR UCSC VectorBase Enzyme and pathway BioCyc BRENDA Pathway_Interaction_DB Reactome Other BindingDB DrugBank NextBio PMAP-CutDB Sequence EMBL IPI PIR RefSeq UniGene 3D structure DisProt HSSP PDB PDBsum ProteinModelPortal SMR PTM GlycoSuiteDB PhosphoSite PhosSite UniProtKB/Swiss-Prot: 129 explicit links and 14 implicit links! Proteomic PeptideAtlas PRIDE ProMEX PPI DIP IntAct MINT STRING Phylogenomic dbs eggNOG GeneTree HOGENOM HOVERGEN InParanoid OMA OrthoDB PhylomeDB ProtClustDB Polymorphism dbSNP Gene expression ArrayExpress Bgee CleanEx Genevestigator GermOnline Ontologies GO
  • 37. The UniProt web site www.uniprot.org • Powerful search engine, google-like and easy-to-use, but also supports very directed field searches • Scoring mechanism presenting relevant matches first • Entry views, search result views and downloads are customizable • The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access • Search, Blast, Align, Retrieve, ID mapping
  • 38. Search A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information
  • 39. Find all human proteins located in the nucleus
  • 40. The search interface guides users with helpful suggestions and hints
  • 41.
  • 42. Advanced Search A very powerful search tool To be used when you know in which entry section the information is stored
  • 43. Find all the protein localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven)
  • 44. Result pages: highly customizable
  • 46.
  • 47. The URL can be bookmarked and manually modified.
  • 48. Blast A tool associated with the standard options to search sequences in different UniProt databases and data sets
  • 49. Blast: customize the result display
  • 50. Blast: local alignment sequence annotation highlighting option
  • 51. Align A ClustalW multiple alignment tool with sequence annotation highlighting option
  • 53. Retrieve A UniProt specific tool allowing to retrieve a list of entries in several standard identifiers formats. You can then query your ‘personal database’ with the UniProt search tool.
  • 54. Query your own dataset
  • 55. ID Mapping Gives the possibility to get a mapping between different databases for a given protein
  • 56. These identifiers are all pointing to a TP53 (p53) protein sequence ! ● P04637, NP_000537, NP_001119584.1, NP_001119585.1, ● NP_001119584.1, NP_001119584.1, NP_001119584.1, ● NP_001119584.1, ENSG00000141510, CCDS11118, ● UPI000002ED67, IPI00025087, etc.
  • 57.
  • 60. Canonical and isoform sequences (fasta format)
  • 61. A few words on the UniProt ‘complete proteome’ sequence sets…
  • 62. 2’747 complete proteomes  Genome completely sequenced  Proteins mapped to the genome  Entries tagged with the KW ‘Complete proteome’  UniProtKB/Swiss-Prot isoform sequences are available in FASTA format only Fully manually reviewed (e.g. S. cerevisiae) Partially manually reviewed (e.g. Homo sapiens) Unreviewed (e.g. Acinetobacter baumannii (strain 1656-2)) UniProtKB - complete proteomes
  • 63. Can be downloaded:  From our complete proteome page www.uniprot.org/taxonomy/complete-proteomes  From the ‘ftp download ‘ page  By querying UniProtKB + download Query: organism:93062 AND keyword:"complete proteome" UniProtKB - complete proteomes Additional information: www.uniprot.org/faq/15
  • 64. Query UniProtKB + download
  • 65.
  • 66. Human proteome ~ 20’200 genes Query for ‘homo sapiens’ (August 2011) • UniProtKB: 110,056 entries + alt sequences (~ 15’435) = 125’491 • UniProtKB/Swiss-Prot: 20’244 entries + alt sequences (~ 15’435) = 35’679 • UniProtKB/TrEMBL: 89,834 entries • RefSeq: 32’898 sequences • Ensembl: 90’720 sequences Query for ‘homo sapiens’ + Complete proteome (KW-181) • UniProtKB: 56’392 + alt sequences (15’435) = 71’827 • UniProtKB/Swiss-Prot: 20’238 + alt sequences (15’435) = 35’673 • UniProtKB/TrEMBL: 36’154 92% of human entries are linked with at least one RefSeq entry…
  • 68.
  • 69. Do not hesitate to contact us ! help@uniprot.org
  • 70. The UniProt Consortium SIB Ioannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie- Claude Blatter, Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel Breuza, Alan Bridge, Edouard de Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael Doche, Dolnide Dornevil, Severine Duvaud, Anne Estreicher, Livia Famiglietti, Marc Feuermann, Sebastien Gehant, Elisabeth Gasteiger, Alain Gateau, Vivienne Gerritsen, Arnaud Gos, Nadine Gruaz- Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller, Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat, Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine Rivoire, Bernd Roechert, Michel Schneider, Christian Sigrist, Karin Sonesson, Sylvie Staehli, Eleanor Stanley, André Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue, Anne-Lise Veuthey EBI Rolf Apweiler, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo Antunes, Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer, Francesco Fazzini, Alexander Fedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius Jacobsen, Michael Kleen, Duncan Legge, Wudong Liu, Jie Luo, Sandra Orchard, Samuel Patient, Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, Tony Sawford, Harminder Sehra, Edward Turner, Matt Corbett, Mike Donnelly and Pieter van Rensburg PIR Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen, Pratibha Dubey, Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale, Thanemozhi G. Natarajan, Jules Nchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su Yeh and Jian Zhang www.uniprot.org
  • 71. UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U41 HG006104- 01. Additional support for the EBI's involvement in UniProt comes from the NIH grant 2P41 HG02273-07. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science and the European Commission contracts SLING (226073), Gen2Phen (200754) and MICROME (222886). PIR activities are also supported by the NIH grants 5R01GM080646-04, 3R01GM080646-04S2, 1G08LM010720-01, and 3P20RR016472-09S2, and NSF grant DBI-0850319.
  • 73. Thank you for your attention http://education.expasy.org/cours/Prague2011/

Editor's Notes

  1. This Science cover clearly shows the well known discepancy between the amount of data and the amount of knowledge which are available.This is a first challenge …but there is a second one: how is to link the 2 together ?
  2. The mission of UniProt is….to link the protein squences (data) together with the biological knowledge (functional information)
  3. The UniProt databases and web site are maintained by the UniProt consortium, which is composed of:
  4. Screen shot of the web page
  5. UniProt provides 4 databases, the central one beiing the UniProtKB.
  6. UniProt provides 4 databases, the central one beiing the UniProtKB.
  7. Computer prediction: if no other evidence from this protein or a similar protein, the keyword is not put.
  8. &amp;lt;number&amp;gt; dbSNP is NOT in DR lines!!! =&amp;gt; not included in the release notes statistics. Note : Replaces BuruList, ListiList, MypuList, PhotoList, SagaList and SubtiList
  9. &amp;lt;number&amp;gt; 3 groups working together Encyclopedia of proteins function in biology and life science Considered by the life science community as the GOLD standard in annotation practices Over 600’000 users per month originating from 149 countries. Is it uniprot or swiss-prot? Used by life science scientists (biologists, MDs), but also by chemists, engineers in nanotechnologies; Bioinformaticians; Used by pharma and biotechnology industry;
  10. &amp;lt;number&amp;gt; 3 groups working together Encyclopedia of proteins function in biology and life science Considered by the life science community as the GOLD standard in annotation practices Over 600’000 users per month originating from 149 countries. Is it uniprot or swiss-prot? Used by life science scientists (biologists, MDs), but also by chemists, engineers in nanotechnologies; Bioinformaticians; Used by pharma and biotechnology industry;
  11. &amp;lt;number&amp;gt; Take home message
  12. a bit of this, a bit of that…