SlideShare a Scribd company logo
Marie-Claude.Blatter@isb-sib.ch
Swiss-Prot group, Geneva
SIB Swiss Institute of Bioinformatics
The UniProt knowledgebase
www.uniprot.org
a hub of integrated protein data
http://education.expasy.org/cours/Prague2011/
Science cover, february 2011
protein sequence functional information
data knowledge
UniProt consortium
EBI : European Bioinformatics Institute (UK)
SIB : Swiss Institute of Bioinformatics (CH)
PIR : Protein information resource (US)
www.uniprot.org
UniProt databases
UniProtKB: protein sequence knowledgebase, 2 sections
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast,
download) (~15 mo entries)
UniParc: protein sequence archive (ENA equivalent at the protein
level). Each entry contains a protein sequence with cross-
links to other databases where you find the sequence
(active or not). Not annotated (query, Blast, download) (~25 mo entries)
UniRef: 3 clusters of protein sequences with 100, 90 and
50 % identity; useful to speed up sequence similarity
search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo
entries; UniRef50 3.3 mo entries)
UniMES: protein sequences derived from metagenomic
projects (mostly Global Ocean Sampling (GOS)) (download)
(8 mo entries, included in UniParc)
UniProt databases
The central piece
UniProtKB
an encyclopedia on proteins
composed of 2 sections
UniProtKB/TrEMBL and UniProtKB/Swiss-Prot
unreviewed and reviewed
automatically annotated and manually annotated
released every 4 weeks
UniProtKB
Origin of protein sequences
UniProtKB protein sequences are mainly derived from
- INSDC (translated submitted coding sequences - CDS)
- Ensembl (gene prediction ) and RefSeq sequences
- Sequences of PDB structures
- Direct submission or sequences scanned from literature
Notes: - UniProt is not doing any gene prediction
- Most non-germline immunoglobulins, T-cell receptors , most patent sequences,
highly over-represented data (e.g. viral antigens), pseudogenes sequences are
excluded from UniProtKB, - but stored in UniParc
- Data from the PIR database have been integrated in UniProtKB since 2003.
15 %
85 %
Swiss-Prot
TrEMBL
EMBL
Automated extraction
of protein sequence
(translated CDS), gene
name and references.
Automated annotation
Manual annotation of
the sequence and
associated biological
information
UniProtKB/TrEMBL
unreviewed
Automatic annotation
released every 4 weeks
One protein sequence
One species
Automated annotation
Keywords
and
Gene Ontology
Automated annotation
Function, Subcellular location,
Catalytic activity,
Sequence similarities…
Automated annotation
transmembrane domains,
signal peptide…
Cross-references
to over 125 databases
References
Protein and gene names
Taxonomic information
UniProtKB/TrEMBL
www.uniprot.org
UniProtKB/TrEMBL
Automatic annotation
Protein sequence
- The quality of the protein sequences is dependent on the information
provided by the submitter of the original nucleotide entry (CDS) or of the
gene prediction pipeline (i.e. Ensembl).
- 100% identical sequences (same length, same organism are merged
automatically).
Biological information
Sources of annotation
- Provided by the submitter (EMBL, PDB, TAIR…)
- From automated annotation (automated generated annotation rules (i.e.
SAAS) and/or manually generated annotation rules (i.e. UniRule))
Example of fully automatic annotation: SAAS
• Rules are derived from the UniProtKB/Swiss-Prot manual annotation.
• Fully automated rule generation based on C4.5 decision tree algorithm.
• One annotation, one rule.
• High stringency – require 99% or greater estimated precision to
generate annotation (test on UniProtKB/Swiss-Prot)
• Rules are produced, updated and validated at each release.
UniProtKB/TrEMBL
UniProtKB/Swiss-Prot
reviewed
manually annotated
released every 4 weeks
MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR
AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK
NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFL
NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE
GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRG
TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGR
AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD
EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV
VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG
One protein sequence
One gene
One species
Manual annotation
Keywords
and
Gene Ontology
Manual annotation
Function, Subcellular location,
Catalytic activity, Disease,
Tissue specificty, Pathway…
Manual annotation
Post-translational modifications,
variants, transmembrane domains,
signal peptide…
Cross-references
to over 125 databases
References
Protein and gene names
Taxonomic information
Alternative products:
protein sequences produced by
alternative splicing,
alternative promoter usage,
alternative initiation…
UniProtKB/Swiss-Prot
www.uniprot.org
UniProtKB/Swiss-Prot
Manual annotation
1. Protein sequence (merge available CDS, annotate
sequence discrepancies, report sequencing mistakes…)
2. Biological information (sequence analysis, extract
literature information, ortholog data propagation, …)
UniProtKB/Swiss-Prot
1- Protein sequence curation
The displayed protein sequence:
…canonical, representative, consensus…
+
alternative sequences (described within the entry)
1 entry <-> 1 gene (1 species)
UniProtKB/Swiss-Prot
a gene-centric view of the protein space
What is the current status?
• At least 20% of Swiss-Prot entries required a minimal
amount of curation effort so as to obtain the “correct”
sequence.
• Typical problems
– unsolved conflicts
– uncorrected initiation sites
– frameshifts
– wrong gene prediction
– other ‘problems’
UCSC genome browser
examples of CDS annotation submitted to INSDC…
UniProtKB/Swiss-Prot
2- Biological data curation
UniProtKB/Swiss-Prot gathers data form multiple sources:
- publications (literature/Pubmed)
- prediction programs (Prosite, TMHMM, …)
- contacts with experts
- other databases
- nomenclature committees
An evidence attribution system allows to easily trace the
source of each annotation
Extract literature information
and protein sequence analysis
maximum usage of controlled vocabulary
Protein and gene names
…enable researchers to
obtain a summary of
what is known about a
protein…
General annotation
(Comments)
www.uniprot.org
Human protein manual annotation:
some statistics (June 2011)
Sequence annotation
(Features)
…enable researchers to
obtain a summary of
what is known about a
protein…
www.uniprot.org
Non-experimental qualifiers
UniProtKB/Swiss-Prot considers both experimental and predicted
data and makes a clear distinction between both
Type of evidence Qualifier
Strong experimental evidence None or Ref.X
Light experimental evidence Probable
Inferred by similarity with homologous protein By similarity
Inferred by prediction Potential
Find all the proteins localized in
the cytoplasm (experimentally
proven) which are phosphorylated
on a serine (experimentally proven)
• The ‘Protein existence’ tag indicates what is the evidence
for the existence of a given protein;
• Different qualifiers:
– 1. Evidence at protein level (~18%)
– (MS, western blot (tissue specificity), immuno (subcellular
location),…)
– 2. Evidence at transcript level (~19%)
– 3. Inferred from homology (~58 %)
– 4. Predicted (~5%)
– 5. Uncertain (mainly in TrEMBL)
‘Protein existence’ tag
http://www.uniprot.org/docs/pe_criteria
UniProtKB
Additional information
can be found in the cross-references
(to more than 140 databases)
2D gel
2DBase-Ecoli
ANU-2DPAGE
Aarhus/Ghent-2DPAGE (no server)
COMPLUYEAST-2DPAGE
Cornea-2DPAGE
DOSAC-COBS-2DPAGE
ECO2DBASE (no server)
OGP
PHCI-2DPAGE
PMMA-2DPAGE
Rat-heart-2DPAGE
REPRODUCTION-2DPAGE
Siena-2DPAGE
SWISS-2DPAGE
UCD-2DPAGE
World-2DPAGE
Family and domain
Gene3D
HAMAP
InterPro
PANTHER
Pfam
PIRSF
PRINTS
ProDom
PROSITE
SMART
SUPFAM
TIGRFAMs
Organism-specific
AGD
ArachnoServer
CGD
ConoServer
CTD
CYGD
dictyBase
EchoBASE
EcoGene
euHCVdb
EuPathDB
FlyBase
GeneCards
GeneDB_Spombe
GeneFarm
GenoList
Gramene
H-InvDB
HGNC
HPA
LegioList
Leproma
MaizeGDB
MGI
MIM
neXtProt
Orphanet
PharmGKB
PseudoCAP
RGD
SGD
TAIR
TubercuList
WormBase
Xenbase
ZFIN
Protein family/group
Allergome
CAZy
MEROPS
PeroxiBase
PptaseDB
REBASE
TCDB
Genome annotation
Ensembl
EnsemblBacteria
EnsemblFungi
EnsemblMetazoa
EnsemblPlants
EnsemblProtists
GeneID
GenomeReviews
KEGG
NMPDR
TIGR
UCSC
VectorBase
Enzyme and pathway
BioCyc
BRENDA
Pathway_Interaction_DB
Reactome
Other
BindingDB
DrugBank
NextBio
PMAP-CutDB
Sequence
EMBL
IPI
PIR
RefSeq
UniGene
3D structure
DisProt
HSSP
PDB
PDBsum
ProteinModelPortal
SMR
PTM
GlycoSuiteDB
PhosphoSite
PhosSite
UniProtKB/Swiss-Prot:
129 explicit links
and 14 implicit links!
Proteomic
PeptideAtlas
PRIDE
ProMEX
PPI
DIP
IntAct
MINT
STRING
Phylogenomic dbs
eggNOG
GeneTree
HOGENOM
HOVERGEN
InParanoid
OMA
OrthoDB
PhylomeDB
ProtClustDB
Polymorphism
dbSNP
Gene expression
ArrayExpress
Bgee
CleanEx
Genevestigator
GermOnline
Ontologies
GO
The UniProt web site
www.uniprot.org
• Powerful search engine, google-like and easy-to-use, but also
supports very directed field searches
• Scoring mechanism presenting relevant matches first
• Entry views, search result views and downloads are customizable
• The URL of a result page reflects the query; all pages and queries
are bookmarkable, supporting programmatic access
• Search, Blast, Align, Retrieve, ID mapping
Search
A very powerful text search tool with
autocompletion and refinement options
allowing to look for UniProt entries and
documentation by biological information
Find all human proteins
located in the nucleus
The search interface
guides users with helpful
suggestions and hints
Advanced Search
A very powerful search tool
To be used when you know in which
entry section the information is stored
Find all the protein localized in the
cytoplasm (experimentally proven)
which are phosphorylated on a
serine (experimentally proven)
Result pages: highly customizable
Result pages: downloadable
The URL can be bookmarked
and manually modified.
Blast
A tool associated with the standard
options to search sequences
in different UniProt databases and
data sets
Blast: customize the result display
Blast: local alignment
sequence annotation highlighting option
Align
A ClustalW multiple alignment tool
with
sequence annotation highlighting option
Align
sequence annotation highlighting option
Retrieve
A UniProt specific tool allowing to retrieve a list of
entries in several standard identifiers formats.
You can then query your ‘personal database’ with the
UniProt search tool.
Query your own dataset
ID Mapping
Gives the possibility to get a mapping between
different databases for a given protein
These identifiers are all pointing to a TP53 (p53) protein sequence !
●
P04637, NP_000537, NP_001119584.1, NP_001119585.1,
●
NP_001119584.1, NP_001119584.1, NP_001119584.1,
●
NP_001119584.1, ENSG00000141510, CCDS11118,
●
UPI000002ED67, IPI00025087, etc.
Download
Download UniProt
http://www.uniprot.org/downloads
Canonical and isoform sequences (fasta format)
A few words on the UniProt
‘complete proteome’
sequence sets…
2’747 complete proteomes
 Genome completely sequenced
 Proteins mapped to the genome
 Entries tagged with the KW ‘Complete proteome’
 UniProtKB/Swiss-Prot isoform sequences are available
in FASTA format only
Fully manually reviewed (e.g. S. cerevisiae)
Partially manually reviewed (e.g. Homo sapiens)
Unreviewed (e.g. Acinetobacter baumannii (strain 1656-2))
UniProtKB - complete proteomes
Can be downloaded:
 From our complete proteome page
www.uniprot.org/taxonomy/complete-proteomes
 From the ‘ftp download ‘ page
 By querying UniProtKB + download
Query: organism:93062 AND keyword:"complete proteome"
UniProtKB - complete proteomes
Additional information: www.uniprot.org/faq/15
Query UniProtKB + download
Human proteome ~ 20’200 genes
Query for ‘homo sapiens’ (August 2011)
• UniProtKB: 110,056 entries + alt sequences (~ 15’435) = 125’491
• UniProtKB/Swiss-Prot: 20’244 entries + alt sequences (~ 15’435) = 35’679
• UniProtKB/TrEMBL: 89,834 entries
• RefSeq: 32’898 sequences
• Ensembl: 90’720 sequences
Query for ‘homo sapiens’ + Complete proteome (KW-181)
• UniProtKB: 56’392 + alt sequences (15’435) = 71’827
• UniProtKB/Swiss-Prot: 20’238 + alt sequences (15’435) = 35’673
• UniProtKB/TrEMBL: 36’154
92% of human entries are linked with at least one RefSeq entry…
Summary
Do not hesitate to contact us !
help@uniprot.org
The UniProt Consortium
SIB
Ioannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie-
Claude Blatter, Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel
Breuza, Alan Bridge, Edouard de Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael
Doche, Dolnide Dornevil, Severine Duvaud, Anne Estreicher, Livia Famiglietti, Marc Feuermann,
Sebastien Gehant, Elisabeth Gasteiger, Alain Gateau, Vivienne Gerritsen, Arnaud Gos, Nadine Gruaz-
Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller,
Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat,
Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole
Redaschi, Catherine Rivoire, Bernd Roechert, Michel Schneider, Christian Sigrist, Karin Sonesson,
Sylvie Staehli, Eleanor Stanley, André Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue,
Anne-Lise Veuthey
EBI
Rolf Apweiler, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo
Antunes, Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer,
Francesco Fazzini, Alexander Fedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius
Jacobsen, Michael Kleen, Duncan Legge, Wudong Liu, Jie Luo, Sandra Orchard, Samuel Patient,
Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, Tony Sawford, Harminder Sehra,
Edward Turner, Matt Corbett, Mike Donnelly and Pieter van Rensburg
PIR
Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen,
Pratibha Dubey, Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale,
Thanemozhi G. Natarajan, Jules Nchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka
Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su Yeh and Jian Zhang
www.uniprot.org
UniProt is mainly supported by the National
Institutes of Health (NIH) grant 1 U41 HG006104-
01. Additional support for the EBI's involvement in
UniProt comes from the NIH grant 2P41 HG02273-07.
Swiss-Prot activities at the SIB are supported by the
Swiss Federal Government through the Federal
Office of Education and Science and the European
Commission contracts SLING (226073), Gen2Phen
(200754) and MICROME (222886). PIR activities are
also supported by the NIH grants 5R01GM080646-04,
3R01GM080646-04S2, 1G08LM010720-01, and
3P20RR016472-09S2, and NSF grant DBI-0850319.
www.isb-sib.ch
Thank you for your attention
http://education.expasy.org/cours/Prague2011/

More Related Content

What's hot

Structural genomics
Structural genomicsStructural genomics
Structural genomics
Vaibhav Maurya
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
SumatiHajela
 
UniProt
UniProtUniProt
UniProt
AmnaA7
 
Bioinformatics
BioinformaticsBioinformatics
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis
Nitin Naik
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
Alphonsa Joseph
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
Vijay Hemmadi
 
Primary and secondary databases ppt by puneet kulyana
Primary and secondary databases ppt by puneet kulyanaPrimary and secondary databases ppt by puneet kulyana
Primary and secondary databases ppt by puneet kulyana
Puneet Kulyana
 
Kegg databse
Kegg databseKegg databse
Kegg databse
Rashi Srivastava
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission Tools
RishikaMaji
 
OMIM Database
OMIM DatabaseOMIM Database
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
sagrika chugh
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
Yogesh Joshi
 
(Expasy)
(Expasy)(Expasy)
(Expasy)
Mazhar Khan
 
Techniques in proteomics
Techniques in proteomicsTechniques in proteomics
Techniques in proteomics
N Poorin
 
Cath
CathCath
Cath
Ramya S
 
Types of genomics ppt
Types of genomics pptTypes of genomics ppt
Types of genomics ppt
Hina Zamir Noori
 
Protein Databases
Protein DatabasesProtein Databases
Gene prediction method
Gene prediction method Gene prediction method
Gene prediction method
Nusrat Gulbarga
 
The European Nucleotide Archive
The European Nucleotide ArchiveThe European Nucleotide Archive
The European Nucleotide Archive
EBI
 

What's hot (20)

Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
UniProt
UniProtUniProt
UniProt
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
Primary and secondary databases ppt by puneet kulyana
Primary and secondary databases ppt by puneet kulyanaPrimary and secondary databases ppt by puneet kulyana
Primary and secondary databases ppt by puneet kulyana
 
Kegg databse
Kegg databseKegg databse
Kegg databse
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission Tools
 
OMIM Database
OMIM DatabaseOMIM Database
OMIM Database
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
(Expasy)
(Expasy)(Expasy)
(Expasy)
 
Techniques in proteomics
Techniques in proteomicsTechniques in proteomics
Techniques in proteomics
 
Cath
CathCath
Cath
 
Types of genomics ppt
Types of genomics pptTypes of genomics ppt
Types of genomics ppt
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Gene prediction method
Gene prediction method Gene prediction method
Gene prediction method
 
The European Nucleotide Archive
The European Nucleotide ArchiveThe European Nucleotide Archive
The European Nucleotide Archive
 

Viewers also liked

Learning sparql 2012 12
Learning sparql 2012 12Learning sparql 2012 12
Learning sparql 2012 12
Jerven Bolleman
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database SystemsDenis Shestakov
 
PROTEIN STRUCTURE DATABANK
PROTEIN STRUCTURE DATABANKPROTEIN STRUCTURE DATABANK
PROTEIN STRUCTURE DATABANKMalvika Bansal
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Presentation on Biological database  By Elufer Akram @ University Of Science ...Presentation on Biological database  By Elufer Akram @ University Of Science ...
Presentation on Biological database By Elufer Akram @ University Of Science ...
Elufer Akram
 
Proteomics
ProteomicsProteomics
Proteomics
Sarfaraz Nasri
 

Viewers also liked (6)

Learning sparql 2012 12
Learning sparql 2012 12Learning sparql 2012 12
Learning sparql 2012 12
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
 
PROTEIN STRUCTURE DATABANK
PROTEIN STRUCTURE DATABANKPROTEIN STRUCTURE DATABANK
PROTEIN STRUCTURE DATABANK
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Presentation on Biological database  By Elufer Akram @ University Of Science ...Presentation on Biological database  By Elufer Akram @ University Of Science ...
Presentation on Biological database By Elufer Akram @ University Of Science ...
 
Proteome databases
Proteome databasesProteome databases
Proteome databases
 
Proteomics
ProteomicsProteomics
Proteomics
 

Similar to The uni prot knowledgebase

TheUniProtKBpptx__2022_03_30_13_07_41.pptx
TheUniProtKBpptx__2022_03_30_13_07_41.pptxTheUniProtKBpptx__2022_03_30_13_07_41.pptx
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
PRIYANKAZALA9
 
Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Sreekanth Gali
 
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
Shweta Kagliwal
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
Vidya Kalaivani Rajkumar
 
Introduction to Bioinformatics: Part 3
Introduction to Bioinformatics: Part 3Introduction to Bioinformatics: Part 3
Introduction to Bioinformatics: Part 3
AhmedAbdElMoniem35
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
Anshika Bansal
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
Rajendra K Labala
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
02. Biological sequence databases.pptx
02. Biological sequence databases.pptx02. Biological sequence databases.pptx
02. Biological sequence databases.pptx
HussainTaqi1
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
sworna kumari chithiraivelu
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES nadeem akhter
 
NIH-mar2604.rm.ppt
NIH-mar2604.rm.pptNIH-mar2604.rm.ppt
NIH-mar2604.rm.ppt
Chandrakanth R
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics final
Rainu Rajeev
 
protein databases
 protein databases protein databases
protein databases
wasisyed
 
Genomic databases
Genomic databasesGenomic databases
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision Medicine
Gabe Rudy
 
Data retrieval
Data retrievalData retrieval
Introduction to Biological databases
Introduction to Biological databasesIntroduction to Biological databases
Protein databases
Protein databasesProtein databases
Protein databases
bansalaman80
 

Similar to The uni prot knowledgebase (20)

TheUniProtKBpptx__2022_03_30_13_07_41.pptx
TheUniProtKBpptx__2022_03_30_13_07_41.pptxTheUniProtKBpptx__2022_03_30_13_07_41.pptx
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
 
Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02Biodatabases 101220022654-phpapp02
Biodatabases 101220022654-phpapp02
 
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
NCBI
NCBINCBI
NCBI
 
Introduction to Bioinformatics: Part 3
Introduction to Bioinformatics: Part 3Introduction to Bioinformatics: Part 3
Introduction to Bioinformatics: Part 3
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
02. Biological sequence databases.pptx
02. Biological sequence databases.pptx02. Biological sequence databases.pptx
02. Biological sequence databases.pptx
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
NIH-mar2604.rm.ppt
NIH-mar2604.rm.pptNIH-mar2604.rm.ppt
NIH-mar2604.rm.ppt
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics final
 
protein databases
 protein databases protein databases
protein databases
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision Medicine
 
Data retrieval
Data retrievalData retrieval
Data retrieval
 
Introduction to Biological databases
Introduction to Biological databasesIntroduction to Biological databases
Introduction to Biological databases
 
Protein databases
Protein databasesProtein databases
Protein databases
 

Recently uploaded

The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
goswamiyash170123
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
Kartik Tiwari
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
DhatriParmar
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
gb193092
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 

Recently uploaded (20)

The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 

The uni prot knowledgebase

  • 1. Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics The UniProt knowledgebase www.uniprot.org a hub of integrated protein data http://education.expasy.org/cours/Prague2011/
  • 3. protein sequence functional information data knowledge
  • 4. UniProt consortium EBI : European Bioinformatics Institute (UK) SIB : Swiss Institute of Bioinformatics (CH) PIR : Protein information resource (US)
  • 7. UniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~15 mo entries) UniParc: protein sequence archive (ENA equivalent at the protein level). Each entry contains a protein sequence with cross- links to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25 mo entries) UniRef: 3 clusters of protein sequences with 100, 90 and 50 % identity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries) UniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc)
  • 9. UniProtKB an encyclopedia on proteins composed of 2 sections UniProtKB/TrEMBL and UniProtKB/Swiss-Prot unreviewed and reviewed automatically annotated and manually annotated released every 4 weeks
  • 10. UniProtKB Origin of protein sequences UniProtKB protein sequences are mainly derived from - INSDC (translated submitted coding sequences - CDS) - Ensembl (gene prediction ) and RefSeq sequences - Sequences of PDB structures - Direct submission or sequences scanned from literature Notes: - UniProt is not doing any gene prediction - Most non-germline immunoglobulins, T-cell receptors , most patent sequences, highly over-represented data (e.g. viral antigens), pseudogenes sequences are excluded from UniProtKB, - but stored in UniParc - Data from the PIR database have been integrated in UniProtKB since 2003. 15 % 85 %
  • 11. Swiss-Prot TrEMBL EMBL Automated extraction of protein sequence (translated CDS), gene name and references. Automated annotation Manual annotation of the sequence and associated biological information
  • 13. One protein sequence One species Automated annotation Keywords and Gene Ontology Automated annotation Function, Subcellular location, Catalytic activity, Sequence similarities… Automated annotation transmembrane domains, signal peptide… Cross-references to over 125 databases References Protein and gene names Taxonomic information UniProtKB/TrEMBL www.uniprot.org
  • 14. UniProtKB/TrEMBL Automatic annotation Protein sequence - The quality of the protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e. Ensembl). - 100% identical sequences (same length, same organism are merged automatically). Biological information Sources of annotation - Provided by the submitter (EMBL, PDB, TAIR…) - From automated annotation (automated generated annotation rules (i.e. SAAS) and/or manually generated annotation rules (i.e. UniRule))
  • 15.
  • 16.
  • 17. Example of fully automatic annotation: SAAS • Rules are derived from the UniProtKB/Swiss-Prot manual annotation. • Fully automated rule generation based on C4.5 decision tree algorithm. • One annotation, one rule. • High stringency – require 99% or greater estimated precision to generate annotation (test on UniProtKB/Swiss-Prot) • Rules are produced, updated and validated at each release. UniProtKB/TrEMBL
  • 19. MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFL NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRG TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGR AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG One protein sequence One gene One species Manual annotation Keywords and Gene Ontology Manual annotation Function, Subcellular location, Catalytic activity, Disease, Tissue specificty, Pathway… Manual annotation Post-translational modifications, variants, transmembrane domains, signal peptide… Cross-references to over 125 databases References Protein and gene names Taxonomic information Alternative products: protein sequences produced by alternative splicing, alternative promoter usage, alternative initiation… UniProtKB/Swiss-Prot www.uniprot.org
  • 20. UniProtKB/Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)
  • 22. The displayed protein sequence: …canonical, representative, consensus… + alternative sequences (described within the entry) 1 entry <-> 1 gene (1 species) UniProtKB/Swiss-Prot a gene-centric view of the protein space
  • 23. What is the current status? • At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence. • Typical problems – unsolved conflicts – uncorrected initiation sites – frameshifts – wrong gene prediction – other ‘problems’
  • 24. UCSC genome browser examples of CDS annotation submitted to INSDC…
  • 26. UniProtKB/Swiss-Prot gathers data form multiple sources: - publications (literature/Pubmed) - prediction programs (Prosite, TMHMM, …) - contacts with experts - other databases - nomenclature committees An evidence attribution system allows to easily trace the source of each annotation Extract literature information and protein sequence analysis maximum usage of controlled vocabulary
  • 28. …enable researchers to obtain a summary of what is known about a protein… General annotation (Comments) www.uniprot.org
  • 29. Human protein manual annotation: some statistics (June 2011)
  • 30. Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org
  • 31. Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between both Type of evidence Qualifier Strong experimental evidence None or Ref.X Light experimental evidence Probable Inferred by similarity with homologous protein By similarity Inferred by prediction Potential
  • 32. Find all the proteins localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven)
  • 33. • The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein; • Different qualifiers: – 1. Evidence at protein level (~18%) – (MS, western blot (tissue specificity), immuno (subcellular location),…) – 2. Evidence at transcript level (~19%) – 3. Inferred from homology (~58 %) – 4. Predicted (~5%) – 5. Uncertain (mainly in TrEMBL) ‘Protein existence’ tag http://www.uniprot.org/docs/pe_criteria
  • 34.
  • 35. UniProtKB Additional information can be found in the cross-references (to more than 140 databases)
  • 36. 2D gel 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE UCD-2DPAGE World-2DPAGE Family and domain Gene3D HAMAP InterPro PANTHER Pfam PIRSF PRINTS ProDom PROSITE SMART SUPFAM TIGRFAMs Organism-specific AGD ArachnoServer CGD ConoServer CTD CYGD dictyBase EchoBASE EcoGene euHCVdb EuPathDB FlyBase GeneCards GeneDB_Spombe GeneFarm GenoList Gramene H-InvDB HGNC HPA LegioList Leproma MaizeGDB MGI MIM neXtProt Orphanet PharmGKB PseudoCAP RGD SGD TAIR TubercuList WormBase Xenbase ZFIN Protein family/group Allergome CAZy MEROPS PeroxiBase PptaseDB REBASE TCDB Genome annotation Ensembl EnsemblBacteria EnsemblFungi EnsemblMetazoa EnsemblPlants EnsemblProtists GeneID GenomeReviews KEGG NMPDR TIGR UCSC VectorBase Enzyme and pathway BioCyc BRENDA Pathway_Interaction_DB Reactome Other BindingDB DrugBank NextBio PMAP-CutDB Sequence EMBL IPI PIR RefSeq UniGene 3D structure DisProt HSSP PDB PDBsum ProteinModelPortal SMR PTM GlycoSuiteDB PhosphoSite PhosSite UniProtKB/Swiss-Prot: 129 explicit links and 14 implicit links! Proteomic PeptideAtlas PRIDE ProMEX PPI DIP IntAct MINT STRING Phylogenomic dbs eggNOG GeneTree HOGENOM HOVERGEN InParanoid OMA OrthoDB PhylomeDB ProtClustDB Polymorphism dbSNP Gene expression ArrayExpress Bgee CleanEx Genevestigator GermOnline Ontologies GO
  • 37. The UniProt web site www.uniprot.org • Powerful search engine, google-like and easy-to-use, but also supports very directed field searches • Scoring mechanism presenting relevant matches first • Entry views, search result views and downloads are customizable • The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access • Search, Blast, Align, Retrieve, ID mapping
  • 38. Search A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information
  • 39. Find all human proteins located in the nucleus
  • 40. The search interface guides users with helpful suggestions and hints
  • 41.
  • 42. Advanced Search A very powerful search tool To be used when you know in which entry section the information is stored
  • 43. Find all the protein localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven)
  • 44. Result pages: highly customizable
  • 46.
  • 47. The URL can be bookmarked and manually modified.
  • 48. Blast A tool associated with the standard options to search sequences in different UniProt databases and data sets
  • 49. Blast: customize the result display
  • 50. Blast: local alignment sequence annotation highlighting option
  • 51. Align A ClustalW multiple alignment tool with sequence annotation highlighting option
  • 53. Retrieve A UniProt specific tool allowing to retrieve a list of entries in several standard identifiers formats. You can then query your ‘personal database’ with the UniProt search tool.
  • 54. Query your own dataset
  • 55. ID Mapping Gives the possibility to get a mapping between different databases for a given protein
  • 56. These identifiers are all pointing to a TP53 (p53) protein sequence ! ● P04637, NP_000537, NP_001119584.1, NP_001119585.1, ● NP_001119584.1, NP_001119584.1, NP_001119584.1, ● NP_001119584.1, ENSG00000141510, CCDS11118, ● UPI000002ED67, IPI00025087, etc.
  • 57.
  • 60. Canonical and isoform sequences (fasta format)
  • 61. A few words on the UniProt ‘complete proteome’ sequence sets…
  • 62. 2’747 complete proteomes  Genome completely sequenced  Proteins mapped to the genome  Entries tagged with the KW ‘Complete proteome’  UniProtKB/Swiss-Prot isoform sequences are available in FASTA format only Fully manually reviewed (e.g. S. cerevisiae) Partially manually reviewed (e.g. Homo sapiens) Unreviewed (e.g. Acinetobacter baumannii (strain 1656-2)) UniProtKB - complete proteomes
  • 63. Can be downloaded:  From our complete proteome page www.uniprot.org/taxonomy/complete-proteomes  From the ‘ftp download ‘ page  By querying UniProtKB + download Query: organism:93062 AND keyword:"complete proteome" UniProtKB - complete proteomes Additional information: www.uniprot.org/faq/15
  • 64. Query UniProtKB + download
  • 65.
  • 66. Human proteome ~ 20’200 genes Query for ‘homo sapiens’ (August 2011) • UniProtKB: 110,056 entries + alt sequences (~ 15’435) = 125’491 • UniProtKB/Swiss-Prot: 20’244 entries + alt sequences (~ 15’435) = 35’679 • UniProtKB/TrEMBL: 89,834 entries • RefSeq: 32’898 sequences • Ensembl: 90’720 sequences Query for ‘homo sapiens’ + Complete proteome (KW-181) • UniProtKB: 56’392 + alt sequences (15’435) = 71’827 • UniProtKB/Swiss-Prot: 20’238 + alt sequences (15’435) = 35’673 • UniProtKB/TrEMBL: 36’154 92% of human entries are linked with at least one RefSeq entry…
  • 68.
  • 69. Do not hesitate to contact us ! help@uniprot.org
  • 70. The UniProt Consortium SIB Ioannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie- Claude Blatter, Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel Breuza, Alan Bridge, Edouard de Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael Doche, Dolnide Dornevil, Severine Duvaud, Anne Estreicher, Livia Famiglietti, Marc Feuermann, Sebastien Gehant, Elisabeth Gasteiger, Alain Gateau, Vivienne Gerritsen, Arnaud Gos, Nadine Gruaz- Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller, Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat, Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine Rivoire, Bernd Roechert, Michel Schneider, Christian Sigrist, Karin Sonesson, Sylvie Staehli, Eleanor Stanley, André Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue, Anne-Lise Veuthey EBI Rolf Apweiler, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo Antunes, Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer, Francesco Fazzini, Alexander Fedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius Jacobsen, Michael Kleen, Duncan Legge, Wudong Liu, Jie Luo, Sandra Orchard, Samuel Patient, Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, Tony Sawford, Harminder Sehra, Edward Turner, Matt Corbett, Mike Donnelly and Pieter van Rensburg PIR Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen, Pratibha Dubey, Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale, Thanemozhi G. Natarajan, Jules Nchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su Yeh and Jian Zhang www.uniprot.org
  • 71. UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U41 HG006104- 01. Additional support for the EBI's involvement in UniProt comes from the NIH grant 2P41 HG02273-07. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science and the European Commission contracts SLING (226073), Gen2Phen (200754) and MICROME (222886). PIR activities are also supported by the NIH grants 5R01GM080646-04, 3R01GM080646-04S2, 1G08LM010720-01, and 3P20RR016472-09S2, and NSF grant DBI-0850319.
  • 73. Thank you for your attention http://education.expasy.org/cours/Prague2011/

Editor's Notes

  1. This Science cover clearly shows the well known discepancy between the amount of data and the amount of knowledge which are available.This is a first challenge …but there is a second one: how is to link the 2 together ?
  2. The mission of UniProt is….to link the protein squences (data) together with the biological knowledge (functional information)
  3. The UniProt databases and web site are maintained by the UniProt consortium, which is composed of:
  4. Screen shot of the web page
  5. UniProt provides 4 databases, the central one beiing the UniProtKB.
  6. UniProt provides 4 databases, the central one beiing the UniProtKB.
  7. Computer prediction: if no other evidence from this protein or a similar protein, the keyword is not put.
  8. &amp;lt;number&amp;gt; dbSNP is NOT in DR lines!!! =&amp;gt; not included in the release notes statistics. Note : Replaces BuruList, ListiList, MypuList, PhotoList, SagaList and SubtiList
  9. &amp;lt;number&amp;gt; 3 groups working together Encyclopedia of proteins function in biology and life science Considered by the life science community as the GOLD standard in annotation practices Over 600’000 users per month originating from 149 countries. Is it uniprot or swiss-prot? Used by life science scientists (biologists, MDs), but also by chemists, engineers in nanotechnologies; Bioinformaticians; Used by pharma and biotechnology industry;
  10. &amp;lt;number&amp;gt; 3 groups working together Encyclopedia of proteins function in biology and life science Considered by the life science community as the GOLD standard in annotation practices Over 600’000 users per month originating from 149 countries. Is it uniprot or swiss-prot? Used by life science scientists (biologists, MDs), but also by chemists, engineers in nanotechnologies; Bioinformaticians; Used by pharma and biotechnology industry;
  11. &amp;lt;number&amp;gt; Take home message
  12. a bit of this, a bit of that…