Biological Databases
• Primary databases
• Secondary databases
• Specialized databases
Biological Databases
Primary Databases: contains original biological data
(raw nucleic acid sequence data produced and submitted
by researchers worldwide)
-Genbank (https://www.ncbi.nlm.nih.gov/genbank/)
-EMBL (European Molecular Biology Laboratory)
(https://www.ebi.ac.uk)
-DDBJ (DNA Data Bank of Japan)
(https://www.ddbj.nig.ac.jp/index-e.html)
Secondary Databases: which contain computationally processed
sequence information derived from the primary databases
-SWISS-PROT: which provides detailed sequence annotation that
includes structure, function, and protein family assignment
-TrEMBL, a database of translated nucleic acid sequences stored
in the EMBL database
-UniProt: (SWISS-PROT + TrEMBL + PIR), which has larger
coverage than any one of the three databases
-Pfam and Blocks: aligned protein sequence information, motifs,
patterns
DALI: protein secondary structure database that is vital for protein
structure classification and threading analysis
Major Biological Databases Available Via the
World Wide Web
SWISS-Prot Curated protein sequence database www.ebi.ac.uk/swissprot/acces
s.html
AceDB Genome database for Caenorhabditis elegans www.acedb.org
DDBJ Primary nucleotide sequence database in Japan www.ddbj.nig.ac.jp
EMBL Primary nucleotide sequence database in Europe www.ebi.ac.uk/embl/index.html
Entrez NCBI portal for a variety of biological databases www.ncbi.nlm.nih.gov/gquery/g
query.fcgi
ExPASY Proteomics database http://us.expasy.org/
FlyBase A database of the Drosophila genome http://flybase.bio.indiana.edu/
FSSP Protein secondary structures www.bioinfo.biocenter.helsinki.f
i:8080/dali/index.html
GenBank Primary nucleotide sequence database in NCBI www.ncbi.nlm.nih.gov/Genban
k
HIV Databases HIV sequence data and related immunologic
information
www.hiv.lanl.gov/content/index
Microarray gene
expression data
base
DNA microarray data and analysis tools www.ebi.ac.uk/microarray
OMIM Genetic information of human diseases www.ncbi.nlm.nih.gov/entrez/q
uery.fcgi?db=OMIM
SRS General sequence retrieval system http://srs6.ebi.ac.uk
PubMed Biomedical literature information www.ncbi.nlm.nih.gov/PubMed
TAIR Arabidopsis information database www.arabidopsis.org
Entrez
NCBI advanced search builder
Main file formats used in Bioinformatics
•GenBank/GenPept
•ASN.1
•EMBL, Swiss Prot
•FASTA
•GCG
•PHYLIP
•PIR
(https://www.ebi.ac.uk/Tools/sfc/emboss_seqret/)
EMBOSS Seqret
Algorithms for pairwise alignments
Web resources
• LALIGN - pairwise sequence alignment
• Global alignment: Needle (EMBOSS): https://www.ebi.ac.uk/Tools/psa/
• Local alignment: Water (EMBOSS): https://www.ebi.ac.uk/Tools/psa/
• The BLAST program was developed by Stephen Altschul of NCBI in 1990
and has since become one of the most popular programs for sequence
analysis
• BLAST uses heuristics to align a query sequence with all sequences in a
database
• The objective is to find high-scoring ungapped segments among related
sequences
• The existence of such segments above a given threshold indicates pairwise
similarity beyond random chance, which helps to discriminate related
sequences from unrelated sequences in a database
BLAST (Basic Local Alignment Search Tool)
(www.ncbi.nlm.nih.gov/BLAST/)
NCBI search for spike surface glycoprotein
BLAST Search against Protein Data Bank
Low Complexity Regions
PSI-BLAST Contd.
• An iterative search in which sequences found in one round of
searching are used to build a score model for the next round of
searching
• An important tool for predicting both biochemical activity and
function
• Identify week homologies (distant relatives of a proteins, which are
not found in FASTA or BLAST.
Information:
https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-2.html
BLAST QuickStart
Example-Driven Web-Based BLAST Tutorial
https://www.ncbi.nlm.nih.gov/books/NBK1734/
NCBI BLAST tutorial – YouTube
https://www.youtube.com/watch?v=HXEpBnUbAMo
NCBI PSI-BLAST Tutorial - YouTube
https://www.youtube.com/watch?v=T3kHEieyylk
MEME suite 5.1.1: http://meme-suite.org/tools/meme
PROSITE: https://prosite.expasy.org
ScanProsite: https://prosite.expasy.org/scanprosite/
Patterns can be generated from multiple sequences using PRATT
https://web.expasy.org/pratt/
PRATT - ExPASy
PHI-BLAST
PROSITE pattern for the kinase active site, starting from the conserved DRH and
making use of the very conserved DFG region: D-R-H-[NS]-[DS]-N-[IL]-x-[IV]-x-[DEK]-
[DGST]-G-[NQR]-L-F-H-I-D-F-G
The above query sequence and the PROSITE pattern used as inputs for the PHI-
BLAST search (see next slide)
PHI-BLAST

Hands on training_biological_databases.ppt

  • 1.
    Biological Databases • Primarydatabases • Secondary databases • Specialized databases
  • 2.
    Biological Databases Primary Databases:contains original biological data (raw nucleic acid sequence data produced and submitted by researchers worldwide) -Genbank (https://www.ncbi.nlm.nih.gov/genbank/) -EMBL (European Molecular Biology Laboratory) (https://www.ebi.ac.uk) -DDBJ (DNA Data Bank of Japan) (https://www.ddbj.nig.ac.jp/index-e.html)
  • 3.
    Secondary Databases: whichcontain computationally processed sequence information derived from the primary databases -SWISS-PROT: which provides detailed sequence annotation that includes structure, function, and protein family assignment -TrEMBL, a database of translated nucleic acid sequences stored in the EMBL database -UniProt: (SWISS-PROT + TrEMBL + PIR), which has larger coverage than any one of the three databases -Pfam and Blocks: aligned protein sequence information, motifs, patterns DALI: protein secondary structure database that is vital for protein structure classification and threading analysis
  • 4.
    Major Biological DatabasesAvailable Via the World Wide Web SWISS-Prot Curated protein sequence database www.ebi.ac.uk/swissprot/acces s.html AceDB Genome database for Caenorhabditis elegans www.acedb.org DDBJ Primary nucleotide sequence database in Japan www.ddbj.nig.ac.jp EMBL Primary nucleotide sequence database in Europe www.ebi.ac.uk/embl/index.html Entrez NCBI portal for a variety of biological databases www.ncbi.nlm.nih.gov/gquery/g query.fcgi ExPASY Proteomics database http://us.expasy.org/ FlyBase A database of the Drosophila genome http://flybase.bio.indiana.edu/ FSSP Protein secondary structures www.bioinfo.biocenter.helsinki.f i:8080/dali/index.html GenBank Primary nucleotide sequence database in NCBI www.ncbi.nlm.nih.gov/Genban k HIV Databases HIV sequence data and related immunologic information www.hiv.lanl.gov/content/index Microarray gene expression data base DNA microarray data and analysis tools www.ebi.ac.uk/microarray OMIM Genetic information of human diseases www.ncbi.nlm.nih.gov/entrez/q uery.fcgi?db=OMIM SRS General sequence retrieval system http://srs6.ebi.ac.uk PubMed Biomedical literature information www.ncbi.nlm.nih.gov/PubMed TAIR Arabidopsis information database www.arabidopsis.org
  • 5.
  • 8.
  • 9.
    Main file formatsused in Bioinformatics •GenBank/GenPept •ASN.1 •EMBL, Swiss Prot •FASTA •GCG •PHYLIP •PIR
  • 10.
  • 11.
    Algorithms for pairwisealignments Web resources • LALIGN - pairwise sequence alignment • Global alignment: Needle (EMBOSS): https://www.ebi.ac.uk/Tools/psa/ • Local alignment: Water (EMBOSS): https://www.ebi.ac.uk/Tools/psa/
  • 12.
    • The BLASTprogram was developed by Stephen Altschul of NCBI in 1990 and has since become one of the most popular programs for sequence analysis • BLAST uses heuristics to align a query sequence with all sequences in a database • The objective is to find high-scoring ungapped segments among related sequences • The existence of such segments above a given threshold indicates pairwise similarity beyond random chance, which helps to discriminate related sequences from unrelated sequences in a database BLAST (Basic Local Alignment Search Tool) (www.ncbi.nlm.nih.gov/BLAST/)
  • 14.
    NCBI search forspike surface glycoprotein
  • 15.
    BLAST Search againstProtein Data Bank
  • 16.
  • 17.
    PSI-BLAST Contd. • Aniterative search in which sequences found in one round of searching are used to build a score model for the next round of searching • An important tool for predicting both biochemical activity and function • Identify week homologies (distant relatives of a proteins, which are not found in FASTA or BLAST. Information: https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-2.html
  • 18.
    BLAST QuickStart Example-Driven Web-BasedBLAST Tutorial https://www.ncbi.nlm.nih.gov/books/NBK1734/ NCBI BLAST tutorial – YouTube https://www.youtube.com/watch?v=HXEpBnUbAMo NCBI PSI-BLAST Tutorial - YouTube https://www.youtube.com/watch?v=T3kHEieyylk
  • 19.
    MEME suite 5.1.1:http://meme-suite.org/tools/meme
  • 20.
  • 21.
  • 22.
    Patterns can begenerated from multiple sequences using PRATT https://web.expasy.org/pratt/ PRATT - ExPASy
  • 23.
    PHI-BLAST PROSITE pattern forthe kinase active site, starting from the conserved DRH and making use of the very conserved DFG region: D-R-H-[NS]-[DS]-N-[IL]-x-[IV]-x-[DEK]- [DGST]-G-[NQR]-L-F-H-I-D-F-G The above query sequence and the PROSITE pattern used as inputs for the PHI- BLAST search (see next slide)
  • 24.