3. What we expect from a database..!!
• Sequence, functional, structural information,
related bibliography
• Well Structured and Indexed
• Well cross-referenced (with other databases)
• Periodically updated
• Tools for analysis and visualization
8. Standard contents of a sequence
database
• Sequences
• Accession number
• References
• Taxonomic data
• Annotation/curation
• Keywords
• Cross-references
• Documentation
9. NCBI
• Very comprehensive biological database
• GENBANK: The nucleotide sequence database
• Provides 42 different resource
• Provides a simple and easy to use web
interface
http://www.ncbi.nlm.nih.gov/
10. • Sequence submission: done using Bankit or
Sequin
• Search Engine for data retrieval: Entrez
• Retrieves information across all the resources
under NCBI
Example: PubMed, taxonomy, SNP, PubChem
etc.
11. Tools for analysis
• BLAST
• Primer-BLAST
• B-Link
• ORF finder
• Genome workbench
16. Pfam
• Proteins contain conserved regions
• Based on the conserved regions, proteins are
classified into families
• Provides links to external databases like PDB,
SCOP, CATH etc.
17. Pfam: Features
• Sequence search
• View Pfam family
• View a clan
• View a sequence
• View a structure
• Keyword search
18. Gene Indices
• Project aimed at indexing genes and their
variants in the various genome sequences.
• Creating a catalogue of genes in a wide range
of organisms
• Reduce redundancy
21. • PDB – Protein Data Bank
• CATH
• SCOP – Structural Classification of Proteins
22. wwPDB
• Contains information about experimentally
determined structures of proteins, nucleic
acids, and complex assemblies
• RCSB-PDB, PDBe, PDBj, BMRB – repositories of
protein structure data
• Files in PDB, mmCIF, PDBML/XML formats
23. • Advanced search – provides comprehensive
information about a protein.
• Sequence info, domain info, sequence
similarity, literature, apart from the details of
the structure.
• Cross referenced to SCOP and CATH
24. CATH
• Classification of proteins based on domain
structures
• Each protein chopped into individual domains
and assigned into homologous superfamilies.
• Hierarchial domain classification of PDB
entries.
25. CATH hierarchy
• Class – derived from secondary structure content is assigned
automatically
• Architecture – describes gross orientation of secondary
structures, independent of connectivity
• Topology – clusters structures according to their
topological connections and numbers of secondary
structures
• Homologous superfamily – this level groups
together protein domains which are thought to
share a common ancestor and can therefore be
described as homologous
26. SCOP
• Description of structural and evolutionary
relationships between all the proteins with
known structures
• Uses the PDB entries
• Search using keywords or PDB identifiers
Each database exchange data every day.
Each database has its own sequence submission and retrieval tools
They follow a standardized annotation
The Collaboration created a Feature Table Definition that outlines legal features and syntax
Currently, NCBI receives and processes about 20,000 direct submission sequences per month, in addition to the approximately 200,000 bulk submissions that are processed automatically.
Collaboration with EMBL and DDBJ
Database continues to grow at exponential rate.
Doubling in size every 10 months
Has sequences of 250,000 distinct organisms
All tools can be downloaded and used on your local workstations as standalone.
The goal of this project is ultimately to represent a non-redundant view of all human genes and data on their expression patterns, cellular roles, functions, and evolutionary relationships. The database will also include links to genomic sequences, mapping data, 3D structures, and literature references