This document discusses biological databases. It begins by defining biological databases as large, organized bodies of persistent biological data that can be updated, queried and retrieved. It then provides examples of popular databases like GenBank, SwissProt and PIR. The document discusses the importance of databases and different types of biological databases, categorized by the content or nature of the data. Specifically, it describes primary and secondary nucleotide and protein sequence databases like GenBank, EMBL, DDBJ, SwissProt and PIR.
2. Biological Databases
A biological database is a large, organized body of persistent data, usually
associated with computerized software designed to update, query, and retrieve
components of the data stored within the system.
The chief objective of the development of a database is to organize data in a set of
structured records to enable easy retrieval of information.
Example. A few popular databases are GenBank from NCBI (National Center for
Biotechnology Information), SwissProt from the Swiss Institute
of Bioinformatics and PIR from the Protein Information Resource.
3. Importance of Databases
1. Databases act as a store house of information.
2. Databases are used to store and organize data in such a way that
information can be retrieved easily via a variety of search criteria.
3. It facilitates the discovery of new biological insights from raw data.
4. Importance of Databases
4. Secondary databases have become the molecular biologist’s reference
library over the past decade or so, providing a wealth of information on just
about any gene or gene product that has been investigated by the research
community.
5. It helps to solve cases where many users want to access the same entries of
data.
6. Allows the indexing of data.
7. It helps to remove redundancy of data.
5. Types of Biological Databases
1. Based on content of biological data
2. Based on the nature of data.
6. 1. Based on content of biological data
1. Primary databases
2. Secondary databases
7. 1. Primary databases
Primary databases are also called as Archieval Database.
They are populated with experimentally derived data such as nucleotide
sequence, protein sequence or macromolecular structure.
Experimental results are submitted directly into the database by researchers, and
the data are essentially archival in nature.
Once given a database accession number, the data in primary databases are never
changed: they form part of the scientific record.
8. 1. Primary databases
Examples
GenBank and DDBJ (nucleotide sequence)
Protein Data Bank (PDB; coordinates of three-dimensional macromolecular
structures)
9. 2. Secondary databases
Secondary databases comprise data derived from the results of analysing primary
data.
Secondary databases often draw upon information from numerous sources,
including other databases (primary and secondary), controlled vocabularies and
the scientific literature.
They are highly curated, often using a complex combination of computational
algorithms and manual analysis and interpretation to derive new knowledge from
the public record of science.
10. 2. Secondary databases
Examples
InterPro (protein families, motifs and domains)
UniProt Knowledgebase (sequence and functional information on proteins)
Ensembl (variation, function, regulation and more layered onto whole
genome sequences)
11. 2.Based on the nature of data
1. Structural database
2. Sequence database
i. Protein sequence databases
ii. Nucleic Acid sequence databases
12. 1.Structural databases
The structural databases contain structural information for each material
derived from analysis of diffraction data.
EX. PDB, CATH and SCOP
13. PDB(Protein Data Bank)
www.rcsb.org/pdb/
The PDB was established in1970’s at the Brookehaven Lab on Long island, New
York State, US.
In 1999, the management was moved to the Research Collaboratory for
Structural Bioinformatics(RCSB – a joint organisation between Rutgers University,
San Diego Super Computer Centre).
The PDB entries contain the atomic coordinates, and some structural parameters
connected with the atoms or computed from the structures(secondary structure).
14. PDB(Protein Data Bank)
The PDB entries contain some annotations, but it is not as comprehensive
as in SWISS PROT.
There are no legal restrictions on the use of the data in PDB.
The Protein Data Bank is an archive of experimentally determined three
dimensional structures (3D) of biological macromolecules, serving a global
community of researchers, educators, and students.
15. PDB(Protein Data Bank)
The archives contain atomic coordinates, bibliographic citations, primary
and secondary structure information as well as crystallographic structure
factors and NMR(Nuclear Magnetic Resonance) experimental data.
PDB is the main primary database for 3D structures of biological
macromolecules determined by X-Ray Crystallography and NMR.
16. PDB(Protein Data Bank)
Structural biologists usually deposit their structures in the PDB on
publication and some scientific journals require this before accepting a
paper.
It also accepts the experimental data used to determine the structures(X-
Ray Crystallography and NMR) and homology models.
17. 2. Sequence databases
A sequence database is a type of biological database that is composed of a
large collection of computerised nucleic acid sequences or other polymer
sequences stored on a computer. These include
I. Nucleotide databases
II. Protein databases
18. NCBI(National Centre for Biotechnological Information)
www.ncbi.nlm.nih.gov
NCBI is a public available tool on web. NCBI was established in November
1988 at the National Library of Medicine in the United States.
The NLM was chosen because it had experience in creating and
maintaining biomedical databases and as part of the National Institute of
Health(NIH) , it could establish a research program in computational
molecular biology.
19. NCBI(National Centre for Biotechnological Information)
The mission of NCBI is to develop new information technologies to aid in understanding of
fundamental molecular and genetic process that control health and disease.
More specifically, NCBI has been charged with creating automated systems for storing
and analysing knowledge about molecular biology, biochemistry and genetics; facilitating
the use of such databases and software by the research and medical community,
coordinating efforts to gather biotechnology information both nationally and internationally
and performing research into advanced methods of computer based information processing
for analysing the structure and function of biologically important molecules.
20. NCBI maintains several databases. They are as
follows
Literature databases
Entrez databases
Nucleotide databases
Genome specific resources
Tools for data mining
21. NCBI maintains several databases. They are as
follows
Tools for Sequence Analysis
Tools for 3D structure display and Similarity Searching
Maps
Resource Statistics
Collaborative Cancer Research
FTP (File Transfer Protocol)
22. 1.Nucleotide databases
The nucleotide database is a collection of sequences from several sources including
GenBank, RefSeq,etc.
I.PRIMARY DATABASES OF NUCLEOTIDE SEQUENCES:
These are the chief databases that store and make available raw nucleic acid sequences to
the public and researchers. They are referred to as primary nucleotide sequence databases
since they are the repository of all the nucleic acid sequences.
Ex. GenBank,DDBJ,EMBL
23. 1.EMBL (European Molecular Biological
Laboratory)
www.ebi.ac.uk
EMBL is the nucleotide sequence database from EBI(European Bioinformatics
Institute).
The EBI institute manages databases of biological data including nucleic acid,
protein sequences and macromolecular structures.
The EBI is a pioneer of novel and developmental bioinformatics research.
The EBI is a centre for research and services in bioinformatics.
24. 1.EMBL (European Molecular Biological
Laboratory)
The mission of EBI is to ensure that the growing body of information from
molecular biology and genome research is placed in the public domain and is
accessible freely.
The databases is produced in collaboration with DDBJ and Gen Bank.
Information can be retrieved from EMBL using the SRS(Sequence Retrieval
System) ; this links the principal DNA and the protein sequence databases with
motif, structure, mapping and other specialist databases.
25. 1.EMBL (European Molecular Biological
Laboratory)
SRS is one of the most powerful data browsing retrieval tools available.SRS
provides rapid, user friendly access to the large volumes of diverse and
heterogeneous life science data stored in more than 400 internal and public domain
databases.
It can be used to browse the various biological sequence and literature databases.
The EBI provides access to many tools for browsing and retrieving biological
related sequence and literature data.
26. 2.DDBJ (DNA Data Bank of Japan)
www.ddbj.nig.ac.jp
DDBJ began in 1986 as a collaboration with EMBL and GenBank. The database
is produced, maintained and distributed at the National Institute of Genetics.
Sequences may be submitted to it from all corners of the world by means of a web
based data submission tool.
The Web is also used to provide standard search tools such as Fast A and BLAST.
27. 2.DDBJ (DNA Data Bank of Japan)
DDBJ is a sole DNA Databank of Japan which is officially certified to collect the DNA
sequences from researchers and to issue the internationally recognised accession number to
data submitters.
DDBJ is one of the International DNA databases including EBI responsible for EMBL
database and NCBI responsible for GenBank database.
Consequently, DDBJ has been collaborating with the two databanks through exchanging
data and information on Internet, and by holding two meetings, the International DNA
DataBank Advisory Meeting and the International DNA DataBanks Collaborative
Meeting(IAM and ICM).
28. 3. GenBank
GenBank, the DNA database from NCBI incorporates sequences from publicly
available sources.
Information can be retrieved from GenBank using the Entrez Integrated
Retrieval system; this combines data from the principal DNA and protein sequence
databases with the information from genome maps and protein structures.
Additional information on sequences can be accessed via MEDLINE facility
which provides abstracts from the original published articles.
29. 3. GenBank
GenBank may be searched with the user query sequence by means of
NCBI’s web interface to the BLAST suite of programs.
A GenBank includes the sequence files, indices created on various database
fields and information derived from database(Ex.Gen Pept, a database of
translated coding sequences in FastA format). Most commonly used is the
sequence entry file, which contains the sequence itself and descriptive
information relating to it.
30. 3. GenBank
A GenBank entry consists of keywords, relevant associated sub key words,
and an optional Feature Table, it end is indicated by a // terminator.
The entry continues with BASE COUNT record which details the
frequency of occurrence of the different base types in the sequence.
31. 2.Secondary databases of nucleotide
sequences
Many of the secondary databases are simply the sub-collection of sequences culled from one
or other of the primary databases such as GenBank or EMBL.
1.Omniome databases:
2. Fly Base Database
3. ACeDB
32. 2.Secondary databases of nucleotide
sequences
1.Omniome databases:
is a comprehensive microbial resource maintained by TIGR(The Institute for
Genomic Research].
It has not only the sequence and annotation of each of the completed genomes,
but also has associated information about the organisms[such as taxon and gram
stain pattern], the structure and composition of their DNA molecules and many
other attributes of protein sequences predicted from the DNA sequences.
33. 2.Secondary databases of nucleotide
sequences
2.Fly Base Database :
A consortium sequenced the entire genome of the fruitfly D.melanogaster to
a high degree of completeness and quality.
3.ACeDB :
It is a repository of not only the sequence but also the genetic map as well as
phenotypic information about the C.elegans nematode worm.
34. II. PROTEIN DATABASES:
A protein database is one or more datasets about protein’s aminoacid
sequence, conformation, structure and features such as active sites.
1.Primary databases of proteins :
The primary databases hold the experimentally determined protein
sequences inferred from the conceptual translation of nucleotide sequences.
35. 1.PIR (Protein Information Resource)
www.pir.georgetown.edu
The Protein Sequence Database was developed at the National Biomedical
Research Foundation (NBRF) in US.
It is involved in collaboration with Martinsred Institute for Protein Sequences
(MIPS), Japan International Protein Information database (JIPID).
PIR was developed by Margaret Dayhoff as a collection of sequences for
investigating evolutionary relationships among proteins.
36. 1.PIR (Protein Information Resource)
The PIR database is split into four distinct sections – PIR1 to PIR4 which
differ in terms of the quality of data, and level of annotation provided.
PIR 1 – contains fully classified and annotated entries
PIR 2 – includes preliminary entries which have not been thoroughly
reviewed and may contain redundancy
PIR 3 – contains unverified entries, which have not been reviewed
37. 1.PIR (Protein Information Resource)
PIR 4 entries fall into 4 categories :
1. Conceptual translations of artefactual sequences.
2. Conceptual translations of sequences that are not transcribed or translated.
3. Protein sequences or conceptual translations that are genetically engineered.
4. Sequence that are not genetically encoded and produced on ribosomes.
One can search for entries or do sequences similarity searches at the PIR site. The database
can be downloaded as a set of files.
38. 2. SWISS PROT
www.expasy.ch/sprot/
Swiss Prot is a protein sequence database, established in 1986, was
produced collaboratively by the Department of Medical Biochemistry at the
University of Geneva and the EMBL ; after 1994, the collaboration moved to
EMBL’s UK outstation, EBI.
In 1998, the collaboration moved to Swiss Institute of
Bioinformatics(SIB). Hence, the database is now maintained collaboratively
by SIB and EBI/EMBL.
39. 2. SWISS PROT
Swiss Prot is a protein sequence database which strives to provide a high
level of annotations such as the description of the function of a protein, its
domain structure, post translational modifications, variants, etc, a minimal
level of redundancy and high level of integration with other databases.
In 1996, a computer annotated supplement to SWISSPROT was created,
termed TrEMBL.
40. 2. SWISS PROT
In SWISS PROT , as in many sequence databases, two classes of data can be
distinguished :
1. Core data : Core data consists of :
1. Sequence data
2. Citation information(bibliographic references)
3. Taxonomic data(description of the biological source of the protein)
41. 2. SWISS PROT
2. Annotation :
1. Function of protein
2. Post translational modifications
3. Domains and sites
4. Secondary structure
42. 2. SWISS PROT
2. Annotation :
5. Quaternary structure
6. Similarities to other proteins
7. Diseases associated with any member of deficiencies in the protein
8. Sequence conflicts, variants
43. 2. SWISS PROT
Sequence Entry File
Each line is flagged with a two letter code, which helps to present the
information in a structured way.
Entries begin with the identification(ID) line and end with a // terminator.
ID codes can some times change, so an additional identifier, an accession
number(AC NO.), is also provided which ought to remain static between
database releases.
44. 2. SWISS PROT
Sequence Entry File
Next, the DT lines provide information about data of entry of the sequence
of database and details of when it was last modified.
The following lines give the gene name(GN), the Organism Species(OS),
and the Organism Classification(OC) within the biological kingdoms.
45. 2. SWISS PROT
Sequence Entry File
CC- Comment lines denote the function of protein, post translational
modifications, similarity and tissue specificity.
Database cross reference(DR) lines follow the comment field. These provide links
to other biomolecular databases.
Following the DR lines; (KW) key words and then a number of FT lines are
present.
46. 2. SWISS PROT
Sequence Entry File
FT line is Feature Table line which highlights the regions of interest in the
sequence including secondary structure, ligand binding sites, post translational
modifications.
The final section of database entry includes the sequence(SQ) itself. The entry
ends with a //terminator.
SWISS PROT has become the most widely used protein sequence database in the world.
47. 3. PubMed
PubMed is a free resource supporting the search and retrieval of biomedical and
life sciences literature with the aim of improving health–both globally and
personally.
1.The PubMed database contains more than 33 million citations and abstracts of
biomedical literature.
2.It does not include full text journal articles; however, links to the full text are
often present when available from other sources, such as the publisher's website
or PubMed Central (PMC).
48. 3. PubMed
3. It is available to the public online since 1996.
4. PubMed was developed and is maintained by the National Centre for
Biotechnology Information (NCBI), at the U.S. National Library of Medicine
(NLM), located at the National Institutes of Health (NIH).
5. Citations in PubMed primarily stem from the biomedicine and health fields, and
related disciplines such as life sciences, behavioural sciences, chemical sciences, and
bioengineering.
49. 3. PubMed
PubMed facilitates searching across several NLM literature resources:
1.Medline 2. PubMed Central (PMC) 3. Bookshelf
1. MEDLINE
MEDLINE is the largest component of PubMed and consists primarily of
citations from journals selected for MEDLINE; articles indexed with MeSH
(Medical Subject Headings) and curated with funding, genetic, chemical and
other metadata.
50. 3. PubMed
2. PubMed Central (PMC)
Citations for PubMed Central (PMC) articles make up the second largest
component of PubMed.
PMC is a full text archive that includes articles from journals reviewed and
selected by NLM for archiving (current and historical), as well as individual
articles collected for archiving in compliance with funder policies.
51. 3. PubMed
3. Bookshelf
The final component of PubMed is citations for books and some individual
chapters available on Bookshelf.
Bookshelf is a full text archive of books, reports, databases, and other
documents related to biomedical, health, and life sciences.
52. 1. Secondary databases of proteins
The secondary databases are so termed because they contain the results of analysis of the sequences held in
primary databases.
1. PROSITE:
A set of databases collects together patterns found in protein sequences rather than the complete
sequences.
PROSITE is one such pattern database.
The protein motif and pattern are encoded as regular expressions.
The information corresponding to each entry in PROSITE is of two forms – the patterns and the related
descriptive text.
53. 1. Secondary databases of proteins
2. PRINTS:
In the PRINTS database, the protein sequence patterns, are stored as “finger prints”. The information
includes :
1. The first section contains cross links to other databases that have more information about the
characterised family.
2. The second section provides a table showing how many of the motifs that makeup the finger print occurs
in how many of the sequences of that family.
3. The last section of the entry contains the actual fingerprints that are stored as multiple aligned sets of
sequences , the alignment is made without gaps.
54. 1. Secondary databases of proteins
3.Pfam :
Pfam contains the profiles used using Hidden Markov Models(HMM)
.HMM builds the model of the pattern as a series of the match, substitute,
insert or delete state, with scores assigned for alignment to go from one state
to another.
55. 1. Secondary databases of proteins
4.TrEMBL :
TrEMBL(Translated EMBL) was created in 1996 as a computer annotated
supplement to SWISS –PROT.
It contains translations of all the coding sequences (COS) in EMBL.
TrEMBL was designed to address the need for a well structured SWISS PROT
link resource that would allow very rapid access to sequence data from the genome
projects.