GenBank, EMBL, and DDBJ are primary nucleotide sequence databases that collaborate to store publicly available DNA sequences. NCBI's GenBank is one of the largest primary sequence databases, containing over 240,000 organisms' sequences submitted from laboratories. PubMed and Entrez are literature and biomedical databases maintained by NCBI that allow users to search biomedical research articles and integrate related data from multiple sources. SRS is a sequence retrieval system developed by EBI that integrates over 250 molecular biology databases and allows complex queries across data sources.
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
The National Center for Biotechnology Information
1.
2. What is database????
• Database are convenient system to properly store, search and
retrieve any type of data.
• A database helps to easily handle and share large amount of data
and supports large scale analysis by easy access and data updating.
3. What is Biological Database
• Biological databases are libraries of life sciences information
,collected from scientific experiments, published literature, high-
throughput experiment technology and computational analysis.
• They contain information from genomics, proteomics, microarray
gene expression.
• Information contained in biological databases includes gene
function, structure, localization(both cellular and
chromosomal),biological sequences and structures.
4. Databases Architecture
Information system
)Query system
Storage System
Data
(The Google, Entrez
SRS)
Your search key words
Oracle,MySQL,PC binary
files,Unix text
files,Bookshelves
GenBank flat file
PDB file
Interaction Record
Title of a book
Book
5. A Sequence Retrieving and
Manipulation Network
DNA Protein
NCBI-GenBANK PIR
DDBJ SWISSPROT
EBI-EMBL EXPASY, PDB
GCG
SeqWEB
Vector NTI
GenoMAX
Entrez
SRS
Sequnece, Pdb, Image
GenBANK
GCG
FASTA
Staden
Image Sequence
Converter
Databases
Softwares
Formats
Retrival
System
Information
7. Primary databases
Theses are the primary sources of data used to store nucleic acid, protein sequences and
structural information of biological macromolecules.
Some primary databases-
• NCBI(The National Centre for Biotechnology Information)
• GenBank
• DDBJ (DNA data bank of Japan)
• SWISS-PROT(Swiss-Prot )
• PIR (Protein Information Resource)
• PDB(Protein Data Bank)
This sequence collection of this database is due to the efforts of basic research from
academic industrial and sequencing lab)
8. IAM: International Advisory Meeting
ICM: International Collaborative Meeting
GenBank/EMBL/DDBJ
International
Nucleotide Sequence Database
EMBL:
European Molecular Biology
Laboratory
EBI:
European Bioinformatics
Institute
DDBJ: DNA Data Bank of Japan
CIB: Center for Information Biology and
DNA Data Bank of Japan
NIG: National Institute of Genetics
NCBI:
National Center for Biotechnology Information
NLM:
National Library of Medicine
9. Secondary Database
• A Secondary database contain additional information derived from the analysis
of data available in primary sources.
• Secondary databases are analysed in a variety of ways and contain different
information in different formats.
• Some secondary databases
• TrEMBL
• Pfam
• PROSITE
• Profiles
• SCOP
• CATH
10. PRIMARY VS. Secondary SEQUENCE DATABASES
Sequencing
Centers
TATAGCCG TATAGCCGTATAGCCG TATAGCCG
Labs
Algorithms
UniGene
Curators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
11. Flat File Storage Data Formats
• When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence
databases had moved to a defined flat file format with a shared feature
table format and annotation standards.
• The flat file formats from the sequence databases are still used to access
and display sequence and annotation. They are also convenient for storage
of local copies.
12.
13.
14.
15.
16. The National Center for
Biotechnology Information
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,
MD
17. NCBI Databases and Services
• GenBank primary sequence database
• Free public access to biomedical literature
• PubMed free Medline (3 million searches per day)
• PubMed Central full text online access
• Entrez integrated molecular and literature databases
• BLAST highest volume sequence search service
(100 – 200 K searches per day)
• VAST structure similarity searches
• Software and Databases
18. GenBank (Genetic Sequence Databank)
• GenBank® is the genetic sequence database at the National
Center for Biotechnology Information (NCBI).
• It was established in the year 1982 and now maintained by the
NationalCenter for Biotechnology (NCBI).
• DNA sequences can be submitted to GenBank using several
different methods.
• It contains publicly available nucleotide sequences for more than
240 000 named organisms, obtained primarily through
submissions from individual laboratories and batch submissions
from large-scale sequencing projects.
19. • It has a flat file structure that is an ASCII text file,
readable & downloadable by both humans and
computers.
• There are two main ways of making batch sequence
submissions to GenBank: NCBI’s Barcode
SubmissionTool (BarSTool) and Sequin.
20.
21.
22. EMBL
• The European Molecular Biology Laboratory (EMBL) is a molecular biology research
institution supported by 22 member states, four prospect and two associate member
states.
• EMBL was created in 1974 and is an intergovernmental organisation funded by public
research money from its member states.
• The Laboratory operates from five sites: the main laboratory in Heidelberg, and
outstations in Hinxton (the European Bioinformatics Institute (EBI), in England),
Grenoble (France), Hamburg (Germany), and Monterotondo (near Rome).
• EMBL groups and laboratories perform basic research in molecular biology and
molecular medicine as well as training for scientists, students and visitors.
• Israel is the only Asian state that has full membership.
• The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl/), maintained
at the European Bioinformatics Institute (EBI),
23. • It is used to incorporate and distributes nucleotide sequences from
public sources.
• The database is a part of an international collaboration with DDBJ
(Japan) and GenBank (USA).
• Data are exchanged between the collaborating databases on a
daily basis.
• The web-based tool, Webin, is the preferred system for individual
submission of nucleotide sequences, including Third Party
Annotation (TPA) and alignment data.
24. • Automatic submission procedures are used for submission of data
from large-scale genome sequencing
• The latest data collection can be accessed via FTP, email and
WWW interfaces.
• The EBI's Sequence Retrieval System (SRS) integrates and links
the main nucleotide and protein databases as well as many other
specialist molecular biology databases.
• For sequence similarity searching, a variety of tools (e.g. FASTA
and BLAST) are available that allow external users to compare
their own sequences against the data in the EMBL Nucleotide
Sequence Database and other databases.
• All available resources can be accessed via the EBI home page at
25.
26.
27.
28.
29.
30.
31.
32. EMBL format
ID LISOD standard; DNA; PRO; 756 BP.
XX
AC X64011; S78972;
XX
SV X64011.1
XX
DT 28-APR-1992 (Rel. 31, Created)
DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)
XX
DE L.ivanovii sod gene for superoxide dismutase
XX
KW sod gene; superoxide dismutase.
XX
OS Listeria ivanovii
OC Bacteria; Firmicutes; Bacillus/Clostridium group;
OC Bacillus/Staphylococcus group; Listeria.
XX
RN [1]
RX MEDLINE; 92140371.
RA Haas A., Goebel W.;
RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by
RT functional complementation in Escherichia coli and characterization of
the
RT gene product.";
33. RL Mol. Gen. Genet. 231:313-322(1992).
XX
RN [2]
RP 1-756
RA Kreft J.;
RT ;
RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases.
RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum
Am
RL Hubland, 8700 Wuerzburg, FRG
XX
DR SWISS-PROT; P28763; SODM_LISIV.
XX
FH Key Location/Qualifiers
FH
FT source 1..756
FT /db_xref="taxon:1638"
FT /organism="Listeria ivanovii"
FT /strain="ATCC 19119"
FT RBS 95..100
FT /gene="sod"
FT terminator 723..746
FT /gene="sod"
FT CDS 109..717
FT /db_xref="SWISS-PROT:P28763"
FT /transl_table=11
FT /gene="sod"
FT /EC_number="1.15.1.1"
FT /product="superoxide dismutase"
FT /protein_id="CAA45406.1"
FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEA
VSG
FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNL
KAA
FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPV
LGL
FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK"
XX
SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other;
cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc
gccttacaat 60
gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat
gacttacgaa 120
ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa
agaaacaatg 180
gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga
agcagtctca 240
ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct
agatagcgtt 300
cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa
ccatacttta 360
ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt
aaaagcagca 420
atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc
ggcagctgcg 480
gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact
agaaattgtt 540
34. ID - Identification.
AC - Accession number(s).
DT - Date.
DE - Description.
GN - Gene name(s).
OS - Organism species.
OG - Organelle.
OC - Organism classification.
RN - Reference number.
RP - Reference position.
RC - Reference comments.
RX - Reference cross-references.
RA - Reference authors.
RL - Reference location.
CC - Comments or notes.
DR - Database cross-references.
KW - Keywords.
FT - Feature table data.
SQ - Sequence header.
- (blanks) sequence data.
// - Termination line.
Some entries do not contain all of the line types, and some line types occur many times in a single
entry. Each entry must begin with an identification line (ID) and end with a terminator line (//).
35. PubMed
• PubMed is a free search engine accessing primarily
the MEDLINE database of references and abstracts on
sciences and biomedical topics.
• The PubMed system was offered free to the public in
1997.
• The United States National Library of Medicine (NLM)
the National Institutes of Health maintains the
part of the Entrez system of information retrieval.
• PMID is the unique identifier number used in
36. • They are assigned to each article record when it enters the
PubMed system.
• The PMID# is always found at the end of a PubMed
citation.
• PubMed Central (PMC) is a free digital system that
archives publicly accessible full-text scholarly articles that
have been published within the biomedical and life
sciences journal literature.
• A "PubMed Mobile" option, providing access to a mobile
37.
38.
39.
40.
41.
42.
43.
44. Entrez
• WWW-based data retrieval system.
• Developed by NCBI (National Centre for Biotechnology
Information).
• - Integrates information held in different DBs.
45. Data bases covered by Entrez are
• Nucleic acid - GenBank,
RefSeq, PDB.
• Protein seqs - SWISS-
PROT, PIR.
• 3D structures – MMDB
• Genomes – Many
sources
• PopSet – From GenBank
• OMIM – OMIM
• Taxonomy – NCBI taxonomy
database
• Books- Bookshelf
• ProbeSet – GEO (Gene
Expression Omnibus)
• Literature - PubMed
46.
47.
48.
49.
50.
51.
52.
53.
54. SRS
• SRS is a Sequence Retrieval System
• - Data retrieval tool developed by EBI
• - Integrates 80 molecular biology DBs
• - An Open source software (Can be installed locally)
• SRS has an associated scripting language called Icarus
• Central resource for molecular biology data
• - more than 250 databanks have been indexed. More than 35 SRS
servers over theWWW(world wide)
55. • Information retrieval
• Easy way to retrieve information from sequence and sequence-related
databases
• Possibility to search for multiple words/other criteria
• Linkage between different databases
• E.g. Find all primary structures with known three-dimensional
• Different types of database in SRS
• Sequence & structure
• DNA, protein, three-dimensional structures
• Sequence-related
• Gene-related
• Genome, mapping, mutations, transcription factors
• SNP
• Bibliographic
56. • SRS main toolbar tabs:
• Top Page: displays databases in different database groups
• Query: displays either the standard or extended query form
• Results or “the query manager”: maintains a history of all the
results obtained during a session
• Projects or “the project manager”: maintains a history of all
queries and views used during a session
• Views: allows a user to define a user specific view for one or
more databases
• Databanks: contains a list and some facts about the databases
available in the system
57. • Search terms in SRS
• SRS indexed fields can be searched using any of the
• Single word search
• Multiple word phrases
• Numbers and dates
• Regular expressions
• Wildcards
•
58.
59.
60.
61. LocusLink
• LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) is a National
Center for Biotechnology Information (NCBI) online resource.
• It is principally intended for use by graduate students and
professional researchers in the biomedical sciences.
• It is designed to bring together related information on genetic loci
and gene products from several sources.
• LocusLink provides a central point of access for basic biomedical
information and molecular data for genes, transcripts, and proteins
from model organisms, currently including human, rat, mouse,
fruit fly, and zebrafish.
• Now it is not available in NCBI.