SlideShare a Scribd company logo
BIOLOGICAL DATABASES
Dr. K. RameshKumar
Assistant Professor
PG & Research Dept. of Zoology
Vivekananda College
Tiruvedakam- West
MADURAI - 625234
WHAT IS DATABASE?
 A database is any collection of related data.
 A Computerized archive used to store and organize
data in such a way that information can be retrieved
easily.
 A database is a collection of interrelated data store
together without harmful and unnecessary
redundancy (duplicate data) to serve multiple
applications
 Retrieving is called firing a query
 Structured collection of information.
 Consists of basic units called records or entries.
 Each record consists of fields, which hold pre-
defined data related to the record.
 For example, a protein database would have
protein entries as records and protein properties as
fields (e.g., name of protein, length, amino-acid
sequence)
THE ‘PERFECT’ DATABASE
 Comprehensive, but easy to search.
 Annotated, but not “too annotated”.
 A simple, easy to understand structure.
 Cross-referenced.
 Minimum redundancy.
 Easy retrieval of data.
BIOLOGICAL DATABASES
 Data Heterogeneity – diversity in the data types
 High Volume of data – highly heterogeneous-
voluminous to support comprehensive investigation in
various fields
 Uncertainty – Biological phenomena are observed and
assumed to be true
 Data Curation – inconsistencies are due to a lack of
knowledge in the desired field
 Large scale data integration – data collected from lab
worldwide after years of research
 Data sharing – Scientific community examination and
inspection
 Dynamic and subject to continual change – Everyday
in various lab
BIOLOGICAL DATABASES
 Need for storing and communicating large
datasets has grown
 Make biological data available to scientists.
 To make biological data available in computer-
readable form.
DIFFERENT CLASSIFICATIONS OF
DATABASES
 Type of data
 nucleotide sequences
 protein sequences
 proteins sequence patterns or motifs
 macromolecular 3D structure
 gene expression data
 metabolic pathways
DIFFERENT CLASSIFICATIONS OF DATABASES….
 Primary or derived databases
 Primary databases: experimental results directly into
database
 Secondary databases: results of analysis of primary
databases
 Aggregate of many databases
 Links to other data items
 Combination of data
 Consolidation of data
DIFFERENT CLASSIFICATIONS OF DATABASES….
 Availability
 Publicly available, no restrictions
 Available, but with copyright
 Accessible, but not downloadable
 Academic, but not freely available
 Proprietary, commercial; possibly free for academics
THE CENTRAL DOGMA & BIOLOGICAL DATA
Protein structures
-Experiments
-Models (homologues)
Literature information
Original DNA Sequences
(Genomes)
Protein Sequences
-Inferred
-Direct sequencing
Expressed DNA sequence
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tag
(ESTs)
NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,MD
WEB ACCESS: WWW.NCBI.NLM.NIH.GOV
New Homepage
Common footer
New pages!
18
NCBI AND ENTREZ
 One of the largest and most comprehensive
databases belonging to the NIH – national
institute of health (USA)
 Entrez is the search engine of NCBI
 Search for :
genes, proteins, genomes, structures, diseases,
publications and more.
 http://www.ncbi.nlm.nih.gov/
PRIMARY DATABASES
 This databases contains the raw nucleic acid
sequence data which are produced and submitted
by researchers worldwide.
• Nucleic acid
EMBL
GenBank
DDBJ (DNA Data Bank of Japan)
• Protein
PIR
MIPS
SWISS-PROT
TrEMBL
NRL-3D
NUCLEOTIDE SEQUENCE DATABASES
 EMBL, GenBank, and DDBJ are the three primary
nucleotide sequence databases
 EMBL www.ebi.ac.uk/embl/
 GenBank www.ncbi.nlm.nih.gov/Genbank/
 DDBJ www.ddbj.nig.ac.jp
 They together constitute the International
Nucleotide Sequence database callaboration.
GENBANK
 An annotated collection of all publicly available
nucleotide and proteins
 Set up in 1979 at the LANL (Los Alamos).
 Maintained since 1992 NCBI (Bethesda).
 http://www.ncbi.nlm.nih.gov
GENBANK FILE FORMAT
GENBANK FILE FORMAT
EMBL NUCLEOTIDE SEQUENCE
DATABASE
 An annotated collection of all publicly available
nucleotide and protein sequences
 Created in 1980 at the European Molecular
Biology Laboratory in Heidelberg.
 Maintained since 1994 by EBI- Cambridge.
 http://www.ebi.ac.uk/embl.html
DDBJ–DNA DATA BANK OF JAPAN
 An annotated collection of all publicly
available nucleotide and protein sequences
 Started, 1984 at the National Institute of
Genetics (NIG) in Mishima.
 Still maintained in this institute a team led
by Takashi Gojobori.
 http://www.ddbj.nig.ac.jp
NCBI DATABASES AND SERVICES
 GenBank primary sequence database
 Free public access to biomedical literature
 PubMed free Medline (3 million searches per day)
 PubMed Central full text online access
 Entrez integrated molecular and literature databases
TYPES OF MOLECULAR DATABASES
 Primary Databases
 Original submissions by experimentalists
 Content controlled by the submitter
 Examples: GenBank, Trace, SRA, SNP, GEO
 Derivative Databases
 Derived from primary data
 Content controlled by third party (NCBI)
 Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene,
Homologene, Structure, Conserved Domain
PRIMARY VS. DERIVATIVE SEQUENCE
DATABASES
GenBank
Sequencing
Centers
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG
Labs
Algorithms
UniGene
Curators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
SEQUENCE DATABASES AT NCBI
 Primary
 GenBank: NCBI’s primary sequence database
 Trace Archive: reads from capillary sequencers
 Sequence Read Archive: next generation data
 Derivative
 GenPept (GenBank translations)
 Outside Protein (UniProt—Swiss-Prot, PDB)
 NCBI Reference Sequences (RefSeq)
GENBANK - PRIMARY SEQUENCE DB
 Nucleotide only sequence database
 Archival in nature
 Historical
 Reflective of submitter point of view (subjective)
 Redundant
 Data
 Direct submissions (traditional records)
 Batch submissions
 FTP accounts (genome data)
GENBANK - PRIMARY SEQUENCE DB (2)
 Three collaborating databases
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database
TRADITIONAL GENBANK RECORD
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
GI number
NCBI internal use
well annotated
the sequence is the data
DERIVATIVE SEQUENCE DATABASES
FEATURES Location/Qualifiers
source 1..2484
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="3"
/map="3p22-p23"
gene 1..2484
/gene="MLH1"
CDS 22..2292
/gene="MLH1"
/note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
Number P14242), S. cerevisiae MLH1 (GenBank Accession
Number U07187), E. coli MUTL (Swiss-Prot Accession Number
P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
Number P14161) and Streptococcus pneumoniae (Swiss-Prot
Accession Number P14160)"
/codon_start=1
/product="DNA mismatch repair protein homolog"
/protein_id="AAC50285.1"
/db_xref="GI:463989"
/translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE
ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
GENPEPT: GENBANK CDS TRANSLATIONS
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
REFSEQ: DERIVATIVE SEQUENCE DATABASE
 Curated transcripts and proteins
 Model transcripts and proteins
 Assembled Genomic Regions
 Chromosome records
 Human genome
 microbial
 organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
SELECTED REFSEQ ACCESSION NUMBERS
mRNAs and Proteins
NM_123456 Curated mRNA
NP_123456 Curated Protein
NR_123456 Curated non-coding RNA
XM_123456 Predicted mRNA
XP_123456 Predicted Protein
XR_123456 Predicted non-coding RNA
Gene Records
NG_123456 Reference Genomic Sequence
Chromosome
NC_123455 Microbial replicons, organelle
AC_123455 Alternate assemblies
Assemblies
NT_123456 Contig
NW_123456 WGS Supercontig
GENBANK TO REFSEQ
REFSEQS: ANNOTATION REAGENTS
Genomic DNA
(NC, NT, NW)
Model mRNA (XM)
(XR)
Curated mRNA (NM)
(NR)
Model protein (XP)
Curated Protein (NP)
Scanning....
= ?
GenBank
Sequences
RefSeq
REFSEQ BENEFITS
 Non-redundancy
 Updates to reflect current sequence data and biology
 Data validation
 Format consistency
 Distinct accession series
 Stewardship by NCBI staff and collaborators
OTHER DERIVATIVE DATABASES
 Expressed Sequences
 dbSNP
 Structure
 Gene
 and more…
ENTREZ
FINDING RELEVANT INFORMATION
IN NCBI DATABASES
ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLAST
BLAST
Phylogeny
Hard Link
Neighbors
Related Sequences
Neighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
GLOBAL QUERY: ALL NCBI DATABASES
The Entrez system: 38 (and counting) integrated databases
TRADITIONAL METHOD: THE LINKS MENU
DNA Sequence
Nucleotide – Protein Link
Related Proteins
Protein – Structure Link
3-D Structure
THE PROBLEM
 Rapidly growing databases with complex and changing
relationships
 Rapidly changing interfaces to match the above
Result
 Many people don’t know:
 Where to begin
 Where to click on a Web page
 Why it might be useful to click there
GLOBAL NCBI (ENTREZ) SEARCH
colon cancer
GLOBAL ENTREZ SEARCH RESULTS
ENTREZ TIP: START SEARCHES IN GENE
Other Entrez DBs
HomoloGene
Entrez
Protein
Gene
UniGene
BLink
Homologene:
Gene Neighbors
PRECISE RESULTS
MLH1[Gene Name] AND Human[Organism]
MLH1 GENE RECORD
MLH1:LINKS TO SEQUENCE
GENEVIEW: HUMAN MLH1 VARIATIONS
ATPase domain
‘TAKE HOME MESSAGE’ ADVANTAGES OF
DATA INTEGRATION
 More relevant inter-related information in one place
 Makes it easier to find additional relevant
information related to your initial query
 Potentially find information indirectly linked, but
relevant to your subject of interest
 uncover non-obvious genetic features that explain
phenotype or disease
 Easier to build a ‘story’ based on multiple pieces of
biological evidence

More Related Content

What's hot

Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological database
KAUSHAL SAHU
 
UniProt
UniProtUniProt
UniProt
AmnaA7
 
Gen bank
Gen bankGen bank
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
KAUSHAL SAHU
 
Phylogenetic data analysis
Phylogenetic data analysisPhylogenetic data analysis
Phylogenetic data analysis
Md. Dilshad karim
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
Pranavathiyani G
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
naveed ul mushtaq
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
VinaKhan1
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
The Oxford College Engineering
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBI
geetikaJethra
 
Gene bank by kk sahu
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahu
KAUSHAL SAHU
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)
ZoufishanY
 
EMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology LaboratoryEMBL- European Molecular Biology Laboratory
(Expasy)
(Expasy)(Expasy)
(Expasy)
Mazhar Khan
 
European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)
Hafiz Muhammad Zeeshan Raza
 
Introduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbjIntroduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbj
KAUSHAL SAHU
 
Protein Database
Protein DatabaseProtein Database
Clustal
ClustalClustal
Clustal
Benittabenny
 
Bioinformatics
BioinformaticsBioinformatics

What's hot (20)

Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological database
 
UniProt
UniProtUniProt
UniProt
 
Gen bank
Gen bankGen bank
Gen bank
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 
Phylogenetic data analysis
Phylogenetic data analysisPhylogenetic data analysis
Phylogenetic data analysis
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBI
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Gene bank by kk sahu
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahu
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)
 
EMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology LaboratoryEMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology Laboratory
 
(Expasy)
(Expasy)(Expasy)
(Expasy)
 
European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)
 
Introduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbjIntroduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbj
 
Protein Database
Protein DatabaseProtein Database
Protein Database
 
Clustal
ClustalClustal
Clustal
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 

Similar to Introduction to Biological databases

Biological database
Biological databaseBiological database
Biological database
Iqbal college Peringammala TVM
 
Introduction to Bioinformatics and DatabasesDay1.ppt
Introduction to Bioinformatics and DatabasesDay1.pptIntroduction to Bioinformatics and DatabasesDay1.ppt
Introduction to Bioinformatics and DatabasesDay1.ppt
khadijarafiq2012
 
Biological databases
Biological databasesBiological databases
Biological databases
Sarfaraz Nasri
 
Databases_L2.pptx
Databases_L2.pptxDatabases_L2.pptx
Databases_L2.pptx
kigaruantony
 
Biological databases
Biological databasesBiological databases
Biological databases
SHRADHEYA GUPTA
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
sworna kumari chithiraivelu
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Presentation on Biological database  By Elufer Akram @ University Of Science ...Presentation on Biological database  By Elufer Akram @ University Of Science ...
Presentation on Biological database By Elufer Akram @ University Of Science ...
Elufer Akram
 
DATABASES...............................pptx
DATABASES...............................pptxDATABASES...............................pptx
DATABASES...............................pptx
Cherry
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
Anshika Bansal
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics final
Rainu Rajeev
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
SBituila
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
BibiQuinah
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
Vidya Kalaivani Rajkumar
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
DrGopaSarma
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
PrashantSharma807
 
Biological databases.pptx
Biological databases.pptxBiological databases.pptx
Biological databases.pptx
PagudalaSangeetha
 
Data base in detail
Data base in detailData base in detail
Data base in detail
Vartika Mishra
 
Protein databases
Protein databasesProtein databases
Protein databases
bansalaman80
 
Primary Databases.pptx
Primary Databases.pptxPrimary Databases.pptx
Primary Databases.pptx
Swarup Malakar
 
Data retrieval tools
Data retrieval toolsData retrieval tools
Data retrieval tools
Vidya Kalaivani Rajkumar
 

Similar to Introduction to Biological databases (20)

Biological database
Biological databaseBiological database
Biological database
 
Introduction to Bioinformatics and DatabasesDay1.ppt
Introduction to Bioinformatics and DatabasesDay1.pptIntroduction to Bioinformatics and DatabasesDay1.ppt
Introduction to Bioinformatics and DatabasesDay1.ppt
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Databases_L2.pptx
Databases_L2.pptxDatabases_L2.pptx
Databases_L2.pptx
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Presentation on Biological database  By Elufer Akram @ University Of Science ...Presentation on Biological database  By Elufer Akram @ University Of Science ...
Presentation on Biological database By Elufer Akram @ University Of Science ...
 
DATABASES...............................pptx
DATABASES...............................pptxDATABASES...............................pptx
DATABASES...............................pptx
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics final
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Biological databases.pptx
Biological databases.pptxBiological databases.pptx
Biological databases.pptx
 
Data base in detail
Data base in detailData base in detail
Data base in detail
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Primary Databases.pptx
Primary Databases.pptxPrimary Databases.pptx
Primary Databases.pptx
 
Data retrieval tools
Data retrieval toolsData retrieval tools
Data retrieval tools
 

More from Dr.K.RameshKumar, Assistant Professor,Vivekananda College,Tiruvedakam West, Madurai

Bioinformatics - Internet
Bioinformatics - InternetBioinformatics - Internet
Poisonous and non poisonous snakes
Poisonous and non poisonous snakes Poisonous and non poisonous snakes
Environmental biology - animal relationship
Environmental biology - animal relationshipEnvironmental biology - animal relationship
Environmental biology water pollution & air pollution
Environmental biology   water pollution & air pollutionEnvironmental biology   water pollution & air pollution
Ecosystem - structure and dynamics
Ecosystem - structure and dynamicsEcosystem - structure and dynamics
Life table - construction and applications
Life table - construction and applicationsLife table - construction and applications
Vital Statistics - Uses and Importance's
Vital Statistics - Uses and Importance'sVital Statistics - Uses and Importance's
Vermitechnology
VermitechnologyVermitechnology
Dairy breeds
Dairy breedsDairy breeds
Ornamental fish culture
Ornamental fish cultureOrnamental fish culture
Snake
SnakeSnake

More from Dr.K.RameshKumar, Assistant Professor,Vivekananda College,Tiruvedakam West, Madurai (11)

Bioinformatics - Internet
Bioinformatics - InternetBioinformatics - Internet
Bioinformatics - Internet
 
Poisonous and non poisonous snakes
Poisonous and non poisonous snakes Poisonous and non poisonous snakes
Poisonous and non poisonous snakes
 
Environmental biology - animal relationship
Environmental biology - animal relationshipEnvironmental biology - animal relationship
Environmental biology - animal relationship
 
Environmental biology water pollution & air pollution
Environmental biology   water pollution & air pollutionEnvironmental biology   water pollution & air pollution
Environmental biology water pollution & air pollution
 
Ecosystem - structure and dynamics
Ecosystem - structure and dynamicsEcosystem - structure and dynamics
Ecosystem - structure and dynamics
 
Life table - construction and applications
Life table - construction and applicationsLife table - construction and applications
Life table - construction and applications
 
Vital Statistics - Uses and Importance's
Vital Statistics - Uses and Importance'sVital Statistics - Uses and Importance's
Vital Statistics - Uses and Importance's
 
Vermitechnology
VermitechnologyVermitechnology
Vermitechnology
 
Dairy breeds
Dairy breedsDairy breeds
Dairy breeds
 
Ornamental fish culture
Ornamental fish cultureOrnamental fish culture
Ornamental fish culture
 
Snake
SnakeSnake
Snake
 

Recently uploaded

weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 

Recently uploaded (20)

weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 

Introduction to Biological databases

  • 1.
  • 2. BIOLOGICAL DATABASES Dr. K. RameshKumar Assistant Professor PG & Research Dept. of Zoology Vivekananda College Tiruvedakam- West MADURAI - 625234
  • 3. WHAT IS DATABASE?  A database is any collection of related data.  A Computerized archive used to store and organize data in such a way that information can be retrieved easily.  A database is a collection of interrelated data store together without harmful and unnecessary redundancy (duplicate data) to serve multiple applications  Retrieving is called firing a query
  • 4.  Structured collection of information.  Consists of basic units called records or entries.  Each record consists of fields, which hold pre- defined data related to the record.  For example, a protein database would have protein entries as records and protein properties as fields (e.g., name of protein, length, amino-acid sequence)
  • 5. THE ‘PERFECT’ DATABASE  Comprehensive, but easy to search.  Annotated, but not “too annotated”.  A simple, easy to understand structure.  Cross-referenced.  Minimum redundancy.  Easy retrieval of data.
  • 6.
  • 7. BIOLOGICAL DATABASES  Data Heterogeneity – diversity in the data types  High Volume of data – highly heterogeneous- voluminous to support comprehensive investigation in various fields  Uncertainty – Biological phenomena are observed and assumed to be true  Data Curation – inconsistencies are due to a lack of knowledge in the desired field  Large scale data integration – data collected from lab worldwide after years of research  Data sharing – Scientific community examination and inspection  Dynamic and subject to continual change – Everyday in various lab
  • 8. BIOLOGICAL DATABASES  Need for storing and communicating large datasets has grown  Make biological data available to scientists.  To make biological data available in computer- readable form.
  • 9.
  • 10.
  • 11. DIFFERENT CLASSIFICATIONS OF DATABASES  Type of data  nucleotide sequences  protein sequences  proteins sequence patterns or motifs  macromolecular 3D structure  gene expression data  metabolic pathways
  • 12. DIFFERENT CLASSIFICATIONS OF DATABASES….  Primary or derived databases  Primary databases: experimental results directly into database  Secondary databases: results of analysis of primary databases  Aggregate of many databases  Links to other data items  Combination of data  Consolidation of data
  • 13. DIFFERENT CLASSIFICATIONS OF DATABASES….  Availability  Publicly available, no restrictions  Available, but with copyright  Accessible, but not downloadable  Academic, but not freely available  Proprietary, commercial; possibly free for academics
  • 14. THE CENTRAL DOGMA & BIOLOGICAL DATA Protein structures -Experiments -Models (homologues) Literature information Original DNA Sequences (Genomes) Protein Sequences -Inferred -Direct sequencing Expressed DNA sequence ( = mRNA Sequences = cDNA sequences) Expressed Sequence Tag (ESTs)
  • 15.
  • 16. NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION Created in 1988 as a part of the National Library of Medicine at NIH – Establish public databases – Research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information Bethesda,MD
  • 17. WEB ACCESS: WWW.NCBI.NLM.NIH.GOV New Homepage Common footer New pages!
  • 18. 18 NCBI AND ENTREZ  One of the largest and most comprehensive databases belonging to the NIH – national institute of health (USA)  Entrez is the search engine of NCBI  Search for : genes, proteins, genomes, structures, diseases, publications and more.  http://www.ncbi.nlm.nih.gov/
  • 19. PRIMARY DATABASES  This databases contains the raw nucleic acid sequence data which are produced and submitted by researchers worldwide. • Nucleic acid EMBL GenBank DDBJ (DNA Data Bank of Japan) • Protein PIR MIPS SWISS-PROT TrEMBL NRL-3D
  • 20. NUCLEOTIDE SEQUENCE DATABASES  EMBL, GenBank, and DDBJ are the three primary nucleotide sequence databases  EMBL www.ebi.ac.uk/embl/  GenBank www.ncbi.nlm.nih.gov/Genbank/  DDBJ www.ddbj.nig.ac.jp  They together constitute the International Nucleotide Sequence database callaboration.
  • 21. GENBANK  An annotated collection of all publicly available nucleotide and proteins  Set up in 1979 at the LANL (Los Alamos).  Maintained since 1992 NCBI (Bethesda).  http://www.ncbi.nlm.nih.gov
  • 24. EMBL NUCLEOTIDE SEQUENCE DATABASE  An annotated collection of all publicly available nucleotide and protein sequences  Created in 1980 at the European Molecular Biology Laboratory in Heidelberg.  Maintained since 1994 by EBI- Cambridge.  http://www.ebi.ac.uk/embl.html
  • 25. DDBJ–DNA DATA BANK OF JAPAN  An annotated collection of all publicly available nucleotide and protein sequences  Started, 1984 at the National Institute of Genetics (NIG) in Mishima.  Still maintained in this institute a team led by Takashi Gojobori.  http://www.ddbj.nig.ac.jp
  • 26. NCBI DATABASES AND SERVICES  GenBank primary sequence database  Free public access to biomedical literature  PubMed free Medline (3 million searches per day)  PubMed Central full text online access  Entrez integrated molecular and literature databases
  • 27. TYPES OF MOLECULAR DATABASES  Primary Databases  Original submissions by experimentalists  Content controlled by the submitter  Examples: GenBank, Trace, SRA, SNP, GEO  Derivative Databases  Derived from primary data  Content controlled by third party (NCBI)  Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain
  • 28. PRIMARY VS. DERIVATIVE SEQUENCE DATABASES GenBank Sequencing Centers TATAGCCG TATAGCCG TATAGCCG TATAGCCG Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters
  • 29. SEQUENCE DATABASES AT NCBI  Primary  GenBank: NCBI’s primary sequence database  Trace Archive: reads from capillary sequencers  Sequence Read Archive: next generation data  Derivative  GenPept (GenBank translations)  Outside Protein (UniProt—Swiss-Prot, PDB)  NCBI Reference Sequences (RefSeq)
  • 30. GENBANK - PRIMARY SEQUENCE DB  Nucleotide only sequence database  Archival in nature  Historical  Reflective of submitter point of view (subjective)  Redundant  Data  Direct submissions (traditional records)  Batch submissions  FTP accounts (genome data)
  • 31. GENBANK - PRIMARY SEQUENCE DB (2)  Three collaborating databases 1. GenBank 2. DNA Database of Japan (DDBJ) 3. European Molecular Biology Laboratory (EMBL) Database
  • 32. TRADITIONAL GENBANK RECORD ACCESSION U07418 VERSION U07418.1 GI:466461 Accession •Stable •Reportable •Universal Version Tracks changes in sequence GI number NCBI internal use well annotated the sequence is the data
  • 34. FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS GENPEPT: GENBANK CDS TRANSLATIONS >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
  • 35. REFSEQ: DERIVATIVE SEQUENCE DATABASE  Curated transcripts and proteins  Model transcripts and proteins  Assembled Genomic Regions  Chromosome records  Human genome  microbial  organelle ftp://ftp.ncbi.nih.gov/refseq/release/
  • 36. SELECTED REFSEQ ACCESSION NUMBERS mRNAs and Proteins NM_123456 Curated mRNA NP_123456 Curated Protein NR_123456 Curated non-coding RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted non-coding RNA Gene Records NG_123456 Reference Genomic Sequence Chromosome NC_123455 Microbial replicons, organelle AC_123455 Alternate assemblies Assemblies NT_123456 Contig NW_123456 WGS Supercontig
  • 38. REFSEQS: ANNOTATION REAGENTS Genomic DNA (NC, NT, NW) Model mRNA (XM) (XR) Curated mRNA (NM) (NR) Model protein (XP) Curated Protein (NP) Scanning.... = ? GenBank Sequences RefSeq
  • 39. REFSEQ BENEFITS  Non-redundancy  Updates to reflect current sequence data and biology  Data validation  Format consistency  Distinct accession series  Stewardship by NCBI staff and collaborators
  • 40. OTHER DERIVATIVE DATABASES  Expressed Sequences  dbSNP  Structure  Gene  and more…
  • 42. ENTREZ: A DISCOVERY SYSTEM Gene Taxonomy PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure 3 -D Structure Word weight VAST BLAST BLAST Phylogeny Hard Link Neighbors Related Sequences Neighbors Related Sequences BLink Domains Neighbors Related Structures Pre-computed and pre-compiled data. •A potential “gold mine” of undiscovered relationships. •Used less than expected.
  • 43. GLOBAL QUERY: ALL NCBI DATABASES The Entrez system: 38 (and counting) integrated databases
  • 44. TRADITIONAL METHOD: THE LINKS MENU DNA Sequence Nucleotide – Protein Link Related Proteins Protein – Structure Link 3-D Structure
  • 45. THE PROBLEM  Rapidly growing databases with complex and changing relationships  Rapidly changing interfaces to match the above Result  Many people don’t know:  Where to begin  Where to click on a Web page  Why it might be useful to click there
  • 46. GLOBAL NCBI (ENTREZ) SEARCH colon cancer
  • 48. ENTREZ TIP: START SEARCHES IN GENE Other Entrez DBs HomoloGene Entrez Protein Gene UniGene BLink Homologene: Gene Neighbors
  • 49. PRECISE RESULTS MLH1[Gene Name] AND Human[Organism]
  • 52. GENEVIEW: HUMAN MLH1 VARIATIONS ATPase domain
  • 53. ‘TAKE HOME MESSAGE’ ADVANTAGES OF DATA INTEGRATION  More relevant inter-related information in one place  Makes it easier to find additional relevant information related to your initial query  Potentially find information indirectly linked, but relevant to your subject of interest  uncover non-obvious genetic features that explain phenotype or disease  Easier to build a ‘story’ based on multiple pieces of biological evidence