SlideShare a Scribd company logo
1 of 45
INTRODUCTION TO
BIOLOGICAL DATABASES
NOTE: Most slides are derived from NCBI’s field guide
WHAT YOU NEED TO LEARN:
What is a database and what are the features of
an ideal database?
What are the relationships/differences between
primary and derived sequence databases?
Why is data integration useful?
WHAT ARE DATABASES?
Structured collection of data/information.
Consists of basic units called records or entries.
Each record consists of fields, which hold pre-
defined data related to the record.
For example, a protein database would have
protein entries as records and protein properties
as fields (e.g., name of protein, length, amino-acid
sequence)
THE „PERFECT‟ DATABASE
Comprehensive, but easy to search.
Annotated, but not “too annotated”.
A simple, easy to understand structure.
Cross-referenced.
Minimum redundancy.
Easy retrieval of data.
Bioinformatics sequence databases
# Can be broadly be divided into 2 classes:
primary databases
secondary databases
# Primary databases contain original biological data such as:
DNA sequence, or protein structure information from experiments such as
crystallography. Examples: GenBank, TreMBL
#Secondary databases attempt to add value to the primary databases and
make them more useful for certain specialist applications,
for example PROSITE, the database of common structural or functional motifs
found in proteins.
THE NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,MD
WEB ACCESS: WWW.NCBI.NLM.NIH.GOV
New Homepage
Common footer
New pages!
THE CENTRAL DOGMA & BIOLOGICAL DATA
Protein structures
-Experiments
-Models (homologues)
Literature information
Original DNA Sequences
(Genomes)
Protein Sequences
-Inferred
-Direct sequencing
Expressed DNA sequence
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tag
(ESTs)
NCBI DATABASES AND SERVICES
GenBank primary sequence database
Free public access to biomedical literature
PubMed free Medline (3 million searches per day)
PubMed Central full text online access
Entrez integrated molecular and literature databases
TYPES OF MOLECULAR DATABASES
Primary Databases
Original submissions by experimentalists
Content controlled by the submitter
Examples: GenBank, Trace, SRA, SNP, GEO
Derivative /Secondary Databases
Derived from primary data
Content controlled by third party (NCBI)
Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets,
UniGene, Homologene, Structure, Conserved Domain
PRIMARY VS. SECONDARY SEQUENCE
DATABASES
GenBank
Sequencing
Centers
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG
Labs
Algorithms
UniGene
Curators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
SEQUENCE DATABASES AT NCBI
Primary
GenBank: NCBI‟s primary sequence database
Trace Archive: reads from capillary sequencers
Sequence Read Archive: next generation data
Derivative
GenPept (GenBank translations)
Outside Protein (UniProt—Swiss-Prot, PDB)
NCBI Reference Sequences (RefSeq)
GENBANK - PRIMARY SEQUENCE DB
Nucleotide only sequence database
Archival in nature
Historical
Reflective of submitter point of view (subjective)
Redundant
Data
Direct submissions (traditional records)
Batch submissions
FTP accounts (genome data)
GENBANK - PRIMARY SEQUENCE DB (2)
Three collaborating databases
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database
TRADITIONAL GENBANK RECORD
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
GI number
NCBI internal use
well annotated
the sequence is the data
DERIVATIVE SEQUENCE DATABASES
FEATURES Location/Qualifiers
source 1..2484
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="3"
/map="3p22-p23"
gene 1..2484
/gene="MLH1"
CDS 22..2292
/gene="MLH1"
/note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
Number P14242), S. cerevisiae MLH1 (GenBank Accession
Number U07187), E. coli MUTL (Swiss-Prot Accession Number
P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
Number P14161) and Streptococcus pneumoniae (Swiss-Prot
Accession Number P14160)"
/codon_start=1
/product="DNA mismatch repair protein homolog"
/protein_id="AAC50285.1"
/db_xref="GI:463989"
/translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE
ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
GENPEPT: GENBANK CDS TRANSLATIONS
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
REFSEQ: DERIVATIVE SEQUENCE DATABASE
Curated transcripts and proteins
Model transcripts and proteins
Assembled Genomic Regions
Chromosome records
Human genome
microbial
organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
SELECTED REFSEQ ACCESSION
NUMBERS
mRNAs and Proteins
NM_123456 Curated mRNA
NP_123456 Curated Protein
NR_123456 Curated non-coding RNA
XM_123456 Predicted mRNA
XP_123456 Predicted Protein
XR_123456 Predicted non-coding RNA
Gene Records
NG_123456 Reference Genomic Sequence
Chromosome
NC_123455 Microbial replicons, organelle
AC_123455 Alternate assemblies
Assemblies
NT_123456 Contig
NW_123456 WGS Supercontig
REFSEQ BENEFITS
Non-redundancy
Updates to reflect current sequence data and biology
Data validation
Format consistency
Distinct accession series
Stewardship by NCBI staff and collaborators
OTHER DERIVATIVE DATABASES
Expressed Sequences
dbSNP
Structure
Gene
and more…
ENTREZ
FINDING RELEVANT
INFORMATION IN NCBI
DATABASES
ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLAST
BLAST
Phylogeny
Hard Link
Neighbors
Related Sequences
Neighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
GLOBAL QUERY: ALL NCBI DATABASES
The Entrez system: 38 (and counting) integrated databases
TRADITIONAL METHOD: THE LINKS MENU
DNA Sequence
Nucleotide – Protein Link
Related Proteins
Protein – Structure Link
3-D Structure
THE PROBLEM
Rapidly growing databases with complex and changing
relationships
Rapidly changing interfaces to match the above
Result
Many people don‟t know:
Where to begin
Where to click on a Web page
Why it might be useful to click there
GLOBAL NCBI (ENTREZ) SEARCH
colon cancer
GLOBAL ENTREZ SEARCH RESULTS
ENTREZ TIP: START SEARCHES IN
GENE
Other Entrez DBs
HomoloGene
Entrez
Protein
Gene
UniGene
BLink
Homologene:
Gene Neighbors
PRECISE RESULTS
MLH1[Gene Name] AND Human[Organism]
ADVANTAGES OF DATA INTEGRATION
More relevant inter-related information in one place
Makes it easier to find additional relevant information related to your
initial query
Potentially find information indirectly linked, but relevant to your
subject of interest
uncover non-obvious genetic features that explain phenotype or
disease
Easier to build a „story‟ based on multiple pieces of biological
evidence
Remember : When reporting on a Bioinformatics analysis it is very important to
state, which release of the sequence databases were used.
• Because of the enormous size of the databases, to ease management they
are now broken up into sections.
• Most of these divisions are organised on taxonomically basis ( prokaryotes,
plants, fungi, rodents, mammals etc)
• These divisions are useful in that they make it easier to search only in the
relevant part of the database.
• User manuals for each database clearly state their structures
Protein Databases
a. GenPept
• GenBank Gene Products Data Bank (GenPept) is a protein database produced by the
National Centre of Biotechnology Information (NCBI).
• The entries in this database are derived from the translations of all open reading frame
GenBank, DDBJ and EMBL.
• It contains the same annotations present in the nucleotide records.
• The entries in this database lacks additional annotation and does not contain protein
derived from amino acid sequencing.
• It is also expected to see a protein represented by multiple records
- i.eredundancy.
b. RefSeq
# The aim of the Reference Sequence (RefSeq) database is to provide a comprehensiv
integrated, non redundant sequence set on both the genomic, transcript (including splic
variants), and protein levels for major organisms.
# RefSeq records represent the current best view of genomes and their transcript and/o
protein products.
# However, the majority of the entries RefSeq are automatically generated with minimal
manual intervention.
# But as a non-redundant database it offers a significant advantage for database sea
# RefSeq collection is substantially based on the sequence records from GenBank, EM
and DDBJ, but it differs in that each record in RefSeq includes attribution to the original
sequence data, not a piece of primary search data in itself.
c. TrEMBL
# Translated EMBL (TrEMBL) is the European counterpart of American GenPept and RefS
# The TrEMBL database, maintained by EBI, contains the translations of all coding sequen
(CDS) present in the EMBL/DDBJ/GenBank that are not yet integrated into SWISS-PROT.
# TrEMBL is a computer-annotated protein database that serves as a kind of a halfway hou
SWISS-PROT.
# As a supplementary database to SWISS-PROT, TrEMBL serves to accommodate the gr
influx of protein sequences and make these sequences available as fast as possible witho
comprising the quality standards of SWISS-PROT.
# Each TrEMBL entry is assigned a SWISS-PROT type accession number that would sta
-
it when the sequence is finally manually checked and accepted into SWISS-PROT.
# To simplify curation, TrEMBL follows the SWISS-PROT format and convention as close
possible.
# But we should bear in mind that due to the fact that TrEMBL entries are generated
automatically, the quality of these entries is not guaranteed.
Universal curated databases
a. PIR-PSD (Protein information resource- protein sequence database)
• The PIR-International Protein Sequence Database (PIR-PSD) was created by the
collaboration of Protein Information Resources (PIR) with the Munich Information
Centre for Protein Sequences (MIPS) and the Japan International Protein Information
Database (JIPID).
• The primary sources of PIR
-PSD are sequences from GenBank/EMBL/DDBJ
translations,
published literature and direct submission to PIR-International.
• PIR-PSD maintains a set of integrated protein sequence databases as shown below:
b. SWISS-PROT Database
# SWISS-PROT, the leading universal curated protein sequence database, is established
1986 and maintained collaboratively by the Department of Medical Biochemistry of the
university of Geneva (Switzerland) and EBI.
# The database contains high-quality annotated data, and the annotation for each entry
includes the description of:
# function(s) of the protein,
# post translation of the modification(s),
# domains and sites,
# secondary and quaternary structure,
# similarities to other proteins,
# disease(s) associated with protein defect in which tissues the protein is fo
# pathways in which the protein is involved
# sequence conflict and variants.
# As a non-redundant database, the SWISSPROT tries to maintains minimal redundancy,
all reports for a given protein are merged into a single entry.
# The feature table (FT) will indicate any cases of conflicts between various sequencing rep
of the corresponding entry.
# The entries in SWISSPROT are produced from translation of sequences in EMBL, extrac
from the literature or submitted directly by researchers.
# To build the annotation, SWISSPROT curators review not only the publications referenc
the author, but also relate articles to periodically update the annotations of the families or g
of proteins.
# The added annotation is stored mainly in the description (DE) and gene (GN) lines, the
comment (CC) lines, the feature table (FT) lines and the keyword (KW) lines.
# SWISSPROT offers added values by providing links to over 30 different databases, includ
databases of nucleic acid and protein sequences, protein families etc.
c. The UniProt knowledgebase (UNIPROT)
# From December 2003, the SWISSPROT, PIR-PSD and TrEMBL protein databases have unite
their activities to form the Universal Protein Knowledgebase (UniProt) consortium.
# The UniProt build upon these solid foundations aims to provide biologists a central,
comprehensive and high- quality protein database with efficient and clear access mechanism.
# UniProt is comprised of three database layers:
1. UniParc
2. UniProtKB
3.UniRef
1. UniParc
# UniProt Archive (UniParc) is the most comprehensive non-redundant protein sequence
repository available.
# UniParc is designed to capture all publicly available protein sequence data from the
databases DDBJ, EMBL, GenBank, SWISSPROT, TrEMBL, PIR-PSD, Ensembl, IPI (Inte
Protein Index), PDB,ReSeq, FlyBase, WormBase and the European, United States and J
Patent Offices.
# As a result, performing a sequence search against UniParc will be equivalent to perform
the same search against all databases cross-referenced by UniParc.
# To avoid redundancy, UniParc assign each unique entry a unique UniParc identifier.
Genome databases
# A second major source of primary data is the various genome projects.
# A large number of which are underway.
# A representative sample of these projects are shown in the table below.
# Much of the information from these projects can be found in the EMBL nucleotide
sequence database.
ASSIGNMENT I
Practical on Databases

More Related Content

Similar to Databases_L2.pptx

Data retriveal ,srg and dbget
Data retriveal ,srg and dbgetData retriveal ,srg and dbget
Data retriveal ,srg and dbgetSurendraKumar338
 
NCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners SlidesNCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners SlidesJackie Wirz, PhD
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Presentation on Biological database  By Elufer Akram @ University Of Science ...Presentation on Biological database  By Elufer Akram @ University Of Science ...
Presentation on Biological database By Elufer Akram @ University Of Science ...Elufer Akram
 
biological databases.pptx
biological databases.pptxbiological databases.pptx
biological databases.pptxscience lover
 
Introduction to Bioinformatics: Part 3
Introduction to Bioinformatics: Part 3Introduction to Bioinformatics: Part 3
Introduction to Bioinformatics: Part 3AhmedAbdElMoniem35
 
database retrival.pdf
database retrival.pdfdatabase retrival.pdf
database retrival.pdfSrimathideviJ
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introductionDrGopaSarma
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological databaseKAUSHAL SAHU
 
Primary and secondary databases ppt by puneet kulyana
Primary and secondary databases ppt by puneet kulyanaPrimary and secondary databases ppt by puneet kulyana
Primary and secondary databases ppt by puneet kulyanaPuneet Kulyana
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEPrashantSharma807
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformaticsAtai Rabby
 
Data retreival system
Data retreival systemData retreival system
Data retreival systemShikha Thakur
 
protein databases.ppt
protein databases.pptprotein databases.ppt
protein databases.pptSanthiyaAK
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS
 

Similar to Databases_L2.pptx (20)

Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Data retriveal ,srg and dbget
Data retriveal ,srg and dbgetData retriveal ,srg and dbget
Data retriveal ,srg and dbget
 
NCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners SlidesNCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners Slides
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Presentation on Biological database  By Elufer Akram @ University Of Science ...Presentation on Biological database  By Elufer Akram @ University Of Science ...
Presentation on Biological database By Elufer Akram @ University Of Science ...
 
Genomic Databases-.pptx
Genomic Databases-.pptxGenomic Databases-.pptx
Genomic Databases-.pptx
 
biological databases.pptx
biological databases.pptxbiological databases.pptx
biological databases.pptx
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
Introduction to Bioinformatics: Part 3
Introduction to Bioinformatics: Part 3Introduction to Bioinformatics: Part 3
Introduction to Bioinformatics: Part 3
 
database retrival.pdf
database retrival.pdfdatabase retrival.pdf
database retrival.pdf
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Gen bank databases
Gen bank databasesGen bank databases
Gen bank databases
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological database
 
Primary and secondary databases ppt by puneet kulyana
Primary and secondary databases ppt by puneet kulyanaPrimary and secondary databases ppt by puneet kulyana
Primary and secondary databases ppt by puneet kulyana
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
protein databases.ppt
protein databases.pptprotein databases.ppt
protein databases.ppt
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 

Recently uploaded

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 

Databases_L2.pptx

  • 1. INTRODUCTION TO BIOLOGICAL DATABASES NOTE: Most slides are derived from NCBI’s field guide
  • 2. WHAT YOU NEED TO LEARN: What is a database and what are the features of an ideal database? What are the relationships/differences between primary and derived sequence databases? Why is data integration useful?
  • 3. WHAT ARE DATABASES? Structured collection of data/information. Consists of basic units called records or entries. Each record consists of fields, which hold pre- defined data related to the record. For example, a protein database would have protein entries as records and protein properties as fields (e.g., name of protein, length, amino-acid sequence)
  • 4. THE „PERFECT‟ DATABASE Comprehensive, but easy to search. Annotated, but not “too annotated”. A simple, easy to understand structure. Cross-referenced. Minimum redundancy. Easy retrieval of data.
  • 5. Bioinformatics sequence databases # Can be broadly be divided into 2 classes: primary databases secondary databases # Primary databases contain original biological data such as: DNA sequence, or protein structure information from experiments such as crystallography. Examples: GenBank, TreMBL #Secondary databases attempt to add value to the primary databases and make them more useful for certain specialist applications, for example PROSITE, the database of common structural or functional motifs found in proteins.
  • 6. THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION Created in 1988 as a part of the National Library of Medicine at NIH – Establish public databases – Research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information Bethesda,MD
  • 7. WEB ACCESS: WWW.NCBI.NLM.NIH.GOV New Homepage Common footer New pages!
  • 8. THE CENTRAL DOGMA & BIOLOGICAL DATA Protein structures -Experiments -Models (homologues) Literature information Original DNA Sequences (Genomes) Protein Sequences -Inferred -Direct sequencing Expressed DNA sequence ( = mRNA Sequences = cDNA sequences) Expressed Sequence Tag (ESTs)
  • 9. NCBI DATABASES AND SERVICES GenBank primary sequence database Free public access to biomedical literature PubMed free Medline (3 million searches per day) PubMed Central full text online access Entrez integrated molecular and literature databases
  • 10. TYPES OF MOLECULAR DATABASES Primary Databases Original submissions by experimentalists Content controlled by the submitter Examples: GenBank, Trace, SRA, SNP, GEO Derivative /Secondary Databases Derived from primary data Content controlled by third party (NCBI) Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain
  • 11. PRIMARY VS. SECONDARY SEQUENCE DATABASES GenBank Sequencing Centers TATAGCCG TATAGCCG TATAGCCG TATAGCCG Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters
  • 12. SEQUENCE DATABASES AT NCBI Primary GenBank: NCBI‟s primary sequence database Trace Archive: reads from capillary sequencers Sequence Read Archive: next generation data Derivative GenPept (GenBank translations) Outside Protein (UniProt—Swiss-Prot, PDB) NCBI Reference Sequences (RefSeq)
  • 13. GENBANK - PRIMARY SEQUENCE DB Nucleotide only sequence database Archival in nature Historical Reflective of submitter point of view (subjective) Redundant Data Direct submissions (traditional records) Batch submissions FTP accounts (genome data)
  • 14. GENBANK - PRIMARY SEQUENCE DB (2) Three collaborating databases 1. GenBank 2. DNA Database of Japan (DDBJ) 3. European Molecular Biology Laboratory (EMBL) Database
  • 15. TRADITIONAL GENBANK RECORD ACCESSION U07418 VERSION U07418.1 GI:466461 Accession •Stable •Reportable •Universal Version Tracks changes in sequence GI number NCBI internal use well annotated the sequence is the data
  • 17. FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS GENPEPT: GENBANK CDS TRANSLATIONS >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
  • 18. REFSEQ: DERIVATIVE SEQUENCE DATABASE Curated transcripts and proteins Model transcripts and proteins Assembled Genomic Regions Chromosome records Human genome microbial organelle ftp://ftp.ncbi.nih.gov/refseq/release/
  • 19. SELECTED REFSEQ ACCESSION NUMBERS mRNAs and Proteins NM_123456 Curated mRNA NP_123456 Curated Protein NR_123456 Curated non-coding RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted non-coding RNA Gene Records NG_123456 Reference Genomic Sequence Chromosome NC_123455 Microbial replicons, organelle AC_123455 Alternate assemblies Assemblies NT_123456 Contig NW_123456 WGS Supercontig
  • 20. REFSEQ BENEFITS Non-redundancy Updates to reflect current sequence data and biology Data validation Format consistency Distinct accession series Stewardship by NCBI staff and collaborators
  • 21. OTHER DERIVATIVE DATABASES Expressed Sequences dbSNP Structure Gene and more…
  • 23. ENTREZ: A DISCOVERY SYSTEM Gene Taxonomy PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure 3 -D Structure Word weight VAST BLAST BLAST Phylogeny Hard Link Neighbors Related Sequences Neighbors Related Sequences BLink Domains Neighbors Related Structures Pre-computed and pre-compiled data. •A potential “gold mine” of undiscovered relationships. •Used less than expected.
  • 24. GLOBAL QUERY: ALL NCBI DATABASES The Entrez system: 38 (and counting) integrated databases
  • 25. TRADITIONAL METHOD: THE LINKS MENU DNA Sequence Nucleotide – Protein Link Related Proteins Protein – Structure Link 3-D Structure
  • 26. THE PROBLEM Rapidly growing databases with complex and changing relationships Rapidly changing interfaces to match the above Result Many people don‟t know: Where to begin Where to click on a Web page Why it might be useful to click there
  • 27. GLOBAL NCBI (ENTREZ) SEARCH colon cancer
  • 29. ENTREZ TIP: START SEARCHES IN GENE Other Entrez DBs HomoloGene Entrez Protein Gene UniGene BLink Homologene: Gene Neighbors
  • 30. PRECISE RESULTS MLH1[Gene Name] AND Human[Organism]
  • 31. ADVANTAGES OF DATA INTEGRATION More relevant inter-related information in one place Makes it easier to find additional relevant information related to your initial query Potentially find information indirectly linked, but relevant to your subject of interest uncover non-obvious genetic features that explain phenotype or disease Easier to build a „story‟ based on multiple pieces of biological evidence
  • 32. Remember : When reporting on a Bioinformatics analysis it is very important to state, which release of the sequence databases were used. • Because of the enormous size of the databases, to ease management they are now broken up into sections. • Most of these divisions are organised on taxonomically basis ( prokaryotes, plants, fungi, rodents, mammals etc) • These divisions are useful in that they make it easier to search only in the relevant part of the database. • User manuals for each database clearly state their structures
  • 33. Protein Databases a. GenPept • GenBank Gene Products Data Bank (GenPept) is a protein database produced by the National Centre of Biotechnology Information (NCBI). • The entries in this database are derived from the translations of all open reading frame GenBank, DDBJ and EMBL. • It contains the same annotations present in the nucleotide records. • The entries in this database lacks additional annotation and does not contain protein derived from amino acid sequencing. • It is also expected to see a protein represented by multiple records - i.eredundancy.
  • 34. b. RefSeq # The aim of the Reference Sequence (RefSeq) database is to provide a comprehensiv integrated, non redundant sequence set on both the genomic, transcript (including splic variants), and protein levels for major organisms. # RefSeq records represent the current best view of genomes and their transcript and/o protein products. # However, the majority of the entries RefSeq are automatically generated with minimal manual intervention. # But as a non-redundant database it offers a significant advantage for database sea # RefSeq collection is substantially based on the sequence records from GenBank, EM and DDBJ, but it differs in that each record in RefSeq includes attribution to the original sequence data, not a piece of primary search data in itself.
  • 35. c. TrEMBL # Translated EMBL (TrEMBL) is the European counterpart of American GenPept and RefS # The TrEMBL database, maintained by EBI, contains the translations of all coding sequen (CDS) present in the EMBL/DDBJ/GenBank that are not yet integrated into SWISS-PROT. # TrEMBL is a computer-annotated protein database that serves as a kind of a halfway hou SWISS-PROT. # As a supplementary database to SWISS-PROT, TrEMBL serves to accommodate the gr influx of protein sequences and make these sequences available as fast as possible witho comprising the quality standards of SWISS-PROT.
  • 36. # Each TrEMBL entry is assigned a SWISS-PROT type accession number that would sta - it when the sequence is finally manually checked and accepted into SWISS-PROT. # To simplify curation, TrEMBL follows the SWISS-PROT format and convention as close possible. # But we should bear in mind that due to the fact that TrEMBL entries are generated automatically, the quality of these entries is not guaranteed.
  • 37. Universal curated databases a. PIR-PSD (Protein information resource- protein sequence database) • The PIR-International Protein Sequence Database (PIR-PSD) was created by the collaboration of Protein Information Resources (PIR) with the Munich Information Centre for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID). • The primary sources of PIR -PSD are sequences from GenBank/EMBL/DDBJ translations, published literature and direct submission to PIR-International. • PIR-PSD maintains a set of integrated protein sequence databases as shown below:
  • 38.
  • 39. b. SWISS-PROT Database # SWISS-PROT, the leading universal curated protein sequence database, is established 1986 and maintained collaboratively by the Department of Medical Biochemistry of the university of Geneva (Switzerland) and EBI. # The database contains high-quality annotated data, and the annotation for each entry includes the description of: # function(s) of the protein, # post translation of the modification(s), # domains and sites, # secondary and quaternary structure, # similarities to other proteins, # disease(s) associated with protein defect in which tissues the protein is fo # pathways in which the protein is involved # sequence conflict and variants.
  • 40. # As a non-redundant database, the SWISSPROT tries to maintains minimal redundancy, all reports for a given protein are merged into a single entry. # The feature table (FT) will indicate any cases of conflicts between various sequencing rep of the corresponding entry. # The entries in SWISSPROT are produced from translation of sequences in EMBL, extrac from the literature or submitted directly by researchers. # To build the annotation, SWISSPROT curators review not only the publications referenc the author, but also relate articles to periodically update the annotations of the families or g of proteins. # The added annotation is stored mainly in the description (DE) and gene (GN) lines, the comment (CC) lines, the feature table (FT) lines and the keyword (KW) lines. # SWISSPROT offers added values by providing links to over 30 different databases, includ databases of nucleic acid and protein sequences, protein families etc.
  • 41. c. The UniProt knowledgebase (UNIPROT) # From December 2003, the SWISSPROT, PIR-PSD and TrEMBL protein databases have unite their activities to form the Universal Protein Knowledgebase (UniProt) consortium. # The UniProt build upon these solid foundations aims to provide biologists a central, comprehensive and high- quality protein database with efficient and clear access mechanism. # UniProt is comprised of three database layers: 1. UniParc 2. UniProtKB 3.UniRef
  • 42. 1. UniParc # UniProt Archive (UniParc) is the most comprehensive non-redundant protein sequence repository available. # UniParc is designed to capture all publicly available protein sequence data from the databases DDBJ, EMBL, GenBank, SWISSPROT, TrEMBL, PIR-PSD, Ensembl, IPI (Inte Protein Index), PDB,ReSeq, FlyBase, WormBase and the European, United States and J Patent Offices. # As a result, performing a sequence search against UniParc will be equivalent to perform the same search against all databases cross-referenced by UniParc. # To avoid redundancy, UniParc assign each unique entry a unique UniParc identifier.
  • 43. Genome databases # A second major source of primary data is the various genome projects. # A large number of which are underway. # A representative sample of these projects are shown in the table below. # Much of the information from these projects can be found in the EMBL nucleotide sequence database.
  • 44.