FBWSeminar on Molecular Biology Databases

FBW
07-10-2014
Wim Van Criekinge

Wel les op 4 november en GEEN les op 18 november

Outline
• Molecular Biology
• Flat files “sequence” databases
– DNA
– Protein
– Structure
• Relational Databases
– What ?
– Why ?
• Biological Relational Databases
– Howto ?

Flat Files
What is a “flat file” ?
• Flat file is a term used to refer to when
data is stored in a plain ordinary file
on the hard disk
• Example RefSEQ
– See CD-ROM
– FILE: hs.GBFF
• Hs: Homo Sapiens
• GBFF: Genbank File Format
• (associated with textpad, use monospaced
font eg. Courier)

Sequence entries
gene 10317..12529 /gene="ZK822.4"
CDS join(10317..10375,10714..10821,10874..10912,10960..11013,
11061..11114,11169..11222,11346..11739,11859..11912,
11962..12195,12242..12529)
/gene="ZK822.4" /codon_start=1
/protein_id="CAA98068.1"
/db_xref="PID:g3881817"
/db_xref="GI:3881817"
/db_xref="SPTREMBL:Q23615"
/translation="MHRHTYRKLYWNLGADGFSQGNADASVSAGSSGSNFLSGLQNSS
FGQAVMGGINTYNQAKNSSGGNWQTAVANSSVGNFFQNGIDFFNGMKNGTQNFLDTDT
IQETIGNSSFGEVVQTGVEFFNNIKNGNSPFQGDASSVMSQFVPFLANASAEAKAEFY
TILPNFGNMTIAEFETAVNAWAAKYNLTDEVEAFNERSKNATVVAEEHANVVVMNLPN
VLNNLKAISSDKNQTVVEMHTRMMAYVNSLDDDTRDIVFIFFRNLLPPQFKKSKCVDQ
GNFLTNMYNKASDFFAGRNNRTDGEGSFWSGQGQNGNSGGSGFSSFFNNFNGQGNGNG
NGAQNPMIGMFNNFMKKNNITADEANAAMADGGASIQILPAISAGWGDVAQVKIGGDF
KIAVEEETKTTKKNKKQQQQANKNKNKNKKKTTIAPEAAIDANIAAEVHTQVL"

Nucleotide Databases
EMBL Nucleotide Sequence Database (European Molecular Biology
Laboratory) http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html
GenBank at NCBI (National Center for Biotechnology Information)
http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html
DDBJ (DNA Database of Japan) http://www.ddbj.nig.ac.jp/
DDBJ,the Center for operating DDBJ, National Institute of Genetics (NIG),Japan,established in
April 1995.
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Release Notes (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)
Genetic Sequence Data Bank - August 15 2003
NCBI-GenBank Flat File Release 137.0
Distribution Release Notes
33 865 022 251 bases, from 27 213 748 reported sequences

GenBank Format
LOCUS LISOD 756 bp DNA BCT 30-JUN-1993
DEFINITION L.ivanovii sod gene for superoxide dismutase.
ACCESSION X64011.1 GI:37619753
NID g44010
KEYWORDS sod gene; superoxide dismutase.
SOURCE Listeria ivanovii.
ORGANISM Listeria ivanovii
Eubacteria; Firmicutes; Low G+C gram-positive bacteria;
Bacillaceae; Listeria.
REFERENCE 1 (bases 1 to 756)
AUTHORS Haas,A. and Goebel,W.
TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii
by functional complementation in Escherichia coli and
characterization of the gene product
JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992)
MEDLINE 92140371
REFERENCE 2 (bases 1 to 756)
AUTHORS Kreft,J.
TITLE Direct Submission
JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie,
Universitaet Wuerzburg, Biozentrum Am Hubland, 8700
Wuerzburg, FRG

FEATURES Location/Qualifiers
source 1..756
/organism="Listeria ivanovii"
/strain="ATCC 19119"
/db_xref="taxon:1638"
RBS 95..100
/gene="sod"
gene 95..746
/gene="sod"
CDS 109..717
/gene="sod"
/EC_number="1.15.1.1"
/codon_start=1
/product="superoxide dismutase"
/db_xref="PID:g44011"
/db_xref="SWISS-PROT:P28763"
/transl_table=11
/translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKL
NEAVSGHAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPN
GGGAPTGNLKAAIESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVS
TANQDSPLSEGKTPVLGLDVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRF
DAAK"
terminator 723..746
/gene="sod"

Example of location descriptors
Location Description
476 Points to a single base in the presented sequence
340..565 Points to a continuous range of bases bounded by and
including the starting and ending bases
<345..500 The exact lower boundary point of a feature is unknown.
(102.110) Indicates that the exact location is unknown but that it
is one of the bases between bases 102 and 110.
(23.45)..600 Specifies that the starting point is one of the bases
between bases 23 and 45, inclusive, and the end base 600
123^124 Points to a site between bases 123 and 124
145^177 Points to a site anywhere between bases 145 and 177
J00193:hladr Points to a feature whose location is described in
another entry: the feature labeled 'hladr' in the
entry (in this database) with primary accession 'J00193'

BASE COUNT 247 a 136 c 151 g 222 t
ORIGIN
1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat
61 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa
121 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg
181 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca
241 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt
301 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta
361 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca
421 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg
481 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt
541 tccactgcta accaagattc tccacttagc gaaggtaaaa ctccagttct tggcttagat
601 gtttgggaac atgcttatta tcttaaattc caaaaccgtc gtcctgaata cattgacaca
661 ttttggaatg taattaactg ggatgaacga aataaacgct ttgacgcagc aaaataatta
721 tcgaaaggct cacttaggtg ggtcttttta tttcta
//

EMBL format
ID LISOD standard; DNA; PRO; 756 BP. IDentification
XX
AC X64011; S78972; Accession (Axxxxx, Afxxxxxx), GUID
XX
NI g44010 Nucleotide Identifier --> x.x
XX
DT 28-APR-1992 (Rel. 31, Created) DaTe
DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)
XX
DE L.ivanovii sod gene for superoxide dismutase DEscription
XX.
KW sod gene; superoxide dismutase. KeyWord
XX
OS Listeria ivanovii Organism Species
OC Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae;
OC Listeria. Organism Classification
XX
RN [1]
RA Haas A., Goebel W.; Reference
RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by
RT functional complementation in Escherichia coli and
RT characterization of the gene product.";
RL Mol. Gen. Genet. 231:313-322(1992).
XX

GenBank,EMBL & DDBJ: Comments
• Collaboration Genbank/EMBL/DDBJ
– Effort: Identical within 24 hours
• Redundant information
• Historical graveyard
– BANKIT (responsability of the submitter)
– Version conflicts
• IDIOSYNCRATIC ( peculiar to the
individual)
– Heterogeneous annotation
– No consistant quality check
• Vectors, sequence errors etc

Other Genbank Formats
• ASN1
– Computer friendly, human unfriendly
• FASTA
– Brief, loses information
– Easy to use
– Compatible with multiple sequences

Web Query tools & Programming Query tools
• NCBI website example:
– http://www.ncbi.nlm.nih.gov/entrez/query/static/ad
vancedentrez.html
• EBI UniProtKB website example:
– http://www.ebi.ac.uk/uniprot/index.html
– http://www.ebi.uniprot.org/search/SearchTools.sht
ml

batch download (ftp server)
• Data available via website is most of
the time also available via an ftp
server to download a complete
batch.
• Examples:
–ftp://ftp.ncbi.nih.gov/
–ftp://ftp.ebi.ac.uk/pub/

Sequence file format tips
• When saving a sequence for use in an email
message or pasting into a web page, use an
unannotated text format such as FASTA
• When retrieving from a database or
exchanging between programs, use an
annotated text format such as Genbank
• When using sequence again with the same
program, use that program’s annotated binary
format (or annotated text if binary not
available)
– Asn-1 (NCBI)
– Gbff (sanger)
– XML

Expressed Sequence Tags
• Sequence that codes for protein is < 5% of the
genome.
• Coding sequence can be obtained from mRNA by
reverse transcription.
• Tags for that sequence can be obtained by end-sequencing.
• Incyte and HGS gambled on this being the useful
part:
– Search for homologies to known proteins, motifs.
– Search for changed levels of expression and tissue specificity
(“virtual/electronic northern” used in GeneCards)
• ESTs have driven the huge expansion of GenBank:
– Unigene now contains some sequence from most genes.
– > 4,000,000 human est sequences
– http://www.ncbi.nlm.nih.gov/dbEST/

dbEST release 100303 Summary by Organism - October 3, 2003
Number of public entries: 18,762,324
Homo sapiens (human) 5,426,001
Mus musculus + domesticus (mouse) 3,881,878
Rattus sp. (rat) 538,073
Triticum aestivum (wheat) 500,898
Ciona intestinalis 492,488
Gallus gallus (chicken) 451,565
Zea mays (maize) 383,416
Danio rerio (zebrafish) 362,362
Hordeum vulgare + subsp. vulgare (barley) 348,233
Xenopus laevis (African clawed frog) 344,695
Glycine max (soybean) 341,573
Bos taurus (cattle) 322,074
Drosophila melanogaster (fruit fly) 261,404

Traces <-> strings
• Traces contain much more information
– TraceDB: http://www.ncbi.nlm.nih.gov/Traces/
Example

Traces <-> strings
• Phrep
– base calling, vector trimming, end of sequence
read trimming
• Phrap
– Phrap uses Phred’s base calling scores to
determine the consensus sequences. Phrap
examines all individual sequences at a given
position, and uses the highest scoring sequence
(if it exists) to extend the consensus sequence
• Consend
– graphical interface extension that controls both
Phred and Phrap

What is Phred?
• Phred is a program that observes the base trace, makes
base calls, and assigns quality values (qv) of bases in the
sequence.
• It then writes base calls and qv to output files that will be
used for Phrap assembly. The qv will be useful for
consensus sequence construction.
• For example, ATGCATGC string1
ATTCATGC string2
AT-CATGC superstring
• Here we have a mismatch ‘G’ and ‘T’, the qv will
determine the dash in the superstring. The base with higher
qv will replaces the dash.

How Phred calculates qv?
• From the base trace Phred know number of peaks
and actual peak locations.
• Phred predicts peaks locations.
• Phred reads the actual peak locations from base
trace.
• Phred match the actual locations with the
predicted locations by using Dynamic
Programming.
• The qv is related to the base call error probability
(ep) by the formula qv = -10*log_10(ep)
• Example 1:10000 = qv 40

Why Phred?
• Output sequence might contain
errors.
• Vector contamination might occur.
• Dye-terminator reaction might not
occur.
• Segment migration abnormal in
gel electrophoresis.
• Weak or variable signal strength
of peak corresponding to a base.

End of Sequence Cropping
• It is common that the end of sequencing reads
have poor data. This is due to the difficulties in
resolving larger fragment ~1kb (it is easier to
resolve 21bp from 20bp than it is to resolve
1001bp from 1000bp).
• Phred assigns a non-value of ‘x’ to this data by
comparing peak separation and peak intensity to
internal standards. If the standard threshold score
is not reached, the data will not be used.

Traces <-> strings
• Handle traces
– Abi-view EMBOSS
– Bioedit
– Acembly, …
• EXAMPLE

NCBI reference sequences
RefSeq database is a non-redundant set of
reference standards that includes
chromosomes, complete genomic molecules,
intermediate assembled genomic contigs,
curated genomic regions, mRNAs, RNAs, and
proteins.

RefSeq nomenclature
NC_#### complete genomic
NG_#### incomplete genomic
NM_####mRNA
NR_#### noncoding transcripts
NP_#### proteins
NT_#### intermediate genomic contigs

RefSeq nomenclature - models
XM_#### mRNA
XR_#### RNA
XP_#### protein
Automated Homo sapiens models provided by
the Genome Annotation process; sequence
corresponds to the genomic contig.

Open reading frame
• Definition:
– A stretch of triplet codons with an initiator
codon at one end and a stop codon sat the other,
as identifiable by nucleotide sequences.
• Example
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
cmd=Retrieve&db=nucleotide&list_uids=6688
473&dopt=GenBank&term=Y18948.1&qty=1

Protein sequence database
SWISS-PROT & TREMBL
SwissProt - http://expasy.hcuge.ch/sprot/
 SWISS-PROT is an annotated protein sequence database
 The sequences are translated from the EMBL Nucleotide Sequence Database
 Sequence entries are composed of different lines.
For standardization purposes the format of SWISS-PROT follows as
closely as possible that of the EMBL Nucleotide Sequence Database.
 Continuously updated (daily).

Different Features of SWISS-PROT
• Format follows as closely as
possible that of EMBL’s
• Curated protein sequence database
• Three differences:
1. Strives to provide a high level of
annotations
2. Minimal level of redundancy
3. High level of integration with
other databases

Three Distinct Criteria
The sequence data; the citation
information (bibliographical
references) and the taxonomic data
(description of the biological source of
the protein) such as protein
functions,post-translational
modifications ,domains and
sites,secondary structure,quaternary
structure,similarities to other
proteins,diseases associated with
deficiencies in the protein,sequence
conflicts, variants, etc.
1. Annotation

2. Minimal Redundancy
any sequence databases contain, for a
given protein sequence, separate
entries which correspond to
different literature reports. SWISS-PROT
is as much as possible to
merge all these data so as to
minimize the redundancy. If
conflicts exist between various
sequencing reports, they are
indicated in the feature table of the
corresponding entry.

3. Integration With Other Databases
• SWISS-PROT and TrEMBL - Protein
sequences
• PROSITE - Protein families and domains
• SWISS-2DPAGE - Two-dimensional
polyacrylamide gel electrophoresis
• SWISS-3DIMAGE - 3D images of proteins
and other biological macromolecules
• SWISS-MODEL Repository - Automatically
generated protein models
• CD40Lbase - CD40 ligand defects
• ENZYME - Enzyme nomenclature
• SeqAnalRef - Sequence analysis bibliographic
references

TREMBL- http://expasy.hcuge.ch/sprot/
 Translated EMBL sequences not (yet) in
Swissprot.
 Updated faster than SWISS-PROT.
TREMBL - two parts
1. SP-TREMBL
 Will eventually be incorporated into Swissprot
 Divided into FUN, HUM, INV, MAM, MHC, ORG, PHG, PLN,
PRO,
ROD, UNC, VRL and VRT.
2. REM-TREMBL (remaining)
 Will NOT be incorporated into Swissprot
 Divided into:Immunoglobins and T-cell receptors,Synthetic
sequences,Patent application sequences,Small fragments,CDS
not coding for real proteins

SWISS-PROT/TrEMBL
• TrEMBL is a computer-annotated
supplement of SWISS-PROT that contains
all the translations of EMBL nucleotide
sequence entries not yet integrated in
SWISS-PROT
• SWISS-PROT Release 39.15 of 19-
Mar-2001: 94,152 entries
TrEMBL Release 16.2 of 23-Mar-
2001: 436,924 entries

Example of a SwissProt entry
ID TNFA_HUMAN STANDARD; PRT; 233 AA. IDentification
AC P01375; ACcession
DT 21-JUL-1986 (REL. 01, CREATED) DaTe
DT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT 15-JUL-1998 (REL. 36, LAST ANNOTATION UPDATE)
DE TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN).
GN TNFA. Gene name
OS HOMO SAPIENS (HUMAN). Organism Species
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC EUTHERIA; PRIMATES. Organism Classification
RN [1] Reference
RP SEQUENCE FROM N.A.
RX MEDLINE; 87217060.
RA NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A.,
RA AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N.,
RA FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A.,
RA CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.;
RL COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986).
RN [2]
RP SEQUENCE FROM N.A.
RX MEDLINE; 85086244.
RA PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R.,
RA PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.;
RL NATURE 312:724-729(1984).
...

CC -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CAN
CC CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATED
CC IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSING
CC FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, IT
CC CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATION
CC UNDER CERTAIN CONDITIONS. Comments
CC -!- SUBUNIT: HOMOTRIMER.
CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS AS
CC AN EXTRACELLULAR SOLUBLE FORM.
CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY
CC PROTEOLYTIC PROCESSING.
CC -!- DISEASE: CACHEXIA ACCOMPANIES A VARIETY OF DISEASES, INCLUDING
CC CANCER AND INFECTION, AND IS CHARACTERIZED BY GENERAL ILL
CC HEALTH AND MALNUTRITION.
CC -!- SIMILARITY: BELONGS TO THE TUMOR NECROSIS FACTOR FAMILY.
DR EMBL; X02910; G37210; -. Database Cross-references
DR EMBL; M16441; G339741; -.
DR EMBL; X01394; G37220; -.
DR EMBL; M10988; G339738; -.
DR EMBL; M26331; G339764; -.
DR EMBL; Z15026; G37212; -.
DR PIR; B23784; QWHUN.
DR PIR; A44189; A44189.
DR PDB; 1TNF; 15-JAN-91.
DR PDB; 2TUN; 31-JAN-94.

KW CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR;
KW MYRISTYLATION; 3D-STRUCTURE. KeyWord
FT PROPEP 1 76 Feature Table
FT CHAIN 77 233 TUMOR NECROSIS FACTOR.
FT TRANSMEM 36 56 SIGNAL-ANCHOR (TYPE-II PROTEIN).
FT LIPID 19 19 MYRISTATE.
FT LIPID 20 20 MYRISTATE.
FT DISULFID 145 177
FT MUTAGEN 105 105 L->S: LOW ACTIVITY.
FT MUTAGEN 108 108 R->W: BIOLOGICALLY INACTIVE.
FT MUTAGEN 112 112 L->F: BIOLOGICALLY INACTIVE.
FT MUTAGEN 162 162 S->F: BIOLOGICALLY INACTIVE.
FT MUTAGEN 167 167 V->A,D: BIOLOGICALLY INACTIVE.
FT MUTAGEN 222 222 E->K: BIOLOGICALLY INACTIVE.
FT CONFLICT 63 63 F -> S (IN REF. 5).
FT STRAND 89 93
FT TURN 99 100
FT TURN 109 110
FT STRAND 112 113
FT TURN 115 116
FT STRAND 118 119
FT STRAND 124 125

FT STRAND 130 143
FT STRAND 152 159
FT STRAND 166 170
FT STRAND 173 174
FT TURN 183 184
FT STRAND 189 202
FT TURN 204 205
FT STRAND 207 212
FT HELIX 215 217
FT STRAND 218 218
FT STRAND 227 232
SQ SEQUENCE 233 AA; 25644 MW; 666D7069 CRC32;
MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI VAGATTLFCL LHFGVIGPQR
EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG QLQWLNRRAN ALLANGVELR
DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCQRE
TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF AESGQVYFGI IAL
//

Protein searching
3-levels of Protein Searching
1. Swissprot Little Noise
Annotated entries
2. Swissprot + TREMBL More Noisy
All probable entries
3. Translated EMBL - tblast or tfasta Most Noisy
All possible entries

New initiatiaves
• IPI: International Protein Index
– http://www.ebi.ac.uk/IPI/IPIhelp.ht
ml
• UNIPROT: Universal Protein
Knowledgebase
– http://www.pir.uniprot.org/
• HPRD: Human Protein Reference
Database
– http://www.hprd.org/

UniProt Consortium
• European Bioinformatics Institute (EBI)
• Swiss Institute of Bioinformatics (SIB)
• Protein Information Resource (PIR)
Uniprot Databases
•UniProt Knowledgebase (UniProtKB)
•UniProt Reference Clusters (UniRef)
•UniProt Archive (UniParc)
UniprotKB
•Swiss-Prot (annotated protein sequence db,
golden standard)
•trEMBL (translated EMBL + automated
electronic annotations)
UniProt

understanding molecular
structure is critical to the
understanding of biology
because because structure
determines function

From Structure to Function
• the drug morphine has chemical groups that are functionally equivalent to the natural
endorphins found in the human body

• the drug morphine has chemical groups that are functionally equivalent to the natural
endorphins found in the human body
• the receptor molecules
located at the synapse
(between two neurons)
bind morphine much the
same way as endorphins
• therefore, morphine is
able to attenuate the pain
response
From Structure to Function

Structure databases
Protein Data Bank (PDB)
Protein Data Bank - http://www.rcsb.org/pdb
Diffraction 7373 structures determined by X-ray diffraction
NMR 388 structures determined by NMR spectroscopy
Theoretical Model 201 structures proposed by modeling

• PDB is three-dimensional structure of
proteins,some nuclei acids involved
• PDB is operated by RCSB(Research Collaboratory for
Structural Bioinformatics),funded by NSF, DOE, and
two units of NIH:NIGMS National Institute Of General
Medical Sciences and NLM National Library Of Medicine.
• Established at BNL Brookhaven National Laboratories in
1971,as an archive for biological
macromolecular crystal structures
• In 1980s, the number of deposited structures
began to increase dramatically.
• October 1998, the management of the PDB
became the responsibility of RCSB.
• Website http://www.rcsb.org

PDB Holdings List: 27-Mar-2001
Molecule Type
Proteins,
Peptides,
and Viruses
Protein/
Nucleic
Acid
Complexes
Nuclei
c
Acids
Carbohydrate
s
total
Exp.
Tech.
X-ray
Diffraction
and other
11045 526 552 14 12137
NMR 1832 71 366 4 2273
Theoretica
l Modeling
281 19 21 0 321
total 13158 616 939 18 14731
5032 Structure Factor Files
968 NMR Restraint Files

Other structure databases
BioMagResBank http://www.bmrb.wisc.edu/
A Repository for Data from NMR Spectroscopy on Proteins, Peptides, and Nucleic
Acids
Biological Macromolecule Crystallization Database (BMCD)
http://h178133.carb.nist.gov:4400/bmcd/bmcd.html
Contains crystal data and the crystallization conditions, which have been compiled
from literature
Nucleic Acid Database (NDB) http://ndbserver.rutgers.edu:80/
Assembles and distributes structural information about nucleic acids
Structural Classification of Proteins (SCOP) http://scop.mrc-lmb.cam.ac.uk/scop/
Structure similarity search. Hierarchic organization.
MOOSE http://db2.sdsc.edu/moose/
Macromolecular Structure Query
Cambridge Structural Database (CSD) http://www.ccdc.cam.ac.uk/
Small molecules.

Protein Splicing?
• Protein splicing is defined as the excision of
an intervening protein sequence (the
INTEIN) from a protein precursor and the
concomitant ligation of the flanking protein
fragments (the EXTEINS) to form a mature
extein protein and the free intein
• http://www.neb.com/inteins/intein_intro.ht
ml

Biological databases
• NAR Database Issue
– Every year: NAR DB Issue
– The 2006 update includes 858 databases
– Citation top 5 are:
• Pfam
• Gene Ontology
• UniProt
• SMART
• KEGG
– Primary Nucleotide DB’s and PDB are
not cited anymore

Why biological databases ?
• Explosive growth in biological data
• Data (sequences, 3D structures, 2D
gel analysis, MS analysis….) are no
longer published in a conventional
manner, but directly submitted to
databases
• Essential tools for biological research,
as classical publications used to be !

Problems with Flat files …
• Wasted storage space
• Wasted processing time
• Data control problems
• Problems caused by changes to data
structures
• Access to data difficult
• Data out of date
• Constraints are system based
• Limited querying eg. all single exon
GPCRs (<1000 bp)

Relational
• The Relational model is not only very mature, but it
has developed a strong knowledge on how to make a
relational back-end fast and reliable, and how to
exploit different technologies such as massive SMP,
Optical jukeboxes, clustering and etc. Object
databases are nowhere near to this, and I do not
expect then to get there in the short or medium term.
• Relational Databases have a very well-known and
proven underlying mathematical theory, a simple one
(the set theory) that makes possible
– automatic cost-based query optimization,
– schema generation from high-level models and
– many other features that are now vital for mission-critical
Information Systems development and operations.

• What is a relational database ?
– Sets of tables and links (the data)
– A language to query the datanase (Structured
Query Language)
– A program to manage the data (RDBMS)
• Flat files are not relational
– Data type (attribute) is part of the data
– Record order mateters
– Multiline records
– Massive duplication
• Bv Organism: Homo sapeinsm Eukaryota, …
– Some records are hierarchical
• Xrefs
– Records contain multiple “sub-records”
– Implecit “Key”

The Benefits of Databases
• Redundancy can be reduced
• Inconsistency can be avoided
• Conflicting requirements can be
balanced
• Standards can be enforced
• Data can be shared
• Data independence
• Integrity can be maintained
• Security restrictions can be applied

Disadvantages
• size
• complexity
• cost
• Additional hardware costs
• Higher impact of failure
• Recovery more difficult

Relational Terminology
CUSTOMER Table (Relation)
ID NAME PHONE EMP_ID
201 Unisports 55-2066101 12
202 Simms Atheletics 81-20101 14
203 Delhi Sports 91-10351 14
204 Womansport 1-206-104-0103 11
Row (Tuple)
Column (Attribute)

Relational Database Terminology
• Each row of data in a table is uniquely identified by a primary key (PK)
• Information in multiple tables can be logically related by foreign keys (FK)
Table Name: CUSTOMER Table Name: EMP
ID LAST_NAME FIRST_NAME
10 Havel Marta
11 Magee Colin
12 Giljum Henry
14 Nguyen Mai
ID NAME PHONE EMP_ID
201 Unisports 55-2066101 12
202 Simms Atheletics 81-20101 14
203 Delhi Sports 91-10351 14
204 Womansport 1-206-104-0103 11
Primary Key Foreign Key Primary Key

• RDBM products
– Free
• MySQL, very fast, widely usedm easy to
jump into but limited non standard SQL
• PostrgreSQL – full SQLm limited OO,
higher learning curve than MySQL
– Commercial
• MS Access – Great query builder, GUI
interfaces
• MS SQL Server – full SQL, NT only
• Oracle, everything, including the kitchen
sink
• IBM DB2, Sybase

A simple datamodel (tables and relations)
Prot_id name seq Species_id
1 GTM1_HUMA
N
MGTDHG… 1
2 GTM1_RAT MGHJADSW.. 2
3 GTM2_HUMA
N
MVSDBSVD.. 1
Species_id name Full Lineage
1 human Homo Sapiens …
2 rat Rattus rattus

Relational Database Fundamentals
• Basic SQL
– SELECT
– FROM
– WHERE
– JOIN – NATURAL, INNER, OUTER
• Other SQL functions
– COUNT()
– MAX(),MIN(),AVE()
– DISTINCT
– ORDER BY
– GROUP BY
– LIMIT

• Query: een opdracht om gegevens uit
een databaase op te vragen noemt men
een query
• eg. MyGPCRdb
– Bioentry
– Taxid (include full lineage)
– Linking table (bioentry_tax)

MyGPCR;
Geef me allE GPCR die korter zijn dan 1000bp
select * from bioentry;
select count(*) from bioentry;
select * from bioentry inner join biosequence on
bioentry.bioentry_id=biosequence.bioentry_id ;
select * from bioentry inner join biosequence on
bioentry.bioentry_id=biosequence.bioentry_id
where length(biosequence_str)<1000;

Example 3-tier model in biological database
Example of different interface to the same back-end database (MySQL)
http://www.bioinformatics.be

Overview
• DataBases
– FF
• *.txt
• Indexed version
– Relational (RDBMS)
• Access, MySQL,
PostGRES, Oracle
– OO (OODBMS)
• AceDB, ObjectStore
– Hierarchical
• XML
– Frame based system
• Eg. DAML+OIL
– Hybrid systems
Overview

Object
• The Object paradigm is already proven for application design and
development, but it may simply not be an adequate paradigm for
the data store.
• Object Database are modelled by graphs. The graph theory plays a
great role on computer science, but is also a great source of
unbeatable problems, the NP-complex class: problems for which
there are no computationally efficient solution, as there's no way to
escape from exponential complexity. This is not a current
technological limit. It's a limit inherent to the problem domain.
• Hybrid Object-Relational databases will probably be the long term
solution for the industry. They put a thin object layer above the
relational structure, thus providing a syntax and semantics closer to
the object oriented design and programming tools. They simply
make it easier to build the data layer classes

Conclusions
• A database is a central component of any
contemporary information system
• The operations on the database and the mainenance
of database consistency is handled by a DBMS
• There exist stand alone query languages or
embedded languages but both deal with definition
(DDL) and manipulation (DML) aspects
• The structural properties, constraints and operations
permitted within a DBMS are defined by a data
model - hierarchical, network, relational
• Recovery and concurrency control are essential
• Linking of heterogebous datasources is central theme
in modern bioinformatics

• How do you know which database
exists ?
• NAR list
• Weblinks op Nexus
– Searchable
– Maintainable

• Tools available in public domain for
simultaneous access
– entrez
– srs
• Batch queries for offload in local
databases for subsequent analysis
(see further)

• What if you want to search the
complete human genome (golden path
coordinates) instead of separate NCBI
entries ?
• ENSEMBL

BioMart
• Joined project between EBI and CSHL,
http://www.biomart.org/
• Aim is to develop a generic, query-oriented data
management system capable of integrating
distributed data sources
• 3 step system:
– Start by selecting a dataset to query
– Filter this dataset by applying the appropriate filters
– Generate the output by selecting the attributes and output
format
• Available public biomart websites:
http://www.biomart.org/biomart/martview

BioMart - Single access point - Generic interface

BioMart - ‘Out of the box’ website

BioMart – 3 step system
Dataset
Attribute
Filter

BioMart - 3 step system
Name, chromosome position,
description
for all Ensembl genes
located on chromosome 1, expressed in
lung, associated with human
homologues
Dataset
Attribute
Filter

BioMart - EnsMart
• The first in line was EnsMart, a powerful data
mining toolset for retrieving customized data sets
from annotated genomes. EnsMart integrates data
from Ensembl and various worldwide data sources.
• EnsMart provides ....
– Gene and protein annotation
– Disease information
– Cross-species analyses
– SNPs affecting proteins
– Allele frequency data
– Retrieval by external identifiers
– Retrieval by Gene Ontology
– Customized sequence datasets
– Microarray annotation tools

Other BioMart implementations
• Other data resources also implemented
a BioMart interface:
– Wormbase
– Gramene
– HapMap
– DictyBase
– euGenes

BioBar
• A toolbar for browsing biological data
and databases
http://biobar.mozdev.org/
• The following databases are included
http://biobar.mozdev.org/Databases.ht
ml
• a toolbar for Mozilla-based browsers
including Firefox and Netscape 7+

Weblems
Weblems Online (example posting)
W2.1. Which isolate of Tabac was used in record accession
Z71230, and human sample in the genbank entry with
accession AJ311677 ?
W2.2: Find all structures of GFP in the Protein Data Bank and
draw a histogram of their dates of deposition ?
W2.3: What is the chromosomal location of the human gene for
insulin ?
W2.4: How many different human NHR (nuclear hormone
receptors) s exist ? How many of these are single exon genes
? Are there any drugs working on this class of receptors ?
W2.5: The gene for Berardinelli-Seip syndrome was initially
localized between two markers on chromosome band 11q13-
D11S4191 and D11S987.
a. How many base pairs are there in the interval between
these two markers ?
b. How many known genes are there ?
c. List the gene ontology terms for that region ?

FBWSeminar on Molecular Biology Databases

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to FBWSeminar on Molecular Biology Databases

Similar to FBWSeminar on Molecular Biology Databases (20)

More from Prof. Wim Van Criekinge

More from Prof. Wim Van Criekinge (20)

Recently uploaded

Recently uploaded (20)

FBWSeminar on Molecular Biology Databases