SlideShare a Scribd company logo
1 of 107
FBW 
07-10-2014 
Wim Van Criekinge
Wel les op 4 november en GEEN les op 18 november
Outline 
• Molecular Biology 
• Flat files “sequence” databases 
– DNA 
– Protein 
– Structure 
• Relational Databases 
– What ? 
– Why ? 
• Biological Relational Databases 
– Howto ?
Flat Files 
What is a “flat file” ? 
• Flat file is a term used to refer to when 
data is stored in a plain ordinary file 
on the hard disk 
• Example RefSEQ 
– See CD-ROM 
– FILE: hs.GBFF 
• Hs: Homo Sapiens 
• GBFF: Genbank File Format 
• (associated with textpad, use monospaced 
font eg. Courier)
Sequence entries 
gene 10317..12529 /gene="ZK822.4" 
CDS join(10317..10375,10714..10821,10874..10912,10960..11013, 
11061..11114,11169..11222,11346..11739,11859..11912, 
11962..12195,12242..12529) 
/gene="ZK822.4" /codon_start=1 
/protein_id="CAA98068.1" 
/db_xref="PID:g3881817" 
/db_xref="GI:3881817" 
/db_xref="SPTREMBL:Q23615" 
/translation="MHRHTYRKLYWNLGADGFSQGNADASVSAGSSGSNFLSGLQNSS 
FGQAVMGGINTYNQAKNSSGGNWQTAVANSSVGNFFQNGIDFFNGMKNGTQNFLDTDT 
IQETIGNSSFGEVVQTGVEFFNNIKNGNSPFQGDASSVMSQFVPFLANASAEAKAEFY 
TILPNFGNMTIAEFETAVNAWAAKYNLTDEVEAFNERSKNATVVAEEHANVVVMNLPN 
VLNNLKAISSDKNQTVVEMHTRMMAYVNSLDDDTRDIVFIFFRNLLPPQFKKSKCVDQ 
GNFLTNMYNKASDFFAGRNNRTDGEGSFWSGQGQNGNSGGSGFSSFFNNFNGQGNGNG 
NGAQNPMIGMFNNFMKKNNITADEANAAMADGGASIQILPAISAGWGDVAQVKIGGDF 
KIAVEEETKTTKKNKKQQQQANKNKNKNKKKTTIAPEAAIDANIAAEVHTQVL"
Nucleotide Databases 
EMBL Nucleotide Sequence Database (European Molecular Biology 
Laboratory) http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html 
GenBank at NCBI (National Center for Biotechnology Information) 
http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html 
DDBJ (DNA Database of Japan) http://www.ddbj.nig.ac.jp/ 
DDBJ,the Center for operating DDBJ, National Institute of Genetics (NIG),Japan,established in 
April 1995. 
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html 
Release Notes (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) 
Genetic Sequence Data Bank - August 15 2003 
NCBI-GenBank Flat File Release 137.0 
Distribution Release Notes 
33 865 022 251 bases, from 27 213 748 reported sequences
GenBank Format 
LOCUS LISOD 756 bp DNA BCT 30-JUN-1993 
DEFINITION L.ivanovii sod gene for superoxide dismutase. 
ACCESSION X64011.1 GI:37619753 
NID g44010 
KEYWORDS sod gene; superoxide dismutase. 
SOURCE Listeria ivanovii. 
ORGANISM Listeria ivanovii 
Eubacteria; Firmicutes; Low G+C gram-positive bacteria; 
Bacillaceae; Listeria. 
REFERENCE 1 (bases 1 to 756) 
AUTHORS Haas,A. and Goebel,W. 
TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii 
by functional complementation in Escherichia coli and 
characterization of the gene product 
JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992) 
MEDLINE 92140371 
REFERENCE 2 (bases 1 to 756) 
AUTHORS Kreft,J. 
TITLE Direct Submission 
JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie, 
Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 
Wuerzburg, FRG
FEATURES Location/Qualifiers 
source 1..756 
/organism="Listeria ivanovii" 
/strain="ATCC 19119" 
/db_xref="taxon:1638" 
RBS 95..100 
/gene="sod" 
gene 95..746 
/gene="sod" 
CDS 109..717 
/gene="sod" 
/EC_number="1.15.1.1" 
/codon_start=1 
/product="superoxide dismutase" 
/db_xref="PID:g44011" 
/db_xref="SWISS-PROT:P28763" 
/transl_table=11 
/translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKL 
NEAVSGHAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPN 
GGGAPTGNLKAAIESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVS 
TANQDSPLSEGKTPVLGLDVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRF 
DAAK" 
terminator 723..746 
/gene="sod"
Example of location descriptors 
Location Description 
476 Points to a single base in the presented sequence 
340..565 Points to a continuous range of bases bounded by and 
including the starting and ending bases 
<345..500 The exact lower boundary point of a feature is unknown. 
(102.110) Indicates that the exact location is unknown but that it 
is one of the bases between bases 102 and 110. 
(23.45)..600 Specifies that the starting point is one of the bases 
between bases 23 and 45, inclusive, and the end base 600 
123^124 Points to a site between bases 123 and 124 
145^177 Points to a site anywhere between bases 145 and 177 
J00193:hladr Points to a feature whose location is described in 
another entry: the feature labeled 'hladr' in the 
entry (in this database) with primary accession 'J00193'
BASE COUNT 247 a 136 c 151 g 222 t 
ORIGIN 
1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 
61 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 
121 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 
181 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 
241 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 
301 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta 
361 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca 
421 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg 
481 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt 
541 tccactgcta accaagattc tccacttagc gaaggtaaaa ctccagttct tggcttagat 
601 gtttgggaac atgcttatta tcttaaattc caaaaccgtc gtcctgaata cattgacaca 
661 ttttggaatg taattaactg ggatgaacga aataaacgct ttgacgcagc aaaataatta 
721 tcgaaaggct cacttaggtg ggtcttttta tttcta 
//
EMBL format 
ID LISOD standard; DNA; PRO; 756 BP. IDentification 
XX 
AC X64011; S78972; Accession (Axxxxx, Afxxxxxx), GUID 
XX 
NI g44010 Nucleotide Identifier --> x.x 
XX 
DT 28-APR-1992 (Rel. 31, Created) DaTe 
DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) 
XX 
DE L.ivanovii sod gene for superoxide dismutase DEscription 
XX. 
KW sod gene; superoxide dismutase. KeyWord 
XX 
OS Listeria ivanovii Organism Species 
OC Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae; 
OC Listeria. Organism Classification 
XX 
RN [1] 
RA Haas A., Goebel W.; Reference 
RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by 
RT functional complementation in Escherichia coli and 
RT characterization of the gene product."; 
RL Mol. Gen. Genet. 231:313-322(1992). 
XX
GenBank,EMBL & DDBJ: Comments 
• Collaboration Genbank/EMBL/DDBJ 
– Effort: Identical within 24 hours 
• Redundant information 
• Historical graveyard 
– BANKIT (responsability of the submitter) 
– Version conflicts 
• IDIOSYNCRATIC ( peculiar to the 
individual) 
– Heterogeneous annotation 
– No consistant quality check 
• Vectors, sequence errors etc
Other Genbank Formats 
• ASN1 
– Computer friendly, human unfriendly 
• FASTA 
– Brief, loses information 
– Easy to use 
– Compatible with multiple sequences
Web Query tools & Programming Query tools 
• NCBI website example: 
– http://www.ncbi.nlm.nih.gov/entrez/query/static/ad 
vancedentrez.html 
• EBI UniProtKB website example: 
– http://www.ebi.ac.uk/uniprot/index.html 
– http://www.ebi.uniprot.org/search/SearchTools.sht 
ml
batch download (ftp server) 
• Data available via website is most of 
the time also available via an ftp 
server to download a complete 
batch. 
• Examples: 
–ftp://ftp.ncbi.nih.gov/ 
–ftp://ftp.ebi.ac.uk/pub/
Sequence file format tips 
• When saving a sequence for use in an email 
message or pasting into a web page, use an 
unannotated text format such as FASTA 
• When retrieving from a database or 
exchanging between programs, use an 
annotated text format such as Genbank 
• When using sequence again with the same 
program, use that program’s annotated binary 
format (or annotated text if binary not 
available) 
– Asn-1 (NCBI) 
– Gbff (sanger) 
– XML
Expressed Sequence Tags 
• Sequence that codes for protein is < 5% of the 
genome. 
• Coding sequence can be obtained from mRNA by 
reverse transcription. 
• Tags for that sequence can be obtained by end-sequencing. 
• Incyte and HGS gambled on this being the useful 
part: 
– Search for homologies to known proteins, motifs. 
– Search for changed levels of expression and tissue specificity 
(“virtual/electronic northern” used in GeneCards) 
• ESTs have driven the huge expansion of GenBank: 
– Unigene now contains some sequence from most genes. 
– > 4,000,000 human est sequences 
– http://www.ncbi.nlm.nih.gov/dbEST/
dbEST release 100303 Summary by Organism - October 3, 2003 
Number of public entries: 18,762,324 
Homo sapiens (human) 5,426,001 
Mus musculus + domesticus (mouse) 3,881,878 
Rattus sp. (rat) 538,073 
Triticum aestivum (wheat) 500,898 
Ciona intestinalis 492,488 
Gallus gallus (chicken) 451,565 
Zea mays (maize) 383,416 
Danio rerio (zebrafish) 362,362 
Hordeum vulgare + subsp. vulgare (barley) 348,233 
Xenopus laevis (African clawed frog) 344,695 
Glycine max (soybean) 341,573 
Bos taurus (cattle) 322,074 
Drosophila melanogaster (fruit fly) 261,404
Traces <-> strings 
• Traces contain much more information 
– TraceDB: http://www.ncbi.nlm.nih.gov/Traces/ 
Example
Traces <-> strings 
• Phrep 
– base calling, vector trimming, end of sequence 
read trimming 
• Phrap 
– Phrap uses Phred’s base calling scores to 
determine the consensus sequences. Phrap 
examines all individual sequences at a given 
position, and uses the highest scoring sequence 
(if it exists) to extend the consensus sequence 
• Consend 
– graphical interface extension that controls both 
Phred and Phrap
What is Phred? 
• Phred is a program that observes the base trace, makes 
base calls, and assigns quality values (qv) of bases in the 
sequence. 
• It then writes base calls and qv to output files that will be 
used for Phrap assembly. The qv will be useful for 
consensus sequence construction. 
• For example, ATGCATGC string1 
ATTCATGC string2 
AT-CATGC superstring 
• Here we have a mismatch ‘G’ and ‘T’, the qv will 
determine the dash in the superstring. The base with higher 
qv will replaces the dash.
How Phred calculates qv? 
• From the base trace Phred know number of peaks 
and actual peak locations. 
• Phred predicts peaks locations. 
• Phred reads the actual peak locations from base 
trace. 
• Phred match the actual locations with the 
predicted locations by using Dynamic 
Programming. 
• The qv is related to the base call error probability 
(ep) by the formula qv = -10*log_10(ep) 
• Example 1:10000 = qv 40
Why Phred? 
• Output sequence might contain 
errors. 
• Vector contamination might occur. 
• Dye-terminator reaction might not 
occur. 
• Segment migration abnormal in 
gel electrophoresis. 
• Weak or variable signal strength 
of peak corresponding to a base.
Vector Trimming
End of Sequence Cropping 
• It is common that the end of sequencing reads 
have poor data. This is due to the difficulties in 
resolving larger fragment ~1kb (it is easier to 
resolve 21bp from 20bp than it is to resolve 
1001bp from 1000bp). 
• Phred assigns a non-value of ‘x’ to this data by 
comparing peak separation and peak intensity to 
internal standards. If the standard threshold score 
is not reached, the data will not be used.
Traces <-> strings 
• Handle traces 
– Abi-view EMBOSS 
– Bioedit 
– Acembly, … 
• EXAMPLE
NCBI reference sequences 
RefSeq database is a non-redundant set of 
reference standards that includes 
chromosomes, complete genomic molecules, 
intermediate assembled genomic contigs, 
curated genomic regions, mRNAs, RNAs, and 
proteins.
RefSeq nomenclature 
NC_#### complete genomic 
NG_#### incomplete genomic 
NM_####mRNA 
NR_#### noncoding transcripts 
NP_#### proteins 
NT_#### intermediate genomic contigs
RefSeq nomenclature - models 
XM_#### mRNA 
XR_#### RNA 
XP_#### protein 
Automated Homo sapiens models provided by 
the Genome Annotation process; sequence 
corresponds to the genomic contig.
Open reading frame 
• Definition: 
– A stretch of triplet codons with an initiator 
codon at one end and a stop codon sat the other, 
as identifiable by nucleotide sequences. 
• Example 
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? 
cmd=Retrieve&db=nucleotide&list_uids=6688 
473&dopt=GenBank&term=Y18948.1&qty=1
Protein sequence database 
SWISS-PROT & TREMBL 
SwissProt - http://expasy.hcuge.ch/sprot/ 
 SWISS-PROT is an annotated protein sequence database 
 The sequences are translated from the EMBL Nucleotide Sequence Database 
 Sequence entries are composed of different lines. 
For standardization purposes the format of SWISS-PROT follows as 
closely as possible that of the EMBL Nucleotide Sequence Database. 
 Continuously updated (daily).
Different Features of SWISS-PROT 
• Format follows as closely as 
possible that of EMBL’s 
• Curated protein sequence database 
• Three differences: 
1. Strives to provide a high level of 
annotations 
2. Minimal level of redundancy 
3. High level of integration with 
other databases
Three Distinct Criteria 
The sequence data; the citation 
information (bibliographical 
references) and the taxonomic data 
(description of the biological source of 
the protein) such as protein 
functions,post-translational 
modifications ,domains and 
sites,secondary structure,quaternary 
structure,similarities to other 
proteins,diseases associated with 
deficiencies in the protein,sequence 
conflicts, variants, etc. 
1. Annotation
2. Minimal Redundancy 
any sequence databases contain, for a 
given protein sequence, separate 
entries which correspond to 
different literature reports. SWISS-PROT 
is as much as possible to 
merge all these data so as to 
minimize the redundancy. If 
conflicts exist between various 
sequencing reports, they are 
indicated in the feature table of the 
corresponding entry.
3. Integration With Other Databases 
• SWISS-PROT and TrEMBL - Protein 
sequences 
• PROSITE - Protein families and domains 
• SWISS-2DPAGE - Two-dimensional 
polyacrylamide gel electrophoresis 
• SWISS-3DIMAGE - 3D images of proteins 
and other biological macromolecules 
• SWISS-MODEL Repository - Automatically 
generated protein models 
• CD40Lbase - CD40 ligand defects 
• ENZYME - Enzyme nomenclature 
• SeqAnalRef - Sequence analysis bibliographic 
references
TREMBL- http://expasy.hcuge.ch/sprot/ 
 Translated EMBL sequences not (yet) in 
Swissprot. 
 Updated faster than SWISS-PROT. 
TREMBL - two parts 
1. SP-TREMBL 
 Will eventually be incorporated into Swissprot 
 Divided into FUN, HUM, INV, MAM, MHC, ORG, PHG, PLN, 
PRO, 
ROD, UNC, VRL and VRT. 
2. REM-TREMBL (remaining) 
 Will NOT be incorporated into Swissprot 
 Divided into:Immunoglobins and T-cell receptors,Synthetic 
sequences,Patent application sequences,Small fragments,CDS 
not coding for real proteins
SWISS-PROT/TrEMBL 
• TrEMBL is a computer-annotated 
supplement of SWISS-PROT that contains 
all the translations of EMBL nucleotide 
sequence entries not yet integrated in 
SWISS-PROT 
• SWISS-PROT Release 39.15 of 19- 
Mar-2001: 94,152 entries 
TrEMBL Release 16.2 of 23-Mar- 
2001: 436,924 entries
Example of a SwissProt entry 
ID TNFA_HUMAN STANDARD; PRT; 233 AA. IDentification 
AC P01375; ACcession 
DT 21-JUL-1986 (REL. 01, CREATED) DaTe 
DT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE) 
DT 15-JUL-1998 (REL. 36, LAST ANNOTATION UPDATE) 
DE TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN). 
GN TNFA. Gene name 
OS HOMO SAPIENS (HUMAN). Organism Species 
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; 
OC EUTHERIA; PRIMATES. Organism Classification 
RN [1] Reference 
RP SEQUENCE FROM N.A. 
RX MEDLINE; 87217060. 
RA NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A., 
RA AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N., 
RA FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A., 
RA CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.; 
RL COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986). 
RN [2] 
RP SEQUENCE FROM N.A. 
RX MEDLINE; 85086244. 
RA PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R., 
RA PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.; 
RL NATURE 312:724-729(1984). 
...
CC -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CAN 
CC CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATED 
CC IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSING 
CC FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, IT 
CC CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATION 
CC UNDER CERTAIN CONDITIONS. Comments 
CC -!- SUBUNIT: HOMOTRIMER. 
CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS AS 
CC AN EXTRACELLULAR SOLUBLE FORM. 
CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY 
CC PROTEOLYTIC PROCESSING. 
CC -!- DISEASE: CACHEXIA ACCOMPANIES A VARIETY OF DISEASES, INCLUDING 
CC CANCER AND INFECTION, AND IS CHARACTERIZED BY GENERAL ILL 
CC HEALTH AND MALNUTRITION. 
CC -!- SIMILARITY: BELONGS TO THE TUMOR NECROSIS FACTOR FAMILY. 
DR EMBL; X02910; G37210; -. Database Cross-references 
DR EMBL; M16441; G339741; -. 
DR EMBL; X01394; G37220; -. 
DR EMBL; M10988; G339738; -. 
DR EMBL; M26331; G339764; -. 
DR EMBL; Z15026; G37212; -. 
DR PIR; B23784; QWHUN. 
DR PIR; A44189; A44189. 
DR PDB; 1TNF; 15-JAN-91. 
DR PDB; 2TUN; 31-JAN-94.
KW CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR; 
KW MYRISTYLATION; 3D-STRUCTURE. KeyWord 
FT PROPEP 1 76 Feature Table 
FT CHAIN 77 233 TUMOR NECROSIS FACTOR. 
FT TRANSMEM 36 56 SIGNAL-ANCHOR (TYPE-II PROTEIN). 
FT LIPID 19 19 MYRISTATE. 
FT LIPID 20 20 MYRISTATE. 
FT DISULFID 145 177 
FT MUTAGEN 105 105 L->S: LOW ACTIVITY. 
FT MUTAGEN 108 108 R->W: BIOLOGICALLY INACTIVE. 
FT MUTAGEN 112 112 L->F: BIOLOGICALLY INACTIVE. 
FT MUTAGEN 162 162 S->F: BIOLOGICALLY INACTIVE. 
FT MUTAGEN 167 167 V->A,D: BIOLOGICALLY INACTIVE. 
FT MUTAGEN 222 222 E->K: BIOLOGICALLY INACTIVE. 
FT CONFLICT 63 63 F -> S (IN REF. 5). 
FT STRAND 89 93 
FT TURN 99 100 
FT TURN 109 110 
FT STRAND 112 113 
FT TURN 115 116 
FT STRAND 118 119 
FT STRAND 124 125
FT STRAND 130 143 
FT STRAND 152 159 
FT STRAND 166 170 
FT STRAND 173 174 
FT TURN 183 184 
FT STRAND 189 202 
FT TURN 204 205 
FT STRAND 207 212 
FT HELIX 215 217 
FT STRAND 218 218 
FT STRAND 227 232 
SQ SEQUENCE 233 AA; 25644 MW; 666D7069 CRC32; 
MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI VAGATTLFCL LHFGVIGPQR 
EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG QLQWLNRRAN ALLANGVELR 
DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCQRE 
TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF AESGQVYFGI IAL 
//
Protein searching 
3-levels of Protein Searching 
1. Swissprot Little Noise 
Annotated entries 
2. Swissprot + TREMBL More Noisy 
All probable entries 
3. Translated EMBL - tblast or tfasta Most Noisy 
All possible entries
New initiatiaves 
• IPI: International Protein Index 
– http://www.ebi.ac.uk/IPI/IPIhelp.ht 
ml 
• UNIPROT: Universal Protein 
Knowledgebase 
– http://www.pir.uniprot.org/ 
• HPRD: Human Protein Reference 
Database 
– http://www.hprd.org/
UniProt Consortium 
• European Bioinformatics Institute (EBI) 
• Swiss Institute of Bioinformatics (SIB) 
• Protein Information Resource (PIR) 
Uniprot Databases 
•UniProt Knowledgebase (UniProtKB) 
•UniProt Reference Clusters (UniRef) 
•UniProt Archive (UniParc) 
UniprotKB 
•Swiss-Prot (annotated protein sequence db, 
golden standard) 
•trEMBL (translated EMBL + automated 
electronic annotations) 
UniProt
understanding molecular 
structure is critical to the 
understanding of biology 
because because structure 
determines function
From Structure to Function 
• the drug morphine has chemical groups that are functionally equivalent to the natural 
endorphins found in the human body
• the drug morphine has chemical groups that are functionally equivalent to the natural 
endorphins found in the human body 
• the receptor molecules 
located at the synapse 
(between two neurons) 
bind morphine much the 
same way as endorphins 
• therefore, morphine is 
able to attenuate the pain 
response 
From Structure to Function
Structure databases 
Protein Data Bank (PDB) 
Protein Data Bank - http://www.rcsb.org/pdb 
Diffraction 7373 structures determined by X-ray diffraction 
NMR 388 structures determined by NMR spectroscopy 
Theoretical Model 201 structures proposed by modeling
• PDB is three-dimensional structure of 
proteins,some nuclei acids involved 
• PDB is operated by RCSB(Research Collaboratory for 
Structural Bioinformatics),funded by NSF, DOE, and 
two units of NIH:NIGMS National Institute Of General 
Medical Sciences and NLM National Library Of Medicine. 
• Established at BNL Brookhaven National Laboratories in 
1971,as an archive for biological 
macromolecular crystal structures 
• In 1980s, the number of deposited structures 
began to increase dramatically. 
• October 1998, the management of the PDB 
became the responsibility of RCSB. 
• Website http://www.rcsb.org
PDB Holdings List: 27-Mar-2001 
Molecule Type 
Proteins, 
Peptides, 
and Viruses 
Protein/ 
Nucleic 
Acid 
Complexes 
Nuclei 
c 
Acids 
Carbohydrate 
s 
total 
Exp. 
Tech. 
X-ray 
Diffraction 
and other 
11045 526 552 14 12137 
NMR 1832 71 366 4 2273 
Theoretica 
l Modeling 
281 19 21 0 321 
total 13158 616 939 18 14731 
5032 Structure Factor Files 
968 NMR Restraint Files
PDB Content Growth
PDB Growth in New Folds
Other structure databases 
BioMagResBank http://www.bmrb.wisc.edu/ 
A Repository for Data from NMR Spectroscopy on Proteins, Peptides, and Nucleic 
Acids 
Biological Macromolecule Crystallization Database (BMCD) 
http://h178133.carb.nist.gov:4400/bmcd/bmcd.html 
Contains crystal data and the crystallization conditions, which have been compiled 
from literature 
Nucleic Acid Database (NDB) http://ndbserver.rutgers.edu:80/ 
Assembles and distributes structural information about nucleic acids 
Structural Classification of Proteins (SCOP) http://scop.mrc-lmb.cam.ac.uk/scop/ 
Structure similarity search. Hierarchic organization. 
MOOSE http://db2.sdsc.edu/moose/ 
Macromolecular Structure Query 
Cambridge Structural Database (CSD) http://www.ccdc.cam.ac.uk/ 
Small molecules.
Protein Splicing? 
• Protein splicing is defined as the excision of 
an intervening protein sequence (the 
INTEIN) from a protein precursor and the 
concomitant ligation of the flanking protein 
fragments (the EXTEINS) to form a mature 
extein protein and the free intein 
• http://www.neb.com/inteins/intein_intro.ht 
ml
Biological databases 
• NAR Database Issue 
– Every year: NAR DB Issue 
– The 2006 update includes 858 databases 
– Citation top 5 are: 
• Pfam 
• Gene Ontology 
• UniProt 
• SMART 
• KEGG 
– Primary Nucleotide DB’s and PDB are 
not cited anymore
Outline 
• Molecular Biology 
• Flat files “sequence” databases 
– DNA 
– Protein 
– Structure 
• Relational Databases 
– What ? 
– Why ? 
• Biological Relational Databases 
– Howto ?
Why biological databases ? 
• Explosive growth in biological data 
• Data (sequences, 3D structures, 2D 
gel analysis, MS analysis….) are no 
longer published in a conventional 
manner, but directly submitted to 
databases 
• Essential tools for biological research, 
as classical publications used to be !
Problems with Flat files … 
• Wasted storage space 
• Wasted processing time 
• Data control problems 
• Problems caused by changes to data 
structures 
• Access to data difficult 
• Data out of date 
• Constraints are system based 
• Limited querying eg. all single exon 
GPCRs (<1000 bp)
Relational 
• The Relational model is not only very mature, but it 
has developed a strong knowledge on how to make a 
relational back-end fast and reliable, and how to 
exploit different technologies such as massive SMP, 
Optical jukeboxes, clustering and etc. Object 
databases are nowhere near to this, and I do not 
expect then to get there in the short or medium term. 
• Relational Databases have a very well-known and 
proven underlying mathematical theory, a simple one 
(the set theory) that makes possible 
– automatic cost-based query optimization, 
– schema generation from high-level models and 
– many other features that are now vital for mission-critical 
Information Systems development and operations.
• What is a relational database ? 
– Sets of tables and links (the data) 
– A language to query the datanase (Structured 
Query Language) 
– A program to manage the data (RDBMS) 
• Flat files are not relational 
– Data type (attribute) is part of the data 
– Record order mateters 
– Multiline records 
– Massive duplication 
• Bv Organism: Homo sapeinsm Eukaryota, … 
– Some records are hierarchical 
• Xrefs 
– Records contain multiple “sub-records” 
– Implecit “Key”
The Benefits of Databases 
• Redundancy can be reduced 
• Inconsistency can be avoided 
• Conflicting requirements can be 
balanced 
• Standards can be enforced 
• Data can be shared 
• Data independence 
• Integrity can be maintained 
• Security restrictions can be applied
Disadvantages 
• size 
• complexity 
• cost 
• Additional hardware costs 
• Higher impact of failure 
• Recovery more difficult
Relational Terminology 
CUSTOMER Table (Relation) 
ID NAME PHONE EMP_ID 
201 Unisports 55-2066101 12 
202 Simms Atheletics 81-20101 14 
203 Delhi Sports 91-10351 14 
204 Womansport 1-206-104-0103 11 
Row (Tuple) 
Column (Attribute)
Relational Database Terminology 
• Each row of data in a table is uniquely identified by a primary key (PK) 
• Information in multiple tables can be logically related by foreign keys (FK) 
Table Name: CUSTOMER Table Name: EMP 
ID LAST_NAME FIRST_NAME 
10 Havel Marta 
11 Magee Colin 
12 Giljum Henry 
14 Nguyen Mai 
ID NAME PHONE EMP_ID 
201 Unisports 55-2066101 12 
202 Simms Atheletics 81-20101 14 
203 Delhi Sports 91-10351 14 
204 Womansport 1-206-104-0103 11 
Primary Key Foreign Key Primary Key
• RDBM products 
– Free 
• MySQL, very fast, widely usedm easy to 
jump into but limited non standard SQL 
• PostrgreSQL – full SQLm limited OO, 
higher learning curve than MySQL 
– Commercial 
• MS Access – Great query builder, GUI 
interfaces 
• MS SQL Server – full SQL, NT only 
• Oracle, everything, including the kitchen 
sink 
• IBM DB2, Sybase
A simple datamodel (tables and relations) 
Prot_id name seq Species_id 
1 GTM1_HUMA 
N 
MGTDHG… 1 
2 GTM1_RAT MGHJADSW.. 2 
3 GTM2_HUMA 
N 
MVSDBSVD.. 1 
Species_id name Full Lineage 
1 human Homo Sapiens … 
2 rat Rattus rattus
Relational Database Fundamentals 
• Basic SQL 
– SELECT 
– FROM 
– WHERE 
– JOIN – NATURAL, INNER, OUTER 
• Other SQL functions 
– COUNT() 
– MAX(),MIN(),AVE() 
– DISTINCT 
– ORDER BY 
– GROUP BY 
– LIMIT
BioSQL
• Query: een opdracht om gegevens uit 
een databaase op te vragen noemt men 
een query 
• eg. MyGPCRdb 
– Bioentry 
– Taxid (include full lineage) 
– Linking table (bioentry_tax)
MyGPCR; 
Geef me allE GPCR die korter zijn dan 1000bp 
select * from bioentry; 
select count(*) from bioentry; 
select * from bioentry inner join biosequence on 
bioentry.bioentry_id=biosequence.bioentry_id ; 
select * from bioentry inner join biosequence on 
bioentry.bioentry_id=biosequence.bioentry_id 
where length(biosequence_str)<1000;
Example 3-tier model in biological database 
Example of different interface to the same back-end database (MySQL) 
http://www.bioinformatics.be
Overview 
• DataBases 
– FF 
• *.txt 
• Indexed version 
– Relational (RDBMS) 
• Access, MySQL, 
PostGRES, Oracle 
– OO (OODBMS) 
• AceDB, ObjectStore 
– Hierarchical 
• XML 
– Frame based system 
• Eg. DAML+OIL 
– Hybrid systems 
Overview
Object 
• The Object paradigm is already proven for application design and 
development, but it may simply not be an adequate paradigm for 
the data store. 
• Object Database are modelled by graphs. The graph theory plays a 
great role on computer science, but is also a great source of 
unbeatable problems, the NP-complex class: problems for which 
there are no computationally efficient solution, as there's no way to 
escape from exponential complexity. This is not a current 
technological limit. It's a limit inherent to the problem domain. 
• Hybrid Object-Relational databases will probably be the long term 
solution for the industry. They put a thin object layer above the 
relational structure, thus providing a syntax and semantics closer to 
the object oriented design and programming tools. They simply 
make it easier to build the data layer classes
Conclusions 
• A database is a central component of any 
contemporary information system 
• The operations on the database and the mainenance 
of database consistency is handled by a DBMS 
• There exist stand alone query languages or 
embedded languages but both deal with definition 
(DDL) and manipulation (DML) aspects 
• The structural properties, constraints and operations 
permitted within a DBMS are defined by a data 
model - hierarchical, network, relational 
• Recovery and concurrency control are essential 
• Linking of heterogebous datasources is central theme 
in modern bioinformatics
• How do you know which database 
exists ? 
• NAR list 
• Weblinks op Nexus 
– Searchable 
– Maintainable
• Tools available in public domain for 
simultaneous access 
– entrez 
– srs 
• Batch queries for offload in local 
databases for subsequent analysis 
(see further)
• What if you want to search the 
complete human genome (golden path 
coordinates) instead of separate NCBI 
entries ? 
• ENSEMBL
BioMart 
• Joined project between EBI and CSHL, 
http://www.biomart.org/ 
• Aim is to develop a generic, query-oriented data 
management system capable of integrating 
distributed data sources 
• 3 step system: 
– Start by selecting a dataset to query 
– Filter this dataset by applying the appropriate filters 
– Generate the output by selecting the attributes and output 
format 
• Available public biomart websites: 
http://www.biomart.org/biomart/martview
BioMart - Single access point - Generic interface
BioMart - ‘Out of the box’ website
BioMart – 3 step system 
Dataset 
Attribute 
Filter
BioMart - 3 step system 
Name, chromosome position, 
description 
for all Ensembl genes 
located on chromosome 1, expressed in 
lung, associated with human 
homologues 
Dataset 
Attribute 
Filter
BioMart - EnsMart 
• The first in line was EnsMart, a powerful data 
mining toolset for retrieving customized data sets 
from annotated genomes. EnsMart integrates data 
from Ensembl and various worldwide data sources. 
• EnsMart provides .... 
– Gene and protein annotation 
– Disease information 
– Cross-species analyses 
– SNPs affecting proteins 
– Allele frequency data 
– Retrieval by external identifiers 
– Retrieval by Gene Ontology 
– Customized sequence datasets 
– Microarray annotation tools
Other BioMart implementations 
• Other data resources also implemented 
a BioMart interface: 
– Wormbase 
– Gramene 
– HapMap 
– DictyBase 
– euGenes
Single interface
BioBar 
• A toolbar for browsing biological data 
and databases 
http://biobar.mozdev.org/ 
• The following databases are included 
http://biobar.mozdev.org/Databases.ht 
ml 
• a toolbar for Mozilla-based browsers 
including Firefox and Netscape 7+
Weblems 
Weblems Online (example posting) 
W2.1. Which isolate of Tabac was used in record accession 
Z71230, and human sample in the genbank entry with 
accession AJ311677 ? 
W2.2: Find all structures of GFP in the Protein Data Bank and 
draw a histogram of their dates of deposition ? 
W2.3: What is the chromosomal location of the human gene for 
insulin ? 
W2.4: How many different human NHR (nuclear hormone 
receptors) s exist ? How many of these are single exon genes 
? Are there any drugs working on this class of receptors ? 
W2.5: The gene for Berardinelli-Seip syndrome was initially 
localized between two markers on chromosome band 11q13- 
D11S4191 and D11S987. 
a. How many base pairs are there in the interval between 
these two markers ? 
b. How many known genes are there ? 
c. List the gene ontology terms for that region ?

More Related Content

Viewers also liked (12)

Smart Print & Hybrid Database
Smart Print & Hybrid DatabaseSmart Print & Hybrid Database
Smart Print & Hybrid Database
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
 
Protein databases
Protein databasesProtein databases
Protein databases
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
 
Application of bioinformatics in climate smart horticulture
Application of bioinformatics in climate smart horticultureApplication of bioinformatics in climate smart horticulture
Application of bioinformatics in climate smart horticulture
 
Protein Database
Protein DatabaseProtein Database
Protein Database
 
Hadoop for Bioinformatics
Hadoop for BioinformaticsHadoop for Bioinformatics
Hadoop for Bioinformatics
 
NCBI
NCBINCBI
NCBI
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0
 
Biological databases
Biological databasesBiological databases
Biological databases
 

Similar to FBWSeminar on Molecular Biology Databases

2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekingeProf. Wim Van Criekinge
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilitiesmkim8
 
NCBI API - Integration into analysis code
NCBI API - Integration into analysis codeNCBI API - Integration into analysis code
NCBI API - Integration into analysis codeJiwoong Kim
 
ENCODE-DCC-metadata-standard-Biocurator 2014
ENCODE-DCC-metadata-standard-Biocurator 2014ENCODE-DCC-metadata-standard-Biocurator 2014
ENCODE-DCC-metadata-standard-Biocurator 2014ENCODE-DCC
 
Esa 2014 qiime
Esa 2014 qiimeEsa 2014 qiime
Esa 2014 qiimeZech Xu
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRONPrabin Shakya
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.Jennifer Shelton
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim D. Pruitt
 
A Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With HypertableA Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With Hypertablehypertable
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Lucidworks
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Surya Saha
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS
 
Linked Data for integrating life-science databases
Linked Data for integrating life-science databasesLinked Data for integrating life-science databases
Linked Data for integrating life-science databasesShuichi Kawashima
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycleRNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycleJennifer Shelton
 

Similar to FBWSeminar on Molecular Biology Databases (20)

2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge
 
Bioinformatica t2-databases
Bioinformatica t2-databasesBioinformatica t2-databases
Bioinformatica t2-databases
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
 
NCBI API - Integration into analysis code
NCBI API - Integration into analysis codeNCBI API - Integration into analysis code
NCBI API - Integration into analysis code
 
ENCODE-DCC-metadata-standard-Biocurator 2014
ENCODE-DCC-metadata-standard-Biocurator 2014ENCODE-DCC-metadata-standard-Biocurator 2014
ENCODE-DCC-metadata-standard-Biocurator 2014
 
Esa 2014 qiime
Esa 2014 qiimeEsa 2014 qiime
Esa 2014 qiime
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRON
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015
 
A Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With HypertableA Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With Hypertable
 
Ensembl Browser Workshop
Ensembl Browser WorkshopEnsembl Browser Workshop
Ensembl Browser Workshop
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1
 
Linked Data for integrating life-science databases
Linked Data for integrating life-science databasesLinked Data for integrating life-science databases
Linked Data for integrating life-science databases
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycleRNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
 
Harvester I
Harvester IHarvester I
Harvester I
 

More from Prof. Wim Van Criekinge

2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_uploadProf. Wim Van Criekinge
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Prof. Wim Van Criekinge
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 

More from Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Recently uploaded (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 

FBWSeminar on Molecular Biology Databases

  • 1.
  • 2. FBW 07-10-2014 Wim Van Criekinge
  • 3. Wel les op 4 november en GEEN les op 18 november
  • 4. Outline • Molecular Biology • Flat files “sequence” databases – DNA – Protein – Structure • Relational Databases – What ? – Why ? • Biological Relational Databases – Howto ?
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. Flat Files What is a “flat file” ? • Flat file is a term used to refer to when data is stored in a plain ordinary file on the hard disk • Example RefSEQ – See CD-ROM – FILE: hs.GBFF • Hs: Homo Sapiens • GBFF: Genbank File Format • (associated with textpad, use monospaced font eg. Courier)
  • 10. Sequence entries gene 10317..12529 /gene="ZK822.4" CDS join(10317..10375,10714..10821,10874..10912,10960..11013, 11061..11114,11169..11222,11346..11739,11859..11912, 11962..12195,12242..12529) /gene="ZK822.4" /codon_start=1 /protein_id="CAA98068.1" /db_xref="PID:g3881817" /db_xref="GI:3881817" /db_xref="SPTREMBL:Q23615" /translation="MHRHTYRKLYWNLGADGFSQGNADASVSAGSSGSNFLSGLQNSS FGQAVMGGINTYNQAKNSSGGNWQTAVANSSVGNFFQNGIDFFNGMKNGTQNFLDTDT IQETIGNSSFGEVVQTGVEFFNNIKNGNSPFQGDASSVMSQFVPFLANASAEAKAEFY TILPNFGNMTIAEFETAVNAWAAKYNLTDEVEAFNERSKNATVVAEEHANVVVMNLPN VLNNLKAISSDKNQTVVEMHTRMMAYVNSLDDDTRDIVFIFFRNLLPPQFKKSKCVDQ GNFLTNMYNKASDFFAGRNNRTDGEGSFWSGQGQNGNSGGSGFSSFFNNFNGQGNGNG NGAQNPMIGMFNNFMKKNNITADEANAAMADGGASIQILPAISAGWGDVAQVKIGGDF KIAVEEETKTTKKNKKQQQQANKNKNKNKKKTTIAPEAAIDANIAAEVHTQVL"
  • 11. Nucleotide Databases EMBL Nucleotide Sequence Database (European Molecular Biology Laboratory) http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html GenBank at NCBI (National Center for Biotechnology Information) http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html DDBJ (DNA Database of Japan) http://www.ddbj.nig.ac.jp/ DDBJ,the Center for operating DDBJ, National Institute of Genetics (NIG),Japan,established in April 1995. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Release Notes (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) Genetic Sequence Data Bank - August 15 2003 NCBI-GenBank Flat File Release 137.0 Distribution Release Notes 33 865 022 251 bases, from 27 213 748 reported sequences
  • 12. GenBank Format LOCUS LISOD 756 bp DNA BCT 30-JUN-1993 DEFINITION L.ivanovii sod gene for superoxide dismutase. ACCESSION X64011.1 GI:37619753 NID g44010 KEYWORDS sod gene; superoxide dismutase. SOURCE Listeria ivanovii. ORGANISM Listeria ivanovii Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae; Listeria. REFERENCE 1 (bases 1 to 756) AUTHORS Haas,A. and Goebel,W. TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii by functional complementation in Escherichia coli and characterization of the gene product JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992) MEDLINE 92140371 REFERENCE 2 (bases 1 to 756) AUTHORS Kreft,J. TITLE Direct Submission JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG
  • 13. FEATURES Location/Qualifiers source 1..756 /organism="Listeria ivanovii" /strain="ATCC 19119" /db_xref="taxon:1638" RBS 95..100 /gene="sod" gene 95..746 /gene="sod" CDS 109..717 /gene="sod" /EC_number="1.15.1.1" /codon_start=1 /product="superoxide dismutase" /db_xref="PID:g44011" /db_xref="SWISS-PROT:P28763" /transl_table=11 /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKL NEAVSGHAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPN GGGAPTGNLKAAIESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVS TANQDSPLSEGKTPVLGLDVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRF DAAK" terminator 723..746 /gene="sod"
  • 14. Example of location descriptors Location Description 476 Points to a single base in the presented sequence 340..565 Points to a continuous range of bases bounded by and including the starting and ending bases <345..500 The exact lower boundary point of a feature is unknown. (102.110) Indicates that the exact location is unknown but that it is one of the bases between bases 102 and 110. (23.45)..600 Specifies that the starting point is one of the bases between bases 23 and 45, inclusive, and the end base 600 123^124 Points to a site between bases 123 and 124 145^177 Points to a site anywhere between bases 145 and 177 J00193:hladr Points to a feature whose location is described in another entry: the feature labeled 'hladr' in the entry (in this database) with primary accession 'J00193'
  • 15. BASE COUNT 247 a 136 c 151 g 222 t ORIGIN 1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 61 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 121 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 181 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 241 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 301 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta 361 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca 421 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg 481 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt 541 tccactgcta accaagattc tccacttagc gaaggtaaaa ctccagttct tggcttagat 601 gtttgggaac atgcttatta tcttaaattc caaaaccgtc gtcctgaata cattgacaca 661 ttttggaatg taattaactg ggatgaacga aataaacgct ttgacgcagc aaaataatta 721 tcgaaaggct cacttaggtg ggtcttttta tttcta //
  • 16. EMBL format ID LISOD standard; DNA; PRO; 756 BP. IDentification XX AC X64011; S78972; Accession (Axxxxx, Afxxxxxx), GUID XX NI g44010 Nucleotide Identifier --> x.x XX DT 28-APR-1992 (Rel. 31, Created) DaTe DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase DEscription XX. KW sod gene; superoxide dismutase. KeyWord XX OS Listeria ivanovii Organism Species OC Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae; OC Listeria. Organism Classification XX RN [1] RA Haas A., Goebel W.; Reference RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and RT characterization of the gene product."; RL Mol. Gen. Genet. 231:313-322(1992). XX
  • 17. GenBank,EMBL & DDBJ: Comments • Collaboration Genbank/EMBL/DDBJ – Effort: Identical within 24 hours • Redundant information • Historical graveyard – BANKIT (responsability of the submitter) – Version conflicts • IDIOSYNCRATIC ( peculiar to the individual) – Heterogeneous annotation – No consistant quality check • Vectors, sequence errors etc
  • 18. Other Genbank Formats • ASN1 – Computer friendly, human unfriendly • FASTA – Brief, loses information – Easy to use – Compatible with multiple sequences
  • 19. Web Query tools & Programming Query tools • NCBI website example: – http://www.ncbi.nlm.nih.gov/entrez/query/static/ad vancedentrez.html • EBI UniProtKB website example: – http://www.ebi.ac.uk/uniprot/index.html – http://www.ebi.uniprot.org/search/SearchTools.sht ml
  • 20. batch download (ftp server) • Data available via website is most of the time also available via an ftp server to download a complete batch. • Examples: –ftp://ftp.ncbi.nih.gov/ –ftp://ftp.ebi.ac.uk/pub/
  • 21. Sequence file format tips • When saving a sequence for use in an email message or pasting into a web page, use an unannotated text format such as FASTA • When retrieving from a database or exchanging between programs, use an annotated text format such as Genbank • When using sequence again with the same program, use that program’s annotated binary format (or annotated text if binary not available) – Asn-1 (NCBI) – Gbff (sanger) – XML
  • 22. Expressed Sequence Tags • Sequence that codes for protein is < 5% of the genome. • Coding sequence can be obtained from mRNA by reverse transcription. • Tags for that sequence can be obtained by end-sequencing. • Incyte and HGS gambled on this being the useful part: – Search for homologies to known proteins, motifs. – Search for changed levels of expression and tissue specificity (“virtual/electronic northern” used in GeneCards) • ESTs have driven the huge expansion of GenBank: – Unigene now contains some sequence from most genes. – > 4,000,000 human est sequences – http://www.ncbi.nlm.nih.gov/dbEST/
  • 23. dbEST release 100303 Summary by Organism - October 3, 2003 Number of public entries: 18,762,324 Homo sapiens (human) 5,426,001 Mus musculus + domesticus (mouse) 3,881,878 Rattus sp. (rat) 538,073 Triticum aestivum (wheat) 500,898 Ciona intestinalis 492,488 Gallus gallus (chicken) 451,565 Zea mays (maize) 383,416 Danio rerio (zebrafish) 362,362 Hordeum vulgare + subsp. vulgare (barley) 348,233 Xenopus laevis (African clawed frog) 344,695 Glycine max (soybean) 341,573 Bos taurus (cattle) 322,074 Drosophila melanogaster (fruit fly) 261,404
  • 24. Traces <-> strings • Traces contain much more information – TraceDB: http://www.ncbi.nlm.nih.gov/Traces/ Example
  • 25. Traces <-> strings • Phrep – base calling, vector trimming, end of sequence read trimming • Phrap – Phrap uses Phred’s base calling scores to determine the consensus sequences. Phrap examines all individual sequences at a given position, and uses the highest scoring sequence (if it exists) to extend the consensus sequence • Consend – graphical interface extension that controls both Phred and Phrap
  • 26. What is Phred? • Phred is a program that observes the base trace, makes base calls, and assigns quality values (qv) of bases in the sequence. • It then writes base calls and qv to output files that will be used for Phrap assembly. The qv will be useful for consensus sequence construction. • For example, ATGCATGC string1 ATTCATGC string2 AT-CATGC superstring • Here we have a mismatch ‘G’ and ‘T’, the qv will determine the dash in the superstring. The base with higher qv will replaces the dash.
  • 27. How Phred calculates qv? • From the base trace Phred know number of peaks and actual peak locations. • Phred predicts peaks locations. • Phred reads the actual peak locations from base trace. • Phred match the actual locations with the predicted locations by using Dynamic Programming. • The qv is related to the base call error probability (ep) by the formula qv = -10*log_10(ep) • Example 1:10000 = qv 40
  • 28. Why Phred? • Output sequence might contain errors. • Vector contamination might occur. • Dye-terminator reaction might not occur. • Segment migration abnormal in gel electrophoresis. • Weak or variable signal strength of peak corresponding to a base.
  • 30. End of Sequence Cropping • It is common that the end of sequencing reads have poor data. This is due to the difficulties in resolving larger fragment ~1kb (it is easier to resolve 21bp from 20bp than it is to resolve 1001bp from 1000bp). • Phred assigns a non-value of ‘x’ to this data by comparing peak separation and peak intensity to internal standards. If the standard threshold score is not reached, the data will not be used.
  • 31. Traces <-> strings • Handle traces – Abi-view EMBOSS – Bioedit – Acembly, … • EXAMPLE
  • 32. NCBI reference sequences RefSeq database is a non-redundant set of reference standards that includes chromosomes, complete genomic molecules, intermediate assembled genomic contigs, curated genomic regions, mRNAs, RNAs, and proteins.
  • 33. RefSeq nomenclature NC_#### complete genomic NG_#### incomplete genomic NM_####mRNA NR_#### noncoding transcripts NP_#### proteins NT_#### intermediate genomic contigs
  • 34. RefSeq nomenclature - models XM_#### mRNA XR_#### RNA XP_#### protein Automated Homo sapiens models provided by the Genome Annotation process; sequence corresponds to the genomic contig.
  • 35.
  • 36.
  • 37.
  • 38. Open reading frame • Definition: – A stretch of triplet codons with an initiator codon at one end and a stop codon sat the other, as identifiable by nucleotide sequences. • Example – http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? cmd=Retrieve&db=nucleotide&list_uids=6688 473&dopt=GenBank&term=Y18948.1&qty=1
  • 39. Protein sequence database SWISS-PROT & TREMBL SwissProt - http://expasy.hcuge.ch/sprot/  SWISS-PROT is an annotated protein sequence database  The sequences are translated from the EMBL Nucleotide Sequence Database  Sequence entries are composed of different lines. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database.  Continuously updated (daily).
  • 40. Different Features of SWISS-PROT • Format follows as closely as possible that of EMBL’s • Curated protein sequence database • Three differences: 1. Strives to provide a high level of annotations 2. Minimal level of redundancy 3. High level of integration with other databases
  • 41. Three Distinct Criteria The sequence data; the citation information (bibliographical references) and the taxonomic data (description of the biological source of the protein) such as protein functions,post-translational modifications ,domains and sites,secondary structure,quaternary structure,similarities to other proteins,diseases associated with deficiencies in the protein,sequence conflicts, variants, etc. 1. Annotation
  • 42. 2. Minimal Redundancy any sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. SWISS-PROT is as much as possible to merge all these data so as to minimize the redundancy. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.
  • 43. 3. Integration With Other Databases • SWISS-PROT and TrEMBL - Protein sequences • PROSITE - Protein families and domains • SWISS-2DPAGE - Two-dimensional polyacrylamide gel electrophoresis • SWISS-3DIMAGE - 3D images of proteins and other biological macromolecules • SWISS-MODEL Repository - Automatically generated protein models • CD40Lbase - CD40 ligand defects • ENZYME - Enzyme nomenclature • SeqAnalRef - Sequence analysis bibliographic references
  • 44. TREMBL- http://expasy.hcuge.ch/sprot/  Translated EMBL sequences not (yet) in Swissprot.  Updated faster than SWISS-PROT. TREMBL - two parts 1. SP-TREMBL  Will eventually be incorporated into Swissprot  Divided into FUN, HUM, INV, MAM, MHC, ORG, PHG, PLN, PRO, ROD, UNC, VRL and VRT. 2. REM-TREMBL (remaining)  Will NOT be incorporated into Swissprot  Divided into:Immunoglobins and T-cell receptors,Synthetic sequences,Patent application sequences,Small fragments,CDS not coding for real proteins
  • 45. SWISS-PROT/TrEMBL • TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT • SWISS-PROT Release 39.15 of 19- Mar-2001: 94,152 entries TrEMBL Release 16.2 of 23-Mar- 2001: 436,924 entries
  • 46. Example of a SwissProt entry ID TNFA_HUMAN STANDARD; PRT; 233 AA. IDentification AC P01375; ACcession DT 21-JUL-1986 (REL. 01, CREATED) DaTe DT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE) DT 15-JUL-1998 (REL. 36, LAST ANNOTATION UPDATE) DE TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN). GN TNFA. Gene name OS HOMO SAPIENS (HUMAN). Organism Species OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. Organism Classification RN [1] Reference RP SEQUENCE FROM N.A. RX MEDLINE; 87217060. RA NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A., RA AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N., RA FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A., RA CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.; RL COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986). RN [2] RP SEQUENCE FROM N.A. RX MEDLINE; 85086244. RA PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R., RA PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.; RL NATURE 312:724-729(1984). ...
  • 47. CC -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CAN CC CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATED CC IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSING CC FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, IT CC CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATION CC UNDER CERTAIN CONDITIONS. Comments CC -!- SUBUNIT: HOMOTRIMER. CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS AS CC AN EXTRACELLULAR SOLUBLE FORM. CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY CC PROTEOLYTIC PROCESSING. CC -!- DISEASE: CACHEXIA ACCOMPANIES A VARIETY OF DISEASES, INCLUDING CC CANCER AND INFECTION, AND IS CHARACTERIZED BY GENERAL ILL CC HEALTH AND MALNUTRITION. CC -!- SIMILARITY: BELONGS TO THE TUMOR NECROSIS FACTOR FAMILY. DR EMBL; X02910; G37210; -. Database Cross-references DR EMBL; M16441; G339741; -. DR EMBL; X01394; G37220; -. DR EMBL; M10988; G339738; -. DR EMBL; M26331; G339764; -. DR EMBL; Z15026; G37212; -. DR PIR; B23784; QWHUN. DR PIR; A44189; A44189. DR PDB; 1TNF; 15-JAN-91. DR PDB; 2TUN; 31-JAN-94.
  • 48. KW CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR; KW MYRISTYLATION; 3D-STRUCTURE. KeyWord FT PROPEP 1 76 Feature Table FT CHAIN 77 233 TUMOR NECROSIS FACTOR. FT TRANSMEM 36 56 SIGNAL-ANCHOR (TYPE-II PROTEIN). FT LIPID 19 19 MYRISTATE. FT LIPID 20 20 MYRISTATE. FT DISULFID 145 177 FT MUTAGEN 105 105 L->S: LOW ACTIVITY. FT MUTAGEN 108 108 R->W: BIOLOGICALLY INACTIVE. FT MUTAGEN 112 112 L->F: BIOLOGICALLY INACTIVE. FT MUTAGEN 162 162 S->F: BIOLOGICALLY INACTIVE. FT MUTAGEN 167 167 V->A,D: BIOLOGICALLY INACTIVE. FT MUTAGEN 222 222 E->K: BIOLOGICALLY INACTIVE. FT CONFLICT 63 63 F -> S (IN REF. 5). FT STRAND 89 93 FT TURN 99 100 FT TURN 109 110 FT STRAND 112 113 FT TURN 115 116 FT STRAND 118 119 FT STRAND 124 125
  • 49. FT STRAND 130 143 FT STRAND 152 159 FT STRAND 166 170 FT STRAND 173 174 FT TURN 183 184 FT STRAND 189 202 FT TURN 204 205 FT STRAND 207 212 FT HELIX 215 217 FT STRAND 218 218 FT STRAND 227 232 SQ SEQUENCE 233 AA; 25644 MW; 666D7069 CRC32; MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI VAGATTLFCL LHFGVIGPQR EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG QLQWLNRRAN ALLANGVELR DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCQRE TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF AESGQVYFGI IAL //
  • 50. Protein searching 3-levels of Protein Searching 1. Swissprot Little Noise Annotated entries 2. Swissprot + TREMBL More Noisy All probable entries 3. Translated EMBL - tblast or tfasta Most Noisy All possible entries
  • 51. New initiatiaves • IPI: International Protein Index – http://www.ebi.ac.uk/IPI/IPIhelp.ht ml • UNIPROT: Universal Protein Knowledgebase – http://www.pir.uniprot.org/ • HPRD: Human Protein Reference Database – http://www.hprd.org/
  • 52. UniProt Consortium • European Bioinformatics Institute (EBI) • Swiss Institute of Bioinformatics (SIB) • Protein Information Resource (PIR) Uniprot Databases •UniProt Knowledgebase (UniProtKB) •UniProt Reference Clusters (UniRef) •UniProt Archive (UniParc) UniprotKB •Swiss-Prot (annotated protein sequence db, golden standard) •trEMBL (translated EMBL + automated electronic annotations) UniProt
  • 53. understanding molecular structure is critical to the understanding of biology because because structure determines function
  • 54.
  • 55. From Structure to Function • the drug morphine has chemical groups that are functionally equivalent to the natural endorphins found in the human body
  • 56. • the drug morphine has chemical groups that are functionally equivalent to the natural endorphins found in the human body • the receptor molecules located at the synapse (between two neurons) bind morphine much the same way as endorphins • therefore, morphine is able to attenuate the pain response From Structure to Function
  • 57. Structure databases Protein Data Bank (PDB) Protein Data Bank - http://www.rcsb.org/pdb Diffraction 7373 structures determined by X-ray diffraction NMR 388 structures determined by NMR spectroscopy Theoretical Model 201 structures proposed by modeling
  • 58. • PDB is three-dimensional structure of proteins,some nuclei acids involved • PDB is operated by RCSB(Research Collaboratory for Structural Bioinformatics),funded by NSF, DOE, and two units of NIH:NIGMS National Institute Of General Medical Sciences and NLM National Library Of Medicine. • Established at BNL Brookhaven National Laboratories in 1971,as an archive for biological macromolecular crystal structures • In 1980s, the number of deposited structures began to increase dramatically. • October 1998, the management of the PDB became the responsibility of RCSB. • Website http://www.rcsb.org
  • 59. PDB Holdings List: 27-Mar-2001 Molecule Type Proteins, Peptides, and Viruses Protein/ Nucleic Acid Complexes Nuclei c Acids Carbohydrate s total Exp. Tech. X-ray Diffraction and other 11045 526 552 14 12137 NMR 1832 71 366 4 2273 Theoretica l Modeling 281 19 21 0 321 total 13158 616 939 18 14731 5032 Structure Factor Files 968 NMR Restraint Files
  • 61. PDB Growth in New Folds
  • 62. Other structure databases BioMagResBank http://www.bmrb.wisc.edu/ A Repository for Data from NMR Spectroscopy on Proteins, Peptides, and Nucleic Acids Biological Macromolecule Crystallization Database (BMCD) http://h178133.carb.nist.gov:4400/bmcd/bmcd.html Contains crystal data and the crystallization conditions, which have been compiled from literature Nucleic Acid Database (NDB) http://ndbserver.rutgers.edu:80/ Assembles and distributes structural information about nucleic acids Structural Classification of Proteins (SCOP) http://scop.mrc-lmb.cam.ac.uk/scop/ Structure similarity search. Hierarchic organization. MOOSE http://db2.sdsc.edu/moose/ Macromolecular Structure Query Cambridge Structural Database (CSD) http://www.ccdc.cam.ac.uk/ Small molecules.
  • 63.
  • 64. Protein Splicing? • Protein splicing is defined as the excision of an intervening protein sequence (the INTEIN) from a protein precursor and the concomitant ligation of the flanking protein fragments (the EXTEINS) to form a mature extein protein and the free intein • http://www.neb.com/inteins/intein_intro.ht ml
  • 65. Biological databases • NAR Database Issue – Every year: NAR DB Issue – The 2006 update includes 858 databases – Citation top 5 are: • Pfam • Gene Ontology • UniProt • SMART • KEGG – Primary Nucleotide DB’s and PDB are not cited anymore
  • 66. Outline • Molecular Biology • Flat files “sequence” databases – DNA – Protein – Structure • Relational Databases – What ? – Why ? • Biological Relational Databases – Howto ?
  • 67. Why biological databases ? • Explosive growth in biological data • Data (sequences, 3D structures, 2D gel analysis, MS analysis….) are no longer published in a conventional manner, but directly submitted to databases • Essential tools for biological research, as classical publications used to be !
  • 68. Problems with Flat files … • Wasted storage space • Wasted processing time • Data control problems • Problems caused by changes to data structures • Access to data difficult • Data out of date • Constraints are system based • Limited querying eg. all single exon GPCRs (<1000 bp)
  • 69. Relational • The Relational model is not only very mature, but it has developed a strong knowledge on how to make a relational back-end fast and reliable, and how to exploit different technologies such as massive SMP, Optical jukeboxes, clustering and etc. Object databases are nowhere near to this, and I do not expect then to get there in the short or medium term. • Relational Databases have a very well-known and proven underlying mathematical theory, a simple one (the set theory) that makes possible – automatic cost-based query optimization, – schema generation from high-level models and – many other features that are now vital for mission-critical Information Systems development and operations.
  • 70. • What is a relational database ? – Sets of tables and links (the data) – A language to query the datanase (Structured Query Language) – A program to manage the data (RDBMS) • Flat files are not relational – Data type (attribute) is part of the data – Record order mateters – Multiline records – Massive duplication • Bv Organism: Homo sapeinsm Eukaryota, … – Some records are hierarchical • Xrefs – Records contain multiple “sub-records” – Implecit “Key”
  • 71. The Benefits of Databases • Redundancy can be reduced • Inconsistency can be avoided • Conflicting requirements can be balanced • Standards can be enforced • Data can be shared • Data independence • Integrity can be maintained • Security restrictions can be applied
  • 72. Disadvantages • size • complexity • cost • Additional hardware costs • Higher impact of failure • Recovery more difficult
  • 73. Relational Terminology CUSTOMER Table (Relation) ID NAME PHONE EMP_ID 201 Unisports 55-2066101 12 202 Simms Atheletics 81-20101 14 203 Delhi Sports 91-10351 14 204 Womansport 1-206-104-0103 11 Row (Tuple) Column (Attribute)
  • 74. Relational Database Terminology • Each row of data in a table is uniquely identified by a primary key (PK) • Information in multiple tables can be logically related by foreign keys (FK) Table Name: CUSTOMER Table Name: EMP ID LAST_NAME FIRST_NAME 10 Havel Marta 11 Magee Colin 12 Giljum Henry 14 Nguyen Mai ID NAME PHONE EMP_ID 201 Unisports 55-2066101 12 202 Simms Atheletics 81-20101 14 203 Delhi Sports 91-10351 14 204 Womansport 1-206-104-0103 11 Primary Key Foreign Key Primary Key
  • 75. • RDBM products – Free • MySQL, very fast, widely usedm easy to jump into but limited non standard SQL • PostrgreSQL – full SQLm limited OO, higher learning curve than MySQL – Commercial • MS Access – Great query builder, GUI interfaces • MS SQL Server – full SQL, NT only • Oracle, everything, including the kitchen sink • IBM DB2, Sybase
  • 76. A simple datamodel (tables and relations) Prot_id name seq Species_id 1 GTM1_HUMA N MGTDHG… 1 2 GTM1_RAT MGHJADSW.. 2 3 GTM2_HUMA N MVSDBSVD.. 1 Species_id name Full Lineage 1 human Homo Sapiens … 2 rat Rattus rattus
  • 77. Relational Database Fundamentals • Basic SQL – SELECT – FROM – WHERE – JOIN – NATURAL, INNER, OUTER • Other SQL functions – COUNT() – MAX(),MIN(),AVE() – DISTINCT – ORDER BY – GROUP BY – LIMIT
  • 79.
  • 80.
  • 81.
  • 82. • Query: een opdracht om gegevens uit een databaase op te vragen noemt men een query • eg. MyGPCRdb – Bioentry – Taxid (include full lineage) – Linking table (bioentry_tax)
  • 83. MyGPCR; Geef me allE GPCR die korter zijn dan 1000bp select * from bioentry; select count(*) from bioentry; select * from bioentry inner join biosequence on bioentry.bioentry_id=biosequence.bioentry_id ; select * from bioentry inner join biosequence on bioentry.bioentry_id=biosequence.bioentry_id where length(biosequence_str)<1000;
  • 84. Example 3-tier model in biological database Example of different interface to the same back-end database (MySQL) http://www.bioinformatics.be
  • 85. Overview • DataBases – FF • *.txt • Indexed version – Relational (RDBMS) • Access, MySQL, PostGRES, Oracle – OO (OODBMS) • AceDB, ObjectStore – Hierarchical • XML – Frame based system • Eg. DAML+OIL – Hybrid systems Overview
  • 86. Object • The Object paradigm is already proven for application design and development, but it may simply not be an adequate paradigm for the data store. • Object Database are modelled by graphs. The graph theory plays a great role on computer science, but is also a great source of unbeatable problems, the NP-complex class: problems for which there are no computationally efficient solution, as there's no way to escape from exponential complexity. This is not a current technological limit. It's a limit inherent to the problem domain. • Hybrid Object-Relational databases will probably be the long term solution for the industry. They put a thin object layer above the relational structure, thus providing a syntax and semantics closer to the object oriented design and programming tools. They simply make it easier to build the data layer classes
  • 87. Conclusions • A database is a central component of any contemporary information system • The operations on the database and the mainenance of database consistency is handled by a DBMS • There exist stand alone query languages or embedded languages but both deal with definition (DDL) and manipulation (DML) aspects • The structural properties, constraints and operations permitted within a DBMS are defined by a data model - hierarchical, network, relational • Recovery and concurrency control are essential • Linking of heterogebous datasources is central theme in modern bioinformatics
  • 88.
  • 89. • How do you know which database exists ? • NAR list • Weblinks op Nexus – Searchable – Maintainable
  • 90.
  • 91. • Tools available in public domain for simultaneous access – entrez – srs • Batch queries for offload in local databases for subsequent analysis (see further)
  • 92.
  • 93.
  • 94. • What if you want to search the complete human genome (golden path coordinates) instead of separate NCBI entries ? • ENSEMBL
  • 95. BioMart • Joined project between EBI and CSHL, http://www.biomart.org/ • Aim is to develop a generic, query-oriented data management system capable of integrating distributed data sources • 3 step system: – Start by selecting a dataset to query – Filter this dataset by applying the appropriate filters – Generate the output by selecting the attributes and output format • Available public biomart websites: http://www.biomart.org/biomart/martview
  • 96. BioMart - Single access point - Generic interface
  • 97. BioMart - ‘Out of the box’ website
  • 98. BioMart – 3 step system Dataset Attribute Filter
  • 99. BioMart - 3 step system Name, chromosome position, description for all Ensembl genes located on chromosome 1, expressed in lung, associated with human homologues Dataset Attribute Filter
  • 100. BioMart - EnsMart • The first in line was EnsMart, a powerful data mining toolset for retrieving customized data sets from annotated genomes. EnsMart integrates data from Ensembl and various worldwide data sources. • EnsMart provides .... – Gene and protein annotation – Disease information – Cross-species analyses – SNPs affecting proteins – Allele frequency data – Retrieval by external identifiers – Retrieval by Gene Ontology – Customized sequence datasets – Microarray annotation tools
  • 101. Other BioMart implementations • Other data resources also implemented a BioMart interface: – Wormbase – Gramene – HapMap – DictyBase – euGenes
  • 103.
  • 104.
  • 105.
  • 106. BioBar • A toolbar for browsing biological data and databases http://biobar.mozdev.org/ • The following databases are included http://biobar.mozdev.org/Databases.ht ml • a toolbar for Mozilla-based browsers including Firefox and Netscape 7+
  • 107. Weblems Weblems Online (example posting) W2.1. Which isolate of Tabac was used in record accession Z71230, and human sample in the genbank entry with accession AJ311677 ? W2.2: Find all structures of GFP in the Protein Data Bank and draw a histogram of their dates of deposition ? W2.3: What is the chromosomal location of the human gene for insulin ? W2.4: How many different human NHR (nuclear hormone receptors) s exist ? How many of these are single exon genes ? Are there any drugs working on this class of receptors ? W2.5: The gene for Berardinelli-Seip syndrome was initially localized between two markers on chromosome band 11q13- D11S4191 and D11S987. a. How many base pairs are there in the interval between these two markers ? b. How many known genes are there ? c. List the gene ontology terms for that region ?