FBW             01-10-2012Wim Van Criekinge
Inhoud Lessen: Bioinformatica                                GEEN LES
Outline          • Molecular Biology          • Flat files “sequence” databases            – DNA            – Protein     ...
Flat Files             What is a “flat file” ?             • Flat file is a term used to refer to when               data ...
Sequence entriesgene   10317..12529                     /gene="ZK822.4"CDS    join(10317..10375,10714..10821,10874..10912,...
Nucleotide DatabasesEMBL Nucleotide Sequence Database (European Molecular BiologyLaboratory) http://www.ebi.ac.uk/ebi_docs...
GenBank FormatLOCUS        LISOD         756 bp    DNA              BCT   30-JUN-1993DEFINITION   L.ivanovii sod gene for ...
FEATURES           Location/Qualifiers     source        1..756                   /organism="Listeria ivanovii"           ...
Example of location descriptorsLocation       Description476            Points to a single base in the presented sequence3...
BASE COUNT         247 a     136 c      151 g     222 tORIGIN1    cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc g...
EMBL formatID LISOD      standard; DNA; PRO; 756 BP.             IDentificationXXAC X64011; S78972;                       ...
GenBank,EMBL & DDBJ: Comments            • Collaboration Genbank/EMBL/DDBJ               – Effort: Identical within 24 hou...
Other Genbank Formats               • ASN1                   – Computer friendly, human unfriendly               • FASTA  ...
Web Query tools & Programming Query tools• NCBI website example:    – http://www.ncbi.nlm.nih.gov/entrez/query/static/ad  ...
batch download (ftp server)• Data available via website is most of  the time also available via an ftp  server to download...
Sequence file format tips                 • When saving a sequence for use in an email                   message or pastin...
Expressed Sequence Tags            • Sequence that codes for protein is < 5% of the              genome.            • Codi...
dbEST release 100303 Summary by Organism - October 3, 2003                 Number of public entries: 18,762,324           ...
Traces <-> strings               • Traces contain much more information                     – TraceDB: http://www.ncbi.nlm...
Traces <-> strings               • Phrep                     – base calling, vector trimming, end of sequence             ...
What is Phred?• Phred is a program that observes the base trace, makesbase calls, and assigns quality values (qv) of bases...
How Phred calculates qv?• From the base trace Phred know number of peaks    and actual peak locations.•   Phred predicts p...
Why Phred?             • Output sequence might contain                 errors.             •   Vector contamination might ...
Vector Trimming
End of Sequence Cropping• It is common that the end of sequencing reads    have poor data. This is due to the difficulties...
Traces <-> strings• Handle traces   – Abi-view EMBOSS   – Bioedit   – Acembly, …• EXAMPLE
NCBI reference sequencesRefSeq database is a non-redundant set of reference standards that includes chromosomes, complete ...
RefSeq nomenclatureNC_#### complete genomicNG_#### incomplete genomicNM_#### mRNANR_#### noncoding transcriptsNP_#### prot...
RefSeq nomenclature - modelsXM_#### mRNAXR_#### RNAXP_#### proteinAutomated Homo sapiens models provided by the Genome Ann...
Open reading frame• Definition:  – A stretch of triplet codons with an initiator    codon at one end and a stop codon sat ...
Protein sequence databaseSWISS-PROT & TREMBLSwissProt - http://expasy.hcuge.ch/sprot/ SWISS-PROT is an annotated protein s...
Different Features of SWISS-PROT                    •  Format follows as closely as                       possible that of...
1. Annotation Three Distinct Criteria                  The sequence data; the citation                   information (bibl...
2. Minimal Redundancy                  any sequence databases contain, for a                   given protein sequence, sep...
3. Integration With Other Databases               • SWISS-PROT and TrEMBL - Protein                 sequences             ...
TREMBL- http://expasy.hcuge.ch/sprot/ Translated EMBL sequences not (yet) inSwissprot.  Updated faster than SWISS-PROT.TRE...
SWISS-PROT/TrEMBL              • TrEMBL is a computer-annotated                supplement of SWISS-PROT that contains     ...
Example of a SwissProt entryID    TNFA_HUMAN STANDARD;           PRT; 233 AA.              IDentificationAC     P01375;   ...
CC   -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CANCC       CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, ...
KW    CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR;KW    MYRISTYLATION; 3D-STRUCTURE.                  ...
FT STRAND     130 143FT STRAND     152 159FT STRAND     166 170FT STRAND     173 174FT TURN      183 184FT STRAND     189 ...
Protein searching3-levels of Protein Searching1. Swissprot                            Little Noise                        ...
New initiatiaves                   • IPI: International Protein Index                     – http://www.ebi.ac.uk/IPI/IPIhe...
UniProt          UniProt Consortium             • European Bioinformatics Institute (EBI)             • Swiss Institute of...
understanding molecular structure is critical to the understanding of biologybecause because structure    determines funct...
From Structure to Function• the drug morphine has chemical groups that are functionally equivalent to the naturalendorphin...
From Structure to Function• the drug morphine has chemical groups that are functionally equivalent to the naturalendorphin...
Structure databasesProtein Data Bank (PDB)Protein Data Bank - http://www.rcsb.org/pdbDiffraction         7373 structures d...
• PDB is three-dimensional structure of  proteins,some nuclei acids involved• PDB is operated by RCSB(Research Collaborato...
PDB Holdings List: 27-Mar-2001                                                Molecule Type                          Prote...
PDB Content Growth
PDB Growth in New Folds
Other structure databasesBioMagResBank http://www.bmrb.wisc.edu/A Repository for Data from NMR Spectroscopy on Proteins, P...
Protein Splicing?• Protein splicing is defined as the excision of  an intervening protein sequence (the  INTEIN) from a pr...
Biological databases                 • NAR Database Issue                       – Every year: NAR DB Issue                ...
Outline          • Molecular Biology          • Flat files “sequence” databases            – DNA            – Protein     ...
Why biological databases ?                 • Explosive growth in biological data                 • Data (sequences, 3D str...
Problems with Flat files …                  •   Wasted storage space                  •   Wasted processing time          ...
Relational             • The Relational model is not only very mature, but it               has developed a strong knowled...
• What is a relational database ?   – Sets of tables and links (the data)   – A language to query the datanase (Structured...
The Benefits of Databases                  • Redundancy can be reduced                  • Inconsistency can be avoided    ...
Disadvantages                •   size                •   complexity                •   cost                •   Additional ...
Relational Terminology    CUSTOMER Table (Relation)                         ID NAME                PHONE       EMP_ID   Ro...
Relational Database Terminology• Each row of data in a table is uniquely identified by a primary key (PK)• Information in ...
• RDBM products  – Free    • MySQL, very fast, widely usedm easy to      jump into but limited non standard SQL    • Postr...
A simple datamodel (tables and relations)                    Prot_id         name            seq           Species_id     ...
Relational Database Fundamentals                  • Basic SQL                      –   SELECT                      –   FRO...
BioSQL
• Query: een opdracht om gegevens uit  een databaase op te vragen noemt men  een query• eg. MyGPCRdb  – Bioentry  – Taxid ...
MyGPCR;Geef me allE GPCR die korter zijn dan 1000bpselect * from bioentry;select count(*) from bioentry;select * from bioe...
Example 3-tier model in biological databaseExample of different interface to the same back-end database (MySQL)           ...
Overview              • DataBases                 – FF                    • *.txt                    • Indexed version    ...
Object         • The Object paradigm is already proven for application design and           development, but it may simply...
Conclusions              •   A database is a central component of any                  contemporary information system    ...
• How do you know which database  exists ?• NAR list• Weblinks op Nexus  – Searchable  – Maintainable
• Tools available in public domain for  simultaneous access  – entrez  – srs• Batch queries for offload in local  database...
• What if you want to search the  complete human genome (golden path  coordinates) instead of separate NCBI  entries ?• EN...
BioMart          • Joined project between EBI and CSHL,            http://www.biomart.org/          • Aim is to develop a ...
BioMart - Single access point - Generic interface
BioMart - ‘Out of the box’ website
BioMart – 3 step system                 Dataset                 Attribute                 Filter
BioMart - 3 step system                 Name, chromosomeDataset          position, descriptionAttribute        for all Ens...
BioMart - EnsMart                • The first in line was EnsMart, a powerful data                  mining toolset for retr...
Other BioMart implementations                • Other data resources also implemented                  a BioMart interface:...
Single interface
BioBar         • A toolbar for browsing biological data           and databases           http://biobar.mozdev.org/       ...
Weblems          Weblems Online (example posting)          W2.1. Which isolate of Tabac was used in record accession      ...
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Bioinformatica t2-databases
Upcoming SlideShare
Loading in …5
×

Bioinformatica t2-databases

1,814 views

Published on

D

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,814
On SlideShare
0
From Embeds
0
Number of Embeds
588
Actions
Shares
0
Downloads
35
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bioinformatica t2-databases

  1. 1. FBW 01-10-2012Wim Van Criekinge
  2. 2. Inhoud Lessen: Bioinformatica GEEN LES
  3. 3. Outline • Molecular Biology • Flat files “sequence” databases – DNA – Protein – Structure • Relational Databases – What ? – Why ? • Biological Relational Databases – Howto ?
  4. 4. Flat Files What is a “flat file” ? • Flat file is a term used to refer to when data is stored in a plain ordinary file on the hard disk • Example RefSEQ – See CD-ROM – FILE: hs.GBFF • Hs: Homo Sapiens • GBFF: Genbank File Format • (associated with textpad, use monospaced font eg. Courier)
  5. 5. Sequence entriesgene 10317..12529 /gene="ZK822.4"CDS join(10317..10375,10714..10821,10874..10912,10960..11013, 11061..11114,11169..11222,11346..11739,11859..11912, 11962..12195,12242..12529) /gene="ZK822.4" /codon_start=1 /protein_id="CAA98068.1" /db_xref="PID:g3881817" /db_xref="GI:3881817" /db_xref="SPTREMBL:Q23615" /translation="MHRHTYRKLYWNLGADGFSQGNADASVSAGSSGSNFLSGLQNSS FGQAVMGGINTYNQAKNSSGGNWQTAVANSSVGNFFQNGIDFFNGMKNGTQNFLDTDT IQETIGNSSFGEVVQTGVEFFNNIKNGNSPFQGDASSVMSQFVPFLANASAEAKAEFY TILPNFGNMTIAEFETAVNAWAAKYNLTDEVEAFNERSKNATVVAEEHANVVVMNLPN VLNNLKAISSDKNQTVVEMHTRMMAYVNSLDDDTRDIVFIFFRNLLPPQFKKSKCVDQ GNFLTNMYNKASDFFAGRNNRTDGEGSFWSGQGQNGNSGGSGFSSFFNNFNGQGNGNG NGAQNPMIGMFNNFMKKNNITADEANAAMADGGASIQILPAISAGWGDVAQVKIGGDF KIAVEEETKTTKKNKKQQQQANKNKNKNKKKTTIAPEAAIDANIAAEVHTQVL"
  6. 6. Nucleotide DatabasesEMBL Nucleotide Sequence Database (European Molecular BiologyLaboratory) http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.htmlGenBank at NCBI (National Center for Biotechnology Information)http://www.ncbi.nlm.nih.gov/Web/Genbank/index.htmlDDBJ (DNA Database of Japan) http://www.ddbj.nig.ac.jp/DDBJ,the Center for operating DDBJ, National Institute of Genetics (NIG),Japan,established inApril 1995.http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htmlRelease Notes (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)Genetic Sequence Data Bank - August 15 2003NCBI-GenBank Flat File Release 137.0Distribution Release Notes33 865 022 251 bases, from 27 213 748 reported sequences
  7. 7. GenBank FormatLOCUS LISOD 756 bp DNA BCT 30-JUN-1993DEFINITION L.ivanovii sod gene for superoxide dismutase.ACCESSION X64011.1 GI:37619753NID g44010KEYWORDS sod gene; superoxide dismutase.SOURCE Listeria ivanovii.ORGANISM Listeria ivanovii Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae; Listeria.REFERENCE 1 (bases 1 to 756) AUTHORS Haas,A. and Goebel,W. TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii by functional complementation in Escherichia coli and characterization of the gene product JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992) MEDLINE 92140371REFERENCE 2 (bases 1 to 756) AUTHORS Kreft,J. TITLE Direct Submission JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG
  8. 8. FEATURES Location/Qualifiers source 1..756 /organism="Listeria ivanovii" /strain="ATCC 19119" /db_xref="taxon:1638" RBS 95..100 /gene="sod" gene 95..746 /gene="sod" CDS 109..717 /gene="sod" /EC_number="1.15.1.1" /codon_start=1 /product="superoxide dismutase" /db_xref="PID:g44011" /db_xref="SWISS-PROT:P28763" /transl_table=11 /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKL NEAVSGHAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPN GGGAPTGNLKAAIESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVS TANQDSPLSEGKTPVLGLDVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRF DAAK" terminator 723..746 /gene="sod"
  9. 9. Example of location descriptorsLocation Description476 Points to a single base in the presented sequence340..565 Points to a continuous range of bases bounded by and including the starting and ending bases<345..500 The exact lower boundary point of a feature is unknown.(102.110) Indicates that the exact location is unknown but that it is one of the bases between bases 102 and 110.(23.45)..600 Specifies that the starting point is one of the bases between bases 23 and 45, inclusive, and the end base 600123^124 Points to a site between bases 123 and 124145^177 Points to a site anywhere between bases 145 and 177J00193:hladr Points to a feature whose location is described in another entry: the feature labeled hladr in the entry (in this database) with primary accession J00193
  10. 10. BASE COUNT 247 a 136 c 151 g 222 tORIGIN1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat61 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa121 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg181 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca241 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt301 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta361 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca421 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg481 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt541 tccactgcta accaagattc tccacttagc gaaggtaaaa ctccagttct tggcttagat601 gtttgggaac atgcttatta tcttaaattc caaaaccgtc gtcctgaata cattgacaca661 ttttggaatg taattaactg ggatgaacga aataaacgct ttgacgcagc aaaataatta721 tcgaaaggct cacttaggtg ggtcttttta tttcta//
  11. 11. EMBL formatID LISOD standard; DNA; PRO; 756 BP. IDentificationXXAC X64011; S78972; Accession (Axxxxx, Afxxxxxx), GUIDXXNI g44010 Nucleotide Identifier --> x.xXXDT 28-APR-1992 (Rel. 31, Created) DaTeDT 30-JUN-1993 (Rel. 36, Last updated, Version 6)XXDE L.ivanovii sod gene for superoxide dismutase DEscriptionXX.KW sod gene; superoxide dismutase. KeyWordXXOS Listeria ivanovii Organism SpeciesOC Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae;OC Listeria. Organism ClassificationXXRN [1]RA Haas A., Goebel W.; ReferenceRT "Cloning of a superoxide dismutase gene from Listeria ivanovii byRT functional complementation in Escherichia coli andRT characterization of the gene product.";RL Mol. Gen. Genet. 231:313-322(1992).XX
  12. 12. GenBank,EMBL & DDBJ: Comments • Collaboration Genbank/EMBL/DDBJ – Effort: Identical within 24 hours • Redundant information • Historical graveyard – BANKIT (responsability of the submitter) – Version conflicts • IDIOSYNCRATIC ( peculiar to the individual) – Heterogeneous annotation – No consistant quality check • Vectors, sequence errors etc
  13. 13. Other Genbank Formats • ASN1 – Computer friendly, human unfriendly • FASTA – Brief, loses information – Easy to use – Compatible with multiple sequences
  14. 14. Web Query tools & Programming Query tools• NCBI website example: – http://www.ncbi.nlm.nih.gov/entrez/query/static/ad vancedentrez.html• EBI UniProtKB website example: – http://www.ebi.ac.uk/uniprot/index.html – http://www.ebi.uniprot.org/search/SearchTools.sht ml
  15. 15. batch download (ftp server)• Data available via website is most of the time also available via an ftp server to download a complete batch.• Examples: –ftp://ftp.ncbi.nih.gov/ –ftp://ftp.ebi.ac.uk/pub/
  16. 16. Sequence file format tips • When saving a sequence for use in an email message or pasting into a web page, use an unannotated text format such as FASTA • When retrieving from a database or exchanging between programs, use an annotated text format such as Genbank • When using sequence again with the same program, use that program’s annotated binary format (or annotated text if binary not available) – Asn-1 (NCBI) – Gbff (sanger) – XML
  17. 17. Expressed Sequence Tags • Sequence that codes for protein is < 5% of the genome. • Coding sequence can be obtained from mRNA by reverse transcription. • Tags for that sequence can be obtained by end- sequencing. • Incyte and HGS gambled on this being the useful part: – Search for homologies to known proteins, motifs. – Search for changed levels of expression and tissue specificity (“virtual/electronic northern” used in GeneCards) • ESTs have driven the huge expansion of GenBank: – Unigene now contains some sequence from most genes. – > 4,000,000 human est sequences – http://www.ncbi.nlm.nih.gov/dbEST/
  18. 18. dbEST release 100303 Summary by Organism - October 3, 2003 Number of public entries: 18,762,324 Homo sapiens (human) 5,426,001 Mus musculus + domesticus (mouse) 3,881,878 Rattus sp. (rat) 538,073 Triticum aestivum (wheat) 500,898 Ciona intestinalis 492,488 Gallus gallus (chicken) 451,565 Zea mays (maize) 383,416 Danio rerio (zebrafish) 362,362 Hordeum vulgare + subsp. vulgare (barley) 348,233 Xenopus laevis (African clawed frog) 344,695 Glycine max (soybean) 341,573 Bos taurus (cattle) 322,074 Drosophila melanogaster (fruit fly) 261,404
  19. 19. Traces <-> strings • Traces contain much more information – TraceDB: http://www.ncbi.nlm.nih.gov/Traces/Example
  20. 20. Traces <-> strings • Phrep – base calling, vector trimming, end of sequence read trimming • Phrap – Phrap uses Phred’s base calling scores to determine the consensus sequences. Phrap examines all individual sequences at a given position, and uses the highest scoring sequence (if it exists) to extend the consensus sequence • Consend – graphical interface extension that controls both Phred and Phrap
  21. 21. What is Phred?• Phred is a program that observes the base trace, makesbase calls, and assigns quality values (qv) of bases in thesequence.• It then writes base calls and qv to output files that will beused for Phrap assembly. The qv will be useful forconsensus sequence construction.• For example, ATGCATGC string1 ATTCATGC string2 AT-CATGC superstring• Here we have a mismatch ‘G’ and ‘T’, the qv willdetermine the dash in the superstring. The base with higherqv will replaces the dash.
  22. 22. How Phred calculates qv?• From the base trace Phred know number of peaks and actual peak locations.• Phred predicts peaks locations.• Phred reads the actual peak locations from base trace.• Phred match the actual locations with the predicted locations by using Dynamic Programming.• The qv is related to the base call error probability (ep) by the formula qv = -10*log_10(ep)• Example 1:10000 = qv 40
  23. 23. Why Phred? • Output sequence might contain errors. • Vector contamination might occur. • Dye-terminator reaction might not occur. • Segment migration abnormal in gel electrophoresis. • Weak or variable signal strength of peak corresponding to a base.
  24. 24. Vector Trimming
  25. 25. End of Sequence Cropping• It is common that the end of sequencing reads have poor data. This is due to the difficulties in resolving larger fragment ~1kb (it is easier to resolve 21bp from 20bp than it is to resolve 1001bp from 1000bp).• Phred assigns a non-value of ‘x’ to this data by comparing peak separation and peak intensity to internal standards. If the standard threshold score is not reached, the data will not be used.
  26. 26. Traces <-> strings• Handle traces – Abi-view EMBOSS – Bioedit – Acembly, …• EXAMPLE
  27. 27. NCBI reference sequencesRefSeq database is a non-redundant set of reference standards that includes chromosomes, complete genomic molecules, intermediate assembled genomic contigs, curated genomic regions, mRNAs, RNAs, and proteins.
  28. 28. RefSeq nomenclatureNC_#### complete genomicNG_#### incomplete genomicNM_#### mRNANR_#### noncoding transcriptsNP_#### proteinsNT_#### intermediate genomic contigs
  29. 29. RefSeq nomenclature - modelsXM_#### mRNAXR_#### RNAXP_#### proteinAutomated Homo sapiens models provided by the Genome Annotation process; sequence corresponds to the genomic contig.
  30. 30. Open reading frame• Definition: – A stretch of triplet codons with an initiator codon at one end and a stop codon sat the other, as identifiable by nucleotide sequences.• Example – http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? cmd=Retrieve&db=nucleotide&list_uids=6688 473&dopt=GenBank&term=Y18948.1&qty=1
  31. 31. Protein sequence databaseSWISS-PROT & TREMBLSwissProt - http://expasy.hcuge.ch/sprot/ SWISS-PROT is an annotated protein sequence database The sequences are translated from the EMBL Nucleotide Sequence Database Sequence entries are composed of different lines. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. Continuously updated (daily).
  32. 32. Different Features of SWISS-PROT • Format follows as closely as possible that of EMBL’s • Curated protein sequence database • Three differences: 1. Strives to provide a high level of annotations 2. Minimal level of redundancy 3. High level of integration with other databases
  33. 33. 1. Annotation Three Distinct Criteria The sequence data; the citation information (bibliographical references) and the taxonomic data (description of the biological source of the protein) such as protein functions,post-translational modifications ,domains and sites,secondary structure,quaternary structure,similarities to other proteins,diseases associated with deficiencies in the protein,sequence conflicts, variants, etc.
  34. 34. 2. Minimal Redundancy any sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. SWISS- PROT is as much as possible to merge all these data so as to minimize the redundancy. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.
  35. 35. 3. Integration With Other Databases • SWISS-PROT and TrEMBL - Protein sequences • PROSITE - Protein families and domains • SWISS-2DPAGE - Two-dimensional polyacrylamide gel electrophoresis • SWISS-3DIMAGE - 3D images of proteins and other biological macromolecules • SWISS-MODEL Repository - Automatically generated protein models • CD40Lbase - CD40 ligand defects • ENZYME - Enzyme nomenclature • SeqAnalRef - Sequence analysis bibliographic references
  36. 36. TREMBL- http://expasy.hcuge.ch/sprot/ Translated EMBL sequences not (yet) inSwissprot. Updated faster than SWISS-PROT.TREMBL - two parts1. SP-TREMBL Will eventually be incorporated into Swissprot Divided into FUN, HUM, INV, MAM, MHC, ORG, PHG, PLN, PRO, ROD, UNC, VRL and VRT.2. REM-TREMBL (remaining) Will NOT be incorporated into Swissprot Divided into:Immunoglobins and T-cell receptors,Synthetic sequences,Patent application sequences,Small fragments,CDS not coding for real proteins
  37. 37. SWISS-PROT/TrEMBL • TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT • SWISS-PROT Release 39.15 of 19- Mar-2001: 94,152 entries TrEMBL Release 16.2 of 23-Mar- 2001: 436,924 entries
  38. 38. Example of a SwissProt entryID TNFA_HUMAN STANDARD; PRT; 233 AA. IDentificationAC P01375; ACcessionDT 21-JUL-1986 (REL. 01, CREATED) DaTeDT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)DT 15-JUL-1998 (REL. 36, LAST ANNOTATION UPDATE)DE TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN).GN TNFA. Gene nameOS HOMO SAPIENS (HUMAN). Organism SpeciesOC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;OC EUTHERIA; PRIMATES. Organism ClassificationRN [1] ReferenceRP SEQUENCE FROM N.A.RX MEDLINE; 87217060.RA NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A.,RA AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N.,RA FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A.,RA CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.;RL COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986).RN [2]RP SEQUENCE FROM N.A.RX MEDLINE; 85086244.RA PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R.,RA PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.;RL NATURE 312:724-729(1984)....
  39. 39. CC -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CANCC CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATEDCC IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSINGCC FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, ITCC CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATIONCC UNDER CERTAIN CONDITIONS. CommentsCC -!- SUBUNIT: HOMOTRIMER.CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS ASCC AN EXTRACELLULAR SOLUBLE FORM.CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BYCC PROTEOLYTIC PROCESSING.CC -!- DISEASE: CACHEXIA ACCOMPANIES A VARIETY OF DISEASES, INCLUDINGCC CANCER AND INFECTION, AND IS CHARACTERIZED BY GENERAL ILLCC HEALTH AND MALNUTRITION.CC -!- SIMILARITY: BELONGS TO THE TUMOR NECROSIS FACTOR FAMILY.DR EMBL; X02910; G37210; -. Database Cross-referencesDR EMBL; M16441; G339741; -.DR EMBL; X01394; G37220; -.DR EMBL; M10988; G339738; -.DR EMBL; M26331; G339764; -.DR EMBL; Z15026; G37212; -.DR PIR; B23784; QWHUN.DR PIR; A44189; A44189.DR PDB; 1TNF; 15-JAN-91.DR PDB; 2TUN; 31-JAN-94.
  40. 40. KW CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR;KW MYRISTYLATION; 3D-STRUCTURE. KeyWordFT PROPEP 1 76 Feature TableFT CHAIN 77 233 TUMOR NECROSIS FACTOR.FT TRANSMEM 36 56 SIGNAL-ANCHOR (TYPE-II PROTEIN).FT LIPID 19 19 MYRISTATE.FT LIPID 20 20 MYRISTATE.FT DISULFID 145 177FT MUTAGEN 105 105 L->S: LOW ACTIVITY.FT MUTAGEN 108 108 R->W: BIOLOGICALLY INACTIVE.FT MUTAGEN 112 112 L->F: BIOLOGICALLY INACTIVE.FT MUTAGEN 162 162 S->F: BIOLOGICALLY INACTIVE.FT MUTAGEN 167 167 V->A,D: BIOLOGICALLY INACTIVE.FT MUTAGEN 222 222 E->K: BIOLOGICALLY INACTIVE.FT CONFLICT 63 63 F -> S (IN REF. 5).FT STRAND 89 93FT TURN 99 100FT TURN 109 110FT STRAND 112 113FT TURN 115 116FT STRAND 118 119FT STRAND 124 125
  41. 41. FT STRAND 130 143FT STRAND 152 159FT STRAND 166 170FT STRAND 173 174FT TURN 183 184FT STRAND 189 202FT TURN 204 205FT STRAND 207 212FT HELIX 215 217FT STRAND 218 218FT STRAND 227 232SQ SEQUENCE 233 AA; 25644 MW; 666D7069 CRC32; MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI VAGATTLFCL LHFGVIGPQR EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG QLQWLNRRAN ALLANGVELR DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCQRE TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF AESGQVYFGI IAL//
  42. 42. Protein searching3-levels of Protein Searching1. Swissprot Little Noise Annotated entries2. Swissprot + TREMBL More Noisy All probable entries3. Translated EMBL - tblast or tfasta Most Noisy All possible entries
  43. 43. New initiatiaves • IPI: International Protein Index – http://www.ebi.ac.uk/IPI/IPIhelp.ht ml • UNIPROT: Universal Protein Knowledgebase – http://www.pir.uniprot.org/ • HPRD: Human Protein Reference Database – http://www.hprd.org/
  44. 44. UniProt UniProt Consortium • European Bioinformatics Institute (EBI) • Swiss Institute of Bioinformatics (SIB) • Protein Information Resource (PIR) Uniprot Databases •UniProt Knowledgebase (UniProtKB) •UniProt Reference Clusters (UniRef) •UniProt Archive (UniParc) UniprotKB •Swiss-Prot (annotated protein sequence db, golden standard) •trEMBL (translated EMBL + automated electronic annotations)
  45. 45. understanding molecular structure is critical to the understanding of biologybecause because structure determines function
  46. 46. From Structure to Function• the drug morphine has chemical groups that are functionally equivalent to the naturalendorphins found in the human body
  47. 47. From Structure to Function• the drug morphine has chemical groups that are functionally equivalent to the naturalendorphins found in the human body • the receptor molecules located at the synapse (between two neurons) bind morphine much the same way as endorphins • therefore, morphine is able to attenuate the pain response
  48. 48. Structure databasesProtein Data Bank (PDB)Protein Data Bank - http://www.rcsb.org/pdbDiffraction 7373 structures determined by X-ray diffractionNMR 388 structures determined by NMR spectroscopyTheoretical Model 201 structures proposed by modeling
  49. 49. • PDB is three-dimensional structure of proteins,some nuclei acids involved• PDB is operated by RCSB(Research Collaboratory for Structural Bioinformatics),funded by NSF, DOE, and two units of NIH:NIGMS National Institute Of General Medical Sciences and NLM National Library Of Medicine.• Established at BNL Brookhaven National Laboratories in 1971,as an archive for biological macromolecular crystal structures• In 1980s, the number of deposited structures began to increase dramatically.• October 1998, the management of the PDB became the responsibility of RCSB.• Website http://www.rcsb.org
  50. 50. PDB Holdings List: 27-Mar-2001 Molecule Type Proteins, Protein/ Peptides, Nucleic Nuclei Carbohydrate total and Viruses Acid c s Complexes Acids X-ray 11045 526 552 14 12137Exp. Diffraction and otherTech. NMR 1832 71 366 4 2273 Theoretica 281 19 21 0 321 l Modeling total 13158 616 939 18 14731 5032 Structure Factor Files 968 NMR Restraint Files
  51. 51. PDB Content Growth
  52. 52. PDB Growth in New Folds
  53. 53. Other structure databasesBioMagResBank http://www.bmrb.wisc.edu/A Repository for Data from NMR Spectroscopy on Proteins, Peptides, and NucleicAcidsBiological Macromolecule Crystallization Database (BMCD) http://h178133.carb.nist.gov:4400/bmcd/bmcd.htmlContains crystal data and the crystallization conditions, which have been compiledfrom literatureNucleic Acid Database (NDB) http://ndbserver.rutgers.edu:80/Assembles and distributes structural information about nucleic acidsStructural Classification of Proteins (SCOP) http://scop.mrc-lmb.cam.ac.uk/scop/Structure similarity search. Hierarchic organization.MOOSE http://db2.sdsc.edu/moose/Macromolecular Structure QueryCambridge Structural Database (CSD) http://www.ccdc.cam.ac.uk/Small molecules.
  54. 54. Protein Splicing?• Protein splicing is defined as the excision of an intervening protein sequence (the INTEIN) from a protein precursor and the concomitant ligation of the flanking protein fragments (the EXTEINS) to form a mature extein protein and the free intein• http://www.neb.com/inteins/intein_intro.ht ml
  55. 55. Biological databases • NAR Database Issue – Every year: NAR DB Issue – The 2006 update includes 858 databases – Citation top 5 are: • Pfam • Gene Ontology • UniProt • SMART • KEGG – Primary Nucleotide DB’s and PDB are not cited anymore
  56. 56. Outline • Molecular Biology • Flat files “sequence” databases – DNA – Protein – Structure • Relational Databases – What ? – Why ? • Biological Relational Databases – Howto ?
  57. 57. Why biological databases ? • Explosive growth in biological data • Data (sequences, 3D structures, 2D gel analysis, MS analysis….) are no longer published in a conventional manner, but directly submitted to databases • Essential tools for biological research, as classical publications used to be !
  58. 58. Problems with Flat files … • Wasted storage space • Wasted processing time • Data control problems • Problems caused by changes to data structures • Access to data difficult • Data out of date • Constraints are system based • Limited querying eg. all single exon GPCRs (<1000 bp)
  59. 59. Relational • The Relational model is not only very mature, but it has developed a strong knowledge on how to make a relational back-end fast and reliable, and how to exploit different technologies such as massive SMP, Optical jukeboxes, clustering and etc. Object databases are nowhere near to this, and I do not expect then to get there in the short or medium term. • Relational Databases have a very well-known and proven underlying mathematical theory, a simple one (the set theory) that makes possible – automatic cost-based query optimization, – schema generation from high-level models and – many other features that are now vital for mission-critical Information Systems development and operations.
  60. 60. • What is a relational database ? – Sets of tables and links (the data) – A language to query the datanase (Structured Query Language) – A program to manage the data (RDBMS)• Flat files are not relational – Data type (attribute) is part of the data – Record order mateters – Multiline records – Massive duplication • Bv Organism: Homo sapeinsm Eukaryota, … – Some records are hierarchical • Xrefs – Records contain multiple “sub-records” – Implecit “Key”
  61. 61. The Benefits of Databases • Redundancy can be reduced • Inconsistency can be avoided • Conflicting requirements can be balanced • Standards can be enforced • Data can be shared • Data independence • Integrity can be maintained • Security restrictions can be applied
  62. 62. Disadvantages • size • complexity • cost • Additional hardware costs • Higher impact of failure • Recovery more difficult
  63. 63. Relational Terminology CUSTOMER Table (Relation) ID NAME PHONE EMP_ID Row (Tuple) 201 Unisports 55-2066101 12 202 Simms Atheletics 81-20101 14 203 Delhi Sports 91-10351 14 204 Womansport 1-206-104-0103 11 Column (Attribute)
  64. 64. Relational Database Terminology• Each row of data in a table is uniquely identified by a primary key (PK)• Information in multiple tables can be logically related by foreign keys (FK) Table Name: CUSTOMER Table Name: EMP ID NAME PHONE EMP_ID ID LAST_NAME FIRST_NAME 201 Unisports 55-2066101 12 10 Havel Marta 202 Simms Atheletics 81-20101 14 11 Magee Colin 203 Delhi Sports 91-10351 14 12 Giljum Henry 204 Womansport 1-206-104-0103 11 14 Nguyen Mai Primary Key Foreign Key Primary Key
  65. 65. • RDBM products – Free • MySQL, very fast, widely usedm easy to jump into but limited non standard SQL • PostrgreSQL – full SQLm limited OO, higher learning curve than MySQL – Commercial • MS Access – Great query builder, GUI interfaces • MS SQL Server – full SQL, NT only • Oracle, everything, including the kitchen sink • IBM DB2, Sybase
  66. 66. A simple datamodel (tables and relations) Prot_id name seq Species_id 1 GTM1_HUMA MGTDHG… 1 N 2 GTM1_RAT MGHJADSW.. 2 3 GTM2_HUMA MVSDBSVD.. 1 N Species_id name Full Lineage 1 human Homo Sapiens … 2 rat Rattus rattus
  67. 67. Relational Database Fundamentals • Basic SQL – SELECT – FROM – WHERE – JOIN – NATURAL, INNER, OUTER • Other SQL functions – COUNT() – MAX(),MIN(),AVE() – DISTINCT – ORDER BY – GROUP BY – LIMIT
  68. 68. BioSQL
  69. 69. • Query: een opdracht om gegevens uit een databaase op te vragen noemt men een query• eg. MyGPCRdb – Bioentry – Taxid (include full lineage) – Linking table (bioentry_tax)
  70. 70. MyGPCR;Geef me allE GPCR die korter zijn dan 1000bpselect * from bioentry;select count(*) from bioentry;select * from bioentry inner join biosequence on bioentry.bioentry_id=biosequence.bioentry_id ;select * from bioentry inner join biosequence on bioentry.bioentry_id=biosequence.bioentry_id where length(biosequence_str)<1000;
  71. 71. Example 3-tier model in biological databaseExample of different interface to the same back-end database (MySQL) http://www.bioinformatics.be
  72. 72. Overview • DataBases – FF • *.txt • Indexed version – Relational (RDBMS) • Access, MySQL, PostGRES, Oracle – OO (OODBMS) • AceDB, ObjectStore – Hierarchical • XML – Frame based systemOverview • Eg. DAML+OIL – Hybrid systems
  73. 73. Object • The Object paradigm is already proven for application design and development, but it may simply not be an adequate paradigm for the data store. • Object Database are modelled by graphs. The graph theory plays a great role on computer science, but is also a great source of unbeatable problems, the NP-complex class: problems for which there are no computationally efficient solution, as theres no way to escape from exponential complexity. This is not a current technological limit. Its a limit inherent to the problem domain. • Hybrid Object-Relational databases will probably be the long term solution for the industry. They put a thin object layer above the relational structure, thus providing a syntax and semantics closer to the object oriented design and programming tools. They simply make it easier to build the data layer classes
  74. 74. Conclusions • A database is a central component of any contemporary information system • The operations on the database and the mainenance of database consistency is handled by a DBMS • There exist stand alone query languages or embedded languages but both deal with definition (DDL) and manipulation (DML) aspects • The structural properties, constraints and operations permitted within a DBMS are defined by a data model - hierarchical, network, relational • Recovery and concurrency control are essential • Linking of heterogebous datasources is central theme in modern bioinformatics
  75. 75. • How do you know which database exists ?• NAR list• Weblinks op Nexus – Searchable – Maintainable
  76. 76. • Tools available in public domain for simultaneous access – entrez – srs• Batch queries for offload in local databases for subsequent analysis (see further)
  77. 77. • What if you want to search the complete human genome (golden path coordinates) instead of separate NCBI entries ?• ENSEMBL
  78. 78. BioMart • Joined project between EBI and CSHL, http://www.biomart.org/ • Aim is to develop a generic, query-oriented data management system capable of integrating distributed data sources • 3 step system: – Start by selecting a dataset to query – Filter this dataset by applying the appropriate filters – Generate the output by selecting the attributes and output format • Available public biomart websites: http://www.biomart.org/biomart/martview
  79. 79. BioMart - Single access point - Generic interface
  80. 80. BioMart - ‘Out of the box’ website
  81. 81. BioMart – 3 step system Dataset Attribute Filter
  82. 82. BioMart - 3 step system Name, chromosomeDataset position, descriptionAttribute for all Ensembl genesFilter located on chromosome 1, expressed in lung, associated with human homologues
  83. 83. BioMart - EnsMart • The first in line was EnsMart, a powerful data mining toolset for retrieving customized data sets from annotated genomes. EnsMart integrates data from Ensembl and various worldwide data sources. • EnsMart provides .... – Gene and protein annotation – Disease information – Cross-species analyses – SNPs affecting proteins – Allele frequency data – Retrieval by external identifiers – Retrieval by Gene Ontology – Customized sequence datasets – Microarray annotation tools
  84. 84. Other BioMart implementations • Other data resources also implemented a BioMart interface: – Wormbase – Gramene – HapMap – DictyBase – euGenes
  85. 85. Single interface
  86. 86. BioBar • A toolbar for browsing biological data and databases http://biobar.mozdev.org/ • The following databases are included http://biobar.mozdev.org/Databases.ht ml • a toolbar for Mozilla-based browsers including Firefox and Netscape 7+
  87. 87. Weblems Weblems Online (example posting) W2.1. Which isolate of Tabac was used in record accession Z71230, and human sample in the genbank entry with accession AJ311677 ? W2.2: Find all structures of GFP in the Protein Data Bank and draw a histogram of their dates of deposition ? W2.3: What is the chromosomal location of the human gene for insulin ? W2.4: How many different human NHR (nuclear hormone receptors) s exist ? How many of these are single exon genes ? Are there any drugs working on this class of receptors ? W2.5: The gene for Berardinelli-Seip syndrome was initially localized between two markers on chromosome band 11q13- D11S4191 and D11S987. a. How many base pairs are there in the interval between these two markers ? b. How many known genes are there ? c. List the gene ontology terms for that region ?

×