• Like
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformatics Course - Session 1.2 - VHIR, Barcelona)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformatics Course - Session 1.2 - VHIR, Barcelona)

  • 114 views
Published

Course: Bioinformatics for Biomedical Research (2014). …

Course: Bioinformatics for Biomedical Research (2014).
Session: 1.2- Storing and Accessing Information. Databases and Queries.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.

Published in Science , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
114
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hospital Universitari Vall d’Hebron Institut de Recerca - VHIR Institut d’Investigació Sanitària de l’Instituto de Salud Carlos III (ISCIII) Bioinformàtica per la Recerca Biomèdica http://ueb.vhir.org/2014BRB Alex Sánchez alex.sanchez@vhir.org 13/05/2014 STORING AND ACCESSING INFORMATION DATABASES AND QUERIES
  • 2. 1. Data banks and databases ● Information in the genomics era ● Distinct DB usages ● To take into account ● Main resources providers 2. Types of databases ● EMBL vs NCBI ● Bibliography DB ● Taxonomy DB ● Nucleotide DB ● Genome DB ● Protein DB ● Microarray DB ● Other DB ● Lists of DB PRESENTATION OUTLINE 213/05/2014 3. Structure and formats of the databases ● Structure of the DB ● Formats of the DB ● Sequence FASTA format ● GenBank entry example ● EMBL entry example 4. Submitting data ● Submitting sequences ● Submitting expression data 5. Tools for DB exploitation ● ENTREZ ● Cross-search tables ● Entrez queries ● Entrez fields ● Help system
  • 3. Data banks and databases 313/05/2014
  • 4. INFORMATION IN THE GENOMICS ERA 4 • Genomics era: huge amount of data • To be able to use this information, it should be properly stored • The access to that info – Must be quick – Has to be done in a flexible way • That is possible thanks to the – Creation of databases – It’s online availability 13/05/2014
  • 5. DISTINCT DB USAGES 5 • Information search – By keyword, accession number, authors… • Homology search – Is there any sequence identical or similar to that mine? • Pattern search – Has my sequence any known pattern? • Predictions – Can I find proteins, with already known function, similar to mine? 13/05/2014
  • 6. Bioinformatics reagent: Databases Organized array of information Place where you put things in, and (if all is well) you should be able to get them out again. Resource for other databases and tools. Simplify the information space by specialization. Bonus: Allows you to make discoveries. Important question to ask: what is the data model?
  • 7. 7 Bioinformatics experiments: BLAST searchSequence Alignment Reagents: •Sequence •Databases Method: •P-P BLASTP •N-P BLASTX •P-N TBLASTN •N-N BLASTN •N (P) – N (P) TBLASTX Interpretation: •Similarity •Hypothesis testing Know your reagents Know your methods Do your controls
  • 8. 8 Nature 409:452 Bioinformatics Citizenship: What it means, and what does it cost?
  • 9. Databases Information system Query system Storage System Data
  • 10. Databases Information system Query system Storage System Data GenBank flat file COSMIC record Interaction Record Title of a book Book
  • 11. Databases Information system Query system Storage System Data Boxes Oracle MySQL PC binary files Unix text files Bookshelves
  • 12. Databases Information system Query system Storage System Data A List you look at A catalogue indexed files SQL grep
  • 13. The library of Congress Google Entrez EnsEMBL UCSC gemome browser Databases Information system Query system Storage System Data
  • 14. TO TAKE INTO ACCOUNT 1413/05/2014 Information organization Resources providers Databases Tools Organizations or centers devoted to the offer and maintain the databases To find/check/export information into/from DB Diverse and very different information
  • 15. MAIN RESOURCES PROVIDERS 1513/05/2014 • The National Center for Biotechnology Information (NCBI) offers data banks, databases and tools at the USA • The European Bioinformatics Institute (EBI) does a similar function in Europe • GenomeNet gathers several databases from Japan
  • 16. Types of databases 1613/05/2014
  • 17. TYPES OF DB 1713/05/2014 • There are hundreds of BD, so it is not feasible to enumerate them (but they have tried here) • We can classify them by multiple criteria • The structural organization of the EMBL and the NCBI resources is radically different
  • 18. EMBL vs NCBI 1813/05/2014 • EMBL – Bibliographic DB – Taxonomic DB – Nucleotide DB – Genomic BD – Protein BD – Microarrays DB … • NCBI – PubMed – Entrez – OMIM – Books – TaxBrowser – Structure …
  • 19. BIBLIOGRAPHY DB 1913/05/2014 • Collection of papers published in scientific journals – Pubmed (NCBI) – Medline (EBI) – Biocatalog: papers organized by concrete molecular biology topics
  • 20. TAXONOMY DB 2013/05/2014 • Information on the classification of living things – basically hierarchical – and based on molecular evidences • To classify any organism from which at least one nucleic acid sequence has been determined • There is indeed some controversy in the scientific community
  • 21. NUCLEOTIDE DB 2113/05/2014 • Sequences from experimental laboratories • Daily updated • Daily exchanging of its contents – Genbank (NCBI) – EMBL (EBI) – KEGG (Genome net)
  • 22. Sequences NOT in NucleotideDB • WGS: whole genome shotgun • TPA: third party annotations • SNPs • SAGE tags (serial analysis of gene expression) • RefSeq (Genomic, mRNA, or protein) • Consensus sequences
  • 23. GENOME DB 2313/05/2014 • Sequences and annotations of whole genomes – Ensembl (EBI) – Genome viewer (NCBI) – Goldenpath (UCSC) • Specialized genomic resources – Transfact – EST – UTRDB – SpliceSitesDB …
  • 24. PROTEIN DB (I) 2413/05/2014 • Aminoacids primary sequences – Without human revision • Trembl (EBI) • NR (NCBI) – With annotation’s curation • Uniprot (EBI) – Proteome DB • Proteome analysis (EBI)
  • 25. PROTEIN DB (II) 2513/05/2014 • Secondary structures or protein domains • They depend on the protein source and the analysis perfomed on them – PROSITE: Regular Expressions over Swiss-Prot – PRINTS: Set of motifs that define a family over Swiss- Prot/TrEMBL – BLOCKS: Aligned motifs from PROSITE/PRINTS – PFAM: Markov Modelos over Swiss-Prot – INTERPRO: Integrates information from several domain- focused data bases.
  • 26. PROTEIN DB (III) 2613/05/2014 • 3D structures with coordinates of each atom – PDB: Reference protein 3D structure (x-ray, NMR) database – CATH: Classification of the PDB in different functional and structural groups – MMDB: subset de PDB maintained by the NCBI – MSD: subset of the PDB maintained by the EBI
  • 27. MICROARRAY DB 2713/05/2014 • Expression arrays results – ArrayExpress – caArray – Gene Expression Omnibus
  • 28. OTHER DB (1) 2813/05/2014 • Biological Annotations – Gene Ontology – KEGG – Gene Cards • Therapeutic targets – Therapeutic targets database – PharmGKB …
  • 29. Historical perspective on the Human Genome Data Human Expressed Seq Tags (mRNA) sequencing Human genome mapping and sequencing Population analysis and polymorphism measurements Genome Wide Association Studies <the Homer paper> The Cancer Genome Atlas pilot The 1000 genome project The Cancer Genome Atlas The International Cancer Genome Consortium
  • 30. • Detailed Phenotype and Outcome data • Region of residence • Risk factors • Examination • Surgery • Drugs • Radiation • Sample • Slide • Specific histological features • Analyte • Aliquot • Donor notes • Gene Expression (probe-level data) • Raw genotype calls • Gene-sample identifier links • Genome sequence files ICGC Controlled Access Datasets • Cancer Pathology Histologic type or subtype Histologic nuclear grade • Patient/Person Gender Age range • Gene Expression (normalized) • DNA methylation • Genotype frequencies • Computed Copy Number and Loss of Heterozygosity • Newly discovered somatic variants ICGC OA Datasets http://goo.gl/w4mrV Main source of Cancer Data: ICGC
  • 31. http://dcc.icgc.org/
  • 32. Module 2a bioinformatics.ca
  • 33. Another source of important Cancer Data: : http://www.sanger.ac.uk/genetics/CGP/cosmic/
  • 34. Module 2a bioinformatics.ca What is Cancer Data? Structured Clinical Data about the patient Structured Clinical Data about the treatment Structured Clinical Data about the tumor Associated with a number of positions (hundreds, if not thousands) of nucleotide coordinate system on one reference genome.
  • 35. ICGC is implementing NCBI’s bioprojects http://www.ncbi.nlm.nih.gov/bioproject
  • 36. LISTS OF BD 3613/05/2014 Nucleic Acids Research Database Listing – Annual Database issue http://www.oxfordjournals.org/nar/database/c/ – Suplement that comes with each year’s January issue – 2009 2013 describes 179 1512 databases, sorted into 14 categories and 41 subcategories. – They ara added to the list of Nucleic Acids Research online Molecular Biology Database Collection – Good starting point for selecting the appropriate DB
  • 37. LISTS OF BD 3713/05/2014
  • 38. Structure and formats of the DB 3813/05/2014
  • 39. STRUCTURE OF THE DB 3913/05/2014 • The way of organizing data in any DB depends mainly in the model or architecture in which it is based on • There are multiple models Relational, Hierarchical, Network-based… but the most usual relational – Several tables, that could have relationships between them – The relationships are done through key fields
  • 40. FORMATS OF THE DB 4013/05/2014 • To work with relational DB implies the use of plane data formats – Text files – Some kind of labels to specify the contents of every line or region of the file • There are multiple formats, so a good program or application should be able to recognize (and even interchange) them.
  • 41. SEQUENCE FASTA FORMAT 4113/05/2014 Identifier Additional info sequence 1stline >gi|15341523|gb|AF405321.1| Human echovirus 29 strain JV-10 5' UTR, partial sequence CAAGCACTTCTGTTTCCCCGGACTGAGTATCAATAGACTGCTCACGCGGTTGAAGGAGAAAACGTTCGTT ATCCGGCCAACTACTTCGAGAAACCTAGTAACGCCATGGAAGTTGTGGAGTGTTTCGCTCAGCACTACCC CAGTGTAGATCAGGTTGATGAGTCACCGCATTCCCCACGGGTGACCGTGGCGGTGGCTGCGTTGGCGGCC TGCCCATGGGGAAACCCATGGGACGCTCTTATACAGACATGGTGCGAAGAGTCTATTGAGCTAGTTGGTA GTCCTCCGGCCCCTGAATGCGGCTAATCCCAACTGCGGAGCATACACTCTCAAGCCAGAGGGTAGTGTGT CGTAATGGGCAACTCTGCAGCGGAACCGACTACTTTGGGT >gi|15341527|gb|AF405325.1| Human echovirus 6 strain D' Amori 5' UTR, partial sequence CAAGCACTTCTGTTTCCCCGGACCGAGTATCAATAAGCTGCTCACGCGGCTGAAGGAGAAAGTGTTCGTT ACCCGGCTAGTTACTTCGAGAAACCTAGTACCACCATGAAGGTTGCGCAGCGTTTCGCTCCGCACAACCC CAGTGTAGATCAGGTCGATGAGTCACCGCGTTCCCCACGGGCGACCGTGGCGGTGGCTGCGTTGGCGGCC TGCCCATGGGGCAACCCATGGGACGCTTCAATACTGACATGGTGCGAAGAGTCTATTGAGCTAACTAGTA GTCCTCCGGCCCCTGAATGCGGATAATCTTAACTGCGGAGCAGGTGCTCACAATCCAGTGGGTGGCCTGT CGTAACGGGCAACTCTGCAGCGGAACCGACTACTTTGGGT
  • 42. GENBANK ENTRY EXAMPLE 4213/05/2014
  • 43. EMBL ENTRY EXAMPLE 4313/05/2014
  • 44. Submitting data 4413/05/2014
  • 45. SUBMITTING DATA 4513/05/2014 • Several biological databases are public, so any (properly identified) user can contribute uploading new data • There are multiple types of data to upload, but the most usual are – Sequencies – Expression data (from microarrays)
  • 46. SUBMITTING SEQUENCES 4613/05/2014 How to submit your sequences to… • EMBL – http://www.ebi.ac.uk/embl/Submission/ • GeneBank – http://www.nlm.nih.gov/pubs/factsheets/sdgenbk.html
  • 47. SUBMITTING EXPRESSION DATA 4713/05/2014 And your expression data to… • ArrayExpress (EBI) – http://www.ebi.ac.uk/microarray/submissions.html • Gene Expression Omnibus (NCBI) – https://www.ncbi.nlm.nih.gov/geo/info/faq.html
  • 48. Tools for DB exploitation 4813/05/2014
  • 49. ENTREZ 4913/05/2014 • It is the NCBI’s searching system • Great power and versatility, but less intuitive than SRS • It doesn’t provide forms for each field • Usually used in a “Top Bottom” manner – Perform a first query – Refine the results until reaching what you are looking for.
  • 50. CROSS-SEARCH TABLES 5013/05/2014
  • 51. ENTREZ QUERIES 5113/05/2014 • Boolean operators: AND, OR, NOT, “”, * • AND applied by default • Query by Accession Numbers (AC) in – Genbank / EMBL / DDBJ: • 1 char. + 5 nums. (U12345) • 2 char. + 6 nums. (AF123456) – SwissProt / PIR: • 1 char. + 5 nums. (P12345) • Refine queries with the reserved word LIMITS • Combine queries with HISTORY
  • 52. ENTREZ AVAILABLE FIELDS 5213/05/2014
  • 53. HELP AND INFORMATION SYSTEM 5313/05/2014
  • 54. Estamos interesados en el gen MLH1 humano, implicado en el cáncer de colon – Separar el grano de la paja: identificar una secuencia de mRNA representativa y bien anotada del gen MLH1. – Obtener literatura asociada y su secuencia protéica. – Identificar proteínas similares. – Identificar dominios conservados dentro de la proteína. – Identificar mutaciones conocidas en el gen o la proteína. – Encontrar la estructura tridimensional de la proteína, si esta es conocida, o si no es así, identificar estructuras de secuencia homóloga. – Ver el contexto genómico del gen y descargar la región que lo contiene. Vall d'Hebron Institut de Recerca 21/06/2011 Ejemplos de búsqueda con Entrez
  • 55. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta directa (1.1)
  • 56. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta directa (1.2) Límites
  • 57. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta directa (1.3) Filtros
  • 58. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta directa (1.4) Registro
  • 59. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta (2) Enlaces a otras BD
  • 60. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta (3) Secuencias
  • 61. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta (4) Proteína
  • 62. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta (5.1) Mutaciones
  • 63. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta (5.2) SNPs
  • 64. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta (5.3) OMIM
  • 65. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta (6.1) Estructuras
  • 66. Mouse over the residues of NP_000240 until the grey footer bar shows ‘gi 4557757, loc 67’ (Glycine). Click on the corresponding Glycine residue in 1H7U_A (loc 74) to highlight it. In the structure window use the left mouse button to spin the 3D structure until you can clearly see and identify the highlighted residue. Is it possibly in the active site? For example, is it within 5 Ä of the ATPS molecule? Double click on the Mg-complexed ATPS to highlight it. Then use the menu bar option called ‘Show/Hide|Select By Distance|Residues Only’ to highlight all residues within 5 Ä of the ATPS. Indeed, the Glycine at position #74 is within 5 Ä and is likely part of the active site for this energy-producing domain. This hints at the possible problems a Gly  Trp mutation might cause at that position. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta (6.2) Alineamiento de secuencia y estructura
  • 67. Vall d'Hebron Institut de Recerca 21/06/2011 Consulta (7) Visualización en contexto genómico