Successfully reported this slideshow.

Understanding Genome


Published on

Understanding genome through biological databases and their usages.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Understanding Genome

  1. 1. Understanding Genome -Biological Database Overview Part-1 DAY-2, SESSION-1 (25-10-2010) Rajendra K. Labala Biomedical Informatics Centre, NICED, ICMR, Kolkata
  2. 2. Major Challenges with Genomes Scientific challenge of decoding a genome from its nucleotides to a set of functional elements Development of software which is capable of storing, manipulating, and evaluating genomes Challenge of providing comprehensive and informative access to a large amount of data in a user friendly way
  3. 3. The Genome Problem The problem with the genome (particularly human) is that it is “large, complicated, and opaque to analysis” Genome features to identify include:  Genes: protein coding, RNA, pseudogenes  Regulatory elements  SNPs, repeats, etc….
  4. 4. Solutions Ensembl NCBI PATRIC  You will learn  Detailed overview  Sequence related information/data mining!
  5. 5. The Ensembl Project Ensembl is a joint project between 3 organizations to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes  EMBL- European Molecular Biology Laboratory  EBI- European Bioinformatics Institute  WTSI – Wellcome Trust Sanger Institute
  6. 6. What is Ensembl Ensembl is one of 3 main systems that are currently available that annotate and display genomic information  Ensembl   UCSC Genome Browser   NCBI Genome Browser  Public annotation of mammalian and other genomes Open source software Relational database system
  7. 7. Genomes and Annotation Ensembl does not assembly any genome project directly  Works in relation with the sequencing centers that generate the genome assembly Ensembl provides high quality annotation for genomes that do not have existing annotation  Works in relation with genomes that do have high quality annotation
  8. 8. Utilizes raw DNAsequence data from publicsourcesCreates a trackingdatabase (The “Ensembldatabase”)Joins the sequences -based on a sequencescaffold or “Golden Path”Automatically findsgenes and other featuresof the sequenceAssociates sequenceand features with datafrom other sourcesProvides a publicly Ensembl Genome Annotationaccessible web basedinterface to the database
  9. 9. Ensemblgenomes57
  10. 10. Species tree
  11. 11. Ensembl Software System Uses extensively BioPerl ( The free MySQL database Entire Ensembl code base is freely available under Apache open source license. Mainly written in Perl, extensions in C. Some viewers have been written in Java (e.g. Apollo). Software can be accessed by FTP Possible to set up a mirror of the entire Ensembl system.
  12. 12. Ensembl Databases 4 Main Databases  Ensembl Core Database  Ensembl EST Database  Ensembl Compara Database  Ensembl Variation Database Ensembl uses MySQL to store information in relational databases Ensembl also utilizes APIs (Application Programme Interfaces)  Serve as a connection between the databases and specific application programs  Ensembl has Perl API and Java API  Perl API more “complete” than Java API
  13. 13. Ensembl Databases Ensembl Core Databases  Species specific Ensembl core databases that store genome sequence and annotation information  Gene, transcript, and protein models that are annotated by the Ensembl automated genome analysis  Databases also stores information about cDNA and protein alignments, as well as external references  Ex. - NCBI Numbers AB012211
  14. 14. Ensembl Databases Ensembl Compara Database  Is a multi-species database that stores the results of genome wide species comparisons  The comparative genomic dataset allows for pairwise whole genome alignments  The comparative proteomics dataset allows for orthologue predictions and protein family clusters Ensembl EST  Species-specific Ensembl EST databases hold an independent EST gene set provided for all well-characterised species with a suitable amount of biological evidence. The layout of Ensembl EST Databases is identical to the Ensembl Core Database schema so that schema descriptions and API access are equally applicable Variation  The large amount of genetic variation information is organised in a set of species-specific Ensembl Variation databases.
  15. 15. Data Mining with Ensembl BioMart  Generic data management system built specifically for use in Ensembl  Ensembl provide users the ability to conduct fast and powerful searches  It simplifies the task of integrating external data sets (provided by the user) with the Ensembl databases Help & Documentation Link 
  16. 16. Data mining through BioMart Choose dataset Choose data to be retrieved (attributes) Narrow your dataset (filters)
  17. 17. BioMartDatasetSelect your datasetthrough the dropdownlist
  18. 18. FiltersFilter your query by thegiven options
  19. 19. AttributesNarrow your searchthrough these attributes
  20. 20. Try Yourself Retrieve all SNPs for „novel‟ human G-protein coupled receptor genes (GPCRs – IPR000276) on chromosome 2. Retrieve the sequences of the exons of the human MEFV gene in FASTA format. Retrieve the gene structure (i.e. start and end coordinates of exons) of the mouse gene ENSMUSG00000042351. Retrieve all human disease genes containing transmembrane domains located between p11.2 and q22. The file contains a list of probeset IDs from a microarray experiment using the Affymetrix array HG-U133 Plus 2.0 (human). Retrieve the 500 bp upstream of the transcripts matching these probeset IDs. Retrieve the sequences 5kb upstream of all human „known‟ genes between D1S2806 and D1S464. Retrieve all human SNPs that have an ID from The SNP Consortium (TSC), from chromosome 6 between 15 Mb and 15.2 Mb, with 200 bases flanking sequence. Retrieve the mouse homologues of Homo sapiens genes CASP1, CASP2, CASP3, and CASP4.
  21. 21. NCBI Genome projects  After DNA sequencing, several contigs were generated and are submitted to NCBI through WGS Submissions Whole Genome Shotgun Sequences WGS List Download (GenBank format  WGS  FASTA)
  22. 22. NCBI GenomeProjectGo for WGS Sequences
  23. 23. WGSHome Page of WGSwhere you can find theWGS project lists
  24. 24. GenBankformat file forthe WGSClick on the link fordetailed view of thedata
  25. 25. WGS projectpageCheck out the FASTAformat
  26. 26. NCBI FTP For downloading the sequences/genomes in different required formats.  FAA (amino acid file in fasta format)  FNA (nucleic acid file in fasta format)  FFN (Coding Sequences in fasta format)  GBK (GenBank format)  PTT (CDS file in tab delimited format)
  27. 27. NCBI FTP
  28. 28. Genome filesin differentformatsFAA (amino acid file infasta format)FNA (nucleic acid file infasta format)FFN (Coding Sequencesin fasta format)GBK (GenBank format)PTT (CDS file in tabdelimited format)
  29. 29. PATRIC WGS annotations download For details visit the website and the FAQ page e
  30. 30. PATRIChome/searchpage
  31. 31. CDS linksCheck out the CDS linksfor the searchedorganism
  32. 32. DownloadingCheck out differentdownloading options
  33. 33. Exercise Check out all the databases thoroughly according to the given problem mentioned in “part-1.doc” file of “day-2” folder (in desktop).