Understanding Genome -Biological Database Overview Part-1 DAY-2, SESSION-1 (25-10-2010) Rajendra K. Labala Biomedical Informatics Centre, NICED, ICMR, Kolkata
Major Challenges with Genomes Scientific challenge of decoding a genome from its nucleotides to a set of functional elements Development of software which is capable of storing, manipulating, and evaluating genomes Challenge of providing comprehensive and informative access to a large amount of data in a user friendly way
The Genome Problem The problem with the genome (particularly human) is that it is “large, complicated, and opaque to analysis” Genome features to identify include: Genes: protein coding, RNA, pseudogenes Regulatory elements SNPs, repeats, etc….
Solutions Ensembl NCBI PATRIC You will learn Detailed overview Sequence related information/data mining!
The Ensembl Project Ensembl is a joint project between 3 organizations to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes EMBL- European Molecular Biology Laboratory EBI- European Bioinformatics Institute WTSI – Wellcome Trust Sanger Institute
What is Ensembl Ensembl is one of 3 main systems that are currently available that annotate and display genomic information Ensembl http://www.ensembl.org UCSC Genome Browser http://genome.ucsc.edu NCBI Genome Browser http://www.ncbi.nlm.nih.gov Public annotation of mammalian and other genomes Open source software Relational database system
Genomes and Annotation Ensembl does not assembly any genome project directly Works in relation with the sequencing centers that generate the genome assembly Ensembl provides high quality annotation for genomes that do not have existing annotation Works in relation with genomes that do have high quality annotation
Utilizes raw DNAsequence data from publicsourcesCreates a trackingdatabase (The “Ensembldatabase”)Joins the sequences -based on a sequencescaffold or “Golden Path”Automatically findsgenes and other featuresof the sequenceAssociates sequenceand features with datafrom other sourcesProvides a publicly Ensembl Genome Annotationaccessible web basedinterface to the database
Ensembl Software System Uses extensively BioPerl (www.bioperl.org) The free MySQL database Entire Ensembl code base is freely available under Apache open source license. Mainly written in Perl, extensions in C. Some viewers have been written in Java (e.g. Apollo). Software can be accessed by FTP Possible to set up a mirror of the entire Ensembl system.
Ensembl Databases 4 Main Databases Ensembl Core Database Ensembl EST Database Ensembl Compara Database Ensembl Variation Database Ensembl uses MySQL to store information in relational databases Ensembl also utilizes APIs (Application Programme Interfaces) Serve as a connection between the databases and specific application programs Ensembl has Perl API and Java API Perl API more “complete” than Java API
Ensembl Databases Ensembl Core Databases Species specific Ensembl core databases that store genome sequence and annotation information Gene, transcript, and protein models that are annotated by the Ensembl automated genome analysis Databases also stores information about cDNA and protein alignments, as well as external references Ex. - NCBI Numbers AB012211
Ensembl Databases Ensembl Compara Database Is a multi-species database that stores the results of genome wide species comparisons The comparative genomic dataset allows for pairwise whole genome alignments The comparative proteomics dataset allows for orthologue predictions and protein family clusters Ensembl EST Species-specific Ensembl EST databases hold an independent EST gene set provided for all well-characterised species with a suitable amount of biological evidence. The layout of Ensembl EST Databases is identical to the Ensembl Core Database schema so that schema descriptions and API access are equally applicable Variation The large amount of genetic variation information is organised in a set of species-specific Ensembl Variation databases.
Data Mining with Ensembl BioMart Generic data management system built specifically for use in Ensembl Ensembl provide users the ability to conduct fast and powerful searches It simplifies the task of integrating external data sets (provided by the user) with the Ensembl databases Help & Documentation Link http://asia.ensembl.org/info/index.html
Data mining through BioMart Choose dataset Choose data to be retrieved (attributes) Narrow your dataset (filters)
BioMartDatasetSelect your datasetthrough the dropdownlist
AttributesNarrow your searchthrough these attributes
Try Yourself Retrieve all SNPs for „novel‟ human G-protein coupled receptor genes (GPCRs – IPR000276) on chromosome 2. Retrieve the sequences of the exons of the human MEFV gene in FASTA format. Retrieve the gene structure (i.e. start and end coordinates of exons) of the mouse gene ENSMUSG00000042351. Retrieve all human disease genes containing transmembrane domains located between p11.2 and q22. The file contains a list of probeset IDs from a microarray experiment using the Affymetrix array HG-U133 Plus 2.0 (human). Retrieve the 500 bp upstream of the transcripts matching these probeset IDs. Retrieve the sequences 5kb upstream of all human „known‟ genes between D1S2806 and D1S464. Retrieve all human SNPs that have an ID from The SNP Consortium (TSC), from chromosome 6 between 15 Mb and 15.2 Mb, with 200 bases flanking sequence. Retrieve the mouse homologues of Homo sapiens genes CASP1, CASP2, CASP3, and CASP4.
NCBI Genome projects After DNA sequencing, several contigs were generated and are submitted to NCBI through WGS Submissions Whole Genome Shotgun Sequences WGS List Download (GenBank format WGS FASTA)
NCBI FTP For downloading the sequences/genomes in different required formats. FAA (amino acid file in fasta format) FNA (nucleic acid file in fasta format) FFN (Coding Sequences in fasta format) GBK (GenBank format) PTT (CDS file in tab delimited format)