Understanding Genome

Understanding Genome
-Biological Database Overview
Part-1

DAY-2, SESSION-1
(25-10-2010)

Rajendra K. Labala
Biomedical Informatics Centre, NICED, ICMR, Kolkata

Major Challenges with Genomes

 Scientific challenge of decoding a genome from its
nucleotides to a set of functional elements
 Development of software which is capable of
storing, manipulating, and evaluating genomes
 Challenge of providing comprehensive and
informative access to a large amount of data in a
user friendly way

The Genome Problem

 The problem with the genome (particularly human)
is that it is “large, complicated, and opaque to
analysis”
 Genome features to identify include:
 Genes: protein coding, RNA, pseudogenes
 Regulatory elements
 SNPs, repeats, etc….

Solutions

 Ensembl
 NCBI
 PATRIC

 You will learn
 Detailed overview
 Sequence related information/data mining!

The Ensembl Project

 Ensembl is a joint project between 3 organizations to
develop a software system which produces and
maintains automatic annotation on selected
eukaryotic genomes
 EMBL- European Molecular Biology Laboratory
 EBI- European Bioinformatics Institute
 WTSI – Wellcome Trust Sanger Institute

What is Ensembl

 Ensembl is one of 3 main systems that are currently
available that annotate and display genomic
information
 Ensembl
 http://www.ensembl.org
 UCSC Genome Browser
 http://genome.ucsc.edu
 NCBI Genome Browser
 http://www.ncbi.nlm.nih.gov
 Public annotation of mammalian and other genomes
 Open source software
 Relational database system

Genomes and Annotation

 Ensembl does not assembly any genome project
directly
 Works in relation with the sequencing centers that
generate the genome assembly

 Ensembl provides high quality annotation for
genomes that do not have existing annotation
 Works in relation with genomes that do have high quality
annotation

Utilizes raw DNA
sequence data from public
sources

Creates a tracking
database (The “Ensembl
database”)

Joins the sequences -
based on a sequence
scaffold or “Golden Path”

Automatically finds
genes and other features
of the sequence

Associates sequence
and features with data
from other sources

Provides a publicly
Ensembl Genome Annotation
accessible web based
interface to the database

Ensembl Software System

 Uses extensively BioPerl (www.bioperl.org)
 The free MySQL database
 Entire Ensembl code base is freely available under
Apache open source license.
 Mainly written in Perl, extensions in C. Some
viewers have been written in Java (e.g. Apollo).
 Software can be accessed by FTP
 Possible to set up a mirror of the entire Ensembl
system.

Ensembl Databases

 4 Main Databases
 Ensembl Core Database
 Ensembl EST Database
 Ensembl Compara Database
 Ensembl Variation Database
 Ensembl uses MySQL to store information in relational
databases
 Ensembl also utilizes APIs (Application Programme
Interfaces)
 Serve as a connection between the databases and specific application
programs
 Ensembl has Perl API and Java API
 Perl API more “complete” than Java API

Ensembl Databases

 Ensembl Core Databases
 Species specific Ensembl core databases that store
genome sequence and annotation information
 Gene, transcript, and protein models that are annotated by the
Ensembl automated genome analysis
 Databases also stores information about cDNA and
protein alignments, as well as external references
 Ex. - NCBI Numbers AB012211

Ensembl Databases

 Ensembl Compara Database
 Is a multi-species database that stores the results of genome wide species
comparisons
 The comparative genomic dataset allows for pairwise whole genome
alignments
 The comparative proteomics dataset allows for orthologue predictions
and protein family clusters
 Ensembl EST
 Species-specific Ensembl EST databases hold an independent EST gene set
provided for all well-characterised species with a suitable amount of
biological evidence. The layout of Ensembl EST Databases is identical to the
Ensembl Core Database schema so that schema descriptions and API access
are equally applicable
 Variation
 The large amount of genetic variation information is organised in a set of
species-specific Ensembl Variation databases.

Data Mining with Ensembl

 BioMart
 Generic data management system built specifically for use in
Ensembl
 Ensembl provide users the ability to conduct fast and powerful
searches
 It simplifies the task of integrating external data sets (provided
by the user) with the Ensembl databases

 Help & Documentation Link
 http://asia.ensembl.org/info/index.html

Data mining through BioMart

 Choose dataset
 Choose data to be retrieved (attributes)
 Narrow your dataset (filters)

BioMart
Dataset
Select your dataset
through the dropdown
list

Filters
Filter your query by the
given options

Attributes
Narrow your search
through these attributes

Try Yourself

 Retrieve all SNPs for „novel‟ human G-protein coupled receptor genes (GPCRs –
IPR000276) on chromosome 2.
 Retrieve the sequences of the exons of the human MEFV gene in FASTA format.
 Retrieve the gene structure (i.e. start and end coordinates of exons) of the mouse
gene ENSMUSG00000042351.
 Retrieve all human disease genes containing transmembrane domains located
between p11.2 and q22.
 The file contains a list of probeset IDs from a microarray experiment using the
Affymetrix array HG-U133 Plus 2.0 (human). Retrieve the 500 bp upstream of the
transcripts matching these probeset IDs.
 Retrieve the sequences 5kb upstream of all human „known‟ genes between D1S2806
and D1S464.
 Retrieve all human SNPs that have an ID from The SNP Consortium (TSC), from
chromosome 6 between 15 Mb and 15.2 Mb, with 200 bases flanking sequence.
 Retrieve the mouse homologues of Homo sapiens genes CASP1, CASP2, CASP3, and
CASP4.

NCBI

 Genome projects
 After DNA sequencing, several contigs were generated and are
submitted to NCBI through WGS Submissions
 Whole Genome Shotgun Sequences
 WGS List
 Download (GenBank format  WGS  FASTA)

NCBI Genome
Project
Go for WGS Sequences

WGS
Home Page of WGS
where you can find the
WGS project lists

GenBank
format file for
the WGS
Click on the link for
detailed view of the
data

WGS project
page
Check out the FASTA
format

NCBI FTP

 For downloading the sequences/genomes in
different required formats.
 FAA (amino acid file in fasta format)
 FNA (nucleic acid file in fasta format)
 FFN (Coding Sequences in fasta format)
 GBK (GenBank format)
 PTT (CDS file in tab delimited format)

Genome files
in different
formats
FAA (amino acid file in
fasta format)

FNA (nucleic acid file in
fasta format)

FFN (Coding Sequences
in fasta format)

GBK (GenBank format)

PTT (CDS file in tab
delimited format)

PATRIC

 WGS annotations download
 For details visit the website and the FAQ page

 http://www.patricbrc.org/portal/portal/patric/Hom
e

PATRIC
home/search
page
http://www.patricbrc.o
rg/portal/portal/patric/
Home

CDS links
Check out the CDS links
for the searched
organism

Downloading
Check out different
downloading options

Exercise

 Check out all the databases thoroughly according to
the given problem mentioned in “part-1.doc” file of
“day-2” folder (in desktop).

Understanding Genome

More Related Content

What's hot

Viewers also liked

Similar to Understanding Genome

Recently uploaded

Understanding Genome