Understanding Genome
  -Biological Database Overview
               Part-1


            DAY-2, SESSION-1
                (25-10-2010)




                  Rajendra K. Labala
 Biomedical Informatics Centre, NICED, ICMR, Kolkata
Major Challenges with Genomes

 Scientific challenge of decoding a genome from its
  nucleotides to a set of functional elements
 Development of software which is capable of
  storing, manipulating, and evaluating genomes
 Challenge of providing comprehensive and
  informative access to a large amount of data in a
  user friendly way
The Genome Problem

 The problem with the genome (particularly human)
  is that it is “large, complicated, and opaque to
  analysis”
 Genome features to identify include:
    Genes: protein coding, RNA, pseudogenes
    Regulatory elements
    SNPs, repeats, etc….
Solutions

 Ensembl
 NCBI
 PATRIC




    You will learn
      Detailed overview
      Sequence related information/data mining!
The Ensembl Project

 Ensembl is a joint project between 3 organizations to
 develop a software system which produces and
 maintains automatic annotation on selected
 eukaryotic genomes
    EMBL- European Molecular Biology Laboratory
    EBI- European Bioinformatics Institute
    WTSI – Wellcome Trust Sanger Institute
What is Ensembl

 Ensembl is one of 3 main systems that are currently
 available that annotate and display genomic
 information
    Ensembl
        http://www.ensembl.org
    UCSC Genome Browser
        http://genome.ucsc.edu
    NCBI Genome Browser
        http://www.ncbi.nlm.nih.gov
 Public annotation of mammalian and other genomes
 Open source software
 Relational database system
Genomes and Annotation

 Ensembl does not assembly any genome project
 directly
    Works in relation with the sequencing centers that
     generate the genome assembly


 Ensembl provides high quality annotation for
 genomes that do not have existing annotation
    Works in relation with genomes that do have high quality
     annotation
Utilizes raw DNA
sequence data from public
sources

Creates a tracking
database (The “Ensembl
database”)

Joins the sequences -
based on a sequence
scaffold or “Golden Path”

Automatically finds
genes and other features
of the sequence

Associates sequence
and features with data
from other sources

Provides a publicly
                            Ensembl Genome Annotation
accessible web based
interface to the database
Ensembl
genomes
57
Species tree
Ensembl Software System

 Uses extensively BioPerl (www.bioperl.org)
 The free MySQL database
 Entire Ensembl code base is freely available under
  Apache open source license.
 Mainly written in Perl, extensions in C. Some
  viewers have been written in Java (e.g. Apollo).
 Software can be accessed by FTP
 Possible to set up a mirror of the entire Ensembl
  system.
Ensembl Databases

 4 Main Databases
    Ensembl Core Database
    Ensembl EST Database
    Ensembl Compara Database
    Ensembl Variation Database
 Ensembl uses MySQL to store information in relational
  databases
 Ensembl also utilizes APIs (Application Programme
  Interfaces)
    Serve as a connection between the databases and specific application
     programs
    Ensembl has Perl API and Java API
        Perl API more “complete” than Java API
Ensembl Databases

 Ensembl Core Databases
   Species specific Ensembl core databases that store
    genome sequence and annotation information
         Gene, transcript, and protein models that are annotated by the
          Ensembl automated genome analysis
     Databases also stores information about cDNA and
      protein alignments, as well as external references
         Ex. - NCBI Numbers AB012211
Ensembl Databases

 Ensembl Compara Database
   Is a multi-species database that stores the results of genome wide species
    comparisons
   The comparative genomic dataset allows for pairwise whole genome
    alignments
   The comparative proteomics dataset allows for orthologue predictions
    and protein family clusters
 Ensembl EST
   Species-specific Ensembl EST databases hold an independent EST gene set
    provided for all well-characterised species with a suitable amount of
    biological evidence. The layout of Ensembl EST Databases is identical to the
    Ensembl Core Database schema so that schema descriptions and API access
    are equally applicable
 Variation
   The large amount of genetic variation information is organised in a set of
    species-specific Ensembl Variation databases.
Data Mining with Ensembl

 BioMart
   Generic data management system built specifically for use in
    Ensembl
   Ensembl provide users the ability to conduct fast and powerful
    searches
   It simplifies the task of integrating external data sets (provided
    by the user) with the Ensembl databases


 Help & Documentation Link
   http://asia.ensembl.org/info/index.html
Data mining through BioMart

 Choose dataset
 Choose data to be retrieved (attributes)
 Narrow your dataset (filters)
BioMart
Dataset
Select your dataset
through the dropdown
list
Filters
Filter your query by the
given options
Attributes
Narrow your search
through these attributes
Try Yourself

 Retrieve all SNPs for „novel‟ human G-protein coupled receptor genes (GPCRs –
    IPR000276) on chromosome 2.
   Retrieve the sequences of the exons of the human MEFV gene in FASTA format.
   Retrieve the gene structure (i.e. start and end coordinates of exons) of the mouse
    gene ENSMUSG00000042351.
   Retrieve all human disease genes containing transmembrane domains located
    between p11.2 and q22.
   The file contains a list of probeset IDs from a microarray experiment using the
    Affymetrix array HG-U133 Plus 2.0 (human). Retrieve the 500 bp upstream of the
    transcripts matching these probeset IDs.
   Retrieve the sequences 5kb upstream of all human „known‟ genes between D1S2806
    and D1S464.
   Retrieve all human SNPs that have an ID from The SNP Consortium (TSC), from
    chromosome 6 between 15 Mb and 15.2 Mb, with 200 bases flanking sequence.
   Retrieve the mouse homologues of Homo sapiens genes CASP1, CASP2, CASP3, and
    CASP4.
NCBI

 Genome projects
   After DNA sequencing, several contigs were generated and are
    submitted to NCBI through WGS Submissions
 Whole Genome Shotgun Sequences
 WGS List
 Download (GenBank format  WGS  FASTA)
NCBI Genome
Project
Go for WGS Sequences
WGS
Home Page of WGS
where you can find the
WGS project lists
GenBank
format file for
the WGS
Click on the link for
detailed view of the
data
WGS project
page
Check out the FASTA
format
NCBI FTP

 For downloading the sequences/genomes in
 different required formats.
    FAA (amino acid file in fasta format)
    FNA (nucleic acid file in fasta format)
    FFN (Coding Sequences in fasta format)
    GBK (GenBank format)
    PTT (CDS file in tab delimited format)
NCBI FTP
Genome files
in different
formats
FAA (amino acid file in
fasta format)

FNA (nucleic acid file in
fasta format)

FFN (Coding Sequences
in fasta format)

GBK (GenBank format)

PTT (CDS file in tab
delimited format)
PATRIC

 WGS annotations download
 For details visit the website and the FAQ page


 http://www.patricbrc.org/portal/portal/patric/Hom
 e
PATRIC
home/search
page
http://www.patricbrc.o
rg/portal/portal/patric/
Home
CDS links
Check out the CDS links
for the searched
organism
Downloading
Check out different
downloading options
Exercise

 Check out all the databases thoroughly according to
 the given problem mentioned in “part-1.doc” file of
 “day-2” folder (in desktop).

Understanding Genome

  • 1.
    Understanding Genome -Biological Database Overview Part-1 DAY-2, SESSION-1 (25-10-2010) Rajendra K. Labala Biomedical Informatics Centre, NICED, ICMR, Kolkata
  • 2.
    Major Challenges withGenomes  Scientific challenge of decoding a genome from its nucleotides to a set of functional elements  Development of software which is capable of storing, manipulating, and evaluating genomes  Challenge of providing comprehensive and informative access to a large amount of data in a user friendly way
  • 3.
    The Genome Problem The problem with the genome (particularly human) is that it is “large, complicated, and opaque to analysis”  Genome features to identify include:  Genes: protein coding, RNA, pseudogenes  Regulatory elements  SNPs, repeats, etc….
  • 4.
    Solutions  Ensembl  NCBI PATRIC  You will learn  Detailed overview  Sequence related information/data mining!
  • 5.
    The Ensembl Project Ensembl is a joint project between 3 organizations to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes  EMBL- European Molecular Biology Laboratory  EBI- European Bioinformatics Institute  WTSI – Wellcome Trust Sanger Institute
  • 6.
    What is Ensembl Ensembl is one of 3 main systems that are currently available that annotate and display genomic information  Ensembl  http://www.ensembl.org  UCSC Genome Browser  http://genome.ucsc.edu  NCBI Genome Browser  http://www.ncbi.nlm.nih.gov  Public annotation of mammalian and other genomes  Open source software  Relational database system
  • 7.
    Genomes and Annotation Ensembl does not assembly any genome project directly  Works in relation with the sequencing centers that generate the genome assembly  Ensembl provides high quality annotation for genomes that do not have existing annotation  Works in relation with genomes that do have high quality annotation
  • 8.
    Utilizes raw DNA sequencedata from public sources Creates a tracking database (The “Ensembl database”) Joins the sequences - based on a sequence scaffold or “Golden Path” Automatically finds genes and other features of the sequence Associates sequence and features with data from other sources Provides a publicly Ensembl Genome Annotation accessible web based interface to the database
  • 9.
  • 10.
  • 11.
    Ensembl Software System Uses extensively BioPerl (www.bioperl.org)  The free MySQL database  Entire Ensembl code base is freely available under Apache open source license.  Mainly written in Perl, extensions in C. Some viewers have been written in Java (e.g. Apollo).  Software can be accessed by FTP  Possible to set up a mirror of the entire Ensembl system.
  • 12.
    Ensembl Databases  4Main Databases  Ensembl Core Database  Ensembl EST Database  Ensembl Compara Database  Ensembl Variation Database  Ensembl uses MySQL to store information in relational databases  Ensembl also utilizes APIs (Application Programme Interfaces)  Serve as a connection between the databases and specific application programs  Ensembl has Perl API and Java API  Perl API more “complete” than Java API
  • 13.
    Ensembl Databases  EnsemblCore Databases  Species specific Ensembl core databases that store genome sequence and annotation information  Gene, transcript, and protein models that are annotated by the Ensembl automated genome analysis  Databases also stores information about cDNA and protein alignments, as well as external references  Ex. - NCBI Numbers AB012211
  • 14.
    Ensembl Databases  EnsemblCompara Database  Is a multi-species database that stores the results of genome wide species comparisons  The comparative genomic dataset allows for pairwise whole genome alignments  The comparative proteomics dataset allows for orthologue predictions and protein family clusters  Ensembl EST  Species-specific Ensembl EST databases hold an independent EST gene set provided for all well-characterised species with a suitable amount of biological evidence. The layout of Ensembl EST Databases is identical to the Ensembl Core Database schema so that schema descriptions and API access are equally applicable  Variation  The large amount of genetic variation information is organised in a set of species-specific Ensembl Variation databases.
  • 15.
    Data Mining withEnsembl  BioMart  Generic data management system built specifically for use in Ensembl  Ensembl provide users the ability to conduct fast and powerful searches  It simplifies the task of integrating external data sets (provided by the user) with the Ensembl databases  Help & Documentation Link  http://asia.ensembl.org/info/index.html
  • 16.
    Data mining throughBioMart  Choose dataset  Choose data to be retrieved (attributes)  Narrow your dataset (filters)
  • 17.
  • 18.
    Filters Filter your queryby the given options
  • 19.
  • 20.
    Try Yourself  Retrieveall SNPs for „novel‟ human G-protein coupled receptor genes (GPCRs – IPR000276) on chromosome 2.  Retrieve the sequences of the exons of the human MEFV gene in FASTA format.  Retrieve the gene structure (i.e. start and end coordinates of exons) of the mouse gene ENSMUSG00000042351.  Retrieve all human disease genes containing transmembrane domains located between p11.2 and q22.  The file contains a list of probeset IDs from a microarray experiment using the Affymetrix array HG-U133 Plus 2.0 (human). Retrieve the 500 bp upstream of the transcripts matching these probeset IDs.  Retrieve the sequences 5kb upstream of all human „known‟ genes between D1S2806 and D1S464.  Retrieve all human SNPs that have an ID from The SNP Consortium (TSC), from chromosome 6 between 15 Mb and 15.2 Mb, with 200 bases flanking sequence.  Retrieve the mouse homologues of Homo sapiens genes CASP1, CASP2, CASP3, and CASP4.
  • 21.
    NCBI  Genome projects  After DNA sequencing, several contigs were generated and are submitted to NCBI through WGS Submissions  Whole Genome Shotgun Sequences  WGS List  Download (GenBank format  WGS  FASTA)
  • 22.
  • 23.
    WGS Home Page ofWGS where you can find the WGS project lists
  • 24.
    GenBank format file for theWGS Click on the link for detailed view of the data
  • 25.
  • 26.
    NCBI FTP  Fordownloading the sequences/genomes in different required formats.  FAA (amino acid file in fasta format)  FNA (nucleic acid file in fasta format)  FFN (Coding Sequences in fasta format)  GBK (GenBank format)  PTT (CDS file in tab delimited format)
  • 27.
  • 28.
    Genome files in different formats FAA(amino acid file in fasta format) FNA (nucleic acid file in fasta format) FFN (Coding Sequences in fasta format) GBK (GenBank format) PTT (CDS file in tab delimited format)
  • 29.
    PATRIC  WGS annotationsdownload  For details visit the website and the FAQ page  http://www.patricbrc.org/portal/portal/patric/Hom e
  • 30.
  • 31.
    CDS links Check outthe CDS links for the searched organism
  • 32.
  • 33.
    Exercise  Check outall the databases thoroughly according to the given problem mentioned in “part-1.doc” file of “day-2” folder (in desktop).