Browsing Genes, Variation and Regulation data with Ensembl
1. Training materials
Ensembl materials are protected by a CC BY license
http://creativecommons.org/licenses/by/4.0/
If you wish to re-use these, please credit Ensembl for
their creation
If you use Ensembl for your work, please cite our papers
http://www.ensembl.org/info/about/publications.html
2. Denise Carvalho-Silva
European Molecular Biology Laboratory
European Bioinformatics Institute
Browsing Genes, Variation and
Regulation data with Ensembl
CRUK - Cambridge Institute
5. Course Objectives
What is Ensembl? What type of data can
you get in Ensembl?
How to navigate the
Ensembl browser website?
How to connect
with Ensembl
6. bli blo
bla blu
bla bla
bli blo
bla blu
bla bla
bli blo
bla blu
bla bla
bla blu
bla bla
bli blo
11. Annotation of vertebrate genomeswww.ensembl.org
pre.ensembl.org
>80 genomes*
D. melanogaster
C. elegans
S. cerevisae
*Release 84
March 2016
12. 1 human genome à 3 assemblies
www.ensembl.org grch37.ensembl.org
e54.ensembl.org
13. EBI is an Outstation of the European Molecular Biology Laboratory.
Comparative GenomicsGene models
RegulationVariation
Custom data displayProgrammatic access
Toolkit
Ensembl Features
14. EBI is an Outstation of the European Molecular Biology Laboratory.
Comparative GenomicsGene models
RegulationVariation
Custom data displayProgrammatic access
Toolkit
Ensembl Features
16. Gene models in Ensembl
Goal: Generate set of well-supported genes
Automatic Manual
17. • many species
• genome-wide at once
• ~ 4 months
• fewer species
• gene by gene
• many years
Automatic and coding (20_)
Manual and coding (00_)
Automatic + Manual
(“gold”)
Manual and non-coding (00_)
Automatic annotation* Manual annotation*
* based on experimental, biological evidence (INSDC, UniProtKB…)
20. Which transcript to use?
http://www.ensembl.org/Help/Glossary?id=493
http://www.ensembl.org/Help/Glossary?id=492
APPRIS
TSLs
21. CCDS project
• annotate a consensus coding DNA sequence set
• EBI, WTSI, UCSC and NCBI
•
Genome Res. 19:1316-23 (2009)
http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi
CCDS transcript
22. Disclaimer: which transcript to use
No single method will tell us which transcript to use
Decision on a case by case basis
• All transcripts OR one/two well supported ones?
List of transcripts: we offer choices based on
• CCDS (Ensembl, HAVANA, NCBI, UCSC)
• Golden transcripts (identical Ensembl and HAVANA)
• Cross reference entries (e.g. UniProtKB, RefSeq)
• APPRIS
• TSLs
23. Annotation based on RNASeq
http://www.ensembl.org/info/genome/genebuild/
rnaseq_annotation.html
25. Ensembl stable identifiers
• ENSG########### Ensembl Gene ID
• ENST########### Ensembl Transcript ID
• ENSP########### Ensembl Peptide ID
• ENSE########### Ensembl Exon ID
• For non-human species a suffix is added:
ENSMUSG MUS (Mus musculus) for mouse
28. The ESPN gene products are active in the
inner ear, where it appears to play an
essential role in normal hearing and balance.
Let’s explore ESPN
Before we start: background
29. A) What is the location and strand of the
human ESPN gene?
B) How can I view protein alignments and
variants mapped to this location?
C) Can I move data tracks up and down,
share and delete tracks?
Human ESPN: location
30. A) How can I find the genomic sequence of
this gene? What is the ID of its first exon?
B) Can I display the genomic coordinates and
variants on this sequence?
C) Can I find information on the expression
of this gene in different tissues?
Human ESPN: gene
31. A) How many exons does the longest ESPN
transcript have? Are there any completely
untranslated exons?
B) Can I find its cDNA sequence?
C) What are the UniProt and RefSeq entries
cross referenced to this transcript?
Human ESPN: transcript
33. EBI
is
an
Outsta,on
of
the
European
Molecular
Biology
Laboratory.
BioMart
34. Outline
• Definitions
• The principle: 4 steps
• Tutorial: simple query in human
• Find Ensembl BioMart and BioMart elsewhere
• Sophisticated platforms: mart services, APIs, etc…
• Exercises
35. What is BioMart?
• Free service for easy retrieval of Ensembl data
• Data export tool with little/no programming required
• Complex queries with a few mouse clicks
• Output formats (.xls, .csv, fasta, tsv, html)
36. The four-step principle
DATA FILTERS ATTRIBUTES RESULTS
IDs
Regions
Domains
Expression
Tables
Fasta
Dataset
Database
Homologs
Sequences
Features
Structures
43. For the IL23R, PTPN22, CUL2, C1orf106 and
IL18RAP genes, use BioMart to retrieve a
table (.xls) containing:
• Associated gene name, ENSG and ENST IDs
• Chromosome name, gene start and end
• GO term name and Interpro description
Tutorial: BioMart
44. The four-step principle
DATA FILTERS ATTRIBUTES RESULTS
Gene
Gene name,
ENSG/ENST ID,
Chr start end,
GO term name,
Interpro
description
.xls table
Human
IL23R,
CUL2,
PTPN22,
C1orf106
IL18RAP
49. More sophisticated platforms
• BioMart queries: MartService
www.biomart.org/martservice.html
• APIs: PERL, Java, Web Services
• Third party softwares
galaxyproject.orgbioconductor.orgtaverna.org.uk
53. EBI
is
an
Outsta,on
of
the
European
Molecular
Biology
Laboratory.
Genetic
Variation
54. Outline
• Classes of variation, species and sources
• Browsing variation data: some entry points
Location tab
Gene tab
Variation tab
• Phenotypic data and population genetics
• How to annotate your own variants
• Exercises
55. 1) Large scale: structural (> 50 base pairs)
Genetic variation
duplication deletion inversion translocation loss
2) Short scale: SNPs (or SNVs), indels
G A C TG A C TA T C G GG GTT TCC CA A A
G A A TG A C TT T C G G- G-T TCC -A A A
56. Species with variation data
Understand the types of genetic variation data and
how to view them in the context of our genomes
57. Sources of variation data
• Import alleles and frequencies
• Annotate variants
http://www.ensembl.org/info/docs/variation/sources_documentation.html
65. Coffee intake is a worldwide phenomenon
with Finland at the top, and UK in the 44th
place. Is caffeine consumption in our genes?
A) What are the chromosome locations of variants associated
with this phenotype?
B) Which variant has got the most significant association?
C) What is the ancestral allele of this variant? Is it conserved
in eutherian mammals?
D) What is the most frequent allele in GBR?
E) Can you download this variant and 200 nt upstream and
downstream flanking sequence in RTF (Rich Text Format)?
Live demo
66. You can annotate your
SNPs and SVs too!
• Variant Effect Predictor
• Different input formats
• SIFT/PolyPhen for missense variants
PMID: 20562413
Perl
script
Web
interface
REST
API
XML
67. CODING
Synonymous
INTRONIC5’ UTR
ATG AAAAAAA
Regulatory
Splice sites
CODING
Missense
3’ UTR5’ Upstream 3’ downstream
Mapping variants on transcripts
Identify transcripts that overlap variants and
predict the consequence of these on Ensembl
(or RefSeq) transcripts using
68. Consequence terms for variants
http://www.ensembl.org/info/genome/variation/predicted_data.html#consequence_type_table
* defined by the Sequence Ontology (SO) project (http://www.sequenceontology.org/)
76. Navigate results
(one row per variant/
transcript overlap)
Show/hide columns in
results table
more columns:
scroll right
• Download results
• Send results to BioMart
Create and edit filters
ensembl.org/info/docs/tools/vep/online/results.html#table
results table
77. Filters consist of three components
Field
• e.g. Consequence, biotype
Operator
• e.g. is, matches (partial string matches)
Value
• the value to compare against
• some fields have autocomplete values
Multiple filters allowed with logical relationship (AND, OR)
Active filters can be edited too!
ensembl.org/info/docs/tools/vep/online/results.html#filter
Filtering results
79. Things to bear in mind
1) No distinction between polymorphisms and mutations.
Exception HGMD and COSMIC: all mutations;
2) C/T à first allele is the one in the reference genome,
not necessarily the major or the ancestral;
3) Ensembl reports all alleles on the forward strand
(different from dbSNP).
83. Ensembl Genetic Variation
Exercises
pages 52-56
Answers
www.ebi.ac.uk/~denise/workshops/2016/
cruk/answers
Feel free to explore your favourite variant/phenotype too!
84. EBI
is
an
Outsta,on
of
the
European
Molecular
Biology
Laboratory.
Gene
Regulation
85. Outline
• Definition and models
• Epigenetics and Epigenomics
• Ensembl Regulation: goal, data sources, species
• Viewing / accessing regulation data in Ensembl
• Track hubs: ENCODE, Blueprint
• Exercises
86. Regulation of gene expression
• Change in the production of mRNA/proteins ( or )
• From the transcription to post-translational levels
• Models of regulation of gene transcription
• Basic
• Expanded
• Complete ??
91. Epigenetics/Epigenomics
Epigenetics*
The study of inherited changes in
phenotype without changes in
genotype
Epigenomics
Epigenetics on a genome-wide scale
http://integratedhealthcare.eu/
*One of the routes to regulate gene transcription
93. ChIP-sequencing
crosslink and shear
TF1 TF2
TF3
TF1
TF3TF2
Antibodies and IP
unlink, purify and DNA sequencing
Y YY
TF1
TF3
TF2
ACGTC CGCTT GAACA
map back to the genome
DNA and proteins
94. Ensembl Regulation
Goal: Annotate the genome with features that may play a
role in the transcriptional regulation of genes
Multiple data sources: collection and summary
http://www.ensembl.org/info/docs/funcgen/regulation_sources.html
http://www.ensembl.org/Homo_sapiens/Experiment/
95. Data source: ENCODE
“Encyclopedia of DNA Elements”
Trying to assign function to many regions as possible
Transcription and regulatory information
4,626 datasets, 2,498 cell types à functional elements
PMID: 22955616, PMID: 17571346
http://www.nature.com/encode/#/threads
96. Data source: Roadmap
NIH consortium: public resource of normal epigenomes
DNA methylation, histone marks, open chromatin, small RNA
http://www.roadmapepigenomics.org/data
http://www.roadmapepigenomics.org/publications
97. • EU consortium: generate 100 reference epigenomes
• Blood cells: healthy individuals and malignant leukaemic
counterparts
• 1046 experiments {ChIP, RNA, Bisulfite, DNase}-Seq
• 425 cell types and seven cell lines
• http://www.blueprint-epigenome.eu/
Data source: Blueprint
98. Dataset for the Ensembl build
raw data à Ensembl Regulation pipeline à Ensembl annotation
102. Regulatory features: view
Configure this page à Regulation à Regulatory features
For single and individual cell lines, e.g. GM12878, HUVEC
103. ChiP-Seq signal for TF
signal
Regulatory Features: motifs
Ensembl regulatory feature
Position Weight Matrix for TF
(JASPAR database)
104. Viewing the raw NGS data
DNaseI and TFBS
Histone marks
and polymerases
Configure this page à Regulation à Open chromatin &…
Configure this page à Regulation à Histones &…
105. How to choose raw data: matrix
Supporting evidence:
1) Open chromatin & TFBS
2) Histones & polymerases
http://tinyurl.com/matrix-ensembl
106. CTCF
enriched
Predicted
Weak
Enhancer/Cis-‐reg
element
Predicted
Transcribed
Region
Predicted
Enhancer
Predicted
Promoter
Flank
Predicted
Repressed/Low
AcAvity
Predicted
Promoter
with
TSS
Segmentation data in Ensemblcategoriesof
combinedsegments
Configure this page à Regulation à Regulatory features
107. Experimental confirmation
• CTCF: good recall, reproducible across
multiple cell lines, tight boundaries.
• TSS:
• 88.9% of FANTOM 5 strict TSSs were covered.
• Enhancers:
• 92.4% of 882 VISTA enhancers were detected.
• 80.3% of 40279 robust FANTOM 5 enhancers were found.
108. Methylation data in Ensembl
CpG DNA methylation (RRBS, WGBS, MeDIP)
ENCODE and PMID: 18577705
Configure this page à Regulation à DNA Methylation
109. The STRADA controls tumor suppressor activities of LKB1
(https://www.wikigenes.org/)
A. What are the Ensembl regulatory features annotated in this
gene?
B. Are there any features in the 5’ region of STRADA?
C. Do the regulatory features for K562, CD8+ cells (ENCODE)
and erythroblast (Blueprint) differ at this region?
D. What is the methylation pattern (WGBS) at the 5’ end of
this gene?
E. What is the stable IDs of the most 5’ regulatory feature?
Tutorial
111. Things to bear in mind
1) The annotation of regulatory elements in Ensembl
highlight where the biochemical data (ChIP-seq, etc)
maps to on the human (mouse) genomes;
2) Features can be nearby genes but might not affect
their transcription/expression;
3) Disclaimer: Ensembl can not tell you how your
favourite gene is regulated.
112. In addition to the big names
CpG islands, TSS, miRNA target predictions (TarBase)
Configure this page à Regulation à Other regulatory regions
Configure this page à Sequence and assembly à Simple features
113. Subset of cell types*
http://www.ensembl.org/info/genome/funcgen/regulatory_segmentation.html
Minimum requirement for cell selection:
CTCF binding, DNaseI, H3K4me3, H3K27me3 and H3K6me3
• Display all TFBS and histone marks for them
• Data processing to predict activity e.g. promoter
* ENCODE
128. Training materials
Ensembl materials are protected by a CC BY license
http://creativecommons.org/licenses/by/4.0/
If you wish to re-use these, please credit Ensembl for
their creation
If you use Ensembl for your work, please cite our papers
http://www.ensembl.org/info/about/publications.html