New insights into the human genome by encode 14.12.12

New insights into the human
genome by ENCODE

What is a gene???
ENCODE
• Union of genomic sequences encoding a
coherent set of potentially overlapping
functional products.
(Gerstein et al., 2007)

Its been ten years since scientists sequenced the human
genome
But What do all these letters????????

ENCODE- the Encyclopedia of
DNA Elements has ANSWERS
Aiming to
delineate all of
the functional
elements
encoded in the
human genome
sequence

ENCODE Consortium
(The ENCODE Project Consortium, 2011)

Pilot Phase •2003-2007
Technology
development
phase
•2007-2012
•30 papers
Production
phase

ENCODE
Major methods
Data production and
initial analysis
Accessing ENCODE
data
Working with ENCODE
data
Data analysis
Limitations
Threads – Nature
explorer

Major Methods

Overall data flow

RNA-seq – Isolation of RNA sequences followed by high-throughput
sequencing
CAGE – Capture of the methylated cap at the 5’end of RNA, followed
by high-throughput sequencing
RNA-PET – Simultaneous capture of RNAs with both a 5’methyl cap
and a poly(A) tail
ChIP-seq - Chromatin immunoprecipitation followed by sequencing
FAIRE-seq - Formaldehyde assisted isolation of regulatory
elements. Crosslinking, phenol extraction, and sequencing the DNA
fragments in the aqueous phase

ENCODE cell types

ENCODE data production and initial analyses
• Since 2007, ENCODE has developed methods and performed a large
number of sequence-based studies to map functional elements across
the human genome.
• The elements mapped (and approaches used) include
 RNA transcribed regions (RNA-seq, CAGE, RNA-PET and manual
annotation),
 Protein-coding regions (mass spectrometry),
 Transcription-factor-binding sites (ChIP-seq and DNase-seq),
 Chromatin structure (DNase-seq, FAIRE-seq, histone ChIP-seq),
 DNA methylation sites (RRBS assay)

Transcribed and protein-coding regions
• In total, GENCODE-annotated exons of protein-coding genes cover 2.94% of the
genome or 1.22% for protein-coding exons.
• Protein-coding genes span 33.45% from the outermost start to stop codons, or
39.54% from promoter to poly(A) site.
• Additional protein-coding genes remain to be found.
• In addition, they annotated 8,801 automatically derived small RNAs and 9,640
manually curated long non-coding RNA (lncRNA) loci
• The GENCODE annotated 11,224 pseudogenes

Process flow of experimental evaluation of
pseudogene transcription
Experimental validation
results showing the
transcription of pseudogenes
in different tissues
(Pei et al., 2012)

ENCODE gene and transcript annotations.

RNA
• They sequenced RNA from different cell lines and multiple
subcellular fractions to develop an extensive RNA expression
catalogue.
• They used CAGE-seq (5’cap-targeted RNA isolation and
sequencing) to identify 62,403 (TSSs) in tier 1 and2 cell types

A large majority of GENCODE elements are detected by
RNA-seq data
(Djebali et al., 2012)

Protein bound regions
• 119 different DNA-binding proteins and a number of RNA
polymerase components in 72 cell types using ChIP-seq
• Overall, 636,336 binding regions covering 231 mega bases
(8.1%) of the genome are enriched for regions bound by DNA-
binding proteins across all cell types.

Occupancy of transcription factors and RNA
polymerase 2 on human chromosome 6p as
determined by ChIP-seq

DNase I hypersensitive sites and footprinting
• Chromatin accessibility characterized by DNase I hypersensitivity
is the hallmark of regulatory DNA regions.
• 2.89 million unique, non-overlapping (DHSs) by DNase-seq in 125
cell types – lie distal to TSSs
• In tier 1 and tier 2 cell types - 205,109 DHSs per cell
type, encompassing an average of 1.0% of the genomic sequence
in each cell type, and 3.9% in aggregate.

Density of DNase I cleavage sites for selected cell types
(Thurman et al., 2012)

• On average, 98.5% of the occupancy sites of transcription factors
mapped by ENCODE ChIP-seq
• Using genomic DNase I footprinting on 41 cell types they
identified 8.4million distinct DNase I footprints

Regions of histone modification
• They assayed chromosomal locations for up to 12 histone
modifications and variants in 46 cell types, across tier 1 and 2.
(The ENCODE Project Consortium, 2012)(http://www.factorbook.org)

DNA methylation
• They used reduced representation bisulphite sequencing (RRBS)
to profile DNA methylation quantitatively for an average of 1.2
million CpGs in each of 82 cell lines and tissues (8.6% of non-
repetitive genomic CpGs), including CpGs in intergenic
regions, proximal promoters and intragenic regions.

Proteomics
 To assess putative protein products generated from novel RNA
transcripts and isoforms, proteins are sequenced and quantified
by mass spectrometry and mapped back to their encoding
transcripts.
 K562 and GM12878 – protein study begun

ENCODE chromatin annotations in the HLA
locus

Accessing ENCODE Data
ENCODE Data Release and Use Policy
• The ENCODE Data Release and Use Policy is described at
http://www.encodeproject.org/ENCODE/terms.html.
• ENCODE data are released for viewing in a publicly accessible
browser (initially at http://genome-preview.ucsc.edu/ENCODE
and, after additional quality checks, at http://encodeproject.org)
Public Repositories
• UCSC Genome Browser database (http://genome.ucsc.edu).

Working with ENCODE Data
Using ENCODE Data in the UCSC Browser
• Many users will want to view and interpret the ENCODE data for
particular genes of interest. At the online ENCODE portal
(http://encodeproject.org), users should follow a ‘‘Genome
Browser’’ link to visualize the data in the context of other genome
annotations.

ENCODE Data Analysis
• Development and implementation of algorithms and pipelines for
processing and analyzing data - major activity of the ENCODE
Project.
•Short sequences
are aligned to
the reference
genome
1st Phase
•Identifying the
enriched regions
2nd Phase •Integrating the
identified regions
of enriched signal
with each other
and with other
data types
3rd Phase

Analysis tools applied by the ENCODE
consortium

Integrating ENCODE with other projects and the
Scientific Community
1. defining promoter and enhancer regions by combining transcript
mapping and biochemical marks,
2. delineating distinct classes of regions within the genomic
landscape by their specific combinations of biochemical and
functional characteristics, and
3. defining transcription factor co-associations and regulatory
networks.

• ENCODE Project - interpretation of human genome variation that
is associated with disease or quantitative phenotypes
• Integrate with 1,000 Genomes Project - how SNPs and structural
variation may affect transcript, regulatory and DNA methylation
data
• ENCODE - GWAS and other sequence variation driven studies of
human phenotypes
Major contributor not only of data but also novel technologies for
deciphering the human genome

Limitations of ENCODE Annotations
• Cell types - physiologically and genetically inhomogeneous.
• Local micro-environments in culture may also vary
• Use of DNA sequencing to annotate functional genomic features is
also constrained.
• Considerable quantitative variation in the signal strength along
the genome

Challenges
• Adult human body contains several hundred distinct cell types
• Each of which expresses a unique subset of the 1,800 TFs
encoded in the human genome
• Brain alone contains thousands of types of neurons that are likely
to express not only different sets of TFs but also a larger variety
of non-coding RNAs
• A truly comprehensive atlas of human functional elements is not
practical with current technologies

Outcome
• Understanding of the human genome
• The broad coverage of ENCODE annotations enhances our
understanding of common diseases with a genetic
component, rare genetic diseases
• 119 of 1,800 known transcription factors and 13 of more than 60
currently known histone or DNA modifications across 147 cell
types
• Overall these data reflect a minor fraction of the potential
functional information encoded in the human genome

http://www.nature.com/encode/#/threads

13 Threads
1. Transcription factor motifs
2. Chromatin patterns at transcription factor binding sites
3. Characterization of intergenic regions and gene definition
4. RNA and chromatin modification patterns around promoters
5. Epigenetic regulation of RNA processing
6. Non-coding RNA characterization
7. DNA methylation
8. Enhancer discovery and characterization
9. Three-dimensional connections across the genome
10. Characterization of network topology
11. Machine learning approaches to genomics
12. Impact of functional information on understanding variation
13. Impact of evolutionary selection on functional regions

Schematic overview of the functional SNP
approach
(Schaub et al., 2012)

Comparison of GWAS identified loci with
ENCODE data

Future goal
• Mechanistic processes that generate these elements and how and
where they function
• Enlarge the data set to additional factors, modifications and cell
types, complementing the other related projects
• Constitute foundational resources for human genomics, allowing a
deeper interpretation of the organization of gene and regulatory
information and the mechanisms of regulation, and thereby
provide important insights into human health and disease

Project is still far from complete
Conclusion
For update: https://www.facebook.com/ENCODEProject

Encode – assign word to letter

Thank you:)
Presented by: R. Veera Ranjani

New insights into the human genome by encode 14.12.12

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to New insights into the human genome by encode 14.12.12

Similar to New insights into the human genome by encode 14.12.12 (20)

Recently uploaded

Recently uploaded (20)

New insights into the human genome by encode 14.12.12

Editor's Notes