“Chromosome-‐territory–interchroma7n-‐compartment” (CT-‐IC) Model h"p://www.nature.com/nrg/journal/v2/n4/full/nrg0401_292a.html
CAGE Cap analysis gene expression “FANTOM is an internaFonal research consorFum established by Dr. Hayashizaki and his colleagues in 2000 to assign funcFonal annotaFons to the full-‐length cDNAs that were collected during the Mouse Encyclopedia Project at RIKEN.” h"p://www.nature.com/nprot/journal/v7/n3/ﬁg_tab/nprot.2012.005_F1.html
FAIRE-‐Seq Formaldehyde-‐Assisted IsolaFon of Regulatory
DNase-‐seq VS FAIRE-‐seq The technique was developed in the laboratory of Jason D. Lieb at the University of North Carolina, Chapel Hill. In contrast to DNase-‐Seq, the FAIRE-‐Seq protocol doesnt require the permeabiliza7on of cells or isola7on of nuclei, and can analyse any cell types. DNase-‐seq and FAIRE-‐seq : • produced strong cross-‐validaFon, with each cell type having 1-‐2% of the human genome as open chromaFn. • are not fully overlapping: FAIRE being more sensiFve to ﬁnd distal regulatory elements that are not detected with DNase-‐Seq but missing promoter regions that are detected with DNase-‐Seq References ^ Giresi, PG; Kim, J, McDaniell, RM, Iyer, VR, Lieb, JD (2007 Jun). "FAIRE (Formaldehyde-‐Assisted IsolaFon of Regulatory Elements) isolates acFve regulatory elements from human chromaFn.". Genome Research 17 (6): 877–85. doi:10.1101/gr.5533506. PMC 1891346. PMID 17179217. ^ Song, L; Zhang, Z, Grasfeder, LL, Boyle, AP, Giresi, PG, Lee, BK, Sheﬃeld, NC, Gräf, S, Huss, M, Keefe, D, Liu, Z, London, D, McDaniell, RM, Chibata, Y, Showers, KA, Simon, JM, Vales, T, Wang, T, Winter, D, Zhang, Z, Clarke, ND, Birney, E, Iyer, VR, Crawford, GE, Lieb, JD, Furey, TS (2011-‐07-‐12). "Open chromaFn deﬁned by DNaseI and FAIRE idenFﬁes regulatory elements that shape cell-‐type idenFty.". Genome Research 21 (10): 1757–67. doi:10.1101/gr.121541.111. PMC 3202292. PMID 21750106. ^ Simon, Jeremy M; Giresi, Paul G; Davis, Ian J; Lieb, Jason D (NaN undeﬁned NaN). "Using formaldehyde-‐assisted isolaFon of regulatory elements (FAIRE) to isolate acFve regulatory DNA". Nature Protocols 7 (2): 256–267. doi:10.1038/nprot.2011.444.
• First, FAIRE requires no treatment of the cells before the addi7on of formaldehyde. • Formaldehyde is applied directly to the growing cells and enters quickly because of its small size Therefore, the state of chromaFn just before the addiFon of the formaldehyde is likely to be captured. • Nuclease sensi7vity assays osen require that cells be permeabilized, or that nuclei be prepared, both of which allow Fme for arFfacts based on these preparaFons to occur. • Second, each Fme a nuclease-‐sensiFvity assay is performed, the appropriate enzyme concentra7on and incuba7on 7me must be determined, because of lot-‐to-‐lot variaFons in commercial DNase acFvity and variaFons in individual nuclei preparaFons. With FAIRE, a wide range of incuba7on 7mes (1, 2, 4, and 7 min) at a single formaldehyde concentraFon (1%) appears to be equally eﬀecFve. • Third, in contrast with ChIP, there is no dependence on an7bodies, FAIRE can analyze any cells: wild type, mutant, or those that contain transgenes that would make histone ChIPs technically diﬃcult. • Another important advantage of FAIRE is that it posi7vely selects genomic regions at which nucleosomes are disrupted. These same regions would be degraded in nuclease sensiFvity assays and require idenFﬁcaFon by their absence or by cloning and idenFﬁcaFon of ﬂanking DNA. References ^ Giresi, PG; Kim, J, McDaniell, RM, Iyer, VR, Lieb, JD (2007 Jun). "FAIRE (Formaldehyde-‐Assisted IsolaFon of Regulatory Elements) isolates acFve regulatory elements from human chromaFn.". Genome Research 17 (6): 877–85. doi:10.1101/gr.5533506. PMC 1891346. PMID 17179217.
147 diﬀerent cell types h"p://genome.ucsc.edu/ENCODE/cellTypes.html#TOP Details. h"p://www.genome.gov/26524238 RaFonale for the selecFon. Cell types were selected largely for pracFcal reasons: • wide availability • the ability to grow them easily • capacity to produce suﬃcient numbers of cells for use in all technologies being used by ENCODE invesFgators. Secondary consideraFons • diversity in Fssue source of the cells • germ layer lineage representaFon • the availability of exisFng data generated using the cell type, and coordinaFon with other ongoing projects.1640 Data Sets
• Tier1: 3 • GM12878 (B-‐lymphocyte), H1-‐hESC (embryonic stem cells), K562 (leukemia) • Tier2: 15 • A549, CD20+, CD20+_RO01778 CD20+_RO01794, H1-‐neurons, HeLa-‐S3 HepG2, HUVEC, IMR90 LHCN-‐M2 MCF-‐7 Monocytes-‐CD14+, Monocytes-‐CD14+_RO01746, Monocytes-‐CD14+_RO01826, SK-‐N-‐SH • Tier3: 338 a few may be useful in our studies: • 8988T, pancreas adenocarcinoma; • BC_Esophagus_H12817N:esophagus,DNA; • Caco-‐2, colorectal adenocarcinoma; • HIPEpiC, iris pigment epithelial cells; • HMVEC-‐dBl-‐Ad, adult lymphaFc microvascular endothelial cells Note: • 30 HapMap Cell lines, only 1 Chinese sample: Coriell GM18526 • Normal Fssues and Fetal Fssues • Stem Cells
Together: Important features about the organiza7on and func7on of the human genome The vast majority (80.4%) of the human genome parFcipates in at least one biochemical RNA-‐ and/or chromaFn-‐associated event in at least one cell type. 95% of the genome lies within 8 kb of a DNA–protein interacFon, and 99% is within 1.7 kb of at least one of the biochemical events measured by ENCODE.-‐Debate! Primate-‐speciﬁc elements as well as elements without detectable mammalian constraint show, in aggregate, evidence of nega7ve selec7on.-‐Func7onal? an iniFal set of 399,124 regions with enhancer-‐like features and 70,292 regions with promoter-‐ like features, as well as hundreds of thousands of quiescent regions. correlate quan7ta7vely RNA sequence produc7on and processing with both chroma7n marks and transcrip7on factor binding at promoters, indicaFng that promoter func7onality can explain most of the varia7on in RNA expression. Many non-‐coding variants in individual genome sequences lie in ENCODE-‐annotated func7onal regions; this number is at least as large as those that lie in protein-‐coding genes. SNPs associated with disease by GWAS are enriched within non-‐coding func7onal elements, with a majority residing in or near ENCODE-‐deﬁned regions that are out-‐ side of protein-‐ coding genes. In many cases, the disease phenotypes can be associated with a speciﬁc cell type or transcrip7on factor.-‐right
Transcribed and protein-‐coding regions • Covering how much of the genome? • GENCODE-‐annotated exons of protein-‐ coding genes cover 2.94% (~3%) of the genome • or 1.22% (~1%) for protein-‐coding exons. • Covering how much of the gene models? • Protein-‐coding genes span 33.45% from the outermost start to stop codons • or 39.54% from promoter to poly(A) site. • Can addi7onal protein-‐coding genes remain to be found? • Analysis of mass spectrometry data from K562 and GM12878 cell lines yielded 57 conﬁdently idenFﬁed unique pepFde sequences in intergenic regions relaFve to GENCODE annotaFon. • Other: • 8,801 automaFcally derived small RNAs and 9,640 manually curated long non-‐coding RNA (lncRNA) loci. • 11,224 pseudogenes, of which 863 were transcribed and associated with acFve chromaFn
RNA • How much of the genome sequence can be transcribed? • 62% of genomic bases are reproducibly represented in sequenced long (>200 nucleo7des) RNA molecules or GENCODE exons. • Of these bases, only 5.5% are explained by GENCODE exons. • Most transcribed bases are within or overlapping annotated gene boundaries (that is, intronic) • Only 31% of bases in sequenced transcripts were intergenic • How many transcrip7on start sites (TSSs) iden7ﬁed? • 62,403 TSSs at high conﬁdence (IDR of 0.01) in 7er 1 and 2 cell types. • Of these, 27,362 (44%) are within 100 bp of the 5’ end of a GENCODE-‐annotated transcript or previously reported full-‐length messenger RNA. • the start sites of novel, cell-‐type-‐speciﬁc transcripts: The remaining regions predominantly lie across exons and 3’ UTRs, cell-‐type-‐restricted expression
Protein bound regions “binding loca7ons of 119 diﬀerent DNA-‐binding proteins and a number of RNA polymerase components in 72 cell types using ChIP-‐seq” • How about the sequence speciﬁcity during the binding process? • 86% of the DNA segments occupied by sequence-‐speciﬁc transcripFon factors contained a strong DNA-‐binding moFf, and in most (55%) cases the known moFf was most enriched • How to explain Protein-‐binding regions lacking high or moderate aﬃnity cognate recogni7on sites? • 82% have high-‐aﬃnity recogniFon sequences for other factors • the median DNase I accessibility is twofold higher in the bo"om 20% of peaks than in the upper 80%
DNase I hypersensi7ve sites (DHSs) and footprints • How many DHSs in the whole genome? • Dnase-‐seq: • 2.89 million unique, non-‐overlapping DHSs in 125 cell types, most are distal to TSSs • FAIRE-‐seq: • 4.8 million sites across 25 cell types displayed reduced nucleosomal, many of which coincide with DHSs. • How many DHSs in one cell type? • In Fer 1 and Fer 2 cell types, a mean of 205,109 DHSs per cell • ~ 1.0% of the genomic sequence in each cell type, and 3.9% in aggregate. • On average, 98.5% of the occupancy sites of transcripFon factors mapped by ENCODE ChIP-‐seq lie within accessible chromaFn deﬁned by DNase I hotspots. • Genomic DNase I footprinFng on 41 cell types we idenFﬁed 8.4 million disFnct DNase I footprints. Our de novo moFf discovery on DNase I footprints recovered , 90% of known transcripFon factor moFfs, together with hundreds of novel evoluFo-‐ narily conserved moFfs, many displaying highly cell-‐selecFve occupancy pa"erns similar to major developmental and Fssue-‐speciﬁc regulators. •
Regions of histone modiﬁca7on “12 histone modiﬁcaFons and variants in 46 cell types, including a complete matrix of eight modiﬁcaFons across Fer 1 and Fer 2. ” • global pa"erns of modiﬁcaFon are highly vari-‐able across cell types, in accordance with changes in transcripFonal acFvity.
DNA methyla7on • an average of 1.2 million CpGs in each of 82 cell lines and Fssues • 96% of CpGs exhibited diﬀerenFal methylaFon in at least one cell type or Fssue • The most variably methylated CpGs are found more osen in gene bodies and intergenic regions. • unexpected correspondence between unmethylated genic CpG islands and binding by P300, a histone acetyltransferase linked to enhancer acFvity • CpGs with allele-‐speciﬁc methylaFon consistent with genomic imprinFng, and determined that these loci exhibit aberrant methylaFon in cancer cell lines • reproducible cytosine methylaFon outside CpG dinucleoFdes in adult Fssues45, providing further support that this non-‐canonical methylaFon event may have important roles in human biology
Chromosome-‐interac7ng regions • The average number of distal ele-‐ ments interacFng with a TSS was 3.9, and the average number of TSSs interacFng with a distal element was 2.5, indicaFng a complex net-‐ work of interconnected chromaFn. • Whereas promoter regions of 2,324 genes were involved in ‘single-‐gene’ enhancer–promoter interacFons, those of 19,813 genes were involved in ‘mulF-‐ gene’ interacFon com-‐ plexes spanning up to several megabases, including promoter– promoter and enhancer–promoter interacFons • long-‐range gene– element connecFvity across ranges of hundreds of kilobases to several megabases • 50–60% of long-‐ range interacFons occurred in only one of the four cell lines, indicaFve of a high degree of Fssue speciﬁcity for gene–element connecFvity
Summary of ENCODE-‐iden7ﬁed elements • 80.4%, is covered by at least one ENCODE-‐idenFﬁed element • Order of region classes: • 1-‐ diﬀerent RNA types, covering 62% of the genome (although the majority is inside of introns or near genes). • 2-‐Regions highly enriched for histone modiﬁcaFons (56.1%). • Excluding RNA elements and broad histone elements, 44.2% of the genome is covered • open chromaFn (15.2%) • sites of transcripFon factor binding (8.1%) • 19.4% covered by at least one DHS or transcripFon factor ChIP-‐seq peak across all cell lines. • Using our most conservaFve assessment, 8.5% of bases are covered by either a transcripFon-‐factor-‐binding-‐site moFf (4.6%) or a DHS footprint (5.7%). This, however, is sFll about 4.5-‐fold higher than the amount of protein-‐coding exons, and about twofold higher than the esFmated amount of pan-‐mammalian constraint.
Promoter-‐anchored integra7on • two relaFvely disFnct types of promoter: • (1) broad, mainly (C+G)-‐rich, TATA-‐less promoters; • (2) narrow, TATA-‐box-‐containing promoters. • a limited set of chromaFn marks are suﬃcient to ‘explain’ transcripFon and that a variety of transcripFon factors might have broad roles in general transcripFon levels across many genes • there is enough informaFon present at the promoter regions of genes to explain most of the variaFon in RNA expression.
Transcrip7on-‐factor-‐binding site-‐anchored integra7on Figure 3 | Pa"erns and asymmetry of chromaFn modiﬁcaFon at transcripFon-‐ factor-‐binding sites. a, Results of clustered aggregaFon of H3K27me3 modiﬁcaFon signal around CTCF-‐binding sites (a mulFfuncFonal protein involved with chromaFn structure). The ﬁrst three plots (les column) show the signal behaviour of the histone modiﬁcaFon over all sites (top) and then split into the high and low signal components. The solid lines show the mean signal distribuFon by relaFve posiFon with the blue shaded area delimiFng the tenth and nineFeth percenFle range. The high signal component is then decomposed further into six diﬀerent shape classes on the right (see ref. 30 for details). The shape decomposiFon process is strand aware.
Transcrip7on-‐factor-‐binding site-‐anchored integra7on Figure 3 | Pa"erns and asymmetry of chromaFn modiﬁcaFon at transcripFon-‐ factor-‐binding sites. b, Summary of shape asymmetry for DNase I, nucleosome and histone modiﬁcaFon signals by plo|ng an asymmetry raFo for each signal over all transcripFon-‐ factor-‐ binding sites. All histone modiﬁcaFons measured in this study show predominantly asymmetric pa"erns at transcripFon-‐factor-‐binding sites. An interacFve version of this ﬁgure is available in the online version of the paper.
Transcrip7on factor co-‐associa7ons igure 4 | Co-‐associaFon between transcripFon factors. a, Signiﬁcant co-‐ associaFons of transcripFon factor pairs using the GSC staFsFc across the enFre genome in K562 cells. The colour strength represents the extent of associaFon (from red (strongest), orange, to yellow (weakest)), whereas the depth of colour represents the ﬁt to the GSC20 model (where white indicates that the staFsFcal model is not appropriate) as indicated by the key. Most transcripFon factors have a nonrandom associaFon to other transcripFon factors, and these associaFons are dependent on the genomic context, meaning that once the genome is separated into promoter proximal and distal regions, the overall levels of co-‐associaFon decrease, but more speciﬁc relaFonships are uncovered.
Transcrip7on factor co-‐associa7ons b, Three classes of behaviour are shown. The ﬁrst column shows a set of associaFons for which strength is independent of locaFon in promoter and distal regions, whereas the second column shows a set of transcripFon factors that have stronger associaFons in promoter-‐proximal regions. Both of these examples are from data in K562 cells and are highlighted on the genome-‐ wide co-‐associaFon matrix (a) by the labelled boxes A and B, respecFvely. The third column shows a set of transcripFon factors that show stronger associaFon in distal regions (in the H1 hESC line).
Genome-‐wide integra7on Figure 5 | IntegraFon of ENCODE data by genome-‐wide segmentaFon. a, IllustraFve region with the two segmentaFon methods (ChromHMM and Segway) in a dense view and the combined segmentaFon expanded to show each state in GM12878 cells, beneath a compressed view of the GENCODE gene annotaFons. Note that at this level of zoom and genome browser resoluFon, some segments appear to overlap although they do not. SegmentaFon classes are named and coloured according to the scheme in Table 3. Beneath the segmentaFons are shown each of the normalized signals that were used as the input data for the segmentaFons. Open chromaFn signals from DNase-‐seq from the University of Washington group (UW DNase) or the ENCODE open chromaFn group (Openchrom DNase) and FAIRE assays are shown in blue; signal from histone modiﬁcaFon ChIP-‐seq in red; and transcripFon factor ChIP-‐seq signal for Pol II and CTCF in green. The mauve ChIP-‐seq control signal (input control) at the bo"om was also included as an input to the segmentaFon.
Genome-‐wide integra7on TranscripFon Start Site (TSS), Promoter Flanking (PF), Enhancer (E), Weak Enhancer (WE), CTCF binding (CTCF), Transcribed Region (T) and Repressed or InacFve Region (R) Figure 5 | IntegraFon of ENCODE data by genome-‐wide segmentaFon. b, AssociaFon of selected transcripFon factor (les) and RNA (right) elements in the combined segmentaFon states (x axis) expressed as an observed/expected raFo (obs./exp.) for each combinaFon of transcripFon factor or RNA element and segmentaFon class using the heat-‐ map scale shown in the key besides each heat map.
Genome-‐wide integra7on TranscripFon Start Site (TSS), Promoter Flanking (PF), Enhancer (E), Weak Enhancer (WE), CTCF binding (CTCF), Transcribed Region (T) and Repressed or InacFve Region (R) Figure 5 | IntegraFon of ENCODE data by genome-‐wide segmentaFon. c, Variability of states between cell lines, showing the distribuFon of occurrences of the state in the six cell lines at speciﬁc genome locaFons: from unique to one cell line to ubiquitous in all six cell lines for ﬁve states (CTCF, E, T, TSS and R).
Genome-‐wide integra7on TranscripFon Start Site (TSS), Promoter Flanking (PF), Enhancer (E), Weak Enhancer (WE), CTCF binding (CTCF), Transcribed Region (T) and Repressed or InacFve Region (R) Figure 5 | IntegraFon of ENCODE data by genome-‐wide segmentaFon. d, DistribuFon of methylaFon level at individual sites from RRBS analysis in GM12878 cells across the diﬀerent states, showing the expected hypomethylaFon at TSSs and hypermethylaFon of genes bodies (T state) and repressed (R) regions.
Genome-‐wide integra7on Figure 7 | High-‐resoluFon segmentaFon of ENCODE data by self-‐organizing maps (SOM). a–c, The training of the SOM (a) and analysis of the results (b, c) are shown. IniFally we arbitrarily placed genomic segments from the ChromHMM segmentaFon on to the toroidal map surface, although the SOM does not use the ChromHMM state assignments (a). We then trained the map using the signal of the 12 diﬀerent ChIP-‐seq and DNase-‐seq assays in the six cell types analysed. Each unit of the SOM is represented here by a hexagonal cell in a planar two-‐dimensional view of the toroidal map. Curved arrows indicate that traversing the edges of two dimensional view leads back to the opposite edge. The resulFng map can be overlaid with any class of ENCODE or other data to view the distribuFon of that data within this high-‐resoluFon segmentaFon.
Genome-‐wide integra7on c, The associaFon of Gene Ontology (GO) terms on the same representaFon of the same trained SOM. We assigned genes that are within 20 kb of a genomic segment in a SOM unit to that unit, and then associated this set of genes with GO terms using a hypergeometric distribuFon aser correcFng for mulFple tesFng. Map units that are signiﬁcantly associated to GO terms are coloured green, with increasing strength of colour reﬂecFng increasing numbers of genes signiﬁcantly associated with the GO terms for either immune response (les) or sequence-‐speciﬁc transcripFon factor acFvity (centre). In each case, speciﬁc SOM units show associaFon with these terms. The right-‐hand panel shows the distribuFon on the same SOM of all signiﬁcantly associated GO terms, now colouring by GO term count per SOM unit. For sequence-‐speciﬁc transcripFon factor acFvity, two example genomic regions are extracted at the bo"om of panel c from neighbouring SOM units. These are regions around the DBX1 (from SOM unit 26,31, les panel) and IRX6 (SOM unit 27,30, right panel) genes, respecFvely, along with their H3K27me3 ChIP-‐seq signal for each of the Fer 1 and 2 cell types. For DBX1, representaFve of a set of primarily neuronal transcripFon factors associated with unit 26,31, there is a repressive H3K27me3 signal in both H1 hESCs and HUVECs; for IRX6, representaFve of a set of body pa"erning transcripFon factors associated with SOM unit 27,30, the repressive mark is restricted largely to the embryonic stem (ES) cell. An interacFve version of this ﬁgure is available in the online version of the paper.
Insights into human genomic varia7on Preferen7al binding towards each parental allele Figure 8 | Allele-‐speciﬁc ENCODE elements. a, RepresentaFve allele-‐speciﬁc informaFon from GM12878 cells for selected assays around the ﬁrst exon of the NACC2 gene (genomic region Chr9: 138950000–138995000, GRCh37). TranscripFon signal is shown in green, and the three secFons show allele-‐ speciﬁc data for three data sets (POLR2A, H3K79me2 and H3K27me3 ChIP-‐ seq). In each case the purple signal is the processed signal for all sequence reads for the assay, whereas the blue and red signals show sequence reads speciﬁcally assigned to either the paternal or maternal copies of the genome, respecFvely. The set of common SNPs from dbSNP, including the phased, heterozygous SNPs used to provide the assignment, are shown at the bo"om of the panel. NACC2 has a staFsFcally signiﬁcant paternal bias for POLR2A and the transcripFon-‐associated mark H3K79me2, and has a signiﬁcant maternal bias for the repressive mark H3K27me3.
Insights into human genomic varia7on the correla7on of selected allele-‐speciﬁc signals across the whole genome. For instance, we found a strong allelic correla7on between POL2RA and BCLAF1 binding, as well as nega-‐ 7ve correla7on between H3K79me2 and H3K27me3, both at genes (Fig. 8b, below the diagonal, bokom lel) and chromosomal segments (top right). Overall, we found that posi7ve allelic correla7ons among the 193 ENCODE assays are stronger and more frequent than nega 7ve correla7ons. Figure 8 | Allele-‐speciﬁc ENCODE elements. b, Pair-‐wise correlaFons of allele-‐speciﬁc signal within single genes (below the diagonal) or within individual ChromHMM segments across the whole genome for selected DNase-‐seq and histone modiﬁcaFon and transcripFon factor ChIP-‐seq assays. The extent of correlaFon is coloured according to the heat-‐map scale indicated from posiFve correlaFon (red) through to anF-‐correlaFon (blue). An interacFve version of this ﬁgure is available in the online version of the paper.
Rare variants, individual genomes and soma7c variants A: variants annota7on B: 1% of transcripFon-‐factor-‐ binding sites in GM12878 cells are detected in a haplotype-‐ speciﬁc fashion. (Fig. 9b -‐a CTCF-‐binding site) C:Overall, somaFc variaFon is relaFvely depleted from ENCODE annotated regions, parFcularly for elements speciﬁc to a cell type matching the putaFve tumour source
Common variants associated with disease Figure 10 | Comparison of genome-‐wide-‐associaFon-‐study-‐idenFﬁed loci with ENCODE data. a, Overlap of lead SNPs in the NHGRI GWAS SNP catalogue (June 2011) with DHSs (les) or transcripFon-‐factor-‐binding sites (right) as red bars compared with various control SNP sets in blue. The control SNP sets are (from les to right): SNPs on the Illumina 2.5M chip as an example of a widely used GWAS SNP typing panel; SNPs from the 1000 Genomes project; SNPs extracted from 24 personal genomes (see personal genome variants track at h"p:// main.genome-‐browser.bx.psu.edu (ref. 80)), all shown as blue bars. In addiFon, a further control used 1,000 randomizaFons from the genotyping SNP panel, matching the SNPs with each NHGRI catalogue SNP for allele frequency and distance to the nearest TSS (light blue bars with bounds at 1.5 Fmes the interquarFle range). For both DHSs and transcripFon-‐factor-‐ binding regions, a larger proporFon of overlaps with GWAS-‐implicated SNPs is found compared to any of the controls sets. b, Aggregate overlap of phenotypes to selected transcripFon-‐factor-‐binding sites (les matrix) or DHSs in selected cell lines (right matrix), with a count of overlaps between the phenotype and the cell line/factor. Values in blue squares pass an empirical P-‐value threshold #0.01 (based on the same analysis of overlaps between randomly chosen, GWAS-‐matched SNPs and these epigeneFc features) and have at least a count of three overlaps. The P value for the total number of phenotype–transcripFon factor associaFons is ,0.001.
Common variants associated with disease c, Several SNPs associated with Crohn’s disease and other inﬂammatory diseases that reside in a large gene desert on chromosome 5, along with some epigeneFc features indicaFve of funcFon. The SNP (rs11742570) strongly associated to Crohn’s disease overlaps a GATA2 transcripFon-‐factor-‐binding signal determined in HUVECs. This region is also DNase I hypersensiFve in HUVECs and T-‐helper TH1 and TH2 cells. An interacFve version of this ﬁgure is available in the online version of the paper.
Limita7ons of ENCODE Annota7ons • Cell types -‐ physiologically and geneFcally inhomogeneous. • Local microenvironments in culture may also vary • Use of DNA sequencing to annotate funcFonal genomic features is also constrained. • Considerable quanFtaFve variaFon in the signal strength along the genome (The ENCODE Project ConsorFum, 2011)
Challenges • Adult human body contains several hundred disFnct cell types • Each of which expresses a unique subset of the 1,800 TFs encoded in the human genome • Brain alone contains thousands of types of neurons that are likely to express not only diﬀerent sets of TFs but also a larger variety of non-‐coding RNAs • A truly comprehensive atlas of human funcFonal elements is not pracFcal with current technologies (The ENCODE Project Consor7um, 2011)
Outcome • Understanding of the human genome • The broad coverage of ENCODE annotaFons enhances our understanding of common diseases with a geneFc component, rare geneFc diseases • 119 of 1,800 known transcripFon factors and 13 of more than 60 currently known histone or DNA modiﬁcaFons across 147 cell types • Overall these data reﬂect a minor fracFon of the potenFal funcFonal informaFon encoded in the human genome (The ENCODE Project ConsorFum, 2012)
Future goal • MechanisFc processes that generate these elements and how and where they funcFon • Enlarge the data set to addiFonal factors, modiﬁcaFons and cell types, complemenFng the other related projects • ConsFtute foundaFonal resources for human genomics, allowing a deeper interpretaFon of the organizaFon of gene and regulatory informaFon and the mechanisms of regulaFon, and thereby provide important insights into human health and disease (The ENCODE Project ConsorFum, 2012)
13 Threads 1. TranscripFon factor moFfs 2. ChromaFn pa"erns at transcripFon factor binding sites 3. CharacterizaFon of intergenic regions and gene deﬁniFon 4. RNA and chromaFn modiﬁcaFon pa"erns around promoters 5. EpigeneFc regulaFon of RNA processing 6. Non-‐coding RNA characterizaFon 7. DNA methylaFon 8. Enhancer discovery and characterizaFon 9. Three-‐dimensional connecFons across the genome 10. CharacterizaFon of network topology 11. Machine learning approaches to genomics 12. Impact of funcFonal informaFon on understanding variaFon 13. Impact of evoluFonary selecFon on funcFonal regions
Spanking #ENCODE“We note that ENCODE used almost exclusively pluripotent stem cells and cancer cells, which are known as transcrip7onally permissive environments.” Another cri7cism in the paper is the sensi7vity vs speciﬁcity choice for repor7ng on the data. Unfortunately, the ENCODE data are neither easily accessible nor very useful—without ENCODE, researchers would have had to examine 3.5 billion nucleo7des in search of func7on, with ENCODE, they would have to siQ through 2.7 billion nucleo7des.-‐-‐“Big Science,” “small science,” and “ENCODE”.
QuesFons: How much of the genome is func7onal? • “80 percent is the ﬁgure only if your deﬁniFon is so loose as to be all but meaningless.” • “FuncFonal" simply means a li"le bit of DNA thats been idenFﬁed in an assay of some sort or another. That’s a remarkably silly deﬁniFon of funcFon and if youre using it to discount junk DNA its downright disingenuous.” • “The upshot is that you’d expect many of the elements that ENCODE idenFﬁed if you just wrote out a random string of As, Gs, Cs, and Ts.” • “does an onion have around ﬁve Fmes as much non-‐coding DNA as we do? Or why puﬀerﬁshes can get by with just a tenth as much? “ • Junk Vs Garbage