Topological associated domains- Hi-C

Topological Associated
Domains identification
using Hi-C
Speaker : Djekidel Mohamed Nadhir
Date : 03/03/2014

Background
• Despite revealing the sequence of the genome, little is known about its 3D structure
• high-throughput chromosome capture (Hi-C) is 3C-based technology
• it can detect chromatin interactions between loci across the entire genome
Biological experiment:
Ming, H., et al. (2013). "Understanding spatial organizations of chromosomes via statistical analysis of Hi-C data." Quantitative Biology 1.

Background
• Hi-C in the chromatin conformation study map
Smallwood, A. and B. Ren (2013). "Genome organization and long-range regulation of gene expression by enhancers." Current opinion in cell biology 25(3):
387-394.

Background- Processing pipeline
• 4 main steps:
• Read mapping : Each side (50 bp) is mapped independently to the reference genome
• Read level filtering
• Fragment filtering : Filter fragments with low mappability score
• Creation of the Hi-C contact matrix
Ming, H., et al. (2013). "Understanding spatial organizations of chromosomes via statistical analysis of Hi-C data." Quantitative Biology 1.

• Read filtering step : The flowing types of reads should be removed :
• Self-ligation reads:
• Dangling reads : un-ligated reads
• PCR amplification reads: many reads that map to the same location
• Random breaking reads : reads located far from the enzyme cutting site (𝑑1 + 𝑑2 > 500𝑏𝑝 )

• Fragment filtering step : Remove fragments with low mappability score (< 0.5)
• fragment near centromere or telomere regions tends to contain a large proportion of repetitive sequence and
leads to a low mappability score
• Additional suggestions :
• Remove fragments with <100bp or > 100 kb
• Remove 0.5% of the fragments with the highest number of reads (can be source of PCR artifacts)

Background
• Construction of the Hi-C interaction matrix：
• The number of Enzyme cut-site is 1012
, however a typical Hi-C experiment generate 108
reads
• Thus, we need to partition the genome into large scale bins.
Processing pipeline:
Hi-C vs FISH

Discussed paper
• Aim :
• Investigate the 3D dimensional organization of the human and mouse genome in ES
and differentiated cell.
• Data :
• Mouse :
• Mouse embryonic stem cell (mESC)
• Cortex cell (generated by another group)
• Human :
• Human embryonic stem cell (hESC)
• IMR90

Data control (1)
• Remove cut site bias
Raw data Normalized data

Data control (2)
Compare 5C generated data for the HoxA
locus (correlation > 0.73)
Compare with Phc1 locus 3C data
Compare with FISH data of 6 loci

Data control (3)
Pearson Correlation between replicates

Visualization of interactions
We can notice aTopological Associated Domain (TAD) structure at bins < 100kb

Identification of topological domains
Step1: Detection of the interaction bias
We notice that in aTAD that :
• The upstream portion is highly biased to interact
downstream
• The downstream portion is highly biased to interact
upstream
a directionality index (ID) was defined to calculate this bias:
• 𝐷𝐼 > 0  Upstream bias
• 𝐷𝐼 < 0  Downstream bias
• 𝐷𝐼 the extent of the interaction

DI calculation
Steps:
• The genome was split into bins of length 40 kb
• Let :
• A: # of reads that map in the 2M upstream of the bin
• B: # of reads that map in the 2M downstream of the bin
• E: expected number of reads 𝐄 =
𝑨+𝑩
𝟐
• Then :
• 𝐷𝐼 =
𝐵−𝐴
𝐵−𝐴
𝐴−𝐸 2
𝐸
+
𝐵−𝐸 2
𝐸
-2Mb +2Mb40kb
A B

Domain detection (1)
• Each bin can have 3 states :
• Upstream biased
• Downstream biased
• No bias
• Use a HMM based on the DI to infer the biased state
• We define :
• 𝒀 = [𝒀 𝟏, 𝒀 𝟐, … , 𝒀 𝒏] :The observed DI
• 𝑸 = [𝑸 𝟏, 𝑸 𝟐, … , 𝑸 𝒏] :The hidden bias 𝑄𝑖 ∈ {𝐷, 𝑈, 𝑁}
• 𝑴 = 𝑴 𝟏, 𝑴 𝟐, … , 𝑴 𝒎 : 𝑚 ∈ [1,20]
• The probabilities are calculated as follow:
• 𝑷 𝒀𝒕 𝑸 𝒕 = 𝒊, 𝑴𝒕 ) = 𝓝 𝐘𝐭; 𝝁𝒊𝒎, 𝚺𝒊𝒎
• 𝑷 𝑴𝒕 = 𝒎 𝑸 𝒕 = 𝒊) = 𝑪(𝒊, 𝒎)
• 𝑪(𝒊, 𝒎): the mixture weight
D D D D U U U N N N D D D U U
Domain Boundary Domain
` ` `
𝑀1 𝑀2
𝑀3
𝑸 𝒕
𝒚𝒕
𝑴 𝒕
𝑸 𝒕+𝟏
𝒚𝒕+𝟏
𝑴𝒕+𝟏
D
U
N

Domain detection (1)
• The region between twoTAD is termed :
• Topological boundary : if size < 400kb
• Unrecognized chromatin : if size ≥ 400 kb

What separates twoTADs
• Studied the HoxA locus known to be separated into two compartments
• Found that the CS5 insulator resides in the boundary
• Maybe insulators are enriched at the boundary ?

CTFC role in the boundary
• Studied other known insulator CTCF

Heterochromatin and boundary
• the H3K9me3 profile changed between cells hESC and IMR90 but the boundaries structure didn’t change
• potential link between the topological domains and transcriptional control in the mammalian genome

Characteristics ofTAD
• TAD are stable between cell lines
hESC
IMR90

Characteristics ofTAD
• TAD are conserved between species

Cell type specific interactions
• A binomial test is performed for each 20kb bin to determine is it is cell specific
• Calculate 𝒏 = 𝑰 𝒎𝑬𝑺𝑪 + 𝑰 𝒄𝒐𝒓𝒕𝒆𝒙 , the number of possible interactions at a distance 𝒅
• Calculate the expected value 𝒑 =
𝑰 𝒎𝑬𝑺𝑪
𝒏
or 𝒑 =
𝑰 𝒄𝒐𝒓𝒕𝒆𝒙
𝒏
• Then for each bin do a binomial-test to see if there is a deviation in the number cell specific
interactions
d d d d
𝒏 = 𝟑 + 𝟐 + 𝟏 + 𝟏 + 𝟐 + 𝟏 + 𝟒 + 𝟏 = 𝟏𝟓
mESC
Cortex
𝒑 =
𝟕
𝟏𝟓
or 𝒑 =
𝟖
𝟏𝟐

Cell type specific interactions
• 20% of the genes that have a FC≥ 4 are found in dynamic interacting loci.
• > 96% of the dynamic interactions occur in the same domain.
• Model :
• domain organization is stable between cell types
• but the regions within each domain may be dynamic,

Factors forming the boundary (1)
• Boundaries are enriched for active promoter signals and gene bodies

Factors forming the boundary (2)

TAD vs A/B compartments (1)
• Loci found clustered in A compartments
are generally:
• gene rich,
• transcriptionally active,
• and DNase I hypersensitive,
Lieberman-Aiden, E., et al. (2009), Science (New York, N.Y.) 326(5950): 289-293.
Compartment B
CompartmentA
• Loci found clustered in B compartments
are generally:
• gene poor,
• transcriptionally silent
• and DNase I insensitive
At a higher order the chromatin is organized into A and B compartments

TAD are smaller than A/B compartments

In summary :
Gibcus, J. and J. Dekker (2013). "The hierarchy of the 3D genome." Molecular cell 49(5): 773-782.

TAD vs Lamina associated domains (LAD) (1)

TAD vs Lamina associated domains (LAD) (2)
Nora, E., et al. (2013). BioEssays : news and reviews in molecular, cellular and developmental biology 35(9): 818-828.

TAD vs LOCKs
• LOCK: Large Organized Chromatin K9-modifications
• Conserved regions exhibiting large H3K9Me2 difference between cell lines

Summary
• The mammalian genome is segmented into a megabase-scale domains
• Domain boundaries are stable between cell lines and species , suggesting that
they are a basic property of the chromosome architecture.
• Domain boundaries are enricher for :
• Transcriptionally active genes
• Coincide with heterochromatin boundaries
• Enriched with insulator proteins
• Enriched with tRNA, SINE and housekeeping genes
• Developed many data-analysis approaches

Topological associated domains- Hi-C

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Topological associated domains- Hi-C

Similar to Topological associated domains- Hi-C (20)

Recently uploaded

Recently uploaded (20)

Topological associated domains- Hi-C