Successfully reported this slideshow.
Upcoming SlideShare
×

# Reproducibility and differential analysis with selfish

62 views

Published on

Chrocotalk, January 31st 2020
INRAE Toulouse
with Matthias Zytnicki

Published in: Science
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Reproducibility and differential analysis with selfish

1. 1. selfish A R Ardakany, F Ay, S Lonardi Bioinformatics, 2019 (ISMB/ECCB)
2. 2. Principle (Yardımcı et al. 2019) review 2/11
3. 3. Method used method Compute sum of interaction inside each block. Store the result into , with . Compute . Similarity between two matrices and is: In practice: and . · · Bi i ∈ [1, 2k] · C(s, t) = 1( , )Bs Bt · A B S(A, B) = =e −c|| − |CA CB |2 e −c [ (s,t)− (∑s,t∈[1,2k]2 CA CB√ · k = 100 c = 5 3/11
4. 4. Quasar (Sauria and Taylor 2017) Discard regions with low counts: . Take the square-root of the counts: . Normalize diagonal-wise. Compute “local” counts of Compute a correlation matrix . Compute the transformed matrix as the element-wise product of and . The replicate score between samples and is the correlation of and . Python (in HiFive) · R · S · · = = { : j− 100 ≤ k < i+ 100}nl ij nl ij nik · = corr( , )cij nl ij nl ji · T S C · A B TA TB · 4/11
5. 5. HiC-Spector (Yan et al. 2017) Raw count: . Compute the Laplacian ( is a diagonal matrix, is the coverage of bin ). Take the normalized form: . The set of eigenvalues is the spectra of . Normalize the normalized eigenvectors . is the number of ``leading’’ eigenvectors (here, ). The distance between samples and is . Julia · W · L = D− W D Dii i · ℓ = LD− 1 2 D− 1 2 · ℓ · vi · r ≤ 20 · A B || − ||∑r−1 i=0 vA i vB i · 5/11
6. 6. HiC-Rep (Yang et al. 2017) 2D local smoothing: square smoothing of side (here, ). (Linear) stratification of interactions (125 strata for bin size 40kb). SCC: stratum-adjusted correlation coefficient statistic: something similar to Pearson correlation, adapted to stratified data (big equations, too much for me to understand). R · h h = 20 · · · 6/11
7. 7. GenomeDISCO (Ursu et al. 2018) Equalize sequence depth by random subsampling. Model bins as vertices, and interactions as edges. Smooth counts using random walks of length : what is the probability of going from to with a random path of length . Compute . Empirically, . Python · · · t i j t · =1− ∈[−1,1]Conct | − |∑i,j At ij Bt ij # non−zero nodes · t =3 · 7/11
8. 8. Figure 3 figure 3 8/11
9. 9. Figure 4 figure 4 9/11
10. 10. Table 1 table 1 10/11
11. 11. References Sauria, Michael EG, and James Taylor. 2017. “QuASAR: Quality Assessment of Spatial Arrangement Reproducibility in Hi-c Data.” bioRxiv. Ursu, Oana, Nathan Boley, Maryna Taranova, Y X Rachel Wang, Galip Gurkan Yardimci, William Stafford Noble, and Anshul Kundaje. 2018. “GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs.” Bioinformatics 34 (16): 2701–7. Yan, Koon-Kiu, Galip Gürkan Yardımcı, Chengfei Yan, William S Noble, and Mark Gerstein. 2017. “HiC-spector: a matrix library for spectral and reproducibility analysis of Hi-C contact maps.” Bioinformatics 33 (14): 2199–2201. Yang, Tao, Feipeng Zhang, Galip Gurkan Yardimci, Fan Song, Ross C Hardison, William Stafford Noble, Feng Yue, and Qunhua Li. 2017. “HiCRep: Assessing the Reproducibility of Hi-c Data Using a Stratum- Adjusted Correlation Coefficient.” Genome Research. Yardımcı, Galip Gürkan, Hakan Ozadam, Michael E. G. Sauria, Oana Ursu, Koon-Kiu Yan, Tao Yang, Abhijit Chakraborty, et al. 2019. “Measuring the reproducibility and quality of Hi-C data.” Genome Biology 20: 57. 11/11
12. 12. ‘sel sh’ (and related tools) for di erential‘sel sh’ (and related tools) for di erential HiC analysisHiC analysis Nathalie Vialaneix, INRAE/MIATNathalie Vialaneix, INRAE/MIAT Chrocogen, February 21st, 2020Chrocogen, February 21st, 2020 1 / 191 / 19
13. 13. First ofall... a short reviewofwhat is HiCdi erentialFirst ofall... a short reviewofwhat is HiCdi erential analysisanalysis 2 / 192 / 19
14. 14. Topic (What is this presentation about?) When two sets of Hi-C matrices have been collected in two different conditions, what are the available methods to compare the matrices and identify regions that are significantly different between the conditions? Comparison usually means: at a bin pair level. 3 / 19
15. 15. Notations and formal de nition ofthe problem Hi-C matrices: for , Hi-C matrices Conditions: 2 conditions and such that and Interactions: is the interaction frequency (in ) for bin pair where and are two genomic loci, in the matrix Question: for all pair , test the following assumption: in which is the random variable that represents the number of contacts (interaction frequency) between loci and in condition . Ht t =, … , T T C1 C2 C1 ∪ C2 = {1, … , T} C1 ∩ C2 = ∅ ht ij N+ (i, j) i j t (i, j) Hij 0 :  NC1 ij = NC2 ij NCr ij i j Cr 4 / 19
16. 16. 1. Prior di erential analysis Most methods start to correct sequencing bias (between matrices normalization) Standard sequencing depth normalization [Anders & Huber, 2010] to obtain equal total number of counts between the different samples (R package edgeR) MA plot correction [Lun & Smyth, 2015] and improvement by [Stansfield et al, 2019] to correct trend in MA (mean versus difference) plots for every pair of samples (R packages diffHic/csaw and multiHiCcompare) MD plot correction [Stansfield et al, 2018] to correct trend in MD (distance versus difference) plots for every pair of samples (R package HiCcompare) 5 / 19
17. 17. Normalization 6 / 19
18. 18. 2. Compute a -value per bin Z score computation [Stansfield et al, 2018] that is based on quantiles of scaled and centered M values (R package HiCcompare) that is used when there is no replicate (one sample per condition) that is very fast and easy to use but is a bit low on the theoretical side (no strong evidence) p T = 2 7 / 19
19. 19. 2. Compute a -value per bin Z score computation [Stansfield et al, 2018], (R package HiCcompare) models [Lun & Smyth, 2015] that is based on Negative Binomial GLM and statistical tests (R package diffHic) that needs at least 3 replicates per condition to be used that is not restricted to two conditions and that can include various covariates but is statistically better justified [Stansfield et al., 2019] (R package multiHiCompare) also do that with small changes (normalization...) [Zaborowski and Wilczyński, 2020] also use this distribution but within distance pools and counts are explained by counts in the other condition rather than by the condition itself p NB 8 / 19
20. 20. 2. Compute a -value per bin Z score computation [Stansfield et al, 2018], (R package HiCcompare) models [Lun & Smyth, 2015], (R package diffHic) In both these approaches, a -value is computed for every bin pair and - values are corrected by multiple correction procedures (not described) But spatial dependencies between pairs of bins are not included in the methods!! p NB p p 9 / 19
21. 21. 2. Compute a -value per bin taking spatial dependencies into account Using an analogy with neuroimaging and spatial Poisson processes [Djekidel et al, 2018] (R package FIND) needs at least 2 replicates per condition to be used seems to be restricted to two conditions (but could maybe be easily extended to more) and can include various covariates is statistically (more or less) justified (from previous work on image analysis) uses tests at bin pair level with multiple corrections but those tests are based on the value of the bin pair and its neighbors is shown to work well for high resolution differential analysis (seems to provide better results for 5kb bins) p 10 / 19
22. 22. Now... coming back toNow... coming back to selfishselfish!! 11 / 1911 / 19
23. 23. selfish features python tool available on github not sensitive to sequencing bias and does not require normalization only suited to (no replicate) for 2 conditionsT = 2 12 / 19
24. 24. selfish steps 1. Distance based correction: with: : average interaction frequency in matrix for bin pairs at distance : sd of interaction frequency in matrix for bin pairs at distance ~h t ij = ht ij−μt d σt d μt d t d σt d t d 13 / 19
25. 25. selfish steps 1. Distance based correction   14 / 19
26. 26. selfish steps 1. Distance based correction 2. Computation of Gaussian filters: for a given bin pair , average in the neighborhood of radius : for radii (i, j) rk Gt,k ij = ∑(i′ ,j′ ): ∥(i′ ,j′ )−(i,j)∥≤rk ht i′ j′ e−γ∥(i′ ,j′ )−(i,j)∥2 rk = r0sk 15 / 19
27. 27. selfish steps 1. Distance based correction 2. Computation of Gaussian filters 3. Comparison step: check if the evolutions of Gaussian filters (for increasing ) is similar between the two matrices: compute the evolution: compute the difference in evolution: rk Γt,k ij = Gt,k+1 ij −Gt,k ij Γt,1 ij −Γt,2 ij 16 / 19
28. 28. selfish steps 1. Distance based correction 2. Computation of Gaussian filters 3. Comparison step 4. -value: differences between the two matrices are assumed Gaussian (why???) and -values are derived from this assumption (similar to the score approach) for different : Conclusion is made based on: and multiple testing correction p p Z k pk ij pij = mink pk ij 17 / 19
29. 29. Comparison with FISH Run on Hi-C contact maps from cell types GM12878 and K562 from [Rao et al, 2014]. Evaluated for: enrichment in epigenetic markers (CTCF, POLII, P300) presence of histone modification H3K4me3 expression fold change of nearby genes computational time (selfish 120 times faster than FISH) Run on simulated datasets and evaluated for precision/recall 18 / 19
30. 30. References Ardakany, A. R., Ay, F., and Lonardi, S. (2019). Selfish: discovery of differential chromatin interactions via a self-similarity measure. Bioinformatics, 35(14):i145--i153. Djekidel, M.N., Chen, Y., and Zhang, M. Q. (2018). FIND: difFerential chromatin INteractions Detection using a spatial Poisson process. Genome Research, 28:412--422. Lun, A. and Smyth, G. (2015). diffHic: a Bioconductor package to detect differential genomic interactions in Hi-C data. BMC Bioinformatics, 16:258. Rao, S.S.P. et al. (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 159: 1665--1680. Stansfield, J.C., Cresswell, K.G., Vladimirov, V.I., and Dozmorov, M.G. (2018). HiCcompare: an R-package for joint normalization and comparison of HI-C datasets. BMC Bioinformatics, 19:279. Stansfield, J.C., Cresswell, K.G., and Dozmorov, M.G. (2019). multiHiCcompare: joint normalization and comparative analysis of complex Hi-C experiments. Bioinformatics, 35(17): 2916-2923. Zaborowski, R. and Wilczyński, B. (2020). DiADeM: differential analysis via dependency modelling of chromatin interactions with robust generalized linear models. bioRxiv preprint. 19 / 19