Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Investigating the 3D structure of the genome with Hi-C data analysis

632 views

Published on

Séminaire de l'unité MIAT, INRA de Toulouse
https://mia.toulouse.inra.fr
Toulouse, France
June, 2nd 2017

Published in: Science
  • Be the first to comment

Investigating the 3D structure of the genome with Hi-C data analysis

  1. 1. Investigating the 3D structure of the genome with Hi-C data analysis Sylvain Foissac & Nathalie Villa-Vialaneix prenom.nom@inra.fr Séminaire MIAT - Toulouse, 2 juin 2017 SF & NV2 | Hi-C data analysis 1/28
  2. 2. Sommaire 1 Normalization 2 TAD identification 3 A/B compartments 4 Differential analysis SF & NV2 | Hi-C data analysis 2/28
  3. 3. Sommaire 1 Normalization 2 TAD identification 3 A/B compartments 4 Differential analysis SF & NV2 | Hi-C data analysis 3/28
  4. 4. Purpose of normalization 1 within matrix normalization: make bins comparable within a matrix (not needed for differential analysis) SF & NV2 | Hi-C data analysis 4/28
  5. 5. Purpose of normalization 1 within matrix normalization: make bins comparable within a matrix (not needed for differential analysis) 2 between matrix normalization: make the same bin pair comparable between two matrices (needed for differential analysis) SF & NV2 | Hi-C data analysis 4/28
  6. 6. Different within matrix normalizations to correct technical biases (GC content, mappability...) explicit correction [Yaffe and Tanay, 2011, Hu et al., 2012]: every factor causing bais is identified and estimated SF & NV2 | Hi-C data analysis 5/28
  7. 7. Different within matrix normalizations to correct technical biases (GC content, mappability...) explicit correction [Yaffe and Tanay, 2011, Hu et al., 2012]: every factor causing bais is identified and estimated non parametric correction ICE correction using matrix balancing [Imakaev et al., 2012] K = b Kb for a K st ∀ i = 1, . . . , p, p j=1 Kij is constant SF & NV2 | Hi-C data analysis 5/28
  8. 8. Different within matrix normalizations to correct technical biases picture from [Schmitt et al., 2016] SF & NV2 | Hi-C data analysis 5/28
  9. 9. Different within matrix normalizations to take distances into account theoretical distribution taken from [Belton et al., 2012] Kd ij = Kij − Kd(i,j) σ(Dd(i,j)) with Kd average counts at distance d σ(Dd) standard deviation available in HiTC [Servant et al., 2012] SF & NV2 | Hi-C data analysis 6/28
  10. 10. Between matrix normalization correct for differences in sequencing depth standard approach: similar to RNA-seq normalization SF & NV2 | Hi-C data analysis 7/28
  11. 11. Between matrix normalization correct for differences in sequencing depth standard approach: similar to RNA-seq normalization However... SF & NV2 | Hi-C data analysis 7/28
  12. 12. Between matrix normalization correct for differences in sequencing depth standard approach: similar to RNA-seq normalization However... density adjustment by LOESS fit [Robinson and Oshlack, 2010] (implemented in csaw) SF & NV2 | Hi-C data analysis 7/28
  13. 13. Sommaire 1 Normalization 2 TAD identification 3 A/B compartments 4 Differential analysis SF & NV2 | Hi-C data analysis 8/28
  14. 14. Topologically Associated Domains (TADs) [Rao et al., 2014] SF & NV2 | Hi-C data analysis 9/28
  15. 15. TAD method jungle Directionality index [Dixon et al., 2012]: compute divergence between up/downstream interaction counts + HMM to identify TADs SF & NV2 | Hi-C data analysis 10/28
  16. 16. TAD method jungle Directionality index [Dixon et al., 2012]: compute divergence between up/downstream interaction counts + HMM to identify TADs armatus [Filippova et al., 2013]: maximize a criteria which evaluate a within/between count ratio + combine multi-resolution results in a consensual segmentation SF & NV2 | Hi-C data analysis 10/28
  17. 17. TAD method jungle Directionality index [Dixon et al., 2012]: compute divergence between up/downstream interaction counts + HMM to identify TADs armatus [Filippova et al., 2013]: maximize a criteria which evaluate a within/between count ratio + combine multi-resolution results in a consensual segmentation segmentation method [Brault et al., 2017]: block boundary estimation in matrix SF & NV2 | Hi-C data analysis 10/28
  18. 18. TAD method jungle Directionality index [Dixon et al., 2012]: compute divergence between up/downstream interaction counts + HMM to identify TADs armatus [Filippova et al., 2013]: maximize a criteria which evaluate a within/between count ratio + combine multi-resolution results in a consensual segmentation segmentation method [Brault et al., 2017]: block boundary estimation in matrix ... (many others), interestingly, very few provides a hierarchical clustering Comparisons in: [Fotuhi Siahpirani et al., 2016, Dali and Blanchette, 2017] SF & NV2 | Hi-C data analysis 10/28
  19. 19. DI evolution with respect to armatus TADs SF & NV2 | Hi-C data analysis 11/28
  20. 20. CTCF at TAD boundaries SF & NV2 | Hi-C data analysis 12/28
  21. 21. Enrichment of genomic features around TAD boundaries Homo Sapiens [Dixon et al., 2012] Sus Scrofa (PORCINET project) SF & NV2 | Hi-C data analysis 13/28
  22. 22. Current methodological development Constrained HAC as a way to compare/combine TADs between samples Contrained HAC: Hierarchical clustering with contiguity constrains SF & NV2 | Hi-C data analysis 14/28
  23. 23. Current methodological development Constrained HAC as a way to compare/combine TADs between samples Contrained HAC: Hierarchical clustering with contiguity constrains Challenges (currently under development with Pierre Neuvial and Marie Chavent): methodological issues: what happens when using Ward’s linkage criterion with a non Euclidean similarity (counts of the Hi-C matrix)? what happens when adding constrains to HAC? (partially solved) development of the R package adjclust (Google Summer of Code selected project) SF & NV2 | Hi-C data analysis 14/28
  24. 24. Sommaire 1 Normalization 2 TAD identification 3 A/B compartments 4 Differential analysis SF & NV2 | Hi-C data analysis 15/28
  25. 25. A/B compartments [Lieberman-Aiden et al., 2009] [Giorgetti et al., 2013] Method (in theory): compute Pearson correlations between bins (using interaction counts with all the other bins of the same chromosome) compute eigenvectors (or perform PCA) on this correlation matrix affect A/B compartments to +/- values of PCs SF & NV2 | Hi-C data analysis 16/28
  26. 26. A/B compartments in practice after ICED and distance-based normalizations SF & NV2 | Hi-C data analysis 17/28
  27. 27. A/B compartments in practice after ICED and distance-based normalizations Method: differentiate between A/B using sign of the correlation between PCs and diagonal counts choose a relevant PC and method maximizing − log10(p − value) between diagonal counts in +/- PC (2-group comparison Student test) SF & NV2 | Hi-C data analysis 17/28
  28. 28. Biological validation SF & NV2 | Hi-C data analysis 18/28
  29. 29. Sommaire 1 Normalization 2 TAD identification 3 A/B compartments 4 Differential analysis SF & NV2 | Hi-C data analysis 19/28
  30. 30. Filtering In differential analysis of sequencing data, filtering is a crucial step: removing low count features (that are little or no chance to be found differential) improves the test power (leverage the multiple testing correction effect) and can save unnecessary computational time SF & NV2 | Hi-C data analysis 20/28
  31. 31. Filtering In differential analysis of sequencing data, filtering is a crucial step: removing low count features (that are little or no chance to be found differential) improves the test power (leverage the multiple testing correction effect) and can save unnecessary computational time can be performed 1/ at the beginning of the analysis or after the estimation of the parameters of the model used for differential analysis SF & NV2 | Hi-C data analysis 20/28
  32. 32. Filtering In differential analysis of sequencing data, filtering is a crucial step: removing low count features (that are little or no chance to be found differential) improves the test power (leverage the multiple testing correction effect) and can save unnecessary computational time can be performed 1/ at the beginning of the analysis or after the estimation of the parameters of the model used for differential analysis; 2/ can be fixed to an arbitrary value (minimum total count per sample) or automated from the data SF & NV2 | Hi-C data analysis 20/28
  33. 33. Filtering In differential analysis of sequencing data, filtering is a crucial step: removing low count features (that are little or no chance to be found differential) improves the test power (leverage the multiple testing correction effect) and can save unnecessary computational time can be performed 1/ at the beginning of the analysis or after the estimation of the parameters of the model used for differential analysis; 2/ can be fixed to an arbitrary value (minimum total count per sample) or automated from the data for Hi-C data: filtering was performed at the beginning of the analysis (to limit the computation burden) was performed by using an arbitrary threshold or a threshold based on the estimation of the noise background by a quantile of inter-chromosomal counts (as in R package diffHic) SF & NV2 | Hi-C data analysis 20/28
  34. 34. Filtering In differential analysis of sequencing data, filtering is a crucial step: removing low count features (that are little or no chance to be found differential) improves the test power (leverage the multiple testing correction effect) and can save unnecessary computational time can be performed 1/ at the beginning of the analysis or after the estimation of the parameters of the model used for differential analysis; 2/ can be fixed to an arbitrary value (minimum total count per sample) or automated from the data 500 kb - automatic filter (filters counts<∼ 5) - 96.4% of pairs filtered out before filtering after filtering SF & NV2 | Hi-C data analysis 20/28
  35. 35. Exploratory analysis (500kb bins) chromosome 1 1 0.911 1 0.8886 0.8866 1 0.8566 0.8651 0.8288 1 0.8973 0.9118 0.8912 0.8692 1 0.8935 0.9032 0.8818 0.8799 0.906 1 LW90−160216−GCCAAT LW90−160223−CTTGTA LW90−160308−AGTTCC LW110−160307−CGATGT LW110−160308−AGTCAA LW110−160517−ACAGTG LW 90−160216−G C C AAT LW 90−160223−C TTG TA LW 90−160308−AG TTC C LW 110−160307−C G ATG T LW 110−160308−AG TC AA LW 110−160517−AC AG TG −1.0 −0.5 0.0 0.5 1.0 Cosinus (Frobenius norm) good reproducibility between experiments no clear organization with respect to the condition SF & NV2 | Hi-C data analysis 21/28
  36. 36. Exploratory analysis (500kb bins) chromosome 1 1 0.911 1 0.8886 0.8866 1 0.8566 0.8651 0.8288 1 0.8973 0.9118 0.8912 0.8692 1 0.8935 0.9032 0.8818 0.8799 0.906 1 LW90−160216−GCCAAT LW90−160223−CTTGTA LW90−160308−AGTTCC LW110−160307−CGATGT LW110−160308−AGTCAA LW110−160517−ACAGTG LW 90−160216−G C C AAT LW 90−160223−C TTG TA LW 90−160308−AG TTC C LW 110−160307−C G ATG T LW 110−160308−AG TC AA LW 110−160517−AC AG TG −1.0 −0.5 0.0 0.5 1.0 Cosinus (Frobenius norm) good reproducibility between experiments no clear organization with respect to the condition all data after filtering and between matrix normalization (LOESS) 2 outliers but PC1 is organized with respect to the condition SF & NV2 | Hi-C data analysis 21/28
  37. 37. Methods for differential analysis of Hi-C Similar to RNA-seq [Lun and Smyth, 2015] and R package diffHic (essentially a wrapper for edgeR): count data modeled by Binomial Negative distribution SF & NV2 | Hi-C data analysis 22/28
  38. 38. Methods for differential analysis of Hi-C Similar to RNA-seq [Lun and Smyth, 2015] and R package diffHic (essentially a wrapper for edgeR): count data modeled by Binomial Negative distribution parameters (mean, variance per gene) are estimated from data: a variance vs mean relationship is modeled SF & NV2 | Hi-C data analysis 22/28
  39. 39. Methods for differential analysis of Hi-C Similar to RNA-seq [Lun and Smyth, 2015] and R package diffHic (essentially a wrapper for edgeR): count data modeled by Binomial Negative distribution parameters (mean, variance per gene) are estimated from data: a variance vs mean relationship is modeled test is performed using an exact test (similar to Fisher) or a log-likelihood ratio test (GLM model) SF & NV2 | Hi-C data analysis 22/28
  40. 40. Complementary remarks about DE analysis Hi-C data contain more zeros than RNA-seq data: some people propose to use Zero Inflated BN distribution (unpublished as far as I know) SF & NV2 | Hi-C data analysis 23/28
  41. 41. Complementary remarks about DE analysis Hi-C data contain more zeros than RNA-seq data: some people propose to use Zero Inflated BN distribution (unpublished as far as I know) provides a p-value for every pair of bins: analysis based on a very large number of bins for finer resolutions (500kb after filtering: 998 623 pairs of bins; without filtering: 13 509 221 pairs of bins): problem solved for 500kb bins but still under study for 40kb bins SF & NV2 | Hi-C data analysis 23/28
  42. 42. Complementary remarks about DE analysis Hi-C data contain more zeros than RNA-seq data: some people propose to use Zero Inflated BN distribution (unpublished as far as I know) provides a p-value for every pair of bins: analysis based on a very large number of bins for finer resolutions (500kb after filtering: 998 623 pairs of bins; without filtering: 13 509 221 pairs of bins): problem solved for 500kb bins but still under study for 40kb bins tests are performed as if bin pairs were independant whereas they are spatially correlated SF & NV2 | Hi-C data analysis 23/28
  43. 43. Complementary remarks about DE analysis Hi-C data contain more zeros than RNA-seq data: some people propose to use Zero Inflated BN distribution (unpublished as far as I know) provides a p-value for every pair of bins: analysis based on a very large number of bins for finer resolutions (500kb after filtering: 998 623 pairs of bins; without filtering: 13 509 221 pairs of bins): problem solved for 500kb bins but still under study for 40kb bins tests are performed as if bin pairs were independant whereas they are spatially correlated: estimation of model parameters might be improved if 1/ smoothed with respect to spatial proximity (similar to what is sometimes performed methylation data analysis); 2/ performed independantly for pairs of bins at a given distance (future work). post-analysis of spatial distribution of p-values, work-in-progress with Pierre Neuvial (submitted CNRS project) SF & NV2 | Hi-C data analysis 23/28
  44. 44. because last page had no picture probably not suited for the youngest SF & NV2 | Hi-C data analysis 24/28
  45. 45. Preliminary results 913 bin pairs found differential (after multiple testing correction) most of them are related to 3 chromosomes parameter setting (filters...) and biological analysis are work-in-progress... SF & NV2 | Hi-C data analysis 25/28
  46. 46. Differential TADs (state-of-the-art) Detecting differential domains between the two conditions Existing approaches: [Fraser et al., 2015] (3 conditions, no replicate) HMM on TAD boundaries (with a tolerance threshold) to identify different TAD boundaries between samples HAC on TADs, cophenetic distance to obtain local conserved structure by using a z-score approach SF & NV2 | Hi-C data analysis 26/28
  47. 47. Differential TADs (state-of-the-art) Detecting differential domains between the two conditions Existing approaches: [Fraser et al., 2015] (3 conditions, no replicate) HMM on TAD boundaries (with a tolerance threshold) to identify different TAD boundaries between samples HAC on TADs, cophenetic distance to obtain local conserved structure by using a z-score approach R package diffHic computes up/down-stream counts (with ± 100Kb) and uses the GLM model implemented in edgeR with an interaction between stream direction (up/down) and condition. SF & NV2 | Hi-C data analysis 26/28
  48. 48. Differential TADs (state-of-the-art) Detecting differential domains between the two conditions Existing approaches: [Fraser et al., 2015] (3 conditions, no replicate) HMM on TAD boundaries (with a tolerance threshold) to identify different TAD boundaries between samples HAC on TADs, cophenetic distance to obtain local conserved structure by using a z-score approach R package diffHic computes up/down-stream counts (with ± 100Kb) and uses the GLM model implemented in edgeR with an interaction between stream direction (up/down) and condition. However, the first approach does not take biological variability into account (no replicate) and the second uses only a very aggregated criterion. SF & NV2 | Hi-C data analysis 26/28
  49. 49. Differential TADs (perspectives) Ideas for future work Using constrained HAC, are we able to: compute a consensus dendrogram using several biological replicates; differentiate branches significantly (in which sense?) different between conditions taking into account the within condition variability? SF & NV2 | Hi-C data analysis 27/28
  50. 50. Differential TADs (perspectives) Ideas for future work Using constrained HAC, are we able to: compute a consensus dendrogram using several biological replicates; differentiate branches significantly (in which sense?) different between conditions taking into account the within condition variability? SF & NV2 | Hi-C data analysis 27/28
  51. 51. Conclusions and perspectives Honnestly, it’s late and I really do not believe that I will have enough time to make a conclusion and discuss perspectives so... Questions? SF & NV2 | Hi-C data analysis 28/28
  52. 52. References Belton, J., Patton MacCord, R., Harmen Gibcus, J., Naumova, N., Zhan, Y., and Dekker, J. (2012). Hi-C: a comprehensive technique to capture the conformation of genomes. Methods, 58:268–276. Brault, V., Chiquet, J., and Lévy-Leduc, C. (2017). Efficient block boundaries estimation in block-wise constant matrices: an application to HiC data. Electronic Journal of Statistics, 11(1):1570–1599. Dali, R. and Blanchette, M. (2017). A critical assessment of topologically associating domain prediction tools. Nucleic Acid Research, 45(6):2994–3005. Dixon, J., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J., and Ren, B. (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 485:376–380. Filippova, D., Patro, R., Duggal, G., and Kingsford, C. (2013). Identification of alternative topological domains in chromatin. Algorithms for Molecular Biology, 9:14. Fotuhi Siahpirani, A., Ay, F., and Roy, S. (2016). A multi-task graph-clustering approach for chromosome conformation capture data sets identifies conserved modules of chromosomal interactions. Genome Biology, 17:114. Fraser, J., Ferrai, C., Chiariello, A., Schueler, M., Rito, T., Laudanno, G., Barbieri, M., Moore, B., Kraemer, D., Aitken, S., Xie, S., Morris, K., Itoh, M., Kawaji, H., Jaeger, I., Hayashizaki, Y., Carninci, P., Forrest, A., The FANTOM Consortium, Semple, C., Dostie, J., Pombo, A., and Nicodemi, M. (2015). Hierarchical folding and reorganization of chromosomes are linked to transcriptional changes in cellular differentiation. Molecular Systems Biology, 11:852. Giorgetti, L., Servant, N., and Heard, E. (2013). Changes in the organization of the genome during the mammalian cell cycle. SF & NV2 | Hi-C data analysis 28/28
  53. 53. Genome Biology, 14:142. Hu, M., Deng, K., Selvaraj, S., Qin, Z., Ren, B., and Liu, J. (2012). HiCNorm: removing biases in Hi-C data via Poisson regression. Bioinformatics, 28(23):3131–3133. Imakaev, M., Fudenberg, G., McCord, R., Naumova, N., Goloborodko, A., Lajoie, B., Dekker, J., and Mirny, L. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nature Methods, 9:999–1003. Lieberman-Aiden, E., van Berkum, N., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B., Sabo, P., Dorschner, M., Sandstrom, R., Bernstein, B., Bender, M., Groudine, M., Gnirke, A., Stamatoyannopoulos, J., Mirny, L., Lander, E., and Dekker, J. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326(5950):289–293. Lun, A. and Smyth, G. (2015). diffHic: a Bioconductor package to detect differential genomic interactions in Hi-C data. BMC Bioinformatics, 16:258. Rao, S., Huntley, M., Durand, N., Stamenova, E., Bochkov, I., Robinson, J., Sanborn, A., Machol, I., Omer, A., Lander, E., and Lieberman Aiden, E. (2014). A 3D map of the human genome at kilobase resolution reveals principle of chromatin looping. Cell, 159(7):1665–1680. Robinson, M. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11:R25. Schmitt, A., Hu, M., and Ren, B. (2016). Genome-wide mapping and analysis of chromosome architecture. Nature Reviews, 17(12):743–755. Servant, N., Lajoie, B., Nora, E., Giorgetti, L., Chen, C., Heard, E., Dekker, J., and Barillot, E. (2012). SF & NV2 | Hi-C data analysis 28/28
  54. 54. HiTC: exploration of high-throughput ‘C’ experiments. Bioinformatics, 28(21):2843–2844. Yaffe, E. and Tanay, A. (2011). Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nature Genetics, 43:1059–1065. SF & NV2 | Hi-C data analysis 28/28

×