My (Angela) Chung, Data Enthusiast, San Jose State University
Cancer is a complex disease which requires interactions between cell-intrinsic alterations and tumor microenvironment. The connection between epigenetics and genomic structure plays a key role in chromatin interactions and enhancer-promoter communications for transcriptional activities. Alterations of these components in oncogenic signaling pathway potentially cause cancer cell-intrinsic changes and inappropriate instructions to normal cell cycles, leading to abnormal cell growth.
' Topologically associating domains (TADs) and A/B compartments are the main structures of higher-order chromatin structure. These contact domains, chromatin states, super-enhancers, and histone modifications together regulate transcription and gene expression for normal/abnormal cell cycles.
' Several bioinformatics tools were utilized ' FANC for processing raw FASTQ data to Hi-C contact matrices, JuicerTools for obtaining the locations of contact domains on the entire genome, and CoolBox for visualizing chromatin contacts in different cell lines.
' High-resolution chromatin contacts showed dynamic interactions among chromosomal regions in different cell lines.
' Qualitative and quantitative features were comprehensively engineered from 3D chromatin folding and epigenetic regulators using available packages (scikit learn, pytorch, pandas, numpy, matplotlib, etc.).
' XGBoost multi-class classifier achieved the highest accuracy of 80.90% in classifying normal and cancer cell lines based on chromatin interactions, followed by Random Forest at 73.76% and TabNet classifier at 70.00%.
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Data Con LA 2022 - Early cancer detection using higher-order genome architecture
1. Effective Cancer Detection Using
Higher-Order Genome Architecture
and Chromatin Interactions
My (Angela) Chung
Master of Science in Bioinformatics
2. Conceptual question
How do higher-order genomic
structures and epigenetic regulations
influence the chromatin interactions
in normal and cancerous cell lines?
2
6. TAD structure Super-enhancer
Rowley et al. (2018) Li et al. (2021)
Accumulation of multiple
enhancers at TAD domains
6
CTCF: the regulator of chromatin
organization
Cohesin: the structural maintenance
of chromosome cohesin complex
Genomic Structure
Loop extrusion
7. Epigenetic marks & chromatin states
7
Epigenetics
H3K27ac is ACTIVATOR
ACTIVATORS for super-enhancer
H3K27me3 is REPRESSOR
Martire et al. (2017) & Chen et al. (2020)
8. Chromatin state in oncogenic
signaling pathways
8
Hniz et al. (2015) & Li et al. (2021)
Epigenetics – Histone modifications Pathway effects
Transcriptional super-enhancers
q Chromatin structure
q Epigenetic regulation
q Effects on cancer pathway
11. High-throughput chromosome
conformation capture (Hi-C)
• Comprehensive mapping of chromatin contacts
• Detect loops across the entire genome
• DNA-DNA proximity ligation is combined with high-throughput sequencing
• Chromosome spatially positions into 3D conformation
Rao et al. (2014) 11
In situ Hi-C
18. Engineer and organize data for visualization
and building a detection model.
Data Engineering &
Visualization
19. Flowchart for
Data Engineering
TADs
CTCF
Loops
RAD21
Histone
marks
Insulation
Directionality
Index
CTCF at both ends
CTCF at TAD start
CTCF at TAD end
DI at TAD end
DI at TAD start
Sum of insulation scores
No CTCF
Number of histone marks
Average location
Histone marks are not
present
Start boundary score
End boundary score
CTCF
loop
Cohesin-independent loop
Cohesin-associated loop
Ordinary
domain
Loop
extrusion
Extrusion
loss
ChIP-Seq
FANC
Arrowhead
HICCUPS
ChIP-Seq
FANC
ChIP-Seq
Within loop
window
Outside loop
window
CTCF
loop loss
Structure Effect
Analysis Features
Key Components
19
24. 24
Enhancers and promoters are potentially protected at strong TAD boundaries.
Chromatin State vs
Boundary Strength Distribution
ACTIVATOR = H3K27ac
27. 27
Cancer cell Cancer cell
Chromatin State vs
Boundary Strength Distribution
REPRESSOR = H3K27me3
28. 28
Lower density of weak boundaries and negative super-enhancers detected in cancerous cell lines.
Chromatin State vs
Boundary Strength Distribution
REPRESSOR = H3K27me3
31. 31
Random Forest
performance
evaluation
Model accuracy is 73.76%
Max depth is unassigned for the highest performance
Number of tree estimators is around 200
Numerical attributes represent feature importance
K562 is best classified at 83%
34. Future direction
34
• Try different scalable and memory-efficient tools to remove biases and normalize Hi-C contact maps.
• Consider gene expression profile (RNA-Seq) for chromatin state categorization.
• Consider new visualization tools to further enhance image quality, especially for triangular heatmap.
• Improve feature engineering to enhance model performance.
• Consider Hi-C image classification to distinguish cell lines.
35. References
Chang, P., Gohain, M., et al., Computational Methods for Assessing Chromatin Hierarchy. Comp Struc Biotech J., 16. 2018
Rowley, M.J., & Victor, G.C., Organization principles of 3D genome architecture. Nature Reviews: Genetics, 19, 2018
Li, G.H., Qu, Q. et al. Super-enhancers: a new frontier for epigenetic modifiers in cancer chemoresistance. J. Exp Clin Cancer
Res, no. 40, 2021
Fudenberg G., Imakaev, M., et al., Formation of chromosomal domains by loop extrusion. Cell Reports, 15, 2038-2049, 2016
Kyrchanova, O., & Georgiev, P., Mechanisms of enhancer-promoter interactions in higher eukaryotes. Int J Mol Sci., vol. 22, no.
2, 2021
Rao, S., Huang, S.C., et al., Cohesin loss eliminates all loop domains. Cell, vol. 171, no. 2, 2017
Ong, C.T., & Corces V. G., Enhancer function: new insights into the regulation of tissue-specific gene expression. Nat Rev Genet.,
vol. 12, no. 4, 2011
Ernst J., Kheradpour, P., et al., Mapping and analysis of chromatin state dynamics in nine human cell types. Nature., 2011
Hnisz, D., Schuijers, J., et al., Convergence of developmental and oncogenic signaling pathways at transcriptional super-
enhancers. Molecular Cell, vol. 58, no. 2, 2015
Rao, S. S. P., Huntley, M. H., Durand, N. C., et al., A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles
of Chromatin Looping. Cell, 159, 2014
35
36. Project Committee Members:
Wendy Lee, Ph.D Department of Computer Science
Carlos Rojas, Ph.D Department of Computer Engineering
William Andreopoulos, Ph.D Department of Computer Science
Thank you for your time and attention!
40. CTCF and histone marks in contact domains
q Cell lines: A549 and IMR90
q Chromosome: 8
q Resolution window: 80mb – 83mb
q CTCF and epigenetic factors in
genomic architecture
41. Basic mapping:
• Paired-end reads were detected and split reads at ligation
junctions by MboI or HindIII restriction enzyme.
• Map reads (MAPQ > 30) to a reference genome hg19 using
Bowtie2 indexing.
Iterative mapping:
• Eliminate unaligned reads
Fragmentation:
• assign mapped reads to restriction fragments (MboI or HindIII)
Mapping
to generate aligned reads
&
Fragmentation
to assign to restriction
fragments
42. Filtering
read pairs
to eliminate invalid
read pairs from
experimental
procedures
Distance filters: filter out self-ligation products <25kb
Restriction distance: remove reads mapping >1000kb
Strand filters (inward and outward ligations):
• Inward same strand (un-ligated) <1kb
• Outward same strand (self-ligated) <1kb
Features of invalid reads:
Unmapped
Multi-mapping
PCR duplicates
Low map quality
Ligation errors
43. Samples Total reads Valid pairs % Valid pairs Sum pairs filtered
GM12878_577 87,569,310 75,182,513
85.9 16,519,508
GM12878_580 63,592,550 54,573,765
85.8 12,121,642
GM12878_581 31,678,266 27,537,842 86.9 5,570,781
GM12878_583 69,592,394 57,017,845
81.9 16,266,949
GM12878_586 59,628,563 49,030,434
82.2 13,838,160
IMR90_672 99,266,806 94,680,468
95.4 25,551,020
IMR90_673 107,407,597 90,914,417 84.6 21,623,779
IMR90_674 14,526,316 12,731,251 87.6 2,345,638
IMR90_675 66,649,091 56,971,226 85.5 26,162,938
IMR90_676 133,365,970 116,341,749 87.2 22,032,033
IMR90_677 118,965,610 94,694,521
79.6 34,048,741
Data processing summary
Four cell lines: GM12878, IMR90, K562, and A549
Total 34 samples including both primary and replicates
44. Matrix balancing
Goal:
• Correct known and potentially unknown systemic biases
• Generate true contact frequencies which preserve underlying architecture of the matrix.
• Assure visibility for all loci
Binning size: 10kb, 100kb, and more
Normalization: (per chromosome level)
Knight-Ruiz ICE
Product of non-negative matrix and diagonal
matrices D1 and D2 to obtain singular value P
Expected contact frequency from biases and
relative contact probabilities