This document discusses organizational heterogeneity in the human genome. It finds significant variation in recombination rates of 100 kbp sequences within different GC content ranges. Various methods are used to analyze genome heterogeneity, detect organizational patterns of segments, and estimate heterogeneity between organizational pattern groups. Certain oligonucleotides are found to be important in classifying segments by recombination rate. Functionally related genes tend to reside in organizationally similar genomic regions. The work is supported by the Israeli Ministry of Immigrant Absorption and Israel Council for Higher Education.
1. Organizational Heterogeneity
of Human Genome:
Significant variation of recombination rate of
100 kbp sequences within GC ranges
Svetlana Frenkel
Valery Kirzhner
Abraham Korol
Department of Evolutionary and Environmental Biology
Institute of Evolution
University of Haifa
2. Some aspects of intra-genome
heterogeneity
Varying gene density
Clusters of tissue-specific and
housekeeping genes
Linkage disequilibrium (LD) blocks
Mutation and recombination rates
Conserved and Ultraconserved segments
Localization of inversions, deletions,
insertions and duplications
3. Genome Heterogeneity: GC content
From: Costantini, M., Clay, O., Auletta, F., Bernardi, G. (2006) An isochore map of
human chromosomes. Genome Res., 16, 536-541.
From: UHN Microarray Centre's CpG Island Database
http://data.microarrays.ca/cpg/index.htm
The level of
redness denotes
the relative
number of CpG
islands that can
be located on
the chromosome
in that region
4. 4
Genome
Signature
Samuel Karlin, et al, 1997
Local:
• preliminary searches of candidates for gene
alignment
• detecting candidate regulatory signals
• detecting promoter regions
• detecting repetitive elements
• duplications of genomic
• horizontal gene transfer
Genome-wide:
• phylogenetic analysis
• species recognition
• whole-genome sequence comparisons
5. Linguistic-like methods
Detecting all of
“words” with certain
maximal length
Characterizing the
sequence
“vocabulary”
Scoring the occurrences
of fixed-length “words”
from a predefined
“vocabulary”
Comparison of “word”
frequencies obtained
from different sequences
Comparison the
“vocabularies” of
different sequences
Compositional Spectra
Analysis
6. Compositional Spectra
A linguistic-like method of genome analysis based
on occurrences of “words” in the A,C,G,T alphabet
Compositional spectrum (CS) is measured as a
histogram of imperfect word occurrences
From: V. Kirzhner et al., 2002-2005
6
8. Methods: Detection of Organizational
Pattern groups of segments
Genome segment number
Low HighClustering tree
Relative distance
between two
clusters
Maximal
distance
between
segments
Neighbor-Joining Clustering
“adaptive cutoff”
10. Significant variation of evolutionary features
of 100 kbp sequences within GC ranges
Testing for potential
association between
genome-wide distribution
of organizational patterns
and various evolutionary
and structural features
reveals the existence of
inter-OP heterogeneity in
such features as SNP and
Indels frequency,
recombination rate,
number of segmental
duplications, size of
linkage disequilibrium
blocks, and proportion of
evolutionary conserved
sequence.
10
12. Estimation of heterogeneity
between OP groups
12
0.22 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 0.03 1.9×10-3 0.01 0.11 3.9×10-3
-log(FDR-correctedp-value)
GC
Kruskal–Wallis non-parametric rank test
10,000 segments reshuffles to estimate test critical value
FDR correction for multiple comparisons
Reshuffled sequences within every segment as control
2.3 5.1 86.1 48.6 81.9 35.7 21.0 26.0 46.7 36.6 13.6 15.7 15.5 16.9
13. Detecting the words related to
recombination rate
13
GC, %
Average RR in the compared
OPGs
Proportion of correct classifications of segments to OP
groups, %
low RR high RR all words set of 47 words set of 8 words
35 0.82 0.93 98.60 98.62 76.03
36 0.62 1.16 98.40 96.56 82.34
37 0.83 1.28 94.10 93.88 80.47
38 0.80 1.46 99.58 99.17 98.33
39 0.91 1.59 97.32 97.32 96.55
40 0.96 1.50 100.0 100.0 100.0
41 1.13 1.81 98.80 98.50 98.50
42 1.05 1.80 100.0 100.0 99.62
43 1.29 1.99 97.48 96.98 95.46
44 1.44 1.83 99.01 99.21 98.81
45 1.35 2.06 100 98.93 98.22
46 1.30 1.88 98.53 98.53 97.35
47 1.15 1.74 94.62 94.61 91.48
48 1.33 2.04 98.78 98.77 97.55
14. Oligonucleotides, which showed high importance in more
than half of OPG comparisons in classification of 100kbp
segments for high and low recombination rate
14
Oligonucleotide GC, %
Appeared in the list of 10
most important variables
(times)
Appeared
as the most important variable
(times)
Previously described
pattern
Reference
CAGCCAGGTT 60 11 4
-CCNCCNTNNCCNC-
-CAGCCAGGTT----
Myers et al. 2008
GACCGGACTG 70 10 1
---CCTCCCT--
-GACCGGACTG-
Myers et al. 2005
-CCNCCNTNNCCNC-
---GACCGGACTG--
Myers et al. 2008
CGCCGGGACT 80 10 3
-CCNCCNTNNCCNC-
--CGCCGGGACT---
Myers et al. 2008
GCGTAGGCTA 60 9 0
-CCNCCNTNNCCNC-
---GCGTAGGCTA--
Myers et al. 2008
TGGGCCCGGC 90 8 4 n/a
GGCGTGCGCG 90 8 1
-GGNGGNAGGGG-
-GGCGTGCGCG--
Zheng et al. 2010
-CCNCCNTNNCCNC-
---GGCGTGCGCG--
Myers et al. 2008
CCCGGTATCG 70 8 0
-CCNCCNTNNCCNC-
--CCCGGTATCG---
Myers et al. 2008
GCCCTTTCCT 60 7 0
---CCTCCCT--
-GCCCTTTCCT-
Myers et al. 2005
-CCNCCNTNNCCNC-
---GCCCTTTCCT--
Myers et al. 2008
-CCTCCCTNNCCAC-
---GCCCTTTCCT--
Myers et al. 2008
15. Functionally related genes tend to reside in
organizationally similar genomic regions
Genes provided the GO
enrichment of four
organizational pattern
clusters, which showed the
most significant GO
enrichments.
L2-a cluster is enriched by
“mitochondrion”, “intracellular non-
membrane-bounded organelle”,
“nuclear envelope” and
“ribonucleoprotein complex” GO
terms;
L2-h cluster is enriched by “G-
protein-coupled receptor protein
signaling pathway” and “sensory
perception of smell” GO terms;
H1-i cluster is enriched by “epithelial
cell differentiation” and “epithelium
development” GO terms;
H2-a cluster is enriched by “skeletal
system development” GO term.
Paz A, Frenkel S, Snir S, Kirzhner V, Korol A. 2014. BMC Genomics 15:252.
15
16. Thank you for your attention
Acknowledgments
Dr. Valery Kirzhner
Prof. Abraham Korol
Prof. Edward Trifonov
Dr. Arnon Paz and Dr. Zeev Frenkel
This work was supported by
The Israeli Ministry of Immigrant Absorption
The Israel Council for Higher Education
18. Spearman's rank correlation
coefficient rho
Spearman's rank correlation coefficient is a non-
parametric measure of correlation
ρ is given by:
where:
• Di = xi − yi = the difference between the ranks of
corresponding values Xi and Yi, and
• n = the number of values in each data set (same for
both sets).
Additional slide
19. The Kendall tau distance
The Kendall tau distance is a metric that counts the number of
pairwise disagreements between two lists. The larger the
distance, the more dissimilar the two lists are.
The Kendall tau distance between two lists τ1 and τ2 is
K(τ1,τ2) will be equal to 0 if the two lists are identical and n(n
− 1) / 2 (where n is the list size) if one list is the reverse of the
other. Often Kendall tau distance is normalized by dividing by
n(n − 1) / 2 so a value of 1 indicates maximum disagreement.
The normalized Kendall tau distance therefore lies in the
interval [0,1].
Additional slide