Organizational Patterns Reveal Genome Heterogeneity

Organizational Heterogeneity
of Human Genome:
Significant variation of recombination rate of
100 kbp sequences within GC ranges
Svetlana Frenkel
Valery Kirzhner
Abraham Korol
Department of Evolutionary and Environmental Biology
Institute of Evolution
University of Haifa

Some aspects of intra-genome
heterogeneity
 Varying gene density
 Clusters of tissue-specific and
housekeeping genes
 Linkage disequilibrium (LD) blocks
 Mutation and recombination rates
 Conserved and Ultraconserved segments
 Localization of inversions, deletions,
insertions and duplications

Genome Heterogeneity: GC content
From: Costantini, M., Clay, O., Auletta, F., Bernardi, G. (2006) An isochore map of
human chromosomes. Genome Res., 16, 536-541.
From: UHN Microarray Centre's CpG Island Database
http://data.microarrays.ca/cpg/index.htm
The level of
redness denotes
the relative
number of CpG
islands that can
be located on
the chromosome
in that region

4
Genome
Signature
Samuel Karlin, et al, 1997
Local:
• preliminary searches of candidates for gene
alignment
• detecting candidate regulatory signals
• detecting promoter regions
• detecting repetitive elements
• duplications of genomic
• horizontal gene transfer
Genome-wide:
• phylogenetic analysis
• species recognition
• whole-genome sequence comparisons

Linguistic-like methods
Detecting all of
“words” with certain
maximal length
Characterizing the
sequence
“vocabulary”
Scoring the occurrences
of fixed-length “words”
from a predefined
“vocabulary”
Comparison of “word”
frequencies obtained
from different sequences
Comparison the
“vocabularies” of
different sequences
Compositional Spectra
Analysis

Compositional Spectra
  
A linguistic-like method of genome analysis based
on occurrences of “words” in the A,C,G,T alphabet
Compositional spectrum (CS) is measured as a
histogram of imperfect word occurrences
From: V. Kirzhner et al., 2002-2005
6

Methods: calculating of distances
d1
d’1 d’2
d2
F(Si, W)
F(S’i, W)
F(Sj, W)
F(S’ j, W)
5’
5’3’
3’
Manhattan (city block) distance
Spearman Rank Correlation ρ (d= 1-ρ)
Kendall distance τ
d = min(di, d’i, dj, d’j)
F(Si, W’) F(Sj, W’)

Methods: Detection of Organizational
Pattern groups of segments
Genome segment number
Low HighClustering tree
Relative distance
between two
clusters
Maximal
distance
between
segments
Neighbor-Joining Clustering
“adaptive cutoff”

Analysis of Organizational Pattern
groups of segments
9

Significant variation of evolutionary features
of 100 kbp sequences within GC ranges
Testing for potential
association between
genome-wide distribution
of organizational patterns
and various evolutionary
and structural features
reveals the existence of
inter-OP heterogeneity in
such features as SNP and
Indels frequency,
recombination rate,
number of segmental
duplications, size of
linkage disequilibrium
blocks, and proportion of
evolutionary conserved
sequence.
10

Estimation of heterogeneity
between OP groups
11
GC
RecombinationRate

Estimation of heterogeneity
between OP groups
12
0.22 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 0.03 1.9×10-3 0.01 0.11 3.9×10-3
-log(FDR-correctedp-value)
GC
Kruskal–Wallis non-parametric rank test
10,000 segments reshuffles to estimate test critical value
FDR correction for multiple comparisons
Reshuffled sequences within every segment as control
2.3 5.1 86.1 48.6 81.9 35.7 21.0 26.0 46.7 36.6 13.6 15.7 15.5 16.9

Detecting the words related to
recombination rate
13
GC, %
Average RR in the compared
OPGs
Proportion of correct classifications of segments to OP
groups, %
low RR high RR all words set of 47 words set of 8 words
35 0.82 0.93 98.60 98.62 76.03
36 0.62 1.16 98.40 96.56 82.34
37 0.83 1.28 94.10 93.88 80.47
38 0.80 1.46 99.58 99.17 98.33
39 0.91 1.59 97.32 97.32 96.55
40 0.96 1.50 100.0 100.0 100.0
41 1.13 1.81 98.80 98.50 98.50
42 1.05 1.80 100.0 100.0 99.62
43 1.29 1.99 97.48 96.98 95.46
44 1.44 1.83 99.01 99.21 98.81
45 1.35 2.06 100 98.93 98.22
46 1.30 1.88 98.53 98.53 97.35
47 1.15 1.74 94.62 94.61 91.48
48 1.33 2.04 98.78 98.77 97.55

Oligonucleotides, which showed high importance in more
than half of OPG comparisons in classification of 100kbp
segments for high and low recombination rate
14
Oligonucleotide GC, %
Appeared in the list of 10
most important variables
(times)
Appeared
as the most important variable
(times)
Previously described
pattern
Reference
CAGCCAGGTT 60 11 4
-CCNCCNTNNCCNC-
-CAGCCAGGTT----
Myers et al. 2008
GACCGGACTG 70 10 1
---CCTCCCT--
-GACCGGACTG-
Myers et al. 2005
-CCNCCNTNNCCNC-
---GACCGGACTG--
Myers et al. 2008
CGCCGGGACT 80 10 3
-CCNCCNTNNCCNC-
--CGCCGGGACT---
Myers et al. 2008
GCGTAGGCTA 60 9 0
-CCNCCNTNNCCNC-
---GCGTAGGCTA--
Myers et al. 2008
TGGGCCCGGC 90 8 4 n/a
GGCGTGCGCG 90 8 1
-GGNGGNAGGGG-
-GGCGTGCGCG--
Zheng et al. 2010
-CCNCCNTNNCCNC-
---GGCGTGCGCG--
Myers et al. 2008
CCCGGTATCG 70 8 0
-CCNCCNTNNCCNC-
--CCCGGTATCG---
Myers et al. 2008
GCCCTTTCCT 60 7 0
---CCTCCCT--
-GCCCTTTCCT-
Myers et al. 2005
-CCNCCNTNNCCNC-
---GCCCTTTCCT--
Myers et al. 2008
-CCTCCCTNNCCAC-
---GCCCTTTCCT--
Myers et al. 2008

Functionally related genes tend to reside in
organizationally similar genomic regions
Genes provided the GO
enrichment of four
organizational pattern
clusters, which showed the
most significant GO
enrichments.
L2-a cluster is enriched by
“mitochondrion”, “intracellular non-
membrane-bounded organelle”,
“nuclear envelope” and
“ribonucleoprotein complex” GO
terms;
L2-h cluster is enriched by “G-
protein-coupled receptor protein
signaling pathway” and “sensory
perception of smell” GO terms;
H1-i cluster is enriched by “epithelial
cell differentiation” and “epithelium
development” GO terms;
H2-a cluster is enriched by “skeletal
system development” GO term.
Paz A, Frenkel S, Snir S, Kirzhner V, Korol A. 2014. BMC Genomics 15:252.
15

Thank you for your attention
Acknowledgments
Dr. Valery Kirzhner
Prof. Abraham Korol
Prof. Edward Trifonov
Dr. Arnon Paz and Dr. Zeev Frenkel
This work was supported by
The Israeli Ministry of Immigrant Absorption
The Israel Council for Higher Education

Calculating compositional
spectra
…
AGTAGTTACA
CTACTATAGT
GACGACTCCA
TCGTCGTCGA
GAACGTACCT
TCTATATCCA
AGGTACTACA
CTCGCGACCG
…
3676
CTACTATAGT
…
…
CTACTATAGT
CTACTAAAGT
CTAGTAAAGT
CTAGTAAAGT
CTAGTAACGT
CGCCTAAAGT
CCACTAAGGT
…
256 × 3676 = 941056 86.7%
Additional slide

Spearman's rank correlation
coefficient rho
 Spearman's rank correlation coefficient is a non-
parametric measure of correlation
 ρ is given by:
 where:
• Di = xi − yi = the difference between the ranks of
corresponding values Xi and Yi, and
• n = the number of values in each data set (same for
both sets).
Additional slide

The Kendall tau distance
 The Kendall tau distance is a metric that counts the number of
pairwise disagreements between two lists. The larger the
distance, the more dissimilar the two lists are.
 The Kendall tau distance between two lists τ1 and τ2 is
 K(τ1,τ2) will be equal to 0 if the two lists are identical and n(n
− 1) / 2 (where n is the list size) if one list is the reverse of the
other. Often Kendall tau distance is normalized by dividing by
n(n − 1) / 2 so a value of 1 indicates maximum disagreement.
The normalized Kendall tau distance therefore lies in the
interval [0,1].
Additional slide

Organizational Patterns Reveal Genome Heterogeneity

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Organizational Patterns Reveal Genome Heterogeneity

Similar to Organizational Patterns Reveal Genome Heterogeneity (20)

Recently uploaded

Recently uploaded (20)

Organizational Patterns Reveal Genome Heterogeneity