An Investigation Of The Rigor Of Interpretation Rules
CGI.Paper
1. Multi-Level Comparison of CGI Presence and
Genomic Architecture Across Animal Phylogeny
Christopher Carroll, Lauren Kordonowy, Lindsay Havens, Kaelina Lombardo, Dr. David
Plachetzki, Dr. Matthew MacManes
2. Introduction:
CpG islands are high-density clusters of cytosine and guanine nucleotides. The islands are
characterized by a lack of 5’-cytosine methylation, contrary to what is typical of background CpG
dinucleotides. This attribute alone makes them curious genomic structures,as ~80% of CpG dinucleotides
in mammalian genomes are estimated to be methylated1
. Additionally, vertebrate genomes are noted to
have low levels of CpGs due to the fact that the methylation of the cytosines frequently causes the
nucleotides to mutate into thymines during the next cycle of replication, altering CpGs to TpGs2
.
Research has unveiled that not only do CGIs strongly correlate with promoter regions and even exonic
regions, but also that the methylation state of CGIs is important for proper transcription to take place. In
mammalian genomes, it is estimated that about 50% of gene promoters are associated with one or more
CGIs 2
. One study found that 72% of promoters in the human genome were present with CGIs 3
. A
commonly held view is that methylation state of CGIs regulates transcription by recruiting proteins
involved in the transcription machinery that recognize unmethylated CpG moieties. This machinery is
thought to alter chromatin configurations, thus regulating transcription by causing a physical intervention.
As unmethylated CpG islands (CGIs) allow transcription and gene expression to occur, methylated CGIs
prevent the transcriptional machinery from accessing the TSSs of genes3, 4
. Irregular methylation of CGIs
in promoter regions of genes is associated with disease and disorders, most notably cancer1, 4, 5
. Additional
studies have suggested that the biological relevance of CGIs goes beyond aiding in transcription. In the
mouse, CGIs in genes located on the X chromosome became methylated after preceding mechanisms had
already silenced the gene, leading to the hypothesis that methylation is involved in the stabilization of
gene silencing and X inactivation6
. CGIs are also consistently implicated in genomic imprinting, indeed
CpG methylation is one of the few factors of imprinting that are well documented. The majority of
imprinted genes have methylated CGIs on only one of the parental alleles. Experiments in mouse
involving the deletion of the methyltransferase gene Dnmt1 resulted in mice that were deficient in
methylation and lacked imprinting7
. Therefore,CGIs are useful gene markers,as well as important
epigenetic elements that will cue further research.
Since CGIs have been established as useful gene markers, and have been suggested to play roles
in so many genetic regulatory processes,this study attempted to obtain evolutionary context of CGI
development by running a statistical analysis across a representative sample of thirty-four animal
genomes. To provide alternative windows through which to understand and explain possible patterns of
CGIs across our spectrum, we also looked at relationships of CGIs and genome size, background GC
content, transcription factor diversity, number of unique protein isoforms, and phylogeny. The
phylogenetic logic underpinning this study lies in the tree constructed by Dr. Plachetzki and associates8
.
Bioinformatic approaches to analyzing genomic CGI content necessarily involve setting statistical
thresholds. Because there is not definitive boundary at which a CGI can be said to stop and start existing,
definitions for CGIs are a bit arbitrary by nature, although the accuracy of these boundaries can be tested
on known CGI locations. The parameters can be thus applied to an entire analysis. The first definitions for
CGIs were proposed by Gardiner-Gardner and Frommer as 200-bp sequences of CG content ≥50%, with
an observed/expected ratio of CpG ≥0.6. These parameters were refined experimentally by Takai and
Jones. Their parameters were comprised of a sequence length of 500-bp with GC content ≥55% and an
observed/expected CpG ratio ≥0.65. These thresholds removed most of the Alu- elements that were called
as a CGI with the original parameters,yet maintained most of the 5’ region CGIs9
. This study took
advantage of the methods created by Takai and Jones.
3. This study hypothesizes that the number of CGIs per genome should increase as a function of
genome complexity, that is, an increase in the number of unique TFs and CDSs which give rise to greater
amounts of expression variation and control. As such, this study also predicts CGI distribution across
animal phylogeny should be able to demonstrate explanatory power for the relatedness of species and the
topology of the given phylogenetic tree.
Methods:
While the algorithm proposed by Takai and Jones was utilized in this study one notable factor
was altered for the full analysis. Defining a CGI involves considering the GC content of a stretch of DNA,
but this consideration is underpinned by a relation to the background GC content. Statistical thresholds of
CGIs have involved setting a GC% parameter. The Takai and Jones algorithm used a parameter of GC
content ≥55%. However,there are varied ranges of GC content in genomes and between species. Average
GC content of 100-kb fragments in humans ranges from 35%-60%, a range twice as wide as that found in
teleostean fishes.10
Ignoring the dynamic range of GC content might, and assuming a fixed percentage
between species might cause data to be erroneously analyzed. A previous study found that 28% of human
gene promoters had CpG content similar to that of the background GC; these were classified as low CpG
concentration (LCG) promoters3
. To account for this, this study employed a GC content threshold of
≥15% of the background GC. The remaining parameters and code set by Takai and Jones were unaltered
and used for this study.
Thirty-four genomes, representing the spectrum of animal evolution and diversity, were analyzed.
The annotation software CEGMA was used to conclude the completeness of the genome assemblies11
.
The quality was determined as the number of contigs per the genome size (Mb). CGI data were
characterized as both raw number count of CGIs per genome and CGI density, that is, the number of CGIs
per Mb per genome. This study made an effort to look for relationships between transcription diversity
and CGIs. To this end, TransDecoder was used to cluster similar isoforms, effectively making sure the
proteins considered for results were unique12
. Additionally, Pfam was used to pull out unique
transcription factors. This gives a reliable reference the relative complexity of transcription between
species13
.
To understand whether or not number of CGIs or CGI density gave phylogenetic signal, that is,
the degree some relationship of CGIs can explain the topology of the tree and relatedness of species,the
data were compared to the phylogenetic tree constructed using Bayesian methods by Plachetzki and
associates7
(Figure 1). The tree is rooted by the animal outgroup consisting of Monosiga brevicollis and
Salpingoeca rosetta. To compare any signal given by the CGI data,unique TFs, background GC content,
genome size, average O/E and unique CDSs were also tested for phylogenetic signal. The genomic
features aforementioned of each species were plotted against number of CGIs and CGI density for a
regression analysis. Each feature was also plotted against genome size to serve as the null hypothesis in
each case. The P-value and R-squared value are reported in each correlation graph, and the P-values for
each respective phylogenetic signal analysis are given.
4. Figure 1: Topology of the tree constructed using Bayesian methods
Results:
5.
6.
7. `
Figure 21: Assembly quality does not display
phylogenetic signal.
(#contigs/Mb) (M b)
Figure 22: Genome size does not display
phylogenetic signal.
(# U nique T Fs)
Figure 23: Unique TFs interestingly display
similarly insignificant phylogenetic signal to
that of genome size.
8. (# U nique C DSs)
Figure 25: A verage O/E GC content of C GIs gives
significant phylogenetic signal.
Figure 24: U nique CDSs do not display
phylogenetic signal.
Figure 26: Like Average O/E GC content of C GIs,
background GC displays phylogenetic signal.
Figure 27: U nique T Fs/Unique CDSs display
significant phylogenetic signal.
9. Discussion:
The results of this study reject the hypothesis: neither CGI density nor the number of CGIs per
genome significantly relates to an increase of unique CDSs or unique TFs across a representative sample
of the animal kingdom. Instead,genome size was the biggest contributor to an increase of CGIs by
positively driving the correlations. This does not necessarily indicate a lack of relationship between CGI
distribution and significant genome architectural aspects such as CDS or TFs, but it casts shadows of
uncertainty when examining correlational data. Two of the most interesting results from the phylogenetic
analysis is that average O/E GC content relating to CGIs (Figure 25) and background GC content (Figure
26) both display significant phylogenetic signal, suggesting that generalgenomic GC content has been
maintained throughout animal phylogeny. Another interesting result from this analysis is that the ratio of
TFs per CDSs (Figure 27) also gives significant phylogenetic signal, even though the ratio correlates
strongly with genome size. Number of CGIs (Figure 29) demonstrates punctuation along the species,
which is echoed, if to a less exaggerated degree,to that of TF/sCDSs (Figure 27). This observation
underpins a general one relating to this study’s analysis: in all relationships tested, there is a large degree
of variation. Even if the relationships explain a quarter to half of all of the data across the entire kingdom,
as is observed, it is difficult to ignore the strong outliers. As is suggested by the observed punctuation in
the phylogenetic graphs, it may be that greater resolution would be obtained from this kind of analysis by
decreasing its scope and investigating an area of punctuation such as, say, the vertebrates,and comparing
results for each punctuated group. Genome size (Figure 22) may be said to demonstrate a small amount of
this punctuation as well.
As in all studies requiring statistical parameters,this study is limited by its utilized definitions.
Because there is no structural, physical border that constitutes a CGI from the background, defined
parameters must be used to conduct the investigation, and as such any definition of a CGI can be a bit
arbitrary. Therefore,this study has operated on the assumption that the parameters reflect,to a degree of
confidence as reflected in the literature, the biologically relevant moieties. This reflection demonstrates a
desire for biochemical, functional assays to greatly supplement and aid this kind of research.
(# C GIs/Mb)
Figure 28: CGI density does not display
phylogenetic signal.
(# C GIs)
Figure 29: Number of CGIs, although more
significantly than CGI density, does not give
phylogenetic signal.
10. References:
1) Zhao, Z. Han,L. (2009) CpG islands: algorithms and applications in methylation studies. BBRC. 382:
14; 643-645
2) Ioshikhes, I. P. Zhang, M. Q. (2000) Large-scale human promoter mapping using CpG islands. Nature
Genetics. 26: 61-63
3) Saxonov, S. Berg, P. Brutlag, D. L. (2006) A genome-wide analysis of CpG dinucleotides in the
human genome distinguishes two distinct classes of promoters. PNAS. 103: 5; 1412-1417
4) Deaton, A. M. Bird, A. (2011) CpG islands and regulation of transcription. Genes and Dev. 25: 1010-
1022
5) Elango, N. Soojin, Y. V. (2008) DNA methylation and structural and functional bimodality of
vertebrate promoters. Mol Biol. 25: 8; 1602-1608
6) Bird, A. (2002) DNA methylation patterns and epigenetic memory. Genes and Dev. 16: 6-21
7) Borowiec, M. L. Lee,E. K. Chiu, J. C. Plachetzki, D. C. (2015) Dissecting phylogenetic signal and
accounting for bias in whole-genome data sets: a case study of the Metazoa. BioRxiv. doi:
http://dx.doi.org/10.1101/013946
8) Feil, R. Khosla, S. (1999) Genomic imprinting in mammals: an interplay between chromatin and DNA
methylation? Trends in Genetics. 15: 11; 431-435
9) Zhao, Z. Han,L. (2009) CpG islands: algorithms and applications in methylation studies. BBRC. 382:
4; 643-645
10) Romiguier, J. Ranwez, V. Douzery, E. J. P. Galtier, N. (2010) Contrasting GC-content dynamics
across 33 mammalian genomes: relationship with life-history traits and chromosome sizes. Genome Res.
20: 1001-1009
11) Parra,G. Bradnam, K. Korf, I. (2007) CEGMA: a pipeline to accurately annotate core genes in
eukaryotic genomes. Bioinformatics. 23: 1061 - 1067.
12) Haas,B. Papanicolaou, A. TransDecoder (Find Coding Regions Within Transcripts).
http://transdecoder.github.io/. 2015.
13) Finn, R. D. et al (2014) Pfam: the protein families database. Nucl. Acids Res. 42: D1; D222-D230