Ong et al._Translational utility of next-generation sequencing_2013_Genomics
Verifying the role of AID in Chronic Lymphocytic Leukemia
1. Verifying the role of activation-induced
deaminase in chronic lymphocytic leukemia
Charlotte Broadbent
Ward Melville High School
380 Old Town Road
East Setauket, NY 11733
1
2. Abstract
Chronic lymphocytic leukemia (CLL) is a relatively common disease amongst aging
adults (Cheson, 2001). To successfully design a drug to combat this disease, more background on
its mechanisms is vital. One aspect of CLL that remains unclear is the role of the enzyme
activation-induced deaminase (AID). AID is normally involved in somatic hypermutation, a
mechanism B-cells use to differentiate their variable region to combat pathogens. AID has
already been associated with CLL, but the sequencing method used, known as pyrosequencing, is
not accurate enough to detect all of the variable region clones when analyzing specific sequences
such as AID hotspots (Ansorge, 2009). The goal of this research was to verify the role of AID
using a statistical method known as bootstrapping, or resampling, so these inaccuracies would
not affect the total distribution of variable region clones. Two parameters were measured to test
for AID activity – mutations in the variable region compared to the constant region, and
mutations at AID hotspots compared to total mutations. Results from both tests were statistically
significant, verifying the role of AID in chronic lymphocytic leukemia.
2
3. Introduction
B-cell chronic lymphocytic leukemia (B-CLL) is the most prevalent type of leukemia
among aging Caucasians (Cheson, 2001), and is a disease which is currently incurable. Patients
with B-CLL follow one of two clinical outcomes: an indolent outcome, in which the median
survival age of patients is greater than 25 years, and an aggressive outcome, in which those
afflicted decline relatively rapidly, with a median survival age of less than 8 years (Chiorazzi et
al., 2005). The ineffectiveness of treatment in prolonging lifespan (Rai et al., 2000) has led many
physicians to delay aggressive treatment until the nature of the disease is evident. However, the
two clinical outcomes have been associated with a biological difference – those patients whose
immunoglobulin variable-region heavy chain (IgVH) is relatively free of mutations follow the
more aggressive outcome, while patients whose IgVH region contains considerable mutations
follow the more indolent outcome (Fais et al., 1998).
One of the primary defenses of the immune system is the diverse repertoire of B cells that
identify pathogens in the body. Unique antigen receptors on the surface of the B cells bind to
foreign antigens, after which the B cell divides and produces millions of antibodies to destroy the
pathogen. These antigen receptors are unique due to the variation in the amino acid sequence at
the antigen-binding site, which consists of a variable (V) region and a constant (C) region. The
distribution of V regions in immunoglobulin genes will always have a dominant clone, known as
the consensus sequence. This clone accounts for approximately 95% of the V region clones.
However, for this process to work, the unique variable region on each antigen receptor must have
an efficient method of differentiating.
Somatic hypermutation is the process which diversifies B cells so they can recognize
3
4. threats to the immune system. The enzyme AID (activation-induced deaminase) induces point
mutations, usually at locations in the variable region of the DNA strand known as WRC hotspots,
where W represents A or T, R represents G or A, and C is always cytosine, or its inverse, GYW
hotspots, where G is always guanine and Y represents a pyrimidine. AID tends to avoid SYC
cold spots, where S represents G or C. Somatic hypermutation is responsible for many of the
mutations in immunoglobulin genes – this research seeks to verify, using statistical methods,
whether the mutations in cells of patients with the indolent form of B-CLL are in fact caused by
AID activity, since it is already known that AID is associated with CLL (Patten et al., 2012).
The DNA of leukemic and other cells has been sequenced since 1977 primarily via the
chain termination sequencing method, also known as the Sanger method (Sanger et al., 1977).
This method has several advantages: it can sequence strands up to five hundred base pairs in
length with fairly high accuracy. However, the Sanger method also has limitations: there are few
opportunities for parallel sequencing, making sequencing of large quantities of DNA very
difficult. Also, sample preparation can take up to four hours (Fakruddin et al., 2012).
These limitations are addressed by a relatively new method known as pyrosequencing.
Pyrosequencing has many distinct advantages over the Sanger method. Even though it can only
process strands around four hundred base pairs in length, it can sequence many strands in
parallel, making it more efficient for much larger quantities of DNA. It eliminates the need to
label the primers, which are the synthetically prepared starting points for the DNA polymerase.
Pyrosequencing is also easy to automate, the cost is lower because of the smaller strands, and
preparation time is as little as fifteen minutes (Fakruddin et al., 2012). In this project, each
sequence was a different B cell's variable region. Pyrosequencing was therefore the logical
4
5. choice, due to the immense quantity of sequences that were each relatively short in length.
However, the pyrosequencing method is much more error prone than the Sanger method,
resulting in many inaccuracies throughout the sequences (Ansorge, 2009). Another problem with
this technique is the number of sequences changes during the course of the experiment. After
AID is stimulated and the results are analyzed, there is often a discrepancy between how many
sequences were originally counted and how many remain. This is attributed to the death of the
cells containing the DNA during the course of the experiment. There is therefore uncertainty in
the exact number and type of variable region clones after AID stimulation.
These problems can be solved by applying a statistical technique known as resampling, or
bootstrapping. The term “bootstrap” was first used by Efron in 1979 (Efron,1979). In particular,
he described the nonparametric bootstrap, or resampling with replacement of size n from the
original sample taken of size n (Geyer, 2006). Therefore, when a sample is resampled thousands
of times, a more representative distribution of possible samples can be obtained regardless of the
shape of the distribution. In this case, this technique allows for a more representative distribution
of immunoglobulin mutations, allowing confirmation of AID activity in leukemic cells. Such a
confirmation would be instrumental in aiding drug design for CLL.
Methods
Sequence data was obtained from collaborators at Northshore LIJ. DNA from patients
with chronic lymphocytic leukemia was sequenced using Roche 454/GC FLX pyrosequencing
technology, both before and after AID stimulation. Here stimulation means that different
experiments were performed to create an AID expression environment. The sequencing was
5
6. performed for different amounts of time for different patients, lasting up to ninety-seven days.
Each day, the DNA was sequenced once from the 5'-end of the chain of bases and once from the
3'-end. The first 20 nucleotides from both ends were removed to eliminate the primer. The data
from the 5'-end sequences were considered to be the variable region of the immunoglobulin
gene, while the first 103 nucleotides from the 3'-end sequence were considered to be the constant
region of the immunoglobulin heavy chains.
All analysis was performed using the statistical program R. In addition, the program
functions Biostrings, ape, lattice, and ShortRead were utilized.
The data was first resampled using the bootstrap technique. A function was created to
identify the unique sequences of variable regions in each patient. The output of this function,
readUniqueSequences, was four different parameters: the unique sequences (vUniqueSeqs), the
associated number of times each of those sequences appeared (vCounts), the consensus sequence
(consensusSequence), and the length of the consensus sequence (blockWidth). These values were
then used in the bootstrap function, getBootstrapCount. The R sample function was used, taking
a sample of size 5000 from the values 1 through the length of vUniqueSeqs, with replacement,
using vCounts as the probability distribution, so that the output was a set of 5000 integers
ranging from 1 to the length of vUniqueSeqs. The unique values were then extracted using the R
function unique, so that repeats were eliminated. Finally, the numbers that remained
corresponded with one of the vUniqueSeqs, so the final variable calculated,
bootstrapUniqueSequences, was a list of less than 5000 unique sequences that was representative
of the original set of sequences.
To confirm AID activity in patients with CLL, the mutation on the variable region
6
7. (IGHV) was compared to the constant region (IGHC), since if the mutations were caused by
somatic hypermutation, there would be no mutations (or low frequency) on the constant region
compared to the variable region. Only mutations occurring at C:G were considered, since only
these would be indicative of AID activity. A function to determine the number of GC mutations
compared to the consensus sequence was made, compareMutationsGCsites. The sequence was
split into individual characters using the R function strsplit and assigned to the variable seq_split.
To ensure the sequence and the consensus sequence were the same length, whichever had the
smaller length was taken using the R function min. The variable GCsitesIndex was created,
which contained only those values in the consensus sequence which were G or C. Then, to
compare the two sequences, a for loop was created for i in 1:length(GCsitesIndex). For every
number, 1 through the length of GCsitesIndex, the corresponding element was assigned to the
variable Pos. The variable consensusNt was created to represent the values of Pos in the
consensus sequence. Meanwhile, the variable seqNt was created to represent the values of Pos in
seq_split. The if function was then used to determine if any element in consensusNt and seqNt
were equal, to add another number to the output, mutationNum. This value therefore gives the
total number of GC mutations in one sequence compared to the consensus sequence.
Another function, countMutationGCsites, was created to count the number of GC
mutations for all of the sequences in vUniqueSeqs, instead of just one. A for loop was created for
k in 1:length(vUniqueSeqs) which calculated the function compareMutationsGCsites for every
sequence in vUniqueSeqs and compiled all of the mutationNum values for each sequence into
one vector.
The data from the variable region and constant region were then compared using this
7
8. function after having been resampled. The values were then tested for statistical significance
using a two sample t test.
Another way in which AID activity was measured was by counting the mutations
occurring at WRC/GYW hotspots compared to all of the mutations, since it is known that AID
preferentially targets such sites. To test the hypothesis whether mutations in the variable region
were caused by somatic hypermutation, it was verified that the mutations occurring at
WRC/GYW hotspots were significant compared to all other mutations. A function
counttotalMutations was created to count all of the mutations in the resampled sequences. The R
function consensusMatrix was used to compare the consensus sequence with the unique
bootstrapped sequences, and all of the mutations were counted. Then, to count the mutations
occuring at hotspots, a function countHotSpotPosition was created to locate the hotspots in the
consensus sequence. The R function matchPattern was used to locate positions on the sequence
that matched the criteria for WRC/GYW hotspots. Then, another function,
countHotSpotMutations, was created to count all of the hotspot mutations that occured in the
bootstrapped sequences realtive to the consensus sequence. The R function consensusMatrix was
used to compare the hotspot positions of the consensus sequence (obtained from the function
countHotSpotPosition) with the bootstrapped sequences. The output of the function was the total
number of hotspot mutations within the resampled sequences relative to the consensus sequence.
The ratio of the resulting values, the total number of mutations and the total number of
hotspot mutations, was then calculated. This ratio was tested for significance against the null
hypothesis of one using a one sample t test.
Finally, a third test was performed to compare the number of mutations at SRC coldspots
8
9. to all of the mutations in the variable region. The same procedure was used for this test as the
previous one, except the pattern used in matchPattern was the coldspot sequence rather than the
hotspot sequence. Also, instead of being tested for being significantly higher, the coldspot data
set was tested to see if it was significantly lower than the null hypothesis of one.
9
10. Results
Figure 1 shows the ratio between variable region mutations and constant region
mutations, so if there was no AID activity, the ratio between the two would be one. If AID did
cause mutations in the variable region, the ratio would be higher than one. The test to determine
whether there were more mutations in the variable region compared to the constant region
produced many statistically significant results. Assuming a significance level of .05, eight of the
eleven samples of DNA rejected the null hypothesis that there was no difference in number of
mutations between the variable region and constant region.
Figure 2 shows the ratio between hotspot mutations and total mutations in the variable
region, so if there was no AID activity, the ratio between the two would be one. If AID was
active in the variable region, the ratio would be higher than one. The test to determine the
prevalence of mutations at AID hotspots versus all mutations also produced many statistically
significant results. Assuming a significance level of .05, six of the eleven samples of DNA
rejected the null hypothesis that there were was no targeting of WRC/GYW hotspots in the
immunoglobulin genes.
Figure 3 shows the ratio between coldspot mutations and total mutations in the
variable region, so if there was no AID activity, the ratio between the two would be one. If AID
was active in the variable region, the ratio would be lower than one, since AID tends to avoid
such coldspots. The test to determine the prevalence of mutations at AID coldspots versus all
mutations did not show as many statistically significant results as did the other two tests.
Assuming a significance level of .05, only four of the eleven samples of DNA rejected the null
hypothesis that there was no avoidance of SYC coldspots in the immunoglobulin genes.
10
11. Figure 1. Histogram of mutations occurring at variable region over constant region. The red line
is the expected value 1. The more area on the right of the read line, the more likely the mutation
is caused by somatic hypermutation.
11
12. Figure 2. Histogram of AID hotspot mutation over random mutation. Shows the ratio between
mutation occurring at AID hotspot(WRC/GYW) and total mutation for each new clone. The red
vertical line is the expected value 1, since if there is no bias for AID targeting, the expectation
should be 1. The more area on the right of the read line, the more likely the mutation is targeted
by AID.
12
13. Figure 3. Histogram of AID coldspot mutation over random mutation. Shows the ratio between
mutations occurring at AID coldspots and total mutations for each new clone. The red vertical
line is the expected value 1, since if there is no bias for AID targeting, the expectation should be
1. The more area on the left of the red line, the more likely the mutation is targeted by AID.
13
14. Discussion
The results of this study confirmed the activity of AID in patients with chronic
lymphocytic leukemia, linking the mutations that occur in the variable regions of these patients
with somatic hypermutation. Of the three major parameters tested, two returned promising
results and the other was still returned a relatively high ratio of significance. Both the
comparison of mutations in the variable region versus constant region and the comparison of
mutations occurring at AID hotspots versus all mutations in the variable region showed evidence
of AID activity. Although the test for lack of mutations at AID coldspots versus all mutations in
the variable region did not show results that were as promising as the other tests, several
examples did show significance and a few more were only just above the significance threshold.
These findings are consistent with studies that confirm the presence of AID in IGHV
mutated and unmutated CLL cells. Patten et al showed that both mutated and unmutated CLL
cells were able to produce AID mRNA protein, confirming the presence of AID in these cells. It
was therefore expected that AID would perform some function in the variable region of these
cells, although no distinction between mutated and unmutated cells was made.
These findings are pivotal in advancing our knowledge of CLL. Being able to confirm the
activity of AID in activated CLL cells enhances our understanding of the disease, which is
necessary if any progress is to be made with curing CLL. In addition, these findings help to
explain the differences between the two types of CLL – the indolent form and aggressive form.
Since the aggressive form has few mutations in the variable region, our results suggest the
aggressive form is perhaps due to some lack of function in the somatic hypermutation process,
which prevents the B-cells from mutating and providing some form of defense for the immune
14
15. system.
One aspect of our findings that was not what we had expected was the low significance in
the results of the test for lack of AID activity at AID coldspots. Although several of the results
were significant, less than half were and several had p-values higher than 0.5. There are several
possible explanations for this result. The first, and most obvious, would be experimental error.
However, since the results of the other two tests did not show any similar lack of significance,
this explanation, although possible, cannot solely explain this occurrence. Another possible
explanation for the results of the AID coldspot test would be that AID does not actually avoid
coldspots as much as it targets hotspots. In other words, even though AID does avoid these
specific DNA sequences, they cannot serve as thorough an indicator of AID activity as do AID
hotspots. Although there are no indications in the literature to suggest this possibility, it could
explain the results.
Although this study was important in furthering our knowledge of AID's role in CLL, the
possibilities of future projects are abundant. One such project could seek to further examine the
difference between mutated and unmutated CLL. It is now apparent that AID is active in mutated
CLL cells. However, the reason why this occurs is still ambiguous. If we were to know why AID
functions in the indolent form of CLL and doesn't in the aggressive form of CLL, we could try to
somehow change some aspect of the aggressive form to become the indolent form, if not treat the
disease entirely. Since patients with the indolent form often die with the disease and not from it,
this could save countless lives.
Another study is to follow up on why the AID coldspot test did not return as significant
results as expected. More of the same test could be performed to confirm the results; if the results
15
16. are consistent, a study could be designed to compare the relative targeting of hotspots versus
coldspots. A statistical analysis of hotspot/coldspot mutation frequency could reveal whether or
not there is a difference between the two. This information would advance our knowledge of
AID and how it targets the variable regions in immunoglobulin genes.
The goal of our research was to verify the role of AID in patients with CLL. Based on the
results of a study by Patten et al, which confirmed the presence of AID mRNA protein in mutated
and unmutated CLL cells, we believed AID would be proven to cause the mutations in the
mutated CLL cells. Our results confirmed this, as many of the samples showed significant
results. The tests for mutations in the variable region versus mutations in the constant region and
AID hotspot mutations versus all mutations were promising, and even though the test for
coldspot mutation versus total mutation did not show as significant results as were expected, a
few samples did show significance. This slight discrepancy should be further looked into. This
information will aid future investigations that seek to design treatment for CLL.
16
17. References
Ansorge, Wilhelm J. (2009). Next-generation DNA Sequencing Techniques. New Biotechnology,
25 (4), n. pag.
Cheson, B.D. (2001). The chronic lymphocytic leukemias. The Annals of Oncology, 13 (12),
1957 – 1957-a.
Chiorazzi, Nicholas, Katerina Hatzi, and Emilia Albesiano. (2005). B-Cell Chronic Lymphocytic
Leukemia, a Clonal Disease of B Lymphocytes with Receptors That Vary in Specificity
for (Auto)antigens. New York Academy of Sciences, 1062, 1-12.
Efron, B. (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics, 7, 1-26.
Fais, F. et al. (1998). Chronic lymphocytic leukemia B cells express restricted sets of mutated
and unmutated antigen receptors. The Journal of Clinical Investigation, 102, 1515 –
1525.
Fakruddin, M.D. et al. (2012). Pyrosequencing - Principles and Applications. International
Journal of Life Science and Pharma Research, 2 (2), n. pag.
Geyer, Charles J. (2006). 5601 Notes: The Subsampling Bootstrap.
Patten, P.E. et al. (2012). IGHV-unmutated and IGHV-mutated chronic lymphocytic leukemia
cells produce activation-induced deaminase protein with a full range of biologic
functions. Blood. 120 (24), 4802–4811.
Rai, K.R. et al. (2000). Fludarabine compared with chlorambucil as primary therapy for chronic
lymphocytic leukemia. The New England Journal of Medicine, 343, 1750 – 1757.
Sanger F., Nicklen S. and Coulson A.R. (1977). DNA sequencing with chain-terminating
inhibitors. Proceedings of the National Academy of Sciences of the United States of
17