1. A New Method to Identify and Study Palindromic DNA
Devin Petersohn, Matt Spencer, Chi-Ren Shyu
University of Missouri – Informatics Institute, Department of Computer Science
A genetic palindrome is a DNA sequence that is the
same on both strands, when read from the 5’ to 3’
end in both cases. Palindromes are studied because
they are known to be the source of diseases,
including cancer.
Palindromes
5’
5’
3’
3’
Cruciforms (displayed above) are associated with
beneficial and harmful functions. The location of
palindromes is important for researching the
effect that cruciforms have on cell functions.
Module 4: Extract Palindromes
Module 5: Iterative Doubling
Module 6: Index and Retrieval
With our filtered k-blocks, palindromes are
easily identified by finding cases where
identical sequences on opposing strands
overlap. This gives us all palindromes of lengths
in the range [k, 2k) without the use of heuristics
and with no false positives.
To double the size of our k-blocks, each k-block
is hashed together with the next one in the
genome. The two sequences are joined and a
new k-block is made with the new k being twice
as large.
Modules 2-5 are repeated until no palindromes
are found in an iteration. This process
guarantees that all palindromes are located, as
larger palindromes are always extensions of
smaller ones.
The extracted palindromes are stored in a
database in the form of a Spark RDD. This
allows indexing by species, chromosome,
sequence length, and more. The database
trivializes further exploration of palindromes,
even when performing multi-species analyses.
A palindromic sequence is present on both the
forward and reverse strand of the same
chromosome. Thus, we remove any sequences
that do not fit this criteria, as they cannot be
part of a palindrome.
Module 1: Sequence Processing
Module 2: Coarse-Grained Filter
Module 3: Fine-Grained Filter
A k-block might be part of a palindrome with
length in [k, 2k) if it has a complementary core
around which the flanking nucleotides are
complementary. Without a complementary
core, we know the k-block isn’t part of a
palindrome in this length range, but it could still
be part of a larger palindrome.
A sliding window is used to scan the raw
genome sequence and collect all subsequences
of 6 base pairs and their reverse complements.
These are stored in a tuple with the genome,
chromosome, and position info as the key. We
call this tuple a “k-block”, with our initial k
being 6.
Findings
0.0001
0.01
1
100
10000
1000000
100000000
1E+10
6 12 24 48
Observed and Expected Palindromic DNA
Occurrences
Observed Expected
Length 6 GC Content and Center Bases
AT
CG
GC
TAGC
Content
AT Content
Length 12 GC Content and Center Bases
GC
Content
AT Content
Length 24 GC Content and Center Bases
AT
CG
GC
TA
GC Content
AT Content
Length 48 GC Content and Center Bases
GC Content
AT Content
• The longest palindrome in the dataset was found in I. tridecemlineatus (ground squirrel) with a length of
101,980bp.
• Extraordinarily long palindromes are abundant in the Gorilla gorilla genome.
• 13/24 Gorilla chromosomes have palindromes over 6kb long.
System Architecture
Future Work & Implications
• There is a genetic bias toward certain lengths and compositions of palindromic DNA
• Properly identifying this bias could lead to innovations in disease treatment
• Plants are very different from animals in their genetic makeup. Study of their palindromic makeup is vital to continued