1. To
convert
a
DNA
sequence
into
a
grayscale
image,
we
first
convert
each
character
into
a
unique
specific
value:
A=0,
C=1,
G=2,
T=3
Then,
in
order
to
convert
those
values
into
a
4-‐bit
grayscale
value
(gray
color
values
from
0-‐15),
we
use
the
following
formula:
(P1*4)+(P2)
Where
P1
is
the
character
in
the
first
posiHon,
and
P2
is
the
second
The
resulHng
grayscale
values
form
the
pixels
of
images
that
represent
the
original
sequence.
In
order
to
get
a
10x10
image,
a
sequence
of
101
base
pairs
is
required.
Example:
CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT
P1
=
C
=
1,
P2
=
A
=
0
=>
(1*4)
+
(0)
=
4
Using
a
sliding
window,
the
second
posiHon
becomes
the
first.
CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT
P1
=
A
=
0,
P2
=
T
=
3
=>
(0*4)
+
(3)
=
3
Each
two-‐character
sequence
receives
a
unique
value
from
0-‐15,
which
corresponds
to
its
grayscale
value
in
the
10x10
image:
Abstract
Co-Occurrence Matrix and Texture MeasurementLocating Potential DNA Mutations
Discussion
IdenHfying
the
Long
Ultra
Similar
Elements
(LUSEs)
in
genomes
can
yield
a
myriad
of
new
informaHon
regarding
the
result
of
a
geneHcally
and
evoluHonarily
significant
mutaHon.
However,
current
methods
of
idenHfying
LUSEs
cannot
capture
every
possible
mutaHon
(inserHon,
deleHon,
and
base
pair
subsHtuHon)
without
an
exhausHve
pair-‐wise
comparison
using
the
Levenshtein
Similarity
measurement.
Alignment
algorithms
aYempt
to
solve
this
problem,
but
can
only
calculate
the
maximum
consecuHvely
similar
elements
in
a
string
of
base
pairs.
We
have
developed
an
image-‐based
method
of
idenHfying
LUSEs
in
genomes
that
has
a
strong
correlaHon
to
the
Levenshtein
Similarity
measurement.
Our
approach
first
converts
a
sequence
into
a
10x10
grayscale
image.
Then,
using
exisHng
co-‐occurrence
matrix
based
texture
feature
metrics,
we
generate
a
unique
feature
vector
for
each
sequence
by
which
other
sequences
can
be
compared.
These
feature
vectors
can
then
be
ploYed
and,
using
a
clustering
algorithm,
we
will
then
be
able
to
idenHfy
clusters
of
sequences
that
share
a
Levenshtein
Similarity
greater
than
90%
(or
another
threshold
of
our
choosing).
Because
of
the
correlaHon
between
clusters
and
the
Levenshtein
Similarity
measurement,
we
can
avoid
pair-‐wise
comparisons
altogether.
Because
there
are
no
pairwise
comparisons,
these
algorithms
can
run
in
parallel
using
a
MapReduce
funcHon
in
a
Big
Data
Ecosystem
(Hadoop),
offering
a
suitable
soluHon
to
this
Big
Data
problem
that
is
scalable
to
the
amount
of
hardware
available.
The
final
product
will
be
a
hash
funcHon
that
can
return
all
clustered
LUSEs
very
quickly
for
biology
researchers
to
access
in
real
Hme.
The
final
product
is
a
searchable
database
for
evoluHonary
biologists
to
be
able
to
upload
and
compare
organism
genomes
against
all
other
genomes
already
in
the
database.
The
Levenshtein
Similarity
measurement
calculates
similarity
between
strings
based
on
the
minimum
number
of
deleHons,
inserHons,
and
subsHtuHons
it
takes
to
get
from
one
string
to
another
[7].
Retrieved
from:
hYp://images.flatworldknowledge.com/ballgob/ballgob-‐fig19_015.jpg
Purpose
of
this
approach:
Work
in
Big
Data
Ecosystem
Algorithm
can
run
in
parallel
Scalable
performance
to
amount
of
hardware
available
No
pairwise
comparison
Contrast
Homo-‐
geneity
Entropy
Dissim-‐
ilarity
Contrast
&
Homogen.
Homogen.
&
Entropy
Entropy
&
Dissim.
Contrast
&
Entropy
Contrast
&
Dissim.
Homogen.
&
Dissim.
Contrast,
Homogen.,
&
Entropy
Contrast,
Homogen.,
&
Dissim.
Contrast,
Entropy,
&
Dissim.
Homogen.,
Entropy,
&
Dissim.
Contrast,
Homogen.,
Entropy,
&
Dissim.
CorrelaHon
0.8738
0.4313
0.7540
0.8691
0.8270
0.7884
0.8861
0.8697
0.8737
0.8198
0.8648
0.8507
0.8986
0.8750
0.8880
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Correla@on
with
Levenshtein
Similarity
(1
is
perfectly
correlated)
Texture
Feature
Measurement
Method(s)
Correla@on
between
Levenshtein
Distance
and
Texture
Feature
Measurement
Methods
The
Co-‐Occurrence
Matrix
is
created
by
counHng
the
number
of
grayscale
pixel
values
that
occur
near
another
in
a
given
image
[4].
From
the
Co-‐
Occurrence
Matrix,
we
can
generate
features
with
exisHng
methods
[4].
Contrast
Dissimilarity
Homogeneity
Entropy
These
feature
measurement
metrics
are
used
to
reduce
the
co-‐occurrence
matrix
down
to
values
that
can
be
measured
or
ploYed
against
other
images
[4].
Below,
the
graph
details
the
correlaHon
between
Levenshtein
Similarity
and
all
possible
combinaHons
of
the
above
feature
metrics.
The
most
correlated
combinaHon
of
metrics
is
Contrast,
Entropy,
and
Dissimilarity
with
a
strong
0.8986
correlaHon
(1
is
perfectly
correlated).
Image
Pixel
that
is
compared
against
all
neighbors
Window
Window
PosiHon
1
Window
PosiHon
2
Query
(Sequence)
Feature
Metric
CalculaHon
User
(Start/End)
Cluster
with
Similar
Sequences
User
submits
query
sequence
of
at
least
101
Feature
Metrics
are
generated
from
query
Metrics
are
ploYed
and
clustered
Finding
LUSE
Overview
These
metrics
can
next
be
ploYed
in
3-‐
dimensional
space
and
clustered
using
the
K
Means
algorithm.
Because
of
the
strong
correlaHon,
each
cluster
will
represent
a
sequence
of
a
measurable
similarity
threshold.
Contrast
Entropy
Benefits
to
approach:
MapReduce
works
in
parallel
=>
very
fast:
Linear
Hme
vs.
ExponenHal
Same
Hme
cost
to
compare
1
vs.
1
and
1
vs.
all
Scalable
to
amount
of
hardware
available:
More
nodes
=
BeYer
Performance
Setup
can
handle
enHre
genomes
to
be
compared
at
once
Only
need
to
run
a
sequence
once
–
results
will
conHnue
to
be
added
as
database
grows
Poten@al
Uses:
IdenHfy
Ultra
Conserved
Elements
(UCEs)
[1]
IdenHfy
evoluHonarily
significant
mutaHons
PotenHal
for
medical
uses
Disease
diagnosis,
GeneHc
Research,
etc.
Others
What’s
Next:
TesHng
different
clustering
algorithms
–
Sop
Clustering
Implement
and
test
Spark
June
PublicaHon
Yellow
area
is
calculated,
blank
pixels
are
not
[1]
Reneker
J,
Lyons
E,
Conant
GC,
Pires
JC,
Freeling
M,
Shyu
CR,
Korkin
D.Proc
Natl
Acad
Sci
U
S
A.
2012
May
8;109(19):E1183-‐91.
doi:
10.1073/pnas.1121356109.
Epub
2012
Apr
10.
[2]
J.
Dean
and
S.
Ghemawat,
"Mapreduce:
Simplified
data
processing
on
large
clusters,"
ACM
Commun.,
vol.
51,
Jan.
2008,
pp.
107-‐113.
[3]
Hadoop,
hYp://hadoop.apache.org/
[4]
Co-‐Occurrence
Matrix,
hYp://www.fp.ucalgary.ca/mhallbey/texture_calculaHons.htm
[5]
Apache
Spark,
hYp://spark.apache.org/
[6]
Apache
Hbase,
hYp://hbase.apache.org/
[7]
Levenshtein,
Vladimir
I.
(February
1966).
"Binary
codes
capable
of
correcHng
deleHons,
inserHons,
and
reversals".
Soviet
Physics
Doklady
10
(8):
707–710.
MapReduce, Hadoop, Spark, & HBase – Big Data Ecosystem
Retrieved
from:
hYp://hadoop.apache.org
MapReduce
Overview
Cluster
Setup
10
Intel
NUC
computers
1
Master
Node:
16GB
RAM
Dual
Core
2.0GHz
CPU
1TB
Hard
Disk
Space
480GB
Solid
State
Drive
9
Compute
Nodes
8GB
RAM
Dual
Core
2.0GHz
CPU
1TB
Hard
Disk
Space
480GB
Solid
State
Drive
Retrieved
from:
hYp://spark.apache.org
//
Map
Func@on
1:
input
<k,v>
k
is
offset
for
current
file
block
(in
bytes);
v
is
a
sequence
in
chromosome
C
1: v
=
P(v)
//
remove
invalid
characters
2: for
i
=
0
to
m-‐n
do{
3:
FV
=
generateFV(v[i
to
i+n])
//generate
feature
vector
4:
start_pos
=
i
+
k
5:
return
(FV,
(start_pos,
C))
}
//
Reduce
Func@on
1:
input
<k,v>
k
is
the
feature
vector
(FV);
v
is
the
star@ng
posi@on
of
the
subsequence
w.r.t
the
chromosome
sequence
1:
pos
=
merge(v)
2:
return
(k,
pos)
//
Map
Func@on
2:
input
<k,v>
k
is
feature
vector;
v
is
the
list
of
posi@ons
matching
the
feature
vector
1: k
=
normalize(k)
//normalize
data
2: return
(k,
v)
//
Reduce
Func@on
2:
input
<k,v>
k
is
the
normalized
feature
vector;
v
is
the
list
of
star@ng
posi@ons
1:
cl
=
kmean(k)
//cluster
data
using
k
means
2:
return
(cl,
v)
Original
Data
(Sequence)
Mapper
1
<FV,
(Ch
ID,
Pos)>
Mapper
2
<FV,
(Ch
ID,
Pos)>
Mapper
3
<FV,
(Ch
ID,
Pos)>
Mapper
n
<FV,
(Ch
ID,
Pos)>
Output
to
HBase
<FV,
(List
of
Pos
IDs)>
Reducer
1
<FV,
(List
of
Pos
IDs)>
Reducer
2
<FV,
(List
of
Pos
IDs)>
Reducer
3
<FV,
(List
of
Pos
IDs)>
Reducer
n
Master
Node
.
.
.
.
.
.
.
.
Retrieved
from:
hYp://hbase.apache.org
Co-‐occurrence
Matrix
FV
calculated
Aggregate
elements
with
matching
FV
Iden@fying
Long
Ultra
Similar
Elements
(LUSEs)
in
Genomes
Using
Image
Based
Texture
Co-‐Occurrence
Matrix
Devin Petersohn1 and Chi-Ren Shyu (Mentor)1,2
1Department of Computer Science, College of Engineering, 2MU Informatics Institute, University of Missouri
References
HBase
Table
Schema
Feature
Vector
<Contrast,
Entropy,
Dissimilarity>
Table
of
ordered
Pairs
(Ch
ID,
Pos)
K
Mean
Cluster
ID
(Calculated
2nd
IteraHon)
.
.
.
.
.
.
.
.
.
.
.
.
<Contrast,
Entropy,
Dissimilarity>
Ch
ID
1
Pos
1
K
Mean
Cluster
ID
Ch
ID
2
Pos
2
.
.
.
.
.
.
Ch
ID
n
Pos
n
Shuffling
Acknowledgements
This
project
was
sponsored
by
the
MU
College
of
Engineering
Undergraduate
Honors
Research
Program
Undergraduate
Research
Forum
–
Spring
2014
0
20
40
60
80
100
120
140
160
180
200
0
250
500
750
1,000
1,250
1,500
1,750
2,000
2,250
Time
(minutes)
Number
of
Base
Pairs
(in
Millions)
Running
Time
for
1st
MapReduce
Func@on
on
a
6
Node
Cluster