Devin Petersohn Poster

To
convert
a
DNA
sequence
into
a
grayscale
image,
we
first
convert

each
character
into
a
unique
specific
value:

A=0,
C=1,
G=2,
T=3

Then,
in
order
to
convert
those
values
into
a
4-‐bit
grayscale
value

(gray
color
values
from
0-‐15),
we
use
the
following
formula:

(P1*4)+(P2)

Where
P1
is
the
character
in
the
first
posiHon,
and
P2
is
the
second

The
resulHng
grayscale
values
form
the
pixels
of
images
that
represent

the
original
sequence.

In
order
to
get
a
10x10
image,
a
sequence
of

101
base
pairs
is
required.

Example:

CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT

P1
=
C
=
1,
P2
=
A
=
0
=>
(1*4)
+
(0)
=
4

Using
a
sliding
window,
the
second
posiHon
becomes
the
first.

CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT

P1
=
A
=
0,
P2
=
T
=
3
=>
(0*4)
+
(3)
=
3

Each
two-‐character
sequence
receives
a
unique
value
from
0-‐15,

which
corresponds
to
its
grayscale
value
in
the
10x10
image:

Abstract

Co-Occurrence Matrix and Texture MeasurementLocating Potential DNA Mutations
Discussion
IdenHfying
the
Long
Ultra
Similar
Elements
(LUSEs)
in
genomes
can
yield
a
myriad
of
new
informaHon
regarding
the
result
of
a
geneHcally
and
evoluHonarily
significant
mutaHon.
However,
current
methods
of
idenHfying
LUSEs
cannot
capture
every
possible
mutaHon
(inserHon,
deleHon,

and
base
pair
subsHtuHon)
without
an
exhausHve
pair-‐wise
comparison
using
the
Levenshtein
Similarity
measurement.

Alignment
algorithms
aYempt
to
solve
this
problem,
but
can
only
calculate
the
maximum
consecuHvely
similar
elements
in
a
string
of
base
pairs.

We
have
developed
an

image-‐based
method
of
idenHfying
LUSEs
in
genomes
that
has
a
strong
correlaHon
to
the
Levenshtein
Similarity
measurement.

Our
approach
first
converts
a
sequence
into
a
10x10
grayscale
image.
Then,
using
exisHng
co-‐occurrence
matrix
based
texture
feature
metrics,
we
generate
a

unique
feature
vector
for
each
sequence
by
which
other
sequences
can
be
compared.

These
feature
vectors
can
then
be
ploYed
and,
using
a
clustering
algorithm,
we
will
then
be
able
to
idenHfy
clusters
of
sequences
that
share
a
Levenshtein
Similarity
greater
than
90%
(or
another

threshold
of
our
choosing).

Because
of
the
correlaHon
between
clusters
and
the
Levenshtein
Similarity
measurement,
we
can
avoid
pair-‐wise
comparisons
altogether.

Because
there
are
no
pairwise
comparisons,
these
algorithms
can
run
in
parallel
using
a
MapReduce
funcHon
in
a
Big
Data

Ecosystem
(Hadoop),
offering
a
suitable
soluHon
to
this
Big
Data
problem
that
is
scalable
to
the
amount
of
hardware
available.

The
final
product
will
be
a
hash
funcHon
that
can
return
all
clustered
LUSEs
very
quickly
for
biology
researchers
to
access
in
real
Hme.

The
final
product
is
a
searchable
database
for
evoluHonary

biologists
to
be
able
to
upload
and
compare
organism

genomes
against
all
other
genomes
already
in
the
database.

The
Levenshtein
Similarity
measurement
calculates
similarity

between
strings
based
on
the
minimum
number
of
deleHons,

inserHons,
and
subsHtuHons
it
takes
to
get
from
one
string
to

another
[7].

Retrieved
from:

hYp://images.flatworldknowledge.com/ballgob/ballgob-‐fig19_015.jpg

Purpose
of
this
approach:

Work
in
Big
Data
Ecosystem

Algorithm
can
run
in
parallel

Scalable
performance
to
amount
of

hardware
available

No
pairwise
comparison
Contrast

Homo-‐
geneity

Entropy

Dissim-‐
ilarity

Contrast
&

Homogen.

Homogen.

&
Entropy

Entropy
&

Dissim.

Contrast
&

Entropy

Contrast
&

Dissim.

Homogen.

&
Dissim.

Contrast,

Homogen.,

&
Entropy

Contrast,

Homogen.,

&
Dissim.

Contrast,

Entropy,
&

Dissim.

Homogen.,

Entropy,
&

Dissim.

Contrast,

Homogen.,

Entropy,
&

Dissim.

CorrelaHon
0.8738
0.4313
0.7540
0.8691
0.8270
0.7884
0.8861
0.8697
0.8737
0.8198
0.8648
0.8507
0.8986
0.8750
0.8880

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Correla@on
with
Levenshtein
Similarity

(1
is
perfectly
correlated)

Texture
Feature
Measurement
Method(s)

Correla@on
between
Levenshtein
Distance
and
Texture
Feature
Measurement
Methods

The
Co-‐Occurrence
Matrix
is
created
by
counHng

the
number
of
grayscale
pixel
values
that
occur

near
another
in
a
given
image
[4].

From
the
Co-‐
Occurrence
Matrix,
we
can
generate
features

with
exisHng
methods
[4].

Contrast
Dissimilarity

Homogeneity
Entropy

These
feature
measurement
metrics
are
used
to

reduce
the
co-‐occurrence
matrix
down
to
values
that

can
be
measured
or
ploYed
against
other
images
[4].

Below,
the
graph
details
the
correlaHon
between

Levenshtein
Similarity
and
all
possible
combinaHons

of
the
above
feature
metrics.

The
most
correlated

combinaHon
of
metrics
is
Contrast,
Entropy,
and

Dissimilarity
with
a
strong
0.8986
correlaHon
(1
is

perfectly
correlated).

Image

Pixel
that
is
compared
against
all

neighbors

Window

Window
PosiHon
1

Window
PosiHon
2

Query

(Sequence)

Feature
Metric

CalculaHon

User

(Start/End)

Cluster
with

Similar

Sequences

User
submits
query

sequence
of
at

least
101

Feature
Metrics
are

generated
from

query

Metrics
are

ploYed
and

clustered

Finding
LUSE
Overview

These
metrics
can
next
be
ploYed
in
3-‐
dimensional
space
and
clustered
using

the
K
Means
algorithm.

Because
of
the

strong
correlaHon,
each
cluster
will

represent
a
sequence
of
a
measurable

similarity
threshold.

Contrast

Entropy

Benefits
to
approach:

MapReduce
works
in
parallel
=>
very
fast:

Linear
Hme
vs.
ExponenHal

Same
Hme
cost
to
compare
1
vs.
1
and
1
vs.
all

Scalable
to
amount
of
hardware
available:

More
nodes
=
BeYer
Performance

Setup
can
handle
enHre
genomes
to
be
compared
at
once

Only
need
to
run
a
sequence
once
–
results
will
conHnue
to

be
added
as
database
grows

Poten@al
Uses:

IdenHfy
Ultra
Conserved
Elements
(UCEs)
[1]

IdenHfy
evoluHonarily
significant
mutaHons

PotenHal
for
medical
uses

Disease
diagnosis,
GeneHc
Research,
etc.

Others

What’s
Next:

TesHng
different
clustering
algorithms
–
Sop
Clustering

Implement
and
test
Spark

June
PublicaHon

Yellow
area
is
calculated,

blank
pixels
are
not

[1]
Reneker
J,
Lyons
E,
Conant
GC,
Pires
JC,
Freeling
M,
Shyu
CR,
Korkin
D.Proc
Natl
Acad
Sci
U
S
A.
2012
May
8;109(19):E1183-‐91.
doi:
10.1073/pnas.1121356109.
Epub

2012
Apr
10.

[2]
J.
Dean
and
S.
Ghemawat,
"Mapreduce:
Simplified
data
processing
on
large
clusters,"
ACM
Commun.,
vol.
51,
Jan.
2008,
pp.
107-‐113.

[3]
Hadoop,
hYp://hadoop.apache.org/

[4]
Co-‐Occurrence
Matrix,
hYp://www.fp.ucalgary.ca/mhallbey/texture_calculaHons.htm

[5]
Apache
Spark,
hYp://spark.apache.org/

[6]
Apache
Hbase,
hYp://hbase.apache.org/

[7]
Levenshtein,
Vladimir
I.
(February
1966).
"Binary
codes
capable
of
correcHng
deleHons,
inserHons,
and
reversals".
Soviet
Physics
Doklady
10
(8):
707–710.

MapReduce, Hadoop, Spark, & HBase – Big Data Ecosystem
Retrieved
from:

hYp://hadoop.apache.org

MapReduce
Overview
Cluster
Setup

10
Intel
NUC
computers

1
Master
Node:

16GB
RAM

Dual
Core
2.0GHz
CPU

1TB
Hard
Disk
Space

480GB
Solid
State
Drive

9
Compute
Nodes

8GB
RAM

Dual
Core
2.0GHz
CPU

1TB
Hard
Disk
Space

480GB
Solid
State
Drive

Retrieved
from:

hYp://spark.apache.org

//
Map
Func@on
1:
input
<k,v>
k
is
offset
for
current

file
block
(in
bytes);
v
is
a
sequence
in
chromosome
C
1: v
=
P(v)

//
remove
invalid
characters
2: for
i
=
0
to
m-‐n
do{
3:

FV
=
generateFV(v[i
to
i+n])
//generate
feature

vector
4:

start_pos
=
i
+
k
5:

return
(FV,
(start_pos,
C))
}

//
Reduce
Func@on
1:
input
<k,v>
k
is
the
feature

vector
(FV);
v
is
the
star@ng
posi@on
of
the

subsequence
w.r.t
the
chromosome
sequence
1:
pos
=
merge(v)
2:
return
(k,
pos)
//
Map
Func@on
2:
input
<k,v>
k
is
feature

vector;
v
is
the
list
of
posi@ons
matching
the

feature
vector
1: k
=
normalize(k)
//normalize
data
2: return
(k,
v)

//
Reduce
Func@on
2:
input
<k,v>
k
is
the

normalized
feature
vector;
v
is
the
list
of
star@ng

posi@ons

1:
cl
=
kmean(k)
//cluster
data
using
k
means
2:
return
(cl,
v)

Original
Data

(Sequence)

Mapper
1
<FV,
(Ch
ID,
Pos)>

Mapper
2
<FV,
(Ch
ID,
Pos)>

Mapper
3
<FV,
(Ch
ID,
Pos)>

Mapper
n
<FV,
(Ch
ID,
Pos)>

Output
to
HBase

<FV,
(List
of
Pos
IDs)>
Reducer
1

<FV,
(List
of
Pos
IDs)>
Reducer
2

<FV,
(List
of
Pos
IDs)>
Reducer
3

<FV,
(List
of
Pos
IDs)>
Reducer
n

Master

Node

.
.
.
.

.
.
.
.

Retrieved
from:

hYp://hbase.apache.org

Co-‐occurrence
Matrix
FV
calculated
Aggregate
elements
with
matching
FV

Iden@fying
Long
Ultra
Similar
Elements
(LUSEs)
in
Genomes

Using
Image
Based
Texture
Co-‐Occurrence
Matrix

Devin Petersohn1 and Chi-Ren Shyu (Mentor)1,2
1Department of Computer Science, College of Engineering, 2MU Informatics Institute, University of Missouri
References
HBase
Table
Schema

Feature
Vector

<Contrast,
Entropy,
Dissimilarity>

Table
of
ordered
Pairs

(Ch
ID,
Pos)

K
Mean
Cluster
ID

(Calculated
2nd
IteraHon)

.
.
.
.

.
.
.
.

.
.
.
.

<Contrast,
Entropy,
Dissimilarity>

Ch
ID
1
Pos
1

K
Mean
Cluster
ID

Ch
ID
2
Pos
2

.
.
.

.
.
.

Ch
ID
n
Pos
n

Shuffling

Acknowledgements
This
project
was
sponsored
by
the
MU
College
of
Engineering
Undergraduate
Honors
Research
Program

Undergraduate
Research
Forum
–
Spring
2014

0

20

40

60

80

100

120

140

160

180

200

0

250

500

750

1,000

1,250

1,500

1,750

2,000

2,250

Time
(minutes)

Number
of
Base
Pairs
(in
Millions)

Running
Time
for
1st
MapReduce
Func@on
on
a
6
Node
Cluster

Devin Petersohn Poster

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Devin Petersohn Poster

Similar to Devin Petersohn Poster (20)

Devin Petersohn Poster