SlideShare a Scribd company logo
1 of 1
Download to read offline
To	
  convert	
  a	
  DNA	
  sequence	
  into	
  a	
  grayscale	
  image,	
  we	
  first	
  convert	
  
each	
  character	
  into	
  a	
  unique	
  specific	
  value:	
  
	
  
A=0,	
  C=1,	
  G=2,	
  T=3	
  
	
  
Then,	
  in	
  order	
  to	
  convert	
  those	
  values	
  into	
  a	
  4-­‐bit	
  grayscale	
  value	
  
(gray	
  color	
  values	
  from	
  0-­‐15),	
  we	
  use	
  the	
  following	
  formula:	
  
	
  
(P1*4)+(P2)	
  
Where	
  P1	
  is	
  the	
  character	
  in	
  the	
  first	
  posiHon,	
  and	
  P2	
  is	
  the	
  second	
  
	
  
The	
  resulHng	
  grayscale	
  values	
  form	
  the	
  pixels	
  of	
  images	
  that	
  represent	
  
the	
  original	
  sequence.	
  	
  In	
  order	
  to	
  get	
  a	
  10x10	
  image,	
  a	
  sequence	
  of	
  
101	
  base	
  pairs	
  is	
  required.	
  
	
  
Example:	
  
	
  	
  	
  	
  	
  CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT	
  
	
  
	
  	
  	
  	
  	
  P1	
  =	
  C	
  =	
  1,	
  P2	
  =	
  A	
  =	
  0	
  =>	
  (1*4)	
  +	
  (0)	
  =	
  4	
  
	
  
	
  	
  	
  	
  	
  Using	
  a	
  sliding	
  window,	
  the	
  second	
  posiHon	
  becomes	
  the	
  first.	
  
	
  
	
  	
  	
  	
  	
  CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT	
  
	
  
	
  	
  	
  	
  	
  	
  P1	
  =	
  A	
  =	
  0,	
  P2	
  =	
  T	
  =	
  3	
  =>	
  (0*4)	
  +	
  (3)	
  =	
  3	
  
	
  
Each	
  two-­‐character	
  sequence	
  receives	
  a	
  unique	
  value	
  from	
  0-­‐15,	
  
which	
  corresponds	
  to	
  its	
  grayscale	
  value	
  in	
  the	
  10x10	
  image:	
  
Abstract
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Co-Occurrence Matrix and Texture MeasurementLocating Potential DNA Mutations
Discussion
IdenHfying	
  the	
  Long	
  Ultra	
  Similar	
  Elements	
  (LUSEs)	
  in	
  genomes	
  can	
  yield	
  a	
  myriad	
  of	
  new	
  informaHon	
  regarding	
  the	
  result	
  of	
  a	
  geneHcally	
  and	
  evoluHonarily	
  significant	
  mutaHon.	
  However,	
  current	
  methods	
  of	
  idenHfying	
  LUSEs	
  cannot	
  capture	
  every	
  possible	
  mutaHon	
  (inserHon,	
  deleHon,	
  
and	
  base	
  pair	
  subsHtuHon)	
  without	
  an	
  exhausHve	
  pair-­‐wise	
  comparison	
  using	
  the	
  Levenshtein	
  Similarity	
  measurement.	
  	
  Alignment	
  algorithms	
  aYempt	
  to	
  solve	
  this	
  problem,	
  but	
  can	
  only	
  calculate	
  the	
  maximum	
  consecuHvely	
  similar	
  elements	
  in	
  a	
  string	
  of	
  base	
  pairs.	
  	
  We	
  have	
  developed	
  an	
  
image-­‐based	
  method	
  of	
  idenHfying	
  LUSEs	
  in	
  genomes	
  that	
  has	
  a	
  strong	
  correlaHon	
  to	
  the	
  Levenshtein	
  Similarity	
  measurement.	
  	
  Our	
  approach	
  first	
  converts	
  a	
  sequence	
  into	
  a	
  10x10	
  grayscale	
  image.	
  Then,	
  using	
  exisHng	
  co-­‐occurrence	
  matrix	
  based	
  texture	
  feature	
  metrics,	
  we	
  generate	
  a	
  
unique	
  feature	
  vector	
  for	
  each	
  sequence	
  by	
  which	
  other	
  sequences	
  can	
  be	
  compared.	
  	
  These	
  feature	
  vectors	
  can	
  then	
  be	
  ploYed	
  and,	
  using	
  a	
  clustering	
  algorithm,	
  we	
  will	
  then	
  be	
  able	
  to	
  idenHfy	
  clusters	
  of	
  sequences	
  that	
  share	
  a	
  Levenshtein	
  Similarity	
  greater	
  than	
  90%	
  (or	
  another	
  
threshold	
  of	
  our	
  choosing).	
  	
  Because	
  of	
  the	
  correlaHon	
  between	
  clusters	
  and	
  the	
  Levenshtein	
  Similarity	
  measurement,	
  we	
  can	
  avoid	
  pair-­‐wise	
  comparisons	
  altogether.	
  	
  Because	
  there	
  are	
  no	
  pairwise	
  comparisons,	
  these	
  algorithms	
  can	
  run	
  in	
  parallel	
  using	
  a	
  MapReduce	
  funcHon	
  in	
  a	
  Big	
  Data	
  
Ecosystem	
  (Hadoop),	
  offering	
  a	
  suitable	
  soluHon	
  to	
  this	
  Big	
  Data	
  problem	
  that	
  is	
  scalable	
  to	
  the	
  amount	
  of	
  hardware	
  available.	
  	
  The	
  final	
  product	
  will	
  be	
  a	
  hash	
  funcHon	
  that	
  can	
  return	
  all	
  clustered	
  LUSEs	
  very	
  quickly	
  for	
  biology	
  researchers	
  to	
  access	
  in	
  real	
  Hme.	
  
The	
  final	
  product	
  is	
  a	
  searchable	
  database	
  for	
  evoluHonary	
  
biologists	
  to	
  be	
  able	
  to	
  upload	
  and	
  compare	
  organism	
  
genomes	
  against	
  all	
  other	
  genomes	
  already	
  in	
  the	
  database.	
  	
  	
  
The	
  Levenshtein	
  Similarity	
  measurement	
  calculates	
  similarity	
  
between	
  strings	
  based	
  on	
  the	
  minimum	
  number	
  of	
  deleHons,	
  
inserHons,	
  and	
  subsHtuHons	
  it	
  takes	
  to	
  get	
  from	
  one	
  string	
  to	
  
another	
  [7].	
  
Retrieved	
  from:	
  	
  hYp://images.flatworldknowledge.com/ballgob/ballgob-­‐fig19_015.jpg	
  
	
  Purpose	
  of	
  this	
  approach:	
  
Work	
  in	
  Big	
  Data	
  Ecosystem	
  
Algorithm	
  can	
  run	
  in	
  parallel	
  
Scalable	
  performance	
  to	
  amount	
  of	
  
hardware	
  available	
  
No	
  pairwise	
  comparison
Contrast	
  
Homo-­‐
geneity	
  
Entropy	
  
Dissim-­‐
ilarity	
  
Contrast	
  &	
  
Homogen.	
  
Homogen.	
  
&	
  Entropy	
  
Entropy	
  &	
  
Dissim.	
  
Contrast	
  &	
  
Entropy	
  
Contrast	
  &	
  
Dissim.	
  
Homogen.	
  
&	
  Dissim.	
  
Contrast,	
  
Homogen.,	
  
&	
  Entropy	
  
Contrast,	
  
Homogen.,	
  
&	
  Dissim.	
  
Contrast,	
  
Entropy,	
  &	
  
Dissim.	
  
Homogen.,	
  
Entropy,	
  &	
  
Dissim.	
  
Contrast,	
  
Homogen.,	
  
Entropy,	
  &	
  
Dissim.	
  
CorrelaHon	
   0.8738	
   0.4313	
   0.7540	
   0.8691	
   0.8270	
   0.7884	
   0.8861	
   0.8697	
   0.8737	
   0.8198	
   0.8648	
   0.8507	
   0.8986	
   0.8750	
   0.8880	
  
0.00	
  
0.10	
  
0.20	
  
0.30	
  
0.40	
  
0.50	
  
0.60	
  
0.70	
  
0.80	
  
0.90	
  
1.00	
  
Correla@on	
  with	
  Levenshtein	
  Similarity	
  
(1	
  is	
  perfectly	
  correlated)	
  
Texture	
  Feature	
  Measurement	
  Method(s)	
  	
  
Correla@on	
  between	
  Levenshtein	
  Distance	
  and	
  Texture	
  Feature	
  Measurement	
  Methods	
  
The	
  Co-­‐Occurrence	
  Matrix	
  is	
  created	
  by	
  counHng	
  
the	
  number	
  of	
  grayscale	
  pixel	
  values	
  that	
  occur	
  
near	
  another	
  in	
  a	
  given	
  image	
  [4].	
  	
  From	
  the	
  Co-­‐
Occurrence	
  Matrix,	
  we	
  can	
  generate	
  features	
  
with	
  exisHng	
  methods	
  [4].	
  	
  
Contrast	
   Dissimilarity	
  
Homogeneity	
   Entropy	
  
These	
  feature	
  measurement	
  metrics	
  are	
  used	
  to	
  
reduce	
  the	
  co-­‐occurrence	
  matrix	
  down	
  to	
  values	
  that	
  
can	
  be	
  measured	
  or	
  ploYed	
  against	
  other	
  images	
  [4].	
  
	
  
Below,	
  the	
  graph	
  details	
  the	
  correlaHon	
  between	
  
Levenshtein	
  Similarity	
  and	
  all	
  possible	
  combinaHons	
  
of	
  the	
  above	
  feature	
  metrics.	
  	
  The	
  most	
  correlated	
  
combinaHon	
  of	
  metrics	
  is	
  Contrast,	
  Entropy,	
  and	
  
Dissimilarity	
  with	
  a	
  strong	
  0.8986	
  correlaHon	
  (1	
  is	
  
perfectly	
  correlated).	
  
Image	
  
Pixel	
  that	
  is	
  compared	
  against	
  all	
  
neighbors	
  
Window	
  
Window	
  PosiHon	
  1	
  
Window	
  PosiHon	
  2	
  
Query	
  
(Sequence)	
  
Feature	
  Metric	
  
CalculaHon	
  
User	
  
(Start/End)	
  
Cluster	
  with	
  
Similar	
  
Sequences	
  
User	
  submits	
  query	
  
sequence	
  of	
  at	
  	
  
least	
  101	
  
Feature	
  Metrics	
  are	
  	
  
generated	
  from	
  	
  
query	
  
Metrics	
  are	
  
ploYed	
  and	
  	
  
clustered	
  
Finding	
  LUSE	
  Overview	
  
These	
  metrics	
  can	
  next	
  be	
  ploYed	
  in	
  3-­‐
dimensional	
  space	
  and	
  clustered	
  using	
  
the	
  K	
  Means	
  algorithm.	
  	
  Because	
  of	
  the	
  
strong	
  correlaHon,	
  each	
  cluster	
  will	
  
represent	
  a	
  sequence	
  of	
  a	
  measurable	
  
similarity	
  threshold.	
  
Contrast	
  
Entropy	
  
Benefits	
  to	
  approach:	
  
	
  	
  	
  	
  	
  MapReduce	
  works	
  in	
  parallel	
  =>	
  very	
  fast:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Linear	
  Hme	
  vs.	
  ExponenHal	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Same	
  Hme	
  cost	
  to	
  compare	
  1	
  vs.	
  1	
  and	
  1	
  vs.	
  all	
  
	
  	
  	
  	
  	
  Scalable	
  to	
  amount	
  of	
  hardware	
  available:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  More	
  nodes	
  =	
  BeYer	
  Performance	
  
	
  	
  	
  	
  	
  Setup	
  can	
  handle	
  enHre	
  genomes	
  to	
  be	
  compared	
  at	
  once	
  
	
  	
  	
  	
  	
  Only	
  need	
  to	
  run	
  a	
  sequence	
  once	
  –	
  results	
  will	
  conHnue	
  to	
  
	
  	
  	
  	
  	
  be	
  added	
  as	
  database	
  grows	
  
Poten@al	
  Uses:	
  
	
  	
  	
  	
  	
  IdenHfy	
  Ultra	
  Conserved	
  Elements	
  (UCEs)	
  [1]	
  
	
  	
  	
  	
  	
  IdenHfy	
  evoluHonarily	
  significant	
  mutaHons	
  	
  
	
  	
  	
  	
  	
  PotenHal	
  for	
  medical	
  uses	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Disease	
  diagnosis,	
  GeneHc	
  Research,	
  etc.	
  
	
  	
  	
  	
  	
  Others	
  
What’s	
  Next:	
  
	
  	
  	
  	
  	
  TesHng	
  different	
  clustering	
  algorithms	
  –	
  Sop	
  Clustering	
  
	
  	
  	
  	
  	
  Implement	
  and	
  test	
  Spark	
  
	
  	
  	
  	
  	
  June	
  PublicaHon	
  
Yellow	
  area	
  is	
  calculated,	
  
blank	
  pixels	
  are	
  not	
  
[1]	
  Reneker	
  J,	
  Lyons	
  E,	
  Conant	
  GC,	
  Pires	
  JC,	
  Freeling	
  M,	
  Shyu	
  CR,	
  Korkin	
  D.Proc	
  Natl	
  Acad	
  Sci	
  U	
  S	
  A.	
  2012	
  May	
  8;109(19):E1183-­‐91.	
  doi:	
  10.1073/pnas.1121356109.	
  Epub	
  
2012	
  Apr	
  10.	
  
[2]	
  J.	
  Dean	
  and	
  S.	
  Ghemawat,	
  "Mapreduce:	
  Simplified	
  data	
  processing	
  on	
  large	
  clusters,"	
  ACM	
  Commun.,	
  vol.	
  51,	
  Jan.	
  2008,	
  pp.	
  107-­‐113.	
  	
  
[3]	
  Hadoop,	
  hYp://hadoop.apache.org/	
  
[4]	
  Co-­‐Occurrence	
  Matrix,	
  hYp://www.fp.ucalgary.ca/mhallbey/texture_calculaHons.htm	
  	
  
[5]	
  Apache	
  Spark,	
  hYp://spark.apache.org/	
  
[6]	
  Apache	
  Hbase,	
  hYp://hbase.apache.org/	
  
[7]	
  Levenshtein,	
  Vladimir	
  I.	
  (February	
  1966).	
  "Binary	
  codes	
  capable	
  of	
  correcHng	
  deleHons,	
  inserHons,	
  and	
  reversals".	
  Soviet	
  Physics	
  Doklady	
  10	
  (8):	
  707–710.	
  
MapReduce, Hadoop, Spark, & HBase – Big Data Ecosystem
Retrieved	
  from:	
  	
  hYp://hadoop.apache.org	
  
MapReduce	
  Overview	
   Cluster	
  Setup	
  
	
  
10	
  Intel	
  NUC	
  computers	
  
1	
  Master	
  Node:	
  
	
  	
  	
  	
  16GB	
  RAM	
  
	
  	
  	
  	
  Dual	
  Core	
  2.0GHz	
  CPU	
  
	
  	
  	
  	
  1TB	
  Hard	
  Disk	
  Space	
  
	
  	
  	
  	
  480GB	
  Solid	
  State	
  Drive	
  
9	
  Compute	
  Nodes	
  
	
  	
  	
  	
  8GB	
  RAM	
  
	
  	
  	
  	
  Dual	
  Core	
  2.0GHz	
  CPU	
  
	
  	
  	
  	
  1TB	
  Hard	
  Disk	
  Space	
  
	
  	
  	
  	
  480GB	
  Solid	
  State	
  Drive	
  
Retrieved	
  from:	
  	
  hYp://spark.apache.org	
  
//	
  Map	
  Func@on	
  1:	
  input	
  <k,v>	
  k	
  is	
  offset	
  for	
  current	
  
file	
  block	
  (in	
  bytes);	
  v	
  is	
  a	
  sequence	
  in	
  chromosome	
  C
1: v	
  =	
  P(v)	
  	
  	
  	
  	
  	
  	
  //	
  remove	
  invalid	
  characters
2: for	
  i	
  =	
  0	
  to	
  m-­‐n	
  do{
3: 	
  	
  	
  	
  	
  	
  FV	
  =	
  generateFV(v[i	
  to	
  i+n])	
  //generate	
  feature	
  
vector
4: 	
  	
  	
  	
  	
  	
  start_pos	
  =	
  i	
  +	
  k
5: 	
  	
  	
  	
  	
  	
  return	
  (FV,	
  (start_pos,	
  C))	
  }
	
   	
  
//	
  Reduce	
  Func@on	
  1:	
  input	
  <k,v>	
  k	
  is	
  the	
  feature	
  
vector	
  (FV);	
  v	
  is	
  the	
  star@ng	
  posi@on	
  of	
  the	
  
subsequence	
  w.r.t	
  the	
  chromosome	
  sequence
1: 	
  pos	
  =	
  merge(v)
2: 	
  return	
  (k,	
  pos)
//	
  Map	
  Func@on	
  2:	
  input	
  <k,v>	
  k	
  is	
  feature	
  
vector;	
  v	
  is	
  the	
  list	
  of	
  posi@ons	
  matching	
  the	
  
feature	
  vector
1: k	
  =	
  normalize(k)	
  //normalize	
  data
2: return	
  (k,	
  v)
	
  
//	
  Reduce	
  Func@on	
  2:	
  input	
  <k,v>	
  k	
  is	
  the	
  
normalized	
  feature	
  vector;	
  v	
  is	
  the	
  list	
  of	
  star@ng	
  
posi@ons	
  	
  
1: 	
  cl	
  =	
  kmean(k)	
  //cluster	
  data	
  using	
  k	
  means
2: 	
  return	
  (cl,	
  v)	
  	
  
Original	
  Data	
  
(Sequence)	
  
Mapper	
  1	
   <FV,	
  (Ch	
  ID,	
  Pos)>	
  
Mapper	
  2	
   <FV,	
  (Ch	
  ID,	
  Pos)>	
  
Mapper	
  3	
   <FV,	
  (Ch	
  ID,	
  Pos)>	
  
Mapper	
  n	
   <FV,	
  (Ch	
  ID,	
  Pos)>	
  
Output	
  to	
  HBase	
  
<FV,	
  (List	
  of	
  Pos	
  IDs)>	
  Reducer	
  1	
  
<FV,	
  (List	
  of	
  Pos	
  IDs)>	
  Reducer	
  2	
  
<FV,	
  (List	
  of	
  Pos	
  IDs)>	
  Reducer	
  3	
  
<FV,	
  (List	
  of	
  Pos	
  IDs)>	
  Reducer	
  n	
  
Master	
  
Node	
  
.	
  .	
  .	
  .	
  
.	
  .	
  .	
  .	
  
Retrieved	
  from:	
  	
  hYp://hbase.apache.org	
  
Co-­‐occurrence	
  Matrix	
  FV	
  calculated	
   Aggregate	
  elements	
  with	
  matching	
  FV	
  
Iden@fying	
  Long	
  Ultra	
  Similar	
  Elements	
  (LUSEs)	
  in	
  Genomes	
  
Using	
  Image	
  Based	
  Texture	
  Co-­‐Occurrence	
  Matrix	
  
Devin Petersohn1 and Chi-Ren Shyu (Mentor)1,2
1Department of Computer Science, College of Engineering, 2MU Informatics Institute, University of Missouri
References
HBase	
  Table	
  Schema	
  
Feature	
  Vector	
  
<Contrast,	
  Entropy,	
  Dissimilarity>	
  
Table	
  of	
  ordered	
  Pairs	
  
(Ch	
  ID,	
  Pos)	
  
K	
  Mean	
  Cluster	
  ID	
  
(Calculated	
  2nd	
  IteraHon)	
  
.	
  .	
  .	
  .	
  
.	
  .	
  .	
  .	
  
	
  
.	
  .	
  .	
  .	
  
<Contrast,	
  Entropy,	
  Dissimilarity>	
  
Ch	
  ID	
  1	
   Pos	
  1	
  
K	
  Mean	
  Cluster	
  ID	
  
Ch	
  ID	
  2	
   Pos	
  2	
  
.	
  .	
  .	
  
.	
  .	
  .	
  
Ch	
  ID	
  n	
   Pos	
  n	
  
Shuffling	
  
Acknowledgements
This	
  project	
  was	
  sponsored	
  by	
  the	
  MU	
  College	
  of	
  Engineering	
  Undergraduate	
  Honors	
  Research	
  Program	
  
Undergraduate	
  Research	
  Forum	
  –	
  Spring	
  2014	
  
0	
  
20	
  
40	
  
60	
  
80	
  
100	
  
120	
  
140	
  
160	
  
180	
  
200	
  
	
  0	
  	
   	
  250	
  	
   	
  500	
  	
   	
  750	
  	
   	
  1,000	
  	
   	
  1,250	
  	
   	
  1,500	
  	
   	
  1,750	
  	
   	
  2,000	
  	
   	
  2,250	
  	
  
Time	
  (minutes)	
  
Number	
  of	
  Base	
  Pairs	
  (in	
  Millions)	
  
Running	
  Time	
  for	
  1st	
  MapReduce	
  Func@on	
  on	
  a	
  6	
  Node	
  Cluster	
  	
  

More Related Content

What's hot

Convolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationConvolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationAnaïs Addad
 
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEMDESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEMLuma Tawfiq
 
Feed forward neural network for sine
Feed forward neural network for sineFeed forward neural network for sine
Feed forward neural network for sineijcsa
 
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Gingles Caroline
 
A New Method for Figuring the Number of Hidden Layer Nodes in BP Algorithm
A New Method for Figuring the Number of Hidden Layer Nodes in BP AlgorithmA New Method for Figuring the Number of Hidden Layer Nodes in BP Algorithm
A New Method for Figuring the Number of Hidden Layer Nodes in BP Algorithmrahulmonikasharma
 
Optimal parameter selection for unsupervised neural network using genetic alg...
Optimal parameter selection for unsupervised neural network using genetic alg...Optimal parameter selection for unsupervised neural network using genetic alg...
Optimal parameter selection for unsupervised neural network using genetic alg...IJCSEA Journal
 
DirichletProcessNotes
DirichletProcessNotesDirichletProcessNotes
DirichletProcessNotesAngie Shen
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
 
CSE545_Porject
CSE545_PorjectCSE545_Porject
CSE545_Porjecthan li
 
Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingDerek Kane
 
Interpolation wikipedia
Interpolation   wikipediaInterpolation   wikipedia
Interpolation wikipediahort34
 
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...Ahmed Gad
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithmsswapnac12
 
Paper id 21201483
Paper id 21201483Paper id 21201483
Paper id 21201483IJRAT
 
Data Science - Part VIII - Artifical Neural Network
Data Science - Part VIII -  Artifical Neural NetworkData Science - Part VIII -  Artifical Neural Network
Data Science - Part VIII - Artifical Neural NetworkDerek Kane
 
MATLAB Code + Description : Real-Time Object Motion Detection and Tracking
MATLAB Code + Description : Real-Time Object Motion Detection and TrackingMATLAB Code + Description : Real-Time Object Motion Detection and Tracking
MATLAB Code + Description : Real-Time Object Motion Detection and TrackingAhmed Gad
 

What's hot (18)

Convolutional Neural Network for Text Classification
Convolutional Neural Network for Text ClassificationConvolutional Neural Network for Text Classification
Convolutional Neural Network for Text Classification
 
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEMDESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
 
Feed forward neural network for sine
Feed forward neural network for sineFeed forward neural network for sine
Feed forward neural network for sine
 
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
 
A New Method for Figuring the Number of Hidden Layer Nodes in BP Algorithm
A New Method for Figuring the Number of Hidden Layer Nodes in BP AlgorithmA New Method for Figuring the Number of Hidden Layer Nodes in BP Algorithm
A New Method for Figuring the Number of Hidden Layer Nodes in BP Algorithm
 
Optimal parameter selection for unsupervised neural network using genetic alg...
Optimal parameter selection for unsupervised neural network using genetic alg...Optimal parameter selection for unsupervised neural network using genetic alg...
Optimal parameter selection for unsupervised neural network using genetic alg...
 
DirichletProcessNotes
DirichletProcessNotesDirichletProcessNotes
DirichletProcessNotes
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 
CSE545_Porject
CSE545_PorjectCSE545_Porject
CSE545_Porject
 
Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image Processing
 
Interpolation wikipedia
Interpolation   wikipediaInterpolation   wikipedia
Interpolation wikipedia
 
sns
snssns
sns
 
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
 
Paper id 21201483
Paper id 21201483Paper id 21201483
Paper id 21201483
 
Perceptron in ANN
Perceptron in ANNPerceptron in ANN
Perceptron in ANN
 
Data Science - Part VIII - Artifical Neural Network
Data Science - Part VIII -  Artifical Neural NetworkData Science - Part VIII -  Artifical Neural Network
Data Science - Part VIII - Artifical Neural Network
 
MATLAB Code + Description : Real-Time Object Motion Detection and Tracking
MATLAB Code + Description : Real-Time Object Motion Detection and TrackingMATLAB Code + Description : Real-Time Object Motion Detection and Tracking
MATLAB Code + Description : Real-Time Object Motion Detection and Tracking
 

Similar to Devin Petersohn Poster

BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...ijscmcj
 
Behavior study of entropy in a digital image through an iterative algorithm
Behavior study of entropy in a digital image through an iterative algorithmBehavior study of entropy in a digital image through an iterative algorithm
Behavior study of entropy in a digital image through an iterative algorithmijscmcj
 
Microarray and its application
Microarray and its applicationMicroarray and its application
Microarray and its applicationprateek kumar
 
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemEnhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemAnders Viken
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONijaia
 
soft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptxsoft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptxnaveen356604
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithmgarima931
 
Evolving Connection Weights for Pattern Storage and Recall in Hopfield Model ...
Evolving Connection Weights for Pattern Storage and Recall in Hopfield Model ...Evolving Connection Weights for Pattern Storage and Recall in Hopfield Model ...
Evolving Connection Weights for Pattern Storage and Recall in Hopfield Model ...ijsc
 
EVOLVING CONNECTION WEIGHTS FOR PATTERN STORAGE AND RECALL IN HOPFIELD MODEL ...
EVOLVING CONNECTION WEIGHTS FOR PATTERN STORAGE AND RECALL IN HOPFIELD MODEL ...EVOLVING CONNECTION WEIGHTS FOR PATTERN STORAGE AND RECALL IN HOPFIELD MODEL ...
EVOLVING CONNECTION WEIGHTS FOR PATTERN STORAGE AND RECALL IN HOPFIELD MODEL ...ijsc
 
Particle swarm optimization
Particle swarm optimizationParticle swarm optimization
Particle swarm optimizationMahesh Tibrewal
 
MR Image Compression Based on Selection of Mother Wavelet and Lifting Based W...
MR Image Compression Based on Selection of Mother Wavelet and Lifting Based W...MR Image Compression Based on Selection of Mother Wavelet and Lifting Based W...
MR Image Compression Based on Selection of Mother Wavelet and Lifting Based W...ijma
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)neeraj7svp
 
ML_Unit_2_Part_A
ML_Unit_2_Part_AML_Unit_2_Part_A
ML_Unit_2_Part_ASrimatre K
 
Machine learning and reinforcement learning
Machine learning and reinforcement learningMachine learning and reinforcement learning
Machine learning and reinforcement learningjenil desai
 

Similar to Devin Petersohn Poster (20)

BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...
 
Behavior study of entropy in a digital image through an iterative algorithm
Behavior study of entropy in a digital image through an iterative algorithmBehavior study of entropy in a digital image through an iterative algorithm
Behavior study of entropy in a digital image through an iterative algorithm
 
Microarray and its application
Microarray and its applicationMicroarray and its application
Microarray and its application
 
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemEnhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
 
soft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptxsoft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptx
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
Evolving Connection Weights for Pattern Storage and Recall in Hopfield Model ...
Evolving Connection Weights for Pattern Storage and Recall in Hopfield Model ...Evolving Connection Weights for Pattern Storage and Recall in Hopfield Model ...
Evolving Connection Weights for Pattern Storage and Recall in Hopfield Model ...
 
Class9_PCA_final.ppt
Class9_PCA_final.pptClass9_PCA_final.ppt
Class9_PCA_final.ppt
 
panel regression.pptx
panel regression.pptxpanel regression.pptx
panel regression.pptx
 
2224d_final
2224d_final2224d_final
2224d_final
 
Dv33736740
Dv33736740Dv33736740
Dv33736740
 
Dv33736740
Dv33736740Dv33736740
Dv33736740
 
EVOLVING CONNECTION WEIGHTS FOR PATTERN STORAGE AND RECALL IN HOPFIELD MODEL ...
EVOLVING CONNECTION WEIGHTS FOR PATTERN STORAGE AND RECALL IN HOPFIELD MODEL ...EVOLVING CONNECTION WEIGHTS FOR PATTERN STORAGE AND RECALL IN HOPFIELD MODEL ...
EVOLVING CONNECTION WEIGHTS FOR PATTERN STORAGE AND RECALL IN HOPFIELD MODEL ...
 
Particle swarm optimization
Particle swarm optimizationParticle swarm optimization
Particle swarm optimization
 
MR Image Compression Based on Selection of Mother Wavelet and Lifting Based W...
MR Image Compression Based on Selection of Mother Wavelet and Lifting Based W...MR Image Compression Based on Selection of Mother Wavelet and Lifting Based W...
MR Image Compression Based on Selection of Mother Wavelet and Lifting Based W...
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
 
ML_Unit_2_Part_A
ML_Unit_2_Part_AML_Unit_2_Part_A
ML_Unit_2_Part_A
 
Machine learning and reinforcement learning
Machine learning and reinforcement learningMachine learning and reinforcement learning
Machine learning and reinforcement learning
 

Devin Petersohn Poster

  • 1. To  convert  a  DNA  sequence  into  a  grayscale  image,  we  first  convert   each  character  into  a  unique  specific  value:     A=0,  C=1,  G=2,  T=3     Then,  in  order  to  convert  those  values  into  a  4-­‐bit  grayscale  value   (gray  color  values  from  0-­‐15),  we  use  the  following  formula:     (P1*4)+(P2)   Where  P1  is  the  character  in  the  first  posiHon,  and  P2  is  the  second     The  resulHng  grayscale  values  form  the  pixels  of  images  that  represent   the  original  sequence.    In  order  to  get  a  10x10  image,  a  sequence  of   101  base  pairs  is  required.     Example:            CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT              P1  =  C  =  1,  P2  =  A  =  0  =>  (1*4)  +  (0)  =  4              Using  a  sliding  window,  the  second  posiHon  becomes  the  first.              CATGCTAACTGATCACTATAGCGCGCTATCATACGCGATCTACGCT                P1  =  A  =  0,  P2  =  T  =  3  =>  (0*4)  +  (3)  =  3     Each  two-­‐character  sequence  receives  a  unique  value  from  0-­‐15,   which  corresponds  to  its  grayscale  value  in  the  10x10  image:   Abstract               Co-Occurrence Matrix and Texture MeasurementLocating Potential DNA Mutations Discussion IdenHfying  the  Long  Ultra  Similar  Elements  (LUSEs)  in  genomes  can  yield  a  myriad  of  new  informaHon  regarding  the  result  of  a  geneHcally  and  evoluHonarily  significant  mutaHon.  However,  current  methods  of  idenHfying  LUSEs  cannot  capture  every  possible  mutaHon  (inserHon,  deleHon,   and  base  pair  subsHtuHon)  without  an  exhausHve  pair-­‐wise  comparison  using  the  Levenshtein  Similarity  measurement.    Alignment  algorithms  aYempt  to  solve  this  problem,  but  can  only  calculate  the  maximum  consecuHvely  similar  elements  in  a  string  of  base  pairs.    We  have  developed  an   image-­‐based  method  of  idenHfying  LUSEs  in  genomes  that  has  a  strong  correlaHon  to  the  Levenshtein  Similarity  measurement.    Our  approach  first  converts  a  sequence  into  a  10x10  grayscale  image.  Then,  using  exisHng  co-­‐occurrence  matrix  based  texture  feature  metrics,  we  generate  a   unique  feature  vector  for  each  sequence  by  which  other  sequences  can  be  compared.    These  feature  vectors  can  then  be  ploYed  and,  using  a  clustering  algorithm,  we  will  then  be  able  to  idenHfy  clusters  of  sequences  that  share  a  Levenshtein  Similarity  greater  than  90%  (or  another   threshold  of  our  choosing).    Because  of  the  correlaHon  between  clusters  and  the  Levenshtein  Similarity  measurement,  we  can  avoid  pair-­‐wise  comparisons  altogether.    Because  there  are  no  pairwise  comparisons,  these  algorithms  can  run  in  parallel  using  a  MapReduce  funcHon  in  a  Big  Data   Ecosystem  (Hadoop),  offering  a  suitable  soluHon  to  this  Big  Data  problem  that  is  scalable  to  the  amount  of  hardware  available.    The  final  product  will  be  a  hash  funcHon  that  can  return  all  clustered  LUSEs  very  quickly  for  biology  researchers  to  access  in  real  Hme.   The  final  product  is  a  searchable  database  for  evoluHonary   biologists  to  be  able  to  upload  and  compare  organism   genomes  against  all  other  genomes  already  in  the  database.       The  Levenshtein  Similarity  measurement  calculates  similarity   between  strings  based  on  the  minimum  number  of  deleHons,   inserHons,  and  subsHtuHons  it  takes  to  get  from  one  string  to   another  [7].   Retrieved  from:    hYp://images.flatworldknowledge.com/ballgob/ballgob-­‐fig19_015.jpg    Purpose  of  this  approach:   Work  in  Big  Data  Ecosystem   Algorithm  can  run  in  parallel   Scalable  performance  to  amount  of   hardware  available   No  pairwise  comparison Contrast   Homo-­‐ geneity   Entropy   Dissim-­‐ ilarity   Contrast  &   Homogen.   Homogen.   &  Entropy   Entropy  &   Dissim.   Contrast  &   Entropy   Contrast  &   Dissim.   Homogen.   &  Dissim.   Contrast,   Homogen.,   &  Entropy   Contrast,   Homogen.,   &  Dissim.   Contrast,   Entropy,  &   Dissim.   Homogen.,   Entropy,  &   Dissim.   Contrast,   Homogen.,   Entropy,  &   Dissim.   CorrelaHon   0.8738   0.4313   0.7540   0.8691   0.8270   0.7884   0.8861   0.8697   0.8737   0.8198   0.8648   0.8507   0.8986   0.8750   0.8880   0.00   0.10   0.20   0.30   0.40   0.50   0.60   0.70   0.80   0.90   1.00   Correla@on  with  Levenshtein  Similarity   (1  is  perfectly  correlated)   Texture  Feature  Measurement  Method(s)     Correla@on  between  Levenshtein  Distance  and  Texture  Feature  Measurement  Methods   The  Co-­‐Occurrence  Matrix  is  created  by  counHng   the  number  of  grayscale  pixel  values  that  occur   near  another  in  a  given  image  [4].    From  the  Co-­‐ Occurrence  Matrix,  we  can  generate  features   with  exisHng  methods  [4].     Contrast   Dissimilarity   Homogeneity   Entropy   These  feature  measurement  metrics  are  used  to   reduce  the  co-­‐occurrence  matrix  down  to  values  that   can  be  measured  or  ploYed  against  other  images  [4].     Below,  the  graph  details  the  correlaHon  between   Levenshtein  Similarity  and  all  possible  combinaHons   of  the  above  feature  metrics.    The  most  correlated   combinaHon  of  metrics  is  Contrast,  Entropy,  and   Dissimilarity  with  a  strong  0.8986  correlaHon  (1  is   perfectly  correlated).   Image   Pixel  that  is  compared  against  all   neighbors   Window   Window  PosiHon  1   Window  PosiHon  2   Query   (Sequence)   Feature  Metric   CalculaHon   User   (Start/End)   Cluster  with   Similar   Sequences   User  submits  query   sequence  of  at     least  101   Feature  Metrics  are     generated  from     query   Metrics  are   ploYed  and     clustered   Finding  LUSE  Overview   These  metrics  can  next  be  ploYed  in  3-­‐ dimensional  space  and  clustered  using   the  K  Means  algorithm.    Because  of  the   strong  correlaHon,  each  cluster  will   represent  a  sequence  of  a  measurable   similarity  threshold.   Contrast   Entropy   Benefits  to  approach:            MapReduce  works  in  parallel  =>  very  fast:                      Linear  Hme  vs.  ExponenHal                      Same  Hme  cost  to  compare  1  vs.  1  and  1  vs.  all            Scalable  to  amount  of  hardware  available:                    More  nodes  =  BeYer  Performance            Setup  can  handle  enHre  genomes  to  be  compared  at  once            Only  need  to  run  a  sequence  once  –  results  will  conHnue  to            be  added  as  database  grows   Poten@al  Uses:            IdenHfy  Ultra  Conserved  Elements  (UCEs)  [1]            IdenHfy  evoluHonarily  significant  mutaHons              PotenHal  for  medical  uses                        Disease  diagnosis,  GeneHc  Research,  etc.            Others   What’s  Next:            TesHng  different  clustering  algorithms  –  Sop  Clustering            Implement  and  test  Spark            June  PublicaHon   Yellow  area  is  calculated,   blank  pixels  are  not   [1]  Reneker  J,  Lyons  E,  Conant  GC,  Pires  JC,  Freeling  M,  Shyu  CR,  Korkin  D.Proc  Natl  Acad  Sci  U  S  A.  2012  May  8;109(19):E1183-­‐91.  doi:  10.1073/pnas.1121356109.  Epub   2012  Apr  10.   [2]  J.  Dean  and  S.  Ghemawat,  "Mapreduce:  Simplified  data  processing  on  large  clusters,"  ACM  Commun.,  vol.  51,  Jan.  2008,  pp.  107-­‐113.     [3]  Hadoop,  hYp://hadoop.apache.org/   [4]  Co-­‐Occurrence  Matrix,  hYp://www.fp.ucalgary.ca/mhallbey/texture_calculaHons.htm     [5]  Apache  Spark,  hYp://spark.apache.org/   [6]  Apache  Hbase,  hYp://hbase.apache.org/   [7]  Levenshtein,  Vladimir  I.  (February  1966).  "Binary  codes  capable  of  correcHng  deleHons,  inserHons,  and  reversals".  Soviet  Physics  Doklady  10  (8):  707–710.   MapReduce, Hadoop, Spark, & HBase – Big Data Ecosystem Retrieved  from:    hYp://hadoop.apache.org   MapReduce  Overview   Cluster  Setup     10  Intel  NUC  computers   1  Master  Node:          16GB  RAM          Dual  Core  2.0GHz  CPU          1TB  Hard  Disk  Space          480GB  Solid  State  Drive   9  Compute  Nodes          8GB  RAM          Dual  Core  2.0GHz  CPU          1TB  Hard  Disk  Space          480GB  Solid  State  Drive   Retrieved  from:    hYp://spark.apache.org   //  Map  Func@on  1:  input  <k,v>  k  is  offset  for  current   file  block  (in  bytes);  v  is  a  sequence  in  chromosome  C 1: v  =  P(v)              //  remove  invalid  characters 2: for  i  =  0  to  m-­‐n  do{ 3:            FV  =  generateFV(v[i  to  i+n])  //generate  feature   vector 4:            start_pos  =  i  +  k 5:            return  (FV,  (start_pos,  C))  }     //  Reduce  Func@on  1:  input  <k,v>  k  is  the  feature   vector  (FV);  v  is  the  star@ng  posi@on  of  the   subsequence  w.r.t  the  chromosome  sequence 1:  pos  =  merge(v) 2:  return  (k,  pos) //  Map  Func@on  2:  input  <k,v>  k  is  feature   vector;  v  is  the  list  of  posi@ons  matching  the   feature  vector 1: k  =  normalize(k)  //normalize  data 2: return  (k,  v)   //  Reduce  Func@on  2:  input  <k,v>  k  is  the   normalized  feature  vector;  v  is  the  list  of  star@ng   posi@ons     1:  cl  =  kmean(k)  //cluster  data  using  k  means 2:  return  (cl,  v)     Original  Data   (Sequence)   Mapper  1   <FV,  (Ch  ID,  Pos)>   Mapper  2   <FV,  (Ch  ID,  Pos)>   Mapper  3   <FV,  (Ch  ID,  Pos)>   Mapper  n   <FV,  (Ch  ID,  Pos)>   Output  to  HBase   <FV,  (List  of  Pos  IDs)>  Reducer  1   <FV,  (List  of  Pos  IDs)>  Reducer  2   <FV,  (List  of  Pos  IDs)>  Reducer  3   <FV,  (List  of  Pos  IDs)>  Reducer  n   Master   Node   .  .  .  .   .  .  .  .   Retrieved  from:    hYp://hbase.apache.org   Co-­‐occurrence  Matrix  FV  calculated   Aggregate  elements  with  matching  FV   Iden@fying  Long  Ultra  Similar  Elements  (LUSEs)  in  Genomes   Using  Image  Based  Texture  Co-­‐Occurrence  Matrix   Devin Petersohn1 and Chi-Ren Shyu (Mentor)1,2 1Department of Computer Science, College of Engineering, 2MU Informatics Institute, University of Missouri References HBase  Table  Schema   Feature  Vector   <Contrast,  Entropy,  Dissimilarity>   Table  of  ordered  Pairs   (Ch  ID,  Pos)   K  Mean  Cluster  ID   (Calculated  2nd  IteraHon)   .  .  .  .   .  .  .  .     .  .  .  .   <Contrast,  Entropy,  Dissimilarity>   Ch  ID  1   Pos  1   K  Mean  Cluster  ID   Ch  ID  2   Pos  2   .  .  .   .  .  .   Ch  ID  n   Pos  n   Shuffling   Acknowledgements This  project  was  sponsored  by  the  MU  College  of  Engineering  Undergraduate  Honors  Research  Program   Undergraduate  Research  Forum  –  Spring  2014   0   20   40   60   80   100   120   140   160   180   200    0      250      500      750      1,000      1,250      1,500      1,750      2,000      2,250     Time  (minutes)   Number  of  Base  Pairs  (in  Millions)   Running  Time  for  1st  MapReduce  Func@on  on  a  6  Node  Cluster