SlideShare a Scribd company logo
1 of 53
Download to read offline
Bacterial	
  Gene	
  Neighborhood	
  
Inves5ga5on	
  Environment:	
  	
  A	
  
Scalable	
  Genome	
  Visualiza5on	
  for	
  
Big	
  Displays	
  
Jillian	
  Aurisano	
  
Master	
  of	
  Science	
  Defense	
  
April	
  16,	
  2014	
  
	
  
Science	
  has	
  historically	
  looked	
  like	
  this:	
  
Up	
  un5l	
  very	
  recently	
  
“Observa)ons!”	
  
Exper5se	
  
explore,	
  	
  
make	
  observa5ons	
  
Collect	
  samples	
  
“No	
  one	
  looks	
  under	
  a	
  microscope	
  anymore.	
  
Its	
  all	
  DNA.	
  ”	
  
How	
  do	
  
scien)sts	
  make	
  
discoveries?	
  
How	
  do	
  we	
  bring	
  experts	
  into	
  the	
  
loop?	
  
•  From	
  direct	
  collec5on	
  of	
  
data,	
  direct	
  observa5on	
  of	
  
results	
  direct	
  
interpreta5on	
  and	
  analysis	
  	
  
•  To	
  automated	
  data	
  
collec5on,	
  automated	
  
filtering	
  and	
  automated	
  
analysis	
  
•  Need	
  visualiza5on	
  to	
  bring	
  
experts	
  into	
  the	
  loop	
  
•  But	
  how	
  do	
  we	
  handle	
  big	
  
data?	
  
•  What’s	
  our	
  Big	
  Data	
  
microscope?	
  
“	
  Picard:	
  	
  Computer;	
  scan	
  everything,	
  
run	
  diagnos5cs,	
  and	
  tell	
  us	
  the	
  
answer.”	
  
“Computer:	
  Results	
  are	
  inconclusive”	
  
Can	
  Big	
  Displays	
  help?	
  
•  Evidence	
  suggests	
  that	
  these	
  environments	
  
can	
  have	
  a	
  posi5ve	
  impact	
  on	
  percep5on	
  and	
  
cogni5on	
  
•  But	
  how	
  do	
  we	
  use	
  them	
  to	
  effec5vely	
  
address	
  big	
  data	
  problems?	
  
•  Can	
  exis5ng	
  visualiza5ons	
  simply	
  be	
  ‘scaled-­‐
up’	
  to	
  fit	
  or	
  are	
  new	
  approaches	
  needed?	
  
In	
  this	
  thesis	
  I	
  will…	
  
Examine	
  a	
  specific	
  big	
  data	
  visualiza5on	
  problem:	
  
compara5ve	
  gene	
  neighborhood	
  analysis	
  in	
  
bacterial	
  genomics	
  
I	
  worked	
  closely	
  over	
  several	
  years	
  with	
  a	
  team	
  of	
  
computa5onal	
  biologists	
  	
  
This	
  work	
  has	
  led	
  to	
  the	
  design	
  and	
  implementa5on	
  
of	
  a	
  new	
  visualiza5on	
  approach	
  designed	
  to	
  scale	
  to	
  
big	
  data	
  and	
  big	
  displays	
  	
  
BactoGeNIE	
  	
  
(‘Bact(o)erial	
  Gene	
  Neighborhood	
  Inves5ga5on	
  
Environment’)	
  
	
  
Outline	
  
1)  Describe	
  compara5ve	
  bacterial	
  gene	
  
neighborhood	
  analysis	
  to	
  understand	
  how	
  to	
  
bring	
  experts	
  into	
  the	
  loop	
  
2)  Examine	
  poten5al	
  impact	
  of	
  Big	
  Displays	
  on	
  Big	
  
Data	
  visualiza5on	
  	
  
3)  Evaluate	
  scalability	
  in	
  exis5ng	
  compara5ve	
  
genomics	
  visualiza5ons	
  
My	
  work:	
  BactoGeNIE	
  
4/5/6)	
  	
  Describe	
  my	
  design,	
  implementa5on,	
  results	
  
7)  Think	
  about	
  the	
  future	
  
In	
  the	
  process,	
  learn	
  something	
  about	
  scaling	
  up	
  
visual	
  approaches	
  to	
  big	
  data	
  and	
  big	
  displays	
  
Warning:	
  	
  Biology	
  is	
  used	
  in	
  this	
  thesis!	
  
Genome	
  sequencing	
  boom	
  
•  Sequencing	
  costs	
  
decreasing	
  faster	
  
than	
  Moore’s	
  Law	
  
•  So,	
  we	
  are	
  able	
  to	
  	
  
produce	
  massive	
  
volumes	
  of	
  
sequence	
  data	
  
•  Bacterial	
  genomes	
  
are	
  small,	
  so	
  we	
  are	
  
genera5ng	
  
thousands	
  of	
  
complete	
  bacterial	
  
genome	
  sequences	
   Wejerstrand	
  K.A.,	
  DNA	
  Sequencing	
  Costs:	
  Data	
  from	
  the	
  NHGRI	
  Large-­‐
Scale	
  Genome	
  Sequencing	
  Program,	
  2012	
  <
www.genome.gov/sequencingcosts>	
  	
  
What	
  is	
  a	
  genome?	
  	
  What	
  is	
  a	
  gene?	
  
•  Genomes	
  consists	
  of	
  one	
  or	
  
more	
  long	
  molecules	
  of	
  ‘DNA’	
  
•  DNA	
  consists	
  of	
  chained	
  
nucleo5de	
  molecules	
  (A,	
  C,	
  T,	
  
G)	
  also	
  called	
  ‘base	
  pairs’	
  
•  All	
  the	
  genes	
  in	
  an	
  organism	
  
are	
  in	
  its	
  ‘genome’	
  	
  
•  Genes	
  determine	
  traits	
  in	
  an	
  
organism	
  
•  Genes	
  ‘code’	
  for	
  proteins,	
  and	
  
proteins	
  do	
  the	
  work	
  to	
  make	
  
traits	
  happen	
  
	
  
How	
  are	
  genomes	
  sequenced?	
  
•  Sequencing	
  
•  Assembly	
  
•  Annota5on	
  
	
  
•  Output:	
  
– Genome	
  feature	
  
files	
  
– Raw	
  sequence	
  
files	
  
Michael	
  Schatz	
  	
  
Cold	
  Spring	
  Harbor	
  	
  
Lots	
  of	
  genome	
  sequences-­‐>	
  
opportunity	
  
Big	
  challenge:	
  Hard	
  to	
  figure	
  out	
  what	
  a	
  novel	
  gene	
  
does	
  
•  Tradi5onally:	
  do	
  wet-­‐lab	
  research	
  to	
  figure	
  out	
  
–  but	
  expensive,	
  5me-­‐consuming	
  
•  Sequence	
  the	
  gene,	
  and	
  use	
  computa5onal	
  
methods	
  to	
  predict	
  the	
  func5on	
  of	
  the	
  protein	
  
–  If	
  novel	
  gene,	
  may	
  not	
  provide	
  answer	
  
	
  
•  Can	
  complete	
  genome	
  sequences	
  help?	
  
•  Compara5ve	
  gene	
  neighborhood	
  analysis	
  
From	
  genome	
  structure	
  	
  
to	
  gene-­‐product	
  func5on	
  
•  In	
  bacteria,	
  genes	
  
whose	
  products	
  are	
  
involved	
  in	
  similar	
  
func5ons	
  onen	
  placed	
  
close	
  to	
  each	
  other	
  in	
  
the	
  genome.	
  	
  
•  Research	
  suggests	
  that	
  
it	
  is	
  possible	
  to	
  predict	
  
gene-­‐product	
  func5on	
  
in	
  bacteria	
  based	
  on	
  
commonly	
  recurring	
  
gene	
  neighbors	
  	
  
•  But,	
  need	
  to	
  examine	
  
lots	
  of	
  genomes	
  for	
  
sta5s5cal	
  significance?	
  
gene1 gene2 gene3 gene4
Biological process
?
Comparing	
  gene	
  neighborhoods	
  across	
  
different	
  genomes	
  
•  Genes	
  with	
  similar	
  sequences	
  likely	
  produce	
  
proteins	
  with	
  similar	
  func5ons	
  
•  Orthologs:	
  similar	
  genes	
  from	
  different	
  genomes	
  
•  Algorithms	
  to	
  compare	
  genes	
  between	
  different	
  
genomes	
  
DeMeo	
  et	
  al.	
  BMC	
  Molecular	
  
Biology	
  2008	
  9:2	
  	
  	
  doi:
10.1186/1471-­‐2199-­‐9-­‐2	
  
Role	
  for	
  visualiza5on	
  in	
  this	
  problem	
  
•  Why	
  not	
  use	
  automated	
  methods	
  to	
  find	
  
common	
  sets	
  of	
  genes	
  around	
  gene	
  targets?	
  	
  
•  Why	
  visualiza5on?	
  
•  3	
  E’s:	
  Explora5on,	
  Exper5se,	
  Errors	
  
Automated methods:
Target: gene B
Common subsequences:
Strains 1, 2, 3: {A, B, C, D}
•  Pajerns	
  and	
  
anomalies	
  
without	
  
knowing	
  in	
  
advance	
  what	
  
you	
  are	
  
looking	
  for	
  	
  	
  
Explora5on	
  
Automated methods:
Target: gene B
Common subsequences:
Strains 1, 2, 3: {A, B, C, D}
Duplication
Strain 1
Strain 2
Strain 3
A B D
A
A
C
CC
D
D
B C
CBB
B
Truncation
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
B C
Deletion
Strain 1
Strain 2
Strain 3
A B
C
D
A
A
C
D
D
B
B
Inversion
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
CB
Exper5se	
  
•  Experts	
  make	
  connec5ons	
  that	
  will	
  be	
  missed	
  by	
  
automated	
  methods	
  
– Not	
  just	
  the	
  anomaly,	
  but	
  significance	
  of	
  the	
  anomaly	
  
– Knowledge	
  about	
  strains,	
  protein	
  families	
  involved	
  in	
  
finding	
  significant	
  anomalies	
  
StrainA
StrainB
StrainC
!
Errors	
  
•  Verify	
  
automated	
  
methods	
  
•  Uncertainty	
  
and	
  errors	
  in	
  
data	
  
genera5on	
  
	
  
Data
Strain 1
Strain 2
Strain 3
Automated methods:
Common subsequences:
Strains 1 and 3: {A, B, C, D}
Strain 2: {A, D}
Ground truth
Strain 1
Strain 2
Strain 3
A B C D
A B C D
A
A B C
D
D
A
A B C
D
D
Data
Strain 1
Strain 2
Strain 3
Automated methods:
Common subsequences:
Strains 1 and 3: {A, B, C, D}
Strain 2: {A, B}
Ground truth
Strain 1
Strain 2
Strain 3
Strain 2
A B C
Breaks	
  in	
  assembly	
   Missed	
  gene	
  boundaries	
  
To	
  address	
  this	
  problem:	
  
•  Visualiza5on	
  must	
  help	
  bring	
  experts	
  into	
  the	
  
data	
  mining	
  loop	
  
1)  Helps	
  experts	
  iden5fy	
  sources	
  of	
  error	
  	
  
2)  Allows	
  experts	
  explore	
  the	
  data	
  	
  
3)  Enable	
  researchers	
  to	
  integrate	
  exper(se	
  in	
  data	
  
analysis	
  	
  
So:	
  overview	
  visualiza5on	
  not	
  enough.	
  	
  	
  
Need	
  gene-­‐neighborhood	
  details	
  	
  
•  Visualiza5on	
  must	
  scale	
  to	
  enable	
  comparisons	
  
between	
  hundreds	
  to	
  thousands	
  of	
  genomes	
  
Big	
  displays:	
  Opportunity	
  for	
  big	
  data?	
  
•  The	
  ques5on	
  is:	
  	
  can	
  these	
  environments	
  be	
  used	
  to	
  
visualize	
  big	
  data	
  sets	
  bejer?	
  
•  Evidence	
  suggests	
  yes:	
  
–  Physical	
  naviga5on	
  over	
  virtual	
  naviga5on	
  	
  
•  Reduced	
  need	
  pan	
  and	
  zoom	
  
•  Reduced	
  need	
  for	
  context	
  switching	
  
•  U5lize	
  embodied	
  cogni5on	
  
•  Mul5ple	
  levels-­‐of	
  detail	
  accessible	
  through	
  physical	
  movement	
  
–  Externalize	
  more	
  informa5on	
  that	
  can	
  be	
  accessed	
  
simultaneously	
  
	
  
Lance	
  Long	
  
Por5ng	
  from	
  small	
  to	
  big	
  displays	
  
•  Maybe	
  por5ng	
  genome	
  visualiza5ons	
  to	
  these	
  
environments	
  is	
  sufficient?	
  
•  Ruddle2013:	
  
–  Export	
  high-­‐resolu5on	
  graphical	
  output	
  from	
  
exis5ng	
  genomics	
  visualiza5ons	
  
–  Display	
  these	
  large	
  images	
  on	
  big	
  display	
  
–  Evidence	
  that	
  this	
  had	
  a	
  posi5ve	
  impact	
  on	
  
researcher	
  reasoning	
  
•  However,	
  effec5ve	
  visualiza5on	
  on	
  big	
  displays	
  
involves	
  more	
  than	
  simply	
  scaling	
  up	
  the	
  
representa5on	
  
Pixel-­‐Density	
  Scalability	
  
•  As	
  pixel-­‐density	
  increases,	
  does	
  a	
  visual	
  approach	
  take	
  
advantage	
  of	
  increased	
  pixels-­‐per-­‐inch	
  to	
  show	
  more	
  
en55es,	
  rela5onships	
  or	
  to	
  show	
  data	
  at	
  higher	
  detail	
  	
  
Evalua5on:	
  
•  High-­‐Density	
  Representa5on?	
  
•  use	
  increased	
  pixels	
  per	
  inch	
  to	
  show	
  more	
  en55es	
  and	
  
rela5onships	
  at	
  higher	
  detail?	
  
•  Simultaneous	
  detail	
  and	
  overview?	
  
•  With	
  increased	
  pixel	
  density,	
  representa5on	
  shows	
  details	
  
and	
  overviews	
  at	
  the	
  same	
  5me,	
  without	
  relying	
  on	
  Focus
+Context	
  
Display-­‐Size	
  Scalability	
  
•  As	
  display	
  size	
  increases,	
  does	
  a	
  visual	
  approach	
  take	
  
advantage	
  of	
  the	
  increased	
  space	
  to	
  depict	
  more	
  
en55es	
  or	
  rela5onships?	
  
Evalua5on	
  
•  Encode	
  big	
  data	
  spa5ally	
  
•  Cluster	
  related	
  elements:	
  
•  spa5al	
  memory	
  	
  
•  direct,	
  visual	
  comparisons	
  	
  
•  Physical	
  naviga5on	
  over	
  virtual	
  naviga5on:	
  
•  Overviews	
  at	
  a	
  distance,	
  details	
  up-­‐close	
  
	
  
Perceptual	
  and	
  Analy5c	
  Task	
  
Scalability	
  
•  Does	
  a	
  visual	
  approach	
  scale	
  up	
  to	
  enable	
  the	
  
performance	
  of	
  an	
  analy5c	
  task	
  across	
  more	
  
data,	
  more	
  space,	
  more	
  pixels.	
  	
  
•  Does	
  percep5on	
  suffer	
  if	
  you	
  scale	
  the	
  approach	
  
up?	
  
•  Analy5c	
  tasks	
  performed	
  pre-­‐ajen5vely	
  	
  
•  Analy5c	
  tasks	
  aided	
  by	
  visual	
  queries	
  	
  
•  Aids	
  to	
  visual	
  search	
  for	
  performing	
  analy5c	
  tasks	
  	
  
Examining	
  current	
  genomic	
  data	
  
visualiza5ons	
  
•  Does	
  it	
  address	
  this	
  problem?	
  
•  Show	
  gene	
  neighborhoods	
  
•  Compara5ve	
  
•  Does	
  this	
  visualiza5on	
  allow	
  comparison	
  between	
  
more	
  than	
  a	
  few	
  gene	
  neighborhoods?	
  
•  If	
  you	
  scale	
  the	
  visual	
  approach	
  up,	
  does	
  it:	
  	
  
•  Allow	
  more	
  comparisons	
  of	
  gene	
  neighborhoods	
  (Analy5c	
  
Task	
  Scalability)	
  
•  Take	
  advantage	
  of	
  big	
  displays	
  in	
  size	
  and	
  pixel-­‐density	
  
(Display	
  Resolu5on	
  Scalability	
  and	
  Display	
  Size	
  Scalability)	
  
•  In	
  the	
  process,	
  remain	
  sensible	
  to	
  a	
  human	
  viewer	
  
(Perceptual	
  scalability)	
  
	
  
Line-­‐based	
  compara5ve	
  approaches	
  
•  On	
  load,	
  align	
  1-­‐2	
  genes	
  to	
  
a	
  chosen	
  gene	
  in	
  a	
  
reference	
  genome	
  
•  Draw	
  a	
  line	
  or	
  a	
  band	
  to	
  
connect	
  orthologs	
  	
  
•  In	
  many	
  cases,	
  repurpose	
  
genome	
  browsers	
  to	
  be	
  
compara5ve	
  by	
  adding	
  
compara5ve	
  track	
  
•  Tools:	
  PSAT,	
  
GBrowse_syn,	
  SynView,	
  
ACT,	
  CGAT,	
  Combo,	
  
MizBee,	
  Mauve	
  
Pan,	
  X.	
  et	
  al.	
  (2005).	
  
SynBrowse:	
  a	
  synteny	
  
browser	
  for	
  
compara5ve	
  sequence	
  
analysis.	
  Bioinforma)cs	
  
(Oxford,	
  England).	
  
McKay	
  et	
  al.	
  Using	
  
the	
  Generic	
  Synteny	
  
Browser	
  
(GBrowse_syn).	
  
Current	
  protocols	
  in	
  
Bioinforma)cs	
  	
  
Hoboken,	
  NJ,	
  USA:	
  
John	
  Wiley	
  &	
  Sons	
  
Line-­‐based	
  approaches	
  expanded:	
  	
  
Mauve	
  
•  Like	
  parallel	
  
coordinates	
  
•  Draw	
  lines	
  between	
  
orthologs	
  
•  Color	
  genes	
  by	
  their	
  
block	
  with	
  that	
  
genome	
  (not	
  colored	
  
by	
  orthology)	
  
•  Example	
  shows	
  9	
  
genomes	
  
Darling,	
  Aaron	
  CE,	
  et	
  al.	
  "Mauve:	
  mul5ple	
  alignment	
  of	
  conserved	
  
genomic	
  sequence	
  with	
  rearrangements."	
  Genome	
  research	
  14.7	
  
(2004):	
  1394-­‐140	
  
Line-­‐based	
  approaches:	
  Cri5que	
  
•  Pixel-­‐density	
  scalable?	
  
–  Not	
  a	
  high-­‐density	
  representa5on	
  
–  Need	
  space	
  for	
  the	
  ‘compara5ve	
  track’	
  
•  Display	
  size	
  scalable?	
  
–  Hard	
  to	
  follow	
  lines	
  across	
  a	
  display	
  
–  Hard	
  to	
  compare	
  similar	
  neighborhoods	
  
across	
  the	
  display	
  
–  No	
  overview	
  from	
  a	
  distance,	
  details	
  up	
  
close	
  
•  Perceptual	
  scalability	
  for	
  comparing	
  
gene	
  neighborhoods?	
  
–  Lots	
  of	
  visual	
  clujer	
  
–  Comparisons	
  not	
  pre-­‐ajen5ve	
  
–  No	
  aid	
  to	
  visual	
  search	
  
•  Number	
  of	
  genomes	
  
–  Published	
  up	
  to	
  9	
  
–  Private	
  groups	
  have	
  adapted	
  frameworks	
  
for	
  10-­‐50	
  genomes	
  on	
  big	
  display	
  
Darling,	
  Aaron	
  CE,	
  et	
  al.	
  "Mauve:	
  mul5ple	
  
alignment	
  of	
  conserved	
  genomic	
  sequence	
  with	
  
rearrangements."	
  Genome	
  research	
  14.7	
  (2004):	
  
1394-­‐140	
  
PSAT:	
  Color	
  and	
  alignment	
  
•  PSAT	
  
– Orthologs	
  encoded	
  
using	
  color	
  
– Strand	
  on	
  which	
  gene	
  
is	
  posi5oned	
  is	
  
encoded	
  by	
  
orienta5on	
  to	
  the	
  
center	
  line	
  
– Text	
  is	
  given	
  by	
  
default	
  
Fong,	
  Chris5ne,	
  et	
  al.	
  "PSAT:	
  a	
  
web	
  tool	
  to	
  compare	
  genomic	
  
neighborhoods	
  of	
  mul5ple	
  
prokaryo5c	
  genomes."	
  BMC	
  
bioinforma5cs	
  9.1	
  (2008):	
  170.	
  
PSAT:	
  Cri5que	
  
•  Pixel-­‐Density	
  
Scalability	
  
– Not	
  high-­‐density	
  
representa5on	
  
because	
  of	
  text	
  labels	
  
•  Perceptual	
  scalability	
  
for	
  comparing	
  gene	
  
neighborhoods?	
  
– Can’t	
  scale	
  to	
  large	
  
number	
  of	
  genes-­‐	
  not	
  
enough	
  colors	
  
Fong,	
  Chris5ne,	
  et	
  al.	
  "PSAT:	
  a	
  
web	
  tool	
  to	
  compare	
  genomic	
  
neighborhoods	
  of	
  mul5ple	
  
prokaryo5c	
  genomes."	
  BMC	
  
bioinforma5cs	
  9.1	
  (2008):	
  170.	
  
GeneRiViT:	
  Alignment	
  and	
  color	
  
•  GeneRiViT	
  
–  Align	
  against	
  arbitrary	
  
gene	
  
–  Color	
  by	
  presence/
absence	
  	
  
–  Examples	
  show	
  4	
  genomes	
  
–  Cri5que:	
  
•  No	
  discussion	
  of	
  scalability	
  
•  Overview	
  visualiza5on	
  
•  Doesn’t	
  address	
  our	
  
problem	
  
Price,	
  A.	
  et	
  al	
  "Gene-­‐RiViT:	
  A	
  visualiza5on	
  tool	
  
for	
  compara5ve	
  analysis	
  of	
  gene	
  
neighborhoods	
  in	
  prokaryotes."	
  Biological	
  
Data	
  Visualiza5on	
  (BioVis),	
  2012	
  IEEE	
  
Symposium	
  on.	
  IEEE,	
  2012.	
  
Dot	
  plots	
  
•  Coordinates	
  of	
  genes	
  in	
  
two	
  genomes	
  are	
  used	
  
as	
  x	
  and	
  y	
  axis	
  
•  Orthologous	
  genes	
  in	
  
other	
  genomes	
  are	
  
plojed	
  
•  Each	
  genome	
  given	
  a	
  
unique	
  color	
  
•  Cri5que:	
  
–  Doesn’t	
  provide	
  ‘gene-­‐
neighborhood’	
  view	
  
–  Overview	
  tool	
  
–  Hard	
  to	
  follow	
  beyond	
  
a	
  few	
  genomes	
  
Price,	
  A.	
  et	
  al	
  "Gene-­‐RiViT:	
  A	
  visualiza5on	
  tool	
  
for	
  compara5ve	
  analysis	
  of	
  gene	
  
neighborhoods	
  in	
  prokaryotes."	
  Biological	
  
Data	
  Visualiza5on	
  (BioVis),	
  2012	
  IEEE	
  
Symposium	
  on.	
  IEEE,	
  2012.	
  
Overview	
  Visualizaiton:	
  Sequence	
  
Surveyor	
  
•  Not	
  this	
  domain	
  
problem,	
  but	
  
interes5ng	
  approach	
  
•  Each	
  gene	
  is	
  drawn	
  as	
  a	
  
rectangle	
  
•  Several	
  possible	
  
variables	
  for	
  posi5on:	
  
Ordinal	
  posi5on	
  
•  Several	
  possible	
  
variables	
  for	
  color:	
  
–  Posi5on	
  in	
  one	
  
reference	
  genome	
  
–  Use	
  a	
  color	
  ramp,	
  for	
  
wide	
  range	
  of	
  colors	
  
Albers,D.	
  et	
  al	
  "Sequence	
  surveyor:	
  Leveraging	
  overview	
  for	
  scalable	
  
genomic	
  alignment	
  visualiza5on."	
  Visualiza5on	
  and	
  Computer	
  
Graphics,	
  IEEE	
  Transac5ons	
  on	
  17.12	
  (2011):	
  2392-­‐2401.	
  
Overview	
  Visualizaiton:	
  Sequence	
  
Surveyor	
  
•  Pixel-­‐density	
  scalable	
  
–  High-­‐density	
  representa5on	
  
–  High-­‐detail	
  representa5on	
  
•  Display	
  size	
  scalability	
  
–  May	
  be	
  difficult	
  to	
  compare	
  
pajerns	
  from	
  one	
  side	
  of	
  
display	
  to	
  another	
  
•  Perceptual	
  Scalability	
  
–  Colors	
  allow	
  for	
  pre-­‐ajen5ve	
  
iden5fica5on	
  of	
  pajerns	
  
–  Avoids	
  visual	
  clujer	
  
Albers,D.	
  et	
  al	
  "Sequence	
  surveyor:	
  Leveraging	
  overview	
  
for	
  scalable	
  genomic	
  alignment	
  visualiza5on."	
  
Visualiza5on	
  and	
  Computer	
  Graphics,	
  IEEE	
  Transac5ons	
  
on	
  17.12	
  (2011):	
  2392-­‐2401.	
  
Copy	
  number	
  varia5ons	
  on	
  big	
  
displays	
  
•  Orchestral:	
  
–  Visualiza5on	
  of	
  a	
  different	
  data	
  type	
  
–  Effec5ve	
  use	
  of	
  color	
  to	
  enable	
  pre-­‐ajen5vely	
  
iden5fica5on	
  of	
  similari5es	
  across	
  genomes	
  
–  High-­‐density	
  representa5on	
  
–  Details-­‐up-­‐close,	
  overview	
  from	
  a	
  distance	
  
Ruddle,	
  Roy	
  A.,	
  et	
  al.	
  "Leveraging	
  
wall-­‐sized	
  high-­‐resolu5on	
  displays	
  for	
  
compara5ve	
  genomics	
  analyses	
  of	
  
copy	
  number	
  varia5on."	
  Biological	
  
Data	
  Visualiza5on	
  (BioVis),	
  2013	
  IEEE	
  
Symposium	
  on.	
  IEEE,	
  2013.	
  
BactoGeNIE	
  Demo	
  
Program	
  details	
  
•  Implemented	
  in	
  C++	
  using	
  Qt	
  and	
  the	
  QGraphicsView	
  
framework	
  
•  Upload:	
  	
  	
  
–  genome	
  feature	
  files	
  
–  Fasta	
  files	
  (raw	
  gene	
  sequences)	
  
•  Cd-­‐hit	
  algorithm	
  processes	
  sequence	
  files	
  to	
  compute	
  
ortholog	
  ‘clusters’	
  	
  
•  MySQL	
  database	
  to	
  store	
  big	
  datasets	
  
–  Loads	
  1000	
  con5gs	
  into	
  memory,	
  rest	
  stored	
  in	
  database	
  
•  Op5mized	
  for	
  PubMed	
  datasets	
  
•  Prototyped	
  on	
  E.Coli	
  dran	
  genomes	
  
–  Capable	
  of	
  displaying	
  any	
  con5gs	
  from	
  thousands	
  of	
  E.Coli	
  dran	
  
genomes	
  
•  On	
  EVL	
  Cyber-­‐commons	
  wall,	
  around	
  400	
  con5gs	
  in	
  view	
  
BactoGeNIE:	
  High	
  density	
  
representa5on	
  
•  Compressed	
  genome	
  
encoding	
  
•  No	
  text	
  labels,	
  instead	
  
‘on-­‐demand’	
  
•  No	
  ‘compara5ve	
  track’	
  
•  Encode	
  orthology	
  using	
  
–  User	
  applied	
  color:	
  pre-­‐
ajen5ve	
  orthology	
  
iden5fica5on	
  
–  Coordinated	
  
highligh5ng:	
  scalable	
  	
  
visual	
  query	
  
–  Alignment:	
  use	
  space	
  to	
  
encode	
  similarity	
  
Use	
  space	
  to	
  encode	
  similarity	
  
•  Goals:	
  
–  Make	
  it	
  easier	
  to	
  perform	
  comparisons	
  across	
  many	
  
genomes	
  (Analy5c	
  task	
  scalability)	
  
–  Accommodate	
  increased	
  display	
  size	
  (Display	
  Size	
  
Scalability)	
  
–  Make	
  similari5es	
  and	
  differences	
  easy	
  to	
  see	
  
(Perceptual	
  Scalability)	
  
•  Sor5ng	
  and	
  Alignment	
  
–  Sort	
  by	
  con5g	
  length	
  
–  Sort	
  by	
  gene	
  content	
  
–  Dynamically	
  align	
  against	
  any	
  gene	
  
	
  
Interac5vity	
  
•  On	
  hovering,	
  con5g	
  expands	
  in	
  height,	
  so	
  easier	
  to	
  select	
  
genes	
  of	
  interest	
  in	
  high-­‐density	
  view	
  
•  User	
  can	
  modify	
  the	
  con5g	
  density,	
  or	
  the	
  gene	
  density	
  
(nucelo5des	
  per	
  pixels)	
  
•  ‘Pop-­‐up’	
  menu	
  for	
  each	
  gene	
  that	
  gives	
  info	
  and	
  allows	
  for:	
  
–  applica5on	
  of	
  color:	
  	
  
•  ‘tagging’	
  opera5on	
  
•  Scalable	
  query	
  
–  “targe5ng”	
  opera5on	
  (described	
  next)	
  
•  User	
  can	
  sort	
  genomes	
  by	
  :	
  
–  Gene	
  target	
  
–  Con5g	
  length:	
  to	
  show	
  common	
  assembly	
  break-­‐points	
  in	
  
related	
  con5gs	
  
‘Gene	
  Targe5ng’	
  Func5on	
  to	
  create	
  
high	
  resolu5on,	
  compara5ve	
  ‘maps’	
  
•  User	
  selects	
  a	
  gene	
  of	
  interest	
  
•  This	
  gene	
  is	
  given	
  a	
  base	
  color	
  
•  Two	
  color	
  ramps	
  are	
  applied	
  to	
  adjacent	
  genes,	
  
one	
  ‘upstream’	
  and	
  one	
  ‘downstream’	
  
•  Orthologous	
  genes	
  in	
  related	
  genomes	
  are	
  given	
  
the	
  same	
  colors	
  
•  Con5gs	
  containing	
  this	
  gene	
  are	
  brought	
  to	
  the	
  
top	
  	
  
•  The	
  target	
  gene	
  is	
  centered	
  
•  Orthologs	
  are	
  aligned	
  to	
  the	
  target	
  
Gene	
  targe5ng	
  func5on	
  
•  Clustering	
  to	
  
promote	
  direct	
  
comparisons	
  
•  Overviews	
  at	
  a	
  
distance	
  
•  Details	
  up	
  close	
  
•  Pre-­‐ajen5ve	
  
iden5fica5on	
  of	
  
similari5es	
  and	
  
differences	
  between	
  
gene	
  neighborhoods	
  
Lance	
  Long	
  
Examples	
  
Pixel-­‐density	
  Scalability	
  
BactoGeNIE	
  fits	
  
the	
  pixel-­‐density	
  
scalability	
  
criteria:	
  
High-­‐density	
  data	
  
display,	
  iden5fier	
  
display	
  and	
  
orthology	
  
encoding	
  
Display	
  Size	
  Scalability	
  
•  BactoGeNIE	
  
is	
  the	
  only	
  
approach	
  to	
  
use	
  
clustering	
  
and	
  show	
  
mul5ple	
  
levels	
  of	
  
detail	
  
Perceptual	
  Scalability	
  and	
  Analy5c	
  
Tasks	
  
BactoGeNIE:	
  
•  Similarity	
  is	
  pre-­‐
ajen5vely	
  
accessible	
  
•  Avoids	
  visual	
  
clujer	
  
•  Visual	
  query	
  for	
  
orthologs	
  
Graphical	
  Scalability:	
  	
  	
  
Display	
  Resolu5on	
  vs	
  Number	
  of	
  
Genomes	
  
0	
  
100	
  
200	
  
300	
  
400	
  
500	
  
600	
  
700	
  
800	
  
900	
  
1000	
  
480	
   720	
   1080	
   1440	
   2160	
   2880	
   3240	
   4320	
  
BactoGeNIE	
  
GeneRiViT	
  
SynBrowse	
  
SynView	
  
	
  PSAT	
  
Geco	
  
Mauve	
  
Pixels	
  
Genomes	
  
Preliminary	
  User	
  Feedback	
  
•  A	
  version	
  of	
  BactoGeNIE	
  used	
  by	
  computa5onal	
  biology	
  team	
  on	
  2	
  monitor	
  x	
  
2	
  monitor	
  5led	
  display	
  wall	
  
•  “This	
  tool	
  has	
  been	
  widely	
  used	
  by	
  members	
  of	
  the	
  team	
  to	
  show	
  the	
  
compara)ve	
  analyses	
  of	
  genomic	
  context	
  for	
  several	
  bacterial	
  genomes”	
  
•  “Genome	
  browsers	
  such	
  as	
  JBrowse	
  enable	
  researchers	
  to	
  do	
  compara)ve	
  
genome	
  analyses	
  for	
  nearly	
  10-­‐50	
  genomes.	
  But	
  fail	
  to	
  work	
  when	
  we	
  are	
  
studying	
  several	
  hundreds	
  of	
  genomes	
  of	
  interest.	
  	
  	
  
•  This	
  tool	
  is	
  really	
  unique	
  and	
  it’s	
  the	
  only	
  tool	
  that	
  I	
  am	
  aware	
  of	
  that	
  can	
  
scale	
  up	
  to	
  any	
  number	
  of	
  genome	
  comparisons.	
  
•  The	
  ability	
  to	
  load	
  mul)ple	
  tracks	
  of	
  genomes,	
  and	
  the	
  zoom	
  in	
  and	
  out	
  
op)ons	
  with	
  color	
  coding,	
  annota)on	
  tracks	
  makes	
  it	
  very	
  convenient	
  for	
  
scien)sts	
  to	
  quickly	
  look	
  at	
  paXerns.	
  	
  
•  This	
  tool	
  has	
  a	
  poten)al	
  to	
  serve	
  both	
  for	
  visualiza)on	
  as	
  well	
  as	
  data	
  mining	
  
needs.”	
  	
  	
  
	
  
Usage	
  of	
  a	
  version	
  without	
  the	
  gene	
  targe5ng	
  approach.	
  
Future	
  study	
  will	
  concentrate	
  on	
  this	
  feature	
  with	
  a	
  wider	
  community	
  of	
  users	
  	
  
Summary	
  of	
  contribu5ons	
  
•  A	
  novel	
  design	
  that	
  is	
  the	
  first	
  to	
  enable	
  direct	
  
comparisons	
  between	
  hundreds	
  of	
  gene	
  
neighborhoods	
  in	
  one	
  view	
  
•  First	
  interac5ve,	
  large-­‐scale	
  compara5ve	
  gene	
  
neighborhood	
  approach,	
  with	
  on-­‐the-­‐fly	
  sor5ng,	
  
dynamic	
  alignment,	
  user-­‐selected	
  color	
  and	
  color	
  
ramps,	
  as	
  well	
  as	
  upload	
  of	
  custom	
  data	
  
•  First	
  to	
  show	
  overviews	
  with	
  gene	
  neighborhood-­‐
details,	
  that	
  can	
  be	
  accessed	
  through	
  physical	
  
movement	
  	
  
•  Introduces	
  a	
  novel	
  visualiza5on	
  approach	
  ‘gene	
  
targe5ng’	
  that	
  translates	
  genomic	
  data	
  into	
  high-­‐
resolu5on	
  genomic	
  maps	
  
What’s	
  next?	
  
Design	
  
•  Mul5ple	
  color	
  ramps	
  
•  Advanced	
  ordering	
  in	
  y,	
  based	
  on	
  similarity	
  to	
  target	
  or	
  
strain	
  phylogeny	
  
•  Show	
  addi5onal	
  proper5es,	
  such	
  as	
  pathway	
  
membership	
  
Implementa5on	
  
•  Scalability	
  in	
  rendering	
  using	
  paralleliza5on	
  on	
  the	
  GPU	
  
•  Port	
  to	
  SAGE	
  
Evalua5on	
  
•  More	
  feedback,	
  case	
  studies	
  and	
  evalua5ons	
  of	
  
scalability	
  vs	
  other	
  approaches	
  
Scalable	
  Design,	
  Big	
  Data,	
  Big	
  Displays	
  
•  Need	
  visualiza5on	
  to	
  provide	
  an	
  interface	
  
between	
  automated	
  methods	
  and	
  the	
  expert	
  
•  For	
  big	
  data	
  problems,	
  challenge	
  is	
  to	
  represent	
  
data	
  effec5vely,	
  avoiding	
  informa5on	
  overload	
  
•  Por5ng	
  exis5ng	
  visual	
  approaches	
  to	
  big	
  data	
  and	
  
big	
  displays	
  will	
  not	
  always	
  work	
  
•  Need	
  to	
  design	
  for	
  increased	
  data	
  volumes	
  and	
  
–  pixel-­‐density	
  
–  display	
  size	
  	
  
–  volume	
  of	
  analy5cal	
  tasks	
  
	
  
Thanks!	
  
•  Acknowledgements:	
  	
  
– Jason	
  Leigh,	
  Andy	
  Johnson,	
  Khairi	
  Reda,	
  Lance	
  
Long,	
  Uthman	
  Shabazz,	
  and	
  everyone	
  in	
  the	
  
Electronic	
  Visualiza5on	
  Laboratory	
  
– Barry	
  Goldman,	
  David	
  Bush,	
  Niran	
  Iyer,	
  Shawn	
  
Stricklin	
  and	
  the	
  rest	
  of	
  the	
  computa5onal	
  biology	
  
team	
  at	
  Monsanto	
  	
  
•  Ques5ons?	
  

More Related Content

Viewers also liked

Our mobile planet_spain_es
Our mobile planet_spain_esOur mobile planet_spain_es
Our mobile planet_spain_esXosé María Cid
 
Infografía de Branded Content de IAB Spain
Infografía de Branded Content de IAB SpainInfografía de Branded Content de IAB Spain
Infografía de Branded Content de IAB SpainIAB Spain
 
Spain digital future in focus 2013
Spain digital future in focus 2013Spain digital future in focus 2013
Spain digital future in focus 2013Xosé María Cid
 
Presentación del Informe siE[14
Presentación del Informe siE[14Presentación del Informe siE[14
Presentación del Informe siE[14Prodigioso Volcán
 
Informe ADigital Tendencias Negocio Digital 2015
Informe ADigital Tendencias Negocio Digital 2015Informe ADigital Tendencias Negocio Digital 2015
Informe ADigital Tendencias Negocio Digital 2015Planimedia
 

Viewers also liked (6)

Our mobile planet_spain_es
Our mobile planet_spain_esOur mobile planet_spain_es
Our mobile planet_spain_es
 
Comscore: Spain Digital Future in Focus 2013 (Español)
Comscore: Spain Digital Future in Focus 2013 (Español)Comscore: Spain Digital Future in Focus 2013 (Español)
Comscore: Spain Digital Future in Focus 2013 (Español)
 
Infografía de Branded Content de IAB Spain
Infografía de Branded Content de IAB SpainInfografía de Branded Content de IAB Spain
Infografía de Branded Content de IAB Spain
 
Spain digital future in focus 2013
Spain digital future in focus 2013Spain digital future in focus 2013
Spain digital future in focus 2013
 
Presentación del Informe siE[14
Presentación del Informe siE[14Presentación del Informe siE[14
Presentación del Informe siE[14
 
Informe ADigital Tendencias Negocio Digital 2015
Informe ADigital Tendencias Negocio Digital 2015Informe ADigital Tendencias Negocio Digital 2015
Informe ADigital Tendencias Negocio Digital 2015
 

Similar to Bacterial Gene Neighborhood Visualization for Big Displays

Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GenomeInABottle
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
Nitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptxNitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptxNitantChoksi1
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use casesGuy Coates
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshopGenomeInABottle
 

Similar to Bacterial Gene Neighborhood Visualization for Big Displays (20)

Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
KnetMiner - EBI Workshop 2017
KnetMiner - EBI Workshop 2017KnetMiner - EBI Workshop 2017
KnetMiner - EBI Workshop 2017
 
Nitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptxNitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptx
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...
In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...
In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...
 
Lecture 7 gwas full
Lecture 7 gwas fullLecture 7 gwas full
Lecture 7 gwas full
 
Ouellette icgc toronto_oct2012_fged_ver02
Ouellette icgc toronto_oct2012_fged_ver02Ouellette icgc toronto_oct2012_fged_ver02
Ouellette icgc toronto_oct2012_fged_ver02
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
Apollo Workshop at KSU 2015
Apollo Workshop at KSU 2015Apollo Workshop at KSU 2015
Apollo Workshop at KSU 2015
 
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Open data genomics_palermo_2017_ver03
Open data genomics_palermo_2017_ver03Open data genomics_palermo_2017_ver03
Open data genomics_palermo_2017_ver03
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 

Recently uploaded

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Bacterial Gene Neighborhood Visualization for Big Displays

  • 1. Bacterial  Gene  Neighborhood   Inves5ga5on  Environment:    A   Scalable  Genome  Visualiza5on  for   Big  Displays   Jillian  Aurisano   Master  of  Science  Defense   April  16,  2014    
  • 2. Science  has  historically  looked  like  this:  
  • 3. Up  un5l  very  recently   “Observa)ons!”   Exper5se   explore,     make  observa5ons   Collect  samples  
  • 4. “No  one  looks  under  a  microscope  anymore.   Its  all  DNA.  ”   How  do   scien)sts  make   discoveries?  
  • 5. How  do  we  bring  experts  into  the   loop?   •  From  direct  collec5on  of   data,  direct  observa5on  of   results  direct   interpreta5on  and  analysis     •  To  automated  data   collec5on,  automated   filtering  and  automated   analysis   •  Need  visualiza5on  to  bring   experts  into  the  loop   •  But  how  do  we  handle  big   data?   •  What’s  our  Big  Data   microscope?   “  Picard:    Computer;  scan  everything,   run  diagnos5cs,  and  tell  us  the   answer.”   “Computer:  Results  are  inconclusive”  
  • 6. Can  Big  Displays  help?   •  Evidence  suggests  that  these  environments   can  have  a  posi5ve  impact  on  percep5on  and   cogni5on   •  But  how  do  we  use  them  to  effec5vely   address  big  data  problems?   •  Can  exis5ng  visualiza5ons  simply  be  ‘scaled-­‐ up’  to  fit  or  are  new  approaches  needed?  
  • 7. In  this  thesis  I  will…   Examine  a  specific  big  data  visualiza5on  problem:   compara5ve  gene  neighborhood  analysis  in   bacterial  genomics   I  worked  closely  over  several  years  with  a  team  of   computa5onal  biologists     This  work  has  led  to  the  design  and  implementa5on   of  a  new  visualiza5on  approach  designed  to  scale  to   big  data  and  big  displays     BactoGeNIE     (‘Bact(o)erial  Gene  Neighborhood  Inves5ga5on   Environment’)    
  • 8. Outline   1)  Describe  compara5ve  bacterial  gene   neighborhood  analysis  to  understand  how  to   bring  experts  into  the  loop   2)  Examine  poten5al  impact  of  Big  Displays  on  Big   Data  visualiza5on     3)  Evaluate  scalability  in  exis5ng  compara5ve   genomics  visualiza5ons   My  work:  BactoGeNIE   4/5/6)    Describe  my  design,  implementa5on,  results   7)  Think  about  the  future   In  the  process,  learn  something  about  scaling  up   visual  approaches  to  big  data  and  big  displays  
  • 9. Warning:    Biology  is  used  in  this  thesis!  
  • 10. Genome  sequencing  boom   •  Sequencing  costs   decreasing  faster   than  Moore’s  Law   •  So,  we  are  able  to     produce  massive   volumes  of   sequence  data   •  Bacterial  genomes   are  small,  so  we  are   genera5ng   thousands  of   complete  bacterial   genome  sequences   Wejerstrand  K.A.,  DNA  Sequencing  Costs:  Data  from  the  NHGRI  Large-­‐ Scale  Genome  Sequencing  Program,  2012  < www.genome.gov/sequencingcosts>    
  • 11. What  is  a  genome?    What  is  a  gene?   •  Genomes  consists  of  one  or   more  long  molecules  of  ‘DNA’   •  DNA  consists  of  chained   nucleo5de  molecules  (A,  C,  T,   G)  also  called  ‘base  pairs’   •  All  the  genes  in  an  organism   are  in  its  ‘genome’     •  Genes  determine  traits  in  an   organism   •  Genes  ‘code’  for  proteins,  and   proteins  do  the  work  to  make   traits  happen    
  • 12. How  are  genomes  sequenced?   •  Sequencing   •  Assembly   •  Annota5on     •  Output:   – Genome  feature   files   – Raw  sequence   files   Michael  Schatz     Cold  Spring  Harbor    
  • 13. Lots  of  genome  sequences-­‐>   opportunity   Big  challenge:  Hard  to  figure  out  what  a  novel  gene   does   •  Tradi5onally:  do  wet-­‐lab  research  to  figure  out   –  but  expensive,  5me-­‐consuming   •  Sequence  the  gene,  and  use  computa5onal   methods  to  predict  the  func5on  of  the  protein   –  If  novel  gene,  may  not  provide  answer     •  Can  complete  genome  sequences  help?   •  Compara5ve  gene  neighborhood  analysis  
  • 14. From  genome  structure     to  gene-­‐product  func5on   •  In  bacteria,  genes   whose  products  are   involved  in  similar   func5ons  onen  placed   close  to  each  other  in   the  genome.     •  Research  suggests  that   it  is  possible  to  predict   gene-­‐product  func5on   in  bacteria  based  on   commonly  recurring   gene  neighbors     •  But,  need  to  examine   lots  of  genomes  for   sta5s5cal  significance?   gene1 gene2 gene3 gene4 Biological process ?
  • 15. Comparing  gene  neighborhoods  across   different  genomes   •  Genes  with  similar  sequences  likely  produce   proteins  with  similar  func5ons   •  Orthologs:  similar  genes  from  different  genomes   •  Algorithms  to  compare  genes  between  different   genomes   DeMeo  et  al.  BMC  Molecular   Biology  2008  9:2      doi: 10.1186/1471-­‐2199-­‐9-­‐2  
  • 16. Role  for  visualiza5on  in  this  problem   •  Why  not  use  automated  methods  to  find   common  sets  of  genes  around  gene  targets?     •  Why  visualiza5on?   •  3  E’s:  Explora5on,  Exper5se,  Errors   Automated methods: Target: gene B Common subsequences: Strains 1, 2, 3: {A, B, C, D}
  • 17. •  Pajerns  and   anomalies   without   knowing  in   advance  what   you  are   looking  for       Explora5on   Automated methods: Target: gene B Common subsequences: Strains 1, 2, 3: {A, B, C, D} Duplication Strain 1 Strain 2 Strain 3 A B D A A C CC D D B C CBB B Truncation Strain 1 Strain 2 Strain 3 A B C D A A B C D D B C Deletion Strain 1 Strain 2 Strain 3 A B C D A A C D D B B Inversion Strain 1 Strain 2 Strain 3 A B C D A A B C D D CB
  • 18. Exper5se   •  Experts  make  connec5ons  that  will  be  missed  by   automated  methods   – Not  just  the  anomaly,  but  significance  of  the  anomaly   – Knowledge  about  strains,  protein  families  involved  in   finding  significant  anomalies   StrainA StrainB StrainC !
  • 19. Errors   •  Verify   automated   methods   •  Uncertainty   and  errors  in   data   genera5on     Data Strain 1 Strain 2 Strain 3 Automated methods: Common subsequences: Strains 1 and 3: {A, B, C, D} Strain 2: {A, D} Ground truth Strain 1 Strain 2 Strain 3 A B C D A B C D A A B C D D A A B C D D Data Strain 1 Strain 2 Strain 3 Automated methods: Common subsequences: Strains 1 and 3: {A, B, C, D} Strain 2: {A, B} Ground truth Strain 1 Strain 2 Strain 3 Strain 2 A B C Breaks  in  assembly   Missed  gene  boundaries  
  • 20. To  address  this  problem:   •  Visualiza5on  must  help  bring  experts  into  the   data  mining  loop   1)  Helps  experts  iden5fy  sources  of  error     2)  Allows  experts  explore  the  data     3)  Enable  researchers  to  integrate  exper(se  in  data   analysis     So:  overview  visualiza5on  not  enough.       Need  gene-­‐neighborhood  details     •  Visualiza5on  must  scale  to  enable  comparisons   between  hundreds  to  thousands  of  genomes  
  • 21. Big  displays:  Opportunity  for  big  data?   •  The  ques5on  is:    can  these  environments  be  used  to   visualize  big  data  sets  bejer?   •  Evidence  suggests  yes:   –  Physical  naviga5on  over  virtual  naviga5on     •  Reduced  need  pan  and  zoom   •  Reduced  need  for  context  switching   •  U5lize  embodied  cogni5on   •  Mul5ple  levels-­‐of  detail  accessible  through  physical  movement   –  Externalize  more  informa5on  that  can  be  accessed   simultaneously     Lance  Long  
  • 22. Por5ng  from  small  to  big  displays   •  Maybe  por5ng  genome  visualiza5ons  to  these   environments  is  sufficient?   •  Ruddle2013:   –  Export  high-­‐resolu5on  graphical  output  from   exis5ng  genomics  visualiza5ons   –  Display  these  large  images  on  big  display   –  Evidence  that  this  had  a  posi5ve  impact  on   researcher  reasoning   •  However,  effec5ve  visualiza5on  on  big  displays   involves  more  than  simply  scaling  up  the   representa5on  
  • 23. Pixel-­‐Density  Scalability   •  As  pixel-­‐density  increases,  does  a  visual  approach  take   advantage  of  increased  pixels-­‐per-­‐inch  to  show  more   en55es,  rela5onships  or  to  show  data  at  higher  detail     Evalua5on:   •  High-­‐Density  Representa5on?   •  use  increased  pixels  per  inch  to  show  more  en55es  and   rela5onships  at  higher  detail?   •  Simultaneous  detail  and  overview?   •  With  increased  pixel  density,  representa5on  shows  details   and  overviews  at  the  same  5me,  without  relying  on  Focus +Context  
  • 24. Display-­‐Size  Scalability   •  As  display  size  increases,  does  a  visual  approach  take   advantage  of  the  increased  space  to  depict  more   en55es  or  rela5onships?   Evalua5on   •  Encode  big  data  spa5ally   •  Cluster  related  elements:   •  spa5al  memory     •  direct,  visual  comparisons     •  Physical  naviga5on  over  virtual  naviga5on:   •  Overviews  at  a  distance,  details  up-­‐close    
  • 25. Perceptual  and  Analy5c  Task   Scalability   •  Does  a  visual  approach  scale  up  to  enable  the   performance  of  an  analy5c  task  across  more   data,  more  space,  more  pixels.     •  Does  percep5on  suffer  if  you  scale  the  approach   up?   •  Analy5c  tasks  performed  pre-­‐ajen5vely     •  Analy5c  tasks  aided  by  visual  queries     •  Aids  to  visual  search  for  performing  analy5c  tasks    
  • 26. Examining  current  genomic  data   visualiza5ons   •  Does  it  address  this  problem?   •  Show  gene  neighborhoods   •  Compara5ve   •  Does  this  visualiza5on  allow  comparison  between   more  than  a  few  gene  neighborhoods?   •  If  you  scale  the  visual  approach  up,  does  it:     •  Allow  more  comparisons  of  gene  neighborhoods  (Analy5c   Task  Scalability)   •  Take  advantage  of  big  displays  in  size  and  pixel-­‐density   (Display  Resolu5on  Scalability  and  Display  Size  Scalability)   •  In  the  process,  remain  sensible  to  a  human  viewer   (Perceptual  scalability)    
  • 27. Line-­‐based  compara5ve  approaches   •  On  load,  align  1-­‐2  genes  to   a  chosen  gene  in  a   reference  genome   •  Draw  a  line  or  a  band  to   connect  orthologs     •  In  many  cases,  repurpose   genome  browsers  to  be   compara5ve  by  adding   compara5ve  track   •  Tools:  PSAT,   GBrowse_syn,  SynView,   ACT,  CGAT,  Combo,   MizBee,  Mauve   Pan,  X.  et  al.  (2005).   SynBrowse:  a  synteny   browser  for   compara5ve  sequence   analysis.  Bioinforma)cs   (Oxford,  England).   McKay  et  al.  Using   the  Generic  Synteny   Browser   (GBrowse_syn).   Current  protocols  in   Bioinforma)cs     Hoboken,  NJ,  USA:   John  Wiley  &  Sons  
  • 28. Line-­‐based  approaches  expanded:     Mauve   •  Like  parallel   coordinates   •  Draw  lines  between   orthologs   •  Color  genes  by  their   block  with  that   genome  (not  colored   by  orthology)   •  Example  shows  9   genomes   Darling,  Aaron  CE,  et  al.  "Mauve:  mul5ple  alignment  of  conserved   genomic  sequence  with  rearrangements."  Genome  research  14.7   (2004):  1394-­‐140  
  • 29. Line-­‐based  approaches:  Cri5que   •  Pixel-­‐density  scalable?   –  Not  a  high-­‐density  representa5on   –  Need  space  for  the  ‘compara5ve  track’   •  Display  size  scalable?   –  Hard  to  follow  lines  across  a  display   –  Hard  to  compare  similar  neighborhoods   across  the  display   –  No  overview  from  a  distance,  details  up   close   •  Perceptual  scalability  for  comparing   gene  neighborhoods?   –  Lots  of  visual  clujer   –  Comparisons  not  pre-­‐ajen5ve   –  No  aid  to  visual  search   •  Number  of  genomes   –  Published  up  to  9   –  Private  groups  have  adapted  frameworks   for  10-­‐50  genomes  on  big  display   Darling,  Aaron  CE,  et  al.  "Mauve:  mul5ple   alignment  of  conserved  genomic  sequence  with   rearrangements."  Genome  research  14.7  (2004):   1394-­‐140  
  • 30. PSAT:  Color  and  alignment   •  PSAT   – Orthologs  encoded   using  color   – Strand  on  which  gene   is  posi5oned  is   encoded  by   orienta5on  to  the   center  line   – Text  is  given  by   default   Fong,  Chris5ne,  et  al.  "PSAT:  a   web  tool  to  compare  genomic   neighborhoods  of  mul5ple   prokaryo5c  genomes."  BMC   bioinforma5cs  9.1  (2008):  170.  
  • 31. PSAT:  Cri5que   •  Pixel-­‐Density   Scalability   – Not  high-­‐density   representa5on   because  of  text  labels   •  Perceptual  scalability   for  comparing  gene   neighborhoods?   – Can’t  scale  to  large   number  of  genes-­‐  not   enough  colors   Fong,  Chris5ne,  et  al.  "PSAT:  a   web  tool  to  compare  genomic   neighborhoods  of  mul5ple   prokaryo5c  genomes."  BMC   bioinforma5cs  9.1  (2008):  170.  
  • 32. GeneRiViT:  Alignment  and  color   •  GeneRiViT   –  Align  against  arbitrary   gene   –  Color  by  presence/ absence     –  Examples  show  4  genomes   –  Cri5que:   •  No  discussion  of  scalability   •  Overview  visualiza5on   •  Doesn’t  address  our   problem   Price,  A.  et  al  "Gene-­‐RiViT:  A  visualiza5on  tool   for  compara5ve  analysis  of  gene   neighborhoods  in  prokaryotes."  Biological   Data  Visualiza5on  (BioVis),  2012  IEEE   Symposium  on.  IEEE,  2012.  
  • 33. Dot  plots   •  Coordinates  of  genes  in   two  genomes  are  used   as  x  and  y  axis   •  Orthologous  genes  in   other  genomes  are   plojed   •  Each  genome  given  a   unique  color   •  Cri5que:   –  Doesn’t  provide  ‘gene-­‐ neighborhood’  view   –  Overview  tool   –  Hard  to  follow  beyond   a  few  genomes   Price,  A.  et  al  "Gene-­‐RiViT:  A  visualiza5on  tool   for  compara5ve  analysis  of  gene   neighborhoods  in  prokaryotes."  Biological   Data  Visualiza5on  (BioVis),  2012  IEEE   Symposium  on.  IEEE,  2012.  
  • 34. Overview  Visualizaiton:  Sequence   Surveyor   •  Not  this  domain   problem,  but   interes5ng  approach   •  Each  gene  is  drawn  as  a   rectangle   •  Several  possible   variables  for  posi5on:   Ordinal  posi5on   •  Several  possible   variables  for  color:   –  Posi5on  in  one   reference  genome   –  Use  a  color  ramp,  for   wide  range  of  colors   Albers,D.  et  al  "Sequence  surveyor:  Leveraging  overview  for  scalable   genomic  alignment  visualiza5on."  Visualiza5on  and  Computer   Graphics,  IEEE  Transac5ons  on  17.12  (2011):  2392-­‐2401.  
  • 35. Overview  Visualizaiton:  Sequence   Surveyor   •  Pixel-­‐density  scalable   –  High-­‐density  representa5on   –  High-­‐detail  representa5on   •  Display  size  scalability   –  May  be  difficult  to  compare   pajerns  from  one  side  of   display  to  another   •  Perceptual  Scalability   –  Colors  allow  for  pre-­‐ajen5ve   iden5fica5on  of  pajerns   –  Avoids  visual  clujer   Albers,D.  et  al  "Sequence  surveyor:  Leveraging  overview   for  scalable  genomic  alignment  visualiza5on."   Visualiza5on  and  Computer  Graphics,  IEEE  Transac5ons   on  17.12  (2011):  2392-­‐2401.  
  • 36. Copy  number  varia5ons  on  big   displays   •  Orchestral:   –  Visualiza5on  of  a  different  data  type   –  Effec5ve  use  of  color  to  enable  pre-­‐ajen5vely   iden5fica5on  of  similari5es  across  genomes   –  High-­‐density  representa5on   –  Details-­‐up-­‐close,  overview  from  a  distance   Ruddle,  Roy  A.,  et  al.  "Leveraging   wall-­‐sized  high-­‐resolu5on  displays  for   compara5ve  genomics  analyses  of   copy  number  varia5on."  Biological   Data  Visualiza5on  (BioVis),  2013  IEEE   Symposium  on.  IEEE,  2013.  
  • 38. Program  details   •  Implemented  in  C++  using  Qt  and  the  QGraphicsView   framework   •  Upload:       –  genome  feature  files   –  Fasta  files  (raw  gene  sequences)   •  Cd-­‐hit  algorithm  processes  sequence  files  to  compute   ortholog  ‘clusters’     •  MySQL  database  to  store  big  datasets   –  Loads  1000  con5gs  into  memory,  rest  stored  in  database   •  Op5mized  for  PubMed  datasets   •  Prototyped  on  E.Coli  dran  genomes   –  Capable  of  displaying  any  con5gs  from  thousands  of  E.Coli  dran   genomes   •  On  EVL  Cyber-­‐commons  wall,  around  400  con5gs  in  view  
  • 39. BactoGeNIE:  High  density   representa5on   •  Compressed  genome   encoding   •  No  text  labels,  instead   ‘on-­‐demand’   •  No  ‘compara5ve  track’   •  Encode  orthology  using   –  User  applied  color:  pre-­‐ ajen5ve  orthology   iden5fica5on   –  Coordinated   highligh5ng:  scalable     visual  query   –  Alignment:  use  space  to   encode  similarity  
  • 40. Use  space  to  encode  similarity   •  Goals:   –  Make  it  easier  to  perform  comparisons  across  many   genomes  (Analy5c  task  scalability)   –  Accommodate  increased  display  size  (Display  Size   Scalability)   –  Make  similari5es  and  differences  easy  to  see   (Perceptual  Scalability)   •  Sor5ng  and  Alignment   –  Sort  by  con5g  length   –  Sort  by  gene  content   –  Dynamically  align  against  any  gene    
  • 41. Interac5vity   •  On  hovering,  con5g  expands  in  height,  so  easier  to  select   genes  of  interest  in  high-­‐density  view   •  User  can  modify  the  con5g  density,  or  the  gene  density   (nucelo5des  per  pixels)   •  ‘Pop-­‐up’  menu  for  each  gene  that  gives  info  and  allows  for:   –  applica5on  of  color:     •  ‘tagging’  opera5on   •  Scalable  query   –  “targe5ng”  opera5on  (described  next)   •  User  can  sort  genomes  by  :   –  Gene  target   –  Con5g  length:  to  show  common  assembly  break-­‐points  in   related  con5gs  
  • 42. ‘Gene  Targe5ng’  Func5on  to  create   high  resolu5on,  compara5ve  ‘maps’   •  User  selects  a  gene  of  interest   •  This  gene  is  given  a  base  color   •  Two  color  ramps  are  applied  to  adjacent  genes,   one  ‘upstream’  and  one  ‘downstream’   •  Orthologous  genes  in  related  genomes  are  given   the  same  colors   •  Con5gs  containing  this  gene  are  brought  to  the   top     •  The  target  gene  is  centered   •  Orthologs  are  aligned  to  the  target  
  • 43. Gene  targe5ng  func5on   •  Clustering  to   promote  direct   comparisons   •  Overviews  at  a   distance   •  Details  up  close   •  Pre-­‐ajen5ve   iden5fica5on  of   similari5es  and   differences  between   gene  neighborhoods   Lance  Long  
  • 45. Pixel-­‐density  Scalability   BactoGeNIE  fits   the  pixel-­‐density   scalability   criteria:   High-­‐density  data   display,  iden5fier   display  and   orthology   encoding  
  • 46. Display  Size  Scalability   •  BactoGeNIE   is  the  only   approach  to   use   clustering   and  show   mul5ple   levels  of   detail  
  • 47. Perceptual  Scalability  and  Analy5c   Tasks   BactoGeNIE:   •  Similarity  is  pre-­‐ ajen5vely   accessible   •  Avoids  visual   clujer   •  Visual  query  for   orthologs  
  • 48. Graphical  Scalability:       Display  Resolu5on  vs  Number  of   Genomes   0   100   200   300   400   500   600   700   800   900   1000   480   720   1080   1440   2160   2880   3240   4320   BactoGeNIE   GeneRiViT   SynBrowse   SynView    PSAT   Geco   Mauve   Pixels   Genomes  
  • 49. Preliminary  User  Feedback   •  A  version  of  BactoGeNIE  used  by  computa5onal  biology  team  on  2  monitor  x   2  monitor  5led  display  wall   •  “This  tool  has  been  widely  used  by  members  of  the  team  to  show  the   compara)ve  analyses  of  genomic  context  for  several  bacterial  genomes”   •  “Genome  browsers  such  as  JBrowse  enable  researchers  to  do  compara)ve   genome  analyses  for  nearly  10-­‐50  genomes.  But  fail  to  work  when  we  are   studying  several  hundreds  of  genomes  of  interest.       •  This  tool  is  really  unique  and  it’s  the  only  tool  that  I  am  aware  of  that  can   scale  up  to  any  number  of  genome  comparisons.   •  The  ability  to  load  mul)ple  tracks  of  genomes,  and  the  zoom  in  and  out   op)ons  with  color  coding,  annota)on  tracks  makes  it  very  convenient  for   scien)sts  to  quickly  look  at  paXerns.     •  This  tool  has  a  poten)al  to  serve  both  for  visualiza)on  as  well  as  data  mining   needs.”         Usage  of  a  version  without  the  gene  targe5ng  approach.   Future  study  will  concentrate  on  this  feature  with  a  wider  community  of  users    
  • 50. Summary  of  contribu5ons   •  A  novel  design  that  is  the  first  to  enable  direct   comparisons  between  hundreds  of  gene   neighborhoods  in  one  view   •  First  interac5ve,  large-­‐scale  compara5ve  gene   neighborhood  approach,  with  on-­‐the-­‐fly  sor5ng,   dynamic  alignment,  user-­‐selected  color  and  color   ramps,  as  well  as  upload  of  custom  data   •  First  to  show  overviews  with  gene  neighborhood-­‐ details,  that  can  be  accessed  through  physical   movement     •  Introduces  a  novel  visualiza5on  approach  ‘gene   targe5ng’  that  translates  genomic  data  into  high-­‐ resolu5on  genomic  maps  
  • 51. What’s  next?   Design   •  Mul5ple  color  ramps   •  Advanced  ordering  in  y,  based  on  similarity  to  target  or   strain  phylogeny   •  Show  addi5onal  proper5es,  such  as  pathway   membership   Implementa5on   •  Scalability  in  rendering  using  paralleliza5on  on  the  GPU   •  Port  to  SAGE   Evalua5on   •  More  feedback,  case  studies  and  evalua5ons  of   scalability  vs  other  approaches  
  • 52. Scalable  Design,  Big  Data,  Big  Displays   •  Need  visualiza5on  to  provide  an  interface   between  automated  methods  and  the  expert   •  For  big  data  problems,  challenge  is  to  represent   data  effec5vely,  avoiding  informa5on  overload   •  Por5ng  exis5ng  visual  approaches  to  big  data  and   big  displays  will  not  always  work   •  Need  to  design  for  increased  data  volumes  and   –  pixel-­‐density   –  display  size     –  volume  of  analy5cal  tasks    
  • 53. Thanks!   •  Acknowledgements:     – Jason  Leigh,  Andy  Johnson,  Khairi  Reda,  Lance   Long,  Uthman  Shabazz,  and  everyone  in  the   Electronic  Visualiza5on  Laboratory   – Barry  Goldman,  David  Bush,  Niran  Iyer,  Shawn   Stricklin  and  the  rest  of  the  computa5onal  biology   team  at  Monsanto     •  Ques5ons?