Bacterial	
  Gene	
  Neighborhood	
  
Inves5ga5on	
  Environment:	
  	
  A	
  
Scalable	
  Genome	
  Visualiza5on	
  for	
...
Science	
  has	
  historically	
  looked	
  like	
  this:	
  
Up	
  un5l	
  very	
  recently	
  
“Observa)ons!”	
  
Exper5se	
  
Explore	
  
Collect	
  samples,	
  
Catch	
  errors	
  
“No	
  one	
  looks	
  under	
  a	
  microscope	
  anymore.	
  
Its	
  all	
  DNA.	
  ”	
  
How	
  do	
  
scien)sts	
  mak...
How	
  do	
  we	
  bring	
  experts	
  into	
  the	
  
loop?	
  
•  From	
  direct	
  collec5on	
  of	
  
data,	
  direct	...
Can	
  Big	
  Displays	
  help?	
  
•  Evidence	
  suggests	
  that	
  these	
  environments	
  
can	
  have	
  a	
  posi5...
In	
  this	
  thesis	
  I	
  will…	
  
Examine	
  a	
  specific	
  big	
  data	
  visualiza5on	
  problem:	
  
compara5ve	
...
Outline	
  
1)  Describe	
  compara5ve	
  bacterial	
  gene	
  
neighborhood	
  analysis	
  to	
  understand	
  how	
  to	...
Warning:	
  	
  Biology	
  is	
  used	
  in	
  this	
  thesis!	
  
Genome	
  sequencing	
  boom	
  
•  Sequencing	
  costs	
  
decreasing	
  faster	
  
than	
  Moore’s	
  Law	
  
•  So,	
  ...
What	
  is	
  a	
  genome?	
  	
  What	
  is	
  a	
  gene?	
  
•  Genomes	
  consists	
  of	
  one	
  or	
  
more	
  long	...
How	
  are	
  genomes	
  sequenced?	
  
•  Sequencing	
  
•  Assembly	
  
•  Annota5on	
  
	
  
•  Output:	
  
– Genome	
 ...
Lots	
  of	
  genome	
  sequences-­‐>	
  
opportunity	
  
Big	
  challenge:	
  Hard	
  to	
  figure	
  out	
  what	
  a	
  ...
From	
  genome	
  structure	
  	
  
to	
  gene-­‐product	
  func5on	
  
•  In	
  bacteria,	
  genes	
  
whose	
  products	...
Comparing	
  gene	
  neighborhoods	
  across	
  
different	
  genomes	
  
•  Genes	
  with	
  similar	
  sequences	
  likel...
Role	
  for	
  visualiza5on	
  in	
  this	
  problem	
  
•  Why	
  not	
  use	
  automated	
  methods	
  to	
  find	
  
com...
•  Pajerns	
  and	
  
anomalies	
  
without	
  
knowing	
  in	
  
advance	
  what	
  
you	
  are	
  
looking	
  for	
  	
 ...
Exper5se	
  
•  Experts	
  make	
  connec5ons	
  that	
  will	
  be	
  missed	
  by	
  
automated	
  methods	
  
– Not	
  ...
Errors	
  
•  Verify	
  
automated	
  
methods	
  
•  Uncertainty	
  
and	
  errors	
  in	
  
data	
  
genera5on	
  
	
  
...
To	
  address	
  this	
  problem:	
  
•  Visualiza5on	
  must	
  help	
  bring	
  experts	
  into	
  the	
  
data	
  minin...
Big	
  displays:	
  Opportunity	
  for	
  big	
  data?	
  
•  The	
  ques5on	
  is:	
  	
  can	
  these	
  environments	
 ...
Por5ng	
  from	
  small	
  to	
  big	
  displays	
  
•  Maybe	
  por5ng	
  genome	
  visualiza5ons	
  to	
  these	
  
envi...
Pixel-­‐Density	
  Scalability	
  
•  As	
  pixel-­‐density	
  increases,	
  does	
  a	
  visual	
  approach	
  take	
  
a...
Display-­‐Size	
  Scalability	
  
•  As	
  display	
  size	
  increases,	
  does	
  a	
  visual	
  approach	
  take	
  
ad...
Perceptual	
  and	
  Analy5c	
  Task	
  
Scalability	
  
•  Does	
  a	
  visual	
  approach	
  scale	
  up	
  to	
  enable...
Examining	
  current	
  genomic	
  data	
  
visualiza5ons	
  
•  Does	
  it	
  address	
  this	
  problem?	
  
•  Show	
  ...
Line-­‐based	
  compara5ve	
  approaches	
  
•  On	
  load,	
  align	
  1-­‐2	
  genes	
  to	
  
a	
  chosen	
  gene	
  in...
Line-­‐based	
  approaches	
  expanded:	
  	
  
Mauve	
  
•  Like	
  parallel	
  
coordinates	
  
•  Draw	
  lines	
  betw...
Line-­‐based	
  approaches:	
  Cri5que	
  
•  Pixel-­‐density	
  scalable?	
  
–  Not	
  a	
  high-­‐density	
  representa...
PSAT:	
  Color	
  and	
  alignment	
  
•  PSAT	
  
– Orthologs	
  encoded	
  
using	
  color	
  
– Strand	
  on	
  which	
...
PSAT:	
  Cri5que	
  
•  Pixel-­‐Density	
  
Scalability	
  
– Not	
  high-­‐density	
  
representa5on	
  
because	
  of	
 ...
GeneRiViT:	
  Alignment	
  and	
  color	
  
•  GeneRiViT	
  
–  Align	
  against	
  arbitrary	
  
gene	
  
–  Color	
  by	...
Dot	
  plots	
  
•  Coordinates	
  of	
  genes	
  in	
  
two	
  genomes	
  are	
  used	
  
as	
  x	
  and	
  y	
  axis	
  ...
Overview	
  Visualizaiton:	
  Sequence	
  
Surveyor	
  
•  Not	
  this	
  domain	
  
problem,	
  but	
  
interes5ng	
  app...
Overview	
  Visualizaiton:	
  Sequence	
  
Surveyor	
  
•  Pixel-­‐density	
  scalable	
  
–  High-­‐density	
  representa...
Copy	
  number	
  varia5ons	
  on	
  big	
  
displays	
  
•  Orchestral:	
  
–  Visualiza5on	
  of	
  a	
  different	
  dat...
BactoGeNIE	
  Demo	
  
•  Video	
  at:
hjps://www.youtube.com/watch?
v=yrSyi1RWcUw	
  
Program	
  details	
  
•  Implemented	
  in	
  C++	
  using	
  Qt	
  and	
  the	
  QGraphicsView	
  
framework	
  
•  Uplo...
BactoGeNIE:	
  High	
  density	
  
representa5on	
  
•  Compressed	
  genome	
  
encoding	
  
•  No	
  text	
  labels,	
  ...
Use	
  space	
  to	
  encode	
  similarity	
  
•  Goals:	
  
–  Make	
  it	
  easier	
  to	
  perform	
  comparisons	
  ac...
Interac5vity	
  
•  On	
  hovering,	
  con5g	
  expands	
  in	
  height,	
  so	
  easier	
  
to	
  select	
  genes	
  of	
...
‘Gene	
  Targe5ng’	
  Func5on	
  to	
  create	
  
high	
  resolu5on,	
  compara5ve	
  ‘maps’	
  
•  User	
  selects	
  a	
...
Gene	
  targe5ng	
  func5on	
  
•  Clustering	
  to	
  
promote	
  direct	
  
comparisons	
  
•  Overviews	
  at	
  a	
  
...
Examples	
  
Pixel-­‐density	
  Scalability	
  
BactoGeNIE	
  fits	
  
the	
  pixel-­‐density	
  
scalability	
  
criteria:	
  
High-­‐d...
Display	
  Size	
  Scalability	
  
•  BactoGeNIE	
  
is	
  the	
  only	
  
approach	
  to	
  
use	
  
clustering	
  
and	
...
Perceptual	
  Scalability	
  and	
  Analy5c	
  
Tasks	
  
BactoGeNIE:	
  
•  Similarity	
  is	
  pre-­‐
ajen5vely	
  
acce...
Graphical	
  Scalability:	
  	
  	
  
Display	
  Resolu5on	
  vs	
  Number	
  of	
  
Genomes	
  
0	
  
100	
  
200	
  
300...
Preliminary	
  User	
  Feedback	
  
•  A	
  version	
  of	
  BactoGeNIE	
  used	
  by	
  computa5onal	
  biology	
  team	
...
Summary	
  of	
  contribu5ons	
  
•  A	
  novel	
  design	
  that	
  is	
  the	
  first	
  to	
  enable	
  direct	
  
compa...
What’s	
  next?	
  
Design	
  
•  Integra5on	
  with	
  different	
  levels	
  of	
  detail	
  
•  Mul5ple	
  color	
  ramp...
Scalable	
  Design,	
  Big	
  Data,	
  Big	
  Displays	
  
•  Need	
  visualiza5on	
  to	
  provide	
  an	
  interface	
  ...
Thanks!	
  
•  Acknowledgements:	
  	
  
– Jason	
  Leigh,	
  Andy	
  Johnson,	
  Khairi	
  Reda,	
  Lance	
  
Long,	
  Ut...
Upcoming SlideShare
Loading in …5
×

Jillian ms defense-4-14-14-ja-novideo

150
-1

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
150
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Jillian ms defense-4-14-14-ja-novideo

  1. 1. Bacterial  Gene  Neighborhood   Inves5ga5on  Environment:    A   Scalable  Genome  Visualiza5on  for   Big  Displays   Jillian  Aurisano   Master  of  Science  Defense   April  16,  2014    
  2. 2. Science  has  historically  looked  like  this:  
  3. 3. Up  un5l  very  recently   “Observa)ons!”   Exper5se   Explore   Collect  samples,   Catch  errors  
  4. 4. “No  one  looks  under  a  microscope  anymore.   Its  all  DNA.  ”   How  do   scien)sts  make   discoveries?  
  5. 5. How  do  we  bring  experts  into  the   loop?   •  From  direct  collec5on  of   data,  direct  observa5on  of   results  direct   interpreta5on  and  analysis     •  To  automated  data   collec5on,  automated   filtering  and  automated   analysis   •  Need  visualiza5on  to  bring   experts  into  the  loop   •  But  how  do  we  handle  big   data?   •  What’s  our  Big  Data   microscope?   “  Picard:    Computer;  scan  everything,   run  diagnos5cs,  and  tell  us  the   answer.”   “Computer:  Results  are  inconclusive”  
  6. 6. Can  Big  Displays  help?   •  Evidence  suggests  that  these  environments   can  have  a  posi5ve  impact  on  percep5on  and   cogni5on   •  But  how  do  we  use  them  to  effec5vely   address  big  data  problems?   •  Can  exis5ng  visualiza5ons  simply  be  ‘scaled-­‐ up’  to  fit  or  are  new  approaches  needed?  
  7. 7. In  this  thesis  I  will…   Examine  a  specific  big  data  visualiza5on  problem:   compara5ve  gene  neighborhood  analysis  in   bacterial  genomics   I  worked  closely  over  several  years  with  a  team  of   computa5onal  biologists     This  work  has  led  to  the  design  and  implementa5on   of  a  new  visualiza5on  approach  designed  to  scale  to   big  data  and  big  displays     BactoGeNIE     (‘Bact(o)erial  Gene  Neighborhood  Inves5ga5on   Environment’)    
  8. 8. Outline   1)  Describe  compara5ve  bacterial  gene   neighborhood  analysis  to  understand  how  to   bring  experts  into  the  loop   2)  Examine  poten5al  impact  of  Big  Displays  on  Big   Data  visualiza5on     3)  Evaluate  scalability  in  exis5ng  compara5ve   genomics  visualiza5ons   My  work:  BactoGeNIE   4/5/6)    Describe  my  design,  implementa5on,  results   7)  Think  about  the  future   In  the  process,  learn  something  about  scaling  up   visual  approaches  to  big  data  and  big  displays  
  9. 9. Warning:    Biology  is  used  in  this  thesis!  
  10. 10. Genome  sequencing  boom   •  Sequencing  costs   decreasing  faster   than  Moore’s  Law   •  So,  we  are  able  to     produce  massive   volumes  of   sequence  data   •  Bacterial  genomes   are  small,  so  we  are   genera5ng   thousands  of   complete  bacterial   genome  sequences   Wejerstrand  K.A.,  DNA  Sequencing  Costs:  Data  from  the  NHGRI  Large-­‐ Scale  Genome  Sequencing  Program,  2012  < www.genome.gov/sequencingcosts>    
  11. 11. What  is  a  genome?    What  is  a  gene?   •  Genomes  consists  of  one  or   more  long  molecules  of  ‘DNA’   •  DNA  consists  of  chained   nucleo5de  molecules  (A,  C,  T,   G)  also  called  ‘base  pairs’   •  All  the  genes  in  an  organism   are  in  its  ‘genome’     •  Genes  determine  traits  in  an   organism   •  Genes  ‘code’  for  proteins,  and   proteins  do  the  work  to  make   traits  happen    
  12. 12. How  are  genomes  sequenced?   •  Sequencing   •  Assembly   •  Annota5on     •  Output:   – Genome  feature   files   – Raw  sequence   files   Michael  Schatz     Cold  Spring  Harbor    
  13. 13. Lots  of  genome  sequences-­‐>   opportunity   Big  challenge:  Hard  to  figure  out  what  a  novel  gene   does   •  Tradi5onally:  do  wet-­‐lab  research  to  figure  out   –  but  expensive,  5me-­‐consuming   •  Sequence  the  gene,  and  use  computa5onal   methods  to  predict  the  func5on  of  the  protein   –  If  novel  gene,  may  not  provide  answer     •  Can  complete  genome  sequences  help?   •  Compara5ve  gene  neighborhood  analysis  
  14. 14. From  genome  structure     to  gene-­‐product  func5on   •  In  bacteria,  genes   whose  products  are   involved  in  similar   func5ons  onen  placed   close  to  each  other  in   the  genome.     •  Research  suggests  that   it  is  possible  to  predict   gene-­‐product  func5on   in  bacteria  based  on   commonly  recurring   gene  neighbors     •  But,  need  to  examine   lots  of  genomes  for   sta5s5cal  significance?   gene1 gene2 gene3 gene4 Biological process ?
  15. 15. Comparing  gene  neighborhoods  across   different  genomes   •  Genes  with  similar  sequences  likely  produce   proteins  with  similar  func5ons   •  Orthologs:  similar  genes  from  different  genomes   •  Algorithms  to  compare  genes  between  different   genomes   DeMeo  et  al.  BMC  Molecular   Biology  2008  9:2      doi: 10.1186/1471-­‐2199-­‐9-­‐2  
  16. 16. Role  for  visualiza5on  in  this  problem   •  Why  not  use  automated  methods  to  find   common  sets  of  genes  around  gene  targets?     •  Why  visualiza5on?   •  3  E’s:  Explora5on,  Exper5se,  Errors   Automated methods: Target: gene B Common subsequences: Strains 1, 2, 3: {A, B, C, D}
  17. 17. •  Pajerns  and   anomalies   without   knowing  in   advance  what   you  are   looking  for       Explora5on   Automated methods: Target: gene B Common subsequences: Strains 1, 2, 3: {A, B, C, D} Duplication Strain 1 Strain 2 Strain 3 A B D A A C CC D D B C CBB B Truncation Strain 1 Strain 2 Strain 3 A B C D A A B C D D B C Deletion Strain 1 Strain 2 Strain 3 A B C D A A C D D B B Inversion Strain 1 Strain 2 Strain 3 A B C D A A B C D D CB
  18. 18. Exper5se   •  Experts  make  connec5ons  that  will  be  missed  by   automated  methods   – Not  just  the  anomaly,  but  significance  of  the  anomaly   – Knowledge  about  strains,  protein  families  involved  in   finding  significant  anomalies   StrainA StrainB StrainC !
  19. 19. Errors   •  Verify   automated   methods   •  Uncertainty   and  errors  in   data   genera5on     Data Strain 1 Strain 2 Strain 3 Automated methods: Common subsequences: Strains 1 and 3: {A, B, C, D} Strain 2: {A, D} Ground truth Strain 1 Strain 2 Strain 3 A B C D A B C D A A B C D D A A B C D D Data Strain 1 Strain 2 Strain 3 Automated methods: Common subsequences: Strains 1 and 3: {A, B, C, D} Strain 2: {A, B} Ground truth Strain 1 Strain 2 Strain 3 Strain 2 A B C Breaks  in  assembly   Missed  gene  boundaries  
  20. 20. To  address  this  problem:   •  Visualiza5on  must  help  bring  experts  into  the   data  mining  loop   1)  Helps  experts  iden5fy  sources  of  error     2)  Allows  experts  explore  the  data     3)  Enable  researchers  to  integrate  exper(se  in  data   analysis     So:  overview  visualiza5on  not  enough.       Need  gene-­‐neighborhood  details     •  Visualiza5on  must  scale  to  enable  comparisons   between  hundreds  to  thousands  of  genomes  
  21. 21. Big  displays:  Opportunity  for  big  data?   •  The  ques5on  is:    can  these  environments  be  used  to   visualize  big  data  sets  bejer?   •  Evidence  suggests  yes:   –  Physical  naviga5on  over  virtual  naviga5on     •  Reduced  need  pan  and  zoom   •  Reduced  need  for  context  switching   •  U5lize  embodied  cogni5on   •  Mul5ple  levels-­‐of  detail  accessible  through  physical  movement   –  Externalize  more  informa5on  that  can  be  accessed   simultaneously     Lance  Long  
  22. 22. Por5ng  from  small  to  big  displays   •  Maybe  por5ng  genome  visualiza5ons  to  these   environments  is  sufficient?   •  Ruddle2013:   –  Export  high-­‐resolu5on  graphical  output  from   exis5ng  genomics  visualiza5ons   –  Display  these  large  images  on  big  display   –  Evidence  that  this  had  a  posi5ve  impact  on   researcher  reasoning   •  However,  effec5ve  visualiza5on  on  big  displays   involves  more  than  simply  scaling  up  the   representa5on  
  23. 23. Pixel-­‐Density  Scalability   •  As  pixel-­‐density  increases,  does  a  visual  approach  take   advantage  of  increased  pixels-­‐per-­‐inch  to  show  more   en55es,  rela5onships  or  to  show  data  at  higher  detail     Evalua5on:   •  High-­‐Density  Representa5on?   •  use  increased  pixels  per  inch  to  show  more  en55es  and   rela5onships  at  higher  detail?   •  Simultaneous  detail  and  overview?   •  With  increased  pixel  density,  representa5on  shows  details   and  overviews  at  the  same  5me,  without  relying  on  Focus +Context  
  24. 24. Display-­‐Size  Scalability   •  As  display  size  increases,  does  a  visual  approach  take   advantage  of  the  increased  space  to  depict  more   en55es  or  rela5onships?   Evalua5on   •  Encode  big  data  spa5ally   •  Cluster  related  elements:   •  spa5al  memory     •  direct,  visual  comparisons     •  Physical  naviga5on  over  virtual  naviga5on:   •  Overviews  at  a  distance,  details  up-­‐close    
  25. 25. Perceptual  and  Analy5c  Task   Scalability   •  Does  a  visual  approach  scale  up  to  enable  the   performance  of  an  analy5c  task  across  more   data,  more  space,  more  pixels.     •  Does  percep5on  suffer  if  you  scale  the  approach   up?   •  Analy5c  tasks  performed  pre-­‐ajen5vely     •  Analy5c  tasks  aided  by  visual  queries     •  Aids  to  visual  search  for  performing  analy5c  tasks    
  26. 26. Examining  current  genomic  data   visualiza5ons   •  Does  it  address  this  problem?   •  Show  gene  neighborhoods   •  Compara5ve   •  Does  this  visualiza5on  allow  comparison  between   more  than  a  few  gene  neighborhoods?   •  If  you  scale  the  visual  approach  up,  does  it:     •  Allow  more  comparisons  of  gene  neighborhoods  (Analy5c   Task  Scalability)   •  Take  advantage  of  big  displays  in  size  and  pixel-­‐density   (Display  Resolu5on  Scalability  and  Display  Size  Scalability)   •  In  the  process,  remain  sensible  to  a  human  viewer   (Perceptual  scalability)    
  27. 27. Line-­‐based  compara5ve  approaches   •  On  load,  align  1-­‐2  genes  to   a  chosen  gene  in  a   reference  genome   •  Draw  a  line  or  a  band  to   connect  orthologs     •  In  many  cases,  repurpose   genome  browsers  to  be   compara5ve  by  adding   compara5ve  track   •  Tools:  PSAT,   GBrowse_syn,  SynView,   ACT,  CGAT,  Combo,   MizBee,  Mauve   Pan,  X.  et  al.  (2005).   SynBrowse:  a  synteny   browser  for   compara5ve  sequence   analysis.  Bioinforma)cs   (Oxford,  England).   McKay  et  al.  Using   the  Generic  Synteny   Browser   (GBrowse_syn).   Current  protocols  in   Bioinforma)cs     Hoboken,  NJ,  USA:   John  Wiley  &  Sons  
  28. 28. Line-­‐based  approaches  expanded:     Mauve   •  Like  parallel   coordinates   •  Draw  lines  between   orthologs   •  Color  genes  by  their   block  with  that   genome  (not  colored   by  orthology)   •  Example  shows  9   genomes   Darling,  Aaron  CE,  et  al.  "Mauve:  mul5ple  alignment  of  conserved   genomic  sequence  with  rearrangements."  Genome  research  14.7   (2004):  1394-­‐140  
  29. 29. Line-­‐based  approaches:  Cri5que   •  Pixel-­‐density  scalable?   –  Not  a  high-­‐density  representa5on   –  Need  space  for  the  ‘compara5ve  track’   •  Display  size  scalable?   –  Hard  to  follow  lines  across  a  display   –  Hard  to  compare  similar  neighborhoods   across  the  display   –  No  overview  from  a  distance,  details  up   close   •  Perceptual  scalability  for  comparing   gene  neighborhoods?   –  Lots  of  visual  clujer   –  Comparisons  not  pre-­‐ajen5ve   –  No  aid  to  visual  search   •  Number  of  genomes   –  Published  up  to  9   –  Private  groups  have  adapted  frameworks   for  10-­‐50  genomes  on  big  display   Darling,  Aaron  CE,  et  al.  "Mauve:  mul5ple   alignment  of  conserved  genomic  sequence  with   rearrangements."  Genome  research  14.7  (2004):   1394-­‐140  
  30. 30. PSAT:  Color  and  alignment   •  PSAT   – Orthologs  encoded   using  color   – Strand  on  which  gene   is  posi5oned  is   encoded  by   orienta5on  to  the   center  line   – Text  is  given  by   default   Fong,  Chris5ne,  et  al.  "PSAT:  a   web  tool  to  compare  genomic   neighborhoods  of  mul5ple   prokaryo5c  genomes."  BMC   bioinforma5cs  9.1  (2008):  170.  
  31. 31. PSAT:  Cri5que   •  Pixel-­‐Density   Scalability   – Not  high-­‐density   representa5on   because  of  text  labels   •  Perceptual  scalability   for  comparing  gene   neighborhoods?   – Can’t  scale  to  large   number  of  genes-­‐  not   enough  colors   Fong,  Chris5ne,  et  al.  "PSAT:  a   web  tool  to  compare  genomic   neighborhoods  of  mul5ple   prokaryo5c  genomes."  BMC   bioinforma5cs  9.1  (2008):  170.  
  32. 32. GeneRiViT:  Alignment  and  color   •  GeneRiViT   –  Align  against  arbitrary   gene   –  Color  by  presence/ absence     –  Examples  show  4  genomes   –  Cri5que:   •  No  discussion  of  scalability   •  Overview  visualiza5on   •  Doesn’t  address  our   problem   Price,  A.  et  al  "Gene-­‐RiViT:  A  visualiza5on  tool   for  compara5ve  analysis  of  gene   neighborhoods  in  prokaryotes."  Biological   Data  Visualiza5on  (BioVis),  2012  IEEE   Symposium  on.  IEEE,  2012.  
  33. 33. Dot  plots   •  Coordinates  of  genes  in   two  genomes  are  used   as  x  and  y  axis   •  Orthologous  genes  in   other  genomes  are   plojed   •  Each  genome  given  a   unique  color   •  Cri5que:   –  Doesn’t  provide  ‘gene-­‐ neighborhood’  view   –  Overview  tool   –  Hard  to  follow  beyond   a  few  genomes   Price,  A.  et  al  "Gene-­‐RiViT:  A  visualiza5on  tool   for  compara5ve  analysis  of  gene   neighborhoods  in  prokaryotes."  Biological   Data  Visualiza5on  (BioVis),  2012  IEEE   Symposium  on.  IEEE,  2012.  
  34. 34. Overview  Visualizaiton:  Sequence   Surveyor   •  Not  this  domain   problem,  but   interes5ng  approach   •  Each  gene  is  drawn  as  a   rectangle   •  Several  possible   variables  for  posi5on:   Ordinal  posi5on   •  Several  possible   variables  for  color:   –  Posi5on  in  one   reference  genome   –  Use  a  color  ramp,  for   wide  range  of  colors   Albers,D.  et  al  "Sequence  surveyor:  Leveraging  overview  for  scalable   genomic  alignment  visualiza5on."  Visualiza5on  and  Computer   Graphics,  IEEE  Transac5ons  on  17.12  (2011):  2392-­‐2401.  
  35. 35. Overview  Visualizaiton:  Sequence   Surveyor   •  Pixel-­‐density  scalable   –  High-­‐density  representa5on   –  High-­‐detail  representa5on   •  Display  size  scalability   –  May  be  difficult  to  compare   pajerns  from  one  side  of   display  to  another   •  Perceptual  Scalability   –  Colors  allow  for  pre-­‐ajen5ve   iden5fica5on  of  pajerns   –  Avoids  visual  clujer   Albers,D.  et  al  "Sequence  surveyor:  Leveraging  overview   for  scalable  genomic  alignment  visualiza5on."   Visualiza5on  and  Computer  Graphics,  IEEE  Transac5ons   on  17.12  (2011):  2392-­‐2401.  
  36. 36. Copy  number  varia5ons  on  big   displays   •  Orchestral:   –  Visualiza5on  of  a  different  data  type   –  Effec5ve  use  of  color  to  enable  pre-­‐ajen5vely   iden5fica5on  of  similari5es  across  genomes   –  High-­‐density  representa5on   –  Details-­‐up-­‐close,  overview  from  a  distance   Ruddle,  Roy  A.,  et  al.  "Leveraging   wall-­‐sized  high-­‐resolu5on  displays  for   compara5ve  genomics  analyses  of   copy  number  varia5on."  Biological   Data  Visualiza5on  (BioVis),  2013  IEEE   Symposium  on.  IEEE,  2013.  
  37. 37. BactoGeNIE  Demo   •  Video  at: hjps://www.youtube.com/watch? v=yrSyi1RWcUw  
  38. 38. Program  details   •  Implemented  in  C++  using  Qt  and  the  QGraphicsView   framework   •  Upload:       –  genome  feature  files   –  Fasta  files  (raw  gene  sequences)   •  Cd-­‐hit  algorithm  processes  sequence  files  to  compute   ortholog  ‘clusters’     •  MySQL  database  to  store  big  datasets   –  Loads  1000  con5gs  into  memory,  rest  stored  in  database   •  Op5mized  for  PubMed  datasets   •  Prototyped  on  E.Coli  dran  genomes   –  Capable  of  displaying  any  con5gs  from  thousands  of  E.Coli  dran   genomes   •  On  EVL  Cyber-­‐commons  wall,  around  400  con5gs  in  view  
  39. 39. BactoGeNIE:  High  density   representa5on   •  Compressed  genome   encoding   •  No  text  labels,  instead   ‘on-­‐demand’   •  No  ‘compara5ve  track’   •  Encode  orthology  using   –  User  applied  color:  pre-­‐ ajen5ve  orthology   iden5fica5on   –  Coordinated   highligh5ng:  scalable     visual  query   –  Alignment:  use  space  to   encode  similarity  
  40. 40. Use  space  to  encode  similarity   •  Goals:   –  Make  it  easier  to  perform  comparisons  across  many   genomes  (Analy5c  task  scalability)   –  Accommodate  increased  display  size  (Display  Size   Scalability)   –  Make  similari5es  and  differences  easy  to  see   (Perceptual  Scalability)   •  Sor5ng  and  Alignment   –  Sort  by  con5g  length   –  Sort  by  gene  content   –  Dynamically  align  against  any  gene    
  41. 41. Interac5vity   •  On  hovering,  con5g  expands  in  height,  so  easier   to  select  genes  of  interest  in  high-­‐density  view   •  ‘Pop-­‐up’  menu  for  each  gene  that  gives  info  and   allows  for:   –  applica5on  of  color:     •  ‘tagging’  opera5on   •  Scalable  query   –  “targe5ng”  opera5on  (described  next)   •  User  can  sort  genomes  by  :   –  Gene  target   –  Con5g  length  
  42. 42. ‘Gene  Targe5ng’  Func5on  to  create   high  resolu5on,  compara5ve  ‘maps’   •  User  selects  a  gene  of  interest   •  This  gene  is  given  a  base  color   •  Two  color  ramps  are  applied  to  adjacent  genes,   one  ‘upstream’  and  one  ‘downstream’   •  Orthologous  genes  in  related  genomes  are  given   the  same  colors   •  Con5gs  containing  this  gene  are  brought  to  the   top     •  The  target  gene  is  centered   •  Orthologs  are  aligned  to  the  target  
  43. 43. Gene  targe5ng  func5on   •  Clustering  to   promote  direct   comparisons   •  Overviews  at  a   distance   •  Details  up  close   •  Pre-­‐ajen5ve   iden5fica5on  of   similari5es  and   differences  between   gene  neighborhoods   Lance  Long  
  44. 44. Examples  
  45. 45. Pixel-­‐density  Scalability   BactoGeNIE  fits   the  pixel-­‐density   scalability   criteria:   High-­‐density  data   display,  iden5fier   display  and   orthology   encoding  
  46. 46. Display  Size  Scalability   •  BactoGeNIE   is  the  only   approach  to   use   clustering   and  show   mul5ple   levels  of   detail  
  47. 47. Perceptual  Scalability  and  Analy5c   Tasks   BactoGeNIE:   •  Similarity  is  pre-­‐ ajen5vely   accessible   •  Avoids  visual   clujer   •  Visual  query  for   orthologs  
  48. 48. Graphical  Scalability:       Display  Resolu5on  vs  Number  of   Genomes   0   100   200   300   400   500   600   700   800   900   1000   480   720   1080   1440   2160   2880   3240   4320   BactoGeNIE   GeneRiViT   SynBrowse   SynView    PSAT   Geco   Mauve   Pixels   Genomes  
  49. 49. Preliminary  User  Feedback   •  A  version  of  BactoGeNIE  used  by  computa5onal  biology  team  on  NxN  pixels   and  MxM  inches  resolu5on  5led  display  wall       •  “This  tool  has  been  widely  used  by  members  of  the  team  to  show  the   compara)ve  analyses  of  genomic  context  for  several  bacterial  genomes”   •  “Genome  browsers  such  as  JBrowse  enable  researchers  to  do  compara)ve   genome  analyses  for  nearly  10-­‐50  genomes.  But  fail  to  work  when  we  are   studying  several  hundreds  of  genomes  of  interest.       •  This  tool  is  really  unique  and  it’s  the  only  tool  that  I  am  aware  of  that  can   scale  up  to  any  number  of  genome  comparisons.   •  The  ability  to  load  mul)ple  tracks  of  genomes,  and  the  zoom  in  and  out   op)ons  with  color  coding,  annota)on  tracks  makes  it  very  convenient  for   scien)sts  to  quickly  look  at  paXerns.     •  This  tool  has  a  poten)al  to  serve  both  for  visualiza)on  as  well  as  data  mining   needs.”         Usage  of  a  version  without  the  gene  targe5ng  approach.   Future  study  will  concentrate  on  this  feature  with  a  wider  community  of  users    
  50. 50. Summary  of  contribu5ons   •  A  novel  design  that  is  the  first  to  enable  direct   comparisons  between  hundreds  of  gene   neighborhoods  in  one  view   •  First  interac5ve,  large-­‐scale  compara5ve  gene   neighborhood  approach,  with  on-­‐the-­‐fly  sor5ng,   dynamic  alignment,  user-­‐selected  color  and  color   ramps   •  First  to  show  overviews  with  gene  neighborhood-­‐ details,  that  can  be  accessed  through  physical   movement     •  introduces  a  novel  visualiza5on  approach  ‘gene   targe5ng’  that  translates  genomic  data  into  high-­‐ resolu5on  genomic  maps  
  51. 51. What’s  next?   Design   •  Integra5on  with  different  levels  of  detail   •  Mul5ple  color  ramps   •  Advanced  ordering  in  y,  based  on  similarity  to  target  or   strain  phylogeny   Implementa5on   •  Scalability  in  rendering  using  paralleliza5on  on  the  GPU   •  Port  to  SAGE                                           Evalua5on   •  User  studies  and  evalua5ons  of  perceptual  scalability  
  52. 52. Scalable  Design,  Big  Data,  Big  Displays   •  Need  visualiza5on  to  provide  an  interface   between  automated  analysis  and  the  expert   •  Por5ng  exis5ng  visual  approaches  to  big  data   and  big  displays  will  not  always  work   •  Need  to  design  for  increased     – pixel-­‐density   – display  size     – volume  of  analy5cal  tasks    
  53. 53. Thanks!   •  Acknowledgements:     – Jason  Leigh,  Andy  Johnson,  Khairi  Reda,  Lance   Long,  Uthman  Shabazz,  and  everyone  in  the   Electronic  Visualiza5on  Laboratory   – Barry  Goldman,  David  Bush,  Niran  Iyer,  Shawn   Stricklin  and  the  rest  of  the  computa5onal  biology   team  at  Monsanto     •  Ques5ons?    

×