Holland R - Pistoia Alliance Sequence Squeeze

974 views

Published on

Presentation at BOSC2012 by Holland R - Pistoia Alliance Sequence Squeeze

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
974
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Holland R - Pistoia Alliance Sequence Squeeze

  1. 1. Pistoia  Alliance  Sequence  Squeeze   Using  a  compe--on  model  to  spur  development  of  novel  open-­‐source  algorithms   Richard  Holland  (Eagle/Pistoia),  Nick  Lynch  (AZ/Pistoia)   BOSC   July  2012  ©Eagle  Genomics  Ltd.       ©Eagle  Genomics  Ltd    
  2. 2. Order  of  Service  •  What/who  is  the  Pistoia  Alliance?  •  What  is/was  Sequence  Squeeze?  •  Who  won,  how,  and  why?  •  Why  did  Pistoia  do  this?  •  Why  is  this  good  for  BOSC  delegates?  •  Will  it  happen  again?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   2  
  3. 3. What/who  is  the  Pistoia  Alliance?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   3  
  4. 4. Who  is  Pistoia?  •  The  Pistoia  Alliance  is   –  global   –  not-­‐for-­‐profit   –  precompeWWve  alliance     –  life  science  companies,  vendors,  publishers,  and  academic  groups   –  aims  to  lower  barriers  to  innovaWon     –  by  improving  the  interoperability  of  R&D  business  processes.  •  We  differ  from  standards  groups  because     –  we  bring  together  the  key  consWtuents  to  idenWfy  the  root  causes  that   lead  to  R&D  inefficiencies     –  develop  best  pracWces  and  technology  pilots  to  overcome  common   obstacles.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   4  
  5. 5. What  is/was  Sequence  Squeeze?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   5    
  6. 6. The  NGS  problem  •  Storing  millions  of  NGS  reads  and  their  quality  scores   uncompressed  is  imprac,cal,  yet  current  compression   technologies  are  becoming  inadequate.    •  There  is  a  need  for  a  new  and  novel  method  of   compressing  sequence  reads  and  their  quality  scores  in   a  way  that  preserves  100%  of  the  informa,on  whilst   achieving  much-­‐improved  linear  (or,  even  beer,  non-­‐ linear)  compression  raWos.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   6  
  7. 7. What  was  Sequence  Squeeze?  •  Contest  to  find  a  beer  FASTQ  compression  algorithm   –  easiest  format  for  ranking  entries  in  an  automated  se_ng.  •  Open  source,  non-­‐restricWve  licence  required  for  entries   –  benefit  the  whole  community.  •  Entries  tested  on  an  extract  of  the  1000  genomes  data  stored  in  AWS.  •  Prize  fund  of  US$15,000  to  the  best  algorithm  submied  before  the   closing  date  of  15  March  2012.    •  Winner  was  announced  at  the  Pistoia  Alliance  Conference  in  Boston  MA   on  24  April  2012   –  more  on  that  story  later.  •  Organised  and  administered  by  Eagle  under  contract  to  Pistoia.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   7  
  8. 8. Who  entered?  •  108  disWnct  entries.  •  But  all  these  from  only  12  entrants!   –  some  entrants  were  groups  or  consorWa  but  most   were  individuals.  •  Public  leaderboard  encouraged  fiercer   compeWWon.  •  Entrants  seemingly  driven  to  outdo  their   compeWtors.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   8  
  9. 9. Who  judged?  •  Yingrui  Li  –  Duty  OperaWon  Officer  of  Science  &   Technology  Department  of  the  BGI-­‐Shenzhen.  •  Nick  Lynch  –  President  of  the  Pistoia  Alliance   (2009-­‐11).  •  Guy  Coates  –  leader  of  the  InformaWcs  Systems   Group  at  the  Wellcome  Trust  Sanger  InsWtute.  •  Tim  Fennell  –  Assistant  Director  for  Sequencing   Pipeline  InformaWcs  at  the  Broad  InsWtute.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   9  
  10. 10. Who  won,  how,  and  why?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   10    
  11. 11. What  were  the  results?  •  Entrants  were  judged  by   –  compression  raWo   –  compression  Wme  and  memory   –  decompression  Wme  and  memory   –  accuracy  (lossiness  –  100%  target)   –  manual  review  for  code  quality,  scalability,  and  other  factors.  •  The  same  three  people  showed  up  at  the  top  of  every   category   –  in  a  different  order   –  with  different  versions  of  their  entries.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   11  
  12. 12. Who  won,  and  why?  •  James  Bonfield  won  overall   –  majority  of  top  places  in  each  category   –  using  various  versions  of  his  entry   –  forming  a  suite  of  suitable  tools.  •  11.41%  compression  raWo  (test  data  ~6GB)   –  or  109.90  seconds  compression  Wme   –  or  100.91  seconds  decompression  Wme   –  or  35.76MB  compression  memory  usage   –  or  16.01MB  decompression  memory  usage   –  but  not  all  at  once!  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   12  
  13. 13. ImplicaWons  of  winning     entry  •  The  approach  is  very  simple  –  essenWally:   –  convert  the  FASTQ  to  BAM  alignments  against  a   reference  genome,  preserving  quality  scores.   –  compress  the  BAM  files.    •  Many  other  entries  followed  the  same  paern:     –  convert  to  some  other  format  then  compress  using   standard  techniques.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   13  
  14. 14. Other  interesWng     results  •  Ma  Mahoney  (Dell)  submied  a  specialised  version  of  the   standard  tool  paq  which  performed  extremely  well.  •  Even  vanilla  paq  wasn’t  too  bad.  •  Discarding  the  quality  scores  enWrely  gets  a  compression  raWo  of   2.87%  vs.  the  original  FASTQ  (not  FASTA).  •  If  this  contest  truly  represented  the  latest  and  greatest  ideas  in  the   field,  then  NGS  storage  must  therefore  either  be     –  highly  compressed,  very  slow  access,     –  or  less  compressed,  relaWvely  fast  access.  •  Its  quite  hard  to  beat  bzip2.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   14  
  15. 15. David  Flanders  (Eagle  CEO)  and  John  Wise  (Pistoia  chairman)  present  James  Bonfield  with  his  prize.  And  unexpected  benefits  James  Bonfield  donated  his  enWre  prize  fund  –  US$15,000  –  to  charity.   50%  to  the  Wellcome  Trust  Sanger  InsWtute.   50%  to  the  BriWsh  Heart  FoundaWon.    Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   15  
  16. 16. PublicaWon  •  Formal  paper  being  wrien  at  the  moment  by  James  Bonfield   –  in  collaboraWon  with  close-­‐second  Ma  Mahoney   –  and  judge  Nick  Lynch   –  and  the  authors  of  other  significant  entries.  •  Source  code  of  ALL  entries  is  available  at  www.sequencesqueeze.org     –  all  under  BSD  licence   –  all  hosted  at  SourceForge  or  similar   –  click  entry  names  to  be  taken  to  download  page.  •  Interviews  with  entrants  at  the  Pistoia  blog  www.pistoiaalliance.org/blog   –  search  for  arWcles  with  the  tag  ‘compression  algorithms’.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   16  
  17. 17. Why  did  Pistoia  do  this?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   17    
  18. 18. Why  did  Pistoia  do  this?  •  Encouraging  innovaWon  through  prize-­‐backed   contests.    •  Open  innovaWon  model  allows  industry  to   state  its  requirements   –  then  let  the  free  market  decide  how  to  deliver   something  that  saWsfies  these.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   18  
  19. 19. Why  did  Pistoia  do  this?  •  Typical  bioinformaWcs  open-­‐source  hackers  do  things  because  they   enjoy  them   –  but  someWmes  also  because  of  the  challenge,  the  kudos,  the   saWsfacWon  of  solving  a  real-­‐world  problem.  •  James’  charity  donaWon  is  a  great  example  of  this   –  he  wasn’t  in  it  for  the  money   –  but  the  prize  fund  created  a  tangible  goal  to  aim  at.  •  Amazon  kindly  sponsored  vouchers  for  all  parWcipants  that  should   have  covered  the  cost  of  developing  and  submi_ng  an  entry   –  contest  was  AWS-­‐based   –  entries  had  to  be  submied  as  S3  buckets.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   19  
  20. 20. Why  did  Pistoia  do  this?  •  Leaderboard  encouraged  compeWWon   –  one-­‐upmanship   –  innovaWon.  •  Does  not  discourage  collaboraWon   –  James  and  Ma  both  discussed  their  entries  with   the  data  compression  community  at  encode.ru    Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   20  
  21. 21. Why  did  Pistoia  do  this?  •  BSD-­‐licence  requirement  ensured  that  the   winning  entry  was  not  going  to  be  available   only  to  those  willing  to  pay  a  fee.  •  EnWre  community  benefits,  not  just  Pistoia   members  or  those  with  deep  pockets  to  pay   for  sosware  licence  agreements.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   21  
  22. 22. Why  is  this  good  for  BOSC  delegates?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   22    
  23. 23. Why  is  this  good  for     BOSC  delegates?  •  If  the  entries  had  been  closed/commercial  then  only  organisaWons  willing   to  pay  to  licence/buy  the  resulWng  products  would  benefit.  •  But  this  way  the  enWre  community  benefits  from  results,  for  free,  without   restricWon.    •  Beneficiaries  include  big  pharma  and  other  large  corporaWons  that   commissioned  the  contest     –  but  also  all  universiWes     –  all  non-­‐profits   –  all  small  businesses  in  biotech   –  and  everyone  else  involved  in  NGS  work.  •  Pistoia  is  about  pre-­‐compeWWve  alliance     –  there  is  no  reason  to  make  the  Alliance’s  output  exclusive   –  they  are  there  to  develop  and  share  ideas,  not  to  build  an  empire.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   23  
  24. 24. Will  it  happen  again?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   24    
  25. 25. Will  it  happen  again?  •  Pleased  with  outcome  and  level  of  interest.  •  So,  yes.  •  Goal  is  to  run  two  such  contests  a  year.  •  But,  your  community  needs  you!   –  we  need  a  topic/subject/idea  that  can  be  raWonally/objecWvely   judged/ranked   –  and  that  is  relevant  to  the  research  acWviWes  of  life  science   companies  and  other  Pistoia  members.  •  Ideas  can  be  sent  to  Pistoia  Ops  team  c/o   execdirector@pistoiaalliance.org    Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   25  
  26. 26. Credits  •  Pistoia  Alliance  for  the  idea  and  funding.  •  Eagle  for  organising  and  administering.  •  All  contestants  for  entering.  •  1000  Genomes  for  the  test  data.  •  AWS  for  sponsoring  parWcipants.  •  BOSC/OBF  for  accepWng  this  talk.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   26  
  27. 27. www.pistoiaalliance.org  richard.holland@eaglegenomics.com   www.sequencesqueeze.org   +44  (0)1223  654481  x3  (ideas  to:  execdirector@pistoiaalliance.org  )   www.eaglegenomics.com     @eaglegen   blog.eaglegenomics.com                facebook.com/eaglegenomics                                  @sequencesqueeze   www.pistoiaalliance.org/blog                      @pistoiaalliance   Eagle®  is  a  registered  trademark  no.  010418135  of  Eagle  Genomics  Ltd.       Postal  address:  Eagle  Genomics  Ltd.,  Babraham  Research  Campus,  Cambridge  CB22  3AT,  United  Kingdom.  ©Eagle  Genomics  Ltd.       ©Eagle  Genomics  Ltd    

×