• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Holland R - Pistoia Alliance Sequence Squeeze
 

Holland R - Pistoia Alliance Sequence Squeeze

on

  • 890 views

Presentation at BOSC2012 by Holland R - Pistoia Alliance Sequence Squeeze

Presentation at BOSC2012 by Holland R - Pistoia Alliance Sequence Squeeze

Statistics

Views

Total Views
890
Views on SlideShare
890
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Holland R - Pistoia Alliance Sequence Squeeze Holland R - Pistoia Alliance Sequence Squeeze Presentation Transcript

    • Pistoia  Alliance  Sequence  Squeeze   Using  a  compe--on  model  to  spur  development  of  novel  open-­‐source  algorithms   Richard  Holland  (Eagle/Pistoia),  Nick  Lynch  (AZ/Pistoia)   BOSC   July  2012  ©Eagle  Genomics  Ltd.       ©Eagle  Genomics  Ltd    
    • Order  of  Service  •  What/who  is  the  Pistoia  Alliance?  •  What  is/was  Sequence  Squeeze?  •  Who  won,  how,  and  why?  •  Why  did  Pistoia  do  this?  •  Why  is  this  good  for  BOSC  delegates?  •  Will  it  happen  again?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   2  
    • What/who  is  the  Pistoia  Alliance?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   3  
    • Who  is  Pistoia?  •  The  Pistoia  Alliance  is   –  global   –  not-­‐for-­‐profit   –  precompeWWve  alliance     –  life  science  companies,  vendors,  publishers,  and  academic  groups   –  aims  to  lower  barriers  to  innovaWon     –  by  improving  the  interoperability  of  R&D  business  processes.  •  We  differ  from  standards  groups  because     –  we  bring  together  the  key  consWtuents  to  idenWfy  the  root  causes  that   lead  to  R&D  inefficiencies     –  develop  best  pracWces  and  technology  pilots  to  overcome  common   obstacles.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   4  
    • What  is/was  Sequence  Squeeze?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   5    
    • The  NGS  problem  •  Storing  millions  of  NGS  reads  and  their  quality  scores   uncompressed  is  imprac,cal,  yet  current  compression   technologies  are  becoming  inadequate.    •  There  is  a  need  for  a  new  and  novel  method  of   compressing  sequence  reads  and  their  quality  scores  in   a  way  that  preserves  100%  of  the  informa,on  whilst   achieving  much-­‐improved  linear  (or,  even  beer,  non-­‐ linear)  compression  raWos.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   6  
    • What  was  Sequence  Squeeze?  •  Contest  to  find  a  beer  FASTQ  compression  algorithm   –  easiest  format  for  ranking  entries  in  an  automated  se_ng.  •  Open  source,  non-­‐restricWve  licence  required  for  entries   –  benefit  the  whole  community.  •  Entries  tested  on  an  extract  of  the  1000  genomes  data  stored  in  AWS.  •  Prize  fund  of  US$15,000  to  the  best  algorithm  submied  before  the   closing  date  of  15  March  2012.    •  Winner  was  announced  at  the  Pistoia  Alliance  Conference  in  Boston  MA   on  24  April  2012   –  more  on  that  story  later.  •  Organised  and  administered  by  Eagle  under  contract  to  Pistoia.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   7  
    • Who  entered?  •  108  disWnct  entries.  •  But  all  these  from  only  12  entrants!   –  some  entrants  were  groups  or  consorWa  but  most   were  individuals.  •  Public  leaderboard  encouraged  fiercer   compeWWon.  •  Entrants  seemingly  driven  to  outdo  their   compeWtors.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   8  
    • Who  judged?  •  Yingrui  Li  –  Duty  OperaWon  Officer  of  Science  &   Technology  Department  of  the  BGI-­‐Shenzhen.  •  Nick  Lynch  –  President  of  the  Pistoia  Alliance   (2009-­‐11).  •  Guy  Coates  –  leader  of  the  InformaWcs  Systems   Group  at  the  Wellcome  Trust  Sanger  InsWtute.  •  Tim  Fennell  –  Assistant  Director  for  Sequencing   Pipeline  InformaWcs  at  the  Broad  InsWtute.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   9  
    • Who  won,  how,  and  why?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   10    
    • What  were  the  results?  •  Entrants  were  judged  by   –  compression  raWo   –  compression  Wme  and  memory   –  decompression  Wme  and  memory   –  accuracy  (lossiness  –  100%  target)   –  manual  review  for  code  quality,  scalability,  and  other  factors.  •  The  same  three  people  showed  up  at  the  top  of  every   category   –  in  a  different  order   –  with  different  versions  of  their  entries.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   11  
    • Who  won,  and  why?  •  James  Bonfield  won  overall   –  majority  of  top  places  in  each  category   –  using  various  versions  of  his  entry   –  forming  a  suite  of  suitable  tools.  •  11.41%  compression  raWo  (test  data  ~6GB)   –  or  109.90  seconds  compression  Wme   –  or  100.91  seconds  decompression  Wme   –  or  35.76MB  compression  memory  usage   –  or  16.01MB  decompression  memory  usage   –  but  not  all  at  once!  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   12  
    • ImplicaWons  of  winning     entry  •  The  approach  is  very  simple  –  essenWally:   –  convert  the  FASTQ  to  BAM  alignments  against  a   reference  genome,  preserving  quality  scores.   –  compress  the  BAM  files.    •  Many  other  entries  followed  the  same  paern:     –  convert  to  some  other  format  then  compress  using   standard  techniques.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   13  
    • Other  interesWng     results  •  Ma  Mahoney  (Dell)  submied  a  specialised  version  of  the   standard  tool  paq  which  performed  extremely  well.  •  Even  vanilla  paq  wasn’t  too  bad.  •  Discarding  the  quality  scores  enWrely  gets  a  compression  raWo  of   2.87%  vs.  the  original  FASTQ  (not  FASTA).  •  If  this  contest  truly  represented  the  latest  and  greatest  ideas  in  the   field,  then  NGS  storage  must  therefore  either  be     –  highly  compressed,  very  slow  access,     –  or  less  compressed,  relaWvely  fast  access.  •  Its  quite  hard  to  beat  bzip2.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   14  
    • David  Flanders  (Eagle  CEO)  and  John  Wise  (Pistoia  chairman)  present  James  Bonfield  with  his  prize.  And  unexpected  benefits  James  Bonfield  donated  his  enWre  prize  fund  –  US$15,000  –  to  charity.   50%  to  the  Wellcome  Trust  Sanger  InsWtute.   50%  to  the  BriWsh  Heart  FoundaWon.    Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   15  
    • PublicaWon  •  Formal  paper  being  wrien  at  the  moment  by  James  Bonfield   –  in  collaboraWon  with  close-­‐second  Ma  Mahoney   –  and  judge  Nick  Lynch   –  and  the  authors  of  other  significant  entries.  •  Source  code  of  ALL  entries  is  available  at  www.sequencesqueeze.org     –  all  under  BSD  licence   –  all  hosted  at  SourceForge  or  similar   –  click  entry  names  to  be  taken  to  download  page.  •  Interviews  with  entrants  at  the  Pistoia  blog  www.pistoiaalliance.org/blog   –  search  for  arWcles  with  the  tag  ‘compression  algorithms’.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   16  
    • Why  did  Pistoia  do  this?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   17    
    • Why  did  Pistoia  do  this?  •  Encouraging  innovaWon  through  prize-­‐backed   contests.    •  Open  innovaWon  model  allows  industry  to   state  its  requirements   –  then  let  the  free  market  decide  how  to  deliver   something  that  saWsfies  these.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   18  
    • Why  did  Pistoia  do  this?  •  Typical  bioinformaWcs  open-­‐source  hackers  do  things  because  they   enjoy  them   –  but  someWmes  also  because  of  the  challenge,  the  kudos,  the   saWsfacWon  of  solving  a  real-­‐world  problem.  •  James’  charity  donaWon  is  a  great  example  of  this   –  he  wasn’t  in  it  for  the  money   –  but  the  prize  fund  created  a  tangible  goal  to  aim  at.  •  Amazon  kindly  sponsored  vouchers  for  all  parWcipants  that  should   have  covered  the  cost  of  developing  and  submi_ng  an  entry   –  contest  was  AWS-­‐based   –  entries  had  to  be  submied  as  S3  buckets.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   19  
    • Why  did  Pistoia  do  this?  •  Leaderboard  encouraged  compeWWon   –  one-­‐upmanship   –  innovaWon.  •  Does  not  discourage  collaboraWon   –  James  and  Ma  both  discussed  their  entries  with   the  data  compression  community  at  encode.ru    Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   20  
    • Why  did  Pistoia  do  this?  •  BSD-­‐licence  requirement  ensured  that  the   winning  entry  was  not  going  to  be  available   only  to  those  willing  to  pay  a  fee.  •  EnWre  community  benefits,  not  just  Pistoia   members  or  those  with  deep  pockets  to  pay   for  sosware  licence  agreements.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   21  
    • Why  is  this  good  for  BOSC  delegates?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   22    
    • Why  is  this  good  for     BOSC  delegates?  •  If  the  entries  had  been  closed/commercial  then  only  organisaWons  willing   to  pay  to  licence/buy  the  resulWng  products  would  benefit.  •  But  this  way  the  enWre  community  benefits  from  results,  for  free,  without   restricWon.    •  Beneficiaries  include  big  pharma  and  other  large  corporaWons  that   commissioned  the  contest     –  but  also  all  universiWes     –  all  non-­‐profits   –  all  small  businesses  in  biotech   –  and  everyone  else  involved  in  NGS  work.  •  Pistoia  is  about  pre-­‐compeWWve  alliance     –  there  is  no  reason  to  make  the  Alliance’s  output  exclusive   –  they  are  there  to  develop  and  share  ideas,  not  to  build  an  empire.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   23  
    • Will  it  happen  again?  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   24    
    • Will  it  happen  again?  •  Pleased  with  outcome  and  level  of  interest.  •  So,  yes.  •  Goal  is  to  run  two  such  contests  a  year.  •  But,  your  community  needs  you!   –  we  need  a  topic/subject/idea  that  can  be  raWonally/objecWvely   judged/ranked   –  and  that  is  relevant  to  the  research  acWviWes  of  life  science   companies  and  other  Pistoia  members.  •  Ideas  can  be  sent  to  Pistoia  Ops  team  c/o   execdirector@pistoiaalliance.org    Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   25  
    • Credits  •  Pistoia  Alliance  for  the  idea  and  funding.  •  Eagle  for  organising  and  administering.  •  All  contestants  for  entering.  •  1000  Genomes  for  the  test  data.  •  AWS  for  sponsoring  parWcipants.  •  BOSC/OBF  for  accepWng  this  talk.  Pistoia  Alliance  Sequence  Squeeze   ©Eagle  Genomics  Ltd     July  14,  2012   26  
    • www.pistoiaalliance.org  richard.holland@eaglegenomics.com   www.sequencesqueeze.org   +44  (0)1223  654481  x3  (ideas  to:  execdirector@pistoiaalliance.org  )   www.eaglegenomics.com     @eaglegen   blog.eaglegenomics.com                facebook.com/eaglegenomics                                  @sequencesqueeze   www.pistoiaalliance.org/blog                      @pistoiaalliance   Eagle®  is  a  registered  trademark  no.  010418135  of  Eagle  Genomics  Ltd.       Postal  address:  Eagle  Genomics  Ltd.,  Babraham  Research  Campus,  Cambridge  CB22  3AT,  United  Kingdom.  ©Eagle  Genomics  Ltd.       ©Eagle  Genomics  Ltd