Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)


Published on

This is a talk I gave at XLDB 2012 on September 11, 2012 at Stanford University.

Published in: Technology, Education
  • Be the first to comment

Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)

  1. 1. Bionimbus:     Lessons  from  a  Petabyte-­‐Scale    Science  Cloud  Service  Provider  (CSP)   Robert  Grossman     Ins?tute  for  Genomics  &  Systems  Biology     Center  for  Research  Informa?cs     Computa?on  Ins?tute   Department  of  Medicine   University  of  Chicago   &     Open  Data  Group     September  11,  2012  
  2. 2. The  OSDC  &  Bionimbus  Teams  •  Open  Science  Data  Cloud  (OSDC)  Team   –  MaM  Greenway,  Allison  Heath,  Ray  Powell,  Rafael   Suarez.   –  Major  funding  for  the  OSDC  is  provided  by  the  Gordon   and  BeMy  Moore  Founda?on.  •  Bionimbus  Team   –  Elizabeth  Bartom,  Casey  Brown,  Jason  Grundstad,  David   Hanley,  Nicolas  Negre,  Tom  Stricker,  MaM  SlaMery,   Rebecca  Spokony  &  Kevin  White.   –  Bionimbus  is  a  joint  project  between  Laboratory  for   Advanced  Compu?ng  &  White  Lab  at  the  University  of   Chicago  and  uses  in  part  the  OSDC  infrastructure.  
  3. 3. Let’s  Step  Back  20  Years   •  1992-­‐96:  Petabyte   Access  &  Storage   Solu?ons  (PASS)   Project  for  SSC.   •  It  developed  &   benchmarked   federated  rela?onal,   OO  DB,  object   stores,  &  column-­‐ oriented  data   warehouse  solu?ons   at  the  TB-­‐scale.    
  4. 4. A  picture  of  Cern’s  Large  Hadron  Collider  (LHC).    The  LHC  took  about  a  decade  to  construct,  and  cost  about  $4.75  billion.      Source  of  picture:  Conrad  Melvin,  Crea?ve  Commons  BY-­‐SA  2.0,  
  5. 5. Part  1.  Genomics  as  a  Big  Data  Science  
  6. 6. Source:  Lincoln  Stein  
  7. 7. One  Million  Genomes  •  Sequencing  a  million  genomes  would  most   likely  fundamentally  change  the  way  we   understand  genomic  varia?on.  •  The  genomic  data  for  a  pa?ent  is  about  1  TB   (including  samples  from  both  tumor  and   normal  ?ssue).  •  One  million  genomes  is  about  1000  PB  or  1  EB  •  With  compression,  it  may  be  about  100  PB  •  At  $1000/genome,  the  sequencing  would  cost   about  $1B  
  8. 8. Big  data  driven  discovery  on   1,000,000  genomes  and  1  EB  of  data.  Genomic-­‐ Improved    Genomic-­‐   driven   understanding   driven  drug  diagnosis   of  genomic   development   science   Precision  diagnosis  and   treatment.    Preven?ve   health  care.  
  9. 9. ER+   TNBC  With  genomics,  we  can  stra?fy  diseases  and  treat  each  stratum  differently.   Source:  White  Lab,  University  of  Chicago.  
  10. 10. Clonal  Evolu?on  of  Tumors   Tumors  evolve  temporally  and  spa?ally.  Source:  Mel  Greaves  &  Carlo  C.  Maley,  Clonal  evolu?on  in  cancer,  Nature,  Volume  241,  pages  306-­‐312,  2012.  
  11. 11. Combina?ons  of  Rare  Alleles   Penetrance   High   rare  examples  of   alleles   high-­‐penetrance   causing   common  variants     Mendelian     influencing     Intermediate   disease   common  disease   Low-­‐frequency   variants  with    intermediate  penetrance   rare  variants  of   most  common   Modest   variants     small  effect   very  hard  to  iden?fy   implicated  in   by  gene?c  means   common  disease   by  GWA   Low   Allele     0.001   0.01   0.1   frequency   Very  rare   Rare   Uncommon   Common  Source:  Mark  McCarthy  
  12. 12. TCGA  Analysis  of  Lung  Cancer   •  178  cases  of   SQCC  (lung   cancer)   •  Matched  tumor   &  normal   •  Mean  of  360   exonic   muta?ons,  323   CNV,  &  165   rearrangements   per  tumor  Source:  The  Cancer  Genome  Atlas  Research  Network,  Comprehensive  genomic  characteriza?on  of  squamous  cell  lung  cancers,  Nature,  2012,  doi:10.1038/nature11404.  
  13. 13. Some  Examples  of  Big  Data  Science  Discipline   Dura3on   Size   #  Devices  HEP  -­‐  LHC   10  years   15  PB/year*   One  Astronomy  -­‐  LSST   10  years   12  PB/year**   One  Genomics  -­‐  NGS   2-­‐4  years   0.5  TB/genome   1000’s  *At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  worlds  largest  par?cle  accelerator,  is  expected  to  produce  more  than  15  million  Gigabytes  of  data  each  year.    …  This  ambi?ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer  centres  in  33  countries.    Source:  hMp://­‐en.html    **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes  processed),  resul?ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hMp://­‐1004.html  
  14. 14. One  large  instrument   Many  smaller  instruments  
  15. 15. Part  2.  What  Instrument  Do  we  Use  to    Make  Big  Data  Discoveries?  How  do  we  build  a  “datascope?”  
  16. 16. TB?   PB?   EB?   ZB?  What  is  big  data?  
  17. 17. Another  way:  Think  of  data  as  big  if  you  measure  it  in  MW,  as  in   Facebook’s  Pineville  Data  Center  is  30  MW.  
  18. 18. An  algorithm  and  compu?ng  infrastructure  is  “big-­‐data  scalable”  if  adding  a  rack  (or  container)  of  data  (and  corresponding  processors)  allows  you  to  do  the  same  computa?on  in  the  same  ?me  but  over  more  data.  
  19. 19. Commercial  Cloud  Service  Provider  (CSP)     15  MW  Data  Center   Monitoring,   Accoun?ng  and   network  security   billing   Customer   and  forensics   Facing   Portal   Automa?c   provisioning  and   100,000  servers   infrastructure   1  PB  DRAM   management   100’s  of  PB  of  disk   ~1  Tbps  egress  bandwidth     25  operators  for  15  MW  Commercial  Cloud   Data  center  network  
  20. 20. What  are  some  of  the  important  differences  between  commercial   and  research-­‐focused  CSPs?    
  21. 21. Science  CSP   Commercial  CSP  POV   Democra?ze  access  to   As  long  as  you  pay  the  bill;   data.    Integrate  data  to   as  long  as  the  business   make  discoveries.    Long   model  holds.   term  archive.  Data  &   Data  intensive   Internet  style  scale  out  Storage   Science  Clouds   compu?ng  &  HP  storage   and  object-­‐based  storage  Flows   Large  data  flows  in  and   Lots  of  small  web  flows   out  Streams   Streaming  processing   NA   required  Accoun?ng   Essen?al   Essen?al  Lock  in   Moving  environment   Lock  in  is  good   between  CSPs  essen?al  
  22. 22. Part  3.  The  Open  Cloud  Consor?um’s    Open  Science  Data  Cloud  
  23. 23. •  U.S  based  not-­‐for-­‐profit  corpora?on.  •  Manages  cloud  compu?ng  infrastructure  to   support  scien?fic  research:  Open  Science   Data  Cloud.  •  Manages  cloud  compu?ng  testbeds:  Open   Cloud  Testbed.    www.opencloudconsor?   23  
  24. 24. Cloud  Services     Opera?ons  Centers  (CSOC)  •  The  OSDC  operates  Cloud  Services  Opera?ons   Center  (or  CSOC).  •  It  is  a  CSOC  focused  on  suppor?ng  Science   Clouds  for  researchers.  •  Compare  to  Network  Opera?ons  Center  or   NOC.  •  Both  are  an  important  part  of  cyber   infrastructure  for  big  data  science.  
  25. 25. Different  Styles  of  OSDC  Racks   •  Design  1:  Put  cores   over  spindles.   •  Higher  cost  but   easy  to  compute   over  all  the  data.   •  Design  2:  separate   (some  of  the  )2012  OSDC  rack  design  (dray)  •  950  TB  /  rack   storage  from  the  •  600  cores  /  rack   compute.  
  26. 26. Open  Science  Data  Cloud   Accoun?ng  and   Monitoring,   billing  (OSDC)   compliance,  &   security   Customer  Facing   Science  Cloud  SW   &  Services   Portal  (Tukey)   Automa?c   provisioning  and   3  PB  2011   infrastructure   10  PB  2012     management   ~100  Gbps  bandwidth   able  to  scale  to     100  PB?   5-­‐12  operators  to  operate  1-­‐5  MW  Science  Cloud   Data  center  network  OSDC  Data  Stack  based  upon  OpenStack,  Hadoop,  GlusterFS,  UDT,  …  
  27. 27. OSDC  Philosophy  •  We  try  to  automate  as  much  as  possible  (we   automate  the  setup  &  opera?ons  of  a  rack).  •  We  try  to  write  as  liMle  soyware  as  possible.  •  Each  project  is  a  bit  different,  but  in  general:  •  We  assign  (permanent)  IDs  to  data  managed  by   the  OSDC  and  manage  associated  metadata.  •  We  assign  and  enforce  permissions  for  users  &   groups  of  users  and  for  files/objects,  collec?ons   of  files/objects,  and  collec?ons  of  collec?ons.  •  We  Support  RESTful  interfaces.  •  Do  accoun?ng  for  storage  and  core-­‐hours.  
  28. 28. Some  Of  Our  Biggest  Mistakes  •  Not  charging  those  who  were  the  largest  users  of   our  services.      This  resulted  in  a  lot  of  bad   behavior.  •  Trying  to  support  donated  equipment  without   adequate  staff.  •  Being  too  op?mis?c  about  when  big  data  soyware   would  be  ready  for  prime  ?me.  •  Some  problems  with  big  data  soyware  doesn’t   show  up  at  less  than  the  full  scale  of  the  OSDC,  but   we  have  only  one  OSDC  and  it  is  difficult  to  test  at   this  scale.  
  29. 29. Essen?al  Services  for  a  Science  CSP  •  Support  for  data  intensive  compu?ng  •  Support  for  big  data  flows  •  Account  management,  authen?ca?on  and   authoriza?on  services  •  Health  and  status  monitoring  •  Billing  and  accoun?ng  •  Ability  to  rapidly  provision  infrastructure  •  Security  services,  logging,  event  repor?ng  •  Access  to  large  amounts  of  public  data  •  High  performance  storage  •  Simple  data  export  and  import  services  
  30. 30. Number  1000’s   Individual  scien?sts  &   small  projects  100’s   Community  based   science  via  Science  as  a  10’s   Service   very  large  projects   Data  Size   Small   Medium  to  Large     Very  Large   Public   Shared  community   Dedicated     infrastructure   infrastructure   infrastructure  
  31. 31. Part  4.    Bionimbus  Bionimbus  is  a  joint  project  between  Laboratory  For  Advanced  Compu?ng  &  the  White  Lab  at  the  University  of  Chicago.  
  32. 32. Step  1.  Prepare  a  Sample  
  33. 33. Step  2.    Login  to  Bionimbus  and  get  a  Bionimbus  Key.  
  34. 34. Step  3.    Send  your  sample  to  the  sequencing  center.    
  35. 35. Step  4.    Login  on  to  Bionimbus  and    view  your  data  
  36. 36. Step  5.    Use  Bionimbus  to  perform  standard  and  custom  pipelines.  Bionimbus  can  launch  mul?ple  virtual  machines.  
  37. 37. Bionimbus  Virtual  Machine  Releases     Peak  Calling   MAT   MA2C   PeakSeq   MACS   SPP   Quality   Various   Control   Alignment  &   Bow?e   Genotyping   TopHat   Samtools   Picard   37  
  38. 38. Soyware  Tools:  Moving  Genomes  
  39. 39. Bionimbus  Community  Genomic  Cloud   researcher  •  1K  genomes   Cloud  for  •  PubMed   Public  Data  •  etc.     Personal  “dropbox”  +  compute  
  40. 40. Bionimbus  Private  Genomic  Cloud   researcher  •  1K  genomes   Cloud  for   Cloud  for   TCGA  •  PubMed   Public  Data   Controlled  Data   dbGaP  •  etc.     Personal  “dropbox”     &  compute  
  41. 41. Bionimbus  Private  Biomedical  Cloud   researcher  •  1K  genomes  •  PubMed   Cloud  for   Cloud  for   TCGA  •  etc.   Public  Data   Personal  “dropbox”   Controlled  Data   dbGaP     plus  compute    ScaMer,  gather   Clinical   Cloud  for  queries   Research  Data   PHI  data   Warehouse  
  42. 42. Step  2.  Send  sample  to   Step  1.  Get  Bionimbus  ID   be  sequenced.   (BID),  assign  project,   private/community,   Internal   BID  Generator   public  cloud,  etc.   External     Sequencers   sequencing  partner   Step  5.    Cloud  based  analysis     using  IGSB  and  3rd     party  tools  and  applica?ons.     Step  3a.  Return  raw   reads.  Step  3b.  Return  variant  calls,    CNV,  annota?on…   Bionimbus   Bionimbus   Private  Cloud   Community   Step  4.  Secure  data   UC   Cloud   rou?ng  to  appropriate   cloud  based  upon  BID.   Bionimbus   Private   dbGaP   Amazon   Cloud  XY  
  43. 43. (Eucalyptus,   web2py-­‐based  Front  End   OpenStack)   U?lity  Cloud  (PostgreSQL)   Services   Database   Analysis  Pipelines  &   Services   Re-­‐analysis  Services   Intercloud   Services  (IDs,  etc.)   (UDT,   Data   Data     replica?on)   Inges?on   Services   Cloud  Services   (Hadoop,   Sector/Sphere)  
  44. 44. >300  ChIP  datasets   -­‐ Chroma?n/RNA  ?mecourse   -­‐ CBP   -­‐ PolII   -­‐ Pho/silencers   -­‐ HDACs   -­‐ Insulators   -­‐ TFs   Predic3ons   537  silencers   2,307  new  promoters   12,285  enhancers   14,145  insulators   44  Negre  et  al.  Nature  2011      
  45. 45. Part  5.      Managing  One  Million  Genomes  
  46. 46. Enrich  with   Rela?onal  databases   Summary  level     clinical  data   (10-­‐100  TB)   NoSql  &  scien?fic   databases     Varia?on  (VCF)  Files  (1-­‐10  PB)     (Genomic  varia?on)  NoSql,  DFS,       Sequence  (BAM)  Files  (100-­‐1000  PB)    file  overlays?     (Sequence  data  in  binary  form)  
  47. 47. Acknowledgements  Major  funding  and  support  for  the  Open  Science  Data  Cloud  (OSDC)  is  provided  by  the  Gordon  and  BeMy  Moore  Founda?on.    This  funding  is  used  to  support  the  OSDC-­‐Adler,  Sullivan  and  Root  facili?es.    Addi?onal  funding  for  the  OSDC  has  been  provided  by  the  following  sponsors:    •  The  OCC-­‐Y  Hadoop  Cluster  (approximately  1000  cores  and  1  PB  of  storage)  was   donated  by  Yahoo!  in  2011.  •  Cisco  provides  the  OSDC  access  to  the  Cisco  C-­‐Wave,  which  connects  OSDC  data   centers  with  10  Gbps  wide  area  networks.  •  NSF  awarded  the  OSDC  a  5-­‐year  (2010-­‐2016)  PIRE  award  to  train  scien?sts  to  use   the  OSDC  and  to  further  develop  the  underlying  technology.  •  OSDC  technology  for  high  performance  data  transport  is  support  in  part  by    NSF   Award  1127316.  •  The  StarLight  Facility  in  Chicago  enables  the  OSDC  to  connect  to  over  30  high   performance  research  networks  around  the  world  at  10  Gbps  or  higher,  with  an   increasing  number  of  100  Gbps  connec?ons.    The  OSDC  is  managed  by  the  Open  Cloud  Consor?um,  a  501(c)(3)  not-­‐for-­‐profit  corpora?on.  If  you  are  interested  in  providing  funding  or  dona?ng  equipment  or  services,  please  contact  us  at  
  48. 48. For  more  informa?on  •  You  can  find  some  more  informa?on  on  my  blog:                                          •  Some  of  my  technical  papers  are  also  available  there.    •  My  email  address  is  robert.grossman  at  uchicago  dot  edu  •  I  recently  wrote  a  popular  book  about  compu?ng  called:  The   Structure  of  Digital  Compu?ng:  From  Mainframes  to  Big  Data,   which  you  can  buy  from  Amazon.     Center for Research Informatics
  49. 49. Sources  for  images  •  The  image  of  the  hard  disk  is  from  Norlando  Pobre,  Crea?ve  Commons.  •  The  image  of  the  Facebook  Pineville  Data  Center  is  from  the  Intel  Free  Press,,  Crea?ve  Commons  BY  2.0.  •  The  image  of  the  LHC  is  from  Conrad  Melvin,  Crea?ve  Commons  BY-­‐SA  2.0, photos/58220828@N07/5350788732