The Open Science Data Cloud: Empowering the Long Tail of Science


Published on

This is a talk I gave at the GLIF Workshop in Chicago on October 11, 2012.

Published in: News & Politics
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The Open Science Data Cloud: Empowering the Long Tail of Science

  1. 1. A  501(c)(3)  not-­‐for-­‐profit   operaCng  clouds  for  science.   The  Open  Science  Data  Cloud:  Empowering  the  Long  Tail  of  Science   October  12,  2012   Robert  L.  Grossman   University  of  Chicago   and  Open  Cloud  ConsorCum  
  2. 2. QuesCon  1.  What  is  the  cyberinfrastructure  required  to  manage,  analyze,  archive  and  share  big  data?        Call  this  analyCc  infrastructure.  
  3. 3. QuesCon  2.  What  is  the  analogy  of  the  GLIF*  for  analyCc  infrastructure?  *GLIF  (,  the  Global  Lambda  Integrated  Facility,  is  an  internaConal  virtual  organizaCon  that  promotes  the  paradigm  of  lambda  networking.  GLIF  provides  lambdas  internaConally  as  an  integrated  facility  to  support  data-­‐intensive  scienCfic  research,  and  supports  middleware  development  for  lambda  networking.    
  4. 4. Number  1000’s   Individual  scienCsts  &   small  projects  100’s   Community  based   science  via  Science  as  a  10’s   Service   very  large  projects   Data  Size   Small   Medium  to  Large     Very  Large   Public   Shared  community   Dedicated     infrastructure   infrastructure   infrastructure  
  5. 5. The  long  tail  of  data  science  A  few  large  data   Many  smaller  data  science  projects.   science  projects.  
  6. 6. Part  1.  What  Instrument  Do  we  Use  to    Make  Big  Data  Discoveries?  How  do  we  build  a  “datascope?”  
  7. 7. TB?   PB?   EB?   ZB?  What  is  big  data?  
  8. 8. Another  way:  Think  of  data  as  big  if  you  measure  it  in  MW,  as  in   Facebook’s  Pineville  Data  Center  is  30  MW.  
  9. 9. An  algorithm  and  compuCng  infrastructure  is  “big-­‐data  scalable”  if  adding  a  rack  (or  container)  of  data  (and  corresponding  processors)  allows  you  to  do  the  same  computaCon  in  the  same  Cme  but  over  more  data.  
  10. 10. Commercial  Cloud  Service  Provider  (CSP)     15  MW  Data  Center   Monitoring,   AccounCng  and   network  security   billing   Customer   and  forensics   Facing   Portal   AutomaCc   provisioning  and   100,000  servers   infrastructure   1  PB  DRAM   management   100’s  of  PB  of  disk   ~1  Tbps  egress  bandwidth     25  operators  for  15  MW  Commercial  Cloud   Data  center  network  
  11. 11. My  vote  for  a  datascope:  a  (bouCque)  data  center  scale  facility  with  a  big-­‐data  scalable  analyCc  infrastructure.  What  would  a  global  integrated  facility  for  datascopes  look  like?  
  12. 12. Some  Examples  of  Big  Data  Science  Discipline   Dura2on   Size   #  Devices  HEP  -­‐  LHC   10  years   15  PB/year*   One  Astronomy  -­‐  LSST   10  years   12  PB/year**   One  Genomics  -­‐  NGS   2-­‐4  years   0.5  TB/genome   1000’s  *At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  worlds  largest  parCcle  accelerator,  is  expected  to  produce  more  than  15  million  Gigabytes  of  data  each  year.    …  This  ambiCous  project  connects  and  combines  the  IT  power  of  more  than  140  computer  centres  in  33  countries.    Source:  hjp://­‐en.html    **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes  processed),  resulCng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hjp://­‐1004.html  
  13. 13. One  large  instrument   Many  smaller  instruments  
  14. 14. Sci  CSP  services   Data  scienCst  Datascope  –  Science  Cloud  Service  Provider  (Sci  CSP)  
  15. 15. What  are  some  of  the  important  differences  between  commercial   and  research-­‐focused  Sci  CSPs?    
  16. 16. Science  CSP   Commercial  CSP  POV   DemocraCze  access  to   As  long  as  you  pay  the  bill;   data.    Integrate  data  to   as  long  as  the  business   make  discoveries.    Long   model  holds.   term  archive.  Data  &   Data  intensive   Internet  style  scale  out  Storage   Science  Clouds   compuCng  &  HP  storage   and  object-­‐based  storage  Flows   Large  data  flows  in  and   Lots  of  small  web  flows   out  Streams   Streaming  processing   NA   required  AccounCng   EssenCal   EssenCal  Lock  in   Moving  environment   Lock  in  is  good   between  CSPs  essenCal  
  17. 17. Part  2.  The  Open  Cloud  ConsorCum’s    Open  Science  Data  Cloud  
  18. 18. •  U.S  based  not-­‐for-­‐profit  corporaCon.  •  Manages  cloud  compuCng  infrastructure  to   support  scienCfic  research:  Open  Science   Data  Cloud.  •  Manages  cloud  compuCng  testbeds:  Open   Cloud  Testbed.   18  
  19. 19. OCC  Members  &  Partners  •  Companies:  Cisco,  Yahoo!,  Citrix,  …  •  UniversiCes:    University  of  Chicago,   Northwestern  Univ.,  Johns  Hopkins,  Calit2,   ORNL,  University  of  Illinois  at  Chicago,  …  •  Federal  agencies  and  labs:  NASA,  LLNL,  ORNL  •  InternaConal  Partners:  AIST  (Japan),  U.   Edinburgh,  U.  Amsterdam,  …  •  Partners:  NaConal  Lambda  Rail   19  
  20. 20. OCC  2011  Resources  Resource   Type   Comments  OSDC  Adler  &   UClity  Cloud     1248  cores  and  0.4  PB  disk  Sullivan  OCC  –  Y   Data  Cloud   928  cores  and  1.0    PB  disk  OCC  –  Matsu   Mixed   1  rack  OSDC  Root   Storage   0.8  PB   •  OCC-­‐Adler,  Sullivan  &  Root  will  more  than  double  in   size  in  2012.  
  21. 21. Bionimbus  WG  (biological  data)  
  22. 22. One  Million  Genomes  •  Sequencing  a  million  genomes  would  most   likely  fundamentally  change  the  way  we   understand  genomic  variaCon.  •  The  genomic  data  for  a  paCent  is  about  1  TB   (including  samples  from  both  tumor  and   normal  Cssue).  •  One  million  genomes  is  about  1000  PB  or  1  EB  •  With  compression,  it  may  be  about  100  PB  •  At  $1000/genome,  the  sequencing  would  cost   about  $1B  
  23. 23. Big  data  driven  discovery  on   1,000,000  genomes  and  1  EB  of  data.  Genomic-­‐ Improved    Genomic-­‐   driven   understanding   driven  drug  diagnosis   of  genomic   development   science   Precision  diagnosis  and   treatment.    PrevenCve   health  care.  
  24. 24. Project Matsu WG:Clouds to Support Earth   24
  25. 25. UDR  •  UDT  is  a  high  performance  network  transport  protocol  •  UDR  =  rsync  +  UDT    •  It  is  easy  for  an  average  systems  administrator  to  keep   100’s  of  TB  of  distributed  data  synchronized.    •  We  are  using  it  to  distribute  c.  1  PB  from  the  OSDC  
  26. 26. OpenFlow-­‐Enabled  Hadoop  WG  •  When  running  Hadoop  some  map  and  reduce  jobs   take  significantly  longer  than  others.  •  These  are  stragglers  and  can  significantly  slow  down   a  MapReduce  computaCon.    •  Stragglers  are  common  (dirty  secret  about  Hadoop)  •  Infoblox  and  UChicago  are  leading  a  OCC  Working   Group  on  OpenFlow-­‐enabled  Hadoop  that  will   provide  addiConal  bandwidth  to  stragglers.    •  We  have  a  testbed  for  a  wide  area  version  of  this   project.  
  27. 27. OSDC  PIRE  Project   We  select  OSDC  PIRE  Fellows   (US  ciCzens  or  permanent   residents):     •  We  give  them  tutorials  and   training  on  big  data  science.   •  We  provide  them   fellowships  to  work  with   OSDC  internaConal   partners.   •  We  give  them  preferred   access  to  the  OSDC.  Nominate  your  favorite  scienCst  as  an  OSDC  PIRE  Fellow.    (look  for  PIRE)  
  28. 28. Part  3.  Cloud  Services  OperaCons  Centers  
  29. 29. Open  Science  Data  Cloud   AccounCng  and   Monitoring,   billing  (OSDC)   compliance,  &   security   Customer  Facing   Science  Cloud  SW   &  Services   Portal  (Tukey)   AutomaCc   provisioning  and   3  PB  2011   infrastructure   10  PB  2012     management   ~100  Gbps  bandwidth   able  to  scale  to     100  PB?   5-­‐12  operators  to  operate  1-­‐5  MW  Science  Cloud   Data  center  network  OSDC  Data  Stack  based  upon  OpenStack,  Hadoop,  GlusterFS,  UDT,  …  
  30. 30. Cloud  Services     OperaCons  Centers  (CSOC)  •  The  OSDC  operates  Cloud  Services  OperaCons   Center  (or  CSOC).  •  It  is  a  CSOC  focused  on  supporCng  Science   Clouds  for  researchers.  •  Compare  to  Network  OperaCons  Center  or   NOC.  •  Both  are  an  important  part  of  cyber   infrastructure  for  big  data  science.  
  31. 31. OSDC  Racks   •  How  quickly  can   we  set  up  a  rack?   •  How  efficiently  can   we  operate  a  rack?   (racks/admin)  2012  OSDC  rack  design  (dray)  •  950  TB  /  rack  •  600  cores  /  rack  
  32. 32. EssenCal  Services  for  a  Science  CSP  •  Support  for  data  intensive  compuCng  •  Support  for  big  data  flows  •  Account  management,  authenCcaCon  and   authorizaCon  services  •  Health  and  status  monitoring  •  Billing  and  accounCng  •  Ability  to  rapidly  provision  infrastructure  •  Security  services,  logging,  event  reporCng  •  Access  to  large  amounts  of  public  data  •  High  performance  storage  •  Simple  data  export  and  import  services  
  33. 33. Please  Join  Us!      (Help  us  from  making  even   more  mistakes.)  
  34. 34. Acknowledgements  Major  funding  and  support  for  the  Open  Science  Data  Cloud  (OSDC)  is  provided  by  the  Gordon  and  Bejy  Moore  FoundaCon.    This  funding  is  used  to  support  the  OSDC-­‐Adler,  Sullivan  and  Root  faciliCes.    AddiConal  funding  for  the  OSDC  has  been  provided  by  the  following  sponsors:    •  The  OCC-­‐Y  Hadoop  Cluster  (approximately  1000  cores  and  1  PB  of  storage)  was   donated  by  Yahoo!  in  2011.  •  Cisco  provides  the  OSDC  access  to  the  Cisco  C-­‐Wave,  which  connects  OSDC  data   centers  with  10  Gbps  wide  area  networks.  •  NSF  awarded  the  OSDC  a  5-­‐year  (2010-­‐2016)  PIRE  award  to  train  scienCsts  to  use   the  OSDC  and  to  further  develop  the  underlying  technology.  •  OSDC  technology  for  high  performance  data  transport  is  support  in  part  by    NSF   Award  1127316.  •  The  StarLight  Facility  in  Chicago  enables  the  OSDC  to  connect  to  over  30  high   performance  research  networks  around  the  world  at  10  Gbps  or  higher,  with  an   increasing  number  of  100  Gbps  connecCons.    The  OSDC  is  managed  by  the  Open  Cloud  ConsorCum,  a  501(c)(3)  not-­‐for-­‐profit  corporaCon.  If  you  are  interested  in  providing  funding  or  donaCng  equipment  or  services,  please  contact  us  at  
  35. 35. For  more  informaCon  •  You  can  find  some  more  informaCon  on  my  blog:                                          •  Some  of  my  technical  papers  are  also  available  there.    •  My  email  address  is  robert.grossman  at  uchicago  dot  edu.     Center for Research Informatics