What Are Science Clouds?


Published on

This is a talk I gave at Data Cloud 2013 on November 17, 2013 that was titled: "What is So Special About Science Clouds and Why Does It Matter? ."

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

What Are Science Clouds?

  1. 1. What  is  So  Special  About  Science   Clouds  and  Why  Does  It  Ma8er?     November  17,  2013   Robert  L.  Grossman   University  of  Chicago   Open  Data  Group   Open  Cloud  ConsorLum  
  2. 2. Part  1   Clouds   2  
  3. 3. In  2011,  aNer   several  years  and   15  draNs,  NIST   developed  a   definiLon  of  a   cloud  that  is  now   the  standard   definiLon.  
  4. 4. EssenLal  CharacterisLcs  of  a  Cloud   1.  Self  Service     2.  Scale   4  
  5. 5. Self  Service   Self  Service   5  
  6. 6. Scale   6  
  7. 7. Cloud  Deployment  Models   •  Public  Clouds     –  Vendors  offering  cloud  services,  such  as  Amazon.   •  Private  Clouds   –  Run  internally  by  company  or  organizaLon,  such   as  the  University  of  Chicago.   •  Community  Clouds   –  Run  by  a  community  or    organizaLons  (either   formally  or  informally),  such  as  the  Open  Cloud   ConsorLum   7  
  8. 8. How  do  you  measure  compute   capacity  for  science  clouds?   TB?  PB?  EB?     100’s?  1,000’s?  10,000’s?  
  9. 9. Another  way:   opencompute.org   Think  of  science  clouds  as  large  if  you  measure   them  in  MW,  as  in  Facebook’s  Pineville  Data   Center  is  30  MW.  
  10. 10. What  about  automaLc  provisioning  and   infrastructure  management?    
  11. 11. This  is  not  a  cloud.   11  
  12. 12. This  is  a  cloud.  
  13. 13. Commercial  Cloud  Service  Provider  (CSP)     15  MW  Data  Center   Monitoring,   network  security   and  forensics   AutomaLc   provisioning  and   infrastructure   management   AccounLng  and   billing   Customer   Facing   Portal   100,000  servers   1  PB  DRAM   100’s  of  PB  of  disk   ~1  Tbps  egress  bandwidth     25  operators  for  15  MW  Commercial  Cloud   Data  center  network  
  14. 14. Requirement of a cloud computing infrastructure Rack  /  Container  Test:    The   addiLon  of  racks  /  containers   of  cores  and  disks  is   automated  and  does  not   require  changing  the  soNware   stack,  but  aNerwards  the   capacity  of  the  system  has   increased.  
  15. 15. •  For  many  organizaLons,   system  administrators  are   just  performing  a  service.   •  It’s  considered  a  good   pracLce  to  outsource  the   service  to  the  lowest  cost   provider.   15   •  At  good  cloud  service   providers,  development  and   operaLons  are  integrated   (devops).     •  SRE/Devops  are  considered   key  personnel.  
  16. 16. Latency  is  Difficult  
  17. 17. EssenLal  CharacterisLcs  of  a  Cloud   1.  2.  3.  4.  Self  Service     Scale   Infrastructure  management  and  automaLon   Focus  on  devops   17  
  18. 18. Part  2   Science  Clouds   18  
  19. 19. Some  Examples  of  the  Sizes  of   Datasets  Produced  by  Instruments   Discipline   Dura5on   Size   #  Devices   HEP  -­‐  LHC   10  years   15  PB/year*   One   Astronomy  -­‐  LSST   10  years   12  PB/year**   Genomics  -­‐  NGS   One   2-­‐4  years   0.5  TB/genome   1000’s   N.B.    This  is  just  the  data  produced  by  the  instrument  itself.    The  analysis  of  this   data  produces  significantly  more  data.   *At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  parLcle  accelerator,  is  expected  to  produce  more  than  15   million  Gigabytes  of  data  each  year.    …  This  ambiLous  project  connects  and  combines  the  IT  power  of  more  than  140  computer   centres  in  33  countries.    Source:  h8p://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html     **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes   processed),  resulLng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  h8p://www.lsst.org/ News/enews/teragrid-­‐1004.html  
  20. 20. Sci  CSP  services   Data  scienLst   Science  Cloud     Service  Provider  (Sci  CSP)  
  21. 21. What  are  some  of  the  important   differences  between  commercial   and  research-­‐focused  Sci  CSPs?    
  22. 22. vs.   Amazon  Web  Services   (AWS)?   Community  clouds,   science  clouds,  etc.   •  Lower  cost  (at  medium  &  large  scale)   •  Some  data  too  important  to  be  stored   •  Scale   exclusively  in  commercial  cloud   •  Simplicity  of  a  credit  card   •  CompuLng  over  scienLfic  data  is  a  core   •  Wide  variety  of  offerings.   competency   •  Can  support  any  required  governance  /   security  model   It  is  essenLal  that  community  science  clouds   interoperate  with  public  clouds.   22  
  23. 23. POV   Science  Clouds   DemocraLze  access  to   data.    Integrate  data  to   make  discoveries.    Long   term  archive.   Commercial  Clouds   As  long  as  you  pay  the  bill;   as  long  as  the  business   model  holds.   Internet  style  scale  out   Science  Clouds  bject-­‐based  storage   and  o Data  &   Storage   In  addiLon,  data   intensive  compuLng  &   HP  storage   Flows   AccounLng   Lock  in   Large  &  small  data  flows   Lots  of  small  web  flows   EssenLal   EssenLal   Moving  environment   Lock  in  is  good   between  CSPs  essenLal   Interop   CriLcal,  but  difficult   Customers  will  drive  to   some  degree   23  
  24. 24. EssenLal  Services  for  a  Science  CSP   •  Support  for  data  intensive  compuLng   •  Support  for  big  data  flows   •  Account  management,  authenLcaLon  and   authorizaLon  services   •  Health  and  status  monitoring   •  Billing  and  accounLng   •  Ability  to  rapidly  provision  infrastructure   •  Security  services,  logging,  event  reporLng   •  Access  to  large  amounts  of  public  data   •  High  performance  storage   •  Simple  data  export  and  import  services  
  25. 25. Sci  CSP  services   Data  scienLst   Datascope  –  Science  Cloud     Service  Provider  (Sci  CSP)   Cloud  Service  OperaLons   Center  (CSOC)  
  26. 26. Part  3.   Open  Science  Data  Cloud  
  27. 27. Number   1000’s   Individual  scienLsts  &   small  projects   100’s   Community  based   science  via  Science  as  a   Service   very  large  projects   10’s   Data  Size   Small   Public   infrastructure   Medium  to  Large     Very  Large   Shared  community   infrastructure   Dedicated     infrastructure  
  28. 28. The  long  tail  of  data  science   A  few  large  data   science  projects.   Many  smaller  data   science  projects.  
  29. 29. Commercial  Cloud  Service  Provider  (CSP)     15  MW  Data  Center   Monitoring,   network  security   and  forensics   AutomaLc   provisioning  and   infrastructure   management   AccounLng  and   billing   Customer   Facing   Portal   100,000  servers   1  PB  DRAM   100’s  of  PB  of  disk   ~1  Tbps  egress  bandwidth     25  operators  for  15  MW  Commercial  Cloud   Data  center  network  
  30. 30. Open  Science  Data  Cloud   Compliance,  &   security  (OCM)   Infrastructure   automaLon  &   management   (Yates)   AccounLng  &   billing   (Salesforce.com)   Science  Cloud  SW   &  Services   Cores  &  Disks   (OpenStack,   GlusterFS  &   Hadoop)   6  engineers  to  operate  0.5  MW  Science  Cloud   •  •  •  •  •  Customer  Facing   Portal  (Tukey)   ~10-­‐100  Gbps  bandwidth     Data  center  network   Virtual  Machine  (VM)  containing  common  applicaLons  &  pipelines   Tukey  (OSDC  portal  &  middleware  v0.2)   Yates  (infrastructure  automaLon  and  management  v0.1)   UDR  /  UDT  for  high  performance  data  transport   Interoperate  with  other  clouds  (upcoming)  and  proprietary  systems  (such  as   Globus  Online.)  
  31. 31. The  Open  Science  Data  Cloud  (OSDC)  is  a  producLon     5  PB*,  7500  core,  wide  area  10G  cloud.   *10  PB  raw  storage.   www.opensciencedatacloud.org  
  32. 32. •  U.S  based  not-­‐for-­‐profit  corporaLon.   •  Manages  cloud  compuLng  infrastructure  to   support  scienLfic  research:  Open  Science  Data   Cloud.   •  Manages  cloud  compuLng  infrastructure  to   support  medical  and  health  care  research:   Biomedical  Commons  Cloud   •  Manages  cloud  compuLng  testbeds:  Open  Cloud   Testbed.     www.opencloudconsorLum.org   32  
  33. 33. •  Companies:  Cisco,  Yahoo!,  Infoblox,  …   •  UniversiLes:    University  of  Chicago,  Northwestern   Univ.,  Johns  Hopkins,  Calit2,  LLNL,  University  of   Illinois  at  Chicago,  …   •  Federal  agencies  and  labs:  NASA,  LLNL,  …   •  InternaLonal  Partners:  AIST  (Japan),  U.  Edinburgh,  U.   Amsterdam,  …   www.opencloudconsorLum.org   33  
  34. 34. Science  Cloud   •  •  •  •  •  Earth  sciences   Biological  sciences   Social  sciences   Digital  humaniLes   ACL,  groups,  etc.   Biomedical  Cloud   Designed  to  hold  Protected   Health  InformaLon  (PHI)   e.g.  genomic  data,   electronic  medical  records,   etc.    (HIPAA,  FISMA)  
  35. 35. What  You  Get  with  the  OSDC   •  Login  with  your  university  credenLals  via   InCommon   •  Launch  virtual  machines,  virtual  clusters,   access  to  large  Hadoop  clusters,  etc.   •  Access  PB+  of  open  and  protected  data   •  Manage  files,  collecLons  of  files,  collecLons  of   collecLons   •  Manage  users,  groups  of  users   •  Manage  accounts,  sub-­‐accounts   •  Efficient  transfer  of  large  data  (UDT,  UDR)  
  36. 36. Our  Point  of  View   •  We  want  to  develop  as  li8le  technology  and   soNware  as  possible  –  we  want  others  to  develop   soNware  and  technology.   •  We  focus  on  providing  researchers  the  ability  to   compute  over  large  and  very  large  datasets.   •  We  need  open  source  soluLons.   •  We  can  interoperate  with  proprietary  soluLons.   •  We  are  working  to  make  interoperaLon  with   AWS  seamless   •  Run  lights  out  over  mulLple  data  centers   connected  with  10G  (soon  100G)    networks.  
  37. 37. OSDC  Cloud  Services     OperaLons  Center  (CSOC)   •  The  OSDC  operates  a  Cloud  Services   OperaLons  Center  (or  CSOC).   •  It  is  a  CSOC  focused  on  supporLng  Science   Clouds  for  researchers.  
  38. 38. OSDC  Racks   2013  OSDC  rack  design     •  1  PB  /  rack   •  1150  cores  /  rack   •  How  quickly  can  we   set  up  a  rack?   •  How  efficiently  can   we  operate  a  rack?   (racks/admin)   •  How  few  changes   does  our  soNware   stack  and   operaLons  require   when  we  add  new   racks?  
  39. 39. Tukey   •  Tukey  (based  in  part  on  Horizon).   •  We  have  factored  out  digital  ID  service,  file   sharing,  and  transport  from  the    Bionimbus  and   Matsu  Projects.  
  40. 40. Yates   •  AutomaLon   installaLon  of   OSDC  soNware   stack  on  rack  of   computers.   •  Based  upon  Chef   •  Version  0.1  
  41. 41. UDR   •  UDT  is  a  high  performance  network  transport  protocol   •  UDR  =  rsync  +  UDT     •  It  is  easy  for  an  average  systems  administrator  to  keep   100’s  of  TB  of  distributed  data  synchronized.     •  We  are  using  it  to  distribute  c.  1  PB  from  the  OSDC  
  42. 42. Bionimbus  Protected  Data  Cloud   42  
  43. 43. Analyzing  Data  From     The  Cancer  Genome  Atlas  (TCGA)   Current  Prac5ce   With  Protected  Data  Cloud  (PDC)   1.  Apply  to  dbGaP  for  access   1.  Apply  to  dbGaP  for  access   to  data.   to  data.   2.  Hire  staff,  set  up  and   2.  Use  your  exisLng  NIH  grant   operate  secure  compliant   eRA  credenLals  to  login  to   compuLng  environment  to   mange  10  –  100+  TB  of  data.       the  PDC,  select  the  data   3.  Get  environment  approved   that  you  want  to  analyze,   by  your  research  center.   and  the  pipelines  that  you   4.  Setup  analysis  pipelines.   want  to  use.     5.  Download  data  from  CG-­‐ Hub  (takes  days  to  weeks).     3.  Begin  analysis.   6.  Begin  analysis.  
  44. 44. OCC Project Matsu Clouds to Support Earth Science matsu.opensciencedatacloud.org   44
  45. 45. Biomedical  Community  Cloud   Medical  Research   Center  A   Medical  Research   Center  C   Cloud  for   Public  Data     Cloud  for  Controlled   Genomic  Data     Cloud  for   EMR,  PHI,   data   Medical  Research   Center  B   Example:  Open  Cloud  ConsorLum’s   Biomedical  Commons  Cloud  (BCC)   Hospital  D   Company  E   45  
  46. 46. 4.  Cloud  Condos  
  47. 47. Cyber  Condo  Model   •  Research  insLtuLons  today   have  access  to  high   performance  networks  –   10G  &  100G.   •  They  couldn’t  afford  access   to  these  networks  from   commercial  providers.   •  Over  a  decade  ago,  they   got  together  to  buy  and   light  fiber.         •  This  changed  how  we  do   scienLfic  research.  
  48. 48. Cloud  Condos   •  The  Open  Cloud   ConsorLum’s  Burnham   Facility  (in  planning)  is  a   Cloud  Condo  model.   •  This  infrastructure   provides  a  sustainable   home  for  large  commons   of  research  data  (and  an   infrastructure  to  compute   over  it).   •  Please  join  us.  
  49. 49. Some  Data  Commons  Guidelines  for   the  Next  Five  Years   •  There  is  a  societal  benefit  when  research  data  is   available  in  data  commons  operated  by  a  NFP  (vs  sold   exclusively  as  data  products  by  commercial  enLLes  or   only  offered  for  download  by  the  USG).   •  Large  data  commons  providers  should  peer.   •  Data  commons  providers  should  develop  standards  for   interoperaLng.   •  Standards  should  not  be  developed  ahead  of  open   source  reference  implementaLons.   •  We  need  a  period  of  experimentaLon  as  we  develop   the  best  technology  and  pracLces.   •  The  details  are  hard  (consent,  publicaLon,  IDs,  open  vs   controlled  access,  sustainability,  etc.)  
  50. 50. Working  with  the  OSDC  -­‐  CSP   •  If  you  have  a  cloud,  please  interoperate  it  with   the  OSDC.   •  Work  with  us  to  design  and  prototype   standards  so  that  Science  Clouds  and  Science   Data  Commons  can  interoperate.   –  Data  synchronizaLon  between  two  clouds   –  APIs  to  access  data     –  Resvul  queries     –  Sca8ering  queries,  gathering  the  results   –  Coordinated  analysis  
  51. 51. OSDC  SoNware  Ecosystem   CSP  A   University  E   Hadoop   AWS   Tukey   Bioninmbus   Medical  Research   Center  B   GlusterFS   OpenStack   Hospital  D   R   Globus  Online   UDT   Startup  F   Startup  G   51  
  52. 52. Working  with  the  OSDC  -­‐  Researchers     •  •  •  •  •  Apply  for  an  account  and  make  a  discovery   Add  data  to  the  OSDC   Add  your  soNware  to  the  OSDC   Suggest  someone  else’s  data  to  add   Suggest  someone  else’s  soNware  to  add  
  53. 53. Data  Commons   CSP  A   University  E   TCGA   EO1   Social  sciences  data   1000  Genomes   census   urban  sciences  data   EMR   Bookworm   Hospital  D   earth  cube  data   Medical  Research   Center  B   Startup  F   Startup  G   53  
  54. 54. QuesLons?   54  
  55. 55. Thank  You!  
  56. 56. For  more  informaLon   •  @bobgrossman   •  You  can  find  more  informaLon  on  my  blog:                                                  rgrossman.com.   •  You  can  find  more  of  my  talks  on:            slideshare.net/rgrossman   Center for Research Informatics
  57. 57. Major  funding  and  support  for  the  Open  Science  Data  Cloud  (OSDC)  is  provided  by  the  Gordon  and   Be8y  Moore  FoundaLon.    This  funding  is  used  to  support  the  OSDC-­‐Adler,  Sullivan  and  Root  faciliLes.     AddiLonal  funding  for  the  OSDC  has  been  provided  by  the  following  sponsors:     •  The  Bionimbus  Protected  Data  Cloud  is  supported  in  by  part  by  NIH/NCI  through  NIH/SAIC  Contract   13XS021  /  HHSN261200800001E.     •  The  OCC-­‐Y  Hadoop  Cluster  (approximately  1000  cores  and  1  PB  of  storage)  was  donated  by  Yahoo!   in  2011.   •  Cisco  provides  the  OSDC  access  to  the  Cisco  C-­‐Wave,  which  connects  OSDC  data  centers  with  10   Gbps  wide  area  networks.   •  The  OSDC  is  supported  by  a  5-­‐year  (2010-­‐2016)  PIRE  award  (OISE  –  1129076)  to  train  scienLsts  to   use  the  OSDC  and  to  further  develop  the  underlying  technology.   •  OSDC  technology  for  high  performance  data  transport  is  support  in  part  by    NSF  Award  1127316.   •  The  StarLight  Facility  in  Chicago  enables  the  OSDC  to  connect  to  over  30  high  performance   research  networks  around  the  world  at  10  Gbps  or  higher.   •  Any  opinions,  findings,  and  conclusions  or  recommendaLons  expressed  in  this  material  are  those   of  the  author(s)  and  do  not  necessarily  reflect  the  views  of  the  NaLonal  Science  FoundaLon,  NIH  or   other  funders  of  this  research.     The  OSDC  is  managed  by  the  Open  Cloud  ConsorLum,  a  501(c)(3)  not-­‐for-­‐profit  corporaLon.  If  you  are   interested  in  providing  funding  or  donaLng  equipment  or  services,  please  contact  us  at   info@opensciencedatacloud.org.  
  58. 58. Please  join  us!     www.opensciencedatacloud.org   www.opencloudconsorLum.org