Adapting federated cyberinfrastructure for shared data collection facilities in structural biology


Published on

Early stage experimental data in structural biology is generally unmaintained
and inaccessible to the public. It is increasingly believed that this data, which
forms the basis for each macromolecular structure discovered by this field, must
be archived and, in due course, published. Furthermore, the widespread use of
shared scientific facilities such as synchrotron beamlines complicates the issue of
data storage, access and movement, as does the increase of remote users. This
work describes a prototype system that adapts existing federated cyberinfrastructure
technology and techniques to significantly improve the operational
environment for users and administrators of synchrotron data collection
facilities used in structural biology. This is achieved through software from the
Virtual Data Toolkit and Globus, bringing together federated users and facilities
from the Stanford Synchrotron Radiation Lightsource, the Advanced Photon
Source, the Open Science Grid, the SBGrid Consortium and Harvard Medical
School. The performance and experience with the prototype provide a model for
data management at shared scientific facilities.

Stokes-Rees, Ian, Ian Levesque, Frank V. Murphy, Wei Yang, Ashley Deacon, and Piotr Sliz. “Adapting Federated Cyberinfrastructure for Shared Data Collection Facilities in Structural Biology.” Journal of Synchrotron Radiation 19, no. 3 (April 6, 2012).

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

  1. 1. Adapting  federated   cyberinfrastructure  for   shared  data  collection  facilities  in  structural  biology GlobusWorld  2012Ian  Stokes-­‐Rees  @ijstokes Harvard  Medical  School
  2. 2. J.  Synchrotron  Rad.  (2012)  19 doi:10.1107/S0909049512009776Online  6  April  2012
  3. 3. Structural  Biology: Study  of  Protein  Structure  and  Function 400m 1mm 10nm • Shared  scientiRic  data  collection  facility • Data  intensive  (10-­‐100  GB/day)Grid Overview - Ian Stokes-Rees
  4. 4. Cryo  Electron  Microscopy • Previously,  1-­10,000  images,  managed  by  hand • Now,  robotic  systems  collect  millions  of  hi-­res  images • estimate  250,000  CPU-­hours  to  reconstruct  model • 20-­50  TB  of  data  in  most  extreme  caseGrid Overview - Ian Stokes-Rees
  5. 5. Molecular  Dynamics  Simulations 1  fs  time  step 1ns  snapshot 1  us  simulation 1e6  steps 1000  frames 10  MB  /  frame 10  GB  /  sim 20  CPU-­years 3  months  (wall-­ clock)Grid Overview - Ian Stokes-Rees
  6. 6. SBGrid  Consortium Cornell U. Washington U. School of Med. R. Cerione NE-CAT T. Ellenberger B. Crane R. Oswald D. Fremont S. Ealick C. Parrish Rosalind Franklin NIH M. Jin H. Sondermann D. Harrison M. Mayer A. Ke UMass Medical U. Washington T. Gonen U. Maryland W. Royer E. Toth Brandeis U. UC Davis N. Grigorieff H. Stahlberg Tufts U. K. Heldwein UCSF Columbia U. JJ Miranda Q. Fan Y. Cheng Rockefeller U. Stanford R. MacKinnon A. Brunger Yale U. K. Garcia T. Boggon K. Reinisch T. Jardetzky D. Braddock J. Schlessinger Y. Ha F. Sigworth CalTech E. Lolis F. Zhou P. Bjorkman Harvard and Affiliates W. Clemons N. Beglova A. Leschziner G. Jensen Rice University S. Blacklow K. Miller D. Rees E. Nikonowicz B. Chen A. Rao Y. Shamoo Vanderbilt J. Chou T. Rapoport Y.J. Tao Center for Structural Biology J. Clardy M. Samso WesternU W. Chazin C. Sanders M. Eck P. Sliz M. Swairjo B. Eichman B. Spiller B. Furie T. Springer M. Egli M. Stone R. Gaudet G. Verdine UCSD B. Lacy M. Waterman M. Grant G. Wagner T. Nakagawa M. Ohi S.C. Harrison L. Walensky H. Viadiu Thomas Jefferson J. Hogle S.Walker J. Williams D. Jeruzalmi T.Walz D. Kahne J. Wang Not Pictured: University of Toronto: L. Howell, E. Pai, F. Sicheri; NHRI (Taiwan): G. Liou; Trinity College, Dublin: Amir Khan T. Kirchhausen S. WongGrid Overview - Ian Stokes-Rees
  7. 7. Boston  Life  Sciences  Hub • Biomedical  researchers • Government  agencies • Life  sciences Tufts • Universities Universit y School of Medicin e • HospitalsGrid Overview - Ian Stokes-Rees
  8. 8. Data  Access
  9. 9. Globus  Online:  High  Performance   Reliable  3rd  Party  File  TransferGUMS DN  to  user  mapping CertiRicate  AuthorityVOMS root  of  trust VO  membership portal cluster Globus  Online Rile  transfer  service data collection lab file facility server desktop laptop
  10. 10. User  can  directly  access  lab   or  facility  data  from  laptop "Sco?"+from Harvard general+publicLocal  accounts  within the+Sliz+lab Public  access  available  to  lab  infrastructure ~sue archived  data  through  web   ~andy interface ~sco? Sliz+lab NEBCAT+beamline+at+APS WWW data collec(onShared  (lab  level)  accounts  at  facility /data/sliz /public/2009/ Embargo  policy  to   /stage/sliz /data/murphy /embarg/2010/ hold  deposited  data   /stage/murphy for  agreed  timeTiered  storage /data/deacon /embarg/2011/ 6+month 10+TB+per+group+ 10+PB staging+storage permanent+archive public+archive Tier1 Tier2 Tier3 VO  management VOMS GlobusOnline SBGridSciencePortal
  11. 11. Architecture✦ SBGrid • manages  all  user  account  creation  and  credential  mgmt • hosts  MyProxy,  VOMS,  GridFTP,  and  user  interfaces✦ Facility • knows  about  lab  groups • e.g.  “Harrison”,  “Sliz” • delegates  knowledge  of  group  membership  to  SBGrid  VOMS • facility  can  poll  VOMS  for  list  of  current  members • uses  X.509  for  user  identiRication • deploys  GridFTP  server✦ Lab  group • designates  group  manager  that  adds/removes  individuals • deploys  GridFTP  server  or  Globus  Connect  client✦ Individual • username/password  to  access  facility  and  lab  storage • Globus  Connect  for  personal  GridFTP  server  to  laptop • Globus  Online  web  interface  to  “drive”  transfers
  12. 12. Objective✦ Easy  to  use  high  performance  data  mgmt   environment✦ Fast  and  reliable  Rile  transfer • facility-­‐to-­‐lab,  facility-­‐to-­‐individual,  lab-­‐to-­‐individual✦ Reduced  administrative  overhead✦ Better  data  curation✦ Release  data  to  public  after  embargo  period
  13. 13. Challenges✦ Access  control • visibility • policies✦ Provenance • data  origin • history✦ Meta-­‐data • attributes • searching
  14. 14. Grid  Computing  Today
  15. 15. Capabilities✦ Server-­‐to-­‐server  interaction✦ Federated  aggregation  of  CPU  power✦ Predictable  data  patterns✦ Command-­‐line  access  from  specialized  servers✦ Web-­‐based  access  through  “Science  Gateways” • pre-­‐deRined  computing  workRlows • e.g.  SBGrid  Science  Portal
  16. 16. Weaknesses✦ Federated  user  identity  system • X.509  digital  certiRicates  and  associated  apparatus✦ Ease  of  use  for  end  users✦ Data  management  tools✦ Support  for  collaborations
  17. 17. Opportunities✦ “Last  mile”  challenge • to  the  desktop • to  the  laptop✦ UniRied  identity  management • centralized  set  of  credentials  for  each  person✦ Empower  collaborations  to  self-­‐manage✦ Shift  of  focus  from  “compute”  to  “data” • for  users • for  facilities  where  data  is  the  main  challenge
  18. 18. User  Credentials
  19. 19. X.509  Digital  CertiRicates✦ Analogy  to  a  passport: • Application  form • Sponsor’s  attestation • Consular  services • veriRication  of  application,  sponsor,  and  accompanying   identiRication  and  eligibility  documents • Passport  issuing  ofRice✦ Portable,  digital  passport • Rixed  and  secure  user  identiRiers • name,  email,  home  institution • signed  by  widely  trusted  issuer • time  limited • ISO  standard
  20. 20. U1 U1 U1Addressing  CertiRicate  Problems /. -..)"*& 012*%2! 3%"! )"*"!4&" ,"!&5":14(! !"#$"%&%()*"+,"!& U1 !"&$!*&!4,5(*)*$67"! *289:4)"*&% !";("<!"#$"%& R1time ;"!(9:$%"!"=()(7(=(&: ,2*>!6"=()(7(=(&: S1 411!2;","!& R2 %()*,"!& *289:4;4(=47(=(&: !"&!(";","!& U2a "?12!&%()*"+ ,"!&5":14(!
  21. 21. VO  (Group)  Membership   Registration !")*# !"#$%&(# *+,(-,.# /-0.# +.0-0(:#50.:#:,#.0?>0-:#&0&90.-<+#93#@A# .0?>0-:#!"#8.,>+-#4(%#.,70-# U2b (,123#4%&(# V1 ;0.23#>-0.#07897:3# 5,(6.&#07897:3# S2time 4++.,;0#&0&90.-<+=# 8.,>+-=#4(%#.,70-# 4%%#@A# V2 :,#!")*# (,123# .0?>0-:#!")*#$B# .0:>.(#!")*#$B# 4%%#$B#:,# +.,C3#50.:#
  22. 22. () AB)! !"#$%& *+",-"# .-/# ;< #/>:/-$+"#$%&%66":,$ =3!#"I3 ;<=* )@81, 0/#176%?",/8%1&-/,$ /8%1&0/#17/@ U1 4/,/#%$/ !"#$%&(%)*+% #/>:/-$-14,/@6/#$ 6/#$9/3+%1# ,,#""%+#0% 1*$/2% #/$:#,$#%691,4,:85/# ,"?23%4/,$- 0/#123/&14151&1$3 A1a 6#/%$/ 6",7#8/&14151&1$3 &"6%&%66$ S1* %++#"0/6/#$time -14,6/#$ ,"?23%0%1&%51&1$3 A1b -/$#/$#1/0%&-/#1%&,:85/# -/$;<#14C$- %66":,$#/%@3,"?76%?", +"#$%&&"41, U2* #/>:/-$-14,/@6/#?76%$/ #/$:#,-14,/@6/#?76%$/ DE+%1#-14,/@6/#$ 1,$"!F(*GDH7&/ #/41-$/#+#"I36/#$ HE6#/%$/&"6%& J1$C=3!#"I3 !"#$%&(%)*+% +#"I36/#$ ,,#""%-#.#$/#.% $#"*!$,#"%
  23. 23. Process  and  Design  Improvements ✦ Single  web-­‐form  application • includes  e-­‐mail  veriRicationn ✦ Centralized  and  connected  credential  management • FreeIPA  LDAP  -­‐  user  directory  and  credential  store • VOMS  -­‐  lab,  institution,  and  collaboration  afRiliations • MyProxy  -­‐  X.509  credential  store ✦ Overlap  administrative  roles • system  admin • registration  agent  for  certiRicate  authority  (approve  X.509   request) • VO  administrator  to  register  group  afRiliations ✦ Automation
  24. 24. “Last  Mile”  and Ease  of  Use
  25. 25. Grid  computing  at  your  desk✦ Platforms • Windows • OS  X✦ Web  interfaces • SBGrid  Science  Portal✦ One-­‐stop-­‐shop  for  account  management • Centralized  services • Automated  processes • Administrator  intervention • Support
  26. 26. Ryan,  a  postdoc  in  the   Frank  Lab  at  Columbia Access  NRAMM  facilities   securely  and  transfer  data   back  to  home  institute automated  X.509 check  SBGrid  for   application group   Ryan’s   membership /data/columbia/frank facility file server transfer  data  to  lab veriRication  in  Frank  Lab,  so   of   grant  access  to  Riles lab  membership SBGrid Ryan  initiate  tfransfer  at   applies   or  an   request  access Science account  at  the  SBGrid   NRAMM to  NRAMM Portal Science  Portal facility using  credential   notify  user  of   held  by  SBGrid lab file desktop completion serverautomated   /nfs/data/rsmithGlobus  Online  application use  Globus  Online   to  manage transfer  from   /Users/Ryan laptop NRAMM  back  to  lab
  27. 27. Architecture  Diagrams
  28. 28. Service  Architecture GlobusOnline UC San Diego @Argonne GUMSUser GUMS GridFTP + glideinWMS data Hadoop factory Open Science Grid computations MyProxy @NCSA, UIUC monitoring interfaces data computation ID mgmt Ganglia scp Condor FreeIPA Apache DOEGrids CA Nagios GridFTP Cycle Server @Lawrence GridSite LDAP RSV SRM VDT Berkley Labs Django VOMS Globus pacct WebDAV Sage Math GUMS glideinWMS Gratia Accting R-Studio GACL @FermiLab file SQL shell CLI server DB cluster Monitoring SBGrid Science Portal @ Harvard Medical School @Indiana
  29. 29. Grid Overview - Ian Stokes-Rees
  30. 30. Acknowledgements✦ Piotr  Sliz✦ SBGrid  Science  Portal • Daniel  O’Donovan,  Meghan  Porter-­‐Mahoney✦ SBGrid  System  Administrators • Ian  Levesque,  Peter  Doherty,  Steve  Jahl✦ Facility  Collaborators • Frank  Murphy  (NE-­‐CAT/APS) • Ashley  Deacon  (JCSG/SLAC)✦ Globus  Online  Team • Steve  Tueke,  Ian  Foster,  Rachana   Ananthakrishnan,  Raj  Kettimuthu  ✦ Ruth  Pordes • Director  of  OSG,  for  championing  SBGrid