computationinstitute.orgScience as a ServiceHow On-Demand ComputingCan Accelerate DiscoveryIan Fosterfoster@anl.gov
computationinstitute.orgA time of disruptive changeAs evidenced by cost per human genome
computationinstitute.orgA time of disruptive changeAs evidenced by cost per human genome
computationinstitute.orgBut most labs have extremelylimited resourcesHeidorn: NSFgrants in 2007< $350,00080% of awards50% ...
computationinstitute.orgAutomation is requiredto apply moresophisticated methods tofar more data
computationinstitute.orgAutomation is requiredto apply moresophisticated methods tofar more dataOutsourcing is neededto ac...
computationinstitute.orgBuilding a discovery cloud• Identify time-consuming activities amenableto automation and outsourci...
computationinstitute.orgWhere does time go in research?42%!!
computationinstitute.orgWe aspire (initially) to create agreat user experience forresearch data managementWhat would a “dr...
computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
computationinstitute.orgRegistryStagingStoreIngestStoreAnalysisStoreCommunityStoreArchive MirrorIngestStoreAnalysisStoreCo...
computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• Archive• Collect• Move• ...
computationinstitute.orgDataSourceDataDestinationUserinitiatestransferrequest1GlobusOnlinemoves/syncs files2Globus Onlinen...
computationinstitute.orgDataSourceUser A selectsfile(s) to share;selectsuser/group, setsshare permissions1Globus Online tr...
computationinstitute.orgExtreme ease of use• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition an...
computationinstitute.orgEarly adoption is encouraging
computationinstitute.orgEarly adoption is encouraging10,000 registered users; >100 daily~18 PB moved; ~1B files10x (or bet...
1e-011e+011e+031e+051e+07duration2011 20121 second1 minute1 hour1 day1 weekDuration of runs, in seconds, over time.Red: >1...
computationinstitute.orgK. Heitmann (Argonne)moves 22 TB of cosmologydata LANL  ANL at 5 Gb/s
computationinstitute.orgB. Winjum (UCLA) moves900K-file plasma physicsdatasets UCLA NERSC
computationinstitute.orgDan Kozak (Caltech)replicates 1 PB LIGOastronomy data for resilience
computationinstitute.org2Credit: Kerstin Kleese-van DamErin Miller (PNNL)collects data atAdvanced PhotonSource, renders at...
computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
computationinstitute.orgGlobus Online already does a lotGlobus ToolkitSharing ServiceTransfer ServiceGlobus Nexus(Identity...
computationinstitute.orgA platform for integration
computationinstitute.orgA platform for integration
computationinstitute.orgA platform for integration
computationinstitute.orgData management SaaS (Globus) +Next-gen sequence analysis pipelines (Galaxy) +Cloud IaaS (Amazon) ...
computationinstitute.org31Ravi Madduri, Bo Liu, Paul Davé, et al.
computationinstitute.orgRNA-Seq pipeline3Ravi Madduri, Bo Liu, Paul Davé, et al.
computationinstitute.orgAmazon pricing for DiffusionTensor Imaging pipeline00.050.10.150.20.250.30.350.40.450.5m1.large m1...
computationinstitute.orgWe are adding capabilitiesGlobus ToolkitSharing ServiceTransfer ServiceGlobus Nexus(Identity, Grou...
computationinstitute.orgGlobus ToolkitSharing ServiceTransfer ServiceDataset ServicesGlobus Nexus(Identity, Group, Profile...
computationinstitute.org• Ingest and publication– Imagine a DropBox that not only replicates, butalso extracts metadata, c...
computationinstitute.orgLooking deeply at howresearchers use data• A single research question often requires theintegratio...
computationinstitute.orgHow do we manage data today?• Often, a curious mix of ad hoc methods– Organize in directories usin...
computationinstitute.orgIntroducing the dataset• Group data based on use, not location– Logical grouping to organize, reor...
computationinstitute.orgBuilds on catalog as a serviceApproach• Hosted user-definedcatalogs• Based on tag model<subject, n...
computationinstitute.orgProvide more capability formore people at lower cost bybuilding a “Discovery Cloud”Delivering “Sci...
computationinstitute.orgIt’s a time of great opportunity … todevelop and apply Science aaSGlobus Nexus(Identity, Group, Pr...
computationinstitute.orgThanks to great colleaguesand collaborators• Steve Tuecke, Rachana Ananthakrishnan, KyleChard, Raj...
computationinstitute.orgThank you to our sponsors!
Science as a Service: How On-Demand Computing can Accelerate Discovery
Upcoming SlideShare
Loading in …5
×

Science as a Service: How On-Demand Computing can Accelerate Discovery

790 views

Published on

My talk at ScienceCloud 2013 in NYC. Thanks to the organizers for the invitation to talk.

A bit of new material relative to previous talks posted, e.g., on Globus Genomics.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
790
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • For example in genomics
  • For example in genomics
  • 10,000 80% of awards and 50% of grant $$ are &lt; $350K
  • Concern:
  • Many in this room are probably users of Dropbox or similar services for keeping their files synced across multiple machinesWell, the scientific research equivalent is a little different
  • We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
  • So how would such a drop box for science be used? Let’s look at a very typical scientific data work flow . . .Data is generated by some instrument (a sequencer at JGI or a light source like APS/ALS)…since these instruments are in high demand, users have to get their data off the instrument to make way for the next userSo the data is typically moved from a staging area to some type of ingest storeEtcetera for analysis, sharing of results with collaborators, annotation with metadata for future search, backup/sync/archival, …
  • We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
  • Started with seemingly simple/mundane task of transferring files …etc.
  • And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
  • And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
  • And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
  • This image shows a 3D rendering of a Shewanella biofilm grown on a flat plastic substrate in a Constant Depth bioFilm Fermenter (CDFF).  The image was generated using x-ray microtomography at the Advanced Photon Source, Argonne National Laboratory.   
  • We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
  • We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
  • Assuming 1 core can handle all tasks, and a basis of 5GB of memory per analysis
  • http://datasets.globus.org/carl-catalog/query/propertyA=value1
  • Questions remain:-- What capabilities? Where does time go?-- How do we turn them into usable solutions?-- How do we scale from thousands to millions?-- How do we incentivize contributions? Long tail.
  • Science as a Service: How On-Demand Computing can Accelerate Discovery

    1. 1. computationinstitute.orgScience as a ServiceHow On-Demand ComputingCan Accelerate DiscoveryIan Fosterfoster@anl.gov
    2. 2. computationinstitute.orgA time of disruptive changeAs evidenced by cost per human genome
    3. 3. computationinstitute.orgA time of disruptive changeAs evidenced by cost per human genome
    4. 4. computationinstitute.orgBut most labs have extremelylimited resourcesHeidorn: NSFgrants in 2007< $350,00080% of awards50% of grant $$$1,000,000$100,000$10,000$1,0002000 4000 6000 8000
    5. 5. computationinstitute.orgAutomation is requiredto apply moresophisticated methods tofar more data
    6. 6. computationinstitute.orgAutomation is requiredto apply moresophisticated methods tofar more dataOutsourcing is neededto achieve economies ofscale in the use ofautomated methods
    7. 7. computationinstitute.orgBuilding a discovery cloud• Identify time-consuming activities amenableto automation and outsourcing• Implement as high-quality, low-touch SaaS• Leverage IaaS for reliability,economies of scale• Extract common elements asresearch automation platformBonus question: SustainabilitySoftware as a servicePlatform as a serviceInfrastructure as a service
    8. 8. computationinstitute.orgWhere does time go in research?42%!!
    9. 9. computationinstitute.orgWe aspire (initially) to create agreat user experience forresearch data managementWhat would a “dropbox forscience” look like?
    10. 10. computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
    11. 11. computationinstitute.orgRegistryStagingStoreIngestStoreAnalysisStoreCommunityStoreArchive MirrorIngestStoreAnalysisStoreCommunityStoreArchive MirrorRegistryQuotaexceeded!Expiredcredentials!Networkfailed. Retry.!Permissiondenied!It should be trivial to Collect, Move, Sync, Share, Analyze,Annotate, Publish, Search, Backup, & Archive BIG DATA… but in reality it’s often very challenging
    12. 12. computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
    13. 13. computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• Archive• Collect• Move• Sync• ShareCapabilities delivered usingSoftware-as-Service (SaaS) model
    14. 14. computationinstitute.orgDataSourceDataDestinationUserinitiatestransferrequest1GlobusOnlinemoves/syncs files2Globus Onlinenotifies user3
    15. 15. computationinstitute.orgDataSourceUser A selectsfile(s) to share;selectsuser/group, setsshare permissions1Globus Online tracksshared files; no needto move files tocloud storage!2User B logs in toGlobus Onlineand accessesshared file3
    16. 16. computationinstitute.orgExtreme ease of use• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition and management• Transfer management and optimization• Reliability via transfer retries• Web interface, REST API, command line• One-click “Globus Connect” install• 5-minute Globus Connect Multi User install
    17. 17. computationinstitute.orgEarly adoption is encouraging
    18. 18. computationinstitute.orgEarly adoption is encouraging10,000 registered users; >100 daily~18 PB moved; ~1B files10x (or better) performance vs. scp99.9% availabilityEntirely hosted on Amazon
    19. 19. 1e-011e+011e+031e+051e+07duration2011 20121 second1 minute1 hour1 day1 weekDuration of runs, in seconds, over time.Red: >10 TB transfer; green: >1 TB transfer.
    20. 20. computationinstitute.orgK. Heitmann (Argonne)moves 22 TB of cosmologydata LANL  ANL at 5 Gb/s
    21. 21. computationinstitute.orgB. Winjum (UCLA) moves900K-file plasma physicsdatasets UCLA NERSC
    22. 22. computationinstitute.orgDan Kozak (Caltech)replicates 1 PB LIGOastronomy data for resilience
    23. 23. computationinstitute.org2Credit: Kerstin Kleese-van DamErin Miller (PNNL)collects data atAdvanced PhotonSource, renders atPNNL, and views atANL
    24. 24. computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
    25. 25. computationinstitute.org• Collect• Move• Sync• Share• Analyze• Annotate• Publish• Search• Backup• ArchiveBIG DATA
    26. 26. computationinstitute.orgGlobus Online already does a lotGlobus ToolkitSharing ServiceTransfer ServiceGlobus Nexus(Identity, Group, Profile)GlobusOnlineAPIsGlobusConnect
    27. 27. computationinstitute.orgA platform for integration
    28. 28. computationinstitute.orgA platform for integration
    29. 29. computationinstitute.orgA platform for integration
    30. 30. computationinstitute.orgData management SaaS (Globus) +Next-gen sequence analysis pipelines (Galaxy) +Cloud IaaS (Amazon) =Flexible, scalable, easy-to-use genomicsanalysis for all biologistsglobusgenomics
    31. 31. computationinstitute.org31Ravi Madduri, Bo Liu, Paul Davé, et al.
    32. 32. computationinstitute.orgRNA-Seq pipeline3Ravi Madduri, Bo Liu, Paul Davé, et al.
    33. 33. computationinstitute.orgAmazon pricing for DiffusionTensor Imaging pipeline00.050.10.150.20.250.30.350.40.450.5m1.large m1.xlarge m3.xlarge m3.2xlarge m2.xlarge m2.2xlarge m2.4xlargeCostperSubject($)On-Demand Spot (Low) Spot (High)Credit: Kyle Chard
    34. 34. computationinstitute.orgWe are adding capabilitiesGlobus ToolkitSharing ServiceTransfer ServiceGlobus Nexus(Identity, Group, Profile)GlobusOnlineAPIsGlobusConnect
    35. 35. computationinstitute.orgGlobus ToolkitSharing ServiceTransfer ServiceDataset ServicesGlobus Nexus(Identity, Group, Profile)GlobusOnlineAPIsGlobusConnectWe are adding capabilities
    36. 36. computationinstitute.org• Ingest and publication– Imagine a DropBox that not only replicates, butalso extracts metadata, catalogs, converts• Cataloging– Virtual views of data based on user-definedand/or automatically extracted metadata• Computation– Associate computationalprocedures, orchestrate application, catalogresults, record provenanceWe are adding capabilities
    37. 37. computationinstitute.orgLooking deeply at howresearchers use data• A single research question often requires theintegration of many data elements, that are:– In different locations– In different formats (Excel, text, CDF, HDF, …)– Described in different ways• Best grouping can vary during investigation– Longitudinal, vertical, cross-cutting• But always needs to be operated on as a unit– Share, annotate, process, copy, archive, …
    38. 38. computationinstitute.orgHow do we manage data today?• Often, a curious mix of ad hoc methods– Organize in directories using file and directorynaming conventions– Capture status in READMEfiles, spreadsheets, notebooks• Time-consuming, complex, error proneWhy can’t we manage our data likewe manage our pictures and music?
    39. 39. computationinstitute.orgIntroducing the dataset• Group data based on use, not location– Logical grouping to organize, reorganize, search, anddescribe usage• Tag with characteristics that reflect content …– Capture as much existing information as we can• …or to reflect current status in investigation– Stage of processing, provenance, validation, ..• Share data sets for collaboration– Control access to data and metadata• Operate on datasets as units– Copy, export, analyze, tag, archive, …
    40. 40. computationinstitute.orgBuilds on catalog as a serviceApproach• Hosted user-definedcatalogs• Based on tag model<subject, name, value>• Optional schemaconstraints• Integrated with otherGlobus servicesThree REST APIs/query/• Retrieve subjects/tags/• Create, delete, retrievetags/tagdef/• Create, delete, retrievetag definitionsBuilds on USC Tagfiler project (C. Kesselman et al.)
    41. 41. computationinstitute.orgProvide more capability formore people at lower cost bybuilding a “Discovery Cloud”Delivering “Science as a service”Our vision for a 21st centurydiscovery infrastructure
    42. 42. computationinstitute.orgIt’s a time of great opportunity … todevelop and apply Science aaSGlobus Nexus(Identity, Group, Profile)…Sharing ServiceTransfer ServiceDataset ServicesGlobus ToolkitGlobusOnlineAPIsGlobusConnect
    43. 43. computationinstitute.orgThanks to great colleaguesand collaborators• Steve Tuecke, Rachana Ananthakrishnan, KyleChard, Raj Kettimuthu, Ravi Madduri, TanuMalik, and many others at Argonne & Uchicago• Carl Kesselman, Karl Czajkowski, RobSchuler, and others at USC/ISI• Francesco de Carlo, Chris Jacobsen, and othersat Argonne
    44. 44. computationinstitute.orgThank you to our sponsors!

    ×