Accelerating data-intensive scienceby outsourcing the mundaneIan Foster
Alfred North Whitehead (1911)   Civilization advances by extending the number of important operations which we can perform without thinking about them
J.C.R. Licklider reflects on thinking (1960)   About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know
For example … (Licklider again)  At one point, it was necessary to compare six experimental determinations of a function relating speech-intelligibilityto speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.
Research hasn’t changed much in 300 yearsAnalyzedataCollectdataPublish     resultsIdentify patternsDesign experimentPose questionTest hypothesesHypothesize explanation
Discovery 1960: Data collection dominates Janet Rowley: chromosome translocationsand cancer
800,000,000,000  bases/day30,000,000,000,000 bases/year    Discovery 2010: Data overflows
42%!!Meanwhile, we drown in administriviaThe Federal Demonstration Partnership’s faculty burden survey
You can run a company from a coffee shop
Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gatewaysSoftwarePlatformInfrastructureVarieties of “* as a Service” (*aaS)
Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gatewaysSoftwarePlatformAmazon, GoGrid,Microsoft, Flexiscale, …InfrastructureVarieties of * as a service (*aaS)
Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gatewaysSoftwareGoogle, Microsoft, Amazon, …PlatformAmazon, GoGrid,Microsoft, Flexiscale, …InfrastructureVarieties of * as a service (*aaS)
Perform important tasks without thinking Web presenceEmail (hosted Exchange)Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distributionIaaS
Perform important tasks without thinkingWeb presenceEmail (hosted Exchange)Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distributionSaaSIaaS
What about small and medium labs?
Research IT is a growing burdenBig projects can build sophisticated solutions to IT problemsSmall labs and collaborations have problems with bothThey need solutions, not toolkits—ideally outsourced solutions
Medium science: Dark Energy SurveyBlanco 4m on Cerro TololoImage credit: Roger Smith/NOAO/AURA/NSFEvery night, they receive 100,000 files in IllinoisThey transmit these files to Texas for analysis (35 msec latency)Then move the results back to IllinoisThis whole process must run reliably & routinely
Open transfer sockets vs. time[Image: Don Petravick, NCSA]
A new approach to research ITGoal: Accelerate discovery and innovation worldwide by providing research IT as a serviceLeverage software-as-a-service (SaaS) toprovide millions of researchers with unprecedented access to powerful research tools, and enable  a massive shortening of cycle times intime-consuming research processes
Time-consuming tasks in scienceRun experimentsCollect dataManage dataMove dataAcquire computersAnalyze dataRun simulationsCompare experiment with simulationSearch the literatureCommunicate with colleagues
Publish papers
Find, configure, install relevant software
Find, access, analyze relevant data
Order supplies
Write proposals
Write reports
…Time-consuming tasks in scienceRun experimentsCollect dataManage dataMove dataAcquire computersAnalyze dataRun simulationsCompare experiment with simulationSearch the literatureCommunicate with colleagues
Publish papers
Find, configure, install relevant software
Find, access, analyze relevant data
Order supplies
Write proposals
Write reports
…Data movement can be surprisingly difficult                      Discover endpoints, determine available                       protocols, negotiate firewalls, configure software,                       manage space, determine required credentials,                       configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, …BA
Grid (aka federation) as a service      Globus Toolkit Globus OnlineBuild the Grid    Components for building custom grid solutionsglobustoolkit.orgUse the Grid  Cloud-hostedfile transfer serviceglobusonline.org
Globus Online’s Web 2.0 architectureCommand line interfacelsalcf#dtn:/scpalcf#dtn:/myfile \nersc#dtn:/myfileHTTP REST interfacePOST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc>Web interfaceFire-and-forget data movementMany files and lots of dataCredential managementPerformance optimizationExpert operations and monitoringGridFTP serversFTP serversHigh-performancedata transfer nodesGlobus Connecton local computers
Globus Connect to/from your laptop25
Almost always faster than other methods0.001  0.01     0.1        1        10       100   1000Megabyte/fileArgonne  NERSC
Monitoring provides deep visibility

Accelerating data-intensive science by outsourcing the mundane

  • 1.
    Accelerating data-intensive sciencebyoutsourcing the mundaneIan Foster
  • 2.
    Alfred North Whitehead(1911) Civilization advances by extending the number of important operations which we can perform without thinking about them
  • 3.
    J.C.R. Licklider reflectson thinking (1960) About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know
  • 4.
    For example …(Licklider again) At one point, it was necessary to compare six experimental determinations of a function relating speech-intelligibilityto speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.
  • 5.
    Research hasn’t changedmuch in 300 yearsAnalyzedataCollectdataPublish resultsIdentify patternsDesign experimentPose questionTest hypothesesHypothesize explanation
  • 6.
    Discovery 1960: Datacollection dominates Janet Rowley: chromosome translocationsand cancer
  • 7.
    800,000,000,000 bases/day30,000,000,000,000bases/year Discovery 2010: Data overflows
  • 8.
    42%!!Meanwhile, we drownin administriviaThe Federal Demonstration Partnership’s faculty burden survey
  • 9.
    You can runa company from a coffee shop
  • 10.
    Salesforce.com, Google,Animoto, …,…, caBIG,TeraGrid gatewaysSoftwarePlatformInfrastructureVarieties of “* as a Service” (*aaS)
  • 11.
    Salesforce.com, Google,Animoto, …,…, caBIG,TeraGrid gatewaysSoftwarePlatformAmazon, GoGrid,Microsoft, Flexiscale, …InfrastructureVarieties of * as a service (*aaS)
  • 12.
    Salesforce.com, Google,Animoto, …,…, caBIG,TeraGrid gatewaysSoftwareGoogle, Microsoft, Amazon, …PlatformAmazon, GoGrid,Microsoft, Flexiscale, …InfrastructureVarieties of * as a service (*aaS)
  • 13.
    Perform important taskswithout thinking Web presenceEmail (hosted Exchange)Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distributionIaaS
  • 14.
    Perform important taskswithout thinkingWeb presenceEmail (hosted Exchange)Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distributionSaaSIaaS
  • 15.
    What about smalland medium labs?
  • 16.
    Research IT isa growing burdenBig projects can build sophisticated solutions to IT problemsSmall labs and collaborations have problems with bothThey need solutions, not toolkits—ideally outsourced solutions
  • 17.
    Medium science: DarkEnergy SurveyBlanco 4m on Cerro TololoImage credit: Roger Smith/NOAO/AURA/NSFEvery night, they receive 100,000 files in IllinoisThey transmit these files to Texas for analysis (35 msec latency)Then move the results back to IllinoisThis whole process must run reliably & routinely
  • 18.
    Open transfer socketsvs. time[Image: Don Petravick, NCSA]
  • 19.
    A new approachto research ITGoal: Accelerate discovery and innovation worldwide by providing research IT as a serviceLeverage software-as-a-service (SaaS) toprovide millions of researchers with unprecedented access to powerful research tools, and enable a massive shortening of cycle times intime-consuming research processes
  • 20.
    Time-consuming tasks inscienceRun experimentsCollect dataManage dataMove dataAcquire computersAnalyze dataRun simulationsCompare experiment with simulationSearch the literatureCommunicate with colleagues
  • 21.
  • 22.
    Find, configure, installrelevant software
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    …Time-consuming tasks inscienceRun experimentsCollect dataManage dataMove dataAcquire computersAnalyze dataRun simulationsCompare experiment with simulationSearch the literatureCommunicate with colleagues
  • 28.
  • 29.
    Find, configure, installrelevant software
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
    …Data movement canbe surprisingly difficult Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, …BA
  • 35.
    Grid (aka federation)as a service Globus Toolkit Globus OnlineBuild the Grid Components for building custom grid solutionsglobustoolkit.orgUse the Grid Cloud-hostedfile transfer serviceglobusonline.org
  • 36.
    Globus Online’s Web2.0 architectureCommand line interfacelsalcf#dtn:/scpalcf#dtn:/myfile \nersc#dtn:/myfileHTTP REST interfacePOST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc>Web interfaceFire-and-forget data movementMany files and lots of dataCredential managementPerformance optimizationExpert operations and monitoringGridFTP serversFTP serversHigh-performancedata transfer nodesGlobus Connecton local computers
  • 37.
  • 38.
    Almost always fasterthan other methods0.001 0.01 0.1 1 10 100 1000Megabyte/fileArgonne  NERSC
  • 39.
  • 41.
    Globus Online runson the cloud
  • 42.
    Data movers scalewell on Amazon
  • 43.
    11 x 125files200 MB each11 users12 sitesSaaS facilitates troubleshooting
  • 44.
  • 45.
    NSF XSEDE architectureincorporatesGlobus Toolkit and Globus Online XSEDE33
  • 46.
    Next steps: Outsourceadditional activitiesAnalyzedataCollectdataPublish resultsIdentify patternsDesign experimentPose questionTest hypothesesHypothesize explanation
  • 47.
    A use casefor the next stepsMedical image data is acquired at multiple sitesUploaded to a commercial cloudQuality control algorithms appliedAnonymization procedures appliedMetadata extracted and storedAccess granted to clinical trial teamInteractive access and analysisMore metadata generated and storedAccess granted to subset of data for education
  • 48.
    Required building blocksGroupmanagement for data sharingScheduled September, 2011, for BIRN biomedicalMetadata managementCreate, update, query a hosted metadata catalogData publication workflowsData movement, naming, metadata operations, etc.Cloud storage accessAnd HTTP, WebDAV, SRM, iRODS, …Computation on shared dataE.g., via Galaxy workflow system
  • 49.
  • 50.
    SummaryTo accelerate discovery,automate the mundaneData-intensive computing is particularly full of mundane tasksOutsourcing complexity to SaaS providers is a promising route to automationGlobus Online is an early experiment in SaaS for science
  • 51.
    For more informationFoster,I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Globus Online: Radical Simplification of Data Movement via SaaS. Preprint CI-PP-05-0611, Computation Institute, 2011.
  • 52.

Editor's Notes

  • #3 Whitehead points out that a powerful tool for enhancing human capabilities is to automate the mundaneHe was talking about mathematics—e.g., decimal system, algebra, calculus, all facilitated thinkingBut in an era in which information and its processing increasingly dominate human activities, computing.For example, arithmetic and mathematics: thus, calculus, Excel, Matlab, supercomputersIncreasingly also discovery and innovation depends on integration of diverse resources: data sources, software, computing power, human expertise
  • #6 The basic research process remains essentiallyunchanged since the emergence of the scientific method in the 17th Century.Collect data, analyze data, identify patterns within data, seek explanations for those patterns, collect new data to test explanations.Speed of discovery depends to a significant degree on the time required for this cycle. Here, new technologies are changing the research process rapidly and dramatically.Data collection time used to dominate research. For example, Janet Rowley took several years to collect data on gross chromosomal abnormalities for a few patients. Today, we can generate genome data at the rate of billions of base pairs per day. So other steps become bottlenecks, like managing and analyzing data—a key issue for Midway.It is important to realize that the vast majority of research is performed within “small and medium labs.” For example, almost all of the ~1000 faculty in BSD and PSD at UChicago work in their own lab. Each lab has a faculty member, some postdocs, students—so maybe 5000 total just at UC.Academic research is a cottage industry—albeit one that is increasingly interconnected—and is likely to stay that way.
  • #7 The abnormality seen by Nowell and Hungerford on chromosome 22. Now known as the Philadelphia Chromosome
  • #8 Sequencing capacity of a big lab is doubling every nine months5 orders of magnitude in ~5 yearsSingle lab with 10 sequencing machines can generate 400 Gbases-pairs per day
  • #9 Federal Demonstration Partnership.
  • #11 Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • #12 Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • #13 Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • #14 Of course, people also make effective use of IaaS, but only for more specialized tasks
  • #15 Of course, people also make effective use of IaaS, but only for more specialized tasks
  • #20 More specifically, the opportunity is to apply a very modern technology—software as a service, or SaaS—to address a very modern problem, namely the enormous challenges inherent in translating revolutionary 21st century technologies into scientific advances. Midway’s SaaS approach will address these challenges, and both make powerful tools far more widely available, and reduce the cycle time associated with research and discovery.
  • #21 So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
  • #22 So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
  • #23 Why? Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, identify diagnose and correct network misconfigurations,…
  • #32 Explain attempts; a cornerstone of our failure mitigation strategyThrough repeated attempts GO was able to overcome transient errors at OLCF and rangerThe expired host certs on bigred were not updated until after the run had completed
  • #39 Self-healingSLA-drivenMulti-tenancy – multitasking, … much moreService-orientedVirtualizedLinearly scalableData, data, data,