Your SlideShare is downloading. ×
Accelerating data-intensive science by outsourcing the mundane
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Accelerating data-intensive science by outsourcing the mundane

2,060
views

Published on

Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!) …

Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!)

Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.

Published in: Technology, Business

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,060
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
26
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Whitehead points out that a powerful tool for enhancing human capabilities is to automate the mundaneHe was talking about mathematics—e.g., decimal system, algebra, calculus, all facilitated thinkingBut in an era in which information and its processing increasingly dominate human activities, computing.For example, arithmetic and mathematics: thus, calculus, Excel, Matlab, supercomputersIncreasingly also discovery and innovation depends on integration of diverse resources: data sources, software, computing power, human expertise
  • The basic research process remains essentiallyunchanged since the emergence of the scientific method in the 17th Century.Collect data, analyze data, identify patterns within data, seek explanations for those patterns, collect new data to test explanations.Speed of discovery depends to a significant degree on the time required for this cycle. Here, new technologies are changing the research process rapidly and dramatically.Data collection time used to dominate research. For example, Janet Rowley took several years to collect data on gross chromosomal abnormalities for a few patients. Today, we can generate genome data at the rate of billions of base pairs per day. So other steps become bottlenecks, like managing and analyzing data—a key issue for Midway.It is important to realize that the vast majority of research is performed within “small and medium labs.” For example, almost all of the ~1000 faculty in BSD and PSD at UChicago work in their own lab. Each lab has a faculty member, some postdocs, students—so maybe 5000 total just at UC.Academic research is a cottage industry—albeit one that is increasingly interconnected—and is likely to stay that way.
  • The abnormality seen by Nowell and Hungerford on chromosome 22. Now known as the Philadelphia Chromosome
  • Sequencing capacity of a big lab is doubling every nine months5 orders of magnitude in ~5 yearsSingle lab with 10 sequencing machines can generate 400 Gbases-pairs per day
  • Federal Demonstration Partnership.
  • Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • Of course, people also make effective use of IaaS, but only for more specialized tasks
  • Of course, people also make effective use of IaaS, but only for more specialized tasks
  • More specifically, the opportunity is to apply a very modern technology—software as a service, or SaaS—to address a very modern problem, namely the enormous challenges inherent in translating revolutionary 21st century technologies into scientific advances. Midway’s SaaS approach will address these challenges, and both make powerful tools far more widely available, and reduce the cycle time associated with research and discovery.
  • So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
  • So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
  • Why? Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, identify diagnose and correct network misconfigurations,…
  • Explain attempts; a cornerstone of our failure mitigation strategyThrough repeated attempts GO was able to overcome transient errors at OLCF and rangerThe expired host certs on bigred were not updated until after the run had completed
  • Self-healingSLA-drivenMulti-tenancy – multitasking, … much moreService-orientedVirtualizedLinearly scalableData, data, data,
  • Transcript

    • 1. Accelerating data-intensive scienceby outsourcing the mundane
      Ian Foster
    • 2. Alfred North Whitehead (1911)
      Civilization advances by extending the number of important operations which we can perform without thinking about them
    • 3. J.C.R. Licklider reflects on thinking (1960)
      About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know
    • 4. For example … (Licklider again)
      At one point, it was necessary to compare six experimental determinations of a function relating speech-intelligibilityto speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.
    • 5. Research hasn’t changed much in 300 years
      Analyzedata
      Collectdata
      Publish
      results
      Identify patterns
      Design experiment
      Pose question
      Test hypotheses
      Hypothesize explanation
    • 6. Discovery 1960: Data collection dominates
      Janet Rowley: chromosome translocationsand cancer
    • 7. 800,000,000,000 bases/day
      30,000,000,000,000 bases/year
      Discovery 2010: Data overflows
    • 8. 42%!!
      Meanwhile, we drown in administrivia
      The Federal Demonstration Partnership’s faculty burden survey
    • 9. You can run a company from a coffee shop
    • 10. Salesforce.com, Google,
      Animoto, …, …, caBIG,
      TeraGrid gateways
      Software
      Platform
      Infrastructure
      Varieties of “* as a Service” (*aaS)
    • 11. Salesforce.com, Google,
      Animoto, …, …, caBIG,
      TeraGrid gateways
      Software
      Platform
      Amazon, GoGrid,Microsoft, Flexiscale, …
      Infrastructure
      Varieties of * as a service (*aaS)
    • 12. Salesforce.com, Google,
      Animoto, …, …, caBIG,
      TeraGrid gateways
      Software
      Google, Microsoft, Amazon, …
      Platform
      Amazon, GoGrid,Microsoft, Flexiscale, …
      Infrastructure
      Varieties of * as a service (*aaS)
    • 13. Perform important tasks without thinking
      Web presence
      Email (hosted Exchange)
      Calendar
      Telephony (hosted VOIP)
      Human resources and payroll
      Accounting
      Customer relationship mgmt
      Data analytics
      Content distribution
      IaaS
    • 14. Perform important tasks without thinking
      Web presence
      Email (hosted Exchange)
      Calendar
      Telephony (hosted VOIP)
      Human resources and payroll
      Accounting
      Customer relationship mgmt
      Data analytics
      Content distribution
      SaaS
      IaaS
    • 15. What about small and medium labs?
    • 16. Research IT is a growing burden
      Big projects can build sophisticated solutions to IT problems
      Small labs and collaborations have problems with both
      They need solutions, not toolkits—ideally outsourced solutions
    • 17. Medium science: Dark Energy Survey
      Blanco 4m on Cerro Tololo
      Image credit: Roger Smith/NOAO/AURA/NSF
      Every night, they receive 100,000 files in Illinois
      They transmit these files to Texas for analysis (35 msec latency)
      Then move the results back to Illinois
      This whole process must run reliably & routinely
    • 18. Open transfer sockets vs. time
      [Image: Don Petravick, NCSA]
    • 19. A new approach to research IT
      Goal: Accelerate discovery and innovation worldwide by providing research IT as a service
      Leverage software-as-a-service (SaaS) to
      provide millions of researchers with unprecedented access to powerful research tools, and
      enable a massive shortening of cycle times intime-consuming research processes
    • 20. Time-consuming tasks in science
      Run experiments
      Collect data
      Manage data
      Move data
      Acquire computers
      Analyze data
      Run simulations
      Compare experiment with simulation
      Search the literature
      • Communicate with colleagues
      • 21. Publish papers
      • 22. Find, configure, install relevant software
      • 23. Find, access, analyze relevant data
      • 24. Order supplies
      • 25. Write proposals
      • 26. Write reports
      • 27.
    • Time-consuming tasks in science
      Run experiments
      Collect data
      Manage data
      Move data
      Acquire computers
      Analyze data
      Run simulations
      Compare experiment with simulation
      Search the literature
      • Communicate with colleagues
      • 28. Publish papers
      • 29. Find, configure, install relevant software
      • 30. Find, access, analyze relevant data
      • 31. Order supplies
      • 32. Write proposals
      • 33. Write reports
      • 34.
    • Data movement can be surprisingly difficult
      Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, …
      B
      A
    • 35. Grid (aka federation) as a service
      Globus Toolkit
      Globus Online
      Build the Grid
      Components for building custom grid solutions
      globustoolkit.org
      Use the Grid
      Cloud-hostedfile transfer service
      globusonline.org
    • 36. Globus Online’s Web 2.0 architecture
      Command line interface
      lsalcf#dtn:/
      scpalcf#dtn:/myfile
      nersc#dtn:/myfile
      HTTP REST interface
      POST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc>
      Web interface
      Fire-and-forget data movement
      Many files and lots of data
      Credential management
      Performance optimization
      Expert operations and monitoring
      GridFTP servers
      FTP servers
      High-performance
      data transfer nodes
      Globus Connect
      on local computers
    • 37. Globus Connect to/from your laptop
      25
    • 38. Almost always faster than other methods
      0.001 0.01 0.1 1 10 100 1000
      Megabyte/file
      Argonne  NERSC
    • 39. Monitoring provides deep visibility
    • 40.
    • 41. Globus Online runs on the cloud
    • 42. Data movers scale well on Amazon
    • 43. 11 x 125 files
      200 MB each
      11 users
      12 sites
      SaaS facilitates troubleshooting
    • 44. Moving 586 Terabytes in two weeks
    • 45. NSF XSEDE architecture incorporatesGlobus Toolkit and Globus Online
      XSEDE
      33
    • 46. Next steps: Outsource additional activities
      Analyzedata
      Collectdata
      Publish
      results
      Identify patterns
      Design experiment
      Pose question
      Test hypotheses
      Hypothesize explanation
    • 47. A use case for the next steps
      Medical image data is acquired at multiple sites
      Uploaded to a commercial cloud
      Quality control algorithms applied
      Anonymization procedures applied
      Metadata extracted and stored
      Access granted to clinical trial team
      Interactive access and analysis
      More metadata generated and stored
      Access granted to subset of data for education
    • 48. Required building blocks
      Group management for data sharing
      Scheduled September, 2011, for BIRN biomedical
      Metadata management
      Create, update, query a hosted metadata catalog
      Data publication workflows
      Data movement, naming, metadata operations, etc.
      Cloud storage access
      And HTTP, WebDAV, SRM, iRODS, …
      Computation on shared data
      E.g., via Galaxy workflow system
    • 49. www.globusoline.org
      37
    • 50. Summary
      To accelerate discovery, automate the mundane
      Data-intensive computing is particularly full of mundane tasks
      Outsourcing complexity to SaaS providers is a promising route to automation
      Globus Online is an early experiment in SaaS for science
    • 51. For more information
      Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.
      Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Globus Online: Radical Simplification of Data Movement via SaaS. Preprint CI-PP-05-0611, Computation Institute, 2011.
    • 52. Thank you!
      foster@anl.gov
      foster@uchicago.edu

    ×