Accelerating data-intensive scienceby outsourcing the mundane<br />Ian Foster<br />
Alfred North Whitehead (1911)<br />   Civilization advances by extending the number of important operations which we can p...
J.C.R. Licklider reflects on thinking (1960)<br />   About 85 per cent of my “thinking” time was spent getting into a posi...
For example … (Licklider again)<br />  At one point, it was necessary to compare six experimental determinations of a func...
Research hasn’t changed much in 300 years<br />Analyzedata<br />Collectdata<br />Publish<br />     results<br />Identify p...
Discovery 1960: Data collection dominates <br />Janet Rowley: chromosome translocationsand cancer <br />
800,000,000,000  bases/day<br />30,000,000,000,000 bases/year    <br />Discovery 2010: Data overflows<br />
42%!!<br />Meanwhile, we drown in administrivia<br />The Federal Demonstration Partnership’s faculty burden survey<br />
You can run a company from a coffee shop<br />
Salesforce.com, Google,<br />Animoto, …, …, caBIG,<br />TeraGrid gateways<br />Software<br />Platform<br />Infrastructure<...
Salesforce.com, Google,<br />Animoto, …, …, caBIG,<br />TeraGrid gateways<br />Software<br />Platform<br />Amazon, GoGrid,...
Salesforce.com, Google,<br />Animoto, …, …, caBIG,<br />TeraGrid gateways<br />Software<br />Google, Microsoft, Amazon, …<...
Perform important tasks without thinking<br /> Web presence<br />Email (hosted Exchange)<br />Calendar<br /> Telephony (ho...
Perform important tasks without thinking<br />Web presence<br />Email (hosted Exchange)<br />Calendar<br /> Telephony (hos...
What about small and medium labs?<br />
Research IT is a growing burden<br />Big projects can build sophisticated solutions to IT problems<br />Small labs and col...
Medium science: Dark Energy Survey<br />Blanco 4m on Cerro Tololo<br />Image credit: Roger Smith/NOAO/AURA/NSF<br />Every ...
Open transfer sockets vs. time<br />[Image: Don Petravick, NCSA]<br />
A new approach to research IT<br />Goal: Accelerate discovery and innovation worldwide by providing research IT as a servi...
Time-consuming tasks in science<br />Run experiments<br />Collect data<br />Manage data<br />Move data<br />Acquire comput...
Publish papers
Find, configure, install relevant software
Find, access, analyze relevant data
Order supplies
Write proposals
Write reports
…</li></li></ul><li>Time-consuming tasks in science<br />Run experiments<br />Collect data<br />Manage data<br />Move data...
Publish papers
Find, configure, install relevant software
Find, access, analyze relevant data
Order supplies
Write proposals
Write reports
…</li></li></ul><li>Data movement can be surprisingly difficult<br />                      Discover endpoints, determine a...
Grid (aka federation) as a service<br />      Globus Toolkit<br /> Globus Online<br />Build the Grid<br />    Components f...
Globus Online’s Web 2.0 architecture<br />Command line interface<br />lsalcf#dtn:/<br />scpalcf#dtn:/myfile <br />nersc#dt...
Globus Connect to/from your laptop<br />25<br />
Almost always faster than other methods<br />0.001  0.01     0.1        1        10       100   1000<br />Megabyte/file<br...
Monitoring provides deep visibility<br />
Upcoming SlideShare
Loading in...5
×

Accelerating data-intensive science by outsourcing the mundane

2,138

Published on

Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!)

Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.

Published in: Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,138
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
27
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Whitehead points out that a powerful tool for enhancing human capabilities is to automate the mundaneHe was talking about mathematics—e.g., decimal system, algebra, calculus, all facilitated thinkingBut in an era in which information and its processing increasingly dominate human activities, computing.For example, arithmetic and mathematics: thus, calculus, Excel, Matlab, supercomputersIncreasingly also discovery and innovation depends on integration of diverse resources: data sources, software, computing power, human expertise
  • The basic research process remains essentiallyunchanged since the emergence of the scientific method in the 17th Century.Collect data, analyze data, identify patterns within data, seek explanations for those patterns, collect new data to test explanations.Speed of discovery depends to a significant degree on the time required for this cycle. Here, new technologies are changing the research process rapidly and dramatically.Data collection time used to dominate research. For example, Janet Rowley took several years to collect data on gross chromosomal abnormalities for a few patients. Today, we can generate genome data at the rate of billions of base pairs per day. So other steps become bottlenecks, like managing and analyzing data—a key issue for Midway.It is important to realize that the vast majority of research is performed within “small and medium labs.” For example, almost all of the ~1000 faculty in BSD and PSD at UChicago work in their own lab. Each lab has a faculty member, some postdocs, students—so maybe 5000 total just at UC.Academic research is a cottage industry—albeit one that is increasingly interconnected—and is likely to stay that way.
  • The abnormality seen by Nowell and Hungerford on chromosome 22. Now known as the Philadelphia Chromosome
  • Sequencing capacity of a big lab is doubling every nine months5 orders of magnitude in ~5 yearsSingle lab with 10 sequencing machines can generate 400 Gbases-pairs per day
  • Federal Demonstration Partnership.
  • Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • Of course, people also make effective use of IaaS, but only for more specialized tasks
  • Of course, people also make effective use of IaaS, but only for more specialized tasks
  • More specifically, the opportunity is to apply a very modern technology—software as a service, or SaaS—to address a very modern problem, namely the enormous challenges inherent in translating revolutionary 21st century technologies into scientific advances. Midway’s SaaS approach will address these challenges, and both make powerful tools far more widely available, and reduce the cycle time associated with research and discovery.
  • So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
  • So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
  • Why? Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, identify diagnose and correct network misconfigurations,…
  • Explain attempts; a cornerstone of our failure mitigation strategyThrough repeated attempts GO was able to overcome transient errors at OLCF and rangerThe expired host certs on bigred were not updated until after the run had completed
  • Self-healingSLA-drivenMulti-tenancy – multitasking, … much moreService-orientedVirtualizedLinearly scalableData, data, data,
  • Accelerating data-intensive science by outsourcing the mundane

    1. 1. Accelerating data-intensive scienceby outsourcing the mundane<br />Ian Foster<br />
    2. 2. Alfred North Whitehead (1911)<br /> Civilization advances by extending the number of important operations which we can perform without thinking about them <br />
    3. 3. J.C.R. Licklider reflects on thinking (1960)<br /> About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know <br />
    4. 4. For example … (Licklider again)<br /> At one point, it was necessary to compare six experimental determinations of a function relating speech-intelligibilityto speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.<br />
    5. 5. Research hasn’t changed much in 300 years<br />Analyzedata<br />Collectdata<br />Publish<br /> results<br />Identify patterns<br />Design experiment<br />Pose question<br />Test hypotheses<br />Hypothesize explanation<br />
    6. 6. Discovery 1960: Data collection dominates <br />Janet Rowley: chromosome translocationsand cancer <br />
    7. 7. 800,000,000,000 bases/day<br />30,000,000,000,000 bases/year <br />Discovery 2010: Data overflows<br />
    8. 8. 42%!!<br />Meanwhile, we drown in administrivia<br />The Federal Demonstration Partnership’s faculty burden survey<br />
    9. 9. You can run a company from a coffee shop<br />
    10. 10. Salesforce.com, Google,<br />Animoto, …, …, caBIG,<br />TeraGrid gateways<br />Software<br />Platform<br />Infrastructure<br />Varieties of “* as a Service” (*aaS)<br />
    11. 11. Salesforce.com, Google,<br />Animoto, …, …, caBIG,<br />TeraGrid gateways<br />Software<br />Platform<br />Amazon, GoGrid,Microsoft, Flexiscale, …<br />Infrastructure<br />Varieties of * as a service (*aaS)<br />
    12. 12. Salesforce.com, Google,<br />Animoto, …, …, caBIG,<br />TeraGrid gateways<br />Software<br />Google, Microsoft, Amazon, …<br />Platform<br />Amazon, GoGrid,Microsoft, Flexiscale, …<br />Infrastructure<br />Varieties of * as a service (*aaS)<br />
    13. 13. Perform important tasks without thinking<br /> Web presence<br />Email (hosted Exchange)<br />Calendar<br /> Telephony (hosted VOIP)<br /> Human resources and payroll<br /> Accounting<br /> Customer relationship mgmt<br /> Data analytics<br /> Content distribution<br />IaaS<br />
    14. 14. Perform important tasks without thinking<br />Web presence<br />Email (hosted Exchange)<br />Calendar<br /> Telephony (hosted VOIP)<br /> Human resources and payroll<br /> Accounting<br /> Customer relationship mgmt<br /> Data analytics<br /> Content distribution<br />SaaS<br />IaaS<br />
    15. 15. What about small and medium labs?<br />
    16. 16. Research IT is a growing burden<br />Big projects can build sophisticated solutions to IT problems<br />Small labs and collaborations have problems with both<br />They need solutions, not toolkits—ideally outsourced solutions<br />
    17. 17. Medium science: Dark Energy Survey<br />Blanco 4m on Cerro Tololo<br />Image credit: Roger Smith/NOAO/AURA/NSF<br />Every night, they receive 100,000 files in Illinois<br />They transmit these files to Texas for analysis (35 msec latency)<br />Then move the results back to Illinois<br />This whole process must run reliably & routinely<br />
    18. 18. Open transfer sockets vs. time<br />[Image: Don Petravick, NCSA]<br />
    19. 19. A new approach to research IT<br />Goal: Accelerate discovery and innovation worldwide by providing research IT as a service<br />Leverage software-as-a-service (SaaS) to<br />provide millions of researchers with unprecedented access to powerful research tools, and <br />enable a massive shortening of cycle times intime-consuming research processes<br />
    20. 20. Time-consuming tasks in science<br />Run experiments<br />Collect data<br />Manage data<br />Move data<br />Acquire computers<br />Analyze data<br />Run simulations<br />Compare experiment with simulation<br />Search the literature<br /><ul><li>Communicate with colleagues
    21. 21. Publish papers
    22. 22. Find, configure, install relevant software
    23. 23. Find, access, analyze relevant data
    24. 24. Order supplies
    25. 25. Write proposals
    26. 26. Write reports
    27. 27. …</li></li></ul><li>Time-consuming tasks in science<br />Run experiments<br />Collect data<br />Manage data<br />Move data<br />Acquire computers<br />Analyze data<br />Run simulations<br />Compare experiment with simulation<br />Search the literature<br /><ul><li>Communicate with colleagues
    28. 28. Publish papers
    29. 29. Find, configure, install relevant software
    30. 30. Find, access, analyze relevant data
    31. 31. Order supplies
    32. 32. Write proposals
    33. 33. Write reports
    34. 34. …</li></li></ul><li>Data movement can be surprisingly difficult<br /> Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, …<br />B<br />A<br />
    35. 35. Grid (aka federation) as a service<br /> Globus Toolkit<br /> Globus Online<br />Build the Grid<br /> Components for building custom grid solutions<br />globustoolkit.org<br />Use the Grid<br /> Cloud-hostedfile transfer service<br />globusonline.org<br />
    36. 36. Globus Online’s Web 2.0 architecture<br />Command line interface<br />lsalcf#dtn:/<br />scpalcf#dtn:/myfile <br />nersc#dtn:/myfile<br />HTTP REST interface<br />POST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc><br />Web interface<br />Fire-and-forget data movement<br />Many files and lots of data<br />Credential management<br />Performance optimization<br />Expert operations and monitoring<br />GridFTP servers<br />FTP servers<br />High-performance<br />data transfer nodes<br />Globus Connect<br />on local computers<br />
    37. 37. Globus Connect to/from your laptop<br />25<br />
    38. 38. Almost always faster than other methods<br />0.001 0.01 0.1 1 10 100 1000<br />Megabyte/file<br />Argonne  NERSC<br />
    39. 39. Monitoring provides deep visibility<br />
    40. 40.
    41. 41. Globus Online runs on the cloud<br />
    42. 42. Data movers scale well on Amazon<br />
    43. 43. 11 x 125 files<br />200 MB each<br />11 users<br />12 sites<br />SaaS facilitates troubleshooting<br />
    44. 44. Moving 586 Terabytes in two weeks<br />
    45. 45. NSF XSEDE architecture incorporatesGlobus Toolkit and Globus Online<br /> XSEDE<br />33<br />
    46. 46. Next steps: Outsource additional activities<br />Analyzedata<br />Collectdata<br />Publish<br /> results<br />Identify patterns<br />Design experiment<br />Pose question<br />Test hypotheses<br />Hypothesize explanation<br />
    47. 47. A use case for the next steps<br />Medical image data is acquired at multiple sites<br />Uploaded to a commercial cloud<br />Quality control algorithms applied<br />Anonymization procedures applied<br />Metadata extracted and stored<br />Access granted to clinical trial team<br />Interactive access and analysis<br />More metadata generated and stored<br />Access granted to subset of data for education<br />
    48. 48. Required building blocks<br />Group management for data sharing<br />Scheduled September, 2011, for BIRN biomedical<br />Metadata management<br />Create, update, query a hosted metadata catalog<br />Data publication workflows<br />Data movement, naming, metadata operations, etc.<br />Cloud storage access<br />And HTTP, WebDAV, SRM, iRODS, …<br />Computation on shared data<br />E.g., via Galaxy workflow system <br />
    49. 49. www.globusoline.org<br />37<br />
    50. 50. Summary<br />To accelerate discovery, automate the mundane<br />Data-intensive computing is particularly full of mundane tasks<br />Outsourcing complexity to SaaS providers is a promising route to automation<br />Globus Online is an early experiment in SaaS for science<br />
    51. 51. For more information<br />Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.<br />Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Globus Online: Radical Simplification of Data Movement via SaaS. Preprint CI-PP-05-0611, Computation Institute, 2011.<br />
    52. 52. Thank you!<br />foster@anl.gov<br />foster@uchicago.edu<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×