• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Accelerating data-intensive science by outsourcing the mundane
 

Accelerating data-intensive science by outsourcing the mundane

on

  • 2,153 views

Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!) ...

Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!)

Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.

Statistics

Views

Total Views
2,153
Views on SlideShare
2,032
Embed Views
121

Actions

Likes
4
Downloads
24
Comments
0

8 Embeds 121

http://www.techgig.com 101
http://www.linkedin.com 10
http://paper.li 5
http://us-w1.rockmelt.com 1
url_unknown 1
https://twitter.com 1
http://twitter.com 1
https://www.linkedin.com 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Whitehead points out that a powerful tool for enhancing human capabilities is to automate the mundaneHe was talking about mathematics—e.g., decimal system, algebra, calculus, all facilitated thinkingBut in an era in which information and its processing increasingly dominate human activities, computing.For example, arithmetic and mathematics: thus, calculus, Excel, Matlab, supercomputersIncreasingly also discovery and innovation depends on integration of diverse resources: data sources, software, computing power, human expertise
  • The basic research process remains essentiallyunchanged since the emergence of the scientific method in the 17th Century.Collect data, analyze data, identify patterns within data, seek explanations for those patterns, collect new data to test explanations.Speed of discovery depends to a significant degree on the time required for this cycle. Here, new technologies are changing the research process rapidly and dramatically.Data collection time used to dominate research. For example, Janet Rowley took several years to collect data on gross chromosomal abnormalities for a few patients. Today, we can generate genome data at the rate of billions of base pairs per day. So other steps become bottlenecks, like managing and analyzing data—a key issue for Midway.It is important to realize that the vast majority of research is performed within “small and medium labs.” For example, almost all of the ~1000 faculty in BSD and PSD at UChicago work in their own lab. Each lab has a faculty member, some postdocs, students—so maybe 5000 total just at UC.Academic research is a cottage industry—albeit one that is increasingly interconnected—and is likely to stay that way.
  • The abnormality seen by Nowell and Hungerford on chromosome 22. Now known as the Philadelphia Chromosome
  • Sequencing capacity of a big lab is doubling every nine months5 orders of magnitude in ~5 yearsSingle lab with 10 sequencing machines can generate 400 Gbases-pairs per day
  • Federal Demonstration Partnership.
  • Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
  • Of course, people also make effective use of IaaS, but only for more specialized tasks
  • Of course, people also make effective use of IaaS, but only for more specialized tasks
  • More specifically, the opportunity is to apply a very modern technology—software as a service, or SaaS—to address a very modern problem, namely the enormous challenges inherent in translating revolutionary 21st century technologies into scientific advances. Midway’s SaaS approach will address these challenges, and both make powerful tools far more widely available, and reduce the cycle time associated with research and discovery.
  • So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
  • So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
  • Why? Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, identify diagnose and correct network misconfigurations,…
  • Explain attempts; a cornerstone of our failure mitigation strategyThrough repeated attempts GO was able to overcome transient errors at OLCF and rangerThe expired host certs on bigred were not updated until after the run had completed
  • Self-healingSLA-drivenMulti-tenancy – multitasking, … much moreService-orientedVirtualizedLinearly scalableData, data, data,

Accelerating data-intensive science by outsourcing the mundane Accelerating data-intensive science by outsourcing the mundane Presentation Transcript

  • Accelerating data-intensive scienceby outsourcing the mundane
    Ian Foster
  • Alfred North Whitehead (1911)
    Civilization advances by extending the number of important operations which we can perform without thinking about them
  • J.C.R. Licklider reflects on thinking (1960)
    About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know
  • For example … (Licklider again)
    At one point, it was necessary to compare six experimental determinations of a function relating speech-intelligibilityto speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.
  • Research hasn’t changed much in 300 years
    Analyzedata
    Collectdata
    Publish
    results
    Identify patterns
    Design experiment
    Pose question
    Test hypotheses
    Hypothesize explanation
  • Discovery 1960: Data collection dominates
    Janet Rowley: chromosome translocationsand cancer
  • 800,000,000,000 bases/day
    30,000,000,000,000 bases/year
    Discovery 2010: Data overflows
  • 42%!!
    Meanwhile, we drown in administrivia
    The Federal Demonstration Partnership’s faculty burden survey
  • You can run a company from a coffee shop
  • Salesforce.com, Google,
    Animoto, …, …, caBIG,
    TeraGrid gateways
    Software
    Platform
    Infrastructure
    Varieties of “* as a Service” (*aaS)
  • Salesforce.com, Google,
    Animoto, …, …, caBIG,
    TeraGrid gateways
    Software
    Platform
    Amazon, GoGrid,Microsoft, Flexiscale, …
    Infrastructure
    Varieties of * as a service (*aaS)
  • Salesforce.com, Google,
    Animoto, …, …, caBIG,
    TeraGrid gateways
    Software
    Google, Microsoft, Amazon, …
    Platform
    Amazon, GoGrid,Microsoft, Flexiscale, …
    Infrastructure
    Varieties of * as a service (*aaS)
  • Perform important tasks without thinking
    Web presence
    Email (hosted Exchange)
    Calendar
    Telephony (hosted VOIP)
    Human resources and payroll
    Accounting
    Customer relationship mgmt
    Data analytics
    Content distribution
    IaaS
  • Perform important tasks without thinking
    Web presence
    Email (hosted Exchange)
    Calendar
    Telephony (hosted VOIP)
    Human resources and payroll
    Accounting
    Customer relationship mgmt
    Data analytics
    Content distribution
    SaaS
    IaaS
  • What about small and medium labs?
  • Research IT is a growing burden
    Big projects can build sophisticated solutions to IT problems
    Small labs and collaborations have problems with both
    They need solutions, not toolkits—ideally outsourced solutions
  • Medium science: Dark Energy Survey
    Blanco 4m on Cerro Tololo
    Image credit: Roger Smith/NOAO/AURA/NSF
    Every night, they receive 100,000 files in Illinois
    They transmit these files to Texas for analysis (35 msec latency)
    Then move the results back to Illinois
    This whole process must run reliably & routinely
  • Open transfer sockets vs. time
    [Image: Don Petravick, NCSA]
  • A new approach to research IT
    Goal: Accelerate discovery and innovation worldwide by providing research IT as a service
    Leverage software-as-a-service (SaaS) to
    provide millions of researchers with unprecedented access to powerful research tools, and
    enable a massive shortening of cycle times intime-consuming research processes
  • Time-consuming tasks in science
    Run experiments
    Collect data
    Manage data
    Move data
    Acquire computers
    Analyze data
    Run simulations
    Compare experiment with simulation
    Search the literature
    • Communicate with colleagues
    • Publish papers
    • Find, configure, install relevant software
    • Find, access, analyze relevant data
    • Order supplies
    • Write proposals
    • Write reports
  • Time-consuming tasks in science
    Run experiments
    Collect data
    Manage data
    Move data
    Acquire computers
    Analyze data
    Run simulations
    Compare experiment with simulation
    Search the literature
    • Communicate with colleagues
    • Publish papers
    • Find, configure, install relevant software
    • Find, access, analyze relevant data
    • Order supplies
    • Write proposals
    • Write reports
  • Data movement can be surprisingly difficult
    Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, …
    B
    A
  • Grid (aka federation) as a service
    Globus Toolkit
    Globus Online
    Build the Grid
    Components for building custom grid solutions
    globustoolkit.org
    Use the Grid
    Cloud-hostedfile transfer service
    globusonline.org
  • Globus Online’s Web 2.0 architecture
    Command line interface
    lsalcf#dtn:/
    scpalcf#dtn:/myfile
    nersc#dtn:/myfile
    HTTP REST interface
    POST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc>
    Web interface
    Fire-and-forget data movement
    Many files and lots of data
    Credential management
    Performance optimization
    Expert operations and monitoring
    GridFTP servers
    FTP servers
    High-performance
    data transfer nodes
    Globus Connect
    on local computers
  • Globus Connect to/from your laptop
    25
  • Almost always faster than other methods
    0.001 0.01 0.1 1 10 100 1000
    Megabyte/file
    Argonne  NERSC
  • Monitoring provides deep visibility
  • Globus Online runs on the cloud
  • Data movers scale well on Amazon
  • 11 x 125 files
    200 MB each
    11 users
    12 sites
    SaaS facilitates troubleshooting
  • Moving 586 Terabytes in two weeks
  • NSF XSEDE architecture incorporatesGlobus Toolkit and Globus Online
    XSEDE
    33
  • Next steps: Outsource additional activities
    Analyzedata
    Collectdata
    Publish
    results
    Identify patterns
    Design experiment
    Pose question
    Test hypotheses
    Hypothesize explanation
  • A use case for the next steps
    Medical image data is acquired at multiple sites
    Uploaded to a commercial cloud
    Quality control algorithms applied
    Anonymization procedures applied
    Metadata extracted and stored
    Access granted to clinical trial team
    Interactive access and analysis
    More metadata generated and stored
    Access granted to subset of data for education
  • Required building blocks
    Group management for data sharing
    Scheduled September, 2011, for BIRN biomedical
    Metadata management
    Create, update, query a hosted metadata catalog
    Data publication workflows
    Data movement, naming, metadata operations, etc.
    Cloud storage access
    And HTTP, WebDAV, SRM, iRODS, …
    Computation on shared data
    E.g., via Galaxy workflow system
  • www.globusoline.org
    37
  • Summary
    To accelerate discovery, automate the mundane
    Data-intensive computing is particularly full of mundane tasks
    Outsourcing complexity to SaaS providers is a promising route to automation
    Globus Online is an early experiment in SaaS for science
  • For more information
    Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.
    Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Globus Online: Radical Simplification of Data Movement via SaaS. Preprint CI-PP-05-0611, Computation Institute, 2011.
  • Thank you!
    foster@anl.gov
    foster@uchicago.edu