Rpi talk foster september 2011

  • 1,039 views
Uploaded on

A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA. …

A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.

We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.

But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?

Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.

I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,039
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
17
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • New capabilities represent a tremendous opportunity for science.The challenge that I want to speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources.The solution, I will suggest, is to get computation out of the lab—to outsource it to third party providers. I will explain how this task can be achieved.
  • The need to deal with and benefit from large quantities of data is not a new concept: it has been noted in many policy reports, particularly in the US and UK, over the past several years.
  • But now the data deluge is now upon us. I use a few examples to highlight developments:-- Genome sequencing machines are doubling in output every nine months. This leaves the rather stately 18 month Moore’s Law doubling of computer performance in the shade.-- Astronomy, which only entered the digital era around 2000, projects 100,000 TB data from LSST by the end of the decade. [2MASS completed 2001; -- Simulation -- And not just volume, but also complexityTrends: Scale, complexity, distributed generation, …--------Source for genomic data: http://www.sciencemag.org/content/331/6018/728.short (“Output from next-generation sequencing (NGS) has grown from 10 Mb per day to 40 Gb per day on a single sequencer, and there are now 10 to 20 major sequencing labs worldwide that have each deployed more than 10 sequencers “)Source for mol bio dbs: http://nar.oxfordjournals.org/content/39/suppl_1/D1.full.pdf+htmlSource for climate change image: http://serc.carleton.edu/details/images/17685.html
  • Not just small labs—medium science too.E.g., Dark Energy Survey.
  • For many researchers, projects, and institutions, large data volumes are not an opportunity but a fundamental challenge to their competitiveness as researchers. How can they keep up?
  • 200 universities * 250 faculty per university = 5,000Summary:-- Big projects can build sophisticated solutions to IT problems-- Small labs and collaborations have problems with both--They need solutions, not toolkits—ideally outsourced solutions
  • Need date
  • Of course, people also make effective use of IaaS, but only for more specialized tasks
  • More specifically, the opportunity is to apply a very modern technology—software as a service, or SaaS—to address a very modern problem, namely the enormous challenges inherent in translating revolutionary 21st century technologies into scientific advances. Midway’s SaaS approach will address these challenges, and both make powerful tools far more widely available, and reduce the cycle time associated with research and discovery.Achieve economies of scaleReduce cost per researcher dramaticallyAchieve positive returns to scaleMost academic solutions do NOT have PRTSMost industrial solutions DO have PRTS
  • So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
  • Example: small lab generates data at Texas Advanced Computing Center or the Advanced Photon Source. Needs to move it back to their lab.Or: Needs to move data from experimental facility (e.g., sequencing center or Dark Energy Survey) to computing facility for analysis.
  • Data movement is conceptually simple, but can be surprisingly difficult
  • Why? Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, identify diagnose and correct network misconfigurations,…
  • •Reliable file transfer. –Easy “fire and forget” file transfers –Automatic fault recovery –High performance –Across multiple security domains•No IT required. –No client software installation –New features automatically available –Consolidated support and troubleshooting –Works with existing GridFTP servers –Globus Connect solves “last mile problem”
  • I’ll talk about integration with the Galaxy workflow system later …
  • Reduce costs.Improve performance.Enable new science.
  • What else do we need?
  • Add university logos?
  • Slide 33: Is the task of creating reusable workflows part of these 6 steps? Is publication and discovery of workflows/derived data products part of this as well? Is reproducible research part of it as well?
  • Researchers vote with their dollars
  • Before-- Lots of little labs-- Big science-- XSEDE After:lots of empowered SMLs, entrepreneurship in science, reproducible/reusable research etc

Transcript

  • 1. Accelerating data-intensive science by outsourcing the mundaneIan Foster
  • 2.
  • 3. The data deluge
    MACHO et al.: 1 TB
    Palomar: 3 TB
    2MASS: 10 TB
    GALEX: 30 TB
    Sloan: 40 TB
    Pan-STARRS: 40,000 TB
    100,000 TB
    Genomic sequencing output x2 every 9 month
    >300 public centers
    1330molec. bio databases Nucleic Acids Research (96 in Jan 2001)
    2004: 36 TB
    2012: 2,300 TB
    Climate model intercomparison
    project (CMIP) of the IPCC
  • 4. Big science has achieved big successes
    OSG: 1.4M CPU-hours/day, >90 sites, >3000 users, >260 pubs in 2010
    LIGO: 1 PB data in last science run, distributed worldwide
    Robust production solutions
    Substantial teams and expense
    Sustained, multi-year effort
    Application-specific solutions, built on common technology
    ESG: 1.2 PB climate data
    delivered to 23,000 users; 600+ pubs
    All build on NSF OCI (& DOE)-supported Globus Toolkit software
  • 5. But small science is struggling
    More data, more complex data
    Ad-hoc solutions
    Inadequate software, hardware
    Data plan mandates
  • 6. Medium-scale science struggles too!
    Blanco 4m on Cerro Tololo
    Image credit: Roger Smith/NOAO/AURA/NSF
    Dark Energy Survey receives 100,000 files each night in Illinois
    They transmit files to Texas for analysis … then move results back to Illinois
    Process must be reliable, routine, and efficient
    The cyberinfrastructure team is not large
  • 7. The challenge of staying competitive
    "Well, in our country," said Alice … "you'd generally get to somewhere else — if you run very fast for a long time, as we've been doing.”
    "A slow sort of country!" said the Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!"
  • 8. Current approaches are unsustainable
    Small laboratories
    PI, postdoc, technician, grad students
    Estimate 5,000 across US university community
    Average ill-spent/unmet need of 0.5 FTE/lab?
    Medium-scale projects
    Multiple PIs, a few software engineers
    Estimate 500 across US university community
    Average ill-spent/unmet need of 3 FTE/project?
    Total 4000 FTE: at ~$100K/FTE => $400M/yr
    Plus computers, storage, opportunity costs, …
  • 9. And don’t forget administrative costs
    42%of the time spent by an average PI on a federally funded research project was reported to be expended on administrative tasks related to that project rather than on research
    — Federal Demonstration Partnership faculty burden survey, 2007
  • 10. You can run a company from a coffee shop
  • 11. Because businesses outsource their IT
    Web presence
    Email (hosted Exchange)
    Calendar
    Telephony (hosted VOIP)
    Human resources and payroll
    Accounting
    Customer relationship mgmt
    Software as a Service
    (SaaS)
  • 12. And often their large-scale computing too
    Web presence
    Email (hosted Exchange)
    Calendar
    Telephony (hosted VOIP)
    Human resources and payroll
    Accounting
    Customer relationship mgmt
    Data analytics
    Content distribution
    Software as a Service
    (SaaS)
    Infrastructure as a Service(IaaS)
  • 13. Let’s rethink how we provide research IT
    Accelerate discovery and innovation worldwide by providing research IT as a service
    Leverage software-as-a-service to
    provide millions of researchers with unprecedented access to powerful tools;
    enable a massive shortening of cycle times intime-consuming research processes; and
    reduce research IT costs dramatically via economies of scale
  • 14. Time-consuming tasks in science
    Run experiments
    Collect data
    Manage data
    Move data
    Acquire computers
    Analyze data
    Run simulations
    Compare experiment with simulation
    Search the literature
    • Communicate with colleagues
    • 15. Publish papers
    • 16. Find, configure, install relevant software
    • 17. Find, access, analyze relevant data
    • 18. Order supplies
    • 19. Write proposals
    • 20. Write reports
    • 21.
  • Time-consuming tasks in science
    Run experiments
    Collect data
    Manage data
    Move data
    Acquire computers
    Analyze data
    Run simulations
    Compare experiment with simulation
    Search the literature
    • Communicate with colleagues
    • 22. Publish papers
    • 23. Find, configure, install relevant software
    • 24. Find, access, analyze relevant data
    • 25. Order supplies
    • 26. Write proposals
    • 27. Write reports
    • 28.
  • Data movement can be surprisingly difficult
    B
    A
  • 29. Data movement can be surprisingly difficult
    Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, …
    It took 2 weeks and much help from many people to move 10 TB between California and Tennessee.
    (2007 BES report)
    B
    A
  • 30. Globus Online’sSaaS/Web 2.0 architecture
    Command line interface
    lsalcf#dtn:/
    scpalcf#dtn:/myfile
    nersc#dtn:/myfile
    HTTP REST interface
    POST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc>
    Web interface
    OpenID
    OAuth
    Shibboleth
    (Operate)
    Fire-and-forget data movement
    Automatic fault recovery
    High performance
    No client software install
    Across multiple security domains
    (Hosted on)
    GridFTP servers
    FTP servers
    Other protocols:
    HTTP, WebDAV, SRM, …
    Globus Connect
    on local computers
  • 31. Example application: UC sequencing facility
    Mac using Globus Connect
    Delivery of data to customer
    iBi File Server
    Mount drive
    iBi general-purpose compute cluster
    Sequencing-specific compute cluster
    Sequencing instrument
  • 32. Statistics and user feedback
    Launched November 2010
    >1700 users registered
    >500 TB user data moved
    >30 million user files moved
    >150 endpoints registered
    Widely used on TeraGrid/XSEDE; other centers & facilities; internationally
    >20x faster than SCP
    Faster than hand-tuned
    “Last time I needed to fetch 100,000 files from NERSC, a graduate student babysat the process for a month.”
    “I expected to spend four weeks writing code to manage my data transfers; with Globus Online, I was up and running in five minutes.”
    “Transferred 28 MB in 20 minutes instead of 61 hours. Makes these global climate simulations manageable.”
  • 33. Moving 586 Terabytes in two weeks
  • 34. Monitoring provides deep visibility
  • 35. 20 Terabytes in less than one day
    Terabyte
    20 Gigabyes in more than two days
    Gigabyte
    Megabyte
    Kilobyte
  • 36. Common research data management steps
    Dark Energy Survey
    Galaxy genomics
    LIGO observatory
    SBGrid structural biology consortium
    NCAR climate data applications
    Land use change; economics
  • 37. We have choices of where to compute
    Campus systems
    First target for many researchers
    XSEDE supercomputers
    220,000 cores, peer-reviewed awards
    Optimized for scientific computing
    Open Science Grid
    60,000 cores; high throughput
    Commercial cloud providers
    Instant access for small tasks
    Expensive for big projects
    Users insist that they need everything connected
  • 38. Towards “research IT as a service”
  • 39. Research data management as a service
    GO-User
    Credentials and other profile information
    GO-Transfer
    Data movement
    GO-Team
    Group membership
    GO-Collaborate
    Connect to collaborative tools: Jira, Confluence, …
    GO-Store
    Access to campus, cloud, XSEDE storage
    GO-Catalog
    On-demand metadata catalogs
    GO-Compute
    Access to computers
    GO-Galaxy
    Share, create, run workflows
    Today
    Prototype
    Fall
  • 40. SaaS services in action: The XSEDE vision
    XUAS
  • 41. Data analysis as a service: Early steps
    Securely and reliably:
    Assemble code
    Find computers
    Deploy code
    Run program
    Access data
    Store data
    Record workflow
    Reuse workflow
    [7, 8]
    [1, 2]
    We have built such systems for biological, environmental,and economics researchers
    VM image
    App code
    Workflow
    Galaxy
    Condor
    [3, 4]
    [5, 6]
    Data store
  • 42. SaaS economics: A quick tutorial
    Lower per-user cost (x10?) via aggregation onto common infrastructure
    $400M/yr $40M/yr?
    Initial “cost trough” due to fixed costs
    Per-user revenue permits positive return to scale
    Further reduce per-user cost over time
    $
    0
    Time
    X10 reduction in per-user cost:
    $50K  $5K/yr per lab
    $300K  $30K/yr per project
  • 43. A national cyberinfrastructure strategy?
    To providemore capability formore people at less cost …
    Create infrastructure
    Robust and universal
    Economies of scale
    Positive returns to scale
    Via the creative use of
    Aggregation (“cloud”)
    Federation (“grid”)
    Small and medium laboratories and projects
    P
    L
    L
    L
    L
    L
    L
    L
    L
    L
    P
    P
    P
    P
    L
    L
    L
    L
    L
    L
    L
    L
    L
    L
    L
    L
    L
    L
    L
    L
    L
    L
    aa
    S
    Research data management Collaboration, computation
    Research administration
  • 44. Acknowledgments
    Colleagues at UChicago and Argonne
    Steve Tuecke, Ravi Madduri, Kyle Chard, Tanu Malik, and others listed at www.globusonline.org/about/goteam/
    Carl Kesselman and other colleagues at other institutions
    Participants in the recent ICiS workshop on “Human-Computer Symbiosis: 50 Years On”
    NSF OCIand MPS; DOE ASCR; and NIH for support
  • 45. For more information
    www.globusonline.org; @globusonline: Twitter
    Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.
    Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Globus Online: Radical Simplification of Data Movement via SaaS. Communications of the ACM, 2011.
  • 46. Thank you!
    foster@uchicago.edu
    www.globusonline.org
    @globusonline