Accelerating data-intensive science by outsourcing the mundane

Accelerating data-intensive scienceby outsourcing the mundane Ian Foster

Alfred North Whitehead (1911) Civilization advances by extending the number of important operations which we can perform without thinking about them

J.C.R. Licklider reflects on thinking (1960) About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know

For example … (Licklider again) At one point, it was necessary to compare six experimental determinations of a function relating speech-intelligibilityto speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.

Research hasn’t changed much in 300 years Analyzedata Collectdata Publish results Identify patterns Design experiment Pose question Test hypotheses Hypothesize explanation

Discovery 1960: Data collection dominates Janet Rowley: chromosome translocationsand cancer

800,000,000,000 bases/day 30,000,000,000,000 bases/year Discovery 2010: Data overflows

42%!! Meanwhile, we drown in administrivia The Federal Demonstration Partnership’s faculty burden survey

You can run a company from a coffee shop

Salesforce.com, Google, Animoto, …, …, caBIG, TeraGrid gateways Software Platform Infrastructure Varieties of “* as a Service” (*aaS)

Salesforce.com, Google, Animoto, …, …, caBIG, TeraGrid gateways Software Platform Amazon, GoGrid,Microsoft, Flexiscale, … Infrastructure Varieties of * as a service (*aaS)

Salesforce.com, Google, Animoto, …, …, caBIG, TeraGrid gateways Software Google, Microsoft, Amazon, … Platform Amazon, GoGrid,Microsoft, Flexiscale, … Infrastructure Varieties of * as a service (*aaS)

Perform important tasks without thinking Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution IaaS

Perform important tasks without thinking Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution SaaS IaaS

What about small and medium labs?

Research IT is a growing burden Big projects can build sophisticated solutions to IT problems Small labs and collaborations have problems with both They need solutions, not toolkits—ideally outsourced solutions

Medium science: Dark Energy Survey Blanco 4m on Cerro Tololo Image credit: Roger Smith/NOAO/AURA/NSF Every night, they receive 100,000 files in Illinois They transmit these files to Texas for analysis (35 msec latency) Then move the results back to Illinois This whole process must run reliably & routinely

Open transfer sockets vs. time [Image: Don Petravick, NCSA]

A new approach to research IT Goal: Accelerate discovery and innovation worldwide by providing research IT as a service Leverage software-as-a-service (SaaS) to provide millions of researchers with unprecedented access to powerful research tools, and enable a massive shortening of cycle times intime-consuming research processes

Time-consuming tasks in science Run experiments Collect data Manage data Move data Acquire computers Analyze data Run simulations Compare experiment with simulation Search the literature ,[object Object]

Find, configure, install relevant software

Find, access, analyze relevant data

Grid (aka federation) as a service Globus Toolkit Globus Online Build the Grid Components for building custom grid solutions globustoolkit.org Use the Grid Cloud-hostedfile transfer service globusonline.org

Globus Online’s Web 2.0 architecture Command line interface lsalcf#dtn:/ scpalcf#dtn:/myfile br />nersc#dtn:/myfile HTTP REST interface POST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc> Web interface Fire-and-forget data movement Many files and lots of data Credential management Performance optimization Expert operations and monitoring GridFTP servers FTP servers High-performance data transfer nodes Globus Connect on local computers

Globus Connect to/from your laptop 25

Almost always faster than other methods 0.001 0.01 0.1 1 10 100 1000 Megabyte/file Argonne  NERSC

Monitoring provides deep visibility

Accelerating data-intensive science by outsourcing the mundane

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Accelerating data-intensive science by outsourcing the mundane

Similar to Accelerating data-intensive science by outsourcing the mundane (20)

More from Ian Foster

More from Ian Foster (20)

Recently uploaded

Recently uploaded (20)

Accelerating data-intensive science by outsourcing the mundane

Editor's Notes