In the academic research community we've made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of large quantities of data. Exploding data volumes and powerful simulation tools mean that most researchers will soon require similar capabilities, but they often do not have the resources or expertise to build and maintain the necessary IT infrastructure. Faced with a similar problem in industry, companies have adopted the Software-as-a-Service (SaaS) model to "free" themselves from IT complexity. We see the same shift occurring in the academic research world over the next decade - indeed, many of us use SaaS services such as Google Docs and Dropbox on a daily basis as an integral part of our research workflow. Here we describe a vision for the next generation of research cyberinfrastructure, and work that the University of Chicago has embarked on to further empower investigators and enable them to access new capabilities beyond the boundaries of their campus.
10. Most labs have limited resources
NSF grants in 2007
< $350,000
80% of awards
50% of grant $$
$1,000,000
$100,000
$10,000
$1,000
2000 4000 6000 8000
Bryan Heidorn
12. 2012 Faculty Burden Survey, National Academies
40
45
50
55
60
65
< $50K $50-99K $100-199K $200-299K $300-499K $500-999K $1-3M > $3M
Federal Funding Amount
ActiveResearchTime(%) Active Research Time vs.
Federal Funding Amount
13. Potential economies of scale
Small laboratories
– PI, postdoc, technician, grad students
– Estimate 10,000 across US research community
– Average ill-spent/unmet need of 0.5 FTE/lab?
+ Medium-scale projects
– Multiple PIs, a few software engineers
– Estimate 1,000 across US research community
– Average ill-spent/unmet need of 3 FTE/project?
= Total 8,000 FTE: at ~$100K/FTE => $800M/yr
(If we could even find 8,000 skilled people)
Plus computers, storage, opportunity costs, …
14. Is there a better way to deliver
research cyberinfrastructure?
Frictionless
Affordable
Sustainable
19. A simple problem
• “Transfers often take longer than expected
based on available network capacities”
• “Lack of an easy to use interface to some of the
high-performance tools”
• “Tools [are] too difficult to install and use”
• “Time and interruption to other work required to
supervise large data transfers”
• “Need data transfer tools that are easy to
use, well-supported, and permitted by site and
facility cybersecurity organizations”
Excerpts from ESnet reports
20. Exemplar: APS Beamline 2-BM
X-Ray imaging, tomography, ~few µm to
30nm resolution
Currently can generate
>100TB per day
<1GB/s data rate; ~3-
5GB/s in 5-10 years
21. Transforming data acquisition
Current
• Experimental parameters
optimized manually
• Collected data combined
with visual inspection to
confirm optimal condition
• Data reconstructed and sent
to users via external drive
• User team starts data
reduction at home institution
22. Transforming data acquisition
Envisaged
• Experimental parameters
optimized automatically
• Collected data available to
optimization programs
• Data are automatically
reconstructed, reduced, an
d shared with local and
remote participants
• User team leaves the APS
with reduced data
Current
• Experimental parameters
optimized manually
• Collected data combined
with visual inspection to
confirm optimal condition
• Data reconstructed and sent
to users via external drive
• User team starts data
reduction at home institution
23. Facility data
acquisition
Research Data Management
as a Service
Globus transfer
service
Reduced
data
Analysis/Shar
ingGlobus sharing
service
Globus data
publication service*
* In development
24. 730GB
90 minutes
“…frees up my time to do more creative work rather than
typing scp commands or devising scripts to initiate and
monitor progress to move many files.”
Steven Gottlieb, Indiana University
25. San Diego to Miami
1 click
20 minutes
“Twenty minutes instead of sixty one hours.
Globus makes OLAM global climate
simulations manageable.”
Craig Mattocks, University of Miami
38. “We are close to having a $1,000
genome sequence, but this may be
accompanied by a$1 million
interpretation.”
Bruce Korf M.D.,
Past President, American College of Medical Genetics
Will data kill genomics?
56. “Affordable” and “Sustainable”?
Either
High-priced commercial software (with
generally higher levels of quality)
Or
Free, open source software (with generally
lower levels of quality)
Is there a happy medium?
57. Industry and economics themes
• Matlab: Commercial closed-source software.
Sustainability achieved via license fees.
• Kitware: Commercial open source software.
Sustainability achieved via services (mostly gov.?).
• DUNE: Community of university and lab people, with
some commercial involvement.
• MVAPICH: Open source software. University team.
Sustainability by continued fed. funding, some industry.
59. To provide more capability for
more people at substantially
lower cost by creatively
aggregating (“cloud”) and
federating (“grid”) resources
Our vision for a 21st century
discovery infrastructure
Editor's Notes
Here are some of the areas where we have active projectsMuch of our legacy is in the physical sciencesBut increasingly we are finding ourselves working in the life sciences….
173 TB/day
Two examples to illustrate some of these issues…LIGO searches for gravitational waves to explore fundamental physics conceptsIt runs three observatories around the world and generated over a petabyte of data in their most recent experimentIt’s no just the volume of data – arguably 1PB is becoming commonplace……the real complexity is that this data has to be made available to almost a thousand researchers all over the world…it has to be actively managed for many years while experiments and analyses are run against itA very complex undertakingAnd by the way, their next experiment, Advanced LIGO, will generate a couple of orders of magnitude more data
Another example if the earth systems grid that provides data and tools to over 20,000 climate scientists around the worldSo what’s notable about these examples?It’s the combination of the amount of data being managed and the number of people that need access to that dataWe heard Martin Leach tell us that the Broad Institute hit 10PB of spinning disk last year …and that it’s not a big dealTo a select few, these numbers are routine ….And for the projects I just talked about, the IT infrastructure is in placeThey have robust production solutionsBuilt by substantial teams at great expenseSustained, multi-year effortsApplication-specific solutions, built mostlyon common/homogeneoustechnology platforms
The point is, the 1% of projects are in good shape
But what about the 99% set?There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challenges
So, there are two parts to this problem:One is the size of awards that most labs get…10,000 80% of awards and 50% of grant $$ are < $350K
The other part of the problem is:The amount of time spent, on average, on active research
Hard to make the case for more funding if this is how the money is really spent!
Result: in the hundreds of thousands of small labs research suffers …and over time many may become irrelevantSo at the CI we asked ourselves a question …many questions actually about how we can help avert this crisisAnd one question that kinds sums up a lot of our thinking is…
Today’s startup can operate every bit as efficiently as a large company …perhaps even more so!Without massive capital investment
All these services share common features:Sign up and get started with just a few clicks - nothing to deploySlick web user interfaceHighly scalableSubscription based pricing
----- Meeting Notes (4/9/14 15:46) -----We have the network Now we need the apps
We believe that SaaS can be equally transformational for researchers…
A sophisticated instrument such as this is not readily accessible to small labsThey can get beam time but managing the data makes it challenging
Steve Gottlieb is the world’s foremost Lattice QCD expertMoved data between Oak Ridge National Lab and TACC
Meteorologist and oceanographerMoved 28GB of data from Trestles to his local server
For example in genomics
For example in genomics
Competitive TCOAlternatives are campus computing cores and commercial sequence analysis services
Modest scalability…but sufficient for the needs of the majority of small labsSpot InstancesTotal: over 350K Core hours in last 6months
The first shift we are experiencing is from being installers to capability brokersWe are less concerned with building a data center or installing and configuring softwareThere is absolutely still a role for that but there a few that have the skills and experience…so we take advantage of that experience and focus instead of selecting various components and spend our time making them easy to use-- again it’s the user experienceAn example of this is the Globus Storage serviceWe are working with multiple providers…talk to UC IT Services deployment and EMC Isilon relationshipCloud storage providers will keep driving the unit cost of storage downWe believe the value lies in making trivial to use that storage in the normal course of their workOther components for Globus CollaborateAnd even for internal use ….Zendesk for supportIN the case of Zendesk we’re using Globus Integrate and Globus Nexus in particular so that from the user’s perspective they only have a single account on Globus and can access external services like Zendesk to track their support tickets, post to forums, etc.
We’re also moving from being developers to playing more of an integrator roleAgain, there are lots of smart people out there that have figured out the hard bits,For example in identity management and securitySo
It really is all about the user experienceWe’ve shifted the make up of our team from