SaaS and the Transformation of Research


Published on

In the academic research community we've made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of large quantities of data. Exploding data volumes and powerful simulation tools mean that most researchers will soon require similar capabilities, but they often do not have the resources or expertise to build and maintain the necessary IT infrastructure. Faced with a similar problem in industry, companies have adopted the Software-as-a-Service (SaaS) model to "free" themselves from IT complexity. We see the same shift occurring in the academic research world over the next decade - indeed, many of us use SaaS services such as Google Docs and Dropbox on a daily basis as an integral part of our research workflow. Here we describe a vision for the next generation of research cyberinfrastructure, and work that the University of Chicago has embarked on to further empower investigators and enable them to access new capabilities beyond the boundaries of their campus.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Here are some of the areas where we have active projectsMuch of our legacy is in the physical sciencesBut increasingly we are finding ourselves working in the life sciences….
  • 173 TB/day
  • Two examples to illustrate some of these issues…LIGO searches for gravitational waves to explore fundamental physics conceptsIt runs three observatories around the world and generated over a petabyte of data in their most recent experimentIt’s no just the volume of data – arguably 1PB is becoming commonplace……the real complexity is that this data has to be made available to almost a thousand researchers all over the world…it has to be actively managed for many years while experiments and analyses are run against itA very complex undertakingAnd by the way, their next experiment, Advanced LIGO, will generate a couple of orders of magnitude more data
  • Another example if the earth systems grid that provides data and tools to over 20,000 climate scientists around the worldSo what’s notable about these examples?It’s the combination of the amount of data being managed and the number of people that need access to that dataWe heard Martin Leach tell us that the Broad Institute hit 10PB of spinning disk last year …and that it’s not a big dealTo a select few, these numbers are routine ….And for the projects I just talked about, the IT infrastructure is in placeThey have robust production solutionsBuilt by substantial teams at great expenseSustained, multi-year effortsApplication-specific solutions, built mostlyon common/homogeneoustechnology platforms
  • The point is, the 1% of projects are in good shape
  • But what about the 99% set?There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challenges
  • So, there are two parts to this problem:One is the size of awards that most labs get…10,000 80% of awards and 50% of grant $$ are < $350K
  • The other part of the problem is:The amount of time spent, on average, on active research
  • Hard to make the case for more funding if this is how the money is really spent!
  • Result: in the hundreds of thousands of small labs research suffers …and over time many may become irrelevantSo at the CI we asked ourselves a question …many questions actually about how we can help avert this crisisAnd one question that kinds sums up a lot of our thinking is…
  • Today’s startup can operate every bit as efficiently as a large company …perhaps even more so!Without massive capital investment
  • All these services share common features:Sign up and get started with just a few clicks - nothing to deploySlick web user interfaceHighly scalableSubscription based pricing
  • ----- Meeting Notes (4/9/14 15:46) -----We have the network Now we need the apps
  • We believe that SaaS can be equally transformational for researchers…
  • A sophisticated instrument such as this is not readily accessible to small labsThey can get beam time but managing the data makes it challenging
  • Steve Gottlieb is the world’s foremost Lattice QCD expertMoved data between Oak Ridge National Lab and TACC
  • Meteorologist and oceanographerMoved 28GB of data from Trestles to his local server
  • For example in genomics
  • For example in genomics
  • Competitive TCOAlternatives are campus computing cores and commercial sequence analysis services
  • Modest scalability…but sufficient for the needs of the majority of small labsSpot InstancesTotal: over 350K Core hours in last 6months
  • The first shift we are experiencing is from being installers to capability brokersWe are less concerned with building a data center or installing and configuring softwareThere is absolutely still a role for that but there a few that have the skills and experience…so we take advantage of that experience and focus instead of selecting various components and spend our time making them easy to use-- again it’s the user experienceAn example of this is the Globus Storage serviceWe are working with multiple providers…talk to UC IT Services deployment and EMC Isilon relationshipCloud storage providers will keep driving the unit cost of storage downWe believe the value lies in making trivial to use that storage in the normal course of their workOther components for Globus CollaborateAnd even for internal use ….Zendesk for supportIN the case of Zendesk we’re using Globus Integrate and Globus Nexus in particular so that from the user’s perspective they only have a single account on Globus and can access external services like Zendesk to track their support tickets, post to forums, etc.
  • We’re also moving from being developers to playing more of an integrator roleAgain, there are lots of smart people out there that have figured out the hard bits,For example in identity management and securitySo
  • It really is all about the user experienceWe’ve shifted the make up of our team from
  • Institutions recognizing value
  • SaaS and the Transformation of Research

    1. 1. SaaS and the Transformation of Research Vas Vasiliadis
    2. 2. Urban Science
    3. 3. Thank you to our sponsors! U.S. DE PARTME NT OF ENERGY
    4. 4. Higgs discovery “only possible because of the extraordinary achievements of …grid computing” Rolf Heuer, CERN DG
    5. 5. 25PB per year 8,000 scientists worldwide
    6. 6. 1PB in last experiment 800 scientists worldwide
    7. 7. 1.2 PB of climate data Delivered to 23,000 users
    8. 8. We have exceptional infrastructure for the 1%
    9. 9. What about the 99%?
    10. 10. Most labs have limited resources NSF grants in 2007 < $350,000 80% of awards 50% of grant $$ $1,000,000 $100,000 $10,000 $1,000 2000 4000 6000 8000 Bryan Heidorn
    11. 11. 57.7%
    12. 12. 2012 Faculty Burden Survey, National Academies 40 45 50 55 60 65 < $50K $50-99K $100-199K $200-299K $300-499K $500-999K $1-3M > $3M Federal Funding Amount ActiveResearchTime(%) Active Research Time vs. Federal Funding Amount
    13. 13. Potential economies of scale Small laboratories – PI, postdoc, technician, grad students – Estimate 10,000 across US research community – Average ill-spent/unmet need of 0.5 FTE/lab? + Medium-scale projects – Multiple PIs, a few software engineers – Estimate 1,000 across US research community – Average ill-spent/unmet need of 3 FTE/project? = Total 8,000 FTE: at ~$100K/FTE => $800M/yr (If we could even find 8,000 skilled people) Plus computers, storage, opportunity costs, …
    14. 14. Is there a better way to deliver research cyberinfrastructure? Frictionless Affordable Sustainable
    15. 15. Commercial startups as “role models”
    16. 16. My Shiny New Startup
    17. 17. “Frictionless” Great User Experience + High performance (but invisible) infrastructure
    18. 18. SaaS is transformational for… Researchers
    19. 19. A simple problem • “Transfers often take longer than expected based on available network capacities” • “Lack of an easy to use interface to some of the high-performance tools” • “Tools [are] too difficult to install and use” • “Time and interruption to other work required to supervise large data transfers” • “Need data transfer tools that are easy to use, well-supported, and permitted by site and facility cybersecurity organizations” Excerpts from ESnet reports
    20. 20. Exemplar: APS Beamline 2-BM X-Ray imaging, tomography, ~few µm to 30nm resolution Currently can generate >100TB per day <1GB/s data rate; ~3- 5GB/s in 5-10 years
    21. 21. Transforming data acquisition Current • Experimental parameters optimized manually • Collected data combined with visual inspection to confirm optimal condition • Data reconstructed and sent to users via external drive • User team starts data reduction at home institution
    22. 22. Transforming data acquisition Envisaged • Experimental parameters optimized automatically • Collected data available to optimization programs • Data are automatically reconstructed, reduced, an d shared with local and remote participants • User team leaves the APS with reduced data Current • Experimental parameters optimized manually • Collected data combined with visual inspection to confirm optimal condition • Data reconstructed and sent to users via external drive • User team starts data reduction at home institution
    23. 23. Facility data acquisition Research Data Management as a Service Globus transfer service Reduced data Analysis/Shar ingGlobus sharing service Globus data publication service* * In development
    24. 24. 730GB 90 minutes “…frees up my time to do more creative work rather than typing scp commands or devising scripts to initiate and monitor progress to move many files.” Steven Gottlieb, Indiana University
    25. 25. San Diego to Miami 1 click 20 minutes “Twenty minutes instead of sixty one hours. Globus makes OLAM global climate simulations manageable.” Craig Mattocks, University of Miami
    26. 26. Early adoption is encouraging
    27. 27. 15,327 endpoints
    28. 28. 182* daily users *30-day average
    29. 29. 41.8PB
    30. 30. 2B files
    31. 31. Other innovative science SaaS projects
    32. 32. “Affordable” Competitive TCO at Modest scale
    33. 33. A time of disruptive change
    34. 34. A time of disruptive change
    35. 35. Will data kill genomics?
    36. 36. “We are close to having a $1,000 genome sequence, but this may be accompanied by a$1 million interpretation.” Bruce Korf M.D., Past President, American College of Medical Genetics Will data kill genomics?
    37. 37. globus genomics Flexible, scalable, affordabl e genomics analysis for all biologists
    38. 38. + Data management SaaS Next-gen sequence analysis pipelines + Scalable IaaS
    39. 39. Exome: $3 – $20 Whole Genome: $20 – $50 RNA-Seq: <$5 Alternatives are at 10-20x
    40. 40. Affordable scalability 350K Core hours in last 6 months
    41. 41. Dobyns Lab Exome analysis 20x speed-up Next: 50x
    42. 42. Cox Lab Consensus variant calling 134 samples; 4 days <0.01% Mendel error rate Next: 13,000 samples
    43. 43. Another Example: DTI Pipelines 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 CostperSubject($) On-Demand Spot (Low) Spot (High)
    44. 44. SaaS is transformational for… Researchers Resource Providers
    45. 45. installers  brokers
    46. 46. Cede (some) control Evolve financial models Adapt institutional policies Become a lawyer!
    47. 47. developers  integrators GSI-OpenSSH
    48. 48. A platform for integration
    49. 49. A platform for integration
    50. 50. A platform for integration
    51. 51. administrators  curators (of the user experience) 1 : 1 : 0 UX : Dev : Ops
    52. 52. We are a non-profit service provider to the non-profit research community
    53. 53. Our challenge: Sustainability We are a non-profit service provider to the non-profit research community
    54. 54. “Affordable” and “Sustainable”? Either High-priced commercial software (with generally higher levels of quality) Or Free, open source software (with generally lower levels of quality) Is there a happy medium?
    55. 55. Industry and economics themes • Matlab: Commercial closed-source software. Sustainability achieved via license fees. • Kitware: Commercial open source software. Sustainability achieved via services (mostly gov.?). • DUNE: Community of university and lab people, with some commercial involvement. • MVAPICH: Open source software. University team. Sustainability by continued fed. funding, some industry.
    56. 56. Globus: Subscriptions Globus Provider plans ( Globus Plus (
    57. 57. To provide more capability for more people at substantially lower cost by creatively aggregating (“cloud”) and federating (“grid”) resources Our vision for a 21st century discovery infrastructure