Research Automation
for Data-Driven Discovery
Ian Foster
Argonne National Laboratory &
The University of Chicago
foster@anl.gov
A productivity crisis in research
Data volumes are growing
much faster than Moore’s law …
(10,000x more over 6 years for
genome data)
Kahn, Science, 331
(6018): 728-729
But most labs
have extremely
limited resources
Heidorn: NSF
grants in 2007
< $350,000
80% of awards
50% of grant $$
"Well, in our country," said Alice …
"you'd generally get to somewhere else
— if you run very fast for a long time,
as we've been doing.”
"A slow sort of country!" said the
Queen. "Now, here, you see, it
takes all the running you can do,
to keep in the same place. If you
want to get somewhere else, you
must run at least twice as fast as that!"
The challenge of staying competitive
4https://bit.ly/2l4gfgu
How industry handles complexity
cloud4scieng.org
Industry software builds on powerful platform services
Cloud platforms have transformed how software is
developed and delivered
6
Can we do the same for science?
• Identify cross-cutting capabilities required by many groups
• Define simple REST APIs for accessing those capabilities
• Operate high-quality, scalable, secure, performant cloud-hosted
implementations
• Ensure persistence and evolution over time
In so doing, enable many scientists and tool developers to automate
and outsource tasks that are not central to their core mission: thus
reduce costs, increase quality, promote interoperability
What capabilities?
7
• Auth: Manage identities, authentication, and authorization
• Transfer: Manage movement of files from A to B
• Sharing: Manage who can access data at a location
• Publish: Preserve, identify, describe, curate
• Search: Index and search data
• Identifiers: Assign identifiers to collections of files
• Automate: Organize sets of activities
• Learn: Discover, train, run machine learning models
• …
globus.org
Science services
operated by UChicago for researchers worldwide
Monitor transfer
Monitor activitiesManage data
Automate and outsource with
REST APIs and Python SDK
Automate and
outsource with
REST APIs and
Python SDK
11
UK
NIST
NSF
NSF
NSF
DOE
NSF
Canada
Automate and outsource:
Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1212
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
Automate and outsource:
Publication and discovery
1313
Programmatic access (Python, Jupyter)
Web browse and search
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
Example: NCAR’s Research Data Archive
Globus used for
• Single sign on via
streamlined account
provisioning
• Data sharing
• Data downloads
15
Beyond transfer
(Experimental)
Cloud platforms have transformed how software is
developed and delivered
17
We can do the same for science
• Identify cross-cutting capabilities required by many groups
• Define simple REST APIs for accessing those capabilities
• Operate high-quality, scalable, secure, performant cloud-hosted
implementations
• Ensure persistence and evolution over time
In so doing, enable many scientists and tool developers to automate
and outsource tasks that are not central to their core mission, to
reduce costs, increase quality, promote interoperability
We have identified some needed capabilities
18
• Auth: Manage identities, authentication, authorization
• Transfer: Manage movement of files from A to B
• Sharing: Manage who can access data at a location
• Publish: Preserve, identify, describe, curate
• Search: Index and search data
• Identifiers: Assign identifiers to collections of files
• Automate: Organize sets of activities
• Learn: Discover, train, run machine learning models
• …
Established
12,000 endpoints
100,000+ users
New
100s of users
Experimental
10s of users
globus.org — Ian Foster — foster@anl.gov

Research Automation for Data-Driven Discovery

  • 1.
    Research Automation for Data-DrivenDiscovery Ian Foster Argonne National Laboratory & The University of Chicago foster@anl.gov
  • 2.
    A productivity crisisin research Data volumes are growing much faster than Moore’s law … (10,000x more over 6 years for genome data) Kahn, Science, 331 (6018): 728-729 But most labs have extremely limited resources Heidorn: NSF grants in 2007 < $350,000 80% of awards 50% of grant $$
  • 3.
    "Well, in ourcountry," said Alice … "you'd generally get to somewhere else — if you run very fast for a long time, as we've been doing.” "A slow sort of country!" said the Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!" The challenge of staying competitive
  • 4.
  • 5.
    cloud4scieng.org Industry software buildson powerful platform services
  • 6.
    Cloud platforms havetransformed how software is developed and delivered 6 Can we do the same for science? • Identify cross-cutting capabilities required by many groups • Define simple REST APIs for accessing those capabilities • Operate high-quality, scalable, secure, performant cloud-hosted implementations • Ensure persistence and evolution over time In so doing, enable many scientists and tool developers to automate and outsource tasks that are not central to their core mission: thus reduce costs, increase quality, promote interoperability
  • 7.
    What capabilities? 7 • Auth:Manage identities, authentication, and authorization • Transfer: Manage movement of files from A to B • Sharing: Manage who can access data at a location • Publish: Preserve, identify, describe, curate • Search: Index and search data • Identifiers: Assign identifiers to collections of files • Automate: Organize sets of activities • Learn: Discover, train, run machine learning models • …
  • 8.
    globus.org Science services operated byUChicago for researchers worldwide
  • 9.
  • 10.
    Automate and outsourcewith REST APIs and Python SDK
  • 11.
    Automate and outsource with RESTAPIs and Python SDK 11 UK NIST NSF NSF NSF DOE NSF Canada
  • 12.
    Automate and outsource: Publicationand discovery Move to permanent location (or publish in place) Compute and record checksums Obtain and record metadata Assign persistent identifier Index for discovery 1212 Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 13.
    Automate and outsource: Publicationand discovery 1313 Programmatic access (Python, Jupyter) Web browse and search Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 14.
    Example: NCAR’s ResearchData Archive Globus used for • Single sign on via streamlined account provisioning • Data sharing • Data downloads
  • 15.
  • 16.
  • 17.
    Cloud platforms havetransformed how software is developed and delivered 17 We can do the same for science • Identify cross-cutting capabilities required by many groups • Define simple REST APIs for accessing those capabilities • Operate high-quality, scalable, secure, performant cloud-hosted implementations • Ensure persistence and evolution over time In so doing, enable many scientists and tool developers to automate and outsource tasks that are not central to their core mission, to reduce costs, increase quality, promote interoperability
  • 18.
    We have identifiedsome needed capabilities 18 • Auth: Manage identities, authentication, authorization • Transfer: Manage movement of files from A to B • Sharing: Manage who can access data at a location • Publish: Preserve, identify, describe, curate • Search: Index and search data • Identifiers: Assign identifiers to collections of files • Automate: Organize sets of activities • Learn: Discover, train, run machine learning models • … Established 12,000 endpoints 100,000+ users New 100s of users Experimental 10s of users globus.org — Ian Foster — foster@anl.gov

Editor's Notes

  • #3 Genome data increase by 10,000 more than Moore’s law over last six years
  • #4 For many researchers, projects, and institutions, large data volumes are not an opportunity but a fundamental challenge to their competitiveness as researchers. How can they keep up?