A presentation given to the Federal MAGIC group. Similar to GlobusWorld 2014 keynote, with publication slides added.

  • Review what the Globus team has done over the past year.Announce an exciting new capability.
  • Joel Brownstein is the data archivist of the Sloan Digital Sky Survey-IVTransfers daily telescope observations to the University of UtahThere they have a large cluster to run their various data reduction pipelinesUsing the Globus command-line interface within their Python APIJoel has moved more than 70 TB of data so far
  • Ann develops numerical simulations of severe storms using the Weather Research and Forecasting (WRF) modelUses several HPC facilities throughout the countryMoved more than 100 TB of data using Globus— 50 TB last January alone!Moves data between various XSEDE resources, NCSA's mass storage system, and PSC's data archiver
  • Collects tissue samples from young patients and their families and then extracts, sequences, and analyzesthe genetic material to understand underlying cause of disease.Uses Globus to move NGS data to and from public clouds where he runs analysis pipelines.More on Bill’s work later on in this talk (under Globus Genomics)
  • Can use standard tools such as apt and yum to deployUses configuration fileAllows incremental config changesMultiple I/O nodesID node (MyProxy)Web node (OAuth)
  • Alllows site administrators to monitor traffic to/from their site. Ultimately will allow for control.
  • Geoffrey Moore
  • Highlight CI ConnectHighlight XSEDE’s planned adoption of user, group and profile management
  • Highlight CI Connect; coming up in Rob Gardner’s talkHighlight XSEDE’s planned adoption of user, group and profile management
  • Competitive TCOAlternatives are campus computing cores and commercial sequence analysis services
  • Collection is a set of DatasetsDataset is data + metadataCollection is within a CommunityPolicies on a CollectionMetadataAccess control Curation workflowLicenseStorage
  • Demo scenario:A scientist, referred to throughout as “the Scientist” and associated with the user Blaiszik, has just published a paper associated with his research on nanoscale materials. He now wants to go ahead and publish the data associated with this publication.Using the Globus publication system, he is able to select the Argonne community, and the Center for Nanoscale Materials (CNM) collection. He selects to publish his dataHe describes the submission with both publication (Dublin core) and scientific metadataThe CNM collection has been preconfigured with its own storage provided at ArgonneAs part of this submission, a unique endpoint is created for “The Scientist", the endpoint is created so that only "The Scientist" can write to it"The Scientist" assembles his dataset on this endpoint by transferring files from 1 or more locations. He can assemble this dataset over a long period of time and can return to the submission workflow when he is happy with the submission. The CNM collection has also been preconfigured with a workflow requiring that an Argonne curator must approve the submissionA curator, referred to throughout as “the Curator” and associated with the user Chard, is able to view and edit the metadata and files of the datasetOnce approved the submission is published in the CNM collection with a DOIOther users (with permission to view the collection) can then discover published datasets by their DOI or using the Globus discovery interface to find datasets by their metadataThese users can choose to browse published datasets and download datasets to other resources (including local resources)
  • Users can login using any of their linked Globus identities, e.g., Campus credentials (via InCommon), Google Account, XSEDE account, ..
  • The first step of submission is to select a collection. In this case "The Scientist" selects the “Center for Nanoscale Materials”, as this is the department through which he conducted his research. Note: "The Scientist" can only see collections he is allowed to publish to.
  • "The Scientist" must first describe the dataset he is publishing. There are two types of metadata required for submission to the CNM collection: 1) Dublin core and 2) scientific metadata. These metadata requirements are defined by the collection and can be configured depending on the domain. Additional pages can also be defined. Here, "The Scientist" enters information about the Authors, their ORCID (a unique researcher identity), the submission title, the date of publication, the accompanying publication to which this dataset is related, and the DOI for that publication. Note: "The Scientist" has missed an ORCID for one of his co-authors.
  • Using the familiar Globus interface, "The Scientist" is able to select files from multiple sources and transfer them to his unique submission endpoint (publish#submission_11).This submission endpoint is created on shared Argonne storage resources, but is initially accessible only to "The Scientist" The dataset may be assembled over any period of time. "The Scientist" can create new files and folders on the endpoint and he can arrange these files in any hierarchy. At the completion of the submission the permissions on the endpoint will be changed such that the dataset is immutable. "The Scientist” will be given read access to the dataset, collection curators will also be given read access to the data so that they can view the contents.
  • Having verified the submission, "The Scientist" must grant the submission license. This license is again configured by the collection (i.e. each collection can customize their individual licenses), and allows the submitting user to grant rights to the collection (CNM) and the Globus system to manage and disseminate the dataset based on the agreed upon policies.
  • The Argonne CNM collection has defined a workflow that requires a curatorto view and approve all submissions. The curation workflow enables the curator to view the submitted files and to edit the submitted metadata.
  • At this point, the dataset is now published in the collection with a unique DOI (handle in this case) for other researchers to reference this published dataset. Access to the dataset (both metadata and files) is changed to reflect the policies of the collection. Access may be restricted to particular users, or groups of users, or it may be made public for any user to access.
  • “The Researcher” chooses to search for all published data in the CNM collection. The results show a brief summary of each published dataset including information about the publication time, collection, summary of number of files, name, authors, description and a set of keyword tags as well as key-value tags. Each of these fields can be used to search for a particular dataset.
  • Knowing that other collections may well have datasets of interest , “The Researcher” may broaden the search context to all accessible collections and search for datasets related to “Li-ion” and “autonomic”. Here, the results show datasets from 2 collections: the CNM and the Chemical Sciences and Engineering collection (red boxes). Results are ranked according to their relevance to the search.
  • Going further, “The Researcher” can use different queries such as key-value and ranges. In this case, “The Researcher” searchers for energy density > 1500 and microcapsules, and finds the dataset previously published in this demo with an associated key-value pair of energy-density:2000 that fits the range query criteria.
  • Having found the desired published dataset, “The Researcher”can navigate to the summary page.
  • The summary page shows a summary of the dataset and the list of files. “The Researcher” can choose to download individual files, browse the dataset using Globus, or download the entire dataset. Ability to view the dataset and download files is governed by the access control on the collection and permissions associated with “The Researcher”.
  • Finally,“The Researcher” can view the downloaded dataset on their desktop PC.
    1. 1. Delivering a Campus Research Data Service with Globus MAGIC Meeting Ian Foster May 7, 2014
    2. 2. Give me your data, your terabytes, Your huddled files yearning to breathe free … Building campus research data services
    3. 3. “It’s deja vu all over again.” Yogi Berra Globus Toolkit Globus Online Globus Globus
    4. 4. What is Globus (today)? Big data transfer and sharing… …simply, securely, and fast… …directly from your own storage systems
    5. 5. Reliable, secure, high-performance file transfer and synchronization • “Fire-and-forget” transfers • Automatic fault recovery • Seamless security integration • Powerful GUI and APIs Data Source Data Destination User initiates transfer request 1 Globus moves and syncs files 2 Globus notifies user 3
    6. 6. Simple, secure sharing off existing storage systems Data Source User A selects file(s) to share, selects user or group, and sets permissions 1 Globus tracks shared files; no need to move files to cloud storage! 2 User B logs in to Globus and accesses shared file 3 • Easily share large data with any user or group • No cloud storage required
    7. 7. 15,000 registered users
    8. 8. 8,000 active endpoints (in the past year)
    9. 9. 3 billion files transferred
    10. 10. Globus is enabling… Study of the structure and evolution of galaxies, the nature of dark energy, and cosmological history of the universe Sloan Digital Sky Survey Source: University of Utah Joel Brownstein University of Utah
    11. 11. Globus is enabling… Development of numerical simulations of severe storms for improved responsiveness to weather events Weather Research and Forecasting Model Source: UCAR Ann Syrowski University of Illinois
    12. 12. Globus is enabling… Pediatric brain research by enhancing analysis of genetic material in pursuit of the underlying cause Communication impairment by genetic variants Source: Wikimedia Commons William Dobyns U. Washington
    13. 13. Globus increasingly used to build campus-wide data service Source: University of Nebraska Holland Computing Center Enable campus computing facilities to better utilize high performance network infrastructure
    14. 14. Typical deployment Science DMZ + Globus Omaha Core Holland Computing Center Internet2 via GPN East/West Campus Networks (firewalls + IDS) Lincoln Core Router 2x 10 Gigabit DYNES Equipment UNL Science DMZ Campus Network Researchers WDM Composit Traffic 100 Gigabit 100 Gigabit Capable West Campus Border Router 10x CMS Data Transfer Nodes Omaha HPC Clusters 100 Gigabit Capable East Campus Border Router perfSONAR + BRO IDS additions 10 Gigabit 4x 10 Gigabit 100 Gigabit perfSONAR Bro IDS Future Redundant I2 Path (2015+) Lincoln Core Switch (CMS and HPC clusters) Center for Brain Imaging and Behavior 10x 10 Gigabit Internet2 via CIC Composit Traffic 100 Gigabit Source: University of Nebraska Holland Computing Center
    15. 15. Instruments are increasingly driving the need for broader data service deployments Next Gen Sequencer Light Sheet Microscope MRI Advanced Light Source
    16. 16. Globus enables users to manage data as research requirements scale up or down Research Computing HPC Cluster Lab Server Campus Home Filesystem Desktop Workstation Personal Laptop XSEDE Resource Public Cloud
    17. 17. Globus product development highlights in 2013-14
    18. 18. Sharing generally available
    19. 19. Much improved Web UI
    20. 20. Globus Connect Server • Native RPM and Debian packaging • Improved configuration management • Multi-server setup • OAuth support
    21. 21. Management console: “Flight Control”
    22. 22. Amazon S3 Endpoints
    23. 23. 85 U.S. campuses
    24. 24. We are a non-profit, delivering a production-grade service to the non-profit research community
    25. 25. Our challenge: Sustainability We are a non-profit, delivering a production-grade service to the non-profit research community
    26. 26. Globus Provider Subscriptions • Managed Endpoints – Priority support – Management console – Usage reports – Mass Storage System optimization – Host shared endpoints – Integration support • Plus Subscriptions – Create and manage shared endpoints – Personal transfers • Branded Web Site • Alternate Identity Provider (InCommon is standard)
    27. 27. NET+ Globus • Internet2 members get discounted Globus Provider subscriptions • Completing “Service Validation” phase – Sponsors: Cornell, U.Michigan, Yale, U.Missouri, and U.Chicago • Available to “Early Adopters” soon
    28. 28. Bridging the gap to sustainability • $500,000 from Sloan Foundation • Recognition of what it takes to “cross the chasm” • Funds non-R&D activities – User Support – Operations – Marketing
    29. 29. Globus Behind the Scenes Identity, Group, Profile Management Services … Sharing Service Transfer Service Globus Toolkit GlobusConnect
    30. 30. Globus Platform-as-a-Service Identity, Group, Profile Management Services … Sharing Service Transfer Service Globus Toolkit GlobusAPIs GlobusConnect
    31. 31. globus genomics Flexible, scalable, affordabl e genomics analysis for all biologists
    32. 32. + Data management PaaS Next-gen sequence analysis SaaS + Scalable IaaS
    33. 33. Globus Genomics on AWS
    34. 34. Exome: $3 – $20 Whole Genome: $20 – $50 RNA-Seq: <$5 Alternatives are at 10-20x
    35. 35. Dobyns Lab Exome analysis 20x speed-up Next: 50x
    36. 36. Cox Lab Consensus variant calling 134 samples; 4 days <0.01% Mendel error rate Next: 13,000 samples
    37. 37. Campus Data Service User Stories • “I need a good place to store / backup / archive my (big) research data, at a reasonable price.” • “I need to easily, quickly, and reliably move or mirror portions of my data to other places.” • “I need a way to easily and securely share my data with my colleagues at other institutions.”
    38. 38. Campus Data Service User Stories • “I need a good place to store / backup / archive my (big) research data, at a reasonable price.” • “I need to easily, quickly, and reliably move or mirror portions of my data to other places.” • “I need a way to easily and securely share my data with my colleagues at other institutions.” • “I want to publish my data.” • “I want to discover published data.”
    39. 39. An all-too familiar tale …
    40. 40. Data is: Identified Described Curated Verifiable Accessible Preserved What does it mean to publish?
    41. 41. I can: Search Browse Access the data What does it mean to discover?
    42. 42. Globus data publication services Announcing…
    43. 43. Metadata Access Control License Storage Curation Workflow Policies Collection Teeing Up a Few Terms … Metadata DataMetadata Data Metadata Data Dataset Dataset Dataset Community
    44. 44. Argonne Storage Univ. of Chicago Argonne IIT UIUC Demo Scenario 3. Assemble Dataset (Transfer Data) Argonne Curator 2. Describe Submission Scientist Shared Endpoint 4. Curate Dataset 1. Publish Data 6. Download 5. Search
    45. 45. Login with Campus or Globus Identity 46
    46. 46. Start a New Submission 47
    47. 47. Describe Submission 48 Dublin Core + Scientific Metadata
    48. 48. Assemble Dataset and Transfer to Submission Endpoint 49
    49. 49. Grant Submission License 50
    50. 50. Recap: Globus Data Publication • SaaS for publishing large research data • Bring your own storage • Extensible metadata • Publication and curation workflows • Public and restricted collections • Rich discovery model
    51. 51. Curation Workflow 52
    52. 52. Submission is now Published with DOI 53
    53. 53. Search Published Datasets by Collection 54
    54. 54. Search Published Datasets across Collections 55
    55. 55. Discovering a Published Dataset 56
    56. 56. Find the Published Dataset 57
    57. 57. Download the Published Dataset 58
    58. 58. Locally Downloaded Dataset 59
    59. 59. Looking for 3-5 early adopters Summer: Use and provide feedback on alpha Fall: Test beta on your campus Winter: Celebrate General Availability Spring: Tell us about it at GlobusWorld 2015!
    60. 60. Thank you to our sponsors! U . S . D E P A R T M E N T O F ENERGY