Delivering a Campus Research Data
Service with Globus
GlobusWorld 2014
Keynote
Give me your data,
your terabytes,
Your huddled files
yearning to
breathe free …
Building campus research
data services
GlobusWorld 2014
“It’s deja vu all over again.”
Yogi Berra
Globus Toolkit
Globus Online
Globus
Globus
Higgs discovery “only possible because
of the extraordinary achievements of
… grid computing”
Rolf Heuer, CERN DG
10s of PB, 100s of institutions,1000s of
scientists, 100Ks of CPUs, Bs of tasks
What is Globus (today)?
Big data transfer
and sharing…
…with Dropbox-like
simplicity…
…directly from your own
storage systems
Reliable, secure, high-performance
file transfer and synchronization
• “Fire-and-forget”
transfers
• Automatic fault
recovery
• Seamless security
integration
• Powerful GUI
and APIs
Data
Source
Data
Destination
User initiates
transfer
request
1
Globus
moves and
syncs files
2
Globus
notifies user
3
Simple, secure sharing off existing
storage systems
Data
Source
User A selects
file(s) to share,
selects user or
group, and sets
permissions
1
Globus tracks shared
files; no need to
move files to cloud
storage!
2
User B logs in
to Globus and
accesses
shared file
3
• Easily share large data
with any user or group
• No cloud storage
required
15,000
registered users
8,000
active endpoints
(in the past year)
3 billion
files transferred
Globus is enabling…
Study of the structure
and evolution of
galaxies, the nature
of dark energy, and
cosmological history
of the universe
Sloan Digital Sky Survey
Source: University of Utah
Joel Brownstein
University of Utah
Globus is enabling…
Development
of numerical
simulations of
severe storms
for improved
responsiveness
to weather
events
Weather Research and Forecasting Model
Source: UCAR
Ann Syrowski
University of Illinois
Globus is enabling…
Pediatric brain
research by
enhancing
analysis of
genetic material
in pursuit of the
underlying
cause
Communication impairment by genetic variants
Source: Wikimedia Commons
William Dobyns
U. Washington
“I need a good place to store / backup
/ archive my (big) research data, at a
reasonable price.”
Public Cloud ArchiveMass StoreCampus Store
“I need to easily, quickly, & reliably move or
mirror portions of my data to other places.”
Research Computing HPC Cluster
Lab Server
Campus Home Filesystem
Desktop Workstation
Personal Laptop
XSEDE Resource
Public Cloud
“I need to easily and securely share
my data with my colleagues at other
institutions.”
“I need to get data from a scientific
instrument to my analysis server.”
Next Gen
Sequencer
Light Sheet Microscope
MRI Advanced
Light Source
Product highlights
since GlobusWorld 2013
Sharing generally available
Much improved Web UI
Globus Connect Server
• Native RPM and Debian packaging
• Improved configuration management
• Multi-server setup
• OAuth support
Management console: “Flight Control”
Amazon S3 Endpoints
Demonstration
85
U.S. campuses
Best practice (fasterdata.es.net)
Create Data Transfer Nodes on
existing (or new) storage with
Globus Connect Server
…deploy in a Science DMZ…
…use Globus as the interface
We are a non-profit, delivering a
production-grade service to the
non-profit research community
Our challenge:
Sustainability
We are a non-profit, delivering a
production-grade service to the
non-profit research community
Globus Provider Subscriptions
• Managed Endpoints
– Priority support
– Management console
– Usage reports
– Mass Storage System optimization
– Host shared endpoints
– Integration support
• Plus Subscriptions
– Create and manage shared endpoints
– Personal transfers
• Branded Web Site
• Alternate Identity Provider (InCommon is standard)
https://www.globus.org/provider-plans
NET+ Globus
• Internet2 members get discounted
Globus Provider subscriptions
• Completing “Service Validation” phase
– Sponsors: Cornell, U.Michigan, Yale,
U.Missouri, and U.Chicago
• Available to “Early Adopters” soon
Bridging the gap to sustainability
• $500,000 from Sloan Foundation
• Recognition of what it takes to
“cross the chasm”
• Funds non-R&D
activities
– User Support
– Operations
– Marketing
Globus Under the Covers
Identity, Group, Profile
Management Services
…
Sharing Service
Transfer Service
Globus Toolkit
GlobusConnect
Globus Platform-as-a-Service
Identity, Group, Profile
Management Services
…
Sharing Service
Transfer Service
Globus Toolkit
GlobusAPIs
GlobusConnect
globus
genomics
Flexible, scalable,
affordable
genomics analysis
for all biologists
+
Data management
PaaS
Next-gen sequence
analysis SaaS
+
Scalable IaaS
Globus Genomics on AWS
Exome: $3 – $20
Whole Genome: $20 – $50
RNA-Seq: <$5
Alternatives are at 10-20x
Dobyns Lab
Exome analysis
20x speed-up
Next: 50x
Cox Lab
Consensus variant calling
134 samples; 4 days
<0.01% Mendel error rate
Next: 13,000 samples
Campus Data Service User Stories
• “I need a good place to store / backup / archive
my (big) research data, at a reasonable price.”
• “I need to easily, quickly, and reliably move or
mirror portions of my data to other places.”
• “I need a way to easily and securely share my
data with my colleagues at other institutions.”
Campus Data Service User Stories
• “I need a good place to store / backup / archive
my (big) research data, at a reasonable price.”
• “I need to easily, quickly, and reliably move or
mirror portions of my data to other places.”
• “I need a way to easily and securely share my
data with my colleagues at other institutions.”
• “I want to publish my data.”
• “I want to discover published data.”
An all-too familiar tale …
Data is:
Identified
Described
Curated
Verifiable
Accessible
Preserved
What does it mean to publish?
I can:
Search
Browse
Access
the data
What does it mean to discover?
Globus
data
publication
services
Announcing…
Metadata
Access Control
License
Storage
Curation
Workflow
Policies
Collection
Teeing Up a Few Terms …
Metadata
DataMetadata
Data
Metadata
Data
Dataset
Dataset
Dataset
Community
Demonstration
Recap: Globus Data Publication
• SaaS for publishing large research data
• Bring your own storage
• Extensible metadata
• Publication and curation workflows
• Public and restricted collections
• Rich discovery model
Looking for 3-5 early adopters
Summer:
Use and
provide
feedback
on alpha
Fall:
Test beta on
your campus
Winter:
Celebrate
General
Availability
Spring:
Tell us about it
at GlobusWorld
2015!
To provide affordable,
advanced capabilities for
all researchers, delivering
sustainable services that
aggregate and federate
existing resources
Our vision for 21st century
research data management
Thank you to our sponsors!
U . S . D E P A R T M E N T O F
ENERGY

Delivering a Campus Research Data Service with Globus

  • 1.
    Delivering a CampusResearch Data Service with Globus GlobusWorld 2014 Keynote
  • 2.
    Give me yourdata, your terabytes, Your huddled files yearning to breathe free … Building campus research data services GlobusWorld 2014
  • 3.
    “It’s deja vuall over again.” Yogi Berra Globus Toolkit Globus Online Globus Globus
  • 4.
    Higgs discovery “onlypossible because of the extraordinary achievements of … grid computing” Rolf Heuer, CERN DG 10s of PB, 100s of institutions,1000s of scientists, 100Ks of CPUs, Bs of tasks
  • 5.
    What is Globus(today)? Big data transfer and sharing… …with Dropbox-like simplicity… …directly from your own storage systems
  • 6.
    Reliable, secure, high-performance filetransfer and synchronization • “Fire-and-forget” transfers • Automatic fault recovery • Seamless security integration • Powerful GUI and APIs Data Source Data Destination User initiates transfer request 1 Globus moves and syncs files 2 Globus notifies user 3
  • 7.
    Simple, secure sharingoff existing storage systems Data Source User A selects file(s) to share, selects user or group, and sets permissions 1 Globus tracks shared files; no need to move files to cloud storage! 2 User B logs in to Globus and accesses shared file 3 • Easily share large data with any user or group • No cloud storage required
  • 8.
  • 9.
  • 10.
  • 12.
    Globus is enabling… Studyof the structure and evolution of galaxies, the nature of dark energy, and cosmological history of the universe Sloan Digital Sky Survey Source: University of Utah Joel Brownstein University of Utah
  • 13.
    Globus is enabling… Development ofnumerical simulations of severe storms for improved responsiveness to weather events Weather Research and Forecasting Model Source: UCAR Ann Syrowski University of Illinois
  • 14.
    Globus is enabling… Pediatricbrain research by enhancing analysis of genetic material in pursuit of the underlying cause Communication impairment by genetic variants Source: Wikimedia Commons William Dobyns U. Washington
  • 15.
    “I need agood place to store / backup / archive my (big) research data, at a reasonable price.” Public Cloud ArchiveMass StoreCampus Store
  • 16.
    “I need toeasily, quickly, & reliably move or mirror portions of my data to other places.” Research Computing HPC Cluster Lab Server Campus Home Filesystem Desktop Workstation Personal Laptop XSEDE Resource Public Cloud
  • 17.
    “I need toeasily and securely share my data with my colleagues at other institutions.”
  • 18.
    “I need toget data from a scientific instrument to my analysis server.” Next Gen Sequencer Light Sheet Microscope MRI Advanced Light Source
  • 19.
  • 20.
  • 21.
  • 22.
    Globus Connect Server •Native RPM and Debian packaging • Improved configuration management • Multi-server setup • OAuth support
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    Best practice (fasterdata.es.net) CreateData Transfer Nodes on existing (or new) storage with Globus Connect Server …deploy in a Science DMZ… …use Globus as the interface
  • 28.
    We are anon-profit, delivering a production-grade service to the non-profit research community
  • 29.
    Our challenge: Sustainability We area non-profit, delivering a production-grade service to the non-profit research community
  • 30.
    Globus Provider Subscriptions •Managed Endpoints – Priority support – Management console – Usage reports – Mass Storage System optimization – Host shared endpoints – Integration support • Plus Subscriptions – Create and manage shared endpoints – Personal transfers • Branded Web Site • Alternate Identity Provider (InCommon is standard) https://www.globus.org/provider-plans
  • 31.
    NET+ Globus • Internet2members get discounted Globus Provider subscriptions • Completing “Service Validation” phase – Sponsors: Cornell, U.Michigan, Yale, U.Missouri, and U.Chicago • Available to “Early Adopters” soon
  • 32.
    Bridging the gapto sustainability • $500,000 from Sloan Foundation • Recognition of what it takes to “cross the chasm” • Funds non-R&D activities – User Support – Operations – Marketing
  • 33.
    Globus Under theCovers Identity, Group, Profile Management Services … Sharing Service Transfer Service Globus Toolkit GlobusConnect
  • 34.
    Globus Platform-as-a-Service Identity, Group,Profile Management Services … Sharing Service Transfer Service Globus Toolkit GlobusAPIs GlobusConnect
  • 35.
  • 36.
  • 37.
  • 38.
    Exome: $3 –$20 Whole Genome: $20 – $50 RNA-Seq: <$5 Alternatives are at 10-20x
  • 39.
  • 40.
    Cox Lab Consensus variantcalling 134 samples; 4 days <0.01% Mendel error rate Next: 13,000 samples
  • 41.
    Campus Data ServiceUser Stories • “I need a good place to store / backup / archive my (big) research data, at a reasonable price.” • “I need to easily, quickly, and reliably move or mirror portions of my data to other places.” • “I need a way to easily and securely share my data with my colleagues at other institutions.”
  • 42.
    Campus Data ServiceUser Stories • “I need a good place to store / backup / archive my (big) research data, at a reasonable price.” • “I need to easily, quickly, and reliably move or mirror portions of my data to other places.” • “I need a way to easily and securely share my data with my colleagues at other institutions.” • “I want to publish my data.” • “I want to discover published data.”
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
    Metadata Access Control License Storage Curation Workflow Policies Collection Teeing Upa Few Terms … Metadata DataMetadata Data Metadata Data Dataset Dataset Dataset Community
  • 48.
  • 49.
    Recap: Globus DataPublication • SaaS for publishing large research data • Bring your own storage • Extensible metadata • Publication and curation workflows • Public and restricted collections • Rich discovery model
  • 50.
    Looking for 3-5early adopters Summer: Use and provide feedback on alpha Fall: Test beta on your campus Winter: Celebrate General Availability Spring: Tell us about it at GlobusWorld 2015!
  • 51.
    To provide affordable, advancedcapabilities for all researchers, delivering sustainable services that aggregate and federate existing resources Our vision for 21st century research data management
  • 52.
    Thank you toour sponsors! U . S . D E P A R T M E N T O F ENERGY

Editor's Notes

  • #3 Review what the Globus team has done over the past year.Announce an exciting new capability.
  • #5 Peter Higgs
  • #13 Joel Brownstein is the data archivist of the Sloan Digital Sky Survey-IVTransfers daily telescope observations to the University of UtahThere they have a large cluster to run their various data reduction pipelinesUsing the Globus command-line interface within their Python APIJoel has moved more than 70 TB of data so far
  • #14 Ann develops numerical simulations of severe storms using the Weather Research and Forecasting (WRF) modelUses several HPC facilities throughout the countryMoved more than 100 TB of data using Globus— 50 TB last January alone!Moves data between various XSEDE resources, NCSA&apos;s mass storage system, and PSC&apos;s data archiver
  • #15 Collects tissue samples from young patients and their families and then extracts, sequences, and analyzesthe genetic material to understand underlying cause of disease.Uses Globus to move NGS data to and from public clouds where he runs analysis pipelines.More on Bill’s work later on in this talk (under Globus Genomics)
  • #16 A number of common patterns, each supported by Globus—plus services deployed on various campus resources. Explains why so many endpoints!
  • #23 Can use standard tools such as apt and yum to deployUses configuration fileAllows incremental config changesMultiple I/O nodesID node (MyProxy)Web node (OAuth)
  • #24 Alllows site administrators to monitor traffic to/from their site. Ultimately will allow for control.
  • #26 Steve5 minutes
  • #28 Science DMZ: Increasingly means a dedicated research network, separate from administrative network, without firewalls etc.
  • #33 Geoffrey Moore
  • #34 Highlight CI Connect; coming up in Rob Gardner’s talkHighlight XSEDE’s planned adoption of user, group and profile management
  • #35 Highlight CI Connect; coming up in Rob Gardner’s talkHighlight XSEDE’s planned adoption of user, group and profile management
  • #39 Competitive TCOAlternatives are campus computing cores and commercial sequence analysis services
  • #48 Collection is a set of DatasetsDataset is data + metadataCollection is within a CommunityPolicies on a CollectionMetadataAccess control Curation workflowLicenseStorage