Overview
• Data management challenges
• Managing instrument data with Globus
• Use cases and lessons learned
• Notebook demo
Data management challenges
• Event Horizon Telescope
– 12 telescopes
– Generate 900TB per 5 day run
– Data written to ~1000 HDDs
– Transported to MIT & Max Planck via airplane
– Aggregated and analyzed
• Global resources, long timescales
• Too much data for manual processing
• Data loss due to HDD failure
Research data management challenges
• Data acquired at various
locations/times
• Analyses executed on
distributed resources
• Catalogs of descriptive
metadata and provenance
• Dynamic collaborations
around data and analysis
Raw
data
store
Catalog
DOE LabCampus
Community Archive
NIH
Exacerbated by large scale science
• Best practices overlooked, useful
data forgotten, errors propagate
• Researchers allocated short periods
of instrument time
• Inefficiencies -> less science
• Errors -> long delays, missed
opportunity …forever!
Automation and Globus
• Globus provides a rich data management
ecosystem for both admins and
researchers
• Compose multiple services into reliable,
secure data management pipelines
• Execute on behalf of users
• Create data-aware automations that
respond to data events
Globus Services
• Transfer: Move data, set ACLs, create shares
• Search: Find data, catalog metadata
• Identifiers: Mint persistent IDs for datasets
• Auth: Glue that ties everything together
Globus Auth
• Programmatic and secure access to both Globus
services and any third party services that support it
• Grant permission to apps to act on your behalf
– Dependent tokens
• Refresh tokens enable one-time authentication that
can be put into long-running pipelines
Automation via Globus
Glue services together
• Globus SDK (docs.globus.org)
• Scripting with the Globus CLI
– globus task wait
• Automate
Example Use Cases
• Advanced Photon Source
– Connectomics
– Time series spectroscopy
• Scanning Electron Microscope
UChicago Kasthuri Lab: Brain aging and disease
• Construct connectomes—mapping of neuron connections
• Use APS synchrotron to rapidly image brains
– Beam time available once every few months
– ~20GB/minute for large (cm) unsectioned brains
• Generate segmented datasets/visualizations for the community
• Perform semi-standard reconstruction on all data across HPC
resources
Original Approach
• Collect data—20 mins
• Move to a local machine
• Generate previews for a couple of images—5 mins
• Collect more data
• Initiate local reconstruction—1 hour
• Batch process the rest after beamtime
Requirements
• Accomodate many different beamline users of
different skillsets
– Automatically apply a “base” reconstruction to data
• Leverage HPC due to computational requirements
• Unobstructive to the user
Ripple: A Trigger-Action platform for data
• Provided set of triggers and
actions to create rules
• Ripple processes data
triggers and reliably
executes actions
• Usable by non-experts
• Daisy-chain rules for
complex flows
Not product!
Data-driven automation
• Filesystem-specific
tools monitor and
report events
– inotify (Linux
– FSWatch (macOS)
• Capture local data
events
– Create, delete, move
Watchdog: github.com/gorakhargosh/watchdog
New Approach
• Detect data as they are collected
• Automaticlaly move data to ALCF
• Initiate a preview and reconstruction
• Detect the preview and move it back to the APS
• Move results to shared storage
• Catalog data in Search for users to explore
Lessons Learned
• Automate data capture where possible - far easier
than convincing people to run things
• Transparency is critical - operators need the ability to
debug at 3am
• “Manual” automation is better than no automation
SEM Focus
1. Slice the image into 6 random
subsections
1. Apply Laplacian blob detection
1. Use NN to classify as in or out
of focus
Credit: Aarthi Koripelly
DLHub
• Collect, publish, categorize models from many
disciplines (materials science, physics, chemistry,
genomics, etc.)
• Serve model inference on-demand via API to simplify
sharing, consumption, and access
• Enable new science through reuse, real-time model-in-
the-loop integration, and synthesis & ensembling of
existing models
Processing SEM Data with DLHub
• Detect files placed in a “/process” directory
• Move data to Petrel
• Generate input for DLHub
• Invoke DLHub
• Put results in a Search index for users
• Append to a list in a “/results” folder
Lessons Learned
• Perfect, complex hooks are sometimes unnecessary
• No value if the user can’t easily find and use the
result
• Outsource and leverage special-purpose services
– You don’t need to do everything
X-ray Photon Correlation Spectroscopy
• APS Beamline 8-ID
• Generate a lot of data
– Images every ~10 seconds
• Apply XPCS-Eigen tool to
HDF files containing many
images
Current Approach
• Internal workflow engine is started in response to
data
• Enormous bash scripts everywhere
• Restricted to local resources
• Any new tools must fit into their dashboard
Current Approach
• Plug in a new step to kick off Globus actions
• Fits into existing dashboards
• Easy for them to debug -- stand alone, or in the flow
Lessons Learned
• Everyone loves automation
• Everyone has a “working” solution and doesn’t want
to change
• Make results easy to find - no value without results
• HPC timeouts are still a pain