Globus Labs: Forging the
Next Frontier
Kyle Chard
chard@uchicago.edu
Globus Labs
2
Research data management and analysis challenges
• Data acquired at various
locations/times
• Analyses executed on
distributed resources with
different capabilities
– Processing time decreases
with distance
• Dynamic collaborations
around data and analysis
Raw
data
Catalog
DOE LabCampus
Community Archive
FPGACloud
Exacerbated by large-scale science
• Best practices overlooked, useful
data forgotten, errors propagate
• Researchers allocated short periods
of instrument and compute time
• Inefficiencies  less science
• Errors  long delays, missed
opportunity …forever!
Making research data reliably, rapidly, and securely
accessible discoverable, and usable
• Automation: encode research pipelines comprised of triggers and actions
• funcX: scalable function as a service for science
• Parsl: intuitive parallel programming in Python
• PolyNER: extracting scientific facts from published literature
• DLHub: model publication and inference
• MDF: publication and scarping of materials datasets
• XtractHub: deriving metadata from scientific files
• Cost-aware computing: application profiling, resource prediction, automated
provisioning
• Cloud classification: identifying different types of (real) clouds in climate data
5
Ripple: A Trigger-Action platform for data
• Monitors events on various
file system types
• Includes a set of triggers
and actions to create rules
• Ripple processes data
triggers and reliably
executes actions
• Usable by non-experts
Automating the research lifecycle
• Simple state machine model
– JSON-based language
– Conditions, loops, fault tolerance, etc.
– Propagates state through the flow
• Standardized API for integrating
custom event and action services
– Actions: synchronous or asynchronous
– Custom Web forms prompt for user input
• Actions secured with Globus Auth
Auth
Search
Manage
Execute
Remote execution of scientific workloads
• Compute wherever it makes the most sense:
– Hardware or software availability, data location,
analysis time, wait time, etc.
• Remote computing has always been
complex and expensive
– Now we have high speed networks, universal
trust fabrics (Globus Auth), and containers
• Many scientific workloads are comprised
of a collection of short duration functions
– E.g., machine learning inference, real-time
analyses, metadata extraction, image
reconstruction, sensor stream analysis
8
funcX: High Performance Function as a Service for
Science
• Endpoints deployed at resource
– Manage provisioning and scheduling of
resources and data
– Scale-out based on resource needs
• Cloud service routes requests to
endpoints
• Singularity containers run functions
securely
• Globus Auth secures communication
9
Composition and parallelism in Python
• Software is increasingly assembled rather than written
– High-level language (e.g., Python) to integrate and wrap components
from many sources
• Parallel and distributed computing is pervasive
– Increasing data sizes combined with plateauing sequential processing
power
– Parallel hardware (e.g., accelerators) and distributed computing systems
10parsl-project.org
Parsl: Pervasive Parallel Programming in Python
Apps define opportunities for parallelism
• Python apps call Python functions
• Bash apps call external applications
Apps return “futures”: a proxy for a result
that might not yet be available
Apps run concurrently respecting data
dependencies. Natural parallel programming!
Parsl scripts are independent of where they
run. Write once run anywhere!
11
pip install parsl
Parsl executors scale to 2M tasks/256K workers
(weak scaling)
Weak scaling: 10 tasks (0-1s) per worker
HTEX and EXEX outperform other Python-
based approaches and scale to millions of tasks
HTEX and EXEX scale to 2K* and 8K* nodes,
respectively, with >1K tasks/s
Scientific literature is inaccessible to most machines
13
Materials Informatics
PolyNER: Generalizable Scientific Named Entity
Recognition
14
Word
embedding
Labelling
Trained classifier
Active
learning
Active Learning
• Scientific NER challenges:
– NLP approaches are not yet suitable for application to scientific
information extraction
– There is a lack of training data for applying ML
• PolyNER automates the creation of training data using
minimal human guidance
– Word embedding models to generate entity-rich corpora
– Context- and content-based classifiers
– Active learning to prioritize expert effort
• Better performance than leading chemical entity
extractors at a fraction of the cost
– 1000 labels, 5 hours of expert time
• Training data for lexicon-infused Bi-LSTM
Questions?
labs.globus.org
15

Globus Labs: Forging the Next Frontier

  • 1.
    Globus Labs: Forgingthe Next Frontier Kyle Chard chard@uchicago.edu
  • 2.
  • 3.
    Research data managementand analysis challenges • Data acquired at various locations/times • Analyses executed on distributed resources with different capabilities – Processing time decreases with distance • Dynamic collaborations around data and analysis Raw data Catalog DOE LabCampus Community Archive FPGACloud
  • 4.
    Exacerbated by large-scalescience • Best practices overlooked, useful data forgotten, errors propagate • Researchers allocated short periods of instrument and compute time • Inefficiencies  less science • Errors  long delays, missed opportunity …forever!
  • 5.
    Making research datareliably, rapidly, and securely accessible discoverable, and usable • Automation: encode research pipelines comprised of triggers and actions • funcX: scalable function as a service for science • Parsl: intuitive parallel programming in Python • PolyNER: extracting scientific facts from published literature • DLHub: model publication and inference • MDF: publication and scarping of materials datasets • XtractHub: deriving metadata from scientific files • Cost-aware computing: application profiling, resource prediction, automated provisioning • Cloud classification: identifying different types of (real) clouds in climate data 5
  • 6.
    Ripple: A Trigger-Actionplatform for data • Monitors events on various file system types • Includes a set of triggers and actions to create rules • Ripple processes data triggers and reliably executes actions • Usable by non-experts
  • 7.
    Automating the researchlifecycle • Simple state machine model – JSON-based language – Conditions, loops, fault tolerance, etc. – Propagates state through the flow • Standardized API for integrating custom event and action services – Actions: synchronous or asynchronous – Custom Web forms prompt for user input • Actions secured with Globus Auth Auth Search Manage Execute
  • 8.
    Remote execution ofscientific workloads • Compute wherever it makes the most sense: – Hardware or software availability, data location, analysis time, wait time, etc. • Remote computing has always been complex and expensive – Now we have high speed networks, universal trust fabrics (Globus Auth), and containers • Many scientific workloads are comprised of a collection of short duration functions – E.g., machine learning inference, real-time analyses, metadata extraction, image reconstruction, sensor stream analysis 8
  • 9.
    funcX: High PerformanceFunction as a Service for Science • Endpoints deployed at resource – Manage provisioning and scheduling of resources and data – Scale-out based on resource needs • Cloud service routes requests to endpoints • Singularity containers run functions securely • Globus Auth secures communication 9
  • 10.
    Composition and parallelismin Python • Software is increasingly assembled rather than written – High-level language (e.g., Python) to integrate and wrap components from many sources • Parallel and distributed computing is pervasive – Increasing data sizes combined with plateauing sequential processing power – Parallel hardware (e.g., accelerators) and distributed computing systems 10parsl-project.org
  • 11.
    Parsl: Pervasive ParallelProgramming in Python Apps define opportunities for parallelism • Python apps call Python functions • Bash apps call external applications Apps return “futures”: a proxy for a result that might not yet be available Apps run concurrently respecting data dependencies. Natural parallel programming! Parsl scripts are independent of where they run. Write once run anywhere! 11 pip install parsl
  • 12.
    Parsl executors scaleto 2M tasks/256K workers (weak scaling) Weak scaling: 10 tasks (0-1s) per worker HTEX and EXEX outperform other Python- based approaches and scale to millions of tasks HTEX and EXEX scale to 2K* and 8K* nodes, respectively, with >1K tasks/s
  • 13.
    Scientific literature isinaccessible to most machines 13 Materials Informatics
  • 14.
    PolyNER: Generalizable ScientificNamed Entity Recognition 14 Word embedding Labelling Trained classifier Active learning Active Learning • Scientific NER challenges: – NLP approaches are not yet suitable for application to scientific information extraction – There is a lack of training data for applying ML • PolyNER automates the creation of training data using minimal human guidance – Word embedding models to generate entity-rich corpora – Context- and content-based classifiers – Active learning to prioritize expert effort • Better performance than leading chemical entity extractors at a fraction of the cost – 1000 labels, 5 hours of expert time • Training data for lexicon-infused Bi-LSTM
  • 15.