Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Globus Labs: Forging the Next Frontier


Published on

This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Kyle Chard from the University of Chicago.

Published in: Data & Analytics
  • I like this service ⇒ ⇐ from Academic Writers. I don't have enough time write it by myself.
    Are you sure you want to  Yes  No
    Your message goes here
  • You can try to use this service ⇒ ⇐ I have used it several times in college and was absolutely satisfied with the result.
    Are you sure you want to  Yes  No
    Your message goes here
  • Jeb Andrews, PhD, CEO of Clinical Trials of America, sent me this touching handwritten letter after he won over $5,000 betting conservatively using my "Demolisher" Baseball Betting System: ➤➤
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Globus Labs: Forging the Next Frontier

  1. 1. Globus Labs: Forging the Next Frontier Kyle Chard
  2. 2. Globus Labs 2
  3. 3. Research data management and analysis challenges • Data acquired at various locations/times • Analyses executed on distributed resources with different capabilities – Processing time decreases with distance • Dynamic collaborations around data and analysis Raw data Catalog DOE LabCampus Community Archive FPGACloud
  4. 4. Exacerbated by large-scale science • Best practices overlooked, useful data forgotten, errors propagate • Researchers allocated short periods of instrument and compute time • Inefficiencies  less science • Errors  long delays, missed opportunity …forever!
  5. 5. Making research data reliably, rapidly, and securely accessible discoverable, and usable • Automation: encode research pipelines comprised of triggers and actions • funcX: scalable function as a service for science • Parsl: intuitive parallel programming in Python • PolyNER: extracting scientific facts from published literature • DLHub: model publication and inference • MDF: publication and scarping of materials datasets • XtractHub: deriving metadata from scientific files • Cost-aware computing: application profiling, resource prediction, automated provisioning • Cloud classification: identifying different types of (real) clouds in climate data 5
  6. 6. Ripple: A Trigger-Action platform for data • Monitors events on various file system types • Includes a set of triggers and actions to create rules • Ripple processes data triggers and reliably executes actions • Usable by non-experts
  7. 7. Automating the research lifecycle • Simple state machine model – JSON-based language – Conditions, loops, fault tolerance, etc. – Propagates state through the flow • Standardized API for integrating custom event and action services – Actions: synchronous or asynchronous – Custom Web forms prompt for user input • Actions secured with Globus Auth Auth Search Manage Execute
  8. 8. Remote execution of scientific workloads • Compute wherever it makes the most sense: – Hardware or software availability, data location, analysis time, wait time, etc. • Remote computing has always been complex and expensive – Now we have high speed networks, universal trust fabrics (Globus Auth), and containers • Many scientific workloads are comprised of a collection of short duration functions – E.g., machine learning inference, real-time analyses, metadata extraction, image reconstruction, sensor stream analysis 8
  9. 9. funcX: High Performance Function as a Service for Science • Endpoints deployed at resource – Manage provisioning and scheduling of resources and data – Scale-out based on resource needs • Cloud service routes requests to endpoints • Singularity containers run functions securely • Globus Auth secures communication 9
  10. 10. Composition and parallelism in Python • Software is increasingly assembled rather than written – High-level language (e.g., Python) to integrate and wrap components from many sources • Parallel and distributed computing is pervasive – Increasing data sizes combined with plateauing sequential processing power – Parallel hardware (e.g., accelerators) and distributed computing systems
  11. 11. Parsl: Pervasive Parallel Programming in Python Apps define opportunities for parallelism • Python apps call Python functions • Bash apps call external applications Apps return “futures”: a proxy for a result that might not yet be available Apps run concurrently respecting data dependencies. Natural parallel programming! Parsl scripts are independent of where they run. Write once run anywhere! 11 pip install parsl
  12. 12. Parsl executors scale to 2M tasks/256K workers (weak scaling) Weak scaling: 10 tasks (0-1s) per worker HTEX and EXEX outperform other Python- based approaches and scale to millions of tasks HTEX and EXEX scale to 2K* and 8K* nodes, respectively, with >1K tasks/s
  13. 13. Scientific literature is inaccessible to most machines 13 Materials Informatics
  14. 14. PolyNER: Generalizable Scientific Named Entity Recognition 14 Word embedding Labelling Trained classifier Active learning Active Learning • Scientific NER challenges: – NLP approaches are not yet suitable for application to scientific information extraction – There is a lack of training data for applying ML • PolyNER automates the creation of training data using minimal human guidance – Word embedding models to generate entity-rich corpora – Context- and content-based classifiers – Active learning to prioritize expert effort • Better performance than leading chemical entity extractors at a fraction of the cost – 1000 labels, 5 hours of expert time • Training data for lexicon-infused Bi-LSTM
  15. 15. Questions? 15