Globus Labs: Forging the Next Frontier

Globus
Globus Labs: Forging the
Next Frontier
Kyle Chard
chard@uchicago.edu
Globus Labs
2
Research data management and analysis challenges
• Data acquired at various
locations/times
• Analyses executed on
distributed resources with
different capabilities
– Processing time decreases
with distance
• Dynamic collaborations
around data and analysis
Raw
data
Catalog
DOE LabCampus
Community Archive
FPGACloud
Exacerbated by large-scale science
• Best practices overlooked, useful
data forgotten, errors propagate
• Researchers allocated short periods
of instrument and compute time
• Inefficiencies  less science
• Errors  long delays, missed
opportunity …forever!
Making research data reliably, rapidly, and securely
accessible discoverable, and usable
• Automation: encode research pipelines comprised of triggers and actions
• funcX: scalable function as a service for science
• Parsl: intuitive parallel programming in Python
• PolyNER: extracting scientific facts from published literature
• DLHub: model publication and inference
• MDF: publication and scarping of materials datasets
• XtractHub: deriving metadata from scientific files
• Cost-aware computing: application profiling, resource prediction, automated
provisioning
• Cloud classification: identifying different types of (real) clouds in climate data
5
Ripple: A Trigger-Action platform for data
• Monitors events on various
file system types
• Includes a set of triggers
and actions to create rules
• Ripple processes data
triggers and reliably
executes actions
• Usable by non-experts
Automating the research lifecycle
• Simple state machine model
– JSON-based language
– Conditions, loops, fault tolerance, etc.
– Propagates state through the flow
• Standardized API for integrating
custom event and action services
– Actions: synchronous or asynchronous
– Custom Web forms prompt for user input
• Actions secured with Globus Auth
Auth
Search
Manage
Execute
Remote execution of scientific workloads
• Compute wherever it makes the most sense:
– Hardware or software availability, data location,
analysis time, wait time, etc.
• Remote computing has always been
complex and expensive
– Now we have high speed networks, universal
trust fabrics (Globus Auth), and containers
• Many scientific workloads are comprised
of a collection of short duration functions
– E.g., machine learning inference, real-time
analyses, metadata extraction, image
reconstruction, sensor stream analysis
8
funcX: High Performance Function as a Service for
Science
• Endpoints deployed at resource
– Manage provisioning and scheduling of
resources and data
– Scale-out based on resource needs
• Cloud service routes requests to
endpoints
• Singularity containers run functions
securely
• Globus Auth secures communication
9
Composition and parallelism in Python
• Software is increasingly assembled rather than written
– High-level language (e.g., Python) to integrate and wrap components
from many sources
• Parallel and distributed computing is pervasive
– Increasing data sizes combined with plateauing sequential processing
power
– Parallel hardware (e.g., accelerators) and distributed computing systems
10parsl-project.org
Parsl: Pervasive Parallel Programming in Python
Apps define opportunities for parallelism
• Python apps call Python functions
• Bash apps call external applications
Apps return “futures”: a proxy for a result
that might not yet be available
Apps run concurrently respecting data
dependencies. Natural parallel programming!
Parsl scripts are independent of where they
run. Write once run anywhere!
11
pip install parsl
Parsl executors scale to 2M tasks/256K workers
(weak scaling)
Weak scaling: 10 tasks (0-1s) per worker
HTEX and EXEX outperform other Python-
based approaches and scale to millions of tasks
HTEX and EXEX scale to 2K* and 8K* nodes,
respectively, with >1K tasks/s
Scientific literature is inaccessible to most machines
13
Materials Informatics
PolyNER: Generalizable Scientific Named Entity
Recognition
14
Word
embedding
Labelling
Trained classifier
Active
learning
Active Learning
• Scientific NER challenges:
– NLP approaches are not yet suitable for application to scientific
information extraction
– There is a lack of training data for applying ML
• PolyNER automates the creation of training data using
minimal human guidance
– Word embedding models to generate entity-rich corpora
– Context- and content-based classifiers
– Active learning to prioritize expert effort
• Better performance than leading chemical entity
extractors at a fraction of the cost
– 1000 labels, 5 hours of expert time
• Training data for lexicon-infused Bi-LSTM
Questions?
labs.globus.org
15
1 of 15

Recommended

Threat hunting using notebook technologies by
Threat hunting using notebook technologiesThreat hunting using notebook technologies
Threat hunting using notebook technologiesAshwin Patil, GCIH, GCIA, GCFE
1.1K views13 slides
Working with Instrument Data (GlobusWorld Tour - UMich) by
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Globus
154 views34 slides
Cascalog at May Bay Area Hadoop User Group by
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Groupnathanmarz
997 views29 slides
Data munging and analysis by
Data munging and analysisData munging and analysis
Data munging and analysisRaminder Singh
1K views22 slides
Data Science at Scale by Sarah Guido by
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
1.9K views29 slides
DataFest 2019 Science Gateways by
DataFest 2019 Science GatewaysDataFest 2019 Science Gateways
DataFest 2019 Science GatewaysRaminder Singh
36 views31 slides

More Related Content

What's hot

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
1.1K views32 slides
Deduplication and Author-Disambiguation of Streaming Records via Supervised M... by
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
915 views26 slides
Populate your Search index, NEST 2016-01 by
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
814 views17 slides
From R Script to Production Using rsparkling with Navdeep Gill by
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillDatabricks
762 views17 slides
Build Your Own Recommendation Engine by
Build Your Own Recommendation EngineBuild Your Own Recommendation Engine
Build Your Own Recommendation EngineSri Ambati
2.6K views14 slides
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb... by
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
2.3K views8 slides

What's hot(20)

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney1.1K views
Deduplication and Author-Disambiguation of Streaming Records via Supervised M... by Spark Summit
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit915 views
Populate your Search index, NEST 2016-01 by David Smiley
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
David Smiley814 views
From R Script to Production Using rsparkling with Navdeep Gill by Databricks
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
Databricks762 views
Build Your Own Recommendation Engine by Sri Ambati
Build Your Own Recommendation EngineBuild Your Own Recommendation Engine
Build Your Own Recommendation Engine
Sri Ambati2.6K views
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb... by Spark Summit
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit2.3K views
Reproducibile scientific workflows - Acting on Change 2016 by PERICLES_FP7
Reproducibile scientific workflows - Acting on Change 2016Reproducibile scientific workflows - Acting on Change 2016
Reproducibile scientific workflows - Acting on Change 2016
PERICLES_FP7101 views
The Discovery Cloud: Accelerating Science via Outsourcing and Automation by Ian Foster
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster937 views
Intro to H2O Machine Learning in Python - Galvanize Seattle by Sri Ambati
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize Seattle
Sri Ambati2.6K views
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford by Sri Ambati
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordMigrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Sri Ambati2.5K views
IBM Strategy for Spark by Mark Kerzner
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
Mark Kerzner752 views
Cloud com foster december 2010 by Ian Foster
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
Ian Foster579 views
Anomaly Detection and Automatic Labeling with Deep Learning by Adam Gibson
Anomaly Detection and Automatic Labeling with Deep LearningAnomaly Detection and Automatic Labeling with Deep Learning
Anomaly Detection and Automatic Labeling with Deep Learning
Adam Gibson4.1K views
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:... by Spark Summit
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
Spark Summit1.4K views
Strata Beijing 2017: Jumpy, a python interface for nd4j by Adam Gibson
Strata Beijing 2017: Jumpy, a python interface for nd4jStrata Beijing 2017: Jumpy, a python interface for nd4j
Strata Beijing 2017: Jumpy, a python interface for nd4j
Adam Gibson2.8K views
Webinar: Fusion for Data Science by Lucidworks
Webinar: Fusion for Data ScienceWebinar: Fusion for Data Science
Webinar: Fusion for Data Science
Lucidworks1.1K views
Assaf Araki – Real Time Analytics at Scale by Flink Forward
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
Flink Forward7.6K views
How Spark Enables the Internet of Things- Paula Ta-Shma by Spark Summit
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit3.7K views

Similar to Globus Labs: Forging the Next Frontier

Scalable Parallel Programming in Python with Parsl by
Scalable Parallel Programming in Python with ParslScalable Parallel Programming in Python with Parsl
Scalable Parallel Programming in Python with ParslGlobus
609 views21 slides
Scaling collaborative data science with Globus and Jupyter by
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
809 views35 slides
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit... by
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
660 views49 slides
04 open source_tools by
04 open source_tools04 open source_tools
04 open source_toolsMarco Quartulli
654 views44 slides
Matching Data Intensive Applications and Hardware/Software Architectures by
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
9.8K views78 slides
Matching Data Intensive Applications and Hardware/Software Architectures by
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
1.2K views78 slides

Similar to Globus Labs: Forging the Next Frontier(20)

Scalable Parallel Programming in Python with Parsl by Globus
Scalable Parallel Programming in Python with ParslScalable Parallel Programming in Python with Parsl
Scalable Parallel Programming in Python with Parsl
Globus 609 views
Scaling collaborative data science with Globus and Jupyter by Ian Foster
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
Ian Foster809 views
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit... by Ilkay Altintas, Ph.D.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Matching Data Intensive Applications and Hardware/Software Architectures by Geoffrey Fox
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Geoffrey Fox9.8K views
Matching Data Intensive Applications and Hardware/Software Architectures by Geoffrey Fox
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Geoffrey Fox1.2K views
Parsl: Pervasive Parallel Programming in Python by Daniel S. Katz
Parsl: Pervasive Parallel Programming in PythonParsl: Pervasive Parallel Programming in Python
Parsl: Pervasive Parallel Programming in Python
Daniel S. Katz667 views
An information environment for neuroscientists by David Wallom
An information environment for neuroscientistsAn information environment for neuroscientists
An information environment for neuroscientists
David Wallom274 views
Advances in Scientific Workflow Environments by Carole Goble
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
Carole Goble1.1K views
"Data Provenance: Principles and Why it matters for BioMedical Applications" by Pinar Alper
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Pinar Alper397 views
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech... by Databricks
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Databricks935 views
Linking Scientific Instruments and Computation by Ian Foster
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
Ian Foster39 views
Apache Spark sql by aftab alam
Apache Spark sqlApache Spark sql
Apache Spark sql
aftab alam386 views
Shaping the Future: To Globus Compute and Beyond! by Globus
Shaping the Future: To Globus Compute and Beyond!Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!
Globus 116 views
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences by Ian Foster
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Ian Foster1.3K views
Data-intensive bioinformatics on HPC and Cloud by Ola Spjuth
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
Ola Spjuth1.4K views
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP... by Globus
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Globus 76 views

More from Globus

Introduction to Globus for System Administrators by
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System AdministratorsGlobus
12 views55 slides
Introduction to Data Transfer and Sharing for Researchers by
Introduction to Data Transfer and Sharing for ResearchersIntroduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersGlobus
5 views33 slides
Introduction to the Globus Platform for Developers by
Introduction to the Globus Platform for DevelopersIntroduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersGlobus
4 views28 slides
Introduction to the Command Line Interface (CLI) by
Introduction to the Command Line Interface (CLI)Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Globus
15 views12 slides
Automating Research Data with Globus Flows and Compute by
Automating Research Data with Globus Flows and ComputeAutomating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeGlobus
9 views60 slides
Automating Research Data Flows and Introduction to the Globus Platform by
Automating Research Data Flows and Introduction to the Globus PlatformAutomating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformGlobus
50 views41 slides

More from Globus (20)

Introduction to Globus for System Administrators by Globus
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
Globus 12 views
Introduction to Data Transfer and Sharing for Researchers by Globus
Introduction to Data Transfer and Sharing for ResearchersIntroduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for Researchers
Globus 5 views
Introduction to the Globus Platform for Developers by Globus
Introduction to the Globus Platform for DevelopersIntroduction to the Globus Platform for Developers
Introduction to the Globus Platform for Developers
Globus 4 views
Introduction to the Command Line Interface (CLI) by Globus
Introduction to the Command Line Interface (CLI)Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)
Globus 15 views
Automating Research Data with Globus Flows and Compute by Globus
Automating Research Data with Globus Flows and ComputeAutomating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and Compute
Globus 9 views
Automating Research Data Flows and Introduction to the Globus Platform by Globus
Automating Research Data Flows and Introduction to the Globus PlatformAutomating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus Platform
Globus 50 views
Advanced Globus System Administration by Globus
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
Globus 26 views
Introduction to Globus for System Administrators by Globus
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
Globus 96 views
Introduction to Globus for New Users by Globus
Introduction to Globus for New UsersIntroduction to Globus for New Users
Introduction to Globus for New Users
Globus 55 views
Working with Globus Platform Services and Portals by Globus
Working with Globus Platform Services and PortalsWorking with Globus Platform Services and Portals
Working with Globus Platform Services and Portals
Globus 28 views
Globus Automation by Globus
Globus AutomationGlobus Automation
Globus Automation
Globus 23 views
Advanced Globus System Administration by Globus
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
Globus 21 views
Introduction to Globus by Globus
Introduction to GlobusIntroduction to Globus
Introduction to Globus
Globus 43 views
Introduction to Globus for System Administrators by Globus
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
Globus 27 views
Working with Globus Platform Services by Globus
Working with Globus Platform ServicesWorking with Globus Platform Services
Working with Globus Platform Services
Globus 42 views
Advanced Globus System Administration by Globus
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
Globus 29 views
Introduction to Globus for System Administrators by Globus
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
Globus 147 views
Using Globus to Streamline Research at Scale by Globus
Using Globus to Streamline Research at ScaleUsing Globus to Streamline Research at Scale
Using Globus to Streamline Research at Scale
Globus 30 views
Introduction to Globus for Researchers by Globus
Introduction to Globus for ResearchersIntroduction to Globus for Researchers
Introduction to Globus for Researchers
Globus 89 views
Automating Research Data Flows and an Introduction to the Globus Platform by Globus
Automating Research Data Flows and an Introduction to the Globus PlatformAutomating Research Data Flows and an Introduction to the Globus Platform
Automating Research Data Flows and an Introduction to the Globus Platform
Globus 132 views

Recently uploaded

[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...DataScienceConferenc1
7 views15 slides
Dr. Ousmane Badiane-2023 ReSAKSS Conference by
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceAKADEMIYA2063
5 views34 slides
Product Research sample.pdf by
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdfAllenSingson
33 views29 slides
META.pptx by
META.pptxMETA.pptx
META.pptxvasanthan19012003
6 views10 slides
DGST Methodology Presentation.pdf by
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdfmaddierlegum
7 views9 slides
CRM stick or twist workshop by
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshopinfo828217
14 views16 slides

Recently uploaded(20)

[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson33 views
DGST Methodology Presentation.pdf by maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum7 views
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 views
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... by DataScienceConferenc1
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
K-Drama Recommendation Using Python by FridaPutriassa
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using Python
FridaPutriassa5 views
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821729 views
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion by Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Listed Instruments Survey 2022.pptx by secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat4121 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views

Globus Labs: Forging the Next Frontier

  • 1. Globus Labs: Forging the Next Frontier Kyle Chard chard@uchicago.edu
  • 3. Research data management and analysis challenges • Data acquired at various locations/times • Analyses executed on distributed resources with different capabilities – Processing time decreases with distance • Dynamic collaborations around data and analysis Raw data Catalog DOE LabCampus Community Archive FPGACloud
  • 4. Exacerbated by large-scale science • Best practices overlooked, useful data forgotten, errors propagate • Researchers allocated short periods of instrument and compute time • Inefficiencies  less science • Errors  long delays, missed opportunity …forever!
  • 5. Making research data reliably, rapidly, and securely accessible discoverable, and usable • Automation: encode research pipelines comprised of triggers and actions • funcX: scalable function as a service for science • Parsl: intuitive parallel programming in Python • PolyNER: extracting scientific facts from published literature • DLHub: model publication and inference • MDF: publication and scarping of materials datasets • XtractHub: deriving metadata from scientific files • Cost-aware computing: application profiling, resource prediction, automated provisioning • Cloud classification: identifying different types of (real) clouds in climate data 5
  • 6. Ripple: A Trigger-Action platform for data • Monitors events on various file system types • Includes a set of triggers and actions to create rules • Ripple processes data triggers and reliably executes actions • Usable by non-experts
  • 7. Automating the research lifecycle • Simple state machine model – JSON-based language – Conditions, loops, fault tolerance, etc. – Propagates state through the flow • Standardized API for integrating custom event and action services – Actions: synchronous or asynchronous – Custom Web forms prompt for user input • Actions secured with Globus Auth Auth Search Manage Execute
  • 8. Remote execution of scientific workloads • Compute wherever it makes the most sense: – Hardware or software availability, data location, analysis time, wait time, etc. • Remote computing has always been complex and expensive – Now we have high speed networks, universal trust fabrics (Globus Auth), and containers • Many scientific workloads are comprised of a collection of short duration functions – E.g., machine learning inference, real-time analyses, metadata extraction, image reconstruction, sensor stream analysis 8
  • 9. funcX: High Performance Function as a Service for Science • Endpoints deployed at resource – Manage provisioning and scheduling of resources and data – Scale-out based on resource needs • Cloud service routes requests to endpoints • Singularity containers run functions securely • Globus Auth secures communication 9
  • 10. Composition and parallelism in Python • Software is increasingly assembled rather than written – High-level language (e.g., Python) to integrate and wrap components from many sources • Parallel and distributed computing is pervasive – Increasing data sizes combined with plateauing sequential processing power – Parallel hardware (e.g., accelerators) and distributed computing systems 10parsl-project.org
  • 11. Parsl: Pervasive Parallel Programming in Python Apps define opportunities for parallelism • Python apps call Python functions • Bash apps call external applications Apps return “futures”: a proxy for a result that might not yet be available Apps run concurrently respecting data dependencies. Natural parallel programming! Parsl scripts are independent of where they run. Write once run anywhere! 11 pip install parsl
  • 12. Parsl executors scale to 2M tasks/256K workers (weak scaling) Weak scaling: 10 tasks (0-1s) per worker HTEX and EXEX outperform other Python- based approaches and scale to millions of tasks HTEX and EXEX scale to 2K* and 8K* nodes, respectively, with >1K tasks/s
  • 13. Scientific literature is inaccessible to most machines 13 Materials Informatics
  • 14. PolyNER: Generalizable Scientific Named Entity Recognition 14 Word embedding Labelling Trained classifier Active learning Active Learning • Scientific NER challenges: – NLP approaches are not yet suitable for application to scientific information extraction – There is a lack of training data for applying ML • PolyNER automates the creation of training data using minimal human guidance – Word embedding models to generate entity-rich corpora – Context- and content-based classifiers – Active learning to prioritize expert effort • Better performance than leading chemical entity extractors at a fraction of the cost – 1000 labels, 5 hours of expert time • Training data for lexicon-infused Bi-LSTM