Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A FAIR Approach to Publishing and Sharing Machine Learning Models

145 views

Published on

While there has been a significant increase in the amount of machine learning research across various domains of science, the processes to publish the results and make the resulting models and code available for reuse has been lacking. In this talk, we discuss FAIR data principles applied to machine learning models and how the Data and Learning Hub for Science (DLHub) can help make models more easily discoverable and usable in common scientific workflows. Visit https://www.dlhub.org for more information.

Published in: Science
  • Be the first to comment

  • Be the first to like this

A FAIR Approach to Publishing and Sharing Machine Learning Models

  1. 1. Funding: 2018 Argonne Advanced Computing LDRD Collaborators: Ryan Chard, Logan Ward, Marcus Schwarting, Kyle Chard, Zhuozhao Li, Anna Woodard, Yadu Babuji, Steve Tuecke, Mike Franklin, Ian Foster Blue – also presenting at this workshop Data and Learning Hub for Science https://www.dlhub.org A FAIR Approach to Publishing and Sharing Machine Learning Models Ben Blaiszik (blaiszik@uchicago.edu)
  2. 2. Quick Polls • How many of you have trained a machine learning model? • How many of you have published papers using machine learning? • How many of you have tried to reuse models from others?
  3. 3. State of Machine Learning in Science Highs • Rapid increase in number of journal publications • Advances across the scientific domains • Achievements on par with experts or best-in-class methods in many domains • Funding agencies are coalescing around ML (AI Initiative etc.) Chart Source and Method: https://github.com/blaiszik/ml_publication_charts
  4. 4. State of Machine Learning in Science For a given model: • Where is the code? • Where are the trained models? • Where is the training data? • How can I reproduce these results? Without all of these pieces, progress is drastically slowed Location of many ML models after a paper is finished Github is another location… Lows
  5. 5. FAIR Data Principles • Findable • Accessible • Interoperable • Reusable https://www.force11.org/group/fairgroup/fairprinciples Set of principles to help make data as useful as possible to the community
  6. 6. FAIR Data Principles Findable • Data have an identifier • Data are registered in a searchable resource Accesible • Data accessible via identifier • Data retrievable by open protocols
  7. 7. FAIR Data Principles Interoperable • Data leverage formalized shared vocabularies • Vocabularies themselves follow FAIR principles Reusable • Clear licensing • Descriptive metadata is sufficient to promote reuse
  8. 8. What Would FAIR Look Like in ML? (1) Find Interesting Science Paper • Links to code repository (Github/DOI) • Links to data repository (DOI) • Publication describes the model and its uses and limitations
  9. 9. What Would FAIR Look Like in ML? (2) Find Code • Has unique identifier (DOI) • Links back to publication (DOI) • Has well-documented code • Tagged with metadata to aid discovery • Registered in a search index • Open license
  10. 10. What Would FAIR Look Like in ML? (3) Find and Run Model • Model has identifier (DOI) • Model has links to data (DOI) • Model has links to the code (DOI/Github) • Model has links to publication (DOI) • Data are accessible • Inference run from the cloud - no installation necessary!
  11. 11. 11 • Collect, publish, categorize models and pre/post processing code • Operate models as a service to simplify sharing, consumption, and access • Identify models with unique and persistent identifiers (e.g., DOI) • Implement versioning, search, access controls etc. Goal: Deliver FAIR for ML 2018 Argonne Adv. Computing LDRD DATA AND LEARNING HUB FOR SCIENCE (DLHUB)
  12. 12. DLHub: Key Concepts Run() • Servables are containers with defined inputs and outputs • Servables may represent machine learning models or other data transformations • Outputs can be cached for inputs
  13. 13. DLHub: Key Concepts • Servables are containers with defined inputs and outputs • Servables may represent machine learning models or other data transformations • Outputs can be cached for inputs Preprocess 1 Run() Preprocess 2 Run() Model predict Run()
  14. 14. Example: Predicting Formation Enthalpy This is what a user has This is what a user wants
  15. 15. Example: Predicting Formation Enthalpy This is what a user has This is what a user wants
  16. 16. PUBLISHING A MACHINE LEARNING MODEL 16
  17. 17. Marking up a Model – Python SDK Existing Model User Mark Up with SDK Send to DLHub (via Globus or HTTPS) DLHub Containerization Populate Search Index / Mint Identifiers SDK Extracts Metadata for Known Model Types
  18. 18. Python SDK – Automated Metadata Generation Citation Metadata Following Datacite DLHub Metadata Servable Metadata Access Control • Public • Globus users • Globus groups
  19. 19. Using DLHub is Easy! 19 2018 Argonne Adv. Computing LDRD Python SDK $ pip install dlhub_sdk 1 2 Describe Publish • Publish to DLHub • DLHub service creates containers • DLHub service creates unique endpoint for servable • Specify the model files • Mark up the model with information to make it discoverable and usable
  20. 20. Using DLHub is Easy! 20 2018 Argonne Adv. Computing LDRD 4 Run • Make predictions by sending data to DLHub and specifying the servable to use 3 Discover • Discover servables with advanced search capabilities through Python SDK or web UI (under construction)
  21. 21. NEXT STEPS 21
  22. 22. Combining DLHub with Data Repositories Get Data Run Model 2018 Argonne Adv. Computing LDRD 22 • Using high-throughput optical imaging to predict material bandgap
  23. 23. Get Data Run Model Combining DLHub with Data Repositories 23 2018 Argonne Adv. Computing LDRD
  24. 24. Model-in-the-Loop Science Select DLHub Use Cases Funding: 2018 Argonne Adv. Computing LDRD • Crystal structure • NIST PFHub • Models linked to dynamic data sources Community Model Benchmarking Automated Model Retraining with New Data • Metallic glass discovery [active learning] • XRD applications XRD image tagging (Yager, BNL) (Ward, ANL/UC) (Ward, ANL/UC) (Wheeler, Warren, Heinonen NIST/UC/Argonne/NU) (Center for Hierarchical Materials Design NIST/UC/Argonne/NU) CH MaD XRD intensity à structure/phase (Cherukara Argonne)
  25. 25. More Examples Available In Our Repositories 25 2018 Argonne Adv. Computing LDRD Cherukara et al. Energy Storage Tomography X-Ray Science Ward et al. TomoGAN Liu et al.
  26. 26. DLHub Architecture and Performance • Task Managers (TM) to support execution on various compute resources • Executors chosen by TM to invoke a given servable’ • Caching at TM • Data staging with Globus • Batch submissions • Scalability through deployment of model replicas https://arxiv.org/abs/1811.11213 zmq Task Manager Model Repository REST CLI SDK TF Serving DLHub Management Service Key Servable Node Model Serving Parsl Sage Maker Executor Executor Executor zmq Task Manager Ryan Chard Zhuozhao Li
  27. 27. Open Source Opportunities 2018 Argonne Adv. Computing LDRDhttps://www.dlhub.org https://github.com/DLHub-Argonne • Deposit models from the community • Help build client functionality • Build examples using existing servables • Be you! Contact: Ben Blaiszik (blaiszik@uchicago.edu)
  28. 28. Thanks to our sponsors! U.S. DEPARTMENT OF ENERGY ALCF DF Parsl Globus IMaD DLHub Argonne LDRD

×