Successfully reported this slideshow.
Your SlideShare is downloading. ×

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Download to read offline

Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria. However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. In addition, there are no means to run complex queries on models and related data.

In this talk, we present ModelDB, a novel end-to-end system for managing machine learning (ML) models. Using client libraries, ModelDB automatically tracks and versions ML models in their native environments (e.g. spark.ml, scikit-learn). A common set of abstractions enable ModelDB to capture models and pipelines built across different languages and environments. The structured representation of models and metadata then provides a platform for users to issue complex queries across various modeling artifacts. Our rich web frontend provides a way to query ModelDB at varying levels of granularity.

ModelDB has been open-sourced at https://github.com/mitdbg/modeldb.

Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria. However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. In addition, there are no means to run complex queries on models and related data.

In this talk, we present ModelDB, a novel end-to-end system for managing machine learning (ML) models. Using client libraries, ModelDB automatically tracks and versions ML models in their native environments (e.g. spark.ml, scikit-learn). A common set of abstractions enable ModelDB to capture models and pipelines built across different languages and environments. The structured representation of models and metadata then provides a platform for users to issue complex queries across various modeling artifacts. Our rich web frontend provides a way to query ModelDB at varying levels of granularity.

ModelDB has been open-sourced at https://github.com/mitdbg/modeldb.

More Related Content

More from Spark Summit

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

  1. 1. ModelDB: A system to manage machine learning models Manasi Vartak PhD Student, MIT DB Group
  2. 2. People Manasi Vartak PhD student, MIT Srinidhi Viswanathan MEng, MIT Samuel Madden Faculty, MIT Matei Zaharia Faculty, Stanford Harihar Subramanyam MEng, MIT Wei-En Lee MEng student, MIT
  3. 3. Building a default prediction algorithm Profession Credit History Risk of Default Politician Reasonable 0.3 Struggling artist Poor 0.7 Investor Has more money than our company 0.0 … … … … Barack Obama Lindsay Lohan Warren Buffet
  4. 4. Accuracy: 62% Model 1
  5. 5. Model 3 RandomForestClassifier val udf1: (Int => Int) = (delayed..) df.withColumn(“timesDelayed”, udf1)
  6. 6. RandomForestClassifier df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2) val lrGrid = new ParamGridBuilder() .addGrid(rf.maxDepth, Array(5, 10, 15)) .addGrid(rf.numTrees, Array(50, 100)) Model 5 credit-default-clean.csv
  7. 7. df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2) .withColumn(“creditUsed”, udf3) … val lrGrid = new ParamGridBuilder() .addGrid(lr.elasticNetParam, Array(0.01, 0.1, 0.5, 0.7)) val scaler = new StandardScaler() .setInputCol(“features”) … val labelIndexer1 = new LabelIndexer() val labelIndexer2 = new LabelIndexer() … Model 50 val udf1: (Int => Int) = (delayed..) val udf2: (String, Int) = … credit-default-clean.csv
  8. 8. No one in here tracks (all of) their models …and this is not unusual I’m willing to bet…
  9. 9. Why is this a problem? • No record of experiments • Insights lost along the way • Difficult to reproduce results • Cannot search for or query models • Difficult to collaborate Did my colleague do that already? How did normalization affect my ROC? How does someone review your model? Where’s the LR model I tried last week with featureX? What params did I use?
  10. 10. Model Management track, store and index modeling artifacts so that they may subsequently be reproduced, shared, queried, and analyzed
  11. 11. ModelDB: a system to manage machine learning models http://modeldb.csail.mit.edu
  12. 12. ModelDB: an end-to-end model management system Model artifact Storage & Versioning Query Ingest models, metadata Collaboration, Reproducibilitytrack store & index query, reproduce++
  13. 13. Demo
  14. 14. ModelDB w/ scikit-learn
  15. 15. ModelDB Architecture & Design Decisions 1. Support for diverse languages and environments 2. Minimal changes to existing workflows 3. Rich visual interface 4. Support for complex queries spark.ml scikit-learn ModelDB Backend Storage thrift Scala Python … ModelDB Frontend: vis + query Native Client Events
  16. 16. ModelDB Features • Experiment tracking • Versioning • Reproducibility • Comparisons, queries, search • Collaboration Log models, params, pipelines etc. via ModelDB API Model search, query, comparison via frontend Central repository of models Review models, annotate All pipeline details, params logged Every modeling run = version
  17. 17. Ongoing Work • Unified querying of modeling artifacts • Mining data in ModelDB • Model monitoring and retraining
  18. 18. ModelDB available now! http://modeldb.csail.mit.edu *MIT License
  19. 19. ModelDB available now! • Download, try it out! • Tell us what you think; what can we do better? • Contribute! (see Issues on repo for some ideas)
  20. 20. ModelDB: a system to manage machine learning models mvartak@csail.mit.edu | @DataCereal http://modeldb.csail.mit.edu

×