Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ModelDB: A system
to manage machine
learning models
Manasi Vartak
PhD Student, MIT DB Group
People
Manasi Vartak
PhD student, MIT
Srinidhi Viswanathan
MEng, MIT
Samuel Madden
Faculty, MIT
Matei Zaharia
Faculty, Sta...
Building a default
prediction algorithm
Profession Credit History Risk of Default
Politician Reasonable 0.3
Struggling
art...
Accuracy: 62%
Model 1
Model 3
RandomForestClassifier
val udf1: (Int => Int) = (delayed..)
df.withColumn(“timesDelayed”, udf1)
RandomForestClassifier
df.withColumn(“timesDelayed”, udf1)
.withColumn(“percentPaid”, udf2)
val lrGrid = new ParamGridBuil...
df.withColumn(“timesDelayed”, udf1)
.withColumn(“percentPaid”, udf2)
.withColumn(“creditUsed”, udf3)
…
val lrGrid = new Pa...
No one in here tracks (all of)
their models
…and this is not unusual
I’m willing to bet…
Why is this a problem?
• No record of experiments
• Insights lost along the way
• Difficult to reproduce results
• Cannot s...
Model Management
track, store and index modeling artifacts
so that they may subsequently be
reproduced, shared, queried, a...
ModelDB: a system to
manage machine
learning models
http://modeldb.csail.mit.edu
ModelDB: an end-to-end
model management system
Model artifact
Storage &
Versioning
Query
Ingest models,
metadata
Collabora...
Demo
ModelDB w/
scikit-learn
ModelDB Architecture &
Design Decisions
1. Support for diverse
languages and environments
2. Minimal changes to
existing w...
ModelDB Features
• Experiment tracking
• Versioning
• Reproducibility
• Comparisons, queries, search
• Collaboration
Log m...
Ongoing Work
• Unified querying of modeling artifacts
• Mining data in ModelDB
• Model monitoring and retraining
ModelDB available now!
http://modeldb.csail.mit.edu
*MIT License
ModelDB available now!
• Download, try it out!
• Tell us what you think; what can we do better?
• Contribute! (see Issues ...
ModelDB: a system to
manage machine
learning models
mvartak@csail.mit.edu | @DataCereal
http://modeldb.csail.mit.edu
Upcoming SlideShare
Loading in …5
×

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

2,219 views

Published on

Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria. However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. In addition, there are no means to run complex queries on models and related data.

In this talk, we present ModelDB, a novel end-to-end system for managing machine learning (ML) models. Using client libraries, ModelDB automatically tracks and versions ML models in their native environments (e.g. spark.ml, scikit-learn). A common set of abstractions enable ModelDB to capture models and pipelines built across different languages and environments. The structured representation of models and metadata then provides a platform for users to issue complex queries across various modeling artifacts. Our rich web frontend provides a way to query ModelDB at varying levels of granularity.

ModelDB has been open-sourced at https://github.com/mitdbg/modeldb.

Published in: Data & Analytics

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

  1. 1. ModelDB: A system to manage machine learning models Manasi Vartak PhD Student, MIT DB Group
  2. 2. People Manasi Vartak PhD student, MIT Srinidhi Viswanathan MEng, MIT Samuel Madden Faculty, MIT Matei Zaharia Faculty, Stanford Harihar Subramanyam MEng, MIT Wei-En Lee MEng student, MIT
  3. 3. Building a default prediction algorithm Profession Credit History Risk of Default Politician Reasonable 0.3 Struggling artist Poor 0.7 Investor Has more money than our company 0.0 … … … … Barack Obama Lindsay Lohan Warren Buffet
  4. 4. Accuracy: 62% Model 1
  5. 5. Model 3 RandomForestClassifier val udf1: (Int => Int) = (delayed..) df.withColumn(“timesDelayed”, udf1)
  6. 6. RandomForestClassifier df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2) val lrGrid = new ParamGridBuilder() .addGrid(rf.maxDepth, Array(5, 10, 15)) .addGrid(rf.numTrees, Array(50, 100)) Model 5 credit-default-clean.csv
  7. 7. df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2) .withColumn(“creditUsed”, udf3) … val lrGrid = new ParamGridBuilder() .addGrid(lr.elasticNetParam, Array(0.01, 0.1, 0.5, 0.7)) val scaler = new StandardScaler() .setInputCol(“features”) … val labelIndexer1 = new LabelIndexer() val labelIndexer2 = new LabelIndexer() … Model 50 val udf1: (Int => Int) = (delayed..) val udf2: (String, Int) = … credit-default-clean.csv
  8. 8. No one in here tracks (all of) their models …and this is not unusual I’m willing to bet…
  9. 9. Why is this a problem? • No record of experiments • Insights lost along the way • Difficult to reproduce results • Cannot search for or query models • Difficult to collaborate Did my colleague do that already? How did normalization affect my ROC? How does someone review your model? Where’s the LR model I tried last week with featureX? What params did I use?
  10. 10. Model Management track, store and index modeling artifacts so that they may subsequently be reproduced, shared, queried, and analyzed
  11. 11. ModelDB: a system to manage machine learning models http://modeldb.csail.mit.edu
  12. 12. ModelDB: an end-to-end model management system Model artifact Storage & Versioning Query Ingest models, metadata Collaboration, Reproducibilitytrack store & index query, reproduce++
  13. 13. Demo
  14. 14. ModelDB w/ scikit-learn
  15. 15. ModelDB Architecture & Design Decisions 1. Support for diverse languages and environments 2. Minimal changes to existing workflows 3. Rich visual interface 4. Support for complex queries spark.ml scikit-learn ModelDB Backend Storage thrift Scala Python … ModelDB Frontend: vis + query Native Client Events
  16. 16. ModelDB Features • Experiment tracking • Versioning • Reproducibility • Comparisons, queries, search • Collaboration Log models, params, pipelines etc. via ModelDB API Model search, query, comparison via frontend Central repository of models Review models, annotate All pipeline details, params logged Every modeling run = version
  17. 17. Ongoing Work • Unified querying of modeling artifacts • Mining data in ModelDB • Model monitoring and retraining
  18. 18. ModelDB available now! http://modeldb.csail.mit.edu *MIT License
  19. 19. ModelDB available now! • Download, try it out! • Tell us what you think; what can we do better? • Contribute! (see Issues on repo for some ideas)
  20. 20. ModelDB: a system to manage machine learning models mvartak@csail.mit.edu | @DataCereal http://modeldb.csail.mit.edu

×