ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Spark Summit
Spark SummitSpark Summit
ModelDB: A system
to manage machine
learning models
Manasi Vartak
PhD Student, MIT DB Group
People
Manasi Vartak
PhD student, MIT
Srinidhi Viswanathan
MEng, MIT
Samuel Madden
Faculty, MIT
Matei Zaharia
Faculty, Stanford
Harihar Subramanyam
MEng, MIT
Wei-En Lee
MEng student, MIT
Building a default
prediction algorithm
Profession Credit History Risk of Default
Politician Reasonable 0.3
Struggling
artist
Poor 0.7
Investor
Has more
money than our
company
0.0
… … … …
Barack
Obama
Lindsay
Lohan
Warren
Buffet
Accuracy: 62%
Model 1
Model 3
RandomForestClassifier
val udf1: (Int => Int) = (delayed..)
df.withColumn(“timesDelayed”, udf1)
RandomForestClassifier
df.withColumn(“timesDelayed”, udf1)
.withColumn(“percentPaid”, udf2)
val lrGrid = new ParamGridBuilder()
.addGrid(rf.maxDepth, Array(5, 10, 15))
.addGrid(rf.numTrees, Array(50, 100))
Model 5
credit-default-clean.csv
df.withColumn(“timesDelayed”, udf1)
.withColumn(“percentPaid”, udf2)
.withColumn(“creditUsed”, udf3)
…
val lrGrid = new ParamGridBuilder()
.addGrid(lr.elasticNetParam, Array(0.01, 0.1, 0.5, 0.7))
val scaler = new StandardScaler()
.setInputCol(“features”)
…
val labelIndexer1 = new LabelIndexer()
val labelIndexer2 = new LabelIndexer()
…
Model 50
val udf1: (Int => Int) = (delayed..)
val udf2: (String, Int) = …
credit-default-clean.csv
No one in here tracks (all of)
their models
…and this is not unusual
I’m willing to bet…
Why is this a problem?
• No record of experiments
• Insights lost along the way
• Difficult to reproduce results
• Cannot search for or query models
• Difficult to collaborate
Did my colleague do that
already?
How did normalization
affect my ROC?
How does someone review
your model?
Where’s the LR
model I tried last
week with featureX?
What params did I use?
Model Management
track, store and index modeling artifacts
so that they may subsequently be
reproduced, shared, queried, and
analyzed
ModelDB: a system to
manage machine
learning models
http://modeldb.csail.mit.edu
ModelDB: an end-to-end
model management system
Model artifact
Storage &
Versioning
Query
Ingest models,
metadata
Collaboration,
Reproducibilitytrack
store &
index
query, reproduce++
Demo
ModelDB w/
scikit-learn
ModelDB Architecture &
Design Decisions
1. Support for diverse
languages and environments
2. Minimal changes to
existing workflows
3. Rich visual interface
4. Support for complex
queries
spark.ml
scikit-learn
ModelDB
Backend
Storage
thrift
Scala
Python
…
ModelDB
Frontend:
vis + query
Native Client
Events
ModelDB Features
• Experiment tracking
• Versioning
• Reproducibility
• Comparisons, queries, search
• Collaboration
Log models, params, pipelines
etc. via ModelDB API
Model search, query,
comparison via frontend
Central repository of models
Review models, annotate
All pipeline details, params
logged
Every modeling run = version
Ongoing Work
• Unified querying of modeling artifacts
• Mining data in ModelDB
• Model monitoring and retraining
ModelDB available now!
http://modeldb.csail.mit.edu
*MIT License
ModelDB available now!
• Download, try it out!
• Tell us what you think; what can we do better?
• Contribute! (see Issues on repo for some ideas)
ModelDB: a system to
manage machine
learning models
mvartak@csail.mit.edu | @DataCereal
http://modeldb.csail.mit.edu
1 of 20

Recommended

Raven: End-to-end Optimization of ML Prediction Queries by
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
450 views29 slides
Machine Learning Pipelines by
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
25.4K views32 slides
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe... by
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
10.3K views50 slides
Machine Learning Project by
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectAbhishek Singh
17.5K views11 slides
Amazon EMR Deep Dive & Best Practices by
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
24.9K views88 slides
The Power of Auto ML and How Does it Work by
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
2.8K views29 slides

More Related Content

What's hot

Databricks Overview for MLOps by
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
1.5K views30 slides
Real World End to End machine Learning Pipeline by
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineSrivatsan Srinivasan
27.5K views29 slides
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi... by
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...Kai Wähner
820 views36 slides
MLOps.pptx by
MLOps.pptxMLOps.pptx
MLOps.pptxAllenPeter7
707 views11 slides
Big Data: Its Characteristics And Architecture Capabilities by
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
7.5K views29 slides
Web-Scale Graph Analytics with Apache® Spark™ by
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
2.7K views59 slides

What's hot(20)

Databricks Overview for MLOps by Databricks
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
Databricks1.5K views
Real World End to End machine Learning Pipeline by Srivatsan Srinivasan
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
Srivatsan Srinivasan27.5K views
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi... by Kai Wähner
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Kai Wähner820 views
Big Data: Its Characteristics And Architecture Capabilities by Ashraf Uddin
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
Ashraf Uddin7.5K views
Web-Scale Graph Analytics with Apache® Spark™ by Databricks
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
Databricks2.7K views
Apply MLOps at Scale by Databricks
Apply MLOps at ScaleApply MLOps at Scale
Apply MLOps at Scale
Databricks687 views
Automated Machine Learning by safa cimenli
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
safa cimenli1.1K views
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf by Po-Chuan Chen
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen531 views
Smart Data Slides: Machine Learning - Case Studies by DATAVERSITY
Smart Data Slides: Machine Learning - Case StudiesSmart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case Studies
DATAVERSITY4.1K views
CRISP-DM: a data science project methodology by Sergey Shelpuk
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
Sergey Shelpuk14.3K views
A Practical Enterprise Feature Store on Delta Lake by Databricks
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta Lake
Databricks664 views
Building a Knowledge Graph using NLP and Ontologies by Neo4j
Building a Knowledge Graph using NLP and OntologiesBuilding a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and Ontologies
Neo4j864 views
Linear models and multiclass classification by NdSv94
Linear models and multiclass classificationLinear models and multiclass classification
Linear models and multiclass classification
NdSv942.9K views

Similar to ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b... by
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
426 views72 slides
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark by
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkValue Amplify Consulting
163 views57 slides
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus by
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusRobust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusManasi Vartak
604 views43 slides
Telecom datascience master_public by
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_publicVincent Michel
357 views43 slides
Data Product Architectures by
Data Product ArchitecturesData Product Architectures
Data Product ArchitecturesBenjamin Bengfort
5.2K views39 slides
Lessons Learned from Building Machine Learning Software at Netflix by
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
14.5K views34 slides

Similar to ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak(20)

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b... by Spark Summit
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Spark Summit426 views
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark by Value Amplify Consulting
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus by Manasi Vartak
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusRobust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Manasi Vartak604 views
Telecom datascience master_public by Vincent Michel
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_public
Vincent Michel357 views
Lessons Learned from Building Machine Learning Software at Netflix by Justin Basilico
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico14.5K views
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine... by Gabriel Moreira
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Gabriel Moreira1.4K views
Introduction to Machine Learning with SciKit-Learn by Benjamin Bengfort
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort17.5K views
C19013010 the tutorial to build shared ai services session 1 by Bill Liu
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
Bill Liu112 views
Machine Learning Models in Production by DataWorks Summit
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit2.2K views
Machine Learning and AI: Core Methods and Applications by QuantUniversity
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and Applications
QuantUniversity836 views
B sc it syit sem 3 sem 4 syllabus as per mumbai university by tanujaparihar
B sc it syit sem 3 sem 4 syllabus as per mumbai universityB sc it syit sem 3 sem 4 syllabus as per mumbai university
B sc it syit sem 3 sem 4 syllabus as per mumbai university
tanujaparihar1.5K views
Beautiful Models in PHP by brandonsavage
Beautiful Models in PHPBeautiful Models in PHP
Beautiful Models in PHP
brandonsavage7.9K views
Scaling up Machine Learning Development by Matei Zaharia
Scaling up Machine Learning DevelopmentScaling up Machine Learning Development
Scaling up Machine Learning Development
Matei Zaharia451 views
Machine Learning with TensorFlow.js by Brian Greig
Machine Learning with TensorFlow.jsMachine Learning with TensorFlow.js
Machine Learning with TensorFlow.js
Brian Greig167 views
Lecture_1_Intro.pdf by paijitk
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
paijitk6 views
Discovering User's Topics of Interest in Recommender Systems by Gabriel Moreira
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
Gabriel Moreira6.1K views
On the Customization of Model Management Systems for File-Centric IDEs by David Méndez-Acuña
On the Customization of Model Management Systems for File-Centric IDEsOn the Customization of Model Management Systems for File-Centric IDEs
On the Customization of Model Management Systems for File-Centric IDEs

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang by
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
6.8K views32 slides
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M... by
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
4.3K views34 slides
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu by
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
3.1K views21 slides
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra by
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
3.1K views21 slides
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem... by
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
3.4K views56 slides
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ... by
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
1.7K views90 slides

More from Spark Summit(20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang by Spark Summit
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit6.8K views
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M... by Spark Summit
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit4.3K views
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu by Spark Summit
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit3.1K views
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra by Spark Summit
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit3.1K views
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem... by Spark Summit
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit3.4K views
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ... by Spark Summit
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit1.7K views
Apache Spark and Tensorflow as a Service with Jim Dowling by Spark Summit
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit2.1K views
Apache Spark and Tensorflow as a Service with Jim Dowling by Spark Summit
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit1.1K views
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library... by Spark Summit
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit2.8K views
Next CERN Accelerator Logging Service with Jakub Wozniak by Spark Summit
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit2.3K views
Powering a Startup with Apache Spark with Kevin Kim by Spark Summit
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit787 views
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra by Spark Summit
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit730 views
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—... by Spark Summit
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit1.1K views
How Nielsen Utilized Databricks for Large-Scale Research and Development with... by Spark Summit
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit999 views
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov... by Spark Summit
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit2.3K views
Goal Based Data Production with Sim Simeonov by Spark Summit
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit746 views
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le... by Spark Summit
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit1.5K views
Getting Ready to Use Redis with Apache Spark with Dvir Volk by Spark Summit
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit3K views
Deduplication and Author-Disambiguation of Streaming Records via Supervised M... by Spark Summit
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit915 views
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization... by Spark Summit
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit1.7K views

Recently uploaded

[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptxDataScienceConferenc1
5 views12 slides
AvizoImageSegmentation.pptx by
AvizoImageSegmentation.pptxAvizoImageSegmentation.pptx
AvizoImageSegmentation.pptxnathanielbutterworth1
6 views14 slides
VoxelNet by
VoxelNetVoxelNet
VoxelNettaeseon ryu
13 views21 slides
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...StatsCommunications
5 views26 slides
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...DataScienceConferenc1
7 views11 slides
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...DataScienceConferenc1
5 views19 slides

Recently uploaded(20)

[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821711 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204215 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

  • 1. ModelDB: A system to manage machine learning models Manasi Vartak PhD Student, MIT DB Group
  • 2. People Manasi Vartak PhD student, MIT Srinidhi Viswanathan MEng, MIT Samuel Madden Faculty, MIT Matei Zaharia Faculty, Stanford Harihar Subramanyam MEng, MIT Wei-En Lee MEng student, MIT
  • 3. Building a default prediction algorithm Profession Credit History Risk of Default Politician Reasonable 0.3 Struggling artist Poor 0.7 Investor Has more money than our company 0.0 … … … … Barack Obama Lindsay Lohan Warren Buffet
  • 5. Model 3 RandomForestClassifier val udf1: (Int => Int) = (delayed..) df.withColumn(“timesDelayed”, udf1)
  • 6. RandomForestClassifier df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2) val lrGrid = new ParamGridBuilder() .addGrid(rf.maxDepth, Array(5, 10, 15)) .addGrid(rf.numTrees, Array(50, 100)) Model 5 credit-default-clean.csv
  • 7. df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2) .withColumn(“creditUsed”, udf3) … val lrGrid = new ParamGridBuilder() .addGrid(lr.elasticNetParam, Array(0.01, 0.1, 0.5, 0.7)) val scaler = new StandardScaler() .setInputCol(“features”) … val labelIndexer1 = new LabelIndexer() val labelIndexer2 = new LabelIndexer() … Model 50 val udf1: (Int => Int) = (delayed..) val udf2: (String, Int) = … credit-default-clean.csv
  • 8. No one in here tracks (all of) their models …and this is not unusual I’m willing to bet…
  • 9. Why is this a problem? • No record of experiments • Insights lost along the way • Difficult to reproduce results • Cannot search for or query models • Difficult to collaborate Did my colleague do that already? How did normalization affect my ROC? How does someone review your model? Where’s the LR model I tried last week with featureX? What params did I use?
  • 10. Model Management track, store and index modeling artifacts so that they may subsequently be reproduced, shared, queried, and analyzed
  • 11. ModelDB: a system to manage machine learning models http://modeldb.csail.mit.edu
  • 12. ModelDB: an end-to-end model management system Model artifact Storage & Versioning Query Ingest models, metadata Collaboration, Reproducibilitytrack store & index query, reproduce++
  • 13. Demo
  • 15. ModelDB Architecture & Design Decisions 1. Support for diverse languages and environments 2. Minimal changes to existing workflows 3. Rich visual interface 4. Support for complex queries spark.ml scikit-learn ModelDB Backend Storage thrift Scala Python … ModelDB Frontend: vis + query Native Client Events
  • 16. ModelDB Features • Experiment tracking • Versioning • Reproducibility • Comparisons, queries, search • Collaboration Log models, params, pipelines etc. via ModelDB API Model search, query, comparison via frontend Central repository of models Review models, annotate All pipeline details, params logged Every modeling run = version
  • 17. Ongoing Work • Unified querying of modeling artifacts • Mining data in ModelDB • Model monitoring and retraining
  • 19. ModelDB available now! • Download, try it out! • Tell us what you think; what can we do better? • Contribute! (see Issues on repo for some ideas)
  • 20. ModelDB: a system to manage machine learning models mvartak@csail.mit.edu | @DataCereal http://modeldb.csail.mit.edu