Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific

80 views

Published on

Thermo Fisher Scientific has one of the most extensive product portfolios in the industry, ranging from reagents to capital instruments across customers in biotechnology, pharmaceuticals, academic, and more.

Published in: Data & Analytics
  • Be the first to comment

Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific

  1. 1. Enabling Scalable Data Science Pipelines with MLflow and Model Registry at Thermo Fisher Scientific Allison Wu Data Scientist, Data Science Center of Excellence Thermo Fisher Scientific
  2. 2. Key Summary ▪ We standardized development of machine learning models by integrating MLFlow tracking into the development pipeline. ▪ We improved reproducibility of machine learning models by having GitHub and Delta Lake integrated into development and deployment pipelines . ▪ We streamlined our deployment process for machine learning models on different platforms through MLFlow and Centralized Model Registry. 3
  3. 3. What do data scientists at our Data Science Center of Excellence do? 4 ▪ Generate novel algorithms that can be applied across different divisions ▪ Work with cross-divisional teams for model migration and standardization ▪ Enable data science in different divisions and functions ▪ Establish data science best practices Operations Human Resources Commercial & Marketing R&D Data Science at Thermo Fisher
  4. 4. Commercial & Marketing Data Science Life Cycle Actionable insights (data) from customer interactions which creates a competitive advantage and drives growth and profitability Data Delivery Install base Cloud Transactional External data Web behavioral Customer interaction Call center Model Development & Deployment Automatic email campaigns Website marketing strategies Prescriptive rec. for sales reps. Machine Learning Models F E E D B A C K F O R M A C H I N E L E A R N I N G R E L E V A N T O F F E R Customer • Engagement • Leads • Revenue ($) Rule-based Legacy Models 5
  5. 5. Model Development and Deployment Cycle ▪ Exploratory analysis ▪ Model development: featuring engineering, selection, model optimization ▪ Deployment to production environment ▪ Audit ▪ Scoring DeploymentDevelopment ▪ Web recommendation ▪ Email campaign ▪ Commercial dashboard Delivery ▪ Monitoring ▪ Feedback Management Production model retraining and retuning New model development PRDDEV PRDPRD DEV ß PRD 6
  6. 6. An Example Model Development / Deployment Cycle A model that makes product recommendation based on customer behaviors, such as web activities, sales transactions, etc. 7 6-8 weeks of EDA and prototyping • Scoring daily. • Retrain/Retune based on new data in production every 2 weeks. • Deliver through email campaign or commercial sales rep. channels. • Monitor model performance metrics
  7. 7. What we used to do… • All work is in Databricks Notebooks • No version control on either data or model • No unit testing • No regression testing against different versions of models • Hard to share modularized functions across projects (Lots of copy-pasting) 8
  8. 8. What we now do… Databricks notebook • Exploratory Analysis • Feature engineering Notebook & mlflow • ML model experiment • Hyperparameter tracking • Feature selection • Model comparison DEV Development Model Registry • Streamline regression testing against previous model versions • Documented model review process • Clean version management for better collaboration within the same DEV environment DEV • ML model library python modules for sharable and testable ML functions such as feature functions, utility functions, ML tuning functions. • Version controlled on GitHub • Integrate with Databricks Projects to version control Databricks notebooks • Documented code review process • Version controlled data source with Delta Lake 9
  9. 9. Tracking Feature Improvements Become Easy ▪ “Let me find out how the features do in my….uh….model_version_10.dbc? Maybe?” ▪ “I wish I had a screen shot of the feature importance figure before….” What we used to do… Boss: What are the important features in this version versus the previous version?
  10. 10. Tracking Feature Improvements Become Easy What we now do…. ▪ “I got it. Let me pull it out from MLFlow…”
  11. 11. 12
  12. 12. Sharing ML features Becomes Easy 13 Colleague: I really like the feature you used in your last model. Can I use that as well? What we used to do… ▪ “Sure! Just copy-paste this part of the notebook…oh but I also have a slightly different version in this other part of the notebook…. I THINK this is the one I used….”
  13. 13. Sharing ML features Becomes Easy 14 What we now do…. ▪ “Sure! I added that feature to the shared ML repo. Feel free to use it by importing the module and if you modify the feature, just contribute to the repo so that I can use it next time as well!” ▪ What’s even cooler…. You can log the exact version of the repo you used in MLFlow so that even if the repo evolved after your model development. You can still trace back to the exact version you used for your own model. Internal Shared ML repo
  14. 14. • Reproducing model results does not just rely on version control of code and notebooks but also the training data, environments and dependencies. • MLflow and Delta Lake allows for tracking all necessary things needed for reproducing the model results. • GitHub allows us to: • establish best practices of accessing our data warehouses • standardizing our ML models • encourage collaboration and review among different data scientists. What We Learned 15
  15. 15. Let’s talk about deployment…. 16
  16. 16. What we used to do… • Manually export Databricks notebooks and dependent libraries. • Manually set up clusters in PRD instance to match cluster settings in DEV. • Difficulty in troubleshooting the differences between PRD and DEV shard environments as data scientists don’t have required access to pre- deploy in PRD environment. 17
  17. 17. What we now can do…. Centralized Model Registry • Regression testing in production environment • Allows model version management in a centralized workspace • Manage production models from different DEV environments • Streamlined deployment with logged dependencies and environment set- up. PRD Development Model Registry DEV 18
  18. 18. What we now can do…. PRD Notebook • Execute model pipelines • Deliver results through various channels • Monitors regular model retraining/retuning, scoring processes • Model feedback logging Centralized Model Registry • Regression testing in production environment • Allows model version management in a centralized workspace • Manage production models from different DEV environments PRD 19
  19. 19. What we can also do…. Deploying and Managing Models Across Different Platforms through a Centralized Model Registry Development Model Registry DEV Centralized Model Registry • Regression testing in production environment • Allows model version management in a centralized workspace • Manage production models from different DEV environments PRD Development Model Registry DEV PRD Notebook • Execute model pipelines • Deliver results through various channels • Monitors regular model retraining/retuning, scoring processes • Model feedback logging 20
  20. 20. Regression Testing Becomes Easy ▪ “Let me look through the previous colleague’s notebook to find out what the performance was….” ▪ After digging through the notebook, you can’t find performance metrics logged anywhere….. 22 Boss: How does your new model performance compare to the old model in production? What we used to do… What we now do… ▪ From the record in model registry, it looks like I have improved the precision by X%..
  21. 21. 23
  22. 22. Troubleshooting Transient Data Discrepancies Becomes Easy ▪ “Uh….the input table is already overwritten by today’s run. I can rerun the model and see if the prediction comes back to normal now?” 24 Data Engineer: The daily run yesterday yield only <1000 rows of prediction. Do you know what happened? What we used to do…
  23. 23. ▪ “Let me pull out that version of input table since it’s saved as Delta Tables. Looks like there were a lot fewer rows in the input table due to the delay of data refresh job?” 25 What we now do… Troubleshooting Transient Data Discrepancies Becomes Easy
  24. 24. • Data scientists like the freedom of trying out new platforms and tools. • Allowing for the freedom of platforms and tools can be a nightmare for deployment in production environment. • MLFlow tracking server and model registry allows logging a wide range of “flavors” of ML models, from Spark ML, Sci-kit Learn to SageMaker. This allows management and comparison across different platforms in the same centralized workspace. What We Learned 26
  25. 25. Thank you!
  26. 26. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×