Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

When Models Go Rogue: Hard Earned Lessons About Using Machine Learning in Production

Much progress has been made over the past decade on process and tooling for managing large-scale, multitier, multicloud apps and APIs, but there is far less common knowledge on best practices for managing machine-learned models (classifiers, forecasters, etc.), especially beyond the modeling, optimization, and deployment process once these models are in production.

Machine-learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Topics include:
- Concept drift: Identifying and correcting for changing in the distribution of data in production, causing pretrained models to decline in accuracy
- Selecting the right retrain pipeline for your specific problem, from automated batch retraining to online active learning
- A/B testing challenges: Recognizing common pitfalls like the primacy and novelty effects and best practices for avoiding them (like A/A testing)
- Offline versus online measurement: Why both are often needed and best practices for getting them right (refreshing labeled datasets, judgement guidelines, etc.)
- Delivering semi-supervised and adversarial learning systems, where most of the learning happens in production and depends on a well-designed closed feedback loop
- The impact of all of the above of project management, planning, staffing, scheduling, and expectation setting

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

When Models Go Rogue: Hard Earned Lessons About Using Machine Learning in Production

  1. 1. Dr. David Talby @davidtalby WHEN MODELS GO ROGUE: HARD EARNED LESSONS ABOUT USING MACHINE LEARNING IN PRODUCTION
  2. 2. MODEL DEVELOPMENT ≠ SOFTWARE DEVELOPMENT
  3. 3. 1. The moment you put a model in production, it starts degrading
  4. 4. Medical claims > 4.7 Billion Pharmacy claims > 1.2 Billion Providers > 500,000 Patients > 120 million • Locality (epidemics) • Seasonality • Changes in the hospital / population • Impact of deploying the system • Combination of all of the above CONCEPT DRIFT: AN EXAMPLE
  5. 5. Never Changing Always Changing Online Social Networking Models/Rules Banking & eCommerce fraud Automated trading Real-time ad bidding Natural Language, Social Behavior Models Cyber Security Physical models: Face recognition Voice recognition Climate models Google or Amazon Search models (MUCH MORE THAN ON YOUR ALGORITHM) HOW FAST DEPENDS ON THE PROBLEM Political & Economic Models
  6. 6. Never Changing Always Changing Automated ensemble, boosting & feature selection techniques Automated ‘challenger’ online evaluation & deployment Real-time online learning via passive feedback Hand-crafted machine learned models Active learning via active feedback Hand Crafted Rules Daily/weekly batch retraining SO PUT THE RIGHT MACHINERY IN PLACE Traditional Scientific Method: Test a Hypothesis
  7. 7. Never Changing Always Changing Automated ensemble, boosting & feature selection techniques Automated ‘challenger’ online evaluation & deployment Real-time online learning via passive feedback Hand-crafted machine learned models Active learning via active feedback Hand Crafted Rules Daily/weekly batch retraining … WHEN YOU’RE ALLOWED Traditional Scientific Method: Test a Hypothesis
  8. 8. 8 Hidden Feedback Loops Undeclared Consumers Data dependencies that bite [D. Sculley et al., Google, December 2014] WATCH FOR SELF INFLICTED CHANGES
  9. 9. 9 Monitor that the distribution of predicted labels is equal to the distribution of observed labels BEST PRACTICE: NEW LEVEL OF MONITORING
  10. 10. 2. You rarely get to deploy the same model twice
  11. 11. Cotter PE, Bhalla VK, Wallis SJ, Biram RW. Predicting readmissions: Poor performance of the LACE index in an older UK population. Age Ageing. 2012 Nov;41(6):784-9. Model Model’s Goal Sample size Context Charlson morbidity index (1987) 1-year mortality 607 1 hospital in NYC, April 1984 Elixhauser morbidity index (1998) Hospital charges, length of stay & in-hospital mortality 1,779,167 438 hospitals in CA, 1992 LACE index (van Walraven et al., 2010) 30-day mortality or readmission 4,812 11 hospitals in Ontario, 2002-2006 REUSE METHODOLOGY, NOT MODELS
  12. 12. • Clinical coding for two outpatient radiology clinics • Train on one clinic, use & measure that model on both 0 0.2 0.4 0.6 0.8 1 Precision Recall Specialized Non-Specialized DON’T ASSUME WHAT MAKES A DIFFERENCE Infer procedure (CPT) 90% overlap
  13. 13. 3. It’s really hard to know how well you’re doing
  14. 14. “it seemed we were only seeing about 10%-15% of the predicted lift, so we decided to run a little experiment. And that’s when the wheels totally flew off the bus.” 14 [Peter Borden, SumAll, June 2014] HOW OPTIMIZELY (ALMOST) GOT ME FIRED
  15. 15. 15 [Alice Zheng, Dato, June 2015] separation of experiences How many false positives can we tolerate? What does the p-value mean? Which metric? How many observations do we need? Multiple models, multiple hypotheses How much change counts as real change? Is the distribution of the metric Gaussian? How long to run the test? One- or two-sided test? Are the variances equal? Catching distribution drift THE PITFALLS OF A/B TESTING
  16. 16. [Ron Kohavi et al., Microsoft, August 2012] The Primacy Effect The Novelty Effect Best Practice: A/A Testing FIVE PUZZLING OUTCOMES EXPLAINED Regression to the Mean
  17. 17. 4. Often, the real modeling work only starts in production
  18. 18. SEMI SUPERVISED LEARNING
  19. 19. IN NUMBERS 50+99.9999% 6+ Months per case ‘Good’ messages Schemes (and counting)
  20. 20. ADVERSARIAL LEARNING
  21. 21. ADVERSARIAL LEARNING
  22. 22. 5. Your best people are needed on the project after going to production
  23. 23. DESIGN Most important, hardest to change technical decisions are made here. BUILD & TEST Riskiest & most reused code components are built and tested first. DEPLOY First deployment is hands-on, then we automate it and iterate to build lower- priority features. OPERATE Ongoing, repetitive tasks are either automated away or handed off to support & operations. SOFTWARE DEVELOPMENT
  24. 24. MODEL Feature engineering, model selection & optimization are done for the 1st model built. DEPLOY & MEASURE Online metrics is key in production, since results will often defer from off-line ones. EXPERIMENT Design & run as many experiments, as fast as possible, with new inputs, features & feedback. AUTOMATE Automate the retrain or active learning pipeline, including online metrics & labeled data collection. MODEL DEVELOPMENT
  25. 25. To conclude…
  26. 26. Rethink your development process Rethink your role in the business Think “run a hospital”, not “build a hospital” MODEL DEVELOPMENT ≠ SOFTWARE DEVELOPMENT
  27. 27. @davidtalby

×