Much progress has been made over the past decade on process and tooling for managing large-scale, multitier, multicloud apps and APIs, but there is far less common knowledge on best practices for managing machine-learned models (classifiers, forecasters, etc.), especially beyond the modeling, optimization, and deployment process once these models are in production.
Machine-learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.
- Concept drift: Identifying and correcting for changing in the distribution of data in production, causing pretrained models to decline in accuracy
- Selecting the right retrain pipeline for your specific problem, from automated batch retraining to online active learning
- A/B testing challenges: Recognizing common pitfalls like the primacy and novelty effects and best practices for avoiding them (like A/A testing)
- Offline versus online measurement: Why both are often needed and best practices for getting them right (refreshing labeled datasets, judgement guidelines, etc.)
- Delivering semi-supervised and adversarial learning systems, where most of the learning happens in production and depends on a well-designed closed feedback loop
- The impact of all of the above of project management, planning, staffing, scheduling, and expectation setting