This document discusses the challenges of bringing AI models to production. It notes that ML systems have stateful data dependencies that change over time, which can cause models to break down if the input data distribution shifts. It advocates separating model code from data preprocessing and batching logic to improve efficiency. The document also argues that ML systems should be treated as interconnected entities rather than isolated components, as changes in one model's output distribution can cascade through linked models. New engineering strategies are needed to track dependencies between models based on distribution analysis.
5. Software Ate The World
■ Software and IT
■ Consumer and Media
■ Finances/banking
904,860
895,670
874,710
818,160
493,750
475,730
472,940
440,980
372,230
342,170
MICROSOFT
APPLE
AMAZON
ALPHABET
BERKSHIRE HATHAWAY
FACEBOOK
ALIBABA
TENCENT
JOHNSON & JOHNSON
EXXONMOBILE
Top 10 Most Valuable Companies 2019 Q1
Source:Wikipedia
6. What is “Software” Anyway
“Traditional” Computing
■ Deterministic
■ Linear Models
“Artificial Intelligence”
■ Probabilistic
■ Non-linear models
7. Why Not Biomed
■ Nature is not
deterministic
■ Decisions are not
clear cut
■ Independent from
IT and computing
Traditional
Computing
Biology
Medical
10. Challenges of ML Engineering
■Model Code Organization
■Data dependency
■Data/Model Drift
■Model co-dependency
11. The New Wall Of Confusion
ML Scientists Software Engineers
Here’s the BERT on GCN,
got accuracy to 99%.
Can you deploy it?
What’s aTensor?
Can I npm it?
12. The Engineering in ML
■ ML engineering is more than fine-tuning training metrics
■ Run-time efficiency
■ Coding structure for extensibility
■ Deployment scaling
■ Good ML engineers are good software engineers first
13. Model Code Structure
■ NLP and Image tasks
often require
transforming input data
■ Data transformation at
run-time is expensive
■ Models class should not
include these
preprocessing logic.
14. Use Classes To Encapsulate Models
■ Do use classes to
encapsulate model
training/prediction and
model definitions
■ Separate training and
prediction from the model
■ Don’t relying on ad-hoc
linear codes and do
everything within a single
file.
15. Separate Forward and Loss
■ Separate model forward
computation and loss
calculation
■ Optimizer and loss can change
often during R&D
■ Forward function will be
reused for inference
■ Needs to be as efficient as
possible
16. Separate Batching and Single Compute
■ Model assumes tensor I/O
only, do not include batching
logic within a model
■ In Tensorflow and PyTorch,
data loader is a separate class
that can include preprocessing
logic, and output an input
batch.
■ This should be included in the
training class, not the model
definition.
17. Data Dependency
■ Source control (Git) tracks stateless logic changes as code
■ ML systems are stateful depending on Data
ML Code
Training
Data
Model
Weights
Git
Inference
Data
Prediction
18. Data/Model Drift
“It’s not that I don’t understand, the world changes too fast”
– Cui Jian
■ Model captures training data assumptions
■ If input changes, the model will breakdown
■ 1. Data format contract ( string instead of numbers )
■ 2. Data input distribution (Here be dragons)
21. Data Dependency
■ Need to track Input Distribution assumptions
■ Meta should be captured with the model weights
Meta
Distribution
Monitor
ML Code
Training
Data
Model
WeightsInference
Data
Prediction
23. ML systems grow together
■ Real world systems is a composite of many ML deployments
■ End-to-end model is not realistic
■ Multiple models are intimately linked by data distribution
dependency
■ Top-level output distribution change will cause failure
cascades
26. New Strategy Is Needed
■ Combine both modular system design, and ML system dependencies
■ Current coding practice only solves part of the problem
■ Better tools are needed to track multiple ML systems based on
distribution analysis
■ Rethink engineering roles and organization
27. Conclusion
■ ML Data dependency has challenges at all levels of system
engineering
■ ML system reliability is particularly critical in biomedical
domains
■ ML Deployment is a different beast from ML R&D
development
■ ML engineers will require a wider range of expertise
Biomedical research, which is foundation of drug development, is impossible without biological products. Biological products are compounds that are used by scientists, in both pharma and academia, to conduct experiments which lead to the development of drugs for life threatening diseases.