ML AutoDoc
Auto Documentation for ML Models
Nikhil Shekhar (Machine Learning Engineer)
Confidential2
Machine Learning Documentation
Challenges
• Tedious for Data Scientist
• Time Consuming
– Iterations and reviews
• Inconsistent
• Incomplete
• Error prone
• Compliance Requirements
– Banks
– Healthcare
Confidential3
ML AutoDoc Overview
• Automatically generates editable Word doc to document model creation (algos, techniques,
data, etc)
• Save Data Science Resources
– automatically build required documentation
• Customize to your business needs
• Simple to use
Confidential4
AutoDoc Support Products
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
5
H2O-3
Confidential6
H2O-3 ML AutoDoc
Supported Algorithms
• AutoDocs for Supervised Learning Models (H2O-3 and XGBoost)
– XGBoost
– Gradient Boosting Machine
– Generalized Linear Model
– Deep Learning
– Distributed Random Forest (including Extremely Randomized Forest)
– Stacked Ensembles
Packaging
• Integrated in H2O Steam
• Python Package
Docs
https://s3.amazonaws.com/artifacts.h2o.ai/snapshots/ai/h2o/ml_autodoc/1.0.1-6/steamguide/autodoc_intro.html
Confidential7
Prediction StatsPartial Dependence
Feature ImportanceActual vs. Predicted
Confidential8
4Lines
of Code
Generation Code Example
Confidential9 Confidential9
• Steam exposes a service to generate model report (ML AutoDoc) for OSS
H2O-3 models (e.g., GBM, AutoML)
– The same service is used in Driverless AI to generate documentation for
DAI models
Steam: ML AutoDoc
AutoDoc for OSS H2O-
3
AutoML model
10
Scikit-learn Examples
Confidential11
Scikit-learn ML AutoDocs
• ML AutoDoc has initial support of 3rd party models: Scikit Learn
• AutoDocs for Supervised Learning Models
• Scikit-Learn Linear Models:
– LogisticRegression
• Scikit Learn Ensemble Methods:
– RandomForestClassifier
– GradientBoostingClassifier
– GradientBoostingRegressor
AutoDoc for OSS
Scikit
Gradient Boosting
Model
12
Partial Dependence
Response Rate
AUC
Shift DetectionConfusion Matrix
13
Driverless AI Examples
Confidential14
AutoDocs in Driverless AI
Algorithms Supported
• XGBoost, LightGBM, Tensorflow,
Additional Features
• Included in Driverless AI
• Explainability
• Customized Reports
15
Feature Importance Scoring Pipeline
Actual v. Predicted PDP/ICE
Confidential16
Driverless AI
Confidential17
Driverless AI AutoDocs
Confidential18
Experiment Summary
Confidential19
Data Overview
Confidential20
Shift Detection
Confidential21
Methodology
Confidential22
Validation Strategy
Confidential23
Model Tuning
All Models More Details
Confidential24
Features and Feature Engineering
Confidential25
Final Model
Confidential26
Driverless AI AutoDoc
Confidential27
Experiment Overview
Confidential28
Data Overview
Confidential30
Methodology
Confidential31
Model Tuning
All Models More Details
Confidential32
Features
Confidential33
Final Model
Confidential34
Customization
Model Diagnostics Model Interpretability
• Additional Performance Metrics
• Population Stability Index
• Prediction Statistics per Quantile
• Actual vs Predicted Plots
• GINI Plot
• Diagnostics on New Datasets
• Perform Model Diagnostics on a list of new datasets
• Partial Dependence Plots
• Generate partial dependence plots on the n most
important features
• Includes histogram with frequency of each feature
• Individual Conditional Expectation Plot
• Add partial dependence plot for specific records
only
• Variable Importance
• Calculate variable importance on original features
using Permutation Importance
• Filter variables to top relative importance or top n
features
Confidential35
Population Stability Index
# Calculate PSI
config_overrides += "nautodoc_population_stability_index=true"
Confidential36
Prediction Statistics per Quantile
# Enable the Prediction Statistics for each dataset split
config_overrides += "nautodoc_prediction_stats=true"
Confidential37
Variable Importance
# Enable the permutation feature importance table and plot
config_overrides += "nautodoc_include_permutation_feature_importance=true"
Confidential38
Partial Dependence Plots
Confidential39
Partial Dependence Plots with ICE
Confidential40
Response Rate Plots (Binary Use Cases Only)
# Enable the response rate plot for each dataset split
config_overrides += "nautodoc_response_rate=true"
Confidential41
GINI Plot (Binary Use Cases Only)
# Show the Gini Plot
config_overrides += "nautodoc_gini_plot=true"

Automatic Model Documentation with H2O