Balancing Automation and Explanation in Machine Learning

Balancing Automation and Explanation
in Machine Learning
Leah McGuire, Till Bergmann

Roadmap for this talk
Automation vs
Explanation
Why explain
your model
Why
automate
your
modeling
How to
automate
with
explanation
in mind
What does
this look like
Our solution:
TransmogrifAI
on Spark
How to explain
your model

P1
(c | f)
Pk
(c | f)
Pn
(c | f)
Σ
Input Output

The Question
Why did the machine learning model make
the decision that it did?
“

The best model or the model you can explain?

Keep it DRY (don’t repeat yourself) and DRO (don’t repeat others)
https://transmogrif.ai/

Simple building blocks for automatic model generation
// Automated feature engineering
val featureVector = Seq(pClass, name, sex, age, sibSp, parch, ticket, cabin, embarked).transmogrify()
// Automated feature selection
val checkedFeatures = survived.sanityCheck(featureVector = featureVector, removeBadFeatures = true)
// Automated model selection
val prediction = BinaryClassificationModelSelector.setInput(survived, checkedFeatures).getOutput()
// Model insights
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(prediction).train()
println("Model insights:n" + model.modelInsights(prediction).prettyPrint())
// Add individual prediction insights
val predictionInsights= new RecordInsightsLOCO(model.getOriginStageOf(prediction)).setInput(pred).getOutput()
val insights = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(predictionInsights)
.withModelStages(model).train().score()

Debuggability
F1
Top contributing features for
surviving the Titanic:
1. Gender
2. pClass
3. Body

MachineHuman
Right
Wrong
Trust

Roadmap for this talk
Automation vs
Explanation
Why explain
your model
Why
automate
your
modeling
How to
automate
with
explanation
in mind
What does
this look like
Our solution:
TransmografAI
on Spark
How to explain
your model

24 Hours in the Life of Salesforce
1.5B
emails sent
4M
orders
260M
social posts
3B
Einstein
predictions
888M
commerce
page views
44M
reports and
dashboards 41M
case
interactions
3M
opportunities
created
2M
leads created
8M
cases logged
B2C
Scale
B2B
Scale
Source: Salesforce March 2018.

ETL
Score and Update
Models
Deploy and
Operationalize Models
Model Training
Feature Engineering
Model Evaluation
Data Cleansing
The typical Machine Learning pipeline

We can’t build one global model
•Privacy concerns
• Customers don’t want data cross-pollinated
•Business Use Cases
• Industries are very different
• Processes are different
•Platform customization
• Ability to create custom fields and objects
•Scale, Automation
• Ability to create

Multiply it by M*N (M = customers; N = use cases)

How to explain your model:
Options:
● Feature weights / importance
● Model agnostic feature impact
● Secondary models
● Feature / label interactions
● Global versus local granularity
Concerns:
● How interpretable is the model?
● How interpretable are the raw
features?
● Do you care about explaining the
model or gaining insights into data?
● Do we need to explain individual
predictions?

Input
Prediction
Explanation
Secondary Model
https://www.statmethods.net/advgraphs/images/corrgram1.png

Feature Weight / Importance / Impact
X X2
X3
X4
X5
Y
0 1 0 0 0 A
1 1 1 0 0 B
0 0 1 1 0 B
1 1 1 1 1 A
1 0 1 0 0 A

Global vs local explanations
Titanic model top features:
● Gender
● Cabin class (pClass)
● Age
Titanic passenger top features (ranking x value):
● Prediction = 1 (survived), Reasons = female, 1st Class
● Prediction = 0 (died), Reasons = male, 3rd Class

Feature Impact (LOCO)
https://www.oreilly.com/ideas/ideas-on-interpreting-machine-learning
Score = 0.27 Reasons => sex = "male" (-0.13), pClass = 3 (-0.05), ...

Issues with Feature Importance / Weight / Impact
http://resources.esri.com/help/9.3/arcgisengine/java/gp_toolref/spatial_statistics_toolbox/multicollinearity.htm

A secondary model that tells you about your data
https://www.statmethods.net/advgraphs/images/corrgram1.png

X1
X2
X3
X4
X5
Y
0 1 0 0 0 A
1 1 1 0 0 B
0 0 1 1 0 B
1 1 1 1 1 A
1 0 1 0 0 A
Where did you get the feature matrix?

Making a DAG for feature engineering - and tracking it!
val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).transmogrify()
•Each feature is mapped to an appropriate .transmogrify() stage based on its type
• gender (a Picklist) and age (an Integral) are automatically assigned to different stages
•Metadata is updated with what happened at each step in transmogrification
• age_bucket_0-10 (parentFeature = age, parentType = integral, grouping = age, value = 0-10)
•Metadata is combined with feature importance measures to produce insights
• age max contribution= -0.27

Use types to make feature engineering smart

Automated feature engineering DAG (with metadata!!)
ZipcodeSubjectPhoneEmail Age
Age
[0-15]
Age
[15-35]
Age
[>35]
Email Is
Spammy
Top 10 Email
Domain
Country
Code
Phone
Is Valid
Top TF-IDF
Terms
Average
Income
Vector
● The name of the RAW feature(s) the column was
made from
● The name of the feature the column was made
from
● Everything you did to get the column
● Any grouping information across columns
● Description of the value in the column

Combining the origin metadata with model interpretations
Information we add:
● Correlation
● Mutual information
● Feature weight /
importance
● Feature distribution
description
● Feature contribution to
each score (optionally)

Example input data
case class Passenger
(
id: Id,
survived: RealNN,
pClass: Integral,
name: Text,
sex: Picklist,
age: Integral,
sibSp: Integral,
parCh: Integral,
ticket: Id,
fare: Currency,
cabin: Picklist,
embarked: Picklist
)

Example Code
// Automated feature engineering
val featureVector = Seq(pClass, name, sex, age, sibSp, parch, ticket, cabin, embarked).transmogrify()
// Automated feature selection
val checkedFeatures = survived.sanityCheck(featureVector = featureVector, removeBadFeatures = true)
// Automated model selection
val prediction = BinaryClassificationModelSelector.setInput(survived, checkedFeatures).getOutput()
// Model Insights
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(prediction).train()
println("Model insights:n" + model.modelInsights(prediction).prettyPrint())
// Add individual prediction insights
val predictionInsights= new RecordInsightsLOCO(model.getOriginStageOf(prediction)).setInput(pred).getOutput()
val insights = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(predictionInsights)
.withModelStages(model).train().score()

Example metadata
case class OpVectorColumnHistory
(
columnName: String, age_modeFill_bucketize_0-10
parentFeatureName: Seq[String], Seq(age_modeFill)
parentFeatureOrigins: Seq[String], Seq(age)
parentFeatureStages: Seq[String], Seq(modeFill, bucketize)
parentFeatureType: Seq[String], Integral
grouping: Option[String], Some(age_modeFill)
indicatorValue: Option[String], Some(0-10)
descriptorValue: Option[String], None
index: Int 17
)

Example insights
case class Insights
(
derivedFeatureName: String,
stagesApplied: Seq[String],
derivedFeatureGroup: Option[String],
derivedFeatureValue: Option[String],
excluded: Option[Boolean],
corr: Option[Double],
cramersV: Option[Double],
mutualInformation: Option[Double],
pointwiseMutualInformation: Map[String, Double],
countMatrix: Map[String, Double],
contribution: Seq[Double],
min: Option[Double],
max: Option[Double],
mean: Option[Double],
variance: Option[Double]
)
case class FeatureInsights
(
featureName: String,
featureType: String,
derivedFeatures: Seq[Insights],
distributions: Seq[FeatureDistribution],
exclusionReasons: Seq[ExclusionReasons]
)

So what is the point?
● The choice is not binary - you can have automation and explanation
● It takes a lot of work and tracking to get good explanations
● You can skip that and just use our solution :-)

Balancing Automation and Explanation in Machine Learning

Balancing Automation and Explanation in Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Balancing Automation and Explanation in Machine Learning

Similar to Balancing Automation and Explanation in Machine Learning (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Balancing Automation and Explanation in Machine Learning