For a machine learning application to be successful, it is not enough to give highly accurate predictions: Customers also want to know why the model has made that prediction, so they can compare it against their intuition and (hopefully) gain trust in the model. However, there is a trade-off between model accuracy and explainability - for example, the more complex your feature transformations become, the harder it is to explain what the resulting features mean to the end customer. However, with the right system design this doesn't mean it has to be a binary choice between these two goals. It is possible to combine complex, even automatic, feature engineering with highly accurate models and explanations. We will describe how we are using lineage tracing to solve this issue at Salesforce Einstein, allowing good model explanations to coexist with automatic feature engineering and model selection. By building this into an open source AutoML library TransmogrifAI, an extension to SparkMlLib, it is easy to ensure a consistent level of transparency in all of our ML applications. As model explanations are provided out of the box, data scientists don't need to re-invent the wheel when model explanations need to be surfaced.
3. Roadmap for this talk
Automation vs
Explanation
Why explain
your model
Why
automate
your
modeling
How to
automate
with
explanation
in mind
What does
this look like
Our solution:
TransmogrifAI
on Spark
How to explain
your model
4. Roadmap for this talk
Automation vs
Explanation
Why explain
your model
Why
automate
your
modeling
How to
automate
with
explanation
in mind
What does
this look like
Our solution:
TransmogrifAI
on Spark
How to explain
your model
8. Roadmap for this talk
Automation vs
Explanation
Why explain
your model
Why
automate
your
modeling
How to
automate
with
explanation
in mind
What does
this look like
Our solution:
TransmogrifAI
on Spark
How to explain
your model
9. Keep it DRY (don’t repeat yourself) and DRO (don’t repeat others)
https://transmogrif.ai/
10. Simple building blocks for automatic model generation
// Automated feature engineering
val featureVector = Seq(pClass, name, sex, age, sibSp, parch, ticket, cabin, embarked).transmogrify()
// Automated feature selection
val checkedFeatures = survived.sanityCheck(featureVector = featureVector, removeBadFeatures = true)
// Automated model selection
val prediction = BinaryClassificationModelSelector.setInput(survived, checkedFeatures).getOutput()
// Model insights
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(prediction).train()
println("Model insights:n" + model.modelInsights(prediction).prettyPrint())
// Add individual prediction insights
val predictionInsights= new RecordInsightsLOCO(model.getOriginStageOf(prediction)).setInput(pred).getOutput()
val insights = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(predictionInsights)
.withModelStages(model).train().score()
11. Roadmap for this talk
Automation vs
Explanation
Why explain
your model
Why
automate
your
modeling
How to
automate
with
explanation
in mind
What does
this look like
Our solution:
TransmogrifAI
on Spark
How to explain
your model
17. Roadmap for this talk
Automation vs
Explanation
Why explain
your model
Why
automate
your
modeling
How to
automate
with
explanation
in mind
What does
this look like
Our solution:
TransmografAI
on Spark
How to explain
your model
18. 24 Hours in the Life of Salesforce
1.5B
emails sent
4M
orders
260M
social posts
3B
Einstein
predictions
888M
commerce
page views
44M
reports and
dashboards 41M
case
interactions
3M
opportunities
created
2M
leads created
8M
cases logged
B2C
Scale
B2B
Scale
Source: Salesforce March 2018.
19. ETL
Score and Update
Models
Deploy and
Operationalize Models
Model Training
Feature Engineering
Model Evaluation
Data Cleansing
The typical Machine Learning pipeline
21. We can’t build one global model
•Privacy concerns
• Customers don’t want data cross-pollinated
•Business Use Cases
• Industries are very different
• Processes are different
•Platform customization
• Ability to create custom fields and objects
•Scale, Automation
• Ability to create
23. Roadmap for this talk
Automation vs
Explanation
Why explain
your model
Why
automate
your
modeling
How to
automate
with
explanation
in mind
What does
this look like
Our solution:
TransmogrifAI
on Spark
How to explain
your model
24. How to explain your model:
Options:
● Feature weights / importance
● Model agnostic feature impact
● Secondary models
● Feature / label interactions
● Global versus local granularity
Concerns:
● How interpretable is the model?
● How interpretable are the raw
features?
● Do you care about explaining the
model or gaining insights into data?
● Do we need to explain individual
predictions?
27. Feature Weight / Importance / Impact
X X2
X3
X4
X5
Y
0 1 0 0 0 A
1 1 1 0 0 B
0 0 1 1 0 B
1 1 1 1 1 A
1 0 1 0 0 A
28. Global vs local explanations
Titanic model top features:
● Gender
● Cabin class (pClass)
● Age
Titanic passenger top features (ranking x value):
● Prediction = 1 (survived), Reasons = female, 1st Class
● Prediction = 0 (died), Reasons = male, 3rd Class
30. Issues with Feature Importance / Weight / Impact
http://resources.esri.com/help/9.3/arcgisengine/java/gp_toolref/spatial_statistics_toolbox/multicollinearity.htm
31. A secondary model that tells you about your data
https://www.statmethods.net/advgraphs/images/corrgram1.png
32. Roadmap for this talk
Automation vs
Explanation
Why explain
your model
Why
automate
your
modeling
How to
automate
with
explanation
in mind
What does
this look like
Our solution:
TransmogrifAI
on Spark
How to explain
your model
33. X1
X2
X3
X4
X5
Y
0 1 0 0 0 A
1 1 1 0 0 B
0 0 1 1 0 B
1 1 1 1 1 A
1 0 1 0 0 A
Where did you get the feature matrix?
34. Making a DAG for feature engineering - and tracking it!
val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).transmogrify()
•Each feature is mapped to an appropriate .transmogrify() stage based on its type
• gender (a Picklist) and age (an Integral) are automatically assigned to different stages
•Metadata is updated with what happened at each step in transmogrification
• age_bucket_0-10 (parentFeature = age, parentType = integral, grouping = age, value = 0-10)
•Metadata is combined with feature importance measures to produce insights
• age max contribution= -0.27
36. Automated feature engineering DAG (with metadata!!)
ZipcodeSubjectPhoneEmail Age
Age
[0-15]
Age
[15-35]
Age
[>35]
Email Is
Spammy
Top 10 Email
Domain
Country
Code
Phone
Is Valid
Top TF-IDF
Terms
Average
Income
Vector
● The name of the RAW feature(s) the column was
made from
● The name of the feature the column was made
from
● Everything you did to get the column
● Any grouping information across columns
● Description of the value in the column
37. Combining the origin metadata with model interpretations
Information we add:
● Correlation
● Mutual information
● Feature weight /
importance
● Feature distribution
description
● Feature contribution to
each score (optionally)
38. Roadmap for this talk
Automation vs
Explanation
Why explain
your model
Why
automate
your
modeling
How to
automate
with
explanation
in mind
What does
this look like
Our solution:
TransmogrifAI
on Spark
How to explain
your model
39. Example input data
case class Passenger
(
id: Id,
survived: RealNN,
pClass: Integral,
name: Text,
sex: Picklist,
age: Integral,
sibSp: Integral,
parCh: Integral,
ticket: Id,
fare: Currency,
cabin: Picklist,
embarked: Picklist
)
40. Example Code
// Automated feature engineering
val featureVector = Seq(pClass, name, sex, age, sibSp, parch, ticket, cabin, embarked).transmogrify()
// Automated feature selection
val checkedFeatures = survived.sanityCheck(featureVector = featureVector, removeBadFeatures = true)
// Automated model selection
val prediction = BinaryClassificationModelSelector.setInput(survived, checkedFeatures).getOutput()
// Model Insights
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(prediction).train()
println("Model insights:n" + model.modelInsights(prediction).prettyPrint())
// Add individual prediction insights
val predictionInsights= new RecordInsightsLOCO(model.getOriginStageOf(prediction)).setInput(pred).getOutput()
val insights = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(predictionInsights)
.withModelStages(model).train().score()
41. Example metadata
case class OpVectorColumnHistory
(
columnName: String, age_modeFill_bucketize_0-10
parentFeatureName: Seq[String], Seq(age_modeFill)
parentFeatureOrigins: Seq[String], Seq(age)
parentFeatureStages: Seq[String], Seq(modeFill, bucketize)
parentFeatureType: Seq[String], Integral
grouping: Option[String], Some(age_modeFill)
indicatorValue: Option[String], Some(0-10)
descriptorValue: Option[String], None
index: Int 17
)
44. So what is the point?
● The choice is not binary - you can have automation and explanation
● It takes a lot of work and tracking to get good explanations
● You can skip that and just use our solution :-)