VIZ4ML
THE ROLE OF DATA VISUALIZATION IN
IMPROVING MACHINE LEARNING MODELS
BSidesLV 2017
Phil Roth
2
Or….
Screenshots of this internal
visualization tool I built to test
MalwareScore along with lessons I
learned building it
Phil Roth
@mrphilroth
proth@endgame.com
PhD in physics using ML
Radar Imager
Data Scientist at Endgame
3
whoami
MalwareScore and
Bit Inspector
MalwareScore
5
MalwareScore is a machine
learning first solution built for
detecting and preventing malware.
MalwareScore
Static features
Deployed to customer
machines
Available at VirusTotal
6
https://www.virustotal.com/
Bit Inspector
7
Bit Inspector is an internal tool for
communicating progress, soliciting
feedback, and identifying errors
related to MalwareScore.
Bit Inspector
8
Built with
Flask (http://flask.pocoo.org/)
D3.js (https://d3js.org/)
matplotlib (https://matplotlib.org/)
seaborn (https://seaborn.pydata.org/)
Connects to multiple internal data and
processing resources
Sample Page
9
Model Page
10
Basic Visualizations
ROC Curve
12
Area Under Receiver Operating
Characteristic Curve
Created by plotting the true
positive rate (TPR) against the
false positive rate (FPR) at
various threshold settings.
ROC Curve
13
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
y_true = metricsdf.y
y_pred = metricsdf.y_pred_holdout
fpr_plot, tpr_plot, _ = roc_curve(y_true, y_pred)
plt.plot(fpr_plot, tpr_plot, lw=2, color="k")
Confusion Matrix
14
A table where columns represent
the predicted class and rows
represent the actual class.
Confusion Matrix
15
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
y_true = metricsdf.y
y_pred = metricsdf.y_pred_holdout
matrix = confusion_matrix(y_true, y_pred > threshold)
fig = sns.heatmap(matrix, annot=True, fmt="d")
fig.invert_yaxis()
plt.xlabel("Predicted")
plt.ylabel("Actual")
Role of Data Visualization
Feature Experimentation
17
Byte Histogram
Sliding Window Byte Entropy
0 3 0 0 0 4 0 0 0 255 255 0 0 184 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 216 0 0 77 90 144 14 31 186 14 0 180 9
205 33 184 1 76 205 33 84 104 105 115 32 112 114 111 103 114 97 109 32 99
97 110 110 111 116 32 98 101 32 117 110 32 105 110 32 68 79 83 32 109 111
100 101 46 13 13 10 36 0 0 0 0 0 0 0 49 184 132 58 117 217 234 105 117 217
234 105 11 217 234 105 182 214 181 105 119 217 234 105 117 217 235 105
238 217 234 105 182 214 183 105 100 217 234 105 33 250 218 105 127 217
234 105 178 223 236 105 116 217 234 105 82 105 99 104 117 21 234 105 0 2
https://arxiv.org/pdf/1508.03096.pdf
Model Performance
18
Communicating results and performance
Finding Problems
19
There’s a rudimentary system for gathering
feedback and using it in future model trainings
Tracking Solutions
20
For each model,
problem areas
are broken out
and analyzed
separately
Visualization Time Budgets
Visualization Time Budget
22
Explainability
Trustworthiness
Beauty
Explainability
23
Can this visualization be
understood on its own?
Annotations
Explanations in readable
prose
https://xkcd.com/1732/
…
Trustworthiness
24
https://www.economist.com/blogs/dailychart/2010/11/us_human_development_state
Can you trust the source of
this visualization?
Consistent styling
Data sources listed
Logos
Beauty
25
https://pudding.cool/2017/02/vocabulary/
Audiences
Yourself
27
The purpose of the visualization is to
convince yourself you’ve done
something useful.
Explainability
Trustworthiness
Beauty Less Time More Time
Yourself
28
“The first principle is that you must
not fool yourself – and you are the easiest
person to fool.”
Richard Feynman
Yourself
29
Try something
Visualize and inspect the results
Looks wrong 
Think critically
Woohoo! I’m done!
Model Building Process
How might I be fooling myself?
Data Science Team
30
Purpose is to communicate what you’ve done
and get feedback you didn’t consider
Add context (data sources, model parameters)
Explainability
Trustworthiness
Beauty Less Time More Time
Domain Experts
31
Explainability
Trustworthiness
Beauty Less Time More Time
Same. But now the context is domain specific.
For me, hashes, PE header information, links to
VirusTotal, etc…
Managers and Executives
32
Explainability
Trustworthiness
Beauty Less Time More Time
Purpose is to communicate progress and
current performance.
Public
33
Explainability
Trustworthiness
Beauty Less Time More Time
Tools and Resources
Python Plotting
35
http://pythonplot.com/
Python Plotting
36
http://pythonplot.com/
Comparison of plotting
syntax between:
pandas
matplotlib
plotnine
ggplot2 (R)
altair (planned)
Jupyter Notebooks
37
http://jupyter.org/
Excellent for exploratory data
analysis
Changes can be made and the
results will update as fast as
the code can run
Kibana
38
https://www.elastic.co/products/kibana
Allows for rapidly
building constantly
updating dashboards
Works best when
querying against data
that’s in ElasticSearch
Internal tool by @laborious_dtg
D3js
39
https://d3js.org/
Javascript
Probably requires data
translation
Large time commitment.
Payoff is the customization
possibilities (and thus
trustworthiness/beauty).
Yellowbrick
40
https://github.com/DistrictDataLabs/yellowbrick
Yellowbrick
41
# Instantiate the visualizer
visualizer = Rank2D(features=features, algorithm='covariance')
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.poof() # Draw/show/poof the data
The idea is to have prebaked
model evaluation visualizations
that adhere to the scikit-learn
API
Facets
42
https://github.com/pair-code/facets
It’s early, but so far this looks like the
best method for truly responsive
Exploratory Data Analysis that I’ve seen
Facets
43
time for a demo?
THANK YOU
proth@endgame.com @mrphilroth

Data Visualization for Machine Learning