Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
VIZ4ML
THE ROLE OF DATA VISUALIZATION IN
IMPROVING MACHINE LEARNING MODELS
BSidesLV 2017
Phil Roth
2
Or….
Screenshots of this internal
visualization tool I built to test
MalwareScore along with lessons I
learned building ...
Phil Roth
@mrphilroth
proth@endgame.com
PhD in physics using ML
Radar Imager
Data Scientist at Endgame
3
whoami
MalwareScore and
Bit Inspector
MalwareScore
5
MalwareScore is a machine
learning first solution built for
detecting and preventing malware.
MalwareScore
Static features
Deployed to customer
machines
Available at VirusTotal
6
https://www.virustotal.com/
Bit Inspector
7
Bit Inspector is an internal tool for
communicating progress, soliciting
feedback, and identifying errors
...
Bit Inspector
8
Built with
Flask (http://flask.pocoo.org/)
D3.js (https://d3js.org/)
matplotlib (https://matplotlib.org/)
...
Sample Page
9
Model Page
10
Basic Visualizations
ROC Curve
12
Area Under Receiver Operating
Characteristic Curve
Created by plotting the true
positive rate (TPR) against t...
ROC Curve
13
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
y_true = metricsd...
Confusion Matrix
14
A table where columns represent
the predicted class and rows
represent the actual class.
Confusion Matrix
15
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
y_t...
Role of Data Visualization
Feature Experimentation
17
Byte Histogram
Sliding Window Byte Entropy
0 3 0 0 0 4 0 0 0 255 255 0 0 184 0 0 0 0 0 0 0 64 0...
Model Performance
18
Communicating results and performance
Finding Problems
19
There’s a rudimentary system for gathering
feedback and using it in future model trainings
Tracking Solutions
20
For each model,
problem areas
are broken out
and analyzed
separately
Visualization Time Budgets
Visualization Time Budget
22
Explainability
Trustworthiness
Beauty
Explainability
23
Can this visualization be
understood on its own?
Annotations
Explanations in readable
prose
https://xkcd...
Trustworthiness
24
https://www.economist.com/blogs/dailychart/2010/11/us_human_development_state
Can you trust the source ...
Beauty
25
https://pudding.cool/2017/02/vocabulary/
Audiences
Yourself
27
The purpose of the visualization is to
convince yourself you’ve done
something useful.
Explainability
Trustwor...
Yourself
28
“The first principle is that you must
not fool yourself – and you are the easiest
person to fool.”
Richard Fey...
Yourself
29
Try something
Visualize and inspect the results
Looks wrong 
Think critically
Woohoo! I’m done!
Model Buildin...
Data Science Team
30
Purpose is to communicate what you’ve done
and get feedback you didn’t consider
Add context (data sou...
Domain Experts
31
Explainability
Trustworthiness
Beauty Less Time More Time
Same. But now the context is domain specific.
...
Managers and Executives
32
Explainability
Trustworthiness
Beauty Less Time More Time
Purpose is to communicate progress an...
Public
33
Explainability
Trustworthiness
Beauty Less Time More Time
Tools and Resources
Python Plotting
35
http://pythonplot.com/
Python Plotting
36
http://pythonplot.com/
Comparison of plotting
syntax between:
pandas
matplotlib
plotnine
ggplot2 (R)
al...
Jupyter Notebooks
37
http://jupyter.org/
Excellent for exploratory data
analysis
Changes can be made and the
results will ...
Kibana
38
https://www.elastic.co/products/kibana
Allows for rapidly
building constantly
updating dashboards
Works best whe...
D3js
39
https://d3js.org/
Javascript
Probably requires data
translation
Large time commitment.
Payoff is the customization...
Yellowbrick
40
https://github.com/DistrictDataLabs/yellowbrick
Yellowbrick
41
# Instantiate the visualizer
visualizer = Rank2D(features=features, algorithm='covariance')
visualizer.fit(...
Facets
42
https://github.com/pair-code/facets
It’s early, but so far this looks like the
best method for truly responsive
...
Facets
43
time for a demo?
THANK YOU
proth@endgame.com @mrphilroth
Upcoming SlideShare
Loading in …5
×

Data Visualization for Machine Learning

849 views

Published on

I talk about lessons I've learned maintaining an internal data visualization tool called Bit Inspector meant to aid the building of MalwareScore

Published in: Technology
  • Be the first to comment

Data Visualization for Machine Learning

  1. 1. VIZ4ML THE ROLE OF DATA VISUALIZATION IN IMPROVING MACHINE LEARNING MODELS BSidesLV 2017 Phil Roth
  2. 2. 2 Or…. Screenshots of this internal visualization tool I built to test MalwareScore along with lessons I learned building it
  3. 3. Phil Roth @mrphilroth proth@endgame.com PhD in physics using ML Radar Imager Data Scientist at Endgame 3 whoami
  4. 4. MalwareScore and Bit Inspector
  5. 5. MalwareScore 5 MalwareScore is a machine learning first solution built for detecting and preventing malware.
  6. 6. MalwareScore Static features Deployed to customer machines Available at VirusTotal 6 https://www.virustotal.com/
  7. 7. Bit Inspector 7 Bit Inspector is an internal tool for communicating progress, soliciting feedback, and identifying errors related to MalwareScore.
  8. 8. Bit Inspector 8 Built with Flask (http://flask.pocoo.org/) D3.js (https://d3js.org/) matplotlib (https://matplotlib.org/) seaborn (https://seaborn.pydata.org/) Connects to multiple internal data and processing resources
  9. 9. Sample Page 9
  10. 10. Model Page 10
  11. 11. Basic Visualizations
  12. 12. ROC Curve 12 Area Under Receiver Operating Characteristic Curve Created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
  13. 13. ROC Curve 13 import seaborn as sns import matplotlib.pyplot as plt from sklearn.metrics import roc_curve y_true = metricsdf.y y_pred = metricsdf.y_pred_holdout fpr_plot, tpr_plot, _ = roc_curve(y_true, y_pred) plt.plot(fpr_plot, tpr_plot, lw=2, color="k")
  14. 14. Confusion Matrix 14 A table where columns represent the predicted class and rows represent the actual class.
  15. 15. Confusion Matrix 15 import seaborn as sns import matplotlib.pyplot as plt from sklearn.metrics import confusion_matrix y_true = metricsdf.y y_pred = metricsdf.y_pred_holdout matrix = confusion_matrix(y_true, y_pred > threshold) fig = sns.heatmap(matrix, annot=True, fmt="d") fig.invert_yaxis() plt.xlabel("Predicted") plt.ylabel("Actual")
  16. 16. Role of Data Visualization
  17. 17. Feature Experimentation 17 Byte Histogram Sliding Window Byte Entropy 0 3 0 0 0 4 0 0 0 255 255 0 0 184 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 216 0 0 77 90 144 14 31 186 14 0 180 9 205 33 184 1 76 205 33 84 104 105 115 32 112 114 111 103 114 97 109 32 99 97 110 110 111 116 32 98 101 32 117 110 32 105 110 32 68 79 83 32 109 111 100 101 46 13 13 10 36 0 0 0 0 0 0 0 49 184 132 58 117 217 234 105 117 217 234 105 11 217 234 105 182 214 181 105 119 217 234 105 117 217 235 105 238 217 234 105 182 214 183 105 100 217 234 105 33 250 218 105 127 217 234 105 178 223 236 105 116 217 234 105 82 105 99 104 117 21 234 105 0 2 https://arxiv.org/pdf/1508.03096.pdf
  18. 18. Model Performance 18 Communicating results and performance
  19. 19. Finding Problems 19 There’s a rudimentary system for gathering feedback and using it in future model trainings
  20. 20. Tracking Solutions 20 For each model, problem areas are broken out and analyzed separately
  21. 21. Visualization Time Budgets
  22. 22. Visualization Time Budget 22 Explainability Trustworthiness Beauty
  23. 23. Explainability 23 Can this visualization be understood on its own? Annotations Explanations in readable prose https://xkcd.com/1732/ …
  24. 24. Trustworthiness 24 https://www.economist.com/blogs/dailychart/2010/11/us_human_development_state Can you trust the source of this visualization? Consistent styling Data sources listed Logos
  25. 25. Beauty 25 https://pudding.cool/2017/02/vocabulary/
  26. 26. Audiences
  27. 27. Yourself 27 The purpose of the visualization is to convince yourself you’ve done something useful. Explainability Trustworthiness Beauty Less Time More Time
  28. 28. Yourself 28 “The first principle is that you must not fool yourself – and you are the easiest person to fool.” Richard Feynman
  29. 29. Yourself 29 Try something Visualize and inspect the results Looks wrong  Think critically Woohoo! I’m done! Model Building Process How might I be fooling myself?
  30. 30. Data Science Team 30 Purpose is to communicate what you’ve done and get feedback you didn’t consider Add context (data sources, model parameters) Explainability Trustworthiness Beauty Less Time More Time
  31. 31. Domain Experts 31 Explainability Trustworthiness Beauty Less Time More Time Same. But now the context is domain specific. For me, hashes, PE header information, links to VirusTotal, etc…
  32. 32. Managers and Executives 32 Explainability Trustworthiness Beauty Less Time More Time Purpose is to communicate progress and current performance.
  33. 33. Public 33 Explainability Trustworthiness Beauty Less Time More Time
  34. 34. Tools and Resources
  35. 35. Python Plotting 35 http://pythonplot.com/
  36. 36. Python Plotting 36 http://pythonplot.com/ Comparison of plotting syntax between: pandas matplotlib plotnine ggplot2 (R) altair (planned)
  37. 37. Jupyter Notebooks 37 http://jupyter.org/ Excellent for exploratory data analysis Changes can be made and the results will update as fast as the code can run
  38. 38. Kibana 38 https://www.elastic.co/products/kibana Allows for rapidly building constantly updating dashboards Works best when querying against data that’s in ElasticSearch Internal tool by @laborious_dtg
  39. 39. D3js 39 https://d3js.org/ Javascript Probably requires data translation Large time commitment. Payoff is the customization possibilities (and thus trustworthiness/beauty).
  40. 40. Yellowbrick 40 https://github.com/DistrictDataLabs/yellowbrick
  41. 41. Yellowbrick 41 # Instantiate the visualizer visualizer = Rank2D(features=features, algorithm='covariance') visualizer.fit(X, y) # Fit the data to the visualizer visualizer.transform(X) # Transform the data visualizer.poof() # Draw/show/poof the data The idea is to have prebaked model evaluation visualizations that adhere to the scikit-learn API
  42. 42. Facets 42 https://github.com/pair-code/facets It’s early, but so far this looks like the best method for truly responsive Exploratory Data Analysis that I’ve seen
  43. 43. Facets 43 time for a demo?
  44. 44. THANK YOU proth@endgame.com @mrphilroth

×