Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Visualization for Machine Learning

584 views

Published on

I talk about lessons I've learned maintaining an internal data visualization tool called Bit Inspector meant to aid the building of MalwareScore

Published in: Technology
  • Be the first to comment

Data Visualization for Machine Learning

  1. 1. VIZ4ML THE ROLE OF DATA VISUALIZATION IN IMPROVING MACHINE LEARNING MODELS BSidesLV 2017 Phil Roth
  2. 2. 2 Or…. Screenshots of this internal visualization tool I built to test MalwareScore along with lessons I learned building it
  3. 3. Phil Roth @mrphilroth proth@endgame.com PhD in physics using ML Radar Imager Data Scientist at Endgame 3 whoami
  4. 4. MalwareScore and Bit Inspector
  5. 5. MalwareScore 5 MalwareScore is a machine learning first solution built for detecting and preventing malware.
  6. 6. MalwareScore Static features Deployed to customer machines Available at VirusTotal 6 https://www.virustotal.com/
  7. 7. Bit Inspector 7 Bit Inspector is an internal tool for communicating progress, soliciting feedback, and identifying errors related to MalwareScore.
  8. 8. Bit Inspector 8 Built with Flask (http://flask.pocoo.org/) D3.js (https://d3js.org/) matplotlib (https://matplotlib.org/) seaborn (https://seaborn.pydata.org/) Connects to multiple internal data and processing resources
  9. 9. Sample Page 9
  10. 10. Model Page 10
  11. 11. Basic Visualizations
  12. 12. ROC Curve 12 Area Under Receiver Operating Characteristic Curve Created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
  13. 13. ROC Curve 13 import seaborn as sns import matplotlib.pyplot as plt from sklearn.metrics import roc_curve y_true = metricsdf.y y_pred = metricsdf.y_pred_holdout fpr_plot, tpr_plot, _ = roc_curve(y_true, y_pred) plt.plot(fpr_plot, tpr_plot, lw=2, color="k")
  14. 14. Confusion Matrix 14 A table where columns represent the predicted class and rows represent the actual class.
  15. 15. Confusion Matrix 15 import seaborn as sns import matplotlib.pyplot as plt from sklearn.metrics import confusion_matrix y_true = metricsdf.y y_pred = metricsdf.y_pred_holdout matrix = confusion_matrix(y_true, y_pred > threshold) fig = sns.heatmap(matrix, annot=True, fmt="d") fig.invert_yaxis() plt.xlabel("Predicted") plt.ylabel("Actual")
  16. 16. Role of Data Visualization
  17. 17. Feature Experimentation 17 Byte Histogram Sliding Window Byte Entropy 0 3 0 0 0 4 0 0 0 255 255 0 0 184 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 216 0 0 77 90 144 14 31 186 14 0 180 9 205 33 184 1 76 205 33 84 104 105 115 32 112 114 111 103 114 97 109 32 99 97 110 110 111 116 32 98 101 32 117 110 32 105 110 32 68 79 83 32 109 111 100 101 46 13 13 10 36 0 0 0 0 0 0 0 49 184 132 58 117 217 234 105 117 217 234 105 11 217 234 105 182 214 181 105 119 217 234 105 117 217 235 105 238 217 234 105 182 214 183 105 100 217 234 105 33 250 218 105 127 217 234 105 178 223 236 105 116 217 234 105 82 105 99 104 117 21 234 105 0 2 https://arxiv.org/pdf/1508.03096.pdf
  18. 18. Model Performance 18 Communicating results and performance
  19. 19. Finding Problems 19 There’s a rudimentary system for gathering feedback and using it in future model trainings
  20. 20. Tracking Solutions 20 For each model, problem areas are broken out and analyzed separately
  21. 21. Visualization Time Budgets
  22. 22. Visualization Time Budget 22 Explainability Trustworthiness Beauty
  23. 23. Explainability 23 Can this visualization be understood on its own? Annotations Explanations in readable prose https://xkcd.com/1732/ …
  24. 24. Trustworthiness 24 https://www.economist.com/blogs/dailychart/2010/11/us_human_development_state Can you trust the source of this visualization? Consistent styling Data sources listed Logos
  25. 25. Beauty 25 https://pudding.cool/2017/02/vocabulary/
  26. 26. Audiences
  27. 27. Yourself 27 The purpose of the visualization is to convince yourself you’ve done something useful. Explainability Trustworthiness Beauty Less Time More Time
  28. 28. Yourself 28 “The first principle is that you must not fool yourself – and you are the easiest person to fool.” Richard Feynman
  29. 29. Yourself 29 Try something Visualize and inspect the results Looks wrong  Think critically Woohoo! I’m done! Model Building Process How might I be fooling myself?
  30. 30. Data Science Team 30 Purpose is to communicate what you’ve done and get feedback you didn’t consider Add context (data sources, model parameters) Explainability Trustworthiness Beauty Less Time More Time
  31. 31. Domain Experts 31 Explainability Trustworthiness Beauty Less Time More Time Same. But now the context is domain specific. For me, hashes, PE header information, links to VirusTotal, etc…
  32. 32. Managers and Executives 32 Explainability Trustworthiness Beauty Less Time More Time Purpose is to communicate progress and current performance.
  33. 33. Public 33 Explainability Trustworthiness Beauty Less Time More Time
  34. 34. Tools and Resources
  35. 35. Python Plotting 35 http://pythonplot.com/
  36. 36. Python Plotting 36 http://pythonplot.com/ Comparison of plotting syntax between: pandas matplotlib plotnine ggplot2 (R) altair (planned)
  37. 37. Jupyter Notebooks 37 http://jupyter.org/ Excellent for exploratory data analysis Changes can be made and the results will update as fast as the code can run
  38. 38. Kibana 38 https://www.elastic.co/products/kibana Allows for rapidly building constantly updating dashboards Works best when querying against data that’s in ElasticSearch Internal tool by @laborious_dtg
  39. 39. D3js 39 https://d3js.org/ Javascript Probably requires data translation Large time commitment. Payoff is the customization possibilities (and thus trustworthiness/beauty).
  40. 40. Yellowbrick 40 https://github.com/DistrictDataLabs/yellowbrick
  41. 41. Yellowbrick 41 # Instantiate the visualizer visualizer = Rank2D(features=features, algorithm='covariance') visualizer.fit(X, y) # Fit the data to the visualizer visualizer.transform(X) # Transform the data visualizer.poof() # Draw/show/poof the data The idea is to have prebaked model evaluation visualizations that adhere to the scikit-learn API
  42. 42. Facets 42 https://github.com/pair-code/facets It’s early, but so far this looks like the best method for truly responsive Exploratory Data Analysis that I’ve seen
  43. 43. Facets 43 time for a demo?
  44. 44. THANK YOU proth@endgame.com @mrphilroth

×