Information Visualization for Large-Scale Data Workflows

3,702
-1

Published on

The ability to instrument and interrogate data as it moves through a processing pipeline is fundamental to effective machine learning at scale. Applied in this capacity, information visualization technologies drive product innovation, shorten iteration cycles, reduce uncertainty, and ultimately improve the performance of predictive models. It can be challenging, however, to understand where in a workflow to employ data visualization, and, once committed to doing so, developing revealing visualizations that suggest clear next steps can be similarly daunting.

In this talk we’ll describe the role that information visualization technologies play in the LinkedIn data science ecosystem, and explore best practices for understanding the structure of large-scale data in a production environment. From hypothesis generation and feature development to model evaluation and tooling, visualization is at the heart of LinkedIn’s machine learning workflows, enabling our data scientists to reason and communicate more effectively. Broken down into clear, structured insights based on proven workflow patterns, this talk will help you understand how to apply information visualization to the analytical challenges you encounter every day.

Published in: Education, Technology

Information Visualization for Large-Scale Data Workflows

  1. 1. Information Visualization for Large-Scale Data Workflows Michael Conover Senior Data Scientist, LinkedIn @vagabondjack reasonengine.wordpress.com Wednesday, October 9, 2013
  2. 2. Emergent Structure Wednesday, October 9, 2013
  3. 3. Elegant Complexity Pedro Cruz, University of Coimbra David Crandall, Indiana University John Nelson, IDV Solutions Credit Wednesday, October 9, 2013
  4. 4. Intellectual Dividends Realistic Mental Models Verification of Assumptions Shortened Iteration Cycles Improved Predictive Performance Product Insights Clarity of Communication Wednesday, October 9, 2013
  5. 5. Hypothesis Generation Wednesday, October 9, 2013
  6. 6. Wednesday, October 9, 2013
  7. 7. Color Commentary @whitehouse #RSVP Wednesday, October 9, 2013
  8. 8. Flock Together Wednesday, October 9, 2013
  9. 9. Political Polarization On Twitter Wednesday, October 9, 2013
  10. 10. Basic Workflow Structure Wednesday, October 9, 2013
  11. 11. aes_string() Basic Visualization Battery Wednesday, October 9, 2013
  12. 12. Feature Development Wednesday, October 9, 2013
  13. 13. Anscombe’s Quartet http://en.wikipedia.org/wiki/Anscombe's_quartet Wednesday, October 9, 2013
  14. 14. 0.0 0.1 0.2 0.3 0.4 −2.5 0.0 2.5 5.0 Standard Normal Density 0.0 0.1 0.2 0.3 0.4 −5.0 −2.5 0.0 2.5 5.0 Standard Normal Density 100,0001,000,000 Wednesday, October 9, 2013
  15. 15. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) geom_point() Wednesday, October 9, 2013
  16. 16. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) geom_point(alpha=1/5) Wednesday, October 9, 2013
  17. 17. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) 25 50 75 100 count geom_bin2d(bins=35) Wednesday, October 9, 2013
  18. 18. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) Class Negative Positive geom_point(alpha=1/5, aes(color=label)) Wednesday, October 9, 2013
  19. 19. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) Class Negative Positive geom_density2d(aes(color=label), bins=10) Wednesday, October 9, 2013
  20. 20. A Lens On The Joint Distribution Marginal Histograms Wednesday, October 9, 2013
  21. 21. A Lens On The Joint Distribution Sepal.Length 6 7 8 5 6 7 8 Cor : −0.118 setosa: 0.743 versicolor: 0.526 virginica: 0.457 Cor : 0.872 setosa: 0.267 versicolor: 0.754 virginica: 0.864 Cor : 0.818 setosa: 0.278 versicolor: 0.546 virginica: 0.281 Sepal.Width 2.5 3 3.5 4 4.5 2 2.5 3 3.5 4 4.5 Cor : −0.428 setosa: 0.178 versicolor: 0.561 virginica: 0.401 Cor : −0.366 setosa: 0.233 versicolor: 0.664 virginica: 0.538 Petal.Length4 6 2 4 6 Cor : 0.963 setosa: 0.332 versicolor: 0.787 virginica: 0.322 Petal.Width 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 Species setosa versicolor virginica GGally (ggpairs) Wednesday, October 9, 2013
  22. 22. Model Fitting & Evaluation Wednesday, October 9, 2013
  23. 23. Model Selection Model A Model B Training Data I Training Data II Battery Battery Battery Battery Wednesday, October 9, 2013
  24. 24. stanford.edu/~jhuang11/ Homework At Scale Wednesday, October 9, 2013
  25. 25. Topic Modeling vis.stanford.edu/papers/termite Wednesday, October 9, 2013
  26. 26. Layercake Wednesday, October 9, 2013
  27. 27. Workflow Principles Latent, Pervasive Modular Consistent Visual Language Wednesday, October 9, 2013
  28. 28. Workflow Management Wednesday, October 9, 2013
  29. 29. Azkaban data.linkedin.com/opensource/azkaban Wednesday, October 9, 2013
  30. 30. White Elephant data.linkedin.com/opensource/white-elephant Wednesday, October 9, 2013
  31. 31. Netflix’ Lipstickgithub.com/Netflix/Lipstick Wednesday, October 9, 2013
  32. 32. Information Visualization for Large-Scale Data Workflows Michael Conover Senior Data Scientist, LinkedIn @vagabondjack reasonengine.wordpress.com Wednesday, October 9, 2013
  33. 33. Extended Toolbox Wednesday, October 9, 2013
  34. 34. tableausoftware.com/public Tableau Wednesday, October 9, 2013
  35. 35. rstudio.com/shiny/ RStudio Shiny Wednesday, October 9, 2013
  36. 36. code.google.com/p/google-motion-charts-with-r GoogleVis Wednesday, October 9, 2013
  37. 37. rweb.stat.ucla.edu/ggplot2/ Wednesday, October 9, 2013
  38. 38. kuler.adobe.com Adobe Kuler Wednesday, October 9, 2013
  39. 39. colorbrewer2.org Color Brewer Wednesday, October 9, 2013
  40. 40. d3js.org D3.js Wednesday, October 9, 2013
  41. 41. bl.ocks.org/mbostock Bostock’s Blocks Wednesday, October 9, 2013
  42. 42. maps.stamen.com Stamen OpenStreetMap Tiles Wednesday, October 9, 2013
  43. 43. zipfianacademy.com/maps/h3/ SF Health Inspections Wednesday, October 9, 2013

×