Information Visualization for Large-Scale Data Workflows

  • 2,969 views
Uploaded on

The ability to instrument and interrogate data as it moves through a processing pipeline is fundamental to effective machine learning at scale. Applied in this capacity, information visualization …

The ability to instrument and interrogate data as it moves through a processing pipeline is fundamental to effective machine learning at scale. Applied in this capacity, information visualization technologies drive product innovation, shorten iteration cycles, reduce uncertainty, and ultimately improve the performance of predictive models. It can be challenging, however, to understand where in a workflow to employ data visualization, and, once committed to doing so, developing revealing visualizations that suggest clear next steps can be similarly daunting.

In this talk we’ll describe the role that information visualization technologies play in the LinkedIn data science ecosystem, and explore best practices for understanding the structure of large-scale data in a production environment. From hypothesis generation and feature development to model evaluation and tooling, visualization is at the heart of LinkedIn’s machine learning workflows, enabling our data scientists to reason and communicate more effectively. Broken down into clear, structured insights based on proven workflow patterns, this talk will help you understand how to apply information visualization to the analytical challenges you encounter every day.

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,969
On Slideshare
0
From Embeds
0
Number of Embeds
8

Actions

Shares
Downloads
41
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Information Visualization for Large-Scale Data Workflows Michael Conover Senior Data Scientist, LinkedIn @vagabondjack reasonengine.wordpress.com Wednesday, October 9, 2013
  • 2. Emergent Structure Wednesday, October 9, 2013
  • 3. Elegant Complexity Pedro Cruz, University of Coimbra David Crandall, Indiana University John Nelson, IDV Solutions Credit Wednesday, October 9, 2013
  • 4. Intellectual Dividends Realistic Mental Models Verification of Assumptions Shortened Iteration Cycles Improved Predictive Performance Product Insights Clarity of Communication Wednesday, October 9, 2013
  • 5. Hypothesis Generation Wednesday, October 9, 2013
  • 6. Wednesday, October 9, 2013
  • 7. Color Commentary @whitehouse #RSVP Wednesday, October 9, 2013
  • 8. Flock Together Wednesday, October 9, 2013
  • 9. Political Polarization On Twitter Wednesday, October 9, 2013
  • 10. Basic Workflow Structure Wednesday, October 9, 2013
  • 11. aes_string() Basic Visualization Battery Wednesday, October 9, 2013
  • 12. Feature Development Wednesday, October 9, 2013
  • 13. Anscombe’s Quartet http://en.wikipedia.org/wiki/Anscombe's_quartet Wednesday, October 9, 2013
  • 14. 0.0 0.1 0.2 0.3 0.4 −2.5 0.0 2.5 5.0 Standard Normal Density 0.0 0.1 0.2 0.3 0.4 −5.0 −2.5 0.0 2.5 5.0 Standard Normal Density 100,0001,000,000 Wednesday, October 9, 2013
  • 15. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) geom_point() Wednesday, October 9, 2013
  • 16. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) geom_point(alpha=1/5) Wednesday, October 9, 2013
  • 17. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) 25 50 75 100 count geom_bin2d(bins=35) Wednesday, October 9, 2013
  • 18. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) Class Negative Positive geom_point(alpha=1/5, aes(color=label)) Wednesday, October 9, 2013
  • 19. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) Class Negative Positive geom_density2d(aes(color=label), bins=10) Wednesday, October 9, 2013
  • 20. A Lens On The Joint Distribution Marginal Histograms Wednesday, October 9, 2013
  • 21. A Lens On The Joint Distribution Sepal.Length 6 7 8 5 6 7 8 Cor : −0.118 setosa: 0.743 versicolor: 0.526 virginica: 0.457 Cor : 0.872 setosa: 0.267 versicolor: 0.754 virginica: 0.864 Cor : 0.818 setosa: 0.278 versicolor: 0.546 virginica: 0.281 Sepal.Width 2.5 3 3.5 4 4.5 2 2.5 3 3.5 4 4.5 Cor : −0.428 setosa: 0.178 versicolor: 0.561 virginica: 0.401 Cor : −0.366 setosa: 0.233 versicolor: 0.664 virginica: 0.538 Petal.Length4 6 2 4 6 Cor : 0.963 setosa: 0.332 versicolor: 0.787 virginica: 0.322 Petal.Width 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 Species setosa versicolor virginica GGally (ggpairs) Wednesday, October 9, 2013
  • 22. Model Fitting & Evaluation Wednesday, October 9, 2013
  • 23. Model Selection Model A Model B Training Data I Training Data II Battery Battery Battery Battery Wednesday, October 9, 2013
  • 24. stanford.edu/~jhuang11/ Homework At Scale Wednesday, October 9, 2013
  • 25. Topic Modeling vis.stanford.edu/papers/termite Wednesday, October 9, 2013
  • 26. Layercake Wednesday, October 9, 2013
  • 27. Workflow Principles Latent, Pervasive Modular Consistent Visual Language Wednesday, October 9, 2013
  • 28. Workflow Management Wednesday, October 9, 2013
  • 29. Azkaban data.linkedin.com/opensource/azkaban Wednesday, October 9, 2013
  • 30. White Elephant data.linkedin.com/opensource/white-elephant Wednesday, October 9, 2013
  • 31. Netflix’ Lipstickgithub.com/Netflix/Lipstick Wednesday, October 9, 2013
  • 32. Information Visualization for Large-Scale Data Workflows Michael Conover Senior Data Scientist, LinkedIn @vagabondjack reasonengine.wordpress.com Wednesday, October 9, 2013
  • 33. Extended Toolbox Wednesday, October 9, 2013
  • 34. tableausoftware.com/public Tableau Wednesday, October 9, 2013
  • 35. rstudio.com/shiny/ RStudio Shiny Wednesday, October 9, 2013
  • 36. code.google.com/p/google-motion-charts-with-r GoogleVis Wednesday, October 9, 2013
  • 37. rweb.stat.ucla.edu/ggplot2/ Wednesday, October 9, 2013
  • 38. kuler.adobe.com Adobe Kuler Wednesday, October 9, 2013
  • 39. colorbrewer2.org Color Brewer Wednesday, October 9, 2013
  • 40. d3js.org D3.js Wednesday, October 9, 2013
  • 41. bl.ocks.org/mbostock Bostock’s Blocks Wednesday, October 9, 2013
  • 42. maps.stamen.com Stamen OpenStreetMap Tiles Wednesday, October 9, 2013
  • 43. zipfianacademy.com/maps/h3/ SF Health Inspections Wednesday, October 9, 2013