Your SlideShare is downloading. ×
0
Information Visualization for
Large-Scale Data Workflows
Michael Conover
Senior Data Scientist, LinkedIn
@vagabondjack
reas...
Emergent Structure
Wednesday, October 9, 2013
Elegant Complexity
Pedro Cruz, University of Coimbra
David Crandall, Indiana University
John Nelson, IDV Solutions
Credit
...
Intellectual Dividends
Realistic Mental Models
Verification of Assumptions
Shortened Iteration Cycles
Improved Predictive P...
Hypothesis Generation
Wednesday, October 9, 2013
Wednesday, October 9, 2013
Color Commentary
@whitehouse #RSVP
Wednesday, October 9, 2013
Flock Together
Wednesday, October 9, 2013
Political Polarization On Twitter
Wednesday, October 9, 2013
Basic Workflow Structure
Wednesday, October 9, 2013
aes_string()
Basic Visualization Battery
Wednesday, October 9, 2013
Feature Development
Wednesday, October 9, 2013
Anscombe’s Quartet
http://en.wikipedia.org/wiki/Anscombe's_quartet
Wednesday, October 9, 2013
0.0
0.1
0.2
0.3
0.4
−2.5 0.0 2.5 5.0
Standard Normal
Density
0.0
0.1
0.2
0.3
0.4
−5.0 −2.5 0.0 2.5 5.0
Standard Normal
Den...
A Lens On The Joint Distribution
log(Connections)
log(EndorsementPagerank)
geom_point()
Wednesday, October 9, 2013
A Lens On The Joint Distribution
log(Connections)
log(EndorsementPagerank)
geom_point(alpha=1/5)
Wednesday, October 9, 2013
A Lens On The Joint Distribution
log(Connections)
log(EndorsementPagerank)
25
50
75
100
count
geom_bin2d(bins=35)
Wednesda...
A Lens On The Joint Distribution
log(Connections)
log(EndorsementPagerank)
Class
Negative
Positive geom_point(alpha=1/5, a...
A Lens On The Joint Distribution
log(Connections)
log(EndorsementPagerank)
Class
Negative
Positive geom_density2d(aes(colo...
A Lens On The Joint Distribution
Marginal Histograms
Wednesday, October 9, 2013
A Lens On The Joint Distribution
Sepal.Length
6
7
8
5 6 7 8
Cor : −0.118
setosa: 0.743
versicolor: 0.526
virginica: 0.457
...
Model Fitting & Evaluation
Wednesday, October 9, 2013
Model Selection
Model A Model B
Training Data I
Training Data II
Battery Battery
Battery Battery
Wednesday, October 9, 2013
stanford.edu/~jhuang11/
Homework At Scale
Wednesday, October 9, 2013
Topic Modeling
vis.stanford.edu/papers/termite
Wednesday, October 9, 2013
Layercake
Wednesday, October 9, 2013
Workflow Principles
Latent, Pervasive
Modular
Consistent Visual Language
Wednesday, October 9, 2013
Workflow Management
Wednesday, October 9, 2013
Azkaban
data.linkedin.com/opensource/azkaban
Wednesday, October 9, 2013
White Elephant
data.linkedin.com/opensource/white-elephant
Wednesday, October 9, 2013
Netflix’ Lipstickgithub.com/Netflix/Lipstick
Wednesday, October 9, 2013
Information Visualization for
Large-Scale Data Workflows
Michael Conover
Senior Data Scientist, LinkedIn
@vagabondjack
reas...
Extended Toolbox
Wednesday, October 9, 2013
tableausoftware.com/public
Tableau
Wednesday, October 9, 2013
rstudio.com/shiny/
RStudio Shiny
Wednesday, October 9, 2013
code.google.com/p/google-motion-charts-with-r
GoogleVis
Wednesday, October 9, 2013
rweb.stat.ucla.edu/ggplot2/
Wednesday, October 9, 2013
kuler.adobe.com
Adobe Kuler
Wednesday, October 9, 2013
colorbrewer2.org
Color Brewer
Wednesday, October 9, 2013
d3js.org
D3.js
Wednesday, October 9, 2013
bl.ocks.org/mbostock
Bostock’s Blocks
Wednesday, October 9, 2013
maps.stamen.com
Stamen OpenStreetMap Tiles
Wednesday, October 9, 2013
zipfianacademy.com/maps/h3/
SF Health Inspections
Wednesday, October 9, 2013
Upcoming SlideShare
Loading in...5
×

Information Visualization for Large-Scale Data Workflows

3,321

Published on

The ability to instrument and interrogate data as it moves through a processing pipeline is fundamental to effective machine learning at scale. Applied in this capacity, information visualization technologies drive product innovation, shorten iteration cycles, reduce uncertainty, and ultimately improve the performance of predictive models. It can be challenging, however, to understand where in a workflow to employ data visualization, and, once committed to doing so, developing revealing visualizations that suggest clear next steps can be similarly daunting.

In this talk we’ll describe the role that information visualization technologies play in the LinkedIn data science ecosystem, and explore best practices for understanding the structure of large-scale data in a production environment. From hypothesis generation and feature development to model evaluation and tooling, visualization is at the heart of LinkedIn’s machine learning workflows, enabling our data scientists to reason and communicate more effectively. Broken down into clear, structured insights based on proven workflow patterns, this talk will help you understand how to apply information visualization to the analytical challenges you encounter every day.

Published in: Education, Technology

Transcript of "Information Visualization for Large-Scale Data Workflows"

  1. 1. Information Visualization for Large-Scale Data Workflows Michael Conover Senior Data Scientist, LinkedIn @vagabondjack reasonengine.wordpress.com Wednesday, October 9, 2013
  2. 2. Emergent Structure Wednesday, October 9, 2013
  3. 3. Elegant Complexity Pedro Cruz, University of Coimbra David Crandall, Indiana University John Nelson, IDV Solutions Credit Wednesday, October 9, 2013
  4. 4. Intellectual Dividends Realistic Mental Models Verification of Assumptions Shortened Iteration Cycles Improved Predictive Performance Product Insights Clarity of Communication Wednesday, October 9, 2013
  5. 5. Hypothesis Generation Wednesday, October 9, 2013
  6. 6. Wednesday, October 9, 2013
  7. 7. Color Commentary @whitehouse #RSVP Wednesday, October 9, 2013
  8. 8. Flock Together Wednesday, October 9, 2013
  9. 9. Political Polarization On Twitter Wednesday, October 9, 2013
  10. 10. Basic Workflow Structure Wednesday, October 9, 2013
  11. 11. aes_string() Basic Visualization Battery Wednesday, October 9, 2013
  12. 12. Feature Development Wednesday, October 9, 2013
  13. 13. Anscombe’s Quartet http://en.wikipedia.org/wiki/Anscombe's_quartet Wednesday, October 9, 2013
  14. 14. 0.0 0.1 0.2 0.3 0.4 −2.5 0.0 2.5 5.0 Standard Normal Density 0.0 0.1 0.2 0.3 0.4 −5.0 −2.5 0.0 2.5 5.0 Standard Normal Density 100,0001,000,000 Wednesday, October 9, 2013
  15. 15. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) geom_point() Wednesday, October 9, 2013
  16. 16. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) geom_point(alpha=1/5) Wednesday, October 9, 2013
  17. 17. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) 25 50 75 100 count geom_bin2d(bins=35) Wednesday, October 9, 2013
  18. 18. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) Class Negative Positive geom_point(alpha=1/5, aes(color=label)) Wednesday, October 9, 2013
  19. 19. A Lens On The Joint Distribution log(Connections) log(EndorsementPagerank) Class Negative Positive geom_density2d(aes(color=label), bins=10) Wednesday, October 9, 2013
  20. 20. A Lens On The Joint Distribution Marginal Histograms Wednesday, October 9, 2013
  21. 21. A Lens On The Joint Distribution Sepal.Length 6 7 8 5 6 7 8 Cor : −0.118 setosa: 0.743 versicolor: 0.526 virginica: 0.457 Cor : 0.872 setosa: 0.267 versicolor: 0.754 virginica: 0.864 Cor : 0.818 setosa: 0.278 versicolor: 0.546 virginica: 0.281 Sepal.Width 2.5 3 3.5 4 4.5 2 2.5 3 3.5 4 4.5 Cor : −0.428 setosa: 0.178 versicolor: 0.561 virginica: 0.401 Cor : −0.366 setosa: 0.233 versicolor: 0.664 virginica: 0.538 Petal.Length4 6 2 4 6 Cor : 0.963 setosa: 0.332 versicolor: 0.787 virginica: 0.322 Petal.Width 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 Species setosa versicolor virginica GGally (ggpairs) Wednesday, October 9, 2013
  22. 22. Model Fitting & Evaluation Wednesday, October 9, 2013
  23. 23. Model Selection Model A Model B Training Data I Training Data II Battery Battery Battery Battery Wednesday, October 9, 2013
  24. 24. stanford.edu/~jhuang11/ Homework At Scale Wednesday, October 9, 2013
  25. 25. Topic Modeling vis.stanford.edu/papers/termite Wednesday, October 9, 2013
  26. 26. Layercake Wednesday, October 9, 2013
  27. 27. Workflow Principles Latent, Pervasive Modular Consistent Visual Language Wednesday, October 9, 2013
  28. 28. Workflow Management Wednesday, October 9, 2013
  29. 29. Azkaban data.linkedin.com/opensource/azkaban Wednesday, October 9, 2013
  30. 30. White Elephant data.linkedin.com/opensource/white-elephant Wednesday, October 9, 2013
  31. 31. Netflix’ Lipstickgithub.com/Netflix/Lipstick Wednesday, October 9, 2013
  32. 32. Information Visualization for Large-Scale Data Workflows Michael Conover Senior Data Scientist, LinkedIn @vagabondjack reasonengine.wordpress.com Wednesday, October 9, 2013
  33. 33. Extended Toolbox Wednesday, October 9, 2013
  34. 34. tableausoftware.com/public Tableau Wednesday, October 9, 2013
  35. 35. rstudio.com/shiny/ RStudio Shiny Wednesday, October 9, 2013
  36. 36. code.google.com/p/google-motion-charts-with-r GoogleVis Wednesday, October 9, 2013
  37. 37. rweb.stat.ucla.edu/ggplot2/ Wednesday, October 9, 2013
  38. 38. kuler.adobe.com Adobe Kuler Wednesday, October 9, 2013
  39. 39. colorbrewer2.org Color Brewer Wednesday, October 9, 2013
  40. 40. d3js.org D3.js Wednesday, October 9, 2013
  41. 41. bl.ocks.org/mbostock Bostock’s Blocks Wednesday, October 9, 2013
  42. 42. maps.stamen.com Stamen OpenStreetMap Tiles Wednesday, October 9, 2013
  43. 43. zipfianacademy.com/maps/h3/ SF Health Inspections Wednesday, October 9, 2013
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×