Information Visualization for
Large-Scale Data Workflows
Michael Conover
Senior Data Scientist, LinkedIn
@vagabondjack
Emergent Structure
Credit Pedro Cruz, University of Coimbra
David Crandall, Indiana University
John Nelson, IDV Solutions Elegant Complexity
Intellectual Dividends
Realistic Mental Models
Verification of Assumptions
Shortened Iteration Cycles
Improved Predictive P...
Hypothesis Generation
@whitehouse #RSVP
Color Commentary
Flock Together
Political Polarization on Twitter
Feature Development
Anscombe’s Quartet
http://en.wikpedia.org/Anscombe’s_quartet
Basic Workflow Structure
0.0
0.1
0.2
0.3
0.4
−5.0 −2.5 0.0 2.5 5.0
Standard Normal
Density
0.0
0.1
0.2
0.3
0.4
−5.0 −2.5 0.0 2.5 5.0
Standard Norma...
geom_point()
A Lens on the Joint Distribution
geom_point(alpha=1/5)
A Lens on the Joint Distribution
geom_bin2d(bins=35)
A Lens on the Joint Distribution
geom_point(alpha=1/5, aes(color=label))
A Lens on the Joint Distribution
geom_density2d(aes(color=label), bins=20)
A Lens on the Joint Distribution
Marginal Histograms
A Lens on the Joint Distribution
GGally (ggpairs)
A Lens on the Joint Distribution
Model Fitting & Evaluation
Model A Model B
Training Data I
Training Data II
Evaluation Batteries
stanford.edu/~jhuang11/
Homework at Scale
github.com/StanfordHCI/termite
Topic Modeling
Layercake
Workflow Principles
Latent, Pervasive
Modular
Consistent Visual Language
Workflow Management
data.linkedin.com/opensource/azkaban
LinkedIn Azkaban
data.linkedin.com/opensource/white-elephant
LinkedIn White Elephant
github.com/Netflix/Lipstick
Netflix Lipstick
driven.cascading.io Concurrent Driven
bigml.com
BigML Sunburst
Information Visualization for
Large-Scale Data Workflows
Michael Conover
Senior Data Scientist, LinkedIn
@vagabondjack
Tableau
RStudio Shiny
rstudio.com/shiny/
RStudio Shiny
rweb.stat.ucla.edu/ggplot2/
Data Driven Documents (D3)
d3js.org
Superconductor
superconductor.github.io
Stamen Tiles
maps.stamen.com
Information Visualization for Large-Scale Data Workflows by Michael Conover (Linkedin)
Upcoming SlideShare
Loading in …5
×

Information Visualization for Large-Scale Data Workflows by Michael Conover (Linkedin)

1,340 views

Published on

The ability to instrument and interrogate data as it moves through a processing pipeline is fundamental to effective machine learning at scale. Applied in this capacity, information visualization technologies drive product innovation, shorten iteration cycles, reduce uncertainty, and ultimately improve the performance of predictive models. It can be challenging, however, to understand where in a workflow to employ data visualization, and, once committed to doing so, developing revealing visualizations that suggest clear next steps can be similarly daunting.

In this talk we’ll describe the role that information visualization technologies play in the LinkedIn data science ecosystem, and explore best practices for understanding the structure of large-scale data in a production environment. From hypothesis generation and feature development to model evaluation and tooling, visualization is at the heart of LinkedIn’s machine learning workflows, enabling our data scientists to reason and communicate more effectively. Broken down into clear, structured insights based on proven technology and workflow patterns, this talk will help you understand how to apply information visualization to the analytical challenges you encounter every day.

Meet Michael Conover
Mike Conover builds machine learning technologies that leverage the behavior and relationships of hundreds of millions of people. A senior data scientist at LinkedIn, Mike has a Ph.D. in complex systems analysis with a focus on information propagation in large-scale social networks. His work has appeared in the New York Times, the Wall Street Journal, Science, MIT Technology Review and on National Public Radio.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,340
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
32
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Information Visualization for Large-Scale Data Workflows by Michael Conover (Linkedin)

  1. 1. Information Visualization for Large-Scale Data Workflows Michael Conover Senior Data Scientist, LinkedIn @vagabondjack
  2. 2. Emergent Structure
  3. 3. Credit Pedro Cruz, University of Coimbra David Crandall, Indiana University John Nelson, IDV Solutions Elegant Complexity
  4. 4. Intellectual Dividends Realistic Mental Models Verification of Assumptions Shortened Iteration Cycles Improved Predictive Performance Product Insights Clarity of Communication
  5. 5. Hypothesis Generation
  6. 6. @whitehouse #RSVP Color Commentary
  7. 7. Flock Together
  8. 8. Political Polarization on Twitter
  9. 9. Feature Development
  10. 10. Anscombe’s Quartet http://en.wikpedia.org/Anscombe’s_quartet
  11. 11. Basic Workflow Structure
  12. 12. 0.0 0.1 0.2 0.3 0.4 −5.0 −2.5 0.0 2.5 5.0 Standard Normal Density 0.0 0.1 0.2 0.3 0.4 −5.0 −2.5 0.0 2.5 5.0 Standard Normal Density 100,0001,000,000
  13. 13. geom_point() A Lens on the Joint Distribution
  14. 14. geom_point(alpha=1/5) A Lens on the Joint Distribution
  15. 15. geom_bin2d(bins=35) A Lens on the Joint Distribution
  16. 16. geom_point(alpha=1/5, aes(color=label)) A Lens on the Joint Distribution
  17. 17. geom_density2d(aes(color=label), bins=20) A Lens on the Joint Distribution
  18. 18. Marginal Histograms A Lens on the Joint Distribution
  19. 19. GGally (ggpairs) A Lens on the Joint Distribution
  20. 20. Model Fitting & Evaluation
  21. 21. Model A Model B Training Data I Training Data II Evaluation Batteries
  22. 22. stanford.edu/~jhuang11/ Homework at Scale
  23. 23. github.com/StanfordHCI/termite Topic Modeling
  24. 24. Layercake
  25. 25. Workflow Principles Latent, Pervasive Modular Consistent Visual Language
  26. 26. Workflow Management
  27. 27. data.linkedin.com/opensource/azkaban LinkedIn Azkaban
  28. 28. data.linkedin.com/opensource/white-elephant LinkedIn White Elephant
  29. 29. github.com/Netflix/Lipstick Netflix Lipstick
  30. 30. driven.cascading.io Concurrent Driven
  31. 31. bigml.com BigML Sunburst
  32. 32. Information Visualization for Large-Scale Data Workflows Michael Conover Senior Data Scientist, LinkedIn @vagabondjack
  33. 33. Tableau
  34. 34. RStudio Shiny rstudio.com/shiny/
  35. 35. RStudio Shiny rweb.stat.ucla.edu/ggplot2/
  36. 36. Data Driven Documents (D3) d3js.org
  37. 37. Superconductor superconductor.github.io
  38. 38. Stamen Tiles maps.stamen.com

×