This session was recorded in NYC on October 22nd, 2019 and can be viewed here: https://youtu.be/1X93Sum_SyM
The Grammar of Graphics and the Future of Big Data Visualization
This year marks the twentieth anniversary of the first edition of The Grammar of Graphics (GG). The book laid the foundation for a major visualization component at the statistical software company SPSS. Since that edition, a commercial company, Tableau, evolved from a Stanford seminar and dissertation devoted to the book. Not long after, the widely used open-source visualization package, ggplot2, arose from a dissertation at Iowa State University. GG not only provided for the first time a formal mathematical foundation for generating statistical charts, but also introduced a wider range of graphics than seen in previous graphical systems. This talk will briefly review that history and then outline recent efforts at H2O to apply GG to very large datasets where classical rendering and analytic methods are infeasible.
Bio: Leland Wilkinson is Chief Scientist at H2O and Adjunct Professor of Computer Science at the University of Illinois Chicago. He received an A.B. degree from Harvard in 1966, an S.T.B. degree from Harvard Divinity School in 1969, and a Ph.D. from Yale in 1975. Wilkinson wrote the SYSTAT statistical package and founded SYSTAT Inc. in 1984. After the company grew to 50 employees, he sold SYSTAT to SPSS in 1994 and worked there for ten years on research and development of visualization systems. Wilkinson subsequently worked at Skytree and Tableau before joining H2O.
Dev Dives: Streamline document processing with UiPath Studio Web
Leland Wilkinson, H2O.ai - The Grammar of Graphics and the Future of Big Data Visualization - H2O World 2019 NYC
1. The Grammar of Graphics and the Future of Big Data
Visualization
Leland Wilkinson
Chief Scientist
H2O
2. 2
The Grammar of Graphics (1999, 2005)
• Show programmers how to design and implement statistical graphics.
• Reveal the mathematical foundation of statistical graphics.
8. 8
• A statistical graphic is a representation of the graph of a function.
• The graph of a function is a subset of the product of its domain and codomain.
• The graphic representing F(z) = (z) here is blue.
• The rest is annotation.
The GG Rule
14. 14
Text
Soleto Map ca. 500 BCE, southern region of Italy’s heel, discovered in dig supervised
by Thierry van Compernolle, Montpellier University
And this
15. 15
• Each ellipse is a class.
• Each class contains member functions.
• functions are interchangeable within a class.
• The chain is a total order.
• changing this order produces a meaningless graphic.
• The chain is invertible.
The GG Function Chain
19. 19
Variables
• Variables is a class that builds a data view.
• Variables receives a Dataset and outputs a Varset.
• A Dataset is a set of data.
• A Varset is a set of variables.
• A Variable is a function from a set of objects O to a set of values V (a
many-to-one mapping).
23. 23
Algebra
barnyard + zoo
• Barnyard and Zoo must be from a common class or
measurement scale for blend to be legal.
24. 24
Algebra
Head injury data from the The National Highway Traffic Safety
Administration (NHTSA)
The full design model for these data is
H = C + M + V + O + T(MV) + MV + MO + VO + OT(MV) + MVO
where the symbols are
H: Head Injury Index
C: constant term (grand mean)
M: Manufacturer
V: Vehicle (car or truck)
O: Occupant (driver or passenger)
T: Model
The smallest plausible subset model is:
H = C + V + O + T(MV)
The GG algebra expression for this graphic is
H*T/(M*V)*O
26. 26
Scales
• Scales map Varsets to Dimensions.
• A Dimension is a dimension of Rn.
• Scales use one-to-one mappings within Frames
• identity(), log(), permutation(), ...
• More than one Varset can map to a Dimension.
• Representational measurement is not sufficient.
• nominal, ordinal, interval, ratio (Stevens)
• too general for blend (+) operator
• we can’t blend area with weight
• we can’t blend speed with acceleration
• we can’t blend a density and a distribution function
27. 27
Scales
• We need measurement units.
• more restrictive than axiomatic measurement scales
37. 37
nViZn
a cover tribute to Martin Wattenberg by Graham Wills
left: Wattenberg visualizing music; right: Wills visualizing his own lecture.
Wills programmed his example in nViZn the evening before his talk at the
2008 Joint Statistical Meetings (the day after Wattenberg’s talk).
38. 38
nViZnnViZn
a cover tribute to Brad Paley by Graham Wills (drawn by nViZn)
left: Paley visualizing Alice in Wonderland; right: Wills visualizing Jungle Book.
Wills programmed his example in nViZn the evening before his talk at the 2008
Joint Statistical Meetings.
39. TODO
• Functions: Getting the function chain right
• Data: Plotting Big Data
• UI: A new UI
• Automatic Visualization: Visualizing Anomalies
• Inverse Problems: Reading graphics
42. TODO
• UI: A new UI
Grammar of Graphics Tableau Q
43. TODO
• Automatic Visualization: Visualizing Anomalies
• Not automatic picking of chart type based on data.
• Decision trees too simple.
• Mackinlay’s work is still the best on this topic.
• Not using novice judgments of what is “interesting.”
• Novices are biased and ignorant.
• Instead, identify the kinds of visual anomalies Tukey would have found.
44. TODO
• Inverse Problems: Reading graphics
• Chapter 19 of GG (Reader) outlines a strategy for inverting the function chain
• ICDAR and other conferences on this topic
• ReVision: automated classification, analysis and redesign of chart images