I survey three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I also discuss some methods for visualizing
I survey three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I also discuss some methods for visualizing large data sets.
“ A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009. Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I’ll also discuss some methods for visualizing large data sets. I’ll end with an overview of Rapache, a tool for embedding R in web applications. For questions beyond this talk, I can be contacted at: Michael E Driscoll http://www.dataspora.com [email_address]
Hal Varian said that “The sexy job in the next ten years will be statisticians…” (in an 2009 interview with McKinsey Quarterly). Data visualization is the fastest means to feeds our brains data, because it leverages our highest bandwidth sensory organ: our eyes. Statistical visualization is sexy both because high-density information plots tickle our brains – we crave information – and because it is hard to do well.
A data visualization is often the final step in a three-step data sense-making process, whereby data is (i) “munged” e.g. collected, cleansed, and structured), (ii) modeled , relationships in the data are explored and hypotheses tested, and finally (iii) visualized , a particular model of the data is represented graphically. At Facebook, their data engineers are called “data scientists.” I like this term because it conveys that working with data involves the scientific method, predicated on making hypotheses and testing them. Ultimately, we are interested in using data to make hypotheses about the world.
Like this one, from Jessica Hagy’s witty blog – this is indexed.com She visualizes a hypothesis that free time and money are related – e.g. that you have the most free time when you’re broke and when you’re rich. I decided to test this hypothesis with data on working hours (its complement = free time) and GDP from 29 OECD countries.
Using R, I decided to test this hypothesis. I modeled it with a polynomial regression. Data for 29 countries in the OECD, using 2006 data on annual hours worked, and GDP per capita. I modeled it with both linear and polynomial regression models. Just a few lines of code.
And using R, I visualized it. Her wealth-free time hypothesis was half-right. The richer you are, the more free time you have (the extreme rightmost point is Luxembourg). But at least for this subset of countries that we examined, the relationship is strictly linear – the poorest OECD countries have the least free time. (In the code shown on the right, I’m using ggplot2 here, not the base graphics plot function in the previous slide. But ggplot2 will automatically do a loess fit for us).
In this section, I describe built-in graphics functions in R, that require no external packages.
First, let’s peek under the covers of the R graphics stack. At the top-most level are packages, like “maps”, “lattice”, and “ggplot2”. These packages make calls to a lower-level graphics system, of which in R there are two – called “graphics” and “grid”. According to Nicholas Lewin-Koh, the goal of these graphics systems is to “create coordinates for each graphical object and render them to a device or canvas. In addition the system may manage (i) a stack of graphics objects, (ii) local state information, (iii) redrawing and resizing.” Finally these graphics systems are capable of rendering output to a variety of devices – which for our purposes, can be considered image formats such as PNG, JPG, and PDF. Devices are most commonly include interactive displays – such as those in Windows of Mac OS X – which R sends its output to by default during an interactive session. Grid is a newer system, and both “lattice” and “ggplot2”, which I’ll discuss later, use Grid.
plot() is a “do the right thing” graphics command plot() is the simplest R command for generating a visualization of an R object. It’s an overloaded function that just “does the right thing”, and yields a quick few for many R objects that are passed to it. These built-in basic plotting commands are useful if you’re just doing quick, exploratory analysis, and publication quality graphs are not what you’re looking for.
We can interactively add layers – lines, points, and text -- to plots using basic graphics functions. One such example is abline – so named for its a slope, b intercept parameters it uses to draw a line (from that saw y = a x + b ).
par is a function for setting graphical parameters for base graphics – and, nota bene, these parameters are often shared by the higher level packages I discuss later. Once parameters are defined via par , graphics functions like plot will use these new parameters in subsequent plots. The example above shows the setting of three parameters: pch to set a p lotting ch aracter (21 denotes a filled circle), cex to set size or c haracter ex pansion (1 is default, 5 is bigger) col to set color, which is definable as a name (“blue”), an integer (1-7 for primaries), or an RGB value (as above).
graphics parameters can be set via par(), or passed directly to graphics functions Above are some more parameters that you can set using par() . For a full list, type help(par) at the R prompt. You can also pass these parameters directly to graphics functions, for example, “ points(5,3, pch=19, col=blue)” The chart on the right is example of a plot painstakingly created with the low-level plotting parameters and functions above. This was done by interactively layering additional text labels and legends on after the initial points were plotted.
Edward Tufte has lauded the value of “small multiples” in information graphics: namely, the incorporation of many small plots in a single graphic. R provides a basic facility for the subdivision of a display device (or ultimately its printed representation) into several panels. This can be achieved by setting the graphics parameter mfrow , which stands for m ultiple f igures plotted row -wise.
With the mfrow parameter, a 2 x 2 matrix of sub-panels -- as in the example above -- can be set up, and plots will be interactively drawn in these sub-panels. The code above illustrates the creation of four figures in a single graphic, and the result is shown in the next slide. (There is also a mfcol function for plotting multiple figures in a col umn-wise manner.)
Unless a data visualization is of unusually high density, most modern display devices allow for upwards of 16 figures to be suitably resolved on a single device. See the splom() function for automatic creation of such dense graphics.
R graphics devices can present some “gotchas” Normally one need not have any knowledge of the graphics devices that underly the R graphics system. But in a few cases, it’s worth knowing something about: while typical users can save R graphics in the Windows or Mac OS X (via a “Save As” dialog in the graphics window), if one is not using a GUI, exporting graphics requires manually opening a device – with one of several device commands (such as pdf() or png() ) – and closing it properly (using dev.off() ). also, when exporting graphics in a non-interactive environment (via a script for instance) – it’s critical to invoke the print() function – which will properly write a graphic to the available device. this “print” issue can be a real gotcha for scripts.
Okay, now I want you to try and forget everything you just heard about base graphics. ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham. It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
the ‘gg’ in ggplot2 is a reference to a book called The G rammar of G raphics written done by Leland Wilkinson The book conceives graphics as compositional – made up colors, visual shapes, and coordinates, much as sentences are made up of parts of speech.
I’ve illustrated an incomplete version of Wilkinson’s grammar in this slide, to convey how graphics are built up – and out of – their component parts. As such, Wilkinson advocates that graphical tools should leave behind what he deems “chart typologies” – rigid casts of a pie charts, bar graphs, or scatter plots, which data is poured into. (Excel chart wizard might be thought of as the Mad Libs of graphics –with pre-defined structure, and limited degrees of freedom). Conceived as compositional, a graphical grammar allows for an infinite variety of graphical constructions.
In the upcoming examples, drawn directly from Hadley Wickham’s book on ggplot2, we’ll visualize data concerning ~ 50,000. We’ll start simple and build to more complex graphs by specifying additional elements of the graphical grammar. This data is in the ggplot2 package, more information is available with help(diamonds) (after loading ggplot2). For our purposes, we’re concerned examining relationships between just three dimensions of this data, namely: carat, cut, clarity, price.
In ggplot2 , the command to build this plot is qplot() , which stands for “ q uick plot”. We pass qplot() two dimensions of our data (carat and price), and it defaults to a scatter plot representation. Also worth noting is ggplot2’s other visual defaults are quite easy on the eyes – in contrast to most of R’s base graphics. We begin with a basic scatter plot of these 50,000 diamonds. This plot reveals that, not surprisingly, the price of diamonds increases as they get bigger (in terms of carats). Somewhat more interesting is how: we perceive that price seems to increase exponentially (and we test this hypothesis in the next slide).
Next, we log normalize the our data, and reveal that as we suspected, the relationship between a diamond’s price and its carat is exponential. It should be noted that we can achieve this transformation in two equivalent ways: (i) we can directly transform our data with the log function, or (ii) we can transform our coordinate scales on which our data is plotted. In ggplot2, this latter approach is achieved by passing the parameter ‘log=“xy”’ to qplot. Because both normalization approaches rely on different parts of graphical speech – data and scale – this nicely illustrates that, as in language, there is more than one way to express data visually using this grammar of graphics and ggplot2.
Another element of the graphical grammar is the aesthetic appearance of plotting points. Here, we pass a parameter, alpha , which controls the transparency of the points plotted. The parameter’s value, I(1/20) , indicates that each point should have 1/20 th of full intensity: thus 20 overplotted points are required at any given location to achieve full saturation (in this case, to black). (Note: the “I” function in R inhibits further interpretation of its arguments, so can be thought of simply the fraction 1/20) This method uncovers some interesting distributions in the data that were previously obscured by overplotting. For example, we can detect that points are highly concentrated around specific carat sizes. Contrast this method with our earlier approach to alpha blending with base graphics, which required manually specifying the RGB hex code.
Here we layer on yet another element of grammar, the color, to show how clearer stones are more expensive. ggplot2 automatically creates a legend for the mapping of color variables onto color. (Note, Wickham’s choice of a default color palette is not accidental – they of equal luminance, thus no one dominates over the other. For more than you ever want to know about color choice, see http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture13.pdf ).
Now we use another element of the grammar – what is termed ‘facets’ – to splinter our graphic into a number of subplots along a given dimension. Here we achieve the small multiples that we previously did using the par function and mfrow parameter. These sorts of sub-divided plots are what the Lattice system, excels at, which we’ll see later. What can say from this plot? Well, if anything, clear colored diamonds (“D”) seem to get more expensive more quickly (slightly steeper slope as a function of their size) versus yellower diamonds.
Let’s take another view of the data. Here we’re interested in seeing how color influences the per carat cost of a diamond. The boxplot on the left shows that nearly clear diamonds (color categories ‘D’ and ‘E’) have a greater number of high-priced outliers, but their median (the center line of each box) is nearly identical to the others. The so-called jitter plot on the right shows this same view of the data, but all of the points are shown – in this case, the points plotted into bins according a categorical variable, diamond color, and “jittered” within each bin to prevent overplotting, and allow a sense of the local density at difference values along the common y-dimension of price/carat.
A display of 50,000 data points. Why not? Our eyes can handle, and I submit, crave these kind of rich visualizations. This also allows us to detect features of the data (for example, several thin white bands across the bottom of the bars – perhaps preferred price/carat combinations?) that may be missing in from more simplified data views.
lattice is an alternative high-level graphics package for R. Like ggplot2 it is built on the grid graphics system.
lattice is named in honor of its predecessor, trellis , which was a visualization library developed for the S language by William Cleveland. trellis was so named because of how it visualizes higher dimensions of data: it splinters these dimensions across space, producing a grid of small multiples that resemble a trellis. In the next series of slides I show how we can use lattice to visualize up to six dimensions of data in a single plot.
To demonstrate lattice’s multivariate visualizing abilities, we’ll use a fascinating data set called MLB Gameday. Since 2007, Major League Baseball has tracked the path and velocity of > 1 million pitches thrown. Sample data is here: http://gd2.mlb.com/components/game/mlb/year_2008/month_03/day_30/gid_2008_03_30_atlmlb_wasmlb_1/pbp/pitchers/400010.xml
With just two dimensions of data to describe — the x and y location in the strike zone — we can use lattice’s xyplot function. Unlike ggplot2, the first that we pass to lattice’s plotting functions (of which xyplot is just one) are formulas that describe a relationship in the data to be plotted. In this case, “x ~ y” can be read as “x depends on y”. Note the visual defaults: not as easy on the eyes as ggplot2 (which has a lower contrast gray background), but an improvement on R’s base graphics plots.
In this plot, I’ve layered a third dimension, pitch type, into our plot by using lattice’s “groups” parameter, which uses a different plotting symbol for each type, and includes a legend across the top. Alas, this is not a particularly informative chart. The symbols are overplotted on top of each other: trends among the pitch types are hard to discern. With lattice, we can use yet another approach.
Now we’re doing what lattice does best – splintering a dimension, in this case pitch type, into space. We do this by using R’s “condition” operator in the formula we pass to lattice (the formula “x ~ y | type” can be read as “x depends on y conditioned on type”).
Now we include a fourth dimension in our plot – pitch speed – by using color. The speed to color mapping is relatively intuitive (seen in upper right), red is fast, blue is slow. How we achieve this is not particularly simple: we must use what lattice deems “panel functions”, which allow us to extend the default appearance of the chart.
Finally we add a fifth dimension, local density, to our plots using a two-dimensional color palette, where speed is related to chroma, and local density to luminance. This is an attempt to control for some overplotting that might otherwise occur when we shrink these pitch plots down in size.
Now we can compare two different pitchers – the sixth dimension – in a single graphic. The six dimensions of data we visualized with lattice are thus: 1. and 2. x and y location of the pitch 3. pitch type 4. pitch speed 5. pitch density (lots of pitches make darker luminosity with out changing hue) 6. pitcher (Cole or Hamels)
As mentioned, the lattice package provides several other graphics functions besides xyplot. Some are listed above here, and the densityplot() function is highlighted at the bottom. This is a particularly useful alternative to standard histograms, which can suffer from binning artifacts.
In this section I mention a couple of techniques for handling large data sets.
This is bad for two reasons: (1) overplotting obscures data, even when alpha blending is used. (2) it’s highly inefficient, both on screen – and especially if saved as vector graphic (huge PDFs). Two solutions: - resort to sampling map density of points onto some other attribute – such as color hexbinplot and geneplotter do just this.
hexbinplot() is a graphics function (in an self-named package) divides a scatter plot area into hexagons, counts occurrences within each these hexagonal areas, and maps these counts to a color scale. The result is a plot, as shown, where the graphics device need only draw as many points as there are hexagons. In the case of the diamond data, rather than 50,000 points being graphed, just ~ 2000 hexagons are. This also reveals some of the clumpiness in the data, though not as well as ggplot2’s alpha-blended scatterplots.
This is an Affymetrix gene chip, with 100,000 data points. On the right we have the output of a typical microarray assay: the colors correspond to RNA expression levels. With R, I can distill these 100,000 data points down to a simple model – and visualize it.
The data visualization on the right, called an M-A plot, is a variation of an XY scatter plot, where we are comparing the observed signals for particular microarray, to a composite background distribution – both are ordered by intensity of signal– deviations from the straight line show differences between our array and the background (in this case, our array tends to have higher signals across the board). Typically we generate an M-A plot for every array in our compendium to yield a big picture view of the consistency of our arrays across experiments – the flatter the red lines, the better (remember that in most models of cellular behavior we expect only a small fraction of genes to change in expression).
Ross Ihaka’s Colorspace package provides access to useful colorspaces beyond RGB, like LAB and HSV. These colorspaces are preferred by artists and designers for their more intuitive properties. This is the package I used to design the palettes in the pitching plots shown earlier. For my opinionated comments on using color in data visualizations, visit: http://dataspora.com/blog/how-to-color-multivariate-data/
Before we end, some thoughts on how R can be used a visualization engine on the web.
So I’ve pushed this pitch visualization application into a web app, using RApache. I can do this because R is open source – without licensing restrictions. Data and the processing can both live on the server – important when your data set is huge (this one is around 20 Gigabytes). And when the data changes, the dashboard updates. No local software installation needed, and updates are instantly available to all web users. It can be part of the open source web-analytics stack, with a catchy name – LAMR. If you can think of something less lame, let me know.
Why Embed R into a Web-based Architecture? Immediately access the many benefits of a web architecture that is: * Stateless/Scalable – URL requests can be distributed across one or many servers * Cacheable - common requests made to the R server can be cached by Apache * Secure - we can piggyback on existing HTTPS architecture for analysis of sensitive data
rapache: Embedding R within the Apache Server Our tool of choice is rapache, developed by Jeff Horner at Vanderbilt University. http://biostat.mc.vanderbilt.edu/rapache/
Naturally this is just scratching the surface of what rapache can do. An alternative approach to printing HTML directly, is to use a templating system, similar to PHP. This is available via the R package brew (also developed by Jeffrey Horner), downloadable on CRAN and at: http://www.rforge.net/brew/
The ggplot2 and lattice books are both published by Springer (ggplot2 as of July 2009), available via Amazon. example code and figures from ggplot2 book http://had.co.nz/ggplot2 example code and figures from lattice book http://lmdvr.r-forge.r-project.org/
Michael E. Driscoll is Principal and Founder of Dataspora LLC. He has a decade of experience developing large-scale databases and data mining algorithms within industry, government, and academic institutions. He founded and until 2008 served on the board of CustomInk.com, an Inc. 500 online retailer. Michael has a Ph.D. in Bioinformatics from Boston University and an A.B. from Harvard College.
A Survey of R Graphics June 18 2009 R Users Group of LA Michael E. Driscoll Principal, Dataspora [email_address] www.dataspora.com
“ The sexy job in the next ten years will be statisticians…” - Hal Varian