Example sweavefunnelplot

3,632 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,632
On SlideShare
0
From Embeds
0
Number of Embeds
2,185
Actions
Shares
0
Downloads
12
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Example sweavefunnelplot

  1. 1. 1 Example of self-documenting data journalism notesThis is an example of using Sweave to combine code and output from the R statistical programmingenvironment and the LaTeX document processing environment to generate a self-documentingscript in which the actual code used to do stats and generate statistical graphics is displayed alongthe charts it directly produces.1.1 Getting Started...The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statisticalanalysis: experience the thrill of touching real data 1 .> # The << echo = T >>= identifies an R code region;> # echo=T means run the code, and print what happens when its run> # In the code area, lines beginning with a # are comment lines and are not executed>> #First, we need to load in the XML library that contains the scraper function> library(XML)> #Now we scrape the table> srcURL=http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis> cancerdata=data.frame(+ readHTMLTable( srcURL, which=1, header=c(Area,Rate,Population,Number) ) )>> #The @ symbol on its own at the start of a line marks the end of a code block The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used toextract the N’th table in the page.) The header part labels the columns (the data pulled in fromthe HTML table itself contains all sorts of clutter). We can inspect the data we’ve imported as follows:> #Look at the whole table (the whole table is quite long,> # so donlt disply it/comment out the command for now instead.> #cancerdata> #If you are using RStudio, you can inspect the data using the command: View(cancerdata))> #Look at the column headers> names(cancerdata)[1] "Area" "Rate" "Population" "Number"> #Look at the first 10 rows> head(cancerdata) Area Rate Population Number1 Shetland Islands 19.15 31332 62 Limavady 21.49 32573 73 Ballymoney 17.05 35191 64 Orkney Islands 29.87 36826 115 Larne 27.54 39942 116 Magherafelt 15.26 45872 7> #Look at the last 10 rows> tail(cancerdata) 1 http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis 1
  2. 2. Area Rate Population Number374 Wiltshire 18.69 727662 136375 Sheffield 16.9 757396 128376 Durham 17.29 786582 136377 Leeds 17.3 959538 166378 Cornwall 15.44 1062176 164379 Birmingham 19.78 1268959 251> #What sort of datatype is in the Number column?> class(cancerdata$Number)[1] "factor" The last line, class(cancerdata$Number), identifies the data as type factor. In order todo stats and plot graphs, we need the Number, Rate and Population columns to contain actualnumbers. (Factors organise data according to categories; when the table is loaded in, the data isloaded in as strings of characters; rather than seeing each number as a number, it’s identified asa category.) The> #Convert the numerical columns to a numeric datatype> cancerdata$Rate =+ as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)])> cancerdata$Population =+ as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])> cancerdata$Number =+ as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])> a˘ #Just check it worked^Ae> class(cancerdata$Number)[1] "numeric"> class(cancerdata$Rate)[1] "numeric"> class(cancerdata$Population)[1] "numeric"> head(cancerdata) Area Rate Population Number1 Shetland Islands 19.15 31332 62 Limavady 21.49 32573 73 Ballymoney 17.05 35191 64 Orkney Islands 29.87 36826 115 Larne 27.54 39942 116 Magherafelt 15.26 45872 7 We can now plot the data as a simple scatterplot using the plot command (figure 1) or wecan add a title to the graph and tweak the axis labels (figure 2). The plot command is great for generating quick charts. If we want a bit more control overthe charts we produce, the ggplot2 library is the way to go. (ggplot2 isn’t part of the standard Rbundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio,find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with itsdependencies...). You can see the sort of chart ggplot creates out of the box in figure 3. 2
  3. 3. > #Plot the Number of deaths by the Population> plot(Number ~ Population, data=cancerdata) 250 q q 200 q q 150 Number q q q q q q q q q qq 100 q q q qq q q q q qq q q q q q q qq q qqq q qq qq q q q qqq qq q q q q qq q q q q q q q q q q q qqq q 50 q qqq q q q qq q qq qq qq qqq qqq q q q q q qqqqqq qq qqqqq qqqq q q qq qq qq q qqq qq qqq q q qq q qq qq q qq qqqq q q qq qqqqqq qqqqqq q qq qqq qqqq q q q qq q q qq q q qq q q q qqqqqqqq q qq qq qqqqqq qqqqqqq qq qq qqqq q qq qqqq q qqq qq q qqq qq qq qq q q q q q q q q qq qq q q q qqq q qqq q q q 0 0 200000 400000 600000 800000 1200000 Population Figure 1: Vanilla scatter plot 3
  4. 4. > #Plot the Number of deaths by the Population.> #Add in a title (main) and tweak the y-axis label (ylab).> plot(Number ~ Population, data=cancerdata,+ main=Bowel Cancer Occurrence by Population, ylab=Number of deaths) Bowel Cancer Occurrence by Population 250 q q 200 Number of deaths q q 150 q q q q q q q q q qq 100 q q q qq q q q q qq q q q q q q qq q qqq q qq qq q q q qqq qq q q q q qq q q q q q q q q q q q qqq q 50 q qqq q q q qq q qq qq qq qqq qqq q q q q q qqqqqq qq qqqqq qqqq q q qq qq qq q qqq qq qqq q q qq q qq qq q qq qqqq q q qq qqqqqq qqqqqq q qq qqq qqqq q q q qq q q qq q q qq q q q qqqqqqqq q qq qq qqqqqq qqqqqqq qq qq qqqq q qq qqqq q qqq qq q qqq qq qq qq q q q q q q q q qq qq q q q qqq q qqq q q q 0 0 200000 400000 600000 800000 1200000 Population Figure 2: Vanilla scatter plot 4
  5. 5. > require(ggplot2)> #Plot the Number of deaths by the Population> p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number))> print(p) q 250 q 200 q q 150 Number q q q q q q q q q 100 q qq q q q qq qq q qq q qq q q qqq q q q q q qq q qq q qq q q q qq q q q q qq q qqq q q qqq q q qq q q q 50 q qq q q q q q qq qqqqq qq q q q qqq q q q q qq q q q q q qq qq q q q qqq q qq q qqq q q qq qq qq q q qq q qqqq qqq qq qq qqq qqq q q q qq q q q q qq q q qq qqq qq qqq qq qqqqq qqq q q qqqqq qq qq q q qq q q q qqq qq q qq q qqqqq qq qqqqqq q q qqq q q q qq q q q qq q q qq q q qqq qqqq q qqq q qqqqqqq q qqqqqq qqq q q qqq q qqq qqq qq qq qqqqqq qq q qq q q qq qq q qq qqqqq q qq q q q q qq q qq q q 200000 400000 600000 800000 1000000 1200000 Population Figure 3: A rather prettier plot 5
  6. 6. 1.2 Generating the Funnel PlotDoing a bit of searching for the “funnel plot” chart type used to display the data in Goldacre’sarticle, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicatedto statistics related Q&A: How to draw funnel plot using ggplot2 in R? 2 The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbingthe code, with confidence limits set at the 95% and 99.9% levels. Note that I needed to do a coupleof things: 1. work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population. 2. change the range and width of samples used to create the curves 3. change the y-axis range. You can see the result in figure 3. 2 http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210 6
  7. 7. > #TH: funnel plot code from:> #stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210> #TH: Use our cancerdata> number=cancerdata$Population> #TH: The rate is given as a per 100,000 value, so normalise it> p=cancerdata$Rate/100000> p.se <- sqrt((p*(1-p)) / (number))> df <- data.frame(p, number, p.se, Area=cancerdata$Area)> ## common effect (fixed effect model)> p.fem <- weighted.mean(p, 1/p.se^2)> ## lower and upper limits for 95% and 99.9% CI, based on FEM estimator> #TH: Im going to alter the spacing of the samples used to generate the curves> number.seq <- seq(1000, max(number), 1000)> number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))> number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))> number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))> number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))> dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)> ## draw plot> #TH: note that we need to tweak the limits of the y-axis> fp <- ggplot(aes(x = number, y = p), data = df) ++ geom_point(shape = 1) ++ geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) ++ geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) ++ geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) ++ geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) ++ geom_hline(aes(yintercept = p.fem), data = dfCI) ++ xlab("Population") + ylab("Bowel cancer death rate") + theme_bw()> #Automatically set the maximum y-axis value to be just a bit larger than the max data value> fp=fp+scale_y_continuous(limits = c(0,1.1*max(p)))> #Label the outlier point> fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003))> print(fp) Glasgow City q 0.00030 q q q qq q q qq q 0.00025 q qq q q qq qq qq q q qq q q q q qq q q q qq q Bowel cancer death rate q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q qq q 0.00020 qq q q q qqqq q qq q q qq q q qq qq q q q qq qqq q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qqq q q q qq q q q q qq qqq q q q q qqq q qqq qq q q qq q q q q q q q q q q q qqq qq q q q q q q q q q qq q q q q q q q q qq q qq qq q qqq q q q qq qqqqq q q qq q qq q qq q q q q q q q q q q qqq q q q q q q q q 0.00015 q qq q q qq q qqq q qqq qq q q q qq q qq q q qq qqq q q q qqqq q qq q q q qq q q qq q q q q q qqqq q qq q qq q q q q q q q q q q 0.00010 q qq q q q q 0.00005 7 0.00000 200000 400000 600000 800000 1000000 1200000 Population

×