SlideShare a Scribd company logo
1      Example of self-documenting data journalism notes
This is an example of using Sweave to combine code and output from the R statistical programming
environment and the LaTeX document processing environment to generate a self-documenting
script in which the actual code used to do stats and generate statistical graphics is displayed along
the charts it directly produces.

1.1     Getting Started...
The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statistical
analysis: experience the thrill of touching real data 1 .

>   # The << echo = T >>= identifies an R code region;
>   # echo=T means run the code, and print what happens when it's run
>   # In the code area, lines beginning with a # are comment lines and are not executed
>
>   #First, we need to load in the XML library that contains the scraper function
>   library(XML)
>   #Now we scrape the table
>   srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis'
>   cancerdata=data.frame(
+     readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) )
>
>   #The @ symbol on its own at the start of a line marks the end of a code block

   The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to
extract the N’th table in the page.) The header part labels the columns (the data pulled in from
the HTML table itself contains all sorts of clutter).
   We can inspect the data we’ve imported as follows:

>   #Look at the whole table (the whole table is quite long,
>   # so donlt disply it/comment out the command for now instead.
>   #cancerdata
>   #If you are using RStudio, you can inspect the data using the command: View(cancerdata))
>   #Look at the column headers
>   names(cancerdata)

[1] "Area"            "Rate"          "Population" "Number"

> #Look at the first 10 rows
> head(cancerdata)

              Area Rate Population Number
1 Shetland Islands 19.15     31332      6
2         Limavady 21.49     32573      7
3       Ballymoney 17.05     35191      6
4   Orkney Islands 29.87     36826     11
5            Larne 27.54     39942     11
6      Magherafelt 15.26     45872      7

> #Look at the last 10 rows
> tail(cancerdata)
    1 http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis




                                                   1
Area      Rate Population Number
374 Wiltshire      18.69     727662    136
375 Sheffield       16.9     757396    128
376     Durham     17.29     786582    136
377      Leeds      17.3     959538    166
378   Cornwall     15.44    1062176    164
379 Birmingham     19.78    1268959    251

> #What sort of datatype is in the Number column?
> class(cancerdata$Number)

[1] "factor"

   The last line, class(cancerdata$Number), identifies the data as type factor. In order to
do stats and plot graphs, we need the Number, Rate and Population columns to contain actual
numbers. (Factors organise data according to categories; when the table is loaded in, the data is
loaded in as strings of characters; rather than seeing each number as a number, it’s identified as
a category.) The

>   #Convert the numerical columns to a numeric datatype
>   cancerdata$Rate =
+     as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)])
>   cancerdata$Population =
+     as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])
>   cancerdata$Number =
+     as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])
>                        a˘
    #Just check it worked^Ae
>   class(cancerdata$Number)

[1] "numeric"

> class(cancerdata$Rate)

[1] "numeric"

> class(cancerdata$Population)

[1] "numeric"

> head(cancerdata)

              Area Rate Population Number
1 Shetland Islands 19.15     31332      6
2         Limavady 21.49     32573      7
3       Ballymoney 17.05     35191      6
4   Orkney Islands 29.87     36826     11
5            Larne 27.54     39942     11
6      Magherafelt 15.26     45872      7

   We can now plot the data as a simple scatterplot using the plot command (figure 1) or we
can add a title to the graph and tweak the axis labels (figure 2).
   The plot command is great for generating quick charts. If we want a bit more control over
the charts we produce, the ggplot2 library is the way to go. (ggplot2 isn’t part of the standard R
bundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio,
find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its
dependencies...). You can see the sort of chart ggplot creates out of the box in figure 3.


                                                  2
> #Plot the Number of deaths by the Population
> plot(Number ~ Population, data=cancerdata)
                 250




                                                                                                   q




                                                                      q
                 200




                                                                                      q   q
                 150
        Number




                                                                          q       q
                                                                              q
                                                                  q
                                                             q
                                                             q
                                                           q q q qq
                 100




                                                          q q
                                                     q qq       q
                                                     q q q
                                                          qq
                                                q   q q q
                                                      q
                                                   q
                                                 qq q
                                            qqq q qq
                                               qq q
                                                q
                                              q qqq
                                                qq
                                           q q q
                                       q qq q q q q q
                                            q
                                                      q
                                                      q
                                       q q q qqq q
                 50




                                          q qqq q
                                                q
                                         q qq q
                                        qq
                                         qq qq
                                       qqq qqq q q
                                          q q
                                          q
                                      qqqqqq qq
                                      qqqqq
                                        qqqq
                                    q q qq qq
                                    qq q
                                   qqq qq qqq
                                    q q qq q
                                     qq
                                     qq q
                                     qq
                                  qqqq q q
                                     qq
                                 qqqqqq
                                 qqqqqq q
                                 qq qqq
                                  qqqq q
                                      q
                                  q qq q
                                 q qq q
                                  q qq q
                                      q
                                      q
                                qqqqqqqq
                                 q qq qq
                                qqqqqq
                                qqqqqqq
                                  qq
                                  qq
                              qqqq q
                                 qq
                              qqqq q
                               qqq
                                qq
                              q qqq
                                qq
                                 qq
                             qq q
                           q q
                            q
                                q q
                             q q qq
                              qq q
                               q q
                           qqq q
                           qqq
                           q q
                            q
                 0




                       0         200000 400000 600000 800000                                  1200000

                                                            Population



                                           Figure 1: Vanilla scatter plot




                                                               3
> #Plot the Number of deaths by the Population.
> #Add in a title (main) and tweak the y-axis label (ylab).
> plot(Number ~ Population, data=cancerdata,
+      main='Bowel Cancer Occurrence by Population', ylab='Number of deaths')



                                           Bowel Cancer Occurrence by Population
                           250




                                                                                                             q




                                                                                q
                           200
        Number of deaths




                                                                                                q   q
                           150




                                                                                    q       q
                                                                                        q
                                                                            q
                                                                       q
                                                                       q
                                                                     q q q qq
                           100




                                                                    q q
                                                               q qq       q
                                                               q q q
                                                                    qq
                                                          q   q q q
                                                                q
                                                             q
                                                           qq q
                                                      qqq q qq
                                                         qq q
                                                          q
                                                        q qqq
                                                          qq
                                                     q q q
                                                 q qq q q q q q
                                                      q
                                                                q
                                                                q
                                                 q q q qqq q
                           50




                                                    q qqq q
                                                          q
                                                   q qq q
                                                  qq
                                                   qq qq
                                                 qqq qqq q q
                                                    q q
                                                    q
                                                qqqqqq qq
                                                qqqqq
                                                  qqqq
                                              q q qq qq
                                              qq q
                                             qqq qq qqq
                                              q q qq q
                                               qq
                                               qq q
                                               qq
                                            qqqq q q
                                               qq
                                           qqqqqq
                                           qqqqqq q
                                           qq qqq
                                            qqqq q
                                                q
                                            q qq q
                                           q qq q
                                            q qq q
                                                q
                                                q
                                          qqqqqqqq
                                           q qq qq
                                          qqqqqq
                                          qqqqqqq
                                            qq
                                            qq
                                        qqqq q
                                           qq
                                        qqqq q
                                         qqq
                                          qq
                                        q qqq
                                          qq
                                           qq
                                       qq q
                                     q q
                                      q
                                          q q
                                       q q qq
                                        qq q
                                         q q
                                     qqq q
                                     qqq
                                     q q
                                      q
                           0




                                 0         200000 400000 600000 800000                                  1200000

                                                                      Population



                                                     Figure 2: Vanilla scatter plot




                                                                         4
>   require(ggplot2)
>   #Plot the Number of deaths by the Population
>   p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number))
>   print(p)


                                                                                                                       q
                    250




                                                                                      q

                    200



                                                                                                        q    q


                    150
           Number




                                                                                          q       q
                                                                                              q

                                                                              q
                                                                      q
                                                                      q
                                                                   q      q   q
                    100                                           q qq            q
                                                           q              q
                                                              qq
                                                               qq
                                                           q
                                                                 qq
                                                      q    qq q
                                                            q
                                                        qqq
                                                          q q
                                                  q q q qq q
                                                     qq
                                                    q qq q q
                                                       q
                                                      qq
                                                 q q     q q qq
                                            q qqq      q      q
                                                   qqq q
                                                       q
                                                 qq q q q
                    50                     q qq q q q q
                                                  q qq
                                           qqqqq qq q q
                                               q
                                            qqq q q q q
                                             qq q
                                               q
                                               q
                                          q q qq qq q
                                          q q qqq
                                          q qq q
                                             qqq q
                                      q qq qq qq
                                       q q qq q
                                     qqqq qqq qq
                                                qq
                                      qqq qqq q q
                                       q qq q
                                       q q q
                                       qq  q
                                                  q
                                   qq qqq qq
                                      qqq qq
                                      qqqqq
                                       qqq q
                                         q
                                    qqqqq qq
                                        qq q
                                         q
                                         qq q q q
                                   qqq qq q
                                   qq q
                                   qqqqq qq
                                   qqqqqq q
                                   q qqq q
                                    q
                                   q qq q q
                                    q qq
                                    q q
                                qq q q
                                  qqq
                                 qqqq
                                   q
                                  qqq
                                    q
                              qqqqqqq q
                              qqqqqq
                                 qqq q q
                                 qqq q
                              qqq qqq
                                 qq qq
                             qqqqqq
                                qq q
                               qq q   q
                          qq qq q qq
                           qqqqq q
                            qq q q
                                q     q
                          qq q
                          qq
                          q   q




                                    200000           400000           600000                  800000   1000000   1200000
                                                                  Population



                                                Figure 3: A rather prettier plot




                                                                              5
1.2    Generating the Funnel Plot
Doing a bit of searching for the “funnel plot” chart type used to display the data in Goldacre’s
article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated
to statistics related Q&A: How to draw funnel plot using ggplot2 in R? 2
    The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing
the code, with confidence limits set at the 95% and 99.9% levels. Note that I needed to do a couple
of things:

  1. work out what values to use where! I did this by looking at the ggplot code to see what
     was plotted. p was on the y-axis and should be used to present the death rate. The data
     provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the
     range 0..1. The x-axis is the population.
  2. change the range and width of samples used to create the curves
  3. change the y-axis range.

   You can see the result in figure 3.




   2 http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#

5210


                                                 6
>   #TH: funnel plot code from:
>   #stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210
>   #TH: Use our cancerdata
>   number=cancerdata$Population
>   #TH: The rate is given as a 'per 100,000' value, so normalise it
>   p=cancerdata$Rate/100000
>   p.se <- sqrt((p*(1-p)) / (number))
>   df <- data.frame(p, number, p.se, Area=cancerdata$Area)
>   ## common effect (fixed effect model)
>   p.fem <- weighted.mean(p, 1/p.se^2)
>   ## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
>   #TH: I'm going to alter the spacing of the samples used to generate the curves
>   number.seq <- seq(1000, max(number), 1000)
>   number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)
>   ## draw plot
>   #TH: note that we need to tweak the limits of the y-axis
>   fp <- ggplot(aes(x = number, y = p), data = df) +
+   geom_point(shape = 1) +
+   geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +
+   geom_hline(aes(yintercept = p.fem), data = dfCI) +
+   xlab("Population") + ylab("Bowel cancer death rate") + theme_bw()
>   #Automatically set the maximum y-axis value to be just a bit larger than the max data value
>   fp=fp+scale_y_continuous(limits = c(0,1.1*max(p)))
>   #Label the outlier point
>   fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003))
>   print(fp)




                                                                                           Glasgow City
                                                                                                q

                                     0.00030   q


                                               q            q
                                                qq
                                                      q
                                                q
                                                   qq          q
                                     0.00025           q qq          q q
                                                        qq
                                                       qq         qq
                                                                   q            q
                                                   qq q q
                                                       q
                                                       q qq
                                                 q q q qq              q
           Bowel cancer death rate




                                                                        q q      q
                                                        q q         q q
                                               q q q qq q q                   q
                                                 q     q qq q q q q
                                                        q        q
                                                        q q
                                                        q q              q q              q
                                                        q q q
                                                  q q q q qq                 q       q qq q
                                     0.00020          qq q q
                                                     q qqqq q
                                                    qq q                    q
                                                    qq q q qq qq         q                                                    q
                                               q      qq qqq q q q q
                                                               q
                                                               q
                                                               q
                                                              q q
                                                                          q
                                                                                  q q
                                                                                      q
                                                                                      q
                                                                                              q
                                                             q    q
                                                  q q q qq q q q q q              q       qq        q
                                                         q q q qqq
                                                           q      q q      qq q           q
                                                   q q qq qqq q q
                                                         q
                                                        q qqq
                                                                   q
                                                    qqq
                                                     qq q q qq q q     q         q q q        q             q    q
                                               q        q qqq qq q q
                                                             q q
                                                             q
                                                             q q                                        q
                                                    q qq q q q q q q q
                                                          q qq
                                                      q qq qq q
                                                             qqq
                                                    q q q qq qqqqq q q qq
                                                       q qq q qq
                                                              q          q              q   q
                                                                                                q
                                                          q q q
                                                      q q qqq q          q                                           q
                                                q q q            q q
                                     0.00015        q
                                                            qq q q qq q
                                                            qqq q qqq
                                                             qq      q q        q
                                                            qq               q
                                                     qq q q qq qqq q               q
                                                                                   q
                                                       qqqq q
                                                         qq           q      q
                                                      q qq q q
                                                         qq           q q
                                                                           q       q
                                                          q qqqq
                                                               q            qq
                                                          q      qq
                                                      q q q             q
                                                 q               q
                                                                 q     q
                                                   q
                                                                      q
                                     0.00010             q
                                                        qq
                                                          q
                                                          q
                                                   q     q




                                     0.00005
                                                                                       7

                                     0.00000


                                                          200000        400000          600000      800000      1000000 1200000
                                                                               Population

More Related Content

Similar to Example sweavefunnelplot

Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & ApplicationsNavigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & ApplicationsRajarshi Guha
 
Stat7840 hao wu
Stat7840 hao wuStat7840 hao wu
Stat7840 hao wu
Hao Wu
 
Time series compare
Time series compareTime series compare
Time series comparelrhutyra
 
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...Rajarshi Guha
 
Portuguese Market and On-board Sampling Effort Review
Portuguese Market and On-board Sampling Effort ReviewPortuguese Market and On-board Sampling Effort Review
Portuguese Market and On-board Sampling Effort Review
Ernesto Jardim
 
F1 2011 Korea Race Report
F1 2011 Korea Race ReportF1 2011 Korea Race Report
F1 2011 Korea Race ReportTony Hirst
 
Manual de Aplicação - TCC
Manual de Aplicação - TCCManual de Aplicação - TCC
Manual de Aplicação - TCC
Marco Menezes
 
Parallel Combinatorial Computing and Sparse Matrices
Parallel Combinatorial Computing and Sparse Matrices Parallel Combinatorial Computing and Sparse Matrices
Parallel Combinatorial Computing and Sparse Matrices
Jason Riedy
 

Similar to Example sweavefunnelplot (13)

Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & ApplicationsNavigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
 
Stat7840 hao wu
Stat7840 hao wuStat7840 hao wu
Stat7840 hao wu
 
Clustering Plot
Clustering PlotClustering Plot
Clustering Plot
 
Time series compare
Time series compareTime series compare
Time series compare
 
Slides lyon-2011
Slides lyon-2011Slides lyon-2011
Slides lyon-2011
 
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
 
Slides mcneil
Slides mcneilSlides mcneil
Slides mcneil
 
Portuguese Market and On-board Sampling Effort Review
Portuguese Market and On-board Sampling Effort ReviewPortuguese Market and On-board Sampling Effort Review
Portuguese Market and On-board Sampling Effort Review
 
Slides geotop
Slides geotopSlides geotop
Slides geotop
 
F1 2011 Korea Race Report
F1 2011 Korea Race ReportF1 2011 Korea Race Report
F1 2011 Korea Race Report
 
Manual de Aplicação - TCC
Manual de Aplicação - TCCManual de Aplicação - TCC
Manual de Aplicação - TCC
 
Parallel Combinatorial Computing and Sparse Matrices
Parallel Combinatorial Computing and Sparse Matrices Parallel Combinatorial Computing and Sparse Matrices
Parallel Combinatorial Computing and Sparse Matrices
 
Slides GEOTOP
Slides GEOTOPSlides GEOTOP
Slides GEOTOP
 

More from Tony Hirst

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiesta
Tony Hirst
 
Dev8d jupyter
Dev8d jupyterDev8d jupyter
Dev8d jupyter
Tony Hirst
 
Ili 16 robot
Ili 16 robotIli 16 robot
Ili 16 robot
Tony Hirst
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptx
Tony Hirst
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptx
Tony Hirst
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacks
Tony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
Tony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
Tony Hirst
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyter
Tony Hirst
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2
Tony Hirst
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 Workshop
Tony Hirst
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wire
Tony Hirst
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interest
Tony Hirst
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKX
Tony Hirst
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
Tony Hirst
 
Conversations with data
Conversations with dataConversations with data
Conversations with data
Tony Hirst
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingo
Tony Hirst
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Tony Hirst
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalismTony Hirst
 

More from Tony Hirst (20)

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiesta
 
Dev8d jupyter
Dev8d jupyterDev8d jupyter
Dev8d jupyter
 
Ili 16 robot
Ili 16 robotIli 16 robot
Ili 16 robot
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptx
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptx
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacks
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyter
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 Workshop
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wire
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interest
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKX
 
Week4
Week4Week4
Week4
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
 
Conversations with data
Conversations with dataConversations with data
Conversations with data
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingo
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
 

Recently uploaded

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 

Recently uploaded (20)

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 

Example sweavefunnelplot

  • 1. 1 Example of self-documenting data journalism notes This is an example of using Sweave to combine code and output from the R statistical programming environment and the LaTeX document processing environment to generate a self-documenting script in which the actual code used to do stats and generate statistical graphics is displayed along the charts it directly produces. 1.1 Getting Started... The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statistical analysis: experience the thrill of touching real data 1 . > # The << echo = T >>= identifies an R code region; > # echo=T means run the code, and print what happens when it's run > # In the code area, lines beginning with a # are comment lines and are not executed > > #First, we need to load in the XML library that contains the scraper function > library(XML) > #Now we scrape the table > srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis' > cancerdata=data.frame( + readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) ) > > #The @ symbol on its own at the start of a line marks the end of a code block The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to extract the N’th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter). We can inspect the data we’ve imported as follows: > #Look at the whole table (the whole table is quite long, > # so donlt disply it/comment out the command for now instead. > #cancerdata > #If you are using RStudio, you can inspect the data using the command: View(cancerdata)) > #Look at the column headers > names(cancerdata) [1] "Area" "Rate" "Population" "Number" > #Look at the first 10 rows > head(cancerdata) Area Rate Population Number 1 Shetland Islands 19.15 31332 6 2 Limavady 21.49 32573 7 3 Ballymoney 17.05 35191 6 4 Orkney Islands 29.87 36826 11 5 Larne 27.54 39942 11 6 Magherafelt 15.26 45872 7 > #Look at the last 10 rows > tail(cancerdata) 1 http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis 1
  • 2. Area Rate Population Number 374 Wiltshire 18.69 727662 136 375 Sheffield 16.9 757396 128 376 Durham 17.29 786582 136 377 Leeds 17.3 959538 166 378 Cornwall 15.44 1062176 164 379 Birmingham 19.78 1268959 251 > #What sort of datatype is in the Number column? > class(cancerdata$Number) [1] "factor" The last line, class(cancerdata$Number), identifies the data as type factor. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers. (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it’s identified as a category.) The > #Convert the numerical columns to a numeric datatype > cancerdata$Rate = + as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)]) > cancerdata$Population = + as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)]) > cancerdata$Number = + as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)]) > a˘ #Just check it worked^Ae > class(cancerdata$Number) [1] "numeric" > class(cancerdata$Rate) [1] "numeric" > class(cancerdata$Population) [1] "numeric" > head(cancerdata) Area Rate Population Number 1 Shetland Islands 19.15 31332 6 2 Limavady 21.49 32573 7 3 Ballymoney 17.05 35191 6 4 Orkney Islands 29.87 36826 11 5 Larne 27.54 39942 11 6 Magherafelt 15.26 45872 7 We can now plot the data as a simple scatterplot using the plot command (figure 1) or we can add a title to the graph and tweak the axis labels (figure 2). The plot command is great for generating quick charts. If we want a bit more control over the charts we produce, the ggplot2 library is the way to go. (ggplot2 isn’t part of the standard R bundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio, find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its dependencies...). You can see the sort of chart ggplot creates out of the box in figure 3. 2
  • 3. > #Plot the Number of deaths by the Population > plot(Number ~ Population, data=cancerdata) 250 q q 200 q q 150 Number q q q q q q q q q qq 100 q q q qq q q q q qq q q q q q q qq q qqq q qq qq q q q qqq qq q q q q qq q q q q q q q q q q q qqq q 50 q qqq q q q qq q qq qq qq qqq qqq q q q q q qqqqqq qq qqqqq qqqq q q qq qq qq q qqq qq qqq q q qq q qq qq q qq qqqq q q qq qqqqqq qqqqqq q qq qqq qqqq q q q qq q q qq q q qq q q q qqqqqqqq q qq qq qqqqqq qqqqqqq qq qq qqqq q qq qqqq q qqq qq q qqq qq qq qq q q q q q q q q qq qq q q q qqq q qqq q q q 0 0 200000 400000 600000 800000 1200000 Population Figure 1: Vanilla scatter plot 3
  • 4. > #Plot the Number of deaths by the Population. > #Add in a title (main) and tweak the y-axis label (ylab). > plot(Number ~ Population, data=cancerdata, + main='Bowel Cancer Occurrence by Population', ylab='Number of deaths') Bowel Cancer Occurrence by Population 250 q q 200 Number of deaths q q 150 q q q q q q q q q qq 100 q q q qq q q q q qq q q q q q q qq q qqq q qq qq q q q qqq qq q q q q qq q q q q q q q q q q q qqq q 50 q qqq q q q qq q qq qq qq qqq qqq q q q q q qqqqqq qq qqqqq qqqq q q qq qq qq q qqq qq qqq q q qq q qq qq q qq qqqq q q qq qqqqqq qqqqqq q qq qqq qqqq q q q qq q q qq q q qq q q q qqqqqqqq q qq qq qqqqqq qqqqqqq qq qq qqqq q qq qqqq q qqq qq q qqq qq qq qq q q q q q q q q qq qq q q q qqq q qqq q q q 0 0 200000 400000 600000 800000 1200000 Population Figure 2: Vanilla scatter plot 4
  • 5. > require(ggplot2) > #Plot the Number of deaths by the Population > p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number)) > print(p) q 250 q 200 q q 150 Number q q q q q q q q q 100 q qq q q q qq qq q qq q qq q q qqq q q q q q qq q qq q qq q q q qq q q q q qq q qqq q q qqq q q qq q q q 50 q qq q q q q q qq qqqqq qq q q q qqq q q q q qq q q q q q qq qq q q q qqq q qq q qqq q q qq qq qq q q qq q qqqq qqq qq qq qqq qqq q q q qq q q q q qq q q qq qqq qq qqq qq qqqqq qqq q q qqqqq qq qq q q qq q q q qqq qq q qq q qqqqq qq qqqqqq q q qqq q q q qq q q q qq q q qq q q qqq qqqq q qqq q qqqqqqq q qqqqqq qqq q q qqq q qqq qqq qq qq qqqqqq qq q qq q q qq qq q qq qqqqq q qq q q q q qq q qq q q 200000 400000 600000 800000 1000000 1200000 Population Figure 3: A rather prettier plot 5
  • 6. 1.2 Generating the Funnel Plot Doing a bit of searching for the “funnel plot” chart type used to display the data in Goldacre’s article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated to statistics related Q&A: How to draw funnel plot using ggplot2 in R? 2 The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code, with confidence limits set at the 95% and 99.9% levels. Note that I needed to do a couple of things: 1. work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population. 2. change the range and width of samples used to create the curves 3. change the y-axis range. You can see the result in figure 3. 2 http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210# 5210 6
  • 7. > #TH: funnel plot code from: > #stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210 > #TH: Use our cancerdata > number=cancerdata$Population > #TH: The rate is given as a 'per 100,000' value, so normalise it > p=cancerdata$Rate/100000 > p.se <- sqrt((p*(1-p)) / (number)) > df <- data.frame(p, number, p.se, Area=cancerdata$Area) > ## common effect (fixed effect model) > p.fem <- weighted.mean(p, 1/p.se^2) > ## lower and upper limits for 95% and 99.9% CI, based on FEM estimator > #TH: I'm going to alter the spacing of the samples used to generate the curves > number.seq <- seq(1000, max(number), 1000) > number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem) > ## draw plot > #TH: note that we need to tweak the limits of the y-axis > fp <- ggplot(aes(x = number, y = p), data = df) + + geom_point(shape = 1) + + geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) + + geom_hline(aes(yintercept = p.fem), data = dfCI) + + xlab("Population") + ylab("Bowel cancer death rate") + theme_bw() > #Automatically set the maximum y-axis value to be just a bit larger than the max data value > fp=fp+scale_y_continuous(limits = c(0,1.1*max(p))) > #Label the outlier point > fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003)) > print(fp) Glasgow City q 0.00030 q q q qq q q qq q 0.00025 q qq q q qq qq qq q q qq q q q q qq q q q qq q Bowel cancer death rate q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q qq q 0.00020 qq q q q qqqq q qq q q qq q q qq qq q q q qq qqq q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qqq q q q qq q q q q qq qqq q q q q qqq q qqq qq q q qq q q q q q q q q q q q qqq qq q q q q q q q q q qq q q q q q q q q qq q qq qq q qqq q q q qq qqqqq q q qq q qq q qq q q q q q q q q q q qqq q q q q q q q q 0.00015 q qq q q qq q qqq q qqq qq q q q qq q qq q q qq qqq q q q qqqq q qq q q q qq q q qq q q q q q qqqq q qq q qq q q q q q q q q q q 0.00010 q qq q q q q 0.00005 7 0.00000 200000 400000 600000 800000 1000000 1200000 Population