Exploratory Data Analysis 
Wesley GOI
In today’s session 
• Principles behind exploratory analyses 
• Plotting data out on to popular exploratory graphs 
• Plotting Systems in R 
• Base (Week1) 
• Lattice (Week2) 
• GGPLOT2 (Week2) 
• Choosing and using Graphic Devices aka the output formats 
Scripts can be downloaded at: 
https://www.dropbox.com/s/ii1yj8f650d4l1q/lesson1.r?dl=0 
https://www.dropbox.com/s/eme44h6lrhn775l/final.r?dl=0
Principles behind exploratory analyses 
• Show comparisons 
• Show causality, mechanism, explanation 
• Show multivariate data 
• Integrate multiple modes of evidence 
• Describe and document the evidence 
• Content is king 
• SPEED
Dimensionality 
• Five-number summary 
• Boxplots 
• Histograms 
• Density plot 
• Barplot 
Multiple-overlayed 1D plots 
Scatter plots
Downloading our dataset 
R code 
dir.create("exploring_data") 
setwd(“exploring_data”) 
download.file(“http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/therbook.zip",dest="data.zip") 
unzip(“data.zip”)
R code 
Boxplots 
weather = read.table("SilwoodWeather.txt",h=T) 
onemonth = subset(weather, 
month==1 & yr == 2004) 
boxplot(onemonth$rain) 
Header = T
Histograms 
R code 
hist(weather$upper) 
rug(weather$upper) ticks for each value
Barplot 
R code 
Barplot( 
table(weather$month), 
col = "wheat", 
main = "Number of Observations in 
Months”)
Raster Vector 
PNG PDF SVG 
grDevices 
Filesize small medium medium 
Scalable No Yes Yes 
Web friendly Yes No Yes
Plotting Systems 
Plotting Systems 
Base Lattice Grid 
Libraries lattice grid, gridExtras 
ggplot2 
Example 
functions 
hist✔ 
barplot✔ 
boxplot✔ 
Plot 
xyplot (scatterplots) 
bwplot (boxplots) 
levelplot 
qplot 
ggplot 
geom 
Facetted plots Yes Yes Yes 
Grammar of 
NO No Yes 
graphics 
Interface with 
statistical 
functions 
Yes Partial Partial + 
Workarounds 
Cannot 
be mixed
Base plots: Scatterplot 
R code 
data1 = read.table("scatter1.txt", h=T) 
data2 = read.table("scatter2.txt", h=T)
Base plots: Scatterplot 
R code 
data1 = read.table("scatter1.txt", h=T) 
data2 = read.table("scatter2.txt", h=T) 
#Color 
with(data1, plot(xv, ys, col="red")) 
#Regression Line 
with(data1, abline(lm(ys~xv))) 
Color
Base plots: Scatterplot 
Set symbol to represent data point
Base plots: Scatterplot 
R code 
data1 = read.table("scatter1.txt", h=T) 
data2 = read.table("scatter2.txt", h=T) 
#Color 
with(data1, plot(xv, ys, col="red")) 
with(data1, abline(lm(ys~xv))) 
#shape 
with(data2, 
points(xv2, ys2, col="blue", 
pch =11)) 
Symbol shape
Base plots: Scatterplot 
R code 
data1 = read.table("scatter1.txt", h=T) 
data2 = read.table("scatter2.txt", h=T) 
#Color 
with(data1, plot(xv, ys, col="red")) 
with(data1, abline(lm(ys~xv))) 
#shape 
with(data2, 
points(xv2, ys2, col="blue", 
pch =11)) 
Symbol shape
Base plots: Using par for multiple plots 
R code 
par(mfrow=c(1,2)) 
with(data1, plot(xv, ys, col="red")) 
with(data1, abline(lm(ys~xv))) 
#Plot2 
with(data2, 
plot(xv2, ys2, col="blue", 
pch =11)) 
title(“My Title", outer=TRUE)
Par: To set global settings 
R code 
mfrow( 
mar=c(5.1,4.1,4.1,2.1), 
oma=c(2,2,2,2) 
)
Lattice 
productivity = read.table("productivity.txt",h=T) 
# of species in forest against differing productivity 
library(lattice) 
#plotting 
xyplot( x~y, productivity, 
xlab=list(label="Productivity"), 
ylab=list(label="Mammal Species")) 
R code 
Formular 
Data frame
Lattice 
productivity = read.table("productivity.txt",h=T) 
# of species in forest against differing productivity 
library(lattice) 
#plotting 
xyplot( x~y, productivity, 
xlab=list(label="Productivity"), 
ylab=list(label="Mammal Species")) 
xyplot( x~y | f, productivity, 
xlab=list(label="Productivity"), 
ylab=list(label="Mammal Species")) 
R code 
Formular 
Data frame 
given
ggplot2 
• Grammar of graphics (gg) 
• Based on GRID plotting system, cannot be 
mixed with base 
ggplot2.org
ggplot 
Components 
• Data & relationship 
• GEOMetric Object 
• Statistical transformation 
• Scales 
• Coordinate system 
• Facetting
ggplot 
Data
ggplot 
Mapping
ggplot 
Geometric objects 
aka 
Geoms 
Coordinate system 
wrt 
scales 
Log scale / sqrt / log ratio 
Title 
Plot 
Theme 
etc
ggplot 
Geometric objects 
aka 
Geoms
ggplot 
Components 
• Data & relationship ✔ 
• GEOMetric Object 
• Statistical transformation 
• Scales 
• Coordinate system 
• Facetting 
R code 
Rmbr to change 
month into a 
factor 
data.frame 
Aesthetics function which maps the relationships 
ggplot(weather, aes(x=month, y=upper))+ 
geom_boxplot()
ggplot 
Components 
• Data & relationship ✔ 
• GEOMetric Object ✔ 
• Statistical transformation✔ 
• Scales 
• Coordinate system 
• Facetting 
R code 
weather2 = weather %>% 
group_by(month) %>% 
summarise(average.upper = mean(upper)) 
ggplot(weather2, aes(month, average.upper))+ 
geom_bar(stat="identity")
ggplot 
Components 
• Data & relationship ✔ 
• GEOMetric Object ✔ 
• Statistical transformation✔ 
• Scales 
• Coordinate system 
• Facetting 
R code 
weather2 = weather %>% 
group_by(month) %>% 
summarise(average.upper = mean(upper)) 
ggplot(weather2, aes(month, average.upper))+ 
geom_bar(stat="identity")
ggplot 
Components 
• Data & relationship ✔ 
• GEOMetric Object ✔ 
• Statistical transformation✔ 
• Scales✔ 
• Coordinate system 
• Facetting 
R code 
plot2 = ggplot(weather2, 
aes(month, average.upper))+ 
geom_bar(aes(fill=month),stat="identity")+ 
scale_fill_brewer(palette="Set3")+ 
xlab("Months")+ 
ylab("Upper Quantile")+theme_bw()
ggplot 
Components 
• Data & relationship ✔ 
• GEOMetric Object ✔ 
• Statistical transformation✔ 
• Scales✔ 
• Coordinate system 
• Facetting 
R code 
plot2 = ggplot(weather2, 
aes(month, average.upper))+ 
geom_bar(aes(fill=month),stat="identity")+ 
scale_fill_brewer(palette="Set3")+ 
xlab("Months")+ 
ylab("Upper Quantile")+theme_bw()
ggplot
qplot 
A separate function which wraps ggplot, for simpler syntax 
R code 
qplot(month, upper, fill=month, data=weather, facets = ~yr, geom="bar", 
stat="identity")
Ethos behind visualization 
http://keylines.com/network-visualization
Final Challenge
Final Challenge 
R code 
library(ggplot2) 
#Reads in data 
data = read.csv("final.csv") 
#Preparing for the rectangle background 
areas=unique(subset(data, select=c(Planning_Area,Planning_Region))) 
areas=areas[order(areas$Planning_Region),] 
areas$rectid=1:nrow(areas) 
rectdata = areas %>% group_by(Planning_Region) %>% summarise(xstart=min(rectid)- 
0.5,xend= max(rectid)+0.5) 
#Order the levels 
data$Planning_Area=factor(data$Planning_Area, 
levels=as.character(areas[order(areas$Planning_Region),]$Planning_Area))
Final challenge 
#Plot 
p0 = 
ggplot(data, aes(Planning_Area, Unit_Price____psm_))+ 
geom_boxplot(outlier.colour=NA)+ 
geom_rect(data=rectdata,aes(xmin=xstart,xmax=xend,ymin = -Inf, ymax = Inf, fill = 
Planning_Region,group=Planning_Region), alpha = 0.4,inherit.aes=F)+ 
geom_jitter(alpha=0.40, aes(color=as.factor(Year)))+ 
scale_color_brewer("Year", palette='RdBu')+ 
scale_fill_brewer(palette="Set1",name='Region')+ 
theme_minimal()+ 
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1))+ 
xlab("Planning Area")+ylab("Unit Price (PSM)") 
R code 
#Save plot 
ggsave(p0, file="areaboxplots.pdf",w=20,h=10,units="in",dpi=300)
“Above all else show the data.” 
― Edward R. Tufte, The Visual Display of Quantitative Information 
Thank you for your time
gridExtras

Exploratory Analysis Part1 Coursera DataScience Specialisation

  • 1.
  • 2.
    In today’s session • Principles behind exploratory analyses • Plotting data out on to popular exploratory graphs • Plotting Systems in R • Base (Week1) • Lattice (Week2) • GGPLOT2 (Week2) • Choosing and using Graphic Devices aka the output formats Scripts can be downloaded at: https://www.dropbox.com/s/ii1yj8f650d4l1q/lesson1.r?dl=0 https://www.dropbox.com/s/eme44h6lrhn775l/final.r?dl=0
  • 3.
    Principles behind exploratoryanalyses • Show comparisons • Show causality, mechanism, explanation • Show multivariate data • Integrate multiple modes of evidence • Describe and document the evidence • Content is king • SPEED
  • 4.
    Dimensionality • Five-numbersummary • Boxplots • Histograms • Density plot • Barplot Multiple-overlayed 1D plots Scatter plots
  • 5.
    Downloading our dataset R code dir.create("exploring_data") setwd(“exploring_data”) download.file(“http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/therbook.zip",dest="data.zip") unzip(“data.zip”)
  • 6.
    R code Boxplots weather = read.table("SilwoodWeather.txt",h=T) onemonth = subset(weather, month==1 & yr == 2004) boxplot(onemonth$rain) Header = T
  • 7.
    Histograms R code hist(weather$upper) rug(weather$upper) ticks for each value
  • 8.
    Barplot R code Barplot( table(weather$month), col = "wheat", main = "Number of Observations in Months”)
  • 9.
    Raster Vector PNGPDF SVG grDevices Filesize small medium medium Scalable No Yes Yes Web friendly Yes No Yes
  • 10.
    Plotting Systems PlottingSystems Base Lattice Grid Libraries lattice grid, gridExtras ggplot2 Example functions hist✔ barplot✔ boxplot✔ Plot xyplot (scatterplots) bwplot (boxplots) levelplot qplot ggplot geom Facetted plots Yes Yes Yes Grammar of NO No Yes graphics Interface with statistical functions Yes Partial Partial + Workarounds Cannot be mixed
  • 11.
    Base plots: Scatterplot R code data1 = read.table("scatter1.txt", h=T) data2 = read.table("scatter2.txt", h=T)
  • 12.
    Base plots: Scatterplot R code data1 = read.table("scatter1.txt", h=T) data2 = read.table("scatter2.txt", h=T) #Color with(data1, plot(xv, ys, col="red")) #Regression Line with(data1, abline(lm(ys~xv))) Color
  • 13.
    Base plots: Scatterplot Set symbol to represent data point
  • 14.
    Base plots: Scatterplot R code data1 = read.table("scatter1.txt", h=T) data2 = read.table("scatter2.txt", h=T) #Color with(data1, plot(xv, ys, col="red")) with(data1, abline(lm(ys~xv))) #shape with(data2, points(xv2, ys2, col="blue", pch =11)) Symbol shape
  • 15.
    Base plots: Scatterplot R code data1 = read.table("scatter1.txt", h=T) data2 = read.table("scatter2.txt", h=T) #Color with(data1, plot(xv, ys, col="red")) with(data1, abline(lm(ys~xv))) #shape with(data2, points(xv2, ys2, col="blue", pch =11)) Symbol shape
  • 16.
    Base plots: Usingpar for multiple plots R code par(mfrow=c(1,2)) with(data1, plot(xv, ys, col="red")) with(data1, abline(lm(ys~xv))) #Plot2 with(data2, plot(xv2, ys2, col="blue", pch =11)) title(“My Title", outer=TRUE)
  • 17.
    Par: To setglobal settings R code mfrow( mar=c(5.1,4.1,4.1,2.1), oma=c(2,2,2,2) )
  • 18.
    Lattice productivity =read.table("productivity.txt",h=T) # of species in forest against differing productivity library(lattice) #plotting xyplot( x~y, productivity, xlab=list(label="Productivity"), ylab=list(label="Mammal Species")) R code Formular Data frame
  • 20.
    Lattice productivity =read.table("productivity.txt",h=T) # of species in forest against differing productivity library(lattice) #plotting xyplot( x~y, productivity, xlab=list(label="Productivity"), ylab=list(label="Mammal Species")) xyplot( x~y | f, productivity, xlab=list(label="Productivity"), ylab=list(label="Mammal Species")) R code Formular Data frame given
  • 22.
    ggplot2 • Grammarof graphics (gg) • Based on GRID plotting system, cannot be mixed with base ggplot2.org
  • 23.
    ggplot Components •Data & relationship • GEOMetric Object • Statistical transformation • Scales • Coordinate system • Facetting
  • 24.
  • 25.
  • 26.
    ggplot Geometric objects aka Geoms Coordinate system wrt scales Log scale / sqrt / log ratio Title Plot Theme etc
  • 27.
  • 28.
    ggplot Components •Data & relationship ✔ • GEOMetric Object • Statistical transformation • Scales • Coordinate system • Facetting R code Rmbr to change month into a factor data.frame Aesthetics function which maps the relationships ggplot(weather, aes(x=month, y=upper))+ geom_boxplot()
  • 29.
    ggplot Components •Data & relationship ✔ • GEOMetric Object ✔ • Statistical transformation✔ • Scales • Coordinate system • Facetting R code weather2 = weather %>% group_by(month) %>% summarise(average.upper = mean(upper)) ggplot(weather2, aes(month, average.upper))+ geom_bar(stat="identity")
  • 30.
    ggplot Components •Data & relationship ✔ • GEOMetric Object ✔ • Statistical transformation✔ • Scales • Coordinate system • Facetting R code weather2 = weather %>% group_by(month) %>% summarise(average.upper = mean(upper)) ggplot(weather2, aes(month, average.upper))+ geom_bar(stat="identity")
  • 31.
    ggplot Components •Data & relationship ✔ • GEOMetric Object ✔ • Statistical transformation✔ • Scales✔ • Coordinate system • Facetting R code plot2 = ggplot(weather2, aes(month, average.upper))+ geom_bar(aes(fill=month),stat="identity")+ scale_fill_brewer(palette="Set3")+ xlab("Months")+ ylab("Upper Quantile")+theme_bw()
  • 32.
    ggplot Components •Data & relationship ✔ • GEOMetric Object ✔ • Statistical transformation✔ • Scales✔ • Coordinate system • Facetting R code plot2 = ggplot(weather2, aes(month, average.upper))+ geom_bar(aes(fill=month),stat="identity")+ scale_fill_brewer(palette="Set3")+ xlab("Months")+ ylab("Upper Quantile")+theme_bw()
  • 33.
  • 34.
    qplot A separatefunction which wraps ggplot, for simpler syntax R code qplot(month, upper, fill=month, data=weather, facets = ~yr, geom="bar", stat="identity")
  • 35.
    Ethos behind visualization http://keylines.com/network-visualization
  • 36.
  • 37.
    Final Challenge Rcode library(ggplot2) #Reads in data data = read.csv("final.csv") #Preparing for the rectangle background areas=unique(subset(data, select=c(Planning_Area,Planning_Region))) areas=areas[order(areas$Planning_Region),] areas$rectid=1:nrow(areas) rectdata = areas %>% group_by(Planning_Region) %>% summarise(xstart=min(rectid)- 0.5,xend= max(rectid)+0.5) #Order the levels data$Planning_Area=factor(data$Planning_Area, levels=as.character(areas[order(areas$Planning_Region),]$Planning_Area))
  • 38.
    Final challenge #Plot p0 = ggplot(data, aes(Planning_Area, Unit_Price____psm_))+ geom_boxplot(outlier.colour=NA)+ geom_rect(data=rectdata,aes(xmin=xstart,xmax=xend,ymin = -Inf, ymax = Inf, fill = Planning_Region,group=Planning_Region), alpha = 0.4,inherit.aes=F)+ geom_jitter(alpha=0.40, aes(color=as.factor(Year)))+ scale_color_brewer("Year", palette='RdBu')+ scale_fill_brewer(palette="Set1",name='Region')+ theme_minimal()+ theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1))+ xlab("Planning Area")+ylab("Unit Price (PSM)") R code #Save plot ggsave(p0, file="areaboxplots.pdf",w=20,h=10,units="in",dpi=300)
  • 39.
    “Above all elseshow the data.” ― Edward R. Tufte, The Visual Display of Quantitative Information Thank you for your time
  • 40.

Editor's Notes

  • #3 In this course we will be learning how to
  • #4 In this course we will be learning how to
  • #5 In this course we will be learning how to
  • #6 In this course we will be learning how to
  • #9 barplot(table(weather$month), col = "wheat", main = "Number of Observations in Months")
  • #12 In this course we will be learning how to
  • #13 In this course we will be learning how to
  • #14 In this course we will be learning how to
  • #15 In this course we will be learning how to
  • #16 In this course we will be learning how to
  • #17 In this course we will be learning how to title("My Title", outer=TRUE)
  • #18 In this course we will be learning how to
  • #29 ggplot(weather, aes(month, upper))+ geom_boxplot()
  • #30 ggplot(weather, aes(month, upper))+ geom_boxplot()
  • #31 ggplot(weather, aes(month, upper))+ geom_boxplot()
  • #32 ggplot(weather, aes(month, upper))+ geom_boxplot()
  • #33 ggplot(weather, aes(month, upper))+ geom_boxplot()
  • #34 ggplot(weather, aes(month, upper))+ geom_boxplot()
  • #38 In this course we will be learning how to
  • #39 In this course we will be learning how to