Exploratory Analysis Part1 Coursera DataScience Specialisation

Exploratory Data Analysis
Wesley GOI

In today’s session
• Principles behind exploratory analyses
• Plotting data out on to popular exploratory graphs
• Plotting Systems in R
• Base (Week1)
• Lattice (Week2)
• GGPLOT2 (Week2)
• Choosing and using Graphic Devices aka the output formats
Scripts can be downloaded at:
https://www.dropbox.com/s/ii1yj8f650d4l1q/lesson1.r?dl=0
https://www.dropbox.com/s/eme44h6lrhn775l/final.r?dl=0

Principles behind exploratory analyses
• Show comparisons
• Show causality, mechanism, explanation
• Show multivariate data
• Integrate multiple modes of evidence
• Describe and document the evidence
• Content is king
• SPEED

Dimensionality
• Five-number summary
• Boxplots
• Histograms
• Density plot
• Barplot
Multiple-overlayed 1D plots
Scatter plots

Downloading our dataset
R code
dir.create("exploring_data")
setwd(“exploring_data”)
download.file(“http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/therbook.zip",dest="data.zip")
unzip(“data.zip”)

R code
Boxplots
weather = read.table("SilwoodWeather.txt",h=T)
onemonth = subset(weather,
month==1 & yr == 2004)
boxplot(onemonth$rain)
Header = T

Histograms
R code
hist(weather$upper)
rug(weather$upper) ticks for each value

Barplot
R code
Barplot(
table(weather$month),
col = "wheat",
main = "Number of Observations in
Months”)

Raster Vector
PNG PDF SVG
grDevices
Filesize small medium medium
Scalable No Yes Yes
Web friendly Yes No Yes

Plotting Systems
Plotting Systems
Base Lattice Grid
Libraries lattice grid, gridExtras
ggplot2
Example
functions
hist✔
barplot✔
boxplot✔
Plot
xyplot (scatterplots)
bwplot (boxplots)
levelplot
qplot
ggplot
geom
Facetted plots Yes Yes Yes
Grammar of
NO No Yes
graphics
Interface with
statistical
functions
Yes Partial Partial +
Workarounds
Cannot
be mixed

Base plots: Scatterplot
R code
data1 = read.table("scatter1.txt", h=T)

R code
#Color
with(data1, plot(xv, ys, col="red"))
#Regression Line
with(data1, abline(lm(ys~xv)))
Color

Set symbol to represent data point

R code
#Color
#shape
with(data2,
points(xv2, ys2, col="blue",
pch =11))
Symbol shape

Base plots: Using par for multiple plots
R code
par(mfrow=c(1,2))
#Plot2
with(data2,
plot(xv2, ys2, col="blue",
pch =11))
title(“My Title", outer=TRUE)

Par: To set global settings
R code
mfrow(
mar=c(5.1,4.1,4.1,2.1),
oma=c(2,2,2,2)
)

Lattice
productivity = read.table("productivity.txt",h=T)
# of species in forest against differing productivity
library(lattice)
#plotting
xyplot( x~y, productivity,
xlab=list(label="Productivity"),
ylab=list(label="Mammal Species"))
R code
Formular
Data frame

Lattice
productivity = read.table("productivity.txt",h=T)
# of species in forest against differing productivity
library(lattice)
#plotting
xyplot( x~y, productivity,
xyplot( x~y | f, productivity,
R code
Formular
Data frame
given

ggplot2
• Grammar of graphics (gg)
• Based on GRID plotting system, cannot be
mixed with base
ggplot2.org

ggplot
Components
• Data & relationship
• GEOMetric Object
• Statistical transformation
• Scales
• Coordinate system
• Facetting

ggplot
Geometric objects
aka
Geoms
Coordinate system
wrt
scales
Log scale / sqrt / log ratio
Title
Plot
Theme
etc

ggplot
Geometric objects
aka
Geoms

ggplot
Components
• Data & relationship ✔
• GEOMetric Object
• Statistical transformation
• Scales
• Facetting
R code
Rmbr to change
month into a
factor
data.frame
Aesthetics function which maps the relationships
ggplot(weather, aes(x=month, y=upper))+
geom_boxplot()

ggplot
Components
• GEOMetric Object ✔
• Statistical transformation✔
• Scales
• Facetting
R code
weather2 = weather %>%
group_by(month) %>%
summarise(average.upper = mean(upper))
ggplot(weather2, aes(month, average.upper))+
geom_bar(stat="identity")

ggplot
Components
• GEOMetric Object ✔
• Statistical transformation✔
• Scales✔
• Facetting
R code
plot2 = ggplot(weather2,
aes(month, average.upper))+
geom_bar(aes(fill=month),stat="identity")+
scale_fill_brewer(palette="Set3")+
xlab("Months")+
ylab("Upper Quantile")+theme_bw()

qplot
A separate function which wraps ggplot, for simpler syntax
R code
qplot(month, upper, fill=month, data=weather, facets = ~yr, geom="bar",
stat="identity")

Ethos behind visualization
http://keylines.com/network-visualization

Final Challenge
R code
library(ggplot2)
#Reads in data
data = read.csv("final.csv")
#Preparing for the rectangle background
areas=unique(subset(data, select=c(Planning_Area,Planning_Region)))
areas=areas[order(areas$Planning_Region),]
areas$rectid=1:nrow(areas)
rectdata = areas %>% group_by(Planning_Region) %>% summarise(xstart=min(rectid)-
0.5,xend= max(rectid)+0.5)
#Order the levels
data$Planning_Area=factor(data$Planning_Area,
levels=as.character(areas[order(areas$Planning_Region),]$Planning_Area))

Final challenge
#Plot
p0 =
ggplot(data, aes(Planning_Area, Unit_Price____psm_))+
geom_boxplot(outlier.colour=NA)+
geom_rect(data=rectdata,aes(xmin=xstart,xmax=xend,ymin = -Inf, ymax = Inf, fill =
Planning_Region,group=Planning_Region), alpha = 0.4,inherit.aes=F)+
geom_jitter(alpha=0.40, aes(color=as.factor(Year)))+
scale_color_brewer("Year", palette='RdBu')+
scale_fill_brewer(palette="Set1",name='Region')+
theme_minimal()+
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1))+
xlab("Planning Area")+ylab("Unit Price (PSM)")
R code
#Save plot
ggsave(p0, file="areaboxplots.pdf",w=20,h=10,units="in",dpi=300)

“Above all else show the data.”
― Edward R. Tufte, The Visual Display of Quantitative Information
Thank you for your time

Exploratory Analysis Part1 Coursera DataScience Specialisation

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Exploratory Analysis Part1 Coursera DataScience Specialisation

Similar to Exploratory Analysis Part1 Coursera DataScience Specialisation (20)

Recently uploaded

Recently uploaded (20)

Exploratory Analysis Part1 Coursera DataScience Specialisation

Editor's Notes