The good, the bad & the pretty
Spatial data analysis with R
Robert Hijmans
University of California, Davis
May 2013
Spatial is special
• Complex: geometry and attributes
• Earth is flat? Map projections
• Size: lots and lots of it, multivariate, time series
• Special plots: maps
• First Law of Geography: nearby things are similar
– Statistical assumptions: violated
– Interpolation: possible
GIS* –
● Visual interaction –
• Data management –
• Geometric operations –
• Standard workflows –
• Single map production –
• Click, click, click & click –
• Speed of execution –
• Cumbersome –
Don't we have GIS for that?
– R
– Data & model focused ●
– Analysis ●
– Attributes as important ●
– Creativity & innovation ●
– Many (simpler) maps ●
– Repeatability (single script) ●
– Speed of development ●
– Easy & powerful (& free) ●
* there are many different GISs and they evolve
Geometry of spatial objects (‘vector’)
points, lines, polygons
X
Y
(Xmin, Ymax)
dimX
dimY
(Xmax, Ymin)
Geometry of spatial field (grid / raster data)
row 1
row 6
col 1 col 5
1 2 3 4 5
6 7
26 27 28 29 30
24 25
MODIS, 22 May, 2013
Representing spatial data
sp classes:
SpatialPointsDataFrame
SpatialLinesDataFrame
SpatialPolygonsDataFrame
SpatialGridDataFrame
SpatialPixelsDataFrame
rgdal
read/write of object (vector) and raster data,
(shapefiles, geotiff)
> library(rgdal)
> city <- readOGR('d:/data', 'city')
> elev <- readGDAL('d:/data/elevation.tif')
Map projections
coordinate reference system
Class: CRS
proj4string(city) <- CRS('+proj=lonlat +datum=WGS84')
cityutm <- sptransform(city, CRS('+proj=utm +zone=51'))
Types of spatial analysis*
• Query and reasoning
Where is? How much is this here? How to get from A to B?
• „Measurement
Area, Distance, Length, Slope
• „Transformation
Buffering, overlay, interpolation
• „Exploration and description
clusters, trends, spatial dependence, fragmentation
• „Optimization
Site selection, re-districting, traveling salesman
• „Inference
Samples from a population, problem of spatial autocorrelation
• Modeling
Climate change effects, impact of nuclear accident, dispersal
* After Michael Goodchild: http://www.csiss.org/aboutus/presentations/files/goodchild_qmss_oct02.pdf
Spatial statistics
• Point pattern analysis
• Geostatistics (kriging)
• Inference (hypothesis testing)
1. Location of points is of prime interest
2. Points are not a sample
3. Points are within a defined study area
4. Points should be true incidents (not centroids)
Point patterns
Point patterns
> library(spatstat); library(maptools)
> cityOwin <- as(city, “owin”)
> pts <- coordinates(crime)
> p <- ppp(pts[,1], pts[,2], window=cityOwin)
> s <- smooth.ppp(p)
> e <- envelope(p) http://www.spatstat.org/
Geostatistics
> library(gstat)
> data(meuse)
> coordinates(meuse) <- ~x+y
> spplot(meuse, 'zinc')
1. Measurements are of prime interest (not locations)
2. Points are a sample
3. Unbiased estimates for locations that were not sampled
> x <- krige(log(zinc)~1, meuse, meuse.grid, model = m)
> spplot(x["var1.pred"], main="ordinary kriging predictions")
> spplot(x["var1.var"], main = "ordinary kriging variance")
> f <- houseValue ~ age + nBedrooms
> m <- lm(f1, data=hh)
> summary(m)
Call:
lm(formula = f1, data = hh)
Residuals:
Min 1Q Median 3Q Max
-222541 -67489 -6128 60509 217655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -628578 233217 -2.695 0.00931 **
age 12695 2480 5.119 4.05e-06 ***
nBedrooms 191889 76756 2.500 0.01543 *
Regression with spatial data
Analyse the model residuals for SA (e.g. Moran's I)
> library(spdep)
> cb <- poly2nb(ca)
> lw <- nb2listw(cb)
> plot(ca)
> plot(lw, coordinates(ca),
add=TRUE, col="red")
> moran.test(residuals, lw)
Moran's I test under randomisation
Moran I statistic standard deviate = 2.6926, p-value = 0.003545
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.158977893 -0.010101010 0.003943149
If SA ‘significant’ then you could
• Re-specify your model
• Permit the coefficients,  , to vary spatially
(GWR)
• Modify the regression model to incorporate the SA
• Proceed and ignore SA?
OLS: Y = Xβ + e
Autogregressive model: Y = ρWY + e
Simultaneous Autoregressive Models:
SAR-lag: Y = ρWY + Xβ + e
(endogenous, inherent spat. autocorrelation, diffusion )
SAR-err: Y = Xβ + λWu + e
(exogenous, induced spatial autocorrelation)
SAR-mix: Y = ρWY + Xβ + WXγ + e
CAR
raster package
• new classes (‘S4’) for raster data
• no file size restrictions
• file formats: gdal, ncdf, ‘native’
• > 200 functions
RasterLayer
> library(raster)
>
> x <- raster(ncol=10, nrow=5)
>
> x <- raster('volcano.tif')
>
> x
class : RasterLayer
dimensions : 87, 61, 5307 (nrow, ncol, ncell)
resolution : 10, 10 (x, y)
extent : 2667400, 2668010, 6478700, 6479570 (xmin, xmax, …
coord. ref. : +proj=nzmg +lat_0=-41 +lon_0=173 +x_0=251
values : d:datavolcano.tif
min value : 94
max value : 195
> str(x)
Formal class 'RasterLayer' [package "raster"] with 16 slots
..@ file :Formal class '.RasterFile' [package "raster"] with 9 slots
. . .. ..@ name : chr “d:datavolcano.tif“
.. .. ..@ driver : chr "gdal"
..@ data :Formal class '.SingleLayerData' [package "raster"] with 11 slots
.. .. ..@ values : logi(0)
.. .. ..@ inmemory : logi FALSE
.. .. ..@ min : num 94
. . .. ..@ max : num 195
..@ extent :Formal class 'Extent' [package "raster"] with 4 slots
.. .. ..@ xmin: num 2667400
.. .. ..@ xmax: num 2668010
.. @ rotation :Formal class '.Rotation' [package "raster"] with 2 slots
.. .. ..@ geotrans: num(0)
.. .. ..@ transfun:function ()
..@ ncols : int 61
..@ nrows : int 87
..@ crs :Formal class 'CRS' [package "sp"] with 1 slots
.. .. ..@ projargs: chr " +proj=nzmg +lat_0=-41 +lon_0=173 +x_0=2510000 +y_0=6023150
..@ layernames: chr "volcano”
RasterLayer
Multiple layers
RasterStack - many files
RasterBrick - single files
> s <- stack(x, x*2, sqrt(x))
>
> s
class : RasterStack
dimensions : 87, 61, 5307, 3 (nrow, ncol, ncell,
nlayers)
resolution : 0.01639344, 0.01149425 (x, y)
extent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
coord. ref. : NA
min values : 94.0, 188.0, 9.7
max values : 195, 390, 14
layer names : layer.1, layer.2, layer.3
0
1 – 10
11 – 25
26 – 50
51 – 100
> 100
Daily rainfall
Some functions
ncell(x)
xyFromCell(x, 10)
getValues(x, row)
adjacent(x, 10)
writeRaster(x, filename, …)
merge, crop, project, aggregate,
reclass, resample,
rasterize, distance, focal …
“High level”
“Low level”
r <- raster(nc=10, nr=10)
values(r) <- 1:ncell(r)
q <- sqrt(r)
x <- (q + r) * 2
s <- stack(r, q, x)
ss <- s * r
Raster algebra
> elev <- getData('worldclim', var='alt', res=2.5)
> usa1 <- getData('GADM', country='USA', level=1)
> ca <- usa1[usa1$NAME_1 == 'California', ]
> bio <- getData('worldclim', var='bio', res=5)
> library(dismo)
> bg <- sampleRandom(bio, ext=extent(ca), size=1000)
> obs <- extract(bio, bigfoot)
> alt <- crop(elev, ca)
> alt <- mask(alt, ca)
> plot(alt)
> points(bigfoot)
Modeling bigfoot
(after Hickerson et al., 2008)
data from:
http://www.bfro.net/news/google_earth.asp
Likelihood
of occurrence
> d <- data.frame(pa=c(rep(1, nrow(obs)), rep(0, nrow(bg))),
rbind(obs, bg))
> library(randomForest)
> rf <- randomForest(pa~., data=d)
> pred <- predict(bio, rf)
> plot(pred)
> plot(ca, add=T)
> points(sel2, col='blue', pch=20)
Visualization
plot
plotRGB
contour
plot3D
…
> library(rasterVis)
> plot(s, addfun=function()plot(esp, add=T))
> library(rasterVis)
> alt <- getData('worldclim', var='alt', res=2.5)
> usa1 <- getData('GADM', country='USA', level=1)
> ca <- usa1[usa1$NAME_1 == 'California', ]
> alt <- crop(alt, extent(ca)+ 0.5)
> alt <- mask(alt, ca)
> levelplot(alt, par.settings=GrTheme)
http://www.revolutionanalytics.com/news-events/free-webinars/2012/ggplot2-with-hadley-wickham/
http://spatialanalysis.co.uk/2012/02/great-maps-ggplot2/
> library(dismo)
> g <- gmap('Mountain View, CA')
> plot(g, interpolate=T)
> xy <- geocode("2600 Casey Ave, Mountain View, CA")
> points(Mercator(xy[,2:3]), col='red', pch='*', cex=5)
http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
> library(geosphere)
> inter <- gcIntermediate(lonlat1, lonlat2, n=100)
> lines(inter, col=colors, lwd=lwd)
.
http://cran.r-project.org/web/views/Spatial.html
More info

Spatial Analysis with R - the Good, the Bad, and the Pretty

  • 1.
    The good, thebad & the pretty Spatial data analysis with R Robert Hijmans University of California, Davis May 2013
  • 2.
    Spatial is special •Complex: geometry and attributes • Earth is flat? Map projections • Size: lots and lots of it, multivariate, time series • Special plots: maps • First Law of Geography: nearby things are similar – Statistical assumptions: violated – Interpolation: possible
  • 3.
    GIS* – ● Visualinteraction – • Data management – • Geometric operations – • Standard workflows – • Single map production – • Click, click, click & click – • Speed of execution – • Cumbersome – Don't we have GIS for that? – R – Data & model focused ● – Analysis ● – Attributes as important ● – Creativity & innovation ● – Many (simpler) maps ● – Repeatability (single script) ● – Speed of development ● – Easy & powerful (& free) ● * there are many different GISs and they evolve
  • 4.
    Geometry of spatialobjects (‘vector’) points, lines, polygons X Y
  • 6.
    (Xmin, Ymax) dimX dimY (Xmax, Ymin) Geometryof spatial field (grid / raster data) row 1 row 6 col 1 col 5 1 2 3 4 5 6 7 26 27 28 29 30 24 25
  • 7.
  • 8.
    Representing spatial data spclasses: SpatialPointsDataFrame SpatialLinesDataFrame SpatialPolygonsDataFrame SpatialGridDataFrame SpatialPixelsDataFrame rgdal read/write of object (vector) and raster data, (shapefiles, geotiff) > library(rgdal) > city <- readOGR('d:/data', 'city') > elev <- readGDAL('d:/data/elevation.tif')
  • 9.
    Map projections coordinate referencesystem Class: CRS proj4string(city) <- CRS('+proj=lonlat +datum=WGS84') cityutm <- sptransform(city, CRS('+proj=utm +zone=51'))
  • 10.
    Types of spatialanalysis* • Query and reasoning Where is? How much is this here? How to get from A to B? • „Measurement Area, Distance, Length, Slope • „Transformation Buffering, overlay, interpolation • „Exploration and description clusters, trends, spatial dependence, fragmentation • „Optimization Site selection, re-districting, traveling salesman • „Inference Samples from a population, problem of spatial autocorrelation • Modeling Climate change effects, impact of nuclear accident, dispersal * After Michael Goodchild: http://www.csiss.org/aboutus/presentations/files/goodchild_qmss_oct02.pdf
  • 11.
    Spatial statistics • Pointpattern analysis • Geostatistics (kriging) • Inference (hypothesis testing)
  • 12.
    1. Location ofpoints is of prime interest 2. Points are not a sample 3. Points are within a defined study area 4. Points should be true incidents (not centroids) Point patterns
  • 13.
    Point patterns > library(spatstat);library(maptools) > cityOwin <- as(city, “owin”) > pts <- coordinates(crime) > p <- ppp(pts[,1], pts[,2], window=cityOwin) > s <- smooth.ppp(p) > e <- envelope(p) http://www.spatstat.org/
  • 15.
    Geostatistics > library(gstat) > data(meuse) >coordinates(meuse) <- ~x+y > spplot(meuse, 'zinc') 1. Measurements are of prime interest (not locations) 2. Points are a sample 3. Unbiased estimates for locations that were not sampled
  • 16.
    > x <-krige(log(zinc)~1, meuse, meuse.grid, model = m) > spplot(x["var1.pred"], main="ordinary kriging predictions") > spplot(x["var1.var"], main = "ordinary kriging variance")
  • 17.
    > f <-houseValue ~ age + nBedrooms > m <- lm(f1, data=hh) > summary(m) Call: lm(formula = f1, data = hh) Residuals: Min 1Q Median 3Q Max -222541 -67489 -6128 60509 217655 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -628578 233217 -2.695 0.00931 ** age 12695 2480 5.119 4.05e-06 *** nBedrooms 191889 76756 2.500 0.01543 * Regression with spatial data
  • 18.
    Analyse the modelresiduals for SA (e.g. Moran's I)
  • 19.
    > library(spdep) > cb<- poly2nb(ca) > lw <- nb2listw(cb) > plot(ca) > plot(lw, coordinates(ca), add=TRUE, col="red") > moran.test(residuals, lw) Moran's I test under randomisation Moran I statistic standard deviate = 2.6926, p-value = 0.003545 alternative hypothesis: greater sample estimates: Moran I statistic Expectation Variance 0.158977893 -0.010101010 0.003943149
  • 20.
    If SA ‘significant’then you could • Re-specify your model • Permit the coefficients,  , to vary spatially (GWR) • Modify the regression model to incorporate the SA • Proceed and ignore SA?
  • 21.
    OLS: Y =Xβ + e Autogregressive model: Y = ρWY + e Simultaneous Autoregressive Models: SAR-lag: Y = ρWY + Xβ + e (endogenous, inherent spat. autocorrelation, diffusion ) SAR-err: Y = Xβ + λWu + e (exogenous, induced spatial autocorrelation) SAR-mix: Y = ρWY + Xβ + WXγ + e CAR
  • 22.
    raster package • newclasses (‘S4’) for raster data • no file size restrictions • file formats: gdal, ncdf, ‘native’ • > 200 functions
  • 23.
    RasterLayer > library(raster) > > x<- raster(ncol=10, nrow=5) > > x <- raster('volcano.tif') > > x class : RasterLayer dimensions : 87, 61, 5307 (nrow, ncol, ncell) resolution : 10, 10 (x, y) extent : 2667400, 2668010, 6478700, 6479570 (xmin, xmax, … coord. ref. : +proj=nzmg +lat_0=-41 +lon_0=173 +x_0=251 values : d:datavolcano.tif min value : 94 max value : 195
  • 24.
    > str(x) Formal class'RasterLayer' [package "raster"] with 16 slots ..@ file :Formal class '.RasterFile' [package "raster"] with 9 slots . . .. ..@ name : chr “d:datavolcano.tif“ .. .. ..@ driver : chr "gdal" ..@ data :Formal class '.SingleLayerData' [package "raster"] with 11 slots .. .. ..@ values : logi(0) .. .. ..@ inmemory : logi FALSE .. .. ..@ min : num 94 . . .. ..@ max : num 195 ..@ extent :Formal class 'Extent' [package "raster"] with 4 slots .. .. ..@ xmin: num 2667400 .. .. ..@ xmax: num 2668010 .. @ rotation :Formal class '.Rotation' [package "raster"] with 2 slots .. .. ..@ geotrans: num(0) .. .. ..@ transfun:function () ..@ ncols : int 61 ..@ nrows : int 87 ..@ crs :Formal class 'CRS' [package "sp"] with 1 slots .. .. ..@ projargs: chr " +proj=nzmg +lat_0=-41 +lon_0=173 +x_0=2510000 +y_0=6023150 ..@ layernames: chr "volcano” RasterLayer
  • 25.
    Multiple layers RasterStack -many files RasterBrick - single files > s <- stack(x, x*2, sqrt(x)) > > s class : RasterStack dimensions : 87, 61, 5307, 3 (nrow, ncol, ncell, nlayers) resolution : 0.01639344, 0.01149425 (x, y) extent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax) coord. ref. : NA min values : 94.0, 188.0, 9.7 max values : 195, 390, 14 layer names : layer.1, layer.2, layer.3
  • 26.
    0 1 – 10 11– 25 26 – 50 51 – 100 > 100 Daily rainfall
  • 27.
    Some functions ncell(x) xyFromCell(x, 10) getValues(x,row) adjacent(x, 10) writeRaster(x, filename, …) merge, crop, project, aggregate, reclass, resample, rasterize, distance, focal … “High level” “Low level”
  • 28.
    r <- raster(nc=10,nr=10) values(r) <- 1:ncell(r) q <- sqrt(r) x <- (q + r) * 2 s <- stack(r, q, x) ss <- s * r Raster algebra
  • 29.
    > elev <-getData('worldclim', var='alt', res=2.5) > usa1 <- getData('GADM', country='USA', level=1) > ca <- usa1[usa1$NAME_1 == 'California', ] > bio <- getData('worldclim', var='bio', res=5) > library(dismo) > bg <- sampleRandom(bio, ext=extent(ca), size=1000) > obs <- extract(bio, bigfoot) > alt <- crop(elev, ca) > alt <- mask(alt, ca) > plot(alt) > points(bigfoot) Modeling bigfoot (after Hickerson et al., 2008) data from: http://www.bfro.net/news/google_earth.asp
  • 30.
    Likelihood of occurrence > d<- data.frame(pa=c(rep(1, nrow(obs)), rep(0, nrow(bg))), rbind(obs, bg)) > library(randomForest) > rf <- randomForest(pa~., data=d) > pred <- predict(bio, rf) > plot(pred) > plot(ca, add=T) > points(sel2, col='blue', pch=20)
  • 31.
  • 32.
    > library(rasterVis) > plot(s,addfun=function()plot(esp, add=T))
  • 33.
    > library(rasterVis) > alt<- getData('worldclim', var='alt', res=2.5) > usa1 <- getData('GADM', country='USA', level=1) > ca <- usa1[usa1$NAME_1 == 'California', ] > alt <- crop(alt, extent(ca)+ 0.5) > alt <- mask(alt, ca) > levelplot(alt, par.settings=GrTheme)
  • 34.
  • 35.
    > library(dismo) > g<- gmap('Mountain View, CA') > plot(g, interpolate=T) > xy <- geocode("2600 Casey Ave, Mountain View, CA") > points(Mercator(xy[,2:3]), col='red', pch='*', cex=5)
  • 36.
    http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/ > library(geosphere) > inter<- gcIntermediate(lonlat1, lonlat2, n=100) > lines(inter, col=colors, lwd=lwd)
  • 37.