PROCESSING GEOSPATIAL
DATA @ SCALE
Rob Emanuele
What we’ll be covering…
Background on geospatial concepts	

What is LocationTech?	

Background on big data frameworks	

Overview of LocationTech projects for
processing big geo data.
PROCESSING GEOSPATIAL DATA
@ SCALE
PROCESSING GEOSPATIAL DATA
@ SCALE
Geospatial Data
Core of GIS (Geographic information system)	

Raster (images, weather data)	

Vector (points of interest, country boundries)
Raster Data
Raster Data
Raster Data
Raster Data
Vector Data (Points)
Vector Data (Lines)
Vector Data (Polygons)
Source: https://ryouready.wordpress.com/2009/11/16/infomaps-using-r-visualizing-german-unemployment-rates-by-color-on-a-map/
Vector Data
PROCESSING GEOSPATIAL DATA
@ SCALE
Contains
Contains
Heatmap (Kernel Density)
Zonal Statistics
Feature Extraction (Image Segmentation)
Source: http://www.professeurs.polymtl.ca/christopher.pal/
PROCESSING GEOSPATIAL DATA
@ SCALE
Large geospatial data
Landsat 8 on AWS: 311,405 scenes @ ~800 MB
each.That's 250 TB and counting.	

OpenStreetMap: planet.osm is 617 GB.	

3 years of geotagged tweets: 3 TB
PROCESSING GEOSPATIAL DATA
@ SCALE
WHAT IS ?
PROCESSING GEOSPATIAL DATA
@ SCALE
Project to build a better search engine, back in
the early 2000’s.	

Worked for small datasets, but was not scalable.
The Google papers
After reading the papers, Nutch developers
added a distributed file system and MapReduce
model to Nutch.	

In 2006, those portions were spun out of Nutch
to form…
Hadoop
Matei Zaharia
Worked with Hadoop at UC Berklee
Noticed Hadoop was not a good fit for
Machine Learning algorithms and other
iterative models.	

So in 2009, he created…
2006
Apache Accumulo
Created by the NSA in 2008	

Donated to the Apache Foundation in 2011	

Graduated to a top level project in 2012	

Almost defunded by the US government the
same year.
(Sec. 929) Prohibits any DOD component from utilizing the
cloud computing database developed by the National Security
Agency (NSA) and known as "Accumulo" after the end of
FY2013, unless the DOD CIO certifies that: (1) there are no
viable commercial open source databases that have such security
features, or (2) Accumulo itself has become a successful open
source database project. Requires DOD and intelligence
community officials to coordinate the use by DOD components
of cloud computing infrastructure and services offered by the
intelligence community for purposes other than intelligence
analysis.
(Sec. 929) Prohibits any DOD component from utilizing the
cloud computing database developed by the National Security
Agency (NSA) and known as "Accumulo" after the end of
FY2013, unless the DOD CIO certifies that: (1) there are no
viable commercial open source databases that have such
security features, or (2) Accumulo itself has become a
successful open source database project. Requires DOD and
intelligence community officials to coordinate the use by DOD
components of cloud computing infrastructure and services
offered by the intelligence community for purposes other than
intelligence analysis.
Data
Node
Data
Node
Data
Node
Name
Node
Master
Tablet
Server
Tablet
Server
Tablet
Server
Accumulo
BigTable clone (columnar database)
Records stored on HDFS	

Lexicographically sorted table index
GEOJINNI	

(FORMERLY SPATIALHADOOP)
Spatial language Built-in spatial data types
Spatial Indexes Spatial Operations
R-TREE INDEX OF A 400 GB ROAD NETWORK
72 Frames × 14 Billion points per frame

Total = 1 Trillion points
Generated in three hours on a 10-node cluster
HEAT MAP FROM 2009 TO 2014 MONTH-BY-MONTH
Geo +
accessed through
SELECT 	

tweet.text, 	

user.name 	

FROM 	

tweet, user	

WHERE 	

bbox(tweet.location, -115, 45, -110, 50) AND	

tweet.user_id = user.user_id
+
GeoTrellis
a Scala library for geospatial data types and
operations.	

enables Spark with geospatial capabilities (raster
now, soon vector!).	

storage and query raster from HDFS,
Accumulo, and S3
Zonal Summaries
Zonal Summaries
Benchmark Results
439.5 GB of monthly temperature model output data
USA temperature yearly average, 2006 to 2100
Benchmark Results
439.5 GB of monthly temperature model output data
USA temperature yearly average, 2006 to 2100
40 m3.xlarge instances	

(estimated $2.00 USD per hour 	

on spot market)
GEOWAVE
Geo +
accessed through
GEOWAVE
GEOWAVE
Three dimensional Z-order curve
PROCESSING GEOSPATIAL
DATA @ SCALE
THANKYOU
@lossyrob	

gitter.im/geotrellis/geotrellis	

github.com/geotrellis/geotrellis	

remanuele@azavea.com

Processing Geospatial at Scale at LocationTech