Processing Geospatial Data At Scale @locationtech

PROCESSING GEOSPATIAL
DATA @ SCALE
Rob Emanuele
GEO(MESA/WAVE/TRELLIS/JINNI)

What we’ll be covering…
What does “processing geospatial data at scale”
mean?
Background on big data frameworks
What is LocationTech?
Overview of LocationTech projects for
processing big geo data.

PROCESSING GEOSPATIAL DATA
@ SCALE

Large geospatial data
Landsat 8 on AWS: 465,68 scenes @ ~800 MB
each.That's 355 TB and counting.
OpenStreetMap edit history: 75 GB
compressed.
3 years of geotagged tweets: 3 TB

Project to build a better search engine, back in
the early 2000’s.
Worked for small datasets, but was not scalable.

After reading the papers, Nutch developers
added a distributed ﬁle system and MapReduce
model to Nutch.
In 2006, those portions were spun out of Nutch
to form…

Apache Hadoop
Heavily supported byYahoo, which moved it’s
large data processing to Hadoop.
by 2007,Twitter, Facebook, LinkedIn and many
others were doing serious work with Hadoop
2008 Hadoop graduated to a top level Apache
project

Hadoop
Source: http://cs.calvin.edu/courses/cs/374/exercises/12/lab/MapReduceWordCount.png

Matei Zaharia
Worked with Hadoop at UC Berklee
Noticed Hadoop was not a good ﬁt for
Machine Learning algorithms and other
iterative models.
So in 2009, he created…

Open sourced in 2010 under BSD license
Maintained by UC Berkeley’s AMPLab
Donated to the Apache Software Foundation in
2013 and relicensed as Apache 2.0
Graduated to a top level Apache project in 2014
Apache Spark

Apache Spark
a distributed computation engine.
An API that lets you work with distributed data
as a collection.
Written in Scala, with language bindings for use
with Java, Python, and R.

Apache Accumulo
Created by the NSA in 2008
Donated to the Apache Foundation in 2011
Graduated to a top level project in 2012
Almost defunded by the US government the
same year.

(Sec. 929) Prohibits any DOD component from utilizing the
cloud computing database developed by the National Security
Agency (NSA) and known as "Accumulo" after the end of
FY2013, unless the DOD CIO certiﬁes that: (1) there are no
viable commercial open source databases that have such security
features, or (2) Accumulo itself has become a successful open
source database project. Requires DOD and intelligence
community ofﬁcials to coordinate the use by DOD components
of cloud computing infrastructure and services offered by the
intelligence community for purposes other than intelligence
analysis.

Data
Node
Data
Node
Data
Node
Name
Node
Master
Tablet
Server
Tablet
Server
Tablet
Server
Accumulo
BigTable clone (columnar database)
Records stored on HDFS
Lexicographically sorted table index

GEOJINNI
(FORMERLY SPATIALHADOOP)

Spatial language Built-in spatial data types
Spatial Indexes Spatial Operations

72 Frames × 14 Billion points per frame 
Total = 1 Trillion points
Generated in three hours on a 10-node cluster
HEAT MAP FROM 2009 TO 2014 MONTH-BY-MONTH

SELECT
tweet.text,
user.name
FROM
tweet, user
WHERE
bbox(tweet.location, -115, 45, -110, 50) AND
tweet.user_id = user.user_id
+

GeoTrellis
a Scala library for geospatial data types and
operations.
enables Spark with geospatial capabilities (mainly
raster, currently working on vector)
storage and query raster from HDFS,
Accumulo, and S3 (Cassandra support in
development)
0.10 is released!

100 spot instance m3.xlarge workers @ $0.04 /
hr = $4.00 / hr
400 CPUs / ≈1.5 TB memory
1 master m3.xlarge on-demand instance @
$0.26 / hr
EMR cluster charge, $0.07 / hr
$4.37 / hr
Rendering elevation
with hillshade + NLCD on AWS EMR

Geo +
accessed through
GEOWAVE

Index type
Z-order spatial &
spatiotemporal binned
by week
Hilbert in N-dimensions
with tiered indexing and
binning
Backends supported
Accumulo (main),
Cassandra, HBase,
DynamoDB, Google
cloud Bigtable
Accumulo (main),
HBase
Servers supported GeoServer GeoServer, Mapnik
Processing
Frameworks supported
Hadoop, Spark, Storm,
Kafka
Hadoop, Spark
Language Scala Java

Binning (time dimension)
1997 1998 1999

Binning (arbitrary dimensions)
Time
Elevation
Velocity

• Working together to learn to collaborate
• Making the connections necessary that allow
collaboration to ﬂourish

• Join the locationtech-iwg mailing list
• Share you big geospatial data challenges
• Propose projects
Get involved!

THANKYOU
@lossyrob
gitter.im/geotrellis/geotrellis
github.com/geotrellis/geotrellis
remanuele@azavea.com

Processing Geospatial Data At Scale @locationtech

More Related Content

What's hot

Viewers also liked

Similar to Processing Geospatial Data At Scale @locationtech

Recently uploaded

Processing Geospatial Data At Scale @locationtech