Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing

GeoMesa: Using
Accumulo for optimized
spatio-temporal
processing
Dr. James Hughes, CCRi
james.hughes@ccri.com

GeoMesa is
● A collection of libraries and modules which can be used to
solve Big Geo Data problems
○ Great for managing billions to trillions of vector data
○ Great for streaming vector data
● Open sourced through Eclipse’s LocationTech working group and has
graduated incubation
● Built on top of great open source libraries
GeoMesa Background

Such architectures allow for live views and near-real time processing (speed layer)
while persisting the data for historic queries and batch analysis (batch layer).
Client access to both layers can be handled by GeoServer.
GeoMesa enables Lambda architectures

Suppose we wish to monitor and understand a group of GPS-enabled and
internet-enabled devices (ex: sensors, vehicles).
● GeoMesa’s ETL / converter library aids in re-usable data modeling.
● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest
into Accumulo and Kafka topics.
● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as
1) geo-fencing, 2) location trackers, and 3) complex alerting rules.
● Effective storage in Accumulo allows for fast query returns.
● End-to-end visualization and analysis supports allows aggregations to pushed
down to the Accumulo tablet servers.
● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc
interactive analysis and data discovery.
Example Use Case: Managing Internet-Aware Devices

Suppose we wish to monitor and understand a group of GPS-enabled and
internet-enabled devices (ex: sensors, vehicles).
● GeoMesa’s ETL / converter library aids in re-usable data modeling.
● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest
into Accumulo and Kafka topics.
● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as
1) geo-fencing, 2) location trackers, and 3) complex alerting rules.
● Effective storage in Accumulo allows for fast query returns.
● End-to-end visualization and analysis supports allows aggregations to pushed
down to the Accumulo tablet servers.
● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc
interactive analysis and data discovery.
All of this adds up to “Speed! Speed! Speed!” whether you are looking at
a live view of the data or pulling back an analysis product.
Example Use Case: Managing Internet-Aware Devices

Enabling and making visualization and analysis quick has been a journey and this
talk is about our steps so far
Talk Outline

1. Space-filling curves and storing spatio-temporal data
2. Improvements to GeoMesa use and implementation of Accumulo Iterators
3. Spark and MapReduce for distributed computation
Talk Outline

1. Space-filling curves and storing spatio-temporal data
2. Improvements to GeoMesa use and implementation of Accumulo Iterators
3. Spark and MapReduce for distributed computation
Not in this talk
1. Storm / NiFi - Streaming Ingest
2. Live views and online processing with Kafka
3. Command line tools
4. ETL / parser library
5. Machine learning / Deep Analytics
Talk Outline

● Accumulo Key Design
● Space Filling Curves 101
● Indices for Points with Time
● Indices for Lines and Polygons
● Lessons Learned
GeoMesa's
evolution of
Accumulo
schemas

In a traditional stack, the application
issues queries to a database which is
responsible for query planning.
Overview of query planning in Accumulo

In a traditional stack, the application
issues queries to a database which is
responsible for query planning.
Overview of query planning in Accumulo
With Accumulo, the query planning is
handled by library code in the
application.

● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
Space Filling Curves (in one slide!)

● First, ‘grid’ the data space into bins.

● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order
that the curve visits the them.
○ Associate the data in that grid cell
with a byte representation of the
label.

filling curve.
label.
● We prefer “good” space filling curves:
○ Want recursive curves and locality.

filling curve.
label.
● We prefer “good” space filling curves:
○ Want recursive curves and locality.
● Space filling curves have higher
dimensional analogs.

To query for points in the grey rectangle, the
query planner enumerates a collection of index
ranges which cover the area.
Note: Most queries won’t line up perfectly with the
gridding strategy.
Further filtering can be run on the Accumulo
tablet servers with Iterators (next section)
or we can return ‘loose’ bounding box results
(likely more quickly).
Query planning with Space Filling Curves

GeoMesa has several tables; each optimized for a particular use case.
The Z3 table is used with and optimized for temporal point data. (Think sensor
observations, track reports, or other events which happen at particular location.)
GeoMesa Key Structure for the ‘Z3’ table
Key Value
Row
Column
Record
Family Qualifier
Shard
1-Byte
Epoch
Week
2-Bytes
Z3(x,y,t)
8-Bytes
‘F’
Here and now:
(38.9864985, -76.9561856)
10:15am, Tuesday, Oct. 11th, 2016
Epoch Week: 2440
X value: 1275689
Y value: 151972
T value: 2097151
Z3 (as a long):
6430470637115132837

Most approaches to indexing non-point
geometries involve covering the
geometry with a number of grid cells
and storing a copy with each index.
This means that the client has to
deduplicate results which is expensive.
Indexing non-point geometries: New XZ Index

Most approaches to indexing non-point
geometries involve covering the
geometry with a number of grid cells
and storing a copy with each index.
This means that the client has to
deduplicate results which is expensive.
Böhm, Klump, and Kriegel describe an
indexing strategy allows such
geometries to be stored once.
GeoMesa has implemented this
strategy in XZ2 (spatial-only) and XZ3
(spatio-temporal) tables.
The key is to store data by resolution,
separate geometries by size, and then
index them by their lower left corner.
This does require consideration on the
query planning side, but avoiding
deduplication is worth the trade-off.
Indexing non-point geometries: New XZ Index
For more details, see Böhm, Klump, and Kriegel. “XZ-ordering: a space-filling curve for objects with spatial
extension.” 6th. Int. Symposium on Large Spatial Databases (SSD), 1999, Hong Kong, China.
(http://www.dbs.ifi.lmu.de/Publikationen/Boehm/Ordering_99.pdf)

● Accumulo Iterator Overview
● GeoMesa Iterators for Analysis
and Visualization
● Iterator Lessons Learned
GeoMesa's use
of Accumulo
Iterators

“Iterators provide a modular mechanism for adding functionality to be executed by
TabletServers when scanning or compacting data. This allows users to efficiently
summarize, filter, and aggregate data.” -- Accumulo 1.7 documentation
Part of the modularity is that the iterators can be stacked:
t the output of one can be wired into the next.
Example: The first iterator might read from disk, the second could filter with
Authorizations, and a final iterator could filter by column family.
Other notes:
● Iterators provided a sorted view of the key/values.
● Iterator code can be loaded from HDFS and namespaced!
Accumulo Iterators

Visualization Example: Heatmaps
Without powerful visualization options,
big data is big nonsense.
Consider this view of shipping in the
Mediterranean sea

Mediterranean sea
Heatmaps help show patterns and
they can be accelerated with
GeoMesa

Mediterranean sea
Heatmaps help show patterns and
they can be accelerated with
GeoMesa
Heatmap
Request
HeatMap WPS
Query Hints

A request to GeoMesa consists of two broad pieces:
1. A filter restricting the data to act on, e.g.:
a. Records in Maryland with ‘Accumulo’ in the text field.
b. Records during the first week of 2016.
2. A request for ‘how’ to return the data, e.g.:
a. Return the full records
b. Return a subset of the record (either a projection or ‘bin’ file format)
c. Return a histogram
d. Return a heatmap / kernel density
Generally, a filter can be handled partially by selecting which ranges to scan; the
remainder can be handled by an Iterator.
Modifications to selected data can also be handled by a GeoMesa Iterator.
GeoMesa Data Requests

The first pass of GeoMesa iterators separated concerns into separate iterators.
The GeoMesa query planner assembled a stack of iterators to achieve the desired
result.
Initial GeoMesa Iterator design
Image from “Spatio-temporal Indexing in Non-relational Distributed Databases” by
Anthony Fox, Chris Eichelberger, James Hughes, Skylar Lyon

The key benefit to having decomposed iterators is that they are easier to
understand and re-mix.
In terms of performance, each one needs to understand the bytes in the Key and
Value. In many cases, this will lead to additional serialization/deserialization.
Now, we prefer to write Iterators which handle transforming the underlying data
into what the client code is expecting in one go.
Second GeoMesa Iterator design

1. Using fewer iterators in the stack can be beneficial
2. Using lazy evaluation / deserialization for filtering Values can power speed
improvements.
3. Iterators take in Sorted Keys + Values and *must* produce Sorted Keys and
Values.
4. Accumulo 1.8.0 has an Iterator Test Harness!
https://accumulo.apache.org/release_notes/1.8.0#iterator-test-harness
https://accumulo.apache.org/1.8/accumulo_user_manual.html#_iterator_testing
Lessons learned about Iterators

Through our use of a) space filling curves, b) a cost-based query optimizer, and
c) carefully configured iterators, the GeoMesa query planner has a lot going on.
The GeoMesa query explainer logs 1) which index was used, 2) which ranges
where scanned, 3) Iterator configuration, etc.
Putting all together: the GeoMesa Query Explainer
geomesa> geomesa explain -u USER -p PASS -i INSTANCE -c geomesa -z zoo1,zoo2,zoo3 -f AccumuloQuickStart -q "Who =
'Bierce'"
Planning 'AccumuloQuickStart' Who = 'Bierce'
Original filter: Who = 'Bierce'
Hints: density[false] bin[false] stats[false] map-aggregate[false] sampling[none]
Sort: none
Transforms: None
Strategy selection:
Query processing took 69ms and produced 1 options
Filter plan: FilterPlan[ATTRIBUTE[Who = 'Bierce'][None]]
Strategy selection took 8ms for 1 options
Strategy 1 of 1: AttributeIdxStrategy
Strategy filter: ATTRIBUTE[Who = 'Bierce'][None]
Plan: org.locationtech.geomesa.accumulo.index.BatchScanPlan
Table: geomesa_attr
Deduplicate: false
Column Families: all
Ranges (1): [%01;%00;%00;Bierce%00;::%01;%00;%00;Bierce%01;)
Iterators (0):
Query planning took 119ms
Verify hints
Inspect strategies considered
See table and ranges to be scanned
Quantify planning time

● GeoMesa + Spark Setup
● GeoMesa + Spark Analytics
● GeoMesa powered notebooks
(Jupyter and Zeppelin)
GeoMesa’s
Spark Support:
Data Analysis
and Discovery

Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
GeoMesa MapReduce and Spark Support

infrastructure.
tablet server.
Accumulo Implements the MapReduce InputFormat interface.

infrastructure.
tablet server.
Spark provides a way to change InputFormats into RDDs.

infrastructure.
tablet server.
Spark provides a way to change InputFormats into RDDs.
So with a little glue code and Spark classpath/environment
management, GeoMesa has Spark support!

GeoMesa Spark Example 1: Time Series
Step 1: Get an RDD[SimpleFeature]
Step 2: Calculate the time series
Step 3: Plot the time series in R.

Using one dataset (country boundaries) to group another (here, GDELT) is
effectively a join.
Our summer intern, Atallah, worked out the details of doing this analysis in Spark
and created a tutorial and blog post.
This picture shows ‘stability’ of a region from GDELT Goldstein values
GeoMesa Spark Example 2: Aggregating by Regions
http://www.ccri.com/2016/08/17/new-geomesa-tutorial-aggregating-visualizing-data/
http://www.geomesa.org/documentation/tutorials/shallow-join.html

GeoMesa Spark Example 3: Aggregating Tweets about #traffic
Virginia Polygon CQL
GeoMesa RDD
Aggregate by County
Calculate ratio of #traffic
Store back to GeoMesa

GeoMesa Spark Example 3: Aggregating Tweets about #traffic
#traffic by Virginia county
Darker blue has a higher count

Problem: Another developer came by and mentioned that his Spark job using
GeoMesa had quite a few tasks (far more than expected).
Around the same time, Eugene Cheipesh (Azavea / GeoTrellis) wrote in to the
Accumulo user list…
In Accumulo 1.6.x, each range in the Accumulo InputFormat becomes a Split.
With space filling curves, it is easy to enumerate plenty of ranges for a query.
Solution: The short term solution was to create a custom InputFormat which
produce Splits which contain more than one range.
A small bump in the road…

Interactive Data Discovery at Scale in GeoMesa Notebooks
Writing (and debugging!) MapReduce /
Spark jobs is slow and requires
expertise.
A long development cycle for an
analytic saps energy and creativity.
The answer to both is interactive
‘notebook’ servers like Apache
Zeppelin and Jupyter (formerly
iPython Notebook).

Writing (and debugging!) MapReduce /
Spark jobs is slow and requires
expertise.
A long development cycle for an
analytic saps energy and creativity.
The answer to both is interactive
‘notebook’ servers like Apache
Zeppelin and Jupyter (formerly
iPython Notebook).
There are two big things to work out:
1. Getting the right libraries on the
classpath.
2. Wiring up visualizations.

GeoMesa Notebook Roadmap:
● Improved JavaScript integration
● D3.js and other visualization
libraries
● OpenLayers and Leaflet
● Python Bindings

Questions?
Find out more at http://geomesa.org
Connect with us on Gitter:
https://gitter.im/locationtech/geomes
a
See applications at CCRi’s blog:
http://www.ccri.com/blog/

http://www.eichelberger.org/sfseize/index.html
Talk filling curves

GeoMesa Converter Library
The Converter library is used in
1. The GeoMesa command line tools
2. GeoMesa’s NiFi processors
Configurations support XML, CSV, TSV JSON, Avro, and more!
Examples are available for GeoNames, GDELT,OSM-GPX, Twitter, and others.

Live view with the GeoMesa Kafka DataStore
Q: How did you get billions of points?
A: Data is streaming in continually.
Examples come from IoT related
applications:
10 thousand sensors reporting
every 5 seconds generate 1.2 billion
records in a week.
In these cases, we want to see where
things are right now.

GeoMesa Kafka DataStore Architecture
We have two issues to address:
1. In-memory index of
SimpleFeatures
2. Durable message passing system
For indexing, we use a combination of
Guava and CQEngine (efficient Java
collections).
Kafka serves as the message passing
system.
Consumer KDSes can be run in Storm
(for event processing), GeoServer (OGC
access), etc.

Z-Order Hilbert
Around 100 years ago, mathematicians asked the question,
“Is there a continuous function from the unit interval to the unit square
which covers it?”
Space Filling Curves: The Math
Row-Major

Streaming Data Architecture; Part 1
Continuous ingest:
GeoMesa-NiFi
leverages the
GeoMesa converter
library

Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing

Similar to Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing (20)

Recently uploaded

Recently uploaded (20)

Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing