SlideShare a Scribd company logo
GeoMesa: Using
Accumulo for optimized
Dr. James Hughes, CCRi
GeoMesa is
● A collection of libraries and modules which can be used to
solve Big Geo Data problems
○ Great for managing billions to trillions of vector data
○ Great for streaming vector data
● Open sourced through Eclipse’s LocationTech working group and has
graduated incubation
● Built on top of great open source libraries
GeoMesa Background
Such architectures allow for live views and near-real time processing (speed layer)
while persisting the data for historic queries and batch analysis (batch layer).
Client access to both layers can be handled by GeoServer.
GeoMesa enables Lambda architectures
Suppose we wish to monitor and understand a group of GPS-enabled and
internet-enabled devices (ex: sensors, vehicles).
● GeoMesa’s ETL / converter library aids in re-usable data modeling.
● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest
into Accumulo and Kafka topics.
● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as
1) geo-fencing, 2) location trackers, and 3) complex alerting rules.
● Effective storage in Accumulo allows for fast query returns.
● End-to-end visualization and analysis supports allows aggregations to pushed
down to the Accumulo tablet servers.
● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc
interactive analysis and data discovery.
Example Use Case: Managing Internet-Aware Devices
Suppose we wish to monitor and understand a group of GPS-enabled and
internet-enabled devices (ex: sensors, vehicles).
● GeoMesa’s ETL / converter library aids in re-usable data modeling.
● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest
into Accumulo and Kafka topics.
● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as
1) geo-fencing, 2) location trackers, and 3) complex alerting rules.
● Effective storage in Accumulo allows for fast query returns.
● End-to-end visualization and analysis supports allows aggregations to pushed
down to the Accumulo tablet servers.
● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc
interactive analysis and data discovery.
All of this adds up to “Speed! Speed! Speed!” whether you are looking at
a live view of the data or pulling back an analysis product.
Example Use Case: Managing Internet-Aware Devices
Enabling and making visualization and analysis quick has been a journey and this
talk is about our steps so far
Talk Outline
Enabling and making visualization and analysis quick has been a journey and this
talk is about our steps so far
1. Space-filling curves and storing spatio-temporal data
2. Improvements to GeoMesa use and implementation of Accumulo Iterators
3. Spark and MapReduce for distributed computation
Talk Outline
Enabling and making visualization and analysis quick has been a journey and this
talk is about our steps so far
1. Space-filling curves and storing spatio-temporal data
2. Improvements to GeoMesa use and implementation of Accumulo Iterators
3. Spark and MapReduce for distributed computation
Not in this talk
1. Storm / NiFi - Streaming Ingest
2. Live views and online processing with Kafka
3. Command line tools
4. ETL / parser library
5. Machine learning / Deep Analytics
Talk Outline
● Accumulo Key Design
● Space Filling Curves 101
● Indices for Points with Time
● Indices for Lines and Polygons
● Lessons Learned
evolution of
In a traditional stack, the application
issues queries to a database which is
responsible for query planning.
Overview of query planning in Accumulo
In a traditional stack, the application
issues queries to a database which is
responsible for query planning.
Overview of query planning in Accumulo
With Accumulo, the query planning is
handled by library code in the
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order
that the curve visits the them.
○ Associate the data in that grid cell
with a byte representation of the
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order
that the curve visits the them.
○ Associate the data in that grid cell
with a byte representation of the
● We prefer “good” space filling curves:
○ Want recursive curves and locality.
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order
that the curve visits the them.
○ Associate the data in that grid cell
with a byte representation of the
● We prefer “good” space filling curves:
○ Want recursive curves and locality.
● Space filling curves have higher
dimensional analogs.
Space Filling Curves (in one slide!)
To query for points in the grey rectangle, the
query planner enumerates a collection of index
ranges which cover the area.
Note: Most queries won’t line up perfectly with the
gridding strategy.
Further filtering can be run on the Accumulo
tablet servers with Iterators (next section)
or we can return ‘loose’ bounding box results
(likely more quickly).
Query planning with Space Filling Curves
GeoMesa has several tables; each optimized for a particular use case.
The Z3 table is used with and optimized for temporal point data. (Think sensor
observations, track reports, or other events which happen at particular location.)
GeoMesa Key Structure for the ‘Z3’ table
Key Value
Family Qualifier
Here and now:
(38.9864985, -76.9561856)
10:15am, Tuesday, Oct. 11th, 2016
Epoch Week: 2440
X value: 1275689
Y value: 151972
T value: 2097151
Z3 (as a long):
Most approaches to indexing non-point
geometries involve covering the
geometry with a number of grid cells
and storing a copy with each index.
This means that the client has to
deduplicate results which is expensive.
Indexing non-point geometries: New XZ Index
Most approaches to indexing non-point
geometries involve covering the
geometry with a number of grid cells
and storing a copy with each index.
This means that the client has to
deduplicate results which is expensive.
Böhm, Klump, and Kriegel describe an
indexing strategy allows such
geometries to be stored once.
GeoMesa has implemented this
strategy in XZ2 (spatial-only) and XZ3
(spatio-temporal) tables.
The key is to store data by resolution,
separate geometries by size, and then
index them by their lower left corner.
This does require consideration on the
query planning side, but avoiding
deduplication is worth the trade-off.
Indexing non-point geometries: New XZ Index
For more details, see Böhm, Klump, and Kriegel. “XZ-ordering: a space-filling curve for objects with spatial
extension.” 6th. Int. Symposium on Large Spatial Databases (SSD), 1999, Hong Kong, China.
● Accumulo Iterator Overview
● GeoMesa Iterators for Analysis
and Visualization
● Iterator Lessons Learned
GeoMesa's use
of Accumulo
“Iterators provide a modular mechanism for adding functionality to be executed by
TabletServers when scanning or compacting data. This allows users to efficiently
summarize, filter, and aggregate data.” -- Accumulo 1.7 documentation
Part of the modularity is that the iterators can be stacked:
t the output of one can be wired into the next.
Example: The first iterator might read from disk, the second could filter with
Authorizations, and a final iterator could filter by column family.
Other notes:
● Iterators provided a sorted view of the key/values.
● Iterator code can be loaded from HDFS and namespaced!
Accumulo Iterators
Visualization Example: Heatmaps
Without powerful visualization options,
big data is big nonsense.
Consider this view of shipping in the
Mediterranean sea
Visualization Example: Heatmaps
Without powerful visualization options,
big data is big nonsense.
Consider this view of shipping in the
Mediterranean sea
Heatmaps help show patterns and
they can be accelerated with
Visualization Example: Heatmaps
Without powerful visualization options,
big data is big nonsense.
Consider this view of shipping in the
Mediterranean sea
Heatmaps help show patterns and
they can be accelerated with
HeatMap WPS
Query Hints
A request to GeoMesa consists of two broad pieces:
1. A filter restricting the data to act on, e.g.:
a. Records in Maryland with ‘Accumulo’ in the text field.
b. Records during the first week of 2016.
2. A request for ‘how’ to return the data, e.g.:
a. Return the full records
b. Return a subset of the record (either a projection or ‘bin’ file format)
c. Return a histogram
d. Return a heatmap / kernel density
Generally, a filter can be handled partially by selecting which ranges to scan; the
remainder can be handled by an Iterator.
Modifications to selected data can also be handled by a GeoMesa Iterator.
GeoMesa Data Requests
The first pass of GeoMesa iterators separated concerns into separate iterators.
The GeoMesa query planner assembled a stack of iterators to achieve the desired
Initial GeoMesa Iterator design
Image from “Spatio-temporal Indexing in Non-relational Distributed Databases” by
Anthony Fox, Chris Eichelberger, James Hughes, Skylar Lyon
The key benefit to having decomposed iterators is that they are easier to
understand and re-mix.
In terms of performance, each one needs to understand the bytes in the Key and
Value. In many cases, this will lead to additional serialization/deserialization.
Now, we prefer to write Iterators which handle transforming the underlying data
into what the client code is expecting in one go.
Second GeoMesa Iterator design
1. Using fewer iterators in the stack can be beneficial
2. Using lazy evaluation / deserialization for filtering Values can power speed
3. Iterators take in Sorted Keys + Values and *must* produce Sorted Keys and
4. Accumulo 1.8.0 has an Iterator Test Harness!
Lessons learned about Iterators
Through our use of a) space filling curves, b) a cost-based query optimizer, and
c) carefully configured iterators, the GeoMesa query planner has a lot going on.
The GeoMesa query explainer logs 1) which index was used, 2) which ranges
where scanned, 3) Iterator configuration, etc.
Putting all together: the GeoMesa Query Explainer
geomesa> geomesa explain -u USER -p PASS -i INSTANCE -c geomesa -z zoo1,zoo2,zoo3 -f AccumuloQuickStart -q "Who =
Planning 'AccumuloQuickStart' Who = 'Bierce'
Original filter: Who = 'Bierce'
Hints: density[false] bin[false] stats[false] map-aggregate[false] sampling[none]
Sort: none
Transforms: None
Strategy selection:
Query processing took 69ms and produced 1 options
Filter plan: FilterPlan[ATTRIBUTE[Who = 'Bierce'][None]]
Strategy selection took 8ms for 1 options
Strategy 1 of 1: AttributeIdxStrategy
Strategy filter: ATTRIBUTE[Who = 'Bierce'][None]
Plan: org.locationtech.geomesa.accumulo.index.BatchScanPlan
Table: geomesa_attr
Deduplicate: false
Column Families: all
Ranges (1): [%01;%00;%00;Bierce%00;::%01;%00;%00;Bierce%01;)
Iterators (0):
Query planning took 119ms
Verify hints
Inspect strategies considered
See table and ranges to be scanned
Quantify planning time
● GeoMesa + Spark Setup
● GeoMesa + Spark Analytics
● GeoMesa powered notebooks
(Jupyter and Zeppelin)
Spark Support:
Data Analysis
and Discovery
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
GeoMesa MapReduce and Spark Support
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat interface.
GeoMesa MapReduce and Spark Support
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat interface.
Spark provides a way to change InputFormats into RDDs.
GeoMesa MapReduce and Spark Support
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat interface.
Spark provides a way to change InputFormats into RDDs.
So with a little glue code and Spark classpath/environment
management, GeoMesa has Spark support!
GeoMesa MapReduce and Spark Support
GeoMesa Spark Example 1: Time Series
Step 1: Get an RDD[SimpleFeature]
Step 2: Calculate the time series
Step 3: Plot the time series in R.
Using one dataset (country boundaries) to group another (here, GDELT) is
effectively a join.
Our summer intern, Atallah, worked out the details of doing this analysis in Spark
and created a tutorial and blog post.
This picture shows ‘stability’ of a region from GDELT Goldstein values
GeoMesa Spark Example 2: Aggregating by Regions
GeoMesa Spark Example 3: Aggregating Tweets about #traffic
Virginia Polygon CQL
GeoMesa RDD
Aggregate by County
Calculate ratio of #traffic
Store back to GeoMesa
GeoMesa Spark Example 3: Aggregating Tweets about #traffic
#traffic by Virginia county
Darker blue has a higher count
Problem: Another developer came by and mentioned that his Spark job using
GeoMesa had quite a few tasks (far more than expected).
Around the same time, Eugene Cheipesh (Azavea / GeoTrellis) wrote in to the
Accumulo user list…
In Accumulo 1.6.x, each range in the Accumulo InputFormat becomes a Split.
With space filling curves, it is easy to enumerate plenty of ranges for a query.
Solution: The short term solution was to create a custom InputFormat which
produce Splits which contain more than one range.
A small bump in the road…
Interactive Data Discovery at Scale in GeoMesa Notebooks
Writing (and debugging!) MapReduce /
Spark jobs is slow and requires
A long development cycle for an
analytic saps energy and creativity.
The answer to both is interactive
‘notebook’ servers like Apache
Zeppelin and Jupyter (formerly
iPython Notebook).
Interactive Data Discovery at Scale in GeoMesa Notebooks
Writing (and debugging!) MapReduce /
Spark jobs is slow and requires
A long development cycle for an
analytic saps energy and creativity.
The answer to both is interactive
‘notebook’ servers like Apache
Zeppelin and Jupyter (formerly
iPython Notebook).
There are two big things to work out:
1. Getting the right libraries on the
2. Wiring up visualizations.
Interactive Data Discovery at Scale in GeoMesa Notebooks
GeoMesa Notebook Roadmap:
● Improved JavaScript integration
● D3.js and other visualization
● OpenLayers and Leaflet
● Python Bindings
Find out more at
Connect with us on Gitter:
See applications at CCRi’s blog:
Backup slides
Talk filling curves
GeoMesa Converter Library
The Converter library is used in
1. The GeoMesa command line tools
2. GeoMesa’s NiFi processors
Configurations support XML, CSV, TSV JSON, Avro, and more!
Examples are available for GeoNames, GDELT,OSM-GPX, Twitter, and others.
Live view with the GeoMesa Kafka DataStore
Q: How did you get billions of points?
A: Data is streaming in continually.
Examples come from IoT related
10 thousand sensors reporting
every 5 seconds generate 1.2 billion
records in a week.
In these cases, we want to see where
things are right now.
GeoMesa Kafka DataStore Architecture
We have two issues to address:
1. In-memory index of
2. Durable message passing system
For indexing, we use a combination of
Guava and CQEngine (efficient Java
Kafka serves as the message passing
Consumer KDSes can be run in Storm
(for event processing), GeoServer (OGC
access), etc.
Z-Order Hilbert
Around 100 years ago, mathematicians asked the question,
“Is there a continuous function from the unit interval to the unit square
which covers it?”
Space Filling Curves: The Math
Streaming Data Architecture; Part 1
Continuous ingest:
leverages the
GeoMesa converter

More Related Content

What's hot

Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Govt.Engineering college, Idukki
GeoMesa: Scalable Geospatial Analytics
GeoMesa:  Scalable Geospatial AnalyticsGeoMesa:  Scalable Geospatial Analytics
GeoMesa: Scalable Geospatial Analytics
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Safir Shah
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
Advancing Scientific Data Support in ArcGIS
Advancing Scientific Data Support in ArcGISAdvancing Scientific Data Support in ArcGIS
Advancing Scientific Data Support in ArcGIS
The HDF-EOS Tools and Information Center
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
European Data Forum
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
The HDF-EOS Tools and Information Center
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
Rob Emanuele
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
Avinash Pandu
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
Muralidharan Deenathayalan
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
Zubair Nabi
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
Avinash Pandu
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
Rim Moussa
parallel OLAP
parallel OLAPparallel OLAP
parallel OLAP
Rim Moussa

What's hot (20)

Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
GeoMesa: Scalable Geospatial Analytics
GeoMesa:  Scalable Geospatial AnalyticsGeoMesa:  Scalable Geospatial Analytics
GeoMesa: Scalable Geospatial Analytics
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Advancing Scientific Data Support in ArcGIS
Advancing Scientific Data Support in ArcGISAdvancing Scientific Data Support in ArcGIS
Advancing Scientific Data Support in ArcGIS
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
parallel OLAP
parallel OLAPparallel OLAP
parallel OLAP

Viewers also liked

Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...
Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...
Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...
Accumulo Summit
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit
Accumulo design
Accumulo designAccumulo design
Accumulo design
Accumulo meetup 20130109
Accumulo meetup 20130109Accumulo meetup 20130109
Accumulo meetup 20130109
Accumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the EnterpriseAccumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit
Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data Lake
Aaron Cordova
Large Scale Accumulo Clusters
Large Scale Accumulo ClustersLarge Scale Accumulo Clusters
Large Scale Accumulo Clusters
Aaron Cordova
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick Introduction
James Salter
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in AccumuloAccumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit
Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]
Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]
Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]
Accumulo Summit
GeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in AccumuloGeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in Accumulo
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtech
Rob Emanuele
Foundation Comparison
Foundation ComparisonFoundation Comparison
Foundation Comparison
Jody Garnett
Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTech
Rob Emanuele
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big DataOct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Yahoo Developer Network

Viewers also liked (20)

Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...
Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...
Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo design
Accumulo designAccumulo design
Accumulo design
Accumulo meetup 20130109
Accumulo meetup 20130109Accumulo meetup 20130109
Accumulo meetup 20130109
Accumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the EnterpriseAccumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the Enterprise
Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data Lake
Large Scale Accumulo Clusters
Large Scale Accumulo ClustersLarge Scale Accumulo Clusters
Large Scale Accumulo Clusters
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick Introduction
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in AccumuloAccumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]
Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]
Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]
GeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in AccumuloGeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in Accumulo
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtech
Foundation Comparison
Foundation ComparisonFoundation Comparison
Foundation Comparison
Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTech
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big DataOct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Similar to Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Rob Emanuele
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
DataWorks Summit
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
Guy K. Kloss
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
Anant Kumar
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databases
Tilak Patidar
Geoservices Activities at EDINA
Geoservices Activities at EDINAGeoservices Activities at EDINA
Geoservices Activities at EDINA
EDINA, University of Edinburgh
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
IRJET Journal
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
Cloudera, Inc.
Ptidej Team
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
Watershed Delineation in ArcGIS
Watershed Delineation in ArcGISWatershed Delineation in ArcGIS
Watershed Delineation in ArcGIS
Arthur Green
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
Skyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed EnvironmentSkyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed Environment
Watershed Delineation Using ArcMap
Watershed Delineation Using ArcMapWatershed Delineation Using ArcMap
Watershed Delineation Using ArcMap
Arthur Green
Characteristics of an on chip cache on nec sx
Characteristics of an on chip cache on nec sxCharacteristics of an on chip cache on nec sx
Characteristics of an on chip cache on nec sx
Léia de Sousa

Similar to Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing (20)

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databases
Geoservices Activities at EDINA
Geoservices Activities at EDINAGeoservices Activities at EDINA
Geoservices Activities at EDINA
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
Watershed Delineation in ArcGIS
Watershed Delineation in ArcGISWatershed Delineation in ArcGIS
Watershed Delineation in ArcGIS
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Skyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed EnvironmentSkyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed Environment
Watershed Delineation Using ArcMap
Watershed Delineation Using ArcMapWatershed Delineation Using ArcMap
Watershed Delineation Using ArcMap
Characteristics of an on chip cache on nec sx
Characteristics of an on chip cache on nec sxCharacteristics of an on chip cache on nec sx
Characteristics of an on chip cache on nec sx

Recently uploaded

A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah

Recently uploaded (20)

A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"

Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing

  • 1. GeoMesa: Using Accumulo for optimized spatio-temporal processing Dr. James Hughes, CCRi
  • 2. GeoMesa is ● A collection of libraries and modules which can be used to solve Big Geo Data problems ○ Great for managing billions to trillions of vector data ○ Great for streaming vector data ● Open sourced through Eclipse’s LocationTech working group and has graduated incubation ● Built on top of great open source libraries GeoMesa Background
  • 3. Such architectures allow for live views and near-real time processing (speed layer) while persisting the data for historic queries and batch analysis (batch layer). Client access to both layers can be handled by GeoServer. GeoMesa enables Lambda architectures
  • 4. Suppose we wish to monitor and understand a group of GPS-enabled and internet-enabled devices (ex: sensors, vehicles). ● GeoMesa’s ETL / converter library aids in re-usable data modeling. ● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest into Accumulo and Kafka topics. ● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as 1) geo-fencing, 2) location trackers, and 3) complex alerting rules. ● Effective storage in Accumulo allows for fast query returns. ● End-to-end visualization and analysis supports allows aggregations to pushed down to the Accumulo tablet servers. ● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc interactive analysis and data discovery. Example Use Case: Managing Internet-Aware Devices
  • 5. Suppose we wish to monitor and understand a group of GPS-enabled and internet-enabled devices (ex: sensors, vehicles). ● GeoMesa’s ETL / converter library aids in re-usable data modeling. ● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest into Accumulo and Kafka topics. ● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as 1) geo-fencing, 2) location trackers, and 3) complex alerting rules. ● Effective storage in Accumulo allows for fast query returns. ● End-to-end visualization and analysis supports allows aggregations to pushed down to the Accumulo tablet servers. ● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc interactive analysis and data discovery. All of this adds up to “Speed! Speed! Speed!” whether you are looking at a live view of the data or pulling back an analysis product. Example Use Case: Managing Internet-Aware Devices
  • 6. Enabling and making visualization and analysis quick has been a journey and this talk is about our steps so far Talk Outline
  • 7. Enabling and making visualization and analysis quick has been a journey and this talk is about our steps so far 1. Space-filling curves and storing spatio-temporal data 2. Improvements to GeoMesa use and implementation of Accumulo Iterators 3. Spark and MapReduce for distributed computation Talk Outline
  • 8. Enabling and making visualization and analysis quick has been a journey and this talk is about our steps so far 1. Space-filling curves and storing spatio-temporal data 2. Improvements to GeoMesa use and implementation of Accumulo Iterators 3. Spark and MapReduce for distributed computation Not in this talk 1. Storm / NiFi - Streaming Ingest 2. Live views and online processing with Kafka 3. Command line tools 4. ETL / parser library 5. Machine learning / Deep Analytics Talk Outline
  • 9. ● Accumulo Key Design ● Space Filling Curves 101 ● Indices for Points with Time ● Indices for Lines and Polygons ● Lessons Learned GeoMesa's evolution of Accumulo schemas
  • 10. In a traditional stack, the application issues queries to a database which is responsible for query planning. Overview of query planning in Accumulo
  • 11. In a traditional stack, the application issues queries to a database which is responsible for query planning. Overview of query planning in Accumulo With Accumulo, the query planning is handled by library code in the application.
  • 12. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves Space Filling Curves (in one slide!)
  • 13. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. Space Filling Curves (in one slide!)
  • 14. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. ● Next, order the grid cells with a space filling curve. ○ Label the grid cells by the order that the curve visits the them. ○ Associate the data in that grid cell with a byte representation of the label. Space Filling Curves (in one slide!)
  • 15. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. ● Next, order the grid cells with a space filling curve. ○ Label the grid cells by the order that the curve visits the them. ○ Associate the data in that grid cell with a byte representation of the label. ● We prefer “good” space filling curves: ○ Want recursive curves and locality. Space Filling Curves (in one slide!)
  • 16. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. ● Next, order the grid cells with a space filling curve. ○ Label the grid cells by the order that the curve visits the them. ○ Associate the data in that grid cell with a byte representation of the label. ● We prefer “good” space filling curves: ○ Want recursive curves and locality. ● Space filling curves have higher dimensional analogs. Space Filling Curves (in one slide!)
  • 17. To query for points in the grey rectangle, the query planner enumerates a collection of index ranges which cover the area. Note: Most queries won’t line up perfectly with the gridding strategy. Further filtering can be run on the Accumulo tablet servers with Iterators (next section) or we can return ‘loose’ bounding box results (likely more quickly). Query planning with Space Filling Curves
  • 18. GeoMesa has several tables; each optimized for a particular use case. The Z3 table is used with and optimized for temporal point data. (Think sensor observations, track reports, or other events which happen at particular location.) GeoMesa Key Structure for the ‘Z3’ table Key Value Row Column Record Family Qualifier Shard 1-Byte Epoch Week 2-Bytes Z3(x,y,t) 8-Bytes ‘F’ Here and now: (38.9864985, -76.9561856) 10:15am, Tuesday, Oct. 11th, 2016 Epoch Week: 2440 X value: 1275689 Y value: 151972 T value: 2097151 Z3 (as a long): 6430470637115132837
  • 19. Most approaches to indexing non-point geometries involve covering the geometry with a number of grid cells and storing a copy with each index. This means that the client has to deduplicate results which is expensive. Indexing non-point geometries: New XZ Index
  • 20. Most approaches to indexing non-point geometries involve covering the geometry with a number of grid cells and storing a copy with each index. This means that the client has to deduplicate results which is expensive. Böhm, Klump, and Kriegel describe an indexing strategy allows such geometries to be stored once. GeoMesa has implemented this strategy in XZ2 (spatial-only) and XZ3 (spatio-temporal) tables. The key is to store data by resolution, separate geometries by size, and then index them by their lower left corner. This does require consideration on the query planning side, but avoiding deduplication is worth the trade-off. Indexing non-point geometries: New XZ Index For more details, see Böhm, Klump, and Kriegel. “XZ-ordering: a space-filling curve for objects with spatial extension.” 6th. Int. Symposium on Large Spatial Databases (SSD), 1999, Hong Kong, China. (
  • 21. ● Accumulo Iterator Overview ● GeoMesa Iterators for Analysis and Visualization ● Iterator Lessons Learned GeoMesa's use of Accumulo Iterators
  • 22. “Iterators provide a modular mechanism for adding functionality to be executed by TabletServers when scanning or compacting data. This allows users to efficiently summarize, filter, and aggregate data.” -- Accumulo 1.7 documentation Part of the modularity is that the iterators can be stacked: t the output of one can be wired into the next. Example: The first iterator might read from disk, the second could filter with Authorizations, and a final iterator could filter by column family. Other notes: ● Iterators provided a sorted view of the key/values. ● Iterator code can be loaded from HDFS and namespaced! Accumulo Iterators
  • 23. Visualization Example: Heatmaps Without powerful visualization options, big data is big nonsense. Consider this view of shipping in the Mediterranean sea
  • 24. Visualization Example: Heatmaps Without powerful visualization options, big data is big nonsense. Consider this view of shipping in the Mediterranean sea Heatmaps help show patterns and they can be accelerated with GeoMesa
  • 25. Visualization Example: Heatmaps Without powerful visualization options, big data is big nonsense. Consider this view of shipping in the Mediterranean sea Heatmaps help show patterns and they can be accelerated with GeoMesa Heatmap Request HeatMap WPS Query Hints
  • 26. A request to GeoMesa consists of two broad pieces: 1. A filter restricting the data to act on, e.g.: a. Records in Maryland with ‘Accumulo’ in the text field. b. Records during the first week of 2016. 2. A request for ‘how’ to return the data, e.g.: a. Return the full records b. Return a subset of the record (either a projection or ‘bin’ file format) c. Return a histogram d. Return a heatmap / kernel density Generally, a filter can be handled partially by selecting which ranges to scan; the remainder can be handled by an Iterator. Modifications to selected data can also be handled by a GeoMesa Iterator. GeoMesa Data Requests
  • 27. The first pass of GeoMesa iterators separated concerns into separate iterators. The GeoMesa query planner assembled a stack of iterators to achieve the desired result. Initial GeoMesa Iterator design Image from “Spatio-temporal Indexing in Non-relational Distributed Databases” by Anthony Fox, Chris Eichelberger, James Hughes, Skylar Lyon
  • 28. The key benefit to having decomposed iterators is that they are easier to understand and re-mix. In terms of performance, each one needs to understand the bytes in the Key and Value. In many cases, this will lead to additional serialization/deserialization. Now, we prefer to write Iterators which handle transforming the underlying data into what the client code is expecting in one go. Second GeoMesa Iterator design
  • 29. 1. Using fewer iterators in the stack can be beneficial 2. Using lazy evaluation / deserialization for filtering Values can power speed improvements. 3. Iterators take in Sorted Keys + Values and *must* produce Sorted Keys and Values. 4. Accumulo 1.8.0 has an Iterator Test Harness! Lessons learned about Iterators
  • 30. Through our use of a) space filling curves, b) a cost-based query optimizer, and c) carefully configured iterators, the GeoMesa query planner has a lot going on. The GeoMesa query explainer logs 1) which index was used, 2) which ranges where scanned, 3) Iterator configuration, etc. Putting all together: the GeoMesa Query Explainer geomesa> geomesa explain -u USER -p PASS -i INSTANCE -c geomesa -z zoo1,zoo2,zoo3 -f AccumuloQuickStart -q "Who = 'Bierce'" Planning 'AccumuloQuickStart' Who = 'Bierce' Original filter: Who = 'Bierce' Hints: density[false] bin[false] stats[false] map-aggregate[false] sampling[none] Sort: none Transforms: None Strategy selection: Query processing took 69ms and produced 1 options Filter plan: FilterPlan[ATTRIBUTE[Who = 'Bierce'][None]] Strategy selection took 8ms for 1 options Strategy 1 of 1: AttributeIdxStrategy Strategy filter: ATTRIBUTE[Who = 'Bierce'][None] Plan: org.locationtech.geomesa.accumulo.index.BatchScanPlan Table: geomesa_attr Deduplicate: false Column Families: all Ranges (1): [%01;%00;%00;Bierce%00;::%01;%00;%00;Bierce%01;) Iterators (0): Query planning took 119ms Verify hints Inspect strategies considered See table and ranges to be scanned Quantify planning time
  • 31. ● GeoMesa + Spark Setup ● GeoMesa + Spark Analytics ● GeoMesa powered notebooks (Jupyter and Zeppelin) GeoMesa’s Spark Support: Data Analysis and Discovery
  • 32. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. GeoMesa MapReduce and Spark Support
  • 33. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. Accumulo Implements the MapReduce InputFormat interface. GeoMesa MapReduce and Spark Support
  • 34. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. Accumulo Implements the MapReduce InputFormat interface. Spark provides a way to change InputFormats into RDDs. GeoMesa MapReduce and Spark Support
  • 35. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. Accumulo Implements the MapReduce InputFormat interface. Spark provides a way to change InputFormats into RDDs. So with a little glue code and Spark classpath/environment management, GeoMesa has Spark support! GeoMesa MapReduce and Spark Support
  • 36. GeoMesa Spark Example 1: Time Series Step 1: Get an RDD[SimpleFeature] Step 2: Calculate the time series Step 3: Plot the time series in R.
  • 37. Using one dataset (country boundaries) to group another (here, GDELT) is effectively a join. Our summer intern, Atallah, worked out the details of doing this analysis in Spark and created a tutorial and blog post. This picture shows ‘stability’ of a region from GDELT Goldstein values GeoMesa Spark Example 2: Aggregating by Regions
  • 38. GeoMesa Spark Example 3: Aggregating Tweets about #traffic Virginia Polygon CQL GeoMesa RDD Aggregate by County Calculate ratio of #traffic Store back to GeoMesa
  • 39. GeoMesa Spark Example 3: Aggregating Tweets about #traffic #traffic by Virginia county Darker blue has a higher count
  • 40. Problem: Another developer came by and mentioned that his Spark job using GeoMesa had quite a few tasks (far more than expected). Around the same time, Eugene Cheipesh (Azavea / GeoTrellis) wrote in to the Accumulo user list… In Accumulo 1.6.x, each range in the Accumulo InputFormat becomes a Split. With space filling curves, it is easy to enumerate plenty of ranges for a query. Solution: The short term solution was to create a custom InputFormat which produce Splits which contain more than one range. A small bump in the road…
  • 41. Interactive Data Discovery at Scale in GeoMesa Notebooks Writing (and debugging!) MapReduce / Spark jobs is slow and requires expertise. A long development cycle for an analytic saps energy and creativity. The answer to both is interactive ‘notebook’ servers like Apache Zeppelin and Jupyter (formerly iPython Notebook).
  • 42. Interactive Data Discovery at Scale in GeoMesa Notebooks Writing (and debugging!) MapReduce / Spark jobs is slow and requires expertise. A long development cycle for an analytic saps energy and creativity. The answer to both is interactive ‘notebook’ servers like Apache Zeppelin and Jupyter (formerly iPython Notebook). There are two big things to work out: 1. Getting the right libraries on the classpath. 2. Wiring up visualizations.
  • 43. Interactive Data Discovery at Scale in GeoMesa Notebooks GeoMesa Notebook Roadmap: ● Improved JavaScript integration ● D3.js and other visualization libraries ● OpenLayers and Leaflet ● Python Bindings
  • 44. Questions? Find out more at Connect with us on Gitter: a See applications at CCRi’s blog:
  • 47. GeoMesa Converter Library The Converter library is used in 1. The GeoMesa command line tools 2. GeoMesa’s NiFi processors Configurations support XML, CSV, TSV JSON, Avro, and more! Examples are available for GeoNames, GDELT,OSM-GPX, Twitter, and others.
  • 48. Live view with the GeoMesa Kafka DataStore Q: How did you get billions of points? A: Data is streaming in continually. Examples come from IoT related applications: 10 thousand sensors reporting every 5 seconds generate 1.2 billion records in a week. In these cases, we want to see where things are right now.
  • 49. GeoMesa Kafka DataStore Architecture We have two issues to address: 1. In-memory index of SimpleFeatures 2. Durable message passing system For indexing, we use a combination of Guava and CQEngine (efficient Java collections). Kafka serves as the message passing system. Consumer KDSes can be run in Storm (for event processing), GeoServer (OGC access), etc.
  • 50. Z-Order Hilbert Around 100 years ago, mathematicians asked the question, “Is there a continuous function from the unit interval to the unit square which covers it?” Space Filling Curves: The Math Row-Major
  • 51. Streaming Data Architecture; Part 1 Continuous ingest: GeoMesa-NiFi leverages the GeoMesa converter library