Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

GeoWave
Geospatial Indexing on Accumulo
Eric Robertson Rich Fecher

• Geographic Information Systems (GIS)
• GeoWave Overview
– Features
– Components
– Data Types
• The Fundamentals
– How does GeoWave organize geospatial data?
• Set of problems and solutions with Accumulo
– Deduplication
– WFS-T Transaction Isolation
– Map Occlusion Culling
– Raster Data
– Statistics
OUTLINE

• GIS Technology Explosion
– E.g. Smart Phone and GPS Applications
• Data Explosion
– Satellite Imagery, Ground Based Imagery,
Aerial Photography
• Problems:
– Generate Maps: Create base image and
add vector data (shapes):
• points of interest
• roads
• boundaries
– Find Features
“restaurants near you”
– Analysis
Density, Surface Analysis, Interpolation,
Pattern Discovery
GIS: GEOGRAPHIC INFORMATION SYSTEM
Generated by OpenStreetMap.org

• Leverage Accumulo offerings as distributed
data store
– High-performance ingest
– Horizontally scalable
– Per-entry access constraints
• Fast geospatial retrieval
• Geo-temporal indexing
• Pre-calculated statistics:
– Counts per Data Type
– Bounding Region
– Time Range
– Numeric Range
– Histograms
FEATURES OF GEOWAVE

Accumulo 1.5.1, 1.6.x
Cloudera 2.0.0-cdh4.7.0, 2.5.0-cdh5.2
Hortonworks HDP 2.1
Apache 2.6
GeoTools 11.4, 12.1, 12.2
Geoserver 2.5.2 ,2.6.1
Accumulo Data Store
Hadoop Map-Reduce input/output formats
GeoServer integration with GeoTools
Vector and Raster Data
Multi-Threaded Ingest Tools
Administrative RESTful Services
Layers and Data Stores
Analytics
Kernel Density
K-means Clustering
Sampling
INTEGRATED COMPONENTS
Tested Versions
PDBScan coming soon with Apache Spark 1.2.1

• Data Structures
– Simple Feature (ISO 19125) via GeoTools (http://www.geotools.org/).
– Raster Images
– Custom
• Provided Ingest Types
– Vector Data Sources (GeoTools)
• Examples: Shapefiles, GeoJSON, PostGIS, etc.
– Grid Formats (GeoTools)
• Examples: ArcGrid, GeoTIFF, etc.
– GeoLife GPS Trajectories (http://research.microsoft.com/en-
us/projects/GeoLife/)
– GPX (http://www.topografix.com/gpx.asp)
– T-Drive (http://research.microsoft.com/en-us/projects/tdrive/)
– PDAL
DATA TYPES

• Basic Problem: Efficiently
locate and retrieve vectors or
tiles intersecting a polygon (e.g
bounding box).
• Accumulo: Each table
organized into blocks of sorted
row identifiers.
• Revised Problem: Two-way
mapping between multiple
dimensions and a single
dimension row ID to support
location efficient storage and
retrieval of vectors or tiles
given constraints in terms of
multi-dimensional boundaries.
MAIN PROBLEM:
INDEX TWO DIMENSION IN SINGLE DIMENSION INDEX

GENERALIZED PROBLEMS
Solve the general problem first. Then apply to Geospatial specific problems.
 Multi-Dimension Index supporting efficient data retrieval given bounded
set of constraints for each dimension.
 Indexed data includes scalars and intervals per dimension.
For example, a range of time or a polygon.
 Index over a mix of bounded and unbounded dimensions.

Curves are constructed
iteratively. Each
iteration produces a
sequence of piecewise
linear continuous
curves, each one more
closely approximating
the space-filling limit.
Each discrete value on
the curve represents a
hyper-rectangle in n-
dimensional space.
Space Filling Curve: A curve whose range contains the entire n-dimensional hypercube.
FUNDAMENTAL APPROACH:
SPACE FILLING CURVES TRAVERSE N-DIMENSIONAL SPACE

Achieve optimal read performance through contiguous series of values
across two or more dimensions.
Reading 11 records over a contiguous range 23->33 is faster than reading non-
contiguous range such as 15,18,34,56-58,83,99,101-102.
Consider: Latitude and Longitude defined by a range (latA, lonA) -> (latB, lonB) should
map to the least number of ranges on the space filling curve.
Haverkort and Walderveen[1] describe 3 metrics to help quantify this.
CURVE SELECTION : SEQUENTIAL IO OPTIMIZATION
Worst Case Dilation Average Bounding BoxWorst Case Bounding Box
𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞
𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞
𝑎𝑟𝑒𝑎 𝑜𝑓 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑟𝑒𝑐𝑡𝑎𝑛𝑔𝑙𝑒 (𝑏𝑙𝑢𝑒)
𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 (𝑔𝑟𝑒𝑒𝑛)

Z-Order Hilbert H-order Peano AR2W2 BΩ
Worst Case
Dilation
Average Box Area
Worst Case Area
L∞
L2
L1
∞ 6 4 8 5.40 5.00
∞ 6 4 8 6.04 5.00
∞ 9 8 10.66 12.00 9.00
∞ 2.40 3.00 2.00 3.05 2.22
2.86 1.41 1.69 1.42 1.47 1.40
[1] Haverkort, Walderveen Locality and Bounding-Box Quality of Two-
Dimensional Space-Filling Curves 2008 arXiv:0806.4787v2
CURVE SELECTION : LOCALITY

• Place a grid on the globe (dotted lines)
• Connect all the points on the grid with a
Hilbert SFC.
• Curve provides linear ordering over two
dimensional space.
• Bounding box is defined by the set of ranges
covered by the Hilbert SFC.
HILBERT CURVE MAPPING IN 2D: THE GLOBAL

• Precision determined by the ‘depth’ of the
curve. In this example globe is defined by a
16X16 grid.
• Resolution is 22.5 degrees latitude and
11.25 degrees longitude per cell.
• Each elbow (discrete point) in the Hilbert
SFC maps to a grid cell.
• The precision, defined in terms of the
number of bits, of the Hilbert SFC
determines the grid. Thus, more bits
equates to finer grained cell.
HILBERT CURVE PRECISION

Recursively decompose the Hilbert
region to find only those covered
regions that overlap the query box.
The figure depicts a third order (23
“buckets” per dimension) Hilbert
curve in 2D.
Forms a quad-tree view over the
data.
Each two bits, from most significant
to least represents a “quadrant.”
00 01
1011
10
11 00
01
11
10
00
01
Hilbert Index (52) =
11 01 00
RECURSIVE DECOMPOSITION : TWO DIMENSION EXAMPLE

Bounding Box over grid cells (2,9) to (5,13) (lower left) to (upper right)
Decompose cells intersecting
bounding box as shown in the blue.
Range decomposes to three (color
coded) ranges –
• 70 -> 75
• 92 -> 99
• 116 -> 121
Note: Bounding box from a geospatial
query window does not necessarily
“snap” perfectly to the grid cells. (e.g.
6.2, 8.8 instead of 6, 9). The bounding
box is expanded to encompass all
intersecting cells.
DECODE THE BOUNDING BOX: RANGE DECOMPOSITION

Here we see the query
range fully decomposed
into the underlying
“quadrants.”
Decomposition stops
when the query window
fully contains the quad.
(See segment 3 and
segment 8)
RANGE DECOMPOSITION OPTIMIZATION

INTERVALS: POLYGONS AND MULTI-POLYGON
Duplicate entry for each intersecting hyper-rectangle over the interval.
Polygon covers 66 cells in the example
Remove duplicate data for each cell – 66
duplicates.
De-Duplication is applied in Accumulo
Iterator as well as client-side.
Query is defined by a range per dimension
(a bounding rectangle in 2D)

INTERVALS: POLYGONS AND MULTI-POLYGONS
High resolution curves force excessive number of duplicates for large intervals.
A high resolution 2D curve – 231 x 231 and a large polygon such as the pacific ocean.
The pacific ocean covers ~33% of the earths surface, amplifies to ~1.5 quintillion
duplicate entries.
Solution: Tiered Indexing[8]
• Each tier has a resolution of 2nx2n, where n
is the tier number. Thus, each lower tier
has a two order increase in resolution.
• Polygons are stored in the lowest tier
possible that minimizes the number of
duplicates.
• Example: Blue polygon indexed in tier 2;
Red polygon indexed in tier 3.

TIERS: QUERY REGIONS WITH FALSE POSITIVES
Balance between an acceptable amount of duplicates and false positives due to
lower granularity of higher tiers.
Consider a query region in orange. It
does not intersect either polygons.
However, it does intersect shared
quadrants at the respective tiers for
both shapes. Thus, more rows are
filtered during range scan.
Without tiers, using a higher resolution,
this false positive does not occur.
However, consider that, for a resolution
of 10 (e.g. 210), hundreds of duplicates
occur.

TIERS: WORST CASE
Cap the amount of duplicates by choosing an appropriate tier.
Our analysis indicates that an optimal
number of duplicates is represented by 2d
where d is the number of dimensions (ie.
in 2 dimensions, cap at 4)
Consider the worst case, a small square
polygon centered on the inner intersecting
boundary (example polygon in red).
Regardless of size, there is always four
duplicates at all tiers except at a 20 tier—
the orange box, representing the entire
world

UNBOUNDED DIMENSION: TIME
To normalize real-world values to fit on a space filling curve, the sample space must be
bound.
Solution: Binning
• A bin represents a period for EACH
dimension. For example, a
periodicity of a year can be used
for time.
• Each bin covers its own Hilbert
space.
• Entries that contain ranges may
span multiple bins resulting in
duplicates.
• The Bin ID is part of row identifier.
1997 1998 1999
A single bin for an unbounded dimension :
[min + (period * period duration),
min + ((period+1) * period duration))

BIN: VARIABILITY OVER DIMENSIONS
Time
Elevation
Velocity
Each Bin is a
hyper-rectangle
representing
ranges of data
labeled by points
on a Hilbert
curve.
Bounded dimensions assume a single Bin.
For example, Latitude and Longitude.

THAT’S ENOUGH THEORY, LET’S APPLY IT
ACCUMULO TECHNIQUES YOU MIGHT FIND INTERESTING

SFC Curve
Hierarchy
Feature
Type
Feature
ID
Hint to
Dedupe
Filter
From
Field
Visibility
Handlers
VECTOR DATA PERSISTENCE MODEL
Column per feature identifier.
Column per each feature attribute.
Types include:
Geometry
Integer
Double
BigDecimal
Date
Time
String
Boolean
etc.
Feature
Attribute
Name

MAP OCCLUSION CULLING
A specific determined zoom level, each pixel signifies a range in degrees.
Scanning the data, only one entry is needed within each pixel range. The rest of the
entries can be skipped.
The block identified in red represents many data points, but is rendered
by the 9 pixels.

1
2 3
4
1
2 3
4
Database Data
The accumulo iterator starts at the first pixel,
scans until it hits a geometry, then skips to
the next pixel.
Scan to the first pixel
Seek to the beginning of the
next pixel
The rendering engine received only
these points
Points that were all skipped.
MAP OCCLUSION CULLING: ITERATORS
Displayed
Pixels

GeoServer
(GeoWave Plugin)
DISTRIBUTED RENDERING
Map Request
Map Response
Layer
Style
Accumulo(GeoWaveIterators)
Rendered
Map
Each scan result is an image
with the data in the range
All resultant images
are composited together

DISTRIBUTED RENDERING WITH OCCLUSION CULLING

SFC Curve
Hierarchy
SFC Value is
Effectively
a Tile ID
Coverage
Name
RASTER DATA PERSISTENCE MODEL
Image Data Buffer
+ Image Metadata
Image Metadata is customizable.
Default is to store “no data” values,
but can be customized
Tiles are unique,
ignore duplication
Unique name for
global coverage

RASTER DATA: GRID COVERAGE
Tiled, each “cell” fit to
boundary.
“No Data” values must
be maintained.
Multi-band, more
than just RGB.

Histogram Equalization [10]
Image Pyramid [11]
Tile Merge Strategy
t1
t2
t3
f ( f( , ), ) =t1 t2 t3 final
tn
Image Data
Buffer
Meta-
data
Value
Custom data per tile,
in scope for f(x)
RASTER DATA: ADVANCED OPTIONS

STATISTICS: STRUCTURE
Statistics infrastructure supports summary data.
Currently, each row ID includes adapter ID and a statistics ID.
Current statistics types include population bounding boxes, counts and ranges.
Key
Statistic ID
Row ID
Column
Value
Adapter ID
Family Qualifier Visibility
“STATS”
Matches
represented
data
Attribute Name &
Statistic Type.
Time

STATISTICS: COMBINER
Statistic ID
ValueAdapter
ID
“STATS”
“Count” 300xA43E“STATS” A&B
“Count” 600xA43E“STATS” A&C
MERGE
Time
2
4
7
9
BBOX: Grow Envelope to Minimum and Maximum corners.
RANGE: Minimum and Maximum
HISTOGRAM: Update bins from coverage over raster image

STATISTICS: TRANSFORMATION ITERATOR
Statistic ID
ValueAdapter
ID
“STATS”
“Count” 600xA43E“STATS” A&C
“Count” 1100xA43E“STATS” A&B&C
MERGE
Time
9
4
9
Query authorization may authorize multiple rows.
Query with authorization A,B & C

WFS-T[12] TRANSACTIONS: ISOLATION
• Problem: Isolation of updates and new records until commit.
• Solution:
– Use a managed set of transaction identifiers as authorization tags. A single transaction
places an authorization tag in all new entries.
– Upon commit, the authorization tag is removed using a transforming iterator.
Role1, role2, tx123
Role1, role2
Commit

SO WHAT?
EYE-CANDY YOU’VE BEEN WAITING FOR

Microsoft GeoLife
Microsoft research has made available a trajectory data set that contains the GPS
coordinates of 182 users over a three year period (April 2007 to August 2012).
There are 17,621 trajectories in this data set with a total distance of about 1.2
million kilometers and a total duration of 48,000+ hours recorded by GPS loggers and
GPS phones often sampling every 1-5 seconds or every 5-10 meters.
http://research.microsoft.com/jump/131675

Let’s bring out some detail –
Kernel Density Estimate (Guassian Kernel)

OSM – Planet GPX dump
Every track ever uploaded to Open Street Map
Complete data attribution
2.9 Billion spatial entities (points)
https://blog.openstreetmap.org/2013/04/12/bulk-gpx-track-data/

Level 0 Overview (all the points!)

Let’s bring out some detail again –
Kernel Density Estimate (Guassian Kernel)

Let’s zoom a bit – and try some different styling
options

[1] Haverkort, Walderveen Locality and Bounding-Box Qualifty of Two-Dimensional Space-Filling Curves 2008
arXiv:0806.4787v2
[2] Hamilton, Rau-Chaplin Compact Hilbert indices: Space-filling curves for domains with unequal side lengths 2008
Information Processing Letters 105 (155-163)
[3] Hayes Crinkly Curves 2013 American Scientist 100-3 (178). DOI: 10.1511/2013.102.1
[4] Skilling Programming the Hilbert Curve Bayesian Inference and Maximum Entropy Methods in Science and Engineering:
23rd Workshop Proceedings. 2004. American Institude of Physics 0-7354-0182-9/04
[5] Wikipedia Well-known_binary http://en.Wikipedia.org/wiki/Well-known_binary 2013
[6] Wikipedia Hilbert curve http://en.wikipedia.org/wiki/Hilbert_curve 2013
[7] Aioanei Uzaygezen–Compact Hilbert Index implementation in Java http://code.google.com/p/uzaygezen/ 2008 Google Inc.
[8] Surratt, Boyd, Russelavage Z-Value Curve Index Evaluation 2012 Internal Presentation.
[9] Open Geospatial Consortium Standard List http://www.opengeospatial.org/standards/is
[10] Remote Sensed Image Processing on Grids for Training in Earth Observation
http://www.intechopen.com/source/html/6674/media/image3.jpeg
[11] OSGeo Wiki http://wiki.osgeo.org/images/thumb/d/d0/Pyramid.jpg/286px-Pyramid.jpg
[12] WFS-T (http://www.opengeospatial.org/standards/wfs )
BIBLIOGRAPHY

Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Similar to Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo] (20)

Recently uploaded

Recently uploaded (20)

Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Editor's Notes