SlideShare a Scribd company logo
GeoWave
Geospatial Indexing on Accumulo
Eric Robertson Rich Fecher
• Geographic Information Systems (GIS)
• GeoWave Overview
– Features
– Components
– Data Types
• The Fundamentals
– How does GeoWave organize geospatial data?
• Set of problems and solutions with Accumulo
– Deduplication
– WFS-T Transaction Isolation
– Map Occlusion Culling
– Raster Data
– Statistics
OUTLINE
• GIS Technology Explosion
– E.g. Smart Phone and GPS Applications
• Data Explosion
– Satellite Imagery, Ground Based Imagery,
Aerial Photography
• Problems:
– Generate Maps: Create base image and
add vector data (shapes):
• points of interest
• roads
• boundaries
– Find Features
“restaurants near you”
– Analysis
Density, Surface Analysis, Interpolation,
Pattern Discovery
GIS: GEOGRAPHIC INFORMATION SYSTEM
Generated by OpenStreetMap.org
• Leverage Accumulo offerings as distributed
data store
– High-performance ingest
– Horizontally scalable
– Per-entry access constraints
• Fast geospatial retrieval
• Geo-temporal indexing
• Pre-calculated statistics:
– Counts per Data Type
– Bounding Region
– Time Range
– Numeric Range
– Histograms
FEATURES OF GEOWAVE
Accumulo 1.5.1, 1.6.x
Cloudera 2.0.0-cdh4.7.0, 2.5.0-cdh5.2
Hortonworks HDP 2.1
Apache 2.6
GeoTools 11.4, 12.1, 12.2
Geoserver 2.5.2 ,2.6.1
Accumulo Data Store
Hadoop Map-Reduce input/output formats
GeoServer integration with GeoTools
Vector and Raster Data
Multi-Threaded Ingest Tools
Administrative RESTful Services
Layers and Data Stores
Analytics
Kernel Density
K-means Clustering
Sampling
INTEGRATED COMPONENTS
Tested Versions
PDBScan coming soon with Apache Spark 1.2.1
• Data Structures
– Simple Feature (ISO 19125) via GeoTools (http://www.geotools.org/).
– Raster Images
– Custom
• Provided Ingest Types
– Vector Data Sources (GeoTools)
• Examples: Shapefiles, GeoJSON, PostGIS, etc.
– Grid Formats (GeoTools)
• Examples: ArcGrid, GeoTIFF, etc.
– GeoLife GPS Trajectories (http://research.microsoft.com/en-
us/projects/GeoLife/)
– GPX (http://www.topografix.com/gpx.asp)
– T-Drive (http://research.microsoft.com/en-us/projects/tdrive/)
– PDAL
DATA TYPES
• Basic Problem: Efficiently
locate and retrieve vectors or
tiles intersecting a polygon (e.g
bounding box).
• Accumulo: Each table
organized into blocks of sorted
row identifiers.
• Revised Problem: Two-way
mapping between multiple
dimensions and a single
dimension row ID to support
location efficient storage and
retrieval of vectors or tiles
given constraints in terms of
multi-dimensional boundaries.
MAIN PROBLEM:
INDEX TWO DIMENSION IN SINGLE DIMENSION INDEX
GENERALIZED PROBLEMS
Solve the general problem first. Then apply to Geospatial specific problems.
 Multi-Dimension Index supporting efficient data retrieval given bounded
set of constraints for each dimension.
 Indexed data includes scalars and intervals per dimension.
For example, a range of time or a polygon.
 Index over a mix of bounded and unbounded dimensions.
Curves are constructed
iteratively. Each
iteration produces a
sequence of piecewise
linear continuous
curves, each one more
closely approximating
the space-filling limit.
Each discrete value on
the curve represents a
hyper-rectangle in n-
dimensional space.
Space Filling Curve: A curve whose range contains the entire n-dimensional hypercube.
FUNDAMENTAL APPROACH:
SPACE FILLING CURVES TRAVERSE N-DIMENSIONAL SPACE
Achieve optimal read performance through contiguous series of values
across two or more dimensions.
Reading 11 records over a contiguous range 23->33 is faster than reading non-
contiguous range such as 15,18,34,56-58,83,99,101-102.
Consider: Latitude and Longitude defined by a range (latA, lonA) -> (latB, lonB) should
map to the least number of ranges on the space filling curve.
Haverkort and Walderveen[1] describe 3 metrics to help quantify this.
CURVE SELECTION : SEQUENTIAL IO OPTIMIZATION
Worst Case Dilation Average Bounding BoxWorst Case Bounding Box
𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞
𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞
𝑎𝑟𝑒𝑎 𝑜𝑓 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑟𝑒𝑐𝑡𝑎𝑛𝑔𝑙𝑒 (𝑏𝑙𝑢𝑒)
𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 (𝑔𝑟𝑒𝑒𝑛)
Z-Order Hilbert H-order Peano AR2W2 BΩ
Worst Case
Dilation
Average Box Area
Worst Case Area
L∞
L2
L1
∞ 6 4 8 5.40 5.00
∞ 6 4 8 6.04 5.00
∞ 9 8 10.66 12.00 9.00
∞ 2.40 3.00 2.00 3.05 2.22
2.86 1.41 1.69 1.42 1.47 1.40
[1] Haverkort, Walderveen Locality and Bounding-Box Quality of Two-
Dimensional Space-Filling Curves 2008 arXiv:0806.4787v2
CURVE SELECTION : LOCALITY
• Place a grid on the globe (dotted lines)
• Connect all the points on the grid with a
Hilbert SFC.
• Curve provides linear ordering over two
dimensional space.
• Bounding box is defined by the set of ranges
covered by the Hilbert SFC.
HILBERT CURVE MAPPING IN 2D: THE GLOBAL
• Precision determined by the ‘depth’ of the
curve. In this example globe is defined by a
16X16 grid.
• Resolution is 22.5 degrees latitude and
11.25 degrees longitude per cell.
• Each elbow (discrete point) in the Hilbert
SFC maps to a grid cell.
• The precision, defined in terms of the
number of bits, of the Hilbert SFC
determines the grid. Thus, more bits
equates to finer grained cell.
HILBERT CURVE PRECISION
Recursively decompose the Hilbert
region to find only those covered
regions that overlap the query box.
The figure depicts a third order (23
“buckets” per dimension) Hilbert
curve in 2D.
Forms a quad-tree view over the
data.
Each two bits, from most significant
to least represents a “quadrant.”
00 01
1011
10
11 00
01
11
10
00
01
Hilbert Index (52) =
11 01 00
RECURSIVE DECOMPOSITION : TWO DIMENSION EXAMPLE
Bounding Box over grid cells (2,9) to (5,13) (lower left) to (upper right)
Decompose cells intersecting
bounding box as shown in the blue.
Range decomposes to three (color
coded) ranges –
• 70 -> 75
• 92 -> 99
• 116 -> 121
Note: Bounding box from a geospatial
query window does not necessarily
“snap” perfectly to the grid cells. (e.g.
6.2, 8.8 instead of 6, 9). The bounding
box is expanded to encompass all
intersecting cells.
DECODE THE BOUNDING BOX: RANGE DECOMPOSITION
Here we see the query
range fully decomposed
into the underlying
“quadrants.”
Decomposition stops
when the query window
fully contains the quad.
(See segment 3 and
segment 8)
RANGE DECOMPOSITION OPTIMIZATION
INTERVALS: POLYGONS AND MULTI-POLYGON
Duplicate entry for each intersecting hyper-rectangle over the interval.
Polygon covers 66 cells in the example
Remove duplicate data for each cell – 66
duplicates.
De-Duplication is applied in Accumulo
Iterator as well as client-side.
Query is defined by a range per dimension
(a bounding rectangle in 2D)
INTERVALS: POLYGONS AND MULTI-POLYGONS
High resolution curves force excessive number of duplicates for large intervals.
A high resolution 2D curve – 231 x 231 and a large polygon such as the pacific ocean.
The pacific ocean covers ~33% of the earths surface, amplifies to ~1.5 quintillion
duplicate entries.
Solution: Tiered Indexing[8]
• Each tier has a resolution of 2nx2n, where n
is the tier number. Thus, each lower tier
has a two order increase in resolution.
• Polygons are stored in the lowest tier
possible that minimizes the number of
duplicates.
• Example: Blue polygon indexed in tier 2;
Red polygon indexed in tier 3.
TIERS: QUERY REGIONS WITH FALSE POSITIVES
Balance between an acceptable amount of duplicates and false positives due to
lower granularity of higher tiers.
Consider a query region in orange. It
does not intersect either polygons.
However, it does intersect shared
quadrants at the respective tiers for
both shapes. Thus, more rows are
filtered during range scan.
Without tiers, using a higher resolution,
this false positive does not occur.
However, consider that, for a resolution
of 10 (e.g. 210), hundreds of duplicates
occur.
TIERS: WORST CASE
Cap the amount of duplicates by choosing an appropriate tier.
Our analysis indicates that an optimal
number of duplicates is represented by 2d
where d is the number of dimensions (ie.
in 2 dimensions, cap at 4)
Consider the worst case, a small square
polygon centered on the inner intersecting
boundary (example polygon in red).
Regardless of size, there is always four
duplicates at all tiers except at a 20 tier—
the orange box, representing the entire
world
UNBOUNDED DIMENSION: TIME
To normalize real-world values to fit on a space filling curve, the sample space must be
bound.
Solution: Binning
• A bin represents a period for EACH
dimension. For example, a
periodicity of a year can be used
for time.
• Each bin covers its own Hilbert
space.
• Entries that contain ranges may
span multiple bins resulting in
duplicates.
• The Bin ID is part of row identifier.
1997 1998 1999
A single bin for an unbounded dimension :
[min + (period * period duration),
min + ((period+1) * period duration))
BIN: VARIABILITY OVER DIMENSIONS
Time
Elevation
Velocity
Each Bin is a
hyper-rectangle
representing
ranges of data
labeled by points
on a Hilbert
curve.
Bounded dimensions assume a single Bin.
For example, Latitude and Longitude.
THAT’S ENOUGH THEORY, LET’S APPLY IT
ACCUMULO TECHNIQUES YOU MIGHT FIND INTERESTING
SFC Curve
Hierarchy
Feature
Type
Feature
ID
Hint to
Dedupe
Filter
From
Field
Visibility
Handlers
VECTOR DATA PERSISTENCE MODEL
Column per feature identifier.
Column per each feature attribute.
Types include:
Geometry
Integer
Double
BigDecimal
Date
Time
String
Boolean
etc.
Feature
Attribute
Name
MAP OCCLUSION CULLING
A specific determined zoom level, each pixel signifies a range in degrees.
Scanning the data, only one entry is needed within each pixel range. The rest of the
entries can be skipped.
The block identified in red represents many data points, but is rendered
by the 9 pixels.
1
2 3
4
1
2 3
4
Database Data
The accumulo iterator starts at the first pixel,
scans until it hits a geometry, then skips to
the next pixel.
Scan to the first pixel
Seek to the beginning of the
next pixel
The rendering engine received only
these points
Points that were all skipped.
MAP OCCLUSION CULLING: ITERATORS
Displayed
Pixels
GeoServer
(GeoWave Plugin)
DISTRIBUTED RENDERING
Map Request
Map Response
Layer
Style
Accumulo(GeoWaveIterators)
Rendered
Map
Each scan result is an image
with the data in the range
All resultant images
are composited together
DISTRIBUTED RENDERING WITH OCCLUSION CULLING
SFC Curve
Hierarchy
SFC Value is
Effectively
a Tile ID
Coverage
Name
RASTER DATA PERSISTENCE MODEL
Image Data Buffer
+ Image Metadata
Image Metadata is customizable.
Default is to store “no data” values,
but can be customized
Tiles are unique,
ignore duplication
Unique name for
global coverage
RASTER DATA: GRID COVERAGE
Tiled, each “cell” fit to
boundary.
“No Data” values must
be maintained.
Multi-band, more
than just RGB.
Histogram Equalization [10]
Image Pyramid [11]
Tile Merge Strategy
t1
t2
t3
f ( f( , ), ) =t1 t2 t3 final
tn
Image Data
Buffer
Meta-
data
Value
Custom data per tile,
in scope for f(x)
RASTER DATA: ADVANCED OPTIONS
STATISTICS: STRUCTURE
Statistics infrastructure supports summary data.
Currently, each row ID includes adapter ID and a statistics ID.
Current statistics types include population bounding boxes, counts and ranges.
Key
Statistic ID
Row ID
Column
Value
Adapter ID
Family Qualifier Visibility
“STATS”
Matches
represented
data
Attribute Name &
Statistic Type.
Time
STATISTICS: COMBINER
Statistic ID
ValueAdapter
ID
Family Qualifier Visibility
“STATS”
“Count” 300xA43E“STATS” A&B
“Count” 600xA43E“STATS” A&C
“Count” 200xA43E“STATS” A&B
“Count” 500xA43E“STATS” A&B
MERGE
Time
2
4
7
9
BBOX: Grow Envelope to Minimum and Maximum corners.
RANGE: Minimum and Maximum
HISTOGRAM: Update bins from coverage over raster image
STATISTICS: TRANSFORMATION ITERATOR
Statistic ID
ValueAdapter
ID
Family Qualifier Visibility
“STATS”
“Count” 500xA43E“STATS” A&B
“Count” 600xA43E“STATS” A&C
“Count” 1100xA43E“STATS” A&B&C
MERGE
Time
9
4
9
Query authorization may authorize multiple rows.
Query with authorization A,B & C
WFS-T[12] TRANSACTIONS: ISOLATION
• Problem: Isolation of updates and new records until commit.
• Solution:
– Use a managed set of transaction identifiers as authorization tags. A single transaction
places an authorization tag in all new entries.
– Upon commit, the authorization tag is removed using a transforming iterator.
Role1, role2, tx123
Role1, role2
Commit
SO WHAT?
EYE-CANDY YOU’VE BEEN WAITING FOR
Microsoft GeoLife
Microsoft research has made available a trajectory data set that contains the GPS
coordinates of 182 users over a three year period (April 2007 to August 2012).
There are 17,621 trajectories in this data set with a total distance of about 1.2
million kilometers and a total duration of 48,000+ hours recorded by GPS loggers and
GPS phones often sampling every 1-5 seconds or every 5-10 meters.
http://research.microsoft.com/jump/131675
GeoLife – Just the tracks
Let’s bring out some detail –
Kernel Density Estimate (Guassian Kernel)
Let’s zoom in a bit
Density estimate again
OSM – Planet GPX dump
Every track ever uploaded to Open Street Map
Complete data attribution
2.9 Billion spatial entities (points)
https://blog.openstreetmap.org/2013/04/12/bulk-gpx-track-data/
Level 0 Overview (all the points!)
Let’s go deeper..
Let’s bring out some detail again –
Kernel Density Estimate (Guassian Kernel)
Let’s zoom a bit – and try some different styling
options
Questions?
[1] Haverkort, Walderveen Locality and Bounding-Box Qualifty of Two-Dimensional Space-Filling Curves 2008
arXiv:0806.4787v2
[2] Hamilton, Rau-Chaplin Compact Hilbert indices: Space-filling curves for domains with unequal side lengths 2008
Information Processing Letters 105 (155-163)
[3] Hayes Crinkly Curves 2013 American Scientist 100-3 (178). DOI: 10.1511/2013.102.1
[4] Skilling Programming the Hilbert Curve Bayesian Inference and Maximum Entropy Methods in Science and Engineering:
23rd Workshop Proceedings. 2004. American Institude of Physics 0-7354-0182-9/04
[5] Wikipedia Well-known_binary http://en.Wikipedia.org/wiki/Well-known_binary 2013
[6] Wikipedia Hilbert curve http://en.wikipedia.org/wiki/Hilbert_curve 2013
[7] Aioanei Uzaygezen–Compact Hilbert Index implementation in Java http://code.google.com/p/uzaygezen/ 2008 Google Inc.
[8] Surratt, Boyd, Russelavage Z-Value Curve Index Evaluation 2012 Internal Presentation.
[9] Open Geospatial Consortium Standard List http://www.opengeospatial.org/standards/is
[10] Remote Sensed Image Processing on Grids for Training in Earth Observation
http://www.intechopen.com/source/html/6674/media/image3.jpeg
[11] OSGeo Wiki http://wiki.osgeo.org/images/thumb/d/d0/Pyramid.jpg/286px-Pyramid.jpg
[12] WFS-T (http://www.opengeospatial.org/standards/wfs )
BIBLIOGRAPHY

More Related Content

What's hot

Topic 2 area & volume
Topic 2   area & volumeTopic 2   area & volume
Topic 2 area & volumekmasz kamal
 
Tutorial mathematical equation
Tutorial mathematical equationTutorial mathematical equation
Tutorial mathematical equationkmasz kamal
 
Fitting Data into Probability Distributions
Fitting Data into Probability DistributionsFitting Data into Probability Distributions
Fitting Data into Probability Distributions
Nikhil Chandra Sarkar
 
Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...
Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...
Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...
Safe Software
 
Dem analaysis and catchment delineation using GIS
Dem analaysis and catchment delineation using GISDem analaysis and catchment delineation using GIS
Dem analaysis and catchment delineation using GIS
Hans van der Kwast
 
Gis mapping by jvc
Gis mapping by jvcGis mapping by jvc
Gis mapping by jvcLIWG-Laos
 

What's hot (7)

Topic 2 area & volume
Topic 2   area & volumeTopic 2   area & volume
Topic 2 area & volume
 
Tutorial mathematical equation
Tutorial mathematical equationTutorial mathematical equation
Tutorial mathematical equation
 
Fitting Data into Probability Distributions
Fitting Data into Probability DistributionsFitting Data into Probability Distributions
Fitting Data into Probability Distributions
 
Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...
Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...
Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...
 
Dem analaysis and catchment delineation using GIS
Dem analaysis and catchment delineation using GISDem analaysis and catchment delineation using GIS
Dem analaysis and catchment delineation using GIS
 
Gis mapping by jvc
Gis mapping by jvcGis mapping by jvc
Gis mapping by jvc
 
05 cubetech
05 cubetech05 cubetech
05 cubetech
 

Similar to Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

BS01_10900_AE_C.pdf
BS01_10900_AE_C.pdfBS01_10900_AE_C.pdf
BS01_10900_AE_C.pdf
SubhajitNandi14
 
Integration
IntegrationIntegration
Integration
Zulqarnain haider
 
ODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scale
Kuldeep Jiwani
 
APPLICATION OF DEFINITE INTEGRAL
APPLICATION OF DEFINITE INTEGRALAPPLICATION OF DEFINITE INTEGRAL
APPLICATION OF DEFINITE INTEGRAL
ChayanPathak5
 
unit-3.ppt
unit-3.pptunit-3.ppt
unit-3.ppt
Srini Vasan
 
1516 contouring
1516 contouring1516 contouring
1516 contouring
Dr Fereidoun Dejahang
 
Floor planning ppt
Floor planning pptFloor planning ppt
Floor planning ppt
Thrinadh Komatipalli
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblages
Data-Centric_Alliance
 
Chapter 4 earth work and quantities new
Chapter 4 earth work and quantities newChapter 4 earth work and quantities new
Chapter 4 earth work and quantities new
BashaFayissa1
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
Krish_ver2
 
Fassold-MMAsia2023-Tutorial-GeometricDL-Part1.pptx
Fassold-MMAsia2023-Tutorial-GeometricDL-Part1.pptxFassold-MMAsia2023-Tutorial-GeometricDL-Part1.pptx
Fassold-MMAsia2023-Tutorial-GeometricDL-Part1.pptx
HannesFesswald
 
Frequency distribution
Frequency distributionFrequency distribution
Frequency distribution
MOHAMMED NASIH
 
Specialized indexing for NoSQL Databases like Accumulo and HBase
Specialized indexing for NoSQL Databases like Accumulo and HBaseSpecialized indexing for NoSQL Databases like Accumulo and HBase
Specialized indexing for NoSQL Databases like Accumulo and HBase
Jim Klucar
 
Operations Research and Mathematical Modeling
Operations Research and Mathematical ModelingOperations Research and Mathematical Modeling
Operations Research and Mathematical ModelingVinodh Soundarajan
 
Optim_methods.pdf
Optim_methods.pdfOptim_methods.pdf
Optim_methods.pdf
SantiagoGarridoBulln
 
My pp tno sound
My pp tno soundMy pp tno sound
My pp tno sounddicosmo178
 
convolutional_neural_networks in deep learning
convolutional_neural_networks in deep learningconvolutional_neural_networks in deep learning
convolutional_neural_networks in deep learning
ssusere5ddd6
 

Similar to Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo] (20)

BS01_10900_AE_C.pdf
BS01_10900_AE_C.pdfBS01_10900_AE_C.pdf
BS01_10900_AE_C.pdf
 
Lecture24
Lecture24Lecture24
Lecture24
 
Integration
IntegrationIntegration
Integration
 
ODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scale
 
APPLICATION OF DEFINITE INTEGRAL
APPLICATION OF DEFINITE INTEGRALAPPLICATION OF DEFINITE INTEGRAL
APPLICATION OF DEFINITE INTEGRAL
 
unit-3.ppt
unit-3.pptunit-3.ppt
unit-3.ppt
 
1516 contouring
1516 contouring1516 contouring
1516 contouring
 
Floor planning ppt
Floor planning pptFloor planning ppt
Floor planning ppt
 
CNN.pptx
CNN.pptxCNN.pptx
CNN.pptx
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblages
 
Chapter 4 earth work and quantities new
Chapter 4 earth work and quantities newChapter 4 earth work and quantities new
Chapter 4 earth work and quantities new
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Fassold-MMAsia2023-Tutorial-GeometricDL-Part1.pptx
Fassold-MMAsia2023-Tutorial-GeometricDL-Part1.pptxFassold-MMAsia2023-Tutorial-GeometricDL-Part1.pptx
Fassold-MMAsia2023-Tutorial-GeometricDL-Part1.pptx
 
Frequency distribution
Frequency distributionFrequency distribution
Frequency distribution
 
Specialized indexing for NoSQL Databases like Accumulo and HBase
Specialized indexing for NoSQL Databases like Accumulo and HBaseSpecialized indexing for NoSQL Databases like Accumulo and HBase
Specialized indexing for NoSQL Databases like Accumulo and HBase
 
Operations Research and Mathematical Modeling
Operations Research and Mathematical ModelingOperations Research and Mathematical Modeling
Operations Research and Mathematical Modeling
 
Optim_methods.pdf
Optim_methods.pdfOptim_methods.pdf
Optim_methods.pdf
 
My pp tno sound
My pp tno soundMy pp tno sound
My pp tno sound
 
Survey design
Survey designSurvey design
Survey design
 
convolutional_neural_networks in deep learning
convolutional_neural_networks in deep learningconvolutional_neural_networks in deep learning
convolutional_neural_networks in deep learning
 

Recently uploaded

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 

Recently uploaded (20)

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 

Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

  • 1. GeoWave Geospatial Indexing on Accumulo Eric Robertson Rich Fecher
  • 2. • Geographic Information Systems (GIS) • GeoWave Overview – Features – Components – Data Types • The Fundamentals – How does GeoWave organize geospatial data? • Set of problems and solutions with Accumulo – Deduplication – WFS-T Transaction Isolation – Map Occlusion Culling – Raster Data – Statistics OUTLINE
  • 3. • GIS Technology Explosion – E.g. Smart Phone and GPS Applications • Data Explosion – Satellite Imagery, Ground Based Imagery, Aerial Photography • Problems: – Generate Maps: Create base image and add vector data (shapes): • points of interest • roads • boundaries – Find Features “restaurants near you” – Analysis Density, Surface Analysis, Interpolation, Pattern Discovery GIS: GEOGRAPHIC INFORMATION SYSTEM Generated by OpenStreetMap.org
  • 4. • Leverage Accumulo offerings as distributed data store – High-performance ingest – Horizontally scalable – Per-entry access constraints • Fast geospatial retrieval • Geo-temporal indexing • Pre-calculated statistics: – Counts per Data Type – Bounding Region – Time Range – Numeric Range – Histograms FEATURES OF GEOWAVE
  • 5. Accumulo 1.5.1, 1.6.x Cloudera 2.0.0-cdh4.7.0, 2.5.0-cdh5.2 Hortonworks HDP 2.1 Apache 2.6 GeoTools 11.4, 12.1, 12.2 Geoserver 2.5.2 ,2.6.1 Accumulo Data Store Hadoop Map-Reduce input/output formats GeoServer integration with GeoTools Vector and Raster Data Multi-Threaded Ingest Tools Administrative RESTful Services Layers and Data Stores Analytics Kernel Density K-means Clustering Sampling INTEGRATED COMPONENTS Tested Versions PDBScan coming soon with Apache Spark 1.2.1
  • 6. • Data Structures – Simple Feature (ISO 19125) via GeoTools (http://www.geotools.org/). – Raster Images – Custom • Provided Ingest Types – Vector Data Sources (GeoTools) • Examples: Shapefiles, GeoJSON, PostGIS, etc. – Grid Formats (GeoTools) • Examples: ArcGrid, GeoTIFF, etc. – GeoLife GPS Trajectories (http://research.microsoft.com/en- us/projects/GeoLife/) – GPX (http://www.topografix.com/gpx.asp) – T-Drive (http://research.microsoft.com/en-us/projects/tdrive/) – PDAL DATA TYPES
  • 7. • Basic Problem: Efficiently locate and retrieve vectors or tiles intersecting a polygon (e.g bounding box). • Accumulo: Each table organized into blocks of sorted row identifiers. • Revised Problem: Two-way mapping between multiple dimensions and a single dimension row ID to support location efficient storage and retrieval of vectors or tiles given constraints in terms of multi-dimensional boundaries. MAIN PROBLEM: INDEX TWO DIMENSION IN SINGLE DIMENSION INDEX
  • 8. GENERALIZED PROBLEMS Solve the general problem first. Then apply to Geospatial specific problems.  Multi-Dimension Index supporting efficient data retrieval given bounded set of constraints for each dimension.  Indexed data includes scalars and intervals per dimension. For example, a range of time or a polygon.  Index over a mix of bounded and unbounded dimensions.
  • 9. Curves are constructed iteratively. Each iteration produces a sequence of piecewise linear continuous curves, each one more closely approximating the space-filling limit. Each discrete value on the curve represents a hyper-rectangle in n- dimensional space. Space Filling Curve: A curve whose range contains the entire n-dimensional hypercube. FUNDAMENTAL APPROACH: SPACE FILLING CURVES TRAVERSE N-DIMENSIONAL SPACE
  • 10. Achieve optimal read performance through contiguous series of values across two or more dimensions. Reading 11 records over a contiguous range 23->33 is faster than reading non- contiguous range such as 15,18,34,56-58,83,99,101-102. Consider: Latitude and Longitude defined by a range (latA, lonA) -> (latB, lonB) should map to the least number of ranges on the space filling curve. Haverkort and Walderveen[1] describe 3 metrics to help quantify this. CURVE SELECTION : SEQUENTIAL IO OPTIMIZATION Worst Case Dilation Average Bounding BoxWorst Case Bounding Box 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞 𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞 𝑎𝑟𝑒𝑎 𝑜𝑓 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑟𝑒𝑐𝑡𝑎𝑛𝑔𝑙𝑒 (𝑏𝑙𝑢𝑒) 𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 (𝑔𝑟𝑒𝑒𝑛)
  • 11. Z-Order Hilbert H-order Peano AR2W2 BΩ Worst Case Dilation Average Box Area Worst Case Area L∞ L2 L1 ∞ 6 4 8 5.40 5.00 ∞ 6 4 8 6.04 5.00 ∞ 9 8 10.66 12.00 9.00 ∞ 2.40 3.00 2.00 3.05 2.22 2.86 1.41 1.69 1.42 1.47 1.40 [1] Haverkort, Walderveen Locality and Bounding-Box Quality of Two- Dimensional Space-Filling Curves 2008 arXiv:0806.4787v2 CURVE SELECTION : LOCALITY
  • 12. • Place a grid on the globe (dotted lines) • Connect all the points on the grid with a Hilbert SFC. • Curve provides linear ordering over two dimensional space. • Bounding box is defined by the set of ranges covered by the Hilbert SFC. HILBERT CURVE MAPPING IN 2D: THE GLOBAL
  • 13. • Precision determined by the ‘depth’ of the curve. In this example globe is defined by a 16X16 grid. • Resolution is 22.5 degrees latitude and 11.25 degrees longitude per cell. • Each elbow (discrete point) in the Hilbert SFC maps to a grid cell. • The precision, defined in terms of the number of bits, of the Hilbert SFC determines the grid. Thus, more bits equates to finer grained cell. HILBERT CURVE PRECISION
  • 14. Recursively decompose the Hilbert region to find only those covered regions that overlap the query box. The figure depicts a third order (23 “buckets” per dimension) Hilbert curve in 2D. Forms a quad-tree view over the data. Each two bits, from most significant to least represents a “quadrant.” 00 01 1011 10 11 00 01 11 10 00 01 Hilbert Index (52) = 11 01 00 RECURSIVE DECOMPOSITION : TWO DIMENSION EXAMPLE
  • 15. Bounding Box over grid cells (2,9) to (5,13) (lower left) to (upper right) Decompose cells intersecting bounding box as shown in the blue. Range decomposes to three (color coded) ranges – • 70 -> 75 • 92 -> 99 • 116 -> 121 Note: Bounding box from a geospatial query window does not necessarily “snap” perfectly to the grid cells. (e.g. 6.2, 8.8 instead of 6, 9). The bounding box is expanded to encompass all intersecting cells. DECODE THE BOUNDING BOX: RANGE DECOMPOSITION
  • 16. Here we see the query range fully decomposed into the underlying “quadrants.” Decomposition stops when the query window fully contains the quad. (See segment 3 and segment 8) RANGE DECOMPOSITION OPTIMIZATION
  • 17. INTERVALS: POLYGONS AND MULTI-POLYGON Duplicate entry for each intersecting hyper-rectangle over the interval. Polygon covers 66 cells in the example Remove duplicate data for each cell – 66 duplicates. De-Duplication is applied in Accumulo Iterator as well as client-side. Query is defined by a range per dimension (a bounding rectangle in 2D)
  • 18. INTERVALS: POLYGONS AND MULTI-POLYGONS High resolution curves force excessive number of duplicates for large intervals. A high resolution 2D curve – 231 x 231 and a large polygon such as the pacific ocean. The pacific ocean covers ~33% of the earths surface, amplifies to ~1.5 quintillion duplicate entries. Solution: Tiered Indexing[8] • Each tier has a resolution of 2nx2n, where n is the tier number. Thus, each lower tier has a two order increase in resolution. • Polygons are stored in the lowest tier possible that minimizes the number of duplicates. • Example: Blue polygon indexed in tier 2; Red polygon indexed in tier 3.
  • 19. TIERS: QUERY REGIONS WITH FALSE POSITIVES Balance between an acceptable amount of duplicates and false positives due to lower granularity of higher tiers. Consider a query region in orange. It does not intersect either polygons. However, it does intersect shared quadrants at the respective tiers for both shapes. Thus, more rows are filtered during range scan. Without tiers, using a higher resolution, this false positive does not occur. However, consider that, for a resolution of 10 (e.g. 210), hundreds of duplicates occur.
  • 20. TIERS: WORST CASE Cap the amount of duplicates by choosing an appropriate tier. Our analysis indicates that an optimal number of duplicates is represented by 2d where d is the number of dimensions (ie. in 2 dimensions, cap at 4) Consider the worst case, a small square polygon centered on the inner intersecting boundary (example polygon in red). Regardless of size, there is always four duplicates at all tiers except at a 20 tier— the orange box, representing the entire world
  • 21. UNBOUNDED DIMENSION: TIME To normalize real-world values to fit on a space filling curve, the sample space must be bound. Solution: Binning • A bin represents a period for EACH dimension. For example, a periodicity of a year can be used for time. • Each bin covers its own Hilbert space. • Entries that contain ranges may span multiple bins resulting in duplicates. • The Bin ID is part of row identifier. 1997 1998 1999 A single bin for an unbounded dimension : [min + (period * period duration), min + ((period+1) * period duration))
  • 22. BIN: VARIABILITY OVER DIMENSIONS Time Elevation Velocity Each Bin is a hyper-rectangle representing ranges of data labeled by points on a Hilbert curve. Bounded dimensions assume a single Bin. For example, Latitude and Longitude.
  • 23. THAT’S ENOUGH THEORY, LET’S APPLY IT ACCUMULO TECHNIQUES YOU MIGHT FIND INTERESTING
  • 24. SFC Curve Hierarchy Feature Type Feature ID Hint to Dedupe Filter From Field Visibility Handlers VECTOR DATA PERSISTENCE MODEL Column per feature identifier. Column per each feature attribute. Types include: Geometry Integer Double BigDecimal Date Time String Boolean etc. Feature Attribute Name
  • 25. MAP OCCLUSION CULLING A specific determined zoom level, each pixel signifies a range in degrees. Scanning the data, only one entry is needed within each pixel range. The rest of the entries can be skipped. The block identified in red represents many data points, but is rendered by the 9 pixels.
  • 26. 1 2 3 4 1 2 3 4 Database Data The accumulo iterator starts at the first pixel, scans until it hits a geometry, then skips to the next pixel. Scan to the first pixel Seek to the beginning of the next pixel The rendering engine received only these points Points that were all skipped. MAP OCCLUSION CULLING: ITERATORS Displayed Pixels
  • 27. GeoServer (GeoWave Plugin) DISTRIBUTED RENDERING Map Request Map Response Layer Style Accumulo(GeoWaveIterators) Rendered Map Each scan result is an image with the data in the range All resultant images are composited together
  • 28. DISTRIBUTED RENDERING WITH OCCLUSION CULLING
  • 29. SFC Curve Hierarchy SFC Value is Effectively a Tile ID Coverage Name RASTER DATA PERSISTENCE MODEL Image Data Buffer + Image Metadata Image Metadata is customizable. Default is to store “no data” values, but can be customized Tiles are unique, ignore duplication Unique name for global coverage
  • 30. RASTER DATA: GRID COVERAGE Tiled, each “cell” fit to boundary. “No Data” values must be maintained. Multi-band, more than just RGB.
  • 31. Histogram Equalization [10] Image Pyramid [11] Tile Merge Strategy t1 t2 t3 f ( f( , ), ) =t1 t2 t3 final tn Image Data Buffer Meta- data Value Custom data per tile, in scope for f(x) RASTER DATA: ADVANCED OPTIONS
  • 32. STATISTICS: STRUCTURE Statistics infrastructure supports summary data. Currently, each row ID includes adapter ID and a statistics ID. Current statistics types include population bounding boxes, counts and ranges. Key Statistic ID Row ID Column Value Adapter ID Family Qualifier Visibility “STATS” Matches represented data Attribute Name & Statistic Type. Time
  • 33. STATISTICS: COMBINER Statistic ID ValueAdapter ID Family Qualifier Visibility “STATS” “Count” 300xA43E“STATS” A&B “Count” 600xA43E“STATS” A&C “Count” 200xA43E“STATS” A&B “Count” 500xA43E“STATS” A&B MERGE Time 2 4 7 9 BBOX: Grow Envelope to Minimum and Maximum corners. RANGE: Minimum and Maximum HISTOGRAM: Update bins from coverage over raster image
  • 34. STATISTICS: TRANSFORMATION ITERATOR Statistic ID ValueAdapter ID Family Qualifier Visibility “STATS” “Count” 500xA43E“STATS” A&B “Count” 600xA43E“STATS” A&C “Count” 1100xA43E“STATS” A&B&C MERGE Time 9 4 9 Query authorization may authorize multiple rows. Query with authorization A,B & C
  • 35. WFS-T[12] TRANSACTIONS: ISOLATION • Problem: Isolation of updates and new records until commit. • Solution: – Use a managed set of transaction identifiers as authorization tags. A single transaction places an authorization tag in all new entries. – Upon commit, the authorization tag is removed using a transforming iterator. Role1, role2, tx123 Role1, role2 Commit
  • 36. SO WHAT? EYE-CANDY YOU’VE BEEN WAITING FOR
  • 37. Microsoft GeoLife Microsoft research has made available a trajectory data set that contains the GPS coordinates of 182 users over a three year period (April 2007 to August 2012). There are 17,621 trajectories in this data set with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours recorded by GPS loggers and GPS phones often sampling every 1-5 seconds or every 5-10 meters. http://research.microsoft.com/jump/131675
  • 38. GeoLife – Just the tracks
  • 39. Let’s bring out some detail – Kernel Density Estimate (Guassian Kernel)
  • 42. OSM – Planet GPX dump Every track ever uploaded to Open Street Map Complete data attribution 2.9 Billion spatial entities (points) https://blog.openstreetmap.org/2013/04/12/bulk-gpx-track-data/
  • 43. Level 0 Overview (all the points!)
  • 45.
  • 46. Let’s bring out some detail again – Kernel Density Estimate (Guassian Kernel)
  • 47. Let’s zoom a bit – and try some different styling options
  • 49. [1] Haverkort, Walderveen Locality and Bounding-Box Qualifty of Two-Dimensional Space-Filling Curves 2008 arXiv:0806.4787v2 [2] Hamilton, Rau-Chaplin Compact Hilbert indices: Space-filling curves for domains with unequal side lengths 2008 Information Processing Letters 105 (155-163) [3] Hayes Crinkly Curves 2013 American Scientist 100-3 (178). DOI: 10.1511/2013.102.1 [4] Skilling Programming the Hilbert Curve Bayesian Inference and Maximum Entropy Methods in Science and Engineering: 23rd Workshop Proceedings. 2004. American Institude of Physics 0-7354-0182-9/04 [5] Wikipedia Well-known_binary http://en.Wikipedia.org/wiki/Well-known_binary 2013 [6] Wikipedia Hilbert curve http://en.wikipedia.org/wiki/Hilbert_curve 2013 [7] Aioanei Uzaygezen–Compact Hilbert Index implementation in Java http://code.google.com/p/uzaygezen/ 2008 Google Inc. [8] Surratt, Boyd, Russelavage Z-Value Curve Index Evaluation 2012 Internal Presentation. [9] Open Geospatial Consortium Standard List http://www.opengeospatial.org/standards/is [10] Remote Sensed Image Processing on Grids for Training in Earth Observation http://www.intechopen.com/source/html/6674/media/image3.jpeg [11] OSGeo Wiki http://wiki.osgeo.org/images/thumb/d/d0/Pyramid.jpg/286px-Pyramid.jpg [12] WFS-T (http://www.opengeospatial.org/standards/wfs ) BIBLIOGRAPHY

Editor's Notes

  1. Expect velocity in the beginning. Slow down during fundamentals and even slower during the accumulo specific. Thinking no more than 2 minutes a slide up to fundamentals.
  2. Shift Focus to two dimensions
  3. TBD: Change examples from red to other color and only show four, not six
  4. Three dimensions a box, etc. Per side ~ side of box. In this example, lowest tier is an 8X8 grid
  5. Use Polygons and Multi-Polygons as an example of intervals
  6. Special focus on generality…without specifics
  7. TO DO: Fix graphic
  8. Index ID…if added and slide not updated, Reason for Index ID: Keep index/adapter statistics…all measures missing components…a like a partialed file indx Eventually to support cost based optimizer, picking the best index None idempotent operations such as count
  9. Statictics: Communitive Monoids Recall: the Combiner occurs before the Versioner. Semigroup M is a nonempty set equipped with a binary operation, which is required (only!) to be associative. An element I of a semigroup M is said to be an identity if for all x ∈ M, Ix=xI=x. A semigroup can have at most one identity. Definition: A Monoid is a semigroup with an identity element. Semigroup M is commutative if x · y = y · x for all x; y ∈ M
  10. Dropped slide on statistics future so mention Bloom filters, hyper log log, bin stats, etc.
  11. Note about WFS-T specific meaning (Sessions)