HGrid A Data Model for Large Geospatial Data Sets in HBase

Dan Han and Eleni Stroulia
University of Alberta
stroulia@ualberta.ca
http://ssrg.cs.ualberta.ca
107/12/13 Cloud 2013

 The General Research Problem
 The Geospatial Problem Instance
 The Data Set
 HBase data-organization alternatives
 Performance analysis
 Some Lessons Learned
07/12/13 3Cloud 2013

 Appropriate data models for
 time-series (MESOCA 2012)
 Geospatial (CLOUD 2013)
applications
 In progress:
 spatio-temporal applications
07/12/13 7Cloud 2013

 [1] built a multi-dimensional index layer on top of a one-
dimensional key-value store HBase to perform spatial queries.
 [2] presented a novel key formulation schema, based on R+-tree
for spatial index in HBase.
Focus on row-key design
no discussion about column and version design
07/12/13 11
[1] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi: MD-HBase: A
Scalable Multi-dimensional Data Infrastructure for Location Aware Services. Mobile
Data Management (1) 2011: 7-16
[2] Ya-Ting Hsu, Yi-Chin Pan, Ling-Yin Wei, Wen-Chih Peng, Wang-Chien Lee: Key
Formulation Schemes for Spatial Index in Cloud Data Managements. MDM 2012: 21-26
Cloud 2013

 Two Synthetic Datasets
 Uniform and ZipF distribution
 Based on Bixi dataset, each object includes
▪ station ID,
▪ latitude, longitude, station name, terminal name,
▪ number of docks
▪ number of bikes
 100 Million objects (70GB)
 in a 100km*100km simulated space
07/12/13 12Cloud 2013

 Regular Grid Indexing
 Row key: Grid rowID
 Column: Grid columnID
 Version: counter of Objects
 Value: one object in JSON format
07/12/13 13
Counter
Column ID
RowID
00 01 02 03
00
01
02
03
Cloud 2013

 Tie-based quad-tree Indexing
 Z-value Linearization
 Rowkey: Z-value
 Column: Object ID
 Value: one object in JSON Format
07/12/13 14
Z-Value
Object ID
Z-value
Cloud 2013

 Quad-Tree data model
 More rows with deeper tree
 Z-ordering linearization
(violates data locality)
 In-time construction vs. pre-
construction implies a
tradeoff between query
performance and memory
allocation
 Regular Grid data model
 Very easy to locate a cell by
row id and column id
 Cannot handle large space
and fine-grained grid
because in-memory indexes
are subject to memory
constraints
07/12/13 15
How much unrelated data is examined in a query matters a lot!
Cloud 2013

07/12/13 16
O
bjectAttribute
Columnid-ObjectId
QTId-RowId
A A A
A A A
A A A
B B B
B B B
B B B
C C C
C C C
C C C
D D D
D D D
D D D
00
01
11
10
01 02 03 01 02 03
Space
Cloud 2013

07/12/13 17Cloud 2013
The row key is
the QT Z-value +
the RG row
index.
The row key is
the QT Z-value +
the RG row
index.
The column
name is the RG
column and the
object-ID
The column
name is the RG
column and the
object-ID
The attributes of
the data point
are stored in the
third dimension.
The attributes of
the data point
are stored in the
third dimension.

1. Compute minimum bounding square based on the query
input location and the range
2. Compute the quad-tree tiles that overlap with the
bounding square  Z-codes
3. Compute all the regular-grid cells indexes in these quad-
tree tiles  the secondary index of rows and columns
4. Issue one sub-query for each selected tile of the quad-
tree; process with user-level coprocessors on the HBase
regions
5. Collect the results of the sub-queries at the client-side
07/12/13 18Cloud 2013

07/12/13 21
00 02 04 06
00
02
04
06
Cloud 2013

07/12/13 22
00 02 04 06
00
02
04
06
09-00
09-04
Cloud 2013

1. Estimate the search range (density-based range
estimation)
2. Compute indices of rows and columns (steps 2 and 3 of
Range Query)
3. Issue a scan query to retrieve the relevant data points
4. If fewer than K data points are returned, re-estimate the
search range and repeat steps 2-3
5. Sort the return set in increasing distance from the input
location
07/12/13 23Cloud 2013

 Experiment Environment
 A four-node cluster on virtual machines with Ubuntu on
OpenStack
 Hadoop 1.0.2 (replication factor is 2), HBase 0.94
 HBase Configuration
▪ 5K Caching Size
▪ Block cache is true
▪ ROWCOL bloom filter
 Query processing Implementation
 Native java API
 User-Level Coprocessor Implementation
07/12/13 24Cloud 2013

 The granularity of grid affects query-processing
performance
 Explore the “best” cell configuration of each model
 Quad-tree=>(t= 1)
 RG=>(t=0.1)
 HGrid=>(T=10,t=0.1)
07/12/13 25Cloud 2013
HG:≈10:0.1 fewer sub-queries
more false positives
HG:≈1:0.1 more sub-queries
fewer false positives
HG:≈10:0.01 more rows
HG:≈10:0.1 fewer rows

07/12/13 26
 Given a location and a radius,
 Return the data points, located within a distance
less or equal to the radius from the input location
Cloud 2013

 Given the coordinates of a location,
 Return the K points nearest to the location
07/12/13 27Cloud 2013

 Data Organization
 Short row key and column name
 Better to have one column family and few columns
 Not large amount of data in one row
 Row key design should ease pruning unrelated data
 3rd dimension can store data as well
 Bloom Filter should be configured to prune rows and columns
 Compression can reduce the amount of data transmission
07/12/13 30Cloud 2013

 Query Processing
 Scanned rows for one query should not exceed the scan cache
size, otherwise, split the query into sub-queries.
 “Scan” is better than “Get” for retrieving discontinuous keys, even
though the unrelated data
 “Scan” for small queries, while Coprocessor for large queries
 Better to split one large query into multiple sub-queries than use
one query with row filter mechanism
07/12/13 31Cloud 2013

 Benefits from the good locality of the RG index; suffers from
the poor locality of the z-ordering QT linearization
 Performance could be improved with other linearization techniques
 Can be flexibly configured and extended
 The QT index can be replaced by the hash code of each sub-space
 The granularity in the second stage can be varied from sub-space to
sub-space based on the various densities
 Is more suitable for homogeneously covered and
discontinuous spaces
07/12/13 32Cloud 2013

 A Data Model for spatio-temporal dataset
 Towards a General Systematic Guidance for Column
Families and other NoSQL databases
 To apply the data model into cloud-based
applications and big data analytics system
07/12/13 33Cloud 2013

HGrid A Data Model for Large Geospatial Data Sets in HBase

More Related Content

What's hot

Similar to HGrid A Data Model for Large Geospatial Data Sets in HBase

Recently uploaded

HGrid A Data Model for Large Geospatial Data Sets in HBase

Editor's Notes