Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HGrid A Data Model for Large Geospatial Data Sets in HBase

  • Be the first to comment

HGrid A Data Model for Large Geospatial Data Sets in HBase

  1. 1. Dan Han and Eleni Stroulia University of Alberta stroulia@ualberta.ca http://ssrg.cs.ualberta.ca 107/12/13 Cloud 2013
  2. 2. 07/12/13 2Cloud 2013
  3. 3.  The General Research Problem  The Geospatial Problem Instance  The Data Set  HBase data-organization alternatives  Performance analysis  Some Lessons Learned 07/12/13 3Cloud 2013
  4. 4. 07/12/13 4Cloud 2013
  5. 5. 07/12/13 Cloud 2013 5
  6. 6. 07/12/13 6Cloud 2013
  7. 7.  Appropriate data models for  time-series (MESOCA 2012)  Geospatial (CLOUD 2013) applications  In progress:  spatio-temporal applications 07/12/13 7Cloud 2013
  8. 8. 07/12/13 9Cloud 2013
  9. 9. 07/12/13 10Cloud 2013
  10. 10.  [1] built a multi-dimensional index layer on top of a one- dimensional key-value store HBase to perform spatial queries.  [2] presented a novel key formulation schema, based on R+-tree for spatial index in HBase. Focus on row-key design no discussion about column and version design 07/12/13 11 [1] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi: MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. Mobile Data Management (1) 2011: 7-16 [2] Ya-Ting Hsu, Yi-Chin Pan, Ling-Yin Wei, Wen-Chih Peng, Wang-Chien Lee: Key Formulation Schemes for Spatial Index in Cloud Data Managements. MDM 2012: 21-26 Cloud 2013
  11. 11.  Two Synthetic Datasets  Uniform and ZipF distribution  Based on Bixi dataset, each object includes ▪ station ID, ▪ latitude, longitude, station name, terminal name, ▪ number of docks ▪ number of bikes  100 Million objects (70GB)  in a 100km*100km simulated space 07/12/13 12Cloud 2013
  12. 12.  Regular Grid Indexing  Row key: Grid rowID  Column: Grid columnID  Version: counter of Objects  Value: one object in JSON format 07/12/13 13 Counter Column ID RowID 00 01 02 03 00 01 02 03 Cloud 2013
  13. 13.  Tie-based quad-tree Indexing  Z-value Linearization  Rowkey: Z-value  Column: Object ID  Value: one object in JSON Format 07/12/13 14 Z-Value Object ID Z-value Cloud 2013
  14. 14.  Quad-Tree data model  More rows with deeper tree  Z-ordering linearization (violates data locality)  In-time construction vs. pre- construction implies a tradeoff between query performance and memory allocation  Regular Grid data model  Very easy to locate a cell by row id and column id  Cannot handle large space and fine-grained grid because in-memory indexes are subject to memory constraints 07/12/13 15 How much unrelated data is examined in a query matters a lot! Cloud 2013
  15. 15. 07/12/13 16 O bjectAttribute Columnid-ObjectId QTId-RowId A A A A A A A A A B B B B B B B B B C C C C C C C C C D D D D D D D D D 00 01 11 10 01 02 03 01 02 03 Space Cloud 2013
  16. 16. 07/12/13 17Cloud 2013 The row key is the QT Z-value + the RG row index. The row key is the QT Z-value + the RG row index. The column name is the RG column and the object-ID The column name is the RG column and the object-ID The attributes of the data point are stored in the third dimension. The attributes of the data point are stored in the third dimension.
  17. 17. 1. Compute minimum bounding square based on the query input location and the range 2. Compute the quad-tree tiles that overlap with the bounding square  Z-codes 3. Compute all the regular-grid cells indexes in these quad- tree tiles  the secondary index of rows and columns 4. Issue one sub-query for each selected tile of the quad- tree; process with user-level coprocessors on the HBase regions 5. Collect the results of the sub-queries at the client-side 07/12/13 18Cloud 2013
  18. 18. 07/12/13 20Cloud 2013
  19. 19. 07/12/13 21 00 02 04 06 00 02 04 06 Cloud 2013
  20. 20. 07/12/13 22 00 02 04 06 00 02 04 06 09-00 09-04 Cloud 2013
  21. 21. 1. Estimate the search range (density-based range estimation) 2. Compute indices of rows and columns (steps 2 and 3 of Range Query) 3. Issue a scan query to retrieve the relevant data points 4. If fewer than K data points are returned, re-estimate the search range and repeat steps 2-3 5. Sort the return set in increasing distance from the input location 07/12/13 23Cloud 2013
  22. 22.  Experiment Environment  A four-node cluster on virtual machines with Ubuntu on OpenStack  Hadoop 1.0.2 (replication factor is 2), HBase 0.94  HBase Configuration ▪ 5K Caching Size ▪ Block cache is true ▪ ROWCOL bloom filter  Query processing Implementation  Native java API  User-Level Coprocessor Implementation 07/12/13 24Cloud 2013
  23. 23.  The granularity of grid affects query-processing performance  Explore the “best” cell configuration of each model  Quad-tree=>(t= 1)  RG=>(t=0.1)  HGrid=>(T=10,t=0.1) 07/12/13 25Cloud 2013 HG:≈10:0.1 fewer sub-queries more false positives HG:≈1:0.1 more sub-queries fewer false positives HG:≈10:0.01 more rows HG:≈10:0.1 fewer rows
  24. 24. 07/12/13 26  Given a location and a radius,  Return the data points, located within a distance less or equal to the radius from the input location Cloud 2013
  25. 25.  Given the coordinates of a location,  Return the K points nearest to the location 07/12/13 27Cloud 2013
  26. 26. 07/12/13 28Cloud 2013
  27. 27. 07/12/13 29Cloud 2013
  28. 28.  Data Organization  Short row key and column name  Better to have one column family and few columns  Not large amount of data in one row  Row key design should ease pruning unrelated data  3rd dimension can store data as well  Bloom Filter should be configured to prune rows and columns  Compression can reduce the amount of data transmission 07/12/13 30Cloud 2013
  29. 29.  Query Processing  Scanned rows for one query should not exceed the scan cache size, otherwise, split the query into sub-queries.  “Scan” is better than “Get” for retrieving discontinuous keys, even though the unrelated data  “Scan” for small queries, while Coprocessor for large queries  Better to split one large query into multiple sub-queries than use one query with row filter mechanism 07/12/13 31Cloud 2013
  30. 30.  Benefits from the good locality of the RG index; suffers from the poor locality of the z-ordering QT linearization  Performance could be improved with other linearization techniques  Can be flexibly configured and extended  The QT index can be replaced by the hash code of each sub-space  The granularity in the second stage can be varied from sub-space to sub-space based on the various densities  Is more suitable for homogeneously covered and discontinuous spaces 07/12/13 32Cloud 2013
  31. 31.  A Data Model for spatio-temporal dataset  Towards a General Systematic Guidance for Column Families and other NoSQL databases  To apply the data model into cloud-based applications and big data analytics system 07/12/13 33Cloud 2013

×