Your SlideShare is downloading. ×
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services

2,953
views

Published on

Published in: Technology

0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,953
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
66
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • アニメーション化
  • Scalability for Data Size # of users Continuously Generated High Insertion Throughput # of users Data collection Frequency Efficient Complex Query Performance Complex Queries Multi-dimensional Range Queries K Nearest Neighbor Queries Near Real-time Data is easy to stale
  • Synchronize text and figures
  • Put an example
  • Transcript

    • 1. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das, Divyakant Agrawal, Amr El Abbadi (University of California, Santa Barbara)
      • Work done as a visiting researcher at UCSB
      • Appeared in MDM 2011, Lulea, Sweden
    • 2. Overview
      • A Motivating Story
      • Existing Technologies
      • Our proposal
      • Evaluation
      • Conclusion
    • 3. Motivating Scenario: Mobile Coupon Distribution Mobile Coupon Distributer Coupon Current Location Current Location Current Location
      • Distribution Policy
      • Area
      • # of coupons
    • 4. Motivating Scenario: Mobile Coupon Distribution 125,000,000 subscribers in Japan Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location
      • Distribution Policy
      • Area
      • # of coupons
      Coupon Coupon Coupon Large amounts of Data High Throughput System Scalability Multi-Dimensional Query Nearest Neighbors Query Efficient Complex Queries
    • 5. Existing Technologies at a reasonable price Key-Value Stores Commercial products but expensive Relational DBs Spatial DBs What We Want Open source products Scalability Multi-dimensional Queries
    • 6. Ordered Key-Value Stores Sorted by key Good at 1-D Range Query ex. BigTable HBase key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Index Buckets Longitude Time Latitude But, our target is multi-dimensional…
    • 7. Naïve Solution: Linearlization key00 key11 keynn keynn valuenn Projects n-D space to 1-D space Simple, but problematic… Apply a Z-ordering curve… key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y 10 8 2 0 11 9 3 1 14 12 6 4 15 13 7 5
    • 8. Problem: False positive scans
      • MD-query on Linearized space
        • Translate a MD-query to linearized range query.
          • Ex. Query from 2 to 9.
        • Scan queried linearized range.
        • Filter points out of the queried area.
          • ex. blue-hatched area (4 to 7)
      • Require the boundary information of
      • the original space.
      10 8 2 0 11 9 3 1 14 12 6 4 15 13 7 5 2 9
    • 9.
      • Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store
      Our Approach: MD-HBase Single Dimensional Index Multi-Dimensional Index Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase
    • 10. Introduce Multi-dimensional Index
      • Multi-dimensional Index (ex. The K-d tree, The Quad tree)
        • Divide a space into subspaces containing almost same # of points
        • Organize subspaces as tree
      • Efficient subspace pruning -> to avoid false positive scans
      Divide into Organize as
    • 11. Space Partition By the K-d tree Binary Z-ordering space 00 01 10 11 11 10 01 00 00 01 10 11 11 10 01 00 Partitioned space by the K-d tree How do we represent these subspaces? bitwise interleaving ex. x= 00 , y= 11 -> 0 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101
    • 12. Key Idea: The longest common prefix naming scheme 00 01 10 11 11 10 01 00 000* 1*** Subspaces represented as the longest common prefix of keys!
      • Remarkable Property
      • Preserve boundary information of the original space
      1*** 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101 Left-bottom corner Right-top corner 1 0 0 0 1 1 1 1 *->0 *->1 ( 10 , 00 ) ( 11 , 11 )
    • 13. Build an index with the longest common prefix of keys 00 01 10 11 11 10 01 00 000* 001* 01** 1*** 000* 001* 01** 1*** Index Buckets allocate per subspace 000* 001* 01** 1*** 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101
    • 14. Multi-dimensional Range Query Reconstruct the boundary Info. & Check whether intersecting the queried area 00 01 10 11 11 10 01 00 Index Filter 001* 000* 11** 01** 10** Scan Scan Subspace Pruning Scan 0010 -1001 on the index 1010 1000 0010 0000 1011 1001 0011 0001 1110 1100 0110 0100 1111 1101 0111 0101 11** 10** 01** 001* 000* 10** 001*
    • 15. K Nearest Neighbors Query
      • The best first algorithm can be applied.
        • the most efficient technique in practical case
      • Check the detail in our paper
      1 2 4 3 5
    • 16. Variations of Storage Layer
      • Table Share Model
        • Uses single table, Maintain bucket boundary
        • Most space efficiency
        • Bucket co-location may cause disk access congestions
      • Table per Bucket Model
        • Allocates a table per bucket
        • Most flexible mapping
          • One-to-one, one-to-many, many-to-one
        • Bucket split is expensive
          • Copy all points to the new buckets.
      • Region per Bucket Model
        • Allocates a region per bucket
        • Most bucket split efficiency
          • Asynchronous bucket split
        • Requires modification of HBase
    • 17. Experimental Results: Multi-dimensional Range Query
      • Dataset: 400,000,000 points
      • Queries: select objects within MD ranges and change selectivity
      • Cluster size: 4 nodes
      • MD-HBase responses 10~100 times faster than others and responses proportional time to selectivity.
    • 18. Experimental Results: k Nearest Neighbors Query
      • Dataset: 400,000,000 points
      • Queries: choose a point and change the number of neighbors
      • Cluster size: 4 nodes
      • MD-HBase responses 1.5 sec where k ≦ 100, and 11 sec even if k = 10,000
    • 19. Experimental Results: Insert
      • Dataset: spatially skewed data generated by zipfian distribution
      • MD-HBase shows good scalability without significant overhead.
    • 20. Conclusions
      • Designed a scalable multi-dimensional data store.
        • Scalability & Efficient multi-dimensional queries
        • Key Idea: indexing the longest common prefix of keys
        • Easily extend general ordered key-value stores.
      • Demonstrated scalable insert throughput and excellent query performance.
        • Range Query: 10-100 times faster than existing technologies.
        • kNN Query: 1.5 s when k ≦ 100.
        • Insert: 220K inserts/sec on 16 nodes cluster without overhead
      • Thank you.
      • Any Questions?