• Save
A 3 dimensional data model in hbase for large time-series dataset-20120915
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

A 3 dimensional data model in hbase for large time-series dataset-20120915

on

  • 2,169 views

 

Statistics

Views

Total Views
2,169
Views on SlideShare
2,169
Embed Views
0

Actions

Likes
4
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

A 3 dimensional data model in hbase for large time-series dataset-20120915 Presentation Transcript

  • 1. Dan Han, Eleni Stroulia University of Alberta9/20/2012 MESOCA 2012 1
  • 2. Outline » Background and Motivation » Related Work » A 3-Dimensional Data Model in HBase » Case Study and Experiment Results » Discussion » Conclusions and Future Work9/20/2012 MESOCA 2012 2
  • 3. Migrating Applications To the Cloud » Cloud is an attractive computing platform ˃ Elasticity, Excellent Scalability, High Availability, Low Operating Cost » Applications are moving to the cloud ˃ Social networking, online shopping, monitoring system ˃ Time-Series data: grows monotonously over time ˃ Analysis of large scale time-series data + May lead to new knowledge + May lead to improvements of existing services » Success adoption of this movement paradigm requires a9/20/2012 new model of storage MESOCA 2012 3
  • 4. Migrating RDBMS Content To NoSQL » From RDBMS to NoSQL storage systems ˃ Enable the storage of big data, in order of row key ˃ Scale horizontally across storage nodes easily ˃ Not much data-organization support » Migration challenges ˃ Few experiences and principles to follow ˃ Steep learning curve for programming ˃ Much experimentation is required before deployment + Much time is spent in designing the data schema + The “wrong” schema may lead to inefficient, high-latency queries9/20/2012 MESOCA 2012 4
  • 5. We need Design Patterns for HBase Schemas » Our objective is to develop a systematic method for ˃ Guiding data organization in NoSQL databases, given ˃ the types of data stored ˃ the amount of data ˃ The data-usage patterns » We start our investigation with HBase ˃ A NoSQL database offering, built on top of Hadoop ˃ Parallel Distributed Computation + MapReduce Framework + Coprocessor Framework9/20/2012 MESOCA 2012 5
  • 6. Related Work » Talks in HBaseCon2012, held in May ˃ Data schema and Coprocessor are two main topics ˃ Experience from 30 enterprises, i.e., Facebook, Yapmap, eBay, Adobe » Organizing time-series data in period-specific “buckets” ˃ OpenTSDB: a distributed scalable time-series database, on top of HBase ˃ A data Model in Cassandra, another NoSQL database offering ˃ Applied in our case study9/20/2012 MESOCA 2012 6
  • 7. Data Organization in HBase » Cell in HBase ˃(Row, Family: Column, Version) => (X,Y,Z) = value Y Z Y X VS X Schema/ Row Family: Column Version dimension 2-D unique id - varying properties current timestamp timestamp9/20/2012 3-D unique id varying properties timestamps MESOCA 2012 7
  • 8. Case study: The Datasets » Cosmology Dataset ˃ Product of an N-body simulation ˃ Three types of particles: dark matter, gas and star ˃ Particles evolve over a series of discrete timestamps ˃ Each snapshot records the properties of all particles at the time of the snapshot ˃ 9 snapshots, consists of 321,065,547 particles » Bixi Dataset ˃ Data from a bicycle-renting service in the city of Montreal ˃ Every minute, the statistic information about bike usage a station is collected by the sensor9/20/2012 ˃ 100,800 timestamps, consists of 404 stations MESOCA 2012 8
  • 9. Three Schemas for the Cosmology Dataset Schema/ Row Family: Version dimension Column Schema1 sid-type-pid particle No meaning Z properties Y Schema2 type-pid particle Snapshot id X properties Schema3 type-reversedpid particle Snapshot id properties Schema1 Schema2 Schema3 Region 24-2-33446666 2-33446666 2-000055339/20/2012 Region 64-2-33559999 2-33550000 2-66664433 Region 84-2-33550000 2-33559999 2-99995533 MESOCA 2012 9
  • 10. The cosmology dataset » Dataset called“cosmo50” ˃ 9 snapshots S-ID Star Particles Total particles 24 1,291 33,555,723 29 5,568 33,559,998 36 20,246 33,574,630 45 67,268 33,620,890 60 259,219 33,800,108 84 907,025 34,369,014 128 2,743,966 35,908,1649/20/2012 216 6,396,955 38,889,220 512 12,417,544 43,787,800 MESOCA 2012 10
  • 11. Three Schemas for the Bixi Dataset Schema/ Row Family: Column Version dimension Schema1 hour-sid minutes[0,59] no meaning Schema2 hour-sid monitoring metrics minutes [0,59] Schema3 day-sid monitoring metrics minutes [0,1439] Schema1 Schema2 Schema3 Time Time metrics Time X X metrics9/20/2012 X MESOCA 2012 11
  • 12. The Bixi dataset » A period of 70 days, from Sep 24, 2010 to Dec 1, 2010, » 100,800 timestamps » 404 stations involved » Stored in XML file9/20/2012 MESOCA 2012 12
  • 13. Experiment Results » Experiment Environment ˃ A four-node cluster on virtual machines with Ubuntu ˃ Hadoop 0.20, HBase 0.93-snapshot (Coprocessor support) ˃ HBase Configuration + The replication factor of 2 + 5KB Caching Size » Queries for each dataset ˃ Three queries of Cosmology dataset from related research ˃ One query of Bixi dataset from business requirement » Query processing Implementation9/20/2012 ˃ Native java API ˃ User-Level Coprocessor Implementation MESOCA 2012 13
  • 14. Query1 of Cosmology Dataset » Get all the particles of a type: star » in a single snapshot » with a given property: tform » whose property matches the expression ˃ [>0.01;84] ˃ [>0.08;128] ˃ [>0.05;128] ˃ [>0.08;216] ˃ [>0.08;512]9/20/2012 MESOCA 2012 14
  • 15. Query2 of Cosmology Dataset » Get all the particles added/destroyed » between s1 and s2 ˃ [29;24] ˃ [60;24] ˃ [84;24] ˃ [128;24] ˃ [216;128] ˃ [216;24] ˃ [512;24] ˃ [512;128] ˃ [512;216]9/20/2012 MESOCA 2012 16
  • 16. Query3 of Cosmology Dataset » Get the values of a property » for a set of particle IDs » across the selected snapshots ˃10;[24] ˃10; [24,512], ˃10;[24,60,128,512] ˃10;[24,29,60,84,128,512] ˃10;[24,36,45,60,84,128,216,512] ˃50;[24,29,84,512] ˃50;[24,29,36,45,60,84,128,216,512] ˃100;[24,29,36,45,60,84,128,216,512]9/20/2012 ˃150;[24,29,36,45,60,84,128,216,512 MESOCA 2012 18
  • 17. Query3 of Cosmology Dataset » Get the values of a property: star:eps » for a set of particle IDs: a continuous range particle IDs » across the selected snapshots ˃ 10;[24] ˃ 10; [24,512], ˃ 10;[24,60,128,512] ˃ 10;[24,29,60,84,128,512] ˃ 10;[24,36,45,60,84,128,216,512] ˃ 50;[24,29,84,512] ˃ 50;[24,29,36,45,60,84,128,216,512] ˃ 100;[24,29,36,45,60,84,128,216,512]9/20/2012 ˃ 150;[24,29,36,45,60,84,128,216,512] MESOCA 2012 19
  • 18. Bixi Query » For a given list of stations: 200 stations » get average bike usage in a given period ˃ [1day] ˃ [2day] ˃ [4day] ˃ [8day] ˃ [16day]9/20/2012 MESOCA 2012 21
  • 19. Discussion » “Qualitative” versus “Quantitative” Suggestions » Dynamic Data versus Static Data » Historical Dataset versus Real-Time Datasets » Supported versus Non-Supported Datasets9/20/2012 MESOCA 2012 23
  • 20. Conclusion » The objective is to make queries local » To do that, you have to design the right key, so that all queries traverse a range of keys ˃With all answers in them ˃With not much irrelevant data in it » But, hotspotting occurs when ˃???9/20/2012 MESOCA 2012 24
  • 21. Conclusion » A 3-dimensional data model ˃Improved performance can be got from the data schema that use the version dimension of HBase » Fit in “write-once, read-many” system ˃Monitoring system ˃Sensor-based system ˃Version-based analysis9/20/2012 MESOCA 2012 25
  • 22. Future Work » More Evaluation of this data model ˃Scalability ˃Elasticity ˃Utilization » How to design data model for other datasets ˃Spatial dataset ˃Graphic dataset9/20/2012 MESOCA 2012 26
  • 23. Questions? Thank you9/20/2012 MESOCA 2012 27