A 3 dimensional data model in hbase for large time-series dataset-20120915

Dan Han, Eleni Stroulia
University of Alberta
9/20/2012

MESOCA 2012 1

Outline
» Background and Motivation
» Related Work
» A 3-Dimensional Data Model in HBase
» Case Study and Experiment Results
» Discussion
» Conclusions and Future Work
9/20/2012

MESOCA 2012 2

Migrating Applications
To the Cloud
» Cloud is an attractive computing platform
˃ Elasticity, Excellent Scalability, High Availability, Low Operating
Cost

» Applications are moving to the cloud
˃ Social networking, online shopping, monitoring system
˃ Time-Series data: grows monotonously over time
˃ Analysis of large scale time-series data
+ May lead to new knowledge
+ May lead to improvements of existing services

» Success adoption of this movement paradigm requires a
9/20/2012

new model of storage

MESOCA 2012 3

Migrating RDBMS Content
To NoSQL
» From RDBMS to NoSQL storage systems
˃ Enable the storage of big data, in order of row key
˃ Scale horizontally across storage nodes easily
˃ Not much data-organization support

» Migration challenges
˃ Few experiences and principles to follow
˃ Steep learning curve for programming
˃ Much experimentation is required before deployment
+ Much time is spent in designing the data schema
+ The “wrong” schema may lead to inefficient, high-latency queries
9/20/2012

MESOCA 2012 4

We need Design Patterns for
HBase Schemas
» Our objective is to develop a systematic method for
˃ Guiding data organization in NoSQL databases, given
˃ the types of data stored
˃ the amount of data
˃ The data-usage patterns

» We start our investigation with HBase
˃ A NoSQL database offering, built on top of Hadoop
˃ Parallel Distributed Computation
+ MapReduce Framework
+ Coprocessor Framework
9/20/2012

MESOCA 2012 5

Related Work
» Talks in HBaseCon2012, held in May
˃ Data schema and Coprocessor are two main topics
˃ Experience from 30 enterprises, i.e., Facebook, Yapmap, eBay, Adobe

» Organizing time-series data in period-specific “buckets”
˃ OpenTSDB: a distributed scalable time-series database, on top of
HBase
˃ A data Model in Cassandra, another NoSQL database offering
˃ Applied in our case study
9/20/2012

MESOCA 2012 6

Data Organization in HBase
» Cell in HBase
˃(Row, Family: Column, Version) => (X,Y,Z) = value
Y Z
Y
X VS X

Schema/ Row Family: Column Version
dimension
2-D unique id - varying properties current
timestamp timestamp
9/20/2012

3-D unique id varying properties timestamps

MESOCA 2012 7

Case study:
The Datasets
» Cosmology Dataset
˃ Product of an N-body simulation
˃ Three types of particles: dark matter, gas and star
˃ Particles evolve over a series of discrete timestamps
˃ Each snapshot records the properties of all particles at
the time of the snapshot
˃ 9 snapshots, consists of 321,065,547 particles
» Bixi Dataset
˃ Data from a bicycle-renting service in the city of
Montreal
˃ Every minute, the statistic information about bike usage
a station is collected by the sensor
9/20/2012

˃ 100,800 timestamps, consists of 404 stations

MESOCA 2012 8

Three Schemas
for the Cosmology Dataset
Schema/ Row Family: Version
dimension Column
Schema1 sid-type-pid particle No meaning
Z
properties
Y
Schema2 type-pid particle Snapshot id
X
properties
Schema3 type-reversedpid particle Snapshot id
properties

Schema1 Schema2 Schema3
Region 24-2-33446666 2-33446666 2-00005533
9/20/2012

Region 64-2-33559999 2-33550000 2-66664433

Region 84-2-33550000 2-33559999 2-99995533
MESOCA 2012 9

The cosmology dataset
» Dataset called“cosmo50”
˃ 9 snapshots
S-ID Star Particles Total particles

24 1,291 33,555,723
29 5,568 33,559,998
36 20,246 33,574,630
45 67,268 33,620,890
60 259,219 33,800,108
84 907,025 34,369,014
128 2,743,966 35,908,164
9/20/2012

216 6,396,955 38,889,220
512 12,417,544 43,787,800

MESOCA 2012 10

Three Schemas
for the Bixi Dataset
Schema/ Row Family: Column Version
dimension
Schema1 hour-sid minutes[0,59] no meaning

Schema2 hour-sid monitoring metrics minutes [0,59]

Schema3 day-sid monitoring metrics minutes [0,1439]

Schema1 Schema2 Schema3
Time
Time metrics Time
X
X metrics
9/20/2012

X

MESOCA 2012 11

The Bixi dataset
» A period of 70 days, from Sep 24, 2010 to Dec 1, 2010,
» 100,800 timestamps
» 404 stations involved
» Stored in XML file
9/20/2012

MESOCA 2012 12

Experiment Results
» Experiment Environment
˃ A four-node cluster on virtual machines with Ubuntu
˃ Hadoop 0.20, HBase 0.93-snapshot (Coprocessor support)
˃ HBase Configuration
+ The replication factor of 2
+ 5KB Caching Size

» Queries for each dataset
˃ Three queries of Cosmology dataset from related research
˃ One query of Bixi dataset from business requirement

» Query processing Implementation
9/20/2012

˃ Native java API
˃ User-Level Coprocessor Implementation

MESOCA 2012 13

Query1 of Cosmology Dataset
» Get all the particles of a type: star
» in a single snapshot
» with a given property: tform
» whose property matches the expression
˃ [>0.01;84]
˃ [>0.08;128]
˃ [>0.05;128]
˃ [>0.08;216]
˃ [>0.08;512]
9/20/2012

MESOCA 2012 14

» Get all the particles added/destroyed
» between s1 and s2
˃ [29;24]
˃ [60;24]
˃ [84;24]
˃ [128;24]
˃ [216;128]
˃ [216;24]
˃ [512;24]
˃ [512;128]
˃ [512;216]
9/20/2012

MESOCA 2012 16

» Get the values of a property
» for a set of particle IDs
» across the selected snapshots
˃10;[24]
˃10; [24,512],
˃10;[24,60,128,512]
˃10;[24,29,60,84,128,512]
˃10;[24,36,45,60,84,128,216,512]
˃50;[24,29,84,512]
˃50;[24,29,36,45,60,84,128,216,512]
˃100;[24,29,36,45,60,84,128,216,512]
9/20/2012

˃150;[24,29,36,45,60,84,128,216,512
MESOCA 2012 18

» Get the values of a property: star:eps
» for a set of particle IDs: a continuous range particle IDs
» across the selected snapshots
˃ 10;[24]
˃ 10; [24,512],
˃ 10;[24,60,128,512]
˃ 10;[24,29,60,84,128,512]
˃ 10;[24,36,45,60,84,128,216,512]
˃ 50;[24,29,84,512]
˃ 50;[24,29,36,45,60,84,128,216,512]
˃ 100;[24,29,36,45,60,84,128,216,512]
9/20/2012

˃ 150;[24,29,36,45,60,84,128,216,512]

MESOCA 2012 19

Bixi Query
» For a given list of stations: 200 stations
» get average bike usage in a given period
˃ [1day]
˃ [2day]
˃ [4day]
˃ [8day]
˃ [16day]
9/20/2012

MESOCA 2012 21

Discussion
» “Qualitative” versus “Quantitative” Suggestions
» Dynamic Data versus Static Data
» Historical Dataset versus Real-Time Datasets
» Supported versus Non-Supported Datasets
9/20/2012

MESOCA 2012 23

Conclusion
» The objective is to make queries local
» To do that, you have to design the right key, so that all
queries traverse a range of keys
˃With all answers in them
˃With not much irrelevant data in it
» But, hotspotting occurs when
˃???
9/20/2012

MESOCA 2012 24

Conclusion
» A 3-dimensional data model
˃Improved performance can be got from the data schema
that use the version dimension of HBase
» Fit in “write-once, read-many” system
˃Monitoring system
˃Sensor-based system
˃Version-based analysis
9/20/2012

MESOCA 2012 25

Future Work
» More Evaluation of this data model
˃Scalability
˃Elasticity
˃Utilization
» How to design data model for other datasets
˃Spatial dataset
˃Graphic dataset
9/20/2012

MESOCA 2012 26

Questions?

Thank you
9/20/2012

MESOCA 2012 27

A 3 dimensional data model in hbase for large time-series dataset-20120915

More Related Content

What's hot

Viewers also liked

Similar to A 3 dimensional data model in hbase for large time-series dataset-20120915

Recently uploaded

A 3 dimensional data model in hbase for large time-series dataset-20120915