Design Pattern of HBase Configuration

LOGO
Design Patterns of HBase Configuration
Dan Han
Supervisor: Eleni Stroulia and Paul Sorenson
Department of Computing Science
University of Alberta, Canada

 Data Modeling in HBase
 Time–Series datasets (MESOCA 2012)
 Geospatial datasets (CLOUD 2013)
 Migrating
 an existing application with geospatial, temporal and
categorical data
 to SAVI, a hierarchical cloud
 Contributions
 Future Work
My Agenda for Today
Time-Series Geo-Spatial Migration to Cloud Conclusion
1

RDBMS
Application
Server
Web
Server
Data Warehouse
Our Motivation: Big Data Problem
2
real-time or near real-time

NoSQL
database
RDBMS
Too big
Too fast
Too hard
for existing tools
Our Motivation: Big Data Problem
3

Our Motivation: Application Extension
RDBMS
Web
Server
Application
Server
Data
Analytics
NoSQL
database
4
In-house IT system

Cloud Infrastructure + NoSQL Storage Platform

A New Software Architecture + an NoSQL expertise
Good Elasticity
Excellent Scalability
High Availability
Low Cost
NoSQL
database
Our Motivation: Application Migration
5
Data
Analytics

Re-architecting
traditional SQL+REST applications
to use NoSQL data stores
For the purpose of extending them with
analytics features
The Objective
6

 To develop guidelines for data organization
 On HBase
 A NoSQL database built on top of Hadoop
 With a coprocessor framework (for parallel
processing)
 Given
 the application data model
 the amount of data involved
 query patterns
The Research Problem
Focusing on
Time-Series and
Geospatial data
7

HBase: NoSQL Data Store
8

 Querying Time-Series Datasets
Modeling Data in HBase (1/2)
Introduction Geo-Spatial Migration to Cloud Conclusion
9

 In an order-dependent manner
 Typical applications
 Sensor-based applications
 Monitoring applications
 Analysis Questions
 What happened?
 Why did it happen? (what happened before)
 What may happen next?
Time-Series Datasets
10

Case Study: The Cosmology and Bixi Datasets
Cosmology dataset: 9 snapshots, 321,065,547 particles,12 metrics
Bixi dataset: 100,800 timestamps, 404 stations, 12 metrics
 Gas:
 iorder, mass, x, y, z, velocity x, y, z,
phi, rho, temp, hsmooth, metals
 Dark Matter:
phi, eps
 Star:
phi, metals, tform, eps
11

 Inspired by OpenTSDB and Facebook
Messages
 Data Model
 Row key: object id-period timestamp
 Column name: object attributes
 Version: offset of period-timestamp
A 3-Dimensional Data Model
Column
Row Key
12

 Experiment Environment
 A four-node cluster on virtual machines with Ubuntu
 Hadoop 0.20, HBase 0.93-snapshot
 Queries for each dataset
 Three queries of Cosmology dataset from related research
 One query of Bixi dataset from business requirement
 Query processing implementation
 Native java API
 User-Level Coprocessor Implementation
The Experiment
HBase Cluster
13

Schema/
dimension
Row Family:
Column
Version
Schema1 sid-type-pid particle
properties
no meaning
Schema2 type-pid particle
properties
snapshot id
Schema3 type-reversedpid particle
properties
snapshot id
Region
Region
Region
24-2-33446666
64-2-33559999
84-2-33550000
2-33446666
2-33550000
2-33559999
2-00005533
2-66664433
2-99995533
Schema1 Schema2 Schema3
Three Schemas for the Cosmology Dataset
14

Query1: within one snapshot
 Get all the particles of a type in a single snapshot
with a given property matching an expression
Query2: across two snapshots
 Get all the particles added/destroyed between
two snapshots
Query3: across multiple snapshots
 Get the values of a property for a continuous
range of particles, across a set of snapshots
Queries of Cosmology Dataset
15

Query1: Single Snapshot Selection
The data schema substantially impacts performance
16

Query2: Comparison Across two Snapshots
Using the 3rd dimension improves locality and consequently performance.
17

Query3: Projection Across Multiple Snapshots
The row-key design is a key aspect of the schema design.
18

Three Schemas for the Bixi Dataset
Schema/
dimension
Row Family: Column Version
Schema1 hour-sid minutes[0,59] no meaning
Schema2 hour-sid monitoring metrics minutes [0,59]
Schema3 day-sid monitoring metrics minutes [0,1439]
Schema1 Schema2 Schema3
Time
Row
metrics
Time
Row
metrics
Row
Time
19

Query3: Projection and Stats Across Snapshots
The period length (defining the 3rd dimension stack) is a key decision.
Get average bike usage in a given period (30 days) for a
given list of stations(200)
20

 Investigated how to store time-series data
 We explored data with
 Few versions, many objects (Cosmology)
 Many versions, few objects (Bixi)
 Examined the impact of row-key design and versioning
 Guidelines
 “Few versions, many objects”: disperse the sequential data with row
key design, e.g. reversed object id
 “Many version, few objects”: version can be deeper
 These findings are applicable to “write-once, read-many”
systems
Contributions (1)
21

 Querying Geospatial Datasets
Modeling Data in HBase (2/2)
Introduction Migration to Cloud ConclusionTime-Series
22

 Multi-dimensional data
 Locations (latitude, longitude)
 attributes
 Applications
 Location-aware applications
 Analysis Questions
 Who are my neighbors?
 Which restaurants are close to me?
Geospatial Datasets
23

 Nishimura et al:
 built a multi-dimensional index layer on top of a one-
dimensional key-value store HBase to perform spatial
queries.
 Hsu et al:
 presented a novel key formulation schema, based on
R+-tree for spatial index in HBase.
 Focus on row-key design
 no discussion about columns and versions
Geospatial Dataset Schema Design: Related Work
24

 Two Synthetic Datasets
 Uniform and ZipF distribution
 Based on Bixi dataset, each object includes
 station ID,
 latitude, longitude, station name, terminal name,
 number of docks
 number of bikes
 100 Million objects (70GB)
 in a 100km*100km simulated space
Case Study: The Datasets
25

 Trie-based quad-tree Indexing
 Z-value Linearization
 Data Model
 Row key: Z-value
 Column: Object ID
 Value: one object in JSON Format
A typical Data Model: Quad-Tree
Z-Value
Object ID
05 07 13 15
04 06 12 14
01 03 09 11
00 02 08 10
Z-value
26

 Regular Grid Indexing
 Data Model
 Row key: Grid rowID
 Column: Grid columnID
 Version: counter of Objects
 Value: one object in JSON format
A typical Data Model: Regular Grid
Column ID
RowID
00 01 02 03
00
01
02
03
27

Two Possible Data Models
 Quad-Tree
 More rows with deeper
tree
 Z-ordering linearization
(violates data locality)
 In-time construction vs.
pre-construction implies a
tradeoff between query
performance and memory
allocation
 Regular Grid
 Very easy to locate a cell
by row id and column id
 Cannot handle large
space and fine-grained
grid because in-memory
indexes are subject to
memory constraints
How much unrelated data is examined in a query matters a lot!
28

A Hybrid Model: the HGrid
Columnid-ObjectId
QTId-RowId
29

HGrid: Index Structure Construction
The row key is
the QT Z-value
+ the RG row
index.
The column
name is the RG
column and the
object-ID
The attributes
of the data point
are stored in
the third
dimension.
T
30

HGrid: Serialized Data at the Physical Level
A A A
A A A
A A A
B B B
B B B
B B B
C C C
C C C
C C C
D D D
D D D
D D D
00
01
11
10
01 02 03 01 02 03
Space
31

 A four-node cluster on virtual machines with Ubuntu
on OpenStack
 Hadoop 1.0.2 (replication factor is 2), HBase 0.94
 Query processing Implementation
 Native java API
 User-Level Coprocessor Implementation
 Typical Spatial Queries
 Range Query and kNN Query
The Experiment
HBase Cluster
32

 Given a location and a radius,
 Return the data points, located within a distance less
or equal to the radius from the input location
Range Query
33

 Given the coordinates of a location,
 Return the K points nearest to the location
kNN Query
34

 HGrid: A Geospatial Data Model that Performs
 Worse than Regular-Grid data model
 Much better than the Quad-Tree data model
 Broader applicability: Both Quad-tree and Regular-Grid
data models suffer from memory constraints
 HGrid is not subject to memory constraints
 Qualities of HGrid
 Benefit from good locality of regular grid index
 Suffer from poor locality of z-ordering linearization
Contributions (2)
35

 Migrating an Existing Geospatial Application
to a Hierarchical Cloud
A Migration Case Study
Introduction Geo-spatial ConclusionTime-Series
SAVI
36

[1] SAVI: Smart Application on Virtual Infrastructure national research project in Canada
Core
Smart Edges
BC edge
ON edge
SAVI: A Hierarchical Cloud
37

 Home Care Aides Technology
 Scheduling Service: schedule care plan
 Assistant Service: audio/images/video/notes
 Location Service: plan the path
 The New Requirement
 Analyze the data as a whole (Nationally)
The HCAT Application
MySQL
Web
Server
Tomcat
Server
38

 Given the need for
 Low latency for end users
 Centralized data analysis
 How to re-architect HCAT application for all
users across Canada on SAVI cloud?
HCA-T to HCA-T2
39
Core
Smart
Edges
BC
ON

HCAT2 Geographical Deployment: BC Edge
40

HCAT2 Geographical Deployment: ON Edge
41

HCAT2 Geographical Deployment: Core
42

HCAT2 Geographical Deployment
43

A Federation-style Architecture on SAVI Cloud
BC
Tomcat Server
Web Server
MySQL
OntarioON
Tomcat Server
Web Server
MySQL
BC
44

HBase
Core
DAAS
A Federation-style Architecture on SAVI Cloud
BC
Tomcat Server
Web Server
MySQL
ON
Tomcat Server
Web Server
MySQL
Data Analytics
45
AB QC…

MySQL
MySQL
HBase
data
data
HCAT
HCAT
The Data Flow
46
Shell

The Most Challenging Research Problem
How can we transform data
schemas from RDBMSs to HBase?
47

 Li:
 Three guidelines for the transition to de-normalize the original relations.
 Schram et al.:
 A case study of mapping the Twitter data schema from MySQL to
Cassandra to support crisis informatics.
 Gupta et al.:
 Four Transition guidelines from traditional data warehouse to Hive based
on Universal data model in relational database.
 De-normalization was suggested
 No discuss on how to model the data in HBase to get
efficient query performance.
Data Migration from RDBMSs to HBase: Related Work
48

Data Schema Transition from RDBMSs to HBase
Classify
relations into
active and
inactive
De-normalize
relations
Descriptive DM
Time-series DMAdjust and
Optimize
Geospatial DM
……
Apply
appropriate
data models
49

Case Study: from HCA-T to HCA-T2
50

51

52

Sample Queries
Window query
 Which area (East/West/North/South) has most/least
patients per region in 2012/2011?
Range query
 Find the neighbors of a given client based on his/her
home location and a given distance.
Time-series statistical query
 Get the total number of appointments/services/uploaded
images per week/month in a given region in 2012/2011.
53

 SAVI Core: 10 VMs for DAAS
 SAVI smart edges: Two HCAT application instances
 Hadoop 1.0.2, HBase 0.94, Sqoop 1.4.3, Oozie 3.3.2
 Dataset
 200GB for each, two duplications
 Sets of Experiments
 One set for migration performance evaluation
 Two sets for query performance evaluation
The Experiment
Core Ontario edge
HCAT
BC edge
HCAT
DAAS
54

Aggregate Statistics Over Time
How many appointments took place in Ontario and BC
in a given period of time (from 1 week to 3 months)?
55

How many appointments took place in Ontario and BC
56

How many appointments took place in BC
57

 We proposed a novel federation-style
architecture
 We constructed a systematic way of
migrating geospatial applications to this
architecture
 We proposed a method of transforming data
schemas from RDBMSs to HBase
Contributions (3)
58

 Data Model in HBase
 Time-Series dataset (MESOCA 2012)
 Geospatial dataset (CLOUD 2013)
 Migrating an existing application to cloud
 Designed a novel federation-style architecture
 proposed a method for transforming data schemas in RDBMSs
to HBase
 A practical case study
 Presented the application of the above guidelines in the design
 Verified the aforementioned data schema transition method
Concluding Summary
Introduction Geo-spatial Migration to CloudTime-Series
59

 Investigate the data beyond the geo-spatial domain
 text, images and videos from social network application
 Investigate other NoSQL databases, beyond HBase
 Key-Value Store, Document databases, and Graph databases
 The method of transforming data schemas from
RDBMSs to HBase
 Needs be more general for various data
Introduction Geo-spatial Migration to CloudTime-Series
Future Work
61

Thank You！
谢谢！
Ευχαριστώ！

Design Pattern of HBase Configuration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Design Pattern of HBase Configuration

Similar to Design Pattern of HBase Configuration (20)

Recently uploaded

Recently uploaded (20)

Design Pattern of HBase Configuration

Editor's Notes