SlideShare a Scribd company logo
1 of 63
LOGO
Design Patterns of HBase Configuration
Dan Han
Supervisor: Eleni Stroulia and Paul Sorenson
Department of Computing Science
University of Alberta, Canada
 Data Modeling in HBase
 Time–Series datasets (MESOCA 2012)
 Geospatial datasets (CLOUD 2013)
 Migrating
 an existing application with geospatial, temporal and
categorical data
 to SAVI, a hierarchical cloud
 Contributions
 Future Work
My Agenda for Today
Time-Series Geo-Spatial Migration to Cloud Conclusion
1
Time-Series Geo-Spatial Migration to Cloud Conclusion
RDBMS
Application
Server
Web
Server
Data Warehouse
Our Motivation: Big Data Problem
2
real-time or near real-time
Time-Series Geo-Spatial Migration to Cloud Conclusion
NoSQL
database
RDBMS
Too big
Too fast
Too hard
for existing tools
Our Motivation: Big Data Problem
3
Our Motivation: Application Extension
Time-Series Geo-Spatial Migration to Cloud Conclusion
RDBMS
Web
Server
Application
Server
Data
Analytics
NoSQL
database
4
In-house IT system
Time-Series Geo-Spatial Migration to Cloud Conclusion
Cloud Infrastructure + NoSQL Storage Platform

A New Software Architecture + an NoSQL expertise
Good Elasticity
Excellent Scalability
High Availability
Low Cost
NoSQL
database
Our Motivation: Application Migration
5
Data
Analytics
Re-architecting
traditional SQL+REST applications
to use NoSQL data stores
For the purpose of extending them with
analytics features
The Objective
Time-Series Geo-Spatial Migration to Cloud Conclusion
6
 To develop guidelines for data organization
 On HBase
 A NoSQL database built on top of Hadoop
 With a coprocessor framework (for parallel
processing)
 Given
 the application data model
 the amount of data involved
 query patterns
The Research Problem
Time-Series Geo-Spatial Migration to Cloud Conclusion
Focusing on
Time-Series and
Geospatial data
7
Time-Series Geo-Spatial Migration to Cloud Conclusion
HBase: NoSQL Data Store
8
 Querying Time-Series Datasets
Modeling Data in HBase (1/2)
Introduction Geo-Spatial Migration to Cloud Conclusion
9
 In an order-dependent manner
 Typical applications
 Sensor-based applications
 Monitoring applications
 Analysis Questions
 What happened?
 Why did it happen? (what happened before)
 What may happen next?
Time-Series Datasets
Introduction Geo-Spatial Migration to Cloud Conclusion
10
Case Study: The Cosmology and Bixi Datasets
Introduction Geo-Spatial Migration to Cloud Conclusion
Cosmology dataset: 9 snapshots, 321,065,547 particles,12 metrics
Bixi dataset: 100,800 timestamps, 404 stations, 12 metrics
 Gas:
 iorder, mass, x, y, z, velocity x, y, z,
phi, rho, temp, hsmooth, metals
 Dark Matter:
 iorder, mass, x, y, z, velocity x, y, z,
phi, eps
 Star:
 iorder, mass, x, y, z, velocity x, y, z,
phi, metals, tform, eps
11
 Inspired by OpenTSDB and Facebook
Messages
 Data Model
 Row key: object id-period timestamp
 Column name: object attributes
 Version: offset of period-timestamp
A 3-Dimensional Data Model
Column
Row Key
Introduction Geo-Spatial Migration to Cloud Conclusion
12
 Experiment Environment
 A four-node cluster on virtual machines with Ubuntu
 Hadoop 0.20, HBase 0.93-snapshot
 Queries for each dataset
 Three queries of Cosmology dataset from related research
 One query of Bixi dataset from business requirement
 Query processing implementation
 Native java API
 User-Level Coprocessor Implementation
The Experiment
Introduction Geo-Spatial Migration to Cloud Conclusion
HBase Cluster
13
Schema/
dimension
Row Family:
Column
Version
Schema1 sid-type-pid particle
properties
no meaning
Schema2 type-pid particle
properties
snapshot id
Schema3 type-reversedpid particle
properties
snapshot id
Region
Region
Region
24-2-33446666
64-2-33559999
84-2-33550000
2-33446666
2-33550000
2-33559999
2-00005533
2-66664433
2-99995533
Schema1 Schema2 Schema3
Introduction Geo-Spatial Migration to Cloud Conclusion
Three Schemas for the Cosmology Dataset
14
Query1: within one snapshot
 Get all the particles of a type in a single snapshot
with a given property matching an expression
Query2: across two snapshots
 Get all the particles added/destroyed between
two snapshots
Query3: across multiple snapshots
 Get the values of a property for a continuous
range of particles, across a set of snapshots
Queries of Cosmology Dataset
Introduction Geo-Spatial Migration to Cloud Conclusion
15
Query1: Single Snapshot Selection
The data schema substantially impacts performance
Introduction Geo-Spatial Migration to Cloud Conclusion
16
Query2: Comparison Across two Snapshots
Using the 3rd dimension improves locality and consequently performance.
Introduction Geo-Spatial Migration to Cloud Conclusion
17
Query3: Projection Across Multiple Snapshots
Introduction Geo-Spatial Migration to Cloud Conclusion
The row-key design is a key aspect of the schema design.
18
Three Schemas for the Bixi Dataset
Schema/
dimension
Row Family: Column Version
Schema1 hour-sid minutes[0,59] no meaning
Schema2 hour-sid monitoring metrics minutes [0,59]
Schema3 day-sid monitoring metrics minutes [0,1439]
Schema1 Schema2 Schema3
Time
Row
metrics
Time
Row
metrics
Row
Time
Introduction Geo-Spatial Migration to Cloud Conclusion
19
Query3: Projection and Stats Across Snapshots
The period length (defining the 3rd dimension stack) is a key decision.
Introduction Geo-Spatial Migration to Cloud Conclusion
Get average bike usage in a given period (30 days) for a
given list of stations(200)
20
 Investigated how to store time-series data
 We explored data with
 Few versions, many objects (Cosmology)
 Many versions, few objects (Bixi)
 Examined the impact of row-key design and versioning
 Guidelines
 “Few versions, many objects”: disperse the sequential data with row
key design, e.g. reversed object id
 “Many version, few objects”: version can be deeper
 These findings are applicable to “write-once, read-many”
systems
Contributions (1)
Introduction Geo-Spatial Migration to Cloud Conclusion
21
 Querying Geospatial Datasets
Modeling Data in HBase (2/2)
Introduction Migration to Cloud ConclusionTime-Series
22
 Multi-dimensional data
 Locations (latitude, longitude)
 attributes
 Applications
 Location-aware applications
 Analysis Questions
 Who are my neighbors?
 Which restaurants are close to me?
Geospatial Datasets
Introduction Migration to Cloud ConclusionTime-Series
23
 Nishimura et al:
 built a multi-dimensional index layer on top of a one-
dimensional key-value store HBase to perform spatial
queries.
 Hsu et al:
 presented a novel key formulation schema, based on
R+-tree for spatial index in HBase.
 Focus on row-key design
 no discussion about columns and versions
Geospatial Dataset Schema Design: Related Work
Introduction Migration to Cloud ConclusionTime-Series
24
 Two Synthetic Datasets
 Uniform and ZipF distribution
 Based on Bixi dataset, each object includes
 station ID,
 latitude, longitude, station name, terminal name,
 number of docks
 number of bikes
 100 Million objects (70GB)
 in a 100km*100km simulated space
Case Study: The Datasets
Introduction Migration to Cloud ConclusionTime-Series
25
 Trie-based quad-tree Indexing
 Z-value Linearization
 Data Model
 Row key: Z-value
 Column: Object ID
 Value: one object in JSON Format
A typical Data Model: Quad-Tree
Z-Value
Object ID
05 07 13 15
04 06 12 14
01 03 09 11
00 02 08 10
Z-value
Introduction Migration to Cloud ConclusionTime-Series
26
 Regular Grid Indexing
 Data Model
 Row key: Grid rowID
 Column: Grid columnID
 Version: counter of Objects
 Value: one object in JSON format
A typical Data Model: Regular Grid
Column ID
RowID
00 01 02 03
00
01
02
03
Introduction Migration to Cloud ConclusionTime-Series
27
Two Possible Data Models
 Quad-Tree
 More rows with deeper
tree
 Z-ordering linearization
(violates data locality)
 In-time construction vs.
pre-construction implies a
tradeoff between query
performance and memory
allocation
 Regular Grid
 Very easy to locate a cell
by row id and column id
 Cannot handle large
space and fine-grained
grid because in-memory
indexes are subject to
memory constraints
How much unrelated data is examined in a query matters a lot!
Introduction Migration to Cloud ConclusionTime-Series
28
A Hybrid Model: the HGrid
Columnid-ObjectId
QTId-RowId
Introduction Migration to Cloud ConclusionTime-Series
29
HGrid: Index Structure Construction
The row key is
the QT Z-value
+ the RG row
index.
The column
name is the RG
column and the
object-ID
The attributes
of the data point
are stored in
the third
dimension.
Introduction Migration to Cloud ConclusionTime-Series
T
30
HGrid: Serialized Data at the Physical Level
A A A
A A A
A A A
B B B
B B B
B B B
C C C
C C C
C C C
D D D
D D D
D D D
00
01
11
10
01 02 03 01 02 03
Space
Introduction Migration to Cloud ConclusionTime-Series
31
 Experiment Environment
 A four-node cluster on virtual machines with Ubuntu
on OpenStack
 Hadoop 1.0.2 (replication factor is 2), HBase 0.94
 Query processing Implementation
 Native java API
 User-Level Coprocessor Implementation
 Typical Spatial Queries
 Range Query and kNN Query
The Experiment
Introduction Migration to Cloud ConclusionTime-Series
HBase Cluster
32
 Given a location and a radius,
 Return the data points, located within a distance less
or equal to the radius from the input location
Range Query
Introduction Migration to Cloud ConclusionTime-Series
33
 Given the coordinates of a location,
 Return the K points nearest to the location
kNN Query
Introduction Migration to Cloud ConclusionTime-Series
34
 HGrid: A Geospatial Data Model that Performs
 Worse than Regular-Grid data model
 Much better than the Quad-Tree data model
 Broader applicability: Both Quad-tree and Regular-Grid
data models suffer from memory constraints
 HGrid is not subject to memory constraints
 Qualities of HGrid
 Benefit from good locality of regular grid index
 Suffer from poor locality of z-ordering linearization
Contributions (2)
Introduction Migration to Cloud ConclusionTime-Series
35
 Migrating an Existing Geospatial Application
to a Hierarchical Cloud
A Migration Case Study
Introduction Geo-spatial ConclusionTime-Series
SAVI
36
Introduction Geo-spatial ConclusionTime-Series
[1] SAVI: Smart Application on Virtual Infrastructure national research project in Canada
Core
Smart Edges
BC edge
ON edge
SAVI: A Hierarchical Cloud
37
 Home Care Aides Technology
 Scheduling Service: schedule care plan
 Assistant Service: audio/images/video/notes
 Location Service: plan the path
 The New Requirement
 Analyze the data as a whole (Nationally)
The HCAT Application
Introduction Geo-spatial ConclusionTime-Series
MySQL
Web
Server
Tomcat
Server
38
Introduction Geo-spatial ConclusionTime-Series
 Given the need for
 Low latency for end users
 Centralized data analysis
 How to re-architect HCAT application for all
users across Canada on SAVI cloud?
HCA-T to HCA-T2
39
Core
Smart
Edges
BC
ON
HCAT2 Geographical Deployment: BC Edge
Introduction Geo-spatial ConclusionTime-Series
40
HCAT2 Geographical Deployment: ON Edge
Introduction Geo-spatial ConclusionTime-Series
41
HCAT2 Geographical Deployment: Core
Introduction Geo-spatial ConclusionTime-Series
42
HCAT2 Geographical Deployment
Introduction Geo-spatial ConclusionTime-Series
43
A Federation-style Architecture on SAVI Cloud
Introduction Geo-spatial ConclusionTime-Series
BC
Tomcat Server
Web Server
MySQL
OntarioON
Tomcat Server
Web Server
MySQL
BC
44
HBase
Core
DAAS
A Federation-style Architecture on SAVI Cloud
Introduction Geo-spatial ConclusionTime-Series
BC
Tomcat Server
Web Server
MySQL
ON
Tomcat Server
Web Server
MySQL
Data Analytics
45
AB QC…
Introduction Geo-spatial ConclusionTime-Series
MySQL
MySQL
HBase
data
data
HCAT
HCAT
The Data Flow
46
Shell
The Most Challenging Research Problem
Introduction Geo-spatial ConclusionTime-Series
How can we transform data
schemas from RDBMSs to HBase?
47
 Li:
 Three guidelines for the transition to de-normalize the original relations.
 Schram et al.:
 A case study of mapping the Twitter data schema from MySQL to
Cassandra to support crisis informatics.
 Gupta et al.:
 Four Transition guidelines from traditional data warehouse to Hive based
on Universal data model in relational database.
 De-normalization was suggested
 No discuss on how to model the data in HBase to get
efficient query performance.
Data Migration from RDBMSs to HBase: Related Work
Introduction Geo-spatial ConclusionTime-Series
48
Data Schema Transition from RDBMSs to HBase
Introduction Geo-spatial ConclusionTime-Series
Classify
relations into
active and
inactive
De-normalize
relations
Descriptive DM
Time-series DMAdjust and
Optimize
Geospatial DM
……
Apply
appropriate
data models
49
Introduction Geo-spatial ConclusionTime-Series
Case Study: from HCA-T to HCA-T2
50
Introduction Geo-spatial ConclusionTime-Series
Case Study: from HCA-T to HCA-T2
51
Introduction Geo-spatial ConclusionTime-Series
Case Study: from HCA-T to HCA-T2
52
Sample Queries
Window query
 Which area (East/West/North/South) has most/least
patients per region in 2012/2011?
Range query
 Find the neighbors of a given client based on his/her
home location and a given distance.
Time-series statistical query
 Get the total number of appointments/services/uploaded
images per week/month in a given region in 2012/2011.
Introduction Geo-spatial ConclusionTime-Series
53
 Experiment Environment
 SAVI Core: 10 VMs for DAAS
 SAVI smart edges: Two HCAT application instances
 Hadoop 1.0.2, HBase 0.94, Sqoop 1.4.3, Oozie 3.3.2
 Dataset
 200GB for each, two duplications
 Sets of Experiments
 One set for migration performance evaluation
 Two sets for query performance evaluation
The Experiment
Introduction Geo-spatial ConclusionTime-Series
Core Ontario edge
HCAT
BC edge
HCAT
DAAS
54
Aggregate Statistics Over Time
How many appointments took place in Ontario and BC
in a given period of time (from 1 week to 3 months)?
Introduction Geo-spatial ConclusionTime-Series
55
Aggregate Statistics Over Time
How many appointments took place in Ontario and BC
in a given period of time (from 1 week to 3 months)?
Introduction Geo-spatial ConclusionTime-Series
56
Aggregate Statistics Over Time
How many appointments took place in BC
in a given period of time (from 1 week to 3 months)?
Introduction Geo-spatial ConclusionTime-Series
57
 We proposed a novel federation-style
architecture
 We constructed a systematic way of
migrating geospatial applications to this
architecture
 We proposed a method of transforming data
schemas from RDBMSs to HBase
Contributions (3)
Introduction Geo-spatial ConclusionTime-Series
58
 Data Model in HBase
 Time-Series dataset (MESOCA 2012)
 Geospatial dataset (CLOUD 2013)
 Migrating an existing application to cloud
 Designed a novel federation-style architecture
 proposed a method for transforming data schemas in RDBMSs
to HBase
 A practical case study
 Presented the application of the above guidelines in the design
 Verified the aforementioned data schema transition method
Concluding Summary
Introduction Geo-spatial Migration to CloudTime-Series
59
Contributions
60
 Investigate the data beyond the geo-spatial domain
 text, images and videos from social network application
 Investigate other NoSQL databases, beyond HBase
 Key-Value Store, Document databases, and Graph databases
 The method of transforming data schemas from
RDBMSs to HBase
 Needs be more general for various data
Introduction Geo-spatial Migration to CloudTime-Series
Future Work
61
Thank You!
谢 谢!
Ευχαριστώ!

More Related Content

What's hot

SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudAnsgar Scherp
 
Secondary Spectrum Usage for Mobile Devices
Secondary Spectrum Usage for Mobile DevicesSecondary Spectrum Usage for Mobile Devices
Secondary Spectrum Usage for Mobile DevicesAmjed Majid
 
Multi-thematic spatial databases
Multi-thematic spatial databasesMulti-thematic spatial databases
Multi-thematic spatial databasesConor Mc Elhinney
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
 
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx
 
On the value of Sampling and Pruning for SBSE
On the value of Sampling and Pruning for SBSEOn the value of Sampling and Pruning for SBSE
On the value of Sampling and Pruning for SBSEJianfeng Chen
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationDat Nguyen
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachSpark Summit
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 
Evaluation of Caching Strategies Based on Access Statistics on Past Requests
Evaluation of Caching Strategies Based on Access Statistics on Past RequestsEvaluation of Caching Strategies Based on Access Statistics on Past Requests
Evaluation of Caching Strategies Based on Access Statistics on Past RequestsSmartenIT
 
Earth Science Platform
Earth Science PlatformEarth Science Platform
Earth Science PlatformTed Habermann
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pRobert Grossman
 
Using Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedUsing Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedVed Mulkalwar
 
Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Rob Emanuele
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 

What's hot (20)

SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
 
Secondary Spectrum Usage for Mobile Devices
Secondary Spectrum Usage for Mobile DevicesSecondary Spectrum Usage for Mobile Devices
Secondary Spectrum Usage for Mobile Devices
 
Multi-thematic spatial databases
Multi-thematic spatial databasesMulti-thematic spatial databases
Multi-thematic spatial databases
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...
 
Summary of HDF-EOS5 Files, Data Model and File Format
Summary of HDF-EOS5 Files, Data Model and File FormatSummary of HDF-EOS5 Files, Data Model and File Format
Summary of HDF-EOS5 Files, Data Model and File Format
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
On the value of Sampling and Pruning for SBSE
On the value of Sampling and Pruning for SBSEOn the value of Sampling and Pruning for SBSE
On the value of Sampling and Pruning for SBSE
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance Segmentation
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
Evaluation of Caching Strategies Based on Access Statistics on Past Requests
Evaluation of Caching Strategies Based on Access Statistics on Past RequestsEvaluation of Caching Strategies Based on Access Statistics on Past Requests
Evaluation of Caching Strategies Based on Access Statistics on Past Requests
 
Earth Science Platform
Earth Science PlatformEarth Science Platform
Earth Science Platform
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 
Using Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedUsing Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics ved
 
Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 

Similar to Design Pattern of HBase Configuration

0603 Esip Fed Wash Dc Tech Pres 060103 Esip Aq Tech Track
0603 Esip Fed Wash Dc Tech Pres 060103 Esip Aq Tech Track0603 Esip Fed Wash Dc Tech Pres 060103 Esip Aq Tech Track
0603 Esip Fed Wash Dc Tech Pres 060103 Esip Aq Tech TrackRudolf Husar
 
2006-01-11 Data Flow & Interoperability in DataFed Service-based AQ Analysis ...
2006-01-11 Data Flow & Interoperability in DataFed Service-based AQ Analysis ...2006-01-11 Data Flow & Interoperability in DataFed Service-based AQ Analysis ...
2006-01-11 Data Flow & Interoperability in DataFed Service-based AQ Analysis ...Rudolf Husar
 
060128 Galeon Rept
060128 Galeon Rept060128 Galeon Rept
060128 Galeon ReptRudolf Husar
 
Making sense of your data
Making sense of your dataMaking sense of your data
Making sense of your dataGerald Muecke
 
My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)Robert Grossman
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010BOSC 2010
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMOVING Project
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
 
Infotech's Spatial Conflation Tool TruShift
Infotech's Spatial Conflation Tool TruShift Infotech's Spatial Conflation Tool TruShift
Infotech's Spatial Conflation Tool TruShift Maria H
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)Robert Grossman
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetupamarsri
 
Godiva2 Overview
Godiva2 OverviewGodiva2 Overview
Godiva2 Overviewjonblower
 
BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overviewBigData_Europe
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataAlessandro Adamou
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!Ian Foster
 

Similar to Design Pattern of HBase Configuration (20)

0603 Esip Fed Wash Dc Tech Pres 060103 Esip Aq Tech Track
0603 Esip Fed Wash Dc Tech Pres 060103 Esip Aq Tech Track0603 Esip Fed Wash Dc Tech Pres 060103 Esip Aq Tech Track
0603 Esip Fed Wash Dc Tech Pres 060103 Esip Aq Tech Track
 
2006-01-11 Data Flow & Interoperability in DataFed Service-based AQ Analysis ...
2006-01-11 Data Flow & Interoperability in DataFed Service-based AQ Analysis ...2006-01-11 Data Flow & Interoperability in DataFed Service-based AQ Analysis ...
2006-01-11 Data Flow & Interoperability in DataFed Service-based AQ Analysis ...
 
060128 Galeon Rept
060128 Galeon Rept060128 Galeon Rept
060128 Galeon Rept
 
3DRepo
3DRepo3DRepo
3DRepo
 
Making sense of your data
Making sense of your dataMaking sense of your data
Making sense of your data
 
My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 
Infotech's Spatial Conflation Tool TruShift
Infotech's Spatial Conflation Tool TruShift Infotech's Spatial Conflation Tool TruShift
Infotech's Spatial Conflation Tool TruShift
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Lambda Data Grid
Lambda Data GridLambda Data Grid
Lambda Data Grid
 
Godiva2 Overview
Godiva2 OverviewGodiva2 Overview
Godiva2 Overview
 
Soumyadip_Chandra
Soumyadip_ChandraSoumyadip_Chandra
Soumyadip_Chandra
 
BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overview
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked data
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Design Pattern of HBase Configuration

  • 1. LOGO Design Patterns of HBase Configuration Dan Han Supervisor: Eleni Stroulia and Paul Sorenson Department of Computing Science University of Alberta, Canada
  • 2.  Data Modeling in HBase  Time–Series datasets (MESOCA 2012)  Geospatial datasets (CLOUD 2013)  Migrating  an existing application with geospatial, temporal and categorical data  to SAVI, a hierarchical cloud  Contributions  Future Work My Agenda for Today Time-Series Geo-Spatial Migration to Cloud Conclusion 1
  • 3. Time-Series Geo-Spatial Migration to Cloud Conclusion RDBMS Application Server Web Server Data Warehouse Our Motivation: Big Data Problem 2 real-time or near real-time
  • 4. Time-Series Geo-Spatial Migration to Cloud Conclusion NoSQL database RDBMS Too big Too fast Too hard for existing tools Our Motivation: Big Data Problem 3
  • 5. Our Motivation: Application Extension Time-Series Geo-Spatial Migration to Cloud Conclusion RDBMS Web Server Application Server Data Analytics NoSQL database 4 In-house IT system
  • 6. Time-Series Geo-Spatial Migration to Cloud Conclusion Cloud Infrastructure + NoSQL Storage Platform  A New Software Architecture + an NoSQL expertise Good Elasticity Excellent Scalability High Availability Low Cost NoSQL database Our Motivation: Application Migration 5 Data Analytics
  • 7. Re-architecting traditional SQL+REST applications to use NoSQL data stores For the purpose of extending them with analytics features The Objective Time-Series Geo-Spatial Migration to Cloud Conclusion 6
  • 8.  To develop guidelines for data organization  On HBase  A NoSQL database built on top of Hadoop  With a coprocessor framework (for parallel processing)  Given  the application data model  the amount of data involved  query patterns The Research Problem Time-Series Geo-Spatial Migration to Cloud Conclusion Focusing on Time-Series and Geospatial data 7
  • 9. Time-Series Geo-Spatial Migration to Cloud Conclusion HBase: NoSQL Data Store 8
  • 10.  Querying Time-Series Datasets Modeling Data in HBase (1/2) Introduction Geo-Spatial Migration to Cloud Conclusion 9
  • 11.  In an order-dependent manner  Typical applications  Sensor-based applications  Monitoring applications  Analysis Questions  What happened?  Why did it happen? (what happened before)  What may happen next? Time-Series Datasets Introduction Geo-Spatial Migration to Cloud Conclusion 10
  • 12. Case Study: The Cosmology and Bixi Datasets Introduction Geo-Spatial Migration to Cloud Conclusion Cosmology dataset: 9 snapshots, 321,065,547 particles,12 metrics Bixi dataset: 100,800 timestamps, 404 stations, 12 metrics  Gas:  iorder, mass, x, y, z, velocity x, y, z, phi, rho, temp, hsmooth, metals  Dark Matter:  iorder, mass, x, y, z, velocity x, y, z, phi, eps  Star:  iorder, mass, x, y, z, velocity x, y, z, phi, metals, tform, eps 11
  • 13.  Inspired by OpenTSDB and Facebook Messages  Data Model  Row key: object id-period timestamp  Column name: object attributes  Version: offset of period-timestamp A 3-Dimensional Data Model Column Row Key Introduction Geo-Spatial Migration to Cloud Conclusion 12
  • 14.  Experiment Environment  A four-node cluster on virtual machines with Ubuntu  Hadoop 0.20, HBase 0.93-snapshot  Queries for each dataset  Three queries of Cosmology dataset from related research  One query of Bixi dataset from business requirement  Query processing implementation  Native java API  User-Level Coprocessor Implementation The Experiment Introduction Geo-Spatial Migration to Cloud Conclusion HBase Cluster 13
  • 15. Schema/ dimension Row Family: Column Version Schema1 sid-type-pid particle properties no meaning Schema2 type-pid particle properties snapshot id Schema3 type-reversedpid particle properties snapshot id Region Region Region 24-2-33446666 64-2-33559999 84-2-33550000 2-33446666 2-33550000 2-33559999 2-00005533 2-66664433 2-99995533 Schema1 Schema2 Schema3 Introduction Geo-Spatial Migration to Cloud Conclusion Three Schemas for the Cosmology Dataset 14
  • 16. Query1: within one snapshot  Get all the particles of a type in a single snapshot with a given property matching an expression Query2: across two snapshots  Get all the particles added/destroyed between two snapshots Query3: across multiple snapshots  Get the values of a property for a continuous range of particles, across a set of snapshots Queries of Cosmology Dataset Introduction Geo-Spatial Migration to Cloud Conclusion 15
  • 17. Query1: Single Snapshot Selection The data schema substantially impacts performance Introduction Geo-Spatial Migration to Cloud Conclusion 16
  • 18. Query2: Comparison Across two Snapshots Using the 3rd dimension improves locality and consequently performance. Introduction Geo-Spatial Migration to Cloud Conclusion 17
  • 19. Query3: Projection Across Multiple Snapshots Introduction Geo-Spatial Migration to Cloud Conclusion The row-key design is a key aspect of the schema design. 18
  • 20. Three Schemas for the Bixi Dataset Schema/ dimension Row Family: Column Version Schema1 hour-sid minutes[0,59] no meaning Schema2 hour-sid monitoring metrics minutes [0,59] Schema3 day-sid monitoring metrics minutes [0,1439] Schema1 Schema2 Schema3 Time Row metrics Time Row metrics Row Time Introduction Geo-Spatial Migration to Cloud Conclusion 19
  • 21. Query3: Projection and Stats Across Snapshots The period length (defining the 3rd dimension stack) is a key decision. Introduction Geo-Spatial Migration to Cloud Conclusion Get average bike usage in a given period (30 days) for a given list of stations(200) 20
  • 22.  Investigated how to store time-series data  We explored data with  Few versions, many objects (Cosmology)  Many versions, few objects (Bixi)  Examined the impact of row-key design and versioning  Guidelines  “Few versions, many objects”: disperse the sequential data with row key design, e.g. reversed object id  “Many version, few objects”: version can be deeper  These findings are applicable to “write-once, read-many” systems Contributions (1) Introduction Geo-Spatial Migration to Cloud Conclusion 21
  • 23.  Querying Geospatial Datasets Modeling Data in HBase (2/2) Introduction Migration to Cloud ConclusionTime-Series 22
  • 24.  Multi-dimensional data  Locations (latitude, longitude)  attributes  Applications  Location-aware applications  Analysis Questions  Who are my neighbors?  Which restaurants are close to me? Geospatial Datasets Introduction Migration to Cloud ConclusionTime-Series 23
  • 25.  Nishimura et al:  built a multi-dimensional index layer on top of a one- dimensional key-value store HBase to perform spatial queries.  Hsu et al:  presented a novel key formulation schema, based on R+-tree for spatial index in HBase.  Focus on row-key design  no discussion about columns and versions Geospatial Dataset Schema Design: Related Work Introduction Migration to Cloud ConclusionTime-Series 24
  • 26.  Two Synthetic Datasets  Uniform and ZipF distribution  Based on Bixi dataset, each object includes  station ID,  latitude, longitude, station name, terminal name,  number of docks  number of bikes  100 Million objects (70GB)  in a 100km*100km simulated space Case Study: The Datasets Introduction Migration to Cloud ConclusionTime-Series 25
  • 27.  Trie-based quad-tree Indexing  Z-value Linearization  Data Model  Row key: Z-value  Column: Object ID  Value: one object in JSON Format A typical Data Model: Quad-Tree Z-Value Object ID 05 07 13 15 04 06 12 14 01 03 09 11 00 02 08 10 Z-value Introduction Migration to Cloud ConclusionTime-Series 26
  • 28.  Regular Grid Indexing  Data Model  Row key: Grid rowID  Column: Grid columnID  Version: counter of Objects  Value: one object in JSON format A typical Data Model: Regular Grid Column ID RowID 00 01 02 03 00 01 02 03 Introduction Migration to Cloud ConclusionTime-Series 27
  • 29. Two Possible Data Models  Quad-Tree  More rows with deeper tree  Z-ordering linearization (violates data locality)  In-time construction vs. pre-construction implies a tradeoff between query performance and memory allocation  Regular Grid  Very easy to locate a cell by row id and column id  Cannot handle large space and fine-grained grid because in-memory indexes are subject to memory constraints How much unrelated data is examined in a query matters a lot! Introduction Migration to Cloud ConclusionTime-Series 28
  • 30. A Hybrid Model: the HGrid Columnid-ObjectId QTId-RowId Introduction Migration to Cloud ConclusionTime-Series 29
  • 31. HGrid: Index Structure Construction The row key is the QT Z-value + the RG row index. The column name is the RG column and the object-ID The attributes of the data point are stored in the third dimension. Introduction Migration to Cloud ConclusionTime-Series T 30
  • 32. HGrid: Serialized Data at the Physical Level A A A A A A A A A B B B B B B B B B C C C C C C C C C D D D D D D D D D 00 01 11 10 01 02 03 01 02 03 Space Introduction Migration to Cloud ConclusionTime-Series 31
  • 33.  Experiment Environment  A four-node cluster on virtual machines with Ubuntu on OpenStack  Hadoop 1.0.2 (replication factor is 2), HBase 0.94  Query processing Implementation  Native java API  User-Level Coprocessor Implementation  Typical Spatial Queries  Range Query and kNN Query The Experiment Introduction Migration to Cloud ConclusionTime-Series HBase Cluster 32
  • 34.  Given a location and a radius,  Return the data points, located within a distance less or equal to the radius from the input location Range Query Introduction Migration to Cloud ConclusionTime-Series 33
  • 35.  Given the coordinates of a location,  Return the K points nearest to the location kNN Query Introduction Migration to Cloud ConclusionTime-Series 34
  • 36.  HGrid: A Geospatial Data Model that Performs  Worse than Regular-Grid data model  Much better than the Quad-Tree data model  Broader applicability: Both Quad-tree and Regular-Grid data models suffer from memory constraints  HGrid is not subject to memory constraints  Qualities of HGrid  Benefit from good locality of regular grid index  Suffer from poor locality of z-ordering linearization Contributions (2) Introduction Migration to Cloud ConclusionTime-Series 35
  • 37.  Migrating an Existing Geospatial Application to a Hierarchical Cloud A Migration Case Study Introduction Geo-spatial ConclusionTime-Series SAVI 36
  • 38. Introduction Geo-spatial ConclusionTime-Series [1] SAVI: Smart Application on Virtual Infrastructure national research project in Canada Core Smart Edges BC edge ON edge SAVI: A Hierarchical Cloud 37
  • 39.  Home Care Aides Technology  Scheduling Service: schedule care plan  Assistant Service: audio/images/video/notes  Location Service: plan the path  The New Requirement  Analyze the data as a whole (Nationally) The HCAT Application Introduction Geo-spatial ConclusionTime-Series MySQL Web Server Tomcat Server 38
  • 40. Introduction Geo-spatial ConclusionTime-Series  Given the need for  Low latency for end users  Centralized data analysis  How to re-architect HCAT application for all users across Canada on SAVI cloud? HCA-T to HCA-T2 39 Core Smart Edges BC ON
  • 41. HCAT2 Geographical Deployment: BC Edge Introduction Geo-spatial ConclusionTime-Series 40
  • 42. HCAT2 Geographical Deployment: ON Edge Introduction Geo-spatial ConclusionTime-Series 41
  • 43. HCAT2 Geographical Deployment: Core Introduction Geo-spatial ConclusionTime-Series 42
  • 44. HCAT2 Geographical Deployment Introduction Geo-spatial ConclusionTime-Series 43
  • 45. A Federation-style Architecture on SAVI Cloud Introduction Geo-spatial ConclusionTime-Series BC Tomcat Server Web Server MySQL OntarioON Tomcat Server Web Server MySQL BC 44
  • 46. HBase Core DAAS A Federation-style Architecture on SAVI Cloud Introduction Geo-spatial ConclusionTime-Series BC Tomcat Server Web Server MySQL ON Tomcat Server Web Server MySQL Data Analytics 45 AB QC…
  • 48. The Most Challenging Research Problem Introduction Geo-spatial ConclusionTime-Series How can we transform data schemas from RDBMSs to HBase? 47
  • 49.  Li:  Three guidelines for the transition to de-normalize the original relations.  Schram et al.:  A case study of mapping the Twitter data schema from MySQL to Cassandra to support crisis informatics.  Gupta et al.:  Four Transition guidelines from traditional data warehouse to Hive based on Universal data model in relational database.  De-normalization was suggested  No discuss on how to model the data in HBase to get efficient query performance. Data Migration from RDBMSs to HBase: Related Work Introduction Geo-spatial ConclusionTime-Series 48
  • 50. Data Schema Transition from RDBMSs to HBase Introduction Geo-spatial ConclusionTime-Series Classify relations into active and inactive De-normalize relations Descriptive DM Time-series DMAdjust and Optimize Geospatial DM …… Apply appropriate data models 49
  • 51. Introduction Geo-spatial ConclusionTime-Series Case Study: from HCA-T to HCA-T2 50
  • 52. Introduction Geo-spatial ConclusionTime-Series Case Study: from HCA-T to HCA-T2 51
  • 53. Introduction Geo-spatial ConclusionTime-Series Case Study: from HCA-T to HCA-T2 52
  • 54. Sample Queries Window query  Which area (East/West/North/South) has most/least patients per region in 2012/2011? Range query  Find the neighbors of a given client based on his/her home location and a given distance. Time-series statistical query  Get the total number of appointments/services/uploaded images per week/month in a given region in 2012/2011. Introduction Geo-spatial ConclusionTime-Series 53
  • 55.  Experiment Environment  SAVI Core: 10 VMs for DAAS  SAVI smart edges: Two HCAT application instances  Hadoop 1.0.2, HBase 0.94, Sqoop 1.4.3, Oozie 3.3.2  Dataset  200GB for each, two duplications  Sets of Experiments  One set for migration performance evaluation  Two sets for query performance evaluation The Experiment Introduction Geo-spatial ConclusionTime-Series Core Ontario edge HCAT BC edge HCAT DAAS 54
  • 56. Aggregate Statistics Over Time How many appointments took place in Ontario and BC in a given period of time (from 1 week to 3 months)? Introduction Geo-spatial ConclusionTime-Series 55
  • 57. Aggregate Statistics Over Time How many appointments took place in Ontario and BC in a given period of time (from 1 week to 3 months)? Introduction Geo-spatial ConclusionTime-Series 56
  • 58. Aggregate Statistics Over Time How many appointments took place in BC in a given period of time (from 1 week to 3 months)? Introduction Geo-spatial ConclusionTime-Series 57
  • 59.  We proposed a novel federation-style architecture  We constructed a systematic way of migrating geospatial applications to this architecture  We proposed a method of transforming data schemas from RDBMSs to HBase Contributions (3) Introduction Geo-spatial ConclusionTime-Series 58
  • 60.  Data Model in HBase  Time-Series dataset (MESOCA 2012)  Geospatial dataset (CLOUD 2013)  Migrating an existing application to cloud  Designed a novel federation-style architecture  proposed a method for transforming data schemas in RDBMSs to HBase  A practical case study  Presented the application of the above guidelines in the design  Verified the aforementioned data schema transition method Concluding Summary Introduction Geo-spatial Migration to CloudTime-Series 59
  • 62.  Investigate the data beyond the geo-spatial domain  text, images and videos from social network application  Investigate other NoSQL databases, beyond HBase  Key-Value Store, Document databases, and Graph databases  The method of transforming data schemas from RDBMSs to HBase  Needs be more general for various data Introduction Geo-spatial Migration to CloudTime-Series Future Work 61

Editor's Notes

  1. Hello, everyone. Today, I’ll talk about my thesis entitled “design patterns of HBase configuration”
  2. In this thesis, there are two main problems we have addressed. The first problem is how to model the data in HBase for a given application. The second problem is how to migrate an existing geospatial application to a hierarchical cloud. I will explain them in detail following this structure. In the end, I will conclude with some future work.
  3. There is a typical class of applications, which are structured in three layers. The base layer is the database which is to store the data. The middle layer is the application server, which manages all the application logics. And web server handles all the requests from both web client and mobile client in the front layer.   Over the years, these applications have generated large amount of data. Traditionally, these data are managed by data warehouse. And the offline business intelligence are performed on the data warehouse to help executives to understand their businesses and make decisions. As the data is increasing rapidly, the warehouses and solutions built around them cannot ensure the reasonable response time any more.   Making this problem even more challenging is the fact that these applications want to evolve themselves by collecting more data and analyzing them in real-time, or near real-time.
  4. In addition, some new applications which are based on analytics are also a big challenge for the traditional database system because of the large amount of unstructured and structured data. such as Insight of Facebook Ad Network, google analytics, recommendation systems. This problem is called “Big data”. Sam Madden, a professor from MIT, said and I quote here “Big data means that the data is too big, too fast, too hard for existing tool”.   RDBMSs cannot help us solve the problem. Compared to RDBMSs, NoSQL databases , they are becoming more and more attractive for these big-data applications.
  5. To use NoSQL databases to address the problem, There are two solutions in terms of the underlying infrastructure. The first one is to build up an in-hour nosql database system with their own IT systems.
  6. The second one is to migrate the application to cloud. Some applications prefer this way because cloud offers many attractive features beyond scalability.   In both cases, new software architecture is required, as well as the expertise in NoSQL databases who knows how to model the data and prepare them for analysis are required.
  7. Our objective is to re-architect the traditional SQL+REST applications to use NoSQL data stores for either migration or extension purpose.
  8. Unfortunately, to date, there is very little methodological and tool support for software development on NoSQL databases. In this work, we want to build a systematic method to guide developers to organize the data on HBase given the application data model and query patterns. And we focused on time-series application and geospatial application, because they are crucial for understanding the performance problems in databases.
  9. Here is a little bit background of HBase. HBase is an implementation of NoSQL data store. It stores the data in a bigtable which is structured with row key, column family, column and version.
  10. The first problem is about time-series data.
  11. Time-series data is generated in an order-dependent manner. Time-series data analysis very useful for answering questions like “what happened? Why did it happen? what’s going to happen next?” This kind of data is generated from many applications, such as sensor-based system and monitoring system. Ganglia , as an example of monitoring sysmte, is to monitor high-performance computing systems such as clusters and grids.
  12. In our case study, we used Cosmology data set and bixi dataset Cosmology dataset was produced by an N-Body simulation of the universe evolution. There are three types of particles: dark matter, gas, and star. Particles evolve over time. Each snapshot records all particles at the corresponding timestamp. In our dataset, there are 9 snapshots consisting of 321,065,547 particles. Bixi dataset is from a bicycle-renting service in Montreal. The data was every minute from the sensors. There are 100,…timestamps for 404 stations.
  13. Our idea was inspired by two related works, OpenTSDB groups the data collected within a period of time into one row. Facebook Message system stores the message id with the third dimension, we combined these two ideas together and proposed a 3-dimensional data model to manage time-series dataset. In this data model, The first dimension is the combination of object Id and the timestamp of a period. The columns are the varying attributes of objects The 3rd dimension is the offset of timestamp in a period.
  14. We did our experiments on a four-node cluster. We evaluated the data model with Cosmology dataset and Bixi dataset
  15. To evaluate performance implication of the third dimension, we designed one two-dimensional data schema, and two three dimensional data schemas. The difference between the two three-dimensional data schemas is the row key.
  16. Based on research requirement, we designed three queries for Cosmology dataset. The queried data in the first one is from one snapshot. And the queried data in query 2 & 3 come from multiple snapshots.
  17. From the experiment results, we can see performance for query1 under three schemas is very different. This leads us to conclude that data schema impacts much for query performance. This conclusion has been proved by the following experiments.
  18. For the second query, schema1, as the two dimensional instance, performed worse than the other three dimensional data schema. So we can see performance can be improved with the 3rd dimension. This conclusion was proved again by the performance of query 3.
  19. By comparing the performance between schema2 and schema3 in query3, we also got to know that the row key design is especially import in data modeling.
  20. We also did the same thing with Bixi dataset. To investigate the performance implication from the depth of the third dimension, we designed more elements in the version dimension in Schema 3 than schema 2
  21. We also did the same thing with Bixi dataset. To investigate the performance implication from the depth of the third dimension, we designed more elements in the version dimension in Schema 3 than schema 2 The result shows that schema 3 performs better than schema2. So, we can conclude that more performance can be obtained with more values localized in 3rd dimension.
  22. We investigated how to store the time-series data with the two dataset with different features.
  23. Geospatial data is a set of multiple-dimensional data, including locations and object attributes. A very typical example is foursquare which is a location-based social networking website for mobile devices, It allows users to connect with friends based on their checked-in locations. Geospatial data analysis is very necessary to answer questions like: who are my neighbors? Which restaurants are close to me?
  24. We first reviewed two related work. But they only focused on the row key design, we also explored the performance implication from the column and version design.
  25. In our study, we chose to use two synthetic data sets. The synthetic data set was generated based on the Bixi data set I introduced before. We augmented the number of stations from 404 to one hundred million, locating them in random coordinates, following a uniform and Zipf distribution, This dataset basically represents one hundred million objects in a 100km * 100km simulated space.
  26. We began our work with quad-tree data model. It relies on a trie-based quad-tree index and applies Z-ordering to transform the two-dimensional spatial data into an one-dimensional array. In this model, the row key, is Z-value. The column is the object id which locates in the cell. Usually people encode the cells with binary digits. But as the row key should be short in HBase, we use decimal encoding here.
  27. The next data model we worked on is regular grid data model. It relies on a regular-grid index. The row key is the row index of the cell in the grid, the column is the column index of the cell, Version: counter of objects. So we can see the third dimension holds a stack of data points located in the same grid cell. Value: each storage cell represents one object in JSON format holding all other attributes and values.
  28. After investigating these two data models, we found that it is easier to prune the unrelated data with regular grid data model than the quad tree data model. Because the z-ordering linearization in Quad tree data model violates the data locality. And both of them have issues with limited memory resource. =========================================================== QT If the index is built in real time for each query, the construction cost dominates many small queries. If the index is maintained in memory, the granularity of the grid is limited by the amount of memory available, since the memory needed to maintain the index increases as the depth of the tree increases and the size of the grid cells becomes smaller. RG The third dimension holds a stack of data points located in the same grid cell, and an index is maintained to keep the count of objects in each cell stack in order to support updates.
  29. Considering the advantages and disadvantages of these two data models, we proposed HGrid data model including a hybrid index structure.
  30. In this index structure, the data-set space is divided into equally-sized rectangular tiles T, encoded with their Z-value. And the data points are organized in a regular grid consisting of continuous uniform fine-grained cells. In this model, each data point is uniquely identified in terms of its row key and column name. The row key is the concatenation of the quad-tree Z-value and the Regular Grid row index. The column name is the concatenation of the Regular Grid column index and the object id of the data point. The attributes of the data point are stored in the third dimension.
  31. This graph shows how the data is stored in physical level with HGrid data model.
  32. Our experiments were performed on a four-node Hadoop and HBase cluster. We implemented the Range Query and KNN query for three data models with both uniform and zipf distribution data.
  33. The range query is that ….. The results shows that the HGrid data model is much better than quadtree data model and worse than the regular-grid data model. The same performance trends persist with both uniform and skewed data.   ================================ Comparing the three models, we can see that the regular-grid data model outperforms the others. Because it supports better data locality, it demonstrates better performance since the percentage of irrelevant rows scanned is low.
  34. The kNN query is … This table shows the response time, where k takes the values 1, 10, 100, 1,000, and 10,000. For both uniform and skewed data, the Regular-grid data model demonstrates best performance, Hgrid comes second with slightly worse performance than regular grid data model, and the query tree data model is the last. We now evaluate the performance for k Nearest Neighbor (kNN) queries using the same data set, under the three data models. This table shows the response time (in seconds) for kNN queries, where k takes the values 1, 10, 100, 1,000, and 10,000. As the density-based range estimation method is employed , there is only one scan operation in the query processing for uniform data, while for skewed data, more than one scan iterations are invoked to retrieve the data. That is why the performance with skewed data under all data models is a little worse than that with the uniform data set. For both uniform and skewed data, the Regular-grid data model demonstrates best performance among the three data models; the HGrid data model come second with slightly worse performance than the regular-grid data model; and the quadtree data model is outperformed by the other two. The poor locality preservation, due to the Z-order linearization method, contributes to the poor performance of the quadtree data model, and also impacts the performance of HGrid, albeit less strongly. For skewed data, with too many false positives, the query with the data points having more than 70% probability cannot get the result below the timeout threshold under all data models when k equals to 10K. To improve performance, a finer granularity is required to filter irrelevant data scanning.
  35. In summary, the query performance of the HGrid data model is better than the quadtree data model and worse than the RG data model. However, the quad-tree and regular grid data model suffer from memory constaits, while HGrid is not subject it the constraints. HGrid benefits from the good locality of regular-grid index, and suffers from poor locality of z-ordering linearization, so, better performance can be obtained with alternative linearization techniques. =========================== For skewed data, the HGrid behaves better with an appropriate configuration, while the regular-grid and QT data models are subject to memory constraints. Can be flexibly configured and extended 1) The quad-tree index can be replaced by the hash code of each sub-space 2) The point-based quad-tree index method is employed. 3) The granularity in the second stage can be varied from sub-space to sub-space based on the various densities. Therefore, HGrid is more scalable and suitable for both homogeneously covered and discontinuous spaces.
  36. The third problem With the two proposed data models and a set of guidance at hand, we decided to apply them in a real application. So we extended our study into migrating a geospatial application to cloud.
  37. This work has been done in the context of SAVI project. SAVI cloud is a hierarchical cloud with two tiers. Smart edges provide limited resources, are geographically near to the user, offering fast on-demand deployment to applications and low-latency to users. The core has powerful computation and storage resources, providing centralized management services. Connections between smart edges and the core go through the inner network rather than wide-area network, which avoids the issues of interoperations among different clouds.
  38. HCAT application is deisgned for making the home care service more efficient. There are three main services: Schedule service is to assign the hcas to vistit patients and carry out their care plan. Assistant service enables HCAs to access and edit the client’s care plan, as well as to provide textual information and images to document their care status. Location service is to instruct the HCAs about the location of their client and how to get there based on the traffic report. The HCA-T system is structured in the typical three-layer architecture. MySQL, a relational database system stores persistent data constitutes the base layer. Tomcat, an application server contains most of the application logic in the middle layer. Finally, a HTTP server handles requests that come from application clients through the web is the top layer. There is a new requirement from research and administration services. They want to investigate and analyze the HCAs and clients across Canada.
  39. Given the need for low latency for end users, and centralized data analysis, With the new requirement of analyzing data as a whole, the current architecture is not effective. I will taking two edges as an example here.
  40. if we deploy hcat in BC edge, users in BC province will be happy, as they can get very good performance because of the service locality, while users in Ontario will be not happy because of the high latency they might get.
  41. If we deploy on Ontario edge, BC users will be unhappy.
  42. If we deploy it in the core , all users will be disappointed.
  43. If we deploy it on both edges, all uses will be happy.
  44. So rather than deploying one application instance in one edge, we install many instances on multiple edges To make sure all users get low latency.
  45. In addtion, we designed a data access and aggregation system in the core extend hcat application have the capacity of dealing with the central analysis. Users can query and access the data throught restful services which are provide by data analytics component in this system. So we can see this architecture promises the low-latency to end users, and provides the centralized data analysis to researchers. Now, to make this architecture work, the question is how to migrate the data from the legacy databases on smart edges to HBase cluster on the core?
  46. To address this problem, we utilized Sqoop, which is designed for efficiently transferring the data between Hadoop and structured data stores. Here, sqoop is used to import the data from raltional database to HBase. a shell action, which is to update the start index in each interval to make sure the incremental data migration We used Oozie, to periodically invoke many parallel workflow jobs which consist of the sqoop and shell scripts to migrate the data periodically and incrementally from smart edges to the core. So we can see, as the application runs, the data will be transferred from the legacy databases on smart edges to the core gradually.
  47. During this migration, we found that the most challenging problem we dealt with is to transform the data schema from RDBMSs to HBase, as the data in a real application is not as clean as the data we used in the previous experiments.
  48. There are three related work which tried to address the similar problem. All of them proposed to demoralize the relations during the transition. But they did not discuss much on how to model the data in HBase to get better performance. Based on our data modeling experience in HBase, we think this part is even more important during this transition. Inspired by the third related work where they proposed four transition guidelines based on the universal data model in relational database, we proposed a four-step method based on entity-relationship data model. The method is described as followed.
  49. First of all, we classify the relations into active and inactive. Then, we de-normalize the relations following the different rules for active and inactive relations. Next, we apply the appropriate data model based on the type of the data and query patterns. Here, we presented three data models which commonly exist in geospatial applications. The last but not least, we should revisit and adjust the schemas based on the HBase storage characteristics and new query requirements.
  50. Here is the ER data model in mysql database. There are five regular entities, two weak entities, one regular relationship, and one weak relationship
  51. Here is the ER data model in mysql database. There are five regular entities, two weak entities, one regular relationship, and one weak relationship
  52. Here is the ER data model in mysql database. There are five regular entities, two weak entities, one regular relationship, and one weak relationship
  53. In the case study, we designed and implemented three types of queries, including windows query, range query, and statistical queries.
  54. The experiments were performed on SAVI cloud. Where we lauched 10 virutal machine in the core to set up the data access and aggregation system, and two application instances were installed in two smart edges, with a single-node MySQL instance in a VM with the following configuration. We used one set of 200GB simulated data, and duplicated them for both provinces In this architecture, the migration and query performance is our biggest concern. We performed one set experiment for migration performance evaluation and two sets for query performance evaluation.
  55. Here, I am gonna present one set experiment where we evaluated the time serial query performance by comparing against MySQL. The query is to find out how many appointments in two provinces in a given period from 1 week to 3 months. We implemented the query in HBase with Coprocessor framework. Within one request, we called coprocessor twice and then aggregate them in the client side.
  56. The yellow bar is the longgest time of Coprocessor instance involved in this query. As in one query, many coprocessor instances might be launched in parallel, the longest time is the exact time for processing the query in HBase.
  57. We also implemented the query with MySQL CLI and JDBC. The execution time for these two are only for One province. The reason is that CLI cannot work on the two disparate data sources in parallel. With JDBC, the query cannot get result because of the limited memory resource.
  58. In this work, we designed a federation-style cloud-enabled architecture for geospatial applications. A systematic method of migrating an existing geospatial application to this architecture was also proposed. In addition, we proposed a method of transforming data schemas from relational database to HBase based on entity-relationship data model in relational database. In this method, we apply different data models based on different categories of the data, which ensures the efficient query performance in HBase.
  59. In this thesis, we proposed ….. We also proposed …. In addition, We successfully migate an existing application to a hierarchical cloud. During the migration, we designed … we constructed.. we proposed
  60. Here you put the three contributions slides
  61. There are three avenues for extending our work. First, we plan to investigate data beyond the geo-spatial domain. such as …. Second, we plan to investigate other NoSQL tools, beyond HBase. such as Finally, a more general method for transforming data schemas from RDBMSs to HBase is needed for various data.