Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Design Pattern of HBase Configuration
1. LOGO
Design Patterns of HBase Configuration
Dan Han
Supervisor: Eleni Stroulia and Paul Sorenson
Department of Computing Science
University of Alberta, Canada
2. Data Modeling in HBase
Time–Series datasets (MESOCA 2012)
Geospatial datasets (CLOUD 2013)
Migrating
an existing application with geospatial, temporal and
categorical data
to SAVI, a hierarchical cloud
Contributions
Future Work
My Agenda for Today
Time-Series Geo-Spatial Migration to Cloud Conclusion
1
3. Time-Series Geo-Spatial Migration to Cloud Conclusion
RDBMS
Application
Server
Web
Server
Data Warehouse
Our Motivation: Big Data Problem
2
real-time or near real-time
4. Time-Series Geo-Spatial Migration to Cloud Conclusion
NoSQL
database
RDBMS
Too big
Too fast
Too hard
for existing tools
Our Motivation: Big Data Problem
3
5. Our Motivation: Application Extension
Time-Series Geo-Spatial Migration to Cloud Conclusion
RDBMS
Web
Server
Application
Server
Data
Analytics
NoSQL
database
4
In-house IT system
6. Time-Series Geo-Spatial Migration to Cloud Conclusion
Cloud Infrastructure + NoSQL Storage Platform
A New Software Architecture + an NoSQL expertise
Good Elasticity
Excellent Scalability
High Availability
Low Cost
NoSQL
database
Our Motivation: Application Migration
5
Data
Analytics
8. To develop guidelines for data organization
On HBase
A NoSQL database built on top of Hadoop
With a coprocessor framework (for parallel
processing)
Given
the application data model
the amount of data involved
query patterns
The Research Problem
Time-Series Geo-Spatial Migration to Cloud Conclusion
Focusing on
Time-Series and
Geospatial data
7
10. Querying Time-Series Datasets
Modeling Data in HBase (1/2)
Introduction Geo-Spatial Migration to Cloud Conclusion
9
11. In an order-dependent manner
Typical applications
Sensor-based applications
Monitoring applications
Analysis Questions
What happened?
Why did it happen? (what happened before)
What may happen next?
Time-Series Datasets
Introduction Geo-Spatial Migration to Cloud Conclusion
10
12. Case Study: The Cosmology and Bixi Datasets
Introduction Geo-Spatial Migration to Cloud Conclusion
Cosmology dataset: 9 snapshots, 321,065,547 particles,12 metrics
Bixi dataset: 100,800 timestamps, 404 stations, 12 metrics
Gas:
iorder, mass, x, y, z, velocity x, y, z,
phi, rho, temp, hsmooth, metals
Dark Matter:
iorder, mass, x, y, z, velocity x, y, z,
phi, eps
Star:
iorder, mass, x, y, z, velocity x, y, z,
phi, metals, tform, eps
11
13. Inspired by OpenTSDB and Facebook
Messages
Data Model
Row key: object id-period timestamp
Column name: object attributes
Version: offset of period-timestamp
A 3-Dimensional Data Model
Column
Row Key
Introduction Geo-Spatial Migration to Cloud Conclusion
12
14. Experiment Environment
A four-node cluster on virtual machines with Ubuntu
Hadoop 0.20, HBase 0.93-snapshot
Queries for each dataset
Three queries of Cosmology dataset from related research
One query of Bixi dataset from business requirement
Query processing implementation
Native java API
User-Level Coprocessor Implementation
The Experiment
Introduction Geo-Spatial Migration to Cloud Conclusion
HBase Cluster
13
15. Schema/
dimension
Row Family:
Column
Version
Schema1 sid-type-pid particle
properties
no meaning
Schema2 type-pid particle
properties
snapshot id
Schema3 type-reversedpid particle
properties
snapshot id
Region
Region
Region
24-2-33446666
64-2-33559999
84-2-33550000
2-33446666
2-33550000
2-33559999
2-00005533
2-66664433
2-99995533
Schema1 Schema2 Schema3
Introduction Geo-Spatial Migration to Cloud Conclusion
Three Schemas for the Cosmology Dataset
14
16. Query1: within one snapshot
Get all the particles of a type in a single snapshot
with a given property matching an expression
Query2: across two snapshots
Get all the particles added/destroyed between
two snapshots
Query3: across multiple snapshots
Get the values of a property for a continuous
range of particles, across a set of snapshots
Queries of Cosmology Dataset
Introduction Geo-Spatial Migration to Cloud Conclusion
15
17. Query1: Single Snapshot Selection
The data schema substantially impacts performance
Introduction Geo-Spatial Migration to Cloud Conclusion
16
18. Query2: Comparison Across two Snapshots
Using the 3rd dimension improves locality and consequently performance.
Introduction Geo-Spatial Migration to Cloud Conclusion
17
19. Query3: Projection Across Multiple Snapshots
Introduction Geo-Spatial Migration to Cloud Conclusion
The row-key design is a key aspect of the schema design.
18
20. Three Schemas for the Bixi Dataset
Schema/
dimension
Row Family: Column Version
Schema1 hour-sid minutes[0,59] no meaning
Schema2 hour-sid monitoring metrics minutes [0,59]
Schema3 day-sid monitoring metrics minutes [0,1439]
Schema1 Schema2 Schema3
Time
Row
metrics
Time
Row
metrics
Row
Time
Introduction Geo-Spatial Migration to Cloud Conclusion
19
21. Query3: Projection and Stats Across Snapshots
The period length (defining the 3rd dimension stack) is a key decision.
Introduction Geo-Spatial Migration to Cloud Conclusion
Get average bike usage in a given period (30 days) for a
given list of stations(200)
20
22. Investigated how to store time-series data
We explored data with
Few versions, many objects (Cosmology)
Many versions, few objects (Bixi)
Examined the impact of row-key design and versioning
Guidelines
“Few versions, many objects”: disperse the sequential data with row
key design, e.g. reversed object id
“Many version, few objects”: version can be deeper
These findings are applicable to “write-once, read-many”
systems
Contributions (1)
Introduction Geo-Spatial Migration to Cloud Conclusion
21
23. Querying Geospatial Datasets
Modeling Data in HBase (2/2)
Introduction Migration to Cloud ConclusionTime-Series
22
24. Multi-dimensional data
Locations (latitude, longitude)
attributes
Applications
Location-aware applications
Analysis Questions
Who are my neighbors?
Which restaurants are close to me?
Geospatial Datasets
Introduction Migration to Cloud ConclusionTime-Series
23
25. Nishimura et al:
built a multi-dimensional index layer on top of a one-
dimensional key-value store HBase to perform spatial
queries.
Hsu et al:
presented a novel key formulation schema, based on
R+-tree for spatial index in HBase.
Focus on row-key design
no discussion about columns and versions
Geospatial Dataset Schema Design: Related Work
Introduction Migration to Cloud ConclusionTime-Series
24
26. Two Synthetic Datasets
Uniform and ZipF distribution
Based on Bixi dataset, each object includes
station ID,
latitude, longitude, station name, terminal name,
number of docks
number of bikes
100 Million objects (70GB)
in a 100km*100km simulated space
Case Study: The Datasets
Introduction Migration to Cloud ConclusionTime-Series
25
27. Trie-based quad-tree Indexing
Z-value Linearization
Data Model
Row key: Z-value
Column: Object ID
Value: one object in JSON Format
A typical Data Model: Quad-Tree
Z-Value
Object ID
05 07 13 15
04 06 12 14
01 03 09 11
00 02 08 10
Z-value
Introduction Migration to Cloud ConclusionTime-Series
26
28. Regular Grid Indexing
Data Model
Row key: Grid rowID
Column: Grid columnID
Version: counter of Objects
Value: one object in JSON format
A typical Data Model: Regular Grid
Column ID
RowID
00 01 02 03
00
01
02
03
Introduction Migration to Cloud ConclusionTime-Series
27
29. Two Possible Data Models
Quad-Tree
More rows with deeper
tree
Z-ordering linearization
(violates data locality)
In-time construction vs.
pre-construction implies a
tradeoff between query
performance and memory
allocation
Regular Grid
Very easy to locate a cell
by row id and column id
Cannot handle large
space and fine-grained
grid because in-memory
indexes are subject to
memory constraints
How much unrelated data is examined in a query matters a lot!
Introduction Migration to Cloud ConclusionTime-Series
28
30. A Hybrid Model: the HGrid
Columnid-ObjectId
QTId-RowId
Introduction Migration to Cloud ConclusionTime-Series
29
31. HGrid: Index Structure Construction
The row key is
the QT Z-value
+ the RG row
index.
The column
name is the RG
column and the
object-ID
The attributes
of the data point
are stored in
the third
dimension.
Introduction Migration to Cloud ConclusionTime-Series
T
30
32. HGrid: Serialized Data at the Physical Level
A A A
A A A
A A A
B B B
B B B
B B B
C C C
C C C
C C C
D D D
D D D
D D D
00
01
11
10
01 02 03 01 02 03
Space
Introduction Migration to Cloud ConclusionTime-Series
31
33. Experiment Environment
A four-node cluster on virtual machines with Ubuntu
on OpenStack
Hadoop 1.0.2 (replication factor is 2), HBase 0.94
Query processing Implementation
Native java API
User-Level Coprocessor Implementation
Typical Spatial Queries
Range Query and kNN Query
The Experiment
Introduction Migration to Cloud ConclusionTime-Series
HBase Cluster
32
34. Given a location and a radius,
Return the data points, located within a distance less
or equal to the radius from the input location
Range Query
Introduction Migration to Cloud ConclusionTime-Series
33
35. Given the coordinates of a location,
Return the K points nearest to the location
kNN Query
Introduction Migration to Cloud ConclusionTime-Series
34
36. HGrid: A Geospatial Data Model that Performs
Worse than Regular-Grid data model
Much better than the Quad-Tree data model
Broader applicability: Both Quad-tree and Regular-Grid
data models suffer from memory constraints
HGrid is not subject to memory constraints
Qualities of HGrid
Benefit from good locality of regular grid index
Suffer from poor locality of z-ordering linearization
Contributions (2)
Introduction Migration to Cloud ConclusionTime-Series
35
37. Migrating an Existing Geospatial Application
to a Hierarchical Cloud
A Migration Case Study
Introduction Geo-spatial ConclusionTime-Series
SAVI
36
39. Home Care Aides Technology
Scheduling Service: schedule care plan
Assistant Service: audio/images/video/notes
Location Service: plan the path
The New Requirement
Analyze the data as a whole (Nationally)
The HCAT Application
Introduction Geo-spatial ConclusionTime-Series
MySQL
Web
Server
Tomcat
Server
38
40. Introduction Geo-spatial ConclusionTime-Series
Given the need for
Low latency for end users
Centralized data analysis
How to re-architect HCAT application for all
users across Canada on SAVI cloud?
HCA-T to HCA-T2
39
Core
Smart
Edges
BC
ON
45. A Federation-style Architecture on SAVI Cloud
Introduction Geo-spatial ConclusionTime-Series
BC
Tomcat Server
Web Server
MySQL
OntarioON
Tomcat Server
Web Server
MySQL
BC
44
46. HBase
Core
DAAS
A Federation-style Architecture on SAVI Cloud
Introduction Geo-spatial ConclusionTime-Series
BC
Tomcat Server
Web Server
MySQL
ON
Tomcat Server
Web Server
MySQL
Data Analytics
45
AB QC…
48. The Most Challenging Research Problem
Introduction Geo-spatial ConclusionTime-Series
How can we transform data
schemas from RDBMSs to HBase?
47
49. Li:
Three guidelines for the transition to de-normalize the original relations.
Schram et al.:
A case study of mapping the Twitter data schema from MySQL to
Cassandra to support crisis informatics.
Gupta et al.:
Four Transition guidelines from traditional data warehouse to Hive based
on Universal data model in relational database.
De-normalization was suggested
No discuss on how to model the data in HBase to get
efficient query performance.
Data Migration from RDBMSs to HBase: Related Work
Introduction Geo-spatial ConclusionTime-Series
48
50. Data Schema Transition from RDBMSs to HBase
Introduction Geo-spatial ConclusionTime-Series
Classify
relations into
active and
inactive
De-normalize
relations
Descriptive DM
Time-series DMAdjust and
Optimize
Geospatial DM
……
Apply
appropriate
data models
49
54. Sample Queries
Window query
Which area (East/West/North/South) has most/least
patients per region in 2012/2011?
Range query
Find the neighbors of a given client based on his/her
home location and a given distance.
Time-series statistical query
Get the total number of appointments/services/uploaded
images per week/month in a given region in 2012/2011.
Introduction Geo-spatial ConclusionTime-Series
53
55. Experiment Environment
SAVI Core: 10 VMs for DAAS
SAVI smart edges: Two HCAT application instances
Hadoop 1.0.2, HBase 0.94, Sqoop 1.4.3, Oozie 3.3.2
Dataset
200GB for each, two duplications
Sets of Experiments
One set for migration performance evaluation
Two sets for query performance evaluation
The Experiment
Introduction Geo-spatial ConclusionTime-Series
Core Ontario edge
HCAT
BC edge
HCAT
DAAS
54
56. Aggregate Statistics Over Time
How many appointments took place in Ontario and BC
in a given period of time (from 1 week to 3 months)?
Introduction Geo-spatial ConclusionTime-Series
55
57. Aggregate Statistics Over Time
How many appointments took place in Ontario and BC
in a given period of time (from 1 week to 3 months)?
Introduction Geo-spatial ConclusionTime-Series
56
58. Aggregate Statistics Over Time
How many appointments took place in BC
in a given period of time (from 1 week to 3 months)?
Introduction Geo-spatial ConclusionTime-Series
57
59. We proposed a novel federation-style
architecture
We constructed a systematic way of
migrating geospatial applications to this
architecture
We proposed a method of transforming data
schemas from RDBMSs to HBase
Contributions (3)
Introduction Geo-spatial ConclusionTime-Series
58
60. Data Model in HBase
Time-Series dataset (MESOCA 2012)
Geospatial dataset (CLOUD 2013)
Migrating an existing application to cloud
Designed a novel federation-style architecture
proposed a method for transforming data schemas in RDBMSs
to HBase
A practical case study
Presented the application of the above guidelines in the design
Verified the aforementioned data schema transition method
Concluding Summary
Introduction Geo-spatial Migration to CloudTime-Series
59
62. Investigate the data beyond the geo-spatial domain
text, images and videos from social network application
Investigate other NoSQL databases, beyond HBase
Key-Value Store, Document databases, and Graph databases
The method of transforming data schemas from
RDBMSs to HBase
Needs be more general for various data
Introduction Geo-spatial Migration to CloudTime-Series
Future Work
61
Hello, everyone.
Today, I’ll talk about my thesis entitled “design patterns of HBase configuration”
In this thesis, there are two main problems we have addressed.
The first problem is how to model the data in HBase for a given application.
The second problem is how to migrate an existing geospatial application to a hierarchical cloud.
I will explain them in detail following this structure.
In the end, I will conclude with some future work.
There is a typical class of applications, which are structured in three layers.
The base layer is the database which is to store the data.
The middle layer is the application server, which manages all the application logics.
And web server handles all the requests from both web client and mobile client in the front layer.
Over the years, these applications have generated large amount of data.
Traditionally, these data are managed by data warehouse.
And the offline business intelligence are performed on the data warehouse to help executives to understand their businesses and make decisions.
As the data is increasing rapidly, the warehouses and solutions built around them cannot ensure the reasonable response time any more.
Making this problem even more challenging is the fact that these applications want to evolve themselves by collecting more data and analyzing them in real-time, or near real-time.
In addition, some new applications which are based on analytics are also a big challenge for the traditional database system because of the large amount of unstructured and structured data. such as Insight of Facebook Ad Network, google analytics, recommendation systems.
This problem is called “Big data”.
Sam Madden, a professor from MIT, said and I quote here “Big data means that the data is too big, too fast, too hard for existing tool”.
RDBMSs cannot help us solve the problem.
Compared to RDBMSs, NoSQL databases , they are becoming more and more attractive for these big-data applications.
To use NoSQL databases to address the problem,
There are two solutions in terms of the underlying infrastructure.
The first one is to build up an in-hour nosql database system with their own IT systems.
The second one is to migrate the application to cloud.
Some applications prefer this way because cloud offers many attractive features beyond scalability.
In both cases, new software architecture is required, as well as the expertise in NoSQL databases who knows how to model the data and prepare them for analysis are required.
Our objective is to re-architect the traditional SQL+REST applications to use NoSQL data stores for either migration or extension purpose.
Unfortunately, to date, there is very little methodological and tool support for software development on NoSQL databases.
In this work, we want to build a systematic method to guide developers to organize the data on HBase given the application data model and query patterns.
And we focused on time-series application and geospatial application, because they are crucial for understanding the performance problems in databases.
Here is a little bit background of HBase.
HBase is an implementation of NoSQL data store.
It stores the data in a bigtable which is structured with row key, column family, column and version.
The first problem is about time-series data.
Time-series data is generated in an order-dependent manner.
Time-series data analysis very useful for answering questions like “what happened? Why did it happen? what’s going to happen next?”
This kind of data is generated from many applications, such as sensor-based system and monitoring system. Ganglia , as an example of monitoring sysmte, is to monitor high-performance computing systems such as clusters and grids.
In our case study, we used Cosmology data set and bixi dataset
Cosmology dataset was produced by an N-Body simulation of the universe evolution.
There are three types of particles: dark matter, gas, and star.
Particles evolve over time. Each snapshot records all particles at the corresponding timestamp.
In our dataset, there are 9 snapshots consisting of 321,065,547 particles.
Bixi dataset is from a bicycle-renting service in Montreal.
The data was every minute from the sensors.
There are 100,…timestamps for 404 stations.
Our idea was inspired by two related works, OpenTSDB groups the data collected within a period of time into one row.
Facebook Message system stores the message id with the third dimension,
we combined these two ideas together and proposed a 3-dimensional data model to manage time-series dataset.
In this data model,
The first dimension is the combination of object Id and the timestamp of a period.
The columns are the varying attributes of objects
The 3rd dimension is the offset of timestamp in a period.
We did our experiments on a four-node cluster.
We evaluated the data model with Cosmology dataset and Bixi dataset
To evaluate performance implication of the third dimension, we designed one two-dimensional data schema, and two three dimensional data schemas. The difference between the two three-dimensional data schemas is the row key.
Based on research requirement, we designed three queries for Cosmology dataset.
The queried data in the first one is from one snapshot.
And the queried data in query 2 & 3 come from multiple snapshots.
From the experiment results, we can see performance for query1 under three schemas is very different.
This leads us to conclude that data schema impacts much for query performance.
This conclusion has been proved by the following experiments.
For the second query, schema1, as the two dimensional instance, performed worse than the other three dimensional data schema. So we can see performance can be improved with the 3rd dimension.
This conclusion was proved again by the performance of query 3.
By comparing the performance between schema2 and schema3 in query3, we also got to know that the row key design is especially import in data modeling.
We also did the same thing with Bixi dataset.
To investigate the performance implication from the depth of the third dimension,
we designed more elements in the version dimension in Schema 3 than schema 2
We also did the same thing with Bixi dataset.
To investigate the performance implication from the depth of the third dimension, we designed more elements in the version dimension in Schema 3 than schema 2
The result shows that schema 3 performs better than schema2. So, we can conclude that more performance can be obtained with more values localized in 3rd dimension.
We investigated how to store the time-series data with the two dataset with different features.
Geospatial data is a set of multiple-dimensional data, including locations and object attributes.
A very typical example is foursquare which is a location-based social networking website for mobile devices, It allows users to connect with friends based on their checked-in locations.
Geospatial data analysis is very necessary to answer questions like: who are my neighbors? Which restaurants are close to me?
We first reviewed two related work.
But they only focused on the row key design, we also explored the performance implication from the column and version design.
In our study, we chose to use two synthetic data sets. The synthetic data set was generated based on the Bixi data set I introduced before. We augmented the number of stations from 404 to one hundred million, locating them in random coordinates, following a uniform and Zipf distribution,
This dataset basically represents one hundred million objects in a 100km * 100km simulated space.
We began our work with quad-tree data model.
It relies on a trie-based quad-tree index and applies Z-ordering to transform the two-dimensional spatial data into an one-dimensional array.
In this model, the row key, is Z-value. The column is the object id which locates in the cell.
Usually people encode the cells with binary digits. But as the row key should be short in HBase, we use decimal encoding here.
The next data model we worked on is regular grid data model.
It relies on a regular-grid index. The row key is the row index of the cell in the grid, the column is the column index of the cell, Version: counter of objects. So we can see the third dimension holds a stack of data points located in the same grid cell. Value: each storage cell represents one object in JSON format holding all other attributes and values.
After investigating these two data models, we found that it is easier to prune the unrelated data with regular grid data model than the quad tree data model. Because the z-ordering linearization in Quad tree data model violates the data locality. And both of them have issues with limited memory resource.
===========================================================
QT If the index is built in real time for each query, the construction cost dominates many small queries. If the index is maintained in memory, the granularity of the grid is limited by the amount of memory available, since the memory needed to maintain the index increases as the depth of the tree increases and the size of the grid cells becomes smaller.
RG The third dimension holds a stack of data points located in the same grid cell, and an index is maintained to keep the count of objects in each cell stack in order to support updates.
Considering the advantages and disadvantages of these two data models, we proposed HGrid data model including a hybrid index structure.
In this index structure, the data-set space is divided into equally-sized rectangular tiles T, encoded with their Z-value. And the data points are organized in a regular grid consisting of continuous uniform fine-grained cells.
In this model, each data point is uniquely identified in terms of its row key and column name.
The row key is the concatenation of the quad-tree Z-value and the Regular Grid row index.
The column name is the concatenation of the Regular Grid column index and the object id of the data point.
The attributes of the data point are stored in the third dimension.
This graph shows how the data is stored in physical level with HGrid data model.
Our experiments were performed on a four-node Hadoop and HBase cluster.
We implemented the Range Query and KNN query for three data models with both uniform and zipf distribution data.
The range query is that …..
The results shows that the HGrid data model is much better than quadtree data model and worse than the regular-grid data model.
The same performance trends persist with both uniform and skewed data.
================================
Comparing the three models, we can see that the regular-grid data model outperforms the others.
Because it supports better data locality, it demonstrates better performance since the percentage of irrelevant rows scanned is low.
The kNN query is …
This table shows the response time, where k takes the values 1, 10, 100, 1,000, and 10,000.
For both uniform and skewed data, the Regular-grid data model demonstrates best performance, Hgrid comes second with slightly worse performance than regular grid data model, and the query tree data model is the last.
We now evaluate the performance for k Nearest Neighbor (kNN) queries using the same data set, under the three data models. This table shows the response time (in seconds) for kNN queries, where k takes the values 1, 10, 100, 1,000, and 10,000. As the density-based range estimation method is employed , there is only one scan operation in the query processing for uniform data, while for skewed data, more than one scan iterations are invoked to retrieve the data.
That is why the performance with skewed data under all data models is a little worse than that with the uniform data set. For both uniform and skewed data, the Regular-grid data model demonstrates best performance among the three data models; the HGrid data model come second with slightly worse performance than the regular-grid data model; and the quadtree data model is outperformed by the other two.
The poor locality preservation, due to the Z-order linearization method, contributes to the poor performance of the quadtree data model, and also impacts the performance of HGrid, albeit less strongly. For skewed data, with too many false positives, the query with the data points having more than 70% probability cannot get the result below the timeout threshold under all data models when k equals to 10K.
To improve performance, a finer granularity is required to filter irrelevant data scanning.
In summary, the query performance of the HGrid data model is better than the quadtree data model and worse than the RG data model.
However, the quad-tree and regular grid data model suffer from memory constaits, while HGrid is not subject it the constraints.
HGrid benefits from the good locality of regular-grid index, and suffers from poor locality of z-ordering linearization, so, better performance can be obtained with alternative linearization techniques.
===========================
For skewed data, the HGrid behaves better with an appropriate configuration, while the regular-grid and QT data models are subject to memory constraints.
Can be flexibly configured and extended
1) The quad-tree index can be replaced by the hash code of each sub-space
2) The point-based quad-tree index method is employed.
3) The granularity in the second stage can be varied from sub-space to sub-space based on the various densities.
Therefore, HGrid is more scalable and suitable for both homogeneously covered and discontinuous spaces.
The third problem
With the two proposed data models and a set of guidance at hand, we decided to apply them in a real application.
So we extended our study into migrating a geospatial application to cloud.
This work has been done in the context of SAVI project.
SAVI cloud is a hierarchical cloud with two tiers.
Smart edges provide limited resources, are geographically near to the user, offering fast on-demand deployment to applications and low-latency to users.
The core has powerful computation and storage resources, providing centralized management services.
Connections between smart edges and the core go through the inner network rather than wide-area network, which avoids the issues of interoperations among different clouds.
HCAT application is deisgned for making the home care service more efficient.
There are three main services:
Schedule service is to assign the hcas to vistit patients and carry out their care plan.
Assistant service enables HCAs to access and edit the client’s care plan, as well as to provide textual information and images to document their care status.
Location service is to instruct the HCAs about the location of their client and how to get there based on the traffic report.
The HCA-T system is structured in the typical three-layer architecture.
MySQL, a relational database system stores persistent data constitutes the base layer.
Tomcat, an application server contains most of the application logic in the middle layer.
Finally, a HTTP server handles requests that come from application clients through the web is the top layer.
There is a new requirement from research and administration services. They want to investigate and analyze the HCAs and clients across Canada.
Given the need for low latency for end users, and centralized data analysis,
With the new requirement of analyzing data as a whole, the current architecture is not effective.
I will taking two edges as an example here.
if we deploy hcat in BC edge, users in BC province will be happy, as they can get very good performance because of the service locality,
while users in Ontario will be not happy because of the high latency they might get.
If we deploy on Ontario edge, BC users will be unhappy.
If we deploy it in the core , all users will be disappointed.
If we deploy it on both edges, all uses will be happy.
So rather than deploying one application instance in one edge, we install many instances on multiple edges
To make sure all users get low latency.
In addtion, we designed a data access and aggregation system in the core extend hcat application have the capacity of dealing with the central analysis.
Users can query and access the data throught restful services which are provide by data analytics component in this system.
So we can see this architecture promises the low-latency to end users, and provides the centralized data analysis to researchers.
Now, to make this architecture work, the question is how to migrate the data from the legacy databases on smart edges to HBase cluster on the core?
To address this problem, we utilized Sqoop, which is designed for efficiently transferring the data between Hadoop and structured data stores. Here, sqoop is used to import the data from raltional database to HBase.
a shell action, which is to update the start index in each interval to make sure the incremental data migration
We used Oozie, to periodically invoke many parallel workflow jobs which consist of the sqoop and shell scripts to migrate the data periodically and incrementally from smart edges to the core.
So we can see, as the application runs, the data will be transferred from the legacy databases on smart edges to the core gradually.
During this migration, we found that the most challenging problem we dealt with is to transform the data schema from RDBMSs to HBase, as the data in a real application is not as clean as the data we used in the previous experiments.
There are three related work which tried to address the similar problem.
All of them proposed to demoralize the relations during the transition. But they did not discuss much on how to model the data in HBase to get better performance.
Based on our data modeling experience in HBase, we think this part is even more important during this transition.
Inspired by the third related work where they proposed four transition guidelines based on the universal data model in relational database, we proposed a four-step method based on entity-relationship data model.
The method is described as followed.
First of all, we classify the relations into active and inactive.
Then, we de-normalize the relations following the different rules for active and inactive relations.
Next, we apply the appropriate data model based on the type of the data and query patterns. Here, we presented three data models which commonly exist in geospatial applications.
The last but not least, we should revisit and adjust the schemas based on the HBase storage characteristics and new query requirements.
Here is the ER data model in mysql database.
There are five regular entities, two weak entities, one regular relationship, and one weak relationship
Here is the ER data model in mysql database.
There are five regular entities, two weak entities, one regular relationship, and one weak relationship
Here is the ER data model in mysql database.
There are five regular entities, two weak entities, one regular relationship, and one weak relationship
In the case study, we designed and implemented three types of queries, including windows query, range query, and statistical queries.
The experiments were performed on SAVI cloud. Where we lauched 10 virutal machine in the core to set up the data access and aggregation system, and two application instances were installed in two smart edges, with a single-node MySQL instance in a VM with the following configuration.
We used one set of 200GB simulated data, and duplicated them for both provinces
In this architecture, the migration and query performance is our biggest concern.
We performed one set experiment for migration performance evaluation and two sets for query performance evaluation.
Here, I am gonna present one set experiment where we evaluated the time serial query performance by comparing against MySQL.
The query is to find out how many appointments in two provinces in a given period from 1 week to 3 months.
We implemented the query in HBase with Coprocessor framework. Within one request, we called coprocessor twice and then aggregate them in the client side.
The yellow bar is the longgest time of Coprocessor instance involved in this query.
As in one query, many coprocessor instances might be launched in parallel, the longest time is the exact time for processing the query in HBase.
We also implemented the query with MySQL CLI and JDBC. The execution time for these two are only for
One province. The reason is that CLI cannot work on the two disparate data sources in parallel.
With JDBC, the query cannot get result because of the limited memory resource.
In this work, we designed a federation-style cloud-enabled architecture for geospatial applications. A systematic method of migrating an existing geospatial application to this architecture was also proposed.
In addition, we proposed a method of transforming data schemas from relational database to HBase based on entity-relationship data model in relational database.
In this method, we apply different data models based on different categories of the data, which ensures the efficient query performance in HBase.
In this thesis, we proposed …..
We also proposed ….
In addition,
We successfully migate an existing application to a hierarchical cloud. During the migration,
we designed …
we constructed..
we proposed
Here you put the three contributions slides
There are three avenues for extending our work.
First, we plan to investigate data beyond the geo-spatial domain. such as ….
Second, we plan to investigate other NoSQL tools, beyond HBase. such as
Finally, a more general method for transforming data schemas from RDBMSs to HBase
is needed for various data.