SlideShare a Scribd company logo
1 of 14
Download to read offline
Splitting Large Medical Data Sets Based
on Normal Distribution in Cloud Environment
Hao Lan Zhang , Member, IEEE, Yali Zhao, Chaoyi Pang, and Jinyuan He
Abstract—The surge of medical and e-commerce applications has generated tremendous amount of data, which brings people to a
so-called “Big Data” era. Different from traditional large data sets, the term “Big Data” not only means the large size of data volume but
also indicates the high velocity of data generation. However, current data mining and analytical techniques are facing the challenge of
dealing with large volume data in a short period of time. This paper explores the efficiency of utilizing the Normal Distribution (ND)
method for splitting and processing large volume medical data in cloud environment, which can provide representative information in
the split data sets. The ND-based new model consists of two stages. The first stage adopts the ND method for large data sets splitting
and processing, which can reduce the volume of data sets. The second stage implements the ND-based model in a cloud computing
infrastructure for allocating the split data sets. The experimental results show substantial efficiency gains of the proposed method over
the conventional methods without splitting data into small partitions. The ND-based method can generate representative data sets,
which can offer efficient solution for large data processing. The split data sets can be processed in parallel in Cloud computing
environment.
Index Terms—Data splitting, cloud computing, Big Data, medical data processing, normal distribution
Ç
1 INTRODUCTION
THE arrival of the Big Data age urges researchers to find
optimized solutions for dealing with large quantity of
data, particularly for online commercial and medical data.
One of the goals is to reduce high dimensional data sets to
lower dimensional data sets without sacrificing its inherent
properties (knowledge content and relationship). Normally,
two classical methods can be considered for handling the
volume of large data sets: (1) splitting data into smaller
chunks so that each chunk can be analyzed independently;
(2) reducing data columns (dimensionality reduction). For
splitting large data sets, an enhanced UV-decomposition
method can be applied.
According to UV-Decomposition [1], a sample data table
containing 1,000 rows and 100 columns can be decomposed
into two tables based on as follows:
x1;1    x1;100
.
.
. ..
. .
.
.
x1;000;1    x1;000;100
2
6
6
4
3
7
7
5
¼
y1;1 y1;2
.
.
. .
.
.
y1;000;1 y1;000;2
2
6
6
4
3
7
7
5 
z1;1    z1;100
z2;1    z2;100
 
:
For a very large data set, the UV-decomposition method
cannot reduce the large data set, particularly for rows as
shown in the above matrices. This demands a more efficient
method of splitting the data sets. We observe that it is a com-
mon phenomenon that online data presents a mixture of nor-
mal- distribution forms in different groups as indicated in
Fig. 1. The figure on the bottom presents the distribution
based on an example data. The recalculated matrices can be
further broken down into smaller data sets based on the num-
ber of generated grouping results. In this way, the ND-based
data splitting can be more efficient for very large data sets.
Based on Fig. 1, the figure on top contains the informa-
tion about how data points are distributed. We applied the
ksdensity function to generate a probability distribution
shown on the bottom figure. we can observe that this data
set demonstrates similarity to a normal distribution form.
We further split the data set into more ND-like-sets. These
split data sets can be analyzed individually as a representa-
tive of the whole data set. This method can solve the time
and cost problems for dealing with large data sets and is
efficient for cloud environment.
The prerequisite of employing ND for resizing is that
data sets need to contain at least one measurable/numeric
column. The increase of more numeric data in data sets can
improve the accuracy of the analytical results. The primary
motivation of this research and major contributions of the
proposed N (m, s2
) method for data splitting are:
 The ND-based splitting method can create small and rep-
resentative data sets. The representative character of
ND-based data sets is reflected and evidenced by
various medical and physiological data sets such as
birth weight distribution, the population distribution
of Body Mass Index (BMI) [2], the population dis-
tribution of breast cancer patients [3], etc. This
 H. L. Zhang is with the Center for SCDM, NIT, Zhejiang University, ,
China and the Department of Computer and Information Systems, The
University of Melbourne, Australia. E-mail: haolan.zhang@nit.zju.edu.cn.
 Y. Zhao is with the Department of Computing and Information Systems,
The University of Melbourne, Australia. E-mail: yalizhao@unimelb.edu.au.
 C. Pang and J. He are with the Center for SCDM, NIT, Zhejiang Univer-
sity, China. E-mail: {chaoyi.pang, hjysix}@gmail.com.
Manuscript received 12 Jan. 2015; revised 8 June 2015; accepted 14 July 2015.
Date of publication 29 July 2015; date of current version 9 June 2020.
Recommended for acceptance by J. He, R. Kotagiri, and F.M. Sanchez.
Digital Object Identifier no. 10.1109/TCC.2015.2462361
518 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
2168-7161 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
type of data can provide relatively comprehensive
information for data analysis rather than randomly
splitting. The experimental analysis of the represen-
tativeness of splitting data is provided in Section 4.
 The splitting data sets can be efficiently deployed in the
cloud environment. The distributed cloud computing
environment allows a system to analyse and process
data locally. The ND-based data sets can be distrib-
uted in any cloud frameworks in different locations.
However, these distributed data sets are not always
required for integration when conducting analysis
since a ND-based data set represents all characteris-
tics. This aspect will allow a single cloud node to pro-
cess its local ND-based data sets for global analysis.
Furthermore, the split data sets are suitable for paral-
lel processing in Cloud environment and increase the
flexibility of memory usage. The experimental results
based on CloudSim has provided in Section 5.
 The ND-based method supports semi-structured data sets,
which can be efficiently applied to medical data analysis.
There is a board range of data types in the practice of
medical and health sciences ranging from narrative,
textual data to numerical measurements, record sig-
nals, and so on [4]. Nevertheless, most medical data
sets normally contain several numeric attributes. It
makes the ND-based splitting process feasible and
efficient. Table 1 shows the sample a set of e-health,
which consists of narrative information and numeric
measurements.
In this sample data, both ‘age’ and ‘duration’ attributes
are numeric, which form a number of ND sub-sets. These
sub-sets might not be the precise ND form. These data sets
are named as ND-Approximate (NDA) sets, i.e. NDA, which
contain representative information compared with random
splitting sets, as shown in Fig. 2 (refer to Fig. 1 for splitting).
The UV decomposition methods can be applied to ND/
NDA data sets for further dimensionality reduction. The
current methods for generating fast analytical results based
on large medical or online data are limited due to the mas-
sive data size and high velocity [6]. This paper explores
the underlining issues and proposes the ND-based split-
ting method for fast data processing and analysis in cloud
environment.
This paper is organized as follows: the next section
reviews the previous work on large medical and online data
processing as well as the related methods. Section 3 introdu-
ces the ND method for large data splitting, which can pro-
vide relatively small and inclusive data sets for analysis.
Section 4 illustrates the ND-based medical data storage in
cloud systems. Section 5 provides the experimental results.
The final section concludes the research findings.
2 Related Work
The data splitting and partition methods have been studied
extensively. Basically, the previous work focus on two
aspects of data splitting, which include the algorithmic
aspect for data splitting and the system modelling aspect
for data file splitting and partition.
2.1 Related Data Splitting Algorithms
Baxter et al. [7] proposed the systematic data splitting
method, in which every kth sample from a random starting
point is selected to form the training, test and validation
datasets. In this method, data sets are first ordered in
TABLE 1
Example Medical Data for Drug Test Results [5]
Age Marital Status Duration Education
30 MARRIED 79 PRIMARY
33 MARRIED 220 SECONDARY
35 SINGLE 185 TERTIARY
30 MARRIED 199 TERTIARY
59 MARRIED 226 SECONDARY
35 SINGLE 141 TERTIARY
36 MARRIED 341 TERTIARY
39 MARRIED 151 SECONDARY
41 MARRIED 57 TERTIARY
43 MARRIED 313 PRIMARY
39 MARRIED 273 SECONDARY
43 MARRIED 113 SECONDARY
36 MARRIED 328 TERTIARY
20 SINGLE 261 SECONDARY
31 MARRIED 89 SECONDARY
Agei MARRIEDi DURATIONi EDUCATIONi
Fig. 1. The distribution of a large data set (CASP F2 Data).
Fig. 2. A small sample ND-based data sets.
ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 519
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
increasing values along the output variable dimension. This
procedure requires huge time and memory, which is appar-
ently not feasible for big data processing.
May et al. [8] utilize the Self Organizing Maps (SOM)
method for generating smaller data sets, i.e. dimensionality
reduction, which primarily considers the columns reduction
of a large data set [9], [10]. Most of SOM has been employed
to a fixed window if it is used for online clustering for data
streams [10], [11], [12]. SBSS-N data splitting approach [13]
further extends SOM method through incorporating uni-
form random intra-cluster sampling process for data split-
ting. Neyman allocation rule [14] is adopted in this process,
which is expressed as:
nm ¼
Nmsm
PM
i¼1 Nisi

n
N
; (1)
where N is the size of the dataset, n is the required sample
size, Ni is the size of stratum i, si is the intra-stratum multi-
variate standard deviation of stratum i.
Current SOM-based solutions have certain limitations in
generating fast analytical results comparing with K-means
and other clustering methods. Like other clustering meth-
ods, the accuracy of self-learning of SOM is highly depen-
dent on the initial conditions: initial weight of the map,
sequence of training vector, etc. A possible method is to
select the initial weights from the same space spanned by
the linear principal component.
2.2 Related System Modelling for Data Splitting and
Partition
Existing techniques for solving large data processing prob-
lems have been applied to several notable models including
Google File Systems (GFS), Map-Reduce model, etc. These
techniques mainly concern file splitting using optimized file
storage frameworks. For instance, a large volume data file
can be split to smaller files in GFS [15], [16].
GFS model can provide efficient, reliable access to data
using large clusters of commodity hardware, which is able
to accommodate large data sets processing. In this model
consists of a Master sever and a large number of chunk
severs. Files are split into file chunks that are stored in
chunk sever. MapReduce model employs Map () and
Reduce () functions that divide a task into several smaller
tasks and assemble the final results. The MapReduce
model generates a number of key values. Map (k1, v1) !
list (k2, v2); Reduce (k2, list (v2)) ! list (v3). Some research-
ers believe that MapReduce model can solve large data set
processing problems to certain extent [15].These methods
basically focus on data sets decomposition using key-value
or file splitting methods, which do not consider providing
inclusive information in a data set. Therefore, a fast analy-
sis based on a number of representative data sets is not
available in GFS and MapReduce models. In other words,
GFS and MapReduce models cannot efficiently handle a
fast inquiry based on a number of small data sets due to
the lack of representative information.
Instead of focusing on file system framework design, this
paper primarily focuses on converting high dimensional data
sets into several low dimensional data sets through incorpo-
rating the enhanced N (m, s2
) method into cloud environment.
3 THE ND-BASED METHOD FOR SPLITTING
LARGE DATA
3.1 Data Observation
While observing various large data sets including bank data
(obtained from the World Bank’s 2013 open source data)
[17], superstore sales data (obtained from Tableau’s 2012
open source data)[18], personal tax data (obtained from
Australian 2003 online open source data), online transaction
data (obtained from an anonymous travel agency’s 2012
data), and e-health data (obtained from Diagnostic Wiscon-
sin Breast Cancer Database)[3].
Fig. 3a 3b, 3c, and 3d (eliminating specific labels and
details) indicate that the observed large data sets contain a
number of sub data sets. Each sub data set form normal dis-
tribution feature.
Most random data sets omit some particular attributes or
features, such as dates, gender, etc., normally contain a
number of sub data sets. Each sub data set has its own
group feature. In this paper we consider the group feature
as the normal distribution form and name such sub data set
as ND sub-data sets or simply ND sets. In other words, a
collection of data forms normal distribution would be con-
sidered as a sub data set. The reason for adopting normal
distribution to determine a sub data set is that this distribu-
tion form widely observed in practical data sets. Each ND
set has its own uptrend, peek and downtrend, which can
provide useful information for data analysis.
3.2 Sub Data Sets (ND/NDA Sets) Determination
A large data set might contain a number of ND sets. Identi-
fying ND sets is a crucial step for splitting large data sets.
Definition 1. A collection of data in a large data set is defined as
a ND set if its distribution is of the form N (m, s2
), m is the
mean or expectation of the distribution, s is its standard devia-
tion and its variance is s2
.
A ND set might not strictly form a standard normal distri-
bution. However, a non-standard ND set still contains
uptrend, peek and downtrend. These ND sets are called
approx. normal-distribution ND sets or NDA sets in short. In
a large set A [1, k], if (ND1[ ND2 [ . . . NDi  A) and NDi has
the distribution form N (m, s2
) or NDA (refer to Proposition
2) as shown in Fig. 4a, however, A might not have an exact
distribution of N (m, s2
) after transformed as shown in
Fig. 4b, however they distribute a ND-approximate form.
Hereafter, A [1, k] represents the complete large data set; [1, k]
is the boundary of A’s record number.
Definition 2. S [a, b]  A [1, k], if x2 [a, b] satisfies:
fðAðxÞ; m; sÞ ¼
1
s
F
AðxÞ  m
s
 
)
axb
a6¼b
S½a; b  ND: (2)
This proposition can be alternatively expressed as:
f AðxÞ
ð Þ ¼
t
ffiffiffiffiffiffi
2p
p e
t2ðAðxÞmÞ2
2 ) S½a; b  ND: (3)
As mentioned above, a NDA set might not strictly form a
normal distribution (refer to Fig. 4b). Instead of using f(x)
for probability distribution, the ND-based method converts
the data values using ksdensity function to generate a
520 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
probability distribution. Therefore, a NDA data set needs to
satisfy the following characteristics:
1) An NDA data set has a maximum possibility value
point within 2D, and this value is considered as the
identifier, i.e. p(x).
2) An NDA data set is basically practical data; therefore
it refers to Chebyshev Inequality [19] in a certain
way that is more than 89 percent data lay within 3 s.
3) An initial threshold mr (0  mr  1) is assigned to
determine maximum tolerance for the number of
data items in a NDA set does not form N(m, s2
),
which lie out of the normal distribution curve and
called disqualified data items. For instance, mr ¼ 0.3
means that the NDA set contains 30 percent of
disqualified data items. In this paper, a quick deter-
mination method is deployed for the identification
of disqualified data item.
Definition 3. In a large data set A [1, k] of k data elements, we
choose a contiguous subset S [pl, pr] starting at position
pl and choose position pr such that jðplþprÞ
2  pj  D where
A(p) denotes the elements with maximum possibility in the
subset S [c - D, c þ D], where c ¼ ðplþprÞ
2 ¼ pl þ B
2, where
B ¼ k/m and m is the number of total partitions.
Proposition 1. S [a, b]  A [1, k], A(p) 2 S, if S [a, b] satisfies:
A(p) ¼ Max(S), c a þ b
ð Þ=2 , c  p
j j  D; and if FðSÞ
ð1mrÞ  1
s
ffiffiffiffi
2p
p b
a expð c2
2 Þdx, mr is the adjusting parameter,
then S [a, b] is a NDA set.
The initial adjusting parameter c and adjusting factor
mr can be adjusted during the S [a, b] selection process.
The determination of A(p) is adjusted based on c. In this
paper, initial c is selected as (a þ b) / 2 when fðpÞ 
Pb
i¼a fðxÞ= ðb  aÞ and the maximum A(p) has c  p
j j  D
within S [a, b]; then p is selected as the real center for the
first ND/NDA set. D is calculated based on total data sets
and m.
Based on Proposition 1 and Definition 3, we modified the
determination process for identifying ND/NDA sets in the
following algorithm.
Algorithm 1 deploys m to determine the total number of
the splitting data sets. The adjustable value m denotes the ini-
tial number of the splitting data sets, which is pre-defined.
Fig. 4. The transformed data distributes N(m, s2
)-approx.
Fig. 3. The division of large data sets (example illustration).
ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 521
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
During the splitting process, the preliminary boundary of the
first splitting set is predetermined. The boundary of the first
set normally starts from 1 to k / m. The adjusting process for
m is based on the center point with maximum probability
and its distance to k / 2m, i.e. D. m can be increased when the
center point c is on the left side of (pr – pl) / 2, which means D
is negative to predefined J  b). That will accumulatively cre-
ate one or several extra data sets instead of predefined m. In
the same way, m can be reduced when D is positive. The final
splitting sets could be variant to predefined m.
Algorithm 1. NDA sets Determination Process
Input: Dataset A; // with k data points;
m; // number of initial partitions;
D; // is the acceptable distance between center and
// the max point in the subset;
Output: ND // list of partitions;
1: B ¼ k / m // size of each partition;
2: ND ¼ [ ]; // initialize partitions to empty;
3: J ¼ 1; // partition number;
4: pl ¼ 1; // left position of the current partition;
5: pr ¼ B; // right position of the current partition;
6: c ¼ B / 2; // the center of the partition;
7: LP ¼ [ ]; // List of the unsatisfied max points;
8: cl ¼ c – D, cr ¼ c þ D;
9: p ¼ MaxPoint (A, cl, cr, LP); // Find the position of
10: // the max probability between positions pl and pr
11: // (inclusive) excluding the elements in the list LP;
12: while pr  k
13: while ( j p – c j  D)
14: LP ¼ [LP, p];
15: p ¼ MaxPoint (A, cl, cr, LP);
16: end
17: pr ¼ pr þ p – c;
18: ND ¼ [ND, [J, pl, pr]];
19: J ¼ J þ 1;
20: LP ¼ [ ];
21: pl ¼ pr þ 1; // Set next partition left pointer;
22: pr ¼ pl þ b; // Set next partition right pointer;
23: c ¼ (pl þ pr) / 2;
24: cl ¼ c – D, cr ¼ c þ D;
25: if pr  k
26: pr ¼ k;
27: end
28: end
We utilize mr for adjusting m value, which can be dynam-
ically applied in the splitting processing or re-applied to the
re-splitting process. The experiments conducted in Section 4
applies mr to the re-splitting process, which adjusts m after
the splitting
mi0
¼ xi 
k
2m
 
k; (4)
mr ¼
P
umi0  k
k
m
¼
xi  m
k

1
2
;
(5)
where xi is the position of the selected maximum point in
a boundary minus its left neighbor’s upper bound.
(
Px
i k=2m) indicates the size of the split data sets, which is
based on the accumulated deviation of the selected maxi-
mum points from the middle line (left neighbors’ upper
bounds þ k/2m). mr is a ratio between the actual split data
size and pre-defined k / m.
Adjusting m ¼ m  mr in the re-splitting process. For the
dynamic adjusting process, we divide k into several sets
successively. The mr value of the previous set will be
applied to the successive set until all sets are adjusted.
After splitting, the sample data set A [1, 1,000] can be
converted to several smaller ND/NDA data sets as below
x1;1    x1;100
.
.
. ..
. .
.
.
x1;000;1    x1;000;100
2
6
6
4
3
7
7
5
¼
x1;1    x1;100
.
.
. ..
. .
.
.
xi;1    xi;100
2
6
6
4
3
7
7
5 [ . . . [
xj;1    x1;100
.
.
. ..
. .
.
.
x1;000;1    x1;000;100
2
6
6
4
3
7
7
5:
These ND/NDA data sets basically reduce the size of a
large data set. It is efficient when the data analytical pro-
cesses are based on data sets with large volume and arriving
in short period time. The two-stage model in this paper
aims to reduce both data size (record numbers) and dimen-
sions. Smaller data sets, such as divided ND/NDA data
sets, can greatly improve time consumption for SOM-based
dimensionality reduction. Previous research showed that
different dimensionality reduction methods are seriously
limited to handle large datasets due to their computational
complexities and memory requirements [10]. The SOM-
based dimensional reduction for large data sets will be
introduced in the next section.
4 SPLITTING DATA SETS BASED ON THE ND/NDA
METHOD
4.1 Data Sources and Configurations
Three medical data source files have been processed based
on the N (m, s2
) method. These data sources are listed as
below.
1) Diabetic data source [20]: the dataset represents
10 years (1999-2008) of clinical care at 130 US hospi-
tals and integrated delivery networks. It includes
over 50 features representing patient and hospital
outcomes.
2) CASP data source [21]: this is a data set of physico-
chemical properties of protein tertiary structure pro-
vided by ABV—Indian institute of information
technology management.
3) Health data source: this data source contains the
patient information and their hospital records pro-
vided by the online sources, which record patients’
marriage status, employment status, in-hospital
duration, etc.
Propositions 1 and 2 indicate the basic method of split-
ting large files based on the ND method. We conducted a
set of experiments to generate and analyze the split files
based on Algorithm 1. The experimental settings are:
522 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
CPU Intel(R) Core(TM), CPU @ 2.10 GHz
RAM 4.00 GB
OS Windows 8, 64bits
Based on Algorithm 1, we select m as initial number
of data sets and a number of boundaries are created
based on k / m. The maximum value less than D distance
to the middle line will be selected as m within the initial
boundary. The middle line is k / 2m þ left neighbor’s
lower bound. Iteratively, all the data sets’ boundaries
will be determined. m is adjusted based on its deviation
from the middle line. A large file can be split into m or
adjusted m number of groups, i.e. [0, mi], . . . [m j, k], k is
the last data item in the data set. The maximum value xi
of each group and its position pi will be stored in an
array. This process will terminate when xi reaches the
boundary 0 and k.
We conducted a set of experiments based on medical
data sources. These data files are Diabetic data sets
(101,767 records, 50 attributes) [20], CASP data sets
(45,731 records, nine attributes) [21] and Health data
sets (4,522 records).
4.2 Splitting Data Sets
Based on the three data sources, we obtained the pi arrays.
The experimental results and experimental settings are
listed as follows:
 Dataset 1- Diabetic_data_source: m indicates the
mean of the splitting ND/NDA data sets in
Diabetic. s indicates the standard deviation of the
splitting ND/NDA data sets. The initialized
k /m¼ 7,269. This table is based on “Diabetic
Attribute: Diag_1”.
 Dataset 2 - CASP_data_source: m and s values are
the same as described in Dataset 1. The initialized
k /m ¼ 6,532. We only illustrate attributes F1, F2, F8
due to space limitation.
 Dataset 3-Health_data_source: m and s values are
the same as described in Dataset 1. The initialized
k /m ¼ 904. Based on “Health Attribute: Balance”.
4.3 Splitting Data Observation
The splitting results of the above three medical data sets are
illustrated as below. We observe the diabetic data sets,
which has been split into 13 files as shown in Fig. 5.
We further evaluate the CASP data set. However, due to
space limitation, we selectively illustrate the distribution
form of F8 attributes for the splitting results observation. In
this case, the KS Density are compared and evaluated as
shown in Fig. 6.
The Health data set and its splitting data sets based on
the balance attribute are evaluated. The Health data sets
has the smallest data records among the three source data,
however, the splitting results are obvious in terms of form
ND shapes as shown in Fig. 7.
4.4 Analysis
ND-based data sets can provide comprehensive informa-
tion. The inclusiveness of ND-based data sets is particularly
efficient for medical data sets according to real data obser-
vation [3], [20], [21]. The inclusiveness of ND-based data
sets can improve the performance of analyzing large vol-
ume data sets. In many cases, a quick inquiry can be based
on a split data set rather than going through the full data
set. The advantages of the inclusive attribute of ND-based
data sets are:
1) Improving data analytical time efficiency: ND-based
data sets support a quick inquiry to be conducted
TABLE 2
Boundary and Values of Splitting Diabetic Data Sets
m s NDA boundary NDA size
475.66 207.81 [1, 7,049] 7,049
485.09 213.41 [7,050, 14,103] 7,054
484.74 212.55 [14,104, 21,173] 7,070
481.79 212.94 [21,174, 28,232] 7,059
484.71 210.66 [28,233, 35,330] 7,098
488.83 211.49 [35,331, 42,398] 7,068
483.28 209.48 [42,399, 49,453] 7,055
490.41 219.02 [49,454, 56,515] 7,062
485.78 211.93 [56,516, 63,564] 7,049
493.44 216.27 [63,565, 70,617] 7,053
490.75 212.98 [70,618, 77,666] 7,049
488.74 213.07 [77,667, 84,729] 7,063
491.41 212.60 [84,730, 91,783] 7,054
486.57 207.08 [91,784, 98,837] 7,054
487.08 211.66 [98,838, 101,766] 2,929
TABLE 3
Boundary and Values of Splitting Casp Data Sets (f1,8)
CASP Attribute: F1
m s NDA boundary NDA size
9,849.98 3,959.98 [1, 6,179] 6,179
9,820.39 4,030.81 [6,180, 13,127] 6,948
9,896.54 4,116.42 [13,128, 19,224] 6,097
9,837.24 4,074.83 [19,225, 25,473] 6,249
9,886.61 4,083.54 [25,474, 32,395] 6,922
9,909.50 4,049.44 [32,396, 39,246] 6,851
9,900.65 4,089.03 [39,247, 45,730] 6,484
CASP Attribute: F8
m s NDA boundary NDA size
68.74 55.94 [1, 5,910] 5,910
69.92 56.40 [5,911, 11,820] 5,910
70.47 56.52 [11,821, 17,713] 5,893
70.11 57.03 [17,714, 23,721] 6,008
70.52 56.37 [23,722, 29,617] 5,896
70.29 56.48 [29,618, 35,543] 5,926
69.08 56.63 [35,544, 41,493] 5,950
70.94 56.55 [41,494, 45,730] 4,237
TABLE 4
Boundary and Values of Splitting Health Data Sets
m s NDA boundary NDA size
1,425.73 2,568.72 [1, 815] 815
1,434.42 2,767.34 [816, 1,645] 830
1,288.30 2,688.25 [1,646, 2,466] 821
1,430.35 3,119.67 [2,467, 3,310] 844
1,509.49 3,840.63 [3,311, 4,129] 819
1,474.81 2,746.92 [4,130, 4,521] 392
ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 523
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Splitting results of diabetic data sets based on Diag1 attribute (a-o).
524 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Splitting results of CASP F8 data sets—KSDensity (a-i).
ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 525
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
based on a split data set instead of the full data set.
This feature greatly improves the data analyzing
time.
2) Improving the accuracy of analytical process: ND-based
data sets contain abundant inclusive information.
Therefore, a quick inquiry or data processing based
on a split data set is considered as sufficient as the
full data set.
In order to verify the inclusiveness of ND-based data
sets, we conducted the following analysis based on the split-
ting data sets and the original data sets of the three data
sources. We observe the m and s values within a boundary
and compare these values with the original data.
 Dataset 1 - Diabetic_data_source:
In Dataset 1, we compare the m and s values of the
full data set with the corresponding values of the
split data sets.
The values contain in the split data sets are
very close to the original data set as show in
Table 5. The m value difference between the
original data set and the average split data sets is
0.0216, which is approximately 0.0044 percent var-
iation of the original data. The s value difference
between the original data set and the average split
data sets is 0.1095, which is approximately 0.0516
percent variation of the original data value. Fig. 8
shows the values of the original data set and the
values of each split set.
Fig. 7. Splitting results of health data based on Balance (a-g).
Fig. 8. Variation of original data values and split data values based on
diabetic source (mean, standard deviation).
526 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
Fig. 9 shows that the ND-like distribution of orig-
inal data set (a) is very similar to splitting data sets
(b, c and d). Overall, the split data sets contain rela-
tively comprehensive information, and they can be
representative for the full data set.
 Dataset 2 - CASP_data_source:
In dataset 2, we only illustrate the result of
CASP_F8 due to the limited space. The comparison
between the full data set and the average split sets is
listed as shown in Table 6.
The values of the split data sets have very slight
variation compared with the original full data set.
The m value difference between the original data set
and the average split data sets is 0.034, which is
approximately 0.0486 percent variation of the origi-
nal data. The s value difference between the
original data set and the average split data sets is 0.
Each split data set is approximate 1/8 size of the
original data set.
Fig. 10 shows the values of the original data set
and the values of each split set.
The m and s values of the split data sets have
very small variations compared with the full data
set. Therefore, the split data sets based on CASP_F8
can be representative for the full data set.
 Dataset 3 - Health_data_source:
Dataset 3 is split into 5 individual data sets. The
comparison between the full data set and the average
split sets is listed in Table 7.
The variations of the split health (balance attribute) data
sets are slightly larger than Diabetic-Diag1 and CASP-F8.
We noticed that the average size of each split data set is
small, which is the major factor for increasing the variations.
The variations of the s values, which we consider as the
crucial criteria, are stable as shown in Fig. 11. The difference
between them is 54.04. It is approximately 1.79 percent vari-
ation of the original data.
Fig. 12 shows that the ND-like distribution of splitting
data set (b, c and d) have slightly increased their s values
compared with the original data set (a). However, the varia-
tion of s is reasonably small (1.79 percent of original data),
and m variation is very small (0.315 percent of original
data). Therefore, they can be representative for the full data
set in a certain degree.
5 RESIZED MEDICAL DATA SIMULATION IN
CLOUDSIM
The Cloud computing paradigm provides an efficient solu-
tion for utilization of computing infrastructures through
dynamic allocation of resources in a pay as you go model,
which allows users to effectively utilize resources and save
Fig. 10. Variation of original data values and split data values based on
CASP source (mean, standard deviation).
TABLE 7
Split and Original Data Sets
Comparison- Health_Balance
Original Health_Balance Average of Split Sets
m 1,422.7 1,427.18
s 3,009.3 2,955.26
Size 4,521 753.5
TABLE 6
Split and Original Data Sets Comparison- CASP_F8
Orignial CASP_F8 Average of Split Sets
m 69.975 70.009
s 56.49 56.49
Size 45,730 5,716.25
TABLE 5
Split and Original Data Sets Comparison- Diabetic_Diag1
Original Diabetic_Diag1 Average of Split Sets
m 486.5304 486.552
s 212.3062 212.1967
Size 101,766 6,784.4
Fig. 9. ND-like form of original data values and split data values based on
diabetic source.
Fig. 11. Variation of original data values and split data values based on
health source (mean, standard deviation).
ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 527
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
costs. The ND-based method is efficient to be applied in
Cloud environments as Cloud solutions provide abundant
support on distributed and parallel computing [22]. The ND-
based method is efficient to be applied in Cloud environ-
ments since Cloud solutions provide abundant support on
distributed and parallel computing [23], [24]. The ND-based
method allows files to be processed in a distributed and par-
allel manner, which can improve the performance and
reduce the costs.
To evaluate the performance of the ND-based method in
Cloud environments, we conduct experiments based on
CloudSim [25]. CloudSim is a discrete event simulator that
enables modeling of Cloud environment with on-demand
resource provision in a repeatable and controllable way. We
compared performance difference of resource configura-
tion, response time, resource cost, and flexibility between
applying ND-based method to process the split data sets
and not applying ND-based method to process the original
data sets. The flexibility refers to the failure rate of data
processing under limited resources. The higher failure rate
indicates the less flexibility of the data processing method.
5.1 Resource Configuration—CloudSim
We simulate a datacenter that consists of 500 physical
nodes. Each node has 64 CPU cores, 300 GB memory, 1 TB
storage, and 10 GB/s network bandwidth. Each CPU core
has performance equivalent to 4,000 Million Instructions
per Second (MIPS). We consider five types of memory opti-
mized Amazon EC2 VMs, which are: r3large, r3xlarge,
r32xlarge, r34xlarge, and r38xlarge [26] as listed in Table 8.
VMs in Amazon EC2 are charged in an hourly basis.
5.2 Workload Data
The workload generated in this study is based on the Big
Data Benchmark [27], which executes queries using differ-
ent data analytic frameworks on large data set in Amazon
EC2 and details the VM configuration and query processing
time. Based on the information, we model resource require-
ments for data processing tasks under different resource
configurations in CloudSim.
To compare performance difference between processing
split data sets and original data sets, we also need to take
data splitting into account, as this operation requires extra
VM resources and time. We use splitting time of CASP data
sets as the experimental data. However, the size of CASP
data sets is not sufficient large to show the obvious perfor-
mance benefits of applying ND-based method. Therefore,
the resources required by splitting original data sets are
enlarged to a suitable scale according to the original file size.
For ND-based data splitting application, we consider sin-
gle core splitting and multicore splitting. For multi-core
splitting, we model contentions for file assess as a penalty
in performance speed up for the extra cores. This means
that only a fraction of the execution time of the splitting
application can benefit from the extra cores. In this experi-
ment, we set this fraction to 50 percent. We model the
resource contention in a conservative way. If a better
method is chosen to provide the data sets to the threads of
multicore applications, the performance could be signifi-
cantly better. Since resource configuration and data splitting
time are known, we model resource requirements for file
splitting tasks for single core application and multicore
application in CloudSim.
As the performances of data processing and data splitting
vary, which do not consume the same amount of resources
each time, we simulate 10 percent performance variation
by introducing resource variation coefficient, which is a
random value generated from a Uniform Distribution with
upper bound 1.1 and lower bound 0.9. Based on the above
settings, we modeled three types of tasks in CloudSim:
(1) the task for splitting, (2) the task for processing split data
sets, (3) the task for processing original (non-split) data sets.
5.3 Experimental Results
The first part of this study aims to evaluate performance dif-
ferences of resource configuration, resource cost, and
response time for processing split data sets and non-split
data sets. We have five resource configurations for five types
of VM in our experiment. For each VM configuration, we
provision three VMs and run one type of tasks on each VM.
The first VM is utilized to run the data splitting tasks, the sec-
ond VM is used to run the data processing tasks for split data
sets, and the third VM is utilized to run the data processing
task for the original data sets. We assume our data analytic
application adopts single core in-memory processing.
We obtain results of resource configuration, VM execu-
tion time, and resource cost from the study, as shown in
Tables 2, 3, 4, and 5. From these results, we can see that
there are only four types of resource configurations avail-
able for processing the original file. VM configuration of
r3.large is not available as it has 15 GB memory, which
Fig. 12. ND-like form of original data values and split data values based
on health source.
TABLE 8
VM Configuration Information
VM type vCPU ECU Memory Storage Cost
r3.large 2 6.5 15 32 SSD $0.175 /Hour
r3.xlarge 4 13 30.5 80 SSD $0.350 /Hour
r3.2xlarge 8 26 61 160 SSD $0.700 /Hour
r3.4xlarge 16 52 122 320 SSD $1.400 /Hour
r3.8xlarge 32 104 244 640 SSD $2.800 /Hour
528 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
cannot fit 30 GB data in memory for processing. Five types
of resource configuration are all available for processing
split data sets as smaller data sets can be easily fit in mem-
ory for processing.
We can see that splitting data sets using single core appli-
cations does not have performance advantages (see Table 9).
In the contrary, it takes longer response time and incurs
higher costs. For the cheapest resource configuration using
r3.xlarge, the overall cost and processing time of single core
splitting is $2.100, 5.015 hours while for multi-core splitting
is $1.050, 2.383 hours. Multi-core processing is 50 percent
cheaper than single core processing and provides 2.10 times
speed up (see Table 10). We also notice that processing split
data sets not only generates less resource cost but also
reduces data processing time. For cost minimized resource
configuration of data processing, we can see that the overall
cost and time for processing original data sets using r3.
xlarge is $9.100, 25.997 hours, while for processing split
data sets using r3.xlarge is $2.800, 7.721 hours. Processing
split data sets is 69.2 percent cheaper and offers 3.37 times
speed up.
We use cost-minimized resource configurations with
minimized execution time to compare the performance dif-
ferences between applying ND-based method and not
applying the method. For applying ND-based method, we
select VM r3.xlarge to execute data splitting tasks with
multi-core splitting and select VM r3.xlarge to process split
data sets in parallel. The overall cost and execution time of
this configuration is $3.850 and 10.104 hours. For processing
original data sets, we select VM r3.xlarge as the resource
configuration, which costs $10.150 and takes 28.380 hours.
The result shows that applying ND-based method performs
significantly better as it not only gives 62.1 percent cost
saving but also leads to 2.81 times speed up. Tables 9-12
show the experimental results.
The second part of this study aims to analyze the flexibil-
ity of processing the split data sets and the original data
sets. Data processing applications adopt in-memory
processing, which require the whole data sets to be fitted in
memory for processing. In this way, when the available
memory in a VM is less than the size of data sets, the data
sets cannot be processed successfully and a failure is gener-
ated. A comparison has been conducted to analyze failures
of processing split data sets and original data sets as an indi-
cator for flexibility analysis.
The experiment scenario is that VMs have running tasks;
therefore, their capacities are not fully available. We con-
sider memory as the resource bottleneck for data processing
and introduce the Resource availability coefficient (RAC) to
model the available percentage of memory in VMs. RAC is
a random value generated using Uniform Distribution with
upper bound 1 and lower bound 0. We conduct 10 experi-
ments and each has a different RAC coefficient for five types
of VM configuration. For each VM configuration, we create
two VMs to execute the original data sets and split data sets.
For split data sets and original data sets, 50 data processing
tasks are executed accordingly. We obtain failure rates of
the 10 experiments, which are arranged in an ascending
order of the RAC value, as shown in Fig. 13.
TABLE 11
Resource Utilization of Processing Original File
Resource configuration Execution time VM Cost
r3.large null null
r3.xlarge 25.997 hours $9.100
r3.2xlarge 25.840 hours $18.200
r3.4xlarge 27.260 hours $39.200
r3.8xlarge 26.483 hours $75.600
TABLE 12
Resource Utilization of Processing Partitioned Files
Resource configuration Execution time VM Cost
r3.large 15.309 hours $2.800
r3.xlarge 7.721 hours $2.800
r3.2xlarge 4.103 hours $3.500
r3.4xlarge 4.045 hours $7.000
r3.8xlarge 4.018 hours $14.000
Fig. 13. Resource utilization of partitioning original file using multi core
application.
TABLE 10
Resource Utilization of Partitioning Original File
Using Multi Core Application
Resource configuration Execution time VM Cost
r3. large null null
r3.xlarge 2.383 hours $1.050
r3.2xlarge 1.224 hours $1.400
r3.4xlarge 0.590 hours $1.400
r3.8xlarge 0.309 hours $2.800
TABLE 9
Resource Utilization of Partitioning Original File
Using Single Core Application
Resource configuration Execution time VM Cost
r3.large null null
r3.xlarge 5.015 hours $2.100
r3.2xlarge 4.952 hours $3.500
r3.4xlarge 4.420 hours $7.000
r3.8xlarge 4.250 hours $14.000
ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 529
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
We compare the failure rate of processing split data sets
and original data sets. Results show that 30 failures out of
50 tasks are generated during processing original data sets
while six failures out of 50 tasks are generated during proc-
essing split data sets. Applying ND-based data processing
method performs significantly better as the failure rate is 12
percent comparing to the failure rate of 60 percent for proc-
essing the original data sets.
For VMs with less available resources, as indicated by
RAC value from 0.06 to 0.25, failures are generated for both
split data sets and original data sets. When RAC is 0.06, five
failures are generated for processing original data sets while
three failures are generated for processing split data sets.
When available VM resource increases, indicated by increas-
ing RAC value from 0.13 to 0.25, the number of failures
decreases for both original data sets and split data sets.
When we further increase the available resources, indicated
by RAC value from 0.31 to 0.99, the number of failures fur-
ther decreases for processing original data set while no fail-
ures are generated for processing split data sets. The
advantage of applying ND-based method is obvious as proc-
essing split data sets generates less number of failures com-
paring to processing original data set for all RAC values.
It is to be noted that our definition of failure for process-
ing split data sets is rigorous. Even if the processing of one
split data set fails, the processing of the whole data sets fails,
regardless of the other successfully processed split data sets.
However, we can partially process split data sets that can be
fitted in memory of under-utilized VMs for higher cost sav-
ing. In this way, it can better utilize remaining resources on
under-utilized VMs as smaller split data sets requires less
resources so that they have more chances to be successfully
processed. The ND-based method only provisions new
VMs for the rest of split data sets that are not successfully
executed, which greatly satisfies cost-saving requirements
of cloud users.
6 CONCLUSION
A novel ND-based method has been introduced in this
paper for splitting large data sets. The rapid growth of data
has caused the problem for efficient data processing and
analysis due to the enormous data size in the “Big Data” era.
The proposed ND-based method can efficiently split
large data sets, particularly for medical and e-health data.
We observed that most medical data sets normally contain
several numeric attributes, which makes the ND-based
splitting process feasible and efficient. Furthermore, the
split data sets are suitable for parallel processing in cloud
environment, which makes real-time data analysis feasible
for emergency medical procedures. The ND-based split
data sets have been tested for their representative capability
based on the inclusive analysis provided in Section 4. This
feature allows systems return query results based on a sin-
gle split data set since the inclusive character of ND-based
split data sets.
The future work of this study will further conduct experi-
ments in Cloud sim to analyze the splitting algorithm by
exploring whether the splitting process can give advantages
for data analysis when parallel data processing is deployed.
Furthermore, we will investigate the effects that algorithm
parameters have on its performance, such as, selecting suit-
able mr and D values for generating more efficient and accu-
rate split data sets.
ACKNOWLEDGMENTS
The authors would like to express our gratitude to Professor
Rao Kotagiri from the University of Melbourne for his
valuable comments and suggestion and The CLOUDS Labo-
ratory at the University of Melbourne for providing Cloud-
Sim platform and experimental environment for simulation,
to student members from Sun Yat-Sen University for con-
ducting the splitting data evaluation. They would thank the
Center for machine learning and intelligent systems at Uni-
versity of California, Irvine, for providing open medical
data source. This work was supported in part by the
Zhejiang Provincial Natural Science Foundation of China
under Grant LY 14G010004, Zhejiang Provincial Philosophy
and Social Science Project under Grant 11JCSH03YB, and
National Natural Science Foundation of China under
Grant 61272480.
REFERENCES
[1] G. W. Stewart, “On the early history of the singular value decom-
position,” SIAM Rev., vol. 35, no. 4, pp. 551—566, 1993.
[2] A. D. Penman and W. D. Johnson, “The changing shape of the
body mass index distribution curve in the population: implica-
tions for public health policy to reduce the prevalence of adult
obesity,” Pre Chronic Disease, vol. 3, no. 3, p. A74, 2006.
[3] W. H. Wolberg, W. N. Street, and O. L. Mangasarian, Breast Cancer
Wisconsin (Diagnostic) Data Set, Nov. 2014.
[4] E. H. Shortliffe and G. O. Barnett, “Biomedical data: Their acquisi-
tion, storage, and use,” in Biomedical Informatics. New York, NY,
USA: Springer, 2006, pp. 46–79.
[5] M. G. Walker and J. Shi, Statistics Consulting—ANOVA Example.
Walker Bioscience, Nov. 2014.
[6] A. Labrinidis and H. V. Jagadish, “Challenges and opportunities
with big data,” Proc. VLDB Endowment, vol. 5, no. 12, pp. 2032 –
2033. 2012.
[7] C. W. Baxter, S. J. Stanley, Q. Zhang, and D. W. Smith,
“Developing artificial neural network process models: A guide for
drinking water utilities,” in Proc. CSCE, 2000, pp. 376–383.
[8] R. J. May, H. R. Maier, and G. C. Dandy, “Data splitting for artifi-
cial neural networks using SOM-based stratified sampling,” Neu-
ral Netw., vol. 23, pp. 283–94. 2010.
[9] T. Kohonen, “Self-organized formation of topologically correct
feature maps,” Biological Cybern., vol. 43, no. 1, pp. 59–69,
1982.
[10] K. Tasdemir and E. Merenyi, “SOM-based topology visualisation
for interactive analysis of high-dimensional large datasets,” Mach.
Learning Rep., vol. 1, pp. 13–15, 2012.
[11] S. Rani and G. Sikka, “Recent techniques of clustering of time
series data: A survey,” Int. J. Comput. Appl., vol. 52, no. 15,
pp. 1–9, 2012.
[12] P. Russom, “Big data analytics,” TDWI Best Practices Report (4th
Quarter). Renton, WA, USA: TDWI Press, 2011, pp. 3–34.
[13] W. Wu, H. R. Maier, G. Dandy, and R. May, “Exploring the impact
of data splitting methods on artificial neural network models,” in
Proc. 10th Int. Conf. Hydroinformat.: Understanding Changing Climate
Environ. Finding Solutions, 2012, pp. 1–8.
[14] J. Neyman, “On the two different aspects of the representative
methods. The method stratified sampling and the method of
purposive selection,” J. Roy. Statistical Soc., vol. 97, pp. 558–606,
1934.
[15] J. Dean and S. Ghemawat, “MapReduce: Simplied data processing
on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.
[16] F. Marozzo, D. Talia, and P. Trunfio, “P2P-MapReduce: Parallel
data processing in dynamic cloud environments,” J. Comput. Syst.
Sci., vol. 78, no. 5, pp. 1382–1402, 2012.
[17] The World Bank Group, World Development Indicators, World Bank
Publication, Dec. 2013.
530 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
[18] Tableau Software, Sample Superstore Sales, Jan. 2014.
[19] R. V. Hogg and A. T. Craig, Introduction to Mathematical Statistics,
5th ed. Englewood Cliffs, NJ, USA: Prentice Hall Press. 2007.
[20] B. Strack, J. P. DeShazo, C. Gennings, J. L. Olmo, S. Ventura, K. J.
Cios, and J. N. Clore, “Impact of HbA1c measurement on hospital
readmission rates: Analysis of 70,000 clinical database patient
records,” BioMed. Res. Int., vol. 2014, p. 11, 2014.
[21] P. S. Rana, “Physicochemical properties of protein tertiary struc-
ture data set,” ABV -Indian Institute of Information Technology 
Management, MP, 2014.
[22] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic,
“Cloud computing and emerging IT platforms: Vision, hype, and
reality for delivering computing as the 5th utility,” Futur. Gener.
Comput. Syst., vol. 25, no. 6, pp. 599–616, Jun. 2009.
[23] A.N. Toosi, R. N. Calheri, and R. Buyya, “Interconnected cloud
computing environments: Challenges, taxonomy and survey,”
ACM Computi. Surveys, vol. 47, no 1, pp. 7:1–7:47, 2014.
[24] B. Hayes, “Cloud computing,” Commun. ACM, vol. 51, no. 7,
pp. 9–11, Jul. 2008.
[25] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. D. Rose, and
R. Buyya. “CloudSim: A toolkit for modeling and simulation
of cloud computing environments and evaluation of resource
provisioning algorithms,” Software— Practice experience, vol. 41,
pp. 23–50, 2010.
[26] Amazon Web Services Inc. Amazon Elastic Compute Cloud—User
Guide for Microsoft Windows, Amazon Copyright, 2014.
[27] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden,
and M. Stonebraker, “A comparison of approaches to large-scale
data analysis,” in Proc. ACM SIGMOD Int. Conf. Manage. Data,
2009, pp. 165–178.
Hao Lan Zhang received the PhD degree in
computer science from Victoria University, Aus-
tralia. He is currently an associate professor at
NIT, Zhejiang University in China and the 151
Talented Scholar of Zhejiang Province. Prior to
that, he was a research fellow and academic staff
at RMIT University, Australia. He has published
over forty papers in refereed journals and confer-
ences. His research interests include intelligent
information systems, multiagent systems, health
informatics, knowledge-based systems, data
mining, etc. He serves as the editorial board member and a reviewer of
several journals including: IEEE Transactions on SCM, IEEE Transac-
tions on Cloud Computing, The Computer Journal, Information Scien-
ces, Journal of AAMAS, etc. He is a member of the IEEE.
Yali Zhao received the bachelor’s degree from
Harbin Institute of Technology, China, and the
joint master’s degree from Harbin Institute of
Technology, China, and University of Bordeaux
1, France. She is currently working toward
the PhD degree at the University of Melbourne,
Australia. Her research interest is SLA-based
resource scheduling for big data analytic as a ser-
vice in cloud computing environments.
Chaoyi Pang received the PhD degree from the
University of Melbourne. He is currently a full pro-
fessor at NIT, Zhejiang University, and a 1,000
talented professor of Zhejiang Province. His
research interests are in algorithm, data security/
privacy, access control, data warehousing, data
integration, database theory, graph theory. He is
a senior member of ACM.
Jinyuan He received the bachelor’s and master’s
degrees from Sun Yat-Sen University in 2013 and
2015, respectively. He was a research assistant
in the Center for Social Computing and Data
Management, NIT, Zhejiang University in 2014.
He is currently working on the large data sets
splitting projects.
 For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.
ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 531
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.

More Related Content

Similar to NAD Vs MapReduce

Framework for efficient transformation for complex medical data for improving...
Framework for efficient transformation for complex medical data for improving...Framework for efficient transformation for complex medical data for improving...
Framework for efficient transformation for complex medical data for improving...IJECEIAES
 
Analysis on Data Mining Techniques for Heart Disease Dataset
Analysis on Data Mining Techniques for Heart Disease DatasetAnalysis on Data Mining Techniques for Heart Disease Dataset
Analysis on Data Mining Techniques for Heart Disease DatasetIRJET Journal
 
IRJET-Survey on Data Mining Techniques for Disease Prediction
IRJET-Survey on Data Mining Techniques for Disease PredictionIRJET-Survey on Data Mining Techniques for Disease Prediction
IRJET-Survey on Data Mining Techniques for Disease PredictionIRJET Journal
 
Assigning Scores For Ordered Categorical Responses
Assigning Scores For Ordered Categorical ResponsesAssigning Scores For Ordered Categorical Responses
Assigning Scores For Ordered Categorical ResponsesMary Montoya
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approachcsandit
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHcscpconf
 
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop ClusterIRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop ClusterIRJET Journal
 
Disease prediction in big data healthcare using extended convolutional neural...
Disease prediction in big data healthcare using extended convolutional neural...Disease prediction in big data healthcare using extended convolutional neural...
Disease prediction in big data healthcare using extended convolutional neural...IJAAS Team
 
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...IJECEIAES
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
 
Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397Editor IJARCET
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismseSAT Journals
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
A comparative analysis of classification techniques on medical data sets
A comparative analysis of classification techniques on medical data setsA comparative analysis of classification techniques on medical data sets
A comparative analysis of classification techniques on medical data setseSAT Publishing House
 
Propose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart DiseasePropose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart DiseaseIJERA Editor
 
IRJET- Predicting Heart Disease using Machine Learning Algorithm
IRJET- Predicting Heart Disease using Machine Learning AlgorithmIRJET- Predicting Heart Disease using Machine Learning Algorithm
IRJET- Predicting Heart Disease using Machine Learning AlgorithmIRJET Journal
 
Multi Disease Detection using Deep Learning
Multi Disease Detection using Deep LearningMulti Disease Detection using Deep Learning
Multi Disease Detection using Deep LearningIRJET Journal
 
IRJET- Extending Association Rule Summarization Techniques to Assess Risk of ...
IRJET- Extending Association Rule Summarization Techniques to Assess Risk of ...IRJET- Extending Association Rule Summarization Techniques to Assess Risk of ...
IRJET- Extending Association Rule Summarization Techniques to Assess Risk of ...IRJET Journal
 
Disease Prediction Using Machine Learning Over Big Data
Disease Prediction Using Machine Learning Over Big DataDisease Prediction Using Machine Learning Over Big Data
Disease Prediction Using Machine Learning Over Big DataCSEIJJournal
 
Disease Prediction Using Machine Learning Over Big Data
Disease Prediction Using Machine Learning Over Big DataDisease Prediction Using Machine Learning Over Big Data
Disease Prediction Using Machine Learning Over Big DataCSEIJJournal
 

Similar to NAD Vs MapReduce (20)

Framework for efficient transformation for complex medical data for improving...
Framework for efficient transformation for complex medical data for improving...Framework for efficient transformation for complex medical data for improving...
Framework for efficient transformation for complex medical data for improving...
 
Analysis on Data Mining Techniques for Heart Disease Dataset
Analysis on Data Mining Techniques for Heart Disease DatasetAnalysis on Data Mining Techniques for Heart Disease Dataset
Analysis on Data Mining Techniques for Heart Disease Dataset
 
IRJET-Survey on Data Mining Techniques for Disease Prediction
IRJET-Survey on Data Mining Techniques for Disease PredictionIRJET-Survey on Data Mining Techniques for Disease Prediction
IRJET-Survey on Data Mining Techniques for Disease Prediction
 
Assigning Scores For Ordered Categorical Responses
Assigning Scores For Ordered Categorical ResponsesAssigning Scores For Ordered Categorical Responses
Assigning Scores For Ordered Categorical Responses
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approach
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
 
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop ClusterIRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
 
Disease prediction in big data healthcare using extended convolutional neural...
Disease prediction in big data healthcare using extended convolutional neural...Disease prediction in big data healthcare using extended convolutional neural...
Disease prediction in big data healthcare using extended convolutional neural...
 
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
A comparative analysis of classification techniques on medical data sets
A comparative analysis of classification techniques on medical data setsA comparative analysis of classification techniques on medical data sets
A comparative analysis of classification techniques on medical data sets
 
Propose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart DiseasePropose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart Disease
 
IRJET- Predicting Heart Disease using Machine Learning Algorithm
IRJET- Predicting Heart Disease using Machine Learning AlgorithmIRJET- Predicting Heart Disease using Machine Learning Algorithm
IRJET- Predicting Heart Disease using Machine Learning Algorithm
 
Multi Disease Detection using Deep Learning
Multi Disease Detection using Deep LearningMulti Disease Detection using Deep Learning
Multi Disease Detection using Deep Learning
 
IRJET- Extending Association Rule Summarization Techniques to Assess Risk of ...
IRJET- Extending Association Rule Summarization Techniques to Assess Risk of ...IRJET- Extending Association Rule Summarization Techniques to Assess Risk of ...
IRJET- Extending Association Rule Summarization Techniques to Assess Risk of ...
 
Disease Prediction Using Machine Learning Over Big Data
Disease Prediction Using Machine Learning Over Big DataDisease Prediction Using Machine Learning Over Big Data
Disease Prediction Using Machine Learning Over Big Data
 
Disease Prediction Using Machine Learning Over Big Data
Disease Prediction Using Machine Learning Over Big DataDisease Prediction Using Machine Learning Over Big Data
Disease Prediction Using Machine Learning Over Big Data
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Recently uploaded (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

NAD Vs MapReduce

  • 1. Splitting Large Medical Data Sets Based on Normal Distribution in Cloud Environment Hao Lan Zhang , Member, IEEE, Yali Zhao, Chaoyi Pang, and Jinyuan He Abstract—The surge of medical and e-commerce applications has generated tremendous amount of data, which brings people to a so-called “Big Data” era. Different from traditional large data sets, the term “Big Data” not only means the large size of data volume but also indicates the high velocity of data generation. However, current data mining and analytical techniques are facing the challenge of dealing with large volume data in a short period of time. This paper explores the efficiency of utilizing the Normal Distribution (ND) method for splitting and processing large volume medical data in cloud environment, which can provide representative information in the split data sets. The ND-based new model consists of two stages. The first stage adopts the ND method for large data sets splitting and processing, which can reduce the volume of data sets. The second stage implements the ND-based model in a cloud computing infrastructure for allocating the split data sets. The experimental results show substantial efficiency gains of the proposed method over the conventional methods without splitting data into small partitions. The ND-based method can generate representative data sets, which can offer efficient solution for large data processing. The split data sets can be processed in parallel in Cloud computing environment. Index Terms—Data splitting, cloud computing, Big Data, medical data processing, normal distribution Ç 1 INTRODUCTION THE arrival of the Big Data age urges researchers to find optimized solutions for dealing with large quantity of data, particularly for online commercial and medical data. One of the goals is to reduce high dimensional data sets to lower dimensional data sets without sacrificing its inherent properties (knowledge content and relationship). Normally, two classical methods can be considered for handling the volume of large data sets: (1) splitting data into smaller chunks so that each chunk can be analyzed independently; (2) reducing data columns (dimensionality reduction). For splitting large data sets, an enhanced UV-decomposition method can be applied. According to UV-Decomposition [1], a sample data table containing 1,000 rows and 100 columns can be decomposed into two tables based on as follows: x1;1 x1;100 . . . .. . . . . x1;000;1 x1;000;100 2 6 6 4 3 7 7 5 ¼ y1;1 y1;2 . . . . . . y1;000;1 y1;000;2 2 6 6 4 3 7 7 5 z1;1 z1;100 z2;1 z2;100 : For a very large data set, the UV-decomposition method cannot reduce the large data set, particularly for rows as shown in the above matrices. This demands a more efficient method of splitting the data sets. We observe that it is a com- mon phenomenon that online data presents a mixture of nor- mal- distribution forms in different groups as indicated in Fig. 1. The figure on the bottom presents the distribution based on an example data. The recalculated matrices can be further broken down into smaller data sets based on the num- ber of generated grouping results. In this way, the ND-based data splitting can be more efficient for very large data sets. Based on Fig. 1, the figure on top contains the informa- tion about how data points are distributed. We applied the ksdensity function to generate a probability distribution shown on the bottom figure. we can observe that this data set demonstrates similarity to a normal distribution form. We further split the data set into more ND-like-sets. These split data sets can be analyzed individually as a representa- tive of the whole data set. This method can solve the time and cost problems for dealing with large data sets and is efficient for cloud environment. The prerequisite of employing ND for resizing is that data sets need to contain at least one measurable/numeric column. The increase of more numeric data in data sets can improve the accuracy of the analytical results. The primary motivation of this research and major contributions of the proposed N (m, s2 ) method for data splitting are: The ND-based splitting method can create small and rep- resentative data sets. The representative character of ND-based data sets is reflected and evidenced by various medical and physiological data sets such as birth weight distribution, the population distribution of Body Mass Index (BMI) [2], the population dis- tribution of breast cancer patients [3], etc. This H. L. Zhang is with the Center for SCDM, NIT, Zhejiang University, , China and the Department of Computer and Information Systems, The University of Melbourne, Australia. E-mail: haolan.zhang@nit.zju.edu.cn. Y. Zhao is with the Department of Computing and Information Systems, The University of Melbourne, Australia. E-mail: yalizhao@unimelb.edu.au. C. Pang and J. He are with the Center for SCDM, NIT, Zhejiang Univer- sity, China. E-mail: {chaoyi.pang, hjysix}@gmail.com. Manuscript received 12 Jan. 2015; revised 8 June 2015; accepted 14 July 2015. Date of publication 29 July 2015; date of current version 9 June 2020. Recommended for acceptance by J. He, R. Kotagiri, and F.M. Sanchez. Digital Object Identifier no. 10.1109/TCC.2015.2462361 518 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020 2168-7161 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tps://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 2. type of data can provide relatively comprehensive information for data analysis rather than randomly splitting. The experimental analysis of the represen- tativeness of splitting data is provided in Section 4. The splitting data sets can be efficiently deployed in the cloud environment. The distributed cloud computing environment allows a system to analyse and process data locally. The ND-based data sets can be distrib- uted in any cloud frameworks in different locations. However, these distributed data sets are not always required for integration when conducting analysis since a ND-based data set represents all characteris- tics. This aspect will allow a single cloud node to pro- cess its local ND-based data sets for global analysis. Furthermore, the split data sets are suitable for paral- lel processing in Cloud environment and increase the flexibility of memory usage. The experimental results based on CloudSim has provided in Section 5. The ND-based method supports semi-structured data sets, which can be efficiently applied to medical data analysis. There is a board range of data types in the practice of medical and health sciences ranging from narrative, textual data to numerical measurements, record sig- nals, and so on [4]. Nevertheless, most medical data sets normally contain several numeric attributes. It makes the ND-based splitting process feasible and efficient. Table 1 shows the sample a set of e-health, which consists of narrative information and numeric measurements. In this sample data, both ‘age’ and ‘duration’ attributes are numeric, which form a number of ND sub-sets. These sub-sets might not be the precise ND form. These data sets are named as ND-Approximate (NDA) sets, i.e. NDA, which contain representative information compared with random splitting sets, as shown in Fig. 2 (refer to Fig. 1 for splitting). The UV decomposition methods can be applied to ND/ NDA data sets for further dimensionality reduction. The current methods for generating fast analytical results based on large medical or online data are limited due to the mas- sive data size and high velocity [6]. This paper explores the underlining issues and proposes the ND-based split- ting method for fast data processing and analysis in cloud environment. This paper is organized as follows: the next section reviews the previous work on large medical and online data processing as well as the related methods. Section 3 introdu- ces the ND method for large data splitting, which can pro- vide relatively small and inclusive data sets for analysis. Section 4 illustrates the ND-based medical data storage in cloud systems. Section 5 provides the experimental results. The final section concludes the research findings. 2 Related Work The data splitting and partition methods have been studied extensively. Basically, the previous work focus on two aspects of data splitting, which include the algorithmic aspect for data splitting and the system modelling aspect for data file splitting and partition. 2.1 Related Data Splitting Algorithms Baxter et al. [7] proposed the systematic data splitting method, in which every kth sample from a random starting point is selected to form the training, test and validation datasets. In this method, data sets are first ordered in TABLE 1 Example Medical Data for Drug Test Results [5] Age Marital Status Duration Education 30 MARRIED 79 PRIMARY 33 MARRIED 220 SECONDARY 35 SINGLE 185 TERTIARY 30 MARRIED 199 TERTIARY 59 MARRIED 226 SECONDARY 35 SINGLE 141 TERTIARY 36 MARRIED 341 TERTIARY 39 MARRIED 151 SECONDARY 41 MARRIED 57 TERTIARY 43 MARRIED 313 PRIMARY 39 MARRIED 273 SECONDARY 43 MARRIED 113 SECONDARY 36 MARRIED 328 TERTIARY 20 SINGLE 261 SECONDARY 31 MARRIED 89 SECONDARY Agei MARRIEDi DURATIONi EDUCATIONi Fig. 1. The distribution of a large data set (CASP F2 Data). Fig. 2. A small sample ND-based data sets. ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 519 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 3. increasing values along the output variable dimension. This procedure requires huge time and memory, which is appar- ently not feasible for big data processing. May et al. [8] utilize the Self Organizing Maps (SOM) method for generating smaller data sets, i.e. dimensionality reduction, which primarily considers the columns reduction of a large data set [9], [10]. Most of SOM has been employed to a fixed window if it is used for online clustering for data streams [10], [11], [12]. SBSS-N data splitting approach [13] further extends SOM method through incorporating uni- form random intra-cluster sampling process for data split- ting. Neyman allocation rule [14] is adopted in this process, which is expressed as: nm ¼ Nmsm PM i¼1 Nisi n N ; (1) where N is the size of the dataset, n is the required sample size, Ni is the size of stratum i, si is the intra-stratum multi- variate standard deviation of stratum i. Current SOM-based solutions have certain limitations in generating fast analytical results comparing with K-means and other clustering methods. Like other clustering meth- ods, the accuracy of self-learning of SOM is highly depen- dent on the initial conditions: initial weight of the map, sequence of training vector, etc. A possible method is to select the initial weights from the same space spanned by the linear principal component. 2.2 Related System Modelling for Data Splitting and Partition Existing techniques for solving large data processing prob- lems have been applied to several notable models including Google File Systems (GFS), Map-Reduce model, etc. These techniques mainly concern file splitting using optimized file storage frameworks. For instance, a large volume data file can be split to smaller files in GFS [15], [16]. GFS model can provide efficient, reliable access to data using large clusters of commodity hardware, which is able to accommodate large data sets processing. In this model consists of a Master sever and a large number of chunk severs. Files are split into file chunks that are stored in chunk sever. MapReduce model employs Map () and Reduce () functions that divide a task into several smaller tasks and assemble the final results. The MapReduce model generates a number of key values. Map (k1, v1) ! list (k2, v2); Reduce (k2, list (v2)) ! list (v3). Some research- ers believe that MapReduce model can solve large data set processing problems to certain extent [15].These methods basically focus on data sets decomposition using key-value or file splitting methods, which do not consider providing inclusive information in a data set. Therefore, a fast analy- sis based on a number of representative data sets is not available in GFS and MapReduce models. In other words, GFS and MapReduce models cannot efficiently handle a fast inquiry based on a number of small data sets due to the lack of representative information. Instead of focusing on file system framework design, this paper primarily focuses on converting high dimensional data sets into several low dimensional data sets through incorpo- rating the enhanced N (m, s2 ) method into cloud environment. 3 THE ND-BASED METHOD FOR SPLITTING LARGE DATA 3.1 Data Observation While observing various large data sets including bank data (obtained from the World Bank’s 2013 open source data) [17], superstore sales data (obtained from Tableau’s 2012 open source data)[18], personal tax data (obtained from Australian 2003 online open source data), online transaction data (obtained from an anonymous travel agency’s 2012 data), and e-health data (obtained from Diagnostic Wiscon- sin Breast Cancer Database)[3]. Fig. 3a 3b, 3c, and 3d (eliminating specific labels and details) indicate that the observed large data sets contain a number of sub data sets. Each sub data set form normal dis- tribution feature. Most random data sets omit some particular attributes or features, such as dates, gender, etc., normally contain a number of sub data sets. Each sub data set has its own group feature. In this paper we consider the group feature as the normal distribution form and name such sub data set as ND sub-data sets or simply ND sets. In other words, a collection of data forms normal distribution would be con- sidered as a sub data set. The reason for adopting normal distribution to determine a sub data set is that this distribu- tion form widely observed in practical data sets. Each ND set has its own uptrend, peek and downtrend, which can provide useful information for data analysis. 3.2 Sub Data Sets (ND/NDA Sets) Determination A large data set might contain a number of ND sets. Identi- fying ND sets is a crucial step for splitting large data sets. Definition 1. A collection of data in a large data set is defined as a ND set if its distribution is of the form N (m, s2 ), m is the mean or expectation of the distribution, s is its standard devia- tion and its variance is s2 . A ND set might not strictly form a standard normal distri- bution. However, a non-standard ND set still contains uptrend, peek and downtrend. These ND sets are called approx. normal-distribution ND sets or NDA sets in short. In a large set A [1, k], if (ND1[ ND2 [ . . . NDi A) and NDi has the distribution form N (m, s2 ) or NDA (refer to Proposition 2) as shown in Fig. 4a, however, A might not have an exact distribution of N (m, s2 ) after transformed as shown in Fig. 4b, however they distribute a ND-approximate form. Hereafter, A [1, k] represents the complete large data set; [1, k] is the boundary of A’s record number. Definition 2. S [a, b] A [1, k], if x2 [a, b] satisfies: fðAðxÞ; m; sÞ ¼ 1 s F AðxÞ m s ) axb a6¼b S½a; b ND: (2) This proposition can be alternatively expressed as: f AðxÞ ð Þ ¼ t ffiffiffiffiffiffi 2p p e t2ðAðxÞmÞ2 2 ) S½a; b ND: (3) As mentioned above, a NDA set might not strictly form a normal distribution (refer to Fig. 4b). Instead of using f(x) for probability distribution, the ND-based method converts the data values using ksdensity function to generate a 520 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 4. probability distribution. Therefore, a NDA data set needs to satisfy the following characteristics: 1) An NDA data set has a maximum possibility value point within 2D, and this value is considered as the identifier, i.e. p(x). 2) An NDA data set is basically practical data; therefore it refers to Chebyshev Inequality [19] in a certain way that is more than 89 percent data lay within 3 s. 3) An initial threshold mr (0 mr 1) is assigned to determine maximum tolerance for the number of data items in a NDA set does not form N(m, s2 ), which lie out of the normal distribution curve and called disqualified data items. For instance, mr ¼ 0.3 means that the NDA set contains 30 percent of disqualified data items. In this paper, a quick deter- mination method is deployed for the identification of disqualified data item. Definition 3. In a large data set A [1, k] of k data elements, we choose a contiguous subset S [pl, pr] starting at position pl and choose position pr such that jðplþprÞ 2 pj D where A(p) denotes the elements with maximum possibility in the subset S [c - D, c þ D], where c ¼ ðplþprÞ 2 ¼ pl þ B 2, where B ¼ k/m and m is the number of total partitions. Proposition 1. S [a, b] A [1, k], A(p) 2 S, if S [a, b] satisfies: A(p) ¼ Max(S), c a þ b ð Þ=2 , c p j j D; and if FðSÞ ð1mrÞ 1 s ffiffiffiffi 2p p b a expð c2 2 Þdx, mr is the adjusting parameter, then S [a, b] is a NDA set. The initial adjusting parameter c and adjusting factor mr can be adjusted during the S [a, b] selection process. The determination of A(p) is adjusted based on c. In this paper, initial c is selected as (a þ b) / 2 when fðpÞ Pb i¼a fðxÞ= ðb aÞ and the maximum A(p) has c p j j D within S [a, b]; then p is selected as the real center for the first ND/NDA set. D is calculated based on total data sets and m. Based on Proposition 1 and Definition 3, we modified the determination process for identifying ND/NDA sets in the following algorithm. Algorithm 1 deploys m to determine the total number of the splitting data sets. The adjustable value m denotes the ini- tial number of the splitting data sets, which is pre-defined. Fig. 4. The transformed data distributes N(m, s2 )-approx. Fig. 3. The division of large data sets (example illustration). ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 521 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 5. During the splitting process, the preliminary boundary of the first splitting set is predetermined. The boundary of the first set normally starts from 1 to k / m. The adjusting process for m is based on the center point with maximum probability and its distance to k / 2m, i.e. D. m can be increased when the center point c is on the left side of (pr – pl) / 2, which means D is negative to predefined J b). That will accumulatively cre- ate one or several extra data sets instead of predefined m. In the same way, m can be reduced when D is positive. The final splitting sets could be variant to predefined m. Algorithm 1. NDA sets Determination Process Input: Dataset A; // with k data points; m; // number of initial partitions; D; // is the acceptable distance between center and // the max point in the subset; Output: ND // list of partitions; 1: B ¼ k / m // size of each partition; 2: ND ¼ [ ]; // initialize partitions to empty; 3: J ¼ 1; // partition number; 4: pl ¼ 1; // left position of the current partition; 5: pr ¼ B; // right position of the current partition; 6: c ¼ B / 2; // the center of the partition; 7: LP ¼ [ ]; // List of the unsatisfied max points; 8: cl ¼ c – D, cr ¼ c þ D; 9: p ¼ MaxPoint (A, cl, cr, LP); // Find the position of 10: // the max probability between positions pl and pr 11: // (inclusive) excluding the elements in the list LP; 12: while pr k 13: while ( j p – c j D) 14: LP ¼ [LP, p]; 15: p ¼ MaxPoint (A, cl, cr, LP); 16: end 17: pr ¼ pr þ p – c; 18: ND ¼ [ND, [J, pl, pr]]; 19: J ¼ J þ 1; 20: LP ¼ [ ]; 21: pl ¼ pr þ 1; // Set next partition left pointer; 22: pr ¼ pl þ b; // Set next partition right pointer; 23: c ¼ (pl þ pr) / 2; 24: cl ¼ c – D, cr ¼ c þ D; 25: if pr k 26: pr ¼ k; 27: end 28: end We utilize mr for adjusting m value, which can be dynam- ically applied in the splitting processing or re-applied to the re-splitting process. The experiments conducted in Section 4 applies mr to the re-splitting process, which adjusts m after the splitting mi0 ¼ xi k 2m k; (4) mr ¼ P umi0 k k m ¼ xi m k 1 2 ; (5) where xi is the position of the selected maximum point in a boundary minus its left neighbor’s upper bound. ( Px i k=2m) indicates the size of the split data sets, which is based on the accumulated deviation of the selected maxi- mum points from the middle line (left neighbors’ upper bounds þ k/2m). mr is a ratio between the actual split data size and pre-defined k / m. Adjusting m ¼ m mr in the re-splitting process. For the dynamic adjusting process, we divide k into several sets successively. The mr value of the previous set will be applied to the successive set until all sets are adjusted. After splitting, the sample data set A [1, 1,000] can be converted to several smaller ND/NDA data sets as below x1;1 x1;100 . . . .. . . . . x1;000;1 x1;000;100 2 6 6 4 3 7 7 5 ¼ x1;1 x1;100 . . . .. . . . . xi;1 xi;100 2 6 6 4 3 7 7 5 [ . . . [ xj;1 x1;100 . . . .. . . . . x1;000;1 x1;000;100 2 6 6 4 3 7 7 5: These ND/NDA data sets basically reduce the size of a large data set. It is efficient when the data analytical pro- cesses are based on data sets with large volume and arriving in short period time. The two-stage model in this paper aims to reduce both data size (record numbers) and dimen- sions. Smaller data sets, such as divided ND/NDA data sets, can greatly improve time consumption for SOM-based dimensionality reduction. Previous research showed that different dimensionality reduction methods are seriously limited to handle large datasets due to their computational complexities and memory requirements [10]. The SOM- based dimensional reduction for large data sets will be introduced in the next section. 4 SPLITTING DATA SETS BASED ON THE ND/NDA METHOD 4.1 Data Sources and Configurations Three medical data source files have been processed based on the N (m, s2 ) method. These data sources are listed as below. 1) Diabetic data source [20]: the dataset represents 10 years (1999-2008) of clinical care at 130 US hospi- tals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. 2) CASP data source [21]: this is a data set of physico- chemical properties of protein tertiary structure pro- vided by ABV—Indian institute of information technology management. 3) Health data source: this data source contains the patient information and their hospital records pro- vided by the online sources, which record patients’ marriage status, employment status, in-hospital duration, etc. Propositions 1 and 2 indicate the basic method of split- ting large files based on the ND method. We conducted a set of experiments to generate and analyze the split files based on Algorithm 1. The experimental settings are: 522 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 6. CPU Intel(R) Core(TM), CPU @ 2.10 GHz RAM 4.00 GB OS Windows 8, 64bits Based on Algorithm 1, we select m as initial number of data sets and a number of boundaries are created based on k / m. The maximum value less than D distance to the middle line will be selected as m within the initial boundary. The middle line is k / 2m þ left neighbor’s lower bound. Iteratively, all the data sets’ boundaries will be determined. m is adjusted based on its deviation from the middle line. A large file can be split into m or adjusted m number of groups, i.e. [0, mi], . . . [m j, k], k is the last data item in the data set. The maximum value xi of each group and its position pi will be stored in an array. This process will terminate when xi reaches the boundary 0 and k. We conducted a set of experiments based on medical data sources. These data files are Diabetic data sets (101,767 records, 50 attributes) [20], CASP data sets (45,731 records, nine attributes) [21] and Health data sets (4,522 records). 4.2 Splitting Data Sets Based on the three data sources, we obtained the pi arrays. The experimental results and experimental settings are listed as follows: Dataset 1- Diabetic_data_source: m indicates the mean of the splitting ND/NDA data sets in Diabetic. s indicates the standard deviation of the splitting ND/NDA data sets. The initialized k /m¼ 7,269. This table is based on “Diabetic Attribute: Diag_1”. Dataset 2 - CASP_data_source: m and s values are the same as described in Dataset 1. The initialized k /m ¼ 6,532. We only illustrate attributes F1, F2, F8 due to space limitation. Dataset 3-Health_data_source: m and s values are the same as described in Dataset 1. The initialized k /m ¼ 904. Based on “Health Attribute: Balance”. 4.3 Splitting Data Observation The splitting results of the above three medical data sets are illustrated as below. We observe the diabetic data sets, which has been split into 13 files as shown in Fig. 5. We further evaluate the CASP data set. However, due to space limitation, we selectively illustrate the distribution form of F8 attributes for the splitting results observation. In this case, the KS Density are compared and evaluated as shown in Fig. 6. The Health data set and its splitting data sets based on the balance attribute are evaluated. The Health data sets has the smallest data records among the three source data, however, the splitting results are obvious in terms of form ND shapes as shown in Fig. 7. 4.4 Analysis ND-based data sets can provide comprehensive informa- tion. The inclusiveness of ND-based data sets is particularly efficient for medical data sets according to real data obser- vation [3], [20], [21]. The inclusiveness of ND-based data sets can improve the performance of analyzing large vol- ume data sets. In many cases, a quick inquiry can be based on a split data set rather than going through the full data set. The advantages of the inclusive attribute of ND-based data sets are: 1) Improving data analytical time efficiency: ND-based data sets support a quick inquiry to be conducted TABLE 2 Boundary and Values of Splitting Diabetic Data Sets m s NDA boundary NDA size 475.66 207.81 [1, 7,049] 7,049 485.09 213.41 [7,050, 14,103] 7,054 484.74 212.55 [14,104, 21,173] 7,070 481.79 212.94 [21,174, 28,232] 7,059 484.71 210.66 [28,233, 35,330] 7,098 488.83 211.49 [35,331, 42,398] 7,068 483.28 209.48 [42,399, 49,453] 7,055 490.41 219.02 [49,454, 56,515] 7,062 485.78 211.93 [56,516, 63,564] 7,049 493.44 216.27 [63,565, 70,617] 7,053 490.75 212.98 [70,618, 77,666] 7,049 488.74 213.07 [77,667, 84,729] 7,063 491.41 212.60 [84,730, 91,783] 7,054 486.57 207.08 [91,784, 98,837] 7,054 487.08 211.66 [98,838, 101,766] 2,929 TABLE 3 Boundary and Values of Splitting Casp Data Sets (f1,8) CASP Attribute: F1 m s NDA boundary NDA size 9,849.98 3,959.98 [1, 6,179] 6,179 9,820.39 4,030.81 [6,180, 13,127] 6,948 9,896.54 4,116.42 [13,128, 19,224] 6,097 9,837.24 4,074.83 [19,225, 25,473] 6,249 9,886.61 4,083.54 [25,474, 32,395] 6,922 9,909.50 4,049.44 [32,396, 39,246] 6,851 9,900.65 4,089.03 [39,247, 45,730] 6,484 CASP Attribute: F8 m s NDA boundary NDA size 68.74 55.94 [1, 5,910] 5,910 69.92 56.40 [5,911, 11,820] 5,910 70.47 56.52 [11,821, 17,713] 5,893 70.11 57.03 [17,714, 23,721] 6,008 70.52 56.37 [23,722, 29,617] 5,896 70.29 56.48 [29,618, 35,543] 5,926 69.08 56.63 [35,544, 41,493] 5,950 70.94 56.55 [41,494, 45,730] 4,237 TABLE 4 Boundary and Values of Splitting Health Data Sets m s NDA boundary NDA size 1,425.73 2,568.72 [1, 815] 815 1,434.42 2,767.34 [816, 1,645] 830 1,288.30 2,688.25 [1,646, 2,466] 821 1,430.35 3,119.67 [2,467, 3,310] 844 1,509.49 3,840.63 [3,311, 4,129] 819 1,474.81 2,746.92 [4,130, 4,521] 392 ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 523 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 7. Fig. 5. Splitting results of diabetic data sets based on Diag1 attribute (a-o). 524 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 8. Fig. 6. Splitting results of CASP F8 data sets—KSDensity (a-i). ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 525 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 9. based on a split data set instead of the full data set. This feature greatly improves the data analyzing time. 2) Improving the accuracy of analytical process: ND-based data sets contain abundant inclusive information. Therefore, a quick inquiry or data processing based on a split data set is considered as sufficient as the full data set. In order to verify the inclusiveness of ND-based data sets, we conducted the following analysis based on the split- ting data sets and the original data sets of the three data sources. We observe the m and s values within a boundary and compare these values with the original data. Dataset 1 - Diabetic_data_source: In Dataset 1, we compare the m and s values of the full data set with the corresponding values of the split data sets. The values contain in the split data sets are very close to the original data set as show in Table 5. The m value difference between the original data set and the average split data sets is 0.0216, which is approximately 0.0044 percent var- iation of the original data. The s value difference between the original data set and the average split data sets is 0.1095, which is approximately 0.0516 percent variation of the original data value. Fig. 8 shows the values of the original data set and the values of each split set. Fig. 7. Splitting results of health data based on Balance (a-g). Fig. 8. Variation of original data values and split data values based on diabetic source (mean, standard deviation). 526 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 10. Fig. 9 shows that the ND-like distribution of orig- inal data set (a) is very similar to splitting data sets (b, c and d). Overall, the split data sets contain rela- tively comprehensive information, and they can be representative for the full data set. Dataset 2 - CASP_data_source: In dataset 2, we only illustrate the result of CASP_F8 due to the limited space. The comparison between the full data set and the average split sets is listed as shown in Table 6. The values of the split data sets have very slight variation compared with the original full data set. The m value difference between the original data set and the average split data sets is 0.034, which is approximately 0.0486 percent variation of the origi- nal data. The s value difference between the original data set and the average split data sets is 0. Each split data set is approximate 1/8 size of the original data set. Fig. 10 shows the values of the original data set and the values of each split set. The m and s values of the split data sets have very small variations compared with the full data set. Therefore, the split data sets based on CASP_F8 can be representative for the full data set. Dataset 3 - Health_data_source: Dataset 3 is split into 5 individual data sets. The comparison between the full data set and the average split sets is listed in Table 7. The variations of the split health (balance attribute) data sets are slightly larger than Diabetic-Diag1 and CASP-F8. We noticed that the average size of each split data set is small, which is the major factor for increasing the variations. The variations of the s values, which we consider as the crucial criteria, are stable as shown in Fig. 11. The difference between them is 54.04. It is approximately 1.79 percent vari- ation of the original data. Fig. 12 shows that the ND-like distribution of splitting data set (b, c and d) have slightly increased their s values compared with the original data set (a). However, the varia- tion of s is reasonably small (1.79 percent of original data), and m variation is very small (0.315 percent of original data). Therefore, they can be representative for the full data set in a certain degree. 5 RESIZED MEDICAL DATA SIMULATION IN CLOUDSIM The Cloud computing paradigm provides an efficient solu- tion for utilization of computing infrastructures through dynamic allocation of resources in a pay as you go model, which allows users to effectively utilize resources and save Fig. 10. Variation of original data values and split data values based on CASP source (mean, standard deviation). TABLE 7 Split and Original Data Sets Comparison- Health_Balance Original Health_Balance Average of Split Sets m 1,422.7 1,427.18 s 3,009.3 2,955.26 Size 4,521 753.5 TABLE 6 Split and Original Data Sets Comparison- CASP_F8 Orignial CASP_F8 Average of Split Sets m 69.975 70.009 s 56.49 56.49 Size 45,730 5,716.25 TABLE 5 Split and Original Data Sets Comparison- Diabetic_Diag1 Original Diabetic_Diag1 Average of Split Sets m 486.5304 486.552 s 212.3062 212.1967 Size 101,766 6,784.4 Fig. 9. ND-like form of original data values and split data values based on diabetic source. Fig. 11. Variation of original data values and split data values based on health source (mean, standard deviation). ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 527 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 11. costs. The ND-based method is efficient to be applied in Cloud environments as Cloud solutions provide abundant support on distributed and parallel computing [22]. The ND- based method is efficient to be applied in Cloud environ- ments since Cloud solutions provide abundant support on distributed and parallel computing [23], [24]. The ND-based method allows files to be processed in a distributed and par- allel manner, which can improve the performance and reduce the costs. To evaluate the performance of the ND-based method in Cloud environments, we conduct experiments based on CloudSim [25]. CloudSim is a discrete event simulator that enables modeling of Cloud environment with on-demand resource provision in a repeatable and controllable way. We compared performance difference of resource configura- tion, response time, resource cost, and flexibility between applying ND-based method to process the split data sets and not applying ND-based method to process the original data sets. The flexibility refers to the failure rate of data processing under limited resources. The higher failure rate indicates the less flexibility of the data processing method. 5.1 Resource Configuration—CloudSim We simulate a datacenter that consists of 500 physical nodes. Each node has 64 CPU cores, 300 GB memory, 1 TB storage, and 10 GB/s network bandwidth. Each CPU core has performance equivalent to 4,000 Million Instructions per Second (MIPS). We consider five types of memory opti- mized Amazon EC2 VMs, which are: r3large, r3xlarge, r32xlarge, r34xlarge, and r38xlarge [26] as listed in Table 8. VMs in Amazon EC2 are charged in an hourly basis. 5.2 Workload Data The workload generated in this study is based on the Big Data Benchmark [27], which executes queries using differ- ent data analytic frameworks on large data set in Amazon EC2 and details the VM configuration and query processing time. Based on the information, we model resource require- ments for data processing tasks under different resource configurations in CloudSim. To compare performance difference between processing split data sets and original data sets, we also need to take data splitting into account, as this operation requires extra VM resources and time. We use splitting time of CASP data sets as the experimental data. However, the size of CASP data sets is not sufficient large to show the obvious perfor- mance benefits of applying ND-based method. Therefore, the resources required by splitting original data sets are enlarged to a suitable scale according to the original file size. For ND-based data splitting application, we consider sin- gle core splitting and multicore splitting. For multi-core splitting, we model contentions for file assess as a penalty in performance speed up for the extra cores. This means that only a fraction of the execution time of the splitting application can benefit from the extra cores. In this experi- ment, we set this fraction to 50 percent. We model the resource contention in a conservative way. If a better method is chosen to provide the data sets to the threads of multicore applications, the performance could be signifi- cantly better. Since resource configuration and data splitting time are known, we model resource requirements for file splitting tasks for single core application and multicore application in CloudSim. As the performances of data processing and data splitting vary, which do not consume the same amount of resources each time, we simulate 10 percent performance variation by introducing resource variation coefficient, which is a random value generated from a Uniform Distribution with upper bound 1.1 and lower bound 0.9. Based on the above settings, we modeled three types of tasks in CloudSim: (1) the task for splitting, (2) the task for processing split data sets, (3) the task for processing original (non-split) data sets. 5.3 Experimental Results The first part of this study aims to evaluate performance dif- ferences of resource configuration, resource cost, and response time for processing split data sets and non-split data sets. We have five resource configurations for five types of VM in our experiment. For each VM configuration, we provision three VMs and run one type of tasks on each VM. The first VM is utilized to run the data splitting tasks, the sec- ond VM is used to run the data processing tasks for split data sets, and the third VM is utilized to run the data processing task for the original data sets. We assume our data analytic application adopts single core in-memory processing. We obtain results of resource configuration, VM execu- tion time, and resource cost from the study, as shown in Tables 2, 3, 4, and 5. From these results, we can see that there are only four types of resource configurations avail- able for processing the original file. VM configuration of r3.large is not available as it has 15 GB memory, which Fig. 12. ND-like form of original data values and split data values based on health source. TABLE 8 VM Configuration Information VM type vCPU ECU Memory Storage Cost r3.large 2 6.5 15 32 SSD $0.175 /Hour r3.xlarge 4 13 30.5 80 SSD $0.350 /Hour r3.2xlarge 8 26 61 160 SSD $0.700 /Hour r3.4xlarge 16 52 122 320 SSD $1.400 /Hour r3.8xlarge 32 104 244 640 SSD $2.800 /Hour 528 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 12. cannot fit 30 GB data in memory for processing. Five types of resource configuration are all available for processing split data sets as smaller data sets can be easily fit in mem- ory for processing. We can see that splitting data sets using single core appli- cations does not have performance advantages (see Table 9). In the contrary, it takes longer response time and incurs higher costs. For the cheapest resource configuration using r3.xlarge, the overall cost and processing time of single core splitting is $2.100, 5.015 hours while for multi-core splitting is $1.050, 2.383 hours. Multi-core processing is 50 percent cheaper than single core processing and provides 2.10 times speed up (see Table 10). We also notice that processing split data sets not only generates less resource cost but also reduces data processing time. For cost minimized resource configuration of data processing, we can see that the overall cost and time for processing original data sets using r3. xlarge is $9.100, 25.997 hours, while for processing split data sets using r3.xlarge is $2.800, 7.721 hours. Processing split data sets is 69.2 percent cheaper and offers 3.37 times speed up. We use cost-minimized resource configurations with minimized execution time to compare the performance dif- ferences between applying ND-based method and not applying the method. For applying ND-based method, we select VM r3.xlarge to execute data splitting tasks with multi-core splitting and select VM r3.xlarge to process split data sets in parallel. The overall cost and execution time of this configuration is $3.850 and 10.104 hours. For processing original data sets, we select VM r3.xlarge as the resource configuration, which costs $10.150 and takes 28.380 hours. The result shows that applying ND-based method performs significantly better as it not only gives 62.1 percent cost saving but also leads to 2.81 times speed up. Tables 9-12 show the experimental results. The second part of this study aims to analyze the flexibil- ity of processing the split data sets and the original data sets. Data processing applications adopt in-memory processing, which require the whole data sets to be fitted in memory for processing. In this way, when the available memory in a VM is less than the size of data sets, the data sets cannot be processed successfully and a failure is gener- ated. A comparison has been conducted to analyze failures of processing split data sets and original data sets as an indi- cator for flexibility analysis. The experiment scenario is that VMs have running tasks; therefore, their capacities are not fully available. We con- sider memory as the resource bottleneck for data processing and introduce the Resource availability coefficient (RAC) to model the available percentage of memory in VMs. RAC is a random value generated using Uniform Distribution with upper bound 1 and lower bound 0. We conduct 10 experi- ments and each has a different RAC coefficient for five types of VM configuration. For each VM configuration, we create two VMs to execute the original data sets and split data sets. For split data sets and original data sets, 50 data processing tasks are executed accordingly. We obtain failure rates of the 10 experiments, which are arranged in an ascending order of the RAC value, as shown in Fig. 13. TABLE 11 Resource Utilization of Processing Original File Resource configuration Execution time VM Cost r3.large null null r3.xlarge 25.997 hours $9.100 r3.2xlarge 25.840 hours $18.200 r3.4xlarge 27.260 hours $39.200 r3.8xlarge 26.483 hours $75.600 TABLE 12 Resource Utilization of Processing Partitioned Files Resource configuration Execution time VM Cost r3.large 15.309 hours $2.800 r3.xlarge 7.721 hours $2.800 r3.2xlarge 4.103 hours $3.500 r3.4xlarge 4.045 hours $7.000 r3.8xlarge 4.018 hours $14.000 Fig. 13. Resource utilization of partitioning original file using multi core application. TABLE 10 Resource Utilization of Partitioning Original File Using Multi Core Application Resource configuration Execution time VM Cost r3. large null null r3.xlarge 2.383 hours $1.050 r3.2xlarge 1.224 hours $1.400 r3.4xlarge 0.590 hours $1.400 r3.8xlarge 0.309 hours $2.800 TABLE 9 Resource Utilization of Partitioning Original File Using Single Core Application Resource configuration Execution time VM Cost r3.large null null r3.xlarge 5.015 hours $2.100 r3.2xlarge 4.952 hours $3.500 r3.4xlarge 4.420 hours $7.000 r3.8xlarge 4.250 hours $14.000 ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 529 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 13. We compare the failure rate of processing split data sets and original data sets. Results show that 30 failures out of 50 tasks are generated during processing original data sets while six failures out of 50 tasks are generated during proc- essing split data sets. Applying ND-based data processing method performs significantly better as the failure rate is 12 percent comparing to the failure rate of 60 percent for proc- essing the original data sets. For VMs with less available resources, as indicated by RAC value from 0.06 to 0.25, failures are generated for both split data sets and original data sets. When RAC is 0.06, five failures are generated for processing original data sets while three failures are generated for processing split data sets. When available VM resource increases, indicated by increas- ing RAC value from 0.13 to 0.25, the number of failures decreases for both original data sets and split data sets. When we further increase the available resources, indicated by RAC value from 0.31 to 0.99, the number of failures fur- ther decreases for processing original data set while no fail- ures are generated for processing split data sets. The advantage of applying ND-based method is obvious as proc- essing split data sets generates less number of failures com- paring to processing original data set for all RAC values. It is to be noted that our definition of failure for process- ing split data sets is rigorous. Even if the processing of one split data set fails, the processing of the whole data sets fails, regardless of the other successfully processed split data sets. However, we can partially process split data sets that can be fitted in memory of under-utilized VMs for higher cost sav- ing. In this way, it can better utilize remaining resources on under-utilized VMs as smaller split data sets requires less resources so that they have more chances to be successfully processed. The ND-based method only provisions new VMs for the rest of split data sets that are not successfully executed, which greatly satisfies cost-saving requirements of cloud users. 6 CONCLUSION A novel ND-based method has been introduced in this paper for splitting large data sets. The rapid growth of data has caused the problem for efficient data processing and analysis due to the enormous data size in the “Big Data” era. The proposed ND-based method can efficiently split large data sets, particularly for medical and e-health data. We observed that most medical data sets normally contain several numeric attributes, which makes the ND-based splitting process feasible and efficient. Furthermore, the split data sets are suitable for parallel processing in cloud environment, which makes real-time data analysis feasible for emergency medical procedures. The ND-based split data sets have been tested for their representative capability based on the inclusive analysis provided in Section 4. This feature allows systems return query results based on a sin- gle split data set since the inclusive character of ND-based split data sets. The future work of this study will further conduct experi- ments in Cloud sim to analyze the splitting algorithm by exploring whether the splitting process can give advantages for data analysis when parallel data processing is deployed. Furthermore, we will investigate the effects that algorithm parameters have on its performance, such as, selecting suit- able mr and D values for generating more efficient and accu- rate split data sets. ACKNOWLEDGMENTS The authors would like to express our gratitude to Professor Rao Kotagiri from the University of Melbourne for his valuable comments and suggestion and The CLOUDS Labo- ratory at the University of Melbourne for providing Cloud- Sim platform and experimental environment for simulation, to student members from Sun Yat-Sen University for con- ducting the splitting data evaluation. They would thank the Center for machine learning and intelligent systems at Uni- versity of California, Irvine, for providing open medical data source. This work was supported in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LY 14G010004, Zhejiang Provincial Philosophy and Social Science Project under Grant 11JCSH03YB, and National Natural Science Foundation of China under Grant 61272480. REFERENCES [1] G. W. Stewart, “On the early history of the singular value decom- position,” SIAM Rev., vol. 35, no. 4, pp. 551—566, 1993. [2] A. D. Penman and W. D. Johnson, “The changing shape of the body mass index distribution curve in the population: implica- tions for public health policy to reduce the prevalence of adult obesity,” Pre Chronic Disease, vol. 3, no. 3, p. A74, 2006. [3] W. H. Wolberg, W. N. Street, and O. L. Mangasarian, Breast Cancer Wisconsin (Diagnostic) Data Set, Nov. 2014. [4] E. H. Shortliffe and G. O. Barnett, “Biomedical data: Their acquisi- tion, storage, and use,” in Biomedical Informatics. New York, NY, USA: Springer, 2006, pp. 46–79. [5] M. G. Walker and J. Shi, Statistics Consulting—ANOVA Example. Walker Bioscience, Nov. 2014. [6] A. Labrinidis and H. V. Jagadish, “Challenges and opportunities with big data,” Proc. VLDB Endowment, vol. 5, no. 12, pp. 2032 – 2033. 2012. [7] C. W. Baxter, S. J. Stanley, Q. Zhang, and D. W. Smith, “Developing artificial neural network process models: A guide for drinking water utilities,” in Proc. CSCE, 2000, pp. 376–383. [8] R. J. May, H. R. Maier, and G. C. Dandy, “Data splitting for artifi- cial neural networks using SOM-based stratified sampling,” Neu- ral Netw., vol. 23, pp. 283–94. 2010. [9] T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biological Cybern., vol. 43, no. 1, pp. 59–69, 1982. [10] K. Tasdemir and E. Merenyi, “SOM-based topology visualisation for interactive analysis of high-dimensional large datasets,” Mach. Learning Rep., vol. 1, pp. 13–15, 2012. [11] S. Rani and G. Sikka, “Recent techniques of clustering of time series data: A survey,” Int. J. Comput. Appl., vol. 52, no. 15, pp. 1–9, 2012. [12] P. Russom, “Big data analytics,” TDWI Best Practices Report (4th Quarter). Renton, WA, USA: TDWI Press, 2011, pp. 3–34. [13] W. Wu, H. R. Maier, G. Dandy, and R. May, “Exploring the impact of data splitting methods on artificial neural network models,” in Proc. 10th Int. Conf. Hydroinformat.: Understanding Changing Climate Environ. Finding Solutions, 2012, pp. 1–8. [14] J. Neyman, “On the two different aspects of the representative methods. The method stratified sampling and the method of purposive selection,” J. Roy. Statistical Soc., vol. 97, pp. 558–606, 1934. [15] J. Dean and S. Ghemawat, “MapReduce: Simplied data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008. [16] F. Marozzo, D. Talia, and P. Trunfio, “P2P-MapReduce: Parallel data processing in dynamic cloud environments,” J. Comput. Syst. Sci., vol. 78, no. 5, pp. 1382–1402, 2012. [17] The World Bank Group, World Development Indicators, World Bank Publication, Dec. 2013. 530 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.
  • 14. [18] Tableau Software, Sample Superstore Sales, Jan. 2014. [19] R. V. Hogg and A. T. Craig, Introduction to Mathematical Statistics, 5th ed. Englewood Cliffs, NJ, USA: Prentice Hall Press. 2007. [20] B. Strack, J. P. DeShazo, C. Gennings, J. L. Olmo, S. Ventura, K. J. Cios, and J. N. Clore, “Impact of HbA1c measurement on hospital readmission rates: Analysis of 70,000 clinical database patient records,” BioMed. Res. Int., vol. 2014, p. 11, 2014. [21] P. S. Rana, “Physicochemical properties of protein tertiary struc- ture data set,” ABV -Indian Institute of Information Technology Management, MP, 2014. [22] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility,” Futur. Gener. Comput. Syst., vol. 25, no. 6, pp. 599–616, Jun. 2009. [23] A.N. Toosi, R. N. Calheri, and R. Buyya, “Interconnected cloud computing environments: Challenges, taxonomy and survey,” ACM Computi. Surveys, vol. 47, no 1, pp. 7:1–7:47, 2014. [24] B. Hayes, “Cloud computing,” Commun. ACM, vol. 51, no. 7, pp. 9–11, Jul. 2008. [25] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. D. Rose, and R. Buyya. “CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms,” Software— Practice experience, vol. 41, pp. 23–50, 2010. [26] Amazon Web Services Inc. Amazon Elastic Compute Cloud—User Guide for Microsoft Windows, Amazon Copyright, 2014. [27] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker, “A comparison of approaches to large-scale data analysis,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2009, pp. 165–178. Hao Lan Zhang received the PhD degree in computer science from Victoria University, Aus- tralia. He is currently an associate professor at NIT, Zhejiang University in China and the 151 Talented Scholar of Zhejiang Province. Prior to that, he was a research fellow and academic staff at RMIT University, Australia. He has published over forty papers in refereed journals and confer- ences. His research interests include intelligent information systems, multiagent systems, health informatics, knowledge-based systems, data mining, etc. He serves as the editorial board member and a reviewer of several journals including: IEEE Transactions on SCM, IEEE Transac- tions on Cloud Computing, The Computer Journal, Information Scien- ces, Journal of AAMAS, etc. He is a member of the IEEE. Yali Zhao received the bachelor’s degree from Harbin Institute of Technology, China, and the joint master’s degree from Harbin Institute of Technology, China, and University of Bordeaux 1, France. She is currently working toward the PhD degree at the University of Melbourne, Australia. Her research interest is SLA-based resource scheduling for big data analytic as a ser- vice in cloud computing environments. Chaoyi Pang received the PhD degree from the University of Melbourne. He is currently a full pro- fessor at NIT, Zhejiang University, and a 1,000 talented professor of Zhejiang Province. His research interests are in algorithm, data security/ privacy, access control, data warehousing, data integration, database theory, graph theory. He is a senior member of ACM. Jinyuan He received the bachelor’s and master’s degrees from Sun Yat-Sen University in 2013 and 2015, respectively. He was a research assistant in the Center for Social Computing and Data Management, NIT, Zhejiang University in 2014. He is currently working on the large data sets splitting projects. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/csdl. ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 531 Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.