NAD Vs MapReduce

Splitting Large Medical Data Sets Based
on Normal Distribution in Cloud Environment
Hao Lan Zhang , Member, IEEE, Yali Zhao, Chaoyi Pang, and Jinyuan He
Abstract—The surge of medical and e-commerce applications has generated tremendous amount of data, which brings people to a
so-called “Big Data” era. Different from traditional large data sets, the term “Big Data” not only means the large size of data volume but
also indicates the high velocity of data generation. However, current data mining and analytical techniques are facing the challenge of
dealing with large volume data in a short period of time. This paper explores the efficiency of utilizing the Normal Distribution (ND)
method for splitting and processing large volume medical data in cloud environment, which can provide representative information in
the split data sets. The ND-based new model consists of two stages. The first stage adopts the ND method for large data sets splitting
and processing, which can reduce the volume of data sets. The second stage implements the ND-based model in a cloud computing
infrastructure for allocating the split data sets. The experimental results show substantial efficiency gains of the proposed method over
the conventional methods without splitting data into small partitions. The ND-based method can generate representative data sets,
which can offer efficient solution for large data processing. The split data sets can be processed in parallel in Cloud computing
environment.
Index Terms—Data splitting, cloud computing, Big Data, medical data processing, normal distribution
Ç
1 INTRODUCTION
THE arrival of the Big Data age urges researchers to find
optimized solutions for dealing with large quantity of
data, particularly for online commercial and medical data.
One of the goals is to reduce high dimensional data sets to
lower dimensional data sets without sacrificing its inherent
properties (knowledge content and relationship). Normally,
two classical methods can be considered for handling the
volume of large data sets: (1) splitting data into smaller
chunks so that each chunk can be analyzed independently;
(2) reducing data columns (dimensionality reduction). For
splitting large data sets, an enhanced UV-decomposition
method can be applied.
According to UV-Decomposition [1], a sample data table
containing 1,000 rows and 100 columns can be decomposed
into two tables based on as follows:
x1;1 x1;100
.
.
. ..
. .
.
.
x1;000;1 x1;000;100
2
6
6
4
3
7
7
5
¼
y1;1 y1;2
.
.
. .
.
.
y1;000;1 y1;000;2
2
6
6
4
3
7
7
5
z1;1 z1;100
z2;1 z2;100

:
For a very large data set, the UV-decomposition method
cannot reduce the large data set, particularly for rows as
shown in the above matrices. This demands a more efficient
method of splitting the data sets. We observe that it is a com-
mon phenomenon that online data presents a mixture of nor-
mal- distribution forms in different groups as indicated in
Fig. 1. The figure on the bottom presents the distribution
based on an example data. The recalculated matrices can be
further broken down into smaller data sets based on the num-
ber of generated grouping results. In this way, the ND-based
data splitting can be more efficient for very large data sets.
Based on Fig. 1, the figure on top contains the informa-
tion about how data points are distributed. We applied the
ksdensity function to generate a probability distribution
shown on the bottom figure. we can observe that this data
set demonstrates similarity to a normal distribution form.
We further split the data set into more ND-like-sets. These
split data sets can be analyzed individually as a representa-
tive of the whole data set. This method can solve the time
and cost problems for dealing with large data sets and is
efficient for cloud environment.
The prerequisite of employing ND for resizing is that
data sets need to contain at least one measurable/numeric
column. The increase of more numeric data in data sets can
improve the accuracy of the analytical results. The primary
motivation of this research and major contributions of the
proposed N (m, s2
) method for data splitting are:
The ND-based splitting method can create small and rep-
resentative data sets. The representative character of
ND-based data sets is reflected and evidenced by
various medical and physiological data sets such as
birth weight distribution, the population distribution
of Body Mass Index (BMI) [2], the population dis-
tribution of breast cancer patients [3], etc. This
H. L. Zhang is with the Center for SCDM, NIT, Zhejiang University, ,
China and the Department of Computer and Information Systems, The
University of Melbourne, Australia. E-mail: haolan.zhang@nit.zju.edu.cn.
Y. Zhao is with the Department of Computing and Information Systems,
The University of Melbourne, Australia. E-mail: yalizhao@unimelb.edu.au.
C. Pang and J. He are with the Center for SCDM, NIT, Zhejiang Univer-
sity, China. E-mail: {chaoyi.pang, hjysix}@gmail.com.
Manuscript received 12 Jan. 2015; revised 8 June 2015; accepted 14 July 2015.
Date of publication 29 July 2015; date of current version 9 June 2020.
Recommended for acceptance by J. He, R. Kotagiri, and F.M. Sanchez.
Digital Object Identifier no. 10.1109/TCC.2015.2462361
518 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 8, NO. 2, APRIL-JUNE 2020
2168-7161 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Zhejiang University. Downloaded on June 14,2020 at 12:42:51 UTC from IEEE Xplore. Restrictions apply.

type of data can provide relatively comprehensive
information for data analysis rather than randomly
splitting. The experimental analysis of the represen-
tativeness of splitting data is provided in Section 4.
The splitting data sets can be efficiently deployed in the
cloud environment. The distributed cloud computing
environment allows a system to analyse and process
data locally. The ND-based data sets can be distrib-
uted in any cloud frameworks in different locations.
However, these distributed data sets are not always
required for integration when conducting analysis
since a ND-based data set represents all characteris-
tics. This aspect will allow a single cloud node to pro-
cess its local ND-based data sets for global analysis.
Furthermore, the split data sets are suitable for paral-
lel processing in Cloud environment and increase the
flexibility of memory usage. The experimental results
based on CloudSim has provided in Section 5.
The ND-based method supports semi-structured data sets,
which can be efficiently applied to medical data analysis.
There is a board range of data types in the practice of
medical and health sciences ranging from narrative,
textual data to numerical measurements, record sig-
nals, and so on [4]. Nevertheless, most medical data
sets normally contain several numeric attributes. It
makes the ND-based splitting process feasible and
efficient. Table 1 shows the sample a set of e-health,
which consists of narrative information and numeric
measurements.
In this sample data, both ‘age’ and ‘duration’ attributes
are numeric, which form a number of ND sub-sets. These
sub-sets might not be the precise ND form. These data sets
are named as ND-Approximate (NDA) sets, i.e. NDA, which
contain representative information compared with random
splitting sets, as shown in Fig. 2 (refer to Fig. 1 for splitting).
The UV decomposition methods can be applied to ND/
NDA data sets for further dimensionality reduction. The
current methods for generating fast analytical results based
on large medical or online data are limited due to the mas-
sive data size and high velocity [6]. This paper explores
the underlining issues and proposes the ND-based split-
ting method for fast data processing and analysis in cloud
environment.
This paper is organized as follows: the next section
reviews the previous work on large medical and online data
processing as well as the related methods. Section 3 introdu-
ces the ND method for large data splitting, which can pro-
vide relatively small and inclusive data sets for analysis.
Section 4 illustrates the ND-based medical data storage in
cloud systems. Section 5 provides the experimental results.
The final section concludes the research findings.
2 Related Work
The data splitting and partition methods have been studied
extensively. Basically, the previous work focus on two
aspects of data splitting, which include the algorithmic
aspect for data splitting and the system modelling aspect
for data file splitting and partition.
2.1 Related Data Splitting Algorithms
Baxter et al. [7] proposed the systematic data splitting
method, in which every kth sample from a random starting
point is selected to form the training, test and validation
datasets. In this method, data sets are first ordered in
TABLE 1
Example Medical Data for Drug Test Results [5]
Age Marital Status Duration Education
30 MARRIED 79 PRIMARY
33 MARRIED 220 SECONDARY
35 SINGLE 185 TERTIARY
30 MARRIED 199 TERTIARY
35 SINGLE 141 TERTIARY
43 MARRIED 313 PRIMARY
20 SINGLE 261 SECONDARY
Agei MARRIEDi DURATIONi EDUCATIONi
Fig. 1. The distribution of a large data set (CASP F2 Data).
Fig. 2. A small sample ND-based data sets.
ZHANG ET AL.: SPLITTING LARGE MEDICAL DATA SETS BASED ON NORMAL DISTRIBUTION IN CLOUD ENVIRONMENT 519

increasing values along the output variable dimension. This
procedure requires huge time and memory, which is appar-
ently not feasible for big data processing.
May et al. [8] utilize the Self Organizing Maps (SOM)
method for generating smaller data sets, i.e. dimensionality
reduction, which primarily considers the columns reduction
of a large data set [9], [10]. Most of SOM has been employed
to a fixed window if it is used for online clustering for data
streams [10], [11], [12]. SBSS-N data splitting approach [13]
further extends SOM method through incorporating uni-
form random intra-cluster sampling process for data split-
ting. Neyman allocation rule [14] is adopted in this process,
which is expressed as:
nm ¼
Nmsm
PM
i¼1 Nisi

n
N
; (1)
where N is the size of the dataset, n is the required sample
size, Ni is the size of stratum i, si is the intra-stratum multi-
variate standard deviation of stratum i.
Current SOM-based solutions have certain limitations in
generating fast analytical results comparing with K-means
and other clustering methods. Like other clustering meth-
ods, the accuracy of self-learning of SOM is highly depen-
dent on the initial conditions: initial weight of the map,
sequence of training vector, etc. A possible method is to
select the initial weights from the same space spanned by
the linear principal component.
2.2 Related System Modelling for Data Splitting and
Partition
Existing techniques for solving large data processing prob-
lems have been applied to several notable models including
Google File Systems (GFS), Map-Reduce model, etc. These
techniques mainly concern file splitting using optimized file
storage frameworks. For instance, a large volume data file
can be split to smaller files in GFS [15], [16].
GFS model can provide efficient, reliable access to data
using large clusters of commodity hardware, which is able
to accommodate large data sets processing. In this model
consists of a Master sever and a large number of chunk
severs. Files are split into file chunks that are stored in
chunk sever. MapReduce model employs Map () and
Reduce () functions that divide a task into several smaller
tasks and assemble the final results. The MapReduce
model generates a number of key values. Map (k1, v1) !
list (k2, v2); Reduce (k2, list (v2)) ! list (v3). Some research-
ers believe that MapReduce model can solve large data set
processing problems to certain extent [15].These methods
basically focus on data sets decomposition using key-value
or file splitting methods, which do not consider providing
inclusive information in a data set. Therefore, a fast analy-
sis based on a number of representative data sets is not
available in GFS and MapReduce models. In other words,
GFS and MapReduce models cannot efficiently handle a
fast inquiry based on a number of small data sets due to
the lack of representative information.
Instead of focusing on file system framework design, this
paper primarily focuses on converting high dimensional data
sets into several low dimensional data sets through incorpo-
rating the enhanced N (m, s2
) method into cloud environment.
3 THE ND-BASED METHOD FOR SPLITTING
LARGE DATA
3.1 Data Observation
While observing various large data sets including bank data
(obtained from the World Bank’s 2013 open source data)
[17], superstore sales data (obtained from Tableau’s 2012
open source data)[18], personal tax data (obtained from
Australian 2003 online open source data), online transaction
data (obtained from an anonymous travel agency’s 2012
data), and e-health data (obtained from Diagnostic Wiscon-
sin Breast Cancer Database)[3].
Fig. 3a 3b, 3c, and 3d (eliminating specific labels and
details) indicate that the observed large data sets contain a
number of sub data sets. Each sub data set form normal dis-
tribution feature.
Most random data sets omit some particular attributes or
features, such as dates, gender, etc., normally contain a
number of sub data sets. Each sub data set has its own
group feature. In this paper we consider the group feature
as the normal distribution form and name such sub data set
as ND sub-data sets or simply ND sets. In other words, a
collection of data forms normal distribution would be con-
sidered as a sub data set. The reason for adopting normal
distribution to determine a sub data set is that this distribu-
tion form widely observed in practical data sets. Each ND
set has its own uptrend, peek and downtrend, which can
provide useful information for data analysis.
3.2 Sub Data Sets (ND/NDA Sets) Determination
A large data set might contain a number of ND sets. Identi-
fying ND sets is a crucial step for splitting large data sets.
Definition 1. A collection of data in a large data set is defined as
a ND set if its distribution is of the form N (m, s2
), m is the
mean or expectation of the distribution, s is its standard devia-
tion and its variance is s2
.
A ND set might not strictly form a standard normal distri-
bution. However, a non-standard ND set still contains
uptrend, peek and downtrend. These ND sets are called
approx. normal-distribution ND sets or NDA sets in short. In
a large set A [1, k], if (ND1[ ND2 [ . . . NDi A) and NDi has
the distribution form N (m, s2
) or NDA (refer to Proposition
2) as shown in Fig. 4a, however, A might not have an exact
distribution of N (m, s2
) after transformed as shown in
Fig. 4b, however they distribute a ND-approximate form.
Hereafter, A [1, k] represents the complete large data set; [1, k]
is the boundary of A’s record number.
Definition 2. S [a, b] A [1, k], if x2 [a, b] satisfies:
fðAðxÞ; m; sÞ ¼
1
s
F
AðxÞ m
s

)
axb
a6¼b
S½a; b ND: (2)
This proposition can be alternatively expressed as:
f AðxÞ
ð Þ ¼
t
ffiffiffiffiffiffi
2p
p e
t2ðAðxÞmÞ2
2 ) S½a; b ND: (3)
As mentioned above, a NDA set might not strictly form a
normal distribution (refer to Fig. 4b). Instead of using f(x)
for probability distribution, the ND-based method converts
the data values using ksdensity function to generate a

probability distribution. Therefore, a NDA data set needs to
satisfy the following characteristics:
1) An NDA data set has a maximum possibility value
point within 2D, and this value is considered as the
identifier, i.e. p(x).
2) An NDA data set is basically practical data; therefore
it refers to Chebyshev Inequality [19] in a certain
way that is more than 89 percent data lay within 3 s.
3) An initial threshold mr (0 mr 1) is assigned to
determine maximum tolerance for the number of
data items in a NDA set does not form N(m, s2
),
which lie out of the normal distribution curve and
called disqualified data items. For instance, mr ¼ 0.3
means that the NDA set contains 30 percent of
disqualified data items. In this paper, a quick deter-
mination method is deployed for the identification
of disqualified data item.
Definition 3. In a large data set A [1, k] of k data elements, we
choose a contiguous subset S [pl, pr] starting at position
pl and choose position pr such that jðplþprÞ
2 pj D where
A(p) denotes the elements with maximum possibility in the
subset S [c - D, c þ D], where c ¼ ðplþprÞ
2 ¼ pl þ B
2, where
B ¼ k/m and m is the number of total partitions.
Proposition 1. S [a, b] A [1, k], A(p) 2 S, if S [a, b] satisfies:
A(p) ¼ Max(S), c a þ b
ð Þ=2 , c p
j j D; and if FðSÞ
ð1mrÞ 1
s
ffiffiffiffi
2p
p b
a expð c2
2 Þdx, mr is the adjusting parameter,
then S [a, b] is a NDA set.
The initial adjusting parameter c and adjusting factor
mr can be adjusted during the S [a, b] selection process.
The determination of A(p) is adjusted based on c. In this
paper, initial c is selected as (a þ b) / 2 when fðpÞ
Pb
i¼a fðxÞ= ðb aÞ and the maximum A(p) has c p
j j D
within S [a, b]; then p is selected as the real center for the
first ND/NDA set. D is calculated based on total data sets
and m.
Based on Proposition 1 and Definition 3, we modified the
determination process for identifying ND/NDA sets in the
following algorithm.
Algorithm 1 deploys m to determine the total number of
the splitting data sets. The adjustable value m denotes the ini-
tial number of the splitting data sets, which is pre-defined.
Fig. 4. The transformed data distributes N(m, s2
)-approx.
Fig. 3. The division of large data sets (example illustration).

During the splitting process, the preliminary boundary of the
first splitting set is predetermined. The boundary of the first
set normally starts from 1 to k / m. The adjusting process for
m is based on the center point with maximum probability
and its distance to k / 2m, i.e. D. m can be increased when the
center point c is on the left side of (pr – pl) / 2, which means D
is negative to predefined J b). That will accumulatively cre-
ate one or several extra data sets instead of predefined m. In
the same way, m can be reduced when D is positive. The final
splitting sets could be variant to predefined m.
Algorithm 1. NDA sets Determination Process
Input: Dataset A; // with k data points;
m; // number of initial partitions;
D; // is the acceptable distance between center and
// the max point in the subset;
Output: ND // list of partitions;
1: B ¼ k / m // size of each partition;
2: ND ¼ [ ]; // initialize partitions to empty;
3: J ¼ 1; // partition number;
4: pl ¼ 1; // left position of the current partition;
5: pr ¼ B; // right position of the current partition;
6: c ¼ B / 2; // the center of the partition;
7: LP ¼ [ ]; // List of the unsatisfied max points;
8: cl ¼ c – D, cr ¼ c þ D;
9: p ¼ MaxPoint (A, cl, cr, LP); // Find the position of
10: // the max probability between positions pl and pr
11: // (inclusive) excluding the elements in the list LP;
12: while pr k
13: while ( j p – c j D)
14: LP ¼ [LP, p];
15: p ¼ MaxPoint (A, cl, cr, LP);
16: end
17: pr ¼ pr þ p – c;
18: ND ¼ [ND, [J, pl, pr]];
19: J ¼ J þ 1;
20: LP ¼ [ ];
21: pl ¼ pr þ 1; // Set next partition left pointer;
22: pr ¼ pl þ b; // Set next partition right pointer;
23: c ¼ (pl þ pr) / 2;
24: cl ¼ c – D, cr ¼ c þ D;
25: if pr k
26: pr ¼ k;
27: end
28: end
We utilize mr for adjusting m value, which can be dynam-
ically applied in the splitting processing or re-applied to the
re-splitting process. The experiments conducted in Section 4
applies mr to the re-splitting process, which adjusts m after
the splitting
mi0
¼ xi
k
2m

k; (4)
mr ¼
P
umi0 k
k
m
¼
xi m
k

1
2
;
(5)
where xi is the position of the selected maximum point in
a boundary minus its left neighbor’s upper bound.
(
Px
i k=2m) indicates the size of the split data sets, which is
based on the accumulated deviation of the selected maxi-
mum points from the middle line (left neighbors’ upper
bounds þ k/2m). mr is a ratio between the actual split data
size and pre-defined k / m.
Adjusting m ¼ m mr in the re-splitting process. For the
dynamic adjusting process, we divide k into several sets
successively. The mr value of the previous set will be
applied to the successive set until all sets are adjusted.
After splitting, the sample data set A [1, 1,000] can be
converted to several smaller ND/NDA data sets as below
x1;1 x1;100
.
.
. ..
. .
.
.
x1;000;1 x1;000;100
2
6
6
4
3
7
7
5
¼
x1;1 x1;100
.
.
. ..
. .
.
.
xi;1 xi;100
2
6
6
4
3
7
7
5 [ . . . [
xj;1 x1;100
.
.
. ..
. .
.
.
x1;000;1 x1;000;100
2
6
6
4
3
7
7
5:
These ND/NDA data sets basically reduce the size of a
large data set. It is efficient when the data analytical pro-
cesses are based on data sets with large volume and arriving
in short period time. The two-stage model in this paper
aims to reduce both data size (record numbers) and dimen-
sions. Smaller data sets, such as divided ND/NDA data
sets, can greatly improve time consumption for SOM-based
dimensionality reduction. Previous research showed that
different dimensionality reduction methods are seriously
limited to handle large datasets due to their computational
complexities and memory requirements [10]. The SOM-
based dimensional reduction for large data sets will be
introduced in the next section.
4 SPLITTING DATA SETS BASED ON THE ND/NDA
METHOD
4.1 Data Sources and Configurations
Three medical data source files have been processed based
on the N (m, s2
) method. These data sources are listed as
below.
1) Diabetic data source [20]: the dataset represents
10 years (1999-2008) of clinical care at 130 US hospi-
tals and integrated delivery networks. It includes
over 50 features representing patient and hospital
outcomes.
2) CASP data source [21]: this is a data set of physico-
chemical properties of protein tertiary structure pro-
vided by ABV—Indian institute of information
technology management.
3) Health data source: this data source contains the
patient information and their hospital records pro-
vided by the online sources, which record patients’
marriage status, employment status, in-hospital
duration, etc.
Propositions 1 and 2 indicate the basic method of split-
ting large files based on the ND method. We conducted a
set of experiments to generate and analyze the split files
based on Algorithm 1. The experimental settings are:

CPU Intel(R) Core(TM), CPU @ 2.10 GHz
RAM 4.00 GB
OS Windows 8, 64bits
Based on Algorithm 1, we select m as initial number
of data sets and a number of boundaries are created
based on k / m. The maximum value less than D distance
to the middle line will be selected as m within the initial
boundary. The middle line is k / 2m þ left neighbor’s
lower bound. Iteratively, all the data sets’ boundaries
will be determined. m is adjusted based on its deviation
from the middle line. A large file can be split into m or
adjusted m number of groups, i.e. [0, mi], . . . [m j, k], k is
the last data item in the data set. The maximum value xi
of each group and its position pi will be stored in an
array. This process will terminate when xi reaches the
boundary 0 and k.
We conducted a set of experiments based on medical
data sources. These data files are Diabetic data sets
(101,767 records, 50 attributes) [20], CASP data sets
(45,731 records, nine attributes) [21] and Health data
sets (4,522 records).
4.2 Splitting Data Sets
Based on the three data sources, we obtained the pi arrays.
The experimental results and experimental settings are
listed as follows:
Dataset 1- Diabetic_data_source: m indicates the
mean of the splitting ND/NDA data sets in
Diabetic. s indicates the standard deviation of the
splitting ND/NDA data sets. The initialized
k /m¼ 7,269. This table is based on “Diabetic
Attribute: Diag_1”.
Dataset 2 - CASP_data_source: m and s values are
the same as described in Dataset 1. The initialized
k /m ¼ 6,532. We only illustrate attributes F1, F2, F8
due to space limitation.
Dataset 3-Health_data_source: m and s values are
the same as described in Dataset 1. The initialized
k /m ¼ 904. Based on “Health Attribute: Balance”.
4.3 Splitting Data Observation
The splitting results of the above three medical data sets are
illustrated as below. We observe the diabetic data sets,
which has been split into 13 files as shown in Fig. 5.
We further evaluate the CASP data set. However, due to
space limitation, we selectively illustrate the distribution
form of F8 attributes for the splitting results observation. In
this case, the KS Density are compared and evaluated as
shown in Fig. 6.
The Health data set and its splitting data sets based on
the balance attribute are evaluated. The Health data sets
has the smallest data records among the three source data,
however, the splitting results are obvious in terms of form
ND shapes as shown in Fig. 7.
4.4 Analysis
ND-based data sets can provide comprehensive informa-
tion. The inclusiveness of ND-based data sets is particularly
efficient for medical data sets according to real data obser-
vation [3], [20], [21]. The inclusiveness of ND-based data
sets can improve the performance of analyzing large vol-
ume data sets. In many cases, a quick inquiry can be based
on a split data set rather than going through the full data
set. The advantages of the inclusive attribute of ND-based
data sets are:
1) Improving data analytical time efficiency: ND-based
data sets support a quick inquiry to be conducted
TABLE 2
Boundary and Values of Splitting Diabetic Data Sets
m s NDA boundary NDA size
475.66 207.81 [1, 7,049] 7,049
485.09 213.41 [7,050, 14,103] 7,054
484.74 212.55 [14,104, 21,173] 7,070
481.79 212.94 [21,174, 28,232] 7,059
484.71 210.66 [28,233, 35,330] 7,098
488.83 211.49 [35,331, 42,398] 7,068
483.28 209.48 [42,399, 49,453] 7,055
490.41 219.02 [49,454, 56,515] 7,062
485.78 211.93 [56,516, 63,564] 7,049
493.44 216.27 [63,565, 70,617] 7,053
490.75 212.98 [70,618, 77,666] 7,049
488.74 213.07 [77,667, 84,729] 7,063
491.41 212.60 [84,730, 91,783] 7,054
486.57 207.08 [91,784, 98,837] 7,054
487.08 211.66 [98,838, 101,766] 2,929
TABLE 3
Boundary and Values of Splitting Casp Data Sets (f1,8)
CASP Attribute: F1
9,849.98 3,959.98 [1, 6,179] 6,179
9,820.39 4,030.81 [6,180, 13,127] 6,948
9,896.54 4,116.42 [13,128, 19,224] 6,097
9,837.24 4,074.83 [19,225, 25,473] 6,249
9,886.61 4,083.54 [25,474, 32,395] 6,922
9,909.50 4,049.44 [32,396, 39,246] 6,851
9,900.65 4,089.03 [39,247, 45,730] 6,484
CASP Attribute: F8
68.74 55.94 [1, 5,910] 5,910
69.92 56.40 [5,911, 11,820] 5,910
70.47 56.52 [11,821, 17,713] 5,893
70.11 57.03 [17,714, 23,721] 6,008
70.52 56.37 [23,722, 29,617] 5,896
70.29 56.48 [29,618, 35,543] 5,926
69.08 56.63 [35,544, 41,493] 5,950
70.94 56.55 [41,494, 45,730] 4,237
TABLE 4
Boundary and Values of Splitting Health Data Sets
1,425.73 2,568.72 [1, 815] 815
1,434.42 2,767.34 [816, 1,645] 830
1,288.30 2,688.25 [1,646, 2,466] 821
1,430.35 3,119.67 [2,467, 3,310] 844
1,509.49 3,840.63 [3,311, 4,129] 819
1,474.81 2,746.92 [4,130, 4,521] 392

Fig. 5. Splitting results of diabetic data sets based on Diag1 attribute (a-o).

Fig. 6. Splitting results of CASP F8 data sets—KSDensity (a-i).

based on a split data set instead of the full data set.
This feature greatly improves the data analyzing
time.
2) Improving the accuracy of analytical process: ND-based
data sets contain abundant inclusive information.
Therefore, a quick inquiry or data processing based
on a split data set is considered as sufficient as the
full data set.
In order to verify the inclusiveness of ND-based data
sets, we conducted the following analysis based on the split-
ting data sets and the original data sets of the three data
sources. We observe the m and s values within a boundary
and compare these values with the original data.
Dataset 1 - Diabetic_data_source:
In Dataset 1, we compare the m and s values of the
full data set with the corresponding values of the
split data sets.
The values contain in the split data sets are
very close to the original data set as show in
Table 5. The m value difference between the
original data set and the average split data sets is
0.0216, which is approximately 0.0044 percent var-
iation of the original data. The s value difference
between the original data set and the average split
data sets is 0.1095, which is approximately 0.0516
percent variation of the original data value. Fig. 8
shows the values of the original data set and the
values of each split set.
Fig. 7. Splitting results of health data based on Balance (a-g).
Fig. 8. Variation of original data values and split data values based on
diabetic source (mean, standard deviation).

Fig. 9 shows that the ND-like distribution of orig-
inal data set (a) is very similar to splitting data sets
(b, c and d). Overall, the split data sets contain rela-
tively comprehensive information, and they can be
representative for the full data set.
Dataset 2 - CASP_data_source:
In dataset 2, we only illustrate the result of
CASP_F8 due to the limited space. The comparison
between the full data set and the average split sets is
listed as shown in Table 6.
The values of the split data sets have very slight
variation compared with the original full data set.
The m value difference between the original data set
and the average split data sets is 0.034, which is
approximately 0.0486 percent variation of the origi-
nal data. The s value difference between the
original data set and the average split data sets is 0.
Each split data set is approximate 1/8 size of the
original data set.
Fig. 10 shows the values of the original data set
and the values of each split set.
The m and s values of the split data sets have
very small variations compared with the full data
set. Therefore, the split data sets based on CASP_F8
can be representative for the full data set.
Dataset 3 - Health_data_source:
Dataset 3 is split into 5 individual data sets. The
comparison between the full data set and the average
split sets is listed in Table 7.
The variations of the split health (balance attribute) data
sets are slightly larger than Diabetic-Diag1 and CASP-F8.
We noticed that the average size of each split data set is
small, which is the major factor for increasing the variations.
The variations of the s values, which we consider as the
crucial criteria, are stable as shown in Fig. 11. The difference
between them is 54.04. It is approximately 1.79 percent vari-
ation of the original data.
Fig. 12 shows that the ND-like distribution of splitting
data set (b, c and d) have slightly increased their s values
compared with the original data set (a). However, the varia-
tion of s is reasonably small (1.79 percent of original data),
and m variation is very small (0.315 percent of original
data). Therefore, they can be representative for the full data
set in a certain degree.
5 RESIZED MEDICAL DATA SIMULATION IN
CLOUDSIM
The Cloud computing paradigm provides an efficient solu-
tion for utilization of computing infrastructures through
dynamic allocation of resources in a pay as you go model,
which allows users to effectively utilize resources and save
CASP source (mean, standard deviation).
TABLE 7
Split and Original Data Sets
Comparison- Health_Balance
Original Health_Balance Average of Split Sets
m 1,422.7 1,427.18
s 3,009.3 2,955.26
Size 4,521 753.5
TABLE 6
Split and Original Data Sets Comparison- CASP_F8
Orignial CASP_F8 Average of Split Sets
m 69.975 70.009
s 56.49 56.49
Size 45,730 5,716.25
TABLE 5
Split and Original Data Sets Comparison- Diabetic_Diag1
Original Diabetic_Diag1 Average of Split Sets
m 486.5304 486.552
s 212.3062 212.1967
Size 101,766 6,784.4
Fig. 9. ND-like form of original data values and split data values based on
diabetic source.
health source (mean, standard deviation).

costs. The ND-based method is efficient to be applied in
Cloud environments as Cloud solutions provide abundant
support on distributed and parallel computing [22]. The ND-
based method is efficient to be applied in Cloud environ-
ments since Cloud solutions provide abundant support on
distributed and parallel computing [23], [24]. The ND-based
method allows files to be processed in a distributed and par-
allel manner, which can improve the performance and
reduce the costs.
To evaluate the performance of the ND-based method in
Cloud environments, we conduct experiments based on
CloudSim [25]. CloudSim is a discrete event simulator that
enables modeling of Cloud environment with on-demand
resource provision in a repeatable and controllable way. We
compared performance difference of resource configura-
tion, response time, resource cost, and flexibility between
applying ND-based method to process the split data sets
and not applying ND-based method to process the original
data sets. The flexibility refers to the failure rate of data
processing under limited resources. The higher failure rate
indicates the less flexibility of the data processing method.
5.1 Resource Configuration—CloudSim
We simulate a datacenter that consists of 500 physical
nodes. Each node has 64 CPU cores, 300 GB memory, 1 TB
storage, and 10 GB/s network bandwidth. Each CPU core
has performance equivalent to 4,000 Million Instructions
per Second (MIPS). We consider five types of memory opti-
mized Amazon EC2 VMs, which are: r3large, r3xlarge,
r32xlarge, r34xlarge, and r38xlarge [26] as listed in Table 8.
VMs in Amazon EC2 are charged in an hourly basis.
5.2 Workload Data
The workload generated in this study is based on the Big
Data Benchmark [27], which executes queries using differ-
ent data analytic frameworks on large data set in Amazon
EC2 and details the VM configuration and query processing
time. Based on the information, we model resource require-
ments for data processing tasks under different resource
configurations in CloudSim.
To compare performance difference between processing
split data sets and original data sets, we also need to take
data splitting into account, as this operation requires extra
VM resources and time. We use splitting time of CASP data
sets as the experimental data. However, the size of CASP
data sets is not sufficient large to show the obvious perfor-
mance benefits of applying ND-based method. Therefore,
the resources required by splitting original data sets are
enlarged to a suitable scale according to the original file size.
For ND-based data splitting application, we consider sin-
gle core splitting and multicore splitting. For multi-core
splitting, we model contentions for file assess as a penalty
in performance speed up for the extra cores. This means
that only a fraction of the execution time of the splitting
application can benefit from the extra cores. In this experi-
ment, we set this fraction to 50 percent. We model the
resource contention in a conservative way. If a better
method is chosen to provide the data sets to the threads of
multicore applications, the performance could be signifi-
cantly better. Since resource configuration and data splitting
time are known, we model resource requirements for file
splitting tasks for single core application and multicore
application in CloudSim.
As the performances of data processing and data splitting
vary, which do not consume the same amount of resources
each time, we simulate 10 percent performance variation
by introducing resource variation coefficient, which is a
random value generated from a Uniform Distribution with
upper bound 1.1 and lower bound 0.9. Based on the above
settings, we modeled three types of tasks in CloudSim:
(1) the task for splitting, (2) the task for processing split data
sets, (3) the task for processing original (non-split) data sets.
5.3 Experimental Results
The first part of this study aims to evaluate performance dif-
ferences of resource configuration, resource cost, and
response time for processing split data sets and non-split
data sets. We have five resource configurations for five types
of VM in our experiment. For each VM configuration, we
provision three VMs and run one type of tasks on each VM.
The first VM is utilized to run the data splitting tasks, the sec-
ond VM is used to run the data processing tasks for split data
sets, and the third VM is utilized to run the data processing
task for the original data sets. We assume our data analytic
application adopts single core in-memory processing.
We obtain results of resource configuration, VM execu-
tion time, and resource cost from the study, as shown in
Tables 2, 3, 4, and 5. From these results, we can see that
there are only four types of resource configurations avail-
able for processing the original file. VM configuration of
r3.large is not available as it has 15 GB memory, which
Fig. 12. ND-like form of original data values and split data values based
on health source.
TABLE 8
VM Configuration Information
VM type vCPU ECU Memory Storage Cost
r3.large 2 6.5 15 32 SSD $0.175 /Hour
r3.xlarge 4 13 30.5 80 SSD $0.350 /Hour
r3.2xlarge 8 26 61 160 SSD $0.700 /Hour
r3.4xlarge 16 52 122 320 SSD $1.400 /Hour
r3.8xlarge 32 104 244 640 SSD $2.800 /Hour

cannot fit 30 GB data in memory for processing. Five types
of resource configuration are all available for processing
split data sets as smaller data sets can be easily fit in mem-
ory for processing.
We can see that splitting data sets using single core appli-
cations does not have performance advantages (see Table 9).
In the contrary, it takes longer response time and incurs
higher costs. For the cheapest resource configuration using
r3.xlarge, the overall cost and processing time of single core
splitting is $2.100, 5.015 hours while for multi-core splitting
is $1.050, 2.383 hours. Multi-core processing is 50 percent
cheaper than single core processing and provides 2.10 times
speed up (see Table 10). We also notice that processing split
data sets not only generates less resource cost but also
reduces data processing time. For cost minimized resource
configuration of data processing, we can see that the overall
cost and time for processing original data sets using r3.
xlarge is $9.100, 25.997 hours, while for processing split
data sets using r3.xlarge is $2.800, 7.721 hours. Processing
split data sets is 69.2 percent cheaper and offers 3.37 times
speed up.
We use cost-minimized resource configurations with
minimized execution time to compare the performance dif-
ferences between applying ND-based method and not
applying the method. For applying ND-based method, we
select VM r3.xlarge to execute data splitting tasks with
multi-core splitting and select VM r3.xlarge to process split
data sets in parallel. The overall cost and execution time of
this configuration is $3.850 and 10.104 hours. For processing
original data sets, we select VM r3.xlarge as the resource
configuration, which costs $10.150 and takes 28.380 hours.
The result shows that applying ND-based method performs
significantly better as it not only gives 62.1 percent cost
saving but also leads to 2.81 times speed up. Tables 9-12
show the experimental results.
The second part of this study aims to analyze the flexibil-
ity of processing the split data sets and the original data
sets. Data processing applications adopt in-memory
processing, which require the whole data sets to be fitted in
memory for processing. In this way, when the available
memory in a VM is less than the size of data sets, the data
sets cannot be processed successfully and a failure is gener-
ated. A comparison has been conducted to analyze failures
of processing split data sets and original data sets as an indi-
cator for flexibility analysis.
The experiment scenario is that VMs have running tasks;
therefore, their capacities are not fully available. We con-
sider memory as the resource bottleneck for data processing
and introduce the Resource availability coefficient (RAC) to
model the available percentage of memory in VMs. RAC is
a random value generated using Uniform Distribution with
upper bound 1 and lower bound 0. We conduct 10 experi-
ments and each has a different RAC coefficient for five types
of VM configuration. For each VM configuration, we create
two VMs to execute the original data sets and split data sets.
For split data sets and original data sets, 50 data processing
tasks are executed accordingly. We obtain failure rates of
the 10 experiments, which are arranged in an ascending
order of the RAC value, as shown in Fig. 13.
TABLE 11
Resource Utilization of Processing Original File
Resource configuration Execution time VM Cost
r3.large null null
r3.xlarge 25.997 hours $9.100
r3.2xlarge 25.840 hours $18.200
r3.4xlarge 27.260 hours $39.200
r3.8xlarge 26.483 hours $75.600
TABLE 12
Resource Utilization of Processing Partitioned Files
r3.large 15.309 hours $2.800
r3.2xlarge 4.103 hours $3.500
Fig. 13. Resource utilization of partitioning original file using multi core
application.
TABLE 10
Resource Utilization of Partitioning Original File
Using Multi Core Application
r3. large null null
TABLE 9
Resource Utilization of Partitioning Original File
Using Single Core Application
r3.large null null

We compare the failure rate of processing split data sets
and original data sets. Results show that 30 failures out of
50 tasks are generated during processing original data sets
while six failures out of 50 tasks are generated during proc-
essing split data sets. Applying ND-based data processing
method performs significantly better as the failure rate is 12
percent comparing to the failure rate of 60 percent for proc-
essing the original data sets.
For VMs with less available resources, as indicated by
RAC value from 0.06 to 0.25, failures are generated for both
split data sets and original data sets. When RAC is 0.06, five
failures are generated for processing original data sets while
three failures are generated for processing split data sets.
When available VM resource increases, indicated by increas-
ing RAC value from 0.13 to 0.25, the number of failures
decreases for both original data sets and split data sets.
When we further increase the available resources, indicated
by RAC value from 0.31 to 0.99, the number of failures fur-
ther decreases for processing original data set while no fail-
ures are generated for processing split data sets. The
advantage of applying ND-based method is obvious as proc-
essing split data sets generates less number of failures com-
paring to processing original data set for all RAC values.
It is to be noted that our definition of failure for process-
ing split data sets is rigorous. Even if the processing of one
split data set fails, the processing of the whole data sets fails,
regardless of the other successfully processed split data sets.
However, we can partially process split data sets that can be
fitted in memory of under-utilized VMs for higher cost sav-
ing. In this way, it can better utilize remaining resources on
under-utilized VMs as smaller split data sets requires less
resources so that they have more chances to be successfully
processed. The ND-based method only provisions new
VMs for the rest of split data sets that are not successfully
executed, which greatly satisfies cost-saving requirements
of cloud users.
6 CONCLUSION
A novel ND-based method has been introduced in this
paper for splitting large data sets. The rapid growth of data
has caused the problem for efficient data processing and
analysis due to the enormous data size in the “Big Data” era.
The proposed ND-based method can efficiently split
large data sets, particularly for medical and e-health data.
We observed that most medical data sets normally contain
several numeric attributes, which makes the ND-based
splitting process feasible and efficient. Furthermore, the
split data sets are suitable for parallel processing in cloud
environment, which makes real-time data analysis feasible
for emergency medical procedures. The ND-based split
data sets have been tested for their representative capability
based on the inclusive analysis provided in Section 4. This
feature allows systems return query results based on a sin-
gle split data set since the inclusive character of ND-based
split data sets.
The future work of this study will further conduct experi-
ments in Cloud sim to analyze the splitting algorithm by
exploring whether the splitting process can give advantages
for data analysis when parallel data processing is deployed.
Furthermore, we will investigate the effects that algorithm
parameters have on its performance, such as, selecting suit-
able mr and D values for generating more efficient and accu-
rate split data sets.
ACKNOWLEDGMENTS
The authors would like to express our gratitude to Professor
Rao Kotagiri from the University of Melbourne for his
valuable comments and suggestion and The CLOUDS Labo-
ratory at the University of Melbourne for providing Cloud-
Sim platform and experimental environment for simulation,
to student members from Sun Yat-Sen University for con-
ducting the splitting data evaluation. They would thank the
Center for machine learning and intelligent systems at Uni-
versity of California, Irvine, for providing open medical
data source. This work was supported in part by the
Zhejiang Provincial Natural Science Foundation of China
under Grant LY 14G010004, Zhejiang Provincial Philosophy
and Social Science Project under Grant 11JCSH03YB, and
National Natural Science Foundation of China under
Grant 61272480.
REFERENCES
[1] G. W. Stewart, “On the early history of the singular value decom-
position,” SIAM Rev., vol. 35, no. 4, pp. 551—566, 1993.
[2] A. D. Penman and W. D. Johnson, “The changing shape of the
body mass index distribution curve in the population: implica-
tions for public health policy to reduce the prevalence of adult
obesity,” Pre Chronic Disease, vol. 3, no. 3, p. A74, 2006.
[3] W. H. Wolberg, W. N. Street, and O. L. Mangasarian, Breast Cancer
Wisconsin (Diagnostic) Data Set, Nov. 2014.
[4] E. H. Shortliffe and G. O. Barnett, “Biomedical data: Their acquisi-
tion, storage, and use,” in Biomedical Informatics. New York, NY,
USA: Springer, 2006, pp. 46–79.
[5] M. G. Walker and J. Shi, Statistics Consulting—ANOVA Example.
Walker Bioscience, Nov. 2014.
[6] A. Labrinidis and H. V. Jagadish, “Challenges and opportunities
with big data,” Proc. VLDB Endowment, vol. 5, no. 12, pp. 2032 –
2033. 2012.
[7] C. W. Baxter, S. J. Stanley, Q. Zhang, and D. W. Smith,
“Developing artificial neural network process models: A guide for
drinking water utilities,” in Proc. CSCE, 2000, pp. 376–383.
[8] R. J. May, H. R. Maier, and G. C. Dandy, “Data splitting for artifi-
cial neural networks using SOM-based stratified sampling,” Neu-
ral Netw., vol. 23, pp. 283–94. 2010.
[9] T. Kohonen, “Self-organized formation of topologically correct
feature maps,” Biological Cybern., vol. 43, no. 1, pp. 59–69,
1982.
[10] K. Tasdemir and E. Merenyi, “SOM-based topology visualisation
for interactive analysis of high-dimensional large datasets,” Mach.
Learning Rep., vol. 1, pp. 13–15, 2012.
[11] S. Rani and G. Sikka, “Recent techniques of clustering of time
series data: A survey,” Int. J. Comput. Appl., vol. 52, no. 15,
pp. 1–9, 2012.
[12] P. Russom, “Big data analytics,” TDWI Best Practices Report (4th
Quarter). Renton, WA, USA: TDWI Press, 2011, pp. 3–34.
[13] W. Wu, H. R. Maier, G. Dandy, and R. May, “Exploring the impact
of data splitting methods on artificial neural network models,” in
Proc. 10th Int. Conf. Hydroinformat.: Understanding Changing Climate
Environ. Finding Solutions, 2012, pp. 1–8.
[14] J. Neyman, “On the two different aspects of the representative
methods. The method stratified sampling and the method of
purposive selection,” J. Roy. Statistical Soc., vol. 97, pp. 558–606,
1934.
[15] J. Dean and S. Ghemawat, “MapReduce: Simplied data processing
on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.
[16] F. Marozzo, D. Talia, and P. Trunfio, “P2P-MapReduce: Parallel
data processing in dynamic cloud environments,” J. Comput. Syst.
Sci., vol. 78, no. 5, pp. 1382–1402, 2012.
[17] The World Bank Group, World Development Indicators, World Bank
Publication, Dec. 2013.

[18] Tableau Software, Sample Superstore Sales, Jan. 2014.
[19] R. V. Hogg and A. T. Craig, Introduction to Mathematical Statistics,
5th ed. Englewood Cliffs, NJ, USA: Prentice Hall Press. 2007.
[20] B. Strack, J. P. DeShazo, C. Gennings, J. L. Olmo, S. Ventura, K. J.
Cios, and J. N. Clore, “Impact of HbA1c measurement on hospital
readmission rates: Analysis of 70,000 clinical database patient
records,” BioMed. Res. Int., vol. 2014, p. 11, 2014.
[21] P. S. Rana, “Physicochemical properties of protein tertiary struc-
ture data set,” ABV -Indian Institute of Information Technology
Management, MP, 2014.
[22] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic,
“Cloud computing and emerging IT platforms: Vision, hype, and
reality for delivering computing as the 5th utility,” Futur. Gener.
Comput. Syst., vol. 25, no. 6, pp. 599–616, Jun. 2009.
[23] A.N. Toosi, R. N. Calheri, and R. Buyya, “Interconnected cloud
computing environments: Challenges, taxonomy and survey,”
ACM Computi. Surveys, vol. 47, no 1, pp. 7:1–7:47, 2014.
[24] B. Hayes, “Cloud computing,” Commun. ACM, vol. 51, no. 7,
pp. 9–11, Jul. 2008.
[25] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. D. Rose, and
R. Buyya. “CloudSim: A toolkit for modeling and simulation
of cloud computing environments and evaluation of resource
provisioning algorithms,” Software— Practice experience, vol. 41,
pp. 23–50, 2010.
[26] Amazon Web Services Inc. Amazon Elastic Compute Cloud—User
Guide for Microsoft Windows, Amazon Copyright, 2014.
[27] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden,
and M. Stonebraker, “A comparison of approaches to large-scale
data analysis,” in Proc. ACM SIGMOD Int. Conf. Manage. Data,
2009, pp. 165–178.
Hao Lan Zhang received the PhD degree in
computer science from Victoria University, Aus-
tralia. He is currently an associate professor at
NIT, Zhejiang University in China and the 151
Talented Scholar of Zhejiang Province. Prior to
that, he was a research fellow and academic staff
at RMIT University, Australia. He has published
over forty papers in refereed journals and confer-
ences. His research interests include intelligent
information systems, multiagent systems, health
informatics, knowledge-based systems, data
mining, etc. He serves as the editorial board member and a reviewer of
several journals including: IEEE Transactions on SCM, IEEE Transac-
tions on Cloud Computing, The Computer Journal, Information Scien-
ces, Journal of AAMAS, etc. He is a member of the IEEE.
Yali Zhao received the bachelor’s degree from
Harbin Institute of Technology, China, and the
joint master’s degree from Harbin Institute of
Technology, China, and University of Bordeaux
1, France. She is currently working toward
the PhD degree at the University of Melbourne,
Australia. Her research interest is SLA-based
resource scheduling for big data analytic as a ser-
vice in cloud computing environments.
Chaoyi Pang received the PhD degree from the
University of Melbourne. He is currently a full pro-
fessor at NIT, Zhejiang University, and a 1,000
talented professor of Zhejiang Province. His
research interests are in algorithm, data security/
privacy, access control, data warehousing, data
integration, database theory, graph theory. He is
a senior member of ACM.
Jinyuan He received the bachelor’s and master’s
degrees from Sun Yat-Sen University in 2013 and
2015, respectively. He was a research assistant
in the Center for Social Computing and Data
Management, NIT, Zhejiang University in 2014.
He is currently working on the large data sets
splitting projects.
For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.

NAD Vs MapReduce

Recommended

Recommended

More Related Content

Similar to NAD Vs MapReduce

Similar to NAD Vs MapReduce (20)

Recently uploaded

Recently uploaded (20)

NAD Vs MapReduce