SlideShare a Scribd company logo
1 of 16
Download to read offline
SPECIAL ISSUE PAPER
SaC-FRAPP: a scalable and cost-effective framework for privacy
preservation over big data on cloud
Xuyun Zhang1,
*,†
, Chang Liu1
, Surya Nepal2
, Chi Yang1
, Wanchun Dou3
and Jinjun Chen1
1
Faculty of Engineering and IT, University of Technology, Sydney, Sydney, Australia
2
Information and Communication Technologies Centre, Commonwealth Scientific and Industrial Research
Organisation, Sydney, Australia
3
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
SUMMARY
Big data and cloud computing are two disruptive trends nowadays, provisioning numerous opportunities to
the current information technology industry and research communities while posing significant challenges
on them as well. Cloud computing provides powerful and economical infrastructural resources for cloud
users to handle ever increasing data sets in big data applications. However, processing or sharing
privacy-sensitive data sets on cloud probably engenders severe privacy concerns because of multi-tenancy.
Data encryption and anonymization are two widely-adopted ways to combat privacy breach. However,
encryption is not suitable for data that are processed and shared frequently, and anonymizing big data and
manage numerous anonymized data sets are still challenges for traditional anonymization approaches. As such,
we propose a scalable and cost-effective framework for privacy preservation over big data on cloud in this
paper. The key idea of the framework is that it leverages cloud-based MapReduce to conduct data
anonymization and manage anonymous data sets, before releasing data to others. The framework provides a
holistic conceptual foundation for privacy preservation over big data. Further, a corresponding
proof-of-concept prototype system is implemented. Empirical evaluations demonstrate that scalable and
cost-effective framework for privacy preservation can anonymize large-scale data sets and mange
anonymous data sets in a highly flexible, scalable, efficient, and cost-effective fashion. Copyright © 2013
John Wiley & Sons, Ltd.
Received 28 May 2013; Accepted 4 June 2013
KEY WORDS: big data; cloud; privacy preservation; framework; anonymization
1. INTRODUCTION
Big data and cloud computing, two disruptive trends at present, offers a large number of business
opportunities, and likewise pose considerable challenges on the current information technology (IT)
industry and research communities [1, 2]. Data sets in most big data applications such as social
networks and sensor networks have become increasingly large and complex so that it is a
considerable challenge for traditional data processing tools to handle the data processing pipeline
(including collection, storage, processing, mining, sharing, etc.). Generally, such data sets are often
from various sources and of different types (Variety) such as unstructured social media content and
half-structured medical records and business transactions, and are of large sizes (Volume) with fast
*Correspondence to: Xuyun Zhang, Faculty of Engineering and IT, University of Technology, Sydney, Sydney, Australia.
†
E-mail: xyzhanggz@gmail.com
Copyright © 2013 John Wiley & Sons, Ltd.
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE
Concurrency Computat.: Pract. Exper. 2013
Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3083
data in or out (Velocity). Cloud systems provides massive computation power and storage capacity that
enable users to deploy applications without infrastructure investment. Because of its salient features,
cloud is promising for users to handle the big data processing pipeline with its elastic and
economical infrastructural resources. For instance, MapReduce [3], extensively studied and widely
adopted large-scale data processing paradigm, is incorporated with cloud infrastructure to provide
more flexible, scalable, and cost-effective computation for big data processing. A typical example is
the Amazon Elastic MapReduce service. Users can invoke Amazon Elastic MapReduce to conduct
their MapReduce computations based on the powerful infrastructure offered by Amazon Web
Services (e.g., Elastic Compute Cloud and Simple Storage Service) and are charged in proportion to
the usage of the services. In this way, it is economical and convenient for companies and
organizations to collect, store, analyze, and share big data to gain competitive advantages.
However, because of the feature that cloud systems are multi-tenant, processing or sharing privacy-
sensitive data sets on them, for example, ‘other parties’ mining healthcare data sets, probably
engenders severe privacy concerns. Still, privacy concerns in MapReduce platforms are aggravated
because the privacy-sensitive information scattered among various data sets can be recovered with
more ease when data and computational power are considerably abundant. Although some privacy
issues are not new, their importance is amplified by cloud computing and big data [2]. With the
wide adoption of online cloud services and proliferation of mobile devices, the privacy concern
about processing on and sharing of sensitive personal information is increasing. For instance,
HealthVault, an online health service provided by Microsoft (Redmond, Washington, USA), is
deployed on the Windows Azure cloud platform. Although these data in such cloud services are
usually deemed extremely privacy-sensitive, they usually offer significant human benefits if
analyzed and mined by organizations similar to disease research centers.
Data encryption [4, 5] and anonymization [6, 7], having been extensively studied recently, are
promising and widely-adopted ways to combat privacy breach and violation on cloud. Mechanisms
such as encryption [4], access control [8], and differential privacy [9] are exploited to protect the
data privacy. These mechanisms are well-know pillars of privacy protection and still have open
questions in the context of cloud computing and big data [2]. Usually, the data sets uploaded into
cloud are not only for simply storage, but also for online cloud applications, that is, the data sets are
dynamical. If we encrypt these data sets, processing on data sets efficiently will be quite a
challenging task, because most existing applications only run on unencrypted data sets. Although
recent progress has been made in homomorphic encryption, which theoretically allows performing
computation on encrypted data sets, applying current algorithms are rather expensive because of
their inefficiency [10]. Still, data holders and data users in cloud are different parties in most
applications, for example, cloud health service providers and pharmaceutical companies. In such
cases, encryption or access control mechanisms alone fail to ensure privacy preservation and data
utility exposure. Data anonymization is a promising category of approaches to achieve such a goal
[7]. However, most existing anonymization algorithms lack of scalability over big data. Thence,
how to leverage the state-of-the-art cloud-based techniques to address the scalability issues of
current anonymization approaches deserve considerable attention. Still, anonymous data sets
scattered on cloud probably compromise data privacy if they are not properly managed. These data
sets can be highly related because they may be parts of a data set [11], different levels of
anonymous version of a data set [12], incremental versions of a data set [13], and so on. Adversaries
are able to collect such a group of data sets from cloud and infer certain privacy from them even if
each data set individually satisfies a privacy requirement. As such, it is still a challenge to manage
multiple anonymous data sets on cloud.
In this paper, we propose a scalable and cost-effective framework for privacy preservation (SaC-
FRAPP) over big data on cloud via integrating our previous work together in terms of the lifecycle
of data anonymization. The key idea of the framework is that it leverages cloud-based MapReduce
and HDFS (Hadoop Distributed File System) to conduct data anonymization and manage
anonymous data sets, respectively, before releasing data sets to other parties. The framework is built
on the top of MapReduce and functions as a filter to preserve the privacy of data sets before these
data sets are accessed and processed by MapReduce. Specifically, the framework provides interfaces
to data holders to specify various privacy requirements based on different privacy models. Once
X. ZHANG ET AL.
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
privacy requirements are specified, the framework launches anonymization algorithms of MapReduce
version to efficiently anonymize data sets for subsequent MapReduce tasks. Anonymous data sets are
retained and reused to avoid re-computation cost. Thus, the framework handles the dynamical update
of data sets as well to maintain the privacy requirements of such data sets. Besides anonymization,
SaC-FRAPP also integrates encryption techniques to cost-effectively ensure the privacy of multiples
data sets that are independently anonymized in terms of different privacy requirements. Finally, a
corresponding proof-of-concept prototype system based on our cloud environment is implemented
for the framework. The framework provides a holistic conceptual foundation for privacy
preservation over big data and enables users to accomplish full potential of the high scalability,
elasticity, and cost-effectiveness of the cloud. We conduct extensive experiments on real-world data
sets to evaluate the proposed framework. Empirical evaluations demonstrate that SaC-FRAPP can
anonymize big data and mange the anonymous data sets in a highly flexible, scalable, efficient, and
cost-effective fashion.
The contributions of this paper are three folds. Firstly, we propose a SaC-FRAPP over big data on
cloud via integrating our previous work together in terms of the lifecycle of data anonymization.
Secondly, a corresponding proof-of-concept prototype system is presented. Thirdly, extensive
experimental evaluations on modules of the framework are provided to demonstrate SaC-FRAPP
can anonymize big data and mange the anonymous data sets in a highly flexible, scalable, efficient,
and cost-effective fashion.
The remainder of this paper is organized as follows. The next section reviews the related work
about privacy protection in cloud computing, big data, and MapReduce. In Section 3, we briefly
present the preliminary background knowledge about cloud systems and MapReduce. Section 4
describes the lifecycle of data anonymization on cloud and formulates the details of the proposed
framework. The proof-of-concept prototype system is designed and implemented in Section 5. We
present the empirical evaluation results of each module of the prototype system in Section 6.
Finally, we conclude this paper and discuss our future work in Section 7.
2. RELATED WORK
We briefly review recent research on data privacy preservation and privacy protection in MapReduce
and cloud computing environments.
The privacy-preserving data publishing research community has investigated extensively on
privacy-preserving issues and made fruitful progress with a variety of privacy models and preserving
methods [7]. Privacy principles such as k-anonymity [14], l-diversity [15], and t-closeness [16] are
put forth to model and quantify privacy. Privacy principles for multiple datasets are also proposed,
but they aim at specific scenarios such as continuous data publishing or sequential data releasing [7].
Several anonymizing operations are leveraged to anonymize data sets, including generalization [17–19],
anatomization [20], slicing [21], disassociation [22], and so on. Roughly, there are four generalization
schemes [7], namely, FG [23], SG [19], MG [17], and CG [18].
Encryption is widely adopted as a straightforward approach to ensure data privacy on cloud against
malicious users. Special operations such as query or search on encrypted data sets stored on cloud has
been extensively studied [4,24, 25], although performing general operations on encrypted data sets are
still quite challenging [10]. Instead of encrypting all data sets, Puttaswamy et al. [8] described a set of
tools called Silverline that can separate all functionally encryptable data from other cloud application
data, where the former is encrypted for privacy preservation while the later is left unencrypted for
application functions. However, the sensitivity of data is required be labeled in advance. Zhang
et al. [26] proposed a privacy leakage upper bound constraint-based approach to preserve the
privacy of multiple data sets by only encrypting a part of data sets on cloud.
As SaC-FRAPP is proposed based on MapReduce, we reviewed MapReduce relevant research work
in the succeeding texts. The Kerberos authentication mechanism [27] is integrated into the MapReduce
framework of Hadoop after the 1.0.0 version. Basically, access control fails to preserve privacy
because data users can infer the privacy-sensitive information if they access to the unencrypted data.
Roy et al. [28] investigated the data privacy problem caused by the MapReduce framework and
SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
presented a system named Airavat, which incorporates mandatory access control with differential
privacy. The mandatory access control will be triggered when the privacy leakage exceeds a
threshold, so that both privacy preservation and high data utility are ensured. However, the results
produced in this system are mixed with certain noise, which is unsuitable to many applications that
need data sets without noise, for example, medical experiment data mining and analysis. Based on
MapReduce, several systems have been proposed to handle privacy concerns of computation and
storage on cloud. Blass et al. [29] proposed a privacy-preserving scheme named PRISM for
MapReduce on cloud to perform parallel word search on over encrypted data sets. Nevertheless,
many cloud applications require MapReduce to conduct tasks similar to data mining and analytics
over these data sets besides search. Ko et al. [30] proposed the HybrEx MapReduce model to
provide a way that sensitive and private data are processed within a private cloud, whereas others
can be safely extended to public cloud. Similarly, Zhang et al. [31] proposed a system named Sedic,
which partitions MapReduce computing jobs in terms of the security labels of data they work on
and then assigns the computation without sensitive data to a public cloud. However, sensitivity of
data is also required be acquired in advance in above two systems. Wei et al. [32] proposed a
service integrity assurance framework named SecureMR for the MapReduce framework. But we
mainly focus herein on privacy-preserving issues in the proposed layer. Our privacy-preserving
framework attempts to produce and retain anonymous data sets according to data holders' privacy
requirements for subsequent MapReduce tasks.
3. CLOUD SYSTEMS AND MAPREDUCE PRELIMINARY
3.1. Cloud systems
Cloud computing is one of the most hyped IT innovations at present, having sparked plenty of interest in both
the IT industry and academia. Recently, IT giants such as Amazon (Seattle, Washington, US), Google
(Mountain View, California, US), IBM (Armonk, New York, US), and Microsoft have invested huge sums
of money in building up their public cloud products, and indeed they have developed their own products,
for example, Amazon's Web Services, Google's App Engine, and Compute, and Microsoft's Azure.
Meanwhile, several corresponding open source cloud computing solutions are also developed, like Hadoop,
Eucalyptus, OpenNebula, and OpenStack. The cloud computing definition published by the US National
Institute of Standards and Technology comprehensively covers the commonly agreed aspects of cloud
computing [33]. In terms of the definition, the cloud model consists of five essential characteristics, three
service delivery models, and four deployment models. The five key features encompass on demand self-
service, broad network access, resource pooling (multi-tenancy), rapid elasticity, and measured services. The
three service delivery models are cloud software as a service, for example, Google Docs; cloud platform as
a service, for example, Google App Engine; and cloud infrastructure as a service, for example, Amazon
Elastic Compute Cloud, and Amazon Simple Storage Service. The four deployment models include private
cloud, community cloud, public cloud, and hybrid cloud.
Technically, cloud computing could be regarded as an ingenious combination of a series of
developed or developing ideas and technologies, establishing a novel business model by offering IT
services using economies of scale. In general, the basic ideas encompass service computing, grid
computing, distributed computing, and so on. The core technologies that cloud computing
principally built on include web service technologies and standards, virtualization, novel distributed
programming models like MapReduce, and cryptography. All the participants in the cloud
computing can benefit from this new business model. The giant IT enterprises can not only run their
own core businesses, but also make profit by delivering the spare infrastructure services to others.
Small-size and medium-size businesses are capable of focusing on their own core businesses via
outsourcing the boring and complicated IT management to other cloud service providers, usually at
a fairly low cost. Especially, cloud computing facilitates start-ups considerably, enabling them to
build up their business with low upfront IT investments, as well as cheap ongoing costs. Moreover,
because of the flexibility of cloud computing, companies can adapt their business readily and swiftly
by enlarging or shrinking the business scale dynamically without concerns about losing anything.
X. ZHANG ET AL.
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
3.2. MapReduce basics
MapReduce is a scalable and fault-tolerant data processing framework that enables to process huge
volume of data in parallel with many low-end commodity computers [3]. MapReduce was first
introduced in the year 2004 by Google with similar concepts in functional languages dated as early
as 60s. It has been widely adopted and received extensive attention from both the academia and the
industry because of its promising capability. In the context of cloud computing, the MapReduce
framework becomes more scalable and cost-effective because infrastructure resources can be
provisioned on demand. Simplicity, scalability, and fault tolerance are three main salient features of
MapReduce framework. Therefore, it is convenient and beneficial for companies and organizations
utilize MapReduce services, such as Amazon Elastic MapReduce, to process big data and obtain
core competiveness.
Basically, a MapReduce task consists of two primitive functions, map and reduce, defined over a
data structure named as key-value pair (key, value). Specifically, the map function can be
formalized as map: (k1, v1) → (k2, v2), that is, map function takes a pair (k1, v1) as input and then
output another intermediate key-value pair (k2, v2). These intermediate pairs will be consumed by
reduce function as input. Formally, the reduce function can be represented as reduce: (k2, list
(v2)) → (k3, v3), that is, reduce function takes intermediate k2 and all its corresponding values list
(v2) as input and output another pair (k3, v3). Usually, (k3, v3) list is the results which MapReduce
users attempt to get. Both map and reduce functions are specified by data users in terms of their
specific applications.
To make such a simple programming model work effectively and efficiently, MapReduce
implementations provide a variety of fundamental mechanisms such as data replication and data
sorting. Besides, distributed file systems similar to Hadoop distributed file system [34] are
substantially crucial to make the MapReduce framework run in a highly scalable and fault-
tolerant fashion. Recently, the standard MapReduce has been extensively revised into many
variations in order to handle data in different scenarios. For instance, Incoop [35] is proposed for
incremental MapReduce computation, which detects changes to the input and automatically
updates the output by employing an efficient, fine-grained result reuse mechanism. Several novel
techniques: a storage system, a contraction phase for reduce tasks, and an affinity-based
scheduling algorithm are utilized to achieve efficiency without sacrificing transparency. As
standard MapReduce framework lack built-in supports for iterative programming, which arise
naturally in many applications including data mining, web ranking, graph analysis, model fitting,
and so on, HaLoop [36] and Twister [37] are designed to support iterative MapReduce
computation. HaLoop is built on top of Hadoop and extends it with a new programming model
and several important optimizations that include: a loop-aware task scheduler, loop-invariant data
caching, and caching for efficient fixpoint verification. Twister is a distributed in-memory
MapReduce runtime optimized for iterative MapReduce computations.
4. SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION
In this section, we mainly present the design of the SaC-FRAPP. Section 4.1 first describes the
lifecycle of data anonymization on cloud and specifies corresponding system requirements for the
lifecycle. An overview of the framework is presented in Section 4.2. Then, we elaborate the details
of each module of the framework from Section 4.3 to 4.6.
4.1. Lifecycle of data anonymization on cloud
As a large number of big data applications are deployed in cloud platforms, most of the data in such
applications are collected, stored, processed, and shared on cloud. A typical lifecycle of big data
anonymization on cloud is described in Figure 1. All participants are illustrated in the figure. We
describe the phases in Figure 1 as follows.
Phase I represents data collection phase, where data owners submit their privacy-sensitive data into
services or applications provided by data holders. A data owner is the individual who is associated with
SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
the data record she or he submitted. Usually, a data owner is an end user of some cloud-based online
services, for example, online healthcare services. Also, group data owners like some institutes place
data sets containing records from a group of individuals into cloud, for example, hospitals may put
patients' diagnosis information into online healthcare services. A data holder is usually the service or
application owner, for example, Microsoft is the data owner because of its cloud-based healthcare
service, HealthVault. Data holders are responsible for the privacy protection of data owners. Data
owners can specify personalized privacy requirements to data holders. But mostly, data holders
develop privacy protection strategies according to governments' privacy policies and regulations.
To make full use of the collected data sets, data holders would like to release the data sets to other
parties for analysis or data mining. However, given the privacy concerns, the data holder anonymizes
the data sets. Unlike the traditional case that data holders anonymize data sets themselves, data
anonymization is conducted by dedicated data anonymization services, that is, anonymization as a
service. Phase II illustrates such a process. We assume that anonymization services are trusted.
Privacy preserving requirements will be specified as parameters to anonymization services. After
anonymization, the anonymous data sets are released to cloud or certain data recipients. In general,
there will be multiple anonymous data sets because of the different data usage or data recipients.
Hence, Phase III chiefly manages such anonymous data sets to ensure privacy preservation. Because
some big data applications are online services, for example, Facebook and HealthVault, the data sets
these applications that are dynamic and incremental. Consequently, Phase IV is responsible for
updating anonymous data sets when new data records are added. To anonymize the newly added
data, the anonymized data sets should be taken into account and the whole anonymized data sets are
offered to users. Phases II to IV will be elaborated in the following sections.
In general, anonymous data sets are consumed by other parties. Phase V in Figure 1 shows such a
process. Various data users will access different anonymous data sets for diverse purposes. For
instance, research centers training data models similar to decision trees or association rules from the
anonymous data sets to reveal certain trends similar to disease infection. Meanwhile, malicious data
users deliberately attempt to infer privacy-sensitive information from the released data sets. Note
that the privacy can also be breached incidentally when a legal data user process on a data set.
In big data and cloud environments, data anonymization encounters several challenges because of
the new features coming with cloud systems and big data applications. In order to comply with such
features specific to cloud computing and big data, we identify several system requirements that
should be satisfied when designing a framework for privacy preservation over big data on cloud as
follows. Note that scalability and cost-effectiveness are the most important two issues in such a
framework for big data privacy preservation.
• Flexible. The framework should provide a user interface through which the data owners can
specify various privacy requirements. Usually, data sets will be accessed and processed by
different data users for different application.
I Data Collection
II Data Anonymization
III Anonymous
Data
Management
IV Data Update
V Data Consumption
Adversaries
Query
Users
Data Mining
Users
Individual Owners
Data
Holder
Group
Owners
Figure 1. Lifecycle of data anonymization on cloud.
X. ZHANG ET AL.
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
• Scalable. Scalability is necessary for current privacy-preserving approaches because the scale of
data sets is too large to be processed by existing serial algorithms. So the privacy-preserving
framework should be also scalable to handle data anonymization and managements. Concretely,
the data-intensive or computation-intensive operations in the framework should be executed
efficiently in a parallel and scalable fashion.
• Dynamical. In cloud computing, most applications accumulate data sets over time, for example,
cloud healthcare services will receive a large number of information from users. The scale of such
data sets becomes larger and larger, forming big data. Hence, the privacy-preserving framework
should handle dynamical data sets, the privacy, and utility of such data sets can still be ensured
when updates occur.
• Cost-effective. In the pay-as-you-go feature of cloud computing, saving IT cost is one of the core
enablers. Thus, it is also desired for the privacy-preserving framework to save the expense of
privacy preservation as much as possible while the privacy preservation and data utility can still
be ensured.
4.2. Framework overview
In terms of the four system requirements formulated earlier in the text, we designed a SaC-FRAPP over
big data on cloud. Accordingly, the framework consists of four main modules, namely, privacy
specification interface (PSI), data anonymization (DA), data update (DU), and anonymous data sets
management (ADM). Based on the four modules, the privacy-preserving framework can achieve the
four system requirements. The framework overview is depicted systematically in Figure 2, where the
submodules (dashed rectangles) for each module are illustrated and the relationships among the
aforementioned four modules, big data application layer, MapReduce or HDFS, and cloud
infrastructure are described as well. Note that although the framework only consists the four
modules, we also present in Figure 2 its potential users, that are, end users, big data applications, or
tools like Mahout, as well as infrastructure and drivers, that are, MapReduce, HDFS, and cloud
infrastructure.
As shown in Figure 2, DA, DU, and ADM are the three main functional modules. They conduct
concrete operations on data sets in terms of the privacy models specified in the PSI module. The DA
and DU modules take advantage of MapReduce and HDFS to anonymize data sets or adjust
anonymized data sets when updates occur. The ADM module is responsible for managing
anonymized data sets in order to save expense via avoiding re-computation. Thus, ADM utilize
cloud infrastructure directly to accomplish the tasks. We will discuss the four proposed modules and
their submodules in detail in following sections.
Unlike traditional MapReduce platforms, the MapReduce platform in our research is deployed on
top of the cloud infrastructure to gain high scalability and elasticity. To preserve the privacy of data
sets processed by data users using MapReduce, a privacy-preserving layer between original data sets
Figure 2. Framework overview of scalable and cost-effective framework for privacy preservation.
SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
and user-specified MapReduce tasks. Basically, the layer including DA and DU modules, is the engine
of SaC-FRAPP as it conducts the most computation in the framework. Figure 3 depicts the privacy-
preserving layer based on MapReduce and HDFS.
As shown in Figure 3, original data sets are stored confidentially on cloud by data owners, and can
never be accessed directly by data users. Then data holders specify privacy requirements and submit
them to the privacy-preserving layer. The layer is responsible for anonymizing original data sets
according to privacy requirements. Certain anonymous data sets are retained to avoid re-
computation. Thus, the layer is also responsible for updating anonymous data when new data join.
Data users can then specify their application tasks in MapReduce jobs and run these jobs on the
anonymized data sets. The results are stored in HDFS or further stored in cloud storage. The
privacy-preserving layer itself exploits MapReduce jobs to conduct the computation required in data
anonymization. This is plausible and necessary to use MapReduce to accomplish the computation
for anonymizing and managing these data sets because data sets are usually of huge volume and
complexity in the context of cloud computing and big data.
4.3. Privacy specification interface
As to privacy models and protection techniques, privacy-preserving data publishing research
community has extensively investigated on the issues and made fruitful progress with a variety of
privacy models, privacy preserving methods, and algorithms [7]. Usually, original data sets can be
accessed and process by different data users for different purposes, leading to various privacy risks
in these scenarios. Moreover, the privacy requirements of data owners possibly vary over time. As
such, a systematic and flexible privacy specification model is proposed to frame privacy
requirements. We have the following definition on privacy specification.
Definition 1 (Privacy Specification)
The privacy requirements specified by a data owner are defined as a Privacy Specification (PS). A PS is
formally represented by a vector of parameters,that is, PS = <PMN, Thr, AT, Alg, Gra, Uti>. The
parameters in the vector are elaborated subsequently.
The name of a privacy model is represented by PMN. Out of recent proposed privacy
models, we employ three widely adopted privacy models in the privacy-preserving
framework, namely, k-anonymity, l-diversity, and t-closeness. The three privacy principles provide
different levels of privacy-preserving extent. The privacy principle k-anonymity means that the number
of the anonymized records that correspond to a quasi-identifier is required to be larger than a threshold.
Otherwise, once certain quasi-identifiers are too specific that only a small group of people is linked to
them, these individuals are linked to sensitive information with high confidence, resulting to privacy
breach. Here, quasi-identifiers represent the groups of anonymized data records. Based on k-anonymity,
l-diversity requires that the sensitive values correspond to a quasi-identifier is not less than a threshold
and therefore stricter than k-anonymity. The strictest is t-closeness, requiring the distribution of
sensitive values correspond to a quasi-identifier to be close to that of original data sets.
Figure 3. Privacy-preserving layer based on MapReduce and Hadoop Distributed File System.
X. ZHANG ET AL.
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
Parameter Thr is the threshold of the specified privacy model, that are, k, l, and t in the three
previously mentioned privacy principles.
Parameter AT denotes application type. Data owners can specify the goals of anonymized data sets,
for example, classification, clustering, or general use. If the use of anonymized data sets is known in
advance, anonymization algorithms can produce anonymized data sets with higher data utility when
the privacy are still preserved. Without knowing the types of applications that consume the
anonymized data sets, the DA module produces anonymized data sets for general use. Different
information metrics will be utilized for different application types [7].
The anonymization algorithms are indicated by the parameter Alg. A variety of algorithms
have been developed for different privacy principles and application types. Details will be
described in Section 4.4.
Parameter Gra represents the granularity of the privacy specification. It determines the scope of the
privacy preservation. Usually, only part of the original data sets is shared with data users, and different
data users are possibly interested in different part of big data. Moreover, only part of attributes of a data
records is considered in the process of anonymization. Thus, the granularity parameter is quite useful.
The data utility parameter Uti is an optional one. Data owners can specify how much data utility
they allow to expose to data users. In general, privacy and data utility are two roughly opposite
aspects of privacy preservation. When privacy thresholds are given, most anonymization algorithms
usually expose as much as possible. On the contrary, we can make the data sets anonymous as
much as possible if data utility is fixed.
Above all, the SaC-FRAPP framework systematically and comprehensively provides diverse
privacy specifications to achieve the flexibility and diversity of privacy preservation.
4.4. Data anonymization
Several categories of anonymization techniques have been proposed, including generalization, slicing,
disassociation, and anatomization. In SaC-FRAPP, we utilize generalization for the anonymization
because it is widely investigated and adopted in existing algorithms. Specifically, four generalization
schemes have been proposed, namely, full-domain generalization (FG), sub-tree generalization (SG),
multi-dimensional generalization (MG), and cell generalization (CG). Their corresponding
sub-modules are depicted in Figure 2. Roughly speaking, data utility exposed by these four schemes
increase in the order: FG < SG < MG < CG, when the same privacy requirement is given. But note
that the anonymized data sets produced by MG and CG suffer from data exploration problem, that
is, the anonymized data sets contain inconsistent data. For instance, one original attribute value can
be generalized into two different higher-level values. The Alg parameter in a privacy specification
indicates which anonymized scheme can be used to anonymize data sets.
However, most existing anonymization algorithms are centralized or serial, meaning that these
algorithms fail to handle big data. Usually, big data are so large that they fail to fit in the memory of
one normal cloud computation node. Hence, they are usually stored across a number of nodes.
Therefore, it is a challenging problem to anonymize large-scale data sets cloud for existing
anonymization algorithm. Hence, we revise existing algorithms into MapReduce versions, in order
to exploit the MapReduce to efficiently anonymize data sets in a scalable and parallel fashion.
The data anonymizing (DA) module consists of a series of anonymized algorithms of MapReduce
version. Basically, each anonymized algorithm has a MapReduce driver program and several pairs
of map and reduce programs. Usually, these map and reduce programs, constituting a MapReduce
job, will be executed iteratively because the anonymization is an iterative process. Although the
standard MapReduce implementation mainly supports one-pass data processing rather than iterative
processing, we design our MapReduce algorithms dedicatedly to make full use of MapReduce. For
instance, we improve the degree of parallelization by partitioning data in advance and launching
multiple MapReduce jobs simultaneously.
4.5. Data update
The anonymized data sets are retained on cloud for different data users. So, these data sets are
persistent every time the anonymized data sets are generated. However, the data sets in applications
SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
on cloud are dynamic and increase dramatically over time, resulting in big data. Hence, we have to
update original data sets and anonymized data sets.
A straightforward way is to anonymize the already updated data sets from scratch. From the
efficiency perspective, it is usually unacceptable to anonymize all data sets once an update occurs.
Furthermore, the privacy preservation fails be ensured according to the analysis in [13]. Therefore, a
better way is to just anonymize the update part and adjust the already anonymized data sets. For
anonymous data sets, anonymization level is used to describe the degree of privacy preservation.
Usually, the anonymization level for the already anonymized data sets satisfies the given privacy
requirements, so new data can be simply anonymized to current anonymization level when updates
occur. But the newly anonymized data sets possibly violate the privacy requirements because they
are likely to be too specific. In such a case, we have to adjust the anonymization level of the whole
anonymized data sets to ensure the privacy preservation for all data.
Another aspect of privacy preservation is to produce data utility as much as possible to data users
when privacy requirements are satisfied. For data anonymization, an interesting phenomenon is that
for a given privacy requirement, the more data are anonymized, the lower anonymization level will
be. A lower anonymization level means more data utility can be produced because the anonymized
values are more specific. Consequently, just anonymizing new data to the current anonymization
level is not sufficient, even though this satisfies the privacy requirements definitely. To expose more
data utility, we need to lower the anonymization level of the whole anonymized data sets.
Above all, three basic operations are provided in the DU module, namely, update, generalization,
and specialization. Generalization is utilized to raise the anonymization level, whereas specialization
is to lower the anonymization level.
4.6. Anonymous data management
As described in the last section, anonymous data sets are retained for data sharing, mining, and
analytics. Another consideration is to save IT expense. In the context of cloud computing, both
computation and storage resources will be charged in proportion to their usage in terms of the
pay-as-you-go feature. In this sense, it is beneficial to store certain part of intermediate data sets
rather than re-compute them repeatedly. Yuan et al. [38] has extensively investigated the trade-offs
between data storage and re-computation and proposed a series of strategies. In SaC-FRAPP,
anonymous data sets are stored to save cost and managed systematically. We exploit data
provenance [39] to manage the data generation relationships among these data sets.
Retaining a large number of independently anonymous data sets potentially suffers from the privacy
breach problems. Because the PSI module provides flexible privacy specifications, a data set can be
possibly anonymized into different anonymous data sets. As a result, privacy-sensitive information
can be recovered from different anonymous data sets. To address this inference problem in multiple
data sets, we have proposed an approach that incorporates encryption to ensure privacy preservation
[26]. Basically, we can encrypt all the anonymous data sets and share them to specific users.
However, encrypting all data sets will incur expensive overhead because these anonymous data sets
are usually accessed or processed frequently by many data users, so we propose to encrypt part of
these data sets to save privacy-preserving cost while privacy preservation can still be ensured.
In the privacy-preserving framework, privacy of all anonymous data sets are quantified according
to [26] or differential privacy [9]. Then, we carefully select part of anonymous data sets to encrypt
via our proposed approaches. In this way, SaC-FRAPP can achieve cost-effective and efficient
privacy preservation.
5. PROTOTYPE SYSTEM DESIGN
We have developed a proof-of-concept prototype system for the privacy-preserving framework
SaC-FRAPP based on Hadoop, an open-source implementation of MapReduce, and the OpenStack
cloud platform. The basic system design of SaC-FRAPP is depicted as a class diagram in Figure 4.
In general, we implement the prototype system according to the formulation of the framework in the
X. ZHANG ET AL.
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
last section. The four main modules are designed as classes or interfaces. The submodules are designed
as their child classes or aggregate components. With these organized components, the developed
prototype system achieves the four system requirements as described in Section 4.1. As a
framework, SaC-FRAPP is also extensible and more privacy models and MapReduce-based
anonymization algorithms can be included.
The Privacy Specification Interface class is responsible for front-end interaction with users, and
invokes the three functional back-end modules (DA, DU, and ADM). So far, we have developed
several MapReduce-based anonymization algorithms for DA class. Such MapReduce programs can
anonymize large-scale in an efficient and scalable fashion. For SG scheme, we develop three
algorithms of MapReduce, that are, DS, BUG, and their combination. The major parts of top-down
specialization (TDS) and bottom-up generalization (BUG) are MapReduce jobs that dominate the
computation in the anonymization process. The MapReduce drivers, mappers, and reducer are
illustrated in Figure 4. For multidimensional scheme, we develop median and info median
algorithms of MapReduce for general purposes and classification workload, respectively. Note that
finding the median of an attribute dominates the computation in multidimensional anonymization, so
the MapReduce jobs are designed to conduct such computation. Because the execution process of
multidimensional anonymization is recursive, we design the recursion control class to determine
whether multiple nodes are launched or only one is utilized to do the recursive computation. For
local recoding anonymization scheme, we model it as the k-member clustering problem [40] and
adopt clustering techniques to address it. Similarity computation, which dominates the clustering
performance, is conducted by MapReduce jobs. The DA class leverage most functions provided by
DA class to carry out data update. For ADM class, several components are aggregated. Specifically,
sensitive intermediate data tree (SIT) and sensitive intermediate data graph (SIG) classes are
leveraged to manage anonymous data generation relationships; encryption and key management
classes are responsible for encrypting data sets that are selected to hide. Encryption decision class,
based on privacy quantification and cost estimation classes, aims at determining which anonymous
data sets should be encrypted, whereas others are not in order to ensure cost-effective privacy
preservation globally. Besides the classes that correspond to the components in the framework, other
classes are also necessary for the system. Taxonomy forest class is an essential data structure for
most functional classes in Figure 4. Note that some components are not presented because of space
limits, but this does not affect the discussion of our design.
Figure 4. Basic System design (class diagram) of scalable and cost-effective framework for privacy
preservation.
SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
6. EXPERIMENTAL EVALUATION
In this section, we demonstrate the deployment environment of the prototype system and summarize
the experimental evaluation results of the main components of SaC-FRAPP. A series of experiments
are conducted on both real-world data sets and synthesized data sets. Specifically, we compare the
performance of the DA, DU, and ADM components with their corresponding existing approaches.
We describe the experiment results of the three components in the following subsections.
6.1. Deployment environment and experiment settings
We develop and deploy the privacy-preserving framework on the basis of our cloud environment U-cloud.
U-Cloud is a cloud computing environment at the University of Technology, Sydney. The system
overview of the U-Cloud system is depicted in Figure 5. The computing facilities of this system are
located among several labs in the Faculty of Engineering and IT, University of Technology,
Sydney. On top of hardware and Linux operating system (Ubuntu), we install KVM virtualization
software, which virtualizes the infrastructure and provides unified computing and storage resources.
To create virtualized data centers, we install OpenStack open source cloud environment, which is
responsible for virtual machine management, resource scheduling, task distribution, and interaction
with users. Furthermore, Hadoop clusters are built on the basis of OpenStack to facilitate
MapReduce computing paradigm and big data processing. All the experiments are conducted in
such a cloud environment.
We use Adult data set [41], a public data set commonly used as a de facto benchmark for testing
data anonymization for privacy preservation. We also generated enlarged data sets based on the
Adult data set. After pre-processing the Adult data set, the sanitized data set consists of 30 162
records. In this data set, each record has 14 attributes. We utilize eight attributes out of them in
our experiments. The basic information about attribute taxonomy trees is described in Table I, in
which the number of domain values and level of trees are listed. The size of original Adult data
set is blown up to generate a series of data sets, which is adopted in [19]. Specifically, for each
Figure 5. System overview of U-Cloud.
Table I. Description of the Adult data set.
Attribute Age Education Marital status Occupation Relationship Race Sex Native country
Levels 5 4 3 4 3 3 2 4
Domain values 99 26 9 23 8 7 3 50
X. ZHANG ET AL.
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
original record r, α À 1 ‘variations’ of r are created, where α > 1 is the blowup scale. As a result, an
enlarged data set is α times as large as the original one.
6.2. Summary of experiment results
In this section, we briefly summarize experiment results of the three main functional modules, that is,
DA, DU, and AMD, from the perspective of scalability and cost-effectiveness. We have already
evaluated these modules in our previous work. Interested readers could refer to [6,26,42, 43].
6.2.1. Scalability of data anonymization. We compare our approach of data anonymization with a
state-of-the-art centralized approaches proposed in [19,44]. We run both approaches on data sets
varying from 50 MB to 2.5 GB. We check whether both approaches can scale over large-scale data
sets and measure execution time to evaluate the efficiency. The change of execution time with
respect to data set size is depicted in Figure 6. The execution time of our approach is denoted as
TMR, while that of the centralized approach is denoted as TCent.
From Figure 6 (a), it is seen that TCent surges when the data size increases, whereas TMR increases
slightly even though it has a higher start value. The centralized approach suffers from memory
insufficiency when the data size is larger than 500 MB, whereas Figure 6 (b) shows that our
approach can still scale over much larger data sets. Hence, the proposed privacy-preserving
framework can significantly improve the scalability and efficiency compared with existing state-
of-the-art anonymization approaches.
6.2.2. Scalability of data update. We compare our approach of anonymous data update with a
state-of-the-art approach proposed in [45]. We run both approach with the number of records in data
update batch ranging from 2000 to 20 000. We measure the update time to evaluate the efficiency.
The execution time of our approach is denoted as tI while that of the existing approach is denoted as
tE. Figure 7 illustrates how the difference between tI and tE changes with respect to the number of
data records in an update batch when K is fixed, where K is the user-specified k-anonymity parameter.
When K is fixed, it can be seen from Figure 7 that the difference between tI and tE becomes bigger
and bigger when the number of data records increase. This trend demonstrates that the
privacy-preserving framework can significantly improve the efficiency of privacy preservation on
large-volume incremental data sets over existing approaches.
6.2.3. Cost-effectiveness of anonymous data management. We compare our approach of retaining
anonymous data sets with the existing approach, which encrypt all data sets [4,25]. We run both
approaches on data sets with the number of data sets varying from 100 to 1000. The privacy-
preserving monetary cost is measured to evaluate the cost-effectiveness. The costs of our approach
and the existing one are denoted as CHEU and CALL, respectively. The change of the cost with
respect to the number of data sets is illustrated in Figure 8 when ϵd is fixed. The parameter ϵd is a
user-specified privacy requirement threshold, meaning that the degree of privacy disclosure must be
under ϵd.
Figure 6. Change of execution time with respect to data size.
SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
We can see from Figure 8 that that the difference between CALL and CHEU becomes bigger and
bigger when the number of intermediate data sets increases, that is, more expense can be reduced
when the number of data sets becomes larger. Thus, this trend means the privacy-preserving
framework can reduce the privacy-preserving cost of retaining anonymous data sets significantly
over the existing approach in real-world big data applications.
As a conclusion, the evaluation results in the experiments mentioned earlier demonstrate that the
proposed privacy-preserving framework SaC-FRAPP can anonymize big data and mange the
anonymous data sets in a highly scalable, efficient, and cost-effective fashion.
7. CONCLUSIONS AND FUTURE WORK
In this paper, we have proposed a SaC-FRAPP over big data on cloud. We have analyzed the lifecycle
of data anonymization and formulated four basic system requirements for a privacy-preserving
framework in the context of cloud computing and big data. To achieve the four system
requirements, we have presented four modules for the privacy-preserving framework SaC-FRAPP,
namely, PSI, DA, DU, and ADM. We leverage cloud-based MapReduce and HDFS to conduct data
anonymization and manage anonymous data sets, respectively, before releasing data sets to other
parties. Also, a corresponding proof-of-concept prototype system has been implemented for the
privacy-preserving framework and deployed in our cloud environment. Empirical evaluations have
demonstrated that SaC-FRAPP can anonymize large-scale data sets and mange the anonymous data
sets in a highly flexible, scalable, efficient, and cost-effective fashion. The framework provides a
holistic conceptual foundation for privacy preservation over big data and enables users to
accomplish full potential of the advantages of cloud platforms.
Privacy concerns of big data on cloud have attracted the attention of researchers in different research
communities. But ensuring privacy preservation of big data sets still needs extensive investigation.
(a) (b)
Figure 8. Change of privacy-preserving cost with respect to the number of data sets.
Figure 7. Change of update time with respect to the number of update records.
X. ZHANG ET AL.
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
Based on the contributions of the proposed framework, we plan to integrate the privacy-preserving
framework with other data processing frameworks that employ MapReduce as the computation
engine, for example, the Apache Mahout project, which is a data mining library built atop of
MapReduce. Further, we will investigate privacy-aware data distribution and scheduling in cloud
environments in the future.
REFERENCES
1. Borkar V, Carey MJ, Li C. Inside “Big Data Management”: Ogres, Onions, or Parfaits? Proceedings of the 15th
International Conference on Extending Database Technology (EDBT'12), 2012; 3–14.
2. Chaudhuri S. What Next?: A Half-Dozen Data Management Research Goals for Big Data and the Cloud.
Proceedings of the 31st Symposium on Principles of Database Systems (PODS'12), 2012; 1–4.
3. Dean J, Ghemawat S. Mapreduce: a flexible data processing tool. Communications of the ACM 2010; 53(1):72–77.
DOI: 10.1145/1629175.1629198.
4. Cao N, Wang C, Li M, Ren K, Lou W. Privacy-preserving Multi-Keyword Ranked Search Over Encrypted Cloud
Data. Proceedings of the 31st Annual IEEE International Conference on Computer Communications
(INFOCOM'11), 2011; 829–837.
5. Liu C, Zhang X, Yang C, Chen J. Ccbke—session key negotiation for fast and secure scheduling of scientific
applications in cloud computing. Future Generation Computer Systems 2013; 29(5):1300–1308. DOI: http://dx.
doi.org/10.1016/j.future.2012.07.001.
6. Zhang X, Yang LT, Liu C, Chen J. A scalable two-phase top-down specialization approach for data anonymization
using mapreduce on cloud. IEEE Transactions on Parallel and Distributed Systems 2013; in press. DOI: 10.1109/
TPDS.2013.48.
7. Fung BCM, Wang K, Chen R, Yu PS. Privacy-preserving data publishing: a survey of recent developments. ACM
Computer Survey 2010; 42(4):1–53. DOI: 10.1145/1749603.1749605.
8. Puttaswamy KPN, Kruegel C, Zhao BY. Silverline: Toward Data Confidentiality in Storage-Intensive Cloud
Applications. Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC'11), 2011; Article 10.
9. Dwork C. Differential Privacy. Proceedings of the 33rd International Colloquium on Automata, Languages and
Programming (ICALP'06), 2006; 1–12.
10. Gentry C. Fully Homomorphic Encryption Using Ideal Lattices. Proceedings of the 41st Annual ACM Symposium on
Theory of Computing (STOC'09), 2009; 169–178.
11. Nergiz ME, Clifton C, Nergiz AE. Multirelational k-anonymity. IEEE Transactions on Knowledge and Data
Engineering 2009; 21(8):1104–1117.
12. Iwuchukwu T, Naughton JF. K-anonymization as Spatial Indexing: Toward Scalable and Incremental
Anonymization. Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07), 2007;
746–757.
13. Pei J, Xu J, Wang Z, Wang W, Wang K. Maintaining k-anonymity Against Incremental Updates. Proceedings of the
19th International Conference on Scientific and Statistical Database Management (SSBDM '07), 2007; Article 5.
14. Sweeney L. K-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and
Knowledge-Based Systems 2002; 10(5):557–570. DOI: 10.1142/s0218488502001648.
15. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M. l-diversity: privacy beyond k-anonymity. ACM
Transactions on Knowledge Discovery from Data 2007; 1(1):Article 3. DOI: 10.1145/1217299.1217302.
16. Li N, Li T, Venkatasubramanian S. Closeness: a new privacy measure for data publishing. IEEE Transactions on
Knowledge and Data Engineering 2010; 22(7):943–956. DOI: 10.1109/TKDE.2009.139.
17. LeFevre K, DeWitt DJ, Ramakrishnan R. Mondrian Multidimensional k-anonymity. Proceedings of 22nd
International Conference on Data Engineering (ICDE '06), 2006; 25–25.
18. Xu J, Wang W, Pei J, Wang X, Shi B, Fu AWC. Utility-based Anonymization Using Local Recoding. Proceedings of
the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data (KDD'06), 2006; 785–790.
19. Mohammed N, Fung B, Hung PCK, Lee CK. Centralized and distributed anonymization for high-dimensional
healthcare data. ACM Transactions on Knowledge Discovery from Data 2010; 4(4):Article 18.
20. Xiao X, Tao Y. Anatomy: Simple and Effective Privacy Preservation. Proceedings of 32nd International Conference
on Very Large Data Bases (VLDB'06), 2006; 139–150.
21. Li T, Li N, Zhang J, Molloy I. Slicing: a new approach for privacy preserving data publishing. IEEE Transactions on
Knowledge and Data Engineering 2012; 24(3):561–574.
22. Terrovitis M, Liagouris J, Mamoulis N, Skiadopolous S. Privacy preservation by disassociation. Proceedings of the
VLDB Endowment 2012; 5(10):944–955.
23. LeFevre K, DeWitt DJ, Ramakrishnan R. Incognito: Efficient Full-Domain k-anonymity. Proceedings of 2005 ACM
SIGMOD International Conference on Management of Data (SIGMOD '05), 2005; 49–60.
24. Hu H, Xu J, Ren C, Choi B. Processing Private Queries Over Untrusted Data Cloud Through Privacy
Homomorphism. Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE'11),
2011; 601–612.
25. Li M, Yu S, Cao N, Lou W. Authorized Private Keyword Search Over Encrypted Data in Cloud Computing.
Proceedings of the 31st International Conference on Distributed Computing Systems (ICDCS'11), 2011; 383–392.
SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe
26. Zhang X, Liu C, Nepal S, Pandey S, Chen J. A privacy leakage upper bound constraint-based approach for
cost-effective privacy preserving of intermediate data sets in cloud. IEEE Transactions on Parallel and Distributed
Systems 2013; 24(6):1192–1202. DOI: 10.1109/tpds.2012.238.
27. Neuman BC, Ts'o T. Kerberos: an authentication service for computer networks. IEEE Communications Magazine
1994; 32(9):33–38.
28. Roy I, Setty STV, Kilzer A, Shmatikov V, Witchel E. Airavat: Security and Privacy for Mapreduce. Proceedings of
7th USENIX Conference on Networked Systems Design and Implementation (NSDI'10), 2010; 297–312.
29. Blass E-O, Pietro RD, Molva R, Önen M. PRISM—Privacy-Preserving Search in Mapreduce. Proceedings of the
12th International Conference on Privacy Enhancing Technologies (PETS'12), 2012; 180–200.
30. Ko SY, Jeon K, Morales R. The Hybrex Model for Confidentiality and Privacy in Cloud Computing. Proceedings of
the 3rd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'11), 2011; Article 8.
31. Zhang K, Zhou X, Chen Y, Wang X, Ruan Y. Sedic: Privacy-Aware Data Intensive Computing on Hybrid Clouds.
Proceedings of 18th ACM Conference on Computer and Communications Security (CCS'11), 2011; 515–526.
32. Wei W, Juan D, Ting Y, Xiaohui G. Securemr: A Service Integrity Assurance Framework for Mapreduce.
Proceedings of Annual Computer Security Applications Conference (ACSAC '09), 2009; 73–82.
33. Mell P, Grance T. The Nist Definition of Cloud Computing (Version 15). U.S. National Institute of Standards and
Technology, Information Technology Laboratory, 2009.
34. Shvachko K, Hairong K, Radia S, Chansler R. The Hadoop Distributed File System. Proceedings of 2010 IEEE 26th
Symposium on Mass Storage Systems and Technologies (MSST'10), 2010; 1–10.
35. Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquin R. Incoop: MapReduce for Incremental Computations.
Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC'11), 2011; 1–14.
36. Bu Y, Howe B, Balazinska M, Ernst MD. The Haloop approach to large-scale iterative data analysis. The VLDB
Journal 2012; 21(2):169–190. DOI: 10.1007/s00778-012-0269-7.
37. Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S-H, Qiu J, Fox G. Twister: A Runtime for Iterative Mapreduce.
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HDPC'10),
2010; 810–818.
38. Yuan D, Yang Y, Liu X, Chen J. On-demand minimum cost benchmarking for intermediate dataset storage in
scientific cloud workflow systems. Journal of Parallel and Distributed Computing 2011; 71(2):316–332. DOI:
10.1016/j.jpdc.2010.09.003.
39. Davidson SB, Khanna S, Milo T, Panigrahi D, Roy S. Provenance Views for Module Privacy. Proceedings of
the thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'11), 2011;
175–186.
40. Byun J-W, Kamra A, Bertino E, Li N. Efficient k-anonymization Using Clustering Techniques. Proceedings of the
12th International Conference on Database Systems for Advanced Applications (DASFAA'07), 2007; 188–200.
41. UCI Machine Learning Repository. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/ [Accessed on April 01,
2013].
42. Zhang X, Liu C, Nepal S, Chen J. An efficient quasi-identifier index based approach for privacy preservation over
incremental data sets on cloud. Journal of Computer and System Sciences 2013; 79(5):542–555. DOI: http://dx.
doi.org/10.1016/j.jcss.2012.11.008.
43. Zhang X, Liu C, Nepal S, Yang C, Dou W, Chen J. Combining Top-down and Bottom-up: Scalable Sub-tree
Anonymization Over Big Data Using Mapreduce on Cloud. Proceedings of 12th IEEE International Conference
on Trust, Security and Privacy in Computing and Communications (IEEE TrustCom-13), 2013; Accepted.
44. Fung BCM, Wang K, Yu PS. Anonymizing classification data for privacy preservation. IEEE Transactions on
Knowledge and Data Engineering 2007; 19(5):711–725.
45. Doka K, Tsoumakos D, Koziris N. KANIS: Preserving k-anonymity Over Distributed Data. Proceedings of the
5th International Workshop on Personalized Access, Profile Management, and Context Awareness in Databases
(PersDB'11), 2011.
X. ZHANG ET AL.
Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)
DOI: 10.1002/cpe

More Related Content

What's hot

iaetsd Controlling data deuplication in cloud storage
iaetsd Controlling data deuplication in cloud storageiaetsd Controlling data deuplication in cloud storage
iaetsd Controlling data deuplication in cloud storageIaetsd Iaetsd
 
An proficient and Confidentiality-Preserving Multi- Keyword Ranked Search ove...
An proficient and Confidentiality-Preserving Multi- Keyword Ranked Search ove...An proficient and Confidentiality-Preserving Multi- Keyword Ranked Search ove...
An proficient and Confidentiality-Preserving Multi- Keyword Ranked Search ove...Editor IJCATR
 
Improving availability and reducing redundancy using deduplication of cloud s...
Improving availability and reducing redundancy using deduplication of cloud s...Improving availability and reducing redundancy using deduplication of cloud s...
Improving availability and reducing redundancy using deduplication of cloud s...dhanarajp
 
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...ijiert bestjournal
 
A hybrid cloud approach for secure authorized
A hybrid cloud approach for secure authorizedA hybrid cloud approach for secure authorized
A hybrid cloud approach for secure authorizedNinad Samel
 
Toward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architectureToward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architectureredpel dot com
 
Paper id 252014139
Paper id 252014139Paper id 252014139
Paper id 252014139IJRAT
 
Privacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applicationsPrivacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applicationsredpel dot com
 
A NEW FRAMEWORK FOR SECURING PERSONAL DATA USING THE MULTI-CLOUD
A NEW FRAMEWORK FOR SECURING PERSONAL DATA USING THE MULTI-CLOUDA NEW FRAMEWORK FOR SECURING PERSONAL DATA USING THE MULTI-CLOUD
A NEW FRAMEWORK FOR SECURING PERSONAL DATA USING THE MULTI-CLOUDijsptm
 
Distributed Large Dataset Deployment with Improved Load Balancing and Perform...
Distributed Large Dataset Deployment with Improved Load Balancing and Perform...Distributed Large Dataset Deployment with Improved Load Balancing and Perform...
Distributed Large Dataset Deployment with Improved Load Balancing and Perform...IJERA Editor
 
10.1.1.21.5883
10.1.1.21.588310.1.1.21.5883
10.1.1.21.5883paserv
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
IRJET- A Novel Framework for Three Level Isolation in Cloud System based ...
IRJET-  	  A Novel Framework for Three Level Isolation in Cloud System based ...IRJET-  	  A Novel Framework for Three Level Isolation in Cloud System based ...
IRJET- A Novel Framework for Three Level Isolation in Cloud System based ...IRJET Journal
 
Cloud Storage: Focusing On Back End Storage Architecture
Cloud Storage: Focusing On Back End Storage ArchitectureCloud Storage: Focusing On Back End Storage Architecture
Cloud Storage: Focusing On Back End Storage ArchitectureIOSR Journals
 
Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...redpel dot com
 

What's hot (18)

iaetsd Controlling data deuplication in cloud storage
iaetsd Controlling data deuplication in cloud storageiaetsd Controlling data deuplication in cloud storage
iaetsd Controlling data deuplication in cloud storage
 
An proficient and Confidentiality-Preserving Multi- Keyword Ranked Search ove...
An proficient and Confidentiality-Preserving Multi- Keyword Ranked Search ove...An proficient and Confidentiality-Preserving Multi- Keyword Ranked Search ove...
An proficient and Confidentiality-Preserving Multi- Keyword Ranked Search ove...
 
Improving availability and reducing redundancy using deduplication of cloud s...
Improving availability and reducing redundancy using deduplication of cloud s...Improving availability and reducing redundancy using deduplication of cloud s...
Improving availability and reducing redundancy using deduplication of cloud s...
 
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...
 
A hybrid cloud approach for secure authorized
A hybrid cloud approach for secure authorizedA hybrid cloud approach for secure authorized
A hybrid cloud approach for secure authorized
 
Toward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architectureToward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architecture
 
Paper id 252014139
Paper id 252014139Paper id 252014139
Paper id 252014139
 
Privacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applicationsPrivacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applications
 
50620130101004
5062013010100450620130101004
50620130101004
 
A NEW FRAMEWORK FOR SECURING PERSONAL DATA USING THE MULTI-CLOUD
A NEW FRAMEWORK FOR SECURING PERSONAL DATA USING THE MULTI-CLOUDA NEW FRAMEWORK FOR SECURING PERSONAL DATA USING THE MULTI-CLOUD
A NEW FRAMEWORK FOR SECURING PERSONAL DATA USING THE MULTI-CLOUD
 
Distributed Large Dataset Deployment with Improved Load Balancing and Perform...
Distributed Large Dataset Deployment with Improved Load Balancing and Perform...Distributed Large Dataset Deployment with Improved Load Balancing and Perform...
Distributed Large Dataset Deployment with Improved Load Balancing and Perform...
 
B1803031217
B1803031217B1803031217
B1803031217
 
10.1.1.21.5883
10.1.1.21.588310.1.1.21.5883
10.1.1.21.5883
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
IRJET- A Novel Framework for Three Level Isolation in Cloud System based ...
IRJET-  	  A Novel Framework for Three Level Isolation in Cloud System based ...IRJET-  	  A Novel Framework for Three Level Isolation in Cloud System based ...
IRJET- A Novel Framework for Three Level Isolation in Cloud System based ...
 
Cloud Storage: Focusing On Back End Storage Architecture
Cloud Storage: Focusing On Back End Storage ArchitectureCloud Storage: Focusing On Back End Storage Architecture
Cloud Storage: Focusing On Back End Storage Architecture
 
Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...
 
1771 1775
1771 17751771 1775
1771 1775
 

Similar to A scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD
BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD
BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD IJNSA Journal
 
A privacy leakage upper bound constraint based approach for cost-effective pr...
A privacy leakage upper bound constraint based approach for cost-effective pr...A privacy leakage upper bound constraint based approach for cost-effective pr...
A privacy leakage upper bound constraint based approach for cost-effective pr...JPINFOTECH JAYAPRAKASH
 
Distributed Scheme to Authenticate Data Storage Security in Cloud Computing
Distributed Scheme to Authenticate Data Storage Security in Cloud ComputingDistributed Scheme to Authenticate Data Storage Security in Cloud Computing
Distributed Scheme to Authenticate Data Storage Security in Cloud ComputingAIRCC Publishing Corporation
 
DISTRIBUTED SCHEME TO AUTHENTICATE DATA STORAGE SECURITY IN CLOUD COMPUTING
DISTRIBUTED SCHEME TO AUTHENTICATE DATA STORAGE SECURITY IN CLOUD COMPUTINGDISTRIBUTED SCHEME TO AUTHENTICATE DATA STORAGE SECURITY IN CLOUD COMPUTING
DISTRIBUTED SCHEME TO AUTHENTICATE DATA STORAGE SECURITY IN CLOUD COMPUTINGAIRCC Publishing Corporation
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A privacy leakage upper bound constra...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A privacy leakage upper bound constra...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A privacy leakage upper bound constra...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A privacy leakage upper bound constra...IEEEGLOBALSOFTTECHNOLOGIES
 
Multi- Level Data Security Model for Big Data on Public Cloud: A New Model
Multi- Level Data Security Model for Big Data on Public Cloud: A New ModelMulti- Level Data Security Model for Big Data on Public Cloud: A New Model
Multi- Level Data Security Model for Big Data on Public Cloud: A New ModelEswar Publications
 
Ijarcet vol-2-issue-3-951-956
Ijarcet vol-2-issue-3-951-956Ijarcet vol-2-issue-3-951-956
Ijarcet vol-2-issue-3-951-956Editor IJARCET
 
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT A privacy leakage upper bound con...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT A privacy leakage upper bound con...JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT A privacy leakage upper bound con...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT A privacy leakage upper bound con...IEEEGLOBALSOFTTECHNOLOGIES
 
Mitigating the Integrity Issues in Cloud Computing Utilizing Cryptography Alg...
Mitigating the Integrity Issues in Cloud Computing Utilizing Cryptography Alg...Mitigating the Integrity Issues in Cloud Computing Utilizing Cryptography Alg...
Mitigating the Integrity Issues in Cloud Computing Utilizing Cryptography Alg...AJASTJournal
 
Enhancement of the Cloud Data Storage Architectural Framework in Private Cloud
Enhancement of the Cloud Data Storage Architectural Framework in Private CloudEnhancement of the Cloud Data Storage Architectural Framework in Private Cloud
Enhancement of the Cloud Data Storage Architectural Framework in Private CloudINFOGAIN PUBLICATION
 
thilaganga journal 1
thilaganga journal 1thilaganga journal 1
thilaganga journal 1thilaganga
 
Science international journal
Science international journalScience international journal
Science international journalSarita30844
 
Effective & Flexible Cryptography Based Scheme for Ensuring User`s Data Secur...
Effective & Flexible Cryptography Based Scheme for Ensuring User`s Data Secur...Effective & Flexible Cryptography Based Scheme for Ensuring User`s Data Secur...
Effective & Flexible Cryptography Based Scheme for Ensuring User`s Data Secur...ijsrd.com
 
Evaluation Of The Data Security Methods In Cloud Computing Environments
Evaluation Of The Data Security Methods In Cloud Computing EnvironmentsEvaluation Of The Data Security Methods In Cloud Computing Environments
Evaluation Of The Data Security Methods In Cloud Computing Environmentsijfcstjournal
 
Proposed Model for Enhancing Data Storage Security in Cloud Computing Systems
Proposed Model for Enhancing Data Storage Security in Cloud Computing SystemsProposed Model for Enhancing Data Storage Security in Cloud Computing Systems
Proposed Model for Enhancing Data Storage Security in Cloud Computing SystemsHossam Al-Ansary
 
Preserving Privacy Policy- Preserving public auditing for data in the cloud
	Preserving Privacy Policy- Preserving public auditing for data in the cloud	Preserving Privacy Policy- Preserving public auditing for data in the cloud
Preserving Privacy Policy- Preserving public auditing for data in the cloudinventionjournals
 
Improved deduplication with keys and chunks in HDFS storage providers
Improved deduplication with keys and chunks in HDFS storage providersImproved deduplication with keys and chunks in HDFS storage providers
Improved deduplication with keys and chunks in HDFS storage providersIRJET Journal
 
A Secure Multi-Owner Data Sharing Scheme for Dynamic Group in Public Cloud.
A Secure Multi-Owner Data Sharing Scheme for Dynamic Group in Public Cloud. A Secure Multi-Owner Data Sharing Scheme for Dynamic Group in Public Cloud.
A Secure Multi-Owner Data Sharing Scheme for Dynamic Group in Public Cloud. IJCERT JOURNAL
 

Similar to A scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر (20)

BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD
BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD
BIG DATA SECURITY AND PRIVACY ISSUES IN THE CLOUD
 
C017421624
C017421624C017421624
C017421624
 
A privacy leakage upper bound constraint based approach for cost-effective pr...
A privacy leakage upper bound constraint based approach for cost-effective pr...A privacy leakage upper bound constraint based approach for cost-effective pr...
A privacy leakage upper bound constraint based approach for cost-effective pr...
 
Distributed Scheme to Authenticate Data Storage Security in Cloud Computing
Distributed Scheme to Authenticate Data Storage Security in Cloud ComputingDistributed Scheme to Authenticate Data Storage Security in Cloud Computing
Distributed Scheme to Authenticate Data Storage Security in Cloud Computing
 
DISTRIBUTED SCHEME TO AUTHENTICATE DATA STORAGE SECURITY IN CLOUD COMPUTING
DISTRIBUTED SCHEME TO AUTHENTICATE DATA STORAGE SECURITY IN CLOUD COMPUTINGDISTRIBUTED SCHEME TO AUTHENTICATE DATA STORAGE SECURITY IN CLOUD COMPUTING
DISTRIBUTED SCHEME TO AUTHENTICATE DATA STORAGE SECURITY IN CLOUD COMPUTING
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A privacy leakage upper bound constra...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A privacy leakage upper bound constra...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A privacy leakage upper bound constra...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A privacy leakage upper bound constra...
 
Multi- Level Data Security Model for Big Data on Public Cloud: A New Model
Multi- Level Data Security Model for Big Data on Public Cloud: A New ModelMulti- Level Data Security Model for Big Data on Public Cloud: A New Model
Multi- Level Data Security Model for Big Data on Public Cloud: A New Model
 
Ijarcet vol-2-issue-3-951-956
Ijarcet vol-2-issue-3-951-956Ijarcet vol-2-issue-3-951-956
Ijarcet vol-2-issue-3-951-956
 
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT A privacy leakage upper bound con...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT A privacy leakage upper bound con...JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT A privacy leakage upper bound con...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT A privacy leakage upper bound con...
 
Mitigating the Integrity Issues in Cloud Computing Utilizing Cryptography Alg...
Mitigating the Integrity Issues in Cloud Computing Utilizing Cryptography Alg...Mitigating the Integrity Issues in Cloud Computing Utilizing Cryptography Alg...
Mitigating the Integrity Issues in Cloud Computing Utilizing Cryptography Alg...
 
Enhancement of the Cloud Data Storage Architectural Framework in Private Cloud
Enhancement of the Cloud Data Storage Architectural Framework in Private CloudEnhancement of the Cloud Data Storage Architectural Framework in Private Cloud
Enhancement of the Cloud Data Storage Architectural Framework in Private Cloud
 
thilaganga journal 1
thilaganga journal 1thilaganga journal 1
thilaganga journal 1
 
Science international journal
Science international journalScience international journal
Science international journal
 
Effective & Flexible Cryptography Based Scheme for Ensuring User`s Data Secur...
Effective & Flexible Cryptography Based Scheme for Ensuring User`s Data Secur...Effective & Flexible Cryptography Based Scheme for Ensuring User`s Data Secur...
Effective & Flexible Cryptography Based Scheme for Ensuring User`s Data Secur...
 
Evaluation Of The Data Security Methods In Cloud Computing Environments
Evaluation Of The Data Security Methods In Cloud Computing EnvironmentsEvaluation Of The Data Security Methods In Cloud Computing Environments
Evaluation Of The Data Security Methods In Cloud Computing Environments
 
Proposed Model for Enhancing Data Storage Security in Cloud Computing Systems
Proposed Model for Enhancing Data Storage Security in Cloud Computing SystemsProposed Model for Enhancing Data Storage Security in Cloud Computing Systems
Proposed Model for Enhancing Data Storage Security in Cloud Computing Systems
 
[IJET-V2I1P12] Authors:Nikesh Pansare, Akash Somkuwar , Adil Shaikh and Satya...
[IJET-V2I1P12] Authors:Nikesh Pansare, Akash Somkuwar , Adil Shaikh and Satya...[IJET-V2I1P12] Authors:Nikesh Pansare, Akash Somkuwar , Adil Shaikh and Satya...
[IJET-V2I1P12] Authors:Nikesh Pansare, Akash Somkuwar , Adil Shaikh and Satya...
 
Preserving Privacy Policy- Preserving public auditing for data in the cloud
	Preserving Privacy Policy- Preserving public auditing for data in the cloud	Preserving Privacy Policy- Preserving public auditing for data in the cloud
Preserving Privacy Policy- Preserving public auditing for data in the cloud
 
Improved deduplication with keys and chunks in HDFS storage providers
Improved deduplication with keys and chunks in HDFS storage providersImproved deduplication with keys and chunks in HDFS storage providers
Improved deduplication with keys and chunks in HDFS storage providers
 
A Secure Multi-Owner Data Sharing Scheme for Dynamic Group in Public Cloud.
A Secure Multi-Owner Data Sharing Scheme for Dynamic Group in Public Cloud. A Secure Multi-Owner Data Sharing Scheme for Dynamic Group in Public Cloud.
A Secure Multi-Owner Data Sharing Scheme for Dynamic Group in Public Cloud.
 

A scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

  • 1. SPECIAL ISSUE PAPER SaC-FRAPP: a scalable and cost-effective framework for privacy preservation over big data on cloud Xuyun Zhang1, *,† , Chang Liu1 , Surya Nepal2 , Chi Yang1 , Wanchun Dou3 and Jinjun Chen1 1 Faculty of Engineering and IT, University of Technology, Sydney, Sydney, Australia 2 Information and Communication Technologies Centre, Commonwealth Scientific and Industrial Research Organisation, Sydney, Australia 3 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China SUMMARY Big data and cloud computing are two disruptive trends nowadays, provisioning numerous opportunities to the current information technology industry and research communities while posing significant challenges on them as well. Cloud computing provides powerful and economical infrastructural resources for cloud users to handle ever increasing data sets in big data applications. However, processing or sharing privacy-sensitive data sets on cloud probably engenders severe privacy concerns because of multi-tenancy. Data encryption and anonymization are two widely-adopted ways to combat privacy breach. However, encryption is not suitable for data that are processed and shared frequently, and anonymizing big data and manage numerous anonymized data sets are still challenges for traditional anonymization approaches. As such, we propose a scalable and cost-effective framework for privacy preservation over big data on cloud in this paper. The key idea of the framework is that it leverages cloud-based MapReduce to conduct data anonymization and manage anonymous data sets, before releasing data to others. The framework provides a holistic conceptual foundation for privacy preservation over big data. Further, a corresponding proof-of-concept prototype system is implemented. Empirical evaluations demonstrate that scalable and cost-effective framework for privacy preservation can anonymize large-scale data sets and mange anonymous data sets in a highly flexible, scalable, efficient, and cost-effective fashion. Copyright © 2013 John Wiley & Sons, Ltd. Received 28 May 2013; Accepted 4 June 2013 KEY WORDS: big data; cloud; privacy preservation; framework; anonymization 1. INTRODUCTION Big data and cloud computing, two disruptive trends at present, offers a large number of business opportunities, and likewise pose considerable challenges on the current information technology (IT) industry and research communities [1, 2]. Data sets in most big data applications such as social networks and sensor networks have become increasingly large and complex so that it is a considerable challenge for traditional data processing tools to handle the data processing pipeline (including collection, storage, processing, mining, sharing, etc.). Generally, such data sets are often from various sources and of different types (Variety) such as unstructured social media content and half-structured medical records and business transactions, and are of large sizes (Volume) with fast *Correspondence to: Xuyun Zhang, Faculty of Engineering and IT, University of Technology, Sydney, Sydney, Australia. † E-mail: xyzhanggz@gmail.com Copyright © 2013 John Wiley & Sons, Ltd. CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2013 Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3083
  • 2. data in or out (Velocity). Cloud systems provides massive computation power and storage capacity that enable users to deploy applications without infrastructure investment. Because of its salient features, cloud is promising for users to handle the big data processing pipeline with its elastic and economical infrastructural resources. For instance, MapReduce [3], extensively studied and widely adopted large-scale data processing paradigm, is incorporated with cloud infrastructure to provide more flexible, scalable, and cost-effective computation for big data processing. A typical example is the Amazon Elastic MapReduce service. Users can invoke Amazon Elastic MapReduce to conduct their MapReduce computations based on the powerful infrastructure offered by Amazon Web Services (e.g., Elastic Compute Cloud and Simple Storage Service) and are charged in proportion to the usage of the services. In this way, it is economical and convenient for companies and organizations to collect, store, analyze, and share big data to gain competitive advantages. However, because of the feature that cloud systems are multi-tenant, processing or sharing privacy- sensitive data sets on them, for example, ‘other parties’ mining healthcare data sets, probably engenders severe privacy concerns. Still, privacy concerns in MapReduce platforms are aggravated because the privacy-sensitive information scattered among various data sets can be recovered with more ease when data and computational power are considerably abundant. Although some privacy issues are not new, their importance is amplified by cloud computing and big data [2]. With the wide adoption of online cloud services and proliferation of mobile devices, the privacy concern about processing on and sharing of sensitive personal information is increasing. For instance, HealthVault, an online health service provided by Microsoft (Redmond, Washington, USA), is deployed on the Windows Azure cloud platform. Although these data in such cloud services are usually deemed extremely privacy-sensitive, they usually offer significant human benefits if analyzed and mined by organizations similar to disease research centers. Data encryption [4, 5] and anonymization [6, 7], having been extensively studied recently, are promising and widely-adopted ways to combat privacy breach and violation on cloud. Mechanisms such as encryption [4], access control [8], and differential privacy [9] are exploited to protect the data privacy. These mechanisms are well-know pillars of privacy protection and still have open questions in the context of cloud computing and big data [2]. Usually, the data sets uploaded into cloud are not only for simply storage, but also for online cloud applications, that is, the data sets are dynamical. If we encrypt these data sets, processing on data sets efficiently will be quite a challenging task, because most existing applications only run on unencrypted data sets. Although recent progress has been made in homomorphic encryption, which theoretically allows performing computation on encrypted data sets, applying current algorithms are rather expensive because of their inefficiency [10]. Still, data holders and data users in cloud are different parties in most applications, for example, cloud health service providers and pharmaceutical companies. In such cases, encryption or access control mechanisms alone fail to ensure privacy preservation and data utility exposure. Data anonymization is a promising category of approaches to achieve such a goal [7]. However, most existing anonymization algorithms lack of scalability over big data. Thence, how to leverage the state-of-the-art cloud-based techniques to address the scalability issues of current anonymization approaches deserve considerable attention. Still, anonymous data sets scattered on cloud probably compromise data privacy if they are not properly managed. These data sets can be highly related because they may be parts of a data set [11], different levels of anonymous version of a data set [12], incremental versions of a data set [13], and so on. Adversaries are able to collect such a group of data sets from cloud and infer certain privacy from them even if each data set individually satisfies a privacy requirement. As such, it is still a challenge to manage multiple anonymous data sets on cloud. In this paper, we propose a scalable and cost-effective framework for privacy preservation (SaC- FRAPP) over big data on cloud via integrating our previous work together in terms of the lifecycle of data anonymization. The key idea of the framework is that it leverages cloud-based MapReduce and HDFS (Hadoop Distributed File System) to conduct data anonymization and manage anonymous data sets, respectively, before releasing data sets to other parties. The framework is built on the top of MapReduce and functions as a filter to preserve the privacy of data sets before these data sets are accessed and processed by MapReduce. Specifically, the framework provides interfaces to data holders to specify various privacy requirements based on different privacy models. Once X. ZHANG ET AL. Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 3. privacy requirements are specified, the framework launches anonymization algorithms of MapReduce version to efficiently anonymize data sets for subsequent MapReduce tasks. Anonymous data sets are retained and reused to avoid re-computation cost. Thus, the framework handles the dynamical update of data sets as well to maintain the privacy requirements of such data sets. Besides anonymization, SaC-FRAPP also integrates encryption techniques to cost-effectively ensure the privacy of multiples data sets that are independently anonymized in terms of different privacy requirements. Finally, a corresponding proof-of-concept prototype system based on our cloud environment is implemented for the framework. The framework provides a holistic conceptual foundation for privacy preservation over big data and enables users to accomplish full potential of the high scalability, elasticity, and cost-effectiveness of the cloud. We conduct extensive experiments on real-world data sets to evaluate the proposed framework. Empirical evaluations demonstrate that SaC-FRAPP can anonymize big data and mange the anonymous data sets in a highly flexible, scalable, efficient, and cost-effective fashion. The contributions of this paper are three folds. Firstly, we propose a SaC-FRAPP over big data on cloud via integrating our previous work together in terms of the lifecycle of data anonymization. Secondly, a corresponding proof-of-concept prototype system is presented. Thirdly, extensive experimental evaluations on modules of the framework are provided to demonstrate SaC-FRAPP can anonymize big data and mange the anonymous data sets in a highly flexible, scalable, efficient, and cost-effective fashion. The remainder of this paper is organized as follows. The next section reviews the related work about privacy protection in cloud computing, big data, and MapReduce. In Section 3, we briefly present the preliminary background knowledge about cloud systems and MapReduce. Section 4 describes the lifecycle of data anonymization on cloud and formulates the details of the proposed framework. The proof-of-concept prototype system is designed and implemented in Section 5. We present the empirical evaluation results of each module of the prototype system in Section 6. Finally, we conclude this paper and discuss our future work in Section 7. 2. RELATED WORK We briefly review recent research on data privacy preservation and privacy protection in MapReduce and cloud computing environments. The privacy-preserving data publishing research community has investigated extensively on privacy-preserving issues and made fruitful progress with a variety of privacy models and preserving methods [7]. Privacy principles such as k-anonymity [14], l-diversity [15], and t-closeness [16] are put forth to model and quantify privacy. Privacy principles for multiple datasets are also proposed, but they aim at specific scenarios such as continuous data publishing or sequential data releasing [7]. Several anonymizing operations are leveraged to anonymize data sets, including generalization [17–19], anatomization [20], slicing [21], disassociation [22], and so on. Roughly, there are four generalization schemes [7], namely, FG [23], SG [19], MG [17], and CG [18]. Encryption is widely adopted as a straightforward approach to ensure data privacy on cloud against malicious users. Special operations such as query or search on encrypted data sets stored on cloud has been extensively studied [4,24, 25], although performing general operations on encrypted data sets are still quite challenging [10]. Instead of encrypting all data sets, Puttaswamy et al. [8] described a set of tools called Silverline that can separate all functionally encryptable data from other cloud application data, where the former is encrypted for privacy preservation while the later is left unencrypted for application functions. However, the sensitivity of data is required be labeled in advance. Zhang et al. [26] proposed a privacy leakage upper bound constraint-based approach to preserve the privacy of multiple data sets by only encrypting a part of data sets on cloud. As SaC-FRAPP is proposed based on MapReduce, we reviewed MapReduce relevant research work in the succeeding texts. The Kerberos authentication mechanism [27] is integrated into the MapReduce framework of Hadoop after the 1.0.0 version. Basically, access control fails to preserve privacy because data users can infer the privacy-sensitive information if they access to the unencrypted data. Roy et al. [28] investigated the data privacy problem caused by the MapReduce framework and SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 4. presented a system named Airavat, which incorporates mandatory access control with differential privacy. The mandatory access control will be triggered when the privacy leakage exceeds a threshold, so that both privacy preservation and high data utility are ensured. However, the results produced in this system are mixed with certain noise, which is unsuitable to many applications that need data sets without noise, for example, medical experiment data mining and analysis. Based on MapReduce, several systems have been proposed to handle privacy concerns of computation and storage on cloud. Blass et al. [29] proposed a privacy-preserving scheme named PRISM for MapReduce on cloud to perform parallel word search on over encrypted data sets. Nevertheless, many cloud applications require MapReduce to conduct tasks similar to data mining and analytics over these data sets besides search. Ko et al. [30] proposed the HybrEx MapReduce model to provide a way that sensitive and private data are processed within a private cloud, whereas others can be safely extended to public cloud. Similarly, Zhang et al. [31] proposed a system named Sedic, which partitions MapReduce computing jobs in terms of the security labels of data they work on and then assigns the computation without sensitive data to a public cloud. However, sensitivity of data is also required be acquired in advance in above two systems. Wei et al. [32] proposed a service integrity assurance framework named SecureMR for the MapReduce framework. But we mainly focus herein on privacy-preserving issues in the proposed layer. Our privacy-preserving framework attempts to produce and retain anonymous data sets according to data holders' privacy requirements for subsequent MapReduce tasks. 3. CLOUD SYSTEMS AND MAPREDUCE PRELIMINARY 3.1. Cloud systems Cloud computing is one of the most hyped IT innovations at present, having sparked plenty of interest in both the IT industry and academia. Recently, IT giants such as Amazon (Seattle, Washington, US), Google (Mountain View, California, US), IBM (Armonk, New York, US), and Microsoft have invested huge sums of money in building up their public cloud products, and indeed they have developed their own products, for example, Amazon's Web Services, Google's App Engine, and Compute, and Microsoft's Azure. Meanwhile, several corresponding open source cloud computing solutions are also developed, like Hadoop, Eucalyptus, OpenNebula, and OpenStack. The cloud computing definition published by the US National Institute of Standards and Technology comprehensively covers the commonly agreed aspects of cloud computing [33]. In terms of the definition, the cloud model consists of five essential characteristics, three service delivery models, and four deployment models. The five key features encompass on demand self- service, broad network access, resource pooling (multi-tenancy), rapid elasticity, and measured services. The three service delivery models are cloud software as a service, for example, Google Docs; cloud platform as a service, for example, Google App Engine; and cloud infrastructure as a service, for example, Amazon Elastic Compute Cloud, and Amazon Simple Storage Service. The four deployment models include private cloud, community cloud, public cloud, and hybrid cloud. Technically, cloud computing could be regarded as an ingenious combination of a series of developed or developing ideas and technologies, establishing a novel business model by offering IT services using economies of scale. In general, the basic ideas encompass service computing, grid computing, distributed computing, and so on. The core technologies that cloud computing principally built on include web service technologies and standards, virtualization, novel distributed programming models like MapReduce, and cryptography. All the participants in the cloud computing can benefit from this new business model. The giant IT enterprises can not only run their own core businesses, but also make profit by delivering the spare infrastructure services to others. Small-size and medium-size businesses are capable of focusing on their own core businesses via outsourcing the boring and complicated IT management to other cloud service providers, usually at a fairly low cost. Especially, cloud computing facilitates start-ups considerably, enabling them to build up their business with low upfront IT investments, as well as cheap ongoing costs. Moreover, because of the flexibility of cloud computing, companies can adapt their business readily and swiftly by enlarging or shrinking the business scale dynamically without concerns about losing anything. X. ZHANG ET AL. Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 5. 3.2. MapReduce basics MapReduce is a scalable and fault-tolerant data processing framework that enables to process huge volume of data in parallel with many low-end commodity computers [3]. MapReduce was first introduced in the year 2004 by Google with similar concepts in functional languages dated as early as 60s. It has been widely adopted and received extensive attention from both the academia and the industry because of its promising capability. In the context of cloud computing, the MapReduce framework becomes more scalable and cost-effective because infrastructure resources can be provisioned on demand. Simplicity, scalability, and fault tolerance are three main salient features of MapReduce framework. Therefore, it is convenient and beneficial for companies and organizations utilize MapReduce services, such as Amazon Elastic MapReduce, to process big data and obtain core competiveness. Basically, a MapReduce task consists of two primitive functions, map and reduce, defined over a data structure named as key-value pair (key, value). Specifically, the map function can be formalized as map: (k1, v1) → (k2, v2), that is, map function takes a pair (k1, v1) as input and then output another intermediate key-value pair (k2, v2). These intermediate pairs will be consumed by reduce function as input. Formally, the reduce function can be represented as reduce: (k2, list (v2)) → (k3, v3), that is, reduce function takes intermediate k2 and all its corresponding values list (v2) as input and output another pair (k3, v3). Usually, (k3, v3) list is the results which MapReduce users attempt to get. Both map and reduce functions are specified by data users in terms of their specific applications. To make such a simple programming model work effectively and efficiently, MapReduce implementations provide a variety of fundamental mechanisms such as data replication and data sorting. Besides, distributed file systems similar to Hadoop distributed file system [34] are substantially crucial to make the MapReduce framework run in a highly scalable and fault- tolerant fashion. Recently, the standard MapReduce has been extensively revised into many variations in order to handle data in different scenarios. For instance, Incoop [35] is proposed for incremental MapReduce computation, which detects changes to the input and automatically updates the output by employing an efficient, fine-grained result reuse mechanism. Several novel techniques: a storage system, a contraction phase for reduce tasks, and an affinity-based scheduling algorithm are utilized to achieve efficiency without sacrificing transparency. As standard MapReduce framework lack built-in supports for iterative programming, which arise naturally in many applications including data mining, web ranking, graph analysis, model fitting, and so on, HaLoop [36] and Twister [37] are designed to support iterative MapReduce computation. HaLoop is built on top of Hadoop and extends it with a new programming model and several important optimizations that include: a loop-aware task scheduler, loop-invariant data caching, and caching for efficient fixpoint verification. Twister is a distributed in-memory MapReduce runtime optimized for iterative MapReduce computations. 4. SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION In this section, we mainly present the design of the SaC-FRAPP. Section 4.1 first describes the lifecycle of data anonymization on cloud and specifies corresponding system requirements for the lifecycle. An overview of the framework is presented in Section 4.2. Then, we elaborate the details of each module of the framework from Section 4.3 to 4.6. 4.1. Lifecycle of data anonymization on cloud As a large number of big data applications are deployed in cloud platforms, most of the data in such applications are collected, stored, processed, and shared on cloud. A typical lifecycle of big data anonymization on cloud is described in Figure 1. All participants are illustrated in the figure. We describe the phases in Figure 1 as follows. Phase I represents data collection phase, where data owners submit their privacy-sensitive data into services or applications provided by data holders. A data owner is the individual who is associated with SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 6. the data record she or he submitted. Usually, a data owner is an end user of some cloud-based online services, for example, online healthcare services. Also, group data owners like some institutes place data sets containing records from a group of individuals into cloud, for example, hospitals may put patients' diagnosis information into online healthcare services. A data holder is usually the service or application owner, for example, Microsoft is the data owner because of its cloud-based healthcare service, HealthVault. Data holders are responsible for the privacy protection of data owners. Data owners can specify personalized privacy requirements to data holders. But mostly, data holders develop privacy protection strategies according to governments' privacy policies and regulations. To make full use of the collected data sets, data holders would like to release the data sets to other parties for analysis or data mining. However, given the privacy concerns, the data holder anonymizes the data sets. Unlike the traditional case that data holders anonymize data sets themselves, data anonymization is conducted by dedicated data anonymization services, that is, anonymization as a service. Phase II illustrates such a process. We assume that anonymization services are trusted. Privacy preserving requirements will be specified as parameters to anonymization services. After anonymization, the anonymous data sets are released to cloud or certain data recipients. In general, there will be multiple anonymous data sets because of the different data usage or data recipients. Hence, Phase III chiefly manages such anonymous data sets to ensure privacy preservation. Because some big data applications are online services, for example, Facebook and HealthVault, the data sets these applications that are dynamic and incremental. Consequently, Phase IV is responsible for updating anonymous data sets when new data records are added. To anonymize the newly added data, the anonymized data sets should be taken into account and the whole anonymized data sets are offered to users. Phases II to IV will be elaborated in the following sections. In general, anonymous data sets are consumed by other parties. Phase V in Figure 1 shows such a process. Various data users will access different anonymous data sets for diverse purposes. For instance, research centers training data models similar to decision trees or association rules from the anonymous data sets to reveal certain trends similar to disease infection. Meanwhile, malicious data users deliberately attempt to infer privacy-sensitive information from the released data sets. Note that the privacy can also be breached incidentally when a legal data user process on a data set. In big data and cloud environments, data anonymization encounters several challenges because of the new features coming with cloud systems and big data applications. In order to comply with such features specific to cloud computing and big data, we identify several system requirements that should be satisfied when designing a framework for privacy preservation over big data on cloud as follows. Note that scalability and cost-effectiveness are the most important two issues in such a framework for big data privacy preservation. • Flexible. The framework should provide a user interface through which the data owners can specify various privacy requirements. Usually, data sets will be accessed and processed by different data users for different application. I Data Collection II Data Anonymization III Anonymous Data Management IV Data Update V Data Consumption Adversaries Query Users Data Mining Users Individual Owners Data Holder Group Owners Figure 1. Lifecycle of data anonymization on cloud. X. ZHANG ET AL. Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 7. • Scalable. Scalability is necessary for current privacy-preserving approaches because the scale of data sets is too large to be processed by existing serial algorithms. So the privacy-preserving framework should be also scalable to handle data anonymization and managements. Concretely, the data-intensive or computation-intensive operations in the framework should be executed efficiently in a parallel and scalable fashion. • Dynamical. In cloud computing, most applications accumulate data sets over time, for example, cloud healthcare services will receive a large number of information from users. The scale of such data sets becomes larger and larger, forming big data. Hence, the privacy-preserving framework should handle dynamical data sets, the privacy, and utility of such data sets can still be ensured when updates occur. • Cost-effective. In the pay-as-you-go feature of cloud computing, saving IT cost is one of the core enablers. Thus, it is also desired for the privacy-preserving framework to save the expense of privacy preservation as much as possible while the privacy preservation and data utility can still be ensured. 4.2. Framework overview In terms of the four system requirements formulated earlier in the text, we designed a SaC-FRAPP over big data on cloud. Accordingly, the framework consists of four main modules, namely, privacy specification interface (PSI), data anonymization (DA), data update (DU), and anonymous data sets management (ADM). Based on the four modules, the privacy-preserving framework can achieve the four system requirements. The framework overview is depicted systematically in Figure 2, where the submodules (dashed rectangles) for each module are illustrated and the relationships among the aforementioned four modules, big data application layer, MapReduce or HDFS, and cloud infrastructure are described as well. Note that although the framework only consists the four modules, we also present in Figure 2 its potential users, that are, end users, big data applications, or tools like Mahout, as well as infrastructure and drivers, that are, MapReduce, HDFS, and cloud infrastructure. As shown in Figure 2, DA, DU, and ADM are the three main functional modules. They conduct concrete operations on data sets in terms of the privacy models specified in the PSI module. The DA and DU modules take advantage of MapReduce and HDFS to anonymize data sets or adjust anonymized data sets when updates occur. The ADM module is responsible for managing anonymized data sets in order to save expense via avoiding re-computation. Thus, ADM utilize cloud infrastructure directly to accomplish the tasks. We will discuss the four proposed modules and their submodules in detail in following sections. Unlike traditional MapReduce platforms, the MapReduce platform in our research is deployed on top of the cloud infrastructure to gain high scalability and elasticity. To preserve the privacy of data sets processed by data users using MapReduce, a privacy-preserving layer between original data sets Figure 2. Framework overview of scalable and cost-effective framework for privacy preservation. SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 8. and user-specified MapReduce tasks. Basically, the layer including DA and DU modules, is the engine of SaC-FRAPP as it conducts the most computation in the framework. Figure 3 depicts the privacy- preserving layer based on MapReduce and HDFS. As shown in Figure 3, original data sets are stored confidentially on cloud by data owners, and can never be accessed directly by data users. Then data holders specify privacy requirements and submit them to the privacy-preserving layer. The layer is responsible for anonymizing original data sets according to privacy requirements. Certain anonymous data sets are retained to avoid re- computation. Thus, the layer is also responsible for updating anonymous data when new data join. Data users can then specify their application tasks in MapReduce jobs and run these jobs on the anonymized data sets. The results are stored in HDFS or further stored in cloud storage. The privacy-preserving layer itself exploits MapReduce jobs to conduct the computation required in data anonymization. This is plausible and necessary to use MapReduce to accomplish the computation for anonymizing and managing these data sets because data sets are usually of huge volume and complexity in the context of cloud computing and big data. 4.3. Privacy specification interface As to privacy models and protection techniques, privacy-preserving data publishing research community has extensively investigated on the issues and made fruitful progress with a variety of privacy models, privacy preserving methods, and algorithms [7]. Usually, original data sets can be accessed and process by different data users for different purposes, leading to various privacy risks in these scenarios. Moreover, the privacy requirements of data owners possibly vary over time. As such, a systematic and flexible privacy specification model is proposed to frame privacy requirements. We have the following definition on privacy specification. Definition 1 (Privacy Specification) The privacy requirements specified by a data owner are defined as a Privacy Specification (PS). A PS is formally represented by a vector of parameters,that is, PS = <PMN, Thr, AT, Alg, Gra, Uti>. The parameters in the vector are elaborated subsequently. The name of a privacy model is represented by PMN. Out of recent proposed privacy models, we employ three widely adopted privacy models in the privacy-preserving framework, namely, k-anonymity, l-diversity, and t-closeness. The three privacy principles provide different levels of privacy-preserving extent. The privacy principle k-anonymity means that the number of the anonymized records that correspond to a quasi-identifier is required to be larger than a threshold. Otherwise, once certain quasi-identifiers are too specific that only a small group of people is linked to them, these individuals are linked to sensitive information with high confidence, resulting to privacy breach. Here, quasi-identifiers represent the groups of anonymized data records. Based on k-anonymity, l-diversity requires that the sensitive values correspond to a quasi-identifier is not less than a threshold and therefore stricter than k-anonymity. The strictest is t-closeness, requiring the distribution of sensitive values correspond to a quasi-identifier to be close to that of original data sets. Figure 3. Privacy-preserving layer based on MapReduce and Hadoop Distributed File System. X. ZHANG ET AL. Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 9. Parameter Thr is the threshold of the specified privacy model, that are, k, l, and t in the three previously mentioned privacy principles. Parameter AT denotes application type. Data owners can specify the goals of anonymized data sets, for example, classification, clustering, or general use. If the use of anonymized data sets is known in advance, anonymization algorithms can produce anonymized data sets with higher data utility when the privacy are still preserved. Without knowing the types of applications that consume the anonymized data sets, the DA module produces anonymized data sets for general use. Different information metrics will be utilized for different application types [7]. The anonymization algorithms are indicated by the parameter Alg. A variety of algorithms have been developed for different privacy principles and application types. Details will be described in Section 4.4. Parameter Gra represents the granularity of the privacy specification. It determines the scope of the privacy preservation. Usually, only part of the original data sets is shared with data users, and different data users are possibly interested in different part of big data. Moreover, only part of attributes of a data records is considered in the process of anonymization. Thus, the granularity parameter is quite useful. The data utility parameter Uti is an optional one. Data owners can specify how much data utility they allow to expose to data users. In general, privacy and data utility are two roughly opposite aspects of privacy preservation. When privacy thresholds are given, most anonymization algorithms usually expose as much as possible. On the contrary, we can make the data sets anonymous as much as possible if data utility is fixed. Above all, the SaC-FRAPP framework systematically and comprehensively provides diverse privacy specifications to achieve the flexibility and diversity of privacy preservation. 4.4. Data anonymization Several categories of anonymization techniques have been proposed, including generalization, slicing, disassociation, and anatomization. In SaC-FRAPP, we utilize generalization for the anonymization because it is widely investigated and adopted in existing algorithms. Specifically, four generalization schemes have been proposed, namely, full-domain generalization (FG), sub-tree generalization (SG), multi-dimensional generalization (MG), and cell generalization (CG). Their corresponding sub-modules are depicted in Figure 2. Roughly speaking, data utility exposed by these four schemes increase in the order: FG < SG < MG < CG, when the same privacy requirement is given. But note that the anonymized data sets produced by MG and CG suffer from data exploration problem, that is, the anonymized data sets contain inconsistent data. For instance, one original attribute value can be generalized into two different higher-level values. The Alg parameter in a privacy specification indicates which anonymized scheme can be used to anonymize data sets. However, most existing anonymization algorithms are centralized or serial, meaning that these algorithms fail to handle big data. Usually, big data are so large that they fail to fit in the memory of one normal cloud computation node. Hence, they are usually stored across a number of nodes. Therefore, it is a challenging problem to anonymize large-scale data sets cloud for existing anonymization algorithm. Hence, we revise existing algorithms into MapReduce versions, in order to exploit the MapReduce to efficiently anonymize data sets in a scalable and parallel fashion. The data anonymizing (DA) module consists of a series of anonymized algorithms of MapReduce version. Basically, each anonymized algorithm has a MapReduce driver program and several pairs of map and reduce programs. Usually, these map and reduce programs, constituting a MapReduce job, will be executed iteratively because the anonymization is an iterative process. Although the standard MapReduce implementation mainly supports one-pass data processing rather than iterative processing, we design our MapReduce algorithms dedicatedly to make full use of MapReduce. For instance, we improve the degree of parallelization by partitioning data in advance and launching multiple MapReduce jobs simultaneously. 4.5. Data update The anonymized data sets are retained on cloud for different data users. So, these data sets are persistent every time the anonymized data sets are generated. However, the data sets in applications SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 10. on cloud are dynamic and increase dramatically over time, resulting in big data. Hence, we have to update original data sets and anonymized data sets. A straightforward way is to anonymize the already updated data sets from scratch. From the efficiency perspective, it is usually unacceptable to anonymize all data sets once an update occurs. Furthermore, the privacy preservation fails be ensured according to the analysis in [13]. Therefore, a better way is to just anonymize the update part and adjust the already anonymized data sets. For anonymous data sets, anonymization level is used to describe the degree of privacy preservation. Usually, the anonymization level for the already anonymized data sets satisfies the given privacy requirements, so new data can be simply anonymized to current anonymization level when updates occur. But the newly anonymized data sets possibly violate the privacy requirements because they are likely to be too specific. In such a case, we have to adjust the anonymization level of the whole anonymized data sets to ensure the privacy preservation for all data. Another aspect of privacy preservation is to produce data utility as much as possible to data users when privacy requirements are satisfied. For data anonymization, an interesting phenomenon is that for a given privacy requirement, the more data are anonymized, the lower anonymization level will be. A lower anonymization level means more data utility can be produced because the anonymized values are more specific. Consequently, just anonymizing new data to the current anonymization level is not sufficient, even though this satisfies the privacy requirements definitely. To expose more data utility, we need to lower the anonymization level of the whole anonymized data sets. Above all, three basic operations are provided in the DU module, namely, update, generalization, and specialization. Generalization is utilized to raise the anonymization level, whereas specialization is to lower the anonymization level. 4.6. Anonymous data management As described in the last section, anonymous data sets are retained for data sharing, mining, and analytics. Another consideration is to save IT expense. In the context of cloud computing, both computation and storage resources will be charged in proportion to their usage in terms of the pay-as-you-go feature. In this sense, it is beneficial to store certain part of intermediate data sets rather than re-compute them repeatedly. Yuan et al. [38] has extensively investigated the trade-offs between data storage and re-computation and proposed a series of strategies. In SaC-FRAPP, anonymous data sets are stored to save cost and managed systematically. We exploit data provenance [39] to manage the data generation relationships among these data sets. Retaining a large number of independently anonymous data sets potentially suffers from the privacy breach problems. Because the PSI module provides flexible privacy specifications, a data set can be possibly anonymized into different anonymous data sets. As a result, privacy-sensitive information can be recovered from different anonymous data sets. To address this inference problem in multiple data sets, we have proposed an approach that incorporates encryption to ensure privacy preservation [26]. Basically, we can encrypt all the anonymous data sets and share them to specific users. However, encrypting all data sets will incur expensive overhead because these anonymous data sets are usually accessed or processed frequently by many data users, so we propose to encrypt part of these data sets to save privacy-preserving cost while privacy preservation can still be ensured. In the privacy-preserving framework, privacy of all anonymous data sets are quantified according to [26] or differential privacy [9]. Then, we carefully select part of anonymous data sets to encrypt via our proposed approaches. In this way, SaC-FRAPP can achieve cost-effective and efficient privacy preservation. 5. PROTOTYPE SYSTEM DESIGN We have developed a proof-of-concept prototype system for the privacy-preserving framework SaC-FRAPP based on Hadoop, an open-source implementation of MapReduce, and the OpenStack cloud platform. The basic system design of SaC-FRAPP is depicted as a class diagram in Figure 4. In general, we implement the prototype system according to the formulation of the framework in the X. ZHANG ET AL. Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 11. last section. The four main modules are designed as classes or interfaces. The submodules are designed as their child classes or aggregate components. With these organized components, the developed prototype system achieves the four system requirements as described in Section 4.1. As a framework, SaC-FRAPP is also extensible and more privacy models and MapReduce-based anonymization algorithms can be included. The Privacy Specification Interface class is responsible for front-end interaction with users, and invokes the three functional back-end modules (DA, DU, and ADM). So far, we have developed several MapReduce-based anonymization algorithms for DA class. Such MapReduce programs can anonymize large-scale in an efficient and scalable fashion. For SG scheme, we develop three algorithms of MapReduce, that are, DS, BUG, and their combination. The major parts of top-down specialization (TDS) and bottom-up generalization (BUG) are MapReduce jobs that dominate the computation in the anonymization process. The MapReduce drivers, mappers, and reducer are illustrated in Figure 4. For multidimensional scheme, we develop median and info median algorithms of MapReduce for general purposes and classification workload, respectively. Note that finding the median of an attribute dominates the computation in multidimensional anonymization, so the MapReduce jobs are designed to conduct such computation. Because the execution process of multidimensional anonymization is recursive, we design the recursion control class to determine whether multiple nodes are launched or only one is utilized to do the recursive computation. For local recoding anonymization scheme, we model it as the k-member clustering problem [40] and adopt clustering techniques to address it. Similarity computation, which dominates the clustering performance, is conducted by MapReduce jobs. The DA class leverage most functions provided by DA class to carry out data update. For ADM class, several components are aggregated. Specifically, sensitive intermediate data tree (SIT) and sensitive intermediate data graph (SIG) classes are leveraged to manage anonymous data generation relationships; encryption and key management classes are responsible for encrypting data sets that are selected to hide. Encryption decision class, based on privacy quantification and cost estimation classes, aims at determining which anonymous data sets should be encrypted, whereas others are not in order to ensure cost-effective privacy preservation globally. Besides the classes that correspond to the components in the framework, other classes are also necessary for the system. Taxonomy forest class is an essential data structure for most functional classes in Figure 4. Note that some components are not presented because of space limits, but this does not affect the discussion of our design. Figure 4. Basic System design (class diagram) of scalable and cost-effective framework for privacy preservation. SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 12. 6. EXPERIMENTAL EVALUATION In this section, we demonstrate the deployment environment of the prototype system and summarize the experimental evaluation results of the main components of SaC-FRAPP. A series of experiments are conducted on both real-world data sets and synthesized data sets. Specifically, we compare the performance of the DA, DU, and ADM components with their corresponding existing approaches. We describe the experiment results of the three components in the following subsections. 6.1. Deployment environment and experiment settings We develop and deploy the privacy-preserving framework on the basis of our cloud environment U-cloud. U-Cloud is a cloud computing environment at the University of Technology, Sydney. The system overview of the U-Cloud system is depicted in Figure 5. The computing facilities of this system are located among several labs in the Faculty of Engineering and IT, University of Technology, Sydney. On top of hardware and Linux operating system (Ubuntu), we install KVM virtualization software, which virtualizes the infrastructure and provides unified computing and storage resources. To create virtualized data centers, we install OpenStack open source cloud environment, which is responsible for virtual machine management, resource scheduling, task distribution, and interaction with users. Furthermore, Hadoop clusters are built on the basis of OpenStack to facilitate MapReduce computing paradigm and big data processing. All the experiments are conducted in such a cloud environment. We use Adult data set [41], a public data set commonly used as a de facto benchmark for testing data anonymization for privacy preservation. We also generated enlarged data sets based on the Adult data set. After pre-processing the Adult data set, the sanitized data set consists of 30 162 records. In this data set, each record has 14 attributes. We utilize eight attributes out of them in our experiments. The basic information about attribute taxonomy trees is described in Table I, in which the number of domain values and level of trees are listed. The size of original Adult data set is blown up to generate a series of data sets, which is adopted in [19]. Specifically, for each Figure 5. System overview of U-Cloud. Table I. Description of the Adult data set. Attribute Age Education Marital status Occupation Relationship Race Sex Native country Levels 5 4 3 4 3 3 2 4 Domain values 99 26 9 23 8 7 3 50 X. ZHANG ET AL. Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 13. original record r, α À 1 ‘variations’ of r are created, where α > 1 is the blowup scale. As a result, an enlarged data set is α times as large as the original one. 6.2. Summary of experiment results In this section, we briefly summarize experiment results of the three main functional modules, that is, DA, DU, and AMD, from the perspective of scalability and cost-effectiveness. We have already evaluated these modules in our previous work. Interested readers could refer to [6,26,42, 43]. 6.2.1. Scalability of data anonymization. We compare our approach of data anonymization with a state-of-the-art centralized approaches proposed in [19,44]. We run both approaches on data sets varying from 50 MB to 2.5 GB. We check whether both approaches can scale over large-scale data sets and measure execution time to evaluate the efficiency. The change of execution time with respect to data set size is depicted in Figure 6. The execution time of our approach is denoted as TMR, while that of the centralized approach is denoted as TCent. From Figure 6 (a), it is seen that TCent surges when the data size increases, whereas TMR increases slightly even though it has a higher start value. The centralized approach suffers from memory insufficiency when the data size is larger than 500 MB, whereas Figure 6 (b) shows that our approach can still scale over much larger data sets. Hence, the proposed privacy-preserving framework can significantly improve the scalability and efficiency compared with existing state- of-the-art anonymization approaches. 6.2.2. Scalability of data update. We compare our approach of anonymous data update with a state-of-the-art approach proposed in [45]. We run both approach with the number of records in data update batch ranging from 2000 to 20 000. We measure the update time to evaluate the efficiency. The execution time of our approach is denoted as tI while that of the existing approach is denoted as tE. Figure 7 illustrates how the difference between tI and tE changes with respect to the number of data records in an update batch when K is fixed, where K is the user-specified k-anonymity parameter. When K is fixed, it can be seen from Figure 7 that the difference between tI and tE becomes bigger and bigger when the number of data records increase. This trend demonstrates that the privacy-preserving framework can significantly improve the efficiency of privacy preservation on large-volume incremental data sets over existing approaches. 6.2.3. Cost-effectiveness of anonymous data management. We compare our approach of retaining anonymous data sets with the existing approach, which encrypt all data sets [4,25]. We run both approaches on data sets with the number of data sets varying from 100 to 1000. The privacy- preserving monetary cost is measured to evaluate the cost-effectiveness. The costs of our approach and the existing one are denoted as CHEU and CALL, respectively. The change of the cost with respect to the number of data sets is illustrated in Figure 8 when ϵd is fixed. The parameter ϵd is a user-specified privacy requirement threshold, meaning that the degree of privacy disclosure must be under ϵd. Figure 6. Change of execution time with respect to data size. SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 14. We can see from Figure 8 that that the difference between CALL and CHEU becomes bigger and bigger when the number of intermediate data sets increases, that is, more expense can be reduced when the number of data sets becomes larger. Thus, this trend means the privacy-preserving framework can reduce the privacy-preserving cost of retaining anonymous data sets significantly over the existing approach in real-world big data applications. As a conclusion, the evaluation results in the experiments mentioned earlier demonstrate that the proposed privacy-preserving framework SaC-FRAPP can anonymize big data and mange the anonymous data sets in a highly scalable, efficient, and cost-effective fashion. 7. CONCLUSIONS AND FUTURE WORK In this paper, we have proposed a SaC-FRAPP over big data on cloud. We have analyzed the lifecycle of data anonymization and formulated four basic system requirements for a privacy-preserving framework in the context of cloud computing and big data. To achieve the four system requirements, we have presented four modules for the privacy-preserving framework SaC-FRAPP, namely, PSI, DA, DU, and ADM. We leverage cloud-based MapReduce and HDFS to conduct data anonymization and manage anonymous data sets, respectively, before releasing data sets to other parties. Also, a corresponding proof-of-concept prototype system has been implemented for the privacy-preserving framework and deployed in our cloud environment. Empirical evaluations have demonstrated that SaC-FRAPP can anonymize large-scale data sets and mange the anonymous data sets in a highly flexible, scalable, efficient, and cost-effective fashion. The framework provides a holistic conceptual foundation for privacy preservation over big data and enables users to accomplish full potential of the advantages of cloud platforms. Privacy concerns of big data on cloud have attracted the attention of researchers in different research communities. But ensuring privacy preservation of big data sets still needs extensive investigation. (a) (b) Figure 8. Change of privacy-preserving cost with respect to the number of data sets. Figure 7. Change of update time with respect to the number of update records. X. ZHANG ET AL. Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 15. Based on the contributions of the proposed framework, we plan to integrate the privacy-preserving framework with other data processing frameworks that employ MapReduce as the computation engine, for example, the Apache Mahout project, which is a data mining library built atop of MapReduce. Further, we will investigate privacy-aware data distribution and scheduling in cloud environments in the future. REFERENCES 1. Borkar V, Carey MJ, Li C. Inside “Big Data Management”: Ogres, Onions, or Parfaits? Proceedings of the 15th International Conference on Extending Database Technology (EDBT'12), 2012; 3–14. 2. Chaudhuri S. What Next?: A Half-Dozen Data Management Research Goals for Big Data and the Cloud. Proceedings of the 31st Symposium on Principles of Database Systems (PODS'12), 2012; 1–4. 3. Dean J, Ghemawat S. Mapreduce: a flexible data processing tool. Communications of the ACM 2010; 53(1):72–77. DOI: 10.1145/1629175.1629198. 4. Cao N, Wang C, Li M, Ren K, Lou W. Privacy-preserving Multi-Keyword Ranked Search Over Encrypted Cloud Data. Proceedings of the 31st Annual IEEE International Conference on Computer Communications (INFOCOM'11), 2011; 829–837. 5. Liu C, Zhang X, Yang C, Chen J. Ccbke—session key negotiation for fast and secure scheduling of scientific applications in cloud computing. Future Generation Computer Systems 2013; 29(5):1300–1308. DOI: http://dx. doi.org/10.1016/j.future.2012.07.001. 6. Zhang X, Yang LT, Liu C, Chen J. A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Transactions on Parallel and Distributed Systems 2013; in press. DOI: 10.1109/ TPDS.2013.48. 7. Fung BCM, Wang K, Chen R, Yu PS. Privacy-preserving data publishing: a survey of recent developments. ACM Computer Survey 2010; 42(4):1–53. DOI: 10.1145/1749603.1749605. 8. Puttaswamy KPN, Kruegel C, Zhao BY. Silverline: Toward Data Confidentiality in Storage-Intensive Cloud Applications. Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC'11), 2011; Article 10. 9. Dwork C. Differential Privacy. Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP'06), 2006; 1–12. 10. Gentry C. Fully Homomorphic Encryption Using Ideal Lattices. Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC'09), 2009; 169–178. 11. Nergiz ME, Clifton C, Nergiz AE. Multirelational k-anonymity. IEEE Transactions on Knowledge and Data Engineering 2009; 21(8):1104–1117. 12. Iwuchukwu T, Naughton JF. K-anonymization as Spatial Indexing: Toward Scalable and Incremental Anonymization. Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07), 2007; 746–757. 13. Pei J, Xu J, Wang Z, Wang W, Wang K. Maintaining k-anonymity Against Incremental Updates. Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSBDM '07), 2007; Article 5. 14. Sweeney L. K-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 2002; 10(5):557–570. DOI: 10.1142/s0218488502001648. 15. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M. l-diversity: privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data 2007; 1(1):Article 3. DOI: 10.1145/1217299.1217302. 16. Li N, Li T, Venkatasubramanian S. Closeness: a new privacy measure for data publishing. IEEE Transactions on Knowledge and Data Engineering 2010; 22(7):943–956. DOI: 10.1109/TKDE.2009.139. 17. LeFevre K, DeWitt DJ, Ramakrishnan R. Mondrian Multidimensional k-anonymity. Proceedings of 22nd International Conference on Data Engineering (ICDE '06), 2006; 25–25. 18. Xu J, Wang W, Pei J, Wang X, Shi B, Fu AWC. Utility-based Anonymization Using Local Recoding. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data (KDD'06), 2006; 785–790. 19. Mohammed N, Fung B, Hung PCK, Lee CK. Centralized and distributed anonymization for high-dimensional healthcare data. ACM Transactions on Knowledge Discovery from Data 2010; 4(4):Article 18. 20. Xiao X, Tao Y. Anatomy: Simple and Effective Privacy Preservation. Proceedings of 32nd International Conference on Very Large Data Bases (VLDB'06), 2006; 139–150. 21. Li T, Li N, Zhang J, Molloy I. Slicing: a new approach for privacy preserving data publishing. IEEE Transactions on Knowledge and Data Engineering 2012; 24(3):561–574. 22. Terrovitis M, Liagouris J, Mamoulis N, Skiadopolous S. Privacy preservation by disassociation. Proceedings of the VLDB Endowment 2012; 5(10):944–955. 23. LeFevre K, DeWitt DJ, Ramakrishnan R. Incognito: Efficient Full-Domain k-anonymity. Proceedings of 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05), 2005; 49–60. 24. Hu H, Xu J, Ren C, Choi B. Processing Private Queries Over Untrusted Data Cloud Through Privacy Homomorphism. Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE'11), 2011; 601–612. 25. Li M, Yu S, Cao N, Lou W. Authorized Private Keyword Search Over Encrypted Data in Cloud Computing. Proceedings of the 31st International Conference on Distributed Computing Systems (ICDCS'11), 2011; 383–392. SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe
  • 16. 26. Zhang X, Liu C, Nepal S, Pandey S, Chen J. A privacy leakage upper bound constraint-based approach for cost-effective privacy preserving of intermediate data sets in cloud. IEEE Transactions on Parallel and Distributed Systems 2013; 24(6):1192–1202. DOI: 10.1109/tpds.2012.238. 27. Neuman BC, Ts'o T. Kerberos: an authentication service for computer networks. IEEE Communications Magazine 1994; 32(9):33–38. 28. Roy I, Setty STV, Kilzer A, Shmatikov V, Witchel E. Airavat: Security and Privacy for Mapreduce. Proceedings of 7th USENIX Conference on Networked Systems Design and Implementation (NSDI'10), 2010; 297–312. 29. Blass E-O, Pietro RD, Molva R, Önen M. PRISM—Privacy-Preserving Search in Mapreduce. Proceedings of the 12th International Conference on Privacy Enhancing Technologies (PETS'12), 2012; 180–200. 30. Ko SY, Jeon K, Morales R. The Hybrex Model for Confidentiality and Privacy in Cloud Computing. Proceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'11), 2011; Article 8. 31. Zhang K, Zhou X, Chen Y, Wang X, Ruan Y. Sedic: Privacy-Aware Data Intensive Computing on Hybrid Clouds. Proceedings of 18th ACM Conference on Computer and Communications Security (CCS'11), 2011; 515–526. 32. Wei W, Juan D, Ting Y, Xiaohui G. Securemr: A Service Integrity Assurance Framework for Mapreduce. Proceedings of Annual Computer Security Applications Conference (ACSAC '09), 2009; 73–82. 33. Mell P, Grance T. The Nist Definition of Cloud Computing (Version 15). U.S. National Institute of Standards and Technology, Information Technology Laboratory, 2009. 34. Shvachko K, Hairong K, Radia S, Chansler R. The Hadoop Distributed File System. Proceedings of 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST'10), 2010; 1–10. 35. Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquin R. Incoop: MapReduce for Incremental Computations. Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC'11), 2011; 1–14. 36. Bu Y, Howe B, Balazinska M, Ernst MD. The Haloop approach to large-scale iterative data analysis. The VLDB Journal 2012; 21(2):169–190. DOI: 10.1007/s00778-012-0269-7. 37. Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S-H, Qiu J, Fox G. Twister: A Runtime for Iterative Mapreduce. Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HDPC'10), 2010; 810–818. 38. Yuan D, Yang Y, Liu X, Chen J. On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems. Journal of Parallel and Distributed Computing 2011; 71(2):316–332. DOI: 10.1016/j.jpdc.2010.09.003. 39. Davidson SB, Khanna S, Milo T, Panigrahi D, Roy S. Provenance Views for Module Privacy. Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'11), 2011; 175–186. 40. Byun J-W, Kamra A, Bertino E, Li N. Efficient k-anonymization Using Clustering Techniques. Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA'07), 2007; 188–200. 41. UCI Machine Learning Repository. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/ [Accessed on April 01, 2013]. 42. Zhang X, Liu C, Nepal S, Chen J. An efficient quasi-identifier index based approach for privacy preservation over incremental data sets on cloud. Journal of Computer and System Sciences 2013; 79(5):542–555. DOI: http://dx. doi.org/10.1016/j.jcss.2012.11.008. 43. Zhang X, Liu C, Nepal S, Yang C, Dou W, Chen J. Combining Top-down and Bottom-up: Scalable Sub-tree Anonymization Over Big Data Using Mapreduce on Cloud. Proceedings of 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (IEEE TrustCom-13), 2013; Accepted. 44. Fung BCM, Wang K, Yu PS. Anonymizing classification data for privacy preservation. IEEE Transactions on Knowledge and Data Engineering 2007; 19(5):711–725. 45. Doka K, Tsoumakos D, Koziris N. KANIS: Preserving k-anonymity Over Distributed Data. Proceedings of the 5th International Workshop on Personalized Access, Profile Management, and Context Awareness in Databases (PersDB'11), 2011. X. ZHANG ET AL. Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013) DOI: 10.1002/cpe