Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy

IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 1 | P a g e Copyright@IDL-2017
Two-Phase TDS Approach for Data Anonymization
To Preserving Bigdata Privacy
1.Ambika M Patil, M.Tech Computer Science Engineering, Center for P G Studies Jnana Sangama VTU Belagavi,
Belagavi, INDIA, Ambika702@gmail.com
2.Assistant Prof.Ranjana B Nadagoudar, Computer Science Engineering Department, Center for P G Studies Jnana Sangama
VTU Belagavi, Belagavi, INDIA
3.Dhananjay A Potdar, Dhananjay.potdar@gmail.com
ABSTRACT - While Big Data gradually become a hot topic
of research and business and has been everywhere used in
many industries, Big Data security and privacy has been
increasingly
concerned. However, there is an obvious contradiction
between Big Data security and privacy and the widespread use
of Big Data. There have been a various different privacy
preserving mechanisms developed for protecting privacy at
different stages (e.g. data generation, data storage, data
processing) of big data life cycle. The goal of this paper is to
provide a complete overview of the privacy preservation
mechanisms in big data and present the challenges for existing
mechanisms and also we illustrate the infrastructure of big
data and state-of-the-art privacy-preserving mechanisms in
each stage of the big data life cycle. This paper focus on the
anonymization process, which significantly improve the
scalability and efficiency of TDS (top-down-specialization)
for data anonymization over existing approaches. Also, we
discuss the challenges and future research directions related to
preserving privacy in big data.
KEYWORDS - Big data, privacy, big data storage, big data
processing. Data anonymization, top-down specialization,
MapReduce, cloud, privacy preservation.
I. INTRODUCTION
As a result of recent technological development, the amount of
data generated by social networking sites, sensor networks,
Internet, healthcare applications, and many other companies,
is significantly increasing day by day. The term “Big Data”
reflects the trend and salient features of the data being
produced from various sources. Basically Big Data can be
described by “3Vs” which stands for Volume, Velocity and
Variety. Volume shows the huge amount of data being
produced from multiple sources. Velocity is concerned with
both how fast we produce and collect data, but also how fast
some of the collected data is changing. Variety shows their
highly distributed and various nature. The data generation rate
is growing so rapidly that it is becoming very difficult to
handle it using traditional methods or systems [1]. In the
“3Vs” model, Variety indicates the various types of data
which include structured, semistructured and unstructured
data; Volume means data scale is large; Velocity indicates all
processes of Big Data must be quick and timely in order to
maximize value of Big Data as shown in Fig.1. These features
that Big Data handles huge amount of data and uses various
types of data including unstructured data and attributes that
were never used in the past distinguish data mining from Big
Data.
In 2011, IDC defined big data as “big data technologies
describe a new generation of technologies and architectures,
designed to economically extract value from very large
volumes of a wide variety of data, by enabling the high-
velocity capture, discovery, and/or analysis”[2].
In this definition, features of big data may be abridged as
4Vs, i.e., Variety, Velocity, Volume and Value, where the
implications of Variety, Velocity, Volume is same as the 3Vs
model respectively and Value refers big data have great social
value. The 4Vs model was widely recognized because it
indicates the most critical problem which is how to discover
value from an enormous, various types, and rapidly generated
datasets in big data.

FIGURE 1. Illustration of the 3 V's of big data.
Despite big data could be effectively used for us to better
understand the world and innovate in various aspects of
human activities, the blowing up amount of data has increased
potential privacy breach of individual. For example, Amazon
and Google can learn our shopping preferences and browsing
habits. Social networking sites such as Facebook store all the
information about our personal life and social relationships.
Popular video sharing websites such as YouTube recommends
us videos based on our search history. With all the power
motivated by big data, gathering, storing and reusing our
personal information for the purpose of attainment of
commercial profits, have put a threat to our privacy and
security. In 2006, AOL released 20 million search queries for
650 users by eliminating the AOL id and IP address for
research purposes. Though, it took researchers only couple of
days to re-identify the users. Users privacy may be breached
under the following circumstances [3]:
I. Personal information when combined with exterior
Datasets may lead to the inference of new facts about the
users. Those details may be secretive and not supposed to be
exposed to others.
II. Personal facts is sometimes collected and used to add
value to business. For example, individual's shopping habits
may disclose a lot of personal information.
III. The sensitive data are stored and processed in a
location not secured properly and data leakage may occur
during storage and processing phases.
In order to safeguard big data privacy, numerous mechanisms
have been developed in recent years. These mechanisms can
be grouped based on the stages of big data life cycle, i.e., data
generation, data storage, and data processing. In data
generation phase, for the protection of privacy, access
restriction and falsifying data techniques are used. While
access restriction techniques try to limit the access to
individuals private data, falsifying data techniques alter the
original data before they are released to a non-trusted party.
The approaches to privacy protection in data storage phase are
mainly based on encryption techniques. Encryption based
techniques can be further divided into attribute based
encryption (ABE), Identity based encryption (IBE), and
storage path encryption. In addition, to protect the sensitive
information, hybrid clouds are used where sensitive data are
stored in private cloud. The data processing phase includes
privacy preserving data publishing (PPDP) and knowledge
extraction from the data. In PPDP, anonymization techniques
such as generalization and suppression are used to protect the
privacy of data. Ensuring the utility of the data while
preserving the privacy is a great challenge in PPDP. In the
knowledge extracting process, there exist several mechanisms
to extract useful information from large-scale and complex
data. These mechanisms can be further divided into clustering,
classification and association rule mining based techniques.
While clustering and classification split the input data into
different groups, association rule mining based techniques and
the useful relationships and trends in the input data.
FIGURE 2. Illustration of big data life cycle.
Protecting privacy in big data is a fast growing
research area. Although some related papers have been
published but only few of them are survey/review type of
papers [4], [5]. Moreover, while these papers introduced the
basic concept of privacy protection in big data, they failed to
cover several important aspects of this area. For example,
neither [4] nor [5] provide detailed discussions regarding big
data privacy with respect to cloud computing. Besides, none of
the papers discussed future challenges in detail.
In this paper, we will give a comprehensive overview
of the state-of-the-art technologies to preserve privacy of big
data at each stage of big data life cycle. This paper focus on
the anonymization process, which significantly improve the
scalability and efficiency of TDS (top-down-specialization)
for data anonymization over existing approaches.
The major contributions of our research are threefold.
First, we creatively apply MapReduce on cloud to TDS for
data anonymization and deliberately design a group of
innovative MapReduce jobs to concretely accomplish the
specializations in a highly scalable fashion. Second, we
propose a two-phase TDS approach to gain high scalability via
allowing specializations to be conducted on multiple data

partitions in parallel during the first phase. Third,
implementation results show that our approach can
significantly improve the scalability and efficiency of TDS for
data anonymization over existing approaches.
The remainder of this paper is organized as follows:
The infrastructure of bigdata and issues or challenges related
to privacy of big data because of the underlying structure of
cloud computing, traditional data privacy preservation
methods in section III. Privacy preservation for big data,
formulates the two-phase TDS approach, elaborates
algorithmic details of MapReduce jobs in section IV.
Implementation of our approach is shown in Section V.
Finally, we conclude this paper and discuss future work in
Section VI.
II. INFRASTRUCTURE OF BIG DATA
To handle different dimensions of big data in terms of volume,
velocity, and variety, we need to design efficient and effective
systems to process large amount of data arriving at very high
speed from different sources. Big data has to go through
multiple phases during its life cycle, as shown in Figure. 2.
Data are distributed nowadays and new technologies are being
developed to store and process large repositories of data. For
example, cloud computing technologies, such as Hadoop
MapReduce, are explored for big data storage and processing.
In this section we will explain the life cycle of big data. In
addition, we will also discuss how big data are leveraging
from cloud computing technologies and challenges related
with cloud computing when used for storage and processing of
big data.
A. LIFE CYCLE OF BIG DATA
Data generation: Data can be generated from many distributed
sources. The amount of data generated by humans and
machines has blowup in the past few years. For example,
everyday 2.5 quintillion bytes of data are generated on the web
and 90 percent of the data in the world is generated in the past
few years. Facebook, a social networking site alone is
generating 25TB of new data every day. Usually, the data
generated is large, diverse and complex. Therefore, it is hard
for traditional systems to handle them. The data generated are
normally related with a specific domain such as business,
Internet, research, etc.
Data storage: This phase refers to storing and managing large-
scale data sets. A data storage system consists of two parts i.e.,
hardware infrastructure and data management [6]. Hardware
infrastructure refers to using information and communications
technology (ICT) resources for several tasks (such as
distributed storage). Data management refers to the set of
software positioned on top of hardware infrastructure to
manage and query large scale data sets. It should also provide
many interfaces to interact with and analyze stored data.
Data processing: Data processing phase refers basically to the
process of data collection, data transmission, pre-processing
and take out useful information. Data collection is needed
because data may be coming from different various sources
i.e., sites that contains text, images and videos. In data
collection phase, data are acquired from specific data
production environment using dedicated data collection
technology. In data transmission phase, after collecting raw
data from a specific data production environment we need a
high speed transmission mechanism to transmit data into a
proper storage for different type of analytic applications.
Finally, the pre-processing phase aims at removing
meaningless and redundant parts of the data so that more
storage space could be saved.
The excessive data and domain specific analytical
methods are used by various applications to derive significant
information. Although different fields in data analytics require
different data characteristics, few of these fields may leverage
similar underlying technology to inspect, transform and model
data to extract value from it. Emerging data analytics research
can be categorized into the following six technical areas:
structured data analytics, text analytics, multimedia analytics,
web analytics, network analytics, and mobile analytics [6].
B. CHALLENGES OF BIG DATA
The application of Big Data is leading to a set of new
challenges since data sets of Big Data so large and complex
that it is difficult to acquisition, storage, management and
analysis. The main challenges are listed as following [7][8]:
1. Data preparation. According to the definition of strong and
accurate techniques for big data, an important basis of big data
analysis and management is the availability of high-quality,
precise, and trustworthy data. Data preparation is paramount
for increasing the value of big data.
2. Efficient distributed storage and search. Timeliness of data
collection is fundamental to offer fast analysis of big data.
Therefore, there is an increasing need of providing efficient
distributed storage with faster memories and enhancing search
algorithms.
3. Effective online data analysis. Online analysis of
multidimensional data becomes a must and potential source of
information for decision making. This would require adapting
existing OLAP approaches to big data.
4. Effective machine learning techniques for big data mining.
Machine learning and data mining should be adapted to big
data to unleash the full potential of collected data.

5. Efficient handling of big data streams. Some specific
scenarios (e.g., stock exchange) would require analysis of data
in the form of streams. Fast and optimized solutions should be
developed to make inference on big data streams.
6. Semantic lifting techniques. Semantics of collected big data
represents an important aspect for future development of big
data applications. Future approaches to big data analysis
should be able to manage with their semantics.
7. Programming models. Many programming models of big
data infrastructures are available. Some examples include
MapReduce and Hadoop. We should consider different
approaches for storing and managing data.
8. Social analytics. The ability to differentiate those data that
can be trusted and comply with user’s needs and preferences is
important as well as different to achieve. Social analytics
should then address this problem providing correct and sound
approaches to social data analysis.
9. Security and privacy. Big data are a priceless source of
information. However, it often contains sensitive information
that needs to be protected from unauthorized access and
release.
III TRADITIONAL DATA PRIVACY PRESERVATION
METHODS
Cryptography refers to set of techniques and algorithms for
protecting data. In cryptography plaintext is transformed into
cipher text using various encryption schemes. There are
numerous methods based on this scheme like public key
cryptography, digital signatures etc.
Cryptography alone can’t enforce the privacy demanded by
common cloud computing and big data services [9]. This is
because big data differs from traditional large data sets on the
basis of three V’s (velocity, variety, volume) [10, 11]. It is
these features of big data that make big data architecture
different from traditional information architectures. These
changes in architecture and its complex nature make
cryptography and traditional encryption schemes not scalable
up to the privacy needs of big data.
The challenge with cryptography is all or nothing retrieval
policy of encrypted data [12]. The less sensitive data that can
be useful in big data analytics is also encrypted and user is not
allowed to access it. It makes data unreachable to those who
don’t have access to decryption key. Also privacy may be
breached if data is stolen before encryption or cryptographic
keys are misused.
Attribute based encryption can also be used for big data
privacy [13, 14]. This method of securing big data is based on
relationships among attributes present in big data. The
attributes that need to be protected are identified based on type
of big data and company policies.
In nutshell, encryption or cryptography alone can’t stand as
big data privacy preservation method. They can help us to do
data anonymization but cannot be used directly for big data
privacy.
IV. PRIVACY PRESERVATION FOR BIG DATA
TWO-PHASE TOP-DOWN SPECIALIZATION (TPTDS)
The sketch of the TPTDS approach is shown in figure 3. Three
components of the TPTDS approach, namely, data partition,
anonymization level merging(AL), and data specialization.
Figure. 3 Execution framework overview of MRTDS.
TPTDS approach to conduct the computation required in TDS
in a highly scalable and efficient fashion. The two phases of
our approach are based on the two levels of parallelization
provisioned by Map Reduce on cloud. Essentially, Map
Reduce on cloud has two levels of parallelization, i.e., job
level and task level. Job level parallelization means that many
Map Reduce jobs can be executed simultaneously to make full
use of cloud infrastructure resources. Combined with cloud,
Map Reduce becomes more powerful and elastic as cloud can
offer infrastructure resources on demand, for example,
Amazon Elastic Map Reduce service. Task level
parallelization refers to that many Mapper/reducer tasks in a
Map Reduce job are executed simultaneously over data splits.
To achieve High scalability, we parallelizing multiple
jobs on data partitions in the first phase, but the resultant
anonymization levels are not indistinguishable. To obtain

finally consistent anonymous data sets, the second phase is
essential to integrate the intermediate results and further
anonymize entire data sets. Then, we run a subroutine over
each of the partitioned data sets in parallel to make full use of
the job level parallelization of MapReduce. The subroutine is
a MapReduce version of centralized TDS (MRTDS) which
concretely conducts the computation required in TPTDS.
MRTDS anonymizes data partitions to generate intermediate
anonymization levels. An intermediate anonymization level
means that further specialization can be performed without
violating k-anonymity. MRTDS only leverages the task level
Parallelization of MapReduce.
ALGORITHM 1. SKETCH OF TWO-PHASE TDS (TPTDS).
Input: information set D, obscurity parameters k, kI and the
number of partitions p.
Output: Anonymous information set D⃰.
1: Partition D into Di,1 ≤ i ≤ p.
2: Execute MRTDS(Di, kI, AL0) → AL0i, one ≤ i ≤ p in
parallel as multiple MapReduce jobs.
3: Merge all intermediate anonymization levels into one,
Merge(AL01, AL02, . . ., AL0p) → ALI.
4: Execute MRTDS(D, k, ALI) → AL⃰ to realize kanonymity.
5: Specialize D in line with AL⃰, Output D⃰
Modules Description:
Data Partition:
o In this module the data partition is performed on the
cloud.
o Here we collect the large no of data sets.
o We are split the large into small data sets.
o Then we provide the random number for each data
set.
Anonymization:
o After getting the individual data sets we apply the
anonymization. The anonymization means hide or
remove the sensitive field in data sets.
o Then we get the intermediate result for the small »»
data sets the intermediate results are used for the
specialization process.
o All intermediate anonymization levels are merged
into one in the second phase. The merging of
anonymization levels is completed by merging cuts.
To ensure that the merged intermediate
anonymization level ALI(Anonymization Service
Level Improve) never violates privacy requirements,
the more general one is selected as the merged one
Merging:
o The intermediate results of the number of small data
sets are merged here.
o The MRTDS driver is used to organize the small
intermediate result for merging; the merged data sets
are unruffled on cloud.
o The merging result is again applied in anonymization
called specialization.
Specialization:
o After getting the intermediate result those results are
merged into one.
o Then we again applies the anonymization on the
merged data it called specialization.
o Here we are using the two kinds of jobs such as IGPL
UPDATE AND IGPL INITIALIZATION.
o The jobs are organized by web using the driver.
Obs:
The OBS called optimized balancing scheduling.
o Here we focus on the two kinds of the scheduling
called time and size.
o Here data sets are split in to the specified size and
applied anonymization on specified time.
o The OBS approach is to deliver the high ability on
handles the large data sets.
V.IMPLEMENTATION AND IMPROVEMENT
To elaborate however knowledge sets are processed in
MRTDS, the execution framework supported common place
MapReduce is depicted in Fig. 1. The solid arrow lines
represent the information flows within the canonical
MapReduce framework. From Fig. 1, we are able to see that
the iteration of MapReduce jobs is controlled by
anonymization level AL in Driver. The data flows for
handling iterations are represented by dotted arrow lines. AL
is sent from Driver to any or all staff including Mappers and
Reducers via the distributed cache mechanism. The worth of
AL is changed in Driver rendering to the output of the IGPL
data formatting or IGPL Update jobs. Because the quantity of
such knowledge is extremely small compared with knowledge
sets that may be anonymized, they can be expeditiously
transmitted between Driver and workers. We have a tendency
to adopt Hadoop, Associate in Nursing ASCII text file
implementation of MapReduce, to implement MRTDS. Since
most of Map and cut back functions have to be compelled to
access current anonymization level AL, we have a propensity
to use the distributed cache mechanism to pass the content of
AL to every Mapper or Reducer node as shown in Fig. 1.
Also, Hadoop provides the mechanism to line easy
international variables for Mappers and Reducers. The
simplest specialization is passed into the Map operate of IGPL
Update job during this method. The partition hash operate
within the shuffle part is changed because the 2 jobs need that

the key-value pairs with the same key:p field instead of entire
key ought to visit the same Reducer. To scale back
communication traffics, MRTDS exploits combiner
mechanism that aggregates the key value pairs with constant
key into one on the nodes running Map functions. The
following are snapshot of implementation of Two-Phase TDS
approach for data anonymization for Preserving Bigdata
Privacy.

VI. CONCLUSION AND FUTURE RESEARCH
CHALLENGES
In this paper, we have examined the scalability problem of
large-scale data anonymization by TDS, and proposed a highly
scalable two-phase TDS approach using MapReduce on cloud.
Data sets are partitioned and anonymized in parallel in the first
phase, producing intermediate results. Then, the intermediate
results are merged and further anonymized to produce
consistent k-anonymous data sets in the second phase. We
have creatively applied MapReduce on cloud to data
anonymization and deliberately designed a group of
innovative MapReduce jobs to concretely achieve the
specialization computation in a highly scalable way.
Experimental results on real-world data sets have
demonstrated that with our approach, the scalability and
efficiency of TDS are improved significantly over existing
approaches. In cloud environment, the privacy preservation for
data analysis, share and mining is a challenging research issue
due to increasingly larger volumes of data sets, thereby
requiring intensive investigation. We will investigate the
adoption of our approach to the bottom-up generalization
algorithms for data anonymization. Based on the contributions
herein, we plan to further explore the next step on scalable
privacy preservation aware analysis and scheduling on large-
scale data sets. Optimized balanced scheduling strategies are
expected to be developed towards overall scalable privacy
preservation aware data set scheduling.
REFERENCES
[1] J. Manyika et al., Big data: The Next
Frontier for Innovation,
Competition,and Productivity. Zürich,
Switzerland: McKinsey Global Inst., Jun. 2011,
[2] Gantz J, Reinsel D. Extracting value from chaos [J]. IDC
iview, 2011: 1-12. pp. 1_137.
[3]A. Katal, M. Wazid, and R. H. Goudar, ``Big data: Issues,
challenges, tools and good practices,'' in Proc. IEEE Int. Conf.
Contemp. Comput., Aug. 2013, pp. 404_409.
[4] B. Matturdi, X. Zhou, S. Li, and F. Lin, ``Big data security
and privacy: A review,'' China Commun., vol. 11, no. 14, pp.
135_145, Apr. 2014.
[5] L. Xu, C. Jiang, J.Wang, J. Yuan, and Y. Ren,
``Information security in big data: Privacy and data mining,''
in IEEE Access, vol. 2, pp. 1149_1176, Oct. 2014.
[6] H. Hu, Y. Wen, T.-S. Chua, and X. Li, ``Toward scalable
systems for big data analytics: A technology tutorial,'' IEEE
Access, vol. 2, pp. 652_687,Jul.2014.
[7] Ardagna C A, Damiani E. Business Intelligence meets Big
Data: An Overview on Security and Privacy[J].
[8] Labrinidis A, Jagadish H V. Challenges and opportunities
with big data [J]. Proceedings of the VLDB Endowment,
2012, 5(12): 2032-2033.
[9] M. V. Dijk, A. Juels, "On the impossibility of
cryptography alone for privacy-preserving cloud computing,"
Proceedings of the 5th USENIX conference on Hot topics in
security, August 10, 2010, pp.1-8.
[10]S. Sagiroglu and D. Sinanc, “Big Data: A Review,” Proc.
International Conference on Collaboration Technologies and
Systems, 2013, pp. 42- 47
[11] Y. Demchenko, P. Grzsso, C. De Laat, P. Membrey,
“Addressing Big Data Issues in Scientific Data Infrastructure,”
Proc. International Conference on Collaboration Technologies
and Systems, 2013, pp. 48- 55.
[12] Top Ten Big Data Security and Privacy Challenges,
Technical report, Cloud Security Alliance, November 2012
[13]S. H. Kim, N. U. Kim, T. M. Chung, “Attribute
Relationship Evaluation Methodology for Big Data Security,”
Proc. International Conference on IT Convergence and
Security (ICITCS), 2013, pp. 1-4.
[14] S.H. Kim, J. H. Eom, T. M. Chung, “Big Data Security
Hardening Methodology Using Attributes Relationship,” Proc.
International Conference on Information Science and
Applications (ICISA), 2013, pp. 1-2.
[15] H. Takabi, J.B.D. Joshi, and G. Ahn, “Security and
Privacy Challenges in Cloud Computing Environments,” IEEE
Security and Privacy, vol. 8, no. 6, pp. 24-31, Nov. 2010.
[16] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan,
“Incognito: Efficient Full-Domain K-Anonymity,” Proc. ACM
SIGMOD Int‟l Conf. Management of Data (SIGMOD ‟05),
pp. 49-60, 2005.
[17]ABID MEHMOOD, IYNKARAN NATGUNANATHAN,
YONG XIANG, (Senior Member, IEEE), GUANG HUA,
(Member, IEEE), AND SONG GUO, (Senior Member, IEEE),
“Protection of Big Data Privacy”, IEEE ACCESS, 2016.

Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy

Similar to Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy (20)

Recently uploaded

Recently uploaded (20)

Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy