SlideShare a Scribd company logo
1 of 7
Download to read offline
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 1 | P a g e Copyright@IDL-2017
Two-Phase TDS Approach for Data Anonymization
To Preserving Bigdata Privacy
1.Ambika M Patil, M.Tech Computer Science Engineering, Center for P G Studies Jnana Sangama VTU Belagavi,
Belagavi, INDIA, Ambika702@gmail.com
2.Assistant Prof.Ranjana B Nadagoudar, Computer Science Engineering Department, Center for P G Studies Jnana Sangama
VTU Belagavi, Belagavi, INDIA
3.Dhananjay A Potdar, Dhananjay.potdar@gmail.com
ABSTRACT - While Big Data gradually become a hot topic
of research and business and has been everywhere used in
many industries, Big Data security and privacy has been
increasingly
concerned. However, there is an obvious contradiction
between Big Data security and privacy and the widespread use
of Big Data. There have been a various different privacy
preserving mechanisms developed for protecting privacy at
different stages (e.g. data generation, data storage, data
processing) of big data life cycle. The goal of this paper is to
provide a complete overview of the privacy preservation
mechanisms in big data and present the challenges for existing
mechanisms and also we illustrate the infrastructure of big
data and state-of-the-art privacy-preserving mechanisms in
each stage of the big data life cycle. This paper focus on the
anonymization process, which significantly improve the
scalability and efficiency of TDS (top-down-specialization)
for data anonymization over existing approaches. Also, we
discuss the challenges and future research directions related to
preserving privacy in big data.
KEYWORDS - Big data, privacy, big data storage, big data
processing. Data anonymization, top-down specialization,
MapReduce, cloud, privacy preservation.
I. INTRODUCTION
As a result of recent technological development, the amount of
data generated by social networking sites, sensor networks,
Internet, healthcare applications, and many other companies,
is significantly increasing day by day. The term “Big Data”
reflects the trend and salient features of the data being
produced from various sources. Basically Big Data can be
described by “3Vs” which stands for Volume, Velocity and
Variety. Volume shows the huge amount of data being
produced from multiple sources. Velocity is concerned with
both how fast we produce and collect data, but also how fast
some of the collected data is changing. Variety shows their
highly distributed and various nature. The data generation rate
is growing so rapidly that it is becoming very difficult to
handle it using traditional methods or systems [1]. In the
“3Vs” model, Variety indicates the various types of data
which include structured, semistructured and unstructured
data; Volume means data scale is large; Velocity indicates all
processes of Big Data must be quick and timely in order to
maximize value of Big Data as shown in Fig.1. These features
that Big Data handles huge amount of data and uses various
types of data including unstructured data and attributes that
were never used in the past distinguish data mining from Big
Data.
In 2011, IDC defined big data as “big data technologies
describe a new generation of technologies and architectures,
designed to economically extract value from very large
volumes of a wide variety of data, by enabling the high-
velocity capture, discovery, and/or analysis”[2].
In this definition, features of big data may be abridged as
4Vs, i.e., Variety, Velocity, Volume and Value, where the
implications of Variety, Velocity, Volume is same as the 3Vs
model respectively and Value refers big data have great social
value. The 4Vs model was widely recognized because it
indicates the most critical problem which is how to discover
value from an enormous, various types, and rapidly generated
datasets in big data.
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 2 | P a g e Copyright@IDL-2017
FIGURE 1. Illustration of the 3 V's of big data.
Despite big data could be effectively used for us to better
understand the world and innovate in various aspects of
human activities, the blowing up amount of data has increased
potential privacy breach of individual. For example, Amazon
and Google can learn our shopping preferences and browsing
habits. Social networking sites such as Facebook store all the
information about our personal life and social relationships.
Popular video sharing websites such as YouTube recommends
us videos based on our search history. With all the power
motivated by big data, gathering, storing and reusing our
personal information for the purpose of attainment of
commercial profits, have put a threat to our privacy and
security. In 2006, AOL released 20 million search queries for
650 users by eliminating the AOL id and IP address for
research purposes. Though, it took researchers only couple of
days to re-identify the users. Users privacy may be breached
under the following circumstances [3]:
I. Personal information when combined with exterior
Datasets may lead to the inference of new facts about the
users. Those details may be secretive and not supposed to be
exposed to others.
II. Personal facts is sometimes collected and used to add
value to business. For example, individual's shopping habits
may disclose a lot of personal information.
III. The sensitive data are stored and processed in a
location not secured properly and data leakage may occur
during storage and processing phases.
In order to safeguard big data privacy, numerous mechanisms
have been developed in recent years. These mechanisms can
be grouped based on the stages of big data life cycle, i.e., data
generation, data storage, and data processing. In data
generation phase, for the protection of privacy, access
restriction and falsifying data techniques are used. While
access restriction techniques try to limit the access to
individuals private data, falsifying data techniques alter the
original data before they are released to a non-trusted party.
The approaches to privacy protection in data storage phase are
mainly based on encryption techniques. Encryption based
techniques can be further divided into attribute based
encryption (ABE), Identity based encryption (IBE), and
storage path encryption. In addition, to protect the sensitive
information, hybrid clouds are used where sensitive data are
stored in private cloud. The data processing phase includes
privacy preserving data publishing (PPDP) and knowledge
extraction from the data. In PPDP, anonymization techniques
such as generalization and suppression are used to protect the
privacy of data. Ensuring the utility of the data while
preserving the privacy is a great challenge in PPDP. In the
knowledge extracting process, there exist several mechanisms
to extract useful information from large-scale and complex
data. These mechanisms can be further divided into clustering,
classification and association rule mining based techniques.
While clustering and classification split the input data into
different groups, association rule mining based techniques and
the useful relationships and trends in the input data.
FIGURE 2. Illustration of big data life cycle.
Protecting privacy in big data is a fast growing
research area. Although some related papers have been
published but only few of them are survey/review type of
papers [4], [5]. Moreover, while these papers introduced the
basic concept of privacy protection in big data, they failed to
cover several important aspects of this area. For example,
neither [4] nor [5] provide detailed discussions regarding big
data privacy with respect to cloud computing. Besides, none of
the papers discussed future challenges in detail.
In this paper, we will give a comprehensive overview
of the state-of-the-art technologies to preserve privacy of big
data at each stage of big data life cycle. This paper focus on
the anonymization process, which significantly improve the
scalability and efficiency of TDS (top-down-specialization)
for data anonymization over existing approaches.
The major contributions of our research are threefold.
First, we creatively apply MapReduce on cloud to TDS for
data anonymization and deliberately design a group of
innovative MapReduce jobs to concretely accomplish the
specializations in a highly scalable fashion. Second, we
propose a two-phase TDS approach to gain high scalability via
allowing specializations to be conducted on multiple data
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 3 | P a g e Copyright@IDL-2017
partitions in parallel during the first phase. Third,
implementation results show that our approach can
significantly improve the scalability and efficiency of TDS for
data anonymization over existing approaches.
The remainder of this paper is organized as follows:
The infrastructure of bigdata and issues or challenges related
to privacy of big data because of the underlying structure of
cloud computing, traditional data privacy preservation
methods in section III. Privacy preservation for big data,
formulates the two-phase TDS approach, elaborates
algorithmic details of MapReduce jobs in section IV.
Implementation of our approach is shown in Section V.
Finally, we conclude this paper and discuss future work in
Section VI.
II. INFRASTRUCTURE OF BIG DATA
To handle different dimensions of big data in terms of volume,
velocity, and variety, we need to design efficient and effective
systems to process large amount of data arriving at very high
speed from different sources. Big data has to go through
multiple phases during its life cycle, as shown in Figure. 2.
Data are distributed nowadays and new technologies are being
developed to store and process large repositories of data. For
example, cloud computing technologies, such as Hadoop
MapReduce, are explored for big data storage and processing.
In this section we will explain the life cycle of big data. In
addition, we will also discuss how big data are leveraging
from cloud computing technologies and challenges related
with cloud computing when used for storage and processing of
big data.
A. LIFE CYCLE OF BIG DATA
Data generation: Data can be generated from many distributed
sources. The amount of data generated by humans and
machines has blowup in the past few years. For example,
everyday 2.5 quintillion bytes of data are generated on the web
and 90 percent of the data in the world is generated in the past
few years. Facebook, a social networking site alone is
generating 25TB of new data every day. Usually, the data
generated is large, diverse and complex. Therefore, it is hard
for traditional systems to handle them. The data generated are
normally related with a specific domain such as business,
Internet, research, etc.
Data storage: This phase refers to storing and managing large-
scale data sets. A data storage system consists of two parts i.e.,
hardware infrastructure and data management [6]. Hardware
infrastructure refers to using information and communications
technology (ICT) resources for several tasks (such as
distributed storage). Data management refers to the set of
software positioned on top of hardware infrastructure to
manage and query large scale data sets. It should also provide
many interfaces to interact with and analyze stored data.
Data processing: Data processing phase refers basically to the
process of data collection, data transmission, pre-processing
and take out useful information. Data collection is needed
because data may be coming from different various sources
i.e., sites that contains text, images and videos. In data
collection phase, data are acquired from specific data
production environment using dedicated data collection
technology. In data transmission phase, after collecting raw
data from a specific data production environment we need a
high speed transmission mechanism to transmit data into a
proper storage for different type of analytic applications.
Finally, the pre-processing phase aims at removing
meaningless and redundant parts of the data so that more
storage space could be saved.
The excessive data and domain specific analytical
methods are used by various applications to derive significant
information. Although different fields in data analytics require
different data characteristics, few of these fields may leverage
similar underlying technology to inspect, transform and model
data to extract value from it. Emerging data analytics research
can be categorized into the following six technical areas:
structured data analytics, text analytics, multimedia analytics,
web analytics, network analytics, and mobile analytics [6].
B. CHALLENGES OF BIG DATA
The application of Big Data is leading to a set of new
challenges since data sets of Big Data so large and complex
that it is difficult to acquisition, storage, management and
analysis. The main challenges are listed as following [7][8]:
1. Data preparation. According to the definition of strong and
accurate techniques for big data, an important basis of big data
analysis and management is the availability of high-quality,
precise, and trustworthy data. Data preparation is paramount
for increasing the value of big data.
2. Efficient distributed storage and search. Timeliness of data
collection is fundamental to offer fast analysis of big data.
Therefore, there is an increasing need of providing efficient
distributed storage with faster memories and enhancing search
algorithms.
3. Effective online data analysis. Online analysis of
multidimensional data becomes a must and potential source of
information for decision making. This would require adapting
existing OLAP approaches to big data.
4. Effective machine learning techniques for big data mining.
Machine learning and data mining should be adapted to big
data to unleash the full potential of collected data.
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 4 | P a g e Copyright@IDL-2017
5. Efficient handling of big data streams. Some specific
scenarios (e.g., stock exchange) would require analysis of data
in the form of streams. Fast and optimized solutions should be
developed to make inference on big data streams.
6. Semantic lifting techniques. Semantics of collected big data
represents an important aspect for future development of big
data applications. Future approaches to big data analysis
should be able to manage with their semantics.
7. Programming models. Many programming models of big
data infrastructures are available. Some examples include
MapReduce and Hadoop. We should consider different
approaches for storing and managing data.
8. Social analytics. The ability to differentiate those data that
can be trusted and comply with user’s needs and preferences is
important as well as different to achieve. Social analytics
should then address this problem providing correct and sound
approaches to social data analysis.
9. Security and privacy. Big data are a priceless source of
information. However, it often contains sensitive information
that needs to be protected from unauthorized access and
release.
III TRADITIONAL DATA PRIVACY PRESERVATION
METHODS
Cryptography refers to set of techniques and algorithms for
protecting data. In cryptography plaintext is transformed into
cipher text using various encryption schemes. There are
numerous methods based on this scheme like public key
cryptography, digital signatures etc.
Cryptography alone can’t enforce the privacy demanded by
common cloud computing and big data services [9]. This is
because big data differs from traditional large data sets on the
basis of three V’s (velocity, variety, volume) [10, 11]. It is
these features of big data that make big data architecture
different from traditional information architectures. These
changes in architecture and its complex nature make
cryptography and traditional encryption schemes not scalable
up to the privacy needs of big data.
The challenge with cryptography is all or nothing retrieval
policy of encrypted data [12]. The less sensitive data that can
be useful in big data analytics is also encrypted and user is not
allowed to access it. It makes data unreachable to those who
don’t have access to decryption key. Also privacy may be
breached if data is stolen before encryption or cryptographic
keys are misused.
Attribute based encryption can also be used for big data
privacy [13, 14]. This method of securing big data is based on
relationships among attributes present in big data. The
attributes that need to be protected are identified based on type
of big data and company policies.
In nutshell, encryption or cryptography alone can’t stand as
big data privacy preservation method. They can help us to do
data anonymization but cannot be used directly for big data
privacy.
IV. PRIVACY PRESERVATION FOR BIG DATA
TWO-PHASE TOP-DOWN SPECIALIZATION (TPTDS)
The sketch of the TPTDS approach is shown in figure 3. Three
components of the TPTDS approach, namely, data partition,
anonymization level merging(AL), and data specialization.
Figure. 3 Execution framework overview of MRTDS.
TPTDS approach to conduct the computation required in TDS
in a highly scalable and efficient fashion. The two phases of
our approach are based on the two levels of parallelization
provisioned by Map Reduce on cloud. Essentially, Map
Reduce on cloud has two levels of parallelization, i.e., job
level and task level. Job level parallelization means that many
Map Reduce jobs can be executed simultaneously to make full
use of cloud infrastructure resources. Combined with cloud,
Map Reduce becomes more powerful and elastic as cloud can
offer infrastructure resources on demand, for example,
Amazon Elastic Map Reduce service. Task level
parallelization refers to that many Mapper/reducer tasks in a
Map Reduce job are executed simultaneously over data splits.
To achieve High scalability, we parallelizing multiple
jobs on data partitions in the first phase, but the resultant
anonymization levels are not indistinguishable. To obtain
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 5 | P a g e Copyright@IDL-2017
finally consistent anonymous data sets, the second phase is
essential to integrate the intermediate results and further
anonymize entire data sets. Then, we run a subroutine over
each of the partitioned data sets in parallel to make full use of
the job level parallelization of MapReduce. The subroutine is
a MapReduce version of centralized TDS (MRTDS) which
concretely conducts the computation required in TPTDS.
MRTDS anonymizes data partitions to generate intermediate
anonymization levels. An intermediate anonymization level
means that further specialization can be performed without
violating k-anonymity. MRTDS only leverages the task level
Parallelization of MapReduce.
ALGORITHM 1. SKETCH OF TWO-PHASE TDS (TPTDS).
Input: information set D, obscurity parameters k, kI and the
number of partitions p.
Output: Anonymous information set D⃰.
1: Partition D into Di,1 ≤ i ≤ p.
2: Execute MRTDS(Di, kI, AL0) → AL0i, one ≤ i ≤ p in
parallel as multiple MapReduce jobs.
3: Merge all intermediate anonymization levels into one,
Merge(AL01, AL02, . . ., AL0p) → ALI.
4: Execute MRTDS(D, k, ALI) → AL⃰ to realize kanonymity.
5: Specialize D in line with AL⃰, Output D⃰
Modules Description:
Data Partition:
o In this module the data partition is performed on the
cloud.
o Here we collect the large no of data sets.
o We are split the large into small data sets.
o Then we provide the random number for each data
set.
Anonymization:
o After getting the individual data sets we apply the
anonymization. The anonymization means hide or
remove the sensitive field in data sets.
o Then we get the intermediate result for the small »»
data sets the intermediate results are used for the
specialization process.
o All intermediate anonymization levels are merged
into one in the second phase. The merging of
anonymization levels is completed by merging cuts.
To ensure that the merged intermediate
anonymization level ALI(Anonymization Service
Level Improve) never violates privacy requirements,
the more general one is selected as the merged one
Merging:
o The intermediate results of the number of small data
sets are merged here.
o The MRTDS driver is used to organize the small
intermediate result for merging; the merged data sets
are unruffled on cloud.
o The merging result is again applied in anonymization
called specialization.
Specialization:
o After getting the intermediate result those results are
merged into one.
o Then we again applies the anonymization on the
merged data it called specialization.
o Here we are using the two kinds of jobs such as IGPL
UPDATE AND IGPL INITIALIZATION.
o The jobs are organized by web using the driver.
Obs:
The OBS called optimized balancing scheduling.
o Here we focus on the two kinds of the scheduling
called time and size.
o Here data sets are split in to the specified size and
applied anonymization on specified time.
o The OBS approach is to deliver the high ability on
handles the large data sets.
V.IMPLEMENTATION AND IMPROVEMENT
To elaborate however knowledge sets are processed in
MRTDS, the execution framework supported common place
MapReduce is depicted in Fig. 1. The solid arrow lines
represent the information flows within the canonical
MapReduce framework. From Fig. 1, we are able to see that
the iteration of MapReduce jobs is controlled by
anonymization level AL in Driver. The data flows for
handling iterations are represented by dotted arrow lines. AL
is sent from Driver to any or all staff including Mappers and
Reducers via the distributed cache mechanism. The worth of
AL is changed in Driver rendering to the output of the IGPL
data formatting or IGPL Update jobs. Because the quantity of
such knowledge is extremely small compared with knowledge
sets that may be anonymized, they can be expeditiously
transmitted between Driver and workers. We have a tendency
to adopt Hadoop, Associate in Nursing ASCII text file
implementation of MapReduce, to implement MRTDS. Since
most of Map and cut back functions have to be compelled to
access current anonymization level AL, we have a propensity
to use the distributed cache mechanism to pass the content of
AL to every Mapper or Reducer node as shown in Fig. 1.
Also, Hadoop provides the mechanism to line easy
international variables for Mappers and Reducers. The
simplest specialization is passed into the Map operate of IGPL
Update job during this method. The partition hash operate
within the shuffle part is changed because the 2 jobs need that
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 6 | P a g e Copyright@IDL-2017
the key-value pairs with the same key:p field instead of entire
key ought to visit the same Reducer. To scale back
communication traffics, MRTDS exploits combiner
mechanism that aggregates the key value pairs with constant
key into one on the nodes running Map functions. The
following are snapshot of implementation of Two-Phase TDS
approach for data anonymization for Preserving Bigdata
Privacy.
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 7 | P a g e Copyright@IDL-2017
VI. CONCLUSION AND FUTURE RESEARCH
CHALLENGES
In this paper, we have examined the scalability problem of
large-scale data anonymization by TDS, and proposed a highly
scalable two-phase TDS approach using MapReduce on cloud.
Data sets are partitioned and anonymized in parallel in the first
phase, producing intermediate results. Then, the intermediate
results are merged and further anonymized to produce
consistent k-anonymous data sets in the second phase. We
have creatively applied MapReduce on cloud to data
anonymization and deliberately designed a group of
innovative MapReduce jobs to concretely achieve the
specialization computation in a highly scalable way.
Experimental results on real-world data sets have
demonstrated that with our approach, the scalability and
efficiency of TDS are improved significantly over existing
approaches. In cloud environment, the privacy preservation for
data analysis, share and mining is a challenging research issue
due to increasingly larger volumes of data sets, thereby
requiring intensive investigation. We will investigate the
adoption of our approach to the bottom-up generalization
algorithms for data anonymization. Based on the contributions
herein, we plan to further explore the next step on scalable
privacy preservation aware analysis and scheduling on large-
scale data sets. Optimized balanced scheduling strategies are
expected to be developed towards overall scalable privacy
preservation aware data set scheduling.
REFERENCES
[1] J. Manyika et al., Big data: The Next
Frontier for Innovation,
Competition,and Productivity. Zürich,
Switzerland: McKinsey Global Inst., Jun. 2011,
[2] Gantz J, Reinsel D. Extracting value from chaos [J]. IDC
iview, 2011: 1-12. pp. 1_137.
[3]A. Katal, M. Wazid, and R. H. Goudar, ``Big data: Issues,
challenges, tools and good practices,'' in Proc. IEEE Int. Conf.
Contemp. Comput., Aug. 2013, pp. 404_409.
[4] B. Matturdi, X. Zhou, S. Li, and F. Lin, ``Big data security
and privacy: A review,'' China Commun., vol. 11, no. 14, pp.
135_145, Apr. 2014.
[5] L. Xu, C. Jiang, J.Wang, J. Yuan, and Y. Ren,
``Information security in big data: Privacy and data mining,''
in IEEE Access, vol. 2, pp. 1149_1176, Oct. 2014.
[6] H. Hu, Y. Wen, T.-S. Chua, and X. Li, ``Toward scalable
systems for big data analytics: A technology tutorial,'' IEEE
Access, vol. 2, pp. 652_687,Jul.2014.
[7] Ardagna C A, Damiani E. Business Intelligence meets Big
Data: An Overview on Security and Privacy[J].
[8] Labrinidis A, Jagadish H V. Challenges and opportunities
with big data [J]. Proceedings of the VLDB Endowment,
2012, 5(12): 2032-2033.
[9] M. V. Dijk, A. Juels, "On the impossibility of
cryptography alone for privacy-preserving cloud computing,"
Proceedings of the 5th USENIX conference on Hot topics in
security, August 10, 2010, pp.1-8.
[10]S. Sagiroglu and D. Sinanc, “Big Data: A Review,” Proc.
International Conference on Collaboration Technologies and
Systems, 2013, pp. 42- 47
[11] Y. Demchenko, P. Grzsso, C. De Laat, P. Membrey,
“Addressing Big Data Issues in Scientific Data Infrastructure,”
Proc. International Conference on Collaboration Technologies
and Systems, 2013, pp. 48- 55.
[12] Top Ten Big Data Security and Privacy Challenges,
Technical report, Cloud Security Alliance, November 2012
[13]S. H. Kim, N. U. Kim, T. M. Chung, “Attribute
Relationship Evaluation Methodology for Big Data Security,”
Proc. International Conference on IT Convergence and
Security (ICITCS), 2013, pp. 1-4.
[14] S.H. Kim, J. H. Eom, T. M. Chung, “Big Data Security
Hardening Methodology Using Attributes Relationship,” Proc.
International Conference on Information Science and
Applications (ICISA), 2013, pp. 1-2.
[15] H. Takabi, J.B.D. Joshi, and G. Ahn, “Security and
Privacy Challenges in Cloud Computing Environments,” IEEE
Security and Privacy, vol. 8, no. 6, pp. 24-31, Nov. 2010.
[16] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan,
“Incognito: Efficient Full-Domain K-Anonymity,” Proc. ACM
SIGMOD Int‟l Conf. Management of Data (SIGMOD ‟05),
pp. 49-60, 2005.
[17]ABID MEHMOOD, IYNKARAN NATGUNANATHAN,
YONG XIANG, (Senior Member, IEEE), GUANG HUA,
(Member, IEEE), AND SONG GUO, (Senior Member, IEEE),
“Protection of Big Data Privacy”, IEEE ACCESS, 2016.

More Related Content

What's hot

Data mining & big data presentation 01
Data mining & big data presentation 01Data mining & big data presentation 01
Data mining & big data presentation 01Aseem Chakrabarthy
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesEditor IJMTER
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataRichard Vidgen
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.ijceronline
 
JPJ1417 Data Mining With Big Data
JPJ1417   Data Mining With Big DataJPJ1417   Data Mining With Big Data
JPJ1417 Data Mining With Big Datachennaijp
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big datakk1718
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...BigMine
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification303Computing
 
Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...
Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...
Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...IJECEIAES
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Big data privacy issues in public social media
Big data privacy issues in public social mediaBig data privacy issues in public social media
Big data privacy issues in public social mediaSupriya Radhakrishna
 
Big data ppt
Big data pptBig data ppt
Big data pptYash Raj
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
IRJET-  	  Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...IRJET-  	  Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...IRJET Journal
 

What's hot (20)

Data mining & big data presentation 01
Data mining & big data presentation 01Data mining & big data presentation 01
Data mining & big data presentation 01
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Big data mining
Big data miningBig data mining
Big data mining
 
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining ChallengesA Survey on Big Data Mining Challenges
A Survey on Big Data Mining Challenges
 
1
11
1
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
 
Data mining on big data
Data mining on big dataData mining on big data
Data mining on big data
 
JPJ1417 Data Mining With Big Data
JPJ1417   Data Mining With Big DataJPJ1417   Data Mining With Big Data
JPJ1417 Data Mining With Big Data
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification
 
Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...
Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...
Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Big data privacy issues in public social media
Big data privacy issues in public social mediaBig data privacy issues in public social media
Big data privacy issues in public social media
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
IRJET-  	  Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...IRJET-  	  Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 

Similar to Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy

RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWieijjournal
 
Research in Big Data - An Overview
Research in Big Data - An OverviewResearch in Big Data - An Overview
Research in Big Data - An Overviewieijjournal
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWieijjournal1
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWieijjournal
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfvvpadhu
 
Big Data: Privacy and Security Aspects
Big Data: Privacy and Security AspectsBig Data: Privacy and Security Aspects
Big Data: Privacy and Security AspectsIRJET Journal
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A reviewShilpa Soi
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesKaran Deep Singh
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its ApplicationsTracy Hill
 
A Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesA Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesDr. Amarjeet Singh
 
Big Data A Review
Big Data A ReviewBig Data A Review
Big Data A Reviewijtsrd
 
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...Arab Federation for Digital Economy
 

Similar to Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy (20)

RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 
Research in Big Data - An Overview
Research in Big Data - An OverviewResearch in Big Data - An Overview
Research in Big Data - An Overview
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
 
Big Data: Privacy and Security Aspects
Big Data: Privacy and Security AspectsBig Data: Privacy and Security Aspects
Big Data: Privacy and Security Aspects
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A review
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
big-data.pdf
big-data.pdfbig-data.pdf
big-data.pdf
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
Big data survey
Big data surveyBig data survey
Big data survey
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its Applications
 
A Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesA Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: Challenges
 
Big data Paper
Big data PaperBig data Paper
Big data Paper
 
Big Data A Review
Big Data A ReviewBig Data A Review
Big Data A Review
 
Data Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope SurveyData Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope Survey
 
Sample
Sample Sample
Sample
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...
Privacy in the Age of Big Data: Exploring the Role of Modern Identity Managem...
 

Recently uploaded

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIkoyaldeepu123
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 

Recently uploaded (20)

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AI
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 

Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy

  • 1. IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 1 | P a g e Copyright@IDL-2017 Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy 1.Ambika M Patil, M.Tech Computer Science Engineering, Center for P G Studies Jnana Sangama VTU Belagavi, Belagavi, INDIA, Ambika702@gmail.com 2.Assistant Prof.Ranjana B Nadagoudar, Computer Science Engineering Department, Center for P G Studies Jnana Sangama VTU Belagavi, Belagavi, INDIA 3.Dhananjay A Potdar, Dhananjay.potdar@gmail.com ABSTRACT - While Big Data gradually become a hot topic of research and business and has been everywhere used in many industries, Big Data security and privacy has been increasingly concerned. However, there is an obvious contradiction between Big Data security and privacy and the widespread use of Big Data. There have been a various different privacy preserving mechanisms developed for protecting privacy at different stages (e.g. data generation, data storage, data processing) of big data life cycle. The goal of this paper is to provide a complete overview of the privacy preservation mechanisms in big data and present the challenges for existing mechanisms and also we illustrate the infrastructure of big data and state-of-the-art privacy-preserving mechanisms in each stage of the big data life cycle. This paper focus on the anonymization process, which significantly improve the scalability and efficiency of TDS (top-down-specialization) for data anonymization over existing approaches. Also, we discuss the challenges and future research directions related to preserving privacy in big data. KEYWORDS - Big data, privacy, big data storage, big data processing. Data anonymization, top-down specialization, MapReduce, cloud, privacy preservation. I. INTRODUCTION As a result of recent technological development, the amount of data generated by social networking sites, sensor networks, Internet, healthcare applications, and many other companies, is significantly increasing day by day. The term “Big Data” reflects the trend and salient features of the data being produced from various sources. Basically Big Data can be described by “3Vs” which stands for Volume, Velocity and Variety. Volume shows the huge amount of data being produced from multiple sources. Velocity is concerned with both how fast we produce and collect data, but also how fast some of the collected data is changing. Variety shows their highly distributed and various nature. The data generation rate is growing so rapidly that it is becoming very difficult to handle it using traditional methods or systems [1]. In the “3Vs” model, Variety indicates the various types of data which include structured, semistructured and unstructured data; Volume means data scale is large; Velocity indicates all processes of Big Data must be quick and timely in order to maximize value of Big Data as shown in Fig.1. These features that Big Data handles huge amount of data and uses various types of data including unstructured data and attributes that were never used in the past distinguish data mining from Big Data. In 2011, IDC defined big data as “big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling the high- velocity capture, discovery, and/or analysis”[2]. In this definition, features of big data may be abridged as 4Vs, i.e., Variety, Velocity, Volume and Value, where the implications of Variety, Velocity, Volume is same as the 3Vs model respectively and Value refers big data have great social value. The 4Vs model was widely recognized because it indicates the most critical problem which is how to discover value from an enormous, various types, and rapidly generated datasets in big data.
  • 2. IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 2 | P a g e Copyright@IDL-2017 FIGURE 1. Illustration of the 3 V's of big data. Despite big data could be effectively used for us to better understand the world and innovate in various aspects of human activities, the blowing up amount of data has increased potential privacy breach of individual. For example, Amazon and Google can learn our shopping preferences and browsing habits. Social networking sites such as Facebook store all the information about our personal life and social relationships. Popular video sharing websites such as YouTube recommends us videos based on our search history. With all the power motivated by big data, gathering, storing and reusing our personal information for the purpose of attainment of commercial profits, have put a threat to our privacy and security. In 2006, AOL released 20 million search queries for 650 users by eliminating the AOL id and IP address for research purposes. Though, it took researchers only couple of days to re-identify the users. Users privacy may be breached under the following circumstances [3]: I. Personal information when combined with exterior Datasets may lead to the inference of new facts about the users. Those details may be secretive and not supposed to be exposed to others. II. Personal facts is sometimes collected and used to add value to business. For example, individual's shopping habits may disclose a lot of personal information. III. The sensitive data are stored and processed in a location not secured properly and data leakage may occur during storage and processing phases. In order to safeguard big data privacy, numerous mechanisms have been developed in recent years. These mechanisms can be grouped based on the stages of big data life cycle, i.e., data generation, data storage, and data processing. In data generation phase, for the protection of privacy, access restriction and falsifying data techniques are used. While access restriction techniques try to limit the access to individuals private data, falsifying data techniques alter the original data before they are released to a non-trusted party. The approaches to privacy protection in data storage phase are mainly based on encryption techniques. Encryption based techniques can be further divided into attribute based encryption (ABE), Identity based encryption (IBE), and storage path encryption. In addition, to protect the sensitive information, hybrid clouds are used where sensitive data are stored in private cloud. The data processing phase includes privacy preserving data publishing (PPDP) and knowledge extraction from the data. In PPDP, anonymization techniques such as generalization and suppression are used to protect the privacy of data. Ensuring the utility of the data while preserving the privacy is a great challenge in PPDP. In the knowledge extracting process, there exist several mechanisms to extract useful information from large-scale and complex data. These mechanisms can be further divided into clustering, classification and association rule mining based techniques. While clustering and classification split the input data into different groups, association rule mining based techniques and the useful relationships and trends in the input data. FIGURE 2. Illustration of big data life cycle. Protecting privacy in big data is a fast growing research area. Although some related papers have been published but only few of them are survey/review type of papers [4], [5]. Moreover, while these papers introduced the basic concept of privacy protection in big data, they failed to cover several important aspects of this area. For example, neither [4] nor [5] provide detailed discussions regarding big data privacy with respect to cloud computing. Besides, none of the papers discussed future challenges in detail. In this paper, we will give a comprehensive overview of the state-of-the-art technologies to preserve privacy of big data at each stage of big data life cycle. This paper focus on the anonymization process, which significantly improve the scalability and efficiency of TDS (top-down-specialization) for data anonymization over existing approaches. The major contributions of our research are threefold. First, we creatively apply MapReduce on cloud to TDS for data anonymization and deliberately design a group of innovative MapReduce jobs to concretely accomplish the specializations in a highly scalable fashion. Second, we propose a two-phase TDS approach to gain high scalability via allowing specializations to be conducted on multiple data
  • 3. IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 3 | P a g e Copyright@IDL-2017 partitions in parallel during the first phase. Third, implementation results show that our approach can significantly improve the scalability and efficiency of TDS for data anonymization over existing approaches. The remainder of this paper is organized as follows: The infrastructure of bigdata and issues or challenges related to privacy of big data because of the underlying structure of cloud computing, traditional data privacy preservation methods in section III. Privacy preservation for big data, formulates the two-phase TDS approach, elaborates algorithmic details of MapReduce jobs in section IV. Implementation of our approach is shown in Section V. Finally, we conclude this paper and discuss future work in Section VI. II. INFRASTRUCTURE OF BIG DATA To handle different dimensions of big data in terms of volume, velocity, and variety, we need to design efficient and effective systems to process large amount of data arriving at very high speed from different sources. Big data has to go through multiple phases during its life cycle, as shown in Figure. 2. Data are distributed nowadays and new technologies are being developed to store and process large repositories of data. For example, cloud computing technologies, such as Hadoop MapReduce, are explored for big data storage and processing. In this section we will explain the life cycle of big data. In addition, we will also discuss how big data are leveraging from cloud computing technologies and challenges related with cloud computing when used for storage and processing of big data. A. LIFE CYCLE OF BIG DATA Data generation: Data can be generated from many distributed sources. The amount of data generated by humans and machines has blowup in the past few years. For example, everyday 2.5 quintillion bytes of data are generated on the web and 90 percent of the data in the world is generated in the past few years. Facebook, a social networking site alone is generating 25TB of new data every day. Usually, the data generated is large, diverse and complex. Therefore, it is hard for traditional systems to handle them. The data generated are normally related with a specific domain such as business, Internet, research, etc. Data storage: This phase refers to storing and managing large- scale data sets. A data storage system consists of two parts i.e., hardware infrastructure and data management [6]. Hardware infrastructure refers to using information and communications technology (ICT) resources for several tasks (such as distributed storage). Data management refers to the set of software positioned on top of hardware infrastructure to manage and query large scale data sets. It should also provide many interfaces to interact with and analyze stored data. Data processing: Data processing phase refers basically to the process of data collection, data transmission, pre-processing and take out useful information. Data collection is needed because data may be coming from different various sources i.e., sites that contains text, images and videos. In data collection phase, data are acquired from specific data production environment using dedicated data collection technology. In data transmission phase, after collecting raw data from a specific data production environment we need a high speed transmission mechanism to transmit data into a proper storage for different type of analytic applications. Finally, the pre-processing phase aims at removing meaningless and redundant parts of the data so that more storage space could be saved. The excessive data and domain specific analytical methods are used by various applications to derive significant information. Although different fields in data analytics require different data characteristics, few of these fields may leverage similar underlying technology to inspect, transform and model data to extract value from it. Emerging data analytics research can be categorized into the following six technical areas: structured data analytics, text analytics, multimedia analytics, web analytics, network analytics, and mobile analytics [6]. B. CHALLENGES OF BIG DATA The application of Big Data is leading to a set of new challenges since data sets of Big Data so large and complex that it is difficult to acquisition, storage, management and analysis. The main challenges are listed as following [7][8]: 1. Data preparation. According to the definition of strong and accurate techniques for big data, an important basis of big data analysis and management is the availability of high-quality, precise, and trustworthy data. Data preparation is paramount for increasing the value of big data. 2. Efficient distributed storage and search. Timeliness of data collection is fundamental to offer fast analysis of big data. Therefore, there is an increasing need of providing efficient distributed storage with faster memories and enhancing search algorithms. 3. Effective online data analysis. Online analysis of multidimensional data becomes a must and potential source of information for decision making. This would require adapting existing OLAP approaches to big data. 4. Effective machine learning techniques for big data mining. Machine learning and data mining should be adapted to big data to unleash the full potential of collected data.
  • 4. IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 4 | P a g e Copyright@IDL-2017 5. Efficient handling of big data streams. Some specific scenarios (e.g., stock exchange) would require analysis of data in the form of streams. Fast and optimized solutions should be developed to make inference on big data streams. 6. Semantic lifting techniques. Semantics of collected big data represents an important aspect for future development of big data applications. Future approaches to big data analysis should be able to manage with their semantics. 7. Programming models. Many programming models of big data infrastructures are available. Some examples include MapReduce and Hadoop. We should consider different approaches for storing and managing data. 8. Social analytics. The ability to differentiate those data that can be trusted and comply with user’s needs and preferences is important as well as different to achieve. Social analytics should then address this problem providing correct and sound approaches to social data analysis. 9. Security and privacy. Big data are a priceless source of information. However, it often contains sensitive information that needs to be protected from unauthorized access and release. III TRADITIONAL DATA PRIVACY PRESERVATION METHODS Cryptography refers to set of techniques and algorithms for protecting data. In cryptography plaintext is transformed into cipher text using various encryption schemes. There are numerous methods based on this scheme like public key cryptography, digital signatures etc. Cryptography alone can’t enforce the privacy demanded by common cloud computing and big data services [9]. This is because big data differs from traditional large data sets on the basis of three V’s (velocity, variety, volume) [10, 11]. It is these features of big data that make big data architecture different from traditional information architectures. These changes in architecture and its complex nature make cryptography and traditional encryption schemes not scalable up to the privacy needs of big data. The challenge with cryptography is all or nothing retrieval policy of encrypted data [12]. The less sensitive data that can be useful in big data analytics is also encrypted and user is not allowed to access it. It makes data unreachable to those who don’t have access to decryption key. Also privacy may be breached if data is stolen before encryption or cryptographic keys are misused. Attribute based encryption can also be used for big data privacy [13, 14]. This method of securing big data is based on relationships among attributes present in big data. The attributes that need to be protected are identified based on type of big data and company policies. In nutshell, encryption or cryptography alone can’t stand as big data privacy preservation method. They can help us to do data anonymization but cannot be used directly for big data privacy. IV. PRIVACY PRESERVATION FOR BIG DATA TWO-PHASE TOP-DOWN SPECIALIZATION (TPTDS) The sketch of the TPTDS approach is shown in figure 3. Three components of the TPTDS approach, namely, data partition, anonymization level merging(AL), and data specialization. Figure. 3 Execution framework overview of MRTDS. TPTDS approach to conduct the computation required in TDS in a highly scalable and efficient fashion. The two phases of our approach are based on the two levels of parallelization provisioned by Map Reduce on cloud. Essentially, Map Reduce on cloud has two levels of parallelization, i.e., job level and task level. Job level parallelization means that many Map Reduce jobs can be executed simultaneously to make full use of cloud infrastructure resources. Combined with cloud, Map Reduce becomes more powerful and elastic as cloud can offer infrastructure resources on demand, for example, Amazon Elastic Map Reduce service. Task level parallelization refers to that many Mapper/reducer tasks in a Map Reduce job are executed simultaneously over data splits. To achieve High scalability, we parallelizing multiple jobs on data partitions in the first phase, but the resultant anonymization levels are not indistinguishable. To obtain
  • 5. IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 5 | P a g e Copyright@IDL-2017 finally consistent anonymous data sets, the second phase is essential to integrate the intermediate results and further anonymize entire data sets. Then, we run a subroutine over each of the partitioned data sets in parallel to make full use of the job level parallelization of MapReduce. The subroutine is a MapReduce version of centralized TDS (MRTDS) which concretely conducts the computation required in TPTDS. MRTDS anonymizes data partitions to generate intermediate anonymization levels. An intermediate anonymization level means that further specialization can be performed without violating k-anonymity. MRTDS only leverages the task level Parallelization of MapReduce. ALGORITHM 1. SKETCH OF TWO-PHASE TDS (TPTDS). Input: information set D, obscurity parameters k, kI and the number of partitions p. Output: Anonymous information set D⃰. 1: Partition D into Di,1 ≤ i ≤ p. 2: Execute MRTDS(Di, kI, AL0) → AL0i, one ≤ i ≤ p in parallel as multiple MapReduce jobs. 3: Merge all intermediate anonymization levels into one, Merge(AL01, AL02, . . ., AL0p) → ALI. 4: Execute MRTDS(D, k, ALI) → AL⃰ to realize kanonymity. 5: Specialize D in line with AL⃰, Output D⃰ Modules Description: Data Partition: o In this module the data partition is performed on the cloud. o Here we collect the large no of data sets. o We are split the large into small data sets. o Then we provide the random number for each data set. Anonymization: o After getting the individual data sets we apply the anonymization. The anonymization means hide or remove the sensitive field in data sets. o Then we get the intermediate result for the small »» data sets the intermediate results are used for the specialization process. o All intermediate anonymization levels are merged into one in the second phase. The merging of anonymization levels is completed by merging cuts. To ensure that the merged intermediate anonymization level ALI(Anonymization Service Level Improve) never violates privacy requirements, the more general one is selected as the merged one Merging: o The intermediate results of the number of small data sets are merged here. o The MRTDS driver is used to organize the small intermediate result for merging; the merged data sets are unruffled on cloud. o The merging result is again applied in anonymization called specialization. Specialization: o After getting the intermediate result those results are merged into one. o Then we again applies the anonymization on the merged data it called specialization. o Here we are using the two kinds of jobs such as IGPL UPDATE AND IGPL INITIALIZATION. o The jobs are organized by web using the driver. Obs: The OBS called optimized balancing scheduling. o Here we focus on the two kinds of the scheduling called time and size. o Here data sets are split in to the specified size and applied anonymization on specified time. o The OBS approach is to deliver the high ability on handles the large data sets. V.IMPLEMENTATION AND IMPROVEMENT To elaborate however knowledge sets are processed in MRTDS, the execution framework supported common place MapReduce is depicted in Fig. 1. The solid arrow lines represent the information flows within the canonical MapReduce framework. From Fig. 1, we are able to see that the iteration of MapReduce jobs is controlled by anonymization level AL in Driver. The data flows for handling iterations are represented by dotted arrow lines. AL is sent from Driver to any or all staff including Mappers and Reducers via the distributed cache mechanism. The worth of AL is changed in Driver rendering to the output of the IGPL data formatting or IGPL Update jobs. Because the quantity of such knowledge is extremely small compared with knowledge sets that may be anonymized, they can be expeditiously transmitted between Driver and workers. We have a tendency to adopt Hadoop, Associate in Nursing ASCII text file implementation of MapReduce, to implement MRTDS. Since most of Map and cut back functions have to be compelled to access current anonymization level AL, we have a propensity to use the distributed cache mechanism to pass the content of AL to every Mapper or Reducer node as shown in Fig. 1. Also, Hadoop provides the mechanism to line easy international variables for Mappers and Reducers. The simplest specialization is passed into the Map operate of IGPL Update job during this method. The partition hash operate within the shuffle part is changed because the 2 jobs need that
  • 6. IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 6 | P a g e Copyright@IDL-2017 the key-value pairs with the same key:p field instead of entire key ought to visit the same Reducer. To scale back communication traffics, MRTDS exploits combiner mechanism that aggregates the key value pairs with constant key into one on the nodes running Map functions. The following are snapshot of implementation of Two-Phase TDS approach for data anonymization for Preserving Bigdata Privacy.
  • 7. IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 7 | P a g e Copyright@IDL-2017 VI. CONCLUSION AND FUTURE RESEARCH CHALLENGES In this paper, we have examined the scalability problem of large-scale data anonymization by TDS, and proposed a highly scalable two-phase TDS approach using MapReduce on cloud. Data sets are partitioned and anonymized in parallel in the first phase, producing intermediate results. Then, the intermediate results are merged and further anonymized to produce consistent k-anonymous data sets in the second phase. We have creatively applied MapReduce on cloud to data anonymization and deliberately designed a group of innovative MapReduce jobs to concretely achieve the specialization computation in a highly scalable way. Experimental results on real-world data sets have demonstrated that with our approach, the scalability and efficiency of TDS are improved significantly over existing approaches. In cloud environment, the privacy preservation for data analysis, share and mining is a challenging research issue due to increasingly larger volumes of data sets, thereby requiring intensive investigation. We will investigate the adoption of our approach to the bottom-up generalization algorithms for data anonymization. Based on the contributions herein, we plan to further explore the next step on scalable privacy preservation aware analysis and scheduling on large- scale data sets. Optimized balanced scheduling strategies are expected to be developed towards overall scalable privacy preservation aware data set scheduling. REFERENCES [1] J. Manyika et al., Big data: The Next Frontier for Innovation, Competition,and Productivity. Zürich, Switzerland: McKinsey Global Inst., Jun. 2011, [2] Gantz J, Reinsel D. Extracting value from chaos [J]. IDC iview, 2011: 1-12. pp. 1_137. [3]A. Katal, M. Wazid, and R. H. Goudar, ``Big data: Issues, challenges, tools and good practices,'' in Proc. IEEE Int. Conf. Contemp. Comput., Aug. 2013, pp. 404_409. [4] B. Matturdi, X. Zhou, S. Li, and F. Lin, ``Big data security and privacy: A review,'' China Commun., vol. 11, no. 14, pp. 135_145, Apr. 2014. [5] L. Xu, C. Jiang, J.Wang, J. Yuan, and Y. Ren, ``Information security in big data: Privacy and data mining,'' in IEEE Access, vol. 2, pp. 1149_1176, Oct. 2014. [6] H. Hu, Y. Wen, T.-S. Chua, and X. Li, ``Toward scalable systems for big data analytics: A technology tutorial,'' IEEE Access, vol. 2, pp. 652_687,Jul.2014. [7] Ardagna C A, Damiani E. Business Intelligence meets Big Data: An Overview on Security and Privacy[J]. [8] Labrinidis A, Jagadish H V. Challenges and opportunities with big data [J]. Proceedings of the VLDB Endowment, 2012, 5(12): 2032-2033. [9] M. V. Dijk, A. Juels, "On the impossibility of cryptography alone for privacy-preserving cloud computing," Proceedings of the 5th USENIX conference on Hot topics in security, August 10, 2010, pp.1-8. [10]S. Sagiroglu and D. Sinanc, “Big Data: A Review,” Proc. International Conference on Collaboration Technologies and Systems, 2013, pp. 42- 47 [11] Y. Demchenko, P. Grzsso, C. De Laat, P. Membrey, “Addressing Big Data Issues in Scientific Data Infrastructure,” Proc. International Conference on Collaboration Technologies and Systems, 2013, pp. 48- 55. [12] Top Ten Big Data Security and Privacy Challenges, Technical report, Cloud Security Alliance, November 2012 [13]S. H. Kim, N. U. Kim, T. M. Chung, “Attribute Relationship Evaluation Methodology for Big Data Security,” Proc. International Conference on IT Convergence and Security (ICITCS), 2013, pp. 1-4. [14] S.H. Kim, J. H. Eom, T. M. Chung, “Big Data Security Hardening Methodology Using Attributes Relationship,” Proc. International Conference on Information Science and Applications (ICISA), 2013, pp. 1-2. [15] H. Takabi, J.B.D. Joshi, and G. Ahn, “Security and Privacy Challenges in Cloud Computing Environments,” IEEE Security and Privacy, vol. 8, no. 6, pp. 24-31, Nov. 2010. [16] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, “Incognito: Efficient Full-Domain K-Anonymity,” Proc. ACM SIGMOD Int‟l Conf. Management of Data (SIGMOD ‟05), pp. 49-60, 2005. [17]ABID MEHMOOD, IYNKARAN NATGUNANATHAN, YONG XIANG, (Senior Member, IEEE), GUANG HUA, (Member, IEEE), AND SONG GUO, (Senior Member, IEEE), “Protection of Big Data Privacy”, IEEE ACCESS, 2016.