Ieeepro techno solutions ieee java project - generalized approach for data

GENERALIZED APPROACH FOR DATA
ANONYMIZATION USING MAP REDUCE
ON CLOUD
K.R.VIGNESH,
M.Tech CSE,
SRM University,
Kattankulathur,
Chennai, India.
P.SARANYA,
Asst.profesor Dept. of CSE,
SRM University,
Kattankulathur,
Chennai, India
ABSTRACT— Data anonymization has been extensively studied and widely adopted method for privacy
preserving in data publishing and sharing scenario. Data anonymization is hiding up of sensitive data for
owner’s data record to avoid unidentified Risk. The privacy of an individual can be effectively preserved
while some aggregate information is shared to data user for data analysis and data mining. The proposed
method is Generalized method data anonymization using Map Reduce on cloud. Here we Two Phase Top
Down specialization. In First Phase the original data set is partitioned into group of smaller dataset and they
are anonymized and intermediate result is produced. In second phase the intermediate result first is further
anonymized to achieve consistent data set. And the data is presented in generalized form using Generalized
Approach.
Keywords: Cloud computing, Data Anonymization, Map Reduce, Privacy Preserving.
1. INTRODUCTION:
Cloud computing is a disruptive trend at
present, poses significant amount of current IT
industry and for research organizations, Cloud
computing provides massive storage and power
capacity, enable user a to implement application cost
effectively without heavy investment and
infrastructure. But privacy preserving is one of the
major disadvantages of cloud environment, some
privacy issues are not new where some personal
health record is shared for data analysis for research
organization. For e.g. Microsoft Health Vault an
online health cloud service.
Data anonymization is widely used method
for Privacy Preserving of data in non-interactive data
publishing scenario Data anonymization refers to the
hiding the identity/or sensitive data for owners data
record. The privacy of individual can be effectively
preserved while some aggregate information is
shared for data analysis and mining. A variety of
anonymizing algorithm is with different operations
have been proposed[3,4,5,6] however the data set
size has been increased tremendously in the big data
trend[1,7] this have become a challenge for
anonymization of data set. And for processing of
large data set we use Map Reduce integrated with

cloud to provide high computational capability for
application.
2. RELATED WORK:
Recently data privacy preservation has been
extensively studied and investigated [2]. Le Fever
et.al has addressed about scalability of anonymization
algorithm via introducing scalable decision tree and
the sampling technique, and lwuchkwu et.al[8]
proposed R-tree based index approach by building a
spatial index over data sets, achieving high
efficiency. However the approach aim at
multidimensional generalization[6] which fail to
work in Top Down Specialization[TDS].
Fung et.al [2, 9, 10] proposed some TDS
approach that produce anonymize data set with data
exploration problem. A data structure taxonomy
indexed partition [TIPS] is exploited to improve
efficiency of TDS but it fails to handle large data set.
But this approach is centralized leasing to in
adequacy of large data set.
Several distributed algorithm are proposed
to preserve privacy of multiple data set retained by
multiple parties, Jiang et al [12] proposed distributed
algorithm to anonymization to vertical portioned
data. However, the above algorithms mainly based on
secure anonymization and integration. But our aim is
scalability issue of TDS anonymization.
Further, Zhang et al [13] leveraged Map
Reduce itself to automatically partition the
computation job in term of security level protecting
data and further processed by other Map Reduce
itself to anonymize large scale data before further
processed by other Map Reduce job, arriving at
privacy preservation.
3. Top-Down Specialization:
Generally, Top-Down Specialization (TDS)
is an iterative process starting from the
Topmost domain values in the taxonomy trees of
attributes. Each round of iteration consists of three
main steps, namely, finding the best specialization,
performing specialization and updating values of the
search metric for the next round [3]. Such a process is
repeated until k-anonymity is violated, in order to
expose the maximum data utility. The goodness of a
specialization is measured by a search metric. We
adopt the Information Gain per Privacy Loss (IGPL),
a trade-off metric that considers both the privacy and
information requirements, as the search metric in our
approach. A specialization with the highest IGPL
value is regarded as the best one and selected in each
round. We briefly describe how to calculate the value
of IGPL subsequently to make readers understand our
approach well. Interested readers can refer to [11] for
more details.
Given a specialization spec: p→ child
(p),the IGPL of the specialization is calculated by
IGPL (spec) = IG (spec)/(PL(spec) + 1). (1)
The term IG (spec) is the information gain
after performing spec, and PL (spec) is the privacy
loss. IG (spec) and PL (spec) can be computed via
statistical information derived from data sets. Let Rx
denote the set of original records containing attribute
values that can be generalized to x. |Rx | is the
number of data records in Rx. Let I(Rx ) be the
entropy of Rx. Then, IG (spec) is calculated by
∑child p
RC
RP
I Rc , (2)
Let |(Rx , sv )| denote the number of the data
records with sensitive value sv in Rx . I(Rx) is
computed
I Rx ∑sv€sv |
R ,
RX
. log2
R ,
RX
.(3)
The anonymity of a data set is defined by the
minimum group size out of all QI-groups, denoted as

, i.e., A = minqid€QID {|QID(qid)|}, where |QID(qid)|,
is the size of QID(qid). Let Ap (Spec) denote the
anonymity before performing spec, while Ac (Spec)
be that after performing spec. Privacy loss caused by
Spec is calculated by
PL (spec) = Ap (spec) - Ac (Spec).
4. Two Phase Top down Specialization:
The Two-phase Top down specialization
(TPDS) approach is in First phase the given data set
is first partitioned and anonymized then the
intermediate result is produced then in the second
phase the intermediate result is further anonymized
and stored in the database. Three components of the
TPTDS approach, namely, data partition,
anonymization level merging and data specialization
4.1 Sketch of Two Phase Top down specialization:
We propose a Two-Phase Top-Down
Specialization (TPTDS) approach to conduct the
computation required in TDS in a highly scalable and
efficient fashion. The two phases of our approach are
based on the two levels of parallelization provisioned
by Map Reduce on cloud. Basically, Map Reduce on
cloud has two levels of parallelization, i.e., job level
and task level. Job level parallelization means that
multiple Map Reduce jobs can be executed
simultaneously to make full use of cloud
infrastructure resources. Combined with cloud, Map
Reduce becomes more powerful and elastic as cloud
can offer infrastructure resources on demand, e.g.,
Amazon Elastic Map Reduce service [11]. Task level
parallelization refers to that multiple mapper/reducer
tasks in a Map Reduce job are executed
simultaneously over data splits. To achieve high
scalability, we parallelizing multiple jobs on data
partitions in the first phase, but the resultant
anonymization levels are not identical. To obtain
finally consistent anonymous data sets, the second
phase is necessary to integrate the intermediate
results and further anonymize entire data sets.
In the first phase, an original data set D is
partitioned into smaller ones. Let Di, 1 ≤ i ≤ P, denote
the data sets partitioned from D the, where P is the
number of partitions, and D=∑i=1 Di, 1≤i < j ≤ p.
Then, we run a subroutine over each of the
partitioned data sets in parallel to make full use of the
job level parallelization of Map Reduce. The
subroutine is a Map Reduce version of centralized
TDS (MRTDS) which concretely conducts the
computation required in TPTDS. MRTDS
anonymizes data partitions to generate intermediate
anonymization levels. An intermediate
anonymization level means that further specialization
can be performed without violating k-anonymity.
MRTDS only leverages the task level parallelization
of Map Reduce. Formally, let function MRTDS (D,
K, AL) →AL` represent a MRTDS routine that
anonymizes data set D to satisfy k-anonymity from
anonymization level AL to A. AL0 is the initial
anonymization level, i.e., AL0 = ({TOP1},{TOP2 },
… ,{TOPM} , where TopJ=DOM j,1≤ j ≤ m , is the
topmost domain value in TTj . ALi is the resultant
intermediate anonymization level.
In the second phase, all intermediate
anonymization levels are merged into one. The
merged anonymization level is denoted as ALi
. The
merging process is formally represented as function
merge(AL`1,AL`2….AL`P)→ ALi
. Then, the whole
data set D is further anonymized based on AL`,
achieving k-anonymity finally, i.e., MRTDS (D, K,
AL) →AL*, where AL* denotes the final
anonymization level. Ultimately, D is concretely
anonymized according to AL*.Above all, Algorithm
1 depicts the sketch of the two-phase TDS approach.

5. Data Partition:
When D is partitioned into Di, 1 ≤ i ≤ p, it is
required that the distribution of data records in Di is
similar to D. A data record here can be treated as a
point in an m dimension space, where m is the
number of attributes. Thus, the intermediate
anonymization levels derived from Di, 1 ≤ i ≤ p, can
be more similar so that we can get a better merged
anonymization level. Random sampling technique is
adopted to partition D, which can satisfy the above
requirement. Specifically, a random number RAND,
1≤ i ≤ p, is generated for each data record. A record is
assigned to the partition DRand. Algorithm2 shows the
Map Reduce program of data partition. Note that the
number of Reducers should be equal to p so that each
Reducer handles one value of Rand, exactly
producing p resultant files. Each file contains a
random sample of D. Once partitioned data sets Di,
1≤ i ≤ p , are obtained, we run MRTDS (D, K, AL0
)
on these data sets in parallel to derive intermediate
anonymization levels ALi*,1≤ i ≤p.
6. Data Specialization:
An original data set D is concretely
specialized for anonymization in a one-pass
MapReduce job. After obtaining the merged
intermediate anonymization level AL1
, we run
MRTDS (D, K, AL1
) on the entire data set D, and get
the final anonymization level AL*
. Then, the data set
D is anonymized by replacing original attribute
values in D with the responding domain values in
AL*.
Details of Map and Reduce functions of the
data specialization MapReduce job are described in
Algorithm3. The Map function emits anonymous
records and its count. The Reduce function simply
aggregates these anonymous records and counts their
number. An anonymous record and its count
represent a QI-group. The QI-groups constitute the
final anonymous data sets.
Algorithm1:Two Phase Top Down
Specialization:
Input: Data set D, anonymity parameter K, K`
and the number of partition P.
Output: Anonymous Data set D*.
1. Partition D into Di, 1≤ i ≤ p`.
2. Execute MRTDS (Di, Ki
, AL0)
→AL`I,
1 ≤ i
≤ p in parallel as multiple Map Reduce jobs
3. Merge all intermediate anonymization
level into one, merge (AL`1, AL`2….AL`P)
4. Execute MRTDS (D, K, AL1) → AL* to
achieve K‐anonymity.
5. Specialization D according to AL*, output
D**.
Algorithm 3: Data Specialization:
Input: Data Record (IDr,r),r € D, Anonymization
Level AL*`
Output: Anonymous Record(r*,count).
Map: Construct anonymous Record
r*=(p1,p2,……….pm, sv),pi 1≤ i ≤ m, is the parent
of specialization in current AL and it I also an
ancestor of Vi in r; emit ( r*,count)
Reduce: For each r*,sum←∑count; emit ( r*,sum)
Algorithm 2: Data partition and Map & Reduce.
Input: Data Record (IDr,r), r € D, parameters P.
Output: Di, 1≤ i ≤ p.
Map: Generate a random number rand, where 1≤
rand ≤ p; emit (rand, r).
Reduce: For each rand, emit (null, list(r)).

7. MRTDS Driver:
Usually, a single MapReduce job is
inadequate to accomplish a complex task in many
applications. Thus, a group of MapReduce jobs are
orchestrated in a driver program to achieve such an
objective. MRTDS consists of MRTDS Driver and
two types of jobs, i.e., IGPL Initialization and IGPL
Update. The driver arranges the execution of jobs.
Algorithm 4 frames MRTDS Driver where a data set
is anonymized by TDS. It is the algorithmic design of
function MRTDS (D, K, AL) →AL. Note that we
leverage anonymization level to manage the process
of anonymization. Step 1 initializes the values of
information gain and privacy loss for all
specializations, which can be done by the job IGPL
Initialization. Step 2 is iterative. Firstly, the best
specialization is selected from valid specializations in
current anonymization level as described in Step 2.1.
A specialization spec is a valid one if it satisfies two
conditions. One is that its parent value is not a leaf,
and the other is that the anonymity Ac (spec) > k, i.e.,
the data set is still k-anonymous if spec is performed.
Then, the current anonymization level is modified via
performing the best specialization in Step 2.2, i.e.,
removing the old specialization and inserting new
ones that are derived from the old one. In Step 2.3,
information gain of the newly added specializations
and privacy loss of all specializations need to be
recomputed, which are accomplished by job IGPL
Update. The iteration continues until all
specializations become invalid, achieving the
maximum data utility. MRTDS produces the same
anonymous data as the centralized TDS in [12],
because they follow the same steps. MTRDS mainly
differs from centralized TDS on calculating IGPL
values. However, calculating IGPL values dominates
the scalability of TDS approaches, as it requires TDS
algorithms to count the statistical information of data
sets iteratively. MRTDS exploits Map Reduce on
cloud to make the computation of IGPL parallel and
scalable.
REFERENCE:
[1] S. Chaudhuri, “What Next?: A Half-Dozen Data
Management Research Goals for Big Data and the
Cloud,”in Proc. 31st Symp Principles of Database
Systems (PODS'12), pp. 1-4, 2012.
[2] B.C.M. Fung, K. Wang, R. Chen and P.S. Yu,
“Privacy- Preserving Data Publishing: A Survey of
Recent Developments,”ACM Comput. Surv., vol. 42,
no. 4, pp. 1-53, 2010.
[3] B.C.M. Fung, K. Wang and P.S. Yu,
“Anonymizing Classification Data for Privacy
Preservation,” IEEE Trans.Knowl..Data Eng., vol.
19, no. 5, pp. 711-725, 2007.
[4] X. Xiao and Y. Tao, “Anatomy: Simple and
Effective PrivacyPreservation,” Proc. 32nd Int'l Conf.
Algorithm 4: MRTDS DRIVER.
Input: Data set D anonymized level AL and K-
anonymity parameter k.
Output: Anonymization AL`.
1. Initialize the value of search metrics IGPL,
i.e., for each specialization spec € uj
m
=1 cutj . The
IGPL value of spec is computed by job IGPL
initialization
2. while ₤ spec € uj
m
=1 cutj is valid
2.1 find the specialization from Ali ,spec Best.
2.2 update ALi and ALi+1.
2.3 update information Gained on the new
specialization in ALi+1. And the privacy of the
specialization IGPL update.
end while
AL`← AL

Very Large Data Bases (VLDB'06), pp. 139-150,
2006.
[5]. K. LeFevre, D.J. DeWitt and R. Ramakrishnan,
“Incognito: EfficientFull-Domain K-Anonymity,”
Proc. 2005 ACM SIGMODInt'l Conf' Management
of Data (SIGMOD '05), pp. 49-60, 2005.
[6].K. LeFevre, D.J. DeWitt and R. Ramakrishnan,
“Mondrian Multidimensional K-Anonymity,” Proc.
22nd Int'l Conf. Data Engineering (ICDE '06), artical
25, 2006.
[7].V. Borkar, M.J. Carey and C. Li, “Inside "Big
Data Management": Ogres, Onions, or Parfaits?,”
Proc. 15th Int'l Conf.Extending Database Technology
(EDBT'12), pp. 3-14, 2012.
[8].T. Iwuchukwu and J.F. Naughton “K-
Anonymization as Spatial Indexing: Toward Scalable
and Incremental Anonymization,” Proc. 33rd Int'l
Conf. Very Large Data Bases(VLDB'07), pp. 746-
757, 2007.
[9] N. Mohammed, B. Fung, P.C.K. Hung and C.K.
Lee, “Centralized and Distributed Anonymization for
High-Dimensional Healthcare Data,” ACM Trans.
Knowl. Discov. Data, vol. 4, no. 4, article 18, 2010.
[10]. B. Fung, K. Wang, L. Wang and P.C.K. Hung,
“Privacy- Preserving Data Publishing for Cluster
Analysis,” Data Knowl.Eng., vol. 68, no. 6, pp. 552-
575, 2009.
[11].Amazon Web Services, “Amazon Elastic
Mapreduce,”http://aws.amazon.com/elasticmapreduc
e/, accessed on: Jan. 05, 2013.
[12]. W. Jiang and C. Clifton, “A Secure Distributed
Framework for Achieving k-anonymity,” VLDB J.,
vol. 15, no. 4, pp. 316-333,2006.
[13]. K. Zhang, X. Zhou, Y. Chen, X. Wang and Y.
Ruan, “Sedic: Privacy-Aware Data Intensive
Computing on Hybrid Clouds,”Proc. 18th ACM
Conf. Computer and Communications Security
(CCS'11), pp. 515-526, 2011.
BIOGRAPHY
K. R. VIGNESH is the M.Tech
student of the Computer science and Engineering in
SRM University, India. His main areas of interest in
cloud computing.
P. SARANYA, is the Assistant
Professor of Department of Computer Science and
Engineering, Kattankulathur Campus, SRM
University, India. Her main areas of interest in Data
Mining and Web Mining.

Ieeepro techno solutions ieee java project - generalized approach for data

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (14)

Similar to Ieeepro techno solutions ieee java project - generalized approach for data

Similar to Ieeepro techno solutions ieee java project - generalized approach for data (20)

More from hemanthbbc

More from hemanthbbc (20)

Recently uploaded

Recently uploaded (20)

Ieeepro techno solutions ieee java project - generalized approach for data