Record Ordering Heuristics for Disclosure Control through Microaggregation
Upcoming SlideShare
Loading in...5
×
 

Record Ordering Heuristics for Disclosure Control through Microaggregation

on

  • 244 views

Statistical disclosure control (SDC) methods ...

Statistical disclosure control (SDC) methods
reconcile the need to release information to researchers with
the need to protect privacy of individual records.
Microaggregation is a SDC method that protects data subjects
by guarantying k-anonymity: Records are partitioned into
groups of size at least k and actual data values are replaced by
the group means so that each record in the group is
indistinguishable from at least k-1 other records. The goal is
to create groups of similar records such that information loss
due to data modification is minimized, where information
loss is measured by the sum of squared deviations between
the actual data values and their group means. Since optimal
multivariate microaggregation is NP-hard, heuristics have
been developed for microaggregation. It has been shown that
for a given ordering of records, the optimal partition consistent
with that ordering can be efficiently computed and some of
the best existing microaggregation methods are based on this
approach. This paper improves on previous heuristics by
adapting tour construction and tour improvement heuristics
for the traveling salesman problem (TSP) for
microaggregation. Specifically, the Greedy heuristic and the
Quick Boruvka heuristic are investigated for tour construction
and the 2-opt, 3-opt, and Lin-Kernighan heuristics are used
for tour improvements. Computational experiments using
benchmark datasets indicate that our method results in lower
information loss than extant microaggregation heuristics.

Statistics

Views

Total Views
244
Views on SlideShare
244
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Record Ordering Heuristics for Disclosure Control through Microaggregation Record Ordering Heuristics for Disclosure Control through Microaggregation Document Transcript

  • ACEEE Int. J. on Control System and Instrumentation, Vol. 03, No. 01, Feb 2012 Record Ordering Heuristics for Disclosure Control through Microaggregation Brook Heaton1 and Sumitra Mukherjee2 1 Graduate School of Computer and Information Sciences, Nova Southeastern University, Fort Lauderdale, FL, USA Email: willheat@nova.edu 2 Graduate School of Computer and Information Sciences, Nova Southeastern University, Fort Lauderdale, FL, USA. Email: sumitra@nova.eduAbstract— Statistical disclosure control (SDC) methods indistinguishable from at least k-1 other records. Replacingreconcile the need to release information to researchers with actual data values with group means leads to informationthe need to protect privacy of individual records. loss and renders the data less valuable to users. OptimalMicroaggregation is a SDC method that protects data subjects microaggregation aims to minimize information loss byby guarantying k-anonymity: Records are partitioned into grouping together records that are very similar, therebygroups of size at least k and actual data values are replaced bythe group means so that each record in the group is maximizing homogeneity among the records of each group.indistinguishable from at least k-1 other records. The goal is The optimal microaggregation problem can be formallyto create groups of similar records such that information loss defined as follows: Given a data set with n records and pdue to data modification is minimized, where information numerical attributes per record, partition the records intoloss is measured by the sum of squared deviations between groups containing at least k records each, such that the sumthe actual data values and their group means. Since optimal squared error (SSE) within groups is minimized. SSE is definedmultivariate microaggregation is NP-hard, heuristics have g nibeen developed for microaggregation. It has been shown that 2for a given ordering of records, the optimal partition consistent as  i 1 j 1 X ij  X i , where g is the number of groups, X ijwith that ordering can be efficiently computed and some ofthe best existing microaggregation methods are based on thisapproach. This paper improves on previous heuristics by is the jth record of the ith group, and X i is the mean for the ithadapting tour construction and tour improvement heuristics group.for the traveling salesman problem (TSP) for This problem has been shown to be NP-hard when p>1microaggregation. Specifically, the Greedy heuristic and theQuick Boruvka heuristic are investigated for tour construction [21]. Since microaggregation is extensively used for SDC,and the 2-opt, 3-opt, and Lin-Kernighan heuristics are used several heuristics have been proposed that lead to lowfor tour improvements. Computational experiments using information loss (see e.g. [4], [6], [7], [9], [10], [11], [13], [16],benchmark datasets indicate that our method results in lower [17], [18], [20], [24], and [25]). There is a need to develop newinformation loss than extant microaggregation heuristics. heuristics that perform better than those currently known, either by producing groups with lower information loss, orIndex Terms—Disclosure Control, Microaggregation, Privacy by achieving the same information loss at lowerprotection, Tour construction heuristics, Tour improvement computational cost. Optimal univariate microaggregation (i.e.,heuristics, Shortest path. involving a single attribute) has been shown to be solvable in polynomial time by taking an ordered list of records and I. INTRODUCTION using a shortest-path algorithm to compute the lowest Many databases, such as health information systems and information loss k-partition for the given ordered list [14]. AU.S. census data contain information that is valuable to k-partition is a set of groups such that every group containsresearchers for statistical purposes. At the same time, these at least k elements. Optimality is guaranteed, since univariatedatabases contain private information about individuals. data can be strictly ordered. Practical applications, however,There is a tradeoff between providing information for the require microaggregation to be performed on recordsbenefit of society, and restricting information for the benefit containing multiple attributes. While multivariate data cannotof individuals. Statistical disclosure control (SDC) is a set of be strictly ordered, Domingo-Ferrer et al. [9] show that, for atechniques for providing access to aggregate information in given ordering of records, the best partition consistent withstatistical databases, while at the same time protecting the that ordering can be identified efficiently as a shortest pathprivacy of individual data subjects. See [1], [12], and [27] for in a network using Hansen and Mukherjee’s method [14].good overviews of statistical disclosure controls methods. Exploiting this idea, they develop the Multivariate Hansen-Microaggregation is a popular SDC method that that protects Mukherjee (MHM) algorithm and propose several heuristicsdata subjects by guarantying k-anonymity (see [2], [23], and for ordering the records. Empirical results indicate that MHM,[26]). Under microaggregation, records are partitioned into when used with good ordering heuristics, outperforms extantgroups of size at least k and actual data values are replaced microaggregation methods. This paper builds on the work ofby the group means so that each record in the group is [9] by investigating new methods for ordering records as a© 2012 ACEEE 22DOI: 01.IJCSI.03.01.11
  • ACEEE Int. J. on Control System and Instrumentation, Vol. 03, No. 01, Feb 2012first step in multivariate microaggregation. The techniques are assigned. Stitch the groups together by applying NPN tofor ordering records are based on the tour construction and their centroids. The MD-MHM heuristic runs in O(n3) time.improvement heuristics for the traveling salesman problem The MDAV-MHM heuristic is very similar to MD, but(TSP). Using previously studied benchmark datasets, our considers distances of records to group centroids, rather thanmethod is shown to outperform extant microaggregation distances to other records. The MDAV heuristic runs in O(n2)heuristics. Section II discusses the MHM based time. The CBFS-MHM designates as the first record themicroaggregation method and record ordering heuristics records r that is furthest from the centroid. The k-1 closestproposed in [9]. Section III introduces our TSP based tour records to r are added to its group. This process is repeatedconstruction and improvement heuristics. Section IV using the remaining records until all the records are assigned.describes the computational experiments used to compare The complexity of CBFS is also O(n2).our method with extant microaggregation heuristics. SectionV presents our results. III. OUR ORDERING AND IMPROVEMENT HEURISTICS We use two TSP based tour construction heuristics – II. MHM MICROAGGREGATION HEURISTIC Greedy and Quick Boruvka (QB) – and three tourA. The MHM Algorithm improvement heuristics – 2-Opt, 3-Opt, and Lin-Kernighan The MHM algorithm introduced in [9] involves (LK). The improvement heuristics are applied not only to theconstructing a graph based on an ordered list of records, and tours generated by Greedy and QB but also to those producedfinding the shortest path in the graph. The arcs in the shortest by the record ordering heuristics discussed in the previouspath correspond to a partition of the records that is guaranteed section – NPN, MD-MHM, MDAV-MHM, and CBFS-MHM.to be the lowest cost partition consistent with the specified The Greedy heuristic produces an ordering of records usingordering. A partition is said to be consistent with an ordering several steps. First, the distances between every pair ofif the following is true: for any three records xi, xj, and xk records in the dataset are calculated and sorted in ascendingwhere i<j<k if xi and xk belong to the same group g then xj order. The pair of records with the smallest distance is chosenalso belongs to group g. The graph is constructed as follows: and added to the path. The remaining pairs are iterativelyGiven an ordered set of n records, create n nodes labeled 1 to considered in order, and added to the path as long as the don. Additionally, create a node labeled 0. For each pair of not introduce sub-tours or vertices of degree greater than 2.nodes (i, j) where i+k <= j < i+2k, create an arc directed The computational complexity of the Greedy heuristic is O(n2 log n). In Quick Boruvka, the records are first sorted by one jfrom i to j. The length of the arc L(i , j )   h  i 1 ( xh  x(i , j ) )2 , of the attributes to generate an initial ordering of records. For each record in the initial ordering, the distance betweenwhere xh is a multivariate record, and x(i , j ) is the centroid of the current record and the other records in the dataset isthe records in (i+1, j). Once the graph is constructed, a computed. The pair of records with the minimum distance isshortest-path algorithm such as Dijkstra’s algorithm [8] is added, subject to the same conditions outlined in the Greedyused to determine the shortest path from node 0 to node n. heuristic above. The QB heuristic is also of complexity ofThe length of the shortest path gives the information loss O(n2 log n), but empirically performs slightly faster than thecorresponding to the best partition consistent with the Greedy heuristic [3]. The 2-Opt heuristic, is a tourspecified ordering. improvement heuristic; that is, it starts with a tour and iteratively attempts to improve it. It operates by first randomlyB. Existing Record Ordering Heuristics selecting two non-connected edges (a total of four points) Four heuristics for ordering records have been proposed from the tour. If re-connecting the edges result in a shorterin [9]: nearest-point-next (NPN), maximum distance (MD- overall tour the originally selected edges are replaced withMHM), maximum distance to average vector (MDAV-MHM), the newly computed ones. This process continues until eitherand centroid-based fixed-size (CBFS-MHM). The NPN no further exchanges result in a shorter tour, or some otherheuristic selects the first record by computing the record stopping criteria are met. In the 3-opt heuristic, threefurthest away from the centroid of the entire dataset. The disconnected edges are broken, and every possible way ofrecord closest to the initial record is designated as the second reconnecting the edges into a valid tour is examined. If a tourrecord. The third record is closest to the second record. with shorter distance is found, the edges are swapped, andThis process continues until all of the records have been the process repeats until no more exchanges are possible oradded to the tour. The NPN heuristic runs in O(n2) time, some other stopping criteria are met. The Lin-Kernighan (LK)where n is the number of records in the dataset. The MD- heuristic is a tour improvement heuristic that is similar to 2-MHM heuristic works as follows: Find the two records r and opt and 3-opt. Like 2-opt, and 3-opt, LK takes an existing tour,s that are furthest from one another in the dataset, using and iteratively swaps edges to create a better tour. Lin andEuclidean distance. Find the k-1 closest records to r and Kernighan [19] note that the choice of swapping two edges inform a group. Find the k-1 closest records to s and form a 2-opt or 3-opt is arbitrary, since it is possible that swappinggroup. Repeat this process, using the remaining records that more than three edges may actually result in a better tour.haven’t been assigned to a group until all the remaining records Therefore, LK attempts to build a series of sequential k-opt© 2012 ACEEE 23DOI: 01.IJCSI.03.01.11
  • ACEEE Int. J. on Control System and Instrumentation, Vol. 03, No. 01, Feb 2012moves (sometimes referred to as swaps, flips, or exchanges), against the Census, Tarragona, and EIA datasets for fiverather than a 2-opt or 3-opt move, that will result in an overall different values of k. The information loss under eachshorter tour. A good overview of these heuristics can be method is reported. For each specified value of k, the lowestfound in [22]. information loss obtained for each dataset is highlighted in bold to emphasize the best performing method. Table 1 shows IV. EXPERIMENTAL DESIGN that in all but one case (k=10 for Census), our tour improvement heuristics improved upon the results obtainedA. Datasets using MD-MHM. Results for the MDAV-MHM and CBFS- We used three benchmark datasets that were used in MHM family of algorithms are very similar, as shown in Tablesprevious studies [9] to evaluate our proposed methods. The 2 and 3. Reductions in information loss are often substantial.first dataset, “Tarragona”, consists of 834 records with 13 For example with the EIA dataset, the MDAV-MHM+LKvariables. The second dataset, “EIA”, consists of 4092 heuristic for k=10 resulted in an information loss of 1.968,records with 10 variables. Finally, the “Census” dataset which is 31% lower than the information loss (2.867) obtainedconsists of 1080 records from the U.S. census, with 13 using MDAV-MHM alone.variables. Consistent with previous studies, the variables TABLE I.were normalized to ensure that results are not dominated by INFORMATION LOSS UNDER MD-MHM AND IMPROVEMENTany single variable. HEURISTICSB. Record Ordering Algorithms The record ordering algorithms used by Domingo-Ferreret al. [9] are NPN, MD-MHM, CBFS-MHM, and MDAV-MHM.These algorithms were run on the three selected datasets.Additionally, two new record ordering algorithms introducedin this study – Greedy and Quick Boruvka – were also run.Three different algorithms for improving existing recordorderings – 2-opt, 3-opt, and LK – were used with all tourconstruction heuristics.C. Comparisons The six microaggregation algorithms (i.e., NPN, MD-MHM, MDAV-MHM, CBFS-MHM, Greedy, and QB) were TABLE II. INFORMATION LOSS UNDER MDAV-MHM ANDrun on the three datasets (Census, Tarragona, and EIA) using IMPROVEMENT HEURISTICSfive typical values of k (3, 4, 5, 6, and 10). These values of kare consistent with those used in [9]. The three tourimprovement heuristics (2opt, 3opt, and LK) were thenapplied to these resulting record orderings. That is, for agiven dataset and a given value of k, four differentexperiments were run using each record ordering heuristic(the original algorithm with no improvement, plus the threeimprovement heuristics). Consistent with previous studies,we report the normalized information loss IL = 100 * SSE/SST(with possible values between 0 and 100) where SST = g ni 2 i 1 j 1 X ij  X , and X is the centroid of the entire TABLE III. INFORMATION LOSS UNDER CBFS-MHM AND IMPROVEMENT HEURISTICSdataset. All of the algorithms were implemented in the Javaprogramming language, using the Sun Java 6.0 SDK. Allalgorithms were run on a computer with a dual 3.0 GHzprocessor and 2GB of RAM, running the Linux operatingsystem (Ubuntu 10.04). V. RESULTS This section details the results from applying the MD-MHM, MDAV-MHM, CBFS-MHM, NPN-MHM, Greedy-MHM, and QB-MHM algorithms, along with theircorresponding 2-opt, 3-opt, and LK improvement heuristics,© 2012 ACEEE 24DOI: 01.IJCSI.03.01. 11
  • ACEEE Int. J. on Control System and Instrumentation, Vol. 03, No. 01, Feb 2012 TABLE IV. TABLE VII. INFORMATION LOSS UNDER NPN-MHM AND BEST COMBINATION OF TOUR CONSTRUCTION AND IMPROVEMENT HEURISTICS IMPROVEMENT HEURISTICS TABLE V.INFORMATION LOSS UNDER QB-MHM AND IMPROVEMENT HEURISTICS Table 7 summarizes the results from tables 1–6 as follows: the best performing combination of tour construction and tour improvement heuristics for each dataset are identified under a specified k. The row labeled “Best under” identifies the specific improvement heuristic that produced the best results. For example, with the Census dataset, when k=3, the best results are obtained using a combination of Greedy-MHM and the LK tour improvement heuristic with an informationTables 4, 5, and 6 present the results for the NPN-MHM, QB- loss of 5.002%. While no tour construction heuristicMHM, and Greedy-MHM family of algorithms respectively. dominates, out of 15 total instances, the best result was foundOur tour improvement heuristics resulted in the lower infor- using the LK improvement heuristic 7 times and using the 3-mation loss in every instance opt improvement heuristic 6 times. Table 8 shows the TABLE VI. difference between the lowest information loss obtained using INFORMATION LOSS UNDER GREEDY-MHM AND extant microaggregation methods (the minimum value found IMPROVEMENT HEURISTICS from the MD, MDAV, CBFS, MD-MHM, CBFS-MHM, NPN- MHM, and MDAV-MHM heuristics) and those obtained using the new methods described in this paper. Under our methods information losses are decreased by 4%, 5.5%, and 13.9% on the average for the Census, Tarragona, and EIA datasets respectively. In all but one instance (k=10 with the census dataset) our proposed method outperformed existing methods and should hence be considered as a promising microaggregation heuristic.© 2012 ACEEE 25DOI: 01.IJCSI.03.01.11
  • ACEEE Int. J. on Control System and Instrumentation, Vol. 03, No. 01, Feb 2012 TABLE VIII. [12] Duncan, G. T., Elliot, M., Salazar-González, J. (2011). RESULTS COMPARED TO EXTANT METHODS Statistical Confidentiality: Principles and Practice, Springer. [13] Fayyoumi, E., & Oommen, B.J. (2009). Achieving microaggregation for secure statistical databases using fixed- structure partitioning-based learning automata. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 39(5), 1192-1205. [14] Hansen, S. L., & Mukherjee, S. (2003). A polynomial algorithm for optimal univariate microaggregation. IEEE Transactions on Knowledge and Data Engineering, 15(4), 1043-1044. [15] Hundepool, A., Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A., Wolf, P., Domingo-Ferrer, J., Torra, V., Brand, R., & Giessing, A. (2004). µ-ARGUS Ver. 4.0 Software and User’s Manual. Voorburg, The Netherlands: Statistics Netherlands. [16] Kabir, M.E., Wang, H., & Zhang, Y. (2010). A pairwise- systematic microaggregation for statistical disclosure control. International Conference on Data Mining 2010, 266-273. [17] Laszlo, M., & Mukherjee, S. (2005). Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on REFERENCES Knowledge and Data Engineering, 17(7), 902-911. [18] Laszlo, M., & Mukherjee, S. (2009). Approximation bounds[1] Adam, N. R., & Wortmann, J. C. (1989). Security-control for minimum information loss microaggregation. IEEE Transactionsmethods for statistical databases: a comparative study. ACM on Knowledge and Data Engineering, 21(11), 1643-1647.Computing Surveys, 21(4), 515-556. [19] Lin, S., & Kernighan, B. (1973). An effective heuristic algorithm[2] Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, for the traveling-salesman problem. Operations Research, 21(2),R., Thomas, D., & Zhu, A. (2005). Approximation algorithms for 498-516.k-anonymity. Journal of Privacy Technology, Paper No. [20] Mateo-Sanz, J. M., & Domingo-Ferrer, J. (1999). A method20051120001. for data oriented multivariate microaggregation In J. Domingo-Ferrer[3] Applegate, D., Cook, W., & Rohe, A. (2003). Chained Lin- (Ed.), Statistical Data Protection (pp. 89-99). Luxemburg: OfficeKernighan for large traveling salesman problems. INFORMS Journal for Official Publications of the European Communities.on Computing, 15(1), 82-92. [21] Oganian, A., & Domingo-Ferrer, J. (2001). On the complexity[4] Balleste, A.M., Solanas, A., Domingo-Ferrer, J., & Mateo-Sanz, of optimal microaggregation for statistical disclosure control.J.M. (2007). A genetic approach to multivariate microaggregation Statistical Journal of the United Nations Economic Commission forfor database privacy. IEEE 23rd International Conference on Data Europe, 18(4), 345-354.Engineering Workshop, 180-185. [22] Rego, C., Gamboa, D., Glover, F., & Osterman, C. (2011).[5] Bentley, J.L. (1992). Fast algorithms for geometric traveling Traveling salesman problem heuristics: leading methods,salesman problems. ORSA Journal of Computing, 4, 387-411. implementations, and latest advances. European Journal of[6] Chang, C.-C., Li, Y.-C., & Huang, W.-H. (2007). TFRP: An Operations Research, 211(3), 427-441.efficient microaggregation algorithm for statistical disclosure [23] Samarati, P. (2001). Protecting respondents’ identities incontrol. Journal of Systems and Software, 80(11), 1866-1878. microdata release. IEEE Transactions on Knowledge and Data[7] Defays, D., & Anwar, N. (1995). Micro-aggregation: a generic Engineering, 13(6), 1010-1027.method. Proceedings of the 2nd International Symposium on [24] Sande, G. (2002). Exact and approximate methods for dataStatistical Confidentiality. Eurostat, Luxemburg, 69-78. directed microaggregation in one or more dimensions. International[8] Dijkstra, E. W. (1959). A note on two problems in connexion Journal of Uncertainty and Fuzziness in Knowledge Based Systems,with graphs. Numerische Mathematik, 1, 269-271. 10(5), 459-476.[9] Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J. M., [25] Solanas, A., Martinez-Balleste, A., Domingo-Ferrer, J., && Sebé, F. (2006). Efficient multivariate data-oriented Mateo-Sanz, M. (2006). A 2^d-tree-based blocking method formicroaggregation. The VLDB Journal, 15, 355–369. microaggregating very large data sets. In First International[10] Domingo-Ferrer, J., & Mateo-Sanz, J. M. (2002). Practical Conference on Availability, Reliability and Security (ARES’06), 922-data-oriented microaggregation for statistical disclosure control. IEEE 928.Transactions on Knowledge and Data Engineering, 14(1), 189- [26] Sweeney, L. (2002). K-anonymity: a model for protecting201. privacy. International Journal on Uncertainty, Fuzziness and[11] Domingo-Ferrer, J., Sebe, F., & Solanas, A. A polynomial- Knowledge-based Systems, 10(5), 557–570.time approximation to optimal multivariate microaggregation. [27] Willenborg, L., & de Waal, T. (2001). Elements of statisticalComputers & Mathematics with Applications, 55(4), 714-732. disclosure control. New York: Springer-Verlag.© 2012 ACEEE 26DOI: 01.IJCSI.03.01. 11