Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

767 views

Published on

An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge By Kato Mivule for the Degree of D.Sc. in Computer Science - Bowie State University

  • Be the first to comment

  • Be the first to like this

An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

  1. 1. Venue and Date: Center for Business and Graduate Studies Dean’s Conference Room 1303 Open to the Public Thursday, April 17, 2014 at 1 pm Dissertation Committee: Claude Turner, Ph.D. Chair Soo-Yeon Ji, Ph.D. Member Hoda El-Sayed, D.Sc. Member Darsana Josyula, Ph.D. Member Anthony Joseph, Ph.D. External Examiner Department of Computer Science Dissertation Defense AN INVESTIGATION OF DATA PRIVACY AND UTILITY USING MACHINE LEARNING AS A GAUGE Kato Mivule For the Degree of D.Sc. in Computer Science Cosmas U. Nwokeafor, PhD Dean, The Graduate School Lethia Jackson, D.Sc. Chair, Computer Science Department
  2. 2. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE OUTLINE • Introduction o The Problem o Contributions • Literature Review • Methodology • Results and Discussion o Results o Discussion • Conclusion and Future work o Conclusion o Future work Kato Mivule – Bowie State University Department of Computer Science
  3. 3. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE CONTRIBUTIONS 1. A proposed a data privacy engineering framework, SIED. 2. A proposed Comparative x-CEG data utility analysis heuristic. 3. A proposed Initial and Subsequent basic (IBP and SBP) privacy indexes. 4. A proposed data swapping and noise addition hybrid model for privacy. 5. A proposed privatized synthetic data generation model using image and signal processing techniques (DT, DCT, and DWT). 6. An implementation of k-anonymity by minimizing information loss via the frequency count analysis and synthetic data replacement model. Kato Mivule – Bowie State University Department of Computer Science
  4. 4. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE THE PROBLEM Finding a user-defined balance between data privacy and utility needs with trade-offs. • The challenge of ambiguous definitions of privacy and utility. “Perfect privacy can be achieved by publishing nothing at all, but this has no utility; perfect utility can be obtained by publishing the data exactly as received, but this offers no privacy” Cynthia Dwork (2006) Data Privacy ~Differential Privacy ~Noise addition ~K-anonymity, etc... Data Utility ~Completeness ~Currency ~Accuracy Kato Mivule – Bowie State University Department of Computer Science
  5. 5. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE MOTIVATION • Generate privatized synthetic data sets that meet acceptable privacy and utility requirements. • Data Privacy Engineering - Adapt engineering principles in the data privacy and utility process. HYPOTHESIS • Fine-tuning parameters in the data privacy procedure, specifically using perturbation methods such as noise addition and differential privacy, lowers the classification error and thus generates better data utility. Kato Mivule – Bowie State University Department of Computer Science
  6. 6. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE LITERATURE REVIEW The data privacy and utility problem • Wong et, al., (2007); Meyerson & Williams, (2004); Park & Shim, (2007): Data privatization diminishes data utility – an NP-Hard problem. • Krause & Horvitz, (2010); Wang & Wu, (2005): Optimal data utility with privacy is a well-documented NP hard problem. • Ghosh, et al., (2008); Brenner & Nissim, (2010 ): Trade-offs needed in the privacy verses utility process – also NP hard. • Li & Li, (2009): It is not possible to equate privacy and utility. • Fienberg, Rinaldo, & Yang, (2010): Even with differential privacy, privacy is granted but at a loss of data utility. Kato Mivule – Bowie State University Department of Computer Science
  7. 7. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE LITERATURE REVIEW Techniques and Algorithms used in this study • Data Privacy • Noise Addition • Logarithmic Noise • Multiplicative Noise • Differential Privacy • K-anonymity • Image and Signal Processing • Distance Transform • Discrete Cosine Transform • Discrete Wavelet Transform • Gaussian Filtering • Machine Learning • KNN • Neural Networks • Naïve Bayes • Decision Trees • AdaBoost M1 Kato Mivule – Bowie State University Department of Computer Science
  8. 8. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Contribution 1 – SIED, a data privacy engineering framework • SIED phases – Specifications, Implementation, Evaluation, and Dissemination • Motivation: Given any original dataset 𝑋, a set of data privacy engineering phases should be followed from start to completion in the generation of a privatized dataset 𝑋′ . Kato Mivule – Bowie State University Department of Computer Science
  9. 9. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Contribution 1 – SIED, a data privacy engineering framework - The SIED Specification Phase: Kato Mivule – Bowie State University Department of Computer Science
  10. 10. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Contribution 1 – SIED, a data privacy engineering framework - The SIED Implementation Phase: Kato Mivule – Bowie State University Department of Computer Science
  11. 11. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Contribution 1 – SIED, a data privacy engineering framework - The SIED Evaluation Phase: Kato Mivule – Bowie State University Department of Computer Science
  12. 12. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Contribution 1 – SIED, a data privacy engineering framework - The SIED Dissemination Phase: Kato Mivule – Bowie State University Department of Computer Science
  13. 13. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Contribution 2 – A Data Privacy Parameter Mapping Heuristic •Categorize parameters for effective fine-tuning – better privacy and utility. What parameters need adjustment in the data privacy process? CATEGORY 1 PARAMETERS CATEGORY 2 PARAMETERS CATEGORY 3 PARAMETERS Data Utility Goal Parameters: For example Accuracy, Currency, and Completeness. Data Privacy Algorithm Parameters: Values k in k-anonymity, ε in Noise addition and Differential privacy. Application Parameters (e.g. Machine Learning Classifier): For example weak learners in AdaBoost. Parameter Adjustment and Fine-tuning Trade-offs Data Privacy and Utility Preservation Kato Mivule – Bowie State University Department of Computer Science
  14. 14. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Contribution 3 – The x-CEG and Comparative x-CEG Heuristics The Classification Error Gauge (x-CEG) Replicates x times until threshold t is reached. Better utility might be achieved - Publish Apply data privacy Classify privatized dataset Get original dataset If error <= t Adjust data privacy parameters Adjust classifier parameters If error > t The Comparative x- CEG heuristic employs multiple data privacy and classifier algorithms in each run. Kato Mivule – Bowie State University Department of Computer Science
  15. 15. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Contribution 4 – The x-CEG Threshold determination heuristic • Average value of the function = integral / interval. • 𝐴𝑉𝐹 = 𝐼𝑛𝑡𝑒𝑔𝑟𝑎𝑙/𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 • 1 𝑏−𝑎 𝑓 𝑥 𝑑𝑥 𝑏 𝑎 • 1 𝑏−𝑎 𝑓(𝑥𝑖)𝑛 𝑖=1 ∆𝑥 • 𝑊ℎ𝑒𝑟𝑒 ∆𝑥 = 𝑏−𝑎 𝑛 • 𝐴𝑛𝑑 𝑥𝑖 = 1 2 𝑥𝑖−1 + 𝑥𝑖 • 𝑇ℎ𝑒 𝑚𝑒𝑎𝑛 𝜇 = 1 𝑁 𝑥𝑖 𝑁 𝑖=1 • 𝒕 = 𝑴𝒂𝒙[𝒎𝒂𝒙 𝒎𝒆𝒂𝒏 , 𝒎𝒂𝒙 𝒎𝒊𝒅𝒑𝒐𝒊𝒏𝒕 ] • The threshold 𝑡 is chosen as the highest point between the max mean and max mid-point values. • The classification error of the original data set is used as a benchmark in measuring privatized synthetic data sets. Kato Mivule – Bowie State University Department of Computer Science
  16. 16. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Contribution 5 – The Initial and Subsequent Privacy Indices • Let 𝑋 be the set of all values in database 𝑋 such that 𝑋 = {𝑋1 … 𝑋 𝑛} . • Let 𝑋′ be the set of items to be privatized such that 𝑋′ = {𝑋1 ′ … 𝑋 𝑛 ′ } • Let 𝑌 be the set of items that get revealed after our initial privacy measurement. • Where |𝑋′ | ≤ |𝑋| and |𝑌| ≤ |𝑋| • As long as 𝑋, 𝑋′ , 𝑎𝑛𝑑 𝑌 are countable, such that there is a one-to-one function (injective) 𝑓: 𝑋 → 𝑁; 𝑋′ → 𝑁; 𝑌 → 𝑁 from 𝑋, 𝑋′, 𝑎𝑛𝑑 𝑌 to natural numbers 𝑁 = { 0, 1, 2, 3 … 𝑛} respectively. • 𝐼𝑛𝑖𝑡𝑖𝑎𝑙 𝐵𝑎𝑠𝑖𝑐 𝑃𝑟𝑖𝑣𝑎𝑐𝑦 (𝐼𝐵𝑃) = 𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 𝑋′ 𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 (𝑋) ∗ 100 • 𝑆𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑡 𝐵𝑎𝑠𝑖𝑐 𝑃𝑟𝑖𝑣𝑎𝑐𝑦 (𝑆𝐵𝑃) = 𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 𝑋′− 𝑌 𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 (𝑋) ∗ 100 • where 𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 is the total count of elements in both 𝑋′ , 𝑌 and 𝑋. • IBP and SBP could be taken as percentages or normalized between 0 and 1. Kato Mivule – Bowie State University Department of Computer Science
  17. 17. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Methodology – Contribution 6 – The Filtered Comparative x-CEG Heuristic - Using image and signal processing techniques to generate privatized synthetic data. Kato Mivule – Bowie State University Department of Computer Science
  18. 18. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Contribution 7– Data swapping and noise addition data privacy hybrid - Generating privatized synthetic data using data swapping and noise perturbation. Kato Mivule – Bowie State University Department of Computer Science
  19. 19. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE METHODOLOGY Contribution 8 – Minimizing information loss with K-anonymity • Implementation of k-anonymity by minimizing information loss via the frequency count analysis and synthetic data replacement model. Kato Mivule – Bowie State University Department of Computer Science
  20. 20. RESULTS AND DISCUSSION Comparative x-CEG Results •The Iris Fisher multivariate dataset from the UCI repository was used. •165 experiment runs – generating 165 privatized synthetic data sets. •KNN, Neural Nets, Decision Trees, AdaBoost, and Naïve Bayes •MATLAB data privacy and Rapid Miner for machine learning. NOISE LEVEL KNN NEURAL NETS NAÏVE BAYES DECISION TREES ADABOOST M1 Original 96.00 96.67 96.00 94.67 97.33 Noise1(μ=5.8, σ=0.8) 66.67 74.00 64.00 66.67 64.00 Noise2(μ=0, σ=0.8) 61.33 72.00 66.67 63.33 54.67 Noise3(μ=1, σ=0.8) 68.67 74.00 69.33 66.67 60.00 Noise4(μ=2, σ=0.8) 68.67 62.67 62.00 59.33 54.67 Noise5(μ=3, σ=0.8) 72.67 66.67 67.33 61.33 50.67 Noise6(μ=4, σ=0.8) 75.33 82.67 70.00 72.00 63.33 Noise1a(μ=5, σ=0.1) 94.00 93.33 92.67 91.33 92.67 Noise1b(μ=5, σ=0.2) 92.00 94.67 91.33 90.00 90.67 Noise1c(μ=5, σ=0.3) 93.33 94.00 90.67 92.00 94.00 Noise1d(μ=5, σ=0.4) 90.00 93.33 87.33 86.67 86.67 Noise2b(μ=0, σ=0.1) 96.67 96.67 94.00 96.67 92.00 Noise2c(μ=0, σ=0.2) 89.33 92.00 86.67 87.33 90.00 Noise2d(μ=0, σ=0.3) 87.33 90.00 86.67 84.67 85.33 Noise2e(μ=0, σ=0.4) 87.33 90.00 86.67 84.67 85.33 Noise3a(μ=1, σ=0.4) 87.33 87.33 85.33 84.00 83.33 Noise3b(μ=1, σ=0.1) 97.33 94.00 96.00 96.00 94.67 Noise3c(μ=1, σ=0.2) 92.67 95.33 91.33 90.67 93.33 Noise3d(μ=1, σ=0.3) 94.67 95.33 91.33 94.00 90.00 Noise4a(μ=2, σ=0.1) 94.67 98.00 98.00 96.67 98.00 Noise4b(μ=2, σ=0.2) 93.33 96.00 92.67 91.33 90.67 Noise4c(μ=2, σ=0.3) 88.00 91.33 89.33 90.00 86.67 Noise4d(μ=2, σ=0.4) 87.33 87.33 85.33 84.00 83.33 Noise5a(μ=3, σ=0.1) 97.33 94.00 96.00 96.00 94.67 Noise5b(μ=3, σ=0.2) 92.67 95.33 91.33 90.67 93.33 Noise5c(μ=3, σ=0.3) 94.67 95.33 91.33 94.00 90.00 Noise5d(μ=3, σ=0.4) 93.33 94.00 93.33 92.00 87.33 Noise6a(μ=4, σ=0.1) 78.00 87.33 87.33 82.67 84.67 Noise6b(μ=4, σ=0.2) 93.33 95.33 94.00 93.33 92.67 Noise6c(μ=4, σ=0.3) 91.33 92.00 92.00 90.00 92.00 Noise6d(μ=4, σ=0.4) 78.00 87.33 88.67 82.67 84.67 Multiplicative 56.67 68.67 59.33 64.67 58.00 Logarithmic 50.67 58.00 56.00 53.33 57.33 DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE Kato Mivule – Bowie State University Department of Computer Science
  21. 21. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE RESULTS AND DISCUSSION Comparative x-CEG Results • A bar chart depiction of the Comparative x-CEG classification accuracy results Kato Mivule – Bowie State University Department of Computer Science
  22. 22. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE RESULTS AND DISCUSSION - Comparative x-CEG Results Comparative x-CEG results classifier performance results – Neural Nets most resilient. Kato Mivule – Bowie State University Department of Computer Science
  23. 23. RESULTS AND DISCUSSION • x-CEG Threshold Determination Results • Threshold 𝒕 = 𝑴𝒂𝒙[𝒎𝒂𝒙 𝒎𝒆𝒂𝒏 , 𝒎𝒂𝒙 𝒎𝒊𝒅𝒑𝒐𝒊𝒏𝒕 ] • The threshold value is chosen heuristically using the mid-point value classification accuracy of 87.33% for the Neural Nets. Statistic KNN NEURAL NETS NAÏVE BAYES DECISION TREES ADABOOST M1 MAX Mean 84.87 87.41 84.54 83.74 82.30 87.41 Mid-Point 80.18 82.48 79.81 79.05 77.51 82.48 Max 84.87 87.41 84.54 83.74 82.30 87.41 DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE Kato Mivule – Bowie State University Department of Computer Science
  24. 24. RESULTS AND DISCUSSION • x-CEG Threshold Determination Results • Threshold 𝒕 = 𝑴𝒂𝒙[𝒎𝒂𝒙 𝒎𝒆𝒂𝒏 , 𝒎𝒂𝒙 𝒎𝒊𝒅𝒑𝒐𝒊𝒏𝒕 ] • The threshold value is chosen heuristically using the mid-point value classification accuracy of 87.33% for the Neural Nets. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE Kato Mivule – Bowie State University Department of Computer Science
  25. 25. RESULTS AND DISCUSSION • x-CEG Threshold Determination Results • Threshold 𝒕 = 𝑴𝒂𝒙[𝒎𝒂𝒙 𝒎𝒆𝒂𝒏 , 𝒎𝒂𝒙 𝒎𝒊𝒅𝒑𝒐𝒊𝒏𝒕 ] • The threshold value is chosen heuristically using the mid-point value classification accuracy of 87.33% for the Neural Nets. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE Kato Mivule – Bowie State University Department of Computer Science
  26. 26. RESULTS AND DISCUSSION • How much privacy? – statistical traits of the original and privatized data. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  27. 27. RESULTS AND DISCUSSION • How much privacy? – statistical traits of the original and privatized data. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  28. 28. RESULTS AND DISCUSSION • How much privacy? – statistical traits of the original and privatized data. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  29. 29. RESULTS AND DISCUSSION • How much privacy? – statistical traits of the original and privatized data. Statistic Value Original Data MSE 15.8937 Privatized Data MSE 24.0875 Original Data Entropy -3.05E+04 Privatized Data Entropy -5.05E+04 Correlation 0.9808 MSE Difference 8.1938 Entropy Difference -2.00E+04 Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  30. 30. RESULTS AND DISCUSSION – Data Swapping and Noise Addition Hybrid • 330 data sets generated from the data swapping and noise addition hybrid experiment. • Optimal data swap for acceptable privacy and utility levels is between 5% and 10% data swap. • The two data sets satisfied the threshold criteria after the Comparative x-CEG: • 𝑛𝑜𝑖𝑠𝑒 ~ (𝜇 = 1, 𝜎 = 0.1) at 5% swap. • 𝑛𝑜𝑖𝑠𝑒 ~ (𝜇 = 5, 𝜎 = 0.1) at 5% swap. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  31. 31. RESULTS AND DISCUSSION – Data Swapping and Noise Addition Hybrid Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  32. 32. RESULTS AND DISCUSSION – Data Swapping and Noise Addition Hybrid • Best classification accuracy obtained between 5 to10% data swap. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  33. 33. RESULTS AND DISCUSSION – Signal Processing and Data Privacy Hybrid Privatized synthetic data sets using Discrete Cosine Transforms (DCT) . Synthetic DCT-based Sepal Length data results Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  34. 34. RESULTS AND DISCUSSION – Signal Processing and Data Privacy Hybrid Privatized synthetic data sets using Discrete Cosine Transforms (DCT) . Synthetic Filtered DCT-based Sepal Length data results Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  35. 35. RESULTS AND DISCUSSION – Signal Processing and Data Privacy Hybrid Privatized synthetic data sets using Discrete Cosine Transforms (DCT) . Filtered DCT-based data descriptive statistics – skeletal structure not kept as in DT-based data Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  36. 36. RESULTS AND DISCUSSION – Signal Processing and Data Privacy Hybrid Privatized synthetic data sets using Discrete Cosine Transforms (DCT) . Filtered DCT-based data inference statistics – low correlation Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  37. 37. RESULTS AND DISCUSSION – Image Processing and Data Privacy Hybrid Privatized synthetic data sets using Distance Transforms (DT) – Skeletal Structure kept. . DT-based Sepal Length data results DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  38. 38. RESULTS AND DISCUSSION – Image Processing and Data Privacy Hybrid Privatized synthetic data sets using Distance Transforms (DT) – Skeletal Structure kept. . Filtered DT-based Sepal Length data results DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  39. 39. RESULTS AND DISCUSSION – Image Processing and Data Privacy Hybrid Privatized synthetic data sets using Distance Transforms (DT) – Skeletal Structure kept. . Filtered DT-based data descriptive statistics – skeletal structure kept DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  40. 40. RESULTS AND DISCUSSION – Image Processing and Data Privacy Hybrid Privatized synthetic data sets using Distance Transforms (DT) – Skeletal Structure kept. . Filtered DT-based data Iinference statistics – High correlation DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  41. 41. RESULTS AND DISCUSSION – Distance Transforms Based Data and the Clustering Test DT produced the best Davis Bouldin Criterion at 0.419 after filtering. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE DT-based synthetic data produced the best Davis Bouldin Criterion at 0.419 after filtering, out performing the original data.
  42. 42. RESULTS AND DISCUSSION – Distance Transforms Based Data and the Clustering Test DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  43. 43. RESULTS AND DISCUSSION – Distance Transforms Based Data and the Clustering Test Clustering results of the Original Fisher Iris Data DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  44. 44. RESULTS AND DISCUSSION – Distance Transforms Based Data and the Clustering Test Clustering results of the synthetic DT-based synthetic Fisher Iris Data DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  45. 45. RESULTS AND DISCUSSION – Distance Transforms Based Data and the Clustering Test Clustering Results of the Filtered DT-based Fisher Iris Data. Clustering greatly improved after filtering. DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  46. 46. RESULTS AND DISCUSSION DT, DCT, and DWT improved classification accuracy after filtering. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  47. 47. Results – Signal Processing – The Machine Learning Classification Error Test Bowie State University Department of Computer Science Priv Synth Data NN KNN NB DT AdaBoost Max Mean 91.00 87.95 86.07 86.74 84.33 91.00 MID-POINT 75.83 72.78 71.65 72.31 70.39 75.83 Max 91.00 87.95 86.07 86.74 84.33 91.00 DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  48. 48. RESULTS AND DISCUSSION - Non-Interactive Differential Privacy (DP) •Results of the Iris-Fisher data after DP – Too much noise is an issue with DP Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  49. 49. RESULTS AND DISCUSSION - Non-Interactive Differential Privacy (DP) • Classification accuracy of DP data (before filtering) reduces with increased DP levels. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  50. 50. RESULTS AND DISCUSSION - Non-Interactive Differential Privacy (DP) • Improved Classification accuracy of DP data sets after filtering. Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  51. 51. RESULTS AND DISCUSSION - Non-Interactive Differential Privacy (DP) • Comparative descriptive statistics of Original, DP, and filtered DP based data. •Skeletal structure not kept as in DT-based data but outlier noise removed in DP-based data Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  52. 52. Results – Non-Interactive Differential Privacy – Inference Statistics Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  53. 53. Results – Non-Interactive Differential Privacy – How much DP? Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  54. 54. Results – Non-Interactive Differential Privacy – How much DP? Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  55. 55. RESULTS AND DISCUSSION– Data Privacy using K-Anonymity • Suppress all items were k = 1. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  56. 56. RESULTS AND DISCUSSION– Data Privacy using K-Anonymity • Replace suppressed items with new synthetic values (most frequent values) such that k > 1 for all items. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  57. 57. RESULTS AND DISCUSSION – Data Privacy using K-Anonymity • Only sensitive attributes removed – info loss minimized in published attributes. Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  58. 58. RESULTS AND DISCUSSION – Data Privacy using K-Anonymity • Only sensitive attributes removed – info loss minimized in published attributes. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  59. 59. CONCLUSION • The Comparative x-CEG: Empirical results from this study show that fine-tuning parameters in the data privacy procedure, specifically, Noise Addition and Differential Privacy, and with adjustments to the machine learning classifiers, lowers the classification error and thus generates better and desirable data utility. The hypothesis holds. The x-CEG model could help in presenting acceptable trade-off points between privacy and utility. • The SIED model: It is vital for the appropriate solicitation of data privacy requirements that vary on a case by case basis; therefore SIED could serve as a suitable framework in such data privacy engineering process. • Privatized Synthetic Data Generation: Data swapping, Distance Transforms, Discrete Cosine Transforms, and Discrete Wavelet Transforms, in combination with data privacy procedures allow for the generation of privatized synthetic data sets. However, more research on optimal parameterization needs to be done; as well as using other signal processing techniques. • Distance Transforms and Filtering: Empirical results from this study show that a hybrid of Distance Transforms (DT) and data privacy, in combination with filtering, maintains the skeletal structure of the original data, generates privatized synthetic data with better classification accuracy results, thus better utility. However, more study needs to be done on securing DT-based privatized data, to prevent attackers from reconstructing private data. • Differential Privacy and Filtering: On the other hand, Differential Privacy (DP) offers strong privacy guarantees but at the loss of data utility. However, empirical results from this study have shown that Gaussian filtering does reduce outlier noise in DP-based data and with improved classification accuracy results. • K-anonymity: Information loss could be minimized using frequency count analysis for privatized data models requiring k-anonymity for confidentiality. Only remove sensitive attributes and use synthetics for suppressed values. • Privacy versus Utility: Achieving optimal utility while granting privacy is still sought; Yet still, accurate classification could also mean loss of privacy; Trade-offs must be made between privacy and utility. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  60. 60. FUTURE WORK •Future works include: •Further the state-of-the-art in Data Privacy Engineering by developing data privacy compliant software, data privacy modeling, autonomous intelligent data privacy agent systems following the SIED framework. •Apply data privacy and utility principles on digital forensics data, network traffic data, bioinformatics data, and big data. •Study efficient generation of privatized synthetic data sets. • Apply data privacy principles to real time data; including realistic scenarios, where users of data provide feedback on how useful the data was to them. •Show, analytically, differences in performance between the various methods introduced in this work, as well as other state-of-the-art methods. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  61. 61. PUBLICATIONS 1. Kato Mivule, “Towards Agent-based Data Privacy Engineering”, Proceedings of the Sixth International Conference on Advanced Cognitive Technologies and Applications – COGNITIVE 2014, May 25 – May 30, 2014 (In Print), Venice, Italy. 2. Kato Mivule and Claude Turner, “SIED, A Data Privacy Engineering Framework”, Abstracts, Emerging Researchers National Conference in STEM (ERN 2014), Page A239, ISBN 978-0-87168-757-9, Feb 20-22, 2014, Washington DC, USA. [Best Oral Presentation Award] 3. Kato Mivule and Claude Turner, International Journal of Computer Science and Mobile Computing, ICMIC13, December- 2013, pg. 36-43, Trivandrum, Kerala, India, Dec 17-18, 2013, Trivandrum, Kerala, India. 4. Kato Mivule and Claude Turner, A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Using Machine Learning Classification as a Gauge, Procedia Computer Science, Volume 20, 2013, Pages 414-419, ISSN 1877-0509, Nov 13-15, Baltimore, MD, USA. 5. Kato Mivule and Claude Turner, “An Investigation of Data Privacy and Utility Preservation Using KNN Classification as a Gauge”, International Conference on Information and Knowledge Engineering (IKE 2013), July 22-25, Pages 203-204, Las Vegas, NV, USA. 6. Kato Mivule, Darsana Josyula, and Claude Turner, “Data Privacy Preservation in Multi-Agent Learning Systems”, Proceedings of the Fifth International Conference on Advanced Cognitive Technologies and Applications – COGNITIVE 2013, May 27 - June 1, 2013, Pages 14-20, Valencia, Spain. 7. Kato Mivule, Claude Turner, Soo-Yeon Ji, "Towards A Differential Privacy and Utility Preserving Machine Learning Classifier", Procedia Computer Science, 2012, Pages 176-181, Washington DC, USA. 8. Kato Mivule, Stephen Otunba, Tattwamasi Tripathy, Sharad and Sharma, "Implementation of Data Privacy and Security in an Online Student Health Records System", Proceedings at the ISCA 21th International Conference on Software Engineering and Data Engineering (SEDE-2012), Pages 143-148, Los Angeles CA, USA. 9. Kato Mivule, Claude Turner, "Applying Data Privacy Techniques on Published Data in Uganda", Proceedings of the 2012 International Conference on e-Learning, e- Business, Enterprise Information Systems, and e-Government (EEE 2012), Pages 110-115, Las Vegas, NV, USA. 10. Kato Mivule, "Utilizing Noise Addition for Data Privacy, an Overview", Proceedings of the International Conference on Information and Knowledge Engineering (IKE 2012), Pages 65-71, Las Vegas, NV, USA. Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
  62. 62. THANK YOU! QUESTIONS? kmivule@gmail.com Kato Mivule – Bowie State University Department of Computer Science DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

×