Complex Adaptive Systems 2012 – Washington DC USA,                                                  November 14-16Towards ...
Complex Adaptive Systems 2012 – Washington DC USA,Outline                                   November 14-16     Introducti...
Introduction                 Entities transact in ‘big data’ containing personal identifiable                  informatio...
Related Work   There is a growing interest in investigating privacy preserving data mining   solutions that provide a bal...
Data Utility verses Privacy          Data utility is the extent of how useful a published dataset is to the           con...
Objective                  Achieving an optimal balance between data privacy and utility                   remains an ong...
Ensemble classification          Is a machine learning process, in which a collection of several           independently ...
AdaBoost Ensemble – Adaptive Boosting          Proposed by Freund and Schapire (1995), uses several iterations by adding ...
AdaBoost Ensemble (Cont’d )          AdaBoost Ensemble computes as follows:Complex Adaptive Systems 2012 – Washington DC ...
Differential PrivacyComplex Adaptive Systems 2012 – Washington DC USA, November 14-16   10
Differential Privacy (Cont’d)Complex Adaptive Systems 2012 – Washington DC USA, November 14-16   11
Methodology (Cont’d)          We utilized a public available Barack Obama 2008 campaign donations dataset.          The ...
Results          Essential statistical traits of the original and differential privacy datasets,           a necessary re...
Results (Cont’d)          There is a strong positive covariance of 1060.8 between the two datasets, which           means...
Results (Cont’d)          There is almost no correlation (the correlation was 0.0054) between the           original and ...
Results (Cont’d)          After applying differential privacy, AdaBoost ensemble classifier is           performed.      ...
Results (Cont’d)          The training dataset from the original data showed that the classification           error drop...
Results (Cont’d)   When the same procedure is applied to the test dataset of the original data the    classification erro...
Conclusion   In this study, we found that while differential privacy might guarantee strong    confidentiality, providing...
Complex Adaptive Systems 2012 – Washington DC USA,  Questions?                                          November 14-16Cont...
Upcoming SlideShare
Loading in …5
×

Towards A Differential Privacy Preserving Utility Machine Learning Classifier

739 views

Published on

Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
739
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Towards A Differential Privacy Preserving Utility Machine Learning Classifier

  1. 1. Complex Adaptive Systems 2012 – Washington DC USA, November 14-16Towards A Differential Privacy and Utility Preserving Machine Learning Classifier Kato Mivule, Claude Turner, and Soo-Yeon Ji Computer Science Department Bowie State University Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 1
  2. 2. Complex Adaptive Systems 2012 – Washington DC USA,Outline November 14-16  Introduction  Related work  Essential Terms  Methodology  Results  Conclusion 2
  3. 3. Introduction  Entities transact in ‘big data’ containing personal identifiable information (PII).  Organizations are bound by federal and state law to ensure data privacy.  In the process to achieve privacy, the utility of privatized datasets diminishes.  Achieving balance between privacy and utility is an ongoing problem.  Therefore, we investigate a differential privacy preserving machine learning classification approach that seeks an acceptable level of utility.Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 3
  4. 4. Related Work There is a growing interest in investigating privacy preserving data mining solutions that provide a balance between data privacy and utility.  Kifer and Gehrke (2006) did a broad study of enhanced data utility in privacy preserving data publishing by using statistical approaches.  Wong (2007) described how achieving global optimal privacy while maintaining utility is an NP-hard problem.  Krause and Horvitz (2010) noted that endeavours of finding trade-offs between privacy and utility is still an NP-hard problem.  Muralidhar and Sarathy (2011) showed that differential privacy provides strong privacy guarantees but utility is still a problem due to noise levels.  Finding the optimal balance between privacy and utility remains a challenge—even with differential privacy. 4Complex Adaptive Systems 2012 – Washington DC USA, November 14-16
  5. 5. Data Utility verses Privacy  Data utility is the extent of how useful a published dataset is to the consumer of that publicized dataset.  In the course of a data privacy process, original data will lose statistical value despite privacy guarantees. Image Source: Kenneth Corbin/Internet News.Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 5
  6. 6. Objective  Achieving an optimal balance between data privacy and utility remains an ongoing challenge.  Such optimality is highly desired and remains our investigation goal. Image Source: Wikipedia, on Confidentiality.Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 6
  7. 7. Ensemble classification  Is a machine learning process, in which a collection of several independently trained classifiers are merged to achieve better prediction.  Examples include single trained decision trees joined to make accurate predictions.Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 7
  8. 8. AdaBoost Ensemble – Adaptive Boosting  Proposed by Freund and Schapire (1995), uses several iterations by adding weak learners to create a powerful learner, adjusting weights to center on misclassified data in earlier iterations.  Classification Error in AdaBoost Ensemble is computed as below:Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 8
  9. 9. AdaBoost Ensemble (Cont’d )  AdaBoost Ensemble computes as follows:Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 9
  10. 10. Differential PrivacyComplex Adaptive Systems 2012 – Washington DC USA, November 14-16 10
  11. 11. Differential Privacy (Cont’d)Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 11
  12. 12. Methodology (Cont’d)  We utilized a public available Barack Obama 2008 campaign donations dataset.  The data set contained 17,695 records of original unperturbed data.  Two attributes, the donation amount and income status, are utilized to classify data into three classes.  The three classes are low income, middle income, and high income, for donations $1 to $49, $50 to $80, $81 and above respectively.  Validating our approach, the dataset comprised 50 percent on training and the remainder on testing, on both Original and Privatized datasets.  Oracle database is queried via MATLAB ODBC connector. MATLAB is used for differential privacy and machine learning classification.Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 12
  13. 13. Results  Essential statistical traits of the original and differential privacy datasets, a necessary requirement to publish privatized datasets, are kept.  As depicted, the mean, standard deviation, and variance of the original and differential privacy datasets remained the same.Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 13
  14. 14. Results (Cont’d)  There is a strong positive covariance of 1060.8 between the two datasets, which means that they grow simultaneously, as illustrated below:Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 14
  15. 15. Results (Cont’d)  There is almost no correlation (the correlation was 0.0054) between the original and differentially privatized datasets.  Indicates some privacy assurances, and difficulty for an attacker, dealing only with the privatized dataset, to correctly infer any alterations.Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 15
  16. 16. Results (Cont’d)  After applying differential privacy, AdaBoost ensemble classifier is performed.  The outcome of the donors’ dataset was Low, Middle, and High income, for donations 0 to 50, 51 to 80, and 81 to 100, respectively.  This same classification outcome is used for the perturbed dataset to investigate whether the classifier would categorize the perturbed dataset correctly.Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 16
  17. 17. Results (Cont’d)  The training dataset from the original data showed that the classification error dropped from 0.25 to 0 with increased weak decision tree learners.  The results changed with the training dataset on the differentially private data when the classification error dropped from 0.588 to 0.58.Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 17
  18. 18. Results (Cont’d)  When the same procedure is applied to the test dataset of the original data the classification error dropped from 0.03 to 0.  However, when this procedure perform on the differentially private data, the error rate did not change even with increased number of weak decision tree.Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 18
  19. 19. Conclusion  In this study, we found that while differential privacy might guarantee strong confidentiality, providing data utility still remains a challenge.  However, this study is instructive in a variety of ways:  The level of Laplace noise does affect the classification error.  Increasing the number of weak learners is not too significant.  Adjusting the Laplace noise parameter, ε, is essential for further study.  However, accurate classification means loss of privacy.  Tradeoffs must be made between privacy and utility.  We plan on investigating optimization approaches for such tradeoffs.Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 19
  20. 20. Complex Adaptive Systems 2012 – Washington DC USA, Questions? November 14-16Contact:Kato Mivule: kmivule@gmail.com Thank You. 20

×