PPID3 AICCSA08
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

PPID3 AICCSA08

  • 1,510 views
Uploaded on

Privacy-Preserving ID3 presented in AICCSA 2008 conference in Doha, Qatar, April 2008

Privacy-Preserving ID3 presented in AICCSA 2008 conference in Doha, Qatar, April 2008

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,510
On Slideshare
1,503
From Embeds
7
Number of Embeds
2

Actions

Shares
Downloads
12
Comments
0
Likes
2

Embeds 7

http://www.site.uottawa.ca 5
http://www.slideshare.net 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Good afternoon everyone. Our paper is about “Privacy Preserving ID3 using Gini Index over Horizontally Partitioned Data”. This work is done by me and my supervisor Dr. Miri from the University of Ottawa.

Transcript

  • 1. Saeed Samet and Ali Miri School of Information Technology and Engineering University of Ottawa Privacy Preserving ID3 using Gini Index over Horizontally Partitioned Data
  • 2. Outline
    • Our motivation
    • Decision Tree and ID3
    • Information Gain, Entropy, and Gini Index
    • Privacy-Preserving Data Mining
    • Background
    • Our Main Protocol and Sub-Protocols
    • Complexity
    • Future Work
  • 3. Our Motivation
    • All works done in privacy-preserving decision tree use entropy
    • Gini index can be used to compute information gain
    • Using gini Index, the largest class goes into one pure node, while the other classes go into the other node
    • Entropy normally tries to create balanced tree
  • 4. Decision Tree and ID3
    • A Decision Tree describes a tree structure wherein leaves represent classifications and branches represent conjunctions of features that lead to those classifications
    • ID3 is a decision tree induction algorithm, developed by Quinlan. ID3 stands for "Iterative Dichotomizer 3 "
  • 5. Decision Tree Possible Values Variables (Normal Attributes) predicted values of target variable (Class Attribute) Observation Conclusion Predictive Model
  • 6. Decision Tree Example . . . . . . . . . . . . . . . Normal (or Independent) Attributes Class (or Dependent) Attribute Yes Weak High Cloudy 13 Yes Weak Normal Cloudy 12 Yes Weak Normal Rain 11 No Strong High Sunny 10 No Strong Normal Sunny 9 No Strong Normal Sunny 8 Yes Weak High Cloudy 7 No Weak High Rain 6 Yes Weak Normal Rain 5 No Strong Normal Rain 4 No Strong High Cloudy 3 No Weak High Sunny 2 No Weak High Sunny 1 Play Wind Humidity Outlook Day
  • 7. Decision Tree Example (cont.) Wind Humidity Outlook Weak Strong Outlook Yes=7 No=0 High Normal Humidity Sunny Cloudy Yes=2 No=0 Yes=0 No=2 Yes=2 No=3 Yes=0 No=4 Yes=0 No=5 Rain Yes=1 No=0 High Normal Yes=0 No=2 Sunny Cloudy Rain (Outlook= Cloudy , Wind = Strong ) , Humidity = High Play = No Target Data
  • 8. Information Gain
    • The information gain of a given attribute A with respect to the class attribute C is the reduction in uncertainty about the value of C when we know the value of A .
    • The uncertainty about the value of C is measured by its entropy.
    • The uncertainty about the value of C when we know the value of A is given by the conditional entropy of C given A .
    • where a is a value of A , is the subset of instances of
    • where A takes the value a , and is the number of instances.
  • 9. Entropy
    • Amount of uncertainty about an event associated with a given probability distribution
    • Shannon defines entropy in terms of a discrete random event X , with possible states as:
    • where: is the probability of
    • the i- th outcome of X.
  • 10. Gini Index
    • Another sensible measure of impurity is Gini Index:
    • where is the relative frequency of in S .
    • Therefore, information gain using Gini Index is:
    • We will come back to this formula…
  • 11. Privacy-Preserving Data Mining
    • Privacy-preserving data mining
      • Extracting desired knowledge without revealing the private data values by developing new algorithms or modifying the standard algorithms
      • In Co-operative and distributed computation, Prevents access to unnecessary and private information while each party wants to achieve some aggregate results
  • 12. Privacy-Preserving Approaches
    • Data Distribution
      • Centralized data environment
      • Distributed data environment
        • Horizontal
        • Vertical
        • Arbitrary
    • Main approaches
      • Secure Multi-party Computation (SMC)
      • Randomization and perturbation
  • 13. Background
    • Pinkas and Lindell
        • Computing information gain using Entropy
        • Presenting a secure protocol to compute when x is distributed between two parties
        • Working only for two parties (because of using Oblivious Polynomial Evaluation protocol)
    • Xiao et al.
        • Computing information gain using Entropy
        • Working for multi-party case
        • Using Homomorphic encryption
  • 14. Our Protocol
    • Privacy Preserving ID3 over Horizontally Partitioned Data
    • Using Gini Index to compute information gain
    • Working for multi-party cases
    • Sub-protocols:
      • Secure multi-party addition
      • Secure multi-party multiplication
      • Secure multi-party square-division
  • 15. Main Protocol
    • Computing information gain using Gini Index
    • is the probability that the value of class attribute C is c while the value of attribute A is a .
  • 16. Main Protocol (cont.)
    • is fixed. Therefore, we have to compute
    . . .
  • 17. Main Protocol (cont.)
    • For instance, in the previous example, for attribute Outlook we have to compute
  • 18.
    • For instance, two parties have to compute
    • and belong to party 1
    • and belong to party 2
    Main Protocol (cont.)
  • 19. Secure Multi-party Addition
    • n parties are involved,
      • Inputs: ,
      • Outputs: ,
      • such that:
      • Suppose E is an Additive Homomorphic Encryption, with public key e and private key d :
      • Thus we have:
  • 20. Secure Multi-party Addition (cont.)
      • selects an additive homomorphic encryption and sends the public key e to all other parties.
      • encrypts its input , , and sends it to .
      • For i=2 to n-1
        • encrypts its input , , multiplies it by and
        • sends to .
      • encrypts its input , , and computes .
      • randomly selects its, nonzero, output share ,
      • calculates and sends it to .
  • 21. Secure Multi-party Addition (cont.)
      • For i=n-1 to 2
        • randomly selects its, nonzero, output , calculates
        • and sends it to .
      • decrypts the received value from and sets it as its output .
  • 22. Other Sub-Protocols
    • Secure Multi-party Multiplication
      • Same as Secure Multi-party Addition
    • Secure Multi-party Square-Division
      • Using two previous sub-protocols
  • 23. Complexity
    • Common parameters
      • Size of the database
      • # of parties involved
      • # of attributes
      • # of possible values for attributes (on average)
    • To compute the cost of the protocol, suppose:
      • # of parties involved in the protocol is denoted by n
      • # of remaining normal attributes at the current step and node is denoted by a
      • # of possible values for those normal attributes, on average, is denoted by v
      • # of possible values for the class attribute is denoted by c
      • # of bits exchanging from one party to another party is denoted by b
      • Computational overhead of sub-protocols 1 and 2 is denoted by CP n
  • 24. Complexity (cont.)
    • CP n includes n encryptions, one decryption, n-1 multiplications and n-1 power computations.
    • The overall computational cost, by assuming that c is dominated by b and n , is
    • The overall communication cost is
  • 25. Future Work
    • Using proposed sub-protocols in other techniques in PPDM and presenting new building blocks in SMC
    • Implementation of the protocol to
      • Find the exact cost and efficiency of the algorithm
      • Compare with other existing techniques
  • 26. References
    • Rakesh Agrawal, Alexandre V. Evfimievski, and Ramakrishnan Srikant. Information sharing across private databases. In ACM Special Interest Group on Management of Data (SIGMOD) Conference , pages 86–97, 2003.
    • Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In ACM Special Interest Group on Management of Data (SIGMOD) Conference , pages 439–450, 2000.
    • Friedman J.H. Olshen R.A. Breiman, L. and C.J. Stone. Classification and Regression Trees . Chapman & Hall, New York, 1984.
    • Leo Breiman. Technical note: Some properties of splitting criteria. Machine Learning , 24(1):41–47, 1996.
    • Christian Cachin and Jan Camenisch, editors. Advances in Cryptology - EUROCRYPT 2004, International Conference on the Theory and Applications of Cryptographic Techniques, Interlaken, Switzerland, May 2-6, 2004, Proceedings , volume 3027 of Lecture Notes in Computer Science . Springer, 2004.
    • Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya, Xiaodong Lin, and Michael Y. Zhu. Tools for privacy preservingdistributed data mining. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) , 4(2):28–34, 2003.
    • DTREG. How trees are built. http://www.dtreg.com/treebuild.htm, 2006. (Last posted: 22/7/2006).
    • W. Du and M. Atallah. Privacy-preserving cooperative statistical analysis. In ACSAC ’01: Proceedings of the 17th Annual Computer Security Applications Conference , pages 102–110, New Orleans, Louisiana, USA, December 10-14 2001.
    • Wenliang Du and Zhijun Zhan. Building decision tree classifier on private data. In CRPITS’14: Proceedings of the IEEE international conference on Privacy, security and data mining , pages 1–8, Darlinghurst, Australia, Australia, 2002. Australian Computer Society, Inc.
    • Bart Goethals, Sven Laur, Helger Lipmaa, and Taneli Mielik¨ainen. On private scalar product computation for privacy preserving data mining. In ICISC , pages 104–120, 2004.
    • R. J. Light and B. H. Margolin. An analysis of variance for categorical data. In Journal of The American Statistical Association , volume 66, pages 534–544, 1971.
    • Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. In CRYPTO , pages 36–54, 2000.
  • 27. References
    • Behzad Malek and Ali Miri. Secure dot-product protocol using trace functions. 2006 IEEE International Symposium on Information Theory , 2006.
    • Moni Naor and Benny Pinkas. Oblivious transfer and polynomial evaluation. In STOC ’99: Proceedings of the thirty-first annual ACM Symposium on Theory of Computing , pages 245–254, New York, NY, USA, 1999. ACM Press.
    • Moni Naor and Benny Pinkas. Efficient oblivious transfer protocols. In SODA ’01: Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms , pages 448–457, Philadelphia, PA, USA, 2001. Society for Industrial and Applied Mathematics.
    • Benny Pinkas. Cryptographic techniques for privacy-preserving data mining. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) , 4(2):12–19, 2002.
    • Laura Elena Raileanu and Kilian Stoffel. Theoretical comparison between the gini index and information gain criteria. Annal of Mathematics and Artificial Intelligence , 41(1):77–93, 2004.
    • Eakalak Suthampan and Songrit Maneewongvatana. Privacy preserving decision tree in multi party environment. In Asia Information Retrieval Symposium (AIRS) , pages 727–732, 2005.
    • Salford Systems. Do splitting rules really matter? http://www. salford-systems.com/423.php, 2006.
    • Jaideep Vaidya and Chris Clifton. Privacy-preserving decision trees over vertically partitioned data. In Data and Application Security (DBSec) , pages 139–152, 2005.
    • Jaideep Vaidya and Chris Clifton. Secure set intersection cardinality with application to association rule mining. Journal of Computer Security , 13(4):593–622, 2005.
    • Ming-Jun Xiao, Liu-Sheng Huang, Yong-Long Luo, and Hong Shen. Privacy preserving ID3 algorithm over horizontally partitioned data. In Parallel and Distributed Computing, Applications and Technologies , pages 239–243, 2005.
  • 28. Secure Multi-party Multiplication
    • k parties are involved,
      • Inputs: ,
      • Outputs: ,
      • such that:
      • Suppose E is an Additive Homomorphic Encryption, with public key e and private key d . Thus we have:
  • 29.
      • selects an additive homomorphic encryption and sends the public key e to all other parties.
      • encrypts its input , , and sends it to .
      • For i=2 to n-1
        • powers the received value to its input , and
        • sends it to .
      • For i=n to 2
        • randomly selects its, nonzero, output share , encrypts it,
        • , computes its inverse, , multiplies the received
        • value to that, , and sends it to .
    Secure Multi-party Multiplication (cont.)
  • 30. Secure Multi-party Multiplication (cont.)
      • decrypts the received value from and sets it as its output .
  • 31. Secure Multi-party Square-Division
    • Suppose two parties are horizontally involved, and
      • Inputs: ,
      • Outputs:
    • Using two-party multiplication
      • Inputs: ,
      • Outputs: ,
      • such that: and
  • 32. Secure Multi-party Square-Division (cont.)
    • Next step
      • Inputs: and
      • and
      • Outputs:
      • Sub-step, using two-party addition
        • Inputs: ,
        • Outputs: ,
        • Such that: and
      • computes and send it to
      • computes and send it to
  • 33. Secure Multi-party Square-Division (cont.)
    • Each party computes