Privacy Preserving Data Mining: Challenges


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Privacy Preserving Data Mining: Challenges

  1. 1. Data Mining Technologies for Digital Libraries & Web Information Systems Ramakrishnan Srikant
  2. 2. Talk Outline <ul><li>Taxonomy Integration (WWW 2001, with R. Agrawal) </li></ul><ul><li>Searching with Numbers </li></ul><ul><li>Privacy-Preserving Data Mining </li></ul>
  3. 3. Taxonomy Integration <ul><li>B2B electronics portal: 2000 categories, 200K datasheets </li></ul>Master Catalog New Catalog DSP Mem. Logic ICs a b c d e f Cat1 Cat2 ICs x y z w
  4. 4. Taxonomy Integration (2) <ul><li>After integration: </li></ul>DSP Mem. Logic ICs a b c d e f x y z w
  5. 5. Goal <ul><li>Use affinity information in new catalog. </li></ul><ul><ul><li>Products in same category are similar. </li></ul></ul><ul><li>Accuracy boost depends on match between two categorizations. </li></ul>
  6. 6. Problem Statement <ul><li>Given </li></ul><ul><ul><li>master categorization M: categories C 1 , C 2 , …, C n </li></ul></ul><ul><ul><ul><li>set of documents in each category </li></ul></ul></ul><ul><ul><li>new categorization N: categories S 1 , S 2 , …, S n </li></ul></ul><ul><ul><ul><li>set of documents in each category </li></ul></ul></ul><ul><li>Find the category in M for each document in N </li></ul><ul><ul><li>Standard Alg: Estimate Pr( C i | d ) </li></ul></ul><ul><ul><li>Enhanced Alg: Estimate Pr( C i | d , S ) </li></ul></ul>
  7. 7. Naive Bayes Classifier <ul><li>Estimate probability of document d belonging to class C i </li></ul><ul><li>Where </li></ul>
  8. 8. Enhanced Naïve Bayes <ul><li>Standard: </li></ul><ul><li>Enhanced: </li></ul><ul><li>How do we estimate Pr( C i |S )? </li></ul><ul><ul><li>Apply standard Naïve Bayes to get number of documents in S that are classified into C i </li></ul></ul><ul><ul><li>Incorporate weight w reflecting match between two taxonomies. </li></ul></ul><ul><ul><ul><li>Only affect classification of borderline documents. </li></ul></ul></ul><ul><ul><li>For w = 0, default to standard classifier. </li></ul></ul>
  9. 9. Enhanced Naïve Bayes (2) <ul><li>Use tuning set to determine w. </li></ul>
  10. 10. Intuition behind Algorithm <ul><li>Standard </li></ul><ul><li>Algorithm </li></ul>Enhanced Algorithm
  11. 11. Electronic Parts Dataset 1150 categories; 37,000 documents
  12. 12. Yahoo & OpenDirectory <ul><li>5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software </li></ul><ul><ul><li>Typical match: 69%, 15%, 3%, 3%, 1%, …. </li></ul></ul><ul><li>Merging Yahoo into OpenDirectory </li></ul><ul><ul><li>30% fewer errors (14.1% absolute difference in accuracy) </li></ul></ul><ul><li>Merging OpenDirectory into Yahoo </li></ul><ul><ul><li>26% fewer errors (14.3% absolute difference) </li></ul></ul>
  13. 13. Summary <ul><li>New algorithm for taxonomy integration. </li></ul><ul><ul><li>Exploits affinity information in the new (source) taxonomy categorizations. </li></ul></ul><ul><ul><li>Can do substantially better, and never does significantly worse than standard Naïve Bayes. </li></ul></ul><ul><li>Open Problems: SVM, Decision Tree, ... </li></ul>
  14. 14. Talk Outline <ul><li>Taxonomy Integration </li></ul><ul><li>Searching with Numbers (WWW 2002, with R. Agrawal) </li></ul><ul><li>Privacy-Preserving Data Mining </li></ul>
  15. 15. Motivation <ul><li>A large fraction of useful web consists of specification documents. </li></ul><ul><ul><li><attribute name, value> pairs embedded in text. </li></ul></ul><ul><li>Examples: </li></ul><ul><ul><li>Data sheets for electronic parts. </li></ul></ul><ul><ul><li>Classified ads. </li></ul></ul><ul><ul><li>Product catalogs. </li></ul></ul>
  16. 16. Search Engines treat Numbers as Strings <ul><li>Search for 6798.32 (lunar nutation cycle) </li></ul><ul><ul><li>Returns 2 pages on Google </li></ul></ul><ul><ul><li>However, search for 6798.320 yielded no page on Google (and all other search engines) </li></ul></ul><ul><li>Current search technology is inadequate for retrieving specification documents. </li></ul>
  17. 17. Data Extraction is hard <ul><li>Synonyms for attribute names and units. </li></ul><ul><ul><li>&quot;lb&quot; and &quot;pounds&quot;, but no &quot;lbs&quot; or &quot;pound&quot;. </li></ul></ul><ul><li>Attribute names are often missing. </li></ul><ul><ul><li>No &quot;Speed&quot;, just &quot;MHz Pentium III&quot; </li></ul></ul><ul><ul><li>No &quot;Memory&quot;, just &quot;MB SDRAM&quot; </li></ul></ul><ul><li>850 MHz Intel Pentium III </li></ul><ul><li>192 MB RAM </li></ul><ul><li>15 GB Hard Disk </li></ul><ul><li>DVD Recorder: Included; </li></ul><ul><li>Windows Me </li></ul><ul><li>14.1 inch display </li></ul><ul><li>8.0 pounds </li></ul>
  18. 18. Searching with Numbers IBM ThinkPad 750 MHz Pentium 3, 196 MB DRAM, … Dell Computer 700 MHz Celeron, 256 MB SDRAM, … Database IBM ThinkPad (750 MHz, 196 MB) … Dell (700 MHz, 256 MB) 800 200 3 lb 800 200
  19. 19. Reflectivity <ul><li>If we get a close match on numbers, how likely is it that we have correctly matched attribute names? </li></ul><ul><ul><li>Likelihood  Non-reflectivity (of data) </li></ul></ul><ul><li>Non-overlapping attributes  Non-reflective. </li></ul><ul><ul><li>Memory: 64- 512 Mb, Disk: 10 - 40 Gb </li></ul></ul><ul><li>Correlations or Clustering  Low reflectivity. </li></ul><ul><ul><li>Memory: 64 - 512 Mb, Disk: 10 - 100 Gb </li></ul></ul>
  20. 20. Reflectivity: Examples
  21. 21. Reflectivity: Definition <ul><li>Let </li></ul><ul><ul><li>D : dataset </li></ul></ul><ul><ul><li>n i : co-ordinates of point x i </li></ul></ul><ul><ul><li>reflections( x i ): permutations of n i </li></ul></ul><ul><ul><li> ( n i ): # of points within distance r of n i </li></ul></ul><ul><ul><li> ( n i ): # of reflections within distance r of n i </li></ul></ul>
  22. 22. Algorithm <ul><li>How to compute match score (rank) of a document for a given query? </li></ul><ul><li>How to limit the number of documents for which the match score is computed? </li></ul>
  23. 23. Match Score of a Document <ul><li>Select k numbers from D yielding minimum distance between Q and D . </li></ul><ul><li>Relative distance for each term: </li></ul><ul><li>Euclidean distance ( L p norm) to combine term distances: </li></ul>
  24. 24. Bipartite Graph Matching <ul><li>Map problem to Bipartite Graph Matching </li></ul><ul><ul><li>k source nodes: corr. to query numbers </li></ul></ul><ul><ul><li>m target nodes: corr. to document numbers </li></ul></ul><ul><ul><li>An edge from each source to k nearest targets. Assign weight f(q i ,n j ) p to the edge (q i ,n j ). </li></ul></ul>20 60 10 25 75 .5 .25 .58 .25 Query: Doc:
  25. 25. Limiting the Set of Documents <ul><li>Similar to the score aggregation problem [Fagin, PODS 96] </li></ul><ul><li>Proposed algorithm is an adaptation of the TA algorithm in [Fagin-Lotem-Naor, PODS 01] </li></ul>
  26. 26. Limiting the set of documents <ul><li>k conceptual sorted lists, one for each query term </li></ul><ul><li>Do round robin access to the lists. For each document found, compute its distance F(D,Q) </li></ul><ul><li>Let n i := number last looked at for query term q i </li></ul><ul><li>Let </li></ul><ul><li>Halt when t documents found whose distance <=  </li></ul><ul><li>t is lower bound on distance of unseen documents </li></ul>
  27. 27. Empirical Results
  28. 28. Empirical Results (2) <ul><li>Screen Shot </li></ul>
  29. 29. Incorporating Hints <ul><li>Use simple data extraction techniques to get hints, </li></ul><ul><li>Names/Units in query matched against Hints. </li></ul><ul><li>256 MB SDRAM memory </li></ul>Unit Hint: MB Attribute Hint: SDRAM, memory
  30. 30. Summary <ul><li>Allows querying using only numbers or numbers + hints. </li></ul><ul><li>Data can come from raw text (e.g. product descriptions) or databases. </li></ul><ul><li>End run around data extraction. </li></ul><ul><ul><li>Use simple extractor to generate hints. </li></ul></ul><ul><li>Open Problems: integration with keyword search. </li></ul>
  31. 31. Talk Outline <ul><li>Taxonomy Integration </li></ul><ul><li>Searching with Numbers </li></ul><ul><li>Privacy-Preserving Data Mining </li></ul><ul><ul><li>Motivation </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Associations </li></ul></ul>
  32. 32. Growing Privacy Concerns <ul><li>Popular Press: </li></ul><ul><ul><li>Economist: The End of Privacy (May 99) </li></ul></ul><ul><ul><li>Time: The Death of Privacy (Aug 97) </li></ul></ul><ul><li>Govt. legislation: </li></ul><ul><ul><li>European directive on privacy protection (Oct 98) </li></ul></ul><ul><ul><li>Canadian Personal Information Protection Act (Jan 2001) </li></ul></ul><ul><li>Special issue on internet privacy, CACM, Feb 99 </li></ul><ul><li>S. Garfinkel, &quot;Database Nation: The Death of Privacy in 21st Century&quot;, O' Reilly, Jan 2000 </li></ul>
  33. 33. Privacy Concerns (2) <ul><li>Surveys of web users </li></ul><ul><ul><li>17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99) </li></ul></ul><ul><ul><li>82% said having privacy policy would matter (Freebies & Privacy: What net users think, July 99) </li></ul></ul>
  34. 34. Technical Question <ul><li>Fear: </li></ul><ul><ul><li>&quot;Join&quot; (record overlay) was the original sin. </li></ul></ul><ul><ul><li>Data mining: new, powerful adversary? </li></ul></ul><ul><li>The primary task in data mining: development of models about aggregated data. </li></ul><ul><li>Can we develop accurate models without access to precise information in individual data records? </li></ul>
  35. 35. Talk Outline <ul><li>Taxonomy Integration </li></ul><ul><li>Searching with Numbers </li></ul><ul><li>Privacy-Preserving Data Mining </li></ul><ul><ul><li>Motivation </li></ul></ul><ul><ul><li>Private Information Retrieval </li></ul></ul><ul><ul><li>Classification (SIGMOD 2000, with R. Agrawal) </li></ul></ul><ul><ul><li>Associations </li></ul></ul>
  36. 36. Web Demographics <ul><li>Volvo S40 website targets people in 20s </li></ul><ul><ul><li>Are visitors in their 20s or 40s? </li></ul></ul><ul><ul><li>Which demographic groups like/dislike the website? </li></ul></ul>
  37. 37. Solution Overview 50 | 40K | ... 30 | 70K | ... ... ... Randomizer Randomizer Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms Model 65 | 20K | ... 25 | 60K | ... ...
  38. 38. Reconstruction Problem <ul><li>Original values x 1 , x 2 , ..., x n </li></ul><ul><ul><li>from probability distribution X (unknown) </li></ul></ul><ul><li>To hide these values, we use y 1 , y 2 , ..., y n </li></ul><ul><ul><li>from probability distribution Y </li></ul></ul><ul><li>Given </li></ul><ul><ul><li>x 1 +y 1 , x 2 +y 2 , ..., x n +y n </li></ul></ul><ul><ul><li>the probability distribution of Y </li></ul></ul><ul><li>Estimate the probability distribution of X. </li></ul>
  39. 39. Intuition (Reconstruct single point) <ul><li>Use Bayes' rule for density functions </li></ul>
  40. 40. Intuition (Reconstruct single point) <ul><li>Use Bayes' rule for density functions </li></ul>
  41. 41. Reconstructing the Distribution <ul><li>Combine estimates of where point came from for all the points: </li></ul><ul><ul><li>Gives estimate of original distribution. </li></ul></ul>
  42. 42. Reconstruction: Bootstrapping <ul><li>f X 0 := Uniform distribution </li></ul><ul><li>j := 0 // Iteration number </li></ul><ul><li>repeat </li></ul><ul><ul><li>(Bayes' rule) </li></ul></ul><ul><ul><li>j := j+1 </li></ul></ul><ul><li>until (stopping criterion met) </li></ul><ul><li>Converges to maximum likelihood estimate. </li></ul><ul><ul><li>D. Agrawal & C.C. Aggarwal, PODS 2001. </li></ul></ul>
  43. 43. Seems to work well!
  44. 44. Recap: Why is privacy preserved? <ul><li>Cannot reconstruct individual values accurately. </li></ul><ul><li>Can only reconstruct distributions. </li></ul>
  45. 45. Talk Outline <ul><li>Taxonomy Integration </li></ul><ul><li>Searching with Numbers </li></ul><ul><li>Privacy-Preserving Data Mining </li></ul><ul><ul><li>Motivation </li></ul></ul><ul><ul><li>Private Information Retrieval </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Associations (KDD 2002, with A. Evfimievski, R. Agrawal & J. Gehrke) </li></ul></ul>
  46. 46. Association Rules <ul><li>Given: </li></ul><ul><ul><li>a set of transactions </li></ul></ul><ul><ul><li>each transaction is a set of items </li></ul></ul><ul><li>Association Rule: 30% of transactions that contain Book1 and Book5 also contain Book20; 5% of transactions contain these items. </li></ul><ul><ul><li>30% : confidence of the rule. </li></ul></ul><ul><ul><li>5% : support of the rule. </li></ul></ul><ul><li>Find all association rules that satisfy user-specified minimum support and minimum confidence constraints. </li></ul><ul><li>Can be used to generate recommendations. </li></ul>
  47. 47. Recommendations Overview Recommendation Service Associations Recommendations Alice Bob Book 5, Book 25 Book 1, Book 11, Book 21 Support Recovery Book 3, Book 25 Book 1, Book 7, Book 21
  48. 48. Private Information Retrieval <ul><li>Retrieve 1 of n documents from a digital library without the library knowing which document was retrieved. </li></ul><ul><li>Trivial solution: Download entire library. </li></ul><ul><li>Can you do better? </li></ul><ul><ul><li>Yes, with multiple servers. </li></ul></ul><ul><ul><li>Yes, with single server & computational privacy. </li></ul></ul><ul><li>Problem introduced in [Chor et al, FOCS 95] </li></ul>
  49. 49. Uniform Randomization <ul><li>Given a transaction, </li></ul><ul><ul><li>keep item with 20% probability, </li></ul></ul><ul><ul><li>replace with a new random item with 80% probability. </li></ul></ul><ul><li>Appears to gives around 80% privacy… </li></ul><ul><ul><li>80% chance that an item in the randomized transaction was not in the original transaction. </li></ul></ul>
  50. 50. Privacy Breach Example 100,000 (1%) have { x , y , z } 9,900,000 (99%) have zero items from { x , y , z } 0.2 3 = .008 6 * (0.8/1000) 3 = 3 * 10 -9 800 transactions .03 transactions (<< 1) 99.99% 0.01% <ul><li>80% privacy “on average,” but not for all items! </li></ul><ul><li>10 M transactions of size 3 with 1000 items: </li></ul>
  51. 51. Solution “ Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.” G.K. Chesterton <ul><li>Insert many false items into each transaction. </li></ul><ul><li>Hide true itemsets among false ones. </li></ul><ul><li>No free lunch: Need more transactions to discover associations. </li></ul>
  52. 52. Related Work <ul><li>S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002. </li></ul><ul><li>Protecting privacy across databases: </li></ul><ul><ul><li>Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining”, Crypto 2000. </li></ul></ul><ul><ul><li>J. Vaidya and C.W. Clifton, “Privacy Preserving Association Rule Mining in Vertically Partitioned Data”, KDD 2002. </li></ul></ul>
  53. 53. Summary <ul><li>Have your cake and mine it too! </li></ul><ul><ul><li>Preserve privacy at the individual level, but still build accurate models. </li></ul></ul><ul><ul><li>Can do both classification & association rules. </li></ul></ul><ul><li>Open Problems: Clustering, Lower bounds on discoverability versus privacy, Faster algorithms, … </li></ul>
  54. 54. Slides available from ... <ul><li> </li></ul>
  55. 55. Backup
  56. 56. Lowest Discoverable Support <ul><li>LDS is s.t., when predicted, is 4  away from zero. </li></ul><ul><li>Roughly, LDS is proportional to </li></ul>| t | = 5,  = 50%
  57. 57. LDS vs. Breach Level | t | = 5, | T | = 5 M
  58. 58. Basic 2-server Scheme <ul><li>Each server returns XOR of green bits. </li></ul><ul><li>Client XORs bits returned by server. </li></ul><ul><li>Communication complexity: O(n) </li></ul>1 2 3 4 6 5 7 8
  59. 59. Sqrt(n) Algorithm <ul><li>Each server returns bit-wise XOR of specified blocks. </li></ul><ul><li>Client XORs the 2 blocks & selects desired bits. </li></ul><ul><li>Each block has sqrt(n) elements => 4*sqrt(n) communication complexity. </li></ul><ul><li>Server computation time still O(n) </li></ul>1 2 3 4 6 5 7 8
  60. 60. Computationally Private IR <ul><li>Use pseudo-random function + mask to generate sets. </li></ul><ul><li>Quadratic residuosity. </li></ul><ul><li>Difficulty of deciding whether a small prime divides  (m) </li></ul><ul><ul><li>m: composite integer of unknown factorization </li></ul></ul><ul><ul><li> (m): Euler totient fn, i.e., # of positive integers <=m that are relatively prime to m. </li></ul></ul>
  61. 61. Extensions <ul><li>Retrieve documents (blocks), not bits. </li></ul><ul><ul><li>If n <= l , comm. complexity 4l . </li></ul></ul><ul><ul><li>If n <= l 2 /4 , comm. complexity 8l . </li></ul></ul><ul><li>Lower communication complexity. </li></ul><ul><li>Select documents using keywords. </li></ul><ul><li>Protect data privacy. </li></ul><ul><li>Preprocessing to reduce computation time. </li></ul><ul><li>Computationally-private information retrieval with single server. </li></ul>
  62. 62. Potential Privacy Breaches <ul><li>Distribution is a spike. </li></ul><ul><ul><li>Example : Everyone is of age 40. </li></ul></ul><ul><li>Some randomized values are only possible from a given range. </li></ul><ul><ul><li>Example : Add U[-50,+50] to age and get 125  True age is  75. </li></ul></ul><ul><ul><li>Not an issue with Gaussian. </li></ul></ul>
  63. 63. Potential Privacy Breaches (2) <ul><li>Most randomized values in a given interval come from a given interval. </li></ul><ul><ul><li>Example : 60% of the people whose randomized value is in [120,130] have their true age in [70,80]. </li></ul></ul><ul><ul><li>Implication: Higher levels of randomization will be required. </li></ul></ul><ul><li>Correlations can make previous effect worse. </li></ul><ul><ul><li>Example : 80% of the people whose randomized value of age is in [120,130] and whose randomized value of income is [...] have their true age in [70,80]. </li></ul></ul>
  64. 64. Work in Statistical Databases <ul><li>Provide statistical information without compromising sensitive information about individuals (surveys: AW89, Sho82) </li></ul><ul><li>Techniques </li></ul><ul><ul><li>Query Restriction </li></ul></ul><ul><ul><li>Data Perturbation </li></ul></ul><ul><li>Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of individual information [AW89] </li></ul>
  65. 65. Statistical Databases: Techniques <ul><li>Query Restriction </li></ul><ul><ul><li>restrict the size of query result (e.g. FEL72, DDS79) </li></ul></ul><ul><ul><li>control overlap among successive queries (e.g. DJL79) </li></ul></ul><ul><ul><li>suppress small data cells (e.g. CO82) </li></ul></ul><ul><li>Output Perturbation </li></ul><ul><ul><li>sample result of query (e.g. Den80) </li></ul></ul><ul><ul><li>add noise to query result (e.g. Bec80) </li></ul></ul><ul><li>Data Perturbation </li></ul><ul><ul><li>replace db with sample (e.g. LST83, LCL85, Rei84) </li></ul></ul><ul><ul><li>swap values between records (e.g. Den82) </li></ul></ul><ul><ul><li>add noise to values (e.g. TYW84, War65) </li></ul></ul>
  66. 66. Statistical Databases: Comparison <ul><li>We do not assume original data is aggregated into a single database. </li></ul><ul><li>Concept of reconstructing original distribution. </li></ul><ul><ul><li>Adding noise to data values problematic without such reconstruction. </li></ul></ul>