Sampling Attacks Against Hidden Databases

255 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
255
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Picture to show count and alert interfaces
  • Picture to show count and alert interfaces
  • Change to count…change animation…to blue and red line……..put a triangle for each subtree
  • One color…don’t remove edges
  • Put the table in the figure
  • Example: final tree
  • Example: final tree
  • Example: final tree
  • Stree: Important slides
  • Sampling Attacks Against Hidden Databases

    1. 1. LEVERAGING COUNT INFORMATION IN SAMPLING HIDDEN DATABASES Presenter: Nan Zhang, George Washington Univ. Joint work with Arjun Dasgupta and Gautam Das, The University of Texas at Arlington
    2. 2. OUTLINE <ul><li>Introduction </li></ul><ul><li>Baseline Algorithm </li></ul><ul><li>COUNT-DECISION-TREE </li></ul><ul><li>ALERT-HYBRID </li></ul><ul><li>Experimental Results </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
    3. 3. THE DEEP WEB <ul><li>Deep Web vs Surface Web </li></ul><ul><ul><li>Dynamic contents, unlinked pages, private web, contextual web, etc </li></ul></ul><ul><ul><li>Estimated size [1] : 91,850 vs 167 tera bytes </li></ul></ul>[1] SIMS, UC Berkeley, How much information? 2003 http://www.sims.berkeley.edu/research/projects/how-much-info-2003/
    4. 4. HIDDEN DATABASES <ul><li>Form-like interface </li></ul><ul><li>Return top-k tuples </li></ul>
    5. 5. SAMPLING HIDDEN DATABASES THROUGH PUBLIC INTERFACES <ul><li>Problem definition </li></ul><ul><ul><li>given such restricted query interfaces, how can one efficiently obtain a uniform random sample of the backend database by only accessing the database via the public front end interface ? </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>In which geographic area, and for which industry, do MSN Careers’ job sources have especially low presence? </li></ul></ul><ul><ul><li>Which flight at which date is more likely to be relatively empty? </li></ul></ul><ul><ul><li>What is the real size of the hidden database? </li></ul></ul>
    6. 6. PERFORMANCE MEASURES OF SAMPLING HIDDEN DATABASES <ul><li>Sample bias </li></ul><ul><ul><li>Over- or under-representing a portion of the population </li></ul></ul><ul><ul><li>Objective: minimize sample bias </li></ul></ul><ul><li>Efficiency (query cost) </li></ul><ul><ul><li>The number of queries issued to the web interface of a hidden database </li></ul></ul><ul><ul><li>Note: many hidden databases charge for each issued query or have limits on the number of queries one can issue per day. </li></ul></ul><ul><ul><li>Objective: minimize query cost </li></ul></ul>
    7. 7. TWO TYPES OF HIDDEN DATABASE INTERFACE TOP-k-ALERT Depending on how overflowing query results are displayed (display overflowing flag only) <Prev> 1 … 20 21 22 23 24 25 Showing 481-500 of more than 500 results
    8. 8. TWO TYPES OF HIDDEN DATABASE INTERFACE TOP-k-COUNT Showing 481-500 of 15,167 results Depending on how overflowing query results are displayed (display real COUNT) <Prev> 1 … 20 21 22 23 24 25
    9. 9. OUTLINE OF TECHNICAL RESULTS <ul><li>Existing work </li></ul><ul><ul><li>HIDDEN-DB-SAMPLER [DDM07] </li></ul></ul><ul><li>Our results </li></ul><ul><ul><li>COUNT-DECISION-TREE </li></ul></ul><ul><ul><ul><li>An efficient unbiased sampling algorithm for Top-k-COUNT interfaces </li></ul></ul></ul><ul><ul><li>ALERT-HYBRID </li></ul></ul><ul><ul><ul><li>An efficient sampling algorithm with slight bias for Top-k-ALERT interfaces </li></ul></ul></ul>
    10. 10. OUTLINE <ul><li>Introduction </li></ul><ul><li>Baseline Algorithm </li></ul><ul><li>COUNT-DECISION-TREE </li></ul><ul><li>ALERT-HYBRID </li></ul><ul><li>Experimental Results </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
    11. 11. A RUNNING EXAMPLE 000 001 010 011 100 101 110 111 t1 t2 t3 t4 A1 A2 A3 t1 0 0 1 t2 0 1 0 t3 0 1 1 t4 1 1 0
    12. 12. BASELINE ALGORITHM: COUNT-ORDER A1 = 0 & A2 = 0 A1 = 0 A1 = 1 A1 A2 A3 A1 = 0 & A2 = 1 A1 = 0 & A2 = 0 & A3 = 0 A1 = 0 & A2 = 1 & A3 = 1 valid underflow overflow
    13. 13. BASELINE ALGORITHM: COUNT-ORDER 000 010 001 011 101 100 111 110 3/4 1/2 2/3 3/4 * 2/3 * 1/2 = 1/4 Count=3 Count=1 Count=1 Count=2 Count=1 A1 A2 A3 4 3 3 Count=1
    14. 14. BASELINE ALGORITHM: COUNT-ORDER 000 010 001 011 101 100 111 110 3/4 1/3 3/4 * 1/3 = 1/4 A1 A2 A3 Count=3 Count=1 Count=1 Count=2 4 3
    15. 15. OUTLINE <ul><li>Introduction </li></ul><ul><li>Baseline Algorithm </li></ul><ul><li>COUNT-DECISION-TREE </li></ul><ul><li>ALERT-HYBRID </li></ul><ul><li>Experimental Results </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
    16. 16. COUNT-DECISION-TREE <ul><li>First result of the paper </li></ul><ul><li>Two main ideas for improving the efficiency of sampling </li></ul><ul><ul><li>Utilizing query history </li></ul></ul><ul><ul><li>Attribute order tree  Decision tree </li></ul></ul>
    17. 17. UTILIZING QUERY HISTORY: BASIC IDEA 000 010 001 011 101 100 111 110 A1 A2 A3
    18. 18. UTLIZING QUERY HISTORY: SAVING <ul><li>Saving from query history is significant </li></ul><ul><ul><li>When minimum domain size is b , saving for collecting s samples is: </li></ul></ul><ul><ul><li>Expected saving: </li></ul></ul><ul><ul><ul><li>5,000 samples from a 100,000-tuple i.i.d. Boolean database with uniform distribution: 49.90% (83,048  41,610) </li></ul></ul></ul><ul><ul><ul><li>b = 5: 62.62% (143,067  53,317) </li></ul></ul></ul>
    19. 19. ATTRIBUTE ORDER TREE  DECISION TREE: BASIC IDEA Note: Not to be confused with a decision tree for classification 000 010 001 011 101 100 111 110 A1 A2 A3 t1 t2 t3 t4 A3 0 1 A1 A2 0 0 1 1 t2 t4 t1 t3 A1 A2 A3 t1 0 0 1 t2 0 1 0 t3 0 1 1 t4 1 1 0 A1 A2 A3 t1 0 0 1 t2 0 1 0 t3 0 1 1 t4 1 1 0
    20. 20. CONSTRUCTING AN OPTIMAL DECISION TREE: TWO MAIN CHALLENGES <ul><li>Problem is hard even if one has access to the entire database </li></ul><ul><ul><li>When k = 1, to collect s = 1 sample, the construction of an optimal decision tree over a Boolean DB is equivalent to a well-known NP-hard problem of constructing an optimal decision tree for entity identification in the database. </li></ul></ul><ul><li>Furthermore, without knowledge of the database, construction must be done on-the-fly </li></ul><ul><ul><li>Note: Tree construction costs queries too! </li></ul></ul>
    21. 21. INTUITION OF A HEURISTIC ALGORITHM The decision-tree construction algorithm must consider query history Consider the number of unique queries required to acquire an infinite number of samples. Observation I: m – 1 = 7 queries are required for Trees A, B, and C ( m = 8 is the number of tuples) Observation II: Empty leaves in Tree D leads to more required queries.
    22. 22. DECISION TREES FOR COLLECTING A FINITE NUMBER OF TUPLES <ul><li>Loss : empty leaves </li></ul><ul><li>Saving : since the number of samples to be collected is finite, not all m – 1 queries need to be issued. </li></ul><ul><li>Constructing an optimal decision tree </li></ul><ul><ul><li>Given the number of samples to be collected, maximize ( Saving – Loss ) </li></ul></ul>Possible saving Possible loss
    23. 23. A GREEDY HEURISTIC ALGORITHM <ul><li>Saving : </li></ul><ul><li>Loss : </li></ul><ul><li>Expense : </li></ul><ul><li>Net Saving Per Expense: </li></ul>if then total cost <=
    24. 24. COMPUTATION OF SER <ul><li>How to compute branch COUNTs (i.e., | u j |) for all candidates of a node? </li></ul><ul><ul><li>Exact computation diminishes the entire concept of minimizing cost </li></ul></ul><ul><ul><li>Fortunately, in many cases a rough estimation is enough e.g., fanout of 2 vs. 10: </li></ul></ul><ul><ul><li>Unfortunately, uniform assumption does not suffice </li></ul></ul><ul><ul><li>Proposed Solution: Issue a small number of marginal queries first, conditional independence assumption </li></ul></ul>
    25. 25. ALGORITHM COUNT-DECISION-TREE Select attribute to maximize SER Select attribute to maximize SER [Transmission] Automatic Manual Honda Toyota Ford VW High Low Nissan [Make] Select attribute to maximize SER Sample Found! [Price Segment] Medium
    26. 26. ALGORITHM COUNT-DECISION-TREE [Transmission] Automatic Manual Honda Toyota Ford VW High Medium Low Nissan Go back to the root and start another walk! [Make] [Price Segment]
    27. 27. ALGORITHM COUNT-DECISION-TREE [Transmission] Automatic Manual Honda Toyota Ford VW High Medium Low Nissan [Transmission] Automatic Manual Select attribute to maximize SER [Price Segment] High Medium Low [Make] Honda Toyota Ford VW Select attribute to maximize SER Sample Found! Saving from History [Make] [Price Segment] Nissan
    28. 28. ALGORITHM COUNT-DECISION-TREE [Transmission] Automatic Manual [Price Segment] High Medium Low [Make] Honda Toyota Nissan Ford VW Honda Toyota Ford VW High Medium Low Nissan Go back to the root and start another walk! [Make] [Price Segment]
    29. 29. OUTLINE <ul><li>Introduction </li></ul><ul><li>Baseline Algorithm </li></ul><ul><li>COUNT-DECISION-TREE </li></ul><ul><li>ALERT-HYBRID </li></ul><ul><li>Experimental Results </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
    30. 30. TWO MAIN IDEAS OF ALERT-HYBRID <ul><li>Use a small number of pilot samples to estimate COUNT </li></ul><ul><ul><li>Motivation: COUNT eliminates bias for sampling, but is unavailable for Top-k-ALERT interfaces. </li></ul></ul><ul><li>On-the-fly switch from COUNT-DECISION-TREE to ALERT-ORDER during the drill down processes for collecting the remaining samples </li></ul><ul><ul><li>Motivation: an inaccurate COUNT may introduce additional bias – switch when confidence is low </li></ul></ul>
    31. 31. OUTLINE <ul><li>Introduction </li></ul><ul><li>Baseline Algorithm </li></ul><ul><li>COUNT-DECISION-TREE </li></ul><ul><li>ALERT-HYBRID </li></ul><ul><li>Experimental Results </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
    32. 32. EXPERIMENTAL SETUP <ul><li>Synthetic Boolean-iid: </li></ul><ul><ul><li>200,000 tuples, 80 attributes, p = 0.25 </li></ul></ul><ul><li>Synthetic Boolean-mixed: </li></ul><ul><ul><li>200,000 tuples, 40 independent attributes </li></ul></ul><ul><ul><li>5 with uniform distribution, the others have p from 1/160 to 35/160 with step of 1/160 </li></ul></ul><ul><li>Yahoo! Auto: http://autos.yahoo.com </li></ul><ul><ul><li>15,211 tuples, 32 Boolean, 6 categorical attributes </li></ul></ul><ul><ul><li>Domain size ranges from 5 to 447 </li></ul></ul><ul><li>Census: UCI Data Mining Archive </li></ul><ul><ul><li>1990 census data, we remove all attributes with domain size > 100, 12 attributes and 32,561 tuples </li></ul></ul><ul><ul><li>Domain size from Boolean to 92 </li></ul></ul>
    33. 33. EFFICIENCY OF COUNT-DECISION-TREE VS. COUNT-ORDER
    34. 34. EFFICIENCY AND BIAS OF ALERT-HYBRID VS. ALERT-ORDER
    35. 35. ILLUSTRATION OF ALERT-HYBRID
    36. 36. OUTLINE <ul><li>Introduction </li></ul><ul><li>Baseline Algorithm </li></ul><ul><li>COUNT-DECISION-TREE </li></ul><ul><li>ALERT-HYBRID </li></ul><ul><li>Experimental Results </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
    37. 37. RELATED WORK <ul><li>Crawling hidden text databases </li></ul><ul><ul><li>[BB98, AIG03, NZC05] </li></ul></ul><ul><li>Extracting data from hidden structured databases </li></ul><ul><ul><li>[RG01, LES+02, ARP+07] </li></ul></ul><ul><li>Sampling search engine’s index using a public interface </li></ul><ul><ul><li>[BB98, BJ04, BG06, BG07] </li></ul></ul><ul><li>Sampling hidden structural databases </li></ul><ul><ul><li>[DDM07] </li></ul></ul>
    38. 38. OUTLINE <ul><li>Introduction </li></ul><ul><li>Baseline Algorithm </li></ul><ul><li>COUNT-DECISION-TREE </li></ul><ul><li>ALERT-HYBRID </li></ul><ul><li>Experimental Results </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
    39. 39. CONCLUSION <ul><li>Main Technical Contribution 1: COUNT-DECISION-TREE </li></ul><ul><ul><li>Unbiased sampling algorithm for Top-k-COUNT interfaces </li></ul></ul><ul><ul><li>Orders of magnitude more efficient than the existing algorithms </li></ul></ul><ul><li>Main Technical Contribution 2: ALERT-HYBRID </li></ul><ul><ul><li>Sampling algorithm with slight bias for Top-k-ALERT interfaces </li></ul></ul><ul><ul><li>Orders of magnitude more efficient and has smaller bias than the existing algorithms. </li></ul></ul>
    40. 40. CONCLUSION <ul><li>Our studies unveil powerful techniques to perform data analytics over hidden databases </li></ul><ul><ul><li>Hidden databases owners may be extremely concerned about the privacy of aggregates over their hidden databases. </li></ul></ul><ul><li>How to reveal individual tuples truthfully and efficiently, but hide aggregated views of the data </li></ul><ul><ul><li>A. Dasgupta, N. Zhang, G. Das, and S. Chaudhuri, Privacy Preservation of Aggregates in Hidden Databases: Why and How? SIGMOD 2009. </li></ul></ul>
    41. 41. THANK YOU
    42. 42. BACKUP SLIDES
    43. 43. EFFICIENCY AND BIAS OF COUNT-DECISION-TREE VS. ALERT-ORDER
    44. 44. TWO TYPES OF HIDDEN DATABASE INTERFACE <ul><li>Top-k-ALERT </li></ul><ul><ul><li>MSN Stock Screener (k = 24) </li></ul></ul><ul><ul><ul><li>http://moneycentral.msn.com/investor/finder/customstocks.asp </li></ul></ul></ul><ul><ul><li>Microsoft Solution Finder (k = 500) </li></ul></ul><ul><ul><ul><li>https://solutionfinder.microsoft.com/Solutions/SolutionsDirectory.aspx?mode=searchproblem </li></ul></ul></ul><ul><li>Top-k-COUNT </li></ul><ul><ul><li>MSN Careers (k = 4,000) </li></ul></ul><ul><ul><ul><li>http://msn.careerbuilder.com/JobSeeker/Jobs/JobFindAdv.aspx </li></ul></ul></ul>
    45. 45. SETTINGS OF S 1 AND C S
    46. 46. COUNT-DECISION-TREE VS. ALERT-RANDOM
    47. 47. IMPROVEMENT BY CONSIDERING HISTORY Note: Almost unrelated to k
    48. 48. IMPROVEMENT BY DECISION TREE
    49. 49. ALGORITHM ALERT-HYBRID [Transmission] |Automatic| |Manual| [Price Segment] |High| |Medium| |Low| [State] COUNT-DECISION-TREE Count below THRESHOLD VW [Make] TX VA NY CA Start ALERT-ORDER Select attribute to maximize SER Select attribute to maximize SER Go back to the root and start another walk!
    50. 50. ALGORITHM ALERT-HYBRID [Transmission] |Automatic| |Manual| COUNT-DECISION-TREE |Honda| |Toyota| |VW| |Nissan| [Make] [Price Segment] |High| |Medium| |Low| [Price Segment] Select attribute to maximize SER Count below THRESHOLD Start ALERT-ORDER TX VA NY CA Select attribute to maximize SER [State] Continue

    ×