Your SlideShare is downloading. ×
  • Like
Sampling Attacks Against Hidden Databases
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Sampling Attacks Against Hidden Databases

  • 95 views
Published

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
95
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Picture to show count and alert interfaces
  • Picture to show count and alert interfaces
  • Change to count…change animation…to blue and red line……..put a triangle for each subtree
  • One color…don’t remove edges
  • Put the table in the figure
  • Example: final tree
  • Example: final tree
  • Example: final tree
  • Stree: Important slides

Transcript

  • 1. LEVERAGING COUNT INFORMATION IN SAMPLING HIDDEN DATABASES Presenter: Nan Zhang, George Washington Univ. Joint work with Arjun Dasgupta and Gautam Das, The University of Texas at Arlington
  • 2. OUTLINE
    • Introduction
    • Baseline Algorithm
    • COUNT-DECISION-TREE
    • ALERT-HYBRID
    • Experimental Results
    • Related Work
    • Conclusion
  • 3. THE DEEP WEB
    • Deep Web vs Surface Web
      • Dynamic contents, unlinked pages, private web, contextual web, etc
      • Estimated size [1] : 91,850 vs 167 tera bytes
    [1] SIMS, UC Berkeley, How much information? 2003 http://www.sims.berkeley.edu/research/projects/how-much-info-2003/
  • 4. HIDDEN DATABASES
    • Form-like interface
    • Return top-k tuples
  • 5. SAMPLING HIDDEN DATABASES THROUGH PUBLIC INTERFACES
    • Problem definition
      • given such restricted query interfaces, how can one efficiently obtain a uniform random sample of the backend database by only accessing the database via the public front end interface ?
    • Applications
      • In which geographic area, and for which industry, do MSN Careers’ job sources have especially low presence?
      • Which flight at which date is more likely to be relatively empty?
      • What is the real size of the hidden database?
  • 6. PERFORMANCE MEASURES OF SAMPLING HIDDEN DATABASES
    • Sample bias
      • Over- or under-representing a portion of the population
      • Objective: minimize sample bias
    • Efficiency (query cost)
      • The number of queries issued to the web interface of a hidden database
      • Note: many hidden databases charge for each issued query or have limits on the number of queries one can issue per day.
      • Objective: minimize query cost
  • 7. TWO TYPES OF HIDDEN DATABASE INTERFACE TOP-k-ALERT Depending on how overflowing query results are displayed (display overflowing flag only) <Prev> 1 … 20 21 22 23 24 25 Showing 481-500 of more than 500 results
  • 8. TWO TYPES OF HIDDEN DATABASE INTERFACE TOP-k-COUNT Showing 481-500 of 15,167 results Depending on how overflowing query results are displayed (display real COUNT) <Prev> 1 … 20 21 22 23 24 25
  • 9. OUTLINE OF TECHNICAL RESULTS
    • Existing work
      • HIDDEN-DB-SAMPLER [DDM07]
    • Our results
      • COUNT-DECISION-TREE
        • An efficient unbiased sampling algorithm for Top-k-COUNT interfaces
      • ALERT-HYBRID
        • An efficient sampling algorithm with slight bias for Top-k-ALERT interfaces
  • 10. OUTLINE
    • Introduction
    • Baseline Algorithm
    • COUNT-DECISION-TREE
    • ALERT-HYBRID
    • Experimental Results
    • Related Work
    • Conclusion
  • 11. A RUNNING EXAMPLE 000 001 010 011 100 101 110 111 t1 t2 t3 t4 A1 A2 A3 t1 0 0 1 t2 0 1 0 t3 0 1 1 t4 1 1 0
  • 12. BASELINE ALGORITHM: COUNT-ORDER A1 = 0 & A2 = 0 A1 = 0 A1 = 1 A1 A2 A3 A1 = 0 & A2 = 1 A1 = 0 & A2 = 0 & A3 = 0 A1 = 0 & A2 = 1 & A3 = 1 valid underflow overflow
  • 13. BASELINE ALGORITHM: COUNT-ORDER 000 010 001 011 101 100 111 110 3/4 1/2 2/3 3/4 * 2/3 * 1/2 = 1/4 Count=3 Count=1 Count=1 Count=2 Count=1 A1 A2 A3 4 3 3 Count=1
  • 14. BASELINE ALGORITHM: COUNT-ORDER 000 010 001 011 101 100 111 110 3/4 1/3 3/4 * 1/3 = 1/4 A1 A2 A3 Count=3 Count=1 Count=1 Count=2 4 3
  • 15. OUTLINE
    • Introduction
    • Baseline Algorithm
    • COUNT-DECISION-TREE
    • ALERT-HYBRID
    • Experimental Results
    • Related Work
    • Conclusion
  • 16. COUNT-DECISION-TREE
    • First result of the paper
    • Two main ideas for improving the efficiency of sampling
      • Utilizing query history
      • Attribute order tree  Decision tree
  • 17. UTILIZING QUERY HISTORY: BASIC IDEA 000 010 001 011 101 100 111 110 A1 A2 A3
  • 18. UTLIZING QUERY HISTORY: SAVING
    • Saving from query history is significant
      • When minimum domain size is b , saving for collecting s samples is:
      • Expected saving:
        • 5,000 samples from a 100,000-tuple i.i.d. Boolean database with uniform distribution: 49.90% (83,048  41,610)
        • b = 5: 62.62% (143,067  53,317)
  • 19. ATTRIBUTE ORDER TREE  DECISION TREE: BASIC IDEA Note: Not to be confused with a decision tree for classification 000 010 001 011 101 100 111 110 A1 A2 A3 t1 t2 t3 t4 A3 0 1 A1 A2 0 0 1 1 t2 t4 t1 t3 A1 A2 A3 t1 0 0 1 t2 0 1 0 t3 0 1 1 t4 1 1 0 A1 A2 A3 t1 0 0 1 t2 0 1 0 t3 0 1 1 t4 1 1 0
  • 20. CONSTRUCTING AN OPTIMAL DECISION TREE: TWO MAIN CHALLENGES
    • Problem is hard even if one has access to the entire database
      • When k = 1, to collect s = 1 sample, the construction of an optimal decision tree over a Boolean DB is equivalent to a well-known NP-hard problem of constructing an optimal decision tree for entity identification in the database.
    • Furthermore, without knowledge of the database, construction must be done on-the-fly
      • Note: Tree construction costs queries too!
  • 21. INTUITION OF A HEURISTIC ALGORITHM The decision-tree construction algorithm must consider query history Consider the number of unique queries required to acquire an infinite number of samples. Observation I: m – 1 = 7 queries are required for Trees A, B, and C ( m = 8 is the number of tuples) Observation II: Empty leaves in Tree D leads to more required queries.
  • 22. DECISION TREES FOR COLLECTING A FINITE NUMBER OF TUPLES
    • Loss : empty leaves
    • Saving : since the number of samples to be collected is finite, not all m – 1 queries need to be issued.
    • Constructing an optimal decision tree
      • Given the number of samples to be collected, maximize ( Saving – Loss )
    Possible saving Possible loss
  • 23. A GREEDY HEURISTIC ALGORITHM
    • Saving :
    • Loss :
    • Expense :
    • Net Saving Per Expense:
    if then total cost <=
  • 24. COMPUTATION OF SER
    • How to compute branch COUNTs (i.e., | u j |) for all candidates of a node?
      • Exact computation diminishes the entire concept of minimizing cost
      • Fortunately, in many cases a rough estimation is enough e.g., fanout of 2 vs. 10:
      • Unfortunately, uniform assumption does not suffice
      • Proposed Solution: Issue a small number of marginal queries first, conditional independence assumption
  • 25. ALGORITHM COUNT-DECISION-TREE Select attribute to maximize SER Select attribute to maximize SER [Transmission] Automatic Manual Honda Toyota Ford VW High Low Nissan [Make] Select attribute to maximize SER Sample Found! [Price Segment] Medium
  • 26. ALGORITHM COUNT-DECISION-TREE [Transmission] Automatic Manual Honda Toyota Ford VW High Medium Low Nissan Go back to the root and start another walk! [Make] [Price Segment]
  • 27. ALGORITHM COUNT-DECISION-TREE [Transmission] Automatic Manual Honda Toyota Ford VW High Medium Low Nissan [Transmission] Automatic Manual Select attribute to maximize SER [Price Segment] High Medium Low [Make] Honda Toyota Ford VW Select attribute to maximize SER Sample Found! Saving from History [Make] [Price Segment] Nissan
  • 28. ALGORITHM COUNT-DECISION-TREE [Transmission] Automatic Manual [Price Segment] High Medium Low [Make] Honda Toyota Nissan Ford VW Honda Toyota Ford VW High Medium Low Nissan Go back to the root and start another walk! [Make] [Price Segment]
  • 29. OUTLINE
    • Introduction
    • Baseline Algorithm
    • COUNT-DECISION-TREE
    • ALERT-HYBRID
    • Experimental Results
    • Related Work
    • Conclusion
  • 30. TWO MAIN IDEAS OF ALERT-HYBRID
    • Use a small number of pilot samples to estimate COUNT
      • Motivation: COUNT eliminates bias for sampling, but is unavailable for Top-k-ALERT interfaces.
    • On-the-fly switch from COUNT-DECISION-TREE to ALERT-ORDER during the drill down processes for collecting the remaining samples
      • Motivation: an inaccurate COUNT may introduce additional bias – switch when confidence is low
  • 31. OUTLINE
    • Introduction
    • Baseline Algorithm
    • COUNT-DECISION-TREE
    • ALERT-HYBRID
    • Experimental Results
    • Related Work
    • Conclusion
  • 32. EXPERIMENTAL SETUP
    • Synthetic Boolean-iid:
      • 200,000 tuples, 80 attributes, p = 0.25
    • Synthetic Boolean-mixed:
      • 200,000 tuples, 40 independent attributes
      • 5 with uniform distribution, the others have p from 1/160 to 35/160 with step of 1/160
    • Yahoo! Auto: http://autos.yahoo.com
      • 15,211 tuples, 32 Boolean, 6 categorical attributes
      • Domain size ranges from 5 to 447
    • Census: UCI Data Mining Archive
      • 1990 census data, we remove all attributes with domain size > 100, 12 attributes and 32,561 tuples
      • Domain size from Boolean to 92
  • 33. EFFICIENCY OF COUNT-DECISION-TREE VS. COUNT-ORDER
  • 34. EFFICIENCY AND BIAS OF ALERT-HYBRID VS. ALERT-ORDER
  • 35. ILLUSTRATION OF ALERT-HYBRID
  • 36. OUTLINE
    • Introduction
    • Baseline Algorithm
    • COUNT-DECISION-TREE
    • ALERT-HYBRID
    • Experimental Results
    • Related Work
    • Conclusion
  • 37. RELATED WORK
    • Crawling hidden text databases
      • [BB98, AIG03, NZC05]
    • Extracting data from hidden structured databases
      • [RG01, LES+02, ARP+07]
    • Sampling search engine’s index using a public interface
      • [BB98, BJ04, BG06, BG07]
    • Sampling hidden structural databases
      • [DDM07]
  • 38. OUTLINE
    • Introduction
    • Baseline Algorithm
    • COUNT-DECISION-TREE
    • ALERT-HYBRID
    • Experimental Results
    • Related Work
    • Conclusion
  • 39. CONCLUSION
    • Main Technical Contribution 1: COUNT-DECISION-TREE
      • Unbiased sampling algorithm for Top-k-COUNT interfaces
      • Orders of magnitude more efficient than the existing algorithms
    • Main Technical Contribution 2: ALERT-HYBRID
      • Sampling algorithm with slight bias for Top-k-ALERT interfaces
      • Orders of magnitude more efficient and has smaller bias than the existing algorithms.
  • 40. CONCLUSION
    • Our studies unveil powerful techniques to perform data analytics over hidden databases
      • Hidden databases owners may be extremely concerned about the privacy of aggregates over their hidden databases.
    • How to reveal individual tuples truthfully and efficiently, but hide aggregated views of the data
      • A. Dasgupta, N. Zhang, G. Das, and S. Chaudhuri, Privacy Preservation of Aggregates in Hidden Databases: Why and How? SIGMOD 2009.
  • 41. THANK YOU
  • 42. BACKUP SLIDES
  • 43. EFFICIENCY AND BIAS OF COUNT-DECISION-TREE VS. ALERT-ORDER
  • 44. TWO TYPES OF HIDDEN DATABASE INTERFACE
    • Top-k-ALERT
      • MSN Stock Screener (k = 24)
        • http://moneycentral.msn.com/investor/finder/customstocks.asp
      • Microsoft Solution Finder (k = 500)
        • https://solutionfinder.microsoft.com/Solutions/SolutionsDirectory.aspx?mode=searchproblem
    • Top-k-COUNT
      • MSN Careers (k = 4,000)
        • http://msn.careerbuilder.com/JobSeeker/Jobs/JobFindAdv.aspx
  • 45. SETTINGS OF S 1 AND C S
  • 46. COUNT-DECISION-TREE VS. ALERT-RANDOM
  • 47. IMPROVEMENT BY CONSIDERING HISTORY Note: Almost unrelated to k
  • 48. IMPROVEMENT BY DECISION TREE
  • 49. ALGORITHM ALERT-HYBRID [Transmission] |Automatic| |Manual| [Price Segment] |High| |Medium| |Low| [State] COUNT-DECISION-TREE Count below THRESHOLD VW [Make] TX VA NY CA Start ALERT-ORDER Select attribute to maximize SER Select attribute to maximize SER Go back to the root and start another walk!
  • 50. ALGORITHM ALERT-HYBRID [Transmission] |Automatic| |Manual| COUNT-DECISION-TREE |Honda| |Toyota| |VW| |Nissan| [Make] [Price Segment] |High| |Medium| |Low| [Price Segment] Select attribute to maximize SER Count below THRESHOLD Start ALERT-ORDER TX VA NY CA Select attribute to maximize SER [State] Continue