PrivatePond: Outsourced Management of Web Corpuses


Published on

We propose a novel system called PrivatePond, which was designed with the goal of allowing an end-user to create, store, and search a corpus of web documents, using an untrusted service provider, and without compromising the confidentiality of the documents in the corpus.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Consider a small company’s intranetOffload management responsibilities
  • Secure boolean search on encrypted documents /Secure inverted indexes for document retrieval Transparency – seamless interaction for the userQuery run time
  • Traditional search architecture query returns ranked list of documents
  • Download each encrypted document to search
  • So not confidential?
  • One example to strike a balance between searchability and confidentiality
  • Impact on Search Quality Lose proximity-based search Lose term frequency Padding of tokens introduces false positives
  • Given a ranking model, examine the change in search quality; we do not determine the best ranking modelN – N highest ranked documents
  • Meaning of N
  • Bw = 1
  • Varying confidentiality and search quality characteristics
  • PrivatePond: Outsourced Management of Web Corpuses

    1. 1. PrivatePond: Outsourced Management of Web Corpuses<br />Daniel Fabbri, Arnab Nandi, <br />Kristen LeFevre, H.V. Jagadish<br />University of Michigan<br />1<br />
    2. 2. Outsourcing Data to the Cloud<br />Increase in cloud computing<br />Outsource documents management to service providers<br />Search and retrieve documents from the cloud<br />Leverage existing search infrastructure<br />High quality search results<br />2<br />
    3. 3. Outsourcing Challenge: Confidentiality<br />Documents may contain private information<br />The service provider/public should not have access to the contents<br />How can we balance confidentiality and search quality?<br />WEB<br />Intranet<br />Search Engines<br />3<br />
    4. 4. PrivatePond<br />Create and store a corpus of confidential hyperlinked documents <br />Search confidential document using an unmodified search engine<br />Balance privacy and searchability with a secure indexable representation<br />WEB<br />Intranet<br />Intranet<br />Search Engines<br />4<br />
    5. 5. PrivatePond Design Goals<br />User Experience:<br />Document Confidentiality<br />Search Quality<br />Transparency<br />Search System:<br />Minimal Overhead<br />Leverage Existing Search Infrastructure<br />Previous work requires modification to the search engine <br /> [Song 2000, Bawa 2003, Zerr 2008]<br />5<br />
    6. 6. Outsourcing Architecture<br />6<br />Outsource the original corpus<br />Does not maintain confidentiality<br />D<br />Service<br />(Unmodified) Search Engine<br />Ranked Result Document(s) D<br />Q<br />User Search<br />
    7. 7. Outsourcing Architecture<br />Outsource encrypted documents<br />Local proxy encrypts and decrypts<br />Local proxy performs the searches<br />High search overhead<br />7<br />E(D)<br />Service<br />(Unmodified) Search Engine<br />Local Proxy<br />Ranked Result Document(s) D<br />Q<br />User Search<br />
    8. 8. PrivatePond Architecture<br />8<br />Secure Indexable Representation<br />Attached to encrypted document<br />Indexable<br />Searchable<br />Secure<br />Indexable <br />Representation<br />E(D)<br />Service<br />(Unmodified) Search Engine<br />E(D)<br />Q’<br />Local Proxy<br />Ranked Result Document(s) D<br />Q<br />User Search<br />
    9. 9. Outsourcing Search<br />9<br />Practical Tradeoffs…<br />Search Quality<br />Confidentiality<br />Indexable Representation<br />Outsource Original Corpus<br /> - Searchable<br /> - Not confidential<br />Outsource Encrypted Corpus<br />- Confidential<br /> - Not easily searched<br />
    10. 10. Sample Indexable Representation<br />First, consider encrypting each word in a document<br />Maintain links between indexable representations <br />Vulnerable to attacks:<br />Language structure (e.g., &lt;noun&gt; &lt;verb&gt; &lt;noun&gt;)<br />Frequency of words (e.g., twinkle is most frequent) <br />[Kumar 2007]<br />Twinkle, twinkle little star<br />AAA AAA BBB CCC<br />Document<br />Indexable Representation<br />10<br />
    11. 11. Second, represent documents as an encrypted set-of-words<br />Prevents attacks on a single indexable representation<br />Vulnerable to attacks that aggregate word frequencies across all indexable representations in the corpus<br />Doc 2<br />Doc 1<br />Doc 3<br />AAA BBB CCC<br />AAA BBB CCC<br />AAA BBB CCC<br />Sample Indexable Representation<br />AAA BBB CCC<br />Corpus of Indexable Representations<br />Aggregate <br />Document Frequency<br />11<br />
    12. 12. Third, Set-of-words representation + Padding (BW = 3)<br /><ul><li>Bin width (BW) = require that each token have the same document frequency as bin width – 1 other tokens</li></ul>Sample Indexable Representation<br />AAA BBB CCC<br />BBB CCC<br />CCC<br />Aggregate <br />Document Frequency<br />Corpus of Indexable Representations<br />12<br />
    13. 13. Set-of-words representation + Padding (BW = 3)<br />PrivatePond Indexable Representation<br />AAA BBB CCC<br />AAA BBB CCC<br />AAABBBCCC<br />Aggregate <br />Document Frequency<br />Corpus of Indexable Representations<br />13<br />
    14. 14. PrivatePond Indexable Representation<br /> Impact on Search Quality<br /><ul><li>Lose proximity-based search
    15. 15. Lose term frequency
    16. 16. Padding of tokens introduces false positives</li></ul>14<br />What is the effect of the indexable representation on search quality?<br />
    17. 17. Evaluation<br />Data:<br />Sample of Simple Wikipedia (Small Corpus)<br />Full Simple Wikipedia (Large Corpus)<br />Query workload of 10 K queries<br />Evaluation preformed with MySQL<br />15<br />
    18. 18. Ranking Models<br />Ranking Models:<br />TFIDF (as implemented in MySQL FULLTEXT) <br />PageRank<br />Combination of Ranking Models<br />Measure change in search quality due to the indexable representation<br />16<br />
    19. 19. Search Quality Metrics<br />Indexable Representation<br />Original <br />Corpus<br />Search Engine<br />Search Engine<br />Ranked Results:<br />Ranked Results:<br />Gold List<br />Pond List<br />17<br />
    20. 20. Example:<br />Search Quality Metrics<br /><ul><li> Precision at N:
    21. 21. N – Consider documents ranked from 1 to N
    22. 22. P(N) = [gold list INTERSECT pond list] / N
    23. 23. P(3) = 2/3
    24. 24. Two additional metrics (included in the paper):
    25. 25. Mean Average Precision
    26. 26. Rank Perturbation </li></ul>18<br />
    27. 27. Effects of the Indexable Representation BW = 1<br />Search Quality Per Corpus<br /><ul><li>Drop in search quality for TFIDF; loses 2 of top-10 for small C
    28. 28. PageRank is unaffected by the set-of-words representation</li></ul>19<br />
    29. 29. Effects of Bin WidthSmall Corpus<br /><ul><li> Loss in search quality as bin width increases
    30. 30. Padding in documents with high PageRankor low document frequency</li></ul>20<br />
    31. 31. Combining Ranking ModelsSmall Corpus, BW = 10<br />Weighted Ranking = (w) · (PageRank) + (1 − w) · (TFIDF)<br /><ul><li> The combined ranking models have comparable search quality</li></ul>21<br />
    32. 32. Conclusion<br />Present the PrivatePond architecture<br />Outsourcing search <br />Goal of balancing searchability and confidentiality<br />Leverages existing search engine infrastructure<br />Future Work: Alternative Indexable Representations<br />22<br />
    33. 33. more info at<br /> <br />23<br />
    34. 34. All Metrics<br />24<br />