• Save
PrivatePond: Outsourced Management of Web Corpuses
Upcoming SlideShare
Loading in...5
×
 

PrivatePond: Outsourced Management of Web Corpuses

on

  • 1,472 views

We propose a novel system called PrivatePond, which was designed with the goal of allowing an end-user to create, store, and search a corpus of web documents, using an untrusted service provider, and ...

We propose a novel system called PrivatePond, which was designed with the goal of allowing an end-user to create, store, and search a corpus of web documents, using an untrusted service provider, and without compromising the confidentiality of the documents in the corpus.

Statistics

Views

Total Views
1,472
Views on SlideShare
1,234
Embed Views
238

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 238

http://arnab.org 233
http://www.arnab.org 4
http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Consider a small company’s intranetOffload management responsibilities
  • Secure boolean search on encrypted documents /Secure inverted indexes for document retrieval Transparency – seamless interaction for the userQuery run time
  • Traditional search architecture query returns ranked list of documents
  • Download each encrypted document to search
  • So not confidential?
  • One example to strike a balance between searchability and confidentiality
  • Impact on Search Quality Lose proximity-based search Lose term frequency Padding of tokens introduces false positives
  • Given a ranking model, examine the change in search quality; we do not determine the best ranking modelN – N highest ranked documents
  • Meaning of N
  • Bw = 1
  • Varying confidentiality and search quality characteristics

PrivatePond: Outsourced Management of Web Corpuses PrivatePond: Outsourced Management of Web Corpuses Presentation Transcript

  • PrivatePond: Outsourced Management of Web Corpuses
    Daniel Fabbri, Arnab Nandi,
    Kristen LeFevre, H.V. Jagadish
    University of Michigan
    1
  • Outsourcing Data to the Cloud
    Increase in cloud computing
    Outsource documents management to service providers
    Search and retrieve documents from the cloud
    Leverage existing search infrastructure
    High quality search results
    2
  • Outsourcing Challenge: Confidentiality
    Documents may contain private information
    The service provider/public should not have access to the contents
    How can we balance confidentiality and search quality?
    WEB
    Intranet
    Search Engines
    3
  • PrivatePond
    Create and store a corpus of confidential hyperlinked documents
    Search confidential document using an unmodified search engine
    Balance privacy and searchability with a secure indexable representation
    WEB
    Intranet
    Intranet
    Search Engines
    4
  • PrivatePond Design Goals
    User Experience:
    Document Confidentiality
    Search Quality
    Transparency
    Search System:
    Minimal Overhead
    Leverage Existing Search Infrastructure
    Previous work requires modification to the search engine
    [Song 2000, Bawa 2003, Zerr 2008]
    5
  • Outsourcing Architecture
    6
    Outsource the original corpus
    Does not maintain confidentiality
    D
    Service
    (Unmodified) Search Engine
    Ranked Result Document(s) D
    Q
    User Search
  • Outsourcing Architecture
    Outsource encrypted documents
    Local proxy encrypts and decrypts
    Local proxy performs the searches
    High search overhead
    7
    E(D)
    Service
    (Unmodified) Search Engine
    Local Proxy
    Ranked Result Document(s) D
    Q
    User Search
  • PrivatePond Architecture
    8
    Secure Indexable Representation
    Attached to encrypted document
    Indexable
    Searchable
    Secure
    Indexable
    Representation
    E(D)
    Service
    (Unmodified) Search Engine
    E(D)
    Q’
    Local Proxy
    Ranked Result Document(s) D
    Q
    User Search
  • Outsourcing Search
    9
    Practical Tradeoffs…
    Search Quality
    Confidentiality
    Indexable Representation
    Outsource Original Corpus
    - Searchable
    - Not confidential
    Outsource Encrypted Corpus
    - Confidential
    - Not easily searched
  • Sample Indexable Representation
    First, consider encrypting each word in a document
    Maintain links between indexable representations
    Vulnerable to attacks:
    Language structure (e.g., <noun> <verb> <noun>)
    Frequency of words (e.g., twinkle is most frequent)
    [Kumar 2007]
    Twinkle, twinkle little star
    AAA AAA BBB CCC
    Document
    Indexable Representation
    10
  • Second, represent documents as an encrypted set-of-words
    Prevents attacks on a single indexable representation
    Vulnerable to attacks that aggregate word frequencies across all indexable representations in the corpus
    Doc 2
    Doc 1
    Doc 3
    AAA BBB CCC
    AAA BBB CCC
    AAA BBB CCC
    Sample Indexable Representation
    AAA BBB CCC
    Corpus of Indexable Representations
    Aggregate
    Document Frequency
    11
  • Third, Set-of-words representation + Padding (BW = 3)
    • Bin width (BW) = require that each token have the same document frequency as bin width – 1 other tokens
    Sample Indexable Representation
    AAA BBB CCC
    BBB CCC
    CCC
    Aggregate
    Document Frequency
    Corpus of Indexable Representations
    12
  • Set-of-words representation + Padding (BW = 3)
    PrivatePond Indexable Representation
    AAA BBB CCC
    AAA BBB CCC
    AAABBBCCC
    Aggregate
    Document Frequency
    Corpus of Indexable Representations
    13
  • PrivatePond Indexable Representation
    Impact on Search Quality
    • Lose proximity-based search
    • Lose term frequency
    • Padding of tokens introduces false positives
    14
    What is the effect of the indexable representation on search quality?
  • Evaluation
    Data:
    Sample of Simple Wikipedia (Small Corpus)
    Full Simple Wikipedia (Large Corpus)
    Query workload of 10 K queries
    Evaluation preformed with MySQL
    15
  • Ranking Models
    Ranking Models:
    TFIDF (as implemented in MySQL FULLTEXT)
    PageRank
    Combination of Ranking Models
    Measure change in search quality due to the indexable representation
    16
  • Search Quality Metrics
    Indexable Representation
    Original
    Corpus
    Search Engine
    Search Engine
    Ranked Results:
    Ranked Results:
    Gold List
    Pond List
    17
  • Example:
    Search Quality Metrics
    • Precision at N:
    • N – Consider documents ranked from 1 to N
    • P(N) = [gold list INTERSECT pond list] / N
    • P(3) = 2/3
    • Two additional metrics (included in the paper):
    • Mean Average Precision
    • Rank Perturbation
    18
  • Effects of the Indexable Representation BW = 1
    Search Quality Per Corpus
    • Drop in search quality for TFIDF; loses 2 of top-10 for small C
    • PageRank is unaffected by the set-of-words representation
    19
  • Effects of Bin WidthSmall Corpus
    • Loss in search quality as bin width increases
    • Padding in documents with high PageRankor low document frequency
    20
  • Combining Ranking ModelsSmall Corpus, BW = 10
    Weighted Ranking = (w) · (PageRank) + (1 − w) · (TFIDF)
    • The combined ranking models have comparable search quality
    21
  • Conclusion
    Present the PrivatePond architecture
    Outsourcing search
    Goal of balancing searchability and confidentiality
    Leverages existing search engine infrastructure
    Future Work: Alternative Indexable Representations
    22
  • more info at
    www.eecs.umich.edu/db
    23
  • All Metrics
    24