SlideShare a Scribd company logo
PrivatePond: Outsourced Management of Web Corpuses Daniel Fabbri, Arnab Nandi,  Kristen LeFevre, H.V. Jagadish University of Michigan 1
Outsourcing Data to the Cloud Increase in cloud computing Outsource documents management to service providers Search and retrieve documents from the cloud Leverage existing search infrastructure High quality search results 2
Outsourcing Challenge: Confidentiality Documents may contain private information The service provider/public should not have access to the contents How can we balance confidentiality and search quality? WEB Intranet Search Engines 3
PrivatePond Create and store a corpus of confidential hyperlinked documents  Search confidential document using an unmodified search engine Balance privacy and searchability with a secure indexable representation WEB Intranet Intranet Search Engines 4
PrivatePond Design Goals User Experience: Document Confidentiality Search Quality Transparency Search System: Minimal Overhead Leverage Existing Search Infrastructure Previous work requires modification to the search engine    [Song 2000, Bawa 2003, Zerr 2008] 5
Outsourcing Architecture 6 Outsource the original corpus Does not maintain confidentiality D Service (Unmodified) Search Engine Ranked Result Document(s) D Q User Search
Outsourcing Architecture Outsource encrypted documents Local proxy encrypts and decrypts Local proxy performs the searches High search overhead 7 E(D) Service (Unmodified) Search Engine Local Proxy Ranked Result Document(s) D Q User Search
PrivatePond Architecture 8 Secure Indexable Representation Attached to encrypted document Indexable Searchable Secure Indexable  Representation E(D) Service (Unmodified) Search Engine E(D) Q’ Local Proxy Ranked Result Document(s) D Q User Search
Outsourcing Search 9 Practical Tradeoffs… Search Quality Confidentiality Indexable Representation Outsource Original Corpus   - Searchable   - Not confidential Outsource Encrypted Corpus - Confidential   - Not easily searched
Sample Indexable Representation First, consider encrypting each word in a document Maintain links between indexable representations  Vulnerable to attacks: Language structure (e.g., <noun> <verb> <noun>) Frequency of words (e.g., twinkle is most frequent)  [Kumar 2007] Twinkle, twinkle little star AAA AAA BBB CCC Document Indexable Representation 10
Second, represent documents as an encrypted set-of-words Prevents attacks on a single indexable representation Vulnerable to attacks that aggregate word frequencies across all indexable representations in the corpus Doc 2 Doc 1 Doc 3 AAA BBB CCC AAA BBB CCC AAA BBB CCC Sample Indexable Representation AAA BBB CCC Corpus of Indexable Representations Aggregate  Document Frequency 11
Third, Set-of-words representation + Padding (BW = 3) ,[object Object],Sample Indexable Representation AAA BBB CCC BBB CCC CCC Aggregate  Document Frequency Corpus of Indexable Representations 12
Set-of-words representation + Padding (BW = 3) PrivatePond Indexable Representation AAA BBB CCC AAA BBB CCC AAABBBCCC Aggregate  Document Frequency Corpus of Indexable Representations 13
PrivatePond Indexable Representation  Impact on Search Quality ,[object Object]
  Lose term frequency
  Padding of tokens introduces false positives14 What is the effect of the indexable representation on search quality?
Evaluation Data: Sample of Simple Wikipedia (Small Corpus) Full  Simple Wikipedia (Large Corpus) Query workload of 10 K queries Evaluation preformed with MySQL 15
Ranking Models Ranking Models: TFIDF (as implemented in MySQL FULLTEXT)  PageRank Combination of Ranking Models Measure change in search quality due to the indexable representation 16
Search Quality Metrics Indexable Representation Original  Corpus Search Engine Search Engine Ranked Results: Ranked Results: Gold List Pond List 17
Example: Search Quality Metrics ,[object Object]
N – Consider documents ranked from 1 to N
  P(N) = [gold list INTERSECT pond list] / N
  P(3) = 2/3
  Two additional metrics (included in the paper):

More Related Content

What's hot

Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)
Jeet Das
 
SURE_2014 Poster 2.0
SURE_2014 Poster 2.0SURE_2014 Poster 2.0
SURE_2014 Poster 2.0
Alex Sumner
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
Sameera Horawalavithana
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
Ruben Taelman
 
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Gong Cheng
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data Publishing
Ruben Taelman
 

What's hot (6)

Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)
 
SURE_2014 Poster 2.0
SURE_2014 Poster 2.0SURE_2014 Poster 2.0
SURE_2014 Poster 2.0
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
 
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data Publishing
 

Viewers also liked

Dé Managementconferentie 2011
Dé Managementconferentie 2011   Dé Managementconferentie 2011
Dé Managementconferentie 2011
saMBO-ICT
 
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Clarke Ching
 
Hans Appel260308
Hans Appel260308Hans Appel260308
Hans Appel260308
saMBO-ICT
 
Augmented Reality Arno Coenders
Augmented Reality Arno CoendersAugmented Reality Arno Coenders
Augmented Reality Arno Coenders
saMBO-ICT
 
Nbl Vermeend26mrt08
Nbl Vermeend26mrt08Nbl Vermeend26mrt08
Nbl Vermeend26mrt08
saMBO-ICT
 
2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan
saMBO-ICT
 
This call is being recorded
This call is being recordedThis call is being recorded
This call is being recorded
saMBO-ICT
 
DeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion SlideshowDeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion Slideshow
mistersugar
 
Multitasking is evil
Multitasking is evilMultitasking is evil
Multitasking is evil
Clarke Ching
 

Viewers also liked (9)

Dé Managementconferentie 2011
Dé Managementconferentie 2011   Dé Managementconferentie 2011
Dé Managementconferentie 2011
 
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
 
Hans Appel260308
Hans Appel260308Hans Appel260308
Hans Appel260308
 
Augmented Reality Arno Coenders
Augmented Reality Arno CoendersAugmented Reality Arno Coenders
Augmented Reality Arno Coenders
 
Nbl Vermeend26mrt08
Nbl Vermeend26mrt08Nbl Vermeend26mrt08
Nbl Vermeend26mrt08
 
2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan
 
This call is being recorded
This call is being recordedThis call is being recorded
This call is being recorded
 
DeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion SlideshowDeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion Slideshow
 
Multitasking is evil
Multitasking is evilMultitasking is evil
Multitasking is evil
 

Similar to PrivatePond: Outsourced Management of Web Corpuses

Exploiting web search engines to search structured
Exploiting web search engines to search structuredExploiting web search engines to search structured
Exploiting web search engines to search structured
Nita Pawar
 
How web searching engines work
How web searching engines workHow web searching engines work
How web searching engines work
VNIT-ACM Student Chapter
 
data base management system (DBMS)
data base management system (DBMS)data base management system (DBMS)
data base management system (DBMS)
Varish Bajaj
 
X.500 More Than a Global Directory
X.500 More Than a Global DirectoryX.500 More Than a Global Directory
X.500 More Than a Global Directory
lurdhu agnes
 
I explore
I exploreI explore
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
IRJET Journal
 
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoLa big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
Data Con LA
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
IRJET Journal
 
Concept Based Search
Concept Based SearchConcept Based Search
Concept Based Search
freewi11
 
search engine
search enginesearch engine
search engine
Musaib Khan
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Sean Golliher
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To Database
WanBK Leo
 
Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
Iftikhar Alam
 
Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)
Prof Ansari
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
C4Media
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
KU Leuven
 
Metadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsMetadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation begins
Péter Király
 
search.ppt
search.pptsearch.ppt
search.ppt
Pikaj2
 
An Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search TechniqueAn Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search Technique
paperpublications3
 
Lecture 3 note.pptx
Lecture 3 note.pptxLecture 3 note.pptx
Lecture 3 note.pptx
TesfanehGorfu
 

Similar to PrivatePond: Outsourced Management of Web Corpuses (20)

Exploiting web search engines to search structured
Exploiting web search engines to search structuredExploiting web search engines to search structured
Exploiting web search engines to search structured
 
How web searching engines work
How web searching engines workHow web searching engines work
How web searching engines work
 
data base management system (DBMS)
data base management system (DBMS)data base management system (DBMS)
data base management system (DBMS)
 
X.500 More Than a Global Directory
X.500 More Than a Global DirectoryX.500 More Than a Global Directory
X.500 More Than a Global Directory
 
I explore
I exploreI explore
I explore
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoLa big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
 
Concept Based Search
Concept Based SearchConcept Based Search
Concept Based Search
 
search engine
search enginesearch engine
search engine
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To Database
 
Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
 
Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Metadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsMetadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation begins
 
search.ppt
search.pptsearch.ppt
search.ppt
 
An Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search TechniqueAn Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search Technique
 
Lecture 3 note.pptx
Lecture 3 note.pptxLecture 3 note.pptx
Lecture 3 note.pptx
 

More from arnabdotorg

Guided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result ParadigmGuided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result Paradigm
arnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
arnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
arnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
arnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
arnabdotorg
 
yvmail
yvmailyvmail
yvmail
arnabdotorg
 

More from arnabdotorg (6)

Guided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result ParadigmGuided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result Paradigm
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
yvmail
yvmailyvmail
yvmail
 

Recently uploaded

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 

Recently uploaded (20)

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 

PrivatePond: Outsourced Management of Web Corpuses

  • 1. PrivatePond: Outsourced Management of Web Corpuses Daniel Fabbri, Arnab Nandi, Kristen LeFevre, H.V. Jagadish University of Michigan 1
  • 2. Outsourcing Data to the Cloud Increase in cloud computing Outsource documents management to service providers Search and retrieve documents from the cloud Leverage existing search infrastructure High quality search results 2
  • 3. Outsourcing Challenge: Confidentiality Documents may contain private information The service provider/public should not have access to the contents How can we balance confidentiality and search quality? WEB Intranet Search Engines 3
  • 4. PrivatePond Create and store a corpus of confidential hyperlinked documents Search confidential document using an unmodified search engine Balance privacy and searchability with a secure indexable representation WEB Intranet Intranet Search Engines 4
  • 5. PrivatePond Design Goals User Experience: Document Confidentiality Search Quality Transparency Search System: Minimal Overhead Leverage Existing Search Infrastructure Previous work requires modification to the search engine [Song 2000, Bawa 2003, Zerr 2008] 5
  • 6. Outsourcing Architecture 6 Outsource the original corpus Does not maintain confidentiality D Service (Unmodified) Search Engine Ranked Result Document(s) D Q User Search
  • 7. Outsourcing Architecture Outsource encrypted documents Local proxy encrypts and decrypts Local proxy performs the searches High search overhead 7 E(D) Service (Unmodified) Search Engine Local Proxy Ranked Result Document(s) D Q User Search
  • 8. PrivatePond Architecture 8 Secure Indexable Representation Attached to encrypted document Indexable Searchable Secure Indexable Representation E(D) Service (Unmodified) Search Engine E(D) Q’ Local Proxy Ranked Result Document(s) D Q User Search
  • 9. Outsourcing Search 9 Practical Tradeoffs… Search Quality Confidentiality Indexable Representation Outsource Original Corpus - Searchable - Not confidential Outsource Encrypted Corpus - Confidential - Not easily searched
  • 10. Sample Indexable Representation First, consider encrypting each word in a document Maintain links between indexable representations Vulnerable to attacks: Language structure (e.g., <noun> <verb> <noun>) Frequency of words (e.g., twinkle is most frequent) [Kumar 2007] Twinkle, twinkle little star AAA AAA BBB CCC Document Indexable Representation 10
  • 11. Second, represent documents as an encrypted set-of-words Prevents attacks on a single indexable representation Vulnerable to attacks that aggregate word frequencies across all indexable representations in the corpus Doc 2 Doc 1 Doc 3 AAA BBB CCC AAA BBB CCC AAA BBB CCC Sample Indexable Representation AAA BBB CCC Corpus of Indexable Representations Aggregate Document Frequency 11
  • 12.
  • 13. Set-of-words representation + Padding (BW = 3) PrivatePond Indexable Representation AAA BBB CCC AAA BBB CCC AAABBBCCC Aggregate Document Frequency Corpus of Indexable Representations 13
  • 14.
  • 15. Lose term frequency
  • 16. Padding of tokens introduces false positives14 What is the effect of the indexable representation on search quality?
  • 17. Evaluation Data: Sample of Simple Wikipedia (Small Corpus) Full Simple Wikipedia (Large Corpus) Query workload of 10 K queries Evaluation preformed with MySQL 15
  • 18. Ranking Models Ranking Models: TFIDF (as implemented in MySQL FULLTEXT) PageRank Combination of Ranking Models Measure change in search quality due to the indexable representation 16
  • 19. Search Quality Metrics Indexable Representation Original Corpus Search Engine Search Engine Ranked Results: Ranked Results: Gold List Pond List 17
  • 20.
  • 21. N – Consider documents ranked from 1 to N
  • 22. P(N) = [gold list INTERSECT pond list] / N
  • 23. P(3) = 2/3
  • 24. Two additional metrics (included in the paper):
  • 26. Rank Perturbation 18
  • 27.
  • 28. PageRank is unaffected by the set-of-words representation19
  • 29.
  • 30. Padding in documents with high PageRankor low document frequency20
  • 31.
  • 32. Conclusion Present the PrivatePond architecture Outsourcing search Goal of balancing searchability and confidentiality Leverages existing search engine infrastructure Future Work: Alternative Indexable Representations 22
  • 33. more info at www.eecs.umich.edu/db 23

Editor's Notes

  1. Consider a small company’s intranetOffload management responsibilities
  2. Secure boolean search on encrypted documents /Secure inverted indexes for document retrieval Transparency – seamless interaction for the userQuery run time
  3. Traditional search architecture query returns ranked list of documents
  4. Download each encrypted document to search
  5. So not confidential?
  6. One example to strike a balance between searchability and confidentiality
  7. Impact on Search Quality Lose proximity-based search Lose term frequency Padding of tokens introduces false positives
  8. Given a ranking model, examine the change in search quality; we do not determine the best ranking modelN – N highest ranked documents
  9. Meaning of N
  10. Bw = 1
  11. Varying confidentiality and search quality characteristics