Information Retrieval AICTE FDP at GCT Coimbatore

INFORMATION RETRIEVAL (IR)
(PRIVATE VS. PUBLIC)
VENINGSTON. K
Ph.D. Student, Department of CSE,
Government College of Technology, Coimbatore.
veningstonk@gct.ac.in

PRESENTATION OUTLINE
 Public IR
 What is Web IR?
 Overview of Web IR Technologies
 Web IR Models
 Web Search architecture
 Semantic Matching
 Personalization in Web IR
 Challenges in Web based IR
 Challenges in Personalizing Web IR
 Summary Note
 Private IR
 What is Private IR?
 How Does It Work?
 PIR Model
 Approaches to PIR
 PIR Properties
 Summary Note
2
11/December/2013AICTEFDPonWebApplicationSecurity

WHY INFORMATION RETRIEVAL?
11/December/2013
3
AICTEFDPonWebApplicationSecurity

WEB INFORMATION RETRIEVAL
(WEB SEARCH)
 Technologies for helping users to accurately,
quickly, and easily find information on the web
11/December/2013
4

GOAL OF WEB SEARCH
Accurate Efficient Easy to Use
Results are
relevant
Response time
is short
Good user
experience
Results are
comprehensive
Results are
novel
Fast task
completion
11/December/2013
5

WEB USERS HEAVILY RELY ON SEARCH
ENGINES
11/December/2013
6

HUGE DATA CENTERS
11/December/2013
7

OVERVIEW OF WEB SEARCH
TECHNOLOGIES
 General Web Search, Entity Search, Facet
Search, Question Answering, Multimedia Search
 Ranking, Matching, Retrieval Document
Understanding, Query Understanding, Crawling,
Indexing, Result Presentation, Anti-spam
 Classification, Clustering, Ranking, Graph
Learning, Tagging, Distributed Computing
11/December/2013
8

WEB SEARCH ARCHITECTURE
Query
String
IR
System
Ranked
Documents
1. Page1
2. Page2
3. Page3
.
.
Document
corpus
Web Spider
9
11/December/2013
9

COMPONENT TECHNOLOGIES FOR WEB IR
 Relevance Ranking
 Importance Ranking
 Web Page Understanding
 Query Understanding
 Crawling
 Indexing
 Search Result Presentation
 Anti-Spam
 Search Log Data Mining / Web Mining
11/December/2013
10

THREE IMPORTANT PROCESSES IN WEB IR
 Retrieval
 Finding documents from inverted index
 Matching
 Calculating relevance score between query and
document pair
 Ranking
 Ranking documents based on relevance scores,
importance scores, etc.,
11/December/2013
11

WEB IR MODELS
 Vector Space Model (Salton 1975 )
 Probabilistic Model
 Okapi or BM25 Model (Robertson and Walker
1994 )
 Language Model (Ponte and Croft 1998 )
 User Model
11/December/2013
12

VECTOR SPACE MODEL
11/December/2013
13

PROBABILISTIC MODEL
11/December/2013
14

OKAPI OR BM25 MODEL
11/December/2013
15

LANGUAGE MODEL
11/December/2013
16

USER MODEL
 User models are personal characteristics of the
user that the system maintains
 A user profile can be thought as a user model
 Types of user models
 Depending on the user being modeled
 Individual
 Canonical (group)
 Depending on Acquisition model
 Explicit (stated)
 Implicit (inferred)
11/December/2013
17

SEMANTIC MATCHING
11/December/2013
18

PERSONALIZATION - ENVIRONMENTS WHERE
IS BEING USED
 Databases
 Newsgroups
 Personal Information Management (desktop files, E-mail,
bookmarks, etc.)
 News: electronic journals
 Search engines
 Web sites
 Business
 e-commerce
 e-health
 e-etc.,
11/December/2013
19

OBJECTIVES
 To enhance the Personalized Web Search and
Retrieval with an intention to satisfy user‟s search
context
 To customize the Web Information Retrieval (IR)
for users.
 To Provide results specific to individual users.
 It is predominantly important because different users
expect different information even for the same query
 To predict whether personalization required or not
 To develop Computationally intelligent and
efficient algorithm for this personalization task
11/December/2013
20

PERSONALIZATION IN WEB IR [1/2]
 Web Personalization is viewed as an application
of data mining and machine learning techniques
to build models of user behavior that can be
applied to the task of predicting user needs and
adapting future interactions with the ultimate
goal of improved user satisfaction.
11/December/2013
21

PERSONALIZATION IN WEB IR [2/2]
 Initially Search engines were concerned with
retrieving relevant documents to a query.
 Within the information overload on the web,
it is increasingly difficult for search engines
to satisfy the individual user needs.
 Personalization has long been recognized as
an avenue to greatly improve search
experience.
 Disambiguates the web search by modeling
the user profile by his/her interests and
preferences.
11/December/2013
22

PROBLEM DESCRIPTION
 Personalization in Web IR
 Customize search results according to each individual user
 Research questions in Personalized Web IR
 What to use to Personalize?
 How to model and represent past search contexts?
 How to Personalize?
 How to use it to improve search results?
 When not to Personalize?
 How to decide whether personalization required or not?
 How to know Personalization helped?
 How to evaluate personalized results?
11/December/2013
23

GENERAL PROBLEM STATEMENT
 When search query is issued, most of the search
engines return the same results irrespective of
the users interest
 Lack the existence of semantic structure and
hence it makes difficult for the machine to
understand the information provided by the user
 Lack in Identifying intention of the user
 Lack in processing Inaccurate / Ambiguous
queries  imprecise keyword
11/December/2013
24

RELATED WORKS
 Short term personalization - book mark
 Long term personalization - browsing history
 Result Diversification - Query reformulation
 Collaborative personalization - for group of
users
 Search interaction personalization - Clicks
 Session based personalization
 Location based personalization
 Task based personalization
 and so on…
11/December/2013
25

ARCHITECTURE OF PERSONALIZATION BASED
WEB IR
Rankings
Document
corpus
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
1. Doc1 
2. Doc2 
3. Doc3 
.
.
Feedback
Query
String
Revise
d
Query
Re-Ranked
Documents
1. Doc2
2. Doc4
3. Doc5
.
.
Query
Reformulation
Personalized
IR
Web
11/December/2013
26

CHALLENGES FOR WEB IR
 Distributed Data: Documents spread over millions
of different web servers.
 Volatile Data: Many documents change or
disappear rapidly (e.g. dead links).
 Large Volume: Billions of separate documents.
 Unstructured and Redundant Data: No uniform
structure, HTML errors, up to 30% near duplicate
documents.
 Quality of Data: No editorial control, false
information, poor quality writing, typos, etc.
 Heterogeneous Data: Multiple media types (images,
video), languages, character sets, etc.
11/December/2013
27

CHALLENGES FOR PERSONALIZATION IN
WEB IR
 From the system centered approach to a
user centered approach to IR
 Modeling the user context in personalized
IR
 Exploiting the user context to enhance
search quality
 The privacy issues
 The evaluation issues
11/December/2013
28
Focused on the
next part of
presentation

POSSIBLE APPROACHES TO INFORMATION
RETRIEVAL
 Statistical approaches
◦ Co-occurrence of features between document
and query
◦ Rank documents based on similarity
 Semantic approaches
◦ “Understand” the query, find matching
documents
 User profile approaches
◦ User profiles store approximations of user
interests
11/December/2013
29

BENEFITS OF PERSONALIZED SEARCH
 Resolving ambiguity
 The profile provides a context to the query in order
to reduce ambiguity.
 Example: The profile of interests will allow to distinguish what
the user asked about “Jaguar” (“Animal”, “Car”) really wants
 Revealing hidden treasures
 The profile allows to bring the most relevant
documents, which could be hidden beyond top
results page
 Example: Owner of iPhone searches for Google Android. Pages
referring to both would be most interesting
11/December/2013
30

WHERE TO APPLY USER PROFILES?
 The user profile can be applied in several ways
 To modify the query itself  pre-processing
 Query Expansion  User profile is applied to add
terms to the query
 To process results of a query  post-processing
To present document snippets
Adaptation of meta-search
11/December/2013
31

VARIATIONS OF USER PROFILE USAGE
11/December/2013
32

SUMMARY ON IR
 Web Information Retrieval is a very challenging
yet exciting area!
 Solution: Learning individual user to match the
query with the document
 Personalized Web Information Retrieval
 Promises significant quality improvements. However,
they are far from optimal
 Thus, more research is necessary in the field of IR
 “Computational Intelligence“ could be adopted by
search tools to manage effectively search,
retrieval, filtering and presenting relevant
information.
11/December/2013
33

PRIVATE INFORMATION RETRIEVAL (PIR)
[1995]
 Goal: allow user to query database while hiding the
identity of the data-items.
 Note: hides identity of data-items; not existence of
interaction with the user.
 Motivation: patent databases; stock quotes; web access
and so on.
 Paradox(?): imagine buying in a store without the seller
knowing what you buy.
(Encrypting requests is useful against third parties; not
against owner of data.)
11/December/2013
34

WHAT IS PRIVATE INFORMATION
RETRIEVAL?
 Real-World Example:
 Suppose there is a movie database and we
want to find information on the movie „Indian‟
 We do not want anyone to know about our
interest in this movie.
11/December/2013
35

THE GOAL OF PIR
 Suppose there is a movie database and we want
to find information on the movie „Endiran‟
 We do not want the database operator to know
about our interest in this movie.
 Users' intentions are to be kept secret
11/December/2013
36

HOW DOES IT WORK?
 Very Simple approach
 Download the entire database
 Improved approach
 Suppose there is a database with blocks D1,…, Dr.
 A client wants to retrieve block Dα from the database
in such a way that the database operator learns
nothing about α.
 Do this without downloading the entire database.
11/December/2013
37

GOLDBERG‟S SCHEME
 We can represent a database of r blocks as an rxs
matrix D and get the αth block (αth row) of D
using simple linear algebra
 Dα = eα.D
 Where eα =[0 0 … 1… 0] is a vector with all zeros,
except a one for the α coordinate.
 There are l servers, each with a copy of the
database.
 We secretly share eα in to v1,….,vl and send one to
each server.
 Each server computes and sends their response
 ri=vi.D
11/December/2013
38

GOLDBERG‟S SCHEME
 The responses r1,….rk are secret shares for Dα. (k
is the number of responses)
 What happens if some of the responses are
wrong?
11/December/2013
39

AOL SEARCH LOG DATA SCANDAL
#4417749:
 clothes for age 60
 60 single men
 best retirement city
 jarrett arnold
 jack t. arnold
 jaylene and jarrett arnold
 gwinnett county yellow pages
 rescue of older dogs
 movies for dogs
 sinus infection
Thelma Arnold
62-year-old widow
Lilburn, Georgia
11/December/2013
40

OBSERVATION
 The owners of databases know a lot about the
users!
 This poses a risk to users‟ privacy.
 E.g. consider database with stock prices
 What can we do?
 Trust them that they will protect our secrecy,
or
 Use Cryptography
11/December/2013
41

HOW CAN CRYPTO HELP?
Note: This problem has nothing to do with
secure communication!
user U database D
11/December/2013
42

CURRENT SETTING
user U
database D
A new primitive:
Private Information Retrieval (PIR)
secure link
11/December/2013
43

MODELING PIR
 Server: holds n-bit string x
 n should be thought of as very large
 User: desires
 to retrieve xi and
 to keep i private
11/December/2013
44

x=x1,x2 , . . ., xn {0,1}n
SERVER
i {1,…n}
xi
USER
i j


PRIVATE PROTOCOL TO INFORMATION
RETRIEVAL
11/December/2013
45

There is NO privacy preservation.
Communication Cost: log n
SERVER
USER
x =x1,x2 , . . ., xn
xi
NON-PRIVATE PROTOCOL
i
i {1,…n}
11/December/2013
46

 Server sends entire database x to User.
 Information theoretic privacy.
 Communication Cost: n
SERVER
xi
USER
x =x1,x2 , . . ., xn
x1,x2 , . . ., xn
TRIVIAL PRIVATE PROTOCOL
Is this optimal?
“The number of bits communicated
between U and S has to be smaller
than n.”
11/December/2013
47

PROBLEM
 In any 1-server PIR with information
theoretic privacy the communication is at
least n.
11/December/2013
48

POSSIBLE SOLUTIONS
 User is asked for additional random indices.
 Drawback: reveals a lot of information
 Employ general crypto protocols to compute xi
privately.
 Drawback: highly inefficient (polynomial in n).
 Anonymity.
Note: Hides identity of user; not the fact that xi is
retrieved.
11/December/2013
49

ANONYMITY - EXAMPLE
 Original Data vs. Anonymized Data
11/December/2013
50

TWO APPROACHES
 Information-Theoretic PIR
 Replicate database among k servers.
 Unconditional privacy against t servers.
 Computational PIR
 Computational privacy, based on cryptographic
assumptions.
11/December/2013
51

INFORMATION THEORETIC PRIVACY
(PERFECT PRIVACY)
 The distribution of the queries the user sends to
any server is independent of the index he/she
wishes to retrieve.
 This means that each server cannot gain any
information about user‟s interest regardless of
his computational power.
11/December/2013
52

COMPUTATIONAL PRIVACY
 The distributions of the queries the user sends to
any server are computationally indistinguishable
by varying the index.
 This means that each server cannot gain any
information about user‟s interest provided that
he/she is computationally bounded.
11/December/2013
53

COMMUNICATION COST
 Multiple servers, information-theoretic
PIR:
 2 servers, comm. n1/2
 k servers, comm. n1/k
 log n servers, comm. Poly( log(n) )
 Single server, computational PIR:
 Comm. Poly( log(n) )
11/December/2013
54

K-SERVER PIR
Correctness: User
obtains xi
Privacy: No single
server gets
information about i
U
S1
x {0,1}n
S2
x {0,1}n
i
x {0,1}n
Sk



11/December/2013
55

input:
PIR PROPERTIES
B1 B2 … Bw
input:
index i = 1,…,w
• the user learns Bi
• the database does not learn i
• the total communication is < w
Note: secrecy of the database is not required
correctness
secrecy (of the user)
non-triviality
These properties needs to be defined more formally!
polynomial time randomized interactive algorithms
11/December/2013
56

PIR PROPERTIES
 Correctness
 In every invocation of the protocol the user retrieves
the bit he is interested in (i.e. xi)
 Privacy
 In every invocation of the protocol each server does
not gain any information about the index of the bit
retrieved by the user (i.e. i).
11/December/2013
57

PIR DOESN‟T EXISTS [1/4]
Correctness, Non-triviality and Secrecy CANNOT be
satisfied simultaneously.
 Def: A transcript T is possible for (i,B) if P(T(i,B) = T) > 0
 Take some T’, and look where it is possible:
T’ T’
T’ T’
indices i
databasesB
58

secrecy → if
T’ is possible for some B and i
then
it is possible for B and all the other i’s
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
indices i
databasesB
T’ T’
T’ T’
59

non-triviality → length(transcript) < length(database)
↓
# transcripts < #databases
↓
there has to exist T’ that is possible for
two databases B0 and B1
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
databasesB
← B0
← B1
indices i
60

 B0 and B1 differ on at least one index i’. So, if i’ is the input
of the user then
correctness → contradiction
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’ T’
databasesB
← B0
← B1
i‟
↓
indices i
61

THUS, IDEAL PIR DOESN‟T EXIST!
 How to bypass the impossibility result?
 Two ideas:
 limit the computing power of a cheating database
 use a larger number of “independent” databases
62

SUMMARY
 Complexity of PIR
 Communication
 Computation
 Possible Extensions
 Symmetric PIR
 User may not learn any item other than the one he/she
requested
 Searching by key-words
 Public-key encryption with key-word search
11/December/2013
63

REFERENCES
 Xiaohui Tao, Yuefeng Li, and Ning Zhong, “A Personalized Ontology model for
Web information gathering”, IEEE Trans. Knowledge and Data Engg., vol.23, No.
4, pp 496-511, April 2011.
 Markus Strohmaier, Mark Kr¨oll“Acquiring Knowledge about human goals from
search query logs”, ACM Transactions on Information System, March 2011.
 K.W.-T. Leung, W. Ng, and D.L. Lee, “Deriving Concept- Based User Profiles
from Search Engine Logs,” IEEE Trans. Knowledge and Data Engg., vol. 22,
no. 7, pp 969-982, July. 2010.
 Zhicheng Dou, Ruihua Song, Ji-Rong Wen, and Xiaojie Yuan, “Evaluating the
Effectiveness of Personalized Web Search” IEEE Trans. Knowledge and Data
Engg., Vol. 21, No. 8,pp 1178-1190, Aug 2009.
 Y. Li and N. Zhong. “Mining Ontology for Automatically Acquiring Web User
Information Needs”, IEEE Transactions on Knowledge and Data Engg., 18(4), pp
554-568, April 2006.
 Fang Liu, Clement Yu, Weiyi Meng, “Personalized Web Search for Improving
Retrieval Effectiveness” IEEE Trans. Knowledge and Data Engg., Vol. 16, No.
1,pp 28-40, January 2004.
 B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, “Private information
retrieval”. Journal of the ACM 45(6),pp 965-982, 1995.

Information Retrieval AICTE FDP at GCT Coimbatore

Recommended

Recommended

More Related Content

Similar to Information Retrieval AICTE FDP at GCT Coimbatore

Similar to Information Retrieval AICTE FDP at GCT Coimbatore (20)

More from veningstonk

More from veningstonk (6)

Recently uploaded

Recently uploaded (20)

Information Retrieval AICTE FDP at GCT Coimbatore