SlideShare a Scribd company logo
1
Web search engines
Web search engines
Rooted in Information Retrieval (IR) systems
•Prepare a keyword index for corpus
•Respond to keyword queries with a ranked list of
documents.
ARCHIE
•Earliest application of rudimentary IR systems to
the Internet
•Title search across sites serving files over FTP
3
Boolean queries: Examples
 Simple queries involving relationships
between terms and documents
• Documents containing the word Java
• Documents containing the word Java but not
the word coffee
 Proximity queries
• Documents containing the phrase Java beans
or the term API
• Documents where Java and island occur in
the same sentence
4
Document preprocessing
 Tokenization
• Filtering away tags
• Tokens regarded as nonempty sequence of
characters excluding spaces and
punctuations.
• Token represented by a suitable integer, tid,
typically 32 bits
• Optional: stemming/conflation of words
• Result: document (did) transformed into a
sequence of integers (tid, pos)
5
Storing tokens
 Straight-forward implementation using a
relational database
• Example figure
• Space scales to almost 10 times
 Accesses to table show common pattern
• reduce the storage by mapping tids to a
lexicographically sorted buffer of (did, pos)
tuples.
• Indexing = transposing document-term matrix
6
Two variants of the inverted index data structure, usually stored on disk. The simpler
version in the middle does not store term offset information; the version to the right stores
term
offsets. The mapping from terms to documents and positions (written as
“document/position”) may
be implemented using a B-tree or a hash-table.
7
Storage
 For dynamic corpora
• Berkeley DB2 storage manager
• Can frequently add, modify and delete
documents
 For static collections
• Index compression techniques (to be
discussed)
8
Stopwords
 Function words and connectives
 Appear in large number of documents and little
use in pinpointing documents
 Indexing stopwords
• Stopwords not indexed
 For reducing index space and improving performance
• Replace stopwords with a placeholder (to remember
the offset)
 Issues
• Queries containing only stopwords ruled out
• Polysemous words that are stopwords in one sense
but not in others
 E.g.; can as a verb vs. can as a noun
9
Stemming
 Conflating words to help match a query term with a
morphological variant in the corpus.
 Remove inflections that convey parts of speech, tense
and number
 E.g.: university and universal both stem to universe.
 Techniques
• morphological analysis (e.g., Porter's algorithm)
• dictionary lookup (e.g., WordNet).
 Stemming may increase recall but at the price of
precision
• Abbreviations, polysemy and names coined in the technical and
commercial sectors
• E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to
“gate”, may be bad !
10
Batch indexing and updates
 Incremental indexing
• Time-consuming due to random disk IO
• High level of disk block fragmentation
 Simple sort-merges.
• To replace the indexed update of variable-
length postings
 For a dynamic collection
• single document-level change may need to
update hundreds to thousands of records.
• Solution : create an additional “stop-press”
index.
11
Maintaining indices over dynamic collections.
12
Stop-press index
 Collection of document in flux
• Model document modification as deletion followed by insertion
• Documents in flux represented by a signed record (d,t,s)
• “s” specifies if “d” has been deleted or inserted.
 Getting the final answer to a query
• Main index returns a document set D0.
• Stop-press index returns two document sets
 D+ : documents not yet indexed in D0 matching the query
 D- : documents matching the query removed from the collection
since D0 was constructed.
 Stop-press index getting too large
• Rebuild the main index
 signed (d, t, s) records are sorted in (t, d, s) order and merge-
purged into the master (t, d) records
• Stop-press index can be emptied out.
13
Index compression techniques
 Compressing the index so that much of it
can be held in memory
• Required for high-performance IR installations
(as with Web search engines),
 Redundancy in index storage
• Storage of document IDs.
 Delta encoding
• Sort Doc IDs in increasing order
• Store the first ID in full
• Subsequently store only difference (gap) from
previous ID
14
Encoding gaps
 Small gap must cost far fewer bits than a
document ID.
 Binary encoding
• Optimal when all symbols are equally likely
 Unary code
• optimal if probability of large gaps decays
exponentially
15
Encoding gaps
 Gamma code
• Represent gap x as
 Unary code for followed by
 represented in binary ( bits)
 Golomb codes
• Further enhancement
 
logx
1
 
logx
2
-
x  
logx
16
Lossy compression mechanisms
 Trading off space for time
 collect documents into buckets
• Construct inverted index from terms to bucket
IDs
• Document' IDs shrink to half their size.
 Cost: time overheads
• For each query, all documents in that bucket
need to be scanned
 Solution: index documents in each bucket
separately
• E.g.: Glimpse (http://tuit.uz/)
17
General dilemmas
 Messy updates vs. High compression rate
 Storage allocation vs. Random I/Os
 Random I/O vs. large scale
implementation
18
Relevance ranking
 Keyword queries
• In natural language
• Not precise, unlike SQL
 Boolean decision for response unacceptable
• Solution
 Rate each document for how likely it is to satisfy the user's
information need
 Sort in decreasing order of the score
 Present results in a ranked list.
 No algorithmic way of ensuring that the ranking
strategy always favors the information need
• Query: only a part of the user's information need
19
Responding to queries
 Set-valued response
• Response set may be very large
 (E.g., by recent estimates, over 12 million Web
pages contain the word java.)
 Demanding selective query from user
 Guessing user's information need and
ranking responses
 Evaluating rankings
20
Evaluating procedure
 Given benchmark
• Corpus of n documents D
• A set of queries Q
• For each query, an exhaustive set of
relevant documents identified
manually
 Query submitted system
• Ranked list of documents
retrieved
• compute a 0/1 relevance list
 iff
 otherwise.
Q
q
D
Dq 
)
d
,
,
d
,
(d n
2
1 
)
r
..,
,
r
,
(r n
2
1
D
d q
i 
1
ri 
0
ri 
21
Recall and precision
 Recall at rank
• Fraction of all relevant documents included in
.
• .
 Precision at rank
• Fraction of the top k responses that are
actually relevant.
• .
1
k 
)
d
,
,
d
,
(d n
2
1 




k
i
1
i
q
r
|
D
|
1
recall(k)




k
i
1
i
r
k
1
k)
precision(
22
Other measures
 Average precision
• Sum of precision at each relevant hit position in the
response list, divided by the total number of relevant
documents
• .
.
• avg.precision =1 iff engine retrieves all relevant
documents and ranks them ahead of any irrelevant
document
 Interpolated precision
• To combine precision values from multiple queries
• Gives precision-vs.-recall curve for the benchmark.
 For each query, take the maximum precision obtained for the
query for any recall greater than or equal to
 average them together for all queries
 Others like measures of authority, prestige etc




|
|
k
1
k
q
)
(
*
r
|
D
|
1
ion
avg.precis
D
k
precision

23
Precision-Recall tradeoff
 Interpolated precision cannot increase with
recall
• Interpolated precision at recall level 0 may be less
than 1
 At level k = 0
• Precision (by convention) = 1, Recall = 0
 Inspecting more documents
• Can increase recall
• Precision may decrease
 we will start encountering more and more irrelevant
documents
 Search engine with a good ranking function will
generally show a negative relation between
recall and precision.
24
ecision and interpolated precision plotted against recall for the given relevance vec
Missing are zeroes.
k
r
25
The vector space model
 Documents represented as vectors in a
multi-dimensional Euclidean space
• Each axis = a term (token)
 Coordinate of document d in direction of
term t determined by:
• Term frequency TF(d,t)
 number of times term t occurs in document d,
scaled in a variety of ways to normalize document
length
• Inverse document frequency IDF(t)
 to scale down the coordinates of terms that occur
in many documents
26
Term frequency
 .
.
 Cornell SMART system uses a smoothed
version



 )
n(d,
t)
n(d,
t)
TF(d,
))
(n(d,
max
t)
n(d,
t)
TF(d,



))
,
(
1
log(
1
)
,
(
0
)
,
(
t
d
n
t
d
TF
t
d
TF




otherwise
t
d
n 0
)
,
( 
27
Inverse document frequency
 Given
• D is the document collection and is the set
of documents containing t
 Formulae
• mostly dampened functions of
• SMART
 .
|
| t
D
D
)
|
|
|
|
1
log(
)
(
t
D
D
t
IDF


t
D
28
Vector space model
 Coordinate of document d in axis t
• .
• Transformed to in the TFIDF-space
 Query q
• Interpreted as a document
• Transformed to in the same TFIDF-space
as d
)
(
)
,
( t
IDF
t
d
TF
dt 
d

q

29
Measures of proximity
 Distance measure
• Magnitude of the vector difference
 .
• Document vectors must be normalized to unit
( or ) length
 Else shorter documents dominate (since queries
are short)
 Cosine similarity
• cosine of the angle between and
 Shorter documents are penalized
|
| q
d



1
L
2
L
d

q

30
Relevance feedback
 Users learning how to modify queries
• Response list must have least some relevant
documents
• Relevance feedback
 `correcting' the ranks to the user's taste
 automates the query refinement process
 Rocchio's method
• Folding-in user feedback
• To query vector
 Add a weighted sum of vectors for relevant documents D+
 Subtract a weighted sum of the irrelevant documents D-
• .
q

 



D -
D
d
-
d
q
'
q







31
Relevance feedback (contd.)
 Pseudo-relevance feedback
• D+ and D- generated automatically
 E.g.: Cornell SMART system
 top 10 documents reported by the first round of
query execution are included in D+
• typically set to 0; D- not used
 Not a commonly available feature
• Web users want instant gratification
• System complexity
 Executing the second round query slower and
expensive for major search engines

32
Ranking by odds ratio
 R : Boolean random variable which
represents the relevance of document d
w.r.t. query q.
 Ranking documents by their odds ratio for
relevance
• .
 Approximating probability of d by product
of the probabilities of individual terms in d
• .
• Approximately…
)
,
|
Pr(
/
)
|
Pr(
)
,
|
Pr(
/
)
|
Pr(
)
,
Pr(
/
)
,
,
Pr(
)
,
Pr(
/
)
,
,
Pr(
)
,
|
Pr(
)
,
|
Pr(
q
R
d
q
R
q
R
d
q
R
d
q
d
q
R
d
q
d
q
R
d
q
R
d
q
R

 





t t
t
q
R
x
q
R
x
q
R
d
q
R
d
)
,
|
Pr(
)
,
|
Pr(
)
,
|
Pr(
)
,
|
Pr(




 


d
q
t q
t
q
t
q
t
q
t
a
b
b
a
d
q
R
d
q
R
)
1
(
)
1
(
)
,
|
Pr(
)
,
|
Pr(
,
,
,
,


33
Bayesian Inferencing
Bayesian inference network for relevance ranking. A
document is relevant to the extent that setting its
corresponding belief node to true lets us assign a high
degree of belief in the node corresponding to the query.
Manual specification of
mappings between terms
to approximate concepts.
34
Bayesian Inferencing (contd.)
 Four layers
1.Document layer
2.Representation layer
3.Query concept layer
4.Query
 Each node is associated with a random
Boolean variable, reflecting belief
 Directed arcs signify that the belief of a
node is a function of the belief of its
immediate parents (and so on..)
35
Bayesian Inferencing systems
 2 & 3 same for basic vector-space IR
systems
 Verity's Search97
• Allows administrators and users to define
hierarchies of concepts in files
 Estimation of relevance of a document d
w.r.t. the query q
• Set the belief of the corresponding node to 1
• Set all other document beliefs to 0
• Compute the belief of the query
• Rank documents in decreasing order of belief
that they induce in the query
36
Other issues
 Spamming
• Adding popular query terms to a page unrelated to
those terms
• E.g.: Adding “Hawaii vacation rental” to a page about
“Internet gambling”
• Little setback due to hyperlink-based ranking
 Titles, headings, meta tags and anchor-text
• TFIDF framework treats all terms the same
• Meta search engines:
 Assign weight age to text occurring in tags, meta-tags
• Using anchor-text on pages u which link to v
 Anchor-text on u offers valuable editorial judgment about v as
well.
37
Other issues (contd..)
 Including phrases to rank complex queries
• Operators to specify word inclusions and
exclusions
• With operators and phrases
queries/documents can no longer be treated
as ordinary points in vector space
 Dictionary of phrases
• Could be cataloged manually
• Could be derived from the corpus itself using
statistical techniques
• Two separate indices:
 one for single terms and another for phrases
38
Corpus derived phrase dictionary
 Two terms and
 Null hypothesis = occurrences of and are
independent
 To the extent the pair violates the null hypothesis, it is
likely to be a phrase
• Measuring violation with likelihood ratio of the
hypothesis
• Pick phrases that violate the null hypothesis
with large confidence
 Contingency table built from statistics
1
t
2
t
1
t
2
t
)
,
(
)
,
(
)
,
(
)
,
(
2
1
11
2
1
10
2
1
01
2
1
00
t
t
k
k
t
t
k
k
t
t
k
k
t
t
k
k




39
Corpus derived phrase dictionary
 Hypotheses
• Null hypothesis
• Alternative hypothesis
• Likelihood ratio
)
;
(
max
)
;
(
max
0
k
p
H
k
p
H
p
p






11
10
01
00
)
(
))
1
(
(
)
)
1
((
))
1
)(
1
((
)
,
,
,
;
,
( 2
1
2
1
2
1
2
1
11
10
01
00
2
1
k
k
k
k
p
p
p
p
p
p
p
p
k
k
k
k
p
p
H 




11
10
01
00
11
10
01
00
11
10
01
00
11
10
01
00 )
,
,
,
;
,
,
,
( k
k
k
k
p
p
p
p
k
k
k
k
p
p
p
p
H 
40
Approximate string matching
 Non-uniformity of word spellings
• dialects of English
• transliteration from other languages
 Two ways to reduce this problem.
1. Aggressive conflation mechanism to
collapse variant spellings into the same
token
2. Decompose terms into a sequence of q-
grams or sequences of q characters
41
Approximate string matching
1. Aggressive conflation mechanism to collapse
variant spellings into the same token
• E.g.: Soundex : takes phonetics and pronunciation details
into account
• used with great success in indexing and searching last
names in census and telephone directory data.
2. Decompose terms into a sequence of q-grams
or sequences of q characters
• Check for similarity in the grams
• Looking up the inverted index : a two-stage affair:
• Smaller index of q-grams consulted to expand each query
term into a set of slightly distorted query terms
• These terms are submitted to the regular index
• Used by Google for spelling correction
• Idea also adopted for eliminating near-duplicate pages
)
4
2
( 
 q
q
42
Meta-search systems
• Take the search engine to the document
• Forward queries to many geographically distributed
repositories
• Each has its own search service
• Consolidate their responses.
• Advantages
• Perform non-trivial query rewriting
• Suit a single user query to many search engines with
different query syntax
• Surprisingly small overlap between crawls
• Consolidating responses
• Function goes beyond just eliminating duplicates
• Search services do not provide standard ranks which
can be combined meaningfully
43
Similarity search
• Cluster hypothesis
• Documents similar to relevant documents are
also likely to be relevant
• Handling “find similar” queries
• Replication or duplication of pages
• Mirroring of sites
Mining the Web Chakrabarti and Ramakrishnan 44
Document similarity
• Jaccard coefficient of similarity between
document and
• T(d) = set of tokens in document d
• .
• Symmetric, reflexive, not a metric
• Forgives any number of occurrences and any
permutations of the terms.
• is a metric
1
d 2
d
|
)
(
)
(
|
|
)
(
)
(
|
)
,
(
'
2
1
2
1
2
1
d
T
d
T
d
T
d
T
d
d
r



)
,
(
'
1 2
1 d
d
r

45
Estimating Jaccard coefficient with
random permutations
1. Generate a set of m random
permutations
2. for each do
3. compute and
4. check if
5. end for
6. if equality was observed in k cases,
estimate.


m
k
d
d
r 
)
,
(
' 2
1
)
(
min
)
(
min 2
1 d
T
d
T 
)
( 2
d

)
( 1
d

46
Fast similarity search with random
permutations
1. for each random permutation do
2. create a file
3. for each document d do
4. write out to
5. end for
6. sort using key s--this results in contiguous blocks with fixed
s containing all associated
7. create a file
8. for each pair within a run of having a given s do
9. write out a document-pair record to g
10. end for
11. sort on key
12. end for
13. merge for all in order, counting the number of
entries

)
,
( 2
1 d
d
s
d



 d
d
T
s )),
(
(
min

f

f

f

g

f
)
,
( 2
1 d
d

g )
,
( 2
1 d
d

g  )
,
( 2
1 d
d )
,
( 2
1 d
d
47
Eliminating near-duplicates via shingling
• “Find-similar” algorithm reports all duplicate/near-
duplicate pages
• Eliminating duplicates
• Maintain a checksum with every page in the corpus
• Eliminating near-duplicates
• Represent each document as a set T(d) of q-grams (shingles)
• Find Jaccard similarity between and
• Eliminate the pair from step 9 if it has similarity above a
threshold
1
d
)
,
( 2
1 d
d
r 2
d
48
Detecting locally similar sub-graphs of the
Web
• Similarity search and duplicate elimination on the
graph structure of the web
• To improve quality of hyperlink-assisted ranking
• Detecting mirrored sites
• Approach 1 [Bottom-up Approach]
1. Start process with textual duplicate detection
• cleaned URLs are listed and sorted to find duplicates/near-
duplicates
• each set of equivalent URLs is assigned a unique token ID
• each page is stripped of all text, and represented as a sequence
of outlink IDs
2. Continue using link sequence representation
3. Until no further collapse of multiple URLs are possible
• Approach 2 [Bottom-up Approach]
1. identify single nodes which are near duplicates (using text-
shingling)
2. extend single-node mirrors to two-node mirrors
3. continue on to larger and larger graphs which are likely mirrors of
one another
49
Detecting mirrored sites (contd.)
• Approach 3 [Step before fetching all pages]
• Uses regularity in URL strings to identify host-pairs which are
mirrors
• Preprocessing
• Host are represented as sets of positional bigrams
• Convert host and path to all lowercase characters
• Let any punctuation or digit sequence be a token separator
• Tokenize the URL into a sequence of tokens, (e.g.,
www6.infoseek.com gives www, infoseek, com)
• Eliminate stop terms such as htm, html, txt, main, index, home,
bin, cgi
• Form positional bigrams from the token sequence
• Two hosts are said to be mirrors if
• A large fraction of paths are valid on both web sites
• These common paths link to pages that are near-duplicates.

More Related Content

What's hot

Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
Mounia Lalmas-Roelleke
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
Trey Grainger
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Trey Grainger
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
Krish_ver2
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
Trey Grainger
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
Trey Grainger
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Tovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio CostantiniTovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio Costantini
maxfalc
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
Trey Grainger
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
Lokesh Ramaswamy
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
KU Leuven
 
Text mining
Text miningText mining
Text mining
Pankaj Thakur
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
Innovation Engineering
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
Gan Keng Hoon
 
search engine
search enginesearch engine
search engine
Musaib Khan
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Trey Grainger
 
Week12
Week12Week12
Week12
Esha Meher
 

What's hot (20)

Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Tovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio CostantiniTovek Presentation 2 by Livio Costantini
Tovek Presentation 2 by Livio Costantini
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
 
Text mining
Text miningText mining
Text mining
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
search engine
search enginesearch engine
search engine
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Week12
Week12Week12
Week12
 

Similar to Web search engines

search.ppt
search.pptsearch.ppt
search.ppt
Pikaj2
 
Text mining
Text miningText mining
Text mining
Koshy Geoji
 
Text Mining
Text MiningText Mining
Text Mining
sathish sak
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
Yen-Yu Chen
 
Set Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree IndexSet Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree Index
HPCC Systems
 
Slides
SlidesSlides
Slides
butest
 
Ir models
Ir modelsIr models
Ir models
Ambreen Angel
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
alaa223
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
MedinaBedru
 
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
Dr. Haxel Consult
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
Enrico Daga
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
Kai Chan
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
Habtamu100
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
Habtamu100
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
Damiano Spina
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
Enrico Daga
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
Edhole.com
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
C. Tobin Magle
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
Kai Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 

Similar to Web search engines (20)

search.ppt
search.pptsearch.ppt
search.ppt
 
Text mining
Text miningText mining
Text mining
 
Text Mining
Text MiningText Mining
Text Mining
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
 
Set Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree IndexSet Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree Index
 
Slides
SlidesSlides
Slides
 
Ir models
Ir modelsIr models
Ir models
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
AI-SDV 2021 - Holger Keibel; Daniele Puccinelli - Leveraging pre-trained lang...
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 

Recently uploaded

spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
Wahiba Chair Training & Consulting
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdfIGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
Amin Marwan
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
MysoreMuleSoftMeetup
 
Solutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptxSolutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptx
spdendr
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
EduSkills OECD
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
S. Raj Kumar
 

Recently uploaded (20)

spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdfIGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
 
Solutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptxSolutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptx
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
 

Web search engines

  • 2. Web search engines Rooted in Information Retrieval (IR) systems •Prepare a keyword index for corpus •Respond to keyword queries with a ranked list of documents. ARCHIE •Earliest application of rudimentary IR systems to the Internet •Title search across sites serving files over FTP
  • 3. 3 Boolean queries: Examples  Simple queries involving relationships between terms and documents • Documents containing the word Java • Documents containing the word Java but not the word coffee  Proximity queries • Documents containing the phrase Java beans or the term API • Documents where Java and island occur in the same sentence
  • 4. 4 Document preprocessing  Tokenization • Filtering away tags • Tokens regarded as nonempty sequence of characters excluding spaces and punctuations. • Token represented by a suitable integer, tid, typically 32 bits • Optional: stemming/conflation of words • Result: document (did) transformed into a sequence of integers (tid, pos)
  • 5. 5 Storing tokens  Straight-forward implementation using a relational database • Example figure • Space scales to almost 10 times  Accesses to table show common pattern • reduce the storage by mapping tids to a lexicographically sorted buffer of (did, pos) tuples. • Indexing = transposing document-term matrix
  • 6. 6 Two variants of the inverted index data structure, usually stored on disk. The simpler version in the middle does not store term offset information; the version to the right stores term offsets. The mapping from terms to documents and positions (written as “document/position”) may be implemented using a B-tree or a hash-table.
  • 7. 7 Storage  For dynamic corpora • Berkeley DB2 storage manager • Can frequently add, modify and delete documents  For static collections • Index compression techniques (to be discussed)
  • 8. 8 Stopwords  Function words and connectives  Appear in large number of documents and little use in pinpointing documents  Indexing stopwords • Stopwords not indexed  For reducing index space and improving performance • Replace stopwords with a placeholder (to remember the offset)  Issues • Queries containing only stopwords ruled out • Polysemous words that are stopwords in one sense but not in others  E.g.; can as a verb vs. can as a noun
  • 9. 9 Stemming  Conflating words to help match a query term with a morphological variant in the corpus.  Remove inflections that convey parts of speech, tense and number  E.g.: university and universal both stem to universe.  Techniques • morphological analysis (e.g., Porter's algorithm) • dictionary lookup (e.g., WordNet).  Stemming may increase recall but at the price of precision • Abbreviations, polysemy and names coined in the technical and commercial sectors • E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to “gate”, may be bad !
  • 10. 10 Batch indexing and updates  Incremental indexing • Time-consuming due to random disk IO • High level of disk block fragmentation  Simple sort-merges. • To replace the indexed update of variable- length postings  For a dynamic collection • single document-level change may need to update hundreds to thousands of records. • Solution : create an additional “stop-press” index.
  • 11. 11 Maintaining indices over dynamic collections.
  • 12. 12 Stop-press index  Collection of document in flux • Model document modification as deletion followed by insertion • Documents in flux represented by a signed record (d,t,s) • “s” specifies if “d” has been deleted or inserted.  Getting the final answer to a query • Main index returns a document set D0. • Stop-press index returns two document sets  D+ : documents not yet indexed in D0 matching the query  D- : documents matching the query removed from the collection since D0 was constructed.  Stop-press index getting too large • Rebuild the main index  signed (d, t, s) records are sorted in (t, d, s) order and merge- purged into the master (t, d) records • Stop-press index can be emptied out.
  • 13. 13 Index compression techniques  Compressing the index so that much of it can be held in memory • Required for high-performance IR installations (as with Web search engines),  Redundancy in index storage • Storage of document IDs.  Delta encoding • Sort Doc IDs in increasing order • Store the first ID in full • Subsequently store only difference (gap) from previous ID
  • 14. 14 Encoding gaps  Small gap must cost far fewer bits than a document ID.  Binary encoding • Optimal when all symbols are equally likely  Unary code • optimal if probability of large gaps decays exponentially
  • 15. 15 Encoding gaps  Gamma code • Represent gap x as  Unary code for followed by  represented in binary ( bits)  Golomb codes • Further enhancement   logx 1   logx 2 - x   logx
  • 16. 16 Lossy compression mechanisms  Trading off space for time  collect documents into buckets • Construct inverted index from terms to bucket IDs • Document' IDs shrink to half their size.  Cost: time overheads • For each query, all documents in that bucket need to be scanned  Solution: index documents in each bucket separately • E.g.: Glimpse (http://tuit.uz/)
  • 17. 17 General dilemmas  Messy updates vs. High compression rate  Storage allocation vs. Random I/Os  Random I/O vs. large scale implementation
  • 18. 18 Relevance ranking  Keyword queries • In natural language • Not precise, unlike SQL  Boolean decision for response unacceptable • Solution  Rate each document for how likely it is to satisfy the user's information need  Sort in decreasing order of the score  Present results in a ranked list.  No algorithmic way of ensuring that the ranking strategy always favors the information need • Query: only a part of the user's information need
  • 19. 19 Responding to queries  Set-valued response • Response set may be very large  (E.g., by recent estimates, over 12 million Web pages contain the word java.)  Demanding selective query from user  Guessing user's information need and ranking responses  Evaluating rankings
  • 20. 20 Evaluating procedure  Given benchmark • Corpus of n documents D • A set of queries Q • For each query, an exhaustive set of relevant documents identified manually  Query submitted system • Ranked list of documents retrieved • compute a 0/1 relevance list  iff  otherwise. Q q D Dq  ) d , , d , (d n 2 1  ) r .., , r , (r n 2 1 D d q i  1 ri  0 ri 
  • 21. 21 Recall and precision  Recall at rank • Fraction of all relevant documents included in . • .  Precision at rank • Fraction of the top k responses that are actually relevant. • . 1 k  ) d , , d , (d n 2 1      k i 1 i q r | D | 1 recall(k)     k i 1 i r k 1 k) precision(
  • 22. 22 Other measures  Average precision • Sum of precision at each relevant hit position in the response list, divided by the total number of relevant documents • . . • avg.precision =1 iff engine retrieves all relevant documents and ranks them ahead of any irrelevant document  Interpolated precision • To combine precision values from multiple queries • Gives precision-vs.-recall curve for the benchmark.  For each query, take the maximum precision obtained for the query for any recall greater than or equal to  average them together for all queries  Others like measures of authority, prestige etc     | | k 1 k q ) ( * r | D | 1 ion avg.precis D k precision 
  • 23. 23 Precision-Recall tradeoff  Interpolated precision cannot increase with recall • Interpolated precision at recall level 0 may be less than 1  At level k = 0 • Precision (by convention) = 1, Recall = 0  Inspecting more documents • Can increase recall • Precision may decrease  we will start encountering more and more irrelevant documents  Search engine with a good ranking function will generally show a negative relation between recall and precision.
  • 24. 24 ecision and interpolated precision plotted against recall for the given relevance vec Missing are zeroes. k r
  • 25. 25 The vector space model  Documents represented as vectors in a multi-dimensional Euclidean space • Each axis = a term (token)  Coordinate of document d in direction of term t determined by: • Term frequency TF(d,t)  number of times term t occurs in document d, scaled in a variety of ways to normalize document length • Inverse document frequency IDF(t)  to scale down the coordinates of terms that occur in many documents
  • 26. 26 Term frequency  . .  Cornell SMART system uses a smoothed version     ) n(d, t) n(d, t) TF(d, )) (n(d, max t) n(d, t) TF(d,    )) , ( 1 log( 1 ) , ( 0 ) , ( t d n t d TF t d TF     otherwise t d n 0 ) , ( 
  • 27. 27 Inverse document frequency  Given • D is the document collection and is the set of documents containing t  Formulae • mostly dampened functions of • SMART  . | | t D D ) | | | | 1 log( ) ( t D D t IDF   t D
  • 28. 28 Vector space model  Coordinate of document d in axis t • . • Transformed to in the TFIDF-space  Query q • Interpreted as a document • Transformed to in the same TFIDF-space as d ) ( ) , ( t IDF t d TF dt  d  q 
  • 29. 29 Measures of proximity  Distance measure • Magnitude of the vector difference  . • Document vectors must be normalized to unit ( or ) length  Else shorter documents dominate (since queries are short)  Cosine similarity • cosine of the angle between and  Shorter documents are penalized | | q d    1 L 2 L d  q 
  • 30. 30 Relevance feedback  Users learning how to modify queries • Response list must have least some relevant documents • Relevance feedback  `correcting' the ranks to the user's taste  automates the query refinement process  Rocchio's method • Folding-in user feedback • To query vector  Add a weighted sum of vectors for relevant documents D+  Subtract a weighted sum of the irrelevant documents D- • . q       D - D d - d q ' q       
  • 31. 31 Relevance feedback (contd.)  Pseudo-relevance feedback • D+ and D- generated automatically  E.g.: Cornell SMART system  top 10 documents reported by the first round of query execution are included in D+ • typically set to 0; D- not used  Not a commonly available feature • Web users want instant gratification • System complexity  Executing the second round query slower and expensive for major search engines 
  • 32. 32 Ranking by odds ratio  R : Boolean random variable which represents the relevance of document d w.r.t. query q.  Ranking documents by their odds ratio for relevance • .  Approximating probability of d by product of the probabilities of individual terms in d • . • Approximately… ) , | Pr( / ) | Pr( ) , | Pr( / ) | Pr( ) , Pr( / ) , , Pr( ) , Pr( / ) , , Pr( ) , | Pr( ) , | Pr( q R d q R q R d q R d q d q R d q d q R d q R d q R         t t t q R x q R x q R d q R d ) , | Pr( ) , | Pr( ) , | Pr( ) , | Pr(         d q t q t q t q t q t a b b a d q R d q R ) 1 ( ) 1 ( ) , | Pr( ) , | Pr( , , , ,  
  • 33. 33 Bayesian Inferencing Bayesian inference network for relevance ranking. A document is relevant to the extent that setting its corresponding belief node to true lets us assign a high degree of belief in the node corresponding to the query. Manual specification of mappings between terms to approximate concepts.
  • 34. 34 Bayesian Inferencing (contd.)  Four layers 1.Document layer 2.Representation layer 3.Query concept layer 4.Query  Each node is associated with a random Boolean variable, reflecting belief  Directed arcs signify that the belief of a node is a function of the belief of its immediate parents (and so on..)
  • 35. 35 Bayesian Inferencing systems  2 & 3 same for basic vector-space IR systems  Verity's Search97 • Allows administrators and users to define hierarchies of concepts in files  Estimation of relevance of a document d w.r.t. the query q • Set the belief of the corresponding node to 1 • Set all other document beliefs to 0 • Compute the belief of the query • Rank documents in decreasing order of belief that they induce in the query
  • 36. 36 Other issues  Spamming • Adding popular query terms to a page unrelated to those terms • E.g.: Adding “Hawaii vacation rental” to a page about “Internet gambling” • Little setback due to hyperlink-based ranking  Titles, headings, meta tags and anchor-text • TFIDF framework treats all terms the same • Meta search engines:  Assign weight age to text occurring in tags, meta-tags • Using anchor-text on pages u which link to v  Anchor-text on u offers valuable editorial judgment about v as well.
  • 37. 37 Other issues (contd..)  Including phrases to rank complex queries • Operators to specify word inclusions and exclusions • With operators and phrases queries/documents can no longer be treated as ordinary points in vector space  Dictionary of phrases • Could be cataloged manually • Could be derived from the corpus itself using statistical techniques • Two separate indices:  one for single terms and another for phrases
  • 38. 38 Corpus derived phrase dictionary  Two terms and  Null hypothesis = occurrences of and are independent  To the extent the pair violates the null hypothesis, it is likely to be a phrase • Measuring violation with likelihood ratio of the hypothesis • Pick phrases that violate the null hypothesis with large confidence  Contingency table built from statistics 1 t 2 t 1 t 2 t ) , ( ) , ( ) , ( ) , ( 2 1 11 2 1 10 2 1 01 2 1 00 t t k k t t k k t t k k t t k k    
  • 39. 39 Corpus derived phrase dictionary  Hypotheses • Null hypothesis • Alternative hypothesis • Likelihood ratio ) ; ( max ) ; ( max 0 k p H k p H p p       11 10 01 00 ) ( )) 1 ( ( ) ) 1 (( )) 1 )( 1 (( ) , , , ; , ( 2 1 2 1 2 1 2 1 11 10 01 00 2 1 k k k k p p p p p p p p k k k k p p H      11 10 01 00 11 10 01 00 11 10 01 00 11 10 01 00 ) , , , ; , , , ( k k k k p p p p k k k k p p p p H 
  • 40. 40 Approximate string matching  Non-uniformity of word spellings • dialects of English • transliteration from other languages  Two ways to reduce this problem. 1. Aggressive conflation mechanism to collapse variant spellings into the same token 2. Decompose terms into a sequence of q- grams or sequences of q characters
  • 41. 41 Approximate string matching 1. Aggressive conflation mechanism to collapse variant spellings into the same token • E.g.: Soundex : takes phonetics and pronunciation details into account • used with great success in indexing and searching last names in census and telephone directory data. 2. Decompose terms into a sequence of q-grams or sequences of q characters • Check for similarity in the grams • Looking up the inverted index : a two-stage affair: • Smaller index of q-grams consulted to expand each query term into a set of slightly distorted query terms • These terms are submitted to the regular index • Used by Google for spelling correction • Idea also adopted for eliminating near-duplicate pages ) 4 2 (   q q
  • 42. 42 Meta-search systems • Take the search engine to the document • Forward queries to many geographically distributed repositories • Each has its own search service • Consolidate their responses. • Advantages • Perform non-trivial query rewriting • Suit a single user query to many search engines with different query syntax • Surprisingly small overlap between crawls • Consolidating responses • Function goes beyond just eliminating duplicates • Search services do not provide standard ranks which can be combined meaningfully
  • 43. 43 Similarity search • Cluster hypothesis • Documents similar to relevant documents are also likely to be relevant • Handling “find similar” queries • Replication or duplication of pages • Mirroring of sites
  • 44. Mining the Web Chakrabarti and Ramakrishnan 44 Document similarity • Jaccard coefficient of similarity between document and • T(d) = set of tokens in document d • . • Symmetric, reflexive, not a metric • Forgives any number of occurrences and any permutations of the terms. • is a metric 1 d 2 d | ) ( ) ( | | ) ( ) ( | ) , ( ' 2 1 2 1 2 1 d T d T d T d T d d r    ) , ( ' 1 2 1 d d r 
  • 45. 45 Estimating Jaccard coefficient with random permutations 1. Generate a set of m random permutations 2. for each do 3. compute and 4. check if 5. end for 6. if equality was observed in k cases, estimate.   m k d d r  ) , ( ' 2 1 ) ( min ) ( min 2 1 d T d T  ) ( 2 d  ) ( 1 d 
  • 46. 46 Fast similarity search with random permutations 1. for each random permutation do 2. create a file 3. for each document d do 4. write out to 5. end for 6. sort using key s--this results in contiguous blocks with fixed s containing all associated 7. create a file 8. for each pair within a run of having a given s do 9. write out a document-pair record to g 10. end for 11. sort on key 12. end for 13. merge for all in order, counting the number of entries  ) , ( 2 1 d d s d     d d T s )), ( ( min  f  f  f  g  f ) , ( 2 1 d d  g ) , ( 2 1 d d  g  ) , ( 2 1 d d ) , ( 2 1 d d
  • 47. 47 Eliminating near-duplicates via shingling • “Find-similar” algorithm reports all duplicate/near- duplicate pages • Eliminating duplicates • Maintain a checksum with every page in the corpus • Eliminating near-duplicates • Represent each document as a set T(d) of q-grams (shingles) • Find Jaccard similarity between and • Eliminate the pair from step 9 if it has similarity above a threshold 1 d ) , ( 2 1 d d r 2 d
  • 48. 48 Detecting locally similar sub-graphs of the Web • Similarity search and duplicate elimination on the graph structure of the web • To improve quality of hyperlink-assisted ranking • Detecting mirrored sites • Approach 1 [Bottom-up Approach] 1. Start process with textual duplicate detection • cleaned URLs are listed and sorted to find duplicates/near- duplicates • each set of equivalent URLs is assigned a unique token ID • each page is stripped of all text, and represented as a sequence of outlink IDs 2. Continue using link sequence representation 3. Until no further collapse of multiple URLs are possible • Approach 2 [Bottom-up Approach] 1. identify single nodes which are near duplicates (using text- shingling) 2. extend single-node mirrors to two-node mirrors 3. continue on to larger and larger graphs which are likely mirrors of one another
  • 49. 49 Detecting mirrored sites (contd.) • Approach 3 [Step before fetching all pages] • Uses regularity in URL strings to identify host-pairs which are mirrors • Preprocessing • Host are represented as sets of positional bigrams • Convert host and path to all lowercase characters • Let any punctuation or digit sequence be a token separator • Tokenize the URL into a sequence of tokens, (e.g., www6.infoseek.com gives www, infoseek, com) • Eliminate stop terms such as htm, html, txt, main, index, home, bin, cgi • Form positional bigrams from the token sequence • Two hosts are said to be mirrors if • A large fraction of paths are valid on both web sites • These common paths link to pages that are near-duplicates.