3. Summaries
Having ranked the documents matching a
query, we wish to present a results list
Most commonly, the document title plus a
short summary
The title is typically automatically extracted
from document metadata
What about the summaries?
4. Summaries
Two basic kinds:
Static
Dynamic
A static summary of a document is always
the same, regardless of the query that hit
the doc
Dynamic summaries are query-dependent
attempt to explain why the document was
retrieved for the query at hand
5. Static summaries
In typical systems, the static summary is a
subset of the document
Simplest heuristic: the first 50 (or so – this
can be varied) words of the document
Summary cached at indexing time
More sophisticated: extract from each
document a set of “key” sentences
Simple NLP heuristics to score each sentence
Summary is made up of top-scoring
sentences.
Most sophisticated: NLP used to synthesize
a summary
Seldom used in IR; cf. text summarization
6. Dynamic summaries
Present one or more “windows” within the
document that contain several of the query
terms
“KWIC” snippets: Keyword in Context
presentation
Generated in conjunction with scoring
If query found as a phrase, the/some
occurrences of the phrase in the doc
If not, windows within the doc that contain
multiple query terms
The summary itself gives the entire content
of the window – all terms, not only the query
7. Generating dynamic summaries
If we have only a positional index, we cannot
(easily) reconstruct context surrounding hits
If we cache the documents at index time, can
run the window through it, cueing to hits
found in the positional index
E.g., positional index says “the query is a
phrase in position 4378” so we go to this
position in the cached document and stream
out the content
Most often, cache a fixed-size prefix of the
doc
Note: Cached copy can be outdated
8. Dynamic summaries
Producing good dynamic summaries is a
tricky optimization problem
The real estate for the summary is normally
small and fixed
Want short item, so show as many KWIC
matches as possible, and perhaps other
things like title
Want snippets to be long enough to be useful
Want linguistically well-formed snippets:
users prefer snippets that contain complete
phrases
Want snippets maximally informative about
doc
But users really like snippets, even if they
complicate IR system design
10. Adversarial IR (Spam)
Motives
Commercial, political, religious, lobbies
Promotion funded by advertising budget
Operators
Contractors (Search Engine Optimizers) for lobbies,
companies
Web masters
Hosting services
Forum
Web master world ( www.webmasterworld.com )
Search engine specific tricks
Discussions about academic papers
11. Search Engine Optimization II
Search Engine Optimization
Adversarial IR
Adversarial IR
(“search engine wars”)
(“search engine wars”)
12. Can you trust words on the page?
auctions.hitsoffice.com/
Pornographic www.ebay.com/
Content
Examples from July 2002
13. Simplest forms
Early engines relied on the density of terms
The top-ranked pages for the query maui
resort were the ones containing the most
maui’s and resort’s
SEOs responded with dense repetitions of
chosen terms
e.g., maui resort maui resort maui resort
Often, the repetitions would be in the same
color as the background of the web page
Repeated terms got indexed by crawlers
But not visible to humans on browsers
Can’t trust the words on a web page, for ranking.
14. A few spam technologies
Cloaking
Serve fake content to search engine robot
DNS cloaking: Switch IP address. Impersonate
Doorway pages
Pages optimized for a single keyword that re-
direct to the real target page
Keyword Spam
Misleading meta-keywords, excessive
repetition of a term, fake “anchor text”
Hidden text with colors, CSS tricks, etc.
Link spamming
Mutual admiration societies, hidden links,
awards
Domain flooding: numerous domains that
point or re-direct to a target page
Robots
Fake click stream
Fake query stream
Millions of submissions via Add-Url
15. More spam techniques
Cloaking
Serve fake content to search engine spider
DNS cloaking: Switch IP address. Impersonate
SPAM
Y
Is this a Search
Engine spider?
N Real
Cloaking Doc
16. Tutorial on
Tutorial on
Cloaking & Stealth
Cloaking & Stealth
Technology
Technology
17. Variants of keyword stuffing
Misleading meta-tags, excessive repetition
Hidden text with colors, style sheet tricks,
etc.
Meta-Tags =
“… London hotels, hotel, holiday inn, hilton, discount, booking, reservation,
sex, mp3, britney spears, viagra, …”
18. More spam techniques
Doorway pages
Pages optimized for a single keyword that re-
direct to the real target page
Link spamming
Mutual admiration societies, hidden links,
awards – more on these later
Domain flooding: numerous domains that
point or re-direct to a target page
Robots
Fake query stream – rank checking programs
“Curve-fit” ranking programs of search engines
Millions of submissions via Add-Url
19. The war against spam
Quality signals - Prefer authoritative
pages based on:
Votes from authors (linkage signals)
Votes from users (usage signals)
Policing of URL submissions
Anti robot test
Limits on meta-keywords
Robust link analysis
Ignore statistically implausible linkage (or text)
Use link analysis to detect spammers (guilt by
association)
20. The war against spam
Spam recognition by machine learning
Training set based on known spam
Family friendly filters
Linguistic analysis, general classification
techniques, etc.
For images: flesh tone detectors, source text
analysis, etc.
Editorial intervention
Blacklists
Top queries audited
Complaints addressed
21. Acid test
Which SEO’s rank highly on the query seo?
Web search engines have policies on SEO
practices they tolerate/block
See pointers in Resources
Adversarial IR: the unending (technical)
battle between SEO’s and web search
engines
See for instance
http://airweb.cse.lehigh.edu/
23. Duplicate/Near-Duplicate Detection
Duplication: Exact match with fingerprints
Near-Duplication: Approximate match
Overview
Compute syntactic similarity with an edit-
distance measure
Use similarity threshold to detect near-
duplicates
E.g., Similarity > 80% => Documents are “near
duplicates”
Not transitive though sometimes used
transitively
24. Computing Similarity
Segments of a document (natural or artificial
breakpoints) [Brin95]
Shingles (Word k-Grams) [Brin95, Brod98]
“a rose is a rose is a rose” =>
a_rose_is_a
rose_is_a_rose
is_a_rose_is
Similarity Measure between two docs (= sets
of shingles)
Set intersection [Brod98]
(Specifically, Size_of_Intersection /
Size_of_Union )
Jaccard measure
25. Shingles + Set Intersection
Computing exact set intersection of shingles
between all pairs of documents is expensive
Approximate using a cleverly chosen subset of
shingles from each (a sketch)
Estimate Jaccard from a short sketch
Create a “sketch vector” (e.g., of size 200) for
each document
Documents which share more than t (say 80%)
corresponding vector elements are similar
For doc d, sketchd[i] is computed as follows:
Let f map all shingles in the universe to 0..2 m
Let πi be a specific random permutation on 0..2 m
Pick MIN πi (f(s)) over all shingles s in d
26. Shingling with sampling
minima
Given two documents A1, A2.
Let S1 and S2 be their shingle sets
Resemblance = |Intersection of S1 and S2| / |
Union of S1 and S2|.
Let Alpha = min ( π (S1))
Let Beta = min (π(S2))
Probability (Alpha = Beta) = Resemblance
27. Computing Sketch[i] for Doc1
Document 1
264 Start with 64 bit shingles
264
Permute on the number line
264 with πi
264 Pick the min value
28. Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 1 Document 2
264 264
264 264
264 264
A B
2 64 264
Are these equal?
Test for 200 random permutations: π1, π2,… π200
29. However…
Document 1 Document 2
264 264
264 264
A
264 B 264
264 264
A = B iff the shingle with the MIN value in the union of
Doc1 and Doc2 is common to both (I.e., lies in the
intersection)
This happens with probability:
Size_of_intersection / Size_of_union
Why?
30. Set Similarity
Set Similarity (Jaccard measure)
Ci C j
simJ(Ci , C j ) =
Ci C j
View sets as columns of a matrix; one row for
each element in the universe. aij = 1 indicates
presence of item i in set j
Example C1 C2
0 1
1 0
1 1 simJ(C1,C2) = 2/5 = 0.4
0 0
1 1
0 1
31. Key Observation
For columns Ci, Cj, four types of rows
Ci Cj
A 1 1
B 1 0
C 0 1
D 0 0
Overload notation: A = # of rows of type A
Claim A
simJ(Ci , C j ) =
A+B+C
32. Min Hashing
Randomly permute rows
h(Ci) = index of first row with 1 in column Ci
Surprising Property
Why? P [ h(Ci ) = h(C j ) ] = simJ ( Ci , C j )
Both are A/(A+B+C)
Look down columns Ci, Cj until first non-Type-
D row
h(Ci) = h(Cj) type A row
33. Mirror Detection
Mirroring is systematic replication of web pages
across hosts.
Single largest cause of duplication on the web
Host1/α and Host2/β are mirrors iff
For all (or most) paths p such that when
http://Host1/ α / p exists
http://Host2/ β / p exists as well
with identical (or near identical) content, and
vice versa.
34. Mirror Detection example
http://www.elsevier.com/ and http://www.elsevier.nl/
Structural Classification of Proteins
http://scop.mrc-lmb.cam.ac.uk/scop
http://scop.berkeley.edu/
http://scop.wehi.edu.au/scop
http://pdb.weizmann.ac.il/scop
http://scop.protres.ru/
36. Motivation
Why detect mirrors?
Smart crawling
Fetch from the fastest or freshest server
Avoid duplication
Better connectivity analysis
Combine inlinks
Avoid double counting outlinks
Redundancy in result listings
“If that fails you can try: <mirror>/samepath”
Proxy caching
37. Bottom Up Mirror Detection
[Cho00]
Maintain clusters of subgraphs
Initialize clusters of trivial subgraphs
Group near-duplicate single documents into a cluster
Subsequent passes
Merge clusters of the same cardinality and corresponding linkage
Avoid decreasing cluster cardinality
To detect mirrors we need:
Adequate path overlap
Contents of corresponding pages within a small time range
38. Can we use URLs to find
mirrors?
www.synthesis.org synthesis.stanford.edu
a b a b
d d
c c
www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-…
www.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.html
synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced…
www.synthesis.org/Docs/annual.report96.final.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-…
www.synthesis.org/Docs/cicee-berlin-paper.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-…
www.synthesis.org/Docs/myr5 synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-…
www.synthesis.org/Docs/myr5/cicee/bridge-gap.html synthesis.stanford.edu/Docs/annual.report96.final.html
www.synthesis.org/Docs/myr5/cs/cs-meta.html synthesis.stanford.edu/Docs/annual.report96.final_fn.html
www.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.html
synthesis.stanford.edu/Docs/myr5/assessment
www.synthesis.org/Docs/myr5/mech/mech-take-home.html synthesis.stanford.edu/Docs/myr5/assessment/assessment-…
www.synthesis.org/Docs/myr5/synsys/experiential-learning.html
synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-…
www.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.html synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.html
www.synthesis.org/Docs/yr5ar synthesis.stanford.edu/Docs/myr5/assessment/not-available.html
www.synthesis.org/Docs/yr5ar/assess synthesis.stanford.edu/Docs/myr5/cicee
www.synthesis.org/Docs/yr5ar/cicee synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.html
www.synthesis.org/Docs/yr5ar/cicee/bridge-gap.html synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.html
www.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html
synthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html
39. Top Down Mirror Detection
[Bhar99, Bhar00c]
E.g.,
www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html
synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html
What features could indicate mirroring?
Hostname similarity:
word unigrams and bigrams: { www, www.synthesis, synthesis, …}
Directory similarity:
Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, … }
IP address similarity:
3 or 4 octet overlap
Many hosts sharing an IP address => virtual hosting by an ISP
Host outlink overlap
Path overlap
Potentially, path + sketch overlap
40. Implementation
Phase I - Candidate Pair Detection
Find features that pairs of hosts have in common
Compute a list of host pairs which might be mirrors
Phase II - Host Pair Validation
Test each host pair and determine extent of mirroring
Check if 20 paths sampled from Host1 have near-
duplicates on Host2 and vice versa
Use transitive inferences:
IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B)
IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B)
Evaluation
140 million URLs on 230,000 hosts (1999)
Best approach combined 5 sets of features
Top 100,000 host pairs had precision = 0.57 and recall =
0.86
41. WebIR Infrastructure
Connectivity Server
Fast access to links to support for link
analysis
Term Vector Database
Fast access to document vectors to augment
link analysis
42. Connectivity Server
[CS1: Bhar98b, CS2 & 3: Rand01]
Fast web graph access to support connectivity
analysis
Stores mappings in memory from
URL to outlinks, URL to inlinks
Applications
HITS, Pagerank computations
Crawl simulation
Graph algorithms: web connectivity, diameter etc.
more on this later
Visualizations
43. Usage
Input
Execution Output
Graph
algorithm Graph URLs
+ URLs algorithm IDs +
URLs to runs in to Values
+ FPs memory URLs
Values to
IDs
Translation Tables on Disk
URL text: 9 bytes/URL (compressed from ~80 bytes )
FP(64b) -> ID(32b): 5 bytes
ID(32b) -> FP(64b): 8 bytes
ID(32b) -> URLs: 0.5 bytes
44. E.g., HIGH IDs:
ID assignment Max(indegree , outdegree) > 254
Partition URLs into 3 sets, sorted
ID URL
lexicographically
High: Max degree > 254 …
Medium: 254 > Max degree > 24 9891 www.amazon.com/
Low: remaining (75%) 9912 www.amazon.com/jobs/
…
IDs assigned in sequence (densely)
9821878 www.geocities.com/
…
40930030 www.google.com/
Adjacency lists …
In memory tables for Outlinks,
Inlinks 85903590 www.yahoo.com/
List index maps from a Source
ID to start of adjacency list
45. Adjacency List Compression - I
…
… 98 …
132 … -6
104 153 34
105 98 104 21
106 147 105 -8
153 106 49
… 6
… …
Sequence …
of Delta
List Adjacency Encoded
Index Lists List Adjacency
Index Lists
• Adjacency List:
- Smaller delta values are exponentially more frequent (80% to same host)
- Compress deltas with variable length encoding (e.g., Huffman)
• List Index pointers: 32b for high, Base+16b for med, Base+8b for low
- Avg = 12b per pointer
46. Adjacency List Compression - II
Inter List Compression
Basis: Similar URLs may share links
Close in ID space => adjacency lists may overlap
Approach
Define a representative adjacency list for a block of IDs
Adjacency list of a reference ID
Union of adjacency lists in the block
Represent adjacency list in terms of deletions and additions
when it is cheaper to do so
Measurements
Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM)
Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
47. Term Vector Database
[Stat00]
Fast access to 50 word term vectors for web pages
Term Selection:
Restricted to middle 1/3rd of lexicon by document frequency
Top 50 words in document by TF.IDF.
Term Weighting:
Deferred till run-time (can be based on term freq, doc freq, doc length)
Applications
Content + Connectivity analysis (e.g., Topic Distillation)
Topic specific crawls
Document classification
Performance
Storage: 33GB for 272M term vectors
Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk
block)
48. Architecture
URLid * 64 /480
offset
URL Info
Base (4 bytes)
LC:TID
Terms
LC:TID
… 128
Bit vector Byte
For LC:TID TV
480 URLids Record
FRQ:RL
FRQ:RL
Freq
…
FRQ:RL
URLid to Term Vector
Lookup
Editor's Notes
Arms race
Small biotech firm ; query example from last time ; infoseek exapmle
Talk about expert witness; george w bush example
More complex problem of finding the “original” site