Special Topics in Search Engines     Result Summaries      Anti-spamming    Duplicate elimination
Results summaries
Summaries   Having ranked the documents matching a    query, we wish to present a results list   Most commonly, the docu...
Summaries   Two basic kinds:       Static       Dynamic    A static summary of a document is always    the same, regar...
Static summaries   In typical systems, the static summary is a    subset of the document   Simplest heuristic: the first...
Dynamic summaries   Present one or more “windows” within the    document that contain several of the query    terms     ...
Generating dynamic summaries   If we have only a positional index, we cannot    (easily) reconstruct context surrounding ...
Dynamic summaries   Producing good dynamic summaries is a    tricky optimization problem       The real estate for the s...
Anti-spamming
Adversarial IR (Spam)   Motives       Commercial, political, religious, lobbies       Promotion funded by advertising b...
Search Engine Optimization IISearch Engine Optimization       Adversarial IR       Adversarial IR  (“search engine wars”) ...
Can you trust words on the page?auctions.hitsoffice.com/     Pornographic           www.ebay.com/        ContentExamples f...
Simplest forms   Early engines relied on the density of terms        The top-ranked pages for the query maui         res...
A few spam technologies   Cloaking       Serve fake content to search engine robot       DNS cloaking: Switch IP addres...
More spam techniques   Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate...
Tutorial on    Tutorial onCloaking & StealthCloaking & Stealth   Technology    Technology
Variants of keyword stuffing   Misleading meta-tags, excessive repetition   Hidden text with colors, style sheet tricks,...
More spam techniques   Doorway pages       Pages optimized for a single keyword that re-        direct to the real targe...
The war against spamQuality signals - Prefer authoritativepages based on:   Votes from authors (linkage signals)   Vote...
The war against spam   Spam recognition by machine learning       Training set based on known spam   Family friendly fi...
Acid test   Which SEO’s rank highly on the query seo?   Web search engines have policies on SEO    practices they tolera...
Duplicate detection
Duplicate/Near-Duplicate Detection   Duplication: Exact match with fingerprints   Near-Duplication: Approximate match  ...
Computing Similarity   Segments of a document (natural or artificial    breakpoints) [Brin95]   Shingles (Word k-Grams) ...
Shingles + Set IntersectionComputing exact set intersection of shinglesbetween all pairs of documents is expensive   App...
Shingling with samplingminima   Given two documents A1, A2.   Let S1 and S2 be their shingle sets   Resemblance = |Inte...
Computing Sketch[i] for Doc1    Document 1                 264   Start with 64 bit shingles                 264           ...
Test if Doc1.Sketch[i] = Doc2.Sketch[i]       Document 1                 Document 2                         264           ...
However…            Document 1                 Document 2                                 264                264          ...
Set Similarity   Set Similarity (Jaccard measure)                                    Ci  C j                simJ(Ci , C ...
Key Observation   For columns Ci, Cj, four types of rows             Ci     Cj       A      1      1       B      1      ...
Min Hashing   Randomly permute rows   h(Ci) = index of first row with 1 in column Ci   Surprising Property   Why? P [ ...
Mirror Detection   Mirroring is systematic replication of web pages    across hosts.      Single largest cause of duplic...
Mirror Detection example   http://www.elsevier.com/ and http://www.elsevier.nl/   Structural Classification of Proteins ...
Repackaged MirrorsAuctions.msn.com        Auctions.lycos.com                                   Aug
Motivation   Why detect mirrors?       Smart crawling                      Fetch from the fastest or freshest server   ...
Bottom Up Mirror Detection[Cho00]   Maintain clusters of subgraphs   Initialize clusters of trivial subgraphs       Gro...
Can we use URLs to find    mirrors?                 www.synthesis.org                               synthesis.stanford.edu...
Top Down Mirror Detection    [Bhar99, Bhar00c]   E.g.,     www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html     synth...
Implementation   Phase I - Candidate Pair Detection        Find features that pairs of hosts have in common        Comp...
WebIR Infrastructure   Connectivity Server       Fast access to links to support for link        analysis   Term Vector...
Connectivity Server[CS1: Bhar98b, CS2 & 3: Rand01]   Fast web graph access to support connectivity    analysis   Stores ...
Usage    Input                      Execution             Output   Graph  algorithm            Graph                URLs  ...
E.g., HIGH IDs:    ID assignment                        Max(indegree , outdegree) > 254   Partition URLs into 3 sets, sor...
Adjacency List Compression - I                        …           …           98                             …            ...
Adjacency List Compression - II   Inter List Compression       Basis: Similar URLs may share links            Close in ...
Term Vector Database    [Stat00]   Fast access to 50 word term vectors for web pages        Term Selection:            ...
ArchitectureURLid * 64 /480                                       offset                                         URL Info ...
Upcoming SlideShare
Loading in …5
×

seo tutorial

1,349 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,349
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Arms race
  • Small biotech firm ; query example from last time ; infoseek exapmle
  • Talk about expert witness; george w bush example
  • More complex problem of finding the “original” site
  • seo tutorial

    1. 1. Special Topics in Search Engines Result Summaries Anti-spamming Duplicate elimination
    2. 2. Results summaries
    3. 3. Summaries Having ranked the documents matching a query, we wish to present a results list Most commonly, the document title plus a short summary The title is typically automatically extracted from document metadata What about the summaries?
    4. 4. Summaries Two basic kinds:  Static  Dynamic A static summary of a document is always the same, regardless of the query that hit the doc Dynamic summaries are query-dependent attempt to explain why the document was retrieved for the query at hand
    5. 5. Static summaries In typical systems, the static summary is a subset of the document Simplest heuristic: the first 50 (or so – this can be varied) words of the document  Summary cached at indexing time More sophisticated: extract from each document a set of “key” sentences  Simple NLP heuristics to score each sentence  Summary is made up of top-scoring sentences. Most sophisticated: NLP used to synthesize a summary  Seldom used in IR; cf. text summarization
    6. 6. Dynamic summaries Present one or more “windows” within the document that contain several of the query terms  “KWIC” snippets: Keyword in Context presentation Generated in conjunction with scoring  If query found as a phrase, the/some occurrences of the phrase in the doc  If not, windows within the doc that contain multiple query terms The summary itself gives the entire content of the window – all terms, not only the query
    7. 7. Generating dynamic summaries If we have only a positional index, we cannot (easily) reconstruct context surrounding hits If we cache the documents at index time, can run the window through it, cueing to hits found in the positional index  E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the content Most often, cache a fixed-size prefix of the doc  Note: Cached copy can be outdated
    8. 8. Dynamic summaries Producing good dynamic summaries is a tricky optimization problem  The real estate for the summary is normally small and fixed  Want short item, so show as many KWIC matches as possible, and perhaps other things like title  Want snippets to be long enough to be useful  Want linguistically well-formed snippets: users prefer snippets that contain complete phrases  Want snippets maximally informative about doc But users really like snippets, even if they complicate IR system design
    9. 9. Anti-spamming
    10. 10. Adversarial IR (Spam) Motives  Commercial, political, religious, lobbies  Promotion funded by advertising budget Operators  Contractors (Search Engine Optimizers) for lobbies, companies  Web masters  Hosting services Forum  Web master world ( www.webmasterworld.com )  Search engine specific tricks  Discussions about academic papers 
    11. 11. Search Engine Optimization IISearch Engine Optimization Adversarial IR Adversarial IR (“search engine wars”) (“search engine wars”)
    12. 12. Can you trust words on the page?auctions.hitsoffice.com/ Pornographic www.ebay.com/ ContentExamples from July 2002
    13. 13. Simplest forms Early engines relied on the density of terms  The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s SEOs responded with dense repetitions of chosen terms  e.g., maui resort maui resort maui resort  Often, the repetitions would be in the same color as the background of the web page  Repeated terms got indexed by crawlers  But not visible to humans on browsers Can’t trust the words on a web page, for ranking.
    14. 14. A few spam technologies Cloaking  Serve fake content to search engine robot  DNS cloaking: Switch IP address. Impersonate Doorway pages  Pages optimized for a single keyword that re- direct to the real target page Keyword Spam  Misleading meta-keywords, excessive repetition of a term, fake “anchor text”  Hidden text with colors, CSS tricks, etc. Link spamming  Mutual admiration societies, hidden links, awards  Domain flooding: numerous domains that point or re-direct to a target page Robots  Fake click stream  Fake query stream  Millions of submissions via Add-Url
    15. 15. More spam techniques Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate SPAM Y Is this a Search Engine spider? N Real Cloaking Doc
    16. 16. Tutorial on Tutorial onCloaking & StealthCloaking & Stealth Technology Technology
    17. 17. Variants of keyword stuffing Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks, etc. Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”
    18. 18. More spam techniques Doorway pages  Pages optimized for a single keyword that re- direct to the real target page Link spamming  Mutual admiration societies, hidden links, awards – more on these later  Domain flooding: numerous domains that point or re-direct to a target page Robots  Fake query stream – rank checking programs  “Curve-fit” ranking programs of search engines  Millions of submissions via Add-Url
    19. 19. The war against spamQuality signals - Prefer authoritativepages based on: Votes from authors (linkage signals) Votes from users (usage signals) Policing of URL submissions Anti robot test Limits on meta-keywordsRobust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association)
    20. 20. The war against spam Spam recognition by machine learning  Training set based on known spam Family friendly filters  Linguistic analysis, general classification techniques, etc.  For images: flesh tone detectors, source text analysis, etc. Editorial intervention  Blacklists  Top queries audited  Complaints addressed
    21. 21. Acid test Which SEO’s rank highly on the query seo? Web search engines have policies on SEO practices they tolerate/block  See pointers in Resources Adversarial IR: the unending (technical) battle between SEO’s and web search engines See for instance http://airweb.cse.lehigh.edu/
    22. 22. Duplicate detection
    23. 23. Duplicate/Near-Duplicate Detection Duplication: Exact match with fingerprints Near-Duplication: Approximate match Overview  Compute syntactic similarity with an edit- distance measure  Use similarity threshold to detect near- duplicates  E.g., Similarity > 80% => Documents are “near duplicates”  Not transitive though sometimes used transitively
    24. 24. Computing Similarity Segments of a document (natural or artificial breakpoints) [Brin95] Shingles (Word k-Grams) [Brin95, Brod98] “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is Similarity Measure between two docs (= sets of shingles)  Set intersection [Brod98] (Specifically, Size_of_Intersection / Size_of_Union ) Jaccard measure
    25. 25. Shingles + Set IntersectionComputing exact set intersection of shinglesbetween all pairs of documents is expensive Approximate using a cleverly chosen subset of shingles from each (a sketch) Estimate Jaccard from a short sketchCreate a “sketch vector” (e.g., of size 200) foreach document Documents which share more than t (say 80%) corresponding vector elements are similar For doc d, sketchd[i] is computed as follows:  Let f map all shingles in the universe to 0..2 m  Let πi be a specific random permutation on 0..2 m  Pick MIN πi (f(s)) over all shingles s in d
    26. 26. Shingling with samplingminima Given two documents A1, A2. Let S1 and S2 be their shingle sets Resemblance = |Intersection of S1 and S2| / | Union of S1 and S2|. Let Alpha = min ( π (S1)) Let Beta = min (π(S2))  Probability (Alpha = Beta) = Resemblance
    27. 27. Computing Sketch[i] for Doc1 Document 1 264 Start with 64 bit shingles 264 Permute on the number line 264 with πi 264 Pick the min value
    28. 28. Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 264 264 264 264 264 264 A B 2 64 264 Are these equal?Test for 200 random permutations: π1, π2,… π200
    29. 29. However… Document 1 Document 2 264 264 264 264 A 264 B 264 264 264A = B iff the shingle with the MIN value in the union ofDoc1 and Doc2 is common to both (I.e., lies in theintersection)This happens with probability: Size_of_intersection / Size_of_union Why?
    30. 30. Set Similarity Set Similarity (Jaccard measure) Ci  C j simJ(Ci , C j ) = Ci  C j View sets as columns of a matrix; one row for each element in the universe. aij = 1 indicates presence of item i in set j Example C1 C2 0 1 1 0 1 1 simJ(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1
    31. 31. Key Observation For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0 Overload notation: A = # of rows of type A Claim A simJ(Ci , C j ) = A+B+C
    32. 32. Min Hashing Randomly permute rows h(Ci) = index of first row with 1 in column Ci Surprising Property Why? P [ h(Ci ) = h(C j ) ] = simJ ( Ci , C j )  Both are A/(A+B+C)  Look down columns Ci, Cj until first non-Type- D row  h(Ci) = h(Cj)  type A row
    33. 33. Mirror Detection Mirroring is systematic replication of web pages across hosts.  Single largest cause of duplication on the web Host1/α and Host2/β are mirrors iff For all (or most) paths p such that when http://Host1/ α / p exists http://Host2/ β / p exists as well with identical (or near identical) content, and vice versa.
    34. 34. Mirror Detection example http://www.elsevier.com/ and http://www.elsevier.nl/ Structural Classification of Proteins  http://scop.mrc-lmb.cam.ac.uk/scop  http://scop.berkeley.edu/  http://scop.wehi.edu.au/scop  http://pdb.weizmann.ac.il/scop  http://scop.protres.ru/
    35. 35. Repackaged MirrorsAuctions.msn.com Auctions.lycos.com Aug
    36. 36. Motivation Why detect mirrors?  Smart crawling  Fetch from the fastest or freshest server  Avoid duplication  Better connectivity analysis  Combine inlinks  Avoid double counting outlinks  Redundancy in result listings  “If that fails you can try: <mirror>/samepath”  Proxy caching
    37. 37. Bottom Up Mirror Detection[Cho00] Maintain clusters of subgraphs Initialize clusters of trivial subgraphs  Group near-duplicate single documents into a cluster Subsequent passes  Merge clusters of the same cardinality and corresponding linkage  Avoid decreasing cluster cardinality To detect mirrors we need:  Adequate path overlap  Contents of corresponding pages within a small time range
    38. 38. Can we use URLs to find mirrors? www.synthesis.org synthesis.stanford.edu a b a b d d c cwww.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-…www.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced…www.synthesis.org/Docs/annual.report96.final.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-…www.synthesis.org/Docs/cicee-berlin-paper.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-…www.synthesis.org/Docs/myr5 synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-…www.synthesis.org/Docs/myr5/cicee/bridge-gap.html synthesis.stanford.edu/Docs/annual.report96.final.htmlwww.synthesis.org/Docs/myr5/cs/cs-meta.html synthesis.stanford.edu/Docs/annual.report96.final_fn.htmlwww.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.html synthesis.stanford.edu/Docs/myr5/assessmentwww.synthesis.org/Docs/myr5/mech/mech-take-home.html synthesis.stanford.edu/Docs/myr5/assessment/assessment-…www.synthesis.org/Docs/myr5/synsys/experiential-learning.html synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-…www.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.html synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.htmlwww.synthesis.org/Docs/yr5ar synthesis.stanford.edu/Docs/myr5/assessment/not-available.htmlwww.synthesis.org/Docs/yr5ar/assess synthesis.stanford.edu/Docs/myr5/ciceewww.synthesis.org/Docs/yr5ar/cicee synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.htmlwww.synthesis.org/Docs/yr5ar/cicee/bridge-gap.html synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.htmlwww.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html synthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html
    39. 39. Top Down Mirror Detection [Bhar99, Bhar00c] E.g., www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html What features could indicate mirroring?  Hostname similarity:  word unigrams and bigrams: { www, www.synthesis, synthesis, …}  Directory similarity:  Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, … }  IP address similarity:  3 or 4 octet overlap  Many hosts sharing an IP address => virtual hosting by an ISP  Host outlink overlap  Path overlap  Potentially, path + sketch overlap
    40. 40. Implementation Phase I - Candidate Pair Detection  Find features that pairs of hosts have in common  Compute a list of host pairs which might be mirrors Phase II - Host Pair Validation  Test each host pair and determine extent of mirroring  Check if 20 paths sampled from Host1 have near- duplicates on Host2 and vice versa  Use transitive inferences: IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B) IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B) Evaluation  140 million URLs on 230,000 hosts (1999)  Best approach combined 5 sets of features  Top 100,000 host pairs had precision = 0.57 and recall = 0.86
    41. 41. WebIR Infrastructure Connectivity Server  Fast access to links to support for link analysis Term Vector Database  Fast access to document vectors to augment link analysis
    42. 42. Connectivity Server[CS1: Bhar98b, CS2 & 3: Rand01] Fast web graph access to support connectivity analysis Stores mappings in memory from  URL to outlinks, URL to inlinks Applications  HITS, Pagerank computations  Crawl simulation  Graph algorithms: web connectivity, diameter etc.  more on this later  Visualizations
    43. 43. Usage Input Execution Output Graph algorithm Graph URLs + URLs algorithm IDs + URLs to runs in to Values + FPs memory URLs Values to IDs Translation Tables on Disk URL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytes ID(32b) -> FP(64b): 8 bytes ID(32b) -> URLs: 0.5 bytes
    44. 44. E.g., HIGH IDs: ID assignment Max(indegree , outdegree) > 254 Partition URLs into 3 sets, sorted ID URL lexicographically  High: Max degree > 254 …  Medium: 254 > Max degree > 24 9891 www.amazon.com/  Low: remaining (75%) 9912 www.amazon.com/jobs/ … IDs assigned in sequence (densely) 9821878 www.geocities.com/ … 40930030 www.google.com/ Adjacency lists …  In memory tables for Outlinks, Inlinks 85903590 www.yahoo.com/  List index maps from a Source ID to start of adjacency list
    45. 45. Adjacency List Compression - I … … 98 … 132 … -6 104 153 34 105 98 104 21 106 147 105 -8 153 106 49 … 6 … … Sequence … of Delta List Adjacency Encoded Index Lists List Adjacency Index Lists• Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host) - Compress deltas with variable length encoding (e.g., Huffman)• List Index pointers: 32b for high, Base+16b for med, Base+8b for low - Avg = 12b per pointer
    46. 46. Adjacency List Compression - II Inter List Compression  Basis: Similar URLs may share links  Close in ID space => adjacency lists may overlap  Approach  Define a representative adjacency list for a block of IDs  Adjacency list of a reference ID  Union of adjacency lists in the block  Represent adjacency list in terms of deletions and additions when it is cheaper to do so  Measurements  Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM)  Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
    47. 47. Term Vector Database [Stat00] Fast access to 50 word term vectors for web pages  Term Selection:  Restricted to middle 1/3rd of lexicon by document frequency  Top 50 words in document by TF.IDF.  Term Weighting:  Deferred till run-time (can be based on term freq, doc freq, doc length) Applications  Content + Connectivity analysis (e.g., Topic Distillation)  Topic specific crawls  Document classification Performance  Storage: 33GB for 272M term vectors  Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk block)
    48. 48. ArchitectureURLid * 64 /480 offset URL Info Base (4 bytes) LC:TID Terms LC:TID … 128 Bit vector Byte For LC:TID TV 480 URLids Record FRQ:RL FRQ:RL Freq … FRQ:RL URLid to Term Vector Lookup

    ×