SlideShare a Scribd company logo
1 of 48
Special Topics in
 Search Engines
     Result Summaries
      Anti-spamming
    Duplicate elimination
Results summaries
Summaries
   Having ranked the documents matching a
    query, we wish to present a results list
   Most commonly, the document title plus a
    short summary
   The title is typically automatically extracted
    from document metadata
   What about the summaries?
Summaries
   Two basic kinds:
       Static
       Dynamic

    A static summary of a document is always
    the same, regardless of the query that hit
    the doc
   Dynamic summaries are query-dependent
    attempt to explain why the document was
    retrieved for the query at hand
Static summaries
   In typical systems, the static summary is a
    subset of the document
   Simplest heuristic: the first 50 (or so – this
    can be varied) words of the document
       Summary cached at indexing time
   More sophisticated: extract from each
    document a set of “key” sentences
       Simple NLP heuristics to score each sentence
       Summary is made up of top-scoring
        sentences.
   Most sophisticated: NLP used to synthesize
    a summary
       Seldom used in IR; cf. text summarization
Dynamic summaries
   Present one or more “windows” within the
    document that contain several of the query
    terms
       “KWIC” snippets: Keyword in Context
        presentation
   Generated in conjunction with scoring
       If query found as a phrase, the/some
        occurrences of the phrase in the doc
       If not, windows within the doc that contain
        multiple query terms
   The summary itself gives the entire content
    of the window – all terms, not only the query
Generating dynamic summaries
   If we have only a positional index, we cannot
    (easily) reconstruct context surrounding hits
   If we cache the documents at index time, can
    run the window through it, cueing to hits
    found in the positional index
       E.g., positional index says “the query is a
        phrase in position 4378” so we go to this
        position in the cached document and stream
        out the content
   Most often, cache a fixed-size prefix of the
    doc
       Note: Cached copy can be outdated
Dynamic summaries
   Producing good dynamic summaries is a
    tricky optimization problem
       The real estate for the summary is normally
        small and fixed
       Want short item, so show as many KWIC
        matches as possible, and perhaps other
        things like title
       Want snippets to be long enough to be useful
       Want linguistically well-formed snippets:
        users prefer snippets that contain complete
        phrases
       Want snippets maximally informative about
        doc
   But users really like snippets, even if they
    complicate IR system design
Anti-spamming
Adversarial IR (Spam)
   Motives
       Commercial, political, religious, lobbies
       Promotion funded by advertising budget
   Operators
       Contractors (Search Engine Optimizers) for lobbies,
        companies
       Web masters
       Hosting services
   Forum
       Web master world ( www.webmasterworld.com )
            Search engine specific tricks
            Discussions about academic papers 
Search Engine Optimization II
Search Engine Optimization
       Adversarial IR
       Adversarial IR
  (“search engine wars”)
   (“search engine wars”)
Can you trust words on the page?
auctions.hitsoffice.com/




     Pornographic           www.ebay.com/
        Content




Examples from July 2002
Simplest forms
   Early engines relied on the density of terms
        The top-ranked pages for the query maui
         resort were the ones containing the most
         maui’s and resort’s
   SEOs responded with dense repetitions of
    chosen terms
        e.g., maui resort maui resort maui resort
        Often, the repetitions would be in the same
         color as the background of the web page
             Repeated terms got indexed by crawlers
             But not visible to humans on browsers

    Can’t trust the words on a web page, for ranking.
A few spam technologies
   Cloaking
       Serve fake content to search engine robot
       DNS cloaking: Switch IP address. Impersonate
   Doorway pages
       Pages optimized for a single keyword that re-
        direct to the real target page
   Keyword Spam
       Misleading meta-keywords, excessive
        repetition of a term, fake “anchor text”
       Hidden text with colors, CSS tricks, etc.
   Link spamming
       Mutual admiration societies, hidden links,
        awards
       Domain flooding: numerous domains that
        point or re-direct to a target page
   Robots
       Fake click stream
       Fake query stream
       Millions of submissions via Add-Url
More spam techniques
   Cloaking
 Serve fake content to search engine spider
 DNS cloaking: Switch IP address. Impersonate




                                          SPAM
                                      Y

                   Is this a Search
                   Engine spider?

                                      N   Real
               Cloaking                   Doc
Tutorial on
    Tutorial on
Cloaking & Stealth
Cloaking & Stealth
   Technology
    Technology
Variants of keyword stuffing
   Misleading meta-tags, excessive repetition
   Hidden text with colors, style sheet tricks,
    etc.

      Meta-Tags =
      “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation,
      sex, mp3, britney spears, viagra, …”
More spam techniques
   Doorway pages
       Pages optimized for a single keyword that re-
        direct to the real target page
   Link spamming
       Mutual admiration societies, hidden links,
        awards – more on these later
       Domain flooding: numerous domains that
        point or re-direct to a target page
   Robots
       Fake query stream – rank checking programs
            “Curve-fit” ranking programs of search engines
       Millions of submissions via Add-Url
The war against spam
Quality signals - Prefer authoritative
pages based on:
   Votes from authors (linkage signals)
   Votes from users (usage signals)
   Policing of URL submissions
   Anti robot test
 Limits on meta-keywords
Robust link analysis
   Ignore statistically implausible linkage (or text)
   Use link analysis to detect spammers (guilt by
    association)
The war against spam
   Spam recognition by machine learning
       Training set based on known spam
   Family friendly filters
       Linguistic analysis, general classification
        techniques, etc.
       For images: flesh tone detectors, source text
        analysis, etc.
   Editorial intervention
       Blacklists
       Top queries audited
       Complaints addressed
Acid test
   Which SEO’s rank highly on the query seo?
   Web search engines have policies on SEO
    practices they tolerate/block
       See pointers in Resources
   Adversarial IR: the unending (technical)
    battle between SEO’s and web search
    engines
   See for instance
    http://airweb.cse.lehigh.edu/
Duplicate detection
Duplicate/Near-Duplicate Detection

   Duplication: Exact match with fingerprints
   Near-Duplication: Approximate match
   Overview
       Compute syntactic similarity with an edit-
        distance measure
       Use similarity threshold to detect near-
        duplicates
         
             E.g., Similarity > 80% => Documents are “near
             duplicates”
         
             Not transitive though sometimes used
             transitively
Computing Similarity
   Segments of a document (natural or artificial
    breakpoints) [Brin95]
   Shingles (Word k-Grams) [Brin95, Brod98]
        “a rose is a rose is a rose” =>
           a_rose_is_a
              rose_is_a_rose
                    is_a_rose_is
   Similarity Measure between two docs (= sets
    of shingles)
       Set intersection [Brod98]
        (Specifically, Size_of_Intersection /
        Size_of_Union )
                   Jaccard measure
Shingles + Set Intersection
Computing exact set intersection of shingles
between all pairs of documents is expensive
   Approximate using a cleverly chosen subset of
    shingles from each (a sketch)
   Estimate Jaccard from a short sketch
Create a “sketch vector” (e.g., of size 200) for
each document
   Documents which share more than t (say 80%)
    corresponding vector elements are similar
    For doc d, sketchd[i] is computed as follows:
     
         Let f map all shingles in the universe to 0..2 m
        Let πi be a specific random permutation on 0..2 m
        Pick MIN πi (f(s)) over all shingles s in d
Shingling with sampling
minima
   Given two documents A1, A2.
   Let S1 and S2 be their shingle sets
   Resemblance = |Intersection of S1 and S2| / |
    Union of S1 and S2|.
   Let Alpha = min ( π (S1))
   Let Beta = min (π(S2))
       Probability (Alpha = Beta) = Resemblance
Computing Sketch[i] for Doc1
    Document 1


                 264   Start with 64 bit shingles

                 264
                       Permute on the number line
                 264   with   πi
                 264   Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]

       Document 1                 Document 2


                         264                      264
                         264                      264

                         264                      264
       A                              B
                         2   64                   264


              Are these equal?
Test for 200 random permutations: π1, π2,… π200
However…
            Document 1                 Document 2

                                 264                264
                                 264                264
            A
                                 264    B           264
                                 264                264

A = B iff the shingle with the MIN value in the union of
Doc1 and Doc2 is common to both (I.e., lies in the
intersection)

This happens with probability:
       Size_of_intersection / Size_of_union
                     Why?
Set Similarity
   Set Similarity (Jaccard measure)
                                    Ci  C j
                simJ(Ci , C j ) =
                                    Ci  C j
   View sets as columns of a matrix; one row for
    each element in the universe. aij = 1 indicates
    presence of item i in set j
    Example      C1 C2

                  0   1
                  1   0
                  1   1      simJ(C1,C2) = 2/5 = 0.4
                  0   0
                  1   1
                  0   1
Key Observation
   For columns Ci, Cj, four types of rows
             Ci     Cj
       A      1      1
       B      1      0
       C      0      1
       D      0      0
   Overload notation: A = # of rows of type A
   Claim                     A
              simJ(Ci , C j ) =
                                  A+B+C
Min Hashing
   Randomly permute rows
   h(Ci) = index of first row with 1 in column Ci
   Surprising Property

   Why? P [ h(Ci ) = h(C j ) ] = simJ ( Ci , C j )
        Both are A/(A+B+C)
        Look down columns Ci, Cj until first non-Type-
         D row
        h(Ci) = h(Cj)  type A row
Mirror Detection
   Mirroring is systematic replication of web pages
    across hosts.
      Single largest cause of duplication on the web

   Host1/α and Host2/β are mirrors iff
        For all (or most) paths p such that when
          http://Host1/ α / p exists
          http://Host2/ β / p exists as well
        with identical (or near identical) content, and
          vice versa.
Mirror Detection example
   http://www.elsevier.com/ and http://www.elsevier.nl/
   Structural Classification of Proteins
       http://scop.mrc-lmb.cam.ac.uk/scop
       http://scop.berkeley.edu/
       http://scop.wehi.edu.au/scop
       http://pdb.weizmann.ac.il/scop
       http://scop.protres.ru/
Repackaged Mirrors
Auctions.msn.com        Auctions.lycos.com




                                   Aug
Motivation
   Why detect mirrors?
       Smart crawling
         
             Fetch from the fastest or freshest server
         
             Avoid duplication
       Better connectivity analysis
         
             Combine inlinks
         
             Avoid double counting outlinks
       Redundancy in result listings
            “If that fails you can try: <mirror>/samepath”
       Proxy caching
Bottom Up Mirror Detection
[Cho00]
   Maintain clusters of subgraphs
   Initialize clusters of trivial subgraphs
       Group near-duplicate single documents into a cluster
   Subsequent passes
       Merge clusters of the same cardinality and corresponding linkage




       Avoid decreasing cluster cardinality
   To detect mirrors we need:
       Adequate path overlap
       Contents of corresponding pages within a small time range
Can we use URLs to find
    mirrors?
                 www.synthesis.org                               synthesis.stanford.edu


                 a              b                                    a             b
                                           d                                                  d
                         c                                                   c

www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html     synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-…
www.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.html
                                                         synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced…
www.synthesis.org/Docs/annual.report96.final.html        synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-…
www.synthesis.org/Docs/cicee-berlin-paper.html           synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-…
www.synthesis.org/Docs/myr5                              synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-…
www.synthesis.org/Docs/myr5/cicee/bridge-gap.html        synthesis.stanford.edu/Docs/annual.report96.final.html
www.synthesis.org/Docs/myr5/cs/cs-meta.html              synthesis.stanford.edu/Docs/annual.report96.final_fn.html
www.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.html
                                                         synthesis.stanford.edu/Docs/myr5/assessment
www.synthesis.org/Docs/myr5/mech/mech-take-home.html synthesis.stanford.edu/Docs/myr5/assessment/assessment-…
www.synthesis.org/Docs/myr5/synsys/experiential-learning.html
                                                         synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-…
www.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.html   synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.html
www.synthesis.org/Docs/yr5ar                             synthesis.stanford.edu/Docs/myr5/assessment/not-available.html
www.synthesis.org/Docs/yr5ar/assess                      synthesis.stanford.edu/Docs/myr5/cicee
www.synthesis.org/Docs/yr5ar/cicee                       synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.html
www.synthesis.org/Docs/yr5ar/cicee/bridge-gap.html       synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.html
www.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html
                                                         synthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html
Top Down Mirror Detection
    [Bhar99, Bhar00c]
   E.g.,
     www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html
     synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html
   What features could indicate mirroring?
        Hostname similarity:
             word unigrams and bigrams: { www, www.synthesis, synthesis, …}
        Directory similarity:
             Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, … }
        IP address similarity:
             3 or 4 octet overlap
             Many hosts sharing an IP address => virtual hosting by an ISP
        Host outlink overlap
        Path overlap
             Potentially, path + sketch overlap
Implementation
   Phase I - Candidate Pair Detection
        Find features that pairs of hosts have in common
        Compute a list of host pairs which might be mirrors
   Phase II - Host Pair Validation
         Test each host pair and determine extent of mirroring
          Check if 20 paths sampled from Host1 have near-

           duplicates on Host2 and vice versa
          Use transitive inferences:

               IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B)
               IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B)
   Evaluation
       140 million URLs on 230,000 hosts (1999)
       Best approach combined 5 sets of features
          Top 100,000 host pairs had precision = 0.57 and recall =

           0.86
WebIR Infrastructure
   Connectivity Server
       Fast access to links to support for link
        analysis
   Term Vector Database
       Fast access to document vectors to augment
        link analysis
Connectivity Server
[CS1: Bhar98b, CS2 & 3: Rand01]
   Fast web graph access to support connectivity
    analysis
   Stores mappings in memory from
       
           URL to outlinks, URL to inlinks
   Applications
       
           HITS, Pagerank computations
       
           Crawl simulation
       
           Graph algorithms: web connectivity, diameter etc.
               more on this later
       
           Visualizations
Usage
    Input
                      Execution             Output
   Graph
  algorithm            Graph                URLs
      +       URLs    algorithm   IDs         +
    URLs      to       runs in    to        Values
      +       FPs     memory      URLs
   Values     to
              IDs

    Translation Tables on Disk
    URL text: 9 bytes/URL (compressed from ~80 bytes )
    FP(64b) -> ID(32b): 5 bytes
    ID(32b) -> FP(64b): 8 bytes
    ID(32b) -> URLs: 0.5 bytes
E.g., HIGH IDs:
    ID assignment                        Max(indegree , outdegree) > 254

   Partition URLs into 3 sets, sorted
                                         ID                URL
    lexicographically
        High: Max degree > 254          …
        Medium: 254 > Max degree > 24   9891       www.amazon.com/
        Low: remaining (75%)            9912       www.amazon.com/jobs/
                                         …
   IDs assigned in sequence (densely)
                                         9821878    www.geocities.com/
                                         …
                                         40930030   www.google.com/

    Adjacency lists                      …

     In memory tables for Outlinks,
      Inlinks                            85903590   www.yahoo.com/

     List index maps from a Source
      ID to start of adjacency list
Adjacency List Compression - I

                        …
           …           98                             …
                       132               …            -6
    104                153                            34
    105                98         104                 21
    106                147        105                 -8
                       153        106                 49
                        …                              6
           …                                          …
                     Sequence            …
                         of                         Delta
           List      Adjacency                    Encoded
          Index        Lists             List     Adjacency
                                        Index       Lists
• Adjacency List:
     - Smaller delta values are exponentially more frequent (80% to same host)
     - Compress deltas with variable length encoding (e.g., Huffman)
• List Index pointers: 32b for high, Base+16b for med, Base+8b for low
     - Avg = 12b per pointer
Adjacency List Compression - II

   Inter List Compression
       Basis: Similar URLs may share links
            Close in ID space => adjacency lists may overlap
       Approach
            Define a representative adjacency list for a block of IDs
                  Adjacency list of a reference ID
                  Union of adjacency lists in the block
            Represent adjacency list in terms of deletions and additions
             when it is cheaper to do so
       Measurements
            Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM)
            Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
Term Vector Database
    [Stat00]
   Fast access to 50 word term vectors for web pages
        Term Selection:
             Restricted to middle 1/3rd of lexicon by document frequency
             Top 50 words in document by TF.IDF.
        Term Weighting:
             Deferred till run-time (can be based on term freq, doc freq, doc length)
   Applications
        Content + Connectivity analysis (e.g., Topic Distillation)
        Topic specific crawls
        Document classification
   Performance
        Storage: 33GB for 272M term vectors
        Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk
         block)
Architecture
URLid * 64 /480

                                       offset

                                         URL Info
            Base (4 bytes)
                                          LC:TID

                               Terms
                                          LC:TID
                                            …        128
              Bit vector                             Byte
                 For                      LC:TID      TV
             480 URLids                             Record
                                         FRQ:RL
                                         FRQ:RL
                               Freq
                                            …
                                         FRQ:RL
        URLid to Term Vector
               Lookup

More Related Content

Similar to seo tutorial

Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic
 

Similar to seo tutorial (20)

Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
 
Measuring Your Code 2.0
Measuring Your Code 2.0Measuring Your Code 2.0
Measuring Your Code 2.0
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
 New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 
PoolParty Advanced Thesaurus Management
PoolParty Advanced Thesaurus ManagementPoolParty Advanced Thesaurus Management
PoolParty Advanced Thesaurus Management
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation Defense
 
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsQuality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
 
Quality, quantity, web and semantics
Quality, quantity, web and semanticsQuality, quantity, web and semantics
Quality, quantity, web and semantics
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
 
The search engine index
The search engine indexThe search engine index
The search engine index
 

Recently uploaded

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

seo tutorial

  • 1. Special Topics in Search Engines Result Summaries Anti-spamming Duplicate elimination
  • 3. Summaries  Having ranked the documents matching a query, we wish to present a results list  Most commonly, the document title plus a short summary  The title is typically automatically extracted from document metadata  What about the summaries?
  • 4. Summaries  Two basic kinds:  Static  Dynamic  A static summary of a document is always the same, regardless of the query that hit the doc  Dynamic summaries are query-dependent attempt to explain why the document was retrieved for the query at hand
  • 5. Static summaries  In typical systems, the static summary is a subset of the document  Simplest heuristic: the first 50 (or so – this can be varied) words of the document  Summary cached at indexing time  More sophisticated: extract from each document a set of “key” sentences  Simple NLP heuristics to score each sentence  Summary is made up of top-scoring sentences.  Most sophisticated: NLP used to synthesize a summary  Seldom used in IR; cf. text summarization
  • 6. Dynamic summaries  Present one or more “windows” within the document that contain several of the query terms  “KWIC” snippets: Keyword in Context presentation  Generated in conjunction with scoring  If query found as a phrase, the/some occurrences of the phrase in the doc  If not, windows within the doc that contain multiple query terms  The summary itself gives the entire content of the window – all terms, not only the query
  • 7. Generating dynamic summaries  If we have only a positional index, we cannot (easily) reconstruct context surrounding hits  If we cache the documents at index time, can run the window through it, cueing to hits found in the positional index  E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the content  Most often, cache a fixed-size prefix of the doc  Note: Cached copy can be outdated
  • 8. Dynamic summaries  Producing good dynamic summaries is a tricky optimization problem  The real estate for the summary is normally small and fixed  Want short item, so show as many KWIC matches as possible, and perhaps other things like title  Want snippets to be long enough to be useful  Want linguistically well-formed snippets: users prefer snippets that contain complete phrases  Want snippets maximally informative about doc  But users really like snippets, even if they complicate IR system design
  • 10. Adversarial IR (Spam)  Motives  Commercial, political, religious, lobbies  Promotion funded by advertising budget  Operators  Contractors (Search Engine Optimizers) for lobbies, companies  Web masters  Hosting services  Forum  Web master world ( www.webmasterworld.com )  Search engine specific tricks  Discussions about academic papers 
  • 11. Search Engine Optimization II Search Engine Optimization Adversarial IR Adversarial IR (“search engine wars”) (“search engine wars”)
  • 12. Can you trust words on the page? auctions.hitsoffice.com/ Pornographic www.ebay.com/ Content Examples from July 2002
  • 13. Simplest forms  Early engines relied on the density of terms  The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s  SEOs responded with dense repetitions of chosen terms  e.g., maui resort maui resort maui resort  Often, the repetitions would be in the same color as the background of the web page  Repeated terms got indexed by crawlers  But not visible to humans on browsers Can’t trust the words on a web page, for ranking.
  • 14. A few spam technologies  Cloaking  Serve fake content to search engine robot  DNS cloaking: Switch IP address. Impersonate  Doorway pages  Pages optimized for a single keyword that re- direct to the real target page  Keyword Spam  Misleading meta-keywords, excessive repetition of a term, fake “anchor text”  Hidden text with colors, CSS tricks, etc.  Link spamming  Mutual admiration societies, hidden links, awards  Domain flooding: numerous domains that point or re-direct to a target page  Robots  Fake click stream  Fake query stream  Millions of submissions via Add-Url
  • 15. More spam techniques  Cloaking  Serve fake content to search engine spider  DNS cloaking: Switch IP address. Impersonate SPAM Y Is this a Search Engine spider? N Real Cloaking Doc
  • 16. Tutorial on Tutorial on Cloaking & Stealth Cloaking & Stealth Technology Technology
  • 17. Variants of keyword stuffing  Misleading meta-tags, excessive repetition  Hidden text with colors, style sheet tricks, etc. Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”
  • 18. More spam techniques  Doorway pages  Pages optimized for a single keyword that re- direct to the real target page  Link spamming  Mutual admiration societies, hidden links, awards – more on these later  Domain flooding: numerous domains that point or re-direct to a target page  Robots  Fake query stream – rank checking programs  “Curve-fit” ranking programs of search engines  Millions of submissions via Add-Url
  • 19. The war against spam Quality signals - Prefer authoritative pages based on:  Votes from authors (linkage signals)  Votes from users (usage signals)  Policing of URL submissions  Anti robot test  Limits on meta-keywords Robust link analysis  Ignore statistically implausible linkage (or text)  Use link analysis to detect spammers (guilt by association)
  • 20. The war against spam  Spam recognition by machine learning  Training set based on known spam  Family friendly filters  Linguistic analysis, general classification techniques, etc.  For images: flesh tone detectors, source text analysis, etc.  Editorial intervention  Blacklists  Top queries audited  Complaints addressed
  • 21. Acid test  Which SEO’s rank highly on the query seo?  Web search engines have policies on SEO practices they tolerate/block  See pointers in Resources  Adversarial IR: the unending (technical) battle between SEO’s and web search engines  See for instance http://airweb.cse.lehigh.edu/
  • 23. Duplicate/Near-Duplicate Detection  Duplication: Exact match with fingerprints  Near-Duplication: Approximate match  Overview  Compute syntactic similarity with an edit- distance measure  Use similarity threshold to detect near- duplicates  E.g., Similarity > 80% => Documents are “near duplicates”  Not transitive though sometimes used transitively
  • 24. Computing Similarity  Segments of a document (natural or artificial breakpoints) [Brin95]  Shingles (Word k-Grams) [Brin95, Brod98] “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is  Similarity Measure between two docs (= sets of shingles)  Set intersection [Brod98] (Specifically, Size_of_Intersection / Size_of_Union ) Jaccard measure
  • 25. Shingles + Set Intersection Computing exact set intersection of shingles between all pairs of documents is expensive  Approximate using a cleverly chosen subset of shingles from each (a sketch)  Estimate Jaccard from a short sketch Create a “sketch vector” (e.g., of size 200) for each document  Documents which share more than t (say 80%) corresponding vector elements are similar  For doc d, sketchd[i] is computed as follows:  Let f map all shingles in the universe to 0..2 m  Let πi be a specific random permutation on 0..2 m  Pick MIN πi (f(s)) over all shingles s in d
  • 26. Shingling with sampling minima  Given two documents A1, A2.  Let S1 and S2 be their shingle sets  Resemblance = |Intersection of S1 and S2| / | Union of S1 and S2|.  Let Alpha = min ( π (S1))  Let Beta = min (π(S2))  Probability (Alpha = Beta) = Resemblance
  • 27. Computing Sketch[i] for Doc1 Document 1 264 Start with 64 bit shingles 264 Permute on the number line 264 with πi 264 Pick the min value
  • 28. Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 264 264 264 264 264 264 A B 2 64 264 Are these equal? Test for 200 random permutations: π1, π2,… π200
  • 29. However… Document 1 Document 2 264 264 264 264 A 264 B 264 264 264 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (I.e., lies in the intersection) This happens with probability: Size_of_intersection / Size_of_union Why?
  • 30. Set Similarity  Set Similarity (Jaccard measure) Ci  C j simJ(Ci , C j ) = Ci  C j  View sets as columns of a matrix; one row for each element in the universe. aij = 1 indicates presence of item i in set j  Example C1 C2 0 1 1 0 1 1 simJ(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1
  • 31. Key Observation  For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0  Overload notation: A = # of rows of type A  Claim A simJ(Ci , C j ) = A+B+C
  • 32. Min Hashing  Randomly permute rows  h(Ci) = index of first row with 1 in column Ci  Surprising Property  Why? P [ h(Ci ) = h(C j ) ] = simJ ( Ci , C j )  Both are A/(A+B+C)  Look down columns Ci, Cj until first non-Type- D row  h(Ci) = h(Cj)  type A row
  • 33. Mirror Detection  Mirroring is systematic replication of web pages across hosts.  Single largest cause of duplication on the web  Host1/α and Host2/β are mirrors iff For all (or most) paths p such that when http://Host1/ α / p exists http://Host2/ β / p exists as well with identical (or near identical) content, and vice versa.
  • 34. Mirror Detection example  http://www.elsevier.com/ and http://www.elsevier.nl/  Structural Classification of Proteins  http://scop.mrc-lmb.cam.ac.uk/scop  http://scop.berkeley.edu/  http://scop.wehi.edu.au/scop  http://pdb.weizmann.ac.il/scop  http://scop.protres.ru/
  • 35. Repackaged Mirrors Auctions.msn.com Auctions.lycos.com Aug
  • 36. Motivation  Why detect mirrors?  Smart crawling  Fetch from the fastest or freshest server  Avoid duplication  Better connectivity analysis  Combine inlinks  Avoid double counting outlinks  Redundancy in result listings  “If that fails you can try: <mirror>/samepath”  Proxy caching
  • 37. Bottom Up Mirror Detection [Cho00]  Maintain clusters of subgraphs  Initialize clusters of trivial subgraphs  Group near-duplicate single documents into a cluster  Subsequent passes  Merge clusters of the same cardinality and corresponding linkage  Avoid decreasing cluster cardinality  To detect mirrors we need:  Adequate path overlap  Contents of corresponding pages within a small time range
  • 38. Can we use URLs to find mirrors? www.synthesis.org synthesis.stanford.edu a b a b d d c c www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-… www.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced… www.synthesis.org/Docs/annual.report96.final.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-… www.synthesis.org/Docs/cicee-berlin-paper.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-… www.synthesis.org/Docs/myr5 synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-… www.synthesis.org/Docs/myr5/cicee/bridge-gap.html synthesis.stanford.edu/Docs/annual.report96.final.html www.synthesis.org/Docs/myr5/cs/cs-meta.html synthesis.stanford.edu/Docs/annual.report96.final_fn.html www.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.html synthesis.stanford.edu/Docs/myr5/assessment www.synthesis.org/Docs/myr5/mech/mech-take-home.html synthesis.stanford.edu/Docs/myr5/assessment/assessment-… www.synthesis.org/Docs/myr5/synsys/experiential-learning.html synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-… www.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.html synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.html www.synthesis.org/Docs/yr5ar synthesis.stanford.edu/Docs/myr5/assessment/not-available.html www.synthesis.org/Docs/yr5ar/assess synthesis.stanford.edu/Docs/myr5/cicee www.synthesis.org/Docs/yr5ar/cicee synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.html www.synthesis.org/Docs/yr5ar/cicee/bridge-gap.html synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.html www.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html synthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html
  • 39. Top Down Mirror Detection [Bhar99, Bhar00c]  E.g., www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html  What features could indicate mirroring?  Hostname similarity:  word unigrams and bigrams: { www, www.synthesis, synthesis, …}  Directory similarity:  Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, … }  IP address similarity:  3 or 4 octet overlap  Many hosts sharing an IP address => virtual hosting by an ISP  Host outlink overlap  Path overlap  Potentially, path + sketch overlap
  • 40. Implementation  Phase I - Candidate Pair Detection  Find features that pairs of hosts have in common  Compute a list of host pairs which might be mirrors  Phase II - Host Pair Validation  Test each host pair and determine extent of mirroring  Check if 20 paths sampled from Host1 have near- duplicates on Host2 and vice versa  Use transitive inferences: IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B) IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B)  Evaluation  140 million URLs on 230,000 hosts (1999)  Best approach combined 5 sets of features  Top 100,000 host pairs had precision = 0.57 and recall = 0.86
  • 41. WebIR Infrastructure  Connectivity Server  Fast access to links to support for link analysis  Term Vector Database  Fast access to document vectors to augment link analysis
  • 42. Connectivity Server [CS1: Bhar98b, CS2 & 3: Rand01]  Fast web graph access to support connectivity analysis  Stores mappings in memory from  URL to outlinks, URL to inlinks  Applications  HITS, Pagerank computations  Crawl simulation  Graph algorithms: web connectivity, diameter etc.  more on this later  Visualizations
  • 43. Usage Input Execution Output Graph algorithm Graph URLs + URLs algorithm IDs + URLs to runs in to Values + FPs memory URLs Values to IDs Translation Tables on Disk URL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytes ID(32b) -> FP(64b): 8 bytes ID(32b) -> URLs: 0.5 bytes
  • 44. E.g., HIGH IDs: ID assignment Max(indegree , outdegree) > 254  Partition URLs into 3 sets, sorted ID URL lexicographically  High: Max degree > 254 …  Medium: 254 > Max degree > 24 9891 www.amazon.com/  Low: remaining (75%) 9912 www.amazon.com/jobs/ …  IDs assigned in sequence (densely) 9821878 www.geocities.com/ … 40930030 www.google.com/ Adjacency lists …  In memory tables for Outlinks, Inlinks 85903590 www.yahoo.com/  List index maps from a Source ID to start of adjacency list
  • 45. Adjacency List Compression - I … … 98 … 132 … -6 104 153 34 105 98 104 21 106 147 105 -8 153 106 49 … 6 … … Sequence … of Delta List Adjacency Encoded Index Lists List Adjacency Index Lists • Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host) - Compress deltas with variable length encoding (e.g., Huffman) • List Index pointers: 32b for high, Base+16b for med, Base+8b for low - Avg = 12b per pointer
  • 46. Adjacency List Compression - II  Inter List Compression  Basis: Similar URLs may share links  Close in ID space => adjacency lists may overlap  Approach  Define a representative adjacency list for a block of IDs  Adjacency list of a reference ID  Union of adjacency lists in the block  Represent adjacency list in terms of deletions and additions when it is cheaper to do so  Measurements  Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM)  Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
  • 47. Term Vector Database [Stat00]  Fast access to 50 word term vectors for web pages  Term Selection:  Restricted to middle 1/3rd of lexicon by document frequency  Top 50 words in document by TF.IDF.  Term Weighting:  Deferred till run-time (can be based on term freq, doc freq, doc length)  Applications  Content + Connectivity analysis (e.g., Topic Distillation)  Topic specific crawls  Document classification  Performance  Storage: 33GB for 272M term vectors  Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk block)
  • 48. Architecture URLid * 64 /480 offset URL Info Base (4 bytes) LC:TID Terms LC:TID … 128 Bit vector Byte For LC:TID TV 480 URLids Record FRQ:RL FRQ:RL Freq … FRQ:RL URLid to Term Vector Lookup

Editor's Notes

  1. Arms race
  2. Small biotech firm ; query example from last time ; infoseek exapmle
  3. Talk about expert witness; george w bush example
  4. More complex problem of finding the “original” site