SlideShare a Scribd company logo
Special Topics in
 Search Engines
     Result Summaries
    Duplicate elimination
Results summaries
   Having ranked the documents matching a
    query, we wish to present a results list
   Most commonly, the document title plus a
    short summary
   The title is typically automatically extracted
    from document metadata
   What about the summaries?
   Two basic kinds:
       Static
       Dynamic

    A static summary of a document is always
    the same, regardless of the query that hit
    the doc
   Dynamic summaries are query-dependent
    attempt to explain why the document was
    retrieved for the query at hand
Static summaries
   In typical systems, the static summary is a
    subset of the document
   Simplest heuristic: the first 50 (or so – this
    can be varied) words of the document
       Summary cached at indexing time
   More sophisticated: extract from each
    document a set of “key” sentences
       Simple NLP heuristics to score each sentence
       Summary is made up of top-scoring
   Most sophisticated: NLP used to synthesize
    a summary
       Seldom used in IR; cf. text summarization
Dynamic summaries
   Present one or more “windows” within the
    document that contain several of the query
       “KWIC” snippets: Keyword in Context
   Generated in conjunction with scoring
       If query found as a phrase, the/some
        occurrences of the phrase in the doc
       If not, windows within the doc that contain
        multiple query terms
   The summary itself gives the entire content
    of the window – all terms, not only the query
Generating dynamic summaries
   If we have only a positional index, we cannot
    (easily) reconstruct context surrounding hits
   If we cache the documents at index time, can
    run the window through it, cueing to hits
    found in the positional index
       E.g., positional index says “the query is a
        phrase in position 4378” so we go to this
        position in the cached document and stream
        out the content
   Most often, cache a fixed-size prefix of the
       Note: Cached copy can be outdated
Dynamic summaries
   Producing good dynamic summaries is a
    tricky optimization problem
       The real estate for the summary is normally
        small and fixed
       Want short item, so show as many KWIC
        matches as possible, and perhaps other
        things like title
       Want snippets to be long enough to be useful
       Want linguistically well-formed snippets:
        users prefer snippets that contain complete
       Want snippets maximally informative about
   But users really like snippets, even if they
    complicate IR system design
Adversarial IR (Spam)
   Motives
       Commercial, political, religious, lobbies
       Promotion funded by advertising budget
   Operators
       Contractors (Search Engine Optimizers) for lobbies,
       Web masters
       Hosting services
   Forum
       Web master world ( )
            Search engine specific tricks
            Discussions about academic papers 
Search Engine Optimization II
Search Engine Optimization
       Adversarial IR
       Adversarial IR
  (“search engine wars”)
   (“search engine wars”)
Can you trust words on the page?


Examples from July 2002
Simplest forms
   Early engines relied on the density of terms
        The top-ranked pages for the query maui
         resort were the ones containing the most
         maui’s and resort’s
   SEOs responded with dense repetitions of
    chosen terms
        e.g., maui resort maui resort maui resort
        Often, the repetitions would be in the same
         color as the background of the web page
             Repeated terms got indexed by crawlers
             But not visible to humans on browsers

    Can’t trust the words on a web page, for ranking.
A few spam technologies
   Cloaking
       Serve fake content to search engine robot
       DNS cloaking: Switch IP address. Impersonate
   Doorway pages
       Pages optimized for a single keyword that re-
        direct to the real target page
   Keyword Spam
       Misleading meta-keywords, excessive
        repetition of a term, fake “anchor text”
       Hidden text with colors, CSS tricks, etc.
   Link spamming
       Mutual admiration societies, hidden links,
       Domain flooding: numerous domains that
        point or re-direct to a target page
   Robots
       Fake click stream
       Fake query stream
       Millions of submissions via Add-Url
More spam techniques
   Cloaking
 Serve fake content to search engine spider
 DNS cloaking: Switch IP address. Impersonate


                   Is this a Search
                   Engine spider?

                                      N   Real
               Cloaking                   Doc
Tutorial on
    Tutorial on
Cloaking & Stealth
Cloaking & Stealth
Variants of keyword stuffing
   Misleading meta-tags, excessive repetition
   Hidden text with colors, style sheet tricks,

      Meta-Tags =
      “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation,
      sex, mp3, britney spears, viagra, …”
More spam techniques
   Doorway pages
       Pages optimized for a single keyword that re-
        direct to the real target page
   Link spamming
       Mutual admiration societies, hidden links,
        awards – more on these later
       Domain flooding: numerous domains that
        point or re-direct to a target page
   Robots
       Fake query stream – rank checking programs
            “Curve-fit” ranking programs of search engines
       Millions of submissions via Add-Url
The war against spam
Quality signals - Prefer authoritative
pages based on:
   Votes from authors (linkage signals)
   Votes from users (usage signals)
   Policing of URL submissions
   Anti robot test
 Limits on meta-keywords
Robust link analysis
   Ignore statistically implausible linkage (or text)
   Use link analysis to detect spammers (guilt by
The war against spam
   Spam recognition by machine learning
       Training set based on known spam
   Family friendly filters
       Linguistic analysis, general classification
        techniques, etc.
       For images: flesh tone detectors, source text
        analysis, etc.
   Editorial intervention
       Blacklists
       Top queries audited
       Complaints addressed
Acid test
   Which SEO’s rank highly on the query seo?
   Web search engines have policies on SEO
    practices they tolerate/block
       See pointers in Resources
   Adversarial IR: the unending (technical)
    battle between SEO’s and web search
   See for instance
Duplicate detection
Duplicate/Near-Duplicate Detection

   Duplication: Exact match with fingerprints
   Near-Duplication: Approximate match
   Overview
       Compute syntactic similarity with an edit-
        distance measure
       Use similarity threshold to detect near-
             E.g., Similarity > 80% => Documents are “near
             Not transitive though sometimes used
Computing Similarity
   Segments of a document (natural or artificial
    breakpoints) [Brin95]
   Shingles (Word k-Grams) [Brin95, Brod98]
        “a rose is a rose is a rose” =>
   Similarity Measure between two docs (= sets
    of shingles)
       Set intersection [Brod98]
        (Specifically, Size_of_Intersection /
        Size_of_Union )
                   Jaccard measure
Shingles + Set Intersection
Computing exact set intersection of shingles
between all pairs of documents is expensive
   Approximate using a cleverly chosen subset of
    shingles from each (a sketch)
   Estimate Jaccard from a short sketch
Create a “sketch vector” (e.g., of size 200) for
each document
   Documents which share more than t (say 80%)
    corresponding vector elements are similar
    For doc d, sketchd[i] is computed as follows:
         Let f map all shingles in the universe to 0..2 m
        Let πi be a specific random permutation on 0..2 m
        Pick MIN πi (f(s)) over all shingles s in d
Shingling with sampling
   Given two documents A1, A2.
   Let S1 and S2 be their shingle sets
   Resemblance = |Intersection of S1 and S2| / |
    Union of S1 and S2|.
   Let Alpha = min ( π (S1))
   Let Beta = min (π(S2))
       Probability (Alpha = Beta) = Resemblance
Computing Sketch[i] for Doc1
    Document 1

                 264   Start with 64 bit shingles

                       Permute on the number line
                 264   with   πi
                 264   Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]

       Document 1                 Document 2

                         264                      264
                         264                      264

                         264                      264
       A                              B
                         2   64                   264

              Are these equal?
Test for 200 random permutations: π1, π2,… π200
            Document 1                 Document 2

                                 264                264
                                 264                264
                                 264    B           264
                                 264                264

A = B iff the shingle with the MIN value in the union of
Doc1 and Doc2 is common to both (I.e., lies in the

This happens with probability:
       Size_of_intersection / Size_of_union
Set Similarity
   Set Similarity (Jaccard measure)
                                    Ci  C j
                simJ(Ci , C j ) =
                                    Ci  C j
   View sets as columns of a matrix; one row for
    each element in the universe. aij = 1 indicates
    presence of item i in set j
    Example      C1 C2

                  0   1
                  1   0
                  1   1      simJ(C1,C2) = 2/5 = 0.4
                  0   0
                  1   1
                  0   1
Key Observation
   For columns Ci, Cj, four types of rows
             Ci     Cj
       A      1      1
       B      1      0
       C      0      1
       D      0      0
   Overload notation: A = # of rows of type A
   Claim                     A
              simJ(Ci , C j ) =
Min Hashing
   Randomly permute rows
   h(Ci) = index of first row with 1 in column Ci
   Surprising Property

   Why? P [ h(Ci ) = h(C j ) ] = simJ ( Ci , C j )
        Both are A/(A+B+C)
        Look down columns Ci, Cj until first non-Type-
         D row
        h(Ci) = h(Cj)  type A row
Mirror Detection
   Mirroring is systematic replication of web pages
    across hosts.
      Single largest cause of duplication on the web

   Host1/α and Host2/β are mirrors iff
        For all (or most) paths p such that when
          http://Host1/ α / p exists
          http://Host2/ β / p exists as well
        with identical (or near identical) content, and
          vice versa.
Mirror Detection example
 and
   Structural Classification of Proteins
Repackaged Mirrors

   Why detect mirrors?
       Smart crawling
             Fetch from the fastest or freshest server
             Avoid duplication
       Better connectivity analysis
             Combine inlinks
             Avoid double counting outlinks
       Redundancy in result listings
            “If that fails you can try: <mirror>/samepath”
       Proxy caching
Bottom Up Mirror Detection
   Maintain clusters of subgraphs
   Initialize clusters of trivial subgraphs
       Group near-duplicate single documents into a cluster
   Subsequent passes
       Merge clusters of the same cardinality and corresponding linkage

       Avoid decreasing cluster cardinality
   To detect mirrors we need:
       Adequate path overlap
       Contents of corresponding pages within a small time range
Can we use URLs to find

                 a              b                                    a             b
                                           d                                                  d
                         c                                                   c…
                                               …… …                    …    
Top Down Mirror Detection
    [Bhar99, Bhar00c]
   E.g.,
   What features could indicate mirroring?
        Hostname similarity:
             word unigrams and bigrams: { www, www.synthesis, synthesis, …}
        Directory similarity:
             Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, … }
        IP address similarity:
             3 or 4 octet overlap
             Many hosts sharing an IP address => virtual hosting by an ISP
        Host outlink overlap
        Path overlap
             Potentially, path + sketch overlap
   Phase I - Candidate Pair Detection
        Find features that pairs of hosts have in common
        Compute a list of host pairs which might be mirrors
   Phase II - Host Pair Validation
         Test each host pair and determine extent of mirroring
          Check if 20 paths sampled from Host1 have near-

           duplicates on Host2 and vice versa
          Use transitive inferences:

               IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B)
               IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B)
   Evaluation
       140 million URLs on 230,000 hosts (1999)
       Best approach combined 5 sets of features
          Top 100,000 host pairs had precision = 0.57 and recall =

WebIR Infrastructure
   Connectivity Server
       Fast access to links to support for link
   Term Vector Database
       Fast access to document vectors to augment
        link analysis
Connectivity Server
[CS1: Bhar98b, CS2 & 3: Rand01]
   Fast web graph access to support connectivity
   Stores mappings in memory from
           URL to outlinks, URL to inlinks
   Applications
           HITS, Pagerank computations
           Crawl simulation
           Graph algorithms: web connectivity, diameter etc.
               more on this later
                      Execution             Output
  algorithm            Graph                URLs
      +       URLs    algorithm   IDs         +
    URLs      to       runs in    to        Values
      +       FPs     memory      URLs
   Values     to

    Translation Tables on Disk
    URL text: 9 bytes/URL (compressed from ~80 bytes )
    FP(64b) -> ID(32b): 5 bytes
    ID(32b) -> FP(64b): 8 bytes
    ID(32b) -> URLs: 0.5 bytes
E.g., HIGH IDs:
    ID assignment                        Max(indegree , outdegree) > 254

   Partition URLs into 3 sets, sorted
                                         ID                URL
        High: Max degree > 254          …
        Medium: 254 > Max degree > 24   9891
        Low: remaining (75%)            9912
   IDs assigned in sequence (densely)

    Adjacency lists                      …

     In memory tables for Outlinks,
      Inlinks                            85903590

     List index maps from a Source
      ID to start of adjacency list
Adjacency List Compression - I

           …           98                             …
                       132               …            -6
    104                153                            34
    105                98         104                 21
    106                147        105                 -8
                       153        106                 49
                        …                              6
           …                                          …
                     Sequence            …
                         of                         Delta
           List      Adjacency                    Encoded
          Index        Lists             List     Adjacency
                                        Index       Lists
• Adjacency List:
     - Smaller delta values are exponentially more frequent (80% to same host)
     - Compress deltas with variable length encoding (e.g., Huffman)
• List Index pointers: 32b for high, Base+16b for med, Base+8b for low
     - Avg = 12b per pointer
Adjacency List Compression - II

   Inter List Compression
       Basis: Similar URLs may share links
            Close in ID space => adjacency lists may overlap
       Approach
            Define a representative adjacency list for a block of IDs
                  Adjacency list of a reference ID
                  Union of adjacency lists in the block
            Represent adjacency list in terms of deletions and additions
             when it is cheaper to do so
       Measurements
            Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM)
            Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
Term Vector Database
   Fast access to 50 word term vectors for web pages
        Term Selection:
             Restricted to middle 1/3rd of lexicon by document frequency
             Top 50 words in document by TF.IDF.
        Term Weighting:
             Deferred till run-time (can be based on term freq, doc freq, doc length)
   Applications
        Content + Connectivity analysis (e.g., Topic Distillation)
        Topic specific crawls
        Document classification
   Performance
        Storage: 33GB for 272M term vectors
        Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk
URLid * 64 /480


                                         URL Info
            Base (4 bytes)

                                            …        128
              Bit vector                             Byte
                 For                      LC:TID      TV
             480 URLids                             Record
        URLid to Term Vector

More Related Content

Similar to seo tutorial

Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
Iftikhar Alam
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Max Irwin
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
Chidanand Byahatti
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
Nate Abele
Measuring Your Code 2.0
Measuring Your Code 2.0Measuring Your Code 2.0
Measuring Your Code 2.0
Nate Abele
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly
New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
 New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
Shannan Butler
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Sean Golliher
PoolParty Advanced Thesaurus Management
PoolParty Advanced Thesaurus ManagementPoolParty Advanced Thesaurus Management
PoolParty Advanced Thesaurus Management
Andreas Blumauer
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Bhaskar Mitra
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsQuality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, quantity, web and semantics
Quality, quantity, web and semanticsQuality, quantity, web and semantics
Quality, quantity, web and semantics
Andraz Tori
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
Tobias Wunner
The search engine index
The search engine indexThe search engine index
The search engine index
CJ Jenkins

Similar to seo tutorial (20)

Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
Measuring Your Code 2.0
Measuring Your Code 2.0Measuring Your Code 2.0
Measuring Your Code 2.0
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
 New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
PoolParty Advanced Thesaurus Management
PoolParty Advanced Thesaurus ManagementPoolParty Advanced Thesaurus Management
PoolParty Advanced Thesaurus Management
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation Defense
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsQuality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, quantity, web and semantics
Quality, quantity, web and semanticsQuality, quantity, web and semantics
Quality, quantity, web and semantics
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
The search engine index
The search engine indexThe search engine index
The search engine index

Recently uploaded

Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Colégio Santa Teresinha
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf (প্রয়োজনীয় বাংলা বই)
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard

Recently uploaded (20)

Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training

seo tutorial

  • 1. Special Topics in Search Engines Result Summaries Anti-spamming Duplicate elimination
  • 3. Summaries  Having ranked the documents matching a query, we wish to present a results list  Most commonly, the document title plus a short summary  The title is typically automatically extracted from document metadata  What about the summaries?
  • 4. Summaries  Two basic kinds:  Static  Dynamic  A static summary of a document is always the same, regardless of the query that hit the doc  Dynamic summaries are query-dependent attempt to explain why the document was retrieved for the query at hand
  • 5. Static summaries  In typical systems, the static summary is a subset of the document  Simplest heuristic: the first 50 (or so – this can be varied) words of the document  Summary cached at indexing time  More sophisticated: extract from each document a set of “key” sentences  Simple NLP heuristics to score each sentence  Summary is made up of top-scoring sentences.  Most sophisticated: NLP used to synthesize a summary  Seldom used in IR; cf. text summarization
  • 6. Dynamic summaries  Present one or more “windows” within the document that contain several of the query terms  “KWIC” snippets: Keyword in Context presentation  Generated in conjunction with scoring  If query found as a phrase, the/some occurrences of the phrase in the doc  If not, windows within the doc that contain multiple query terms  The summary itself gives the entire content of the window – all terms, not only the query
  • 7. Generating dynamic summaries  If we have only a positional index, we cannot (easily) reconstruct context surrounding hits  If we cache the documents at index time, can run the window through it, cueing to hits found in the positional index  E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the content  Most often, cache a fixed-size prefix of the doc  Note: Cached copy can be outdated
  • 8. Dynamic summaries  Producing good dynamic summaries is a tricky optimization problem  The real estate for the summary is normally small and fixed  Want short item, so show as many KWIC matches as possible, and perhaps other things like title  Want snippets to be long enough to be useful  Want linguistically well-formed snippets: users prefer snippets that contain complete phrases  Want snippets maximally informative about doc  But users really like snippets, even if they complicate IR system design
  • 10. Adversarial IR (Spam)  Motives  Commercial, political, religious, lobbies  Promotion funded by advertising budget  Operators  Contractors (Search Engine Optimizers) for lobbies, companies  Web masters  Hosting services  Forum  Web master world ( )  Search engine specific tricks  Discussions about academic papers 
  • 11. Search Engine Optimization II Search Engine Optimization Adversarial IR Adversarial IR (“search engine wars”) (“search engine wars”)
  • 12. Can you trust words on the page? Pornographic Content Examples from July 2002
  • 13. Simplest forms  Early engines relied on the density of terms  The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s  SEOs responded with dense repetitions of chosen terms  e.g., maui resort maui resort maui resort  Often, the repetitions would be in the same color as the background of the web page  Repeated terms got indexed by crawlers  But not visible to humans on browsers Can’t trust the words on a web page, for ranking.
  • 14. A few spam technologies  Cloaking  Serve fake content to search engine robot  DNS cloaking: Switch IP address. Impersonate  Doorway pages  Pages optimized for a single keyword that re- direct to the real target page  Keyword Spam  Misleading meta-keywords, excessive repetition of a term, fake “anchor text”  Hidden text with colors, CSS tricks, etc.  Link spamming  Mutual admiration societies, hidden links, awards  Domain flooding: numerous domains that point or re-direct to a target page  Robots  Fake click stream  Fake query stream  Millions of submissions via Add-Url
  • 15. More spam techniques  Cloaking  Serve fake content to search engine spider  DNS cloaking: Switch IP address. Impersonate SPAM Y Is this a Search Engine spider? N Real Cloaking Doc
  • 16. Tutorial on Tutorial on Cloaking & Stealth Cloaking & Stealth Technology Technology
  • 17. Variants of keyword stuffing  Misleading meta-tags, excessive repetition  Hidden text with colors, style sheet tricks, etc. Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”
  • 18. More spam techniques  Doorway pages  Pages optimized for a single keyword that re- direct to the real target page  Link spamming  Mutual admiration societies, hidden links, awards – more on these later  Domain flooding: numerous domains that point or re-direct to a target page  Robots  Fake query stream – rank checking programs  “Curve-fit” ranking programs of search engines  Millions of submissions via Add-Url
  • 19. The war against spam Quality signals - Prefer authoritative pages based on:  Votes from authors (linkage signals)  Votes from users (usage signals)  Policing of URL submissions  Anti robot test  Limits on meta-keywords Robust link analysis  Ignore statistically implausible linkage (or text)  Use link analysis to detect spammers (guilt by association)
  • 20. The war against spam  Spam recognition by machine learning  Training set based on known spam  Family friendly filters  Linguistic analysis, general classification techniques, etc.  For images: flesh tone detectors, source text analysis, etc.  Editorial intervention  Blacklists  Top queries audited  Complaints addressed
  • 21. Acid test  Which SEO’s rank highly on the query seo?  Web search engines have policies on SEO practices they tolerate/block  See pointers in Resources  Adversarial IR: the unending (technical) battle between SEO’s and web search engines  See for instance
  • 23. Duplicate/Near-Duplicate Detection  Duplication: Exact match with fingerprints  Near-Duplication: Approximate match  Overview  Compute syntactic similarity with an edit- distance measure  Use similarity threshold to detect near- duplicates  E.g., Similarity > 80% => Documents are “near duplicates”  Not transitive though sometimes used transitively
  • 24. Computing Similarity  Segments of a document (natural or artificial breakpoints) [Brin95]  Shingles (Word k-Grams) [Brin95, Brod98] “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is  Similarity Measure between two docs (= sets of shingles)  Set intersection [Brod98] (Specifically, Size_of_Intersection / Size_of_Union ) Jaccard measure
  • 25. Shingles + Set Intersection Computing exact set intersection of shingles between all pairs of documents is expensive  Approximate using a cleverly chosen subset of shingles from each (a sketch)  Estimate Jaccard from a short sketch Create a “sketch vector” (e.g., of size 200) for each document  Documents which share more than t (say 80%) corresponding vector elements are similar  For doc d, sketchd[i] is computed as follows:  Let f map all shingles in the universe to 0..2 m  Let πi be a specific random permutation on 0..2 m  Pick MIN πi (f(s)) over all shingles s in d
  • 26. Shingling with sampling minima  Given two documents A1, A2.  Let S1 and S2 be their shingle sets  Resemblance = |Intersection of S1 and S2| / | Union of S1 and S2|.  Let Alpha = min ( π (S1))  Let Beta = min (π(S2))  Probability (Alpha = Beta) = Resemblance
  • 27. Computing Sketch[i] for Doc1 Document 1 264 Start with 64 bit shingles 264 Permute on the number line 264 with πi 264 Pick the min value
  • 28. Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 264 264 264 264 264 264 A B 2 64 264 Are these equal? Test for 200 random permutations: π1, π2,… π200
  • 29. However… Document 1 Document 2 264 264 264 264 A 264 B 264 264 264 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (I.e., lies in the intersection) This happens with probability: Size_of_intersection / Size_of_union Why?
  • 30. Set Similarity  Set Similarity (Jaccard measure) Ci  C j simJ(Ci , C j ) = Ci  C j  View sets as columns of a matrix; one row for each element in the universe. aij = 1 indicates presence of item i in set j  Example C1 C2 0 1 1 0 1 1 simJ(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1
  • 31. Key Observation  For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0  Overload notation: A = # of rows of type A  Claim A simJ(Ci , C j ) = A+B+C
  • 32. Min Hashing  Randomly permute rows  h(Ci) = index of first row with 1 in column Ci  Surprising Property  Why? P [ h(Ci ) = h(C j ) ] = simJ ( Ci , C j )  Both are A/(A+B+C)  Look down columns Ci, Cj until first non-Type- D row  h(Ci) = h(Cj)  type A row
  • 33. Mirror Detection  Mirroring is systematic replication of web pages across hosts.  Single largest cause of duplication on the web  Host1/α and Host2/β are mirrors iff For all (or most) paths p such that when http://Host1/ α / p exists http://Host2/ β / p exists as well with identical (or near identical) content, and vice versa.
  • 34. Mirror Detection example  and  Structural Classification of Proteins     
  • 35. Repackaged Mirrors Aug
  • 36. Motivation  Why detect mirrors?  Smart crawling  Fetch from the fastest or freshest server  Avoid duplication  Better connectivity analysis  Combine inlinks  Avoid double counting outlinks  Redundancy in result listings  “If that fails you can try: <mirror>/samepath”  Proxy caching
  • 37. Bottom Up Mirror Detection [Cho00]  Maintain clusters of subgraphs  Initialize clusters of trivial subgraphs  Group near-duplicate single documents into a cluster  Subsequent passes  Merge clusters of the same cardinality and corresponding linkage  Avoid decreasing cluster cardinality  To detect mirrors we need:  Adequate path overlap  Contents of corresponding pages within a small time range
  • 38. Can we use URLs to find mirrors? a b a b d d c c…………………
  • 39. Top Down Mirror Detection [Bhar99, Bhar00c]  E.g.,  What features could indicate mirroring?  Hostname similarity:  word unigrams and bigrams: { www, www.synthesis, synthesis, …}  Directory similarity:  Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, … }  IP address similarity:  3 or 4 octet overlap  Many hosts sharing an IP address => virtual hosting by an ISP  Host outlink overlap  Path overlap  Potentially, path + sketch overlap
  • 40. Implementation  Phase I - Candidate Pair Detection  Find features that pairs of hosts have in common  Compute a list of host pairs which might be mirrors  Phase II - Host Pair Validation  Test each host pair and determine extent of mirroring  Check if 20 paths sampled from Host1 have near- duplicates on Host2 and vice versa  Use transitive inferences: IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B) IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B)  Evaluation  140 million URLs on 230,000 hosts (1999)  Best approach combined 5 sets of features  Top 100,000 host pairs had precision = 0.57 and recall = 0.86
  • 41. WebIR Infrastructure  Connectivity Server  Fast access to links to support for link analysis  Term Vector Database  Fast access to document vectors to augment link analysis
  • 42. Connectivity Server [CS1: Bhar98b, CS2 & 3: Rand01]  Fast web graph access to support connectivity analysis  Stores mappings in memory from  URL to outlinks, URL to inlinks  Applications  HITS, Pagerank computations  Crawl simulation  Graph algorithms: web connectivity, diameter etc.  more on this later  Visualizations
  • 43. Usage Input Execution Output Graph algorithm Graph URLs + URLs algorithm IDs + URLs to runs in to Values + FPs memory URLs Values to IDs Translation Tables on Disk URL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytes ID(32b) -> FP(64b): 8 bytes ID(32b) -> URLs: 0.5 bytes
  • 44. E.g., HIGH IDs: ID assignment Max(indegree , outdegree) > 254  Partition URLs into 3 sets, sorted ID URL lexicographically  High: Max degree > 254 …  Medium: 254 > Max degree > 24 9891  Low: remaining (75%) 9912 …  IDs assigned in sequence (densely) 9821878 … 40930030 Adjacency lists …  In memory tables for Outlinks, Inlinks 85903590  List index maps from a Source ID to start of adjacency list
  • 45. Adjacency List Compression - I … … 98 … 132 … -6 104 153 34 105 98 104 21 106 147 105 -8 153 106 49 … 6 … … Sequence … of Delta List Adjacency Encoded Index Lists List Adjacency Index Lists • Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host) - Compress deltas with variable length encoding (e.g., Huffman) • List Index pointers: 32b for high, Base+16b for med, Base+8b for low - Avg = 12b per pointer
  • 46. Adjacency List Compression - II  Inter List Compression  Basis: Similar URLs may share links  Close in ID space => adjacency lists may overlap  Approach  Define a representative adjacency list for a block of IDs  Adjacency list of a reference ID  Union of adjacency lists in the block  Represent adjacency list in terms of deletions and additions when it is cheaper to do so  Measurements  Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM)  Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
  • 47. Term Vector Database [Stat00]  Fast access to 50 word term vectors for web pages  Term Selection:  Restricted to middle 1/3rd of lexicon by document frequency  Top 50 words in document by TF.IDF.  Term Weighting:  Deferred till run-time (can be based on term freq, doc freq, doc length)  Applications  Content + Connectivity analysis (e.g., Topic Distillation)  Topic specific crawls  Document classification  Performance  Storage: 33GB for 272M term vectors  Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk block)
  • 48. Architecture URLid * 64 /480 offset URL Info Base (4 bytes) LC:TID Terms LC:TID … 128 Bit vector Byte For LC:TID TV 480 URLids Record FRQ:RL FRQ:RL Freq … FRQ:RL URLid to Term Vector Lookup

Editor's Notes

  1. Arms race
  2. Small biotech firm ; query example from last time ; infoseek exapmle
  3. Talk about expert witness; george w bush example
  4. More complex problem of finding the “original” site