Anti Spam Algorithm

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Anti Spam Algorithm - Presentation Transcript

    1. Anti-spam Algorithm TrustRank 、 Hilltop 954203041 林裕得 954203057 蔡繼正
    2. Outline
      • Introduction
      • Compare with Page Rank 、 Trust Rank 、 Hilltop
      • Trust Rank
        • Combating Web Spam with Trust Rank
      • Hilltop
        • Hilltop: A Search Engine based on Expert Documents
      • Evaluation
    3. Introduction(1/3)
      • Page rank
      • Current Problem
        • web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s result
      0.4 0.4 0.2
    4. Introduction(2/3)
      • Type of web spam
        • Content spam
          • Hidden or invisible text 
          • Keyword stuffing 
          • Meta tag stuffing 
        • Link spam
          • Link farms 
          • Hidden links 
        • Other types
          • Mirror websites 
          • URL redirection 
    5. Introduction(3/3)
      • How to combat web spam
        • TrustRank or Hilltop :
        • 哪些頁面肯定不是作弊頁面
        • BadRank or SpamRank:
        • 哪些頁面肯定是作弊頁面
        • Sandbox:
        • 不能有效的識別哪些是作弊或者不作弊頁面,但是可以通過這種行為有效的打壓 SEO 市場
        • 人工舉報和具體 ANTI-SPAM 方法:
        • 幫助建立更加全面的 SPAM POOL 資源
          • http:// www.google.com/contact/spamreport.html
    6. Compare with Page Rank 、 Trust Rank 、 Hilltop(1/3)
      • All are connectivity algorithms, namely that the number and quality of the sources referring to a page are a good measure of the page's quality.
    7. Compare with Page Rank 、 Trust Rank 、 Hilltop(2/3)
      • Basic assumption
        • Page rank
          • good page has many important inlinks.
        • Trust rank
          • Good pages point to good ones.
        • Hilltop rank
          • Only expert pages point to good ones.
    8. Compare with Page Rank 、 Trust Rank 、 Hilltop(3/3) 0.16 0.16 0.16 0.16 0.16 0.16 0.33 0 0.33 0 0.33 0 0.5 0.2 0.3 Algorithm 1 or 0 Average Initial Score Expert pages All pages All pages Inlinks Source Hilltop Trust Rank Page Rank  
    9. Trust Rank(1/7)
      • 0 0 0 ….
      • 0 1 ….
      • 0 0.5 0 .…
      • 0 0.5 0 …
      • ………………
      0 0.5 0 … 0 0 0.5 … 0 0.5 0 … ………………
    10.  
    11. Trust Rank (2/7)
      • Step1 : Evaluate seed-desirability of
      • pages By Inverse Page Rank
    12. U S M 1 N …………… . M
    13. Trust Rank(3/7)
      • Step2 : Generate good seeds
    14. Trust Rank(4/7)
      • Step3 : Select good seeds
      ex, L=3, seed set is {2,4,5}
    15. Trust Rank(5/7)
      • Step4 : normalize static score
      • distribution vector
    16. Trust Rank(6/7)
      • Step5:Compute TrustRank score
      T d t* …………… . M
    17. Trust Rank(7/7)
      • Conclusion
    18. Hilltop (1/9)
      • expert page
        • a page is about a certain topic and has links to many non-affiliated pages on that topic.
      • non-affiliated
        • Two pages are non-affiliated conceptually if they are authored by authors from non-affiliated organizations.
    19. Hilltop (2/9)
      • Step1 : Expert Lookup
        • Detecting Host Affiliation
        • Selecting the Experts
        • Indexing the Experts
      non-affiliated pages expert page …… Index key phrases
    20. Hilltop (3/9)
      • Detecting Host Affiliation
        • Rules: one or both of the following must be true
      • Affiliation relation is transitive
        • if A and B are affiliated and B and C are affiliated then we take A and C to be affiliated
          • They share the same first 3 octets of the IP address.
          • The rightmost non-generic token in the hostname is the same.
            • ex, “www. ibm .com" and " ibm .co.mx“
    21. Hilltop (4/9)
      • Selecting the Experts
        • Considering all pages with out-degree greater than a threshold, k (e.g., k=5) we test to see if these URLs point to k distinct non-affiliated hosts. Every such page is considered an expert page.
      non-affiliated pages expert page ……
    22. Hilltop (5/9)
      • Indexing the Experts
        • index text contained within " key phrases " of the expert. The following are considered key phrases.
          • title
          • headings (e.g., <H1> </H1> tags)
          • anchor text
        • A key phrase is a piece of text that qualifies one or more URLs in the page. And every key phrase has a scope with the document text.
    23. Hilltop (6/9)
      • Example
        • Title qualify 4 URLs
        • heading qualify 2 URLs
        • anchor qualify 1 URLs
      <title> 中央大學 </title> <h1> 資管系 </h1> <A> 001 </A> <A> 002 </A> <h1> 企管系 </h1> <A> 001 </A> <A> 002 </A>
    24. Hilltop (7/9)
      • Step2 : Target Ranking
        • Computing the Expert Score
        • Computing the Target Score
      Target page expert pages N = 200 …… Least 2 experts point to target
    25. Hilltop (8/9)
      • Computing the Expert Score
        • Expert score reflect the number and importance of the key phrases that contain the query keywords .
    26. Computing the Expert Score(1/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包含 k-2 個 keywords 的 Key Phrase 的總值 S i = SUM (key phrases p with k-i query terms) LevelScore(p) * FullnessFactor(p,q) LevelScore : 16 of title, 6 of heading, 1 of anchor m is the number of terms in p which are not in q If m <= 2, FullnessFactor(p,q) = 1 If m > 2, FullnessFactor(p,q) = 1 – (m-2) / plen Query: A B S 0 = 16*1 S 1 = 16*1 + 6*1 + 16*1 S 2 = 0 Title: A B C H1: A
    27. Computing the Expert Score(2/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包含 k-2 個 keywords 的 Key Phrase 的總值 Expert_Score = ( 2 32 * S 0 ) + ( 2 16 * S 1 ) + S 2
    28. Hilltop(9/9)
      • Computing the Target Score
        • Target score reflect both the number and relevance of the experts pointing to it
        • And the relevance of the phrases qualifying the links .
    29. Computing the Target Score(1/2) occ(w,T) is the number of distinct key phrases in E that contain w and qualify the edge(E,T) If occ(w,T) is 0 for any query keyword then the Edge_Score(E,T) = 0 Otherwise, Edge_Score(E,T) = Expert_Score(E) * SUM (query keywords w) occ(w,T) T E edge
    30. Computing the Target Score(2/2) Target_Score = SUM ( non-affiliated E) Edge_Score(E,T) T E 1 E 2 E 3 E2 and E3 are affiliated, and ES(E 2 ,T) > ES(E 3 ,T)
    31. Evaluation-Trust Rank(1/3)
    32. Evaluation-Trust Rank(2/3)
      • Pairwise Orderness
    33. Evaluation-Trust Rank(3/3)
      • Precision
      • Recall
    34. Evaluation-Hilltop(1/2)
      • Precision
    35. Evaluation-Hilltop(2/2)
      • Recall
    36. Reference
      • Combating Web Spam with Trust Rank
        • http://www.vldb.org/conf/2004/RS15P3.PDF
      • Hilltop: A Search Engine based on Expert Documents
        • http://www.cs.toronto.edu/~georgem/hilltop/
      • Type of web spam
        • http:// en.wikipedia.org/wiki/Spamdexing
    37. Q&A

    + flyingsheepflyingsheep, 3 years ago

    custom

    2041 views, 0 favs, 0 embeds more stats

    Two algorithm for anti-spam

    More info about this document

    CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

    Go to text version

    • Total Views 2041
      • 2041 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 46
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories