Your SlideShare is downloading. ×
0
Anti-spam Algorithm TrustRank 、 Hilltop 954203041 林裕得 954203057 蔡繼正
Outline <ul><li>Introduction </li></ul><ul><li>Compare with Page Rank 、 Trust Rank 、 Hilltop </li></ul><ul><li>Trust Rank ...
Introduction(1/3) <ul><li>Page rank </li></ul><ul><li>Current Problem </li></ul><ul><ul><li>web spam  pages use various te...
Introduction(2/3) <ul><li>Type of web spam </li></ul><ul><ul><li>Content spam  </li></ul></ul><ul><ul><ul><li>Hidden or in...
Introduction(3/3) <ul><li>How to combat web spam </li></ul><ul><ul><li>TrustRank or Hilltop : </li></ul></ul><ul><ul><li>哪...
Compare with Page Rank 、 Trust Rank 、 Hilltop(1/3) <ul><li>All are connectivity algorithms, namely that  the number and qu...
Compare with Page Rank 、 Trust Rank 、 Hilltop(2/3) <ul><li>Basic assumption </li></ul><ul><ul><li>Page rank </li></ul></ul...
Compare with Page Rank 、 Trust Rank 、 Hilltop(3/3) 0.16 0.16 0.16 0.16 0.16 0.16 0.33 0 0.33 0 0.33 0 0.5 0.2 0.3 Algorith...
Trust Rank(1/7) <ul><li>0  0  0  …. </li></ul><ul><li>0  1  …. </li></ul><ul><li>0  0.5  0  .… </li></ul><ul><li>0  0.5  0...
 
Trust Rank (2/7) <ul><li>Step1 : Evaluate seed-desirability of  </li></ul><ul><li>pages By Inverse Page Rank </li></ul>
U S M 1 N …………… . M
Trust Rank(3/7) <ul><li>Step2 : Generate good seeds </li></ul>
Trust Rank(4/7) <ul><li>Step3 : Select good seeds </li></ul>ex, L=3, seed set is {2,4,5}
Trust Rank(5/7) <ul><li>Step4 : normalize static score  </li></ul><ul><li>distribution vector </li></ul>
Trust Rank(6/7) <ul><li>Step5:Compute TrustRank score </li></ul>T d t* …………… . M
Trust Rank(7/7) <ul><li>Conclusion </li></ul>
Hilltop (1/9) <ul><li>expert page   </li></ul><ul><ul><li>a page is about a certain topic and has links to many non-affili...
Hilltop (2/9) <ul><li>Step1 : Expert Lookup   </li></ul><ul><ul><li>Detecting Host Affiliation  </li></ul></ul><ul><ul><li...
Hilltop (3/9) <ul><li>Detecting Host Affiliation  </li></ul><ul><ul><li>Rules: one or both of the following must be true <...
Hilltop (4/9) <ul><li>Selecting the Experts </li></ul><ul><ul><li>Considering all pages with  out-degree greater than a th...
Hilltop (5/9) <ul><li>Indexing the Experts </li></ul><ul><ul><li>index text contained within &quot; key phrases &quot; of ...
Hilltop (6/9) <ul><li>Example </li></ul><ul><ul><li>Title qualify 4 URLs </li></ul></ul><ul><ul><li>heading qualify 2 URLs...
Hilltop (7/9) <ul><li>Step2 : Target Ranking   </li></ul><ul><ul><li>Computing the Expert Score  </li></ul></ul><ul><ul><l...
Hilltop (8/9) <ul><li>Computing the Expert Score </li></ul><ul><ul><li>Expert score reflect the number and importance of t...
Computing the Expert Score(1/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包...
Computing the Expert Score(2/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包...
Hilltop(9/9) <ul><li>Computing the Target Score </li></ul><ul><ul><li>Target score reflect both the number and relevance o...
Computing the Target Score(1/2) occ(w,T)  is the number of distinct key phrases in E that contain w and qualify the edge(E...
Computing the Target Score(2/2) Target_Score  = SUM ( non-affiliated E)   Edge_Score(E,T) T E 1 E 2 E 3 E2 and E3 are affi...
Evaluation-Trust Rank(1/3)
Evaluation-Trust Rank(2/3) <ul><li>Pairwise Orderness </li></ul>
Evaluation-Trust Rank(3/3) <ul><li>Precision </li></ul><ul><li>Recall </li></ul>
Evaluation-Hilltop(1/2) <ul><li>Precision </li></ul>
Evaluation-Hilltop(2/2) <ul><li>Recall </li></ul>
Reference <ul><li>Combating Web Spam with Trust Rank </li></ul><ul><ul><li>http://www.vldb.org/conf/2004/RS15P3.PDF </li><...
Q&A
Upcoming SlideShare
Loading in...5
×

Anti Spam Algorithm

3,039

Published on

Two algorithm for anti-spam

Published in: Economy & Finance, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,039
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
83
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Transcript of "Anti Spam Algorithm"

    1. 1. Anti-spam Algorithm TrustRank 、 Hilltop 954203041 林裕得 954203057 蔡繼正
    2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Compare with Page Rank 、 Trust Rank 、 Hilltop </li></ul><ul><li>Trust Rank </li></ul><ul><ul><li>Combating Web Spam with Trust Rank </li></ul></ul><ul><li>Hilltop </li></ul><ul><ul><li>Hilltop: A Search Engine based on Expert Documents </li></ul></ul><ul><li>Evaluation </li></ul>
    3. 3. Introduction(1/3) <ul><li>Page rank </li></ul><ul><li>Current Problem </li></ul><ul><ul><li>web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s result </li></ul></ul>0.4 0.4 0.2
    4. 4. Introduction(2/3) <ul><li>Type of web spam </li></ul><ul><ul><li>Content spam </li></ul></ul><ul><ul><ul><li>Hidden or invisible text  </li></ul></ul></ul><ul><ul><ul><li>Keyword stuffing  </li></ul></ul></ul><ul><ul><ul><li>Meta tag stuffing  </li></ul></ul></ul><ul><ul><li>Link spam </li></ul></ul><ul><ul><ul><li>Link farms  </li></ul></ul></ul><ul><ul><ul><li>Hidden links  </li></ul></ul></ul><ul><ul><li>Other types </li></ul></ul><ul><ul><ul><li>Mirror websites  </li></ul></ul></ul><ul><ul><ul><li>URL redirection  </li></ul></ul></ul>
    5. 5. Introduction(3/3) <ul><li>How to combat web spam </li></ul><ul><ul><li>TrustRank or Hilltop : </li></ul></ul><ul><ul><li>哪些頁面肯定不是作弊頁面 </li></ul></ul><ul><ul><li>BadRank or SpamRank: </li></ul></ul><ul><ul><li>哪些頁面肯定是作弊頁面 </li></ul></ul><ul><ul><li>Sandbox: </li></ul></ul><ul><ul><li>不能有效的識別哪些是作弊或者不作弊頁面,但是可以通過這種行為有效的打壓 SEO 市場 </li></ul></ul><ul><ul><li>人工舉報和具體 ANTI-SPAM 方法: </li></ul></ul><ul><ul><li> 幫助建立更加全面的 SPAM POOL 資源 </li></ul></ul><ul><ul><ul><li>http:// www.google.com/contact/spamreport.html </li></ul></ul></ul>
    6. 6. Compare with Page Rank 、 Trust Rank 、 Hilltop(1/3) <ul><li>All are connectivity algorithms, namely that the number and quality of the sources referring to a page are a good measure of the page's quality. </li></ul>
    7. 7. Compare with Page Rank 、 Trust Rank 、 Hilltop(2/3) <ul><li>Basic assumption </li></ul><ul><ul><li>Page rank </li></ul></ul><ul><ul><ul><li>good page has many important inlinks. </li></ul></ul></ul><ul><ul><li>Trust rank </li></ul></ul><ul><ul><ul><li>Good pages point to good ones. </li></ul></ul></ul><ul><ul><li>Hilltop rank </li></ul></ul><ul><ul><ul><li>Only expert pages point to good ones. </li></ul></ul></ul>
    8. 8. Compare with Page Rank 、 Trust Rank 、 Hilltop(3/3) 0.16 0.16 0.16 0.16 0.16 0.16 0.33 0 0.33 0 0.33 0 0.5 0.2 0.3 Algorithm 1 or 0 Average Initial Score Expert pages All pages All pages Inlinks Source Hilltop Trust Rank Page Rank  
    9. 9. Trust Rank(1/7) <ul><li>0 0 0 …. </li></ul><ul><li>0 1 …. </li></ul><ul><li>0 0.5 0 .… </li></ul><ul><li>0 0.5 0 … </li></ul><ul><li>……………… </li></ul>0 0.5 0 … 0 0 0.5 … 0 0.5 0 … ………………
    10. 11. Trust Rank (2/7) <ul><li>Step1 : Evaluate seed-desirability of </li></ul><ul><li>pages By Inverse Page Rank </li></ul>
    11. 12. U S M 1 N …………… . M
    12. 13. Trust Rank(3/7) <ul><li>Step2 : Generate good seeds </li></ul>
    13. 14. Trust Rank(4/7) <ul><li>Step3 : Select good seeds </li></ul>ex, L=3, seed set is {2,4,5}
    14. 15. Trust Rank(5/7) <ul><li>Step4 : normalize static score </li></ul><ul><li>distribution vector </li></ul>
    15. 16. Trust Rank(6/7) <ul><li>Step5:Compute TrustRank score </li></ul>T d t* …………… . M
    16. 17. Trust Rank(7/7) <ul><li>Conclusion </li></ul>
    17. 18. Hilltop (1/9) <ul><li>expert page </li></ul><ul><ul><li>a page is about a certain topic and has links to many non-affiliated pages on that topic. </li></ul></ul><ul><li>non-affiliated </li></ul><ul><ul><li>Two pages are non-affiliated conceptually if they are authored by authors from non-affiliated organizations. </li></ul></ul>
    18. 19. Hilltop (2/9) <ul><li>Step1 : Expert Lookup </li></ul><ul><ul><li>Detecting Host Affiliation </li></ul></ul><ul><ul><li>Selecting the Experts </li></ul></ul><ul><ul><li>Indexing the Experts </li></ul></ul>non-affiliated pages expert page …… Index key phrases
    19. 20. Hilltop (3/9) <ul><li>Detecting Host Affiliation </li></ul><ul><ul><li>Rules: one or both of the following must be true </li></ul></ul><ul><li>Affiliation relation is transitive </li></ul><ul><ul><li>if A and B are affiliated and B and C are affiliated then we take A and C to be affiliated </li></ul></ul><ul><ul><ul><li>They share the same first 3 octets of the IP address. </li></ul></ul></ul><ul><ul><ul><li>The rightmost non-generic token in the hostname is the same. </li></ul></ul></ul><ul><ul><ul><ul><li>ex, “www. ibm .com&quot; and &quot; ibm .co.mx“ </li></ul></ul></ul></ul>
    20. 21. Hilltop (4/9) <ul><li>Selecting the Experts </li></ul><ul><ul><li>Considering all pages with out-degree greater than a threshold, k (e.g., k=5) we test to see if these URLs point to k distinct non-affiliated hosts. Every such page is considered an expert page. </li></ul></ul>non-affiliated pages expert page ……
    21. 22. Hilltop (5/9) <ul><li>Indexing the Experts </li></ul><ul><ul><li>index text contained within &quot; key phrases &quot; of the expert. The following are considered key phrases. </li></ul></ul><ul><ul><ul><li>title </li></ul></ul></ul><ul><ul><ul><li>headings (e.g., <H1> </H1> tags) </li></ul></ul></ul><ul><ul><ul><li>anchor text </li></ul></ul></ul><ul><ul><li>A key phrase is a piece of text that qualifies one or more URLs in the page. And every key phrase has a scope with the document text. </li></ul></ul>
    22. 23. Hilltop (6/9) <ul><li>Example </li></ul><ul><ul><li>Title qualify 4 URLs </li></ul></ul><ul><ul><li>heading qualify 2 URLs </li></ul></ul><ul><ul><li>anchor qualify 1 URLs </li></ul></ul><title> 中央大學 </title> <h1> 資管系 </h1> <A> 001 </A> <A> 002 </A> <h1> 企管系 </h1> <A> 001 </A> <A> 002 </A>
    23. 24. Hilltop (7/9) <ul><li>Step2 : Target Ranking </li></ul><ul><ul><li>Computing the Expert Score </li></ul></ul><ul><ul><li>Computing the Target Score </li></ul></ul>Target page expert pages N = 200 …… Least 2 experts point to target
    24. 25. Hilltop (8/9) <ul><li>Computing the Expert Score </li></ul><ul><ul><li>Expert score reflect the number and importance of the key phrases that contain the query keywords . </li></ul></ul>
    25. 26. Computing the Expert Score(1/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包含 k-2 個 keywords 的 Key Phrase 的總值 S i = SUM (key phrases p with k-i query terms) LevelScore(p) * FullnessFactor(p,q) LevelScore : 16 of title, 6 of heading, 1 of anchor m is the number of terms in p which are not in q If m <= 2, FullnessFactor(p,q) = 1 If m > 2, FullnessFactor(p,q) = 1 – (m-2) / plen Query: A B S 0 = 16*1 S 1 = 16*1 + 6*1 + 16*1 S 2 = 0 Title: A B C H1: A
    26. 27. Computing the Expert Score(2/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包含 k-2 個 keywords 的 Key Phrase 的總值 Expert_Score = ( 2 32 * S 0 ) + ( 2 16 * S 1 ) + S 2
    27. 28. Hilltop(9/9) <ul><li>Computing the Target Score </li></ul><ul><ul><li>Target score reflect both the number and relevance of the experts pointing to it </li></ul></ul><ul><ul><li>And the relevance of the phrases qualifying the links . </li></ul></ul>
    28. 29. Computing the Target Score(1/2) occ(w,T) is the number of distinct key phrases in E that contain w and qualify the edge(E,T) If occ(w,T) is 0 for any query keyword then the Edge_Score(E,T) = 0 Otherwise, Edge_Score(E,T) = Expert_Score(E) * SUM (query keywords w) occ(w,T) T E edge
    29. 30. Computing the Target Score(2/2) Target_Score = SUM ( non-affiliated E) Edge_Score(E,T) T E 1 E 2 E 3 E2 and E3 are affiliated, and ES(E 2 ,T) > ES(E 3 ,T)
    30. 31. Evaluation-Trust Rank(1/3)
    31. 32. Evaluation-Trust Rank(2/3) <ul><li>Pairwise Orderness </li></ul>
    32. 33. Evaluation-Trust Rank(3/3) <ul><li>Precision </li></ul><ul><li>Recall </li></ul>
    33. 34. Evaluation-Hilltop(1/2) <ul><li>Precision </li></ul>
    34. 35. Evaluation-Hilltop(2/2) <ul><li>Recall </li></ul>
    35. 36. Reference <ul><li>Combating Web Spam with Trust Rank </li></ul><ul><ul><li>http://www.vldb.org/conf/2004/RS15P3.PDF </li></ul></ul><ul><li>Hilltop: A Search Engine based on Expert Documents </li></ul><ul><ul><li>http://www.cs.toronto.edu/~georgem/hilltop/ </li></ul></ul><ul><li>Type of web spam </li></ul><ul><ul><li>http:// en.wikipedia.org/wiki/Spamdexing </li></ul></ul>
    36. 37. Q&A
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×