Anti-spam Algorithm TrustRank 、 Hilltop 954203041 林裕得 954203057 蔡繼正
Outline Introduction Compare with Page Rank 、 Trust Rank 、 Hilltop Trust Rank Combating Web Spam with Trust Rank Hilltop Hilltop: A Search Engine based on Expert Documents   Evaluation
Introduction(1/3) Page rank Current Problem web spam  pages use various techniques to achieve higher-than-deserved rankings in a search engine’s result 0.4 0.4 0.2
Introduction(2/3) Type of web spam Content spam  Hidden or invisible text   Keyword stuffing   Meta tag stuffing   Link spam  Link farms   Hidden links    Other types Mirror websites   URL redirection 
Introduction(3/3) How to combat web spam TrustRank or Hilltop : 哪些頁面肯定不是作弊頁面   BadRank or SpamRank: 哪些頁面肯定是作弊頁面  Sandbox: 不能有效的識別哪些是作弊或者不作弊頁面,但是可以通過這種行為有效的打壓 SEO 市場 人工舉報和具體 ANTI-SPAM 方法:   幫助建立更加全面的 SPAM POOL 資源 http:// www.google.com/contact/spamreport.html
Compare with Page Rank 、 Trust Rank 、 Hilltop(1/3) All are connectivity algorithms, namely that  the number and quality of the sources referring to a page  are a good measure of the page's quality.
Compare with Page Rank 、 Trust Rank 、 Hilltop(2/3) Basic assumption Page rank good page has  many important inlinks. Trust rank Good pages  point to good ones. Hilltop rank Only expert pages  point to good ones.
Compare with Page Rank 、 Trust Rank 、 Hilltop(3/3) 0.16 0.16 0.16 0.16 0.16 0.16 0.33 0 0.33 0 0.33 0 0.5 0.2 0.3 Algorithm 1 or 0 Average Initial Score Expert pages All pages All pages Inlinks Source Hilltop Trust Rank Page Rank  
Trust Rank(1/7) 0  0  0  …. 0  1  …. 0  0.5  0  .… 0  0.5  0  … ……………… 0  0.5  0  … 0  0  0.5 … 0  0.5  0  … ………………
 
Trust Rank (2/7) Step1 : Evaluate seed-desirability of  pages By Inverse Page Rank
U S M 1 N …………… . M
Trust Rank(3/7) Step2 : Generate good seeds
Trust Rank(4/7) Step3 : Select good seeds ex, L=3, seed set is {2,4,5}
Trust Rank(5/7) Step4 : normalize static score  distribution vector
Trust Rank(6/7) Step5:Compute TrustRank score T d t* …………… . M
Trust Rank(7/7) Conclusion
Hilltop (1/9) expert page   a page is about a certain topic and has links to many non-affiliated pages on that topic. non-affiliated   Two pages are   non-affiliated conceptually if they are authored by authors from non-affiliated organizations.
Hilltop (2/9) Step1 : Expert Lookup   Detecting Host Affiliation  Selecting the Experts  Indexing the Experts   non-affiliated pages expert page …… Index key phrases
Hilltop (3/9) Detecting Host Affiliation  Rules: one or both of the following must be true Affiliation relation is transitive  if A and B are affiliated and B and C are affiliated then we take A and C to be affiliated   They share the same first 3 octets of the IP address.  The rightmost non-generic token in the hostname is the same.  ex, “www. ibm .com" and " ibm .co.mx“
Hilltop (4/9) Selecting the Experts Considering all pages with  out-degree greater than a threshold,  k  (e.g., k=5) we test to see if these URLs point to  k  distinct  non-affiliated  hosts. Every such page is considered an expert page.  non-affiliated pages expert page ……
Hilltop (5/9) Indexing the Experts index text contained within &quot; key phrases &quot; of the expert. The following are considered key phrases. title headings  (e.g., <H1> </H1> tags) anchor text A key phrase is a piece of text that  qualifies one or more URLs  in the page. And every key phrase has a scope with the document text.
Hilltop (6/9) Example Title qualify 4 URLs heading qualify 2 URLs anchor qualify 1 URLs <title>  中央大學  </title> <h1>  資管系  </h1> <A> 001 </A> <A> 002 </A> <h1>  企管系  </h1> <A> 001 </A> <A> 002 </A>
Hilltop (7/9) Step2 : Target Ranking   Computing the Expert Score  Computing the Target Score   Target page  expert pages N = 200 …… Least 2 experts point to target
Hilltop (8/9) Computing the Expert Score Expert score reflect the number and importance of the  key phrases that contain the query keywords .
Computing the Expert Score(1/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包含 k-2 個 keywords 的 Key Phrase 的總值 S i  = SUM (key phrases p with k-i query terms)  LevelScore(p) * FullnessFactor(p,q) LevelScore : 16 of title, 6 of heading, 1 of anchor m is the number of terms in p which are not in q If m <= 2, FullnessFactor(p,q) = 1 If m > 2, FullnessFactor(p,q) = 1 – (m-2) / plen  Query: A B S 0  = 16*1 S 1  = 16*1 + 6*1 + 16*1 S 2  = 0 Title: A B C H1: A
Computing the Expert Score(2/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包含 k-2 個 keywords 的 Key Phrase 的總值 Expert_Score  = ( 2 32  * S 0  ) + ( 2 16  * S 1  ) + S 2
Hilltop(9/9) Computing the Target Score Target score reflect both the number and relevance of the  experts pointing to it And the relevance of the  phrases qualifying the links .
Computing the Target Score(1/2) occ(w,T)  is the number of distinct key phrases in E that contain w and qualify the edge(E,T) If occ(w,T) is 0  for any query keyword  then the  Edge_Score(E,T) = 0 Otherwise, Edge_Score(E,T)  = Expert_Score(E) *  SUM (query keywords w)  occ(w,T) T E edge
Computing the Target Score(2/2) Target_Score  = SUM ( non-affiliated E)   Edge_Score(E,T) T E 1 E 2 E 3 E2 and E3 are affiliated,  and ES(E 2 ,T) > ES(E 3 ,T)
Evaluation-Trust Rank(1/3)
Evaluation-Trust Rank(2/3) Pairwise Orderness
Evaluation-Trust Rank(3/3) Precision Recall
Evaluation-Hilltop(1/2) Precision
Evaluation-Hilltop(2/2) Recall
Reference Combating Web Spam with Trust Rank http://www.vldb.org/conf/2004/RS15P3.PDF Hilltop: A Search Engine based on Expert Documents  http://www.cs.toronto.edu/~georgem/hilltop/ Type of web spam http:// en.wikipedia.org/wiki/Spamdexing
Q&A

Anti Spam Algorithm

  • 1.
    Anti-spam Algorithm TrustRank、 Hilltop 954203041 林裕得 954203057 蔡繼正
  • 2.
    Outline Introduction Comparewith Page Rank 、 Trust Rank 、 Hilltop Trust Rank Combating Web Spam with Trust Rank Hilltop Hilltop: A Search Engine based on Expert Documents Evaluation
  • 3.
    Introduction(1/3) Page rankCurrent Problem web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s result 0.4 0.4 0.2
  • 4.
    Introduction(2/3) Type ofweb spam Content spam Hidden or invisible text  Keyword stuffing  Meta tag stuffing  Link spam Link farms  Hidden links  Other types Mirror websites  URL redirection 
  • 5.
    Introduction(3/3) How tocombat web spam TrustRank or Hilltop : 哪些頁面肯定不是作弊頁面 BadRank or SpamRank: 哪些頁面肯定是作弊頁面 Sandbox: 不能有效的識別哪些是作弊或者不作弊頁面,但是可以通過這種行為有效的打壓 SEO 市場 人工舉報和具體 ANTI-SPAM 方法: 幫助建立更加全面的 SPAM POOL 資源 http:// www.google.com/contact/spamreport.html
  • 6.
    Compare with PageRank 、 Trust Rank 、 Hilltop(1/3) All are connectivity algorithms, namely that the number and quality of the sources referring to a page are a good measure of the page's quality.
  • 7.
    Compare with PageRank 、 Trust Rank 、 Hilltop(2/3) Basic assumption Page rank good page has many important inlinks. Trust rank Good pages point to good ones. Hilltop rank Only expert pages point to good ones.
  • 8.
    Compare with PageRank 、 Trust Rank 、 Hilltop(3/3) 0.16 0.16 0.16 0.16 0.16 0.16 0.33 0 0.33 0 0.33 0 0.5 0.2 0.3 Algorithm 1 or 0 Average Initial Score Expert pages All pages All pages Inlinks Source Hilltop Trust Rank Page Rank  
  • 9.
    Trust Rank(1/7) 0 0 0 …. 0 1 …. 0 0.5 0 .… 0 0.5 0 … ……………… 0 0.5 0 … 0 0 0.5 … 0 0.5 0 … ………………
  • 10.
  • 11.
    Trust Rank (2/7)Step1 : Evaluate seed-desirability of pages By Inverse Page Rank
  • 12.
    U S M1 N …………… . M
  • 13.
    Trust Rank(3/7) Step2: Generate good seeds
  • 14.
    Trust Rank(4/7) Step3: Select good seeds ex, L=3, seed set is {2,4,5}
  • 15.
    Trust Rank(5/7) Step4: normalize static score distribution vector
  • 16.
    Trust Rank(6/7) Step5:ComputeTrustRank score T d t* …………… . M
  • 17.
  • 18.
    Hilltop (1/9) expertpage a page is about a certain topic and has links to many non-affiliated pages on that topic. non-affiliated Two pages are non-affiliated conceptually if they are authored by authors from non-affiliated organizations.
  • 19.
    Hilltop (2/9) Step1: Expert Lookup Detecting Host Affiliation Selecting the Experts Indexing the Experts non-affiliated pages expert page …… Index key phrases
  • 20.
    Hilltop (3/9) DetectingHost Affiliation Rules: one or both of the following must be true Affiliation relation is transitive if A and B are affiliated and B and C are affiliated then we take A and C to be affiliated They share the same first 3 octets of the IP address. The rightmost non-generic token in the hostname is the same. ex, “www. ibm .com&quot; and &quot; ibm .co.mx“
  • 21.
    Hilltop (4/9) Selectingthe Experts Considering all pages with out-degree greater than a threshold, k (e.g., k=5) we test to see if these URLs point to k distinct non-affiliated hosts. Every such page is considered an expert page. non-affiliated pages expert page ……
  • 22.
    Hilltop (5/9) Indexingthe Experts index text contained within &quot; key phrases &quot; of the expert. The following are considered key phrases. title headings (e.g., <H1> </H1> tags) anchor text A key phrase is a piece of text that qualifies one or more URLs in the page. And every key phrase has a scope with the document text.
  • 23.
    Hilltop (6/9) ExampleTitle qualify 4 URLs heading qualify 2 URLs anchor qualify 1 URLs <title> 中央大學 </title> <h1> 資管系 </h1> <A> 001 </A> <A> 002 </A> <h1> 企管系 </h1> <A> 001 </A> <A> 002 </A>
  • 24.
    Hilltop (7/9) Step2: Target Ranking Computing the Expert Score Computing the Target Score Target page expert pages N = 200 …… Least 2 experts point to target
  • 25.
    Hilltop (8/9) Computingthe Expert Score Expert score reflect the number and importance of the key phrases that contain the query keywords .
  • 26.
    Computing the ExpertScore(1/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包含 k-2 個 keywords 的 Key Phrase 的總值 S i = SUM (key phrases p with k-i query terms) LevelScore(p) * FullnessFactor(p,q) LevelScore : 16 of title, 6 of heading, 1 of anchor m is the number of terms in p which are not in q If m <= 2, FullnessFactor(p,q) = 1 If m > 2, FullnessFactor(p,q) = 1 – (m-2) / plen Query: A B S 0 = 16*1 S 1 = 16*1 + 6*1 + 16*1 S 2 = 0 Title: A B C H1: A
  • 27.
    Computing the ExpertScore(2/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包含 k-2 個 keywords 的 Key Phrase 的總值 Expert_Score = ( 2 32 * S 0 ) + ( 2 16 * S 1 ) + S 2
  • 28.
    Hilltop(9/9) Computing theTarget Score Target score reflect both the number and relevance of the experts pointing to it And the relevance of the phrases qualifying the links .
  • 29.
    Computing the TargetScore(1/2) occ(w,T) is the number of distinct key phrases in E that contain w and qualify the edge(E,T) If occ(w,T) is 0 for any query keyword then the Edge_Score(E,T) = 0 Otherwise, Edge_Score(E,T) = Expert_Score(E) * SUM (query keywords w) occ(w,T) T E edge
  • 30.
    Computing the TargetScore(2/2) Target_Score = SUM ( non-affiliated E) Edge_Score(E,T) T E 1 E 2 E 3 E2 and E3 are affiliated, and ES(E 2 ,T) > ES(E 3 ,T)
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
    Reference Combating WebSpam with Trust Rank http://www.vldb.org/conf/2004/RS15P3.PDF Hilltop: A Search Engine based on Expert Documents http://www.cs.toronto.edu/~georgem/hilltop/ Type of web spam http:// en.wikipedia.org/wiki/Spamdexing
  • 37.