Hilltop: A Search Engine based on Expert Documents
Evaluation
Introduction(1/3)
Page rank
Current Problem
web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s result
0.4 0.4 0.2
Introduction(2/3)
Type of web spam
Content spam
Hidden or invisible text
Keyword stuffing
Meta tag stuffing
Link spam
Link farms
Hidden links
Other types
Mirror websites
URL redirection
Introduction(3/3)
How to combat web spam
TrustRank or Hilltop :
哪些頁面肯定不是作弊頁面
BadRank or SpamRank:
哪些頁面肯定是作弊頁面
Sandbox:
不能有效的識別哪些是作弊或者不作弊頁面,但是可以通過這種行為有效的打壓 SEO 市場
人工舉報和具體 ANTI-SPAM 方法:
幫助建立更加全面的 SPAM POOL 資源
http:// www.google.com/contact/spamreport.html
Compare with Page Rank 、 Trust Rank 、 Hilltop(1/3)
All are connectivity algorithms, namely that the number and quality of the sources referring to a page are a good measure of the page's quality.
Compare with Page Rank 、 Trust Rank 、 Hilltop(2/3)
Basic assumption
Page rank
good page has many important inlinks.
Trust rank
Good pages point to good ones.
Hilltop rank
Only expert pages point to good ones.
Compare with Page Rank 、 Trust Rank 、 Hilltop(3/3) 0.16 0.16 0.16 0.16 0.16 0.16 0.33 0 0.33 0 0.33 0 0.5 0.2 0.3 Algorithm 1 or 0 Average Initial Score Expert pages All pages All pages Inlinks Source Hilltop Trust Rank Page Rank
Trust Rank(1/7)
0 0 0 ….
0 1 ….
0 0.5 0 .…
0 0.5 0 …
………………
0 0.5 0 … 0 0 0.5 … 0 0.5 0 … ………………
Trust Rank (2/7)
Step1 : Evaluate seed-desirability of
pages By Inverse Page Rank
U S M 1 N …………… . M
Trust Rank(3/7)
Step2 : Generate good seeds
Trust Rank(4/7)
Step3 : Select good seeds
ex, L=3, seed set is {2,4,5}
Trust Rank(5/7)
Step4 : normalize static score
distribution vector
Trust Rank(6/7)
Step5:Compute TrustRank score
T d t* …………… . M
Trust Rank(7/7)
Conclusion
Hilltop (1/9)
expert page
a page is about a certain topic and has links to many non-affiliated pages on that topic.
non-affiliated
Two pages are non-affiliated conceptually if they are authored by authors from non-affiliated organizations.
Hilltop (2/9)
Step1 : Expert Lookup
Detecting Host Affiliation
Selecting the Experts
Indexing the Experts
non-affiliated pages expert page …… Index key phrases
Hilltop (3/9)
Detecting Host Affiliation
Rules: one or both of the following must be true
Affiliation relation is transitive
if A and B are affiliated and B and C are affiliated then we take A and C to be affiliated
They share the same first 3 octets of the IP address.
The rightmost non-generic token in the hostname is the same.
ex, “www. ibm .com" and " ibm .co.mx“
Hilltop (4/9)
Selecting the Experts
Considering all pages with out-degree greater than a threshold, k (e.g., k=5) we test to see if these URLs point to k distinct non-affiliated hosts. Every such page is considered an expert page.
non-affiliated pages expert page ……
Hilltop (5/9)
Indexing the Experts
index text contained within " key phrases " of the expert. The following are considered key phrases.
title
headings (e.g., <H1> </H1> tags)
anchor text
A key phrase is a piece of text that qualifies one or more URLs in the page. And every key phrase has a scope with the document text.
Target page expert pages N = 200 …… Least 2 experts point to target
Hilltop (8/9)
Computing the Expert Score
Expert score reflect the number and importance of the key phrases that contain the query keywords .
Computing the Expert Score(1/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包含 k-2 個 keywords 的 Key Phrase 的總值 S i = SUM (key phrases p with k-i query terms) LevelScore(p) * FullnessFactor(p,q) LevelScore : 16 of title, 6 of heading, 1 of anchor m is the number of terms in p which are not in q If m <= 2, FullnessFactor(p,q) = 1 If m > 2, FullnessFactor(p,q) = 1 – (m-2) / plen Query: A B S 0 = 16*1 S 1 = 16*1 + 6*1 + 16*1 S 2 = 0 Title: A B C H1: A
Computing the Expert Score(2/2) S 0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S 1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S 2 :包含 k-2 個 keywords 的 Key Phrase 的總值 Expert_Score = ( 2 32 * S 0 ) + ( 2 16 * S 1 ) + S 2
Hilltop(9/9)
Computing the Target Score
Target score reflect both the number and relevance of the experts pointing to it
And the relevance of the phrases qualifying the links .
Computing the Target Score(1/2) occ(w,T) is the number of distinct key phrases in E that contain w and qualify the edge(E,T) If occ(w,T) is 0 for any query keyword then the Edge_Score(E,T) = 0 Otherwise, Edge_Score(E,T) = Expert_Score(E) * SUM (query keywords w) occ(w,T) T E edge
Computing the Target Score(2/2) Target_Score = SUM ( non-affiliated E) Edge_Score(E,T) T E 1 E 2 E 3 E2 and E3 are affiliated, and ES(E 2 ,T) > ES(E 3 ,T)
Evaluation-Trust Rank(1/3)
Evaluation-Trust Rank(2/3)
Pairwise Orderness
Evaluation-Trust Rank(3/3)
Precision
Recall
Evaluation-Hilltop(1/2)
Precision
Evaluation-Hilltop(2/2)
Recall
Reference
Combating Web Spam with Trust Rank
http://www.vldb.org/conf/2004/RS15P3.PDF
Hilltop: A Search Engine based on Expert Documents
0 comments
Post a comment