Link Analysis for Web Information Retrieval
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Link Analysis for Web Information Retrieval

on

  • 4,152 views

Talk from February 2008 @ FADOC, Complutense University, Madrid

Talk from February 2008 @ FADOC, Complutense University, Madrid

Statistics

Views

Total Views
4,152
Views on SlideShare
4,143
Embed Views
9

Actions

Likes
4
Downloads
123
Comments
0

2 Embeds 9

http://www.slideee.com 8
http://localhost 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Link Analysis for Web Information Retrieval Presentation Transcript

  • 1. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Link Analysis for Web Information Retrieval Levels of link analysis With Applications to Adversarial IR Ranking Web spam Carlos Castillo1 ... detection ... links chato@yahoo-inc.com ... contents With: R. Baeza-Yates1,3 , L. Becchetti2 , P. Boldi5 , ... both D. Donato1 , A. Gionis1 , S. Leonardi2 , V.Murdock1 , Summary M. Santini5 , F. Silvestri4 , S. Vigna5 1. Yahoo! Research Barcelona – Catalunya, Spain 2. Universit` di Roma “La Sapienza” – Rome, Italy a 3. Yahoo! Research Santiago – Chile 4. ISTI-CNR –Pisa,Italy 5. Universit` degli Studi di Milano – Milan, Italy a
  • 2. Link Analysis for When you have a hammer Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 3. Link Analysis for Everything looks like a graph! Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 4. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 5. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are not placed at random Ranking Web spam ... detection ... links Topical locality hypothesis ... contents Link endorsement hypothesis ... both Summary
  • 6. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are not placed at random Ranking Web spam ... detection ... links Topical locality hypothesis ... contents Link endorsement hypothesis ... both Summary
  • 7. Link Analysis for Topical locality hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam “We found that pages are significantly more likely to ... detection be related topically to pages to which they are ... links linked, as opposed to other pages selected at ... contents random or other nearby pages.” [Davison, 2000] ... both Summary
  • 8. Link Analysis for Web Information Retrieval 0.7 C. Castillo Average text similarity Hypothesis 0.6 Levels of link analysis Ranking 0.5 Web spam ... detection 0.4 ... links ... contents 0.3 ... both Summary 0.2 1 2 3 4 5 Link distance [Baeza-Yates et al., 2006], data from UK 2006
  • 9. Link Analysis for Link similarity cases Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Link (geodesic) distance ... links Co-citation ... contents Bibliographic coupling ... both Summary
  • 10. Link Analysis for Co-citation Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 11. Link Analysis for Bibliographic coupling Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 12. Link Analysis for (Both can be generalized) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection (Both co-citation and bibliographic coupling can be ... links generalized. E.g.: SimRank [Jeh and Widom, 2002]: ... contents generalizes the idea of co-citation to several levels) ... both Summary
  • 13. Link Analysis for Link endorsement hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are assumed to be endorsements (votes, positive Ranking opinions) [Li, 1998] Web spam ... detection But they can represent: ... links Disagreement ... contents Self citations ... both Summary Nepotism Citations to methodological documents etc.
  • 14. Link Analysis for Link endorsement hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are assumed to be endorsements (votes, positive Ranking opinions) [Li, 1998] Web spam ... detection But they can represent: ... links Disagreement ... contents Self citations ... both Summary Nepotism Citations to methodological documents etc.
  • 15. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 16. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 17. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 18. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 19. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 20. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 21. Link Analysis for Nevertheless Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Both the topical locality hypothesis and the link endorsement Web spam hypothesis are meaningful on the Web ... detection ... links Analogy with Economy ... contents ... both Think on the hypothesis requiring many buyers/sellers, zero Summary transaction costs, perfect information, etc. in economic sciences
  • 22. Link Analysis for Nevertheless Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Both the topical locality hypothesis and the link endorsement Web spam hypothesis are meaningful on the Web ... detection ... links Analogy with Economy ... contents ... both Think on the hypothesis requiring many buyers/sellers, zero Summary transaction costs, perfect information, etc. in economic sciences
  • 23. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 24. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 25. Link Analysis for How to find meaningful patterns? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Several levels of analysis: ... detection Macroscopic view: overall structure ... links ... contents Microscopic view: nodes ... both Mesoscopic view: regions Summary
  • 26. Link Analysis for How to find meaningful patterns? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Several levels of analysis: ... detection Macroscopic view: overall structure ... links ... contents Microscopic view: nodes ... both Mesoscopic view: regions Summary
  • 27. Link Analysis for How to find meaningful patterns? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Several levels of analysis: ... detection Macroscopic view: overall structure ... links ... contents Microscopic view: nodes ... both Mesoscopic view: regions Summary
  • 28. Link Analysis for Macroscopic view, e.g. Bow-tie Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Broder et al., 2000]
  • 29. Link Analysis for Macroscopic view, e.g. Jellyfish Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Tauro et al., 2001] - Internet Autonomous Systems (AS) Topology
  • 30. Link Analysis for Macroscopic view, e.g. Jellyfish Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 31. Link Analysis for Microscopic view, e.g. Degree Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Barab´si, 2002] and others a
  • 32. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary “While entirely of human design, the emerging network appears to have more in common with a cell or an ecological system than with a Swiss watch.” [Barab´si, 2002] a
  • 33. Link Analysis for Other scale-free networks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Power grid designs ... detection Sexual partners in humans ... links Collaboration of movie actors in films ... contents ... both Citations in scientific publications Summary Protein interactions
  • 34. Link Analysis for Microscopic view, e.g. Degree Web Information Retrieval C. Castillo Greece Chile Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents Spain Korea ... both Summary [Baeza-Yates et al., 2007] - compares this distribution in 8 countries . . . guess what is the result?
  • 35. Link Analysis for Mesoscopic view, e.g. Hop-plot Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 36. Link Analysis for Mesoscopic view, e.g. Hop-plot Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 37. Link Analysis for Mesoscopic view, e.g. Hop-plot Web Information Retrieval C. Castillo .it (40M pages) .uk (18M pages) Hypothesis 0.3 0.3 Levels of link analysis 0.2 0.2 Ranking Frequency Frequency Web spam 0.1 0.1 ... detection ... links 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 ... contents Distance Distance .eu.int (800K pages) Synthetic graph (100K pages) ... both Summary 0.3 0.3 0.2 0.2 Frequency Frequency 0.1 0.1 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 Distance Distance [Baeza-Yates et al., 2006]
  • 38. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 39. Link Analysis for Models Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Preferential attachment ... links Copy model ... contents Hybrid models ... both Summary
  • 40. Link Analysis for Models Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Preferential attachment ... links Copy model ... contents Hybrid models ... both Summary
  • 41. Link Analysis for Models Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Preferential attachment ... links Copy model ... contents Hybrid models ... both Summary
  • 42. Link Analysis for Preferential attachment Web Information Retrieval C. Castillo Hypothesis Levels of link analysis “A common property of many large networks is that Ranking the vertex connectivities follow a scale-free Web spam power-law distribution. This feature was found to be ... detection a consequence of two generic mechanisms: (i) ... links networks expand continuously by the addition of ... contents new vertices, and (ii) new vertices attach ... both preferentially to sites that are already well Summary connected.” [Barab´si and Albert, 1999] a “rich get richer”
  • 43. Link Analysis for Preferential attachment Web Information Retrieval C. Castillo Hypothesis Levels of link analysis “A common property of many large networks is that Ranking the vertex connectivities follow a scale-free Web spam power-law distribution. This feature was found to be ... detection a consequence of two generic mechanisms: (i) ... links networks expand continuously by the addition of ... contents new vertices, and (ii) new vertices attach ... both preferentially to sites that are already well Summary connected.” [Barab´si and Albert, 1999] a “rich get richer”
  • 44. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 45. Link Analysis for Counting in-links does not work Web Information Retrieval C. Castillo Hypothesis Levels of link analysis “With a simple program, huge numbers of pages can Ranking Web spam be created easily, artificially inflating citation counts. ... detection Because the Web environment contains profit ... links seeking ventures, attention getting strategies evolve ... contents in response to search engine algorithms. For this ... both reason, any evaluation strategy which counts Summary replicable features of web pages is prone to manipulation” [Page et al., 1998]
  • 46. Link Analysis for PageRank: simplified version Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam PageRank ′ (v ) PageRank ′ (u) = ... detection |Γ+ (v )| ... links v ∈Γ− (u) ... contents ... both Γ− (·): in-links Summary Γ+ (·): out-links
  • 47. Link Analysis for Iterations with pseudo-PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 48. Link Analysis for Iterations with pseudo-PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 49. Link Analysis for So far, so good, but ... Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam The Web includes many pages with no out-links, these ... detection will accumulate all of the score ... links ... contents We would like Web pages to accumulate ranking ... both We add random jumps (teleportation) Summary
  • 50. Link Analysis for PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ǫ PageRank(v ) PageRank(u) = + (1 − ǫ) ... detection |Γ+ (v )| N v ∈Γ− (u) ... links ... contents ... both Γ− (·): in-links Summary Γ+ (·): out-links ǫ/N: jump to a random page with probability ǫ ≈ 0.15
  • 51. Link Analysis for HITS Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Two scores per page: “hub score” and “authority score”.
  • 52. Link Analysis for HITS Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Two scores per page: “hub score” and “authority score”.
  • 53. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 54. Link Analysis for Iterations Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Initialize: Web spam hub(u, 0) = auth(u, 0) = 0 ... detection ... links ... contents Iterate: ... both auth(v ,t−1) hub(u, t) = v ∈Γ+ (u) |Γ− (v )| Summary hub(v ,t−1) auth(u, t) = |Γ+ (v )| v ∈Γ− (u)
  • 55. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 56. Link Analysis for What is on the Web? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 57. Link Analysis for What is on the Web [2.0]? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 58. Link Analysis for What else is on the Web? Web Information Retrieval C. Castillo “The sum of all human knowledge plus porn” – Robert Gilbert Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 59. Link Analysis for What’s happening on the Web? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam There is a fierce competition ... detection ... links ... contents for your attention ... both Summary
  • 60. Link Analysis for What’s happening on the Web? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Search engines are to some extent ... detection ... links arbiters of this competition ... contents ... both and they must watch it closely, otherwise ... Summary
  • 61. Link Analysis for Some cheating occurs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary 1986 FIFA World Cup, Argentina vs England
  • 62. Link Analysis for Simple web spam Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 63. Link Analysis for Hidden text Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 64. Link Analysis for Made for advertising Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 65. Link Analysis for Search engine? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 66. Link Analysis for Fake search engine Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 67. Link Analysis for “Normal” content in link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 68. Link Analysis for “Normal” content in link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 69. Link Analysis for Cloaking Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 70. Link Analysis for Redirection Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 71. Link Analysis for Redirects using Javascript Web Information Retrieval C. Castillo Hypothesis Simple redirect Levels of link analysis <script> Ranking document.location=quot;http://www.topsearch10.com/quot;; Web spam </script> ... detection ... links “Hidden” redirect ... contents ... both <script> Summary var1=24; var2=var1; if(var1==var2) { document.location=quot;http://www.topsearch10.com/quot;; } </script>
  • 72. Link Analysis for Problem: obfuscated code Web Information Retrieval C. Castillo Hypothesis Levels of link Obfuscated redirect analysis Ranking <script> Web spam var a1=quot;winquot;,a2=quot;dowquot;,a3=quot;locaquot;,a4=quot;tion.quot;, ... detection a5=quot;replacequot;,a6=quot;(’http://www.top10search.com/’)quot;; ... links var i,str=quot;quot;; ... contents for(i=1;i<=6;i++) ... both { Summary str += eval(quot;aquot;+i); } eval(str); </script>
  • 73. Link Analysis for Problem: really obfuscated code Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Encoded javascript Web spam <script> ... detection var s = quot;%5CBE0D%5C%05GDHJ BDE%16...%04%0Equot;; ... links var e = ’’, i; ... contents eval(unescape(’s%eDunescape%28s%29%3Bfor...%3B’)); ... both Summary </script> More examples: [Chellapilla and Maykov, 2007]
  • 74. Link Analysis for There are many attempts of cheating on the Web Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Most of these are spam: ... detection 1,630,000 results for “free mp3 hilton viagra” in SE1 ... links ... contents 1,760,000 results for “credit vicodin loan” in SE2 ... both 1,320,000 results for “porn mortgage” in SE3 Summary
  • 75. Link Analysis for Costs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Costs: Web spam X Costs for users: lower precision for some queries ... detection ... links X Costs for search engines: wasted storage space, ... contents network resources, and processing cycles ... both X Costs for the publishers: resources invested in cheating Summary and not in improving their contents
  • 76. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 77. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 78. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 79. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 80. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 81. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 82. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 83. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 84. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 85. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 86. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 87. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 88. Link Analysis for Opportunities for Web spam Web Information Retrieval C. Castillo Hypothesis Levels of link analysis X Spamdexing Ranking Keyword stuffing Web spam Link farms ... detection Spam blogs (splogs) ... links Cloaking ... contents ... both Adversarial relationship Summary Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 89. Link Analysis for Opportunities for Web spam Web Information Retrieval C. Castillo Hypothesis Levels of link analysis X Spamdexing Ranking Keyword stuffing Web spam Link farms ... detection Spam blogs (splogs) ... links Cloaking ... contents ... both Adversarial relationship Summary Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 90. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 91. Link Analysis for Motivation Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking [Fetterly et al., 2004] hypothesized that studying the Web spam distribution of statistics about pages could be a good way of ... detection detecting spam pages: ... links ... contents “in a number of these distributions, outlier values are ... both associated with web spam” Summary
  • 92. Link Analysis for Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 93. Link Analysis for Training of a Decision Tree Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 94. Link Analysis for Decision Tree (error = 15%) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 95. Link Analysis for Decision Tree (error = 15% → 12%) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 96. Link Analysis for Machine Learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 97. Link Analysis for Feature Extraction Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 98. Link Analysis for Challenges: Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Machine Learning Challenges: ... detection Instances are not really independent (graph) ... links ... contents Learning with few examples ... both Scalability Summary
  • 99. Link Analysis for Challenges: Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Machine Learning Challenges: ... detection Instances are not really independent (graph) ... links ... contents Learning with few examples ... both Scalability Summary
  • 100. Link Analysis for Challenges: Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Machine Learning Challenges: ... detection Instances are not really independent (graph) ... links ... contents Learning with few examples ... both Scalability Summary
  • 101. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  • 102. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  • 103. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  • 104. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  • 105. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  • 106. Link Analysis for Challenges: Data Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Data is difficult to collect ... detection Data is expensive to label ... links ... contents Labels are sparse ... both Humans do not always agree Summary
  • 107. Link Analysis for Agreement Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 108. Link Analysis for Results Web Information Retrieval C. Castillo Labels Hypothesis Label Frequency Percentage Levels of link analysis Normal 4,046 61.75% Ranking Borderline 709 10.82% Web spam Spam 1,447 22.08% ... detection Can not classify 350 5.34% ... links ... contents Agreement ... both Category Kappa Interpretation Summary normal 0.62 Substantial agreement spam 0.63 Substantial agreement borderline 0.11 Slight agreement global 0.56 Moderate agreement Reference collection [Castillo et al., 2006]
  • 109. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 110. Link Analysis for Topological spam: link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
  • 111. Link Analysis for Topological spam: link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
  • 112. Link Analysis for Handling large graphs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam For large graphs, random access is not possible. ... detection ... links Large graphs do not fit in main memory ... contents ... both Streaming model of computation Summary
  • 113. Link Analysis for Handling large graphs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam For large graphs, random access is not possible. ... detection ... links Large graphs do not fit in main memory ... contents ... both Streaming model of computation Summary
  • 114. Link Analysis for Handling large graphs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam For large graphs, random access is not possible. ... detection ... links Large graphs do not fit in main memory ... contents ... both Streaming model of computation Summary
  • 115. Link Analysis for Semi-streaming model Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Memory size enough to hold some data per-node ... links Disk size enough to hold some data per-edge ... contents A small number of passes over the data ... both Summary
  • 116. Link Analysis for Restriction Web Information Retrieval C. Castillo Semi-streaming model: graph on disk Hypothesis Levels of link 1: for node : 1 . . . N do analysis INITIALIZE-MEM(node) 2: Ranking 3: end for Web spam 4: for distance : 1 . . . d do {Iteration step} ... detection for src : 1 . . . N do {Follow links in the graph} ... links 5: ... contents for all links from src to dest do 6: ... both COMPUTE(src,dest) 7: Summary end for 8: end for 9: NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
  • 117. Link Analysis for Restriction Web Information Retrieval C. Castillo Semi-streaming model: graph on disk Hypothesis Levels of link 1: for node : 1 . . . N do analysis INITIALIZE-MEM(node) 2: Ranking 3: end for Web spam 4: for distance : 1 . . . d do {Iteration step} ... detection for src : 1 . . . N do {Follow links in the graph} ... links 5: ... contents for all links from src to dest do 6: ... both COMPUTE(src,dest) 7: Summary end for 8: end for 9: NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
  • 118. Link Analysis for Restriction Web Information Retrieval C. Castillo Semi-streaming model: graph on disk Hypothesis Levels of link 1: for node : 1 . . . N do analysis INITIALIZE-MEM(node) 2: Ranking 3: end for Web spam 4: for distance : 1 . . . d do {Iteration step} ... detection for src : 1 . . . N do {Follow links in the graph} ... links 5: ... contents for all links from src to dest do 6: ... both COMPUTE(src,dest) 7: Summary end for 8: end for 9: NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
  • 119. Link Analysis for Link-Based Features Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Degree-related measures Web spam PageRank ... detection ... links TrustRank [Gy¨ngyi et al., 2004] o ... contents Truncated PageRank [Becchetti et al., 2006] ... both Estimation of supporters [Becchetti et al., 2006] Summary 140 features per host (2 pages per host)
  • 120. Link Analysis for Degree-Based Web Information Retrieval C. Castillo Hypothesis 0.12 Normal Spam Levels of link 0.10 analysis 0.08 Ranking 0.06 Web spam 0.04 ... detection 0.02 ... links ... contents 0.00 4 18 76 323 1380 5899 25212 107764 460609 1968753 0.14 ... both Normal Spam 0.12 Summary 0.10 0.08 0.06 0.04 0.02 0.00 0.0 0.0 0.0 0.1 0.6 4.9 40.0 327.9 2686.5 22009.9
  • 121. Link Analysis for TrustRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking TrustRank [Gy¨ngyi et al., 2004] o Web spam A node with high PageRank, but far away from a core set of ... detection “trusted nodes” is suspicious ... links ... contents Start from a set of trusted nodes, then do a random walk, ... both returning to the set of trusted nodes with probability 1 − α at Summary each step i Trusted nodes: data from http://www.dmoz.org/
  • 122. Link Analysis for TrustRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking TrustRank [Gy¨ngyi et al., 2004] o Web spam A node with high PageRank, but far away from a core set of ... detection “trusted nodes” is suspicious ... links ... contents Start from a set of trusted nodes, then do a random walk, ... both returning to the set of trusted nodes with probability 1 − α at Summary each step i Trusted nodes: data from http://www.dmoz.org/
  • 123. Link Analysis for TrustRank Idea Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 124. Link Analysis for TrustRank / PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking 1.00 Normal Spam Web spam 0.90 0.80 ... detection 0.70 ... links 0.60 0.50 ... contents 0.40 0.30 ... both 0.20 Summary 0.10 0.00 0.4 1 4 1e+01 4e+01 1e+02 3e+02 1e+03 3e+03 9e+03
  • 125. Link Analysis for High and low-ranked pages are different Web Information Retrieval C. Castillo 4 x 10 Hypothesis Top 0%−10% 12 Levels of link Top 40%−50% analysis Top 60%−70% 10 Ranking Number of Nodes Web spam 8 ... detection ... links 6 ... contents ... both 4 Summary 2 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
  • 126. Link Analysis for High and low-ranked pages are different Web Information Retrieval C. Castillo 4 x 10 Hypothesis Top 0%−10% 12 Levels of link Top 40%−50% analysis Top 60%−70% 10 Ranking Number of Nodes Web spam 8 ... detection ... links 6 ... contents ... both 4 Summary 2 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
  • 127. Link Analysis for Probabilistic counting Web Information Retrieval C. Castillo Hypothesis 1 1 0 0 Levels of link 0 0 analysis 0 0 0 1 1 1 1 1 Ranking 0 0 1 1 0 0 0 0 0 0 Web spam Propagation of 0 0 1 1 bits using the 1 ... detection 0 1 1 “OR” operation 1 0 1 0 ... links 1 Target 0 Count bits set ... contents 0 page 0 to estimate ... both 0 0 supporters 0 0 1 1 Summary 1 1 0 0 1 1 0 0 0 0 1 1 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  • 128. Link Analysis for Probabilistic counting Web Information Retrieval C. Castillo Hypothesis 1 1 0 0 Levels of link 0 0 analysis 0 0 0 1 1 1 1 1 Ranking 0 0 1 1 0 0 0 0 0 0 Web spam Propagation of 0 0 1 1 bits using the 1 ... detection 0 1 1 “OR” operation 1 0 1 0 ... links 1 Target 0 Count bits set ... contents 0 page 0 to estimate ... both 0 0 supporters 0 0 1 1 Summary 1 1 0 0 1 1 0 0 0 0 1 1 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  • 129. Link Analysis for Bottleneck number Web Information Retrieval C. Castillo Hypothesis bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}. Minimum rate of growth Levels of link analysis of the neighbors of x up to a certain distance. We expect that Ranking spam pages form clusters that are somehow isolated from the Web spam rest of the Web graph and they have smaller bottleneck ... detection numbers than non-spam pages. ... links 0.40 Normal ... contents Spam 0.35 ... both 0.30 Summary 0.25 0.20 0.15 0.10 0.05 0.00 1.11 1.30 1.52 1.78 2.07 2.42 2.83 3.31 3.87 4.52
  • 130. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 131. Link Analysis for Content-Based Features Web Information Retrieval C. Castillo Hypothesis Most of these reported in [Ntoulas et al., 2006]: Levels of link Number of word in the page and title analysis Ranking Average word length Web spam Fraction of anchor text ... detection Fraction of visible text ... links ... contents Compression rate ... both From [Castillo et al., 2007]: Summary Corpus precision and corpus recall Query precision and query recall Independent trigram likelihood Entropy of trigrams
  • 132. Link Analysis for Average word length Web Information Retrieval C. Castillo Hypothesis 0.12 Normal Levels of link Spam analysis 0.10 Ranking 0.08 Web spam ... detection 0.06 ... links ... contents 0.04 ... both 0.02 Summary 0.00 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Figure: Histogram of the average word length in non-spam vs. spam pages for k = 500.
  • 133. Link Analysis for Corpus precision Web Information Retrieval C. Castillo Hypothesis 0.10 Normal Levels of link 0.09 Spam analysis 0.08 Ranking 0.07 Web spam 0.06 ... detection 0.05 ... links 0.04 ... contents 0.03 ... both 0.02 Summary 0.01 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Figure: Histogram of the corpus precision in non-spam vs. spam pages.
  • 134. Link Analysis for Query precision Web Information Retrieval C. Castillo Hypothesis 0.12 Normal Levels of link Spam analysis 0.10 Ranking 0.08 Web spam ... detection 0.06 ... links ... contents 0.04 ... both 0.02 Summary 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Figure: Histogram of the query precision in non-spam vs. spam pages for k = 500.
  • 135. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 136. Link Analysis for General hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Pages topologically close to each other are more likely to have ... detection the same label (spam/nonspam) than random pairs of pages ... links ... contents Ideas for exploiting this: clustering, propagation, stacked ... both learning Summary
  • 137. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Castillo et al., 2007]
  • 138. Link Analysis for Topological dependencies: in-links Web Information Retrieval C. Castillo Hypothesis Histogram of fraction of spam hosts in the in-links Levels of link analysis 0 = no in-link comes from spam hosts Ranking 1 = all of the in-links come from spam hosts Web spam ... detection 0.4 ... links In-links of non spam In-links of spam 0.35 ... contents 0.3 ... both 0.25 Summary 0.2 0.15 0.1 0.05 0 0.0 0.2 0.4 0.6 0.8 1.0
  • 139. Link Analysis for Topological dependencies: out-links Web Information Retrieval C. Castillo Hypothesis Histogram of fraction of spam hosts in the out-links Levels of link analysis 0 = none of the out-links points to spam hosts Ranking 1 = all of the out-links point to spam hosts Web spam ... detection 1 ... links Out-links of non spam 0.9 Outlinks of spam ... contents 0.8 ... both 0.7 Summary 0.6 0.5 0.4 0.3 0.2 0.1 0 0.0 0.2 0.4 0.6 0.8 1.0
  • 140. Link Analysis for Idea 1: Clustering Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Classify, then cluster hosts, then assign the same label to all ... links hosts in the same cluster by majority voting ... contents ... both Summary
  • 141. Link Analysis for Idea 1: Clustering (cont.) Web Information Retrieval C. Castillo Hypothesis Initial prediction: Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 142. Link Analysis for Idea 1: Clustering (cont.) Web Information Retrieval C. Castillo Hypothesis Clustering: Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 143. Link Analysis for Idea 1: Clustering (cont.) Web Information Retrieval C. Castillo Hypothesis Final prediction: Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 144. Link Analysis for Idea 1: Clustering – Results Web Information Retrieval C. Castillo Hypothesis Levels of link Baseline Clustering analysis Without bagging Ranking Web spam True positive rate 75.6% 74.5% ... detection False positive rate 8.5% 6.8% ... links F-Measure 0.646 0.673 ... contents With bagging ... both True positive rate 78.7% 76.9% Summary False positive rate 5.7% 5.0% F-Measure 0.723 0.728 V Reduces error rate
  • 145. Link Analysis for Idea 2: Propagate the label Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Classify, then interpret “spamicity” as a probability, then do a ... links random walk with restart from those nodes ... contents ... both Summary
  • 146. Link Analysis for Idea 2: Propagate the label (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link Initial prediction: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 147. Link Analysis for Idea 2: Propagate the label (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link Propagation: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 148. Link Analysis for Idea 2: Propagate the label (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link Final prediction, applying a threshold: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 149. Link Analysis for Idea 2: Propagate the label – Results Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Baseline Fwds. Backwds. Both Ranking Classifier without bagging Web spam True positive rate 75.6% 70.9% 69.4% 71.4% ... detection False positive rate 8.5% 6.1% 5.8% 5.8% ... links F-Measure 0.646 0.665 0.664 0.676 ... contents ... both Classifier with bagging Summary True positive rate 78.7% 76.5% 75.0% 75.2% False positive rate 5.7% 5.4% 4.3% 4.7% F-Measure 0.723 0.716 0.733 0.724
  • 150. Link Analysis for Idea 3: Stacked graphical learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Meta-learning scheme [Cohen and Kou, 2006] ... detection Derive initial predictions ... links Generate an additional attribute for each object by ... contents combining predictions on neighbors in the graph ... both Summary Append additional attribute in the data and retrain
  • 151. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Let p(x) ∈ [0..1] be the prediction of a classification Ranking algorithm for a host x using k features Web spam ... detection Let N(x) be the set of pages related to x (in some way) ... links Compute ... contents g ∈N(x) p(g ) ... both f (x) = |N(x)| Summary Add f (x) as an extra feature for instance x and learn a new model with k + 1 features
  • 152. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Initial prediction: Web spam ... detection ... links ... contents ... both Summary
  • 153. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Computation of new feature: Ranking Web spam ... detection ... links ... contents ... both Summary
  • 154. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link New prediction with k + 1 features: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 155. Link Analysis for Idea 3: Stacked graphical learning - Results Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Avg. Avg. Avg. Web spam Baseline of in of out of both ... detection True positive rate 78.7% 84.4% 78.3% 85.2% ... links False positive rate 5.7% 6.7% 4.8% 6.1% ... contents F-Measure 0.723 0.733 0.742 ... both 0.750 Summary V Increases detection rate
  • 156. Link Analysis for Idea 3: Stacked graphical learning x2 Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking And repeat ... Web spam ... detection Baseline First pass Second pass ... links True positive rate 78.7% 85.2% 88.4% ... contents False positive rate 5.7% 6.1% 6.3% ... both F-Measure 0.723 0.750 0.763 Summary V Significant improvement over the baseline
  • 157. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 158. Link Analysis for Concluding remarks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Hypothesis: topical locality + link endorsement ... links Primitives: similarity, ranking, propagation, etc. ... contents Application to Web spam ... both Summary
  • 159. Link Analysis for Concluding remarks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Hypothesis: topical locality + link endorsement ... links Primitives: similarity, ranking, propagation, etc. ... contents Application to Web spam ... both Summary
  • 160. Link Analysis for Concluding remarks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Hypothesis: topical locality + link endorsement ... links Primitives: similarity, ranking, propagation, etc. ... contents Application to Web spam ... both Summary
  • 161. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Thank you! ... detection ... links ... contents ... both Summary
  • 162. Link Analysis for Web Information Baeza-Yates, R., Boldi, P., and Castillo, C. (2006). Retrieval Generalizing pagerank: Damping functions for link-based ranking C. Castillo algorithms. In Proceedings of ACM SIGIR, pages 308–315, Seattle, Washington, USA. Hypothesis ACM Press. Levels of link Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2007). analysis Characterization of national web domains. Ranking ACM Transactions on Internet Technology, 7(2). Web spam Baeza-Yates, R. and Poblete, B. (2006). ... detection Dynamics of the chilean web structure. ... links Comput. Networks, 50(10):1464–1473. ... contents Baeza-Yates, R., Saint-Jean, F., and Castillo, C. (2002). ... both Web structure, dynamics and page quality. In Proceedings of String Processing and Information Retrieval (SPIRE), Summary volume 2476 of Lecture Notes in Computer Science, Lisbon, Portugal. Springer. Barab´si, A.-L. (2002). a Linked: The New Science of Networks. Perseus Books Group. Barab´si, A. L. and Albert, R. (1999). a Emergence of scaling in random networks. Science, 286(5439):509–512.
  • 163. Link Analysis for Web Information Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. Retrieval (2006). Using rank propagation and probabilistic counting for link-based spam C. Castillo detection. Hypothesis In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press. Levels of link analysis Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Ranking Stata, R., Tomkins, A., and Wiener, J. (2000). Web spam Graph structure in the web: Experiments and models. In Proceedings of the Ninth Conference on World Wide Web, pages ... detection 309–320, Amsterdam, Netherlands. ACM Press. ... links Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., ... contents and Vigna, S. (2006). ... both A reference collection for web spam. SIGIR Forum, 40(2):11–24. Summary Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. (2007). Know your neighbors: Web spam detection using the web topology. In Proceedings of SIGIR, Amsterdam, Netherlands. ACM. Chellapilla, K. and Maykov, A. (2007). A taxonomy of javascript redirection spam. In AIRWeb ’07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, pages 81–88, New York, NY, USA. ACM Press.
  • 164. Link Analysis for Web Information Retrieval Cohen, W. W. and Kou, Z. (2006). Stacked graphical learning: approximating learning in markov random C. Castillo fields using very short inhomogeneous markov chains. Technical report. Hypothesis Levels of link Davison, B. D. (2000). analysis Topical locality in the web. Ranking In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pages 272–279, Athens, Web spam Greece. ACM Press. ... detection Fetterly, D., Manasse, M., and Najork, M. (2004). ... links Spam, damn spam, and statistics: Using statistical analysis to locate spam ... contents web pages. In Proceedings of the seventh workshop on the Web and databases ... both (WebDB), pages 1–6, Paris, France. Summary Flajolet, P. and Martin, N. G. (1985). Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182–209. Gibson, D., Kumar, R., and Tomkins, A. (2005). Discovering large dense subgraphs in massive graphs. In VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 721–732. VLDB Endowment.
  • 165. Link Analysis for Web Information Retrieval Gy¨ngyi, Z., Garcia-Molina, H., and Pedersen, J. (2004). o C. Castillo Combating web spam with trustrank. In Proceedings of the 30th International Conference on Very Large Data Hypothesis Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann. Levels of link analysis Jeh, G. and Widom, J. (2002). Simrank: a measure of structural-context similarity. Ranking In KDD ’02: Proceedings of the eighth ACM SIGKDD international Web spam conference on Knowledge discovery and data mining, pages 538–543, New ... detection York, NY, USA. ACM Press. ... links Li, Y. (1998). ... contents Toward a qualitative search engine. IEEE Internet Computing. ... both Summary Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006). Detecting spam web pages through content analysis. In Proceedings of the World Wide Web conference, pages 83–92, Edinburgh, Scotland. Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project.
  • 166. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002). Ranking ANF: a fast and scalable tool for data mining in massive graphs. Web spam In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 81–90, New York, NY, USA. ... detection ACM Press. ... links Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2001). ... contents A simple conceptual model for the internet topology. ... both In Global Internet, San Antonio, Texas, USA. IEEE CS Press. Summary