Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis
                    Link Analysis for Web Info...
Link Analysis for
                    When you have a hammer
Web Information
    Retrieval

   C. Castillo

Hypothesis

Le...
Link Analysis for
                    Everything looks like a graph!
Web Information
    Retrieval

   C. Castillo

Hypoth...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothes...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                     ...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                     ...
Link Analysis for
                    Topical locality hypothesis
Web Information
    Retrieval

   C. Castillo

Hypothesi...
Link Analysis for
Web Information
    Retrieval

                                                    0.7
   C. Castillo


...
Link Analysis for
                    Link similarity cases
Web Information
    Retrieval

   C. Castillo

Hypothesis

Lev...
Link Analysis for
                    Co-citation
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of lin...
Link Analysis for
                    Bibliographic coupling
Web Information
    Retrieval

   C. Castillo

Hypothesis

Le...
Link Analysis for
                    (Both can be generalized)
Web Information
    Retrieval

   C. Castillo

Hypothesis
...
Link Analysis for
                    Link endorsement hypothesis
Web Information
    Retrieval

   C. Castillo

Hypothesi...
Link Analysis for
                    Link endorsement hypothesis
Web Information
    Retrieval

   C. Castillo

Hypothesi...
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of lin...
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of lin...
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of lin...
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of lin...
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of lin...
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of lin...
Link Analysis for
                    Nevertheless
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of li...
Link Analysis for
                    Nevertheless
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of li...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothes...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

....
Link Analysis for
                    How to find meaningful patterns?
Web Information
    Retrieval

   C. Castillo

Hypot...
Link Analysis for
                    How to find meaningful patterns?
Web Information
    Retrieval

   C. Castillo

Hypot...
Link Analysis for
                    How to find meaningful patterns?
Web Information
    Retrieval

   C. Castillo

Hypot...
Link Analysis for
                    Macroscopic view, e.g. Bow-tie
Web Information
    Retrieval

   C. Castillo

Hypoth...
Link Analysis for
                    Macroscopic view, e.g. Jellyfish
Web Information
    Retrieval

   C. Castillo

Hypot...
Link Analysis for
                    Macroscopic view, e.g. Jellyfish
Web Information
    Retrieval

   C. Castillo

Hypot...
Link Analysis for
                    Microscopic view, e.g. Degree
Web Information
    Retrieval

   C. Castillo

Hypothe...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

....
Link Analysis for
                    Other scale-free networks
Web Information
    Retrieval

   C. Castillo

Hypothesis
...
Link Analysis for
                    Microscopic view, e.g. Degree
Web Information
    Retrieval

   C. Castillo
        ...
Link Analysis for
                    Mesoscopic view, e.g. Hop-plot
Web Information
    Retrieval

   C. Castillo

Hypoth...
Link Analysis for
                    Mesoscopic view, e.g. Hop-plot
Web Information
    Retrieval

   C. Castillo

Hypoth...
Link Analysis for
                    Mesoscopic view, e.g. Hop-plot
Web Information
    Retrieval

   C. Castillo

      ...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

....
Link Analysis for
                    Models
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
ana...
Link Analysis for
                    Models
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
ana...
Link Analysis for
                    Models
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
ana...
Link Analysis for
                    Preferential attachment
Web Information
    Retrieval

   C. Castillo

Hypothesis

L...
Link Analysis for
                    Preferential attachment
Web Information
    Retrieval

   C. Castillo

Hypothesis

L...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothes...
Link Analysis for
                    Counting in-links does not work
Web Information
    Retrieval

   C. Castillo

Hypot...
Link Analysis for
                    PageRank: simplified version
Web Information
    Retrieval

   C. Castillo

Hypothesi...
Link Analysis for
                    Iterations with pseudo-PageRank
Web Information
    Retrieval

   C. Castillo

Hypot...
Link Analysis for
                    Iterations with pseudo-PageRank
Web Information
    Retrieval

   C. Castillo

Hypot...
Link Analysis for
                    So far, so good, but ...
Web Information
    Retrieval

   C. Castillo

Hypothesis

...
Link Analysis for
                    PageRank
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
a...
Link Analysis for
                    HITS
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analy...
Link Analysis for
                    HITS
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analy...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

....
Link Analysis for
                    Iterations
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothes...
Link Analysis for
                    What is on the Web?
Web Information
    Retrieval

   C. Castillo

Hypothesis

Level...
Link Analysis for
                    What is on the Web [2.0]?
Web Information
    Retrieval

   C. Castillo

Hypothesis
...
Link Analysis for
                    What else is on the Web?
Web Information
    Retrieval

   C. Castillo
             ...
Link Analysis for
                    What’s happening on the Web?
Web Information
    Retrieval

   C. Castillo

Hypothes...
Link Analysis for
                    What’s happening on the Web?
Web Information
    Retrieval

   C. Castillo

Hypothes...
Link Analysis for
                    Some cheating occurs
Web Information
    Retrieval

   C. Castillo

Hypothesis

Leve...
Link Analysis for
                    Simple web spam
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of...
Link Analysis for
                    Hidden text
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of lin...
Link Analysis for
                    Made for advertising
Web Information
    Retrieval

   C. Castillo

Hypothesis

Leve...
Link Analysis for
                    Search engine?
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of ...
Link Analysis for
                    Fake search engine
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels...
Link Analysis for
                    “Normal” content in link farms
Web Information
    Retrieval

   C. Castillo

Hypoth...
Link Analysis for
                    “Normal” content in link farms
Web Information
    Retrieval

   C. Castillo

Hypoth...
Link Analysis for
                    Cloaking
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
a...
Link Analysis for
                    Redirection
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of lin...
Link Analysis for
                    Redirects using Javascript
Web Information
    Retrieval

   C. Castillo

Hypothesis...
Link Analysis for
                    Problem: obfuscated code
Web Information
    Retrieval

   C. Castillo

Hypothesis

...
Link Analysis for
                    Problem: really obfuscated code
Web Information
    Retrieval

   C. Castillo

Hypot...
Link Analysis for
                    There are many attempts of cheating on the Web
Web Information
    Retrieval

   C. ...
Link Analysis for
                    Costs
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
anal...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypo...
Link Analysis for
                    Opportunities for Web spam
Web Information
    Retrieval

   C. Castillo

Hypothesis...
Link Analysis for
                    Opportunities for Web spam
Web Information
    Retrieval

   C. Castillo

Hypothesis...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothes...
Link Analysis for
                    Motivation
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link...
Link Analysis for
                    Machine Learning
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels o...
Link Analysis for
                    Training of a Decision Tree
Web Information
    Retrieval

   C. Castillo

Hypothesi...
Link Analysis for
                    Decision Tree (error = 15%)
Web Information
    Retrieval

   C. Castillo

Hypothesi...
Link Analysis for
                    Decision Tree (error = 15% → 12%)
Web Information
    Retrieval

   C. Castillo

Hyp...
Link Analysis for
                    Machine Learning (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis

...
Link Analysis for
                    Feature Extraction
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels...
Link Analysis for
                    Challenges: Machine Learning
Web Information
    Retrieval

   C. Castillo

Hypothes...
Link Analysis for
                    Challenges: Machine Learning
Web Information
    Retrieval

   C. Castillo

Hypothes...
Link Analysis for
                    Challenges: Machine Learning
Web Information
    Retrieval

   C. Castillo

Hypothes...
Link Analysis for
                    Challenges: Information Retrieval
Web Information
    Retrieval

   C. Castillo

Hyp...
Link Analysis for
                    Challenges: Information Retrieval
Web Information
    Retrieval

   C. Castillo

Hyp...
Link Analysis for
                    Challenges: Information Retrieval
Web Information
    Retrieval

   C. Castillo

Hyp...
Link Analysis for
                    Challenges: Information Retrieval
Web Information
    Retrieval

   C. Castillo

Hyp...
Link Analysis for
                    Challenges: Information Retrieval
Web Information
    Retrieval

   C. Castillo

Hyp...
Link Analysis for
                    Challenges: Data
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels o...
Link Analysis for
                    Agreement
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
...
Link Analysis for
                    Results
Web Information
    Retrieval

   C. Castillo


                            ...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothes...
Link Analysis for
                    Topological spam: link farms
Web Information
    Retrieval

   C. Castillo

Hypothes...
Link Analysis for
                    Topological spam: link farms
Web Information
    Retrieval

   C. Castillo

Hypothes...
Link Analysis for
                    Handling large graphs
Web Information
    Retrieval

   C. Castillo

Hypothesis

Lev...
Link Analysis for
                    Handling large graphs
Web Information
    Retrieval

   C. Castillo

Hypothesis

Lev...
Link Analysis for
                    Handling large graphs
Web Information
    Retrieval

   C. Castillo

Hypothesis

Lev...
Link Analysis for
                    Semi-streaming model
Web Information
    Retrieval

   C. Castillo

Hypothesis

Leve...
Link Analysis for
                    Restriction
Web Information
    Retrieval

   C. Castillo


                    Semi...
Link Analysis for
                    Restriction
Web Information
    Retrieval

   C. Castillo


                    Semi...
Link Analysis for
                    Restriction
Web Information
    Retrieval

   C. Castillo


                    Semi...
Link Analysis for
                    Link-Based Features
Web Information
    Retrieval

   C. Castillo

Hypothesis

Level...
Link Analysis for
                    Degree-Based
Web Information
    Retrieval

   C. Castillo

Hypothesis              ...
Link Analysis for
                    TrustRank
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
...
Link Analysis for
                    TrustRank
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
...
Link Analysis for
                    TrustRank Idea
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of ...
Link Analysis for
                    TrustRank / PageRank
Web Information
    Retrieval

   C. Castillo

Hypothesis

Leve...
Link Analysis for
                    High and low-ranked pages are different
Web Information
    Retrieval

   C. Castillo...
Link Analysis for
                    High and low-ranked pages are different
Web Information
    Retrieval

   C. Castillo...
Link Analysis for
                    Probabilistic counting
Web Information
    Retrieval

   C. Castillo

Hypothesis
   ...
Link Analysis for
                    Probabilistic counting
Web Information
    Retrieval

   C. Castillo

Hypothesis
   ...
Link Analysis for
                    Bottleneck number
Web Information
    Retrieval

   C. Castillo

Hypothesis

       ...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothes...
Link Analysis for
                    Content-Based Features
Web Information
    Retrieval

   C. Castillo

Hypothesis
   ...
Link Analysis for
                    Average word length
Web Information
    Retrieval

   C. Castillo

Hypothesis
      ...
Link Analysis for
                    Corpus precision
Web Information
    Retrieval

   C. Castillo

Hypothesis
         ...
Link Analysis for
                    Query precision
Web Information
    Retrieval

   C. Castillo

Hypothesis
          ...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothes...
Link Analysis for
                    General hypothesis
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

....
Link Analysis for
                    Topological dependencies: in-links
Web Information
    Retrieval

   C. Castillo

Hy...
Link Analysis for
                    Topological dependencies: out-links
Web Information
    Retrieval

   C. Castillo

H...
Link Analysis for
                    Idea 1: Clustering
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels...
Link Analysis for
                    Idea 1: Clustering (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis...
Link Analysis for
                    Idea 1: Clustering (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis...
Link Analysis for
                    Idea 1: Clustering (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis...
Link Analysis for
                    Idea 1: Clustering – Results
Web Information
    Retrieval

   C. Castillo

Hypothes...
Link Analysis for
                    Idea 2: Propagate the label
Web Information
    Retrieval

   C. Castillo

Hypothesi...
Link Analysis for
                    Idea 2: Propagate the label (cont.)
Web Information
    Retrieval

   C. Castillo

H...
Link Analysis for
                    Idea 2: Propagate the label (cont.)
Web Information
    Retrieval

   C. Castillo

H...
Link Analysis for
                    Idea 2: Propagate the label (cont.)
Web Information
    Retrieval

   C. Castillo

H...
Link Analysis for
                    Idea 2: Propagate the label – Results
Web Information
    Retrieval

   C. Castillo
...
Link Analysis for
                    Idea 3: Stacked graphical learning
Web Information
    Retrieval

   C. Castillo

Hy...
Link Analysis for
                    Idea 3: Stacked graphical learning (cont.)
Web Information
    Retrieval

   C. Cast...
Link Analysis for
                    Idea 3: Stacked graphical learning (cont.)
Web Information
    Retrieval

   C. Cast...
Link Analysis for
                    Idea 3: Stacked graphical learning (cont.)
Web Information
    Retrieval

   C. Cast...
Link Analysis for
                    Idea 3: Stacked graphical learning (cont.)
Web Information
    Retrieval

   C. Cast...
Link Analysis for
                    Idea 3: Stacked graphical learning - Results
Web Information
    Retrieval

   C. Ca...
Link Analysis for
                    Idea 3: Stacked graphical learning x2
Web Information
    Retrieval

   C. Castillo
...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothes...
Link Analysis for
                    Concluding remarks
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels...
Link Analysis for
                    Concluding remarks
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels...
Link Analysis for
                    Concluding remarks
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels...
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

 ...
Link Analysis for
Web Information
                    Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).
    Retrieval
 ...
Link Analysis for
Web Information     Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.
    Retri...
Link Analysis for Web Information Retrieval
Link Analysis for Web Information Retrieval
Link Analysis for Web Information Retrieval
Link Analysis for Web Information Retrieval
Upcoming SlideShare
Loading in...5
×

Link Analysis for Web Information Retrieval

2,746

Published on

Talk from February 2008 @ FADOC, Complutense University, Madrid

Published in: Technology, News & Politics
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,746
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
126
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Link Analysis for Web Information Retrieval

  1. 1. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Link Analysis for Web Information Retrieval Levels of link analysis With Applications to Adversarial IR Ranking Web spam Carlos Castillo1 ... detection ... links chato@yahoo-inc.com ... contents With: R. Baeza-Yates1,3 , L. Becchetti2 , P. Boldi5 , ... both D. Donato1 , A. Gionis1 , S. Leonardi2 , V.Murdock1 , Summary M. Santini5 , F. Silvestri4 , S. Vigna5 1. Yahoo! Research Barcelona – Catalunya, Spain 2. Universit` di Roma “La Sapienza” – Rome, Italy a 3. Yahoo! Research Santiago – Chile 4. ISTI-CNR –Pisa,Italy 5. Universit` degli Studi di Milano – Milan, Italy a
  2. 2. Link Analysis for When you have a hammer Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  3. 3. Link Analysis for Everything looks like a graph! Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  4. 4. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  5. 5. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are not placed at random Ranking Web spam ... detection ... links Topical locality hypothesis ... contents Link endorsement hypothesis ... both Summary
  6. 6. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are not placed at random Ranking Web spam ... detection ... links Topical locality hypothesis ... contents Link endorsement hypothesis ... both Summary
  7. 7. Link Analysis for Topical locality hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam “We found that pages are significantly more likely to ... detection be related topically to pages to which they are ... links linked, as opposed to other pages selected at ... contents random or other nearby pages.” [Davison, 2000] ... both Summary
  8. 8. Link Analysis for Web Information Retrieval 0.7 C. Castillo Average text similarity Hypothesis 0.6 Levels of link analysis Ranking 0.5 Web spam ... detection 0.4 ... links ... contents 0.3 ... both Summary 0.2 1 2 3 4 5 Link distance [Baeza-Yates et al., 2006], data from UK 2006
  9. 9. Link Analysis for Link similarity cases Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Link (geodesic) distance ... links Co-citation ... contents Bibliographic coupling ... both Summary
  10. 10. Link Analysis for Co-citation Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  11. 11. Link Analysis for Bibliographic coupling Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  12. 12. Link Analysis for (Both can be generalized) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection (Both co-citation and bibliographic coupling can be ... links generalized. E.g.: SimRank [Jeh and Widom, 2002]: ... contents generalizes the idea of co-citation to several levels) ... both Summary
  13. 13. Link Analysis for Link endorsement hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are assumed to be endorsements (votes, positive Ranking opinions) [Li, 1998] Web spam ... detection But they can represent: ... links Disagreement ... contents Self citations ... both Summary Nepotism Citations to methodological documents etc.
  14. 14. Link Analysis for Link endorsement hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are assumed to be endorsements (votes, positive Ranking opinions) [Li, 1998] Web spam ... detection But they can represent: ... links Disagreement ... contents Self citations ... both Summary Nepotism Citations to methodological documents etc.
  15. 15. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  16. 16. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  17. 17. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  18. 18. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  19. 19. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  20. 20. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  21. 21. Link Analysis for Nevertheless Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Both the topical locality hypothesis and the link endorsement Web spam hypothesis are meaningful on the Web ... detection ... links Analogy with Economy ... contents ... both Think on the hypothesis requiring many buyers/sellers, zero Summary transaction costs, perfect information, etc. in economic sciences
  22. 22. Link Analysis for Nevertheless Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Both the topical locality hypothesis and the link endorsement Web spam hypothesis are meaningful on the Web ... detection ... links Analogy with Economy ... contents ... both Think on the hypothesis requiring many buyers/sellers, zero Summary transaction costs, perfect information, etc. in economic sciences
  23. 23. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  24. 24. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  25. 25. Link Analysis for How to find meaningful patterns? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Several levels of analysis: ... detection Macroscopic view: overall structure ... links ... contents Microscopic view: nodes ... both Mesoscopic view: regions Summary
  26. 26. Link Analysis for How to find meaningful patterns? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Several levels of analysis: ... detection Macroscopic view: overall structure ... links ... contents Microscopic view: nodes ... both Mesoscopic view: regions Summary
  27. 27. Link Analysis for How to find meaningful patterns? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Several levels of analysis: ... detection Macroscopic view: overall structure ... links ... contents Microscopic view: nodes ... both Mesoscopic view: regions Summary
  28. 28. Link Analysis for Macroscopic view, e.g. Bow-tie Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Broder et al., 2000]
  29. 29. Link Analysis for Macroscopic view, e.g. Jellyfish Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Tauro et al., 2001] - Internet Autonomous Systems (AS) Topology
  30. 30. Link Analysis for Macroscopic view, e.g. Jellyfish Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  31. 31. Link Analysis for Microscopic view, e.g. Degree Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Barab´si, 2002] and others a
  32. 32. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary “While entirely of human design, the emerging network appears to have more in common with a cell or an ecological system than with a Swiss watch.” [Barab´si, 2002] a
  33. 33. Link Analysis for Other scale-free networks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Power grid designs ... detection Sexual partners in humans ... links Collaboration of movie actors in films ... contents ... both Citations in scientific publications Summary Protein interactions
  34. 34. Link Analysis for Microscopic view, e.g. Degree Web Information Retrieval C. Castillo Greece Chile Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents Spain Korea ... both Summary [Baeza-Yates et al., 2007] - compares this distribution in 8 countries . . . guess what is the result?
  35. 35. Link Analysis for Mesoscopic view, e.g. Hop-plot Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  36. 36. Link Analysis for Mesoscopic view, e.g. Hop-plot Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  37. 37. Link Analysis for Mesoscopic view, e.g. Hop-plot Web Information Retrieval C. Castillo .it (40M pages) .uk (18M pages) Hypothesis 0.3 0.3 Levels of link analysis 0.2 0.2 Ranking Frequency Frequency Web spam 0.1 0.1 ... detection ... links 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 ... contents Distance Distance .eu.int (800K pages) Synthetic graph (100K pages) ... both Summary 0.3 0.3 0.2 0.2 Frequency Frequency 0.1 0.1 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 Distance Distance [Baeza-Yates et al., 2006]
  38. 38. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  39. 39. Link Analysis for Models Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Preferential attachment ... links Copy model ... contents Hybrid models ... both Summary
  40. 40. Link Analysis for Models Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Preferential attachment ... links Copy model ... contents Hybrid models ... both Summary
  41. 41. Link Analysis for Models Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Preferential attachment ... links Copy model ... contents Hybrid models ... both Summary
  42. 42. Link Analysis for Preferential attachment Web Information Retrieval C. Castillo Hypothesis Levels of link analysis “A common property of many large networks is that Ranking the vertex connectivities follow a scale-free Web spam power-law distribution. This feature was found to be ... detection a consequence of two generic mechanisms: (i) ... links networks expand continuously by the addition of ... contents new vertices, and (ii) new vertices attach ... both preferentially to sites that are already well Summary connected.” [Barab´si and Albert, 1999] a “rich get richer”
  43. 43. Link Analysis for Preferential attachment Web Information Retrieval C. Castillo Hypothesis Levels of link analysis “A common property of many large networks is that Ranking the vertex connectivities follow a scale-free Web spam power-law distribution. This feature was found to be ... detection a consequence of two generic mechanisms: (i) ... links networks expand continuously by the addition of ... contents new vertices, and (ii) new vertices attach ... both preferentially to sites that are already well Summary connected.” [Barab´si and Albert, 1999] a “rich get richer”
  44. 44. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  45. 45. Link Analysis for Counting in-links does not work Web Information Retrieval C. Castillo Hypothesis Levels of link analysis “With a simple program, huge numbers of pages can Ranking Web spam be created easily, artificially inflating citation counts. ... detection Because the Web environment contains profit ... links seeking ventures, attention getting strategies evolve ... contents in response to search engine algorithms. For this ... both reason, any evaluation strategy which counts Summary replicable features of web pages is prone to manipulation” [Page et al., 1998]
  46. 46. Link Analysis for PageRank: simplified version Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam PageRank ′ (v ) PageRank ′ (u) = ... detection |Γ+ (v )| ... links v ∈Γ− (u) ... contents ... both Γ− (·): in-links Summary Γ+ (·): out-links
  47. 47. Link Analysis for Iterations with pseudo-PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  48. 48. Link Analysis for Iterations with pseudo-PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  49. 49. Link Analysis for So far, so good, but ... Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam The Web includes many pages with no out-links, these ... detection will accumulate all of the score ... links ... contents We would like Web pages to accumulate ranking ... both We add random jumps (teleportation) Summary
  50. 50. Link Analysis for PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ǫ PageRank(v ) PageRank(u) = + (1 − ǫ) ... detection |Γ+ (v )| N v ∈Γ− (u) ... links ... contents ... both Γ− (·): in-links Summary Γ+ (·): out-links ǫ/N: jump to a random page with probability ǫ ≈ 0.15
  51. 51. Link Analysis for HITS Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Two scores per page: “hub score” and “authority score”.
  52. 52. Link Analysis for HITS Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Two scores per page: “hub score” and “authority score”.
  53. 53. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  54. 54. Link Analysis for Iterations Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Initialize: Web spam hub(u, 0) = auth(u, 0) = 0 ... detection ... links ... contents Iterate: ... both auth(v ,t−1) hub(u, t) = v ∈Γ+ (u) |Γ− (v )| Summary hub(v ,t−1) auth(u, t) = |Γ+ (v )| v ∈Γ− (u)
  55. 55. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  56. 56. Link Analysis for What is on the Web? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  57. 57. Link Analysis for What is on the Web [2.0]? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  58. 58. Link Analysis for What else is on the Web? Web Information Retrieval C. Castillo “The sum of all human knowledge plus porn” – Robert Gilbert Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  59. 59. Link Analysis for What’s happening on the Web? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam There is a fierce competition ... detection ... links ... contents for your attention ... both Summary
  60. 60. Link Analysis for What’s happening on the Web? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Search engines are to some extent ... detection ... links arbiters of this competition ... contents ... both and they must watch it closely, otherwise ... Summary
  61. 61. Link Analysis for Some cheating occurs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary 1986 FIFA World Cup, Argentina vs England
  62. 62. Link Analysis for Simple web spam Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  63. 63. Link Analysis for Hidden text Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  64. 64. Link Analysis for Made for advertising Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  65. 65. Link Analysis for Search engine? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  66. 66. Link Analysis for Fake search engine Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  67. 67. Link Analysis for “Normal” content in link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  68. 68. Link Analysis for “Normal” content in link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  69. 69. Link Analysis for Cloaking Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  70. 70. Link Analysis for Redirection Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  71. 71. Link Analysis for Redirects using Javascript Web Information Retrieval C. Castillo Hypothesis Simple redirect Levels of link analysis <script> Ranking document.location=quot;http://www.topsearch10.com/quot;; Web spam </script> ... detection ... links “Hidden” redirect ... contents ... both <script> Summary var1=24; var2=var1; if(var1==var2) { document.location=quot;http://www.topsearch10.com/quot;; } </script>
  72. 72. Link Analysis for Problem: obfuscated code Web Information Retrieval C. Castillo Hypothesis Levels of link Obfuscated redirect analysis Ranking <script> Web spam var a1=quot;winquot;,a2=quot;dowquot;,a3=quot;locaquot;,a4=quot;tion.quot;, ... detection a5=quot;replacequot;,a6=quot;(’http://www.top10search.com/’)quot;; ... links var i,str=quot;quot;; ... contents for(i=1;i<=6;i++) ... both { Summary str += eval(quot;aquot;+i); } eval(str); </script>
  73. 73. Link Analysis for Problem: really obfuscated code Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Encoded javascript Web spam <script> ... detection var s = quot;%5CBE0D%5C%05GDHJ BDE%16...%04%0Equot;; ... links var e = ’’, i; ... contents eval(unescape(’s%eDunescape%28s%29%3Bfor...%3B’)); ... both Summary </script> More examples: [Chellapilla and Maykov, 2007]
  74. 74. Link Analysis for There are many attempts of cheating on the Web Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Most of these are spam: ... detection 1,630,000 results for “free mp3 hilton viagra” in SE1 ... links ... contents 1,760,000 results for “credit vicodin loan” in SE2 ... both 1,320,000 results for “porn mortgage” in SE3 Summary
  75. 75. Link Analysis for Costs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Costs: Web spam X Costs for users: lower precision for some queries ... detection ... links X Costs for search engines: wasted storage space, ... contents network resources, and processing cycles ... both X Costs for the publishers: resources invested in cheating Summary and not in improving their contents
  76. 76. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  77. 77. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  78. 78. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  79. 79. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  80. 80. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  81. 81. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  82. 82. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  83. 83. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  84. 84. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  85. 85. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  86. 86. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  87. 87. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  88. 88. Link Analysis for Opportunities for Web spam Web Information Retrieval C. Castillo Hypothesis Levels of link analysis X Spamdexing Ranking Keyword stuffing Web spam Link farms ... detection Spam blogs (splogs) ... links Cloaking ... contents ... both Adversarial relationship Summary Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  89. 89. Link Analysis for Opportunities for Web spam Web Information Retrieval C. Castillo Hypothesis Levels of link analysis X Spamdexing Ranking Keyword stuffing Web spam Link farms ... detection Spam blogs (splogs) ... links Cloaking ... contents ... both Adversarial relationship Summary Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  90. 90. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  91. 91. Link Analysis for Motivation Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking [Fetterly et al., 2004] hypothesized that studying the Web spam distribution of statistics about pages could be a good way of ... detection detecting spam pages: ... links ... contents “in a number of these distributions, outlier values are ... both associated with web spam” Summary
  92. 92. Link Analysis for Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  93. 93. Link Analysis for Training of a Decision Tree Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  94. 94. Link Analysis for Decision Tree (error = 15%) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  95. 95. Link Analysis for Decision Tree (error = 15% → 12%) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  96. 96. Link Analysis for Machine Learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  97. 97. Link Analysis for Feature Extraction Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  98. 98. Link Analysis for Challenges: Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Machine Learning Challenges: ... detection Instances are not really independent (graph) ... links ... contents Learning with few examples ... both Scalability Summary
  99. 99. Link Analysis for Challenges: Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Machine Learning Challenges: ... detection Instances are not really independent (graph) ... links ... contents Learning with few examples ... both Scalability Summary
  100. 100. Link Analysis for Challenges: Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Machine Learning Challenges: ... detection Instances are not really independent (graph) ... links ... contents Learning with few examples ... both Scalability Summary
  101. 101. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  102. 102. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  103. 103. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  104. 104. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  105. 105. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  106. 106. Link Analysis for Challenges: Data Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Data is difficult to collect ... detection Data is expensive to label ... links ... contents Labels are sparse ... both Humans do not always agree Summary
  107. 107. Link Analysis for Agreement Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  108. 108. Link Analysis for Results Web Information Retrieval C. Castillo Labels Hypothesis Label Frequency Percentage Levels of link analysis Normal 4,046 61.75% Ranking Borderline 709 10.82% Web spam Spam 1,447 22.08% ... detection Can not classify 350 5.34% ... links ... contents Agreement ... both Category Kappa Interpretation Summary normal 0.62 Substantial agreement spam 0.63 Substantial agreement borderline 0.11 Slight agreement global 0.56 Moderate agreement Reference collection [Castillo et al., 2006]
  109. 109. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  110. 110. Link Analysis for Topological spam: link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
  111. 111. Link Analysis for Topological spam: link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
  112. 112. Link Analysis for Handling large graphs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam For large graphs, random access is not possible. ... detection ... links Large graphs do not fit in main memory ... contents ... both Streaming model of computation Summary
  113. 113. Link Analysis for Handling large graphs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam For large graphs, random access is not possible. ... detection ... links Large graphs do not fit in main memory ... contents ... both Streaming model of computation Summary
  114. 114. Link Analysis for Handling large graphs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam For large graphs, random access is not possible. ... detection ... links Large graphs do not fit in main memory ... contents ... both Streaming model of computation Summary
  115. 115. Link Analysis for Semi-streaming model Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Memory size enough to hold some data per-node ... links Disk size enough to hold some data per-edge ... contents A small number of passes over the data ... both Summary
  116. 116. Link Analysis for Restriction Web Information Retrieval C. Castillo Semi-streaming model: graph on disk Hypothesis Levels of link 1: for node : 1 . . . N do analysis INITIALIZE-MEM(node) 2: Ranking 3: end for Web spam 4: for distance : 1 . . . d do {Iteration step} ... detection for src : 1 . . . N do {Follow links in the graph} ... links 5: ... contents for all links from src to dest do 6: ... both COMPUTE(src,dest) 7: Summary end for 8: end for 9: NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
  117. 117. Link Analysis for Restriction Web Information Retrieval C. Castillo Semi-streaming model: graph on disk Hypothesis Levels of link 1: for node : 1 . . . N do analysis INITIALIZE-MEM(node) 2: Ranking 3: end for Web spam 4: for distance : 1 . . . d do {Iteration step} ... detection for src : 1 . . . N do {Follow links in the graph} ... links 5: ... contents for all links from src to dest do 6: ... both COMPUTE(src,dest) 7: Summary end for 8: end for 9: NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
  118. 118. Link Analysis for Restriction Web Information Retrieval C. Castillo Semi-streaming model: graph on disk Hypothesis Levels of link 1: for node : 1 . . . N do analysis INITIALIZE-MEM(node) 2: Ranking 3: end for Web spam 4: for distance : 1 . . . d do {Iteration step} ... detection for src : 1 . . . N do {Follow links in the graph} ... links 5: ... contents for all links from src to dest do 6: ... both COMPUTE(src,dest) 7: Summary end for 8: end for 9: NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
  119. 119. Link Analysis for Link-Based Features Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Degree-related measures Web spam PageRank ... detection ... links TrustRank [Gy¨ngyi et al., 2004] o ... contents Truncated PageRank [Becchetti et al., 2006] ... both Estimation of supporters [Becchetti et al., 2006] Summary 140 features per host (2 pages per host)
  120. 120. Link Analysis for Degree-Based Web Information Retrieval C. Castillo Hypothesis 0.12 Normal Spam Levels of link 0.10 analysis 0.08 Ranking 0.06 Web spam 0.04 ... detection 0.02 ... links ... contents 0.00 4 18 76 323 1380 5899 25212 107764 460609 1968753 0.14 ... both Normal Spam 0.12 Summary 0.10 0.08 0.06 0.04 0.02 0.00 0.0 0.0 0.0 0.1 0.6 4.9 40.0 327.9 2686.5 22009.9
  121. 121. Link Analysis for TrustRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking TrustRank [Gy¨ngyi et al., 2004] o Web spam A node with high PageRank, but far away from a core set of ... detection “trusted nodes” is suspicious ... links ... contents Start from a set of trusted nodes, then do a random walk, ... both returning to the set of trusted nodes with probability 1 − α at Summary each step i Trusted nodes: data from http://www.dmoz.org/
  122. 122. Link Analysis for TrustRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking TrustRank [Gy¨ngyi et al., 2004] o Web spam A node with high PageRank, but far away from a core set of ... detection “trusted nodes” is suspicious ... links ... contents Start from a set of trusted nodes, then do a random walk, ... both returning to the set of trusted nodes with probability 1 − α at Summary each step i Trusted nodes: data from http://www.dmoz.org/
  123. 123. Link Analysis for TrustRank Idea Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  124. 124. Link Analysis for TrustRank / PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking 1.00 Normal Spam Web spam 0.90 0.80 ... detection 0.70 ... links 0.60 0.50 ... contents 0.40 0.30 ... both 0.20 Summary 0.10 0.00 0.4 1 4 1e+01 4e+01 1e+02 3e+02 1e+03 3e+03 9e+03
  125. 125. Link Analysis for High and low-ranked pages are different Web Information Retrieval C. Castillo 4 x 10 Hypothesis Top 0%−10% 12 Levels of link Top 40%−50% analysis Top 60%−70% 10 Ranking Number of Nodes Web spam 8 ... detection ... links 6 ... contents ... both 4 Summary 2 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
  126. 126. Link Analysis for High and low-ranked pages are different Web Information Retrieval C. Castillo 4 x 10 Hypothesis Top 0%−10% 12 Levels of link Top 40%−50% analysis Top 60%−70% 10 Ranking Number of Nodes Web spam 8 ... detection ... links 6 ... contents ... both 4 Summary 2 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
  127. 127. Link Analysis for Probabilistic counting Web Information Retrieval C. Castillo Hypothesis 1 1 0 0 Levels of link 0 0 analysis 0 0 0 1 1 1 1 1 Ranking 0 0 1 1 0 0 0 0 0 0 Web spam Propagation of 0 0 1 1 bits using the 1 ... detection 0 1 1 “OR” operation 1 0 1 0 ... links 1 Target 0 Count bits set ... contents 0 page 0 to estimate ... both 0 0 supporters 0 0 1 1 Summary 1 1 0 0 1 1 0 0 0 0 1 1 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  128. 128. Link Analysis for Probabilistic counting Web Information Retrieval C. Castillo Hypothesis 1 1 0 0 Levels of link 0 0 analysis 0 0 0 1 1 1 1 1 Ranking 0 0 1 1 0 0 0 0 0 0 Web spam Propagation of 0 0 1 1 bits using the 1 ... detection 0 1 1 “OR” operation 1 0 1 0 ... links 1 Target 0 Count bits set ... contents 0 page 0 to estimate ... both 0 0 supporters 0 0 1 1 Summary 1 1 0 0 1 1 0 0 0 0 1 1 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  129. 129. Link Analysis for Bottleneck number Web Information Retrieval C. Castillo Hypothesis bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}. Minimum rate of growth Levels of link analysis of the neighbors of x up to a certain distance. We expect that Ranking spam pages form clusters that are somehow isolated from the Web spam rest of the Web graph and they have smaller bottleneck ... detection numbers than non-spam pages. ... links 0.40 Normal ... contents Spam 0.35 ... both 0.30 Summary 0.25 0.20 0.15 0.10 0.05 0.00 1.11 1.30 1.52 1.78 2.07 2.42 2.83 3.31 3.87 4.52
  130. 130. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  131. 131. Link Analysis for Content-Based Features Web Information Retrieval C. Castillo Hypothesis Most of these reported in [Ntoulas et al., 2006]: Levels of link Number of word in the page and title analysis Ranking Average word length Web spam Fraction of anchor text ... detection Fraction of visible text ... links ... contents Compression rate ... both From [Castillo et al., 2007]: Summary Corpus precision and corpus recall Query precision and query recall Independent trigram likelihood Entropy of trigrams
  132. 132. Link Analysis for Average word length Web Information Retrieval C. Castillo Hypothesis 0.12 Normal Levels of link Spam analysis 0.10 Ranking 0.08 Web spam ... detection 0.06 ... links ... contents 0.04 ... both 0.02 Summary 0.00 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Figure: Histogram of the average word length in non-spam vs. spam pages for k = 500.
  133. 133. Link Analysis for Corpus precision Web Information Retrieval C. Castillo Hypothesis 0.10 Normal Levels of link 0.09 Spam analysis 0.08 Ranking 0.07 Web spam 0.06 ... detection 0.05 ... links 0.04 ... contents 0.03 ... both 0.02 Summary 0.01 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Figure: Histogram of the corpus precision in non-spam vs. spam pages.
  134. 134. Link Analysis for Query precision Web Information Retrieval C. Castillo Hypothesis 0.12 Normal Levels of link Spam analysis 0.10 Ranking 0.08 Web spam ... detection 0.06 ... links ... contents 0.04 ... both 0.02 Summary 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Figure: Histogram of the query precision in non-spam vs. spam pages for k = 500.
  135. 135. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  136. 136. Link Analysis for General hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Pages topologically close to each other are more likely to have ... detection the same label (spam/nonspam) than random pairs of pages ... links ... contents Ideas for exploiting this: clustering, propagation, stacked ... both learning Summary
  137. 137. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Castillo et al., 2007]
  138. 138. Link Analysis for Topological dependencies: in-links Web Information Retrieval C. Castillo Hypothesis Histogram of fraction of spam hosts in the in-links Levels of link analysis 0 = no in-link comes from spam hosts Ranking 1 = all of the in-links come from spam hosts Web spam ... detection 0.4 ... links In-links of non spam In-links of spam 0.35 ... contents 0.3 ... both 0.25 Summary 0.2 0.15 0.1 0.05 0 0.0 0.2 0.4 0.6 0.8 1.0
  139. 139. Link Analysis for Topological dependencies: out-links Web Information Retrieval C. Castillo Hypothesis Histogram of fraction of spam hosts in the out-links Levels of link analysis 0 = none of the out-links points to spam hosts Ranking 1 = all of the out-links point to spam hosts Web spam ... detection 1 ... links Out-links of non spam 0.9 Outlinks of spam ... contents 0.8 ... both 0.7 Summary 0.6 0.5 0.4 0.3 0.2 0.1 0 0.0 0.2 0.4 0.6 0.8 1.0
  140. 140. Link Analysis for Idea 1: Clustering Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Classify, then cluster hosts, then assign the same label to all ... links hosts in the same cluster by majority voting ... contents ... both Summary
  141. 141. Link Analysis for Idea 1: Clustering (cont.) Web Information Retrieval C. Castillo Hypothesis Initial prediction: Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  142. 142. Link Analysis for Idea 1: Clustering (cont.) Web Information Retrieval C. Castillo Hypothesis Clustering: Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  143. 143. Link Analysis for Idea 1: Clustering (cont.) Web Information Retrieval C. Castillo Hypothesis Final prediction: Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  144. 144. Link Analysis for Idea 1: Clustering – Results Web Information Retrieval C. Castillo Hypothesis Levels of link Baseline Clustering analysis Without bagging Ranking Web spam True positive rate 75.6% 74.5% ... detection False positive rate 8.5% 6.8% ... links F-Measure 0.646 0.673 ... contents With bagging ... both True positive rate 78.7% 76.9% Summary False positive rate 5.7% 5.0% F-Measure 0.723 0.728 V Reduces error rate
  145. 145. Link Analysis for Idea 2: Propagate the label Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Classify, then interpret “spamicity” as a probability, then do a ... links random walk with restart from those nodes ... contents ... both Summary
  146. 146. Link Analysis for Idea 2: Propagate the label (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link Initial prediction: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  147. 147. Link Analysis for Idea 2: Propagate the label (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link Propagation: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  148. 148. Link Analysis for Idea 2: Propagate the label (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link Final prediction, applying a threshold: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  149. 149. Link Analysis for Idea 2: Propagate the label – Results Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Baseline Fwds. Backwds. Both Ranking Classifier without bagging Web spam True positive rate 75.6% 70.9% 69.4% 71.4% ... detection False positive rate 8.5% 6.1% 5.8% 5.8% ... links F-Measure 0.646 0.665 0.664 0.676 ... contents ... both Classifier with bagging Summary True positive rate 78.7% 76.5% 75.0% 75.2% False positive rate 5.7% 5.4% 4.3% 4.7% F-Measure 0.723 0.716 0.733 0.724
  150. 150. Link Analysis for Idea 3: Stacked graphical learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Meta-learning scheme [Cohen and Kou, 2006] ... detection Derive initial predictions ... links Generate an additional attribute for each object by ... contents combining predictions on neighbors in the graph ... both Summary Append additional attribute in the data and retrain
  151. 151. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Let p(x) ∈ [0..1] be the prediction of a classification Ranking algorithm for a host x using k features Web spam ... detection Let N(x) be the set of pages related to x (in some way) ... links Compute ... contents g ∈N(x) p(g ) ... both f (x) = |N(x)| Summary Add f (x) as an extra feature for instance x and learn a new model with k + 1 features
  152. 152. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Initial prediction: Web spam ... detection ... links ... contents ... both Summary
  153. 153. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Computation of new feature: Ranking Web spam ... detection ... links ... contents ... both Summary
  154. 154. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link New prediction with k + 1 features: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  155. 155. Link Analysis for Idea 3: Stacked graphical learning - Results Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Avg. Avg. Avg. Web spam Baseline of in of out of both ... detection True positive rate 78.7% 84.4% 78.3% 85.2% ... links False positive rate 5.7% 6.7% 4.8% 6.1% ... contents F-Measure 0.723 0.733 0.742 ... both 0.750 Summary V Increases detection rate
  156. 156. Link Analysis for Idea 3: Stacked graphical learning x2 Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking And repeat ... Web spam ... detection Baseline First pass Second pass ... links True positive rate 78.7% 85.2% 88.4% ... contents False positive rate 5.7% 6.1% 6.3% ... both F-Measure 0.723 0.750 0.763 Summary V Significant improvement over the baseline
  157. 157. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  158. 158. Link Analysis for Concluding remarks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Hypothesis: topical locality + link endorsement ... links Primitives: similarity, ranking, propagation, etc. ... contents Application to Web spam ... both Summary
  159. 159. Link Analysis for Concluding remarks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Hypothesis: topical locality + link endorsement ... links Primitives: similarity, ranking, propagation, etc. ... contents Application to Web spam ... both Summary
  160. 160. Link Analysis for Concluding remarks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Hypothesis: topical locality + link endorsement ... links Primitives: similarity, ranking, propagation, etc. ... contents Application to Web spam ... both Summary
  161. 161. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Thank you! ... detection ... links ... contents ... both Summary
  162. 162. Link Analysis for Web Information Baeza-Yates, R., Boldi, P., and Castillo, C. (2006). Retrieval Generalizing pagerank: Damping functions for link-based ranking C. Castillo algorithms. In Proceedings of ACM SIGIR, pages 308–315, Seattle, Washington, USA. Hypothesis ACM Press. Levels of link Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2007). analysis Characterization of national web domains. Ranking ACM Transactions on Internet Technology, 7(2). Web spam Baeza-Yates, R. and Poblete, B. (2006). ... detection Dynamics of the chilean web structure. ... links Comput. Networks, 50(10):1464–1473. ... contents Baeza-Yates, R., Saint-Jean, F., and Castillo, C. (2002). ... both Web structure, dynamics and page quality. In Proceedings of String Processing and Information Retrieval (SPIRE), Summary volume 2476 of Lecture Notes in Computer Science, Lisbon, Portugal. Springer. Barab´si, A.-L. (2002). a Linked: The New Science of Networks. Perseus Books Group. Barab´si, A. L. and Albert, R. (1999). a Emergence of scaling in random networks. Science, 286(5439):509–512.
  163. 163. Link Analysis for Web Information Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. Retrieval (2006). Using rank propagation and probabilistic counting for link-based spam C. Castillo detection. Hypothesis In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press. Levels of link analysis Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Ranking Stata, R., Tomkins, A., and Wiener, J. (2000). Web spam Graph structure in the web: Experiments and models. In Proceedings of the Ninth Conference on World Wide Web, pages ... detection 309–320, Amsterdam, Netherlands. ACM Press. ... links Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., ... contents and Vigna, S. (2006). ... both A reference collection for web spam. SIGIR Forum, 40(2):11–24. Summary Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. (2007). Know your neighbors: Web spam detection using the web topology. In Proceedings of SIGIR, Amsterdam, Netherlands. ACM. Chellapilla, K. and Maykov, A. (2007). A taxonomy of javascript redirection spam. In AIRWeb ’07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, pages 81–88, New York, NY, USA. ACM Press.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×