Your SlideShare is downloading. ×
  • Like
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

  • 479 views
Published

 

Published in Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
479
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
15
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Outline Motivation Results Conclusions Link Analysis in National Web Domains Ricardo Baeza-Yates and Carlos Castillo ICREA / C´tedra Telef´nica, Universitat Pompeu Fabra a o http://www.upf.edu/dtecn/ OSWIR 2005 Compiegne, France September 19, 2005 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 2. Outline Motivation Results Conclusions Motivation 1 Results 2 Conclusions 3 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 3. Outline Motivation Results Conclusions Motivation Sampling the Web X We don’t have access to a global-scale collection X A set of Web sites in the same organization is not diverse enough X A set of Web sites in the same topic might not be representative X A set of random Web sites might not be connected V A national domain has a good balance between diversity and completeness Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 4. Outline Motivation Results Conclusions Motivation Sampling the Web X We don’t have access to a global-scale collection X A set of Web sites in the same organization is not diverse enough X A set of Web sites in the same topic might not be representative X A set of random Web sites might not be connected V A national domain has a good balance between diversity and completeness Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 5. Outline Motivation Results Conclusions Motivation Sampling the Web X We don’t have access to a global-scale collection X A set of Web sites in the same organization is not diverse enough X A set of Web sites in the same topic might not be representative X A set of random Web sites might not be connected V A national domain has a good balance between diversity and completeness Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 6. Outline Motivation Results Conclusions Motivation Sampling the Web X We don’t have access to a global-scale collection X A set of Web sites in the same organization is not diverse enough X A set of Web sites in the same topic might not be representative X A set of random Web sites might not be connected V A national domain has a good balance between diversity and completeness Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 7. Outline Motivation Results Conclusions Motivation Sampling the Web X We don’t have access to a global-scale collection X A set of Web sites in the same organization is not diverse enough X A set of Web sites in the same topic might not be representative X A set of random Web sites might not be connected V A national domain has a good balance between diversity and completeness Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 8. Outline Motivation Results Conclusions Collections used V Different economical, historical, linguistic, geographical contexts Collection Year Brazil 2005 Chile 2004 Greece 2004 Indochina 2004 Italy 2004 South Korea 2004 Spain 2004 U. K. 2002 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 9. Outline Motivation Results Conclusions Collections used Collection Year Available hosts Pages [mill] (rank) [mill] 11th Brazil 2005 3.9 4.7 42th Chile 2004 0.3 3.3 40th Greece 2004 0.3 3.7 38th Indochina 2004 0.5 7.4 4th Italy 2004 9.3 41.3 47th South Korea 2004 0.2 8.9 25th Spain 2004 1.3 16.2 10th U. K. 2002 4.4 18.5 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 10. Outline Motivation Results Conclusions Scale-free topology If we sort pages by the number of in-links, the k th page has indegree proportional to k −α (Zipf’s Law). = The fraction of pages with x in-links is proportional to x −θ (Power law). Experimentally, θ ≈ 2.1 on the Web Partial explanation: a multiplicative process; if dt is the number of links at time t, then dt+1 = C × dt . Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 11. Outline Motivation Results Conclusions Scale-free topology If we sort pages by the number of in-links, the k th page has indegree proportional to k −α (Zipf’s Law). = The fraction of pages with x in-links is proportional to x −θ (Power law). Experimentally, θ ≈ 2.1 on the Web Partial explanation: a multiplicative process; if dt is the number of links at time t, then dt+1 = C × dt . Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 12. Outline Motivation Results Conclusions Scale-free topology If we sort pages by the number of in-links, the k th page has indegree proportional to k −α (Zipf’s Law). = The fraction of pages with x in-links is proportional to x −θ (Power law). Experimentally, θ ≈ 2.1 on the Web Partial explanation: a multiplicative process; if dt is the number of links at time t, then dt+1 = C × dt . Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 13. Outline Motivation Results Conclusions In-degree Brazil Chile Greece 10−1 10−1 10−1 10−2 10−2 10−2 10−3 10−3 10−3 −4 −4 −4 10 10 10 10−5 10−5 10−5 10−6 10−6 10−6 10−7 0 10−7 0 10−7 0 101 102 103 104 101 102 103 104 101 102 103 104 10 10 10 Italy Korea Spain 10−1 10−1 10−1 −2 −2 −2 10 10 10 −3 −3 −3 10 10 10 −4 −4 −4 10 10 10 10−5 10−5 10−5 10−6 10−6 10−6 10−7 10−7 10−7 100 101 102 103 104 100 101 102 103 104 100 101 102 103 104 U.K. 10−1 10−2 −3 10 10−4 10−5 10−6 −7 10 100 101 102 103 104 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 14. Outline Motivation Results Conclusions Out-degree Brazil Chile Greece 10−1 10−1 10−1 −2 −2 −2 10 10 10 10−3 10−3 10−3 −4 −4 10−4 10 10 10−5 10−5 10−5 10−6 0 10−6 0 10−6 0 101 102 103 101 102 103 101 102 103 10 10 10 Italy Korea Spain 10−1 10−1 10−1 10−2 10−2 10−2 10−3 10−3 10−3 10−4 10−4 10−4 −5 −5 −5 10 10 10 10−6 10−6 10−6 100 1 2 3 100 1 2 3 100 101 102 103 10 10 10 10 10 10 U.K. 10−1 −2 10 −3 10 10−4 10−5 −6 10 100 101 102 103 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 15. Outline Motivation Results Conclusions Link scores (PageRank, Hubs, Authorities) Brazil Chile Greece Korea 10-2 10-2 10-2 10-2 -3 -3 -3 -3 10 10 10 10 10-4 10-4 10-4 10-4 -5 -5 -5 -5 10 10 10 10 10-6 10-6 10-6 10-6 10-7 -7 10-7 -7 10-7 -7 10-7 -7 -6 -5 -4 -6 -5 -4 -6 -5 -4 -6 -5 -4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Brazil Chile Greece Korea 10-3 10-3 10-3 10-3 -4 -4 -4 -4 10 10 10 10 10-5 10-5 10-5 10-5 -6 -6 -6 -6 10 10 10 10 -7 -7 -7 -7 10 10 10 10 -7 -6 -5 -4 -7 -6 -5 -4 -7 -6 -5 -4 10-7 10-6 10-5 10-4 10 10 10 10 10 10 10 10 10 10 10 10 Brazil Chile Greece Korea 10-3 10-3 10-3 10-3 10-4 10-4 10-4 10-4 10-5 10-5 10-5 10-5 10-6 10-6 10-6 10-6 10-7 -7 10-7 -7 10-7 -7 10-7 -7 -6 -5 -4 -6 -5 -4 -6 -5 -4 -6 -5 -4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 16. Outline Motivation Results Conclusions Power-law exponents Collection In- Degree Brazil 1.9 Chile 2.0 Greece 1.9 Indochina 1.6 Italy 1.8 South Korea 1.9 Spain 2.1 U. K. 1.8 (Broder. . . 2000) 2.1 (Dill. . . 2002) 2.1 ≈2 (Kleinberg. . . 1999) Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 17. Outline Motivation Results Conclusions Power-law exponents Collection In- Outdegree Page- HITS degree Small Large Rank Hubs Auth. Brazil 1.9 0.7 2.7 1.8 2.9 1.8 Chile 2.0 0.7 2.6 1.9 2.7 1.9 Greece 1.9 0.6 1.9 1.8 2.6 1.8 Indochina 1.6 0.7 2.6 Italy 1.8 0.7 2.5 South Korea 1.9 0.3 2.0 1.8 3.7 1.8 Spain 2.1 0.9 4.2 2.0 U. K. 1.8 0.7 3.4 (Broder. . . 2000) 2.1 2.7 (Dill. . . 2002) 2.1 2.2 (Pandurangan. . . 2002) 2.1 ≈2 (Kleinberg. . . 1999) Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 18. Outline Motivation Results Conclusions Hostgraph www.example1.com S1 www.example2.com S2 www.example3.com S3 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 19. Outline Motivation Results Conclusions Hostgraph also exhibits a power-law Hostgraph degree Collection In Out Brazil 1.9 1.9 Chile 2.0 1.7 Greece 2.0 1.6 South Korea 1.2 1.4 Spain 1.8 1.3 (Bharat. . . 2001) 1.6-1.7 1.7-1.8 (Dill. . . 2002) 2.3 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 20. Outline Motivation Results Conclusions Web structure: connected components “Normal” vs “Giant” strongly connected components Brazil Chile Greece 100 100 100 10-1 10-1 10-1 10-2 10-2 10-2 10-3 10-3 10-3 10-4 10-4 10-4 10-5 10-5 10-5 -6 -6 -6 10 10 10 100 101 102 103 104 105 100 101 102 103 104 105 100 101 102 103 104 105 Korea Spain 100 100 -1 -1 10 10 -2 -2 10 10 -3 -3 10 10 10-4 10-4 10-5 10-5 10-6 10-6 100 101 102 103 104 105 100 101 102 103 104 105 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 21. Outline Motivation Results Conclusions Conclusions V Consistent results across collections V Differences in the amount of spam V Comparison of other aspects [to be available soon] Thank you Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 22. Outline Motivation Results Conclusions Conclusions V Consistent results across collections V Differences in the amount of spam V Comparison of other aspects [to be available soon] Thank you Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 23. Outline Motivation Results Conclusions Conclusions V Consistent results across collections V Differences in the amount of spam V Comparison of other aspects [to be available soon] Thank you Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 24. Outline Motivation Results Conclusions Conclusions V Consistent results across collections V Differences in the amount of spam V Comparison of other aspects [to be available soon] Thank you Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 25. Outline Motivation Results Conclusions Conclusions V Consistent results across collections V Differences in the amount of spam V Comparison of other aspects [to be available soon] Thank you Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/