Your SlideShare is downloading. ×
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

495
views

Published on

Published in: Technology, Design

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
495
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Outline Motivation Results Conclusions Link Analysis in National Web Domains Ricardo Baeza-Yates and Carlos Castillo ICREA / C´tedra Telef´nica, Universitat Pompeu Fabra a o http://www.upf.edu/dtecn/ OSWIR 2005 Compiegne, France September 19, 2005 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 2. Outline Motivation Results Conclusions Motivation 1 Results 2 Conclusions 3 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 3. Outline Motivation Results Conclusions Motivation Sampling the Web X We don’t have access to a global-scale collection X A set of Web sites in the same organization is not diverse enough X A set of Web sites in the same topic might not be representative X A set of random Web sites might not be connected V A national domain has a good balance between diversity and completeness Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 4. Outline Motivation Results Conclusions Motivation Sampling the Web X We don’t have access to a global-scale collection X A set of Web sites in the same organization is not diverse enough X A set of Web sites in the same topic might not be representative X A set of random Web sites might not be connected V A national domain has a good balance between diversity and completeness Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 5. Outline Motivation Results Conclusions Motivation Sampling the Web X We don’t have access to a global-scale collection X A set of Web sites in the same organization is not diverse enough X A set of Web sites in the same topic might not be representative X A set of random Web sites might not be connected V A national domain has a good balance between diversity and completeness Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 6. Outline Motivation Results Conclusions Motivation Sampling the Web X We don’t have access to a global-scale collection X A set of Web sites in the same organization is not diverse enough X A set of Web sites in the same topic might not be representative X A set of random Web sites might not be connected V A national domain has a good balance between diversity and completeness Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 7. Outline Motivation Results Conclusions Motivation Sampling the Web X We don’t have access to a global-scale collection X A set of Web sites in the same organization is not diverse enough X A set of Web sites in the same topic might not be representative X A set of random Web sites might not be connected V A national domain has a good balance between diversity and completeness Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 8. Outline Motivation Results Conclusions Collections used V Different economical, historical, linguistic, geographical contexts Collection Year Brazil 2005 Chile 2004 Greece 2004 Indochina 2004 Italy 2004 South Korea 2004 Spain 2004 U. K. 2002 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 9. Outline Motivation Results Conclusions Collections used Collection Year Available hosts Pages [mill] (rank) [mill] 11th Brazil 2005 3.9 4.7 42th Chile 2004 0.3 3.3 40th Greece 2004 0.3 3.7 38th Indochina 2004 0.5 7.4 4th Italy 2004 9.3 41.3 47th South Korea 2004 0.2 8.9 25th Spain 2004 1.3 16.2 10th U. K. 2002 4.4 18.5 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 10. Outline Motivation Results Conclusions Scale-free topology If we sort pages by the number of in-links, the k th page has indegree proportional to k −α (Zipf’s Law). = The fraction of pages with x in-links is proportional to x −θ (Power law). Experimentally, θ ≈ 2.1 on the Web Partial explanation: a multiplicative process; if dt is the number of links at time t, then dt+1 = C × dt . Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 11. Outline Motivation Results Conclusions Scale-free topology If we sort pages by the number of in-links, the k th page has indegree proportional to k −α (Zipf’s Law). = The fraction of pages with x in-links is proportional to x −θ (Power law). Experimentally, θ ≈ 2.1 on the Web Partial explanation: a multiplicative process; if dt is the number of links at time t, then dt+1 = C × dt . Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 12. Outline Motivation Results Conclusions Scale-free topology If we sort pages by the number of in-links, the k th page has indegree proportional to k −α (Zipf’s Law). = The fraction of pages with x in-links is proportional to x −θ (Power law). Experimentally, θ ≈ 2.1 on the Web Partial explanation: a multiplicative process; if dt is the number of links at time t, then dt+1 = C × dt . Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 13. Outline Motivation Results Conclusions In-degree Brazil Chile Greece 10−1 10−1 10−1 10−2 10−2 10−2 10−3 10−3 10−3 −4 −4 −4 10 10 10 10−5 10−5 10−5 10−6 10−6 10−6 10−7 0 10−7 0 10−7 0 101 102 103 104 101 102 103 104 101 102 103 104 10 10 10 Italy Korea Spain 10−1 10−1 10−1 −2 −2 −2 10 10 10 −3 −3 −3 10 10 10 −4 −4 −4 10 10 10 10−5 10−5 10−5 10−6 10−6 10−6 10−7 10−7 10−7 100 101 102 103 104 100 101 102 103 104 100 101 102 103 104 U.K. 10−1 10−2 −3 10 10−4 10−5 10−6 −7 10 100 101 102 103 104 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 14. Outline Motivation Results Conclusions Out-degree Brazil Chile Greece 10−1 10−1 10−1 −2 −2 −2 10 10 10 10−3 10−3 10−3 −4 −4 10−4 10 10 10−5 10−5 10−5 10−6 0 10−6 0 10−6 0 101 102 103 101 102 103 101 102 103 10 10 10 Italy Korea Spain 10−1 10−1 10−1 10−2 10−2 10−2 10−3 10−3 10−3 10−4 10−4 10−4 −5 −5 −5 10 10 10 10−6 10−6 10−6 100 1 2 3 100 1 2 3 100 101 102 103 10 10 10 10 10 10 U.K. 10−1 −2 10 −3 10 10−4 10−5 −6 10 100 101 102 103 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 15. Outline Motivation Results Conclusions Link scores (PageRank, Hubs, Authorities) Brazil Chile Greece Korea 10-2 10-2 10-2 10-2 -3 -3 -3 -3 10 10 10 10 10-4 10-4 10-4 10-4 -5 -5 -5 -5 10 10 10 10 10-6 10-6 10-6 10-6 10-7 -7 10-7 -7 10-7 -7 10-7 -7 -6 -5 -4 -6 -5 -4 -6 -5 -4 -6 -5 -4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Brazil Chile Greece Korea 10-3 10-3 10-3 10-3 -4 -4 -4 -4 10 10 10 10 10-5 10-5 10-5 10-5 -6 -6 -6 -6 10 10 10 10 -7 -7 -7 -7 10 10 10 10 -7 -6 -5 -4 -7 -6 -5 -4 -7 -6 -5 -4 10-7 10-6 10-5 10-4 10 10 10 10 10 10 10 10 10 10 10 10 Brazil Chile Greece Korea 10-3 10-3 10-3 10-3 10-4 10-4 10-4 10-4 10-5 10-5 10-5 10-5 10-6 10-6 10-6 10-6 10-7 -7 10-7 -7 10-7 -7 10-7 -7 -6 -5 -4 -6 -5 -4 -6 -5 -4 -6 -5 -4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 16. Outline Motivation Results Conclusions Power-law exponents Collection In- Degree Brazil 1.9 Chile 2.0 Greece 1.9 Indochina 1.6 Italy 1.8 South Korea 1.9 Spain 2.1 U. K. 1.8 (Broder. . . 2000) 2.1 (Dill. . . 2002) 2.1 ≈2 (Kleinberg. . . 1999) Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 17. Outline Motivation Results Conclusions Power-law exponents Collection In- Outdegree Page- HITS degree Small Large Rank Hubs Auth. Brazil 1.9 0.7 2.7 1.8 2.9 1.8 Chile 2.0 0.7 2.6 1.9 2.7 1.9 Greece 1.9 0.6 1.9 1.8 2.6 1.8 Indochina 1.6 0.7 2.6 Italy 1.8 0.7 2.5 South Korea 1.9 0.3 2.0 1.8 3.7 1.8 Spain 2.1 0.9 4.2 2.0 U. K. 1.8 0.7 3.4 (Broder. . . 2000) 2.1 2.7 (Dill. . . 2002) 2.1 2.2 (Pandurangan. . . 2002) 2.1 ≈2 (Kleinberg. . . 1999) Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 18. Outline Motivation Results Conclusions Hostgraph www.example1.com S1 www.example2.com S2 www.example3.com S3 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 19. Outline Motivation Results Conclusions Hostgraph also exhibits a power-law Hostgraph degree Collection In Out Brazil 1.9 1.9 Chile 2.0 1.7 Greece 2.0 1.6 South Korea 1.2 1.4 Spain 1.8 1.3 (Bharat. . . 2001) 1.6-1.7 1.7-1.8 (Dill. . . 2002) 2.3 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 20. Outline Motivation Results Conclusions Web structure: connected components “Normal” vs “Giant” strongly connected components Brazil Chile Greece 100 100 100 10-1 10-1 10-1 10-2 10-2 10-2 10-3 10-3 10-3 10-4 10-4 10-4 10-5 10-5 10-5 -6 -6 -6 10 10 10 100 101 102 103 104 105 100 101 102 103 104 105 100 101 102 103 104 105 Korea Spain 100 100 -1 -1 10 10 -2 -2 10 10 -3 -3 10 10 10-4 10-4 10-5 10-5 10-6 10-6 100 101 102 103 104 105 100 101 102 103 104 105 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 21. Outline Motivation Results Conclusions Conclusions V Consistent results across collections V Differences in the amount of spam V Comparison of other aspects [to be available soon] Thank you Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 22. Outline Motivation Results Conclusions Conclusions V Consistent results across collections V Differences in the amount of spam V Comparison of other aspects [to be available soon] Thank you Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 23. Outline Motivation Results Conclusions Conclusions V Consistent results across collections V Differences in the amount of spam V Comparison of other aspects [to be available soon] Thank you Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 24. Outline Motivation Results Conclusions Conclusions V Consistent results across collections V Differences in the amount of spam V Comparison of other aspects [to be available soon] Thank you Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/
  • 25. Outline Motivation Results Conclusions Conclusions V Consistent results across collections V Differences in the amount of spam V Comparison of other aspects [to be available soon] Thank you Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/