0
The Graph Structure of the Web
- Aggregated by Pay-Level Domain
Oliver Lehmberg, Robert Meusel, Christian Bizer
Research G...
General Knowledge about the Web Graph
• Broder et al.* in 2000:
– In- and Outdegree follow power laws
– There is a directe...
Our Contributions
• R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure
in the web – revisted. WWW ’14, 2014.
...
DATA SET
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Slide 4
Web Data Commons Hyperlink Graph
• Page level: the largest hyperlink graph available to the public
– extracted from Common...
Downloading the WDC Hyperlink Graph
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehm...
GRAPH HANDS-ON
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Sli...
Node Centrality Ranking
http://wwwranking.webdatacommons.org
Version 6/25/2014 The Graph Structure in the Web - Aggregated...
Top PLD Lists
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Rank...
Most interlinked PLDs
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bi...
GRAPH ANALYSIS
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain
Lehmberg/Meusel/Bizer
Sli...
In- and Outdegree – Power-Laws?
Power-Law:
𝑦 ∝ 𝑥−𝛾
Methodology:
• Clauset et al.*
Maximum-
likelihood fitting
(plfit *²)
•...
In- and Outdegree – Power-Laws?
Outdegree results:
𝑥0 = 496
𝛾 = 2.39
Must reject power
law hypothesis
Yet unclear which
di...
Bow-Tie Structure
Observations:
Small IN component
Large OUT component
TEND and TUBES almost non-
existent
Compared to Bro...
Distance Distribution
Methodology:
Approximate distribution
several times (using
Hyperball*)
Connected pairs:
42.42(±3.59)...
High connectivity based on Hubs?
• LSCC of 51.9%, 42% connected pairs & avg. distance of 4.27
– How important are hubs in ...
Two Layer Model
7/4/2014 Data and Web Science Group 17
Approach:
Remove incoming links from the
graph and measures sizes o...
PLD Topic Graph
Approach:
Use topical categories from the
open directory project* to
categorise our websites.
15 topical c...
Public Suffix (PS) Graph
Approach:
Top ten PSs from our PLD graph +
“others”
Generally agrees with Verisign
Domain Industr...
WebDataCommons.org also offers:
1.Corpus of 17 billion RDFa, Microdata, Microformats statements
2.Corpus of 147 million re...
Upcoming SlideShare
Loading in...5
×

The Graph Structure of the Web - Aggregated by Pay-Level Domain

522

Published on

The Graph Structure of the Web - Aggregated by Pay-Level Domain @ Web Sciene 2014

Published in: Science, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
522
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "The Graph Structure of the Web - Aggregated by Pay-Level Domain"

  1. 1. The Graph Structure of the Web - Aggregated by Pay-Level Domain Oliver Lehmberg, Robert Meusel, Christian Bizer Research Group Data and Web Science
  2. 2. General Knowledge about the Web Graph • Broder et al.* in 2000: – In- and Outdegree follow power laws – There is a directed path between two pages in 25% of all cases – The Web Graph has the bow-tie structure Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer *A. Broder, R. Kumar, F. Maghoul, P. Raghavan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. In WWW’00, pages 309–320. North-Holland Publishing Co, 2000. Slide 2
  3. 3. Our Contributions • R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014. – Analysis of the 2012 Web Graph on page level • This presentation: – Analysis of the same graph, aggregated by pay-level domain (PLD) – Focus on inter-website connections – No intra-website links • Additionally: – Interconnections between topical groups of websites – Public Suffix aggregation Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 3
  4. 4. DATA SET Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 4
  5. 5. Web Data Commons Hyperlink Graph • Page level: the largest hyperlink graph available to the public – extracted from Common Crawl – 3.5 billion nodes (web pages) – 128 billion arcs (hyperlinks) • Aggregated by pay-level domain – 43 million nodes (websites) – 623 million arcs (aggregated hyperlinks) – 240 million registered domains in the Web in 2012 (18%)* • Pay-level domain: – dws.informatik.uni-mannheim.de  uni-mannheim.de Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer *http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf Slide 5
  6. 6. Downloading the WDC Hyperlink Graph Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer http://webdatacommons.org/hyperlinkgraph/ • 4 aggregation levels: • Extraction code is published under Apache License – Extraction costs per run: ~ 200 US$ in Amazon EC2 fees Graph #Nodes #Arcs Size (zipped) Page graph 3.56 billion 128.73 billion 376 GB Subdomain graph 101 million 2,043 million 10 GB 1st level subdomain graph 95 million 1,937 million 9.5 GB PLD graph 43 million 623 million 3.1 GB Slide 6
  7. 7. GRAPH HANDS-ON Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 7
  8. 8. Node Centrality Ranking http://wwwranking.webdatacommons.org Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 8
  9. 9. Top PLD Lists Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Rank Website Outdegree Website Indegree Website PageRank 1 blogspot.com 3.898.561 wordpress.org 1.822.440 wordpress.org 113,388 2 wordpress.com 2.249.553 youtube.com 1.319.548 gmpg.org 111,173 3 youtube.com 1.078.938 wikipedia.org 1.243.291 youtube.com 88,206 4 wikipedia.org 862.705 gmpg.org 1.156.727 twitter.com 54,644 5 serebella.com 699.609 blogspot.com 1.034.450 wikipedia.org 54,081 6 refertus.info 668.271 google.com 782.660 blogspot.com 40,901 7 top20directory.com 650.884 wordpress.com 710.590 google.com 40,799 8 typepad.com 551.360 twitter.com 646.239 wordpress.com 28,018 9 botw.org 496.645 yahoo.com 554.251 yahoo.com 27,594 10 tumblr.com 496.045 flickr.com 339.231 networkadvertising.org 27,395 11 dmoz.org 476.890 facebook.com 314.051 apple.com 23,929 12 vindhetviahier.nl 424.646 apple.com 312.396 phpbb.com 22,329 13 jcsearch.com 423.918 miibeian.gov.cn 289.605 miibeian.gov.cn 22,165 14 startpagina.nl 392.543 vimeo.com 269.003 hugedomains.com 20,793 15 yahoo.com 371.087 tumblr.com 226.596 facebook.com 20,254 16 tatu.us 370.918 joomla.org 201.863 joomla.org 18,146 17 freeseek.org 362.310 amazon.com 196.690 flickr.com 17,966 18 lap.hu 352.668 w3.org 196.507 adobe.com 17,903 19 blau-webkatalog.com 312.924 nytimes.com 193.907 linkedin.com 16,083 20 allepaginas.nl 276.578 sourceforge.net 189.663 w3.org 15,539 Slide 9
  10. 10. Most interlinked PLDs Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 10
  11. 11. GRAPH ANALYSIS Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 11
  12. 12. In- and Outdegree – Power-Laws? Power-Law: 𝑦 ∝ 𝑥−𝛾 Methodology: • Clauset et al.* Maximum- likelihood fitting (plfit *²) • Goodness-of-fit test Indegree results: 𝑥0 = 3,062 𝛾 = 2.40 Cannot reject power law hypothesis Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 12 * Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009.*² https://github.com/ntamas/plfit
  13. 13. In- and Outdegree – Power-Laws? Outdegree results: 𝑥0 = 496 𝛾 = 2.39 Must reject power law hypothesis Yet unclear which distribution fits Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 13
  14. 14. Bow-Tie Structure Observations: Small IN component Large OUT component TEND and TUBES almost non- existent Compared to Broder et al.: Unbalanced LSCC much larger Compared to our page graph*: Proportions of IN and OUT exchanged Large fraction of IN pages were merged into LSCC (ca. 1 billion pages) Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer * R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014. Slide 14
  15. 15. Distance Distribution Methodology: Approximate distribution several times (using Hyperball*) Connected pairs: 42.42(±3.59)% Avg. distance: 4.27(±0.085) Diameter (at least): 48 Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer *P. Boldi and S. Vigna. In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond. In ICDMW 2013. IEEE, 2013 Slide 15
  16. 16. High connectivity based on Hubs? • LSCC of 51.9%, 42% connected pairs & avg. distance of 4.27 – How important are hubs in this graph? • Approach: – A) Remove links to Hubs (i.e. high indegree) – B) Keep only links to Hubs – Repeat this for different indegree values as thresholds and then measure largest remaining WCC/SCC • Results – Removing links to nodes with high indegree: no large SCC once all links to nodes with indegree 10 or higher are removed – Removing links to nodes with low indegree: the more links we remove, the more likely are the remaining nodes to be part of the largest SCC Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer Slide 16
  17. 17. Two Layer Model 7/4/2014 Data and Web Science Group 17 Approach: Remove incoming links from the graph and measures sizes of largest SCC/WCC Subgraph with indegree < 𝟏𝟎 • 73.7% of all nodes weakly connected • No large strongly connected component •  Low Degree Layer Subgraph with indegree ≥ 𝟏𝟎 • Removed incoming links of 79.2% of all nodes • 16.1% of all nodes strongly connected •  High Degree Layer
  18. 18. PLD Topic Graph Approach: Use topical categories from the open directory project* to categorise our websites. 15 topical categories Results: “computers”: 6th largest, but largest number of links “shopping”: much more incoming than outgoing links, few internal links Conclusion: No obvious patterns, more properties needed Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer health Kids and teens news Slide 18 *http://dmoz.org
  19. 19. Public Suffix (PS) Graph Approach: Top ten PSs from our PLD graph + “others” Generally agrees with Verisign Domain Industry Brief* gTLDs: more external than internal links ccTLDs: more internal than external links Extreme cases: .com does not follow this rule .de  half of all links are from a single spammer Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer co.uk ru others org nl net it info de com *http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf Slide 19
  20. 20. WebDataCommons.org also offers: 1.Corpus of 17 billion RDFa, Microdata, Microformats statements 2.Corpus of 147 million relational HTML tables Thank you for your attention! Advertisement The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×