SlideShare a Scribd company logo
1 of 4
Download to read offline
Link Analysis in National Web Domains

                       Ricardo Baeza-Yates                                    Carlos Castillo
                        ICREA Professor                                  Department of Technology
                     University Pompeu Fabra                             University Pompeu Fabra
                   ricardo.baeza@upf.edu                              carlos.castillo@upf.edu


                         Abstract                                        Table 1 summarizes the characteristics of the collections.
                                                                      The number of unique hosts was measured by the ISC2 ; the
    The Web can be seen as a graph in which every page                last column is the number of pages actually downloaded.
is a node, and every hyper-link between two pages is an
edge. This Web graph forms a scale-free network: a graph
in which the distribution of the degree of the nodes is very              Table 1. Characteristics of the collections.
skewed. This graph is also self-similar, in terms that a small              Collection  Year Available hosts Pages
part of the graph shares most properties with the entire                                       [mill] (rank) [mill]
graph.                                                                        Brazil    2005    3.9     11th     4.7
    This paper compares the characteristics of several na-                    Chile     2004    0.3     42th     3.3
tional Web domains, by studying the Web graph of large                        Greece    2004    0.3     40th     3.7
collections obtained using a Web crawler; the comparison                    Indochina   2004    0.5     38th     7.4
unveils striking similarities between the Web graphs of very                   Italy    2004    9.3      4th    41.3
different countries.                                                       South Korea 2004     0.2     47th     8.9
                                                                              Spain     2004    1.3     25th    16.2
                                                                              U. K.     2002    4.4     10th    18.5
1 Introduction
                                                                         By observing the number of available hosts and the
    Large samples from specific communities, such as na-               downloaded pages in each collection, we consider that most
tional domains, have a good balance between diversity and             of them have a high coverage. The collections of Brazil
completeness. They include pages inside a common geo-                 and the United Kingdom are smaller samples in compari-
graphical, historical and cultural context that are written by        son with the others, but their sizes are large enough to show
diverse authors in different organizations. National Web do-          results that are consistent with the others.
mains also have a moderate size that allows good accuracy
                                                                         Zipf’s law: the graph representing the connections be-
in the results; because of this, they have attracted the atten-
                                                                      tween Web pages has a scale-free topology. Scale-free net-
tion of several researchers.
                                                                      works, as opposed to random networks, are characterized
    In this paper, we study eight national domains. The
                                                                      by an uneven distribution of links. For a page p, we have
collection studied include four collections obtained using
                                                                      P r(p has k links) ∝ k −θ . We find this distribution on the
WIRE [3]: Brazil (BR domain) [18, 15], Chile (CL domain)
                                                                      Web in almost every aspect, and it is the same distribution
[1, 8, 4], Greece (GR domain) [12] and South Korea (KR
                                                                      found by economist Vilfredo Pareto in 1896 for the distribu-
domain) [7]; three collections obtained from the Laboratory
                                                                      tion of wealth in large populations, and by George K. Zipf
of Web Algorithmics1: Indochina (KH, LA, MM, TH and VN
                                                                      in 1932 for the frequency of words in texts. This distribu-
domains), Italy (IT domain) and the United Kingdom (UK
                                                                      tion later turned out to be applicable to several domains [19]
domain); and one collection obtained using Akwan [10]:
                                                                      and was called by Zipf the law of minimal effort.
Spain (ES domain) [6]. Our 104-million page sample is
less than 1% of the indexable Web [13] but presents charac-              Section 2 studies the Web graph, and section 3 the Host-
teristics that are very similar to those of the full Web.             graph. The last section presents our conclusions.
   1 Laboratory   of   Web   Algorithmics,     Dipartimento di
Scienze dell’Informazione, Universit´ degli
                                    a         studi di Milano,          2 Internet Systems  Consortium’s          domain      survey,
<http://law.dsi.unimi.it/>.                                           <http://www.isc.org/ds/>


                                                                  1
2 Web graph                                                                      Indegree                                         Outdegree

                                                                                          Brazil                                        Brazil
                                                                      10−1                                             10−1
2.1 Degree                                                            10
                                                                        −2                                               −2
                                                                                                                       10
                                                                        −3
                                                                      10                                                 −3
                                                                        −4
                                                                                                                       10
                                                                      10                                                 −4
                                                                        −5                                             10
                                                                      10
   The distributions of the indegree and outdegree are                10−6                                             10
                                                                                                                         −5


shown in Figure 1; both are consistent with a power-law               10
                                                                        −7
                                                                          10
                                                                            0
                                                                                 10
                                                                                      1
                                                                                            10
                                                                                                 2
                                                                                                     10
                                                                                                          3
                                                                                                              10
                                                                                                                   4
                                                                                                                       10
                                                                                                                         −6
                                                                                                                            10
                                                                                                                              0
                                                                                                                                   10
                                                                                                                                     1
                                                                                                                                                 10
                                                                                                                                                   2
                                                                                                                                                       10
                                                                                                                                                         3

distribution. When examining the distribution of outdegree,
we found two different curves: one for smaller outdegrees                                 Chile                                          Chile
                                                                      10−1                                             10−1
–less than 20 to 30 out-links– and another one for larger               −2
                                                                      10                                               10−2
outdegrees. They both show a power-law distribution and               10
                                                                        −3
                                                                        −4                                             10
                                                                                                                         −3
                                                                      10
we estimated the exponents for both parts separately.                 10
                                                                        −5                                             10
                                                                                                                         −4

                                                                        −6                                               −5
                                                                                                                       10
   For the in-degree, the average power-law exponent θ we             10
                                                                      10
                                                                        −7
                                                                                                                       10
                                                                                                                         −6
                                                                            0         1          2        3        4          0      1             2     3
observed was 1.9±0.1; this can be compared with the value                 10     10         10       10       10            10     10            10    10


of 2.1 observed by other authors [9, 11] in samples of the
                                                                                          Greece                                    Greece
global Web. For the out-degree, the exponent was 0.6 ± 0.2            10−1                                             10−1
                                                                        −2
for small outdegrees, and 2.8 ± 0.8 for large out-degrees;            10
                                                                      10
                                                                        −3
                                                                                                                       10−2

                                                                                                                         −3
                                                                                                                       10
the latter can be compared with the parameters 2.7 [9] and            10
                                                                        −4
                                                                        −5                                             10
                                                                                                                         −4
                                                                      10
2.2 [11] found for samples of the global Web.                         10−6                                             10
                                                                                                                         −5

                                                                        −7                                               −6
                                                                      10     0        1          2        3        4
                                                                                                                       10     0      1             2     3
                                                                           10    10         10       10       10            10     10            10    10


2.2 Ranking
                                                                                           Italy                                         Italy
                                                                      10−1                                             10−1
                                                                        −2
                                                                      10                                               10−2
                                                                        −3
   One of the main algorithms for link-based ranking of               10
                                                                        −4
                                                                                                                       10
                                                                                                                         −3
                                                                      10                                                 −4
Web pages is PageRank [16]. We calculated the PageRank                10
                                                                        −5                                             10
                                                                                                                         −5
                                                                      10−6                                             10
distribution for several collections and found a power-law in         10−7                                             10−6
                                                                          100    101        102      103      104          100     101           102   103
the distribution of the obtained scores, with average expo-
nent 1.86 ± 0.06. In theory, the PageRank exponent should
                                                                                          Korea                                         Korea
be similar to the indegree exponent [17] (the value they              10−1                                             10−1
                                                                      10−2                                               −2
measured for the exponent was 2.1), and this is indeed the            10−3
                                                                                                                       10
                                                                                                                       10−3
case. The distribution of PageRank values can be seen in              10−4
                                                                                                                       10−4
                                                                      10−5
Figure 2.                                                             10−6                                             10−5
                                                                      10−7                                             10−6
   We also calculated a static version of the HITS scores                 100    101
                                                                                            102
                                                                                                     103
                                                                                                              104
                                                                                                                           100     101           102   103

[14], counting only external links and calculating the scores
in the whole graph, instead of only on a set of pages.                                    Spain                                          Spain
                                                                      10−1                                             10−1
The tail of the distribution of authority-score also follows          10−2
                                                                                                                       10−2
                                                                      10−3
a power law. In the case of hub-score, it is difficult to assert       10−4
                                                                                                                       10−3
                                                                                                                       10−4
that the data follows a power-law because the frequencies             10−5
                                                                      10−6                                             10−5
seems to be much more dispersed. The average exponent                 10−7                                             10−6
                                                                          100    101        102      103      104          100     101           102   103
observed was 3.0 ± 0.5 for hub score, and 1.84 ± 0.01 for
authority score.                                                                          U.K.                                           U.K.
                                                                      10−1                                             10−1
                                                                      10−2                                             10−2

                                                                      10−3
                                                                                                                       10−3
3 Hostgraph                                                           10−4
                                                                                                                       10−4
                                                                        −5
                                                                      10
                                                                      10−6                                             10−5
                                                                      10−7                                             10−6
                                                                          100      1          2        3        4
                                                                                                                           100     101           102   103
    We studied the hostgraph [11], this is, the graph created                    10         10       10       10

by changing all the nodes representing Web pages in the
same Web site by a single node representing the Web site.             Figure 1. Histograms of the indegree and out-
The hostgraph is a graph in which there is a node for each            degree of Web pages, including a fit for a
Web site, and two nodes A and B are connected iff there is            power-law distribution.
at least one link on site A pointing to a page in site B. In
this section, we consider only the collections from which
we have a hostgraph.

                                                                  2
Brazil                                        Chile                                               Greece                                                                                            Korea
       10-2                                           10-2                                             10-2                                                                                    10-2
            -3                                          -3                                               -3                                                                                      -3
       10                                             10                                               10                                                                                      10
            -4                                          -4                                               -4                                                                                      -4
       10                                             10                                               10                                                                                      10
            -5                                          -5                                               -5                                                                                      -5
       10                                             10                                               10                                                                                      10
            -6                                          -6                                               -6                                                                                      -6
       10                                             10                                               10                                                                                      10
            -7                                          -7                                               -7                                                                                      -7
       10         -7        -6         -5        -4
                                                      10        -7        -6        -5        -4
                                                                                                       10     -7               -6                       -5                    -4
                                                                                                                                                                                               10          -7                  -6                     -5                     -4
             10        10         10        10             10        10        10        10                 10            10                          10                10                           10                   10                     10                     10



                             Brazil                                        Chile                                               Greece                                                                                            Korea
       10-3                                           10-3                                             10-3                                                                                    10-3
            -4                                          -4                                               -4                                                                                      -4
       10                                             10                                               10                                                                                      10
            -5                                          -5                                               -5                                                                                      -5
       10                                             10                                               10                                                                                      10
            -6                                          -6                                               -6                                                                                      -6
       10                                             10                                               10                                                                                      10
            -7                                          -7                                               -7                                                                                      -7
       10                                             10                                               10                                                                                      10
                  -7        -6         -5        -4             -7        -6        -5        -4              -7               -6                       -5                    -4                           -7                  -6                  -5                        -4
             10        10         10        10             10        10        10        10                 10            10                          10                10                           10                   10                     10                     10



                             Brazil                                        Chile                                               Greece                                                                                            Korea
       10-3                                           10-3                                             10-3                                                                                    10-3
         -4                                             -4                                               -4                                                                                      -4
       10                                             10                                               10                                                                                      10
         -5                                             -5                                               -5                                                                                      -5
       10                                             10                                               10                                                                                      10
         -6                                             -6                                               -6                                                                                      -6
       10                                             10                                               10                                                                                      10
         -7                                             -7                                               -7                                                                                      -7
       10        -7     -6            -5     -4
                                                      10     -7       -6           -5     -4
                                                                                                       10     -7               -6                       -5                    -4
                                                                                                                                                                                               10          -7                  -6                  -5                        -4
             10        10         10        10             10        10        10        10                 10            10                          10                10                           10                   10                     10                     10



    Figure 2. Histograms of the scores using PageRank (top), hubs (middle) and authorities (bottom).


3.1 Degree                                                                                                  3.2 Web structure


   The average indegree per Web site (average number of                                                        Broder et al. [9] proposed a partition of the Web graph
different Web sites inside the same country linking to a                                                    based on the relationship of pages with the larger strongly
given Web site) was 3.5 for Brazil, 1.2 for Chile, 1.6 for                                                  connected component (SCC) on the graph. The pages in
Greece, 37.0 for South Korea and 1.5 for Spain. The his-                                                    the larger strongly connected component belong to the cate-
tograms of indegree is consistent with a Zipfian distribution,                                               gory MAIN. All the pages reachable from MAIN by follow-
with parameter 1.8 ± 0.3.                                                                                   ing links forwards belong to the category OUT, and by fol-
   By manual inspection we observed that in Brazil and                                                      lowing links backwards to the category IN. The rest of the
specially in South Korea, there is a significant use –and                                                    Web that is weakly connected (disregarding the direction of
abuse– of DNS wildcarding. DNS wildcarding is a way                                                         links) to MAIN is in a component called TENDRILS.
of configuring DNS servers so they reply with the same IP                                                        In [2] we showed that this macroscopic structure is sim-
address no matter which host name is used in a DNS query.                                                   ilar at the hostgraph level: the hostgraphs we examined
    The average outdegree per Web site (average number                                                      are scale-free networks and have a giant strongly connected
of different Web sites inside the same country linked by a                                                  component. We observed that distribution of the sizes of
given Web site) was 2.2 for Brazil, 2.4 for Chile, 4.8 for                                                  their strongly connected components is shown in Figure 3.
Greece, 16.5 for South Korea and 11.2 for Spain. The dis-
tribution of outdegree also exhibits a power-law with pa-
rameter 1.6 ± 0.3.                                                                                                       Brazil                                                                      Chile                                                                   Greece
                                                                                                        100                                                              100                                                                      100
                                                                                                       10-1                                                             10-1                                                                     10-1

    We also measured the number of internal links, that is,                                            10-2
                                                                                                       10-3
                                                                                                                                                                        10-2
                                                                                                                                                                        10-3
                                                                                                                                                                                                                                                 10
                                                                                                                                                                                                                                                 10-3
                                                                                                                                                                                                                                                      -2



                                                                                                       10-4                                                             10-4                                                                     10-4
links going to pages inside the same Web site. We normal-                                              10-5                                                             10-5                                                                     10-5
                                                                                                       10-6                                                             10-6                                                                     10-6
ized this by the number of pages in each Web site, to be                                                   100     101   102        103         104         105             100          101         102        103      104         105             100         101         102   103   104   105


able to compare values. We observed a combination of two
                                                                                                                                                                  Korea                                                                    Spain
power-law distributions: one for Web sites with up to 10 in-                                                                         100                                                                         100
                                                                                                                                    10-1                                                                        10-1
ternal links per Web page on average, and one for Web sites                                                                         10-2
                                                                                                                                    10-3
                                                                                                                                                                                                                10-2
                                                                                                                                                                                                                10-3
with more internal links per Web page. For the sites with                                                                           10-4
                                                                                                                                    10-5
                                                                                                                                                                                                                10-4
                                                                                                                                                                                                                10-5

less than 10 internal links per page on average, the parame-                                                                        10-6
                                                                                                                                          100         101         102   103        104         105
                                                                                                                                                                                                                10-6 0
                                                                                                                                                                                                                    10         101         102   103       104         105

ter for the power-law was 1.1 ± 0.3, and for sites with more
internal links per page on average, 3.0 ± 0.3.                                                                       Figure 3. Histograms of the sizes of SCCs.


                                                                                                   3
The parameter for the power-law distribution was 2.7 ±              [4] R. Baeza-Yates and C. Castillo. Caracter´       ısticas de la Web
0.7. In Chile, Greece and Spain, a sole giant SCC appears                   Chilena 2004. Technical report, Center for Web Research,
having at least 2 orders of magnitude more Web sites than                   University of Chile, 2005.
the second largest SCC component. In the case of Brazil,                [5] R. Baeza-Yates and C. Castillo. Characterization of national
there are two giant SCCs. The larger one is a “natural” one,                Web domains. Technical report, Universitat Pompeu Fabra,
                                                                            July 2005.
containing Web sites from different domains. The second
                                                                        [6] R. Baeza-Yates, C. Castillo, and V. L´    opez. Caracter´   ısticas
larger is an “artificial” one, containing only Web sites under
                                                                            de la Web de Espa˜ a. Technical report, Universitat Pompeu
                                                                                                 n
a domain that uses DNS wildcarding to create a “link farm”                  Fabra, 2005.
(a strongly connected community of mutual links). In the                [7] R. Baeza-Yates and F. Lalanne. Characteristics of the Korean
case of South Korea, we detected at least 5 large link farms.               Web. Technical report, Korea–Chile IT Cooperation Center
    Regarding the Web structure, the distribution between                   ITCC, 2004.
sites in general gives the component called OUT a large                 [8] R. Baeza-Yates and B. Poblete. Evolution of the Chilean Web
share. If we do not consider sites that are weakly connected                structure composition. In Proceedings of Latin American Web
to MAIN, IN has on average 8% of the sites, MAIN 28%,                       Conference, pages 11–13, Santiago, Chile, 2003. IEEE CS
OUT 58% and TENDRILS 6%. The sites that are discon-                         Press.
nected from MAIN are 40% on average, but contribute less                [9] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Ra-
than 10% of the pages.                                                      jagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph struc-
                                                                            ture in the Web: Experiments and models. In Proceedings
                                                                            of the Ninth Conference on World Wide Web, pages 309–320,
4 Conclusions                                                               Amsterdam, Netherlands, May 2000. ACM Press.
                                                                        [10] A. S. da Silva, E. A. Veloso, P. B. Golgher, , A. H. F. Laender,
                                                                            and N. Ziviani. CoBWeb - A crawler for the Brazilian Web. In
   Even when the collections were obtained from countries                   Proceedings of String Processing and Information Retrieval
with different economical, historical and geographical con-                 (SPIRE), pages 184–191, Cancun, Mxico, 1999. IEEE CS
texts, and speaking different languages we observed that                    Press.
the results across different collections are always consis-             [11] S. Dill, R. Kumar, K. S. Mccurley, S. Rajagopalan,
tent when the observed characteristic exhibits a power-law                  D. Sivakumar, and A. Tomkins. Self-similarity in the web.
in one collection. In this class we include the distribution of             ACM Trans. Inter. Tech., 2(3):205–223, 2002.
degrees, link-based scores, internal links, etc.                        [12] E. Efthimiadis and C. Castillo. Charting the Greek Web.
   Besides links, we are working in a detailed account of                   In Proceedings of the Conference of the American Society
the characteristics of the contents and technologies used in                for Information Science and Technology (ASIST), Providence,
several collections [5].                                                    Rhode Island, USA, November 2004. American Society for
                                                                            Information Science and Technology.
   Acknowledgments: We worked with Vicente L´ pez ino                   [13] A. Gulli and A. Signorini. The indexable Web is more than
the study of the Spanish Web, with Efthimis N. Efthimiadis                  11.5 billion pages. In Poster proceedings of the 14th interna-
in the study of the Greek Web, with Felipe Ortiz, B´ rbara
                                                     a                      tional conference on World Wide Web, pages 902–903, Chiba,
Poblete and Felipe Saint-Jean in the studies of the Chilean                 Japan, 2005. ACM Press.
                                                                        [14] J. M. Kleinberg. Authoritative sources in a hyperlinked en-
Web and with Felipe Lalanne in the study of the Korean
                                                                            vironment. Journal of the ACM, 46(5):604–632, 1999.
Web. We also thank the Laboratory of Web Algorithmics
                                                                        [15] M. Modesto, ´ Pereira, N. Ziviani, C. Castillo, and R. Baeza-
                                                                                            a.
for making their Web collections available for research.                    Yates. Un novo retrato da Web Brasileira. In Proceedings of
                                                                            SEMISH, S˜ o Leopoldo, Brazil, 2005.
                                                                                        a
References                                                              [16] L. Page, S. Brin, R. Motwani, and T. Winograd. The Page-
                                                                            Rank citation ranking: bringing order to the Web. Technical
                                                                            report, Stanford Digital Library Technologies Project, 1998.
[1] R. Baeza-Yates and C. Castillo. Caracterizando la Web               [17] G. Pandurangan, P. Raghavan, and E. Upfal. Using Page-
    Chilena. In Encuentro chileno de ciencias de la computaci´ n,
                                                             o              rank to characterize Web structure. In Proceedings of the 8th
    Punta Arenas, Chile, 2000. Sociedad Chilena de Ciencias de              Annual International Computing and Combinatorics Confer-
    la Computaci´  on.                                                      ence (COCOON), volume 2387 of Lecture Notes in Computer
[2] R. Baeza-Yates and C. Castillo. Relating Web characteristics            Science, pages 330–390, Singapore, August 2002. Springer.
    with link based Web page ranking. In Proceedings of String          [18] E. A. Veloso, E. de Moura, P. Golgher, A. da Silva,
    Processing and Information Retrieval SPIRE, pages 21–32,                R. Almeida, A. Laender, R. B. Neto, and N. Ziviani. Um
    Laguna San Rafael, Chile, 2001. IEEE CS Press.                          retrato da Web Brasileira. In Proceedings of Simposio
[3] R. Baeza-Yates and C. Castillo. Balancing volume, quality               Brasileiro de Computacao, Curitiba, Brasil, 2000.
    and freshness in Web crawling. In Soft Computing Systems -          [19] G. K. Zipf. Human behavior and the principle of least ef-
    Design, Management and Applications, pages 565–572, San-                fort: An introduction to human ecology. Addison-Wesley,
    tiago, Chile, 2002. IOS Press Amsterdam.                                Cambridge, MA, USA, 1949.


                                                                    4

More Related Content

Similar to Link Analysis in National Web Domains

Characterization of National Web Domains
Characterization of National Web DomainsCharacterization of National Web Domains
Characterization of National Web Domainswebhostingguy
 
Characterization of National Web Domains
Characterization of National Web DomainsCharacterization of National Web Domains
Characterization of National Web Domainswebhostingguy
 
Fiche Online: A Vision for Digitizing All Documents Fiche
Fiche Online: A Vision for Digitizing All Documents FicheFiche Online: A Vision for Digitizing All Documents Fiche
Fiche Online: A Vision for Digitizing All Documents FicheChristopher Brown
 
Towards a Shared European and National State of the Environment - SoER makes ...
Towards a Shared European and National State of the Environment - SoER makes ...Towards a Shared European and National State of the Environment - SoER makes ...
Towards a Shared European and National State of the Environment - SoER makes ...Antonio De Marinis
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)Emil Eifrem
 
Map4rdf - Faceted Browser for Geospatial Datasets
Map4rdf - Faceted Browser for Geospatial DatasetsMap4rdf - Faceted Browser for Geospatial Datasets
Map4rdf - Faceted Browser for Geospatial DatasetsBoris Villazón-Terrazas
 
2016_NGWA_SUMMIT_HydroDaVE_Presentation_Final
2016_NGWA_SUMMIT_HydroDaVE_Presentation_Final2016_NGWA_SUMMIT_HydroDaVE_Presentation_Final
2016_NGWA_SUMMIT_HydroDaVE_Presentation_FinalEric Chiang
 
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)Emil Eifrem
 
Comparing Vocabularies for Representing Geographical Features and Their Geometry
Comparing Vocabularies for Representing Geographical Features and Their GeometryComparing Vocabularies for Representing Geographical Features and Their Geometry
Comparing Vocabularies for Representing Geographical Features and Their GeometryGhislain Atemezing
 
rworldmap: A New R package for Mapping Global Data
rworldmap: A New R package for Mapping Global Datarworldmap: A New R package for Mapping Global Data
rworldmap: A New R package for Mapping Global DataDr. Volkan OBAN
 
20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA KeynoteAxel Polleres
 
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and SqoopAnalysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoopsushantparte
 
Bingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationBingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationWARCnet
 
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...CONUL Conference
 
Pal gov.tutorial2.session14.lab rdf-dataintegration
Pal gov.tutorial2.session14.lab rdf-dataintegrationPal gov.tutorial2.session14.lab rdf-dataintegration
Pal gov.tutorial2.session14.lab rdf-dataintegrationMustafa Jarrar
 
Pal gov.tutorial2.session4.lab xml document and schemas
Pal gov.tutorial2.session4.lab xml  document and schemasPal gov.tutorial2.session4.lab xml  document and schemas
Pal gov.tutorial2.session4.lab xml document and schemasMustafa Jarrar
 
TERN Facility Portals - Stuart Phinn
TERN Facility Portals - Stuart PhinnTERN Facility Portals - Stuart Phinn
TERN Facility Portals - Stuart PhinnTERN Australia
 

Similar to Link Analysis in National Web Domains (20)

GeoLinkedData
GeoLinkedDataGeoLinkedData
GeoLinkedData
 
Characterization of National Web Domains
Characterization of National Web DomainsCharacterization of National Web Domains
Characterization of National Web Domains
 
Characterization of National Web Domains
Characterization of National Web DomainsCharacterization of National Web Domains
Characterization of National Web Domains
 
Fiche Online: A Vision for Digitizing All Documents Fiche
Fiche Online: A Vision for Digitizing All Documents FicheFiche Online: A Vision for Digitizing All Documents Fiche
Fiche Online: A Vision for Digitizing All Documents Fiche
 
Towards a Shared European and National State of the Environment - SoER makes ...
Towards a Shared European and National State of the Environment - SoER makes ...Towards a Shared European and National State of the Environment - SoER makes ...
Towards a Shared European and National State of the Environment - SoER makes ...
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
 
Map4rdf - Faceted Browser for Geospatial Datasets
Map4rdf - Faceted Browser for Geospatial DatasetsMap4rdf - Faceted Browser for Geospatial Datasets
Map4rdf - Faceted Browser for Geospatial Datasets
 
2016_NGWA_SUMMIT_HydroDaVE_Presentation_Final
2016_NGWA_SUMMIT_HydroDaVE_Presentation_Final2016_NGWA_SUMMIT_HydroDaVE_Presentation_Final
2016_NGWA_SUMMIT_HydroDaVE_Presentation_Final
 
Going for GOLD - Adventures in Open Linked Metadata
Going for GOLD - Adventures in Open Linked MetadataGoing for GOLD - Adventures in Open Linked Metadata
Going for GOLD - Adventures in Open Linked Metadata
 
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
NOSQL Overview, Neo4j Intro And Production Example (QCon London 2010)
 
Phd defense slides
Phd defense slidesPhd defense slides
Phd defense slides
 
Comparing Vocabularies for Representing Geographical Features and Their Geometry
Comparing Vocabularies for Representing Geographical Features and Their GeometryComparing Vocabularies for Representing Geographical Features and Their Geometry
Comparing Vocabularies for Representing Geographical Features and Their Geometry
 
rworldmap: A New R package for Mapping Global Data
rworldmap: A New R package for Mapping Global Datarworldmap: A New R package for Mapping Global Data
rworldmap: A New R package for Mapping Global Data
 
20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA Keynote
 
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and SqoopAnalysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoop
 
Bingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationBingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman Presentation
 
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
 
Pal gov.tutorial2.session14.lab rdf-dataintegration
Pal gov.tutorial2.session14.lab rdf-dataintegrationPal gov.tutorial2.session14.lab rdf-dataintegration
Pal gov.tutorial2.session14.lab rdf-dataintegration
 
Pal gov.tutorial2.session4.lab xml document and schemas
Pal gov.tutorial2.session4.lab xml  document and schemasPal gov.tutorial2.session4.lab xml  document and schemas
Pal gov.tutorial2.session4.lab xml document and schemas
 
TERN Facility Portals - Stuart Phinn
TERN Facility Portals - Stuart PhinnTERN Facility Portals - Stuart Phinn
TERN Facility Portals - Stuart Phinn
 

More from webhostingguy

Running and Developing Tests with the Apache::Test Framework
Running and Developing Tests with the Apache::Test FrameworkRunning and Developing Tests with the Apache::Test Framework
Running and Developing Tests with the Apache::Test Frameworkwebhostingguy
 
MySQL and memcached Guide
MySQL and memcached GuideMySQL and memcached Guide
MySQL and memcached Guidewebhostingguy
 
Novell® iChain® 2.3
Novell® iChain® 2.3Novell® iChain® 2.3
Novell® iChain® 2.3webhostingguy
 
Load-balancing web servers Load-balancing web servers
Load-balancing web servers Load-balancing web serversLoad-balancing web servers Load-balancing web servers
Load-balancing web servers Load-balancing web serverswebhostingguy
 
SQL Server 2008 Consolidation
SQL Server 2008 ConsolidationSQL Server 2008 Consolidation
SQL Server 2008 Consolidationwebhostingguy
 
Master Service Agreement
Master Service AgreementMaster Service Agreement
Master Service Agreementwebhostingguy
 
PHP and MySQL PHP Written as a set of CGI binaries in C in ...
PHP and MySQL PHP Written as a set of CGI binaries in C in ...PHP and MySQL PHP Written as a set of CGI binaries in C in ...
PHP and MySQL PHP Written as a set of CGI binaries in C in ...webhostingguy
 
Dell Reference Architecture Guide Deploying Microsoft® SQL ...
Dell Reference Architecture Guide Deploying Microsoft® SQL ...Dell Reference Architecture Guide Deploying Microsoft® SQL ...
Dell Reference Architecture Guide Deploying Microsoft® SQL ...webhostingguy
 
Managing Diverse IT Infrastructure
Managing Diverse IT InfrastructureManaging Diverse IT Infrastructure
Managing Diverse IT Infrastructurewebhostingguy
 
Web design for business.ppt
Web design for business.pptWeb design for business.ppt
Web design for business.pptwebhostingguy
 
IT Power Management Strategy
IT Power Management Strategy IT Power Management Strategy
IT Power Management Strategy webhostingguy
 
Excel and SQL Quick Tricks for Merchandisers
Excel and SQL Quick Tricks for MerchandisersExcel and SQL Quick Tricks for Merchandisers
Excel and SQL Quick Tricks for Merchandiserswebhostingguy
 
Parallels Hosting Products
Parallels Hosting ProductsParallels Hosting Products
Parallels Hosting Productswebhostingguy
 
Microsoft PowerPoint presentation 2.175 Mb
Microsoft PowerPoint presentation 2.175 MbMicrosoft PowerPoint presentation 2.175 Mb
Microsoft PowerPoint presentation 2.175 Mbwebhostingguy
 
Installation of MySQL 5.1 Cluster Software on the Solaris 10 ...
Installation of MySQL 5.1 Cluster Software on the Solaris 10 ...Installation of MySQL 5.1 Cluster Software on the Solaris 10 ...
Installation of MySQL 5.1 Cluster Software on the Solaris 10 ...webhostingguy
 

More from webhostingguy (20)

Running and Developing Tests with the Apache::Test Framework
Running and Developing Tests with the Apache::Test FrameworkRunning and Developing Tests with the Apache::Test Framework
Running and Developing Tests with the Apache::Test Framework
 
MySQL and memcached Guide
MySQL and memcached GuideMySQL and memcached Guide
MySQL and memcached Guide
 
Novell® iChain® 2.3
Novell® iChain® 2.3Novell® iChain® 2.3
Novell® iChain® 2.3
 
Load-balancing web servers Load-balancing web servers
Load-balancing web servers Load-balancing web serversLoad-balancing web servers Load-balancing web servers
Load-balancing web servers Load-balancing web servers
 
SQL Server 2008 Consolidation
SQL Server 2008 ConsolidationSQL Server 2008 Consolidation
SQL Server 2008 Consolidation
 
What is mod_perl?
What is mod_perl?What is mod_perl?
What is mod_perl?
 
What is mod_perl?
What is mod_perl?What is mod_perl?
What is mod_perl?
 
Master Service Agreement
Master Service AgreementMaster Service Agreement
Master Service Agreement
 
Notes8
Notes8Notes8
Notes8
 
PHP and MySQL PHP Written as a set of CGI binaries in C in ...
PHP and MySQL PHP Written as a set of CGI binaries in C in ...PHP and MySQL PHP Written as a set of CGI binaries in C in ...
PHP and MySQL PHP Written as a set of CGI binaries in C in ...
 
Dell Reference Architecture Guide Deploying Microsoft® SQL ...
Dell Reference Architecture Guide Deploying Microsoft® SQL ...Dell Reference Architecture Guide Deploying Microsoft® SQL ...
Dell Reference Architecture Guide Deploying Microsoft® SQL ...
 
Managing Diverse IT Infrastructure
Managing Diverse IT InfrastructureManaging Diverse IT Infrastructure
Managing Diverse IT Infrastructure
 
Web design for business.ppt
Web design for business.pptWeb design for business.ppt
Web design for business.ppt
 
IT Power Management Strategy
IT Power Management Strategy IT Power Management Strategy
IT Power Management Strategy
 
Excel and SQL Quick Tricks for Merchandisers
Excel and SQL Quick Tricks for MerchandisersExcel and SQL Quick Tricks for Merchandisers
Excel and SQL Quick Tricks for Merchandisers
 
OLUG_xen.ppt
OLUG_xen.pptOLUG_xen.ppt
OLUG_xen.ppt
 
Parallels Hosting Products
Parallels Hosting ProductsParallels Hosting Products
Parallels Hosting Products
 
Microsoft PowerPoint presentation 2.175 Mb
Microsoft PowerPoint presentation 2.175 MbMicrosoft PowerPoint presentation 2.175 Mb
Microsoft PowerPoint presentation 2.175 Mb
 
Reseller's Guide
Reseller's GuideReseller's Guide
Reseller's Guide
 
Installation of MySQL 5.1 Cluster Software on the Solaris 10 ...
Installation of MySQL 5.1 Cluster Software on the Solaris 10 ...Installation of MySQL 5.1 Cluster Software on the Solaris 10 ...
Installation of MySQL 5.1 Cluster Software on the Solaris 10 ...
 

Link Analysis in National Web Domains

  • 1. Link Analysis in National Web Domains Ricardo Baeza-Yates Carlos Castillo ICREA Professor Department of Technology University Pompeu Fabra University Pompeu Fabra ricardo.baeza@upf.edu carlos.castillo@upf.edu Abstract Table 1 summarizes the characteristics of the collections. The number of unique hosts was measured by the ISC2 ; the The Web can be seen as a graph in which every page last column is the number of pages actually downloaded. is a node, and every hyper-link between two pages is an edge. This Web graph forms a scale-free network: a graph in which the distribution of the degree of the nodes is very Table 1. Characteristics of the collections. skewed. This graph is also self-similar, in terms that a small Collection Year Available hosts Pages part of the graph shares most properties with the entire [mill] (rank) [mill] graph. Brazil 2005 3.9 11th 4.7 This paper compares the characteristics of several na- Chile 2004 0.3 42th 3.3 tional Web domains, by studying the Web graph of large Greece 2004 0.3 40th 3.7 collections obtained using a Web crawler; the comparison Indochina 2004 0.5 38th 7.4 unveils striking similarities between the Web graphs of very Italy 2004 9.3 4th 41.3 different countries. South Korea 2004 0.2 47th 8.9 Spain 2004 1.3 25th 16.2 U. K. 2002 4.4 10th 18.5 1 Introduction By observing the number of available hosts and the Large samples from specific communities, such as na- downloaded pages in each collection, we consider that most tional domains, have a good balance between diversity and of them have a high coverage. The collections of Brazil completeness. They include pages inside a common geo- and the United Kingdom are smaller samples in compari- graphical, historical and cultural context that are written by son with the others, but their sizes are large enough to show diverse authors in different organizations. National Web do- results that are consistent with the others. mains also have a moderate size that allows good accuracy Zipf’s law: the graph representing the connections be- in the results; because of this, they have attracted the atten- tween Web pages has a scale-free topology. Scale-free net- tion of several researchers. works, as opposed to random networks, are characterized In this paper, we study eight national domains. The by an uneven distribution of links. For a page p, we have collection studied include four collections obtained using P r(p has k links) ∝ k −θ . We find this distribution on the WIRE [3]: Brazil (BR domain) [18, 15], Chile (CL domain) Web in almost every aspect, and it is the same distribution [1, 8, 4], Greece (GR domain) [12] and South Korea (KR found by economist Vilfredo Pareto in 1896 for the distribu- domain) [7]; three collections obtained from the Laboratory tion of wealth in large populations, and by George K. Zipf of Web Algorithmics1: Indochina (KH, LA, MM, TH and VN in 1932 for the frequency of words in texts. This distribu- domains), Italy (IT domain) and the United Kingdom (UK tion later turned out to be applicable to several domains [19] domain); and one collection obtained using Akwan [10]: and was called by Zipf the law of minimal effort. Spain (ES domain) [6]. Our 104-million page sample is less than 1% of the indexable Web [13] but presents charac- Section 2 studies the Web graph, and section 3 the Host- teristics that are very similar to those of the full Web. graph. The last section presents our conclusions. 1 Laboratory of Web Algorithmics, Dipartimento di Scienze dell’Informazione, Universit´ degli a studi di Milano, 2 Internet Systems Consortium’s domain survey, <http://law.dsi.unimi.it/>. <http://www.isc.org/ds/> 1
  • 2. 2 Web graph Indegree Outdegree Brazil Brazil 10−1 10−1 2.1 Degree 10 −2 −2 10 −3 10 −3 −4 10 10 −4 −5 10 10 The distributions of the indegree and outdegree are 10−6 10 −5 shown in Figure 1; both are consistent with a power-law 10 −7 10 0 10 1 10 2 10 3 10 4 10 −6 10 0 10 1 10 2 10 3 distribution. When examining the distribution of outdegree, we found two different curves: one for smaller outdegrees Chile Chile 10−1 10−1 –less than 20 to 30 out-links– and another one for larger −2 10 10−2 outdegrees. They both show a power-law distribution and 10 −3 −4 10 −3 10 we estimated the exponents for both parts separately. 10 −5 10 −4 −6 −5 10 For the in-degree, the average power-law exponent θ we 10 10 −7 10 −6 0 1 2 3 4 0 1 2 3 observed was 1.9±0.1; this can be compared with the value 10 10 10 10 10 10 10 10 10 of 2.1 observed by other authors [9, 11] in samples of the Greece Greece global Web. For the out-degree, the exponent was 0.6 ± 0.2 10−1 10−1 −2 for small outdegrees, and 2.8 ± 0.8 for large out-degrees; 10 10 −3 10−2 −3 10 the latter can be compared with the parameters 2.7 [9] and 10 −4 −5 10 −4 10 2.2 [11] found for samples of the global Web. 10−6 10 −5 −7 −6 10 0 1 2 3 4 10 0 1 2 3 10 10 10 10 10 10 10 10 10 2.2 Ranking Italy Italy 10−1 10−1 −2 10 10−2 −3 One of the main algorithms for link-based ranking of 10 −4 10 −3 10 −4 Web pages is PageRank [16]. We calculated the PageRank 10 −5 10 −5 10−6 10 distribution for several collections and found a power-law in 10−7 10−6 100 101 102 103 104 100 101 102 103 the distribution of the obtained scores, with average expo- nent 1.86 ± 0.06. In theory, the PageRank exponent should Korea Korea be similar to the indegree exponent [17] (the value they 10−1 10−1 10−2 −2 measured for the exponent was 2.1), and this is indeed the 10−3 10 10−3 case. The distribution of PageRank values can be seen in 10−4 10−4 10−5 Figure 2. 10−6 10−5 10−7 10−6 We also calculated a static version of the HITS scores 100 101 102 103 104 100 101 102 103 [14], counting only external links and calculating the scores in the whole graph, instead of only on a set of pages. Spain Spain 10−1 10−1 The tail of the distribution of authority-score also follows 10−2 10−2 10−3 a power law. In the case of hub-score, it is difficult to assert 10−4 10−3 10−4 that the data follows a power-law because the frequencies 10−5 10−6 10−5 seems to be much more dispersed. The average exponent 10−7 10−6 100 101 102 103 104 100 101 102 103 observed was 3.0 ± 0.5 for hub score, and 1.84 ± 0.01 for authority score. U.K. U.K. 10−1 10−1 10−2 10−2 10−3 10−3 3 Hostgraph 10−4 10−4 −5 10 10−6 10−5 10−7 10−6 100 1 2 3 4 100 101 102 103 We studied the hostgraph [11], this is, the graph created 10 10 10 10 by changing all the nodes representing Web pages in the same Web site by a single node representing the Web site. Figure 1. Histograms of the indegree and out- The hostgraph is a graph in which there is a node for each degree of Web pages, including a fit for a Web site, and two nodes A and B are connected iff there is power-law distribution. at least one link on site A pointing to a page in site B. In this section, we consider only the collections from which we have a hostgraph. 2
  • 3. Brazil Chile Greece Korea 10-2 10-2 10-2 10-2 -3 -3 -3 -3 10 10 10 10 -4 -4 -4 -4 10 10 10 10 -5 -5 -5 -5 10 10 10 10 -6 -6 -6 -6 10 10 10 10 -7 -7 -7 -7 10 -7 -6 -5 -4 10 -7 -6 -5 -4 10 -7 -6 -5 -4 10 -7 -6 -5 -4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Brazil Chile Greece Korea 10-3 10-3 10-3 10-3 -4 -4 -4 -4 10 10 10 10 -5 -5 -5 -5 10 10 10 10 -6 -6 -6 -6 10 10 10 10 -7 -7 -7 -7 10 10 10 10 -7 -6 -5 -4 -7 -6 -5 -4 -7 -6 -5 -4 -7 -6 -5 -4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Brazil Chile Greece Korea 10-3 10-3 10-3 10-3 -4 -4 -4 -4 10 10 10 10 -5 -5 -5 -5 10 10 10 10 -6 -6 -6 -6 10 10 10 10 -7 -7 -7 -7 10 -7 -6 -5 -4 10 -7 -6 -5 -4 10 -7 -6 -5 -4 10 -7 -6 -5 -4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Figure 2. Histograms of the scores using PageRank (top), hubs (middle) and authorities (bottom). 3.1 Degree 3.2 Web structure The average indegree per Web site (average number of Broder et al. [9] proposed a partition of the Web graph different Web sites inside the same country linking to a based on the relationship of pages with the larger strongly given Web site) was 3.5 for Brazil, 1.2 for Chile, 1.6 for connected component (SCC) on the graph. The pages in Greece, 37.0 for South Korea and 1.5 for Spain. The his- the larger strongly connected component belong to the cate- tograms of indegree is consistent with a Zipfian distribution, gory MAIN. All the pages reachable from MAIN by follow- with parameter 1.8 ± 0.3. ing links forwards belong to the category OUT, and by fol- By manual inspection we observed that in Brazil and lowing links backwards to the category IN. The rest of the specially in South Korea, there is a significant use –and Web that is weakly connected (disregarding the direction of abuse– of DNS wildcarding. DNS wildcarding is a way links) to MAIN is in a component called TENDRILS. of configuring DNS servers so they reply with the same IP In [2] we showed that this macroscopic structure is sim- address no matter which host name is used in a DNS query. ilar at the hostgraph level: the hostgraphs we examined The average outdegree per Web site (average number are scale-free networks and have a giant strongly connected of different Web sites inside the same country linked by a component. We observed that distribution of the sizes of given Web site) was 2.2 for Brazil, 2.4 for Chile, 4.8 for their strongly connected components is shown in Figure 3. Greece, 16.5 for South Korea and 11.2 for Spain. The dis- tribution of outdegree also exhibits a power-law with pa- rameter 1.6 ± 0.3. Brazil Chile Greece 100 100 100 10-1 10-1 10-1 We also measured the number of internal links, that is, 10-2 10-3 10-2 10-3 10 10-3 -2 10-4 10-4 10-4 links going to pages inside the same Web site. We normal- 10-5 10-5 10-5 10-6 10-6 10-6 ized this by the number of pages in each Web site, to be 100 101 102 103 104 105 100 101 102 103 104 105 100 101 102 103 104 105 able to compare values. We observed a combination of two Korea Spain power-law distributions: one for Web sites with up to 10 in- 100 100 10-1 10-1 ternal links per Web page on average, and one for Web sites 10-2 10-3 10-2 10-3 with more internal links per Web page. For the sites with 10-4 10-5 10-4 10-5 less than 10 internal links per page on average, the parame- 10-6 100 101 102 103 104 105 10-6 0 10 101 102 103 104 105 ter for the power-law was 1.1 ± 0.3, and for sites with more internal links per page on average, 3.0 ± 0.3. Figure 3. Histograms of the sizes of SCCs. 3
  • 4. The parameter for the power-law distribution was 2.7 ± [4] R. Baeza-Yates and C. Castillo. Caracter´ ısticas de la Web 0.7. In Chile, Greece and Spain, a sole giant SCC appears Chilena 2004. Technical report, Center for Web Research, having at least 2 orders of magnitude more Web sites than University of Chile, 2005. the second largest SCC component. In the case of Brazil, [5] R. Baeza-Yates and C. Castillo. Characterization of national there are two giant SCCs. The larger one is a “natural” one, Web domains. Technical report, Universitat Pompeu Fabra, July 2005. containing Web sites from different domains. The second [6] R. Baeza-Yates, C. Castillo, and V. L´ opez. Caracter´ ısticas larger is an “artificial” one, containing only Web sites under de la Web de Espa˜ a. Technical report, Universitat Pompeu n a domain that uses DNS wildcarding to create a “link farm” Fabra, 2005. (a strongly connected community of mutual links). In the [7] R. Baeza-Yates and F. Lalanne. Characteristics of the Korean case of South Korea, we detected at least 5 large link farms. Web. Technical report, Korea–Chile IT Cooperation Center Regarding the Web structure, the distribution between ITCC, 2004. sites in general gives the component called OUT a large [8] R. Baeza-Yates and B. Poblete. Evolution of the Chilean Web share. If we do not consider sites that are weakly connected structure composition. In Proceedings of Latin American Web to MAIN, IN has on average 8% of the sites, MAIN 28%, Conference, pages 11–13, Santiago, Chile, 2003. IEEE CS OUT 58% and TENDRILS 6%. The sites that are discon- Press. nected from MAIN are 40% on average, but contribute less [9] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Ra- than 10% of the pages. jagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph struc- ture in the Web: Experiments and models. In Proceedings of the Ninth Conference on World Wide Web, pages 309–320, 4 Conclusions Amsterdam, Netherlands, May 2000. ACM Press. [10] A. S. da Silva, E. A. Veloso, P. B. Golgher, , A. H. F. Laender, and N. Ziviani. CoBWeb - A crawler for the Brazilian Web. In Even when the collections were obtained from countries Proceedings of String Processing and Information Retrieval with different economical, historical and geographical con- (SPIRE), pages 184–191, Cancun, Mxico, 1999. IEEE CS texts, and speaking different languages we observed that Press. the results across different collections are always consis- [11] S. Dill, R. Kumar, K. S. Mccurley, S. Rajagopalan, tent when the observed characteristic exhibits a power-law D. Sivakumar, and A. Tomkins. Self-similarity in the web. in one collection. In this class we include the distribution of ACM Trans. Inter. Tech., 2(3):205–223, 2002. degrees, link-based scores, internal links, etc. [12] E. Efthimiadis and C. Castillo. Charting the Greek Web. Besides links, we are working in a detailed account of In Proceedings of the Conference of the American Society the characteristics of the contents and technologies used in for Information Science and Technology (ASIST), Providence, several collections [5]. Rhode Island, USA, November 2004. American Society for Information Science and Technology. Acknowledgments: We worked with Vicente L´ pez ino [13] A. Gulli and A. Signorini. The indexable Web is more than the study of the Spanish Web, with Efthimis N. Efthimiadis 11.5 billion pages. In Poster proceedings of the 14th interna- in the study of the Greek Web, with Felipe Ortiz, B´ rbara a tional conference on World Wide Web, pages 902–903, Chiba, Poblete and Felipe Saint-Jean in the studies of the Chilean Japan, 2005. ACM Press. [14] J. M. Kleinberg. Authoritative sources in a hyperlinked en- Web and with Felipe Lalanne in the study of the Korean vironment. Journal of the ACM, 46(5):604–632, 1999. Web. We also thank the Laboratory of Web Algorithmics [15] M. Modesto, ´ Pereira, N. Ziviani, C. Castillo, and R. Baeza- a. for making their Web collections available for research. Yates. Un novo retrato da Web Brasileira. In Proceedings of SEMISH, S˜ o Leopoldo, Brazil, 2005. a References [16] L. Page, S. Brin, R. Motwani, and T. Winograd. The Page- Rank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998. [1] R. Baeza-Yates and C. Castillo. Caracterizando la Web [17] G. Pandurangan, P. Raghavan, and E. Upfal. Using Page- Chilena. In Encuentro chileno de ciencias de la computaci´ n, o rank to characterize Web structure. In Proceedings of the 8th Punta Arenas, Chile, 2000. Sociedad Chilena de Ciencias de Annual International Computing and Combinatorics Confer- la Computaci´ on. ence (COCOON), volume 2387 of Lecture Notes in Computer [2] R. Baeza-Yates and C. Castillo. Relating Web characteristics Science, pages 330–390, Singapore, August 2002. Springer. with link based Web page ranking. In Proceedings of String [18] E. A. Veloso, E. de Moura, P. Golgher, A. da Silva, Processing and Information Retrieval SPIRE, pages 21–32, R. Almeida, A. Laender, R. B. Neto, and N. Ziviani. Um Laguna San Rafael, Chile, 2001. IEEE CS Press. retrato da Web Brasileira. In Proceedings of Simposio [3] R. Baeza-Yates and C. Castillo. Balancing volume, quality Brasileiro de Computacao, Curitiba, Brasil, 2000. and freshness in Web crawling. In Soft Computing Systems - [19] G. K. Zipf. Human behavior and the principle of least ef- Design, Management and Applications, pages 565–572, San- fort: An introduction to human ecology. Addison-Wesley, tiago, Chile, 2002. IOS Press Amsterdam. Cambridge, MA, USA, 1949. 4