Analysis of hybrid P2P overlay network topology
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Analysis of hybrid P2P overlay network topology

on

  • 1,988 views

 

Statistics

Views

Total Views
1,988
Views on SlideShare
1,988
Embed Views
0

Actions

Likes
0
Downloads
31
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Analysis of hybrid P2P overlay network topology Document Transcript

  • 1. Available online at www.sciencedirect.com Computer Communications 31 (2008) 190–200 www.elsevier.com/locate/comcom q,qq Analysis of hybrid P2P overlay network topology a,* Chao Xie , Guihai Chen c, Art Vandenberg d, Yi Pan b,* a Department of Computer Science, University of Wisconsin-Madison, Madison, WI, 53706-1685, USA b Department of Computer Science, Georgia State University, Atlanta, GA 30302-3994, USA c State Key Laboratory of Novel Software, Nanjing University, Nanjing 210093, China d Department of Information Systems and Technology, Georgia State University, Atlanta, GA 30302-3968, USA Available online 19 August 2007 Abstract Modeling peer-to-peer (P2P) networks is a challenge for P2P researchers. In this paper, we provide a detailed analysis of large-scale hybrid P2P overlay network topology, using Gnutella as a case study. First, we re-examine the power-law distributions of the Gnutella network discovered by previous researchers. Our results show that the current Gnutella network deviates from the earlier power-laws, suggesting that the Gnutella network topology may have evolved a lot over time. Second, we identify important trends with regard to the evolution of the Gnutella network between September 2005 and February 2006. Upon analyzing the limitations of the power-laws, we provide a novel two-layered approach to study the topology of the Gnutella network. We divide the Gnutella network into two layers, namely the mesh and the forest, to model the hybrid and highly dynamic architecture of the current Gnutella network. We give a detailed analysis of the two-layered overlay and present six power-laws and one empirical law to characterize the topology. Using the two-layered approach and laws proposed, realistic topologies can be generated and the realism of artificial topologies can be validated. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Peer-to-peer; Overlay network; Network topology; Power-law 1. Introduction of algorithms and facilitate design of more efficient proto- cols that take advantage of topology properties. Third, we Modeling the topologies of peer-to-peer (P2P) networks can generate more accurate artificial topologies for simula- is an important open problem. An accurate topological tion purposes. Furthermore, we can predict future trends model can have significant influence on P2P research. First, and thereby address potential problems in advance. we can gain detailed insight into the nature of the underly- Previous researchers [2] and [7] tended to use power-laws ing system. Second, the model can enable detailed analysis to characterize the topology of P2P networks. Recent advances in P2P networks have resulted in hybrid architec- q This paper extends and supplants the earlier version of this paper tures, represented by the success of Gnutella protocol 0.6 presented at IEEE GLOBECOM’06 [1]. [3] and Kazaa [4]. In this paper, we provide a detailed anal- qq Guihai Chen’s work is supported by China NSF under Grant ysis of large-scale hybrid P2P network topology, giving 60573131, China Jiangsu Provincial NSF under Grant BK2005208, China results concerning major topology properties and main dis- 973 projects under Grants 2006CB303000 and 2002CB312002, and Nokia tributions. In our study, we choose Gnutella as a case Bridging the World Program. Yi Pan’s work is supported in part by the National Science Foundation (NSF) under Grants ECS-0196569, ECS- study, as it has a large user community and open architec- 0334813, and CCF-0514750. Any opinions, findings, and conclusions or ture. Our work can be summarized by the following points. recommendations expressed in this paper are those of the authors and do First, we re-examine the power-law distributions of the not necessarily reflect the views of the NSF, China NSF or Nokia. Gnutella network discovered by previous researchers. * Corresponding authors. Tel.: +1 404 651 0649; fax: +1 404 463 9912. Our results show that the current Gnutella network devi- E-mail addresses: cxie@cs.wisc.edu (C. Xie), gchen@nju.edu.cn (G. Chen), avandenberg@gsu.edu (A. Vandenberg), pan@cs.gsu.edu (Y. Pan). ates from the earlier power-laws. This observation suggests URLs: http://www.cs.wisc.edu/~cxie (C. Xie), http://www.cs.gsu.edu/ that the Gnutella network topology may have evolved a lot pan (Y. Pan). over time. 0140-3664/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.comcom.2007.08.014
  • 2. C. Xie et al. / Computer Communications 31 (2008) 190–200 191 Second, we identify important trends with regard to the Gnutella protocol 0.6 [3] employs a hybrid architecture evolution of the Gnutella network between September 2005 combining centralized and decentralized model. Servents and February 2006. are categorized into leaf and ultrapeer. A leaf keeps only As our primary contribution, we provide a novel two- a small number of connections to ultrapeers. An ultrapeer layered approach to study the topology of the Gnutella maintains connections with other ultrapeers and acts as a network. Due to the limitations of the power-laws, we proxy to the Gnutella network for the leaves connected divide the Gnutella network into two layers, namely the to it. An ultrapeer only forwards a query to a leaf if it mesh and the forest, to model the hybrid and highly believes the leaf can answer it, and leaves never relay que- dynamic architecture of the current Gnutella network. ries between ultrapeers. Fig. 2 illustrates the topology of We give a detailed analysis of the two-layered overlay the Gnutella 0.6 network. Protocol 0.6 is compatible with and present six power-laws and one empirical law to char- protocol 0.4, which implies that the current Gnutella net- acterize the topology. work can contain some fraction of nodes of former proto- Finally, we focus on the generation of realistic topolo- col specification 0.4. gies and the validation of artificial topologies using our approach and laws proposed. 2.2. Power-law The rest of this paper is organized as follows. Section 2 presents background and previous work. In Section 3, we Power-laws have been found in numerous diverse fields present our traces of the Gnutella network. In Section 4, spanning sociological, geological, natural and biological we re-examine the power-law distributions discovered by systems. Power-laws of the form y µ xa enables a compact previous researchers and identify the trends concerning characterization of topologies through their exponents. the evolution of Gnutella network. In Section 5, we analyze Faloutsos et al. [8] discovered four power-laws characteriz- the limitations of the power-laws and introduce our new ing the topology of the Internet, while Magoni et al. [9] two-layered approach to study the topology of Gnutella found another four power-laws of the Internet. network. In Section 6, we analyze the topological proper- In [2,7,11], several power-laws were found with regard ties of the mesh and present two power-laws concerning to the topology of the Gnutella network. In 2002, Ripeanu the mesh topology. In Section 7, we examine the topology et al. [10] argued that the connection distribution of the properties of the forest and provide one empirical law con- more recent Gnutella network may follow a two-tier cerning the tree size. In Section 8, we present to two two power-law distribution. P2P studies usually assume that power-laws concerning the overlay network as a whole these power-laws characterize the topology of P2P net- and discuss the practical uses of our approach and laws. works and use synthetically generated topologies following Finally, Section 9 concludes our work. these power-laws [12–17]. 2. Background and previous work 3. Our Gnutella Network Traces 2.1. Gnutella Protocol and the crawler We developed a crawler to collect topology information Gnutella protocol 0.4 [5] employs a pure decentralized of the Gnutella network, taking advantage of message model. In this model, individual nodes, also called servents communication mechanism of both protocol 0.4 and pro- are equal in terms of functionality. They not only perform tocol 0.6. The crawler is based on the Limewire [6] open server-side roles such as matching incoming queries against source client and performs a breadth first searching on their local resources and respond with applicable results, the network in parallel. It can discover more than but also offer client-side functions such as issuing queries 100,000 nodes in minutes. and collecting search results. All servents are connected We can build the graph of nodes by analyzing the col- to each other randomly. Fig. 1 illustrates the topology of lected data on the Gnutella network. We model two adja- the Gnutella 0.4 network. cent nodes that have at least one connection between Fig. 1. Topology of the Gnutella 0.4 Network. Fig. 2. Topology of the Gnutella 0.6 Network.
  • 3. 192 C. Xie et al. / Computer Communications 31 (2008) 190–200 Table 1 Basic Statistics of the Gnutella Network Stat. Data Ours [11] [2] 091505 021106 V34206 V57926 Time 09–2005 02–2006 09–2003 10–2003 11–2000 12–2000 Nodes 107,205 118,925 34,206 57,926 992 1,125 Edges 118,187 130,612 43,958 80,276 2465 4080 l 6.4 7.9 5.4 5.8 3.7 3.3 Diam. 22 24 16 15 9 8 k 2.20 2.20 2.57 2.72 4.97 7.25 each other by an edge. We treat the Gnutella network as a undirected graph. In this paper, we provide two traces of the Gnutella net- work, namely the 091505 trace and the 021106 trace. Note that we have studied the topology of the Gnutella network from September 2005 until February 2006 and all the traces we have gotten accord with the results given in this paper. In Table 1, we present some basic statistics about our traces and previous work [2,11]. In Table 1, l represents the aver- age shortest distance and k represents the average degree. 4. Current Gnutella network topology In this section, we examine the power-laws of the Gnu- tella network described in previous literatures against our two traces. The goal of our work is to find out whether the topology of the current Gnutella network accords with the early power-laws. We use linear regression to fit a line in a set of two- Fig. 3. Log–log plot of the degree dv versus the rank rv in the sequence of dimensional points using the least-square errors method. decreasing degree. The validity of the approximation is quantified by the cor- relation coefficient ranging from À1.0 and 1.0. The absolute patterns. On the one hand, the nodes with high rank are value of the correlation coefficient is ACC. An ACC value of too small degree. This is because the Gnutella protocol of 1.0 indicates perfect linear correlation. In general, the 0.6 imposes a limit on maximal connections of an ultra- ACC level should be greater than 0.90 to validate linear peer. On the other hand, there are too many nodes with correlation. degree around 30, with the result that the curve breakouts from the linear regression. This pattern suggests that 4.1. Rank distribution ultrapeers in the Gnutella 0.6 network tend to have the connection limit around 30. In this section, we study the degrees of the nodes in the Moreover, the 021106 graph is somewhat different from Gnutella network. the 091505 graph. First, the nodes with high rank in the Power-law of rank exponent R: The degree dv of a node v former graph are of smaller degree compared with the is proportional to the rank of the node rv to the power of a counterparts in the latter, implying that protocol 0.6 is constant R : d v / rR . The rank rv of a node v is defined as v effectively replacing protocol 0.4. Secondly, the curve after its index in the order of decreasing degree. a degree of approximately 30 drops much more suddenly in Jovanovic [2] found that the early Gnutella network fol- the former graph than in the latter, which suggests that lowed the above power-law with rank exponent of À0.98 ultrapeers tend to employ as many connections as they can. and ACC of 0.94. For our two traces, the rank exponent is À0.64268 and À0.60681 and ACC is 0.92178 and 0.88120 in chronological order as we see in Fig. 3. The 4.2. Degree Distribution low ACC values imply that this power-law is relatively weak in the 091505 graph and even invalid for the In this section, we study the distribution of the degrees 021106 graph. of the nodes. Note that the degree power law we present Compared with a pure power-law distribution, the two in the current work is different from the one in earlier work graphs deviate from the linear regression with similar [2]. However, they both refer to the same distribution. The
  • 4. C. Xie et al. / Computer Communications 31 (2008) 190–200 193 difference is that the current work uses the cumulative Furthermore, in the 021106 graph, degrees in interval 5– probability distribution function, while the earlier work 20 follow an almost constant distribution, which means uses the probability distribution function. As a result, the there are too few ultrapeers with a degree in this interval. exponents of the two power-laws differ approximately by This confirms our previous conclusion that ultrapeers try one. The cumulative distribution is preferable because it to hold more connections up to the limit. The curve of can be estimated in a statistically robust way. higher degree in the 021106 graph drops much more shar- Power-law of degree exponent D: The complementary ply, which agrees with our previous comment that the Gnu- cumulative distribution function (CCDF) Dd of a degree tella protocol 0.6 prevents ultrapeers from employing a d, is proportional to the degree to the power of a constant large number of connections. D : Dd / d D . The CCDF of a degree d is the percentage of nodes that have degree greater than the degree d. 5. The two-layered approach Jovanovic [2] showed degree exponent of À1.4 and ACC of 0.96 for the early Gnutella network by probability distri- In this section, we first discuss the limitations of the bution. For our two traces, the degree exponent is power-laws and then present a new approach to study À2.25926 and À2.31074 and ACC is 0.91744 and 0.87718 the topology of the Gnutella network. in chronological order as we see in Fig. 4. Again, the low ACC values imply that this power-law is relatively weak 5.1. Limitations of the power-laws in the 091505 graph and even invalid for the 021106 graph. Compared with a pure power-law distribution, the Previous researches [18] and [19] suggest two key causes graphs share some common patterns. There are too many for power-law distributions in network topologies: incre- nodes with degree around 30, and the resulting curves devi- mental growth and preferential connectivity. Incremental ate from the linear regression. This is coincident with what growth refers to open networks that form by the continual we found in rank distribution. addition of new nodes, and thus the gradual increase in the size of the network. Preferential connectivity refers to the tendency of a new node to connect to existing nodes that are highly connected or popular. The topology of the Gnutella network is highly dynamic, since a node can join or leave the Gnutella net- work at any time. More specifically, most leaves tend to disconnect from the Gnutella network in several minutes after they connect to the network. The transient life-time of the leaves works against incremental growth. Moreover, due to the hybrid architecture of Gnutella protocol 0.6 [3], a leaf keeps only a small number of connections to ultrap- eers and cannot connect to other leaves. This limitation on leaves also works against preferential connectivity, because leaves can never become highly connected. Combining the above factors, we can explain why the current Gnutella net- work does not follow the early power-law distributions. It is the limitations of the power-laws that make them inap- propriate for modeling hybrid and highly dynamic topologies. As we mentioned earlier, P2P studies usually use syn- thetically generated topologies characterized by the early power-laws. These topologies may not reflect properties of current P2P networks. So there should be a new approach to model current P2P networks. 5.2. Our approach In our study, we propose a new two-layered approach to model the topology of the current Gnutella network. We split the Gnutella network into two layers, namely the mesh and the forest. Before we present the analysis of our approach, we pro- vide below a few definitions. Note that Magoni et al. [9] Fig. 4. Log–log plot of Dd versus the degree d. proposed some definitions to describe the AS network.
  • 5. 194 C. Xie et al. / Computer Communications 31 (2008) 190–200 We keep these definitions and modify them into the follow- With the knowledge of both the topology of the mesh ing ones. Fig. 5 shows different kinds of nodes in a sample and the topology of the forest, we can model the topology graph. of the Gnutella network easily by merging these two layers. • Cycle node: a node that belongs to a cycle (i.e. it is on a 6. Mesh topology analysis closed path of disjoint nodes; in Fig. 5, there are eleven cycle nodes). In this section, we study the topology properties con- • Bridge node: a node which is not a cycle node and is on cerning the mesh in the Gnutella network. In Table 2, we a path connecting 2 cycle nodes (in Fig. 5, there is one present some basic statistics about the mesh in our traces. bridge node). In Table 2, p(m) represents the percentage of nodes in the • In-mesh node: a node which is a cycle node or a bridge mesh, l represents average shortest distance, and k repre- node (in Fig. 5, the mesh has twelve in-mesh nodes). sents average degree. • In-tree node: a node which is not an in-mesh node (i.e. it belongs to a tree; in Fig. 5, each tree has four in-tree 6.1. Mesh node rank exponent Rm nodes). In this section, we study the degrees of the nodes in the Mesh is the set of in-mesh nodes and forest is the set of mesh. We sort the nodes in the mesh in decreasing order of in-tree nodes. degree d vm and define the mesh node rank rvm as the index of the node in the sequence. We plot the ðd vm ; rvm Þ pairs in log- • Branch node: an in-tree node of degree at least 2. log scale. The plots are shown in Fig. 6. The data values are • Leaf node: an in-tree AS of degree 1. represented by points, while the solid lines represent the • Root node: an in-mesh node which is the root of a tree. least-squares approximation. • Relay node: a node having exactly 2 connections. The points of Fig. 6 are well approximated by the linear • Border node: a node located on the diameter of the regression. The ACC is 0.96425 for the 091505 trace and network. 0.96580 for the 021106 trace. This leads us to the following power law and definition. If we split the Gnutella network into the mesh and the Power-law 1 (Mesh node rank exponent): The degree d vm forest, we can analyze the topological properties of the of a mesh node vm is proportional to the rank of the mesh mesh and the forest, respectively. node rvm to the power of a constant Rm : After careful comparison between Figs. 2 and 5, we can d vm / r R m : vm find that the mesh in Fig. 5 is composed merely of ultrap- eers and acts as the backbone of the Gnutella network. Since ultrapeers are relatively stable and tend to stay in the Gnutella network for a longer time, it can meet the Definition 1. Let us sort the mesh nodes of a graph in requirement of incremental growth. Further more, since decreasing order of degree. We define the mesh rank ultrapeers can connect to other ultrapeers, it can meet the exponent Rm to be the slope of the plot of the degrees of requirement of preferential connectivity. Hence, the topol- the mesh nodes versus the rank of the nodes in log–log scale. ogy of the mesh theoretically should comply with power- laws (see Section 6 for detailed validation). On the other hand, we can also obtain major topology properties and 6.2. Mesh node degree exponent Om distributions of the forest (see Section 7). Note that it is not necessary to have all ultrapeers in the mesh. In this section, we study the distribution of the degrees of the nodes in the mesh. We define the frequency fd m of a mesh node degree dm as the number of nodes in the mesh with degree dm. We plot the (fd m ; d m ) pairs in log-log scale in Fig. 7. In these plots, we exclude a small percentage of nodes of higher degree that have frequency of one, but still plot 99.9% of the total number of nodes. As we saw earlier, Table 2 Basic Statistics of the Mesh Stat. Data 091505 021106 Nb of Nodes 16,487 11,852 p(m) 15.4% 10.0% Nb of Edges 27,467 23,539 l 5.2 6.5 Diameter 14 17 k 3.33 3.97 Fig. 5. Different kinds of nodes.
  • 6. C. Xie et al. / Computer Communications 31 (2008) 190–200 195 Fig. 6. Log–log plot of the mesh node degree d mv versus the rank rmv in the Fig. 7. Log–log plot of frequency fd m versus the mesh node degree dm. sequence of decreasing degree. at least one vertex not in common [9]. The distribution of the higher degrees are described and captured by the mesh NSP is useful for evaluating the amount of redundant rank exponent. edges involved in shortest path. Higher NSP values mean The major observation of Fig. 7 is that the plots are that if one edge of a shortest path between a pair of nodes approximately linear with ACC of 0.97171 for the 091505 is removed, there is still a probability for another shortest trace and 0.96016 for the 021106 trace. We infer the follow- path of the same length to exist for this pair. We sort the ing power-law and definition. pairs of in-mesh nodes in decreasing NSP npm and define Power-law 2 (Mesh node degree exponent): The fre- the pair rank rpm as the index of the pair in the sequence. quency fd m of a mesh node degree dm, is proportional to We plot the ðnpm ; rpm Þ pairs in log-log scale. The plots are the degree to the power of a constant Om : shown in Fig. 8. Due to the enormous amount of node fd m / d Om : pairs, we plot the first 106 pairs only. m The points of Fig. 8 are well approximated by the linear regression with ACC of 0.99157 for the 091505 trace and 0.99632 for the 021106 trace. Note that it seems that in Definition 2. We define the mesh node degree exponent Om Fig. 8(a) a significant portion of the upper left part of the to be the slope of the plot of the frequency of the mesh curve goes off the straight line. However, this is a visual node degrees versus the degrees in log–log scale. illusion. The dots in the lower right part of the curve are much more denser than the dots in the upper left part, resulting in a high ACC value all the same. This leads us 6.3. Mesh pair rank exponent P m to the following power law and definition. Power-law 3 (Mesh pair rank exponent). The NSP npm In this section, we study the Number of distinct Shortest between a pair of mesh nodes pm, is proportional to the Paths (NSP) of each pair of vertices in the mesh. The num- rank of the pair rpm to the power of a constant P m : ber of distinct shortest paths between two vertices is the number of shortest paths such that any of these paths have npm / rPmm : p
  • 7. 196 C. Xie et al. / Computer Communications 31 (2008) 190–200 Fig. 9. Log–log plot of frequency fnm versus the mesh NSP nm. Fig. 8. Log–log plot of the mesh NSP npm versus the rank rpm in the sequence of decreasing degree. Definition 4. We define the Mesh NSP exponent N m to be the slope of the plot of the frequency of the mesh NSP Definition 3. Let us sort the pairs of nodes in the mesh of a versus the mesh NSP in log-log scale. graph in decreasing order of NSP. We define the mesh pair rank exponent P m to be the slope of the plot of the NSP 7. Forest topology analysis versus the rank of the mesh node pairs in log-log scale. In this section, we study the topology properties concern- 6.4. Mesh NSP exponent N m ing the forest in the Gnutella network. In Table 3, we present some basic statistics about the forest in our traces. In Table 3, In this section, we study the distribution of NSP of in- p(t) represents the percentage of nodes in the forest. mesh nodes. We define the frequency fnm of a NSP nm as 7.1. Tree depth distribution the number of pairs with NSP of nm in the mesh. We plot the (fnm ; nm ) pairs in log-log scale in Fig. 9. In these plots, We define the probability p(td) of a tree depth td as the we exclude a small percentage of pairs of higher NSP that percentage of trees in the forest with depth td. Fig. 10 have lowest frequency, but still plot more than 99.9% of the describes the tree depth distribution. total number of pairs. The solid lines are the result of the linear regression. Table 3 The major observation of Fig. 9 is that the plots are Basic Statistics of the Forest approximately linear with ACC of 0.94301 for the 091505 Stat. Data 091505 021106 trace and 0.99840 for the 021106 trace. We infer the follow- Nb of Nodes 90,718 107,073 ing power-law and definition. p(t) 84.6% 90.0% Power-law 4 (Mesh NSP Exponent). The frequency fnm Nb of trees 9886 6830 of a NSP between a pair of nodes in the mesh, nm, is pro- Mean tree size 10.18 16.68 portional to the NSP to the power of a constant N m : Max tree size 4,824 231 Mean tree depth 1.52 1.30 fnm / nN m : m Max tree depth 8 10
  • 8. C. Xie et al. / Computer Communications 31 (2008) 190–200 197 Fig. 10. Tree depth distribution. In Fig. 10, we notice that more than 56% of trees are simply composed of leaves that is directly connected to their corresponding root. We can also observe that more than 27% of trees have depth 2 and less than 4% of trees have depth larger than 3. 7.2. Tree rank distribution In this section, we study the size of each tree, which is defined as the sum of the vertices composing the tree plus the root. We sort the trees in decreasing tree size st and define tree rank rt as the index of the tree in the sequence. We plot the (st,rt) pairs in Fig. 11, applying log-scale only on the y-axis. The solid lines are given by lin- Fig. 11. Plot of the tree size st(log-scale) versus the rank rt in the sequence ear regression. of decreasing size. The plots of Fig. 11 match the linear regression line. The ACC is 0.95621 for the 091505 trace and 0.95465 for the 8.1. Additional power-laws 021106 trace. Consequently, we infer the following empiri- cal law and definition. In our study, we find that the NSP rank distribution and Empirical law 1: The size st of a tree t, is proportional to NSP distribution of all the nodes in the Gnutella network an exponential function with exponent being the product of follow power-laws as well. This can be explained easily. the rank of the tree rt and a constant T : Because the mesh is the core part of the network, shortest st / expðT rt Þ: paths is mainly constituted by nodes in the mesh, while nodes in the forest barely contribute to shortest paths. However, the two power-laws presented below could be used as minor metrics to distinguish P2P topologies. Definition 5. Let us sort the trees of a graph in decreasing order of size. We define T to be the slope of the plot of the sizes of trees versus the rank of the trees with log-scale 8.1.1. Pair rank exponent P applied on the sizes of trees. Here we study the NSP of all the nodes (including both This empirical law provides the formula on the sizes of in-mesh nodes and in-tree nodes). We sort the pairs of the trees in a sequence of trees. nodes in decreasing NSP np and plot the (np, rp) pairs in log–log scale in Fig. 12. Due to the enormous amount of 8. Discussion node pairs, we plot the first 106 pairs only. The data values are represented by points, while the solid lines represent the In this section, we first present two more power-laws least-squares approximation. concerning all the nodes (including both in-mesh nodes The points of Fig. 12 are well approximated by the lin- and in-tree nodes) in the Gnutella network. Then we ear regression with ACC of 0.98184 for the 091505 trace focus on the generation of synthetic topologies of P2P and 0.99259 for the 021106 trace. Note that it seems that networks. in both Fig. 12(a) and (b), a significant portion of the upper
  • 9. 198 C. Xie et al. / Computer Communications 31 (2008) 190–200 Fig. 13. Log–log plot of frequency fn versus the NSP n. Fig. 12. Log–log plot of the NSP np versus the rank of the pairs rp in the sequence of decreasing NSP. of pairs of higher NSP that have lowest frequency. In left part of the curves goes off the straight line. However, any case, we plot more than 99.9% of the total number this is also resulted from visual illusion. The dots in the of pairs. The solid lines are the result of the linear nether right part of the curve is much more dense than regression. the dots in the upper left part, resulting in that the ACC The major observation is that the plots are approxi- value is high all the same. This leads us to the following mately linear with ACC of 0.93510 for the 091505 trace power law and definition. and 0.98810 for the 021106 trace. We infer the following Power-law 5 (Pair Rank Exponent): The NSP np power-law and definition. between a pair of nodes p, is proportional to the rank of Power-law 6 (NSP Exponent): The frequency fn of a NSP the pair rp to the power of a constant P: between a pair of nodes n, is proportional to the NSP to the np / r P : power of a constant N : p fn / nN : Definition 6. Let us sort the pairs of nodes of a graph in decreasing order of NSP. We define the pair rank exponent Definition 7. We define the NSP exponent N to be the P to be the slope of the plot of the NSP versus the rank of slope of the plot of the frequency of the NSP versus the the pairs in log–log scale. NSP in log-log scale. 8.1.2. NSP Exponent N 8.2. Topology generation Here we study the distribution of NSP of all the nodes (including both in-mesh nodes and in-tree nodes). We The regularity observed in our traces of the Gnutella define the frequency fn of a NSP n as the number of pairs network between September 2005 and February 2006 with NSP of n. We plot the (fn, n) pairs in log–log scale (including but not restricted to the two traces specifically in Fig. 13. In these plots, we exclude a small percentage discussed in this paper) is unlikely to be a coincidence.
  • 10. C. Xie et al. / Computer Communications 31 (2008) 190–200 199 We could reasonably conjecture that our laws might con- [14] N.S. Ting, R. Deters, 3LS – A peer-to-peer network simulator, in: tinue to hold, at least for the near future. Proc. IEEE P2P’03, Sweden, 2003. [15] N. Kotilainen, M. Vapa, T. Keltanen, A. Auvinen, J. Vuori, Our work can facilitate the generation of realistic topol- P2PRealm – Peer-to-Peer Network Simulator, in: Proc. 11th Inter- ogies of P2P networks, specially those which employ a national Workshop on Computer-Aided Modeling, Analysis and hybrid and highly dynamic architecture like the Gnutella Design of Communication Links and Networks, 2006, pp. 93–99. network. As an overview, we list the following guidelines [16] M. Jelasity, A. Montresor, G.P. Jesi, Peersim peer-to- peer simulator, for creating P2P network topologies. First, a small percent- 2004, Avaliable from: <http://peersim.sourceforge.net/>. [17] W. Yang, N. Abu-Ghazaleh, GPS: a general peer-to-peer simulator age of the nodes (15.4% or 10.0%) belong to the mesh and a and its use for modeling BitTorrent, in: Proc. IEEE/ACM MAS- large percentage of the nodes (84.6% or 90.0%) belong to COTS’05, Atlanta, GA, 2005. the forest. Second, the degree distribution of the mesh is [18] A.L. Barabasi, R. Albert, Emergence of scaling in random networks, skewed following our power-law 1 and 2. Third, more than Science 286 (1999) 509. 56% of the trees have depth one, less than 4% of the trees [19] A. Medina, I. Matta, J. Byers, On the origin of power laws in internet topologies, ACM SIGCOMM Computer Communication Review 30 have depth larger than 3, and the maximum depth is 7 or (2) (2000) 18–28. 10. Fourth, the size distribution of the trees is skewed fol- lowing our empirical law 1. As a final step, we merge the generated mesh and the generated forest together to get Chao Xie currently is a Ph.D. student in the the P2P network topology. We can further use our law 3, Department of Computer Science at University of Wisconsin-Madison. He obtained his M.S. law 4, law 5, and law 6 to examine the quality of the gen- degree in Computer Science from Georgia State erated topologies. If we finetune the parameters, we can get University, USA, in 2007, obtained his M.Eng. specific topologies that meet our needs. degree in Computer Science from Huazhong University of Science and Technology, China, in 9. Conclusion and future work 2005, and obtained his B.S. degree in Mechanical Engineering from Huazhong University of Sci- ence and Technology, China, in 2001. In this paper, we study the hybrid P2P network topology His main research interests include computer through the mesh perspective and the forest perspective networks, distributed systems, parallel computing and data mining. respectively. Using the two-layered approach and laws pro- Chao Xie is a member of the Association of Computing Machinery and posed, realistic topologies can be generated. the IEEE Computer Society. References Guihai Chen obtained his B.S. degree from Nanjing University, M.Eng. from Southeast [1] C. Xie, Y. Pan, Analysis of large-scale hybrid peer-to-peer network University, and Ph.D from University of Hong topology, in: Proc. IEEE GLOBECOM’06, San Francisco, USA, Kong. He visited Kyushu Institute of Technol- 2006. ogy, Japan in 1998 as a research fellow, and [2] M.A. Jovanovic, Modelling large-scale peer-to-peer networks and a University of Queensland, Australia in 2000 as a case study of gnutella, Master’s thesis, University of Cincinnati, visiting professor. During September 2001 to Cambridge , June 2000. August 2003, he was a visiting professor in [3] Gnutella, The gnutella protocol v0.6, 2002. Wayne State University. He is now a full pro- [4] The KaZaA website, 2006. fessor and deputy chair of Department of Com- [5] Clip2, The Gnutella protocol specification v0.4, 2001. puter Science, Nanjing University. Prof. Chen [6] The Limewire website, 2006. has published more than 100 papers in peer-reviewed journals and refereed [7] L.A. Adamic, R.M. Lukose, A.R. Puniyani, B.A. Huberman, conference proceedings in the areas of wireless sensor networks, high- Search in power-law networks, Physical Review E 64 (2001) performance computer architecture, peer-to-peer computing and perfor- 46135–46143. mance evaluation. He has also served on technical program committees of [8] M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationships numerous international conferences. He is a member of the IEEE Com- of the internet topology, in: Proc. ACM SIGCOMM’99, New York, puter Society. NY, 1999, pp. 251–262. [9] D. Magoni, J.-J. Pansiot, Analysis of the autonomous system network topology, ACM SIGCOMM Computer Communication Review 31 Art Vandenberg was born in Grasonville, Mary- (3) (2001) 26–37. land, 1950. Education includes B.A. English [10] M. Ripeanu, I. Foster, A. Iamnitchi, Mapping the Gnutella network: Literature, Swarthmore College, Swarthmore, properties of large-scale peer-to-peer systems and implications for PA, 1972; M.V.A Painting and Drawing, Georgia system design, IEEE Internet Computing Journal 6 (1) (2002) 50–57. State University, Atlanta, GA 1979; and M.S. [11] H. Chen, H. Jin, J. Sun, D. Deng, X. Liao, Analysis of large-scale Information and Computer Systems, Georgia topological properties for peer-to-peer networks, in: Proc. IEEE Institute of Technology, Atlanta, GA 1985. CCGrid’04, 2004, pp. 27–34. He has worked in library systems, research and [12] Q. He, M. Ammar, G. Riley, H. Raj, R. Fujimoto, Mapping peer administrative computing since 1976, including 15 behavior to packet-level details: a framework for packet-level years in information technology positions at simulation of peer-to-peer systems, in: Proc. IEEE/ACM MAS- Georgia Institute of Technology. Since 1997 he has COTS’03, Orlando, FL, October 2003. been with Information Systems & Technology at Georgia State University, [13] S. Merugu, S. Srinivasan, E. Zegura, P-sim, A simulator for peer-to- as Director of Advanced Campus Services charged with deploying middle- peer networks, in: Proc. IEEE/ACM MASCOTS’03, Orlando, FL, ware and research computing infrastructure. His current activities include Oct. 2003. deploying grid computing solutions and establishing high-performance
  • 11. 200 C. Xie et al. / Computer Communications 31 (2008) 190–200 computing cyberinfrastructure. Recent research grants include a NSF ITR has also co-authored/co-edited 30 books (including proceedings) and Award 0312636 as Co-PI investigating a unique approach to resolving contributed several book chapters. His pioneer work on computing using metadata heterogeneity for information integration by combining moni- reconfigurable optical buses has inspired extensive subsequent work by toring, clustering and visualization to discover patterns or trends. He is a many researchers, and his research results have been cited by more than member of Georgia State’s IT Risk Management Research Group, the 100 researchers worldwide in books, theses, journal and conference Georgia State Information Integration Lab, and serves as Chair of papers. He is a co-inventor of three U.S. patents (pending) and 5 pro- SURAgrid, a regional grid initiative of the Southeastern Universities visional patents, and has received many awards from agencies such as Research Association. NSF, AFOSR, JSPS, IISF and Mellon Foundation. His recent research Mr. Vandenberg is a member of the Association of Computing has been supported by NSF, NIH, NSFC, AFOSR, AFRL, JSPS, IISF Machinery and the IEEE Computer Society. and the states of Georgia and Ohio. He has served as a reviewer/panelist for many research foundations/agencies such as the U.S. National Sci- ence Foundation, the Natural Sciences and Engineering Research Yi Pan is the chair and a professor in the Council of Canada, the Australian Research Council, and the Hong Department of Computer Science and a profes- Kong Research Grants Council. Dr. Pan has served as an editor-in-chief sor in the Department of Computer Information or editorial board member for 15 journals including 5 IEEE Transac- Systems at Georgia State University. Dr. Pan tions and a guest editor for 10 special issues for 9 journals including 2 received his B.Eng. and M.Eng. degrees in IEEE Transactions. He has organized several international conferences computer engineering from Tsinghua University, and workshops and has also served as a program committee member for China, in 1982 and 1984, respectively, and his several major international conferences such as INFOCOM, GLOBE- Ph.D. degree in computer science from the COM, ICC, IPDPS, and ICPP. Dr. Pan has delivered over 10 keynote University of Pittsburgh, USA, in 1991. Dr. speeches at many international conferences. Dr. Pan is an IEEE Dis- Pan’s research interests include parallel and tinguished Speaker (2000-2002), a Yamacraw Distinguished Speaker distributed computing, optical networks, wire- (2002), a Shell Oil Colloquium Speaker (2002), and a senior member of less networks, and bioinformatics. Dr. Pan has published more than 100 IEEE. He is listed in Men of Achievement, Who’sWho in Midwest, journal papers with 30 papers published in various IEEE journals. In Who’sWho in America, Who’sWho in American Education, Who’s Who addition, he has published over 100 papers in refereed conferences in Computational Science and Engineering, and Who’s Who of Asian (including IPDPS, ICPP, ICDCS, INFOCOM, and GLOBECOM). He Americans.