A Survey and Comparison of Peer-to-Peer Overlay Network Schemes


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Survey and Comparison of Peer-to-Peer Overlay Network Schemes

  1. 1. IEEE COMMUNICATIONS SURVEY AND TUTORIAL, MARCH 2004 1 A Survey and Comparison of Peer-to-Peer Overlay Network Schemes Eng Keong Lua, Jon Crowcroft, Marcelo Pias, Ravi Sharma and Steven Lim Abstract— Over the Internet today, computing and communi- cations environments are significantly more complex and chaotic than classical distributed systems, lacking any centralized orga-   ¡¢£ ¥¦£ ¡ ¤ § ¨ © §¢ £ ` XQ DC 6C BQ 3Q E4 § © ¤ © X 3 P W nization or hierarchical control. There has been much interest G E 3 7 H in emerging Peer-to-Peer (P2P) network overlays because they £ ¤ © provide a good substrate for creating large-scale data sharing, ¦ ¥¦ £ ¥ ¥ ¤ © ¨¦ V Y PC DC DC ¥ ¨ ¥ 3 7 D 3 F F X 3 W £ © © ¥ ¨ content distribution and application-level multicast applications. G E 3 7 H #¦ £ !¢£ !£ ¤ © ¤ ¨ These P2P networks try to provide a long list of features such as: selection of nearby peers, redundant storage, efficient ¤ © § ¤ ! #¦ ¦ ¨¦ U E4 ¥ ¨ ¥ ¨ % ¥ ¨ ¥ 3 7 3 F A search/location of data items, data permanence or guarantees, S G B4 ¢£ $£¢ £ B E T E 3 @ 3 E 3 H 7 ¥ hierarchical naming, trust and authentication, and, anonymity. !¦¢ ©£¢ £ ¥ ¨ ¤ # P2P networks potentially offer an efficient routing architecture % !¦£ § ¨ that is self-organizing, massively scalable, and robust in the ( ¥¦£ ( ) I 3Q 2 R § ¤ § ¨ § § ¡ ! 7 E 6 3 F P H S G B4 wide-area, combining fault tolerance, load balancing and explicit E B E T 3 @ 3 E 3 H 7 © § ¤ © ! 0 ' ¦ ©£ 1 §¤ § ) # notion of locality. In this paper, we present a survey and compar- ison of various Structured and Unstructured P2P networks. We @ 2 7 E 8 E A 54 G@ 3 H BC 6 3 D 7 9 6C 4 categorize the various schemes into these two groups in the design 6 B F spectrum and discuss the application-level network performance of each group. Index Terms— Peer-to-Peer, Distributed Scalable Algorithms, Fig. 1. An Abstract P2P Overlay Network Architecture Lookup Protocols, Overlay Routing, Overlay Networks. I. INTRODUCTION communications framework. The Network Communications layer describes the network characteristics of desktop ma- P EER-TO-PEER (P2P) overlay networks are distributed systems in nature, without any hierarchical organization or centralized control. Peers form self-organizing overlay chines connected over the Internet or small wireless or sensor- based devices that are connected in an ad-hoc manner. The networks that are overlayed on the Internet Protocol (IP) dynamic nature of peers poses challenges in communication networks, offering a mix of various features such as robust paradigm. The Overlay Nodes Management layer covers the wide-area routing architecture, efficient search of data items, management of peers, which include discovery of peers and selection of nearby peers, redundant storage, permanence, hier- routing algorithms for optimization. The Features Management archical naming, trust and authentication, anonymity, massive layer deals with the security, reliability, fault resiliency and scalability and fault tolerance. Peer-to-peer overlay systems aggregated resource availability aspects of maintaining the ro- go beyond services offered by client-server systems by having bustness of P2P systems. The Services Specific layer supports symmetry in roles where a client may also be a server. It allows the underlying P2P infrastructure and the application-specific access to its resources by other systems and supports resource- components through scheduling of parallel and computation- sharing, which requires fault-tolerance, self-organization and intensive tasks, content and file management. Meta-data de- massive scalability properties. Unlike Grid systems, P2P over- scribes the content stored across the P2P peers and the lay networks do not arise from the collaboration between location information. The Application-level layer is concerned established and connected groups of systems and without a with tools, applications and services that are implemented more reliable set of resources to share. with specific functionalities on top of the underlying P2P We can view P2P overlay network models spanning a wide overlay infrastructure. So, there are two classes of P2P overlay spectrum of the communication framework, which specifies a networks: Structured and Unstructured. fully-distributed, cooperative network design with peers build- The technical meaning of Structured is that the P2P overlay ing a self-organizing system. Figure 1 shows an abstract P2P network topology is tightly controlled and content are placed overlay architecture, illustrating the components in the overlay not at random peers but at specified locations that will make subsequent queries more efficient. Such Structured P2P sys- Manuscript received March 31, 2004; revised November 20, 2004. tems use the Distributed Hash Table (DHT) as a substrate, Eng Keong Lua, Jon Crowcroft and Marcelo Pias are with the University in which data object (or value) location information is placed of Cambridge, Computer Laboratory. Ravi Sharma is with the Nanyang Technological University. deterministically, at the peers with identifiers corresponding Steven Lim is with the Microsoft Asia. to the data object’s unique key. DHT-based systems have a
  2. 2. 2 IEEE COMMUNICATIONS SURVEY AND TUTORIAL, MARCH 2004 facility. It was the first system to recognize that requests for popular content need not to be sent to a central server but Distributed Structured P2P Overlay Application instead it could be handled by many peers, that have the requested content. Such P2P file-sharing systems are self- Value scaling in that as more peers join the system, they add ¨§   ¡ ¢ £¤¢ ¨§ ¨§   ¡ ¢ £¤¢   ¡ ¢ £¤¢ ¥¦ © ¥ ¥ ¦ © ¥ ¥ ¦ © ¥ ¡ ¥ ¤ ! ¥ ¥¤ ¥ ¨ to the aggregate download capability. Napster achieved this ¥ # ¥ ¨ ¥ ¥ $ self-scaling behavior by using a centralized search facility Distributed Hash Table based on file lists provided by each peer, thus, it does not require much bandwidth for the centralized search. Such a system has the issue of a single point of failure due to the centralized search mechanism. However, a lawsuit filed Peer Peer Peer Peer by the Recording Industry Association of America (RIAA) forced Napster to shut down the file-sharing service of digital music — literally, its killer application. However, the paradigm Fig. 2. Application Interface for Structured DHT-based P2P Overlay Systems caught the imagination of platform providers and users alike. Gnutella [9]–[11] is a decentralized system that distributes both the search and download capabilities, establishing an overlay network of peers. It is the first system that makes use property that consistently assigned uniform random NodeIDs of an Unstructured P2P overlay network. An Unstructured P2P to the set of peers into a large space of identifiers. Data objects system is composed of peers joining the network with some are assigned unique identifiers called keys, chosen from the loose rules, without any prior knowledge of the topology. The same identifier space. Keys are mapped by the overlay network network uses flooding as the mechanism to send queries across protocol to a unique live peer in the overlay network. The P2P the overlay with a limited scope. When a peer receives the overlay networks support the scalable storage and retrieval flood query, it sends a list of all content matching the query of {key,value} pairs on the overlay network, as illustrated to the originating peer. While flooding-based techniques are in Figure 2. Given a key, a store operation (put(key,value)) effective for locating highly replicated items and are resilient lookup retrieval operation (value=get(key)) can be invoked to to peers joining and leaving the system, they are poorly suited store and retrieve the data object corresponding to the key, for locating rare items. Clearly this approach is not scalable as which involves routing requests to the peer corresponding to the load on each peer grows linearly with the total number of the key. queries and the system size. Thus, Unstructured P2P networks Each peer maintains a small routing table consisting of its face one basic problem: peers readily become overloaded, neighboring peers’ NodeIDs and IP addresses. Lookup queries therefore, the system does not scale when handling a high or message routing are forwarded across overlay paths to peers rate of aggregate queries and sudden increase in system size. in a progressive manner, with the NodeIDs that are closer to Although Structured P2P networks can efficiently locate the key in the identifier space. Different DHT-based systems rare items since the key-based routing is scalable, they incur will have different organization schemes for the data objects significantly higher overheads than Unstructured P2P networks and its key space and routing strategies. In theory, DHT-based for popular content. Consequently, over the Internet today, the systems can guarantee that any data object can be located decentralized Unstructured P2P overlay networks are more in a small O(logN ) overlay hops on average, where N is commonly used. However, there are recent efforts on Key- the number of peers in the system. The underlying network based Routing (KBR) API abstractions [12] that allow more path between two peers can be significantly different from the application-specific functionality to be built over this common path on the DHT-based overlay network. Therefore, the lookup basic KBR API abstractions, and OpenHash (Open publicly latency in DHT-based P2P overlay networks can be quite high accessible DHT service) [13] that allows the unification plat- and could adversely affect the performance of the applications form of providing developers with basic DHT service models running over it. Plaxton et al. [1] provides an elegant algorithm that runs on a set of infrastructure hosts, to deploy DHT-based that achieves nearly optimal latency on graphs that exhibit overlay applications without the burden of maintaining a DHT power-law expansion [2], at the same time, preserving the and with ease of use to spur the deployment of DHT-based scalable routing properties of the DHT-based system. However, applications. In contrast, Unstructured P2P overlay systems are this algorithm requires pair-wise probing between peers to Ad-Hoc in nature, and do not present the possibilities of being determine latencies and it is unlikely to scale to a large number unified under a common platform for application development. of peers in the overlay. DHT-based systems [3]–[7] are an In the sections II and IV of the paper, we will describe important class of P2P routing infrastructures. They support the key features Structured P2P and Unstructured P2P overlay the rapid development of a wide variety of Internet-scale networks and their operation functionalities. After providing a applications ranging from distributed file and naming systems basic understanding of the various overlays schemes in these to application-layer multicast. They also enable scalable, wide- two classes, we proceed to evaluate these various overlays area retrieval of shared information. schemes in both classes and discuss its developments in In 1999, the Napster [8] pioneered the idea of a peer-to- sections III and V. Then, we attempt to use the taxonomy to peer file sharing system supporting a centralized file search make comparisons between the various discussed Structured
  3. 3. A SURVEY AND COMPARISON OF PEER-TO-PEER OVERLAY NETWORK SCHEMES 3 and Unstructured P2P overlay schemes: • Decentralization — examine whether the overlay system is distributed. • Architecture — describe the overlay system architecture with respect to its operation. • Lookup Protocol — the lookup query protocol adopted by the overlay system. • System Parameters — the required system parameters for the overlay system operation. • Routing Performance — the lookup routing protocol performance in overlay routing. • Routing State — the routing state and scalability of the overlay system. • Peers Join and Leave — describe the behavior of the overlay system when churn and self-organization oc- curred. • Security — look into the security vulnerabilities of over- lay system. • Reliability and Fault Resiliency — examine how robust the overlay system when subjected to faults. Lastly, in section VI, we conclude with some thoughts on the relative applicability of each class to some of the research problems that arise in Ad-Hoc, location-based or content delivery networks. II. STRUCTURED P2P OVERLAY NETWORKS In this category, the overlay network assigns keys to data items and organizes its peers into a graph that maps each data key to a peer. This structured graph enables efficient discovery of data items using the given keys. However, in its simple form, this class of systems does not support complex queries and it is necessary to store a copy or a pointer to each data object (or value) at the peer responsible for the data object’s key. In this section, we survey and compare the Structured Fig. 3. Example of 2-d space CAN before and after Peer Z joins P2P overlay networks: Content Addressable Network (CAN) [5], Tapestry [7], Chord [6], Pastry [4], Kademlia [14] and Viceroy [15]. CAN paper [5], the virtual coordinate space is used to store {key,value} pairs as follows: to store a pair {K,V}, key K A. Content Addressable Network (CAN) is deterministically mapped onto a point P in the coordinate The Content Addressable Network (CAN) [5] is a dis- space using a uniform hash function. The lookup protocol to tributed decentralized P2P infrastructure that provides hash- retrieve an entry corresponding to key K, any peer can apply table functionality on Internet-like scale. CAN is designed to the same deterministic hash function to map K onto point P be scalable, fault-tolerant, and self-organizing. The architec- and then retrieve the corresponding value V from the point tural design is a virtual multi-dimensional Cartesian coordinate P. If the requesting peer or its immediate neighbors do not space on a multi-torus. This d-dimensional coordinate space is own the point P, the request must be routed through the CAN completely logical. The entire coordinate space is dynamically infrastructure until it reaches the peer where P lays. A peer partitioned among all the peers (N number of peers) in the maintains the IP addresses of those peers that hold coordinate system such that every peer possesses its individual, distinct zones adjoining its zone. This set of immediate neighbors in zone within the overall space. A CAN peer maintains a the coordinate space serves as a coordinate routing table that routing table that holds the IP address and virtual coordinate enables efficient routing between points in this space. zone of each of its neighbors in the coordinate space. A A new peer that joins the system must have its own portion CAN message includes the destination coordinates. Using the of the coordinate space allocated. This can be achieved by neighbor coordinates, a peer routes a message towards its splitting existing peer’s zone in half; retaining half for the destination using a simple greedy forwarding to the neighbor peer and allocating the other half to the new peer. CAN has peer that is closest to the destination coordinates. CAN has a an associated DNS domain name which is resolved into IP 1 routing performance of O(d.N d ) and its routing state is of address of one or more CAN bootstrap peers (which maintains 2.d bound. As shown in Figure 3 which we adapted from the a partial list of CAN peers). For a new peer to join CAN
  4. 4. 4 IEEE COMMUNICATIONS SURVEY AND TUTORIAL, MARCH 2004 network, the peer looks up in the DNS a CAN domain is in the construction of wide-area name resolution services name to retrieve a bootstrap peer’s IP address, similar to the that decouple the naming scheme from the name resolution bootstrap mechanism in [16]. The bootstrap peer supplies the process. This enables an arbitrary and location-independent IP addresses of some randomly chosen peers in the system. naming scheme. The new peer randomly chooses a point P and sends a JOIN request destined for point P. Each CAN’s peer uses the CAN B. Chord routing mechanism to forward the message until it reaches the peer in which zone P lies. The current peer in zone P then Chord [6] uses consistent hashing [20] to assign keys to its splits its in half and assigns the other half to the new peer. peers. Consistent hashing is designed to let peers enter and For example, in a 2-dimensional (2 − d) space, a zone would leave the network with minimal interruption. This decentral- first be split along the X dimension, then the Y , and so on. ized scheme tends to balance the load on the system, since The {K,V} pairs from the half zone to be handed over are each peer receives roughly the same number of keys, and also transferred to the new peer. After obtaining its zone, the there is little movement of keys when peers join and leave the new peer learns of the IP addresses of its neighbor set from system. In a steady state, for N peers in the system, each peer the previous peer in point P, and adds to that previous peer maintains routing state information for about only O(logN ) itself. other peers (N number of peers in the system). This may When a peer leaves the CAN network, an immediate be efficient but performance degrades gracefully when that takeover algorithm ensures that one of the failed peer’s neigh- information is out-of-date. bors takes over the zone and starts a takeover timer. The peer The consistent hash functions assign peers and data keys an updates its neighbor set to eliminate those peers that are no m-bit identifier using SHA-1 [21] as the base hash function. longer its neighbors. Every peer in the system then sends A peer’s identifier is chosen by hashing the peer’s IP address, soft-state updates to ensure that all of their neighbors will while a key identifier is produced by hashing the data key. The learn about the change and update their own neighbor sets. length of the identifier m must be large enough to make the The number of neighbors a peer maintains depends only on probability of keys hashing to the same identifier negligible. the dimensionality of the coordinate space (i.e. 2.d) and it is Identifiers are ordered on an identifier circle modulo 2m. independent of the total number of peers in the system. Key k is assigned to the first peer whose identifier is equal The Figure 3 example illustrated a simple routing path from to or follows k in the identifier space. This peer is called peer X to point E and a new peer Z joining the CAN network. the successor peer of key k, denoted by successor(k). If For a d-dimensional space partitioned into n equal zones, identifiers are represented as a circle of numbers from 0 to 1 the average routing path length is (d/4) x (n d ) hops and 2m − 1, then successor(k) is the first peer clockwise from k. individual peers maintain a list of 2.d neighbors. Thus, the The identifier circle is termed as the Chord ring. To maintain growth of peers (or zones) can be achieved without increasing consistent hashing mapping when a peer n joins the network, 1 per peer state while the average path length grows as O(n d ). certain keys previously assigned to n’s successor now need to Since there are many different paths between two points in the be reassigned to n. When peer n leaves the Chord system, all space, when one or more of a peer’s neighbors fail, this peer of its assigned keys are reassigned to n’s successor. Therefore, can still route along the next best available path. peers join and leave the system with (logN )2 performance. No Improvement to the CAN algorithm can be done by main- other changes of keys assignment to peers need to occur. In taining multiple, independent coordinate spaces with each Figure 4 (adapted from [6]), the Chord ring is depicted with peer in the system being assigned a different zone in each m = 6. This particular ring has ten peers and stores five keys. coordinate space, called reality. For a CAN with r realities, a The successor of the identifier 10 is peer 14, so key 10 will single peer is assigned r coordinate zones, one on each reality be located at NodeID 14. Similarly, if a peer were to join with available, and this peer holds r independent neighbor sets. identifier 26, it would store the key with identifier 24 from the The contents of the hash table are replicated on every reality, peer with identifier 32. thus improving data availability. For further data availability Each peer in the Chord ring needs to know how to contact improvement, CAN could use k different hash functions to its current successor peer on the identifier circle. Lookup map a given key onto k points in the coordinate space. queries involve the matching of key and NodeID. For a given This results in the replication of a single {key,value} pair identifier could be passed around the circle via these successor at k distinct peers in the system. A {key,value} pair is then pointers until they encounter a pair of peers that include the unavailable only when all the k replicas are simultaneously desired identifier; the second peer in the pair is the peer the unavailable. Thus, queries for a particular hash table entry query maps to. An example is presented in Figure 4, whereby could be forwarded to all k peers in parallel thereby reducing peer 8 performs a lookup for key 54. Peer 8 invokes the the average query latency, and reliability and fault resiliency find successor operation for this key, which eventually returns properties are enhanced. the successor of that key, i.e. peer 56. The query visits every CAN could be used in large scale storage management peer on the circle between peer 8 and peer 56. The response systems such as the OceanStore [17], Farsite [18], and Publius is returned along the reverse of the path. [19]. These systems require efficient insert and retrieval of As m is the number of bits in the key/NodeID space, each content in a large distributed storage network with a scalable peer n maintains a routing table with up to m entries, called indexing mechanism. Another potential application for CANs the finger table. The ith entry in the table at peer n contains
  5. 5. A SURVEY AND COMPARISON OF PEER-TO-PEER OVERLAY NETWORK SCHEMES 5 ¡¦¨ ¡  about it. To avoid this situation, peers maintain a successor ¢  ¡¦¨¨ list of size r, which contains the peer’s first r successors. §¦  When the successor peer does not respond, the peer simply contacts the next peer on its successor list. Assuming that £ ¦  ¡£   peer failures occur with a probability p, the probability that ¢ ¡  ©£ ¨ every peer on the successor list will fail is pr . Increasing r makes the system more robust. By tuning this parameter, any £ ¤  degree of robustness with good reliability and fault resiliency ¤¡  may be achieved. ¢¥¨ ¢¥  ¡¤¨ The following applications are examples of how Chord ¤¥  could be used: ©¥¨ %$# ! • Cooperative mirroring or Cooperative File System (CFS) ¡£   £ ¢  [22], in which multiple providers of content cooperate ¡£   ¤ ¢  to store and serve each others’ data. Spreading the total ¡  ¡£   ¡ ¢  load evenly over all participant hosts lowers the total cost £ ¤  ¢ ¢  of the system, since each participant needs to provide ¢  ¤¥  §£ ¢  §¦  0' capacity only for the average load, not for the peak ¤ ¡  ¤¥ ¢  (' )' load. There are two layers in CFS. The DHash (Dis- £ ¦  ¤¥ tributed Hash) layer performs block fetches for the peer, §£ ¡£   ¢ distributes the blocks among the servers, and maintains ¢ ¡  cached and replicated copies. The Chord layer distributed £ ¤  lookup system is used to locate the servers responsible ¤¡  for a block. ¢¥  • Chord-based DNS [23] provides a lookup service, with ¤¥  host names as keys and IP addresses (and other host information) as values. Chord could provide a DNS-like Fig. 4. Chord ring with identifier circle consisting of ten peers and five service by hashing each host name to a key [20]. Chord- data keys. It shows the path followed by a query originated at peer 8 for the based DNS would require no special servers, while ordi- lookup of key 54. Finger table entries for peer 8. nary DNS systems rely on a set of special root servers. DNS also requires manual management of the routing information (DNS records) that allows clients to navigate the identity of the first peer s that succeeds n by at least 2i−1 the name server hierarchy; Chord automatically maintains on the identifier circle, i.e. s = successor(n + 2i−1 ), where the correctness of the analogous routing information. 1 ≤ i ≤ m. Peer s is the ith finger of peer n (n.f inger[i]). DNS only works well when host names are hierarchically A finger table entry includes both the Chord identifier and the structured to reflect administrative boundaries; Chord IP address (and port number) of the relevant peer. Figure 4 imposes no naming structure. DNS is specialized to the shows the finger table of peer 8, and the first finger entry for task of finding named hosts or services, while Chord can this peer points to peer 14, as the latter is the first peer that also be used to find data object values that are not tied succeeds (8+20) mod 26 = 9. Similarly, the last finger of peer to particular machines. 8 points to peer 42, i.e. the first peer that succeeds (8 + 25) mod 26 = 40. In this way, peers store information about only a small number of other peers, and know more about peers C. Tapestry closely following it on the identifier circle than other peers. Sharing similar properties as Pastry, Tapestry [7] employs Also, a peer’s finger table does not contain enough information decentralized randomness to achieve both load distribution and to directly determine the successor of an arbitrary key k. For routing locality. The difference between Pastry and Tapestry example, peer 8 cannot determine the successor of key 34 by is the handling of network locality and data object replica- itself, as successor of this key (peer 38) is not present in peer tion, and this difference will be more apparent, as described 8’s finger table. in Pasty section. Tapestry’s architecture uses variant of the When a peer joins the system, the successor pointers of Plaxton et al. [1] distributed search technique, with additional some peers need to be changed. It is important that the mechanisms to provide availability, scalability, and adaptation successor pointers are up to date at any time because the in the presence of failures and attacks. Plaxton et al. proposes correctness of lookups is not guaranteed otherwise. The Chord a distributed data structure, known as the Plaxton mesh, protocol uses a stabilization protocol [6] running periodically optimized to support a network overlay for locating named in the background to update the successor pointers and the data objects which are connected to one root peer. On the other entries in the finger table. The correctness of the Chord hand, Tapestry uses multiple roots for each data object to avoid protocol relies on the fact that each peer is aware of its single point of failure. In the Plaxton mesh, peers can take on successors. When peers fail, it is possible that a peer does the roles of servers (where data objects are stored), routers not know its new successor, and that it has no chance to learn (forward messages), and clients (entity of requests). It uses
  6. 6. 6 IEEE COMMUNICATIONS SURVEY AND TUTORIAL, MARCH 2004 local routing maps at each peer, to incrementally route overlay tains cached content for fault recovery by relying on TCP messages to the destination ID digit by digit, for instance, timeouts and UDP periodic heartbeats packets, to detect ∗ ∗ ∗7 ⇒ ∗ ∗ 97 ⇒ ∗297 ⇒ 3297, where ’∗’ is the wildcard, link, server failures during normal operations, and rerouting similar to the longest prefix routing in the CIDR IP address through its neighbors. During fault operation, each entry in allocation architecture [24]. The resolution of digits from right the neighbor map maintains two backup neighbors in addition to left or left to right is arbitrary. A peer’s local routing map to the closest/primary neighbor. On a testbed of 100 machines has multiple levels, where each of them represents a matching with 1000 peers simulations, the results in [103] shows that the suffix up to a digit position in the ID space. The nth peer the good routing rates and maintenance bandwidths during that a message reaches, shares a suffix of at least length n with instantaneous failures and continuing churn. the destination ID. To locate the next router, the (n + 1)th A variety of different applications have been designed and level map is examined to locate the entry matching the value implemented on Tapestry. Tapestry is self-organizing, fault- of the next digit in the destination ID. This routing method tolerant, resilient under load, and it is a fundamental com- guarantees that any existing unique peer in the system can be ponent of the OceanStore system [17], [25]. The OceanStore located within at most logB N logical hops, in a system with N is a global-scale, highly available storage utility deployed on peers using NodeIDs of base B. Since the peer’s local routing the PlanetLab [26] testbed. OceanStore servers use Tapestry map assumes that the preceding digits all match the current to disseminate encoded file blocks efficiently, and clients can peer’s suffix, the peer needs only to keep a small constant size quickly locate and retrieve nearby file blocks by their ID, (B) entry at each route level, yielding a routing map of fixed despite server and network failures. Other Tapestry applica- constant size: (entries/map) x no.of maps = B.logB N . tions include the Bayeux [27] — an efficient self organizing The lookup and routing mechanisms of Tapestry is similar application-level multicast system and SpamWatch [28] — to Plaxton, which are based on matching the suffix in NodeID a decentralized spam-filtering system that uses a similarity as described above. Routing maps are organized into routing search engine implemented on Tapestry. levels, where each level contains entries that point to a set of peers closest in distance that matches the suffix for that level. Also, each peer holds a list of pointers to peers referred D. Pastry to as neighbors. Tapestry stores the locations of all data Pastry [4], like Tapestry, makes use of Plaxton-like prefix object replicas to increase semantic flexibility and allowing routing, to build a self-organizing decentralized overlay net- application level to choose from a set of data object replicas work, where each peer routes client requests and interacts with based on some selection criteria, such as date. Each data object local instances of one or more applications. Each peer in Pastry may include an optional application-specific metric in addition is assigned a 128-bit peer identifier (NodeID). The NodeID is to a distance metric; e.g. OceanStore [17] global storage used to give a peer’s position in a circular NodeID space, architecture finds the closest cached document replica which which ranges from 0 to 2128 − 1. The NodeID is assigned satisfies the closest distance metric. These queries deviate from randomly when a peer joins the system and it is assumed to be the simple ”find first” semantics, and Tapestry will route the generated such that the resulting set of NodeIDs is uniformly message to the closest k distinct data objects. distributed in the 128-bit space. For a network of N peers, Tapestry handles the problem of a single point of failure due Pastry routes to the numerically closest peer to a given key in to a single data object’s root peer by assigning multiple roots less than logB N steps under normal operation (where B = 2b to each object. Tapestry makes use of surrogate routing to is a configuration parameter with typical value of b = 4). The select root peers incrementally, during the publishing process NodeIDs and keys are considered a sequence of digits with to insert location information into Tapestry. Surrogate routing base B. Pastry routes messages to the peer whose NodeID is provides a technique by which any identifier can be uniquely numerically closest to the given key. A peer normally forwards mapped to an existing peer in the network. A data object’s the message to a peer whose NodeIDs shares with the key a root or surrogate peer is chosen as the peer which matches prefix that is at least one digit (or b bits) longer than the prefix the data object’s ID, X. This is unlikely to happen, given that the key shares with the current peer NodeID. the sparse nature of the NodeID space. Nevertheless, Tapestry As shown in Figure 5, each Pastry peer maintains a routing assumes peer X exists by attempting to route a message to table, a neighborhood set, and a leaf set. A peer routing table it. A route to a non-existent identifier will encounter empty is designed with logB N rows, where each row holds B − 1 neighbor entries at various positions along the way. The goal number of entries. The B − 1 number of entries at row n of is to select an existing link, which can act as an alternative to the routing table each refer to a peer whose NodeID shares the desired link; i.e. the one associated with a digit X. Routing the current peer’s NodeID in the first n digits, but whose (n + terminates when a map is reached where the only non-empty 1)th digit has one of the B − 1 possible values other than routing entry belongs to the current peer. That peer is then the (n + 1)th digit in the current peer’s NodeID. Each entry designated as the surrogate root for the data object. While in the routing table contains the IP address of peers whose surrogate routing may take additional hops to reach a root NodeID have the appropriate prefix, and it is chosen according if compared with Plaxton algorithm, the additional number of to close proximity metric. The choice of b involves a trade-off hops is small. Thus, surrogate routing in Tapestry has minimal between the size of the populated portion of the routing table routing overhead relative to the static global Plaxton algorithm. [approx.(logB N ) x (B − 1) entries] and maximum number Tapestry addresses the issue of fault adaptation and main- of hops required to route between any pair of peers (logB N ).
  7. 7. A SURVEY AND COMPARISON OF PEER-TO-PEER OVERLAY NETWORK SCHEMES 7 metric, from the list of contact peers. Peer X asks A to route ¡  ¡¢ ¡£ ¡¦ ¥¤ § ¡¨ ¡© ¡ a special join message with the key equal to X. Pastry routes the join message to the existing peer Z whose NodeID is ¡  ¡¢ ¡£ ¡ ¡© ¡ numerically closest to X. In response to receiving the join ¥ ¤ § § ¡  ¡¢ ¡£ ¡ ¥ ¤ § § ¡ ¡© request, peers A, Z and all peers encountered on the path ¡ ¡© ¡¨ ¡ ¡ § ¡£ ¡¢ ¥ ¤ from A to Z send their state tables to X. Finally, X informs any peers that need to be aware of its arrival. This ensures that X initializes its state with appropriate values and that the state 1 0 ) (' % $ # ! in all other affected peers is updated. As peer A is assumed to 6#5 43 #2 be topologically close to the new peer X, A’s neighborhood A@ #99 38 57 F EE DC B BB E DC B GG E DCB FF E DCB set initialize X’s neighborhood set. Considering the general H HE DC B CC E DC B PP E DCB II E DCB case, where NodeIDs of A and X share no common prefix, 6#5 43 #2 A@ #Q@ 32 7 let Ai denote peer A’s row of the routing table at level i. A0 GR E DC B SR E DC B PR E DCB HR E DCB contains appropriate values for X0 , since the entries in row 0 DR E DC B V R E DC B UR E DCB T R E DCB of the routing table are independent of a peer’s NodeID. Other ! !X@ !Y X QW # levels of A’s routing table are of no use to X, since A’s and 6#5 T BG G DF C P HBTF E` DI HG T DECPG X’s NodeIDs share no common prefix. Appropriate values for T DGF PB DEF CP H DEaSCB EF SC C H ER DE aB T DF S SH ` UGF aB V ` UEa H X1 can be taken from B1 , when B is the first peer along the EV` aC G T E DE aG UE DEF I DV BF GI route path from A to Z. The entries in B1 and X1 share the T I HBF F I HF F G G C PF G GF F GGC PG DSGG PF Ua Sa SG ` G EaaF UT DaaF same prefix, because X and B have the same first digit in their NodeID. Finally, X transmits a copy of its resulting state to each of the peers found in its neighborhood set, leaf set and routing table. These peers then update their own state based on the information received. A Pastry peer is considered failed when its immediate neighbors in the NodeID space can no longer communicate with the peer. To replace a failed peer in the leaf set, its neighbor in the NodeID space contacts the live peer with the largest index on the side of the failed peer, and requests its leaf table. For example, if Li failed for |L|/2 i 0, it requests the leaf set from L − |L|/2. Let the received leaf Fig. 5. Pastry peer’s routing table, leaf set and neighbor set. An example of routing path for a Pastry peer. set be L , which overlaps the current peer’s leaf set L, and it contains peers with nearby NodeIDs not present in L. The appropriate one is then chosen to insert into L, verifying that the peer is actually still alive by contacting it. To repair the With a value of b = 4 and 106 peers, a routing table contains failed routing table entry Rd(levell) , a peer contacts first the on average 75 entries and the expected number of routing hops peer referred to by another entry Ri(levell) , i = d of the same is 5. The neighborhood set, M , contains the NodeIDs and row, and asks for that peer’s entry for Rd . If none of the entries IP addresses of the |M | peers that are closest in proximity in row l have a pointer to a live peer with appropriate prefix, to the local peer. The network proximity that Pastry uses is the peer contacts an entry Ri(levell+1) , i = d, thereby casting based on a scalar proximity metric such as the number of IP a wider coverage. The neighborhood set is not used in the routing geographic distance. The leaf set L is the set of peers routing of messages, but it is still kept fresh/update because with |L|/2 numerically closest larger NodeIDs and |L|/2 peers the set plays an important role in exchanging information with numerically smaller NodeIDs, in relation to the current about nearby peers. Therefore, a peer contacts each member peer’s NodeID. Typical values for |L| and |M | are B or 2 of the neighborhood set periodically to see if it is still alive. x B. Even with concurrent peers failure, eventual delivery is If the peer is not responding, the peer asks other members for guaranteed with good reliability and fault resiliency, unless their neighborhood sets and checks for the closest proximity |L|/2 peers with adjacent NodeIDs fail simultaneously, (|L| is of each of the newly discovered peers and updates its own a configuration parameter with a typical value of 16 or 32). neighborhood set. Pastry is being used in the implementation When a new peer (NodeID is X) joins the network, it of an application-level multicast, called SplitStream [29]. needs to initialize its state table and inform other peers of Instead of relying on a multicast infrastructure in the network its presence. This joining peer needs to know the address of which is not widely available, the participating peers route a contact peer in the network. A small list of contact peers, and distribute multicast message using only unicast network based on a proximity metric (e.g. the RTT to each peer) to services. SplitStream allows a cooperative environment where provide better performance, could be provided as a service in peers contribute resources in exchange for using the service. the network, and the joining peer could select at random one The key idea is to split the content into k stripes and to of the peers to be its contact peer. So, this new peer knows multicast each stripe using a separate tree. Peers join as initially about a nearby Pastry peer A, according to a proximity many trees as there are stripes they wish to receive and they
  8. 8. 8 IEEE COMMUNICATIONS SURVEY AND TUTORIAL, MARCH 2004 specify an upper bound on the number of stripes that they [18] allows hosts to use the same encrypted representation for are willing to forward. The challenge is to construct this common data without sharing keys. forest of multicast trees such that an interior peer in one tree is a leaf peer in all the remaining trees and the bandwidth E. Kademlia constraints specified by the peers are satisfied. This ensures that forwarding load can be spread across all participating The Kademlia [14] P2P decentralized overlay network takes peers. For example, if all peers wish to receive k stripes and the basic approach of assigning each peer a NodeID in the they are willing to forward k stripes, SplitStream will construct 160-bit key space, and key,value pairs are stored on peers a forest such that the forwarding load is evenly balanced across with IDs close to the key. A NodeID-based routing algorithm all peers while achieving low delay and link stress across the will be used to locate peers near a destination key. One network. of the key architecture of Kademlia is the use of a novel Scribe [30], [31] is a scalable application-level multicast XOR metric for distance between points in the key space. infrastructure that supports a large number of groups with XOR is symmetric and it allows peers to receive lookup large number of members per group. Scribe is built on top queries from precisely the same distribution of peers contained of Pastry which is used to create and manage groups and to in their routing tables. Kademlia can send a query to any build efficient multicast trees for dissemination of messages peer within an interval, allowing it to select routes based on to each group. Scribe builds a multicast tree formed by latency or send parallel asynchronous queries. It uses a single joining Pastry routes from each group member to a rendezvous routing algorithm throughout the process to locate peers near point associated with a group. Membership maintenance and a particular ID. message dissemination in Scribe leverages the robustness, self- Every message being transmitted by a peer includes its organization, locality and reliability properties of Pastry. peer ID, permitting the recipient to record the sender peer’s Squirrel [32] uses Pastry as its data object location service, existence. Data keys are also 160-bit identifiers. To locate to identify and route to peers that cache copies of a requested {key,value} pairs, Kademlia relies on the notion of distance data object. It facilitates mutual sharing of web data objects between two identifiers. Given two 160-bit identifiers, a and b, among client peers, and enables the peers to export their local it defines the distance between them as their bitwise exclusive caches to other peers in the network, thus creating a large OR (XOR, interpreted as d(a, b) = a ⊕ b = d(b, a) for all shared virtual web cache. Each peer then performs both web a, b), and this is a non-Euclidean metric. Thus, d(a, b) = browsing and web caching, without the need for expensive 0, d(a, b) 0(if a = b), and for all a, b: d(a, b) = d(b, a). and dedicated hardware for centralized web caching. Squirrel XOR also offers the triangle inequality property: d(a, b) + faces a new challenge whereby peers in a decentralized cache d(b, c) ≥ d(a, c), since d(a, c) = d(a, b) ⊕ d(b, c) and incur the overhead of having to serve each other requests, and (a+b ≥ a⊕b) for all a, b = 0. Similarly to Chord’s clockwise this extra load must be kept low. circle metric, XOR is unidirectional. For any given point x PAST [33], [34] is a large scale P2P persistent storage and distance d 0, there is exactly one point y such that utility, based on Pastry. The PAST system is composed of d(x, y) = d. The unidirectional approach makes sure that peers connected to the Internet where each peer is capable all lookups for the same key converge along the same path, of initiating and routing client requests to insert or retrieve regardless of the originating peer. Hence, caching {key,value} files. Peers may also contribute storage to the system. A pairs along the lookup path alleviates hot spots. storage system like PAST is attractive because it exploits the The peer in the network stores a list of {IP address, UDP multitude and diversity of peers in the Internet to achieve port, NodeID} triples for peers of distance between 2i and strong persistence and high availability. This eradicates the 2i+1 from itself. These lists are called k-buckets. Each k- need for physical transport of storage media to protect lookup bucket is kept sorted by last time seen; i.e. least recently and archival data, and the need for explicit mirroring to accessed peer at the head, most-recently accessed at the tail. ensure high availability and throughput for shared data. A The Kademlia routing protocol consists of: global storage utility also facilitates the sharing of storage and • PING probes a peer to check if it is active. bandwidth, thus permitting a group of peers to jointly store or • STORE instructs a peer to store a {key,value} pair for publish content that would exceed the capacity or bandwidth later retrieval. of any individual peer. • FIND NODE takes a 160-bit ID, and returns {IP ad- Pastiche [35] is a simple and inexpensive backup system dress,UDP port,NodeID} triples for the k peers it knows that exploits excess disk capacity to perform P2P backup with that are closest to the target ID. no administrative costs. The cost and inconvenience of backup • FIND VALUE is similar to FIND NODE, it returns {IP are unavoidable, and often prohibitive. Small-scale solutions address,UDP port,NodeID} triples, except for the case require significant administrative efforts. Large-scale solutions when a peer received a STORE for the key, it just return require aggregation of substantial demand to justify the capital the stored value. costs of a large, centralized repository. Pastiche builds on three Importantly, Kademlia’s peer must locate the k closest peers architecture: Pastry which provides the scalable P2P network to some given NodeID. This lookup initiator starts by picking with self-administered routing and peer location; Content- X peers from its closest non-empty k-bucket, and then sends based indexing [36], [37], which provides flexible discovery parallel asynchronous FIND NODE to the X peers it has of redundant data for similar files; and Convergent encryption chosen. If FIND NODE fails to return a peer that is any closer,
  9. 9. A SURVEY AND COMPARISON OF PEER-TO-PEER OVERLAY NETWORK SCHEMES 9 by climbing using up connections to a level l − 1 peer. Then ¢ 0 ¡   1 proceeds down the levels of the tree using the down links, and moving from level l to level l + 1. It follows either the edge to the nearby down link or the further down link, depending Level 1 1 on distance 2l . This is recursively continues until a peer is reached with no down links, and it is in the vicinity of the target peer. So, a vicinity lookup is performed using the ring and level-ring links. For reliability and fault resiliency, when Level 2 a peer leaves the overlay network, it hands over its key pairs to a successor from the ring pointers and notifies other peers to find a replacement. It is formalized and proved [15] that the routing process requires only O(logN ), where N is the number of peers in the network. Level 3 Fig. 6. A Simplified Viceroy network. For simplicity, the up link, ring and III. DISCUSSION ON STRUCTURED P2P OVERLAY level-ring links are not shown. NETWORK The algorithm of Plaxton was originally devised to route web queries to nearby caches, and this influenced the design than the closest peers already seen, the initiator resends the of Pastry, Tapestry and Chord. The method of Plaxton has FIND NODE to all of the k closest peers it has not already logarithmic expected join/leave complexity. Plaxton ensures queried. It can route for lower latency because it has the that queries never travel further in network distance than the flexibility of choosing any one of k peers to forward a request. peer where the key is stored. However, Plaxton has several To find a {key,value} pair, a peer starts by performing a disadvantages: it requires global knowledge to construct the FIND VALUE lookup to find the k peers with IDs closest overlay; an object’s root peer is the single point of failure; to the key. To join the network, a peer n must have contact to no insertion or deletion of peers; no avoidance to hotspots an already participating peer m. Peer n inserts peer m into the congestion. Pastry and Tapestry schemes relied on DHT to appropriate k-bucket, and then performs a peer lookup for its provide the substrate for semantic-free and data-centric ref- own peer ID. Peer n refreshes all k-buckets far away than its erences, through the assignment of a semantic-free NodeID, closest neighbor, and during this refresh, peer n populates its such as a 160-bit key, and performed efficient request routing own k-buckets and inserts itself into other peers’ k-buckets, if between lookup peers using an efficient and dynamic routing needed. infrastructure, whereby peers leave and join. Overlays that perform query routing in DHT-based systems have strong F. Viceroy theoretical foundations, guaranteeing that a key can be found The Viceroy [15] P2P decentralized overlay network is if it exists and they do not capture the relationships between designed to handle the discovery and location of data and the object name and its content. However, DHT-based systems resources in a dynamic butterfly fashion. Viceroy employs have a few problems in terms of data object lookup latency: consistent hashing [20], to distribute data so that it is balanced 1) For each overlay hop, peers route a message to the across the set of servers and resilient to servers joining and next intermediate peer that can be located very far leaving the network. It utilizes the DHT to manage the distri- away with regard to physical topology of the underlying bution of data among a changing set of servers and allowing IP network. This can result in high network delay peers to contact any server in the network to locate any stored and unnecessary long-distance network traffics, from a resource by name. In addition to this, Viceroy maintains an deterministic short overlay path of O(logN ), (where N architecture that is an approximation to a butterfly network is the number of peers). [38], as shown in Figure 6 (adapted from diagram in [15]), 2) DHT-based systems assume that all peers equally partic- and uses links between successors and predecessors - ideas ipate in hosting published data objects or their location that were based on Kleingberg [39] and Barri` re et al. [40] e information. This would lead to a bottleneck at low- — on the ring (a key is mapped to its successor on the ring) capacity peers. for short distances. Its diameter of the overlay is better than Pastry and Tapestry routing algorithms are a randomized CAN and its degree is better than Chord, Tapestry and Pastry. approximation of a hypercube and routing towards an object When N peers are operational, one of logN levels is is done by matching longer addresses suffixes, until either selected with near equal probability. Level l peer’s two edges the object’s root peer or another peer with a nearby copy are connected to peers at level l + 1. A down-right edge is is found. Rhea et al. [41] makes use of FreePastry imple- added to a long-range contact at level l + 1 at a distance about mentation to discover that most lookups fail to complete 1 2l away, and a down-left edge at a close distance on the ring when there is excessive churn. They claimed that short-lived to the level l + 1. The up edge to a nearby peer at level l − 1 peers leave the overlay with lookups that have not yet timed is included if l 1. Then, level-ring links are added to the out. They outlined design issues pertaining to DHT-based next and previous peers of the same level l. Routing is done performance under churn: lookup timeouts, reactive versus