A survey of peer-to-peer content distribution technologies


Published on

A survey of peer-to-peer content distribution technologies

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A survey of peer-to-peer content distribution technologies

  1. 1. A Survey of Peer-to-Peer Content Distribution Technologies STEPHANOS ANDROUTSELLIS-THEOTOKIS AND DIOMIDIS SPINELLIS Athens University of Economics and Business Distributed computer architectures labeled “peer-to-peer” are designed for the sharing of computer resources (content, storage, CPU cycles) by direct exchange, rather than requiring the intermediation or support of a centralized server or authority. Peer-to-peer architectures are characterized by their ability to adapt to failures and accommodate transient populations of nodes while maintaining acceptable connectivity and performance. Content distribution is an important peer-to-peer application on the Internet that has received considerable research attention. Content distribution applications typically allow personal computers to function in a coordinated manner as a distributed storage medium by contributing, searching, and obtaining digital content. In this survey, we propose a framework for analyzing peer-to-peer content distribution technologies. Our approach focuses on nonfunctional characteristics such as security, scalability, performance, fairness, and resource management potential, and examines the way in which these characteristics are reflected in—and affected by—the architectural design decisions adopted by current peer-to-peer systems. We study current peer-to-peer systems and infrastructure technologies in terms of their distributed object location and routing mechanisms, their approach to content replication, caching and migration, their support for encryption, access control, authentication and identity, anonymity, deniability, accountability and reputation, and their use of resource trading and management schemes. Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Network Architecture and Design—Network topology; C.2.2 [Computer- Communication Networks]: Network Protocols—Routing protocols; C.2.4 [Computer-Communication Networks]: Distributed Systems—Distributed databases; H.2.4 [Database Management]: Systems—Distributed databases; H.3.4 [Information Storage and Retrieval]: Systems and Software—Distributed systems General Terms: Algorithms, Design, Performance, Reliability, Security Additional Key Words and Phrases: Content distribution, DOLR, DHT, grid computing, p2p, peer-to-peer 1. INTRODUCTION systems such as [Gnutella 2003], Seti@ Home [SetiAtHome 2003], OceanStore A new wave of network architectures [Kubiatowicz et al. 2000], and many labeled peer-to-peer is the basis of others. Such architectures are gener- operation of distributed computing ally characterized by the direct sharing Authors’ address: Athens University of Economics and Business, 76 Patission St., GR-104 34, Athens, Greece; email: stheotok@aueb.gr. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or permissions@acm.org. c 2004 ACM 0360-0300/04/1200-0335 $5.00 ACM Computing Surveys, Vol. 36, No. 4, December 2004, pp. 335–371.
  2. 2. 336 S. Androutsellis-Theotokis and D. Spinellis of computer resources (CPU cycles, stor- We then present the main attributes of age, content) rather than requiring the in- peer-to-peer content distribution systems, termediation of a centralized server. and the aspects of their architectural de- The motivation behind basing appli- sign which affect these attributes are an- cations on peer-to-peer architectures alyzed with reference to specific existing derives to a large extent from their ability peer-to-peer content distribution systems to function, scale, and self-organize in and technologies. the presence of a highly transient popu- Throughout this report the terms lation of nodes, network, and computer “node”, “peer” and “user” are used inter- failures, without the need of a central changeably, according to the context, to re- server and the overhead of its adminis- fer to the entities that are connected in a tration. Such architectures typically have peer-to-peer network. as inherent characteristics scalability, resistance to censorship and centralized 1.1. Defining Peer-to-Peer Computing control, and increased access to resources. Administration, maintenance, respon- A quick look at the literature reveals a con- sibility for the operation, and even the siderable number of different definitions notion of “ownership” of peer-to-peer sys- of “peer-to-peer”, mainly distinguished tems are also distributed among the users, by the “broadness” they attach to the instead of being handled by a single com- term. pany, institution or person (see also Agre The strictest definitions of “pure” peer- [2003] for an interesting discussion of in- to-peer refer to totally distributed sys- stitutional change through decentralized tems, in which all nodes are completely architectures). Finally, peer-to-peer archi- equivalent in terms of functionality and tectures have the potential to accelerate tasks they perform. These definitions fail communication processes and reduce to encompass, for example, systems that collaboration costs through the ad hoc ad- employ the notion of “supernodes” (nodes ministration of working groups [Schoder that function as dynamically assigned and Fischbach 2003]. localized mini-servers) such as Kazaa This report surveys peer-to-peer con- [2003], which are, however, widely ac- tent distribution technologies, aiming to cepted as peer-to-peer, or systems that provide a comprehensive account of ap- rely on some centralized server infras- plications, features, and implementation tructure for a subset of noncore tasks techniques. As this is a new and (thank- (e.g. bootstrapping, maintaining reputa- fully) rapidly evolving field, and advances tion ratings, etc). and new capabilities are constantly being According to a broader and widely ac- introduced, this article will be present- cepted definition in Shirky [2000], “peer- ing what essentially constitutes a “snap- to-peer is a class of applications that take shot” of the state of the art around the advantage of resources—storage, cycles, time of its writing—as is unavoidably the content, human presence—available at case for any survey of a thriving research the edges of the internet”. This defini- field. We do, however, believe that the tion, however, encompasses systems that core information and principles presented completely rely upon centralized servers will remain relevant and useful for the for their operation (such as seti@home, reader. various instant messaging systems, or In the next section, we define the basic even the notorious Napster), as well as concepts of peer-to-peer computing. We various applications from the field of Grid classify peer-to-peer systems into three computing. categories (communication and collab- Overall, it is fair to say that there is oration, distributed computation, and no general agreement about what “is” and content distribution). Content distribu- what “is not” peer-to-peer. tion systems are further discussed and We feel that this lack of agreement on categorized. a definition—or rather the acceptance of ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  3. 3. A Survey of Content Distribution Technologies 337 various different definitions—is, to a large nections fail or recover, in order to main- extent, due to the fact that systems or tain its connectivity and performance. applications are labeled “peer-to-peer” not because of their internal operation or ar- We therefore propose the following defi- chitecture, but rather as a result of how nition: they are perceived “externally”, that is, Peer-to-peer systems are distributed systems whether they give the impression of pro- consisting of interconnected nodes able to self- viding direct interaction between comput- organize into network topologies with the purpose ers. As a result, different definitions of of sharing resources such as content, CPU cycles, “peer-to-peer” are applied to accommodate storage and bandwidth, capable of adapting to the various different cases of such systems failures and accommodating transient popula- or applications. tions of nodes while maintaining acceptable con- From our perspective, we believe that nectivity and performance, without requiring the the two defining characteristics of peer-to- intermediation or support of a global centralized server or authority. peer architectures are the following: —The sharing of computer resources by This definition is meant to encompass direct exchange, rather than requir- “degrees of centralization” ranging from ing the intermediation of a centralized the pure, completely decentralized sys- server. Centralized servers can some- tems such as Gnutella, to “partially cen- times be used for specific tasks (sys- tralized” systems1 such as Kazaa. How- tem bootstrapping, adding new nodes ever, for the purposes of this survey, we to the network, obtain global keys for shall not restrict our presentation and dis- data encryption), however, systems that cussion of architectures and systems to rely on one or more global centralized our own proposed definition, and we will servers for their basic operation (e.g. for take into account systems that are consid- maintaining a global index and search- ered peer-to-peer by other definitions as ing through it—Napster, Publius) are well, including systems that employ a cen- clearly stretching the definition of peer- tralized server (such as Napster, instant to-peer. messaging applications, and others). As the nodes of a peer-to-peer network The focus of our study is content distri- cannot rely on a central server coordi- bution, a significant area of peer-to-peer nating the exchange of content and the systems that has received considerable re- operation of the entire network, they are search attention. required to actively participate by inde- pendently and unilaterally performing 1.2. Peer-to-Peer and Grid Computing tasks such as searching for other nodes, Peer-to-peer and Grid computing are two locating or caching content, routing in- approaches to distributed computing, both formation and messages, connecting to concerned with the organization of re- or disconnecting from other neighboring source sharing in large-scale computa- nodes, encrypting, introducing, retriev- tional societies. ing, decrypting and verifying content, as Grids are distributed systems that en- well as others. able the large-scale coordinated use and —Their ability to treat instability and sharing of geographically distributed re- variable connectivity as the norm, au- sources, based on persistent, standards- tomatically adapting to failures in both based service infrastructures, often with network connections and computers, as a high-performance orientation [Foster well as to a transient population of et al. 2001]. nodes. This fault-tolerant, self-organizing ca- 1 In our definition, we refer to “global” servers, to pacity suggests the need for an adap- make a distinction from the dynamically assigned tive network topology that will change “supernodes” of partially centralized systems (see as nodes enter or leave and network con- also Section 3.3.3) ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  4. 4. 338 S. Androutsellis-Theotokis and D. Spinellis As Grid systems increase in scale, they Chat/Irc, Instant Messaging (Aol, Icq, begin to require solutions to issues of self- Yahoo, Msn), and Jabber [Jabber 2003]. configuration, fault tolerance, and scala- Distributed Computation. This category bility, for which peer-to-peer research has includes systems whose aim is to take ad- much to offer. vantage of the available peer computer Peer-to-peer systems, on the other hand, processing power (CPU cycles). This is focus on dealing with instability, tran- achieved by breaking down a computer- sient populations, fault tolerance, and self- intensive task into small work units and adaptation. To date, however, peer-to-peer distributing them to different peer com- developers have worked mainly on verti- puters, that execute their corresponding cally integrated applications, rather than work unit and return the results. Cen- seeking to define common protocols and tral coordination is invariably required, standardized infrastructures for interop- mainly for breaking up and distributing erability. the tasks and collecting the results. Exam- In summary, one can say that “Grid com- ples of such systems include projects such puting addresses infrastructure, but not as Seti@home [Sullivan III et al. 1997; yet failure, while peer-to-peer addresses SetiAtHome 2003], genome@home [Lar- failure, but not yet infrastructure” [Foster son et al. 2003; GenomeAtHome 2003], and Iamnitchi 2003]. and others. In addition to this, the form of sharing Internet Service Support. A number of initially targeted by peer-to-peer has been different applications based on peer-to- of limited functionality, providing a global peer infrastructures have emerged for content distribution and filesharing space supporting a variety of Internet ser- lacking any form of access control. vices. Examples of such applications As peer-to-peer technologies move into include peer-to-peer multicast systems more sophisticated and complex applica- [VanRenesse et al. 2003; Castro et al. tions, such as structured content distri- 2002], Internet indirection infrastruc- bution, desktop collaboration, and net- tures [Stoica et al. 2002], and secu- work computation, it is expected that rity applications, providing protection there will be a strong convergence be- against denial of service or virus attacks tween peer-to-peer and Grid computing [Keromytis et al. 2002; Janakiraman et al. [Foster 2000]. The result will be a new 2003; Vlachos et al. 2004]. class of technologies combining elements Database Systems. Considerable work of both peer-to-peer and Grid comput- has been done on designing distributed ing, which will address scalability, self- database systems based on peer-to-peer adaptation, and failure recovery, while, infrastructures. Bernstein et al. [2002] at the same time, providing a persis- propose the Local Relational Model tent and standardized infrastructure for (LRM), in which the set of all data stored interoperability. in a peer-to-peer network is assumed to be comprised of inconsistent local rela- tional databases interconnected by sets 1.3. Classification of Peer-to-Peer of “acquaintances” that define translation Applications rules and semantic dependencies between Peer-to-peer architectures have been em- them. PIER [Huebsch et al. 2003] is a ployed for a variety of different application scalable distributed query engine built categories, which include the following. on top of a peer-to-peer overlay network Communication and Collaboration. topology that allows relational queries This category includes systems that to run across thousands of computers. provide the infrastructure for facilitating The Piazza system [Halevy et al. 2003] direct, usually real-time, communica- provides an infrastructure for building tion and collaboration between peer semantic Web [Berners-Lee et al. 2001] computers. Examples include chat and applications, consisting of nodes that can instant messaging applications, such as supply either source data (e.g. from a ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  5. 5. A Survey of Content Distribution Technologies 339 relational database), schemas (or ontolo- An examination of current peer-to-peer gies) or both. Piazza nodes are transitively technologies in this context suggests that connected by chains of mappings between they can be grouped as follows (with ex- pairs of nodes, allowing queries to be amples in Tables I and II). distributed across the Piazza network. Finally, Edutella [Nejdl et al. 2003] is an —Peer-to-Peer Applications. This category open-source project that builds on the includes content distribution systems W3C metadata standard RDF, to provide that are based on peer-to-peer tech- a metadata infrastructure and querying nology. We attempt to further subdi- capability for peer-to-peer applications. vide them into the following two groups, Content Distribution. Most of the cur- based on their application goals and per- rent peer-to-peer systems fall within the ceived complexity: category of content distribution, which in- Peer-to-peer “file exchange” systems. cludes systems and infrastructures de- These systems are targeted towards signed for the sharing of digital media simple, one-off file exchanges between and other data between users. Peer-to- peers. They are used for setting up peer content distribution systems range a network of peers and providing fa- from relatively simple direct fileshar- cilities for searching and transferring ing applications, to more sophisticated files between them. These are typically systems that create a distributed stor- light-weight applications that adopt a age medium for securely and efficiently best-effort approach without addressing publishing, organizing, indexing, search- security, availability, and persistence. It ing, updating, and retrieving data. There is mainly systems in this category that are numerous such systems and in- are responsible for spawning the repu- frastructures. Some examples are: the tation (and in some cases notoriety) of late Napster, Publius [Waldman et al. peer-to-peer technologies. 2000], Gnutella [Gnutella 2003], Kazaa Peer-to-peer content publishing and stor- [Kazaa 2003], Freenet [Clarke et al. age systems. These systems are targeted 2000], MojoNation [MojoNation 2003], towards creating a distributed storage Oceanstore [Kubiatowicz et al. 2000], medium in—and through—which users PAST [Druschel and Rowstron 2001], will be able to publish, store, and dis- Chord [Stoica et al. 2001], Scan [Chen tribute content in a secure and persis- et al. 2000], FreeHaven [Dingledine tent manner. Such content is meant to et al. 2000], Groove [Groove 2003], and be accessible in a controlled manner by Mnemosyne [Hand and Roscoe 2002]. peers with appropriate privileges. The This survey will focus on content distri- main focus of such systems is security bution, one of the most prominent appli- and persistence, and often the aim is cation areas of peer-to-peer systems. to incorporate provisions for account- ability, anonymity and censorship resis- tance, as well as persistent content man- 1.4. Peer-to-Peer Content Distribution agement (updating, removing, version In its most basic form, a peer-to-peer con- control) facilities. tent distribution system creates a dis- —Peer-to-Peer Infrastructures. This cate- tributed storage medium that allows for gory includes peer-to-peer based infras- the publishing, searching, and retrieval tructures that do not constitute working of files by members of its network. As applications, but provide peer-to-peer systems become more sophisticated, non- based services and application frame- functional features may be provided, in- works. The following infrastructure ser- cluding provisions for security, anonymity, vices are identified: fairness, increased scalability and perfor- Routing and location. Any peer-to-peer mance, as well as resource management content distribution system relies on and organization capabilities. All of these a network of peers within which re- will be discussed in the following sections. quests and messages must be routed ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  6. 6. 340 S. Androutsellis-Theotokis and D. Spinellis Table I. A Classification of Current Peer-to-Peer Systems (RM: Resource Management; CR: Censorship Resistance; PS: Performance and Scalability; SPE: Security, Privacy and Encryption; A: Anonymity; RA: Reputation and Accountability; RT: Resource Trading.) Peer-to-Peer File Exchange Systems System Brief Description Napster Distributed file sharing—hybrid decentralized. Kazaa [2003] Distributed file sharing—partially centralized. Gnutella [2003] Distributed file sharing—purely decentralized. Peer-to-Peer Content Publishing and Storage Systems System Brief Description Main Focus Scan [Chen et al. 2000] A dynamic, scalable, efficient content distribution PS network. Provides dynamic content replication. Publius [Waldman et al. 2000] A censorship-resistant system for publishing RM content. Static list of servers. Enhanced content management (update and delete). Groove [Groove 2003] Internet communications software for direct RM,PS,SPE real-time peer-to-peer interaction. FreeHaven [Dingledine et al. 2000] A flexible system for anonymous storage. A,RA Freenet [Clarke et al. 2000] Distributed anonymous information storage and A,RA retrieval system. MojoNation [MojoNation 2003] Distributed file storage. Fairness through the use SPE,RT of currency mojo. Oceanstore [Kubiatowicz et al. 2000] An architecture for global scale persistent storage. RM,PS,SPE Scalable, provides security and access control. Intermemory [Chen et al. 1999] System of networked computers. Donate storage RT in exchange for the right to publish data. Mnemosyne [Hand and Roscoe 2002] Peer-to-peer steganographic storage system. SPE Provides privacy and plausible deniability. PAST [Druschel and Rowstron 2001] Large scale persistent peer-to-peer storage utility. PS,SPE Dagster [Stubblefield and Wallach 2001] A censorship-resistant document publishing CR,SPE system. Tangler [Waldman and Mazi 2001] A content publishing system based on document CR,SPE entanglements. with efficiency and fault tolerance, and 2. ANALYSIS FRAMEWORK through which peers and content can In this survey, we present a description be efficiently located. Different infras- and analysis of applications, systems, and tructures and algorithms have been infrastructures that are based on peer- developed to provide such services. to-peer architectures and aim at either Anonymity. Peer-to-peer based infras- offering content distribution solutions, or tructure systems have been designed supporting content distribution related with the explicit aim of providing user activities. anonymity. Our approach is based on: Reputation Management. In a peer- to-peer network, there is no central —identifying the feature space of nonfunc- organization to maintain reputation tional properties and characteristics of information for users and their behavior. content distribution systems, Reputation information is, therefore, —determining the way in which the non- hosted in the various network nodes. In functional properties depend on, and order for such reputation information to can be affected by, various design fea- be kept secure, up-to-date, and avail- tures, and able throughout the network, complex —providing an account, analysis, and reputation management infrastructures evaluation of the design features and need to be employed. solutions that have been adopted by ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  7. 7. A Survey of Content Distribution Technologies 341 Table II. A Classification of Current Peer-to-Peer Infrastructures Infrastructures for routing and location Chord [Stoica et al. 2001] A scalable peer-to-peer lookup service. Given a key, it maps the key to a node. CAN [Ratnasamy et al. 2001] Scalable content addressable network. A distributed infrastructure that provides hash-table functionality for mapping file names to their locations. Pastry [Rowstron and Druschel 2001] Infrastructure for fault-tolerant wide-area location and routing. Tapestry [Zhao et al. 2001] Infrastructure for fault-tolerant wide-area location and routing. Kademlia [Mayamounkov and Mazieres 2002] A scalable peer-to-peer lookup service based on the XOR metric. Infrastructures for anonymity Anonymous remailer mixnet [Berthold et al. 1998] Infrastructure for anonymous connection. Onion Routing [Goldschlag et al. 1999] Infrastructure for anonymous connection. ZeroKnowledge Freedom [Freedom 2003] Infrastructure for anonymous connection. Tarzan [Freedman et al. 2002] A peer-to-peer decentralized anonymous network layer. Infrastructures for reputation management Eigentrust [Kamvar et al. 2003] A Distributed reputation management algorithm. A Partially distributed reputation management A partially distributed approach based on a system [Gupta et al. 2003] debit/credit or debit-only scheme. PeerTrust [Xiong and Liu 2002] A decentralized, feedback-based reputation management system using satisfaction and number of interaction metrics. current peer-to-peer systems, as well as data and processing methods. Unautho- their shortcomings, potential improve- rized entities cannot change data; ad- ments, and proposed alternatives. versaries cannot substitute a forged doc- ument for a requested one. For the identification of the nonfunc- Privacy and confidentiality. Ensuring tional characteristics on which our study that data is accessible only to those is based, we used the work of Shaw and authorized to have access, and that Garlan [1995], which we adapted to the there is control over what data is col- field of peer-to-peer content distribution. lected, how it is used, and how it is The various design features and solu- maintained. tions that we examine, as well as their Availability and persistence. Ensuring relationship to the relevant nonfunctional that authorized users have access to characteristics, were assembled through data and associated assets when re- a detailed analysis and study of current quired. For a peer-to-peer content dis- peer-to-peer content distribution systems tribution system this often means al- that are either deployed, researched, or ways. This property entails stability in proposed. the presence of failure, or changing node The resulting analysis framework is il- populations. lustrated in Figure 1. The boxes in the periphery show the most important at- Scalability. Maintaining the system’s per- tributes of peer-to-peer content distribu- formance attributes independent of the tion systems, described as follows. number of nodes or documents in its network. A dramatic increase in the Security. Further analyzed in terms of: number of nodes or documents will Integrity and authenticity. Safeguard- have minimal effect on performance and ing the accuracy and completeness of availability. ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  8. 8. 342 S. Androutsellis-Theotokis and D. Spinellis Fig. 1. Illustration of the way in which various design features affect the main characteristics of peer- to-peer content distribution systems. Performance. The time required for per- grouping, based on the content itself; forming the operations allowed by the grouping, based on locality or network system, typically publication, searching, distance; grouping, based on organiza- and retrieval of documents. tion ties, as well as others. Fairness. Ensuring that users offer and The relationships between nonfunc- consume resources in a fair and bal- tional characteristics are depicted as a anced manner. May rely on account- UML diagram in Figure 1. The Figure ability, reputation, and resource trading schematically illustrates the relationship mechanisms. between the various design features and Resource Management Capabilities. In the main characteristics of peer-to-peer their most basic form, peer-to-peer con- content distribution systems. tent distribution systems allow the pub- The boxes in the center show the design lishing, searching, and retrieval of docu- decisions that affect these attributes. We ments. More sophisticated systems may note that these design decisions are mostly afford more advanced resource manage- independent and orthogonal. ment capabilities, such as editing or We see, for example, that the perfor- removal of documents, management of mance of a peer-to-peer content distribu- storage space, and operations on meta- tion system is affected by the distributed data. object location and routing mechanisms, Semantic Grouping of Information. An as well as by the data replication, area of research that has attracted con- caching, and migration algorithms. Fair- siderable attention recently is the se- ness, on the other hand, depends on mantic grouping and organization of the system’s provisions for accountabil- content and information in peer-to-peer ity and reputation, as well as on any re- networks. Various grouping schemes source trading mechanisms that may be are encountered, such as semantic implemented. ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  9. 9. A Survey of Content Distribution Technologies 343 In the following sections of this survey, such networks are often termed “servents” the different design decisions and features (SERVers+clieENTS). will be presented and discussed, based on Partially Centralized Architectures. The an examination of existing peer-to-peer basis is the same as with purely decentral- content distribution systems, and with ized systems. Some of the nodes, however, reference to the way in which they affect assume a more important role, acting as the main attributes of these systems, local central indexes for files shared by lo- as presented. A relevant presentation cal peers. The way in which these supern- and discussion of various desired prop- odes are assigned their role by the network erties of peer-to-peer systems can also varies between different systems. It is im- be found in Kubiatowics [2003], while portant, however, to note that these su- an analysis of peer-to-peer systems from pernodes do not constitute single points the end-user perspective can be found in of failure for a peer-to-peer network, since Lee [2003]. they are dynamically assigned and, if they fail, the network will automatically take 3. PEER-TO-PEER DISTRIBUTED OBJECT action to replace them with others. LOCATION AND ROUTING Hybrid Decentralized Architectures. In these systems, there is a central server fa- The operation of any peer-to-peer content cilitating the interaction between peers by distribution system relies on a network maintaining directories of metadata, de- of peer computers (nodes), and connec- scribing the shared files stored by the peer tions (edges) between them. This network nodes. Although the end-to-end interac- is formed on top of—and independently tion and file exchanges may take place di- from—the underlying physical computer rectly between two peer nodes, the central (typically IP) network, and is thus referred servers facilitate this interaction by per- to as an “overlay” network. The topology, forming the lookups and identifying the structure, and degree of centralization of nodes storing the files. The terms “peer- the overlay network, and the routing and through-peer” or “broker mediated” are location mechanisms it employs for mes- sometimes used for such systems [Kim sages and content are crucial to the oper- 2001]. ation of the system, as they affect its fault Obviously, in these architectures, there tolerance, self-maintainability, adaptabil- is a single point of failure (the central ity to failures, performance, scalability, server). This typically renders them inher- and security; in short almost all of the sys- ently unscalable and vulnerable to censor- tem’s attributes as laid out in Section 2 ship, technical failure, or malicious attack. (see also Figure 1). Overlay networks can be distinguished in terms of their centralization and 3.2. Network Structure structure. By structure, we refer to whether the over- lay network is created nondeterministi- 3.1. Overlay Network Centralization cally (ad hoc) as nodes and content are Although in their purest form peer-to-peer added, or whether its creation is based overlay networks are supposed to be to- on specific rules. We categorize peer-to- tally decentralized, in practice this is not peer networks as follows, in terms of their always true, and systems with various de- structure: grees of centralization are encountered. Unstructured. The placement of content Specifically, the following three categories (files) is completely unrelated to the over- are identified. lay topology. Purely Decentralized Architectures. All In an unstructured network, content nodes in the network perform exactly the typically needs to be located. Searching same tasks, acting both as servers and mechanisms range from brute force meth- clients, and there is no central coordi- ods, such as flooding the network with nation of their activities. The nodes of propagating queries in a breadth-first ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  10. 10. 344 S. Androutsellis-Theotokis and D. Spinellis or depth-first manner until the desired Table III. A classification of Peer-to-Peer Content content is located, to more sophisticated Distribution Systems and Location and Routing Infrastructures in Terms of Their Network Structure, and resource-preserving strategies that With Some Typical Examples include the use of random walks and rout- ing indices (discussed in more detail in Centralization Section 3.3.4). The searching mechanisms employed in unstructured networks have Hybrid Partial None obvious implications, particularly in re- Unstructured Napster, Kazaa, Gnutella, gards to matters of availability, scalability, Publius Mor- FreeHaven and persistence. pheus, Unstructured systems are generally Gnutella, Edutella more appropriate for accommodating highly-transient node populations. Some Structured Chord, representative examples of unstructured Infrastructures CAN, systems are Napster, Publius [Waldman Tapestry, Pastry et al. 2000], Gnutella [Gnutella 2003], Kazaa [Kazaa 2003], Edutella [Nejdl et al. Structured OceanStore, 2003], FreeHaven [Dingledine et al. 2000], Systems Mnemosyne, Scan, PAST, as well as others. Kademlia, Structured. These have emerged mainly Tarzan in an attempt to address the scalability issues that unstructured systems were A category of networks that are in be- originally faced with. In structured net- tween structured and unstructured are re- works, the overlay topology is tightly ferred to as loosely structured networks. controlled and files (or pointers to them) Although the location of content is not are placed at precisely specified locations. completely specified, it is affected by rout- These systems essentially provide a map- ing hints. A typical example is Freenet ping between content (e.g. file identifier) [Clarke et al. 2000; Clake et al. 2002]. and location (e.g. node address), in the Table III summarizes the categories we form of a distributed routing table, so that outlined, with examples of peer-to-peer queries can be efficiently routed to the content distribution systems and architec- node with the desired content [Lv et al. tures. Note that all structured and loosely 2002]. structured systems are inherently purely Structured systems offer a scalable so- decentralized; form follows function. lution for exact-match queries, that is, In the following sections, the overlay queries where the exact identifier of the network topology and operation of differ- requested data object is known (as com- ent peer-to-peer systems is discussed, fol- pared to keyword queries). Using exact- lowing the above classification, according match queries as a substrate for keyword to degree of centralization and network queries remains an open research problem structure. for distributed environments [Witten et al. 1999]. 3.3. Unstructured Architectures A disadvantage of structured systems is 3.3.1. Hybrid Decentralized. Figure 2 il- that it is hard to maintain the structure lustrates the architecture of a typical hy- required for efficiently routing messages brid decentralized peer-to-peer system. in the face of a very transient node popu- Each client computer stores content lation, in which nodes are joining and leav- (files) shared with the rest of the network. ing at a high rate [Lv et al. 2002]. All clients connect to a central directory Typical examples of structured systems server that maintains: include Chord [Stoica et al. 2001], CAN [Ratnasamy et al. 2001], PAST [Druschel —A table of registered user connection in- and Rowstron 2001], Tapestry [Zhao et al. formation (IP address, connection band- 2001] among others. width etc.) ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  11. 11. A Survey of Content Distribution Technologies 345 torious Napster and Publius systems [Waldman et al. 2000] that rely on a static, system-wide list of servers. Their architec- ture does not provide any smooth, decen- tralized support for adding a new server, or removing dead or malicious servers. A comprehensive study of the behavior of hybrid peer-to-peer systems and a com- parison of their query characteristics and their performance in terms of bandwidth and CPU cycle consumption is presented in Yang and Garcia-Molina [2001]. It should be noted that systems that do not fall under the hybrid decentral- ized category may still use some central administration server to a limited extent, Fig. 2. Typical hybrid decentralized peer-to-peer for example, for initial system boot- architecture. A central directory server maintains strapping (e.g. Mojonation [MojoNation an index of the metadata for all files in the network. 2003]), or for allowing new users to —A table listing the files that each user join the network by providing them holds and shares in the network, along with access to a list of current users with metadata descriptions of the files (e.g. gnutellahosts.com for the gnutella (e.g. filename, time of creation, etc.) network). A computer that wishes to join the 3.3.2. Purely Decentralized. In this sec- network contacts the central server and tion, we examine the Gnutella network reports the files it maintains. Client com- [Gnutella 2003], an interesting and rep- puters send requests for files to the server. resentative member of purely decentral- The server searches for matches in its in- ized peer-to-peer architectures, due to its dex, returning a list of users that hold the open architecture, achieved scale, and self- matching file. The user then opens direct organizing structure. FreeHaven [Dingle- connections with one or more of the peers dine et al. 2000] is another system using that hold the requested file, and down- routing and search mechanisms similar to loads it (see Figure 2). those of Gnutella. The advantage of hybrid decentralized Like most peer-to-peer systems, systems is that they are simple to imple- Gnutella builds a virtual overlay net- ment, and they locate files quickly and ef- work with its own routing mechanisms ficiently. Their main disadvantage is that [Ripeanu and Foster 2002], allowing its they are vulnerable to censorship, legal ac- users to share files with other peers. tion, surveillance, malicious attack, and There is no central coordination of the technical failure, since the content shared, activities in the network and users connect or at least descriptions of it, and the abil- to each other directly through a software ity to access it are controlled by the single application that functions both as a client institution, company, or user maintain- and a server (users are referred to as a ing the central server. Furthermore, these servents). systems are considered inherently unscal- Gnutella uses IP as its underlying net- able, as there are bound to be limitations work service, while the communication be- to the size of the server database and its tween servents is specified in a form of capacity to respond to queries. Large Web application level protocol supporting four search engines have, however, repeatedly types of messages [Jovanovich et al. 2001]: provided counterexamples to this notion. Examples of hybrid decentralized con- Ping. A request for a certain host to an- tent distribution systems include the no- nounce itself. ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  12. 12. 346 S. Androutsellis-Theotokis and D. Spinellis Pong. Reply to a Ping message. It con- Once a node receives a QueryHit mes- tains the IP and port of the responding sage, indicating that the target file has host and number and size of the files being been identified at a certain node, it ini- shared. tiates a download by establishing a di- Query. A search request. It contains a rect connection between the two nodes. search string and the minimum speed re- Figure 3 illustrates an example of the quirements of the responding host. Gnutella search mechanism. Query Hits. Reply to a Query message. Scalability issues in the original purely It contains the IP, port, and speed of the decentralized systems arose from the fact responding host, the number of match- that the use of the TTL effectively seg- ing files found, and their indexed result mented the network into “sub-networks”, set. imposing on each user a virtual “hori- zon” beyond which their messages could After joining the Gnutella network (by not reach [Jovanovic 2000]. Removing the connecting to nodes found in databases TTL limit, on the other hand, would re- such as gnutellahosts.com), a node sends sult in the network being swamped with out a Ping message to any node it is messages. connected to. The nodes send back a Significant research has been carried Pong message identifying themselves, and out to address the above issues, and also propagate the Ping message to their various solutions have been proposed. neighbors. These will be discussed in more detail in In order to locate a file in unstructured Section 3.3.4. systems such as gnutella, nondeterminis- tic searches are the only option since the 3.3.3. Partially Centralized. Partially cen- nodes have no way of guessing where (at tralized systems are similar to purely which nodes) the files may lie. decentralized, but they use the con- The original Gnutella architecture uses cept of supernodes: nodes that are dy- a flooding (or broadcast) mechanism to dis- namically assigned the task of servic- tribute Ping and Query messages: each ing a small subpart of the peer network Gnutella node forwards the received mes- by indexing and caching files contained sages to all of its neighbors. The response therein. messages received are routed back along Peers are automatically elected to be- the opposite path through which the origi- come supernodes if they have sufficient nal request arrived. To limit the spread of bandwidth and processing power (al- messages through the network, each mes- though a configuration parameter may sage header contains a time-to-live (TTL) allow users to disable this feature) field. At each hop, the value of this field [FastTrack 2003]. is decremented, and when it reaches zero, Supernodes index the files shared by the message is dropped. peers connected to them, and proxy search The above mechanism is implemented requests on behalf of these peers. All by assigning each message a unique iden- queries are therefore initially directed to tifier and equipping each host with a dy- supernodes. namic routing table of message identifiers Two major advantages of partially cen- and node addresses. Since the response tralized systems are that: messages contain the same ID as the orig- inal messages, the host checks its routing —Discovery time is reduced in compari- table to determine along which link the re- son with purely decentralized systems, sponse message should be forwarded. In while there still is no unique point of order to avoid loops, the nodes use the failure. If one or more supernodes go unique message identifiers to detect and down, the nodes connected to them can drop duplicate messages, to improve effi- open new connections with other su- ciency, and preserve network bandwidth pernodes, and the network will continue [Jovanovic 2000]. to operate. ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  13. 13. A Survey of Content Distribution Technologies 347 Fig. 3. An example of the Gnutella search mechanism. Solid lines between the nodes represent connections of the Gnutella network. The search orig- inates at the “requestor” node, for a file maintained by another node. Re- quest messages are dispatched to all neighboring nodes, and propagated from node-to-node as shown in the four consecutive steps (a) to (d). When a node receives the same request message (based on its unique identifier) multiple times, it replies with a failure message to avoid loops and minimize traffic. When the file is identified, a success message is returned. —The inherent heterogeneity of peer-to- mentation (as it is a proprietary system, peer networks is taken advantage of, there is no detailed documentation on its and exploited. In a purely decentralized structure and operation). Edutella [Nejdl network, all of the nodes will be equally et al. 2003] is another partially centralized (and usually heavily) loaded, regard- architecture. less of their CPU power, bandwidth, or Yang and Garcia-Molina [2002a, 2002b] storage capabilities. In partially central- present research addressing the design ized systems, however, the supernodes of, and searching techniques for, partially will undertake a large portion of the centralized peer-to-peer networks. entire network load, while most of the The concept of supernodes has also been other (so called “normal”) nodes will be proposed in a more recent version of the very lightly loaded, in comparison (see Gnutella protocol. A mechanism for dy- also [Lv et al. 2002; Zhichen et al. 2002]). namically selecting supernodes organizes the Gnutella network into an interconnec- Kazaa is a typical, widely used instance tion of superpeers (as they are referred to) of a partially centralized system imple- and client nodes. ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  14. 14. 348 S. Androutsellis-Theotokis and D. Spinellis When a node with enough CPU power sues by means of: joins the network, it immediately be- comes a superpeer and establishes con- —A dynamic topology adaptation protocol nections with other superpeers, forming a that ensures that most nodes will be at a flat unstructured network of superpeers. short distance from high capacity nodes. If it establishes a minimum required num- This protocol, coupled with replicating ber of connections to client nodes within a pointers to the content of neighbors, en- specified time, it remains a superpeer. Oth- sures that high capacity nodes will be erwise, it turns into a regular client node. able to provide answers to a very large number of queries. 3.3.4. Shortcomings and Evolution of Unstruc- —An active flow control scheme that ex- tured Architectures. A number of methods ploits network heterogeneity to avoid have been proposed to overcome the origi- hotspots, and nal unscalability of unstructured peer-to- —A search protocol that is based on ran- peer systems. These have been shown to dom walks directed towards high capac- drastically improve their performance. ity nodes. Lv et al. [2002] proposed to replace the original flooding approach by multiple Simulations of Gia were found to in- parallel random walks, where each node crease overall system capacity by three chooses a neighbor at random, and propa- to five orders of magnitude. Similar ap- gates the request only to it. The use of ran- proaches, based on taking advantage of dom walks, combined with proactive object the underlying network heterogeneity, are replication (discussed in Section 4), was described in Lv et al. [2002] and Zhichen found to significantly improve the perfor- et al. [2002]. mance of the system as measured by query In another approach, Crespo and resolution time (in terms of numbers of Garcia-Molina [2002] use routing indices hops), per-node query load, and message to address the searching and scalability traffic generated. Proactive data replica- issues. Routing indices are tables of infor- tion schemes are further examined in Co- mation about other nodes, stored within hen and Shenker [2001]. each node. They provide a list of neighbors Yang and Garcia-Molina [2002b] sug- that are most likely to be “in the direction” gested the use of more sophisticated of the content corresponding to the query. broadcast policies, selecting which neigh- These tables contain information about bors to forward search queries to based on the total number of files maintained by their past history, as well as the use of local nearby nodes, as well as the number indices: data structures where each node of files corresponding to various topics. maintains an index of the data stored at Three different variations of the approach nodes located within a radius from itself. are presented, and simulations are shown A similar solution to the information re- to improve performance by 1-2 orders trieval problem is proposed by Kalogeraki of magnitude with respect to flooding et al. [2002] in the form of an Intelli- techniques. gent Search Mechanism built on top of Finally, the connectivity properties and a Modified Random BFS Search Mecha- reliability of unstructured peer-to-peer nism. Each peer forwards queries to a sub- networks such as Gnutella have been set of its neighbors, selecting them based studied in Ripeanu and Foster [2002]. In on a profile mechanism that maintains in- particular, emphasis was placed on the formation about their performance in re- fact that peer-to-peer networks exhibit cent queries. Neighbours are ranked ac- the properties of power-law networks, cording to their profiles and queries are in which the number of nodes with L forwarded selectively to the most appro- links is proportional to L−k , where k is priate ones only. a network dependent constant. In other Chawathe et al. [2003] presented the words, most nodes have few links (thus, a Gia System, which addresses the above is- large fraction of them can be taken away ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  15. 15. A Survey of Content Distribution Technologies 349 without seriously damaging the network —CAN is a system using an n-dimensional connectivity), while there are a few highly- Cartesian coordinate space to imple- connected nodes which, if taken away, are ment the distributed location and rout- likely to cause the whole network to be ing table, whereby each node is respon- broken down in disconnected pieces. One sible for a zone in the coordinate space. implication of this is that such networks —Tapestry (and the similar Pastry and are robust when facing random node Kademlia) are based on the plaxton attacks, yet vulnerable to well-planned mesh data structure, which maintains attacks. The topology mismatch between pointers to nodes in the network whose the Gnutella network and the underlying IDs match the elements of a tree-like physical network infrastructure was also structure of ID prefixes up to a digit documented in Ripeanu and Foster [2002]. position. Tsoumakos and Roussopoulos [2003] present a more comprehensive analysis 3.4.1. Freenet—A Loosely Structured Sys- and comparison of the above methods. tem. The defining characteristic of loosely Overall, unstructured peer-to-peer con- structured systems is that the nodes of the tent distribution systems might be the peer-to-peer network can produce an esti- preferred choice for applications where the mate (not with certainty) of which node is following assumptions hold: most likely to store certain content. This affords them the possibility of avoiding —keyword searching is the common oper- blindly broadcasting request messages to ation, all (or a random subset) of their neighbors. —most content is typically replicated at a Instead, they use a chain mode propaga- fair fraction of participating sites, tion approach, where each node makes a —the node population is highly transient, local decision about which node to send the —users will accept a best-effort content re- request message to next. trieval approach, and Freenet [Clarke et al. 2000] is a typical, purely decentralized loosely-structured —the network size is not so large as to in- content distribution system. It operates cur scalability problems [Lv et al. 2002] as a self-organizing peer-to-peer network, (The last issue is alleviated through pooling unused disk space in peer com- the various methods described in this puters to create a collaborative virtual Section). file system. Important features of Freenet include its focus on security, publisher 3.4. Structured Architectures anonymity, deniability, and data replica- The various structured content distribu- tion for availability and performance. tion systems and infrastructures employ Files in Freenet are identified by unique different mechanisms for routing mes- binary keys. Three types of keys are sup- sages and locating data. Four of the most ported, the simplest is based on applying interesting and representative mecha- a hash function on a short descriptive text nisms and their corresponding systems string that accompanies each file as it is are examined in the following sections. stored in the network by its original owner. Each Freenet node maintains its own lo- —Freenet is a loosely structured system cal data store, that it makes available to that uses file and node identifier similar- the network for reading and writing, as ity to produce an estimate of where a file well as a dynamic routing table contain- may be located, and a chain mode propa- ing the addresses of other nodes and the gation approach to forward queries from files they are thought to hold. To search node-to-node. for a file, the user sends a request message —Chord is a system whose nodes maintain specifying the key and a timeout (hops-to- a distributed routing table in the form of live) value. an identifier circle on which all nodes are Freenet uses the following types of mes- mapped, and an associated finger table. sages, which all include the node identifier ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  16. 16. 350 S. Androutsellis-Theotokis and D. Spinellis (for loop detection), a hops-to-live value If a node receives a request for a locally- (similar to the Gnutella TTL, see Sec- stored file, the search stops and the data tion 3.3.2), and the source and destination is forwarded back to the requestor. node identifiers: If the node does not store the file that the requestor is looking for, it forwards the re- Data insert. A node inserts new data quest to its neighbor that is most likely to in the network. A key and the actual data have the file, by searching for the file key (file) are included. in its local routing table that is closest to Data request. A request for a certain the one requested. The messages, there- file. The key of the file requested is also fore, form a chain, as they propagate from included. node-to-node. To avoid huge chains, mes- Data reply. A reply initiated when the sages are deleted after passing through requested file is located. The actual file is a certain number of nodes, based on the also included in the reply message. hops-to-live value they carry. Nodes also Data failed. A failure to locate a file. store the ID and other information of the The location (node) of the failure and the requests they have seen, in order to handle reason are also included. “data reply” and “data failed” messages. New nodes join the Freenet network If a node receives a backtracking “data by first discovering the address of one failed” message from a downstream node, or more existing nodes, and then start- it selects the next best node from its rout- ing to send Data Insert messages. To in- ing stack and forwards the request to it. sert a new file in the network, the node If all nodes in the routing table have been first calculates a binary key for the file, explored in this way and failed, it sends and then sends a data insert message to back a “data failed” message to the node itself. Any node that receives the insert from which it originally received the data message, first checks to see if the key is request message. already taken. If the key is not found, the If the requested file is eventually found node looks up the closest key (in terms at a certain node, a reply is passed back of lexicographic distance) in its routing through each node that forwarded the re- table, and forwards the insert message quest to the original node that started to the corresponding node. By this mech- the chain. This data reply message will anism, newly inserted files are placed include the actual data, which is cached at nodes possessing files with similar in all intermediate nodes for future re- keys. quests. A subsequent request with the This continues as long as the hops-to- same key will be served immediately with live limit is not reached. In this way, more the cached data. A request for a similar than one node will store the new file. At key will be forwarded to the node that pre- the same time, all the participating nodes viously provided the data. will update their routing tables with the To address the problem of obtaining new information (this is the mechanism the key that corresponds to a specific file, through which the new nodes announce Freenet recommends the use of a special their presence to the rest of the network). class of lightweight files called “indirect If the hops-to-live limit is reached without files”. When a real file is inserted, the au- any key collision, an “all clear” result thor also inserts a number of indirect files will be propagated back to the original that are named according to search key- inserter, informing that the insert was words and contain pointers to the real file. successful. These indirect files differ from normal files If the key is found to be taken, the node in that multiple files with the same key returns the preexisting file as if a request (i.e. search keyword) are permitted to ex- were made for it. In this way, malicious ist, and requests for such keys keep going attempts to supplant existing files by in- until a specified number of results is ac- serting junk will result in the existing files cumulated, instead of stopping at the first being spread further. file found. The problem of managing the ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  17. 17. A Survey of Content Distribution Technologies 351 Fig. 4. The use of indirect files in Freenet. The diagram illustrates a regular file with key “A652D17D88”, which is supposed to contain a tu- torial about photography. The author chose to in- sert the file itself, and then a set of indirect files (marked with an i), named according to search key- Fig. 5. A Chord identifier circle consisting of the words she considered relevant. The indirect files three nodes 0,1 and 3. In this example, key 1 is lo- are distributed among the nodes of the network; cated at node 1, key 2 at node 3, and key 6 at node 0. they do not contain the actual document, simply a “pointer” to the location of the regular file contain- ing the document. [Karger et al. 1997]. All node identifiers are ordered in an “identifier circle” modulo large volume of such indirect files remains 2m (Figure 5 shows an identifier circle with open. Figure 4 illustrates the use of indi- m = 3). Key k is assigned to the first node rect files. whose identifier is equal to, or follows k, in The following properties of Freenet are the identifier space. This node is called the a result of its routing and location algo- successor node of key k. The use of consis- rithms: tent hashing tends to balance load, as each node receives roughly the same number of —Nodes tend to specialize in searching keys. for similar keys over time, as they get The only routing information required queries from other nodes for similar is for each node to be aware of its succes- keys. sor node on the circle. Queries for a given —Nodes store similar keys over time, due key are passed around the circle via these to the caching of files as a result of suc- successor pointers until a node that con- cessful queries. tains the key is encountered. This is the —Similarity of keys does not reflect simi- node the query maps to. larity of files. When a new node n joins the network, —Routing does not reflect the underlying certain keys previously assigned to n’s suc- network topology. cessor will become assigned to n. When node n leaves the network, all keys as- 3.4.2. Chord. Chord [Stoica et al. 2001] signed to it will be reassigned to its suc- is a peer-to-peer routing and location in- cessor. These are the only changes in key frastructure that performs a mapping of assignments that need to take place in or- file identifiers onto node identifiers. Data der to maintain load balance. location can be implemented on top of As discussed, only one data element Chord by identifying data items (files) per node needs to be correct for Chord to with keys and storing the (key, data item) guarantee correct (though slow) routing of pairs at the node that the keys map to. queries. Performance degrades gracefully In Chord, nodes are also identified by when routing information becomes out of keys. The keys are assigned both to files date due to nodes joining and leaving the and nodes by means of a deterministic system, and availability remains high only function, a variant of consistent hashing as long as nodes fail independently. Since ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  18. 18. 352 S. Androutsellis-Theotokis and D. Spinellis Fig. 6. CAN: (a) Example 2-d [0, 1] × [0, 1] coordinate space partitioned between 5 CAN nodes; (b) Example 2-d space after node F joins. the overlay topology is not based on the by supporting the insertion, lookup, and underlying physical IP network topology, a deletion of (key,value) pairs in the table. single failure in the IP network may man- Each individual node of the CAN net- ifest itself as multiple, scattered link fail- work stores a part (referred to as a “zone”) ures in the overlay [Saroiu et al. 2002]. of the hash table, as well as information To increase the efficiency of the loca- about a small number of adjacent zones tion mechanism described previously, that in the table. Requests to insert, lookup, or may, in the worst case, require traversing delete a particular key are routed via in- all N nodes to find a certain key, Chord termediate zones to the node that main- maintains additional routing information, tains the zone containing the key. in the form of a “finger table”. In this table, CAN uses a virtual d -dimensional each entry i points to the successor of node Cartesian coordinate space (see Figure 6) n + 2i. For a node n to perform a lookup to store (key K ,value V ) pairs. The zone for key k, the finger table is consulted to of the hash table that a node is respon- identify the highest node n whose ID is sible for corresponds to a segment of this between n and k. If such a node exists, the coordinate space. Any key K is, therefore, lookup is repeated starting from n . Oth- deterministically mapped onto a point P erwise, the successor of n is returned. Us- in the coordinate space. The (K , V ) pair is ing the finger table, both the amount of then stored at the node that is responsible routing information maintained by each for the zone within which point P lies. For node and the time required for resolving example, in the case of Figure 6(a), a key lookups are O(logN ) for an N -node sys- that maps to coordinate (0.1,0.2) would be tem in the steady state. stored at the node responsible for zone B. Achord [Hazel and Wiley 2002] is pro- To retrieve the entry corresponding to posed as a variant of Chord that pro- K , any node can apply the same determin- vides censorship resistance by limiting istic function to map K to P , and then re- each node’s knowledge of the network in trieve the corresponding value V from the ways similar to Freenet (see Section 3.4.1). node covering P . Unless P happens to lie in the requesting node’s zone, the request 3.4.3. CAN. The CAN (“Content must be routed from node-to-node until it Addressable Network”) [Ratnasamy reaches the node covering P . et al. 2001] is essentially a distributed, CAN nodes maintain a routing table Internet-scale hash table that maps file containing the IP addresses of nodes that names to their location in the network, hold zones adjoining their own, to enable ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  19. 19. A Survey of Content Distribution Technologies 353 routing between arbitrary points in space. advanced routing metrics, such as connec- Intuitively, routing in CAN works by fol- tion latency and underlying IP topology, lowing the straight line path through the alongside the Cartesian distance between Cartesian space from source to destination source and destination; allowing multiple coordinates. For example, in Figure 6(a), a nodes to share the same zone, mapping the request from node A for a key mapping to same key onto different points; and the ap- point p would be routed though nodes A, plication of caching and replication tech- B, E, along the straight line represented niques [Ratnasamy et al. 2001]. by the arrow. A new node that joins the CAN system is allocated its own portion of the coordinate 3.4.4. Tapestry. Tapestry [Zhao et al. space by splitting the allocated zone of an 2001] supports the location of objects and existing node in half, as follows: the routing of messages to objects (or the closest copy of them, if more than one copy (1) The new node identifies a node al- exist in the network) in a distributed, self- ready in CAN network, using a boot- administering, and fault-tolerant manner, strap mechanism as described in Fran- offering system-wide stability by bypass- cis [2000]. ing failed routes and nodes, and rapidly (2) Using the CAN routing mechanism, adapting communication topologies to cir- the node randomly chooses a point P cumstances. in the coordinate space and sends a The topology of the network is self- JOIN request to the node covering P . organizing as nodes come and go, and net- The zone is then split, and half of it is work latencies vary. The routing and loca- assigned to the new node. tion information is distributed among the (3) The new node builds its routing table network nodes; the topology’s consistency with the IP addresses of its new neigh- is checked on-the-fly, and if it is lost due to bors, and the neighbors of the split failures or destroyed, it is easily rebuilt or zone are also notified to update their refreshed. routing tables to include the new node. Tapestry is based on the location and routing mechanisms introduced by Plax- When nodes gracefully leave CAN, the ton, Rajamaran and Richa [1997], in which zones they occupy and the associated they present the Plaxton mesh, a dis- hash table entries are explicitly handed tributed data structure that allows nodes over to one of their neighbors. Under to locate objects and route messages to normal conditions, a node sends periodic them across an arbitrarily-sized overlay update messages to each of its neighbors network, while using routing maps of reporting its zone coordinates, its list of small and constant size. In the original neighbors, and their zone coordinates. If Plaxton mesh, the nodes can take on the there is prolonged absence of such an up- role of servers (where objects are stored), date message, the neighbor nodes realize routers (which forward messages), and there has been a failure, and initiate a con- clients (origins of requests). trolled takeover mechanism. If many of Each node maintains a neighbor map, the neighbors of a failed node also fail, an as shown in the example in Table IV. The expanding ring search mechanism is ini- neighbor map has multiple levels, each tiated by one of the neighboring nodes, to level l containing pointers to nodes whose identify any functioning nodes outside the ID must be matched with l digits (the x’s failure region. represent wildcards). Each entry in the A list of design improvements is pro- neighbor map corresponds to a pointer to posed over the basic CAN design described the closest node in the network whose ID including the use of multi-dimensional matches the number in the neighbor map, coordinate space, or multiple coordinate up to a digit position. spaces, for improving network latency and For example, the 5th entry for the 3rd fault tolerance; the employment of more level for node 67493 points to the node ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  20. 20. 354 S. Androutsellis-Theotokis and D. Spinellis Table IV. The Neighbor Map Held by Tapestry Node With ID 67493 Level 5 Level 4 Level 3 Level 2 Level 1 Entry 0 07493 x0493 xx093 xxx03 xxxx0 Entry 1 17493 x1493 xx193 xxx13 xxxx1 Entry 2 27493 x2493 xx293 xxx23 xxxx2 Entry 3 37493 x3493 xx393 xxx33 xxxx3 Entry 4 47493 x4493 xx493 xxx43 xxxx4 Entry 5 57493 x5493 xx593 xxx53 xxxx5 Entry 6 67493 x6493 xx693 xxx63 xxxx6 Entry 7 77493 x7493 xx793 xxx73 xxxx7 Entry 8 87493 x8493 xx893 xxx83 xxxx8 Entry 9 97493 x9493 xx993 xxx93 xxxx9 Each entry in this table corresponds to a pointer to another node. closest to 67493 in network distance whose ID ends in ..593. Table IV shows an exam- Fig. 7. Tapestry: Plaxton mesh routing ex- ple neighbor map maintained by a node ample, showing the path taken by a message with ID 67493. originating from node 67493, and destined for Messages are, therefore, incrementally node 34567 in a Plaxton mesh, using decimal routed to the destination node digit-by- digits of length 5. digit, from the right to the left. Figure 7 shows an example path taken by a mes- required for assigning and identifying root sage from node with I D = 67493, to node nodes, and (2) the vulnerability of the root I D = 34567. The digits are resolved right nodes. to left as follows: The Plaxton mesh assumes a static node population. Tapestry extends its design to xxxx7 → xxx67 → xx567 → x4567 adapt it to the transient populations of peer-to-peer networks and provide adapt- → 34567 ability, fault tolerance, as well as vari- ous optimizations described in Zhao et al. The Plaxton mesh uses a root node for [2001] and outlined below: each object, that serves to provide a guar- anteed node from which the object can —Each node additionally maintains a list be located. When an object o is inserted of back-pointers, which point to nodes in the network and stored at node ns , where it is referred to as a neighbor. a root node nr is assigned to it by us- These are used in dynamic node inser- ing a globally consistent deterministic al- tion algorithms to generate the appro- gorithm. A message is then routed from priate neighbor maps for new nodes. Dy- ns to nr , storing data in the form of a namic algorithms are employed for node mapping (object id o, storer id ns ) at all insertion, populating neighbor maps, nodes along the way. During a location and notifying neighbors of new node in- query, messages destined for o are ini- sertions. tially routed towards nr , until a node is —The concept of distance between nodes encountered containing the (o, ns ) location becomes semantically more flexible, and mapping. locations of more than one replica of The Plaxton mesh offers: (1) simple an object are stored, allowing the ap- fault-handling by its potential to route plication architecture to define how the around a single link or node by choos- “closest” node will be interpreted. For ing a node with a similar suffix, and (2) example, in the Oceanstore architecture scalability (with the only bottleneck exist- [Kubiatowicz et al. 2000], a “freshness” ing at the root nodes). Its limitations in- metric is incorporated in the concept clude: (1) the need for global knowledge of distance, that is taken into account ACM Computing Surveys, Vol. 36, No. 4, December 2004.
  21. 21. A Survey of Content Distribution Technologies 355 when finding the closest replica of a doc- availability to adapt to regional outages ument. and denial of service attacks. —The use of soft-state to maintain cached Pastry [Rowstron and Druschel 2001] content, based on the announce/listen is a scheme very similar to Tapestry, approach [Deering 1998], is adopted by differing mainly in its approach to Tapestry to detect, circumvent, and re- achieving network locality and object cover from failures in routing or ob- replication. It is employed by the PAST ject location. Caches are periodically [Druschel and Rowstron 2001] large- updated by refreshment messages, or scale persistent peer-to-peer storage purged if no such messages are re- utility. ceived. Additionally, the neighbor map Finally Kademlia [Mayamounkov and is extended to maintain two backup Mazieres 2002] is proposed as an improved neighbors, in addition to the closest (pri- XOR-topology-based routing algorithm mary) neighbor. Furthermore, to avoid similar to Tapestry and Pastry, focusing costly reinsertions of nodes after fail- on consistency and performance. It intro- ures, when a node realizes that a neigh- duces the use of a concurrence parameter, bor is unreachable, instead of removing that lets users trade bandwidth for better its pointer, it temporarily marks it as in- latency and fault recovery. valid in the hope that the failure will be repaired, and, in the meantime, routes 3.4.5. Comparison, Shortcomings and Evo- messages through alternative paths. lution of Structured Architectures. In the —To avoid the single point of failure that previous sections, we have presented root nodes constitute, Tapestry assigns characteristic examples of structured multiple roots to each object. This en- peer-to-peer object location and routing ables a trade-off between reliability and systems, and their main characteristics. redundancy. A distributed algorithm A comparison of these (and similar) sys- called surrogate routing is employed to tems can be based on the size of the compute a unique root node for an object routing information maintained by each in a globally consistent fashion, given node (the routing table), their search the nonstatic set of nodes in the network and retrieval performance (measured in —A set of optimizations improve per- number of hops), and the flexibility with formance by adapting to environment which they can adapt to changing network changes. Tapestry nodes tune their topologies. neighbor pointers by running refresher It turns out that in their basic de- threads that update network latency sign, most of these systems are equiva- values. Algorithms are implemented to lent in terms of routing table space cost, detect query hotspots and offer sugges- which is O(log N ), where N is the size of tions as to where additional copies of network. Similarly, their performance is objects can be placed to significantly im- again mostly O(log N ), with the exception prove query response time. A “hotspot of CAN, where the performance is given by 1 cache” is also maintained at each node. O( d N d ), d being the number of employed 4 dimensions. Tapestry is used by several systems Maintaining the routing table in the such as Oceanstore [Kubiatowicz et al. face of transient node populations is rela- 2000; Rhea et al. 2001], Mnemosyne tively costly for all systems, perhaps more [Hand and Roscoe 2002], and Scan [Chen for Chord, as nodes joining or leaving in- et al. 2000]. duce changes to all other nodes, and less Oceanstore further enhances the per- so in Kademlia which maintains a more formance and fault tolerance of Tapestry flexible routing table. Liben-Nowell et al. by applying an additional “introspec- [2002a] introduce the notion of the “half- tion layer”, where nodes monitor usage life” of a system as an approach to this patterns, network activity, and resource issue. ACM Computing Surveys, Vol. 36, No. 4, December 2004.