2. we analyze in this paper. In Section IV, we examine the impact
that locality has upon system’s performance. In Section V, we
evaluate the hierarchical designs we have introduced in Section
III. Finally, we draw some conclusions in Section VI.
II. RELATED WORK
In spite of the growing interest in hierarchical DHTs, as far
as we know, this paper is the first attempt at providing a formal
analysis that compares the two main hierarchical DHT designs:
the homogenous design [2-4] against the superpeer design [5-
7]. However, recent work has extensively examined the super-
peers architectures shedding light on the performance tradeoffs,
the potential drawbacks and their reliability. S. Zoels et. al. [8]
proposed a cost model to analyze when superpeer architecture
is better than flat architecture. They found that there is a natural
tradeoff between minimizing total system costs and minimizing
the costs for the highest loaded peer in the network. However,
their study was tailored to a particular hierarchical architecture,
thus no providing an analytical framework to compare existing
hierarchical DHT designs. A more practical study by Yang and
Garcia-Molina [9] considered, besides performance tradeoffs,
redundancy and topology variations in superpeer design. They
were able to extract a few rules of thumb, though they did not
contemplate the homogenous alternative.
For flat structured P2P protocols, Christin and Chuang [10]
proposed a cost-based model to evaluate the resources that each
node has to contribute for participating in the overlay network.
They evaluated some of the topologies currently proposed for
overlay networks, and concluded that from the point of view of
reliability and scalability, all the geometries they analyzed can
create large imbalances in the load imposed on different nodes.
We argue that such a cost model is a useful complement to our
work, in that allows valuing the resources nodes use to forward
traffic on behalf of other nodes, and characterizing the efficacy
of a system as whole. We share their aim, but for the two basic
hierarchical DHT designs in the literature.
In this paragraph, we review some relevant works that pre-
sented the advantages of hierarchical DHT design. Ganesan et.
al. [3] proposed Canon, a technique to build a hierarchical DHT
from its flat counterpart, so that it inherits the homogeneity of
load and functionality offered by the original flat design. They
mostly adduced better fault isolation, more effective bandwidth
utilization, and better adaptation to the physical Internet as the
main arguments to use hierarchical DHT systems. Garcés-Erice
et. al. [7] studied the potential advantages a hierarchical system
offers when the most “reliable” peers are at the top layers, the
peers are organized into clusters according to topological proxi-
mity, and when popular files are cached within the clusters.
III. HIERARCHICAL ARCHITECTURES
We start with a formal description of the two architectures
we compare in this work. For the sake of simplicity, both archi-
tectures are restricted to two layers. It has been discussed in [6]
that two is a pragmatic choice for the number of layers in a hie-
rarchy. Of course, to make a fair comparison, we suppose that
both architectures have the same number of nodes N, the same
number of peers in a cluster n, and the same number of clusters
Κ, where N = Κn. Also, we assume that N is evenly divisible
by Κ and each cluster has a unique id called clusterId. Further,
we instantiate both architectures with Chord. Mainly, we have
chosen Chord for its hypercube routing geometry.
Consider an undirected graph on N = 2D
nodes arranged in
a circle. Nodes are labeled with D-bit identifiers called nodeIds
from 0 through 2D
− 1 going clockwise, i.e., N = {0, 1, …, 2D
− 1}, where N denotes Chord’s namespace. Each node has a
routing table of length D = log2N called its finger table, which
contains the addresses of nodes (fingers) that are located ahead
of it on the ring at distances: 1, 2, 4, 8, …, N / 2. The resulting
topology forms the basis of Chord.
In the standard Chord routing algorithm, messages are sent
along only those links that diminish the clockwise distance by
some power of two. Routing is clockwise and greedy, never
overshooting the destination. Routing is as follows. Each node
contacts the finger that is the closest predecessor (according to
clockwise distance) to the destination. Then, this finger repeats
the same operation, but using its own finger table. This process
continues until the destination is reached.
For the remainder of this work, we assume that all nodes in
N = {0, 1, 2, …, 2D
− 1} exist and are alive. This assumption
is plausible since we only establish “comparative results” under
the same conditions: we ignore the bias toward nodes that have
large clockwise arc lengths. This bias is more significant when
performing worst-case cost analysis. However, for an average-
case analysis, this simplification is acceptable and very helpful
to facilitate the model. To ease exposition, we also set n to 2d
(0 ≤ d < D). Hence, Κ = 2D-d
.
A. Homogenous Design
Cyclone [4] is a generic technique to construct hierarchical
DHTs from their flat DHT versions, in such a way that one can
obtain the best of both worlds, without inheriting the disadvan-
tages of either. To be specific, Cyclone produces hierarchical
DHTs with all peers assuming equal roles, thereby maximizing
load balancing, which is the main shortcoming of the superpeer
design. To maximize load balancing, DHTs use the same hash
function h to map a set of objects O and a set of participants
P to a unique namespace N, i.e. h: O → N and h: P → N,
respectively, obscuring two capital concerns: on the one hand,
the distribution of nodes over the namespace and, on the other
hand, the distribution of the objects over the namespace. Cyclo-
ne considers these concerns separately. While all layers share
the entire namespace for the objects O, the namespace for peers
is hierarchically decomposed into interleaved partitions. In par-
ticular, each node takes its nodeId from the range N = [0, 2D
),
according to Chord’s paper [1], and uses the i rightmost bits of
the binary representation of its nodeId to obtain its clusterId at
layer θ − i, where θ is the hierarchy depth. As a result, in each
layer all peers are organized into disjoint clusters, there exists a
“universal” cluster at layer θ that contains all peers in P, and
each peer belongs simultaneously to θ telescoping clusters (i.e.
clusters of clusters of … of peers), with one in each layer.
To clarify the preceding observations, let us represent each
nodeId by its binary representation x1x2…xD, where each xi ∈
{0, 1}. Suppose that x1x2…xD is the nodeId of a given peer p.
It can be easily seen that the clusterId for a peer p at layer θ −
326
3. i is given by the suffix xD-ixD-(i-1)…xD, while the prefix x1x2...
xD-(i+1) denotes the nodeId for p within that cluster. In our case,
p will be assigned to the layer-1 cluster xD-κxD-(κ-1)… xD, where
κ = log2K, besides the layer-2 cluster, also called the universal
cluster since includes all nodeIds in N. In Fig. 1, we represent
the recursive construction of Cyclone’s version of Chord with
2κ = 1
layer-1 clusters.
1) Routing
Routing in Cyclone is identical to routing in Chord, namely,
clockwise greedy routing, but operating in θ loops. In the first
loop, a node that wishes to route a query for a key k, it initially
routes the message to the predecessor p of k within its lowest
cluster. In the second loop, p switches to the next higher cluster
and continues routing on that cluster. This operation is repeated
in each layer until the node responsible for k is reached. As the
routing procedure goes up, more and more peers are included
and the message is increasingly closer to the destination. At the
last loop, routing is executed on the universal cluster that spans
all peers in the system.
In this work, node p will first attempt to route on its layer-1
cluster, and only if the node that is responsible for key k is not
reached, the query will be routed through the universal cluster.
2) Advantadges and Disadvantages
As noted in [3], the primary benefit of homogenous design
is that there is uniform distribution of load among the nodes in
the network, which also ensures that there is no single point of
failure. We illustrate fault resilience through a brief discussion.
We say that a cluster is faulty if all nodes in the cluster are also
faulty. Considering a cluster as a single node, a cluster can only
be disconnected if the clusters (nodes) to which it is connected
at the next layer are faulty. To see this, note that all peers in a
cluster have intercluster links to the same set of clusters at each
layer. Hence, while a cluster has an intercluster link to at least
one non-faulty peer from such clusters, a message will have the
chance to leave the cluster, albeit through a (suboptimal) route.
To verify this, we include the following Lemma, which claims
that even a small n provides sufficient non-failure guarantees.
Lemma 1: If each node fails independently with probability
p in a time period of length λ(p) and n > , (ε
= 1 – 1/Nc
), then no cluster is faulty with high probability.
Proof: The probability that any cluster is faulty is even very
low in case of high presence of churn. Let λ(p) denote the time
period in which each peer disappears with probability p. For a
cluster to fail is necessary that all peers in the cluster are faulty.
The probability that this event happens is pn
. The total number
of clusters is K = N/n. Hence, the probability that no cluster is
faulty is (1 − pn
) > (1 − pn
)N
≈ e (N is supposed to be
large). Picking n = , this probability is more
than ε and the lemma follows. ■
The main drawback is that transient and low-capacity peers
can seriously compromise the scalability of the whole system.
For example, low-bandwidth peers holding popular data can be
rapidly overwhelmed, becoming bottlenecks. However, as seen
in [11], proactive caching can compensate the disadvantages of
a homogenous treatment of low- and high-capacity peers.
B. Superpeer Design
We next consider the superpeer design proposed by Garcés-
Erice et. al. [7]. In their superpeer design, peers are divided into
two layers: superlayer and regular-layer, thereby scaling better
by taking advantage of peers’ heterogeneity. Each peer in the
superlayer, or the layer-2 network, is called supeerpeer, and is
responsible for propagating the queries on behalf the regular-
peers in its cluster. To this end, each superpeer is required to be
connected to other superpeers in the superlayer, according to
the standard Chord rule for creating links. More specifically, all
clusters are organized into a Chord overlay network defined by
a directed graph (C, E), where C = {c1, c2, .., cK} is the set of
all clusters and E is the set of edges between the nodes, which
are the clusters in C. If si is a superpeer in cluster ci, and (ci, cj)
is an edge in the superlayer overlay (C, E), then si knows the IP
address of the superpeer sj in cluster cj. With this knowledge, si
can forward queries to sj. It is important to note that we assume
that each cluster has exactly one superpeer, though in [7] there
is no such a restriction. However, this assumption is reasonable
in conjunction with the view that superpeers are always up and
never leave the network we take here (optimistic view).
A peer in the regular-layer, or equivalently in the layer-1, is
called regular-peer. A regular-peer is characterized because it
only keeps connections to other regular-peers in its cluster. Wi-
thin each cluster, there is also a Chord overlay network that is
used for query communication among the peers in that cluster.
Fig. 2 depicts Garcés-Erice superpeer architecture with a single
superpeer per cluster.
1) Routing
The routing protocol exploits the multi-layer structure: first,
the routing protocol finds the cluster that is responsible for the
key; then it finds the peer within a cluster that is responsible for
the key.
Say that a cluster c ∈C is responsible for a key k if c is the
“closest” cluster to k among all the clusters. Here “closest” is
defined by the clockwise distance to key k on the ring. Routing
operates as follows. Consider that a peer p wishes to determine
the node responsible for k:
s00
ClusterId 00
ClusterId 11
Layer-2
Fig 2. Two layer superpeer Chord-like architecture.
s01
s10
s11
r01
r01
r01
r10 r10
r10
r11
r11
r11
ClusterId 10
Supercluster
101
001
111
011
110 100
000 010
001 011
101111
000
010
100
110
ClusterId 0 ClusterId 1
Univerval Cluster
Layer-1
Layer-2
Fig 1. Two layer Cyclone’s Chord-like architecture with κ = 1.
n
-p N
( )( )ε-1 1
ln
ln ln 1/
p
N
( )( )ε-1 1
ln
ln ln 1/
p
N
N
n
327
4. • First, p routes a message for k to the superpeer s in its
cluster using the standard Chord routing algorithm.
• Once the message reaches superpeer s, the superlayer’s
routing algorithm routes the query through (C, E) to the
cluster ck that is responsible for k. During this phase,
the query only passes through superpeers.
• Finally, using the Chord overlay network in cluster ck,
superpeer sk routes the message to the regular peer that
is responsible for k.
2) Advantadges and Disadvantages
The obvious benefit of superpeer systems is the exploitation
of heterogeneous peers to their advantage, by assigning greater
responsibility to those who are more capable of handling it. By
designating as superpeers the peers that are “up” the most, the
superlayer overlay network will be more stable, thereby letting
the system approaches its theoretical optimal performance.
Disadvantages of superpeer systems include the potentially
high traffic rates on intercluster links, namely the source of any
potential degradation in performance, and the diminished fault-
tolerance due to the special role played by superpeers. There is,
however, another factor that most notably limits the feasibility
of a superpeer system (according to [12]): the utilization of an
effective management protocol such that the long-lifetime and
large-capacity peers are allowed to act as superpeers. Since no
complete knowledge is available in a decentralized system, it is
difficult to determine what values are “long” or “large” enough
in relation to the peers now present in a system.
IV. LOCALITY ANALYSIS
Previous works [3, 7, 8] that evaluate hierarchical DHT sys-
tems suppose that communication is uniform, that is, all nodes
have an equal probability of serving a request. While such an
assumption is realistic for flat DHTs, hierarchical DHTs exploit
the locality that exists in communication patterns to their bene-
fit, for example, by allowing nodes to specify in which clusters
the content is to be made accessible or to be stored [3]. In other
words, we drop the assumption that a node communicates with
any other node with equal probability, assuming that nodes in a
cluster communicate together with a higher (or lower) probabi-
lity than do two nodes from two different clusters.
Let γ be the probability that both the source and the destina-
tion nodes of a request are in the same cluster. Hence, (1 − γ)
denotes the probability of intercluster communication. That is,
the smaller the value of γ, the more likely the anti-locality in
communication. Further, we assume that:
• Intracluster communication is distributed uniformly at
random over the set of nodes in each cluster, that is, a
source node routes an intracluster a query to each peer
within its cluster with equal probability, and
• Intercluster communication is uniformly random, that
is, a source node forwards an intercluster query to each
other cluster and to each node within the target cluster
with equal probability.
In what follows, we characterize the performance measures
to evaluate both architectures as a function of γ. Notice that it is
difficult to predict what values of γ an architect may expect in
practice. We remark that although the results of the following
sections suggest that no design is “universally” better, it is clear
that if intercluster messages are frequent, it is more beneficial
to use a hierarchical DHT that requires higher maintenance but
saves significantly on requests.
First, we begin examining the performance of both designs
by clarifying the role that plays locality upon the average path
length µ (measured in terms of number of hops). The expected
value of this quantity can be viewed as simple metric capturing
the overall routing cost suffered by each node.
A. Homogenous Design
We derive the average path length µHD for the homogenous
design under the assumption of locality γ in communication. In
Cyclone, the nodeId of a node, which is D bits long, is divided
into two parts: the (D – d) leftmost bits identify a node within
a cluster of nodes whose d rightmost bits are the same; these d
bits denote the clusterId of this cluster at layer-1. Then,
µHD = γ (µl1) + (1 − γ) (µl2) (1)
where µl1 is the average number of hops between two nodes
within a cluster at layer-1, and µl2 is the average path length of
the universal cluster (when both nodes belong to distinct layer-
1 clusters).
To compute µl1, we exploit the fact that under all-exist-all-
alive assumption a Chord overlay perfectly embeds a hypercu-
be. Thus, if a message is for a node that is clockwise distance η
away, routing is equivalent to performing left-to-right bit fixing
to convert the 1s in the binary representation of η to 0s. That is,
if η is 9 (1001 in binary), Chord routing uses the jumps of size
8 and 1 in that order, thus fixing the leftmost 1 in the remaining
distance to 0 at each step. Consequently, µl1 can be calculated
as follows,
µl1 =
22
0 d
i
i
dd
i
=
∑=
d
. (2)
Once we compute µl2, we shall be done. Recall that at least
one bit in the rightmost d bits of the source and target nodeIds
is required to be distinct. Noting that the number of nodes at a
clockwise distance η = i + j from the source, where the distance
i (i > 0) is contributed by the (D – d) leftmost bits of η and the
distance j (j > 0) is contributed by the d rightmost bits of η, is
equal to
−
j
d
i
dD
, then
µl2 =
( )
dD
dD
dD
dD
i
d
j dD
ji
j
d
i
dD
22
22
22
11
1 0
−
−
−
+
−
−−
−
= =
=
∑ ∑ . (3)
Equations (1), (2), and (3) then give
µHD = ( )
−
−
−+
−−
dD
dD
dDd
22
22
1
2
11
γγ .
It is worth noting that when γ = 2d-D
(i.e., the probability of
intracluster communication is proportional to n), µHD = D/2 as
expected.
328
5. B. Superpeer Design
We now compute the average path length µSD of this design
as a function of γ. Clearly, µSD is given by:
µSD = γ (µl1) + (1 – γ)(2µl1 + µsl)
where µl1 is the average path length of the network at layer-
1; and µsl is the average path length of the superlayer network
(note that both source and destination nodes are required to lie
in distinct clusters). We have already shown that µl1 = d/2. We
derive µsl using reasoning analogous to that in (3), yielding
µsl =
12
)2(
12
1
1
−
−
−
−
−−
−
=
=
∑
d-D
dD
d-D
dD
i dD
i
i
dD
. (4)
Therefore, combining equations (4) and (2), µSD is given by
µSD = ( )
−
+ −
+ − −
D d
D d
d D d d1
( )2 2
1
2 2 2
γ γ .
Again, when γ = 2d-D
, µHD equals to D/2 or that of Chord.
C. Discussion of Results
The average path length for both designs is shown in Fig. 3.
(left). Both networks have cluster size n = 210
and network size
N = 232
. The value of γ is varied from 0 to 1. As γ approaches
1, however, both designs have similar performance, since inter-
cluster links are rarely used. It should also be mentioned that as
γ approaches 0, the required capacities of superpeers increases
as will be shown in next section. As Fig. 3 (right) demonstrates,
the average path length in both designs increases when cluster
size approaches N (γ is fixed at 0.5). This indicates that is not
economical in number of hops to make big clusters. With a few
thousands of nodes in a cluster, the average path length of the
superpeer network is 40% higher than that of the homogenous
network. Consequently, if query traffic dominates maintenance
traffic, then hierarchical DHTs with large clusters will perform
poorly. Most of the traffic will cross the already congested non-
intracluster links, incurring intolerable network delays.
V. COST-BASED ANALYSIS
We measure the performance of both designs using both the
total traffic and the individual cost that each node should afford
under typical workload. We define a workload by a quadruplet
<N, n, l, f>, where l is the average node lifetime, and f is the
average number of queries that each node processes per second.
Remember that N is the number of nodes and n the cluster size.
Our typical workload assumes that node lifetime follows an ex-
ponential distribution: pl(t) = λle-λlt
, where λl = 1/l. We take a
node lifetime l similar to that in real P2P systems. For instance,
the average node lifetime in Gnutella is 2.9 hours [13]. We also
assume that nodes generate traffic independently of each other,
following a Poisson process with a medium to high mean rate f
= 0.1 ∼10 queries/second (Note that a node itself can represent
a gateway in a large enterprise network).
In addition, node arrival is treated as a Poisson process with
rate λa. To maintain a stable population of N nodes, we set λa =
Nλl = N/l. In addition, summing the routing and maintenance
costs, we can compute the total cost of a hierarchical design.
More formally, the total cost C = Cm + Cr, where Cm denotes
the maintenance cost and Cr the routing cost.
For our purposes, we assume that both query messages and
maintenance messages have a size of b bytes. All messages are
explicitly acknowledged and the acknowledgments have length
0.5b. Below, we utilize the total traffic metric to compare both
hierarchical DHT designs.
A. Homogenous Design
1) Maintenance cost
We calculate the minimum traffic required for updating the
routing tables in the face of node arrivals and departures. In an
N-node homogenous DHT with node lifetime l, on average N/l
nodes join and N/l nodes leave each second. Since our analysis
uses Chord as a representative substrate, maintenance costs for
the homogenous design are strongly characterized by the costs
of keeping up the Chord overlay structure [1] in each cluster. In
fact, all nodes assume equal roles and take the same duties for
all the operations. This implies that the traffic imposed on each
node is the same, regardless of the individual capacities of each
node. In our cost model, this is advantageous in computing the
maintenance and routing costs of the homogenous DHT design.
In fact, this is equivalent to computing the total cost of a Chord
network (note that requests are uniformly distributed over the
set of nodes and all nodes have an equal number of fingers). To
that effect, we next compute the maintenance cost of Chord. In
Chord, maintenance traffic is generated by heartbeat messages,
stabilize algorithm and fixfinger traffic. Heartbeat messages are
sent periodically by each node to check if a finger is still alive.
Assuming that heartbeat messages are sent every Tbeat seconds,
they have size 0.5b, and each node handles D fingers, the total
traffic for the heartbeats is Cbeat =
beatT
bND5.0
.
Chord protocol has a stabilization algorithm that each node
executes periodically to check whether a new node has inserted
itself between a node and its successor. A detailed description
of stabilize procedure can be found in [1]. For convenience, we
assume here that stabilize procedure requires three messages of
length b. Assuming that stabilize is run every Tstab seconds, the
traffic for the stabilize algorithm is Cstab =
stabT
bN3
.
In addition, adding or removing a node is accomplished at a
cost of O(log2
N) messages. By way of explanation, each node
Fig 3. Comparison between homogenous (HD) and superpeer (SD)
designs’ average path length for D = 32 and d = 10. The value of γ is
varied from 0 to 1 (left). γ is fixed at 0.5 and d is varied from 1 to 32
(right).
0 5 10 15 20 25 30 35
8
10
12
14
16
18
20
22
24
d (Cluster Size)
AveragePathLength
HD
SD
0 0.2 0.4 0.6 0.8 1
4
6
8
10
12
14
16
18
20
22
γ
AveragePathLength
HD
SD
329
6. periodically invokes fixfinger to make sure that its fingers are
correct; this is how existing nodes incorporate new nodes into
their finger tables. In particular, fixing a finger requires looking
up the finger’s id, which costs O(log N) messages on average.
Since log N rounds of fixfinger are needed to initialize a finger
table, fixfinger cost is O(log2
N). For the time being, we make a
further simplification by assuming a fixfinger algorithm which
acquaints each node that is affected by a topological change. In
other words, fixfinger is run solely when a node joins or leaves.
So we take a conservative approach, contrariwise to Chord that
needs periodic update on all nodes, a simple strategy that incurs
a higher overhead. Consistent with this observation, the cost for
fixfinger traffic is presented below.
When a node joins, the number of existing nodes that need
to update its finger table is O(log N) on average. In our case,
this value is exactly D (we suppose that the namespace is fully
populated). Since fixing a finger requires performing a lookup,
a mean of D(D/2) messages are required to update the present
nodes in the system in the face of a node arrival. Assuming all
these messages have unit size b and each is acknowledged by a
packet of length 0.5b, the total traffic needed for updating the
finger tables of existing nodes is (1 + 0.5)bND2
/2l .
When a node leaves, at least one message is sent to notify
each of its D fingers, which in turn need to be updated. Hence,
node departures require ND2
/2l messages yielding a cost of (1
+ 0.5)bND2
/2l.
Moreover, a mean of D2
/2 messages are needed to set up
the finger table of a new node. Therefore, the fixfinger traffic is
Cfix,HD = ( )
+ + +
1 0.5
2
N N N D
b D D D
l l l
.
The maintenance traffic (fixfinger + heartbeat + stabilize) is
then given by
Cm,HD =
+ +
2
0.5 3
4.5
2
beat stab
D D
bN
l T T
.
2) Routing Cost
Last, we need to compute the routing cost to see whether it
is economical in traffic to use a homogenous design. To do so,
we know that each node processes f queries per second. Hence,
there are Nf lookups in total. Since each lookup takes µHD hops,
the total traffic for lookups is
Cr, HD = (1 + 0.5)bNfµHD .
3) Total cost
Therefore, the total cost (maintenance plus routing) is
CHD =
+ + +
HD
beat stab
D D
f bN
l T T
2
0.5 3
4.5 1.5
2
µ . (5)
We can in turn use expression (5) to compute the individual
cost that each overlay node p has to afford for being part of the
overlay. Let CHD(p) denote the individual cost suffered by peer
p. Then, CHD(p) can be directly computed by dividing the total
cost CHD by N (recall that all nodes are assumed to have equal
responsibilities), that is, ∀p CHD(p) = CHD/N. Therefore,
CHD(p) =
+ + +
HD
beat stab
D D
f b
l T T
2
0.5 3
4.5 1.5
2
µ for all p.
B. Superpeer Design
1) Maintenance cost
There are two classes of connections in superpeer networks:
the connections between superpeers and the connections which
have at least one regular-peer as an endpoint. The maintenance
cost on any peer is directly related to the number and stability
of the peers in its finger table. This means that the maintenance
cost is proportional to the number of finger nodes and inversely
proportional to the average lifetime of them. To better represent
reality, we consider that supeerpeer are always up, thereby we
favor this design with regards to fixfinger traffic. Formally, this
leads us to establish that total fixfinger traffic is (i.e., only links
to regular-peers can fail)
Cfix,SD = ( )
+ + +
1 0.5
2
N N N d
b d d d
l l l
.
Consequently, the total maintenance cost Cm,SD is
Cm,SD =
+ +
2
0.5 3
4.5
2
beat stab
d D
bN
l T T
. (6)
One important observation about superpeer design points to
the fact that individual maintenance costs Cm,SD(p) suffered by
each node p can be directly derived from Cm,SD/N.
2) Routing cost
Next, we derive the routing cost suffered by each superpeer.
Define CSD,r(s) as the individual routing cost that superpeer s
experiences each second. It is easy to see then that CSD,r(s) can
be calculated as follows:
CSD,r(s) = CSD,CL(s) + CSD,NCL(s)
where CSD,CL(s) is the routing cost at s due to intracluster
traffic, and CSD,NCL(s) is the routing cost at s due to intercluster
traffic.
In deriving CSD,r(s), observe that each intercluster message
traverses µsl = [(D – d)2D–d–1
/2D–d
– 1] superpeers on average
to reach the destination cluster. Since each cluster injects these
messages into the system at rate (1 – γ)fn, then
CSD,NCL(s) = (1 – γ)fnµsl(1 + 0.5)b for all s,
while
CSD,CL(s) = γfµl1(1 + 0.5)b. for all s.
This completes the computation of CSD,r(s).
In an intercluster query process, when a query is issued by a
regular-peer, the regular-peer routes the query to its superpeer
using the lookup algorithm of the cluster. The amount of query
workload is proportional to the number of regular-peers and the
query frequency f of each peer.
Moreover, the individual cost imposed on each regular-peer
is not unique. More specifically, the regular neighbors closer to
a superpeer receive more intercluster queries than those that are
330
7. not. In summary, the individual cost imposed on a regular-peer
is totally dependent on how close the node is to its superpeer.
To better understand why this happens, our analysis comes
back to the geometric intuition that routing in Chord resembles
hypercube geometry. To see why, let η(i, j) be the clockwise
distance between nodes i and j expressed in binary form. Then,
it can be easily seen that Chord routing is effectively achieved
by “correcting” the 1s in the binary representation of η(i, j) to
0s. In a hypercube, this is equivalent to routing from node η(i,
j) to node 0. The key difference between routing in Chord and
on the hypercube is that hypercubes allow bits to be corrected
in any order while on Chord bits have to be corrected from left-
to-right. Keeping in mind this, it can effectively be discussed,
in the context of spanning binomial trees [14], why intercluster
traffic is highly unbalanced. To explain, Chord routing spawns
a spanning binomial tree, which is formed when the clockwise-
greedy paths from all nodes to one specific node are combined.
Further, the spanning binomial tree of an N-node Chord system
is unique, irrespectively of the destination (or root) node. The
uniqueness is easy to see since Chord is symmetric with respect
every node. Label each node with its clockwise distance to the
root. Notice then that the spanning binomial tree rooted at each
node has indeed the same structure and it is therefore the same.
To simplify the analysis, we will represent each node in the tree
by its clockwise distance to the root rather than by its nodeId.
Thus, we shall avoid including the root as parameter, so we can
expose our findings in a somewhat more general context that is
needed. In fact, the spanning binomial tree of Chord, say TChord,
corresponds to the spanning binomial tree rooted at node 0 of
an N-node binary hypercube. In Fig. 4, we illustrate TChord for a
16-node (d = 4) Chord cluster.
We continue giving a brief description of TChord. TChord can
be defined using the children(p) function, which returns for a
node p the set of children nodes of p in TChord. The children of
node p are obtained by complementing one of the leading zeros
in the binary representation of p. Also, TChord can be uniquely
defined using the parent(p) function, which returns for a node
p the parent node of p in TChord. The parent of p is obtained by
complementing the leftmost 1-bit in binary representation of p,
as standard Chord routing does. It worth noting that TChord has
an optimal height d, and optimal average height equal to d/2,
though it is highly unbalanced. By simply looking at Fig. 4, it
is easy to see that the subtrees of the root have sizes 20
, 21
, …
and 2d-1
, which corresponds to a poorly balanced tree.
Let η = ηd-1ηd-2…η0 denote the binary representation of the
clockwise distance between the root and an arbitrary node other
than the root. Let i be such that ηi = 1 and ηj = 0, ∀j ∈ {i + 1, i
+ 2, …, d – 1} ≡ LSBT
(η) and let i = -1 if η = 0. LSBT
(η) is the
set of leading zeros of η. Then,
children(η) = {(ηd-1ηd-2 … η j … η0)}, ∀j ∈ LSBT
(η).
The next lemma establishes the number of times a node η is
seen on the clockwise-greedy paths from all nodes to one node
in Chord; to the best of our knowledge, our work appears to be
the first to claim it.
Lemma 2: Let i be an integer such that ηi = 1 and ηj = 0, ∀j
∈ LSBT
(η). Then, there are 2d–i–1
nodes in the subtree of TChord
induced by node η (including node η itself).
Proof: In TChord, remember that the children of a node η are
obtained by complementing one of the leading 0s in the binary
representation of η. Let i be such that ηi = 1 and ηj = 0, ∀j ∈ {i
+ 1, i + 2, …, d – 1} ≡ LSBT
(η). Consequently, each child of
node η is formed by drawing exactly one element of LSBT
(η).
The possibilities to draw one object of a set of size (d – i – 1)
is C1
d-i-1
=
1
1-i-d
, which leads to the fact that there are exactly
C1
d-i-1
other nodes that route through η. By a similar argument,
one can prove without much difficulty that |childrenk
(η)| = Ck
d-
i-1
(k > 0) where childrenk
(η) denotes the set of children nodes
that result after applying k times children function to η. Then,
the number of nodes that are in the subtree of TChord rooted at η
is given by:
1
1
1
2
1
1 -i-d
-i-d
k k
-i-d
=∑ =
+
and the lemma follows. ■
We are now in position to compute the individual routing
cost imposed on a regular node p, CSD,r(p), depending on how
close p is to its superpeer, namely, s. Let η(p, s) = ηd-1ηd-2…η0
be the clockwise distance from p to s expressed in binary form.
In addition, let i be the position of the leftmost 1-bit in η(p, s),
i.e., i is such that ηi = 1 and ηj = 0, ∀j ∈ {i + 1, i + 2, …, d -
1} ≡ LSBT
(η). Then, the CSD,r(p) is given by:
CSD,r(p) = CSD,CL(p) + CSD,NCL(p)
where CSD,CL(p) denotes the routing load suffered by p due
to intracluster traffic, and CSD,NCL(p) denotes the routing cost at
p due to intercluster traffic.
Since the intracluster traffic cost is equally distributed over
the nodes within a cluster, CSD,CL(p) is given by
CSD,CL(p) = γf(µl1)(1 + 0.5)b for all p.
However, CSD,NCL(p) is not the same for all regular nodes
in a cluster. Note that the total number of intercluster messages
a cluster generates and receives each second equals 2(1 – γ)fn.
Then, CSD,NCL(p) can be calculated as follows:
0000
1000010000100001
11001010
0110
1001
01010011
111011011011
0111
1111
Fig 4. Chord’s spanning binomial tree (d = 4).
331
8. CSD,NCL(p) = (1 – γ)fn(1 + 0.5)b(Rp,OUT + Rp,IN)
where Rp,OUT (resp., Rp,IN) denotes the probability that an
intercluster message gets routed through node p on its way out
of (resp., into) the cluster.
Now, we derive the expression for Rp,OUT. One way to do so
yields the result we seek. Say first that the set of source nodes
S(p) for which p acts as an intermediate hop (including itself)
is the set of nodes that are in the subtree rooted at p. By Lemma
2, |S(p)| = 2d–i–1
since i is the position of the leftmost 1-bit in
η(p, s). Hence, Rp,OUT is given by (recall that n = 2d
):
Rp,OUT = did
-i-d-i-d
n 2
1
2
1
2
1212
1
11
−=
−
=
−
+
.
By a similar argument, it can be shown that
Rp,IN = dk-dd
kk
n 2
1
2
1
2
1212
−=
−
=
−
. (7)
where k is the position of the rightmost 1-bit of η(s, p) = ηd-
1ηd-2…η0, i.e., k is such that ηk = 1 and ηj = 0, ∀j ∈ {k – 1, k –
2, …, 0} ≡ TSBT
(η(s, p)) (k = d if η(s, p) = 0). TSBT
(η(s, p))
denotes the set of trailing zeros of η(s, p). This completes the
derivation of CSD,NCL(p). Consequently,
CSD,r(p) = 1.5bf [γµl1 + (1 – γ)(2d–i–1
+ 2k
– 2)].
One interesting point to be noted is that in general, Rp,OUT ≠
Rp,IN for every node. To see why, recall that η(i, j) = (j – i +
n) mod n is the clockwise distance from a node i to a node j.
Then one can verify without much difficulty that Rp,OUT = Rp,IN
iff d – i – 1 = k, where (d – i – 1) is the position of the
leftmost 1-bit in η(p, s) and k the position of the rightmost 1-
bit in η(s, p), respectively.
Using equation (7), the PMF (probability mass function) of
Rp,IN is illustrated in Fig. 5 (a) (n = 256), which shows that a
small subset of nodes (we have excluded superpeers) are highly
loaded. Like in the case of homogenous designs, this confirms
us the fact that the performance of superpeer systems are also
conditioned on the capacities of weak nodes, thus questioning
one of the key benefits of superpeer design. For instance, while
for a peer that is at a clockwise distance 1 away from its super-
peer, RIN = 0, the fraction of intercluster queries that receives a
node at a clockwise distance 128 away is RIN = 1/2.
3) Total cost
To complete the picture, we next derive the total traffic cost
CSD for superpeer design. Notice that to do so, one can use the
individual routing costs experienced by each node. Then, CSD is
given by
CSD = Cm,SD +
= =
+∑ ∑SD,r SD,r sl
n- K-
p s
K C p C s
1 1
0 0
( ) ( )µ .
Since each superpeer manages an equal number of regular-
peers, CSD is then given by the following expression (recall that
K is the number of cluster and n the cluster size):
CSD = Cm,SD +
p=
+
∑ SD,r SD,r sl
n-
K C p C
1
0
( ) (0)µ . (8)
Recall that the cost CSD,r(s) is proportional to the number of
regular-peers s manages, which is n. We leave the expansion of
equation (8) to the reader.
C. Discussion of Results
In this section, we compare the traffic costs incurred in the
hierarchical designs we discussed. We choose D = 32 and d =
10, thus avoiding to inadvertently bias one design against other
due to the traffic of intercluster queries arriving at and leaving
from each superpeer.
Furthermore, there exists some controversy in the literature
[8, 9, 13] about the optimal ratio of regular-peers to the number
of superpeers. This is of great importance, however, we do not
wish to involve ourselves in this issue here. Rather, we provide
a comparative with a workload apparently similar to that in real
P2P systems. Unlike otherwise noted, we assume that l = 2.9
hours (Gnutella) and f = 0.1 queries/second. As a result, the
workload quadruplet used for the comparison conducted on this
section is <32, 10, 2.9, 360>. In addition, Tbeat and Tstab are
set to 30 seconds.
In Fig. 5 (b), we compare the total cost of the homogenous
design with that of the superpeer design we study. We use the
relative traffic cost to illustrate the traffic savings of one design
to that of reference. Here, we define the total relative traffic as
CSD / CHD, where CSD refers to the total traffic of the superpeer
design and CHD is the total cost of the homogenous design. The
value of γ is varied from 0 to 1. For communication patterns
with up to 70% of locality, the maintenance cost exhibited by
the homogenous design is rightly compensated by the savings
obtained from the greater efficiency in routing. In contrast, the
superpeer design incurs higher load in this range. However, as
γ approaches 1, the homogenous design introduces more traffic
0 % 10% 20% 30% 40% 50% 60% 70%
10%
20%
30%
40%
50%
60%
70%
80%
90%
RIN
PMF
0 2 4 6 8 10 12
0
100
200
300
400
500
600
Average Lifetime l (hour)
HLPRelatievTraffic
Csp
/ Chp
Fig 5. Traffic comparison between homogenous and superpeer designs for
D = 32 and d = 10. (a) illustrates PMF of Rp,IN . The relative total cost,
namely CSD/CHD, is shown in (b) when γ is varied from 0 to 1. Similarly,
we plot the relative cost between the highest loaded peers (HLPs) of both
designs. (c) Varies locality γ . (d) Varies node lifetime (γ is set to 0.5).
(a) (b)
(c) (d)
0 0.2 0.4 0.6 0.8 1
0
100
200
300
400
500
600
700
γ
HLPRelativeTraffic
CSD
/ CHD
0 0.2 0.4 0.6 0.8 1
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
γ
RelativeTraffic
CSD
/ CHD
332
9. than the superpeer design. The reason why this happens is easy
to infer. As γ approaches 1, intercluster links are used less, and
the maintenance summand emerges as the “dominant” factor in
equations (5) and (8). As a result, the homogenous design can
not be recommended for distributed applications that anticipate
strong locality on its communications patterns.
Bearing in mind the special role played by superpeers, Fig.
5 (c) and (d) plot the relative traffic experienced by the highest
loaded peer (HLP) in each design. In (c), we vary the value of
intracluster communication γ. In (d), we vary the average node
lifetime (γ = 0.5) to investigate the impact of churn, that is, the
continuous process of node arrivals and departures. Recall that
to maintain a stable population, we set the node arrival rate λa
to N/l. Consequently, if l is of the order of a few seconds, then
an important fraction of nodes are required to join the network
each second. The motivation behind these figures obeys to the
conviction that the routing traffic imposed on weak superpeers
can limit the performance of a hierarchical system as a whole.
In fact, we compare the traffic imposed on a superpeer against
the traffic at a homogenous peer. As a rule of thumb, (c) shows
that the traffic experienced by a superpeer is 600 times greater
than that suffered by a homogenous peer when γ = 0 (reflecting
a situation of strong “anti-locality”), which provides a practical
finding: depending on the communications patterns, superpeer
systems, on the contrary to that one can a priory conjecture, do
not always allow a system to achieve its optimal performance.
In (d) we show that the positive results for the homogenous
design in the preceding evaluation no longer hold. In fact, (d)
confirms the well-known fact that if the average node lifetime l
is low (in our case when l < 1 hour), the cost of maintaining a
network of transient peers do not compensate a greater efficacy
in routing. Notice that this result is rather pessimistic; increased
realism would be supposing that superpeers join and leave the
network dynamically, which requires dropping the assumption
that superpeers stay forever.
In summary, while very appealing from the point of view of
resilience and scalability, superpeer designs (like to that in [7])
are not always the best alternative: They can potentially create
large imbalances on the load imposed on nodes. Similarly, as
our analysis demonstrates, homogenous designs cannot also be
considered the best solution: They perform poorly if the nodes
participating in the system are too much transient. Therefore,
owing to the fact of the lack of a “universal” design, we believe
that the cost model we propose can be very useful in helping an
architect in the arduous task of choosing the appropriate design
for a given workload.
VI. CONCLUSIONS
In this paper, we propose a novel analytical cost framework
aimed at evaluating hierarchical DHTs. Using this framework,
we compare the two main families of hierarchical DHTs: The
homogenous design, in which all nodes act equal roles, against
the superpeer design, in which a small subset of peers (i.e., the
most powerful and stable), behave as proxies, interconnecting
clusters with highly dynamic membership. More specifically,
our results demonstrate that no design is “universally” better,
and to that effect, we believe that our cost-based model can be
very useful in identifying the advantages and disadvantages of
a design against multiple alternatives for a specific problem.
In summary, while traditional superpeer systems have been
motivated from the point of view of resilience and scalability,
in the context of peer-to-peer systems, they do not always offer
the best alternative. For example, they potentially create large
imbalances on the traffic imposed on the different nodes when
intercluster communication is high. Despite our model does not
consider churn in the superlayer, note that dynamic superpeers’
maintenance protocols may impose excessive overhead.
In contrast, homogeneous designs offer a great deal of load
balancing, which derives from their inherent symmetry. Hence,
we believe that when one seeks simplicity, and assuming high
anti-locality in communication, node heterogeneous capacities
lose preponderance.
ACKNOWLEDGMENT
This work is supported in part by European Commission
under the FP6-2006-IST-034241 POPEYE project.
REFERENCES
[1] I. Sotica, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan,
“Chord: A Scalable Peer-to-Peer Lookup Service for Internet
Applications,” ACM SIGCOMM’01 Conference, 2001.
[2] M. J. Freedman and D. Mazières. “Sloppy Hashing and Self-Organizing
Clusters”. In Proc. 2nd Intl. Workshop on Peer-to-Peer Systems (IPTPS
'03) Berkeley, CA, February 2003
[3] P. Ganesan, K. Gummadi, and H. Garcia-Molina, “Canon in G Major:
Designing DHTs with Hierarchical Structure,” In Proc. International
Conference on Distributed Computing Systems (ICDCS 2004), 2004.
[4] M. S. Artigas, P. G. López, J. P. Ahulló, and A. F. Skarmeta, “Cyclone:
A novel design schema for hierarchical dhts,”. In Proc of the Fifth IEEE
International Conference on Peer-to-Peer Computing (P2P’05), 2005.
[5] R. Tian, Y. Xiong, Q. Zhang, B. Li, B. Y. Zhao, and X. Li, “Hybrid
overlay structure based on random walks,” In Proc. of the 4th Intl.
Workshop on Peer-to-Peer Systems (IPTPS’05), Feb. 2005.
[6] Z. Xu, R. Min, and Y. Hu, “HIERAS: A DHT based hierarchical p2p
routing algorithm,” In Proc. of the 2003 Intl. Conf. on Parallel
Processing (ICPP’03), pages 187–194, Oct. 2003.
[7] L. Garcés-Erice, E. Biersack, K. W. Ross, P. A. Felber, and G. Urvoy-
Keller, “Hierarchical P2P Systems,” ACM/IFIP Conference on Parallel
and Distributed Computing (Euro-Par), 2003.
[8] S. Zoels, Z. Despotovic, and W. Kellerer, “Cost-Based Analysis of
Hierarchical DHT Design,”, In Proc of the Sixth IEEE International
Conference on Peer-to-Peer Computing (P2P’06), 2006.
[9] B. Yang and H. Garcia-Molina, “Designing a Super-Peer Network,” In
Proc. fo 19th Int'l Conf. Data Eng., Mar. 2003
[10] N. Christin and J, Chuang, “A Cost-Based Analysis of Overlay Routing
Geometries,” IEEE INFOCOM’05, 2005.
[11] V. Ramasubramanian, and E. Sirer, “Beehive: The Design and
Implementation of a Next Generation Name Service for the Internet,” In
Proc. of ACM SIGCOMM’04 Conference, 2004.
[12] L. Xiao, Z. Zhuang, and Y. Liu, “Dynamic Layer Management in
Superpeer Architectures”, IEEE Trans. Parallel and Distributed Systems,
vol 16, no. 11, 1078 – 1091, Nov. 2005.
[13] S. Saroui, P. K. Gummadi, and S. D. Gribble, “A measurement study of
peer-to-peer file sharing systems,” In Proc. of Multimedia Computing
and Networking (MCN), 2002.
[14] S. L. Johnson and C. T. Ho, “Optimum broadcasting and personalized
communication in hypercubes,” IEEE Trans. Comput, vol 38, no. 9, pp.
1249-1268, Sept. 1989.
333