Paul presentation P2P Chord v1

Chord: A Scalable Peer-to-peer
Lookup Service for Internet
Applications
Paul Yang
楊曜年

What is a P2P system?
• A distributed system architecture:
• No centralized control
• Nodes are symmetric in function
Node
Node
Node Node
Node
Internet

3 layers - from implementation
Distributed hash table
Distributed application
get (key) data
node node node….
put(key, data)
Lookup service
lookup(key) node IP address
• Application may be distributed over many nodes
• DHT distributes data storage over many nodes
(Ivy)
(DHash)
(Chord)

A peer-to-peer storage problem
• 1000 scattered music enthusiasts
• Willing to store and serve replicas
• How do you find the data?

The lookup problem
Internet
N1
N2
N3
N6
N5
N4
Publisher
Key=“title”
Value=MP3 data…
Client
Lookup(“title”)
?

Centralized lookup (Napster)
Publisher@
Client
Lookup(“title”)
N6
N9 N7
DB
N8
N3
N2N1SetLoc(“title”, N4)
Simple, but O(N) state and a single point of failure
Key=“title”
Value=MP3 data…
N4

Flooded queries (Gnutella)
N4
Publisher@
Client
N6
N9
N7
N8
N3
N2N1
Robust, but worst case O(N) messages per lookup
Key=“title”
Value=MP3 data…
Lookup(“title”)

Routed queries (Freenet, Chord,
etc.)
N4Publisher
Client
N6
N9
N7
N8
N3
N2N1
Lookup(“title”)
Key=“title”
Value=MP3 data…

Routing challenges
• Keep the hop count small
• Keep the tables small
• Stay robust despite rapid change
• Chord: emphasizes efficiency and simplicity

Chord properties
• Efficient: O(log(N)) messages per lookup
• Load balance: closed to K/N
• Decentralization
• Scalable: O(log(N)) state per node
• Robust: survives massive failures

Chord overview
• Provides peer-to-peer hash lookup:
• Lookup(key) → IP address
• Chord does not store the data
• How does Chord route lookups?
• How does Chord maintain routing tables?

Chord IDs
• Key identifier = SHA-1(key)
• Node identifier = SHA-1(IP address & Port)
• Both are uniformly distributed
• Both exist in the same ID space
• For terribly distributed hash (collision) –
universal hash
• How to map key IDs to node IDs?

Simple lookup algorithm
Lookup(my-id, key-id) //if k=7, MyID=2, MyS = 8
n = my successor
if my-id < n < key-id
call Lookup(id) on node n // next hop
else
return my successor // done
• Correctness depends only on successors

6
1
2
6
0
4
26
5
1
3
7
2
identifier
circle
identifier
node
X key
Consistent Hashing - Successor
Nodes – Take O (N) hop
successor(1) = 1
successor(2) = 3successor(6) = 0

Scalable Key Location
• To accelerate lookups, Chord maintains
additional routing information.
• This additional information is not
essential for correctness, which is
achieved as long as each node knows
its correct successor.

Scalable Key Location –
Finger Tables
• Each node n’ maintains a routing table with up
to m entries (which is in fact the number of bits
in identifiers), called finger table.
• The ith
entry in the table at node n contains the
identity of the first node s that succeeds n by at
least 2
i-1
on the identifier circle.
• s = successor(n+2i-1
).
• s is called the ith
finger of node n, denoted by
n.finger(i)

Finger Tables
0
4
26
5
1
3
7
1
2
4
1
3
0
finger table
start succ.
keys
1
2
3
5
3
3
0
finger table
start succ.
keys
2
4
5
7
0
0
0
finger table
start succ.
keys
6
0+20
0+21
0+22
For.
1+20
1+21
1+22
For.
3+20
3+21
3+22
For.
Finer[k] = (n + 2k-1) mod 2m

Finger i points to successor of n+2i
N80
½¼
1/8
1/16
1/32
1/64
1/128
112
N120
M = 7 -> 128

Lookups take O(log(N)) hops
N32
N10
N5
N20
N110
N99
N80
N60
Lookup(K19)
K19

Lookup with fingers
Lookup(my-id, key-id)
look in local finger table for
highest node n s.t. my-id < n < key-id
if n exists
else
return my successor // done

Node Joins and Stabilizations
• The most important thing is the successor
pointer.
• If the successor pointer is ensured to be up to
date, which is sufficient to guarantee
correctness of lookups, then finger table can
always be verified.
• Each node runs a “stabilization” protocol
periodically in the background to update
successor pointer and finger table.

Node Joins and Stabilizations
• “Stabilization” protocol contains 6
functions:
• create() //create a network
• join()
• stabilize()
• notify()
• fix_fingers()
• check_predecessor()

Node Joins – join()
• When node n first starts, it calls
n.join(n’), where n’ is any known Chord
node.
• The join() function asks n’ to find the
immediate successor of n.
• join() does not make the rest of the
network aware of n.

Node Joins – join()
// create a new Chord ring.
n.create()
predecessor = nil;
successor = n;
// join a Chord ring containing node n’.
n.join(n’)
predecessor = nil;
successor = n’.find_successor(n);

find_successor()
• Pseudo code:
// ask node n to find the successor of id
// id = 36, n’ = 25 , successor=40
n.find_successor(id)
if (id ∈ (n, successor])
return successor;
else
n’ = closest_preceding_node(id);
return n’.find_successor(id);
// search the local table for the highest predecessor of id
n.closest_preceding_node(id)
for i = m downto 1
if (finger[i] ∈ (n, id))
return finger[i];
return n;

Joining: linked list insert
N36
N40
N25
1. Lookup(36)
K30
K38

Join (2)
N36
N40
N25
2. N36 sets its own
successor pointer
K30
K38

Join (3)
N36
N40
N25
3. Copy keys 26..36
from N40 to N36
K30
K38
K30

Join (4)
N36
N40
N25
4. Set N25’s successor
pointer
Update finger pointers in the background
Correct successors produce correct lookups
K30
K38
K30

Node Joins – stabilize()
• Each time node n runs stabilize(), it
asks its successor for the it’s
predecessor p, and decides whether p
should be n’s successor instead.
• stabilize() notifies node n’s successor of
n’s existence, giving the successor the
chance to change its predecessor to n.
• The successor does this only if it knows
of no closer predecessor than n.

Node Joins – stabilize()
// called periodically. verifies n’s immediate
// successor, and tells the successor about n.
// n=30, p=36, n’s successor = 40
n.stabilize()
x = successor.predecessor;
if (x ∈ (n, successor))
successor = x;
successor.notify(n);
// n’ thinks it might be our predecessor.
n.notify(n’)
if (predecessor is nil or n’ ∈ (predecessor, n))
predecessor = n’;

Node Joins – Join and Stabilization
np
succ(np)=ns
ns
n
pred(ns)=np  n joins
 predecessor = nil
 n acquires ns as successor via some n’
 n runs stabilize
 n notifies ns being the new predecessor
 ns acquires n as its predecessor
 np runs stabilize
 np asks ns for its predecessor (now n)
 np acquires n as its successor
 np notifies n
 n will acquire np as its predecessor
 all predecessor and successor
pointers are now correct
 fingers still need to be fixed, but old
fingers will still work
nil
pred(ns)=n
succ(np)=n

Node Joins – fix_fingers()
• Each node periodically calls fix fingers
to make sure its finger table entries are
correct.
• It is how new nodes initialize their finger
tables
• It is how existing nodes incorporate new
nodes into their finger tables.

Node Joins – fix_fingers()
// called periodically. refreshes finger table entries
//next = 1
n.fix_fingers()
next = next + 1 ;
if (next > m)
next = 1 ;
finger[next] = find_successor(n + 2next-1
);
// checks whether predecessor has failed.
n.check_predecessor()
if (predecessor has failed)
predecessor = nil;

fix_fingers()
Node 6 Join Node 3 leave

Failures might cause incorrect
lookup
N120
N113
N102
N80
N85
N80 doesn’t know correct successor, so incorrect lookup
N10
Lookup(90)

Solution: successor lists
• Each node knows r immediate successors
• After failure, will know first live successor
• Correct successors guarantee correct lookups
• Guarantee is with some probability

Successor Lists Ensure Robust
Lookup
N32
N10
N5
N20
N110
N99
N80
N60
• Each node remembers r successors
• Lookup can skip over dead nodes to find blocks
N40
10, 20, 32
20, 32, 40
32, 40, 60
40, 60, 80
60, 80, 99
80, 99, 110
99, 110, 5
110, 5, 10
5, 10, 20

Lookup with fault tolerance
Lookup(my-id, key-id)
look in local finger table and successor-list
for highest node n s.t. my-id < n < key-id
if n exists
if call failed,
remove n from finger table
return Lookup(my-id, key-id)
else return my successor // done

Experimental overview
• Variation in load balance
• Quick lookup in large systems
• Low variation in lookup costs
• Robust despite massive failure
Experiments confirm theoretical results

Variation in load balance
The mean and 1st and 99th percentiles of the number of
keys stored per node in a 10x4 node network

Variation in load balance
The probability density function (PDF) of the number of keys
per node. The total number of keys is 5 x 10 square 5.

Virtual Node in Consistent
Hash
Hash(“202.168.14.241”);
Hash(“202.168.14.241#1”); // cache A1
Hash(“202.168.14.241#2”); // cache A2

Result when virtual node used
r virtual node, r = 1, 2, 5, 10, 20
99th
: 4.8x to 1.6x ; 1st
: 0 to 0.5x

Chord lookup cost is O(log N)
Number of Nodes
AverageMessagesperLookup
Constant is ½
Actually ½ log(N) due to finger table

Failure experimental setup
• Start 10000 node and 1000000 keys
• Successor list has 20 entries
• Insert 1,000 key/value pairs
• Five replicas of each
• Immediately perform 1,000 lookups

Massive failures have little impact
0
0.2
0.4
0.6
0.8
1
1.2
1.4
5 10 15 20 25 30 35 40 45 50
FailedLookups(Percent)
Failed Nodes (Percent)
(1/2)6
is 1.6%

Conclusion
• Efficient location of the node that stores a
desired data item is a fundamental problem in
P2P networks
• Chord protocol solves it in a efficient
decentralized manner
• Routing information: O(log N) nodes
• Lookup: O(log N) nodes
• Update: O(log2
N) messages
• It also adapts dynamically to the topology
changes introduced during the run

Improvement
Metadata Layer
Distributing Index
Finger Table
Original
Chord
Improved
Chord
Query in metadata
Put resource into
Support more search beside
Keyword
Index differs due to different
Hash function
More dynamically
Create the routing table
Nope
Produce index
By SHA1
Fixed size, can
not resist churn
Hashing Function
SHA-2 improve
collision ， Use Pearson to
speed up
SHA-1

Join: lazy finger update is OK
N36
N40
N25
N2
K30
N2 finger should now point to N36, not N40
Lookup(K30) visits only nodes < 30, will undershoot

CFS: a peer-to-peer storage system
• Inspired by Napster, Gnutella, Freenet
• Separates publishing from serving
• Uses spare disk space, net capacity
• Avoids centralized mechanisms
• Delete this slide?
• Mention “distributed hash lookup”

CFS architecture
move later?
Block storage
Availability / replication
Authentication
Caching
Consistency
Server selection
Keyword search
Lookup
Dhash distributed
block store
Chord
• Powerful lookup simplifies other mechanisms

Consistent hashing [Karger 97]
N32
N90
N105
K80
K20
K5
Circular 7-bit
ID space
Key 5
Node 105
A key is stored at its successor: node with next higher ID

Basic lookup
N32
N90
N105
N60
N10
N120
K80
“Where is key 80?”
“N90 has K80”

“Finger table” allows log(N)-time lookups
N80
½¼
1/8
1/16
1/32
1/64
1/128

Finger i points to successor of n+2i
N80
½¼
1/8
1/16
1/32
1/64
1/128
112
N120

Dynamic Operations and Failures
Need to deal with:
• Node Joins and Stabilization
• Impact of Node Joins on Lookups
• Failure and Replication
• Voluntary Node Departures

Node Joins and Stabilization
• Node’s successor pointer should be up
to date
• For correctly executing lookups
• Each node periodically runs a
“Stabilization” Protocol
• Updates finger tables and successor
pointers

Node Joins and Stabilization
• Contains 6 functions:
• create()
• join()
• stabilize()
• notify()
• fix_fingers()
• check_predecessor()

Create()
• Creates a new Chord ring
n.create()
predecessor = nil;
successor = n;

Join()
• Asks m to find the immediate successor
of n.
• Doesn’t make rest of the network aware
of n.
n.join(m)
predecessor = nil;
successor = m.find_successor(n);

Stabilize()
• Called periodically to learn about new nodes
• Asks n’s immediate successor about successor’s predecessor p
• Checks whether p should be n’s successor instead
• Also notifies n’s successor about n’s existence, so that
successor may change its predecessor to n, if necessary
n.stabilize()
x = successor.predecessor;
if (x ∈ (n, successor))
successor = x;
successor.notify(n);

Notify()
• m thinks it might be n’s predecessor
n.notify(m)
if (predecessor is nil or m ∈ (predecessor, n))
predecessor = m;

Fix_fingers()
• Periodically called to make sure that finger table entries
are correct
• New nodes initialize their finger tables
• Existing nodes incorporate new nodes into their finger tables
n.fix_fingers()
next = next + 1 ;
if (next > m)
next = 1 ;
finger[next] = find_successor(n + 2next-1
);

Check_predecessor()
• Periodically called to check whether
predecessor has failed
• If yes, it clears the predecessor pointer,
which can then be modified by notify()
n.check_predecessor()
if (predecessor has failed)
predecessor = nil;

Theorem 3
• If any sequence of join operations is
executed interleaved with stabilizations,
then at some time after the last join the
successor pointers will form a cycle on
all nodes in the network

Stabilization Protocol
• Guarantees to add nodes in a fashion
to preserve reach ability
• By itself won’t correct a Chord system
that has split into multiple disjoint
cycles, or a single cycle that loops
multiple times around the identifier
space

Impact of Node Joins on
Lookups
• Correctness
• If finger table entries are reasonably
current
• Lookup finds the correct successor in O(log N)
steps
• If successor pointers are correct but finger
tables are incorrect
• Correct lookup but slower
• If incorrect successor pointers
• Lookup may fail

Impact of Node Joins on Lookups
• Performance
• If stabilization is complete
• Lookup can be done in O(log N) time
• If stabilization is not complete
• Existing nodes finger tables may not reflect the new
nodes
– Doesn’t significantly affect lookup speed
• Newly joined nodes can affect the lookup speed, if the
new nodes ID’s are in between target and target’s
predecessor
– Lookup will have to be forwarded through the intervening
nodes, one at a time

Theorem 4
• If we take a stable network with N
nodes with correct finger pointers, and
another set of up to N nodes joins the
network, and all successor pointers (but
perhaps not all finger pointers) are
correct, then lookups will still take O(log
N) time with high probability

Failure and Replication
• Correctness of the protocol relies on the
fact of knowing correct successor
• To improve robustness
• Each node maintains a successor list of ‘r’
nodes
• This can be handled using modified
version of stabilize procedure
• Also helps higher-layer software to
replicate data

Theorem 5
• If we use successor list of length r =
O(log N) in a network that is initially
stable, and then every node fails with
probability ½, then with high probability
find_successor returns the closest living
successor to the query key

Theorem 6
• In a network that is initially stable, if
every node fails with probability ½, then
the expected time to execute
find_successor is O(log N)

Voluntary Node Departures
• Can be treated as node failures
• Two possible enhancements
• Leaving node may transfers all its keys to
its successor
• Leaving node may notify its predecessor
and successor about each other so that
they can update their links

The promise of P2P computing
• High capacity through parallelism:
• Many disks
• Many network connections
• Many CPUs
• Reliability:
• Many replicas
• Geographic distribution
• Automatic configuration
• Useful in public and proprietary settings

A DHT has a good interface
• Put(key, value) and get(key) → value
• Call a key/value pair a “block”
• API supports a wide range of applications
• DHT imposes no structure/meaning on keys
• Key/value pairs are persistent and global
• Can store keys in other DHT values
• And thus build complex data structures

A DHT makes a good shared
infrastructure
• Many applications can share one DHT
service
• Much as applications share the Internet
• Eases deployment of new applications
• Pools resources from many participants
• Efficient due to statistical multiplexing
• Fault-tolerant due to geographic distribution

Many recent DHT-based projects
• File sharing [CFS, OceanStore, PAST, …]
• Web cache [Squirrel, ..]
• Backup store [Pastiche]
• Censor-resistant stores [Eternity, FreeNet,..]
• DB query and indexing [Hellerstein, …]
• Event notification [Scribe]
• Naming systems [ChordDNS, Twine, ..]
• Communication primitives [I3, …]
Common thread: data is location-independent

Related Work
• CAN (Ratnasamy, Francis, Handley, Karp,
Shenker)
• Pastry (Rowstron, Druschel)
• Tapestry (Zhao, Kubiatowicz, Joseph)
• Chord emphasizes simplicity

Chord Summary
• Chord provides peer-to-peer hash lookup
• Efficient: O(log(n)) messages per lookup
• Robust as nodes fail and join
• Good primitive for peer-to-peer systems
http://www.pdos.lcs.mit.edu/chord

find_successor()
// ask node n to find the successor of id
// id = 36, n’ = 25 , successor=40
n.find_successor(id)
n’ = n.find_predecessor(id)
return n’.succsor;
n.find_predecessor(id)
n’ = n;
While (id NOT∈ (n’, n’.successor])
n’ = n’.closest_preceding_finger(id)
return n’;
n.closest_preceding_finger(id)
for i = m downto 1
if (finger[i] ∈ (n, id))
return finger[i];
return n;

Choosing the successor list length
• Assume 1/2 of nodes fail
• P(successor list all dead) = (1/2)r
• I.e. P(this node breaks the Chord ring)
• Depends on independent failure

Improvement
Metadata Layer
( 資訊層 )
Distributing Index
Finger Table
原始 Chord 改良 Chord
將嘗試利用 metadata 以對資
源的描述更有彈性，並支援複
雜的查詢。將節點依照本身所
擁有的資源加入到適當的資源
層中，查詢時只需要在適當
因為雜湊函數的不同，
所以此索引亦會不同
可嚐試改用雙向路由法、鄰居
路由法，或是根據網路攪動程
度，動態調整路由表
無
由上數雜湊函數
所得到的檔案搜
尋索引
fixed ，對網路攪
動的抵抗缺乏強
健性
Hashing Function
( 雜湊函數 )
可改用 SHA-2 以增進加密功能
，
或是用 Pearson 等以加快運算
速度
SHA-1

Paul presentation P2P Chord v1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Paul presentation P2P Chord v1

Similar to Paul presentation P2P Chord v1 (20)

More from Paul Yang

More from Paul Yang (20)

Recently uploaded

Recently uploaded (20)

Paul presentation P2P Chord v1

Editor's Notes