SlideShare a Scribd company logo
1 of 84
Chord: A Scalable Peer-to-peer
Lookup Service for Internet
Applications
Paul Yang
楊曜年
What is a P2P system?
• A distributed system architecture:
• No centralized control
• Nodes are symmetric in function
Node
Node
Node Node
Node
Internet
3 layers - from implementation
Distributed hash table
Distributed application
get (key) data
node node node….
put(key, data)
Lookup service
lookup(key) node IP address
• Application may be distributed over many nodes
• DHT distributes data storage over many nodes
(Ivy)
(DHash)
(Chord)
A peer-to-peer storage problem
• 1000 scattered music enthusiasts
• Willing to store and serve replicas
• How do you find the data?
The lookup problem
Internet
N1
N2
N3
N6
N5
N4
Publisher
Key=“title”
Value=MP3 data…
Client
Lookup(“title”)
?
Centralized lookup (Napster)
Publisher@
Client
Lookup(“title”)
N6
N9 N7
DB
N8
N3
N2N1SetLoc(“title”, N4)
Simple, but O(N) state and a single point of failure
Key=“title”
Value=MP3 data…
N4
Flooded queries (Gnutella)
N4
Publisher@
Client
N6
N9
N7
N8
N3
N2N1
Robust, but worst case O(N) messages per lookup
Key=“title”
Value=MP3 data…
Lookup(“title”)
Routed queries (Freenet, Chord,
etc.)
N4Publisher
Client
N6
N9
N7
N8
N3
N2N1
Lookup(“title”)
Key=“title”
Value=MP3 data…
Routing challenges
• Keep the hop count small
• Keep the tables small
• Stay robust despite rapid change
• Chord: emphasizes efficiency and simplicity
Chord properties
• Efficient: O(log(N)) messages per lookup
• Load balance: closed to K/N
• Decentralization
• Scalable: O(log(N)) state per node
• Robust: survives massive failures
Chord overview
• Provides peer-to-peer hash lookup:
• Lookup(key) → IP address
• Chord does not store the data
• How does Chord route lookups?
• How does Chord maintain routing tables?
Chord IDs
• Key identifier = SHA-1(key)
• Node identifier = SHA-1(IP address & Port)
• Both are uniformly distributed
• Both exist in the same ID space
• For terribly distributed hash (collision) –
universal hash
• How to map key IDs to node IDs?
Simple lookup algorithm
Lookup(my-id, key-id) //if k=7, MyID=2, MyS = 8
n = my successor
if my-id < n < key-id
call Lookup(id) on node n // next hop
else
return my successor // done
• Correctness depends only on successors
6
1
2
6
0
4
26
5
1
3
7
2
identifier
circle
identifier
node
X key
Consistent Hashing - Successor
Nodes – Take O (N) hop
successor(1) = 1
successor(2) = 3successor(6) = 0
Scalable Key Location
• To accelerate lookups, Chord maintains
additional routing information.
• This additional information is not
essential for correctness, which is
achieved as long as each node knows
its correct successor.
Scalable Key Location –
Finger Tables
• Each node n’ maintains a routing table with up
to m entries (which is in fact the number of bits
in identifiers), called finger table.
• The ith
entry in the table at node n contains the
identity of the first node s that succeeds n by at
least 2
i-1
on the identifier circle.
• s = successor(n+2i-1
).
• s is called the ith
finger of node n, denoted by
n.finger(i)
Scalable Key Location –
Finger Tables
0
4
26
5
1
3
7
1
2
4
1
3
0
finger table
start succ.
keys
1
2
3
5
3
3
0
finger table
start succ.
keys
2
4
5
7
0
0
0
finger table
start succ.
keys
6
0+20
0+21
0+22
For.
1+20
1+21
1+22
For.
3+20
3+21
3+22
For.
Finer[k] = (n + 2k-1) mod 2m
Finger i points to successor of n+2i
N80
½¼
1/8
1/16
1/32
1/64
1/128
112
N120
M = 7 -> 128
Lookups take O(log(N)) hops
N32
N10
N5
N20
N110
N99
N80
N60
Lookup(K19)
K19
Lookup with fingers
Lookup(my-id, key-id)
look in local finger table for
highest node n s.t. my-id < n < key-id
if n exists
call Lookup(id) on node n // next hop
else
return my successor // done
Node Joins and Stabilizations
• The most important thing is the successor
pointer.
• If the successor pointer is ensured to be up to
date, which is sufficient to guarantee
correctness of lookups, then finger table can
always be verified.
• Each node runs a “stabilization” protocol
periodically in the background to update
successor pointer and finger table.
Node Joins and Stabilizations
• “Stabilization” protocol contains 6
functions:
• create() //create a network
• join()
• stabilize()
• notify()
• fix_fingers()
• check_predecessor()
Node Joins – join()
• When node n first starts, it calls
n.join(n’), where n’ is any known Chord
node.
• The join() function asks n’ to find the
immediate successor of n.
• join() does not make the rest of the
network aware of n.
Node Joins – join()
// create a new Chord ring.
n.create()
predecessor = nil;
successor = n;
// join a Chord ring containing node n’.
n.join(n’)
predecessor = nil;
successor = n’.find_successor(n);
Scalable Key Location –
find_successor()
• Pseudo code:
// ask node n to find the successor of id
// id = 36, n’ = 25 , successor=40
n.find_successor(id)
if (id ∈ (n, successor])
return successor;
else
n’ = closest_preceding_node(id);
return n’.find_successor(id);
// search the local table for the highest predecessor of id
n.closest_preceding_node(id)
for i = m downto 1
if (finger[i] ∈ (n, id))
return finger[i];
return n;
Joining: linked list insert
N36
N40
N25
1. Lookup(36)
K30
K38
Join (2)
N36
N40
N25
2. N36 sets its own
successor pointer
K30
K38
Join (3)
N36
N40
N25
3. Copy keys 26..36
from N40 to N36
K30
K38
K30
Join (4)
N36
N40
N25
4. Set N25’s successor
pointer
Update finger pointers in the background
Correct successors produce correct lookups
K30
K38
K30
Node Joins – stabilize()
• Each time node n runs stabilize(), it
asks its successor for the it’s
predecessor p, and decides whether p
should be n’s successor instead.
• stabilize() notifies node n’s successor of
n’s existence, giving the successor the
chance to change its predecessor to n.
• The successor does this only if it knows
of no closer predecessor than n.
Node Joins – stabilize()
// called periodically. verifies n’s immediate
// successor, and tells the successor about n.
// n=30, p=36, n’s successor = 40
n.stabilize()
x = successor.predecessor;
if (x ∈ (n, successor))
successor = x;
successor.notify(n);
// n’ thinks it might be our predecessor.
n.notify(n’)
if (predecessor is nil or n’ ∈ (predecessor, n))
predecessor = n’;
Node Joins – Join and Stabilization
np
succ(np)=ns
ns
n
pred(ns)=np  n joins
 predecessor = nil
 n acquires ns as successor via some n’
 n runs stabilize
 n notifies ns being the new predecessor
 ns acquires n as its predecessor
 np runs stabilize
 np asks ns for its predecessor (now n)
 np acquires n as its successor
 np notifies n
 n will acquire np as its predecessor
 all predecessor and successor
pointers are now correct
 fingers still need to be fixed, but old
fingers will still work
nil
pred(ns)=n
succ(np)=n
Node Joins – fix_fingers()
• Each node periodically calls fix fingers
to make sure its finger table entries are
correct.
• It is how new nodes initialize their finger
tables
• It is how existing nodes incorporate new
nodes into their finger tables.
Node Joins – fix_fingers()
// called periodically. refreshes finger table entries
//next = 1
n.fix_fingers()
next = next + 1 ;
if (next > m)
next = 1 ;
finger[next] = find_successor(n + 2next-1
);
// checks whether predecessor has failed.
n.check_predecessor()
if (predecessor has failed)
predecessor = nil;
fix_fingers()
Node 6 Join Node 3 leave
Failures might cause incorrect
lookup
N120
N113
N102
N80
N85
N80 doesn’t know correct successor, so incorrect lookup
N10
Lookup(90)
Solution: successor lists
• Each node knows r immediate successors
• After failure, will know first live successor
• Correct successors guarantee correct lookups
• Guarantee is with some probability
Successor Lists Ensure Robust
Lookup
N32
N10
N5
N20
N110
N99
N80
N60
• Each node remembers r successors
• Lookup can skip over dead nodes to find blocks
N40
10, 20, 32
20, 32, 40
32, 40, 60
40, 60, 80
60, 80, 99
80, 99, 110
99, 110, 5
110, 5, 10
5, 10, 20
Lookup with fault tolerance
Lookup(my-id, key-id)
look in local finger table and successor-list
for highest node n s.t. my-id < n < key-id
if n exists
call Lookup(id) on node n // next hop
if call failed,
remove n from finger table
return Lookup(my-id, key-id)
else return my successor // done
Experimental overview
• Variation in load balance
• Quick lookup in large systems
• Low variation in lookup costs
• Robust despite massive failure
Experiments confirm theoretical results
Variation in load balance
The mean and 1st and 99th percentiles of the number of
keys stored per node in a 10x4 node network
Variation in load balance
The probability density function (PDF) of the number of keys
per node. The total number of keys is 5 x 10 square 5.
Virtual Node in Consistent
Hash
Hash(“202.168.14.241”);
Hash(“202.168.14.241#1”); // cache A1
Hash(“202.168.14.241#2”); // cache A2
Result when virtual node used
r virtual node, r = 1, 2, 5, 10, 20
99th
: 4.8x to 1.6x ; 1st
: 0 to 0.5x
Chord lookup cost is O(log N)
Number of Nodes
AverageMessagesperLookup
Constant is ½
Actually ½ log(N) due to finger table
Failure experimental setup
• Start 10000 node and 1000000 keys
• Successor list has 20 entries
• Insert 1,000 key/value pairs
• Five replicas of each
• Immediately perform 1,000 lookups
Massive failures have little impact
0
0.2
0.4
0.6
0.8
1
1.2
1.4
5 10 15 20 25 30 35 40 45 50
FailedLookups(Percent)
Failed Nodes (Percent)
(1/2)6
is 1.6%
Conclusion
• Efficient location of the node that stores a
desired data item is a fundamental problem in
P2P networks
• Chord protocol solves it in a efficient
decentralized manner
• Routing information: O(log N) nodes
• Lookup: O(log N) nodes
• Update: O(log2
N) messages
• It also adapts dynamically to the topology
changes introduced during the run
Backup
Improvement
Metadata Layer
Distributing Index
Finger Table
Original
Chord
Improved
Chord
Query in metadata
Put resource into
Support more search beside
Keyword
Index differs due to different
Hash function
More dynamically
Create the routing table
Nope
Produce index
By SHA1
Fixed size, can
not resist churn
Hashing Function
SHA-2 improve
collision , Use Pearson to
speed up
SHA-1
Join: lazy finger update is OK
N36
N40
N25
N2
K30
N2 finger should now point to N36, not N40
Lookup(K30) visits only nodes < 30, will undershoot
CFS: a peer-to-peer storage system
• Inspired by Napster, Gnutella, Freenet
• Separates publishing from serving
• Uses spare disk space, net capacity
• Avoids centralized mechanisms
• Delete this slide?
• Mention “distributed hash lookup”
CFS architecture
move later?
Block storage
Availability / replication
Authentication
Caching
Consistency
Server selection
Keyword search
Lookup
Dhash distributed
block store
Chord
• Powerful lookup simplifies other mechanisms
Consistent hashing [Karger 97]
N32
N90
N105
K80
K20
K5
Circular 7-bit
ID space
Key 5
Node 105
A key is stored at its successor: node with next higher ID
Basic lookup
N32
N90
N105
N60
N10
N120
K80
“Where is key 80?”
“N90 has K80”
“Finger table” allows log(N)-time lookups
N80
½¼
1/8
1/16
1/32
1/64
1/128
Finger i points to successor of n+2i
N80
½¼
1/8
1/16
1/32
1/64
1/128
112
N120
Dynamic Operations and Failures
Need to deal with:
• Node Joins and Stabilization
• Impact of Node Joins on Lookups
• Failure and Replication
• Voluntary Node Departures
Node Joins and Stabilization
• Node’s successor pointer should be up
to date
• For correctly executing lookups
• Each node periodically runs a
“Stabilization” Protocol
• Updates finger tables and successor
pointers
Node Joins and Stabilization
• Contains 6 functions:
• create()
• join()
• stabilize()
• notify()
• fix_fingers()
• check_predecessor()
Create()
• Creates a new Chord ring
n.create()
predecessor = nil;
successor = n;
Join()
• Asks m to find the immediate successor
of n.
• Doesn’t make rest of the network aware
of n.
n.join(m)
predecessor = nil;
successor = m.find_successor(n);
Stabilize()
• Called periodically to learn about new nodes
• Asks n’s immediate successor about successor’s predecessor p
• Checks whether p should be n’s successor instead
• Also notifies n’s successor about n’s existence, so that
successor may change its predecessor to n, if necessary
n.stabilize()
x = successor.predecessor;
if (x ∈ (n, successor))
successor = x;
successor.notify(n);
Notify()
• m thinks it might be n’s predecessor
n.notify(m)
if (predecessor is nil or m ∈ (predecessor, n))
predecessor = m;
Fix_fingers()
• Periodically called to make sure that finger table entries
are correct
• New nodes initialize their finger tables
• Existing nodes incorporate new nodes into their finger tables
n.fix_fingers()
next = next + 1 ;
if (next > m)
next = 1 ;
finger[next] = find_successor(n + 2next-1
);
Check_predecessor()
• Periodically called to check whether
predecessor has failed
• If yes, it clears the predecessor pointer,
which can then be modified by notify()
n.check_predecessor()
if (predecessor has failed)
predecessor = nil;
Theorem 3
• If any sequence of join operations is
executed interleaved with stabilizations,
then at some time after the last join the
successor pointers will form a cycle on
all nodes in the network
Stabilization Protocol
• Guarantees to add nodes in a fashion
to preserve reach ability
• By itself won’t correct a Chord system
that has split into multiple disjoint
cycles, or a single cycle that loops
multiple times around the identifier
space
Impact of Node Joins on
Lookups
• Correctness
• If finger table entries are reasonably
current
• Lookup finds the correct successor in O(log N)
steps
• If successor pointers are correct but finger
tables are incorrect
• Correct lookup but slower
• If incorrect successor pointers
• Lookup may fail
Impact of Node Joins on Lookups
• Performance
• If stabilization is complete
• Lookup can be done in O(log N) time
• If stabilization is not complete
• Existing nodes finger tables may not reflect the new
nodes
– Doesn’t significantly affect lookup speed
• Newly joined nodes can affect the lookup speed, if the
new nodes ID’s are in between target and target’s
predecessor
– Lookup will have to be forwarded through the intervening
nodes, one at a time
Theorem 4
• If we take a stable network with N
nodes with correct finger pointers, and
another set of up to N nodes joins the
network, and all successor pointers (but
perhaps not all finger pointers) are
correct, then lookups will still take O(log
N) time with high probability
Failure and Replication
• Correctness of the protocol relies on the
fact of knowing correct successor
• To improve robustness
• Each node maintains a successor list of ‘r’
nodes
• This can be handled using modified
version of stabilize procedure
• Also helps higher-layer software to
replicate data
Theorem 5
• If we use successor list of length r =
O(log N) in a network that is initially
stable, and then every node fails with
probability ½, then with high probability
find_successor returns the closest living
successor to the query key
Theorem 6
• In a network that is initially stable, if
every node fails with probability ½, then
the expected time to execute
find_successor is O(log N)
Voluntary Node Departures
• Can be treated as node failures
• Two possible enhancements
• Leaving node may transfers all its keys to
its successor
• Leaving node may notify its predecessor
and successor about each other so that
they can update their links
The promise of P2P computing
• High capacity through parallelism:
• Many disks
• Many network connections
• Many CPUs
• Reliability:
• Many replicas
• Geographic distribution
• Automatic configuration
• Useful in public and proprietary settings
A DHT has a good interface
• Put(key, value) and get(key) → value
• Call a key/value pair a “block”
• API supports a wide range of applications
• DHT imposes no structure/meaning on keys
• Key/value pairs are persistent and global
• Can store keys in other DHT values
• And thus build complex data structures
A DHT makes a good shared
infrastructure
• Many applications can share one DHT
service
• Much as applications share the Internet
• Eases deployment of new applications
• Pools resources from many participants
• Efficient due to statistical multiplexing
• Fault-tolerant due to geographic distribution
Many recent DHT-based projects
• File sharing [CFS, OceanStore, PAST, …]
• Web cache [Squirrel, ..]
• Backup store [Pastiche]
• Censor-resistant stores [Eternity, FreeNet,..]
• DB query and indexing [Hellerstein, …]
• Event notification [Scribe]
• Naming systems [ChordDNS, Twine, ..]
• Communication primitives [I3, …]
Common thread: data is location-independent
Related Work
• CAN (Ratnasamy, Francis, Handley, Karp,
Shenker)
• Pastry (Rowstron, Druschel)
• Tapestry (Zhao, Kubiatowicz, Joseph)
• Chord emphasizes simplicity
Chord Summary
• Chord provides peer-to-peer hash lookup
• Efficient: O(log(n)) messages per lookup
• Robust as nodes fail and join
• Good primitive for peer-to-peer systems
http://www.pdos.lcs.mit.edu/chord
Scalable Key Location –
find_successor()
// ask node n to find the successor of id
// id = 36, n’ = 25 , successor=40
n.find_successor(id)
n’ = n.find_predecessor(id)
return n’.succsor;
n.find_predecessor(id)
n’ = n;
While (id NOT∈ (n’, n’.successor])
n’ = n’.closest_preceding_finger(id)
return n’;
n.closest_preceding_finger(id)
for i = m downto 1
if (finger[i] ∈ (n, id))
return finger[i];
return n;
Choosing the successor list length
• Assume 1/2 of nodes fail
• P(successor list all dead) = (1/2)r
• I.e. P(this node breaks the Chord ring)
• Depends on independent failure
Improvement
Metadata Layer
( 資訊層 )
Distributing Index
Finger Table
原始 Chord 改良 Chord
將嘗試利用 metadata 以對資
源的描述更有彈性,並支援複
雜的查詢。將節點依照本身所
擁有的資源加入到適當的資源
層中,查詢時只 需要在適當
因為雜湊函數的不同,
所以此索引亦會不同
可嚐試改用雙向路由法、鄰居
路由法,或是根據網路攪動程
度,動態調整路由表
無
由上數雜湊函數
所得到的檔案搜
尋索引
fixed ,對網路攪
動的抵抗缺乏強
健性
Hashing Function
( 雜湊函數 )
可改用 SHA-2 以增進加密功能
,
或是用 Pearson 等以加快運算
速度
SHA-1

More Related Content

What's hot

RSA криптосистем
RSA криптосистемRSA криптосистем
RSA криптосистем
sodhero
 
Cracking Pseudorandom Sequences Generators in Java Applications
Cracking Pseudorandom Sequences Generators in Java ApplicationsCracking Pseudorandom Sequences Generators in Java Applications
Cracking Pseudorandom Sequences Generators in Java Applications
Positive Hack Days
 
Recover A RSA Private key from a TLS session with perfect forward secrecy
Recover A RSA Private key from a TLS session with perfect forward secrecyRecover A RSA Private key from a TLS session with perfect forward secrecy
Recover A RSA Private key from a TLS session with perfect forward secrecy
Priyanka Aash
 

What's hot (20)

Security of RSA and Integer Factorization
Security of RSA and Integer FactorizationSecurity of RSA and Integer Factorization
Security of RSA and Integer Factorization
 
Cyclic Attacks on the RSA Trapdoor Function
Cyclic Attacks on the RSA Trapdoor FunctionCyclic Attacks on the RSA Trapdoor Function
Cyclic Attacks on the RSA Trapdoor Function
 
An Analysis of Secure Remote Password (SRP)
An Analysis of Secure Remote Password (SRP)An Analysis of Secure Remote Password (SRP)
An Analysis of Secure Remote Password (SRP)
 
RSA cracking puzzle
RSA cracking puzzleRSA cracking puzzle
RSA cracking puzzle
 
Dependency Analysis of RSA Private Variables
Dependency Analysis of RSA Private VariablesDependency Analysis of RSA Private Variables
Dependency Analysis of RSA Private Variables
 
How do computers exchange secrets using Math?
How do computers exchange secrets using Math?How do computers exchange secrets using Math?
How do computers exchange secrets using Math?
 
On deriving the private key from a public key
On deriving the private key from a public keyOn deriving the private key from a public key
On deriving the private key from a public key
 
RSA криптосистем
RSA криптосистемRSA криптосистем
RSA криптосистем
 
On the Secrecy of RSA Private Keys
On the Secrecy of RSA Private KeysOn the Secrecy of RSA Private Keys
On the Secrecy of RSA Private Keys
 
CNIT 141 10. RSA
CNIT 141 10. RSACNIT 141 10. RSA
CNIT 141 10. RSA
 
Active Attacks on DH Key Exchange
Active Attacks on DH Key ExchangeActive Attacks on DH Key Exchange
Active Attacks on DH Key Exchange
 
CNIT 141 10. RSA
CNIT 141 10. RSACNIT 141 10. RSA
CNIT 141 10. RSA
 
Cracking Pseudorandom Sequences Generators in Java Applications
Cracking Pseudorandom Sequences Generators in Java ApplicationsCracking Pseudorandom Sequences Generators in Java Applications
Cracking Pseudorandom Sequences Generators in Java Applications
 
Elliptic Curve Cryptography and Zero Knowledge Proof
Elliptic Curve Cryptography and Zero Knowledge ProofElliptic Curve Cryptography and Zero Knowledge Proof
Elliptic Curve Cryptography and Zero Knowledge Proof
 
RSA ALGORITHM
RSA ALGORITHMRSA ALGORITHM
RSA ALGORITHM
 
RSA ALGORITHM
RSA ALGORITHMRSA ALGORITHM
RSA ALGORITHM
 
WiFi Security Explained
WiFi Security ExplainedWiFi Security Explained
WiFi Security Explained
 
RSA - ALGORITHM by Muthugomathy and Meenakshi Shetti of GIT COLLEGE
RSA - ALGORITHM by Muthugomathy and Meenakshi Shetti of GIT COLLEGE RSA - ALGORITHM by Muthugomathy and Meenakshi Shetti of GIT COLLEGE
RSA - ALGORITHM by Muthugomathy and Meenakshi Shetti of GIT COLLEGE
 
RSA algorithm
RSA algorithmRSA algorithm
RSA algorithm
 
Recover A RSA Private key from a TLS session with perfect forward secrecy
Recover A RSA Private key from a TLS session with perfect forward secrecyRecover A RSA Private key from a TLS session with perfect forward secrecy
Recover A RSA Private key from a TLS session with perfect forward secrecy
 

Viewers also liked (6)

惜缘
惜缘惜缘
惜缘
 
Anesthesia for myocardial revascularization
Anesthesia for myocardial revascularizationAnesthesia for myocardial revascularization
Anesthesia for myocardial revascularization
 
Implementing a Distributed Hash Table with Scala and Akka
Implementing a Distributed Hash Table with Scala and AkkaImplementing a Distributed Hash Table with Scala and Akka
Implementing a Distributed Hash Table with Scala and Akka
 
Introduction P2p
Introduction P2pIntroduction P2p
Introduction P2p
 
6. Linked list - Data Structures using C++ by Varsha Patil
6. Linked list - Data Structures using C++ by Varsha Patil6. Linked list - Data Structures using C++ by Varsha Patil
6. Linked list - Data Structures using C++ by Varsha Patil
 
PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...
PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...
PWL Seattle #16 - Chord: A Scalable Peer-to-peer Lookup Protocol for Internet...
 

Similar to Paul presentation P2P Chord v1

Chord 120427053647-phpapp01 (1)
Chord 120427053647-phpapp01 (1)Chord 120427053647-phpapp01 (1)
Chord 120427053647-phpapp01 (1)
Hadeel Ibrahim
 
Chord- A Scalable Peer-to-Peer Lookup Service for Internet Applications
Chord- A Scalable Peer-to-Peer Lookup Service for Internet ApplicationsChord- A Scalable Peer-to-Peer Lookup Service for Internet Applications
Chord- A Scalable Peer-to-Peer Lookup Service for Internet Applications
Chandan Thakur
 
Fundamentals of data structures
Fundamentals of data structuresFundamentals of data structures
Fundamentals of data structures
Niraj Agarwal
 

Similar to Paul presentation P2P Chord v1 (20)

lec03-chord(1).pptx
lec03-chord(1).pptxlec03-chord(1).pptx
lec03-chord(1).pptx
 
Chord Algorithm
Chord AlgorithmChord Algorithm
Chord Algorithm
 
Chord DHT
Chord DHTChord DHT
Chord DHT
 
Chord 120427053647-phpapp01 (1)
Chord 120427053647-phpapp01 (1)Chord 120427053647-phpapp01 (1)
Chord 120427053647-phpapp01 (1)
 
5.1.3. Chord.pptx
5.1.3. Chord.pptx5.1.3. Chord.pptx
5.1.3. Chord.pptx
 
Chord- A Scalable Peer-to-Peer Lookup Service for Internet Applications
Chord- A Scalable Peer-to-Peer Lookup Service for Internet ApplicationsChord- A Scalable Peer-to-Peer Lookup Service for Internet Applications
Chord- A Scalable Peer-to-Peer Lookup Service for Internet Applications
 
Data structures final lecture 1
Data structures final  lecture 1Data structures final  lecture 1
Data structures final lecture 1
 
TeraSort
TeraSortTeraSort
TeraSort
 
Chord
ChordChord
Chord
 
Data structures
Data structuresData structures
Data structures
 
An overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology newAn overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology new
 
From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)
 
08 binarysearchtrees 1
08 binarysearchtrees 108 binarysearchtrees 1
08 binarysearchtrees 1
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Classical programming interview questions
Classical programming interview questionsClassical programming interview questions
Classical programming interview questions
 
Ch01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluitonCh01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluiton
 
Fundamentals of data structures
Fundamentals of data structuresFundamentals of data structures
Fundamentals of data structures
 
Q
QQ
Q
 
PVEB Tree.pptx
PVEB Tree.pptxPVEB Tree.pptx
PVEB Tree.pptx
 
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)
 

More from Paul Yang

A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
Paul Yang
 

More from Paul Yang (20)

release_python_day4_slides_201606_1.pdf
release_python_day4_slides_201606_1.pdfrelease_python_day4_slides_201606_1.pdf
release_python_day4_slides_201606_1.pdf
 
release_python_day3_slides_201606.pdf
release_python_day3_slides_201606.pdfrelease_python_day3_slides_201606.pdf
release_python_day3_slides_201606.pdf
 
release_python_day1_slides_201606.pdf
release_python_day1_slides_201606.pdfrelease_python_day1_slides_201606.pdf
release_python_day1_slides_201606.pdf
 
release_python_day2_slides_201606.pdf
release_python_day2_slides_201606.pdfrelease_python_day2_slides_201606.pdf
release_python_day2_slides_201606.pdf
 
RHEL5 XEN HandOnTraining_v0.4.pdf
RHEL5 XEN HandOnTraining_v0.4.pdfRHEL5 XEN HandOnTraining_v0.4.pdf
RHEL5 XEN HandOnTraining_v0.4.pdf
 
Intel® AT-d Validation Overview v0_3.pdf
Intel® AT-d Validation Overview v0_3.pdfIntel® AT-d Validation Overview v0_3.pdf
Intel® AT-d Validation Overview v0_3.pdf
 
HP Performance Tracking ADK_part1.pdf
HP Performance Tracking ADK_part1.pdfHP Performance Tracking ADK_part1.pdf
HP Performance Tracking ADK_part1.pdf
 
HP Performance Tracking ADK part2.pdf
HP Performance Tracking ADK part2.pdfHP Performance Tracking ADK part2.pdf
HP Performance Tracking ADK part2.pdf
 
Determination of Repro Rates 20140724.pdf
Determination of Repro Rates 20140724.pdfDetermination of Repro Rates 20140724.pdf
Determination of Repro Rates 20140724.pdf
 
Debug ADK performance issue 20140729.pdf
Debug ADK performance issue 20140729.pdfDebug ADK performance issue 20140729.pdf
Debug ADK performance issue 20140729.pdf
 
A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
 
A brief study on bottlenecks to Intel vs. Acer v0.1.pdf
A brief study on bottlenecks to Intel vs. Acer v0.1.pdfA brief study on bottlenecks to Intel vs. Acer v0.1.pdf
A brief study on bottlenecks to Intel vs. Acer v0.1.pdf
 
出租店系統_楊曜年_林宏庭_OOD.pdf
出租店系統_楊曜年_林宏庭_OOD.pdf出租店系統_楊曜年_林宏庭_OOD.pdf
出租店系統_楊曜年_林宏庭_OOD.pdf
 
Arm Neoverse market update_05122020.pdf
Arm Neoverse market update_05122020.pdfArm Neoverse market update_05122020.pdf
Arm Neoverse market update_05122020.pdf
 
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfBuilding PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
 
Agile & Secure SDLC
Agile & Secure SDLCAgile & Secure SDLC
Agile & Secure SDLC
 
Mitigating routing misbehavior in mobile ad hoc networks
Mitigating routing misbehavior in mobile ad hoc networks Mitigating routing misbehavior in mobile ad hoc networks
Mitigating routing misbehavior in mobile ad hoc networks
 
Nodes bearing grudges
Nodes bearing grudgesNodes bearing grudges
Nodes bearing grudges
 
Routing Security and Authentication Mechanism for Mobile Ad Hoc Networks
Routing Security and Authentication Mechanism for Mobile Ad Hoc NetworksRouting Security and Authentication Mechanism for Mobile Ad Hoc Networks
Routing Security and Authentication Mechanism for Mobile Ad Hoc Networks
 
Clients developing_chunghwa telecom
Clients developing_chunghwa telecomClients developing_chunghwa telecom
Clients developing_chunghwa telecom
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Paul presentation P2P Chord v1

  • 1. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Paul Yang 楊曜年
  • 2. What is a P2P system? • A distributed system architecture: • No centralized control • Nodes are symmetric in function Node Node Node Node Node Internet
  • 3. 3 layers - from implementation Distributed hash table Distributed application get (key) data node node node…. put(key, data) Lookup service lookup(key) node IP address • Application may be distributed over many nodes • DHT distributes data storage over many nodes (Ivy) (DHash) (Chord)
  • 4. A peer-to-peer storage problem • 1000 scattered music enthusiasts • Willing to store and serve replicas • How do you find the data?
  • 6. Centralized lookup (Napster) Publisher@ Client Lookup(“title”) N6 N9 N7 DB N8 N3 N2N1SetLoc(“title”, N4) Simple, but O(N) state and a single point of failure Key=“title” Value=MP3 data… N4
  • 7. Flooded queries (Gnutella) N4 Publisher@ Client N6 N9 N7 N8 N3 N2N1 Robust, but worst case O(N) messages per lookup Key=“title” Value=MP3 data… Lookup(“title”)
  • 8. Routed queries (Freenet, Chord, etc.) N4Publisher Client N6 N9 N7 N8 N3 N2N1 Lookup(“title”) Key=“title” Value=MP3 data…
  • 9. Routing challenges • Keep the hop count small • Keep the tables small • Stay robust despite rapid change • Chord: emphasizes efficiency and simplicity
  • 10. Chord properties • Efficient: O(log(N)) messages per lookup • Load balance: closed to K/N • Decentralization • Scalable: O(log(N)) state per node • Robust: survives massive failures
  • 11. Chord overview • Provides peer-to-peer hash lookup: • Lookup(key) → IP address • Chord does not store the data • How does Chord route lookups? • How does Chord maintain routing tables?
  • 12. Chord IDs • Key identifier = SHA-1(key) • Node identifier = SHA-1(IP address & Port) • Both are uniformly distributed • Both exist in the same ID space • For terribly distributed hash (collision) – universal hash • How to map key IDs to node IDs?
  • 13. Simple lookup algorithm Lookup(my-id, key-id) //if k=7, MyID=2, MyS = 8 n = my successor if my-id < n < key-id call Lookup(id) on node n // next hop else return my successor // done • Correctness depends only on successors
  • 14. 6 1 2 6 0 4 26 5 1 3 7 2 identifier circle identifier node X key Consistent Hashing - Successor Nodes – Take O (N) hop successor(1) = 1 successor(2) = 3successor(6) = 0
  • 15. Scalable Key Location • To accelerate lookups, Chord maintains additional routing information. • This additional information is not essential for correctness, which is achieved as long as each node knows its correct successor.
  • 16. Scalable Key Location – Finger Tables • Each node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table. • The ith entry in the table at node n contains the identity of the first node s that succeeds n by at least 2 i-1 on the identifier circle. • s = successor(n+2i-1 ). • s is called the ith finger of node n, denoted by n.finger(i)
  • 17. Scalable Key Location – Finger Tables 0 4 26 5 1 3 7 1 2 4 1 3 0 finger table start succ. keys 1 2 3 5 3 3 0 finger table start succ. keys 2 4 5 7 0 0 0 finger table start succ. keys 6 0+20 0+21 0+22 For. 1+20 1+21 1+22 For. 3+20 3+21 3+22 For. Finer[k] = (n + 2k-1) mod 2m
  • 18. Finger i points to successor of n+2i N80 ½¼ 1/8 1/16 1/32 1/64 1/128 112 N120 M = 7 -> 128
  • 19. Lookups take O(log(N)) hops N32 N10 N5 N20 N110 N99 N80 N60 Lookup(K19) K19
  • 20. Lookup with fingers Lookup(my-id, key-id) look in local finger table for highest node n s.t. my-id < n < key-id if n exists call Lookup(id) on node n // next hop else return my successor // done
  • 21. Node Joins and Stabilizations • The most important thing is the successor pointer. • If the successor pointer is ensured to be up to date, which is sufficient to guarantee correctness of lookups, then finger table can always be verified. • Each node runs a “stabilization” protocol periodically in the background to update successor pointer and finger table.
  • 22. Node Joins and Stabilizations • “Stabilization” protocol contains 6 functions: • create() //create a network • join() • stabilize() • notify() • fix_fingers() • check_predecessor()
  • 23. Node Joins – join() • When node n first starts, it calls n.join(n’), where n’ is any known Chord node. • The join() function asks n’ to find the immediate successor of n. • join() does not make the rest of the network aware of n.
  • 24. Node Joins – join() // create a new Chord ring. n.create() predecessor = nil; successor = n; // join a Chord ring containing node n’. n.join(n’) predecessor = nil; successor = n’.find_successor(n);
  • 25. Scalable Key Location – find_successor() • Pseudo code: // ask node n to find the successor of id // id = 36, n’ = 25 , successor=40 n.find_successor(id) if (id ∈ (n, successor]) return successor; else n’ = closest_preceding_node(id); return n’.find_successor(id); // search the local table for the highest predecessor of id n.closest_preceding_node(id) for i = m downto 1 if (finger[i] ∈ (n, id)) return finger[i]; return n;
  • 26. Joining: linked list insert N36 N40 N25 1. Lookup(36) K30 K38
  • 27. Join (2) N36 N40 N25 2. N36 sets its own successor pointer K30 K38
  • 28. Join (3) N36 N40 N25 3. Copy keys 26..36 from N40 to N36 K30 K38 K30
  • 29. Join (4) N36 N40 N25 4. Set N25’s successor pointer Update finger pointers in the background Correct successors produce correct lookups K30 K38 K30
  • 30. Node Joins – stabilize() • Each time node n runs stabilize(), it asks its successor for the it’s predecessor p, and decides whether p should be n’s successor instead. • stabilize() notifies node n’s successor of n’s existence, giving the successor the chance to change its predecessor to n. • The successor does this only if it knows of no closer predecessor than n.
  • 31. Node Joins – stabilize() // called periodically. verifies n’s immediate // successor, and tells the successor about n. // n=30, p=36, n’s successor = 40 n.stabilize() x = successor.predecessor; if (x ∈ (n, successor)) successor = x; successor.notify(n); // n’ thinks it might be our predecessor. n.notify(n’) if (predecessor is nil or n’ ∈ (predecessor, n)) predecessor = n’;
  • 32. Node Joins – Join and Stabilization np succ(np)=ns ns n pred(ns)=np  n joins  predecessor = nil  n acquires ns as successor via some n’  n runs stabilize  n notifies ns being the new predecessor  ns acquires n as its predecessor  np runs stabilize  np asks ns for its predecessor (now n)  np acquires n as its successor  np notifies n  n will acquire np as its predecessor  all predecessor and successor pointers are now correct  fingers still need to be fixed, but old fingers will still work nil pred(ns)=n succ(np)=n
  • 33. Node Joins – fix_fingers() • Each node periodically calls fix fingers to make sure its finger table entries are correct. • It is how new nodes initialize their finger tables • It is how existing nodes incorporate new nodes into their finger tables.
  • 34. Node Joins – fix_fingers() // called periodically. refreshes finger table entries //next = 1 n.fix_fingers() next = next + 1 ; if (next > m) next = 1 ; finger[next] = find_successor(n + 2next-1 ); // checks whether predecessor has failed. n.check_predecessor() if (predecessor has failed) predecessor = nil;
  • 36. Failures might cause incorrect lookup N120 N113 N102 N80 N85 N80 doesn’t know correct successor, so incorrect lookup N10 Lookup(90)
  • 37. Solution: successor lists • Each node knows r immediate successors • After failure, will know first live successor • Correct successors guarantee correct lookups • Guarantee is with some probability
  • 38. Successor Lists Ensure Robust Lookup N32 N10 N5 N20 N110 N99 N80 N60 • Each node remembers r successors • Lookup can skip over dead nodes to find blocks N40 10, 20, 32 20, 32, 40 32, 40, 60 40, 60, 80 60, 80, 99 80, 99, 110 99, 110, 5 110, 5, 10 5, 10, 20
  • 39. Lookup with fault tolerance Lookup(my-id, key-id) look in local finger table and successor-list for highest node n s.t. my-id < n < key-id if n exists call Lookup(id) on node n // next hop if call failed, remove n from finger table return Lookup(my-id, key-id) else return my successor // done
  • 40. Experimental overview • Variation in load balance • Quick lookup in large systems • Low variation in lookup costs • Robust despite massive failure Experiments confirm theoretical results
  • 41. Variation in load balance The mean and 1st and 99th percentiles of the number of keys stored per node in a 10x4 node network
  • 42. Variation in load balance The probability density function (PDF) of the number of keys per node. The total number of keys is 5 x 10 square 5.
  • 43. Virtual Node in Consistent Hash Hash(“202.168.14.241”); Hash(“202.168.14.241#1”); // cache A1 Hash(“202.168.14.241#2”); // cache A2
  • 44. Result when virtual node used r virtual node, r = 1, 2, 5, 10, 20 99th : 4.8x to 1.6x ; 1st : 0 to 0.5x
  • 45. Chord lookup cost is O(log N) Number of Nodes AverageMessagesperLookup Constant is ½ Actually ½ log(N) due to finger table
  • 46. Failure experimental setup • Start 10000 node and 1000000 keys • Successor list has 20 entries • Insert 1,000 key/value pairs • Five replicas of each • Immediately perform 1,000 lookups
  • 47. Massive failures have little impact 0 0.2 0.4 0.6 0.8 1 1.2 1.4 5 10 15 20 25 30 35 40 45 50 FailedLookups(Percent) Failed Nodes (Percent) (1/2)6 is 1.6%
  • 48. Conclusion • Efficient location of the node that stores a desired data item is a fundamental problem in P2P networks • Chord protocol solves it in a efficient decentralized manner • Routing information: O(log N) nodes • Lookup: O(log N) nodes • Update: O(log2 N) messages • It also adapts dynamically to the topology changes introduced during the run
  • 50. Improvement Metadata Layer Distributing Index Finger Table Original Chord Improved Chord Query in metadata Put resource into Support more search beside Keyword Index differs due to different Hash function More dynamically Create the routing table Nope Produce index By SHA1 Fixed size, can not resist churn Hashing Function SHA-2 improve collision , Use Pearson to speed up SHA-1
  • 51. Join: lazy finger update is OK N36 N40 N25 N2 K30 N2 finger should now point to N36, not N40 Lookup(K30) visits only nodes < 30, will undershoot
  • 52. CFS: a peer-to-peer storage system • Inspired by Napster, Gnutella, Freenet • Separates publishing from serving • Uses spare disk space, net capacity • Avoids centralized mechanisms • Delete this slide? • Mention “distributed hash lookup”
  • 53. CFS architecture move later? Block storage Availability / replication Authentication Caching Consistency Server selection Keyword search Lookup Dhash distributed block store Chord • Powerful lookup simplifies other mechanisms
  • 54. Consistent hashing [Karger 97] N32 N90 N105 K80 K20 K5 Circular 7-bit ID space Key 5 Node 105 A key is stored at its successor: node with next higher ID
  • 56. “Finger table” allows log(N)-time lookups N80 ½¼ 1/8 1/16 1/32 1/64 1/128
  • 57. Finger i points to successor of n+2i N80 ½¼ 1/8 1/16 1/32 1/64 1/128 112 N120
  • 58. Dynamic Operations and Failures Need to deal with: • Node Joins and Stabilization • Impact of Node Joins on Lookups • Failure and Replication • Voluntary Node Departures
  • 59. Node Joins and Stabilization • Node’s successor pointer should be up to date • For correctly executing lookups • Each node periodically runs a “Stabilization” Protocol • Updates finger tables and successor pointers
  • 60. Node Joins and Stabilization • Contains 6 functions: • create() • join() • stabilize() • notify() • fix_fingers() • check_predecessor()
  • 61. Create() • Creates a new Chord ring n.create() predecessor = nil; successor = n;
  • 62. Join() • Asks m to find the immediate successor of n. • Doesn’t make rest of the network aware of n. n.join(m) predecessor = nil; successor = m.find_successor(n);
  • 63. Stabilize() • Called periodically to learn about new nodes • Asks n’s immediate successor about successor’s predecessor p • Checks whether p should be n’s successor instead • Also notifies n’s successor about n’s existence, so that successor may change its predecessor to n, if necessary n.stabilize() x = successor.predecessor; if (x ∈ (n, successor)) successor = x; successor.notify(n);
  • 64. Notify() • m thinks it might be n’s predecessor n.notify(m) if (predecessor is nil or m ∈ (predecessor, n)) predecessor = m;
  • 65. Fix_fingers() • Periodically called to make sure that finger table entries are correct • New nodes initialize their finger tables • Existing nodes incorporate new nodes into their finger tables n.fix_fingers() next = next + 1 ; if (next > m) next = 1 ; finger[next] = find_successor(n + 2next-1 );
  • 66. Check_predecessor() • Periodically called to check whether predecessor has failed • If yes, it clears the predecessor pointer, which can then be modified by notify() n.check_predecessor() if (predecessor has failed) predecessor = nil;
  • 67. Theorem 3 • If any sequence of join operations is executed interleaved with stabilizations, then at some time after the last join the successor pointers will form a cycle on all nodes in the network
  • 68. Stabilization Protocol • Guarantees to add nodes in a fashion to preserve reach ability • By itself won’t correct a Chord system that has split into multiple disjoint cycles, or a single cycle that loops multiple times around the identifier space
  • 69. Impact of Node Joins on Lookups • Correctness • If finger table entries are reasonably current • Lookup finds the correct successor in O(log N) steps • If successor pointers are correct but finger tables are incorrect • Correct lookup but slower • If incorrect successor pointers • Lookup may fail
  • 70. Impact of Node Joins on Lookups • Performance • If stabilization is complete • Lookup can be done in O(log N) time • If stabilization is not complete • Existing nodes finger tables may not reflect the new nodes – Doesn’t significantly affect lookup speed • Newly joined nodes can affect the lookup speed, if the new nodes ID’s are in between target and target’s predecessor – Lookup will have to be forwarded through the intervening nodes, one at a time
  • 71. Theorem 4 • If we take a stable network with N nodes with correct finger pointers, and another set of up to N nodes joins the network, and all successor pointers (but perhaps not all finger pointers) are correct, then lookups will still take O(log N) time with high probability
  • 72. Failure and Replication • Correctness of the protocol relies on the fact of knowing correct successor • To improve robustness • Each node maintains a successor list of ‘r’ nodes • This can be handled using modified version of stabilize procedure • Also helps higher-layer software to replicate data
  • 73. Theorem 5 • If we use successor list of length r = O(log N) in a network that is initially stable, and then every node fails with probability ½, then with high probability find_successor returns the closest living successor to the query key
  • 74. Theorem 6 • In a network that is initially stable, if every node fails with probability ½, then the expected time to execute find_successor is O(log N)
  • 75. Voluntary Node Departures • Can be treated as node failures • Two possible enhancements • Leaving node may transfers all its keys to its successor • Leaving node may notify its predecessor and successor about each other so that they can update their links
  • 76. The promise of P2P computing • High capacity through parallelism: • Many disks • Many network connections • Many CPUs • Reliability: • Many replicas • Geographic distribution • Automatic configuration • Useful in public and proprietary settings
  • 77. A DHT has a good interface • Put(key, value) and get(key) → value • Call a key/value pair a “block” • API supports a wide range of applications • DHT imposes no structure/meaning on keys • Key/value pairs are persistent and global • Can store keys in other DHT values • And thus build complex data structures
  • 78. A DHT makes a good shared infrastructure • Many applications can share one DHT service • Much as applications share the Internet • Eases deployment of new applications • Pools resources from many participants • Efficient due to statistical multiplexing • Fault-tolerant due to geographic distribution
  • 79. Many recent DHT-based projects • File sharing [CFS, OceanStore, PAST, …] • Web cache [Squirrel, ..] • Backup store [Pastiche] • Censor-resistant stores [Eternity, FreeNet,..] • DB query and indexing [Hellerstein, …] • Event notification [Scribe] • Naming systems [ChordDNS, Twine, ..] • Communication primitives [I3, …] Common thread: data is location-independent
  • 80. Related Work • CAN (Ratnasamy, Francis, Handley, Karp, Shenker) • Pastry (Rowstron, Druschel) • Tapestry (Zhao, Kubiatowicz, Joseph) • Chord emphasizes simplicity
  • 81. Chord Summary • Chord provides peer-to-peer hash lookup • Efficient: O(log(n)) messages per lookup • Robust as nodes fail and join • Good primitive for peer-to-peer systems http://www.pdos.lcs.mit.edu/chord
  • 82. Scalable Key Location – find_successor() // ask node n to find the successor of id // id = 36, n’ = 25 , successor=40 n.find_successor(id) n’ = n.find_predecessor(id) return n’.succsor; n.find_predecessor(id) n’ = n; While (id NOT∈ (n’, n’.successor]) n’ = n’.closest_preceding_finger(id) return n’; n.closest_preceding_finger(id) for i = m downto 1 if (finger[i] ∈ (n, id)) return finger[i]; return n;
  • 83. Choosing the successor list length • Assume 1/2 of nodes fail • P(successor list all dead) = (1/2)r • I.e. P(this node breaks the Chord ring) • Depends on independent failure
  • 84. Improvement Metadata Layer ( 資訊層 ) Distributing Index Finger Table 原始 Chord 改良 Chord 將嘗試利用 metadata 以對資 源的描述更有彈性,並支援複 雜的查詢。將節點依照本身所 擁有的資源加入到適當的資源 層中,查詢時只 需要在適當 因為雜湊函數的不同, 所以此索引亦會不同 可嚐試改用雙向路由法、鄰居 路由法,或是根據網路攪動程 度,動態調整路由表 無 由上數雜湊函數 所得到的檔案搜 尋索引 fixed ,對網路攪 動的抵抗缺乏強 健性 Hashing Function ( 雜湊函數 ) 可改用 SHA-2 以增進加密功能 , 或是用 Pearson 等以加快運算 速度 SHA-1

Editor's Notes

  1. The promise of P2P computing High capacity through parallelism: Many disks Many network connections Many CPUs Reliability: Many replicas Geographic distribution Automatic configuration Useful in public and proprietary settings
  2. Publisher – Put(Value, Key) Client – Get(Key) Put(key, value) and get(key)  value Call a key/value pair a “block” API supports a wide range of applications DHT imposes no structure/meaning on keys Key/value pairs are persistent and global Can store keys in other DHT values And thus build complex data structures
  3. 1000s of nodes. Set of nodes may change…
  4. O(N) state means its hard to keep the state up to date.
  5. Challenge: can we make it robust? Small state? Actually find stuff in a changing system? Consistent rendezvous point, between publisher and client.
  6. Load balance : distributed hash function, spreading keys evenly over nodes Decentralization : chord is fully important than other, improves robustness Scalability : logarithmic growth of lookup costs with numbedistributed, no node more r of nodes in network, even very large systems are feasible Availability : chord automatically adjusts its internal tables to ensure that the node responsible for a key can always be found
  7. 線性探測法 , 又稱為 linear open address. 將雜湊表以一維陣列來表示 , 若陣列大小為 size, 則每個元素的位址是 0 ~ size-1. 若在位址 i 發生溢位時 , 以線性的方式找下一個位置 ((i+1)%size), 若有空的位置則放入 , 否則繼續往下一個線性位置 . 當無法找到空的位置 , 則表示位置都滿了 SHA-1 SHA-2 Hash Function--- Folding Hash Function---Mid-Square Hash Function---Division The probability distribution is then over random choices of keys and nodes, and says that such a random choice is unlikely to produce an unbalanced distribution. A similar model is applied to analyze standard hashing. Standard hash functions distribute data well when the set of keys being hashed is random. When keys are not random, such a result cannot be guaranteed—indeed, for any hash function, there exists some key set that is terribly distributed by the hash function (e.g., the set of keys that all map to a single hash bucket). In practice, such potential bad sets are considered unlikely to arise. Techniques have also been developed [3] to introduce randomness in the hash function; given any set of keys, we can choose a hash function at random so that the keys are well distributed with high probability over the choice of hash function.
  8. Always undershoots to predecessor. So never misses the real successor. Lookup procedure isn’t inherently log(n). But finger table causes it to be.
  9. Small tables, but multi-hop lookup. Table entries: IP address and Chord ID. Navigate in ID space, route queries closer to successor. Log(n) tables, log(n) hops. Route to a document between ¼ and ½ …
  10. Maybe note that fingers point to the first relevant node.
  11. Always undershoots to predecessor. So never misses the real successor. Lookup procedure isn’t inherently log(n). But finger table causes it to be.
  12. No problem until lookup gets to a node which knows of no node &lt; key. There’s a replica of K90 at N113, but we can’t find it.
  13. All r successors have to fail before we have a problem. List ensures we find actual current successor.
  14. Always undershoots to predecessor. So never misses the real successor. Lookup procedure isn’t inherently log(n). But finger table causes it to be.
  15. 若 X 為隨機變數 累加分配函數 : 若 X 不連續,則 f(x) 稱為機率密度函數 (probability density function, pdf)
  16. The number of keys per node exhibits large variations that increase linearly with the number of keys. For example, in all cases some nodes store no keys. To clarify this, Figure 8(b) plots the probability density function (PDF) of the number of keys per node when there are 5x 10 5 keys stored in the network. The maximum number of nodes stored by any node in this case is 457,
  17. One reason for these variations is that node identifiers do not uniformly cover the entire identifier space. If we divide the identifier space in N equal-sized bins, where N is the number of nodes, then we might hope to see one node in each bin. But in fact, the probability that a particular bin does not contain any node is (1 – 1/N) N , For large values of N , it’s around 0.368 As we discussed earlier, the consistent hashing paper solves this problem by associating keys with virtual nodes, and mapping multiple virtual nodes (with unrelated identifiers) to each real node. Intuitively, this will provide a more uniform coverage of the identifier space. For example, if we allocate log N randomly chosen virtual node to each real node, with high probability each of the We note that this does not affect the worst-case query path length, which now becomes O(log N) 4 虛擬節點 考量 Hash 算法的另一個指標是平衡性 (Balance) ,定義如下: 平衡性   平衡性是指哈希的結果能夠盡可能分布到所有的緩衝中去,這樣可以使得所有的緩衝空間都得到利用。 hash 算法並不是保證絕對的平衡,如果 cache 較少的話,對像並不能被均勻的映射到 cache 上,比如在上面的例子中,僅部署 cache A 和 cache C 的情況下,在 4 個對像中, cache A 僅存儲了 object1 ,而 cache C 則存儲了 object2 、 object3 和 object4 ;分布是很不均衡的。 為了解決這種情況, consistent hashing 引入了“虛擬節點”的概念,它可以如下定義: “ 虛擬節點”( virtual node )是實際節點在 hash 空間的復制品( replica ),一實際個節點對應了若干個“虛擬節點”,這個對應個數也成為“復制個數”,“虛擬節點”在 hash 空間中以 hash 值排列。 仍以僅部署 cache A 和 cache C 的情況為例,在圖 4 中我們已經看到, cache 分布並不均勻。現在我們引入虛擬節點,並設置“復制個數”為 2 ,這就意味著一共會存在 4 個“虛擬節點”, cache A1, cache A2 代表了 cache A ; cache C1, cache C2 代表了 cache C ;假設一種比較理想的情況,參見圖 6 。 在數學中,一個連續型隨機變數的 機率密度函數 (在不至於混淆時可以簡 稱為 密度函數 )是一個描述這個隨機變數的輸出值在某一個確定的取值點附近的可能性的函數。 而隨機變數的取值落在某個區域之內的機率則是機率密度函數在這個區域上的積分。 當機率密度函數存在的時候,累積分佈函數是機率密度函數的積分。一般以小寫「 pdf 」( P robability D ensity F unction ) 表記。 機率密度函數有時也被稱為機率分佈函數,但這種稱法可能會和累積分佈函數或機率質量函數混淆。
  18. 4 虛擬節點 考量 Hash 算法的另一個指標是平衡性 (Balance) ,定義如下: 平衡性   平衡性是指哈希的結果能夠盡可能分布到所有的緩衝中去,這樣可以使得所有的緩衝空間都得到利用。 hash 算法並不是保證絕對的平衡,如果 cache 較少的話,對像並不能被均勻的映射到 cache 上,比如在上面的例子中,僅部署 cache A 和 cache C 的情況下,在 4 個對像中, cache A 僅存儲了 object1 ,而 cache C 則存儲了 object2 、 object3 和 object4 ;分布是很不均衡的。 為了解決這種情況, consistent hashing 引入了“虛擬節點”的概念,它可以如下定義: “ 虛擬節點”( virtual node )是實際節點在 hash 空間的復制品( replica ),一實際個節點對應了若干個“虛擬節點”,這個對應個數也成為“復制個數”,“虛擬節點”在 hash 空間中以 hash 值排列。 仍以僅部署 cache A 和 cache C 的情況為例,在圖 4 中我們已經看到, cache 分布並不均勻。現在我們引入虛擬節點,並設置“復制個數”為 2 ,這就意味著一共會存在 4 個“虛擬節點”, cache A1, cache A2 代表了 cache A ; cache C1, cache C2 代表了 cache C ;假設一種比較理想的情況,參見圖 6 。 此時,對像到“虛擬節點”的映射關系為: objec1-&gt;cache A2 ; objec2-&gt;cache A1 ; objec3-&gt;cache C1 ; objec4-&gt;cache C2 ; 因此對像 object1 和 object2 都被映射到了 cache A 上,而 object3 和 object4 映射到了 cache C 上;平衡性有了很大提高。 引入“虛擬節點”後,映射關系就從 { 對像 -&gt; 節點 } 轉換到了 { 對像 -&gt; 虛擬節點 } 。查詢物體所在 cache 時的映射關系如圖 7 所示。 虛擬節點”的 hash 計算可以采用對應節點的 IP 地址加數字後綴的方式。例如假設 cache A 的 IP 地址為 202.168.14.241 。 引入“虛擬節點”前,計算 cache A 的 hash 值: Hash(“202.168.14.241”); 引入“虛擬節點”後,計算“虛擬節”點 cache A1 和 cache A2 的 hash 值: Hash(“202.168.14.241#1”); // cache A1 Hash(“202.168.14.241#2”); // cache A2
  19. One reason for these variations is that node identifiers do not uniformly cover the entire identifier space. If we divide the identifier space in N equal-sized bins, where N is the number of nodes, then we might hope to see one node in each bin. But in fact, the probability that a particular bin does not contain any node is (1 – 1/N) N , For large values of N , it’s around 0.368 As we discussed earlier, the consistent hashing paper solves this problem by associating keys with virtual nodes, and mapping multiple virtual nodes (with unrelated identifiers) to each real node. Intuitively, this will provide a more uniform coverage of the identifier space. For example, if we allocate log N randomly chosen virtual node to each real node, with high probability each of the We note that this does not affect the worst-case query path length, which now becomes O(log N) The tradeoff is that routing table space usage will increase as each actual node now needs r times as much space to store the finger tables for its virtual nodes. However, we believe that this increase can be easily accommodated in practice. For example, assuming a network with N = 10 6 nodes, and assuming r = log N, each node has to maintain a table with only log N ~= 400 entries 4 虛擬節點 考量 Hash 算法的另一個指標是平衡性 (Balance) ,定義如下: 平衡性   平衡性是指哈希的結果能夠盡可能分布到所有的緩衝中去,這樣可以使得所有的緩衝空間都得到利用。 hash 算法並不是保證絕對的平衡,如果 cache 較少的話,對像並不能被均勻的映射到 cache 上,比如在上面的例子中,僅部署 cache A 和 cache C 的情況下,在 4 個對像中, cache A 僅存儲了 object1 ,而 cache C 則存儲了 object2 、 object3 和 object4 ;分布是很不均衡的。 為了解決這種情況, consistent hashing 引入了“虛擬節點”的概念,它可以如下定義: “ 虛擬節點”( virtual node )是實際節點在 hash 空間的復制品( replica ),一實際個節點對應了若干個“虛擬節點”,這個對應個數也成為“復制個數”,“虛擬節點”在 hash 空間中以 hash 值排列。 仍以僅部署 cache A 和 cache C 的情況為例,在圖 4 中我們已經看到, cache 分布並不均勻。現在我們引入虛擬節點,並設置“復制個數”為 2 ,這就意味著一共會存在 4 個“虛擬節點”, cache A1, cache A2 代表了 cache A ; cache C1, cache C2 代表了 cache C ;假設一種比較理想的情況,參見圖 6 。 此時,對像到“虛擬節點”的映射關系為: objec1-&gt;cache A2 ; objec2-&gt;cache A1 ; objec3-&gt;cache C1 ; objec4-&gt;cache C2 ; 因此對像 object1 和 object2 都被映射到了 cache A 上,而 object3 和 object4 映射到了 cache C 上;平衡性有了很大提高。 引入“虛擬節點”後,映射關系就從 { 對像 -&gt; 節點 } 轉換到了 { 對像 -&gt; 虛擬節點 } 。查詢物體所在 cache 時的映射關系如圖 7 所示。 虛擬節點”的 hash 計算可以采用對應節點的 IP 地址加數字後綴的方式。例如假設 cache A 的 IP 地址為 202.168.14.241 。 引入“虛擬節點”前,計算 cache A 的 hash 值: Hash(“202.168.14.241”); 引入“虛擬節點”後,計算“虛擬節”點 cache A1 和 cache A2 的 hash 值: Hash(“202.168.14.241#1”); // cache A1 Hash(“202.168.14.241#2”); // cache A2
  20. Actually ½ log(N). Error bars: one std dev.
  21. *before* stabilization starts. All lookup failures attributable to loss of all 6 replicas.
  22. 上圖主要在顯示本文與原始 MIT 所發展出來的 Chord 有何不同,藉由哪些方面的改良,以期望提升檔案搜尋的速度。
  23. Say maps Ids to data? I.e. not keyword search.
  24. Ids live in a single circular space. Consistent hashing is designed to let nodes enter and leave the network with minimal disruption. To maintain the consistent hashing mapping when a node n joins the network, certain keys previously assigned to n’s successor now become assigned to n. When node n leaves the network, all of its assigned keys are reassigned to n’s successor. No other changes in assignment of keys to nodes need occur. In the example above, if a node were to join with identifier 26, it would capture the key with identifier 24 from the node with identifier 32 An adversary can select a badly distributed set of keys for that hash function. In our application, an adversary can generate a large set of keys and insert into the Chord ring only those keys that map to a particular node, thus creating a badly distributed set of keys. As with standard hashing, however, we expect that a non-adversarial set of keys can be analyzed as if it were random. Using this assumption, we state many of our results below as “high probability” results.
  25. Just need to make progress, and not overshoot. Will talk about initialization later. And robustness. Now, how about speed?
  26. Small tables, but multi-hop lookup. Table entries: IP address and Chord ID. Navigate in ID space, route queries closer to successor. Log(n) tables, log(n) hops. Route to a document between ¼ and ½ …
  27. Small tables, but multi-hop lookup. Table entries: IP address and Chord ID. Navigate in ID space, route queries closer to successor. Log(n) tables, log(n) hops. Route to a document between ¼ and ½ …
  28. Just the right lookup for peer-to-peer storage systems. NATs? Mogul. What if most nodes are flakey? Details of noticing and reacting to failures? How to eval with huge # of nodes?
  29. 上圖主要在顯示本文與原始 MIT 所發展出來的 Chord 有何不同,藉由哪些方面的改良,以期望提升檔案搜尋的速度。