Paul presentation P2P Chord v1

1,075 views
940 views

Published on

This was to introduce basic concept of MIT P2P Chord algorithm. data were collected from internet and reference book....

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,075
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
7
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • The promise of P2P computing High capacity through parallelism: Many disks Many network connections Many CPUs Reliability: Many replicas Geographic distribution Automatic configuration Useful in public and proprietary settings
  • Publisher – Put(Value, Key) Client – Get(Key) Put(key, value) and get(key)  value Call a key/value pair a “block” API supports a wide range of applications DHT imposes no structure/meaning on keys Key/value pairs are persistent and global Can store keys in other DHT values And thus build complex data structures
  • 1000s of nodes. Set of nodes may change…
  • O(N) state means its hard to keep the state up to date.
  • Challenge: can we make it robust? Small state? Actually find stuff in a changing system? Consistent rendezvous point, between publisher and client.
  • Load balance : distributed hash function, spreading keys evenly over nodes Decentralization : chord is fully important than other, improves robustness Scalability : logarithmic growth of lookup costs with numbedistributed, no node more r of nodes in network, even very large systems are feasible Availability : chord automatically adjusts its internal tables to ensure that the node responsible for a key can always be found
  • 線性探測法 , 又稱為 linear open address. 將雜湊表以一維陣列來表示 , 若陣列大小為 size, 則每個元素的位址是 0 ~ size-1. 若在位址 i 發生溢位時 , 以線性的方式找下一個位置 ((i+1)%size), 若有空的位置則放入 , 否則繼續往下一個線性位置 . 當無法找到空的位置 , 則表示位置都滿了 SHA-1 SHA-2 Hash Function--- Folding Hash Function---Mid-Square Hash Function---Division The probability distribution is then over random choices of keys and nodes, and says that such a random choice is unlikely to produce an unbalanced distribution. A similar model is applied to analyze standard hashing. Standard hash functions distribute data well when the set of keys being hashed is random. When keys are not random, such a result cannot be guaranteed—indeed, for any hash function, there exists some key set that is terribly distributed by the hash function (e.g., the set of keys that all map to a single hash bucket). In practice, such potential bad sets are considered unlikely to arise. Techniques have also been developed [3] to introduce randomness in the hash function; given any set of keys, we can choose a hash function at random so that the keys are well distributed with high probability over the choice of hash function.
  • Always undershoots to predecessor. So never misses the real successor. Lookup procedure isn’t inherently log(n). But finger table causes it to be.
  • Small tables, but multi-hop lookup. Table entries: IP address and Chord ID. Navigate in ID space, route queries closer to successor. Log(n) tables, log(n) hops. Route to a document between ¼ and ½ …
  • Maybe note that fingers point to the first relevant node.
  • Always undershoots to predecessor. So never misses the real successor. Lookup procedure isn’t inherently log(n). But finger table causes it to be.
  • No problem until lookup gets to a node which knows of no node < key. There’s a replica of K90 at N113, but we can’t find it.
  • All r successors have to fail before we have a problem. List ensures we find actual current successor.
  • Always undershoots to predecessor. So never misses the real successor. Lookup procedure isn’t inherently log(n). But finger table causes it to be.
  • 若 X 為隨機變數 累加分配函數 : 若 X 不連續,則 f(x) 稱為機率密度函數 (probability density function, pdf)
  • The number of keys per node exhibits large variations that increase linearly with the number of keys. For example, in all cases some nodes store no keys. To clarify this, Figure 8(b) plots the probability density function (PDF) of the number of keys per node when there are 5x 10 5 keys stored in the network. The maximum number of nodes stored by any node in this case is 457,
  • One reason for these variations is that node identifiers do not uniformly cover the entire identifier space. If we divide the identifier space in N equal-sized bins, where N is the number of nodes, then we might hope to see one node in each bin. But in fact, the probability that a particular bin does not contain any node is (1 – 1/N) N , For large values of N , it’s around 0.368 As we discussed earlier, the consistent hashing paper solves this problem by associating keys with virtual nodes, and mapping multiple virtual nodes (with unrelated identifiers) to each real node. Intuitively, this will provide a more uniform coverage of the identifier space. For example, if we allocate log N randomly chosen virtual node to each real node, with high probability each of the We note that this does not affect the worst-case query path length, which now becomes O(log N) 4 虛擬節點 考量 Hash 算法的另一個指標是平衡性 (Balance) ,定義如下: 平衡性   平衡性是指哈希的結果能夠盡可能分布到所有的緩衝中去,這樣可以使得所有的緩衝空間都得到利用。 hash 算法並不是保證絕對的平衡,如果 cache 較少的話,對像並不能被均勻的映射到 cache 上,比如在上面的例子中,僅部署 cache A 和 cache C 的情況下,在 4 個對像中, cache A 僅存儲了 object1 ,而 cache C 則存儲了 object2 、 object3 和 object4 ;分布是很不均衡的。 為了解決這種情況, consistent hashing 引入了“虛擬節點”的概念,它可以如下定義: “ 虛擬節點”( virtual node )是實際節點在 hash 空間的復制品( replica ),一實際個節點對應了若干個“虛擬節點”,這個對應個數也成為“復制個數”,“虛擬節點”在 hash 空間中以 hash 值排列。 仍以僅部署 cache A 和 cache C 的情況為例,在圖 4 中我們已經看到, cache 分布並不均勻。現在我們引入虛擬節點,並設置“復制個數”為 2 ,這就意味著一共會存在 4 個“虛擬節點”, cache A1, cache A2 代表了 cache A ; cache C1, cache C2 代表了 cache C ;假設一種比較理想的情況,參見圖 6 。 在數學中,一個連續型隨機變數的 機率密度函數 (在不至於混淆時可以簡 稱為 密度函數 )是一個描述這個隨機變數的輸出值在某一個確定的取值點附近的可能性的函數。 而隨機變數的取值落在某個區域之內的機率則是機率密度函數在這個區域上的積分。 當機率密度函數存在的時候,累積分佈函數是機率密度函數的積分。一般以小寫「 pdf 」( P robability D ensity F unction ) 表記。 機率密度函數有時也被稱為機率分佈函數,但這種稱法可能會和累積分佈函數或機率質量函數混淆。
  • 4 虛擬節點 考量 Hash 算法的另一個指標是平衡性 (Balance) ,定義如下: 平衡性   平衡性是指哈希的結果能夠盡可能分布到所有的緩衝中去,這樣可以使得所有的緩衝空間都得到利用。 hash 算法並不是保證絕對的平衡,如果 cache 較少的話,對像並不能被均勻的映射到 cache 上,比如在上面的例子中,僅部署 cache A 和 cache C 的情況下,在 4 個對像中, cache A 僅存儲了 object1 ,而 cache C 則存儲了 object2 、 object3 和 object4 ;分布是很不均衡的。 為了解決這種情況, consistent hashing 引入了“虛擬節點”的概念,它可以如下定義: “ 虛擬節點”( virtual node )是實際節點在 hash 空間的復制品( replica ),一實際個節點對應了若干個“虛擬節點”,這個對應個數也成為“復制個數”,“虛擬節點”在 hash 空間中以 hash 值排列。 仍以僅部署 cache A 和 cache C 的情況為例,在圖 4 中我們已經看到, cache 分布並不均勻。現在我們引入虛擬節點,並設置“復制個數”為 2 ,這就意味著一共會存在 4 個“虛擬節點”, cache A1, cache A2 代表了 cache A ; cache C1, cache C2 代表了 cache C ;假設一種比較理想的情況,參見圖 6 。 此時,對像到“虛擬節點”的映射關系為: objec1->cache A2 ; objec2->cache A1 ; objec3->cache C1 ; objec4->cache C2 ; 因此對像 object1 和 object2 都被映射到了 cache A 上,而 object3 和 object4 映射到了 cache C 上;平衡性有了很大提高。 引入“虛擬節點”後,映射關系就從 { 對像 -> 節點 } 轉換到了 { 對像 -> 虛擬節點 } 。查詢物體所在 cache 時的映射關系如圖 7 所示。 虛擬節點”的 hash 計算可以采用對應節點的 IP 地址加數字後綴的方式。例如假設 cache A 的 IP 地址為 202.168.14.241 。 引入“虛擬節點”前,計算 cache A 的 hash 值: Hash(“202.168.14.241”); 引入“虛擬節點”後,計算“虛擬節”點 cache A1 和 cache A2 的 hash 值: Hash(“202.168.14.241#1”); // cache A1 Hash(“202.168.14.241#2”); // cache A2
  • One reason for these variations is that node identifiers do not uniformly cover the entire identifier space. If we divide the identifier space in N equal-sized bins, where N is the number of nodes, then we might hope to see one node in each bin. But in fact, the probability that a particular bin does not contain any node is (1 – 1/N) N , For large values of N , it’s around 0.368 As we discussed earlier, the consistent hashing paper solves this problem by associating keys with virtual nodes, and mapping multiple virtual nodes (with unrelated identifiers) to each real node. Intuitively, this will provide a more uniform coverage of the identifier space. For example, if we allocate log N randomly chosen virtual node to each real node, with high probability each of the We note that this does not affect the worst-case query path length, which now becomes O(log N) The tradeoff is that routing table space usage will increase as each actual node now needs r times as much space to store the finger tables for its virtual nodes. However, we believe that this increase can be easily accommodated in practice. For example, assuming a network with N = 10 6 nodes, and assuming r = log N, each node has to maintain a table with only log N ~= 400 entries 4 虛擬節點 考量 Hash 算法的另一個指標是平衡性 (Balance) ,定義如下: 平衡性   平衡性是指哈希的結果能夠盡可能分布到所有的緩衝中去,這樣可以使得所有的緩衝空間都得到利用。 hash 算法並不是保證絕對的平衡,如果 cache 較少的話,對像並不能被均勻的映射到 cache 上,比如在上面的例子中,僅部署 cache A 和 cache C 的情況下,在 4 個對像中, cache A 僅存儲了 object1 ,而 cache C 則存儲了 object2 、 object3 和 object4 ;分布是很不均衡的。 為了解決這種情況, consistent hashing 引入了“虛擬節點”的概念,它可以如下定義: “ 虛擬節點”( virtual node )是實際節點在 hash 空間的復制品( replica ),一實際個節點對應了若干個“虛擬節點”,這個對應個數也成為“復制個數”,“虛擬節點”在 hash 空間中以 hash 值排列。 仍以僅部署 cache A 和 cache C 的情況為例,在圖 4 中我們已經看到, cache 分布並不均勻。現在我們引入虛擬節點,並設置“復制個數”為 2 ,這就意味著一共會存在 4 個“虛擬節點”, cache A1, cache A2 代表了 cache A ; cache C1, cache C2 代表了 cache C ;假設一種比較理想的情況,參見圖 6 。 此時,對像到“虛擬節點”的映射關系為: objec1->cache A2 ; objec2->cache A1 ; objec3->cache C1 ; objec4->cache C2 ; 因此對像 object1 和 object2 都被映射到了 cache A 上,而 object3 和 object4 映射到了 cache C 上;平衡性有了很大提高。 引入“虛擬節點”後,映射關系就從 { 對像 -> 節點 } 轉換到了 { 對像 -> 虛擬節點 } 。查詢物體所在 cache 時的映射關系如圖 7 所示。 虛擬節點”的 hash 計算可以采用對應節點的 IP 地址加數字後綴的方式。例如假設 cache A 的 IP 地址為 202.168.14.241 。 引入“虛擬節點”前,計算 cache A 的 hash 值: Hash(“202.168.14.241”); 引入“虛擬節點”後,計算“虛擬節”點 cache A1 和 cache A2 的 hash 值: Hash(“202.168.14.241#1”); // cache A1 Hash(“202.168.14.241#2”); // cache A2
  • Actually ½ log(N). Error bars: one std dev.
  • *before* stabilization starts. All lookup failures attributable to loss of all 6 replicas.
  • 上圖主要在顯示本文與原始 MIT 所發展出來的 Chord 有何不同,藉由哪些方面的改良,以期望提升檔案搜尋的速度。
  • Say maps Ids to data? I.e. not keyword search.
  • Ids live in a single circular space. Consistent hashing is designed to let nodes enter and leave the network with minimal disruption. To maintain the consistent hashing mapping when a node n joins the network, certain keys previously assigned to n’s successor now become assigned to n. When node n leaves the network, all of its assigned keys are reassigned to n’s successor. No other changes in assignment of keys to nodes need occur. In the example above, if a node were to join with identifier 26, it would capture the key with identifier 24 from the node with identifier 32 An adversary can select a badly distributed set of keys for that hash function. In our application, an adversary can generate a large set of keys and insert into the Chord ring only those keys that map to a particular node, thus creating a badly distributed set of keys. As with standard hashing, however, we expect that a non-adversarial set of keys can be analyzed as if it were random. Using this assumption, we state many of our results below as “high probability” results.
  • Just need to make progress, and not overshoot. Will talk about initialization later. And robustness. Now, how about speed?
  • Small tables, but multi-hop lookup. Table entries: IP address and Chord ID. Navigate in ID space, route queries closer to successor. Log(n) tables, log(n) hops. Route to a document between ¼ and ½ …
  • Small tables, but multi-hop lookup. Table entries: IP address and Chord ID. Navigate in ID space, route queries closer to successor. Log(n) tables, log(n) hops. Route to a document between ¼ and ½ …
  • Just the right lookup for peer-to-peer storage systems. NATs? Mogul. What if most nodes are flakey? Details of noticing and reacting to failures? How to eval with huge # of nodes?
  • 上圖主要在顯示本文與原始 MIT 所發展出來的 Chord 有何不同,藉由哪些方面的改良,以期望提升檔案搜尋的速度。
  • Paul presentation P2P Chord v1

    1. 1. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Paul Yang 楊曜年
    2. 2. What is a P2P system? • A distributed system architecture: • No centralized control • Nodes are symmetric in function Node Node Node Node Node Internet
    3. 3. 3 layers - from implementation Distributed hash table Distributed application get (key) data node node node…. put(key, data) Lookup service lookup(key) node IP address • Application may be distributed over many nodes • DHT distributes data storage over many nodes (Ivy) (DHash) (Chord)
    4. 4. A peer-to-peer storage problem • 1000 scattered music enthusiasts • Willing to store and serve replicas • How do you find the data?
    5. 5. The lookup problem Internet N1 N2 N3 N6 N5 N4 Publisher Key=“title” Value=MP3 data… Client Lookup(“title”) ?
    6. 6. Centralized lookup (Napster) Publisher@ Client Lookup(“title”) N6 N9 N7 DB N8 N3 N2N1SetLoc(“title”, N4) Simple, but O(N) state and a single point of failure Key=“title” Value=MP3 data… N4
    7. 7. Flooded queries (Gnutella) N4 Publisher@ Client N6 N9 N7 N8 N3 N2N1 Robust, but worst case O(N) messages per lookup Key=“title” Value=MP3 data… Lookup(“title”)
    8. 8. Routed queries (Freenet, Chord, etc.) N4Publisher Client N6 N9 N7 N8 N3 N2N1 Lookup(“title”) Key=“title” Value=MP3 data…
    9. 9. Routing challenges • Keep the hop count small • Keep the tables small • Stay robust despite rapid change • Chord: emphasizes efficiency and simplicity
    10. 10. Chord properties • Efficient: O(log(N)) messages per lookup • Load balance: closed to K/N • Decentralization • Scalable: O(log(N)) state per node • Robust: survives massive failures
    11. 11. Chord overview • Provides peer-to-peer hash lookup: • Lookup(key) → IP address • Chord does not store the data • How does Chord route lookups? • How does Chord maintain routing tables?
    12. 12. Chord IDs • Key identifier = SHA-1(key) • Node identifier = SHA-1(IP address & Port) • Both are uniformly distributed • Both exist in the same ID space • For terribly distributed hash (collision) – universal hash • How to map key IDs to node IDs?
    13. 13. Simple lookup algorithm Lookup(my-id, key-id) //if k=7, MyID=2, MyS = 8 n = my successor if my-id < n < key-id call Lookup(id) on node n // next hop else return my successor // done • Correctness depends only on successors
    14. 14. 6 1 2 6 0 4 26 5 1 3 7 2 identifier circle identifier node X key Consistent Hashing - Successor Nodes – Take O (N) hop successor(1) = 1 successor(2) = 3successor(6) = 0
    15. 15. Scalable Key Location • To accelerate lookups, Chord maintains additional routing information. • This additional information is not essential for correctness, which is achieved as long as each node knows its correct successor.
    16. 16. Scalable Key Location – Finger Tables • Each node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table. • The ith entry in the table at node n contains the identity of the first node s that succeeds n by at least 2 i-1 on the identifier circle. • s = successor(n+2i-1 ). • s is called the ith finger of node n, denoted by n.finger(i)
    17. 17. Scalable Key Location – Finger Tables 0 4 26 5 1 3 7 1 2 4 1 3 0 finger table start succ. keys 1 2 3 5 3 3 0 finger table start succ. keys 2 4 5 7 0 0 0 finger table start succ. keys 6 0+20 0+21 0+22 For. 1+20 1+21 1+22 For. 3+20 3+21 3+22 For. Finer[k] = (n + 2k-1) mod 2m
    18. 18. Finger i points to successor of n+2i N80 ½¼ 1/8 1/16 1/32 1/64 1/128 112 N120 M = 7 -> 128
    19. 19. Lookups take O(log(N)) hops N32 N10 N5 N20 N110 N99 N80 N60 Lookup(K19) K19
    20. 20. Lookup with fingers Lookup(my-id, key-id) look in local finger table for highest node n s.t. my-id < n < key-id if n exists call Lookup(id) on node n // next hop else return my successor // done
    21. 21. Node Joins and Stabilizations • The most important thing is the successor pointer. • If the successor pointer is ensured to be up to date, which is sufficient to guarantee correctness of lookups, then finger table can always be verified. • Each node runs a “stabilization” protocol periodically in the background to update successor pointer and finger table.
    22. 22. Node Joins and Stabilizations • “Stabilization” protocol contains 6 functions: • create() //create a network • join() • stabilize() • notify() • fix_fingers() • check_predecessor()
    23. 23. Node Joins – join() • When node n first starts, it calls n.join(n’), where n’ is any known Chord node. • The join() function asks n’ to find the immediate successor of n. • join() does not make the rest of the network aware of n.
    24. 24. Node Joins – join() // create a new Chord ring. n.create() predecessor = nil; successor = n; // join a Chord ring containing node n’. n.join(n’) predecessor = nil; successor = n’.find_successor(n);
    25. 25. Scalable Key Location – find_successor() • Pseudo code: // ask node n to find the successor of id // id = 36, n’ = 25 , successor=40 n.find_successor(id) if (id ∈ (n, successor]) return successor; else n’ = closest_preceding_node(id); return n’.find_successor(id); // search the local table for the highest predecessor of id n.closest_preceding_node(id) for i = m downto 1 if (finger[i] ∈ (n, id)) return finger[i]; return n;
    26. 26. Joining: linked list insert N36 N40 N25 1. Lookup(36) K30 K38
    27. 27. Join (2) N36 N40 N25 2. N36 sets its own successor pointer K30 K38
    28. 28. Join (3) N36 N40 N25 3. Copy keys 26..36 from N40 to N36 K30 K38 K30
    29. 29. Join (4) N36 N40 N25 4. Set N25’s successor pointer Update finger pointers in the background Correct successors produce correct lookups K30 K38 K30
    30. 30. Node Joins – stabilize() • Each time node n runs stabilize(), it asks its successor for the it’s predecessor p, and decides whether p should be n’s successor instead. • stabilize() notifies node n’s successor of n’s existence, giving the successor the chance to change its predecessor to n. • The successor does this only if it knows of no closer predecessor than n.
    31. 31. Node Joins – stabilize() // called periodically. verifies n’s immediate // successor, and tells the successor about n. // n=30, p=36, n’s successor = 40 n.stabilize() x = successor.predecessor; if (x ∈ (n, successor)) successor = x; successor.notify(n); // n’ thinks it might be our predecessor. n.notify(n’) if (predecessor is nil or n’ ∈ (predecessor, n)) predecessor = n’;
    32. 32. Node Joins – Join and Stabilization np succ(np)=ns ns n pred(ns)=np  n joins  predecessor = nil  n acquires ns as successor via some n’  n runs stabilize  n notifies ns being the new predecessor  ns acquires n as its predecessor  np runs stabilize  np asks ns for its predecessor (now n)  np acquires n as its successor  np notifies n  n will acquire np as its predecessor  all predecessor and successor pointers are now correct  fingers still need to be fixed, but old fingers will still work nil pred(ns)=n succ(np)=n
    33. 33. Node Joins – fix_fingers() • Each node periodically calls fix fingers to make sure its finger table entries are correct. • It is how new nodes initialize their finger tables • It is how existing nodes incorporate new nodes into their finger tables.
    34. 34. Node Joins – fix_fingers() // called periodically. refreshes finger table entries //next = 1 n.fix_fingers() next = next + 1 ; if (next > m) next = 1 ; finger[next] = find_successor(n + 2next-1 ); // checks whether predecessor has failed. n.check_predecessor() if (predecessor has failed) predecessor = nil;
    35. 35. fix_fingers() Node 6 Join Node 3 leave
    36. 36. Failures might cause incorrect lookup N120 N113 N102 N80 N85 N80 doesn’t know correct successor, so incorrect lookup N10 Lookup(90)
    37. 37. Solution: successor lists • Each node knows r immediate successors • After failure, will know first live successor • Correct successors guarantee correct lookups • Guarantee is with some probability
    38. 38. Successor Lists Ensure Robust Lookup N32 N10 N5 N20 N110 N99 N80 N60 • Each node remembers r successors • Lookup can skip over dead nodes to find blocks N40 10, 20, 32 20, 32, 40 32, 40, 60 40, 60, 80 60, 80, 99 80, 99, 110 99, 110, 5 110, 5, 10 5, 10, 20
    39. 39. Lookup with fault tolerance Lookup(my-id, key-id) look in local finger table and successor-list for highest node n s.t. my-id < n < key-id if n exists call Lookup(id) on node n // next hop if call failed, remove n from finger table return Lookup(my-id, key-id) else return my successor // done
    40. 40. Experimental overview • Variation in load balance • Quick lookup in large systems • Low variation in lookup costs • Robust despite massive failure Experiments confirm theoretical results
    41. 41. Variation in load balance The mean and 1st and 99th percentiles of the number of keys stored per node in a 10x4 node network
    42. 42. Variation in load balance The probability density function (PDF) of the number of keys per node. The total number of keys is 5 x 10 square 5.
    43. 43. Virtual Node in Consistent Hash Hash(“202.168.14.241”); Hash(“202.168.14.241#1”); // cache A1 Hash(“202.168.14.241#2”); // cache A2
    44. 44. Result when virtual node used r virtual node, r = 1, 2, 5, 10, 20 99th : 4.8x to 1.6x ; 1st : 0 to 0.5x
    45. 45. Chord lookup cost is O(log N) Number of Nodes AverageMessagesperLookup Constant is ½ Actually ½ log(N) due to finger table
    46. 46. Failure experimental setup • Start 10000 node and 1000000 keys • Successor list has 20 entries • Insert 1,000 key/value pairs • Five replicas of each • Immediately perform 1,000 lookups
    47. 47. Massive failures have little impact 0 0.2 0.4 0.6 0.8 1 1.2 1.4 5 10 15 20 25 30 35 40 45 50 FailedLookups(Percent) Failed Nodes (Percent) (1/2)6 is 1.6%
    48. 48. Conclusion • Efficient location of the node that stores a desired data item is a fundamental problem in P2P networks • Chord protocol solves it in a efficient decentralized manner • Routing information: O(log N) nodes • Lookup: O(log N) nodes • Update: O(log2 N) messages • It also adapts dynamically to the topology changes introduced during the run
    49. 49. Backup
    50. 50. Improvement Metadata Layer Distributing Index Finger Table Original Chord Improved Chord Query in metadata Put resource into Support more search beside Keyword Index differs due to different Hash function More dynamically Create the routing table Nope Produce index By SHA1 Fixed size, can not resist churn Hashing Function SHA-2 improve collision , Use Pearson to speed up SHA-1
    51. 51. Join: lazy finger update is OK N36 N40 N25 N2 K30 N2 finger should now point to N36, not N40 Lookup(K30) visits only nodes < 30, will undershoot
    52. 52. CFS: a peer-to-peer storage system • Inspired by Napster, Gnutella, Freenet • Separates publishing from serving • Uses spare disk space, net capacity • Avoids centralized mechanisms • Delete this slide? • Mention “distributed hash lookup”
    53. 53. CFS architecture move later? Block storage Availability / replication Authentication Caching Consistency Server selection Keyword search Lookup Dhash distributed block store Chord • Powerful lookup simplifies other mechanisms
    54. 54. Consistent hashing [Karger 97] N32 N90 N105 K80 K20 K5 Circular 7-bit ID space Key 5 Node 105 A key is stored at its successor: node with next higher ID
    55. 55. Basic lookup N32 N90 N105 N60 N10 N120 K80 “Where is key 80?” “N90 has K80”
    56. 56. “Finger table” allows log(N)-time lookups N80 ½¼ 1/8 1/16 1/32 1/64 1/128
    57. 57. Finger i points to successor of n+2i N80 ½¼ 1/8 1/16 1/32 1/64 1/128 112 N120
    58. 58. Dynamic Operations and Failures Need to deal with: • Node Joins and Stabilization • Impact of Node Joins on Lookups • Failure and Replication • Voluntary Node Departures
    59. 59. Node Joins and Stabilization • Node’s successor pointer should be up to date • For correctly executing lookups • Each node periodically runs a “Stabilization” Protocol • Updates finger tables and successor pointers
    60. 60. Node Joins and Stabilization • Contains 6 functions: • create() • join() • stabilize() • notify() • fix_fingers() • check_predecessor()
    61. 61. Create() • Creates a new Chord ring n.create() predecessor = nil; successor = n;
    62. 62. Join() • Asks m to find the immediate successor of n. • Doesn’t make rest of the network aware of n. n.join(m) predecessor = nil; successor = m.find_successor(n);
    63. 63. Stabilize() • Called periodically to learn about new nodes • Asks n’s immediate successor about successor’s predecessor p • Checks whether p should be n’s successor instead • Also notifies n’s successor about n’s existence, so that successor may change its predecessor to n, if necessary n.stabilize() x = successor.predecessor; if (x ∈ (n, successor)) successor = x; successor.notify(n);
    64. 64. Notify() • m thinks it might be n’s predecessor n.notify(m) if (predecessor is nil or m ∈ (predecessor, n)) predecessor = m;
    65. 65. Fix_fingers() • Periodically called to make sure that finger table entries are correct • New nodes initialize their finger tables • Existing nodes incorporate new nodes into their finger tables n.fix_fingers() next = next + 1 ; if (next > m) next = 1 ; finger[next] = find_successor(n + 2next-1 );
    66. 66. Check_predecessor() • Periodically called to check whether predecessor has failed • If yes, it clears the predecessor pointer, which can then be modified by notify() n.check_predecessor() if (predecessor has failed) predecessor = nil;
    67. 67. Theorem 3 • If any sequence of join operations is executed interleaved with stabilizations, then at some time after the last join the successor pointers will form a cycle on all nodes in the network
    68. 68. Stabilization Protocol • Guarantees to add nodes in a fashion to preserve reach ability • By itself won’t correct a Chord system that has split into multiple disjoint cycles, or a single cycle that loops multiple times around the identifier space
    69. 69. Impact of Node Joins on Lookups • Correctness • If finger table entries are reasonably current • Lookup finds the correct successor in O(log N) steps • If successor pointers are correct but finger tables are incorrect • Correct lookup but slower • If incorrect successor pointers • Lookup may fail
    70. 70. Impact of Node Joins on Lookups • Performance • If stabilization is complete • Lookup can be done in O(log N) time • If stabilization is not complete • Existing nodes finger tables may not reflect the new nodes – Doesn’t significantly affect lookup speed • Newly joined nodes can affect the lookup speed, if the new nodes ID’s are in between target and target’s predecessor – Lookup will have to be forwarded through the intervening nodes, one at a time
    71. 71. Theorem 4 • If we take a stable network with N nodes with correct finger pointers, and another set of up to N nodes joins the network, and all successor pointers (but perhaps not all finger pointers) are correct, then lookups will still take O(log N) time with high probability
    72. 72. Failure and Replication • Correctness of the protocol relies on the fact of knowing correct successor • To improve robustness • Each node maintains a successor list of ‘r’ nodes • This can be handled using modified version of stabilize procedure • Also helps higher-layer software to replicate data
    73. 73. Theorem 5 • If we use successor list of length r = O(log N) in a network that is initially stable, and then every node fails with probability ½, then with high probability find_successor returns the closest living successor to the query key
    74. 74. Theorem 6 • In a network that is initially stable, if every node fails with probability ½, then the expected time to execute find_successor is O(log N)
    75. 75. Voluntary Node Departures • Can be treated as node failures • Two possible enhancements • Leaving node may transfers all its keys to its successor • Leaving node may notify its predecessor and successor about each other so that they can update their links
    76. 76. The promise of P2P computing • High capacity through parallelism: • Many disks • Many network connections • Many CPUs • Reliability: • Many replicas • Geographic distribution • Automatic configuration • Useful in public and proprietary settings
    77. 77. A DHT has a good interface • Put(key, value) and get(key) → value • Call a key/value pair a “block” • API supports a wide range of applications • DHT imposes no structure/meaning on keys • Key/value pairs are persistent and global • Can store keys in other DHT values • And thus build complex data structures
    78. 78. A DHT makes a good shared infrastructure • Many applications can share one DHT service • Much as applications share the Internet • Eases deployment of new applications • Pools resources from many participants • Efficient due to statistical multiplexing • Fault-tolerant due to geographic distribution
    79. 79. Many recent DHT-based projects • File sharing [CFS, OceanStore, PAST, …] • Web cache [Squirrel, ..] • Backup store [Pastiche] • Censor-resistant stores [Eternity, FreeNet,..] • DB query and indexing [Hellerstein, …] • Event notification [Scribe] • Naming systems [ChordDNS, Twine, ..] • Communication primitives [I3, …] Common thread: data is location-independent
    80. 80. Related Work • CAN (Ratnasamy, Francis, Handley, Karp, Shenker) • Pastry (Rowstron, Druschel) • Tapestry (Zhao, Kubiatowicz, Joseph) • Chord emphasizes simplicity
    81. 81. Chord Summary • Chord provides peer-to-peer hash lookup • Efficient: O(log(n)) messages per lookup • Robust as nodes fail and join • Good primitive for peer-to-peer systems http://www.pdos.lcs.mit.edu/chord
    82. 82. Scalable Key Location – find_successor() // ask node n to find the successor of id // id = 36, n’ = 25 , successor=40 n.find_successor(id) n’ = n.find_predecessor(id) return n’.succsor; n.find_predecessor(id) n’ = n; While (id NOT∈ (n’, n’.successor]) n’ = n’.closest_preceding_finger(id) return n’; n.closest_preceding_finger(id) for i = m downto 1 if (finger[i] ∈ (n, id)) return finger[i]; return n;
    83. 83. Choosing the successor list length • Assume 1/2 of nodes fail • P(successor list all dead) = (1/2)r • I.e. P(this node breaks the Chord ring) • Depends on independent failure
    84. 84. Improvement Metadata Layer ( 資訊層 ) Distributing Index Finger Table 原始 Chord 改良 Chord 將嘗試利用 metadata 以對資 源的描述更有彈性,並支援複 雜的查詢。將節點依照本身所 擁有的資源加入到適當的資源 層中,查詢時只 需要在適當 因為雜湊函數的不同, 所以此索引亦會不同 可嚐試改用雙向路由法、鄰居 路由法,或是根據網路攪動程 度,動態調整路由表 無 由上數雜湊函數 所得到的檔案搜 尋索引 fixed ,對網路攪 動的抵抗缺乏強 健性 Hashing Function ( 雜湊函數 ) 可改用 SHA-2 以增進加密功能 , 或是用 Pearson 等以加快運算 速度 SHA-1

    ×