Kai – An Open Source Implementation of Amazon’s Dynamo
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Kai – An Open Source Implementation of Amazon’s Dynamo

on

  • 15,789 views

 

Statistics

Views

Total Views
15,789
Views on SlideShare
15,712
Embed Views
77

Actions

Likes
50
Downloads
515
Comments
2

8 Embeds 77

http://www.slideshare.net 67
http://paper.li 3
https://twitter.com 2
http://www.fachak.com 1
http://static.slidesharecdn.com 1
http://us-w1.rockmelt.com 1
http://twitter.com 1
http://bb.kai.ru 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • really helpful, thx alot
    Are you sure you want to
    Your message goes here
    Processing…
  • essential overview of Dynamo
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Kai – An Open Source Implementation of Amazon’s Dynamo Presentation Transcript

  • 1. Kai – An Open Source Implementation of Amazon’s Dynamo takemaru 08.6.17
  • 2. Outline   Amazon’s Dynamo   Motivation   Features   Algorithms   Kai   Build and Run   Internals   Roadmap 08.6.17
  • 3. Dynamo: Motivation   Largest e-commerce site   75K query/sec (my estimation)   500 req/sec * 150 query/req   O(10M) users and Many items   Why not RDBMS?   Not easy to scaling-out or load balancing   Many components only need primary key access   Databases are required just for Amazon   Dynamo for primary key accesses   SimpleDB for complex queries   S3 for large files   Dynamo is used for   Shopping carts, customer preferences, session management, sales rank, and product catalogs 08.6.17
  • 4. Dynamo: Features   Key, value store   Distributed hash table   High scalability   No master, peer-to-peer   Large scale cluster, maybe O(1K) get(“cart/joe”)   Fault tolerant   Even if an entire data center fails joe   Meets latency requirements in the case joe joe 08.6.17
  • 5. Dynamo: Features, cont’d   Service Level Agreements   < 300ms for 99.9% queries   On the average, 15ms for reads and 30ms for writes   High availability   No lock and always writable get(“cart/joe”)   Eventually Consistent   Replicas are loosely synchronized joe   Inconsistencies are resolved by clients Tradeoff between availability and consistency -  RDBMS chooses consistency joe -  Dynamo prefers availability joe 08.6.17
  • 6. Dynamo: Overview   Dynamo cluster (instance) joe   Consists of equivalent nodes bob   Has N replicas for each key bob joe bob   Dynamo APIs   get(key) joe   put(key, value, context)   Context is a kind of version, like cas of memcache Dynamo of N = 3   delete is not described in the paper (I guess defined)   Requirements for clients   Don’t need to know ALL nodes, unlike memcache clients   Requests can be sent to any node 08.6.17
  • 7. Dynamo: Partitioning   Consistent Hashing 2^128 0 1 2   Nodes and keys are positioned at their hash hash(node5) values hash(node4)   MD5 (128bits)   Keys are stored in the following N nodes hash(node6) hash(node2) hash(node3) joe hash(cart/joe) hash(node1) Hash ring (N=3) 08.6.17
  • 8. Dynamo: Partitioning, cont’d   Advantages 2^128 0 1 2   Small # of keys are remapped, when hash(node5) membership is changed hash(node4)   Between down node and its Nth predecessor hash(node6) hash(node2) hash(node3) joe hash(cart/joe) hash(node1) Remapped range Node2 is down 08.6.17
  • 9. in data center A Dynamo: Partitioning in data center B   Physical placement of 2^128 replicas 0 1 2   Each key is replicated hash(node5) across multiple data centers hash(node4)   Arrangement scheme has not been revealed   Netmask is helpful, I guess   Impact on replica distribution is unknown hash(node6) hash(node2) hash(node3) joe   Advantages hash(cart/joe)   No data outage on data hash(node1) center failures joe is replicated in multiple data centers 08.6.17
  • 10. Dynamo: Partitioning, cont’d   Virtual nodes 2^128   Multiple positions are taken hash(node3-1) 0 1 2 by a single physical node hash(node5)   O(100) virtual/physical hash(node4) hash(node1-1) hash(node5-1)   Advantages hash(node2-1)   Keys are more uniformly hash(node6-1) distributed statistically hash(node6) hash(node2)   Remapped keys are evenly hash(node3) dispersed across nodes   # of virtual nodes can be determined based on hash(node5-1) hash(node1) capacity of physical node Two virtual nodes per each physical node
  • 11. Dynamo: Partitioning, cont’d   Buckets 2^128   Hash ring is equally divided 0 1 2 into buckets   There should be more buckets than all of virtual nodes   Keys in same bucket are mapped to same nodes bob and joe are bob in same bucket   Advantages   Keys are easily synchronized joe bucket by bucket   For Merkle trees discussed later Divided into 8 buckets 08.6.17
  • 12. Dynamo: Membership   Gossip-based protocol   Spreads membership like a rumor Detecting node failure   Membership contains node list and change history node2 is down   Exchanges membership with a node at random every second   Updates membership if more node2 is down recent one received node2 is down hash(node2)   Advantages node2 is down   Robust; no one can prevent a rumor from spreading   Exponentially rapid spread Membership change is spread by the form of gossip 08.6.17
  • 13. Dynamo: Membership, cont’d   Chord Brown knows only yellows   For large scale cluster   > O(10K) virtual nodes   Each node needs to know only O(log v) nodes Who has “cart/ken”?   v is # of virtual nodes ken   The nearer on hash ring, the more to know   Messages are routed hop by hop   Hop count is < O(log v)   For details, see original paper Routing in Chord 08.6.17
  • 14. Dynamo: Request Response get/put Operations   Client Client 1.  Sends a request any of Dynamo node   The request is forwarded to Coordinator coordinator joe   Coordinator: one of nodes associated with the key   Coordinator 1.  Chooses N nodes by using consistent hashing joe 2.  Forwards a request to N nodes 3.  Waits responses from R or joe W nodes, or timeouts 4.  Checks replica versions if get 5.  Sends a response to client get/put operations for N,R,W = 3,2,2 08.6.17
  • 15. Dynamo: Request Response get/put Operations, Cont’d Client   Quorum   Parameters   N: # of replicas Coordinator   R: min # of successful reads joe   W: min # of successful writes   Conditions   R+W > N   Key can be read from at least joe one node which has successfully written it   R<N joe   Provides better latency   N,R,W = 3,2,2 in common get/put operations for N,R,W = 3,2,2 08.6.17
  • 16. Dynamo: Versioning   Vector Clocks   Ordering of versions in distributed manner   No master, no clock synchronization   Parallel branches can be detected   List of clocks (counters) associated with each node   Each clock is managed by coordinator   (A:2, C:1) means that updated when A’s clock was 2 and C’s was 1 Independent nodeA (A:0) (A:1) (A:2, C:1) (A:3, B:1, C:1) Effect nodeB (A:1, B:1, C:1) (A:3, B:2, C:1) nodeC Cause (A:1, C:1) (A:2, C:2) Independent Time Cause and Effect of version (A:1, B:1, C:1) 08.6.17
  • 17. Dynamo: Versioning, cont’d   Vector Clocks   When a coordinator updates a key, …   Increments its own clock   Replaces clock in vector of the key Independent nodeA (A:0) (A:1) (A:2, C:1) (A:3, B:1, C:1) Effect nodeB (A:1, B:1, C:1) (A:3, B:2, C:1) nodeC Cause (A:1, C:1) (A:2, C:2) Independent Time Cause and Effect of version (A:1, B:1, C:1) 08.6.17
  • 18. Dynamo: Versioning, cont’d   Vector Clocks   How to determine ordering of versions?   If all clocks are equal to or greater than those of the other   The one is more recently updated   e.g. (A:1, B:1, C:1) < (A:3, B:1, C:1)   Otherwise   They are parallel branches and cannot be ordered   e.g. (A:1, B:1, C:1) ? (A:2, C:1) Independent nodeA (A:0) (A:1) (A:2, C:1) (A:3, B:1, C:1) Effect nodeB (A:1, B:1, C:1) (A:3, B:2, C:1) nodeC Cause (A:1, C:1) (A:2, C:2) Independent Time Cause and Effect of version (A:1, B:1, C:1) 08.6.17
  • 19. Dynamo: Synchronization   Sync replicas with Merkle tree   Hierarchical checksums (Merkle tree) are kept to be calculated   Trees are constructed for each bucket   Synchronization is executed periodically or when membership changes node1 3377 5C37 node2 12D5 F1C9 12D5 E334 8FF3 9F9D F632 34B7 8FF3 9F9D F632 E1F3 win mac linux bsd win mac linux bsd Comparison of hierarchical checksums in Merkle trees 08.6.17
  • 20. Dynamo: Synchronization, cont’d   Sync replicas with Merkle tree   Compares Merkle tree with other nodes   From root to leaf, until checksum corresponds with each other node1 3377 1. Compare 5C37 node2 12D5 F1C9 2. Compare 12D5 E334 8FF3 9F9D F632 34B7 3. Compare 8FF3 9F9D F632 E1F3 win mac linux bsd win mac linux bsd Comparison of hierarchical checksums in Merkle trees 08.6.17
  • 21. Dynamo: Synchronization, cont’d   Sync replicas with Merkle tree   Synchronize out-of-date replicas   In background as low priority task   If versions cannot be ordered, no sync will be done node1 3377 1. Compare 5C37 node2 12D5 F1C9 2. Compare 12D5 E334 8FF3 9F9D F632 34B7 3. Compare 8FF3 9F9D F632 E1F3 win mac linux bsd 4. Sync win mac linux bsd Comparison of hierarchical checksums in Merkle trees 08.6.17
  • 22. Dynamo: Synchronization, cont’d   Advantages of Merkle tree   Comparisons can be reduced, if most of replicas are synchronized   Root checksums are equal, and no more comparison is required node1 3377 1. Compare 5C37 node2 12D5 F1C9 2. Compare 12D5 E334 8FF3 9F9D F632 34B7 3. Compare 8FF3 9F9D F632 E1F3 win mac linux bsd 4. Sync win mac linux bsd Comparison of hierarchical checksums in Merkle trees 08.6.17
  • 23. Dynamo: Implementation   Implementation   Written in Java   Closed source    APIs   Over HTTP   Storage   BDB or MySQL   Security   No requirements 08.6.17
  • 24. Kai: Overview   Kai   Open source implementation of Amazon’s Dynamo   Named after my origin   OpenDynamo had been taken by a project not related to Amazon’s Dynamo    Written in Erlang   memcache API   Found at http://sourceforge.net/projects/kai/ 08.6.17
  • 25. Kai: Building Kai   Requirements   Erlang OTP (>= R12B)   make   Build % svn co http://kai.svn.sourceforge.net/svnroot/kai/trunk kai % cd kai/ % make % make test   Edits RUN_TEST in Makefile according to your enbironment before make test, if not MacOSX   ./configure will be coming 08.6.17
  • 26. Kai: Configuration   kai.config   All parameters are optional since having default values Parameter Description Default value logfile Name of log file Standard output hostname Hostname, which should be specified if the Auto detection computer has multiple network interfaces port Port # of internal API 11011 memcache_port Port # of memcache API 11211 n, r, w N, R, W for quorum 3, 2, 2 number_of_buckets # of buckets, which must correspond with other 1024 nodes number_of_virtual_nodes # of virtual nodes 128 08.6.17
  • 27. Kai: Running Kai   Run as a stand alone server % erl -pa src -config kai -kai n 1 -kai r 1 -kai w 1 1> application:load(kai). 2> application:start(kai).   Arguments Arguments Description -pa src Adds directory src to load path -config kai Loads configuration file kai.config -kai n1 -kai r1 -kai w1 Overwrites configuration as N, R, W = 1, 1, 1   Access to 127.0.0.1:11211 by your memcache client 08.6.17
  • 28. Kai: Running Kai, cont’d   Run as a cluster system   Start three nodes on different port #’s in a single computer Terminal 1 % erl -pa src -config kai -kai port 11011 -kai memcache_port 11211 1> application:load(kai). 2> application:start(kai). Terminal 2 % erl -pa src -config kai -kai port 11012 -kai memcache_port 11212 1> application:load(kai). 2> application:start(kai). Terminal 3 % erl -pa src -config kai -kai port 11013 -kai memcache_port 11213 1> application:load(kai). 2> application:start(kai). 08.6.17
  • 29. Kai: Running Kai, cont’d   Run as a cluster system   Connect all nodes by informing of their neighbors 3> kai_api:check_node({{127,0,0,1}, 11011}, {{127,0,0,1}, 11012}). 4> kai_api:check_node({{127,0,0,1}, 11012}, {{127,0,0,1}, 11013}). 5> kai_api:check_node({{127,0,0,1}, 11013}, {{127,0,0,1}, 11011}).   Access to 127.0.0.1:11211-11213 by your memcache client   Adding a new node to existed cluster % Link a new node to the existed cluster % (kai_api:check_node/2 establishes one-directional link) 1> kai_api:check_node(NewNode, NodeInCluster). % Wait till synchronizing buckets… % Link a node in the cluster to the new node 2> kai_api:check_node(NodeInCluster, NewNode). 08.6.17
  • 30. Kai: Internals Function Module Comments Partitioning kai_hash •  Not considering physical placement Membership kai_network •  Not including Chord • Will be renamed to kai_membership Coordinator kai_memcache • Will be re-implemented in kai_coordinator Versioning •  Not implemented yet Synchronization kai_sync •  Not including Merkle tree Storage kai_store •  Using ets Internal API kai_api API kai_memcache •  get, set, and delete Logging kai_log Configuration kai_config Supervisor kai_sup 08.6.17
  • 31. Kai: kai_hash   Base module   Future work   gen_server   Physical placement   Current status   Parallel access will be   Consistent hashing allowed for read operation   To avoid waiting long while   32bit hash space re-hashing   Virtual nodes   No use gen_server:call   Buckets Synopsis kai_hash:start_link(), # Re-hashes when membership changes {replaced_buckets, ListOfReplacedBuckets} = kai_hash:update_nodes(ListOfNodesToAdd, ListOfNodesToRemove), # Finds nodes associated with Key to get/put {nodes, ListOfNodes} = kai_hash:find_nodes(Key). 08.6.17
  • 32. Kai: kai_network, will be renamed to kai_membership   Base module   Future work   gen_fsm   Chord or Kademlia   Current status   Kademlia is used by BitTorrent   Built-in distributed facilities, such as EPMD, are not used, since they don’t seem to be scalable   Gossip-based protocol Synopsis kai_network:start_link(), # Checks whether Node is alive, and updates consistent hashing if needed kai_network:check_node(Node), # In background, kai_network checks a node at random every second 08.6.17
  • 33. Kai: kai_coordinator, now implemented in kai_memcache   Base module   Future work   gen_server (in plan)   Separated from kai_memcache   Current status   Requests from clients will   Implemented in kai_memcache be routed to coordinators   Currently, a node receiving   Quorum requests behaves like a coordinator Synopsis (in plan) kai_coordinator:start_link(), % Sends a get request to N nodes, and returns received data to kai_memcache Data = kai_coordinator:get(Key), % Sends a put request to N nodes, where Data is a variable of data record kai_coordinator:put(Data). 08.6.17
  • 34. Kai: kai_version   Base module   Future work   gen_server (in plan)   Vector Clocks   Current Status   Not implemented yet Synopsis (in plan) kai_version:start_link(), % Updates clock of LocalNode in VectorClocks kai_version:update(VectorClocks, LocalNode), % Generates ordering of vector clocks {order, Order} = kai_version:order(VectorClocks1, VectorClocks2). 08.6.17
  • 35. Kai: kai_sync   Base module   Future work   gen_fsm   Bulk transport   Current status   Downloads whole bucket when membership changes   Sync data if not exists   Parallel download   Versions are not compared   Synchronizes a bucket with multiple nodes Synopsis   Merkle tree kai_sync:start_link(), # Compares key list in Bucket with other nodes, # and retrieves those which don’t exist kai_sync:update_bucket(Bucket), # In background, kai_sync synchronizes a bucket at random every second 08.6.17
  • 36. Kai: kai_store   Base module   Future work   gen_server   Persistent storage without   Current status capacity limit   dets, mnesia, or MySQL   ets, which is Erlang built-in   Unfortunately, tables in dets memory storage, is used and mnesia < 4GB   Delaying deletion Synopsis kai_store:start_link(), # Retrieves Data associated with Key Data = kai_store:get(Key), % Stores Data, which is a variable of data record kai_store:put(Data). 08.6.17
  • 37. Kai: kai_api   Base module   Future work   gen_tcp   Process pool   Current status   Restricts max # of processes to receive API calls   Internal API   Connection pool   RPC for kai_hash, kai_store, and kai_network   Re-uses TCP connections Synopsis kai_api:start_link(), # Retrieves node list from Node {node_list, ListOfNodes} = kai_api:node_list(Node), # Retrieves Data associated with Key from Node Data = kai_api:get(Node, Key). 08.6.17
  • 38. Kai: kai_memcache   Base module   Future work   gen_tcp   cas and stats   Current status   Process pool   get, set, and delete of   Restricts max # of processes to receive API calls memcache API are supported   exptime of set must be zero, Synopsis in Ruby since Kai is persistent require ‘memcache’ storage   get returns multiple values if cache = MemCache.new ‘127.0.0.1:11211’ multiple versions are found # Stores ‘value’ with ‘key’ cache[‘key’] = ‘value’ # Retrieves data associated with ‘key’ p cache[‘key’] 08.6.17
  • 39. Kai: Testing   Execution   Done by make test   Implementation   common_test   Built-in testing framework   Test servers   Written from scratch currently   Can test_server be used? 08.6.17
  • 40. Kai: Miscellaneous   Node ID # Nodes are identified by socket addresses of internal API {Addr, Port} = {{192,168,1,1}, 11011}.   Data structure # This data structure includes value as well as metadata -record(data, {key, bucket, last_modified, checksum, flags, value}). # This includes only metadata, such as key, bucket #, and MD5 checksum -record(metadata, {key, bucket, last_modified, checksum}).   Return values # Returns ‘undefined’ if data associated with Key is not found undefined = kai_store:get(Key). # Returns Reason with ‘error’ tag when something wrong happens {error, Reason} = function(Args). 08.6.17
  • 41. Kai: Roadmap 1.  Basic implementation   Current version 2.  Almost Dynamo Module Task kai_hash Parallel access will be allowed for read operation kai_coordinator Requests from clients will be routed to coordinators kai_version Vector clocks kai_sync Bulk and parallel transport kai_store Persistent storage kai_api Process pool kai_memcache Process pool and cas 08.6.17
  • 42. Kai: Roadmap, cont’d 3.  Dynamo Module Task kai_hash Physical Placement kai_membership Chord or Kademlia kai_sync Merkel tree kai_store Delaying deletion kai_api Connection pool kai_memcache stats   On development environments   configure, test_server 08.6.17
  • 43. Conclusion   Join us if interested in! http://sourceforge.net/projects/kai/ 08.6.17