Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dynomite at Erlang Factory

2,269 views

Published on

This is my presentation from the 2009 Erlang Factory conference about Dynomite, a distributed key value store.

Published in: Technology, News & Politics
  • Be the first to comment

Dynomite at Erlang Factory

  1. 1. Dynomite Yet Another Distributed Key Value Store
  2. 2. @moonpolysoft - questions
  3. 3. A Crowded Field • Cassandra • Lightcloud • Memcachedb • Redis • Tokyo Tyrannical Cabinet Device Thing • Voldemort
  4. 4. Who here has written one?
  5. 5. Alpha - 0.6.0
  6. 6. Focus on Distribution
  7. 7. Focus on Performance
  8. 8. • Latency • Average ~ 10 ms • Median ~ 5 ms • 99.9% ~ 1 s
  9. 9. • Throughput • R12B ~ 2,000 req/s • R13B ~ 6,500 req/s
  10. 10. At Powerset • 12 machine production cluster • ~6 million images + metadata • ~2 TB of data including replicas • ~139KB average size
  11. 11. Production? • Data you can afford to lose • Compatibility will break • Migration will be provided
  12. 12. The Constants
  13. 13. Consistency Throughput Durability N – Replication
  14. 14. Max Replicas per Partition
  15. 15. Latency Consistency R – Read Quorum
  16. 16. Minimum participation for read
  17. 17. Latency Durability W – Write Quorum
  18. 18. Minimum participation for write
  19. 19. Throughput Scalability Q – Partitioning
  20. 20. Partitions = 2Q
  21. 21. QNRW – Defines your cluster
  22. 22. At Powerset • Batch writes • Online reads • Wikipedia pages are obese
  23. 23. Q – 6 N–3 R–1 W–2
  24. 24. (dynomite@galva)2> mediator:put( quot;prefs:user:mervquot;, undefined, term_to_binary( [{safe_search, false}, {results, 50}])).
  25. 25. (dynomite@galva)2> mediator:put( quot;prefs:user:mervquot;, undefined, term_to_binary( [{safe_search, false}, {results, 50}])).
  26. 26. (dynomite@galva)2> mediator:put( quot;prefs:user:mervquot;, undefined, term_to_binary( [{safe_search, false}, {results, 50}])).
  27. 27. (dynomite@galva)2> mediator:put( quot;prefs:user:mervquot;, undefined, term_to_binary( [{safe_search, false}, {results, 50}])).
  28. 28. Value is always Binary
  29. 29. (dynomite@galva)2> mediator:put( quot;prefs:user:mervquot;, undefined, term_to_binary( [{safe_search, false}, {results, 50}])).
  30. 30. (dynomite@galva)1> {ok, {Context, [Bin]}} = mediator:get(quot;prefs:user:mervquot;). {ok,{[{quot;<0.630.0>quot;,1240543271.698089}], [<<131,108,0,0,0,2,...>>]}} (dynomite@galva)2> Terms = binary_to_term(Bin). [{safe_search,false},{results,50}]
  31. 31. (dynomite@galva)1> {ok, {Context, [Bin]}} = mediator:get(quot;prefs:user:mervquot;). {ok,{[{quot;<0.630.0>quot;,1240543271.698089}], [<<131,108,0,0,0,2,...>>]}} (dynomite@galva)2> Terms = binary_to_term(Bin). [{safe_search,false},{results,50}]
  32. 32. mediator:delete/1 mediator:has_key/1
  33. 33. Up In Dem Gutz
  34. 34. Client Protocols • Native Erlang Messages • Thrift • Protocol Buffers • Ascii Protocol
  35. 35. Mediator
  36. 36. N – 3 ;R – 2
  37. 37. pmap(Fun, List, ReturnNum) -> N = if ReturnNum > length(List) -> length(List); true -> ReturnNum end, SuperParent = self(), SuperRef = erlang:make_ref(), Ref = erlang:make_ref(), %% we spawn an intermediary to collect the results %% this is so that there will be no leaked messages sitting in our mailbox Parent = spawn(fun() -> L = gather(N, length(List), Ref, []), SuperParent ! {SuperRef, pmap_sort(List, L)} end), Pids = [spawn(fun() -> Parent ! {Ref, {Elem, (catch Fun(Elem))}} end) || Elem <- List], Ret = receive {SuperRef, Ret} -> Ret end, % i think we need to cleanup here. lists:foreach(fun(P) -> exit(P, die) end, Pids),
  38. 38. Membership
  39. 39. Handles Nodes Joining
  40. 40. membership:nodes() =/= nodes().
  41. 41. • Receive join notification • Recalculate partition to node mapping • Start bootstrap routines • Stops old local storage servers • Starts new local storage servers • Install the new partition table
  42. 42. Maps Partitions To Nodes
  43. 43. Gossip Girl
  44. 44. • Gossip servers randomly wake up • Pick another membership server • Merge membership tables • Versioned by vector clocks
  45. 45. gossip_loop(Server) -> #membership{nodes=Nodes,node=Node} = gen_server:call(Server, state), case lists:delete(Node, Nodes) of [] -> ok; % no other nodes Nodes1 when is_list(Nodes1) -> fire_gossip(random_node(Nodes1)) end, SleepTime = random:uniform(5000) + 5000, receive stop -> gossip_paused(Server); _Val -> ok after SleepTime -> ok end, gossip_loop(Server).
  46. 46. Storage Server
  47. 47. Merkle Trees
  48. 48. Merkle Trees
  49. 49. O(log n)
  50. 50. • Implemented on disk as a B-Tree • Includes buddy-block allocator for key space • Extremely gnarly code
  51. 51. Minimal Transfer
  52. 52. Sync Server
  53. 53. Randomly Wakes Up
  54. 54. Choses two replicas
  55. 55. Transfers Merkle Diff
  56. 56. [cli@yourmom ~]$ dynomite console (dynomite@yourmom)1 sync_manager:running(). [{3623878657,'dynomite@yourmom','dynomite@yourd ad'}, {3892314113,'dynomite@yourmom','dynomite@cableg uy'},
  57. 57. Problem Areas
  58. 58. Membership
  59. 59. Problems • New partitions online before they are ready • A priori knowledge of what is running • Possible to dilute a cluster into uselessness • Migrations are tremendously painful
  60. 60. Direction • Storage servers rally with local membership server • Use monitors on local storage • Alerts for dangerous situations
  61. 61. Merkle
  62. 62. Erlang is good at many things
  63. 63. Low level storage ain’t one
  64. 64. Direction • Rewrite Dmerkle storage in C • Use async driver pool • Build in more crash recovery to the format
  65. 65. Web Console
  66. 66. Direction • More detailed visualizations of cluster health • Perform “safe” ops tasks
  67. 67. And other things I haven’t thought of
  68. 68. Demo!
  69. 69. Spare a patch, friend?
  70. 70. • http://github.com/cliffmoon/dynomite • http://wiki.github.com/cliffmoon/dynomite
  71. 71. Q and or A

×