Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dynomite Nosql

1,877 views

Published on

Published in: Technology
  • Be the first to comment

Dynomite Nosql

  1. 1. Dynomite Yet Another Distributed Key Value Store
  2. 2. @moonpolysoft - questions
  3. 3. #dynomite on irc.freenode.net
  4. 4. Alpha - 0.6.0
  5. 5. Focus on Distribution
  6. 6. Focus on Performance
  7. 7. • Latency • Average ~ 10 ms • Median ~ 5 ms • 99.9% ~ 1 s
  8. 8. • Throughput • R12B ~ 2,000 req/s • R13B ~ 6,500 req/s
  9. 9. At Powerset • 12 machine production cluster • ~6 million images + metadata • ~2 TB of data including replicas • ~139KB average size
  10. 10. Production? • Data you can afford to lose • Compatibility will break • Migration will be provided
  11. 11. The Constants
  12. 12. Consistency Throughput Durability N – Replication
  13. 13. Max Replicas per Partition
  14. 14. Latency Consistency R – Read Quorum
  15. 15. Minimum participation for read
  16. 16. Latency Durability W – Write Quorum
  17. 17. Minimum participation for write
  18. 18. Throughput Scalability Q – Partitioning
  19. 19. Partitions = 2Q
  20. 20. QNRW – Defines your cluster
  21. 21. At Powerset • Batch writes • Online reads • Clients will read many images concurrently
  22. 22. Q – 6 N–3 R–1 W–2
  23. 23. (dynomite@galva)2> mediator:put( "prefs:user:merv", undefined, term_to_binary( [{safe_search, false}, {results, 50}])). {ok,1}
  24. 24. (dynomite@galva)2> mediator:put( "prefs:user:merv", undefined, term_to_binary( [{safe_search, false}, {results, 50}])). {ok,1}
  25. 25. (dynomite@galva)2> mediator:put( "prefs:user:merv", undefined, term_to_binary( [{safe_search, false}, {results, 50}])). {ok,1}
  26. 26. (dynomite@galva)2> mediator:put( "prefs:user:merv", undefined, term_to_binary( [{safe_search, false}, {results, 50}])). {ok,1}
  27. 27. Value is always Binary
  28. 28. (dynomite@galva)2> mediator:put( "prefs:user:merv", undefined, term_to_binary( [{safe_search, false}, {results, 50}])). {ok,1}
  29. 29. (dynomite@galva)1> {ok, {Context, [Bin]}} = mediator:get("prefs:user:merv"). {ok,{[{"<0.630.0>",1240543271.698089}], [<<131,108,0,0,0,2,...>>]}} (dynomite@galva)2> Terms = binary_to_term(Bin). [{safe_search,false},{results,50}]
  30. 30. (dynomite@galva)1> {ok, {Context, [Bin]}} = mediator:get("prefs:user:merv"). {ok,{[{"<0.630.0>",1240543271.698089}], [<<131,108,0,0,0,2,...>>]}} (dynomite@galva)2> Terms = binary_to_term(Bin). [{safe_search,false},{results,50}]
  31. 31. mediator:delete/1 mediator:has_key/1
  32. 32. Up In Dem Gutz
  33. 33. Client Protocols • Native Erlang Messages • Thrift • Protocol Buffers • Ascii Protocol
  34. 34. Mediator
  35. 35. N – 3 ;R – 2
  36. 36. pmap(Fun, List, ReturnNum) -> N = if ReturnNum > length(List) -> length(List); true -> ReturnNum end, SuperParent = self(), SuperRef = erlang:make_ref(), Ref = erlang:make_ref(), %% we spawn an intermediary to collect the results %% this is so that there will be no leaked messages sitting in our mailbox Parent = spawn(fun() -> L = gather(N, length(List), Ref, []), SuperParent ! {SuperRef, pmap_sort(List, L)} end), Pids = [spawn(fun() -> Parent ! {Ref, {Elem, (catch Fun(Elem))}} end) || Elem <- List], Ret = receive {SuperRef, Ret} -> Ret end, % i think we need to cleanup here. lists:foreach(fun(P) -> exit(P, die) end, Pids), Ret.
  37. 37. Membership
  38. 38. Handles Nodes Joining
  39. 39. membership:nodes() =/= nodes().
  40. 40. • Receive join notification • Recalculate partition to node mapping • Start bootstrap routines • Stops old local storage servers • Starts new local storage servers • Install the new partition table
  41. 41. Maps Partitions To Nodes
  42. 42. Gossip Girl
  43. 43. • Gossip servers randomly wake up • Pick another membership server • Merge membership tables • Versioned by vector clocks
  44. 44. gossip_loop(Server) -> #membership{nodes=Nodes,node=Node} = gen_server:call(Server, state), case lists:delete(Node, Nodes) of [] -> ok; % no other nodes Nodes1 when is_list(Nodes1) -> fire_gossip(random_node(Nodes1)) end, SleepTime = random:uniform(5000) + 5000, receive stop -> gossip_paused(Server); _Val -> ok after SleepTime -> ok end, gossip_loop(Server).
  45. 45. Storage Server
  46. 46. Merkle Trees
  47. 47. Merkle Trees
  48. 48. O(log n)
  49. 49. • Implemented on disk as a B-Tree • Includes buddy-block allocator for key space • Extremely gnarly code
  50. 50. Minimal Transfer
  51. 51. Sync Server
  52. 52. Randomly Wakes Up
  53. 53. Choses two replicas
  54. 54. Transfers Merkle Diff
  55. 55. [cliff@yourmom ~]$ dynomite console (dynomite@yourmom)1> sync_manager:running(). [{3623878657,'dynomite@yourmom','dynomite@yourdad'}, {3892314113,'dynomite@yourmom','dynomite@cableguy'}, {4026531841,'dynomite@yourmom','dynomite@mailman'}]
  56. 56. The Roadmap
  57. 57. Why Just Key/Value?
  58. 58. To Focus on Distribution
  59. 59. How to maintain focus...
  60. 60. And stay relevant?
  61. 61. Imagine a Generic Distribution Mechanism
  62. 62. Not Tied to Key Value Lookup
  63. 63. A Dynamo Framework
  64. 64. For Distributed Databases
  65. 65. Distribute any Data Model within reason
  66. 66. But How?
  67. 67. Storage Engine • Persists data according to its data model • Runs queries locally • Provides a means to stitch data back together during read repair • Can provide hashing • Publishes its own API
  68. 68. Dynomite • Distributes queries and updates to nodes • Invokes the correct read repair phase • Handles node membership transparently • Provides network interface for clients • Versions data
  69. 69. API Negotiation • Storage Engine Publishes an API • API Methods declare • Data Types • Conflict resolution • Result Collation • Partition Strategy
  70. 70. Client Protocol • Dynomite is a front end to storage • Clients can query API capabilities • Mediator follows distribution semantics • Single partition • Range of partitions • Quorum requirements
  71. 71. api_info() -> {api_info, HashFun = fun(Keys), MethodSpecs = [api_spec()] } api_spec() = { ReadWrite = read | write | cas, Plurality = single | many, Name = atom(), Arguments = [argument_type_spec()], PartitionStrategy = single | many, CollationFun = fun(ResultsA, ResultsB) | undefined, ResolutionFun = fun(ContextA, ValueA, ContextB, ValueB) | undefined } argument_type_spec() = { Key = true | false, Name = atom(), DataType = abstract_data_type() }
  72. 72. Plz to Contribute
  73. 73. • http://github.com/cliffmoon/dynomite • http://wiki.github.com/cliffmoon/dynomite
  74. 74. Q and or A

×