Dynomite
Yet Another Distributed Key Value Store
@moonpolysoft -
  questions
#dynomite on
irc.freenode.net
Alpha - 0.6.0
Focus on Distribution
Focus on Performance
• Latency
 • Average ~ 10 ms
 • Median ~ 5 ms
 • 99.9% ~ 1 s
• Throughput
 • R12B ~ 2,000 req/s
 • R13B ~ 6,500 req/s
At Powerset

• 12 machine production cluster
• ~6 million images + metadata
• ~2 TB of data including replicas
• ~139KB av...
Production?

• Data you can afford to lose
• Compatibility will break
• Migration will be provided
The Constants
Consistency
Throughput
                        Durability




         N – Replication
Max Replicas per
   Partition
Latency           Consistency




     R – Read Quorum
Minimum participation
      for read
Latency          Durability




    W – Write Quorum
Minimum participation
      for write
Throughput              Scalability




         Q – Partitioning
Partitions =   2Q
QNRW – Defines your
     cluster
At Powerset

• Batch writes
• Online reads
• Clients will read many images concurrently
Q – 6
N–3
R–1
W–2
(dynomite@galva)2> mediator:put(
  "prefs:user:merv",
  undefined,
  term_to_binary(
    [{safe_search, false}, {results, ...
(dynomite@galva)2> mediator:put(
  "prefs:user:merv",
  undefined,
  term_to_binary(
    [{safe_search, false}, {results, ...
(dynomite@galva)2> mediator:put(
  "prefs:user:merv",
  undefined,
  term_to_binary(
    [{safe_search, false}, {results, ...
(dynomite@galva)2> mediator:put(
  "prefs:user:merv",
  undefined,
  term_to_binary(
    [{safe_search, false}, {results, ...
Value is always Binary
(dynomite@galva)2> mediator:put(
  "prefs:user:merv",
  undefined,
  term_to_binary(
    [{safe_search, false}, {results, ...
(dynomite@galva)1> {ok, {Context, [Bin]}} =
mediator:get("prefs:user:merv").

{ok,{[{"<0.630.0>",1240543271.698089}],
    ...
(dynomite@galva)1> {ok, {Context, [Bin]}} =
mediator:get("prefs:user:merv").

{ok,{[{"<0.630.0>",1240543271.698089}],
    ...
mediator:delete/1
mediator:has_key/1
Up In Dem Gutz
Client Protocols

• Native Erlang Messages
• Thrift
• Protocol Buffers
• Ascii Protocol
Mediator
N – 3 ;R – 2
pmap(Fun, List, ReturnNum) ->
  N = if
    ReturnNum > length(List) -> length(List);
    true -> ReturnNum
  end,
  SuperP...
Membership
Handles Nodes Joining
membership:nodes() =/= nodes().
• Receive join notification
• Recalculate partition to node mapping
• Start bootstrap routines
• Stops old local storage se...
Maps Partitions To Nodes
Gossip Girl
• Gossip servers randomly wake up
• Pick another membership server
• Merge membership tables
• Versioned by vector clocks
gossip_loop(Server) ->
  #membership{nodes=Nodes,node=Node} =
gen_server:call(Server, state),
  case lists:delete(Node, No...
Storage Server
Merkle Trees
Merkle Trees
O(log n)
• Implemented on disk as a B-Tree
• Includes buddy-block allocator for key
  space

• Extremely gnarly code
Minimal Transfer
Sync Server
Randomly Wakes Up
Choses two replicas
Transfers Merkle Diff
[cliff@yourmom ~]$ dynomite console

(dynomite@yourmom)1> sync_manager:running().

[{3623878657,'dynomite@yourmom','dynomi...
The Roadmap
Why Just Key/Value?
To Focus on
Distribution
How to maintain
   focus...
And stay relevant?
Imagine a Generic
Distribution Mechanism
Not Tied to Key Value
       Lookup
A Dynamo Framework
For Distributed
  Databases
Distribute any Data
       Model
      within reason
But How?
Storage Engine
• Persists data according to its data model
• Runs queries locally
• Provides a means to stitch data back
 ...
Dynomite

• Distributes queries and updates to nodes
• Invokes the correct read repair phase
• Handles node membership tra...
API Negotiation
• Storage Engine Publishes an API
• API Methods declare
 • Data Types
 • Conflict resolution
 • Result Coll...
Client Protocol
• Dynomite is a front end to storage
• Clients can query API capabilities
• Mediator follows distribution ...
api_info() ->
  {api_info,
    HashFun               = fun(Keys),
    MethodSpecs           = [api_spec()]
  }

    api_sp...
Plz to Contribute
• http://github.com/cliffmoon/dynomite
• http://wiki.github.com/cliffmoon/dynomite
Q and or A
Dynomite Nosql
Dynomite Nosql
Dynomite Nosql
Dynomite Nosql
Upcoming SlideShare
Loading in …5
×

Dynomite Nosql

1,725 views
1,621 views

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,725
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
22
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Dynomite Nosql

  1. 1. Dynomite Yet Another Distributed Key Value Store
  2. 2. @moonpolysoft - questions
  3. 3. #dynomite on irc.freenode.net
  4. 4. Alpha - 0.6.0
  5. 5. Focus on Distribution
  6. 6. Focus on Performance
  7. 7. • Latency • Average ~ 10 ms • Median ~ 5 ms • 99.9% ~ 1 s
  8. 8. • Throughput • R12B ~ 2,000 req/s • R13B ~ 6,500 req/s
  9. 9. At Powerset • 12 machine production cluster • ~6 million images + metadata • ~2 TB of data including replicas • ~139KB average size
  10. 10. Production? • Data you can afford to lose • Compatibility will break • Migration will be provided
  11. 11. The Constants
  12. 12. Consistency Throughput Durability N – Replication
  13. 13. Max Replicas per Partition
  14. 14. Latency Consistency R – Read Quorum
  15. 15. Minimum participation for read
  16. 16. Latency Durability W – Write Quorum
  17. 17. Minimum participation for write
  18. 18. Throughput Scalability Q – Partitioning
  19. 19. Partitions = 2Q
  20. 20. QNRW – Defines your cluster
  21. 21. At Powerset • Batch writes • Online reads • Clients will read many images concurrently
  22. 22. Q – 6 N–3 R–1 W–2
  23. 23. (dynomite@galva)2> mediator:put( "prefs:user:merv", undefined, term_to_binary( [{safe_search, false}, {results, 50}])). {ok,1}
  24. 24. (dynomite@galva)2> mediator:put( "prefs:user:merv", undefined, term_to_binary( [{safe_search, false}, {results, 50}])). {ok,1}
  25. 25. (dynomite@galva)2> mediator:put( "prefs:user:merv", undefined, term_to_binary( [{safe_search, false}, {results, 50}])). {ok,1}
  26. 26. (dynomite@galva)2> mediator:put( "prefs:user:merv", undefined, term_to_binary( [{safe_search, false}, {results, 50}])). {ok,1}
  27. 27. Value is always Binary
  28. 28. (dynomite@galva)2> mediator:put( "prefs:user:merv", undefined, term_to_binary( [{safe_search, false}, {results, 50}])). {ok,1}
  29. 29. (dynomite@galva)1> {ok, {Context, [Bin]}} = mediator:get("prefs:user:merv"). {ok,{[{"<0.630.0>",1240543271.698089}], [<<131,108,0,0,0,2,...>>]}} (dynomite@galva)2> Terms = binary_to_term(Bin). [{safe_search,false},{results,50}]
  30. 30. (dynomite@galva)1> {ok, {Context, [Bin]}} = mediator:get("prefs:user:merv"). {ok,{[{"<0.630.0>",1240543271.698089}], [<<131,108,0,0,0,2,...>>]}} (dynomite@galva)2> Terms = binary_to_term(Bin). [{safe_search,false},{results,50}]
  31. 31. mediator:delete/1 mediator:has_key/1
  32. 32. Up In Dem Gutz
  33. 33. Client Protocols • Native Erlang Messages • Thrift • Protocol Buffers • Ascii Protocol
  34. 34. Mediator
  35. 35. N – 3 ;R – 2
  36. 36. pmap(Fun, List, ReturnNum) -> N = if ReturnNum > length(List) -> length(List); true -> ReturnNum end, SuperParent = self(), SuperRef = erlang:make_ref(), Ref = erlang:make_ref(), %% we spawn an intermediary to collect the results %% this is so that there will be no leaked messages sitting in our mailbox Parent = spawn(fun() -> L = gather(N, length(List), Ref, []), SuperParent ! {SuperRef, pmap_sort(List, L)} end), Pids = [spawn(fun() -> Parent ! {Ref, {Elem, (catch Fun(Elem))}} end) || Elem <- List], Ret = receive {SuperRef, Ret} -> Ret end, % i think we need to cleanup here. lists:foreach(fun(P) -> exit(P, die) end, Pids), Ret.
  37. 37. Membership
  38. 38. Handles Nodes Joining
  39. 39. membership:nodes() =/= nodes().
  40. 40. • Receive join notification • Recalculate partition to node mapping • Start bootstrap routines • Stops old local storage servers • Starts new local storage servers • Install the new partition table
  41. 41. Maps Partitions To Nodes
  42. 42. Gossip Girl
  43. 43. • Gossip servers randomly wake up • Pick another membership server • Merge membership tables • Versioned by vector clocks
  44. 44. gossip_loop(Server) -> #membership{nodes=Nodes,node=Node} = gen_server:call(Server, state), case lists:delete(Node, Nodes) of [] -> ok; % no other nodes Nodes1 when is_list(Nodes1) -> fire_gossip(random_node(Nodes1)) end, SleepTime = random:uniform(5000) + 5000, receive stop -> gossip_paused(Server); _Val -> ok after SleepTime -> ok end, gossip_loop(Server).
  45. 45. Storage Server
  46. 46. Merkle Trees
  47. 47. Merkle Trees
  48. 48. O(log n)
  49. 49. • Implemented on disk as a B-Tree • Includes buddy-block allocator for key space • Extremely gnarly code
  50. 50. Minimal Transfer
  51. 51. Sync Server
  52. 52. Randomly Wakes Up
  53. 53. Choses two replicas
  54. 54. Transfers Merkle Diff
  55. 55. [cliff@yourmom ~]$ dynomite console (dynomite@yourmom)1> sync_manager:running(). [{3623878657,'dynomite@yourmom','dynomite@yourdad'}, {3892314113,'dynomite@yourmom','dynomite@cableguy'}, {4026531841,'dynomite@yourmom','dynomite@mailman'}]
  56. 56. The Roadmap
  57. 57. Why Just Key/Value?
  58. 58. To Focus on Distribution
  59. 59. How to maintain focus...
  60. 60. And stay relevant?
  61. 61. Imagine a Generic Distribution Mechanism
  62. 62. Not Tied to Key Value Lookup
  63. 63. A Dynamo Framework
  64. 64. For Distributed Databases
  65. 65. Distribute any Data Model within reason
  66. 66. But How?
  67. 67. Storage Engine • Persists data according to its data model • Runs queries locally • Provides a means to stitch data back together during read repair • Can provide hashing • Publishes its own API
  68. 68. Dynomite • Distributes queries and updates to nodes • Invokes the correct read repair phase • Handles node membership transparently • Provides network interface for clients • Versions data
  69. 69. API Negotiation • Storage Engine Publishes an API • API Methods declare • Data Types • Conflict resolution • Result Collation • Partition Strategy
  70. 70. Client Protocol • Dynomite is a front end to storage • Clients can query API capabilities • Mediator follows distribution semantics • Single partition • Range of partitions • Quorum requirements
  71. 71. api_info() -> {api_info, HashFun = fun(Keys), MethodSpecs = [api_spec()] } api_spec() = { ReadWrite = read | write | cas, Plurality = single | many, Name = atom(), Arguments = [argument_type_spec()], PartitionStrategy = single | many, CollationFun = fun(ResultsA, ResultsB) | undefined, ResolutionFun = fun(ContextA, ValueA, ContextB, ValueB) | undefined } argument_type_spec() = { Key = true | false, Name = atom(), DataType = abstract_data_type() }
  72. 72. Plz to Contribute
  73. 73. • http://github.com/cliffmoon/dynomite • http://wiki.github.com/cliffmoon/dynomite
  74. 74. Q and or A

×