Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Infinit's Next Generation Key-value Store - Julien Quintard and Quentin Hocquet, Docker


Published on

Key-value store projects have been widely adopted as a way to store metadata, but also as a low-level construct on top of which can be built more advanced storage solutions from file systems, object storage APIs and more. Unfortunately, most key-value store constructs suffer the same limitations when it comes to scalability, performance, and resilience. Infinit's key-value store takes a different approach, relying on a decentralized architecture rather than a master/slave model while offering strong consistency.

Published in: Technology
  • Be the first to comment

Infinit's Next Generation Key-value Store - Julien Quintard and Quentin Hocquet, Docker

  1. 1. Julien Quintard Technical Staff, Docker Quentin Hocquet Software Engineer, Docker Infinit’s Next Generation Key-Value Store
  2. 2. 1. Analysis Agenda 2. Introducing Infinit’s key-value store 3. API 4. Demo
  3. 3. Key-value stores have increasingly gained in popularity as a fundamental layer for storing and sharing content in a distributed system. Such a construct can be used (and has been extensively) to manage: • Metadata • Logs • etc. The most well-known key-value stores today are etcd, ZooKeeper and Consul. Introduction
  4. 4. Why the hell yet another key-value store? 1. Analysis
  5. 5. Depending on the use case, the requirements on the underlying key-value store vary on several levels: • Scalability • Resilience • Performance • Security • Consistency We believe that the community needs a key-value store for the all the other applications. Problem
  6. 6. The main problem comes from the distribution mechanism which is based on a manager/worker model. Model manager manager worker worker worker
  7. 7. This model, even though well-suited to many use cases, suffers from its design on many levels: • Scalability: limited by the scalability of manager nodes • Resilience: overflow could lead cascading effects • Performance: limited capability to handle slaves and clients’ requests • Security: managers are ideal targets • Consistency: impossibility to handle many parallel update requests Limitations
  8. 8. So what makes it different (apart from not having a name, yet!)? 2. Introducing Infinit’s key-value store
  9. 9. What makes it different from the etcd, Consul, ZooKeeper and other key- value stores is the use of a decentralized model (i.e peer-to-peer) where every node is equipotent. Presentation node node node
  10. 10. Such a decentralized architecture is naturally adapted to scaling since nodes can join and leave without a need to keep track of them through a central directory. Instead, the directory is collectively managed by the cluster through algorithms known as an overlay network (routing requests to the right nodes) and distributed hash table or DHT (redundancy, self-healing etc.). BONUS: Infinit’s key-value store can be deployed over a single-node cluster, something that is not possible for the manager-worker-based systems. Scalability
  11. 11. Systems based on the manager/worker model need to dimension the system in order to support the worker nodes and handle clients’ requests. On the contrary, a decentralized architecture does away with bottlenecks, single points of failures and the associated slow performance since requests are not concentrated on a small subset of critical manager nodes. Even more, the more nodes in the system, the faster requests will be processed because the load is naturally distributed between all the nodes. Resilience & Performance
  12. 12. Unlike in manager-worker-model-based systems, there is no authoritative node nor more privileged nodes in a decentralized architecture. As such, an attacker would have no choice but to either take control of a large portion of the nodes composing the cluster or to find a breach in the network protocols in order to attack the system. Security
  13. 13. Consistency is all about reaching consensus within the set of servers that host the replicas of a piece of data. Distributed systems based on a manager/worker model rely on the managers to maintain consistency whenever an update is requested. Due to the concentration of such requests on the managers, the number of parallel requests that can be processed is limited. Consistency1/4 DISTRIBUTED (manager/worker) managerleader worker
  14. 14. Infinit’s key-value store instead relies on block- based quorums, meaning there are as many quorums, hence potential parallel consensus run, as there are blocks (a.k.a values) in the system. This approach means that parallel requests are handled by disjoint quorums, leading to better performance, security and fault tolerance. Consistency2/4 DECENTRALIZED (peer-to-peer) node
  15. 15. Consistency3/4 Block-based quorums also means the complexity of the consensus algorithm is function of the redundancy factor, not the number of nodes in the cluster (unlike manager/worker systems). In other words, in a manager/worker model, a cluster of 1 million worker nodes would require, say 100 managers. Given the quadratic complexity of consensus algorithms, a consensus among that many nodes would take several seconds to be reached. In a decentralized architecture, the complexity remains the same no matter the number of nodes in the network.
  16. 16. Most distributed systems nowadays rely on Raft for consensus. However, because Raft generates a lot of noise and because it is impractical in systems that can have millions of quorums, we have decided to use Paxos. Also, because the key-value store is a fundamental layer, Infinit’s is strongly consistent to allow for more demanding applications. NOTE: the consensus algorithm can be customized to switch to another one with different consistency guarantees. Consistency4/4
  17. 17. As a summary, Infinit key-value store’s decentralized architecture brings a number of advantages over manager-worker-based distributed systems. This model offers better performance, security and resilience by removing the critical manager nodes. Also, coupled with a block-based quorums, such a model allows for extremely scalable applications. Conclusion
  18. 18. I hope it speaks SOAP! 3. API
  19. 19. Infinit’s key-value store differs from other key-value stores in two major ways: • Key: one cannot choose the key associated with the values it stores; Infinit’s key-value store generates an address so as to optimize data placement for load balancing, fault tolerance and more. • Value: in Infinit’s, there are different types of value (known as blocks), each with their tradeoffs. Overview
  20. 20. In order to properly use Infinit’s key-value store, one needs to perfectly understand the various block types. In its purest form, there are two types of blocks, on top of which many other can be created: • Mutable Blocks: costly, subject to conflicts, need consensus when updated, need to invalidate their cache to refresh the value etc. • Immutable Blocks (content hashing): cannot conflict, can be cached forever, can be fetched from any source (integrity easy to validate) etc. Blocks
  21. 21. Infinit key-value store’s API is composed of two types of calls: • Block Generation: • MakeImmutableBlock() -> (Address, Block) • MakeMutableBlock() -> (Address, Block) • Key-Value Store Manipulation: • Insert(Block) -> Boolean • Update(Block) -> Boolean • Remove(Address) -> Boolean • Fetch(Address) -> Block API
  22. 22. Ok, now show me the money! 4. Demo
  23. 23. def connect(endpoint): import grpc import doughnut_pb2_grpc channel = grpc.insecure_channel(endpoint) return doughnut_pb2_grpc.DoughnutStub(channel) def init(kv): # Create a mutable block representing the index index = kv.MakeMutableBlock(MakeMutableBlockRequest()) # Set its payload to an emtpy list index.data_plain = pickle.dumps([]) # Insert the block kv.Insert(InsertRequest(block = index)) # Return its address return index.address.hex() def index(kv, addr): return kv.Fetch(FetchRequest(address = unhexlify(addr), decrypt_data = True)).block Example index MutableBlock image ImmutableBlock
  24. 24. def add(kv, addr, content): idx = index(kv, addr) # Create the content mutable block content_block = kv.MakeImmutableBlock( MakeImmutableBlockRequest(data = pickle.dumps(content))) # Append the address to the index l = pickle.loads(idx.data_plain) l.append(content_block.address) idx.data_plain = pickle.dumps(l) # Update the index update = kv.Update(UpdateRequest(block = idx)) # Push the content block kv.Insert(InsertRequest(block = content_block)) Example
  25. 25. def add(kv, addr, content): content['conflicts'] = 0 idx = index(kv, addr) while True: content_block = kv.MakeImmutableBlock( MakeImmutableBlockRequest(data = pickle.dumps(content))) l = pickle.loads(idx.data_plain) l.append(content_block.address) idx.data_plain = pickle.dumps(l) time.sleep(random.random() * 0.1) update = kv.Update(UpdateRequest(block = idx)) if update.error == SUCCESS: break elif update.error == CONFLICT:: content['conflicts'] = content['conflicts'] + 1 idx = update.current else: raise Exception(update.error) kv.Insert(InsertRequest(block = content_block)) Example
  26. 26. endpoint = sys.argv[1] address = sys.argv[2] kv = connect(endpoint) while True: images = requests.get('') for image in images: content = requests.get(image['link'], headers = headers).content add(kv, address, { 'img': content, 'host': socket.gethostname(), }) time.sleep(random.random()) Example: writer
  27. 27. endpoint = sys.argv[1] address = sys.argv[2] kv = connect(endpoint) from wsgiref.simple_server import make_server def simple_app(environ, start_response): status = '200 OK' headers = [('Content-type', 'text/html')] start_response(status, headers) i = 0 images = pickle.loads(index(kv, address).data_plain) for l in images[-32:]: data = pickle.loads(kv.Fetch(FetchRequest(address = l)) img = base64.b64encode(data['img']).decode('latin-1') yield '''<img src="data:{}"/> Host: {} Conflicts: {}<br/>'''.format( img, data['host'], data['conflicts']) make_server(8000, simple_app).serve_forever() Example: reader
  28. 28. Q&A @docker #dockercon