Bizur is a consensus algorithm invented by Elastifile to address issues with log-based algorithms like Paxos. It optimizes for strongly consistent distributed key-value stores by having independent buckets for keys that are replicated and use leader election. Reads require a majority and writes succeed with majority acknowledgement. It includes recovery mechanisms like ensuring buckets are up-to-date after leader changes and can reconfigure membership or number of buckets per shard dynamically through techniques like SMART migrations.
2. What’s Bizur?
● A consensus algorithm Elastifile [1] invented to develop their distributed file
system
● Aim to solve problems regarding Paxos-like log-based algorithms
● As of 14 Feb. 2017 it’s only preprint available in Arxiv [2]
● Blog posts are also available [3, 4]
3. The drawback of Paxos-like algorithms
● a) Reading an object requires having all related preceding entries available
● b) A slow operation will affect unrelated succeeding operations
● c) Time to detect a failure is longer
● d) Once a failure is detected, recovery can take a long time
● e) It requires handling the complex flow of log-compaction
4. Bizur against Paxos-like algorithms
● Paxos-like algorithms tries to solve a more general problem. Thus produces
the drawbacks a) - e)
● Bizur, on the other hand, is optimized for a specific use-case: a
strongly-consistent distributed key-value store
● With Paxos-like algorithms, it can be thought of as if each key has its own
distributed log consensus algorithm. However, that would be very inefficient
5. Data structures
● Bucket: where k-v pairs are packed
○ hash function :: k -> bucket_index
○ Buckets are all independent but operations on the same bucket are serialized (e.g. protected
by mutex)
○ Version = (elect_id, counter). Similar to Raft’s (term, index) pair
○ Each bucket can have different leadership. For example, the leader of bucket 0 can be Node A
and the leader of bucket 1 can be Node B
■ elect_id = the term that leader claims
■ voted_elect_id = the term each node knows
○ Leader election and client interaction are the same as Raft
● Node = [Bucket]
6. API
● READ, WRITE are majority-aware
operations
● Operations decode, encode_xxx
are to read/write a local bucket
● READ is required preceding to
updates (set or delete)
7. WRITE
● Update the bucket’s version
● Succeed only if acked by majority
● Leader
○ Claim the bucket’s elect_id is its
elect_id
● Replica
○ Nack if the claimed elect_id is older
than the voted_elect_id it knows
○ Otherwise replace the local bucket and
ack
8. READ
● Return the local bucket if majority
admit the bucket is the latest
● Processing ENSURERECOVERY
in prior can keep the bucket
up-to-date
9. ENSURERECOVERY
● When the local bucket is older than
the elect_id (current term) it should
repair the local bucket
● Processed in case of leader
change
● Use REPLICAREAD and take the
most up-to-date replica of the
bucket
○ Reading from majority guarantees that
at least one of the replicas are
up-to-date
10. Extension: Shard
● Intermediate data structure between Node and Bucket
○ Node = N * Shard, Shard = M * Bucket
○ N: static (e.g. 256)
○ M: dynamic (e.g. 100k)
● Introduced as an optimization
○ Theoretically, we can maintain leadership on bucket basis but that’s too expensive
○ Instead, we can lift it up to shard level so the buckets in shard can share that: elect_id and
voted_elect_id can be maintain in the shard level and buckets only refer to it
11. Configuration change
● Based on SMART [5]
● Configuration
○ The cluster membership (X -> Y)
○ M (#buckets in a shard)
● High level description
○ 1) Create a new Bizur instance running
on membership Y
○ 2) Notifies the old instance to return
RECONFIGERROR to all requests
○ 3) On the error, client can access to the
new instance
○ 4) New instance migrates buckets from
the old instance
Bizur
Bizur
12. Bucket migration on changing M
● k-v pairs increase and the buckets
become huge. It’s time to Increase
M
● The easiest way is to increase it in
power of two. That way, we can
maintain the bucket boundaries
● It’s something like caching where
the old instance is backing storeold
new
node
shard
bucket
13. Links
● [1] Elastifile
● [2] Bizur: A Key-Value Consensus Algorithm for Scalable File-systems
● [3] Bizur: A New Key-value Consensus Algorithm
● [4] Log-less Consensus
● [5] The SMART way to migrate replicated stateful services