Practical Byzantine Fault Tolerance
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Practical Byzantine Fault Tolerance



A presentation on work by Barbara Liskov and others on providing practical byzantine fault tolerance.

A presentation on work by Barbara Liskov and others on providing practical byzantine fault tolerance.



Total Views
Views on SlideShare
Embed Views



5 Embeds 28 11 10 4 2 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Practical Byzantine Fault Tolerance Presentation Transcript

  • 1. Practical Byzantine Fault Tolerance Castro & Liskov - Suman Karumuri
  • 2. Byzantine fault • A process behaves in an inconsistent manner. • Can’t be solved unless • n>3*f+1 – n: number of processes – f: number of faults.
  • 3. Replicated State machine problem • The state of the system is modeled as a state machine with – State variables. – Operations. • Reads • Updates • The state of the machine is replicated across processes some of which are byzantine. • Goals: – Keep all the state machines consistent. – Do it efficiently.
  • 4. Previous Work • Previous work is not really practical. Too many messages. • Assume synchrony. – Bounds on message delays and process speeds.
  • 5. The Model • Networks are unreliable – Can delay, reorder, drop,retransmit • Adversary can – Can co-ordinate faulty nodes. – Delay communication. • Adversary can’t – Delay correct nodes. – Break cryptographic protocols. • Some fraction of nodes are byzantine – May behave in any way, and need not follow the protocol. • Nodes can verify the authenticity of messages sent to them.
  • 6. Service properties • Every operation performed by the client is deterministic. • Safety: The replicated service satisfies linearizability. • Liveness: Client’s eventually receive replies to their requests if – floor(n-1/3) replicas are faulty and – the network delay does not grow faster than time. • Security and privacy are out of scope.
  • 7. Assumptions about Nodes • Maintain a state – Log – View number – State • Can perform a set of operations – Need not be simple read/write – Must be deterministic • Well behaved nodes must: – start at the same state – Execute requests in the same order
  • 8. Views • Operations occur within views • For a given view, a particular node in is designated the primary node, and the others are backup nodes • Primary = v mod n – N is number of nodes – V is the view number
  • 9. High level Algorithm 1. A client sends a request to invoke a service operation to the primary 2. The primary multicasts the request to the backups 3. Replicas execute the request and send a reply to the client 4. The client waits for f+1 replies from different replicas with the same result; this is the result of the operation
  • 10. High level Algo contd.. • If the client does not receive replies soon enough – it broadcasts the request to all replicas. If the request has already been processed, the replicas simply re- send the reply. • If the replica is not the primary, – it relays the request to the primary. • If the primary does not multicast the request to the group – it will eventually be suspected to be faulty by enough replicas to cause a view change
  • 11. Protocol • It is a three phase protocol – Pre-prepare: primary proposes an order – Prepare: Backup copies agree on # – Commit: agree to commit
  • 12. Request to Primary {REQUEST,operation,ts,client}
  • 13. Pre-prepare message <{PRE_PREPARE,view,seq#,msg_digest}, msg>
  • 14. Prepare message {PREPARE,view,seq#,msg_digest,i}
  • 15. Prepared when i receives 2f prepares that match pre- prepares.
  • 16. Commit message {COMMIT,view,seq#,msg_digest,i}
  • 17. Committed_local If prepared(i) and received 2f+1 commits
  • 18. Reply {REPLY,view,ts,client,I,response}
  • 19. Truncating the log • Checkpoints at regular intervals • Requests are in log, or already stable • Each node maintains multiple copies of state: – A copy of the last proven checkpoint – 0 or more unproven checkpoints – The current working state • A node sends a checkpoint message when it generates a new checkpoint • Checkpoint is proven when a quorum agrees – Then this checkpoint becomes stable – Log truncated, old checkpoints discarded
  • 20. View change • The view change mechanism – Protects against faulty primaries • Backups propose a view change when a timer expires – The timer runs whenever a backup has accepted some message & is waiting to execute it. – Once a view change is proposed, the backup will no longer do work (except checkpoint) in the current view.
  • 21. View change 2 • A view change message contains – # of the highest message in the stable checkpoint • And the check point messages – A pre-prepare message for non-checkpointed messages • And proof it was prepared • The new primary declares a new view when it receives a quorum of messages
  • 22. New view • New primary computes – Maximum checkpointed sequence number – Maximum sequence number not checkpointed • Constructs new pre-prepare messages – Either is a new pre-prepare for a message in the new view – Or a no-op pre-prepare so there are no gaps
  • 23. New view 2 • New primary sends a new view message – Contains all view change messages – All computed pre-prepare messages • Recipients verify: – The pre-prepare messages – The have the latest checkpoint • If not, they can get a copy – Sends a prepare message for each pre-prepare – Enters the new view
  • 24. Controlling View Changes • Moving through views too quickly – Nodes will wait longer if • No useful work was done in the previous view – I.e. only re-execution of previous requests – Or enough nodes accepted the change, but no new view was declared – If a node gets f+1 view change requests with a higher view number • It will send its own view change with the minimum view number • This is safe, because at least one non-faulty replica sent a message
  • 25. Optimization • Don’t send f+1 responses – Just send the f digests and 1 response – If that doesn’t work, switch to old protocol. • Tentative commits – After prepare, backup may tentatively execute request ( fails on view-change requests) – Client waits for a quorum of tentative replies, otherwise retries and waits for f+1 replies – Read-only • Clients multicast directly to replicas • Replicas execute the request, wait until no tentative request are pending, return the result • Client waits for a quorum of results
  • 26. Micro benchmark • Worst case. • Not so much difference on real work loads. • Still faster than Rampart and Securing.
  • 27. Benchmark • Andrew benchmark • Software development workload. 5 phases – Create subdirectories – Copy source tree – Look at file status – Look at file contents – Compile • Implementations compared – NFS – BFS strict – BFS (lookup, read are read only)
  • 28. Results
  • 29. Fault-Scalable Byzantine Fault-Tolerant Services
  • 30. Fault-scalability • Fault-scalable service is one in which performance degrades gradually, if at all, as more server faults are tolerated. • BFT is not fault-scalable. • Graceful-degradation in web parlance.
  • 31. Query/Update(Q/U) Protocol • The Q/U protocol is byzantine fault tolerant and fault scalable. • Provides same operations and interfaces like replicated state machine systems. • Uses 3 techniques: – Optimistic non-destructive updates. – Minimizes consensus by using versioning with logical time stamping scheme. – Uses preferred quorum, to minimize server to server communication. • Cost: Needs 5f+1 servers instead of 3f+1 servers for fault-scalable BFT.
  • 32. Quorum vs Agreement protocols • In a Quorum, only subsets of servers process requests. • In agreement based systems, all servers process all requests.
  • 33. Efficiency and Scalability • No need of prepare or commit phase because of optimism.
  • 34. Throughput-scalability • Additional servers, beyond those necessary to provide the desired fault tolerance, can provide additional throughput • No need to partition the services to achieve scalability.
  • 35. System model • Clients and servers – Asynchronous timing. – May be Byzantine faulty – Computationally bounded. (Crypto can’t be broken) – Communicate using point-to point unreliable channels. – Knows each others symmetric keys. • Failure model is a hybrid failure model – Benign (Follow protocol) – Malevolent (Don’t follow protocol) – Faulty (malevolent and crashes and never recovers)
  • 36. Q/U Overview • Clients update objects by issuing requests stamped with object versions to version servers. • Version servers evaluate these requests. – If the request is over an out of date version, the clients version is corrected and the request reissued – If an out of date server is required to reach a quorum, it retrieves an object history from a group of other servers – If the version matches the server version, of course, it is executed • Since we are optimistic, we should correct for errors when objects are concurrently accessed.
  • 37. Concurrency and Repair • Concurrent access to an object may fail • Two operations – Barrier • Create a barrier(a dummy request with larger timestamp), that prevents any requests from being executed. – Copy • Copy the object past the barrier so that it can be copied. • Clients may repeatedly barrier each other, to combat this, an exponential backoff strategy is enforced.
  • 38. Classification and Constraints • Based on partial observations of the global system state, an operation may be – Established • Complete (4b+1 <= order) – Potential • Repairable (2b+1 <= order < 4b+1) – Can be repaired using the copy and barrier strategy • Incomplete (Otherwise) – Revert – Order is the number of responses received by the client.
  • 39. Example
  • 40. Algorithm for increment. 1 2 3
  • 41. 1 2
  • 42. 1 2 3 4 5
  • 43. Optimisations • Do everything in single step. – Inline repair • If an object is not contended, just sync it without a barrier phase. – Cached object history set. • Cache object history to reduce round trips. • Preferred Quorums – Always go to a quorum that already processed an object. • Improves cache locality. • Reduces serialization overhead.
  • 44. Optimizations contd.. • Reducing contention – Handling repeated requests. • Store answers, send same answers for same queries. – Retry and back-off policy. • Reduce contention on contended object. – Round-robin strategy for non-preferred quorum.
  • 45. Optimizations contd.. • Optimizing data structures. – Trade bandwidth for computation while syncing objects. – Use digital signatures instead of Hmac’s for authenticators. (large n) – Compact timestamps and replica histories.
  • 46. Multi-Object Updates • In this case, servers lock their local copies, if they approve the OHS, the update goes through • If not, a multi-object repair protocol goes through – In this case, repair depends on the ability to establish all objects in the set – Objects in the set are only repairable if all are repairable. If objects in the set that would be repairable are reclassified as incomplete.
  • 47. Fault Scalability
  • 48. More fault-scalability
  • 49. Isolated vs Contending
  • 50. NFSv3 metadata
  • 51. Thank you
  • 52. Two generals problem • A city is in a valley. • The generals who wish to conquer it are on 2 mountains, surrounding the city. • They have to co-ordinate on a time to attack, by sending messages across the valley. • Channel is unreliable, as messengers may be caught.
  • 53. Solution • Message acknowledgements do not work. • Replication is the only solution. – Send 100 messengers instead of one.
  • 54. Byzantine generals problem • More general than two generals problem. • What if messengers are generals are traitors? • They may manipulate messages. • All generals should reach consensus (majority agreement).
  • 55. Byzantine fault • A process behaves in an inconsistent manner.
  • 56. Properties of a solution • A solution has to guarantee that all correct processes eventually reach a decision regarding the value of the order they have been given. • All correct processes have to decide on the same value of the order they have been given. • If the source process is a correct process, all processes have to decide on the value that was originally given by the source process.
  • 57. Impossible in general case
  • 58. Simplified Lamport’s algo • The general sends his value to each lieutenant. • Each lieutenant sends the value he receives to the other lieutenants. (Stage 1) • The lieutenant waits till he receives the messages from all the lieutenant’s and acts on the majority agreement. (Stage 2)
  • 59. Properties of the Algo. • 3 * faulty processes < total processes.
  • 60. Overview • Queries are read only methods • Updates modify an object • Methods exported take arguments and return answers • Clients perform operations by issuing requests to a quorum • A server receives a request. If it accepts it it invokes a method • Each update creates a new object version
  • 61. Overview • The object version is kept with its logical timestamp in a version history called the replica history • Servers return replica histories in response to requests • Clients store replica histories in their object history set, an array of replicas indexed by server
  • 62. Overview • Timestamps in these histories are candidates for future operations • Candidates are classified in order to determine which object version a method should be executed upon
  • 63. Overview • In non-optimistic operation, a client may need to perform a repair – Addressed later • To perform an operation, a client first retrieves an object history set. The clients operation is conditioned on this set, which is transmitted with the operation.
  • 64. Overview • The client sends this operation to a quorum of servers. • To promote efficiency, the client sends the request to a preferred quorum – Addressed later • Single phase operation hinges on the availability of a preferred quorum, and on concurrency-free access.
  • 65. Overview • Before executing a request, servers first validate its integrity. • This is important, servers do not communicate object histories directly to each other, so the client’s data must be validated. • Servers use authenticators to do this, lists of HMACs that prevent malevolent nodes from fabricating replica histories. • Servers cull replica histories from the conditioned on OHS that they cannot validate
  • 66. Overview – the last bit • Servers validate that they do not have a higher timestamp in their local replica histories • Failing this, the client repairs • Passing this, the method is executed, and the new timestamp created – Timestamps are crafted such that they always increase in value
  • 67. Preferred Quorums • Traditional quorum systems use random quorums, but this means that servers frequently need to be synced – This is to distribute the load • Preferred quorums choose to access servers with the most up to date data, assuring that syncs happen less often
  • 68. Preferred Quorums • If a preferred quorum cannot be met, clients probe for additional servers to add to the quorum – Authenticators make it impossible to forge object histories for benign servers – The new host syncs with b+1 host servers, in order to validate that the data is correct • In the prototype, probing selects servers such that the load is distributed using a method parameterized on object ID and server ID
  • 69. Concurrency and Repair • Concurrent access to an object may fail • Two operations – Barrier • Barrier candidates have no data associated with them, and so are safe to select during periods of contention • Barrier advances the logical clock so as to prevent earlier timestamps from completing – Copy • Copies the latest object data past the barrier, so it can be acted upon
  • 70. Concurrency and Repair • Clients may repeatedly barrier each other, to combat this, an exponential backoff strategy is enforced