Classical Distributed Algorithms with DDS

3,283 views

Published on

The OMG DDS standard has been witnessing a very strong adoption as the distribution middleware of choice for a large class of mission and business critical systems, such as Air Traffic Control, Automated Trading, SCADA, Smart Energy, etc.

The main reason for choosing DDS lies in its efficiency, scalability, high-availability and configurability -- through the 20+ QoS policy. Yet, all of these nice properties come at the cost of a relaxed consistency model no strong guarantees over global invariants.

As a result, many architects have to devise, by themselves – assuming the DDS primitives as a foundation – the correct algorithms for classical problems such as fault-detection, leader election, consensus, distributed mutual exclusion, atomic multicast, distributed queues, etc.

In this presentation we will explore DDS-based distributed algorithms for many classical, yet fundamental, problems in distributed systems. For simplicity, we'll start with algorithms that ignore the presence of failures. Then we will (1) demonstrate how these algorithms can be extended to deal with failures, and (2) introduce Paxos as one of the fundamental algorithm for consensus and atomic broadcast.

Finally, we'll show how these classical algorithms can be used to implement useful extensions of the DDS semantics, such as multi-writer / multi-reader distributed queues. 

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,283
On SlideShare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
128
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Classical Distributed Algorithms with DDS

  1. 1. Classical Distributed Algorithms with DDS [Developing Higher Level Abstractions on DDS]OpenSplice DDS Angelo CORSARO, Ph.D. Chief Technology Officer OMG DDS Sig Co-Chair PrismTech angelo.corsaro@prismtech.com
  2. 2. Context ☐ The Data Distribution Service (DDS) provides a very useful foundation Copyright  2011,  PrismTech  –    All  Rights  Reserved. for building highly dynamic, reconfigurable, dependable and high performance systems ☐ However, in building distributed systems with DDS one is often facedOpenSplice DDS with two kind of problems: ☐ How can distributed coordination problems be solved with DDS? e.g. distributed mutual exclusion, consensus, etc ☐ How can higher order primitives and abstractions be supported over DDS? e.g. fault-tolerant distributed queues, total-order multicast, etc. ☐ In this presentation we will look at how DDS can be used to implement some of the classical Distributed Algorithm that solve these problems
  3. 3. DDS Abstractions and PropertiesOpenSplice DDS
  4. 4. Data Distribution Service For Real-Time Systems DDS provides a Topic-Based Publish/ Copyright  2011,  PrismTech  –    All  Rights  Reserved. Subscribe abstraction based on: Data Reader Data Writer ☐ Topics: data distribution subject’sOpenSplice DDS Data Reader Data TopicD Writer DataWriters: data producers TopicA ☐ Data TopicB Reader Data Writer ☐ DataReaders: data consumers TopicC ... Data Data Writer Reader DDS Global Data Space
  5. 5. Data Distribution Service For Real-Time Systems Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ DataWriters and DataReaders are automatically and Data Reader dynamically matched by the Data Writer DDS Dynamic DiscoveryOpenSplice DDS Data Reader Data TopicD Writer TopicA ☐ A rich set of QoS allows to Data Reader TopicB control existential, temporal, Data Writer TopicC ... and spatial properties of data Data Data Writer Reader DDS Global Data Space
  6. 6. DDS Topics “Circle”, “Square”, “Triangle”, ... ☐ A Topic defines a class of streams Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ A Topic has associated a unique Name name, a user defined extensible type and a set of QoS policies Topic Typ SOpenSplice DDS DURABILITY, Qo ☐ QoS Policies capture the Topic e DEADLINE, ShapeType non-functional invariants PRIORITY, … ☐ Topics can be discovered or locally defined struct ShapeType { @Key string color; long x; long y; long shapesize; };
  7. 7. Topic Instances ☐ Each unique key value Copyright  2011,  PrismTech  –    All  Rights  Reserved. Instances Instances identifies a unique stream of data color =”Green” DDS not only struct ShapeType { ☐ color =”red” @Key string color; TopicOpenSplice DDS long x; long y; demultiplexes “streams” color = “Blue” long shapesize;}; but provides also lifecycle information ☐ A DDS DataWriter can write multiple instances
  8. 8. Anatomy of a DDS Application Domain (e.g. Domain 123) Domain Copyright  2011,  PrismTech  –    All  Rights  Reserved. Participant Topic Partition (e.g. “Telemetry”, “Shapes”, )OpenSplice DDS Publisher Subscriber Topic Instances/Samples DataWrter DataReader
  9. 9. Channel Properties Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ We can think of a DataWriter and its matching DataReaders as connected by DR a logical typed communication channel DW Topic DROpenSplice DDS ☐ The properties of this channel are controlled by means of QoS Policies DR ☐ At the two extreme this logical communication channel can be: ☐ Best-Effort/Reliable Last n-values Channel ☐ Best-Effort/Reliable FIFO Channel
  10. 10. Last n-values Channel ☐ The last n-values channel is useful when Copyright  2011,  PrismTech  –    All  Rights  Reserved. modeling distributed state ☐ When n=1 then the last value channel provides DR a way of modeling an eventually consistent distributed state DW Topic DROpenSplice DDS ☐ This abstraction is very useful if what matters is DR the current value of a given topic instance ☐ The Qos Policies that give a Last n-value Channel are: ☐ RELIABILITY = BEST_EFFORT | RELIABLE ☐ HISTORY = KEEP_LAST(n) ☐ DURABILITY = TRANSIENT | PERSISTENT [in most cases]
  11. 11. FIFO Channel ☐ The FIFO Channel is useful when we care about Copyright  2011,  PrismTech  –    All  Rights  Reserved. every single sample that was produced for a given topic -- as opposed to the “last value” DR ☐ This abstraction is very useful when writing distributing algorithm over DDS DW Topic DROpenSplice DDS ☐ Depending on Qos Policies, DDS provides: DR ☐ Best-Effort/Reliable FIFO Channel ☐ FT-Reliable FIFO Channel (using an OpenSplice- specific extension) ☐ The Qos Policies that give a FIFO Channel are: ☐ RELIABILITY = BEST_EFFORT | RELIABLE ☐ HISTORY = KEEP_ALL
  12. 12. Membership DR ☐ We can think of a DDS Topic as defining a group Copyright  2011,  PrismTech  –    All  Rights  Reserved. DW Topic DR ☐ The members of this group are matching DataReaders and DataWriters DR ☐ DDS’ dynamic discovery manages this group DataWriter Group ViewOpenSplice DDS membership, however it provides a low level interface to group management and eventual consistency of views DW ☐ In addition, the group view provided by DDS makes available matched readers on the writer-side and matched-writers on the reader- DW Topic DR side ☐ This is not sufficient for certain distributed DW algorithms. DataReader Group View
  13. 13. Fault-Detection Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ DDS provides built-in mechanism DW for detection of DataWriter faults through theOpenSplice DDS LivelinessChangedStatus DW Topic DR ☐ A writer is considered as having DW lost its liveliness if it has failed to DataReader Group View assert it within its lease period
  14. 14. OpenSplice DDS System Model
  15. 15. System Model Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ Partially Synchronous ☐ After a Global Stabilization Time (GST) communication latencies areOpenSplice DDS bounded, yet the bound is unknown ☐ Non-Byzantine Fail/Recovery ☐ Process can fail and restart but don’t perform malicious actions
  16. 16. Programming Environment Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ The algorithms that will be showed next are implemented on OpenSplice using the Escalier Scala APIOpenSplice DDS ☐ All algorithms are available as part of the ¥ DDS-based Advanced Distributed Open Source project dada Algorithms Toolkit ¥ Open Source ¥ github.com/kydos/dada OpenSplice | DDS Escalier ¥ #1 OMG DDS Implementation ¥ Fastest growing JVM Language ¥ Scala API for OpenSplice DDS ¥ Open Source ¥ Open Source ¥ Open Source ¥ www.opensplice.org ¥ www.scala-lang.org ¥ github.com/kydos/escalier
  17. 17. Higher Level AbstractionsOpenSplice DDS
  18. 18. Group Management abstract class Group { // Join/Leave API def join(mid: Int) def leave(mid: Int) ☐ A Group Management Copyright  2011,  PrismTech  –    All  Rights  Reserved. // Group View API abstraction should provide the def size: Int def view: List[Int] ability to join/leave a group, def waitForViewSize(n: Int) def waitForViewSize(n: Int, provide the current view and timeout: Int) detect failures of group membersOpenSplice DDS // Leader Election API def leader: Option[Int] def proposeLeader(mid: Int, lid: Int) ☐ Ideally group management should also provide the ability to // Reactions handling Group Events val reactions: Reactions elect leaders } ☐ A Group Member should case case class class MemberJoin(val mid: Int) MemberLeave(val mid: Int) represent a process case case class class MemberFailure(mid:Int) EpochChange(epoch: Long) case class NewLeader(mid: Option[Int])
  19. 19. Topic Types ☐ To implement the Group abstraction with support for Leader Copyright  2011,  PrismTech  –    All  Rights  Reserved. Election it is sufficient to rely on the following topic types:OpenSplice DDS enum TMemberStatus { JOINED, LEFT, struct TEventualLeaderVote { FAILED, long long epoch; SUSPECTED long mid; }; long lid; // voted leader-id }; struct TMemberInfo { #pragma keylist TEventualLeaderVote mid long mid; // member-id TMemberStatus status; }; #pragma keylist TMemberInfo mid
  20. 20. Topics Group Management ☐ The TMemberInfo topic is used to advertise presence and manage the Copyright  2011,  PrismTech  –    All  Rights  Reserved. members state transitions Leader Election ☐ The TEventualLeaderVote topic is used to cast votes for leader electionOpenSplice DDS This leads us to: ☐ Topic(name = MemberInfo, type = TMemberInfo, QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal}) ☐ Topic(name = EventualLeaderVote, type = TEventualLeaderVote, QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal}
  21. 21. Observation Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ Notice that we are using two Last-Value Channels for implementing both the (eventual) group management and the (eventual) leader electionOpenSplice DDS ☐ This makes it possible to: ☐ Let DDS provide our latest known state automatically thanks to the TransientLocal Durability ☐ No need for periodically asserting our liveliness as DDS will do that our DataWriter
  22. 22. Leader Election join crash M1 Copyright  2011,  PrismTech  –    All  Rights  Reserved. Leader: None => M1 Leader: None => M1 Leader: None => M0 Leader: None => M0 join M2OpenSplice DDS join M0 epoch = 0 epoch = 1 epoch = 2 epoch = 3 ☐ At the beginning of each epoch the leader is None ☐ Each new epoch a leader election algorithm is run
  23. 23. Distinguishing Groups Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ To isolate the traffic generated by different groups, we use the group-id gid to name the partition in which all the group related traffic will take placeOpenSplice DDS Partition associated to the group with gid=2 “2” “1” “3” DDS Domain
  24. 24. Example object GroupMember { def main(args: Array[String]) { if (args.length < 2) { [1/2] println("USAGE: GroupMember <gid> <mid>") sys.exit(1) } val gid = args(0).toInt val mid = args(1).toInt val group = Group(gid) Copyright  2011,  PrismTech  –    All  Rights  Reserved. group.join(mid) val printGroupView = () => { print("Group["+ gid +"] = { ") ☐ Events provide notification of group.view foreach(m => print(m + " ")) println("}")} group membership changesOpenSplice DDS group.reactions += { case MemberFailure(mid) => { ☐ These events are handled by println("Member "+ mid + " Failed.") printGroupView() registering partial functions } case MemberJoin(mid) => { with the Group reactions println("Member "+ mid + " Joined") printGroupView() } case MemberLeave(mid) => { println("Member "+ mid +" Left") printGroupView() } } } }
  25. 25. Example [1/2] object EventualLeaderElection { ☐ An eventual leader election algorithm def main(args: Array[String]) { Copyright  2011,  PrismTech  –    All  Rights  Reserved. if (args.length < 2) { can be implemented by simply println("USAGE: GroupMember <gid> <mid>") casting a vote each time there is an } sys.exit(1) group epoch change val gid = args(0).toInt val mid = args(1).toInt ☐ A Group Epoch change takes place val group = Group(gid)OpenSplice DDS each time there is a change on the group.join(mid) group view group.reactions += { case EpochChange(e) => { ☐ The leader is eventually elected only if val lid = group.view.min a majority of the process currently on } group.proposeLeader(mid, lid) the view agree case NewLeader(l) => println(">> NewLeader = "+ l) } ☐ Otherwise the group leader is set to } “None” }
  26. 26. Distributed MutexOpenSplice DDS
  27. 27. Lamport’s Distributed Mutex ☐ A relatively simple Distributed Mutex Algorithm was proposed by Leslie Copyright  2011,  PrismTech  –    All  Rights  Reserved. Lamport as an example application of Lamport’s Logical Clocks ☐ The basic protocol (with Agrawala optimization) works as follows (sketched):OpenSplice DDS ☐ When a process needs to enter a critical section sends a MUTEX request by tagging it with its current logical clock ☐ The process obtains the Mutex only when he has received ACKs from all the other process in the group ☐ When process receives a Mutex requests he sends an ACK only if he has not an outstanding Mutex request timestamped with a smaller logical clock
  28. 28. Mutex Abstraction ☐ A base class defines the Copyright  2011,  PrismTech  –    All  Rights  Reserved. Mutex Protocol abstract class Mutex  {   def acquire() ☐ The Mutex companion def release() uses dependency injection }OpenSplice DDS to decide which concrete mutex implementation to use
  29. 29. Foundation Abstractions Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ The mutual exclusion algorithm requires essentially: ☐ FIFO communication channels between group members ☐ Logical ClocksOpenSplice DDS ☐ MutexRequest and MutexAck Messages These needs, have now to be translated in terms of topic types, topics, readers/writers and QoS Settings
  30. 30. Topic Types Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ For implementing the Mutual Exclusion Algorithm it is sufficient to define the following topic types:OpenSplice DDS struct TLogicalClock { long ts; long mid; }; #pragma keylist LogicalClock mid struct TAck { long amid; // acknowledged member-id LogicalClock ts; }; #pragma keylist TAck ts.mid
  31. 31. Topics We need essentially two topics: Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ One topic for representing the Mutex Requests, and ☐ Another topic for representing AcksOpenSplice DDS This leads us to: ☐ Topic(name = MutexRequest, type = TLogicalClock, QoS = {Reliability.Reliable, History.KeepAll}) ☐ Topic(name = MutexAck, type = TAck, QoS = {Reliability.Reliable, History.KeepAll})
  32. 32. Show me the Code! Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ All the algorithms presented were implemented using DDS and ScalaOpenSplice DDS ☐ Specifically we’ve used the OpenSplice Escalier language mapping for Scala ☐ The resulting library has been baptized “dada” (DDS Advanced Distributed Algorithms) and is available under LGPL-v3
  33. 33. LCMutex ☐ The LCMutex is one of the possible Mutex protocol, implementing the Agrawala variation of the classical Lamport’s Algorithm Copyright  2011,  PrismTech  –    All  Rights  Reserved. class LCMutex(val mid: Int, val gid: Int, val n: Int)(implicit val logger: Logger) extends Mutex { private var group = Group(gid) private var ts = LogicalClock(0, mid)OpenSplice DDS private var receivedAcks = new AtomicLong(0) private var pendingRequests = new SynchronizedPriorityQueue[LogicalClock]() private var myRequest = LogicalClock.Infinite private val reqDW = DataWriter[TLogicalClock](LCMutex.groupPublisher(gid), LCMutex.mutexRequestTopic, LCMutex.dwQos) private val reqDR = DataReader[TLogicalClock](LCMutex.groupSubscriber(gid), LCMutex.mutexRequestTopic, LCMutex.drQos) private val ackDW = DataWriter[TAck](LCMutex.groupPublisher(gid), LCMutex.mutexAckTopic, LCMutex.dwQos) private val ackDR = DataReader[TAck](LCMutex.groupSubscriber(gid), LCMutex.mutexAckTopic, LCMutex.drQos) private val ackSemaphore = new Semaphore(0)
  34. 34. LCMutex.acquire Copyright  2011,  PrismTech  –    All  Rights  Reserved. def acquire() { ts = ts.inc() myRequest = ts Notice that as the LCMutex reqDW ! myRequest is single-threaded we can’t ackSemaphore.acquire() issue concurrent acquire.OpenSplice DDS }
  35. 35. LCMutex.release def release() { Copyright  2011,  PrismTech  –    All  Rights  Reserved. myRequest = LogicalClock.Infinite (pendingRequests dequeueAll) foreach { req => Notice that as the LCMutex ts = ts inc() is single-threaded we can’t ackDW ! new TAck(req.id, ts) issue a new request before } } we release.OpenSplice DDS
  36. 36. LCMutex.onACK ackDR.reactions += { case DataAvailable(dr) => { // Count only the ACK for us Copyright  2011,  PrismTech  –    All  Rights  Reserved. val acks = ((ackDR take) filter (_.amid == mid)) val k = acks.length if (k > 0) { // Set the local clock to the max (tsi, tsj) + 1 synchronized {OpenSplice DDS val maxTs = math.max(ts.ts, (acks map (_.ts.ts)).max) + 1 ts = LogicalClock(maxTs, ts.id) } val ra = receivedAcks.addAndGet(k) val groupSize = group.size // If received sufficient many ACKs we can enter our Mutex! if (ra == groupSize - 1) { receivedAcks.set(0) ackSemaphore.release() } } } }
  37. 37. LCMutex.onReq reqDR.reactions += { case DataAvailable(dr) => { val requests = (reqDR take) filterNot (_.mid == mid) Copyright  2011,  PrismTech  –    All  Rights  Reserved. if (requests.isEmpty == false ) { synchronized { val maxTs = math.max((requests map (_.ts)).max, ts.ts) + 1 ts = LogicalClock(maxTs, ts.id) } requests foreach (r => { if (r < myRequest) {OpenSplice DDS ts = ts inc() val ack = new TAck(r.mid, ts) ackDW ! ack None } else { (pendingRequests find (_ == r)).getOrElse({ pendingRequests.enqueue(r) r}) } }) } } }
  38. 38. Distributed QueueOpenSplice DDS
  39. 39. Distributed Queue Abstraction A distributed queue is conceptually provide with the ability of Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ enqueueing and dequeueing elements ☐ Depending on the invariants that are guaranteed the distributedOpenSplice DDS queue implementation can be more or less efficient ☐ In what follows we’ll focus on a relaxed form of distributed queue, called Eventual Queue, which while providing a relaxed yet very useful semantics is amenable to high performance implementations
  40. 40. Eventual Queue Specification ☐ Invariants Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ All enqueued elements will be eventually dequeued ☐ Each element is dequeued once ☐ If the queue is empty a dequeue returns nothing If the queue is non-empty a dequeue might return somethingOpenSplice DDS ☐ ☐ Elements might be dequeued in a different order than they are enqueued DR DW DR DW DR DW Distributed Eventual Queue DR
  41. 41. Eventual Queue Specification ☐ Invariants Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ All enqueued elements will be eventually dequeued ☐ Each element is dequeued once ☐ If the queue is empty a dequeue returns nothingOpenSplice DDS ☐ If the queue is non-empty a dequeue might return something ☐ Elements might be dequeued in a different order than they are enqueued DR DW DR DW DR DW Distributed Eventual Queue DR
  42. 42. Eventual Queue Specification ☐ Invariants Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ All enqueued elements will be eventually dequeued ☐ Each element is dequeued once ☐ If the queue is empty a dequeue returns nothingOpenSplice DDS ☐ If the queue is non-empty a dequeue might return something ☐ Elements might be dequeued in a different order than they are enqueued DR DW DR DW DR DW Distributed Eventual Queue DR
  43. 43. Eventual Queue Specification ☐ Invariants Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ All enqueued elements will be eventually dequeued ☐ Each element is dequeued once ☐ If the queue is empty a dequeue returns nothingOpenSplice DDS ☐ If the queue is non-empty a dequeue might return something ☐ Elements might be dequeued in a different order than they are enqueued DR DW DR DW DR DW Distributed Eventual Queue DR
  44. 44. Eventual Queue Specification ☐ Invariants Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ All enqueued elements will be eventually dequeued ☐ Each element is dequeued once ☐ If the queue is empty a dequeue returns nothingOpenSplice DDS ☐ If the queue is non-empty a dequeue might return something ☐ Elements might be dequeued in a different order than they are enqueued DR DW DR DW DR DW Distributed Eventual Queue DR
  45. 45. Eventual Queue Abstraction Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ A Queue can be seen as the trait Enqueue[T] { def enqueue(t: T) composition of two simpler data } structure, a Dequeue and an trait Dequeue[T] { Enqueue def dequeue(): Option[T]OpenSplice DDS def sdequeue(): Option[T] def length: Int ☐ The Enqueue simply allows to add def isEmpty: Boolean = length == 0 elements } trait Queue[T] The Enqueue simply allows to get ☐ extends Enqueue[T] with Dequeue[T] elements
  46. 46. Eventual Queue on DDS ☐ One approach to implement the eventual queue on DDS is to Copyright  2011,  PrismTech  –    All  Rights  Reserved. keep a local queue on each of the consumer and to run a coordination algorithm to enforce the Eventual Queue Invariants ☐ The advantage of this approach is that the latency of theOpenSplice DDS dequeue is minimized and the throughput of enqueues is maximized (we’ll see this latter is really a property of the eventual queue) ☐ The disadvantage, for some use cases, is that the consumer need to store the whole queue locally thus, this solution is applicable for either symmetric environments running on LANs
  47. 47. Eventual Queue Invariants & DDS Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ All enqueued elements will be eventually dequeued ☐ Each element is dequeued once ☐ If the queue is empty a dequeue returns nothingOpenSplice DDS ☐ If the queue is non-empty a dequeue might return something ☐ These invariants require that we implement a distributed protocol for ensuring that values are eventual picked up and picked up only once! ☐ Elements might be dequeued in a different order than they are enqueued
  48. 48. Eventual Queue Invariants & DDS ☐ All enqueued elements will be eventually dequeued Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ If the queue is empty a dequeue returns nothing ☐ If the queue is non-empty a dequeue might return something ☐ Elements might be dequeued in a different order than they areOpenSplice DDS enqueued ☐ This essentially means that we can have different local order for the queue elements on each consumer. Which in turns means that we can distribute enqueued elements by simple DDS writes! ☐ The implication of this is that the enqueue operation is going to be as efficient as a DDS write ☐ Finally, to ensure eventual consistency in presence of writer faults we’ll take advantage of OpenSplice FT-Reliability!
  49. 49. Dequeue Protocol: General Idea ☐ A possible Dequeue protocol can be derived by the Lamport/Agrawala Copyright  2011,  PrismTech  –    All  Rights  Reserved. Distributed Mutual Exclusion Algorithm ☐ The general idea is similar as we want to order dequeues as opposed to access to some critical section, however there are some important details toOpenSplice DDS be sorted out to ensure that we really maintain the eventual queue invariants ☐ Key Issues to be dealt ☐ DDS provides eventual consistency thus we might have wildly different local view of the content of the queue (not just its order but the actual elements) ☐ Once a process has gained the right to dequeue it has to be sure that it can pick an element that nobody else has picked just before. Then he has to ensure that before he allows anybody else to pick a value his choice has to be popped by all other local queues
  50. 50. Topic Types struct TLogicalClock { long long ts; long mid; To implement the Eventual Queue }; ☐ Copyright  2011,  PrismTech  –    All  Rights  Reserved. over DDS we use three different enum TCommandKind { DEQUEUE, Topic Types ACK, POP }; ☐ The TQueueCommand represents allOpenSplice DDS struct TQueueCommand { the commands used by the TCommandKind kind; long mid; protocol (more later on this) }; TLogicalClock ts; #pragma keylist TQueueCommand ☐ TQueueElement represents a writer time-stamped queue typedef sequence<octet> TData; struct TQueueElement { element TLogicalClock ts; TData data; }; #pragma keylist TQueueElement
  51. 51. Topics Copyright  2011,  PrismTech  –    All  Rights  Reserved. To implement the Eventual Queue we need only two topics: ☐ One topic for representing the queue elementsOpenSplice DDS ☐ Another topic for representing all the protocol messages. Notice that the choice of using a single topic for all the protocol messages was carefully made to be able to ensure FIFO ordering between protocol messages
  52. 52. Topics Copyright  2011,  PrismTech  –    All  Rights  Reserved. This leads us to: ☐ Topic(name = QueueElement, type = TQueueElement,OpenSplice DDS QoS = {Reliability.Reliable, History.KeepAll}) ☐ Topic(name = QueueCommand, type = TQueueCommand, QoS = {Reliability.Reliable, History.KeepAll})
  53. 53. Dequeue Protocol: A Sample Run deq():a (1,1) (1,1) (1,2) (1,2) Copyright  2011,  PrismTech  –    All  Rights  Reserved. app 1 (1,2) 1 1 2 3 4 a, ts ack {(4,1)} b, ts’ req {(1,1)} b, ts’ pop{ts, (3,1)}OpenSplice DDS deq():b req {(1,2)} ack {(2,2)} ’ pop{ts, (5,2)} app 2 (1,2) (1,1) 1 (1,2) 2 3 1 b, ts’ b, ts’ a, ts
  54. 54. Example: Producer object MessageProducer { def main(args: Array[String]) { if (args.length < 4) { println("USAGE:nt MessageProducer <mid> <gid> <n> <samples>") Copyright  2011,  PrismTech  –    All  Rights  Reserved. sys.exit(1) } val mid = args(0).toInt val gid = args(1).toInt val n = args(2).toInt val samples = args(3).toInt val group = Group(gid) group.reactions += {OpenSplice DDS case MemberJoin(mid) => println("Joined M["+ mid +"]") } group.join(mid) group.waitForViewSize(n) val queue = Enqueue[String]("CounterQueue", mid, gid) for (i <- 1 to samples) { val msg = "MSG["+ mid +", "+ i +"]" println(msg) queue.enqueue(msg) // Pace the write so that you can see whats going on Thread.sleep(300) } } }
  55. 55. Example: Consumer object MessageConsumer { def main(args: Array[String]) { if (args.length < 4) { println("USAGE:nt MessageProducer <mid> <gid> <readers-num> <n>") sys.exit(1) } Copyright  2011,  PrismTech  –    All  Rights  Reserved. val mid = args(0).toInt val gid = args(1).toInt val rn = args(2).toInt val n = args(3).toInt val group = Group(gid) group.reactions += { case MemberJoin(mid) => println("Joined M["+ mid +"]")OpenSplice DDS } group.join(mid) group.waitForViewSize(n) val queue = Queue[String]("CounterQueue", mid, gid, rn) val baseSleep = 1000 while (true) { queue.sdequeue() match { case Some(s) => println(Console.MAGENTA_B + s + Console.RESET) case _ => println(Console.MAGENTA_B + "None" + Console.RESET) } val sleepTime = baseSleep + (math.random * baseSleep).toInt Thread.sleep(sleepTime) } } }
  56. 56. Dealing with FaultsOpenSplice DDS
  57. 57. Fault-Detectors Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ The algorithms presented so far can be easily extended to deal with failures by taking advantage of group abstraction presented earlierOpenSplice DDS ☐ The main issue to carefully consider is that if a timing assumption is violated thus leading to falsely suspecting the crash of a process safety of some of those algorithms might be violated!
  58. 58. OpenSplice DDS Paxos
  59. 59. Paxos in Brief ☐ Paxos is a protocol for state-machine replication proposed by Leslie Copyright  2011,  PrismTech  –    All  Rights  Reserved. Lamport in his “The Part-time Parliament” ☐ The Paxos protocol works in under asynchrony -- to be precise, it is safe under asynchrony and has progress under partial synchrony (both are not possible under asynchrony due to FLP) -- and admits a crash/recoveryOpenSplice DDS failure mode ☐ Paxos requires some form of stable storage ☐ The theoretical specification of the protocol is very simple and elegant ☐ The practical implementations of the protocol have to fill in many hairy details...
  60. 60. Paxos in Brief ☐ The Paxos protocol considers three different kinds of agents (the Copyright  2011,  PrismTech  –    All  Rights  Reserved. same process can play multiple roles): ☐ Proposers ☐ Acceptors LearnersOpenSplice DDS ☐ ☐ To make progress the protocol requires that a proposer acts as the leader in issuing proposals to acceptors on behalf of clients ☐ The protocol is safe even if there are multiple leaders, in that case progress might be scarified ☐ This implies that Paxos can use an eventual leader election algorithm to decide the distinguished proposer
  61. 61. Paxos Synod Protocol Copyright  2011,  PrismTech  –    All  Rights  Reserved.OpenSplice DDS [Pseudocode from “Ring Paxos: A High-Throughput Atomic Broadcast Protocol, DSN 2010”. Notice that the pseudo code is not correct as it suffers from progress in several cases, however it illustrates the key idea of the Paxos Synod protocol]
  62. 62. OpenSplice DDS C2 C1 Cn P2 P1 Pk [Leader] Paxos in Action A2 A1 Am L2 L1 Lh Copyright  2011,  PrismTech  –    All  Rights  Reserved.
  63. 63. OpenSplice DDS C2 C1 Cn P2 P1 Pk [Leader] phase1A(c-rnd) A2 A1 Am L2 L1 Lh Paxos in Action -- Phase 1A Copyright  2011,  PrismTech  –    All  Rights  Reserved.
  64. 64. Paxos in Action -- Phase 1B [Leader] Copyright  2011,  PrismTech  –    All  Rights  Reserved. C1 P1 A1 L1 C2 P2 A2 L2OpenSplice DDS Cn Pk Am Lh phase1B(rnd, v-rnd, v-val)
  65. 65. OpenSplice DDS C2 C1 Cn P2 P1 Pk [Leader] phase2A(c-rnd, c-val) A2 A1 Am L2 L1 Lh Paxos in Action -- Phase 2A Copyright  2011,  PrismTech  –    All  Rights  Reserved.
  66. 66. OpenSplice DDS C2 C1 Cn P2 P1 Pk [Leader] phase2B(v-rnd, v-val) A2 A1 Am L2 L1 Lh Paxos in Action -- Phase 2B Copyright  2011,  PrismTech  –    All  Rights  Reserved.
  67. 67. OpenSplice DDS C2 C1 Cn P2 P1 Pk [Leader] Decision(v-val) A2 A1 Am L2 L1 Lh Paxos in Action -- Phase 2B Copyright  2011,  PrismTech  –    All  Rights  Reserved.
  68. 68. Eventual Queue with Paxos ☐ The Eventual queue we specified on the previous section can be implemented using an adaptation of the Paxos protocol Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ In this case, consumers don’t cache locally the queue but leverage a mid-tier running the Paxos protocol to serve dequeuesOpenSplice DDS C1 P1 Pi [Proposers] C2 P2 Ai [Acceptors] L1 [Learners] [Learners] Cn [Eventual Queue] Pm
  69. 69. OpenSplice DDS Summing Up
  70. 70. Concluding Remarks Copyright  2011,  PrismTech  –    All  Rights  Reserved. ☐ OpenSplice DDS provides a good foundation to effectively and efficiently express some of the most important distributed algorithms e.g. DataWriter fault-detection and OpenSplice FT-Reliable MulticastOpenSplice DDS ☐ ☐ dada provides access to reference implementations of many of the most important distributed algorithms ☐ It is implemented in Scala, but that means you can also use these libraries from Java too!
  71. 71. References Copyright  2011,  PrismTech  –    All  Rights  Reserved. OpenSplice | DDS Escalier ¥ #1 OMG DDS Implementation ¥ Fastest growing JVM Language ¥ Scala API for OpenSplice DDS ¥ Open Source ¥ Open Source ¥ Open Source ¥ www.opensplice.org ¥ www.scala-lang.org ¥ github.com/kydos/escalierOpenSplice DDS ¥ Simple C++ API for DDS ¥ DDS-PSM-Java for OpenSplice DDS ¥ DDS-based Advanced Distributed ¥ Open Source ¥ Open Source Algorithms Toolkit ¥ github.com/kydos/simd-cxx ¥ github.com/kydos/simd-java ¥ Open Source ¥ github.com/kydos/dada
  72. 72. :: Connect with Us :: ¥opensplice.com ¥forums.opensplice.org ¥@acorsaro ¥opensplice.org ¥opensplicedds@prismtech.com ¥@prismtechOpenSplice DDS ¥ crc@prismtech.com ¥sales@prismtech.com ¥youtube.com/opensplicetube ¥slideshare.net/angelo.corsaro
  73. 73. OpenSplice DDS

×