Failover and takeover contingency mechanisms for network partition and node failure
Upcoming SlideShare
Loading in...5
×
 

Failover and takeover contingency mechanisms for network partition and node failure

on

  • 1,632 views

Proper definition of suitable mechanisms to cope with network partition and to recover from node failure are among the most common problems when designing and implementing a fault-tolerant distributed ...

Proper definition of suitable mechanisms to cope with network partition and to recover from node failure are among the most common problems when designing and implementing a fault-tolerant distributed system. The concern is even more serious when the different scenarios could not be predicted beforehand and are detected once the system is at deployment stage.

There are a number of decisions that can be made when choosing the right contingency mechanisms to deal with these distribution-bounded problems. The factors that must be taken into account include not only the technology in use, the node layout, the message protocol and the properties of the messages to be exchanged, certain desired/demanded features such as latency, bandwidth,... but also the communications network reliability, and even the hardware where the system is running on.

In this paper we present ADVERTISE, a distributed system for advertisement transmission to on-customer-home set-top boxes (STBs) over a Digital TV network (iDTV) of a cable operator. We use this system as a case study to explain how we addressed the aforementioned problems, and present a set of good practices that can be extrapolated to comparable systems.

Statistics

Views

Total Views
1,632
Views on SlideShare
1,627
Embed Views
5

Actions

Likes
0
Downloads
6
Comments
0

2 Embeds 5

http://www.linkedin.com 4
https://twitter.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Failover and takeover contingency mechanisms for network partition and node failure Failover and takeover contingency mechanisms for network partition and node failure Presentation Transcript

  • Failover and Takeover Contingency Mechanisms for Network Partition and Node Failure Macías López, Laura M. Castro, David Cabrero MADS Research Group – Universidade da Coruña (Spain) Erlang Workshop Copenhaguen, 14th September 2012Erlang Workshop (2012) Fail/Takeover Mechanisms 1 / 25
  • Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 2 / 25
  • Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 3 / 25
  • Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 3 / 25
  • Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 3 / 25
  • Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 4 / 25
  • Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 4 / 25
  • Why are we (presenting this work) here? concurrency! high- availability! distribution! Erlang Workshop (2012) Fail/Takeover Mechanisms 5 / 25
  • Why are we (presenting this work) here?Unexpected problemsafter deployment! node failures! system failure! Erlang Workshop (2012) Fail/Takeover Mechanisms 6 / 25
  • Why are we (presenting this work) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 7 / 25
  • Why are we (presenting this work) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 7 / 25
  • Outline1 The system2 The problems at deployment3 The solution4 Final remarks Erlang Workshop (2012) Fail/Takeover Mechanisms 8 / 25
  • The systemADVERTISEDistributed system for advertisement transmission to on-customer-homeset-top boxes (STBs) over a Digital TV network (iDTV) of a cable operator Erlang Workshop (2012) Fail/Takeover Mechanisms 9 / 25
  • The systemADVERTISE’s requirements ensure the appropriate coordination of advertising mechanisms: compilation of events emission of advertising signals to STBs during a period of time recording hits (displays) of a specific piece of advertisementMajor challengeManagement of the size of the communications network: growing number of operator’s customers (∼ 100.000) Erlang Workshop (2012) Fail/Takeover Mechanisms 10 / 25
  • The systemADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • The systemADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • The systemADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • The systemADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • The systemADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • The systemADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • The systemADVERTISE’s structure Erlang Workshop (2012) Fail/Takeover Mechanisms 12 / 25
  • The systemADVERTISE as Erlang Distributed ApplicationTo meet its requirements, ADVERTISE was designedas a distributed application over several nodes Erlang Workshop (2012) Fail/Takeover Mechanisms 13 / 25
  • The systemADVERTISE as Erlang Distributed ApplicationTo meet its requirements, ADVERTISE was designedas a distributed application over several nodes Erlang Workshop (2012) Fail/Takeover Mechanisms 13 / 25
  • The problems at deploymentThe symptomsADVERTISE deployment environmentpresented some particularities that had not been foreseen: some nodes showed a tendency to fail more often than others network partition was common during some time periods (noon, night)In this situation. . . Fault tolerance requirements were not met! Erlang Workshop (2012) Fail/Takeover Mechanisms 14 / 25
  • The problems at deploymentThe diagnosisADVERTISE was developed and tested over several physical machines Erlang Workshop (2012) Fail/Takeover Mechanisms 15 / 25
  • The problems at deploymentThe diagnosisADVERTISE was deployed over several virtual machines Erlang Workshop (2012) Fail/Takeover Mechanisms 15 / 25
  • The problems at deploymentThe diagnosisADVERTISE was deployed over several virtual machines running on a single physical machine using a shared hard disk sharing the network link sharing with other apps/VMsFrequent saturation of shared resources was perceived by ADVERTISEnodes as short network partitions. Erlang Workshop (2012) Fail/Takeover Mechanisms 15 / 25
  • The problems at deploymentThe consequencesIf nodes lose connectivity, believe that all the others are down and assumesystem functions, there are likely to be inconsistencies when connectivityis restored (duplicated responsibilities, data inconsistencies).Perceived network partitions led to cascade failoversDuplicated registration of global names, random killing of conflictingprocesses, overflow and eventual stop of the supervision mechanisms. Erlang Workshop (2012) Fail/Takeover Mechanisms 16 / 25
  • The solutionFor ADVERTISE, data consistency was more importantthan availability: system could not afford that advertising campaigns, rules, or media were lost or became inconsistent instead, it was acceptable that no ads were sent to STBs (or that they were delayed)The solutionWe re-designed ADVERTISE to be deployed over a minimum of 3 nodes, and never on an isolated node Erlang Workshop (2012) Fail/Takeover Mechanisms 17 / 25
  • The solutionADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • The solutionADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • The solutionADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • The solutionADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • The solutionADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • The solutionADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • The solutionADVERTISE boot Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
  • The solutionADVERTISE boot Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
  • The solutionADVERTISE boot Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
  • The solutionADVERTISE boot Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
  • The solutionNode integrity check 1 Retrieve the last known population of active nodes Listactives 2 Retrieve the list of all ADVERTISE nodes from the configuration Listall 3 Filter Listall removing ping-unreachable nodes 4 If (filtered (Listall ) = Listactives ) ∧ (|Listactives | = 1) ADVERTISE is suspended immediately, and node is rebooted once connectivity is restored Erlang Workshop (2012) Fail/Takeover Mechanisms 20 / 25
  • The solutionDistributed AC check 1 DAC is queried on all nodes, to get PID of ADVERTISE local sup 2 If ∃n ∈ Listall for which ADVERTISE local sup PID could not be retrieved, node failure is assumed 1 If n ∈ Listactives it means it replies to ping from the global supervisor but cannot reach others; after a timeout 1 If n ∈ Listactives node failure is confirmed / 2 If n ∈ Listactives node is up and we reboot it Erlang Workshop (2012) Fail/Takeover Mechanisms 21 / 25
  • The solutionCurrent ADVERTISE deployment Cluster of 3 virtual nodes, handles an average of 18K STBs per node with peaks of 23K STBs during prime time Our tests reached a maximum of 45K STBs per node System running with no incidents reported in the last 4 months Most intensive advertising campaign was a 2-month promotion: displayed over 66 million times, with a peak of 140K times in 1 hour Average campaign can be displayed a total of 500K, with peaks of up to 30K in 1 hour during prime time Saturday night Erlang Workshop (2012) Fail/Takeover Mechanisms 22 / 25
  • Final remarksLessons learnedWhen designing a distributed Erlang app, one must take into account: Network security Network reliability Network topology Latency of requests Heterogeneity of components Bandwidth Scalability Erlang Workshop (2012) Fail/Takeover Mechanisms 23 / 25
  • Final remarksLessons learnedWhen designing a distributed Erlang app, one must take into account: Network security Network reliability Network topology Latency of requests Heterogeneity of components Bandwidth Scalability Erlang Workshop (2012) Fail/Takeover Mechanisms 23 / 25
  • Final remarksYour mileage may vary!Had ADVERTISE requirements been substantially different we would probably have favoured availability over consistency, for instance And that would be a different story. . . Erlang Workshop (2012) Fail/Takeover Mechanisms 24 / 25
  • Questions? Audience ! thanks Some images and icons were downloaded from: openclipart.org Erlang Workshop (2012) Fail/Takeover Mechanisms 25 / 25