Improving Robustness In Distributed Systems


Published on

Published in: Technology, News & Politics
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Improving Robustness In Distributed Systems

  1. 1. Improving Robustness in Distributed Systems Per Bergqvist [email_address] Erlang User Conference 2001 (courtesy CellPoint Systems AB)
  2. 2. Design base <ul><li>Cluster of cooperating hosts </li></ul><ul><li>Erlang and C </li></ul><ul><li>COTS hardware based </li></ul><ul><li>Unix based (i.e. Solaris or Linux) </li></ul><ul><li>10/100/1000 base-T back plane (”system area network”) </li></ul>
  3. 3. Cluster <ul><li>Shared, distributed, system configuration </li></ul><ul><li>Each host have ONE cluster controller </li></ul><ul><li>Dispatch and supervise worker tasks </li></ul><ul><li>Master cluster controller: holds configuration database (persistent replica) </li></ul><ul><li>Slave cluster controller: gets configuration from master cluster controllers </li></ul><ul><li>Cluster is DOWN when all master cluster controllers are inaccessible </li></ul>
  4. 4. Typical system Firewall Switch Traffic Control
  5. 5. Cluster Key Benefits <ul><li>Single system view </li></ul><ul><li>Enforces decoupling of parts of O&M from actual traffic processing </li></ul>
  6. 6. Implementing a cluster <ul><li>Cluster->Host->Node->NodeData </li></ul><ul><li>Cluster global parameters </li></ul><ul><li>Subscription mechanisms for conf. changes </li></ul><ul><li>Mnesia as configuration database on master cluster controllers </li></ul><ul><li>Homebrewn configuration distribution to slave controllers (NOT using mnesia) </li></ul><ul><li>(Worker) node supervision </li></ul>
  7. 7. Mnesia gotchas <ul><li>First distributed node startup </li></ul><ul><ul><li>Disallow writes when all replicas not accessible </li></ul></ul><ul><ul><li>Use timeout on table load and force load </li></ul></ul>
  8. 8. ... BUT ... <ul><li>TCP based distribution </li></ul><ul><li>Network partitioning </li></ul>
  9. 9. Network parameters <ul><li>Align TCP retransmission intervals w/ Erlang heartbeats </li></ul><ul><li>Align TCP and IP rerouting parameters </li></ul>
  10. 10. Typical system II: Dual back plane Firewall Switch Traffic Control
  11. 11. Erlang multi-homing problem Host A Host B Host C
  12. 12. Multi-home Erlang w/ TCP <ul><li>Add an alias interface to loop back i/f </li></ul><ul><li>Patch tcp distribution to bind to alias </li></ul><ul><li>Publish alias interface on (all wanted) via real hw i/f’s </li></ul><ul><ul><li>Method 1: Static routes and gratuitous/proxy arp </li></ul></ul><ul><ul><li>Method 2: Use new (routing) protocol </li></ul></ul>
  13. 13. ARP method <ul><li>Implement a utility to: - broadcast unsolicited ARP responses - respond to ARP requests for the alias i/f address </li></ul><ul><li>Add static routes on all far end systems </li></ul><ul><li>NOTE: all real i/f needs to be on same IP subnet </li></ul>
  14. 14. New routing protocol <ul><li>Broadcast (Ethernet frames) what you have, including interface priority </li></ul><ul><li>Let the far end select path based on what/when they receive </li></ul><ul><li>Far end dynamically sets up host routes </li></ul><ul><li>Use short retransmission intervals </li></ul>
  15. 15. Erlang multi-homing resolved ? Host A Host B Host C
  16. 16. Summing up <ul><li>Erlang can support multihoming with some additional work </li></ul><ul><li>By using loop back alias i/f, link failure becomes a routing problem (peer-peer association is kept intact) </li></ul><ul><li>Solaris TCP/IP stack parameters are: - hard to find (only in out-of-date app. notes) - hard to set ”right” - host global </li></ul><ul><li>A distribution mechanism with built-in support for multi-homing preferred </li></ul>
  17. 17. Erlang Distribution over SCTP Per Bergqvist et al [email_address] Erlang User Conference 2002