Your SlideShare is downloading. ×
Improving Robustness In Distributed Systems
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Improving Robustness In Distributed Systems


Published on

Published in: Technology, News & Politics

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Improving Robustness in Distributed Systems Per Bergqvist [email_address] Erlang User Conference 2001 (courtesy CellPoint Systems AB)
  • 2. Design base
    • Cluster of cooperating hosts
    • Erlang and C
    • COTS hardware based
    • Unix based (i.e. Solaris or Linux)
    • 10/100/1000 base-T back plane (”system area network”)
  • 3. Cluster
    • Shared, distributed, system configuration
    • Each host have ONE cluster controller
    • Dispatch and supervise worker tasks
    • Master cluster controller: holds configuration database (persistent replica)
    • Slave cluster controller: gets configuration from master cluster controllers
    • Cluster is DOWN when all master cluster controllers are inaccessible
  • 4. Typical system Firewall Switch Traffic Control
  • 5. Cluster Key Benefits
    • Single system view
    • Enforces decoupling of parts of O&M from actual traffic processing
  • 6. Implementing a cluster
    • Cluster->Host->Node->NodeData
    • Cluster global parameters
    • Subscription mechanisms for conf. changes
    • Mnesia as configuration database on master cluster controllers
    • Homebrewn configuration distribution to slave controllers (NOT using mnesia)
    • (Worker) node supervision
  • 7. Mnesia gotchas
    • First distributed node startup
      • Disallow writes when all replicas not accessible
      • Use timeout on table load and force load
  • 8. ... BUT ...
    • TCP based distribution
    • Network partitioning
  • 9. Network parameters
    • Align TCP retransmission intervals w/ Erlang heartbeats
    • Align TCP and IP rerouting parameters
  • 10. Typical system II: Dual back plane Firewall Switch Traffic Control
  • 11. Erlang multi-homing problem Host A Host B Host C
  • 12. Multi-home Erlang w/ TCP
    • Add an alias interface to loop back i/f
    • Patch tcp distribution to bind to alias
    • Publish alias interface on (all wanted) via real hw i/f’s
      • Method 1: Static routes and gratuitous/proxy arp
      • Method 2: Use new (routing) protocol
  • 13. ARP method
    • Implement a utility to: - broadcast unsolicited ARP responses - respond to ARP requests for the alias i/f address
    • Add static routes on all far end systems
    • NOTE: all real i/f needs to be on same IP subnet
  • 14. New routing protocol
    • Broadcast (Ethernet frames) what you have, including interface priority
    • Let the far end select path based on what/when they receive
    • Far end dynamically sets up host routes
    • Use short retransmission intervals
  • 15. Erlang multi-homing resolved ? Host A Host B Host C
  • 16. Summing up
    • Erlang can support multihoming with some additional work
    • By using loop back alias i/f, link failure becomes a routing problem (peer-peer association is kept intact)
    • Solaris TCP/IP stack parameters are: - hard to find (only in out-of-date app. notes) - hard to set ”right” - host global
    • A distribution mechanism with built-in support for multi-homing preferred
  • 17. Erlang Distribution over SCTP Per Bergqvist et al [email_address] Erlang User Conference 2002