• Like
Improving Robustness In Distributed Systems
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Improving Robustness In Distributed Systems

  • 1,949 views
Published

 

Published in Technology , News & Politics
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,949
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
47
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Improving Robustness in Distributed Systems Per Bergqvist [email_address] Erlang User Conference 2001 (courtesy CellPoint Systems AB)
  • 2. Design base
    • Cluster of cooperating hosts
    • Erlang and C
    • COTS hardware based
    • Unix based (i.e. Solaris or Linux)
    • 10/100/1000 base-T back plane (”system area network”)
  • 3. Cluster
    • Shared, distributed, system configuration
    • Each host have ONE cluster controller
    • Dispatch and supervise worker tasks
    • Master cluster controller: holds configuration database (persistent replica)
    • Slave cluster controller: gets configuration from master cluster controllers
    • Cluster is DOWN when all master cluster controllers are inaccessible
  • 4. Typical system Firewall Switch Traffic Control
  • 5. Cluster Key Benefits
    • Single system view
    • Enforces decoupling of parts of O&M from actual traffic processing
  • 6. Implementing a cluster
    • Cluster->Host->Node->NodeData
    • Cluster global parameters
    • Subscription mechanisms for conf. changes
    • Mnesia as configuration database on master cluster controllers
    • Homebrewn configuration distribution to slave controllers (NOT using mnesia)
    • (Worker) node supervision
  • 7. Mnesia gotchas
    • First distributed node startup
      • Disallow writes when all replicas not accessible
      • Use timeout on table load and force load
  • 8. ... BUT ...
    • TCP based distribution
    • Network partitioning
  • 9. Network parameters
    • Align TCP retransmission intervals w/ Erlang heartbeats
    • Align TCP and IP rerouting parameters
  • 10. Typical system II: Dual back plane Firewall Switch Traffic Control
  • 11. Erlang multi-homing problem Host A Host B Host C
  • 12. Multi-home Erlang w/ TCP
    • Add an alias interface to loop back i/f
    • Patch tcp distribution to bind to alias
    • Publish alias interface on (all wanted) via real hw i/f’s
      • Method 1: Static routes and gratuitous/proxy arp
      • Method 2: Use new (routing) protocol
  • 13. ARP method
    • Implement a utility to: - broadcast unsolicited ARP responses - respond to ARP requests for the alias i/f address
    • Add static routes on all far end systems
    • NOTE: all real i/f needs to be on same IP subnet
  • 14. New routing protocol
    • Broadcast (Ethernet frames) what you have, including interface priority
    • Let the far end select path based on what/when they receive
    • Far end dynamically sets up host routes
    • Use short retransmission intervals
  • 15. Erlang multi-homing resolved ? Host A Host B Host C
  • 16. Summing up
    • Erlang can support multihoming with some additional work
    • By using loop back alias i/f, link failure becomes a routing problem (peer-peer association is kept intact)
    • Solaris TCP/IP stack parameters are: - hard to find (only in out-of-date app. notes) - hard to set ”right” - host global
    • A distribution mechanism with built-in support for multi-homing preferred
  • 17. Erlang Distribution over SCTP Per Bergqvist et al [email_address] Erlang User Conference 2002