On October 23rd, 2014, we updated our
By continuing to use LinkedIn’s SlideShare service, you agree to the revised terms, so please take a few minutes to review them.
High-Availability with Novell Cluster Services ™ for Novell ® Open Enterprise Server on Linux Tim Heywood , CTO, NDS8 [email_address] Martin Weiss , Senior Technical Specialist [email_address] Dr. Frieder Schmidt , Senior Technical Specialist [email_address]
Agenda High Availability and Fault Tolerance Novell Cluster Services ™ Best Practices Deploying Cluster Services What is Clusterable? Demo
Power failures, hardware crashes, software errors, human mistakes...
Unmasked faults show through to the user
How much does downtime of a service cost you?
Even if you can afford a 5 second blip, can you afford a day long outage or worse, loss of data?
Can you afford low availability systems?
If you are selling or depending on a service, service unavailability translates to cost
Mean Time Between Failures (MTBF)
follows a normal distribution
Mean Time To Repair (MTTR)
Percentage of time that a system functions as expected
Always computed for a certain time, i. e. a month, a year
MTBF: 360 days
MTTR: 1 hour
How to Determine Availability?
Availability of a complex system is determined by the availability of its individual components
two ways to couple components:
Availability of a serial design: A ser = A 1 * A 2 ; A 1 = 0.99, A 2 = 0.99, A ser = 0.9801
Availability of a parallel design: A par = 1 – ( 1 - A 1 ) * ( 1 – A 2 ); A par = 1 – ( 1 - 0.99 ) * ( 1 – 0.99 ); A par = 1 – ( 0.01 ) * ( 0.01 ) = 0.9999
“3R Rule” for High-Availability Systems R edundancy, R edundancy, R edundancy Fault Tolerance “The ability of a system to respond gracefully to an unexpected hardware or software failure.” Webopedia Computer System Fault Tolerance “The ability of a computer system to continue to operate correctly even though one or more of its components are malfunctioning.” Institute for Telecommunication Services, National Telecommunications and Information Administration, US Dept. of Commerce
Managing Risk: Two Goals Primary Goal: Increase Mean Time to Failure (MTTF)
Choose reliable hardware
Implement redundant / fault tolerant systems
Easy to implement for some components (power supplies, LAN connectivity, SAN connectivity, RAID, etc.)
Not so easy for other components (main board, memory, processor, etc.
Establish sound administrative practices
Secondary Goal: Reduce Mean Time to Repair (MTTR)
Keep hardware spares close at hand
Document repair procedures and train personnel
Chose Open Enterprise Server– Linux Server with Novell Cluster Services ™
High-Availability by Clustering Redundant setup “clustered” to act as one avoid Single Point of Failure (SPOF)
Primary focus is availability , but can allow for increased performance
HA via fail-over: In case [an application on] a server failure is detected, another server takes over
Results achieved depend on failure detection time and startup delays
The [virtual] hand moves faster than the eye
The fault is masked before the user really notices
Depends on failure detection time, restart time, overhead
Novell Cluster Services ™
Novell Cluster Services ™
Cluster services allows a resource to be activated on any host in the cluster
Load distribution over multiple servers when having multiple resources
Monitors LAN and SAN/Storage connectivity – in the event of a failure – fences the problematic node and relocates the resource
Supports active-passive clustering
Supports resource monitoring
Supports Linux and Novell ® Open Enterprise Server services
Supports up to 32 nodes per cluster
Novell Cluster Services ™
Command Line Interface
E-mail and SNMP Notification
Integration with Novell ® Open Enterprise Server Services
Cluster Services in Novell ® Open Enterprise Server (OES) 2
New features are Linux only
New from OES2 FCS on:
XEN virtualization support
x86_64 platform support
Including mixed 32/64 bit node support
Dynamic Storage Technology
What's New in SP1/2?
Major rewrite of cluster code for SP2
Removed NetWare ® translation layer
Much lower system load
Typical load average of 0.2!
New/improved clustering for:
NCP ™ virtual server for POSIX filesystem resources :-(
What's New in SP3?
Resource Mutual Exclusion (RME)
Up to 4 resource groups
What's New in SP3? Other Incremental Changes:
Ability to rename resources
Ability to edit resource priority list as text
Various UI improvements
Ability to disable resource monitoring (for Maintenance)
Types of Clusters
Dom0 hosts (nodes)
XEN guests (DomU) resources
Each resource is a server in its own right
Live migration with para-virtualised DomU
XEN Cluster Architecture OCFS2 LUN DomU Files Cluster Node Xen Dom0 Cluster Node Xen Dom0 Cluster Node Xen Dom0 Resource DomU Linux iPrint Resource DomU Linux iPrint Resource DomU Linux iFolder Resource DomU Linux GroupWise Resource DomU NetWare pCounter Live Migrate Live Migrate
Best Practices Deploying Cluster Services
What Are Our Requirements?
Which services should be “how” high-available?
Novell ® GroupWise ®
Novell ZENworks ®
With or without SAN / shared storage?
DNS Master Server
Hardware Setup Availability starts at the lowest layer
LAN / SAN / Power cabling
BIOS / Firmware
Disable what is not required
Local RAID setup
Two logical devices?
Use AutoYaST + ZENworks ® Linux Management
All Servers in a cluster must be identical
Install only required patterns
Use local time
Why connect everything fault tolerant?
If we have multiple servers and Novell Cluster Services ™ can migrate the resource in case of a failure?
Unpublished Work of Novell, Inc. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary, and trade secret information of Novell, Inc. Access to this work is restricted to Novell employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of Novell, Inc. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. Novell, Inc. makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for Novell products remains at the sole discretion of Novell. Further, Novell, Inc. reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All Novell marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.
What Must Be Protected?
First: The Data
Without data ...
There has been no service
There is no service
There will be no service
Data corruption must be prevented at all costs
Rather no service than risk loss of corruption of data!
Nodes need shared, coordinated access to the data
Second: The Service
Operation system instance
Typically, only one service instance at a time is allowed to run
non-cluster-aware file system mount
If That Sounds Too Easy ... A cluster of nodes forms a partially synchronous distributed system:
Storage will fail and corrupt data and nodes will loose access
Nodes will lose power, hardware will die, memory corruption will occur, not even time keeping is guaranteed
The network loses, corrupts, delays and reorders data
Some nodes will receive a packet, others will not
And then, there are humans – admins as well as attackers
Failures can only be detected after they have occurred