Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply



Published on

Novell Brainshare 2010 Amsterdam

Novell Brainshare 2010 Amsterdam

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Mean Time Between Failurs (MTBF) Mean Time To Failure (MTTF); Time to FIRST Failure (new components) = statistical metric that is only valid for a large number (batch) of a given component - follows a normal distribution - does not give any indication after what time a certain individual component (i. e. hard disk) will fail
  • Availability (365,2425 day year 365 + 0,25 - 0,01 + 0,0025 ) 98.01% 174,44 h of allowable down time 99% 87,66 h of allowable down time 99.5% 43,83 h of allowable down time 99.9% 8,77 h of allowable down time 99.99% 52,59 min of allowable down time 99.999% 5,26 min of allowable down time Think of a multi-segmented NSS pool as an example of a serial design. Think of a NIC team as an example of parallel design All systems are made up of a combination of serial and parallel components
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Talk about creating the two pools and why. DEMO: Create pool1/vol1 Create pool1_shd/vol1_shd
  • Transcript

    • 1. High-Availability with Novell Cluster Services ™ for Novell ® Open Enterprise Server on Linux Tim Heywood , CTO, NDS8 [email_address] Martin Weiss , Senior Technical Specialist [email_address] Dr. Frieder Schmidt , Senior Technical Specialist [email_address]
    • 2. Agenda High Availability and Fault Tolerance Novell Cluster Services ™ Best Practices Deploying Cluster Services What is Clusterable? Demo
    • 3. High-Availability and Fault Tolerance
    • 4. High-Availability: Motivation <ul><li>Murphy's Law is universal: faults will occur </li><ul><li>Power failures, hardware crashes, software errors, human mistakes... </li></ul><li>Unmasked faults show through to the user
    • 5. How much does downtime of a service cost you? </li><ul><li>Even if you can afford a 5 second blip, can you afford a day long outage or worse, loss of data? </li></ul><li>Can you afford low availability systems? </li></ul>If you are selling or depending on a service, service unavailability translates to cost
    • 6. Definition: Availability <ul><li>Mean Time Between Failures (MTBF) </li><ul><li>follows a normal distribution </li></ul><li>Mean Time To Repair (MTTR)
    • 7. Availability </li><ul><li>Percentage of time that a system functions as expected
    • 8. Always computed for a certain time, i. e. a month, a year </li></ul><li>Example: </li><ul><li>MTBF: 360 days
    • 9. MTTR: 1 hour </li></ul></ul>
    • 10. How to Determine Availability? <ul><li>Availability of a complex system is determined by the availability of its individual components
    • 11. two ways to couple components: </li><ul><li>serial design
    • 12. parallel design </li></ul><li>Availability of a serial design: A ser = A 1 * A 2 ;    A 1 = 0.99, A 2 = 0.99, A ser = 0.9801
    • 13. Availability of a parallel design: A par = 1 – ( 1 - A 1 ) * ( 1 – A 2 ); A par = 1 – ( 1 - 0.99 ) * ( 1 – 0.99 ); A par = 1 – ( 0.01 ) * ( 0.01 ) = 0.9999 </li></ul>
    • 14. “3R Rule” for High-Availability Systems R edundancy, R edundancy, R edundancy Fault Tolerance “The ability of a system to respond gracefully to an unexpected hardware or software failure.” Webopedia Computer System Fault Tolerance “The ability of a computer system to continue to operate correctly even though one or more of its components are malfunctioning.” Institute for Telecommunication Services, National Telecommunications and Information Administration, US Dept. of Commerce
    • 15. Managing Risk: Two Goals Primary Goal: Increase Mean Time to Failure (MTTF) <ul><ul><li>Choose reliable hardware
    • 16. Implement redundant / fault tolerant systems </li><ul><li>Easy to implement for some components (power supplies, LAN connectivity, SAN connectivity, RAID, etc.)
    • 17. Not so easy for other components (main board, memory, processor, etc. </li></ul><li>Establish sound administrative practices </li></ul></ul>Secondary Goal: Reduce Mean Time to Repair (MTTR) <ul><ul><li>Keep hardware spares close at hand
    • 18. Document repair procedures and train personnel
    • 19. Chose Open Enterprise Server– Linux Server with Novell Cluster Services ™ </li></ul></ul>
    • 20. High-Availability by Clustering Redundant setup “clustered” to act as one avoid Single Point of Failure (SPOF) <ul><ul><li>Primary focus is availability , but can allow for increased performance </li></ul></ul>HA via fail-over: In case [an application on] a server failure is detected, another server takes over <ul><ul><li>Results achieved depend on failure detection time and startup delays </li></ul></ul>The [virtual] hand moves faster than the eye <ul><ul><li>The fault is masked before the user really notices
    • 21. Depends on failure detection time, restart time, overhead </li></ul></ul>
    • 22. Novell Cluster Services ™
    • 23. Novell Cluster Services ™ <ul><li>Cluster services allows a resource to be activated on any host in the cluster
    • 24. Load distribution over multiple servers when having multiple resources
    • 25. Monitors LAN and SAN/Storage connectivity – in the event of a failure – fences the problematic node and relocates the resource
    • 26. Supports active-passive clustering
    • 27. Supports resource monitoring
    • 28. Supports Linux and Novell ® Open Enterprise Server services
    • 29. Supports up to 32 nodes per cluster </li></ul>
    • 30. Novell Cluster Services ™ <ul><li>Easy Management
    • 31. Easy Configuration </li><ul><li>Load Script
    • 32. Unload Script
    • 33. Monitoring Script </li></ul><li>iManager integration
    • 34. Command Line Interface
    • 35. E-mail and SNMP Notification
    • 36. Integration with Novell ® Open Enterprise Server Services
    • 37. Integration with XEN </li></ul>
    • 38. Novell Cluster Services ™ Ctrl 2 Dual NICs Dual HBAs LUN 0 LUN 1 LUN … Ctrl 1 LAN Fabric SAN Fabric Storage Array Storage Array Novell iSCSI Storage Array Typical NCS 1.8 Architecture Fibre Channel or iSCSIl Ethernet
    • 39. Cluster Services in Novell ® Open Enterprise Server (OES) 2 <ul><li>New features are Linux only
    • 40. New from OES2 FCS on: </li><ul><li>Resource monitoring
    • 41. XEN virtualization support
    • 42. x86_64 platform support </li><ul><li>Including mixed 32/64 bit node support </li></ul><li>Dynamic Storage Technology </li></ul></ul>
    • 43. What's New in SP1/2? <ul><li>Major rewrite of cluster code for SP2 </li><ul><li>Removed NetWare ® translation layer
    • 44. Much faster
    • 45. Much lower system load
    • 46. Typical load average of 0.2! </li></ul><li>New/improved clustering for: </li><ul><li>iFolder 3
    • 47. AFP
    • 48. …
    • 49. … </li></ul><li>NCP ™ virtual server for POSIX filesystem resources :-( </li></ul>
    • 50. What's New in SP3? <ul><li>Resource Mutual Exclusion (RME) </li><ul><li>Up to 4 resource groups </li></ul></ul>
    • 51. What's New in SP3? Other Incremental Changes: <ul><li>Ability to rename resources
    • 52. Ability to edit resource priority list as text
    • 53. Various UI improvements
    • 54. Ability to disable resource monitoring (for Maintenance) </li></ul>
    • 55. Types of Clusters <ul><li>Traditional cluster </li><ul><li>Servers (nodes)
    • 56. Resources </li><ul><li>NSS
    • 57. GroupWise ®
    • 58. iPrint </li></ul></ul><li>XEN cluster </li><ul><li>Dom0 hosts (nodes)
    • 59. XEN guests (DomU) resources
    • 60. Each resource is a server in its own right
    • 61. Live migration with para-virtualised DomU </li></ul></ul>
    • 62. XEN Cluster Architecture OCFS2 LUN DomU Files Cluster Node Xen Dom0 Cluster Node Xen Dom0 Cluster Node Xen Dom0 Resource DomU Linux iPrint Resource DomU Linux iPrint Resource DomU Linux iFolder Resource DomU Linux GroupWise Resource DomU NetWare pCounter Live Migrate Live Migrate
    • 63. Best Practices Deploying Cluster Services
    • 64. What Are Our Requirements? <ul><li>Which services should be “how” high-available? </li><ul><li>File, Print
    • 65. DHCP, DNS
    • 66. Novell ® GroupWise ®
    • 67. Novell ZENworks ®
    • 68. XEN VMs
    • 69. Other Services </li></ul><li>With or without SAN / shared storage? </li><ul><li>DNS Master Server </li></ul></ul>
    • 70. Hardware Setup Availability starts at the lowest layer <ul><li>LAN / SAN / Power cabling
    • 71. BIOS / Firmware </li><ul><li>Versions
    • 72. Configuration
    • 73. Disable what is not required </li></ul><li>Local RAID setup </li><ul><li>Two logical devices? </li></ul></ul>
    • 74. Software Setup <ul><li>Use AutoYaST + ZENworks ® Linux Management
    • 75. All Servers in a cluster must be identical
    • 76. Filesystem layout
    • 77. Install only required patterns
    • 78. Use local time
    • 79. Default runlevel=3
    • 80. Process: </li><ul><li>Install
    • 81. Patch
    • 82. Configure </li></ul></ul>
    • 83. Connectivity Rules <ul><li>Why connect everything fault tolerant? </li><ul><li>If we have multiple servers and Novell Cluster Services ™ can migrate the resource in case of a failure?
    • 84. -> File-Level re-connect and “grey data” </li></ul><li>Fault tolerant LAN connectivity </li><ul><li>Bonding – multiple switches
    • 85. Fixed vs. Auto Speed/Duplex </li></ul><li>Fault tolerant SAN / Storage connectivity </li><ul><li>Multiple switches / fabrics, multipathing solution (DM-MPIO) </li><ul><li>Vendor specific configuration
    • 86. /etc/multipath.conf and /etc/multipath/bindings </li></ul></ul></ul>
    • 87. Naming and Addressing <ul><li>Naming </li><ul><li>Short names (ex. cl1-node1 instead “thisisclusternode1inmycompany”)
    • 88. Lower case where possible (Linux is case sensitive) </li></ul><li>Addressing </li><ul><li>A separate VLAN for each cluster
    • 89. Standardize IP addresses
    • 90. Virtual IP (Service-Lifetime address) instead of secondary IP? </li></ul></ul>
    • 91. eDirectory ™ <ul><li>Fault tolerant name resolution
    • 92. Fault tolerant time synchronization
    • 93. Each cluster in its own partition </li><ul><li>Only the required information for Novell Cluster Services ™ and resources </li></ul><li>LDAP Proxy User and Group for each cluster </li><ul><li>Security
    • 94. Change Password </li></ul><li>Fault tolerant LDAP connectivity for Novell Cluster Services </li><ul><li>Specify multiple LDAP Servers (the “local” one plus eDirectory Server outside the cluster) </li></ul></ul>
    • 95. Resource Rules <ul><li>Filesystem </li><ul><li>“ One that rules them all” One LUN, one Partition, one Pool, one Volume
    • 96. Use additional LUNs to expand existing pools
    • 97. Maximum size (restore speed and allowable recovery time)
    • 98. Define rules (ex. max. 80% fill-rate)
    • 99. Use Novell Storage Services ™ where possible </li></ul><li>Failover Matrix </li><ul><li>No resource is allowed to migrate to all nodes (“two node cluster is no cluster”)
    • 100. Load Balancing
    • 101. Fan-out Failover </li></ul></ul>
    • 102. 3 rd Party Applications <ul><li>Backup </li><ul><li>SMS
    • 103. Novell Storage Services ™
    • 104. Cluster aware </li></ul><li>Antivirus </li><ul><li>Exclude /admin and /_admin
    • 105. Verify performance and utilization </li></ul></ul>
    • 106. Always verify and test your cluster! Test, Test, Test <ul><li>Create a written test plan and document test results
    • 107. LAN Tests </li><ul><li>Bonding + Routing (virtual IP) </li></ul><li>SAN / Storage Tests </li><ul><li>Multipathing
    • 108. Storage Control (BCC) </li></ul><li>Resource migration tests
    • 109. Everything with I/O </li></ul>
    • 110. What is Clusterable?
    • 111. File Sharing Resources <ul><li>Novell Storage Services ™ pools </li><ul><li>Use iManager
    • 112. Use NSSMU
    • 113. iManager has better granular control </li></ul><li>Combine Novell Storage Services Volumes for DST </li><ul><li>One resource – mount both volumes
    • 114. Delete resource for shadow
    • 115. Modify load script for primary </li></ul></ul>
    • 116. File Sharing Resources <ul><li>POSIX filesystem based resource with NCP ™ </li><ul><li>Easier than Samba to access files
    • 117. Can be used for iPrint, DHCP etc
    • 118. Use evmsgui to create and format the volume
    • 119. Create the resource in iManager
    • 120. Script to create NCP virtual server </li></ul><li>EVMS Locking problems with large clusters </li></ul>
    • 121. File Sharing Resources <ul><li>Add resource monitoring
    • 122. Add NFS access </li><ul><li>LUM enablement of target users
    • 123. Novell Storage Services ™ /POSIX rights
    • 124. exportfs in load script rather then /etc/exports on nodes
    • 125. Use fsid=x for Novell Storage Services </li></ul></ul>
    • 126. NFS Access SHARED1 Virtual Server SHARED1 Volume NFSaccess Iface UID: 1012 Mis-dweeb UID: 1004 LUM NSS Rights Dweeb Gromit Wallace NFS FPC1 FPC2 FPC3 FPC4 FPC5 fpc.server.luton Mis UID: 1010 Oracle UID: 60003 Mis UID: 1010 Oracle UID: 60003 Iface UID: 1012 Mis UID: 1010 Oracle UID: 60003 Iface UID: 1012 Mis UID: 1010 Oracle UID: 60003 Iface UID: 1012
    • 127. iPrint <ul><li>Create iPrint on Novell Storage Services ™
    • 128. Run iprint_nss_relocate on each node with volume in place
    • 129. NB: only one iPrint resource may run on a node
    • 130. Need to accept certificates in iManager for each node </li></ul>
    • 131. iFolder <ul><li>Create iFolder on POSIX </li><ul><li>/mnt/cluster/ifolder </li></ul><li>Run /opt/novell/ifolder3/bin/ifolder_cluster_setup on each node </li><ul><li>Copy /etc/sysconfig/novell/ifldr3_2_sp2 to nodes first </li></ul><li>NB: Only one iFolder resource may run on a node </li></ul>
    • 132. DNS <ul><li>DNS must be on Novell Storage Services ™ as NCP ™ server required for eDirectory ™ integration
    • 133. Check NCP:NCPServer objects
    • 134. LUM user required for Novell Storage Services rights </li></ul>
    • 135. DHCP <ul><li>Create DHCP on Novell Storage Services ™
    • 136. Leases file on Novell Storage Services volume
    • 137. Log file on Novell Storage Services volume </li><ul><li>Syslog-ng configuration
    • 138. Logrotate configuration
    • 139. Default AppArmor configuration will not allow logging to here! </li></ul></ul>
    • 140. Novell ® GroupWise ® <ul><li>Create PO on Novell Storage Services ™
    • 141. Ensure no Salvage when creating Volume
    • 142. Set namespace in load script </li><ul><li>/opt=ns=long </li></ul><li>Disable atime/diratime on volume </li><ul><li>Open nsscon
    • 143. Run /noatime=volname </li></ul></ul>
    • 144. OCFS2 Shared Storage <ul><li>Shared disk! Multi-mount, read/write with distributed lock management
    • 145. /etc/ocfs2/cluster.conf automagically created by Novell Cluster Services ™
    • 146. Fstab mounting uses /etc/init.d/ocfs2 service
    • 147. Required for XEN guests
    • 148. SLOW!!! </li></ul>
    • 149. DEMO
    • 150. File Sharing Resources <ul><li>A Novell Storage Services ™ pool </li><ul><li>Use iManager
    • 151. Will end up as Primary for DST pair </li></ul><li>Another Novell Storage Services pool </li><ul><li>Use NSSMU (just because we can)
    • 152. Will end up as Shadow for DST pair </li></ul><li>Combine them into one resource </li><ul><li>Delete resource for shadow
    • 153. Modify load script for primary </li></ul></ul>
    • 154. Novell ® GroupWise ® <ul><li>PO on Novell Storage Services ™ on GroupWise resource
    • 155. Software on resource – pre-install on every node </li><ul><li>Cluster functions selected on first Install </li></ul><li>Start install Script ./install & </li><ul><li>Select cluster and then install based on previous installation
    • 156. Edit /etc/opt/novell/grpwise/gwha.conf - show=yes
    • 157. Start GroupWise Agents one at a time
    • 158. Comment out gwha.conf change </li></ul></ul>
    • 159. Question and Answer
    • 160.
    • 161.  
    • 162. Unpublished Work of Novell, Inc. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary, and trade secret information of Novell, Inc. Access to this work is restricted to Novell employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of Novell, Inc. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. Novell, Inc. makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for Novell products remains at the sole discretion of Novell. Further, Novell, Inc. reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All Novell marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.
    • 163. What Must Be Protected? <ul>First: The Data </ul><ul><ul><li>Without data ... </li></ul></ul><ul><ul><ul><li>There has been no service
    • 164. There is no service
    • 165. There will be no service </li></ul></ul></ul><ul><ul><li>Data corruption must be prevented at all costs
    • 166. Rather no service than risk loss of corruption of data!
    • 167. Nodes need shared, coordinated access to the data </li></ul></ul><ul>Second: The Service </ul><ul><ul><li>Operation system instance
    • 168. Application service
    • 169. Typically, only one service instance at a time is allowed to run
    • 170. Examples: </li></ul></ul><ul><ul><ul><li>IP address
    • 171. non-cluster-aware file system mount
    • 172. database instance
    • 173. SAP R/3
    • 174. ... </li></ul></ul></ul>
    • 175. If That Sounds Too Easy ... A cluster of nodes forms a partially synchronous distributed system: <ul><ul><ul><li>Storage will fail and corrupt data and nodes will loose access
    • 176. Nodes will lose power, hardware will die, memory corruption will occur, not even time keeping is guaranteed
    • 177. The network loses, corrupts, delays and reorders data
    • 178. Some nodes will receive a packet, others will not
    • 179. And then, there are humans – admins as well as attackers </li></ul></ul></ul><ul><ul><li>Failures can only be detected after they have occurred
    • 180. You cannot trust anything or anyone
    • 181. No 100% perfect solution is possible </li></ul></ul>