Webcast - Failover Cluster Troubleshooting
Upcoming SlideShare
Loading in...5
×
 

Webcast - Failover Cluster Troubleshooting

on

  • 3,191 views

Failover Cluster TroubleShooting

Failover Cluster TroubleShooting

Statistics

Views

Total Views
3,191
Views on SlideShare
3,187
Embed Views
4

Actions

Likes
1
Downloads
37
Comments
0

1 Embed 4

http://www.linkedin.com 4

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 1 dk
  • What is a quorum? To put it simply, a quorum is the cluster’s configuration database. The database resides in a file named \\MSCS\\quolog.log. The quorum is sometimes also referred to as the quorum log.it tells the cluster which node should be active

Webcast - Failover Cluster Troubleshooting Webcast - Failover Cluster Troubleshooting Presentation Transcript

  • Failover Cluster Troubleshooting
    10.08.2011
    Hakan YÜKSEL
    hakan.yuksel@turkiyefinans.com.tr
    http://yukselis.wordpress.com
  • Ajanda
    • Cluster
    • Kavramlar, Gereksinimler, Mimari, Log Yönetimi, ..
    • Quorum Modeli
    • Troubleshooting
    • Soru – Cevap
  • Cluster Gereksinimleri
    Review hardware and infrastructure requirements for a failover cluster.
    • Servers: Microsoft supports a failover cluster solution only if all the hardware components are marked as "Certified for Windows Server 2008 R2." In addition, the complete configuration (servers, network, and storage) must pass all tests in the Validate a Configuration Wizard, which is included in the Failover Cluster Manager snap-in
    • Storage: You must use shared storage that is compatible with Windows Server 2008 R2
    • Network adapters and cable (for network communication): The network hardware, like other components in the failover cluster solution, must be marked as "Certified for Windows Server 2008 R2." If you use iSCSI, your network adapters should be dedicated to either network communication or iSCSI, not both
    • Account for administering the cluster: When you first create a cluster or add servers to it, you must be logged on to the domain with an account that has administrator rights and permissions on all servers in that cluster. The account does not need to be a Domain Admins account—it can be a Domain Users account that is in the Administrators group on each clustered server. In addition, if the account is not a Domain Admins account, the account (or the group that the account is a member of) must be delegated Create Computer Objects and Read All Properties permissions in the domain
    • Standart Edition üzerindeki sunucular üzerinde cluster activate edilebilir
    • SCSI-3 CommandsPersistent Reservations (PRs) Required
    • Basic GPT and MBR disks supported
    • Multipath IO (MPIO) recommended
  • Sık Sorulanlar
    Sanal makinalar üzerinde cluster yapabilir miyim?
    evet!
    Fiziksel ve Sanal sunucular aynı cluster içerisinde olabilir mi?
    evet!
    Sunucular aynı donanımsal özelliklere sahip olmalı mı ?
    hayır
    Validation testinden geçiyorsanız, destekleniyordur.
  • Cluster Validate
    • Ürün içerisinde konumlanmıştır
    • Gereksinimlerin karşılanmaması durumunda uyarı verir
    • Clusterı oluşturan servers vestorage ile ilgili tüm kontrolleri yapar
    • Her değişiklik sonrası çalıştırılması gerekir
    • Create a new cluster
    • Add a node, disk, or network
    • Update system software (drivers, firmware, service packs,
    MPIO)
    • Configure hardware (HBA, MPIO, Network Adapter, etc)
    • Change any component in your solution
    • It’s the very first thing you do!
    http://technet.microsoft.com/en-us/library/cc732035(WS.10).aspx#BKMK_understanding_tests
  • Quorum ve Majority Node Set
    • Quorum cluster konfigürasyonu ve durum bilgisinin olduğu database.
    • Windows Server 2008 ile yeni bir Quorum modeli mevcut (Node and Disk Majority), bu sefer Quorum diskin kullanımı biraz farklı oluyor: Quorumu node sayısı ile beraber bir oy hakkı olarak kullanıyoruz..
    • Majority Node Set MNS demokratik bir sistemdir. Quorum da sadece bir oy var ise ve buna sahiplenen cluster a sahiplenebiliyorsa, MNS de çoğunluk clustera sahiplenir. Mesela 5 nodelu cluster da split brain senaryosu yaşanırsa her node toplam kaç node ila haberleşebildiğine bakar. Bir node iki node ile haberleşebiliyorsa, 3 node 5 nodedan çoğunluğu oluşturur ve cluster sahiplenir. Diğer iki node azınlıkta olduklarını anlar ve diğer 3 node un haberleşebildiğini varsayarlar. 
    • 2003 Cluster ortamında yaşanılan bir split brain senaryosunda hangi node quorum diskinin sahibi ise uygulamalar onun üzerinde aktif olarak çalışmakta, clientların erişip erişememesinin bir önemi bulunmamaktaydı.
  • Quoruma Bakış
    • Majority is greater than 50%
    • Possible Voters:
    • Nodes (1 each), Disk Witness (1 max), File Share Witness (1 max)
    • 4 Quorum Types
    Node majority
    Node and File Share majority
    Disk only (not recommended)
    Node and Disk majority
    Vote
    Vote
    Vote
    Vote
    Vote
  • Quorum Modelini Seçme
    Considerations for choosing a quorum mode include:
    • By default, failover clustering chooses:
    - Node Majority if there are an odd number of nodes in the cluster
    - Node and Disk Majority if there are an even number of nodes in the cluster
    • Node and File Share Majority is recommended for geographically dispersed clusters
    • No Majority: Disk Only is not recommended, because of the disk subsystem’s single point of failure
    • Plan changes to the quorum mode carefully to avoid a mode that may result in loss of quorum
  • Failover Cluster Mimari
    Microsoft Cluster Service (MSCS) sharing nothing modelini kullanır. Bunun anlamı sadece bir server kaynakların sahibi olabilir bunlar disk,virtual server, IP vb..
    Classdb file HKLMCluster registry hive üzerinden download eder. Nodelar üzerinde ve quorum üzerinde durur. Son güncelleme bilgisini içerir
    • Birbirlerine 3343 üzerinden register replikasyonu yapmakta.
    • File Share Witness içerisine de clusdb kopyalanmaktadır.
    When the computer is started, the Cluster Disk Driver (Clusdisk.sys) reads the following local registry key to obtain a list of the signatures of the shared disks under cluster management:HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesClusDiskParameters Signatures
    Recommandation private only hb public mix olmalı
  • .. mimari
    • Heartbeat 5 sn.de bilgi gelmez ise Host manager devreye girerek public üzerinde kontrollere devam ediyor
    • Preffered Owner listeside hangi node gideceğini karar verecek,
    • Possible ownerda hangi node gidip gidemeyeceğine karar verecek.
    • Tüm resourcesların aynı ownerlara sahip olması gerekmektedir.
    • Affecti group resource fail olursa group failover yapsın.
    • Diskteki efekti group seçili gelmekte.
    • Pause Node
    • When you pause a node, existing groups and resources stay online, but additional groups and resources cannot be brought online on the node. Pausing a node is usually done when applying software updates to the node
  • Scsi Bus Reset, SCSI3 Persistent Reservarion
    Split Brain Senaryosu: İki node birbirleri arasındaki network iletişimi kaybetme durumu. Bu durumda Cluster servisi (clusdisk.sys) Challande/Defense protokolu ile SCSI reserver komutları vasıtasıyla önce reset komutu gönderir bundan sonra reserve komutu ile quorum diskini reserve eder online getirir akabinde ownershipliği alarak tüm resourceları online duruma çeker.
    Windows Server 2008 ile birlikte artık scsi bus resetleri kullanılmıyor. Scsi 3 serial persistent reservation kullanılmaktadır. Scsi bus reset den sadece o disk değil aynı bus üzerindeki bütün diskler etkilenmekte, konfigürasyona bağlı olarak her disk için her node dan bir bus reset gönderilebilmekte bu durumda cluster kendisini online etme süreleri uzamakta ve offline kalabilmekteler bu durumda manuel online çekilmesi gerekebiliyor idi.
  • Resource Monitor
    • Cluster üzerinde resource groupların doğru çalışıp çalışmadığını kontrol eden resource monitorler mevcuttur. Resource monitor clsusvc altinda çalişan dll lerden oluşmaktadır. belli servisler (exchange,SQL,vb..) için özel dll’ler mevcut.2008 ‘de bunun adi RHS.exe
    • Disk üzerinde Turn On maintanence for this disk işaretlersek is alive ve looks alive işlemleri yapılmayacaktır yani diskin statusunu kontrol etmeyecek, diske erişim yapmayacak (içerisine dir çekme) cluster servisi devamli online oldugunu farzeder.
    The Resource Hosting Subsystem (RHS) conducts periodic health checks of all cluster resources to ensure they are functioning properly. This is accomplished by executing IsAlive and LooksAliveprocesses which are specific to the type of resource
  • Failover Süreci
     2 node birbirine ulaşamadiği durumda quarum diskine erişmeye çalışır bu duruma arbitration process denilir. Clusdisk.sys dosyası nodeların ikisininde disklere erişimin engellemek için yönetimi yapar. MNS mimarisi ile birlikte quarum bilgisi register replikasyonu ile sağlanmaktadır. Bu dosyalara %windowssystem32config altından erişilebilinir. Cluster açılması esnasında clusdb dosyasını registryden download edilerek cluster işletimi çalışmaya başlar. Bu konfigürasyon dosyasında hangi disklere erişebileceğinin bilgisi yer almaktadır..
  • Cluster Komponentleri
    OBJECT MANAGER (clussvc.exe) (OM)
    Şu anki configurasyonu tutar
    HOST MANAGER (HM)
    Host ekleme çıkarma, node faile görme, modüller ile birlikte çalışıyor, cluster ayağa kalktı,kim cevap verirse 3343 üzeridnen onunla konuşuyor
    MEMBERSİP MANAGER (MM)
    Hklm clussvc altına lokalde yazar sonrada gider object managere ilertir OM bunu ram üzerine alır,
    Join oldu, evict oldu, MM bunu kayıt altına alır, bilgi paylaşımını sağlar
    GLOBAL UPDATE MANAGER (GUM)
    Bütün değişikilklerin replikasyonundan sorumludur
    Backup – VSS çalışıyor bilgisini diğer nodelar üzerine bildiri böylelikle diğer nodelar üzerinde değişklik yapmanın önüne geçer
    Tüm updatelerden sorumlu
    RESOURCE CONTROL MANAGER (RCM)
    Rsh.exe ile çalışır
    Dependencilerden bu sorumlu
    En baba modül :P
    TOPOLOGY MANAGER
    NETWORK MANAGER (nm) / INTERFACE MANGER (im)
    Nic up / fail
    DATABASE MANAGER
    Replikasyondan sorumlu
    Gup.mang. üzerinden yapıyor
    Logu tutan dm yapmaktadır
    Registry. Clusdb yüklenmektedir.
    QUORUM MANAGER
    Quorum oluştumu, oluşmadımı
    Hangi quorum modeli olmakta ona bakar
    Doğru replikeyi seçmekten o sorumlu
    RCM ile konuşabilir, quoarum oluşruramıyoruz rcm devreye sokup diyoruz ki nerede ise quorum oluşturacaz bize bir vote verebilir misin, 1 eksik miyiz.
    SECURİTY MANAGER
    Encryption, kerberos ilişkileri
  • Microsoft Failover Cluster Virtual Adapter
    Microsoft Cluster ortamlarda “Microsoft Failover Cluster Virtual Adapter” adında bir interface oluşturur, hidden bir interface’dir NetFT (Network Faut Tolerant) dosyasını simüle eder, clusterlar arası iletişimi yürütür, heartbeat için bir redundancy sağlar. Bu interface mevcut interface üzerine bind olur smb’den SAN’e olan trafik bu kart üzerinde utilize edilir. NetFT, ipconfig /All üzerinden görülür kendisine APIPA adresi tahsis (169.254.1.2) eder, bu ip üzerinden aslında data transferi yapılmaz bu IP fiziksel kart üzerine bind olduğunda TM üzerinden utilizasyon görülmektedir.
  • Failover Cluster Kurulum Adımları
    Failover Cluster Prerequisites
    Establish a Network Naming Convention
    TCP/IP Network Configuration
    Public Network
    Storage Network
    Heartbeat Network
    Procedures
    Prepare the Failover Cluster
    Create a Domain User Account
    Add Nodes to an Active Directory Domain
    Expose Storage to Cluster Nodes
    Install the Failover Cluster Feature
    Run Cluster Validation
    Create and Configure the Failover Cluster
    Create a Cluster
    Set Cluster Network Properties and Apply Naming Convention
    Create a Highly Available Services
    -> Create a Highly Available iSCSI Target
    Configuring Windows Firewall for Microsoft iSCSI Software Target
    Installing the Microsoft iSCSI Software Target
    Create the Failover iSCSI Target Resource Group
    Create an iSCSI Target in the Microsoft iSCSI Target MMC
    Create and Configure Virtual Disks
    Connect Initiators
    Testing Your Failover Cluster Configuration
    Server Core Installation Option of Windows Server 2008 Step-by-Step Guide:
    http://technet2.microsoft.com/windowsserver2008/en/library/47a23a74-e13c-46de-8d30-ad0afb1eaffc1033.mspx?mfr=true
  • Troubleshooting
    Reviewing cluster events
    Reviewing hardware events
    Using the Validate a Configuration Wizard
    Reviewing storage/SAN events
    Troubleshooting methodologies for cluster issues, whether in Windows 2003 or Windows 2008, are fairly similar. Most of the typical support issues in the cluster category fall under the following categories:
    · Cluster Service fails to start.
    · Cluster resources in a failed state or fail to come online.
    · Determine root cause of cluster failure.
    · Initial configuration of the cluster
    The Win 2003 legacy CLUSTER.LOG text file no longer exists. In Win 2008 the cluster log is handled by the Windows Event Tracing (ETW) process. This is the same logging infrastructure that handles events for other aspects you are already well familiar with, such as the System or Application Event logs you view in Event Viewer.
    Command Line
    c:>cluster log /gen
    Powershell
    C:PS> Get-ClusterLog
    ForceQuorum
    net start clussvc /forcequorum (or /fq)
  • Troubleshooting Tips
    • When you encounter a problem, always,always,always start with Cluster Events
    Look at a Cluster wide view of the Cluster events
    Dig into all events in the System Event log
    Check the Application Event log
    • Don’t be distracted by symptoms - focus on root cause
    For example, if you see Cluster IP Address failures, don’t waste lots of time looking at Cluster events
    • Instead look for other networking related errors
    There may be multiple retries after a failure, producing more events. Look for what caused the first failure
    You don’t always need to run a FULL validate
    http://technet.microsoft.com/en-us/library/cc732035(WS.10).aspx
    Don’t “assume” the cluster will work and skip Validate
  • Cluster Eventları
    Cluster Events
    Recent Cluster Events üzerinde son 24 saate ait eventlar görünmektedir.
    Monitoring Cluster Events
    Fully featured Failover Cluster Management Packs
    Cluster logging level
    Set-ClusterLog –level 3
  • Configuring Debug Logging
    Logging enabled by default
    Log files stored as .ETL in:
    %WinDir%System32winevtlogsMicrosoft-Windows-FailoverClustering
    Default log size is 100 MB
    Set-Clusterlog –Size 100
    Default log level is 3
    Set-Clusterlog –Level 3
    Up to three log files
    This means log history can be kept for up to three reboots
    The number of logs can be modified via the registry:
    HKLMSoftwareMicrosoftWindowsCurrentVersionWINEVTChannelsMicrosoft-Windows-FailoverClustering/DiagnosticFileMax
    Default
    Can have performance impact
  • Genişletilmiş PowerShell Konutları
    http://blogs.technet.com/b/josebda/archive/2010/09/19/mapping-cluster-exe-commands-to-windows-powershell-cmdlets-for-failover-clusters-extended-edition.aspx
  • Cluster Nodlara bağlanmada yaşanan problemler
    ‘Create Cluster Wizard’, ‘Validate a Configuration Wizard’, and ‘Add Node Wizard’, so any of the following messages and warnings we list could be due to WMI issues:
    · "RPC Server Unavailable" error.
    ·         Access is Denied.
    ·         The computer ‘Node1’ could not be reached.
    ·         Failed to retrieve the maximum number of nodes for ‘{0}’.
    ·         The computer ‘Node1.contoso.com’ does not have the Failover Clustering feature installed.  Use Server Manager to install the feature on this computer.
    o   Note: first confirm you have installed the Failover Clustering feature on this node
    Troubleshooting Steps
    1) Ensure it is not a DNS Issue
    2) Check your that WMI is Running on the Node (wbemtest)
    3) Check your Firewall Settings
    4) Reboot the Node
    5) Rebuild a Corrupt WMI Repository
    ·         In the Services console, manually stop the WMI service to ensure that dependent services are stopped
    ·         Start WMI service again
    ·         Launch and elevated CMD or PowerShell
    ·         CMD/PS > winmgmt /salvagerepository
    6) Patch WMI for Performance Improvements (974930)
  • Antivirus Exclusion
    Antivirus Yazılımınız Cluster Aware bir yazılım mı ?
    Antivirus software that is not cluster-aware may cause unexpected problems on a server that is running Cluster Services. For example, you may experience resource failures or problems when you try to move a group to a different node.
    If you are troubleshooting failover issues or general problems with a Cluster services and antivirus software is installed, temporarily uninstall the antivirus software or check with the manufacturer of the software to determine whether the antivirus software works with Cluster services. Just disabling the antivirus software is insufficient in most cases. Even if you disable the antivirus software, the filter driver is still loaded when you restart the computer.
    Antivirüsü sistemden nasıl disable edebilirim ;
    Exclusion List
    • Q: (quorum) discfrom virus scanning.
    • The %Systemroot%Cluster folder.
    • The temp folder for the Cluster Service account. For example, exclude the clusterserviceaccountLocal SettingsTemp folder from virus scanning. w2k3
    http://support.microsoft.com/kb/250355#appliesto
  • Cluster Log Error Anlamları
    status 170 - Which means "The requested resource is in use." This could be related to Persistent Reservation problems, it can also be MPIO, fibre/HBA drivers and/or some type of lower level file system driver or software such as anti-virus, quota management, open file agent for backup software, etc, etc,:
    00000c94.000008d4::<date and time>.585 INFO Physical Disk <Disk Q:>: [DiskArb] Issuing Reserve on signature 33af636f. 00000c94.000008d4::<date and time>.616 ERR Physical Disk <Disk Q:>: [DiskArb] Reserve completed, status 170. 00000c94.000008d4::<date and time>.616 INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate returned status 170.
    status 5 - Is usually a permissions related problem, in this case it was a problem with either Cluster Service Account (CSA) username/password were not synchronized between the nodes. This can also happen if the cluster looses it's Secure Channel connection to the DC in order for the CSA to get authenticated. Another situation in which this can occur, is when one of the domain Group Policy Objects (GPO) or one of the Local Policy Objects is missing a User Rights Assignment needed for the CSA to funtion properly.
    000014a0.00001460::::<date and time>.629 WARN [JOIN] JoinVersion data for sponsor <Cluster Name> is invalid, status 5.000014a0.000017d0::::<date and time>.629 WARN [JOIN] Unable to get join version data from sponsor 10.7.47.100 using NTLM package, status 5.
    status 1117 - Which means an ERROR_IO_DEVICE (The request could not be performed because of an I/O device error) when Event ID 1123 occurs
    000015a0.000014a8::<date and time>.511 WARN IP Address <IP Address resource name>: IP Interface 4 (address 10.101.160.65) failed LooksAlive check, status 1117, address 0x10119e0, instance 0xf74d6fb8.
  • Cluster Nedir, Niçin Kullanıyoruz
    Cluster Blog
    http://blogs.msdn.com/b/clustering/
    Technet Failover Cluster
    http://technet.microsoft.com/en-us/library/cc754482.aspx
    Configuring Auditing for a Windows Server 2008 Failover Cluster
    http://blogs.technet.com/b/askcore/archive/2009/01/19/configuring-auditing-for-a-windows-server-2008-failover-cluster.aspx
    Top Issues for Microsoft Support for Windows 2008 Failover Clusters
    http://blogs.technet.com/b/askcore/archive/2008/10/13/top-issues-for-microsoft-support-for-windows-2008-failover-clusters.aspx
    Checklist: Create a Clustered Virtual Machine
    http://technet.microsoft.com/en-us/library/dd759220.aspx
    Top Issues for Microsoft Support for Windows 2008 Failover Clusters
    http://blogs.technet.com/b/askcore/archive/2008/10/13/top-issues-for-microsoft-support-for-windows-2008-failover-clusters.aspx
    Failover Clusters in Windows Server 2008 R2
    http://technet.microsoft.com/en-us/library/ff182338(WS.10).aspx
    • Trouble Connecting to Cluster Nodes? Check WMI
    http://blogs.msdn.com/b/clustering/archive/2010/11/23/10095621.aspx
  • Sorular & Teşekkürler