Scott Schnoll - Exchange server 2013 high availability and site resilience

  • 3,069 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,069
On Slideshare
0
From Embeds
0
Number of Embeds
7

Actions

Shares
Downloads
20
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Scott Schnoll Exchange Server 2013 High Availability and Site Resilience
  • 2. Agenda • DAG Architecture • • • • MSExchangeRepl MSExchangeDAGMgmt Cluster Crimson Channel • Witness Server • Dynamic quorum • DAG member maintenance 3
  • 3. DAG ARCHITECTURE
  • 4. DAG Replication Service • Introduced in Exchange 2007 RTM • • • • • Microsoft Exchange Replication service | MSExchangeRepl MSExchangeRepl.exe Runs on all Mailbox servers (not just DAG members) Communicates with Active Directory and other DAG members Includes 16 components Active Directory lookup Replay RPC server wrapper TPR API manager Copy status lookup Remote data provider wrapper Support API manager Replay core manager VssWriter Server locator manager Seed manager Active manager Health state tracker Autoreseed manager Active manager RPC server wrapper Disk reclaimer manager Failure item manager
  • 5. DAG Management Service • Introduced in RTM CU2 • • • • • • • • • 6 Microsoft Exchange DAG Management service | MSExchangeDagMgmt MSExchangeDagMgmt.exe Runs on all Mailbox servers (not just DAG members) Communicates with Active Directory and other DAG members Active Directory lookup Copy status lookup Monitoring Tracer instance Includes 4 components
  • 6. DAG Management Service • Created for two primary reasons: • • • • • • • • so the Replication service can have more focused functionality so Managed Availability actions can kill lower-priority activities Microsoft Exchange DAG management service | MSExchangeDagMgmt MSExchangeDagMgmt.exe Runs on all mailbox servers (not just DAG members) Communicates with Active Directory and other DAG members • • AutoReseed, Disk reclaimer, Dynamic replay lag playdown Future AutoDAG copy layout and mobility features Writes events to same place as Replication service Other functions will move to this service
  • 7. Cluster service • Introduced in NT Server enterprise edition (1997) • Cluster Service | ClusSvc • Clussvc.exe • Exchange DAGs use several cluster components • • • • 8 Quorum Membership and node management Networks and heartbeating Cluster registry
  • 8. Cluster service • Quorum is required in order to mount databases • Quorum is based on votes, not membership • Voting can be rigged • Votes can be taken away manually or dynamically • Exchange manages quorum model, not quorum • Exchange management of quorum model based on nodes, not votes • Removing votes requires manual configuration of quorum model • Exchange will make incorrect quorum model management decisions if votes are manually removed at the cluster level
  • 9. Cluster registry • Active Manager stores database / server information in the cluster registry for DAG members • Registry changes are replicated immediately to all DAG members • Stored information is used as part of BCSS
  • 10. Cluster registry IsEntryExist?True*ActiveServer?ex2*LastMountedServer?ex2*LastMountedTime?2013-0715T22:29:39*MountStatus?Mounted*IsAdminDismounted?False*IsAutomaticActionsAllowed?True* • ActiveServer • Name of the server where the database is currently mounted or is expected to be mounted when mount operations complete • LastMountServer • The name of the server where the database was last successfully mounted • The date and time stamp of the last time the database was mounted • LastMountedTime
  • 11. Cluster registry IsEntryExist?True*ActiveServer?ex2*LastMountedServer?ex2*LastMountedTime?2013-0715T22:29:39*MountStatus?Mounted*IsAdminDismounted?False*IsAutomaticActionsAllowed?True* • MountStatus • • The current mount status for the database Possible values are mounted / dismounted • IsAdminDismounted • • Designates whether the current dismounted status of the database is the result of administrator action Possible values are true / false • IsAutomaticActionsAllowed • • Designates whether the database can be automatically activated by AM Possible values are true / false
  • 12. Crimson channel • Applications and Services logs • Area of Windows Server event log used by applications for logging and internal communication These logs store events from a single application or component rather than events that might have system-wide impact This is referred to as an application's crimson channel • • • • • • ActiveMonitoring HighAvailability MailboxDatabaseFailureItems ManagedAvailability PushNotifications Troubleshooters • • • Exchange 2013 has multiple channels
  • 13. Crimson channel
  • 14. WITNESS SERVER
  • 15. Witness Server • A server that participates in a failover cluster with an even number of members • • • • Is not a member of the cluster Does not contain a full copy of quorum data Represented by File Share Witness resource • If server or share are not available, cluster resources are failed and moved to another node If another node does not bring resource online, the resource remains in a Failed state, with restart attempts every 60 minutes If needed for quorum, but cannot be brought online, quorum will be lost Uses IsAlive check for availability • •
  • 16. Witness Server • A lock is not actively maintained on the witness • When it becomes necessary to obtain an additional vote to maintain quorum • An SMB file lock is placed on the witness.log file by one node • Node paxos information is incremented by the locking node and the updated paxos tag written to the witness.log file • Lock is released when witness server is no longer needed to maintain quorum
  • 17. Windows Failover Clustering • Node that locks witness.log gets the witness vote • If enough nodes are in contact with the locking node to constitute a majority, they will maintain quorum and continue providing service • Nodes not in contact with the locking node are in the minority and lose quorum • Nodes not owning cluster core resources wait 6 seconds prior to attempting to lock the FSW (arbitrationDelay)
  • 18. Windows Failover Clustering •Cluster Core Resources •Sequence #: 20 22 21 •Lock witness.log •Sequence #: 21 •Sequence #: 20 22 Challenging node attempts witness lock. Lock already exists – sequence # higher, challenge not successful. Cluster state change – node owning cluster core resources locks FSW – updates sequence number 0 1 2 3 4 5 6 7 8 9 10 11 12 All nodes available. FSW lock released. Changes replicated, sequence numbers in sync. 13 14 15 16
  • 19. Windows Failover Clustering •Sequence #: •Cluster Core 22 Resources •Sequence #: 20 •Lock witness.log •Sequence #: 21 •Sequence #: •Cluster Core 20 Resources •Sequence #: 21 22 Cluster state change – node owning cluster core resources unavailable. 0 1 2 3 4 5 Challenging node attempts witness lock. No lock exists, lock successful, sequence number updated. 6 7 8 9 10 11 All nodes available. FSW lock released. Changes replicated, sequence numbers in sync. 12 13 14 15 16
  • 20. Witness server placement • Basic guidance for Exchange 2010 • “We recommend that you use a Hub Transport server running on Microsoft Exchange Server 2010 in the Active Directory site containing the DAG. This allows the witness server and directory to remain under the control of an Exchange administrator.” • “If your DAG is extended to multiple datacenters, we recommend deploying the witness server in the datacenter that is considered to be the primary datacenter.”
  • 21. Witness server placement • Exchange 2013 guidance more complicated due to new options introduced by architectural changes • Exchange 2013 includes support for new DAG configuration options that are not recommended or possible in previous versions of Exchange • A third location, such as a third physical datacenter or branch office
  • 22. Witness server placement • Ultimately, the placement of a DAG’s witness server depends on business requirements and the options available to the organization
  • 23. Witness server placement Deployment scenario Recommendations Single DAG deployed in a single datacenter Locate witness server in the same datacenter as DAG members Single DAG deployed across two datacenters; no additional locations available Locate witness server in primary datacenter Multiple DAGs deployed in a single datacenter Locate witness server in the same datacenter as DAG members. Additional options include: • Using the same witness server for multiple DAGs • Using a DAG member to act as a witness server for a different DAG Multiple DAGs deployed across two datacenters Locate witness server in the same datacenter as DAG members. Additional options include: • Using the same witness server for multiple DAGs • Using a DAG member to act as a witness server for a different DAG Single or Multiple DAGs deployed across more than two datacenters Locate the witness server in the datacenter where you want the majority of quorum votes to exist
  • 24. Witness server placement • If the organization has a third location, a DAG’s witness server can be deployed there for automatic failover between sites • The witness server location must have network infrastructure and connectivity that is isolated from network failures that affect the two datacenters with Exchange • For all DAGs, the availability of the witness server should be on the Exchange administrator’s radar
  • 25. Witness server placement • Azure is not supported for use as a Witness Server for Exchange DAGs • Investigation into using Azure to host witness server ran into dead end • Azure does not yet support the required underlying network configuration to enable an Azure file server VM to act as a witness server • More info at http://aka.ms/DAGAzure
  • 26. DYNAMIC QUORUM
  • 27. Dynamic Quorum • In Windows Server 2008 R2, quorum majority is fixed, based on the initial cluster configuration • In Windows Server 2012 (and later), cluster quorum majority is determined by the set of nodes that are active members of the cluster at a given time • This new feature is called Dynamic Quorum, and it is enabled for all clusters by default
  • 28. Dynamic Quorum • Cluster dynamically manages the vote assignment to nodes, based on the state of each node • When a node shuts down or crashes, the node loses its quorum vote • When a node successfully rejoins the cluster, it regains its quorum vote • By dynamically adjusting the assignment of quorum votes, the cluster can increase or decrease the number of quorum votes that are required to keep running • This enables the cluster to maintain availability during sequential node failures or shutdowns
  • 29. Dynamic Quorum • With dynamic quorum management, it is also possible for a cluster to run on the last surviving cluster node • By dynamically adjusting the quorum majority requirement, the cluster can sustain sequential node shutdowns to a single node • This is referred to as “Last Man Standing” scenario
  • 30. Dynamic Quorum • Does not allow a cluster to sustain a simultaneous failure of a majority of voting members • To continue running, the cluster must always have a quorum majority at the time of a node shutdown or failure • If you remove a node’s vote, the cluster does not dynamically add the vote back
  • 31. Dynamic Quorum DQ = 7
  • 32. DQ = 4 Dynamic Quorum X X X
  • 33. DQ = 3 Dynamic Quorum X X X X
  • 34. DQ = 2 Dynamic Quorum X X X X X
  • 35. DQ = 2 Dynamic Quorum X X X X X
  • 36. DQ = 2 Dynamic Quorum X 0 1 X X X X
  • 37. DQ = 2 Dynamic Quorum X 1 0 X X X X
  • 38. Dynamic Quorum Use Get-ClusterNode to verify DynamicWeight common property of Node 0 = does not have quorum vote 1 = has quorum vote Get-ClusterNode <Name> | ft name, *weight, state Name ---EX1 DynamicWeight ------------1 NodeWeight State ---------- ----1 Up
  • 39. Dynamic Quorum and DAGs • Does not change quorum requirements for DAGs • Does work with DAGs • All internal DAG testing done with dynamic quorum enabled • Enabled in Office 365 for servers on Windows Server 2012 • Exchange is not dynamic quorum-aware
  • 40. Dynamic quorum and DAGs Cluster team guidance on dynamic quorum: “Selecting this option generally increases the availability of the cluster. By default the option is enabled, and it is strongly recommended to not disable this option. This option allows the cluster to continue running in failure scenarios that are not possible when this option is disabled.” Exchange team guidance on dynamic quorum: Leave it enabled for majority of DAG members Don’t factor it into availability plans The advantage is that, in some cases where 2008 R2 would have lost quorum, 2012 can maintain quorum; this only applies to a few cases, and should not be relied upon when planning a DAG
  • 41. DAG MEMBER MAINTENANCE
  • 42. DAG member maintenance • Basic guidance for DAG member maintenance in Exchange 2010 • Run StartDagServerMaintenance.ps1 to put DAG member in maintenance mode • Perform the maintenance (e.g., install the update rollup) • Run StopDagServerMaintenance.ps1 to take DAG member out of maintenance mode and put it back into production • Optionally rebalance the DAG by using RedistributeActiveDatabases.ps1
  • 43. Exchange 2013 guidance more complicated Go into maintenance mode Set-ServerComponentState <Server> -Component HubTransport -State Draining -Requester Maintenance Restart-Service MSExchangeTransport Set-ServerComponentState <Server> -Component UMCallRouter –State Draining –Requestor Maintenance Redirect-Message -Server <Server> -Target <FQDNTarget> Suspend-ClusterNode <Server> Set-MailboxServer <Server> -DatabaseCopyActivationDisabledAndMoveNow $True Set-MailboxServer <Server> -DatabaseCopyAutoActivationPolicy Blocked Set-ServerComponentState <Server> -Component ServerWideOffline -State Inactive -Requester Maintenance Verify production mode Get-ServerComponentState <Server> | ft Component,State -Autosize Get-MailboxServer <Server> | ft DatabaseCopy* -Autosize Get-ClusterNode <Server> | fl Get-Queue
  • 44. Exchange 2013 guidance more complicated Go into production Set-ServerComponentState <Server> -Component ServerWideOffline -State Active -Requester Maintenance Set-ServerComponentState <Server> -Component UMCallRouter –State Active –Requestor Maintenance Resume-ClusterNode <Server> Set-MailboxServer <Server> -DatabaseCopyActivationDisabledAndMoveNow $False Set-MailboxServer <Server> -DatabaseCopyAutoActivationPolicy Unrestricted Set-ServerComponentState <Server> -Component HubTransport -State Active -Requester Maintenance Restart-Service MSExchangeTransport Verify production mode Get-ServerComponentState <Server> | ft Component,State -Autosize Get-MailboxServer <Server> | ft DatabaseCopy* -Autosize Get-ClusterNode <Server> | fl Get-Queue
  • 45. SUMMARY
  • 46. Summary • DAG architecture continues to evolve • More witness server placement options available • Dynamic quorum works with DAGs • DAG member maintenance mode process is new
  • 47. Scott Schnoll scott.schnoll@microsoft.com Twitter: @Schnoll Blog: http://aka.ms/schnoll QUESTIONS?
  • 48. Please evaluate the session before you leave 