Microsoft Exchange Server 2007 High Availability And Disaster Recovery Deep Dive


Published on

Published in: Technology
1 Comment
1 Like
  • When the EDB file of MS exchange gets corrupted then it is harder to recreate it again manually so commercial third party software is used. such tool are very fast and recovers the entire content lost from the EDB file.
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • DB portability between different OS versions – watch out for performance impact!- an upgrade of the operating system for an Exchange database results in the updating of the value for OS Version in the database header. - This update triggers the rebuilding of internal database indexes. When using database portability to move a database from a Mailbox server running WS03 to a Mailbox server running WS08 , the Extensible Storage Engine (ESE) will detect the operating system upgrade and take the following actions:-- During the first database mount operation, all secondary indexes are discarded. A secondary index is used to provide a specific view of the mailbox data (for example, when messages in a mail folder are sorted using Outlook in Online Mode). The database will not be mounted and available to clients until this initial operation is complete. The amount of time it takes to complete the operation is largely dependent on the size of the database. The larger the database is, the longer the mount operation will take.-- Secondary indexes will be rebuilt on-demand, as Outlook users sort their views in Online Mode. In environments with large or extremely large databases, the on-demand rebuilding of indexes will initially result in high processor and disk utilization.
  • This illustrates why our belief that CCR is a better solution than SCC.When you lose a database in SCC, you can recover by restoring a VSS clone, but that clone is a point-in-time restore (it could be 10 min old, 30 min old, etc., depending on how frequently backups occur).
  • Storage failures for SCC can involve storage for data and the storage hosting the VSS clones. Typically, this is the same storage, so when it fails, you need to use remote data to recover.
  • This is a summary showing the RTO and RPO for these two solutions. You can see that to achieve the same RTO/RPO of CCR, an SCC solution also needs to be extended with replication technology, as well as hardware-based VSS (at least two).RTO for Data/LUN failure is15min -1 hour: While 3rd part solutions can activate a VSS clone quickly, Exchange server still has to be brought up and recovery (play the logs forward) still has to be run once the clone has been activate. This can take several minutes to over an hour depending upon log backup regimen.RPO: For CCR, the normal RPO can’t really be measured by time. The type of items that can be lost are the items that don’t go through Transport. If a deployment has synchronous replication and no geo-clustering, then it is a manual DR process to activate the copy (expose the LUNS, go through Exchange DR/Database Portability steps).  Exchange server may or may not be pre-built out (depends upon the SLA and how much idle hardware a customer can afford).Geo-clustered synchronous replication solutions are almost always failed over manually (automatic failover between sites is a big deal for customers and they prefer to hit the “big red button”).  RTO is typically~15min if all works correctly.RPO for LOG LUN:If the log LUN dies, the DB becomes unclean. Jet can't shutdown and all un-flushed writes to the db are lost,leaving the DB in a bad state. As a result, recovery must be run but can’t since the LOG LUN is dead; thus, the DB is also lost. If the logs have been synchronously replicated and the replicated copy of the logs are good, they can be used to recover the DB. However, if the reason the LOG LUN was lost was because of physical corruption on the logs, which gets replicated to the LOG LUN’s replicated copy, then the only option is to recover from a backup.
  • - Polls and uses file system notifications to see a new log in a directory- LogInspector verifies that the log is safe to replay (3rd party sync replication cannot provide this type of replicated data verification for logs)ChecksumIs this log for this log stream?Recopy on failure
  • If shares do exist, they will not be re-created. If permissions on the shares as messed up, remove the shares manually and cycle replication service.Different ReplicaInstance types cannot co-exist
  • Logs required indicates that some transactions haven’t been committed (some pages may have been written to disk, others may have not been). Checkpoint is the minimum log that we need in order to perform recovery. Waypoint is the maximum log needed for recovery, i.e. the last log file that has potential log records that have been recorded in the physical database.Committed Generation is the last log file generated by ESE for the particular storage group.
  • Logically speaking – dumpster is a property of the storagegroup not storagegroupcopy. Loss calculation: now – last log inspectedRequest dumpster resubmit: 12 hours before the loss and 1 hour after the loss it cannot grab extra space. Every SG has a max dumpster size dedicated to that specific SG. Messages are stored only once but counted against multiple SGs if they happen to be in an SG’s dumpster. Maybe you are remembering this other discussion: Msg1 is delivered to both SG1 and SG2. This message counts against the dumpster quota for both SGs. Let’s say SG1 got lots of messages and had to drop Msg1 from its dumpster (Msg1 is still at the HUB server because it is included as part of SG2’s dumpster. When a dumpster resubmit request comes for SG1, msg1 will get resubmitted because it happen to be on the server. This is not guaranteed though
  • - This traffic amounts to around 3 (= 20/6) logs/min/SG
  • Microsoft Exchange Server 2007 High Availability And Disaster Recovery Deep Dive

    1. 1. Agenda Solutions for Disaster Recovery Mailbox Server High Availability CCR and SCR: Better Together Why CCR? Why not SCC? Continuous Replication Demystified 2
    2. 2. 3
    3. 3. Solutions for Disaster Recovery Deleted Item Retention – default 14 days Deleted Mailbox Retention – default 30 days Mailbox Service and Data Recovery Server Recovery Setup /m:RecoverServer Setup /recoverCMS Database portability Dial tone portability Continuous replication Backup and Restore Legacy streaming ESE backups Volume Shadow Copy Service (VSS) backups Recovery Storage Groups, alternate restores Edge Transport Server Cloned Configuration
    4. 4. Solutions for Disaster Recovery Augment built-in solutions with other processes Configuration Management Server build standardization Server build documentation Change management Release management Proactive monitoring Detailed recovery plans Regular integrity checks Regular practice drills
    5. 5. Server Recovery Setup /m:recoverServer All roles except Edge Fresh install and ImportEdgeConfig for Edge All custom settings on Client Access server must be recreated Restrictions: Can’t use this for… repairing a failed setup migrating between different operating systems recovering or un-clustering a clustered mailbox server Setup /recoverCMS For CCR and SCC only Restrictions: Can’t use this for… changing from CCR to SCC or vice versa migrating between different operating systems clustering a standalone Mailbox server splitting or merging clustered Exchange environments Does not trigger Transport Dumpster Windows 2003 clustering has dependency on PDC Emulator 6
    6. 6. Data Recovery Switch to a replicated copy (Activation) Passive copy (LCR/CCR) Target copy (SCR) Restore from backup Same server Database portability on alternate server Database portability from Windows 2003 to Windows 2008 has initial performance impact Dial tone and data merge using RSG 7
    7. 7. 8
    8. 8. Mailbox Server High Availability Built-in features for various levels of availability Local Continuous Replication (LCR) – data availability Single Copy Cluster (SCC) – service availability Cluster Continuous Replication (CCR) – data and service availability Standby Continuous Replication (SCR) – disaster recovery and site resilience
    9. 9. Mailbox Server High Availability Local Continuous Replication (LCR) 10
    10. 10. Mailbox Server High Availability Single Copy Cluster (SCC) 11
    11. 11. Mailbox Server High Availability Cluster Continuous Replication (CCR) 12
    12. 12. Standby Continuous Replication SCR Sources SCR Targets Standalone Mailbox CCR Server (w/o LCR) Standby Cluster with Passive Mailbox Role Standalone SCC 13
    13. 13. 14
    14. 14. CCR and SCR: Better Together CCR provides high-availability for Mailbox data and services within the datacenter SCR replicates data remotely to provide site resilience for the Mailbox data Datacenter A Datacenter B
    15. 15. CCR across 2 Sites Datacenter A Datacenter B 16
    16. 16. CCR local / SCR to remote Site Datacenter A Datacenter B 17
    17. 17. CCR/SCR vs SCC/Sync – 2 sites Datacenter A Datacenter B CCR Log corruption Setup /recovercms, detected play logs forward immediately on replication Physical at both Corruption targets Logs Logs DB DB DB Logs SCC Exchange Disaster On Site Failure in On full Storage Recovery or 3rd Primary Failure or Site Site, Party Failover ifin Primary Site, corruption not detected and corruption is Physical Undetected corrected from a detected, must Corruption Physical test failover, must Recover from Corruption Recover from Backup Clone Clone Logs DB Backup DB VSS VSS Q Logs 1 month later, Undetected Physical Corruption 18
    18. 18. 19
    19. 19. Why CCR? Why not SCC? CCR SCC Single Point  None when stretched across Data, Storage and Site single points of failure sites or combined with SCR for Potential for massive data loss on single failure: of Failure • Storage device failures can lose collocated backups site resiliency • Hardware replication can propagate physical errors • Storage failure requires activation of remote copy if one exists • Requires two VSS clones plus a remote copy of data to achieve RPO equal to CCR Simplicity  Simple setup  Shared storage • No special storage  Storage configuration before and after forming configuration cluster  Built-in Site Resilience  Complex storage stack  Same technology and  Complex deployment to get RTO/RPO of 1 CCR redundancy model for intra- cluster and inter-site protection 20
    20. 20. Why CCR? Why not SCC? CCR SCC Backups Backups off passive copy Backups must be off active eliminates/reduces backup window  Reduced TCO  Higher TCO TCO • Cheaper hardware • Additional products needed to achieve • No special storage equivalent combined RTO/RPO expertise required • Separate management tools for HA • In-the-box solution operations may be required • Integrated management • Higher-end servers and storage required • Single operations team • Storage expertise needed • Reduced backup cost Large • Great RTO/RPO, Simplicity,  Higher TCO, long recovery times constrain No Maintenance Window, mailbox size Mailboxes Reduced TCO → improved support for larger mailboxes 21
    21. 21. Why CCR? Why not SCC? CCR SCC Failure SCC + SCR/3rd party replication + 2 VSS clones Stretched CCR or CCR + SCR to approach combined RTO/RPO of 1 CCR cluster Server ~ 2 minutes ~ 2 minutes Data or LUN ~ 2 minutes 15 min – 1 hour RTO Full Storage ~ 2 minutes  ~ 15 min with synchronous replication  Days with VSS clones only Site  ~ 2 minutes for Stretched CCR  ~ 15 min with synchronous replication  30-60 minutes for CCR + SCR  Days with VSS clones only Server 0 for mail* 0 – uses same copy of data appointment, contact, task, draft Physical DB 0 Hours to days if sync repl; point in time if VSS Corrupt Logs 0 (must reseed passive) N/A if log not needed; same as DB if needed DB LUN dies 0  0 with synchronous replication  Point-in-time with VSS clones RPO LOG LUN dies 0 for mail*  0 with synchronous replication appointment, contact, task, draft  Point-in-time with VSS clones Full Storage 0 for mail*  0 with synchronous replication appointment, contact, task, draft  Hours to days with VSS clones only Site  Same as Server for Stretched CCR  0 with synchronous replication  1 Log**  Hours to days with VSS clone * Assumes following best practice guidance for Transport Dumpster **Assumes replication’s keeping up 22
    22. 22. Why CCR? Why not SCC? Corruptions caused by the application Logical Logical corruption replicated by all replication solutions Corruption SCR with lag replay can mitigate if detected early SCC: no mechanism to detect database corruption on the copy replicated by 3rd Party solutions SCC: no mechanism to detect log corruption on the copy replicated by 3rd Party solutions Physical With hardware-based replication, deeper stack can lead to Corruption corruption caused by: HBA driver/firmware Multi-path driver Server hardware FC Switch firmware Storage controller firmware/OS Target storage controller firmware/OS 23
    23. 23. 24
    24. 24. Basic Replication Pipeline Source DB Store Log Log Copier Inspector Inspector Replica Source Directory Log Log Directory Directory Log Replayer Target DB 25
    25. 25. Continuous Replication Basics When current log file is closed, it is copied to the replication target by the Replication service Replication service at source: creates read-only shares for log directory at target: reads from the shares and pulls a copy of the log file contains a ReplicaInstance for each storage group Configuration discovered from Active Directory (every 30 sec for LCR/CCR, every 3 min for SCR) 26
    26. 26. Continuous Replication Basics Communication is done via logs, registry, cluster database and RPC Logs: replicate database changes and backup status Registry: used in LCR and SCR. Also in CCR for checkpointing the current log generation value for loss calculation Cluster database: cluster res quot;Exchange Information Store Instance (CMSName)quot; /priv | findstr /i replay RPCs: Target Replication service RPCs into Store for log truncation coordination 27
    27. 27. Lost Log Resilience (LLR) Designed to minimize need to reseed after lossy failover Database changes written to log file prior to database, and the database can be updated as soon as change is logged LLR modifies this behavior by delaying updates to the database until 1 or more log generations are created Utilizes a new log stream marker called the waypoint Minimum Log Required to prevent database divergence No modifications after the waypoint have been written to the database 28
    28. 28. Log Stream Markers Committed: Log generation 20 Checkpoint: Log generation 2 Waypoint: Log generation 10 What this means: Only logs 2-10 are needed Logs 11-20 can be discarded Initiating FILE DUMP mode... Database: priv1.edb ... State: Dirty Shutdown Log Required: 2-10 (0x2-0xA) Log Committed: 0-20 (0x0-0x14) ...
    29. 29. NodeA NodeB 21 21 Healthy CCR 20 20 19 19 18 18 NodeA fails and a failover to 17 17 NodeB occurs 16 16 Validate database can mount 15 15 logs lost < 14 14 AutoDatabaseMountDial 13 13 12 waypoint 12 Logs are generated on 11 11 NodeB (beyond gen21) 10 10 9 9 NodeA recovers and performs a 8 8 divergence check 7 7 6 6 NodeA performs incremental 5 5 reseed and copies logs 4 checkpoint 4 3 3 2 2 Healthy CCR 1 1
    30. 30. When Do I Need A Full Reseed? Rarely Lost log past current Waypoint Admin accepted large amount of loss by running Restore- StorageGroupCopy Automatic mount while LLR was “not honored” Automatic lossy mount with “stale” loss window calculation Log corruption prior to log replay ESE cannot skip over logs Database files modified outside of Store or Replication service E.g., Offline defrag, eseutil /r 31
    31. 31. Hub Transport servers retain messages that have been delivered to destination mailbox until size or time limit is reached Transport Dumpster is per storage group per Hub Transport server for servers in same Active Directory site as the storage group Transport Dumpster statistics: Get-StorageGroupCopyStatus -DumpsterStatistics Output: DumpsterServersNotAvailable:{HUB1} DumpsterStatistics: {HUB2(2/25/2009 10:20:37 PM; 2 ; 1032KB)} 32
    32. 32. CCR CMS MBX1 HUB1 SG Dumpster Contents SG1 SG2 SG1 Msg1 Active SG2 Msg1,Msg3 Msg1 MBX2 Redeliver SG1,SG2(returns timeout) retry) success) HUB2 SG1 SG2 SG Dumpster Contents Passive SG1 Msg2,Msg4 Msg2 SG Resubmit Required SG2 Msg4 SG1 HUB1 HUB1,HUB2 SG2 HUB1 HUB1,HUB2 Redeliver SG1,SG2(returns Retry) Success) 33
    33. 33. How much data loss can transport dumpster mitigate? 18 MB dumpster per storage group on 8 Hub Transport servers = 144 MB / storage group [20 MB / 10 hour] x [100 users / SG] = 200 MB message traffic in one hour Putting the above two together gives 60 min X 144 / 200  43.2 minutes worth of data in 43.2 minutes  144+ logs created per SG Customize transport dumpster size/time limit Set-TransportConfig –MaxDumpsterSizePerStorageGroup 30MB –MaxDumpsterTime 07.00:00:00 No time window guarantees If there are no message size limits, a single large message (e.g., 15 MB) will purge all other messages for destination storage group(s) on a given Hub Transport server 34
    34. 34. When CCR detects a lossy failover: Expands loss window by 12 hours back and 4 hours forward Finds all Hub Transport servers in the local Active Directory site Requests transport dumpster redelivery from all detected servers New servers not added to redelivery list Inaccessible servers: CCR retries same request every 30 seconds until configured MaxDumpsterTime If multiple lossy failovers take place, new loss is window added to previous one Restore-StorageGroupCopy on LCR is one time request, no retries Redelivery not triggered as part of Setup /recoverCMS No other ways to redeliver messages from transport dumpster 35
    35. 35. Redundant Networks Use for log shipping and seeding in CCR Enable-ContinuousReplicationHostName Seeding Update-StorageGroupCopy -DataHostNames:Host1,Host2 Get-ClusteredMailboxServerStatus OperationalReplicationHostNames: FailedReplicationHostNames: InUseReplicationHostNames: Watch out for misconfigured host file
    36. 36. Circular Logging One configuration setting with two consumers Store service: requires database to be dismounted and re- mounted to take effect Replication service: picks up new setting dynamically In CCR, it’s no big deal to switch between on/off/on In some settings, logs are deleted prematurely Example: turn off circular logging, then enable LCR without dismount/mount of database ESE is still doing log truncation with circular logging logic Logs will get truncated before making it to the LCR copy To be safe follow this recipe: Suspend, dismount, change setting, mount, resume 37
    37. 37. © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.