Tackling Disaster in a
 Sunrise Clinical
 Manager Environment …
 Disaster Restart OR
 Disaster Recovery?




Ziaul Mannan
...
Yale-New Haven Hospital
           New Haven, Connecticut

 944 Bed Tertiary Teaching Facility
 2600 Medical Staff
 755...
Future Clinical Cancer Center At
    Yale-New Haven Hospital
 112 inpatient beds
 Outpatient treatment rooms
 Expanded ...
Problem
• After the events of 9/11, hospital realized that
  there needed to be redundant data centers with
  the ability ...
Goals
•   Provide 24X7X365 uptime
•   Minimize downtime
•   Faster recovery in DR situation
•   Database must be consisten...
YNHH Production Environment
• SCM 4.5 SP3 RU3 Migrating to SCM4.5 with
 SMM-10/03/06
  –   Total Users Defined – 10,000
  ...
YNHH Production Environment
• MS SQL Server
  – SQL Server 2000 EE Build 2195: SP 4
  – Master and Enterprise on their own...
YNHH Production Environment
– Distributed SCM Environment
  • Master Server (MSCS Cluster using EMC
    SRDF/CE) ~ 125 GB ...
YNHH SCM Production
              Environment
MSMQ DCs                                SCM Master DB
                      ...
Solutions/Tools
• Disaster Restart
• Microsoft Cluster Service (MSCS)
• EMC SRDF/CE
Disaster Recovery VS. Disaster
             Restart
• Disaster Recovery
  –   DR process restores database objects to last...
Microsoft Cluster Service (MSCS)
• MSCS is clustering extension to Windows Server
  Enterprise and Datacenter
• MSCS is a ...
A typical two-node MSCS
          cluster
Limitations of MSCS
• With SCSI, all servers must be within 40 meters of one
  other
• Each must be less than 20 meters fr...
SRDF
• Symmetrix Remote Data Facility/ Cluster Enabler is a
  disaster restart able business continuance solution
  based ...
Basic SRDF Configuration
SRDF/CE Overview
• Software extension for MSCS
• Cluster nodes can be geographically separated by
  distances of up to 60 ...
A Geographically Distributed 2-
   Node SRDF/CE Cluster
SRDF/CE modes of operation
• Active/Passive
   – Cluster of 2 nodes or more
   – Processing is done on one node (active no...
• When a node fails, work is transferred to a remaining
  node (or nodes)
• The node that picks up, processes load of both...
SRDF/CE in YNHH SCM Production
          Environment
      Clients:


                                  Enterprise LAN/WAN...
SRDF/CE Over MSCS
• SRDF/CE protects against more failure scenarios than
  MSCS can.
• It overcomes the distance limitatio...
SRDF/CE and MSCS
    Common Recovery Behavior
1. LAN Link failure         5. Server failure
2. Heartbeat Link failure   6....
SRDF/CE failover operation
Complete Site Failure and
               Recovery
• Site (Server and Symmetrix) failure (5+8)
   – Site failure occurs whe...
Complete Site Failure
Response to complete site failure
• Site Failure                                          GspanPlan
   – Site Failure occu...
• Total Communications Failure
   – Total Communications Failure causes the node without the
     QuorumGrp to go offline
...
Benefits
• Disaster recovery solution
• Disaster restart provides short restart time and low data
  loss
• Ensures data in...
Disadvantages
•   Cost
•   Complex Setup
•   Lots of Disks
•   Fail-back needs to be planned, takes longer than failover
•...
Conclusions
• In our DR test following failure scenarios were tested:
   –   Server failure
   –   O/S Failure
   –   HBA/...
References
• EMC SRDF/Cluster Enabler for MCSC v2.1 Product
  Guide P/N 300-001-286 REV A02 by Eclipsys
  Corporation, Hop...
THANK YOU !




 Questions?
Tackling Disaster in a SCM Environment
Tackling Disaster in a SCM Environment
Upcoming SlideShare
Loading in …5
×

Tackling Disaster in a SCM Environment

759 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
759
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Tackling Disaster in a SCM Environment

  1. 1. Tackling Disaster in a Sunrise Clinical Manager Environment … Disaster Restart OR Disaster Recovery? Ziaul Mannan -Sr. Technical DBA Howard Goldberg -Director, Clinical Systems and Support
  2. 2. Yale-New Haven Hospital New Haven, Connecticut  944 Bed Tertiary Teaching Facility  2600 Medical Staff  7550 Employees  100% CPOE  Average 350,000 orders monthly  Average Daily Census is 724  7 Time “Most Wired” and 3 Time “Most Wireless” Hospital by Hospitals and Health Networks
  3. 3. Future Clinical Cancer Center At Yale-New Haven Hospital  112 inpatient beds  Outpatient treatment rooms  Expanded operating rooms  Infusion suites  Diagnostic imaging services  Therapeutic radiology  Specialized Women's Cancer Center  Yale-New Haven Breast Center/GYN Oncology Center  Tentative Completion Date: 2009
  4. 4. Problem • After the events of 9/11, hospital realized that there needed to be redundant data centers with the ability to provide “zero” downtime. • Implemented SCM with server clusters and EMC SAN situated in data centers on opposite ends of the hospital campus.
  5. 5. Goals • Provide 24X7X365 uptime • Minimize downtime • Faster recovery in DR situation • Database must be consistent – Or it won’t come up Challenges • Build a redundant System over 2 KM apart Data centers • Overcome the limitations of Clustering solutions • Design a system for both redundancy and DR solution
  6. 6. YNHH Production Environment • SCM 4.5 SP3 RU3 Migrating to SCM4.5 with SMM-10/03/06 – Total Users Defined – 10,000 – Users logged on at peak hours ~450 – SCM Reports, HL7 Interfaces,CDS,Multum – No CDR – Total disk for data 700GB (all servers) – Total disk for Archive 500 GB
  7. 7. YNHH Production Environment • MS SQL Server – SQL Server 2000 EE Build 2195: SP 4 – Master and Enterprise on their own Servers and both clustered – MSCS and EMC SRDF/CE used as Clustering solution • OS and Hardware – Window 2000 Advance Server SP4 – Local SCSI and EMC Disks on Symmetrix
  8. 8. YNHH Production Environment – Distributed SCM Environment • Master Server (MSCS Cluster using EMC SRDF/CE) ~ 125 GB DB • Enterprise Server (MSCS+EMC SRDF/CE) • HL7 Server (MSCS+EMC SRDF/CE) • Reports Server (MSCS+EMC SRDF/CE) • CDS Server (MSCS+EMC SRDF/CE) • Multum Server (MSCS+EMC SRDF/CE) • Compaq Servers - DL760G2,DL380G3,DL560 • 2-8 CPUs, 3-8 GB RAM
  9. 9. YNHH SCM Production Environment MSMQ DCs SCM Master DB Client Workstations Enterprise Server XAENTER1PA Master Active DB XAENTERP XAMASTER1PA XAENTERCL1 XAMASTERP SCM Client SCM Client XAMASTERCL1 YNHORG2 YNHORG4 HL7 Interface Server HL7 Interface Server Executive Server Manager Server MSMQ XAAPPS2P XAHL71PA XAHL7P XAHL7CL1 SunriseXA Services Notification, CDS and Order Generation Server Multum Server Report Server XACOGNS1PA XAMULTUM1PA XAREPORT1P XACOGNSP XAMULTUMP XAREPORTP XACOGNSCL1 XAMULTUMCL1 XAREPORTCL1
  10. 10. Solutions/Tools • Disaster Restart • Microsoft Cluster Service (MSCS) • EMC SRDF/CE
  11. 11. Disaster Recovery VS. Disaster Restart • Disaster Recovery – DR process restores database objects to last good backup – Recovery process restores and recovers data – Difficult to coordinate recoveries across database systems – Long restart time and data loss could be high • Disaster Restart – Disaster Restart is inherent in all DBMS – Remote disaster restart possible using remote mirroring (SRDF) – Remote restart has no formal recovery – Remote disaster similar to local system power failure – Short restart time and low data loss
  12. 12. Microsoft Cluster Service (MSCS) • MSCS is clustering extension to Windows Server Enterprise and Datacenter • MSCS is a loosely coupled cluster system • Provides H/W and OS redundancy, no disk redundancy • On a failure it fails to the other node along with disks and resources • Failover can occur due to manual failover, H/W failure or application failure. • Relatively quicker uptime in event of failure • MSCS provides improved availability,increased scalability, simplify management of groups of systems
  13. 13. A typical two-node MSCS cluster
  14. 14. Limitations of MSCS • With SCSI, all servers must be within 40 meters of one other • Each must be less than 20 meters from the storage • With Fiber Channel connections this distance can be increased • Does not provide disk redundancy • It is not a fault tolerant closely coupled system • Not a solution for disaster recovery
  15. 15. SRDF • Symmetrix Remote Data Facility/ Cluster Enabler is a disaster restart able business continuance solution based on Symmetrix from EMC corporation • SRDF is a configuration of multiple Symmetrix arrays • SRDF duplicates data from production (source) site to a secondary recovery (target) site transparently to users, applications, databases and host processors • If the primary site fails, data in the secondary site is current up to the I/O. • Used for disaster recovery, remote back up, data center migration, datacenter decision solutions
  16. 16. Basic SRDF Configuration
  17. 17. SRDF/CE Overview • Software extension for MSCS • Cluster nodes can be geographically separated by distances of up to 60 KM • provides fail over for MSCS-handled failures as well as site disasters, Symmetrix failures or Total Communication failures (IP + SRDF links lost). • Up to 64 MSCS clusters per Symmetrix pair • Protects data from following types of failure: – Storage failures – System failures – Site failures
  18. 18. A Geographically Distributed 2- Node SRDF/CE Cluster
  19. 19. SRDF/CE modes of operation • Active/Passive – Cluster of 2 nodes or more – Processing is done on one node (active node) – Processing is picked up by a remaining node (or nodes) only when the active node fails – Half of the H/W is normally idle. – On failure the application restarts with full performance • Active/Active – Cluster of 2 nodes or more – All nodes run application software.
  20. 20. • When a node fails, work is transferred to a remaining node (or nodes) • The node that picks up, processes load of both systems • Extra load may cause performance degradation Other Generic type of Clusters: • Shared-nothing: No common Cluster resources shared between clusters. • Shared-something: Some resource in each cluster node.
  21. 21. SRDF/CE in YNHH SCM Production Environment Clients: Enterprise LAN/WAN Private Interconnect (Heartbeat Connector) 20 KM with Single Mode FDDI Host A Host B Node 1 Node 2 UWD SCSI or FC-AL UWD SCSI or FC-AL R1 Bi -directional SRDF Interconnect R2 R2 R1 Symmetrix Symmetrix
  22. 22. SRDF/CE Over MSCS • SRDF/CE protects against more failure scenarios than MSCS can. • It overcomes the distance limitations of MSCS • Cluster nodes can be geographically separated by distances of up to 60 KM (network round trip latency of less than 300 ms) • An ideal solution for dealing with disaster • Critical information available in minutes • System restart not recovery when disaster happens
  23. 23. SRDF/CE and MSCS Common Recovery Behavior 1. LAN Link failure 5. Server failure 2. Heartbeat Link failure 6. Application software failure 3. SRDF Link failure 7. Host bus adapter failure 4. Host NIC failure 8. Symmetrix array failure SRDF/CE Unique Behavior The geographic separation and disaster tolerance of SRDF/CE causes unique behavior and provides recovery alternatives
  24. 24. SRDF/CE failover operation
  25. 25. Complete Site Failure and Recovery • Site (Server and Symmetrix) failure (5+8) – Site failure occurs when both the Server and Symmetrix fail from natural disaster and human error • Total Communication Failure(1+2+3) - Split-Brain ? – Occurs when all communication between node1 and node2 is lost – In this type of failure, both nodes remain operational and is referred to as split-brain – Is a potential cause of logical data corruption as each side assumes the other side is dead and begin processing new transactions against their copy of data – Two separate and irreconcilable copies of data are created
  26. 26. Complete Site Failure
  27. 27. Response to complete site failure • Site Failure GspanPlan – Site Failure occurs at Node 2 Test-Plan – QuorumGrp and SQLGrp continue running on Node 1 – Manual intervention required to bring FShareGrp online on Node1 • Site Failure – Quorum Lost – Site Failure occurs at Node 1 – Site Failure causes SQLGrp and QuorumGrp to go offline – With QuorumGrp offline, W2K takes whole cluster offline – Manual intervention required to bring cluster online.
  28. 28. • Total Communications Failure – Total Communications Failure causes the node without the QuorumGrp to go offline – This prevents the Spilt-Brain – Manual intervention required to bring FShareGrp online. – EMC doesn’t suggest automatic site fail-over to prevent Spilt- Brain
  29. 29. Benefits • Disaster recovery solution • Disaster restart provides short restart time and low data loss • Ensures data integrity • SRDF/CE overcomes limitations in traditional cluster solutions like MSCS
  30. 30. Disadvantages • Cost • Complex Setup • Lots of Disks • Fail-back needs to be planned, takes longer than failover • Synchronous SRDF Disaster Restart – Data must be written to both Symmetrix – Consistent, reliable data – More I/O over head • Asynchronous SRDF Disaster Restart – Data is written asynchronously to secondary Symmetrix – May incur data loss – Faster I/O • Both sites in the same city, prone to regional disaster
  31. 31. Conclusions • In our DR test following failure scenarios were tested: – Server failure – O/S Failure – HBA/Channel failure – Application failure – Public LAN failure – Private LAN failure – Complete IP communication failure (public LAN and private LAN) • All tests were passed • We have achieved high uptime (non-scheduled outages) of almost 100% in last 3 years • 2 unplanned fail overs so far due to windows fluctuation
  32. 32. References • EMC SRDF/Cluster Enabler for MCSC v2.1 Product Guide P/N 300-001-286 REV A02 by Eclipsys Corporation, Hopkinton, MA 01748-9103, 2006 • GeopSpan Implementation by John Toner, EMC Corporation, 2003 Contact Information Ziaul Mannan : Ziaul.Mannan@ynhh.org Howard Goldberg: Howard.Goldberg@ynhh.org
  33. 33. THANK YOU ! Questions?

×