Tackling Disaster in a SCM Environment

Tackling Disaster in a
Sunrise Clinical
Manager Environment …
Disaster Restart OR
Disaster Recovery?

Ziaul Mannan
-Sr. Technical DBA
Howard Goldberg
-Director, Clinical Systems and Support

Yale-New Haven Hospital
New Haven, Connecticut

 944 Bed Tertiary Teaching Facility
 2600 Medical Staff
 7550 Employees
 100% CPOE
 Average 350,000 orders monthly
 Average Daily Census is 724
 7 Time “Most Wired” and 3 Time “Most Wireless”
Hospital by Hospitals and Health Networks

Future Clinical Cancer Center At
Yale-New Haven Hospital
 112 inpatient beds
 Outpatient treatment rooms
 Expanded operating rooms
 Infusion suites
 Diagnostic imaging services
 Therapeutic radiology
 Specialized Women's Cancer Center
 Yale-New Haven Breast Center/GYN Oncology Center
 Tentative Completion Date: 2009

Problem
• After the events of 9/11, hospital realized that
there needed to be redundant data centers with
the ability to provide “zero” downtime.
• Implemented SCM with server clusters and EMC
SAN situated in data centers on opposite ends
of the hospital campus.

Goals
• Provide 24X7X365 uptime
• Minimize downtime
• Faster recovery in DR situation
• Database must be consistent
– Or it won’t come up

Challenges
• Build a redundant System over 2 KM apart Data centers
• Overcome the limitations of Clustering solutions
• Design a system for both redundancy and DR solution

YNHH Production Environment
• SCM 4.5 SP3 RU3 Migrating to SCM4.5 with
SMM-10/03/06
– Total Users Defined – 10,000
– Users logged on at peak hours ~450
– SCM Reports, HL7 Interfaces,CDS,Multum
– No CDR
– Total disk for data 700GB (all servers)
– Total disk for Archive 500 GB

• MS SQL Server
– SQL Server 2000 EE Build 2195: SP 4
– Master and Enterprise on their own Servers and both
clustered
– MSCS and EMC SRDF/CE used as Clustering
solution
• OS and Hardware
– Window 2000 Advance Server SP4
– Local SCSI and EMC Disks on Symmetrix

– Distributed SCM Environment
• Master Server (MSCS Cluster using EMC
SRDF/CE) ~ 125 GB DB
• Enterprise Server (MSCS+EMC SRDF/CE)
• HL7 Server (MSCS+EMC SRDF/CE)
• Reports Server (MSCS+EMC SRDF/CE)
• CDS Server (MSCS+EMC SRDF/CE)
• Multum Server (MSCS+EMC SRDF/CE)
• Compaq Servers - DL760G2,DL380G3,DL560
• 2-8 CPUs, 3-8 GB RAM

YNHH SCM Production
Environment
MSMQ DCs SCM Master DB
Client Workstations
Enterprise Server
XAENTER1PA
Master Active DB XAENTERP
XAMASTER1PA XAENTERCL1
XAMASTERP SCM Client SCM Client
XAMASTERCL1
YNHORG2

YNHORG4
HL7 Interface Server HL7 Interface Server
Executive Server Manager Server
MSMQ XAAPPS2P
XAHL71PA
XAHL7P
XAHL7CL1

SunriseXA Services

Notification, CDS and Order Generation Server Multum Server Report Server
XACOGNS1PA XAMULTUM1PA XAREPORT1P
XACOGNSP XAMULTUMP XAREPORTP
XACOGNSCL1 XAMULTUMCL1 XAREPORTCL1

Solutions/Tools
• Disaster Restart
• Microsoft Cluster Service (MSCS)
• EMC SRDF/CE

Disaster Recovery VS. Disaster
Restart
• Disaster Recovery
– DR process restores database objects to last good backup
– Recovery process restores and recovers data
– Difficult to coordinate recoveries across database systems
– Long restart time and data loss could be high

• Disaster Restart
– Disaster Restart is inherent in all DBMS
– Remote disaster restart possible using remote mirroring (SRDF)
– Remote restart has no formal recovery
– Remote disaster similar to local system power failure
– Short restart time and low data loss

Microsoft Cluster Service (MSCS)
• MSCS is clustering extension to Windows Server
Enterprise and Datacenter
• MSCS is a loosely coupled cluster system
• Provides H/W and OS redundancy, no disk redundancy
• On a failure it fails to the other node along with disks and
resources
• Failover can occur due to manual failover, H/W failure or
application failure.
• Relatively quicker uptime in event of failure
• MSCS provides improved availability,increased
scalability, simplify management of groups of systems

A typical two-node MSCS
cluster

Limitations of MSCS
• With SCSI, all servers must be within 40 meters of one
other
• Each must be less than 20 meters from the storage
• With Fiber Channel connections this distance can be
increased
• Does not provide disk redundancy
• It is not a fault tolerant closely coupled system
• Not a solution for disaster recovery

SRDF
• Symmetrix Remote Data Facility/ Cluster Enabler is a
disaster restart able business continuance solution
based on Symmetrix from EMC corporation
• SRDF is a configuration of multiple Symmetrix arrays
• SRDF duplicates data from production (source) site to a
secondary recovery (target) site transparently to
users, applications, databases and host processors
• If the primary site fails, data in the secondary site is
current up to the I/O.
• Used for disaster recovery, remote back up, data center
migration, datacenter decision solutions

SRDF/CE Overview
• Software extension for MSCS
• Cluster nodes can be geographically separated by
distances of up to 60 KM
• provides fail over for MSCS-handled failures as well as
site disasters, Symmetrix failures or Total
Communication failures (IP + SRDF links lost).
• Up to 64 MSCS clusters per Symmetrix pair
• Protects data from following types of failure:
– Storage failures
– System failures
– Site failures

A Geographically Distributed 2-
Node SRDF/CE Cluster

SRDF/CE modes of operation
• Active/Passive
– Cluster of 2 nodes or more
– Processing is done on one node (active node)
– Processing is picked up by a remaining node (or
nodes) only when the active node fails
– Half of the H/W is normally idle.
– On failure the application restarts with full
performance
• Active/Active
– Cluster of 2 nodes or more
– All nodes run application software.

• When a node fails, work is transferred to a remaining
node (or nodes)
• The node that picks up, processes load of both systems
• Extra load may cause performance degradation

Other Generic type of Clusters:

• Shared-nothing: No common Cluster resources shared
between clusters.
• Shared-something: Some resource in each cluster node.

SRDF/CE in YNHH SCM Production
Environment
Clients:

Enterprise LAN/WAN

Private Interconnect (Heartbeat Connector)

20 KM with Single Mode FDDI

Host A Host B
Node 1 Node 2

UWD SCSI
or FC-AL UWD SCSI
or FC-AL

R1 Bi -directional SRDF Interconnect R2

R2 R1

Symmetrix Symmetrix

SRDF/CE Over MSCS
• SRDF/CE protects against more failure scenarios than
MSCS can.
• It overcomes the distance limitations of MSCS
• Cluster nodes can be geographically separated by
distances of up to 60 KM (network round trip latency of
less than 300 ms)
• An ideal solution for dealing with disaster
• Critical information available in minutes
• System restart not recovery when disaster happens

SRDF/CE and MSCS
Common Recovery Behavior
1. LAN Link failure 5. Server failure
2. Heartbeat Link failure 6. Application software failure
3. SRDF Link failure 7. Host bus adapter failure
4. Host NIC failure 8. Symmetrix array failure

SRDF/CE Unique Behavior
The geographic separation and disaster tolerance of
SRDF/CE causes unique behavior and provides
recovery alternatives

Complete Site Failure and
Recovery
• Site (Server and Symmetrix) failure (5+8)
– Site failure occurs when both the Server and Symmetrix fail from
natural disaster and human error
• Total Communication Failure(1+2+3) - Split-Brain ?
– Occurs when all communication between node1 and node2 is
lost
– In this type of failure, both nodes remain operational and is
referred to as split-brain
– Is a potential cause of logical data corruption as each side
assumes the other side is dead and begin processing new
transactions against their copy of data
– Two separate and irreconcilable copies of data are created

Response to complete site failure
• Site Failure GspanPlan
– Site Failure occurs at Node 2 Test-Plan
– QuorumGrp and SQLGrp continue running on Node 1
– Manual intervention required to bring FShareGrp online on
Node1
• Site Failure – Quorum Lost
– Site Failure occurs at Node 1
– Site Failure causes SQLGrp and QuorumGrp to go offline
– With QuorumGrp offline, W2K takes whole cluster offline
– Manual intervention required to bring cluster online.

• Total Communications Failure
– Total Communications Failure causes the node without the
QuorumGrp to go offline
– This prevents the Spilt-Brain
– Manual intervention required to bring FShareGrp online.
– EMC doesn’t suggest automatic site fail-over to prevent Spilt-
Brain

Benefits
• Disaster recovery solution
• Disaster restart provides short restart time and low data
loss
• Ensures data integrity
• SRDF/CE overcomes limitations in traditional cluster
solutions like MSCS

Disadvantages
• Cost
• Complex Setup
• Lots of Disks
• Fail-back needs to be planned, takes longer than failover
• Synchronous SRDF Disaster Restart
– Data must be written to both Symmetrix
– Consistent, reliable data
– More I/O over head
• Asynchronous SRDF Disaster Restart
– Data is written asynchronously to secondary Symmetrix
– May incur data loss
– Faster I/O
• Both sites in the same city, prone to regional disaster

Conclusions
• In our DR test following failure scenarios were tested:
– Server failure
– O/S Failure
– HBA/Channel failure
– Application failure
– Public LAN failure
– Private LAN failure
– Complete IP communication failure (public LAN and private LAN)
• All tests were passed
• We have achieved high uptime (non-scheduled outages)
of almost 100% in last 3 years
• 2 unplanned fail overs so far due to windows fluctuation

References
• EMC SRDF/Cluster Enabler for MCSC v2.1 Product
Guide P/N 300-001-286 REV A02 by Eclipsys
Corporation, Hopkinton, MA 01748-9103, 2006
• GeopSpan Implementation by John Toner, EMC
Corporation, 2003

Contact Information
Ziaul Mannan : Ziaul.Mannan@ynhh.org
Howard Goldberg: Howard.Goldberg@ynhh.org

Tackling Disaster in a SCM Environment

More Related Content

What's hot

Similar to Tackling Disaster in a SCM Environment

Tackling Disaster in a SCM Environment