No Advance 9053900678 Chandigarh Call Girls , Indian Call Girls For Full Ni...
Comp8 unit9a lecture_slides
1. Installation and Maintenance of Health
IT Systems
Creating Fault-Tolerant Systems,
Backups, and Decommissioning
Lecture a
his material (Comp 8 Unit 9) was developed by Duke University, funded by the Department of Health and Human
Services, Office of the National Coordinator for Health Information Technology under Award Number
IU24OC000024. This material was updated in 2016 by The University of Texas Health Science Center at
Houston under Award Number 90WT0006.
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.
2. Creating Fault-Tolerant Systems,
Backups, and Decommissioning
Learning Objectives
1. Define availability, reliability, redundancy, and fault
tolerance (Lecture a)
2. Explain areas and outline rules for implementing fault
tolerant systems (Lecture a)
3. Perform risk assessment (Lecture a)
4. Follow best practice guidelines for common
implementations (Lecture b)
5. Develop strategies for backup and restore of operating
systems, applications, configuration settings, and
databases (Lecture c)
6. Decommission systems and data (Lecture c)
2
3. Redundancy and Fault Tolerance
• Dependence on EHRs is increasing.
• EHR systems require redundant, or “failover”, resources
and fault tolerance to ensure uptime and data integrity so
that it can perform as specified.
• “Failure” vs “fault”: fault is the cause of a failure of the
system to comply with its specifications or precise
requirements.
• Fault tolerance is resilience in a system, or ability to
continue performing to specification despite problems
• Ask vendor how fault tolerance is designed/coded into
the EHR application.
3
4. Creating Fault Tolerance
• Redundancy
– Secondary or “backup”
systems
• Reliability
– Infrequent failure
– Redundant components
• Availability
– Accessible when needed –
no downtime
– Available systems are
reliable and accessible
• Computer hardware
– Servers and workstations
• Data storage
– Hard disks
• Network and Power
– Network switches and
Internet access
– Mains, generators,
batteries
• Virtualization
– Isolation of system from
hardware
4
5. System Failure and Downtime
• Forrester Consulting report on server failure during prior two
years:
– ¾ experienced downtime.
– Only 1% of server outages were resolved within five
minutes.
– 68% had impact on clinical activities.
– 50+% affected administrative processes.
• How much downtime is acceptable?
– Required good understanding of business’ processes
Critical system downtime can have significant negative
impact on patient health
(Forrester Consulting Report, 2010)
5
6. Three Areas for Fault Tolerance
1. Hardware fault tolerance – compensate for hardware failure
– Often simplest to implement
– Extra hardware resources as secondaries or backups
– E.g., secondary network cards, error checking and correcting (ECC)
memory, redundant power supplies, redundant disks / file storage
2. Software fault tolerance – compensate for poor programming or data
– Involves program verification (code review) and assertion checking
– Compensating for faults such as poorly formatted input data
– E.g., sanity check, double-entry comparison, and multiple-version
programs
3. System fault tolerance – compensate for non-computer or inter-device
failures
– Most complex, highest number of variables
– System may include facilities that are not computer-based
– E.g., detection of sensor failure, graceful reaction to intersystem
communication failure, graceful shutdown in unexpected circumstances
(A Conceptual Framework for System Fault Tolerance - 1.1 What is a System?, 1995)
6
7. Six Rules of Fault Tolerance
In “A Conceptual Framework for Systems Fault Tolerance”,
the Center For High Integrity Software Systems Assurance
summarizes 6 rules:
1.Know precisely what the system is supposed to do.
2.Look at what can go wrong.
3.Study your application & determine appropriate fault
containment regions & earliest feasible time to deal with
potential faults.
4.Completely understand application requirements & use
them to make appropriate time/space trade-offs.
5.Concentrate on credible faults first.
6.Determine application failure margins.
(A Conceptual Framework for System Fault Tolerance - 5 Putting It All Together, 1995)
7
8. Six Rules of Fault Tolerance
(cont’d)
• Rule 1: Know precisely what the system is
supposed to do.
– How long can system be allowed to deviate from
specifications before being declared a “failure”?
– What abnormal conditions must be
accommodated?
• Rule 2: Look at what can go wrong.
– Group causes into classes.
– Define “fault floor”.
(A Conceptual Framework for System Fault Tolerance - 5 Putting It All Together, 1995)
8
9. Six Rules of Fault Tolerance
(cont’d – 2)
• Rule 3: Study your application & determine
appropriate fault containment regions & earliest
feasible time to deal with potential faults.
– Fault tolerance generally means more resources
(time & space)
• Rule 4: Completely understand application
requirements & use them to make appropriate
time/space trade-offs.
– Consider costs, & classify faults by likelihood.
(A Conceptual Framework for System Fault Tolerance - 5 Putting It All Together, 1995)
9
10. Six Rules of Fault Tolerance
(cont’d – 3)
• Rule 5: Concentrate on credible faults first.
– Ignore less likely faults unless they require little
additional cost. Mitigate the most likely faults first.
• Rule 6: Determine application failure margins.
– Balance the degree of fault tolerance needed with the
cost of implementation.
– Does a small expenditure now save a great deal
later?
(A Conceptual Framework for System Fault Tolerance - 5 Putting It All Together, 1995)
10
11. Risk Assessment
Risk Assessment
• Identify what is to be protected
– Examples: EHR server, or clinical record
– Include rating of importance
– Types of loss or liability
• Identify risks to each component
– Examples: Power failure, or record alteration
– Risk = Threat x Probability x Impact
– Intentional or Accidental, Human or System, Internal or External
• Identify mitigation strategies for each risk
– Examples: UPS with power monitoring, or automatic backup
– Policies (for people) or Controls (for systems or equipment)
(Benson, n.d., Maniscalchi. 2009) 11
12. Creating Fault-Tolerant Systems,
Backups, and Decommissioning
Summary – Lecture a
• Fault tolerance is running despite
problems
• Implemented using Redundancy to
increase Reliability and provide Availability
• Three areas of Hardware, Software, and
System
• Risk assessment to identify assets, risks,
and mitigation
12
13. Creating Fault-Tolerant Systems,
Backups, and Decommissioning
References – Lecture a
References
Benson C. Security Planning. (n.d.) Available from: http://technet.microsoft.com/en-
us/library/cc723503.aspx
Maniscalchi, J. Threat vs. Vulnerability vs. Risk. (June 2009) Available from:
https://www.pinkerton.com/blog/risk-vs-threat-vs-vulnerability-and-why-you-should-
know-the-differences/
Heimerdinger, W. L., Weinstock, C. B., A Conceptual Framework for System Fault
Tolerance (1992, October). Retrieved from:
http://www.sei.cmu.edu/reports/92tr033.pdf
Server Availability Trends In The Time Of Electronic Health Records. (January 2010)
Forrester Research, Inc. Available at
http://www.stratus.com/assets/ServerAvailabilityTrends_EHR_ForresterPaper.pdf
13
14. Installation and Maintenance of
Health IT Systems
Creating Fault-Tolerant Systems,
Backups, and Decommissioning
Lecture a
This material was developed by Duke University,
funded by the Department of Health and Human
Services, Office of the National Coordinator for
Health Information Technology under Award
Number IU24OC000024. This material was
updated in 2016 by The University of Texas Health
Science Center at Houston under Award Number
90WT0006.
14
Editor's Notes
Welcome to Installation and Maintenance of Health IT Systems, Creating Fault -Tolerant Systems, Backups, and Decommissioning, This is lecture a.
This component, Installation and Maintenance of Health IT Systems, covers fundamentals of selection, installation, and maintenance of typical Electronic Health Records (EHR) systems.
This unit, Creating Fault-Tolerant Systems, Backups, and Decommissioning, will discuss ensuring availability and resiliency through fault tolerance, data reliability through backup, and secure decommissioning of EHR systems.
The objectives for this unit, Creating Fault-Tolerant Systems, Backups, and Decommissioning are to:
Define availability, reliability, redundancy, and fault tolerance
Explain areas and outline rules for implementing fault tolerant systems
Perform risk assessment
Follow best practice guidelines for common implementations
Develop strategies for backup and restore of operating systems, applications, configuration settings, and databases and
Decommission systems and data
As healthcare organizations adopt new technology to improve their efficiency, their dependence on that technology increases exponentially. However, what happens to all of these critical applications if a failure were to occur? What about the integrity of the caregiver’s data in the event of a disaster?
In this lecture, we will discuss the importance of fault tolerance and redundancy and look at areas and rules for creating and maintaining a fault-tolerant system.
As I just mentioned, dependence on EHR technology is increasing greatly. This means there is a critical need for EHRs to have redundant, or “failover" resources, and fault tolerance to ensure uptime and data integrity.
A system is said to have a failure if the service it provides does not meet expected specifications or requirements.
A failure’s cause is called a fault. Fault tolerance is the ability of a system to continue performing to specifications despite problems. So fault tolerance is desirable because it reduces the chance of systems failure.
So let’s focus on a few areas of fault tolerance common to most EHR systems, and identify some best practices for implementation.
Redundancy is an excellent general practice. It is ensuring that you have more than one of whatever component is important. Redundant power, for instance, means that if the primary source of electricity is removed, a secondary source is available. Note that redundant systems can be fully- or partially- redundant; having a second power plant available may be unrealistic, but a battery or generator that can run for some hours is not.
Reliability is another good characteristic. The system overall should not fail. Note that reliable systems may have individual components that fail, but if redundant components take over, then the overall system remains reliable.
Availability means that the system is there and working correctly when you need it. A reliable system needs to be accessible when needed to maintain high availability.
Later slides will introduce fault-tolerant technologies in common system areas, including computer servers and workstations, file and data storage systems, and communications and power networks. Virtualization of these systems will also be discussed.
According to a Forrester Consulting report from 2010, three-quarters of respondents experienced downtime related to a server failure during the past two years.
“Sixty-eight percent had an impact on clinical activities, and greater than half affected administrative processes. Rarely was there no impact. Recovery times were typically measured in hours, not minutes. Only 1 percent of server outages were resolved within five minutes. Providers’ strategies for swapping servers and manual failovers are not medical grade.“
In the healthcare setting, critical system downtime can equate to a significant potential impact on patient health even possibly death.
Below are three areas in which to apply fault tolerance.
Hardware:
The term fault tolerance usually means automatic compensation for faults or failure of computing hardware. This is often the simplest kind of redundancy to implement, though not necessarily the cheapest. Using additional components that are deployed in parallel or other redundant fashion allows the system to operate even if one of the pieces fails. Techniques include redundant power supplies for equipment, error checking and correcting (ECC) memory, additional network interfaces with secondary connections, and redundant data storage.
Software:
A fault-tolerant hardware platform does not guarantee high availability to the system user. The software must still compensate for faults such as poorly formatted input data. Mechanisms such as a sanity check for processed data (is this a non-zero file, is every required field present), double entry comparison, and multiple-version programs are often used at this level.
System:
System fault tolerance will compensate for failures in other system facilities that are not computer-based. If required subcomponents are not available, system fault tolerance ensures that other parts react properly, so that data or time is not lost. As a simple example, a printer that is out of paper should stop running the print motor or printing surface to save wear and tear. Also, the system should be able to stop gracefully, allowing normal operation to resume without duplication of effort or waste of resources, such as halting a lab test if precursor tests have not yet been performed.
Now let’s discuss six rules for approaching fault tolerance in your system.
The Center For High Integrity Software Systems Assurance summarizes 6 rules for ensuring fault tolerance. These are:
Know precisely what the system is supposed to do.
Look at what can go wrong.
Study your application & determine appropriate fault containment regions & earliest feasible time to deal with potential faults.
Completely understand application requirements & use them to make appropriate time/space trade-offs.
Concentrate on credible faults first.
Determine application failure margins.
“Rule 1: Know precisely what the system is supposed to do. Part of this process should be determining how long a system can be allowed to deviate from its specification before the deviation is declared a failure. However, it is not sufficient to know what the system is supposed to do under normal circumstances. It is also necessary to know what abnormal conditions the system must accommodate. It is virtually impossible to enumerate the set of all possible faults that a system might encounter. It is much more manageable to deal with classes of faults.”
This rule requires broad knowledge of the business environment, not just the specific system. For instance, certain deadlines may exist for regulatory paperwork that will result in large fines for late submission.
“Rule 2: Look at what can go wrong, and try to group the causes into classes for easier manageability. This involves defining a fault floor based on your ability to diagnose and repair faults. The goal of fault tolerance is to prevent faults from propagating to the system boundary, where it becomes observable and, hence, a failure. In general, the further a fault has propagated, the harder it is to deal with. Since fault tolerance is redundancy management, however, it becomes a matter of the degree of redundancy desired. For instance, it is almost certainly cheaper to deal with memory faults by using error-correcting memory (that is, redundant bits in a memory location) than by providing a ‘shadow’ memory. Note, however, that dealing with faults earlier rather than later may go counter to the advice given above regarding dealing with classes of faults rather than individual faults. “
This rule may seem boundless, as any pessimist knows that “things can always get worse.” Grouping into general classes of problems can help. Example classes may be: failure of hardware equipment, failure of communication transmission, incorrect data entry.
“Rule 3: Study your application and determine appropriate fault containment regions and the earliest feasible time to deal with potential faults. In general, the price paid for a fault-tolerant system is additional resources, both in terms of time and space. As with most things, these two can be traded off against each other. In some applications, in flight control, for example, timing is everything, even at the cost of extra processors. In general, the comparison approach to fault detection works best in these situations. In other applications, such as a space probe, weight and power consumption is an overriding issue--arguing for a higher reliance on time redundancy and suggesting the use of acceptance tests.”
This rule is the beginning of the tradeoff between different requirements. For instance, an error in recording insurance information could be eliminated by requiring verification with the insurance company before proceeding. The additional time and resources to provide that verification may not be worth it, especially if the problem will be seen and manually corrected during the billing process. In contrast, a problem in recording drug delivery or dosage information could threaten the life of a patient.
“Rule 4: Completely understand the requirements of your application and use them to make appropriate time/space trade-offs. Protecting a system from every conceivable fault can exhaust another resource--money. This is true even if a rational set of fault classes is defined. The trade-off here is fault coverage versus the cost of that coverage. In all systems, it is possible to classify faults by the likelihood of occurrence.”
This rule continues with the idea of appropriate expenditure. An error that is infrequent and has low impact need not receive the kind of attention reserved for an error that happens often or is extremely costly.
“Rule 5: Whenever possible, concentrate on the credible faults and ignore those less likely to occur unless they can be dealt with at little or no additional cost. Time is an essential element in any digital computer system, even in systems that do not claim to be real-time. It is important to define the minimum period of time a system can fail to provide its defined service before a failure is declared. Unnecessarily short failure margins force the system designer to resort to expensive fault-tolerance mechanisms, such as real-time fault masking.”
This rule focuses on the likely faults that may occur. Data on these types of faults may already be available for long-term systems, either formally or anecdotally. Unlikely problems should be addressed, but after addressing the likely ones.
“Rule 6: Carefully determine application failure margins and use the information to balance the degree of fault tolerance needed with the cost of implementation.”
This rule introduces the idea of costs versus benefits. Cost / benefit analysis looks at weighing costs (one-time, ongoing, and indirect) with benefits (savings, greater revenue, higher satisfaction, better patient outcome). Often used in larger budgetary and project decisions, a less formalized version may help decide margins. For instance, a spare PC monitor in storage may cost $150 in initial direct cost, but save money over time if a room that would otherwise be unusable for patients can be brought back up quickly.
Fault tolerance planning and implementation starts with risk assessment, a three step process.
Begin by identification of the assets or processes to be protected. A rating of importance should attend each one, so that the more critical ones may be identified easily. A multi-level categorization is often sufficient; top-to-bottom ranking is usually not necessary.
Then for each one identified, identify the risks. Risks are ways in which the asset or process can be slowed, corrupted, destroyed, or otherwise made unavailable. A risk is composed of a threat, the likelihood of the threat, and the potential resulting impact of the threat. Again, these should be rated so that the biggest risks (most likely and highest damage threats) are identified easily. Note that sources of threat may be intentional or not, from inside or outside, and from people or systems.
Finally, for each of the identified risks, identify how that risk will be mitigated. Note the term “mitigate” – often a risk cannot be eliminated entirely without unrealistic expenditure, so the focus should be on realistic minimization of liability. Identifying the type of liability can help this ranking process – does this affect the ability to continue doing business? Any direct impact on patient health? A drop in the number of patients that can be seen?
This concludes Lecture a of Creating Fault-Tolerant Systems, Backups, and Decommissioning.
In this lecture we defined fault tolerance as a system’s ability to continue performing to specification despite problems. The idea is to minimize downtime and maximize availability through reliability and redundancy.
Hardware, Software, and Systems are three areas that should be fault tolerant
To make systems more fault-tolerant, use risk assessment to evaluate potential problems, then mitigate those potential problems using best practices.