2. 1. Fault tolerant mindset
2. Design Tradeoffs
3. Quality v. Fault Tolerance
4. Keep It Simple
5. Incremental Additions of Reliability
6. Defensive Programming Techniques
1) Faults in Fault Tolerance Code
2) Memory Corruption
3) Data Structure Design
4) Design for Maintainability
5) Coding Standards
6) Redundancy
7) Static Analysis Tools
8) N-Version Programming
9) Redundant Disks [PGK88][MS00]
7. The Role of Verification
8. Fault Insertion Testing
9. Fault Tolerant Design Methodology
3. Look at Techniques to design for fault
tolerance and enhanced reliability and
availability.
4. Fault tolerant한 시스템을 만들겠다는 마음가짐
Thinking to ask the question and define the
solution is called having a Fault Tolerant Mindset.
Question & Solution? : What if the stack pointer
becomes negative? What if the wrong subclass is
instantiated? What if the message arrives out of
order?
언제 필요한가?
Requirement definition, Architecture, Design,
Coding, test development
5. Tradeoff : the act of balancing two things
that you need or want but which are opposed
to each other.
MTTF and MTTR determines the reliability
and availability of a system.
6.
7. 더 중요한 요소
MTTR / high availability
- telecommunication system
MTTR is long, MTTF is also long
- Space Shuttle
MTTF(Failure-free) but highly available
- ATM banking system
따라서!
This requires a thorough analysis during the
design of the hardware and software components
to ensure both high MTTF and low MTTR.
8. Quality refers to how fault-free the system is.
Fault tolerance is the ability of the system to
execute properly even though there are faults
present.
A Fault tolerant system does not have High
quality necessarily.
Quality is aimed at preventing faults from
entering the system. (= Fault Prevention)
Fault prevention is sometimes considered one of
the phases of fault tolerance.
High
Quality
Fault
tolerant
System
10. 시스템을 운영해가면서 점차적으로 reliability
를 더하기
Many projects add fault tolerance incrementally.
A policy of studying every failure and
implementing any design changes that are
suggested by analysis is a certain way to improve
the system’s fault tolerance.
Ex. 4ESS™ Switch project that has been
functioning in the US telephone network for over
30 years.
11. 문제가 생길 수 있는 부분이 있는지 늘 질문을
하고 해결책을 찾아가며 프로그래밍 하기
Programming defensively is done constantly,
in every situation, by asking ‘what can go
wrong here?’, ‘what errors can occur?’, ‘how
might this fail?’, and ‘how can this code be
protected from errors in other parts?’.
12. Fault tolerant related activities를 수행하는
software도 fault를 가지고 있을 수 있으니
Keep it simple!
Even the software written to perform only fault tolerant
related activities can have faults.
Adding complexity to error handling code increases the risk
of adding faults to the software. Keep it Simple!
Ex. a particular piece of software mysteriously stopped
and restarted itself approximately every 24 days. The
counter was stored in a 32 bit signed integer. The solution
to the previous problem was to make the counter a 32 bit
unsigned integer which became invalid (zero) after twice
as long.
13. 데이터 스토리지에 항상 정확한 데이터가 있는
것은 아니라는 것을 명심하기
Ex.
1. A message type defined by protocol standard
is corrupted.
- corrupted message type 처리 구문 추가
2. Memory leakage
- 메모리 할당/해제 주의
14. 데이터 구조를 검토하기 쉽게 디자인하기
Data structures should be designed so that
they can be audited, or checked, for
correctness. Audits should check both for
correct values and to ensure that the data
structure integrity is intact.
15. 유지보수하기 쉽게 디자인 하기
1. Use short modules, methods and functions.
2. Make the code readable through the use of
white space and comments.
3. Keep the control flow simple because that
will be easier to understand and to change
if that becomes necessary.
16. 적절할 정도의 코딩 스탠다드를 사용하기
Coding standards are another way of
improving the quality of software as it is
written.
Having a small number that can be followed
and verified will result in higher quality
software.
17. 시스템의 capabilities를 복제하기
목적 : more rapid error recovery and fault treatment.
redundancy in both time and space dimensions
1. Time based redundancy can be the sequential execution of
different versions of a program followed by a selection, or
Voting (21), of which result to use. Other time based
redundancy includes recalculating parameters that don’t
change frequently(or ever) to ensure that the parameter value
in use is always correct.
2. Space based redundancy is usually implemented by executing a
program on different computers. The programs might be
identical, or they might be different versions. The different
computers are typically located within a cluster, and can be
geographically dispersed also.
18. Tools such as lint(1)
Lint Warnings
ADT(Android Development Tools) Plugin 버전
16.0에 공개됨
Lint Warining은 애플리케이션 내의 리소스
(레이아웃, 문자열 등...)의 오류를 미리 검사
한다. 특히, 애플리케이션 실행에는 문제가 없
으나 잠재적으로 문제를 발생시킬 수 있는 항
목을 찾아주는 역할을 함
19.
20. NVP : 여러 독립적인 팀 똑같은 specification
으로 여러 버전을 구현하는 방법
각 팀은 다른 언어, 다른 알고리즘, 다른 디자
인, 토론 없음
디자인/개발 부터 실행까지 모든 level에서
Redudancy이용
개인이나 작은 팀의 규모에서는 적용안됨 더
이상 논의하지 않음
21. RAID, or ‘redundant array of inexpensive disks’,
a technology for grouping disks together to make
a complex that provides redundancy and hence
fault tolerance that is unavailable on a single
disk.
5가지 RAID level
- increase the level of disk redundancy, RAID-1
through RAID-5.
추가 RAID level : RAID- 0
an alternate method of writing data to a
disk, which does not increase redundancy.
22. Disk striping, used by RAID-0, stores consecutive chunks of data spread
over a number of disks. This enables parallelism in reading and writing
and hence faster access. Disk striping alone does not increase reliability,
but it increases performance.
Disk mirroring is used by RAID-1. The same data is written to multiple
disks simultaneously. It is synchronized between the disks, so that if any
of the disks fail there are redundant, identical copies of the data. This
increases the reliability of the data with only a small performance
increase.
RAID-2 through RAID-5 all use some form of Hamming or parity encoding
to ensure that the stored data can be reconstructed if a disk fails.
The differences between these RAID levels is determined by where the
data and parity encoding is stored and whether it is striped or mirrored.
Table 2.1 shows how these levels are implemented.
Hamming code encoding : 오류가 생겼을 때 복구 가능
Parity bit : 오류가 생겼는지 확인하는 용도
RAID technology is common and is included in many commercial products
today. For more information about RAID, refer to product literature or
Marcus and Stern [MS00].
23.
24. 테스팅과 검증의 중요성
Testing and verification also provide the data needed by a project’s
software reliability engineers to compute the expected reliability of a
system.
operational profile testing
An operational profile describes the usage of the system in quantitative
terms and the most typical scenarios that the system will process.
Operational profiles are the scenarios that are used in design,
development, and test. To test the reliability and performance of the
system the operational profile adds quantitative information to the
descriptions of typical scenarios
For more information
- operational profiles, refer to Musa et al. [MFI96]
- The Handbook of Software Reliability Engineering [Lyu96] contains more
detailed information about testing and verification for reliability.
25. A technique that is used during the testing and verification
phase of a project is fault insertion testing.
This testing serves the dual purpose of identifying faults
in the system’s error handling processes and of providing
data for the computation of coverage factors.
the only way that a system’s coverage factor can be
determined.
방법: Known faults are introduced into the system, which
is then observed to see if the system was able to handle
the faults automatically. The coverage is computed as the
percentage of cases in which recovery was successful.
26. Fault tolerant design methodology의 6 단계
1. Assess the things that can go wrong with
the system. (fault trees)
2. Strategies must be defined to mitigate the
risks. The project specific pattern language
that will be used during design is identified
in step 2.
3. Create a mental model of your system
identifying the primary system dividing
points and modes of redundancy.
27. 4. The architectural and major design decisions
can be made. (ch.4 patterns for high level
decision)
5. Design in the capabilities for the system to
implement the risk mitigation strategies
identified in step 2. (ch. 5 ~ ch.8)
6. Almost all systems, no matter how fault
tolerant they are, require some provisions that
enable them to be managed and administered
by people.
- Designing Interfaces by Tidwell [Tid05] or ‘An
Input and Output Pattern Language’ [HS00].
28. Step 1 where you identify the failures and the risks factors that
can cause them.
Step 5 comes back to the risks and failures to see if mitigation
techniques have been added to the design to cover the risks.
Step 2 gives you a chance to think about the risks and the
patterns and other techniques that will be useful to mitigate the
risks. The pattern language for your project is created in this
step.
Steps 3 and 4 are where the fault tolerant design starts to take
shape.
- Any elements of redundancy present in the system that can be
leveraged to mitigate risks are studied and enhanced in step 3. -
- Step 4 continues the design of the error detection and error
processing capabilities of the system.
People will interact with the system being designed both as users
and as operating personnel. Step 6 considers how the human
computer interface can be made more robust and more error-
free.
29. Adapted from one in Secure Coding, Graff
and van Wyk [GvW03]. It has been put in
terms of fault tolerance and the Fault
Tolerant Mindset.
This methodology will be used in an example
problem, to design a fault tolerant Presence
Server in the Conclusion.
The benefit of this methodology is that it will
get you thinking about what can go wrong
with the system.
30. 1. Pattern
Problem
Context of the problem
Forces
2. Pattern Language
3. Fault -> Error -> Failure (specification)
A system failure occurs when the delivered service no longer complies with the
specification, the latter being an agreed description of the system’s expected
function and/or service. An error is that part of the system state that is liable to
lead to subsequent failure; an error affecting the service is an indication that a
failure occurs or has occurred. The adjudged or hypothesized cause of an error is a
fault. [Lap91, p.4]
4. Reliability
A system’ reliability is the probability that it will perform without deviations from
agreed-upon behavior for a specific period of time. That there will be no failures
during a specific time.
1) MTTF : Mean Time To Failure
the average time from start of operation until the time when the first failure occurs.
2) MTTR : Mean Time To Repair
A measure of the average time required to restore a failing component to operation