SlideShare a Scribd company logo
1 of 30
EVA
황선희
1. Fault tolerant mindset
2. Design Tradeoffs
3. Quality v. Fault Tolerance
4. Keep It Simple
5. Incremental Additions of Reliability
6. Defensive Programming Techniques
1) Faults in Fault Tolerance Code
2) Memory Corruption
3) Data Structure Design
4) Design for Maintainability
5) Coding Standards
6) Redundancy
7) Static Analysis Tools
8) N-Version Programming
9) Redundant Disks [PGK88][MS00]
7. The Role of Verification
8. Fault Insertion Testing
9. Fault Tolerant Design Methodology
 Look at Techniques to design for fault
tolerance and enhanced reliability and
availability.
 Fault tolerant한 시스템을 만들겠다는 마음가짐
 Thinking to ask the question and define the
solution is called having a Fault Tolerant Mindset.
 Question & Solution? : What if the stack pointer
becomes negative? What if the wrong subclass is
instantiated? What if the message arrives out of
order?
 언제 필요한가?
Requirement definition, Architecture, Design,
Coding, test development
 Tradeoff : the act of balancing two things
that you need or want but which are opposed
to each other.
 MTTF and MTTR determines the reliability
and availability of a system.
더 중요한 요소
 MTTR / high availability
- telecommunication system
 MTTR is long, MTTF is also long
- Space Shuttle
 MTTF(Failure-free) but highly available
- ATM banking system
따라서!
 This requires a thorough analysis during the
design of the hardware and software components
to ensure both high MTTF and low MTTR.
 Quality refers to how fault-free the system is.
 Fault tolerance is the ability of the system to
execute properly even though there are faults
present.
 A Fault tolerant system does not have High
quality necessarily.
 Quality is aimed at preventing faults from
entering the system. (= Fault Prevention)
 Fault prevention is sometimes considered one of
the phases of fault tolerance.
High
Quality
Fault
tolerant
System
 KIS
 Extra code will contain unnecessary faults.
시스템을 운영해가면서 점차적으로 reliability
를 더하기
 Many projects add fault tolerance incrementally.
 A policy of studying every failure and
implementing any design changes that are
suggested by analysis is a certain way to improve
the system’s fault tolerance.
 Ex. 4ESS™ Switch project that has been
functioning in the US telephone network for over
30 years.
문제가 생길 수 있는 부분이 있는지 늘 질문을
하고 해결책을 찾아가며 프로그래밍 하기
 Programming defensively is done constantly,
in every situation, by asking ‘what can go
wrong here?’, ‘what errors can occur?’, ‘how
might this fail?’, and ‘how can this code be
protected from errors in other parts?’.
Fault tolerant related activities를 수행하는
software도 fault를 가지고 있을 수 있으니
Keep it simple!
 Even the software written to perform only fault tolerant
related activities can have faults.
 Adding complexity to error handling code increases the risk
of adding faults to the software. Keep it Simple!
 Ex. a particular piece of software mysteriously stopped
and restarted itself approximately every 24 days. The
counter was stored in a 32 bit signed integer. The solution
to the previous problem was to make the counter a 32 bit
unsigned integer which became invalid (zero) after twice
as long.
데이터 스토리지에 항상 정확한 데이터가 있는
것은 아니라는 것을 명심하기
 Ex.
1. A message type defined by protocol standard
is corrupted.
- corrupted message type 처리 구문 추가
2. Memory leakage
- 메모리 할당/해제 주의
데이터 구조를 검토하기 쉽게 디자인하기
 Data structures should be designed so that
they can be audited, or checked, for
correctness. Audits should check both for
correct values and to ensure that the data
structure integrity is intact.
유지보수하기 쉽게 디자인 하기
1. Use short modules, methods and functions.
2. Make the code readable through the use of
white space and comments.
3. Keep the control flow simple because that
will be easier to understand and to change
if that becomes necessary.
적절할 정도의 코딩 스탠다드를 사용하기
 Coding standards are another way of
improving the quality of software as it is
written.
 Having a small number that can be followed
and verified will result in higher quality
software.
시스템의 capabilities를 복제하기
 목적 : more rapid error recovery and fault treatment.
 redundancy in both time and space dimensions
1. Time based redundancy can be the sequential execution of
different versions of a program followed by a selection, or
Voting (21), of which result to use. Other time based
redundancy includes recalculating parameters that don’t
change frequently(or ever) to ensure that the parameter value
in use is always correct.
2. Space based redundancy is usually implemented by executing a
program on different computers. The programs might be
identical, or they might be different versions. The different
computers are typically located within a cluster, and can be
geographically dispersed also.
Tools such as lint(1)
 Lint Warnings
ADT(Android Development Tools) Plugin 버전
16.0에 공개됨
Lint Warining은 애플리케이션 내의 리소스
(레이아웃, 문자열 등...)의 오류를 미리 검사
한다. 특히, 애플리케이션 실행에는 문제가 없
으나 잠재적으로 문제를 발생시킬 수 있는 항
목을 찾아주는 역할을 함
 NVP : 여러 독립적인 팀 똑같은 specification
으로 여러 버전을 구현하는 방법
 각 팀은 다른 언어, 다른 알고리즘, 다른 디자
인, 토론 없음
 디자인/개발 부터 실행까지 모든 level에서
Redudancy이용
 개인이나 작은 팀의 규모에서는 적용안됨 더
이상 논의하지 않음
 RAID, or ‘redundant array of inexpensive disks’,
 a technology for grouping disks together to make
a complex that provides redundancy and hence
fault tolerance that is unavailable on a single
disk.
 5가지 RAID level
- increase the level of disk redundancy, RAID-1
through RAID-5.
 추가 RAID level : RAID- 0
an alternate method of writing data to a
disk, which does not increase redundancy.
 Disk striping, used by RAID-0, stores consecutive chunks of data spread
over a number of disks. This enables parallelism in reading and writing
and hence faster access. Disk striping alone does not increase reliability,
but it increases performance.
 Disk mirroring is used by RAID-1. The same data is written to multiple
disks simultaneously. It is synchronized between the disks, so that if any
of the disks fail there are redundant, identical copies of the data. This
increases the reliability of the data with only a small performance
increase.
 RAID-2 through RAID-5 all use some form of Hamming or parity encoding
to ensure that the stored data can be reconstructed if a disk fails.
The differences between these RAID levels is determined by where the
data and parity encoding is stored and whether it is striped or mirrored.
Table 2.1 shows how these levels are implemented.
 Hamming code encoding : 오류가 생겼을 때 복구 가능
 Parity bit : 오류가 생겼는지 확인하는 용도
 RAID technology is common and is included in many commercial products
today. For more information about RAID, refer to product literature or
Marcus and Stern [MS00].
테스팅과 검증의 중요성
Testing and verification also provide the data needed by a project’s
software reliability engineers to compute the expected reliability of a
system.
 operational profile testing
An operational profile describes the usage of the system in quantitative
terms and the most typical scenarios that the system will process.
Operational profiles are the scenarios that are used in design,
development, and test. To test the reliability and performance of the
system the operational profile adds quantitative information to the
descriptions of typical scenarios
 For more information
- operational profiles, refer to Musa et al. [MFI96]
- The Handbook of Software Reliability Engineering [Lyu96] contains more
detailed information about testing and verification for reliability.
 A technique that is used during the testing and verification
phase of a project is fault insertion testing.
 This testing serves the dual purpose of identifying faults
in the system’s error handling processes and of providing
data for the computation of coverage factors.
 the only way that a system’s coverage factor can be
determined.
 방법: Known faults are introduced into the system, which
is then observed to see if the system was able to handle
the faults automatically. The coverage is computed as the
percentage of cases in which recovery was successful.
Fault tolerant design methodology의 6 단계
1. Assess the things that can go wrong with
the system. (fault trees)
2. Strategies must be defined to mitigate the
risks. The project specific pattern language
that will be used during design is identified
in step 2.
3. Create a mental model of your system
identifying the primary system dividing
points and modes of redundancy.
4. The architectural and major design decisions
can be made. (ch.4 patterns for high level
decision)
5. Design in the capabilities for the system to
implement the risk mitigation strategies
identified in step 2. (ch. 5 ~ ch.8)
6. Almost all systems, no matter how fault
tolerant they are, require some provisions that
enable them to be managed and administered
by people.
- Designing Interfaces by Tidwell [Tid05] or ‘An
Input and Output Pattern Language’ [HS00].
 Step 1 where you identify the failures and the risks factors that
can cause them.
 Step 5 comes back to the risks and failures to see if mitigation
techniques have been added to the design to cover the risks.
 Step 2 gives you a chance to think about the risks and the
patterns and other techniques that will be useful to mitigate the
risks. The pattern language for your project is created in this
step.
 Steps 3 and 4 are where the fault tolerant design starts to take
shape.
- Any elements of redundancy present in the system that can be
leveraged to mitigate risks are studied and enhanced in step 3. -
- Step 4 continues the design of the error detection and error
processing capabilities of the system.
 People will interact with the system being designed both as users
and as operating personnel. Step 6 considers how the human
computer interface can be made more robust and more error-
free.
 Adapted from one in Secure Coding, Graff
and van Wyk [GvW03]. It has been put in
terms of fault tolerance and the Fault
Tolerant Mindset.
 This methodology will be used in an example
problem, to design a fault tolerant Presence
Server in the Conclusion.
 The benefit of this methodology is that it will
get you thinking about what can go wrong
with the system.
1. Pattern
 Problem
 Context of the problem
 Forces
2. Pattern Language
3. Fault -> Error -> Failure (specification)
A system failure occurs when the delivered service no longer complies with the
specification, the latter being an agreed description of the system’s expected
function and/or service. An error is that part of the system state that is liable to
lead to subsequent failure; an error affecting the service is an indication that a
failure occurs or has occurred. The adjudged or hypothesized cause of an error is a
fault. [Lap91, p.4]
4. Reliability
A system’ reliability is the probability that it will perform without deviations from
agreed-upon behavior for a specific period of time. That there will be no failures
during a specific time.
1) MTTF : Mean Time To Failure
the average time from start of operation until the time when the first failure occurs.
2) MTTR : Mean Time To Repair
A measure of the average time required to restore a failing component to operation

More Related Content

What's hot

Overview of software reliability engineering
Overview of software reliability engineeringOverview of software reliability engineering
Overview of software reliability engineeringAnn Marie Neufelder
 
NASA Software Safety Guidebook
NASA Software Safety GuidebookNASA Software Safety Guidebook
NASA Software Safety GuidebookVapula
 
Software engineering 23 software reliability
Software engineering 23 software reliabilitySoftware engineering 23 software reliability
Software engineering 23 software reliabilityVaibhav Khanna
 
Ch13-Software Engineering 9
Ch13-Software Engineering 9Ch13-Software Engineering 9
Ch13-Software Engineering 9Ian Sommerville
 
Developing software analyzers tool using software reliability growth model
Developing software analyzers tool using software reliability growth modelDeveloping software analyzers tool using software reliability growth model
Developing software analyzers tool using software reliability growth modelIAEME Publication
 
Software Reliability Engineering
Software Reliability EngineeringSoftware Reliability Engineering
Software Reliability Engineeringguest90cec6
 
Successive Software Reliability Growth Model: A Modular Approach
Successive Software Reliability Growth Model: A Modular ApproachSuccessive Software Reliability Growth Model: A Modular Approach
Successive Software Reliability Growth Model: A Modular Approachajeetmnnit
 
Software reliability growth model
Software reliability growth modelSoftware reliability growth model
Software reliability growth modelHimanshu
 
The Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliabilityThe Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliabilityAnn Marie Neufelder
 
Chapter 1 - The Technical Test Analyst Tasks in Risk Based Testing
Chapter 1 - The Technical Test Analyst Tasks in Risk Based TestingChapter 1 - The Technical Test Analyst Tasks in Risk Based Testing
Chapter 1 - The Technical Test Analyst Tasks in Risk Based TestingNeeraj Kumar Singh
 
Chapter 6 - Transitioning Manual Testing to an Automation Environment
Chapter 6 - Transitioning Manual Testing to an Automation EnvironmentChapter 6 - Transitioning Manual Testing to an Automation Environment
Chapter 6 - Transitioning Manual Testing to an Automation EnvironmentNeeraj Kumar Singh
 
Introduction to Software Failure Modes Effects Analysis
Introduction to Software Failure Modes Effects AnalysisIntroduction to Software Failure Modes Effects Analysis
Introduction to Software Failure Modes Effects AnalysisAnn Marie Neufelder
 
AUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEWAUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
 
Chapter 8 - Continuous Improvement
Chapter 8 - Continuous ImprovementChapter 8 - Continuous Improvement
Chapter 8 - Continuous ImprovementNeeraj Kumar Singh
 

What's hot (18)

Overview of software reliability engineering
Overview of software reliability engineeringOverview of software reliability engineering
Overview of software reliability engineering
 
NASA Software Safety Guidebook
NASA Software Safety GuidebookNASA Software Safety Guidebook
NASA Software Safety Guidebook
 
Software engineering 23 software reliability
Software engineering 23 software reliabilitySoftware engineering 23 software reliability
Software engineering 23 software reliability
 
Ch13-Software Engineering 9
Ch13-Software Engineering 9Ch13-Software Engineering 9
Ch13-Software Engineering 9
 
Software reliability
Software reliabilitySoftware reliability
Software reliability
 
Developing software analyzers tool using software reliability growth model
Developing software analyzers tool using software reliability growth modelDeveloping software analyzers tool using software reliability growth model
Developing software analyzers tool using software reliability growth model
 
Software Reliability Engineering
Software Reliability EngineeringSoftware Reliability Engineering
Software Reliability Engineering
 
Successive Software Reliability Growth Model: A Modular Approach
Successive Software Reliability Growth Model: A Modular ApproachSuccessive Software Reliability Growth Model: A Modular Approach
Successive Software Reliability Growth Model: A Modular Approach
 
O0181397100
O0181397100O0181397100
O0181397100
 
Software reliability growth model
Software reliability growth modelSoftware reliability growth model
Software reliability growth model
 
Testing guide
Testing guideTesting guide
Testing guide
 
The Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliabilityThe Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliability
 
Chapter 1 - The Technical Test Analyst Tasks in Risk Based Testing
Chapter 1 - The Technical Test Analyst Tasks in Risk Based TestingChapter 1 - The Technical Test Analyst Tasks in Risk Based Testing
Chapter 1 - The Technical Test Analyst Tasks in Risk Based Testing
 
Istqb chapter 5
Istqb chapter 5Istqb chapter 5
Istqb chapter 5
 
Chapter 6 - Transitioning Manual Testing to an Automation Environment
Chapter 6 - Transitioning Manual Testing to an Automation EnvironmentChapter 6 - Transitioning Manual Testing to an Automation Environment
Chapter 6 - Transitioning Manual Testing to an Automation Environment
 
Introduction to Software Failure Modes Effects Analysis
Introduction to Software Failure Modes Effects AnalysisIntroduction to Software Failure Modes Effects Analysis
Introduction to Software Failure Modes Effects Analysis
 
AUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEWAUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEW
 
Chapter 8 - Continuous Improvement
Chapter 8 - Continuous ImprovementChapter 8 - Continuous Improvement
Chapter 8 - Continuous Improvement
 

Similar to 02. Fault Tolerance Pattern 위한 mindset

Software Risk Analysis
Software Risk AnalysisSoftware Risk Analysis
Software Risk AnalysisBrett Leonard
 
3Audit Software & Tools.pptx
3Audit Software & Tools.pptx3Audit Software & Tools.pptx
3Audit Software & Tools.pptxjack952975
 
Software engineering study materials
Software engineering study materialsSoftware engineering study materials
Software engineering study materialssmruti sarangi
 
Software engineering introduction
Software engineering introductionSoftware engineering introduction
Software engineering introductionVishal Singh
 
Jeremiah Yancy | Skills and techniques of the Systems Analyst
Jeremiah Yancy | Skills and techniques of the Systems AnalystJeremiah Yancy | Skills and techniques of the Systems Analyst
Jeremiah Yancy | Skills and techniques of the Systems AnalystJeremiah Yancy
 
Chapter 3 software engineering.pptx
Chapter 3 software engineering.pptxChapter 3 software engineering.pptx
Chapter 3 software engineering.pptx02LabiqaIslam
 
03 - Agile Software Development.pptx
03 - Agile Software Development.pptx03 - Agile Software Development.pptx
03 - Agile Software Development.pptxALI2H
 
Software reliability engineering
Software reliability engineeringSoftware reliability engineering
Software reliability engineeringMark Turner CRP
 
2.-IT-266_APDET-Module-2-of-3.pptx
2.-IT-266_APDET-Module-2-of-3.pptx2.-IT-266_APDET-Module-2-of-3.pptx
2.-IT-266_APDET-Module-2-of-3.pptxKENNEDYDONATO1
 
Software Bugs A Software Architect Point Of View
Software Bugs    A Software Architect Point Of ViewSoftware Bugs    A Software Architect Point Of View
Software Bugs A Software Architect Point Of ViewShahzad
 
Intro softwareeng
Intro softwareengIntro softwareeng
Intro softwareengPINKU29
 
Testing In Software Engineering
Testing In Software EngineeringTesting In Software Engineering
Testing In Software Engineeringkiansahafi
 
Periodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and PracticesPeriodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and PracticesJérôme Kehrli
 
CHAPTER FOUR buugii 2023.docx
CHAPTER FOUR buugii 2023.docxCHAPTER FOUR buugii 2023.docx
CHAPTER FOUR buugii 2023.docxRUKIAHASSAN4
 
Distributed Software Engineering with Client-Server Computing
Distributed Software Engineering with Client-Server ComputingDistributed Software Engineering with Client-Server Computing
Distributed Software Engineering with Client-Server ComputingHaseeb Rehman
 

Similar to 02. Fault Tolerance Pattern 위한 mindset (20)

Ch20
Ch20Ch20
Ch20
 
Software Risk Analysis
Software Risk AnalysisSoftware Risk Analysis
Software Risk Analysis
 
3Audit Software & Tools.pptx
3Audit Software & Tools.pptx3Audit Software & Tools.pptx
3Audit Software & Tools.pptx
 
Software engineering study materials
Software engineering study materialsSoftware engineering study materials
Software engineering study materials
 
Software engineering introduction
Software engineering introductionSoftware engineering introduction
Software engineering introduction
 
Jeremiah Yancy | Skills and techniques of the Systems Analyst
Jeremiah Yancy | Skills and techniques of the Systems AnalystJeremiah Yancy | Skills and techniques of the Systems Analyst
Jeremiah Yancy | Skills and techniques of the Systems Analyst
 
Chapter 3 software engineering.pptx
Chapter 3 software engineering.pptxChapter 3 software engineering.pptx
Chapter 3 software engineering.pptx
 
03 - Agile Software Development.pptx
03 - Agile Software Development.pptx03 - Agile Software Development.pptx
03 - Agile Software Development.pptx
 
Software reliability engineering
Software reliability engineeringSoftware reliability engineering
Software reliability engineering
 
Ch13.pptx
Ch13.pptxCh13.pptx
Ch13.pptx
 
Agiel sw development
Agiel sw developmentAgiel sw development
Agiel sw development
 
2.-IT-266_APDET-Module-2-of-3.pptx
2.-IT-266_APDET-Module-2-of-3.pptx2.-IT-266_APDET-Module-2-of-3.pptx
2.-IT-266_APDET-Module-2-of-3.pptx
 
Ch13
Ch13Ch13
Ch13
 
Software Bugs A Software Architect Point Of View
Software Bugs    A Software Architect Point Of ViewSoftware Bugs    A Software Architect Point Of View
Software Bugs A Software Architect Point Of View
 
Intro softwareeng
Intro softwareengIntro softwareeng
Intro softwareeng
 
Testing In Software Engineering
Testing In Software EngineeringTesting In Software Engineering
Testing In Software Engineering
 
Periodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and PracticesPeriodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and Practices
 
CHAPTER FOUR buugii 2023.docx
CHAPTER FOUR buugii 2023.docxCHAPTER FOUR buugii 2023.docx
CHAPTER FOUR buugii 2023.docx
 
Distributed Software Engineering with Client-Server Computing
Distributed Software Engineering with Client-Server ComputingDistributed Software Engineering with Client-Server Computing
Distributed Software Engineering with Client-Server Computing
 
3. quality.ppt
3. quality.ppt3. quality.ppt
3. quality.ppt
 

More from eva

Bash-as-a-Interpreter
Bash-as-a-InterpreterBash-as-a-Interpreter
Bash-as-a-Interpretereva
 
Scalable web architecture and distributed systems
Scalable web architecture and distributed systemsScalable web architecture and distributed systems
Scalable web architecture and distributed systemseva
 
Heartbeat pattern
Heartbeat patternHeartbeat pattern
Heartbeat patterneva
 
Unit of mitigation Pattern
Unit of mitigation PatternUnit of mitigation Pattern
Unit of mitigation Patterneva
 
[EVA] 5. Detection Patterns - Patterns for Fault Tolerant Software
[EVA] 5. Detection Patterns - Patterns for Fault Tolerant Software[EVA] 5. Detection Patterns - Patterns for Fault Tolerant Software
[EVA] 5. Detection Patterns - Patterns for Fault Tolerant Softwareeva
 
[EVA] 4.9 Escalation - Patterns for Fault Tolerant Software
[EVA] 4.9 Escalation - Patterns for Fault Tolerant Software[EVA] 4.9 Escalation - Patterns for Fault Tolerant Software
[EVA] 4.9 Escalation - Patterns for Fault Tolerant Softwareeva
 
Fault tolerance 1장
Fault tolerance 1장Fault tolerance 1장
Fault tolerance 1장eva
 
[FTP] 4-10 fault observer
[FTP] 4-10 fault observer[FTP] 4-10 fault observer
[FTP] 4-10 fault observereva
 
Fault tolerant 4_5
Fault tolerant 4_5Fault tolerant 4_5
Fault tolerant 4_5eva
 
[FTP] 4-8 Someone in charge
[FTP] 4-8 Someone in charge[FTP] 4-8 Someone in charge
[FTP] 4-8 Someone in chargeeva
 
Software update
Software updateSoftware update
Software updateeva
 
꿈을 찾아서1.4
꿈을 찾아서1.4꿈을 찾아서1.4
꿈을 찾아서1.4eva
 
git, git flow
git, git flowgit, git flow
git, git floweva
 
안드로이드로 풀어보는 플러그인 패턴이야기
안드로이드로 풀어보는 플러그인 패턴이야기안드로이드로 풀어보는 플러그인 패턴이야기
안드로이드로 풀어보는 플러그인 패턴이야기eva
 
서비스 발견을 위한 패턴언어
서비스 발견을 위한 패턴언어서비스 발견을 위한 패턴언어
서비스 발견을 위한 패턴언어eva
 

More from eva (15)

Bash-as-a-Interpreter
Bash-as-a-InterpreterBash-as-a-Interpreter
Bash-as-a-Interpreter
 
Scalable web architecture and distributed systems
Scalable web architecture and distributed systemsScalable web architecture and distributed systems
Scalable web architecture and distributed systems
 
Heartbeat pattern
Heartbeat patternHeartbeat pattern
Heartbeat pattern
 
Unit of mitigation Pattern
Unit of mitigation PatternUnit of mitigation Pattern
Unit of mitigation Pattern
 
[EVA] 5. Detection Patterns - Patterns for Fault Tolerant Software
[EVA] 5. Detection Patterns - Patterns for Fault Tolerant Software[EVA] 5. Detection Patterns - Patterns for Fault Tolerant Software
[EVA] 5. Detection Patterns - Patterns for Fault Tolerant Software
 
[EVA] 4.9 Escalation - Patterns for Fault Tolerant Software
[EVA] 4.9 Escalation - Patterns for Fault Tolerant Software[EVA] 4.9 Escalation - Patterns for Fault Tolerant Software
[EVA] 4.9 Escalation - Patterns for Fault Tolerant Software
 
Fault tolerance 1장
Fault tolerance 1장Fault tolerance 1장
Fault tolerance 1장
 
[FTP] 4-10 fault observer
[FTP] 4-10 fault observer[FTP] 4-10 fault observer
[FTP] 4-10 fault observer
 
Fault tolerant 4_5
Fault tolerant 4_5Fault tolerant 4_5
Fault tolerant 4_5
 
[FTP] 4-8 Someone in charge
[FTP] 4-8 Someone in charge[FTP] 4-8 Someone in charge
[FTP] 4-8 Someone in charge
 
Software update
Software updateSoftware update
Software update
 
꿈을 찾아서1.4
꿈을 찾아서1.4꿈을 찾아서1.4
꿈을 찾아서1.4
 
git, git flow
git, git flowgit, git flow
git, git flow
 
안드로이드로 풀어보는 플러그인 패턴이야기
안드로이드로 풀어보는 플러그인 패턴이야기안드로이드로 풀어보는 플러그인 패턴이야기
안드로이드로 풀어보는 플러그인 패턴이야기
 
서비스 발견을 위한 패턴언어
서비스 발견을 위한 패턴언어서비스 발견을 위한 패턴언어
서비스 발견을 위한 패턴언어
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

02. Fault Tolerance Pattern 위한 mindset

  • 2. 1. Fault tolerant mindset 2. Design Tradeoffs 3. Quality v. Fault Tolerance 4. Keep It Simple 5. Incremental Additions of Reliability 6. Defensive Programming Techniques 1) Faults in Fault Tolerance Code 2) Memory Corruption 3) Data Structure Design 4) Design for Maintainability 5) Coding Standards 6) Redundancy 7) Static Analysis Tools 8) N-Version Programming 9) Redundant Disks [PGK88][MS00] 7. The Role of Verification 8. Fault Insertion Testing 9. Fault Tolerant Design Methodology
  • 3.  Look at Techniques to design for fault tolerance and enhanced reliability and availability.
  • 4.  Fault tolerant한 시스템을 만들겠다는 마음가짐  Thinking to ask the question and define the solution is called having a Fault Tolerant Mindset.  Question & Solution? : What if the stack pointer becomes negative? What if the wrong subclass is instantiated? What if the message arrives out of order?  언제 필요한가? Requirement definition, Architecture, Design, Coding, test development
  • 5.  Tradeoff : the act of balancing two things that you need or want but which are opposed to each other.  MTTF and MTTR determines the reliability and availability of a system.
  • 6.
  • 7. 더 중요한 요소  MTTR / high availability - telecommunication system  MTTR is long, MTTF is also long - Space Shuttle  MTTF(Failure-free) but highly available - ATM banking system 따라서!  This requires a thorough analysis during the design of the hardware and software components to ensure both high MTTF and low MTTR.
  • 8.  Quality refers to how fault-free the system is.  Fault tolerance is the ability of the system to execute properly even though there are faults present.  A Fault tolerant system does not have High quality necessarily.  Quality is aimed at preventing faults from entering the system. (= Fault Prevention)  Fault prevention is sometimes considered one of the phases of fault tolerance. High Quality Fault tolerant System
  • 9.  KIS  Extra code will contain unnecessary faults.
  • 10. 시스템을 운영해가면서 점차적으로 reliability 를 더하기  Many projects add fault tolerance incrementally.  A policy of studying every failure and implementing any design changes that are suggested by analysis is a certain way to improve the system’s fault tolerance.  Ex. 4ESS™ Switch project that has been functioning in the US telephone network for over 30 years.
  • 11. 문제가 생길 수 있는 부분이 있는지 늘 질문을 하고 해결책을 찾아가며 프로그래밍 하기  Programming defensively is done constantly, in every situation, by asking ‘what can go wrong here?’, ‘what errors can occur?’, ‘how might this fail?’, and ‘how can this code be protected from errors in other parts?’.
  • 12. Fault tolerant related activities를 수행하는 software도 fault를 가지고 있을 수 있으니 Keep it simple!  Even the software written to perform only fault tolerant related activities can have faults.  Adding complexity to error handling code increases the risk of adding faults to the software. Keep it Simple!  Ex. a particular piece of software mysteriously stopped and restarted itself approximately every 24 days. The counter was stored in a 32 bit signed integer. The solution to the previous problem was to make the counter a 32 bit unsigned integer which became invalid (zero) after twice as long.
  • 13. 데이터 스토리지에 항상 정확한 데이터가 있는 것은 아니라는 것을 명심하기  Ex. 1. A message type defined by protocol standard is corrupted. - corrupted message type 처리 구문 추가 2. Memory leakage - 메모리 할당/해제 주의
  • 14. 데이터 구조를 검토하기 쉽게 디자인하기  Data structures should be designed so that they can be audited, or checked, for correctness. Audits should check both for correct values and to ensure that the data structure integrity is intact.
  • 15. 유지보수하기 쉽게 디자인 하기 1. Use short modules, methods and functions. 2. Make the code readable through the use of white space and comments. 3. Keep the control flow simple because that will be easier to understand and to change if that becomes necessary.
  • 16. 적절할 정도의 코딩 스탠다드를 사용하기  Coding standards are another way of improving the quality of software as it is written.  Having a small number that can be followed and verified will result in higher quality software.
  • 17. 시스템의 capabilities를 복제하기  목적 : more rapid error recovery and fault treatment.  redundancy in both time and space dimensions 1. Time based redundancy can be the sequential execution of different versions of a program followed by a selection, or Voting (21), of which result to use. Other time based redundancy includes recalculating parameters that don’t change frequently(or ever) to ensure that the parameter value in use is always correct. 2. Space based redundancy is usually implemented by executing a program on different computers. The programs might be identical, or they might be different versions. The different computers are typically located within a cluster, and can be geographically dispersed also.
  • 18. Tools such as lint(1)  Lint Warnings ADT(Android Development Tools) Plugin 버전 16.0에 공개됨 Lint Warining은 애플리케이션 내의 리소스 (레이아웃, 문자열 등...)의 오류를 미리 검사 한다. 특히, 애플리케이션 실행에는 문제가 없 으나 잠재적으로 문제를 발생시킬 수 있는 항 목을 찾아주는 역할을 함
  • 19.
  • 20.  NVP : 여러 독립적인 팀 똑같은 specification 으로 여러 버전을 구현하는 방법  각 팀은 다른 언어, 다른 알고리즘, 다른 디자 인, 토론 없음  디자인/개발 부터 실행까지 모든 level에서 Redudancy이용  개인이나 작은 팀의 규모에서는 적용안됨 더 이상 논의하지 않음
  • 21.  RAID, or ‘redundant array of inexpensive disks’,  a technology for grouping disks together to make a complex that provides redundancy and hence fault tolerance that is unavailable on a single disk.  5가지 RAID level - increase the level of disk redundancy, RAID-1 through RAID-5.  추가 RAID level : RAID- 0 an alternate method of writing data to a disk, which does not increase redundancy.
  • 22.  Disk striping, used by RAID-0, stores consecutive chunks of data spread over a number of disks. This enables parallelism in reading and writing and hence faster access. Disk striping alone does not increase reliability, but it increases performance.  Disk mirroring is used by RAID-1. The same data is written to multiple disks simultaneously. It is synchronized between the disks, so that if any of the disks fail there are redundant, identical copies of the data. This increases the reliability of the data with only a small performance increase.  RAID-2 through RAID-5 all use some form of Hamming or parity encoding to ensure that the stored data can be reconstructed if a disk fails. The differences between these RAID levels is determined by where the data and parity encoding is stored and whether it is striped or mirrored. Table 2.1 shows how these levels are implemented.  Hamming code encoding : 오류가 생겼을 때 복구 가능  Parity bit : 오류가 생겼는지 확인하는 용도  RAID technology is common and is included in many commercial products today. For more information about RAID, refer to product literature or Marcus and Stern [MS00].
  • 23.
  • 24. 테스팅과 검증의 중요성 Testing and verification also provide the data needed by a project’s software reliability engineers to compute the expected reliability of a system.  operational profile testing An operational profile describes the usage of the system in quantitative terms and the most typical scenarios that the system will process. Operational profiles are the scenarios that are used in design, development, and test. To test the reliability and performance of the system the operational profile adds quantitative information to the descriptions of typical scenarios  For more information - operational profiles, refer to Musa et al. [MFI96] - The Handbook of Software Reliability Engineering [Lyu96] contains more detailed information about testing and verification for reliability.
  • 25.  A technique that is used during the testing and verification phase of a project is fault insertion testing.  This testing serves the dual purpose of identifying faults in the system’s error handling processes and of providing data for the computation of coverage factors.  the only way that a system’s coverage factor can be determined.  방법: Known faults are introduced into the system, which is then observed to see if the system was able to handle the faults automatically. The coverage is computed as the percentage of cases in which recovery was successful.
  • 26. Fault tolerant design methodology의 6 단계 1. Assess the things that can go wrong with the system. (fault trees) 2. Strategies must be defined to mitigate the risks. The project specific pattern language that will be used during design is identified in step 2. 3. Create a mental model of your system identifying the primary system dividing points and modes of redundancy.
  • 27. 4. The architectural and major design decisions can be made. (ch.4 patterns for high level decision) 5. Design in the capabilities for the system to implement the risk mitigation strategies identified in step 2. (ch. 5 ~ ch.8) 6. Almost all systems, no matter how fault tolerant they are, require some provisions that enable them to be managed and administered by people. - Designing Interfaces by Tidwell [Tid05] or ‘An Input and Output Pattern Language’ [HS00].
  • 28.  Step 1 where you identify the failures and the risks factors that can cause them.  Step 5 comes back to the risks and failures to see if mitigation techniques have been added to the design to cover the risks.  Step 2 gives you a chance to think about the risks and the patterns and other techniques that will be useful to mitigate the risks. The pattern language for your project is created in this step.  Steps 3 and 4 are where the fault tolerant design starts to take shape. - Any elements of redundancy present in the system that can be leveraged to mitigate risks are studied and enhanced in step 3. - - Step 4 continues the design of the error detection and error processing capabilities of the system.  People will interact with the system being designed both as users and as operating personnel. Step 6 considers how the human computer interface can be made more robust and more error- free.
  • 29.  Adapted from one in Secure Coding, Graff and van Wyk [GvW03]. It has been put in terms of fault tolerance and the Fault Tolerant Mindset.  This methodology will be used in an example problem, to design a fault tolerant Presence Server in the Conclusion.  The benefit of this methodology is that it will get you thinking about what can go wrong with the system.
  • 30. 1. Pattern  Problem  Context of the problem  Forces 2. Pattern Language 3. Fault -> Error -> Failure (specification) A system failure occurs when the delivered service no longer complies with the specification, the latter being an agreed description of the system’s expected function and/or service. An error is that part of the system state that is liable to lead to subsequent failure; an error affecting the service is an indication that a failure occurs or has occurred. The adjudged or hypothesized cause of an error is a fault. [Lap91, p.4] 4. Reliability A system’ reliability is the probability that it will perform without deviations from agreed-upon behavior for a specific period of time. That there will be no failures during a specific time. 1) MTTF : Mean Time To Failure the average time from start of operation until the time when the first failure occurs. 2) MTTR : Mean Time To Repair A measure of the average time required to restore a failing component to operation