ftc1.ppt
Upcoming SlideShare
Loading in...5
×
 

ftc1.ppt

on

  • 4,313 views

 

Statistics

Views

Total Views
4,313
Views on SlideShare
4,301
Embed Views
12

Actions

Likes
0
Downloads
62
Comments
0

1 Embed 12

http://www.slideshare.net 12

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

ftc1.ppt ftc1.ppt Presentation Transcript

  • HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 1 INTRODUCTION Wintersemester 2000/2001 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc DS - IX - NFT - 1
  • FAULT-TOLERANT COMPUTING SYSTEMS Topical Outline: 1. Introduction (Unit I) – Motivation – System views – Dependability rings – Dependable design methodology 2. Dependability Concepts, Measures and Models (UNIT DCMM) – Basic definitions – Dependability measures – Dependability models – Examples – Dependability evaluation tools 3. Testing Techniques (UNIT TT) – Testing techniques principles – Processor testing – Memory testing – Network testing DS - IX - NFT - 2
  • FAULT-TOLERANT COMPUTING SYSTEMS Topical Outline: 4. Fault Diagnosis Techniques (UNIT FST) – Fault detection techniques – Fault location (isolation) methods 5. Fault Recovery and Tolerance Techniques (UNIT FRTT) (System Level) – Dynamic techniques – Static techniques – Hybrid techniques 6. Fault-tolerant and Fault-secure Memories (UNIT FRTT) – Fault-tolerant techniques in manufacturing – Replication – Coding – Reconfiguration DS - IX - NFT - 3
  • FAULT-TOLERANT COMPUTING SYSTEMS Topical Outline: 7. Network Fault Tolerance (UNIT NFT) – Computer networks – Basic techniques – Example – multistage networks 8. Case Studies (UNIT CS) – ESS and 3B20 – FTMP – Fault-tolerant Multiprocessor – SIFT – Software-implemented Fault Tolerance – Communication controller – Fault-tolerant Building Block Architecture DS - IX - NFT - 4
  • COURSE ACTIVITIES • PROJECT • PRESENTATION • INVITED SPEAKERS • CONFERENCES AND WORKSHOPS • Some Websites: – www.dependability.org – www.paradise.caltech.edu – www.milan.eas.asu.edu – www.crhc.uiuc.edu DS - IX - NFT - 5
  • Major References on Fault-tolerant Computing (Books/General) 1 • Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital Systems, Wiley –Interscience, 1970. • Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits, Prentice-Hall, 1971. • Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976. • Kraft, G. D. and W. N. Toy, Microprogrammed Control and Reliable Design of Small Computers, Prentice-Hall, 1981. • Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice, Prentice-Hall, 1982. • Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable Systems Design, Digital Press, 1982 & 1995. • Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall International, 1985. • Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice-Hall, 1986. DS - IX - NFT - 6
  • Major References on Fault-tolerant Computing (Books/General) 2 • Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of Fault- Tolerant Computing, Springer-Verlag, 1987. • Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989. • Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989. • Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems, Vol. 5: Dependability: Basic Concepts and Terminology, Springer- Verlag Wien New York, 1992. • Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing for Critical Applications 3, Springer-Verlag Wien New York, 1993. • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp- uting, System Implementation, Kluwer Academic Publishers, 1994. • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp- uting, Paradigms for Dependable Applications, Kluwer Academic Publishers, 1994. DS - IX - NFT - 7
  • Major References on Fault-tolerant Computing (Books/General) 3 • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp- uting, Models and Frameworks for Dependable Systems, Kluwer Academic Publishers, 1994. • Malek, M. (ed.), Responsive Computing, Kluwer Acad. Publish., 1994. • Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems, Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic Publishers, 1995. • Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical Applications 4, Springer-Verlag Wien New York, 1995. • Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996. • A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997 • W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999 • S. Montenegro, Sichere und fehlertolerante Steuerungen, Hanser Muenchen, 1999. DS - IX - NFT - 8
  • Major References on Fault-tolerant Computing (Books/Reliability Evaluation) • Myers, G. J., Software Reliability Principles and Practice, Wiley- Interscience, 1976. • Trivedi, K. S., Probability and Statistics with Reliability Queuing and Computer Science Applications, Prentice-Hall, 1982. • Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel Dekker, 1984. • Musa, J. D., A. Iannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987. • W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999 DS - IX - NFT - 9
  • Major References on Fault-tolerant Computing (Books/Coding) • Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, McGraw-Hill, 1968. • Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT Press, 1972. • Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and Applications, The Computer Science Library, 1978. • Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and Application, Prentice-Hall, 1983. • Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting Codes for Computer Scientist and Engineers, MacMillan Publishers, 1986. • Rao, T. R. N. and E. Fujiwara, Error-Control Coding for Computer Systems, Prentice-Hall, 1989. DS - IX - NFT - 10
  • Major References on Fault-tolerant Computing (Books/Software) • Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970. • Deutsch, M. D., Software Verification and Validation, Prent.-Hall, 1982. • Shooman, M. L., Software Engineering, McGraw-Hill, 1983. • Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983. • Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987. • Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993. • Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995. • Lyu, M. R. (ed.), Handbook of Software Reliability Engineering, Computer Science Press, 1995. DS - IX - NFT - 11
  • Major References on Fault-tolerant Computing (Journals) • Special Issue of Proc. Of IEEE, October 1978 • Special Issue of Computer, October 1979 • Special Issue of Computer, March 1980 • Special Issue of Computer, August 1984 • Special Issue of IEEE Software, May 1995 • IEEE Trans. on Reliability • IEEE Trans. On Software Engineering • Computer • Design and Test • Electronics • Proc. Of IEEE • Computer Design • Journal of Electronic Testing: Theory and Applications • Journal of Parallel and Distributed Computing • IEEE Trans. on Parallel and Distributed Computing • Real-Time Systems Journal DS - IX - NFT - 12
  • Major References on Fault-tolerant Computing (Conference Proceedings) • Fault-Tolerant Computing Symposium • Reliability and Maintainability Symposium • Reliability in Distributed Software and Database Systems Symposium • Test Conference • Distributed Computing Systems Conference • Parallel Processing Conference • Real-Time Systems Symposium • Computer Architecture Symposium DS - IX - NFT - 13
  • INTRODUCTION • OBJECTIVES: – MOTIVATION FOR FAULT-TOLERANT SYSTEMS – TO INTRODUCE VARIOUS VIEWS OF COMPUTER SYSTEMS AND THEIR RELATIONS TO COMPUTER SYSTEM DEPENDABILITY – TO PRESENT BASIC CONCEPTS AND APPROACHES – TO INTRODUCE DEPENDABLE DESIGN METHODOLOGY • CONTENTS: – MOTIVATION – SYSTEM VIEWS – SYSTEM DEPENDABILITY CONCEPTS – APPROACHES TO DEPENDABLE DESIGN – DEPENDABILITY RINGS – DEPENDABLE DESIGN METHODOLOGY DS - IX - NFT - 14
  • TYPES OF SYSTEMS • Dependable (Reliable) System – A system which delivers a required service during its lifetime • Fault-Tolerant Computer Systems – A system that has the capability to continue the correct execution of its programs and input/output functions in the presence of faults • Real-Time-Computer Systems – are the ones that deliver service to a user within a specified deadline (physical time, duration, etc.) • Responsive Computer System – are Fault-Tolerant Real-Time Systems that deliver satisfactory service in a timely manner DS - IX - NFT - 15
  • MOTIVATION FOR RELIABLE AND FAULT- TOLERANT COMPUTING • ECONOMIC NECESSITY • LIFE SAVING • NOVICE USERS • HARSH ENVIRONMENTS • MORE COMPLEX SYSTEMS DS - IX - NFT - 16
  • DEVICE RELIABILITY AND SYSTEM RELIABILITY Equivalent – Device Reliability 106 105 Mean Time between 104 Failures 103 (MTBF) in Years 102 Minimum Acceptable Reliability 10 1 System Reliability 1950 1960 1970 1980 1990 Relays – Vacuum Tubes – Semiconductors – SSI – MSI – LSI - VLSI DS - IX - NFT - 17
  • DEPENDABILITY – PERFORMANCE TRADE-OFF Ultra Reliable Systems 0.99999 Commercial 0.9999 Availability Fault-Tolerant Systems 0.999 Massively Parallel/ Distributed Systems 0.99 0.9 1 10 100 1000 10000 100000 Throughput (MIPS) DS - IX - NFT - 18
  • EXAMPLES • DEFENSE SYSTEMS • FLIGHT SYSTEMS • AIR TRAFFIC CONTROL • COMMUNICATION SYSTEMS • BANKING SYSTEMS • AIRLINE SEAT RESERVATIONS • TELEPHONE SYSTEMS • HOUSEHOLD APPLIANCES • VIDEO GAMES DS - IX - NFT - 19
  • VIEW 1: SYSTEM LIFE CYCLE SYSTEM NEW OBSOLESCENCE NEEDS CONSTRAINTS TECHNOLOGY CONCEPT FORMULATION SYSTEM SPECIFICATION DESIGN PROTOTYPE PRODUCTION INSTALLATION OPERATIONAL LIFE MODIFICATION AND RETIREMENT • Notice that testing, verification or validation should occur after every phase of life cycle • Very few tools exist, and for some steps of the cycle only DS - IX - NFT - 20
  • VIEW 2: PACKAGING LEVELS OF INTEGRATION • APPLICATIONS • APPLICATIONS MODULES • SPECIAL-PURPOSE LANGUAGES • STANDARD LANGUAGES • OPERATING SYSTEMS • CABINETS/FRAMES • BOXES/CAGES • PRINTED CIRCUIT BOARDS/CARDS, WAFERS, TCMs • INTEGRATED CIRCUITS (CHIPS) • Dependability must be considered at every level • System decomposition (partitioning) may have a significant impact on dependability DS - IX - NFT - 21
  • VIEW 3: WORKLOAD VIEW LIVEWARE PREPARATION USEFU L WORK SEMI HARDWARE/ USEFUL SOFTWARE WORK FAULT IDLING SERVICING • ELIMINATE IDLING AND USE IT FOR TESTING TO IMPROVE DEPENDABILITY DS - IX - NFT - 22
  • VIEW 4: LEVELS OF ABSTRACTION FOR DIGITAL COMPUTERS LEVEL SUBLEVEL COMPONENTS PMS Processors, Memories, Switches, Links (Networks), Controllers, ALUs, I/Os Program HLL, ISP (Inst- Software, Memory State, Processor State, raction Set Effective Address Calculation, Instruction Processor Decode, Instruction Execution Logic Register Trans- Data Paths, Registers, Data Operators, fer Level (RTL) Control (Hardwired), Microprogramming (Microstore) Circuit Resistors, Capacitors, Inductors, Power Sources, Diodes Transistors Quantum & El- Disks, Tapes ectromagnetic • DEPENDABILITY AND TESTING MUST BE CONSIDERED AT EVERY LEVEL DS - IX - NFT - 23
  • VIEW 5: COMPUTER SYSTEM SOFTWARE PACKAGES LIVEWARE ASSEMBLERS COMPILERS MAINTENANCE PERSONNEL OPERATING SYSTEMS UTILITY PROGRAMS OPERATORS DEBUGGING PROGRAMS FILE PROCESSING PROGRAMS FIRMWARE SYSTEM DESIGNERS MICROPROGRAM & MICROPRO- GRAMMING SYSTEMS SYSTEM ANALYSTS HARDWARE CPUs PROGRAMMERS I/O DEVICES MEMORIES USERS INTERCONNECTION NETWORKS FAULTS ARE ATTRIBUTED TO: HARDWARE: 20%-65%; SOFTWARE: 20%-80%; PEOPLE: 15%-40%; AT&T’s: 20-40-40%; (2/3 applications + 1/3 OS) DS - IX - NFT - 24
  • (WARNING!!!) VIEW 6: IF YOU DO NOT FOLLOW DEPENDABLE DESIGN METHODOLOGY YOU MAY END UP WITH THE FOLLOWING: SIX PHASES OF A PROJECT 1. ENTHUSIASM 2. DISILLUSIONMENT 3. PANIC AND HYSTERIA 4. SEARCH FOR THE GUILTY 5. PUNISHMENT OF THE INNOCENT 6. PRAISE AND AWARDS FOR THE NON-PARTICIPANTS (Author unknown – found in one of the computer companies) DS - IX - NFT - 25
  • SYSTEM DEPENDABILITY CONCEPTS • RELIABILITY – Is a conditional probability that the system will perform its intended function without failure at time t provided it was fully operational at time t = 0 • AVAILABILITY – Instantaneous availability is the probability that a system is performing correctly at time t and is equal to reliability of non-repairable systems A (t) = R (t) – Steady-state availability is the probability that a system will be operational at any random point of time and is expressed as the fraction of time a system is operational during its expected lifetime UPTIM E As (t) = LIFETIME • SURVIVABILITY is the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset DS - IX - NFT - 26
  • APPROACHES • FAULT INTOLERANCE • FAULT TOLERANCE • MAINTAINABILITY • HARDWARE/SOFTWARE TRADE-OFFS DS - IX - NFT - 27
  • HARDWARE/SOFTWARE CONTINUUM AND VERTICAL MIGRATION HARDWARE INSTRUCTIONS EXAMPLES INTEGER ARITHMETIC ADD/SUB M6800 MPY/DIV MC68000 FLOATING-POINT ARITHMETIC VAX-11/780 IBM-30XX VECTOR PROCESSING CRAY-XMP C-205 MULTIPROCESSING (e.g., SYSTOLIC ARRAYS, submachine set-up) RECONFIGURABLE OR EXPERIMENTAL MULTICOMPUTERS SOFTWARE VERTICAL MIGRATION is a transfer of functions’ implementation from software to firmware and/or hardware or vice-versa. Vertical Migration improves performance and dependability, and reduces cost. DS - IX - NFT - 28
  • DEPENDABILITY (RELIABILITY) RINGS FOR FAULT TOLERANCE Dependability Acceptance Test Rings Operating System, Languages and Application Acceptance Test System Hardware Acceptance Test Register-Transfer Level Acceptance Test Logic Level Each Dependability Ring should provide measures and mechanisms for Fault Tolerance (Detection, Location, Testability and Recovery) DS - IX - NFT - 29
  • A BOOTSTRAP – TEST RINGS IN A MULTICOMPUTER SYSTEM Network Memories Processor Diagnostic and Maintenance Processor (s) (Hardcore) Test Rings DS - IX - NFT - 30
  • DEPENDABLE DESIGN METHODOLOGY • Identify fault classes, fault latency and fault impact • Determine qualitative and quantitative specs for fault tolerance and evaluate your design in specific environment • Identify “weak spots” and assess potential damage • Decompose the system • Develop fault and error detection techniques and algorithms • Develop fault isolation techniques and algorithms • Develop recovery/reintegration/restart • Evaluate degree of fault tolerance • Refine, iterate for improvement; try to eliminate “weak spots” and minimize potential damage DS - IX - NFT - 31
  • REAL-TIME SYSTEMS DESIGN • Identify time/critical tasks and specify their timing (deadlines, durations, frequency, periodicity, if any). Characterize the system load and environment. • Characterize timing of a system (hardware and software). • Map timing specification onto a system timing (find the best resource allocation and scheduling methods), and incorporate concurrent monitoring. • Verify and validate the design for quantitative and qualitative specifications. • Refine, iterate and fine-tune the design. DS - IX - NFT - 32
  • RESPONSIVE SYSTEM DESIGN • Determine qualitative and quantitative specifications for fault tolerance and task timeliness which meet user requirements. • Determine system timing (hardware and software) assess damage, availability and responsiveness. • Develop and time fault and error detection techniques and algorithms. • Develop and time fault isolation techniques and algorithms. • Develop time recovery/reintegration/restart. • Map timing specification onto system timing under appropriate assumptions and incorporate concurrent monitoring. • Evaluate responsiveness. • Refine and iterate for improvement. RESPONSIVE SYSTEMS NEED ARCHITECTS OF SPACE AND ARCHITECTS OF TIME DS - IX - NFT - 33
  • REFERENCES (TEXTBOOK) • C. G. Bell, J. C. Mudge and J. E. McNamara “Seven Views of Computer Systems”, Chapter 1 in the book by the same authors titled “Computer Engineering”, Digital Press, 1978. • G.J. Lipovski and M. Malek, “Parallel Computing: Theory and Comparisons”, Wiley-Interscience, New York, 1987. • M. Malek, “Parallel Computer Systems Testing and Integration”, in the book titled “Testing and Diagnosis of VLSI and LSI”, M. G. Sami and F. Lombardi (eds.), Kluwer, 1988. • Pankaj Jalote, Fault Tolerance in Distributed Systems / Textbook Binding / Published 1994 • Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996. DS - IX - NFT - 34