Cost-effective software reliability through autonomic tuning of system resources
Upcoming SlideShare
Loading in...5
×
 

Cost-effective software reliability through autonomic tuning of system resources

on

  • 425 views

Lecture given at the IMEC Academy (www.imec.be)

Lecture given at the IMEC Academy (www.imec.be)

Statistics

Views

Total Views
425
Views on SlideShare
424
Embed Views
1

Actions

Likes
0
Downloads
1
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cost-effective software reliability through autonomic tuning of system resources Cost-effective software reliability through autonomic tuning of system resources Presentation Transcript

  • IMEC Academy / SSET Seminar: Cost-effective software reliability through autonomic tuning of system resources Vincenzo De Florio vincenzo.deflorio@ua.ac.be
  • Agenda  Introduction Closed world systems  Open world software  Autonomic management of redundancy  ◦ Examples  Conclusions and next steps 4 May 2011 Imec academy 2
  • Introduction – main actors in the play    Software reliability – an important, elusive requirement Redundancy – an effective way to achieve reliability Autonomic software evolution – a costeffective method to parameterize reliability in software 4 May 2011 Imec academy 3
  • Part 1: Closed world systems Key problem & a classic solution Given an unreliable “channel,” how do we use it reliably?  Common solution: redundancy + upper bounds estimation  ◦ Off-line analysis of the maximal disturbance ◦ Off-line dimensioning redundancy such that any disturbance is tolerated  A closed world approach 4 May 2011 Imec academy 4
  • Closed world systems Systems built on immutable hypotheses regarding their deployment environments & platforms  Context-agnostic , ataraxic systems  ‘Virtual’ agents that operate irrespective of any physical world property  ◦ Time, temperature, humidity, user’s quality of experience, attacks, … 4 May 2011 Imec academy 5
  • When does it make sense? Whenever the designer has strong confidence that assumptions will hold  Whenever there is “strong and certified control” on  ◦ The platform ◦ The environment  E.g. synchronous systems 4 May 2011 Imec academy 6
  • Example  I have a problem of interference, but ◦ I have full confidence on my platform and its state ◦ I have “full control” on the environment: ◦ e.g., I can make sure that, during certain critical operational stages, interference will stay minimal  Do you recognize which case is this? 4 May 2011 Imec academy 7
  • “Please be advised that all electronic devices must be switched off and remain switched off until further notice.” “Most personal devices transmit a signal and all of them emit electromagnetic waves which, in theory, could interfere with the plane’s electronics.” 4 May 2011 Imec academy 8
  • There’s no full control as fools’ control of course… How can we make sure that passengers will comply? 2. System & environment compliance refers to the past  “…the deterioration of planes and advance or decline of electronic devices over time is the immeasurable factor that is never taken into account by passengers”  Or companies! “A plane is designed to the right specs, but nobody goes back and checks if it is still robust.” [p3air] 1. 4 May 2011 Imec academy 9
  • “Nobody goes back & checks again…” Closed world systems are “frozen in time”  Their certification implicitly refers to a scenario that may differ from the real one  ◦ Ariane-5, Therac-25,… ◦ You can only rely on the fact that the certification was valid yesterday ◦ Scenario = hw/sw/nw technologies, hci 4 May 2011 Imec academy 10
  • Conclusions part 1  Closed world systems: “sitting ducks” to change! “Frozen ducks,” actually  Service = Platform(t) + Environment(t)  Design sometimes results in systematic assumptions hiding and “clashes”  4 May 2011 Imec academy 11
  • Part 2: Open-world Software Other option: Open-world  Software that ◦ senses endogenous state & exogenous conditions ◦ makes use of gathered context to optimize its behavior  Choices must be made of what to make translucent and what to leave transparent  Certain events will be detected and treated, some others won't  Basic feature: detection of assumption vs. context clashes 4 May 2011 Imec academy 12
  • Two typical cases  Platform Assumption vs. Context Clashes ◦ PC  Environmental Clashes ◦ EC 4 May 2011 Imec academy 13
  • PC Clashes related to our assumptions on the platform  E.g.  ◦ Memory chip technology ◦ Presence/absence of hw component 4 May 2011 Imec academy 14
  • PC in memory chips Failure semantics may differ considerably  CMOS failures: mostly single bit errors  SDRAM failures: Single-Event Effects  ◦ Single-event latchup  loss of all data on chip ◦ Single-event upset  soft errors ◦ Single-event functional interrupt  device left in either test mode, halt, or undefined state  Even from lot to lot error and failure rates can vary more than one order of magnitude [Lad02] 4 May 2011 Imec academy 15
  • PC in memory chips Open-world systems may detect this clash  E.g. Configure-like scripts that check hypotheses at compile / deployment time  Exploiting hardware / OS support [DF10]  4 May 2011 Imec academy 16
  • PC in memory chips Serial Presence Detect
  • *-memory description: System Memory PC physical id: 1000 slot: System board or motherboard size: 1536MiB *-bank:0 description: DIMM DDR Synchronous 533 MHz (1.9 ns) vendor: CE00000000000000 physical id: 0 serial: F504F679 slot: DIMM_A size: 1GiB width: 64 bits clock: 533MHz (1.9ns) *-bank:1 “lshw” on a Dell Inspiron description: DIMM DDR Synchronous 667 MHz (1.5 ns) vendor: CE00000000000000 physical id: 1 serial: F33DD2FD slot: DIMM_B size: 512MiB width: 64 bits clock: 667MHz (1.5ns) 18
  • PC: 2nd case Presence/absence of features  (MMU)? Access : Deny  – w/o MMU, memory faults may stay uncovered  Policies (e.g. for security issues) – Standards, e.g. WS-Policy
  • Second class of clashes: EC Re: assumptions on the environment  Two examples: 1. Choice of protocol 2. Choice of design pattern  4 May 2011 Imec academy 20
  • EC-1: choice of protocol c: vi s ... ... ... ... … … • Client c invokes service vi to get an object from server s 4 May 2011 Imec academy 21
  • EC-1: choice of protocol c: vi s t t ... ... … … • vi uses transport protocol t to transfer that object • Nature & properties of t : unknown to c 4 May 2011 Imec academy 22
  • EC-1: a possible scenario Appl. Appl. c Appl. Appl. s TCP TCP TCP TCP TCP TCP IP IP IP IP IP IP … … … … … … • One momentary disruption breaks all TCP connections 4 May 2011 Imec academy 23
  • EC-1 Network disruption No Yes TCP Protocol UDP • Once a clash is suspected / detected, adjustments can be made • E.g. transport protocol changed on the fly [HGS11] 4 May 2011 Imec academy 24
  • Clash EC-2: Choice of design pattern • • Fault-tolerance design patterns can be applied to reach higher reliability Design choices include e.g. – Redoing (time redundancy scheme) – Reconfiguration (design redundancy scheme) • Any choice implies an assumption – Here: fault model assumption: transient vs. permanent faults  Closed world  “hardwired” assumption 4 May 2011 Imec academy 25
  • EC-2  Hardwiring assumptions is hazard hiding Experienced fault Transient Design pattern Permanent Redoing Reconfiguration 4 May 2011 Imec academy 26
  • EC-2 Possible treatment: autonomic revision of component graphs 4 May 2011 As e.g. in ACCADA [GD11] Imec academy 27
  • Conclusions part 2 • Depending on the context c(t), the chosen assumptions: – May be valid / invalid – May clash with other assumptions. • Clashes  software is bound to – Experience failures – Waste useful assets  Autonomic revision of assumptions in the face / probability of a context clash 4 May 2011 Imec academy 28
  • Part 3: Autonomic Redundancy Management Autonomic assumptions failure avoidance Context clash avoidance  One or (if time allows!) two examples  ◦ Adaptively redundant data structures ◦ Adaptive N-version programming 4 May 2011 Imec academy 29
  • Redundant data structures Goal: tolerate transient faults affecting program memory  Method: transparent memory cells replication + voting [TaMB80]  – Writing to a redundant variable = writing to n replicas [EC], located somewhere and according to some strategy [PC] – Reading from a redundant variable = reading the n cells, performing majority voting 4 May 2011 Imec academy 30
  • Design & Contextual Redundancy 1. 2. Design redundancy: our fixed choice (e.g. , n=3 replicas) Contextual redundancy: the “right choice” at time t ◦ A model of the environment  Dynamic system cr(t) ◦ cr(34)=5  “5 replicas is what we need at t=34” 4 May 2011 Imec academy 31
  • EC in RDS Contextual redundancy … cr(t) } Design Redundancy undershooting } n < cr(t) overshooting n = cr(t) n > cr(t) … 4 May 2011 Imec academy 32
  • Tackling EC in RDS • Dynamically redundant data structures – Autonomic management of redundancy – RDS where redundancy is not fixed once and for all, but changes dynamically after cr(t) How to estimate cr(t)?  Direct measurement or indirect deduction  4 May 2011 Imec academy 33
  • Distance-to-failure dtof = 4 dtof = 3 ?? dtof = 2 dtof = 0 : failure! 4 May 2011 Imec academy 34
  • Distance-to-failure n (design redundancy) in function of dtof • Under normal conditions, n=3 • – System triplicates cells of redundant variables – Up to one memory fault is tolerated Under more critical situations, dtof decreases  amount of redundancy is automatically adjusted • Adjustment logic should select the ideal degree of redundancy matching the current disturbances • 4 May 2011 Imec academy 35
  • Risk of failure n(i) = redundancy at voting round i = 2p(i)+1 (p(i)>0)  m(i) = card {replicas that agree after voting round i }  1 ≤ m(i) ≤ n(i)  Then (n(i) – m(i))/p(i) when m(i) > p(i) risk(i) = 1 otherwise   Here, linear evolution (not very efficient) 4 May 2011 Imec academy 36
  • Evolution engine Algorithm responsible for taking decisions on how/when to adapt  In what follows, trivial example:  if risk(t-1) was high, then redundancy  redundancy + 2 if risk(t-1000 … t-1) were low, then redundancy  redundancy – 2  A static formulation! 4 May 2011 Imec academy 37
  • Fault-injection “little language” 4 May 2011 Imec academy 38
  • Simulations scrambler + aRDS + reader  aRDS “protects” 20,000 4-byte variables  ◦ Fixed allocation stride = 20 ( no protection against PC in this case) reader: round robin read accesses  Experiments record  ◦ number of scrambled cells ◦ number of read failures 4 May 2011 Imec academy 39
  • Experiment 1: Closed world, n=3 4 May 2011 Imec academy 40
  • Experiment 2: Closed world, n=5 4 May 2011 Imec academy 41
  • Experiment 3: DTOF, n(0)=5 4 May 2011 Imec academy 42
  • Redundancy Redundancy evolution t 4 May 2011 Imec academy 43
  • 4 May 2011 Imec academy 44
  • A second case – aNVP  Generalization of DTOF: Normalized dissent 4 May 2011 Imec academy 45
  • Normalized Dissent Quantifies the detrimental impact of a single version in NVP/MV composite  Two sub-models:  ◦ Penalization mechanism: ND↑  “Fine” faulty replicas  Omission – performance – value response failures ◦ Reward model: ND↓  Reward replicas behaving correctly  Weigh down i.f.o. time (absolution)
  • Conclusions Open-world: in some cases, the only option  Transparency vs. translucency: two conflicting requirements  Mechanisms are needed to hide complexity without hiding intelligence  E.g. via autonomic assumption failure detection and treatment, or policies [DF10]  4 May 2011 Imec academy 49
  • Next steps • • • Tuning the fault-tolerance design pattern to the experienced fault class Mechanisms to express and assert the design time hypotheses about platform and environment Ultimate challenges: – Intelligent management of the (dependability) strategies – Autonomic tuning of time and design redundancy – Resilience (robust evolution)
  • References      [TaMB80] David Taylor et al., “Redundancy in Data Structures: Improving Software Fault Tolerance,” IEEE Trans. on Software Engineering 6:6 (1980) [p3air] http://www.p3air.com/2011/electronic-devices-caninterfere-with-aircraft-instrumnts-to-create-perfectstorm?wpmp_switcher=mobile [HGS11] Joe Hoffert, Aniruddha Gokhale, and Douglas C. Schmidt ,“Timely Autonomic Adaptation of Publish/Subscribe Middleware in Dynamic Environments”, IJARAS Vol.2 No.4 (2011) [GD11] N. Gui, V. De Florio, and C. Blondia,“Toward Architecture-based Context-Aware Deployment and Adaptation, Journal of Systems and Software, 84:2. Elsevier, February 2011 [DF10] De Florio, V. : "Software Assumptions Failure Tolerance: Role, Strategies, and Visions," chapter in Architecting Dependable Systems, Vol. 7, LNCS Vol. 6420, pp. 249-272. Springer, 2010. 4JMay 2011 Imec academy 51
  • Where to Get More information • • • • • • • • K. Boulding (1956): “General Systems Theory – The Skeleton of Science”. Management Science, 2(3). V. De Florio (2009): “Application-layer Fault-tolerance Protocols”. Information Science Reference, IGI-Global. V. De Florio & C. Blondia (2010): “Adaptation and dependability and their key role in modern software engineering”, International Journal of Adaptive, Resilient and Autonomic Systems (IJARAS), 1(2). C. Esposito & D. Cotroneo (2010): “Resilient and Timely Event Dissemination in Publish/Subscribe Middleware”, IJARAS, 1(1). N. Gui, V. De Florio, H. Sun & C. Blondia (2009): "ACCADA: A Framework for Continuous Context-Aware Deployment and Adaptation," Proc. of 11th Int.l Symp. on Stabilization, Safety, and Security of Distributed Systems, Lyon. E. Hollnagel, D. Woods, N.G. Leveson (2006): “Resilience engineering: Concepts and precepts”, Aldershot, UK. J. Horning (1998): “ACM Fellow Profile”, ACM Software Eng. Notes 23(4). N. G. Leveson (1995): “Safeware: Systems Safety and Computers”, Addison.
  • Where to Get More information: www.igi-global.com/reference/details.asp?ID=32917
  • Thank you for your attention Questions?