The PEnDAR project investigated the application of stochastic engineering techniques to verification and validation of complex and cyber-physical systems
PEnDAR - Performance ENsurance by Design, Analysing Requirements TSB REFERENCE: 132304
Why? Seeing cost/performance hazards becoming visible late in the development process – too late to save some projects! Multi-$B problem worldwide
Pressure to re-purpose commodity infrastructure for safety/mission-critical objectives; need to be able to articulate a safety case.
A Quantitative Timeliness Agreement (QTA) is a relationship between the demand (the applied load, including its pattern) and the delivered quality impairment (as a probability distribution, ∆Q)
Opportunity cost between one system and another sharing the same resources, and successive refinements won’t be considered in this webinar.
Rocket science used to be something only world superpowers could do – now you only need to be a billionaire! It’s well enough understood to be reproducible, and is just (complex) engineering. Brain surgery requires experience, skill and gut feel – not easy to teach! Outcomes are hard to quantify.
16
Any CDF whose curve is always to the left and above this one represents an outcome that is “acceptable”. If the black line crosses the blue line we have a performance hazard.
This can be combined with a corresponding analysis of the resource consumption
We’re now going to run through some of the technical dimensions of this challenge
This captures what we have learnt about system delivery problems over the last decade. There’s a lot here so we’re going to break it down!
They key task with shared-resource systems is to find a way to quantify and manage the performance/resource tradeoff.
Quantifying and managing the performance/resource tradeoff (yellow centre) is specific to each particular system; the issues around it can de dealt with by applying generic techniques. Analysis of the central problem is complemented by a synthesis of other techniques.
The three key aspects to consider are:
Scale – how are the resource/performance trades affected by the scale of the system?
Exception/failure – how are these managed, given that they become inevitable in a shared, distributed system
Variability – how variable are the resources and the demand for outcomes?
Scale has two dimensions:
Space – either in terms of physical distance, affecting transmission times, or in terms of numbers of users/demands on the system, which together create a notion of ‘density’ that can drive the economics of the solution.
Time – on long timescales the question is one of capacity, on short ones of schedulability.
Exception and failure are specifically not a question of ‘coding errors’ or hardware faults (although those are a factor) but more one of temporary shortage of resources, resulting, for example, in the loss of a packet or a deadline being missed.
Two approaches to handling this are mitigation (re-transmitting a packet, for example) or propagation (packet loss resulting in a failed transfer), requiring handing at a higher layer. These interact, and the optimal approach will depend on the frequency and severity of the failures and the costs of handling them in different ways.
Variability applies both to resources and to load, and its key aspect is correlation:
Positively correlated, e.g. by TV advert breaks
Negatively correlated, e.g. use of one part of the system precludes simultaneous use of another
Uncorrelated, basically a random effect.
Correlations can be externally generated or be a result of the operation of the system
We need to consider both the impact on individual outcomes and the impact on the ability of the rest of the system to deliver collective outcomes.
Once the core is understood, the rest is manageable with the right tools.
Need to support stages in the SDLC.
In Design:
Feasibility: can you deliver the outcomes with sufficient timeliness with acceptable use of resources
Hierarchical decomposition
Acceptance criteria
Verification requires checking quantified outcomes, in a way that is ‘cheap’ enough to re-apply during the system lifetime.
Looking at a more formal approach to managing cost/performance hazards – do the benefits and costs of this balance out?
There’s a push to use standard commodity infrastructure for safety/mission critical purposes – saves a lot of costs but also introduces risk. Need to be able to make a safety case! Virtualisation is coming in everywhere – what are the risks?
Case studies done inside the project show that getting intentions to be quantified can be hard; however explaining that allowing for some possibility of delay or failure can dramatically reduce the delivery costs may encourage engagement.
Even functional verification can be considered ‘too expensive’.
Looking at a more formal approach to managing cost/performance hazards – do the benefits and costs of this balance out?
There’s a push to use standard commodity infrastructure for safety/mission critical purposes – saves a lot of costs but also introduces risk. Need to be able to make a safety case! Virtualisation is coming in everywhere – what are the risks?
Case studies done inside the project show that getting intentions to be quantified can be hard; however explaining that allowing for some possibility of delay or failure can dramatically reduce the delivery costs may encourage engagement.
Even functional verification can be considered ‘too expensive’.