Estimating Risk for Functionality 1) Count Defects raised against requirement features (or Widget / Page functionality). The higher the number of defects the greater the risk (or the greater the need to do regression testing). HOWEVER this has to be read against other aspects, for example more defects might simply relate to improved test coverage, in which case one may want to do the reverse. 2) Investigate code coverage – using a code coverage tool (line/statement and decision/block) – or look at tests linked to requirements. Areas of low coverage will need additional tests and will need to ensure more regression testing. 3) Investigate Risk Score, how likely an error is to occur – based upon code Cyclometric complexity, or simply ask Dev to score how difficult to code (High, Medium, Low or 3, 2, 1). Multiply the score by the impact upon the customer if this failed (get from requirements team as a similar score). The product total will flag areas (high score) that will need to be included in any regression pack. 4) If code exists and tooling is present to display Halstead Metrics, then it may be possible to: 1) Extract the Cyclometric Complexity (C). Values of C above 10 present greater risk. So risk can be updated and reassessed as code is delivered. 2) Calculate the Bug level (B). This indicates the number of bugs expected in code, based upon proven numeric analysis. While it can underestimate for C/C++ code (unlike Java), it can still help to contrast areas to identify greater risk.
Techniques for Deciding Sufficient Testing Done Tests done can be a measure of completeness. However if the test coverage is poor, then even trebling the number of tests – stepping over the same lines of code will not remove the risk from that part of the code that remains untested. Hence a range of techniques need to be used to define when to stop testing. This however will need to be outlined within the Test Policy, so the criteria for “Good Enough” is understood and accepted. With this acceptance comes an acceptable level of risk linked to the “Good Enough” decision.
Sufficient Testing Overview This section looks at the techniques to decide if sufficient testing is carried out. It provides a way of assessing risk if the product is Overview are not suitable for safety critical systems, or delivered with no further testing. The techniques however where there is high commercial risk and every test case must be tested. These are more appropriate for an Agile delivery where Risk is mitigated in line with a project budget. 02/01/2013 4
Deciding upon Sufficient Testing Deciding upon the quality of testing is always going to be difficult. No matter how much testing you do, there is a risk that a number of defects will leak through. The decision is therefore not so much about sufficient testing, but sufficient risk mitigation. The following text is general advice that has to be taken within the context of the project being delivered. If a project has a safety element or critical function then more testing will be required. If there are millions of pounds at stake in a critical business or banking system then again more testing will be required. For some customers there may be an urgent need to go live but that has to be balanced with the companies reputation and future business if the product fails. In looking at testing, the traditional approach is to look at the number of defects that are found and fixed. When the defects found plateau, then a decision is taken to test more or not.
Why Estimate Defects 1. Comparing the number of defects expected vs the number found, gives an indication of the number left to find and so amount of effort till the end of the project. 2. Defect prediction, which is decremented by the number found each day or week, can provide valuable insight to the trend towards product completeness. So a trend line (with a gradient +/- error line) can give an indication to when all defects are found and so when the product shall actually be delivered (within a range).
Defect Plateau for the Right Reasons – page 1 of 2 In reaching a plateau of rising and fixed defects does the plateau truly reflect that all defects have been found? Alternative reasons may be: The test cases may not be exercising the code fully. While we may have a large number of test cases: Do our tests penetrate sufficiently within the code? Have we considered all appropriate boundary values? Have we considered combinations of values and code paths? Have we considered negative as well as the positive test cases? Have we understood the requirements sufficiently to test all requirements at each logical statement? Were our tests derived from the requirements, or (not to be recommended) were the tests derived from the code or the programmer (who may have an incorrect understanding of the code)? Have we considered all security and performance issues?
Defect Plateau for the Right Reasons – page 2 of 2 We may believe that we have sufficient testing and have done additional exploratory testing, but was that exploratory testing targeted to minimise risk. That is did we target the testing at our most vulnerable code? How did we decide where that vulnerability was and How did we measure the risk level and mitigate the risk? It is not uncommon to reach a plateau in rising defect numbers detected, then uncover a weakness, only to see the number of defects start to rise again to reach a further plateau. This part of the presentation is about deciding if the plateau is reached at an expected level and if it represents a remaining acceptable level of risk.
How can you estimate the number of defects in code Rules of thumb: Based upon number of requirements Based upon number of lines of code or derivatives leading to that estimate. Passed experience – similar project. Semi Quantitative Number of Use Cases Quantitative Approach Halstead Metrics Function Point Analysis / Test Point Analysis
Test Estimate based upon Number of Requirements Requirements are single logical Statements. If a requirements calls in a specification, then a judgement needs to be made if that specification should be treated as a large list of separate requirements that need to be added into the sum. Assume 1 to 1.5 defects per requirement. This gives you a minimum and maximum value for this approach. So 655 requirements will give a range of 655 to 983 defects.
Test Estimate based upon Number of Lines of Code So 10% of Lines of code will have errors so Number of lines x 0.1 gives an estimate for defects. You can refine this by adding the following: Typically 10% to 15% of defect fixes will be rejected. In industry this can be as high as 25% to 30%. However not finding a high level of defect failures can mean good coding or poor testing. So take care. However if you take a 10% margin and a 15% margin as your lower and top limit this will give a range. You can then add your adjustments for 50% reduction at each iteration and so for N iterations this will give you a max and min total for this estimation approach. You can estimate the number of lines of code on the bases of number of developers. So 8 developers working for time T will say produce 5 code modules, each of 50 lines of code, so have a total of 8 x 5 x 5 = 200 defects. So 3 iterations in total with 50% reduction at each iteration means we have: 200 + 100 + 50 = 350 defects. Now add in 10% or 15% defect fix failure and this gives a range of (350 + 35) to (350 + 53) defects. This gives a range of 385 to 403 defects. You will also need to add in an estimate for existing code that is reused. Do not assume that this will be error free, since interfaces will have changed and potentially some data handling will have changed.
Test Estimate based upon Similar Project A similar Project may have given you a defect count of say 225 defects. However you estimate that the difficulty is between 2 and 3 times greater. So an estimate might be therefore between 450 and 675 defects respectfully, depending upon difficulty. Difficulty and Complexity can be simply assessed to begin with. However once code is written, the difficulty and complexity can be reassessed quantitatively using Halstead Metrics.
Test Estimate based upon Previous Phases or Code Drops You may decide that you have run a number of code drops and typically you may expect an average of 10 defects per drop. However you may have a larger delivery or a smaller delivery expected and may decide to scale the expected defect level. You may have decided to do a Risk, Likelihood Impact analysis and decide that there are more defects detected with high risk functionality. So you may decide to reference the risk analysis to refine your prediction of defect levels. A simple method however is to expect that 10% of tests will find defects, assuming adequate test analysis. The defect level prediction provides an indication of the amount of re-test that is required. It will however not account for the number of tests that fail a defect fix. That can be estimated as 5% of all fixes.
Test Estimate Prediction - Reducing Error We added in values for defect fix failure. Was this at the right level, did we do this in every estimation approach? Was it at an appropriate level. Should we have perhaps considered 20%? Are the developers more or less experienced in this project to the previous project upon which our core estimate was based? So the level of complexity might need adjusting. Did we count all requirements, perhaps we decided not to include sub-requirements or technical requirements. Obviously the closer the methods of prediction align, so the percentage error for prediction narrows.
Code Coverage - page 1 of 3 While progress in testing may be thought of in terms of number of test scripts successfully passed. There are significant risks in this approach: It is important to check the value that testing contributes, by checking the amount of code paths covered by the test set. It may be difficult to check every single path, however a good code coverage tool can be deployed to monitor code coverage. In monitoring code coverage, remember that 100% coverage by any single code coverage metric does not mean 100% of the code is covered. For full coverage a number of different code coverage metrics need to be deployed. 100% of coverage at the following levels will provide 100% true coverage: Branch Combination Condition Coverage Linear Code Sequence and Jump Declare and Use path Coverage (Du-Path). Note it can be extremely difficult to reach 100% coverage, since exception conditions may be difficult and at times almost impossible to introduce. So targets need to be realistic. Where testing is not possible, then test statically with targeted reviews.
Code Coverage - page 2 of 3 Adapted from BCS Sigist working party draft 3.3, dated 28 April 1997, which is the basis of BS7925-2.
Code Coverage - page 3 of 3 Typically one would use a variety of Coverage metrics and aim to achieve: 67% to 80% Line / Statement Coverage 70% to 90% Decision / Block Coverage Remember however that 100% Decision / Block Coverage can be achieved simply by touching each block. This is not the spirit of what is required and the effectiveness of block Coverage needs to be read alongside the code coverage metric. Investigate structural analysis to ensure that there are no unreferenced or discarded code portions present. Structural Analysis tools can help in spotting defects. e,g. Coverity Prevent.
Iteration Approach This works on measuring the time taken to test one iteration. We measure the defects open and see the level rise to a plateau and fall as defects are fixed. An assumption is then made that the next iteration will have fewer defects say 50% less. This is then used to predict the time to completion.
Problem with Iteration Approach The problem with an iteration approach is that it does not take account of unfound defects. If a plateau was reached due to poor testing or low coverage or even not exercising a specific function, then the first plateau should have been higher and so the estimate for the second iteration of testing needs to be adjusted to take into account the unfound defects. As a safeguard for a good, experienced test team, assume a further 10% of defects went undiscovered.
Defect Tracking Tool – Graphical Output Defect tracking tools and task management tools like Jira (with Grasshopper) can produce graphical representation of defect records, which can help to indicate time expectations. But adjustment is still needed as previously discussed.
Halstead Metrics This is a method that is based upon tried and well proven (since 1970’s) quantitative approaches. It is however based on a way of refining an estimate, based upon qualities within code that can be physically measured. For defect estimation it will for C and C++. under estimate slightly (assume 20% underestimate as general guidance). However it will provide guidance and help to get a ball park figure. Other languages provide good levels of indication. The main measures made are: Number of unique Operators n1 used in code (+, -, =, reserved, etc) Number of Unique Operands n2 used in code (ID, Type, Count, etc) Total number of Operators (N1) Total number of Operands (N2) NOTE: Code Coverage and Static Analysis Tools routinely provide values for Halstead Metrics, quite often the values just need to be extracted from the tooling deployed. It is simply knowing that the facility is there and if the value cannot be extracted in one step, knowing how to create the value from the root metrics.
Halstead Calculations Halstead Programme Length N = N1 + N2 Where: Total number of Operators = N1 Total number of Operands = N2 Halstead Programme Vocabulary n = n1 + n2 Where Number of unique Operators n1 used in code (+, -, =, reserved, etc) Number of Unique Operands n2 used in code (ID, Type, Count, etc)
Halstead Code Difficulty & Effort Halstead Code Difficulty D = [(n1)/2] x [(N2)/(n2)] Where Number of unique Operators = n1 Number of Unique Operands = n2 Total number of Operands = N2 Halstead Code Effort E = V x D Where V = Volume D = Difficulty
Halstead Code Volume Halstead Code Volume V = N x log2n Where N = Programme Length n = Programme Vocabulary
Halstead Delivered Defects Delivered Bugs B = [(E2/3)/3000] Where Effort = E The delivered Bugs is the total number of defects that would be reasonably expected to be delivered within a measured Volume of code of measured Difficulty and taking a measured amount of Effort to recreate.
Cyclometric Complexity Cyclometric Complexity (C) is a way of measuring the level of risk associated with code, through the number of code paths and so the need to run more test scenarios. Cyclometric Complexity can also be used to compare code modules and can help to indicate the areas of code that perhaps need more review effort, more testing and greater attention to general risk. As a general rule a value for C of 10 or above has higher risk for a code module.
Function Point Analysis A further tool that can be used for estimation is Function Point (FP) Analysis. This is a formula based approach that takes into account, language for development, number of requirements, inputs, outputs, experience of the development and test team. Explanation of FP is out of scope of this slide set.
System Integration Testing Your application will need to sit on a system with other software. This will include other applications, management systems (e.g. enterprise monitoring) and monitoring software such as security protective monitoring. All interface will need to be tested. This assumes that the other products are supplied by 3rd parties as fully tested.
Your target risks In SIT testing you need to establish which risks you are testing to protect. Typically you will need to: Spend focused time on protective monitoring interface tests. Spend time checking enterprise monitoring. Test any PKI certificates, revoke certificates, invalid certificates, etc. So you need to budget for the certificates and have them ready. Test application interfaces. Test the creation of users and licences. Test system access, valid and invalid. Ensure you can test performance. Note covered in another slide set.
Estimating Testing for Protective Monitoring Testing Large systems will have software that will monitor the system for unauthorised activity. This monitoring also needs to be tested.
Overview Testing of Protective Monitoring (PM) is relatively new. There is no reliable guidance as yet for estimating testing. Overview The level of PM testing needs to be defined within the Test Policy for the project. 02/01/2013 31
Learning For a team with no previous PM testing experience allow 2 weeks learning time and provide budgeted training and support for that period.
Overview of PM Estimating The best baseline for estimate would be to take the CESG Good Practice Guide 13 (GPG13) as the standard for implementation of the solution. Theres 12 protective monitoring controls (PMC’s) that would require testing , but the breakdown into test cases is dependant on the system architecture breakdown of boundary definitions, the level of alerting defined, the verbosity of logging and how the overall design of the security model influences the system implementation. GPG13 is a guide and as such is open to interpretation, however one should at least plan for the following:
Target Protective Monitoring Controls PMC 1 - Accurate time in logs PMC 2 - Recording of business traffic crossing a boundary PMC 3 - Recording relating to suspicious activity at the boundary PMC 4 - Recording on internal workstation, server or device PMC 5 - Recording relating to suspicious internal network activity PMC 6 - Recording relating to network connections PMC 7 - Recording on session activity by user and workstation PMC 8 - Recording on data backup status PMC 9 - Alerting critical events PMC 10 - Reporting on the status of the audit system PMC 11 - Production of sanitised and statistical management reports PMC 12 - Providing a legal framework for Protective Monitoring activities
Information over syslog or to windows event logs Note: If the software produced is actually writing information over syslog or to the windows event logs then each of the entries it is writing would need to be checked against the PMC it should be related to. Standard operating system events need to be tested.
Totals for Scripts Research time to identify events and scripts – 10 days typically. Number of scripts required = number of events written to logs by software + reported hardware events, (typically 140) The time to write a PM test plan = Having defined the detailed scripts… around 3 to 5 days of effort minimum, (typically 70 pages). Plus review time (typically a full day) The time to write a test script is as above x 5 minutes per script. The time to execute the tests is as above x 15 minutes per script.
Mobile Testing This section looks at testing mobile and personal digital assistants (PDA).
Deciding your target test set You will need to consider which operating systems you plan to test, which devices and if you are going to test on real devices, hardware emulators or software emulators.
Emulation vs Real Device Real devices will find defects that cannot be found on emulated devices. Emulators will also be sensitive to defects that may not be so simple to find on a real device. Emulation can be quick to set up, while testing on real devices can be slow. It may be worth considering cloud solutions to obtaining remote access to a wide range of devices. Some emulators may not be available for new devices.
Which devices Create a table of the operating systems you want to test. Create a list of the most popular recent devices, devices out there in the world and the recent releases. Ask your marketing department to help create the list, or check on the web and with phone providers. Decide which devices offer best value for targeting your primary tests. Then decide which devices you will run a subset of test on. You might decide to run most tests on emulation, however consider memory, CPU and other hardware related issues that may impact a test when using real devices. Failure to delete an item correctly might cause memory to fill up rapidly, so single tests may not be appropriate. Try to stretch the infrastructure of the device.
PDA Decisions Having chosen your approach for mobile access, you will need to estimate the number of tests, regression runs and any time to set up the infrastructure. If you are operating within a secure environment you will need to talk to your IT security adviser in setting up appropriate access for testing. This will all need to be budgeted.
Acceptance Testing You are likely to go through a number of phases for acceptance. Typically this may include: Testing of application wire frames. Factory Acceptance testing (FAT) of the Application. System Acceptance Testing (SAT) of the integrated system (tested as SIT). User Acceptance Testing (UAT) Operational Acceptance Testing (OAT) showing operational readiness, disaster recovery, etc.
FAT FAT will have an input criteria reflected by the functionality delivered and the defects outstanding with permitted limits for each severity type. FAT is to demonstrate the functionality of the developed application against functional requirements. It might also include some early verification of non- functional readiness. FAT will probably use end to end test cases that map to functional requirements. If using delivery iterations, there may be a FAT for each iteration or group of iterations.
UAT User Acceptance Testing is usually planned and carried out by the customer or if an internal project a group of users acting as the customer. If you need to plan and carry out UAT, you might be able to save time by using end to end functional tests as the bases for UAT planning.
SAT System Acceptance Testing focuses on the application functionality within the system, demonstrating that the application interacts with the system as a whole. This will include evidence of any monitoring software and interfaces. Penetration testing and Load and Performance Testing may be part of the SAT results. Some SAT activities might be moved to OAT and some OAT activities might get moved to SAT.
OAT Operational or Readiness Acceptance Testing focuses on te complete system. It will take account of items such as: Backup /Restore Disaster recovery Fault reporting and resolution life cycle Day time and after hours support
Items to consider in OAT The following should be considered in OAT and time allowed: Each site to be tested. Support processes and escalation. Issue logging and reporting. Environment monitoring. Protective monitoring. Data backup and recovery. Deployment of new builds and recovery.
OAT Planning Create flow charts for all your support processes. Include out of hours as separate activities. Incorporate within an OAT plan. Consider setting up of pagers for out of hours and delivery and setting up of pagers for testing. Tooling may need setting up for cross reporting between application support teams and environment support teams.