HCLT Whitepaper: Landmines of Software Testing Metrics


Published on

http://www.hcltech.com/enterprise-transformation-services/overview~ More on ETS

It is not only desirable but also necessary to assess the quality of testing being delivered by a vendor. Specific to software testing, there are some discerning metrics that one an look at, however it must be kept in mind that there are multiple factors that affect these metrics which are not necessarily under the control of testing team. The SLAs for testing initiatives can, and should, only be committed after a detailed understanding of the customer’s IT organization in terms of culture and process maturity and after analyzing the various trends among these metrics. This white paper lists some of the popular testing metrics and the factors one must keep in mind while reading in to their values.

Excerpts from the Paper

The estimates and planning for testing is based on certain assumptions and available historical data. However if there are higher number of disruptions (than anticipated) to testing in terms of environment unavailability or higher number of defects being found and fixed, the quality time available for testing the system would be less and hence higher number of defects slip through the testing stage. We must ensure that the data on defects on all subsequent stages are also available and are accurate. Production defects are usually handled by a separate Production support team and testing team is at times not given much insight in to this data. Also, since multiple projects and/or Programs would be going live, one after another, there are usually challenges in identifying which defects in Production can be attributed to which Project or Program. Inaccuracies in assignment would lead to inaccurate measure of test stage effectiveness.

Published in: Business, Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

HCLT Whitepaper: Landmines of Software Testing Metrics

  1. 1. Landmines of Software Testing Metrics
  2. 2. 1. AbstractIt is not only desirable but also necessary to assess the quality oftesting being delivered by a vendor. Specific to software testing,there are some discerning metrics that one can look at, however itmust be kept in mind that there are multiple factors that affect thesemetrics which are not necessarily under the control of testing team.The SLAs for testing initiatives can, and should, only be committedafter a detailed understanding of the customer’s IT organization interms of culture and process maturity and after analyzing thevarious trends among these metrics. This white paper lists some ofthe popular testing metrics and the factors one must keep in mindwhile reading in to their values.2. IntroductionThis white paper discusses some of the popular metrics for testingoutsourcing engagements and the factors one must keep in mindwhile looking at the values of these metrics.Metric 1.Residual defects after a testing stage?DefinitionThe absolute number of defects that are detected after the testingstage (owned by the vendor’s testing team).The lower the number of defects found after the current testingstage the better the quality of testing is considered.Factors to consider? requirements Quality ofThe ambiguity in the requirements results in misinterpretations andmisunderstandings, leading to ineffectiveness in defect detection.The clearer the requirements the higher the chances of testing teamunderstanding it right and hence noticing any deviations or defectsin the system under test (SUT).? developmentQuality ofThe planning for testing is usually done with an assumption that thesystem will be thoroughly unit tested prior to handling it to thetesting team. However if the quality of the development process ispoor and if the unit testing is not thoroughly done, the testing teamis likely to be encountering more of unit-level defects and might be 2
  3. 3. pausing their testing for the defects (even in the fundamentalprocesses) to get fixed and hence wouldn’t be able to devote/focusall of their time in looking for functional-level/system-level/business-level defects.? of Business users for reviews and clarificationsAvailabilityIf the Business users are not available for answering the queries intime, the testing team is likely to struggle in establishing theintended behavior of the system under test (whether it is a defect ornot etc.) and hence some amount of productivity would be lost andit is likely to have more defects remaining by the time the testingwindow is used up.? test cycles - Delay in defect fixes, higher defect fixes,Incompleteenvironment availabilityThe estimates and planning for testing is based on certainassumptions and available historical data. However if there arehigher number of disruptions (than anticipated) to testing in termsof environment unavailability or higher number of defects beingfound and fixed, the quality time available for testing the systemwould be less and hence higher number of defects slip through thetesting stage.Metric 2. Test effectiveness of a testing stage (or) DefectContainment effectiveness.?DefinitionThe % of system level defects that have slipped through the testingstage (owned by the vendor’s testing team) and are detected at a latertesting stage or in Production.The higher the effectiveness the lower the chances of defects beingfound downstream.Test Stage Effectiveness =(Defects detected in the current testing stage)-------------------------------------------------------------(Detects detected in current testing stage + defects detected in allsubsequent stages)Factors to consider? of accurate defects data at all stages AvailabilityWe must ensure that the data on defects on all subsequent stages arealso available and are accurate. Production defects are usuallyhandled by a separate Production support team and testing team is 3
  4. 4. at times not given much insight in to this data. Also, since multipleprojects and/or Programs would be going live, one after another,there are usually challenges in identifying which defects inProduction can be attributed to which Project or Program.Inaccuracies in assignment would lead to inaccurate measure of teststage effectiveness.All the factors listed above for ‘Residual defects’ also apply for thismetric.Metric 3. % improvement in test case productivity?DefinitionsTest case Productivity = # of test cases developed per person perday (or per hour)% Improvement =(Test case productivity in current year – last year)= --------------------------------------------------------- x 100 (Initial Test Case Productivity last year)Factors to consider? changes Nature ofThe nature of changes that the system goes through might notnecessarily be comparable all the time. Depending on the nature ofchanges the # of test cases required to test a unit of developmenteffort, and the amount of investigation/analysis required prior todeciding or documenting the test cases would differ drastically.? definitionTest caseThe very measurement of a ‘test case’ itself could be a challenge. Atest case can range from very simple to very complex depending onthe specifics of the test objective. Hence a mere “count of the testcases’ might not reflect the actual effort put in testing a particularchange or functionality. We must consider the complexity of testcases as well. Also different teams might be following differentlevels of documentation for ‘a test case’. For example a test casewith 12 conditions might be considered as a single test case by oneteam while the other may split them in to 12 separate test cases withone condition each. Normalizing the ‘unit’ of test case is verycritical to get this metric representing the real picture.? on effort categorizationAmbiguitySome people might consider test data set up as part of test case 4
  5. 5. development effort while some consider it as part of test executionand set up. Different people following different notations wouldlead to erroneous values and the data might not be comparable.? Effectiveness of test casesTesters might be incorrectly motivated towards creating ‘lots’ oftest cases in less time rather than taking time to think through thechanges and requirements and come up with good test cases (evenif they are only few) that are likely to find defects or are likely to givethe Business people more comfort/confidence.?in experience/SMEIncreaseOver a period of time a testing resource is likely to become moreknowledgeable about the SUT. Due to this he/she would be in aposition to better expect which test cases are likely to find a defectand thus might cut down on test cases that are NOT likely to finddefects.? Testing ResourcesChange inWhenever a resource is replaced by another it is likely that the newresource would take more time for doing the analysis needed towrite the test cases. The higher the number of changes the lower thetest case productivity (and also test case effectiveness) in a way.Metric 4. % reduction in testing cost?Definitions(Testing cost per unit dev effort in curr year – last year)= ---------------------------------------------------------- (Testing cost per unit development effort last year)Factors to consider? Lack of accurate measurement of development effortsThis metric heavily depends on the measurement of actualdevelopment effort which might not be accurate. The number ofprojects that have formal measurement units such as FP is relativelyfew.?Testing effort variance with dev effortTesting effort might not always directly proportional to thedevelopment effort. For example a slight modification (with verysmall development effort) to a legacy application might incur a lotof testing effort factoring in the regression testing etc. Similarly 5
  6. 6. there may be lot of structural changes to the existing code with highdevelopment efforts (for performance enhancement or some otherrefactoring needs) but the (black box) testing effort might be not beas high if the end functionality is not undergoing a drastic change.? outsourcingScope ofWhen we engage with a customer for testing projects we might startwith a small set of applications or modules with in. However over aperiod of time we would have started servicing more applicationsor modules and the timesheet entries might not be granular enoughto distinguish between efforts for different modules. After somepoint of time it might be difficult to extract the actual effort that isspent on a particular module and hence to calculate the reduction ineffort for the same module.? resourcesSharing ofAlong the same lines as above it might be difficult to extract theinformation as to which resource spent how much time on whichapplication as they might be working on different modulesinterchangeably.? complexity not comparableProjects’It is also common to encounter projects with complexities widelydiffering from the projects that were used for base lining. Hencecomparing the data between these types of projects might not givethe real picture.Metric 5. % Automated?Definitions% of test cases automated # of test cases automated= --------------------------------------- x 100 Total # of test casesFactors to consider? case BusinessNot every application and at times not every module within anapplication (and even not every test case within that module) hasbusiness case for automation and they might continue to be testedmanually for multiple business reasons (e.g. possible replacementof application with a COTS package in the next 6 months). 6
  7. 7. ? of automationFeasibilityDepending on the technology constraints and the suitability oftools that are available, some parts of an application might not betechnically feasible to automate even if we wanted/needed to.?ScopeIt is advisable to measure % of automation under the revised scope:“For the modules/applications that are known to give good RoI andare technically feasible – what % is automated’.? allocationResourceIndustry survey reveals that over 60% of automation projects arenot successful and the major causes for this are: automationattempted on an ad-hoc basis and people not dedicated forautomation. People also underestimate the need for an effective on-going maintenance after the initial test bed is automated. Ifautomation is approached immaturely one can expect disruptions inbuild and maintenance of automation due to lack of constant focusand supporting skills/resources for automation.? exploratory testing Room forAt times it may be desirable to give allowance for a bit ofexploratory testing to check out if we can detect any anomalies thatcould not be detected by regular testing techniques. Usually inexploratory testing, test execution is attempted along the lines of afew test objectives. This is then followed by documentation of theattempts and results. This could lead to addition of a few test caseswhich might not necessarily be automated but are desired to ensuremore probing in to the application behavior.Metric 6. Requirements coverageDefinitions% of requirements that are covered by test cases. (# Of requirements covered by at least 1 test case)= -------------------------------------------------------- x 100 (Total # of requirements)Factors to consider? requirements, traceability matrix Format ofNot every team might be following standard notations whiledocumenting the requirements and hence one can expect challengesin mapping test cases to free-format requirements. 7
  8. 8. ? Legacy systems with no documented requirementsMost of the legacy systems might not have any documentation onthe basic functionality thus making the reference point difficult toestablish.?Configuration ManagementThe team may not be having access to a configuration managementtool that enables keeping the test cases systematically mapped withchanging requirements. The challenge lies in managing not only theversion of requirements but also the corresponding version of testcases.Metric 7. Test case effectivenessDefinitions(# of Test Cases that detected defects)(Total # of Test cases)Factors to consider? application Stability ofAs the application or SUT stabilizes over a period of time thenumber of defects in the system goes down and it require moreeffort (test cases) to find remaining defects.?RiskWhere people try to over emphasize this metric and refrain fromwriting more test cases - there may be a risk of not detecting someof defects we could otherwise find (thus resulting in bad test stageeffectiveness). In order to reduce this risk people might try andwrite more and more test cases even though the possibility offinding a defect might be low.Metric 8. Defect Rejection RatioA defect initially raised by a tester could be later rejected for multiplereasons and the main objective of having this metric is to ensure thetesters correctly understand the application/requirements and dotheir ground work well before everyone’s time and attention isinvested in solving the defect. Too many defects being rejectedresults in inefficiency (due to time and effort spent on somethingthat wasn’t a problem). 8
  9. 9. ?Definitions (Defects rejected as invalid)= --------------------------------- (Total no. of defects raised)Factors to consider? rejection Cause ofLack of application knowledge is usually the cause for rejecteddefects. However there could be other reasons as well, such asmisinterpretations, changes in the environments and defectsbecoming non-reproducible. Hence we need to take in toconsideration the causes for rejection in order to correctlyunderstand the trends.? cycle cultureDefect lifeIn some teams a defect is initially raised in the system as ‘New’followed by a discussion and then ‘rejected’ off the system if it isnot considered a defect. However in some teams discussions areheld between testers and other stakeholders such as Business andDevelopment on the ‘possible’ defects and a defect is entered in tothe system only after it is confirmed during the discussions. Hencethere won’t be a rejection at all.?Fear of riskIn some cases it might not be very straightforward to decide onwhether a behavior is a defect or not and one has to choose betweenrisking a defect rejection but covering the ‘defect slippage’ and notraising the defect (lower defect rejection) but higher defect slippagein to the next stage. Usually people are more concerned about thedefect slippage and hence take the route of raising defects even if itmeans higher defect rejection.? requirements Quality ofThe higher the quality of requirements the higher the chances ofelimination of misunderstanding and lower the defect rejectionrate. Even if the defect rejection rate is higher it will at least help inestablishing the fact that tester’s knowledge was not sufficient andthat requirements could not be blamed.? environmentControl ofAt times the behavior of the system changes from the time a defectis raised to the time the verification of the defect is done (‘it didn’twork when I raised the defect but now it is working. Don’t know 9
  10. 10. how.’) If the environment is undergoing changes with out thenotice/control of testing team it would be hard to establish thecause for the defect rejection.Metric 9. Currency of knowledge database?Definitions% of application knowledge that is documented. (Knowledge level documented)= ------------------------------------------------------------ (Total Knowledge of the Application or Module)Factors to consider? Quantifying ‘Knowledge’The parameter ‘Knowledge’ is hard to quantify and is mostly aqualitative measure. At times people do graduate theirapplications/modules in to a list of functions/transactions/flowsand the documentation for these is measured relatively objectively.However not every module might be conducive for this kind ofgraduation and care must be taken to ensure that some sort ofclassification is possible before we attempt to measure this metric.? details expected in documentationDepth ofAlso a functionality/transaction can very well be documented atdifferent levels of depth – the deeper the documentation the higherthe time involved in it, ranging from a few minutes to a few days. Wemust ensure that the expectations on ‘depth’ are understoodcorrectly before we commit a certain SLA on this metric.? of changes FrequencyDuring estimations for documentation efforts a certain amount ofchanges to baseline requirements/functionality would be factoredin. However if the changes are much higher than estimated this hasan obvious impact on the actual effort for documentation.?Expectation on Time of DocumentationIt is also important to know the actual ‘point’ in life cycle at whichthe completed documentation is “expected”. If the documentationis expected to be reviewed/refined at the end of each project (i.e. byPhase and not by duration such as 3 months or 6 months) then it isvery likely that people do not attempt to document things ‘during’the middle of a project thus avoiding any rework (due to changes in 10
  11. 11. requirements). However if the expectation is that thedocumentation is updated every 3 months (or even 6 months) it ispossible that the project is going through some changes inrequirements and hence one should anticipate rework indocumentation.?Projects on same application/moduleMultipleIt is also likely that the same application might be undergoingchanges by multiple projects with a little bit of time gap betweeneach. This brings in all the complexity that applies to configurationmanagement of the ‘code’. The documentation must reflect thechanges with respect to the Projects so people of other projects alsoknow where to apply their changes and by how much. In theabsence of a configuration management tool the documentationmight be a little difficult to handle without leading toconfusion/rework.? for documentation effortPlanningTesting usually encounters tight schedules with racing against timebecoming the norm. Due to this, whether documentation ofapplication knowledge is a paid service or a value addition, onecannot achieve it without factoring the time and effort (fordocumentation) during the Project planning.3. ConclusionThere are some discerning metrics for testing engagements whichcan be considered for drafting SLAs. However for each metric onemust take in to consideration the factors (including other metrics)that influence the value of metric and the scope of the testing teamin controlling those factors. This white paper lists down some ofthe popular metrics and discusses about the factors that affect eachof these metrics.One shouldn’t underestimate the fact that it usually takes some time(a few months to over a year) in order to establish consistent trendson these metrics. It is thus recommended that enough time is givento assess the trends of these metrics before an SLA is worked uponfor the engagement.Where applicable, ‘assumptions’ must clearly be documented toindicate/reflect the influencing factors on the SLA’s beingcommitted. 11
  12. 12. Hello there. I am from HCL Technologies. We work behind the scenes, helping our customers to shift paradigms and start revolutions. We use digital engineering to build superhuman capabilities. We make sure that the rate of progress far exceeds the price. And right now, 59000 of us bright sparks are busy developing solutions for 500 customers in 20 countries across the world.How can I help you? transform@hcl.in 12