When you’re on a long project – 6 months, a year or longer, we need someway to gauge these things.Developing software is a complex system that is mostly intangible. So we use these measurements as a window into that world. What’s going on here? When will we be done? What’s our quality like? Etc.It’s human nature to explain things we can’t see.
What do you think about this metric? Actually it’s a really bad one – there’s correlation/causation errors going on, and overall “project success” is way too complicated a system to judge based on one metric.Chaos Report from 1995 to 2010: project success rate goes from 16% to 30%
In waterfall we need gauges and indicators and ways to predict the future, because it’s scary to be on a project with a really long time horizon.“We all have a need to understand, we all get anxious when we don’t, we all look for ways to explain things that aren’t easy to explain. That’s what these metrics do. And, if you’re the owner of the project, your butt is on the line, so all the more to be anxious about, all the more to try to make the intangible tangible (which is what I think of software development – until you see a product, it is an intangible and intangibles are scary).” –LL
So before we talk about how Agile responds to that, let’s look a bit at how we operate as humans, and how metrics can effect our behavior.This section has one slide of theory and one real life example.
“There are so many possible measures in a software process that a random selection of metrics will not likely turn up something of value” – Watts Humphrey Metrics used in isolation probably don’t measure what you think they do.-System is more complex than this. We’re probably not ever going to be able to measure enough to give us a simple indicator of the system. - Isolated metrics entice people to draw system wide conclusions.-> Primary/Secondary MetricBeware long hanging fruit. Also, old literature praises low hanging fruit!-> Just because we can measure something easily doesn’t actually mean it’s meaningful.
Ask: Does everyone agree this is a easy to gather metric? What is this metric really telling us? Stakeholders: “How come we have less tests than a few sprints ago? That can’t be right. We must not be testing enough.” Stakeholders: “On my last project we had thousands of tests, why are there only a couple hundred? That can’t be right, we must not be testing enough, I bet this thing is littered with bugs.”This is an example of things that are easy to measure, and things measured in isolation. The system – the software development machine – is far too complex to be making broad quality statements based on such isolated measurements. But we’re so used to doing that. So you can start to see that some traditional metrics might not really fit the bill. Let’s go on
Explain Hawthorne Experiment. Select group of workers old they were being studied, and their productivity changed. All the researchers did was minutely change the lighting levels.For example, measuring test pass/fail status always causes pass percentage to rise. But it is an artificial rise, due to people not wanting to fail tests or splitting up tests into smaller and smaller units to drive the percentage calculation up (which is just creating waste). Also called demand characteristics: refers to an experimental artifact where participants form an interpretation of the experiment's purpose and unconsciously change their behavior to fit that interpretation
I’ve changed the exact wording here to protect the innocent. But here’s a good real life example. Read these two statements and think about what may have changed in the time between these two statement.
Robert Austin. Measuring and Managing Performance in Organization. Nucor Steel. Based plant managers salaries on productivity – of ALL plants, not just theirs.The obvious example here is defect counts.Edward Demming, the noted quality expert, insisted that most quality defects are not caused by individuals, but by management systems that make error-free performance all but impossible.Eh… Attributing defects to individuals does little to address the systemic causes of defects, and placing blame on individuals when the problem is systematic perpetuates the problem. By aggregating defect counts into an informational measurement, and hiding individual performance measurements, it becomes easier to address the root causes of defects. If an entire development team, testers and developers alike, feel responsible for the defects, then testers will tend to become involved earlier and provide more timely and useful feedback to developers. Defects caused by code integration will become everyone’s problem, not just the unlucky person who wrote the last bit of code.
So what do you think happened here? What was the result?Perhaps their intense focus on defects per person, lead to no focus on the customer… perhaps not, it’s too complex to really tell, but the point is that they are probably measuring too low. Are defects-per-person-hour really important to your goal? Probably not. Measure one level up, maybe defects reported by customers…
So we’re at the point where we know that waterfall feels risky, and we know there are some behavioral aspects to metrics that we need to consider.
Agile takes all the worry and all that risk and packages it up into cute little time boxes. Agile inherently limits risk. Even if one of these boxes explode, the project isn’t a failure. And every few weeks we produce a valuable increment of product, we have the chance to inspect it and adapt our approach, reprioritize, replan etc. Managers no longer need to be worried about and have this anxiety over predicting project performance over months and months. We have real tangible results every few weeks. We can inspect it and determine the ACTUAL characteristics of the product that we used to use metrics to try to get at. Agile Projects inherently limit riskTime Boxes, WIP, DoD, AC, fast feedback(lead in) So that’s nice, but how do you define quality on this increment and on the product as a whole?
Two ways. In on any single increment we use the above mindset. These are not strict equations, I’m not doing any math here, it’s just a way to think about quality in the agile world. DoD: Shared definition among the team of what “done” means. Typically you see things like coding standards, unit test coverage, tests pass, deployable, reviewed, etc. Every piece of work must adhere to the DoD.AC: Product Owners business-language criteria for how a specific piece of work must function. Sometime written in the GIVEN-WHEN-THEN format, a practice associated with ATDD. So as we string increments of working software together, how do we get at the quality of the product? We use the mindset at the bottom for this.On the product level, it’s no longer so much about defining quality in a quantitative sense as it is about having a development process that can easily react to change. React to negative customer feedback as well as suggestions for new features and what’s most important to the customer at the moment.Stakeholders that don't show up at the Sprint Review will still be nervous, and rightly so. The corollary is: every time a manager/stakeholder/etc. asks for a report, instead of giving it to them stress the importance of showing up at the Sprint Review.
You have clear development principles that help limit risk (DoD) (verification) and clear business objectives that help limit risk (Acceptance Criteria ) (validation). This ensures some base level of quality in your product, and then through frequent stakeholder and customer feedback, we ensure ongoing quality and value of product. Our chief metric in scrum is working software. That said, what other metrics do we need? Right?
Agile does indeed negate the need for many traditional metrics. It certainly helps make the complex rather intangible process of developing software a bit more tangible – one increment at a time.I do suggest starting here. It’s less dangerous than starting with metrics carried over form waterfall. Rip it all down and build it back up.But there are some useful metrics that we could use, so to set that context…
In his 14 Points, Deming said “Eliminate management by numbers and numerical goals. Instead substitute with leadership.” The more we rely on metrics to tell us what happened, the more we distance ourselves from the actual work being done. We realize that measuring a system as complex as the software development machine, doesn’t really provide understanding, just data. Sometimes bad data, sometimes good data. And we realize that the obvious answer isn’t always right – like blaming bad developers for buggy products – “it must be the developers” – we respect that there is likely more going on in the system than any one root cause of anything. Further, if we use metrics the wrong way, we build games and systems that reward paying attention to the metric and not the success of the company.Overall we believe that being agile is important to the goal – our goal being making really good software products that have high value and delight customers. So we will use metrics that help us be agile. That encourage us to embrace lean and XP and good development practices.
Trends over static numbers: tear the labels off the y axisIs this setting up stakeholders to draw a system conclusion based on an isolated metric?No single prescription – figure out what makes sense for you. Take these considerations into account. We’ll go over a bunch of possible metrics next, but I’m not advocating a simple recipe for anyone. I’m certainly not saying you have to use all of these.
Our chief metric is working software. Did we get to the end of the sprint and have potentially shippable product? How do you measure this? A simple thumbs up or thumbs down. Get everyone in a room and do it. Not good enough? Then document it. We keep a running go/nogo document.Why not just do this in waterfall? Get everyone in a room at after a year long project and give it the thumbs up? Well in some sense you do – we often ignore all those other metrics we’ve spent so long gathering. We rationalize sev 1’s down to 2’s, etc. In agile you can do this more safely because YOU HAVE CONTEXT. You have really good context and memory within a timebox. The risk is limited.
Indicates team progress. A way to visualize what’s done and what’s WIP and what’s left to do. Tool to use to see when we’ll be DONE with a particular chunk of value.Don’t like hours? Don’t want a graph? Fine: use a task board, count tasks, stories-to-done, whatever. It’s just a tool so that you as a team know how work is progressing, and can visualize that and discuss it as a team.If it’s not given to management, there is little risk of negative hawthorne effect or gaming.
Not individualsNo comparing across teamsNot really for management, certainly not for incentives (risk of gaming)
Helps the business know when a larger chuck of functionality might be DONE. Not really part of scrum but also something you usually can’t get away without doing. At least this method of planning is based on empirical evidence of past sprints velocity and what’s actually on the backlog now, and also look at the cone of uncertainty there – we’re not promising a date, we’re just giving a forecast as accurately as we can while still being able to sleep at night.Increments are great, and this tells us when enough increments put together will satisfy some large business objective.
Unit Testing is a great development practice. If we measure it, we just might encourage that behavior. Pick a Target, Should never go down
Don’t discourage check-ins by making this visible at too high a level. Individuals need fast feedback, and sometimes teams can use this in-good-spirits, but it can start to deter checkins
Beware the “math” on this one – as software matures and ceases to change, this percentage approaches 0. But 0 in any one sprint indicates a problem. Rapid fluxuation might indicate some churn our lack of vision around testing (or churn in the software)
Etsy – optimize everything for employee happinesshttp://happiily.comEncourages self-awarenessLeading indicatorConfidence? When you check in or move something to doneScale is 1-5. We measure this continuously through a live Google Spreadsheet. People update it approximately once per month.Here are the columns:NameHow happy are you with Crisp? (scale 1-5)Last update of this row (timestamp)What feels best right now?What feels worst right now?What would increase your happiness index?Other comments
Running Tested Features – XP practicePositive Hawthorne Effect: We want to deliver more value (but beware gaming – you still have to be DONE)Measures up: delivered value for the product (not single team or individual)Little’s Law: queue size ~ queue time
Here’s one for your metric walls – what are the top three most common customer complaints. Or the three hottest issues right now? Post these on a wall where everyone can see them.
Size of bubbles are TCO (total cost of ownershipHopefully in a single project you are up in the magic quadrantAcross a program/product there might be some things in other places – “have to do’s” compliance and legal stuff…
For a single feature, you can also drill down one level and look at the number of times per day/week/month a user uses it, and the amount of time spent using the feature.Why measure this? Are we building the right features? Is a bug in feature “C” more critical than a bug in feature “I”? Feature “K” may have more maintenance costs than value – consider dropping it.
Where is there waste in the system?What’s the best time for a nominal task?
Erik Weber @erikjweber Slidesha.re/AgileMAgile MetricsOr: How I Learned toStop Worrying andLove Agile
ABOUT CENTARE Agile/ALM Mobile Cloud Microsoft 2011 Partner of the Year Finalist ALM Gold Competency Azure Circle / Cloud Accelerate Apple / Java / Scrum iOS iPhone/iPad/Android Scrum.org Partner Certified Professional Scrum Trainers
BackgroundAGENDA Why metrics? The Psychology of Metrics Agile Response Examples of Agile Metrics Sources
ABOUT MEWork Stuff Me Stuff Healthcare, Finance, Green Huge foodie and amateur cook Buildings Wearer of bowties Huge Conglomerates, Small Homebrewer and beverage Employee Owned, Fortune imbiber 500 Passionate about Agile (have Tester -> Developer -> multiple kanban boards up in Automation Dude -> QA my living room) Manager -> Project Manager -> Scrum Master -> Scrum Product Owner -> Scrum Coach Consulting and FTE Passionate about Agile
WE NEED TANGIBLES As gauges or indicators - For status, quality, doneness, cost, etc. As predictors - What can we expect in the future? As decision making tools - Can we release yet? A visual way to peer into a mostly non-visual world - Because we don’t completely understand what’s going on in the software/project and we need to
HISTORY TELLS US TO USE METRICS Tons of research. Mostly from the 80’s and 90’s and based upon industrial metrics. Tons of implementation at companies Research + Implementation has grown exponentially Hasn’t really affected project success (what a metric!) Metrics Usage: Papers, Books, Co mpanies, etc. Software Project Success Rate1980 1985 1990 1995 2000 2005 2010 *Chaos Report from 1995 to 2010
WATERFALL IS SCARY WITHOUT THEM “Metrics are used in waterfall because we had no idea what was happening, so we tried to measure anything.” – Ken Schwaber, ALM Chicago Keynote, 2012 Because the system is complex and intangible. So we worry. So we want a way to peer into the system and make predictions. So we take measurements to try to create a window. But we still worry. EVERYTHING STILL FEELS RISKY
THE MEASUREMENT PARADOX “Not everything that can be counted counts, and not everything that counts can be counted” – Albert Einstein Software development is a complex system Metrics used in isolation probably don’t measure what you think they do Beware ‘low hanging fruit’ Value of Measurement = 1/Ease of Measuring
Number of Test Cases 600 500 400 300 200 100 0 December January February MarchReal Life Example In reality, we just started focusing on cleaning up old test cases.
THE HAWTHORNE EFFECT Measuring something will change people’s behavior When you measure something, you influence it You can exploit this effect in a positive way Most traditional metrics have a negative hawthorne effect Gaming = Hawthorne Effect * Deliberate Personal Gain“Tell me how you will measure me and I will tell you how I will behave” -Goldratt
“Test case TC8364 has failed, the customer settings page doesn’t work in Chrome.” “Tests: Passed - But I wrote a bug for not being able to use the customer setting page in Chrome.Real Life Example Same Tester. Same Test. One sprint before test pass/fail percentage metric put in place, and one sprint after.
MEASURING AT THE WRONG LEVEL Austin Corollary: You get what you measure, and only what you measure Austin Corollary: You tend to lose others you cannot measure: collaboration, creativity, happiness, dedication to customer service … Suggests “measuring up” Measure the team, not the individual Measure the business, not the team Helps keep focus on outcomes, not output
Real Life Example Defects per Person-Hour went down! We met our quality goal! Customer Complaints went up. Oops. Pankaj Jalote. Software Project Management in Practice. Tsinghua University Press, 2004. Pages 90-922.
EVERYTHING STILL FEELS RISKY Is it still risky in Agile?
INCREMENTS ARE GAME CHANGERS- Agile projects produce potentially shippable Increments every few weeks - The system is no longer intangible - No need to have tons of predictive metrics- Reviewing the Increment (sprint review) - Enables quick adaptation to customer needs, market concerns, quality issues, etc.
SCRUM APPROACH The only metric that really matters is what I say about your product.
DOES THAT MEAN … No Metrics?! Well, OK; no metrics are better than bad metrics.
OUR AGILE METRICS MANIFESTO We no longer view or use metrics as isolated gauges, predictors, or decision making tools; rather they indicate a need to investigate something and have a conversation, nothing more. We realize now that the system is more complex than could ever be modeled by a discrete set of measurements; we respect this. We understand there are some behavioral psychology concepts associated with measuring [the product of] people’s work; we respect this.
CONSIDERATIONS What really matters? Listen to the customer Understand and respect the complex system Trends over static numbers Are we measuring at the right level? How can we make this measurement a bit less isolated? How can we ensure only the correct audience sees it? Measure up! What behaviors are we trying to nurture (or avoid)? Will this help us be more agile? No Single Prescription
WORKING SOFTWARE Can everybody confidently give the “thumbs up” to the increment?
SUMMARYWaterfall makes me anxiousAgile inherently limits risk, renders manytraditional metrics moot The increment is a game changerMeasuring people influences their behaviorThere are useful metrics in agile Beware traditional metrics and low hanging fruit Leverage the Hawthorne effect Measure up Promote Agile/Lean/XP/good development practices
Scrum.Org Professional Scrum Product Owner Course. http://bit.ly/xOccnM Mike Grifiths- Leading Answers: “Smart Metrics” http://bit.ly/yfV643 Elisabeth Hendrickson – Test Obsessed : “Question from the Mailbox: What Metrics Do You Use in Agile?” http://bit.ly/xtSDdg SOURCES Jason Montague – Observations of a Reflective Commuter: “Systems Thinking and Brain Surgery” http://bit.ly/ylBxIn Ian Spence – Measurements for Agile Software Development Organizations: “Better Faster Cheaper Happier” http://bit.ly/y4UKIt N.E. Fenton – “Software Metrics: Successes, Failures & New Directions” http://bit.ly/ybwUzA Failure Rate - “Statistics over IT projects failure rate.” http://bit.ly/xjBRv0 Chad Albrecht – Ballot Debris: “Simple Scrum Diagram” http://bit.ly/yc7yFW Robert Austin–“Measuring and Managing Performance in Organization” http://amzn.to/wTfgx3These people are Mary Poppendieck– Lean Software Development “Measure Up”much smarter than I, http://bit.ly/zppVTCplease read what they Jeff Sutherland – Scrum Log: “Happiness Metric – The Wave of the Future”have to say! http://bit.ly/xO8ETS
UNIT TEST COVERAGE Encourages teams to write unit tests, good xp/agile/development practice Doesn’t guarantee GOOD tests – careful! 120% 100% 80% Team 1 60% Team 2 Team 3 40% 20% 0% Sprint 1 Sprint 2 Sprint 3 Sprint 4
CONTINUOUS INTEGRATION STATUS Current build status red/green How long has it been broken?
TEST CASE LIVELIHOOD Trend of new or Team 2 changing test cases 18% 16% 14% Shows if tests are 12% 10% keeping up with a 8% 6% 4% growing/changing 2% 0% software Sprint 1 Sprint 2 Sprint 3 Sprint 4 Team 3 Encourages teams to upkeep tests 10% 9% 8% 7% 6% 5% 4% 3% 2% 1% 0% Sprint 1 Sprint 2 Sprint 3 Sprint 4
CUSTOMER REPORTED DEFECTS Make these visible! Customer Happiness Net Promoter Score
STRATEGIC ALIGNMENT INDEX Are the features we’re implementing really the highest value? Are the projects we’re running really the best ROI?
USAGE INDEX Are the features we’ve implemented being used? Where should we focus our attention? Feature Usage Index 1 0.9 Percent of Users Using 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 A B C D E F G H I J K
CYCLES TIMES How long does To-do to Done take?