In his presentation, Martin questions the actuality of the software crisis of 1968 and answers the question why we can build bridges relibly but fail when it comes to building reliable software and what modern agile concepts such as Continuous Delivery and DevOps offer to the rescue. Finally, he sketches two use-cases on how an application monitoring solution like Dynatrace can help reduce costs and improve software quality along the Continuous Delivery build pipeline.
4. 4 #Dynatrace
» projects running over-budget
» projects running over-time
» software was very inefficient
» software was of low quality
» software often did not meet requirements
» code was complex and difficult to maintain
» software was often never delivered
The “Software Crisis” as of 1968
5. 5 #Dynatrace
» projects running over-budget
» projects running over-time
» software was very inefficient
» software was of low quality
» software often did not meet requirements
» code was complex and difficult to maintain
» software was often never delivered
The “Software Crisis” as of 1968 today?
11. 11 #Dynatrace
“We need to create a culture that reinforces the value of taking risks and
learning from failure and the need for repetition and practice to create
mastery.” Gene Kim, The Phoenix Project
A key-principle of DevOps
26. 26 #Dynatrace
Rate of Diminishing Returns of Fixing Bugs
Developers should
not spend time here!
Low yield!
Concentrate on these!Concentrate on these!Concentrate on these!
29. 29 #Dynatrace
Dynatrace in Automated Testing
12 0 120ms
3 1 68ms
Build 20 testPurchase OK
testSearch OK
Build 17 testPurchase OK
testSearch OK
Build 18 testPurchase FAILED
testSearch OK
Build 19 testPurchase OK
testSearch OK
Build # Test Case Status # SQL # Exceptions CPU
12 0 120ms
3 1 68ms
12 5 60ms
3 1 68ms
75 0 230ms
3 1 68ms
Test Framework Results Architectural Data
Regression!
Problem solved!
Exceptions probably reason
for failed tests
Problem fixed but now we have an
architectural regression
Problem fixed but now we have an
architectural regression!Now we have the functional and
architectural confidence
Let’s look behind the scenes
36. 36 #Dynatrace
“I’ve muddled over the same log files for weeks sometimes
to extrapolate the relationships between different systems
[...] before having my eureka moment.”
RecklessKelly (Operator) on reddit
53. 53 #Dynatrace
» identified whether it’s been the host, process or transactions
» identified which critical business functionality was affected
» been able to prioritze the failure and secure evidence
» gotten the right people on the same table
» taken minutes, not weeks!
Awesome! We have...
I would like to start my talk with a somewhat controversal question – and I’ll promise to give you an answer to it in a bit. Do we have a software crisis?
The term “software crisis” was coined by members of the NATO Software Engineering Conference in 1968. They felt a huge gap between what was theoretically possible at that time and what could actually be achieved in software development. At that time, the USA were in the middle of the Cold War with the Sovjet Union and they feared a loss in expertise in software development.
Notes:
inefficient in terms of resource consumption (run in less time and/or with less memory)
low quality in terms of reliability and stability, maintainability, etc.
complex and difficult to maintain: spaghetti code; code dependencies are twisted and tangled like a bowl of spaghetti. if you pull on one end, something will move on the other end
To me, these points still seem valid today, even if not as severe. Why not have a look at some numbers...
The CHAOS Manifesto by the Standish Group shows the results of a research based on roughly 50.000 software development projects around the globe and across verticals from 2002 to 2012. The companies they surveyed are:
50% Fortune 1000-type companies (large)
30% mid-range
20% small-range
Success := delivered on time, on budget, with required features and functions
Challenged := late, over budget, and/or with less than required features and functions
Failed := cancelled prior to completion OR delivered and never used (unusable)
If you think that 39% in 2012 are bad, let’s have a look at 2004...
In 2004, only 29% of all researched software projects were successful, which makes 2004 the worst of all years since 2001. However, since 2004 there has been a constant increase in success rates. The authors of the manifesto revealed that this increase was due to much better (agile) project management and the use of agile software development practices, such as test-driven development, pair programming, etc.
At the end of the 1980s, Alfred Spector, now a VP of Research at Google co-authored a scientific paper, where he tried to answer the following question: “Why are we able to build bridges, which finish on-time and on-budget (and typically do not crash) – but fail to do this when it comes to writing software?”
Two answers were almost obvious:
Bridges are being built since more than 3000 years, software only since a few decades
Bridge building relies on the laws of mathematics and statics, where there is only little room for flexibility, whereas software does not underly strict laws
Most surprisingly was the following discovery: it does have something to do with how we deal with mistakes. When a bridge ever falls down, the incident is thoroughly investigated and reported so that future bridge builders can learn from previous mistakes. Not so with software: failures are often covered up, ignored or rationalised (“it’s not a bug, it’s a feature”), with the result being that we are unable to learn from our mistakes.
How do you best ignore an undesireable situation?
Is it my problem? Probably not. Finger-pointing and blame-games do not solve the problem and cost precious time and money. But a problem is always an unpleasant situation – not not?
My point is that we must establish a culture that accepts errors as part of our daily work and that the ability to quickly and efficiently resolve these errors allows us to learn from our mistakes and to get better day by day – which allows us to outperform others who don’t and be successful in the long-term.
And I am certainly not alone with my point of view...
Warum das wichtig ist?
Companies use Agile and Lean Software Development practices and mindsets to have better control over the outcome of the software development and –delivery process.
Agile project management frameworks, such as SCRUM, allow us to better and dynamically react to customer requirements and build quality in our products.
It’s essentially about getting features into your users’ hands quickly!
I would now like to present you two use-cases how Dynatrace helps you to deal with errors both effectively and efficiently. In this first use-case, I will present to you, how you can proactively uncover issues in your software before they affect your users.
Clearly, the focus should be on fixing bugs in Development and Test, rather than in Operations. There is nothing more inefficient to fix a bug only when it has already hit your customers and when the developer does not remember his or her code.
In this second use-case, I’d like to show you how you can identify root causes efficiently when a problem occurrs in your production system.
"Do we still need War Rooms?" I claim that war rooms should really be a thing of the past. The term “war room” is used in fire-fighting scenarios when subject matter experts are summoned into a room to fix a critical problem.
Usually, the people in this room know about the symptoms, but they don’t know much about the root cause or who should actually be involved - and if you ever had to gather insights from manually correlating piles of distributed log files you’ll know as much as I do that this can be quite daunting.
Instead, you will want to involve only those people who are really related to the problem and have all the others keep their talents focused on business critical development, testing and operations.
Recently, I participated in a discussion on the usefulness of logging on reddit and one Operator came up with this astonishing insight: he admitted it would sometimes take him weeks to figure out what was going on in his systems based on looking at log files. Hmm. I wonder what the deployment rate looked like? Every 2 weeks? Every 4 weeks? Quarterly? If they deployed into production every two weeks, his insights were most probably outdated at the time he had his “eureka” moment.
Can we do any better? I say, Yes We Can!
Looking at an application through Dynatrace allows you to see the global health status for all transactions, 24/7 no matter the degree of distribution over runtimes and physical or virtual machines.
Right here we see that our application is affected by a failure on the Business Backend Server, which is indicated by the red circular segment. We can immediately observe that the failure does not originate from a problem in the infrastructure – however, if it did, we could dig down deeper here...
...and that it too does not originate from the Backend Server’s process – and we could also dig down deeper here and look at the CPU activity, memory consumption, the number of threads over time, the impact of the garbage collector, etc.
The root cause, however, can be found in the transactions passing the Business Backend Server and as an Operator you may want to show this to a developer now.
Still, you wanna know whether any business critical transactions are affected, such as anything related to a login, a search, a newsletter registration or a purchase.
What we see here is: it’s the logins! They now have a 100% failure rate since your last deployment you made 10 minutes ago.
What about the relevance? How do you assign priority to an issue? Here you observe that more than 100 login attempts by more than 60 users have failed.
Ok, you decide not wo wait any longer and go talk to a developer. Which one? A backend developer in this case.
Before you leave, you make a right-click and create a Session File in Dynatrace and save it on your disk. The Dynatrace Session File allows you to secure all evidence and share it with your peers for offline analysis. Think of it as a common language for Dev, Test and Ops.
Instead we get a Backend Developer, a Tester and the Operator on the same table and have them look at the evidence in Dynatrace.
So what would be the takeaways for the particular roles?
The Developer can identify the root-cause by looking at actual method invocations and contextual information in all failed transactions...
What the developer looks at here is a PurePath. A PurePath shows all data Dynatrace has recorded on behalf of a particular transaction. It is a combined tree that shows method invocations across runtime and machine boundaries, from the landing of the web request all the way down to the database and everything in between – no more manual correlation of distributed log files. This is root-cause analysis in minutes, not weeks. What you see combined on this dashboard are...
The Tester can design new or rework existing test scenarios (whether manual or automated ones) and incorporate method arguments, HTTP parameters, and any other data captured by Dynatrace.
And the Operator can configure alerts so that he and the two guys next to him get alerted should this failure ever come back again.