“ All we are doing
is looking at the timeline from the moment the customer gives us an order to the point we can collect the cash. And we are reducing that timeline by removing the non-value-added wastes. -‐-‐-‐ Taiichi Ohno, Founder of TPS
They capture a shared understanding
of a problem - often with just pencil and paper. Why are we talking about it? Where do things stand today? What should be happening? What would be a step in the right direction? What causes prevent us from reaching our target condition?
Problem: got a Speeding ticket
Late for work Got up late Alarm clock didn’t work Batteries were flat Why? Why? Why? Why? Therefore Therefore Therefore Therefore … And 5-whys
Once the nature of the
problem is clear, they take steps to move towards the target condition systematically. What countermeasures should neutralize the causes? What steps are required to implement the selected countermeasures? How will you know if the countermeasures work? Based on the results, what’s next?
Traditionally, a mentor would challenge
a problem solver’s line of thought with quick coaching cycles What do you mean by it? (Clarity) Is it always the case? (Assumptions) How do you know? (Evidence) What are you implying by that? (Implications) Would that necessarily happen? (consequences) Do anyone see it another way? (Alternative Point of views)
… Turning into this Hi
all, Here’s a quick summary of the root cause analysis we did this morning. Problem: The Website was offline from 15:31 to 15:40 because: -Website couldn't establish a connection to the database. - Because MySql service crashed. - Because MySql storage engine reached the default configuration memory limit. - Because Apache web server configuration allows threads to request more physical memory than available to MySql. - Because Apache and MySql default configuration settings are not optimised for the RAM currently available on the new virtual server. We failed to detect it because: - New relic didn't notify us that the site was not responsive - We don't know yet, needs further investigation. It might not be configured properly (cont..)
… We took adaptive actions
and the site is now up an running. However these are the further actions to take: Preventive: (future/cause) [now] Expand RAM on new virtual server [now] Review Apache & MySql configuration [later] Investigate moving to nginx Contingent (future/effect): [now] Configure New Relic's monitoring properly (alerts on site down, response time, n of processes, memory) [later] Investigate using New Relic for app profiling