There is No Root Cause!
I’m a big fan of Sidney Dekker. Dekker did lots of research into the research and analysis the airline industry performs after a crash. When lives are lost, finding out why is paramount. The families of lost loved ones want answers, and the industry needs to figure out how to avoid similar accidents in the future.
Dekker ended up revolutionizing the process. Later, his work came to the attention of the software development industry through a seminal blog post from an Etsy employee writing about blameless postmortems (BPMs). Having helped implement a BPM practice at my company, I read his book, “The Field Guide to Understanding Human Error.”
My biggest takeaway was that there is no root cause! All error is, in the end, human error. When systems fail, a human fails to anticipate the scenario that caused it to fail. Mechanical parts wear out, but humans either decide when to replace them or design systems to determine when to replace them.
We often point to systemic failures as the “root cause,” but in doing so, we have just stopped short of following the path to the ultimate source of the problem. It was people! Even “acts of God” or “force majeure,” as insurance companies like to call them, can be mitigated if they are adequately evaluated and prepared for by…you guessed it…humans!
When we evaluate for root cause, we often stop at the borders of our influence. Suppose your toaster oven burns the bread because the timer breaks. You’re going to call the broken toaster the root cause. It is doubtful that you will contact the company to report that your 10-year-old toaster no longer works.
Burnt toast is the epitome of low stakes. After the failure, you will discard the toaster and get a new one. However, when the stakes are higher, replacing aged equipment is something that humans must anticipate. They need to weigh the cost of a failure against the cost of replacement before the failure.
The airline industry is a case in point. Every single part on a jet is inventoried and tracked. Each part has an assigned life based on how critical it is to the functioning of the craft. Critical parts are thus replaced well before their estimated failure date range. This is one of the reasons why it is safer to travel by jet than on roads.
What does this mean for you? Next time you participate in a root cause analysis, recognize when you’re stopping at the borders of your influence and ask, “What could we have done to avoid this failure?” Maybe the answer is to have purchased a different product. Or it could have been to assert influence in a situation in which we thought we had none.
It always comes down to choices that humans made at some point in time prior to the failure. Sometimes, we learn and adapt based on our understanding of those choices. Other times, we may determine that the cost of avoiding the failure was higher than the cost of the failure itself. Knowing that is often good enough to put the issue to rest.