The problem with not seeing the whole problem

We made the mistake of choosing a Saturday for the quarterly trip to Costco to pick up our cat’s asthma medication.

Low, low warehouse prices had also lured half the metro area, and so there were lines in the parking lot, lines at the door, lines in the pharmacy, lines in the food aisles, lines at the checkout, lines for the lines … you name it.

Complete chaos.

Yet everything kept moving. People were picking up what they needed and queuing in a (mostly) orderly fashion, and the lines didn’t stall out. The system was working despite the high load.

This didn’t happen by magic. There’s a robust set of operational sub-systems working behind the scenes. Some I can think of off the top of my head:

Procurement and vendor management
Infrastructure and maintenance
Supply chain
Inventory management
Staffing and training
Technology
Business operations
Marketing

Each of these have sub-systems, too, and I’m sure I’m overlooking a bunch more. Systems, when they’re working, are invisible. It’s only when something breaks that we notice.

What if, for example, the self-checkout terminals suddenly lose connectivity to payment systems and a reboot doesn’t fix it? A bottleneck forms while other lines absorb the bereft self-checkers. Wait times get longer. Customers get angry. Staff, too, because they’re bearing the brunt of that anger. Some folks get mad enough they leave without buying anything, and the store loses a sale. If problems like these happen often enough, the store takes a reputational hit that could hit the bottom line harder.

Ideally, backup plans kick in when bad things happen. But no one can anticipate every possible problem, and even if they could, it might be cost-prohibitive to put contingencies in place for what’s likely an edge case.

The cloak of invisibility

When everything works, operations folks (like me) also tend to be invisible. But when something breaks, suddenly the Spanish Inquisition shows up in Slack wondering why you didn’t anticipate it and demanding your full and complete 147-step implementation and communications plan for fixing it before end of day.

<cough> Not that I’ve ever experienced this.

But, for the sake of argument, if I had, my first priority would be defusing the panic. It’s just not helping anyone. This requires projecting calm and confidence that the situation will be handled, and then following through.

Second priority is to stop the bleeding, whatever it may be. This is likely a quick fix to get the system back up and running. It’s also probably a temporary fix. So the third priority is figuring out what the long-term fix is and then implementing it.

Here’s where things tend to go sideways. In the quest to make sure This Never Happens Again, folks tend to focus on the point of failure and don’t see the rest of the system. For example, in the world of publishing, it’s not uncommon to inadvertently publish something inaccurate in a piece of content. Sometimes the fix is an easy “oops, here’s the correct version” and everyone goes about their business.

And sometimes a simple correction isn’t enough and the inaccuracy embarrasses or offends the brand, an exec, a set of customers, you name it, with a massive stink ensuing. When the hubbub dies down, I would almost guarantee new processes are put in place post-haste to add more reviews or signoffs and make sure nothing similar slips through again.

The new safety-net reviewers usually have jobs that don’t involve reviewing content. So we’ve just heaped more work on their plates, and it’s probably lower-priority than whatever else they’re doing. If the publishing cadence is infrequent, this might not be a big deal. But in a steady content pipeline, review cycles will soon get slower, and whoever was relying on publishing that content will soon get madder. Like the customers in the never-ending line, they could get so mad that they bypass the system and create their own, thereby defeating the reason the processes were put in place.

<cough> Not that I’ve ever experienced this.

Checking under the hood

Tedious as it may be, the best way to avoid purely symptomatic fixes is to check the entire system for other potential root causes before you put any changes in place.

Here’s a trivial example: The first iteration of the honeybee illustration I generated for this blog post was … creepy, with alien-like larvae in each cell of the honeycomb. Diagnosis: My prompt wasn’t detailed enough, which I remedied by asking for the cells to be filled with honey and not larvae. Second round: Half the cells were honey; half were the creepy larvae. Diagnosis: I dunno. I reworded the prompt to ask for “empty” cells. That worked, but if you look closely at each bee, there are various numbers of legs, antennae, wings, etc. I could have kept going, but I don’t need a Nature-worthy image here. The OK-at-a-glance version suits the purpose. If that weren’t the case, I’d need to address another root cause that I’m a cheapskate and used the free version of the image generator, which limits iterations and also uses a less sophisticated model.

With the broken checkout terminals, it’s easy to default to the sentiment that the equipment is a POS (not meaning Point of Sale) and needs a complete upgrade. Cool, but it won’t fix the problem if the payment-system provider also needs an upgrade but can’t do it until next year. With the inaccurate content, one of the root causes might be goals or incentives that prioritize speed over quality. If so, the new review processes will catch the quality issues but at the expense of someone else’s business goals. Not a great way to promote collaboration.

The more complex the system, the more important it is to check all the possible contributing factors. The 5 Why’s exercise is a good and relatively easy start, but there are other methodologies, too.

Last but not least, it’s very possible you’ll diagnose all the problems, identify the fixes, and then not get approval to fully put them in place. Maybe you’ve got dependencies you can’t control. Maybe the budget isn’t there. Maybe the executive approver just doesn’t like you. Who can say? But if you find yourself in this situation, be as transparent as possible about the risks and consequences, and make sure they’re properly documented. This is more than a CYA. When the next fire drill comes along, you can pull it back out and use it to help your case for making the rest of the fixes.

Pro tip: Resist the urge to say I told you so.

All opinions here are my own. All text is my own, too, including the em dashes. I welcome constructive comments and discussion on LinkedIn and Bluesky.