Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)
Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

Shell shocking events

Imagine it's the start of your work day and you've decided to dig into a recently filed bug. You go to pull the latest version of a project on github, only to discover that you can't.

Why not? Turns out the company network has decided to block requests to github. No one knows why. There's a bunch of headless chickens running around in ops trying to at least restore service (and figure out why), they give updates every 30 minutes. Hours will pass before it's resolved.

What do you do during those hours? Technically you don't need the latest version of the project to spin things up, it's unlikely this bug is only present for the latest changes. Some days, you might just soldier on. Spin up the local copy you had, debug, you can eventually make a pull request with the fix whenever the network issue is resolved.

Some other days though, you might just be shell shocked. You'll end up doing nothing during those hours, because you're so shocked that such a state could come to pass, could last so long, in a company this big, with this much money. Of course depending on how tracked you are, you might have to pretend to do something while recovering from your shock, but the shock is there nonetheless. (Sometimes there are culturally accepted outs to explain a lack of output that day, like "the build tool was broken for me" -- because everyone has been there, experienced that shell shock of not even being able to build the project offline (WTF!!!) because of some stupid issue with a stupid custom build tool that's mostly out of their control.)

This is the benefit of smaller companies and startups. You just don't get into certain dysfunctions. For other dysfunctions, you tend to have no one else to blame but yourself, and then you can go address it directly yourself.

I think a lot of BigCo soul-suckage and eventual departures can be attributable to too many shocking dysfunctions that one can't do anything about. Furthermore these dysfunctions are almost always from a system level common cause. One should expect the occasional dysfunction, since variation exists.

But distressingly upper management at bigger companies end up treating each thing as a special cause, which leads to more issues when they over-correct for it. For example maybe the above situation was ultimately caused by someone messing up an iptables command somewhere in the corporate network control stack, and the management solution is that no one is allowed to run iptables commands manually anymore. (They might even fire the poor sap who made the mistake.) Great, now when an attacker is DDoSing a service in a dumb way and could be stopped with a simple iptables command, no one can actually run it. The real cause was at the system level, not at the individual level of who ran what command. Misdiagnosing this leads to further self-inflicted wounds. If this snowballs into a positive feedback loop, it can be company ending.

There is no silver bullet for improving software quality, but there are lots of potential bronze bullets. One of them is a bullet in the form of pure management changes. Software companies would benefit to internalize the lessons of Deming and learn how to properly manage and appreciate the system, understand variation, understand knowledge, and understand individuals. "Reduce shell shocking events" -- has any software manager ever had that as a goal? Do they even realize that shock is a thing in their system? If they did, could they gather impersonal* data about it to track and improve things? Does "control chart" mean anything?

*By impersonal, I mean that there's no judgment/ranking of people within the distribution, where the distribution shows a pretty smooth range where some employees are shocked once a day (for some a simple meeting is sufficient to induce a shock that necessitates recovery time before productive and thoughtful work can begin again) while others only once a month. Sometimes there will be clear outliers from the distribution -- for example, someone in constant shock. It's still almost always a common cause issue though. Perhaps they really cannot stand open offices, and so a special/personal accommodation can be made (giving them a private office / letting them work from home) so that their rate of shock moves to be within the normal distribution.

Posted on 2019-11-27 by Jach

Tags: management, philosophy, programming


Trackback URL:

Back to the top

Back to the first comment

Comment using the form below

(Only if you want to be notified of further responses, never displayed.)

Your Comment:

LaTeX allowed in comments, use $$\$\$...\$\$$$ to wrap inline and $$[math]...[/math]$$ to wrap blocks.