PUSHDEVTechnologies
Essays

The bug that didn't fire

Two engineers ship the same mistake. One's system is live, so the gap in their understanding becomes visible. The other's isn't. Thomas Nagel called this circumstantial luck, and software engineering is full of it.

We were all working at the edge of what any of us understood. The technology was new enough that nobody on the team could claim a decade of experience with it, and nobody tried. What they didn't know, I didn't know either. The difference, the only material difference, was that my system went live first.

When something broke in deployment, the post-mortem had my name on it. The gap in my understanding became a visible thing: a point on a timeline, a cause. My colleagues' identical gaps remained invisible because their systems hadn't been in front of customers yet. They were asked for opinions. I was asked to explain myself.

I have been thinking about this for a while, trying to find the name for what happened. I think the name is Thomas Nagel's.

The drunk driver

In 1976, Nagel published a short paper called "Moral Luck," alongside a companion piece by Bernard Williams in the same journal. The central question is simple: can luck change what someone deserves?

The argument rests on what Nagel calls the control principle: you can only be held morally responsible for what you controlled. This sounds obvious. It is also, as Nagel shows, constantly violated by how we actually judge people.

His example is two drunk drivers. Both made the same decision to drive home after drinking. Both were equally reckless. On one driver's route, a child stepped off the kerb. The other arrived home without incident. The first is charged with manslaughter. The second is not charged at all. If the control principle holds, this is a problem: both made the same decision under the same conditions, and the difference in their outcomes was entirely outside their control. But almost nobody actually feels that the two deserve identical treatment.

Nagel identifies four ways luck gets inside moral judgment. The one that names the drunk driver case is resultant luck: luck in how things turn out. Same decision, different consequence, different blame.

But there is an earlier kind, and it is the one that happened to me. He calls it circumstantial luck: luck in the situation you face. Not how things turn out, but whether you were placed in the conditions where your choices could matter morally at all.

The tenth server

In August 2012, Knight Capital Group deployed new software to its trading infrastructure. Ten servers needed updating. Nine received the new code. The tenth was missed: a deployment script had failed silently, and the engineer responsible assumed the job was done.

On the unpatched server, a flag that should have been dormant reactivated a piece of old trading logic called Power Peg. When the market opened, nine servers processed trades correctly and one did not. The system distributed orders across all ten without distinguishing between them. In forty-five minutes Knight Capital lost four hundred and sixty million dollars.

The nine engineers who deployed the working servers made the same decision, under the same pressure, as the one who missed the tenth. The deployment process was the same. The understanding of the system was presumably similar. One person happened to be the one where the gap showed.

That is the drunk driver case, implemented in a deployment script. But it is also something more. The bug that caused the disaster had existed in the codebase for years. Power Peg was old code, never removed, waiting. The engineers who deployed to the nine working servers were not safe from it. They were just not unlucky enough to encounter it.

A word for the sceptic

I want to be fair to the other side, because it is not a stupid position.

We are outcome-sensitive for real reasons. Results carry information that intentions don't. A near-miss is genuinely less bad than a failure. One person is dead in one case and not in the other, and that difference is real. Criminal law distinguishes attempt from completion partly because the incentive to stop halfway through would disappear if outcomes carried no weight. There is a logic here, not just scapegoating.

The engineer whose bug fires in production made a mistake. The feedback is real. Some degree of outcome-sensitivity is just how risk calibration works.

What the post-mortem leaves unasked

The problem is not outcome-sensitivity. It is what happens when outcome-sensitivity becomes the end of the investigation.

When a post-mortem finds a root cause, it answers a question that feels satisfying and leaves a more useful one unasked. The question it answers: who was responsible for this outcome? The question it leaves unasked: who else was in the same position?

In a team working at the frontier of what anyone knows, the honest answer to the second question is usually most of us. The knowledge gap that became visible in my deployment was not mine alone. It was a team condition. The technology was new. The expertise was thin across the board. My deployment was the first one to put that condition under load.

When the post-mortem names me and praises my colleagues, two things happen. My colleagues receive a signal that their understanding is sound, which it isn't. And I carry a weight proportional not to how wrong my understanding was, but to how exposed it happened to get. Neither is useful information. Neither makes the next deployment safer.

What a team working on new technology needs from a failure is not a verdict about who should have known better. It needs a knowledge audit: where is the gap, how wide is it, and who else is standing on the same ground? That question is uncomfortable for everyone in the room, including the people who haven't shipped yet. That discomfort is the point. It means the lesson is being distributed to the people who need it, not just to the person whose luck ran out first.

What "first to deploy" actually means

There is something worth saying plainly, because it didn't get said in my own post-mortem.

The person who deploys first is not behind their teammates. Deployment is the only test that runs against the real system, with real users, under real conditions. Being the first to face that test does not mean you understood the technology least. It often means the opposite.

The gap in my knowledge became visible because I was the first to put mine at risk. My colleagues' gaps are still there. They just haven't been tested yet.

That is not a criticism of them. It is a description of what being first actually costs, and what it is actually worth.

Sources

  • Thomas Nagel, "Moral Luck," Aristotelian Society Supplementary Volume 50 (1976), revised in Mortal Questions (Cambridge University Press, 1979). Stanford Encyclopedia of Philosophy overview
  • Bernard Williams, "Moral Luck," companion paper in the same volume, collected in Moral Luck: Philosophical Papers 1973–1980 (Cambridge University Press, 1981).
  • Knight Capital Group incident, August 2012. Case study by Henrico Dolfing

Thanks for reading.If this resonated, the easiest way to support the project is to forward it to one person who'd like it, or to subscribe to the letter.