A critical component went into a fault state and stayed there. The data pipeline saw it: the status was sitting right there in the database, updated every poll. A dashboard would have rendered it red. But nobody was looking at the dashboard at 2 a.m., and nothing reached out to say so. The fault was discovered the slow, embarrassing way: a human noticed later that the numbers were wrong.
Nothing was technically broken. Polling worked. Storage worked. The UI worked. What we didn't have was the one thing that mattered at 2 a.m.: a system whose job was to interrupt someone. We had monitoring. We had no alerting. They are not the same thing, and the distance between them is most of what I want to talk about.
Monitoring is a pull. Alerting is a push.
Monitoring assumes a watcher. It's a dashboard, a query, a graph you go and look at. It is wonderful for investigating a problem you already know you have, and nearly useless for discovering one you don't, because it depends on a human being curious at the right moment.
Alerting inverts the responsibility. The system watches itself and reaches out. The hard conceptual shift is that you stop asking "is anything wrong?" and start enumerating, in advance, every specific way the thing can be wrong. Because each one needs its own tripwire.
The failure mode that taught us this was the nastiest kind: silence. When a monitored component stops reporting, a naive monitor shows no errors, because errors are things that get reported, and a silent component reports nothing. A flat, quiet graph looks identical whether the system is healthy or the pipe is dead. So the first real design decision was that we had to alert on the absence of signal, not just the presence of faults. We ended up naming three distinct failure shapes that a single "is it down?" check would have blurred together:
- Component fault: the component reports, and what it reports is bad.
- Connected but silent: the component answers, but sends no usable state ("no data").
- Failed connection: we can't reach the component at all ("polling failure").
Each has a different cause, a different owner, and a different urgency. Collapsing them loses exactly the information the on-call engineer needs at 2 a.m.
The hard part isn't firing alerts. It's not firing the wrong ones.
Anyone can call a pager API in an afternoon. The engineering is entirely in restraint. A pager that fires on every transient blip gets muted within a week, and a muted pager is just an expensive way to be blind again. You've rebuilt the 2 a.m. problem with extra steps.
So most of the system is machinery for suppression. The pieces that earned their keep:
Tiered severity. Not every fault deserves a phone call. We classified components into tiers (critical, platform, noisy, and diagnostics) and mapped each tier-and-state combination to a severity. Only some of them page a human; the rest are logged for later. The classification lives in config, not code, so promoting a component from "log it" to "wake someone" is a one-line change, not a deploy-and-pray:
# illustrative: first match wins
- pattern: "*payments*" tier: critical # FAULT -> page
- pattern: "*metrics*" tier: diagnostics # FAULT -> log only
- pattern: "worker-*" tier: noisy # page only if sustained
Persistence gating. Some components flap by nature. A single bad poll usually means nothing. So noisy-tier faults don't page on the first occurrence. A counter has to cross a threshold of consecutive bad polls first. Real problems persist; noise doesn't. This one change removed the majority of would-be false pages.
Deduplication. One ongoing fault is one incident, not one incident per poll. A stable dedup key per (source, component, level) means the pager rings once and goes quiet until the state actually changes.
Auto-resolve and self-cleaning state. When a component recovers (or simply vanishes for a couple of polls) the system closes its own incident. The counters that drive gating live behind a short time-to-live so a component that goes quiet doesn't leave stale state lying around forever.
None of this is glamorous. All of it is the difference between an alerting system people trust and one they route to a muted channel.
Who watches the watcher?
Here is the part I find genuinely unsettling, and the reason I now treat alerting code as more dangerous than the code it watches: an alerting system can fail silently in exactly the way it exists to prevent.
Two real examples from our own build.
First, a real fault on an important component didn't page anyone, not because the detection broke, but because that component had been classified into the diagnostics tier, where a fault maps to "info" severity, which by design never rings a phone. The logic worked perfectly. It was doing precisely what the configuration told it to do, which happened to be nothing. The alerting system had its own silent-failure mode, one layer up.
Second, a subtler one a review caught. The persistence gate used a counter that ticked up to a threshold and fired. But the counter wasn't reset when the component recovered. So a component that faulted, paged, recovered, and then re-faulted within the time-to-live window would find its counter already past the threshold, and the equality check that fired the alert would never be true again. The re-fault would pass in total silence. A monitoring system carrying a bug whose only symptom is that it stops monitoring.
The only thing more dangerous than no alerting is alerting you've decided to trust.
Two disciplines fell out of this, permanently:
- The watcher needs as many tests as the watched. Our alerting stack ended up roughly one line of test for every line of production code. That ratio felt absurd while writing it and correct the first time a test caught a severity-matrix regression before it shipped. The decision engine was built as a pure function (previous state and current state in, alert decisions out) precisely so every tier-by-transition case could be pinned down in memory without a live database or a real pager.
- Observability must fail safe. The alert dispatch sits inside the polling loop, and the polling loop is sacred: it's how state gets recorded at all. So the dispatch boundary swallows infrastructure errors: if the pager API is down, that must never take the data pipeline down with it. The watcher failing should degrade to "we're flying blind," never to "we crashed the plane while checking the instruments."
Logging is the other half, and you'll discover it's broken at the worst moment
Alerting tells you that something is wrong. Logs are how you find out what. And logging has a failure mode all its own: it tends to be broken precisely when you finally reach for it, because nobody exercises it until an incident.
We learned this trying to debug a watchdog on one of our nodes. I wrote a small harness that subscribed to the node's live log stream, then deliberately tripped the watchdog to capture the error sequence in context. The harness connected, authenticated, subscribed, provoked the fault on schedule, and captured zero log lines. The logs we needed weren't flowing through the path we assumed. We found out our logging was broken in the exact moment we were depending on it, which is the only moment anyone ever finds out.
The takeaways generalize: log the decisions your alerting makes, not just the raw events, so you can answer "why didn't this page?" after the fact. Make logs structured, so they're queryable under pressure rather than grep-and-pray. And test the logging path on purpose, because it will never test itself.
Why this is suddenly more urgent
For most of software history you could lean on a quiet fallback: somebody understood the code. The author was down the hall, the logic fit in a few heads, and a careful reading could substitute for good instrumentation. That fallback is eroding fast.
More than 30% of Google's new code is now AI-generated, per Sundar Pichai on the company's April 2025 earnings call; industry-wide, surveys put AI-assisted code around 40% of the total and climbing. The volume is going up and the share any individual fully understands is going down at the same time.
And the early reliability signals are not reassuring. Google's 2025 DORA report frames AI as an amplifier: it magnifies an organization's existing strengths and weaknesses rather than uniformly improving things, so a team without solid foundations just ships its problems faster. Independent reviews of real-world pull requests have reported AI-authored changes carrying meaningfully more correctness bugs than human-written ones, and analyses of large code corpora have flagged rising churn: code rewritten or reverted within days of being committed. More code, written faster, understood by fewer people, breaking at least as often.
Put those together and the conclusion is hard to dodge. When code is generated faster than anyone can internalize it, the running system becomes the only honest source of truth about what the code actually does. You cannot code-review your way to confidence in a codebase no single person holds in their head. You can only observe it. Alerting and logging stop being hygiene and become the primary interface through which humans supervise an engine increasingly assembled by machines. The AI writes the confident code; the alarms are what tell you the confident code is wrong.
So build the alarms first
The system in this story exists because it didn't: because a fault paged no one and an incident forced the work. That's the most expensive possible order to do it in, and it's the default order, because alerting is invisible when everything is fine and feels like overhead right up until the night it's the only thing that matters.
Retrofitting observability is harder than it sounds, and the reason is subtle: instrumentation encodes intent. To alert well you must state what "normal" is, which faults deserve a human, what each failure mode means. That knowledge is freshest while you're building the feature and evaporating the moment you move on. Bolt alerting on six months later and you're not just writing code. You're archaeologically reconstructing what the system was supposed to do from what it happens to do now.
So the discipline is to treat observability as a property, not a phase. Every component ships with its severity tier. Every failure mode ships with its tripwire. Every module ships with the log lines you'd want during its first outage. It's cheaper to wire the alarms while you still remember what the code was meant to do, and in an era where much of that code arrives pre-written by a model that remembers nothing and will not be on call, that window of understanding closes faster than it ever has.
The dashboard that knew about our fault and told no one is the whole lesson in one image. Monitoring shows you the house. Alerting is the smoke detector. You do not wait until you smell smoke to go buy one.
Sources
- AI Coding Assistant Statistics & Trends, 2025: adoption and share figures (incl. Google's >30% and the ~40% global estimate). secondtalent.com
- State of AI-assisted Software Development 2025 (DORA Report): AI as an amplifier of existing strengths and weaknesses, Google / DORA. dora.dev
- AI-authored code needs more attention, contains worse bugs: The Register, December 17, 2025. theregister.com
Thanks for reading.If this resonated, the easiest way to support the project is to forward it to one person who'd like it, or to subscribe to the letter.