If your coding agent doesn't reduce your maintenance bill, it's costing you twice

Productivity is not lines of code per day. It's the code that's still in the repo six months later, minus the rework, review time, pages, and rewrites it dragged in. James Shore made the same argument last week in An AI coding agent, used to write code, needs to reduce your maintenance costs — still climbing Hacker News, because most of us have felt the speedup from Cursor and Claude Code and quietly suspect we're going to pay for it later. We are. 2x output with 2x maintenance is zero, and most of the agents we're shipping with today land on the wrong side of that math.

Farfield is built on the other side of it. This post is about why the side matters more than the speed.

The maintenance bill nobody put on the invoice

When Cursor Background opens a PR while you're at lunch, the cost shows up in three places the demo doesn't: the reviewer who has to read it line-by-line because it looks confident, the on-call who gets paged at 3am when the "almost right" fix turns out to be almost, and the engineer six months later who finds the same helper duplicated in four files because the agent didn't know the other three existed.

Not hypothetical. A few numbers from the last two quarters:

The first year code duplication (copy-paste churn) overtook moved code in the public commit corpus was 2025 — a 20-year inversion. Agents wrote most of the delta.
A 2026 survey found 43% of AI-generated code changes need debugging in production.
GitHub's own data: agent PRs take 91% longer to review than human PRs of comparable size.
LangChain's State of AI Agents 2026 ranks quality, not capability, as the #1 deployment blocker for the second year running.

extra review time · per agent PR+91%GitHub, "Agent pull requests are everywhere — here's how to review them," 2026. Reviewer time is the biggest hidden cost of AI throughput, and the line item nobody puts in the pitch when they ask for seat budget.

The shape is the same across every team we talk to. Output goes up. Throughput, measured honestly, doesn't.

Two productivity equations

The case for buying more Cursor seats rests on the first equation. The case against most of what's shipping in 2026 rests on the second.

view	equation	what it optimizes for
naive	productivity = output	LoC/day, PRs/week, demos
honest	productivity = (output − rework − review − escalation) ÷ maintenance debt growth	code that survives 6 months

The numerator looks similar in both. The denominator is where it splits. A tool that doubles output and triples the denominator is a regression you're paying for. A tool that lifts output 30% and halves the denominator is the biggest win an engineering team has had this decade. The market hasn't priced the second one yet because it doesn't demo well.

Where the cost actually accrues

Four patterns show up in almost every codebase we scan. None are exotic. All of them are the kind of "looks fine in review" debt that compounds quietly until you can't ship a feature without renaming three files.

Duplicate helpers. The agent writes a utility that already exists three directories away under a slightly different name. The reviewer doesn't catch it — the diff is small, self-contained, looks correct. Six months later there are four versions of formatRelativeTime in the repo and two of them are subtly wrong.

// new file Cursor Background created on a Tuesday
function formatRelativeTime(ts: Date): string {
  const diff = Date.now() - ts.getTime()
  if (diff < 60_000) return "just now"
  if (diff < 3600_000) return `${Math.floor(diff / 60_000)}m ago`
  return `${Math.floor(diff / 3600_000)}h ago`
}
 
// what already lived in src/lib/time.ts, with timezone handling,
// a `now` injection for tests, and 14 unit tests behind it
export function timeAgo(ts: Date, now = new Date()): string { /* ... */ }

The new one ships. The old one rots. Two months later someone files a bug about times being wrong near midnight UTC, and a senior engineer burns half a day figuring out which timeAgo is on the page.

Inconsistent patterns inside one repo. The agent picks a different framework for the same problem each time it sees it — react-query in one route, manual useEffect + state in the next, a custom hook in the third. Each is defensible on its own. The repo becomes ungreppable.

Symptom patches. The visible error is a null deref. Claude Code adds a ?. chain. The actual cause was an upstream API contract that changed three deploys ago and silently started returning null for a previously-required field. The patch ships, the error stops, the misunderstood contract spreads to two more callsites the next week.

Reviewer overload. Every PR looks plausible. You can't trust a clean diff at face value anymore, so you read every line — or you stop reading carefully and merge on vibes. Both failure modes cost you.

What maintenance-subtracting actually looks like

An agent reduces maintenance cost when it does work that would otherwise have been done later, by a human, more expensively. There are four moves that count:

Catch bugs before merge. Ideally before the PR even goes up. The cost of a defect grows by roughly an order of magnitude at every stage it survives — dev, review, staging, prod, customer-visible. A finding that lands as a pre-merge comment is worth ~10x the same one caught by an alert, and ~100x the one that pages the on-call. This is why Farfield runs as a background scanner against the main branch, not as another IDE assistant — the IDE is too early to know what the rest of the codebase looks like, and the PR is too late to redesign.
Find root causes, not symptoms. When we file an issue, the report points at why it's wrong, not just where it crashes. The agent traces upstream until the failure stops being explainable by local code. Symptom-patching agents are net-additive on maintenance — they hide the signal that would have driven a real fix.
Deduplicate across the repo. When the same finding shows up in nine files, we file it once with nine attachments. When a suggested fix overlaps with an existing helper, we say so and link to it. Issue-level dedup is the boring half; the interesting half is fix-level — refusing to introduce the fourth formatRelativeTime when three already exist.
Get smarter from what just happened. Every accepted fix, every dismissed finding, every "we tried that in Q3 and it broke staging" Slack reply has to make the next scan better. An agent that produces the same false positive on the same file three months in a row is charging maintenance interest, not paying it down.

The substitution test on this list is the one we apply to every claim in the post: swap "Farfield" for "Cursor" or "Devin" or "Claude Code" and see if the sentence is still true. If it is, we haven't said anything. Most autocomplete-shaped tools fail (1) — they can't see what hasn't been merged. Do-it-for-you agents fail (2) — they're rewarded for closing the loop locally. Almost none of them clear (3) and (4) — there's no persistent state to compound from.

The asymmetry that breaks the equation

Here's the one paragraph that earns the post. Writing code is now cheap. Reading it, understanding it, debugging it at 2am, operating it, and rewriting it when an assumption changes — all of those cost what they did in 2019. The price of the first derivative collapsed; nothing after it moved. Any tool that compresses the cheap side of that asymmetry while inflating the expensive side is, on a long enough horizon, a tax on the team that ships with it.

Output and maintenance debt tracked roughly together for two decades. They stopped in 2024. The interesting question about any agent is which curve it's accelerating.

How to tell if your agent is making you slower

If you're the engineer or founder making the call on whether to keep Cursor Background, Devin, or Codex in the loop, the question isn't "how fast does it ship features?" Every tool demos well on that axis. The question is what the maintenance bill looks like per merged PR, measured over a window long enough for the second-order costs to land.

What we'd actually want to know before paying for another seat:

How much of the agent's code is still alive in six months. Pull every PR the agent touched, run git blame today. If less than 60% of those lines are still there, you're paying for a deletion-and-rewrite cycle dressed up as throughput.
Did review get slower. Average time-to-review per PR, before rollout vs. after. If it went up by more than the throughput went up, the agent is net-negative and you're funding it with reviewer hours.
How many copies of the same helper. Similarity scan over functions added in the last quarter. Count anything with a >0.85 match to a pre-existing function in the same repo. We see teams with five formatCurrencys and no one wants to be the person who consolidates them.
Which incidents trace back to an agent merge. Of the last 90 days of incidents, how many touch code the agent wrote or substantially edited? Compare to your baseline before adoption. If the on-call rotation has a feel for which days are "Cursor days," that's the answer.
Prove the tool got better. Show one finding from last month that the agent wouldn't have caught the month before. One memory entry, one learned rule, one dismissed-pattern record. If there isn't one, the tool isn't compounding — it's running the same prompt against your code every day and charging you for the privilege.

None of these are hard to answer. None of them are on the dashboards most coding agents ship in 2026, which is itself a signal.

Cloudflare on review at scale and the broader "2026 is the year of AI code quality" thread converge on the same point: the missing piece isn't faster authoring, it's something running in the background that pays for itself in caught defects.

Where we cost the team

We don't get to write this post and skip the limits section. Farfield reduces maintenance cost on net, but it's not zero-cost to operate. The honest list:

Setup. Connecting repos, tuning severity, picking which Slack threads to listen in. A new workspace is about an hour for one engineer. Multi-repo setups take longer.
Triage. Every finding gets looked at by someone, even the dismissed ones. Precision is the metric we obsess over internally, but a real low-severity issue still costs you 30 seconds to mark "won't fix."
Scan compute. Deep scans aren't free. We absorb the LLM cost on credit-based plans, but it's real money and it doesn't scale for free.
What we don't catch. We're strong on cross-file logic bugs, contract drift, duplication, and root-cause traces. We're weak on visual regressions, perf regressions that need load to manifest, and security issues that need fuzzing. We name the boundary so you don't assume coverage you don't have.

If you're a four-person team and the maintenance bill hasn't hit yet, you don't need us yet. The math here is a function of codebase size, team size, and how aggressively you're already merging agent PRs. Below some threshold, careful review and small PRs are still the cheapest answer.

The pile that wins

The next twelve months sort coding agents into two piles. One keeps optimizing the cheap side of the asymmetry — faster authoring, more PRs per week, bigger demos at YC Demo Day. The other subtracts from the expensive side: catching defects before merge, refusing to introduce a fourth duplicate, naming the root cause instead of the symptom, and getting measurably better at the repo it lives in.

We know which pile we're in. The interesting question is which pile yours is in.