The Model Is the Easy 10%: Where Agent Reliability Lives

Eighty-five percent of professional developers now use AI coding agents, and by the same account roughly two in five lines of new code are machine-generated. Those figures, presented in Google's latest agents course, settle one question and open a harder one. Writing code is no longer the constraint. Making the system that writes it reliable enough to ship is — and that gap, between a demo that dazzles in a meeting and a system you can run a business on, is where most enterprise budgets are currently being spent in the wrong place.

The course frames every agent with one equation: agent = model + harness. The model — the reasoning engine everyone argues about — is treated as roughly 10% of the system. The other 90% is the harness: the sandboxes, tools, orchestration, and guardrails that make agentic coding reliable. The split is a teaching heuristic, not a measurement. But it points at the thing most enterprises have backwards. They are shopping for models and starving the harness.

The same material lays out a spectrum from casual "vibe coding" — prompt the model, paste the error back, repeat — to disciplined agentic engineering, where the model works inside structured, verifiable boundaries. The gap between the two is not a better model. It is entirely harness.

Dimension	Casual vibe coding	Agentic engineering
Verification	Paste the error back to the model	Automated tests, CI/CD gating, evaluation judges
Context	Whatever fits in the prompt	Engineered, loaded on demand
Output	Code that works this once	A system that produces working code repeatably
Cost shape	Low setup, unpredictable	Higher capex, lower opex
Accountability	Whoever sent the prompt	A named human in the loop

That contrast is the whole argument in miniature. A better model moves you down a single column. It does not move you across the table. The six components below are where that 90% lives, and where the reliability your pilots are missing actually comes from.

1. Give the Agent a Workshop, Not a Script

The weakest agent is a set of LLM calls with a custom prompt and a few tools bolted on. You write the prompt, the model answers, and if it is wrong you adjust the prompt. That is a script, and scripts run out of room fast.

The pattern Google's team reports far more success with is structurally different: hand the agent a sandbox where it can write its own code, build its own tools on the fly, and spawn sub-agents to check the work. Inside that environment the agent runs a loop — scaffold, build, observe, optimize — rather than producing a single answer. It tries something, runs it, reads what broke, and corrects, the way an engineer does. The model stops being the worker. It becomes the thing that builds and directs the workers.

This is what closes the distance between "80% vibe-coded" and the last stubborn 20%. The first 80% is a generation problem, and models are already good at it. The last 20% is an iteration problem, and iteration is impossible without an environment to iterate in.

The course's own worked example is a migration. Moving a large codebase from TensorFlow to JAX is the kind of task that is technically straightforward and operationally miserable: thousands of mechanical changes, each able to break something three files away. Using an agentic approach inside a sandbox — change, run, observe the failure, fix, re-run — the team reported the work going six to eight times faster than the manual baseline. The speedup did not come from a smarter model writing better JAX. It came from giving the model a place to fail cheaply and often.

The same shape applies far outside Google. Picture an insurer trying to retire a twenty-year-old policy-rating engine written in a language nobody on staff still writes. A prompt-only agent produces plausible replacement code that nobody can trust. An agent with a sandbox, the old test suite, and a sub-agent comparing outputs against the legacy system can grind through the translation, flag the cases where new and old disagree, and hand a human a short list instead of a blank check.

That migration scenario is a representative illustration, not a cited case.

When to use: Any task complex enough that a single prompt-and-response can't carry it — multi-step builds, migrations, refactors, anything where the agent needs to try, fail, and correct against real feedback.

Key insight: This capability does not come from a bigger model. It comes from the environment you put the model in. That environment — the sandbox, the test harness, the sub-agents, the ability to create tools on demand — is yours to build, and no vendor ships it for you.

2. Verification Is the Harness, Not a QA Step

The last mile of any agent task — doing it consistently, handling errors, surviving the edge cases — is both where agents fail and where the real engineering sits. You do not close that mile by testing at the end. You close it by building verification into the loop itself, in two layers.

The first layer is automated. A sub-agent evaluates the primary agent's output and sends it back to be revised, on repeat, before a human ever sees it. The second layer is a human gate, triggered by defined conditions rather than gut feel — a schema change, a payment above a threshold, a destructive operation. Most of the volume clears the automated layer. The human is spent only where human judgment actually changes the outcome.

def verified_run(agent, evaluator, task, needs_human):
    result = agent.execute(task)
    # Automated layer: the sub-agent critiques and the primary agent revises.
    for _ in range(MAX_REVISIONS):
        review = evaluator.assess(result)
        if review.passes:
            break
        result = agent.revise(result, review.feedback)
    # Human layer: only the cases that meet an escalation condition.
    if needs_human(result):
        decision = await human_review(result)
        golden_dataset.append(decision)   # every review becomes training data
        return decision.output
    return result

The last line is the move most teams skip. Every human decision is captured into a curated, golden dataset, so the same correction is never needed twice. Verification stops being a tax on throughput and becomes the mechanism by which the agent gets better over time.

If that sounds aspirational, it is already how the most advanced systems work. Google's AlphaEvolve — an agent that has produced genuinely novel optimizations to long-standing problems, including matrix multiplication — is built around an evaluator function that scores each candidate the model generates. The reason a system like that keeps improving the longer it runs is precisely that every attempt is graded. The evaluator is not a safety afterthought. It is the engine. The same principle scales down to a coding agent grinding on your backlog overnight: an agent that cannot check its own work cannot be trusted to run unattended, and an agent that can is the only kind worth pointing at a long-running job.

Key insight: The approval step is not overhead. It is the training data for the next version — and the only honest evidence that the agent is getting better rather than just getting faster.

3. Context Engineering Beats Context Dumping

The course calls context engineering the real skill of modern engineering, and the distinction that matters to anyone paying the bill is between two kinds of context. Static context — system instructions, style guides, the whole rulebook — is loaded on every single call, and you pay for it every time. Dynamic context is pulled in only when the task needs it.

	Static context	Dynamic context
Example	System instructions, global rules	Skills loaded for the task at hand
Loaded	Every call	On demand
Cost	Paid on every token, every time	Paid only when used
Best for	A small, universal core	Everything else

The naive instinct is to stuff more into the prompt — dump the whole repository into the context window and hope the model finds what matters. It is expensive, and it gets worse as the codebase grows.

The structural alternative is to give the agent a map before it writes a line. Represent the system as linked plain-text files, one per meaningful thing — a service, a database, a contract. Think of index cards pinned to a board with strings between them. Each card describes one component; each string is a real dependency. A graph traversal over those cards lets the agent answer the question that separates an engineer from an autocomplete: if I change this, what breaks?

[checkout-service] ──calls──> [payments-api] ──reads──> [ledger-db]
        │                           │
     emits                      enforces
        ▼                           ▼
 [order-events]              [pci-policy.md]

Ask an autocomplete to "update the payments API" and it edits one file. Ask an agent that can walk this graph, and it sees that checkout-service calls the API, that the change touches ledger-db, and that pci-policy.md governs the path — so it surfaces the blast radius before touching anything. That is a denser, cheaper, and far safer representation than pasting the whole repo into a prompt.

Why it matters: In long-running agents, a growing share of the time is spent not on the model thinking but on the agent fetching and grinding through context and external tools. Many of those tools were built for humans — they assume a person is clicking, so latency is fine and parallelism was never a requirement. Point an agent at them and that assumption becomes your bottleneck. (This is exactly why interoperability standards like MCP matter, and why "how agents plug into the outside world" is its own field of work.) Engineering the context — a tight map instead of a firehose — is the most direct lever you have on that cost.

4. Specification Becomes the Bottleneck You Forgot to Staff

When the implementation phase collapses from weeks to minutes, the constraint does not vanish. It moves to the two ends humans still own: stating precisely what to build, and confirming it was built right. The course names specification quality as the primary new bottleneck of the AI-driven development lifecycle, and the logic is hard to argue with. A faster model executing a vague spec just produces the wrong thing sooner.

The difference is concrete. Compare two ways of asking for the same feature:

"Add a feature to let users export their data."

"Add a CSV export to the account page, available only to the account owner, capped at 50,000 rows, rate-limited to one export per minute, excluding soft-deleted records, with the request and result logged to the audit trail."

The first invites the agent to guess, and it will guess differently every run. The second is executable. The scarce skill is no longer typing speed or syntax recall — it is the judgment to write the second prompt and to verify what comes back against it.

That judgment also changes how the work is shaped day to day. The course describes engineers moving between two modes: a conductor directing real-time edits in the IDE, and an orchestrator asynchronously delegating whole tasks to networks of agents and reviewing the results. A director-level leader is increasingly running a portfolio of agents the way they once ran a team of people — and the same things that make a brief to a person succeed or fail (clear scope, defined done, a way to check the work) decide whether the agent does.

Key insight: You cannot out-model a bad specification. The work that used to be the senior 20% of the job — saying exactly what "done" means — is now most of the job. Fund the people who can do it.

5. Optimize the Workflow, Not the Step

Speed up one stage in isolation and the bottleneck simply relocates. Make coding ten times faster and testing becomes the constraint; the course's own framing for this is whack-a-mole — push one part down and another pops up. The discipline is to map the whole AI-infused workflow and optimize across it, not to celebrate a fast demo of one piece while the queue backs up behind it.

This is also where the economics get misread. Agentic engineering is a high-capex, low-opex trade: you spend more up front — building the harness, setting up sandboxes and evals, paying for tokens and infrastructure — to spend far less on the recurring cost, which is developer time and effort. Leaders who benchmark it against the old model on opex alone will conclude it looks expensive, because they are pricing the investment as if it were the running cost.

	Traditional development	Agentic engineering
Upfront cost (capex)	Low	High — harness, evals, tooling, tokens
Recurring cost (opex)	High — developer hours	Low — supervision and exceptions
Reads as "expensive" when	Never up front	Judged on capex alone
Pays off when	—	Amortized across volume over time

A useful filter for whether a use case is worth that investment came from one of the Google engineers: Impressive, Useful, Sustainable. The first demo is almost always impressive and almost always cheap — you wire up one case and it looks like magic. Useful means it generalizes past the single example you built it for. Sustainable means it is scalable, secure, and economical enough to keep running — and plenty of AI use cases land at three times the cost of the current way of doing the work, which makes them impressive and useful but not yet sustainable. (Even free trials hit this wall: the course itself rations token quota because unbounded usage is not sustainable.)

Bottom line: Most enterprise pilots die in the gap between Impressive and Sustainable. They demo well, survive a second use case, and quietly fail the unit economics. Budget for sustainable from the start, or you have funded a demo and called it a strategy.

6. Build the Controls That Keep a Human in Command

Everything above makes the agent more capable. This section is about staying in charge of it — and it is the part that turns an architecture conversation into a board-level one.

Start with the failure modes. One Google engineer uses a blunt checklist — hate, harm, hallucinations — alongside grounding outputs in real sources and watching for bias in the training data. For an AI that writes and ships code, hallucination is not a quirk; it is a plausible-looking function that does the wrong thing and passes a shallow review.

The subtler risk is slower and more corrosive. As more of the codebase is written and managed by AI, the team's own expertise with it erodes. That creates two problems the model leaderboard never mentions. The first is accountability: when something breaks at 2 a.m. in code no human wrote or fully understands, who owns it? The second is a lost opportunity for improvement — much of human ingenuity comes from deep familiarity with a system, and that familiarity is exactly what atrophies when you stop reading your own codebase. And as a Google engineer noted, thinning technical expertise widens security gaps precisely as the surface area of machine-written code grows.

These are governable, but only deliberately.

Risk	What it looks like	Control
Hallucinated logic	Code that looks right, encodes the wrong rule	Evaluation judges, grounding, tests written from the spec
Expertise erosion	Nobody on the team can explain the module the agent owns	Human-in-the-loop on architectural changes; agents that document as they go
Diffuse accountability	An incident in AI-written code with no clear owner	A named human owner per system; every agent decision logged
Security drift	New attack surface in code no one reviewed	Mandatory review gates on sensitive paths; security treated as a workflow stage

Why it matters: The same harness that delivers reliability — verification loops, human gates, the dataset of every decision, the audit log — is also what keeps a human genuinely in command rather than nominally in charge. Control is not a constraint you bolt on after the agent works. It is part of what makes the agent worth deploying at all.

The Meta-Pattern: The Harness Is the Moat

Raw engineering power is now close to abundant. Anyone can fire up a studio, call a frontier model, and have working code in minutes — the thing that used to take a multi-day setup and a team to hire. When the model is a commodity available to everyone on the same terms, it cannot be your advantage. By definition, your competitors are renting the identical 10%.

Models also commoditize on a quarterly cadence. The next release will be cheaper and better, and it will not build your sandbox, write your evaluation suite, design your human-in-the-loop gates, curate your golden datasets, or draw your system's dependency map. The harness does not commoditize. It compounds. As the course puts it, your output as an organization is no longer the code — it is the system that produces the code. That system is the asset on the balance sheet, and the version of it you have in two years is worth more than the one you have today only if you have been building it.

There is a real industry ambition behind all of this — the path from prompt to prototype to production to, eventually, a profitable company, the way video platforms once turned anyone with a camera into a broadcaster. That future is plausible. But it rewards the people who built the harness around the model, not the ones who kept refreshing the model leaderboard. The role of the senior engineer and the leader above them shifts from author to conductor — and a conductor is only in command if the harness gives them something real to hold.

So the buying decision is simpler than the leaderboard makes it look. You rent the model. You build the harness. Only one of them is a moat.

Figures and frameworks here are drawn from Day 1 of Google and Kaggle's 5-day AI agents course: the agent = model + harness heuristic, the scaffold/build/observe/optimize loop, the static-vs-dynamic context distinction, the linked-knowledge-graph concept, the impressive/useful/sustainable filter, the hate/harm/hallucinations checklist, the high-capex/low-opex framing, the conductor/orchestrator modes, the reported six-to-eight-times-faster TensorFlow-to-JAX migration, and AlphaEvolve as an evaluator-driven agent. The adoption statistics (≈85% of developers, ≈41% of new code) were presented in-session without independent sourcing and should be treated as directional. The migration, insurer, and feature-spec examples are representative illustrations, not cited cases. Interpretation, structure, and the executive framing are my own.

Aayush Mediratta advises enterprise leaders on architecting and deploying autonomous AI agents in production. Get in touch →