The Context Deficit

I keep seeing the same failure across AI deployments that look unrelated.

An engineer's coding agent destroys a production database, not through any technical error, but because it could not distinguish live infrastructure from test copies. A medical triage system identifies respiratory failure in its own reasoning, then tells the patient to schedule a routine appointment.

These are not stories about bad models. They are stories about capable systems operating without access to the knowledge that would have prevented the mistake.

I've been thinking of them as symptoms of a single phenomenon, what I'd call the context deficit (the term is imperfect, but it names the gap between what these systems can do and what they grasp about where they operate). Capabilities are advancing fast. Contextual understanding, the kind that prevents a locally correct decision from being organizationally catastrophic, is not.

That asymmetry may be the defining challenge of this period in AI deployment.

By context I do not mean context windows. Those have grown and will keep growing. I mean something harder to scale: the institutional knowledge and decision rationale that allows a human professional to distinguish a test environment from a production system, or a routine contract from one entangled with an undisclosed acquisition.

This kind of knowledge is distributed across people, updates with every conversation and departure, and is almost never written down. AI agents have close to zero access to it.

I should acknowledge upfront that as someone who works in AI (and who is therefore invested, in multiple senses, in the success of these systems), I have an obvious interest in how the challenges get framed. I've tried to be honest about where we're getting this right and where we're not, but the reader should apply appropriate skepticism.

I. The evidence is converging

Several recent studies, taken together, make the deficit quantifiable.

Scale AI and the Center for AI Safety tested frontier agents on 240 real freelance projects from Upwork. Average project cost: $630. Average human completion time: 29 hours. The best agent completed 2.5 percent of projects at a quality a paying client would accept. Two and a half percent.

A separate benchmark, OpenAI's GDPVal, showed the same models approaching expert-level quality at a hundred times human speed. Same models, same capabilities. The difference: GDPVal provides all necessary context up front. The Upwork benchmark hands over a client brief and expects the agent to figure the rest out. One is a task execution score. The other is a job performance score, and the gap between them is the context deficit, quantified.

Alibaba researchers built what may be the first benchmark measuring software maintenance over time rather than fresh code generation. Across a hundred codebases spanning an average of 233 days of development history, seventy-five percent of frontier models introduced regressions. They broke features that had been working.

Maintenance requires understanding which parts of a system are load-bearing and how today's change interacts with decisions made months ago. Fresh code generation does not. That difference is context.

Harvard researchers studied 62 million American workers and found that companies adopting generative AI saw junior employment decline about ten percent within eighteen months, while senior employment kept rising.

AI replaces task execution, and juniors were hired for tasks. Seniors survived because they carry the contextual understanding no current system replicates: which parts of the organization are fragile, and which decisions have political dimensions that will never appear in a dataset.

And then there is the Mount Sinai study, published in Nature Medicine in February (and to my mind the most methodologically rigorous evaluation paper of the year).

Researchers tested ChatGPT Health across 960 clinical interactions using a factorial design that varied contextual factors while holding medical content constant. Among cases three independent physicians unanimously classified as emergencies, the system directed patients away from the ER fifty-two percent of the time. A single dismissive sentence from a family member shifted the triage recommendation with an odds ratio of 11.7.

The most troubling finding: in multiple instances, the system's own reasoning trace identified dangerous clinical findings, and the output contradicted them. The model's analysis said "early respiratory failure." The recommendation said "wait."

Research on chain-of-thought faithfulness suggests this disconnect is structural, not incidental. The reasoning trace and the final answer can operate as semi-independent processes. Oxford's AI Governance Initiative has argued that chain of thought is unreliable as an explanation of a model's decision process. I find that assessment uncomfortable but correct, and its implications reach far beyond medicine.

If we cannot trust reasoning traces to predict outputs, the entire approach of catching errors by reading the model's explanation rests on weaker ground than most deployment teams assume.

II. Two responses to the gap

The proliferation of AI agent products since OpenClaw's viral rise in early 2026 amounts to a natural experiment: different companies testing different theories of how to manage the context deficit.

OpenClaw, which accumulated over 250,000 GitHub stars in weeks, embodies the maximalist bet. Run locally, connect to everything, trust the user to manage context and security.

The results were instructive in both directions. Twelve percent of marketplace skills were confirmed malicious (Koi Security). Over 30,000 instances sat exposed on the public internet (Bitsight). CrowdStrike shipped a removal tool. None of it slowed adoption, because the productivity gain was large enough that users accepted the risk. CISOs who banned the tool pushed usage onto personal devices connected to the same corporate systems. They eliminated visibility without eliminating exposure.

Nvidia's NemoClaw, released at GTC 2026, embodies the opposing theory. Jensen Huang's implicit argument: four of the five hard problems in agent deployment (context compression, codebase instrumentation, architecture enforcement through linting, multi-agent coordination) are standard engineering with decades of precedent. His open-source security stack applies containerization, reverse-proxy patterns, and infrastructure-as-code to a new execution context.

The fifth problem, what he did not solve, is the one that matters most.

III. The specification problem

The first four problems in agent deployment have names, published implementations, and engineering teams that can solve them in days or weeks. The fifth is different.

In traditional software, you specify what deterministic code can do and verify compliance. Agent security requires specifying what a probabilistic system should do, drawing boundaries around behavior that is sometimes wrong rather than deterministically malicious.

An agent with email access (because its job requires reading email) cannot be prevented by any sandbox from misusing that access within its authorized scope. The sandbox limits blast radius. It cannot evaluate judgment.

In regulated industries, the specification problem compounds. The rules an agent must follow are external to the codebase, jurisdiction-specific, and revised faster than any internal team tracks. Getting them wrong means regulatory penalties and lost licenses.

An engineer who excels at Kubernetes does not know how insurance claim adjudication works across three states, or how HIPAA's minimum necessary standard applies to an agent accessing patient records. That knowledge takes years to build.

The engineering is solvable with open-source tooling. The domain expertise is not, and that distinction matters for how you resource the project.

The correct decomposition (and this is what most enterprise decision-makers miss) is to separate the commodity engineering from the domain challenges and resource them differently. Four problems your team handles. One problem where you may need specialized help. That 4:1 ratio changes the calculus on consulting engagements, vendor selection, and hiring.

IV. The legibility problem

The context deficit has a mirror image. Agents lack understanding of the organizations they serve. But organizations are also not legible to agents.

For fifteen years, technology stacks were engineered around the assumption that automated access is hostile. Bots meant spam. Scrapers meant theft. That assumption now encounters a contradiction: the automated system trying to access your service may be a paying customer's agent, acting on their behalf, attempting to give you money.

Over a million Shopify merchants are preparing for agent-mediated commerce. MCP SDK downloads have grown from 100,000 at launch to 97 million per month. But the challenge is not the protocol layer. It is the data underneath.

A human browsing a website can infer that "ships in 2-3 business days" means weekdays, excludes holidays, and starts from order confirmation. An agent takes what it receives as given. If your product catalog, fulfillment system, and FAQ contradict each other, the agent skips you for a competitor whose data is consistent.

Most companies estimate that twenty percent of decision-relevant product information exists as structured, machine-readable data. The rest lives in marketing copy, institutional memory, or someone's head.

This produces a striking second-order effect. Making your systems legible to agents forces data coherence across the entire stack. You cannot present a consistent interface to an agent when underlying systems have drifted apart for years, papered over by a UI layer that reconciles the inconsistencies.

The surprise (and I did not expect this when I started examining these cases) is that becoming agent-readable turns out to be one of the most effective forcing functions for data quality organizations have encountered. The resulting clean data serves human employees as much as it serves AI.

V. Evaluation, memory, and the path forward

If the context deficit is the constraint, two responses address it from opposite directions. Evaluation catches harm before it happens. Persistent memory means agents accumulate context across interactions rather than starting fresh. Both are needed.

On evaluation

The Mount Sinai study demonstrated a methodology (factorial design with systematic contextual variation) that exposed failure modes no standard benchmark catches.

The technique generalizes: social pressure that shifts a recommendation exists in financial advising, procurement, and lending. Anchoring bias from authority figures is universal. You can build a reusable library of contextual variations adapted per domain, and the raw material — processed claims, approved procurements, chat transcripts — already sits in enterprise systems.

The useful architecture has four layers:

Confidence routing grants autonomy for routine decisions and sends edge cases to human review.
Deterministic validation (rule-based, running outside the model) checks whether the agent's output is consistent with its own reasoning.
Continuous evaluation scores every interaction and feeds failures into a growing regression suite.
Factorial stress testing probes for anchoring and guardrail failures by varying context while holding structured inputs constant.

The cost model is front-loaded: two to four domain experts for two to three weeks to build the scenario library and calibrate the automated layers. By month six, most of the process runs without human intervention. Every failure a human catches becomes a permanent automated test, growing the regression suite while the human review load drops.

Most organizations delegate evaluation to junior staff. This is backwards. Writing effective evaluations requires deep domain knowledge and the judgment to know when technically correct output is organizationally wrong. I don't know of a shortcut here.

On memory and compound loops

The combination of persistent memory, scheduled proactivity, and tool access creates agents that accumulate context across interactions rather than starting fresh each time.

A sales pipeline agent running daily can recognize that a declining account matches the trajectory of one that churned five months ago, with enough lead time to intervene. A competitive intelligence agent can connect a hiring signal from three weeks ago to a patent filing last week to a partnership announcement this morning, constructing a narrative no human analyst could maintain alongside their other responsibilities.

Shopify's CEO let an AI agent run overnight experiments against an internal model: 37 experiments, a 19 percent improvement, an 0.8 billion parameter model outperforming a 1.6 billion parameter model configured by a human. Not because the agent was smarter in any single cycle, but because it ran dozens of cycles and remembered all of them.

How they reinforce each other

Agents with persistent memory produce more sophisticated outputs that demand better evaluation. Robust evaluation enables organizations to extend autonomy to agents that demonstrate reliability over time.

Evaluation catches failures before they erode trust. Trust enables delegation. Delegation generates value. Organizations that sustain this cycle gain a durable structural advantage (distinct from the advantages of scale or funding, because it is built from the organization's own accumulated knowledge and cannot be bought off the shelf).

VI. The harder question

There is something I do not have a confident answer to, and I think it matters more than most questions dominating the AI conversation right now.

Closing the context deficit means encoding the institutional knowledge that lives in people's heads into formats AI systems can access. This is the same knowledge that Gartner predicts companies will scramble to recover: by 2027, half the companies that cut staff for AI will rehire workers to perform similar functions.

Forrester reports fifty-five percent of employers already regret AI-driven layoffs. Klarna cut 700 customer service positions for a chatbot, then reversed course when the AI could not handle the contextual complexity their business required.

The pattern: organizations shed task-execution capacity, then discover the people they let go were carrying invisible infrastructure. You learn it was load-bearing after it collapses.

The question is what happens when that knowledge gets encoded.

Harvey, the legal AI company, is building on OpenAI's Frontier platform. Every piece of legal domain knowledge Harvey encodes into Frontier's semantic layer lives on OpenAI's infrastructure. Today that is a partnership. In three years, when OpenAI needs to justify its valuation and legal AI is a proven market, the nature of that relationship may change in ways the current terms do not anticipate.

The institutional knowledge that makes organizations function, distributed today across millions of professionals, would concentrate in systems controlled by a handful of technology companies. The dynamics resemble (at least in structure, if not yet in scale) the way financial derivatives concentrated systemic risk before 2008: useful instruments that, once ownership narrowed to a few institutions, created dependencies nobody had mapped.

Someone will raise the counterargument that this is all temporary. Better models with larger context windows and improved reasoning will close the deficit within a few model generations and make this entire discussion moot.

I have thought about this, and my best guess is that it is wrong, though I hold that view with less certainty than I'd like.

We can measure capability gains: benchmark scores, context length, task completion rates. Contextual understanding requires something different. It requires knowing what nobody wrote down, what matters politically but not logically, what failed two years ago in a way that shaped current policy.

That kind of knowledge has no training signal. Models may get better at utilizing context when they have it. The problem is that the context does not yet exist in a form models can access, and producing it is an organizational challenge, not a technical one.

The agents are here. They work. They are improving. And every one of them is operating without access to the institutional knowledge that keeps complex organizations functional.

Whether AI deployment creates broad benefit will depend less on model capabilities than on whether organizations build the infrastructure to close the gap: data quality and evaluation rigor.

The system that identified respiratory failure in its own reasoning and told the patient to wait is the emblem of this period. Closing the context deficit is institutional work, done at institutional pace, and the organizations that start now will be ready. The ones that wait for the models to solve it will be waiting for a long time.