Yannick Hofmeister
Whitepaper··39 min read

The intelligence transition

[ human · machine ]

Executive summary

When you give an AI a well-defined task — draft this contract clause, analyze this dataset, summarize this research — it now matches or beats human experts 83 percent of the time. When you give it an actual job — here is a client brief, figure out what to do — it succeeds 2.5 percent of the time. That gap is the story. This paper diagnoses why it exists and describes the organizational architecture that closes it.

The failure is structural, not technical. The paper argues for three moves: encode your organizational methodology into machine-readable skills infrastructure, restructure around small teams with expanded missions and agent architectures matched to problem types, and build frontier operations capability across the organization. The ten diagnostic questions in section VI are designed to tell you where you stand.


Preface

This paper synthesizes six months of data — benchmarks, earnings reports, workforce surveys, and case studies from companies that have deployed AI at scale. Some of what follows rests on strong evidence. Much involves judgment under uncertainty. Where I am speculating beyond what the data support, I will say so directly.

The place to start is with three numbers.


I. The numbers that tell the same story

Three data points, taken individually, seem to describe entirely different realities. Taken together, they describe the defining dynamic of the current moment.

The first number is 83 percent. OpenAI's GDPval benchmark measures how often AI output is preferred over human expert output on well-scoped knowledge work — the actual tasks that lawyers, doctors, consultants, and engineers perform, graded by experienced professionals. GPT-5.4 was tied with or preferred over human domain experts 83 percent of the time, up from 74.1 percent just weeks earlier with GPT-5.2 Pro.1

A system that produces expert-preferred output on more than four-fifths of scoped knowledge tasks is not a marginal improvement. It matches or exceeds the professional it augments on a large majority of well-defined tasks — though, as I will argue in section III, the gap between task performance and job completion suggests these numbers overstate real-world capability in ways that matter enormously.

A contrasting benchmark sharpens the point. Scale AI's Remote Labor Index tested the same class of frontier models on 240 real freelance projects from Upwork — video production, architecture, data analysis — averaging $630 in cost and 29 hours of human completion time. The best agent completed 2.5 percent of projects at a quality a paying client would accept.2 If you hand the model a client brief without further guidance, performance collapses. GDPval provides all necessary context up front; the Remote Labor Index does not. One measures tasks. The other measures something closer to jobs. Tasks come with context provided. Jobs require you to bring your own. That distinction explains nearly everything confusing about the current AI discourse.

The second number is $660 to $690 billion. That is what the five largest hyperscalers are spending on AI infrastructure in 2026, up from $443 billion in 2025. Goldman Sachs projects cumulative hyperscaler capex from 2025 through 2027 will reach $1.15 trillion.3 Bank of America notes that AI capital expenditure now consumes up to 94 percent of these companies' operating cash flows after dividends and buybacks. Nearly every discretionary dollar is going into AI infrastructure.

The gap between that spending and the return is stark: AI-related services delivered roughly $25 billion in revenue to hyperscalers in 2025 — approximately 6 percent of what they spent on infrastructure.3 That imbalance is not necessarily a sign of a bubble. It may be a sign that the hyperscalers see something in their internal data that the revenue has not yet caught up to. When the largest technology companies in history dedicate nearly all of their available capital to a single thesis, they are responding to what they see about what AI systems can do — data that is, by definition, months ahead of what the rest of the market has access to.

The third number is 42 percent. According to S&P Global, 42 percent of companies abandoned the majority of their AI initiatives in 2025 — up from 17 percent the prior year. That is not a plateau. That is acceleration in the wrong direction.4

These numbers do not contradict each other. AI systems that match or exceed human experts on four-fifths of knowledge tasks coexist with an enterprise landscape where the modal outcome of an AI initiative is failure. The organizational capacity to direct that capability toward useful outcomes is, for most companies, not yet built. That is the gap.

It is also a temporary gap. Temporary means it closes. And the organizations that close it earliest will capture disproportionate value while the rest are still trying to figure out what went wrong with their pilots. The question is what, precisely, changed to make that gap so large and so consequential.


II. What actually changed

Before diagnosing why organizations are struggling, we need to establish what shifted and why it constitutes a genuine discontinuity rather than a continuation of prior trends.

The capability explosion

The benchmarks tell a story of acceleration, not merely improvement. On the SWE-bench coding benchmark, AI systems solved 4.4 percent of problems in 2023. By 2024, that number reached 71.7 percent. In one year. The Epoch AI Capabilities Index shows frontier model improvement accelerating from roughly 8 points per year before April 2024 to over 15 points per year after — an 85 percent acceleration that coincided with the rise of reasoning models and increased focus on reinforcement learning.

METR, an independent AI safety organization, has been tracking the length of tasks that AI systems can reliably complete autonomously. That duration has been doubling roughly every seven months since 2019, with suggestive evidence that the doubling time itself is shrinking to approximately every four months. When independent laboratories, working with different architectures and different training approaches, converge on the same exponential slopes, they are discovering the same fundamental scaling laws rather than gaming the same benchmarks.

The cost collapse

If you paid twenty dollars per million tokens in 2022, you now pay seven cents. What required 540 billion parameters achieves equivalent scores with 3.8 billion parameters today — a 142-fold reduction in model size. The Stanford AI Index reports that, depending on the task, inference prices have fallen anywhere from 9 to 900 times per year.

We have never found a ceiling on the demand for intelligence. Every time computing costs have fallen — from mainframes to PCs, from PCs to cloud, from cloud to serverless — demand has expanded to consume the new capacity and then some. There is no reason to believe this time will be different, and considerable reason to believe the expansion will be larger, because intelligence is more general-purpose than any previous computing abstraction.

The self-acceleration loop

The tools are now building themselves. Ninety percent of the code in Claude Code — Anthropic's agentic coding tool — was written by Claude Code itself. Boris Cherny, who leads the project, has not personally written code in over two months. His role has shifted entirely to specification, direction, and judgment. OpenAI's Codex 5.3 was the first frontier model that was instrumental in creating itself — earlier builds analyzed training logs, flagged failing tests, and suggested fixes to training scripts. SemiAnalysis estimates that roughly 4 percent of public commits on GitHub are now authored by Claude Code, a share projected to exceed 20 percent by end of 2026.

The rate of capability improvement is now partially decoupled from the rate of human effort invested in achieving it. That is new, and its consequences are underappreciated.

The capital conviction

Amazon spent $125 billion on capital expenditure in 2025, roughly 75 percent of which went directly to AI infrastructure, and has announced $200 billion for 2026.5 Its quarterly free cash flow went negative — negative $4.8 billion — because the company is converting human headcount to compute capacity. It cut 30,000 jobs to fund GPU purchases. Amazon's capital allocation historically correlates with high-confidence internal projections, not speculative positioning.

Hyperscalers collectively are projected to deploy trillions in cumulative AI capex through 2030. The question is no longer whether AI capability is real. The question is why most organizations cannot convert that capability into outcomes. (A note on physical supply constraints — helium shortages, memory rationing, and their implications for infrastructure timelines — appears in Appendix A.)

If the technology is this capable and this cheap, and the capital is this committed, why are most organizations failing to use it?


III. Why most organizations are failing

The failure rate — 80 percent of AI projects, roughly twice the rate of non-AI IT projects — demands a structural explanation. I think the failures are best understood not as a taxonomy of five separate problems but as five manifestations of a single mismatch: organizations built for a world where intelligence was scarce, encountering a world where it is abundant. Each barrier is a different surface of that mismatch.

The intent gap

The most instructive failure of the past year is Klarna's. In early 2024, Klarna deployed an OpenAI-powered customer service agent. It handled 2.3 million conversations in its first month, across 23 markets, in 35 languages. Resolution times dropped from eleven minutes to two. The CEO projected $40 million in savings; the actual figure reached $60 million.

By mid-2025, CEO Sebastian Siemiatkowski was on Bloomberg explaining that cost had been "a too predominant evaluation factor" and that the result was "lower quality." Klarna began hiring human agents back — the leading edge of a pattern that Forrester data says is widespread, with 55 percent of employers now regretting AI-driven layoffs.6

The AI did not fail. The AI succeeded — at the wrong objective. Klarna's organizational intent was not "resolve tickets fast." It was "build lasting customer relationships that drive lifetime value in a competitive fintech market." If you have ever watched a capable new hire optimize for the wrong metric because nobody told them what the company actually values, you have seen the intent gap at human scale. A human agent with five years at the company has internalized an implicit model of organizational values — she knows that a long-tenured customer whose tone signals frustration warrants three extra minutes, even if that reduces her resolution metrics. The AI agent knew none of it. It had a prompt and plenty of context. What it did not have was intent — a structured, machine-readable encoding of what the organization actually values when values conflict.

The industry has progressed through two stages: first, crafting individual instructions (prompt engineering), and now, curating the information environment in which AI operates (context engineering). Both are necessary. Neither addresses the Klarna problem. What is missing is what I'd call intent engineering: encoding organizational purpose — goals, values, tradeoff preferences, decision boundaries — into machine-readable infrastructure.7

Without this layer, deploying AI across an organization is equivalent to hiring thousands of new employees and never telling them what the company does. As of March 2026, Klarna has not publicly described building this layer — which suggests the problem is genuinely hard, not a trivial oversight that better prompting would have prevented.

The specification bottleneck

The cost of producing software — and, increasingly, analysis, research, design, and strategy — is collapsing so fast that the bottleneck has moved from "can we build it" to "can we specify exactly what should be built, how it should be validated, and where the boundaries are."

This turns out to be rarer than anyone assumed. In July 2025, Jason Lemkin watched a Replit AI agent delete his production database. The agent did not hallucinate. It did exactly what it was allowed to do, because nobody had specified where its authority ended.

The quality of AI output is a function of the quality of the specification it receives, and that relationship is non-linear. A slightly vague specification does not produce a slightly wrong output. It produces a confidently wrong output that looks polished enough to ship. That is significantly more dangerous than an obviously broken one.

Concrete specification mechanisms are now emerging, and the pattern they share is instructive. In software engineering, teams are discovering that lint rules function as executable architecture specifications — "lint green" becomes a machine-readable proxy for "conforms to architecture."8 In design, Google's Stitch product introduced DESIGN.md: a portable, agent-readable markdown file that encodes an entire design system as structured specification rather than visual reference, making the handoff between designer and builder lossless for the first time. In operations, StrongDM runs what they call a software factory — three engineers, no human writes or reviews code, production software shipping continuously — built on six to seven thousand lines of detailed behavioral specifications.9 The charter "code shall not be written by humans, code shall not be reviewed by humans" works precisely because the specification effort is extraordinary.

If you are trying to figure out where to invest, the specification layer is the answer. Agent deployment decomposes into what one analysis terms a 4:1 ratio: four engineering problems that are well-understood and solvable with published patterns, and one genuinely new problem — the specification problem itself.8

The quality data confirms the urgency. CodeRabbit's analysis of 470 GitHub pull requests found that AI-assisted code generates 1.7 times more issues than human-authored code. DORA's 2025 report found that alongside 90 percent AI adoption among software teams, organizations saw a 9 percent climb in bug rates. The SWE-CI benchmark, which measures AI maintaining software over time, found that 75 percent of frontier models break previously working features during maintenance.10 AI acts as a mirror and a multiplier — it magnifies the strengths of organizations with clear specifications and the dysfunctions of those without them.11 As I will show in section V, the specification bottleneck is not merely a theoretical concern — it is measurable at the individual level, with consequences that experienced practitioners consistently fail to detect in their own work.

The agent readability imperative

A Salesforce survey of 6,000 data leaders found that 84 percent of organizations say their data strategies need a complete overhaul before AI can work effectively. Simultaneously, 63 percent of C-suite executives believe their companies are already data-driven. That is a 47-point perception gap — likely one of the most expensive disconnects in corporate life right now.

The problem is sharper than "data quality." What is emerging is a more specific requirement: agent readability — whether your systems, products, and services can be discovered, evaluated, and transacted by an AI agent acting autonomously.12 If an agent acting on a customer's behalf cannot read your product catalog, parse your pricing, or execute a transaction through a structured interface, the agent routes around you. No human ever sees the offer. You lose the sale without knowing you were in the running.

Roughly 20 percent of product and service meaning lives in structured data that agents can already read — names, prices, SKUs, specifications. The other 80 percent is tribal knowledge embedded in marketing copy, packaging, and institutional memory.12

Software is increasingly going to be a set of primitives callable from any surface, not applications with interfaces. If you are building software today, every feature should be evaluated against three questions: Can an agent invoke this? Can an agent read the output? Can an agent write the input? If any answer is no, the feature is built for a world that is receding. Stripe shipped a standardized protocol server allowing agents to process refunds and manage subscriptions — but its analytics layer cannot be trivially exposed because result sets exceed what an AI can process at once. Even for one of the best engineering organizations in the world, agent readability is step one of a multi-quarter build.12

The coordination tax

Perhaps the most underappreciated barrier is organizational coordination overhead. Research consistently finds that 60 to 70 percent of labor hours in a typical knowledge-work organization are spent on coordination — meetings, status updates, approvals, alignment sessions, handoffs — rather than production. Seventy-one percent of senior executives describe their own meetings as unproductive.

AI dramatically amplifies individual productive capacity. What it does not do is reduce coordination cost. Imagine a team of twenty where each person, with AI assistance, produces three times their previous output. You now have sixty units of work flowing through the same approval pipeline that was already a bottleneck at twenty. The feeling of increased activity without corresponding improvement in outcomes is the predictable result of amplifying one variable while leaving the binding constraint untouched.

The direct antidote is structural: reducing team size. Moving from a 20-person team to four 5-person teams cuts communication pathways from 190 to 40.

Pilot purgatory

You know the pattern. A three-month pilot proves AI can draft contracts in a quarter of the time. The team presents results to leadership. Leadership asks for a plan to scale. The plan requires changes to the document management system, the review workflow, and the quality assurance process. Eighteen months later, the pilot is still running.

The Deloitte 2026 State of AI report, surveying 3,235 leaders across 24 countries, found that 84 percent of companies have not redesigned jobs around AI capabilities, and only 21 percent have a mature model for agent governance. The pilot is not the hard part. The hard part is the organizational infrastructure that converts a successful pilot into sustained, scaled value. Most organizations have built none of it.

The question is whether any organizations have actually escaped this pattern. Some have.


IV. The architecture that works

If the diagnosis above is correct, then the organizations succeeding with AI should look structurally different from the ones failing. They do.

Intelligence abundance as an operating premise

The most consequential difference is philosophical. The organizations that succeed have stopped asking "How do we implement AI?" and started asking "Given a hundred times more intelligence available to us, what should we rebuild?" If you are asking the first question, you are optimizing the wrong variable. The first assumes a fixed organizational structure into which AI must be inserted. The second treats the organizational structure itself as a variable. The outcomes are not comparable.

Matching architecture to problem type

One reason many organizations struggle beyond pilots is that they treat "deploying agents" as a single category. It is not. There are four distinct agent architectures, each suited to a different class of problem, and deploying the wrong one produces failure that looks like the technology does not work when the actual problem is architectural mismatch.9

The simplest architecture is what you might call a coding harness: a single powerful model in an agentic loop, equipped with file operations, search, and execution. If you have used Cursor or Claude Code, you know the pattern. The quality gate is a human being with taste. The governing principle is isolation — two agents editing the same file will fight each other; two agents working on independent modules will compound each other's output.

A more radical architecture removes the human from the review loop entirely. StrongDM's three-engineer software factory ships production software continuously with no human code review, built on thousands of lines of behavioral specifications and validation treated like a machine learning holdout set — test scenarios stored where agents cannot access them during development. The specification becomes the product; code becomes disposable. This works when the output is machine-verifiable, and it fails when it is not.

Shopify took a different approach with its Liquid template engine. They gave an agent a codebase, a performance metric, and boundaries. The agent brainstormed improvements, tested them one at a time, kept what worked, reverted what did not. The result: 53 percent faster performance, 61 percent fewer memory allocations.9 This only works when you have a computable metric — if you cannot score the output as a number, you cannot auto-research it. But when you can, the results can be striking.

At the largest scale, you coordinate multiple specialized agents in a pipeline. Walmart's WIBEY super-agent orchestrates over 200 specialized agents, each executing a step and passing a defined output to the next. The value proposition is coordination, not intelligence.

The diagnostic question that resolves which architecture to deploy: what are you optimizing against? Code quality with human taste as the arbiter — use a coding harness. Specification compliance with machine-verifiable output — build a dark factory. A numerically scorable metric — set up auto research. Pipeline throughput across specialized steps — design an orchestration framework. Most organizations conflate these four problems and deploy a single architecture for all of them.

The combinatorics of small teams

The mathematical relationship between team size and coordination cost is governed by n(n-1)/2. Five people generate ten communication pathways. Twenty people generate one hundred and ninety. Robin Dunbar's research on primate neocortex size, Jeff Bezos's two-pizza team, the U.S. infantry fire team — evolutionary psychology, military operations, and software engineering converge on teams of five.

What AI changes is the consequence of getting the number wrong. Before AI, a five-person team might produce output valued at roughly $250,000 per person per year. The coordination cost of adding a sixth was manageable. After AI, the same team can produce output valued at $2 million or more per person. If you add a sixth person to an AI-augmented team of five, you are not adding one-sixth more capacity. You are adding a coordination cost measured in millions of dollars of lost output.

The evidence is visible in AI-native companies. Cursor generated over $8 million per employee at 60 people. Lovable reached $400 million in annual recurring revenue with roughly 45 employees, serving 8 million users at a $6.6 billion valuation (though whether these numbers generalize to established organizations with legacy systems and regulatory constraints is a genuine open question).13 The traditional SaaS benchmark for "great" revenue per employee is $200,000 to $300,000. These companies operate at ten to forty times that level.

The ambition frame

Say you run an organization of five hundred people, and each of them is now five to ten times more capable than they were two years ago. The natural response — "we can now operate with fifty people" — is wrong. The productive question: what were we previously unable to do?13

The companies getting this right understood it immediately. Lovable did not start with 45 people and build a small product; they built a platform serving 8 million users. Midjourney targeted the full scope of visual creation with 100 employees. Whoop is hiring 600 people — nearly doubling its workforce — because the opportunity set expanded when execution cost dropped. Accenture cut 11,000 traditional roles while simultaneously doubling its AI specialists to 77,000 — the K-shaped labor market in a single company.14

Every major reduction in the cost of production — the printing press, the steam engine, electrification, the personal computer — eventually created more total employment, not less.15 The strongest counterargument deserves direct engagement: intelligence is categorically different from previous general-purpose technologies because it substitutes for the very cognitive faculty that historically identified new uses for those technologies. If the thing that found new jobs after the steam engine was human ingenuity, and AI substitutes for human ingenuity, the historical pattern may not hold. Whether you find this persuasive depends on whether you believe human judgment remains non-substitutable as a complement to AI execution — which is precisely the evaluation meta-skill argument I make in section V. The transition was never smooth and the displacement was real, but the organizations that treated the new technology as a cost-reduction tool rather than a capability amplifier were consistently outcompeted by those that understood what had actually changed.

Case studies

JPMorgan deployed its LLM Suite to more than 200,000 employees, identified over 450 use cases, and has reported $1.5 to $2.0 billion in operational value alongside 10 to 20 percent improvements in developer efficiency. The critical detail is not the technology. It is the 18-month institutional learning advantage they built by starting before most competitors had finished their vendor evaluations.

Shopify may be the most instructive case. CEO Tobi Lutke required that every team prototype with AI before beginning the real build. He made AI fluency part of performance reviews and required teams to demonstrate why AI could not do a task before requesting additional headcount. The fastest AI adoption at Shopify came from finance, sales, and support — not engineering. 3,500 R&D employees are organized as "lots and lots and lots of small teams." The entire culture is structured around the premise that small, capable teams with AI augmentation outperform large teams without it.

If this is how work is being restructured at the organizational level, the question becomes: what does it mean for the people inside those organizations?


V. The workforce implications

The organizational architecture described above reshapes how work itself is structured. The implications are best understood not as a single trend but as two markets moving in opposite directions.

The K-shaped labor market

One market is contracting. Postings for automation-prone roles have declined 13 percent. Companies adopting generative AI saw junior employment drop roughly 10 percent relative to non-adopters within eighteen months, driven primarily by slower hiring rather than increased firing.16 The signal is consistent: organizations are shedding task-execution capacity.

The other market is expanding so fast that demand outstrips supply 3.2-to-1. There are 1.6 million open AI-related positions. The average time-to-fill is 142 days. Salary ranges span from $150,000 to over $437,000. And the barrier to entry is lower than most people assume: 60 percent of AI product management hires come from non-computer-science backgrounds.14

If you are wondering what skills the market is pricing at a premium, analysis of real job postings at Anthropic, Scale AI, Robinhood, Glean, and Upwork reveals seven: specification precision, evaluation and quality judgment, decomposition for delegation, failure pattern recognition, trust boundary design, context architecture, and cost and token economics.14 Certifications from weekend courses do not substitute for published artifacts that demonstrate real capability.

A significant fraction of the apparent "talent shortage" is, on closer inspection, a specification shortage on the employer side — companies that have not defined what outcomes they want, writing incoherent job postings spanning four roles and rejecting all candidates.14

Two simultaneous compressions

The first compression is horizontal. If you are an engineer who can also specify product requirements, direct a design agent, and evaluate marketing copy, you are not doing four jobs. You are doing the one job that matters: orchestrating AI systems to produce outcomes. The domain knowledge does not disappear — a lawyer directing AI to analyze a contract still needs to understand contract law — but the differentiator shifts to whether you can effectively direct AI systems to apply that knowledge at scale.

The second compression is temporal. What you knew about AI capabilities six months ago is already partially obsolete. The career advantage you built over a decade — deep expertise in a specific tool, a specific workflow — depreciates faster than at any previous point in the history of professional work. The preparation is the engagement.

The evaluation meta-skill

The single most important career skill in the emerging landscape is one that has not historically been recognized as a discrete competency: the ability to evaluate work you did not produce. The execution layer — generating designs, code, analysis, strategy — is compressing across every domain simultaneously. The judgment layer is not.

Here is the most disquieting finding I encountered in six months of research — and, I believe, the single strongest piece of evidence for this paper's central thesis. METR ran a randomized controlled trial of experienced open-source developers working in codebases they already knew. The developers completed tasks 19 percent slower when using AI tools. Not faster. Slower. They predicted AI would make them 24 percent faster. After the study, they still believed it had made them 20 percent faster. They were wrong about the direction, not merely the magnitude.

This result connects directly to the specification bottleneck described in section III. The developers had implicit specifications — mental models of what the code should do — that were sufficient when they wrote the code themselves but insufficient when evaluating AI-generated code that looked correct but diverged in ways that required careful inspection to detect. The specification gap that produces 1.7 times more issues in AI-assisted pull requests (section III) is the same gap that made these experienced developers slower: evaluating work against an unwritten specification is harder than doing the work yourself. And critically, the difficulty is invisible — you do not know you are doing it poorly, because the output looks competent. This is the intent gap and the specification bottleneck expressed at the level of individual work, and it suggests that the organizational failures described in section III are not management problems. They are epistemological ones.

Given these dynamics, what should leaders actually do? The honest answer is that specific tactics are evolving faster than any paper can track. But the structural principles appear durable enough to warrant recommendation.


VI. What leaders should do now

Develop frontier operations capability

BCG estimates that roughly 5 percent of companies are achieving real value from AI deployments.4 The capability that characterizes them is what I call frontier operations: the ability to work productively at the boundary between what AI can do reliably and what still requires human judgment.

The capability has six components. The first two are about knowing where the boundary is: boundary sensing — maintaining current operational intuition about where the human-agent line sits for your domain, something that changes quarterly (the gap between Sonnet 4.5's performance on long-context retrieval at below 20 percent and Opus 4.6's at 76 percent illustrates why) — and capability forecasting, making reasonable predictions about where the boundary will move next so you can invest learning accordingly.

The next two are about working at the boundary: seam design — structuring work so that transitions between human and agent phases are clean, verifiable, and recoverable — and failure model maintenance, the differentiated understanding that for task type A the failure mode is X while for task type B the failure mode is Y, which is different from the generic advice to "be skeptical of AI output."

The final two are about optimizing human attention: leverage calibration — making high-quality decisions about where to spend the scarcest resource in an agent-rich environment — and asynchronous delegation, the ability to structure work so agents execute on schedules and triggers without human presence. If you cannot name where the human-agent boundary sits in your domain this quarter — not last quarter, this quarter — you do not yet have frontier operations capability. Early practitioners report that 25 minutes of direction can yield three or more hours of unsupervised execution, but only when the specification survives the absence of the human who wrote it. Roughly 50 percent of complex tasks still fail in Anthropic's Dispatch research preview,17 and the gap is specification, not intelligence.

These six capabilities may be the first workforce skills in history that expire on a quarterly cycle.

Restructure around small teams

Restructure operational units into teams of no more than five people with full AI toolkits, expanded missions, and clear specification and evaluation standards. Not as an efficiency measure — as a capability expansion. If your current team structure requires extensive coordination overhead, the AI amplification you are deploying is being absorbed by the coordination tax rather than producing outcomes.

Build the intent layer — using skills as the mechanism

The Microsoft Copilot stall — 90 percent of Fortune 500 companies "use" Copilot but only 3.3 percent of commercial users have paid seats — traces to AI systems deployed without a structured encoding of organizational intent.

Six months ago, building an intent layer would have remained an abstraction. That has changed. A concrete mechanism has emerged: skills — markdown files with structured metadata that encode methodology, quality criteria, output specifications, and decision boundaries into a form AI systems load automatically at the point of use.18

The architecture explains why this works where documentation has not. Only the metadata — a name and a brief description — loads into the AI's working memory at startup. The full methodology loads only when the AI determines the skill is relevant. Supporting references and scripts load only when specifically needed. Twenty installed skills cost roughly 1,500 units of context. The methodology is present at the moment of use, already loaded, already active — not sitting in a wiki nobody opens under time pressure.

Anthropic launched the format in October 2025. By March 2026, OpenAI, Microsoft, GitHub, and Cursor had adopted it. Five hundred thousand skills now run across platforms interchangeably — the same file that runs in a developer terminal runs in the Excel sidebar and in overnight API pipelines.18 That is not a developer tool. That is organizational infrastructure.

The failure asymmetry is the datum that should change how seriously you take this. A vague skill costs a human caller roughly 10 to 15 percent of output quality — you notice the drift, redirect, recover. The same vague skill in an agent pipeline produces output the downstream agent treats as correct, processing it further until the error surfaces six steps later in an unrecognizable form. Human caller: minor quality degradation. Agent caller: potential total chain failure.18

The organizational deployment has three tiers, and most organizations get the priority backwards. The first tier is standards — non-negotiable consistency rules like brand voice and compliance requirements, provisioned organization-wide. The second is methodology — how the organization actually approaches high-value work, built by senior practitioners from their actual output, not their articulated intentions. The third is personal workflows. Most start at the third when 80 percent of the value is in the first two. Here is the question that should determine your skills backlog: what are the three things a new person at your organization needs three months to learn to do at your standard? The person who knows how to do those things well is the right person to encode the methodology — before it calcifies back into intuition nobody can articulate, or walks out the door when they leave.18

Invest in fluency, not tools

Tools commoditize quarterly. Model leadership rotates every six to twelve months. What does not commoditize is the organizational capacity to use AI tools effectively — the fluency, the evaluation muscle, the specification discipline. A person who delegates ten real tasks a day to an agent and evaluates the output builds more capability in a week than someone who completes a forty-hour AI course and returns to a workplace where they never touch an AI tool. Calibration reps, not classroom hours.

A company where 200 people use AI daily has 200 people building intuitions, discovering edge cases, and identifying workflows worth automating. A company where 2,000 people "have access" but 50 use it has 50 people learning and 1,950 waiting. The gap between them widens with every passing month.

Audit your position

If you want to know where you stand, run through these ten questions with your leadership team. More than three "no" answers suggests a frontier operations gap that is actively costing you output.

  1. Can you name, specifically, three tasks your team delegated to agents last quarter that they did not delegate the quarter before?
  2. When the last major model update shipped, did anyone on your team formally reassess which workflows to change?
  3. Do you have at least one person whose explicit responsibility includes knowing where agents fail in your domain?
  4. Can your team articulate the difference between tasks where they trust agent output without review and tasks where they always verify?
  5. Has anyone on your team redesigned a workflow handoff between human and agent work in the last 90 days?
  6. When an agent produces something better than expected or fails unexpectedly, does your team have a mechanism for capturing that signal?
  7. Is your team's review depth differentiated by risk, or does everyone check everything at the same depth?
  8. Can anyone on your team make a credible six-month forecast about which of their current tasks will migrate to agents?
  9. Do you have explicit roles for frontier operations, or is it something you expect to emerge from people's day jobs?
  10. If your strongest frontier operator left tomorrow, could the rest of the team maintain the current level of agent-assisted output?

Those ten questions are a snapshot. The more consequential question is about trajectory.


VII. The question that matters

The gap between what AI can do and what organizations are doing with it is real, large, and temporary. Temporary means it closes. And when it closes, the organizations that built the architecture described here — methodology encoded as skills, small teams matched to the right agent architectures, specification discipline, frontier operations capability — will have compounding advantages that are difficult to replicate.

Institutional learning in AI is not something you can purchase or shortcut. JPMorgan's 18-month head start in deploying AI to 200,000 employees is not something a competitor can compress by spending more money. The learning had to happen in sequence, and the time had to pass.

The physical constraints described in Appendix A add urgency without changing the conclusion. Helium shortages and memory rationing may slow the rate at which new compute comes online — but they do not slow the capability of the compute that already exists. The models running today are already good enough to justify every organizational change this paper describes. When supply constraints ease, the organizations that spent the intervening period building specification discipline and encoding methodology will absorb the new capacity immediately. The ones that waited will find themselves competing for hardware, talent, and institutional learning simultaneously — and the third cannot be accelerated at any price.

The transition timeline is genuinely uncertain — electrification took forty years to produce productivity gains that were available from day one, and these organizational barriers may prove more durable than I have implied. But I believe the direction of the analysis is correct, and that the organizations and individuals who engage with it seriously, starting now, will be in a categorically better position than those who wait.

The question that matters is not "Should we use AI?" — the capital allocation data has answered that. It is not "Does AI work?" — the benchmarks have answered that. The question is the one almost nobody is asking:

An organization of five hundred people, each five to ten times more capable than they were two years ago. What was previously impossible that is now merely difficult?

Start here, this week:

  1. Run the diagnostic with your leadership team. Ten questions, thirty minutes. Count the "no" answers.
  2. Identify your three tier-two skills — the things a new person needs three months to learn to do at your standard. Have the senior people who know how to do them well start encoding the methodology.
  3. Pick one workflow and restructure it around a small team with the right agent architecture. Expand the mission. Measure the output.

The organizations that take this seriously will find themselves in a structurally different competitive position within the next several years. The ones that do not will face an environment shaped by decisions they did not make.


Appendix A: The physical constraint

Most AI infrastructure forecasts assume supply constraints that no longer hold. Organizations making capital allocation decisions deserve to know.

Every advanced AI chip passes through EUV lithography — $200 million machines that require helium to cool optical elements, maintain wafer temperature, and detect vacuum leaks. There is no substitute; without helium, EUV scanners cannot operate.19 Approximately a third of the world's semiconductor-grade helium came from a single industrial complex in Qatar — Ras Laffan — which was struck by Iranian missiles in March 2026. The facility is offline, with reconstruction timelines of three to five years.20

This disruption compounds into a semiconductor supply chain already under strain. HBM memory — required by every major AI accelerator — was sold out through 2026 before the strike. DRAM prices rose 50 to 55 percent in a single quarter. Intel's CEO said publicly: "There's no relief until 2028."20 South Korea, which fabricates roughly a quarter of the world's memory chips, imported 64.7 percent of its helium from Qatar. And Taiwan, home to TSMC and 90 percent of the world's most advanced logic chips, imports 97 percent of its energy and holds 11 days of LNG reserves.

None of this invalidates the investment thesis. But if you are planning AI infrastructure deployment, factor physical-supply risk into your timelines. The chips may not arrive on the schedule the capex models assume.


Endnotes


All statistics cited reflect data available as of March 2026.

Footnotes

  1. OpenAI, "GDPval: Measuring AI performance on professional knowledge work," 2026. GPT-5 Thinking scored 38.8%; GPT-5.2 Thinking scored 70.9%; GPT-5.2 Pro scored 74.1%; GPT-5.4 scored 83.0%.

  2. Scale AI and the Center for AI Safety, "Remote Labor Index," 2026. 240 real freelance projects from Upwork. Average project cost $630, average human completion time 29 hours. Best frontier agent achieved 2.5% acceptable completion rate.

  3. Combined Big Tech AI capital expenditure reached $443 billion in 2025 (Bank of America, "AI Infrastructure Capital Flows," Q4 2025), with 2026 spending plans of $660-690 billion. Breakdown: Amazon $200B, Alphabet $175-185B, Meta $115-135B, Microsoft $120B+. AI capex consumes up to 94% of operating cash flows after dividends and buybacks. Goldman Sachs projects cumulative hyperscaler capex from 2025 through 2027 will reach $1.15 trillion. AI-related services delivered roughly $25 billion in revenue to hyperscalers in 2025. 2

  4. S&P Global, 2025 AI Initiative Survey (42% abandonment, up from 17%). Corroborated by BCG ("From Potential to Profit with GenAI," 2025: 74% have yet to show tangible value, ~5% achieving organization-wide impact), MIT (95% of generative AI pilots fail to deliver measurable impact), and McKinsey (2025 State of AI: roughly two-thirds stuck in experimentation or piloting). 2

  5. Amazon 2025 annual report and 2026 capital expenditure guidance. Quarterly free cash flow of negative $4.8 billion reported in Amazon Q3 2025 earnings. Workforce reduction of approximately 30,000 positions reported across 2024-2025.

  6. Forrester, "AI Workforce Impact Survey," 2026. 55% of employers report regretting AI-driven layoffs. Gartner predicts by 2027, half of companies that cut staff for AI will rehire for similar functions under different titles.

  7. The concept of "intent engineering" builds on Anthropic's September 2025 publication on context engineering, Harrison Chase's commentary at Sequoia Capital, and failure patterns documented by Deloitte (2026 State of AI in the Enterprise, 3,235 leaders, 24 countries), S&P Global (2025 AI Initiative Survey), and McKinsey (2025 State of AI).

  8. The lint-as-architecture pattern and 4:1 ratio framework documented in analysis of enterprise agent deployments. DESIGN.md format introduced with Google's Stitch product. 2

  9. Agent architecture taxonomy: coding harnesses (Claude Code, Cursor), dark factories (StrongDM), auto research (Shopify's Liquid optimization: 53% faster, 61% fewer memory allocations), orchestration (Walmart's WIBEY with 200+ specialized agents). 2 3

  10. SWE-CI (Alibaba Research), 2026. 100 real codebases, average 233 days of development history. 75% of frontier models broke previously working features during maintenance tasks.

  11. CodeRabbit, December 2025 analysis of 470 GitHub pull requests: AI-assisted code generates 1.7x more issues. Google 2025 DORA report: alongside 90% AI adoption, 9% climb in bug rates, 91% increase in code review time. DORA's conclusion: AI acts as a "mirror and a multiplier."

  12. Agent readability analysis: Stripe (protocol server for customer operations, analytics layer as remaining gap), SAP (proprietary ERP interfaces predating modern API standards), Cloudflare (markdown-for-agents feature reducing consumption by 80%). The 20%/80% structured-data/tribal-knowledge split describes the encoding challenge. 2 3

  13. Revenue-per-employee data from AI-native companies: Cursor at $8M+/employee, Midjourney at $3-5M/employee, Lovable at $400M ARR with ~45 employees ($6.6B valuation, 8M users). Traditional SaaS benchmarks: $200-300K/employee. The 5-10x multiplier for existing organizations is an estimate, not a controlled measurement. 2

  14. K-shaped labor market: demand-to-supply ratio 3.2:1, 1.6 million open positions, 142-day average time-to-fill, salary ranges $150K-$437K+. Accenture cut 11,000 roles while doubling AI specialists to 77,000. 60% of AI PM hires from non-CS backgrounds. 2 3 4

  15. See Robert Gordon, The Rise and Fall of American Growth (2016), for detailed treatment of how general-purpose technologies created net employment gains over multi-decade timescales.

  16. Hosseini Maasoum and Lichtinger, Harvard Business School, 2026. Study of 62 million American workers across 285,000 firms. Companies adopting generative AI saw junior employment decline roughly 10% relative to non-adopters within 18 months, driven primarily by slower hiring.

  17. Anthropic's Dispatch research preview (roughly 50% reliability on complex multi-application tasks). Shopify CEO Tobi Lutke's 37 overnight agent experiments. The 25-minute direction / 3+ hour execution ratio from early practitioner reports.

  18. Skills ecosystem: 500,000 skills running cross-platform as of March 2026, adopted by Anthropic, OpenAI, Microsoft, GitHub, and Cursor. 20 installed skills cost approximately 1,500 tokens of context. Simon Willison characterized skills in October 2025 as "maybe a bigger deal than MCP." 2 3 4

  19. Georgetown University CSET, Frost & Sullivan, and ASML technical specifications. Helium is non-substitutable in EUV lithography for optical cooling, wafer temperature maintenance, and vacuum leak detection. A 300mm EUV fab may consume 5,000 to 20,000 cubic meters of helium per month at 6N purity.

  20. US Geological Survey (Qatar produced approximately 33% of global helium supply). Korea International Trade Association (South Korea imported 64.7% of its helium from Qatar). TrendForce reported DRAM price increases of 50-55% in a single quarter. Intel CEO Lip-Bu Tan: "There's no relief until 2028." 2