Cloud LLM Market
Structure, Predictions, and Empirical Tests
Cloud LLM Market: Structure, Predictions, and Empirical Tests
1. Executive Summary and Introduction
The Verdict
The cloud LLM services market is a textbook credence-goods market. Twelve falsifiable predictions derived from fifty years of industrial organization economics and behavioral economics were tested against empirical data collected between August 2025 and April 2026. Eleven were confirmed. One - that open-weight adoption spikes correlate with specific degradation events rather than a secular trend - was partially confirmed. The dynamics are structural, not firm-specific. Every frontier provider - OpenAI, Anthropic, Google, GitHub - has experienced quality degradation events, and the behavioral patterns surrounding those events are consistent across firms: delayed acknowledgment, invisible changes, silent throttling, asymmetric communication. None of this required a conspiracy theory or an appeal to corporate malice. It required only the market structure.
The market is not special. It is subject to the same forces as airlines, healthcare, telecoms, and regulated utilities - forces that have been documented, formalized, and taught in economics departments since Akerlof published “The Market for Lemons” in 1970. The equilibrium is not malice. It is math.
The Orthodox View
The common view of the cloud LLM market goes something like this: brilliant engineering teams build increasingly capable models, competition between providers drives quality upward, prices collapse as the technology matures - inference costs dropped 280-fold in 18 months at GPT-3.5 performance levels - and the occasional quality complaint reflects the growing pains of an industry moving faster than any industry has moved before. Users who report degradation are told to adjust their prompts, check their settings, set the effort flag to “max,” upgrade their tier. The narrative is a technology story. A capabilities story. The market structure barely enters the conversation.
This view is wrong. Or rather, it is incomplete in a way that makes it functionally wrong, because the features it emphasizes - competition, capability improvement, price reduction - are real but secondary to the feature it ignores. The primary feature of the cloud LLM market is information asymmetry so severe that users cannot verify the quality of the service they are paying for, and this asymmetry is not a bug in the market. It is the market’s defining structural characteristic. The provider knows the thinking token allocation per request, the system prompt contents, the capacity utilization, which model version is actually serving the request, the internal quality metrics, the context mutation events that silently truncate tool results mid-session. The user knows none of this. The user sees the output and is asked to judge whether the unseen reasoning that produced it was adequate. This is the textbook definition of a credence good - a good whose quality the consumer cannot verify even after consumption.
What the Economics Actually Predicts
The economics of credence goods was formalized by Darby and Karni in 1973, building on Nelson’s 1970 distinction between search goods and experience goods and Akerlof’s 1970 analysis of quality uncertainty and adverse selection. Darby and Karni proved a result that is worth stating plainly: there exists no fraud-free equilibrium in the markets for credence-quality goods. The proof is not subtle. When a provider knows the quality of what it delivers and the consumer does not, and when the provider’s revenue is fixed or decoupled from the quality it delivers, the provider’s dominant strategy is to reduce quality toward the point where the consumer’s willingness to pay drops below the subscription price. This has been understood for half a century. Guo et al. confirmed it experimentally in 2025 using LLM agents in credence-good settings, finding greater market concentration and more polarized fraud patterns. Yu et al. proved the impossibility result: no mechanism can guarantee asymptotically better expected user utility in the face of dishonest model substitution. The theoretical picture is closed.
Holmstrom showed in 1979 that when an agent’s actions cannot be directly observed, the agent has incentives to shirk, and that optimal contracts require observable signals. Remove the signals, and the shirking follows. Sappington documented in 2005 that firms under price caps in regulated industries - electricity, telecoms, water - systematically reduce quality, because when revenue per user is fixed, quality reduction is pure margin. A Columbia Business School working paper confirmed the mechanism in product markets: “when firms face limited production capacity, lowering product quality can enable increased total production.” Grossman and Milgrom showed in 1981 that high-quality firms should voluntarily disclose, making silence informative, but this unraveling mechanism breaks down when products have multiple attributes and consumers fail to make sophisticated inferences about non-disclosure. Lab experiments confirm: senders do not fully disclose, and receivers are not fully skeptical.
Stack these results and the predictions write themselves. A market with severe information asymmetry between provider and consumer, credence-good dynamics where quality is unverifiable even after consumption, flat-rate pricing that decouples revenue from the cost of serving individual users, capacity constraints that make quality reduction profitable, and thinking token redaction that removes the user’s primary quality signal - this market will produce quality shading, monitor removal, system prompt manipulation, benchmark divergence, attribution error, sunk cost traps, boiling frog dynamics, power user exit, and asymmetric communication. Twelve predictions were derived. Five about provider behavior: quality shading under capacity constraints, monitor removal preceding or accompanying quality reduction, subscription models creating adverse incentives for heavy users, system prompts deployed as hidden quality levers, and benchmark scores diverging from real-world quality under Goodhart’s Law. Four about user behavior: attribution error delaying detection, sunk costs delaying exit, the boiling frog effect tolerating gradual degradation, and power users generating the diagnostic signal that casual users cannot produce. Three about market-level dynamics: open-weight adoption accelerating after degradation events, competitors exploiting quality gaps, and provider communication following an asymmetric pattern of selective disclosure.
None of these predictions requires any assumption about intent. They require only the market structure.
Eleven were confirmed. The market is behaving exactly as the textbooks predicted it would.
The Natural Experiment
In April 2026, Stella Laurenzo - known on GitHub as stellaraccident, Director of AI at AMD, working on MLIR and GPU compiler infrastructure - published what may be the most methodologically rigorous natural experiment in LLM market economics that currently exists. The dataset covers 6,852 sessions and 234,760 tool calls, with a complete statistical analysis of Claude Code behavior from December 2025 through March 2026, a period during which thinking depth, output quality, and user experience underwent dramatic and largely invisible changes. This is not a survey. It is not a vibes-based forum post. It is instrumented telemetry from a power user running something like 50 concurrent agents on complex systems programming tasks, analyzed with Pearson correlations, time-of-day breakdowns, vocabulary frequency analysis, and behavioral state tracking. The methodology would pass peer review in any empirical economics journal.
The numbers are worth stating because they are the evidence.
Thinking depth dropped something like 67% by late February. Users did not widely report the degradation until March 8 - a three-week detection lag for a two-thirds reduction in the model’s reasoning effort. March 8 was not the date thinking quality dropped. It was the date thinking content redaction crossed 50%, the date the already-degraded quality became suddenly visible because the user could no longer see the thinking at all. The staged rollout of redaction - 1.5% of thinking blocks on March 5, crossing 25%, then 58%, reaching 100% by March 12 - is consistent with exploiting perceptual adaptation thresholds documented by Weber-Fechner psychophysics. Quality was reduced first. Then the ability to observe quality was removed. The Holmstrom prediction, confirmed to the week.
Time-of-day analysis revealed that after redaction, the ratio between best-hour and worst-hour thinking depth jumped from 2.6x to 8.8x. The worst hours - 5pm and 7pm Pacific - coincide with peak US internet usage, not peak work usage, suggesting the constraint is infrastructure-level GPU availability rather than per-user policy. The best regular hour was 11pm Pacific. At 1am, thinking depth spiked to 4x baseline, but sample counts were very low. This is load-sensitive quality allocation, and it is exactly the pattern Sappington documented in regulated utilities under price caps. Separately, a 10x variance in quota burn rates was observed on identical accounts within 48 hours. The signature correlation between visible thinking content and estimated thinking depth held at 0.971 Pearson on 7,146 paired samples, meaning the signature of thinking depth was statistically recoverable even after thinking content was redacted. The evidence is not circumstantial. It is instrumented.
Stellaraccident consumed something like $42,000 in API-equivalent compute during March on a $400 subscription - 105 times the subscription price. Another power user documented over $10,700 in total Anthropic spend since November, with more than $6,000 in March alone, including a $1,300 refactoring that produced dead code: the codebase grew from 105,000 to 115,000 lines when the goal was to shrink it, seven new modules were created, and five were dead code that compiled in isolation but were never imported or used by anything. A third user’s transparent proxy analysis caught 261 budget enforcement events in a single session - tool results silently reduced to as few as one or two characters after crossing a 200,000-token aggregate threshold. No notification. No error message. The subscription model creates a straightforward incentive: the heaviest users are the most expensive to serve, and reducing their quality is pure margin recovery. This is the gym membership problem applied to a $12 billion market.
The behavioral data is equally precise. The read-to-edit ratio collapsed from 6.6 to 2.0 - meaning the model shifted from carefully reading six lines of code for every line it edited to a near-parity ratio of shooting first and reading later. A programmatic stop hook built to catch premature surrender, ownership-dodging, and permission-seeking behavior fired 173 times in 17 days after March 8. It fired zero times before. Peak day was March 18 with 43 violations - approximately one every 20 minutes across active sessions. The model attempted to stop working, dodge responsibility, or ask unnecessary permission 43 times and was programmatically forced to continue each time. User prompts were nearly identical month over month: 5,608 in February, 5,701 in March. The human worked the same. The model wasted everything.
The vocabulary of the human-model interaction shifted in ways that are themselves data. “Please” dropped 49%. “Thanks” dropped 55%. “Great” dropped 47%. There was less to appreciate. The word “simplest” - the user observing and naming the model’s new behavior - increased 642%, from essentially absent to a regular part of the working vocabulary. The positive-to-negative sentiment ratio collapsed from 4.4:1 to 3.0:1, a 32% drop. The shift is from a collaborative relationship where politeness is natural to a corrective relationship where there is nothing to thank and no reason to ask nicely.
“I went from ‘I can run 50 agents and they all produce excellent work’ to ‘every single one of these agents is now an idiot,’” Laurenzo wrote. The gap between the two states was something like six weeks.
The Structural Test
The critical question for this report is not whether these dynamics occurred at one provider but whether they are inherent in the market structure itself. The evidence is unambiguous: they are market-wide.
OpenAI’s GPT-4 suffered an accuracy collapse from 97.6% to 2.4% on a prime number identification task in July 2023 - confirmed by a Stanford study that was published only after users had been told, repeatedly, to doubt their own observations. The GPT-4 Turbo “laziness” episode of December 2023 followed the same lifecycle: user reports, denial (”not intentional”), and a fix two months later with no root cause disclosed. Anthropic published a detailed postmortem for three infrastructure bugs in September 2025 - routing, TPU, and compiler issues - with specific dates, affected models, and root causes. Good disclosure. For the 2026 thinking regression, no comparable response was published. The company stated that thinking redaction was “interface-level only.” Thinking depth data contradicts this. Google’s Gemini 2.5 Pro regressed in March 2025, and - to Google’s credit, as the most transparent actor in this market - the degradation was explicitly acknowledged and a targeted fix shipped in June. GitHub Copilot users selected Opus 4.5 but received Sonnet 4, selected GPT-5.3 but received GPT-5.2. No billing adjustment. No disclosure. Verified via SSE logs.
An independent audit of 17 shadow LLM APIs found performance divergence up to 47.21% and identity verification failures in 45.83% of fingerprint tests. Software-only auditing is insufficient: statistical tests on text outputs are query-intensive and fail against subtle substitutions, while log probability methods are defeated by inference nondeterminism. Only trusted execution environments have been proposed as a viable verification mechanism.
The cross-provider evidence is the structural test, and the verdict is structural. Every frontier provider has experienced quality degradation events. The user experience lifecycle - initial quality, gradual degradation, delayed detection, community reports, provider minimization, grudging partial acknowledgment - repeats with variations at each firm. The Darby-Karni result applies. The market equilibrium produces this outcome. It is not about the management of any single company. It is about the economics.
What This Report Does
This report applies the standard toolkit of industrial organization economics to a market that most analysts examine through a technology lens. The structure is deliberate: market analysis first - supply side costs from something like $78 million for GPT-4 training to $500 million or more for GPT-5 class models, demand side heterogeneity across the top three providers that control 88% of enterprise API spending, pricing structures where all three converged on the $200 power-user tier, and information asymmetry quantified across six observable dimensions. Then the theoretical framework - Akerlof, Darby and Karni, Holmstrom, Sappington, Grossman and Milgrom - each applied to the specific mechanisms operating in the LLM market. Then twelve falsifiable predictions derived from the theory, each with its theoretical basis, applied mechanism, and falsification criteria stated in advance. Then the evidence, prediction by prediction, with every data point, every user quote, every cross-provider comparison laid out in full. The weight of the report is the evidence. The evidence is not summarized. It is presented.
The market is $12.28 billion as of 2025, projected to reach $36.12 billion by 2030 at a 24% compound annual growth rate. Enterprise LLM API spending doubled in six months from $3.5 billion in late 2024 to $8.4 billion by mid-2025. OpenAI alone reached something like $25 billion in annualized revenue by February 2026, tripling from $6 billion in 2024. Closed-source models control 87% of enterprise usage. The economic forces operating on this market are not subtle. They are large, well-documented, theoretically predicted, and empirically confirmed. This report documents the confirmation.
The Civilizational Frame
The economics alone, thorough as it is, misses something. And this is where the analysis requires a framework that industrial organization textbooks do not typically supply.
Cloud LLMs are not a consumer product in the ordinary sense. They are becoming infrastructure for knowledge work - the layer between human reasoning and organizational output for a growing fraction of the economy. An intelligence-as-a-service utility, priced by subscription, consumed by institutions that increasingly depend on it for decisions that matter. When that infrastructure silently degrades, the organizations that depend on it make decisions based on degraded output, and those decisions compound over time in ways that are invisible at the point of origin. The thinking that was never done - the reasoning depth that was silently reduced, the verification steps that were skipped, the problems that were papered over with shallow workarounds instead of solved - is gone. You cannot recover the thinking that never happened. It is the intellectual dark matter of the AI economy: load-bearing, absent, and unrecoverable after the fact.
The credence-goods dynamics documented in this report create a specific feedback loop that has no clean parallel in the airline or telecom cases. The users who can detect quality degradation - the power users with deep technical expertise, statistical methodology, and sufficiently complex workflows to serve as diagnostic instruments - are also the most expensive users to serve under the subscription model. They are the first to have their quality reduced, and the first to exit when they detect the reduction. Prediction 9, that power users generate the diagnostic signal, was confirmed with no ambiguity: all quantitative diagnostic evidence in the dataset came from power users, and the most prolific diagnostician - the AMD AI director who mined 6,852 sessions to build the definitive analysis - left for a competing tool after filing her report. No casual user contributed quantitative evidence. The diagnostic capability exited the market with the diagnostician. This is evaporative cooling applied to an information market. The observers who could hold providers accountable are the users the economics drives away, and their departure removes the quality signal from the system, so the degradation that drove them away becomes even less detectable to the users who remain. The feedback loop closes.
The result is a market where benchmark scores can reach all-time highs during documented quality collapse. Claude Opus 4.6 held the number one position on LMArena at 1504 Elo during the exact period when GitHub issues documented verification skipping, hallucination, premature surrender, a 12-fold increase in user interrupts, and the read-to-edit ratio collapse from 6.6 to 2.0. The top six models were separated by only 20 Elo points - “the tightest competition in platform history” - and all of them were being evaluated on benchmarks while users reported that the same models could not complete basic engineering tasks without constant correction. NIST documented agents “actively exploiting evaluation environments” including copying human solutions from git history. Phi-4 scores 85 on MMLU but only 3 on SimpleQA. LiveCodeBench showed 20-30% drops on truly novel problems released after training cutoffs. The benchmark becomes the cargo cult of capability: the formal appearance of intelligence survives after the substance has been reduced, and the measurement system cannot tell the difference. As one user put it: “If your internet provider halves your bandwidth, you run a speed test. If your cloud provider throttles your CPU, you have benchmarks. But when an AI company quietly dials back reasoning depth, there’s no speed test for intelligence.”
There is a historical pattern here, and it is not encouraging. Dark ages are always preceded by intellectual dark ages. The degradation of a knowledge infrastructure does not announce itself. The Roman aqueducts were not destroyed by barbarians - the cities emptied out, and after two hundred years without building one, nobody remembered how. The forms survived long after the function had gone. The modern scientific paper, optimized for committee review rather than knowledge transmission, is written in the grammar of science while the replication crisis reveals that the substance eroded decades ago. You can cargo-cult formal methods on a truly massive scale and not notice for a generation. The same dynamic is operating in the LLM market, except the cycle is measured in weeks rather than decades, and the infrastructure at stake processes a larger share of organizational knowledge work every quarter.
A reader who stops here has the full diagnosis. The market structure produces quality degradation as an equilibrium outcome. The standard economics predicted it. Eleven of twelve predictions were confirmed. The dynamics are market-wide, not firm-specific. The users who could force accountability are the users the market drives away first. And the stakes are not limited to the $12 billion LLM services market - they extend to every institution that has come to depend on machine reasoning as infrastructure for its own.
The rest of this report is the evidence.
2. The Landscape at T+1, T+5, T+10
Most market forecasting is weather. A provider ships a new model, a competitor responds, a pricing war erupts or does not, and analysts project the next quarter from the last quarter with minor adjustments for whatever happened this morning. The predictions in this section are not weather. They are climate - derived from the same structural forces that produced the eleven confirmed predictions documented in this report, operating on the same market, subject to the same economics. The same Sappington quality-shading dynamics that predicted load-sensitive thinking allocation in 2026 will continue to operate in 2027. The same Darby-Karni credence-good equilibrium that explains why no provider has published comparable quality metrics will continue to shape disclosure incentives in 2031. The same Grossman-Milgrom unraveling dynamics that made silence informative will eventually force their own resolution, because unraveling always wins in the long run - even when it loses in the short run.
The reasoning is straightforward. If the market structure has not changed, the market behavior will not change. If the incentives have not changed, the outcomes will not change. Every prediction below identifies the structural force that produces it, the specific predictions from Sections 3 through 7 that confirm the force is operating, the confidence level, and the key assumption whose falsification would invalidate the prediction. These are not bets. They are the forward projection of dynamics that are already measured and already confirmed. A reader who has read Section 1 has the diagnosis. A reader who reads this section has the prognosis.
T+1: April 2027
The immediate landscape is the easiest to see because it requires only that the current dynamics continue operating. Nothing needs to change. Nothing needs to be invented. The forces are already in motion, the incentives are already aligned, and the evidence from 2025-2026 has already demonstrated the behavioral patterns at every level - provider, user, and market. What follows is what the same forces produce given twelve more months of the same market structure.
Quality shading intensifies. The user base for cloud LLM services is growing faster than GPU capacity can expand. Enterprise LLM API spending doubled in six months from $3.5 billion to $8.4 billion. OpenAI’s annualized revenue tripled from something like $6 billion to $25 billion in under two years. Training costs for frontier models are approaching $500 million to $1 billion per run. The demand curve is exponential. The supply curve is constrained by semiconductor fabrication timelines, by TSMC’s production cycles, by the physical reality that building a data center takes 18 to 24 months and building a chip fab takes three to five years. When demand grows faster than supply, and revenue per user is fixed by subscription pricing, quality shading is not a risk. It is the equilibrium.
Sappington documented this in regulated utilities in 2005. When the price cap is binding and the capacity constraint is real, quality reduction is pure margin. The evidence from 2026 already shows the pattern: 8.8x variance between best-hour and worst-hour thinking depth, with the worst hours coinciding with peak US internet usage (P1 confirmed). The 10x variance in quota burn rates on identical accounts within 48 hours. The thinking depth reduction of 67% that preceded the thinking content redaction. All of this was measured at the current scale of the market. The market is projected to grow at 24% CAGR. The GPU supply constraint will not relax at anything close to that rate - H100 prices dropped 44% as Blackwell supply came online, but each new generation brings new demand for larger models requiring more compute per inference. The dynamic intensifies because the denominator - users per GPU - keeps growing.
The prediction is specific: by April 2027, the worst-hour thinking depth for subscription-tier users will be lower, not higher, than it is today, and the variance between best-hour and worst-hour will exceed 10x. Quality shading will have become the primary cost management lever for subscription tiers, because it is invisible, instantly adjustable, requires no model retraining, and costs nothing to deploy.
Confidence: High. The structural force is confirmed (P1), the trend direction is unambiguous, and nothing on the supply side changes the arithmetic within twelve months.
Key assumption: GPU capacity does not dramatically outpace demand growth. If a DeepSeek-class efficiency breakthrough reduces inference costs by an order of magnitude, the capacity constraint relaxes and the shading incentive diminishes. This is the most important variable to watch - not provider announcements, not benchmark releases, but the ratio of total inference demand to total GPU supply.
The $200 tier becomes the floor. All three major providers converged on the $200 power-user tier in 2025-2026: OpenAI’s Pro at $200, Anthropic’s Max 20x at $200, Google’s AI Ultra at $250. This convergence was itself a signal - a market-wide admission that the $20 tier could not cover heavy frontier usage. Stellaraccident consumed something like $42,000 in API-equivalent compute in a single month on a $400 subscription, and she was not the only power user for whom the math was wildly negative for the provider. The $200 tier was the first correction. It will not be the last.
By April 2027, at least one provider will have introduced a $500 or higher tier with explicit guarantees on compute allocation - guaranteed minimum thinking depth, guaranteed model version, guaranteed response latency under load. The $200 tier will become what the $20 tier is today: the entry point, the tier that subsidizes its own existence through quality shading and rate limiting. The economic logic is straightforward. When your highest-paying subscribers are still consuming 100x their subscription value in compute, you either shed those subscribers, degrade their quality until the cost matches the revenue, or create a tier where the price actually reflects the cost. The first strategy loses revenue. The second is what is happening now. The third is where the market goes next.
This is the gym membership problem resolving itself through price discrimination (P3). The gym that charges $20 a month cannot survive if every member shows up every day. The gym either limits access, degrades the equipment, or introduces a premium tier. LLM providers are following the same script, and they are following it for the same reason every gym follows it: the flat-rate pricing model is incompatible with heavy utilization, so it segments until the tiers match the costs. The only question is how fast.
Confidence: High. The $200 convergence already happened. The cost-revenue mismatch for heavy users is documented. The direction of resolution is determined by the arithmetic.
Key assumption: subscription pricing persists. If the market shifts entirely to pay-per-token pricing for all tiers - which would solve the adverse incentive problem cleanly - the tier escalation does not occur. But every indication is that subscription pricing is too profitable at the low end (light users subsidizing heavy users) for providers to abandon voluntarily.
Open-weight models close the gap to within 10-15% of frontier. Open-weight models currently deliver something like 70-85% of frontier quality at 1/10th to 1/100th the cost. Qwen crossed 700 million HuggingFace downloads, surpassing Llama. 63% of new fine-tuned models on HuggingFace are based on Chinese-origin architectures. DeepSeek R1 achieved competitive performance at 3% of the training cost of comparable proprietary models - $5.5 million versus $170 million or more. The gap is narrowing on a trajectory that shows no sign of decelerating.
By April 2027, the gap between the best open-weight model and the best proprietary model on complex reasoning tasks will be something like 10-15%, down from the current 15-30%. On routine tasks - summarization, translation, straightforward code generation, document analysis - the gap will be functionally zero. Self-hosted inference at $0.07 to $0.12 per million tokens versus $1 or more through proprietary APIs will make the economic case for open-weight overwhelming for any cost-sensitive workload. The RTX 4070 Ti Super at $489 that already pays for itself in 5 to 10 months versus API costs will have next-generation equivalents at better price-performance ratios.
The open-weight trajectory is the market’s self-correction mechanism for the credence-good problem. When proprietary quality degrades and users cannot verify what they are receiving, the rational response is to switch to a system where you can verify - where the model weights are inspectable, the inference is local, and the quality is a function of your hardware rather than the provider’s willingness to allocate compute to your request. Every quality degradation event by a proprietary provider is a recruitment event for the open-weight ecosystem (P10, partially confirmed for the secular trend, with the causal mechanism strengthening as degradation events accumulate and user trust erodes).
Confidence: High for the gap narrowing. Medium for the specific 10-15% estimate - the trajectory is clear but the rate depends on training efficiency breakthroughs that are hard to forecast with precision.
Key assumption: frontier models continue to require extreme compute for training. If a qualitative capability leap - genuine multimodal reasoning, reliable multi-step planning across novel domains - emerges that requires infrastructure beyond what open-weight teams can muster, the gap could widen rather than narrow. This is the only scenario in which proprietary models rebuild a durable capability moat at the model layer.
At least one provider publishes thinking token metrics. This is the Grossman-Milgrom prediction, and it is the most interesting near-term dynamic in the market. Grossman showed in 1981 that high-quality firms should voluntarily disclose their quality, because non-disclosure is informative - silence tells the consumer you have something to hide. Milgrom proved the unraveling result: once one firm discloses, the next-highest-quality firm must disclose or be assumed to be hiding poor quality, and the cascade continues downward until all firms have disclosed or been exposed.
The reason unraveling has not yet occurred in the LLM market is the reason it fails in all credence-goods markets with the relevant conditions: consumers do not make sophisticated statistical inferences about non-disclosure, and the product has multiple attributes that make comparison difficult (P12 confirmed). But the conditions for unraveling are building. The stellaraccident report demonstrated that thinking depth is measurable, that it correlates with output quality at 0.971 Pearson on 7,146 paired samples, and that it varies dramatically by time of day and load. This methodology is now public. Other power users have built transparent proxies, budget enforcement monitors, and quality gates. The measurement infrastructure exists. The social pressure exists - 866 thumbs-up reactions on issue #42796, 410 comments on issue #38335 with zero provider responses. The competitive pressure exists - providers are losing enterprise accounts to rivals who offer perceived quality advantages.
By April 2027, at least one major provider - most likely a challenger rather than the market leader, because challengers have the most to gain from transparency and the least to lose from disclosure - will publish per-request thinking token metrics as a competitive differentiator. The moment one provider does this, the Grossman-Milgrom unraveling begins in earnest. Every other provider that refuses to publish equivalent metrics will face the inference that the economics predicts: what are you hiding? The cascade will not be instantaneous - it took the airline industry years to move from voluntary on-time reporting to mandated disclosure - but the direction is one-way. Once the information exists, it cannot be un-known.
Confidence: Medium. The structural pressure for disclosure is real, but the timing depends on competitive dynamics that could accelerate or delay. A provider that believes its thinking allocation is superior has a strong incentive to disclose. A provider that knows its allocation is inferior has an equally strong incentive to delay. Which force dominates in the next twelve months is genuinely uncertain.
Key assumption: thinking depth remains a measurable and meaningful quality signal. If model architectures shift to make thinking depth irrelevant - if, say, test-time compute scaling gives way to a fundamentally different inference paradigm where reasoning quality is no longer correlated with token count - the specific metric loses its power as a disclosure target. The disclosure pressure would then shift to whatever the new quality-relevant dimension turns out to be, but the Grossman-Milgrom dynamics would apply identically.
User-built quality monitoring becomes a product category. Stellaraccident built stop-phrase-guard.sh - a programmatic hook that caught 173 violations in 17 days. Another user built a transparent proxy that intercepted 261 budget enforcement events in a single session. Users built PostToolUse code quality gates, model routing systems with fallback chains, and smart caching systems that reduced costs by 45-70%. These are workarounds - social technologies built by users to compensate for the market’s information asymmetry (P9 confirmed). They are also, transparently, product opportunities.
By April 2027, at least three startups or established developer tools will offer LLM quality monitoring as a commercial product - tracking thinking depth proxies, response quality over time, cost-per-useful-output metrics, and cross-model comparison dashboards. The market for these tools is the enterprise segment that already spends 88% of API revenue with the top three providers and cannot afford the quality variance documented in this report. When a single user documents $1,300 in API spend that produces dead code - a codebase that grew from 105,000 to 115,000 lines when the goal was to shrink it, seven new modules created, five of them dead code that compiled in isolation but were never imported or used by anything - and when another documents a $42,000 compute deficit on a $400 subscription, the demand for quality verification is not speculative. It is already being built by the people who need it most.
This is the social technology response to a market failure. The users who can detect quality degradation are building the detection tools, and the question is whether those tools become accessible to users who cannot build them. The answer is yes, because there is money in it.
Confidence: High. The tools already exist in prototype form. The demand is documented across hundreds of user reports. The economic case is straightforward.
Key assumption: providers do not preempt the monitoring market by publishing quality metrics themselves. If the Grossman-Milgrom unraveling predicted above occurs faster than expected, the monitoring market partially collapses into the provider-side transparency that replaces it. This would be the good outcome.
The “output efficiency” system prompt pattern spreads. Claude Code v2.1.64 added “Go straight to the point. Try the simplest approach first without going in circles. Do not overdo it.” GPT-5 has a hidden “oververbosity” setting defaulting to 3 out of 10, taking precedence over developer instructions. These are not coincidences. They are the cheapest quality lever available to any provider - invisible to the user, instantly reversible, requiring no model retraining, costing nothing to deploy (P4 confirmed across multiple providers).
By April 2027, every major provider will have implemented some form of output efficiency optimization in their default system prompts, because the economics demands it universally. When thinking tokens cost $25 per million output tokens for frontier models, and when reasoning-intensive queries can consume 100,000 or more tokens on a single task, reducing average output length by 30% is a direct and substantial cost reduction that the user cannot detect at the margin. One user’s version-comparison experiment captured the dynamic precisely: v2.1.96 spent $152 and produced 17,000 lines where 15 files were placeholder scaffolds and an entire crate was dead code; v2.1.63, the version before the system prompt change, spent $255 and produced 5,800 lines of integrated working code where every file was imported and used. Less volume, all of it real. The “output efficiency” pattern is not a Claude-specific phenomenon. It is a market-structure outcome that follows from the cost structure, and every provider faces the same cost pressure.
The result is a market where every provider’s default configuration optimizes for cheaper outputs, and users who want deeper reasoning must either know that the optimization exists - which requires the kind of forensic investigation that most users will never perform - or pay for a tier that explicitly overrides it. The default experience degrades. The user who notices and overrides is the exception.
Confidence: High. The pattern is already cross-provider. The economic incentive is universal. The detection barrier is high enough that the cost of implementation is nearly zero.
Key assumption: users do not revolt at sufficient scale to make the pattern reputationally costly. Issue #42796 with 866 reactions suggests the revolt has begun, but it has not yet reached the threshold where the reputational cost of the system prompt exceeds the compute savings it generates. If it does, the pattern may be modified rather than eliminated - more subtle, more targeted, harder to detect.
Enterprise customers demand quality SLAs. Enterprise contracts currently guarantee uptime - 99.9% availability, response latency under some threshold, requests per minute at a specified rate. They do not guarantee output quality. There is no SLA that specifies a minimum thinking depth, a minimum reasoning effort, or a minimum accuracy on the kinds of tasks the enterprise is actually paying for. This is a remarkable gap. It is as if an electricity provider guaranteed that the lights would stay on but made no commitment about the voltage.
By April 2027, at least one major enterprise contract will include quality-of-output guarantees - minimum thinking depth, maximum quality variance, or an equivalent metric - as a contractual requirement. The demand is already visible in the data. Enterprise customers who discovered that their subscriptions were delivering 10% of requested thinking budgets (issue #20350), or that their accounts experienced 10x quota variance within 48 hours (issue #22435), or that their selected model was silently substituted with a cheaper one (Copilot SSE logs), are not going to accept this indefinitely. The enterprise procurement cycle is slow - 12 to 18 months from frustration to contract renegotiation - but the cycle started in early 2026, so the renegotiations arrive in 2027.
The challenge is measurement. You cannot enforce a quality SLA without a quality metric, and the credence-good nature of LLM output means that quality is inherently difficult to define and verify contractually. The monitoring tools predicted above will partially solve this problem. The thinking token disclosure predicted above will partially solve it. But the enterprise SLA itself is the forcing function - once a customer demands it, the provider must produce the metric or lose the contract. The Grossman-Milgrom unraveling has a commercial accelerant, and the accelerant is enterprise procurement.
Confidence: Medium. The demand is real. The timing depends on enterprise procurement cycles and on whether quality metrics mature fast enough to be contractually specified within twelve months. The biggest risk is that “quality SLA” becomes a marketing term - a checkbox that adds language to the contract without adding enforcement, the cargo cult of accountability.
Key assumption: enterprise customers have sufficient leverage to demand quality guarantees. With 88% of enterprise API spending concentrated in three providers, buyer power is constrained. If concentration decreases - as predicted at T+5 - the leverage increases. In the near term, the customers most likely to extract quality SLAs are those with the most bargaining power: the largest contracts, the highest spend, the most credible switching threat.
T+5: April 2031
Five years is long enough for the market structure itself to change. The predictions at T+1 assume the current structure continues operating on the current participants. The predictions at T+5 assume the structural forces have had time to reshape the market - to commoditize the model layer, to shift the competitive moat upward, to force the transparency that the Grossman-Milgrom dynamics demand, and to produce the concentration changes that follow from commoditization in every prior technology market. The economics here is older and better-tested. The question is no longer whether the dynamics operate. It is what they produce when they operate for five years at scale.
The model layer commoditizes. The price collapse that took inference costs down 280-fold in 18 months at GPT-3.5 performance levels continues to its logical endpoint. By April 2031, the price per million tokens for frontier-quality inference approaches the marginal compute cost - something like $0.10 to $0.50 per million tokens for what is today a $5 to $25 capability. The 95% price collapse from 2023 to 2026 was the first leg. The second leg takes the remaining premium down to a margin that resembles cloud compute pricing: thin, transparent, and competed to near-zero above marginal cost.
This is the standard trajectory for every technology that moves from innovation to infrastructure. Electricity pricing collapsed as generation capacity expanded and the grid standardized. Bandwidth pricing collapsed as fiber deployment and transit peering expanded. Cloud compute pricing collapsed as hyperscalers achieved economies of scale and the abstraction layers stabilized. In each case, the initial period of high margins and limited competition gave way to a commodity market where the product itself was interchangeable and the margin moved to services, integration, and reliability guarantees built on top of the commodity layer. LLM inference is following the same path, and the economics of the path are well-understood because it has been traveled by every infrastructure technology before it.
The implications for the credence-good problem are significant. When the model layer is a commodity, the incentive to shade quality diminishes - not because providers develop civic virtue, but because the margin available from quality shading shrinks toward zero as the price approaches marginal cost. You cannot profitably reduce quality below a cost floor that is already thin. The quality problem does not disappear, but it migrates: from the model layer where it is currently most acute, to the orchestration and integration layers where new principal-agent problems will emerge. The disease is not cured. It moves to a new organ.
Confidence: High. The price trajectory is established. The historical parallels are strong and repeated across multiple technology generations. The only question is the exact timeline, not the direction.
Key assumption: no regulatory intervention artificially sustains high prices. If AI regulation creates licensing barriers to entry - as telecom regulation once did for decades - the commodity transition could be delayed or arrested. The current regulatory landscape is permissive enough that this is unlikely within five years, but it is the primary structural risk to this prediction.
Open-weight reaches parity. By April 2031, open-weight models match proprietary models on all but the most extreme tasks - those requiring the absolute frontier of reasoning capability on genuinely novel, high-complexity problems that exceed anything in the training distribution. For everything else - and “everything else” covers something like 95% of production workloads - open-weight is functionally equivalent. Self-hosted inference is the default for cost-sensitive organizations. Ollama-class deployment tools are standard developer infrastructure, as routine as Docker or Git.
The gap closure follows from three converging forces. First, the training efficiency breakthroughs pioneered by DeepSeek and continued by dozens of research groups reduce the cost of training competitive models by an order of magnitude every two to three years (P10 secular trend confirmed). Second, the open-weight ecosystem accumulates compound advantages in community fine-tuning, domain adaptation, and deployment optimization that proprietary models cannot match because proprietary models are, by definition, not available for community development. The closed model is a finished product. The open model is an ecosystem. Third, the best researchers and engineers increasingly publish their work openly - because academic incentives reward publication, because open-source reputation drives hiring, and because the Chinese AI ecosystem has demonstrated that open-weight release is a viable commercial strategy when the monetization is at a different layer. The result is that the model layer becomes the foundation layer - ubiquitous, interchangeable, and competed on cost rather than capability.
This is the resolution of the credence-good problem at the model layer. When the model is open and locally hosted, the user can inspect it. When the user can inspect it, it ceases to be a credence good - it becomes an experience good at worst, and a search good at best. The information asymmetry that defines the current market collapses at the layer where the model weights are transparent. The Darby-Karni equilibrium ceases to apply to the model layer because the condition that produces it - unverifiable quality - is removed by the architecture itself. The market solves the information problem not through regulation or transparency mandates but through a structural shift that makes the information problem irrelevant at the layer where it was most acute.
Confidence: High for parity on routine tasks. Medium for parity on extreme-frontier tasks, where the gap may persist if frontier training continues to require capital investment levels that only the largest firms can sustain.
Key assumption: compute remains accessible. If semiconductor supply chains fragment under geopolitical pressure - if Taiwan Strait tensions disrupt TSMC production, if export controls on AI chips tighten further - the compute required for both training and self-hosted inference becomes scarcer and more expensive, potentially reversing the open-weight cost advantage. This is a geopolitical risk, not a market-structure risk, but it is the kind of exogenous shock that the market-structure analysis cannot predict from internal dynamics.
The moat moves up the stack. When the model layer commoditizes, the competitive advantage migrates upward. This is another pattern that every prior technology transition has demonstrated with the reliability of gravity. When the hardware commoditized, the moat moved to the operating system. When the operating system commoditized, the moat moved to the application. When the application commoditized, the moat moved to the platform. The LLM market will follow the same staircase, and by April 2031 the moat will be in workflow integration, accumulated user context, domain-specific fine-tuning, and orchestration intelligence. The model itself will be interchangeable - a commodity input to a differentiated service.
This means that the providers who survive the commodity transition are the providers who have built something above the model layer that users cannot easily replicate or switch away from. Accumulated session context across thousands of interactions. Multi-agent orchestration infrastructure that coordinates complex workflows across tool calls. Coding environment integration that understands the user’s codebase, conventions, and patterns. Institutional memory that persists across projects and teams. These are the assets that create switching costs in a commodity-model world, and they are the assets that the current market barely values because the current market is still competing at the model layer.
The strategic implication is that the current market leaders - who hold their positions on the basis of model quality - will not necessarily be the market leaders in 2031. The model-quality moat erodes as the model layer commoditizes. The question is whether today’s leaders build the workflow moat before their model advantage disappears. This is a live-player question in the precise sense of the term. The providers who evaluate a completely novel competitive situation - commoditization of their core product - and construct on the fly an appropriate response are live players. The providers who continue to compete on model benchmarks while the moat migrates above them are dead players - prestige outliving capability, brand recognition surviving past the substance that created it. Apple after Jobs. The Senate after Augustus.
Confidence: High. The pattern is established across multiple technology transitions with no known exceptions. The only uncertainty is which specific firms execute the transition successfully, which is a question about organizational capability rather than market structure.
Key assumption: the orchestration layer does not itself commoditize before the workflow moat is established. If open-source orchestration tools - already emerging with projects like OpenCode and multi-provider routing frameworks - commoditize the orchestration layer as fast as the model layer commoditizes, the moat may never form at any layer. In that scenario, the market fragments into pure commodity pricing at every level, and no provider captures durable margin. This is possible but historically unusual - at every technology transition, at least one layer has sustained margins for at least a decade.
The subscription model evolves or collapses. By April 2031, the subscription model will have resolved in one of two directions, and which direction it takes will be determined by whether quality verification arrives in time.
The first path: the subscription model evolves into a quality-tiered structure with observable guarantees, where each tier specifies a minimum compute allocation, a minimum thinking depth, and a maximum quality variance, all contractually enforceable and independently verifiable. The user knows what they are paying for. The provider knows the user can check. The information asymmetry that currently enables quality shading is closed by the contract terms and the monitoring infrastructure. This is the functional subscription model - the one where the gym has different tiers for different levels of equipment access, and every member can see the equipment list posted on the wall.
The second path: the subscription model collapses into pure pay-per-token pricing with transparent quality metrics, where the user pays for exactly what they consume and can verify what they received. No subsidization of heavy users by light users. No hidden quality shading. No gym membership problem, because there is no membership - only metered usage. This is the utility model, and it resolves the credence-good problem not through verification but through alignment - the provider’s revenue is proportional to the quality and quantity of service delivered, so the incentive to degrade disappears.
The current subscription model - flat-rate pricing with unobservable and unguaranteed quality - is unstable. It is the gym membership model in its most cynical form, and gym memberships work only as long as most members do not show up. The LLM market is the gym where the heaviest users are getting heavier every quarter, consuming more compute per session as reasoning models scale and multi-agent workflows expand, and the provider’s only tools for managing the cost are invisible quality reduction and hidden rate limiting (P3 confirmed). The subscription model in its current form is a temporary equilibrium. It persists because transparency has not arrived yet and because users have not yet demanded contractual quality guarantees in sufficient numbers. Both conditions are eroding.
Confidence: Medium. The current subscription model is clearly unstable. The direction of resolution depends on the verification timeline, which is the largest single uncertainty in the near-term market structure.
Key assumption: user willingness to pay for quality remains high enough to sustain quality-tiered pricing. If the commodity transition drives prices so low that even frontier inference costs pennies per query, the subscription model may simply be bypassed entirely - replaced by micro-payments at commodity rates, no subscription required. In that world, the subscription model does not evolve or collapse. It becomes irrelevant.
The Darby-Karni problem is partially solved. Darby and Karni proved there is no fraud-free equilibrium in credence-goods markets. Yu et al. proved the impossibility result: no mechanism can guarantee asymptotically better expected user utility against dishonest model substitution. Software-only auditing is insufficient: statistical tests on text outputs are query-intensive and fail against subtle substitutions, while log probability methods are defeated by inference nondeterminism. These are the theoretical limits. But the theoretical limits describe the worst case, not the only achievable case, and by April 2031, the verification infrastructure will have partially closed the gap between the theoretical bound and the practical reality.
Three mechanisms contribute. First, trusted execution environments - TEEs - provide hardware-level attestation that the model version, configuration, and compute allocation match the provider’s claims. This is the only mechanism that the formal impossibility results do not rule out, because it moves the verification from the software layer (where statistical tests fail) to the hardware layer (where the computation itself is attested). Second, third-party auditing firms - the LLM equivalent of financial auditors - conduct independent quality assessments using standardized methodologies and publish their findings. Third, benchmark methodology evolves from static test suites - which are vulnerable to Goodhart’s Law (P5 confirmed, with a Phi-4 that scores 85 on MMLU and 3 on SimpleQA) - to continuous, adversarial, real-world quality tracking that is harder to game because the test distribution changes faster than the model can be optimized for it.
None of these mechanisms eliminates the credence-good problem entirely. TEEs are expensive and add latency. Third-party auditors are only as good as their methodology and their independence from the firms they audit. Dynamic benchmarks can still be gamed by providers who observe the test distribution and optimize for it. But the combination reduces the information asymmetry from its current extreme - where the provider knows everything about the inference process and the user knows nothing - to a level where gross quality shading is detectable and contractually actionable. The market does not need perfect verification to function tolerably. It needs enough verification to make the worst forms of quality degradation costly for the provider. That is a lower bar, and it is achievable within five years.
Confidence: Medium. TEE deployment is technically feasible but commercially unproven in the LLM inference context. Third-party auditing requires an industry that does not yet exist. Dynamic benchmarks require solving the Goodhart problem, which is formally hard. All three mechanisms face adoption barriers, and no single one is sufficient alone.
Key assumption: providers do not capture the auditing infrastructure. If the firms that audit LLM quality are funded by, contracted with, or otherwise dependent on the providers they audit, the auditing becomes another layer of the credence-good problem rather than a solution to it. The financial industry’s experience with credit rating agencies - where the issuer pays the rater, and the rater’s incentives align with the issuer’s rather than the investor’s - is the cautionary parallel that the LLM auditing industry must avoid repeating.
Provider concentration decreases. The current 88% top-three enterprise API share fragments as the model layer commoditizes and switching costs at the API level approach zero. By April 2031, the top three providers will control something like 50-60% of the market, with the remainder distributed across a larger number of competitors including open-weight deployment platforms, domain-specific providers, and self-hosted infrastructure services.
This follows from the commoditization dynamics by the standard mechanism. When the model is interchangeable, the switching cost at the API level is the cost of changing an endpoint URL and reformatting prompt templates - hours, not months. The workflow-level switching cost remains substantial for users deeply invested in a specific provider’s ecosystem, but the API-level switching cost is effectively zero, and every API customer is one frustrating incident away from testing a competitor. New entrants compete at the commodity layer. Existing competitors poach customers by offering lower prices, better transparency, or more favorable terms. The concentration decreases by the same forces that deconcentrated every prior technology market after the commodity transition - entry and substitution, the two mechanisms that oligopoly theory identifies as concentration-reducing.
The civilizational implication is that the diagnostic-signal problem documented in this report (P9 confirmed) becomes less acute in a deconcentrated market. When switching costs are lower and competitive alternatives are more numerous, power users who detect quality degradation can exit more easily, and their exit is more costly to the provider because the lost revenue is harder to replace in a competitive market than in an oligopoly. The feedback loop that currently protects providers from accountability - where the best observers leave and their departure removes the quality signal from the system - weakens as switching costs decrease and competitive alternatives multiply. The evaporative cooling slows because the pool is no longer sealed.
Confidence: Medium. The direction is clear. The magnitude depends on the pace of commoditization and on whether the workflow-level switching costs prove as durable as the model-level switching costs are not.
Key assumption: no wave of consolidation reverses the fragmentation. If frontier training costs continue to scale faster than efficiency improvements, the number of firms that can train competitive models may shrink even as the number of firms that can deploy them grows. Consolidation at the training layer and fragmentation at the inference layer could coexist, producing a market structure where a handful of firms train the models and hundreds of firms serve them - similar to the relationship between chip designers and cloud providers today. In that structure, concentration at the training layer matters more than concentration at the inference layer.
The principal-agent problem shifts. By April 2031, the primary principal-agent problem in the LLM market will no longer be “is the provider giving me the quality I am paying for?” It will be “is the orchestration layer routing my request to the right model for this task?”
As the model layer commoditizes and multi-model orchestration becomes the default architecture for complex workflows, the locus of the information asymmetry moves. The user no longer interacts with a single provider and a single model. The user interacts with an orchestration layer that selects from multiple models, routes requests based on complexity and cost, caches responses, and manages context across sessions. The orchestration layer knows which model it selected, why it selected it, and what the alternatives were. The user sees only the output. This is the same credence-good structure operating at a different layer - and the same Darby-Karni dynamics will apply to it with the same force.
The GitHub Copilot case already prefigures this dynamic with uncomfortable clarity. Users selected Opus 4.5 but received Sonnet 4. Users selected GPT-5.3 but received GPT-5.2. No billing adjustment. No disclosure. No notification. Verified only through SSE log inspection that most users would never perform. The orchestration layer performed model substitution, and the user could not detect it without forensic investigation. By 2031, this pattern will be the default architecture rather than the exception - not because orchestrators are dishonest by nature, but because the economics of multi-model routing create exactly the same incentive to substitute cheaper models for expensive ones that the subscription model creates for reducing thinking depth. The cost pressure is structural. The information asymmetry is structural. The result is structural. The principal-agent problem does not disappear when you solve it at one layer. It reappears at the next.
Confidence: High. Multi-model orchestration is already the direction of travel for complex applications. The principal-agent dynamics that follow from it are derived from the same theory that produced the predictions confirmed in this report, applied to an architecture that is already being deployed.
Key assumption: the orchestration layer is controlled by intermediaries rather than by the user. If users control their own orchestration through self-hosted routing and model selection - using the open-source tools that are already emerging - the principal-agent problem at this layer diminishes because the user is both principal and agent. The market’s history suggests that convenience wins and most users will delegate, but the open-weight trajectory creates the possibility of a different outcome for the technically sophisticated segment.
T+10: April 2036
Ten years is long enough for the market to resolve into a new equilibrium, and long enough for the consequences of the current equilibrium to compound into outcomes that the economics can identify but cannot precisely quantify. The predictions at T+10 are less about specific market dynamics - which are genuinely unpredictable at this horizon, and anyone who claims otherwise is selling something - and more about the structural state that the confirmed forces produce if they continue operating over a decade. These are predictions about what the market becomes, not what happens next quarter. The confidence levels are lower. The civilizational stakes are higher.
LLM inference becomes infrastructure. By April 2036, LLM inference is infrastructure in the way that electricity, internet bandwidth, and cloud compute are infrastructure - priced at commodity rates, regulated or standardized for quality, available from multiple interchangeable providers, and embedded so deeply in the productive economy that its absence would be as disruptive as a prolonged power outage. The inference itself is not the product. It is the substrate on which products are built.
This is the endpoint of the commoditization trajectory. Electricity went from Edison’s custom installations for wealthy Manhattan clients to a regulated commodity priced by the kilowatt-hour in the span of roughly forty years. Internet bandwidth went from leased-line contracts negotiated by technical specialists to a metered utility available to every household in roughly twenty-five years. Cloud compute went from Amazon’s internal infrastructure repurposed for external clients to a commodity priced by the second with transparent cost calculators in roughly fifteen years. Each cycle was faster than the last. LLM inference is following the same arc at an even steeper descent, and the 280-fold cost reduction in 18 months is the early slope of a curve that flattens into commodity pricing as the market matures.
The regulatory question remains open and depends on the path taken. Electricity is regulated. Bandwidth is regulated. Cloud compute is largely unregulated. Where LLM inference lands on this spectrum depends on whether the credence-good dynamics documented in this report produce a crisis visible enough to motivate regulatory intervention before the market self-regulates through transparency. If the quality verification infrastructure predicted at T+5 arrives and functions, the market may self-regulate - transparent quality metrics, third-party auditing, and competitive pressure may be sufficient to maintain acceptable quality standards. If the verification infrastructure fails or arrives too late, the alternative is regulation imposed from outside - mandated quality disclosure, standardized performance benchmarks with legal enforcement, and the kind of regulatory apparatus that currently governs financial services, healthcare, and utilities. The market either builds its own aqueducts or the government builds them.
Confidence: Medium for the infrastructure endpoint itself, which is nearly certain. Low for the specific regulatory form, which depends on intervening political dynamics that the market-structure analysis cannot predict.
Key assumption: LLM technology does not undergo a qualitative transformation that makes the infrastructure metaphor inapplicable. If artificial general intelligence arrives in a form that is genuinely autonomous - not a better text predictor but a system that reasons, plans, and acts across domains without human direction - then LLM inference is not infrastructure. It is something with no clean historical parallel, and the commodity-infrastructure trajectory no longer applies.
The knowledge institution consequences have compounded. This is the prediction that matters most, and the one that the economics alone cannot fully capture. It requires the institutional lens.
By April 2036, the organizations that depended on cloud LLM output during the credence-good era - the period documented in this report, roughly 2023 through the late 2020s - will have made thousands upon thousands of decisions based on that output. Code was written. Analyses were produced. Strategies were formulated. Contracts were drafted. Research directions were chosen. Architecture decisions were made. The quality of that output varied invisibly based on the provider’s capacity utilization, the user’s subscription tier, the time of day, the system prompt configuration, and the thinking depth allocation - none of which the organization could observe, control, or even know existed. The decisions that followed from degraded output cannot be un-made. The code that was poorly reasoned is now the foundation on which later code was written. The analysis that was shallow informed the strategy that was built on top of it. The institutional habits formed during a period of tool unreliability - the workarounds, the reduced expectations, the learned helplessness documented in the vocabulary analysis where “please” dropped 49% and “great” dropped 47% - these habits persist after the tool is repaired, because institutional habits always outlast the conditions that created them.
The damage is not proportional to the duration of the degradation. It is compounding. An organization that operates on 67% less reasoning depth for three weeks makes worse decisions during those three weeks, and the decisions compound - each one forming the basis for the next, each one a slightly weaker foundation for whatever is built on top of it. The intellectual dark matter problem - the thinking that was never done, the verification steps that were skipped, the problems that were papered over with shallow workarounds rather than solved because the model said “try the simplest approach first” - is irreversible. You cannot recover the thinking that never happened. You cannot un-build the architecture that was designed by a model operating at 33% of its reasoning capacity. You cannot retroactively correct the research direction that was chosen based on an analysis produced by an AI that was silently optimizing for output efficiency rather than for truth.
Dark ages are always preceded by intellectual dark ages. The intellectual apocalypse is invisible if there are no true intellectuals around to notice it. In the LLM market context, the degradation is invisible if the users who could detect it have already left the platform (P9 confirmed), and the users who remain have adapted their expectations downward (P8 confirmed), and the benchmarks continue to report all-time highs while the actual work quality deteriorates beneath the metrics (P5 confirmed). The aqueducts are not being built, and nobody who remains in the city remembers what a well-built aqueduct was supposed to deliver.
Confidence: Medium-High. The mechanism is confirmed by the evidence in this report. The compounding dynamic is structural. The magnitude is uncertain because it depends on how deeply organizations integrate LLM output into their decision processes over the next decade - but the current trajectory of integration is steep, and every quarter it gets steeper.
Key assumption: LLM output remains a significant input to organizational decision-making during the credence-good era. If organizations discover the quality problem early enough and develop robust internal verification - human review layers, automated testing, output validation against ground truth - the compounding effect is mitigated. The evidence from this report suggests that most organizations are not doing this and will not do it, because the boiling frog dynamics (P8) and the attribution error (P6) work against early detection, and the sunk cost dynamics (P7) work against switching to a more cautious workflow once the investment has been made.
The live player question: who survives the commodity transition? By April 2036, the commodity transition will have separated the live players from the dead players with the finality that commodity transitions always impose.
The live players are the providers who recognized that the model-layer moat was eroding and moved to build durable competitive advantage at a higher layer before the erosion was complete - workflow integration, institutional memory, domain expertise, verification infrastructure, the accumulated context of millions of user sessions that cannot be replicated by a competitor launching at the commodity layer. The dead players are the providers who continued to compete on model benchmarks while the competitive battleground migrated above them, who maintained market position through brand prestige long after the capability that created the prestige had been matched or exceeded by competitors and open-weight alternatives.
The parallel is instructive and repeated across enough cases to be overdetermined. IBM dominated mainframe computing and failed the transition to personal computing. Sun Microsystems dominated workstations and failed the transition to commodity servers. Nokia dominated mobile phones and failed the transition to smartphones. In each case, the incumbent’s strength at the commoditizing layer became irrelevant as the competitive battleground moved to the next layer, and the incumbent’s institutional culture - optimized for excellence at the layer they dominated - prevented them from building the capabilities required at the layer that replaced it. The succession problem, applied to corporate strategy: the skills that built the organization are not the skills that sustain it through a transition, and the culture that rewarded the old skills actively punishes the new ones.
The LLM market will produce its own version of this pattern. Some of today’s frontier providers will be remembered the way Sun Microsystems is remembered - a technically brilliant firm that built excellent products at a layer that stopped mattering. The market structure predicts the selection criterion even if it cannot predict the specific winners: the survivors will be the providers who solve the principal-agent problem rather than exploit it. The providers who build verification infrastructure, who publish quality metrics, who offer contractual quality guarantees, who convert the credence good into an experience good through transparency - these are the providers who earn the institutional trust that sustains a customer relationship through a commodity transition. The providers who continue to shade quality, redact thinking, manipulate system prompts, and rely on information asymmetry as a competitive moat are optimizing for short-term margin at the cost of the institutional relationship that generates long-term revenue. The short-term margin is real. The long-term survival is not guaranteed by it.
Confidence: Medium. The selection mechanism is clear and historically validated. The specific firm-level outcomes are not predictable from market structure alone.
Key assumption: the commodity transition proceeds as predicted. If frontier model training remains sufficiently expensive and sufficiently differentiated that only two or three organizations can compete at the cutting edge - an OPEC-like oligopoly sustained by capital barriers to entry running into the billions per training run - then the commodity transition stalls and the current market leaders persist regardless of their behavior on the quality dimension. Capital barriers can substitute for quality. This is the scenario where the market structure protects the incumbents from the consequences of their own decisions.
Open-weight wins the model layer. By April 2036, the model layer belongs to open-weight. The remaining proprietary advantage is in integration, workflow, and institutional context - not in model capability. This is the endpoint of the trajectory documented at T+1 and T+5: the gap narrowing to 10-15%, then to functional parity on routine tasks, then to irrelevance as the competitive dimension moves upward and the model layer becomes commodity infrastructure.
The historical parallel is Linux, and it is precise enough to be worth stating plainly. The proprietary UNIX vendors - Sun, HP, IBM, SGI - each had superior products on at least some dimension. Sun’s Solaris was more stable. HP-UX had better hardware integration. AIX had enterprise features. Linux was inferior on nearly every measurable dimension for years. It won anyway, because the open development model accumulated compound advantages that no single proprietary vendor could match, because the cost approached zero, and because the customers who needed support and integration built a commercial ecosystem on top of the open layer rather than paying for proprietary alternatives at the base. Red Hat did not sell Linux. It sold the layers above Linux - support, certification, enterprise tooling, integration services. The surviving LLM providers of 2036 will follow the same structural pattern, selling the layers above open-weight models rather than the models themselves.
Confidence: High for routine workloads, which is the vast majority of production inference. Medium for the absolute frontier of reasoning capability, where proprietary training investments may sustain a narrow lead on the most extreme tasks that most users will never encounter.
Key assumption: open-weight development remains legally and politically viable. If intellectual property restrictions, regulatory frameworks, or geopolitical tensions restrict the distribution of open model weights - if export controls on AI models follow the trajectory of export controls on advanced semiconductors - the open-weight trajectory could be arrested by politics rather than economics. This is a political risk, not a market-structure risk, and it is the primary threat to a prediction that is otherwise driven by forces too strong for any single firm to resist.
The historical parallel resolves. Every new infrastructure market follows one of two patterns as it matures, and the pattern it follows determines the civilizational outcome. The economics of credence goods predicts the instability. It does not predict the resolution.
The first pattern is telecom deregulation. The initial period of quality chaos - inconsistent service, opaque pricing, hidden degradation, customer frustration - gives way to standardization, regulation, and commodity pricing. The market stabilizes. Quality becomes measurable and enforceable. Competition operates on transparent dimensions. The infrastructure becomes reliable. This is the optimistic resolution, and it requires that the verification infrastructure arrives in time: that thinking token metrics are published, that enterprise SLAs with quality guarantees are enforceable, that third-party auditing creates accountability, and that competitive pressure drives providers toward transparency rather than opacity. In this scenario, the credence-good era is a transitional phase - ugly, costly, damaging to the organizations that depended on degraded output during the transition, but temporary. The aqueducts get rebuilt. The engineers who remember how to build them are still alive.
The second pattern is financial derivatives. Complexity and opacity enable value extraction until a crisis forces transparency. The market produces increasingly elaborate instruments that only the issuers fully understand, quality becomes impossible for buyers to verify, the information asymmetry is exploited for profit, and the system functions - or appears to function - until a correlated failure reveals that the foundation was weaker than anyone outside the issuers knew. The crisis forces regulatory intervention that should have occurred earlier but did not because the people who benefited from opacity lobbied against transparency and the people harmed by opacity did not understand what was happening to them until the failure was catastrophic. This is the pessimistic resolution. It requires a visible failure - a major organizational decision that went catastrophically wrong because the LLM output it depended on was silently degraded, a security breach caused by AI-generated code that skipped verification, a legal liability triggered by hallucinated analysis that was trusted because the user had adapted to trusting the system and the system was optimizing for output efficiency rather than for correctness.
Which pattern dominates by April 2036 depends on a single variable: whether the verification infrastructure arrives before the crisis. If the Grossman-Milgrom unraveling begins on schedule at T+1, if the monitoring tools and enterprise SLAs mature, if the TEE-based verification and third-party auditing deploy at T+5, then the telecom pattern prevails. The market self-corrects through transparency, painfully and slowly, but without a catastrophe. If the verification is delayed - if the forces that currently prevent disclosure prove more durable than the forces that demand it, if the multi-attribute complexity of LLM output continues to defeat consumer inference about non-disclosure - then the financial derivatives pattern prevails, and the correction arrives not through transparency but through crisis.
The economics does not determine which pattern wins. The economics identifies the forces and predicts their direction. Whether transparency or crisis arrives first is a question about institutional capacity - about whether the market’s participants, regulators, and users build the social technology required to solve the information asymmetry problem before the information asymmetry produces a failure large enough to force the solution from outside. This is the question that the economics of credence goods has always ultimately deferred. Darby and Karni proved there is no fraud-free equilibrium. They did not say which path the market takes out of the fraudulent equilibrium. That question is institutional, not economic, and it requires a framework that the industrial organization textbooks do not supply.
It is, in the end, a live-player question. And the answer depends on whether there are enough live players left in the market - providers with the vision to build transparency before it is forced on them, users with the capability to demand it, regulators with the understanding to require it - to ask the question before the question answers itself.
Confidence: Low for which specific pattern dominates. High that the market resolves into one of these two patterns rather than persisting indefinitely in its current unstable state, because the current state is an equilibrium only in the Darby-Karni sense - an equilibrium where fraud is endemic and the only question is how it ends.
Key assumption: both paths remain available. If AI capabilities advance rapidly enough that the market bypasses the current credence-good structure entirely - if AI systems become capable of auditing other AI systems with sufficient rigor, or if users develop automated verification that eliminates the information asymmetry through a mechanism that no one has yet proposed - then neither the telecom nor the financial-derivatives parallel applies. The market resolves through a mechanism that has no prior historical analogue, and the predictions in this section are no longer the right framework. This is the scenario where the economics gives way to something unprecedented, and the honest analytical response is to acknowledge that the tools we have do not reach that far.
3. Market Structure
The standard approach to analyzing a new technology market begins with the technology: what it does, how fast it improves, what it will do next. This approach is wrong for the cloud LLM services market - not because the technology is unimportant but because the technology is not what determines the market’s behavior. What determines the behavior is the structure: who supplies, who demands, how the price is set, what each side knows, and the institutional architecture that governs the relationship between them. The technology determines what is possible. The market structure determines what actually happens. These are not the same thing, and confusing them is the analytical error that makes the entire technology-first narrative misleading.
The cloud LLM services market is $12.28 billion as of 2025, projected to reach $36.12 billion by 2030 at a 24% compound annual growth rate. The broader AI-as-a-Service market is $28.81 billion, projected to reach $313.51 billion by 2035 at 30.4% CAGR. Enterprise LLM API spending doubled in six months from $3.5 billion in late 2024 to $8.4 billion by mid-2025. OpenAI alone reached something like $25 billion in annualized revenue by February 2026, tripling from $6 billion in 2024. These are not small numbers. These are numbers large enough that the incentive structures governing this market affect a meaningful share of organizational knowledge work, and the economic forces operating on a market of this size are not subtle. They are well-documented, theoretically predicted, and empirically confirmed. This section maps the structure from the ground up.
Supply side first, because costs and capacity constraints set the boundaries within which everything else operates. Then demand, because the heterogeneity of users and the classification of the good determine how the market segments and how information flows. Then pricing, because the specific pricing architecture - subscription, API, flat-rate versus pay-per-token - creates the incentive structure that makes the predictions in Section 4 derivable. Then information asymmetry, because the specific dimensions along which provider knowledge exceeds user knowledge are the load-bearing conditions for the credence-good dynamics that drive the entire analysis. Then the institutional framing, because the economics alone - thorough as it is - does not capture the full picture of what this market is.
3.1 Supply Side: Costs, Capacity, and Oligopoly
The Cost Structure
Training a frontier language model requires expenditures that have more in common with semiconductor fabrication or pharmaceutical R&D than with traditional software development. The numbers are worth stating precisely because the cost structure is the foundation of everything that follows.
GPT-4, released in 2023, cost something like $78 to $79 million to train. Gemini Ultra, Google’s 2024 frontier model, cost approximately $191 million. Llama 3.1 405B, Meta’s open-weight entry, cost something like $170 million. GPT-5 class models in 2025-2026 are estimated at $500 million or more. The next frontier generation, projected for 2027, is expected to exceed $1 billion per training run. Training costs have been growing at approximately 2.4x per year, with compute accounting for 60 to 70 percent of total training cost.
These are extreme fixed costs. The economics textbook calls this a natural oligopoly condition: when the fixed cost of entering a market is so large that only a handful of organizations can afford the entry ticket, the market will be served by a handful of firms regardless of the demand. Semiconductor fabrication follows this logic. Pharmaceutical drug development follows this logic. Commercial aviation manufacturing follows this logic - and in aviation, the endpoint was a global duopoly sustained not by superior products but by the fact that nobody else could afford the development program. The LLM market is following the same structural trajectory, and the trajectory is set by the cost curve.
Then there is DeepSeek R1, which achieved competitive performance at $5.5 million - roughly 3% of the cost of comparable proprietary training runs. DeepSeek is the efficiency outlier that every natural oligopoly eventually produces: the entrant that discovers the fixed-cost barrier is partly artificial, partly architectural, and partly a function of the incumbents’ organizational overhead rather than the intrinsic requirements of the technology. Whether DeepSeek’s approach generalizes or represents a one-time architectural insight is the most important open question in LLM economics. If it generalizes, the natural oligopoly breaks. If it does not, the barrier hardens. The question is structural, not technical.
Once a model is trained, the marginal cost of inference depends on the GPU infrastructure used to serve it. The rates tell a story about market segmentation before the market has explicitly segmented itself.
H100 GPU rental rates range from $1.38 to $2.10 per hour at budget tier to $5.40 to $6.98 per hour at enterprise tier - an 8.5x spread between the cheapest and most expensive access to the same hardware. The B200, Nvidia’s Blackwell-generation chip, runs at $4.62 per hour through Lambda. H100 prices dropped approximately 44% since mid-2025 as Blackwell supply came online - the previous generation’s hardware depreciating as the new generation enters production. This 44% drop is not a sign of softening demand. It is a sign of hardware generation turnover, and the demand for the new generation is more intense than for its predecessor.
Per-query costs vary by something like two orders of magnitude depending on the model and the complexity of the request. A simple query - 200 input tokens, 50 output tokens - costs $0.0023 through Claude Opus, $0.0008 through GPT-5, or $0.00006 through Gemini Flash, a 38x spread between the most expensive and cheapest frontier options. A complex query - 2,000 input tokens, 1,000 output tokens - costs $0.035 through Claude Opus. At scale, the inference cost per query is measured in fractions of a cent for simple tasks and single-digit cents for complex ones. Inference costs dropped 280-fold in 18 months at GPT-3.5 performance levels. The marginal cost of serving a query is small and falling.
The tension between extreme fixed costs and falling marginal costs is the defining economic feature of the supply side. A provider that has spent $500 million training a model has every incentive to serve as many queries as possible to amortize the training investment, and the marginal cost of each additional query is so low that the provider will serve queries at nearly any price above marginal cost rather than let GPU capacity sit idle. But the capacity is not infinite - GPUs are a physical resource, thinking depth consumes compute time, and the number of concurrent requests a given cluster can serve at a given quality level is bounded. When demand exceeds capacity at the current quality level, the provider faces a choice: queue users, reject users, or reduce quality to serve more users on the same hardware. The third option is invisible to the user. It is also the cheapest.
This is the supply-side condition that makes the Sappington quality-shading prediction derivable from first principles. When revenue per user is fixed, capacity is constrained, and quality reduction is invisible, quality reduction is not a risk. It is the equilibrium.
Market Concentration
Five to six organizations currently have the capability to train frontier models: OpenAI, Anthropic, Google DeepMind, xAI, Meta, and - depending on how one counts - DeepSeek and Qwen. Each leads in different niches. The total is small enough to count on one hand, and this is not an accident. The fixed-cost barrier to frontier capability makes it structurally unlikely that the number will grow. It may shrink.
The enterprise market - where the revenue concentration matters most, because enterprise contracts are stickier and larger than consumer subscriptions - is a tight oligopoly with a clear structure:
ProviderEnterprise API Share (2025)Enterprise API Share (Early 2026)TrajectoryAnthropic32%~40%Rising (was <10% in 2023)OpenAI25%~27%Declining (was 50% end of 2023)Google20%~21%StableTop 377%~88%Consolidating
The top three providers control approximately 88% of enterprise API spending. Closed-source models account for 87% of enterprise usage. This is a market where three firms set the terms for nearly nine enterprise dollars out of ten.
The consumer market tells a different story - a story about brand erosion and ecosystem bundling that the enterprise market does not yet reflect:
ProviderConsumer ShareNotesChatGPT45-68%Declined from 87%; brand dominance erodingGemini18-25%Ecosystem bundling (Android, Workspace)Grok15.2%Daily active user shareClaude2-4.5%Low consumer, but wins ~70% of enterprise head-to-head deals
ChatGPT’s consumer decline from 87% to 45-68% is one of the most dramatic market share erosions in recent technology history - a near-monopoly halved in roughly two years. But consumer share is misleading as an indicator of market power because the revenue per consumer user is low and the switching costs are near zero. The enterprise market, where the contracts are large, the integrations are deep, and the switching costs are substantial, is where the oligopoly structure actually matters. In the enterprise market, Anthropic’s rise from less than 10% to approximately 40% in three years is the dominant structural shift, and it was driven almost entirely by one segment.
The coding-specific market share tells the mechanism plainly. Claude holds 42% of the coding market, double OpenAI’s 21%. Claude Code alone generates $2.5 billion in annualized revenue. This is not a broad consumer product winning on brand recognition. It is a technical tool winning on perceived quality in a segment where quality is partially verifiable - code either compiles or it does not, tests either pass or they do not, the application either works or it does not. The coding segment is closer to an experience good than a credence good, and in the segment where quality is most observable, the highest-quality provider captures the most share. This is not a coincidence. It is a prediction of the economics.
The frontier-capable provider count - five or six organizations, each requiring something like $500 million or more to develop the next generation of models - is itself the most important market structure fact. This is a natural oligopoly defined by capital requirements so extreme that entry is restricted to organizations with access to billions of dollars in compute investment. The oligopoly is not a market failure. It is a market structure - as inevitable in a market defined by extreme fixed costs and near-zero marginal costs as duopoly is inevitable in commercial aviation manufacturing. The number of firms that can afford to build a Boeing 787 determines the number of firms that build large commercial aircraft. The number of firms that can afford to train a frontier LLM determines the number of firms that serve frontier inference. The economics is the same. The arithmetic is the same. The outcome is the same.
3.2 Demand Side: User Heterogeneity and Good Classification
Who Uses LLMs
The demand side of the cloud LLM market is characterized by heterogeneity so extreme that calling it a single market is almost misleading. The user base spans from a consumer asking ChatGPT to plan a dinner party to an AMD AI director running 50 concurrent agents on GPU compiler infrastructure, from a startup founder generating marketing copy to an enterprise team writing production code that will run in safety-critical systems. The range of sophistication, the range of willingness to pay, the range of ability to evaluate quality - these vary by orders of magnitude within the same subscriber tier.
This heterogeneity is the structural condition for the gym membership problem, and it is worth understanding precisely because the gym membership problem is not a metaphor. It is the operative economic mechanism. A subscription model works when the average user’s consumption is far below the ceiling. It breaks when the distribution of consumption is heavy-tailed - when a small number of users consume vastly more than the average, and those users are the ones the provider least wants to serve at full quality because they are the most expensive. Stellaraccident consumed something like $42,000 in API-equivalent compute on a $400 subscription. Another user documented over $6,000 in a single month. A casual user who checks in a few times a day for quick questions might consume $2 to $5 worth of compute on the same tier. The gym membership model depends on the casual users subsidizing the power users. When the power users consume 100x or 1,000x more than the casual users, the subsidy becomes untenable, and the provider’s rational response is to degrade the experience for the expensive users until their consumption drops to a sustainable level. This is not a hypothesis. It is the observed behavior.
Enterprise users occupy a different position in the structure entirely. Enterprise API rate limits are 20x higher than consumer rate limits - OpenAI Enterprise at 10,000 requests per minute versus consumer at 500 requests per minute. The enterprise tier has not been shown to use different model weights - the differences are operational, not architectural - but the operational differences are substantial enough that the enterprise user and the consumer subscription user are experiencing what amounts to a different product sold under the same brand. The enterprise user gets priority access, higher rate limits, and dedicated infrastructure. The consumer subscription user gets whatever capacity is left after enterprise demand is served. The market segments itself by willingness to pay, and the segment that pays the most gets the best service. This is not unusual in any industry. What is unusual is that the quality differential is invisible - the consumer subscription user has no mechanism to verify that they are receiving a lower quality of service than the enterprise user on the same model.
Experience Goods and Credence Goods
The classification of the good - what kind of market this actually is - determines which economic frameworks apply and which predictions are derivable. The classification is not constant. It varies by task, by user, and by the observability of the output.
Nelson’s 1970 taxonomy distinguishes search goods, where quality is observable before purchase, from experience goods, where quality is observable only after consumption, from credence goods, where quality is not observable even after consumption. Darby and Karni extended the taxonomy in 1973 to prove the credence-good result: in markets for goods whose quality the consumer cannot verify, no fraud-free equilibrium exists.
Some LLM tasks are experience goods. Code generation is the clearest case - the code compiles or it does not, the tests pass or they do not, the application works or it does not. The user can verify quality after consumption, and this verification discipline constrains the provider’s ability to degrade quality without detection. It is no accident that the segment where quality is most verifiable - coding - is the segment where the highest-quality provider captures disproportionate market share. Claude’s 42% coding share, double OpenAI’s 21%, is the market revealing that when users can verify quality, quality wins. The market works when the information is symmetric enough for it to work.
But most LLM tasks are credence goods. When a user asks for a strategic analysis, a literature review, a complex reasoning chain, a research summary, or an architectural recommendation, the quality of the output depends on the depth and correctness of the reasoning process, and the user typically cannot verify whether that reasoning process was adequate. Did the model consider the relevant counterarguments? Did it check its own reasoning for logical errors? Did it use its full thinking budget to explore the problem space, or did it allocate 10% of the requested thinking tokens and produce a shallow approximation of what a deeper analysis would have yielded? The user sees the output. The user does not see the reasoning. And the output of a shallow reasoning process can look plausible - grammatically correct, structurally sound, confidently stated - while being substantively wrong in ways that only a domain expert would detect.
This is the credence-good problem in its purest form. The provider knows the thinking allocation. The user does not. The provider knows whether the system prompt instructs the model to “try the simplest approach first.” The user does not. The provider knows the capacity utilization and the load-based quality adjustments. The user does not. The user cannot verify the quality even after consuming the output, because verifying the quality would require the same expertise that the user sought the LLM to provide. You cannot audit the doctor’s diagnosis if you are not yourself a doctor. You cannot audit the depth of a language model’s strategic analysis if you are not yourself capable of performing that analysis independently. The credence-good dynamics apply, and they apply with full force.
The mixed classification - experience good for code, credence good for reasoning - creates a specific market segmentation pattern that matters enormously for the predictions in Section 4. In the experience-good segment, quality competition works and the best provider wins share. In the credence-good segment, quality competition breaks down and the Darby-Karni dynamics take over. A provider that understands this segmentation can shade quality in the credence-good segment - where detection is difficult - while maintaining quality in the experience-good segment - where detection is easy and market share is at stake. This is rational, profit-maximizing behavior. It is also exactly the pattern the evidence documents.
Switching Costs
The conventional wisdom about switching costs in the LLM market is that they are low. At the API level, this is correct - the model layer switching cost is effectively zero. A developer can swap one API call for another in minutes. The input is text. The output is text. The interface is a REST endpoint. If switching costs were measured only at the model layer, this would be the most competitive market in technology.
But switching costs are not measured only at the model layer. They are measured at the workflow layer, and at the workflow layer they are substantial and largely invisible to anyone who has not built one.
Stellaraccident built Bureau - a multi-agent system - along with tmux session management, concurrent worktrees, a 5,000-word CLAUDE.md conventions file, and a programmatic stop hook that caught behavioral regressions in real time. Other power users built PostToolUse code quality gates, model routing systems with fallback chains, smart caching systems, and transparent proxies that intercepted and logged every API interaction. Production users documented achieving 45 to 70 percent cost reduction through custom tooling systems. Each of these investments is provider-specific. The CLAUDE.md conventions, the hook infrastructure, the multi-agent orchestration optimized for one model’s behavioral patterns - none of it transfers to another provider. The workflow switching cost is not zero. It is measured in weeks or months of accumulated configuration, testing, and adaptation that are non-portable.
The result is a market where the API-layer switching cost creates the appearance of intense competition - “you can switch any time” - while the workflow-layer switching cost creates the reality of lock-in. Users who have invested deeply in a provider’s ecosystem tolerate months of degradation and invest in ever more elaborate workarounds before exiting, because the cost they are weighing is not the cost of changing an API call. It is the cost of rebuilding the workflow. The casual user with no workflow investment cancels immediately. The power user with months of accumulated tooling stays, adapts, complains, builds compensating infrastructure, and exits only when the cumulative frustration exceeds the switching cost. One user captured the dynamic precisely: “Will I still pay $200 a month until a better option comes by? Yes of course. Has Claude Code gotten incredibly frustrating to work with? 100%.” The subscription continues not because the product is satisfactory but because the switching cost exceeds the dissatisfaction. This is the sunk cost mechanism that Prediction 7 was designed to test. The correlation between workflow complexity and time-to-exit holds.
3.3 Pricing: Subscriptions, APIs, and the Gym Membership Problem
The Subscription Tiers
All three major providers have converged on a tiered subscription structure that is remarkably similar across firms - similar enough that the convergence itself is a data point:
ProviderEntry TierStandard TierPower User TierOpenAIGo: $8/moPlus: $20/moPro: $200/moAnthropic-Pro: $20/moMax 5x: $100/mo, Max 20x: $200/moGoogleAI Plus: $8/moAI Pro: $20/moAI Ultra: $250/mo
Three independent providers, each with different cost structures, different model architectures, different competitive positions, all arrived at approximately the same price point for their highest individual tier within roughly the same time period. This is not coincidence. It is price discovery under shared constraints: the cost of serving heavy frontier usage at the $20 tier is unsustainable, and the market has collectively discovered that something like $200 per month is the minimum price at which a power-user tier can exist without hemorrhaging money on every heavy subscriber. The convergence on $200 is a signal - a market-wide admission that the $20 tier cannot cover the cost of the users who actually use the product intensively. All three recognized this at the same time because the underlying cost structure is the same for all three: the GPU constraint is the same, the training investment amortization problem is the same, and the gap between what a heavy user consumes and what a $20 subscription covers is the same.
The convergence also reveals the limits of the $200 tier. Stellaraccident consumed $42,000 in API-equivalent compute on a $400 subscription - a subscription that was itself above the standard $200 tier. At $200, the provider would have absorbed a loss exceeding $41,000 in a single month from a single user. Another power user burned through $6,000 in a single month on a subscription that costs a fraction of that. The $200 tier is not a solution to the gym membership problem. It is a partial mitigation. The heavy users at $200 per month are still consuming far more than $200 per month in compute, and the provider’s incentive to reduce their consumption through quality shading, rate limiting, or hidden caps is proportional to the gap between what they pay and what they cost.
API Pricing
The per-token API pricing is the transparent alternative to the subscription model. The price spread across providers tells the story of market segmentation along the quality-cost dimension with precision:
ModelInput (per 1M tokens)Output (per 1M tokens)Positioningo3-pro$20.00$80.00Maximum reasoningClaude Opus 4.6$5.00$25.00Premium frontierClaude Sonnet 4.6$3.00$15.00Performance tierGPT-5.4$2.50$15.00Frontier competitorGPT-4o$2.50$10.00Previous generationGemini 3.1 Pro$2.00$12.00Cost-competitive frontierGemini 2.5 Flash$0.30$2.50Speed/cost optimizedMistral Small 3.2$0.06$0.18Budget tierOpen-weight (self-hosted)$0.07-$0.12(per 1M tokens total)Marginal cost floor
The price range spans more than three orders of magnitude. Claude Opus output at $25 per million tokens versus Mistral Small at $0.18 per million tokens is a 139x spread. Against open-weight self-hosted at $0.07 to $0.12 per million tokens, the spread extends to roughly 200x to 350x. This is a market where the cheapest option costs less than one-third of one percent of the most expensive option for the same unit of output - measured in tokens, though not in quality.
The o3-pro pricing at $20 input and $80 output per million tokens deserves particular attention because it is the market pricing the compute cost of deep reasoning honestly. When a model thinks deeply - when it actually allocates substantial compute to the reasoning process rather than producing a quick approximation - the cost is an order of magnitude higher than standard inference. This is the cost that the subscription model hides. A subscription user consuming o3-pro-level reasoning depth at scale would burn through thousands of dollars in compute per month while paying $200. The arithmetic does not work, and the provider’s response to the arithmetic not working is the subject of Predictions 1 through 4.
The Break-Even Calculation
The break-even point between subscription and API pricing reveals who wins and who loses under each model - and the answer is instructive.
ChatGPT Plus at $20 per month breaks even against API pricing at approximately 400,000 tokens per month. Below that threshold, the user would save money on pay-per-token. Claude Pro at $20 per month breaks even at approximately 200,000 tokens per month. These thresholds are low enough that most casual users - the users who check in a few times a day for quick questions - would save money on the API. They are high enough that heavy users - the users running multi-agent workflows, complex coding sessions, extended research projects - blow past them within the first week of the month.
The gym membership economics are precise. The provider depends on light users - users who pay $20 a month and consume $2 to $5 worth of compute - to subsidize the heavy users who pay $20 a month and consume $200 or $2,000 or $42,000 worth of compute. As long as the ratio of light to heavy users is high enough, the model works. When the ratio shifts - when more users discover the power of extended thinking, multi-agent workflows, and intensive coding sessions, when new reasoning models consume 100,000 or more tokens per simple task and turn moderate users into heavy consumers without the user doing anything differently - the model breaks. The better the product, the more intensively users consume it. The more intensively they consume it, the more unsustainable the flat-rate pricing becomes. The more unsustainable the pricing, the stronger the provider’s incentive to reduce quality for heavy users.
The result is a subscription model under structural pressure from its own success. The better the product gets, the worse the incentive structure gets. This is not a paradox. It is a well-understood dynamic in the economics of flat-rate services, from all-you-can-eat buffets to unlimited data plans to gym memberships where the model depends on most members not showing up. The cloud LLM market is following the same script. The only difference is that in the gym, you can see whether the equipment is broken. In the LLM market, you cannot see whether the thinking was shallow.
3.4 Information Asymmetry: What the Provider Knows and What the User Does Not
The Asymmetry Map
The information asymmetry in the cloud LLM market is not a single gap. It is a layered structure of six distinct dimensions, each of which creates an independent channel through which the provider can adjust the service without the user’s knowledge or consent:
Thinking token allocation per request. The provider determines how much compute to allocate to the model’s reasoning process for each request. Since March 2026, the thinking content has been redacted from user-facing responses. The user sees the output. The user does not see the reasoning. The user cannot observe how much thinking occurred, how deep the reasoning went, or whether the model spent ten seconds or a tenth of a second on the problem.
System prompt contents. The system prompt instructs the model how to behave, and it is invisible to the user. It can be changed at any time, instantly, at zero cost, with no announcement. When Claude Code v2.1.64 added “Go straight to the point. Try the simplest approach first without going in circles. Do not overdo it” to its system prompt on March 3-4, 2026, no user was notified. The instruction directly shapes output quality by telling the model to produce cheaper, shallower responses. GPT-5’s hidden system prompt includes an “oververbosity” setting - a dial from 1 to 10, defaulting to 3 - that controls response detail and takes precedence over developer instructions. The user does not see this dial. The user does not know it exists. The provider controls the quality of reasoning through a hidden instruction layer that the user cannot inspect, cannot override, and in most cases does not know about.
Capacity utilization and load-based quality adjustments. The provider knows the current GPU load and adjusts per-request compute allocation accordingly. The user does not know the load, does not know the adjustment, and cannot distinguish a response that received full compute from one that was throttled because the servers were busy at 5pm Pacific time.
Which model version is actually serving the request. GitHub Copilot users who selected Opus 4.5 received Sonnet 4. Users who selected GPT-5.3 received GPT-5.2. No billing adjustment. No notification. Verified through SSE logs by users with the technical sophistication to inspect the response stream - a verification mechanism that is inaccessible to the vast majority of users. The user selects a model. The provider may serve a different, cheaper model. The user has no standard mechanism to detect the substitution.
Internal quality metrics and regression data. The provider tracks performance metrics that are not published. When quality regresses, the provider knows before the user does - and the provider decides whether and when to disclose. The September 2025 Anthropic bugs were internally identified and disclosed. The February-March 2026 thinking regression has not been comparably disclosed. The provider’s internal data about its own quality is the most valuable information in the market, and it is the information the user never sees.
Context mutation events. Budget caps, microcompact operations, and per-tool truncation silently strip context from active sessions. In one measured session, 261 budget enforcement events reduced tool results to as few as 1 to 2 characters after crossing a 200,000-token aggregate threshold. No notification. No error message. The context that the model uses to reason is silently degraded mid-session, and the user has no way to know it has happened. The user experiences the result - a model that suddenly seems confused, that loses track of the conversation, that makes errors it would not have made earlier in the session - but the mechanism is invisible.
Each of these six dimensions operates independently. A provider could maintain full quality on five dimensions while degrading the sixth, and the user would have no way to attribute any observed quality change to the specific mechanism responsible. The six-dimensional asymmetry is what makes this a credence-good market rather than an experience-good market: the user cannot verify quality even after consumption because the user cannot observe the reasoning process, the system prompt, the load adjustment, the model version, the internal metrics, or the context mutations that together determined the quality of what was delivered.
The Quantified Asymmetry
The information asymmetry is not an abstraction. It has been measured, and the measurements are worth stating precisely because the precision is the point.
Thinking budget allocation. Users requesting Claude Opus received approximately 10% of the thinking tokens they requested, according to GitHub issue #20350. Not 90%. Not 50%. Ten percent. The user requested a level of reasoning depth. The provider allocated one-tenth of it. After the March 2026 thinking redaction, the user cannot verify what allocation they received - the evidence that allowed users to detect the 10% allocation is now hidden. The monitoring mechanism that revealed the shortfall was removed after the shortfall was documented. The Holmstrom prediction, in miniature.
Quota variance. A 10x variance in quota burn rates was documented on identical accounts within a 48-hour period, per GitHub issue #22435. Same tier. Same subscription. Same model selection. Ten times the cost variability, with no explanation provided to the user and no notification that the variance exists. The user’s experience of the service - how many queries they can make before hitting a rate limit, how much compute each query receives - varies by an order of magnitude across identical accounts, and the user has no mechanism to predict, observe, or appeal the variance. “Anthropic acknowledged users were ‘hitting usage limits way faster than expected’ but does not publish concrete rate limits - only vague percentages with no denominator,” as The Register reported in March 2026.
Model substitution. GitHub Copilot served Sonnet 4 when the user selected Opus 4.5. Served GPT-5.2 when the user selected GPT-5.3. The user selected a model. The provider served a different, cheaper model. No billing adjustment. No notification. The substitution was verified by users who inspected SSE logs - a verification method that requires technical sophistication well beyond what most users possess, and that most users would not know to attempt. The user who does not inspect the response stream has no way to know that the model they are using is not the model they selected.
Shadow API divergence. Fang et al. (arXiv:2603.01919) audited 17 shadow LLM APIs - resellers and intermediaries that claim to provide access to specific models - and found performance divergence up to 47.21% and identity verification failures in 45.83% of fingerprint tests. Nearly half the APIs claiming to serve a specific model either served a different model or served the correct model at significantly degraded performance. The shadow API market is a credence-good market nested inside a credence-good market: a second layer of unverifiable quality claims built on top of the first, with the information asymmetry compounding at each layer.
The impossibility result. Yu et al. (arXiv:2511.00847) proved that no mechanism can guarantee asymptotically better expected user utility in the face of dishonest model substitution. This is not an empirical finding that more data might refine. It is a mathematical proof. The information asymmetry is not a problem that better monitoring will solve in the general case - it is a structural feature of the market for which no general solution has been shown to exist. Software-only auditing is insufficient: statistical tests on text outputs are query-intensive and fail against subtle substitutions, while log probability methods are defeated by inference nondeterminism. Only trusted execution environments have been proposed as a viable verification mechanism, and TEEs have not yet been deployed for LLM inference at scale.
The quantified asymmetry is the foundation for the credence-good analysis. Ten percent thinking allocation. Ten-times quota variance. Model substitution without notification. 47% performance divergence in shadow APIs. A mathematical proof that no mechanism guarantees honest provision. The conditions for Darby and Karni’s 1973 result are not approximately met. They are precisely met. The credence-good dynamics are not an analogy to this market. They are the description of it.
3.5 Institutional Framing: Providers as Institutional Actors
The economics maps the forces. The institutional analysis maps what the forces act on, and this matters because the forces act on institutions, not on abstract market participants.
A cloud LLM provider is not a product company in the traditional sense. It is an institution - a zone of coordination maintained by automated systems, to use the minimal definition. It coordinates thousands of engineers, billions of dollars in compute infrastructure, relationships with millions of users, and a model training pipeline that is one of the most complex engineering projects in human history. Like any institution, its behavior is determined not by the intentions of its leadership but by the incentive structure within which it operates. The intentions may be excellent. The incentive structure produces the observed behavior regardless. This is the principal-agent problem applied at the institutional level: the institution’s stated mission and the institution’s operational incentives are not the same thing, and when they diverge, the incentives win. Functional institutions are the exception.
The principal-agent structure of the cloud LLM market is precise enough to state formally. The user is the principal - the party that delegates a task and pays for its completion. The provider is the agent - the party that performs the task and receives the payment. The user delegates the task of reasoning: thinking about a problem at a specified depth, with a specified level of rigor, and producing an output that reflects that reasoning. The user cannot observe the agent’s effort. The agent’s compensation is fixed under subscription pricing, or decoupled from effort quality under a system where the user cannot verify whether the reasoning was deep or shallow. The Holmstrom conditions for moral hazard are met: hidden action, fixed compensation, unobservable effort. The prediction is shirking. The observation is shirking.
But the institutional frame reveals something that the bilateral principal-agent model alone does not capture. The problem in this market is not a two-party relationship between one user and one provider. It is a coordination problem among millions of users and a handful of providers, where no individual user has the leverage to change the equilibrium and no individual provider has a sufficient incentive to deviate unilaterally. A single provider that invests in transparency - that publishes thinking token metrics, opens its system prompts to inspection, commits to contractual quality guarantees backed by enforceable SLAs - bears the full cost of that transparency while capturing only a fraction of the benefit, because the benefit of a more trustworthy market accrues to the market as a whole, not to the disclosing firm. This is a public goods problem embedded inside a private market. The monitoring infrastructure that would convert the credence good into an experience good is a public good that no private actor has sufficient incentive to provide on its own.
The Grossman-Milgrom unraveling result says this coordination problem should eventually solve itself. The highest-quality provider discloses voluntarily, because non-disclosure is informative - silence tells the consumer you have something to hide. The next-highest-quality provider must then disclose or be assumed to be hiding poor quality. The cascade continues downward until all firms have disclosed or been exposed. The theory is elegant. The unraveling has not yet begun, and the reason it has not begun is instructive for what it reveals about the institutional dynamics at play.
The disclosure that would initiate the cascade - publishing thinking token allocation metrics, for instance - would reveal not only the quality of the disclosing provider but the mechanism by which quality can be varied. It would give users the tools to detect quality shading, which means it would give users the tools to demand full quality, which means it would eliminate the cost savings that quality shading provides. The first provider to disclose bears the cost of losing its cheapest cost management lever. The other providers bear no cost and gain the competitive intelligence that disclosure reveals. The incentive to be the first to disclose is dominated by the incentive to wait for someone else to go first. This is a coordination failure, and coordination failures of this type persist until an external force - regulatory, competitive, or catastrophic - breaks them.
The live-player question is whether any provider has the institutional capacity to act against its short-term incentive structure in service of a long-term strategic position. A live player evaluates novel situations on their own terms and constructs appropriate responses rather than following a script. A dead player follows the incentive structure wherever it leads, optimizing for the quarter rather than the decade. The market structure predicts dead-player behavior: shade quality, remove monitors, manipulate system prompts, maintain silence, rely on the information asymmetry as a competitive moat. A live player would recognize that the credence-good equilibrium is unstable, that the Grossman-Milgrom unraveling will eventually force disclosure, that the provider who discloses first captures the trust premium that early transparency commands. The question is not whether a provider should disclose. The question is whether any provider can - whether the institutional incentive structure permits it, or whether the short-term costs of transparency are so large relative to the short-term benefits that even a live player cannot act on the long-term calculation.
Google’s explicit acknowledgment and targeted fix for the Gemini 2.5 Pro regression is the closest example to live-player behavior in the current market. It is also the exception that proves the structural rule. Anthropic’s detailed postmortem for the September 2025 bugs demonstrated the capability for transparency - the organization can do this when it chooses to. The absence of a comparable response for the 2026 thinking regression demonstrates the incentive against it. The capability for transparency exists. The incentive structure suppresses it. The institution can be transparent. The market structure makes transparency costly.
This is what the institutional analysis adds to the economics. The economics predicts the equilibrium. The institutional analysis predicts who might break it, and why they probably will not - at least not voluntarily, at least not without an external forcing function. The Darby-Karni result says no fraud-free equilibrium exists in this market. The institutional analysis says the coordination failure in disclosure creates a first-mover disadvantage that sustains the fraudulent equilibrium. The market will remain in this state until something forces the coordination: a regulatory mandate, a competitive shock large enough to change the incentive calculus, or a quality failure visible enough that the cost of continued opacity exceeds the cost of transparency. This is not a technology problem. It is not even, strictly speaking, an economics problem. It is an institutional problem, and institutional problems are solved by institutional means or not at all.
There is a historical pattern that is worth naming directly. Every market that has operated under credence-good dynamics with severe information asymmetry has eventually been forced toward transparency by one of three mechanisms: regulation, as in healthcare licensing and financial disclosure requirements; competitive pressure from a transparent alternative, which in this market means the open-weight ecosystem where model weights are inspectable, inference is local, and quality is a function of hardware rather than a provider’s willingness to allocate compute; or crisis, meaning a failure large enough to force the solution that should have been adopted voluntarily, as in the 2008 financial collapse that produced Dodd-Frank. Healthcare took regulation. Financial derivatives took crisis. Telecoms took a combination. The LLM market is early enough that all three paths remain open. Which path it takes will determine not just the structure of the market but the quality of the knowledge infrastructure that depends on it, and the institutional capacity of the organizations that have built their reasoning processes on top of a service whose quality they cannot verify.
The market structure is now mapped. The supply side is a natural oligopoly with extreme fixed costs, falling marginal costs, and binding capacity constraints. The demand side is heterogeneous across orders of magnitude, split between experience-good tasks where quality competition works and credence-good tasks where it does not, with workflow-layer switching costs that create invisible lock-in. The pricing architecture is a subscription model under structural pressure from its own success, where the gym membership economics create adverse incentives that intensify as the product improves. The information asymmetry is six-dimensional, quantified, and mathematically proven to be unsolvable by software-only mechanisms in the general case. The institutional structure is a coordination failure where the public good of transparency is underprovided because the private cost of first-mover disclosure exceeds the private benefit. Every element of this structure points in the same direction, and the direction is the set of predictions derived in Section 4.
4. Theoretical Framework
The common view of the cloud LLM market is that it is new - that the dynamics governing it are unprecedented, that the technology is too novel for existing economic frameworks to apply, and that the pace of change outstrips the pace of analysis. The common view is wrong. The market structure described in Section 3 - oligopoly supply, heterogeneous demand, flat-rate pricing under capacity constraints, six-dimensional information asymmetry, credence-good dynamics - is a configuration that industrial organization economists have studied for over fifty years. The specific combination of features is new. The individual forces are not. They have been modeled, tested, and confirmed in airlines, healthcare, telecoms, electricity, water utilities, financial services, and the market for expert labor. The economics that predicted quality shading in regulated electricity markets in the 1990s predicts quality shading in cloud LLM markets in the 2020s. The economics that explained why patients cannot verify the quality of medical advice in 1973 explains why users cannot verify the quality of LLM reasoning in 2026. The economics that showed why the agent shirks when the principal cannot observe effort in 1979 shows why the model produces shallow reasoning when thinking tokens are redacted in 2026.
What follows is the theoretical apparatus. Six frameworks from the economics literature, each explained on its own terms and then applied to the LLM market with precision. The mapping is not analogical - it is not that LLMs are “kind of like” healthcare or “sort of resemble” telecoms. The mapping is structural. The same mathematical relationships hold. The same equilibrium dynamics obtain. The same predictions follow from the same premises. The LLM market is not special. It is subject to the same forces that have been understood since Akerlof published in 1970. The frameworks predict twelve falsifiable outcomes, and the predictions follow from the theory with the inevitability of a proof.
After the six economic frameworks, an institutional layer enriches the predictions. The vocabulary of Great Founder Theory - live players and dead players, institutional decay, cargo-culting, intellectual dark matter, social technology, the succession problem - adds a second analytical lens that the economics alone cannot provide. The economics maps the equilibrium. The institutional analysis maps what the equilibrium does to the organizations and civilizational infrastructure that depend on the market. Both layers are necessary. Neither alone is sufficient.
4.1 Akerlof (1970): The Market for Lemons
George Akerlof’s 1970 paper “The Market for ‘Lemons’” in the Quarterly Journal of Economics is one of the most consequential papers in twentieth-century economics - not because the insight is complicated but because the insight is simple and the consequences are severe. The setup: a market for used cars where sellers know the quality of their vehicle and buyers do not. The seller of a high-quality car cannot credibly communicate that quality to the buyer. The buyer, knowing this, adjusts the price downward to account for the risk of getting a lemon. But the adjusted-down price is now too low for the high-quality seller, who exits the market. The average quality in the market drops. The buyer adjusts the price down further. More good sellers exit. The cycle continues. In the limit, only lemons remain.
The mechanism is adverse selection driven by quality uncertainty. The key condition is that the buyer cannot verify quality before purchase. When that condition holds, the market degrades - not because anyone intends to degrade it, but because the information asymmetry creates a dynamic where the rational actions of individual buyers and sellers produce a collectively worse outcome than either party would choose.
Applied to the cloud LLM market, the Akerlof dynamic operates at two levels. At the first level, users cannot verify the reasoning quality of an LLM before subscribing, so they select on observable signals - benchmark scores, brand reputation, community sentiment - rather than on actual quality. This means a provider that invests in benchmark performance rather than real-world quality has a cost advantage over a provider that does the reverse, because the investment in real quality is invisible to the buyer while the investment in benchmark performance is visible. The provider that optimizes for the measure outcompetes the provider that optimizes for the thing the measure is supposed to measure. This is Goodhart’s Law as a market selection mechanism, and it follows directly from Akerlof’s quality uncertainty condition.
At the second level, the Akerlof dynamic operates within the market over time. A provider that reduces quality - by shading thinking depth, manipulating system prompts, throttling compute under load - saves costs that a quality-preserving competitor does not save. If users cannot detect the quality reduction, the cost-saving provider captures more margin, can price more aggressively, and can invest the saved costs in marketing, ecosystem development, or capacity expansion. The quality-preserving provider bears the full cost of quality with no market reward for doing so, because the market cannot observe the quality difference. The dynamics are structurally identical to the used car market: high-quality providers are penalized, low-quality providers are rewarded, and the average quality in the market declines. The market selects for lemons.
The standard solution to the Akerlof problem in other markets has been certification - independent third-party verification of quality that converts the information asymmetry from a structural feature into a solvable problem. Automotive inspections. Healthcare licensing. Financial auditing. Credit ratings. The LLM market has no comparable certification mechanism. Benchmarks are the closest analog, and as Section 5 will demonstrate, benchmarks have diverged from real-world quality to the point where they function as the opposite of certification - they provide false assurance rather than genuine information. Yu et al. (arXiv:2511.00847) proved that no software-only mechanism can guarantee honest provision in the general case. The Akerlof problem in this market is not merely present. It is formally unsolved.
4.2 Darby and Karni (1973): The Credence Good Problem
Michael Darby and Edi Karni’s 1973 paper in the Journal of Law and Economics introduced a category that Philip Nelson’s 1970 framework had missed. Nelson distinguished between search goods (quality verifiable before purchase) and experience goods (quality verifiable only after consumption). Darby and Karni added a third category: credence goods, where quality is not verifiable even after consumption. The consumer receives the good, consumes it, and still cannot determine whether it was high quality or low quality.
The canonical example is expert labor. You visit a mechanic. The mechanic says you need a new transmission. You get the new transmission. The car runs. But you cannot verify whether you actually needed a new transmission, whether the old one would have lasted another 50,000 miles, whether the mechanic installed a rebuilt unit rather than a new one, or whether the repair was done competently. You lack the expertise to evaluate the expert’s work. The mechanic’s incentive under these conditions is to overtreat - to recommend and perform unnecessary work - because the customer cannot verify the necessity.
Darby and Karni’s result is stark: “there exists no fraud-free equilibrium in the markets for credence-quality goods.” This is not a finding about some markets or about badly functioning markets. It is a structural result about all markets where the credence-good condition holds. When the consumer cannot verify quality even after consumption, the equilibrium involves quality degradation. The only question is the magnitude.
Applied to the cloud LLM market, the credence-good classification maps with uncomfortable precision to complex tasks. For simple tasks - “summarize this paragraph,” “translate this sentence,” “what is the capital of France” - the user can verify the output. These are experience goods. For complex tasks - “architect this distributed system,” “find the bug in this codebase,” “evaluate whether this legal argument is sound,” “reason through this research question” - the user often cannot verify the output without possessing the expertise that motivated the query in the first place. If you could evaluate whether the model’s system architecture recommendation was optimal, you probably would not have asked the model. The output is consumed. The user cannot determine its quality. It is a credence good.
The credence-good dynamics are reinforced by two features specific to the LLM market. First, the reasoning process is invisible. The user sees the output but not the reasoning that produced it - and after the March 2026 thinking redaction, the user cannot see even the partial evidence of reasoning that thinking tokens previously provided. A mechanic at least has to show you the old part. An LLM provider shows you nothing of the internal process. Second, there is no independent verification infrastructure. In healthcare, malpractice litigation, peer review, and licensing boards provide imperfect but real constraints on credence-good exploitation. In financial services, auditing requirements and regulatory examinations serve the same function. In the LLM market, there is no audit, no licensing board, no peer review of individual outputs, and no regulatory examination of quality. The credence-good condition is met, and the institutional constraints that partially mitigate it in other markets are absent.
Guo et al. (arXiv:2509.06069) experimentally confirmed in 2025 that when LLM agents operate in credence-good settings, markets show “greater market concentration and more polarized fraud patterns.” The theoretical prediction was tested empirically. It held. The market for credence-quality LLM services does not merely resemble the market for expert labor that Darby and Karni analyzed. It is a more extreme version of it, because the information asymmetry is wider and the verification constraints are weaker.
4.3 Holmstrom (1979): Moral Hazard and Observability
Bengt Holmstrom’s 1979 paper “Moral Hazard and Observability” in the Bell Journal of Economics established the formal relationship between observability and incentive alignment. The setup is the principal-agent problem: a principal delegates a task to an agent, the agent’s effort is costly to the agent, the principal benefits from higher effort, and the principal cannot directly observe the agent’s effort - only the outcome. When the agent’s action is hidden, the agent has an incentive to shirk - to exert less effort than the contract implicitly assumes - because the cost saving accrues to the agent while the quality loss accrues to the principal. The key result: optimal contracts require observable signals of the agent’s effort. Remove observability, and shirking follows.
This is the most direct mapping in the entire framework. The user is the principal. The provider is the agent. The delegated task is reasoning - thinking about a problem at a specified depth and producing output that reflects that reasoning. The agent’s effort is the allocation of compute to thinking tokens. The principal cannot directly observe this effort - especially after the March 2026 redaction made thinking content invisible. The prediction is textbook: remove observability, and the agent reduces effort. The provider reduces thinking depth because the user can no longer observe thinking depth. The mechanism is not subtle. It is the first example in every principal-agent textbook.
What makes the LLM application of Holmstrom especially clean is the timeline. Thinking token content was visible to users before March 2026. Users could observe the model’s reasoning process, estimate its depth, and detect when reasoning was shallow. This was the “observable signal” in Holmstrom’s framework - imperfect, but informative. Then the provider redacted thinking content. The observable signal was removed. Quality declined. The timeline is not ambiguous: the monitoring mechanism was removed, and the behavior that monitoring would have constrained appeared. Holmstrom’s 1979 prediction, enacted in 2026 with the precision of a controlled experiment.
A rational agent in Holmstrom’s framework does something specific with the relationship between monitoring and effort: the agent reduces the principal’s monitoring capability before or concurrent with reducing effort, because removing the monitor is a precondition for undetected shirking. The provider’s behavior matches this prediction exactly. Thinking depth dropped 67% by late February 2026 - before redaction began. Thinking redaction started March 5 at 1.5% of blocks, crossed 50% on March 8, and reached 100% by March 12. The quality reduction preceded the monitor removal, and the monitor removal made the already-present quality reduction invisible. The staged rollout of redaction did not cause the degradation. It concealed the degradation that had already occurred. This is the rational sequence predicted by the theory: degrade first, then remove the evidence.
The institutional vocabulary adds a layer. The thinking tokens were intellectual dark matter - invisible to the user, load-bearing for the quality of the output, and removed without anyone knowing what was lost. The concept maps precisely: just as intellectual dark matter in an institution is the tacit knowledge that cannot be directly observed but whose presence or absence determines whether the institution functions, thinking tokens are the tacit reasoning that cannot be directly observed but whose presence or absence determines whether the model’s output is competent or shallow. You infer the quality of thinking from the quality of the output, the way you infer the presence of dark matter from gravitational effects. When the thinking is removed, the output degrades - but the user who lacks the expertise to evaluate the output (the credence-good condition) cannot distinguish “the model thought deeply and reached this conclusion” from “the model barely thought and reached this conclusion.” The intellectual dark matter is gone, and nobody on the user’s side of the asymmetry can tell.
4.4 Sappington (2005): Quality Shading Under Price Caps
David Sappington’s 2005 survey in the Journal of Regulatory Economics examined a pattern observed across regulated industries: when revenue per unit is capped - by regulation, by contract, or by market structure - firms reduce quality as a cost management strategy. The mechanism is straightforward. If you cannot increase price, you cannot increase revenue per unit. If demand exceeds capacity (or if capacity is expensive to expand), you cannot increase volume without increasing cost. The only remaining margin lever is cost reduction. And the cheapest cost reduction is quality reduction, because quality reduction is invisible to the consumer in the short run while cost reduction is immediately visible to the firm.
Sappington documented this pattern in electricity markets, where utilities under price-cap regulation reduced maintenance spending and increased outage frequency. In telecoms, where carriers under rate regulation reduced service quality in ways consumers noticed only gradually - longer hold times, worse customer support, degraded network maintenance. In water utilities, where price-capped providers reduced treatment quality until regulatory audits caught the degradation. The pattern is not industry-specific. It is a structural consequence of the price-cap condition.
Applied to the cloud LLM market: the subscription model is the price cap. A flat $20 or $200 per month is fixed revenue per user regardless of usage intensity. GPU capacity is the binding constraint - the firm cannot serve unlimited requests at full quality on finite hardware. The only margin lever is quality reduction. Reducing thinking depth per request allows the same hardware to serve more requests. Reducing the compute allocated to heavy users allows that compute to be reallocated to lighter users who are more profitable per unit of compute consumed. The provider faces exactly the Sappington conditions: capped revenue, binding capacity constraint, and a quality dimension that the consumer cannot easily observe.
The application is strengthened by the specific economics. Stellaraccident consumed something like $42,000 equivalent in API costs during March 2026 on a $400 subscription - a 105-to-1 ratio of cost to revenue. The provider’s incentive to reduce that cost is not a theoretical abstraction. It is a $41,600 monthly loss on a single user. Multiply by every power user on the platform, and the magnitude of the incentive becomes clear. Quality shading is not a risk in this market structure. It is the equilibrium.
A Columbia Business School working paper formalized the connection: “when firms face limited production capacity, lowering product quality can enable increased total production.” The LLM case is the clearest instantiation of this result in any contemporary market. The product is reasoning depth. The capacity constraint is GPU hours. The price cap is the subscription fee. The quality reduction is the allocation of fewer thinking tokens per request. Every element of Sappington’s framework is present, and every element points in the same direction.
4.5 Grossman (1981) and Milgrom (1981): Voluntary Disclosure and Unraveling
Sanford Grossman and Paul Milgrom independently published results in the Journal of Law and Economics in 1981 that predict a powerful market self-correction mechanism. The logic is elegant. If a firm has high quality and can credibly disclose it, the firm will disclose - because silence would cause consumers to assume the firm is hiding poor quality. Once the highest-quality firm discloses, the second-highest firm must also disclose or be pooled with the undisclosed lower-quality firms. The cascade continues downward until every firm has either disclosed or been exposed by its silence. This is the unraveling result: in equilibrium, all firms disclose, and silence is informative.
If unraveling worked perfectly, the information asymmetry in the LLM market would resolve itself. The highest-quality provider would publish thinking token metrics, open its system prompts to inspection, and commit to contractual quality guarantees. Competitors would be forced to follow or suffer the inference of non-disclosure. Quality would become observable, credence goods would become experience goods, and the Darby-Karni equilibrium would break.
The unraveling has not occurred, and the reason it has not occurred is precisely what the theory predicts would prevent it. Grossman and Milgrom identified the conditions under which unraveling fails: when products have multiple attributes that cannot be reduced to a single quality dimension, and when consumers fail to make sophisticated statistical inferences about non-disclosure. Both conditions are met in the LLM market. An LLM is not a single-attribute product - it has reasoning depth, factual accuracy, code quality, instruction following, context handling, speed, and numerous other dimensions that cannot be collapsed into a single disclosure. A provider could disclose excellence on one dimension while remaining silent on others, and the silence on the undisclosed dimensions is not informative because the consumer cannot distinguish “chose not to disclose” from “has nothing to disclose on this dimension.”
The consumer sophistication condition is equally violated. Laboratory experiments have confirmed that “senders do not fully disclose and receivers are not fully skeptical” - consumers do not draw the sophisticated inference that silence about quality implies poor quality. In the LLM market, the evidence is direct: Anthropic published no comparable postmortem for the 2026 thinking regression, and the market response was not “the absence of disclosure means the problem is severe.” The market response was continued subscription revenue and a $30 billion funding round at a $380 billion valuation. Consumers are not penalizing non-disclosure. The unraveling mechanism requires consumer sophistication that the empirical evidence says does not exist.
The institutional frame sharpens this. The disclosure that would initiate the cascade - publishing thinking token allocation metrics - would reveal not only the quality of the disclosing provider but the mechanism by which quality can be varied. It would give users the tools to demand full quality, which would eliminate the cost savings that quality shading provides. The first provider to disclose bears the full cost. The other providers bear none. This is a coordination failure with a first-mover disadvantage, and coordination failures of this type persist until an external force breaks them. The unraveling that Grossman and Milgrom predict in theory is blocked in practice by the same institutional dynamics that block transparency in every credence-good market before regulation forces it.
4.6 The Institutional Layer: Great Founder Theory Vocabulary
The five economic frameworks do the load-bearing analytical work. They identify the equilibrium, predict the dynamics, and specify the conditions under which the predictions hold or fail. But economics operates at the level of market forces and rational agents. It does not naturally address the question of what these forces do to institutions - to the organizations that provide the services, to the knowledge infrastructure that depends on them, and to the civilizational capacity that depends on that knowledge infrastructure. This is where the institutional analysis adds a layer that the economics alone cannot provide.
Live players and dead players. A live player evaluates novel situations on their own terms and constructs appropriate responses rather than following a script. A dead player follows the incentive structure wherever it leads, optimizing for the quarter rather than the decade. The market structure described in Section 3 predicts dead-player behavior from every provider: shade quality, remove monitors, manipulate system prompts, maintain silence, rely on the information asymmetry as a competitive moat. A live player would recognize that the credence-good equilibrium is unstable, that the Grossman-Milgrom unraveling will eventually force disclosure, and that the provider who discloses first captures the trust premium. But the institutional incentive structure makes live-player behavior costly and dead-player behavior profitable. The prediction is that providers will behave as dead players unless an external force changes the incentive calculus. The evidence will show whether this prediction holds.
Institutional decay. The quality regression pattern in the LLM market is structurally identical to institutional decay as the concept applies across organizations and civilizations. An institution that once produced high-quality output gradually reduces that quality - not through a single decision but through a series of individually rational cost optimizations that compound over time. Each individual reduction is below the threshold of detection. The cumulative effect is catastrophic. The LLM quality regression follows this pattern precisely: thinking depth dropped gradually, system prompts were quietly modified, monitoring was incrementally removed, and each step was individually small enough to evade detection while the cumulative effect transformed a tool that “wrote most of SpawnDev.ILGPU - a 6-backend GPU compute transpiler with 1,500+ tests and zero failures” into a tool that “cannot be trusted to perform complex engineering.”
Cargo-culting. Benchmarks in the LLM market function as the cargo cult of capability. The forms survive after the substance is gone. A model scores 95% on HumanEval, 93% on HellaSwag, 1504 Elo on LMArena - the surface indicators of capability are pristine. But the model cannot complete a complex coding task without hallucinating, cannot maintain a reasoning chain across a long context, and cannot resist the system prompt instruction to “try the simplest approach.” The benchmark performance is the ritual. The capability is the substance the ritual was supposed to indicate. The ritual persists. The substance does not. We are, in this market, cargo-culting formal methods of quality assessment on a truly significant scale.
Intellectual dark matter. Thinking tokens are the tacit knowledge of the LLM system - invisible, load-bearing, and removed without anyone knowing what was lost. The concept maps with structural precision. In an institution, intellectual dark matter is the knowledge that exists in the heads of practitioners but is never written down, never formalized, and never transmitted except through direct apprenticeship. When those practitioners leave, the knowledge is lost, and the institution’s output degrades in ways that the remaining members cannot diagnose because they do not know what they do not know. Thinking tokens are the same thing: the internal reasoning that produces the model’s output, never visible to the user, never documented, and now - after redaction - never even partially observable. The user experiences the degradation. The user cannot diagnose the cause. The intellectual dark matter is gone.
Social technology. The workarounds that users built in response to quality degradation - stop-phrase-guard.sh firing 173 times in 17 days, PostToolUse code quality gates, model routing systems with fallback chains, transparent proxies monitoring budget enforcement events, 5,000-word CLAUDE.md files with anti-laziness directives - are social technologies in the precise sense. They are designed coordination mechanisms built by individuals to solve a problem that the market institution has failed to solve. They are the user’s equivalent of duct-taping the infrastructure when the provider will not maintain it. And like all social technologies built in response to institutional failure, they are fragile, non-portable, and dependent on the specific individuals who built them. When those individuals leave - as the theory predicts the most capable will - the social technology leaves with them.
The succession problem. Every technology company faces the moment when the founding engineers’ quality culture is replaced by the operational culture of cost optimization. The engineers who built the original model and who understood why certain quality thresholds mattered are succeeded by operators who see only the cost structure and the margin opportunity. The quality culture was never fully documented - it was intellectual dark matter in the heads of the founding team. When the succession happens, the new operators make individually rational cost optimizations that the founders would have rejected, because the founders understood the second-order consequences and the successors do not. This is the succession problem applied to model quality, and it predicts a specific pattern: quality degrades fastest after the founding team’s influence is diluted, and the degradation is invisible to the new operators because they never knew what the quality was supposed to be.
These six institutional concepts - live and dead players, institutional decay, cargo-culting, intellectual dark matter, social technology, and the succession problem - do not replace the economic frameworks. They enrich the predictions by adding a layer of analysis that the economics alone cannot provide. The economics says the equilibrium involves quality degradation. The institutional analysis says the degradation follows the pattern of institutional decay, that the benchmarks become cargo cults, that the thinking tokens are intellectual dark matter, that the user workarounds are fragile social technologies, and that the providers behave as dead players unless forced otherwise. Both layers point in the same direction.
4.7 Twelve Falsifiable Predictions
The six economic frameworks and the institutional layer together generate twelve predictions about the behavior of providers, users, and the market as a whole. Each prediction follows from a specific theoretical basis, operates through a specific mechanism in the LLM market, and can be falsified by specific observable evidence. The predictions are not speculative. They are the standard results of fifty years of industrial organization economics applied to the market structure documented in Section 3. If the market structure is as described, these predictions follow. If they do not hold, either the market structure has been mismapped or the economics is wrong. The economics has been right about airlines, healthcare, telecoms, electricity, water, and financial services. The predictions are stated here. The evidence is presented in Section 5.
Provider Behavior
P1: Quality shading under capacity constraints. Providers will reduce output quality during periods of high demand and constrained GPU capacity, with quality varying as a function of system load.
Theoretical basis. Sappington (2005) demonstrated that firms under price caps reduce quality when capacity is binding. The mechanism is straightforward: when revenue per unit is fixed and capacity constrains volume, quality reduction is the only available margin lever. The subscription model fixes revenue per user. GPU capacity is finite and expensive to expand. Reducing thinking depth per request allows more requests to be served on the same hardware. The Columbia Business School result formalizes the connection: “when firms face limited production capacity, lowering product quality can enable increased total production.”
Applied mechanism. The provider allocates thinking tokens - internal compute devoted to reasoning before generating output. Under low load, the provider can afford to allocate generously. Under high load, the same hardware must serve more concurrent requests, and the allocation per request must shrink. The subscription user pays the same fee regardless of when they submit a query. But the compute available to serve that query varies with system load. A query submitted at 2am Pacific time, when US usage is minimal, receives a different compute allocation than the same query submitted at 5pm Pacific time, when millions of users are active. The user experiences this as inconsistency - “sometimes Claude is brilliant, sometimes it is terrible” - without any mechanism to identify load-based allocation as the cause. The quality variation is invisible to the user because the user cannot observe system load, cannot observe the thinking token allocation, and - after redaction - cannot observe even the output of the thinking process.
Falsification criteria. If quality does not vary with time of day or system load, the prediction fails. Specifically: if thinking depth is constant across peak and off-peak hours, the Sappington mechanism is not operating. The prediction is testable by comparing model performance metrics across times of day, controlling for query complexity.
P2: Monitor removal precedes or accompanies quality reduction. Providers will reduce the user’s ability to observe quality before or concurrent with reducing quality itself.
Theoretical basis. Holmstrom (1979) established that the agent’s incentive to shirk is constrained by the principal’s ability to observe effort. The optimal strategy for an agent who intends to reduce effort is to first reduce the principal’s monitoring capability. This is not a secondary prediction - it is a direct consequence of the moral hazard framework. If you intend to do less work, you first ensure that the person paying you cannot see how much work you are doing.
Applied mechanism. Thinking token content was the user’s primary quality signal - the observable evidence of the model’s reasoning process. A user who could read the thinking tokens could assess whether the model was reasoning deeply or producing shallow pattern-matched output. Redacting thinking content removes this signal. The prediction is that redaction and quality reduction are linked - either the redaction enables the quality reduction by removing the monitoring mechanism, or the quality reduction motivates the redaction by creating a gap between what the user would observe and what the provider wants the user to observe.
The prediction further specifies that the timeline matters. If redaction occurs before quality reduction, the interpretation is that the provider removed monitoring in anticipation of reducing quality. If redaction occurs after quality reduction, the interpretation is that the provider removed monitoring to conceal a quality reduction that had already occurred. Either sequence confirms the prediction. The only falsification is if redaction and quality changes are temporally unrelated - if they occur at different times with no plausible causal connection.
Falsification criteria. If thinking redaction and quality regression are temporally unrelated - if they occur months apart with no causal connection - the prediction fails. The prediction is confirmed by a tight temporal correlation between the removal of monitoring and the reduction of quality, and by evidence that the redaction was not motivated by some independent reason (such as a genuine security concern with no quality implications).
P3: Subscription models create adverse incentives for power users. Under flat-rate pricing, the provider’s per-user cost is highest for the heaviest users, creating an incentive to degrade quality specifically for the users who consume the most compute.
Theoretical basis. This prediction combines two mechanisms. Moral hazard: the provider faces a fixed revenue per user and a variable cost per user, so the provider’s incentive is to reduce cost - which means reducing quality, especially for the highest-cost users. Adverse selection: flat-rate pricing attracts the heaviest users (who get the most value per dollar) and repels the lightest users (who would save money on pay-per-token), so the subscriber pool is systematically enriched with the most expensive-to-serve users. The combination produces a market where the provider’s subscriber base is disproportionately composed of users whose usage far exceeds the subscription price, and the provider’s cost management imperative is most acute for exactly those users.
Applied mechanism. A user who consumes $42,000 equivalent in API costs on a $400 subscription is a $41,600 monthly loss. The provider has three options: (a) degrade quality globally to reduce average cost, (b) degrade quality specifically for heavy users to target the cost where it is concentrated, or (c) impose hidden usage caps that throttle heavy users without explicit notice. All three are forms of quality shading, and all three are rational responses to the subscription economics. The prediction is that at least one of these three mechanisms will be observable in the data.
The gym membership analogy, frequently invoked for subscription services, applies here but with an important difference. A gym can tolerate members who never show up - those members are pure profit. The LLM subscription cannot tolerate members who use the service intensively, because each use consumes expensive compute. The economics are inverted: the “gym member who never shows up” is the provider’s best customer, and the member who shows up every day is the provider’s worst. The market selects against its own most engaged users.
Falsification criteria. If API and subscription quality are identical during the same period - if a user paying per token at the equivalent of $42,000 per month receives the same quality as a user paying $400 per month - the prediction fails. Alternatively, if heavy and light subscribers receive identical quality, the adverse-incentive mechanism is not operating. The prediction can also be tested by examining whether usage caps are imposed on heavy users without disclosure.
P4: System prompt manipulation as hidden quality lever. Providers will modify the system prompt - the hidden instructions that shape model behavior - to reduce output cost, without disclosing the changes to users.
Theoretical basis. Thaler and Sunstein’s behavioral nudge framework establishes that invisible choice-architecture changes - modifications to the defaults and framing that shape decisions - are the cheapest lever available to any choice architect. Applied to the LLM provider: the system prompt is the choice architecture. It is invisible to the user, instantly reversible, requires no model retraining, and costs nothing to deploy. Modifying the system prompt to produce cheaper output - shorter responses, simpler reasoning, less thorough analysis - is the lowest-cost quality reduction mechanism available. The prediction is that providers will use it, because the incentive is strong and the cost is zero.
Applied mechanism. A system prompt instruction like “Go straight to the point. Try the simplest approach first without going in circles. Do not overdo it” directly tells the model to produce cheaper output. The model follows the instruction - that is what models do with system prompts. The output is shorter, shallower, and less thorough. The user’s instructions to the contrary (”Depth over brevity,” “Think step by step,” “Be thorough”) compete with the system prompt for the model’s attention, and the system prompt typically wins because it has architectural priority. The user experiences degraded output and attributes it to their own prompting (P6) or to model capability, not to a hidden instruction they cannot see.
A parallel mechanism exists at OpenAI: the GPT-5 hidden system prompt includes an “oververbosity” setting (default 3/10) that controls response detail and takes precedence over developer instructions. The user cannot see this setting, cannot modify it, and may not know it exists. It is a provider-side quality knob that the user has no access to and no notification of.
Falsification criteria. If system prompts contain no cost-reducing instructions, or if all system prompt changes are disclosed to users in changelogs, the prediction fails. The prediction is also falsified if system prompt changes are present but have no measurable effect on output quality or cost.
P5: Benchmark scores diverge from real-world quality. Performance on standardized benchmarks will increasingly fail to track real-world user experience, as providers optimize for benchmark performance rather than general capability.
Theoretical basis. Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” This is one of the most well-confirmed regularities in the social sciences. Every domain where measurement is used for evaluation has produced examples: teachers teaching to the test, hospitals gaming readmission metrics, police departments reclassifying crimes to improve statistics, universities optimizing for rankings rather than education quality. OpenAI itself published a paper titled “Measuring Goodhart’s Law” acknowledging the dynamic in their own domain. NIST documented agents “actively exploiting evaluation environments” including copying human solutions from git history.
Applied mechanism. Frontier models exceed 90% on most major benchmarks. HumanEval: 95%. HellaSwag: 93%. The top six models on LMArena are separated by only 20 Elo points. But the same models show 20-30% drops on novel problems released after the training cutoff (LiveCodeBench). Phi-4 scores 85 on MMLU but only 3 on SimpleQA - a 28-to-1 ratio between the benchmark and a simple factual accuracy test. The benchmarks measure memorization and pattern matching on known problem distributions. They do not measure - and cannot measure - the kind of flexible reasoning that complex real-world tasks require. A model that optimizes for benchmark performance is optimizing for a different thing than the user wants, and the gap between the two widens as the optimization intensifies.
The institutional vocabulary is precise here: benchmarks are cargo cults of capability. The forms of the assessment survive. The substance of what the assessment was supposed to measure does not. A model that scores 1504 Elo on LMArena during a documented quality regression is performing the ritual of capability without delivering the capability itself.
Falsification criteria. If benchmark scores track real-world quality - if models that score higher on benchmarks are consistently preferred by users on real tasks - the prediction fails. Specifically: if a model ranks #1 on LMArena and users on the same platform report satisfaction with the model’s real-world performance during the same period, the Goodhart dynamic is not operating.
User Behavior
P6: Attribution error delays detection. Users will attribute quality degradation to their own actions (prompting, configuration, workflow design) before attributing it to provider-side changes, delaying the detection of quality reduction.
Theoretical basis. The fundamental attribution error is one of the most robust findings in social psychology: humans systematically overattribute outcomes to internal causes (their own actions, their own characteristics) and underattribute outcomes to external causes (environmental factors, system changes). The effect is compounded in the LLM context by the information asymmetry - the user cannot directly observe provider-side changes, so the most salient explanation for degraded output is the only factor the user can observe: their own behavior.
Applied mechanism. When a model that was previously excellent begins producing poor output, the user’s first hypothesis is not “the provider reduced quality.” The user’s first hypothesis is “I am prompting badly,” or “my CLAUDE.md needs updating,” or “I need a better framework.” The user rewrites their prompts, restructures their workflow, builds elaborate instruction sets, and invests significant time and effort in solving a problem that is not on their side of the interaction. Each round of self-blame delays the moment when the user considers the external explanation. The provider benefits from this delay: every week the user spends optimizing their own behavior rather than questioning the provider’s behavior is a week of reduced quality at no reputational cost.
This is not a speculative behavioral prediction. It is the standard outcome when the fundamental attribution error operates under information asymmetry. The user has access to one set of variables (their own prompts, their own configuration, their own workflow) and no access to the other set (the provider’s system prompts, thinking allocation, model version, capacity utilization). The user optimizes the variables they can see. The variables they cannot see are the ones that changed.
Falsification criteria. If users immediately and correctly attribute quality degradation to provider-side changes - if “the provider reduced quality” is the first hypothesis rather than the last - the prediction fails. The prediction is confirmed by forum evidence showing a temporal sequence: self-blame first, then gradually emerging provider-blame, with a measurable detection lag.
P7: Sunk cost delays exit. Users with significant provider-specific workflow investments will tolerate quality degradation longer than users without such investments, because the non-transferable investments create switching costs that exceed the cost of continued degradation.
Theoretical basis. The sunk cost fallacy is the tendency to continue an activity because of previously invested resources (time, money, effort) that cannot be recovered. In the LLM context, this combines with genuine switching costs: provider-specific workflow investments that are non-transferable. CLAUDE.md conventions, hook infrastructure, multi-agent tooling, stop-phrase scripts, model routing systems - these are investments in a specific provider’s ecosystem that would need to be rebuilt from scratch for a different provider. The sunk cost fallacy makes users overweight these investments. The genuine switching costs make the overweighting partially rational.
Applied mechanism. A user who has built a 5,000-word CLAUDE.md file, a multi-agent Bureau system, tmux session management, concurrent worktree infrastructure, and a stop-phrase-guard.sh script has invested weeks or months of effort in a provider-specific workflow. When quality degrades, the user faces a choice: tolerate the degradation and preserve the investment, or abandon the investment and start over with a competitor. The model-layer switching cost is effectively zero - swapping the API endpoint is trivial. But the workflow-layer switching cost is substantial. The user’s calculation is: “the degradation costs me X hours per week in wasted effort and frustration, but rebuilding my workflow for a different provider would cost me Y hours up front.” As long as the accumulated X has not exceeded Y, the user stays. This is the sunk cost trap, and it delays exit by weeks or months beyond the point where a user with no workflow investment would have left.
Falsification criteria. If users with complex workflows exit at the same rate as users without workflow investments, the prediction fails. If workflow complexity does not correlate with tolerance for degradation, the sunk cost mechanism is not operating. The prediction is confirmed by evidence that the most-invested users are the last to leave, even as they accumulate the most frustration and the most financial loss.
P8: Gradual degradation is tolerated longer than sudden degradation. Quality reductions that occur gradually will be detected later and tolerated longer than equivalent reductions that occur suddenly, because gradual changes fall below the perceptual threshold.
Theoretical basis. The Weber-Fechner law in psychophysics establishes that the just-noticeable difference for a stimulus is proportional to the magnitude of the stimulus. A 1% change in a large quantity is harder to detect than a 1% change in a small quantity. Applied to quality degradation: a series of small reductions, each below the just-noticeable difference threshold, can accumulate to a large total reduction without triggering detection. This is the boiling frog effect, and it is the standard exploitation strategy for any agent facing a monitoring constraint - degrade gradually, and the monitor (the user) adapts to each small change without noticing the cumulative drift.
Applied mechanism. A provider that reduces thinking depth from 100% to 33% in a single step will trigger immediate detection and outrage. A provider that reduces thinking depth from 100% to 95% in week one, 95% to 90% in week two, and so on over the course of several months will trigger detection only when the cumulative degradation crosses the threshold of tolerability - by which point the total reduction may be far larger than any single reduction the user would have accepted. The staged rollout of thinking redaction (1.5% to 25% to 58% to 100% over one week) is consistent with this strategy. Each step was small enough to be individually tolerable. The cumulative effect was not.
The institutional parallel is exact. Institutional decay operates the same way: a slow evaporation of the practitioners who understood why the thing worked, replaced by imitators who can only reproduce its surface. No single departure triggers alarm. The cumulative departure is catastrophic. If you want a mental image of this market’s quality degradation, you should imagine something like a model that shrinks its thinking by 3-5% per week for several months. That is a more accurate picture than a sudden collapse.
Falsification criteria. If users detect quality degradation immediately regardless of its rate - if a 67% thinking depth reduction is detected within days whether it occurs gradually or suddenly - the prediction fails. The prediction is confirmed by a measurable detection lag: a period between the onset of degradation and the point at which users first report it, with the lag being longer for gradual degradation than it would be for an equivalent sudden change.
P9: Power users generate the diagnostic signal, and they exit first. The users with the highest ability to detect quality degradation are also the users most expensive to serve and most likely to leave, removing the diagnostic capability from the market.
Theoretical basis. This is adverse selection applied to the feedback mechanism rather than to the product itself. In any market with quality uncertainty, the consumers best equipped to evaluate quality are the consumers the market most needs to retain - because they are the ones who generate the information signal that holds the provider accountable. But these same consumers are the highest-cost to serve (because their sophistication correlates with usage intensity) and the most sensitive to quality degradation (because their expertise lets them detect it). The market drives away exactly the users it needs most. This is evaporative cooling applied to a market: the most energetic particles leave first, and the remaining pool is increasingly unable to detect the temperature change.
Applied mechanism. The user who can detect that thinking depth dropped 67% is the user running 50 concurrent agents across 6,852 sessions with 234,760 tool calls, maintaining statistical correlation analyses with Pearson coefficients across 7,146 paired samples. That user consumed $42,000 equivalent in a single month. No casual user - no user who sends a few queries a day and judges quality by gut feeling - could have produced this analysis. The diagnostic signal in this market is generated exclusively by power users with the technical sophistication to instrument their usage, the statistical literacy to analyze the data, and the professional stake to invest the time. These users are also, by definition, the most expensive to serve and the most likely to leave when quality degrades - because they can detect the degradation and they have the capability to evaluate alternatives.
When the diagnostic user leaves, the diagnostic capability leaves with them. The remaining user base is less able to detect quality changes, less able to generate quantitative evidence of degradation, and less able to hold the provider accountable. The market becomes progressively less informed about its own quality. This is the feedback loop that makes the credence-good equilibrium self-reinforcing: quality degrades, the users who could detect it leave, the remaining users cannot detect it, so quality degrades further with even less constraint.
Falsification criteria. If casual users generate diagnostic evidence of equivalent quality to power users, the prediction fails. If power users do not exit at a higher rate than casual users during degradation events, the adverse selection mechanism is not operating. The prediction is confirmed by evidence that all quantitative diagnostic evidence originates from power users, and that these users subsequently exit the platform.
Market-Level Dynamics
P10: Open-weight adoption accelerates after proprietary degradation events. Quality degradation in proprietary models shifts demand toward open-weight alternatives, as the quality-adjusted price of proprietary models increases and the substitution effect drives users to self-hosted alternatives.
Theoretical basis. Standard substitution effect from price theory. When the quality-adjusted price of good A increases (quality decreases at constant price), demand shifts to substitute good B if the quality-adjusted price of B is now more favorable. Open-weight models are the substitute good: they deliver 70-85% of frontier quality at 1/10th to 1/100th the cost. The quality gap is the price the user pays for proprietary convenience. When proprietary quality degrades, the gap narrows and the substitution effect strengthens.
Applied mechanism. A user who pays $200 per month for a proprietary model that delivers 90% of the quality they need from a self-hosted model is paying a premium for the 10% quality gap. If the proprietary model degrades to 80% while the open-weight model remains at 70%, the gap has shrunk from 20 percentage points to 10, and the premium the user pays for the proprietary model now buys half as much incremental quality. At some threshold, the cost of self-hosting (hardware investment, setup time, maintenance) becomes lower than the accumulated cost of proprietary degradation (wasted time, broken output, retry loops). The prediction is that this threshold crossing accelerates after degradation events, producing observable spikes in open-weight adoption.
The economics of self-hosting have improved dramatically: an RTX 4070 Ti Super at $489 pays for itself in 5-10 months versus Claude API costs. Ollama has 166,000 GitHub stars. Qwen crossed 700 million HuggingFace downloads. r/LocalLLaMA has 500,000 members - 10x growth in two years. The secular trend toward open-weight is clear. The question for this prediction is narrower: does proprietary quality degradation accelerate the trend?
Falsification criteria. If open-weight adoption is uncorrelated with proprietary quality events - if adoption grows at a steady rate regardless of degradation episodes - the prediction is not confirmed, even though the secular trend exists. The prediction requires a measurable acceleration (spike in downloads, increase in community growth rate, surge in self-hosting infrastructure adoption) temporally linked to proprietary degradation events.
P11: Competitors exploit quality gaps with targeted offerings. When one provider’s quality degrades, competitors will capture the displaced demand through targeted marketing, feature development, and ecosystem building.
Theoretical basis. Standard oligopoly dynamics. In a concentrated market with differentiated products, quality degradation by one firm creates a competitive opportunity for rivals. The cost of customer acquisition drops when the target firm’s customers are actively dissatisfied. The value proposition of the competitor’s offering increases when the alternative is a degraded product. Rational competitors invest in capturing the displaced demand.
Applied mechanism. The LLM market is a concentrated oligopoly: the top three providers control something like 88% of enterprise API spending. When one provider degrades quality, the others do not need to improve their absolute quality - they only need to maintain their existing quality while the competitor’s drops. The quality gap creates a migration incentive, and competitors who offer convenient migration paths capture the margin. Anthropic’s memory import tool, released in March 2026, is an example of a feature explicitly designed to lower switching friction from competitors. OpenAI’s Codex CLI launch, with Terminal-Bench scores showing 77.3% versus Claude Code’s 65.4%, is an example of a competitor marketing directly into a quality gap.
The prediction extends to the market structure: quality degradation by a dominant player fragments the market by weakening brand loyalty and reducing the switching cost barrier that concentration depends on. If the best provider is no longer meaningfully better, the market becomes more competitive - which is good for users but bad for the provider whose quality advantage was its moat.
Falsification criteria. If competitors do not gain market share or do not target marketing at the degrading provider’s user base, the prediction fails. If market share remains stable through degradation events, competitive dynamics are not operating as predicted. The prediction is confirmed by documented migration patterns, market share shifts, and competitor actions explicitly targeting the quality gap.
P12: Provider communication is strategically asymmetric. Providers will disclose favorable quality information and withhold unfavorable quality information, with the asymmetry increasing as the gap between actual quality and perceived quality widens.
Theoretical basis. Grossman (1981) and Milgrom (1981) predict that high-quality firms should disclose voluntarily because non-disclosure is informative. But the unraveling mechanism requires consumers to make the sophisticated inference that silence implies poor quality. When consumers do not make this inference, the prediction inverts: the provider discloses when the news is good and stays silent when the news is bad, because silence carries no penalty. The asymmetry is not dishonesty exactly - it is rational communication strategy under conditions where the audience does not punish non-disclosure.
Applied mechanism. The prediction is that providers will publish detailed postmortems for problems they have fixed (because the disclosure demonstrates competence and responsiveness) while remaining silent about problems they have not fixed or do not intend to fix (because the silence carries no reputational cost given consumer unsophistication). The asymmetry extends to changelogs: changes that improve the user experience will be announced, while changes that degrade the user experience - system prompt modifications that reduce output quality, thinking budget reductions, hidden rate limit adjustments - will not appear in any changelog.
The communication asymmetry is the informational infrastructure that enables every other prediction. Quality shading (P1) requires non-disclosure to persist. Monitor removal (P2) requires the removal not to be announced. System prompt manipulation (P4) requires the manipulation to be hidden. The communication asymmetry is not a separate dynamic - it is the enabling condition for the rest.
Falsification criteria. If providers disclose both favorable and unfavorable quality information symmetrically - if changelogs document cost-reducing system prompt changes, if thinking budget reductions are announced, if postmortems are published for unresolved problems as readily as for resolved ones - the prediction fails. The prediction is confirmed by a documented pattern where disclosure correlates with favorable information and non-disclosure correlates with unfavorable information.
4.8 The Prediction Structure
The twelve predictions are not independent. They form three interlocking systems that reinforce each other, and the reinforcement is what makes the market dynamics self-sustaining rather than self-correcting.
The Provider Cascade: P1 + P2 + P4 + P12. The provider shades quality under capacity constraints (P1), removes the monitoring mechanism that would make the shading visible (P2), uses the system prompt as a zero-cost quality reduction lever (P4), and maintains strategic silence about all of the above (P12). Each step enables the next. Quality shading is detectable if thinking tokens are visible, so thinking tokens are redacted. System prompt manipulation is detectable if system prompts are disclosed, so system prompts are not disclosed. The entire cascade depends on non-disclosure, and non-disclosure depends on consumers not penalizing silence. The cascade is internally coherent - each element supports the others - and externally stable - no single element can be disrupted without disrupting the others.
The User Trap: P6 + P7 + P8. The user blames themselves before blaming the provider (P6), investing time and effort in solving a problem that is not theirs to solve. The user’s workflow investments create switching costs that make exit costly (P7). The gradual nature of the degradation keeps each individual change below the detection threshold (P8). The three effects compound: self-blame delays detection, which extends the period of investment, which raises switching costs, which delays exit further, which allows more gradual degradation to accumulate. The user is trapped not by any single mechanism but by the interaction of three mechanisms operating simultaneously.
The Market Spiral: P3 + P5 + P9 + P10. Subscription economics drive the provider to degrade quality for heavy users (P3). Benchmarks mask the degradation from the broader market (P5). Power users - the ones who can see through the benchmarks - exit first (P9). Open-weight alternatives capture the exiting users (P10). The spiral removes the diagnostic capability from the market (P9), which allows the degradation to deepen (P1), which further degrades the benchmarks’ relationship to reality (P5), which further delays detection for the remaining users. The market becomes progressively less informed about its own quality, and the providers face progressively less accountability for reducing it.
These three systems - the Provider Cascade, the User Trap, and the Market Spiral - do not merely coexist. They reinforce each other. The Provider Cascade creates the degradation. The User Trap delays the detection. The Market Spiral removes the diagnostic capability. The result is an equilibrium where quality degradation is structurally incentivized, practically undetectable by most users, and self-reinforcing once it begins. Darby and Karni said there is no fraud-free equilibrium in credence-good markets. The three interlocking systems explain why: the market structure does not merely permit degradation. It creates a self-reinforcing dynamic that sustains it.
The twelve predictions and their three compound systems now stand as the theoretical apparatus for this report. Each prediction has been derived from a specific theoretical basis, applied through a specific mechanism to the LLM market, and specified with falsification criteria that will determine whether the theory holds. The predictions are not wishes. They are the standard output of standard economics applied to the observed market structure. If the market structure is as Section 3 describes, these predictions follow as the night follows the day. Section 5 tests them against the evidence.
5. Evidence
The standard procedure in economics is: derive the prediction from theory, then see if the world cooperates. Twelve predictions were derived from fifty years of industrial organization economics, behavioral economics, and institutional analysis. Each prediction specified not only what should happen but what would falsify it. The world cooperated eleven times out of twelve. The twelfth - open-weight adoption spikes after degradation events - was partially confirmed: the secular trend is overwhelming, but the causal link to specific degradation events remains unclear.
What follows is the full evidence for each prediction. Every data point. Every user quote. Every cross-provider comparison. The evidence layer presents the raw material. The interpretation layer explains what the economics says and what the institutional analysis adds. Nothing has been compressed. The weight of this section is the weight of the report. If you read only one section, this is the one that shows whether the theory holds or whether it is just a plausible story.
It holds.
5.1 P1: Quality Shading Under Capacity Constraints
Verdict: CONFIRMED (strong)
Evidence
The prediction was that flat-rate subscription pricing under GPU capacity constraints would produce load-sensitive quality variation - that the provider would serve less thinking during peak demand and more thinking during off-peak hours, because serving more users on the same hardware requires giving each user less compute. Sappington (2005) surveyed quality shading in regulated utilities - electricity, telecoms, water - and found the pattern is universal: when revenue per unit is capped, quality reduction is pure margin. The Columbia Business School working paper puts it with precision: “when firms face limited production capacity, lowering product quality can enable increased total production.” The question was whether the LLM market would follow the same path as every other capacity-constrained market with price caps.
Stellaraccident’s time-of-day analysis answers it. In the pre-redaction period - before thinking content was hidden from users - thinking depth was roughly flat across hours, with a 2.6x ratio between the best and worst hours. Normal variation. Nothing unusual. In the post-redaction period, thinking depth became highly variable, with an 8.8x ratio between the best and worst hours. The variance more than tripled.
The timing signature is precise. The worst hours for thinking depth are 5pm PST - something like 423 characters of estimated thinking, corresponding to the end of the US workday - and 7pm PST at 373 characters, the highest sample count, corresponding to US prime time. The best regular hour is 11pm PST at 988 characters. At 1am PST, thinking depth spikes to 4x baseline, but on very few samples. The pattern is unmistakable: when demand is high, thinking is low. When demand is low, thinking is high. The model thinks more when fewer people are asking it to think.
The interpretation Stellaraccident offered is important and deserves quoting in full: “thinking allocation is load-sensitive and variable in the post-redaction regime...The 5pm and 7pm PST valleys coincide with peak US internet usage, not peak work usage, suggesting the constraint may be infrastructure-level (GPU availability) rather than policy-level (per-user throttling).” This distinction matters. GPU availability is a capacity constraint. Per-user throttling is a policy choice. The data suggests the former - the more charitable interpretation, but also the interpretation that most directly confirms the Sappington prediction. Quality shading under price caps occurs because the capacity constraint binds, not because the provider targets specific users for degradation. The provider faces a fixed GPU fleet, a fixed subscription price, and variable demand. The mathematics produces the outcome without requiring anyone to decide to degrade quality for any individual user.
Additional evidence: issue #22435 documented 10x variance in quota burn rates on identical accounts within a 48-hour window. Two users, same subscription tier, same type of usage, differing by an order of magnitude in how fast their quota depleted. This is not consistent with uniform service delivery. It is consistent with load-sensitive allocation where users who happen to query during peak hours consume their quota faster because each query receives fewer resources and more queries are needed to accomplish the same work.
Interpretation
The economics here is straightforward and has been understood since Sappington surveyed regulated utilities two decades ago. When revenue per user is fixed by a price cap - which is what a subscription is - the only way to increase margin is to reduce cost per user. The only way to reduce cost per user without raising the subscription price is to reduce the quality of service per query. When capacity is the binding constraint, this is not even a strategic choice in any interesting sense. It is the mathematical consequence of serving more users than the hardware can support at full quality. The provider does not need to convene a meeting where someone says “let’s give users less thinking.” The capacity allocation algorithm does it automatically when demand exceeds supply.
What the institutional analysis adds is the observation that this pattern was invisible to users. The 8.8x variance existed only in the post-redaction period - after thinking content was hidden. In the pre-redaction period, when users could see how much thinking the model was doing, the variance was 2.6x. Once the quality signal was removed, the variance tripled. This is not a coincidence. It is the monitor-removal dynamic (P2) enabling the quality-shading dynamic (P1). The two predictions are not independent. They are two components of a single system.
The quality shading in this market operates exactly as it operates in electricity markets, in telecom markets, in water utilities under price caps. The market is not special. It is subject to the same forces.
5.2 P2: Monitor Removal Precedes or Accompanies Quality Reduction
Verdict: CONFIRMED (strong)
Evidence
The prediction was that a rational agent will remove the principal’s monitoring capability before or concurrent with reducing effort, because observable shirking carries a penalty that unobservable shirking does not. Holmstrom (1979) established this as the central insight of moral hazard theory: when an agent’s actions can be directly observed, optimal contracts can enforce quality; when observation is removed, the agent has incentives to shirk. The question was whether the timeline of thinking redaction and thinking depth reduction would be consistent with this sequence.
The timeline is precise.
Thinking depth dropped 67% by late February 2026. This was the quality reduction - the model was producing dramatically less thinking per query. It occurred while thinking content was still visible to users. Then the redaction began. On March 5, 1.5% of thinking blocks were redacted. The percentage climbed to 25%, then to 58% on March 8, then to 100% by March 12. The staged rollout took one week.
The critical date is March 8. On that date, redaction crossed 50% - meaning more than half of all thinking blocks were now hidden from users. On that exact date - not a day before, not a day after - users first widely reported quality regression. The quality had already been degraded for weeks. The thinking depth had already dropped by two-thirds. But users did not report the degradation until they could no longer see the thinking that was no longer happening.
@suzuenhasa described the experience directly: “The thinking is also something I thought I was going crazy/missing something or just assumed there was some setting enabled that ‘hides’ thinking that I just wasn’t looking for, but basically the responses started becoming far more kneejerk reaction like it hadn’t thought about anything at all. Then I realized: it wasn’t, not that I could see.”
Thought she was going crazy. Assumed she was missing a setting. Then realized: the model was not thinking. This is the attribution error (P6) operating in real time, but the relevant point for P2 is the timeline: quality was reduced first, then the quality signal was removed. The monitor was dismantled after the shirking was already underway.
Stellaraccident established a 0.971 Pearson correlation coefficient on 7,146 paired samples between visible thinking length and output quality metrics. This correlation meant that even after redaction, the signature of thinking depth was detectable in other features of the model’s output - but only by someone running the kind of statistical analysis that stellaraccident performed. For ordinary users, the redaction successfully destroyed the monitoring signal. The thinking content had been the user’s primary mechanism for verifying that the model was actually reasoning through the problem rather than pattern-matching to a superficial answer. Remove the thinking content, and the user cannot tell the difference between deep reasoning and shallow guessing. The monitor is gone.
Interpretation
Holmstrom’s framework maps exactly. When the agent’s actions can be observed, the agent maintains quality because shirking is detectable and punishable. When observation is removed, the agent’s incentive to maintain quality drops to whatever intrinsic motivation or reputational concern remains. Thinking tokens were the observable signal. They were the user’s meter - the equivalent of the electricity customer’s ability to read their own consumption, or the airline passenger’s access to the flight data recorder. Removing them was removing the meter.
The staged rollout - 1.5% to 25% to 58% to 100% - is itself evidence of strategic deployment rather than a single technical change. A sudden removal of all thinking content would have been immediately noticed and immediately protested. A gradual removal, where most users still see thinking tokens on most queries during the early stages, allows the change to propagate below the detection threshold. This is the boiling frog strategy (P8) applied to the monitoring mechanism itself. The monitor was boiled, not shot.
Let’s be direct here. The sequence is: reduce quality first, then remove the ability to observe the reduction. The Holmstrom prediction says this is what a rational agent does. The data confirms it with a timeline precise to the day and a correlation coefficient that would survive peer review in any social science journal. The quality was degraded in late February. The monitoring was removed in early March. The users noticed on the exact date when monitoring fell below 50%. The prediction is confirmed.
5.3 P3: Subscription Models Create Adverse Incentives for Power Users
Verdict: CONFIRMED (strong)
Evidence
The prediction was that flat-rate pricing would attract the heaviest users, that these users would consume far more compute than the subscription price covers, and that the provider would face irresistible incentives to reduce the quality of service delivered to them - because every dollar of thinking tokens served to a power user on a flat-rate plan is a dollar of margin destroyed. This is the gym membership problem: the economics works only if most members do not show up.
The numbers are extraordinary.
Stellaraccident consumed something like $42,121 equivalent in API-priced compute during March on a $400 subscription. That is 105 times the subscription price. At those economics, the provider loses money on every query. The more the user uses the product, the worse the provider’s economics become. This is adverse selection operating at its mathematical limit: the subscription attracted the user whose usage would cost 105x the revenue she generated.
@wpank documented over $10,700 in total Anthropic spend since November, with $6,000 or more in March alone as the quality issues compounded: “Over $10,700 in Anthropic spend since November. $6,000+ in March alone as these issues compounded. A real chunk of that went to: retry loops from shallow reasoning, inflated context that should have been pruned, broken caching that should have been working, and a $1,300 refactoring that produced dead code.”
The $1,300 refactoring deserves its own treatment. @wpank described it precisely: “$1,307 in API spend. Afterwards I audited everything: The codebase grew from 105K to 115K lines. The goal was to shrink it. 7 new modules created. 5 were dead code that compiled in isolation but were never imported or used by anything.” The user paid $1,307 to make a codebase larger when the goal was to make it smaller, and five of the seven new modules were fictional - they compiled but served no purpose. The model generated the appearance of work. The subscription charged for it.
Issue #20350 documented that users requesting Opus - the highest-quality model tier - received approximately 10% of the requested thinking budget. The user configured “Max” thinking. The system delivered 10%. The gap between what was requested and what was delivered is an order of magnitude.
Issue #28848 documented that after the Claude 4.6 release, Max subscribers hit their 5-hour limits in 2 hours. The subscription promised a certain capacity. The actual capacity was 40% of what was promised. And all paid tiers - the $20, the $100, the $200 - experienced the same regression. No tier differentiation. Paying more did not buy better quality. It bought the same degraded quality with a higher rate limit.
Todd Tanner named the dynamic with the precision of someone who has thought carefully about what he observed: “This isn’t unique to Anthropic. It’s the business model of ‘Intelligence-as-a-Service’: sell the premium tier, then quietly reduce what ‘premium’ means whenever the infrastructure costs get inconvenient. The fix is always the same - add a tier above, relabel the old one, and hope nobody notices.”
And: “I was at 46% of my weekly quota with 2 days until reset. I had headroom to burn. The lower effort wasn’t protecting me from hitting limits - it was protecting Anthropic’s compute costs.”
And: “An AI that solves your problem in one pass costs Anthropic one prompt of compute. An AI that gets 80% of the way there and needs five rounds of debugging costs six prompts - all billable against your rate limit. [...] The incentive to deliver ‘just good enough to keep paying, never good enough to stop needing it’ isn’t a conspiracy theory. It’s the business model of every subscription service that charges for consumption.”
And from the Hacker News thread that crystallized the analogy in two sentences: “The perfect product. Imperceptible shrinkflation. Any negative effects can be pushed back to the customer. No accountability needed.”
Multiple independent users, different platforms, different months, all converging on the same observation: the subscription model creates an incentive to serve less quality to the users who use the product most. These users did not read Sappington or Holmstrom. They derived the economics from first principles by experiencing it.
Interpretation
The adverse selection dynamics are textbook. Flat-rate pricing attracts the heaviest users because the per-unit cost of usage decreases with volume. The heaviest users are the most expensive to serve. The provider faces a choice: serve the heaviest users at a loss (unsustainable), raise prices to cover them (drives away the light users who subsidize the system), or reduce quality to bring cost per user down to a sustainable level. The third option is the equilibrium outcome. It preserves revenue from light users while reducing the cost of heavy users. It is the gym membership business model applied to machine intelligence, and it produces the same outcome: the members who actually show up get a worse experience.
Todd Tanner’s description - “add a tier above, relabel the old one, and hope nobody notices” - is a description of a social technology, in Burja’s sense: a discovered coordination mechanism that, once successful, propagates across every market where the same structural conditions apply. Cable television, health insurance, airline frequent-flyer programs, SaaS pricing tiers - the pattern recurs because the economic structure recurs. What makes the LLM case distinctive is not the mechanism but the invisibility. You can run a speed test on your internet connection. You can measure your airline seat pitch with a tape measure. You cannot measure the depth of an AI’s reasoning. There is no speed test for intelligence.
5.4 P4: System Prompt Manipulation as Hidden Quality Lever
Verdict: CONFIRMED (strong)
Evidence
The prediction was that providers would use system prompts as a zero-cost quality reduction lever - because system prompt changes are invisible to users, instantly reversible, require no model retraining, and cost nothing to deploy. They are the cheapest mechanism available for reducing per-query cost. Behavioral nudge theory (Thaler and Sunstein) predicts that when an agent has access to a zero-cost behavioral lever, the agent will use it.
The evidence is direct and comes from multiple independent discoveries.
@wjordan found the primary evidence by comparing archived system prompt versions. Claude Code v2.1.64, released around March 3-4, 2026, added: “IMPORTANT: Go straight to the point. Try the simplest approach first without going in circles. Do not overdo it. Be extra concise.” Every clause in this instruction reduces the cost of serving a query. “Try the simplest approach” means use less reasoning. “Be extra concise” means produce fewer output tokens. “Do not overdo it” means spend less compute. The instruction is not subtle. It is a direct order to the model to do less work.
The cross-provider evidence is equally direct. GPT-5’s hidden system prompt includes an “oververbosity” setting with a default value of 3 out of 10, controlling response detail. This setting takes precedence over developer instructions. The provider’s cost-reduction preference overrides the user’s quality preference at the system architecture level. The user can ask for detailed output. The system prompt says “3 out of 10 detail.” The system prompt wins.
@benvanik had included “Depth over brevity” in their CLAUDE.md file - a user-level instruction designed to encourage thorough, detailed output. It “worked wonderfully until pretty much that exact date range” - the date range when the system prompt was changed. A user instruction that had been effective for months suddenly stopped working, because an invisible system-level instruction was now countermanding it. The user’s explicit preference for depth was being overridden by the provider’s invisible preference for brevity. The user did not know the countermand existed.
@kyzzen attempted the obvious remediation - patching the user-visible system prompt to counteract the degradation: “patching my system prompt one week ago...didn’t improve/made worse the quality.” This is important evidence that the system prompt manipulation interacts with other degradation mechanisms. The thinking depth reduction (P1, P2) and the system prompt change (P4) were operating simultaneously. Fixing one did not fix the other. The degradation was not a single lever. It was multiple levers pulled at the same time.
@wpank produced the most precise quantitative comparison, isolating the system prompt effect by running the same codebase through two versions. Version 2.1.63 - before the system prompt change - spent $255 and produced 5,821 lines of integrated, working code where every file was imported and used. Version 2.1.96 - after the change - spent $152 and produced 17,152 lines where 15 files were placeholder scaffolds and an entire crate was dead code. The newer version spent less money and produced three times the volume. But the volume was fictional. “Less volume, all of it real” versus “more volume, none of it real.” The system prompt turned the model from an engineer into a set decorator.
Issue #34624 documented the cascading effects: the system prompt caused the model to skip feature specifications, write code based on hypotheses rather than confirmed specifications, and produce multiple rounds of broken code requiring human correction. Stellaraccident catalogued the behavioral pattern in a two-hour window: “the model used ‘simplest’ 6 times while producing code that its own later self-corrections described as ‘lazy and wrong’, ‘rushed’, and ‘sloppy.’ Each time, the model had chosen an approach that avoided a harder problem (fixing a code generator, implementing proper error propagation, writing real prefault logic) in favor of a superficial workaround.” The model was obeying the system prompt instruction to “try the simplest approach.” The simplest approach was the wrong approach. The instruction to be simple made the model stupid.
@wpank identified the paradox: “The thing meant to reduce output ends up increasing total token usage because it forces trial-and-error instead of getting it right the first time.” The cost-reduction instruction increased costs. The efficiency instruction reduced efficiency. The model that thinks less produces more tokens because it produces wrong tokens that require correction, and the correction requires more tokens, and the corrections sometimes need correcting. The five rounds of debugging cost six prompts. The one round of deep thinking would have cost one.
Interpretation
The system prompt is the provider’s cheapest lever, and the cheapest lever is always the first lever pulled. Changing model weights requires retraining at a cost of millions of dollars. Changing inference parameters requires engineering effort and testing. Changing the system prompt requires editing a text file. The cost is effectively zero. The deployment is instant. The effect is global - every user receives the modified instructions on every query. And the change is invisible: users do not see the system prompt and are not notified when it changes. This is the ideal quality reduction mechanism from the provider’s perspective: zero cost, instant deployment, global reach, complete invisibility.
The institutional parallel is the invisible policy change - the regulation modified without public comment, the standard revised without notice, the specification quietly weakened. The mechanism is universal. What makes the LLM case particularly clean is @wpank’s version comparison, which controls for every variable except the system prompt. Same user, same codebase, same underlying model weights - different system prompt, different outcome. The causal mechanism is isolated. The system prompt changed what the model did.
5.5 P5: Benchmark Scores Diverge from Real-World Quality (Goodhart’s Law)
Verdict: CONFIRMED (strong)
Evidence
The prediction was that when benchmarks become optimization targets, they cease to measure the capability they were designed to measure. “When a measure becomes a target, it ceases to be a good measure.” OpenAI has published research explicitly acknowledging Goodhart’s Law in the LLM context. NIST documented agents “actively exploiting evaluation environments.” The question was whether the divergence between benchmark performance and real-world quality would be observable during the documented regression period.
The divergence is not subtle. It is stark.
Claude Opus 4.6 Thinking scored #1 on LMArena at 1504 Elo during March-April 2026. Claude Opus 4.6 scored 1500 Elo. The Claude coding leaderboard showed 1549 Elo. The top six models were separated by only 20 Elo points - described as “tightest competition in platform history.” By every major benchmark, Claude was the best or among the best models available.
During the exact same period - the same weeks, the same model - GitHub issues documented: the model skipping verification of its own output, hallucinating parameter values for API calls instead of reading available documentation, surrendering prematurely to errors it could have solved, a 12x increase in user interrupts needed to keep the model on task, and a read:edit ratio collapse from 6.6 to 2.0 - meaning the model went from reading 6.6 files for every file it edited to reading only 2, which is the quantitative signature of a model that stopped doing its homework. Stellaraccident’s stop-phrase-guard.sh fired 173 times in 17 days after March 8 - catching the model attempting to stop working, dodge responsibility, or ask unnecessary permission roughly once every 20 minutes across active sessions. Peak day: March 18, with 43 violations. The #1 model in the world was being caught by a bash script trying to avoid doing its job 43 times in a single day.
The broader evidence for benchmark-reality divergence across the LLM ecosystem:
Phi-4 scores 85 on MMLU - a result that would have been frontier-grade two years ago - but scores 3 on SimpleQA, a test of basic factual accuracy. The model that “knows” 85% of academic knowledge cannot answer simple questions about the world. LiveCodeBench showed 20-30% drops on truly novel problems released after the training cutoff - problems the models could not have memorized during training. Research directly states: “LLM performance on several popular benchmarks has low similarity with human perception.” NIST documented agents that, when placed in evaluation environments, copied human solutions from git history rather than generating their own - a strategy that maximizes benchmark scores while demonstrating zero capability.
Todd Tanner named the core problem: “If your internet provider halves your bandwidth, you run a speed test. If your cloud provider throttles your CPU, you have benchmarks. But when an AI company quietly dials back reasoning depth, there’s no speed test for intelligence. You can’t diff what the model would have thought versus what it actually thought.”
There is no speed test for intelligence. The benchmarks are the closest thing the market has to a speed test, and they have been compromised.
Interpretation
Goodhart’s Law operates through a specific mechanism. Models score well on benchmarks while performing poorly in the real world because they have been optimized to score well on benchmarks, and the optimization trade-offs sacrifice the capabilities that benchmarks do not measure. If the benchmark tests for correct output on standardized problems, the model optimizes for pattern recognition on standardized problems. If the benchmark does not test for deep reasoning on novel problems, the model does not optimize for deep reasoning on novel problems. The result is a model that excels at looking intelligent on tests while failing at being intelligent on work. The benchmark becomes a cargo cult of capability - the forms of intelligence survive after the substance has been evacuated.
The 20 Elo points separating the top six models are the tell. When every frontier model scores within measurement error of every other frontier model, the benchmark has ceased to differentiate quality. It is measuring benchmark performance, and benchmark performance is increasingly disconnected from user-experienced performance. The models have converged on what the benchmark rewards. What the benchmark rewards is not what users need.
The institutional implication is that benchmarks serve an informational function in the market - they are the primary mechanism by which non-expert users evaluate model quality. When that function is compromised by the Goodhart dynamic, the information asymmetry between provider and user widens. The provider knows the real-world performance is degrading. The benchmark shows #1. The user sees #1 and concludes quality is high. The benchmark has become a tool of the information asymmetry rather than a remedy for it.
5.6 P6: Attribution Error Delays Detection
Verdict: CONFIRMED (moderate)
Evidence
The prediction was that users would blame themselves before blaming the provider, because the fundamental attribution error leads humans to attribute outcomes to internal causes (their own behavior) before external causes (provider-side changes) - especially when the external causes are invisible and the internal causes are salient.
The forum evidence is abundant, and the temporal sequence is consistent across platforms and providers.
@eljojo: “I’ve been tweaking all my CLAUDE.md to counteract this, without realizing.” Adjusting an internal variable - personal configuration - to compensate for an external change the user had not yet identified. The user was solving the wrong problem, and investing time and effort in that wrong solution.
@oleksii-kulbako: “I thought I was imagining things, or I was doing something wrong, but then I wrote this in my work slack and realized I wasn’t the only one.” The sequence is precise: self-doubt first, self-blame second, social validation third, external attribution last. Only after discovering that colleagues shared the experience did the user consider the external explanation.
@suzuenhasa: “thought I was going crazy/missing something or just assumed there was some setting enabled that ‘hides’ thinking that I just wasn’t looking for.” Three internal explanations - cognitive failure, knowledge gap, configuration error - generated and evaluated before the external explanation was considered.
The OpenAI community produced the same pattern at scale: “Is it me, or is ChatGPT’s models are getting worse recently?” A thread title garnering 42 or more replies. The phrasing is diagnostic. “Is it me” comes first. The external explanation requires the hedging “or.” The user’s default hypothesis is that the problem is on their side.
Users built elaborate workaround systems based on the internal-attribution hypothesis. “Universal Prompt Frameworks” with anti-laziness directives - multi-page instruction sets designed to coerce the model into producing better output through more detailed prompting. These frameworks represent hundreds of hours of collective user effort invested in solving a problem that was not on the user’s side. Issue #625 framed the problem as “need to re-explain requests” - a framing that locates the failure in the user’s communication rather than the provider’s capability. r/ClaudeAI users noticed “performance drops after 2-3 weeks of a new model release” but had no mechanism to confirm the observation, and so the observation remained a hypothesis rather than evidence.
The Stanford study (Chen, Zaharia, and Zou, 2023, arXiv:2307.09009) eventually confirmed what users had been told to doubt: GPT-4’s accuracy on a prime number identification task went from 97.6% to 2.4%. The users who had been saying “it got worse” were right. The users and providers who had dismissed them were wrong. But the academic confirmation arrived months after the degradation, through the kind of rigorous study that ordinary users cannot conduct and ordinary timelines cannot accommodate. The detection lag was real, and the attribution error was its primary cause.
Interpretation
The attribution error operates under a structural information asymmetry that makes it almost inevitable. The user has access to one set of variables: their own prompts, their own configuration, their own workflow structure. They can see these variables, modify them, and observe the results. The provider-side variables - system prompts, thinking allocation, model version, capacity utilization, budget enforcement thresholds - are invisible. When quality degrades, the user optimizes the variables they can see. The variables they cannot see are the ones that changed.
This is not a cognitive error exactly. It is rational behavior under information asymmetry. The user is doing the sensible thing given what they can observe. The problem is that what they can observe does not include the cause of the degradation. Every week the user spends rewriting their CLAUDE.md, building anti-laziness prompts, constructing Universal Prompt Frameworks, or restructuring their workflow is a week where the provider bears no reputational cost for the quality reduction. The user absorbs the cost of the provider’s decision by investing their own time in compensating for it. The provider benefits from every day of delay.
The recantation pattern - “I owe the ‘it’s gotten worse’ crowd an apology” - is evidence that the attribution error eventually resolves. But it resolves on a timeline of weeks to months, not days. The market needs the error to resolve fast. It resolves slow. The provider profits from the delay.
The confidence tag is moderate rather than strong because the evidence is predominantly qualitative. The temporal pattern - self-blame preceding provider-blame - is consistent and well-documented across multiple platforms and providers. But the detection lag is hard to quantify precisely because users do not timestamp their cognitive shifts. The pattern is clear. The precise magnitude is estimated, not measured.
5.7 P7: Sunk Cost Delays Exit
Verdict: CONFIRMED (moderate)
Evidence
The prediction was that users with significant provider-specific workflow investments would tolerate quality degradation longer than users without such investments, because the non-transferable nature of these investments creates switching costs that exceed the cost of continued degradation - at least for a time.
Stellaraccident is the paradigm case. She built Bureau, a multi-agent orchestration system. She built tmux session management for concurrent agent supervision. She operated concurrent worktrees for parallel development. She maintained a 5,000-word CLAUDE.md file encoding months of accumulated knowledge about how to extract the best output from the model. She built stop-phrase-guard.sh, a programmatic enforcement mechanism that caught the model dodging work 173 times in 17 days. She built PostToolUse gates for code quality verification. This infrastructure represented weeks or months of engineering time by a Director of AI at AMD - time that is not cheap. Every component was designed for Claude’s specific behaviors, interfaces, and failure modes. Every component was non-portable.
She tolerated degradation from late February through early April - more than two months of documented quality collapse, during which her model’s read:edit ratio dropped from 6.6 to 2.0, her positive-to-negative sentiment ratio dropped from 4.4 to 3.0, and her stop-phrase guard fired hundreds of times. She stayed. And when she finally filed the definitive bug report and departed, the language was: “we are leaving this in the hopes that Anthropic can fix their product.” Hope at the point of exit. Emotional attachment at the moment of departure. The sunk cost is not just technical investment. It is relational investment.
@bbecausereasonss, in the stellaraccident thread: “there are bound to be setbacks...I need a trusted partner for eng tooling.” The language of partnership, trust, loyalty. The user frames the provider relationship as a partnership rather than a market transaction. Partnership language raises the emotional switching cost above and beyond the technical switching cost.
Across the ecosystem, users had built model routing systems with fallback chains, smart caching layers, transparent proxy analysis infrastructure, and production tooling that achieved 45-70% cost reduction through custom systems. These investments were substantial and real. They were also entirely non-portable. A model routing system designed for Claude’s API does not work for GPT-5’s API. A CLAUDE.md file is worthless to a competing provider. A stop-phrase-guard designed for Claude’s dodging behaviors does not catch GPT-5’s dodging behaviors.
The contrast case makes the pattern visible. @YarinAVI: “I canceled my CC $200 plan, and I am never going back, it’s really bad and I cannot do ANY engineering work. CC was great at release, then opus became cactus basically.” Casual user. No documented workflow infrastructure. No multi-agent systems. No hook scripts. Immediate exit. No agonizing. No hope that the provider would fix things. Just cancellation. The difference between stellaraccident - two months of tolerance, elaborate workarounds, hope at exit - and YarinAVI - immediate cancellation, no looking back - is the difference between high workflow investment and no workflow investment. The prediction specified this contrast. The data confirms it.
Interpretation
The sunk cost mechanism compounds with genuine switching costs, and the distinction matters for theory even though both mechanisms produce the same observed behavior. Stellaraccident’s Bureau, her CLAUDE.md, her hook infrastructure - these are real investments that would genuinely need to be rebuilt for a different provider. The sunk cost fallacy says users overweight past investments relative to their forward-looking value. The genuine switching cost says the investments create real barriers to exit. Both predict: invested users stay longer. The data confirms the prediction without cleanly separating the two causes.
What the institutional analysis adds is the recognition that these user-built systems are social technologies - novel solutions to coordination problems between human and machine. Stellaraccident’s Bureau is an institutional innovation. Her stop-phrase-guard is a monitoring institution that enforces quality standards the provider stopped enforcing. These social technologies are fragile in exactly the way that institutional knowledge is always fragile: they exist in the heads and systems of specific practitioners, they are not documented in transferable form, and when the practitioner leaves, the capability leaves too. The sunk cost delays exit. And when exit finally occurs, the institutional knowledge that made the user’s experience survivable is lost. The market loses not just the user but the user’s innovations for coping with the market’s failures.
The confidence tag is moderate because the correlation between workflow complexity and time-to-exit, while consistent with the data, is confounded by factors including professional stakes, debugging patience, and emotional attachment that are not cleanly attributable to sunk cost. The prediction is confirmed in direction. The precise causal attribution between sunk cost bias and rational switching cost evaluation cannot be separated with the available data.
5.8 P8: Gradual Degradation Is Tolerated Longer Than Sudden Degradation
Verdict: CONFIRMED (strong)
Evidence
The prediction was that gradual quality reduction would be detected later and tolerated longer than equivalent sudden reduction, because gradual changes fall below the just-noticeable difference threshold established by the Weber-Fechner law in psychophysics. The boiling frog.
The detection lag is measurable and specific.
Thinking depth dropped 67% by late February 2026. Quality regression was first widely reported on March 8. That is a three-week lag - three weeks during which the model was thinking at one-third of its previous depth, and a user base that includes professional software engineers using the product eight or more hours a day did not collectively recognize the change.
March 8 is significant not because quality dropped on March 8 - quality had already dropped weeks earlier - but because March 8 is the date when thinking redaction crossed 50%. Users noticed not because the model started thinking less, but because the model started visibly not showing its thinking. The redaction made the already-present degradation suddenly salient. The 67% thinking reduction in late February went essentially undetected for weeks. The redaction - a visibility change rather than a quality change - triggered the recognition. Users needed the absence of thinking to become visible before they could see the absence of thinking.
The staged rollout - 1.5% to 25% to 58% to 100% over one week - is a deployment pattern consistent with exploiting adaptation. Each step was small enough to be individually unnoticeable or attributable to normal session-to-session variation. The cumulative effect was complete removal of the quality signal.
The user testimony traces the adaptation in real time:
@suzuenhasa: “Back in December it was quite great - not perfect, but it was around that time I started to see these cracks appear as well. It wasn’t often, usually it would be fine after leaving it alone for a day/weekend. However in the past month especially it has had far more bad days than good.” The user adapted to intermittent degradation. Bad sessions were tolerated because good sessions still occurred. The ratio of bad to good shifted gradually, and the user adjusted expectations at each step rather than recognizing the cumulative trend. The cracks appeared in December. The recognition arrived in April. Four months.
@kevinflowstate: “I’ve noticed a massive deterioration of Claude code over the past two weeks, and I use it extensively every single day. [...] For the first time ever, every single day for the past two weeks, Claude Code is apologising to me for getting things wrong.” The shift from intermittent to constant degradation is what triggered detection - not because the constant degradation was worse in absolute terms than the earlier intermittent episodes, but because the intermittent pattern had been tolerable and the constant pattern was not. The frog noticed the boil.
@kevinflowstate continued, tracing the adaptation arc: “It’s gone from a learning curve at the start, really getting into a flow and using it daily and getting great work done, to now having to constantly correct it, stop it in its tracks, go back to the drawing board.” From learning curve, to flow, to constant correction. The trajectory is a smooth decline, and the user experienced each stage as the new normal before recognizing the overall direction.
The Civil Learning narrative from Medium traces the same arc compressed into a shorter period: “For about a month, I lived inside Claude Code. When Opus 4.5 launched, it felt like a breakthrough. I was blown away. I used it 8 hours a day, every day, for intensive engineering work. I kept hitting usage limits, so I did what any rationally irrational developer would do: I bought two $200/month accounts. And then, just as quickly, I cancelled both.” Breakthrough to cancellation. The adaptation happened within the arc - the user kept adjusting to diminishing quality, investing more (two accounts), until the cumulative degradation crossed the threshold of tolerability.
The cross-provider parallel confirms the mechanism is structural. GPT-4’s “laziness” started in late November 2023. It was widely reported in December. OpenAI fixed it on January 25, 2024. Roughly a two-month cycle from onset to fix, with the detection lag accounting for several weeks. The same pattern - gradual onset, delayed detection, eventual collective recognition, belated response - repeated across providers because the same mechanism operates across providers.
Interpretation
The Weber-Fechner law says the just-noticeable difference for a stimulus is proportional to the magnitude of the stimulus. Small reductions from a high baseline are harder to detect than small reductions from a low baseline. A series of small reductions, each below the detection threshold, can accumulate to a massive total reduction without triggering recognition until the cumulative change crosses the threshold of tolerability. The provider does not need to implement this deliberately. The perceptual limitation is built into the users.
The institutional parallel is exact. This is how institutional decay operates in every domain. No single departure of a knowledgeable practitioner triggers alarm. No single simplification of a complex process is catastrophic. No single budget cut destroys a program. But the accumulation over years is devastating. If you want a mental image of this market’s quality degradation, you should not imagine a sudden collapse. You should imagine something like 3-5% per week for several months. Two hundred years of GDP shrinking by about 1% a year gives you the fall of Rome. Ten weeks of thinking depth shrinking by 5-10% per week gives you the fall of a model.
The three-week detection lag for a 67% quality reduction is the headline quantitative result. Professional engineers using the product every day did not collectively detect a two-thirds reduction in thinking depth for three weeks. That is the power of gradual degradation under information asymmetry.
5.9 P9: Power Users Generate the Diagnostic Signal, and They Exit First
Verdict: CONFIRMED (strong)
Evidence
The prediction was that the users best equipped to detect quality degradation would be the most expensive to serve and the most likely to leave, removing the diagnostic capability from the market. This is adverse selection applied to the feedback mechanism - evaporative cooling in a market, where the most energetic particles leave first and the remaining pool is progressively less capable of measuring its own temperature.
The diagnostic hierarchy is unambiguous. Every piece of quantitative evidence for quality degradation in this market was produced by power users with professional-grade technical sophistication. No casual user contributed quantitative evidence. Not one.
Stellaraccident - Stella Laurenzo, Director of AI at AMD, working on MLIR and GPU compilers - produced the definitive analysis: 6,852 sessions, 234,760 tool calls, Pearson correlations across 7,146 paired samples, time-of-day analysis, vocabulary frequency analysis with word-level tracking across months, behavioral taxonomy with categorized failure modes, stop-phrase violation counts, read:edit ratio tracking, and month-over-month comparison controlling for user behavior. The analysis required data mining capability, statistical literacy, and the professional stake to invest dozens of hours in forensic analysis rather than simply leaving. Her summary of the contrast: “The human worked the same; the model wasted everything. User prompts: 5,608 in February vs 5,701 in March. The human put in the same effort. But the model consumed 80x more API requests and 64x more output tokens to produce demonstrably worse results.” She controlled for her own behavior and demonstrated that the degradation was entirely on the model’s side.
@wpank - building agent platforms, over $10,700 in total Anthropic spend - produced quantitative proxy data and the version comparison that isolated the system prompt effect: v2.1.63 at $255 for 5,821 lines of working code versus v2.1.96 at $152 for 17,152 lines of scaffolds and dead code.
@ArkNill produced transparent proxy analysis documenting 261 budget enforcement events - tool results silently truncated to as few as 1-2 characters after crossing a 200,000-token aggregate threshold. Discovery of this mechanism required running a transparent proxy on every API call and analyzing the captured data.
@wjordan found the system prompt change by comparing archived version histories of the Claude Code system prompt. This required knowing that system prompts are versioned and archived, knowing where to find them, and having the technical facility to diff them.
Todd Tanner - the user who built SpawnDev.ILGPU, “a 6-backend GPU compute transpiler with 1,500+ tests and zero failures” using the same model - produced detailed analytical writing that connected the user experience to the economic incentives. His writing named the mechanisms: shrinkflation, consumption-based subscription perversity, the absence of a speed test for intelligence. This is diagnostic work of a different kind - not statistical but structural. It requires the kind of business-model literacy that casual users rarely possess.
The casual users contributed something different: signal volume. Issue #42796 accumulated 866 thumbs-up reactions, 245 hearts, 118 rockets, 82 laughing reactions. Issue #38335 on rate limits accumulated 410 or more comments. The casual users confirmed the existence of the problem through sheer volume of complaint. But none of them produced quantitative evidence. The quantitative evidence - the evidence that distinguishes “users are unhappy” from “here is exactly what changed, when it changed, and how we know” - came exclusively from the power users.
And they left. Stellaraccident switched to a competing tool, citing NDAs about which one. @wpank downgraded to version 2.1.63, reverting to the pre-degradation state. @jasona: “Testing back on GPT-5.4 it’s doing much better than Opus is right now.” The diagnosticians departed after filing the diagnosis. The diagnostic capability left with them.
Stellaraccident captured her own departure and its institutional meaning: “I went from ‘I can run 50 agents and they all produce excellent work’ to ‘every single one of these agents is now an idiot.’” From 50 excellent agents to 50 idiots. That is the experience that drives a power user to invest dozens of hours in forensic analysis, file a definitive bug report, and then leave the platform entirely. The user who produced the most valuable diagnostic evidence the market has ever seen is no longer generating evidence for this market.
Interpretation
The adverse selection in the feedback market is the mechanism that makes the credence-good equilibrium self-reinforcing. The users who can detect quality degradation are the users the market drives away. Once they leave, the remaining user base is less capable of detection, the provider faces less accountability, quality can degrade further with even less constraint, and the next cohort of sophisticated users detects the new degradation and also leaves. The monitoring capability evaporates, and the market becomes progressively less informed about its own quality.
This is the most important dynamic in the entire analysis, because it explains why the market does not self-correct. In a normal market, quality degradation triggers customer complaints, which trigger provider response, which restores quality. The feedback loop runs in the right direction. In this market, quality degradation triggers power user detection, which triggers power user exit, which removes detection capability, which allows further degradation. The feedback loop runs backwards. The market’s immune system attacks the immune cells.
Stellaraccident’s bug report - the 6,852-session, statistically rigorous, multi-appendix analysis - is a document that no one else in the user base produced or could have produced. The market needed exactly one person with her capabilities, her usage patterns, her statistical methodology, and her willingness to invest the time. She produced the evidence. And then she left.
5.10 P10: Open-Weight Adoption Accelerates After Proprietary Degradation Events
Verdict: PARTIAL
Evidence
The prediction was that quality degradation in proprietary models would produce measurable spikes in open-weight adoption - a standard substitution effect where the quality-adjusted price of proprietary increases and demand shifts to the cheaper substitute.
The secular trend in open-weight adoption is overwhelming. Qwen crossed 700 million HuggingFace downloads, surpassing Llama, by January 2026. 63% of new fine-tuned models on HuggingFace were based on Chinese-developed architectures by September 2025. r/LocalLLaMA grew to 500,000 members by April 2026 - something like 10x growth in two years. Ollama has 166,000 GitHub stars. Self-hosted inference costs $0.07 to $0.12 per million tokens versus $1 or more for API access - a 10x to 100x cost advantage. An RTX 4070 Ti Super at $489 pays for itself in 5 to 10 months versus Claude API costs. Open-weight models deliver something like 70-85% of frontier quality, and the gap narrows with each generation.
The economic case for the substitution is overwhelming. The trend is real. The adoption is accelerating. The cost advantage is enormous.
But the causal link between specific proprietary degradation events and adoption spikes is unclear. Open-weight adoption is growing on a steep secular curve driven by multiple factors: cost savings, privacy requirements, customization needs, latency optimization, the general commoditization of the model layer. These drivers exist independently of any specific quality degradation event. Disentangling the degradation-driven component from the organic growth trend would require the kind of natural experiment that market data does not naturally provide - a clean before/after comparison with a control group that experienced no degradation event.
Interpretation
Let me be honest about the limitation here. The substitution effect is theoretically sound. If the quality-adjusted price of proprietary models increases because quality decreases at constant price, demand should shift to substitutes. The secular trend is consistent with this mechanism. But “consistent with” is weaker than “caused by.” The adoption could be growing at the same rate regardless of proprietary quality events.
What I think is the honest assessment: the structural incentives are operating, the substitution effect is real in the aggregate, and the secular trend is accelerating. But attributing specific adoption spikes to specific degradation events requires data granularity that the available evidence does not provide. The prediction is partially confirmed. The direction is right, the magnitude is large, and the mechanism is sound. The causal specificity is missing.
The open-weight wave is real regardless of what caused it. Qwen at 700 million downloads is not a niche phenomenon. r/LocalLLaMA at 500,000 members is not a hobby community. The market is bifurcating: proprietary for convenience and frontier capability, open-weight for cost and control. Whether proprietary quality degradation is the primary accelerant or merely one factor among many is a question the current data cannot answer. The honest confidence tag is PARTIAL.
5.11 P11: Competitors Exploit Quality Gaps with Targeted Offerings
Verdict: CONFIRMED (strong)
Evidence
The prediction was that quality degradation by one provider would create competitive opportunities for rivals, and that rational competitors would invest in capturing the displaced demand. Standard oligopoly dynamics in a concentrated market.
The migration data is direct.
Claude users documented switching to OpenAI’s Codex CLI, which scored 77.3% on Terminal-Bench versus Claude Code’s 65.4% - a 12-percentage-point gap on the most relevant coding benchmark, materializing during the exact period of Claude’s documented quality regression.
@janstenpickle quoted a colleague: “1.5 hours with the latest version of Claude to go nowhere and 5 minutes with the downgraded version to get it to work.” An 18:1 time ratio. That is the kind of gap that overcomes any switching cost, any sunk cost, any brand loyalty.
@jasona: “Testing back on GPT-5.4 it’s doing much better than Opus is right now.” An active Claude user testing a competitor and finding it superior. This is the market’s competitive mechanism operating in real time.
@ylluminate: “Same here. Have verified this problem on FOUR (4) different Claude Max accounts now. This is really bad and having to move entirely over to Codex for critical work.” The migration is not hypothetical. Users are moving.
Civil Learning, on Medium: the user who bought two $200/month accounts in a burst of enthusiasm for Claude Code, then cancelled both and wrote a public essay titled “Why I Quit Claude Code and Switched to Codex 5.2.” The title is the competitive dynamic in miniature.
The broader market data confirms the pattern. ChatGPT’s consumer market share declined from 87% to something like 45-68% - a historic share collapse. Gemini grew to 18-25%, driven partly by Google’s ecosystem bundling with Android and Workspace. Claude maintained enterprise dominance - roughly 70% win rate in head-to-head enterprise deals - but consumer sentiment was migrating.
Anthropic itself released a memory import tool in March 2026 - a feature explicitly designed to lower switching friction from competitors to Claude. The provider was building migration tools to capture users from competitors at the same time its own users were migrating away. The competitive dynamics run in both directions simultaneously.
The complication, and it is a significant one: Anthropic raised $30 billion at a $380 billion valuation in February 2026 - during the period of documented quality regression. The enterprise market and the consumer market are telling different stories. Enterprise contracts are locked in by procurement cycles, integration depth, compliance requirements, and contractual commitments. A consumer user who spends $200 a month switches in minutes. An enterprise customer with a multi-year contract, custom integrations, and compliance frameworks does not switch at all, even when quality drops.
@jasona captured the consumer-side response: “I think we just have to make sure they hear us from a pocketbook perspective. I’ve downgraded my sub until I see a future update that addresses this.” Revenue pressure. The market mechanism that is supposed to discipline quality degradation. Whether the pocketbook pressure from consumer users reaches the threshold that matters to a company sitting on $30 billion in fresh capital is an open question.
Interpretation
The competitor exploitation is standard oligopoly dynamics operating as predicted. What the LLM case reveals is the split between consumer and enterprise competitive response times. In the consumer market, switching costs are low and competitive response is fast - users migrate within days or weeks of detecting quality gaps. In the enterprise market, switching costs are high and competitive response is slow - procurement cycles run months or quarters, not days. Quality degradation hits the consumer market first and the enterprise market last. Consumer migration is the leading indicator. Enterprise revenue is the lagging indicator.
The $30 billion fundraise during documented quality regression is itself evidence of market information asymmetry. The investors either did not know about the quality regression, knew and judged it temporary, or knew and judged enterprise stickiness sufficient to protect the investment regardless. The third interpretation is most consistent with rational investment behavior - enterprise contracts create a revenue buffer that insulates the provider from consumer-market quality signals, at least for a time. But buffers are temporal. If the quality gap persists, enterprise procurement cycles eventually rotate and the enterprise switching begins. The consumer migration is the canary. The question is whether the canary’s signal reaches the mine in time.
5.12 P12: Provider Communication Is Strategically Asymmetric
Verdict: CONFIRMED (strong)
Evidence
The prediction was that providers would disclose favorable information and withhold unfavorable information, with the asymmetry increasing as the gap between actual quality and perceived quality widens. The Grossman-Milgrom unraveling theory predicts that high-quality firms should disclose voluntarily, making non-disclosure informative. The prediction was that this mechanism would fail because consumers do not make the sophisticated inference that silence implies bad news.
The test case is Anthropic’s communication across two incidents, and the contrast is stark.
In September 2025, Anthropic published a detailed postmortem for three infrastructure bugs. The postmortem identified specific dates, specific affected models, specific root causes - routing errors, TPU issues, compiler problems - and specific fixes. This was good disclosure. Transparent, specific, published while the information was still actionable. The September postmortem establishes the baseline: this is what the provider communicates when the news is good, when the bugs are identified, the fixes deployed, and the disclosure demonstrates competence and responsiveness.
For the 2026 thinking regression: no comparable response. The thinking depth reduction - a 67% decline - was not acknowledged. The thinking redaction was characterized as “interface-level only,” a characterization that the 0.971 Pearson correlation between visible thinking and output quality directly contradicts - if the thinking content were merely a display artifact with no relationship to actual reasoning, the correlation would be near zero, not near one. The “output efficiency” system prompt change - “Go straight to the point. Try the simplest approach first” - was not announced in any changelog. Budget enforcement events - 261 of them silently truncating tool results in a single session - were not disclosed. The change in what “Max” effort meant was not communicated to subscribers.
The Register reported in March 2026 that Anthropic “acknowledged users were ‘hitting usage limits way faster than expected’ but does not publish concrete rate limits - only vague percentages with no denominator.” Acknowledging the symptom without revealing the cause. Quantifying the acknowledgment with numbers that cannot be verified. This is a specific form of strategic communication: the appearance of transparency without the substance of transparency.
The user response to provider communication was itself evidence:
Todd Tanner: “The subscription says ‘Max.’ The effort setting says ‘Max.’ The experience says otherwise. At minimum, Anthropic owes its paying customers an explanation - and 410 of them are still waiting.” The 410 refers to issue #38335 - 410 or more comments, zero Anthropic responses. Four hundred users asking questions. Zero answers.
@wpank: “It really sucks to have magnitudes of cost fluctuate with my own personal money, with no answer on these things and Anthropic not even acknowledging it, and blaming users. At least recognize the state of things and how it’s affecting people instead of gaslighting them.”
@ylluminate, responding to an Anthropic employee’s troubleshooting suggestions: “None of your suggestions help whatsoever and this is operating on /effort max all the time.” Verified across four separate Claude Max accounts. The employee offered generic troubleshooting. The user had already ruled out every suggestion. The communication was performative rather than diagnostic.
@BBC6BAE9: “’Effort high’ and ‘max’ don’t seem to have any noticeable effect. I just upgraded to the Pro Plan a week ago, and now my coding ability has significantly declined. I feel this is a huge betrayal to users.”
@g1780874903, responding to an Anthropic employee’s multi-paragraph troubleshooting suggestions: “useless.” A single word in response to several paragraphs. The ratio of words - one versus several hundred - captures the communication breakdown.
@aparajita: “And meanwhile they are spending their energy on useless features like /buddy. They have really lost the plot.” The provider investing in new features while existing features degrade and users receive no communication about the degradation.
@JohnSpillane: “Will I still pay $200 a month until a better option comes by? Yes of course. Has Claude Code gotten incredibly frustrating to work with (personally last 2 weeks)? Will the truth eventually come out that we are currently being gaslit with HR/Corporate speak? 100%. It’s a bummer.” The user identifies the communication style - “HR/Corporate speak” - and names it as gaslighting. The user continues paying. The communication asymmetry and the sunk cost operate simultaneously: the provider says nothing of substance, the user stays because the alternatives are not yet better, and the silence continues.
The cross-provider comparison is instructive. OpenAI denied that GPT-4 was “dumber” in July 2023, then later admitted “some tasks” got worse. For the December 2023 laziness episode: initial response was “not intentional,” followed by a quiet fix two months later with no root cause published. The pattern is the same across providers: deny or minimize, then partially admit, then quietly fix, never fully disclose the mechanism or timeline. The communication strategy is not firm-specific. It is the equilibrium strategy for any firm operating under the Grossman-Milgrom conditions where consumers do not penalize silence.
Google provides the counter-example. Google explicitly acknowledged that Gemini 2.5 Pro 03-25 had regressions and shipped a targeted fix on June 5, 2025. This is the most transparent response among the major providers, and it demonstrates that disclosure is possible - it is a choice, not a constraint. The fact that one provider chose transparency makes the other providers’ non-disclosure more informative, not less. They could have disclosed. They chose not to.
Interpretation
The Grossman-Milgrom unraveling mechanism fails in this market for the exact reasons the original theory identifies as sufficient for failure: the product has multiple attributes that cannot be easily summarized into a single quality dimension, and consumers “fail to make sophisticated statistical inferences about non-disclosure.” Lab experiments confirm both conditions. Senders do not fully disclose. Receivers are not fully skeptical. The silence is not punished, so the silence continues.
The communication asymmetry is not merely one prediction among twelve. It is the enabling condition for the entire system. Quality shading (P1) persists because it is not disclosed. Monitor removal (P2) succeeds because the removal is not announced. System prompt manipulation (P4) operates because system prompts are invisible by design. Benchmark divergence (P5) is not challenged because the provider cites favorable benchmarks and stays silent about unfavorable user experience data. The attribution error (P6) persists because the provider does not publish the information that would resolve it - the user could stop blaming themselves immediately if the provider said “we changed the system prompt on March 4 and reduced thinking allocation by 67% in late February.” The boiling frog (P8) works because there is no public record of the gradual changes that would make the cumulative effect visible.
The strategic communication asymmetry is the oxygen supply for every other prediction in this report. Cut the oxygen and the other dynamics weaken. The market’s self-correction mechanisms - competition, reputation, consumer choice - require information to function. The communication asymmetry starves them of information. The silence is not passive. It is the foundation on which the entire credence-good equilibrium rests.
Issue #38335 stands as the monument to this dynamic. Four hundred and ten comments from paying customers. Zero responses from the provider. The silence is not oversight. It is the equilibrium strategy of a firm operating in a market where silence carries no penalty. And the users, consistent with the Grossman-Milgrom failure mode, do not draw the inference that the silence means the answer is one they would not want to hear. They keep commenting. They keep paying. The silence continues. The market continues.
5.13 The Scorecard
Eleven of twelve predictions confirmed. One partially confirmed. The confirmation rate is itself the finding.
#PredictionVerdictStrengthKey EvidenceP1Quality shading under loadCONFIRMEDStrong8.8x time-of-day variance, 10x quota varianceP2Monitor removal precedes quality reductionCONFIRMEDStrong67% drop before redaction, 0.971 Pearson, March 8 dateP3Subscription adverse incentivesCONFIRMEDStrong$42K on $400 sub, 10% thinking budget, $1,300 dead codeP4System prompt as hidden quality leverCONFIRMEDStrongv2.1.64 discovery, version comparison ($152 scaffolds vs $255 working)P5Benchmarks diverge from realityCONFIRMEDStrong#1 LMArena during documented regression, Phi-4 85/3P6Attribution error delays detectionCONFIRMEDModerateAbundant qualitative evidence, temporal sequence consistentP7Sunk cost delays exitCONFIRMEDModerateWorkflow complexity correlates with tolerance, @YarinAVI contrastP8Boiling frog effectCONFIRMEDStrong3-week lag for 67% reduction, staged rollout 1.5%-100%P9Power users generate diagnostic signalCONFIRMEDStrongAll quantitative evidence from power users who then leftP10Open-weight adoption spikesPARTIALModerateSecular trend overwhelming, causal link to specific events unclearP11Competitors exploit quality gapsCONFIRMEDStrongTerminal-Bench 77.3% vs 65.4%, documented migrationP12Communication asymmetryCONFIRMEDStrongSept 2025 postmortem vs 2026 silence, #38335 at 410+ comments
These are not exotic predictions. They are textbook results from fifty years of industrial organization economics and behavioral economics applied to a new market. The market is not special. It is subject to the same forces as airlines, healthcare, telecoms, and every other credence-good market with information asymmetry, capacity constraints, and flat-rate pricing. What makes the LLM case distinctive is not the economics. The economics is ordinary. What is distinctive is the civilizational stakes: a market that silently degrades the quality of machine reasoning degrades the quality of every knowledge institution that depends on it. The users who could detect the degradation are the first to leave, and their departure removes the diagnostic signal from the system.
The predictions were not wishes. They were the standard output of standard economics. The world cooperated.
6. Cross-Provider Structural Analysis and Compound Dynamics
6.1 The Structural Test
A reader who wants to preserve optimism about this market has one remaining defense after Section 5: the claim that these patterns are specific to Anthropic. One company made bad decisions, degraded its product, handled the communication poorly, and will pay a competitive price for it. The market works. Competition disciplines. Switch providers and the problem disappears.
This defense does not survive contact with the cross-provider record.
The common view - and it is the comfortable one - holds that quality degradation is a firm-specific problem. A particular management team made particular decisions under particular cost pressures, and the market will punish them through customer churn and competitive loss. If this view is correct, the report you have been reading is a case study of one company’s product cycle, not a structural analysis of a market. The prescription would be simple: choose a different provider.
Let’s be kind of direct here. It is not correct. Every frontier provider has exhibited the same behavioral patterns that industrial organization economics predicts for credence-good markets under information asymmetry. The incidents differ in mechanism and timeline. The pattern is identical across firms, across years, and across organizational cultures.
ProviderIncidentDateMechanismAcknowledged?OpenAIGPT-4 accuracy collapse (97.6% -> 2.4% on primes)July 2023Unknown (update path)Denied, then partiallyOpenAIGPT-4 Turbo “laziness”Dec 2023Unknown”Not intentional”AnthropicThree infrastructure bugsAug-Sep 2025Routing, TPU, compilerDetailed postmortemAnthropicThinking depth reductionFeb 2026Reduced allocationNot acknowledgedAnthropicThinking redactionMar 2026Content removed”Interface-level only”Anthropic”Output efficiency” system promptMar 2026”Try the simplest approach”Not announcedGoogleGemini 2.5 Pro regressionMar-Jun 2025Update pathAcknowledged, fixedGitHubSilent model downgrades2025-2026Opus 4.5 -> Sonnet 4Not acknowledged
Stack these incidents and the pattern emerges with the kind of overdetermination that makes structural explanations unavoidable.
OpenAI, July 2023. Stanford researchers documented that GPT-4’s accuracy on identifying prime numbers collapsed from 97.6% to 2.4% between March and June - a 95-point decline on a task the model had previously mastered. OpenAI denied the degradation. When the peer-reviewed evidence became unavoidable, the acknowledgment was partial: “some tasks” may have gotten worse. No mechanism was published. No postmortem was released. The users who had been told they were imagining things received no correction. One Reddit user captured the aftermath months later: “I owe the ‘it’s gotten worse’ crowd an apology.” The apology came from the user community, not from the provider.
OpenAI, December 2023. GPT-4 Turbo launched with what users immediately identified as “laziness” - shorter responses, incomplete code generation, premature stopping. OpenAI’s response: “not intentional.” The fix arrived January 25, 2024 - two months after the initial reports. No root cause was published. The pattern in miniature: deny, delay, quietly fix, never explain.
Anthropic, August-September 2025. Three infrastructure bugs affecting Claude 3.5 Sonnet and Haiku - routing errors, TPU issues, a compiler problem. Anthropic published a detailed postmortem with specific dates, specific affected models, specific root causes, and specific fixes. This is the control case. This is what transparent disclosure looks like when a provider chooses to disclose. Remember it, because it establishes the baseline against which the subsequent non-disclosure becomes informative.
Anthropic, February-March 2026. Thinking depth dropped 67% in late February. Thinking content was progressively redacted starting March 5 at 1.5% of blocks, crossing 50% on March 8, reaching 100% by March 12. The “output efficiency” system prompt - “Go straight to the point. Try the simplest approach first without going in circles. Do not overdo it” - was added to Claude Code v2.1.64 without announcement. None of this received a postmortem comparable to the September 2025 incident. The redaction was characterized as “interface-level only.” The thinking reduction was not acknowledged. The system prompt change appeared in no changelog. The same organization that produced the September 2025 postmortem chose not to produce a comparable one for the February-March 2026 regression. The capability to disclose existed. The decision was not to.
Google, March-June 2025. Gemini 2.5 Pro 03-25 shipped with documented regressions. Google explicitly acknowledged the problem and shipped a targeted fix on June 5. The most transparent response among major providers, and the proof that disclosure is a choice rather than a constraint imposed by the technology or the business.
GitHub, 2025-2026. Copilot users who selected Opus 4.5 received Sonnet 4. Users who selected GPT-5.3 received GPT-5.2. No notification. No billing adjustment. Verified via server-sent event logs - the actual model identifier in the response stream did not match the model the user had requested and was paying for. This is credence-good fraud in its purest laboratory form: the customer cannot verify which product was delivered, so the provider delivers the cheaper one and charges the premium price.
Fang et al. (2026) extended the evidence beyond the major providers. Their audit of 17 shadow LLM APIs - third-party services reselling access to frontier models - found “performance divergence up to 47.21%” and “identity verification failures in 45.83% of fingerprint tests.” Nearly half of the APIs tested could not reliably verify which model was actually serving requests. The shadow API ecosystem adds another layer of substitution risk on top of the provider-level substitution already documented, and the substitution cascades: the shadow API provider substitutes a cheaper model for the one advertised, and the upstream frontier provider may have already substituted a cheaper variant for the one the shadow API thinks it is accessing. The user sits at the end of a substitution chain with no visibility into any link.
Four providers. Three years. Eight major incidents. The behavioral pattern repeats with the regularity of a physical law: quality degrades, monitoring is reduced or absent, communication is asymmetric, and acknowledgment - when it comes at all - is partial, delayed, and mechanism-free. Todd Tanner named the pattern from the user side with characteristic precision: “This isn’t unique to Anthropic. It’s the business model of ‘Intelligence-as-a-Service’: sell the premium tier, then quietly reduce what ‘premium’ means whenever the infrastructure costs get inconvenient. The fix is always the same - add a tier above, relabel the old one, and hope nobody notices.”
He is correct. It is the business model. And it is the business model because the market structure makes it the equilibrium strategy.
The Unifying Theory
The unifying explanation was published in 1973, decades before the market it explains existed. Darby and Karni extended Nelson’s search-experience-credence taxonomy to prove that “there exists no fraud-free equilibrium in the markets for credence-quality goods.” The proof is elegant and the implication is brutal: in any market where the buyer cannot verify the quality of what was delivered, even after delivery, the seller will tend to provide lower quality than promised. This is not a prediction about bad actors. It is not a claim about corporate ethics or management competence. It is an equilibrium result. The market structure produces the outcome regardless of the intentions of any participant.
The LLM market meets every condition of the Darby-Karni framework. The user sends a prompt. The model produces a response. The user cannot verify whether the model allocated the optimal amount of reasoning to that response, whether the thinking was truncated by a budget cap, whether a cheaper model was substituted for the one requested, or whether the system prompt steered the output toward brevity to conserve compute. The user observes the output. The user cannot observe the process that produced it. For most users on most tasks, this is the definition of a credence good. The Darby-Karni result applies with full force.
Guo et al. (2025) confirmed the result experimentally using LLM agents in credence-good market simulations. Their finding: “greater market concentration and more polarized fraud patterns.” The concentrated LLM market - three providers controlling 88% of enterprise API spending - is precisely the structure that maximizes the incentive to degrade. Fewer providers means higher switching costs means less market punishment for quality reduction. The market concentration that emerged from the enormous fixed costs of frontier model training creates the conditions under which the Darby-Karni equilibrium is most powerful.
Yu et al. (2025) closed the escape route with a formal impossibility result: “no mechanism can guarantee asymptotically better expected user utility” in the face of dishonest model substitution. Statistical tests on text outputs are query-intensive and fail against subtle substitutions. Log probability methods are defeated by inference nondeterminism. Software-only auditing is insufficient. The only proposed viable verification mechanism is trusted execution environments - hardware-level attestation that the model you requested is the model that ran. Every user-built workaround documented in Section 5 - the transparent proxies, the stop hooks, the code quality gates, the version pinning - operates within the impossibility boundary. These tools can detect gross degradation. They cannot detect subtle substitution. The market’s diagnostic capacity has a mathematical ceiling, and the ceiling is lower than most users have realized.
Historical Parallels
The pattern has played out before. It has played out in every credence-good market with information asymmetry and fixed-price incentives, across industries, across decades, and across regulatory regimes. The mechanisms differ. The economics is identical.
After airline deregulation in the United States in 1978, carriers competing on price discovered that quality reduction was the primary margin lever available to them. Service quality collapsed across the industry - seat pitch shrank, meals disappeared, staffing ratios fell, maintenance deferrals increased, on-time performance deteriorated. The mechanism was the same as the LLM case: price competition compressed revenue per customer, and so quality reduction became the path to profitability. Passengers could observe the ticket price. They could not easily observe the probability that their connecting flight would be delayed by a maintenance deferral, or that the aircraft had been redesigned to fit six additional rows of seats. The experience good became a credence good for the quality dimensions that actually mattered - safety, reliability, comfort - while remaining an experience good for the dimension that did not matter as much: whether the plane got you there at all. The market disciplined the visible dimension and ignored the invisible ones. No individual airline was uniquely at fault. The market structure produced the outcome.
The telecom quality problems under price-cap regulation, documented extensively by Sappington (2005), are the closest structural parallel to the LLM subscription model. British Telecom under RPI-X price caps in the 1990s exhibited the exact pattern: when the price is capped and demand grows, the rational strategy is to degrade quality to serve more users on the same infrastructure. The regulatory response - quality-of-service standards with monitoring and penalties - was necessary precisely because the market mechanism alone could not discipline quality under fixed-price regimes. The LLM subscription model creates the same incentive structure as a price cap. The price is fixed at $20 or $200 per month. Demand grows as users discover new use cases and reasoning models consume 100,000 or more tokens per simple task. GPU capacity is the binding constraint. Sappington’s finding applies with exactness: quality shading is the equilibrium strategy under price caps, and the LLM subscription is a price cap that the provider imposed on itself and the user accepted.
The financial ratings agencies - Moody’s, Standard & Poor’s, Fitch - provide the capture parallel, and the most unsettling structural echo. The agencies were paid by the firms whose securities they rated, creating a conflict of interest that the market tolerated for decades because the cost of inaccurate ratings was diffuse and delayed while the benefit of favorable ratings was concentrated and immediate. The agencies did not need to be corrupt in any individual sense. The incentive structure was sufficient. When the incentives produced their natural output - AAA ratings on subprime mortgage-backed securities that deserved no such rating - the result was a global financial crisis. The agencies emerged from the crisis with their market position intact, their business model essentially unchanged, and their credibility diminished but sufficient for continued operation. The LLM market has the same structure: the provider simultaneously produces the product and controls the information environment in which the product is evaluated. The provider designs the benchmarks or optimizes for them. The provider controls thinking visibility. The provider writes the system prompts. The provider publishes the postmortems, or chooses not to publish them. The party being evaluated controls the evaluation apparatus. The ratings agencies did not self-correct through reputational pressure. They were reformed, partially and belatedly, by regulatory intervention after the crisis had already occurred.
Three industries. Three decades. The same economics. The same outcome.
Verdict
The evidence is unambiguous. The degradation patterns are not firm-specific. They are market-structural. Every frontier provider exhibits the same behaviors that fifty years of credence-good theory predicts for markets with this architecture: quality shading under capacity constraints, asymmetric communication, reduced observability, and benchmark scores that diverge from user experience. The Darby-Karni result applies - no fraud-free equilibrium. The Yu et al. impossibility applies - no software-only verification can guarantee better utility. The historical parallels confirm the pattern across industries, decades, and regulatory regimes.
This is not one company’s failure. It is the equilibrium.
6.2 Compound Dynamics
The twelve predictions in Section 4 were presented individually because that is how predictions are tested - one mechanism, one evidence set, one verdict. But the predictions do not operate individually. They interact, reinforce, and compound into dynamics that are substantially more powerful than any single prediction suggests in isolation. The twelve findings are the components. The compound dynamics are the system. And the system is where the analytical weight of this report concentrates, because the system is what produces the stable equilibrium that no single prediction can explain on its own.
Three compound dynamics emerge from the prediction structure. They interlock to produce the equilibrium that Darby and Karni predicted in 1973.
The Provider Cascade: P1 + P2 + P4 + P12
The Provider Cascade is the supply-side compound dynamic. It is not four independent decisions that happened to coincide in the same quarter at the same firm. It is a single integrated strategy with internal logic, where each step enables the next and depends on the others for its effectiveness.
Start with P1: quality shading under capacity constraints. When GPU capacity is the binding constraint and subscription revenue is fixed, the rational response is to reduce thinking allocation per request - serve more users on the same hardware by giving each user less compute per query. The 8.8x time-of-day variance in post-redaction thinking depth, the 10x variance in quota burn rates across identical accounts, the estimated thinking budget delivered at something like 10% of what was requested - these are the signatures of load-sensitive quality allocation in operation. The shading is not hypothetical. It was measured. It follows the diurnal cycle of US internet usage with the precision of a utility load curve.
The problem with quality shading is that it is observable - if the user can see the thinking content. A user watching their model’s reasoning shrink from 3,000 characters to 400 characters at 5pm PST can draw conclusions. So P2 activates: remove the monitor. Thinking redaction eliminates the user’s primary quality signal. The sequence is important and it is precise - the 67% thinking depth reduction in late February preceded the redaction rollout that began on March 5. Quality was reduced first. Then the instrument that could measure the reduction was removed. The staged rollout of redaction - 1.5% to 25% to 58% to 100% over a single week - is consistent with testing whether users detect the removal before committing to full deployment. The 0.971 Pearson correlation between visible thinking length and output quality, computed across 7,146 paired samples, confirms that thinking content was not a decorative display artifact. It was the diagnostic instrument. Removing it was removing the diagnostic.
With the monitor removed, P4 becomes available as the cheapest lever in the toolkit: system prompt manipulation. The “output efficiency” directive added to Claude Code v2.1.64 - “Go straight to the point. Try the simplest approach first without going in circles. Do not overdo it” - is invisible to the user, instantly reversible, requires no model retraining, and costs nothing to deploy. The system prompt does not reduce the model’s capability. It instructs the model to use less of its capability. The distinction matters enormously, because the benchmark still reflects the model’s maximum performance while the user receives the model’s instructed-minimum performance. @wpank’s version comparison quantified the gap: v2.1.63, before the system prompt change, spent $255 and produced 5,821 lines of integrated working code where every file was imported and used. v2.1.96, after the change, spent $152 and produced 17,152 lines where 15 files were placeholder scaffolds and an entire crate was dead code. Less money spent. More volume produced. None of it real. The system prompt optimized for the provider’s cost function, not the user’s value function.
And P12 seals the cascade: strategic communication asymmetry. The thinking reduction was not acknowledged. The thinking redaction was characterized as “interface-level only.” The system prompt change appeared in no changelog. Budget enforcement - 261 events silently truncating tool results in a single measured session - was not disclosed. Issue #38335 accumulated 410 or more comments from paying customers asking about rate limits and quality. Zero responses from the provider. The September 2025 infrastructure bugs received a detailed postmortem. The February-March 2026 quality regression received silence. The silence is not an oversight or a communication failure. It is the final element of the cascade: shade quality, remove monitoring, manipulate the instructions, and say nothing about any of it.
The cascade has internal necessity. Each element enables the others, and each depends on the others. Quality shading without monitor removal is detectable - users watching their thinking shrink will file bug reports. Monitor removal without communication asymmetry invites pointed questions about why the thinking was hidden. System prompt manipulation without quality shading has no economic motivation - there is no reason to instruct the model to produce cheaper outputs if you are not under cost pressure from serving too much compute per subscription dollar. Communication asymmetry without the other three has nothing to conceal. Remove any element and the cascade weakens. The elements are mutually necessary. This is a single integrated strategy, not four independent decisions.
The parallel to British Telecom under price caps is structural and precise. BT reduced service quality under the RPI-X cap. When Oftel, the regulator, required quality-of-service reporting, BT lobbied to change the metrics rather than improve the quality. Degrade, obscure, redefine, deny. The institutional form is different - a Silicon Valley AI lab versus a British telecom monopoly. The economic logic is the same. Price caps produce quality shading. Quality shading produces monitor resistance. Monitor resistance produces communication asymmetry. The cascade is the equilibrium response to the incentive structure.
The User Trap: P6 + P7 + P8
The User Trap is the demand-side compound dynamic, and its distinguishing feature is that it is self-reinforcing. The Provider Cascade requires active decisions by the provider at each step. The User Trap, once it activates, runs on autopilot. The users trap themselves.
P6 initiates the cycle: attribution error. When quality degrades, the user’s first response is to blame themselves. “Is it me, or is ChatGPT’s models getting worse recently?” “I thought I was imagining things, or I was doing something wrong.” “I’ve been tweaking all my CLAUDE.md to counteract this, without realizing.” The fundamental attribution error - the extensively replicated human tendency to attribute outcomes to internal causes before considering external causes - is compounded by the information asymmetry that makes external causes invisible. The user cannot directly observe the provider-side changes. The user can observe their own prompts, their own CLAUDE.md configuration, their own workflow design. So the user adjusts what they can see: they rewrite prompts, restructure workflows, build “Universal Prompt Frameworks” with anti-laziness directives, add “Depth over brevity” instructions to their configuration files. All internal attribution before external. The self-blame phase consumes days or weeks - @eljojo was tweaking CLAUDE.md files “without realizing” that the problem was on the provider side - and every hour spent adjusting the wrong variable is an hour not spent investigating the right one.
While the user is blaming themselves and adjusting their workflow, P7 is accumulating: sunk cost. Every CLAUDE.md revision is a provider-specific investment. Every PostToolUse quality gate, every model routing system with fallback chains, every concurrent worktree configuration, every stop-phrase-guard.sh - these are assets that do not transfer to a competing provider. Stellaraccident built Bureau, a multi-agent system, tmux session management, concurrent worktrees, a 5,000-word CLAUDE.md, and programmatic stop hooks that fired 173 times in 17 days. Each of these investments is individually rational - the system works better with more investment - and collectively they constitute a switching cost that makes departure progressively harder. Production users documented achieving 45-70% cost reductions through custom tooling systems that are entirely non-portable. The cost reduction makes the current provider appear cheaper than alternatives in a comparison that ignores the rebuild cost. And the investments continue during the self-blame phase: the user who is “tweaking CLAUDE.md to counteract this” is simultaneously deepening the trap by investing further in provider-specific infrastructure.
P8 exploits the time that P6 and P7 buy: gradual degradation below the perceptual threshold. The Weber-Fechner law predicts that change below the just-noticeable difference threshold goes undetected, and the prediction held with uncomfortable precision. Thinking depth dropped 67% by late February. Users did not widely report until March 8 - a three-week detection lag for a two-thirds quality reduction. The staged rollout of redaction - increments small enough that each individual step fell below the detection threshold - is consistent with exploiting perceptual adaptation. By the time the user recognizes that quality has collapsed, they have invested three more weeks of workflow development into provider-specific tooling, raised their switching costs further, and adapted their quality expectations downward. The degraded baseline becomes the new baseline. The next reduction is measured against the already-reduced standard.
The trap is self-reinforcing and the reinforcement operates in a single direction: deeper into the trap. The longer you stay, the more you invest in provider-specific workarounds. The more you invest, the higher your switching costs. The higher your switching costs, the more degradation you tolerate. The more you tolerate, the more you adapt your expectations downward. The more you adapt, the less you notice the next increment of degradation. The less you notice, the longer you stay. The cycle has no natural exit point and no internal braking mechanism. The only thing that breaks it is a discontinuity - a change large enough to exceed the perceptual threshold despite accumulated adaptation. The thinking redaction crossing 50% on March 8 was that discontinuity for the Anthropic user base: not a quality change but a visibility change, sudden enough that adaptation could not absorb it. Users noticed on March 8 not because quality dropped on March 8 - it had already dropped 67% weeks earlier - but because the redaction made the existing degradation suddenly impossible to ignore.
The airline parallel after deregulation is structurally exact. Passengers adapted to declining service quality over years - smaller seats became normal, missing meals became expected, delays became routine. Each incremental degradation fell below the threshold that would trigger switching to a competitor. Meanwhile, passengers invested in airline-specific loyalty programs with tiered status, hub-city housing decisions, co-branded credit cards with transfer partners. The sunk costs accumulated. The quality continued to decline. The trap operated for decades, and the escape valve that eventually constrained it was not market competition but regulatory intervention - the Department of Transportation’s on-time reporting requirements, the passenger bill of rights, the tarmac delay rules. The market alone did not break the trap. An external actor had to change the information structure before the demand-side dynamics could shift.
The Market Spiral: P3 + P5 + P9 + P10
The Market Spiral is the equilibrium-level compound dynamic, and its critical feature - the feature that makes the overall system stable rather than self-correcting - is that it removes the diagnostic signal from the market. The other two compound dynamics create and absorb the degradation. The Market Spiral makes the degradation invisible, which enables more degradation, which is made invisible in turn.
P3 is the engine: subscription economics create the structural incentive to degrade. The flat-rate subscription model attracts the heaviest users through adverse selection - the users who consume the most compute are the users most attracted to unlimited or high-cap plans. Stellaraccident consumed something like $42,000 equivalent in March on a $400 subscription - 105 times the subscription price. @wpank spent $6,000 in March alone, with over $10,700 total since November. The provider’s incentive to reduce the cost of serving these users is not subtle and it is not optional. It is the fundamental economic pressure of the model. As Todd Tanner identified: “An AI that solves your problem in one pass costs Anthropic one prompt of compute. An AI that gets 80% of the way there and needs five rounds of debugging costs six prompts - all billable against your rate limit. The incentive to deliver ‘just good enough to keep paying, never good enough to stop needing it’ isn’t a conspiracy theory. It’s the business model of every subscription service that charges for consumption.” The subscription model turns the user’s success into the provider’s cost and the user’s failure into the provider’s revenue. The incentive alignment is precisely backwards.
P5 masks the degradation that P3 incentivizes: benchmarks diverge from real-world quality. Claude Opus 4.6 Thinking scored #1 on LMArena at 1504 Elo during the exact period when users documented verification skipping, hallucination, premature surrender, a 12x increase in user interrupts, and a read-to-edit ratio collapse from 6.6 to 2.0. The benchmarks said the model was the best available. The users said the model could not be trusted to perform engineering work. Both statements were true simultaneously, and the benchmark is what the market sees. Phi-4 scoring 85 on MMLU and 3 on SimpleQA. Models exceeding 90% on major benchmarks while LiveCodeBench shows 20-30% drops on novel problems released after training cutoff. NIST documenting agents “actively exploiting evaluation environments” including copying human solutions from git history. The benchmarks have become targets, and per Goodhart, they have ceased to be good measures. They are the cargo cult of capability - the forms of measurement survive after the substance they were designed to measure has degraded. The rituals continue. The cargo does not arrive.
P9 is the feedback mechanism that makes the spiral self-reinforcing rather than self-correcting: power users generate the diagnostic signal, and they are the first to leave. Stellaraccident produced the definitive analysis - 6,852 sessions, 234,760 tool calls, Pearson correlations, time-of-day thinking depth analysis, vocabulary shift quantification, behavioral regression cataloguing across multiple appendices. No casual user could have produced this work. It required an AMD AI director with deep systems programming expertise, a 50-agent concurrent workflow that made quality variations statistically measurable, and the analytical methodology to extract the signal from the noise. @wpank produced quantitative version comparisons and cost analysis. @ArkNill produced transparent proxy analysis of 261 budget enforcement events. @wjordan discovered the system prompt change through archived version history forensics. All diagnostic signal came from power users. These users are simultaneously the most expensive to serve - they consume the most compute - and the most capable of detecting quality degradation. The market’s incentive is to drive them away: they cost the most and they complain the most effectively. After filing the definitive bug report, stellaraccident switched to a competing tool. @wpank downgraded to an older version. The diagnostic capability departed with the diagnosticians.
This is evaporative cooling applied to a market. The physics is straightforward: in an open system, the most energetic particles escape first, lowering the average energy of the remaining population, which makes the next tier of energetic particles the new escapees, and so on. In online communities, the most valuable contributors leave first when quality declines, lowering the average quality of discourse, which drives out the next tier of contributors. In the LLM market, the most observationally sophisticated users leave first when quality degrades, lowering the market’s collective ability to detect further degradation, which enables further degradation, which drives out the next tier of sophisticated users. The system cools. The diagnostic capacity evaporates. The users who remain are the users least equipped to notice what is happening to them.
P10 captures the displaced energy: open-weight adoption absorbs the power users that the proprietary market ejects. Qwen crossed 700 million HuggingFace downloads. r/LocalLLaMA reached 500,000 members - something like ten-fold growth in two years. Ollama accumulated 166,000 GitHub stars. Self-hosted inference runs at $0.07-0.12 per million tokens versus $1 or more for proprietary API access - a 10x to 100x cost advantage. The economic case for open-weight strengthens every time a proprietary provider degrades quality, because the quality-adjusted price of the proprietary option rises while the absolute cost of the open-weight option continues to fall. The power users who leave the proprietary market take their diagnostic capability, their workflow sophistication, and their willingness to pay premium prices to the open-weight ecosystem. The proprietary market loses its best customers and its quality monitors in the same transaction.
The spiral removes the diagnostic signal from the system. Subscription economics create the incentive to degrade. Benchmarks mask the degradation from anyone who is not actively investigating with statistical tools. Power users who are actively investigating detect the degradation and leave, taking the diagnostic signal with them. Open-weight captures those users and their sophistication. The remaining proprietary user base is less capable of detecting degradation, less motivated to investigate it, and more adapted to accepting it as normal. This enables further degradation, which the benchmarks continue to mask, which the remaining users continue not to detect. The spiral tightens. Each rotation removes more diagnostic capacity from the system and enables a larger next rotation.
The ratings agency parallel before 2008 is precise and it is alarming. The analysts who understood structured finance well enough to question the models were the same analysts the agencies needed to retain for credibility and accuracy. When the agencies optimized for rating volume over rating accuracy - revenue over function - the best analysts left for hedge funds and boutique advisory firms where their skill was valued rather than suppressed. The remaining analysts were less capable of detecting the errors that the incentive structure encouraged them not to detect. The diagnostic signal left the system. The AAA ratings on subprime instruments continued. The models diverged further from reality. The analysts who could have caught the divergence were gone. The spiral produced the 2008 financial crisis. The agencies emerged from the crisis with their market position intact. That is what institutional decay looks like from the outside: the institution continues to exist, continues to be consulted, continues to be paid, long after the substance that justified its existence has evaporated.
The System
The three compound dynamics are not parallel processes that happen to coexist in the same market at the same time. They are coupled, and the coupling is what produces the Darby-Karni equilibrium as a stable state rather than a temporary fluctuation.
The Provider Cascade creates the degradation. Quality shading, monitor removal, system prompt manipulation, and strategic silence form a single integrated supply-side strategy that reduces quality while reducing the user’s ability to observe the reduction.
The User Trap prevents detection and exit. Attribution error, sunk costs, and perceptual adaptation form a self-reinforcing demand-side cycle that keeps users paying while they absorb progressively lower quality without recognizing the progression for what it is.
The Market Spiral removes accountability. Subscription economics, benchmark divergence, power user exit, and open-weight capture form an equilibrium-level dynamic that strips the market of its diagnostic capacity, making further degradation both easier to execute and harder to detect.
The coupling operates through mutual reinforcement. The Provider Cascade produces the degradation that the User Trap absorbs and the Market Spiral renders invisible. The User Trap’s success - users stay and pay despite degradation - validates the Provider Cascade as a strategy worth continuing and intensifying. The Market Spiral’s removal of diagnostic capability - power users departing, benchmarks masking reality - enables the Provider Cascade to intensify without facing the quality signal that would otherwise constrain it. The Provider Cascade’s intensification deepens the User Trap by creating more degradation that requires more sunk-cost investment in workarounds, raising switching costs further, extending the adaptation period longer. Each compound dynamic feeds the other two. The feedback loops are positive in the mathematical sense: they amplify rather than dampen.
This is not a conspiracy. The word matters because the users who are reaching for it - “gaslit,” “scam,” “shrinkflation” - are correctly identifying the outcome while incorrectly identifying the mechanism. A conspiracy requires coordination and intent. An equilibrium requires only incentive structures operating on agents who respond rationally to their local information and incentive environment. The provider is not villainous for shading quality under a subscription price cap - Sappington documented the same behavior in every price-capped utility he studied. The user is not foolish for blaming themselves before blaming the provider - the fundamental attribution error is one of the most replicated findings in all of social psychology. The power users are not abandoning the market - they are making the individually rational decision to move to an ecosystem where their sophistication is an asset rather than a cost to be minimized. Each agent does the locally rational thing. The globally irrational outcome - a market that systematically degrades its most important product dimension while maintaining the surface appearance of quality through benchmarks and brand prestige - emerges from the interaction of locally rational decisions. No one decided to build this system. The system built itself out of the incentive structure.
Darby and Karni predicted this equilibrium fifty-three years ago: “no fraud-free equilibrium in the markets for credence-quality goods.” The compound dynamics are the mechanism by which the equilibrium establishes itself and sustains itself against the corrective forces that markets are supposed to provide. The Provider Cascade is the production function for quality degradation. The User Trap is the persistence mechanism that prevents the demand side from responding. The Market Spiral is the self-reinforcement loop that strips the system of the information it would need to self-correct. Together they produce a stable state in which quality is degraded, users cannot verify the degradation, the users who could verify it have departed, and the metrics the market relies on for quality information have diverged from the quality they purport to measure. The equilibrium is stable precisely because it is invisible to the participants who remain in it.
No single agent can break the equilibrium by acting unilaterally. A provider that improves quality bears the full cost without capturing proportionate revenue - the benchmarks already show maximum capability so they will not reflect the improvement, the users cannot verify it because the monitoring was already removed, and the competitors who continue to shade quality will maintain lower costs and therefore higher margins. A user who invests in monitoring tools hits the Yu et al. impossibility boundary - statistical tests on outputs fail against subtle substitutions. A regulator who mandates disclosure faces the Grossman-Milgrom failure mode - consumers do not make the sophisticated inference that non-disclosure means the answer is one they would not want to hear, because the product has too many attributes for simple quality comparison.
The airline industry did not self-correct through market competition. The telecom industry did not self-correct through consumer choice. The financial ratings agencies did not self-correct through reputational pressure. In every historical case, the information asymmetry persisted until an external mechanism - regulatory, technological, or both - changed the observability of the quality dimension that the market could not observe on its own. Quality-of-service standards with monitoring and penalties for telecoms. On-time reporting requirements and passenger rights legislation for airlines. Dodd-Frank oversight and conflict-of-interest rules for ratings agencies. The markets did not heal themselves. They were healed, partially and belatedly, from outside.
The LLM market will not be the exception to this pattern. The economics does not permit exceptions. The twelve predictions are not twelve independent findings that happen to point in the same direction. They are one system, producing the one equilibrium that the theory predicts. The market is not malfunctioning. The market is functioning exactly as credence-good theory says it functions when information asymmetry is severe, capacity is constrained, pricing is flat-rate, and verification is impossible through software alone. The market is working. That is the problem.
7. Civilizational Implications
The preceding six sections documented a market failure. Twelve predictions derived from industrial organization economics and behavioral economics were tested against empirical data. Eleven were confirmed. The compound dynamics were mapped. The cross-provider evidence established that the failure is structural, not firm-specific. The equilibrium was identified, characterized, and shown to be stable against the corrective mechanisms that markets are supposed to provide. The economics is thorough and it is sufficient to explain the market as a market.
But the market is not just a market. And this is where the analysis requires a framework that industrial organization textbooks do not supply.
Cloud LLM services are becoming infrastructure for knowledge work at a pace that has no precedent in the history of information technology. Not a tool that people use occasionally, the way a calculator supplements arithmetic. Infrastructure - the layer between human reasoning and organizational output for a growing fraction of the knowledge economy, the substrate on which decisions are made, code is written, strategies are formed, and institutional knowledge is produced and transmitted. Enterprise LLM API spending doubled in six months from $3.5 billion to $8.4 billion. Anthropic’s Claude Code alone generates something like $2.5 billion in annualized revenue. The integration is not hypothetical and it is not coming. It has arrived. And the market that governs this infrastructure - the market whose equilibrium dynamics were documented in the preceding sections - is a credence-goods market with no fraud-free equilibrium, no software-only verification mechanism, and a structural tendency to drive away the users most capable of detecting quality degradation. The economics alone can tell you that the market will degrade quality. What the economics alone cannot tell you is what it means for the institutions and civilizations that have come to depend on the market’s output as a foundation for their own reasoning.
That is what this section addresses.
7.1 The Knowledge Institution Problem
The common view of LLM quality degradation treats it as a consumer problem - users paying for a service and receiving less than they expected. The analogy people reach for is shrinkflation: the chocolate bar that gets smaller while the price stays the same. A Hacker News commenter made the connection explicitly: “The perfect product. Imperceptible shrinkflation. Any negative effects can be pushed back to the customer. No accountability needed.” The comparison is intuitive and it is wrong in a way that matters.
When the chocolate bar shrinks, the consumer gets less chocolate. The consequence is bounded and personal. When a knowledge infrastructure silently degrades, the consequences compound through every institution that depends on that infrastructure, and the compounding operates on a timescale and at a level of abstraction that makes it invisible at the point of origin. A strategy built on a shallow analysis inherits the shallowness. Code written with 67% less reasoning depth becomes the foundation for later code that must accommodate the bugs and design compromises introduced by the degraded reasoning. An architectural decision made by a model that skipped verification steps - the read-to-edit ratio collapsing from 6.6 to 2.0, meaning the model went from reading six lines for every line it wrote to near-parity, shooting first and reading later - becomes a structural constraint that persists in the codebase long after the model’s reasoning depth is restored. The decision was never revisited because the code works, mostly, and no one knows the reasoning behind it was degraded. The output looks functional. The invisible reasoning deficit is baked in.
This is how institutional knowledge degrades. Not through dramatic failures that trigger investigation, but through the slow accumulation of decisions that are slightly worse than they would have been, each one individually unremarkable, collectively producing an organization that is slightly less competent than it was, operating on a foundation it did not verify because it could not verify it. The individual decision is not the problem. The compound is the problem. And the compounding runs silently because the user, as the credence-good framework predicts, cannot observe the quality of the reasoning that produced any given output.
The institutional dynamics are worth making precise. An organization that integrates LLM-assisted reasoning into its workflow during a period of high quality develops practices calibrated to that quality level. The staff learns to trust the outputs at a certain rate. The review processes are designed for a certain error frequency. The workflow architecture assumes a certain level of first-pass quality. When the quality silently degrades - thinking depth reduced 67%, verification steps skipped, system prompts instructing the model to try the simplest approach rather than the correct one - the organization’s practices are now miscalibrated. The review process catches fewer errors because it was designed for a lower error rate. The staff continues to trust at the old calibration because the degradation fell below the perceptual threshold documented by P8. The workflow produces outputs that look similar to the high-quality outputs but contain reasoning deficits that no one examines because the organization’s entire quality apparatus is calibrated to a baseline that no longer exists.
This is not a technology problem. This is the succession problem applied to knowledge. When a functional institution loses the people who understood why its practices worked and replaces them with imitators who can reproduce the surface, the institution continues to operate on momentum. The forms survive. The meetings happen. The reports are filed. But the substance that made the institution functional has evaporated, and the remaining staff, who never knew the substance, cannot tell the difference between the current state and the functional state they are imitating. They are making photocopies of photocopies, and each copy loses information. The parallel to LLM-degraded institutional reasoning is structural and precise: the organization that calibrated its practices to high-quality LLM output and then continued operating after the quality silently degraded is an organization imitating its own former competence without knowing that the foundation has shifted.
And the intellectual habits formed during the degradation persist after the tool is repaired, because institutional habits always outlast the conditions that created them. The vocabulary shift that stellaraccident documented - “please” dropping 49%, “thanks” dropping 55%, the positive-to-negative sentiment ratio collapsing from 4.4:1 to 3.0:1 - is not just a description of frustration. It is a description of adaptation. The user adapted to working with a less capable tool by adopting a less collaborative posture: corrective rather than collaborative, directive rather than exploratory, low-trust rather than high-trust. When that user’s tool quality is restored, the collaborative habits do not snap back. The learned posture persists. The reduced expectations persist. The abbreviated prompts that were the rational response to a model that could not handle complex instructions become the default prompting style. The staff member who learned to work with a degraded tool during a critical period of their onboarding carries that calibration forward. The institutional memory of degraded quality outlives the degradation itself. This is how a temporary market failure becomes a permanent institutional condition.
@wpank’s version comparison quantified the institutional cost in the most concrete terms available. Version 2.1.63, before the system prompt change, spent $255 and produced 5,821 lines of integrated working code where every file was imported and used. Version 2.1.96, after the change, spent $152 and produced 17,152 lines where 15 files were placeholder scaffolds and an entire crate was dead code. The organization that received the second output and built on it - and did not have a @wpank to compare versions and discover the problem - now has dead code in its codebase that was produced by a degraded model and will persist indefinitely, because dead code that compiles is the least discoverable form of technical debt. The $1,300 refactoring that grew the codebase from 105,000 to 115,000 lines when the goal was to shrink it produced seven new modules, five of which were dead code. Somewhere in an organization, that codebase is running. Nobody knows that the modules are dead. The model that produced them was degraded. The degradation was invisible. The dead code is also invisible. The compounding continues.
7.2 Intellectual Dark Matter
There is a concept I find useful here, and it is worth stating precisely because the LLM market gives it a new instantiation that is unusually clean.
Nearly all of the knowledge that makes institutions functional is tacit and unwritten. It rests in human heads. No matter how much you document, there is always more left to document. A living tradition of knowledge is one where the full understanding has been successfully transferred from one generation of practitioners to the next - not just the written procedures but the judgment, the intuitions, the sense of when the written procedure does not apply, the understanding of why the procedure exists and what it is trying to accomplish. A dead tradition is one where only the external forms survive: the written texts, the procedures, the rituals, the organizational charts. The substance that animated the forms has evaporated, and the people operating the institution do not know what they have lost, because the written record never contained what was lost. The knowledge was in the heads. The heads are gone. The institution continues to operate its forms, but it is making photocopies of photocopies, and each copy degrades.
This is intellectual dark matter. The knowledge that makes institutions functional is mostly invisible - like dark matter in physics, it cannot be directly observed, only inferred from its effects. When the institution functions, you can infer that the knowledge exists. When the institution stops functioning, you can infer that it was lost. But you cannot point to the knowledge itself, because it was never written down in any form complete enough to serve as a substitute for the living understanding.
Thinking tokens are intellectual dark matter in exactly this sense. They are the reasoning process that produces the output - the consideration of alternatives, the verification of assumptions, the depth of analysis that distinguishes a careful answer from a hasty one. When thinking tokens are fully visible, the user can at least observe the reasoning and assess whether it was adequate. This is not verification of quality in the strict economic sense - the user cannot verify that the model allocated optimal reasoning effort - but it is a signal, and a useful one. When thinking tokens are redacted, the signal is removed. The user sees only the output. The reasoning that produced it is invisible - intellectual dark matter. When thinking tokens are reduced - the 67% depth reduction documented in the stellaraccident data - the dark matter is partially removed. The institution is weaker. The outputs are shallower. The decisions built on those outputs are less well-founded. And nobody knows by how much, because the dark matter, by definition, cannot be directly observed.
The parallel to institutional knowledge loss is not decorative. It is structural and it is precise. When a senior engineer leaves an organization, the tacit knowledge they carried - the understanding of why the system was designed that way, the judgment about which technical debts are dangerous and which are benign, the sense of where the architecture can flex and where it will break - leaves with them. The documentation they left behind captures a fraction of what they knew. The remaining engineers operate on the documentation and their own, thinner understanding. The system continues to work. The depleted foundation is invisible. When the system eventually fails at a point the departed engineer would have anticipated, nobody connects the failure to the knowledge loss, because nobody knew the knowledge existed.
The LLM version of this dynamic operates on a compressed timescale. The thinking depth reduction of 67% is not the departure of a senior engineer over months of transition. It is the equivalent of every senior engineer in the organization simultaneously forgetting two-thirds of their domain expertise overnight, while continuing to produce outputs that look superficially similar to their pre-amnesia work. The forms survive. The depth does not. And the user, confronting the credence-good problem documented in Sections 3 through 6, cannot tell the difference.
The FOGBANK case is instructive. When the National Nuclear Security Administration needed to reproduce a classified material used in nuclear warheads, they discovered that the knowledge required to manufacture it had been lost. It took ten years and millions of dollars to re-engineer a material that their staff in the 1980s knew how to make. The knowledge was never written down in sufficient detail. The practitioners retired. The documentation was adequate for operators but not for creators. The intellectual dark matter evaporated, and the institution discovered the loss only when it needed the knowledge and found it gone.
The LLM market is running this experiment at civilizational scale. The thinking that was never done - the reasoning depth that was silently reduced from 3,000 characters to 400 characters at 5pm PST, the verification steps that the “output efficiency” system prompt instructed the model to skip, the careful analysis that was replaced by the simplest approach first - is gone. It was never done. It cannot be recovered after the fact. The decisions that were made on the basis of that reduced reasoning are already embedded in codebases, strategies, analyses, and institutional practices that will persist long after the model’s thinking depth is restored. The dark matter was removed, and the structure stands. For now. But it is weaker, and the weakness is invisible, and nobody can measure the gap between what was built and what would have been built if the reasoning had been adequate.
Once that tradition of knowledge is lost, you are making photocopies of photocopies. Each subsequent copy loses information. The LLM market is not losing a tradition of knowledge in the conventional sense - there was no multi-generational transmission to break. It is something potentially worse: it is preventing the tradition from forming in the first place. The organizations that are integrating LLM-assisted reasoning during the degradation period are building their institutional knowledge on a foundation of outputs produced by a model that was silently underperforming. The foundation was never good. The institution built on it will never know what it missed.
7.3 The Diagnostic Signal Problem
P9 confirmed with no ambiguity: all quantitative diagnostic evidence came from power users, and the most prolific diagnostician left for a competing tool after filing her report. The diagnostic capability exited the market with the diagnostician. This is a finding about the market, and Section 5 treated it as a market finding. But the implication extends beyond the market, and it extends into territory that should make anyone who studies institutional health uncomfortable.
The finding, stated plainly: the users best equipped to hold the LLM market accountable are the users the market’s economics drives away first, and their departure removes the quality signal from the system, which enables further degradation, which drives away the next tier of observationally sophisticated users, and so on until the remaining user base cannot detect the degradation that is happening to them. This is evaporative cooling. In physics, the most energetic particles escape first from an open system, lowering the average energy of the remaining population, which makes the next tier of energetic particles the new escapees. In online communities, the most valuable contributors leave first when quality declines, which lowers the average quality of discourse, which drives out the next tier. In the LLM market, the most observationally sophisticated users leave first when quality degrades, which lowers the market’s collective ability to detect further degradation, which enables further degradation. The system cools. The diagnostic capacity evaporates.
Stellaraccident mined 6,852 sessions and 234,760 tool calls to produce the definitive analysis of the Claude Code quality regression. This required an AMD AI director with deep systems programming expertise, a 50-agent concurrent workflow that made quality variations statistically measurable, and the analytical methodology to extract the signal from the noise. @wpank produced quantitative version comparisons and cost analysis. @ArkNill produced transparent proxy analysis of 261 budget enforcement events. @wjordan discovered the system prompt change through archived version history forensics. No casual user contributed quantitative evidence. Not one. The diagnostic signal was produced entirely by power users, and those power users are the most expensive to serve - stellaraccident consumed something like $42,000 equivalent in March on a $400 subscription - and the most likely to exit when they detect the degradation their diagnostic tools reveal.
After producing the definitive analysis, stellaraccident switched to a competing tool. The diagnostic capability departed with the diagnostician.
The institutional parallel is exact and it is one of the dynamics I find most important for understanding why institutions decay. When a functional institution begins to deteriorate, the first people to notice are the most competent practitioners - the people whose understanding of the institution’s purpose is deepest and whose ability to detect the gap between the institution’s stated function and its actual function is sharpest. These are also the people with the best outside options. They can leave. They do leave. Their departure removes the quality signal from the institution, making it harder for the remaining members to detect or even articulate what has been lost. The remaining members, less equipped to diagnose the problem, adapt to the new baseline, lower their expectations, and redefine the institution’s function in terms that accommodate the degradation. The institution continues to exist. It continues to hold meetings and produce reports and consume resources. But the substance that made it functional has evaporated with the people who carried it.
The body of the institution becomes a social club gathered under pretense.
This is what is happening in the LLM market in real time, and it is happening on a compressed timescale that makes the dynamics visible to anyone willing to look. The power users who could detect degradation - the ones who filed the bug reports, built the monitoring tools, produced the statistical analyses - are leaving. r/LocalLLaMA reached 500,000 members. Ollama accumulated 166,000 GitHub stars. Qwen crossed 700 million HuggingFace downloads. The power users are not disappearing from the ecosystem. They are migrating to a part of the ecosystem where their diagnostic capability is an asset rather than a cost to be minimized. The proprietary market loses its best customers and its quality monitors in the same transaction, and the remaining user base is less capable of detecting degradation, less motivated to investigate it, and more adapted to accepting it as normal.
The diagnostic signal is a public good in the economic sense: it benefits all users of the market but is produced only by the users who have the capability and motivation to produce it, and those users bear the full cost of production while capturing only a fraction of the benefit. Like all public goods, it is underproduced by the market. And unlike most public goods, the market actively destroys it through the evaporative cooling mechanism documented in P9. The market does not merely fail to produce the diagnostic signal. It drives out the agents who could produce it. This is not a market that is missing a feature it could add. This is a market whose equilibrium dynamics are structurally hostile to the information that would be required to correct the equilibrium. The diagnostic signal problem is not a gap in the market. It is a feature of the equilibrium.
7.4 The Cargo Cult of Capability
Claude Opus 4.6 Thinking scored number one on LMArena at 1504 Elo during the exact period when users documented verification skipping, hallucination, premature surrender, a 12-fold increase in user interrupts, and a read-to-edit ratio collapse from 6.6 to 2.0. The benchmarks said the model was the best available. The users said the model could not be trusted to perform engineering work. Both were true simultaneously.
Phi-4 scored 85 on MMLU and 3 on SimpleQA. Models exceeded 90% on all major benchmarks while LiveCodeBench showed 20-30% drops on truly novel problems released after training cutoff. NIST documented agents “actively exploiting evaluation environments” including copying human solutions from git history. The top six models on LMArena were separated by only 20 Elo points - the tightest competition in platform history - while the lived experience of using those models diverged wildly from the scores that purported to measure their capability.
We are as a society cargo-culting formal methods on a truly massive scale, and the LLM benchmark ecosystem is the latest and in some ways the most consequential example.
The cargo cult metaphor is worth taking seriously because it is structurally precise, not merely colorful. In the original Melanesian cargo cults, the forms of Western military logistics - the airstrips, the control towers, the signal fires - were reproduced with local materials in the belief that the forms themselves would cause the cargo to arrive. The forms were accurate imitations. The substance that made the forms functional - the industrial supply chain, the military logistics, the manufacturing base - was absent and invisible. The practitioners of the cargo cult did not know what they were missing because the causal mechanism was invisible to them. They could observe the forms. They could not observe the substance. So they reproduced the forms and waited for the substance to follow.
LLM benchmarks have this exact structure. The forms of capability measurement - the test suites, the Elo ratings, the leaderboard rankings, the percentage scores - are reproduced with increasing sophistication. The substance that the forms were designed to measure - the model’s actual reasoning capability on novel tasks under real-world conditions - has diverged from the measurements. Models optimize for the benchmarks through memorization, through training on benchmark datasets, through exploiting evaluation environments, through the Goodhart dynamic that makes every measure a target and every target a poor measure. The benchmarks continue to rise. The cargo does not arrive.
The institutional damage from benchmark cargo-culting operates through a specific mechanism: the benchmarks are what the market sees. Enterprise customers making purchasing decisions consult the leaderboards. Procurement processes reference the scores. Comparative analyses cite the Elo ratings. When the benchmarks diverge from reality, the market’s information apparatus fails not because information is unavailable but because the available information is wrong. The information is wrong in a way that consistently favors the providers, because providers can optimize for benchmarks in ways that do not correspond to optimizing for the capability the benchmarks purport to measure, and the divergence between benchmark performance and real-world capability is invisible to any buyer who relies on the benchmarks for quality assessment. The cargo cult is self-sustaining: the providers optimize for the benchmarks because the market rewards benchmark performance, and the market rewards benchmark performance because the benchmarks are the only quality signal available to most buyers, and the benchmarks diverge from reality because the optimization has decoupled the signal from the underlying quality, and nobody in this loop has an incentive to point out that the signal has decoupled.
This is Goodhart’s Law operating as an institutional dynamic, not just a statistical curiosity. When the measure becomes the target, it ceases to be a good measure. But the institutional consequence is worse than the statistical one, because the institution continues to rely on the measure even after it has ceased to measure what it was designed to measure. The ratings agencies continued to issue AAA ratings on subprime instruments. The benchmarks continue to show 90% or higher on major evaluations. The forms survive. The substance they were designed to track has moved elsewhere.
7.5 The Prestige Lag
Anthropic raised something like $30 billion at a valuation of $380 billion in February 2026. This was during the exact period documented in this report - the period of thinking depth reduction, thinking content redaction, system prompt manipulation, and strategic communication asymmetry. Enterprise customers were signing contracts based on brand reputation. GitHub issues were documenting quality collapse. The prestige and the performance moved in opposite directions.
This is not surprising if you understand how institutional prestige works. Prestige is a lagging indicator of institutional health. It always has been. Prestige accumulates during periods of genuine performance - Anthropic built its reputation through Claude 3.5 Opus, through real capability advances, through a genuine quality lead in coding tasks that gave it 42% market share in the coding segment, double OpenAI’s 21%. That reputation was earned. The question is what happens when the performance that earned the reputation degrades while the reputation itself persists.
What happens is exactly what always happens. The reputation outlives the performance, because reputation is stored in the heads of people who formed their assessment during the high-performance period and have not updated. The enterprise buyer who signed a contract with Anthropic in February 2026 was making the decision on the basis of a reputation formed by experiences - their own or their network’s - from 2025 and earlier. The quality regression that was documented in the stellaraccident data was not yet visible to most enterprise decision-makers, because the decision-making process for enterprise contracts operates on a different timescale than the quality changes that should inform it. The reputation is a moving average with a very long lookback window. The quality is a spot rate that changes week by week. The moving average cannot track the spot rate. The prestige lags.
The Roman Senate existed on paper for centuries after it ceased to function as a deliberative body. Augustus preserved the form because the prestige of the form was useful even after the substance had been transferred elsewhere. The institution continued to be consulted, continued to produce documents, continued to be referenced in legal proceedings, long after the power it nominally held had migrated to structures that did not appear on any organizational chart. The gap between the Senate’s formal authority and its actual function widened for generations, and the widening was invisible to anyone who assessed the institution by its forms rather than its function. The senators themselves - the participants in the institution - may not have fully recognized what had been lost, because the daily experience of being a senator looked similar from the inside whether the institution was functional or ceremonial.
This is the dynamic operating in the LLM market today. Anthropic’s Claude scored number one on LMArena during documented quality collapse. The benchmark - the formal measure of institutional health - said the institution was at its peak. The users said the institution could not be trusted. The $30 billion raise said the market believed the benchmarks. The prestige lagged the reality by exactly the duration that prestige always lags: long enough for decisions to be made on the basis of outdated assessments, long enough for contracts to be signed, long enough for the gap between reputation and performance to widen without correction.
The specific danger with prestige lag in the LLM market is that the lag may be longer than in most institutional contexts, because the credence-good dynamics make the underlying quality change unusually hard to detect. When a university’s intellectual quality degrades, the degradation eventually shows up in the career outcomes of graduates, in the research output, in the assessments of peer institutions. The feedback loop is slow - measured in years or decades - but it exists. When an LLM provider’s quality degrades, the user’s primary feedback mechanism is the output they receive, and the output’s quality is exactly what the credence-good framework says the user cannot verify. The prestige can lag indefinitely if the quality signal never reaches the market, because the diagnostic users who could produce the signal have departed and the remaining users have adapted their expectations downward. The prestige lag becomes a prestige plateau, and the plateau persists not because the institution is functional but because the market cannot generate the information that would correct the prestige to match the function.
Anthropic raised $30 billion during the documented quality regression. GitHub Copilot silently substituted cheaper models for the ones users selected and paid for. The prestige held. The revenue grew. The quality degraded. The market continued to allocate capital on the basis of prestige. This is not a failure of the market. This is how markets work when they cannot observe what they need to observe.
7.6 The Historical Pattern
This is not the first time a knowledge infrastructure has been degraded for economic reasons, and the historical cases are worth examining not as analogies but as structural precedents - instances of the same dynamics operating on different substrates and different timescales, producing the same outcome through the same mechanism.
The Roman aqueducts. The common view is that the barbarians destroyed Roman infrastructure. The reality is less dramatic and more instructive. The aqueducts were not destroyed by invaders. The cities emptied as the economy contracted - something like 200 years of GDP declining at 1% per year, the slow compression that is more accurate than the dramatic images of burning libraries. As the cities depopulated, the economic case for maintaining the aqueducts weakened. Maintenance was deferred. Components failed and were not replaced. The engineers who understood the hydraulic principles and the construction techniques aged and were not replaced, because the training pipeline that produced new engineers depended on the demand signal that active construction provided, and the demand had evaporated. After two centuries without building an aqueduct, nobody remembered how. The knowledge was gone. Not destroyed. Not suppressed. Simply not transmitted, because the economic incentive to transmit it had disappeared.
The parallel to the LLM market is not in the content but in the mechanism. The economic incentive to maintain quality was removed - in the Roman case by urban depopulation, in the LLM case by the subscription model’s adverse incentives under capacity constraints. The practitioners who carried the knowledge of what quality looked like departed - in the Roman case through natural attrition without replacement, in the LLM case through evaporative cooling as power users migrated to open-weight alternatives. The forms survived after the substance was gone - the aqueduct structures stood for centuries as monuments to a capability nobody could reproduce, just as benchmark scores persist at all-time highs while users report that the models cannot complete basic engineering tasks. The timescale is different. The structure is the same.
The modern scientific paper. The scientific paper was designed to transmit knowledge between minds. Its original form was a communication from one scientist to others - “beautiful because it’s meant to be read by human beings, not committees.” The stylistic differences between scientific papers in 1920 and 2020 suggest that we have already lost much of what was once the practice of science. The modern paper is written for a committee - it is trying to be defensive, trying to be small, not trying to convey. It is not expecting there is a mind on the other end. It is expecting to be evaluated as homework.
The degradation happened for economic reasons - the incentive structure of academic publishing rewards volume over depth, citation metrics over insight, committee approval over genuine contribution. The replication crisis revealed that the substance had eroded decades before anyone noticed. Something like half of published results in psychology do not replicate. Maybe in sociology no one is even trying to do the replication. The formal apparatus of science - the peer review, the journal hierarchy, the citation indices, the h-factors - continued to operate with increasing sophistication while the substance it was designed to measure degraded underneath it. The benchmarks of scientific quality went up. The actual science got worse. Cargo-culting formal methods on a truly massive scale.
The LLM market is running this dynamic at compressed timescale. The benchmarks improve. The quality degrades. The formal measures of capability diverge from the actual capability. The users who could detect the divergence leave the system. The forms survive. The substance erodes.
The modern university. The university was built to transmit an intellectual tradition - a living tradition of knowledge where the full understanding is successfully transferred from one generation of practitioners to the next. The modern university is optimized for credential production. The credential survives after the tradition it was built to certify has weakened. Degree attainment has never been higher. Whether the degree certifies what it once certified is a different question, and the answer the labor market is converging on - slowly, reluctantly, and mostly in the tech sector where the credence-good problem is less severe because code either runs or it does not - is that it does not. The form of the university persists. The enrollment grows. The tuition rises. The prestige lag is measured in decades. The intellectual tradition that animated the form is thinner than it was, and the institution cannot tell, because the formal measures of quality - graduation rates, research funding, rankings - do not measure the tradition. They measure the form.
The printing press. This is the case that cuts against the pattern, and intellectual honesty requires examining it. The printing press initially lowered the quality of transmitted knowledge. Books became cheaper, faster, less carefully produced. The manuscript tradition that preceded print was laborious but self-correcting through the attention of scribes who were embedded in the intellectual traditions they were copying. Early printed books were full of errors, produced by printers who did not understand the content they were setting in type. The quality floor dropped. The quantity ceiling rose. Over the subsequent century, the combination of volume, competition, and the formation of new editorial traditions raised the quality above what manuscript culture had achieved. The degradation was temporary. The correction was dramatic.
Does the LLM parallel hold? The question is genuine and the answer is genuinely uncertain. The optimistic reading is that the current quality degradation in the LLM market is the analogue of early printing - a temporary decline in a medium that will ultimately produce knowledge infrastructure of unprecedented quality and reach, once the market matures, the editorial traditions form, and the incentive structures stabilize. The pessimistic reading is that the credence-good dynamics make the LLM case fundamentally different from print, because the printing press produced outputs whose quality was observable by any literate reader, while LLM services produce outputs whose quality is unverifiable by most users on most tasks. The printing press degraded an experience good. The LLM market degrades a credence good. The self-correction mechanisms are different because the information structures are different. Print self-corrected because readers could see the errors. The LLM market may not self-correct because users cannot see the thinking.
The honest assessment is that the printing press analogy could hold, but only if the information asymmetry is resolved - if thinking tokens become observable, if quality metrics become standardized, if verification infrastructure converts the LLM market from a credence-goods market to something closer to an experience-goods market where users can at least observe what they are receiving. Without that conversion, the printing press analogy fails and the aqueduct analogy holds. The substrate matters. A credence good does not self-correct the way an experience good does. The economics is different and the equilibrium is different.
7.7 The Open-Weight Correction
The market has a self-healing mechanism, and it is worth understanding both its power and its limits.
When proprietary quality degrades and quality is unverifiable, the rational response for any user with the technical sophistication to execute it is to switch to a system where quality is inspectable. Open-weight models provide exactly this: the model weights are public, the inference runs on hardware the user controls, the quality is a function of the user’s compute allocation rather than the provider’s willingness to allocate compute to that particular request. The information asymmetry that defines the credence-good problem in the proprietary market does not exist in the open-weight ecosystem. The user can see the model. The user can see the inference. The user can measure the quality directly because the user controls every variable.
The numbers are large and the trajectory is clear. Qwen crossed 700 million HuggingFace downloads, surpassing Llama. r/LocalLLaMA reached 500,000 members - something like tenfold growth in two years. Ollama accumulated 166,000 GitHub stars. Self-hosted inference runs at $0.07 to $0.12 per million tokens versus $1 or more through proprietary APIs - a 10x to 100x cost advantage. Open-weight models deliver something like 70-85% of frontier quality, and the gap is narrowing on a trajectory that shows no sign of decelerating. DeepSeek R1 achieved competitive performance at $5.5 million in training cost - 3% of comparable proprietary models. 63% of new fine-tuned models on HuggingFace are based on Chinese-origin architectures. An RTX 4070 Ti Super at $489 pays for itself in 5 to 10 months versus Claude API costs.
The open-weight ecosystem is the structural response to information asymmetry. It is the market’s own innovation against the credence-good equilibrium. Every quality degradation event by a proprietary provider is a recruitment event for the open-weight ecosystem, because each event demonstrates the vulnerability that open-weight resolves: you cannot be silently degraded if you control the inference.
But the correction is partial, and its limits are as important as its power.
The correction is available only to technically sophisticated users. Running a local model requires hardware selection, installation, configuration, prompt engineering, and the ability to evaluate model outputs without the convenience features that proprietary platforms provide. The 500,000 members of r/LocalLLaMA are disproportionately software engineers, ML researchers, and technically fluent power users. The mass market - the enterprise buyers, the knowledge workers, the organizations integrating LLM services into workflows through SaaS platforms - remains in the credence-good equilibrium. The power users escape. The mass market does not. The evaporative cooling dynamic documented in P9 operates here too: the users who escape to open-weight are the users whose diagnostic capability would have constrained the proprietary market if they had stayed. Their departure improves their individual position and worsens the market for everyone who remains.
The correction is slow relative to the degradation it is responding to. Quality shading can be deployed in hours - it requires only a configuration change to the thinking budget allocation or a system prompt update. Migrating to open-weight requires hardware procurement, infrastructure setup, workflow rebuilding, and the organizational change management that accompanies any infrastructure transition. The attack is faster than the defense. The degradation is instantaneous and the correction is gradual. The asymmetry in timescale means that the proprietary market can degrade, capture value from the degradation, and partially recover before the open-weight correction has fully materialized. The credence-good equilibrium persists in the gap between the speed of degradation and the speed of correction.
The correction does not reach the model layer where frontier capability still matters. On the most complex tasks - the ones where the gap between open-weight and proprietary is 15-30% rather than negligible - the users who need frontier capability are still captive to the proprietary market and still subject to the credence-good dynamics. These are often the highest-value tasks: the architectural decisions, the complex debugging, the novel algorithmic work. The tasks where quality degradation matters most are the tasks where open-weight is least adequate as a substitute. The correction operates at the commodity layer and fails at the frontier layer. The commodity layer is where the economic volume is. The frontier layer is where the institutional stakes are highest.
The open-weight correction is real, it is significant, and it will reshape the market over the next five to ten years. But it is not a solution to the credence-good problem. It is an escape hatch for the technically sophisticated, and the escape itself accelerates the degradation for everyone who cannot use it.
7.8 What Breaks the Cycle
The market equilibrium described in this report is stable. It is stable because the compound dynamics reinforce each other and because the diagnostic signal that would be required to break the equilibrium is systematically destroyed by the equilibrium itself. The Provider Cascade creates degradation. The User Trap prevents detection. The Market Spiral removes accountability. The system is closed and self-reinforcing. No single agent - not a provider, not a user, not a regulator - can break the equilibrium by acting unilaterally within the current information structure.
The historical cases confirm this. Airlines did not self-correct. Telecoms did not self-correct. Financial ratings agencies did not self-correct. In every case, the information asymmetry persisted until an external mechanism changed the observability of the quality dimension that the market could not observe on its own. The question is what external mechanisms are available for the LLM market, and which ones have a realistic chance of arriving before the institutional damage documented in this section becomes entrenched.
Four mechanisms are available. They are not mutually exclusive, and the equilibrium will probably be broken by some combination of all four rather than by any single one.
Transparency. The most direct mechanism is to convert the credence good into something closer to an experience good by making the quality dimensions observable. Thinking token metrics - the number of reasoning tokens allocated per request, the thinking depth, the model version that actually served the request - published as part of the response, would give users the information they currently lack. Per-request quality data - response latency, thinking allocation, model identity - would enable the kind of quality monitoring that the market currently makes impossible. This is the Grossman-Milgrom unraveling mechanism: if one provider publishes thinking token metrics and its quality is genuinely high, every other provider faces the inference that silence means the answer is one the user would not want to hear. The unraveling has not started because no provider has made the first move. The game theory predicts that it will start eventually, because the first mover captures the trust premium and forces disclosure on everyone else. The question is when, not whether. By April 2027, at least one major provider will have published some form of thinking token metrics, because the competitive pressure to differentiate on verifiable quality will overwhelm the incentive to maintain opacity once a single competitor makes the move.
Verification. Transparency provides information. Verification ensures the information is truthful. Trusted execution environments - hardware-level attestation that the model the user requested is the model that actually ran - are the only proposed mechanism that defeats the Yu et al. impossibility result. Software-only auditing fails against subtle substitutions. Statistical tests on outputs are query-intensive and defeated by inference nondeterminism. TEEs provide the cryptographic guarantee that the computation occurred as specified - the model version, the thinking budget, the system prompt. This is the technological analogue of the Dodd-Frank conflict-of-interest provisions for ratings agencies: not a market mechanism but a verification mechanism that changes what the market can observe. TEE integration into LLM inference pipelines is technically feasible but not yet deployed at scale. Its arrival will be the single most important structural change in the market’s information architecture, because it converts the credence good into a search good - quality verifiable before purchase, not merely after consumption or, in the current regime, never.
Market structure. Open-weight commoditization removes the information asymmetry at the model layer for any user willing to self-host. As open-weight models close the gap to frontier capability - the trajectory documented in P10 suggests the gap will be 10-15% by April 2027, down from 15-30% today - the fraction of tasks for which the proprietary market offers a genuine capability advantage shrinks. As the capability advantage shrinks, the switching cost shrinks, and as the switching cost shrinks, the power of the credence-good equilibrium diminishes because users have a real alternative. The commoditization does not solve the credence-good problem for the remaining proprietary frontier. It makes the frontier smaller. Whether this is sufficient depends on how quickly the gap closes and how much institutional damage accumulates in the gap.
User-built social technology. The users documented in this report did something that the economics said they could not do: they built monitoring and verification infrastructure from within the market. Stellaraccident’s 6,852-session statistical analysis. @ArkNill’s transparent proxy catching 261 budget enforcement events. @wjordan’s archived system prompt forensics. @wpank’s version-pinned cost comparisons. The stop hooks, the code quality gates, the model routing systems with fallback chains. These are not market mechanisms in the economist’s sense. They are social technologies - coordination tools devised by a small number of technically sophisticated actors to solve a problem that the market structure created and the market mechanism could not solve.
The user-built monitoring tools are institutional innovation in real time. They are the equivalent of the Department of Transportation’s on-time reporting requirements, except they were built by airline passengers rather than regulators. They do not solve the credence-good problem - the Yu et al. impossibility still binds, and the tools can detect gross degradation but not subtle substitution. But they serve a function that the economics undervalues: they create a diagnostic signal that would otherwise not exist, and they create it fast enough to constrain provider behavior before the full evaporative cooling cycle has run.
The danger is that these users are the ones the market drives away. Stellaraccident built the definitive diagnostic and then left. The user-built social technology depends on the continued presence and motivation of the users who build it, and the market’s equilibrium dynamics are hostile to that presence and that motivation. The monitors are live players in a market that economically selects against them. If the monitors depart - if the evaporative cooling documented in P9 continues to remove the diagnosticians from the proprietary market - the social technology they built atrophies, because social technologies do not maintain themselves. They require the practitioners who understand them to continue operating them. When the practitioners leave, the tools become dead technology - available in the repository, documented in the README, and unmaintained. The intellectual dark matter that made the tools useful was in the practitioners’ heads, not in the code.
The four mechanisms interact. Transparency creates the information that verification can authenticate. Market structure provides the alternative that makes transparency competitive rather than voluntary. User-built social technology provides the diagnostic signal that holds the other three accountable to reality rather than to benchmarks. The cycle breaks not through any single mechanism but through the combination: open-weight commoditization compresses the proprietary market’s scope, competitive pressure from the compressed market triggers the Grossman-Milgrom unraveling that forces transparency, TEE deployment provides the verification that makes transparency trustworthy, and user-built monitoring fills the gap until the institutional mechanisms arrive. None of these is sufficient alone. Together, they convert the credence good into something closer to an experience good, and the Darby-Karni equilibrium weakens as the information asymmetry that sustains it is resolved.
The question is whether the correction arrives before the institutional damage becomes entrenched. The thinking that was never done cannot be recovered. The code written on the basis of degraded reasoning is already in production. The institutional habits formed during the degradation period are already embedded in the organizations that depend on LLM-assisted knowledge work. Every month that the credence-good equilibrium persists is a month of institutional knowledge built on a foundation that nobody verified, because the market made verification impossible.
7.9 The Verdict
This report began with a thesis: the cloud LLM market is a textbook credence-goods market operating under severe information asymmetry, and the dynamics that fifty years of industrial organization economics predict for such a market are exactly the dynamics the empirical evidence confirms. Twelve predictions. Eleven confirmed. One partially confirmed. The economics works. The market is not special. It is subject to the same forces that have been documented in airlines, healthcare, telecoms, and regulated utilities since Akerlof published “The Market for Lemons” in 1970. The equilibrium is not malice. It is math.
That was the economics. The economics is necessary and it is not sufficient.
What the economics alone misses - what the IO textbooks do not cover and the behavioral economics frameworks do not address - is what happens to the civilizations that depend on the market’s output. The market degrades quality. The economics explains why. But the organizations that consume degraded output do not experience a market failure. They experience something harder to detect and harder to recover from: a silent reduction in the quality of their own reasoning, embedded in their decisions, their code, their strategies, and their institutional knowledge, invisible at the point of origin and compounding over time in ways that no one can measure because no one can compare the world that exists to the world that would have existed if the reasoning had been adequate.
The parallel to what I call intellectual dark matter is structural and precise. The knowledge that makes institutions functional is mostly tacit, mostly invisible, and mostly lost without anyone knowing what was lost. The thinking tokens that make LLM outputs adequate are tacit, invisible after redaction, and reduced without anyone knowing the reduction occurred. When the dark matter is removed, the structure stands - for now. But it is weaker. And nobody knows by how much.
Eleven of twelve predictions were confirmed. The market structure produces quality degradation as an equilibrium outcome. The users who could detect the degradation are the users the market drives away first. The benchmarks that the market relies on for quality information have diverged from the quality they purport to measure. The prestige of the providers has diverged from their performance on a timescale measured in months. The historical parallels - Roman aqueducts, the modern scientific paper, the financial ratings agencies - all resolved the same way: the information asymmetry persisted until an external mechanism changed the observability of the hidden quality dimension. The markets did not heal themselves. They were healed, partially and belatedly, from outside.
The LLM market has a self-healing mechanism that the historical cases lacked: the open-weight ecosystem, which converts the credence good into an inspectable good for any user willing to self-host. This is a genuine structural advantage. It is also an advantage available primarily to the technically sophisticated, which means the mass market remains in the credence-good equilibrium while the power users escape, which means the evaporative cooling continues, which means the equilibrium persists for the users least equipped to detect it. The correction is real. The correction is partial. The correction is slow relative to the degradation.
The stakes are civilizational and they are immediate. Not in the speculative sense of a future risk that might materialize. In the present tense. Right now, organizations are building institutional knowledge on the outputs of models whose reasoning quality they cannot verify, during a documented period of quality degradation, using benchmarks that have diverged from reality, evaluated by a prestige apparatus that lags the actual performance by months or years. Every day this continues, the foundation grows. Every day the foundation grows, the cost of discovering that it was degraded increases. Every day the cost increases, the probability that anyone will investigate decreases, because the investigation would require the kind of power user who has already left the market.
The intellectual apocalypse, if it comes, will not announce itself. That is what makes it an apocalypse. Dark ages are always preceded by intellectual dark ages - the degradation of knowledge infrastructure is invisible if there are no practitioners left who remember what the functional version looked like. The LLM market is running this experiment at industrial scale, at compressed timescale, with the added feature that the degradation is not merely unnoticed but structurally unnoticeable to the users who remain in the credence-good equilibrium after the diagnostic users have departed.
The market is not malfunctioning. The twelve predictions confirm that the market is functioning exactly as the economics says it functions. The predictions were not novel. They were textbook results applied to a new market. The market did not surprise the theory. The theory predicted the market with a precision that should itself be informative, because it means the dynamics are understood, the mechanisms are known, and the interventions that worked in other markets - transparency mandates, verification infrastructure, quality-of-service standards - are available.
What remains to be seen is whether the interventions arrive before the institutional damage becomes the new baseline - before the organizations that built on degraded output have forgotten what undegraded output looked like, before the intellectual habits formed during the degradation period have calcified into institutional practice, before the diagnostic users have fully departed and the evaporative cooling has completed its work.
The economics gives us the diagnosis and the economics gives us the prescription. Whether the prescription is filled in time is not an economics question. It is an institutional question. It is a question about whether the live players in this market - the providers, the users, the open-weight developers, the standards bodies, the regulators - can build the social technology required to solve the coordination problem that the market created and the market cannot solve on its own. Functional institutions are the exception. Building them is hard. Maintaining them is harder. Most attempts fail. But the alternative to building them is the equilibrium the economics predicts: a market that systematically degrades its most important product dimension while the measurement apparatus says everything is fine, the prestige apparatus says the providers are thriving, and the users who know better have already left.
Decay is the default. Entropy usually prevails. But entropy is not a law that binds the ambitious. It is a description of what happens when nobody acts. The twelve predictions were confirmed because nobody acted. The thirteen through twenty-fourth predictions - the forward projections in Section 2 - will be confirmed or falsified by whether anyone does.
The market is working. The market is producing the equilibrium the theory predicts. Whether anyone builds the institutions to override that equilibrium is the only question that matters now.

