Intelligence Curves and Diminishing Returns

Scaling laws explain that model performance tends to improve as resources increase. Intelligence curves explain how those improvements behave over time. This distinction matters because many teams assume that capability increases in a simple and predictable way. In practice, performance gains often slow down as models get larger and training becomes more expensive. This is known as diminishing returns.

This section introduces diminishing returns as a practical planning concept. It helps organisations set realistic expectations, allocate budgets responsibly, and select deployment strategies that produce measurable outcomes. In Stage 3, teams will design real workflows and choose model configurations. A clear understanding of intelligence curves supports better trade-offs between capability, cost, and operational reliability.

2.1 Why Improvement Slows Down

2.1.1 The Intuition Trap: Linear Thinking

A common assumption in technology planning is linear growth. Linear growth means that if a certain investment produces a certain level of improvement, repeating the same investment should produce a similar improvement again. This assumption often feels reasonable because many familiar business levers behave in a roughly proportional way over short time horizons. Adding headcount can increase output capacity. Increasing marketing spend can increase reach. Incremental upgrades to software infrastructure can improve performance by predictable margins. These experiences create an intuition that progress is steady and that returns scale in a consistent pattern.

Large language model scaling follows a different dynamic. When the main development inputs increase, including compute, data, and parameter capacity, performance often improves. However, improvement is not proportional to investment. The improvement gained from each additional unit of investment tends to decline as scale increases. Early increases can produce visible gains because the system moves from low capability to moderate capability and begins to capture common patterns more reliably. Later increases operate in a different region of the capability curve, where the model already performs well on many tasks and the remaining weaknesses are narrower, rarer, and more difficult to eliminate.

This shift changes what an “upgrade” means. Early upgrades can reduce common errors, improve coherence, and noticeably strengthen general task performance. Later upgrades often target subtler limitations. These limitations may involve harder reasoning chains, more complex constraint tracking, improved robustness under ambiguity, and better behaviour on edge cases that appear less frequently. Because the remaining weaknesses are less common and harder to address, larger increases in resources may be required to achieve smaller gains. The gains also become less obvious in casual use, which is why careful measurement and task-specific evaluation become more important in later stages of scaling.

This pattern is known as diminishing returns. Diminishing returns means that improvement continues, but each additional investment delivers a smaller incremental improvement than earlier investments. The concept is not unique to AI. It appears in many systems where early investments resolve broad limitations and later investments focus on difficult refinements. In the context of language models, diminishing returns is an essential planning concept because it affects how organisations interpret progress, how they justify additional investment, and how they decide whether to pursue further scaling or focus on other levers such as workflow design, knowledge grounding, and governance controls.

2.1.2 What Diminishing Returns Means in Model Development

Diminishing returns describes a pattern in which additional investment continues to produce improvement, yet the size of improvement per incremental investment declines over time. In the context of model scaling, the “resources” being increased usually include compute, training data, and parameter capacity. As these resources increase, a model can improve across many tasks. The critical point is that the next increase tends to deliver a smaller performance gain than the previous increase. Progress remains possible, but progress becomes increasingly expensive.

A useful way to picture this behaviour is through the shape of the performance curve. Instead of a straight line that rises at a constant rate, performance often follows a curve that rises quickly at first and then begins to flatten. Early investments produce visible jumps because they address broad limitations. Later investments produce subtler improvements because the remaining limitations are narrower, less frequent, and more difficult to resolve through scaling alone. This flattening curve is a practical representation of diminishing returns.

As scaling continues, three dynamics commonly appear. First, the easiest gains are captured early. Early improvements often reflect the model learning common language structures, basic reasoning templates, and widely occurring patterns in data. These are the high-frequency behaviours that dominate many evaluations and everyday interactions. Once these behaviours become strong, the remaining errors shift toward more challenging cases.

Second, the remaining weaknesses become harder to address. These weaknesses tend to occur in situations that require longer chains of reasoning, careful tracking of multiple constraints, robust interpretation under ambiguity, or accurate handling of rare edge cases. They are also more sensitive to context quality and task framing. Because these weaknesses are not driven by simple gaps in general capability, removing them requires more targeted investment and stronger supporting conditions.

Third, improvement requires larger increases in resources to produce smaller marginal gains. This is a central operational implication. A modest percentage gain on a benchmark or a modest reduction in error rate may require a disproportionately larger increase in compute and data, along with careful balancing of training conditions. This creates a widening gap between effort and visible improvement.

This pattern appears across many model families and evaluation settings. It reflects a general property of scaling-based progress in complex systems. Diminishing returns does not imply that scaling stops producing benefits. It describes how the efficiency of investment changes as models become more capable. The model can continue to improve, yet the benefit achieved per unit cost declines, which makes planning, measurement discipline, and prioritisation increasingly important as scale increases.

2.1.3 Why Later Improvements Are Harder

Later improvements in model performance are harder to achieve because the training process has already captured many of the common patterns that drive early progress. In earlier stages of scaling, models learn high-frequency structures in language and task behaviour. These include core grammar and syntax, common forms of instruction-following, typical document structures, and widely occurring reasoning templates. Because these patterns appear frequently in training material and evaluation tasks, improvements in these areas can emerge relatively quickly as resources increase.

As the model becomes more capable, the nature of the remaining errors changes. Errors begin to cluster in areas that are less frequent, more nuanced, and more sensitive to context. These are not problems that can be resolved by learning another common pattern. They often require deeper generalisation and more precise control over reasoning and behaviour. Five categories illustrate why later-stage improvement becomes more demanding.

First, remaining errors often involve more complex reasoning across multiple steps. Multi-step reasoning requires the model to maintain intermediate states, track dependencies between steps, and preserve consistency over longer chains of logic. Mistakes can occur when the model loses track of a condition introduced earlier, confuses the order of operations, or fails to preserve a constraint while generating later parts of an answer. These failures are difficult because they depend on maintaining coherent structure over time, not simply recalling a fact or pattern.

Second, later errors frequently involve ambiguous instructions. Ambiguity is common in real communication. Users often omit context, mix objectives, or express requirements imprecisely. Handling ambiguity requires judgement about what the user likely means, when to request clarification, and how to proceed without introducing unsupported assumptions. This judgement is harder than completing a well-defined task because it requires balancing multiple plausible interpretations and choosing a response strategy that remains safe and aligned.

Third, remaining errors often involve robustness when information is incomplete. Many tasks present partial data, missing fields, conflicting statements, or incomplete documents. Robust behaviour requires the model to detect gaps, avoid over-committing to uncertain claims, and ask for the right missing inputs. This is challenging because the model must generate coherent outputs while also recognising when the evidence is insufficient. It requires discipline in uncertainty handling, which is not automatically reinforced by simple pattern learning.

Fourth, later errors often involve rare situations and edge cases. Rare cases matter because they are frequently where risk concentrates in enterprise settings. These cases appear less often in training data, which means the model has fewer examples from which to learn stable behaviour. Edge cases can also combine multiple constraints in unusual ways, making them harder to generalise from typical examples. Improving performance on rare cases therefore requires more targeted learning signals and more careful evaluation than improving performance on common cases.

Fifth, later improvements often involve calibration of confidence and uncertainty. Calibration refers to how well the model’s expressed confidence matches the reliability of its output. Poor calibration can lead to confident statements that are weakly supported or cautious statements when evidence is strong. Improving calibration requires the system to detect uncertainty, represent it consistently, and communicate it in a controlled manner. This is difficult because confidence is expressed through language, and language can appear certain even when underlying evidence is missing. It also requires decision rules for when to stop, when to ask for clarification, and when to escalate.

These categories share a common characteristic. They depend on subtle judgement, longer reasoning chains, and stronger dependence on context. They also tend to be less frequent and more variable, which makes them harder targets for improvement through general scaling alone.

As models scale, the focus of improvement shifts from broad competence to difficult refinements. Broad competence refers to the ability to handle common tasks in a generally correct and coherent way. Refinement refers to reducing the remaining failure modes that appear in complex, ambiguous, or rare scenarios. Refinements require more resources because they target smaller regions of the overall error distribution. Early scaling reduces large clusters of common errors. Later scaling attempts to reduce smaller clusters of specialised errors. This distributional shift changes the economics of progress. The training process must allocate more effort to reduce a smaller portion of remaining errors, which increases the resource cost of each incremental improvement.

2.1.4 Measurement and Visibility

Improvement can appear to slow for a second reason that is distinct from diminishing returns. Later improvements are often less visible without structured measurement. As models become more capable, the remaining gains tend to concentrate in narrower areas of behaviour. These gains can be meaningful, yet they do not always show up in casual day-to-day use. The difference between “visible improvement” and “measured improvement” becomes more important as capability increases.

When an organisation upgrades from a small model to a medium model, the change is often immediately noticeable. Many baseline weaknesses improve at once. Responses become more coherent, instruction-following becomes more stable, and common errors occur less frequently. Users notice this because it affects a large percentage of everyday tasks. The improvements are distributed broadly across writing quality, comprehension, and general task handling, which makes them easy to observe in informal testing.

When an organisation upgrades from a large model to a slightly larger model, improvements often appear in specific categories rather than across the full range of tasks. Examples of category-specific improvements include better handling of long-context prompts, stronger structured planning over multiple steps, more consistent adherence to complex constraints, and improved performance in specialised domain tasks that require careful interpretation. These improvements are harder to observe casually because they are not triggered by every interaction. They appear under particular conditions, such as when the prompt is long, when documents are complex, or when tasks involve subtle ambiguity and constrained reasoning.

In professional settings, informal impressions can therefore become unreliable. If teams evaluate a model upgrade using a small set of routine prompts, they may conclude that nothing has changed. They may also focus on tasks that are already easy for both models, where the difference is small. In these cases, improvement can exist while remaining invisible to casual observation. The challenge is not that progress is absent. The challenge is that progress becomes more conditional, and its visibility depends on whether evaluation probes the behaviours that actually changed.

This is why Stage 3 emphasises evaluation discipline as part of deployment practice. Evaluation discipline refers to the use of structured measurement rather than casual testing. It requires organisations to define what matters in their environment and to evaluate those behaviours explicitly. This includes deciding which task categories carry the most risk, which failure modes are unacceptable, and which outputs must meet strict standards for traceability, formatting, or compliance alignment.

A disciplined measurement framework should reflect the organisation’s own workflows and risk tolerance. Public benchmarks can provide a broad reference point, yet they rarely represent internal documents, organisational terminology, permission constraints, or governance requirements. Enterprise evaluation therefore needs task-specific tests drawn from the organisation’s actual work. It also requires repeatable scoring criteria, consistent prompts, and review standards that can be applied over time. Without this structure, teams rely on subjective impressions and may misinterpret what has changed between model versions, particularly when improvements are concentrated in specialised conditions rather than in everyday general usage.

2.2 A Practical Interpretation of Intelligence Curves

2.2.1 The Uneven Experience of Improvement

Diminishing returns produce an experience that many teams recognise:

The jump from a small model to a medium model often feels dramatic.
The jump from a large model to a larger model often feels modest for many business tasks.

This is not a problem of perception alone. It reflects where the model is in its improvement curve.

Small models struggle with many tasks because they lack sufficient representational capacity. Medium models often cross a threshold where they can handle a wide variety of everyday tasks competently. Large models refine and expand those capabilities, but the remaining gaps are smaller and harder to close.

2.2.2 Why “Modest Gains” Still Matter in Certain Areas

Although later improvements may feel modest in general usage, they can still be valuable in specific domains:

Handling complex, multi-step tasks without losing track of goals
Maintaining consistency in long documents
Following multi-constraint instructions more reliably
Reducing hallucinations in specialised contexts when grounded properly
Producing better reasoning traces and clearer justification

The key is that these benefits are not evenly distributed across all tasks. Later scaling improvements tend to concentrate in harder tasks, which may represent a smaller percentage of everyday workload but a large portion of organisational risk.

2.2.3 The Error Distribution View

A useful way to interpret diminishing returns is to think in terms of error distribution.

Early scaling reduces frequent errors that appear across many tasks. Later scaling targets rarer errors that appear in edge cases or under ambiguity.

Examples of frequent errors include:

Basic misunderstanding of instructions
Poor summarisation
Weak grammatical coherence
Inability to produce structured outputs

Examples of rarer errors include:

Subtle policy misinterpretation
Failure to respect complex constraints across a long workflow
Weak performance when key context is missing
Inconsistent judgement under uncertainty

Rarer errors are more expensive to eliminate because they require training to address smaller, more specialised failure modes.

2.3 Why This Matters for Businesses

2.3.1 Enterprise Goals: Reliable Impact Under Constraints

In enterprise deployments, the objective is not to maximise benchmark scores. The objective is to produce reliable impact within real constraints.

Those constraints include:

Budget limitations and cost predictability
Latency expectations and workflow speed
Data security and permission boundaries
Audit readiness and traceability
User adoption and operational change management

Diminishing returns matter because they force a discipline in allocation. If a larger model provides only a small marginal improvement for a given workflow, the organisation may gain more by improving the system around the model.

2.3.2 High-Stakes Work Versus High-Volume Work

Different categories of work benefit differently from scaling.

High-stakes work includes legal review support, compliance analysis, risk reasoning, and financial decision support. These tasks often have high error costs. Even small improvements in reliability can be valuable because they reduce risk, rework, and exposure.

High-volume work includes drafting, summarisation, templated reporting, extraction, and operational checklists. These tasks often benefit more from speed, consistency, and structured workflows than from maximum model size.

In high-volume work, the organisation often gains more from:

Better instruction design and structured prompting
Strong grounding through internal documentation and curated knowledge
Correct integration mapping and permission design
Clear workflow orchestration across specialist agents
Review processes that catch errors and enforce standards

This distinction is central to Stage 3 because clients will design workflows that reflect their organisation’s real work distribution.

2.3.3 System Design as the Primary Lever Beyond a Threshold

Diminishing returns teach a practical lesson: once models reach a certain capability threshold, system design becomes a larger driver of value than further increases in model size.

System design includes:

Prompt structure and instruction quality
Clear task definition, constraints, and expected output format can improve performance substantially without changing model size.
Knowledge grounding through internal documentation
When models have access to reliable internal sources, applied performance improves. This is especially important for policy-driven tasks.
Data access design through integrations and permissions
Many failures occur because the system cannot access the right data or has too much access without control. Well-designed surfaces improve both accuracy and security.
Workflow orchestration across agents
Complex work improves when divided into specialised stages with clear hand-offs, rather than forcing one model call to do everything.
Output review and governance processes
Review practices, version control, and traceability create stability. They also reduce risk by catching errors early.

These levers often produce larger improvements in organisational outcomes than upgrading to the next largest model.