This section explains why AI spend behaves differently from most software spend, and why many organisations experience unexpected cost growth once AI becomes part of daily work. The key concept is non-linearity. In traditional software, usage often increases value without increasing marginal cost in a comparable way. In AI systems, usage directly consumes compute. When the amount of processed text grows, cost grows with it. When the amount of processed text grows repeatedly, cost can grow faster than teams expect.
Stage 3 treats cost control as a design discipline. The objective is not to reduce usage. The objective is to design workflows that spend compute only where it creates value. To do that, participants must understand what causes cost to increase, how context accumulates, and how retrieval strategies prevent budget overruns.
A. Why AI Costs Scale With Usage
AI systems incur cost whenever they process information. Each time a model receives a request, it must perform computation to read the input, incorporate relevant context, apply instructions, and generate an output. That computation happens at the moment of use.
For this reason, the economics of AI resemble a utility model. The organisation pays in proportion to consumption. The more text processed and generated, the more compute is used. The more compute used, the higher the cost.
In Cyrenza, a typical request includes multiple components, many of which are invisible to the user but still contribute to cost:
-
User input text
This includes the prompt, instructions, and any content the user pastes into the system. -
Retrieved context
This can include encyclopedia passages, prior artifacts, excerpts from documents, relevant internal policies, or summaries from earlier steps in the workflow. -
System instructions and policy constraints
These include guardrails that enforce organisational requirements, security constraints, formatting rules, and workflow policies. Even when the user does not see them, the model still processes them. -
Generated output
This includes the visible response, plus any structured components such as tables, bullet lists, checklists, or draft documents. -
Optional reasoning traces and citations
Some workflows include citations, references, confidence indicators, or structured justifications. These are valuable for governance but add additional tokens and additional processing.
Every component increases the total processing volume. Cost therefore reflects the combined size of what the system reads and writes, not only the user’s message.
A practical operational implication follows: cost is driven by workflow design, not only by number of users. A small number of users running large context-heavy workflows can cost more than a large number of users running short, well-scoped tasks.
B. The Token Economy
AI systems measure text in tokens. Tokens represent chunks of text that the model processes. Tokens are not identical to words. A single word may be one token, part of a token, or multiple tokens depending on language, formatting, and the specific text.
Tokens matter because they are the unit of capacity and often the unit of billing. Organisations are billed based on how many tokens the model processes, which includes both input tokens and output tokens.
Practical interpretation for enterprise teams
-
The model is billed for what it reads
If the request includes long documents, long instructions, or large retrieved context, those tokens contribute to cost even before the model produces an answer. -
The model is billed for what it writes
Long responses, detailed analyses, and multi-format outputs increase output tokens and therefore cost. -
Repeated context multiplies cost
If the same material is included across multiple requests, the organisation effectively pays repeatedly for the model to read the same content. -
Multi-agent workflows multiply token usage
If a workflow involves multiple agents, each agent call processes its own input context and generates its own output. This can be efficient when each agent has a narrow role, but it can become expensive if every agent receives the same large context.
Here’s a simple rule of thumb: costs increase when the system processes unnecessary text, and costs increase faster when that unnecessary text is repeated across steps.
This rule is foundational to sustainable workflow design.
C. Context Windows and Why They Matter
A model has a context window, which is the maximum amount of text it can consider at once when generating an output. The context window includes everything the model receives as input for that request, including user prompts, system instructions, retrieved context, and prior conversation history included in the request.
As the context window fills, two operational effects become more likely.
1. Cost growth through repeated inclusion
When a large context is included repeatedly, the organisation pays each time the model reads it. This is a direct cost driver. It becomes especially visible when teams attach full documents, paste long content repeatedly, or carry long conversation histories forward without summarisation and retrieval discipline.
2. Performance degradation through attention dilution
Large contexts can reduce reliability because the model must allocate attention across more text. In long contexts, the model may:
-
miss a relevant detail because it is buried in a long document
-
over-weight irrelevant details that appear more recently in the context
-
produce outputs that are less focused because the instruction and the supporting evidence compete for attention
-
slow down because the system must process more text
This is not only a cost issue. It is also a quality issue. The system can become more expensive and less reliable at the same time if context is not managed.
A critical Stage 3 insight is that controlling context improves both cost and output quality.
D. Context Inflation: The Typical Cause of Budget Overruns
Context inflation describes a pattern where the amount of text included in each request gradually increases over the life of a project, a conversation, or a workflow. This often happens unintentionally because people naturally add more background information as work progresses.
Context inflation is common in enterprise environments because teams work with:
-
large documents such as contracts, policies, and reports
-
iterative review cycles with repeated questions
-
multi-stakeholder inputs and appended comments
-
long-running projects with accumulated artifacts and meeting notes
A common pattern that triggers inflation
-
A team uploads a large document, such as a policy pack or contract set.
-
Users ask a first question, and the system includes the document to answer it.
-
Users ask follow-up questions.
-
Each follow-up request includes the full document again, because the workflow is not configured for selective retrieval.
-
Over time, additional context is added, including prior outputs, meeting notes, and new attachments.
-
The total context becomes large, and each new question triggers the model to reread a large amount of text.
The result is that later interactions cost much more than earlier interactions, even when later questions are short. This surprises teams because they focus on the length of the question rather than the size of the total context being processed behind the scenes.
Context inflation is one of the most common causes of unexpectedly high AI spend in organisations. It also creates an operational risk: when cost spikes, organisations restrict usage, reduce experimentation, and weaken adoption.
E. Strategic Retrieval as the Primary Control Mechanism
Strategic retrieval is the most effective way to control context inflation while preserving quality. It refers to the deliberate practice of retrieving only the most relevant information needed for a task, rather than feeding full documents or entire histories repeatedly.
In Cyrenza, strategic retrieval is enabled through indexing and semantic search. The goal is to ensure the model reads the right evidence, not the maximum possible evidence.
Strategic retrieval typically includes three practices.
1. Retrieve relevant sections rather than full documents
Instead of attaching a complete policy pack or full contract repeatedly, the system retrieves the specific paragraph, clause, or section that is relevant to the question.
This practice has two effects:
-
it reduces token usage and therefore reduces cost
-
it increases focus and therefore can improve output quality
2. Use indexed knowledge and semantic search instead of manual copying
When documents are indexed, the system can locate relevant content based on meaning. Users do not need to paste large excerpts manually, and the system does not need to reprocess entire documents to find a small section.
This reduces human effort and improves repeatability.
3. Convert stable outputs into artifacts that can be referenced
A mature workflow produces stable intermediate outputs and stores them as artifacts. For example:
-
a policy summary for a project
-
an extracted list of contract obligations
-
a risk classification rubric
-
a checklist derived from an SOP
Once these artifacts exist, future tasks can reference them without reprocessing the entire source document. This is a strong cost control method because it reduces repeated reading.
Controlling retrieval often has more impact on cost than selecting a cheaper model. Model choice matters, but retrieval discipline determines how much text the model is forced to process repeatedly. Many organisations save more by improving context strategy than by changing models.