3.2

Cost, Latency, and Throughput as a System

30 min

Sustainable AI deployment requires systems thinking. In Cyrenza, cost, latency, and throughput are not separate concerns that can be optimised independently. They form a coupled system. Decisions made to improve one property usually affect the other two. Stage 3 introduces this relationship because workflow design is no longer an interface activity. It becomes an operational design discipline, similar to capacity planning in engineering and budget planning in finance.

Organisations that treat cost, latency, and throughput as isolated metrics often experience predictable failures. They may optimise for the highest quality output and later find the system is too expensive to scale. They may optimise for the lowest cost and later find users reject the system because it slows work. They may optimise for instant responses and later discover that system load creates instability during peak usage. A mature approach begins by acknowledging the interdependence of these properties and designing workflows that match business priorities.

This section provides a structured framework for understanding the system and making disciplined trade-offs.

1. The Three Interdependent Properties

1.1 Cost: Financial Consumption of a Workflow

Cost refers to the financial spend created by operating a workflow. In AI systems, cost is primarily driven by:

  • The volume of text processed and generated, usually measured in tokens

  • The computational intensity of the chosen model

  • The number of steps in the workflow, including multi-agent orchestration

  • The amount of retrieved context, including repeated or unnecessary context

  • The frequency of use across teams and time

Cost is therefore a function of design. Two workflows that achieve the same business outcome can have very different cost profiles depending on how much text they process, how many model calls they make, and how efficiently they retrieve information.

In Stage 3, cost is treated as a controllable variable. The objective is cost efficiency, meaning spending is concentrated where it produces meaningful value.

1.2 Latency: Time to a Usable Response

Latency is the time between a request and a usable output. In operational environments, latency is not merely a technical metric. It determines whether the tool fits into the rhythm of work.

Latency is influenced by:

  • The model’s compute requirements

  • The size of the context window being processed

  • The complexity of the task and required output length

  • The number of sequential steps in a workflow

  • System load, traffic spikes, and rate limits

Latency shapes human behaviour. If the system feels slow for everyday work, users avoid it or reduce usage. If it is fast for routine tasks but slow for complex tasks, adoption remains strong because the waiting time aligns with the value of the output.

1.3 Throughput: Work Completed at Scale

Throughput is how much work the system can complete per unit time across many users or across large volumes of tasks. Throughput matters because enterprises rarely run AI for a single individual. They run it across departments, time zones, and continuous workflows.

Throughput depends on:

  • How many requests can be processed concurrently

  • How long each request occupies compute resources

  • The distribution of request types across latency tiers

  • The ability to batch non-urgent work

  • The stability of the system under load, including queueing and rate limits

Throughput is a key determinant of reliability. Low throughput can cause queueing delays, timeouts, and workflow interruption. High throughput enables consistent delivery even when usage spikes.

2. Why These Properties Form a System

Cost, latency, and throughput are linked because they share the same underlying resource: compute.

When a workflow consumes more compute per request, it tends to:

  • Increase cost, because compute usage is higher

  • Increase latency, because processing time is longer

  • Reduce throughput, because fewer requests can be processed in the same time window

When a workflow is designed to reduce compute per request, it tends to:

  • Reduce cost

  • Reduce latency

  • Increase throughput

However, the trade-offs become more complex when you include quality requirements, governance needs, and user experience. For example, a workflow designed for maximum speed may shorten context too aggressively and reduce accuracy. A workflow designed for maximum reasoning depth may reduce throughput and create queueing delays.

3. The Core Trade-Off Principle

You cannot fully optimise cost, latency, and throughput at the same time. The reason is that each optimization pushes against the others when resources are finite.

A useful mental model is to treat workflow design as the allocation of limited compute capacity across competing goals:

  • If you prioritise quality and depth of reasoning for every task, you spend more compute per task. This increases cost and reduces throughput.

  • If you prioritise minimal cost for every task, you may choose smaller models, reduce retrieval, and shorten outputs. This can reduce quality and may increase rework and human review load.

  • If you prioritise minimal latency for every task, you may choose fast models and restrict context. This can improve user experience but may underperform on complex tasks that require deeper reasoning.

A mature organisation chooses priorities explicitly. This is a governance decision as much as a technical decision.

4. A Practical Framework for Deliberate Trade-Offs

4.1 Classify the Task by Business Characteristics

For each workflow, determine:

  • Urgency: does the user need an immediate answer, or can the output arrive later?

  • Risk: what happens if the output is wrong? Is error cost high or low?

  • Volume: is this a high-volume recurring task or an occasional task?

  • Complexity: does the task require multi-step reasoning and synthesis, or structured transformation?

  • Governance requirement: does the output require traceability, citations, and formal formatting?

This classification makes the trade-offs explicit.

4.2 Map the Task to an Operating Strategy

Once classified:

  • Low-risk, high-volume, high-urgency tasks often require speed and throughput.

  • High-risk, low-volume, lower-urgency tasks often justify deeper reasoning and higher cost.

  • High-volume, low-urgency tasks often benefit from batching for efficiency and reliability.

  • High-stakes tasks often require traceability, structured outputs, and human review.

This mapping is how organisations scale responsibly without uncontrolled spending.

5. Common Failure Modes in Workflow Design

Understanding the system also means recognising predictable design errors.

Failure Mode 1: Using deep reasoning models for routine tasks

This increases cost and latency without meaningful benefit. It also reduces throughput for everyone.

Failure Mode 2: Forcing real-time processing for high-volume non-urgent work

This creates traffic spikes, queueing, and rate limit problems, reducing reliability.

Failure Mode 3: Optimising cost by cutting context without improving retrieval

This reduces grounding and increases errors, which can increase rework and introduce risk.

Failure Mode 4: Ignoring throughput during pilot design

A workflow may appear successful in a pilot group but collapse during organisation-wide rollout due to concurrency and load.

Stage 3 prepares teams to avoid these failure modes by designing from first principles.