How AI Models Generate Text and Why That Process Produces Errors

To understand why AI language models produce errors that are indistinguishable in presentation from accurate outputs, the practitioner needs a working understanding of the mechanism through which these models generate text. This understanding does not require technical expertise in machine learning or computational linguistics. It requires a clear mental model of what the model is actually doing when it produces a response, because that mental model is the foundation for the verification discipline that responsible professional AI use demands. A practitioner who understands the generation mechanism treats AI outputs with the appropriate level of critical scrutiny. A practitioner who does not may develop a misplaced confidence in AI outputs based on their surface quality, which is precisely the confidence that the mechanism is most capable of producing and most likely to betray at a professionally consequential moment.

An AI language model generates text through a process of statistical prediction. The model has been trained on an enormous quantity of text drawn from the internet, published books, academic articles, professional publications, legal databases, financial reports, and the full range of documented human knowledge and expression available in digital form. During training, the model processed this material repeatedly, adjusting billions of internal numerical parameters to become progressively better at predicting what text typically follows what other text across millions of different contexts, topics, styles, and domains. The result of this training process is a system that has absorbed the statistical patterns of human language at a scale and granularity that allows it to generate text that is, in its surface characteristics, virtually indistinguishable from text produced by a knowledgeable human writer.

When a practitioner submits a request to an AI language model, the model does not search a database of verified facts to find the correct answer. It does not retrieve information from a structured knowledge store and assemble it into a response. It generates text one word at a time, with each successive word selected on the basis of which word is statistically most likely to follow everything that has come before in the sequence, including the practitioner's request, any context documents provided, and the text the model has already generated in its response so far. This prediction process draws entirely on the statistical relationships the model absorbed during training. The model is, in a precise technical sense, completing a very sophisticated pattern based on the patterns it learned from the training data.

This mechanism is responsible for the remarkable fluency and apparent coherence of AI-generated professional text. The model learned, across millions of documents, how professional writing is structured, how arguments are developed and supported, how conclusions are qualified, how domain-specific vocabulary is used in context, and how different genres of professional output, including legal memoranda, insurance coverage analyses, financial commentaries, and consulting reports, are conventionally organised and expressed. When the model generates a coverage analysis or a contract summary, it applies these learned structural and stylistic patterns to produce output that reads, in its surface characteristics, as though a professionally trained author wrote it with full command of the relevant domain.

The same mechanism that produces this fluency also produces errors, and it produces them through a process that is built into how the system functions rather than representing a malfunction or a failure of the technology to perform as intended. The statistical prediction that drives text generation is accurate when the training data contained dense, consistent, and reliable evidence about the specific fact or claim being generated. Under these conditions, the most statistically likely next word is also the factually correct next word, and the model's output will be accurate. The statistical prediction produces errors when the training data was sparse, inconsistent, or absent with respect to the specific information the generation process requires. Under these conditions, the model still generates the most statistically likely continuation of the text, because that is what the mechanism does. The statistically likely continuation may be factually wrong. The mechanism has no way to distinguish between these two conditions from the inside, and the output it produces in both cases is presented with the same fluency, the same structural coherence, and the same apparent authority.

The practical consequence for professional practitioners is that the surface quality of an AI-generated output provides no reliable signal about its factual accuracy. A fabricated legal citation, meaning a case that does not exist, or that exists but does not stand for the proposition for which it is being cited, is delivered in a grammatically correct sentence formatted to the conventions of legal citation in exactly the same way as a real citation that is accurately represented. An incorrect interpretation of a policy exclusion, one that misreads the scope of the exclusion or fails to account for a qualifying endorsement, is expressed through the same structured analytical reasoning as a correct interpretation. A financial figure that does not appear anywhere in the source data the practitioner provided is presented with the same numerical formatting and contextual explanation as a figure that can be verified directly against the source. The model generated all of these outputs through the same statistical prediction process. The process cannot distinguish between accurate and inaccurate outputs because it does not operate on the basis of factual verification. It operates on the basis of statistical likelihood, and statistical likelihood and factual accuracy are not the same thing.

A related and equally important property of the generation mechanism is that the model has no reliable internal mechanism for recognising the boundaries of its own knowledge. When a human professional is asked a question that falls outside their area of expertise, or that requires information they do not have, they typically recognise the limitation. They may decline to answer, qualify their response with an explicit acknowledgment that they are speculating, refer the questioner to a more appropriate source, or ask for the additional information they need before they can respond reliably. This recognition of the boundary between what one knows and what one does not know is a fundamental feature of how human professional judgment operates, and it is one of the properties that makes expert professional advice trustworthy.

AI language models do not possess this mechanism in a reliable form. The statistical prediction process generates text with equal fluency whether the underlying information is densely represented in the training data, sparsely represented, inconsistently represented, or entirely absent. The model cannot reliably detect from the inside which of these conditions applies to any specific claim it is generating, and it therefore cannot reliably signal to the practitioner when it is operating at the boundary of its training knowledge and moving into the territory where statistically likely text is most likely to diverge from factually accurate text. Some models have been designed to produce hedging language in certain conditions, phrases that acknowledge uncertainty or indicate that the information should be verified, but these hedges are themselves generated through statistical prediction rather than through genuine epistemic assessment, and they are inconsistent in their application. The practitioner who relies on the model to flag its own uncertainty will miss the cases where the model generates confident, unhedged text in precisely the situations where its reliability is lowest.

In AI research and in the broader professional discussion of AI risks, this phenomenon is described as hallucination. The term has become widely used, and it captures something important about the character of the failure, which is that the model produces outputs that are presented with apparent confidence and surface accuracy but that describe things that are factually wrong or that simply do not exist. The term can be misleading, however, if it creates the impression that hallucination is a malfunction, a departure from normal operation, or an anomaly that more careful engineering might eventually eliminate. Hallucination is a predictable consequence of how language models work. The model was designed to produce statistically likely text, and it performs that function consistently and well. When statistically likely text is also factually accurate, the output is useful and trustworthy. When statistically likely text diverges from factual accuracy, the output is wrong but presented in a form that is indistinguishable from trustworthy output. The conditions under which this divergence occurs are not random. They are predictable in their general character, concentrated in areas where training data was sparse or inconsistent, in questions requiring information more recent than the training cutoff, in highly specific factual claims about names, dates, citations, and numerical values, and in questions at the intersection of multiple domains where the relevant patterns in the training data did not consistently overlap. Understanding these patterns allows practitioners to direct their verification effort toward the categories of claim that carry the highest risk of hallucination, rather than treating every sentence in an AI output as equally likely to be accurate or inaccurate.

The professional discipline that this understanding demands is straightforward to state and requires consistent application to be effective. Every AI output is a draft that requires verification before it is incorporated into professional work, regardless of how convincing, well-structured, or authoritatively presented it appears. The fluency of the text is a property of the generation mechanism, not a signal of factual accuracy. The apparent confidence with which a claim is made is a property of the generation mechanism, not a signal that the claim has been verified against a reliable source. The structural coherence of the analysis is a property of the generation mechanism, not a signal that the analytical conclusions are correct. The practitioner's professional accountability for the accuracy of their work product cannot be discharged by reading an AI output and finding it plausible. It is discharged by verifying the specific claims in the output against the primary sources that establish whether those claims are accurate, and by correcting or removing the claims that verification reveals to be wrong. This is the discipline that the mechanism demands, and it is the discipline that the sections and modules that follow are designed to make specific, efficient, and practically sustainable in the conditions of professional work.

ay and what they actually say.

A lease abstract that correctly extracts and accurately represents the majority of the commercial terms in a complex lease, but silently omits a co-tenancy clause whose operation could materially affect the tenant's obligations under certain circumstances, produces a document that appears comprehensive and is professionally incomplete in a way that may not surface as a problem until months after the transaction has completed and the clause is triggered. The abstract looks like careful, thorough work. The omission is invisible in the abstract itself because an absent item leaves no visible trace. Detection requires the reviewer to read the actual lease and compare its terms against the abstract systematically, rather than reading the abstract and assessing whether it appears complete.

A consulting report that correctly frames the strategic question, applies a recognisable analytical framework with apparent rigour, and presents conclusions in the structured, evidence-referenced form that professional consulting deliverables conventionally take, but bases its market sizing on figures the model generated from general patterns rather than from the data the client provided, produces advice that sounds rigorous and rests on invented evidence. The structure of the report provides no signal that the specific figures are not drawn from the client's data, because the model has learned how to present generated figures in the format and with the contextual explanation that real figures receive.

The detection challenge with this failure pattern is that surface reading and structural assessment are both insufficient to catch it. The output passes the tests that surface reading applies because its grammar, structure, tone, and apparent reasoning are all appropriate. The error is in the substance, in the specific factual claim that is wrong, the specific provision that is misread, the specific interaction between clauses that is missed, or the specific figures that are invented rather than sourced. Detecting it requires the reviewer to engage with the specific substantive claims in the output at the level of checking them against source material, rather than reading the output as a whole and assessing its general quality. The discipline this demands is more time-consuming than surface reading, and it is the discipline that protects professional work from the category of AI error that is most likely to survive an insufficiently rigorous review.

Invented Specifics in the Absence of Information

The third failure pattern exploits a specific and important property of the generation mechanism: the model generates a response to every question submitted to it, regardless of whether it has access to the specific information needed to answer that question accurately. When a human professional is asked a question and does not have the information required to answer it reliably, the professional response is to acknowledge the gap, request the necessary information, or qualify the response explicitly to reflect the limits of available knowledge. The AI model's generation mechanism does not produce this response reliably, because the mechanism is optimised to generate the most statistically likely continuation of the text, and a continuation that provides a substantive answer is statistically more likely than a continuation that acknowledges a gap in the available information. The result is that the model fills informational gaps with plausible-sounding content rather than flagging them.

A practitioner who asks an AI tool to summarise a specific client's historical claims experience without providing the actual claims data will frequently receive a detailed, structured summary that reads as though it was produced from comprehensive records. The summary may reference specific claim types, approximate frequencies, resolution patterns, and aggregate figures. These details will be presented with the specificity and formatting that characterise a genuine data-based analysis. They will have been generated by the model from the statistical patterns it learned about how claims summaries of that kind are typically structured and what kinds of figures they typically contain, rather than from the client's actual claims history, which was never provided to the model.

A financial analyst who asks the tool to comment on a specific company's revenue trajectory and margin performance without providing the company's actual financial statements may receive commentary that references specific percentage changes in revenue, specific margin figures, and specific period comparisons. These figures will be presented in the confident, precise language of financial analysis. They will have been generated by the model from patterns in financial commentary rather than from the company's actual results. A real estate professional who asks the tool about comparable sales in a specific local market without providing actual transaction data may receive a list of comparables with addresses, sale prices, and sale dates that reads as precisely sourced market evidence and was generated from the model's general knowledge of how comparable sales analyses are structured and what kinds of figures they typically contain.

This failure pattern is particularly prevalent when practitioners submit questions that require specific, current, or organisation-internal information without providing that information as context. The model's willingness to provide a detailed, confident-sounding answer is not evidence that it has access to the information required to answer accurately. It is evidence that the generation mechanism has found a statistically likely continuation of the text that takes the form of a detailed, confident answer. Every specific claim in an AI output that could only be accurate if the model had access to specific data the practitioner did not provide must be treated as potentially invented until verified against the actual source. This applies with particular force to numerical figures, specific dates, named individuals, referenced documents, and any other claim whose accuracy depends on access to specific information rather than on general professional knowledge.

Inconsistency Across Long Outputs

The fourth failure pattern is a consequence of the sequential character of the text generation process and becomes more pronounced as the length and complexity of the AI output increases. The model generates text one word at a time, with each successive word predicted on the basis of the full sequence of text that precedes it. Maintaining perfect consistency across a long, complex output requires the model to track and honour all of the constraints, definitions, analytical positions, and factual claims it has already generated as it continues generating additional text. This tracking becomes progressively more demanding as the output grows longer, because the constraint set that must be respected grows with every additional claim or analytical position the model produces.

The failure mode that results from this property takes several forms in professional work. A position taken in an early section of a long output may be qualified, modified, or implicitly reversed in a later section without the modification being explicitly acknowledged, because the model's attention to the earlier constraint has degraded as the generation has continued. A definition established at the beginning of a document may be applied inconsistently as the document proceeds, with the term being used in one sense in some sections and a slightly different sense in others. An analytical conclusion reached in one part of a report may be incompatible with an assumption made in a different part, creating an internal contradiction that undermines the reliability of both conclusions.

In professional work, internal inconsistency is a serious quality problem whose consequences extend beyond the specific sections where the inconsistency appears. A risk assessment that describes a specific risk factor as material in the executive summary and immaterial in the detailed analysis creates a document whose conclusions cannot be relied upon by the decision-makers receiving it, because the document provides internally contradictory guidance about the significance of the factor in question. A contract review that identifies a specific clause as non-standard in the high-level summary but treats the equivalent clause as standard in the detailed section-by-section analysis creates confusion for the solicitor and the client about what the reviewer's professional assessment actually is. A due diligence report that applies different valuation assumptions in different sections, without acknowledging or reconciling the difference, produces overall conclusions that rest on an inconsistent analytical foundation and that cannot be defended if challenged.

The verification discipline that addresses this failure pattern is distinct from the verification disciplines that address the preceding three patterns. Checking for fabricated references requires verification against primary sources. Checking for plausible but incorrect analysis requires substantive engagement with the specific claims in the output. Checking for invented specifics requires confirming that the specific figures, dates, and details in the output correspond to actual data the practitioner provided. Checking for internal inconsistency requires reading the output as a complete document with attention to whether the positions, definitions, and conclusions in each section are compatible with those in every other section. For long professional outputs, this is most efficiently accomplished by reading the complete output in sequence before focusing verification effort on specific substantive claims, so that the overall analytical structure and the consistency of positions across sections can be assessed before the detail-level verification begins. The investment of time this requires is proportionate to the professional consequences of delivering a document whose internal contradictions undermine the reliability of its conclusions, which in the professional domains this programme addresses are consistently significant.

Section 3: Context Windows and Their Limits in Professional Document Processing

Among the technical properties of AI language models that have direct and practical consequences for professional use, the context window is one of the most important and one of the most frequently misunderstood. Understanding what the context window is, how its limits affect the reliability of AI outputs from long professional documents, and how practitioners can design their workflows to work within those limits rather than being constrained by them, is essential knowledge for any practitioner who regularly directs AI assistance toward the document-intensive work that characterises professional practice in legal services, insurance, finance, consulting, and real estate.

The context window is the total volume of text that an AI model can hold in its active processing at any single moment during an interaction. Everything the model reads and reasons about within a single session, including the practitioner's instructions, any documents or reference materials provided as context, any background information supplied through integrated knowledge bases, the conversation history accumulated across multiple exchanges in the session, and the text of the output the model is currently generating, must fit within this window simultaneously. The context window functions as the model's working memory for the interaction. Material that falls within the window is available to the model as it generates its response. Material that exceeds the window cannot be incorporated into the active processing and cannot reliably influence the output.

Context windows have expanded substantially over the past several years as AI model development has advanced, and current leading models can process volumes of text that would have exceeded the capabilities of their predecessors by significant margins. For a wide range of professional tasks, the context window available in current AI tools is sufficient to accommodate the relevant materials without constraint. A single commercial contract of typical length, a standard insurance policy document with its principal endorsements, a set of meeting notes from a complex client engagement, a short due diligence report, or a regulatory filing of moderate length can generally be processed within a single interaction without the context window creating a practical limitation. The practitioner who works primarily with documents of these dimensions and who provides focused, relevant context rather than submitting entire document sets by default will encounter the context window as a constraint relatively infrequently in day-to-day professional use.

The context window becomes a material operational constraint when practitioners work with the larger document collections that professional practice regularly generates. A complete due diligence data room assembled for a significant corporate transaction may contain hundreds of documents across multiple categories. A full lease portfolio for a commercial property client may include dozens of individual leases, each with their own schedules and supplemental agreements. A comprehensive regulatory filing with all its supporting appendices and referenced materials may extend to hundreds of pages. An insurance policy pack that includes the master policy, all endorsements issued across the policy period, and the relevant regulatory guidance may constitute a substantial volume of text. When the total volume of material that would ideally inform an AI-assisted analysis of these collections exceeds the context window, the practitioner faces a decision about how to manage the constraint, and the quality of that decision directly affects the reliability of the outputs the AI tool produces.

The practitioner who does not understand the context window tends to discover its limits in one of two ways, neither of which is conducive to good professional practice. The tool may truncate the input, processing only the portion of the submitted material that fits within the window and generating a response that reflects only that portion without signalling clearly which portions of the input were not processed. Or the tool may reject the input entirely with an error indicating that the submission exceeds its capacity. In both cases, the practitioner who was not anticipating the constraint is in a position of uncertainty about which parts of their submission the tool actually processed and which parts of its output may be unreliable because they were generated without access to material that fell outside the window. The practitioner who understands the context window in advance can design their submissions to work within it deliberately, making informed decisions about which material to include, which to exclude, and how to structure multi-interaction workflows that address document sets too large for single-interaction processing.

The appropriate response to a context window constraint in professional document work depends on the nature of the task and the structure of the document collection. For tasks involving a single long document, extracting and submitting only the sections directly relevant to the specific analytical question, rather than submitting the complete document, is typically the most effective approach. A practitioner asking an AI tool to analyse the break clause provisions in a long commercial lease does not need to submit the entire lease if the break clause provisions occupy a clearly defined set of sections that can be extracted and submitted independently. Submitting only those sections reduces the volume of text the tool must process, keeps the relevant material well within the context window, reduces the cost of the interaction as discussed in Module 3.2, and, as will be discussed below, substantially improves the reliability of the output by ensuring that the relevant material is not competing for the model's processing attention with dozens of pages of unrelated provisions.

For tasks requiring analysis across multiple documents or across an entire document collection, a sequential, multi-interaction approach that addresses the collection systematically across a series of focused interactions is typically more reliable than attempting to process the entire collection in a single submission. A practitioner conducting a lease portfolio review might process each lease individually across separate interactions, synthesising the findings after each analysis and building a cumulative picture of the portfolio's terms, rather than attempting to submit all leases simultaneously and asking for a portfolio-level analysis in a single request. A paralegal reviewing a discovery production might process the production in defined batches, maintaining a running summary of findings that informs each subsequent batch analysis, rather than attempting to submit the complete production in a single interaction. These sequential approaches require more interactions to complete the overall task, but they consistently produce more reliable results than single-interaction attempts to process document volumes that strain or exceed the context window.

The Lost in the Middle Problem

Understanding the context window as a hard limit on what can be processed is important, but it addresses only part of the reliability challenge that document length creates in professional AI use. There is a further and more subtle reliability issue that affects long documents even when they fit comfortably within the context window: the model does not process all parts of the document with equal reliability. Research examining how AI language models attend to information across long texts has consistently demonstrated a pattern in which information positioned near the beginning and near the end of the submitted text receives stronger and more reliable processing attention than information positioned in the middle sections. This pattern holds even when the total document length is well within the model's context window capacity, and it produces errors whose character is specific to the position of the relevant information within the submitted text rather than to the inherent difficulty of the analytical question being asked. The phenomenon is known in AI research as the lost in the middle problem, and its practical implications for professional document processing are significant.

The mechanism that produces this pattern reflects properties of how transformer-based language models, which underlie all current leading AI tools, process and attend to information across long sequences. Without entering into technical detail that exceeds what is professionally useful here, the model's attention to specific positions within the submitted text is not uniform. Positions at the beginning of the text, where the initial framing of the task and the first substantive content appear, and positions at the end of the text, where the most recently processed content appears, consistently receive stronger processing attention than positions in the middle of long sequences. Information at these favoured positions is more likely to be correctly identified, accurately represented, and reliably incorporated into the output. Information in the middle sections is more likely to be underweighted, mischaracterised, selectively attended to, or omitted from the output entirely, even when it is directly relevant to the analytical question being asked.

For professional practitioners working with the kinds of documents that professional practice regularly involves, the lost in the middle problem creates specific and consequential risks that deserve careful consideration in workflow design. A claims analyst who provides a fifty-page policy document and asks the tool to identify all applicable exclusions may receive an output that accurately identifies the exclusions appearing in the opening sections of the policy and the exclusions in the final sections, while missing or inadequately representing exclusions that appear in the middle portions of the document. The output will not signal this selective coverage. It will present the identified exclusions as a complete analysis, formatted with the same apparent thoroughness as an analysis that had genuinely engaged with every section of the document with equal attention.

A financial analyst who provides a lengthy management commentary and asks the tool to extract and summarise the key risk factors discussed may find that risks identified early in the commentary and risks raised toward its conclusion are well represented in the output, while risks discussed in the middle sections receive inadequate attention or are absent from the summary entirely. A paralegal who submits a substantial discovery production in a single interaction and asks the tool to identify communications that may attract privilege claims may find that documents positioned in the middle of the submitted set receive less careful and less reliable analysis than those at the beginning and end, with potentially significant implications for the privilege assessment if relevant documents in the middle sections are incorrectly categorised.

The practical response to the lost in the middle problem is to design submissions so that the most critical material is positioned where the model's processing attention is strongest, and to reduce the volume of surrounding material that competes with the critical content for that attention. For a practitioner asking a specific analytical question about a long document, this means extracting the specific sections of the document most directly relevant to the question and submitting those sections as the primary context, rather than submitting the complete document and relying on the model to identify and attend to the relevant sections across the full volume of text. This approach positions the material that matters at the beginning of the submitted context, where attention is strongest, rather than allowing it to be buried in the middle of a lengthy submission where attention is weakest.

The instructions themselves should be positioned prominently at the beginning of the submission rather than embedded within a lengthy context. A practitioner who positions their instructions at the start of the interaction, before the document or context material, ensures that the model's initial and strongest processing attention is directed toward understanding what the task requires before it begins engaging with the substantive material. Instructions buried at the end of a long context submission, or interspersed throughout the document material, receive less reliable processing attention and are more likely to be partially followed or inconsistently applied across the output.

The approach of breaking complex document analysis tasks into multiple focused interactions, each addressing a specific and clearly bounded aspect of the overall question, also serves as an effective response to the lost in the middle problem as well as to the context window constraint discussed earlier. A practitioner who analyses a complex commercial lease by addressing the rent review provisions in one interaction, the repair and reinstatement obligations in a second, the break clause conditions in a third, and the alienation restrictions in a fourth, submitting only the relevant lease sections in each interaction, achieves two things simultaneously. Each interaction keeps the relevant material well within the context window with ample space for the model's full processing attention to be directed toward it. And each interaction eliminates the competition between the target provisions and the rest of the lease for the model's processing attention, ensuring that the specific analytical question receives the focused engagement that produces the most reliable output.

The lost in the middle problem also reinforces the value of the verification discipline that Section 1 established as the practitioner's primary defence against AI errors in professional work. An output that appears to comprehensively address a long document analysis question but that was produced from a submission where critical material was positioned in the middle of a lengthy context requires verification against the actual document to confirm that the relevant provisions were identified and accurately represented. The practitioner cannot determine from reading the output alone whether the middle sections of the submitted document received reliable processing attention. Only comparison against the source material reveals whether the analysis is complete or whether the lost in the middle effect has produced a selective output that presents itself as comprehensive.