1.2

What a Large Language Model Actually Is

15 min

Large language models are the most common form of AI that professionals work with today, and they are the specific systems that drove the capability shift described in Module 1.1. A practitioner who understands what a large language model actually is as a system will understand why it behaves the way it does, what it can reliably do, what it cannot reliably do, and how to work with it effectively. A practitioner who does not understand the underlying system will either over-trust it (treating its outputs as authoritative when they are not) or under-trust it (avoiding it for tasks where it would actually perform well). Both failures are expensive, and both can be avoided by a working understanding of what these systems are.

The System Itself

A large language model is a very large neural network that has been trained to predict the next token in a sequence of text. The network is typically built on the transformer architecture described in Module 1.1. The training process involves showing the network enormous quantities of text (hundreds of billions of tokens drawn from books, articles, web pages, and other written material) and adjusting its internal parameters so that it produces accurate next-token predictions on the training data. The parameters are the numbers that determine how the network processes its inputs. A current-generation large language model has tens or hundreds of billions of parameters, which is why these systems are called large.

What the trained network contains is not a database of facts, not a set of rules, and not a copy of its training data. What it contains is a very large set of parameter values that together encode statistical patterns across language. When the model processes a request, it uses these parameters to compute which token is most likely to come next given everything that has come before. Token by token, the model produces its response by predicting the next most likely token, outputting that token, and then predicting the next token given the now-extended sequence. This is the fundamental mechanism of how current large language models work, and every property of their behaviour traces back to it.

The practical consequences of this mechanism shape how the systems behave. The model does not look anything up. It computes predictions based on patterns learned during training. It does not have a concept of what it knows or does not know, because its outputs are computed rather than retrieved from a structured store. It can produce outputs that sound authoritative on topics where it has incomplete or inaccurate training data, because authoritative-sounding language is a pattern it has learned. It can handle topics that were not explicitly in its training data by drawing on patterns from related topics, which is why these systems generalise well. These properties emerge from the underlying mechanism rather than from any specific design choice, and they are stable across the current generation of these systems.

Tokens

A token is the unit of text that the model processes. The model does not read letter by letter or word by word in the sense that a human reader does. It reads tokens, which are small pieces of text that may be whole words, parts of words, punctuation marks, or special symbols. The sentence "the lawyer reviewed the contract" might be broken into six tokens, one for each word. The word "reviewing" might be broken into two tokens: "review" and "ing." Less common words are typically broken into more tokens than common words, because the system's vocabulary is finite and long or unusual words must be represented as combinations of shorter pieces.

Tokens matter to practitioners for two reasons. The first is that many commercial AI tools are priced by token usage, which means the cost of a task depends on how many tokens are in the input and the output. The second is that context windows, described below, are measured in tokens. A practitioner who wants to understand how much text they can send to a model needs to know roughly how many tokens their text will become. A rough working rule is that one token corresponds to about four characters or three-quarters of a word in English text, though the exact ratio varies with the specific model and the specific content.

Context Windows

The context window is the bounded amount of text that the model can consider at once when producing its output. Everything the model processes in a single interaction (the request, any attached documents, the conversation history, and the model's own output as it generates) counts against the context window. A context window of one hundred thousand tokens means the model can consider roughly seventy thousand words of text in a single interaction before it runs out of space.

Context windows have grown substantially over the past several years. Early commercial language models had context windows of a few thousand tokens, which was enough for short conversations but too small for most document work. Current commercial models typically have context windows of one hundred thousand tokens or more, and some have windows of several hundred thousand or more than a million tokens. The size of the context window determines how much material the model can work with in a single interaction. A contract that fits within the context window can be reviewed in one pass. A document that exceeds the context window has to be processed in sections, which degrades the model's ability to understand the document as a whole.

Context windows also matter because everything within them influences what the model produces. If the practitioner provides extensive background material at the start of a conversation and then asks a question twenty exchanges later, the background material is still within the context window and still shapes the model's response. If the conversation exceeds the context window, older material falls out, and the model loses access to it. The practical implication is that practitioners benefit from providing relevant context deliberately and from being aware of what has and has not been kept in the conversation.

Inference

Inference is the process of the model running in response to a request. When a practitioner sends a question to a large language model and receives a response, the model is performing inference. This is distinct from training, which is the much more expensive process of adjusting the model's parameters based on large quantities of text. Training happens once (or periodically, when the model is updated), typically takes weeks or months, and requires thousands of specialised computers running in parallel. Inference happens every time a user makes a request, typically takes seconds, and can be performed on more modest hardware.

The distinction between training and inference matters because it explains why large language models behave as they do. During inference, the model's parameters are fixed. The model does not learn from the current interaction. Anything the practitioner says during a conversation does not update the model's parameters or change its underlying behaviour. The current interaction is, for the model, just another sequence of tokens to predict continuations for. If the same practitioner returned a week later and asked the same question, the model would compute its response from scratch using the same parameters. Anything it appeared to have learned during the previous conversation is gone from the model's perspective, though the conversation history may be preserved in the interface the practitioner is using.

This also explains why corrections to the model during a conversation do not have a lasting effect. If the practitioner points out that the model gave a wrong answer, the model will update its behaviour within the current conversation (the correction becomes part of the context window and influences subsequent predictions), and it will not update its parameters. A different user asking the same question the next day will get the same wrong answer unless the model's parameters have been updated through training, which is a separate and much more expensive process described below under fine-tuning.

Hallucination

Hallucination is the property of a large language model to produce outputs that sound plausible and are factually wrong. The model might invent a case citation, attribute a quotation to a person who did not say it, produce statistics that have no source, or describe events that did not occur. The outputs sound confident because confident language is a pattern the model has learned, and they are sometimes indistinguishable from accurate outputs without independent verification.

Hallucination is a direct consequence of how the underlying mechanism works rather than a bug that engineers will eventually fix. The model produces outputs by predicting the next most likely token given the context. It has no separate mechanism for checking whether its output is true, because it has no concept of truth as distinct from pattern-matching plausibility. If the most plausible next token extends a factually wrong statement, the model will produce that token, because plausibility is what the training process optimised for.

Different kinds of content are more or less prone to hallucination. Content where the model has seen many similar examples during training tends to be more reliable, because the patterns are well-established. Content that requires specific factual claims the model did not encounter extensively during training (obscure case citations, specific statistics, recent events after the training cutoff, details about particular individuals) is more prone to hallucination, because the model is extrapolating from weaker patterns. Content involving reasoning chains is moderately prone to hallucination, because the model may produce a plausible-sounding reasoning chain that contains a logical leap or a false premise.

The practical implication for professionals is significant. Hallucination means that a large language model cannot be treated as an authoritative source of facts. Specific claims the model makes about named sources, statistics, events, or people need to be verified independently if they matter. This reflects a property of what the models are rather than a criticism of them or a reason to avoid using them, and professional practice with these systems is built around accommodating the property. The evaluation and reasoning control discipline developed later in this programme is the applied answer to the hallucination problem, and it works when the practitioner understands why the problem exists and what it looks like in practice.

Fine-Tuning

Fine-tuning is the process of taking a pre-trained large language model and training it further on a specific dataset to adjust its behaviour for a particular purpose. A model pre-trained on general web text might be fine-tuned on a firm's internal documents to produce outputs that match the firm's voice and conventions. A model pre-trained on general language might be fine-tuned on legal contracts to improve its performance on legal analysis tasks. Fine-tuning updates the model's parameters, which means the changes persist across future inferences rather than being confined to a single conversation.

Fine-tuning is much less expensive than training a model from scratch, because the model starts with the general capabilities learned during pre-training and only needs to be specialised for the target domain. It is still substantially more expensive than inference, and it requires technical expertise, curated training data, and infrastructure that most professional firms do not have in-house. Most professionals will not directly perform fine-tuning, and they may interact with fine-tuned models through products that firms or vendors have built.

A practical example illustrates what fine-tuning produces. Consider a law firm that wants its associates to use a language model to draft client memos in the firm's specific style. A general-purpose language model will produce competent memos in a generic style, with vocabulary and structure drawn from its broad training data. The firm could fine-tune the model on a curated collection of several thousand of its own past memos, together with some metadata about what each memo was for. The resulting fine-tuned model, when asked to draft a memo, will produce outputs that reflect the firm's characteristic vocabulary, its structural conventions, its typical treatment of common issues, and the tonal register it prefers. The fine-tuning does not make the model more intelligent or more accurate in absolute terms. It makes the model's default outputs more aligned to this specific firm's expectations, which reduces the editing work the associates need to do on each draft.

Understanding what fine-tuning is matters because it distinguishes durable improvements in a model's behaviour (which require fine-tuning or similar training-based adjustments) from temporary improvements within a single conversation (which do not persist beyond the current context window). A practitioner who tells a general-purpose model what voice to use is achieving a temporary adjustment. A firm that fine-tunes a model is achieving a durable one.

Retrieval-Augmented Generation

Retrieval-augmented generation, often shortened to RAG, is a technique for connecting a large language model to external information at inference time. Instead of relying only on what the model learned during training, a RAG system retrieves relevant documents from an external source (a firm's document database, a search index, a knowledge base) and adds them to the context window before the model produces its response. The model then produces its output using both its trained knowledge and the retrieved documents, which allows it to draw on information that was not in its training data and to cite specific sources.

RAG addresses several of the practical limitations of large language models simultaneously. It allows the model to work with information that postdates its training cutoff, because the retrieval happens at inference time. It reduces hallucination on topics covered by the retrieved documents, because the model has direct access to the relevant source material rather than having to extrapolate from training patterns. It allows the model to cite specific sources, which supports verification and defensibility. It allows a single base model to be adapted to different knowledge bases without fine-tuning, by simply connecting different retrieval sources.

A practical example shows how RAG changes what a model can do. Consider a consulting firm that has built a knowledge base from its case library, past deliverables, internal methodology documents, and client industry research. A general-purpose language model cannot access any of this material, because the firm's internal documents were not part of the model's training data. A RAG system built on top of the same model can. When a consultant asks a question about how the firm has previously approached a particular type of engagement, the retrieval component searches the knowledge base, identifies the most relevant past deliverables and methodology documents, and adds them to the context window. The model then produces its answer drawing on those specific documents, and it can cite which documents informed which parts of its answer. The model's underlying capability is unchanged, and the information it can work with is substantially expanded.

RAG does not solve all the problems of large language models. The retrieval component has to find the right documents, and if it retrieves the wrong documents the model's output will reflect that. The model still produces its output through the same prediction mechanism, which means it can still hallucinate about material not covered by the retrieved documents. The quality of a RAG system depends on the quality of the retrieval, the quality of the underlying knowledge base, and the quality of the base model, and weaknesses in any of these components appear in the final output. Many of the AI tools professionals encounter in their work, including tools that search the web, tools that retrieve from firm document management systems, and tools that draw on internal databases, are RAG systems built on top of large language models. When a practitioner uses such a tool and the output quality feels inconsistent, the cause is often in the retrieval rather than in the model itself, and understanding the architecture helps diagnose what is actually going wrong.