How the Transformer Works

The transformer is worth understanding at a basic mechanism level because its properties explain what current AI tools can and cannot do, why they work the way they do, and why some of the professional practices developed later in this programme are necessary. The explanation below is simplified, and it is accurate in the ways that matter for a practitioner.

A transformer processes text as a sequence of tokens. A token is a small unit of text, typically a word or a fragment of a word. The sentence "the lawyer reviewed the contract" might be broken into six tokens, one for each word. Longer or less common words are often broken into multiple tokens. The transformer does not read text the way a human reader does, letter by letter or word by word in a single pass. It represents each token as a sequence of numbers, and it operates on those numbers mathematically.

The transformer's central mechanism is called attention. When the model processes a token, it examines every other token in the surrounding text and assigns each of them a weight. The weights determine how much each other token influences the meaning the model assigns to the current one. When the model processes the word "bank" in a sentence that also contains "river," attention assigns a high weight to "river" and produces a representation of "bank" that reflects the water-related meaning. When the model processes the word "bank" in a sentence that contains "deposit" and "money," attention assigns high weights to those tokens and produces a representation that reflects the financial meaning. The model weights context to produce a representation of each token that reflects its meaning in this specific sentence, rather than looking up definitions in a dictionary.

The model performs this attention computation many times in parallel, with each attention operation learning to pay attention to different kinds of relationships. One operation may be tracking which words refer to which other words. Another may be tracking what tense the sentence is in. Another may be tracking logical connections between clauses. The model combines all of these attention operations, and the combination produces a rich representation of the meaning of the text.

Training a large language model is conceptually simple, even though the computational scale is enormous. The model is shown a very large quantity of text, often measured in hundreds of billions of tokens drawn from books, articles, web pages, and other written sources. For each position in the text, the model is asked to predict the token that appears at that position. The model makes a prediction based on all the tokens that come before the position. If the prediction is wrong, the model's internal parameters are adjusted slightly to make the correct prediction more likely. This process repeats across the entire training dataset, typically for weeks or months on thousands of specialised computers running in parallel.

What emerges from this training is a system that has learned the statistical structure of language at a depth that produces substantively useful behaviour. The model has learned grammar and sentence structure. It has learned the common patterns of argument and explanation that appear in the texts it was trained on. It has learned associations between words, concepts, and the contexts where they appear. Crucially, it has learned patterns that extend across the specific examples it was trained on, which is why it can produce reasonable outputs on requests it has never seen before.

The limit on how much text a model can process at once is called the context window. A context window of one hundred thousand tokens means the model can consider roughly seventy thousand words of text in a single interaction. This includes everything the professional provides in the request, any documents they attach, the conversation history, and the model's own output. Context windows have grown substantially over the past several years, and they remain a practical constraint on how AI tools can be used. A document that exceeds the context window cannot be processed in full in a single interaction.

When the model produces output, it does so one token at a time. It predicts the most likely next token given everything that has come before, outputs that token, and then predicts the next one based on the now-extended context. This sequential generation is why AI outputs appear in the conversational interface as though being typed. The model is literally producing text one token at a time, with each new token computed from the full sequence of tokens that preceded it.

This mechanism has specific implications that matter for professional practice. The model does not have a database of facts. It has statistical patterns from its training data that often produce factually correct outputs and sometimes produce plausible-sounding outputs that are not accurate. The model does not know what it knows, because its predictions are computed rather than retrieved from a structured knowledge store. The model's outputs can be influenced substantially by the specific wording and context provided in the request, because the predictions depend on everything the model sees. These properties are consequences of the underlying mechanism rather than defects of the current implementation that engineers will eventually fix, and professional practice with AI tools is built around accommodating them.