How Large Language Models Actually Work: A Technical Introduction

Large language models (LLMs) like GPT-4, Claude, and Gemini are neural networks — but understanding what that means practically, and what distinguishes them from earlier machine learning systems, requires understanding a specific architecture called the Transformer and the training process that produces these models.

The Transformer Architecture

All modern LLMs are based on the Transformer architecture, introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al. (Google Brain). The core mechanism: self-attention. Self-attention allows a model to weigh the importance of different words (tokens) in a sequence relative to each other when generating a prediction. When processing the sentence “The bank where I put my money was by the river bank,” self-attention allows the model to understand that the two uses of “bank” have different meanings by attending to the surrounding context. In technical terms: each token is represented as a vector (embedding); the attention mechanism computes query, key, and value vectors for each token; the attention score between tokens is computed as the dot product of their query and key vectors, scaled and softmaxed; the output for each token is a weighted sum of value vectors, where the weights are the attention scores. This allows the model to capture long-range dependencies — relationships between words separated by many others — which recurrent neural networks (RNNs) struggled with. The layers: a Transformer model stacks multiple attention layers, each of which learns to attend to different aspects of the input. A large model like GPT-4 has 96 layers; Claude 3 Opus has a similar depth. Between attention layers are feed-forward networks that transform the representations. The parameters (weights) of these attention layers and feed-forward networks are what is “learned” during training — modern large models have hundreds of billions of parameters.

Training

Pre-training: the initial training phase, where the model learns on a vast corpus of text (Wikipedia, books, web text, code, and other sources — measured in trillions of tokens). The objective: next-token prediction. Given the text “The cat sat on the ___”, predict what comes next. The model adjusts its parameters (via backpropagation) to minimise the error in these predictions. This process, applied to trillions of examples, forces the model to develop internal representations of language, facts, reasoning, and world knowledge. This is the most computationally expensive phase — pre-training GPT-3 consumed approximately 3.14×10²³ FLOPS and cost approximately $4.6 million. Fine-tuning and RLHF: after pre-training, raw LLMs are not useful as assistants — they will complete whatever pattern they see, including harmful or misleading ones. Two additional training phases: supervised fine-tuning (SFT): the model is trained on examples of desirable behaviour (question → answer pairs written by humans). Reinforcement Learning from Human Feedback (RLHF): a reward model is trained on human preferences (human labellers rank multiple model outputs from best to worst); the LLM is then trained to maximise the reward model’s score using RL. This is what makes the model “helpful” and “aligned” rather than a raw text predictor. Constitutional AI (Anthropic): an alternative alignment technique — the model is given a set of principles (a “constitution”) and trained via self-critique: it generates outputs, critiques them against the constitution, and revises. Reduces reliance on human labellers for the critique step.

What LLMs Can and Cannot Do

What emerges from scale: capabilities that were not explicitly trained for but appeared when models reached sufficient scale — arithmetic, translation, code generation, logical reasoning, few-shot learning (learning a new task from a few examples in the prompt). These “emergent capabilities” are one of the most surprising findings in LLM research. The context window: the maximum number of tokens the model can process at once. GPT-3 had 2,048; GPT-4 Turbo has 128,000; Claude has 200,000. Larger context windows enable processing of longer documents, entire codebases, or extended conversations. The fundamental limitation: LLMs are statistical pattern matchers at their core — they predict what text is likely to follow given the context. They do not have access to information beyond their training data (without external tools); they do not have persistent memory between conversations (without explicit memory mechanisms); they can confabulate (generate plausible-sounding but incorrect information) when asked about facts outside their training distribution. The distinction between reasoning and pattern matching remains an active research area — current evidence suggests LLMs do perform forms of reasoning, but the nature and reliability of that reasoning is not fully understood.

上一篇 杜塞尔多夫:认真对待阿尔特啤酒的城市
下一篇 大型语言模型实际上如何工作:技术入门