1. Building an LLM (like ChatGPT): A Three-Stage Process
- Pre-training:
- Data Acquisition & Processing: This stage involves downloading a massive amount of text from the internet (e.g., Common Crawl) and filtering it rigorously. The goal is to have a large, diverse, and high-quality dataset of text (e.g., the RefinedWeb dataset is ~44TB). This involves URL filtering (removing unwanted sites), text extraction (getting the relevant text from HTML), language filtering (e.g., focusing on English), and PII removal.
- Tokenization: Raw text is converted into a sequence of "tokens," which are essentially unique IDs representing chunks of text. This is crucial because neural networks operate on numerical sequences. GPT-4, for instance, uses a vocabulary of around 100,277 tokens. The Byte Pair Encoding (BPE) algorithm is used to create these tokens. Tools like Tick Tokenizer let you explore this.
- Neural Network Training: A neural network (specifically, a Transformer) is trained to predict the next token in a sequence, given a "context window" of previous tokens (e.g., 8,000 tokens). The network has billions of parameters (weights) that are initially random. During training, these parameters are iteratively adjusted to improve the model's predictions, matching the statistical patterns of the training data. The "loss" is a key metric that indicates how well the model is performing. Training is computationally intensive, requiring massive data centers with specialized hardware (e.g., Nvidia H100 GPUs).
- Inference: Once trained, the model can generate text by repeatedly predicting the next token, sampling from the probability distribution it outputs. This is how ChatGPT generates responses. It's stochastic (probabilistic), so the same input can lead to different outputs.
- Base Model output: The product from this phase is a simulator that immitates token sequences.
- Post-training (Supervised Fine-tuning - SFT):
- Creating Conversational Datasets: The goal is to turn the "base model" (which is good at imitating internet text) into a helpful "assistant." This involves creating a dataset of conversations between a human and an assistant. Human labelers (guided by detailed instructions) write both prompts and ideal assistant responses. Modern approaches use LLMs to help create these datasets, but human oversight is still crucial.
- Encoding Conversations as Tokens: Conversations are encoded as token sequences using special tokens (e.g.,
IM_START
,IM_END
,user
,assistant
) to denote turns and structure. - Fine-tuning: The base model is further trained on this conversational dataset. This is computationally much less expensive than pre-training. The model learns to imitate the style and behavior of a helpful assistant.
- The Output: An assistant, able to converse based off human examples in conversations.
- Reinforcement Learning (RL):
- Motivation: SFT models are limited by the quality of human-provided examples. Humans may not know the best way for an LLM to solve a problem (its "cognition" is different). RL lets the model discover optimal strategies.
- Process: The model generates many possible solutions to a prompt. For verifiable domains (like math, where there's a clear correct answer), solutions are scored (e.g., is the answer correct?). Solutions that lead to correct answers are "reinforced" – the model is trained to be more likely to produce similar solutions in the future. This is akin to "practice problems" in a textbook.
- Emergent Reasoning: Remarkably, RL can lead to emergent reasoning abilities. Models learn to "think out loud," re-evaluate steps, try different approaches, etc. – all without being explicitly programmed to do so (e.g., the DeepSeek-Coder model). This is critical.
- Unverifiable Domains (e.g., creative writing): It's harder to score solutions (how do you objectively rate a joke?). Reinforcement Learning from Human Feedback (RLHF) is used. A separate "reward model" (another neural network) is trained to imitate human preferences (e.g., ranking jokes). The main LLM is then trained to maximize the reward model's score. However, RLHF has limitations, as the reward model can be "gamed" (adversarial examples). It is useful but is considered a 'bandaid', because the RL model is optimized, then gamed until it's gamed too far.
- Key Point: RL allows the model to potentially surpass human capabilities, as it's not constrained by imitating human solutions.
2. LLM "Psychology" and Limitations
- Hallucinations: LLMs make up facts because they are trained to produce text that statistically resembles their training data, not to be perfectly truthful. Mitigations include:
- Factuality Training: Adding examples where the correct response is "I don't know" when the model is uncertain.
- Tool Use: Allowing the model to access external tools (like web search) to retrieve information.
- Working Memory vs. Vague Recollection: Information in the context window is directly accessible (like working memory). Information stored in the model's parameters is a vague recollection (like something read long ago). Prompting strategies should leverage the context window whenever possible.
- Knowledge of Self: LLMs don't have a persistent sense of self. Questions like "What model are you?" are nonsensical. Responses are based on patterns in the training data (e.g., hallucinating that they are ChatGPT) or are explicitly programmed (system messages, hardcoded training examples).
- Computational Capabilities: LLMs need "tokens to think." They can't do complex computations in a single token prediction. Solutions should be spread out over multiple tokens. Leaning on tools (like a code interpreter) is preferable to relying on the model's "mental arithmetic."
- Counting and Spelling: LLMs struggle with counting and spelling tasks because they operate on tokens, not characters. Again, using tools helps.
- Swiss Cheese Model: LLMs can be brilliant in some areas and inexplicably fail in others. This is unpredictable.
3. Future Capabilities
- Multimodality: LLMs will handle text, audio, and images natively.
- Agents: Longer-running tasks, performing complex jobs with human supervision.
- Pervasive and Invisible: Integrated into many tools.
- Action-Taking: Performing actions on a computer (e.g., mouse and keyboard).
- Test time training Allowing models to train when a user interacts, leading to increased capability.
In essence, LLMs like ChatGPT are incredibly powerful tools, but they are statistical simulators, not sentient beings. Understanding their training process, limitations, and potential helps users interact with them effectively.