How LLMs Work

Table of Contents

AI in the form of LLMs has become part of everyday life for billions of people around the world. Each day there seems to be a new breakthrough technology or model. Honestly, it is insane how fast things are moving. Since I use AI for personal and work use, I figured I better know the ins and outs of how it works. So let's jump into it.

What is an LLM?

LLM stands for Large Language Model. At its core, an LLM is a type of neural network trained on massive amounts of text data. The "large" part refers to both the size of the training data and the number of parameters (think of these as knobs the model can tune) in the network. Modern LLMs like Claude have hundreds of billions of parameters.

But here's the thing that took me a while to wrap my head around: LLMs don't "understand" language the way we do. They're incredibly sophisticated pattern recognition and prediction machines. Given some input text, they predict what text should come next. That's the fundamental operation. Everything else - answering questions, writing code, summarizing documents - emerges from this simple mechanism executed at massive scale.

Tokens

Before an LLM can process your text, it needs to break it down into smaller pieces called tokens. A token isn't exactly a word - it's more like a chunk of text that the model has learned to recognize as a meaningful unit.

For example, the word "understanding" might be broken into two tokens: "under" and "standing". Common words like "the" are typically single tokens. Rare words or technical jargon get split into multiple pieces. On average, one token is roughly 3/4 of a word in English.

Why does this matter? Because LLMs have limits on how many tokens they can process at once. When you see "200K context window" for a model, that's referring to tokens, not words. Understanding tokens helps you work more effectively with these tools - you start to intuit why some prompts work better than others.

Every token gets converted into a high-dimensional vector (basically a long list of numbers) called an embedding. These embeddings capture the meaning and relationships between tokens. Words with similar meanings end up with similar embeddings. This is how the model represents language in a form it can actually compute with.

Training

Training an LLM happens in stages, each building on the last.

Pre-training

The first stage is pre-training. This is where the model learns language itself by consuming enormous amounts of text in the form of books, websites, code repositories, scientific papers. The training objective is simple: predict the next token. Given "The cat sat on the", predict "mat" (or "couch" or "floor").

The model does this billions of times, adjusting its parameters slightly each time to get better at prediction. Through this process, it implicitly learns grammar, facts, reasoning patterns, and even some common sense knowledge. It's not being explicitly taught these things - they emerge from the pressure to predict text accurately.

Pre-training is computationally expensive. We're talking thousands of GPUs running for months, costing tens of millions of dollars. This is why only a handful of organizations can train frontier models from scratch. And why there are so many new datacenters being built.

Fine-tuning

After pre-training, you have a model that's good at predicting text but not necessarily good at being helpful. It might continue a conversation by predicting what a typical internet commenter would say rather than providing a thoughtful response.

Fine-tuning addresses this. The model is trained on carefully curated examples of high-quality responses. For instruction-following models, this often involves examples of questions paired with helpful, accurate answers. The model learns to not just predict likely text, but to predict good text.

RLHF (Reinforcement Learning from Human Feedback)

This is where things get interesting. In RLHF, human raters compare different model outputs and indicate which one is better. This feedback trains a "reward model" that can predict what humans would prefer. The LLM is then optimized to produce outputs that score highly according to this reward model.

RLHF is what makes modern assistants feel helpful rather than just capable. It's the difference between a model that can answer your question and one that wants to answer it well.

The Transformer Architecture

Nearly all modern LLMs are built on the transformer architecture, introduced in 2017. The key innovation is the "attention mechanism" which allows the model to look at all parts of the input simultaneously and decide which parts are most relevant for each prediction.

Before transformers, models processed text sequentially - one word at a time. This made it hard to capture long-range dependencies. If a pronoun at the end of a paragraph referred to something mentioned at the beginning, older models struggled to make that connection.

Attention solves this by letting the model directly compare any token to any other token, regardless of distance. When generating a response, the model can "attend to" the most relevant parts of your prompt, even if they appeared hundreds of tokens earlier.

The transformer processes input through multiple layers, each containing attention mechanisms and feed-forward neural networks. As information flows through these layers, the model builds increasingly abstract representations of the text, eventually producing a probability distribution over what token should come next.

Context Windows

The context window is how much text the model can "see" at once. Early LLMs had tiny context windows - maybe 2,000 tokens. This meant they'd forget the beginning of a long document by the time they reached the end.

Modern models have dramatically expanded this. All of the large AI labs such as, Anthropic and OpenAI, all have models that can handle 200,000+ tokens. That's enough to fit entire codebases or books. This changes what's possible. You can have long, nuanced conversations without the model losing track. You can ask questions about lengthy documents without summarizing first.

But context windows aren't unlimited, and there are subtleties. Models don't attend to all parts of the context equally well - information in the middle sometimes gets less attention than information at the beginning or end. And more context means slower, more expensive inference. It's a tradeoff.

Inference: How Responses Are Generated

When you send a prompt to an LLM, here's what happens:

Tokenization: Your text is converted into tokens
Embedding: Each token becomes a high-dimensional vector
Forward pass: These vectors flow through the transformer layers
Prediction: The model outputs a probability distribution over all possible next tokens
Sampling: A token is selected from this distribution
Repeat: Steps 3-5 repeat with the new token added, until the model produces an end token or hits a length limit

The sampling step is where "temperature" comes in. At temperature 0, the model always picks the most likely token - deterministic but potentially repetitive. Higher temperatures introduce randomness, making outputs more creative but potentially less coherent. Finding the right temperature for your use case is part of the art of prompting.

Why This Matters

Understanding how LLMs work isn't just academic curiosity. It helps you use them better.

Knowing about tokens helps you write more efficient prompts. Understanding training explains why models have certain knowledge cutoffs and biases. Grasping the attention mechanism explains why putting important context at the beginning or end of your prompt can improve results. Context window limits inform how you structure long interactions.

These models aren't magic. They're impressive engineering achievements with specific strengths and limitations. The more you understand the mechanics, the better you can collaborate with them.

And honestly, the more I learn about how these systems work, the more impressed I am that they work at all. The fact that next-token prediction at scale produces systems capable of writing code, analyzing documents, and engaging in nuanced conversation is wild. We're just getting started figuring out what's possible.