LLM Fundamentals Quiz
LLM Fundamentals Quiz — Study Guide
LLM Fundamentals: How Large Language Models Work
Large Language Models (LLMs) like GPT-4 and Claude power everything from coding assistants to chatbots. Understanding how they work under the hood — from how text is broken up, to how the model decides what word comes next — makes you a far more effective developer and user. This guide covers the core concepts you'll need to ace the LLM Fundamentals Quiz.
Tokenisation and BPE
Before an LLM can process text, it must convert words into tokens — numeric chunks the model understands.
What is a Token?
A token is roughly ¾ of a word on average. As a rule of thumb:["un", "believ", "able"]Byte-Pair Encoding (BPE)
BPE is the most common tokenisation algorithm. It works by:Initial: ["h", "e", "l", "l", "o"]
After merges: ["he", "ll", "o"] → ["hell", "o"] → ["hello"]This lets the model handle rare or made-up words by breaking them into known sub-pieces.
Transformers and Attention
The Transformer architecture (introduced in 2017) is the backbone of every modern LLM.
The Attention Mechanism
Attention lets the model focus on relevant parts of the input when generating each token. The core formula is:Attention(Q, K, V) = softmax(QK^T / √d_k) · V| Symbol | Meaning |
|---|---|
| Q (Query) | What the current token is "looking for" |
| K (Key) | What each token "offers" to match against |
| V (Value) | The actual information to retrieve |
| √d_k | Scaling factor to prevent vanishing gradients |
Positional Encoding
Transformers process all tokens simultaneously (not sequentially), so they need a way to know token order. Positional encodings inject position information — either as fixed sine/cosine waves or learned embeddings — so the model knows "this is the 5th token, that is the 12th."Parameters and Knowledge Storage
An LLM's parameters are the millions or billions of numerical weights learned during training. Think of them as the model's "memory" — they encode:
A 7-billion parameter model has 7 × 10⁹ individual numbers. Knowledge is stored implicitly in these weights — there's no explicit database. This is why LLMs can "know" facts but also hallucinate: the knowledge is distributed and probabilistic, not looked up.
Autoregressive Generation and the Context Window
Autoregressive Means One Token at a Time
LLMs generate text by predicting the next token, then feeding that token back in to predict the one after, and so on:Input: "The cat sat on the"
Step 1: → "mat"
Input: "The cat sat on the mat"
Step 2: → "."Each new token depends on all previous tokens. This is what autoregressive means.
Context Window
The context window is the maximum number of tokens the model can "see" at once (both input and output combined). Common sizes:Tokens outside the context window are simply forgotten — the model has no access to them.
Sampling and Temperature
Once the model computes probabilities for the next token, it must sample one. This is where temperature and sampling strategies come in.
Temperature
Temperature controls randomness:| Temperature | Effect | Use Case |
|---|---|---|
0 | Always picks the highest-probability token (greedy/deterministic) | Factual Q&A, code |
0.7 | Balanced creativity | General chat |
1.0+ | High randomness, surprising outputs | Creative writing |
Sampling Strategies
Top-p is adaptive — it uses fewer candidates when the model is confident, more when it's uncertain.
Inference, KV-Cache, and Performance
Inference
Inference is the process of running a trained model to generate output (as opposed to training, which updates weights). Every time you send a prompt to an LLM API, you're performing inference.KV-Cache
During autoregressive generation, the model recomputes attention Keys and Values for every token at every step — which is expensive. The KV-cache stores these computed K and V matrices so they don't need to be recalculated:Without KV-cache: Recompute ALL tokens every step → O(n²) cost
With KV-cache: Only compute the NEW token → much fasterThis is a critical optimization that makes real-time LLM responses practical.
Performance and Scale
Scale refers to increasing model size (parameters), training data, or compute. Research has shown that scaling up these three dimensions predictably improves performance — this is captured by scaling laws.Emergent abilities are capabilities that appear suddenly at certain scales — they're absent in smaller models but appear in larger ones without being explicitly trained. Examples include multi-step reasoning and few-shot learning. These abilities are called "emergent" because they weren't directly optimized for; they arise from scale alone.
Key Takeaways
softmax(QK^T / √d_k) · V) lets the model relate all tokens to each other; the √d_k scaling prevents gradient issues.