AI / LLM Engineering

LLM Fundamentals Quiz

LLM Fundamentals Quiz — Study Guide

LLM Fundamentals: How Large Language Models Work

Large Language Models (LLMs) like GPT-4 and Claude power everything from coding assistants to chatbots. Understanding how they work under the hood — from how text is broken up, to how the model decides what word comes next — makes you a far more effective developer and user. This guide covers the core concepts you'll need to ace the LLM Fundamentals Quiz.

Tokenisation and BPE

Before an LLM can process text, it must convert words into tokens — numeric chunks the model understands.

What is a Token?

A token is roughly ¾ of a word on average. As a rule of thumb:

1,000 tokens ≈ 750 words in English

A single word like "unbelievable" might be split into ["un", "believ", "able"]

Short common words like "the" or "cat" are usually one token each

Byte-Pair Encoding (BPE)

BPE is the most common tokenisation algorithm. It works by:

Starting with individual characters as the vocabulary

Repeatedly merging the most frequent adjacent pair into a new token

Stopping when the vocabulary reaches a target size (e.g., 50,000 tokens)

Initial: ["h", "e", "l", "l", "o"]
After merges: ["he", "ll", "o"] → ["hell", "o"] → ["hello"]

This lets the model handle rare or made-up words by breaking them into known sub-pieces.

Transformers and Attention

The Transformer architecture (introduced in 2017) is the backbone of every modern LLM.

The Attention Mechanism

Attention lets the model focus on relevant parts of the input when generating each token. The core formula is:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Symbol	Meaning
Q (Query)	What the current token is "looking for"
K (Key)	What each token "offers" to match against
V (Value)	The actual information to retrieve
√d_k	Scaling factor to prevent vanishing gradients

The √d_k term is critical — without it, dot products between large vectors grow too large, pushing softmax into regions with near-zero gradients and making training unstable.

Positional Encoding

Transformers process all tokens simultaneously (not sequentially), so they need a way to know token order. Positional encodings inject position information — either as fixed sine/cosine waves or learned embeddings — so the model knows "this is the 5th token, that is the 12th."

Parameters and Knowledge Storage

An LLM's parameters are the millions or billions of numerical weights learned during training. Think of them as the model's "memory" — they encode:

Grammar and syntax

World knowledge (facts, relationships)

Reasoning patterns

A 7-billion parameter model has 7 × 10⁹ individual numbers. Knowledge is stored implicitly in these weights — there's no explicit database. This is why LLMs can "know" facts but also hallucinate: the knowledge is distributed and probabilistic, not looked up.

Autoregressive Generation and the Context Window

Autoregressive Means One Token at a Time

LLMs generate text by predicting the next token, then feeding that token back in to predict the one after, and so on:

Input:  "The cat sat on the"
Step 1: → "mat"
Input:  "The cat sat on the mat"
Step 2: → "."

Each new token depends on all previous tokens. This is what autoregressive means.

Context Window

The context window is the maximum number of tokens the model can "see" at once (both input and output combined). Common sizes:

GPT-3.5: ~4,096 tokens

GPT-4 Turbo: ~128,000 tokens

Tokens outside the context window are simply forgotten — the model has no access to them.

Sampling and Temperature

Once the model computes probabilities for the next token, it must sample one. This is where temperature and sampling strategies come in.

Temperature

Temperature controls randomness:

Temperature	Effect	Use Case
`0`	Always picks the highest-probability token (greedy/deterministic)	Factual Q&A, code
`0.7`	Balanced creativity	General chat
`1.0+`	High randomness, surprising outputs	Creative writing

Temperature = 0 means the model will always output the same response for the same input — no randomness at all.

Sampling Strategies

Top-k sampling: Only consider the top *k* most likely tokens

Top-p (nucleus) sampling: Consider the smallest set of tokens whose cumulative probability ≥ *p* (e.g., p=0.9 means pick from tokens that together cover 90% of the probability mass)

Top-p is adaptive — it uses fewer candidates when the model is confident, more when it's uncertain.

Inference, KV-Cache, and Performance

Inference

Inference is the process of running a trained model to generate output (as opposed to training, which updates weights). Every time you send a prompt to an LLM API, you're performing inference.

KV-Cache

During autoregressive generation, the model recomputes attention Keys and Values for every token at every step — which is expensive. The KV-cache stores these computed K and V matrices so they don't need to be recalculated:

Without KV-cache: Recompute ALL tokens every step → O(n²) cost
With KV-cache:    Only compute the NEW token → much faster

This is a critical optimization that makes real-time LLM responses practical.

Performance and Scale

Scale refers to increasing model size (parameters), training data, or compute. Research has shown that scaling up these three dimensions predictably improves performance — this is captured by scaling laws.

Emergent abilities are capabilities that appear suddenly at certain scales — they're absent in smaller models but appear in larger ones without being explicitly trained. Examples include multi-step reasoning and few-shot learning. These abilities are called "emergent" because they weren't directly optimized for; they arise from scale alone.

Key Takeaways

Tokenisation (BPE) breaks text into sub-word chunks; ~750 words = ~1,000 tokens in English.

Attention (softmax(QK^T / √d_k) · V) lets the model relate all tokens to each other; the √d_k scaling prevents gradient issues.

Autoregressive generation produces one token at a time, feeding each output back as input, within the limits of the context window.

Temperature = 0 is deterministic (always picks the top token); higher temperatures increase randomness. Top-p sampling dynamically selects from the minimum set of tokens covering probability mass ≥ p.

KV-cache speeds up inference by reusing previously computed attention keys and values, while emergent abilities are surprising capabilities that appear only at large scale.