AI / LLM Engineering

LLM Fundamentals Quiz

LLM Fundamentals Quiz — Study Guide

LLM Fundamentals: How Large Language Models Work

Large Language Models (LLMs) like GPT-4 and Claude power everything from coding assistants to chatbots. Understanding how they work under the hood — from how text is broken up, to how the model decides what word comes next — makes you a far more effective developer and user. This guide covers the core concepts you'll need to ace the LLM Fundamentals Quiz.


Tokenisation and BPE

Before an LLM can process text, it must convert words into tokens — numeric chunks the model understands.

What is a Token?

A token is roughly ¾ of a word on average. As a rule of thumb:
  • 1,000 tokens ≈ 750 words in English
  • A single word like "unbelievable" might be split into ["un", "believ", "able"]
  • Short common words like "the" or "cat" are usually one token each
  • Byte-Pair Encoding (BPE)

    BPE is the most common tokenisation algorithm. It works by:
  • Starting with individual characters as the vocabulary
  • Repeatedly merging the most frequent adjacent pair into a new token
  • Stopping when the vocabulary reaches a target size (e.g., 50,000 tokens)
  • Initial: ["h", "e", "l", "l", "o"]
    After merges: ["he", "ll", "o"] → ["hell", "o"] → ["hello"]

    This lets the model handle rare or made-up words by breaking them into known sub-pieces.


    Transformers and Attention

    The Transformer architecture (introduced in 2017) is the backbone of every modern LLM.

    The Attention Mechanism

    Attention lets the model focus on relevant parts of the input when generating each token. The core formula is:

    Attention(Q, K, V) = softmax(QK^T / √d_k) · V

    SymbolMeaning
    Q (Query)What the current token is "looking for"
    K (Key)What each token "offers" to match against
    V (Value)The actual information to retrieve
    √d_kScaling factor to prevent vanishing gradients
    The √d_k term is critical — without it, dot products between large vectors grow too large, pushing softmax into regions with near-zero gradients and making training unstable.

    Positional Encoding

    Transformers process all tokens simultaneously (not sequentially), so they need a way to know token order. Positional encodings inject position information — either as fixed sine/cosine waves or learned embeddings — so the model knows "this is the 5th token, that is the 12th."


    Parameters and Knowledge Storage

    An LLM's parameters are the millions or billions of numerical weights learned during training. Think of them as the model's "memory" — they encode:

  • Grammar and syntax
  • World knowledge (facts, relationships)
  • Reasoning patterns
  • A 7-billion parameter model has 7 × 10⁹ individual numbers. Knowledge is stored implicitly in these weights — there's no explicit database. This is why LLMs can "know" facts but also hallucinate: the knowledge is distributed and probabilistic, not looked up.


    Autoregressive Generation and the Context Window

    Autoregressive Means One Token at a Time

    LLMs generate text by predicting the next token, then feeding that token back in to predict the one after, and so on:

    Input:  "The cat sat on the"
    Step 1: → "mat"
    Input:  "The cat sat on the mat"
    Step 2: → "."

    Each new token depends on all previous tokens. This is what autoregressive means.

    Context Window

    The context window is the maximum number of tokens the model can "see" at once (both input and output combined). Common sizes:
  • GPT-3.5: ~4,096 tokens
  • GPT-4 Turbo: ~128,000 tokens
  • Tokens outside the context window are simply forgotten — the model has no access to them.


    Sampling and Temperature

    Once the model computes probabilities for the next token, it must sample one. This is where temperature and sampling strategies come in.

    Temperature

    Temperature controls randomness:

    TemperatureEffectUse Case
    0Always picks the highest-probability token (greedy/deterministic)Factual Q&A, code
    0.7Balanced creativityGeneral chat
    1.0+High randomness, surprising outputsCreative writing
    Temperature = 0 means the model will always output the same response for the same input — no randomness at all.

    Sampling Strategies

  • Top-k sampling: Only consider the top *k* most likely tokens
  • Top-p (nucleus) sampling: Consider the smallest set of tokens whose cumulative probability ≥ *p* (e.g., p=0.9 means pick from tokens that together cover 90% of the probability mass)
  • Top-p is adaptive — it uses fewer candidates when the model is confident, more when it's uncertain.


    Inference, KV-Cache, and Performance

    Inference

    Inference is the process of running a trained model to generate output (as opposed to training, which updates weights). Every time you send a prompt to an LLM API, you're performing inference.

    KV-Cache

    During autoregressive generation, the model recomputes attention Keys and Values for every token at every step — which is expensive. The KV-cache stores these computed K and V matrices so they don't need to be recalculated:

    Without KV-cache: Recompute ALL tokens every step → O(n²) cost
    With KV-cache:    Only compute the NEW token → much faster

    This is a critical optimization that makes real-time LLM responses practical.

    Performance and Scale

    Scale refers to increasing model size (parameters), training data, or compute. Research has shown that scaling up these three dimensions predictably improves performance — this is captured by scaling laws.

    Emergent abilities are capabilities that appear suddenly at certain scales — they're absent in smaller models but appear in larger ones without being explicitly trained. Examples include multi-step reasoning and few-shot learning. These abilities are called "emergent" because they weren't directly optimized for; they arise from scale alone.


    Key Takeaways

  • Tokenisation (BPE) breaks text into sub-word chunks; ~750 words = ~1,000 tokens in English.
  • Attention (softmax(QK^T / √d_k) · V) lets the model relate all tokens to each other; the √d_k scaling prevents gradient issues.
  • Autoregressive generation produces one token at a time, feeding each output back as input, within the limits of the context window.
  • Temperature = 0 is deterministic (always picks the top token); higher temperatures increase randomness. Top-p sampling dynamically selects from the minimum set of tokens covering probability mass ≥ p.
  • KV-cache speeds up inference by reusing previously computed attention keys and values, while emergent abilities are surprising capabilities that appear only at large scale.