Data Engineering

SQL Practice Quiz

SQL Practice Quiz — Study Guide

Advanced SQL: Window Functions, Indexing, CTEs & More

Mastering SQL beyond basic SELECT statements is what separates junior developers from senior engineers. This guide covers the advanced SQL concepts you'll need to write performant, elegant queries — from window functions and recursive CTEs to indexing strategies and common NULL gotchas.


Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing them like GROUP BY does.

SELECT
  employee_id,
  department,
  salary,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
  ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num,
  DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank
FROM employees;

Ranking Functions Compared

FunctionGaps in Sequence?Unique per Partition?
ROW_NUMBER()No gapsAlways unique
RANK()Has gaps (1,1,3)No
DENSE_RANK()No gaps (1,1,2)No
Key insight: DENSE_RANK() assigns unique sequential integers with no gaps — the answer to "which function has no gaps?"

LAG and LEAD

Access previous or next row values without a self-join:

SELECT
  order_date,
  revenue,
  LAG(revenue, 1) OVER (ORDER BY order_date) AS prev_revenue,
  LEAD(revenue, 1) OVER (ORDER BY order_date) AS next_revenue
FROM sales;

Window Frames

Control which rows are included in the calculation:

-- Running total using a frame
SUM(revenue) OVER (
  ORDER BY order_date
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)

Common frame options: ROWS BETWEEN, RANGE BETWEEN, UNBOUNDED PRECEDING, CURRENT ROW, UNBOUNDED FOLLOWING.

Top-N Per Group

Use ROW_NUMBER() in a CTE or subquery to get top N rows per partition:

WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
  FROM employees
)
SELECT * FROM ranked WHERE rn <= 3;


CTEs and Recursive Queries

Common Table Expressions (CTEs)

CTEs make complex queries readable by naming subqueries:

WITH high_earners AS (
  SELECT * FROM employees WHERE salary > 100000
)
SELECT department, COUNT(*) FROM high_earners GROUP BY department;

Recursive CTEs

Use WITH RECURSIVE to traverse hierarchical data (org charts, file trees) where depth is unknown:

WITH RECURSIVE org_chart AS (
  -- Base case
  SELECT id, name, manager_id, 0 AS level
  FROM employees WHERE manager_id IS NULL

UNION ALL

-- Recursive case SELECT e.id, e.name, e.manager_id, oc.level + 1 FROM employees e JOIN org_chart oc ON e.manager_id = oc.id ) SELECT * FROM org_chart;


NULL Gotchas

NULL represents unknown, not zero or empty. This causes surprising behavior:

-- This returns NO rows even if NULL exists in the subquery!
SELECT * FROM orders WHERE customer_id NOT IN (SELECT id FROM blacklist);
-- If blacklist has ANY NULL, the result is always empty

Why? x NOT IN (1, NULL) evaluates as x <> 1 AND x <> NULL → always UNKNOWN.

Fix: Use NOT EXISTS instead:

SELECT * FROM orders o
WHERE NOT EXISTS (SELECT 1 FROM blacklist b WHERE b.id = o.customer_id);

Other NULL rules:

  • NULL = NULL → FALSE (use IS NULL)
  • Aggregate functions like COUNT() ignore NULLs
  • COALESCE(value, fallback) replaces NULLs

  • Indexing & Performance

    Index Types

    Index TypeBest For
    B-TreeEquality, range scans, ORDER BY
    HashEquality only
    GIN/GiSTFull-text, arrays, JSON
    BRINVery large tables ordered by value (e.g., timestamps)
    For a 1-billion row table with range scans on a timestamp, a BRIN index is most efficient — it stores min/max per block range, using minimal space.

    Covering Index

    A covering index includes all columns needed by a query, avoiding a table lookup:

    -- Query: SELECT name, email FROM users WHERE department = 'Engineering'
    CREATE INDEX idx_covering ON users (department, name, email);
    -- All needed columns are IN the index — no heap access needed

    Partial Index

    Index only a subset of rows — great for filtering common conditions:

    CREATE INDEX idx_active_users ON users (email) WHERE active = true;
    -- Much smaller and faster than indexing all users


    EXPLAIN and Query Optimization

    EXPLAIN shows the query plan without running it. EXPLAIN ANALYZE actually executes the query and shows real timing and row counts.

    EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;

    Look for:

  • Seq Scan → no index used (potential problem on large tables)
  • Index Scan → using an index ✓
  • Hash Join vs Nested Loop → join strategy chosen by planner
  • Best Practices

  • Avoid SELECT * in production
  • Filter early (push WHERE conditions down)
  • Use EXISTS instead of COUNT(*) to check existence
  • Avoid functions on indexed columns in WHERE: WHERE YEAR(created_at) = 2023 kills index usage

  • Joins Deep Dive

    Join Types

    JoinReturns
    INNER JOINMatching rows only
    LEFT JOINAll left rows + matched right
    RIGHT JOINAll right rows + matched left
    FULL OUTER JOINAll rows from both sides
    CROSS JOINCartesian product
    LATERAL JOINSubquery that references outer query columns

    LATERAL Joins

    LATERAL allows a subquery to reference columns from preceding tables — like a correlated subquery in the FROM clause:

    SELECT u.name, recent.order_date
    FROM users u,
    LATERAL (
      SELECT order_date FROM orders o
      WHERE o.user_id = u.id
      ORDER BY order_date DESC LIMIT 1
    ) recent;


    Aggregation, Deduplication & Pivot

    Deduplication

    -- Keep one row per duplicate using ROW_NUMBER
    WITH deduped AS (
      SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) AS rn
      FROM users
    )
    DELETE FROM users WHERE id IN (SELECT id FROM deduped WHERE rn > 1);

    PIVOT (Conditional Aggregation)

    Most databases use CASE for pivoting:

    SELECT
      department,
      SUM(CASE WHEN gender = 'M' THEN 1 ELSE 0 END) AS male_count,
      SUM(CASE WHEN gender = 'F' THEN 1 ELSE 0 END) AS female_count
    FROM employees
    GROUP BY department;

    Date Functions

    DATE_TRUNC('month', created_at)   -- PostgreSQL
    DATE_FORMAT(created_at, '%Y-%m')  -- MySQL
    DATEADD(day, 7, order_date)       -- SQL Server


    Key Takeaways

  • DENSE_RANK() gives sequential ranks with no gaps; ROW_NUMBER() is always unique; RANK() skips numbers after ties.
  • NOT IN with NULLs always returns empty results — use NOT EXISTS as a safe alternative.
  • EXPLAIN ANALYZE actually runs the query and returns real execution stats; plain EXPLAIN only estimates.
  • Recursive CTEs are the go-to tool for hierarchical/tree-structured data of unknown depth.
  • Covering indexes and partial indexes are powerful optimizations: covering indexes eliminate table lookups; partial indexes reduce index size by targeting specific row subsets.