Data Engineering

SQL Practice Quiz

SQL Practice Quiz — Study Guide

Advanced SQL: Window Functions, Indexing, CTEs & More

Mastering SQL beyond basic SELECT statements is what separates junior developers from senior engineers. This guide covers the advanced SQL concepts you'll need to write performant, elegant queries — from window functions and recursive CTEs to indexing strategies and common NULL gotchas.

Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing them like GROUP BY does.

SELECT
  employee_id,
  department,
  salary,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
  ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num,
  DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank
FROM employees;

Ranking Functions Compared

Function	Gaps in Sequence?	Unique per Partition?
`ROW_NUMBER()`	No gaps	Always unique
`RANK()`	Has gaps (1,1,3)	No
`DENSE_RANK()`	No gaps (1,1,2)	No

Key insight: DENSE_RANK() assigns unique sequential integers with no gaps — the answer to "which function has no gaps?"

LAG and LEAD

Access previous or next row values without a self-join:

SELECT
  order_date,
  revenue,
  LAG(revenue, 1) OVER (ORDER BY order_date) AS prev_revenue,
  LEAD(revenue, 1) OVER (ORDER BY order_date) AS next_revenue
FROM sales;

Window Frames

Control which rows are included in the calculation:

-- Running total using a frame
SUM(revenue) OVER (
  ORDER BY order_date
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)

Common frame options: ROWS BETWEEN, RANGE BETWEEN, UNBOUNDED PRECEDING, CURRENT ROW, UNBOUNDED FOLLOWING.

Top-N Per Group

Use ROW_NUMBER() in a CTE or subquery to get top N rows per partition:

WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
  FROM employees
)
SELECT * FROM ranked WHERE rn <= 3;

CTEs and Recursive Queries

Common Table Expressions (CTEs)

CTEs make complex queries readable by naming subqueries:

WITH high_earners AS (
  SELECT * FROM employees WHERE salary > 100000
)
SELECT department, COUNT(*) FROM high_earners GROUP BY department;

Recursive CTEs

Use WITH RECURSIVE to traverse hierarchical data (org charts, file trees) where depth is unknown:

WITH RECURSIVE org_chart AS (
  -- Base case
  SELECT id, name, manager_id, 0 AS level
  FROM employees WHERE manager_id IS NULL
  UNION ALL  -- Recursive case
  SELECT e.id, e.name, e.manager_id, oc.level + 1
  FROM employees e
  JOIN org_chart oc ON e.manager_id = oc.id
)
SELECT * FROM org_chart;

NULL Gotchas

NULL represents unknown, not zero or empty. This causes surprising behavior:

-- This returns NO rows even if NULL exists in the subquery!
SELECT * FROM orders WHERE customer_id NOT IN (SELECT id FROM blacklist);
-- If blacklist has ANY NULL, the result is always empty

Why? x NOT IN (1, NULL) evaluates as x <> 1 AND x <> NULL → always UNKNOWN.

Fix: Use NOT EXISTS instead:

SELECT * FROM orders o
WHERE NOT EXISTS (SELECT 1 FROM blacklist b WHERE b.id = o.customer_id);

Other NULL rules:

NULL = NULL → FALSE (use IS NULL)

Aggregate functions like COUNT() ignore NULLs

COALESCE(value, fallback) replaces NULLs

Indexing & Performance

Index Types

Index Type	Best For
B-Tree	Equality, range scans, ORDER BY
Hash	Equality only
GIN/GiST	Full-text, arrays, JSON
BRIN	Very large tables ordered by value (e.g., timestamps)

For a 1-billion row table with range scans on a timestamp, a BRIN index is most efficient — it stores min/max per block range, using minimal space.

Covering Index

A covering index includes all columns needed by a query, avoiding a table lookup:

-- Query: SELECT name, email FROM users WHERE department = 'Engineering'
CREATE INDEX idx_covering ON users (department, name, email);
-- All needed columns are IN the index — no heap access needed

Partial Index

Index only a subset of rows — great for filtering common conditions:

CREATE INDEX idx_active_users ON users (email) WHERE active = true;
-- Much smaller and faster than indexing all users

EXPLAIN and Query Optimization

EXPLAIN shows the query plan without running it. EXPLAIN ANALYZE actually executes the query and shows real timing and row counts.

EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;

Look for:

Seq Scan → no index used (potential problem on large tables)

Index Scan → using an index ✓

Hash Join vs Nested Loop → join strategy chosen by planner

Best Practices

Avoid SELECT * in production

Filter early (push WHERE conditions down)

Use EXISTS instead of COUNT(*) to check existence

Avoid functions on indexed columns in WHERE: WHERE YEAR(created_at) = 2023 kills index usage

Joins Deep Dive

Join Types

Join	Returns
INNER JOIN	Matching rows only
LEFT JOIN	All left rows + matched right
RIGHT JOIN	All right rows + matched left
FULL OUTER JOIN	All rows from both sides
CROSS JOIN	Cartesian product
LATERAL JOIN	Subquery that references outer query columns

LATERAL Joins

LATERAL allows a subquery to reference columns from preceding tables — like a correlated subquery in the FROM clause:

SELECT u.name, recent.order_date
FROM users u,
LATERAL (
  SELECT order_date FROM orders o
  WHERE o.user_id = u.id
  ORDER BY order_date DESC LIMIT 1
) recent;

Aggregation, Deduplication & Pivot

Deduplication

-- Keep one row per duplicate using ROW_NUMBER
WITH deduped AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) AS rn
  FROM users
)
DELETE FROM users WHERE id IN (SELECT id FROM deduped WHERE rn > 1);

PIVOT (Conditional Aggregation)

Most databases use CASE for pivoting:

SELECT
  department,
  SUM(CASE WHEN gender = 'M' THEN 1 ELSE 0 END) AS male_count,
  SUM(CASE WHEN gender = 'F' THEN 1 ELSE 0 END) AS female_count
FROM employees
GROUP BY department;

Date Functions

DATE_TRUNC('month', created_at)   -- PostgreSQL
DATE_FORMAT(created_at, '%Y-%m')  -- MySQL
DATEADD(day, 7, order_date)       -- SQL Server

Key Takeaways

DENSE_RANK() gives sequential ranks with no gaps; ROW_NUMBER() is always unique; RANK() skips numbers after ties.

NOT IN with NULLs always returns empty results — use NOT EXISTS as a safe alternative.

EXPLAIN ANALYZE actually runs the query and returns real execution stats; plain EXPLAIN only estimates.

Recursive CTEs are the go-to tool for hierarchical/tree-structured data of unknown depth.

Covering indexes and partial indexes are powerful optimizations: covering indexes eliminate table lookups; partial indexes reduce index size by targeting specific row subsets.