Databricks Data Engineer Associate

Spark SQL & DataFrames — Quiz 3

Spark SQL & DataFrames — Quiz 3 — Study Guide

Spark SQL & DataFrames — Study Guide (Quiz 3)

Spark SQL and DataFrames are the backbone of modern big data processing. Whether you're analyzing millions of customer records or transforming raw logs into insights, knowing how to query, filter, aggregate, and join data in Spark is an essential skill. This guide covers the key concepts you'll need to ace Quiz 3 — from basic SELECT statements to window functions and JSON parsing.

Querying Basics with Spark SQL

Spark SQL lets you write familiar SQL queries against DataFrames by registering them as temporary views.

# Register a DataFrame as a temporary view
df.createOrReplaceTempView("employees")
Query it like standard SQL
spark.sql("SELECT * FROM employees")

SELECT and Filtering

Use SELECT to choose columns and WHERE to filter rows:

# Select all columns
spark.sql("SELECT * FROM employees")
Select specific columns with a filter
spark.sql("SELECT name, salary FROM employees WHERE department = 'Engineering'")
DataFrame API equivalent
df.select("name", "salary").where(df.department == "Engineering")
df.filter(df.salary > 50000)

Analogy: Think of SELECT as choosing which drawers to open, and WHERE as only looking in rooms that match your criteria.

DISTINCT and COUNT

# Get unique departments
spark.sql("SELECT DISTINCT department FROM employees")
df.select("department").distinct()
Count total rows
spark.sql("SELECT COUNT(*) FROM employees")
df.count()        # returns an integer
df.select(count("*"))  # returns a DataFrame

Note: df.count() returns the size (number of rows) as a Python integer, while count() inside select() returns a DataFrame.

Sorting and Aggregation

ORDER BY

# Sort ascending (default)
spark.sql("SELECT * FROM employees ORDER BY salary")
Sort descending
spark.sql("SELECT * FROM employees ORDER BY salary DESC")
DataFrame API
df.orderBy("salary")
df.orderBy(df.salary.desc())

AVG and Other Aggregates

from pyspark.sql.functions import avg, sum, max, min
spark.sql("SELECT department, AVG(salary) FROM employees GROUP BY department")
DataFrame API
df.groupBy("department").agg(avg("salary"), sum("salary"))

Function	Purpose
`avg()`	Average of a column
`sum()`	Total sum
`count()`	Number of rows
`max()` / `min()`	Highest / lowest value
`collect_list()`	Collects values into a list

Set Operations: UNION vs UNION ALL

# UNION removes duplicates (like DISTINCT)
df1.union(df2).distinct()
spark.sql("SELECT * FROM table1 UNION SELECT * FROM table2")
UNION ALL keeps all rows including duplicates
df1.union(df2)
spark.sql("SELECT * FROM table1 UNION ALL SELECT * FROM table2")

Key difference: UNION = unique rows only. UNION ALL = everything, faster because no deduplication step.

Column Operations

Renaming Columns

# withColumnRenamed(old_name, new_name)
df.withColumnRenamed("emp_name", "name")
SQL alias
spark.sql("SELECT emp_name AS name FROM employees")

LIKE and Pattern Matching

# Find names starting with 'A'
spark.sql("SELECT * FROM employees WHERE name LIKE 'A%'")
df.filter(df.name.like("A%"))
Contains 'son'
df.filter(df.name.like("%son%"))

Pattern	Meaning
`A%`	Starts with A
`%son`	Ends with "son"
`%son%`	Contains "son"
`_ohn`	Any single char + "ohn"

isin

# Filter rows where department is in a list
df.filter(df.department.isin("Engineering", "Marketing"))
spark.sql("SELECT * FROM employees WHERE department IN ('Engineering', 'Marketing')")

Dates, Nulls, and Conditionals

to_date — Date Conversion

from pyspark.sql.functions import to_datedf = df.withColumn("hire_date", to_date("hire_date_str", "yyyy-MM-dd"))

Handling Null Values with coalesce

coalesce returns the first non-null value from a list of columns:

from pyspark.sql.functions import coalescedf.withColumn("bonus", coalesce(df.bonus, df.default_bonus, lit(0)))

CASE Statements — Conditional Logic

spark.sql("""
  SELECT name,
    CASE WHEN salary > 100000 THEN 'High'
         WHEN salary > 50000  THEN 'Mid'
         ELSE 'Low'
    END AS salary_band
  FROM employees
""")
DataFrame API
from pyspark.sql.functions import when
df.withColumn("salary_band",
    when(df.salary > 100000, "High")
    .when(df.salary > 50000, "Mid")
    .otherwise("Low"))

Joins

# Inner join
df1.join(df2, df1.id == df2.emp_id)
Left outer join — keeps all rows from left, nulls for non-matches on right
df1.join(df2, df1.id == df2.emp_id, "left_outer")spark.sql("""
  SELECT e.name, d.dept_name
  FROM employees e
  LEFT OUTER JOIN departments d ON e.dept_id = d.id
""")

Window Functions: lead, lag, partition by

Window functions compute values across related rows without collapsing them.

from pyspark.sql.window import Window
from pyspark.sql.functions import lead, lag, avg
window = Window.partitionBy("department").orderBy("hire_date")df.withColumn("next_hire", lead("name", 1).over(window))
df.withColumn("prev_hire", lag("name", 1).over(window))
df.withColumn("dept_avg_salary", avg("salary").over(Window.partitionBy("department")))

lead → looks forward to the next row

lag → looks backward to the previous row

partitionBy → defines the group (like GROUP BY but keeps all rows)

JSON Parsing

from pyspark.sql.functions import get_json_object
Extract a field from a JSON string column
df.withColumn("city", get_json_object(df.json_col, "$.address.city"))

Performance Tips

Temporary views (createOrReplaceTempView) are session-scoped and don't persist to disk — great for iterative queries.

UNION ALL is faster than UNION because it skips deduplication.

partitionBy in window functions limits data shuffled across nodes.

Use isin instead of multiple OR conditions for cleaner, optimized filtering.

Key Takeaways

SELECT * retrieves all columns; use WHERE / .filter() to narrow rows, and DISTINCT to remove duplicates.

count() returns the number of rows; avg(), sum(), and collect_list() are common aggregate functions used with groupBy.

UNION deduplicates results while UNION ALL keeps all rows — choose based on whether duplicates matter.

Window functions (lead, lag, avg over partitionBy) let you compute running or comparative values without collapsing the DataFrame.

coalesce handles nulls, to_date converts strings to dates, and get_json_object extracts fields from JSON strings — all critical for real-world data cleaning.