Databricks Data Engineer Associate

Spark SQL & DataFrames — Quiz 3

Spark SQL & DataFrames — Quiz 3 — Study Guide

Spark SQL & DataFrames — Study Guide (Quiz 3)

Spark SQL and DataFrames are the backbone of modern big data processing. Whether you're analyzing millions of customer records or transforming raw logs into insights, knowing how to query, filter, aggregate, and join data in Spark is an essential skill. This guide covers the key concepts you'll need to ace Quiz 3 — from basic SELECT statements to window functions and JSON parsing.


Querying Basics with Spark SQL

Spark SQL lets you write familiar SQL queries against DataFrames by registering them as temporary views.

# Register a DataFrame as a temporary view
df.createOrReplaceTempView("employees")

Query it like standard SQL

spark.sql("SELECT * FROM employees")

SELECT and Filtering

Use SELECT to choose columns and WHERE to filter rows:

# Select all columns
spark.sql("SELECT * FROM employees")

Select specific columns with a filter

spark.sql("SELECT name, salary FROM employees WHERE department = 'Engineering'")

DataFrame API equivalent

df.select("name", "salary").where(df.department == "Engineering") df.filter(df.salary > 50000)

Analogy: Think of SELECT as choosing which drawers to open, and WHERE as only looking in rooms that match your criteria.

DISTINCT and COUNT

# Get unique departments
spark.sql("SELECT DISTINCT department FROM employees")
df.select("department").distinct()

Count total rows

spark.sql("SELECT COUNT(*) FROM employees") df.count() # returns an integer df.select(count("*")) # returns a DataFrame

Note: df.count() returns the size (number of rows) as a Python integer, while count() inside select() returns a DataFrame.


Sorting and Aggregation

ORDER BY

# Sort ascending (default)
spark.sql("SELECT * FROM employees ORDER BY salary")

Sort descending

spark.sql("SELECT * FROM employees ORDER BY salary DESC")

DataFrame API

df.orderBy("salary") df.orderBy(df.salary.desc())

AVG and Other Aggregates

from pyspark.sql.functions import avg, sum, max, min

spark.sql("SELECT department, AVG(salary) FROM employees GROUP BY department")

DataFrame API

df.groupBy("department").agg(avg("salary"), sum("salary"))

FunctionPurpose
avg()Average of a column
sum()Total sum
count()Number of rows
max() / min()Highest / lowest value
collect_list()Collects values into a list

Set Operations: UNION vs UNION ALL

# UNION removes duplicates (like DISTINCT)
df1.union(df2).distinct()
spark.sql("SELECT * FROM table1 UNION SELECT * FROM table2")

UNION ALL keeps all rows including duplicates

df1.union(df2) spark.sql("SELECT * FROM table1 UNION ALL SELECT * FROM table2")

Key difference: UNION = unique rows only. UNION ALL = everything, faster because no deduplication step.


Column Operations

Renaming Columns

# withColumnRenamed(old_name, new_name)
df.withColumnRenamed("emp_name", "name")

SQL alias

spark.sql("SELECT emp_name AS name FROM employees")

LIKE and Pattern Matching

# Find names starting with 'A'
spark.sql("SELECT * FROM employees WHERE name LIKE 'A%'")
df.filter(df.name.like("A%"))

Contains 'son'

df.filter(df.name.like("%son%"))

PatternMeaning
A%Starts with A
%sonEnds with "son"
%son%Contains "son"
_ohnAny single char + "ohn"

isin

# Filter rows where department is in a list
df.filter(df.department.isin("Engineering", "Marketing"))
spark.sql("SELECT * FROM employees WHERE department IN ('Engineering', 'Marketing')")


Dates, Nulls, and Conditionals

to_date — Date Conversion

from pyspark.sql.functions import to_date

df = df.withColumn("hire_date", to_date("hire_date_str", "yyyy-MM-dd"))

Handling Null Values with coalesce

coalesce returns the first non-null value from a list of columns:

from pyspark.sql.functions import coalesce

df.withColumn("bonus", coalesce(df.bonus, df.default_bonus, lit(0)))

CASE Statements — Conditional Logic

spark.sql("""
  SELECT name,
    CASE WHEN salary > 100000 THEN 'High'
         WHEN salary > 50000  THEN 'Mid'
         ELSE 'Low'
    END AS salary_band
  FROM employees
""")

DataFrame API

from pyspark.sql.functions import when df.withColumn("salary_band", when(df.salary > 100000, "High") .when(df.salary > 50000, "Mid") .otherwise("Low"))


Joins

# Inner join
df1.join(df2, df1.id == df2.emp_id)

Left outer join — keeps all rows from left, nulls for non-matches on right

df1.join(df2, df1.id == df2.emp_id, "left_outer")

spark.sql(""" SELECT e.name, d.dept_name FROM employees e LEFT OUTER JOIN departments d ON e.dept_id = d.id """)


Window Functions: lead, lag, partition by

Window functions compute values across related rows without collapsing them.

from pyspark.sql.window import Window
from pyspark.sql.functions import lead, lag, avg

window = Window.partitionBy("department").orderBy("hire_date")

df.withColumn("next_hire", lead("name", 1).over(window)) df.withColumn("prev_hire", lag("name", 1).over(window)) df.withColumn("dept_avg_salary", avg("salary").over(Window.partitionBy("department")))

  • lead → looks forward to the next row
  • lag → looks backward to the previous row
  • partitionBy → defines the group (like GROUP BY but keeps all rows)

  • JSON Parsing

    from pyspark.sql.functions import get_json_object

    Extract a field from a JSON string column

    df.withColumn("city", get_json_object(df.json_col, "$.address.city"))


    Performance Tips

  • Temporary views (createOrReplaceTempView) are session-scoped and don't persist to disk — great for iterative queries.
  • UNION ALL is faster than UNION because it skips deduplication.
  • partitionBy in window functions limits data shuffled across nodes.
  • Use isin instead of multiple OR conditions for cleaner, optimized filtering.

  • Key Takeaways

  • SELECT * retrieves all columns; use WHERE / .filter() to narrow rows, and DISTINCT to remove duplicates.
  • count() returns the number of rows; avg(), sum(), and collect_list() are common aggregate functions used with groupBy.
  • UNION deduplicates results while UNION ALL keeps all rows — choose based on whether duplicates matter.
  • Window functions (lead, lag, avg over partitionBy) let you compute running or comparative values without collapsing the DataFrame.
  • coalesce handles nulls, to_date converts strings to dates, and get_json_object extracts fields from JSON strings — all critical for real-world data cleaning.