Spark SQL & DataFrames — Quiz 3
Spark SQL & DataFrames — Quiz 3 — Study Guide
Spark SQL & DataFrames — Study Guide (Quiz 3)
Spark SQL and DataFrames are the backbone of modern big data processing. Whether you're analyzing millions of customer records or transforming raw logs into insights, knowing how to query, filter, aggregate, and join data in Spark is an essential skill. This guide covers the key concepts you'll need to ace Quiz 3 — from basic SELECT statements to window functions and JSON parsing.
Querying Basics with Spark SQL
Spark SQL lets you write familiar SQL queries against DataFrames by registering them as temporary views.
# Register a DataFrame as a temporary view
df.createOrReplaceTempView("employees")Query it like standard SQL
spark.sql("SELECT * FROM employees")SELECT and Filtering
Use SELECT to choose columns and WHERE to filter rows:
# Select all columns
spark.sql("SELECT * FROM employees")Select specific columns with a filter
spark.sql("SELECT name, salary FROM employees WHERE department = 'Engineering'")DataFrame API equivalent
df.select("name", "salary").where(df.department == "Engineering")
df.filter(df.salary > 50000)Analogy: Think ofSELECTas choosing which drawers to open, andWHEREas only looking in rooms that match your criteria.
DISTINCT and COUNT
# Get unique departments
spark.sql("SELECT DISTINCT department FROM employees")
df.select("department").distinct()Count total rows
spark.sql("SELECT COUNT(*) FROM employees")
df.count() # returns an integer
df.select(count("*")) # returns a DataFrameNote: df.count() returns the size (number of rows) as a Python integer, while count() inside select() returns a DataFrame.
Sorting and Aggregation
ORDER BY
# Sort ascending (default)
spark.sql("SELECT * FROM employees ORDER BY salary")Sort descending
spark.sql("SELECT * FROM employees ORDER BY salary DESC")DataFrame API
df.orderBy("salary")
df.orderBy(df.salary.desc())AVG and Other Aggregates
from pyspark.sql.functions import avg, sum, max, minspark.sql("SELECT department, AVG(salary) FROM employees GROUP BY department")
DataFrame API
df.groupBy("department").agg(avg("salary"), sum("salary"))| Function | Purpose |
|---|---|
avg() | Average of a column |
sum() | Total sum |
count() | Number of rows |
max() / min() | Highest / lowest value |
collect_list() | Collects values into a list |
Set Operations: UNION vs UNION ALL
# UNION removes duplicates (like DISTINCT)
df1.union(df2).distinct()
spark.sql("SELECT * FROM table1 UNION SELECT * FROM table2")UNION ALL keeps all rows including duplicates
df1.union(df2)
spark.sql("SELECT * FROM table1 UNION ALL SELECT * FROM table2")Key difference:UNION= unique rows only.UNION ALL= everything, faster because no deduplication step.
Column Operations
Renaming Columns
# withColumnRenamed(old_name, new_name)
df.withColumnRenamed("emp_name", "name")SQL alias
spark.sql("SELECT emp_name AS name FROM employees")LIKE and Pattern Matching
# Find names starting with 'A'
spark.sql("SELECT * FROM employees WHERE name LIKE 'A%'")
df.filter(df.name.like("A%"))Contains 'son'
df.filter(df.name.like("%son%"))| Pattern | Meaning |
|---|---|
A% | Starts with A |
%son | Ends with "son" |
%son% | Contains "son" |
_ohn | Any single char + "ohn" |
isin
# Filter rows where department is in a list
df.filter(df.department.isin("Engineering", "Marketing"))
spark.sql("SELECT * FROM employees WHERE department IN ('Engineering', 'Marketing')")Dates, Nulls, and Conditionals
to_date — Date Conversion
from pyspark.sql.functions import to_datedf = df.withColumn("hire_date", to_date("hire_date_str", "yyyy-MM-dd"))
Handling Null Values with coalesce
coalesce returns the first non-null value from a list of columns:
from pyspark.sql.functions import coalescedf.withColumn("bonus", coalesce(df.bonus, df.default_bonus, lit(0)))
CASE Statements — Conditional Logic
spark.sql("""
SELECT name,
CASE WHEN salary > 100000 THEN 'High'
WHEN salary > 50000 THEN 'Mid'
ELSE 'Low'
END AS salary_band
FROM employees
""")DataFrame API
from pyspark.sql.functions import when
df.withColumn("salary_band",
when(df.salary > 100000, "High")
.when(df.salary > 50000, "Mid")
.otherwise("Low"))Joins
# Inner join
df1.join(df2, df1.id == df2.emp_id)Left outer join — keeps all rows from left, nulls for non-matches on right
df1.join(df2, df1.id == df2.emp_id, "left_outer")spark.sql("""
SELECT e.name, d.dept_name
FROM employees e
LEFT OUTER JOIN departments d ON e.dept_id = d.id
""")
Window Functions: lead, lag, partition by
Window functions compute values across related rows without collapsing them.
from pyspark.sql.window import Window
from pyspark.sql.functions import lead, lag, avgwindow = Window.partitionBy("department").orderBy("hire_date")
df.withColumn("next_hire", lead("name", 1).over(window))
df.withColumn("prev_hire", lag("name", 1).over(window))
df.withColumn("dept_avg_salary", avg("salary").over(Window.partitionBy("department")))
lead → looks forward to the next rowlag → looks backward to the previous rowpartitionBy → defines the group (like GROUP BY but keeps all rows)JSON Parsing
from pyspark.sql.functions import get_json_objectExtract a field from a JSON string column
df.withColumn("city", get_json_object(df.json_col, "$.address.city"))Performance Tips
createOrReplaceTempView) are session-scoped and don't persist to disk — great for iterative queries.partitionBy in window functions limits data shuffled across nodes.isin instead of multiple OR conditions for cleaner, optimized filtering.Key Takeaways
SELECT * retrieves all columns; use WHERE / .filter() to narrow rows, and DISTINCT to remove duplicates.count() returns the number of rows; avg(), sum(), and collect_list() are common aggregate functions used with groupBy.UNION deduplicates results while UNION ALL keeps all rows — choose based on whether duplicates matter.lead, lag, avg over partitionBy) let you compute running or comparative values without collapsing the DataFrame.coalesce handles nulls, to_date converts strings to dates, and get_json_object extracts fields from JSON strings — all critical for real-world data cleaning.