Data Engineering: Implementing Data Contracts
Your analytics pipeline breaks at 2 AM. The upstream team changed a column name. Again. Sound…
Data Engineering: Implementing Data Contracts
Your analytics pipeline breaks at 2 AM. The upstream team changed a column name. Again. Sound familiar?
This is the core problem data contracts solve. As data systems grow, the informal handshake between the team that produces data and the team that consumes it stops working. Data contracts replace that handshake with something explicit, versioned, and enforceable.
Why Data Contracts Matter
Data teams are essentially running distributed systems where the interfaces between components are mostly undocumented. A backend engineer wouldn't ship an API without defining its schema — but data pipelines do it constantly. The result is:
Data contracts fix this by making the agreement between producers and consumers a first-class artifact — something you can validate, version, and enforce in CI/CD.
What a Data Contract Actually Is
A data contract is a formal specification that defines what a dataset looks like and how it should behave. Think of it as an API contract, but for data. It typically includes:
Here's a simple contract defined in YAML — a common format for data contracts:
# orders_contract.yaml
apiVersion: v1
kind: DataContract
metadata:
name: orders
owner: checkout-team
version: 2.1.0
description: "Order events emitted after successful payment"schema:
- name: order_id
type: STRING
nullable: false
description: "Unique identifier for the order (UUID v4)"
- name: customer_id
type: STRING
nullable: false
description: "Reference to the customer who placed the order"
- name: total_amount
type: DECIMAL(10,2)
nullable: false
description: "Total order value in USD"
- name: status
type: STRING
nullable: false
enum: ["pending", "confirmed", "shipped", "cancelled"]
- name: created_at
type: TIMESTAMP
nullable: false
quality:
- rule: "order_id IS UNIQUE"
- rule: "total_amount > 0"
- rule: "created_at >= '2020-01-01'"
sla:
freshness: "< 1 hour"
availability: "99.9%"
This YAML lives in a Git repo. It's versioned, reviewable, and can be referenced by both the producing and consuming teams.
Validating Contracts in Practice
Defining a contract is only half the job. You need to validate data against it — ideally automatically, as part of your pipeline. Here's how you might implement validation in Python using great_expectations or a lightweight custom approach:
import pandas as pd
from dataclasses import dataclass
from typing import List, Optional@dataclass
class FieldContract:
name: str
dtype: str
nullable: bool
allowed_values: Optional[List] = None
def validate_contract(df: pd.DataFrame, contract: List[FieldContract]) -> List[str]:
violations = []
for field in contract:
# Check field exists
if field.name not in df.columns:
violations.append(f"Missing field: {field.name}")
continue
# Check nullability
if not field.nullable and df[field.name].isnull().any():
null_count = df[field.name].isnull().sum()
violations.append(f"Null violation in '{field.name}': {null_count} null values found")
# Check allowed values (enum)
if field.allowed_values:
invalid = df[~df[field.name].isin(field.allowed_values)][field.name].unique()
if len(invalid) > 0:
violations.append(f"Invalid values in '{field.name}': {invalid.tolist()}")
return violations
Define the contract
orders_contract = [
FieldContract("order_id", "str", nullable=False),
FieldContract("customer_id", "str", nullable=False),
FieldContract("total_amount", "float", nullable=False),
FieldContract("status", "str", nullable=False,
allowed_values=["pending", "confirmed", "shipped", "cancelled"]),
FieldContract("created_at", "datetime", nullable=False),
]Run validation
df = pd.read_parquet("orders_latest.parquet")
violations = validate_contract(df, orders_contract)if violations:
for v in violations:
print(f"[CONTRACT VIOLATION] {v}")
raise ValueError("Data contract validation failed. Pipeline halted.")
else:
print("Contract validation passed. Proceeding with pipeline.")
This runs at ingestion time. If the upstream team ships a breaking change, your pipeline fails fast with a clear error message instead of silently corrupting your analytics tables.
Integrating Contracts Into Your Workflow
The real value of data contracts comes from making them part of your development workflow, not a one-off documentation exercise.
Put contracts in Git alongside your pipeline code. When the producing team wants to change the schema, they open a PR. Consumers get visibility, can comment, and breaking changes become explicit conversations rather than surprises.
Run contract validation in CI/CD. Add a validation step to your pipeline that runs on every deployment. Use tools like [Great Expectations](https://greatexpectations.io/), [Soda Core](https://www.soda.io/), or [dbt tests](https://docs.getdbt.com/docs/build/tests) to automate this.
# Example dbt schema test that enforces contract rules
models/schema.yml
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ["pending", "confirmed", "shipped", "cancelled"]
- name: total_amount
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "> 0"
Version your contracts semantically. Use semver. A new optional field is a minor version bump. Removing a field or changing a type is a major version bump — and it requires coordination with consumers before it ships.
Alert on violations, don't just log them. Wire your validation results into your incident management system. A contract violation in production is an incident, not a log message.
Common Pitfalls to Avoid
Don't make contracts too granular too soon. Start with the fields that matter most — the ones consumers actually use. An overly strict contract on every field creates friction without proportional value.
Don't let contracts become stale documentation. If they're not validated automatically, they'll drift from reality. The contract is only useful if it's enforced.
Don't skip the semantics. Schema types tell you the shape of data. Semantics tell you what it *means*. "What does total_amount include — taxes? Discounts?" Document it. Future-you will be grateful.
Involve consumers from the start. Contracts written only by producers often miss what consumers actually need. Make it a collaborative spec.
Actionable Next Steps
Here's how to start without boiling the ocean:
Data contracts won't eliminate all data quality issues, but they shift the culture from "someone changed something and nobody noticed" to "changes are explicit, agreed upon, and validated." That's a big deal when your business decisions depend on the data being right.
Start small, automate early, and make the contract the source of truth — not the wiki page nobody reads.