Back to blog
data-engineeringdata-contractsdata-qualitydata-governance

Data Engineering: Implementing Data Contracts

Your analytics pipeline breaks at 2 AM. The upstream team changed a column name. Again. Sound…

Data Engineering: Implementing Data Contracts

Your analytics pipeline breaks at 2 AM. The upstream team changed a column name. Again. Sound familiar?

This is the core problem data contracts solve. As data systems grow, the informal handshake between the team that produces data and the team that consumes it stops working. Data contracts replace that handshake with something explicit, versioned, and enforceable.

Why Data Contracts Matter

Data teams are essentially running distributed systems where the interfaces between components are mostly undocumented. A backend engineer wouldn't ship an API without defining its schema — but data pipelines do it constantly. The result is:

  • Silent schema changes that break downstream dashboards
  • Null values appearing in fields that were "always populated"
  • Consumers building on data they don't fully understand
  • Hours of debugging to figure out *who* changed *what* and *when*
  • Data contracts fix this by making the agreement between producers and consumers a first-class artifact — something you can validate, version, and enforce in CI/CD.

    What a Data Contract Actually Is

    A data contract is a formal specification that defines what a dataset looks like and how it should behave. Think of it as an API contract, but for data. It typically includes:

  • Schema: field names, types, nullability
  • Semantics: what each field actually means
  • SLAs: freshness expectations, update frequency
  • Quality rules: valid ranges, uniqueness constraints, referential integrity
  • Ownership: who produces it, who consumes it, and who to contact when things break
  • Here's a simple contract defined in YAML — a common format for data contracts:

    # orders_contract.yaml
    apiVersion: v1
    kind: DataContract
    metadata:
      name: orders
      owner: checkout-team
      version: 2.1.0
      description: "Order events emitted after successful payment"

    schema: - name: order_id type: STRING nullable: false description: "Unique identifier for the order (UUID v4)" - name: customer_id type: STRING nullable: false description: "Reference to the customer who placed the order" - name: total_amount type: DECIMAL(10,2) nullable: false description: "Total order value in USD" - name: status type: STRING nullable: false enum: ["pending", "confirmed", "shipped", "cancelled"] - name: created_at type: TIMESTAMP nullable: false

    quality: - rule: "order_id IS UNIQUE" - rule: "total_amount > 0" - rule: "created_at >= '2020-01-01'"

    sla: freshness: "< 1 hour" availability: "99.9%"

    This YAML lives in a Git repo. It's versioned, reviewable, and can be referenced by both the producing and consuming teams.

    Validating Contracts in Practice

    Defining a contract is only half the job. You need to validate data against it — ideally automatically, as part of your pipeline. Here's how you might implement validation in Python using great_expectations or a lightweight custom approach:

    import pandas as pd
    from dataclasses import dataclass
    from typing import List, Optional

    @dataclass class FieldContract: name: str dtype: str nullable: bool allowed_values: Optional[List] = None

    def validate_contract(df: pd.DataFrame, contract: List[FieldContract]) -> List[str]: violations = []

    for field in contract: # Check field exists if field.name not in df.columns: violations.append(f"Missing field: {field.name}") continue

    # Check nullability if not field.nullable and df[field.name].isnull().any(): null_count = df[field.name].isnull().sum() violations.append(f"Null violation in '{field.name}': {null_count} null values found")

    # Check allowed values (enum) if field.allowed_values: invalid = df[~df[field.name].isin(field.allowed_values)][field.name].unique() if len(invalid) > 0: violations.append(f"Invalid values in '{field.name}': {invalid.tolist()}")

    return violations

    Define the contract

    orders_contract = [ FieldContract("order_id", "str", nullable=False), FieldContract("customer_id", "str", nullable=False), FieldContract("total_amount", "float", nullable=False), FieldContract("status", "str", nullable=False, allowed_values=["pending", "confirmed", "shipped", "cancelled"]), FieldContract("created_at", "datetime", nullable=False), ]

    Run validation

    df = pd.read_parquet("orders_latest.parquet") violations = validate_contract(df, orders_contract)

    if violations: for v in violations: print(f"[CONTRACT VIOLATION] {v}") raise ValueError("Data contract validation failed. Pipeline halted.") else: print("Contract validation passed. Proceeding with pipeline.")

    This runs at ingestion time. If the upstream team ships a breaking change, your pipeline fails fast with a clear error message instead of silently corrupting your analytics tables.

    Integrating Contracts Into Your Workflow

    The real value of data contracts comes from making them part of your development workflow, not a one-off documentation exercise.

    Put contracts in Git alongside your pipeline code. When the producing team wants to change the schema, they open a PR. Consumers get visibility, can comment, and breaking changes become explicit conversations rather than surprises.

    Run contract validation in CI/CD. Add a validation step to your pipeline that runs on every deployment. Use tools like [Great Expectations](https://greatexpectations.io/), [Soda Core](https://www.soda.io/), or [dbt tests](https://docs.getdbt.com/docs/build/tests) to automate this.

    # Example dbt schema test that enforces contract rules
    

    models/schema.yml

    models: - name: orders columns: - name: order_id tests: - unique - not_null - name: status tests: - accepted_values: values: ["pending", "confirmed", "shipped", "cancelled"] - name: total_amount tests: - not_null - dbt_utils.expression_is_true: expression: "> 0"

    Version your contracts semantically. Use semver. A new optional field is a minor version bump. Removing a field or changing a type is a major version bump — and it requires coordination with consumers before it ships.

    Alert on violations, don't just log them. Wire your validation results into your incident management system. A contract violation in production is an incident, not a log message.

    Common Pitfalls to Avoid

    Don't make contracts too granular too soon. Start with the fields that matter most — the ones consumers actually use. An overly strict contract on every field creates friction without proportional value.

    Don't let contracts become stale documentation. If they're not validated automatically, they'll drift from reality. The contract is only useful if it's enforced.

    Don't skip the semantics. Schema types tell you the shape of data. Semantics tell you what it *means*. "What does total_amount include — taxes? Discounts?" Document it. Future-you will be grateful.

    Involve consumers from the start. Contracts written only by producers often miss what consumers actually need. Make it a collaborative spec.

    Actionable Next Steps

    Here's how to start without boiling the ocean:

  • Pick one critical dataset — the one that breaks most often or has the most downstream consumers. Write a contract for it today.
  • Add basic validation to your pipeline — even a simple null check and schema check is better than nothing.
  • Store the contract in Git — make it reviewable and versionable from day one.
  • Set up an alert — connect validation failures to Slack or PagerDuty so violations surface immediately.
  • Expand incrementally — once the pattern works for one dataset, roll it out to others.
  • Data contracts won't eliminate all data quality issues, but they shift the culture from "someone changed something and nobody noticed" to "changes are explicit, agreed upon, and validated." That's a big deal when your business decisions depend on the data being right.

    Start small, automate early, and make the contract the source of truth — not the wiki page nobody reads.