dbt Analytics Engineer

dbt Fundamentals

Core dbt concepts: models, sources, tests, refs, and the CLI.

dbt Fundamentals — Study Guide

dbt Fundamentals: A Beginner's Guide

Data transformation is a crucial part of any data pipeline. Raw data is rarely in a format ready for analysis. dbt (data build tool) allows you to transform data *in* your data warehouse using SQL and best practices for software development. This means you can leverage the power and scalability of your warehouse for transformations, rather than relying on slower, less manageable ETL processes. This lesson will cover the core concepts you need to understand to get started with dbt and pass a "dbt Fundamentals" quiz.

1. The dbt CLI

The dbt command-line interface (CLI) is your primary way to interact with dbt. Think of it as the control panel for your data transformations. You'll use it to run models, test your code, and manage your dbt project.

  • Installation: You'll need to install dbt using pip: pip install dbt-core
  • Initialization: dbt init creates a new dbt project with a default directory structure.
  • Key Commands:
  • * dbt run: This is the workhorse command. It compiles and executes your models in the order defined by their dependencies. * dbt test: Runs the tests you've defined for your models. * dbt docs generate: Creates documentation for your dbt project. * dbt debug: Helps diagnose connection issues with your data warehouse. * dbt seed: Loads data from CSV files into your data warehouse (useful for small lookup tables).

    2. Models: The Building Blocks

    dbt models are SQL files that define how your data is transformed. They're the core of your dbt project. Each model represents a step in your transformation process.

  • Location: Models are typically stored in the models directory of your dbt project.
  • File Format: Models are written in SQL, but dbt adds powerful features on top of standard SQL (more on that later with Jinja).
  • Example:
  • -- models/staging/stg_customers.sql

    SELECT customer_id, first_name, last_name, email FROM {{ source('raw_data', 'customers') }}

    This model selects data from a source table (explained later) and creates a staging table called stg_customers.

    3. Jinja: SQL with Superpowers

    Jinja is a templating language that dbt uses to add logic and flexibility to your SQL models. It allows you to:

  • Use Variables: Access configuration values defined in your dbt_project.yml file.
  • Control Flow: Use if statements and loops to dynamically generate SQL.
  • Macros: Create reusable code snippets.
  • Referencing other models: (covered in the next section)
  • Example:
  • -- models/intermediate/int_customer_orders.sql

    SELECT c.customer_id, c.first_name, c.last_name, o.order_date, o.total_amount FROM {{ ref('stg_customers') }} c JOIN {{ ref('stg_orders') }} o ON c.customer_id = o.customer_id WHERE o.order_date > '{{ var('cutoff_date') }}'

    Here, {{ ref('stg_customers') }} and {{ ref('stg_orders') }} are Jinja functions, and {{ var('cutoff_date') }} accesses a variable defined in your dbt_project.yml.

    4. refs: Connecting Your Models

    The ref() function is *critical* for defining dependencies between your models. It tells dbt the order in which to run your models. Think of it as drawing arrows between boxes in a data flow diagram.

  • Purpose: ref('model_name') returns the fully qualified name of the model_name model.
  • Dependency Management: dbt uses these references to build a Directed Acyclic Graph (DAG) of your models, ensuring they are executed in the correct order.
  • Example (from above): {{ ref('stg_customers') }} tells dbt that the int_customer_orders model depends on the stg_customers model. stg_customers *must* be run before int_customer_orders.
  • 5. Tests: Ensuring Data Quality

    dbt tests are SQL queries that verify the accuracy and consistency of your data. They are essential for maintaining data quality.

  • Types of Tests:
  • * Generic Tests: Built-in tests like unique, not_null, accepted_values. * Singular Tests: Custom tests you write to validate specific business rules.
  • Location: Tests are typically stored in the tests directory.
  • Example (generic test):
  • # tests/assert_unique_customer_id.yml
    version: 2

    models: - name: stg_customers columns: - name: customer_id tests: - unique

    This test ensures that the customer_id column in the stg_customers model contains only unique values.

    6. Sources: Defining Your Raw Data

    dbt sources represent your raw data tables in your data warehouse. They tell dbt where your data comes from.

  • Purpose: Sources provide a central place to define your raw data and track its lineage.
  • Configuration: Sources are defined in your dbt_project.yml file.
  • Example:
  • # dbt_project.yml
    sources:
      - name: raw_data
        database: your_database
        schema: your_schema
        tables:
          - name: customers
          - name: orders

    This defines a source named raw_data pointing to the customers and orders tables in your your_database.your_schema.

    7. Materializations: How Models are Built

    Materializations control how dbt builds your models in the data warehouse. They determine whether a model is created from scratch each time, incrementally updated, or viewed as a virtual table.

    MaterializationDescriptionWhen to Use
    tableCreates a new table each time the model is run.Most common; good for full refreshes and when data changes significantly.
    viewCreates a view (virtual table) that queries the underlying data.Useful for simple transformations or when you want to avoid storing data.
    incrementalAppends new data to an existing table.For large tables where you only need to process new data.
    ephemeralDoesn't create a table; the model's SQL is inlined into dependent models.For temporary transformations or code reuse.
  • Default: The default materialization is table.
  • Specifying Materialization: You can specify the materialization in your model file:
  • -- models/my_model.sql
    {{ config(materialized='view') }}

    SELECT ...


    Key Takeaways

  • dbt is a transformation tool: It helps you transform data *within* your data warehouse using SQL and software engineering best practices.
  • refs are crucial for dependency management: They define the order in which your models are executed.
  • Tests ensure data quality: Use generic and singular tests to validate your data transformations.
  • Materializations control how models are built: Choose the right materialization based on your data volume and update frequency.
  • Jinja adds power to SQL: Use Jinja to create dynamic and reusable SQL code.