dbt Analytics Engineer

dbt Fundamentals

Core dbt concepts: models, sources, tests, refs, and the CLI.

dbt Fundamentals — Study Guide

dbt Fundamentals: A Beginner's Guide

Data transformation is a crucial part of any data pipeline. Raw data is rarely in a format ready for analysis. dbt (data build tool) allows you to transform data *in* your data warehouse using SQL and best practices for software development. This means you can leverage the power and scalability of your warehouse for transformations, rather than relying on slower, less manageable ETL processes. This lesson will cover the core concepts you need to understand to get started with dbt and pass a "dbt Fundamentals" quiz.

1. The dbt CLI

The dbt command-line interface (CLI) is your primary way to interact with dbt. Think of it as the control panel for your data transformations. You'll use it to run models, test your code, and manage your dbt project.

Installation: You'll need to install dbt using pip: pip install dbt-core

Initialization: dbt init creates a new dbt project with a default directory structure.

Key Commands:

* dbt run: This is the workhorse command. It compiles and executes your models in the order defined by their dependencies. * dbt test: Runs the tests you've defined for your models. * dbt docs generate: Creates documentation for your dbt project. * dbt debug: Helps diagnose connection issues with your data warehouse. * dbt seed: Loads data from CSV files into your data warehouse (useful for small lookup tables).

2. Models: The Building Blocks

dbt models are SQL files that define how your data is transformed. They're the core of your dbt project. Each model represents a step in your transformation process.

Location: Models are typically stored in the models directory of your dbt project.

File Format: Models are written in SQL, but dbt adds powerful features on top of standard SQL (more on that later with Jinja).

Example:

-- models/staging/stg_customers.sqlSELECT
    customer_id,
    first_name,
    last_name,
    email
FROM
    {{ source('raw_data', 'customers') }}

This model selects data from a source table (explained later) and creates a staging table called stg_customers.

3. Jinja: SQL with Superpowers

Jinja is a templating language that dbt uses to add logic and flexibility to your SQL models. It allows you to:

Use Variables: Access configuration values defined in your dbt_project.yml file.

Control Flow: Use if statements and loops to dynamically generate SQL.

Macros: Create reusable code snippets.

Referencing other models: (covered in the next section)

Example:

-- models/intermediate/int_customer_orders.sqlSELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_date,
    o.total_amount
FROM
    {{ ref('stg_customers') }} c
JOIN
    {{ ref('stg_orders') }} o
ON
    c.customer_id = o.customer_id
WHERE
    o.order_date > '{{ var('cutoff_date') }}'

Here, {{ ref('stg_customers') }} and {{ ref('stg_orders') }} are Jinja functions, and {{ var('cutoff_date') }} accesses a variable defined in your dbt_project.yml.

4. `refs`: Connecting Your Models

The ref() function is *critical* for defining dependencies between your models. It tells dbt the order in which to run your models. Think of it as drawing arrows between boxes in a data flow diagram.

Purpose: ref('model_name') returns the fully qualified name of the model_name model.

Dependency Management: dbt uses these references to build a Directed Acyclic Graph (DAG) of your models, ensuring they are executed in the correct order.

Example (from above): {{ ref('stg_customers') }} tells dbt that the int_customer_orders model depends on the stg_customers model. stg_customers *must* be run before int_customer_orders.

5. Tests: Ensuring Data Quality

dbt tests are SQL queries that verify the accuracy and consistency of your data. They are essential for maintaining data quality.

Types of Tests:

* Generic Tests: Built-in tests like unique, not_null, accepted_values. * Singular Tests: Custom tests you write to validate specific business rules.

Location: Tests are typically stored in the tests directory.

Example (generic test):

# tests/assert_unique_customer_id.yml version: 2

models: - name: stg_customers columns: - name: customer_id tests: - unique

This test ensures that the customer_id column in the stg_customers model contains only unique values.

6. Sources: Defining Your Raw Data

dbt sources represent your raw data tables in your data warehouse. They tell dbt where your data comes from.

Purpose: Sources provide a central place to define your raw data and track its lineage.

Configuration: Sources are defined in your dbt_project.yml file.

Example:

# dbt_project.yml
sources:
  - name: raw_data
    database: your_database
    schema: your_schema
    tables:
      - name: customers
      - name: orders

This defines a source named raw_data pointing to the customers and orders tables in your your_database.your_schema.

7. Materializations: How Models are Built

Materializations control how dbt builds your models in the data warehouse. They determine whether a model is created from scratch each time, incrementally updated, or viewed as a virtual table.

Materialization	Description	When to Use
`table`	Creates a new table each time the model is run.	Most common; good for full refreshes and when data changes significantly.
`view`	Creates a view (virtual table) that queries the underlying data.	Useful for simple transformations or when you want to avoid storing data.
`incremental`	Appends new data to an existing table.	For large tables where you only need to process new data.
`ephemeral`	Doesn't create a table; the model's SQL is inlined into dependent models.	For temporary transformations or code reuse.

Default: The default materialization is table.

Specifying Materialization: You can specify the materialization in your model file:

-- models/my_model.sql
{{ config(materialized='view') }}SELECT ...

Key Takeaways

dbt is a transformation tool: It helps you transform data *within* your data warehouse using SQL and software engineering best practices.

refs are crucial for dependency management: They define the order in which your models are executed.

Tests ensure data quality: Use generic and singular tests to validate your data transformations.

Materializations control how models are built: Choose the right materialization based on your data volume and update frequency.

Jinja adds power to SQL: Use Jinja to create dynamic and reusable SQL code.