dbt Fundamentals
Core dbt concepts: models, sources, tests, refs, and the CLI.
dbt Fundamentals — Study Guide
dbt Fundamentals: A Beginner's Guide
Data transformation is a crucial part of any data pipeline. Raw data is rarely in a format ready for analysis. dbt (data build tool) allows you to transform data *in* your data warehouse using SQL and best practices for software development. This means you can leverage the power and scalability of your warehouse for transformations, rather than relying on slower, less manageable ETL processes. This lesson will cover the core concepts you need to understand to get started with dbt and pass a "dbt Fundamentals" quiz.
1. The dbt CLI
The dbt command-line interface (CLI) is your primary way to interact with dbt. Think of it as the control panel for your data transformations. You'll use it to run models, test your code, and manage your dbt project.
pip: pip install dbt-coredbt init creates a new dbt project with a default directory structure.dbt run: This is the workhorse command. It compiles and executes your models in the order defined by their dependencies.
* dbt test: Runs the tests you've defined for your models.
* dbt docs generate: Creates documentation for your dbt project.
* dbt debug: Helps diagnose connection issues with your data warehouse.
* dbt seed: Loads data from CSV files into your data warehouse (useful for small lookup tables).2. Models: The Building Blocks
dbt models are SQL files that define how your data is transformed. They're the core of your dbt project. Each model represents a step in your transformation process.
models directory of your dbt project.-- models/staging/stg_customers.sqlSELECT
customer_id,
first_name,
last_name,
email
FROM
{{ source('raw_data', 'customers') }}
This model selects data from a source table (explained later) and creates a staging table called stg_customers.
3. Jinja: SQL with Superpowers
Jinja is a templating language that dbt uses to add logic and flexibility to your SQL models. It allows you to:
dbt_project.yml file.if statements and loops to dynamically generate SQL.-- models/intermediate/int_customer_orders.sqlSELECT
c.customer_id,
c.first_name,
c.last_name,
o.order_date,
o.total_amount
FROM
{{ ref('stg_customers') }} c
JOIN
{{ ref('stg_orders') }} o
ON
c.customer_id = o.customer_id
WHERE
o.order_date > '{{ var('cutoff_date') }}'
Here, {{ ref('stg_customers') }} and {{ ref('stg_orders') }} are Jinja functions, and {{ var('cutoff_date') }} accesses a variable defined in your dbt_project.yml.
4. refs: Connecting Your Models
The ref() function is *critical* for defining dependencies between your models. It tells dbt the order in which to run your models. Think of it as drawing arrows between boxes in a data flow diagram.
ref('model_name') returns the fully qualified name of the model_name model.{{ ref('stg_customers') }} tells dbt that the int_customer_orders model depends on the stg_customers model. stg_customers *must* be run before int_customer_orders.5. Tests: Ensuring Data Quality
dbt tests are SQL queries that verify the accuracy and consistency of your data. They are essential for maintaining data quality.
unique, not_null, accepted_values.
* Singular Tests: Custom tests you write to validate specific business rules.
tests directory.# tests/assert_unique_customer_id.yml
version: 2models:
- name: stg_customers
columns:
- name: customer_id
tests:
- unique
This test ensures that the customer_id column in the stg_customers model contains only unique values.
6. Sources: Defining Your Raw Data
dbt sources represent your raw data tables in your data warehouse. They tell dbt where your data comes from.
dbt_project.yml file.# dbt_project.yml
sources:
- name: raw_data
database: your_database
schema: your_schema
tables:
- name: customers
- name: ordersThis defines a source named raw_data pointing to the customers and orders tables in your your_database.your_schema.
7. Materializations: How Models are Built
Materializations control how dbt builds your models in the data warehouse. They determine whether a model is created from scratch each time, incrementally updated, or viewed as a virtual table.
| Materialization | Description | When to Use |
|---|---|---|
table | Creates a new table each time the model is run. | Most common; good for full refreshes and when data changes significantly. |
view | Creates a view (virtual table) that queries the underlying data. | Useful for simple transformations or when you want to avoid storing data. |
incremental | Appends new data to an existing table. | For large tables where you only need to process new data. |
ephemeral | Doesn't create a table; the model's SQL is inlined into dependent models. | For temporary transformations or code reuse. |
table.-- models/my_model.sql
{{ config(materialized='view') }}SELECT ...
Key Takeaways
refs are crucial for dependency management: They define the order in which your models are executed.