audit_helper и codegen: миграции и автогенерация
Два уникальных package в dbt экосистеме: dbt_audit_helper (для миграций моделей) и dbt_codegen (для auto-generation YAML и SQL). Они не нужны постоянно — но в нужный момент economically спасают часы работы.
Этот урок — когда какой package нужен, и production workflow patterns.
SCD Type 2: когда миграция модели меняет историческую логикуdbt_audit_helper: сравнение old vs new
Use case: рефакторите модель fct_orders. Новая реализация должна давать те же результаты, что старая (no business logic changes). Как это проверить?
Без audit_helper — пишите вручную:
-- diff query вручную
WITH old AS (
SELECT * FROM analytics.fct_orders_old
),
new AS (
SELECT * FROM analytics.fct_orders_new
)
SELECT 'in_old_not_new' AS where, * FROM old EXCEPT SELECT * FROM new
UNION ALL
SELECT 'in_new_not_old' AS where, * FROM new EXCEPT SELECT * FROM old
audit_helper делает это в одну строку.
Setup
# packages.yml
packages:
- package: dbt-labs/audit_helper
version: [">=0.10.0", "<1.0.0"]
dbt deps
Базовое сравнение: compare_relations
-- analyses/audit_fct_orders_migration.sql
{{ audit_helper.compare_relations(
a_relation=ref('fct_orders_old'),
b_relation=ref('fct_orders_new'),
primary_key='order_id',
exclude_columns=['_loaded_at']
) }}
Compile output:
-- Compare summary
WITH a AS (SELECT * FROM analytics.fct_orders_old),
b AS (SELECT * FROM analytics.fct_orders_new),
diff AS (
SELECT 'in_a_not_b' AS source, * FROM a EXCEPT SELECT 'in_a_not_b', * FROM b
UNION ALL
SELECT 'in_b_not_a' AS source, * FROM b EXCEPT SELECT 'in_b_not_a', * FROM a
)
SELECT source, COUNT(*) AS n_records FROM diff GROUP BY source
Run:
dbt compile --select audit_fct_orders_migration
dbt show --select audit_fct_orders_migration
Result:
source | n_records
in_a_not_b | 0
in_b_not_a | 0
Если обе строки — 0, миграция bit-exact. Если non-zero — есть differences, investigate.
Per-row comparison: compare_queries
Более детальный compare — row-by-row:
-- analyses/audit_fct_orders_per_row.sql
{{ audit_helper.compare_queries(
a_query='SELECT * FROM ' ~ ref('fct_orders_old'),
b_query='SELECT * FROM ' ~ ref('fct_orders_new'),
primary_key='order_id',
exclude_columns=['_loaded_at']
) }}
Возвращает rows со differences и которые колонки отличаются.
Column-level diff: compare_column_values
-- analyses/audit_amount_only.sql
{{ audit_helper.compare_column_values(
a_query='SELECT order_id, amount FROM ' ~ ref('fct_orders_old'),
b_query='SELECT order_id, amount FROM ' ~ ref('fct_orders_new'),
primary_key='order_id',
column_to_compare='amount'
) }}
Возвращает summary:
- Match (одинаковое значение).
- a_only / b_only (есть только в одной).
- Different (в обеих, но разные значения).
Полезно для targeted comparison: ‘мы изменили calculation amount, есть ли differences?’.
Production migration workflow
Реальная миграция модели через audit_helper:
Step 1: build both versions
-- models/marts/fct_orders.sql — OLD (current production)
SELECT ...
-- models/marts/fct_orders_v2.sql — NEW (refactored)
SELECT ... -- new implementation, target same logic
dbt build --select fct_orders fct_orders_v2
Обе модели materialized в warehouse.
Step 2: high-level diff
-- analyses/audit_fct_orders_migration.sql
{{ audit_helper.compare_relations(
a_relation=ref('fct_orders'),
b_relation=ref('fct_orders_v2'),
primary_key='order_id',
exclude_columns=['_loaded_at', '_etl_run_id']
) }}
dbt compile --select audit_fct_orders_migration
dbt show
Если differences — get count by source.
Step 3: investigate differences
Если in_a_not_b > 0 — есть rows в OLD не в NEW (NEW потерял rows).
-- analyses/audit_missing_in_v2.sql
SELECT order_id FROM {{ ref('fct_orders') }}
EXCEPT
SELECT order_id FROM {{ ref('fct_orders_v2') }}
LIMIT 100
Investigate какие order_id missing. Common causes:
- WHERE filter changed.
- JOIN type changed (INNER vs LEFT).
- Source filter slipped.
Step 4: per-column comparison
Если row counts match, но content differs:
{{ audit_helper.compare_column_values(
a_query='SELECT order_id, amount FROM ' ~ ref('fct_orders'),
b_query='SELECT order_id, amount FROM ' ~ ref('fct_orders_v2'),
primary_key='order_id',
column_to_compare='amount'
) }}
Repeat для каждой важной колонки. Identify which calculations diverged.
Step 5: cutover
После 100% match (или acceptable known diff):
- Renaming:
fct_orders.sql-> archive,fct_orders_v2.sql->fct_orders.sql. - Update downstream models if needed.
- PR review.
- Deploy.
audit_helper диагностические analyses можно оставить в репо как proof of equivalence.
Common audit_helper patterns
Tolerance для numeric
Numeric differences могут быть float artifacts. audit_helper не имеет native tolerance, но можно wrap:
{{ audit_helper.compare_queries(
a_query=\"SELECT order_id, ROUND(amount, 2) AS amount FROM \" ~ ref('fct_orders_old'),
b_query=\"SELECT order_id, ROUND(amount, 2) AS amount FROM \" ~ ref('fct_orders_new'),
primary_key='order_id'
) }}
ROUND до 2 decimals = ignores float noise.
Sample testing
На больших tables — diff на subset:
{{ audit_helper.compare_queries(
a_query=\"SELECT * FROM \" ~ ref('fct_orders_old') ~ \" WHERE order_date >= current_date - 7\",
b_query=\"SELECT * FROM \" ~ ref('fct_orders_new') ~ \" WHERE order_date >= current_date - 7\",
primary_key='order_id'
) }}
Last week only — fast diff. Если совпадает, expand window.
Per-environment diff
-- Сравнить dev vs prod
{{ audit_helper.compare_queries(
a_query=\"SELECT * FROM dev_analytics.fct_orders\",
b_query=\"SELECT * FROM prod_analytics.fct_orders\",
primary_key='order_id'
) }}
Catches drift между environments. Useful если dev должен mirror prod.
dbt_codegen: auto-generation
dbt_codegen генерирует YAML / SQL automatically на based on warehouse schema. Use cases:
- Bootstrap source YAML — для new source database с 100 tables.
- Bootstrap staging models — typing
SELECT col1, col2, ...для каждой source table. - Generate base model YAML — column descriptions, типы.
Setup
# packages.yml
packages:
- package: dbt-labs/codegen
version: [">=0.13.0", "<1.0.0"]
dbt deps
generate_source: создание YAML для source
dbt run-operation generate_source --args '{schema_name: app, database_name: raw}'
Output (печатает в stdout):
version: 2
sources:
- name: app
database: raw
tables:
- name: customers
- name: orders
- name: order_items
- name: products
- name: payments
Save into models/sources/_sources.yml. Manual review, добавление descriptions, freshness, tests.
С extras:
dbt run-operation generate_source --args '{
schema_name: app,
database_name: raw,
generate_columns: true,
include_descriptions: true
}'
Generates full YAML с columns:
sources:
- name: app
database: raw
tables:
- name: customers
description: "" # to be filled
columns:
- name: customer_id
description: ""
- name: name
description: ""
- name: email
description: ""
Затем добавляете descriptions manually.
generate_base_model: bootstrap staging
dbt run-operation generate_base_model --args '{
source_name: app,
table_name: customers
}'
Output:
WITH source AS (
SELECT * FROM {{ source('app', 'customers') }}
),
renamed AS (
SELECT
customer_id,
first_name,
last_name,
email,
created_at,
updated_at
FROM source
)
SELECT * FROM renamed
Save в models/staging/stg_app__customers.sql. Modify rename column logic.
С extras:
dbt run-operation generate_base_model --args '{
source_name: app,
table_name: customers,
leading_commas: true
}'
Стиль leading commas (some teams prefer).
generate_model_yaml: создание YAML
После создания model:
dbt run-operation generate_model_yaml --args '{
model_names: [stg_app__customers, stg_app__orders]
}'
Output:
version: 2
models:
- name: stg_app__customers
description: ""
columns:
- name: customer_id
description: ""
- name: first_name
description: ""
- name: email
description: ""
Save в models/staging/_models.yml. Manual add: descriptions, tests.
Production workflow
New project bootstrap:
dbt run-operation generate_source --args '{schema_name: app}'-> save into _sources.yml.- For each table:
dbt run-operation generate_base_model --args '{source_name: app, table_name: customers}' > models/staging/stg_app__customers.sql
dbt run-operation generate_model_yaml --args '{model_names: [...]}'-> save into _models.yml.- Manual: add tests, descriptions, transformations.
Saves hours для project с 50+ tables.
New table addition:
- Source-team added new table
app.payments. dbt run-operation generate_base_model --args '{source_name: app, table_name: payments}'-> save.- Add source entry in _sources.yml.
- Add model YAML.
Without codegen — manual typing all column names from DESCRIBE TABLE.
Gotchas
audit_helper
-
Performance: для huge tables — compare очень тяжёлый. EXCEPT scans both relations. Use sample / where фильтры.
-
Float precision: numeric differences часто float noise. Round before compare.
-
Order independence: audit_helper does not care row order. Если оба относительно одинаковые но в разном порядке — ok.
-
NULL handling: NULL != NULL в SQL. Rows с NULL могут appear в diff даже если semantically одинаковы. Use COALESCE в queries.
-
Schema differences: разные column orders, разные types — compare может fail на implicit cast issues. Align schemas via SELECT explicit columns.
codegen
-
Need warehouse access: codegen reads schema через
adapter.get_columns_in_relation. Warehouse должна быть accessible. -
No business logic: generates structural skeleton. Все business renaming, transformations, tests — manual.
-
Re-generation overwrites: если manually добавили columns, re-run codegen затирает changes. Save outputs cautiously.
-
Output to stdout: codegen prints, не пишет в файлы. Use shell redirection:
dbt run-operation generate_base_model --args '...' > models/staging/stg_xxx.sql
- Pre-1.5 syntax: некоторые codegen versions use older YAML syntax. Проверяйте output на compliance с current dbt version.
Combining patterns
Pattern: large source onboarding.
# Step 1: generate source YAML
dbt run-operation generate_source --args '{schema_name: app, generate_columns: true}' > models/sources/_sources.yml
# Step 2: for each table, generate base model
for table in customers orders order_items products payments; do
dbt run-operation generate_base_model --args "{source_name: app, table_name: $table}" > models/staging/stg_app__$table.sql
done
# Step 3: generate model YAML
dbt run-operation generate_model_yaml --args '{model_names: [stg_app__customers, stg_app__orders, ...]}' > models/staging/_models.yml
# Step 4: manual review, edit, add tests
5 tables × 5 minutes = 25 minutes vs 5 hours of manual typing.
Pattern: model refactor verification.
# 1. Build both versions
dbt build --select fct_orders fct_orders_v2
# 2. Compare
dbt compile --select audit_fct_orders_migration
dbt show
# 3. If diff:
dbt compile --select audit_missing_in_v2
dbt show
# 4. Iterate v2 until 100% match
# 5. Rename, commit, deploy
DuckDB specifics
audit_helper и codegen работают на DuckDB. Особенности:
- codegen read schema через DuckDB system tables. Тип BIGINT, TIMESTAMP — DuckDB-specific defaults.
- audit_helper EXCEPT — DuckDB поддерживает EXCEPT BY NAME (column matching by name, not position) с extensions. Standard EXCEPT работает.
Performance на DuckDB обычно good для these operations (in-memory processing).
Попробуй сам
В labs:
-
audit_helper migration workflow:
- Create fct_orders_v2 — refactored version of fct_orders.
- Use compare_relations для high-level diff.
- Use compare_column_values для targeted column diff.
- Get to 100% equivalence.
-
codegen для new source:
- generate_source для имеющейся schema.
- generate_base_model для каждой table.
- generate_model_yaml для документации.
- Review output, manually добавьте descriptions.
-
Combined workflow: используйте codegen для нового source, проверьте через audit_helper что новые staging models возвращают ту же data что source.
Ключевые выводы
- audit_helper — для migration verification. compare_relations (high-level), compare_queries (per-row), compare_column_values (per-column).
- codegen — для auto-generation YAML / SQL skeletons. generate_source, generate_base_model, generate_model_yaml.
- audit_helper workflow: build both versions -> compare relations -> investigate differences -> iterate -> cutover.
- codegen workflow: generate skeletons -> manually add business logic / descriptions / tests.
- Numeric precision: round before audit_helper compare. NULL handling — COALESCE.
- codegen output to stdout — use shell redirection для files. Don’t overwrite manual edits.
- DuckDB: оба packages работают, performance в memory good.