store_failures и store_failures_as: debugging workflow
Test failed в CI: ‘unique violated на customer_email — 47 rows’. Какие именно 47 rows? Open dbt logs — там только сводка. Re-run? Slow. Re-create test query вручную? Tedious.
store_failures решает эту проблему. На middle важно не «что такое store_failures», а production workflow: incident response, sensitive data, retention, CI artifact upload, cleanup и cost.
Базовый store_failures: true и схема <target>_dbt_test__audit разбирались в dbt-i/07/04. Здесь — production debugging workflow, sensitive data handling, retention и cost.
Recap в одной таблице
| Поведение | Значение | Куда |
|---|---|---|
store_failures: true (per-test) | сохранять failing rows | <schema>_dbt_test__audit |
+store_failures: true (project) | globally on | dbt_project.yml |
| Перезапись | каждый run overwrites table (replace, не append) | автоматически |
| Селективно | per critical test, не project-wide на проде | best practice |
Дефолт: off. Включать per-test для critical (PK, FK, business-invariants), не глобально (overhead на 100+ tests).
store_failures_as (dbt 1.9+)
В 1.9 добавлен store_failures_as — controls materialization audit table:
- unique:
config:
store_failures: true
store_failures_as: 'view' # 'table' (default) или 'view'
| Option | Behavior |
|---|---|
'table' (default) | Materialize as table. Persists. Faster queries. |
'view' | Materialize as view (over the test query). Doesn’t persist data, computed on read. |
Use cases:
'view': production where storage matters. Failures computed on demand. Always fresh.'table': dev / debug, want stable snapshot for investigation.
tests:
+store_failures: true
+store_failures_as: "{{ 'table' if target.name == 'dev' else 'view' }}"
Dev: table (persistent debug). Prod: view (storage saving).
Production debugging workflow
Test failed в CI. Steps:
Step 1: Identify failed test.
dbt test --select my_model
# Output:
# FAIL 47 unique_fct_customers_customer_email
Step 2: Examine audit table.
SELECT * FROM analytics_dbt_test__audit.unique_fct_customers_customer_email
LIMIT 100;
Returns 47 duplicate customer_emails. Investigate patterns.
Step 3: Patterns analysis.
SELECT customer_email, COUNT(*)
FROM analytics_dbt_test__audit.unique_fct_customers_customer_email
GROUP BY customer_email
ORDER BY COUNT(*) DESC
LIMIT 10;
Identify most-duplicated emails. Maybe one email has 20 dups (single user multiple registrations?).
Step 4: Root cause investigation.
Trace failures upstream:
-- Where did these duplicates come from?
SELECT s._loaded_at, s._source_file, c.email, COUNT(*)
FROM raw.app.customers s
JOIN analytics_dbt_test__audit.unique_fct_customers_customer_email c USING (email)
GROUP BY 1, 2, 3
ORDER BY 4 DESC;
Find source / ETL load that introduced duplicates.
Step 5: Fix или document.
- Bug: fix ETL deduplication.
- Acceptable: document tolerance, increase warn_if threshold.
- Data quality issue: notify source-team.
Step 6: Cleanup.
-- Audit tables accumulate. Cleanup if needed.
DROP TABLE analytics_dbt_test__audit.unique_fct_customers_customer_email;
Or scheduled cleanup macro (similar to audit log retention).
Schema naming для audit
By default:
# Custom schema for failures
tests:
+store_failures: true
+schema: 'test_audit' # vs default <target>_dbt_test__audit
# Per environment
tests:
+store_failures: true
+schema: "{{ 'test_audit_' ~ target.name }}"
Custom names — organization. Default is fine for most projects.
Combining с error_if / warn_if
- not_null:
config:
severity: error
warn_if: ">5"
error_if: ">50"
store_failures: true # save failing rows для investigation
Even if warn (5-50 violations), failures stored. Can investigate before they become errors.
Useful для tuning thresholds: ‘what kinds of failures triggered the warn?’.
Sensitive data в failures: PII risk
Audit tables содержат реальные failing rows. Если тест на customer_email, phone, ssn — failure-row имеет эти значения в plain audit table. Это GDPR/CCPA-relevant.
Production patterns:
1. Restrict grants на audit schema:
tests:
+store_failures: true
+grants:
select: ['data_engineering'] # не аналитики, не BI
2. Mask sensitive columns в самом тесте:
-- tests/unique_customer_email.sql (singular test)
SELECT
customer_id,
md5(customer_email) AS customer_email_hash, -- не plain
count(*) AS dup_count
FROM {{ ref('fct_customers') }}
GROUP BY 1, 2
HAVING count(*) > 1
Audit показывает hash вместо email — debugging возможен по hash equivalence, PII не утекает.
3. Retention policy — критично:
on-run-end:
- "{{ cleanup_test_audits(retention_days=7) }}"
Где cleanup_test_audits — macro, который дропает audit-таблицы старше N дней. Без retention аудиты копятся бесконечно (несмотря на overwrite — старые таблицы остаются если тест переименовали).
4. Skip store_failures на PII тестах:
- not_null:
column_name: customer_email
config:
store_failures: false # PII — debug через grep raw source
Идея: PII тесты failing — investigation через source с full DPA / access logging, не через audit.
Retention и cost
Audit tables кажутся «маленькими» (только failing rows), но:
| Сценарий | Audit size |
|---|---|
| Хороший проект (rare failures) | < 1 MB на тест |
| Деградировавший проект (10K+ duplicates) | 100 MB - 1 GB на тест |
| 50 tests × store_failures каждый run | 5-50 GB cumulative |
На Snowflake/BigQuery это billed storage. Retention macro обязателен. Пример:
-- macros/cleanup_test_audits.sql
{% macro cleanup_test_audits(retention_days=7) %}
{% set audit_schema = target.schema ~ '_dbt_test__audit' %}
{% set cutoff = run_started_at - modules.datetime.timedelta(days=retention_days) %}
{% set tables_query %}
SELECT table_name FROM information_schema.tables
WHERE table_schema = '{{ audit_schema }}'
AND last_altered < '{{ cutoff }}'
{% endset %}
{% set results = run_query(tables_query) %}
{% for row in results %}
{% do run_query('DROP TABLE ' ~ audit_schema ~ '.' ~ row[0]) %}
{% do log('Dropped expired audit: ' ~ row[0], info=true) %}
{% endfor %}
{% endmacro %}
Запускается на on-run-end, дропает audit-таблицы старше 7 дней.
CI artifact upload
После test failure в GitHub Actions — нужно сохранить audit для investigation после того как CI environment протухнет:
# .github/workflows/dbt-ci.yml
- name: Run dbt tests
id: dbt_test
run: dbt test --select state:modified+
continue-on-error: true
- name: Export failure audit
if: steps.dbt_test.outcome == 'failure'
run: |
python scripts/export_audit_to_csv.py \
--schema ${{ env.DBT_SCHEMA }}_dbt_test__audit \
--output ./test_failures/
- name: Upload audit as artifact
if: steps.dbt_test.outcome == 'failure'
uses: actions/upload-artifact@v4
with:
name: test-failures-${{ github.run_id }}
path: ./test_failures/
retention-days: 30
- name: Fail step (после upload)
if: steps.dbt_test.outcome == 'failure'
run: exit 1
Логика: continue-on-error чтобы export успел; upload; затем fail step. PR review видит artifact с CSV failing rows без необходимости запускать dbt локально.
Limits и gotchas
1. Audit tables grow
Each dbt test overwrites audit tables (with fresh test query). They don’t accumulate (replace, not append). But on every run, write occurs.
For production with many tests + store_failures everywhere -> overhead. Per-test config (не project-wide) — best practice.
2. View vs table choice
| Option | Pros | Cons |
|---|---|---|
'table' (default) | cheap repeated queries | storage cost, persists stale data до next run |
'view' | no storage, always fresh | recompute test SQL на каждом read |
Для большинства проектов — table дефолт. view имеет смысл когда test SQL дешёвый, investigation редкий, storage критичен.
3. NULL behavior
Test failure semantic: test query returns rows. If query returns NULL ID columns — audit table has NULL ID rows. Can be confusing.
Example: not_null test:
{% test not_null(model, column_name) %}
SELECT * FROM {{ model }} WHERE {{ column_name }} IS NULL
{% endtest %}
Audit table — rows с NULL в column_name. Other columns may have values.
Investigation: ‘which 47 rows have NULL email?’ SELECT * FROM audit, JOIN to model on PK.
DuckDB specifics
DuckDB store_failures:
- Creates schema dbt_test__audit (или custom prefix).
- Tables materialized как DuckDB tables.
- View: standard DuckDB views.
- Single-writer constraint: parallel test runs могут conflict. Single test runner at a time.
Попробуй сам
В labs:
- Add store_failures: true к unique тесту на одной модели.
- Force failure — insert duplicate row в source.
- Run dbt test — должен fail.
- Query audit table — посмотрите failing rows.
- Try store_failures_as: ‘view’ — drop table, run test, query — same data via view.
- Combine с error_if: store failures even for warns.
Ключевые выводы
- store_failures: true — saves failing test rows в audit schema.
- Audit schema:
<target>_dbt_test__auditby default. - store_failures_as (1.9+): ‘table’ (default) или ‘view’.
- Production workflow: identify fail -> examine audit -> investigate patterns -> fix or document.
- Per-test config (don’t project-wide для performance).
- View saves storage but recomputes. Table cheaper queries but persists.
- Sensitive data: access controls + retention policy.
- Cleanup: audit tables accumulate, schedule deletes.