store_failures и store_failures_as: debugging workflow

Test failed в CI: ‘unique violated на customer_email — 47 rows’. Какие именно 47 rows? Open dbt logs — там только сводка. Re-run? Slow. Re-create test query вручную? Tedious.

store_failures решает эту проблему. На middle важно не «что такое store_failures», а production workflow: incident response, sensitive data, retention, CI artifact upload, cleanup и cost.

Privacy: PII в audit-таблицах — compliance-риск

Уже знаешь из dbt-i?

Базовый store_failures: true и схема <target>_dbt_test__audit разбирались в dbt-i/07/04. Здесь — production debugging workflow, sensitive data handling, retention и cost.

Recap в одной таблице

Поведение	Значение	Куда
`store_failures: true` (per-test)	сохранять failing rows	`<schema>_dbt_test__audit`
`+store_failures: true` (project)	globally on	`dbt_project.yml`
Перезапись	каждый run overwrites table (replace, не append)	автоматически
Селективно	per critical test, не project-wide на проде	best practice

Дефолт: off. Включать per-test для critical (PK, FK, business-invariants), не глобально (overhead на 100+ tests).

store_failures_as (dbt 1.9+)

В 1.9 добавлен store_failures_as — controls materialization audit table:

- unique:
    config:
      store_failures: true
      store_failures_as: 'view'  # 'table' (default) или 'view'

Option	Behavior
`'table'` (default)	Materialize as table. Persists. Faster queries.
`'view'`	Materialize as view (over the test query). Doesn’t persist data, computed on read.

Use cases:

'view': production where storage matters. Failures computed on demand. Always fresh.
'table': dev / debug, want stable snapshot for investigation.

tests:
  +store_failures: true
  +store_failures_as: "{{ 'table' if target.name == 'dev' else 'view' }}"

Dev: table (persistent debug). Prod: view (storage saving).

Production debugging workflow

Test failed в CI. Steps:

Step 1: Identify failed test.

dbt test --select my_model
# Output:
# FAIL 47 unique_fct_customers_customer_email

Step 2: Examine audit table.

SELECT * FROM analytics_dbt_test__audit.unique_fct_customers_customer_email
LIMIT 100;

Returns 47 duplicate customer_emails. Investigate patterns.

Step 3: Patterns analysis.

SELECT customer_email, COUNT(*)
FROM analytics_dbt_test__audit.unique_fct_customers_customer_email
GROUP BY customer_email
ORDER BY COUNT(*) DESC
LIMIT 10;

Identify most-duplicated emails. Maybe one email has 20 dups (single user multiple registrations?).

Step 4: Root cause investigation.

Trace failures upstream:

-- Where did these duplicates come from?
SELECT s._loaded_at, s._source_file, c.email, COUNT(*)
FROM raw.app.customers s
JOIN analytics_dbt_test__audit.unique_fct_customers_customer_email c USING (email)
GROUP BY 1, 2, 3
ORDER BY 4 DESC;

Find source / ETL load that introduced duplicates.

Step 5: Fix или document.

Bug: fix ETL deduplication.
Acceptable: document tolerance, increase warn_if threshold.
Data quality issue: notify source-team.

Step 6: Cleanup.

-- Audit tables accumulate. Cleanup if needed.
DROP TABLE analytics_dbt_test__audit.unique_fct_customers_customer_email;

Or scheduled cleanup macro (similar to audit log retention).

Schema naming для audit

By default:

# Custom schema for failures
tests:
  +store_failures: true
  +schema: 'test_audit'  # vs default <target>_dbt_test__audit

# Per environment
tests:
  +store_failures: true
  +schema: "{{ 'test_audit_' ~ target.name }}"

Custom names — organization. Default is fine for most projects.

Combining с error_if / warn_if

- not_null:
    config:
      severity: error
      warn_if: ">5"
      error_if: ">50"
      store_failures: true  # save failing rows для investigation

Even if warn (5-50 violations), failures stored. Can investigate before they become errors.

Useful для tuning thresholds: ‘what kinds of failures triggered the warn?’.

Sensitive data в failures: PII risk

Audit tables содержат реальные failing rows. Если тест на customer_email, phone, ssn — failure-row имеет эти значения в plain audit table. Это GDPR/CCPA-relevant.

Production patterns:

1. Restrict grants на audit schema:

tests:
  +store_failures: true
  +grants:
    select: ['data_engineering']  # не аналитики, не BI

2. Mask sensitive columns в самом тесте:

-- tests/unique_customer_email.sql (singular test)
SELECT
    customer_id,
    md5(customer_email) AS customer_email_hash,  -- не plain
    count(*) AS dup_count
FROM {{ ref('fct_customers') }}
GROUP BY 1, 2
HAVING count(*) > 1

Audit показывает hash вместо email — debugging возможен по hash equivalence, PII не утекает.

3. Retention policy — критично:

on-run-end:
  - "{{ cleanup_test_audits(retention_days=7) }}"

Где cleanup_test_audits — macro, который дропает audit-таблицы старше N дней. Без retention аудиты копятся бесконечно (несмотря на overwrite — старые таблицы остаются если тест переименовали).

4. Skip store_failures на PII тестах:

- not_null:
    column_name: customer_email
    config:
      store_failures: false  # PII — debug через grep raw source

Идея: PII тесты failing — investigation через source с full DPA / access logging, не через audit.

Retention и cost

Audit tables кажутся «маленькими» (только failing rows), но:

Сценарий	Audit size
Хороший проект (rare failures)	< 1 MB на тест
Деградировавший проект (10K+ duplicates)	100 MB - 1 GB на тест
50 tests × store_failures каждый run	5-50 GB cumulative

На Snowflake/BigQuery это billed storage. Retention macro обязателен. Пример:

-- macros/cleanup_test_audits.sql
{% macro cleanup_test_audits(retention_days=7) %}
  {% set audit_schema = target.schema ~ '_dbt_test__audit' %}
  {% set cutoff = run_started_at - modules.datetime.timedelta(days=retention_days) %}

  {% set tables_query %}
    SELECT table_name FROM information_schema.tables
    WHERE table_schema = '{{ audit_schema }}'
      AND last_altered < '{{ cutoff }}'
  {% endset %}

  {% set results = run_query(tables_query) %}
  {% for row in results %}
    {% do run_query('DROP TABLE ' ~ audit_schema ~ '.' ~ row[0]) %}
    {% do log('Dropped expired audit: ' ~ row[0], info=true) %}
  {% endfor %}
{% endmacro %}

Запускается на on-run-end, дропает audit-таблицы старше 7 дней.

CI artifact upload

После test failure в GitHub Actions — нужно сохранить audit для investigation после того как CI environment протухнет:

# .github/workflows/dbt-ci.yml
- name: Run dbt tests
  id: dbt_test
  run: dbt test --select state:modified+
  continue-on-error: true

- name: Export failure audit
  if: steps.dbt_test.outcome == 'failure'
  run: |
    python scripts/export_audit_to_csv.py \
      --schema ${{ env.DBT_SCHEMA }}_dbt_test__audit \
      --output ./test_failures/

- name: Upload audit as artifact
  if: steps.dbt_test.outcome == 'failure'
  uses: actions/upload-artifact@v4
  with:
    name: test-failures-${{ github.run_id }}
    path: ./test_failures/
    retention-days: 30

- name: Fail step (после upload)
  if: steps.dbt_test.outcome == 'failure'
  run: exit 1

Логика: continue-on-error чтобы export успел; upload; затем fail step. PR review видит artifact с CSV failing rows без необходимости запускать dbt локально.

Limits и gotchas

1. Audit tables grow

Each dbt test overwrites audit tables (with fresh test query). They don’t accumulate (replace, not append). But on every run, write occurs.

For production with many tests + store_failures everywhere -> overhead. Per-test config (не project-wide) — best practice.

2. View vs table choice

Option	Pros	Cons
`'table'` (default)	cheap repeated queries	storage cost, persists stale data до next run
`'view'`	no storage, always fresh	recompute test SQL на каждом read

Для большинства проектов — table дефолт. view имеет смысл когда test SQL дешёвый, investigation редкий, storage критичен.

3. NULL behavior

Test failure semantic: test query returns rows. If query returns NULL ID columns — audit table has NULL ID rows. Can be confusing.

Example: not_null test:

{% test not_null(model, column_name) %}
    SELECT * FROM {{ model }} WHERE {{ column_name }} IS NULL
{% endtest %}

Audit table — rows с NULL в column_name. Other columns may have values.

Investigation: ‘which 47 rows have NULL email?’ SELECT * FROM audit, JOIN to model on PK.

DuckDB specifics

DuckDB store_failures:

Creates schema dbt_test__audit (или custom prefix).
Tables materialized как DuckDB tables.
View: standard DuckDB views.
Single-writer constraint: parallel test runs могут conflict. Single test runner at a time.

Попробуй сам

В labs:

Add store_failures: true к unique тесту на одной модели.
Force failure — insert duplicate row в source.
Run dbt test — должен fail.
Query audit table — посмотрите failing rows.
Try store_failures_as: ‘view’ — drop table, run test, query — same data via view.
Combine с error_if: store failures even for warns.

Ключевые выводы

store_failures: true — saves failing test rows в audit schema.
Audit schema: <target>_dbt_test__audit by default.
store_failures_as (1.9+): ‘table’ (default) или ‘view’.
Production workflow: identify fail -> examine audit -> investigate patterns -> fix or document.
Per-test config (don’t project-wide для performance).
View saves storage but recomputes. Table cheaper queries but persists.
Sensitive data: access controls + retention policy.
Cleanup: audit tables accumulate, schedule deletes.

Проверка знанийKnowledge check

store_failures: true vs store_failures: false — какие use cases для каждого?

ОтветAnswer

Decision based on debug needs, storage costs, performance.\n\n**store_failures: true**:\n\nUse cases:\n\n1. **Critical tests** где failure debugging — frequent need.\n - unique на PK columns.\n - not_null на required fields.\n - relationships (FK).\n\n2. **Tests с tolerance** (error_if / warn_if).\n - Want to see what triggered warn before becoming error.\n - Threshold tuning evidence.\n\n3. **Tests на business rules**.\n - 'High value orders missing audit' — need to know which orders.\n - Investigation requires actual rows.\n\n4. **Production debugging**.\n - On CI failure, examine audit table.\n - Faster than re-running test or constructing query manually.\n\n**store_failures: false (default)**:\n\nUse cases:\n\n1. **Trivial tests** где failure obvious.\n - accepted_values с small set — failure обычно known cause.\n\n2. **High-volume tests** где audit storage expensive.\n - Statistical tests at scale.\n - Tests на huge tables where audit может be 10K+ rows.\n\n3. **Sensitive data tests**.\n - Tests на PII columns.\n - Compliance requires not persisting sensitive data even in audit.\n\n4. **Performance-critical CI**.\n - Every store_failures = additional write per test.\n - 100 tests × store = 100 extra writes.\n\n**Hybrid approach (recommended)**:\n\n```yaml\n# Project-level: store for critical tests by default\ntests:\n +store_failures: false # default off\n\n# Critical tests opt-in\n- name: fct_orders\n columns:\n - name: order_id\n data_tests:\n - unique:\n config:\n store_failures: true # critical PK\n - not_null:\n config:\n store_failures: true\n```\n\nExplicit opt-in for critical tests. Default off for performance.\n\n**Per environment**:\n\n```yaml\ntests:\n +store_failures: \"{{ True if target.name == 'dev' else False }}\"\n```\n\nDev: always store (debug friendly). Prod: off (performance).\n\nOr:\n\n```yaml\ntests:\n +store_failures: \"{{ True if target.name in ['dev', 'ci'] else False }}\"\n```\n\nCI also stores (для post-mortem). Prod off.\n\n**Per test config**:\n\n```yaml\n- unique:\n config:\n store_failures: true\n store_failures_as: 'view' # view для prod (no storage)\n- not_null:\n config:\n store_failures: true\n store_failures_as: 'table' # table для critical (cheap queries)\n```\n\nFine-grained control per test.\n\n**Lifecycle considerations**:\n\nAudit tables overwrite per run. Не accumulate. But:\n\n1. If retention needed -> backup audit before run.\n2. If investigation requires historic audits -> query within window.\n3. dbt does not auto-clean — manual или scheduled DROP.\n\n**Decision factor: investigation frequency**:\n\n- Investigate failures often? -> store_failures: true.\n- Failures rare и manageable through logs? -> false (default).\n- Statistical / monitoring tests без actionable failures? -> false.\n\n**Anti-patterns**:\n\n1. **Project-wide store без consideration**: 100 tests storing always — wasted I/O и storage.\n2. **Never store критичные** — when failure happens, lost evidence.\n3. **Forgetting cleanup**: audits accumulate (despite overwrite, accumulate across DROP/CREATE iterations).\n\n**Best practice**:\n\n1. Default: off.\n2. Opt-in для critical tests.\n3. Per-environment Jinja conditional для dev/CI/prod.\n4. Cleanup macro в on-run-end:\n\n```yaml\non-run-end:\n - \"{{ cleanup_test_audits(retention_days=7) }}\"\n```\n\nDelete audits older than 7 days. Persistent for debugging, не infinite growth.\n\nKey: storage vs debug. Critical tests — store. Everything else — default off. Per-env tunability.

Проверка знанийKnowledge check

После test failure, audit table показывает 47 failing rows. Какой систематический investigation workflow?

ОтветAnswer

Systematic root-cause analysis через audit table.\n\n**Step 1: Examine failure structure**.\n\n```sql\nSELECT * FROM analytics_dbt_test__audit.unique_fct_customers_customer_email LIMIT 100;\n```\n\nWhat columns в audit? Usually mirrors model schema. Look for clues — created_at? source? etc.\n\n**Step 2: Identify pattern of duplicates**.\n\n```sql\n-- For unique test failure: which emails most duplicated?\nSELECT customer_email, COUNT(*) AS dup_count\nFROM analytics_dbt_test__audit.unique_fct_customers_customer_email\nGROUP BY customer_email\nORDER BY dup_count DESC\nLIMIT 10;\n```\n\nResults:\n\n```\nemail | dup_count\[email protected] | 23 ← test data leaked?\[email protected] | 8\[email protected] | 2\n...\n```\n\n23 duplicates for [email protected] — likely test data in production. Investigate.\n\n**Step 3: Temporal analysis**.\n\n```sql\n-- When did duplicates appear?\nSELECT \n DATE(created_at) AS created_date,\n COUNT(*) AS dups_on_date\nFROM analytics_dbt_test__audit.unique_fct_customers_customer_email\nGROUP BY 1\nORDER BY 1 DESC;\n```\n\nResults:\n\n```\ncreated_date | dups_on_date\n2026-05-19 | 30 ← yesterday spike\n2026-05-18 | 5\n2026-05-17 | 2\n...\n```\n\n30 duplicates yesterday. Spike. What changed?\n\n**Step 4: Cross-reference с source**.\n\n```sql\n-- Trace failures back to source.\nWITH failures AS (\n SELECT customer_id, customer_email\n FROM analytics_dbt_test__audit.unique_fct_customers_customer_email\n)\nSELECT \n s._loaded_at,\n s._source_file,\n COUNT(*) AS failed_rows\nFROM raw.app.customers s\nJOIN failures f USING (customer_id, customer_email)\nGROUP BY 1, 2\nORDER BY 1 DESC;\n```\n\nResults:\n\n```\n_loaded_at | _source_file | failed_rows\n2026-05-19 02:00 | etl_export_v2.csv | 30 ← bug в new ETL\n2026-05-18 02:00 | etl_export.csv | 5\n```\n\nNew ETL version ('etl_export_v2.csv') on May 19 introduced 30 duplicates.\n\n**Step 5: Confirm hypothesis**.\n\n```sql\n-- Did the v2 ETL have duplicate logic?\nSELECT * FROM raw.app.customers\nWHERE _source_file = 'etl_export_v2.csv'\nLIMIT 100;\n\n-- Look for duplicate customer_id values\nSELECT customer_id, COUNT(*) FROM raw.app.customers\nWHERE _source_file = 'etl_export_v2.csv'\nGROUP BY 1 HAVING COUNT(*) > 1;\n```\n\nIf v2 ETL имеет duplicate IDs — source bug confirmed.\n\n**Step 6: Fix**.\n\nOptions:\n\n**A. Fix source (best)**:\n- Contact ETL team, fix dedup logic.\n- Re-run ETL when fixed.\n\n**B. Fix в staging (temporary)**:\n\n```sql\n-- stg_app__customers.sql\nWITH source AS (\n SELECT * FROM raw.app.customers\n),\ndeduped AS (\n {{ dbt_utils.deduplicate(\n source,\n partition_by='customer_id',\n order_by='_loaded_at DESC'\n ) }}\n)\nSELECT * FROM deduped\n```\n\nDeduplicate in staging. Sacrifices: which row to keep (most recent _loaded_at). Logs warning.\n\n**C. Filter in staging**:\n\n```sql\nWHERE _source_file != 'etl_export_v2.csv' -- exclude broken file\n```\n\nTemporary while ETL fixed.\n\n**Step 7: Verify fix**.\n\n```bash\ndbt build --select stg_app__customers fct_orders\ndbt test --select fct_orders\n# Test should now pass\n```\n\n**Step 8: Document incident**.\n\n```markdown\n## Incident: customer_email duplicates 2026-05-19\n\n**Root cause**: ETL v2 had broken dedup logic, introduced 30 duplicates.\n\n**Fix**: Reverted to ETL v1 (2026-05-19 16:00).\n\n**Prevention**:\n- Added not_null test for _source_file (ensure tracking).\n- Added unique_combination_of_columns test for (customer_id, _loaded_at).\n- Schedule weekly review of ETL changes affecting source quality.\n```\n\nIncident review prevents recurrence.\n\n**Workflow summary**:\n\n1. **Examine** audit (look at rows).\n2. **Pattern** (group by, find common).\n3. **Temporal** (when did they appear?).\n4. **Source trace** (where did они come from?).\n5. **Confirm** hypothesis.\n6. **Fix** (source / staging / filter).\n7. **Verify** (re-run tests).\n8. **Document** (incident notes).\n\nKey: audit table — first step diagnosis. Без store_failures, this workflow слишком tedious (need to manually reconstruct test query).\n\nWithout store_failures, you would:\n1. See 47 failures в log.\n2. Find test SQL в compiled.\n3. Run manually.\n4. Investigate.\n\nMore time-consuming. store_failures saves this step, accelerating debugging.

store_failures и store_failures_as: debugging workflow

Recap в одной таблице

store_failures_as (dbt 1.9+)

Production debugging workflow

Schema naming для audit

Combining с error_if / warn_if

Sensitive data в failures: PII risk

Retention и cost

CI artifact upload

Limits и gotchas

1. Audit tables grow

2. View vs table choice

3. NULL behavior

DuckDB specifics

Попробуй сам

Ключевые выводы

Закончили урок?