where filter: subset testing

dbt-tests applied к whole model by default. На huge tables (millions of rows) это дорого. where config позволяет ограничить test subset rows — performance, focused testing, conditional tests.

В этом уроке — practical patterns where filter.

SQL: FILTER clause и WHERE vs HAVING

Базовое использование

columns:
  - name: customer_email
    data_tests:
      - not_null:
          config:
            where: "order_date >= current_date - 30"

Test runs SELECT * FROM model WHERE customer_email IS NULL AND order_date >= current_date - 30.

Только последний месяц tested. Historical NULLs ignored (могут быть legacy data quality issues).

Performance optimization

- unique:
    config:
      where: "created_at >= current_date - 7"  # last week only
      tags: ['ci_fast']

Last week — fast. Full table test раз в weekly schedule.

# CI (fast)
dbt test --select tag:ci_fast

# Weekly (slow, comprehensive)
dbt test --select tag:weekly_full

Tradeoff: faster CI vs less coverage. Acceptable если recent data — what matters в production.

Recent data focus

Most data quality issues occur с new data. Old data (validated long ago) — likely stable.

- not_null:
    config:
      where: "order_date >= current_date - 30"
      severity: error
- not_null:
    config:
      where: "order_date < current_date - 30"
      severity: warn  # historical — relaxed

Two tests:

Recent (30 days): error severity, block CI.
Historical: warn, don’t block.

Layered quality management.

Exclude PII / sensitive

- expect_column_values_to_match_regex:
    regex: '^[\\w._%+-]+@[\\w.-]+\\.[A-Za-z]{2,}$'
    config:
      where: "is_test_account = false"  # exclude test accounts
      severity: error

Test accounts могут have malformed emails (e.g., ‘test@’, ‘user@localhost’). Exclude из validation чтобы не fail legitimately tests.

- not_null:
    config:
      where: "deleted_at IS NULL"  # exclude soft-deleted rows

Soft-deleted records — don’t care about their data quality.

Conditional tests

- expect_column_mean_to_be_between:
    column_name: amount
    min_value: 100
    max_value: 500
    config:
      where: "customer_country = 'US'"  # US-specific expectations

Different expectations per segment:

- expect_column_mean_to_be_between:
    column_name: amount
    min_value: 100
    max_value: 500
    config:
      where: "customer_country = 'US'"

- expect_column_mean_to_be_between:
    column_name: amount
    min_value: 50
    max_value: 200
    config:
      where: "customer_country = 'IN'"

US: $100-500 expected. India:$ 50-200. Multiple tests с different conditions.

Complex where conditions

- not_null:
    config:
      where: |
        order_date >= current_date - 30
        AND status != 'cancelled'
        AND customer_country = 'US'

Multiline. Complex filters. Same SQL syntax as model SELECT WHERE.

where с Jinja

- not_null:
    config:
      where: "order_date >= '{{ var('test_start_date', '2026-01-01') }}'"

Variable-driven:

dbt test --vars 'test_start_date: 2026-05-01'

Or target-conditional:

- not_null:
    config:
      where: "{{ \"order_date >= current_date - 7\" if target.name == 'ci' else \"1=1\" }}"

CI: subset. Prod: full table.

tags + where combination

- unique:
    config:
      where: "order_date >= current_date - 30"
      tags: ['recent']
- unique:
    config:
      tags: ['full']  # no where

dbt test --select tag:recent  # fast PR check
dbt test --select tag:full   # full coverage daily

Multi-level testing strategy.

Common production patterns

Pattern 1: Recent-only PR checks

- unique:
    config:
      where: "order_date >= current_date - 7"
      tags: ['pr_check']
- not_null:
    config:
      where: "order_date >= current_date - 7"
      tags: ['pr_check']

# PR CI
dbt test --select tag:pr_check  # fast

PRs check last 7 days. Recent issues caught quickly. Historical — separate suite.

Pattern 2: Per-environment subsets

- not_null:
    config:
      where: "{{ \"order_date >= current_date - 30\" if target.name == 'dev' else \"1=1\" }}"

Dev: 30 days subset (fast iteration). Prod: all data.

Pattern 3: Hot-data vs cold-data

- unique:
    config:
      where: "order_date >= current_date - 30"
      tags: ['hot_data']
      severity: error
- unique:
    config:
      where: "order_date < current_date - 30"
      tags: ['cold_data']
      severity: warn

Hot data critical. Cold data relaxed (historical, less actionable).

Pattern 4: Sample testing

- unique:
    config:
      where: "MOD(customer_id, 10) = 0"  # 10% sample
      tags: ['sample']

10% sample for fast tests. Full coverage в weekly.

Pattern 5: Tier-specific

- premium_revenue_threshold:
    config:
      where: "tier = 'premium'"
- standard_revenue_threshold:
    config:
      where: "tier = 'standard'"

Different rules per tier. Cleaner than one mega-test со логикой.

Gotchas

1. where с NULL

- unique:
    config:
      where: "customer_email IS NOT NULL"

This excludes NULL emails from uniqueness test. Если NULL emails are duplicates — not caught.

Be intentional. Add separate not_null test:

- not_null:
- unique:
    config:
      where: "customer_email IS NOT NULL"  # composite logic

Both tests cover full picture.

2. where on derived columns

- unique:
    config:
      where: "DATE(created_at) >= current_date - 30"

DATE() function in WHERE — runtime computation. May not use index. Slower.

Better: compare to TIMESTAMP:

- unique:
    config:
      where: "created_at >= (current_date - 30)::timestamp"

Or pre-compute в model:

-- В модели
SELECT *, DATE(created_at) AS created_date FROM ...

- unique:
    config:
      where: "created_date >= current_date - 30"

Pre-computed column — faster WHERE.

3. where не applies к all tests

- relationships:
    to: ref('dim_customers')
    field: customer_id
    config:
      where: "order_date >= current_date - 30"

Test relationships subset. But what about FK violations в older orders? Not tested.

Decide:

Recent FK is what matters (real-time integrity).
Or historical FK too (full coverage).

Layered approach: where=recent (error) + no where (warn) для full coverage.

4. where в singular tests

Singular tests (custom SQL в tests/ directory) don’t get auto where. Add manually:

-- tests/orders_amount_consistency.sql
SELECT 1 WHERE 1=0  -- placeholder
-- where filtering done в test SQL itself, not via config

For generic tests с where config — works. For singular — manual.

Performance impact

where filter helps when:

Table is large (10M+ rows).
Recent subset much smaller (10K rows last week vs 10M total).
Index on filtered column (e.g., order_date).

Doesn’t help когда:

Table small (менее 100K rows).
Filtered column not indexed.
Most rows pass filter (no real subset reduction).

Measure: EXPLAIN test query to verify pruning.

DuckDB specifics

DuckDB optimizer handles WHERE well, often partition pruning if data clustered.
For huge tables -> make sure column is sorted / clustered.

# dbt_project.yml
models:
  +cluster_by: ['order_date']

Some adapters (Snowflake, BigQuery) support clustering. DuckDB has automatic optimization based on access patterns.

Попробуй сам

В labs:

Add where к test: where: “order_date не меньше current_date - 30”. Run test, измените where, observe what changes в compiled SQL.
Create dual tests: recent (error) + historical (warn).
Compare run times: with vs without where on large model.
Add target-conditional: dev subset, prod full.
Sample test: where MOD(id, 100) = 0 — 1% sample, fast.

Ключевые выводы

where config ограничивает test subset rows. Performance + focused.
Recent-only для PR CI fast (last 7-30 days).
Tier-specific tests с where per segment.
Per-environment: dev subset, prod full.
Tags + where для multi-level testing strategy.
Layered: recent error + historical warn для balanced approach.
Gotchas: NULL exclusion может miss issues, derived columns slow, не all tests support where same way.
Performance: where helps на large tables с indexed filter columns.

Проверка знанийKnowledge check

Команда хочет fast PR checks (менее 1 minute). Большой проект — 500K test queries времени 15 minutes. Какая стратегия с where?

ОтветAnswer

Fast PR checks через aggressive subset testing с tags + where filtering.\n\n**Step 1: Categorize tests**.\n\nIdentify slow tests:\n\n```bash\ndbt test --resource-type test --output json | jq '.[] | {name: .name, time: .timing}'\n```\n\nOr через run_results.json analysis. Top 20 slowest tests.\n\n**Step 2: Tag fast vs slow**.\n\n```yaml\n# Critical PK / FK tests с small subset (fast)\n- unique:\n config:\n where: \"order_date не меньше current_date - 7\"\n tags: ['pr_fast']\n\n# Statistical / comprehensive (slow)\n- expect_column_mean_to_be_between:\n config:\n tags: ['daily_slow']\n```\n\n**Step 3: PR CI uses tag fast**.\n\n```yaml\n# .github/workflows/pr.yml\n- name: Run fast tests\n run: dbt test --select tag:pr_fast --target ci\n```\n\nPR CI tests subset. менее 1 minute target.\n\n**Step 4: Daily / weekly comprehensive**.\n\n```yaml\n# .github/workflows/scheduled.yml\n- name: Run full tests\n run: dbt test --target prod # all tests\n schedule:\n - cron: '0 2 * * *' # daily at 2am\n```\n\nComprehensive nightly. CI fast, comprehensive scheduled.\n\n**Step 5: Specific optimizations**.\n\nDifferent strategies per test type:\n\n**Tests на huge tables**:\n\n```yaml\n- unique:\n config:\n where: \"order_date не меньше current_date - 7\" # recent only\n tags: ['pr_fast']\n```\n\nLast week subset.\n\n**Statistical tests**:\n\n```yaml\n- expect_column_mean_to_be_between:\n config:\n where: \"order_date не меньше current_date - 7\"\n severity: warn # don't block PR\n tags: ['pr_fast']\n```\n\nWarn instead error — investigation but не block.\n\n**Complex business tests**:\n\n```yaml\n- complex_business_rule:\n config:\n where: \"customer_id IN (SELECT customer_id FROM critical_segment)\"\n tags: ['pr_fast']\n```\n\nOnly critical customers tested.\n\n**Step 6: Sampling для very slow**.\n\n```yaml\n- complex_aggregate_check:\n config:\n where: \"MOD(customer_id, 100) = 0\" # 1% sample\n tags: ['pr_fast']\n```\n\n1% sample very fast. Full version в daily.\n\n**Step 7: Skip slow tests in PR**.\n\nIf test takes 60+ seconds и не critical для merge — skip from PR:\n\n```yaml\n- comprehensive_audit:\n config:\n tags: ['daily_only'] # NOT in pr_fast\n```\n\nDaily/weekly suite picks up.\n\n**Step 8: Parallel execution**.\n\ndbt 1.5+ supports test parallelism:\n\n```yaml\n# profiles.yml\nci:\n threads: 16 # was 4 — increase для parallel tests\n```\n\nMore threads = faster tests (если warehouse supports).\n\n**Step 9: Measure**.\n\n```bash\ntime dbt test --select tag:pr_fast --target ci\n# Goal: менее 1 minute\n```\n\nIterate: tweak where, drop tests, add sampling, until target reached.\n\n**Configuration example**:\n\n```yaml\nmodels:\n - name: fct_orders\n data_tests:\n # Critical PK — fast, subset\n - dbt_utils.unique_combination_of_columns:\n combination_of_columns: [order_id, line_item_id]\n config:\n where: \"order_date не меньше current_date - 7\"\n tags: ['pr_fast']\n \n # Full version daily\n - dbt_utils.unique_combination_of_columns:\n combination_of_columns: [order_id, line_item_id]\n config:\n tags: ['daily_slow']\n \n # Statistical — sample only в PR\n - dbt_expectations.expect_column_mean_to_be_between:\n column_name: amount\n min_value: 50\n max_value: 200\n config:\n where: \"MOD(order_id, 100) = 0 AND order_date не меньше current_date - 7\"\n severity: warn\n tags: ['pr_fast']\n \n # Full statistical daily\n - dbt_expectations.expect_column_mean_to_be_between:\n column_name: amount\n min_value: 50\n max_value: 200\n config:\n severity: warn\n tags: ['daily_slow']\n```\n\nDuplication, но explicit. Multi-level coverage.\n\n**Trade-offs**:\n\n- **Pro**: PR менее 1 minute -> fast developer iteration -> more PRs merged faster.\n- **Con**: Subset testing -> may miss issues в old data / non-sampled rows.\n- **Mitigation**: Daily/weekly comprehensive catches issues, dev cycle still fast.\n\n**Anti-patterns**:\n\n1. **No tagging strategy** — every PR runs full suite. Slow, lazy developers.\n2. **Tags but no scheduled run** — PR fast, comprehensive never runs. Issues accumulate.\n3. **PR too lax** — missing critical tests, broken data merged.\n\nBalance: critical tests in PR (subset), comprehensive scheduled, both have to pass before deploy.\n\nKey: PR speed важна для productivity. Strategic test subsetting через where + tags makes possible. Pair с comprehensive scheduled tests для full coverage.

Проверка знанийKnowledge check

where filter применил 'WHERE deleted_at IS NULL' к unique тесту. Через месяц обнаружили: deleted records have duplicates но test passes. Что произошло?

ОтветAnswer

Subtle bug: where filter сужает scope. Test passes because filter excludes problematic rows.\n\n**Что произошло**:\n\n```yaml\n- unique:\n config:\n where: \"deleted_at IS NULL\"\n```\n\nThis tests ТОЛЬКО non-deleted rows. Deleted rows (with deleted_at NOT NULL) excluded из test.\n\nIf deleted rows have duplicates (e.g., soft-delete process не handles dedup), test passes despite real issue.\n\nExample data:\n\n```\norder_id | customer_id | email | deleted_at\n1 | 42 | [email protected] | NULL ← tested\n2 | 99 | [email protected] | NULL ← tested\n3 | 42 | [email protected] | 2024-01-15 ← NOT tested\n4 | 42 | [email protected] | 2024-02-20 ← NOT tested\n```\n\nWith where: deleted_at IS NULL test only sees row 1 и 2. unique customer_id = TRUE.\n\nWithout where: test sees rows 1, 2, 3, 4. customer_id=42 appears 3 times. unique = FALSE.\n\n**Why это happened**:\n\n1. **Original intent**: 'tests only matter для active rows'. Logical decision.\n2. **Edge case overlooked**: deleted rows could have data integrity issues that downstream models care about (e.g., audit, GDPR).\n3. **Test scope mismatch**: filter narrowed test scope без considering all downstream uses.\n\n**Fix options**:\n\n**Option 1: Remove where filter** (full coverage):\n\n```yaml\n- unique # no where\n```\n\nTests all rows. Detects duplicates regardless of deleted status.\n\nTrade-off: includes historical deleted records. Slower, but complete.\n\n**Option 2: Two tests (recommended)**:\n\n```yaml\n- unique:\n config:\n where: \"deleted_at IS NULL\"\n severity: error # active duplicates — critical\n- unique:\n config:\n where: \"deleted_at IS NOT NULL\"\n severity: warn # historical duplicates — non-blocking\n```\n\nActive: must be unique. Historical: should be unique but не block CI.\n\nLayered: critical issues block, historical issues alert.\n\n**Option 3: Composite test**:\n\n```yaml\n- unique\n- not_null:\n column_name: customer_id\n # Make sure deleted records also satisfy not_null\n```\n\nFull check + complementary tests.\n\n**Option 4: Investigate root cause**:\n\nWhy deleted records have duplicates?\n\n```sql\n-- Look at duplicates в deleted records\nSELECT customer_id, COUNT(*) FROM fct_orders\nWHERE deleted_at IS NOT NULL\nGROUP BY 1 HAVING COUNT(*) > 1\nLIMIT 100;\n```\n\nPotential causes:\n1. **Soft delete process duplicated rows**: race conditions when marking deleted.\n2. **Source system bug**: pre-existing duplicates marked deleted in bulk.\n3. **ETL re-runs**: same customer marked deleted multiple times.\n\nFix at source if possible.\n\n**Option 5: Schema redesign**:\n\nIf deleted records shouldn't logically exist as separate rows, restructure model:\n\n```sql\n-- Move deleted rows к archive table\nWITH active AS (\n SELECT * FROM source WHERE deleted_at IS NULL\n),\narchive AS (\n SELECT * FROM source WHERE deleted_at IS NOT NULL\n)\n-- Materialize as separate models\n```\n\nActive model has unique constraint (where filter unnecessary). Archive table has different rules.\n\n**General principle: where filters narrow scope**.\n\nWhen using where, ask:\n\n1. **Что excluded** by filter?\n2. **Could excluded rows have issues** we care about?\n3. **Multiple tests** для multiple scopes?\n\nDocument the scope:\n\n```yaml\n- unique:\n config:\n where: \"deleted_at IS NULL\"\n # Tests uniqueness среди ACTIVE rows.\n # Deleted rows не covered — see separate test 'historical_unique'.\n severity: error\n- unique:\n config:\n where: \"deleted_at IS NOT NULL\"\n severity: warn\n tags: ['historical_unique']\n```\n\nExplicit scope in comments.\n\n**Anti-pattern check**:\n\nIf you find yourself adding many tests with same where filter, that's signal that:\n- Model design needs split (active vs archive separate models).\n- Or naming convention для scope (active_, archived_ prefixes).\n\nRefactoring to clean schema beats накопление tests with where clauses.\n\nKey: where filters narrow test scope. Audit periodically: what's outside scope? Are those rows важны? Layered coverage с multiple tests = robust quality.

where filter: subset testing

Базовое использование

Performance optimization

Recent data focus

Exclude PII / sensitive

Conditional tests

Complex where conditions

where с Jinja

tags + where combination

Common production patterns

Pattern 1: Recent-only PR checks

Pattern 2: Per-environment subsets

Pattern 3: Hot-data vs cold-data

Pattern 4: Sample testing

Pattern 5: Tier-specific

Gotchas

1. where с NULL

2. where on derived columns

3. where не applies к all tests

4. where в singular tests

Performance impact

DuckDB specifics

Попробуй сам

Ключевые выводы

Закончили урок?