dbt-tests applied к whole model by default. На huge tables (millions of rows) это дорого. where config позволяет ограничить test subset rows — performance, focused testing, conditional tests.
where config ограничивает test subset rows. Performance + focused.
Recent-only для PR CI fast (last 7-30 days).
Tier-specific tests с where per segment.
Per-environment: dev subset, prod full.
Tags + where для multi-level testing strategy.
Layered: recent error + historical warn для balanced approach.
Gotchas: NULL exclusion может miss issues, derived columns slow, не all tests support where same way.
Performance: where helps на large tables с indexed filter columns.
Проверка знанийKnowledge check
Команда хочет fast PR checks (менее 1 minute). Большой проект — 500K test queries времени 15 minutes. Какая стратегия с where?
ОтветAnswer
Fast PR checks через aggressive subset testing с tags + where filtering.\n\n**Step 1: Categorize tests**.\n\nIdentify slow tests:\n\n```bash\ndbt test --resource-type test --output json | jq '.[] | {name: .name, time: .timing}'\n```\n\nOr через run_results.json analysis. Top 20 slowest tests.\n\n**Step 2: Tag fast vs slow**.\n\n```yaml\n# Critical PK / FK tests с small subset (fast)\n- unique:\n config:\n where: \"order_date не меньше current_date - 7\"\n tags: ['pr_fast']\n\n# Statistical / comprehensive (slow)\n- expect_column_mean_to_be_between:\n config:\n tags: ['daily_slow']\n```\n\n**Step 3: PR CI uses tag fast**.\n\n```yaml\n# .github/workflows/pr.yml\n- name: Run fast tests\n run: dbt test --select tag:pr_fast --target ci\n```\n\nPR CI tests subset. менее 1 minute target.\n\n**Step 4: Daily / weekly comprehensive**.\n\n```yaml\n# .github/workflows/scheduled.yml\n- name: Run full tests\n run: dbt test --target prod # all tests\n schedule:\n - cron: '0 2 * * *' # daily at 2am\n```\n\nComprehensive nightly. CI fast, comprehensive scheduled.\n\n**Step 5: Specific optimizations**.\n\nDifferent strategies per test type:\n\n**Tests на huge tables**:\n\n```yaml\n- unique:\n config:\n where: \"order_date не меньше current_date - 7\" # recent only\n tags: ['pr_fast']\n```\n\nLast week subset.\n\n**Statistical tests**:\n\n```yaml\n- expect_column_mean_to_be_between:\n config:\n where: \"order_date не меньше current_date - 7\"\n severity: warn # don't block PR\n tags: ['pr_fast']\n```\n\nWarn instead error — investigation but не block.\n\n**Complex business tests**:\n\n```yaml\n- complex_business_rule:\n config:\n where: \"customer_id IN (SELECT customer_id FROM critical_segment)\"\n tags: ['pr_fast']\n```\n\nOnly critical customers tested.\n\n**Step 6: Sampling для very slow**.\n\n```yaml\n- complex_aggregate_check:\n config:\n where: \"MOD(customer_id, 100) = 0\" # 1% sample\n tags: ['pr_fast']\n```\n\n1% sample very fast. Full version в daily.\n\n**Step 7: Skip slow tests in PR**.\n\nIf test takes 60+ seconds и не critical для merge — skip from PR:\n\n```yaml\n- comprehensive_audit:\n config:\n tags: ['daily_only'] # NOT in pr_fast\n```\n\nDaily/weekly suite picks up.\n\n**Step 8: Parallel execution**.\n\ndbt 1.5+ supports test parallelism:\n\n```yaml\n# profiles.yml\nci:\n threads: 16 # was 4 — increase для parallel tests\n```\n\nMore threads = faster tests (если warehouse supports).\n\n**Step 9: Measure**.\n\n```bash\ntime dbt test --select tag:pr_fast --target ci\n# Goal: менее 1 minute\n```\n\nIterate: tweak where, drop tests, add sampling, until target reached.\n\n**Configuration example**:\n\n```yaml\nmodels:\n - name: fct_orders\n data_tests:\n # Critical PK — fast, subset\n - dbt_utils.unique_combination_of_columns:\n combination_of_columns: [order_id, line_item_id]\n config:\n where: \"order_date не меньше current_date - 7\"\n tags: ['pr_fast']\n \n # Full version daily\n - dbt_utils.unique_combination_of_columns:\n combination_of_columns: [order_id, line_item_id]\n config:\n tags: ['daily_slow']\n \n # Statistical — sample only в PR\n - dbt_expectations.expect_column_mean_to_be_between:\n column_name: amount\n min_value: 50\n max_value: 200\n config:\n where: \"MOD(order_id, 100) = 0 AND order_date не меньше current_date - 7\"\n severity: warn\n tags: ['pr_fast']\n \n # Full statistical daily\n - dbt_expectations.expect_column_mean_to_be_between:\n column_name: amount\n min_value: 50\n max_value: 200\n config:\n severity: warn\n tags: ['daily_slow']\n```\n\nDuplication, но explicit. Multi-level coverage.\n\n**Trade-offs**:\n\n- **Pro**: PR менее 1 minute -> fast developer iteration -> more PRs merged faster.\n- **Con**: Subset testing -> may miss issues в old data / non-sampled rows.\n- **Mitigation**: Daily/weekly comprehensive catches issues, dev cycle still fast.\n\n**Anti-patterns**:\n\n1. **No tagging strategy** — every PR runs full suite. Slow, lazy developers.\n2. **Tags but no scheduled run** — PR fast, comprehensive never runs. Issues accumulate.\n3. **PR too lax** — missing critical tests, broken data merged.\n\nBalance: critical tests in PR (subset), comprehensive scheduled, both have to pass before deploy.\n\nKey: PR speed важна для productivity. Strategic test subsetting через where + tags makes possible. Pair с comprehensive scheduled tests для full coverage.
Проверка знанийKnowledge check
where filter применил 'WHERE deleted_at IS NULL' к unique тесту. Через месяц обнаружили: deleted records have duplicates но test passes. Что произошло?
ОтветAnswer
Subtle bug: where filter сужает scope. Test passes because filter excludes problematic rows.\n\n**Что произошло**:\n\n```yaml\n- unique:\n config:\n where: \"deleted_at IS NULL\"\n```\n\nThis tests ТОЛЬКО non-deleted rows. Deleted rows (with deleted_at NOT NULL) excluded из test.\n\nIf deleted rows have duplicates (e.g., soft-delete process не handles dedup), test passes despite real issue.\n\nExample data:\n\n```\norder_id | customer_id | email | deleted_at\n1 | 42 | [email protected] | NULL ← tested\n2 | 99 | [email protected] | NULL ← tested\n3 | 42 | [email protected] | 2024-01-15 ← NOT tested\n4 | 42 | [email protected] | 2024-02-20 ← NOT tested\n```\n\nWith where: deleted_at IS NULL test only sees row 1 и 2. unique customer_id = TRUE.\n\nWithout where: test sees rows 1, 2, 3, 4. customer_id=42 appears 3 times. unique = FALSE.\n\n**Why это happened**:\n\n1. **Original intent**: 'tests only matter для active rows'. Logical decision.\n2. **Edge case overlooked**: deleted rows could have data integrity issues that downstream models care about (e.g., audit, GDPR).\n3. **Test scope mismatch**: filter narrowed test scope без considering all downstream uses.\n\n**Fix options**:\n\n**Option 1: Remove where filter** (full coverage):\n\n```yaml\n- unique # no where\n```\n\nTests all rows. Detects duplicates regardless of deleted status.\n\nTrade-off: includes historical deleted records. Slower, but complete.\n\n**Option 2: Two tests (recommended)**:\n\n```yaml\n- unique:\n config:\n where: \"deleted_at IS NULL\"\n severity: error # active duplicates — critical\n- unique:\n config:\n where: \"deleted_at IS NOT NULL\"\n severity: warn # historical duplicates — non-blocking\n```\n\nActive: must be unique. Historical: should be unique but не block CI.\n\nLayered: critical issues block, historical issues alert.\n\n**Option 3: Composite test**:\n\n```yaml\n- unique\n- not_null:\n column_name: customer_id\n # Make sure deleted records also satisfy not_null\n```\n\nFull check + complementary tests.\n\n**Option 4: Investigate root cause**:\n\nWhy deleted records have duplicates?\n\n```sql\n-- Look at duplicates в deleted records\nSELECT customer_id, COUNT(*) FROM fct_orders\nWHERE deleted_at IS NOT NULL\nGROUP BY 1 HAVING COUNT(*) > 1\nLIMIT 100;\n```\n\nPotential causes:\n1. **Soft delete process duplicated rows**: race conditions when marking deleted.\n2. **Source system bug**: pre-existing duplicates marked deleted in bulk.\n3. **ETL re-runs**: same customer marked deleted multiple times.\n\nFix at source if possible.\n\n**Option 5: Schema redesign**:\n\nIf deleted records shouldn't logically exist as separate rows, restructure model:\n\n```sql\n-- Move deleted rows к archive table\nWITH active AS (\n SELECT * FROM source WHERE deleted_at IS NULL\n),\narchive AS (\n SELECT * FROM source WHERE deleted_at IS NOT NULL\n)\n-- Materialize as separate models\n```\n\nActive model has unique constraint (where filter unnecessary). Archive table has different rules.\n\n**General principle: where filters narrow scope**.\n\nWhen using where, ask:\n\n1. **Что excluded** by filter?\n2. **Could excluded rows have issues** we care about?\n3. **Multiple tests** для multiple scopes?\n\nDocument the scope:\n\n```yaml\n- unique:\n config:\n where: \"deleted_at IS NULL\"\n # Tests uniqueness среди ACTIVE rows.\n # Deleted rows не covered — see separate test 'historical_unique'.\n severity: error\n- unique:\n config:\n where: \"deleted_at IS NOT NULL\"\n severity: warn\n tags: ['historical_unique']\n```\n\nExplicit scope in comments.\n\n**Anti-pattern check**:\n\nIf you find yourself adding many tests with same where filter, that's signal that:\n- Model design needs split (active vs archive separate models).\n- Or naming convention для scope (active_, archived_ prefixes).\n\nRefactoring to clean schema beats накопление tests with where clauses.\n\nKey: where filters narrow test scope. Audit periodically: what's outside scope? Are those rows важны? Layered coverage с multiple tests = robust quality.
Доступ закрыт
Требуется вход
Для доступа к материалам курса необходимо войти через Telegram
Проверьте понимание
Результат: 0 из 0
Прикладной
Закончили урок?
Отметьте его как пройденный, чтобы отслеживать свой прогресс