Топ-15 dbt-expectations тестов

dbt-expectations имеет 60+ tests. На middle важно знать топ 15 — что используется реально в production. Этот урок — tour с практическими примерами.

Categories:

Table-level (5)
Column-level (5)
Statistical (5)

Каждый test — example use case + tradeoffs.

Data Quality Dimensions: какие тесты покрывают какие измерения

Table-level tests

1. expect_table_row_count_to_be_between

Sanity check on table size.

- dbt_expectations.expect_table_row_count_to_be_between:
    min_value: 1000
    max_value: 50000
    config:
      where: "order_date = current_date - 1"
      severity: warn

Use: detect sudden data drops (0 rows = source broken) or surges (100K = duplicates / promo).

2. expect_table_column_count_to_equal

Schema validation.

- dbt_expectations.expect_table_column_count_to_equal:
    value: 15

Use: catches accidental schema changes (column added/removed без communication).

3. expect_table_columns_to_match_ordered_list

Strict schema validation including order.

- dbt_expectations.expect_table_columns_to_match_ordered_list:
    column_list: [order_id, customer_id, order_date, amount, status, ...]

Use: critical для unnamed UNION ALL or position-based downstream. dbt 1.5+ model contracts often preferred.

4. expect_compound_columns_to_be_unique

Composite primary key.

- dbt_expectations.expect_compound_columns_to_be_unique:
    column_list: [order_id, line_item_id]

Use: equivalent dbt_utils.unique_combination_of_columns. Pick convention.

5. expect_table_row_count_to_equal_other_table

Compare two tables.

- dbt_expectations.expect_table_row_count_to_equal_other_table:
    compare_model: ref('staging_orders')
    factor: 1.0  # exact match expected

Use: 1-to-1 relationship verification. Marts should have same row count as source.

Column-level tests

6. expect_column_values_to_not_be_null

Null check (alternative к dbt-core not_null).

- dbt_expectations.expect_column_values_to_not_be_null

Use: equivalent dbt-core not_null. Stylistic preference.

7. expect_column_values_to_be_in_set

Enum validation.

- dbt_expectations.expect_column_values_to_be_in_set:
    value_set: ['pending', 'paid', 'shipped', 'cancelled', 'refunded']

Use: equivalent accepted_values. Catches unexpected new statuses.

8. expect_column_values_to_match_regex

Pattern matching.

- dbt_expectations.expect_column_values_to_match_regex:
    regex: '^[\\w._%+-]+@[\\w.-]+\\.[A-Za-z]{2,}$'

Use: email format, phone, license plates, SSNs. Pattern-based validation.

9. expect_column_values_to_be_between

Range check.

- dbt_expectations.expect_column_values_to_be_between:
    min_value: 0
    max_value: 10000

Use: amount sanity, dates within window, quantities reasonable. Catches negative values, outliers.

10. expect_column_values_to_be_of_type

Type validation.

- dbt_expectations.expect_column_values_to_be_of_type:
    column_type: NUMERIC

Use: schema drift catch. dbt 1.5+ model contracts often preferred.

Statistical tests

11. expect_column_mean_to_be_between

Mean expectation.

- dbt_expectations.expect_column_mean_to_be_between:
    column_name: amount
    min_value: 50
    max_value: 200
    config:
      where: "order_date >= current_date - 7"
      severity: warn

Use: detect drift в business metrics. Mean shifted = something changed.

12. expect_column_quantile_values_to_be_between

Percentile / quantile check (robust to outliers).

- dbt_expectations.expect_column_quantile_values_to_be_between:
    column_name: amount
    quantile: 0.95
    min_value: 100
    max_value: 5000

Use: SLA monitoring (95% менее 5000ms). More robust than mean.

13. expect_column_stdev_to_be_between

Standard deviation check.

- dbt_expectations.expect_column_stdev_to_be_between:
    column_name: amount
    min_value: 10
    max_value: 100

Use: distribution shape. StdDev=0 = all values identical (corrupt). Sudden change = data quality issue.

14. expect_column_distinct_count_to_be_between

Cardinality check.

- dbt_expectations.expect_column_distinct_count_to_be_between:
    column_name: customer_country
    min_value: 50
    max_value: 250

Use: ‘50-250 unique countries expected’. Catches if source перестал писать country, etc.

15. expect_column_distinct_count_to_equal_other_table

Cross-table consistency.

- dbt_expectations.expect_column_distinct_count_to_equal_other_table:
    column_name: customer_id
    compare_model: ref('dim_customers')
    compare_column_name: customer_id

Use: every customer в orders should be в dim_customers. Orphan detection.

Production patterns

Pattern 1: Schema validation layer

models:
  - name: fct_orders
    data_tests:
      - dbt_expectations.expect_table_column_count_to_equal: {value: 15}
      - dbt_expectations.expect_table_columns_to_match_set:
          column_list: [order_id, customer_id, order_date, amount, status, ...]
    columns:
      - name: order_id
        data_tests:
          - dbt_expectations.expect_column_values_to_be_of_type: {column_type: BIGINT}
      - name: amount
        data_tests:
          - dbt_expectations.expect_column_values_to_be_of_type: {column_type: NUMERIC}

Lightweight schema contract pre-1.5. Native model contracts (1.5+) often preferred.

Pattern 2: Statistical monitoring

columns:
  - name: amount
    data_tests:
      - dbt_expectations.expect_column_mean_to_be_between:
          min_value: 50
          max_value: 200
          config:
            where: "order_date >= current_date - 7"
            severity: warn
      - dbt_expectations.expect_column_quantile_values_to_be_between:
          quantile: 0.95
          min_value: 100
          max_value: 5000
          config:
            severity: warn
      - dbt_expectations.expect_column_stdev_to_be_between:
          min_value: 10
          max_value: 100
          config:
            severity: warn

Three statistical metrics: mean, p95, stddev. Together catches different anomalies.

Pattern 3: Cross-model consistency

data_tests:
  - dbt_expectations.expect_table_row_count_to_equal_other_table:
      compare_model: ref('staging_orders')
  - dbt_expectations.expect_column_distinct_count_to_equal_other_table:
      column_name: customer_id
      compare_model: ref('dim_customers')
      compare_column_name: customer_id

Validates: marts derived correctly от source, no orphans.

Pattern 4: Business rules through value_set

columns:
  - name: status
    data_tests:
      - dbt_expectations.expect_column_values_to_be_in_set:
          value_set: ['pending', 'paid', 'shipped', 'delivered', 'cancelled', 'refunded']
          config:
            severity: error

Strict business rules.

Pattern 5: Range checks

columns:
  - name: amount
    data_tests:
      - dbt_expectations.expect_column_values_to_be_between:
          min_value: 0
          max_value: 100000
          config:
            severity: error
  - name: order_date
    data_tests:
      - dbt_expectations.expect_column_values_to_be_between:
          min_value: '2020-01-01'
          max_value: 'CURRENT_DATE + INTERVAL \\\\'1 day\\\\''

Sanity ranges. Amount can’t be negative. Date can’t be in future (with 1-day tolerance).

Decision matrix: которую test когда?

Task	Test
Row count expectations	expect_table_row_count_to_be_between
Schema validation (columns)	expect_table_columns_to_match_set OR model contracts (1.5+)
Type validation	expect_column_values_to_be_of_type OR model contracts
Single column uniqueness	dbt-core unique
Composite uniqueness	expect_compound_columns_to_be_unique OR dbt_utils.unique_combination_of_columns
Null check	expect_column_values_to_not_be_null OR dbt-core not_null
Enum / value set	expect_column_values_to_be_in_set OR dbt-core accepted_values
Pattern (regex)	expect_column_values_to_match_regex
Numeric range	expect_column_values_to_be_between
Mean drift	expect_column_mean_to_be_between
Quantile (SLA)	expect_column_quantile_values_to_be_between
Cardinality	expect_column_distinct_count_to_be_between
Cross-table	expect_table_row_count_to_equal_other_table, expect_column_distinct_count_to_equal_other_table

DuckDB compatibility

All listed tests work on DuckDB. Notes:

Type tests: DuckDB uses широкие types (BIGINT default for integers). column_type: ‘INTEGER’ may fail for BIGINT columns. Use specific DuckDB types or query INFORMATION_SCHEMA.
Quantile: DuckDB supports quantile_cont() — correct results.
Performance: in-memory processing fast для statistical aggregates.

Попробуй сам

В labs, выберите 5 из этого списка и apply к fct_orders:

expect_table_row_count_to_be_between (sanity check).
expect_compound_columns_to_be_unique (composite PK).
expect_column_values_to_be_in_set для status.
expect_column_values_to_be_between для amount.
expect_column_mean_to_be_between для amount (severity warn).

Calibrate thresholds based on actual data.

Run dbt test. Iterate until clean (or tuned warnings).

Ключевые выводы

15 critical dbt-expectations tests. Cover schema, value, statistical needs.
Schema validation: pre-1.5 use dbt-expectations. 1.5+ — native model contracts.
Statistical: mean, quantile, stddev для drift detection. Severity warn typically.
Cross-table: row_count_equal, distinct_count_equal для consistency.
Decision matrix: pick test based on what you’re validating.
Calibrate thresholds based on actual data (run SQL exploration first).
DuckDB compatible с all listed. Notes для type tests (wide defaults).

Проверка знанийKnowledge check

Какие 5 dbt-expectations tests наиболее ценны для starting middle project?

ОтветAnswer

Recommend 5 starter tests на every model. High value, manageable to implement.\n\n**1. expect_table_row_count_to_be_between** — sanity check.\n\n```yaml\n- dbt_expectations.expect_table_row_count_to_be_between:\n min_value: 100 # adjust per model\n max_value: 1000000 # adjust per model\n config:\n severity: warn\n```\n\nPurpose: detect 0-row (broken extraction) or massive surges (duplicates).\nWhy first: simplest, catches catastrophic issues, no calibration required initially.\n\n**2. expect_compound_columns_to_be_unique** — composite PK.\n\n```yaml\n- dbt_expectations.expect_compound_columns_to_be_unique:\n column_list: [order_id, line_item_id]\n config:\n severity: error\n```\n\nPurpose: ensure PK integrity для composite keys.\nWhy second: critical data quality. dbt-core unique only covers single columns.\n\n**3. expect_column_values_to_be_in_set** — enum validation.\n\n```yaml\ncolumns:\n - name: status\n data_tests:\n - dbt_expectations.expect_column_values_to_be_in_set:\n value_set: ['pending', 'paid', 'shipped', 'delivered', 'cancelled', 'refunded']\n config:\n severity: error\n```\n\nPurpose: catch new status values без communication.\nWhy third: business rule violation. Detects source semantic changes.\n\n**4. expect_column_values_to_be_between** — range sanity.\n\n```yaml\ncolumns:\n - name: amount\n data_tests:\n - dbt_expectations.expect_column_values_to_be_between:\n min_value: 0\n max_value: 100000\n config:\n severity: error\n - name: order_date\n data_tests:\n - dbt_expectations.expect_column_values_to_be_between:\n min_value: '2020-01-01'\n max_value: 'CURRENT_DATE + INTERVAL \\\\'1 day\\\\''\n```\n\nPurpose: catch negative amounts, future dates, etc.\nWhy fourth: simple sanity checks с high signal-to-noise.\n\n**5. expect_table_row_count_to_equal_other_table** — cross-table consistency.\n\n```yaml\n- dbt_expectations.expect_table_row_count_to_equal_other_table:\n compare_model: ref('staging_orders')\n factor: 1.0\n```\n\nPurpose: marts должна have same row count as source (или predictable ratio).\nWhy fifth: catches mart-level transformations that lose/duplicate rows.\n\n**Why these 5**:\n\n1. Cover wide aspects: structure, integrity, business rules, sanity, consistency.\n2. Easy to implement: minimal calibration needed for first 4. Last requires source-derived counts.\n3. High-value findings: each catches different category of issue.\n4. Manageable: 5 tests per model = manageable runtime.\n5. Foundation: scale up later с statistical tests, regex, etc.\n\n**What NOT to start with**:\n\n- **Statistical** (mean, stddev): require calibration to data distribution. Add after observation period.\n- **Regex**: useful but specific to data formats. Add per need.\n- **Type validation**: better via model contracts (1.5+).\n- **Cardinality**: useful but specific to columns с known distribution.\n\n**Implementation rollout**:\n\n**Week 1**: Add 5 tests to top 3 critical models.\n**Week 2**: Add to top 10 models.\n**Week 3**: Cover all marts.\n**Week 4+**: Add statistical layer (mean, stddev) for high-priority columns.\n\nGradual expansion. Sustainable.\n\n**Anti-patterns**:\n\n1. **Add 30 tests at once** — overwhelm, many false positives, demotivated team.\n2. **Skip foundation, jump to statistical** — without baseline, statistical tests flap.\n3. **One model fully tested, others nothing** — uneven quality. Spread coverage first.\n\n**Maturity progression**:\n\nStarter (months 1-3): 5 tests above. All models.\nGrowing (months 4-6): Add statistical для critical columns.\nMature (months 7+): Full coverage. Custom business tests.\n\nKey: start small, iterate, expand. Don't try perfect first time.

Проверка знанийKnowledge check

dbt-expectations test 'expect_column_values_to_be_of_type' vs native dbt model contracts. Когда какой использовать?

ОтветAnswer

Both validate types, but different lifecycle и features. Choose based on dbt version и needs.\n\n**expect_column_values_to_be_of_type** (dbt-expectations):\n\n```yaml\ncolumns:\n - name: amount\n data_tests:\n - dbt_expectations.expect_column_values_to_be_of_type:\n column_type: NUMERIC\n```\n\n**Lifecycle**: validation at `dbt test` (post-build).\n\n**Behavior**:\n- Build model — type may be wrong, data в warehouse.\n- Test — detects wrong type.\n- Fail при mismatch.\n- Data already exposed downstream.\n\n**Pros**:\n- Works на all dbt versions (including pre-1.5).\n- Specific assertions per column.\n\n**Cons**:\n- Late detection (after build).\n- Type drift can affect downstream before test catches.\n- No automatic enforcement of types.\n\n**Native model contracts** (dbt 1.5+):\n\n```yaml\nmodels:\n - name: fct_orders\n config:\n contract:\n enforced: true\n columns:\n - name: amount\n data_type: NUMERIC\n - name: order_id\n data_type: BIGINT\n```\n\n**Lifecycle**: validation at `dbt run` / `dbt build` (during build).\n\n**Behavior**:\n- dbt run starts.\n- Before executing SELECT, dbt analyzes inferred type.\n- If mismatch — fail immediately.\n- Data не committed to warehouse.\n\n**Pros**:\n- Early detection (during build).\n- Schema enforcement (data must match contract).\n- Can include constraints (NOT NULL, PRIMARY KEY).\n- Single source of truth (YAML defines schema).\n\n**Cons**:\n- Requires dbt 1.5+.\n- Strict — может require manual cast в SELECT для compliance.\n- Adapter limitations: not all features work on all warehouses.\n\n**Decision matrix**:\n\n| Scenario | Choose |\n|---|---|\n| dbt менее 1.5 project | expect_column_values_to_be_of_type |\n| dbt 1.5+, critical schema | Model contracts |\n| dbt 1.5+, exploratory data | Tests (or both) |\n| Public-facing model (downstream consumers) | Model contracts |\n| Internal staging model | Tests (lighter) |\n| Need constraints (NOT NULL, PK, FK) | Model contracts (native) |\n\n**Use both? Yes, for critical**:\n\n```yaml\n# Use both contract and test for critical models\nmodels:\n - name: fct_orders\n config:\n contract:\n enforced: true # build-time validation\n columns:\n - name: amount\n data_type: NUMERIC\n data_tests:\n - dbt_expectations.expect_column_values_to_be_of_type:\n column_type: NUMERIC\n # additional test-time validation\n - dbt_expectations.expect_column_values_to_be_between:\n min_value: 0\n max_value: 100000\n # value range — contract doesn't cover this\n```\n\nContract = schema enforcement. Test = additional value-level validation.\n\n**Migration path**:\n\nIf project использует expect_column_values_to_be_of_type:\n\n1. Upgrade to dbt 1.5+.\n2. Add contract config к critical models:\n```yaml\nconfig:\n contract:\n enforced: true\n```\n3. Add data_type to columns:\n```yaml\ncolumns:\n - name: amount\n data_type: NUMERIC\n```\n4. Keep test for parallel validation initial period.\n5. Once stable, remove redundant test.\n\n**Realtime example**:\n\nScenario: SQL accidentally returns amount as VARCHAR.\n\n**With expect_column_values_to_be_of_type**:\n1. dbt run: succeeds. amount is VARCHAR в warehouse.\n2. Downstream BI starts using new VARCHAR amount.\n3. dbt test: fails (type mismatch).\n4. Investigate, fix SQL, dbt run again.\n5. Downstream — may have errored on VARCHAR amount в interim.\n\n**With contract enforced**:\n1. dbt run: fails immediately ('Column amount expected NUMERIC, got VARCHAR').\n2. SQL fixed before any data committed.\n3. Downstream not affected.\n\nContract catches earlier. Faster failure = less downstream impact.\n\n**Adapter limitations**:\n\nContracts depend on warehouse:\n- **Snowflake**: full support (types, constraints).\n- **BigQuery**: full support.\n- **Postgres**: most features.\n- **DuckDB**: types supported. NOT NULL и UNIQUE constraints work. PRIMARY KEY и FOREIGN KEY partial. Documentation enforced, runtime enforcement varies.\n\nFor DuckDB: contracts give type guarantees. Constraints may need supplementation through tests (как в module 10).\n\n**Don't replace tests with contracts**:\n\nContracts validate schema. Tests validate **data values**:\n\n- Contract: amount is NUMERIC.\n- Test: amount values between 0 and 10000.\n\nDifferent concerns. Both critical for full coverage.\n\nKey: contracts are **schema-time** validation. Tests are **runtime data** validation. Use complementarily для new projects.

Топ-15 dbt-expectations тестов

Table-level tests

1. expect_table_row_count_to_be_between

2. expect_table_column_count_to_equal

3. expect_table_columns_to_match_ordered_list

4. expect_compound_columns_to_be_unique

5. expect_table_row_count_to_equal_other_table

Column-level tests

6. expect_column_values_to_not_be_null

7. expect_column_values_to_be_in_set

8. expect_column_values_to_match_regex

9. expect_column_values_to_be_between

10. expect_column_values_to_be_of_type

Statistical tests

11. expect_column_mean_to_be_between

12. expect_column_quantile_values_to_be_between

13. expect_column_stdev_to_be_between

14. expect_column_distinct_count_to_be_between

15. expect_column_distinct_count_to_equal_other_table

Production patterns

Pattern 1: Schema validation layer

Pattern 2: Statistical monitoring

Pattern 3: Cross-model consistency

Pattern 4: Business rules through value_set

Pattern 5: Range checks

Decision matrix: которую test когда?

DuckDB compatibility

Попробуй сам

Ключевые выводы

Закончили урок?