Топ-15 dbt-expectations тестов
dbt-expectations имеет 60+ tests. На middle важно знать топ 15 — что используется реально в production. Этот урок — tour с практическими примерами.
Categories:
- Table-level (5)
- Column-level (5)
- Statistical (5)
Каждый test — example use case + tradeoffs.
Data Quality Dimensions: какие тесты покрывают какие измеренияTable-level tests
1. expect_table_row_count_to_be_between
Sanity check on table size.
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 1000
max_value: 50000
config:
where: "order_date = current_date - 1"
severity: warn
Use: detect sudden data drops (0 rows = source broken) or surges (100K = duplicates / promo).
2. expect_table_column_count_to_equal
Schema validation.
- dbt_expectations.expect_table_column_count_to_equal:
value: 15
Use: catches accidental schema changes (column added/removed без communication).
3. expect_table_columns_to_match_ordered_list
Strict schema validation including order.
- dbt_expectations.expect_table_columns_to_match_ordered_list:
column_list: [order_id, customer_id, order_date, amount, status, ...]
Use: critical для unnamed UNION ALL or position-based downstream. dbt 1.5+ model contracts often preferred.
4. expect_compound_columns_to_be_unique
Composite primary key.
- dbt_expectations.expect_compound_columns_to_be_unique:
column_list: [order_id, line_item_id]
Use: equivalent dbt_utils.unique_combination_of_columns. Pick convention.
5. expect_table_row_count_to_equal_other_table
Compare two tables.
- dbt_expectations.expect_table_row_count_to_equal_other_table:
compare_model: ref('staging_orders')
factor: 1.0 # exact match expected
Use: 1-to-1 relationship verification. Marts should have same row count as source.
Column-level tests
6. expect_column_values_to_not_be_null
Null check (alternative к dbt-core not_null).
- dbt_expectations.expect_column_values_to_not_be_null
Use: equivalent dbt-core not_null. Stylistic preference.
7. expect_column_values_to_be_in_set
Enum validation.
- dbt_expectations.expect_column_values_to_be_in_set:
value_set: ['pending', 'paid', 'shipped', 'cancelled', 'refunded']
Use: equivalent accepted_values. Catches unexpected new statuses.
8. expect_column_values_to_match_regex
Pattern matching.
- dbt_expectations.expect_column_values_to_match_regex:
regex: '^[\\w._%+-]+@[\\w.-]+\\.[A-Za-z]{2,}$'
Use: email format, phone, license plates, SSNs. Pattern-based validation.
9. expect_column_values_to_be_between
Range check.
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 10000
Use: amount sanity, dates within window, quantities reasonable. Catches negative values, outliers.
10. expect_column_values_to_be_of_type
Type validation.
- dbt_expectations.expect_column_values_to_be_of_type:
column_type: NUMERIC
Use: schema drift catch. dbt 1.5+ model contracts often preferred.
Statistical tests
11. expect_column_mean_to_be_between
Mean expectation.
- dbt_expectations.expect_column_mean_to_be_between:
column_name: amount
min_value: 50
max_value: 200
config:
where: "order_date >= current_date - 7"
severity: warn
Use: detect drift в business metrics. Mean shifted = something changed.
12. expect_column_quantile_values_to_be_between
Percentile / quantile check (robust to outliers).
- dbt_expectations.expect_column_quantile_values_to_be_between:
column_name: amount
quantile: 0.95
min_value: 100
max_value: 5000
Use: SLA monitoring (95% менее 5000ms). More robust than mean.
13. expect_column_stdev_to_be_between
Standard deviation check.
- dbt_expectations.expect_column_stdev_to_be_between:
column_name: amount
min_value: 10
max_value: 100
Use: distribution shape. StdDev=0 = all values identical (corrupt). Sudden change = data quality issue.
14. expect_column_distinct_count_to_be_between
Cardinality check.
- dbt_expectations.expect_column_distinct_count_to_be_between:
column_name: customer_country
min_value: 50
max_value: 250
Use: ‘50-250 unique countries expected’. Catches if source перестал писать country, etc.
15. expect_column_distinct_count_to_equal_other_table
Cross-table consistency.
- dbt_expectations.expect_column_distinct_count_to_equal_other_table:
column_name: customer_id
compare_model: ref('dim_customers')
compare_column_name: customer_id
Use: every customer в orders should be в dim_customers. Orphan detection.
Production patterns
Pattern 1: Schema validation layer
models:
- name: fct_orders
data_tests:
- dbt_expectations.expect_table_column_count_to_equal: {value: 15}
- dbt_expectations.expect_table_columns_to_match_set:
column_list: [order_id, customer_id, order_date, amount, status, ...]
columns:
- name: order_id
data_tests:
- dbt_expectations.expect_column_values_to_be_of_type: {column_type: BIGINT}
- name: amount
data_tests:
- dbt_expectations.expect_column_values_to_be_of_type: {column_type: NUMERIC}
Lightweight schema contract pre-1.5. Native model contracts (1.5+) often preferred.
Pattern 2: Statistical monitoring
columns:
- name: amount
data_tests:
- dbt_expectations.expect_column_mean_to_be_between:
min_value: 50
max_value: 200
config:
where: "order_date >= current_date - 7"
severity: warn
- dbt_expectations.expect_column_quantile_values_to_be_between:
quantile: 0.95
min_value: 100
max_value: 5000
config:
severity: warn
- dbt_expectations.expect_column_stdev_to_be_between:
min_value: 10
max_value: 100
config:
severity: warn
Three statistical metrics: mean, p95, stddev. Together catches different anomalies.
Pattern 3: Cross-model consistency
data_tests:
- dbt_expectations.expect_table_row_count_to_equal_other_table:
compare_model: ref('staging_orders')
- dbt_expectations.expect_column_distinct_count_to_equal_other_table:
column_name: customer_id
compare_model: ref('dim_customers')
compare_column_name: customer_id
Validates: marts derived correctly от source, no orphans.
Pattern 4: Business rules through value_set
columns:
- name: status
data_tests:
- dbt_expectations.expect_column_values_to_be_in_set:
value_set: ['pending', 'paid', 'shipped', 'delivered', 'cancelled', 'refunded']
config:
severity: error
Strict business rules.
Pattern 5: Range checks
columns:
- name: amount
data_tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 100000
config:
severity: error
- name: order_date
data_tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: '2020-01-01'
max_value: 'CURRENT_DATE + INTERVAL \\\\'1 day\\\\''
Sanity ranges. Amount can’t be negative. Date can’t be in future (with 1-day tolerance).
Decision matrix: которую test когда?
| Task | Test |
|---|---|
| Row count expectations | expect_table_row_count_to_be_between |
| Schema validation (columns) | expect_table_columns_to_match_set OR model contracts (1.5+) |
| Type validation | expect_column_values_to_be_of_type OR model contracts |
| Single column uniqueness | dbt-core unique |
| Composite uniqueness | expect_compound_columns_to_be_unique OR dbt_utils.unique_combination_of_columns |
| Null check | expect_column_values_to_not_be_null OR dbt-core not_null |
| Enum / value set | expect_column_values_to_be_in_set OR dbt-core accepted_values |
| Pattern (regex) | expect_column_values_to_match_regex |
| Numeric range | expect_column_values_to_be_between |
| Mean drift | expect_column_mean_to_be_between |
| Quantile (SLA) | expect_column_quantile_values_to_be_between |
| Cardinality | expect_column_distinct_count_to_be_between |
| Cross-table | expect_table_row_count_to_equal_other_table, expect_column_distinct_count_to_equal_other_table |
DuckDB compatibility
All listed tests work on DuckDB. Notes:
- Type tests: DuckDB uses широкие types (BIGINT default for integers). column_type: ‘INTEGER’ may fail for BIGINT columns. Use specific DuckDB types or query INFORMATION_SCHEMA.
- Quantile: DuckDB supports quantile_cont() — correct results.
- Performance: in-memory processing fast для statistical aggregates.
Попробуй сам
В labs, выберите 5 из этого списка и apply к fct_orders:
- expect_table_row_count_to_be_between (sanity check).
- expect_compound_columns_to_be_unique (composite PK).
- expect_column_values_to_be_in_set для status.
- expect_column_values_to_be_between для amount.
- expect_column_mean_to_be_between для amount (severity warn).
Calibrate thresholds based on actual data.
Run dbt test. Iterate until clean (or tuned warnings).
Ключевые выводы
- 15 critical dbt-expectations tests. Cover schema, value, statistical needs.
- Schema validation: pre-1.5 use dbt-expectations. 1.5+ — native model contracts.
- Statistical: mean, quantile, stddev для drift detection. Severity warn typically.
- Cross-table: row_count_equal, distinct_count_equal для consistency.
- Decision matrix: pick test based on what you’re validating.
- Calibrate thresholds based on actual data (run SQL exploration first).
- DuckDB compatible с all listed. Notes для type tests (wide defaults).