Contracts в CI: gate, что упускается, schema metadata limits
Enforced contract — это hard gate во время dbt run. Но contract можно использовать раньше, в CI на pull request, до того, как код попадает в main. Это даёт faster feedback loop: developer видит contract violation за 30 секунд CI run, не после полного scheduled run.
Этот урок — про operationalizing contracts: какие проверки делать в CI, какие сценарии contracts ловит, и что upravlyaет. Last part — limits — критически важна: contracts создают false sense of security если не понимать, что они не проверяют.
Airflow: CI для DAG-тестов — аналогичная многоуровневая схемаContract как CI gate
Базовый CI workflow:
# .github/workflows/dbt-ci.yml
name: dbt CI
on: [pull_request]
jobs:
contract-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11' }
- run: pip install dbt-core==1.10.21 dbt-duckdb==1.10.1
- run: dbt deps
# Phase 1: Parse — статическая проверка
- run: dbt parse
# Phase 2: Compile — компиляция SQL без подключения к prod
- run: dbt compile --target ci
# Phase 3: Run на ephemeral DuckDB — проверяет contracts enforcement
- run: dbt run --select state:modified+ --target ci
# Phase 4: Tests — data quality после material
- run: dbt test --select state:modified+ --target ci
Что catches каждая phase:
| Phase | Catches |
|---|---|
| Parse | YAML errors, syntax errors, missing refs |
| Compile | Jinja errors, undefined macros, missing sources |
| Run | Contract violations, SQL errors, materialization failures |
| Tests | Data quality issues (unique, not_null, accepted_values, custom) |
Contract violations catches at Run phase — dbt attempts to material, fails при mismatch. Это требует actual DuckDB target в CI, не only parse.
CI gate types
Gate 1: Contract presence на marts
Проверяем, что все mart models имеют contract:
# scripts/check_contracts.py
import json, sys
with open('target/manifest.json') as f:
manifest = json.load(f)
errors = []
for node_id, node in manifest['nodes'].items():
if node['resource_type'] != 'model':
continue
is_mart = any(p == 'marts' for p in node['fqn'])
if not is_mart:
continue
contract = node.get('contract', {})
if not contract.get('enforced'):
errors.append(f"Mart model '{node['name']}' missing enforced contract")
if errors:
print('\n'.join(errors))
sys.exit(1)
В CI:
- run: dbt parse
- run: python scripts/check_contracts.py
PR adding mart model без contract — fails.
Gate 2: data_type presence per column
Каждая column declared должна иметь data_type:
for node in manifest['nodes'].values():
if node['resource_type'] == 'model':
contract = node.get('contract', {})
if contract.get('enforced'):
for col_name, col in node.get('columns', {}).items():
if not col.get('data_type'):
errors.append(f"{node['name']}.{col_name} missing data_type")
Контракт без data_type — incomplete. CI catches.
Gate 3: No bare types
numeric без precision, varchar без length — drift potential:
import re
BARE_TYPES = re.compile(r'^(numeric|number|varchar|int|float)$', re.IGNORECASE)
for node in manifest['nodes'].values():
for col_name, col in node.get('columns', {}).items():
data_type = col.get('data_type', '')
if BARE_TYPES.match(data_type):
errors.append(f"{node['name']}.{col_name}: bare type '{data_type}'")
Gate 4: Critical constraints на PK / FK / business-critical
CRITICAL_COLUMN_PATTERNS = ['_id$', '^id$', '_pk$', 'email', 'customer_id']
for node in manifest['nodes'].values():
for col_name, col in node.get('columns', {}).items():
constraints = col.get('constraints', [])
constraint_types = {c['type'] for c in constraints}
# PK columns should have unique + not_null
if col_name.endswith('_id') or col_name == 'id':
if 'primary_key' not in constraint_types:
errors.append(f"{node['name']}.{col_name}: missing PRIMARY KEY")
if 'not_null' not in constraint_types:
errors.append(f"{node['name']}.{col_name}: missing NOT NULL")
Gate 5: Contracts + data tests parity
For each constraint, expect corresponding data test:
for node in manifest['nodes'].values():
for col_name, col in node.get('columns', {}).items():
constraint_types = {c['type'] for c in col.get('constraints', [])}
tests = col.get('tests', [])
test_names = set(t if isinstance(t, str) else list(t.keys())[0] for t in tests)
# PK should have unique + not_null tests
if 'primary_key' in constraint_types:
if 'unique' not in test_names:
errors.append(f"{node['name']}.{col_name}: PK but no unique test")
if 'not_null' not in test_names:
errors.append(f"{node['name']}.{col_name}: PK but no not_null test")
Это закрепляет правило: constraints declarative, tests enforce.
Gate 6: Versions consistency
For models with versions:
for node in manifest['nodes'].values():
if node.get('versions'):
if not node.get('latest_version'):
errors.append(f"{node['name']}: has versions but no latest_version set")
for v in node['versions']:
if v['v'] > node['latest_version'] and not v.get('defined_in'):
errors.append(f"{node['name']}.v{v['v']}: missing defined_in")
Gate 7: Deprecated versions not removed yet
Track соблюдение deprecation_date:
from datetime import date
for node in manifest['nodes'].values():
for v in node.get('versions', []):
dep_date = v.get('deprecation_date')
if dep_date:
if date.fromisoformat(dep_date) < date.today():
warnings.append(f"{node['name']}.v{v['v']}: deprecation_date passed, should be removed")
Это warning, не error — depends on team policy.
Что contracts ловят: detailed
Critical: contracts vs data tests vs unit tests
| Aspect | Contracts | Data tests | Unit tests |
|---|---|---|---|
| What checks | Schema (columns, types, structural constraints) | Data quality (values in warehouse) | Logic correctness on mock input |
| When runs | Run-time (build) | Run-time (after build) | Build-time / CI (no warehouse) |
| Enforcement | Build fails при mismatch | Test fails если data invalid | Test fails если logic wrong |
| Strength | Strong cross-warehouse (DDL applied) | Strong (actual data query) | Strong (deterministic) |
| Weakness | Metadata-only в Snowflake/BQ | Slow (warehouse query) | Doesn’t catch real data drift |
| Production use | Marts + public APIs | All models (basic), critical (extensive) | Critical models (revenue, churn, attribution) |
Все три слоя complementary. Production-grade модель — contract + data tests + unit tests.
Specific limits в production
1. Snowflake / BigQuery: constraints metadata-only
Что contracts declare:
constraints:
- type: not_null
- type: check
expression: "revenue >= 0"
Что warehouse делает:
- Snowflake: creates DDL
CHECK (revenue >= 0)— NOT enforced - BigQuery: similar — metadata only
Что НЕ ловит:
- Insert NULL where NOT NULL declared
- Insert negative revenue where CHECK declared
- Insert duplicate where PK declared
- Insert orphan FK где FOREIGN KEY declared
Real enforcement — через data tests:
data_tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
2. DuckDB: full enforcement locally, partial cloud
Что works locally:
- All constraints enforced
- FK works
Что doesn’t work:
- MotherDuck FK — not supported
:memory:— constraints lost когда process ends
3. Postgres: full enforcement, but cost
Что works:
- All constraints enforced (PK, FK, CHECK, NOT NULL, UNIQUE)
Cost:
- FK check on every INSERT — slows bulk loads
- For dbt’s incremental loads — acceptable
- For huge initial backfills — может быть bottleneck
4. Data quality несравнима со schema integrity
Contracts говорят: ‘schema is what we declared’. Doesn’t say ‘data is correct’.
Example:
- name: revenue
data_type: numeric(12, 2)
constraints:
- type: not_null
Contract pass:
- Column is numeric(12, 2) [x]
- All values NOT NULL [x]
Data still broken:
- All revenue values = 0 (formula bug)
- Or all revenue values = $1 (currency error)
- Or all revenue values from yesterday (freshness issue)
Contract: ok. Data quality: terrible. Need data tests + freshness checks + business validation.
5. Schema changes без notification
Contract заявляет ‘this is current schema’. Doesn’t notify consumers when schema changes.
Example:
- Day 1:
data_type: numeric(12, 2)— consumers use это - Day 2: PR changes
data_type: numeric(18, 4)(more precision) - Day 2: PR merges
- Consumers don’t see change
Solution: model versions (previous lesson) for breaking changes. Or notification process для non-breaking (added column, increased precision).
CI optimization: only modified
Не запускать contract checks на каждом PR на всех models — slow. Использовать state:modified+:
- name: Contract checks (modified only)
run: |
dbt run --select state:modified+ --target ci --defer
python scripts/check_contracts.py --modified-only
state:modified+ — только изменённые модели + их downstream. Faster CI.
Это требует state comparison — manifest of production vs PR. См. модуль 13 (CI/CD GitHub).
Slim CI с contracts
Полный workflow:
# .github/workflows/dbt-ci.yml
on: pull_request
jobs:
ci:
steps:
# Download production manifest for state:modified+
- name: Download prod manifest
run: aws s3 cp s3://my-bucket/dbt-manifest/manifest.json ./prod-manifest/
# Parse
- name: Parse
run: dbt parse
# Static checks
- name: Custom contract checks
run: python scripts/check_contracts.py
# Slim CI build с contracts enforcement
- name: Build modified
run: dbt build --select state:modified+ --defer --state ./prod-manifest --target ci
# Compile diff report
- name: Schema diff report
run: python scripts/schema_diff.py --before prod --after ci
dbt build = run + tests. With contracts, mismatch will fail build.
Schema diff report — extra niceness: показывает diff между prod schema и PR schema, для review.
Cost of contracts в production
Per-run overhead:
- Constraint DDL применяется per relation
- 50 marts × 20 columns × 3 constraints = 3000 DDL statements
- На Snowflake — ~$0.01-0.05 per run в DDL credits
- Negligible для most teams
Storage:
- Constraints stored в metadata — minimal
- Negligible
Compute:
- Enforcement (where applies) — Postgres FK check может slow inserts
- For dbt’s batch loads — acceptable
Development:
- Time to write YAML declarations: 5-15 min per model first time
- Time to maintain: 1-2 min per PR touching schema
- ROI: orders of magnitude saved at downstream
Reverting contract
Случается: contract too restrictive, needs to disable temporarily.
- name: fct_orders
config:
contract:
enforced: false # toggle off
Single line change. dbt run pass without enforcement.
When to revert:
- Production fire — need flexibility immediately
- Migration in progress (between phases)
- Discovered schema drift, need to align before re-enabling
After fix:
- Re-enable, ensure CI passes
- Document why was disabled (incident log)
CI gate: catch when contract disabled на critical mart:
CRITICAL_MARTS = ['fct_orders', 'customer_metrics', 'revenue_daily']
for node in manifest['nodes'].values():
if node['name'] in CRITICAL_MARTS:
if not node.get('contract', {}).get('enforced'):
errors.append(f"CRITICAL: {node['name']} has contract.enforced=false")
Это strict policy — critical мarts must have contracts. Toggle off requires PR approval.
Попробуй сам
-
Enable contract на одной mart-модели.
-
Test breaking changes:
- Изменить тип в SQL (numeric -> float) —
dbt runдолжен fail - Add extra column в SQL — fail
- Remove column from YAML — fail
- Изменить тип в SQL (numeric -> float) —
-
Write CI gate script:
- Parse manifest.json
- Check: all marts have enforced contract
- Check: all columns have data_type
- Check: no bare types (numeric, varchar)
- Run в pre-commit или CI
-
Test gate:
- Add mart без contract — script should fail
- Add column без data_type — script should fail
-
Combine:
- Contract + data test (not_null + unique) + unit test (logic) — на одной модели
- All three layers active
Ключевые выводы
- Contract как CI gate: dbt run with contract enforcement = build fails при mismatch SQL vs YAML. Faster feedback than scheduled prod run.
- CI Gates types: presence on marts, data_type per column, no bare types, critical constraints, contract+test parity, version consistency, deprecation enforcement.
- What catches: type drift, extra column, missing column, precision change. What doesn’t catch: wrong data, wrong logic, semantic change, performance regression.
- Metadata-only constraints: Snowflake/BigQuery — CHECK/FK/NOT NULL declarative, не enforced. DuckDB locally — enforced. Postgres — enforced.
- Complement with data tests: constraints metadata + data tests = actual enforcement. Belt and suspenders.
- CI optimization:
state:modified+для только изменённых models. Slim CI. - Cost: minimal compute / storage. Mostly development time для YAML declarations. ROI огромный для production stability.
- Reverting:
enforced: falsetoggle для emergency. CI gate должен catch reversion на critical marts.