Learning Platform
Глоссарий Troubleshooting
Урок 11.02 · 26 мин
Средний
ContractsConstraintsData typesPRIMARY KEYFOREIGN KEYDuckDB

Defining contract: columns, data_type, constraints (PK/FK/CHECK/NOT NULL)

В прошлом уроке мы разобрали концепт contract. Этот урок — про синтаксис: точно как объявлять colonki, типы, constraints, и что supports каждый warehouse.

Production-grade contract = full declaration: каждая колонка с типом, PK/FK/CHECK/NOT NULL constraints где meaningful. Это схема как код для downstream consumers.

SQL DDL: constraints — PK/FK/CHECK/NOT NULL на уровне базы

Минимальный contract: только columns + types

- name: dim_customers
  config:
    contract:
      enforced: true
  columns:
    - name: customer_id
      data_type: bigint

    - name: first_name
      data_type: varchar(100)

    - name: last_name
      data_type: varchar(100)

    - name: email
      data_type: varchar(255)

    - name: created_at
      data_type: timestamp

SQL должен материализовать колонки точно с этими типами. Casts в SQL обязательны:

SELECT
    id::bigint AS customer_id,
    first_name::varchar(100),
    last_name::varchar(100),
    email::varchar(255),
    created_at::timestamp
FROM {{ ref('stg_customers') }}

Это base level — schema enforced, без constraints.


data_type: какие типы supports

dbt принимает SQL data types, специфичные для каждого warehouse. Common типы:

Тип семантикаDuckDBSnowflakeBigQueryPostgres
Integer 64-bitbigintbigint / number(19,0)int64bigint
Integer 32-bitintegerinteger / number(10,0)int64 (no 32-bit)integer
Decimal (точный)numeric(p, s)number(p, s)numeric / bignumericnumeric(p, s)
Float doubledoublefloatfloat64double precision
Variable textvarchar(N)varchar(N)stringvarchar(N)
Booleanbooleanbooleanboolboolean
Datedatedatedatedate
Timestamptimestamptimestamp_ntztimestamptimestamp
Timestamp + TZtimestamptztimestamp_tztimestamptimestamp with time zone
Arraybigint[]arrayarray<int64>bigint[]
JSON / Structjson / struct(...)variant / objectjson / struct<>jsonb

Tip: на DuckDB основные типы pretty standard SQL. На Snowflake/BigQuery — некоторые отличия (например, BigQuery only int64, no smaller integers).


Точная precision для decimal

Для финансов обязательно указывать precision/scale:

- name: order_total
  data_type: numeric(12, 2)
  # 12 digits total, 2 after decimal point
  # max value: 9_999_999_999.99

numeric без precision — драйфует. Compatibility issues между warehouses + потеря точности при aggregation.

При нарушении в SQL:

-- SQL
SELECT order_total::numeric(10, 4) AS order_total
Contract enforcement failure:
- Column 'order_total': SQL type is numeric(10, 4), YAML declares numeric(12, 2)

Это caught — alignment строгий.


Constraints: PRIMARY KEY

- name: customer_id
  data_type: bigint
  constraints:
    - type: primary_key

или (multi-column PK):

columns:
  - name: order_id
    data_type: bigint
  - name: line_item_id
    data_type: bigint

# на уровне модели — composite key
constraints:
  - type: primary_key
    columns: [order_id, line_item_id]

Что происходит при materialization:

  • DuckDB: CREATE TABLE ... PRIMARY KEY (customer_id) — enforced при insert
  • Snowflake: PRIMARY KEY (...) declared — но NOT enforced (warehouse design)
  • BigQuery: PK declared в metadata, optimizer hint — не enforced
  • Postgres: enforced

Production реальность: на cloud warehouses (Snowflake/BigQuery) PK declarative. Не полагайся на enforcement — дублируй через data test unique для real check.

- name: customer_id
  data_type: bigint
  constraints:
    - type: primary_key   # declarative на Snowflake
  data_tests:
    - unique               # actual enforcement через test
    - not_null

Это belt and suspenders approach — оба слоя для critical FK.


Constraints: FOREIGN KEY

- name: customer_id
  data_type: bigint
  constraints:
    - type: foreign_key
      expression: ref('dim_customers')
      references_field: customer_id

(синтаксис может слегка отличаться в разных версиях dbt — проверять docs your version)

Support:

  • DuckDB locally: enforced ok
  • DuckDB MotherDuck: NOT supported
  • Snowflake: declarative, не enforced
  • BigQuery: declarative, не enforced
  • Postgres: enforced

Production: используй contract FK для documentation/optimizer + data test relationships для actual check:

- name: customer_id
  constraints:
    - type: foreign_key
      expression: ref('dim_customers')
      references_field: customer_id
  data_tests:
    - relationships:
        to: ref('dim_customers')
        field: customer_id

Constraints: CHECK

Custom validation expressions:

- name: order_total
  data_type: numeric(12, 2)
  constraints:
    - type: check
      expression: "order_total >= 0"

- name: status
  data_type: varchar(20)
  constraints:
    - type: check
      expression: "status IN ('completed', 'pending', 'cancelled', 'refunded')"

- name: email
  data_type: varchar(255)
  constraints:
    - type: check
      expression: "email LIKE '%@%'"

Support:

  • DuckDB: enforced
  • Postgres: enforced
  • Snowflake: declarative, не enforced
  • BigQuery: declarative

Для critical CHECK логики используй data test через dbt_utils.expression_is_true:

- name: order_total
  data_tests:
    - dbt_utils.expression_is_true:
        expression: ">= 0"

Это работает в любом warehouse как actual check.


Constraints: NOT NULL

- name: order_id
  data_type: bigint
  constraints:
    - type: not_null

Support:

  • DuckDB: enforced
  • Postgres: enforced
  • Snowflake: enforced (одно из немногих enforced constraints в Snowflake)
  • BigQuery: enforced through STRUCT mode=‘REQUIRED’

NOT NULL — один из best-enforced constraints across warehouses. Большая часть warehouses фактически рекомендуют это для primary keys и timestamp columns.

Production:

- name: order_id
  data_type: bigint
  constraints:
    - type: not_null
    - type: primary_key
  data_tests:
    - not_null   # дополнительная safety
    - unique

Constraints: UNIQUE

- name: email
  data_type: varchar(255)
  constraints:
    - type: unique

Support:

  • DuckDB: enforced via UNIQUE index
  • Postgres: enforced
  • Snowflake: declarative
  • BigQuery: не supported в DDL

Используй data_tests: [unique] для cross-warehouse.


Full production contract пример

- name: fct_orders
  description: "Production fact for orders. PII-clean, used by 15 dashboards."
  config:
    contract:
      enforced: true
    materialized: table

  columns:
    # Primary key
    - name: order_id
      data_type: bigint
      description: "Unique order ID"
      constraints:
        - type: not_null
        - type: primary_key
      data_tests: [unique, not_null]

    # Foreign key
    - name: customer_id
      data_type: bigint
      description: "FK on dim_customers"
      constraints:
        - type: not_null
        - type: foreign_key
          expression: ref('dim_customers')
          references_field: customer_id
      data_tests:
        - not_null
        - relationships:
            to: ref('dim_customers')
            field: customer_id

    # Date
    - name: order_date
      data_type: date
      description: "Date of order (UTC)"
      constraints:
        - type: not_null
        - type: check
          expression: "order_date >= DATE '2000-01-01' AND order_date <= CURRENT_DATE + INTERVAL '1 year'"
      data_tests: [not_null]

    # Financial decimal
    - name: order_total
      data_type: numeric(12, 2)
      description: "Order total in USD"
      constraints:
        - type: not_null
        - type: check
          expression: "order_total >= 0"
      data_tests:
        - not_null
        - dbt_utils.expression_is_true:
            expression: ">= 0"

    # Categorical
    - name: status
      data_type: varchar(20)
      description: "One of: completed, pending, cancelled, refunded"
      constraints:
        - type: not_null
        - type: check
          expression: "status IN ('completed', 'pending', 'cancelled', 'refunded')"
      data_tests:
        - not_null
        - accepted_values:
            values: ['completed', 'pending', 'cancelled', 'refunded']

    # Optional text
    - name: notes
      data_type: varchar(500)
      description: "Customer notes (optional)"
      # no constraints, can be NULL
-- models/marts/fct_orders.sql
SELECT
    o.order_id::bigint,
    o.customer_id::bigint,
    o.order_date::date,
    o.order_total::numeric(12, 2),
    o.status::varchar(20),
    o.notes::varchar(500)
FROM {{ ref('stg_orders') }} o
WHERE o.order_id IS NOT NULL
  AND o.customer_id IS NOT NULL

Когда dbt run:

  1. SQL компилируется
  2. Resulting types vs YAML declared — checked
  3. DDL constraints applied (where supports)
  4. Material runs
  5. Data tests run после material

При нарушении — error message:

Contract enforcement failure for fct_orders:
- Column 'order_id': SQL produces 'integer', YAML declares 'bigint'
- Column 'status': SQL produces 'varchar', YAML declares 'varchar(20)'
- Unexpected column 'customer_email' in SQL not declared in YAML

Producer обязан align.


Constraints на model level vs column level

Иногда constraint касается нескольких колонок — composite PK, composite UNIQUE, complex CHECK:

- name: order_line_items
  config:
    contract:
      enforced: true
  constraints:   # на уровне модели
    - type: primary_key
      columns: [order_id, line_number]
    - type: unique
      columns: [order_id, line_number]
    - type: check
      expression: "quantity > 0"
      name: positive_quantity
  columns:
    - name: order_id
      data_type: bigint
      constraints: [{type: not_null}]   # column-level
    - name: line_number
      data_type: integer
      constraints: [{type: not_null}]
    - name: quantity
      data_type: integer

Comp PK / composite UNIQUE объявляются на model level через columns array. CHECK на multiple columns — также model level.


Type alignment gotchas

1. Implicit casts на DuckDB

DuckDB liberal с casting. SELECT 1 AS x создаст x INTEGER, но contract x BIGINT — fails. Always explicit:

SELECT 1::bigint AS x

2. Functions с unclear return types

-- Что возвращает? Зависит от input
SELECT SUM(order_total) AS revenue
FROM orders

Может вернуть BIGINT, NUMERIC, HUGEINT — зависит от input column type. Phantom drift.

Решение — explicit cast:

SELECT SUM(order_total)::numeric(12, 2) AS revenue

3. Date / timestamp granularity

data_type: timestamp   # vs timestamp(0)

DuckDB timestamp = microsecond precision. PostgreSQL — same. Если SQL делает now()::timestamp — works. Но если now()::timestamptz — fails contract.

4. Varchar(N) vs varchar

data_type: varchar    # без N

vs

data_type: varchar(100)

Some warehouses allow varchar без length, others require. Snowflake — varchar = varchar(16777216) (max). DuckDB — flexible.

Production: always specify length explicitly. Catches truncation bugs.


Migration: добавление contract на existing model

Phased approach (см. урок 01):

# PR 1: declare без enforce
- name: fct_orders
  config:
    contract:
      enforced: false   # declarations only
  columns:
    - name: order_id
      data_type: bigint
    ...

# PR 2: align SQL — explicit casts, remove extras
# (SQL changes)

# PR 3: enable enforce
- name: fct_orders
  config:
    contract:
      enforced: true
  columns:
    ...

Каждый PR — separate review. После PR 3 — pre-existing model under contract.


Антипаттерны

  1. Generic numeric без precision: data_type: numeric — drift между warehouses, precision loss. Always numeric(p, s).

  2. No length для varchar: varchar без N — truncation bugs могут проходить silently. Always varchar(N).

  3. FK через contract вместо relationships test: на Snowflake/BigQuery FK declarative — не enforced. Polagis on data test для real check.

  4. CHECK через contract вместо expression test: на Snowflake/BigQuery CHECK declarative. Data test expression_is_true works cross-warehouse.

  5. Lots of CHECKs: 10 CHECK constraints на колонке — fragile при изменениях. Группируй где возможно.

  6. Implicit casts в SQL: полагание на DuckDB liberal casting — breaks при migration to Snowflake. Always explicit ::type for contract columns.

  7. No data tests + contract only: contract = schema. Data quality нужны через data tests. Не дублирование, а complementary.


Попробуй сам

  1. Выбрать mart-модель с 5-10 колонками.

  2. Добавить contract declarations в YAML:

    • data_type для каждой
    • PK на ID column
    • FK на FK columns
    • CHECK на critical fields (revenue не меньше 0)
    • NOT NULL на required
  3. Align SQL — explicit casts.

  4. Enable enforce: contract.enforced: true.

  5. Test breaks:

    • Изменить тип в SQL — должен fail
    • Add extra column в SQL — should fail
    • Remove column from YAML — should fail
    • Negative revenue (если CHECK) — DuckDB blocks (Snowflake — passes silently)
  6. Test data tests — отдельный layer, проверяют data quality после material.


Ключевые выводы

  1. Полный contract = data_type + constraints. Декларация в YAML, alignment в SQL через explicit casts.
  2. Constraint types: not_null, primary_key, foreign_key, check, unique. Каждый с разной enforcement по warehouses.
  3. Enforcement matrix: DuckDB locally — most enforced. Snowflake/BigQuery — mostly metadata-only. Postgres — most enforced. Production = constraint + data test для real check.
  4. DuckDB FK partial: works locally, не on MotherDuck. Snowflake/BigQuery FK — only declarative.
  5. Always specify precision: numeric(12, 2), varchar(100), не bare numeric / varchar.
  6. Type alignment: explicit ::type casts в SQL для каждой declared column. DuckDB liberal casting — fails при migration.
  7. Composite constraints — на model level через columns: [...] array.
  8. Migration: declare -> align SQL -> enable enforce. Phased для safety.
Проверка знанийKnowledge check
Senior говорит: 'на Snowflake contract заявляет CHECK 'revenue не меньше 0', но дашборд показал -$500. CHECK не сработал?'. Что случилось?
ОтветAnswer
На Snowflake (и BigQuery) **CHECK constraints declarative, не enforced**. Это **fundamental architecture choice**: warehouse focuses на bulk loads и query performance, не на per-row insertion validation.\n\nЧто delivers `CHECK expression` в Snowflake:\n- DDL syntax accepted: `CONSTRAINT my_check CHECK (revenue не меньше 0)` — successfully created\n- Visible в `SHOW IMPORTED KEYS` / metadata queries\n- Optimizer hints (rare usage)\n- **NOT enforced** при INSERT / UPDATE / MERGE / dbt run\n\nЭто **false sense of security** — common gotcha при переходе с Postgres / DuckDB local.\n\n**Решения**:\n\n**1. Data test `expression_is_true`**:\n```yaml\n- name: revenue\n constraints:\n - type: check\n expression: "revenue не меньше 0" # declarative для documentation\n data_tests:\n - dbt_utils.expression_is_true:\n expression: "не меньше 0" # actual check\n```\n\nПосле `dbt build` тест runs query `SELECT * FROM revenue WHERE NOT (revenue не меньше 0)` — fails if any rows. Это **real enforcement**.\n\n**2. Custom logic в SQL модели**:\n```sql\nSELECT\n customer_id,\n GREATEST(revenue, 0) AS revenue -- floor at 0\nFROM ...\n```\n\nИли (если негативные revenue = bug data) — фильтр:\n```sql\nWHERE revenue не меньше 0\n OR raise_error("Negative revenue detected")\n```\n\nЭто catches на load, не post-hoc.\n\n**3. Source-level validation** — в staging применяй CHECK в data tests, не пропускай invalid в downstream.\n\n**Production принцип**: \n- Contract constraints = documentation + DDL\n- Data tests = actual enforcement\n- Source filters = preventive\n\nНа DuckDB и Postgres — CHECK enforced (real). На Snowflake/BigQuery — нет. Always **complement** constraints с data tests cross-warehouse.
Проверка знанийKnowledge check
Команда пишет contract: `data_type: numeric` без precision. dbt parse ok, run ok. Через 3 месяца на проде revenue ушёл в дисбаланс. Что произошло?
ОтветAnswer
**Numeric drift** — silent precision loss или платформенная divergence.\n\n**Сценарии**:\n\n**1. DuckDB default**:\n`numeric` без precision = `DECIMAL(18, 3)` (default). \n\n**2. Snowflake default**:\n`number` без precision = `NUMBER(38, 0)` (integer-like, scale=0). Это **MASSIVE precision loss**: 50.75 становится 50 при cast.\n\n**3. BigQuery**:\nNUMERIC = NUMERIC(38, 9). Scale 9 — OK для money, но different от DuckDB default.\n\nЕсли модель переехала с DuckDB на Snowflake (migration или CI vs prod different):\n- DuckDB: revenue 50.75 stored as DECIMAL(18, 3)\n- Snowflake: revenue 50 (truncated to scale 0 because of default)\n\nDashboards aggregate Snowflake values -> revenue dropped by 50% из-за silent truncation.\n\n**Решение** — always explicit precision:\n\n```yaml\n- name: revenue\n data_type: numeric(12, 2) # explicit\n```\n\n12 = total digits, 2 = after decimal point. Max value 9_999_999_999.99 (~$10B). Suficient для большинства money use cases.\n\nДля больших аггрегатов: `numeric(18, 2)` (max $9999... trillion).\n\nДля sub-cent precision (FX rates, fractional shares): `numeric(18, 6)` (6 после decimal).\n\n**Правило**: **никогда не используй bare `numeric` или `number` в contracts**. Always specify precision and scale. Аналогично:\n\n- `varchar` -> `varchar(N)` — truncation safety\n- `int` (без size) -> `bigint` (explicit 64-bit)\n- `timestamp` -> `timestamp` (default ms) или `timestamp(6)` (microsecond) — depend on need\n\n**Production CI gate**:\n\n```python\nfor col in manifest_columns:\n if col['data_type'] in ('numeric', 'number', 'varchar', 'int'):\n errors.append(f'Bare type {col["data_type"]} in {col["name"]}: specify precision')\n```\n\nЭто catches bare types на PR. Prevent silent drift on production.

Проверьте понимание

Результат: 0 из 0
Аналитический
Вопрос 1 из 5. Команда пишет `data_type: numeric` (без precision) для revenue колонки. После 3 месяцев на проде revenue ушёл в дисбаланс. Корень?

Закончили урок?

Отметьте его как пройденный, чтобы отслеживать свой прогресс

Войдите чтобы оценить урок

Прогресс модуля
0 из 5