Data Quality & Observability
Почему Data Quality — это system design проблема
Плохие данные → неправильные решения → потеря доверия к data platform. Data quality — не afterthought, а архитектурный слой.
Cost of bad data:
Обнаружено в Bronze: $1 (автоматический fix)
Обнаружено в Silver: $10 (reprocessing)
Обнаружено в Gold: $100 (dashboard downtime)
Обнаружено бизнесом: $1000 (потеря доверия)
→ Shift-left: ловить как можно раньше
Data Quality Dimensions
Freshness
Completeness
Accuracy
Consistency
Uniqueness
Schema
Shift-Left: обнаруживай раньше → дешевле чинить
| Dimension | Что проверяем | Пример проверки |
|---|---|---|
| Freshness | Данные свежие? | Last record timestamp не старше 1 часа |
| Completeness | Все данные на месте? | Row count ± 10% от вчера |
| Accuracy | Данные корректные? | amount не отрицательное, email валиден |
| Consistency | Данные согласованы? | SUM(order_items) = order.total |
| Uniqueness | Нет дубликатов? | COUNT(DISTINCT id) = COUNT(id) |
| Schema | Структура верная? | Все expected columns present, types correct |
Инструменты Data Quality
Great Expectations
# Declarative expectations
expectation_suite = {
"expectations": [
{"type": "expect_column_to_exist", "column": "order_id"},
{"type": "expect_column_values_to_not_be_null", "column": "order_id"},
{"type": "expect_column_values_to_be_unique", "column": "order_id"},
{"type": "expect_column_values_to_be_between",
"column": "amount", "min_value": 0, "max_value": 1000000},
{"type": "expect_table_row_count_to_be_between",
"min_value": 1000, "max_value": 1000000}
]
}
# Run: validator.validate(df, suite) → results + data docs
Great Expectations — production deployment
dbt Tests
# schema.yml
models:
- name: orders_silver
columns:
- name: order_id
tests:
- not_null
- unique
- name: amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
- name: customer_id
tests:
- relationships:
to: ref('dim_customer')
field: customer_id
dbt quality tests — глубокий разбор
Data Contracts (ODCS)
# Open Data Contract Standard (ODCS)
dataContract:
name: orders
version: 2.0.0
owner: checkout-team
schema:
- name: order_id
type: string
required: true
unique: true
- name: amount
type: decimal
required: true
constraints:
minimum: 0
- name: created_at
type: timestamp
required: true
sla:
freshness: 1h
completeness: 99.5%
availability: 99.9%
consumers:
- team: analytics
usage: dashboard
- team: ml-team
usage: feature-engineering
TIP
Data Contracts = API для данных. Producer гарантирует schema + SLA. Consumer знает, что ожидать. Изменения через versioning, не breaking changes.
Observability: Monitoring Pipeline Health
What to monitor:
1. Freshness: MAX(event_time) per table, alert if stale
2. Volume: row count per partition, alert on anomaly
3. Schema: column count, types — alert on drift
4. Quality: % rows passing checks, alert below threshold
5. Latency: end-to-end pipeline duration
6. Cost: compute + storage per pipeline
Alert thresholds:
Freshness: table not updated in 2x SLA
Volume: ±30% from 7-day moving average
Quality: below 99% pass rate on critical checks
Anomaly Detection
Statistical approaches:
Z-score: flag if metric > 3σ from mean
Moving average: compare to 7d/30d rolling average
Seasonal decomposition: account for day-of-week patterns
Example:
Daily order count: avg=100K, σ=10K
Today: 45K orders
Z-score: (45K - 100K) / 10K = -5.5 → ALERT
But: is it a holiday? Check seasonal pattern first
WARNING
Anti-pattern: Alert fatigue. Не ставьте alert на каждую метрику. Tiered alerting: P1 (page on-call) = freshness SLA breach, P2 (Slack) = volume anomaly, P3 (ticket) = schema drift.