Governance, Cataloging & Lineage
Governance — не бюрократия, а архитектура
Data governance — система правил и инструментов, которая отвечает на вопросы: какие данные есть, откуда они, кто имеет доступ, где PII, как удалить данные пользователя по запросу GDPR.
Что такое data governance — фундамент DMBOK и frameworks — теория governanceБез governance:
"Где customer email?" — grep по 200 таблицам
"Откуда эти цифры в dashboard?" — никто не знает
"GDPR запрос на удаление" — 2 недели ручной работы
С governance:
Catalog → найти за секунды
Lineage → проследить от source до dashboard
PII tagging → автоматическое маскирование
GDPR → автоматический workflow удаления
Data Catalog: OpenMetadata vs DataHub
Aspect
OpenMetadata
DataHub
Architecture
Monolith (Java + React)
Microservices (5+ services)
Lineage
SQL-parsed + API
Graph-based (Neo4j)
Data Quality
Built-in (GE-based)
Via integrations
Deployment
Simpler
More complex
| Feature | OpenMetadata | DataHub |
|---|---|---|
| Architecture | Monolith (Java + React) | Microservices (Java + React) |
| Metadata Store | MySQL/PostgreSQL | Graph DB (Neo4j/RDBMS) |
| Lineage | SQL-parsed + API | SQL-parsed + API |
| Auto-discovery | Ingestion connectors | Ingestion recipes |
| Data Quality | Built-in (Great Expectations) | Integration |
| Collaboration | Conversations, tasks | Discussions |
| Deployment | Simpler | More complex |
| Community | Growing fast | Large, LinkedIn-backed |
Выбор:
Маленькая/средняя команда → OpenMetadata (проще deploy)
Enterprise с complex lineage → DataHub (graph-based)
Cloud-first → Managed solutions (Atlan, Select Star)
Catalog fundamentals — теория и реализации
Data lineage — глубокий разбор
Lineage Tracking
Column-level lineage пример:
bronze.raw_orders.amount
↓ (CAST to DECIMAL)
silver.orders.amount
↓ (SUM GROUP BY customer)
gold.customer_revenue.total_revenue
↓ (JOIN + format)
dashboard.top_customers.revenue_column
Вопрос: "Почему revenue в dashboard неверный?"
Ответ: трейсим lineage обратно → находим ошибку в CAST
Lineage Sources
Automatic lineage extraction:
1. SQL parsing: parse dbt/Spark SQL → extract table/column refs
2. Orchestrator: Airflow/Dagster DAG structure → task dependencies
3. Runtime: Spark query plan → actual data flow
4. BI tools: Tableau/Looker → which tables feed dashboards
Manual lineage:
5. API calls: register custom lineage for Python/API pipelines
Access Control
Role-based access (RBAC):
Roles:
data_engineer: read/write all layers
analyst: read Silver + Gold
ml_engineer: read Silver + write feature tables
business_user: read Gold only
Column-level:
analyst: SELECT * EXCEPT (email, phone) FROM customers
data_engineer: SELECT * FROM customers
Row-level:
region_manager_eu: WHERE region = 'EU'
region_manager_us: WHERE region = 'US'
RBAC implementation — практический разбор
ABAC и Policy-as-Code (OPA)
PII Handling
PII Classification:
Direct PII: email, phone, SSN, passport → MASK or ENCRYPT
Quasi PII: zip_code, birth_date, IP → GENERALIZE
Sensitive: salary, health_data → RESTRICT access
Masking strategies:
Hashing: SHA256(email) → deterministic, joinable
Tokenization: email → token_abc123 (reversible via vault)
Redaction: email → ***@***.com
Generalization: age=34 → age_group="30-40"
Where to mask:
Bronze: raw (encrypted at rest)
Silver: PII masked/hashed
Gold: no PII (aggregated)
GDPR Compliance Architecture
Right to be Forgotten workflow:
1. User submits deletion request
2. Catalog lookup: find all tables with user's PII
(lineage → which tables contain customer_id = 12345)
3. Delete/anonymize in each table:
Bronze: mark as deleted (soft delete, retain for audit)
Silver: DELETE WHERE customer_id = 12345
Gold: re-aggregate without user's data
4. Audit log: record deletion for compliance proof
5. Confirm to user within 30 days (GDPR requirement)
Technical: Delta Lake/Iceberg support DELETE operations
Without table format: full partition rewrite to remove rows
WARNING
PII в Bronze layer. Bronze хранит raw данные включая PII. Защита: encryption at rest + strict access control + audit logging. Не маскируйте Bronze — он нужен для replay и debugging.
NOTE
Cross-reference: Data Governance курс. Этот урок покрывает governance с позиции System Design. Для глубокого погружения (policy frameworks, data stewardship, metadata management) — курс Data Governance на этой платформе.