Secrets best practices — production checklist
В предыдущих уроках разобрали механику: Fernet, Vault, cloud backends. Этот урок — про operational excellence: как организовать secrets management в production так, чтобы не получить 3am инцидент. Это собрание patterns из реальных deployments — Astronomer, Shopify, Lyft, Airbnb внутренние документы и community wisdom.
Cardinal rule: никогда UI для production secrets
Самое важное правило, нарушаемое чаще всего:
DO NOT use Airflow UI to add real production secrets.
Почему UI anti-pattern для production:
- No version history: UI overwrite молчаливо. Нет diff, нет audit who/when changed.
- No code review: secret попал в DB без PR / approval.
- Manual = error-prone: copy-paste из chat, typo в credentials, accidentally committed screenshots.
- No env separation: легко accidentally положить prod secret в dev Airflow.
- Fernet is single point of failure: ключ потерян = passwords unrecoverable (см. урок 03).
OK use cases для UI:
- Local development connections (
localhost:5432к local Postgres) - Изоляcia/isolation testing — temporary connection для отладки
- Emergency override (с audit trail post-incident)
Production path — Secrets Backend (Vault / AWS SM / GCP SM / Azure KV) + Infrastructure-as-Code (Terraform / Pulumi):
# terraform/airflow_secrets.tf
resource "vault_kv_secret_v2" "pg_prod_warehouse" {
mount = "airflow"
name = "connections/pg_prod_warehouse"
data_json = jsonencode({
conn_type = "postgres"
login = var.pg_prod_login
password = var.pg_prod_password # из CI/CD secret
host = aws_rds_cluster.warehouse.endpoint
port = 5432
schema = "warehouse"
})
}
Git history = audit trail. PR review = approval. Terraform plan = deterministic apply.
Separate paths / namespaces per environment
Vault policy example для prod:
# policies/airflow-prod-read.hcl
path "airflow/data/prod/connections/*" {
capabilities = ["read"]
}
path "airflow/data/prod/variables/*" {
capabilities = ["read"]
}
path "airflow/data/prod/connections/*" {
denied_parameters = {} # explicit no write
}
Каждый Airflow cluster (dev/staging/prod) использует свой Vault role с соответствующей policy.
В airflow.cfg каждого env:
# dev cluster
backend_kwargs = '{..., "kubernetes_role": "airflow-dev", "variables_path": "dev/variables"}'
# prod cluster
backend_kwargs = '{..., "kubernetes_role": "airflow-prod", "variables_path": "prod/variables"}'
Secret rotation и external secret managers на K8s
Rotation cadence
Industry standard rotation policies:
| Тип secret | Cadence | Method |
|---|---|---|
| Fernet key | Quarterly (90 days) | airflow rotate-fernet-key + multi-key window |
| DB passwords (static) | Quarterly | Manual через DBA + update Vault |
| DB passwords (Vault dynamic) | Per-task (1h TTL) | Automatic via Vault database/ engine |
| API tokens (third-party) | Annually | Manual rotate в provider portal + update Vault |
| Cloud IAM access keys | NEVER use long-lived; use IRSA/WI/MI | N/A |
| Internal service tokens | 30 days | Auto-rotation via Vault token renewal |
Critical: rotation должна быть planned and tested, не reactive. Тест rotation в staging quarterly:
# Quarterly rotation playbook (staging dry-run)
./scripts/rotate-fernet.sh staging
./scripts/rotate-db-creds.sh staging
./scripts/verify-airflow-health.sh staging
Если staging rotation проходит без incident — производите production rotation. Если падает — fix процедуру до production attempt.
Compromised secret — emergency procedure
Сценарий: AWS access key leaked в публичный GitHub commit (классика).
Не нужно ждать perfect — fast revocation > complete forensics first. Compromised key revoked в 15 min = bounded damage. Revoked в 24h = potentially massive impact.
Audit logging
Airflow itself имеет limited audit для secrets. Для production надо:
Backend-level audit (Vault, AWS, GCP, Azure):
- Vault: audit device enabled, log every read/write —
vault audit enable file file_path=/var/log/vault/audit.log - AWS Secrets Manager: CloudTrail logs every
GetSecretValue. Set up CloudWatch alarm на anomalies. - GCP Secret Manager: Cloud Audit Logs (data access logs need explicit enable).
- Azure Key Vault: diagnostic logs to Log Analytics.
Application-level audit через Listener API (Module 12):
# plugins/secret_audit_listener.py
from airflow.listeners import hookimpl
class SecretAuditListener:
@hookimpl
def on_task_instance_running(self, previous_state, task_instance, session):
# Можно log какие connections/variables loaded
send_to_audit_log({
"event": "task_started",
"task_id": task_instance.task_id,
"dag_id": task_instance.dag_id,
"executor_user": "airflow",
"timestamp": datetime.utcnow(),
})
Это даёт application-side perspective: who/what/when ran with credentials. Sum with backend audit → full picture.
Monitoring failed secret lookups
Failed secret lookup = Hook initialization failure = task failure. Это уже visible в Airflow, но timing matters — лучше catch до того, как DAG run blocked.
Метрики для monitoring:
# Через airflow metrics export
# /metrics — Prometheus endpoint в Airflow 2.x
airflow_dag_run_failed_total{dag_id="my_dag"} 5
airflow_task_instance_failed_total{task_id="extract", dag_id="my_dag"} 3
Это generic — failed task может быть из-за чего угодно. Для secret-specific monitoring:
Vault metrics:
vault_secret_kv_count— total secretsvault_audit_log_request— read rate, можно по path filtervault_token_create_count— auth rate (увеличение = либо новые pods, либо token leak)
AWS Secrets Manager:
- CloudWatch metric
GetSecretValueRequestCount— rate per secret - CloudWatch alarm на
4XXErrorrate (access denied = misconfig)
Custom alerting:
# В Vault wrapper backend
class MonitoredVaultBackend(VaultBackend):
def get_connection(self, conn_id):
start = time.time()
try:
result = super().get_connection(conn_id)
statsd.timing("secret.lookup.duration", time.time() - start, tags=[f"conn_id:{conn_id}"])
return result
except Exception as e:
statsd.incr("secret.lookup.failure", tags=[f"conn_id:{conn_id}", f"error:{type(e).__name__}"])
raise
Это feeds в Datadog / Grafana / CloudWatch — alert на error spike или latency growth.
CI/CD secret pipeline
Production-grade secret deployment workflow:
Developer/Owner → PR с Terraform change
↓
CI runs: terraform plan, security scan (gitleaks, snyk)
↓
Required reviewers approve (2+ для prod)
↓
Merge → CI/CD pipeline runs terraform apply
↓
Pipeline uses sealed credentials (CI secret store)
для talking к Vault/AWS/GCP API
↓
Apply succeeds → secret теперь в Vault
↓
Airflow подхватит на cache expire (или manual rollout)
Important details:
- CI runner credentials ≠ secret value. Runner имеет permission записать в Vault path, но НЕ читать (separation of duties).
- Secret value source: либо developer вводит в protected CI variable (sealed), либо generated by Terraform random_password resource, либо retrieved через provider’s own API (например AWS IAM CreateUser).
- Audit: every PR → diff в git → known author + reviewers.
Anti-patterns to avoid:
- Secret в plain text в Terraform .tfvars committed → IMMEDIATELY rotated.
- Manual
aws secretsmanager put-secret-valueчерез CLI on prod → no audit, anti-pattern. - Sharing access via Slack / email → не audited, не reviewable.
DAG-side hygiene
Производственные DAGs должны соответствовать:
# anti-pattern
PG_PASSWORD = Variable.get("pg_password") # top-level!
# pattern
@task
def my_task():
# OK — lookup inside task
conn = BaseHook.get_connection("pg_prod_warehouse")
# Hook handles password decryption internally
...
Linting: enforce через flake8 plugin / custom AST checker:
# scripts/check_dags.py
import ast
def check_top_level_secrets(tree):
for node in ast.walk(tree):
if isinstance(node, ast.Module):
for child in node.body:
if isinstance(child, ast.Assign):
if has_call_to(child.value, ["Variable.get", "BaseHook.get_connection"]):
raise ValueError(f"Top-level secret lookup: line {child.lineno}")
Этот scanner запускается в pre-commit hook + CI — fail PR если найдено.
DR procedure: Vault cluster lost
Worst-case: Vault primary cluster destroyed (region outage), no replica online.
Recovery options (in priority order):
-
Failover to Vault DR cluster: если у вас Vault DR Performance Replica в другом region — promote её. RPO ~ seconds, RTO ~ minutes. Airflow продолжает работу с brief blip.
-
Restore from Vault snapshot: HCV daily snapshots в S3 / GCS. Restore: provision new Vault cluster,
vault operator raft snapshot restore snapshot.snap. RPO = up to 24h, RTO = hours. -
Failover на Secrets Backend chain: если у вас configured
backend = vault,aws_secrets_manager— AWS backup автоматически становится primary. Если secrets были replicated в AWS заранее. -
DB fallback (emergency): если ENV или DB stage в lookup chain имеет рабочие connections — system continues degraded. Стратегия: храните minimal critical subset of connections в DB как emergency fallback.
-
Full manual recovery: re-create connections в new Vault из external sources (DBA team manually provides DB password, cloud team re-issues IAM keys, etc.). Hours-to-days, requires coordination.
Lesson: Vault — critical infrastructure. Treat it like DB: backups, monitoring, DR plan, regular DR tests (quarterly chaos exercise).
Production checklist (final)
- Secrets Backend configured, no real secrets в metadata DB
- Separate paths/namespaces per environment (dev/staging/prod)
- IaC (Terraform / Pulumi) для secret management — no manual UI changes
-
[secrets] use_cache = True, cache_ttl_seconds = 600(or appropriate value) -
connections_lookup_patternдля ограничения backend lookups до prod prefix - IRSA / Workload Identity / Managed Identity (no long-lived cloud creds)
- Fernet key managed через KMS с deletion protection
- Audit logging enabled в backend (Vault audit / CloudTrail / Cloud Audit / KV diagnostic)
- Monitoring + alerting на failed secret lookups
- Rotation playbook documented и tested quarterly в staging
- CI/CD pipeline для secret deployment с required reviews
- Pre-commit hooks / CI checks для top-level Variable.get / BaseHook.get_connection
- DR plan для Vault outage, tested quarterly
- Incident response runbook для compromised secret (target: < 15 min revoke)