Learning Platform
Глоссарий Troubleshooting
Урок 11.07 · 24 мин
Продвинутый
ProductionRotationAuditMonitoringDisaster Recovery

Secrets best practices — production checklist

В предыдущих уроках разобрали механику: Fernet, Vault, cloud backends. Этот урок — про operational excellence: как организовать secrets management в production так, чтобы не получить 3am инцидент. Это собрание patterns из реальных deployments — Astronomer, Shopify, Lyft, Airbnb внутренние документы и community wisdom.


Cardinal rule: никогда UI для production secrets

Самое важное правило, нарушаемое чаще всего:

DO NOT use Airflow UI to add real production secrets.

Почему UI anti-pattern для production:

  1. No version history: UI overwrite молчаливо. Нет diff, нет audit who/when changed.
  2. No code review: secret попал в DB без PR / approval.
  3. Manual = error-prone: copy-paste из chat, typo в credentials, accidentally committed screenshots.
  4. No env separation: легко accidentally положить prod secret в dev Airflow.
  5. Fernet is single point of failure: ключ потерян = passwords unrecoverable (см. урок 03).

OK use cases для UI:

  • Local development connections (localhost:5432 к local Postgres)
  • Изоляcia/isolation testing — temporary connection для отладки
  • Emergency override (с audit trail post-incident)

Production path — Secrets Backend (Vault / AWS SM / GCP SM / Azure KV) + Infrastructure-as-Code (Terraform / Pulumi):

# terraform/airflow_secrets.tf
resource "vault_kv_secret_v2" "pg_prod_warehouse" {
  mount = "airflow"
  name  = "connections/pg_prod_warehouse"
  data_json = jsonencode({
    conn_type = "postgres"
    login     = var.pg_prod_login
    password  = var.pg_prod_password  # из CI/CD secret
    host      = aws_rds_cluster.warehouse.endpoint
    port      = 5432
    schema    = "warehouse"
  })
}

Git history = audit trail. PR review = approval. Terraform plan = deterministic apply.


Separate paths / namespaces per environment

Env separation в Vault
airflow/Top-level mount. Каждый environment имеет свой sub-path. Vault policies могут ограничивать access per path — dev-team policy не даёт read из prod/. Это critical для предотвращения 'мы случайно DROP TABLE в prod' инцидентов.
airflow/dev/Dev environment. Connections к dev DB (sandbox Postgres), dev S3 bucket, test API tokens. Policy: dev team read-write. Если dev secrets leak — impact contained.
airflow/staging/Staging mirror of prod, минимальная разница. Real-looking но isolated infrastructure. Policy: dev team read-only, dedicated 'staging-deployer' для writes via CI/CD.
airflow/prod/Production. Policy: ТОЛЬКО Airflow prod ServiceAccount read. Writes только через approved Terraform apply через CI/CD pipeline с required reviews. Audit log на каждый read/write.

Vault policy example для prod:

# policies/airflow-prod-read.hcl
path "airflow/data/prod/connections/*" {
  capabilities = ["read"]
}
path "airflow/data/prod/variables/*" {
  capabilities = ["read"]
}
path "airflow/data/prod/connections/*" {
  denied_parameters = {}  # explicit no write
}

Каждый Airflow cluster (dev/staging/prod) использует свой Vault role с соответствующей policy.

В airflow.cfg каждого env:

# dev cluster
backend_kwargs = '{..., "kubernetes_role": "airflow-dev", "variables_path": "dev/variables"}'

# prod cluster
backend_kwargs = '{..., "kubernetes_role": "airflow-prod", "variables_path": "prod/variables"}'

Secret rotation и external secret managers на K8s

Rotation cadence

Industry standard rotation policies:

Тип secretCadenceMethod
Fernet keyQuarterly (90 days)airflow rotate-fernet-key + multi-key window
DB passwords (static)QuarterlyManual через DBA + update Vault
DB passwords (Vault dynamic)Per-task (1h TTL)Automatic via Vault database/ engine
API tokens (third-party)AnnuallyManual rotate в provider portal + update Vault
Cloud IAM access keysNEVER use long-lived; use IRSA/WI/MIN/A
Internal service tokens30 daysAuto-rotation via Vault token renewal

Critical: rotation должна быть planned and tested, не reactive. Тест rotation в staging quarterly:

# Quarterly rotation playbook (staging dry-run)
./scripts/rotate-fernet.sh staging
./scripts/rotate-db-creds.sh staging
./scripts/verify-airflow-health.sh staging

Если staging rotation проходит без incident — производите production rotation. Если падает — fix процедуру до production attempt.


Compromised secret — emergency procedure

Сценарий: AWS access key leaked в публичный GitHub commit (классика).

Compromised secret response
t=0: detectionSource: GitGuardian alert / AWS GuardDuty / manual report. Confirm compromise: проверьте git history, проверьте CloudTrail на suspicious API calls с этим key (region access, EC2 launches, S3 download patterns).
< 15 минут
t=15m: revoke compromised keyAWS IAM: deactivate access key немедленно. Не delete сразу — оставьте deactivated для CloudTrail forensics. Для Vault token: vault token revoke <token>. Для DB password: ALTER USER ... PASSWORD '<random>' immediately.
t=20m: deploy new credentialsGenerate new credentials. Update Vault / Secrets Manager. Airflow подхватит на cache expire (с cache_ttl_seconds=600 — макс 10 минут после update). Or force restart pods для immediate effect.
t=30m: forensicsCloudTrail audit: какие API calls сделаны с compromised key. Был ли actual misuse? Что было созданно / удалено? Какие resources были accessed? Этот audit feed-ит incident report.
t=2h: incident reportInternal write-up: timeline, impact, root cause (как secret leaked), what was accessed, remediation steps, prevention. Share с security team + affected stakeholders. Public disclosure если data involved.
t=1w: post-mortem + preventionWhy did this happen? Например, secret в git → install pre-commit hook (detect-secrets / gitleaks). Manual rotation → automate via Terraform pipeline. Long-lived AWS key → migrate to IRSA. Each incident = systemic fix, not just patch.

Не нужно ждать perfect — fast revocation > complete forensics first. Compromised key revoked в 15 min = bounded damage. Revoked в 24h = potentially massive impact.


Audit logging

Airflow itself имеет limited audit для secrets. Для production надо:

Backend-level audit (Vault, AWS, GCP, Azure):

  • Vault: audit device enabled, log every read/write — vault audit enable file file_path=/var/log/vault/audit.log
  • AWS Secrets Manager: CloudTrail logs every GetSecretValue. Set up CloudWatch alarm на anomalies.
  • GCP Secret Manager: Cloud Audit Logs (data access logs need explicit enable).
  • Azure Key Vault: diagnostic logs to Log Analytics.

Application-level audit через Listener API (Module 12):

# plugins/secret_audit_listener.py
from airflow.listeners import hookimpl

class SecretAuditListener:
    @hookimpl
    def on_task_instance_running(self, previous_state, task_instance, session):
        # Можно log какие connections/variables loaded
        send_to_audit_log({
            "event": "task_started",
            "task_id": task_instance.task_id,
            "dag_id": task_instance.dag_id,
            "executor_user": "airflow",
            "timestamp": datetime.utcnow(),
        })

Это даёт application-side perspective: who/what/when ran with credentials. Sum with backend audit → full picture.


Monitoring failed secret lookups

Failed secret lookup = Hook initialization failure = task failure. Это уже visible в Airflow, но timing matters — лучше catch до того, как DAG run blocked.

Метрики для monitoring:

# Через airflow metrics export
# /metrics — Prometheus endpoint в Airflow 2.x
airflow_dag_run_failed_total{dag_id="my_dag"} 5
airflow_task_instance_failed_total{task_id="extract", dag_id="my_dag"} 3

Это generic — failed task может быть из-за чего угодно. Для secret-specific monitoring:

Vault metrics:

  • vault_secret_kv_count — total secrets
  • vault_audit_log_request — read rate, можно по path filter
  • vault_token_create_count — auth rate (увеличение = либо новые pods, либо token leak)

AWS Secrets Manager:

  • CloudWatch metric GetSecretValueRequestCount — rate per secret
  • CloudWatch alarm на 4XXError rate (access denied = misconfig)

Custom alerting:

# В Vault wrapper backend
class MonitoredVaultBackend(VaultBackend):
    def get_connection(self, conn_id):
        start = time.time()
        try:
            result = super().get_connection(conn_id)
            statsd.timing("secret.lookup.duration", time.time() - start, tags=[f"conn_id:{conn_id}"])
            return result
        except Exception as e:
            statsd.incr("secret.lookup.failure", tags=[f"conn_id:{conn_id}", f"error:{type(e).__name__}"])
            raise

Это feeds в Datadog / Grafana / CloudWatch — alert на error spike или latency growth.


CI/CD secret pipeline

Production-grade secret deployment workflow:

Developer/Owner → PR с Terraform change

                CI runs: terraform plan, security scan (gitleaks, snyk)

                Required reviewers approve (2+ для prod)

                Merge → CI/CD pipeline runs terraform apply

                Pipeline uses sealed credentials (CI secret store)
                для talking к Vault/AWS/GCP API

                Apply succeeds → secret теперь в Vault

                Airflow подхватит на cache expire (или manual rollout)

Important details:

  • CI runner credentials ≠ secret value. Runner имеет permission записать в Vault path, но НЕ читать (separation of duties).
  • Secret value source: либо developer вводит в protected CI variable (sealed), либо generated by Terraform random_password resource, либо retrieved через provider’s own API (например AWS IAM CreateUser).
  • Audit: every PR → diff в git → known author + reviewers.

Anti-patterns to avoid:

  • Secret в plain text в Terraform .tfvars committed → IMMEDIATELY rotated.
  • Manual aws secretsmanager put-secret-value через CLI on prod → no audit, anti-pattern.
  • Sharing access via Slack / email → не audited, не reviewable.

DAG-side hygiene

Производственные DAGs должны соответствовать:

# anti-pattern
PG_PASSWORD = Variable.get("pg_password")  # top-level!

# pattern
@task
def my_task():
    # OK — lookup inside task
    conn = BaseHook.get_connection("pg_prod_warehouse")
    # Hook handles password decryption internally
    ...

Linting: enforce через flake8 plugin / custom AST checker:

# scripts/check_dags.py
import ast

def check_top_level_secrets(tree):
    for node in ast.walk(tree):
        if isinstance(node, ast.Module):
            for child in node.body:
                if isinstance(child, ast.Assign):
                    if has_call_to(child.value, ["Variable.get", "BaseHook.get_connection"]):
                        raise ValueError(f"Top-level secret lookup: line {child.lineno}")

Этот scanner запускается в pre-commit hook + CI — fail PR если найдено.


DR procedure: Vault cluster lost

Worst-case: Vault primary cluster destroyed (region outage), no replica online.

Recovery options (in priority order):

  1. Failover to Vault DR cluster: если у вас Vault DR Performance Replica в другом region — promote её. RPO ~ seconds, RTO ~ minutes. Airflow продолжает работу с brief blip.

  2. Restore from Vault snapshot: HCV daily snapshots в S3 / GCS. Restore: provision new Vault cluster, vault operator raft snapshot restore snapshot.snap. RPO = up to 24h, RTO = hours.

  3. Failover на Secrets Backend chain: если у вас configured backend = vault,aws_secrets_manager — AWS backup автоматически становится primary. Если secrets были replicated в AWS заранее.

  4. DB fallback (emergency): если ENV или DB stage в lookup chain имеет рабочие connections — system continues degraded. Стратегия: храните minimal critical subset of connections в DB как emergency fallback.

  5. Full manual recovery: re-create connections в new Vault из external sources (DBA team manually provides DB password, cloud team re-issues IAM keys, etc.). Hours-to-days, requires coordination.

Lesson: Vault — critical infrastructure. Treat it like DB: backups, monitoring, DR plan, regular DR tests (quarterly chaos exercise).


Production checklist (final)

  • Secrets Backend configured, no real secrets в metadata DB
  • Separate paths/namespaces per environment (dev/staging/prod)
  • IaC (Terraform / Pulumi) для secret management — no manual UI changes
  • [secrets] use_cache = True, cache_ttl_seconds = 600 (or appropriate value)
  • connections_lookup_pattern для ограничения backend lookups до prod prefix
  • IRSA / Workload Identity / Managed Identity (no long-lived cloud creds)
  • Fernet key managed через KMS с deletion protection
  • Audit logging enabled в backend (Vault audit / CloudTrail / Cloud Audit / KV diagnostic)
  • Monitoring + alerting на failed secret lookups
  • Rotation playbook documented и tested quarterly в staging
  • CI/CD pipeline для secret deployment с required reviews
  • Pre-commit hooks / CI checks для top-level Variable.get / BaseHook.get_connection
  • DR plan для Vault outage, tested quarterly
  • Incident response runbook для compromised secret (target: < 15 min revoke)

Проверка знанийKnowledge check
Production Airflow на EKS. Vault Performance Standby в same region. AWS Secrets Manager configured как secondary backend. Сегодня 2:47am — Vault primary down (storage failure), Vault standby не promoted automatically (мисс конфиг). Все Airflow tasks failing с 'Connection lookup failed'. У вас на дежурстве. План действий на следующие 30 минут?
ОтветAnswer
План на ближайшие 30 минут — priority survival mode. (1) **0-5 min: assess and stabilize**. Connect to PagerDuty incident. Confirm scope: все Airflow envs или только prod? Все tasks или только некоторые? Check Vault status: kubectl get pods -n vault — primary CrashLoopBackoff? standby up? AWS Secrets Manager up? (2) **5-10 min: emergency failover**. Если Performance Standby healthy: vault operator raft snapshot inspect → promote standby: vault operator step-down (graceful) или force-promote если raft quorum lost. После promote — verify cluster health: vault status. Airflow начинает recovery на следующий cache hit (с TTL 10 min — max wait). (3) **10-15 min: speed recovery**. Чтобы не ждать 10 min cache expire: kubectl rollout restart deployment/airflow-scheduler -n airflow + workers + triggerer. Каждый pod restart инвалидирует cache → новые tasks подхватывают rotated Vault. (4) **15-20 min: monitor recovery**. Watch task success rate в Airflow UI. Check that running DAGs не stuck в queued. Verify backup-via-AWS-Secrets-Manager kicks in для остальных запросов (если secrets также replicated в AWS — это secondary safety net). (5) **20-30 min: bridge announcement**. Send update в incident channel — recovery в progress, ETA. Update stakeholders в business. Document timeline и actions taken для post-mortem. (6) **Параллельно**: page Vault SRE team для investigate primary failure root cause (storage corruption, OOM, network partition). Не try restore primary себе — это их job. Lesson preventiv: (1) Auto-promotion of standby — fix мисс конфиг ASAP. (2) Test DR procedure quarterly в staging — этот инцидент знакомил бы вас с procedure. (3) Reduce cache_ttl_seconds если этот latency unacceptable — но trade-off с Vault load. (4) Multiple region Vault clusters для true HA.

Проверьте понимание

Результат: 0 из 0
Аналитический
Вопрос 1 из 4. Почему 'UI для production secrets' — anti-pattern?

Закончили урок?

Отметьте его как пройденный, чтобы отслеживать свой прогресс

Войдите чтобы оценить урок

Прогресс модуля
0 из 7