Learning Platform
Глоссарий Troubleshooting
Урок 15.01 · 22 мин
Средний
Cost OptimizationCloud CostsSpot InstancesAuto-scalingData LifecycleRight-sizing

Cost Optimization & Tuning

Почему Cost — это Design Decision

Data Platform Cost Breakdown: Compute vs Storage vs Network
Compute (50-70%)
Storage (15-30%)
Network (5-15%)
Managed Services
Right-sizing
Auto-scaling
Query Optimization

Data platform без cost optimization — бездонная яма. Типичная ошибка: запуск без budget awareness → счёт за cloud в 10x от ожидания через 3 месяца.

Типичные cost drivers:
  1. Compute: Spark clusters, Flink jobs, OLAP engines
  2. Storage: S3/GCS/ADLS, warehouse storage
  3. Network: cross-region transfer, egress
  4. Managed services: BigQuery slots, Snowflake credits
  5. Licenses: Databricks DBUs, Confluent CKUs

Cost Comparison: Compute

ResourceOn-demandSpot/PreemptibleReserved (1yr)Savings
AWS m5.2xlarge$0.384/hr$0.12/hr$0.24/hr69% spot
GCP n2-standard-8$0.39/hr$0.12/hr$0.24/hr69% spot
Databricks DBU$0.55/DBU$0.30/DBU (jobs)$0.37/DBU45% jobs
Snowflake credit$2-3/creditN/A$1.8-2.520% prepay
TIP

Spot instances для batch jobs. Spark batch pipeline tolerates interruptions (checkpointing + retry). Используйте spot для 70% cost reduction. Streaming (Flink) — on-demand (interruption = data loss risk).

Storage Optimization

S3 Storage Tiers:
  Standard:        $0.023/GB/month (hot data, last 30 days)
  Infrequent:      $0.0125/GB/month (warm, 30-90 days)
  Glacier Instant:  $0.004/GB/month (cold, 90d-1yr)
  Glacier Deep:     $0.00099/GB/month (archive, 1yr+)

Data Lifecycle Policy:
  Bronze (raw):
    0-30 days:  Standard (frequent replay)
    30-90 days: Infrequent Access
    90-365 days: Glacier Instant
    365+ days:  Glacier Deep или DELETE

  Silver (cleaned):
    0-90 days:  Standard
    90-365 days: Infrequent Access
    365+ days:  Glacier Instant

  Gold (aggregated):
    Always Standard (frequently queried)

Savings example (100 TB Bronze):
  Without lifecycle: $2,300/month
  With lifecycle:    $400/month (82% saving)
TTL и tiered storage в ClickHouse Zero-copy replication для S3

Compute Right-sizing

Anti-pattern: "let's use r5.4xlarge for everything"

Right-sizing framework:
  1. Measure actual utilization (CloudWatch, Prometheus)
  2. Identify waste:
     CPU util < 30% → downsize
     Memory util < 40% → switch instance family
     GPU util < 20% → remove GPU instances
  
  3. Match workload to instance:
     Compute-heavy (Spark transforms) → c-family (c5, c6g)
     Memory-heavy (large JOINs, caches) → r-family (r5, r6g)
     Balanced → m-family (m5, m6g)
     ARM → Graviton (20% cheaper, same perf)

Spark cluster sizing:
  Over-provisioned: 50 executors × 16 GB → 800 GB total, using 200 GB
  Right-sized: 20 executors × 12 GB → 240 GB, 83% utilized
  Savings: 60% compute cost reduction

Query Optimization

BigQuery:
  [NO] SELECT * FROM large_table
  [OK] SELECT col1, col2 FROM large_table WHERE date = '2024-01-01'
  → Columnar: scan only needed columns
  → Partitioning: scan only needed partitions
  → Cost: $5/TB scanned → fewer columns = less cost

Snowflake:
  [NO] warehouse = 'X-LARGE' for all queries
  [OK] Multi-cluster auto-scaling:
     Small queries → XS warehouse ($1/credit)
     Heavy ETL → L warehouse ($8/credit)
     Auto-suspend after 60 seconds of idle

Spark:
  [NO] spark.sql.shuffle.partitions = 200 (default)
  [OK] AQE (Adaptive Query Execution) auto-tunes:
     spark.sql.adaptive.enabled = true
     spark.sql.adaptive.coalescePartitions.enabled = true
Cost optimization в Spark — практика

Auto-scaling Strategies

Spark on K8s:
  Dynamic allocation:
    spark.dynamicAllocation.enabled = true
    spark.dynamicAllocation.minExecutors = 2
    spark.dynamicAllocation.maxExecutors = 100
    → Scale up for heavy stages, down for light ones

Kafka:
  Consumers: K8s HPA based on consumer lag
    if lag > 10000 messages → add consumer
    if lag < 100 → remove consumer (respect min replicas)

OLAP:
  ClickHouse: read replicas for query scaling
  Druid: auto-scale middle managers based on ingestion load
WARNING

Мониторьте cost ПЕРЕД оптимизацией. Без visibility нельзя оптимизировать. Настройте cost dashboards (AWS Cost Explorer, GCP Billing, Kubecost) с разбивкой по team/pipeline/layer. Установите budget alerts на 80% и 100% порога.

Проверка знанийKnowledge check
ОтветAnswer

Проверьте понимание

Результат: 0 из 0
Прикладной
Вопрос 1 из 3. Daily Spark batch job на AWS: 50 × m5.2xlarge on-demand, runs 2 часа. Monthly cost: $1,152. Job tolerates interruptions (checkpointing). Как оптимизировать?

Закончили урок?

Отметьте его как пройденный, чтобы отслеживать свой прогресс

Войдите чтобы оценить урок

Прогресс модуля
0 из 1