Cost Optimization & Tuning
Почему Cost — это Design Decision
Compute (50-70%)
Storage (15-30%)
Network (5-15%)
Managed Services
Right-sizing
Auto-scaling
Query Optimization
Data platform без cost optimization — бездонная яма. Типичная ошибка: запуск без budget awareness → счёт за cloud в 10x от ожидания через 3 месяца.
Типичные cost drivers:
1. Compute: Spark clusters, Flink jobs, OLAP engines
2. Storage: S3/GCS/ADLS, warehouse storage
3. Network: cross-region transfer, egress
4. Managed services: BigQuery slots, Snowflake credits
5. Licenses: Databricks DBUs, Confluent CKUs
Cost Comparison: Compute
| Resource | On-demand | Spot/Preemptible | Reserved (1yr) | Savings |
|---|---|---|---|---|
| AWS m5.2xlarge | $0.384/hr | $0.12/hr | $0.24/hr | 69% spot |
| GCP n2-standard-8 | $0.39/hr | $0.12/hr | $0.24/hr | 69% spot |
| Databricks DBU | $0.55/DBU | $0.30/DBU (jobs) | $0.37/DBU | 45% jobs |
| Snowflake credit | $2-3/credit | N/A | $1.8-2.5 | 20% prepay |
TIP
Spot instances для batch jobs. Spark batch pipeline tolerates interruptions (checkpointing + retry). Используйте spot для 70% cost reduction. Streaming (Flink) — on-demand (interruption = data loss risk).
Storage Optimization
S3 Storage Tiers:
Standard: $0.023/GB/month (hot data, last 30 days)
Infrequent: $0.0125/GB/month (warm, 30-90 days)
Glacier Instant: $0.004/GB/month (cold, 90d-1yr)
Glacier Deep: $0.00099/GB/month (archive, 1yr+)
Data Lifecycle Policy:
Bronze (raw):
0-30 days: Standard (frequent replay)
30-90 days: Infrequent Access
90-365 days: Glacier Instant
365+ days: Glacier Deep или DELETE
Silver (cleaned):
0-90 days: Standard
90-365 days: Infrequent Access
365+ days: Glacier Instant
Gold (aggregated):
Always Standard (frequently queried)
Savings example (100 TB Bronze):
Without lifecycle: $2,300/month
With lifecycle: $400/month (82% saving)
TTL и tiered storage в ClickHouse
Zero-copy replication для S3
Compute Right-sizing
Anti-pattern: "let's use r5.4xlarge for everything"
Right-sizing framework:
1. Measure actual utilization (CloudWatch, Prometheus)
2. Identify waste:
CPU util < 30% → downsize
Memory util < 40% → switch instance family
GPU util < 20% → remove GPU instances
3. Match workload to instance:
Compute-heavy (Spark transforms) → c-family (c5, c6g)
Memory-heavy (large JOINs, caches) → r-family (r5, r6g)
Balanced → m-family (m5, m6g)
ARM → Graviton (20% cheaper, same perf)
Spark cluster sizing:
Over-provisioned: 50 executors × 16 GB → 800 GB total, using 200 GB
Right-sized: 20 executors × 12 GB → 240 GB, 83% utilized
Savings: 60% compute cost reduction
Query Optimization
BigQuery:
[NO] SELECT * FROM large_table
[OK] SELECT col1, col2 FROM large_table WHERE date = '2024-01-01'
→ Columnar: scan only needed columns
→ Partitioning: scan only needed partitions
→ Cost: $5/TB scanned → fewer columns = less cost
Snowflake:
[NO] warehouse = 'X-LARGE' for all queries
[OK] Multi-cluster auto-scaling:
Small queries → XS warehouse ($1/credit)
Heavy ETL → L warehouse ($8/credit)
Auto-suspend after 60 seconds of idle
Spark:
[NO] spark.sql.shuffle.partitions = 200 (default)
[OK] AQE (Adaptive Query Execution) auto-tunes:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.coalescePartitions.enabled = true
Cost optimization в Spark — практика
Auto-scaling Strategies
Spark on K8s:
Dynamic allocation:
spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.minExecutors = 2
spark.dynamicAllocation.maxExecutors = 100
→ Scale up for heavy stages, down for light ones
Kafka:
Consumers: K8s HPA based on consumer lag
if lag > 10000 messages → add consumer
if lag < 100 → remove consumer (respect min replicas)
OLAP:
ClickHouse: read replicas for query scaling
Druid: auto-scale middle managers based on ingestion load
WARNING
Мониторьте cost ПЕРЕД оптимизацией. Без visibility нельзя оптимизировать. Настройте cost dashboards (AWS Cost Explorer, GCP Billing, Kubecost) с разбивкой по team/pipeline/layer. Установите budget alerts на 80% и 100% порога.