Spark Architecture для System Design
Почему Spark — default для batch
Apache Spark — de-facto стандарт для batch processing больших данных. Понимание его архитектуры критично для system design:
Driver
Executor 1
Executor 2
Executor N
DAG и Stages
Spark превращает ваш код в DAG (Directed Acyclic Graph) операций:
DAG Scheduler — устройство и stagesКод:
df = spark.read.parquet("orders") # Stage 0: read
df = df.filter(col("amount") > 100) # Stage 0: filter (narrow)
df = df.groupBy("region").sum("amount") # Stage 1: shuffle + aggregate
df = df.orderBy("sum(amount)") # Stage 2: shuffle + sort
df.write.parquet("output") # Stage 2: write
DAG Stages:
Stage 0: [read + filter] — narrow transforms, no shuffle
↓ shuffle (groupBy requires data redistribution)
Stage 1: [aggregate]
↓ shuffle (orderBy requires global sort)
Stage 2: [sort + write]
Design Insight: Shuffle = Performance Killer
Shuffle — перераспределение данных между executors. Самая дорогая операция:
Shuffle cost:
1. Serialize data on source executor
2. Write to local disk (shuffle files)
3. Network transfer to target executors
4. Read from disk on target executor
5. Deserialize
Network I/O + Disk I/O = bottleneck
Shuffle в Spark — глубокое погружение
Оптимизация shuffle
Partitioning: ключ к performance
Input Partitioning
Файлы на S3/HDFS → Spark partitions:
1 file (1 GB Parquet) → 1 partition → 1 task
Правило: partition size ≈ 128 MB
1 GB file → ~8 partitions → 8 parallel tasks
100 GB dataset → ~800 partitions → 800 tasks
Shuffle Partitioning
spark.sql.shuffle.partitions = 200 (default)
Проблема:
10 GB data → 200 partitions = 50 MB/partition [OK]
1 TB data → 200 partitions = 5 GB/partition [NO] (OOM!)
1 MB data → 200 partitions = 5 KB/partition [NO] (overhead!)
Rule: partition_count ≈ total_data_size / 128 MB
10 GB → 80 partitions
1 TB → 8,000 partitions
Data Skew
Data Skew Problem:
groupBy("country") →
partition "US": 50M records (99%)
partition "LU": 10K records (1%)
Task for "US": 30 minutes
Task for "LU": 1 second
Total: 30 minutes (limited by slowest task = straggler)
Solutions:
1. Salting: add random suffix to key → distribute "US" across N partitions
2. Broadcast Join: if one side is small (< 10 MB), broadcast to all executors
3. AQE Skew Join: Spark 3+ automatically splits skewed partitions
Skewed
Salted
System Design Decisions для Spark Jobs
1. Cluster Sizing
Formula:
total_data = 100 GB
target_partition_size = 128 MB
partitions = ceil(100 GB / 128 MB) = 800
cores_per_executor = 4 (sweet spot)
executor_memory = 8 GB (4 GB per core)
executors_needed = ceil(800 / 4) = 200 (at once)
With 20 executors: 800 tasks / (20 × 4 cores) = 10 waves
Time ≈ 10 waves × task_duration
2. Storage Format
| Format | Use Case | Compression | Columnar |
|---|---|---|---|
| Parquet | Default for analytics | Snappy/Zstd | Yes |
| ORC | Hive ecosystem | Zlib | Yes |
| Avro | Streaming ingestion | Snappy | No (row) |
| JSON | Never for production | None | No |
| CSV | Legacy only | None | No |
3. Join Strategies
| Strategy | When | Performance |
|---|---|---|
| Broadcast Join | One side fits in memory (< 10 MB default) | Fastest (no shuffle) |
| Sort-Merge Join | Both sides large, sorted | Good (one shuffle) |
| Shuffle Hash Join | Both sides large, unsorted | OK (hash + shuffle) |
TIP
Cross-reference: Apache Spark Deep-Dive
Этот курс покрывает Spark с позиции System Design. Для глубокого погружения в Spark internals (Catalyst, Tungsten, AQE, UDFs, memory management) — курс Apache Spark Deep-Dive на этой платформе.