Spark Architecture для System Design

Почему Spark — default для batch

Apache Spark — de-facto стандарт для batch processing больших данных. Понимание его архитектуры критично для system design:

Spark Architecture: Driver + Executors

Driver

Executor 1

Executor 2

Executor N

DAG и Stages

Spark превращает ваш код в DAG (Directed Acyclic Graph) операций:

DAG Scheduler — устройство и stages

Код:
  df = spark.read.parquet("orders")     # Stage 0: read
  df = df.filter(col("amount") > 100)   # Stage 0: filter (narrow)
  df = df.groupBy("region").sum("amount") # Stage 1: shuffle + aggregate
  df = df.orderBy("sum(amount)")         # Stage 2: shuffle + sort
  df.write.parquet("output")             # Stage 2: write

DAG Stages:
  Stage 0: [read + filter] — narrow transforms, no shuffle
     ↓ shuffle (groupBy requires data redistribution)
  Stage 1: [aggregate]
     ↓ shuffle (orderBy requires global sort)
  Stage 2: [sort + write]

Design Insight: Shuffle = Performance Killer

Shuffle — перераспределение данных между executors. Самая дорогая операция:

Shuffle cost:
  1. Serialize data on source executor
  2. Write to local disk (shuffle files)
  3. Network transfer to target executors
  4. Read from disk on target executor
  5. Deserialize

Network I/O + Disk I/O = bottleneck

Shuffle в Spark — глубокое погружение Оптимизация shuffle

Partitioning: ключ к performance

Input Partitioning

Файлы на S3/HDFS → Spark partitions:
  1 file (1 GB Parquet) → 1 partition → 1 task
  
  Правило: partition size ≈ 128 MB
  1 GB file → ~8 partitions → 8 parallel tasks
  100 GB dataset → ~800 partitions → 800 tasks

Shuffle Partitioning

spark.sql.shuffle.partitions = 200 (default)

Проблема: 
  10 GB data → 200 partitions = 50 MB/partition [OK]
  1 TB data → 200 partitions = 5 GB/partition [NO] (OOM!)
  1 MB data → 200 partitions = 5 KB/partition [NO] (overhead!)

Rule: partition_count ≈ total_data_size / 128 MB
  10 GB → 80 partitions
  1 TB → 8,000 partitions

Data Skew

Data Skew Problem:
  groupBy("country") → 
    partition "US": 50M records (99%)
    partition "LU": 10K records (1%)
  
  Task for "US": 30 minutes
  Task for "LU": 1 second
  Total: 30 minutes (limited by slowest task = straggler)

Solutions:
  1. Salting: add random suffix to key → distribute "US" across N partitions
  2. Broadcast Join: if one side is small (< 10 MB), broadcast to all executors
  3. AQE Skew Join: Spark 3+ automatically splits skewed partitions

Data Skew: Straggler Task

Skewed

Salted

System Design Decisions для Spark Jobs

1. Cluster Sizing

Formula:
  total_data = 100 GB
  target_partition_size = 128 MB
  partitions = ceil(100 GB / 128 MB) = 800
  
  cores_per_executor = 4 (sweet spot)
  executor_memory = 8 GB (4 GB per core)
  executors_needed = ceil(800 / 4) = 200 (at once)
  
  With 20 executors: 800 tasks / (20 × 4 cores) = 10 waves
  Time ≈ 10 waves × task_duration

2. Storage Format

Format	Use Case	Compression	Columnar
Parquet	Default for analytics	Snappy/Zstd	Yes
ORC	Hive ecosystem	Zlib	Yes
Avro	Streaming ingestion	Snappy	No (row)
JSON	Never for production	None	No
CSV	Legacy only	None	No

3. Join Strategies

Strategy	When	Performance
Broadcast Join	One side fits in memory (< 10 MB default)	Fastest (no shuffle)
Sort-Merge Join	Both sides large, sorted	Good (one shuffle)
Shuffle Hash Join	Both sides large, unsorted	OK (hash + shuffle)

TIP

Cross-reference: Apache Spark Deep-Dive

Этот курс покрывает Spark с позиции System Design. Для глубокого погружения в Spark internals (Catalyst, Tungsten, AQE, UDFs, memory management) — курс Apache Spark Deep-Dive на этой платформе.