Learning Platform
Глоссарий Troubleshooting
Урок 04.01 · 22 мин
Средний
SparkArchitectureDAGShuffleExecutorPartitioning

Spark Architecture для System Design

Почему Spark — default для batch

Apache Spark — de-facto стандарт для batch processing больших данных. Понимание его архитектуры критично для system design:

Spark Architecture: Driver + Executors
Driver
Executor 1
Executor 2
Executor N

DAG и Stages

Spark превращает ваш код в DAG (Directed Acyclic Graph) операций:

DAG Scheduler — устройство и stages
Код:
  df = spark.read.parquet("orders")     # Stage 0: read
  df = df.filter(col("amount") > 100)   # Stage 0: filter (narrow)
  df = df.groupBy("region").sum("amount") # Stage 1: shuffle + aggregate
  df = df.orderBy("sum(amount)")         # Stage 2: shuffle + sort
  df.write.parquet("output")             # Stage 2: write

DAG Stages:
  Stage 0: [read + filter] — narrow transforms, no shuffle
     ↓ shuffle (groupBy requires data redistribution)
  Stage 1: [aggregate]
     ↓ shuffle (orderBy requires global sort)
  Stage 2: [sort + write]

Design Insight: Shuffle = Performance Killer

Shuffle — перераспределение данных между executors. Самая дорогая операция:

Shuffle cost:
  1. Serialize data on source executor
  2. Write to local disk (shuffle files)
  3. Network transfer to target executors
  4. Read from disk on target executor
  5. Deserialize

Network I/O + Disk I/O = bottleneck
Shuffle в Spark — глубокое погружение Оптимизация shuffle

Partitioning: ключ к performance

Input Partitioning

Файлы на S3/HDFS → Spark partitions:
  1 file (1 GB Parquet) → 1 partition → 1 task
  
  Правило: partition size ≈ 128 MB
  1 GB file → ~8 partitions → 8 parallel tasks
  100 GB dataset → ~800 partitions → 800 tasks

Shuffle Partitioning

spark.sql.shuffle.partitions = 200 (default)

Проблема: 
  10 GB data → 200 partitions = 50 MB/partition [OK]
  1 TB data → 200 partitions = 5 GB/partition [NO] (OOM!)
  1 MB data → 200 partitions = 5 KB/partition [NO] (overhead!)

Rule: partition_count ≈ total_data_size / 128 MB
  10 GB → 80 partitions
  1 TB → 8,000 partitions

Data Skew

Data Skew Problem:
  groupBy("country") → 
    partition "US": 50M records (99%)
    partition "LU": 10K records (1%)
  
  Task for "US": 30 minutes
  Task for "LU": 1 second
  Total: 30 minutes (limited by slowest task = straggler)

Solutions:
  1. Salting: add random suffix to key → distribute "US" across N partitions
  2. Broadcast Join: if one side is small (< 10 MB), broadcast to all executors
  3. AQE Skew Join: Spark 3+ automatically splits skewed partitions
Data Skew: Straggler Task
Skewed
Salted

System Design Decisions для Spark Jobs

1. Cluster Sizing

Formula:
  total_data = 100 GB
  target_partition_size = 128 MB
  partitions = ceil(100 GB / 128 MB) = 800
  
  cores_per_executor = 4 (sweet spot)
  executor_memory = 8 GB (4 GB per core)
  executors_needed = ceil(800 / 4) = 200 (at once)
  
  With 20 executors: 800 tasks / (20 × 4 cores) = 10 waves
  Time ≈ 10 waves × task_duration

2. Storage Format

FormatUse CaseCompressionColumnar
ParquetDefault for analyticsSnappy/ZstdYes
ORCHive ecosystemZlibYes
AvroStreaming ingestionSnappyNo (row)
JSONNever for productionNoneNo
CSVLegacy onlyNoneNo

3. Join Strategies

StrategyWhenPerformance
Broadcast JoinOne side fits in memory (< 10 MB default)Fastest (no shuffle)
Sort-Merge JoinBoth sides large, sortedGood (one shuffle)
Shuffle Hash JoinBoth sides large, unsortedOK (hash + shuffle)
TIP

Cross-reference: Apache Spark Deep-Dive

Этот курс покрывает Spark с позиции System Design. Для глубокого погружения в Spark internals (Catalyst, Tungsten, AQE, UDFs, memory management) — курс Apache Spark Deep-Dive на этой платформе.

Проверка знанийKnowledge check
ОтветAnswer

Проверьте понимание

Результат: 0 из 0
Прикладной
Вопрос 1 из 2. Spark job обрабатывает 1 TB данных. spark.sql.shuffle.partitions = 200 (default). Что произойдёт?

Закончили урок?

Отметьте его как пройденный, чтобы отслеживать свой прогресс

Войдите чтобы оценить урок

Прогресс модуля
0 из 2