Learning Platform
Глоссарий Troubleshooting
Урок 10.04 · 28 мин
Средний
Binary formatsParquetORCAvroArrow IPCcolumnarrow-basedschema evolutionCross-course storage-formatsCross-course datafusion-coursePitfall 30

Binary formats overview: Parquet, ORC, Avro, Arrow IPC

Уроки 01-03 covered text formats — CSV, JSON. Они human-readable, schema-flexible, universal — но inefficient для analytics: parse-heavy (string→int conversion repeat per row), storage-bloated (3-10x Parquet size), no built-in compression. Production data lakes используют binary columnar formats.

В этом уроке — conceptual matrix: какой формат когда выбирать. Мы не имеем pyarrow / fastavro в browser (Pitfall 31 — pyarrow 50MB+, не bundled в Pyodide); challenges concentrated в уроке 02 (CSV) и 03 (JSON). Здесь — decision tree + heavy cross-course refs к Storage Formats course (27 уроков deep dives — primary deep-dive route per Pitfall 30).

В этом уроке:

  1. Why binary formats — text-format limitations.
  2. Layout dichotomy — columnar vs row-based.
  3. Format matrix — Parquet / ORC / Avro / Arrow IPC.
  4. Decision tree — какой когда.
  5. Pitfall 30 — don’t deep-dive Parquet internals here.
  6. Cross-course references — Storage Formats M02/M03/M04/M07; DataFusion M01.
  7. Newer formats brief — Iceberg / Delta / Hudi / Paimon — out-of-scope.

Why binary formats — text-format limitations

CSV / JSON хороши для interchange между systems с different stacks. Но для analytical workloads:

ConstraintCSV/JSON impactBinary format solution
Parse costКаждое чтение re-parses string → typed value (int(row['age'])). Compute-heavySchema-aware — values pre-decoded; ZeroCopy reads
Storage sizeUTF-8 text + JSON braces + repeated keys → 3-10x bloatCompressed (gzip / lz4 / zstd / snappy) + dictionary-encoded repeated values
Predicate pushdownНевозможен — всегда сканируется весь fileMin/max statistics per row group → skip irrelevant blocks
Column projectionНевозможен — JSON object reads все keysColumnar layout — read только requested columns
Schema enforcementSchema-less — runtime errors на bad dataSchema-on-write — declared при ingest; readers validate

Production rule: для analytical / OLAP — binary columnar (Parquet / ORC). Для streaming / Kafka pipelines — Avro (schema evolution). Для inter-process / cross-library zero-copy — Arrow IPC.


Layout dichotomy — columnar vs row-based

Row-based (CSV, JSON, Avro)

Каждая row stored contiguously:

Row 1: [name=alice, age=30, role=dev]
Row 2: [name=bob,   age=25, role=qa]
Row 3: [name=carol, age=35, role=pm]

Use case: OLTP-like — read entire record (e.g., user profile lookup). Streaming — append individual records (Kafka).

Columnar (Parquet, ORC, Arrow)

Каждая column stored contiguously:

name:  [alice, bob, carol]
age:   [30, 25, 35]
role:  [dev, qa, pm]

Use case: OLAP — SELECT AVG(age) FROM users reads только age column, skips name+role (column projection). Compression dramatically лучше — homogeneous values within column compress better than mixed-type rows.

Trade-off: writing single row требует updating 3 files (one per column). Hence columnar — append-only, batch-write.

Cross-course → Storage Formats M02 урок 01 — Parquet row groups covers row group internals (column chunks → pages → encoded values).


Format matrix — Parquet / ORC / Avro / Arrow IPC

FormatLayoutCompressionSchemaEvolutionEcosystemUse case
Parquetcolumnarrow-group + column-chunk + page-level (snappy default; zstd modern)required, on-writeyes (add columns; rename via field-id)Spark / Hive / Impala / Trino / DuckDB / PyArrowDefault для analytical data lakes
ORCcolumnarstripe-level + indexrequired, on-writelimited (Hive ecosystem only)Hive primary, Trino, SparkHadoop / Hive deep-integration
Avrorow-basedblock-levelrequired, embedded в file headerstrong (forward + backward compat)Kafka primary, Flink, HadoopStreaming + Kafka schema registry
Arrow IPCcolumnar in-memoryoptional (lz4 / zstd)requiredyesDataFusion / pandas / Polars / cuDFIn-memory transfer + cross-library

Parquet (Apache Parquet)

  • Layout: file → row groups → column chunks → pages.
  • Compression: per-column-chunk (snappy default; zstd modern; gzip legacy).
  • Predicate pushdown: min/max statistics per row group → skip blocks where filter excludes range.
  • Cross-course deep dive: Storage Formats M02 — Parquet (7 уроков) — row groups, column chunks, pages, encodings, metadata, nested data (Dremel), file viewer.
Интерактивный Parquet file viewer

ORC (Optimized Row Columnar)

  • Layout: file → stripes → row groups (within stripe) → indexes + column data.
  • Compression: stripe-level zlib/snappy/zstd; index for skip.
  • Born: Hortonworks 2013 для Hive ecosystem.
  • Limited evolution: rename / type promotion слабее чем Parquet.
  • Cross-course deep dive: Storage Formats M03 — ORC (7 уроков) — stripes, indexes, encodings, ACID semantics, Bloom filters.

Avro (Apache Avro)

  • Layout: row-based — block of N records, schema embedded в file header.
  • Compression: block-level (snappy default; deflate legacy).
  • Schema evolution: STRONG — explicit forward/backward compat rules (Avro spec § Schema Resolution). Reader schema differs from writer schema → automatic projection / default value fill / type promotion.
  • Use case: streaming (Kafka) — schema registry stores writer schema, consumers pin reader schema, evolution managed without breaking pipelines.
  • Cross-course deep dive: Storage Formats M04 — Avro (6 уроков) — schema type system, evolution rules, schema registry, Kafka integration.
Avro container format — структура файла Avro в Kafka — runtime schema evolution

Arrow IPC (Apache Arrow Inter-Process Communication)

  • Layout: in-memory columnar buffers — same memory representation across libraries.
  • Compression: optional — usually skipped (zero-copy is the point).
  • Format: streaming (record-batch sequence без footer) или file (with footer + schema replay).
  • Killer feature: zero-copy между pandas / Polars / DuckDB / DataFusion / Spark Arrow UDFs — same memory buffers shared without serialize/deserialize cycle.
  • Cross-course deep dive: Storage Formats M07 — Arrow (7 уроков) — memory layout, type system, IPC format, Feather, Flight protocol, ecosystem, memory viewer.
  • Cross-course → DataFusion: DataFusion 01/02 — Arrow memory layout — DataFusion uses Arrow as in-memory representation.

Decision tree — какой формат когда

Workload?
├── Analytical reads (OLAP) — column projections, aggregations
│   ├── Spark / Trino / DuckDB / Impala — Parquet (default)
│   └── Hive — ORC (deep ecosystem integration)

├── Streaming ingestion + schema evolution
│   └── Kafka / Flink — Avro (schema registry + forward/backward compat)

├── In-memory cross-library transfer (no disk persistence)
│   └── pandas → Polars → DuckDB → DataFusion — Arrow IPC (zero-copy)

└── Interchange / debugging / human-readable
    └── CSV / JSON (M09 уроки 02-03)

Production rule: не over-think — default Parquet для new data lakes. ORC только при Hive lock-in. Avro при schema evolution-heavy streaming. Arrow IPC — implementation detail между libraries; rarely chosen by user explicit.


Pitfall 30 — don’t deep-dive Parquet internals here

Anti-pattern: lesson 04 пытается explain Parquet predicate pushdown internals, dictionary encoding, RLE, bit-packing, page indexes…

Why it goes wrong: этих deep-dive topics семь уроков в Storage Formats course (М02). Дублирование =

  • divergent explanations (one course updates first; другой stale);
  • pragmatic-DEEP overrun — 28 min lesson превращается в 90 min.

Correct framing: lesson 04 — matrix + decision tree. Каждый формат — short paragraph + cross-course link к deep dive. Storage Formats course covers internals; M09 урок 04 covers selection criteria.

Production rule: в любом point когда вы готовы написать “Parquet automatically does X internally…” — STOP и cross-link к Storage Formats. M09 — bridge phase, не competing deep dive.


Cross-course references — Storage Formats deep dives

Phase 68 main feature — heavy cross-course bridges. Storage Formats course ships 27 уроков across 4 modules describing exactly эти binary formats:

Storage Formats M02 — Parquet (7 уроков)

Storage Formats M03 — ORC (7 уроков)

Storage Formats M04 — Avro (6 уроков)

Storage Formats M07 — Arrow (7 уроков)

DataFusion — Arrow foundation

Three-layer cross-course bridge: M09 урок 04 (decision tree) → Storage Formats (format internals) → DataFusion / Spark / ClickHouse (engine integration). Each layer adds depth.


Newer formats — brief mention

Modern data lakes используют table formats поверх Parquet:

FormatWhatStatus
Apache IcebergOpen table format — schema evolution + time travel + ACID + hidden partitioningIndustry standard 2024+; AWS / Snowflake / Databricks adopted
Delta LakeDatabricks-originated; ACID + schema evolution + time travelStrong Databricks ecosystem
Apache HudiStreaming-friendly upserts + change data captureUsed at Uber / Robinhood
Apache PaimonFlink-native — stream-batch unifiedNewer (2023+); Flink ecosystem

Эти table formats — НЕ формат файлов; они manage Parquet/ORC files (manifest, snapshot, statistics). M09 урок 04 — out-of-scope; Storage Formats M11-M14 covers их (если ship’нуты в course catalog).

Production guidance 2026: для new projects — Iceberg + Parquet; ecosystem support across Spark / Trino / DuckDB / Snowflake / AWS Athena.


Что в следующем уроке

Урок 05 — Compression formats (gzip / bzip2 / lz4 / zstd / snappy). Эти codecs reused across binary formats урока 04 — Parquet column-chunk compression — zstd или snappy; ORC stripe compression; Avro block compression. Урок 05 — speed-vs-ratio tradeoff matrix + Pyodide-safe gzip.decompress demo via io.BytesIO. Cross-course → Storage Formats M09 (compression internals — Btrblocks / Fastlanes / ALP / FSST).

Pragmatic-DEEP принцип: selection criteria в M09; internals в Storage Formats. Don’t compete with Storage Formats course — augment via cross-links (Pitfall 30).

Проверьте понимание

Результат: 0 из 0
Концептуальный
Вопрос 1 из 4. Чем columnar layout (Parquet, ORC, Arrow) принципиально отличается от row-based (CSV, JSON, Avro) для analytical workloads?

Закончили урок?

Отметьте его как пройденный, чтобы отслеживать свой прогресс

Войдите чтобы оценить урок

Прогресс модуля
0 из 7