Binary formats overview: Parquet, ORC, Avro, Arrow IPC

Уроки 01-03 covered text formats — CSV, JSON. Они human-readable, schema-flexible, universal — но inefficient для analytics: parse-heavy (string→int conversion repeat per row), storage-bloated (3-10x Parquet size), no built-in compression. Production data lakes используют binary columnar formats.

В этом уроке — conceptual matrix: какой формат когда выбирать. Мы не имеем pyarrow / fastavro в browser (Pitfall 31 — pyarrow 50MB+, не bundled в Pyodide); challenges concentrated в уроке 02 (CSV) и 03 (JSON). Здесь — decision tree + heavy cross-course refs к Storage Formats course (27 уроков deep dives — primary deep-dive route per Pitfall 30).

В этом уроке:

Why binary formats — text-format limitations.
Layout dichotomy — columnar vs row-based.
Format matrix — Parquet / ORC / Avro / Arrow IPC.
Decision tree — какой когда.
Pitfall 30 — don’t deep-dive Parquet internals here.
Cross-course references — Storage Formats M02/M03/M04/M07; DataFusion M01.
Newer formats brief — Iceberg / Delta / Hudi / Paimon — out-of-scope.

Why binary formats — text-format limitations

CSV / JSON хороши для interchange между systems с different stacks. Но для analytical workloads:

Constraint	CSV/JSON impact	Binary format solution
Parse cost	Каждое чтение re-parses string → typed value (`int(row['age'])`). Compute-heavy	Schema-aware — values pre-decoded; ZeroCopy reads
Storage size	UTF-8 text + JSON braces + repeated keys → 3-10x bloat	Compressed (gzip / lz4 / zstd / snappy) + dictionary-encoded repeated values
Predicate pushdown	Невозможен — всегда сканируется весь file	Min/max statistics per row group → skip irrelevant blocks
Column projection	Невозможен — JSON object reads все keys	Columnar layout — read только requested columns
Schema enforcement	Schema-less — runtime errors на bad data	Schema-on-write — declared при ingest; readers validate

Production rule: для analytical / OLAP — binary columnar (Parquet / ORC). Для streaming / Kafka pipelines — Avro (schema evolution). Для inter-process / cross-library zero-copy — Arrow IPC.

Layout dichotomy — columnar vs row-based

Row-based (CSV, JSON, Avro)

Каждая row stored contiguously:

Row 1: [name=alice, age=30, role=dev]
Row 2: [name=bob,   age=25, role=qa]
Row 3: [name=carol, age=35, role=pm]

Use case: OLTP-like — read entire record (e.g., user profile lookup). Streaming — append individual records (Kafka).

Columnar (Parquet, ORC, Arrow)

Каждая column stored contiguously:

name:  [alice, bob, carol]
age:   [30, 25, 35]
role:  [dev, qa, pm]

Use case: OLAP — SELECT AVG(age) FROM users reads только age column, skips name+role (column projection). Compression dramatically лучше — homogeneous values within column compress better than mixed-type rows.

Trade-off: writing single row требует updating 3 files (one per column). Hence columnar — append-only, batch-write.

Cross-course → Storage Formats M02 урок 01 — Parquet row groups covers row group internals (column chunks → pages → encoded values).

Format matrix — Parquet / ORC / Avro / Arrow IPC

Format	Layout	Compression	Schema	Evolution	Ecosystem	Use case
Parquet	columnar	row-group + column-chunk + page-level (snappy default; zstd modern)	required, on-write	yes (add columns; rename via field-id)	Spark / Hive / Impala / Trino / DuckDB / PyArrow	Default для analytical data lakes
ORC	columnar	stripe-level + index	required, on-write	limited (Hive ecosystem only)	Hive primary, Trino, Spark	Hadoop / Hive deep-integration
Avro	row-based	block-level	required, embedded в file header	strong (forward + backward compat)	Kafka primary, Flink, Hadoop	Streaming + Kafka schema registry
Arrow IPC	columnar in-memory	optional (lz4 / zstd)	required	yes	DataFusion / pandas / Polars / cuDF	In-memory transfer + cross-library

Parquet (Apache Parquet)

Layout: file → row groups → column chunks → pages.
Compression: per-column-chunk (snappy default; zstd modern; gzip legacy).
Predicate pushdown: min/max statistics per row group → skip blocks where filter excludes range.
Cross-course deep dive: Storage Formats M02 — Parquet (7 уроков) — row groups, column chunks, pages, encodings, metadata, nested data (Dremel), file viewer.

Интерактивный Parquet file viewer

ORC (Optimized Row Columnar)

Layout: file → stripes → row groups (within stripe) → indexes + column data.
Compression: stripe-level zlib/snappy/zstd; index for skip.
Born: Hortonworks 2013 для Hive ecosystem.
Limited evolution: rename / type promotion слабее чем Parquet.
Cross-course deep dive: Storage Formats M03 — ORC (7 уроков) — stripes, indexes, encodings, ACID semantics, Bloom filters.

Avro (Apache Avro)

Layout: row-based — block of N records, schema embedded в file header.
Compression: block-level (snappy default; deflate legacy).
Schema evolution: STRONG — explicit forward/backward compat rules (Avro spec § Schema Resolution). Reader schema differs from writer schema → automatic projection / default value fill / type promotion.
Use case: streaming (Kafka) — schema registry stores writer schema, consumers pin reader schema, evolution managed without breaking pipelines.
Cross-course deep dive: Storage Formats M04 — Avro (6 уроков) — schema type system, evolution rules, schema registry, Kafka integration.

Avro container format — структура файла Avro в Kafka — runtime schema evolution

Arrow IPC (Apache Arrow Inter-Process Communication)

Layout: in-memory columnar buffers — same memory representation across libraries.
Compression: optional — usually skipped (zero-copy is the point).
Format: streaming (record-batch sequence без footer) или file (with footer + schema replay).
Killer feature: zero-copy между pandas / Polars / DuckDB / DataFusion / Spark Arrow UDFs — same memory buffers shared without serialize/deserialize cycle.
Cross-course deep dive: Storage Formats M07 — Arrow (7 уроков) — memory layout, type system, IPC format, Feather, Flight protocol, ecosystem, memory viewer.
Cross-course → DataFusion: DataFusion 01/02 — Arrow memory layout — DataFusion uses Arrow as in-memory representation.

Decision tree — какой формат когда

Workload?
├── Analytical reads (OLAP) — column projections, aggregations
│   ├── Spark / Trino / DuckDB / Impala — Parquet (default)
│   └── Hive — ORC (deep ecosystem integration)
│
├── Streaming ingestion + schema evolution
│   └── Kafka / Flink — Avro (schema registry + forward/backward compat)
│
├── In-memory cross-library transfer (no disk persistence)
│   └── pandas → Polars → DuckDB → DataFusion — Arrow IPC (zero-copy)
│
└── Interchange / debugging / human-readable
    └── CSV / JSON (M09 уроки 02-03)

Production rule: не over-think — default Parquet для new data lakes. ORC только при Hive lock-in. Avro при schema evolution-heavy streaming. Arrow IPC — implementation detail между libraries; rarely chosen by user explicit.

Pitfall 30 — don’t deep-dive Parquet internals here

Anti-pattern: lesson 04 пытается explain Parquet predicate pushdown internals, dictionary encoding, RLE, bit-packing, page indexes…

Why it goes wrong: этих deep-dive topics семь уроков в Storage Formats course (М02). Дублирование =

divergent explanations (one course updates first; другой stale);
pragmatic-DEEP overrun — 28 min lesson превращается в 90 min.

Correct framing: lesson 04 — matrix + decision tree. Каждый формат — short paragraph + cross-course link к deep dive. Storage Formats course covers internals; M09 урок 04 covers selection criteria.

Production rule: в любом point когда вы готовы написать “Parquet automatically does X internally…” — STOP и cross-link к Storage Formats. M09 — bridge phase, не competing deep dive.

Cross-course references — Storage Formats deep dives

Phase 68 main feature — heavy cross-course bridges. Storage Formats course ships 27 уроков across 4 modules describing exactly эти binary formats:

Storage Formats M02 — Parquet (7 уроков)

01 — Row groups — file structure
02 — Column chunks — chunk layout
03 — Pages — page-level encoding
04 — Encodings — RLE, bit-packing, dictionary
05 — Metadata + statistics — predicate pushdown
06 — Nested data (Dremel)
07 — Parquet file viewer — interactive

Storage Formats M03 — ORC (7 уроков)

01 — Stripes — stripe layout
02-07 — indexes, encodings, ACID, Bloom filters

Storage Formats M04 — Avro (6 уроков)

02 — Schema type system — types + evolution rules
остальные — schema registry, Kafka integration

Storage Formats M07 — Arrow (7 уроков)

01 — Memory layout — buffer structure
02 — Type system — physical/logical types
03 — IPC format — streaming + file
04 — Feather format — IPC subset
05 — Flight protocol — gRPC streaming
06-07 — ecosystem + viewer

DataFusion — Arrow foundation

DataFusion 01/02 — Arrow memory layout — DataFusion как Arrow consumer

Three-layer cross-course bridge: M09 урок 04 (decision tree) → Storage Formats (format internals) → DataFusion / Spark / ClickHouse (engine integration). Each layer adds depth.

Newer formats — brief mention

Modern data lakes используют table formats поверх Parquet:

Format	What	Status
Apache Iceberg	Open table format — schema evolution + time travel + ACID + hidden partitioning	Industry standard 2024+; AWS / Snowflake / Databricks adopted
Delta Lake	Databricks-originated; ACID + schema evolution + time travel	Strong Databricks ecosystem
Apache Hudi	Streaming-friendly upserts + change data capture	Used at Uber / Robinhood
Apache Paimon	Flink-native — stream-batch unified	Newer (2023+); Flink ecosystem

Эти table formats — НЕ формат файлов; они manage Parquet/ORC files (manifest, snapshot, statistics). M09 урок 04 — out-of-scope; Storage Formats M11-M14 covers их (если ship’нуты в course catalog).

Production guidance 2026: для new projects — Iceberg + Parquet; ecosystem support across Spark / Trino / DuckDB / Snowflake / AWS Athena.

Что в следующем уроке

Урок 05 — Compression formats (gzip / bzip2 / lz4 / zstd / snappy). Эти codecs reused across binary formats урока 04 — Parquet column-chunk compression — zstd или snappy; ORC stripe compression; Avro block compression. Урок 05 — speed-vs-ratio tradeoff matrix + Pyodide-safe gzip.decompress demo via io.BytesIO. Cross-course → Storage Formats M09 (compression internals — Btrblocks / Fastlanes / ALP / FSST).

Pragmatic-DEEP принцип: selection criteria в M09; internals в Storage Formats. Don’t compete with Storage Formats course — augment via cross-links (Pitfall 30).

Binary formats overview: Parquet, ORC, Avro, Arrow IPC

Why binary formats — text-format limitations

Layout dichotomy — columnar vs row-based

Row-based (CSV, JSON, Avro)

Columnar (Parquet, ORC, Arrow)

Format matrix — Parquet / ORC / Avro / Arrow IPC

Parquet (Apache Parquet)

ORC (Optimized Row Columnar)

Avro (Apache Avro)

Arrow IPC (Apache Arrow Inter-Process Communication)

Decision tree — какой формат когда

Pitfall 30 — don’t deep-dive Parquet internals here

Cross-course references — Storage Formats deep dives

Storage Formats M02 — Parquet (7 уроков)

Storage Formats M03 — ORC (7 уроков)

Storage Formats M04 — Avro (6 уроков)

Storage Formats M07 — Arrow (7 уроков)

DataFusion — Arrow foundation

Newer formats — brief mention

Что в следующем уроке

Закончили урок?