Binary formats overview: Parquet, ORC, Avro, Arrow IPC
Уроки 01-03 covered text formats — CSV, JSON. Они human-readable, schema-flexible, universal — но inefficient для analytics: parse-heavy (string→int conversion repeat per row), storage-bloated (3-10x Parquet size), no built-in compression. Production data lakes используют binary columnar formats.
В этом уроке — conceptual matrix: какой формат когда выбирать. Мы не имеем pyarrow / fastavro в browser (Pitfall 31 — pyarrow 50MB+, не bundled в Pyodide); challenges concentrated в уроке 02 (CSV) и 03 (JSON). Здесь — decision tree + heavy cross-course refs к Storage Formats course (27 уроков deep dives — primary deep-dive route per Pitfall 30).
В этом уроке:
- Why binary formats — text-format limitations.
- Layout dichotomy — columnar vs row-based.
- Format matrix — Parquet / ORC / Avro / Arrow IPC.
- Decision tree — какой когда.
- Pitfall 30 — don’t deep-dive Parquet internals here.
- Cross-course references — Storage Formats M02/M03/M04/M07; DataFusion M01.
- Newer formats brief — Iceberg / Delta / Hudi / Paimon — out-of-scope.
Why binary formats — text-format limitations
CSV / JSON хороши для interchange между systems с different stacks. Но для analytical workloads:
| Constraint | CSV/JSON impact | Binary format solution |
|---|---|---|
| Parse cost | Каждое чтение re-parses string → typed value (int(row['age'])). Compute-heavy | Schema-aware — values pre-decoded; ZeroCopy reads |
| Storage size | UTF-8 text + JSON braces + repeated keys → 3-10x bloat | Compressed (gzip / lz4 / zstd / snappy) + dictionary-encoded repeated values |
| Predicate pushdown | Невозможен — всегда сканируется весь file | Min/max statistics per row group → skip irrelevant blocks |
| Column projection | Невозможен — JSON object reads все keys | Columnar layout — read только requested columns |
| Schema enforcement | Schema-less — runtime errors на bad data | Schema-on-write — declared при ingest; readers validate |
Production rule: для analytical / OLAP — binary columnar (Parquet / ORC). Для streaming / Kafka pipelines — Avro (schema evolution). Для inter-process / cross-library zero-copy — Arrow IPC.
Layout dichotomy — columnar vs row-based
Row-based (CSV, JSON, Avro)
Каждая row stored contiguously:
Row 1: [name=alice, age=30, role=dev]
Row 2: [name=bob, age=25, role=qa]
Row 3: [name=carol, age=35, role=pm]
Use case: OLTP-like — read entire record (e.g., user profile lookup). Streaming — append individual records (Kafka).
Columnar (Parquet, ORC, Arrow)
Каждая column stored contiguously:
name: [alice, bob, carol]
age: [30, 25, 35]
role: [dev, qa, pm]
Use case: OLAP — SELECT AVG(age) FROM users reads только age column, skips name+role (column projection). Compression dramatically лучше — homogeneous values within column compress better than mixed-type rows.
Trade-off: writing single row требует updating 3 files (one per column). Hence columnar — append-only, batch-write.
Cross-course → Storage Formats M02 урок 01 — Parquet row groups covers row group internals (column chunks → pages → encoded values).
Format matrix — Parquet / ORC / Avro / Arrow IPC
| Format | Layout | Compression | Schema | Evolution | Ecosystem | Use case |
|---|---|---|---|---|---|---|
| Parquet | columnar | row-group + column-chunk + page-level (snappy default; zstd modern) | required, on-write | yes (add columns; rename via field-id) | Spark / Hive / Impala / Trino / DuckDB / PyArrow | Default для analytical data lakes |
| ORC | columnar | stripe-level + index | required, on-write | limited (Hive ecosystem only) | Hive primary, Trino, Spark | Hadoop / Hive deep-integration |
| Avro | row-based | block-level | required, embedded в file header | strong (forward + backward compat) | Kafka primary, Flink, Hadoop | Streaming + Kafka schema registry |
| Arrow IPC | columnar in-memory | optional (lz4 / zstd) | required | yes | DataFusion / pandas / Polars / cuDF | In-memory transfer + cross-library |
Parquet (Apache Parquet)
- Layout: file → row groups → column chunks → pages.
- Compression: per-column-chunk (snappy default; zstd modern; gzip legacy).
- Predicate pushdown: min/max statistics per row group → skip blocks where filter excludes range.
- Cross-course deep dive: Storage Formats M02 — Parquet (7 уроков) — row groups, column chunks, pages, encodings, metadata, nested data (Dremel), file viewer.
ORC (Optimized Row Columnar)
- Layout: file → stripes → row groups (within stripe) → indexes + column data.
- Compression: stripe-level zlib/snappy/zstd; index for skip.
- Born: Hortonworks 2013 для Hive ecosystem.
- Limited evolution: rename / type promotion слабее чем Parquet.
- Cross-course deep dive: Storage Formats M03 — ORC (7 уроков) — stripes, indexes, encodings, ACID semantics, Bloom filters.
Avro (Apache Avro)
- Layout: row-based — block of N records, schema embedded в file header.
- Compression: block-level (snappy default; deflate legacy).
- Schema evolution: STRONG — explicit forward/backward compat rules (Avro spec § Schema Resolution). Reader schema differs from writer schema → automatic projection / default value fill / type promotion.
- Use case: streaming (Kafka) — schema registry stores writer schema, consumers pin reader schema, evolution managed without breaking pipelines.
- Cross-course deep dive: Storage Formats M04 — Avro (6 уроков) — schema type system, evolution rules, schema registry, Kafka integration.
Arrow IPC (Apache Arrow Inter-Process Communication)
- Layout: in-memory columnar buffers — same memory representation across libraries.
- Compression: optional — usually skipped (zero-copy is the point).
- Format: streaming (record-batch sequence без footer) или file (with footer + schema replay).
- Killer feature: zero-copy между pandas / Polars / DuckDB / DataFusion / Spark Arrow UDFs — same memory buffers shared without serialize/deserialize cycle.
- Cross-course deep dive: Storage Formats M07 — Arrow (7 уроков) — memory layout, type system, IPC format, Feather, Flight protocol, ecosystem, memory viewer.
- Cross-course → DataFusion: DataFusion 01/02 — Arrow memory layout — DataFusion uses Arrow as in-memory representation.
Decision tree — какой формат когда
Workload?
├── Analytical reads (OLAP) — column projections, aggregations
│ ├── Spark / Trino / DuckDB / Impala — Parquet (default)
│ └── Hive — ORC (deep ecosystem integration)
│
├── Streaming ingestion + schema evolution
│ └── Kafka / Flink — Avro (schema registry + forward/backward compat)
│
├── In-memory cross-library transfer (no disk persistence)
│ └── pandas → Polars → DuckDB → DataFusion — Arrow IPC (zero-copy)
│
└── Interchange / debugging / human-readable
└── CSV / JSON (M09 уроки 02-03)
Production rule: не over-think — default Parquet для new data lakes. ORC только при Hive lock-in. Avro при schema evolution-heavy streaming. Arrow IPC — implementation detail между libraries; rarely chosen by user explicit.
Pitfall 30 — don’t deep-dive Parquet internals here
Anti-pattern: lesson 04 пытается explain Parquet predicate pushdown internals, dictionary encoding, RLE, bit-packing, page indexes…
Why it goes wrong: этих deep-dive topics семь уроков в Storage Formats course (М02). Дублирование =
- divergent explanations (one course updates first; другой stale);
- pragmatic-DEEP overrun — 28 min lesson превращается в 90 min.
Correct framing: lesson 04 — matrix + decision tree. Каждый формат — short paragraph + cross-course link к deep dive. Storage Formats course covers internals; M09 урок 04 covers selection criteria.
Production rule: в любом point когда вы готовы написать “Parquet automatically does X internally…” — STOP и cross-link к Storage Formats. M09 — bridge phase, не competing deep dive.
Cross-course references — Storage Formats deep dives
Phase 68 main feature — heavy cross-course bridges. Storage Formats course ships 27 уроков across 4 modules describing exactly эти binary formats:
Storage Formats M02 — Parquet (7 уроков)
- 01 — Row groups — file structure
- 02 — Column chunks — chunk layout
- 03 — Pages — page-level encoding
- 04 — Encodings — RLE, bit-packing, dictionary
- 05 — Metadata + statistics — predicate pushdown
- 06 — Nested data (Dremel)
- 07 — Parquet file viewer — interactive
Storage Formats M03 — ORC (7 уроков)
- 01 — Stripes — stripe layout
- 02-07 — indexes, encodings, ACID, Bloom filters
Storage Formats M04 — Avro (6 уроков)
- 02 — Schema type system — types + evolution rules
- остальные — schema registry, Kafka integration
Storage Formats M07 — Arrow (7 уроков)
- 01 — Memory layout — buffer structure
- 02 — Type system — physical/logical types
- 03 — IPC format — streaming + file
- 04 — Feather format — IPC subset
- 05 — Flight protocol — gRPC streaming
- 06-07 — ecosystem + viewer
DataFusion — Arrow foundation
- DataFusion 01/02 — Arrow memory layout — DataFusion как Arrow consumer
Three-layer cross-course bridge: M09 урок 04 (decision tree) → Storage Formats (format internals) → DataFusion / Spark / ClickHouse (engine integration). Each layer adds depth.
Newer formats — brief mention
Modern data lakes используют table formats поверх Parquet:
| Format | What | Status |
|---|---|---|
| Apache Iceberg | Open table format — schema evolution + time travel + ACID + hidden partitioning | Industry standard 2024+; AWS / Snowflake / Databricks adopted |
| Delta Lake | Databricks-originated; ACID + schema evolution + time travel | Strong Databricks ecosystem |
| Apache Hudi | Streaming-friendly upserts + change data capture | Used at Uber / Robinhood |
| Apache Paimon | Flink-native — stream-batch unified | Newer (2023+); Flink ecosystem |
Эти table formats — НЕ формат файлов; они manage Parquet/ORC files (manifest, snapshot, statistics). M09 урок 04 — out-of-scope; Storage Formats M11-M14 covers их (если ship’нуты в course catalog).
Production guidance 2026: для new projects — Iceberg + Parquet; ecosystem support across Spark / Trino / DuckDB / Snowflake / AWS Athena.
Что в следующем уроке
Урок 05 — Compression formats (gzip / bzip2 / lz4 / zstd / snappy). Эти codecs reused across binary formats урока 04 — Parquet column-chunk compression — zstd или snappy; ORC stripe compression; Avro block compression. Урок 05 — speed-vs-ratio tradeoff matrix + Pyodide-safe gzip.decompress demo via io.BytesIO. Cross-course → Storage Formats M09 (compression internals — Btrblocks / Fastlanes / ALP / FSST).
Pragmatic-DEEP принцип: selection criteria в M09; internals в Storage Formats. Don’t compete with Storage Formats course — augment via cross-links (Pitfall 30).