PyArrow — Arrow memory model, zero-copy, Table/RecordBatch
PyArrow (Apache Arrow Python docs, since 2016) — Python bindings к Apache Arrow. Arrow — universal columnar in-memory format для cross-library zero-copy data exchange. Это не библиотека для DataFrame manipulation (как pandas / Polars) — это memory layer под ними. Понимание Arrow — ключ к пониманию почему modern DE stack (pandas 2.0+ / Polars / DataFusion / Spark Arrow / DuckDB / ClickHouse Arrow output) interoperates без serialize-deserialize overhead.
В этом уроке:
- Why Arrow — universal columnar standard.
- Arrow memory model — fixed-width primitives + variable-length offsets + null bitmap.
pa.Table/pa.RecordBatch/pa.Schema— three core types.- Zero-copy semantics — same memory accessed by multiple libraries.
- PyArrow API surface —
parquet,ipc,compute,csv,json. - Cross-link M02 урок 01 — PyListObject contiguity → Arrow cross-library extension.
- Run-on-Your-Machine #3 —
pa.Table.from_pylist+ Parquet metadata. - HEAVY cross-course — Storage Formats M07 Arrow + Spark M11 Arrow Module + Spark M06 Pandas UDFs.
Why Arrow — universal columnar standard
До Arrow (pre-2016): каждая library имела собственный binary format. pandas pickle ≠ NumPy .npy ≠ Spark internal ≠ DataFusion internal. Cross-library data exchange требовал serialize → wire → deserialize cycle (CPU + memory cost).
С Arrow (2016+): universal columnar in-memory format с published spec (Apache Arrow Format spec). Любая library, реализующая Arrow specification, читает чужие Arrow buffers без копирования. Это zero-copy interop.
Production реализации:
- pandas 2.0+ (PyArrow-backed dtypes opt-in)
- Polars (Arrow always — backbone)
- DuckDB (Arrow-compatible result format)
- Spark 3.0+ (Arrow для PySpark Pandas UDFs + columnar shuffle)
- ClickHouse (Arrow input/output format)
- DataFusion (Arrow native)
Key insight: Arrow — это NOT disk format. Arrow IPC можно serialize на disk (или transfer via network — Flight protocol), но primary purpose — in-memory layout for zero-copy access. Disk formats (Parquet, ORC) — separate concern (cross-link Storage Formats M02 / M03).
Arrow memory model — primitives + variable-length + null bitmap
Arrow стандартизует memory layout per data type. Three categories:
1. Fixed-width primitives (int8/16/32/64, float32/64, bool):
column = [int32: 1, 2, 3, 4, 5]
memory = [01 00 00 00 | 02 00 00 00 | 03 00 00 00 | 04 00 00 00 | 05 00 00 00]
(5 contiguous 4-byte slots — little-endian)
Это тот же memory layout что NumPy 1D array. SIMD-friendly. Cross-link M02 урок 01: аналогично PyListObject ob_item[] — contiguous memory; разница — Arrow хранит сами values (как numpy), а PyListObject хранит PyObject pointers (extra indirection). Arrow extends contiguity insight cross-library — same values читаемы pandas / Polars / DuckDB / Spark без copy.
2. Variable-length (string, binary, list):
column = [string: "alice", "bob", "carol"]
values_buffer = [a l i c e b o b c a r o l]
offsets_buffer = [0, 5, 8, 13]
(start offsets per element, length = N+1)
# Read element i: values[offsets[i] : offsets[i+1]]
# Read element 1 ("bob"): values[5:8] = "bob"
Two buffers: values (concatenated) + offsets (start positions). Reading element i — values[offsets[i] : offsets[i+1]] — O(1) per element.
3. Null bitmap — separate from value buffer:
column = [int32: 1, NULL, 3, NULL, 5]
values_buffer = [01 00 00 00 | 00 00 00 00 | 03 00 00 00 | 00 00 00 00 | 05 00 00 00]
↑ junk
null_bitmap = [10101] (LSB-first: index 0=valid, 1=null, 2=valid, 3=null, 4=valid)
Bit-per-element null indicator — separate buffer. Это позволяет efficient null-aware operations (mask AND value comparison) без re-scanning memory.
Cite Apache Arrow Format spec — Layout.
pa.Table / pa.RecordBatch / pa.Schema — three core types
import pyarrow as pa
# Schema — metadata describing column names + types
schema = pa.schema([
pa.field('id', pa.int64()),
pa.field('name', pa.string()),
pa.field('amount', pa.float64()),
])
# RecordBatch — single contiguous chunk of all columns
batch = pa.RecordBatch.from_pylist([
{'id': 1, 'name': 'alice', 'amount': 100.5},
{'id': 2, 'name': 'bob', 'amount': 200.0},
])
# Table — collection of named ChunkedArrays (one per column)
# Может состоять из multiple RecordBatches (если данных много — chunked)
table = pa.Table.from_pylist([
{'id': 1, 'name': 'alice', 'amount': 100.5},
{'id': 2, 'name': 'bob', 'amount': 200.0},
{'id': 3, 'name': 'carol', 'amount': 50.0},
])
print(table.schema)
# id: int64
# name: string
# amount: double
Three-tier hierarchy:
| Type | Структура | Use case |
|---|---|---|
pa.Schema | Metadata (names + types) | Описание format, без data |
pa.RecordBatch | Single chunk — все columns same length, contiguous | Streaming chunks (Arrow IPC) |
pa.Table | Multiple RecordBatches concatenated logically (column = ChunkedArray) | DataFrame-like collection |
Pitfall: pa.Table — collection of ChunkedArray (logical concatenation; не physical). Operations возвращают новые Table объекты — immutable. Это совпадает с Polars’ immutable model. Вернёмся к immutability в M10 урок 04 (Arrow C Data Interface).
Zero-copy semantics — same memory accessed by multiple libraries
import pyarrow as pa
# Build Arrow Table once
table = pa.Table.from_pylist([{'a': 1, 'b': 'x'}, {'a': 2, 'b': 'y'}])
# Convert к pandas — zero-copy when types align
df_pd = table.to_pandas(zero_copy_only=False) # zero_copy_only=True для strict
# Convert к Polars — zero-copy через Arrow C Data Interface
import polars as pl
df_pl = pl.from_arrow(table)
# Convert к DuckDB — zero-copy через Arrow integration
import duckdb
con = duckdb.connect()
con.register('arrow_view', table)
result = con.execute("SELECT * FROM arrow_view WHERE a > 1").arrow()
Critical insight для DE production pipelines: save CPU + memory bandwidth.
Старая модель (без Arrow):
pandas DataFrame → pickle.dumps → bytes → wire/disk → bytes → pickle.loads → Polars DataFrame
^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
CPU cost CPU cost
+ 2x memory (orig + serialized) + 2x memory (serialized + orig)
Новая модель (Arrow):
pandas DataFrame ← (pointer) → Arrow buffers ← (pointer) → Polars DataFrame
same RAM
Один и тот же memory accessed by multiple libraries — zero-copy. Это работает потому что все трое (pandas 2.0+ Arrow-backed mode, Polars, DuckDB) используют тот же Arrow specification для buffers.
Cite Apache Arrow Format spec + PyArrow Python docs.
PyArrow API surface — parquet, ipc, compute, csv, json
Top-level package pyarrow — core types (Table, RecordBatch, Schema, Array). Submodules для IO + compute:
| Submodule | Назначение | Key APIs |
|---|---|---|
pyarrow.parquet | Read/write Parquet | pq.read_table, pq.write_table, pq.read_metadata, pq.ParquetDataset |
pyarrow.ipc | Arrow IPC stream/file format | ipc.new_stream, ipc.open_file, ipc.RecordBatchStreamReader |
pyarrow.compute | Arrow-native compute kernels (Acero engine) | pc.greater, pc.add, pc.match_substring, pc.aggregate |
pyarrow.csv | Multi-threaded CSV reader (faster than pandas) | csv.read_csv, csv.write_csv, csv.ParseOptions |
pyarrow.json | Multi-threaded JSON reader | json.read_json, json.ReadOptions |
pyarrow.feather | Feather V2 format (Arrow IPC file) | feather.read_table, feather.write_feather |
pyarrow.dataset | Partitioned dataset abstraction | ds.dataset(...) — multi-file scan |
pyarrow.flight | Arrow Flight RPC protocol | client/server primitives |
Parquet workflow example:
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'id': [1, 2, 3], 'name': ['a', 'b', 'c']})
pq.write_table(table, 'demo.parquet') # write
restored = pq.read_table('demo.parquet') # read full table
metadata = pq.read_metadata('demo.parquet') # metadata-only (fast)
print(metadata.num_rows, metadata.num_row_groups) # 3, 1
print(metadata.schema) # full schema
pq.read_metadata — без чтения data buffers. Используется для catalog inspection / partition pruning. Cross-link М09 урок 04 (binary-formats-overview matrix) — Parquet row group statistics enable predicate pushdown — теперь видим как PyArrow exposes metadata API.
Run-on-Your-Machine — PyArrow Table + Parquet metadata
Run-on-Your-Machine: PyArrow Table + Parquet metadata
Установите локально (PyArrow 50MB+; C++/Rust dependencies; не bundled в Pyodide default):
pip install 'pyarrow>=15.0'Создайте файл pyarrow_demo.py:
import pyarrow as pa
import pyarrow.parquet as pq
# Create Arrow Table from Python list of dicts
data = [
{'id': 1, 'name': 'alice', 'amount': 100.5},
{'id': 2, 'name': 'bob', 'amount': 200.0},
]
table = pa.Table.from_pylist(data)
print(table.schema)
# Write + re-read Parquet
pq.write_table(table, 'demo.parquet')
metadata = pq.read_metadata('demo.parquet')
print(f"row groups: {metadata.num_row_groups}, rows: {metadata.num_rows}")
# Zero-copy → pandas (если pandas 2.0+ Arrow-backed)
df = table.to_pandas()
print(df)Запустите:
python3 pyarrow_demo.pyОжидаемый вывод:
id: int64
name: string
amount: double
row groups: 1, rows: 2
id name amount
0 1 alice 100.5
1 2 bob 200.0Deep dive Parquet internals (column chunks, encodings, footer) — cross-course → Storage Formats M02 — 7 уроков deep dive. Version pin >=15.0 (Pitfall 32) — PyArrow 15+ stable C Data Interface support; older versions have schema evolution gaps.
Cross-link M02 урок 01 — PyListObject contiguous → Arrow columnar cross-library
Memory layout ladder:
| Level | Структура | Memory layout | M_module |
|---|---|---|---|
| 1. Python list | PyListObject->ob_item[] array of PyObject* | contiguous pointers; values heap-scattered | M02 урок 01 |
| 2. NumPy ndarray | numpy.ndarray->data array of values | contiguous values | external (NumPy ~ pandas backbone) |
| 3. Arrow column buffer | values_buffer (+ offsets если variable-length) | contiguous values + cross-library standard | M10 урок 03 (this) |
PyListObject (M02 урок 01) уже преподал contiguous-memory insight — почему list[i] is O(1), почему amortized append is O(1). Arrow extends этот insight cross-library: columns layout совместимы между pandas / Polars / DuckDB / Spark, потому что все используют Arrow specification.
Pedagogical bridge: M02 урок 01 → M10 урок 03 — same architectural primitive (contiguous memory) применяется на разных уровнях. Single-language → cross-library standard.
HEAVY cross-course — Storage Formats M07 Arrow + Spark M11 Arrow Module + Spark M06 Pandas UDFs
Storage Formats course M07 arrow — deep dive Arrow internals (7 уроков):
- 07/01 — memory-layout — full memory model spec.
- 07/02 — type-system — Arrow type system (primitives, nested, dictionary encoding, extensions).
- 07/03 — IPC format — IPC stream/file format spec — wire layout для cross-process zero-copy.
- 07/04 — feather format — Feather V2 (Arrow IPC file).
- 07/05 — flight protocol — Arrow Flight RPC.
Spark M11 arrow-module — Spark’s Arrow integration (5 lessons):
- 11/05 — pyarrow-spark — explicit PyArrow ↔ Spark interop (
spark.createDataFrame(pa.Table)zero-copy).
Spark M06 udf-performance — UDF performance:
- 06/03 — pandas-udfs-arrow — Pandas UDFs use Arrow for batch transfer (массovo ускоряет user-defined functions vs. row-by-row Python UDFs).
Three-layer cross-course bridge:
- Python lens (this lesson) — PyArrow API + memory model basics.
- Format internals lens (Storage Formats M07) — full Arrow spec deep-dive.
- Engine integration lens (Spark M06/M11) — Arrow ускоряет distributed compute.
Cross-course path: изучите M10 урок 03 → Storage Formats M07 (deep) → Spark M11 (production integration) → DataFusion 01 (Rust engine native Arrow). Это полный educational path для Arrow expertise.
Что в следующем уроке
М10 урок 04 — Arrow C Data Interface + PEP 749 (Python DataFrame interchange protocol). Объяснит как zero-copy работает через language boundaries (C → Python → Rust → Java) — ABI-stable C struct + reference counting. Закроет concept от disk (Parquet) до memory (Arrow Table) до cross-language wire (C Data Interface).