Learning Platform
Глоссарий Troubleshooting
Урок 11.03 · 25 мин
Средний
pyarrowArrowcolumnarzero-copyTableRecordBatchSchemaChunkedArraymemory-layoutArrow-IPCRun-on-Your-Machinecross-course

PyArrow — Arrow memory model, zero-copy, Table/RecordBatch

PyArrow (Apache Arrow Python docs, since 2016) — Python bindings к Apache Arrow. Arrow — universal columnar in-memory format для cross-library zero-copy data exchange. Это не библиотека для DataFrame manipulation (как pandas / Polars) — это memory layer под ними. Понимание Arrow — ключ к пониманию почему modern DE stack (pandas 2.0+ / Polars / DataFusion / Spark Arrow / DuckDB / ClickHouse Arrow output) interoperates без serialize-deserialize overhead.

В этом уроке:

  1. Why Arrow — universal columnar standard.
  2. Arrow memory model — fixed-width primitives + variable-length offsets + null bitmap.
  3. pa.Table / pa.RecordBatch / pa.Schema — three core types.
  4. Zero-copy semantics — same memory accessed by multiple libraries.
  5. PyArrow API surfaceparquet, ipc, compute, csv, json.
  6. Cross-link M02 урок 01 — PyListObject contiguity → Arrow cross-library extension.
  7. Run-on-Your-Machine #3pa.Table.from_pylist + Parquet metadata.
  8. HEAVY cross-course — Storage Formats M07 Arrow + Spark M11 Arrow Module + Spark M06 Pandas UDFs.

Why Arrow — universal columnar standard

До Arrow (pre-2016): каждая library имела собственный binary format. pandas pickle ≠ NumPy .npy ≠ Spark internal ≠ DataFusion internal. Cross-library data exchange требовал serialize → wire → deserialize cycle (CPU + memory cost).

С Arrow (2016+): universal columnar in-memory format с published spec (Apache Arrow Format spec). Любая library, реализующая Arrow specification, читает чужие Arrow buffers без копирования. Это zero-copy interop.

Production реализации:

  • pandas 2.0+ (PyArrow-backed dtypes opt-in)
  • Polars (Arrow always — backbone)
  • DuckDB (Arrow-compatible result format)
  • Spark 3.0+ (Arrow для PySpark Pandas UDFs + columnar shuffle)
  • ClickHouse (Arrow input/output format)
  • DataFusion (Arrow native)

Key insight: Arrow — это NOT disk format. Arrow IPC можно serialize на disk (или transfer via network — Flight protocol), но primary purpose — in-memory layout for zero-copy access. Disk formats (Parquet, ORC) — separate concern (cross-link Storage Formats M02 / M03).


Arrow memory model — primitives + variable-length + null bitmap

Arrow стандартизует memory layout per data type. Three categories:

1. Fixed-width primitives (int8/16/32/64, float32/64, bool):

column = [int32: 1, 2, 3, 4, 5]
memory =  [01 00 00 00 | 02 00 00 00 | 03 00 00 00 | 04 00 00 00 | 05 00 00 00]
          (5 contiguous 4-byte slots — little-endian)

Это тот же memory layout что NumPy 1D array. SIMD-friendly. Cross-link M02 урок 01: аналогично PyListObject ob_item[] — contiguous memory; разница — Arrow хранит сами values (как numpy), а PyListObject хранит PyObject pointers (extra indirection). Arrow extends contiguity insight cross-library — same values читаемы pandas / Polars / DuckDB / Spark без copy.

2. Variable-length (string, binary, list):

column = [string: "alice", "bob", "carol"]

values_buffer  =  [a l i c e b o b c a r o l]
offsets_buffer =  [0, 5, 8, 13]
                  (start offsets per element, length = N+1)

# Read element i: values[offsets[i] : offsets[i+1]]
# Read element 1 ("bob"): values[5:8] = "bob"

Two buffers: values (concatenated) + offsets (start positions). Reading element i — values[offsets[i] : offsets[i+1]] — O(1) per element.

3. Null bitmap — separate from value buffer:

column = [int32: 1, NULL, 3, NULL, 5]

values_buffer = [01 00 00 00 | 00 00 00 00 | 03 00 00 00 | 00 00 00 00 | 05 00 00 00]
                                ↑ junk
null_bitmap   = [10101]                 (LSB-first: index 0=valid, 1=null, 2=valid, 3=null, 4=valid)

Bit-per-element null indicator — separate buffer. Это позволяет efficient null-aware operations (mask AND value comparison) без re-scanning memory.

Cite Apache Arrow Format spec — Layout.


pa.Table / pa.RecordBatch / pa.Schema — three core types

import pyarrow as pa

# Schema — metadata describing column names + types
schema = pa.schema([
    pa.field('id', pa.int64()),
    pa.field('name', pa.string()),
    pa.field('amount', pa.float64()),
])

# RecordBatch — single contiguous chunk of all columns
batch = pa.RecordBatch.from_pylist([
    {'id': 1, 'name': 'alice', 'amount': 100.5},
    {'id': 2, 'name': 'bob',   'amount': 200.0},
])

# Table — collection of named ChunkedArrays (one per column)
# Может состоять из multiple RecordBatches (если данных много — chunked)
table = pa.Table.from_pylist([
    {'id': 1, 'name': 'alice', 'amount': 100.5},
    {'id': 2, 'name': 'bob',   'amount': 200.0},
    {'id': 3, 'name': 'carol', 'amount': 50.0},
])

print(table.schema)
# id: int64
# name: string
# amount: double

Three-tier hierarchy:

TypeСтруктураUse case
pa.SchemaMetadata (names + types)Описание format, без data
pa.RecordBatchSingle chunk — все columns same length, contiguousStreaming chunks (Arrow IPC)
pa.TableMultiple RecordBatches concatenated logically (column = ChunkedArray)DataFrame-like collection

Pitfall: pa.Table — collection of ChunkedArray (logical concatenation; не physical). Operations возвращают новые Table объекты — immutable. Это совпадает с Polars’ immutable model. Вернёмся к immutability в M10 урок 04 (Arrow C Data Interface).


Zero-copy semantics — same memory accessed by multiple libraries

import pyarrow as pa

# Build Arrow Table once
table = pa.Table.from_pylist([{'a': 1, 'b': 'x'}, {'a': 2, 'b': 'y'}])

# Convert к pandas — zero-copy when types align
df_pd = table.to_pandas(zero_copy_only=False)  # zero_copy_only=True для strict

# Convert к Polars — zero-copy через Arrow C Data Interface
import polars as pl
df_pl = pl.from_arrow(table)

# Convert к DuckDB — zero-copy через Arrow integration
import duckdb
con = duckdb.connect()
con.register('arrow_view', table)
result = con.execute("SELECT * FROM arrow_view WHERE a > 1").arrow()

Critical insight для DE production pipelines: save CPU + memory bandwidth.

Старая модель (без Arrow):

pandas DataFrame → pickle.dumps → bytes → wire/disk → bytes → pickle.loads → Polars DataFrame
                  ^^^^^^^^^^^^^^                              ^^^^^^^^^^^^^^
                  CPU cost                                    CPU cost
                  + 2x memory (orig + serialized)             + 2x memory (serialized + orig)

Новая модель (Arrow):

pandas DataFrame ← (pointer) → Arrow buffers ← (pointer) → Polars DataFrame
                                  same RAM

Один и тот же memory accessed by multiple libraries — zero-copy. Это работает потому что все трое (pandas 2.0+ Arrow-backed mode, Polars, DuckDB) используют тот же Arrow specification для buffers.

Cite Apache Arrow Format spec + PyArrow Python docs.


PyArrow API surface — parquet, ipc, compute, csv, json

Top-level package pyarrow — core types (Table, RecordBatch, Schema, Array). Submodules для IO + compute:

SubmoduleНазначениеKey APIs
pyarrow.parquetRead/write Parquetpq.read_table, pq.write_table, pq.read_metadata, pq.ParquetDataset
pyarrow.ipcArrow IPC stream/file formatipc.new_stream, ipc.open_file, ipc.RecordBatchStreamReader
pyarrow.computeArrow-native compute kernels (Acero engine)pc.greater, pc.add, pc.match_substring, pc.aggregate
pyarrow.csvMulti-threaded CSV reader (faster than pandas)csv.read_csv, csv.write_csv, csv.ParseOptions
pyarrow.jsonMulti-threaded JSON readerjson.read_json, json.ReadOptions
pyarrow.featherFeather V2 format (Arrow IPC file)feather.read_table, feather.write_feather
pyarrow.datasetPartitioned dataset abstractionds.dataset(...) — multi-file scan
pyarrow.flightArrow Flight RPC protocolclient/server primitives

Parquet workflow example:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'id': [1, 2, 3], 'name': ['a', 'b', 'c']})

pq.write_table(table, 'demo.parquet')              # write
restored = pq.read_table('demo.parquet')           # read full table
metadata = pq.read_metadata('demo.parquet')        # metadata-only (fast)

print(metadata.num_rows, metadata.num_row_groups)   # 3, 1
print(metadata.schema)                              # full schema

pq.read_metadata — без чтения data buffers. Используется для catalog inspection / partition pruning. Cross-link М09 урок 04 (binary-formats-overview matrix) — Parquet row group statistics enable predicate pushdown — теперь видим как PyArrow exposes metadata API.


Run-on-Your-Machine — PyArrow Table + Parquet metadata

TIP

Run-on-Your-Machine: PyArrow Table + Parquet metadata

Установите локально (PyArrow 50MB+; C++/Rust dependencies; не bundled в Pyodide default):

pip install 'pyarrow>=15.0'

Создайте файл pyarrow_demo.py:

import pyarrow as pa
import pyarrow.parquet as pq

# Create Arrow Table from Python list of dicts
data = [
    {'id': 1, 'name': 'alice', 'amount': 100.5},
    {'id': 2, 'name': 'bob', 'amount': 200.0},
]
table = pa.Table.from_pylist(data)
print(table.schema)

# Write + re-read Parquet
pq.write_table(table, 'demo.parquet')
metadata = pq.read_metadata('demo.parquet')
print(f"row groups: {metadata.num_row_groups}, rows: {metadata.num_rows}")

# Zero-copy → pandas (если pandas 2.0+ Arrow-backed)
df = table.to_pandas()
print(df)

Запустите:

python3 pyarrow_demo.py

Ожидаемый вывод:

id: int64
name: string
amount: double
row groups: 1, rows: 2
   id   name  amount
0   1  alice   100.5
1   2    bob   200.0

Deep dive Parquet internals (column chunks, encodings, footer) — cross-course → Storage Formats M02 — 7 уроков deep dive. Version pin >=15.0 (Pitfall 32) — PyArrow 15+ stable C Data Interface support; older versions have schema evolution gaps.


Memory layout ladder:

LevelСтруктураMemory layoutM_module
1. Python listPyListObject->ob_item[] array of PyObject*contiguous pointers; values heap-scatteredM02 урок 01
2. NumPy ndarraynumpy.ndarray->data array of valuescontiguous valuesexternal (NumPy ~ pandas backbone)
3. Arrow column buffervalues_buffer (+ offsets если variable-length)contiguous values + cross-library standardM10 урок 03 (this)

PyListObject (M02 урок 01) уже преподал contiguous-memory insight — почему list[i] is O(1), почему amortized append is O(1). Arrow extends этот insight cross-library: columns layout совместимы между pandas / Polars / DuckDB / Spark, потому что все используют Arrow specification.

Pedagogical bridge: M02 урок 01 → M10 урок 03 — same architectural primitive (contiguous memory) применяется на разных уровнях. Single-language → cross-library standard.


HEAVY cross-course — Storage Formats M07 Arrow + Spark M11 Arrow Module + Spark M06 Pandas UDFs

Storage Formats course M07 arrowdeep dive Arrow internals (7 уроков):

Spark M11 arrow-module — Spark’s Arrow integration (5 lessons):

Spark M06 udf-performance — UDF performance:

  • 06/03 — pandas-udfs-arrow — Pandas UDFs use Arrow for batch transfer (массovo ускоряет user-defined functions vs. row-by-row Python UDFs).

Three-layer cross-course bridge:

  1. Python lens (this lesson) — PyArrow API + memory model basics.
  2. Format internals lens (Storage Formats M07) — full Arrow spec deep-dive.
  3. Engine integration lens (Spark M06/M11) — Arrow ускоряет distributed compute.

Cross-course path: изучите M10 урок 03 → Storage Formats M07 (deep) → Spark M11 (production integration) → DataFusion 01 (Rust engine native Arrow). Это полный educational path для Arrow expertise.


Что в следующем уроке

М10 урок 04 — Arrow C Data Interface + PEP 749 (Python DataFrame interchange protocol). Объяснит как zero-copy работает через language boundaries (C → Python → Rust → Java) — ABI-stable C struct + reference counting. Закроет concept от disk (Parquet) до memory (Arrow Table) до cross-language wire (C Data Interface).

Проверьте понимание

Результат: 0 из 0
Прикладной
Вопрос 1 из 4. **Apply scenario — zero-copy:** Production pipeline: Polars читает Parquet → передаёт результат pandas (PyArrow-backed dtypes pandas 2.0+) → DuckDB SQL query. Сколько раз data копируется в memory?

Закончили урок?

Отметьте его как пройденный, чтобы отслеживать свой прогресс

Войдите чтобы оценить урок

Прогресс модуля
0 из 7