Decision matrix: when to use pandas / Polars / PyArrow / DuckDB / Spark / DataFusion

Engineering choice between data libraries / engines depends на workload + constraints. No one-size-fits-all — pandas хорош для interactive REPL, Polars для production single-machine ETL, Spark для distributed big data. Этот урок предлагает factor matrix + decision tree — pragmatic guidance для production teams. Включает optional code-challenge py-m10-06-code-1 (Pattern 4 variant — top-K via Counter.most_common).

В этом уроке:

Why decision matrix — engineering choice depends on workload + constraints.
Factor matrix — six tools across five factors.
Decision tree — walkthrough.
Brief DuckDB mention — embedded analytical SQL.
Code-challenge py-m10-06-code-1 — Pattern 4 variant top-K.
Cross-course — Spark M00 + DataFusion M00 + ClickHouse M00 (when-to-use intro modules).

Why decision matrix

Production team faces вопросы:

Workload — analytical reads / ETL / streaming / interactive Jupyter / SQL OLAP?
Data size — KB / MB / GB / TB / PB?
Compute — single-machine / distributed cluster?
Ecosystem — Python-only / Rust / JVM / multi-language?
Existing stack — already на Spark / Snowflake / ClickHouse?

Без structured frame — choice ad-hoc, technology proliferation, maintenance burden grows. Decision matrix даёт shared vocabulary + repeatable framework.

Factor matrix — six tools across five factors

Tool	Data size	Mode	Memory model	Primary use case
pandas	<1GB single-machine	eager	numpy ndarray (default); Arrow opt-in 2.0+	interactive REPL / Jupyter / quick scripts
Polars	<100GB single-machine	lazy (recommend)	Arrow always	production ETL replace pandas at scale
PyArrow	any size	format/interop layer	Arrow native	data exchange between libraries; Parquet/IPC IO
DuckDB	<1TB single-machine	SQL embedded	columnar custom + Arrow-compatible	embedded analytical SQL; medium-data analytics
chDB	<1TB single-machine	SQL embedded	ClickHouse columnar (MergeTree) + Arrow bridge	embedded ClickHouse engine in-process; OLAP без сервера
DataFusion-Python	<1TB embedded; cluster через Ballista	SQL + DataFrame	Arrow native (Rust)	Rust query engine с Python bindings; embeddable + customizable
Spark	TB-PB distributed	SQL + DataFrame distributed	Spark SQL Catalyst + Arrow shuffle	big-data distributed ETL/analytics
DataFusion	<1TB embedded; cluster-distributed via Ballista	SQL + DataFrame	Arrow native (Rust)	Rust-native query engine; embeddable + customizable

Factor explanations:

Data size threshold — приближённый ceiling per tool. Polars может handle 100GB+ если sufficient RAM; Spark scales к PB через cluster.
Mode — eager / lazy / SQL / format-layer. Влияет на debugging (eager — simpler stack traces) vs optimization (lazy — query optimizer).
Memory model — какой in-memory representation. Arrow-based tools cross-compatible (zero-copy interop, M10 урок 04).
Primary use case — opinionated default; tools могут use overlapping use-cases.

Decision tree walkthrough

Q: Какой у вас workload?

Interactive analysis / Jupyter exploration → pandas (eager, REPL-friendly)
Production single-machine ETL pipeline (medium data, <100 GB)
- Нужны lazy + optimizer? → Polars (recommended default 2024+)
- Предпочитаете SQL-синтаксис? → DuckDB
Multi-library zero-copy pipeline (pandas → Polars → DuckDB → ClickHouse) → PyArrow (format/interop layer; не tool сам по себе)
Big data, distributed cluster (TB–PB)
- Battle-tested production / Hadoop / EMR ecosystem → Spark
- Rust-native + embeddable engine → DataFusion (или Ballista для distributed)
Analytical SQL OLAP (medium–large data)
- Embedded в Python application → DuckDB
- Standalone server, multi-user, multi-tenancy → ClickHouse (cross-course)

Production rule: default к Polars для new single-machine ETL workloads (post-2024); pandas остаётся для notebooks / interactive analysis. Spark когда уже на distributed cluster или PB-scale data. DataFusion для engine-embedding (custom analytical features в Rust app).

Brief DuckDB mention

DuckDB (duckdb.org) — embedded analytical OLAP database. Не имеет dedicated course в этой platform v2.4 — Open Question 7 расколол: brief mention здесь, full coverage в possible v3 milestone.

Why mention:

Embedded — like SQLite для analytics; ноль setup; single-process.
Columnar storage + vectorized execution — analytical queries fast.
Arrow-compatible result format — zero-copy в Polars / pandas (cross-link M10 урок 04).
Read-write Parquet, CSV, JSON natively через read_parquet, read_csv SQL functions.

When DuckDB beats Polars:

Workload — predominantly SQL (легче maintain SQL queries чем method-chained DataFrames).
Need joins / window functions / SQL features which DataFrame DSL не surface elegantly.
Embedded в Python application — import duckdb; con = duckdb.connect() — ноль setup.

When Polars beats DuckDB:

Workload — predominantly Python DataFrame manipulation (filters / aggregations / mutations).
Need columnar lazy plan inspection через .explain().
Multi-step pipeline composes natural в Python (not SQL string templating).

Production reality: оба часто co-existing. Polars для transformation, DuckDB для analytical SQL queries on результат. Arrow zero-copy interop означает minimal handoff cost.

Brief chDB mention — embedded ClickHouse в Python

chDB — embedded ClickHouse (PyPI pip install chdb) — full ClickHouse engine в одном Python process, без сервера. Аналог DuckDB но на ClickHouse codebase.

# (Run-on-Your-Machine — НЕ Pyodide; chDB requires native binaries)
import chdb

# In-process SQL execution — full ClickHouse engine
result = chdb.query("SELECT count(), uniq(user_id) FROM file('events.parquet', Parquet)", "Pretty")
print(result)

# Stateful session — temporary tables, MergeTree, persistence
sess = chdb.session.Session("/path/to/data")
sess.query("CREATE TABLE events (id UInt64, ts DateTime) ENGINE = MergeTree ORDER BY id")

Когда выбирать chDB:

OLAP-heavy — ClickHouse aggregation engine — один из самых быстрых in-class (HyperLogLog, MergeTree partitioning, vectorized execution).
ClickHouse SQL dialect — компатибельность с production ClickHouse cluster (one-codepath dev/prod queries).
Streaming aggregation — topK, quantileTDigest, uniqCombined — production-grade approximate algorithms из коробки.
Data engineering pipelines — Parquet/CSV/JSON ingestion + columnar MergeTree storage в одном процессе.

chDB vs DuckDB:

chDB — ClickHouse SQL dialect, MergeTree storage, large analytical functions library. Heavier binary (~150 MB chdb wheel vs ~30 MB duckdb).
DuckDB — standard SQL (PostgreSQL-compatible), smaller footprint, broader integration ecosystem.

Cite chDB GitHub + chDB Python docs.

Brief DataFusion-Python mention — Rust query engine с Python bindings

datafusion (PyPI pip install datafusion) — Python bindings для Apache DataFusion (Rust SQL/DataFrame engine, see dedicated DataFusion course). Arrow-native; zero-copy interop с Polars / pandas / pyarrow.

# (Run-on-Your-Machine — нативные Rust binaries; не Pyodide)
import datafusion

ctx = datafusion.SessionContext()
ctx.register_parquet("events", "events.parquet")

df = ctx.sql("SELECT user_id, count(*) FROM events GROUP BY user_id ORDER BY 2 DESC LIMIT 10")
print(df.to_pandas())          # zero-copy Arrow → pandas

Когда выбирать DataFusion-Python:

Arrow-native pipeline — zero-copy interop с Polars / pandas / pyarrow, без сериализации между шагами.
Кастомные UDF на Python с Rust-производительностью основной части — UDF/UDAF можно писать на Python через register_udf, остальное — DataFusion Rust.
Embedded query engine в Python application — встраиваемый, customizable execution; кастомные TableProvider (см. DataFusion course).
SQL + DataFrame одновременно — оба API в одном инструменте, единый optimizer.

DataFusion-Python vs DuckDB / chDB:

DataFusion-Python — Arrow-native, Rust extensibility (custom optimizer rules / table providers), best для embedded Rust-quality engine.
DuckDB — broader SQL coverage, larger Python data ecosystem integration (Pandas-style API через duckdb.df()).
chDB — ClickHouse OLAP функции + MergeTree persistence; best для ClickHouse-dialect pipelines.

Cite datafusion.apache.org/python + DataFusion Python GitHub.

Code-challenge — `py-m10-06-code-1` (optional Pattern 4 variant)

В quiz JSON ниже — py-m10-06-code-1: top-K tags по частоте через collections.Counter.most_common(N). Pandas equivalent: pd.Series(tags).value_counts().head(N). ClickHouse equivalent: topK(N)(tag) (approximate). Spark equivalent: df.groupBy('tag').count().orderBy(desc('count')).limit(N) (distributed).

from collections import Counter

def solve(tags):
    return Counter(tags).most_common(3)

Algorithm — heap-based O(n log N): Counter.most_common(N) строит counts dict O(n), затем использует heapq.nlargest(N, counts.items(), key=lambda x: x[1]) — O(n log N) heap-based sort. Без N — fall-back O(n log n) full sort.

Production cross-course mapping:

Layer	Implementation	Complexity
Pure-stdlib (in this challenge)	`Counter(tags).most_common(N)`	O(n log N)
pandas single-machine	`pd.Series(tags).value_counts().head(N)`	O(n log n) sort + head
Polars single-machine	`pl.col('tag').value_counts(sort=True).head(N)`	O(n log n) sort + head
Spark distributed	`df.groupBy('tag').count().orderBy(desc('count')).limit(N)`	distributed shuffle + sort
ClickHouse distributed	`SELECT topK(N)(tag) FROM ...`	distributed approximate (HyperLogLog-based)

Cite Python collections.Counter.

Cross-course → Spark M00 + DataFusion M00 + ClickHouse M00 (when-to-use intro modules)

Каждая courseware platform имеет dedicated decision-frame lesson:

Spark M00 course-intro — when-to-use Spark; expected data size / cluster size; comparison к other tools (включая pandas, DuckDB).
DataFusion M00 course-intro — DataFusion positioning; Rust embedded vs distributed (Ballista).
ClickHouse M00 course-intro — analytical OLAP positioning; comparison к Spark / DuckDB / BigQuery.

Cross-course bridge insight: decision matrix (this lesson) — Python lens на ecosystem. Каждый dedicated course предоставляет own lens (Spark — distributed lens; DataFusion — Rust embedded lens; ClickHouse — OLAP server lens). Reading several intro modules даёт rounded-perspective.

Cumulative cross-course course exposure после Phase 68:

Storage Formats course (Phase 56) — disk layer.
Spark course (Phase 02-03) — distributed processing.
DataFusion course (Phase 10) — alternative engines.
ClickHouse course (Phase 56) — analytical OLAP.

Spark DataFrame — schema и lazy evaluation DataFusion-Python — Arrow-native single-node engine clickhouse-local — embedded OLAP без сервера

М10 урок 07 (next) — full bridge table суммирующий все cross-course refs Phase 68 + forward-link к Phase 69/70.

Что в следующем уроке

М10 урок 07 — Phase 68 recap + cross-course bridges + forward-link к Phase 69/70. Cumulative cross-course bridge table (≥10 references) + forward-link к Phase 69 (Production Skills — logging file rotation / performance profiling / packaging) + Phase 70 (Launch Polish — final exam ASMT-04..08 closure).

Decision matrix: when to use pandas / Polars / PyArrow / DuckDB / Spark / DataFusion

Why decision matrix

Factor matrix — six tools across five factors

Decision tree walkthrough

Brief DuckDB mention

Brief chDB mention — embedded ClickHouse в Python

Brief DataFusion-Python mention — Rust query engine с Python bindings

Code-challenge — py-m10-06-code-1 (optional Pattern 4 variant)

Cross-course → Spark M00 + DataFusion M00 + ClickHouse M00 (when-to-use intro modules)

Что в следующем уроке

Закончили урок?

Code-challenge — `py-m10-06-code-1` (optional Pattern 4 variant)