Decision matrix: when to use pandas / Polars / PyArrow / DuckDB / Spark / DataFusion
Engineering choice between data libraries / engines depends на workload + constraints. No one-size-fits-all — pandas хорош для interactive REPL, Polars для production single-machine ETL, Spark для distributed big data. Этот урок предлагает factor matrix + decision tree — pragmatic guidance для production teams. Включает optional code-challenge py-m10-06-code-1 (Pattern 4 variant — top-K via Counter.most_common).
В этом уроке:
- Why decision matrix — engineering choice depends on workload + constraints.
- Factor matrix — six tools across five factors.
- Decision tree — walkthrough.
- Brief DuckDB mention — embedded analytical SQL.
- Code-challenge
py-m10-06-code-1— Pattern 4 variant top-K. - Cross-course — Spark M00 + DataFusion M00 + ClickHouse M00 (when-to-use intro modules).
Why decision matrix
Production team faces вопросы:
- Workload — analytical reads / ETL / streaming / interactive Jupyter / SQL OLAP?
- Data size — KB / MB / GB / TB / PB?
- Compute — single-machine / distributed cluster?
- Ecosystem — Python-only / Rust / JVM / multi-language?
- Existing stack — already на Spark / Snowflake / ClickHouse?
Без structured frame — choice ad-hoc, technology proliferation, maintenance burden grows. Decision matrix даёт shared vocabulary + repeatable framework.
Factor matrix — six tools across five factors
| Tool | Data size | Mode | Memory model | Primary use case |
|---|---|---|---|---|
| pandas | <1GB single-machine | eager | numpy ndarray (default); Arrow opt-in 2.0+ | interactive REPL / Jupyter / quick scripts |
| Polars | <100GB single-machine | lazy (recommend) | Arrow always | production ETL replace pandas at scale |
| PyArrow | any size | format/interop layer | Arrow native | data exchange between libraries; Parquet/IPC IO |
| DuckDB | <1TB single-machine | SQL embedded | columnar custom + Arrow-compatible | embedded analytical SQL; medium-data analytics |
| chDB | <1TB single-machine | SQL embedded | ClickHouse columnar (MergeTree) + Arrow bridge | embedded ClickHouse engine in-process; OLAP без сервера |
| DataFusion-Python | <1TB embedded; cluster через Ballista | SQL + DataFrame | Arrow native (Rust) | Rust query engine с Python bindings; embeddable + customizable |
| Spark | TB-PB distributed | SQL + DataFrame distributed | Spark SQL Catalyst + Arrow shuffle | big-data distributed ETL/analytics |
| DataFusion | <1TB embedded; cluster-distributed via Ballista | SQL + DataFrame | Arrow native (Rust) | Rust-native query engine; embeddable + customizable |
Factor explanations:
- Data size threshold — приближённый ceiling per tool. Polars может handle 100GB+ если sufficient RAM; Spark scales к PB через cluster.
- Mode — eager / lazy / SQL / format-layer. Влияет на debugging (eager — simpler stack traces) vs optimization (lazy — query optimizer).
- Memory model — какой in-memory representation. Arrow-based tools cross-compatible (zero-copy interop, M10 урок 04).
- Primary use case — opinionated default; tools могут use overlapping use-cases.
Decision tree walkthrough
Q: Какой у вас workload?
- Interactive analysis / Jupyter exploration →
pandas(eager, REPL-friendly) - Production single-machine ETL pipeline (medium data, <100 GB)
- Нужны lazy + optimizer? → Polars (recommended default 2024+)
- Предпочитаете SQL-синтаксис? → DuckDB
- Multi-library zero-copy pipeline (pandas → Polars → DuckDB → ClickHouse) → PyArrow (format/interop layer; не tool сам по себе)
- Big data, distributed cluster (TB–PB)
- Battle-tested production / Hadoop / EMR ecosystem → Spark
- Rust-native + embeddable engine → DataFusion (или Ballista для distributed)
- Analytical SQL OLAP (medium–large data)
- Embedded в Python application → DuckDB
- Standalone server, multi-user, multi-tenancy → ClickHouse (cross-course)
Production rule: default к Polars для new single-machine ETL workloads (post-2024); pandas остаётся для notebooks / interactive analysis. Spark когда уже на distributed cluster или PB-scale data. DataFusion для engine-embedding (custom analytical features в Rust app).
Brief DuckDB mention
DuckDB (duckdb.org) — embedded analytical OLAP database. Не имеет dedicated course в этой platform v2.4 — Open Question 7 расколол: brief mention здесь, full coverage в possible v3 milestone.
Why mention:
- Embedded — like SQLite для analytics; ноль setup; single-process.
- Columnar storage + vectorized execution — analytical queries fast.
- Arrow-compatible result format — zero-copy в Polars / pandas (cross-link M10 урок 04).
- Read-write Parquet, CSV, JSON natively через
read_parquet,read_csvSQL functions.
When DuckDB beats Polars:
- Workload — predominantly SQL (легче maintain SQL queries чем method-chained DataFrames).
- Need joins / window functions / SQL features which DataFrame DSL не surface elegantly.
- Embedded в Python application —
import duckdb; con = duckdb.connect()— ноль setup.
When Polars beats DuckDB:
- Workload — predominantly Python DataFrame manipulation (filters / aggregations / mutations).
- Need columnar lazy plan inspection через
.explain(). - Multi-step pipeline composes natural в Python (not SQL string templating).
Production reality: оба часто co-existing. Polars для transformation, DuckDB для analytical SQL queries on результат. Arrow zero-copy interop означает minimal handoff cost.
Brief chDB mention — embedded ClickHouse в Python
chDB — embedded ClickHouse (PyPI pip install chdb) — full ClickHouse engine в одном Python process, без сервера. Аналог DuckDB но на ClickHouse codebase.
# (Run-on-Your-Machine — НЕ Pyodide; chDB requires native binaries)
import chdb
# In-process SQL execution — full ClickHouse engine
result = chdb.query("SELECT count(), uniq(user_id) FROM file('events.parquet', Parquet)", "Pretty")
print(result)
# Stateful session — temporary tables, MergeTree, persistence
sess = chdb.session.Session("/path/to/data")
sess.query("CREATE TABLE events (id UInt64, ts DateTime) ENGINE = MergeTree ORDER BY id")
Когда выбирать chDB:
- OLAP-heavy — ClickHouse aggregation engine — один из самых быстрых in-class (HyperLogLog, MergeTree partitioning, vectorized execution).
- ClickHouse SQL dialect — компатибельность с production ClickHouse cluster (one-codepath dev/prod queries).
- Streaming aggregation —
topK,quantileTDigest,uniqCombined— production-grade approximate algorithms из коробки. - Data engineering pipelines — Parquet/CSV/JSON ingestion + columnar MergeTree storage в одном процессе.
chDB vs DuckDB:
- chDB — ClickHouse SQL dialect, MergeTree storage, large analytical functions library. Heavier binary (~150 MB chdb wheel vs ~30 MB duckdb).
- DuckDB — standard SQL (PostgreSQL-compatible), smaller footprint, broader integration ecosystem.
Cite chDB GitHub + chDB Python docs.
Brief DataFusion-Python mention — Rust query engine с Python bindings
datafusion (PyPI pip install datafusion) — Python bindings для Apache DataFusion (Rust SQL/DataFrame engine, see dedicated DataFusion course). Arrow-native; zero-copy interop с Polars / pandas / pyarrow.
# (Run-on-Your-Machine — нативные Rust binaries; не Pyodide)
import datafusion
ctx = datafusion.SessionContext()
ctx.register_parquet("events", "events.parquet")
df = ctx.sql("SELECT user_id, count(*) FROM events GROUP BY user_id ORDER BY 2 DESC LIMIT 10")
print(df.to_pandas()) # zero-copy Arrow → pandas
Когда выбирать DataFusion-Python:
- Arrow-native pipeline — zero-copy interop с Polars / pandas / pyarrow, без сериализации между шагами.
- Кастомные UDF на Python с Rust-производительностью основной части — UDF/UDAF можно писать на Python через
register_udf, остальное — DataFusion Rust. - Embedded query engine в Python application — встраиваемый, customizable execution; кастомные
TableProvider(см. DataFusion course). - SQL + DataFrame одновременно — оба API в одном инструменте, единый optimizer.
DataFusion-Python vs DuckDB / chDB:
- DataFusion-Python — Arrow-native, Rust extensibility (custom optimizer rules / table providers), best для embedded Rust-quality engine.
- DuckDB — broader SQL coverage, larger Python data ecosystem integration (Pandas-style API через
duckdb.df()). - chDB — ClickHouse OLAP функции + MergeTree persistence; best для ClickHouse-dialect pipelines.
Cite datafusion.apache.org/python + DataFusion Python GitHub.
Code-challenge — py-m10-06-code-1 (optional Pattern 4 variant)
В quiz JSON ниже — py-m10-06-code-1: top-K tags по частоте через collections.Counter.most_common(N). Pandas equivalent: pd.Series(tags).value_counts().head(N). ClickHouse equivalent: topK(N)(tag) (approximate). Spark equivalent: df.groupBy('tag').count().orderBy(desc('count')).limit(N) (distributed).
from collections import Counter
def solve(tags):
return Counter(tags).most_common(3)
Algorithm — heap-based O(n log N): Counter.most_common(N) строит counts dict O(n), затем использует heapq.nlargest(N, counts.items(), key=lambda x: x[1]) — O(n log N) heap-based sort. Без N — fall-back O(n log n) full sort.
Production cross-course mapping:
| Layer | Implementation | Complexity |
|---|---|---|
| Pure-stdlib (in this challenge) | Counter(tags).most_common(N) | O(n log N) |
| pandas single-machine | pd.Series(tags).value_counts().head(N) | O(n log n) sort + head |
| Polars single-machine | pl.col('tag').value_counts(sort=True).head(N) | O(n log n) sort + head |
| Spark distributed | df.groupBy('tag').count().orderBy(desc('count')).limit(N) | distributed shuffle + sort |
| ClickHouse distributed | SELECT topK(N)(tag) FROM ... | distributed approximate (HyperLogLog-based) |
Cite Python collections.Counter.
Cross-course → Spark M00 + DataFusion M00 + ClickHouse M00 (when-to-use intro modules)
Каждая courseware platform имеет dedicated decision-frame lesson:
- Spark M00 course-intro — when-to-use Spark; expected data size / cluster size; comparison к other tools (включая pandas, DuckDB).
- DataFusion M00 course-intro — DataFusion positioning; Rust embedded vs distributed (Ballista).
- ClickHouse M00 course-intro — analytical OLAP positioning; comparison к Spark / DuckDB / BigQuery.
Cross-course bridge insight: decision matrix (this lesson) — Python lens на ecosystem. Каждый dedicated course предоставляет own lens (Spark — distributed lens; DataFusion — Rust embedded lens; ClickHouse — OLAP server lens). Reading several intro modules даёт rounded-perspective.
Cumulative cross-course course exposure после Phase 68:
- Storage Formats course (Phase 56) — disk layer.
- Spark course (Phase 02-03) — distributed processing.
- DataFusion course (Phase 10) — alternative engines.
- ClickHouse course (Phase 56) — analytical OLAP.
М10 урок 07 (next) — full bridge table суммирующий все cross-course refs Phase 68 + forward-link к Phase 69/70.
Что в следующем уроке
М10 урок 07 — Phase 68 recap + cross-course bridges + forward-link к Phase 69/70. Cumulative cross-course bridge table (≥10 references) + forward-link к Phase 69 (Production Skills — logging file rotation / performance profiling / packaging) + Phase 70 (Launch Polish — final exam ASMT-04..08 closure).