Learning Platform
Глоссарий Troubleshooting
Урок 11.06 · 22 мин
Средний
decision-matrixpandaspolarspyarrowduckdbchdbdatafusiondatafusion-pythonsparksingle-machinedistributedOLAPRun-on-Your-MachinePattern 4 variantcross-course
Требуемые знания:

Decision matrix: when to use pandas / Polars / PyArrow / DuckDB / Spark / DataFusion

Engineering choice between data libraries / engines depends на workload + constraints. No one-size-fits-all — pandas хорош для interactive REPL, Polars для production single-machine ETL, Spark для distributed big data. Этот урок предлагает factor matrix + decision tree — pragmatic guidance для production teams. Включает optional code-challenge py-m10-06-code-1 (Pattern 4 variant — top-K via Counter.most_common).

В этом уроке:

  1. Why decision matrix — engineering choice depends on workload + constraints.
  2. Factor matrix — six tools across five factors.
  3. Decision tree — walkthrough.
  4. Brief DuckDB mention — embedded analytical SQL.
  5. Code-challenge py-m10-06-code-1 — Pattern 4 variant top-K.
  6. Cross-course — Spark M00 + DataFusion M00 + ClickHouse M00 (when-to-use intro modules).

Why decision matrix

Production team faces вопросы:

  • Workload — analytical reads / ETL / streaming / interactive Jupyter / SQL OLAP?
  • Data size — KB / MB / GB / TB / PB?
  • Compute — single-machine / distributed cluster?
  • Ecosystem — Python-only / Rust / JVM / multi-language?
  • Existing stack — already на Spark / Snowflake / ClickHouse?

Без structured frame — choice ad-hoc, technology proliferation, maintenance burden grows. Decision matrix даёт shared vocabulary + repeatable framework.


Factor matrix — six tools across five factors

ToolData sizeModeMemory modelPrimary use case
pandas<1GB single-machineeagernumpy ndarray (default); Arrow opt-in 2.0+interactive REPL / Jupyter / quick scripts
Polars<100GB single-machinelazy (recommend)Arrow alwaysproduction ETL replace pandas at scale
PyArrowany sizeformat/interop layerArrow nativedata exchange between libraries; Parquet/IPC IO
DuckDB<1TB single-machineSQL embeddedcolumnar custom + Arrow-compatibleembedded analytical SQL; medium-data analytics
chDB<1TB single-machineSQL embeddedClickHouse columnar (MergeTree) + Arrow bridgeembedded ClickHouse engine in-process; OLAP без сервера
DataFusion-Python<1TB embedded; cluster через BallistaSQL + DataFrameArrow native (Rust)Rust query engine с Python bindings; embeddable + customizable
SparkTB-PB distributedSQL + DataFrame distributedSpark SQL Catalyst + Arrow shufflebig-data distributed ETL/analytics
DataFusion<1TB embedded; cluster-distributed via BallistaSQL + DataFrameArrow native (Rust)Rust-native query engine; embeddable + customizable

Factor explanations:

  • Data size threshold — приближённый ceiling per tool. Polars может handle 100GB+ если sufficient RAM; Spark scales к PB через cluster.
  • Mode — eager / lazy / SQL / format-layer. Влияет на debugging (eager — simpler stack traces) vs optimization (lazy — query optimizer).
  • Memory model — какой in-memory representation. Arrow-based tools cross-compatible (zero-copy interop, M10 урок 04).
  • Primary use case — opinionated default; tools могут use overlapping use-cases.

Decision tree walkthrough

Q: Какой у вас workload?

  • Interactive analysis / Jupyter explorationpandas (eager, REPL-friendly)
  • Production single-machine ETL pipeline (medium data, <100 GB)
    • Нужны lazy + optimizer? → Polars (recommended default 2024+)
    • Предпочитаете SQL-синтаксис? → DuckDB
  • Multi-library zero-copy pipeline (pandas → Polars → DuckDB → ClickHouse) → PyArrow (format/interop layer; не tool сам по себе)
  • Big data, distributed cluster (TB–PB)
    • Battle-tested production / Hadoop / EMR ecosystem → Spark
    • Rust-native + embeddable engine → DataFusion (или Ballista для distributed)
  • Analytical SQL OLAP (medium–large data)
    • Embedded в Python application → DuckDB
    • Standalone server, multi-user, multi-tenancy → ClickHouse (cross-course)

Production rule: default к Polars для new single-machine ETL workloads (post-2024); pandas остаётся для notebooks / interactive analysis. Spark когда уже на distributed cluster или PB-scale data. DataFusion для engine-embedding (custom analytical features в Rust app).


Brief DuckDB mention

DuckDB (duckdb.org) — embedded analytical OLAP database. Не имеет dedicated course в этой platform v2.4 — Open Question 7 расколол: brief mention здесь, full coverage в possible v3 milestone.

Why mention:

  • Embedded — like SQLite для analytics; ноль setup; single-process.
  • Columnar storage + vectorized execution — analytical queries fast.
  • Arrow-compatible result format — zero-copy в Polars / pandas (cross-link M10 урок 04).
  • Read-write Parquet, CSV, JSON natively через read_parquet, read_csv SQL functions.

When DuckDB beats Polars:

  • Workload — predominantly SQL (легче maintain SQL queries чем method-chained DataFrames).
  • Need joins / window functions / SQL features which DataFrame DSL не surface elegantly.
  • Embedded в Python application — import duckdb; con = duckdb.connect() — ноль setup.

When Polars beats DuckDB:

  • Workload — predominantly Python DataFrame manipulation (filters / aggregations / mutations).
  • Need columnar lazy plan inspection через .explain().
  • Multi-step pipeline composes natural в Python (not SQL string templating).

Production reality: оба часто co-existing. Polars для transformation, DuckDB для analytical SQL queries on результат. Arrow zero-copy interop означает minimal handoff cost.


Brief chDB mention — embedded ClickHouse в Python

chDBembedded ClickHouse (PyPI pip install chdb) — full ClickHouse engine в одном Python process, без сервера. Аналог DuckDB но на ClickHouse codebase.

# (Run-on-Your-Machine — НЕ Pyodide; chDB requires native binaries)
import chdb

# In-process SQL execution — full ClickHouse engine
result = chdb.query("SELECT count(), uniq(user_id) FROM file('events.parquet', Parquet)", "Pretty")
print(result)

# Stateful session — temporary tables, MergeTree, persistence
sess = chdb.session.Session("/path/to/data")
sess.query("CREATE TABLE events (id UInt64, ts DateTime) ENGINE = MergeTree ORDER BY id")

Когда выбирать chDB:

  • OLAP-heavy — ClickHouse aggregation engine — один из самых быстрых in-class (HyperLogLog, MergeTree partitioning, vectorized execution).
  • ClickHouse SQL dialect — компатибельность с production ClickHouse cluster (one-codepath dev/prod queries).
  • Streaming aggregationtopK, quantileTDigest, uniqCombined — production-grade approximate algorithms из коробки.
  • Data engineering pipelines — Parquet/CSV/JSON ingestion + columnar MergeTree storage в одном процессе.

chDB vs DuckDB:

  • chDB — ClickHouse SQL dialect, MergeTree storage, large analytical functions library. Heavier binary (~150 MB chdb wheel vs ~30 MB duckdb).
  • DuckDB — standard SQL (PostgreSQL-compatible), smaller footprint, broader integration ecosystem.

Cite chDB GitHub + chDB Python docs.


Brief DataFusion-Python mention — Rust query engine с Python bindings

datafusion (PyPI pip install datafusion) — Python bindings для Apache DataFusion (Rust SQL/DataFrame engine, see dedicated DataFusion course). Arrow-native; zero-copy interop с Polars / pandas / pyarrow.

# (Run-on-Your-Machine — нативные Rust binaries; не Pyodide)
import datafusion

ctx = datafusion.SessionContext()
ctx.register_parquet("events", "events.parquet")

df = ctx.sql("SELECT user_id, count(*) FROM events GROUP BY user_id ORDER BY 2 DESC LIMIT 10")
print(df.to_pandas())          # zero-copy Arrow → pandas

Когда выбирать DataFusion-Python:

  • Arrow-native pipeline — zero-copy interop с Polars / pandas / pyarrow, без сериализации между шагами.
  • Кастомные UDF на Python с Rust-производительностью основной части — UDF/UDAF можно писать на Python через register_udf, остальное — DataFusion Rust.
  • Embedded query engine в Python application — встраиваемый, customizable execution; кастомные TableProvider (см. DataFusion course).
  • SQL + DataFrame одновременно — оба API в одном инструменте, единый optimizer.

DataFusion-Python vs DuckDB / chDB:

  • DataFusion-Python — Arrow-native, Rust extensibility (custom optimizer rules / table providers), best для embedded Rust-quality engine.
  • DuckDB — broader SQL coverage, larger Python data ecosystem integration (Pandas-style API через duckdb.df()).
  • chDB — ClickHouse OLAP функции + MergeTree persistence; best для ClickHouse-dialect pipelines.

Cite datafusion.apache.org/python + DataFusion Python GitHub.


Code-challenge — py-m10-06-code-1 (optional Pattern 4 variant)

В quiz JSON ниже — py-m10-06-code-1: top-K tags по частоте через collections.Counter.most_common(N). Pandas equivalent: pd.Series(tags).value_counts().head(N). ClickHouse equivalent: topK(N)(tag) (approximate). Spark equivalent: df.groupBy('tag').count().orderBy(desc('count')).limit(N) (distributed).

from collections import Counter

def solve(tags):
    return Counter(tags).most_common(3)

Algorithm — heap-based O(n log N): Counter.most_common(N) строит counts dict O(n), затем использует heapq.nlargest(N, counts.items(), key=lambda x: x[1]) — O(n log N) heap-based sort. Без N — fall-back O(n log n) full sort.

Production cross-course mapping:

LayerImplementationComplexity
Pure-stdlib (in this challenge)Counter(tags).most_common(N)O(n log N)
pandas single-machinepd.Series(tags).value_counts().head(N)O(n log n) sort + head
Polars single-machinepl.col('tag').value_counts(sort=True).head(N)O(n log n) sort + head
Spark distributeddf.groupBy('tag').count().orderBy(desc('count')).limit(N)distributed shuffle + sort
ClickHouse distributedSELECT topK(N)(tag) FROM ...distributed approximate (HyperLogLog-based)

Cite Python collections.Counter.


Cross-course → Spark M00 + DataFusion M00 + ClickHouse M00 (when-to-use intro modules)

Каждая courseware platform имеет dedicated decision-frame lesson:

  • Spark M00 course-intro — when-to-use Spark; expected data size / cluster size; comparison к other tools (включая pandas, DuckDB).
  • DataFusion M00 course-intro — DataFusion positioning; Rust embedded vs distributed (Ballista).
  • ClickHouse M00 course-intro — analytical OLAP positioning; comparison к Spark / DuckDB / BigQuery.

Cross-course bridge insight: decision matrix (this lesson) — Python lens на ecosystem. Каждый dedicated course предоставляет own lens (Spark — distributed lens; DataFusion — Rust embedded lens; ClickHouse — OLAP server lens). Reading several intro modules даёт rounded-perspective.

Cumulative cross-course course exposure после Phase 68:

  • Storage Formats course (Phase 56) — disk layer.
  • Spark course (Phase 02-03) — distributed processing.
  • DataFusion course (Phase 10) — alternative engines.
  • ClickHouse course (Phase 56) — analytical OLAP.
Spark DataFrame — schema и lazy evaluation DataFusion-Python — Arrow-native single-node engine clickhouse-local — embedded OLAP без сервера

М10 урок 07 (next) — full bridge table суммирующий все cross-course refs Phase 68 + forward-link к Phase 69/70.


Что в следующем уроке

М10 урок 07 — Phase 68 recap + cross-course bridges + forward-link к Phase 69/70. Cumulative cross-course bridge table (≥10 references) + forward-link к Phase 69 (Production Skills — logging file rotation / performance profiling / packaging) + Phase 70 (Launch Polish — final exam ASMT-04..08 closure).

Проверьте понимание

Результат: 0 из 0
Прикладной
Вопрос 1 из 4. **Apply scenario — production single-machine ETL:** workload = ежедневная Parquet aggregation (~30GB), single-server, CPU-bound, post-2024 production. Между pandas (eager) vs Polars (lazy) выбрать?

Закончили урок?

Отметьте его как пройденный, чтобы отслеживать свой прогресс

Войдите чтобы оценить урок

Прогресс модуля
0 из 7