Learning Platform
Глоссарий Troubleshooting
Урок 10.05 · 22 мин
Средний
Compressiongzipbzip2lzmalz4zstdsnappyio.BytesIOspeed-vs-ratioCPU-boundCross-course storage-formatsPitfall 33

Compression: gzip, bzip2, lz4, zstd, snappy

Урок 04 показал, что Parquet / ORC / Avro используют column-chunk или block-level compression. Какой codec выбрать? Trade-off: speed vs ratio — быстрые codecs дают слабую compression, медленные дают высокую. Production calculus: где bottleneck — CPU или disk/network? Pragmatic-DEEP: 5 codecs matrix + decision tree + Pyodide-safe gzip demo. Internals (Btrblocks / Fastlanes / ALP / FSST) — cross-course Storage Formats M09 deep dive.

В этом уроке:

  1. Why compression — disk I/O / network / storage cost.
  2. Codec matrix — gzip / bzip2 / lz4 / zstd / snappy.
  3. CPU-bound vs disk-bound calculus — где bottleneck.
  4. Pyodide-safe demogzip.compress / gzip.decompress через io.BytesIO.
  5. Pitfall 33gzip.decompress requires bytes, не str.
  6. Cross-course → Storage Formats M09 — Btrblocks, Fastlanes, ALP, FSST.
  7. Spark shuffle compression — production tuning brief.

Why compression — disk / network / storage cost

Three production drivers:

  1. Storage cost — S3 / GCS / Azure Blob — pay-per-GB. Compress 10TB raw → 2TB → 5x cost reduction.
  2. Network bandwidth — Spark shuffle / cross-region replication / streaming ingestion — bytes-on-wire dominate latency для big workloads.
  3. Disk I/O — analytical scans bottleneck’ed на disk (10x ratio between disk MB/sec и CPU GB/sec). Compression trades CPU для disk I/O reduction; usually net positive.

Counter-example: in-memory pipelines (Arrow IPC zero-copy), high-throughput Kafka (lz4 minimal CPU overhead). Compression — per-workload decision, не universal default.


Codec matrix — 5 codecs

CodecCompress speedDecompress speedRatioPyodide stdlib?Use case
gzipmedium (~50 MB/s)fast (~250 MB/s)medium-high (3-4x)yes (gzip)universal HTTP / log files / archives — Content-Encoding: gzip
bzip2slow (~10 MB/s)medium (~50 MB/s)high (4-6x)yes (bz2)archival storage; production analytics rare
lzma (xz)very slow (~5 MB/s)slow (~30 MB/s)very high (5-8x)yes (lzma)archival; Linux kernel tarballs
lz4very fast (~500 MB/s)very fast (~3000 MB/s)low-medium (1.5-2x)no (pip install lz4)high-throughput pipelines (Kafka, Spark shuffle)
zstdfast (~250 MB/s)fast (~1000 MB/s)high (3-5x)no (pip install zstandard; Python 3.14+ compression.zstd stdlib)modern default — analytics (Parquet, ClickHouse)
snappyvery fast (~600 MB/s)very fast (~2000 MB/s)low (1.5x)no (pip install python-snappy)Hadoop ecosystem (HDFS, ORC, BigTable)

Key insight: zstd вытесняет gzip / snappy в production 2024+ — комбинирует near-snappy speed + near-gzip ratio. Adoption: Parquet zstd-default 2023+, ClickHouse zstd-default, Linux kernel zstd modules, btrfs / squashfs default.

Cite zstd README — comprehensive benchmarks.


CPU-bound vs disk-bound calculus

Production rule: pick codec based on bottleneck profile.

Disk-bound workload

  • Symptom: CPU idle, disk saturated reading raw data.
  • Solution: high-ratio codec — fewer bytes from disk, CPU has spare cycles to decompress.
  • Pick: zstd, bzip2 (если archival), gzip (legacy compatibility).

CPU-bound workload

  • Symptom: CPU saturated decompressing, disk underutilized.
  • Solution: low-CPU-cost codec — fast decompression even при low ratio.
  • Pick: lz4, snappy, zstd-1 (lowest level).

Network-bound workload

  • Symptom: cross-region transfer; bytes-on-wire dominate.
  • Solution: balanced — zstd-mid (level 6-12) для good compression + acceptable decompress.
  • Pick: zstd.

Archival workload

  • Symptom: write once, rarely read; storage size paramount.
  • Solution: maximum ratio ignoring compress speed.
  • Pick: lzma (xz), zstd-22 (max level), bzip2.

Production calculus example (Parquet):

  • 1 TB raw data, gzip → ~250 GB (4x).
  • 1 TB raw data, zstd-3 → ~250 GB (4x), но decompress 4x faster чем gzip.
  • 1 TB raw data, snappy → ~700 GB (1.5x), но decompress 8x faster.

Default modern Parquet — zstd (level 1-3 для balance).


Pyodide-safe demo — gzip через io.BytesIO

Stdlib gzip module works в Pyodide. Operates на bytes (binary) — нужен io.BytesIO (не io.StringIO как в уроках 01-03):

import gzip
import io

# Compress bytes
data = b"hello, world! This is some text we'll compress."
compressed = gzip.compress(data)
print(len(data), '->', len(compressed))
# 49 -> 70  (! больше — small data overhead; gzip header + footer)

# Larger sample — compression actually helps
sample = b"hello, world! " * 1000  # 14000 bytes
compressed = gzip.compress(sample)
print(len(sample), '->', len(compressed))
# 14000 -> 117  (~120x ratio для repeating data)

# Decompress
restored = gzip.decompress(compressed)
print(restored == sample)  # True

Streaming variant через gzip.GzipFile:

import gzip
import io

# Write compressed
buf = io.BytesIO()
with gzip.GzipFile(fileobj=buf, mode='wb') as gz:
    gz.write(b"line 1\n")
    gz.write(b"line 2\n")

compressed_bytes = buf.getvalue()
print(len(compressed_bytes))  # ~30 bytes

# Read compressed
buf.seek(0)
with gzip.GzipFile(fileobj=buf, mode='rb') as gz:
    print(gz.read())  # b'line 1\nline 2\n'

Same pattern для bz2 и lzma:

  • import bz2; bz2.compress(...) / bz2.BZ2File(...)
  • import lzma; lzma.compress(...) / lzma.LZMAFile(...)

API identical (compress / decompress / *File streaming class). Cite docs.python.org/3/library/gzip + bz2 + lzma.


Pitfall 33 — gzip.decompress requires bytes, не str

What goes wrong:

import gzip

gzip.decompress("compressed hex string")  # TypeError
# TypeError: a bytes-like object is required, not 'str'

Why: gzip operates на binary data. Compressed output — bytes (raw byte sequence — not valid UTF-8). Calling с str confuses learners coming from text I/O context уроков 01-03.

How to avoid: always работайте с bytes literals (b"...") и io.BytesIO (НЕ io.StringIO) для compression workflows:

# Correct
import gzip
data: bytes = b"hello"           # bytes literal
compressed: bytes = gzip.compress(data)
restored: bytes = gzip.decompress(compressed)

# Convert text → bytes
text: str = "hello, мир"
data: bytes = text.encode('utf-8')  # encode → bytes
compressed = gzip.compress(data)
restored = gzip.decompress(compressed).decode('utf-8')  # decode → str

Production rule: encoding boundary — encode str → bytes ДО compression; decode bytes → str ПОСЛЕ decompression. Compression operates only на bytes.


Cross-course → Storage Formats M09 compression internals

Storage Formats course covers compression internals в M09 (7 уроков):

Three-layer cross-course bridge (carrying урок 04):

  • Stdlib gzip.compress (M09 урок 05) — single-process Python.
  • Parquet column-chunk zstd-compression — distributed analytical workload.
  • ClickHouse compression_codec='ZSTD(3)' settings — vectorized OLAP DB.

Same algorithm family (LZ77 + entropy coding), three execution layers.


Spark shuffle compression — production tuning

Spark 03/04 — groupBy aggregations covers shuffle mechanics — intermediate data между map and reduce stages. Compression default — lz4 (fastest decompress matters here):

# Spark config (pyspark или spark-submit --conf)
spark = (
    SparkSession.builder
    .config('spark.io.compression.codec', 'lz4')           # default
    # Switch to zstd для cross-region clusters
    .config('spark.io.compression.codec', 'zstd')
    .config('spark.io.compression.zstd.level', '3')
    .getOrCreate()
)

Production tuning rule:

  • Local cluster (single rack): lz4 (CPU cheaper than network unsaturated).
  • Cross-region cluster: zstd (network bytes dominate).
  • Spillable shuffle (large datasets): zstd-1 (faster compress for one-shot writes).

Recipe — production gzip pipeline с error handling

End-to-end: encode str → compress → simulated transmission → decompress → decode str.

import gzip
import io

def compress_text(text: str, level: int = 6) -> bytes:
    """Compress text → gzip bytes. Default level 6 (balance)."""
    return gzip.compress(text.encode('utf-8'), compresslevel=level)

def decompress_text(data: bytes) -> str:
    """Decompress gzip bytes → text. Raises ValueError on corrupt data."""
    try:
        return gzip.decompress(data).decode('utf-8')
    except (gzip.BadGzipFile, OSError) as e:
        raise ValueError(f'Invalid gzip data: {e}') from e

# Usage — simulate over-the-wire compression
text = "Hello, мир! " * 100   # 1300 bytes
compressed = compress_text(text)
print(len(text), '->', len(compressed))  # 1300 -> ~50

restored = decompress_text(compressed)
assert restored == text, 'round-trip preserved'

# Streaming variant — write compressed log
buf = io.BytesIO()
with gzip.GzipFile(fileobj=buf, mode='wb') as gz:
    for line in ['log line 1\n', 'log line 2\n', 'log line 3\n']:
        gz.write(line.encode('utf-8'))

print(f'Compressed log size: {len(buf.getvalue())} bytes')

Что в следующем уроке

Урок 06 — pathlib API (Path / PurePath / PurePosixPath / PureWindowsPath). Code-challenge py-m09-06-code-1 — Pattern 3 (pathlib path arithmetic via PurePosixPath). Run-on-Your-Machine callout #1 — real disk operations через Path.iterdir / Path.glob (browser MEMFS limited per Pitfall 22).

Pragmatic-DEEP принцип: не deep-dive’ем compression algorithms (LZ77, Huffman) — это Storage Formats course territory. Selection criteria + Pyodide-safe demo — sufficient для М09. Production: zstd для analytical default 2024+; lz4 для high-throughput pipelines.

Проверьте понимание

Результат: 0 из 0
Прикладной
Вопрос 1 из 4. **Apply scenario — codec selection:** Workload — analytical Parquet reads (S3 → Spark → DuckDB), 10TB raw data per day, cross-region transfer dominates latency. Какой codec **modern default** 2024+?

Закончили урок?

Отметьте его как пройденный, чтобы отслеживать свой прогресс

Войдите чтобы оценить урок

Прогресс модуля
0 из 7