Compression: gzip, bzip2, lz4, zstd, snappy

Урок 04 показал, что Parquet / ORC / Avro используют column-chunk или block-level compression. Какой codec выбрать? Trade-off: speed vs ratio — быстрые codecs дают слабую compression, медленные дают высокую. Production calculus: где bottleneck — CPU или disk/network? Pragmatic-DEEP: 5 codecs matrix + decision tree + Pyodide-safe gzip demo. Internals (Btrblocks / Fastlanes / ALP / FSST) — cross-course Storage Formats M09 deep dive.

В этом уроке:

Why compression — disk I/O / network / storage cost.
Codec matrix — gzip / bzip2 / lz4 / zstd / snappy.
CPU-bound vs disk-bound calculus — где bottleneck.
Pyodide-safe demo — gzip.compress / gzip.decompress через io.BytesIO.
Pitfall 33 — gzip.decompress requires bytes, не str.
Cross-course → Storage Formats M09 — Btrblocks, Fastlanes, ALP, FSST.
Spark shuffle compression — production tuning brief.

Why compression — disk / network / storage cost

Three production drivers:

Storage cost — S3 / GCS / Azure Blob — pay-per-GB. Compress 10TB raw → 2TB → 5x cost reduction.
Network bandwidth — Spark shuffle / cross-region replication / streaming ingestion — bytes-on-wire dominate latency для big workloads.
Disk I/O — analytical scans bottleneck’ed на disk (10x ratio between disk MB/sec и CPU GB/sec). Compression trades CPU для disk I/O reduction; usually net positive.

Counter-example: in-memory pipelines (Arrow IPC zero-copy), high-throughput Kafka (lz4 minimal CPU overhead). Compression — per-workload decision, не universal default.

Codec matrix — 5 codecs

Codec	Compress speed	Decompress speed	Ratio	Pyodide stdlib?	Use case
gzip	medium (~50 MB/s)	fast (~250 MB/s)	medium-high (3-4x)	yes (`gzip`)	universal HTTP / log files / archives — `Content-Encoding: gzip`
bzip2	slow (~10 MB/s)	medium (~50 MB/s)	high (4-6x)	yes (`bz2`)	archival storage; production analytics rare
lzma (xz)	very slow (~5 MB/s)	slow (~30 MB/s)	very high (5-8x)	yes (`lzma`)	archival; Linux kernel tarballs
lz4	very fast (~500 MB/s)	very fast (~3000 MB/s)	low-medium (1.5-2x)	no (`pip install lz4`)	high-throughput pipelines (Kafka, Spark shuffle)
zstd	fast (~250 MB/s)	fast (~1000 MB/s)	high (3-5x)	no (`pip install zstandard`; Python 3.14+ `compression.zstd` stdlib)	modern default — analytics (Parquet, ClickHouse)
snappy	very fast (~600 MB/s)	very fast (~2000 MB/s)	low (1.5x)	no (`pip install python-snappy`)	Hadoop ecosystem (HDFS, ORC, BigTable)

Key insight: zstd вытесняет gzip / snappy в production 2024+ — комбинирует near-snappy speed + near-gzip ratio. Adoption: Parquet zstd-default 2023+, ClickHouse zstd-default, Linux kernel zstd modules, btrfs / squashfs default.

Cite zstd README — comprehensive benchmarks.

CPU-bound vs disk-bound calculus

Production rule: pick codec based on bottleneck profile.

Disk-bound workload

Symptom: CPU idle, disk saturated reading raw data.
Solution: high-ratio codec — fewer bytes from disk, CPU has spare cycles to decompress.
Pick: zstd, bzip2 (если archival), gzip (legacy compatibility).

CPU-bound workload

Symptom: CPU saturated decompressing, disk underutilized.
Solution: low-CPU-cost codec — fast decompression even при low ratio.
Pick: lz4, snappy, zstd-1 (lowest level).

Network-bound workload

Symptom: cross-region transfer; bytes-on-wire dominate.
Solution: balanced — zstd-mid (level 6-12) для good compression + acceptable decompress.
Pick: zstd.

Archival workload

Symptom: write once, rarely read; storage size paramount.
Solution: maximum ratio ignoring compress speed.
Pick: lzma (xz), zstd-22 (max level), bzip2.

Production calculus example (Parquet):

1 TB raw data, gzip → ~250 GB (4x).
1 TB raw data, zstd-3 → ~250 GB (4x), но decompress 4x faster чем gzip.
1 TB raw data, snappy → ~700 GB (1.5x), но decompress 8x faster.

Default modern Parquet — zstd (level 1-3 для balance).

Pyodide-safe demo — `gzip` через `io.BytesIO`

Stdlib gzip module works в Pyodide. Operates на bytes (binary) — нужен io.BytesIO (не io.StringIO как в уроках 01-03):

import gzip
import io

# Compress bytes
data = b"hello, world! This is some text we'll compress."
compressed = gzip.compress(data)
print(len(data), '->', len(compressed))
# 49 -> 70  (! больше — small data overhead; gzip header + footer)

# Larger sample — compression actually helps
sample = b"hello, world! " * 1000  # 14000 bytes
compressed = gzip.compress(sample)
print(len(sample), '->', len(compressed))
# 14000 -> 117  (~120x ratio для repeating data)

# Decompress
restored = gzip.decompress(compressed)
print(restored == sample)  # True

Streaming variant через gzip.GzipFile:

import gzip
import io

# Write compressed
buf = io.BytesIO()
with gzip.GzipFile(fileobj=buf, mode='wb') as gz:
    gz.write(b"line 1\n")
    gz.write(b"line 2\n")

compressed_bytes = buf.getvalue()
print(len(compressed_bytes))  # ~30 bytes

# Read compressed
buf.seek(0)
with gzip.GzipFile(fileobj=buf, mode='rb') as gz:
    print(gz.read())  # b'line 1\nline 2\n'

Same pattern для bz2 и lzma:

import bz2; bz2.compress(...) / bz2.BZ2File(...)
import lzma; lzma.compress(...) / lzma.LZMAFile(...)

API identical (compress / decompress / *File streaming class). Cite docs.python.org/3/library/gzip + bz2 + lzma.

Pitfall 33 — `gzip.decompress` requires `bytes`, не `str`

What goes wrong:

import gzip

gzip.decompress("compressed hex string")  # TypeError
# TypeError: a bytes-like object is required, not 'str'

Why: gzip operates на binary data. Compressed output — bytes (raw byte sequence — not valid UTF-8). Calling с str confuses learners coming from text I/O context уроков 01-03.

How to avoid: always работайте с bytes literals (b"...") и io.BytesIO (НЕ io.StringIO) для compression workflows:

# Correct
import gzip
data: bytes = b"hello"           # bytes literal
compressed: bytes = gzip.compress(data)
restored: bytes = gzip.decompress(compressed)

# Convert text → bytes
text: str = "hello, мир"
data: bytes = text.encode('utf-8')  # encode → bytes
compressed = gzip.compress(data)
restored = gzip.decompress(compressed).decode('utf-8')  # decode → str

Production rule: encoding boundary — encode str → bytes ДО compression; decode bytes → str ПОСЛЕ decompression. Compression operates only на bytes.

Cross-course → Storage Formats M09 compression internals

Storage Formats course covers compression internals в M09 (7 уроков):

01 — Compression internals — algorithm fundamentals (LZ77, Huffman, arithmetic coding); how gzip/zstd/snappy actually work
02 — Compression tuning — speed vs ratio tuning per workload; level selection rules
03 — Btrblocks — modern columnar compression scheme (cascading codecs)
04 — Fastlanes — SIMD-friendly bit-packing
05 — ALP — adaptive lossless precision (float compression)
06 — FSST — fast string substring extraction (string-aware compression)
07 — Future encoding — research frontier

Three-layer cross-course bridge (carrying урок 04):

Stdlib gzip.compress (M09 урок 05) — single-process Python.
Parquet column-chunk zstd-compression — distributed analytical workload.
ClickHouse compression_codec='ZSTD(3)' settings — vectorized OLAP DB.

Same algorithm family (LZ77 + entropy coding), three execution layers.

Spark shuffle compression — production tuning

Spark 03/04 — groupBy aggregations covers shuffle mechanics — intermediate data между map and reduce stages. Compression default — lz4 (fastest decompress matters here):

# Spark config (pyspark или spark-submit --conf)
spark = (
    SparkSession.builder
    .config('spark.io.compression.codec', 'lz4')           # default
    # Switch to zstd для cross-region clusters
    .config('spark.io.compression.codec', 'zstd')
    .config('spark.io.compression.zstd.level', '3')
    .getOrCreate()
)

Production tuning rule:

Local cluster (single rack): lz4 (CPU cheaper than network unsaturated).
Cross-region cluster: zstd (network bytes dominate).
Spillable shuffle (large datasets): zstd-1 (faster compress for one-shot writes).

Recipe — production gzip pipeline с error handling

End-to-end: encode str → compress → simulated transmission → decompress → decode str.

import gzip
import io

def compress_text(text: str, level: int = 6) -> bytes:
    """Compress text → gzip bytes. Default level 6 (balance)."""
    return gzip.compress(text.encode('utf-8'), compresslevel=level)

def decompress_text(data: bytes) -> str:
    """Decompress gzip bytes → text. Raises ValueError on corrupt data."""
    try:
        return gzip.decompress(data).decode('utf-8')
    except (gzip.BadGzipFile, OSError) as e:
        raise ValueError(f'Invalid gzip data: {e}') from e

# Usage — simulate over-the-wire compression
text = "Hello, мир! " * 100   # 1300 bytes
compressed = compress_text(text)
print(len(text), '->', len(compressed))  # 1300 -> ~50

restored = decompress_text(compressed)
assert restored == text, 'round-trip preserved'

# Streaming variant — write compressed log
buf = io.BytesIO()
with gzip.GzipFile(fileobj=buf, mode='wb') as gz:
    for line in ['log line 1\n', 'log line 2\n', 'log line 3\n']:
        gz.write(line.encode('utf-8'))

print(f'Compressed log size: {len(buf.getvalue())} bytes')

Что в следующем уроке

Урок 06 — pathlib API (Path / PurePath / PurePosixPath / PureWindowsPath). Code-challenge py-m09-06-code-1 — Pattern 3 (pathlib path arithmetic via PurePosixPath). Run-on-Your-Machine callout #1 — real disk operations через Path.iterdir / Path.glob (browser MEMFS limited per Pitfall 22).

Pragmatic-DEEP принцип: не deep-dive’ем compression algorithms (LZ77, Huffman) — это Storage Formats course territory. Selection criteria + Pyodide-safe demo — sufficient для М09. Production: zstd для analytical default 2024+; lz4 для high-throughput pipelines.

Compression: gzip, bzip2, lz4, zstd, snappy

Why compression — disk / network / storage cost

Codec matrix — 5 codecs

CPU-bound vs disk-bound calculus

Disk-bound workload

CPU-bound workload

Network-bound workload

Archival workload

Pyodide-safe demo — gzip через io.BytesIO

Pitfall 33 — gzip.decompress requires bytes, не str

Cross-course → Storage Formats M09 compression internals

Spark shuffle compression — production tuning

Recipe — production gzip pipeline с error handling

Что в следующем уроке

Закончили урок?

Pyodide-safe demo — `gzip` через `io.BytesIO`

Pitfall 33 — `gzip.decompress` requires `bytes`, не `str`