Compression: gzip, bzip2, lz4, zstd, snappy
Урок 04 показал, что Parquet / ORC / Avro используют column-chunk или block-level compression. Какой codec выбрать? Trade-off: speed vs ratio — быстрые codecs дают слабую compression, медленные дают высокую. Production calculus: где bottleneck — CPU или disk/network? Pragmatic-DEEP: 5 codecs matrix + decision tree + Pyodide-safe gzip demo. Internals (Btrblocks / Fastlanes / ALP / FSST) — cross-course Storage Formats M09 deep dive.
В этом уроке:
- Why compression — disk I/O / network / storage cost.
- Codec matrix — gzip / bzip2 / lz4 / zstd / snappy.
- CPU-bound vs disk-bound calculus — где bottleneck.
- Pyodide-safe demo —
gzip.compress/gzip.decompressчерезio.BytesIO. - Pitfall 33 —
gzip.decompressrequiresbytes, неstr. - Cross-course → Storage Formats M09 — Btrblocks, Fastlanes, ALP, FSST.
- Spark shuffle compression — production tuning brief.
Why compression — disk / network / storage cost
Three production drivers:
- Storage cost — S3 / GCS / Azure Blob — pay-per-GB. Compress 10TB raw → 2TB → 5x cost reduction.
- Network bandwidth — Spark shuffle / cross-region replication / streaming ingestion — bytes-on-wire dominate latency для big workloads.
- Disk I/O — analytical scans bottleneck’ed на disk (10x ratio between disk MB/sec и CPU GB/sec). Compression trades CPU для disk I/O reduction; usually net positive.
Counter-example: in-memory pipelines (Arrow IPC zero-copy), high-throughput Kafka (lz4 minimal CPU overhead). Compression — per-workload decision, не universal default.
Codec matrix — 5 codecs
| Codec | Compress speed | Decompress speed | Ratio | Pyodide stdlib? | Use case |
|---|---|---|---|---|---|
| gzip | medium (~50 MB/s) | fast (~250 MB/s) | medium-high (3-4x) | yes (gzip) | universal HTTP / log files / archives — Content-Encoding: gzip |
| bzip2 | slow (~10 MB/s) | medium (~50 MB/s) | high (4-6x) | yes (bz2) | archival storage; production analytics rare |
| lzma (xz) | very slow (~5 MB/s) | slow (~30 MB/s) | very high (5-8x) | yes (lzma) | archival; Linux kernel tarballs |
| lz4 | very fast (~500 MB/s) | very fast (~3000 MB/s) | low-medium (1.5-2x) | no (pip install lz4) | high-throughput pipelines (Kafka, Spark shuffle) |
| zstd | fast (~250 MB/s) | fast (~1000 MB/s) | high (3-5x) | no (pip install zstandard; Python 3.14+ compression.zstd stdlib) | modern default — analytics (Parquet, ClickHouse) |
| snappy | very fast (~600 MB/s) | very fast (~2000 MB/s) | low (1.5x) | no (pip install python-snappy) | Hadoop ecosystem (HDFS, ORC, BigTable) |
Key insight: zstd вытесняет gzip / snappy в production 2024+ — комбинирует near-snappy speed + near-gzip ratio. Adoption: Parquet zstd-default 2023+, ClickHouse zstd-default, Linux kernel zstd modules, btrfs / squashfs default.
Cite zstd README — comprehensive benchmarks.
CPU-bound vs disk-bound calculus
Production rule: pick codec based on bottleneck profile.
Disk-bound workload
- Symptom: CPU idle, disk saturated reading raw data.
- Solution: high-ratio codec — fewer bytes from disk, CPU has spare cycles to decompress.
- Pick: zstd, bzip2 (если archival), gzip (legacy compatibility).
CPU-bound workload
- Symptom: CPU saturated decompressing, disk underutilized.
- Solution: low-CPU-cost codec — fast decompression even при low ratio.
- Pick: lz4, snappy, zstd-1 (lowest level).
Network-bound workload
- Symptom: cross-region transfer; bytes-on-wire dominate.
- Solution: balanced — zstd-mid (level 6-12) для good compression + acceptable decompress.
- Pick: zstd.
Archival workload
- Symptom: write once, rarely read; storage size paramount.
- Solution: maximum ratio ignoring compress speed.
- Pick: lzma (xz), zstd-22 (max level), bzip2.
Production calculus example (Parquet):
- 1 TB raw data, gzip → ~250 GB (4x).
- 1 TB raw data, zstd-3 → ~250 GB (4x), но decompress 4x faster чем gzip.
- 1 TB raw data, snappy → ~700 GB (1.5x), но decompress 8x faster.
Default modern Parquet — zstd (level 1-3 для balance).
Pyodide-safe demo — gzip через io.BytesIO
Stdlib gzip module works в Pyodide. Operates на bytes (binary) — нужен io.BytesIO (не io.StringIO как в уроках 01-03):
import gzip
import io
# Compress bytes
data = b"hello, world! This is some text we'll compress."
compressed = gzip.compress(data)
print(len(data), '->', len(compressed))
# 49 -> 70 (! больше — small data overhead; gzip header + footer)
# Larger sample — compression actually helps
sample = b"hello, world! " * 1000 # 14000 bytes
compressed = gzip.compress(sample)
print(len(sample), '->', len(compressed))
# 14000 -> 117 (~120x ratio для repeating data)
# Decompress
restored = gzip.decompress(compressed)
print(restored == sample) # True
Streaming variant через gzip.GzipFile:
import gzip
import io
# Write compressed
buf = io.BytesIO()
with gzip.GzipFile(fileobj=buf, mode='wb') as gz:
gz.write(b"line 1\n")
gz.write(b"line 2\n")
compressed_bytes = buf.getvalue()
print(len(compressed_bytes)) # ~30 bytes
# Read compressed
buf.seek(0)
with gzip.GzipFile(fileobj=buf, mode='rb') as gz:
print(gz.read()) # b'line 1\nline 2\n'
Same pattern для bz2 и lzma:
import bz2; bz2.compress(...)/bz2.BZ2File(...)import lzma; lzma.compress(...)/lzma.LZMAFile(...)
API identical (compress / decompress / *File streaming class). Cite docs.python.org/3/library/gzip + bz2 + lzma.
Pitfall 33 — gzip.decompress requires bytes, не str
What goes wrong:
import gzip
gzip.decompress("compressed hex string") # TypeError
# TypeError: a bytes-like object is required, not 'str'
Why: gzip operates на binary data. Compressed output — bytes (raw byte sequence — not valid UTF-8). Calling с str confuses learners coming from text I/O context уроков 01-03.
How to avoid: always работайте с bytes literals (b"...") и io.BytesIO (НЕ io.StringIO) для compression workflows:
# Correct
import gzip
data: bytes = b"hello" # bytes literal
compressed: bytes = gzip.compress(data)
restored: bytes = gzip.decompress(compressed)
# Convert text → bytes
text: str = "hello, мир"
data: bytes = text.encode('utf-8') # encode → bytes
compressed = gzip.compress(data)
restored = gzip.decompress(compressed).decode('utf-8') # decode → str
Production rule: encoding boundary — encode str → bytes ДО compression; decode bytes → str ПОСЛЕ decompression. Compression operates only на bytes.
Cross-course → Storage Formats M09 compression internals
Storage Formats course covers compression internals в M09 (7 уроков):
- 01 — Compression internals — algorithm fundamentals (LZ77, Huffman, arithmetic coding); how gzip/zstd/snappy actually work
- 02 — Compression tuning — speed vs ratio tuning per workload; level selection rules
- 03 — Btrblocks — modern columnar compression scheme (cascading codecs)
- 04 — Fastlanes — SIMD-friendly bit-packing
- 05 — ALP — adaptive lossless precision (float compression)
- 06 — FSST — fast string substring extraction (string-aware compression)
- 07 — Future encoding — research frontier
Three-layer cross-course bridge (carrying урок 04):
- Stdlib
gzip.compress(M09 урок 05) — single-process Python. - Parquet column-chunk zstd-compression — distributed analytical workload.
- ClickHouse
compression_codec='ZSTD(3)'settings — vectorized OLAP DB.
Same algorithm family (LZ77 + entropy coding), three execution layers.
Spark shuffle compression — production tuning
Spark 03/04 — groupBy aggregations covers shuffle mechanics — intermediate data между map and reduce stages. Compression default — lz4 (fastest decompress matters here):
# Spark config (pyspark или spark-submit --conf)
spark = (
SparkSession.builder
.config('spark.io.compression.codec', 'lz4') # default
# Switch to zstd для cross-region clusters
.config('spark.io.compression.codec', 'zstd')
.config('spark.io.compression.zstd.level', '3')
.getOrCreate()
)
Production tuning rule:
- Local cluster (single rack): lz4 (CPU cheaper than network unsaturated).
- Cross-region cluster: zstd (network bytes dominate).
- Spillable shuffle (large datasets): zstd-1 (faster compress for one-shot writes).
Recipe — production gzip pipeline с error handling
End-to-end: encode str → compress → simulated transmission → decompress → decode str.
import gzip
import io
def compress_text(text: str, level: int = 6) -> bytes:
"""Compress text → gzip bytes. Default level 6 (balance)."""
return gzip.compress(text.encode('utf-8'), compresslevel=level)
def decompress_text(data: bytes) -> str:
"""Decompress gzip bytes → text. Raises ValueError on corrupt data."""
try:
return gzip.decompress(data).decode('utf-8')
except (gzip.BadGzipFile, OSError) as e:
raise ValueError(f'Invalid gzip data: {e}') from e
# Usage — simulate over-the-wire compression
text = "Hello, мир! " * 100 # 1300 bytes
compressed = compress_text(text)
print(len(text), '->', len(compressed)) # 1300 -> ~50
restored = decompress_text(compressed)
assert restored == text, 'round-trip preserved'
# Streaming variant — write compressed log
buf = io.BytesIO()
with gzip.GzipFile(fileobj=buf, mode='wb') as gz:
for line in ['log line 1\n', 'log line 2\n', 'log line 3\n']:
gz.write(line.encode('utf-8'))
print(f'Compressed log size: {len(buf.getvalue())} bytes')
Что в следующем уроке
Урок 06 — pathlib API (Path / PurePath / PurePosixPath / PureWindowsPath). Code-challenge py-m09-06-code-1 — Pattern 3 (pathlib path arithmetic via PurePosixPath). Run-on-Your-Machine callout #1 — real disk operations через Path.iterdir / Path.glob (browser MEMFS limited per Pitfall 22).
Pragmatic-DEEP принцип: не deep-dive’ем compression algorithms (LZ77, Huffman) — это Storage Formats course territory. Selection criteria + Pyodide-safe demo — sufficient для М09. Production: zstd для analytical default 2024+; lz4 для high-throughput pipelines.