Data Lakehouse Architecture
Lakehouse: лучшее от Lake + Warehouse
Lakehouse объединяет дешёвое хранилище data lake с ACID-гарантиями и performance warehouse. Ключевая технология — open table formats.
Batch (Spark)
Streaming (Flink)
SQL (Trino/Presto)
Table Format (Delta/Iceberg/Hudi)
Object Storage (S3 / GCS / ADLS)
Table Formats: Delta vs Iceberg vs Hudi
Feature
Delta Lake
Apache Iceberg
Apache Hudi
Creator
Databricks
Netflix → Apache
Uber → Apache
Transaction Log
_delta_log/ (JSON)
metadata/ (Avro)
.hoodie/ (Timeline)
ACID
Optimistic concurrency
Snapshot isolation
MVCC
Time Travel
По version/timestamp
По snapshot ID
По commit timeline
Schema Evolution
Add/rename columns
Full (add/drop/rename/reorder)
Add/rename columns
Partition Evolution
Manual rewrite
Hidden partitioning (no rewrite)
Manual rewrite
Ecosystem Lock-in
Databricks-centric
Vendor-neutral
Moderate
Best For
Databricks users
Multi-engine, vendor-neutral
Upsert-heavy (CDC)
Table Format Internals
Как работает commit
Write operation:
1. Write new Parquet files to storage
2. Create commit entry in transaction log
3. Atomic commit (rename / conditional put)
Read operation:
1. Read transaction log → list of valid files
2. Apply partition pruning + file pruning
3. Read only relevant Parquet files
Time Travel:
SELECT * FROM orders VERSION AS OF 42
→ Read log up to version 42 → get file list at that point
Compaction
Small File Problem:
Streaming writes 1 file/minute → 1440 files/day
Query: scan 1440 small files = slow (file open overhead)
Compaction:
1440 × 1 MB files → 12 × 128 MB files
Run as background job: OPTIMIZE table
Z-Order / Hilbert Clustering:
Co-locate related data within files
OPTIMIZE table ZORDER BY (date, customer_id)
→ Queries filtering by date+customer scan fewer files
Schema Evolution
Iceberg schema evolution (no rewrite needed):
ALTER TABLE orders ADD COLUMN discount DOUBLE
ALTER TABLE orders RENAME COLUMN qty TO quantity
ALTER TABLE orders DROP COLUMN legacy_field
Old files: read without new column (NULL default)
New files: include new column
→ Zero-downtime schema changes
Medallion Architecture в контексте Lakehouse
Bronze (Raw):
- Ingested as-is from sources
- Append-only, no transformations
- Schema: source schema (может быть JSON string)
- Retention: 90 days — 1 year
Silver (Cleaned):
- Deduplicated, validated, typed
- Business keys resolved
- SCD applied to dimensions
- Retention: 1-3 years
Gold (Aggregated):
- Business-level aggregates
- Optimized for BI queries
- Pre-joined dimensions
- Retention: indefinite
WARNING
Anti-pattern: Gold без Silver. Если агрегаты (Gold) считаются напрямую из Raw (Bronze) — нет возможности отладить data quality issues. Silver layer = point of truth для business logic.
TIP
Тренд: Iceberg становится де-факто стандартом. Snowflake, BigQuery, Databricks, AWS Athena — все поддерживают Iceberg tables. Vendor-neutral формат побеждает.