Data Lakehouse Architecture

Lakehouse: лучшее от Lake + Warehouse

Lakehouse объединяет дешёвое хранилище data lake с ACID-гарантиями и performance warehouse. Ключевая технология — open table formats.

Lakehouse Architecture: Storage + Table Format + Engine

Batch (Spark)

Streaming (Flink)

SQL (Trino/Presto)

Table Format (Delta/Iceberg/Hudi)

Object Storage (S3 / GCS / ADLS)

Lakehouse архитектура в Spark

Table Formats: Delta vs Iceberg vs Hudi

Feature

Delta Lake

Apache Iceberg

Apache Hudi

Creator

Databricks

Netflix → Apache

Uber → Apache

Transaction Log

_delta_log/ (JSON)

metadata/ (Avro)

.hoodie/ (Timeline)

ACID

Optimistic concurrency

Snapshot isolation

MVCC

Time Travel

По version/timestamp

По snapshot ID

По commit timeline

Schema Evolution

Add/rename columns

Full (add/drop/rename/reorder)

Add/rename columns

Partition Evolution

Manual rewrite

Hidden partitioning (no rewrite)

Manual rewrite

Ecosystem Lock-in

Databricks-centric

Vendor-neutral

Moderate

Best For

Databricks users

Multi-engine, vendor-neutral

Upsert-heavy (CDC)

Iceberg catalog architecture — детальный разбор Delta transaction log — устройство _delta_log Hudi timeline — основа MVCC

Table Format Internals

Как работает commit

Write operation:
  1. Write new Parquet files to storage
  2. Create commit entry in transaction log
  3. Atomic commit (rename / conditional put)
  
Read operation:
  1. Read transaction log → list of valid files
  2. Apply partition pruning + file pruning
  3. Read only relevant Parquet files

Time Travel:
  SELECT * FROM orders VERSION AS OF 42
  → Read log up to version 42 → get file list at that point

Compaction

Small File Problem:
  Streaming writes 1 file/minute → 1440 files/day
  Query: scan 1440 small files = slow (file open overhead)

Compaction:
  1440 × 1 MB files → 12 × 128 MB files
  Run as background job: OPTIMIZE table

Z-Order / Hilbert Clustering:
  Co-locate related data within files
  OPTIMIZE table ZORDER BY (date, customer_id)
  → Queries filtering by date+customer scan fewer files

Schema Evolution

Iceberg schema evolution (no rewrite needed):
  ALTER TABLE orders ADD COLUMN discount DOUBLE
  ALTER TABLE orders RENAME COLUMN qty TO quantity
  ALTER TABLE orders DROP COLUMN legacy_field
  
  Old files: read without new column (NULL default)
  New files: include new column
  → Zero-downtime schema changes

Medallion Architecture в контексте Lakehouse

Bronze (Raw):
  - Ingested as-is from sources
  - Append-only, no transformations
  - Schema: source schema (может быть JSON string)
  - Retention: 90 days — 1 year

Silver (Cleaned):
  - Deduplicated, validated, typed
  - Business keys resolved
  - SCD applied to dimensions
  - Retention: 1-3 years

Gold (Aggregated):
  - Business-level aggregates
  - Optimized for BI queries
  - Pre-joined dimensions
  - Retention: indefinite

WARNING

Anti-pattern: Gold без Silver. Если агрегаты (Gold) считаются напрямую из Raw (Bronze) — нет возможности отладить data quality issues. Silver layer = point of truth для business logic.

TIP

Тренд: Iceberg становится де-факто стандартом. Snowflake, BigQuery, Databricks, AWS Athena — все поддерживают Iceberg tables. Vendor-neutral формат побеждает.

ClickHouse + Iceberg — query engine для lakehouse

Data Lakehouse Architecture

Lakehouse: лучшее от Lake + Warehouse

Table Formats: Delta vs Iceberg vs Hudi

Table Format Internals

Как работает commit

Compaction

Schema Evolution

Medallion Architecture в контексте Lakehouse

Закончили урок?