Learning Platform
Глоссарий Troubleshooting
Урок 07.01 · 25 мин
Средний
LakehouseDelta LakeIcebergHudiTable FormatsMedallionTime Travel

Data Lakehouse Architecture

Lakehouse: лучшее от Lake + Warehouse

Lakehouse объединяет дешёвое хранилище data lake с ACID-гарантиями и performance warehouse. Ключевая технология — open table formats.

Lakehouse Architecture: Storage + Table Format + Engine
Batch (Spark)
Streaming (Flink)
SQL (Trino/Presto)
Table Format (Delta/Iceberg/Hudi)
Object Storage (S3 / GCS / ADLS)
Lakehouse архитектура в Spark

Table Formats: Delta vs Iceberg vs Hudi

Feature
Delta Lake
Apache Iceberg
Apache Hudi
Creator
Databricks
Netflix → Apache
Uber → Apache
Transaction Log
_delta_log/ (JSON)
metadata/ (Avro)
.hoodie/ (Timeline)
ACID
Optimistic concurrency
Snapshot isolation
MVCC
Time Travel
По version/timestamp
По snapshot ID
По commit timeline
Schema Evolution
Add/rename columns
Full (add/drop/rename/reorder)
Add/rename columns
Partition Evolution
Manual rewrite
Hidden partitioning (no rewrite)
Manual rewrite
Ecosystem Lock-in
Databricks-centric
Vendor-neutral
Moderate
Best For
Databricks users
Multi-engine, vendor-neutral
Upsert-heavy (CDC)
Iceberg catalog architecture — детальный разбор Delta transaction log — устройство _delta_log Hudi timeline — основа MVCC

Table Format Internals

Как работает commit

Write operation:
  1. Write new Parquet files to storage
  2. Create commit entry in transaction log
  3. Atomic commit (rename / conditional put)
  
Read operation:
  1. Read transaction log → list of valid files
  2. Apply partition pruning + file pruning
  3. Read only relevant Parquet files

Time Travel:
  SELECT * FROM orders VERSION AS OF 42
  → Read log up to version 42 → get file list at that point

Compaction

Small File Problem:
  Streaming writes 1 file/minute → 1440 files/day
  Query: scan 1440 small files = slow (file open overhead)

Compaction:
  1440 × 1 MB files → 12 × 128 MB files
  Run as background job: OPTIMIZE table

Z-Order / Hilbert Clustering:
  Co-locate related data within files
  OPTIMIZE table ZORDER BY (date, customer_id)
  → Queries filtering by date+customer scan fewer files

Schema Evolution

Iceberg schema evolution (no rewrite needed):
  ALTER TABLE orders ADD COLUMN discount DOUBLE
  ALTER TABLE orders RENAME COLUMN qty TO quantity
  ALTER TABLE orders DROP COLUMN legacy_field
  
  Old files: read without new column (NULL default)
  New files: include new column
  → Zero-downtime schema changes

Medallion Architecture в контексте Lakehouse

Bronze (Raw):
  - Ingested as-is from sources
  - Append-only, no transformations
  - Schema: source schema (может быть JSON string)
  - Retention: 90 days — 1 year

Silver (Cleaned):
  - Deduplicated, validated, typed
  - Business keys resolved
  - SCD applied to dimensions
  - Retention: 1-3 years

Gold (Aggregated):
  - Business-level aggregates
  - Optimized for BI queries
  - Pre-joined dimensions
  - Retention: indefinite
WARNING

Anti-pattern: Gold без Silver. Если агрегаты (Gold) считаются напрямую из Raw (Bronze) — нет возможности отладить data quality issues. Silver layer = point of truth для business logic.

TIP

Тренд: Iceberg становится де-факто стандартом. Snowflake, BigQuery, Databricks, AWS Athena — все поддерживают Iceberg tables. Vendor-neutral формат побеждает.

ClickHouse + Iceberg — query engine для lakehouse
Проверка знанийKnowledge check
ОтветAnswer

Проверьте понимание

Результат: 0 из 0
Прикладной
Вопрос 1 из 2. Streaming pipeline пишет в Iceberg table 1 файл/минуту. Через неделю query performance деградировал. Причина и решение?

Закончили урок?

Отметьте его как пройденный, чтобы отслеживать свой прогресс

Войдите чтобы оценить урок

Прогресс модуля
0 из 2