Capstone: Design Data Platform для E-commerce

Задача

Спроектировать полную data platform для e-commerce компании: от source systems до serving layer. Применить все концепции из предыдущих модулей.

Requirements

Бизнес-контекст:
  - E-commerce: 1M orders/day, 10M active users
  - 50 product categories, 3 warehouses (fulfillment)
  - Payment processing через внешний API

Source Systems:
  1. PostgreSQL (OLTP): orders, customers, products, inventory
  2. Clickstream: user behavior events (page views, clicks, searches)
  3. Payment API: payment status, refunds, chargebacks

Consumers:
  1. BI Dashboards: daily revenue, conversion funnel, cohort analysis
  2. ML Models: recommendation engine, churn prediction, fraud detection
  3. Real-time: live order tracking, inventory alerts, real-time metrics

SLAs:
  - BI: refresh every 1 hour, query time under 10 seconds
  - ML: features updated daily, training weekly
  - Real-time: end-to-end latency under 30 seconds
  - Data quality: 99.5% completeness, 99.9% accuracy

Architecture Design

E-commerce Data Platform: End-to-End Architecture

PostgreSQL (CDC)

Clickstream API

Payment API

Apache Kafka

Lakehouse (Iceberg + S3 + REST catalog)

Real-time

Layer-by-Layer Design

Ingestion Layer

1. PostgreSQL → Kafka (Debezium CDC):
   - Log-based CDC: reads WAL, minimal load on OLTP
   - Topic per table: cdc.public.orders, cdc.public.customers
   - Schema Registry: Avro schemas, backward compatibility
   - Exactly-once: Kafka transactions + LSN tracking

2. Clickstream → Kafka:
   - SDK → HTTP API → Kafka producer
   - Topic: clickstream.events (partitioned by user_id)
   - Schema: {user_id, event_type, page_url, timestamp, metadata}
   - Volume: ~100M events/day

3. Payment API → Kafka:
   - Webhook receiver → Kafka producer
   - Topic: payments.status
   - Idempotency: dedup by payment_id

Storage Layer (Lakehouse foundation)

Iceberg + S3 + REST Catalog:
  S3                    → physical layer (Parquet data files + manifests)
  Iceberg table format  → metadata, schema evolution, time travel, ACID
  REST Catalog          → atomic metadata pointer, RBAC, credential vending

  Iceberg БЕЗ catalog в production не работает:
    - Нет atomic commit (race conditions при concurrent writers)
    - Нет single source of truth о latest snapshot
    - Multi-engine concurrency (Spark + Trino + Flink + DuckDB) ломается
    - Нет fine-grained RBAC и credential vending для query engines

  Production-ready REST catalogs (2025):
    - Apache Polaris (Snowflake/Dremio, Apache incubation, vendor-neutral)
    - Snowflake Open Catalog (managed Polaris, GA)
    - Unity Catalog OSS (Databricks, multi-format Delta+Iceberg+Hudi+ML)
    - Lakekeeper (Rust, OpenFGA-based RBAC, лёгкий footprint)
    - Nessie (git-style branching по таблицам)
    - AWS Glue (managed, basic, без credential vending для не-AWS engines)

  Выбор для capstone: Polaris OSS (vendor-neutral, full Iceberg REST spec)
                    или Unity Catalog OSS (если нужен multi-format + ML assets)

Iceberg catalog metadata — фундамент Apache Iceberg в Spark — production

Processing Layer

Bronze (Spark Streaming, micro-batch every 5 min):
  Kafka → append to Iceberg tables (через REST catalog)
  No transformations, preserve raw format
  Retention: 90 days

Silver (Spark batch, hourly):
  CDC: apply MERGE for latest state + SCD Type 2 for dimensions
  Clickstream: sessionize (30-min window), enrich with user profile
  Payments: join with orders, resolve final status
  
  Data Quality checks (Great Expectations):
    - orders: amount >= 0, customer_id NOT NULL
    - completeness: row count ± 10% from previous day
    - freshness: MAX(updated_at) within 2 hours

Gold (dbt, hourly after Silver):
  Star Schema:
    fact_orders: order metrics with dimension keys
    dim_customer: SCD Type 2
    dim_product: SCD Type 1
    dim_date: static calendar
  
  Aggregates:
    daily_revenue_by_region
    hourly_conversion_funnel
    customer_cohort_analysis

Serving Layer

BI (Looker / Metabase):
  Source: Gold tables (Iceberg → Trino/Spark SQL)
  Refresh: hourly
  Caching: pre-warm top-20 dashboards

ML (Feast Feature Store):
  Offline: Silver → Spark batch → feature tables (daily)
  Online: materialize to Redis (latest values)
  Features:
    customer_order_count_365d, customer_avg_order_value
    product_view_to_purchase_ratio, session_duration_avg
  Models: recommendations (daily retrain), fraud (real-time scoring)

Real-time (ClickHouse):
  Source: Kafka → ClickHouse (Kafka engine)
  Use cases: live order count, real-time revenue, inventory alerts
  Materialized Views: pre-aggregate by minute/hour
  Latency: p99 under 100ms

MV + AggregatingMergeTree паттерн Kafka table engine в ClickHouse DataFusion как embedded engine для Iceberg

Orchestration

Dagster (asset-centric):
  Assets:
    bronze_orders → silver_orders → gold_fact_orders → bi_dashboard
    bronze_clickstream → silver_sessions → gold_conversion_funnel
  
  Sensors:
    - Kafka lag sensor: trigger processing when lag > threshold
    - S3 sensor: trigger on new file arrival
  
  Schedules:
    - Bronze: every 5 minutes (micro-batch)
    - Silver: hourly (after Bronze completes)
    - Gold: hourly (after Silver completes)
    - Features: daily (after Gold completes)

Governance и Quality

Data Catalog: OpenMetadata
  - Auto-discovery: Iceberg tables, Kafka topics, dbt models
  - Column-level lineage: source → bronze → silver → gold → dashboard
  - PII tagging: customer email, phone → masked in Silver

Data Quality: Great Expectations + dbt tests
  - Bronze: schema validation, freshness
  - Silver: business rules, referential integrity
  - Gold: aggregation accuracy, SLA freshness

Alerting (tiered):
  P1 (PagerDuty): pipeline failure, SLA breach
  P2 (Slack): quality check failure, anomaly
  P3 (ticket): schema drift, cost spike

NOTE

Это reference architecture. В реальном проекте: начните с MVP (1 source → Bronze → Silver → Gold → 1 dashboard), затем итерируйте. Не стройте всё сразу — complexity kills projects.

WARNING

Iceberg REST catalog — production requirement, не «nice to have». Типичная ошибка SD-интервью: «Iceberg + S3» без упоминания catalog. Это equivalent «PostgreSQL без WAL» — теоретически работает, на практике падает на первом concurrent writer. Назовите конкретный catalog: Polaris (vendor-neutral), Unity Catalog OSS (multi-format + ML), Lakekeeper (Rust, lightweight), AWS Glue (managed, basic). Опишите почему: atomic commit, RBAC, credential vending, multi-engine concurrency.

TIP

Capstone checklist. Проверьте свой дизайн по всем модулям курса: pipeline patterns (M02), batch (M03), streaming (M04), warehouse (M05), lakehouse (M06), modeling (M07), CDC (M08), quality (M09), orchestration (M10), governance (M11), features (M12), real-time (M13), cost (M14).