Capstone: Design Data Platform для E-commerce
Задача
Спроектировать полную data platform для e-commerce компании: от source systems до serving layer. Применить все концепции из предыдущих модулей.
Requirements
Бизнес-контекст:
- E-commerce: 1M orders/day, 10M active users
- 50 product categories, 3 warehouses (fulfillment)
- Payment processing через внешний API
Source Systems:
1. PostgreSQL (OLTP): orders, customers, products, inventory
2. Clickstream: user behavior events (page views, clicks, searches)
3. Payment API: payment status, refunds, chargebacks
Consumers:
1. BI Dashboards: daily revenue, conversion funnel, cohort analysis
2. ML Models: recommendation engine, churn prediction, fraud detection
3. Real-time: live order tracking, inventory alerts, real-time metrics
SLAs:
- BI: refresh every 1 hour, query time under 10 seconds
- ML: features updated daily, training weekly
- Real-time: end-to-end latency under 30 seconds
- Data quality: 99.5% completeness, 99.9% accuracy
Architecture Design
Layer-by-Layer Design
Ingestion Layer
1. PostgreSQL → Kafka (Debezium CDC):
- Log-based CDC: reads WAL, minimal load on OLTP
- Topic per table: cdc.public.orders, cdc.public.customers
- Schema Registry: Avro schemas, backward compatibility
- Exactly-once: Kafka transactions + LSN tracking
2. Clickstream → Kafka:
- SDK → HTTP API → Kafka producer
- Topic: clickstream.events (partitioned by user_id)
- Schema: {user_id, event_type, page_url, timestamp, metadata}
- Volume: ~100M events/day
3. Payment API → Kafka:
- Webhook receiver → Kafka producer
- Topic: payments.status
- Idempotency: dedup by payment_id
Storage Layer (Lakehouse foundation)
Iceberg + S3 + REST Catalog:
S3 → physical layer (Parquet data files + manifests)
Iceberg table format → metadata, schema evolution, time travel, ACID
REST Catalog → atomic metadata pointer, RBAC, credential vending
Iceberg БЕЗ catalog в production не работает:
- Нет atomic commit (race conditions при concurrent writers)
- Нет single source of truth о latest snapshot
- Multi-engine concurrency (Spark + Trino + Flink + DuckDB) ломается
- Нет fine-grained RBAC и credential vending для query engines
Production-ready REST catalogs (2025):
- Apache Polaris (Snowflake/Dremio, Apache incubation, vendor-neutral)
- Snowflake Open Catalog (managed Polaris, GA)
- Unity Catalog OSS (Databricks, multi-format Delta+Iceberg+Hudi+ML)
- Lakekeeper (Rust, OpenFGA-based RBAC, лёгкий footprint)
- Nessie (git-style branching по таблицам)
- AWS Glue (managed, basic, без credential vending для не-AWS engines)
Выбор для capstone: Polaris OSS (vendor-neutral, full Iceberg REST spec)
или Unity Catalog OSS (если нужен multi-format + ML assets)
Iceberg catalog metadata — фундамент
Apache Iceberg в Spark — production
Processing Layer
Bronze (Spark Streaming, micro-batch every 5 min):
Kafka → append to Iceberg tables (через REST catalog)
No transformations, preserve raw format
Retention: 90 days
Silver (Spark batch, hourly):
CDC: apply MERGE for latest state + SCD Type 2 for dimensions
Clickstream: sessionize (30-min window), enrich with user profile
Payments: join with orders, resolve final status
Data Quality checks (Great Expectations):
- orders: amount >= 0, customer_id NOT NULL
- completeness: row count ± 10% from previous day
- freshness: MAX(updated_at) within 2 hours
Gold (dbt, hourly after Silver):
Star Schema:
fact_orders: order metrics with dimension keys
dim_customer: SCD Type 2
dim_product: SCD Type 1
dim_date: static calendar
Aggregates:
daily_revenue_by_region
hourly_conversion_funnel
customer_cohort_analysis
Serving Layer
BI (Looker / Metabase):
Source: Gold tables (Iceberg → Trino/Spark SQL)
Refresh: hourly
Caching: pre-warm top-20 dashboards
ML (Feast Feature Store):
Offline: Silver → Spark batch → feature tables (daily)
Online: materialize to Redis (latest values)
Features:
customer_order_count_365d, customer_avg_order_value
product_view_to_purchase_ratio, session_duration_avg
Models: recommendations (daily retrain), fraud (real-time scoring)
Real-time (ClickHouse):
Source: Kafka → ClickHouse (Kafka engine)
Use cases: live order count, real-time revenue, inventory alerts
Materialized Views: pre-aggregate by minute/hour
Latency: p99 under 100ms
MV + AggregatingMergeTree паттерн
Kafka table engine в ClickHouse
DataFusion как embedded engine для Iceberg
Orchestration
Dagster (asset-centric):
Assets:
bronze_orders → silver_orders → gold_fact_orders → bi_dashboard
bronze_clickstream → silver_sessions → gold_conversion_funnel
Sensors:
- Kafka lag sensor: trigger processing when lag > threshold
- S3 sensor: trigger on new file arrival
Schedules:
- Bronze: every 5 minutes (micro-batch)
- Silver: hourly (after Bronze completes)
- Gold: hourly (after Silver completes)
- Features: daily (after Gold completes)
Governance и Quality
Data Catalog: OpenMetadata
- Auto-discovery: Iceberg tables, Kafka topics, dbt models
- Column-level lineage: source → bronze → silver → gold → dashboard
- PII tagging: customer email, phone → masked in Silver
Data Quality: Great Expectations + dbt tests
- Bronze: schema validation, freshness
- Silver: business rules, referential integrity
- Gold: aggregation accuracy, SLA freshness
Alerting (tiered):
P1 (PagerDuty): pipeline failure, SLA breach
P2 (Slack): quality check failure, anomaly
P3 (ticket): schema drift, cost spike
Это reference architecture. В реальном проекте: начните с MVP (1 source → Bronze → Silver → Gold → 1 dashboard), затем итерируйте. Не стройте всё сразу — complexity kills projects.
Iceberg REST catalog — production requirement, не «nice to have». Типичная ошибка SD-интервью: «Iceberg + S3» без упоминания catalog. Это equivalent «PostgreSQL без WAL» — теоретически работает, на практике падает на первом concurrent writer. Назовите конкретный catalog: Polaris (vendor-neutral), Unity Catalog OSS (multi-format + ML), Lakekeeper (Rust, lightweight), AWS Glue (managed, basic). Опишите почему: atomic commit, RBAC, credential vending, multi-engine concurrency.
Capstone checklist. Проверьте свой дизайн по всем модулям курса: pipeline patterns (M02), batch (M03), streaming (M04), warehouse (M05), lakehouse (M06), modeling (M07), CDC (M08), quality (M09), orchestration (M10), governance (M11), features (M12), real-time (M13), cost (M14).