Feature Store & ML Pipeline

Зачем Feature Store

ML pipeline без feature store: каждая команда дублирует feature engineering, features в training не совпадают с production, нет версионирования.

Без Feature Store:
  Team A: customer_ltv = SUM(orders) last 365 days
  Team B: customer_ltv = SUM(orders) last 360 days
  Production: customer_ltv = AVG(orders) last 30 days (другая логика!)
  
  → Training-serving skew → модель деградирует

С Feature Store:
  Одно определение: customer_ltv = SUM(amount) OVER last 365 days
  Offline (training): batch compute → stored in Parquet/Iceberg
  Online (serving): precomputed → Redis/DynamoDB (p99 под 10ms)
  → Consistency guaranteed

Online vs Offline Feature Store

Feature Store Architecture: Online vs Offline

Offline Store (Training)

Materialize (feast materialize)

Online Store (Serving)

Batch Features (daily Spark)

Feature Repository (Git)

Streaming Features (Flink → Online)

Training: get_historical_features()

Serving: get_online_features()

Аспект	Offline Store	Online Store
Use case	Model training	Real-time inference
Latency	Секунды-минуты	Миллисекунды (p99 под 10ms)
Storage	S3/GCS (Parquet, Iceberg)	Redis, DynamoDB, Cassandra
Volume	TB+ (полная история)	GB (latest values)
Compute	Spark batch (daily/hourly)	Pre-materialized или streaming
Access	DataFrame API	Key-value lookup

Feature computation pipeline:
  
  Batch features (daily):
    Source tables → Spark job → Offline store (Parquet)
                             → Materialize to Online store (Redis)
  
  Streaming features (real-time):
    Kafka events → Flink → Online store (Redis)
    
  Example:
    user_total_orders_365d → batch (daily Spark)
    user_orders_last_1h → streaming (Flink window)

Feast Architecture

Feast components:
  1. Feature Repository (Git):
     - Feature definitions (Python/YAML)
     - Entity definitions
     - Data source references
  
  2. Offline Store:
     - BigQuery, Snowflake, Redshift, file-based
     - Historical feature values
     - Used for: get_historical_features(entity_df, features)
  
  3. Online Store:
     - Redis, DynamoDB, SQLite
     - Latest feature values per entity
     - Used for: get_online_features(entity_rows, features)
  
  4. Materialization:
     - Batch: offline → online (feast materialize)
     - Streaming: Kafka → online (feast push)

Feature Definition

# feature_repo/features.py
from feast import Entity, Feature, FeatureView, FileSource
from feast.types import Float64, Int64

customer = Entity(name="customer", join_keys=["customer_id"])

customer_features = FeatureView(
    name="customer_features",
    entities=[customer],
    schema=[
        Feature(name="total_orders", dtype=Int64),
        Feature(name="total_revenue", dtype=Float64),
        Feature(name="avg_order_value", dtype=Float64),
        Feature(name="days_since_last_order", dtype=Int64),
    ],
    source=FileSource(path="s3://features/customer_features.parquet"),
    ttl=timedelta(days=1),  # freshness SLA
)

Training-Serving Skew

Skew types:
  1. Feature skew: training features computed differently than serving
     Fix: single feature definition (feature store)
  
  2. Data skew: training data distribution differs from production
     Fix: monitor feature distributions, retrain on fresh data
  
  3. Time-travel skew: training uses future data (data leakage)
     Fix: point-in-time joins
     
Point-in-time join:
  Training: for each training example at time T,
            use feature values AS OF time T (not latest!)
  
  Without: model "sees future" → inflated metrics → fails in prod
  With:    model sees only past data → realistic training

WARNING

Point-in-time correctness — критически важно. Без point-in-time joins модель использует “будущие” features при training → метрики завышены → модель деградирует в production. Feature store решает это: get_historical_features() автоматически делает point-in-time join.

Feature Reuse

Feature reuse across models:
  customer_features.total_revenue:
    - Churn prediction model
    - Upsell recommendation model
    - Credit scoring model
    - Customer segmentation
  
  4 models × 1 feature definition = consistency
  Without feature store: 4 different implementations

TIP

Feature Store для Data Engineer. DE отвечает за: batch/streaming pipelines для compute features, materialization offline → online, monitoring freshness и data quality features. ML Engineer отвечает за: feature selection, model training, inference pipeline.