Feature Store & ML Pipeline
Зачем Feature Store
ML pipeline без feature store: каждая команда дублирует feature engineering, features в training не совпадают с production, нет версионирования.
Без Feature Store:
Team A: customer_ltv = SUM(orders) last 365 days
Team B: customer_ltv = SUM(orders) last 360 days
Production: customer_ltv = AVG(orders) last 30 days (другая логика!)
→ Training-serving skew → модель деградирует
С Feature Store:
Одно определение: customer_ltv = SUM(amount) OVER last 365 days
Offline (training): batch compute → stored in Parquet/Iceberg
Online (serving): precomputed → Redis/DynamoDB (p99 под 10ms)
→ Consistency guaranteed
Online vs Offline Feature Store
| Аспект | Offline Store | Online Store |
|---|---|---|
| Use case | Model training | Real-time inference |
| Latency | Секунды-минуты | Миллисекунды (p99 под 10ms) |
| Storage | S3/GCS (Parquet, Iceberg) | Redis, DynamoDB, Cassandra |
| Volume | TB+ (полная история) | GB (latest values) |
| Compute | Spark batch (daily/hourly) | Pre-materialized или streaming |
| Access | DataFrame API | Key-value lookup |
Feature computation pipeline:
Batch features (daily):
Source tables → Spark job → Offline store (Parquet)
→ Materialize to Online store (Redis)
Streaming features (real-time):
Kafka events → Flink → Online store (Redis)
Example:
user_total_orders_365d → batch (daily Spark)
user_orders_last_1h → streaming (Flink window)
Feast Architecture
Feast components:
1. Feature Repository (Git):
- Feature definitions (Python/YAML)
- Entity definitions
- Data source references
2. Offline Store:
- BigQuery, Snowflake, Redshift, file-based
- Historical feature values
- Used for: get_historical_features(entity_df, features)
3. Online Store:
- Redis, DynamoDB, SQLite
- Latest feature values per entity
- Used for: get_online_features(entity_rows, features)
4. Materialization:
- Batch: offline → online (feast materialize)
- Streaming: Kafka → online (feast push)
Feature Definition
# feature_repo/features.py
from feast import Entity, Feature, FeatureView, FileSource
from feast.types import Float64, Int64
customer = Entity(name="customer", join_keys=["customer_id"])
customer_features = FeatureView(
name="customer_features",
entities=[customer],
schema=[
Feature(name="total_orders", dtype=Int64),
Feature(name="total_revenue", dtype=Float64),
Feature(name="avg_order_value", dtype=Float64),
Feature(name="days_since_last_order", dtype=Int64),
],
source=FileSource(path="s3://features/customer_features.parquet"),
ttl=timedelta(days=1), # freshness SLA
)
Training-Serving Skew
Skew types:
1. Feature skew: training features computed differently than serving
Fix: single feature definition (feature store)
2. Data skew: training data distribution differs from production
Fix: monitor feature distributions, retrain on fresh data
3. Time-travel skew: training uses future data (data leakage)
Fix: point-in-time joins
Point-in-time join:
Training: for each training example at time T,
use feature values AS OF time T (not latest!)
Without: model "sees future" → inflated metrics → fails in prod
With: model sees only past data → realistic training
Point-in-time correctness — критически важно. Без point-in-time joins модель использует “будущие” features при training → метрики завышены → модель деградирует в production. Feature store решает это: get_historical_features() автоматически делает point-in-time join.
Feature Reuse
Feature reuse across models:
customer_features.total_revenue:
- Churn prediction model
- Upsell recommendation model
- Credit scoring model
- Customer segmentation
4 models × 1 feature definition = consistency
Without feature store: 4 different implementations
Feature Store для Data Engineer. DE отвечает за: batch/streaming pipelines для compute features, materialization offline → online, monitoring freshness и data quality features. ML Engineer отвечает за: feature selection, model training, inference pipeline.