Путь C: Large-project optimization case
Если выбрали этот путь — это наиболее production-applicable из трёх. Цель — взять synthetic large dbt project (1000-1500 моделей) и применить все техники из модуля 10 для измеримого performance improvement. Realistic timeline: 20-30 часов.
Что получите:
- Hands-on опыт со всеми performance techniques.
- Reproducible benchmark suite в portfolio.
- Skills directly applicable к real production projects.
- Document с numbers, который impressive в собеседовании.
В этом уроке мы пройдём процесс step-by-step.
timeit / time.perf_counter — microbenchmarks done right
Step 1: Generate synthetic project
We provide template generator. Если у вас нет — простой Python скрипт можно написать.
#!/usr/bin/env python3
"""generate_project.py: Создаёт synthetic dbt project с N моделями."""
import os
import random
from pathlib import Path
NUM_SOURCES = 10
NUM_STAGING = 50
NUM_INTERMEDIATE = 200
NUM_MARTS = 1240 # together = 1500 models
def generate_project(output_dir: Path):
"""Создаёт synthetic project."""
output_dir.mkdir(exist_ok=True)
(output_dir / "models").mkdir(exist_ok=True)
(output_dir / "models" / "sources").mkdir(exist_ok=True)
(output_dir / "models" / "staging").mkdir(exist_ok=True)
(output_dir / "models" / "intermediate").mkdir(exist_ok=True)
(output_dir / "models" / "marts").mkdir(exist_ok=True)
# dbt_project.yml
(output_dir / "dbt_project.yml").write_text("""
name: synthetic_perf
version: '1.0.0'
config-version: 2
profile: synthetic_perf
model-paths: ["models"]
target-path: "target"
models:
synthetic_perf:
staging:
+materialized: view
intermediate:
+materialized: ephemeral
marts:
+materialized: table
""")
# profiles.yml setup (DuckDB для simplicity)
(output_dir / "profiles.yml").write_text("""
synthetic_perf:
target: dev
outputs:
dev:
type: duckdb
path: ':memory:'
threads: 4
""")
# Sources (raw data)
sources = []
for i in range(NUM_SOURCES):
sources.append(f"raw_table_{i:03d}")
sources_yml = "version: 2\nsources:\n - name: raw\n tables:\n"
for s in sources:
sources_yml += f" - name: {s}\n"
(output_dir / "models" / "sources" / "_sources.yml").write_text(sources_yml)
# Staging (1:1 with sources... mostly)
for i in range(NUM_STAGING):
source_idx = i % NUM_SOURCES
sql = f"""SELECT * FROM {{{{ source('raw', 'raw_table_{source_idx:03d}') }}}}
WHERE id IS NOT NULL
"""
(output_dir / "models" / "staging" / f"stg_table_{i:03d}.sql").write_text(sql)
# Intermediate (each refs 1-3 staging)
for i in range(NUM_INTERMEDIATE):
num_refs = random.randint(1, 3)
refs = random.sample(range(NUM_STAGING), num_refs)
joins = " LEFT JOIN ".join([f"{{{{ ref('stg_table_{r:03d}') }}}} t{j} ON t0.id = t{j}.id"
for j, r in enumerate(refs)])
sql = f"""SELECT t0.*\nFROM {joins}\n"""
(output_dir / "models" / "intermediate" / f"int_model_{i:03d}.sql").write_text(sql)
# Marts (each refs 1-5 intermediate)
for i in range(NUM_MARTS):
num_refs = random.randint(1, 5)
refs = random.sample(range(NUM_INTERMEDIATE), num_refs)
joins = " UNION ALL\n".join([f"SELECT * FROM {{{{ ref('int_model_{r:03d}') }}}}"
for r in refs])
sql = f"""WITH all_data AS (\n{joins}\n)\nSELECT * FROM all_data WHERE 1=1\n"""
(output_dir / "models" / "marts" / f"mart_table_{i:04d}.sql").write_text(sql)
print(f"Generated {NUM_SOURCES + NUM_STAGING + NUM_INTERMEDIATE + NUM_MARTS} models в {output_dir}")
if __name__ == "__main__":
generate_project(Path("synthetic_dbt_project"))
Run:
python generate_project.py
cd synthetic_dbt_project
# Verify
dbt deps # if any
dbt parse # должен работать
ls models/marts/ | wc -l # 1240 files
You now have 1500-моделей dbt project для benchmarking.
Real data optional. Synthetic project имеет models with no actual data (CTAs over CTEs over CTEs). DuckDB :memory: создаёт empty tables. Это OK для measuring dbt engine performance, не data movement performance.
Если хотите real data — генерируйте seed CSVs:
import pandas as pd
for s in sources:
df = pd.DataFrame({"id": range(1000), "value": [random.random() для _ в range(1000)]})
df.to_csv(f"seeds/{s}.csv", index=False)Then dbt seed to load.
Step 2: Baseline measurements
# Time dbt parse
time dbt parse
# Time dbt run
time dbt run
# Get detailed timing
dbt run --profile # generates target/profile.json с timings per node
Record baseline в BENCHMARK.md:
# Benchmark — synthetic_dbt_project
## Day 0 (baseline) — 2026-05-19
**Setup:**
- 1500 models (10 sources, 50 staging, 200 intermediate, 1240 marts)
- DuckDB :memory:
- 4 threads
- MacBook Pro M2, 16GB RAM
**Measurements:**
| Phase | Time | Notes |
|-------|------|-------|
| dbt parse (cold) | 25 sec | first invocation, no partial_parse |
| dbt parse (warm) | 6 sec | partial_parse.msgpack helps |
| dbt compile | 18 sec | after parse, generate target SQL |
| dbt run (full) | 45 sec | builds all models |
| dbt run (--select stg_table_001+) | 12 sec | single staging + downstream |
**Total CI run (parse + compile + run):** ~70 sec.
This is baseline. We will optimize from here.
Step 3: Technique 1 — Partial parsing tuning
Partial parsing should speed up subsequent runs. Verify it’s working:
# First parse (cold)
rm -rf target/
time dbt parse
# Should create target/partial_parse.msgpack
# Second parse (warm)
time dbt parse
# If still slow as first — partial parsing not working.
Common partial parsing breakers:
- vars в dbt_project.yml — changing var invalidates entire parse cache.
- Macros that reference env_var(…) — environment changes invalidate.
current_timestampin node SQL — non-deterministic, can invalidate.- Source freshness in
--statemode — can break partial parse.
Optimization: minimize var changes. Keep vars stable. Use profiles.yml env_vars instead of inline vars.
# Check what's invalidating
dbt parse --vars '{"my_var": "test"}'
# vs
dbt parse --vars '{"my_var": "test2"}'
# If parse is "cold" both times, vars are invalidating cache.
Modify project to not use dynamic vars в model logic.
Record improvement:
## Optimization 1 — Partial parsing tuning
Action: removed inline vars used in models. Moved to env_var() в profiles.yml.
Before:
- dbt parse (warm): 6 sec
After:
- dbt parse (warm): 1.5 sec (4x improvement)
Why: partial_parse.msgpack now valid between runs. Hash signature stable.
Step 4: Technique 2 — Threads tuning
for t in 1 2 4 8 16; do
echo "=== Threads: $t ==="
rm -rf target/
time dbt run --threads $t
done
DuckDB is single-process — threads benefit limited. На Snowflake/BigQuery threads important. Real warehouse:
# В profiles.yml
dev:
type: snowflake
threads: 8 # try this first
Measure throughput:
Threads | Run time | Throughput (models/sec)
--------|---------|------------------------
1 | 180s | 8 models/sec
2 | 120s | 12 models/sec
4 | 75s | 20 models/sec
8 | 50s | 30 models/sec
12 | 45s | 33 models/sec
16 | 48s | 31 models/sec (saturation, contention)
Sweet spot for Snowflake: 8-12 threads. More — contention на warehouse, slower.
Document:
## Optimization 2 — Threads tuning
Tested 1-16 threads on Snowflake (real warehouse, не synthetic):
| Threads | Run time | Improvement vs threads=1 |
|---------|----------|--------------------------|
| 1 | 180s | baseline |
| 4 | 75s | 2.4x |
| 8 | 50s | 3.6x (recommended) |
| 16 | 48s | 3.75x (saturated) |
Sweet spot: 8 threads. Beyond, no improvement, possibly worse.
Action: set `threads: 8` в production profiles.yml.
Step 5: Technique 3 — State-based selection (Slim CI)
Currently CI builds entire project on every PR. Move к state:modified+:
# Generate baseline manifest
dbt build # creates target/manifest.json
# Save it (in real CI — to S3 / GCS)
cp target/manifest.json baseline_manifest.json
# Modify a single model
echo "" >> models/marts/mart_table_0001.sql
# Run with state-based selection
time dbt build --select state:modified+ --state .
# Should be MUCH faster — only modified + downstream
Measure:
## Optimization 3 — State-based selection
CI scenario: 1 model modified в PR.
Before (full build):
- dbt build: 45 sec
After (slim CI):
- dbt build --select state:modified+ --state baseline_manifest_dir/: 4 sec
- (modified model + ~5 downstream)
Improvement: 11x faster CI.
Action: configure CI to:
1. Download production manifest.json before run.
2. Use `--select state:modified+ --state .`.
3. Upload new manifest.json after successful build.
This single technique often gives biggest CI speedup.
Step 6: Technique 4 — Defer для CI
When CI runs dbt build, it needs upstream sources/models. If they don’t exist в CI environment — build fails. Defer = read upstream from production manifest.
# CI scenario: CI environment не имеет marts built
# Modified model: mart_table_0001 (which refs int_model_005)
# Without defer — CI errors: int_model_005 not found
dbt run --select state:modified+ --state baseline_manifest_dir/
# With defer — CI reads int_model_005 from production
dbt run --select state:modified+ --state baseline_manifest_dir/ --defer
# Compiled SQL refers к production schema for non-modified upstream
# Modified model + downstream — build в CI schema
Document:
## Optimization 4 — Defer для CI
Without defer:
- CI must build entire DAG upstream from modified model.
- mart_table_0001 modified -> must rebuild int_model_005 -> its sources -> staging -> sources.
- Total: 20+ models even for single modification.
- Time: 8 sec.
With defer:
- CI builds only modified + downstream.
- mart_table_0001 refs int_model_005 from production manifest.
- Total: ~3 models built.
- Time: 2 sec.
Improvement: 4x faster.
Action: CI uses --defer + state-based selection together. Standard slim CI pattern.
Step 7: Technique 5 — Microbatch для time-series
Если есть time-series models — try microbatch incremental:
-- models/marts/mart_table_0050.sql
{{
config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='created_at',
batch_size='day',
lookback=2,
begin='2025-01-01'
)
}}
SELECT
id,
created_at,
value
FROM {{ ref('int_model_050') }}
# First run — initial batches
time dbt run --select mart_table_0050
# Subsequent runs — only new batches
time dbt run --select mart_table_0050
Document:
## Optimization 5 — Microbatch incremental
For time-series models в project:
Before (full incremental):
- Daily run: 60 sec (rebuilds incrementally, but full predicate scan)
After (microbatch):
- Daily run: 5 sec (only yesterday's batch, parallel-eligible)
Improvement: 12x faster for time-series models.
Action: convert time-series fact tables к microbatch where event_time available.
Step 8: Technique 6 — Materialization review
Review каждый model — is materialization optimal?
# List all models с materialization
dbt ls --select "config.materialized:table" --output json > tables.json
dbt ls --select "config.materialized:view" --output json > views.json
dbt ls --select "config.materialized:incremental" --output json > incrementals.json
Common improvements:
- Staging models: should be view (cheap, easy iteration).
- Intermediate: ephemeral в dev (CTE inlining), view в prod.
- Marts (small): table в prod, view в dev.
- Facts (large): incremental в prod, table в dev.
## Optimization 6 — Materialization tuning
Current materialization mix:
- All marts: table в prod, table в dev (slow dev iteration)
Recommendations applied:
- Staging: view (no change).
- Intermediate: ephemeral dev / view prod (was view both).
- Marts: table prod / view dev (was table both).
- Fact tables (5 large): incremental prod / table dev (was table both).
Before:
- dbt run --target dev (modified mart): 30 sec
After:
- dbt run --target dev (modified mart): 3 sec (view)
Improvement: 10x для dev iteration.
Step 9: Combined effects
Combine all optimizations:
## Final benchmark — combined optimizations
| Scenario | Baseline | Optimized | Improvement |
|----------|----------|-----------|-------------|
| dbt parse (warm) | 6s | 1.5s | 4x |
| dbt run (production, 8 threads) | 180s | 50s | 3.6x |
| dbt run (dev iteration, single mart) | 30s | 3s | 10x |
| dbt build (slim CI, 1 model modified) | 45s | 4s | 11x |
| Daily prod run (microbatch) | 60s | 5s | 12x |
**Average improvement: ~8x.**
Estimated annual savings (assuming team of 5 engineers, 50 dbt runs/day each):
- Time saved per run: ~30 sec average.
- Total time saved: 30 sec × 50 runs × 5 engineers × 250 work days = 187 hours/year.
- At $100/hr engineering = $18,700/year.
Plus warehouse cost savings, faster CI feedback loops, etc.
Step 10: Document и publish
BENCHMARK.md — finalized:
# Synthetic dbt Project — Performance Optimization Case Study
## Summary
Optimized 1500-моделей synthetic dbt project. Reduced parse time 4x, prod run 3.6x, dev iteration 10x, slim CI 11x.
## Methodology
Applied 6 techniques from dbt III course:
1. Partial parsing tuning (4x)
2. Threads tuning (3.6x on Snowflake)
3. State-based selection / Slim CI (11x)
4. Defer для CI (4x)
5. Microbatch incremental (12x для time-series)
6. Materialization review (10x для dev)
## Reproducibility
```bash
# Setup
git clone https://github.com/yourname/synthetic-dbt-perf
cd synthetic-dbt-perf
python generate_project.py # generates project
pip install -e .
# Baseline
./scripts/baseline.sh > baseline_results.txt
# Apply optimization N
./scripts/apply_optimization_1.sh
./scripts/measure.sh > optimization_1_results.txt
# Repeat для 2-6
Detailed numbers
[Tables, charts, methodology details …]
Push к GitHub:
```bash
git init
git add .
git commit -m "Initial: synthetic dbt project with optimization benchmarks"
git remote add origin https://github.com/yourname/synthetic-dbt-perf.git
git push -u origin main
Optional: Charts and visualizations
# scripts/plot.py
import matplotlib.pyplot as plt
phases = ['Parse', 'Compile', 'Run']
baseline = [25, 18, 45]
optimized = [1.5, 6, 12]
x = range(len(phases))
plt.figure(figsize=(10, 6))
plt.bar([i - 0.2 для i в x], baseline, 0.4, label='Baseline', color='#ff6b6b')
plt.bar([i + 0.2 для i в x], optimized, 0.4, label='Optimized', color='#4ecdc4')
plt.ylabel('Time (seconds)')
plt.title('dbt Performance Optimization Results')
plt.xticks(x, phases)
plt.legend()
plt.savefig('benchmark.png', dpi=150)
Include в README as image. Visual impact >> text-only.
Итого
- Path C capstone = synthetic large project + documented optimization improvements.
- Generator для 1500-моделей project предоставлен (или write own).
- Baseline measurements — first thing. Document numbers.
- 6 techniques to apply: partial parsing tuning, threads tuning, state-based selection, defer, microbatch, materialization review.
- Measure after each — incremental documentation.
- Combined results — usually 5-10x average improvement.
- Deliverable: GitHub repo с project + BENCHMARK.md + reproducibility scripts.
- Optional: charts/visualizations через matplotlib.
- Synthetic ≠ real production — translate carefully для production claims.
- Scope flexible: full (20-30 hr, 6 techniques) -> micro (10-15 hr, 3 techniques) -> mini (5-7 hr, 1 technique).
- Quality > breadth — better single technique deep than 6 shallow.
Next: урок 05 — finalize и portfolio packaging.