Learning Platform
Глоссарий Troubleshooting
Урок 15.04 · 30 мин
Продвинутый
performanceoptimizationbenchmarkingSlim CI

Путь C: Large-project optimization case

Если выбрали этот путь — это наиболее production-applicable из трёх. Цель — взять synthetic large dbt project (1000-1500 моделей) и применить все техники из модуля 10 для измеримого performance improvement. Realistic timeline: 20-30 часов.

Что получите:

  • Hands-on опыт со всеми performance techniques.
  • Reproducible benchmark suite в portfolio.
  • Skills directly applicable к real production projects.
  • Document с numbers, который impressive в собеседовании.

В этом уроке мы пройдём процесс step-by-step.


timeit / time.perf_counter — microbenchmarks done right

Step 1: Generate synthetic project

We provide template generator. Если у вас нет — простой Python скрипт можно написать.

#!/usr/bin/env python3
"""generate_project.py: Создаёт synthetic dbt project с N моделями."""

import os
import random
from pathlib import Path

NUM_SOURCES = 10
NUM_STAGING = 50
NUM_INTERMEDIATE = 200
NUM_MARTS = 1240   # together = 1500 models

def generate_project(output_dir: Path):
    """Создаёт synthetic project."""
    output_dir.mkdir(exist_ok=True)
    (output_dir / "models").mkdir(exist_ok=True)
    (output_dir / "models" / "sources").mkdir(exist_ok=True)
    (output_dir / "models" / "staging").mkdir(exist_ok=True)
    (output_dir / "models" / "intermediate").mkdir(exist_ok=True)
    (output_dir / "models" / "marts").mkdir(exist_ok=True)
    
    # dbt_project.yml
    (output_dir / "dbt_project.yml").write_text("""
name: synthetic_perf
version: '1.0.0'
config-version: 2
profile: synthetic_perf

model-paths: ["models"]
target-path: "target"

models:
  synthetic_perf:
    staging:
      +materialized: view
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table
""")
    
    # profiles.yml setup (DuckDB для simplicity)
    (output_dir / "profiles.yml").write_text("""
synthetic_perf:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: ':memory:'
      threads: 4
""")
    
    # Sources (raw data)
    sources = []
    for i in range(NUM_SOURCES):
        sources.append(f"raw_table_{i:03d}")
    
    sources_yml = "version: 2\nsources:\n  - name: raw\n    tables:\n"
    for s in sources:
        sources_yml += f"      - name: {s}\n"
    (output_dir / "models" / "sources" / "_sources.yml").write_text(sources_yml)
    
    # Staging (1:1 with sources... mostly)
    for i in range(NUM_STAGING):
        source_idx = i % NUM_SOURCES
        sql = f"""SELECT * FROM {{{{ source('raw', 'raw_table_{source_idx:03d}') }}}}
WHERE id IS NOT NULL
"""
        (output_dir / "models" / "staging" / f"stg_table_{i:03d}.sql").write_text(sql)
    
    # Intermediate (each refs 1-3 staging)
    for i in range(NUM_INTERMEDIATE):
        num_refs = random.randint(1, 3)
        refs = random.sample(range(NUM_STAGING), num_refs)
        joins = " LEFT JOIN ".join([f"{{{{ ref('stg_table_{r:03d}') }}}} t{j} ON t0.id = t{j}.id" 
                                      for j, r in enumerate(refs)])
        sql = f"""SELECT t0.*\nFROM {joins}\n"""
        (output_dir / "models" / "intermediate" / f"int_model_{i:03d}.sql").write_text(sql)
    
    # Marts (each refs 1-5 intermediate)
    for i in range(NUM_MARTS):
        num_refs = random.randint(1, 5)
        refs = random.sample(range(NUM_INTERMEDIATE), num_refs)
        joins = " UNION ALL\n".join([f"SELECT * FROM {{{{ ref('int_model_{r:03d}') }}}}" 
                                       for r in refs])
        sql = f"""WITH all_data AS (\n{joins}\n)\nSELECT * FROM all_data WHERE 1=1\n"""
        (output_dir / "models" / "marts" / f"mart_table_{i:04d}.sql").write_text(sql)
    
    print(f"Generated {NUM_SOURCES + NUM_STAGING + NUM_INTERMEDIATE + NUM_MARTS} models в {output_dir}")


if __name__ == "__main__":
    generate_project(Path("synthetic_dbt_project"))

Run:

python generate_project.py
cd synthetic_dbt_project

# Verify
dbt deps   # if any
dbt parse  # должен работать
ls models/marts/ | wc -l  # 1240 files

You now have 1500-моделей dbt project для benchmarking.

NOTE

Real data optional. Synthetic project имеет models with no actual data (CTAs over CTEs over CTEs). DuckDB :memory: создаёт empty tables. Это OK для measuring dbt engine performance, не data movement performance.

Если хотите real data — генерируйте seed CSVs:

import pandas as pd
for s in sources:
    df = pd.DataFrame({"id": range(1000), "value": [random.random() для _ в range(1000)]})
    df.to_csv(f"seeds/{s}.csv", index=False)

Then dbt seed to load.


Step 2: Baseline measurements

# Time dbt parse
time dbt parse

# Time dbt run
time dbt run

# Get detailed timing
dbt run --profile  # generates target/profile.json с timings per node

Record baseline в BENCHMARK.md:

# Benchmark — synthetic_dbt_project

## Day 0 (baseline) — 2026-05-19

**Setup:**
- 1500 models (10 sources, 50 staging, 200 intermediate, 1240 marts)
- DuckDB :memory:
- 4 threads
- MacBook Pro M2, 16GB RAM

**Measurements:**

| Phase | Time | Notes |
|-------|------|-------|
| dbt parse (cold) | 25 sec | first invocation, no partial_parse |
| dbt parse (warm) | 6 sec | partial_parse.msgpack helps |
| dbt compile | 18 sec | after parse, generate target SQL |
| dbt run (full) | 45 sec | builds all models |
| dbt run (--select stg_table_001+) | 12 sec | single staging + downstream |

**Total CI run (parse + compile + run):** ~70 sec.

This is baseline. We will optimize from here.

Step 3: Technique 1 — Partial parsing tuning

Partial parsing should speed up subsequent runs. Verify it’s working:

# First parse (cold)
rm -rf target/
time dbt parse

# Should create target/partial_parse.msgpack

# Second parse (warm)
time dbt parse

# If still slow as first — partial parsing not working.

Common partial parsing breakers:

  1. vars в dbt_project.yml — changing var invalidates entire parse cache.
  2. Macros that reference env_var(…) — environment changes invalidate.
  3. current_timestamp in node SQL — non-deterministic, can invalidate.
  4. Source freshness in --state mode — can break partial parse.

Optimization: minimize var changes. Keep vars stable. Use profiles.yml env_vars instead of inline vars.

# Check what's invalidating
dbt parse --vars '{"my_var": "test"}'
# vs
dbt parse --vars '{"my_var": "test2"}'

# If parse is "cold" both times, vars are invalidating cache.

Modify project to not use dynamic vars в model logic.

Record improvement:

## Optimization 1 — Partial parsing tuning

Action: removed inline vars used in models. Moved to env_var() в profiles.yml.

Before:
- dbt parse (warm): 6 sec

After:
- dbt parse (warm): 1.5 sec (4x improvement)

Why: partial_parse.msgpack now valid between runs. Hash signature stable.

Step 4: Technique 2 — Threads tuning

for t in 1 2 4 8 16; do
    echo "=== Threads: $t ==="
    rm -rf target/
    time dbt run --threads $t
done

DuckDB is single-process — threads benefit limited. На Snowflake/BigQuery threads important. Real warehouse:

# В profiles.yml
dev:
  type: snowflake
  threads: 8   # try this first

Measure throughput:

Threads | Run time | Throughput (models/sec)
--------|---------|------------------------
1       | 180s    | 8 models/sec
2       | 120s    | 12 models/sec
4       | 75s     | 20 models/sec
8       | 50s     | 30 models/sec
12      | 45s     | 33 models/sec
16      | 48s     | 31 models/sec  (saturation, contention)

Sweet spot for Snowflake: 8-12 threads. More — contention на warehouse, slower.

Document:

## Optimization 2 — Threads tuning

Tested 1-16 threads on Snowflake (real warehouse, не synthetic):

| Threads | Run time | Improvement vs threads=1 |
|---------|----------|--------------------------|
| 1       | 180s     | baseline                 |
| 4       | 75s      | 2.4x                     |
| 8       | 50s      | 3.6x (recommended)       |
| 16      | 48s      | 3.75x (saturated)        |

Sweet spot: 8 threads. Beyond, no improvement, possibly worse.

Action: set `threads: 8` в production profiles.yml.

Step 5: Technique 3 — State-based selection (Slim CI)

Currently CI builds entire project on every PR. Move к state:modified+:

# Generate baseline manifest
dbt build  # creates target/manifest.json

# Save it (in real CI — to S3 / GCS)
cp target/manifest.json baseline_manifest.json

# Modify a single model
echo "" >> models/marts/mart_table_0001.sql

# Run with state-based selection
time dbt build --select state:modified+ --state .

# Should be MUCH faster — only modified + downstream

Measure:

## Optimization 3 — State-based selection

CI scenario: 1 model modified в PR.

Before (full build):
- dbt build: 45 sec

After (slim CI):
- dbt build --select state:modified+ --state baseline_manifest_dir/: 4 sec
- (modified model + ~5 downstream)

Improvement: 11x faster CI.

Action: configure CI to:
1. Download production manifest.json before run.
2. Use `--select state:modified+ --state .`.
3. Upload new manifest.json after successful build.

This single technique often gives biggest CI speedup.


Step 6: Technique 4 — Defer для CI

When CI runs dbt build, it needs upstream sources/models. If they don’t exist в CI environment — build fails. Defer = read upstream from production manifest.

# CI scenario: CI environment не имеет marts built
# Modified model: mart_table_0001 (which refs int_model_005)

# Without defer — CI errors: int_model_005 not found
dbt run --select state:modified+ --state baseline_manifest_dir/

# With defer — CI reads int_model_005 from production
dbt run --select state:modified+ --state baseline_manifest_dir/ --defer

# Compiled SQL refers к production schema for non-modified upstream
# Modified model + downstream — build в CI schema

Document:

## Optimization 4 — Defer для CI

Without defer:
- CI must build entire DAG upstream from modified model.
- mart_table_0001 modified -> must rebuild int_model_005 -> its sources -> staging -> sources.
- Total: 20+ models even for single modification.
- Time: 8 sec.

With defer:
- CI builds only modified + downstream.
- mart_table_0001 refs int_model_005 from production manifest.
- Total: ~3 models built.
- Time: 2 sec.

Improvement: 4x faster.

Action: CI uses --defer + state-based selection together. Standard slim CI pattern.

Step 7: Technique 5 — Microbatch для time-series

Если есть time-series models — try microbatch incremental:

-- models/marts/mart_table_0050.sql
{{
    config(
        materialized='incremental',
        incremental_strategy='microbatch',
        event_time='created_at',
        batch_size='day',
        lookback=2,
        begin='2025-01-01'
    )
}}

SELECT 
    id,
    created_at,
    value
FROM {{ ref('int_model_050') }}
# First run — initial batches
time dbt run --select mart_table_0050

# Subsequent runs — only new batches
time dbt run --select mart_table_0050

Document:

## Optimization 5 — Microbatch incremental

For time-series models в project:

Before (full incremental):
- Daily run: 60 sec (rebuilds incrementally, but full predicate scan)

After (microbatch):
- Daily run: 5 sec (only yesterday's batch, parallel-eligible)

Improvement: 12x faster for time-series models.

Action: convert time-series fact tables к microbatch where event_time available.

Step 8: Technique 6 — Materialization review

Review каждый model — is materialization optimal?

# List all models с materialization
dbt ls --select "config.materialized:table" --output json > tables.json
dbt ls --select "config.materialized:view" --output json > views.json
dbt ls --select "config.materialized:incremental" --output json > incrementals.json

Common improvements:

  • Staging models: should be view (cheap, easy iteration).
  • Intermediate: ephemeral в dev (CTE inlining), view в prod.
  • Marts (small): table в prod, view в dev.
  • Facts (large): incremental в prod, table в dev.
## Optimization 6 — Materialization tuning

Current materialization mix:
- All marts: table в prod, table в dev (slow dev iteration)

Recommendations applied:
- Staging: view (no change).
- Intermediate: ephemeral dev / view prod (was view both).
- Marts: table prod / view dev (was table both).
- Fact tables (5 large): incremental prod / table dev (was table both).

Before:
- dbt run --target dev (modified mart): 30 sec

After:
- dbt run --target dev (modified mart): 3 sec (view)

Improvement: 10x для dev iteration.

Step 9: Combined effects

Combine all optimizations:

## Final benchmark — combined optimizations

| Scenario | Baseline | Optimized | Improvement |
|----------|----------|-----------|-------------|
| dbt parse (warm) | 6s | 1.5s | 4x |
| dbt run (production, 8 threads) | 180s | 50s | 3.6x |
| dbt run (dev iteration, single mart) | 30s | 3s | 10x |
| dbt build (slim CI, 1 model modified) | 45s | 4s | 11x |
| Daily prod run (microbatch) | 60s | 5s | 12x |

**Average improvement: ~8x.**

Estimated annual savings (assuming team of 5 engineers, 50 dbt runs/day each):
- Time saved per run: ~30 sec average.
- Total time saved: 30 sec × 50 runs × 5 engineers × 250 work days = 187 hours/year.
- At $100/hr engineering = $18,700/year.

Plus warehouse cost savings, faster CI feedback loops, etc.

Step 10: Document и publish

BENCHMARK.md — finalized:

# Synthetic dbt Project — Performance Optimization Case Study

## Summary

Optimized 1500-моделей synthetic dbt project. Reduced parse time 4x, prod run 3.6x, dev iteration 10x, slim CI 11x.

## Methodology

Applied 6 techniques from dbt III course:

1. Partial parsing tuning (4x)
2. Threads tuning (3.6x on Snowflake)
3. State-based selection / Slim CI (11x)
4. Defer для CI (4x)
5. Microbatch incremental (12x для time-series)
6. Materialization review (10x для dev)

## Reproducibility

```bash
# Setup
git clone https://github.com/yourname/synthetic-dbt-perf
cd synthetic-dbt-perf
python generate_project.py  # generates project
pip install -e .

# Baseline
./scripts/baseline.sh > baseline_results.txt

# Apply optimization N
./scripts/apply_optimization_1.sh
./scripts/measure.sh > optimization_1_results.txt

# Repeat для 2-6

Detailed numbers

[Tables, charts, methodology details …]


Push к GitHub:

```bash
git init
git add .
git commit -m "Initial: synthetic dbt project with optimization benchmarks"
git remote add origin https://github.com/yourname/synthetic-dbt-perf.git
git push -u origin main

Optional: Charts and visualizations

# scripts/plot.py
import matplotlib.pyplot as plt

phases = ['Parse', 'Compile', 'Run']
baseline = [25, 18, 45]
optimized = [1.5, 6, 12]

x = range(len(phases))
plt.figure(figsize=(10, 6))
plt.bar([i - 0.2 для i в x], baseline, 0.4, label='Baseline', color='#ff6b6b')
plt.bar([i + 0.2 для i в x], optimized, 0.4, label='Optimized', color='#4ecdc4')
plt.ylabel('Time (seconds)')
plt.title('dbt Performance Optimization Results')
plt.xticks(x, phases)
plt.legend()
plt.savefig('benchmark.png', dpi=150)

Include в README as image. Visual impact >> text-only.


Проверка знанийKnowledge check
После optimization 1500-моделей project parse сократился с 25s к 1.5s. Manager спрашивает: 'это transferable к нашему real 800-models proj? Какие numbers ожидаем?'
ОтветAnswer
Хороший strategic вопрос. Synthetic benchmarks must be carefully translated к real-world expectations. **Why synthetic ≠ real production:** **Synthetic project**: - Generated models, simple SQL (mostly SELECT * + joins). - No real data (DuckDB :memory: empty tables). - Predictable structure (sources -> staging -> intermediate -> marts). - No custom macros, complex Jinja. - No team workflow factors (multiple developers, branches, etc.). **Real production**: - Custom macros (incremental strategies, audit logging, etc.). - Heavy Jinja meta-programming. - Real data (some queries process TB). - Diverse SQL complexity (CTEs, window functions, recursive). - Multiple developers — concurrent changes. - Many sources, custom configurations. **What translates 1:1:** 1. **Threads tuning** — Snowflake/BigQuery contention model same. Real benefit: similar 3-4x improvement on real warehouse. 2. **State-based selection** — manifest comparison logic universal. Real benefit: similar 5-10x CI speedup. 3. **Defer** — production manifest reference same mechanism. Real benefit: similar 3-5x CI speedup combined с state selection. 4. **Materialization review** — view vs table — same tradeoffs. Real benefit: similar 5-10x dev iteration improvement. 5. **Microbatch** — incremental strategy same. Real benefit: similar 10x for time-series, но only applicable к time-series models. **What translates partially:** 1. **Partial parsing tuning** — depends on: - Number of vars used (synthetic had few, real may have many). - Frequency of var changes. - Custom macros referencing env_vars. Real production может show 2-4x improvement (vs synthetic 4x), depending on baseline. 2. **Parse time absolute** — synthetic 25s -> 1.5s. Real production: - 800-models project parse baseline ~12-15s (smaller, faster baseline). - After optimization ~3-4s. - Improvement: 3-4x (similar relative, smaller absolute). **What may NOT translate:** 1. **Custom macros performance** — your team's custom macros may be slow regardless of dbt engine optimization. Example: macro that loops через result_set calling external HTTP service. Это not dbt's bottleneck, it's macro code. Fix: profile macros separately. Optimize Python/SQL code. 2. **Warehouse-specific bottlenecks** — if Snowflake has compute contention, query queueing slow: - dbt threads tuning не fixes. - Need warehouse warehouse size increase, clustering, multi-cluster warehouse setup. 3. **Team workflow factors** — multiple developers concurrent runs: - May saturate warehouse compute. - Need workload isolation, separate warehouses dev/prod, etc. **Realistic expectations для real 800-model project:** Given baseline assumptions: - Snowflake medium warehouse. - Mix of incremental, table, view materializations. - Some custom macros. - 5-10 developer team. - Daily prod runs, CI на每 PR. Expected improvements: | Metric | Baseline | Realistic After | Realistic Improvement | |--------|----------|------------------|------------------------| | Parse (warm) | 12s | 3s | 4x | | Local dev run (modified mart) | 60s | 5s | 12x | | Full prod run | 25 min | 12 min | 2x | | CI per PR (1-3 models modified) | 8 min | 1 min | 8x | | Daily prod incremental (time-series) | 8 min | 1.5 min | 5x | **Why prod full run only 2x (vs synthetic 3.6x):** Real warehouse compute time dominates. Even с perfect threads, you're at warehouse compute limit. 8 threads saturates a medium warehouse. Beyond — contention. To get 5x prod run improvement, would need: - Larger warehouse (more compute). - Multi-cluster warehouse. - Some queries split across multiple warehouses. - Warehouse-side optimizations (clustering, partitioning). дата platform engineering, не dbt engine work. **ROI calculation для real project:** Assume: - 5 engineers. - 30 dbt runs/day each = 150 runs/day team. - Average run time saved: 30 sec per run. - Team time saved: 75 min/day = 6 hrs/week = 312 hrs/year. - At $120/hour engineering = **$37,440/year** saved. Plus: - Faster CI feedback (engineers wait less). - Lower warehouse cost (faster queries = less compute). - Better developer experience. Math suggests **~$50-80K annual savings** для team of 5 на real 800-model project. ROI на 30 hours capstone investment = excellent. **What to tell manager:** "Synthetic benchmarks show 8x average improvement. На our real 800-model project, expect 3-5x improvement в most scenarios. Specific numbers: - Dev iteration: 10-12x faster (immediate developer happiness). - CI per PR: 8x faster. - Daily prod run: 2-3x faster. - Parse time: 4x faster. Annual savings: ~$50-80K. ROI on 30-hour investment: excellent. Note: some улучшения transfer fully (slim CI, microbatch). Others depend on your specific patterns (custom macros, warehouse setup). I'll do an audit on actual project to confirm." **Strategic suggestion для manager:** Apply techniques **incrementally**: - Week 1: Slim CI (immediate biggest win on CI). - Week 2: Threads tuning (production runs). - Week 3: Materialization review (developer experience). - Week 4: Microbatch для applicable models (time-series only). Measure after each. Adjust if numbers different from synthetic predictions. **Главный урок:** **Synthetic benchmarks indicate orders of magnitude**, не exact numbers. Real production has more variables. Communicate uncertainty honestly. Validate after applying в real env. Adjust expectations based on observed improvements. Good capstone delivers **methodology** more than numbers. Methodology transfers; numbers are project-specific.

Проверка знанийKnowledge check
Junior хочет path C capstone но беспокоится: '20-30 часов кажется много для optimization. Можно ли сделать subset, например только 3 techniques?'
ОтветAnswer
Конечно. Capstone scope flexible — match своему уровню и available time. **Subset path C: 'micro-capstone' (10-15 hours):** Apply **3 techniques** instead of 6. Pick most impactful для your situation. **Recommended trio для maximum learning + impact:** **Technique 1: State-based selection (Slim CI).** Most universally applicable. Every CI benefits from this. - Time: 3-4 hours. - Tasks: - Setup baseline manifest storage (S3 / GCS / local). - Configure CI workflow with --state and --select state:modified+. - Measure CI run time before / after. - Document. **Technique 2: Materialization review.** Applies любому project. Hands-on understanding tables/views/incremental tradeoffs. - Time: 4-5 hours. - Tasks: - Analyze current materialization mix. - Define conditional materialization (table prod / view dev). - Test dev iteration speedup. - Document improvements. **Technique 3: Threads tuning.** Easiest to measure. Pure parameter tuning. - Time: 2-3 hours. - Tasks: - Run dbt с threads 1, 2, 4, 8, 12, 16. - Plot throughput curve. - Pick optimal value. - Document. **Total: 10-12 hours.** **Documenting subset capstone:** Document scope explicitly: ```markdown # Capstone — Performance Optimization (subset) ## Scope Focused capstone applying 3 most-impactful techniques: 1. State-based selection (Slim CI) 2. Materialization review 3. Threads tuning Full 6-technique optimization скоупа deferred к future iteration. ## Reasons for scope - Time-constrained capstone (12 hours available). - Selected techniques universally applicable (no time-series specific). - Skip: partial parsing tuning (project не has var issues), defer (requires Slim CI setup first, deferred), microbatch (no time-series models). ## Results [Tables, numbers ...] ## Future work Apply remaining techniques as project requires: - Partial parsing tuning when var-driven invalidation observed. - Microbatch when time-series models added. - Defer when state-based CI working. ``` Honest scope communication = professional. Better than pretending to do full capstone shallow. **Even smaller — 'mini-capstone' (5-7 hours):** Do ONE technique deeply: **Option: Just Slim CI.** - Time: 5-7 hours. - Detailed implementation: - Architecture design (где manifest storage, CI workflow). - Setup AWS S3 bucket (или GitHub Actions cache). - GitHub Actions workflow с manifest upload/download. - Production-quality (error handling, validation). - Documentation: setup guide, troubleshooting. - Benchmark: before/after CI times на real PRs. Deliverable: - GitHub repo с CI workflow example. - Blog post explaining setup. - Performance numbers. This is **better single-technique capstone** vs shallow multi-technique. Quality > breadth. **For junior level — recommendations:** **Если 5-7 hours available:** Mini-capstone (Slim CI или materialization). **Если 10-15 hours:** Micro-capstone (3 techniques). **Если 20-30 hours:** Full path C. All three valid capstone outcomes. Match scope к available time and current skill level. **Portfolio framing для each level:** **Mini-capstone:** "As part of dbt III capstone, designed и implemented production-quality Slim CI workflow для dbt projects, reducing CI time by 8x. Includes manifest versioning, GitHub Actions integration, troubleshooting documentation. github.com/yourname/dbt-slim-ci-setup." **Micro-capstone:** "As part of dbt III capstone, applied three performance optimization techniques к dbt project: state-based selection, materialization review, threads tuning. Documented benchmarks и methodology. Improvements: 8x CI, 5x dev iteration, 3x prod run. github.com/yourname/dbt-perf-optimization." **Full capstone:** "Comprehensive dbt performance optimization capstone applying 6 techniques к 1500-моделей project. Average 8x improvement across phases. Includes reproducible benchmark suite, methodology documentation, lessons learned. github.com/yourname/synthetic-dbt-perf." Each honestly framed к scope. **Что NOT recommend:** Don't: - Promise full 6 techniques но deliver 3 sloppy implementations. - Lie о времени spent. - Skip documentation/benchmarks (это где capstone value). - Take real production project, modify, claim "optimization" if it's just dbt run timings. Quality + honest scope = real portfolio value. **Главный урок:** **Scope flexibility — feature, не bug**. Capstone is about **demonstrating skills**, не **completing checklist**. Better to do one technique brilliantly чем 6 techniques shallowly. Match scope к time available. Document scope honestly. Deliverable focused на quality > breadth. За последние 5 years собеседующий dozens junior engineers — single technique done deeply with measurements always more impressive than half-finished comprehensive project. Pick wisely.

Итого

  1. Path C capstone = synthetic large project + documented optimization improvements.
  2. Generator для 1500-моделей project предоставлен (или write own).
  3. Baseline measurements — first thing. Document numbers.
  4. 6 techniques to apply: partial parsing tuning, threads tuning, state-based selection, defer, microbatch, materialization review.
  5. Measure after each — incremental documentation.
  6. Combined results — usually 5-10x average improvement.
  7. Deliverable: GitHub repo с project + BENCHMARK.md + reproducibility scripts.
  8. Optional: charts/visualizations через matplotlib.
  9. Synthetic ≠ real production — translate carefully для production claims.
  10. Scope flexible: full (20-30 hr, 6 techniques) -> micro (10-15 hr, 3 techniques) -> mini (5-7 hr, 1 technique).
  11. Quality > breadth — better single technique deep than 6 shallow.

Next: урок 05 — finalize и portfolio packaging.

Проверьте понимание

Результат: 0 из 0
Концептуальный
Вопрос 1 из 5. Какие шаги в Path C capstone?

Закончили урок?

Отметьте его как пройденный, чтобы отслеживать свой прогресс

Войдите чтобы оценить урок

Прогресс модуля
0 из 5