Capstone: три пути senior проекта

Поздравляем — вы прошли 13 модулей dbt III. Знаете architecture dbt-core, parsing pipeline, Jinja contexts, manifest structure, custom materializations, adapter API, performance optimization для large projects, dbt Mesh, Fusion engine, и MetricFlow internals.

Capstone — это где это всё консолидируется в реальный проект, который вы можете показать в portfolio, на собеседовании, или использовать в реальной работе. Это не «упражнение для галочки» — это 20-40 часов работы senior engineer, которая даёт concrete deliverable.

Capstone имеет три варианта пути. В этом уроке мы поможем выбрать какой подходит именно вам на основе ваших goals, experience, и available time.

Capstone dbt II: production-grade e-commerce pipeline (dbt II)

Три пути

Три варианта capstone

Каждый путь — complete project с deliverable. Не нужно делать все три.

Decision matrix: какой путь вам подходит?

Выбор capstone по вашему профилю

Путь A: Contribute в dbt-core

Что это

Реальный contribution в open source проект dbt-core (github.com/dbt-labs/dbt-core).

Объём:

Найти issue, помеченный good_first_issue или good_first_pr.
Reproduce bug или understand feature request.
Написать failing test.
Implement fix или feature.
Submit Pull Request.
Iterate based on reviewer feedback.
Get merged.

Deliverable:

Merged PR в dbt-core (your name in commit history).
Issue closed.
Possibly featured in dbt release notes.

Время: 15-30 часов (depends на complexity issue).

Когда выбирать

Хотите community recognition в dbt ecosystem.
Любите open source workflow.
Уверенно владеете Python, pytest, git.
Готовы interact с maintainers (responding на reviews).

Когда НЕ выбирать

Если deadlines tight — PR review может занять недели.
Если вы новичок в open source (steeper learning curve чем других путей).
Если хотите fully solo control project (PR требует maintainer approval).

Pre-requisites

Python 3.10-3.13 (dbt-core current).
Git + GitHub workflow comfortable.
Comfortable читать dbt-core source code (мы covered это в модулях 1-7).
Patience для review iterations.

Путь B: Свой adapter

Что это

Написать functional dbt adapter from scratch для warehouse, который не имеет первоклассной dbt поддержки.

Suggested targets:

dbt-sqlite — для local dev/learning, simple SQLite database.
dbt-pglite — Postgres-compatible, embedded WASM-based (modern, “in-browser dbt”).
dbt-clickhouse-light — alternative к community dbt-clickhouse, focus на specific use case.
dbt-yourwarehouse — если у вас есть exotic warehouse которое команды использует.

Объём:

Implement Credentials dataclass.
Implement ConnectionManager (open, get_response, cancel, exception_handler).
Write required adapter macros (15+).
Implement Relation, Column classes.
Pass dbt-tests-adapter suite.
Package как pip-installable.
Опционально — apply для Trusted Adapter Program.

Deliverable:

GitHub repo с adapter код + README.
pip-installable package.
Passing tests on CI (GitHub Actions).
(Bonus) PyPI publication.
(Bonus) Trusted Adapter Program application.

Время: 30-40 часов.

Когда выбирать

Хотите deepest technical learning из трёх путей.
Любите database internals.
Хотите потенциально reusable artifact (если adapter useful — community может использовать).
Готовы потратить 30+ часов.

Когда НЕ выбирать

Если есть deadlines.
Если не любите low-level database programming.
Если warehouse не имеет стандартного DBAPI (нужно implementing custom transport).

Pre-requisites

Strong Python OOP.
DBAPI или database client experience.
Comfortable с pytest и mock-ing.
Знание SQL DDL внутри chosen warehouse.

Путь C: Optimization case

Что это

Создать synthetic large dbt project (1000-1500 моделей) и применить все техники модуля 10 для measurable performance improvement.

Объём:

Generate synthetic project (мы предоставим template generator).
Baseline measurements (parse, compile, run time).
Apply optimizations (partial parsing, threads tuning, microbatch, deferral, selectors).
Measure improvements.
Document numbers в structured report.
(Bonus) Migrate part of project на dbt Fusion для сравнения.

Deliverable:

GitHub repo с synthetic project.
BENCHMARK.md report с before/after numbers.
Reproducible benchmarks (scripts для re-running tests).
Recommendations document для real-world projects.

Время: 20-30 часов.

Когда выбирать

Работаете на реальном dbt project и хотите apply learnings immediately.
Хотите production-grade skills.
Хотите avoid risks of open source dependencies (no PR review delays).
Любите numbers, benchmarks, charts.

Когда НЕ выбирать

Если хотите community visibility (этот путь — solo work).
Если уже работаете на large dbt project и можете оптимизировать real project вместо synthetic (тогда — делайте это вместо capstone, и document).

Pre-requisites

Understanding модуля 10 (performance at scale).
Comfortable с benchmarking tools (cProfile, dbt —profile flag).
Markdown writing для report.

Сравнительная таблица

Aspect                          Path A           Path B           Path C
─────────────────────────────────────────────────────────────────────────
Time                           15-30 hrs        30-40 hrs        20-30 hrs
Risk                           Medium           Medium           Low
Community visibility           High             Medium           Low
Technical depth                Medium           High             Medium
Production applicability       Low              Medium           High
Portfolio appeal               High             High             Medium
Deliverable                    Merged PR        Functional pkg   Benchmark
Solo control                   Low              High             High
Risk of failure                Medium           Medium           Low

Что делать после выбора

После того как выбрали путь, продолжайте к соответствующему уроку:

Путь A -> урок 02 (contribute в dbt-core)
Путь B -> урок 03 (написать adapter)
Путь C -> урок 04 (optimization case)

Все три пути заканчиваются уроком 05 — finalize project + portfolio. Там мы поможем оформить ваш capstone deliverable для:

GitHub repository (README, structure).
Resume bullet point.
Talking points для собеседования.
Blog post (опционально).

Можно ли комбинировать?

Some могут думать “сделаю и contribute, и optimization”. Если у вас есть 60+ часов — да. Но обычно сосредоточьтесь на одном и сделайте его deeply. Senior portfolio = depth >> breadth.

Если очень амбициозны — adapter + benchmarks (путь B + путь C minor) — interesting combo: «написал adapter, потом benchmarked vs official adapter, нашёл improvement areas». Это уже full conference talk material.

Тонкости каждого пути

Путь A: что искать в issues

Не все good_first_issue одинаковы. Хорошие первые issues имеют:

Clear reproduction steps в issue description.
Recent (последние 6 месяцев) — старые issues могут быть outdated.
Один автор / commenters (мало “noise” в comments).
Labels: good_first_issue + triage:accepted (maintainers confirmed it’s valid).
Низкий комплексity area: documentation, error messages, edge case in single function.

Избегать:

Issues older than 1 год без commits.
Issues со словами “refactor”, “redesign” — это deep architectural work.
Issues, where last comment says “needs design discussion”.
Issues открытые на major feature areas (adapter API, runtime, etc.).

Путь B: какой warehouse выбрать?

dbt-sqlite: best learning experience. SQLite is well-documented, simple, no setup complexity. Recommended for first adapter.
dbt-pglite: trendy (browser-based, in-memory Postgres). Higher community interest, but more setup.
dbt-clickhouse-light: existing dbt-clickhouse есть — нужна differentiation. Niche.
Existing private warehouse: если у вас есть warehouse, который нет dbt support — это best use case (real value to your team).

Путь C: synthetic vs real project

Synthetic: easier setup (we provide generator), but less realistic. Good для learning.
Real project (если есть permission от employer): more impressive результаты, но требует careful scrubbing (no proprietary data, no business names).

Если работаете на real dbt project, get permission, optimize forks или synthetic copy. Don’t optimize production directly without authorization.

Готовы выбрать?

Подумайте над:

Ваши career goals — где хотите работать? Какие skills important?
Ваше время — сколько часов готовы инвестировать?
Ваши strong points — Python? Database internals? Performance?
Ваш appetite for risk — community PR review vs solo work?

Made your choice? Скип к соответствующему уроку.

Проверка знанийKnowledge check

Senior engineer работает в финтех компании, dbt project ~600 моделей, Snowflake. Currently parse занимает 25 секунд, что раздражает команду. Какой capstone path лучший?

ОтветAnswer

Для этого specific scenario я бы рекомендовал **Path C (optimization)**, с **real project basis вместо synthetic** (если возможно permissions wise). **Почему Path C:** 1. **Direct production value.** Текущий pain point (25 секунд parse) impacting team daily. Capstone work directly improves work environment. Это **immediate ROI** vs theoretical learning. 2. **Skills directly applicable.** Optimization techniques (partial parsing, threads tuning, microbatch, deferral) можно применять каждый день. Senior senior который improved internal dbt project from 25s parse -> 5s parse — это **demonstrable impact** на работодатель. 3. **Stakeholder buy-in available.** Manager already knows pain point exists. Capstone не нужно justify. "I will improve our dbt project performance" — easy sell для time allocation. 4. **Real data available** (with permissions). Synthetic project имеет limitation — patterns не match реальные. С real project можно address actual bottlenecks. 5. **Portfolio piece appropriate для finance.** Финтех companies value: stability, performance, measurable improvements. "Reduced dbt CI time from 25s -> 5s, saving $X/year in developer time" — exactly what finance leadership wants. **Implementation suggestion:** Step 1: Get permission от manager (1 час). - "Я хочу сделать senior capstone project на dbt performance improvement. Цель — reduce parse time. Это benefit team. Need 20-30 hours during work." - Most managers happily approve (free improvement к team productivity). Step 2: Baseline measurements (2 часа). - Profile current state: `dbt parse --profile`, run_results.json analysis. - Identify bottlenecks: какие phases dominate time? - Document baseline в BENCHMARK.md. Step 3: Apply techniques systematically (15-20 часов). **Technique 1: Partial parsing tuning.** Если у вас var changes invalidating partial parse — проблема. Move vars в profiles.yml где stable, или make them sources вместо vars. Measure after each fix. **Technique 2: Threads tuning.** For Snowflake, optimal threads usually 8-16. Test: ```bash for t in 4 8 12 16 24; do time dbt build --threads $t --target dev done ``` Document throughput curve. Likely знакомая Snowflake contention pattern. **Technique 3: State-based selection (Slim CI).** CI currently builds everything? Move к `--select state:modified+`. Need manifest storage setup (S3, GCS). This alone usually 5-10x CI speedup. **Technique 4: Microbatch для time-series models.** Если есть large fact tables с incremental — try microbatch. Reduces per-batch time, allows parallel batches. **Technique 5: Materialization tuning.** Review each model. Should it be table or view? Some prod tables can be views (less storage, similar performance). Some views could be tables (faster reads). **Technique 6: Source freshness vs source-status.** Если используете source freshness checks — это additional warehouse round-trips. Bundle через `--store-failures` for efficiency. **Technique 7: defer для CI runs.** CI doesn't need to build full DAG. Defer к prod artifacts: ```bash dbt build --target ci --defer --state prod_manifest/ ``` Only builds modified + downstream. Massive CI speedup. Step 4: Document numbers (3-5 часов). - BENCHMARK.md: before/after for each technique. - Charts (matplotlib или просто markdown tables). - Reproducible scripts. Step 5: Present to team (1 час). - Slack/Confluence post: "Reduced dbt CI from 25s -> 5s through techniques X, Y, Z. Savings: $X/year team time." - This is **visible win**, recognition. **Deliverable for portfolio:** - Internal Confluence post (links removed для public). - Sanitized version с anonymized numbers для GitHub blog (if employer agrees). - LinkedIn post: "Optimized our dbt project parse time by 5x". - Mention в resume: "Reduced internal dbt project CI time from 25s -> 5s through state-based selection, threads tuning, partial parsing optimization." **Why not Path A или B:** **Path A (contribute):** Requires finding good_first_issue, write fix, wait for review (weeks возможно), iterate. Time potentially indefinite. Existing pain point unaddressed. **Path B (adapter):** Snowflake adapter уже existing. Writing dbt-snowflake-light не helps. Could write dbt-something-else, но learning value less direct vs business value of Path C. **Edge case:** Если finance company restrict на public open source contributions (security policy), Path A может быть невозможным. Path B (внутренний adapter для что-то) — possible. Path C — always allowed (internal work). **Главный урок:** Capstone choice — это не just "что cool", это "что **mostly benefits me right now**". For someone в working dbt environment с известным pain point — Path C wins on practical grounds. Better to convert 25 hours into real производственное improvement than into theoretical contribution. That said, Path A или B имеют value тоже — для другого career stage, different goals. Right answer depends на your situation. Это example про matching capstone к situation, не universal best choice.

Проверка знанийKnowledge check

Стажёр finished dbt III courses и хочет capstone, но опыт limited (6 месяцев работы на small dbt project, 50 моделей). Какой path realistic, и как scope project so не overwhelmed?

ОтветAnswer

Это **scope matching exercise** — выбор capstone должен match вашему уровню. Honest assessment: 6 месяцев experience, small project — это **mid-junior уровень**, не senior. Capstone должен fit, не destruction. **Recommendation: Path C (optimization), but adapted.** Почему Path C **с modifications**: **Why Path C generally best для mid-junior:** 1. **Low risk.** No PR review process. No PyPI publishing. Self-contained project. Если что-то не работает, можно adjust scope. Path A — PR can be rejected. Path B — adapter может не pass tests-adapter. 2. **Builds confidence.** Optimization techniques — concrete, measurable. Either parse time went down, либо не went. Black-or-white feedback loop is good for learning. 3. **Transferable skills.** Performance optimization applies to любому dbt project. Path A skills (PR workflow) less broadly applicable. Path B skills (adapter internals) niche. **Why generic Path C может быть overwhelming для mid-junior:** - Default Path C asks: "1500 моделей synthetic project, apply all 10 module techniques, measure 50% improvement". Для mid-junior это **scope creep**. Реальность: - 1500 models is large to comprehend. - Apply ALL techniques — это deep dive into each. - 50% improvement — может не быть достижимым на synthetic data (real bottlenecks differ). **Modified Path C для mid-junior:** **Scope reduction Step 1: Smaller synthetic project.** - 100-200 моделей instead of 1500. - Still enough для feel scale effects (parse time visible). - Faster experimentation (each test fast). - Less overwhelming structure. **Scope reduction Step 2: Focus 3-4 техники, не все.** Pick 3-4 most impactful для small projects: 1. **Partial parsing tuning** — universally applicable. 2. **Threads tuning** — easy to measure, clear results. 3. **State-based selection** — important для CI. 4. **Materialization review** — applies to any project size. Skip more advanced: - Microbatch (только если у вас есть time-series data). - Cross-project ref (только если dbt Mesh used). - Fusion migration (advanced topic, separate from core optimization). **Scope reduction Step 3: Smaller improvement target.** Replace "50% improvement" с "measurable, documented improvements". Для 100-моделей project: - Baseline parse: 3 секунды. - После optimizations: 1 секунда. - "Reduced parse time by 67%" — perfectly valid achievement. Focus на learning experience, не arbitrary big numbers. **Suggested timeline для modified Path C:** **Week 1: Project generation + baseline (5 часов)** - Generate 100-models project via template. - Create 5-10 sources, 10-20 staging, 5-10 marts, остальное — intermediate models. - Setup local DuckDB (zero-cost warehouse). - Run `dbt build`. Baseline measurements: `dbt parse` timing, `dbt run` timing. - Create BENCHMARK.md с "Day 0" numbers. **Week 2: Partial parsing optimization (4 часа)** - Read partial parsing docs. - Identify what invalidates parse в your project. - Move dynamic vars to sources или stable places. - Measure: parse time before/after. - Document в BENCHMARK.md. **Week 3: Threads tuning (3 часа)** - Test `--threads` 1, 2, 4, 8. - Measure run time at each. - Document throughput curve. - Choose optimal для your DuckDB. **Week 4: State-based selection (4 часа)** - Run `dbt build` to get manifest. - Commit manifest. - Modify 1 model. - Run `dbt build --select state:modified+ --state manifest_dir/`. - Measure: how much faster vs full build? - Document. **Week 5: Materialization review + cleanup (4 часа)** - Review каждый model: is materialization optimal? - Switch some tables to views (or vice versa). - Run `dbt build` again. - Compare numbers. - Final BENCHMARK.md polish. - Write README explaining project. Total: ~20 часов over 5 недели. Comfortable pace для mid-junior. **Why not Path A для mid-junior:** - Reading dbt-core source code is dense — modules 1-7 dbt III courses cover, but reading code yourself takes time. - Finding good first issue requires understanding dbt-core internals. - PR review iterations с senior maintainers — intimidating early career. - High risk что PR не merged after weeks of work. **Why not Path B для mid-junior:** - Adapter development needs deep understanding of dbt internals (modules 8-9 dbt III). - DBAPI / database client programming — different domain. - Edge cases в Jinja macros tricky. - 30-40 часов commitment large. - Risk: incomplete adapter что не pass tests-adapter — feels like failure. **Future paths:** После mid-junior Path C success, в 1-2 года когда senior level — re-attempt: - Path A: contribute now that you know dbt-core deeper. - Path B: write adapter for specialized warehouse. **Portfolio framing для mid-junior:** Don't claim "senior optimization expert". Claim **honestly**: "As part of dbt III course capstone, optimized 100-model synthetic dbt project. Reduced parse time by 67% (3s -> 1s) and CI build time by 80% through partial parsing tuning, threads optimization, и state-based selection. Documented benchmarks и reproducible methodology." This is **strong portfolio piece** appropriate to experience level. Doesn't overstate. **Главный урок:** Scope **должно** match experience level. Senior capstone path doesn't mean "stupid difficult". It means "ambitious enough to demonstrate skills, achievable enough to complete well". Adapter modifications от default scope are encouraged. Modified Path C **for mid-junior** = appropriate, successful, builds confidence для future senior projects.

Итого

Capstone — это где знания консолидируются в real project. 20-40 часов работы senior engineer.
Три пути:
- A: Contribute в dbt-core (merged PR, community visibility).
- B: Свой adapter (deep technical, packageable artifact).
- C: Optimization case (production applicable, measurable).
Decision criteria: career goals, available time, technical strengths, risk appetite.
Финансовая, тех, или enterprise positioning влияет на выбор. Финтех = Path C обычно best. Open source culture = Path A. Database infra = Path B.
Можно scope down для mid-junior. Default scoping (1500 models, 50% improvement) — для senior. Adjust по своему уровню.
Каждый путь имеет dedicated следующий урок (02-04). После выбора пути — go to соответствующий урок.
Урок 05 — finalize и portfolio packaging для всех трёх путей.

Готовы выбрать ваш путь? Move to next lesson.

Capstone: три пути senior проекта

Три пути

Decision matrix: какой путь вам подходит?

Путь A: Contribute в dbt-core

Что это

Когда выбирать

Когда НЕ выбирать

Pre-requisites

Путь B: Свой adapter

Что это

Когда выбирать

Когда НЕ выбирать

Pre-requisites

Путь C: Optimization case

Что это

Когда выбирать

Когда НЕ выбирать

Pre-requisites

Сравнительная таблица

Что делать после выбора

Можно ли комбинировать?

Тонкости каждого пути

Путь A: что искать в issues

Путь B: какой warehouse выбрать?

Путь C: synthetic vs real project

Готовы выбрать?

Итого

Закончили урок?