pathlib: cross-platform paths, Path.glob, / operator

pathlib (PEP 428, Python 3.4+, преобладает с 3.6+) — modern object-oriented API для file paths. Заменяет verbose os.path.join(os.path.dirname(...), 'subdir', 'file.csv') лаконичным Path(...) / 'subdir' / 'file.csv'. Cross-platform aware (Posix vs Windows separators), immutable, introspectable. Pragmatic-DEEP rule: используйте pathlib везде где не нужна performance — readability worth it.

В этом уроке:

Why pathlib — vs os.path.
Class hierarchy — Path / PurePath / PurePosixPath / PureWindowsPath.
API tour — disk operations + path arithmetic.
/ operator — concatenation idiom.
Pitfall 22 — Pyodide MEMFS limited disk.
Pitfall 34 — / operator absolute right-side override.
Code-challenge py-m09-06-code-1 — Pattern 3 path arithmetic.
Run-on-Your-Machine callout — real disk operations.

Why pathlib — vs `os.path`

Сравните verbose os.path подход vs pathlib:

# Old style — os.path (Python 2 + 3.0-3.3 era)
import os
data_dir = os.path.dirname(os.path.abspath(__file__))
csv_path = os.path.join(data_dir, 'data', 'users.csv')
parquet_path = os.path.splitext(csv_path)[0] + '.parquet'  # replace extension
parent = os.path.dirname(csv_path)
exists = os.path.isfile(csv_path)

# Modern style — pathlib (Python 3.4+)
from pathlib import Path
csv_path = Path(__file__).parent / 'data' / 'users.csv'
parquet_path = csv_path.with_suffix('.parquet')
parent = csv_path.parent
exists = csv_path.is_file()

Wins:

/ operator вместо os.path.join;
.parent / .suffix / .stem properties вместо os.path.dirname / os.path.splitext;
.is_file() / .exists() methods вместо os.path.isfile;
immutable — каждая операция returns new Path (functional style — chain freely);
introspectable — repr(path) shows PosixPath('...') или WindowsPath('...').

Production rule: stdlib os.path legacy. Modern Python (3.6+) — pathlib first.

Class hierarchy — `Path` / `PurePath` / `Pure[Posix|Windows]Path`

PurePath (abstract — string arithmetic only, no FS access)
├── PurePosixPath  — forced POSIX (`/` separator)
├── PureWindowsPath — forced Windows (`\\` separator)
└── Path (concrete — adds FS operations)
    ├── PosixPath   — auto-selected on Linux/macOS
    └── WindowsPath — auto-selected on Windows

Rules:

Class	Filesystem?	Separator
`Path`	yes	platform-specific (Posix on Linux/macOS, Windows on Windows)
`PosixPath`	yes	`/`
`WindowsPath`	yes	`\\`
`PurePath`	no	platform-specific
`PurePosixPath`	no	`/` (forced)
`PureWindowsPath`	no	`\\` (forced)

When to use which:

Path — actual disk operations (.read_text(), .glob(), .exists()).
PurePath — cross-platform path manipulation without FS access (testing, web URL parsing, cross-platform serialization).
PurePosixPath — explicitly POSIX (URL-like paths, Git repo paths, S3/GCS keys).

В browser challenges (М09 урок 06): PurePosixPath — string-only path arithmetic, NO FS access (Pitfall 22 — Pyodide MEMFS empty by default).

from pathlib import Path, PurePath, PurePosixPath, PureWindowsPath

# Concrete — actual FS lookup
p = Path('users.csv')
print(p.exists())   # False (depending on FS state)

# Abstract — pure string arithmetic
pp = PurePosixPath('/data/raw/events.json')
print(pp.parent)    # PurePosixPath('/data/raw')
print(pp.suffix)    # '.json'
# pp.exists() → AttributeError — PurePath has no FS methods

Cite docs.python.org/3/library/pathlib.

API tour — disk operations + path arithmetic

Disk operations (на `Path` only)

from pathlib import Path

p = Path('users.csv')

p.exists()          # bool
p.is_file()         # bool
p.is_dir()          # bool
p.stat()            # os.stat_result — size, mtime, permissions
p.read_text(encoding='utf-8')   # str — full content
p.read_bytes()                  # bytes
p.write_text('content', encoding='utf-8')
p.write_bytes(b'content')
p.unlink(missing_ok=False)      # delete file
p.touch(exist_ok=True)          # create empty file
p.mkdir(parents=True, exist_ok=True)  # create directory

Listing / globbing

p = Path('.')

p.iterdir()                   # iterator над содержимым directory
p.glob('*.csv')               # iterator matching pattern (one level)
p.rglob('**/*.csv')           # recursive — matches subdirectories
list(p.glob('data/*.json'))   # subpath glob

Path arithmetic (на любом subclass)

p = Path('/data/raw/events.json')

p.parent            # Path('/data/raw')
p.parents           # iterator [Path('/data/raw'), Path('/data'), Path('/')]
p.parents[0]        # Path('/data/raw')   — same as p.parent
p.parents[1]        # Path('/data')

p.name              # 'events.json'
p.stem              # 'events'           — name without last suffix
p.suffix            # '.json'            — last suffix включая dot
p.suffixes          # ['.json']          — all suffixes (для multi-dot — `app.tar.gz` → ['.tar', '.gz'])

p.parts             # ('/', 'data', 'raw', 'events.json')   — tuple

p.with_suffix('.csv')           # Path('/data/raw/events.csv')
p.with_name('output.parquet')   # Path('/data/raw/output.parquet')
p.with_stem('summary')          # Path('/data/raw/summary.json')   (Python 3.9+)
p.relative_to('/data')          # Path('raw/events.json')

`/` operator — concatenation

base = Path('/data')
csv = base / 'raw' / 'events.csv'   # Path('/data/raw/events.csv')

# Multiple constructor args — equivalent
csv = Path('/data', 'raw', 'events.csv')

Pitfall 34 — `/` operator with absolute right-side overrides

Counterintuitive: если правая часть начинается с /, она overrides parent (per RFC 3986 для URLs):

from pathlib import PurePosixPath

p = PurePosixPath('/foo') / '/bar'
print(p)   # /bar    ← '/foo' discarded!

Why: /bar — absolute path; concatenating relative+absolute makes no sense semantically. RFC 3986 (URI specs) defines this — absolute right-side fully replaces left.

Avoid: для multi-part paths используйте constructor form:

p = PurePosixPath('/foo', 'bar')   # PurePosixPath('/foo/bar')   ← correct

Quiz Q (py-m09-06-q1) — designed exposing этот pitfall.

Pitfall 22 — Pyodide MEMFS limited disk operations

What goes wrong:

from pathlib import Path
list(Path('./data').glob('*.csv'))   # []  в browser — empty list

Why: Pyodide MEMFS — in-memory filesystem, empty by default. Реальный disk учащегося browser НЕ видит (security boundary — sandbox). Path.read_text('./users.csv') → FileNotFoundError.

How to avoid в browser challenges:

Используйте PurePosixPath для string-based path arithmetic — не касается FS.
Pattern 3 challenge (M09 урок 06) demonstrates это — solve(path_str) returns tuple of computed components.

For real disk operations — Run-on-Your-Machine callout (ниже) — execute локально с реальной OS.

# В browser challenge — string arithmetic only
from pathlib import PurePosixPath
def solve(path_str: str) -> tuple:
    p = PurePosixPath(path_str)
    return (str(p.parent), p.stem, p.suffix, str(p.with_suffix('.csv')))

solve('/data/raw/events.json')
# ('/data/raw', 'events', '.json', '/data/raw/events.csv')

Code-challenge `py-m09-06-code-1` — Pattern 3 setup

В quiz JSON 06-pathlib.json встроен challenge:

Дан path string. Используя pathlib.PurePosixPath, верните tuple (parent, stem, suffix, with_suffix_csv) где with_suffix_csv — путь с расширением заменённым на .csv.

Solution skeleton (revealed после submission):

from pathlib import PurePosixPath

def solve(path_str: str) -> tuple:
    p = PurePosixPath(path_str)
    return (str(p.parent), p.stem, p.suffix, str(p.with_suffix('.csv')))

Test cases (3 — 2 visible + 1 hidden):

Standard JSON file: /data/raw/events.json → ('/data/raw', 'events', '.json', '/data/raw/events.csv').
Multi-dot stem: logs/app.2024-12-01.log → ('logs', 'app.2024-12-01', '.log', 'logs/app.2024-12-01.csv').
Hidden — no extension: README → ('.', 'README', '', 'README.csv').

Это canonical Pattern 3 — string arithmetic без disk access.

Run-on-Your-Machine — pathlib real disk operations

TIP

Run-on-Your-Machine: pathlib real disk operations

Pyodide MEMFS не имеет файлов вашей системы — для real disk inspection запустите локально (Python 3.10+ работает без installs):

python3 -c "from pathlib import Path; [print(p, p.stat().st_size) for p in Path('.').iterdir()]"

Или скрипт disk_inspect.py:

# disk_inspect.py
from pathlib import Path

# Найти все Python files под current directory
for p in Path('.').rglob('*.py'):
    print(p, '->', p.stat().st_size, 'bytes')

# Walk parents — полезно для finding project root
here = Path(__file__).resolve()
for parent in here.parents:
    if (parent / 'pyproject.toml').exists():
        print(f'Project root: {parent}')
        break

Запустите:

python3 disk_inspect.py

Ожидаемый вывод (зависит от вашей директории):

setup.py -> 1234 bytes
src/main.py -> 5678 bytes
tests/test_foo.py -> 890 bytes
Project root: /Users/you/myproject

В browser challenge выше мы используем PurePosixPath для string-based path arithmetic — она НЕ касается реального FS. Real disk operations (.iterdir, .glob, .stat, .read_text) требуют local Python где OS проvideс actual file descriptors. Это carrying convention Pyodide environment (М00 урок 03 — Run-on-Your-Machine pre-promised; М08 урок 02 first occurrence).

Recipe — production pathlib pipeline

End-to-end: walk directory tree, collect files matching pattern, transform paths.

from pathlib import Path

def find_data_files(root: Path, pattern: str = '*.csv') -> list[Path]:
    """Recursively collect files matching pattern, sorted by size."""
    if not root.is_dir():
        raise NotADirectoryError(root)
    files = list(root.rglob(pattern))
    files.sort(key=lambda p: p.stat().st_size, reverse=True)
    return files

def derive_output_path(input_path: Path, output_root: Path) -> Path:
    """Mirror input → output structure, change extension к .parquet."""
    rel = input_path.relative_to(input_path.anchor)   # strip leading /
    return (output_root / rel).with_suffix('.parquet')

# Usage (Run-on-Your-Machine demo — не в browser)
# files = find_data_files(Path('/data/raw'), '*.json')
# for f in files[:5]:
#     out = derive_output_path(f, Path('/data/processed'))
#     print(f'{f.name} ({f.stat().st_size} bytes) -> {out}')

Что в следующем уроке

Урок 07 — M09 module recap + bridge к M10. We taught I/O fundamentals first because pandas / Polars / PyArrow CANNOT run в browser (50MB+ each, C/Rust dependencies, micropip experimental). M10 covers libraries conceptually + cross-course bridges. Cumulative cross-course summary table + Phase 69 forward-link (logging file rotation + production performance).

Pragmatic-DEEP принцип: pathlib — fundamental Python 3 idiom. Каждая path manipulation в production должна использовать pathlib, не string concatenation. Memorize: /, .parent, .suffix, .stem, .with_suffix, .glob, .iterdir — covers 95% production cases.

pathlib: cross-platform paths, Path.glob, / operator

Why pathlib — vs os.path

Class hierarchy — Path / PurePath / Pure[Posix|Windows]Path

API tour — disk operations + path arithmetic

Disk operations (на Path only)

Listing / globbing

Path arithmetic (на любом subclass)

/ operator — concatenation

Pitfall 34 — / operator with absolute right-side overrides

Pitfall 22 — Pyodide MEMFS limited disk operations

Code-challenge py-m09-06-code-1 — Pattern 3 setup

Run-on-Your-Machine — pathlib real disk operations

Recipe — production pathlib pipeline

Что в следующем уроке

Закончили урок?

Why pathlib — vs `os.path`

Class hierarchy — `Path` / `PurePath` / `Pure[Posix|Windows]Path`

Disk operations (на `Path` only)

`/` operator — concatenation

Pitfall 34 — `/` operator with absolute right-side overrides

Code-challenge `py-m09-06-code-1` — Pattern 3 setup