pathlib: cross-platform paths, Path.glob, / operator
pathlib (PEP 428, Python 3.4+, преобладает с 3.6+) — modern object-oriented API для file paths. Заменяет verbose os.path.join(os.path.dirname(...), 'subdir', 'file.csv') лаконичным Path(...) / 'subdir' / 'file.csv'. Cross-platform aware (Posix vs Windows separators), immutable, introspectable. Pragmatic-DEEP rule: используйте pathlib везде где не нужна performance — readability worth it.
В этом уроке:
- Why pathlib — vs
os.path. - Class hierarchy —
Path/PurePath/PurePosixPath/PureWindowsPath. - API tour — disk operations + path arithmetic.
/operator — concatenation idiom.- Pitfall 22 — Pyodide MEMFS limited disk.
- Pitfall 34 —
/operator absolute right-side override. - Code-challenge
py-m09-06-code-1— Pattern 3 path arithmetic. - Run-on-Your-Machine callout — real disk operations.
Why pathlib — vs os.path
Сравните verbose os.path подход vs pathlib:
# Old style — os.path (Python 2 + 3.0-3.3 era)
import os
data_dir = os.path.dirname(os.path.abspath(__file__))
csv_path = os.path.join(data_dir, 'data', 'users.csv')
parquet_path = os.path.splitext(csv_path)[0] + '.parquet' # replace extension
parent = os.path.dirname(csv_path)
exists = os.path.isfile(csv_path)
# Modern style — pathlib (Python 3.4+)
from pathlib import Path
csv_path = Path(__file__).parent / 'data' / 'users.csv'
parquet_path = csv_path.with_suffix('.parquet')
parent = csv_path.parent
exists = csv_path.is_file()
Wins:
/operator вместоos.path.join;.parent/.suffix/.stemproperties вместоos.path.dirname/os.path.splitext;.is_file()/.exists()methods вместоos.path.isfile;- immutable — каждая операция returns new Path (functional style — chain freely);
- introspectable —
repr(path)showsPosixPath('...')илиWindowsPath('...').
Production rule: stdlib os.path legacy. Modern Python (3.6+) — pathlib first.
Class hierarchy — Path / PurePath / Pure[Posix|Windows]Path
PurePath (abstract — string arithmetic only, no FS access)
├── PurePosixPath — forced POSIX (`/` separator)
├── PureWindowsPath — forced Windows (`\\` separator)
└── Path (concrete — adds FS operations)
├── PosixPath — auto-selected on Linux/macOS
└── WindowsPath — auto-selected on Windows
Rules:
| Class | Filesystem? | Separator |
|---|---|---|
Path | yes | platform-specific (Posix on Linux/macOS, Windows on Windows) |
PosixPath | yes | / |
WindowsPath | yes | \\ |
PurePath | no | platform-specific |
PurePosixPath | no | / (forced) |
PureWindowsPath | no | \\ (forced) |
When to use which:
Path— actual disk operations (.read_text(),.glob(),.exists()).PurePath— cross-platform path manipulation without FS access (testing, web URL parsing, cross-platform serialization).PurePosixPath— explicitly POSIX (URL-like paths, Git repo paths, S3/GCS keys).
В browser challenges (М09 урок 06): PurePosixPath — string-only path arithmetic, NO FS access (Pitfall 22 — Pyodide MEMFS empty by default).
from pathlib import Path, PurePath, PurePosixPath, PureWindowsPath
# Concrete — actual FS lookup
p = Path('users.csv')
print(p.exists()) # False (depending on FS state)
# Abstract — pure string arithmetic
pp = PurePosixPath('/data/raw/events.json')
print(pp.parent) # PurePosixPath('/data/raw')
print(pp.suffix) # '.json'
# pp.exists() → AttributeError — PurePath has no FS methods
Cite docs.python.org/3/library/pathlib.
API tour — disk operations + path arithmetic
Disk operations (на Path only)
from pathlib import Path
p = Path('users.csv')
p.exists() # bool
p.is_file() # bool
p.is_dir() # bool
p.stat() # os.stat_result — size, mtime, permissions
p.read_text(encoding='utf-8') # str — full content
p.read_bytes() # bytes
p.write_text('content', encoding='utf-8')
p.write_bytes(b'content')
p.unlink(missing_ok=False) # delete file
p.touch(exist_ok=True) # create empty file
p.mkdir(parents=True, exist_ok=True) # create directory
Listing / globbing
p = Path('.')
p.iterdir() # iterator над содержимым directory
p.glob('*.csv') # iterator matching pattern (one level)
p.rglob('**/*.csv') # recursive — matches subdirectories
list(p.glob('data/*.json')) # subpath glob
Path arithmetic (на любом subclass)
p = Path('/data/raw/events.json')
p.parent # Path('/data/raw')
p.parents # iterator [Path('/data/raw'), Path('/data'), Path('/')]
p.parents[0] # Path('/data/raw') — same as p.parent
p.parents[1] # Path('/data')
p.name # 'events.json'
p.stem # 'events' — name without last suffix
p.suffix # '.json' — last suffix включая dot
p.suffixes # ['.json'] — all suffixes (для multi-dot — `app.tar.gz` → ['.tar', '.gz'])
p.parts # ('/', 'data', 'raw', 'events.json') — tuple
p.with_suffix('.csv') # Path('/data/raw/events.csv')
p.with_name('output.parquet') # Path('/data/raw/output.parquet')
p.with_stem('summary') # Path('/data/raw/summary.json') (Python 3.9+)
p.relative_to('/data') # Path('raw/events.json')
/ operator — concatenation
base = Path('/data')
csv = base / 'raw' / 'events.csv' # Path('/data/raw/events.csv')
# Multiple constructor args — equivalent
csv = Path('/data', 'raw', 'events.csv')
Pitfall 34 — / operator with absolute right-side overrides
Counterintuitive: если правая часть начинается с /, она overrides parent (per RFC 3986 для URLs):
from pathlib import PurePosixPath
p = PurePosixPath('/foo') / '/bar'
print(p) # /bar ← '/foo' discarded!
Why: /bar — absolute path; concatenating relative+absolute makes no sense semantically. RFC 3986 (URI specs) defines this — absolute right-side fully replaces left.
Avoid: для multi-part paths используйте constructor form:
p = PurePosixPath('/foo', 'bar') # PurePosixPath('/foo/bar') ← correct
Quiz Q (py-m09-06-q1) — designed exposing этот pitfall.
Pitfall 22 — Pyodide MEMFS limited disk operations
What goes wrong:
from pathlib import Path
list(Path('./data').glob('*.csv')) # [] в browser — empty list
Why: Pyodide MEMFS — in-memory filesystem, empty by default. Реальный disk учащегося browser НЕ видит (security boundary — sandbox). Path.read_text('./users.csv') → FileNotFoundError.
How to avoid в browser challenges:
- Используйте
PurePosixPathдля string-based path arithmetic — не касается FS. - Pattern 3 challenge (M09 урок 06) demonstrates это —
solve(path_str)returns tuple of computed components.
For real disk operations — Run-on-Your-Machine callout (ниже) — execute локально с реальной OS.
# В browser challenge — string arithmetic only
from pathlib import PurePosixPath
def solve(path_str: str) -> tuple:
p = PurePosixPath(path_str)
return (str(p.parent), p.stem, p.suffix, str(p.with_suffix('.csv')))
solve('/data/raw/events.json')
# ('/data/raw', 'events', '.json', '/data/raw/events.csv')
Code-challenge py-m09-06-code-1 — Pattern 3 setup
В quiz JSON 06-pathlib.json встроен challenge:
Дан path string. Используя
pathlib.PurePosixPath, верните tuple(parent, stem, suffix, with_suffix_csv)гдеwith_suffix_csv— путь с расширением заменённым на.csv.
Solution skeleton (revealed после submission):
from pathlib import PurePosixPath
def solve(path_str: str) -> tuple:
p = PurePosixPath(path_str)
return (str(p.parent), p.stem, p.suffix, str(p.with_suffix('.csv')))
Test cases (3 — 2 visible + 1 hidden):
- Standard JSON file:
/data/raw/events.json→('/data/raw', 'events', '.json', '/data/raw/events.csv'). - Multi-dot stem:
logs/app.2024-12-01.log→('logs', 'app.2024-12-01', '.log', 'logs/app.2024-12-01.csv'). - Hidden — no extension:
README→('.', 'README', '', 'README.csv').
Это canonical Pattern 3 — string arithmetic без disk access.
Run-on-Your-Machine — pathlib real disk operations
Run-on-Your-Machine: pathlib real disk operations
Pyodide MEMFS не имеет файлов вашей системы — для real disk inspection запустите локально (Python 3.10+ работает без installs):
python3 -c "from pathlib import Path; [print(p, p.stat().st_size) for p in Path('.').iterdir()]"Или скрипт disk_inspect.py:
# disk_inspect.py
from pathlib import Path
# Найти все Python files под current directory
for p in Path('.').rglob('*.py'):
print(p, '->', p.stat().st_size, 'bytes')
# Walk parents — полезно для finding project root
here = Path(__file__).resolve()
for parent in here.parents:
if (parent / 'pyproject.toml').exists():
print(f'Project root: {parent}')
breakЗапустите:
python3 disk_inspect.pyОжидаемый вывод (зависит от вашей директории):
setup.py -> 1234 bytes
src/main.py -> 5678 bytes
tests/test_foo.py -> 890 bytes
Project root: /Users/you/myprojectВ browser challenge выше мы используем PurePosixPath для string-based path arithmetic — она НЕ касается реального FS. Real disk operations (.iterdir, .glob, .stat, .read_text) требуют local Python где OS проvideс actual file descriptors. Это carrying convention Pyodide environment (М00 урок 03 — Run-on-Your-Machine pre-promised; М08 урок 02 first occurrence).
Recipe — production pathlib pipeline
End-to-end: walk directory tree, collect files matching pattern, transform paths.
from pathlib import Path
def find_data_files(root: Path, pattern: str = '*.csv') -> list[Path]:
"""Recursively collect files matching pattern, sorted by size."""
if not root.is_dir():
raise NotADirectoryError(root)
files = list(root.rglob(pattern))
files.sort(key=lambda p: p.stat().st_size, reverse=True)
return files
def derive_output_path(input_path: Path, output_root: Path) -> Path:
"""Mirror input → output structure, change extension к .parquet."""
rel = input_path.relative_to(input_path.anchor) # strip leading /
return (output_root / rel).with_suffix('.parquet')
# Usage (Run-on-Your-Machine demo — не в browser)
# files = find_data_files(Path('/data/raw'), '*.json')
# for f in files[:5]:
# out = derive_output_path(f, Path('/data/processed'))
# print(f'{f.name} ({f.stat().st_size} bytes) -> {out}')
Что в следующем уроке
Урок 07 — M09 module recap + bridge к M10. We taught I/O fundamentals first because pandas / Polars / PyArrow CANNOT run в browser (50MB+ each, C/Rust dependencies, micropip experimental). M10 covers libraries conceptually + cross-course bridges. Cumulative cross-course summary table + Phase 69 forward-link (logging file rotation + production performance).
Pragmatic-DEEP принцип: pathlib — fundamental Python 3 idiom. Каждая path manipulation в production должна использовать pathlib, не string concatenation. Memorize:
/,.parent,.suffix,.stem,.with_suffix,.glob,.iterdir— covers 95% production cases.