JSON: json.loads, json.dumps, JSONL streaming, custom encoders

JSON (JavaScript Object Notation, RFC 8259) — second-most-common data interchange format после CSV. Schema-less, self-describing, text-based, dominant в HTTP APIs / NoSQL / streaming pipelines. Stdlib json module — C-accelerated (_json extension в CPython, works в Pyodide), production-grade. Pragmatic-DEEP rule: никогда не пишите свой JSON parser — json.loads covers RFC 8259 + UTF-8 + escape sequences.

В этом уроке:

json.loads / json.dumps — string-based API.
json.load / json.dump — file-based API.
Type mapping — Python ↔ JSON.
Custom encoders — cls=JSONEncoder для datetime / Decimal / dataclass.
JSONL streaming — line-at-a-time для large datasets.
JSONDecoder.raw_decode — advanced streaming.
Pitfalls 23 / 25 — JSONDecodeError class hierarchy, tuple round-trip loss.
Code-challenge py-m09-03-code-1 — Pattern 2 (JSON nested traversal).
Cross-course → Spark spark.read.json schema inference.
Cross-course → ClickHouse JSONEachRow format.

`json.loads` / `json.dumps` — string-based API

json.loads(s) — parse JSON string → Python object. json.dumps(obj) — serialize Python object → JSON string.

import json

# loads — parse string → object
s = '{"name": "alice", "age": 30, "tags": ["dev", "qa"]}'
data = json.loads(s)
print(data)  # {'name': 'alice', 'age': 30, 'tags': ['dev', 'qa']}
print(type(data))   # <class 'dict'>
print(type(data['tags']))  # <class 'list'>

# dumps — serialize object → string
obj = {'name': 'bob', 'age': 25, 'verified': True, 'manager': None}
print(json.dumps(obj))
# {"name": "bob", "age": 25, "verified": true, "manager": null}

indent — pretty-print:

print(json.dumps(obj, indent=2))
# {
#   "name": "bob",
#   "age": 25,
#   "verified": true,
#   "manager": null
# }

sort_keys=True — deterministic output (важно для diffs / hashing / reproducible tests):

print(json.dumps({'b': 2, 'a': 1}, sort_keys=True))
# {"a": 1, "b": 2}

ensure_ascii=False — preserve non-ASCII characters (по default escape’ятся в \uXXXX):

print(json.dumps({'city': 'Москва'}))                       # {"city": "Москва"}
print(json.dumps({'city': 'Москва'}, ensure_ascii=False))   # {"city": "Москва"}

Cite docs.python.org/3/library/json.html#json.dumps.

`json.load` / `json.dump` — file-based API

Symmetric — read из file-like / write в file-like:

import json
import io

# load — parse file-like → object (в browser challenges — io.StringIO)
buf = io.StringIO('{"users": [{"id": 1, "name": "alice"}]}')
data = json.load(buf)
print(data)  # {'users': [{'id': 1, 'name': 'alice'}]}

# dump — serialize object → file-like
out = io.StringIO()
json.dump({'count': 42}, out)
print(out.getvalue())  # {"count": 42}

Production note: json.load(f) загружает весь file в память — same risk что f.read() в уроке 01. Для streaming больших JSON файлов используйте JSONL (ниже) или JSONDecoder.raw_decode.

Type mapping — Python ↔ JSON

Python	JSON	Comment
`dict`	`object`	keys только `str` (или auto-converted: `int` → `str`)
`list`	`array`
`tuple`	`array`	one-way — round-trip loses tuple-ness (Pitfall 25)
`str`	`string`	UTF-8
`int`	`number`	arbitrary precision (Python int unbounded; некоторые JS parsers truncate >2^53)
`float`	`number`	IEEE-754 double; `nan`/`inf` non-standard (RFC 8259 § 6 запрещает)
`True` / `False`	`true` / `false`
`None`	`null`

NOT supported natively: datetime, Decimal, set, frozenset, bytes, dataclass, custom classes. Для них — custom encoder (ниже).

import json

# Auto-conversion int keys → str
print(json.dumps({1: 'a', 2: 'b'}))
# {"1": "a", "2": "b"}

# Set → TypeError
try:
    json.dumps({1, 2, 3})
except TypeError as e:
    print(e)  # Object of type set is not JSON serializable

Pitfall 25 — tuple round-trip loss

JSON не имеет tuple type. json.dumps((1, 2)) → "[1, 2]" (array). json.loads("[1, 2]") → [1, 2] (list, не tuple).

import json

original = (1, 2, 3)
serialized = json.dumps(original)
print(serialized)  # "[1, 2, 3]"

restored = json.loads(serialized)
print(type(restored))  # <class 'list'>     ← НЕ tuple
print(restored == [1, 2, 3])   # True
print(restored == (1, 2, 3))   # False      ← list != tuple

Implications:

Если tuple используется как dict key (M02 урок 06 — immutable hashable), нельзя serialize в JSON и round-trip — converted в list (unhashable).
Caller обязан re-wrap в tuple после json.loads если semantic нужен.

Custom encoders — `cls=JSONEncoder`

Для datetime, Decimal, dataclass, custom classes — extend JSONEncoder:

import json
from datetime import datetime
from decimal import Decimal

class ProductionEncoder(json.JSONEncoder):
    """Handle datetime, Decimal, fallback to default."""
    def default(self, obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        if isinstance(obj, Decimal):
            return str(obj)  # Preserve precision как string
        return super().default(obj)  # raises TypeError for unknowns

obj = {
    'created': datetime(2026, 4, 29, 12, 0),
    'price': Decimal('19.99'),
    'name': 'widget',
}
print(json.dumps(obj, cls=ProductionEncoder, indent=2))
# {
#   "created": "2026-04-29T12:00:00",
#   "price": "19.99",
#   "name": "widget"
# }

Alternative — default= parameter — lambda-based, no class needed:

print(json.dumps(
    obj,
    default=lambda o: o.isoformat() if isinstance(o, datetime)
                     else str(o) if isinstance(o, Decimal)
                     else None,
))

Cross-link M03 урок 04 (closure): default=lambda o: ... — closure capturing isinstance-checks. Production rule: для simple cases — default=; для complex hierarchy — cls= subclass.

Dataclass + JSON

Combine с dataclasses.asdict (M07 урок 04 carrying):

import json
from dataclasses import dataclass, asdict

@dataclass
class User:
    name: str
    age: int

users = [User('alice', 30), User('bob', 25)]
print(json.dumps([asdict(u) for u in users]))
# [{"name": "alice", "age": 30}, {"name": "bob", "age": 25}]

Cite docs.python.org/3/library/json.html#json.JSONEncoder.

JSONL — line-streaming format

JSONL (JSON Lines, a.k.a. NDJSON — Newline-Delimited JSON) — \n-separated JSON objects. Each line — independent valid JSON. Streaming benefit: parse one line at time, O(1) memory per record:

{"id": 1, "name": "alice"}
{"id": 2, "name": "bob"}
{"id": 3, "name": "carol"}

Recipe — process JSONL без loading в память (cross-link М05 урок 02 generator):

import json
import io
from collections.abc import Iterator

def parse_jsonl(buf: io.IOBase) -> Iterator[dict]:
    """Yield one dict per line. Skip empty + invalid lines."""
    for line in buf:
        line = line.strip()
        if not line:
            continue
        try:
            yield json.loads(line)
        except json.JSONDecodeError:
            continue  # log + skip in production

# Usage
data = '{"id": 1, "name": "alice"}\n{"id": 2, "name": "bob"}\n'
buf = io.StringIO(data)
for record in parse_jsonl(buf):
    print(record)
# {'id': 1, 'name': 'alice'}
# {'id': 2, 'name': 'bob'}

Production formats supporting JSONL:

CloudWatch Logs / Elasticsearch ingestion — streaming log records.
OpenAI fine-tuning datasets — one prompt/completion per line.
BigQuery LOAD DATA — JSONL native ingestion.
ClickHouse JSONEachRow — semantically identical (cross-course ниже).

`JSONDecoder.raw_decode` — advanced streaming

Для multi-document JSON (без newline separators), raw_decode parses один object и returns его + position:

import json

decoder = json.JSONDecoder()
s = '{"a": 1}{"b": 2}{"c": 3}'

pos = 0
while pos < len(s):
    obj, end = decoder.raw_decode(s, idx=pos)
    print(obj)
    pos = end
    # Skip whitespace (если есть)
    while pos < len(s) and s[pos].isspace():
        pos += 1

# {'a': 1}
# {'b': 2}
# {'c': 3}

Когда использовать: RPC streams, concat’ed JSON dumps без separators, custom protocols. Для нормальных pipelines — JSONL preferable (clearer, tooling-friendly).

Pitfall 23 — `JSONDecodeError` is `ValueError` subclass

json.loads raises json.JSONDecodeError при invalid input. Это subclass ValueError (backward compat — pre-3.5 raised plain ValueError).

import json

# Old code — still works
try:
    json.loads('{invalid')
except ValueError as e:
    print(type(e).__name__)  # JSONDecodeError

# New code — preferred (specific exception)
try:
    json.loads('{invalid')
except json.JSONDecodeError as e:
    print(e.msg)   # 'Expecting property name enclosed in double quotes'
    print(e.lineno, e.colno, e.pos)  # 1 2 1

Cross-link M07 урок 06 (PYTH-09 typed exceptions): specific subclass даёт typed error API — e.msg, e.lineno, e.colno — debuggable position. Plain except ValueError теряет эти атрибуты. Production rule: always catch JSONDecodeError directly для precise diagnostics.

def safe_parse(json_str: str) -> dict | None:
    try:
        return json.loads(json_str)
    except json.JSONDecodeError as e:
        # Логируем precise position
        print(f'JSON error at line {e.lineno} col {e.colno}: {e.msg}')
        return None

Code-challenge `py-m09-03-code-1` — Pattern 2 setup

В quiz JSON 03-json-stdlib.json встроен challenge:

Дана JSON строка {"users": [{"name": ..., "emails": [...]}, ...]}. Верните список всех email-адресов flatten’утых в один list.

Solution skeleton (revealed после submission):

import json

def solve(json_str: str) -> list[str]:
    data = json.loads(json_str)
    return [email for user in data['users'] for email in user['emails']]

Test cases (3 — 2 visible + 1 hidden):

2 users с 1+2 emails = 3 total.
Empty users list → [].
Hidden: user без emails (empty inner list) → [].

Это canonical Pattern 2 — nested traversal через double-for list comprehension. Pedagogically illustrates: (1) json.loads returns Python dict/list — same iteration patterns; (2) flatten one-liner — idiomatic Python (cross-link М05 урок 03 itertools/comprehensions).

Cross-course → Spark `spark.read.json` schema inference

Spark — distributed evolution. Spark 03/01 — DataFrame creation + schema covers spark.read.json reading single-line или multi-line JSON files.

# Spark equivalent (Run-on-Your-Machine, NOT Pyodide-runnable)
df = (
    spark.read
    .option('multiLine', 'false')   # ← JSONL по default; True для pretty-printed JSON
    .json('s3://bucket/events/')
)
df.printSchema()
# Spark infers nested struct schema из sample (sampleRatio default 1.0)

Difference vs stdlib:

json.loads — parse один document.
Spark spark.read.json — parses thousands JSONL files in parallel, infers schema, returns DataFrame (column-oriented, vectorized).

Bridge insight: multiLine=false option в Spark = JSONL semantics (M09 урок 03 streaming pattern). Production migration path: prototype в Python parse_jsonl(...) → scale в Spark spark.read.json('s3://...') — same input format.

Cross-course → ClickHouse `JSONEachRow` format

ClickHouse 11/07 — FORMAT clause describes JSONEachRow — semantically identical JSONL.

INSERT INTO events FORMAT JSONEachRow
{"timestamp": "2026-04-29T12:00:00", "user_id": 42, "event": "login"}
{"timestamp": "2026-04-29T12:00:01", "user_id": 43, "event": "logout"}

ClickHouse автоматически maps JSON keys → table columns; missing columns → defaults; extra keys → ignored (или error if input_format_skip_unknown_fields=0).

Three-layer bridge:

Stdlib json.loads(line) per line — single Python process.
Spark spark.read.json(multiLine=False) — distributed JVM cluster.
ClickHouse INSERT FORMAT JSONEachRow — vectorized columnar OLAP DB.

Same data format (JSONL), three execution layers. Production pipeline often combines: producers emit JSONL → Spark batch ingest → ClickHouse storage.

Recipe — production JSON pipeline с error handling

End-to-end: parse JSON → validate → typed records → re-serialize subset.

import io
import json
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class Event:
    timestamp: datetime
    user_id: int
    event: str

class EventEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        return super().default(obj)

def parse_events(jsonl_str: str) -> list[Event]:
    """Parse JSONL → typed Event records. Skip malformed."""
    out: list[Event] = []
    for line in jsonl_str.splitlines():
        line = line.strip()
        if not line:
            continue
        try:
            d = json.loads(line)
            out.append(Event(
                timestamp=datetime.fromisoformat(d['timestamp']),
                user_id=int(d['user_id']),
                event=d['event'],
            ))
        except (json.JSONDecodeError, KeyError, ValueError):
            continue  # skip malformed
    return out

def emit_logins(events: list[Event]) -> str:
    """Re-serialize только login events."""
    out = io.StringIO()
    for e in events:
        if e.event == 'login':
            out.write(json.dumps(asdict(e), cls=EventEncoder))
            out.write('\n')
    return out.getvalue()

# Usage
data = '''{"timestamp": "2026-04-29T12:00:00", "user_id": 42, "event": "login"}
{"timestamp": "2026-04-29T12:00:05", "user_id": 42, "event": "logout"}
{"timestamp": "2026-04-29T12:00:10", "user_id": 43, "event": "login"}
'''
events = parse_events(data)
print(len(events))   # 3
print(emit_logins(events))
# {"timestamp": "2026-04-29T12:00:00", "user_id": 42, "event": "login"}
# {"timestamp": "2026-04-29T12:00:10", "user_id": 43, "event": "login"}

Что в следующем уроке

Урок 04 — Binary formats overview (Parquet / ORC / Avro / Arrow IPC). Conceptual only — мы не имеем pyarrow / fastavro / orc-python в browser. Matrix comparison + decision tree + heavy cross-course references к Storage Formats course (27 уроков deep dives).

Pragmatic-DEEP принцип: не deep-dive’ем _json C-extension internals. Stdlib json.loads — battle-tested, RFC 8259 compliant. Если нужно >10x speedup — orjson, ujson (production), но stdlib сейчас достаточно быстрая для большинства pipelines.

JSON: json.loads, json.dumps, JSONL streaming, custom encoders

json.loads / json.dumps — string-based API

json.load / json.dump — file-based API

Type mapping — Python ↔ JSON

Pitfall 25 — tuple round-trip loss

Custom encoders — cls=JSONEncoder

Dataclass + JSON

JSONL — line-streaming format

JSONDecoder.raw_decode — advanced streaming

Pitfall 23 — JSONDecodeError is ValueError subclass

Code-challenge py-m09-03-code-1 — Pattern 2 setup

Cross-course → Spark spark.read.json schema inference

Cross-course → ClickHouse JSONEachRow format

Recipe — production JSON pipeline с error handling

Что в следующем уроке

Закончили урок?

`json.loads` / `json.dumps` — string-based API

`json.load` / `json.dump` — file-based API

Custom encoders — `cls=JSONEncoder`

`JSONDecoder.raw_decode` — advanced streaming

Pitfall 23 — `JSONDecodeError` is `ValueError` subclass

Code-challenge `py-m09-03-code-1` — Pattern 2 setup

Cross-course → Spark `spark.read.json` schema inference

Cross-course → ClickHouse `JSONEachRow` format