JSON: json.loads, json.dumps, JSONL streaming, custom encoders
JSON (JavaScript Object Notation, RFC 8259) — second-most-common data interchange format после CSV. Schema-less, self-describing, text-based, dominant в HTTP APIs / NoSQL / streaming pipelines. Stdlib json module — C-accelerated (_json extension в CPython, works в Pyodide), production-grade. Pragmatic-DEEP rule: никогда не пишите свой JSON parser — json.loads covers RFC 8259 + UTF-8 + escape sequences.
В этом уроке:
json.loads/json.dumps— string-based API.json.load/json.dump— file-based API.- Type mapping — Python ↔ JSON.
- Custom encoders —
cls=JSONEncoderдля datetime / Decimal / dataclass. - JSONL streaming — line-at-a-time для large datasets.
JSONDecoder.raw_decode— advanced streaming.- Pitfalls 23 / 25 — JSONDecodeError class hierarchy, tuple round-trip loss.
- Code-challenge
py-m09-03-code-1— Pattern 2 (JSON nested traversal). - Cross-course → Spark
spark.read.jsonschema inference. - Cross-course → ClickHouse
JSONEachRowformat.
json.loads / json.dumps — string-based API
json.loads(s) — parse JSON string → Python object. json.dumps(obj) — serialize Python object → JSON string.
import json
# loads — parse string → object
s = '{"name": "alice", "age": 30, "tags": ["dev", "qa"]}'
data = json.loads(s)
print(data) # {'name': 'alice', 'age': 30, 'tags': ['dev', 'qa']}
print(type(data)) # <class 'dict'>
print(type(data['tags'])) # <class 'list'>
# dumps — serialize object → string
obj = {'name': 'bob', 'age': 25, 'verified': True, 'manager': None}
print(json.dumps(obj))
# {"name": "bob", "age": 25, "verified": true, "manager": null}
indent — pretty-print:
print(json.dumps(obj, indent=2))
# {
# "name": "bob",
# "age": 25,
# "verified": true,
# "manager": null
# }
sort_keys=True — deterministic output (важно для diffs / hashing / reproducible tests):
print(json.dumps({'b': 2, 'a': 1}, sort_keys=True))
# {"a": 1, "b": 2}
ensure_ascii=False — preserve non-ASCII characters (по default escape’ятся в \uXXXX):
print(json.dumps({'city': 'Москва'})) # {"city": "Москва"}
print(json.dumps({'city': 'Москва'}, ensure_ascii=False)) # {"city": "Москва"}
Cite docs.python.org/3/library/json.html#json.dumps.
json.load / json.dump — file-based API
Symmetric — read из file-like / write в file-like:
import json
import io
# load — parse file-like → object (в browser challenges — io.StringIO)
buf = io.StringIO('{"users": [{"id": 1, "name": "alice"}]}')
data = json.load(buf)
print(data) # {'users': [{'id': 1, 'name': 'alice'}]}
# dump — serialize object → file-like
out = io.StringIO()
json.dump({'count': 42}, out)
print(out.getvalue()) # {"count": 42}
Production note: json.load(f) загружает весь file в память — same risk что f.read() в уроке 01. Для streaming больших JSON файлов используйте JSONL (ниже) или JSONDecoder.raw_decode.
Type mapping — Python ↔ JSON
| Python | JSON | Comment |
|---|---|---|
dict | object | keys только str (или auto-converted: int → str) |
list | array | |
tuple | array | one-way — round-trip loses tuple-ness (Pitfall 25) |
str | string | UTF-8 |
int | number | arbitrary precision (Python int unbounded; некоторые JS parsers truncate >2^53) |
float | number | IEEE-754 double; nan/inf non-standard (RFC 8259 § 6 запрещает) |
True / False | true / false | |
None | null |
NOT supported natively: datetime, Decimal, set, frozenset, bytes, dataclass, custom classes. Для них — custom encoder (ниже).
import json
# Auto-conversion int keys → str
print(json.dumps({1: 'a', 2: 'b'}))
# {"1": "a", "2": "b"}
# Set → TypeError
try:
json.dumps({1, 2, 3})
except TypeError as e:
print(e) # Object of type set is not JSON serializable
Pitfall 25 — tuple round-trip loss
JSON не имеет tuple type. json.dumps((1, 2)) → "[1, 2]" (array). json.loads("[1, 2]") → [1, 2] (list, не tuple).
import json
original = (1, 2, 3)
serialized = json.dumps(original)
print(serialized) # "[1, 2, 3]"
restored = json.loads(serialized)
print(type(restored)) # <class 'list'> ← НЕ tuple
print(restored == [1, 2, 3]) # True
print(restored == (1, 2, 3)) # False ← list != tuple
Implications:
- Если tuple используется как dict key (M02 урок 06 — immutable hashable), нельзя serialize в JSON и round-trip — converted в list (unhashable).
- Caller обязан re-wrap в tuple после
json.loadsесли semantic нужен.
Custom encoders — cls=JSONEncoder
Для datetime, Decimal, dataclass, custom classes — extend JSONEncoder:
import json
from datetime import datetime
from decimal import Decimal
class ProductionEncoder(json.JSONEncoder):
"""Handle datetime, Decimal, fallback to default."""
def default(self, obj):
if isinstance(obj, datetime):
return obj.isoformat()
if isinstance(obj, Decimal):
return str(obj) # Preserve precision как string
return super().default(obj) # raises TypeError for unknowns
obj = {
'created': datetime(2026, 4, 29, 12, 0),
'price': Decimal('19.99'),
'name': 'widget',
}
print(json.dumps(obj, cls=ProductionEncoder, indent=2))
# {
# "created": "2026-04-29T12:00:00",
# "price": "19.99",
# "name": "widget"
# }
Alternative — default= parameter — lambda-based, no class needed:
print(json.dumps(
obj,
default=lambda o: o.isoformat() if isinstance(o, datetime)
else str(o) if isinstance(o, Decimal)
else None,
))
Cross-link M03 урок 04 (closure): default=lambda o: ... — closure capturing isinstance-checks. Production rule: для simple cases — default=; для complex hierarchy — cls= subclass.
Dataclass + JSON
Combine с dataclasses.asdict (M07 урок 04 carrying):
import json
from dataclasses import dataclass, asdict
@dataclass
class User:
name: str
age: int
users = [User('alice', 30), User('bob', 25)]
print(json.dumps([asdict(u) for u in users]))
# [{"name": "alice", "age": 30}, {"name": "bob", "age": 25}]
Cite docs.python.org/3/library/json.html#json.JSONEncoder.
JSONL — line-streaming format
JSONL (JSON Lines, a.k.a. NDJSON — Newline-Delimited JSON) — \n-separated JSON objects. Each line — independent valid JSON. Streaming benefit: parse one line at time, O(1) memory per record:
{"id": 1, "name": "alice"}
{"id": 2, "name": "bob"}
{"id": 3, "name": "carol"}
Recipe — process JSONL без loading в память (cross-link М05 урок 02 generator):
import json
import io
from collections.abc import Iterator
def parse_jsonl(buf: io.IOBase) -> Iterator[dict]:
"""Yield one dict per line. Skip empty + invalid lines."""
for line in buf:
line = line.strip()
if not line:
continue
try:
yield json.loads(line)
except json.JSONDecodeError:
continue # log + skip in production
# Usage
data = '{"id": 1, "name": "alice"}\n{"id": 2, "name": "bob"}\n'
buf = io.StringIO(data)
for record in parse_jsonl(buf):
print(record)
# {'id': 1, 'name': 'alice'}
# {'id': 2, 'name': 'bob'}
Production formats supporting JSONL:
- CloudWatch Logs / Elasticsearch ingestion — streaming log records.
- OpenAI fine-tuning datasets — one prompt/completion per line.
- BigQuery
LOAD DATA— JSONL native ingestion. - ClickHouse
JSONEachRow— semantically identical (cross-course ниже).
JSONDecoder.raw_decode — advanced streaming
Для multi-document JSON (без newline separators), raw_decode parses один object и returns его + position:
import json
decoder = json.JSONDecoder()
s = '{"a": 1}{"b": 2}{"c": 3}'
pos = 0
while pos < len(s):
obj, end = decoder.raw_decode(s, idx=pos)
print(obj)
pos = end
# Skip whitespace (если есть)
while pos < len(s) and s[pos].isspace():
pos += 1
# {'a': 1}
# {'b': 2}
# {'c': 3}
Когда использовать: RPC streams, concat’ed JSON dumps без separators, custom protocols. Для нормальных pipelines — JSONL preferable (clearer, tooling-friendly).
Pitfall 23 — JSONDecodeError is ValueError subclass
json.loads raises json.JSONDecodeError при invalid input. Это subclass ValueError (backward compat — pre-3.5 raised plain ValueError).
import json
# Old code — still works
try:
json.loads('{invalid')
except ValueError as e:
print(type(e).__name__) # JSONDecodeError
# New code — preferred (specific exception)
try:
json.loads('{invalid')
except json.JSONDecodeError as e:
print(e.msg) # 'Expecting property name enclosed in double quotes'
print(e.lineno, e.colno, e.pos) # 1 2 1
Cross-link M07 урок 06 (PYTH-09 typed exceptions): specific subclass даёт typed error API — e.msg, e.lineno, e.colno — debuggable position. Plain except ValueError теряет эти атрибуты. Production rule: always catch JSONDecodeError directly для precise diagnostics.
def safe_parse(json_str: str) -> dict | None:
try:
return json.loads(json_str)
except json.JSONDecodeError as e:
# Логируем precise position
print(f'JSON error at line {e.lineno} col {e.colno}: {e.msg}')
return None
Code-challenge py-m09-03-code-1 — Pattern 2 setup
В quiz JSON 03-json-stdlib.json встроен challenge:
Дана JSON строка
{"users": [{"name": ..., "emails": [...]}, ...]}. Верните список всех email-адресов flatten’утых в один list.
Solution skeleton (revealed после submission):
import json
def solve(json_str: str) -> list[str]:
data = json.loads(json_str)
return [email for user in data['users'] for email in user['emails']]
Test cases (3 — 2 visible + 1 hidden):
- 2 users с 1+2 emails = 3 total.
- Empty users list →
[]. - Hidden: user без emails (empty inner list) →
[].
Это canonical Pattern 2 — nested traversal через double-for list comprehension. Pedagogically illustrates: (1) json.loads returns Python dict/list — same iteration patterns; (2) flatten one-liner — idiomatic Python (cross-link М05 урок 03 itertools/comprehensions).
Cross-course → Spark spark.read.json schema inference
Spark — distributed evolution. Spark 03/01 — DataFrame creation + schema covers spark.read.json reading single-line или multi-line JSON files.
# Spark equivalent (Run-on-Your-Machine, NOT Pyodide-runnable)
df = (
spark.read
.option('multiLine', 'false') # ← JSONL по default; True для pretty-printed JSON
.json('s3://bucket/events/')
)
df.printSchema()
# Spark infers nested struct schema из sample (sampleRatio default 1.0)
Difference vs stdlib:
json.loads— parse один document.- Spark
spark.read.json— parses thousands JSONL files in parallel, infers schema, returnsDataFrame(column-oriented, vectorized).
Bridge insight: multiLine=false option в Spark = JSONL semantics (M09 урок 03 streaming pattern). Production migration path: prototype в Python parse_jsonl(...) → scale в Spark spark.read.json('s3://...') — same input format.
Cross-course → ClickHouse JSONEachRow format
ClickHouse 11/07 — FORMAT clause describes JSONEachRow — semantically identical JSONL.
INSERT INTO events FORMAT JSONEachRow
{"timestamp": "2026-04-29T12:00:00", "user_id": 42, "event": "login"}
{"timestamp": "2026-04-29T12:00:01", "user_id": 43, "event": "logout"}
ClickHouse автоматически maps JSON keys → table columns; missing columns → defaults; extra keys → ignored (или error if input_format_skip_unknown_fields=0).
Three-layer bridge:
- Stdlib
json.loads(line)per line — single Python process. - Spark
spark.read.json(multiLine=False)— distributed JVM cluster. - ClickHouse
INSERT FORMAT JSONEachRow— vectorized columnar OLAP DB.
Same data format (JSONL), three execution layers. Production pipeline often combines: producers emit JSONL → Spark batch ingest → ClickHouse storage.
Recipe — production JSON pipeline с error handling
End-to-end: parse JSON → validate → typed records → re-serialize subset.
import io
import json
from dataclasses import dataclass, asdict
from datetime import datetime
@dataclass
class Event:
timestamp: datetime
user_id: int
event: str
class EventEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime):
return obj.isoformat()
return super().default(obj)
def parse_events(jsonl_str: str) -> list[Event]:
"""Parse JSONL → typed Event records. Skip malformed."""
out: list[Event] = []
for line in jsonl_str.splitlines():
line = line.strip()
if not line:
continue
try:
d = json.loads(line)
out.append(Event(
timestamp=datetime.fromisoformat(d['timestamp']),
user_id=int(d['user_id']),
event=d['event'],
))
except (json.JSONDecodeError, KeyError, ValueError):
continue # skip malformed
return out
def emit_logins(events: list[Event]) -> str:
"""Re-serialize только login events."""
out = io.StringIO()
for e in events:
if e.event == 'login':
out.write(json.dumps(asdict(e), cls=EventEncoder))
out.write('\n')
return out.getvalue()
# Usage
data = '''{"timestamp": "2026-04-29T12:00:00", "user_id": 42, "event": "login"}
{"timestamp": "2026-04-29T12:00:05", "user_id": 42, "event": "logout"}
{"timestamp": "2026-04-29T12:00:10", "user_id": 43, "event": "login"}
'''
events = parse_events(data)
print(len(events)) # 3
print(emit_logins(events))
# {"timestamp": "2026-04-29T12:00:00", "user_id": 42, "event": "login"}
# {"timestamp": "2026-04-29T12:00:10", "user_id": 43, "event": "login"}
Что в следующем уроке
Урок 04 — Binary formats overview (Parquet / ORC / Avro / Arrow IPC). Conceptual only — мы не имеем pyarrow / fastavro / orc-python в browser. Matrix comparison + decision tree + heavy cross-course references к Storage Formats course (27 уроков deep dives).
Pragmatic-DEEP принцип: не deep-dive’ем
_jsonC-extension internals. Stdlibjson.loads— battle-tested, RFC 8259 compliant. Если нужно >10x speedup —orjson,ujson(production), но stdlib сейчас достаточно быстрая для большинства pipelines.