perf: make Node subclass bytes (C-level hash/equality) by AmitMY · Pull Request #43 · sign-language-processing/complex-tokenization

AmitMY · 2026-06-27T16:07:23Z

Answers "is there a faster counter?" — the bottleneck isn't Counter (it's already C); it's hashing/comparing the candidate node-tuples it's fed. Counter(graph.get_merges()) re-hashes them every step, and merge scans compare them. With Node a frozen dataclass, __hash__/__eq__ are Python calls.

Make a Node be its bytes:

class Node(bytes, GraphVertex):
    __str__ = GraphVertex.__str__   # bytes wins the MRO; keep our str()
    def __new__(cls, value): return super().__new__(cls, value)
    @property
    def value(self): return self[:]
    ...

Now __hash__/__eq__/__len__ are bytes' C-level operations (and tuples of bytes hash ~4× faster than tuples of dataclass instances). It's net simpler (drops the custom __eq__/__hash__/__len__).

Why this one matters for the whole pipeline: it speeds the recount path, so unlike the disconnected-only incremental work (#42) it also helps Boundless and SuperBPE — and the two compose.

Output identical (merge digests unchanged at 50/200 and 100/400), 138 tests pass, memory unchanged (nodes are interned, so the slightly larger object is negligible):

	main	this PR
BPE	0.384s / 1.33MB	0.220s / 1.33MB — 1.7×
BNE n=4	0.768s / 2.63MB	0.337s / 2.63MB — 2.3×
Boundless	0.469s / 1.42MB	0.264s / 1.42MB — 1.8×

Note: Node now compares equal to a raw bytes of the same value (bytes __eq__), but in the graph only Node/NodesSequence are ever compared, so behavior is unchanged in practice.

🤖 Generated with Claude Code

The training loop's cost is hashing and comparing candidate node-tuples: Counter(graph.get_merges()) hashes them every step, and merge scans compare them. With Node a frozen dataclass, __hash__/__eq__ were Python calls. Making Node a bytes subclass (a Node *is* its bytes) gives those operations C-level implementations — and tuples of bytes hash ~4x faster than tuples of dataclass instances. This speeds the recount path too, so unlike the disconnected-only incremental work it also helps Boundless/SuperBPE. It's net simpler: __eq__/__hash__/__len__ are inherited from bytes, and Node no longer needs a `value` field/property at all (a Node *is* its bytes) — Tree.merge just returns the token. bytes wins the MRO for the dunders we want; we keep GraphVertex.__str__ explicitly. Memory is unchanged (nodes are interned). ~1.7x BPE, ~2.3x BNE, ~1.8x Boundless; memory unchanged; output identical (digests unchanged); 138 tests pass. Note: drops the public Node.value attribute — a Node is now its own bytes, so use the node directly or bytes(node). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AmitMY force-pushed the perf/node-subclass-bytes branch from 4c4bec0 to a0d00c9 Compare June 28, 2026 04:03

AmitMY force-pushed the perf/node-subclass-bytes branch from a0d00c9 to 0e651f2 Compare June 28, 2026 04:17

AmitMY merged commit 587df0f into main Jun 28, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: make Node subclass bytes (C-level hash/equality)#43

perf: make Node subclass bytes (C-level hash/equality)#43
AmitMY merged 1 commit into
mainfrom
perf/node-subclass-bytes

AmitMY commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AmitMY commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant