perf: incremental candidate counting for disconnected training by AmitMY · Pull Request #42 · sign-language-processing/complex-tokenization

AmitMY · 2026-06-27T08:29:30Z

The training loop rebuilt Counter(graph.get_merges()) from scratch every step — re-counting and re-hashing every candidate in the whole forest, even the subgraphs the chosen merge never touched. Profiling showed this recount is ~70% of training time. For a disconnected forest (BPE/BNE — each merge hits only the few words containing the pair) almost all of it is wasted.

_train_incremental keeps each subgraph's candidate counts plus a running global total, and after a merge updates only the subgraphs that contained it (located via an index):

for i in list(index[nodes]):
    _index_remove(total, index, i, comp_counts[i])      # subtract old
    components[i] = components[i].merge(token, nodes)
    comp_counts[i] = Counter(components[i].get_merges())
    _index_add(total, index, i, comp_counts[i])          # add new

Correctness: total is summed in subgraph order, so picking the first max-score candidate reproduces max(Counter(graph.get_merges()), key=score) exactly, including tie-breaks. Output is byte-identical — merge digests unchanged, 138 tests pass.

Gated to the disconnected case (Tokenizer passes incremental=not connected); connected / draw / verbose keep the plain recount loop, so Boundless and SuperBPE are unchanged (incremental doesn't help when a merge touches most of the few large components).

	main	this PR
BPE	0.384s / 1.33MB	0.091s / 4.21MB — 4.2×, +mem
BNE n=4	0.768s / 2.63MB	0.205s / 8.93MB — 3.7×, +mem
Boundless	0.469s / 1.42MB	0.464s / 1.42MB — unchanged

Trade-off: the persistent per-subgraph count state (comp_counts + index + total) raises peak memory ~3× for the disconnected tokenizers — a deliberate speed-for-memory choice for the dominant BPE/BNE path.

🤖 Generated with Claude Code

The training loop rebuilt Counter(graph.get_merges()) from scratch every step — recounting the whole forest, even the subgraphs the chosen merge never touched. For a disconnected forest (BPE/BNE, where each merge hits only the few words containing the pair) almost all of it is wasted. _train_incremental keeps each subgraph's candidate counts plus a running global total, and after a merge updates only the subgraphs that contained it (located via an index). total is summed in subgraph order, so picking the first max-score candidate reproduces max(Counter(...), key=score) exactly, including tie-breaks — output is byte-identical. Gated to the disconnected case (Tokenizer passes incremental=not connected); connected/draw/verbose keep the plain recount loop, so Boundless/SuperBPE are unchanged. The two loops share _steps() for the progress scaffolding; their cores differ (Counter-rebuild vs total+index update). Trade-off: the persistent count state raises peak memory. Stacked on the bytes-Node change: BPE ~6.6x, BNE ~7.4x over the original baseline. 138 tests pass; digests identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AmitMY mentioned this pull request Jun 27, 2026

perf: make Node subclass bytes (C-level hash/equality) #43

Merged

AmitMY force-pushed the perf/incremental-counting branch from c92c053 to 6627325 Compare June 28, 2026 04:50

AmitMY mentioned this pull request Jun 28, 2026

chore: tidy Trainer.train loop + drop __main__ dead code #44

Merged

AmitMY force-pushed the perf/incremental-counting branch from 6627325 to 25b3773 Compare June 28, 2026 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: incremental candidate counting for disconnected training#42

perf: incremental candidate counting for disconnected training#42
AmitMY wants to merge 1 commit into
mainfrom
perf/incremental-counting

AmitMY commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AmitMY commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant