Skip to content

perf: incremental candidate counting for disconnected training#42

Open
AmitMY wants to merge 1 commit into
mainfrom
perf/incremental-counting
Open

perf: incremental candidate counting for disconnected training#42
AmitMY wants to merge 1 commit into
mainfrom
perf/incremental-counting

Conversation

@AmitMY

@AmitMY AmitMY commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

The training loop rebuilt Counter(graph.get_merges()) from scratch every step — re-counting and re-hashing every candidate in the whole forest, even the subgraphs the chosen merge never touched. Profiling showed this recount is ~70% of training time. For a disconnected forest (BPE/BNE — each merge hits only the few words containing the pair) almost all of it is wasted.

_train_incremental keeps each subgraph's candidate counts plus a running global total, and after a merge updates only the subgraphs that contained it (located via an index):

for i in list(index[nodes]):
    _index_remove(total, index, i, comp_counts[i])      # subtract old
    components[i] = components[i].merge(token, nodes)
    comp_counts[i] = Counter(components[i].get_merges())
    _index_add(total, index, i, comp_counts[i])          # add new

Correctness: total is summed in subgraph order, so picking the first max-score candidate reproduces max(Counter(graph.get_merges()), key=score) exactly, including tie-breaks. Output is byte-identical — merge digests unchanged, 138 tests pass.

Gated to the disconnected case (Tokenizer passes incremental=not connected); connected / draw / verbose keep the plain recount loop, so Boundless and SuperBPE are unchanged (incremental doesn't help when a merge touches most of the few large components).

main this PR
BPE 0.384s / 1.33MB 0.091s / 4.21MB — 4.2×, +mem
BNE n=4 0.768s / 2.63MB 0.205s / 8.93MB — 3.7×, +mem
Boundless 0.469s / 1.42MB 0.464s / 1.42MB — unchanged

Trade-off: the persistent per-subgraph count state (comp_counts + index + total) raises peak memory ~3× for the disconnected tokenizers — a deliberate speed-for-memory choice for the dominant BPE/BNE path.

🤖 Generated with Claude Code

The training loop rebuilt Counter(graph.get_merges()) from scratch every
step — recounting the whole forest, even the subgraphs the chosen merge
never touched. For a disconnected forest (BPE/BNE, where each merge hits
only the few words containing the pair) almost all of it is wasted.

_train_incremental keeps each subgraph's candidate counts plus a running
global total, and after a merge updates only the subgraphs that contained
it (located via an index). total is summed in subgraph order, so picking
the first max-score candidate reproduces max(Counter(...), key=score)
exactly, including tie-breaks — output is byte-identical.

Gated to the disconnected case (Tokenizer passes incremental=not connected);
connected/draw/verbose keep the plain recount loop, so Boundless/SuperBPE
are unchanged. The two loops share _steps() for the progress scaffolding;
their cores differ (Counter-rebuild vs total+index update). Trade-off: the
persistent count state raises peak memory.

Stacked on the bytes-Node change: BPE ~6.6x, BNE ~7.4x over the original
baseline. 138 tests pass; digests identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY force-pushed the perf/incremental-counting branch from 6627325 to 25b3773 Compare June 28, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant