feat: unify shared-prefix Megatron execution by bradhilton · Pull Request #739 · OpenPipe/ART

bradhilton · 2026-06-25T22:01:05Z

Summary

Adds support for deeper shared-prefixes and multi-LoRA slots while simplifying the code. Also adds a dev script for validation and some tests.

What changed

Replaced depth-specialized shared-prefix/GDN paths with one generic tree representation and execution path.
Added pack_shared_prefixes(...) and shared-prefix tree/parser coverage for depth 0 through deeper branchy trees.
Simplified CP/GDN planning and execution around tree metadata, parent indices, and depth buckets.
Added direct block-mask metadata parity tests against PyTorch create_block_mask, including full/partial block metadata rather than only mask_mod behavior.
Kept LoRA/vLLM codec/export support aligned with the Qwen3/Qwen3.5 A3B layouts.
Added dev/megatron_review_perf.py for Austin-focused block-mask, FlexAttention, and CP planning measurements.

Validation

Local on this branch:

uv run --no-sync prek run --all-files: passed.
uv run pytest tests/unit/test_shared_prefix_attention_builder.py tests/unit/test_shared_prefix_tree.py tests/unit/test_shared_prefix_packing.py tests/unit/test_shared_prefix_grad_parity.py tests/integration/megatron/lora/test_lora_disk_codecs.py: 30 passed, 9 skipped locally; GPU/Megatron/vLLM portions are skipped on local macOS.

4x H200 SkyPilot (codex-trainer-rank-h200) on this branch:

Shared-prefix/block-mask/tree/grad-parity subset: 48 passed.
vLLM LoRA disk codec suite with GPU allocation: 17 passed.
Austin-style review perf, Qwen/Qwen3.5-35B-A3B-style packed shape, CP=4, 198k packed tokens, 2.448M logical tokens:
- cold CP planning: 60.54ms
- warm-cache CP planning: 1.865ms
- full block-mask build, 7 masks: 136.503ms
- 8192-token small reference variant: block-mask build 3.478ms, CP planning 12.916ms
In head-to-head performance check against main, speed was statistically indistinguishable, both about 33k tok/s.

This reverts commit 5474538.

feat: unify shared-prefix megatron execution

a2414ab

This was referenced Jun 25, 2026

feat: add TrainerRank API #740

Draft

feat: add TrainerRank #731

Closed

bradhilton added 4 commits June 25, 2026 17:56

Merge main into gdn tree core

0e85b0b

fix sparse block mask metadata parity

4afa197

use native wandb sdk types

48bdd15

fix wandb boundary test mock

46a9d61

bradhilton marked this pull request as ready for review June 26, 2026 01:49

bradhilton requested a review from FurtherAI June 26, 2026 01:50

bradhilton added 21 commits June 26, 2026 17:29

test: cover tree gdn trainability

6c05e09

test: keep tree trainability smoke small

498bcd8

test: relax tree trainability planner shape

6f5c379

test: use production dtype for tree trainability

d5b9088

test: check tree trainability by parameter update

4c2b3df

fix: allow partial lora target coverage

e944f81

fix: avoid qwen35 dense te compile crash

1c79ace

fix: disable te layernorm linear compile path

b039a28

fix: graph break qwen35 dense lora fc1

0527d76

fix: handle empty lora fc1 slices

2acd4cb

fix: normalize gemma4 shared expert lora keys

efe01ba

style: apply megatron formatting

32b14a1

test: cover fused fc1 lora sensitivity

61e4d2b

test: normalize chat template tool calls

6089a2a

test: strengthen flash lse sensitivity topology

d0d1681

fix: preserve gdn tree segment order

5474538

Revert "fix: preserve gdn tree segment order"

9470337

This reverts commit 5474538.

fix: chunk-align local tree gdn forks

eccdefb

test: cover chunk-aligned tree gdn planning

61dba04

test: run flash lse sensitivity under flash backend

e7497a4

test: use bf16 for flash lse sensitivity

6053920

bradhilton added 7 commits June 26, 2026 22:47

test: stabilize qwen35 length trainability

c72552d

test: tune qwen35 length learning rate

c104a2c

test: lower train-inf vllm memory reservation

4ebb865

fix: generalize gemma4 flex attention workaround

c39a379

fix: remove gemma4 shared expert lora rescale

c8cfaf1

fix: align gemma4 shared expert lora conversion

31a7954

test: expose gdn prefill backend for train-inf validation

e8d17f2

bradhilton force-pushed the feat/megatron-gdn-tree-core branch from 5f1a400 to e8d17f2 Compare June 27, 2026 09:00

test: stabilize model support workflow probes

8326658

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: unify shared-prefix Megatron execution#739

feat: unify shared-prefix Megatron execution#739
bradhilton wants to merge 34 commits into
mainfrom
feat/megatron-gdn-tree-core

bradhilton commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bradhilton commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bradhilton commented Jun 25, 2026 •

edited

Loading