Skip to content

feat: unify shared-prefix Megatron execution#739

Open
bradhilton wants to merge 34 commits into
mainfrom
feat/megatron-gdn-tree-core
Open

feat: unify shared-prefix Megatron execution#739
bradhilton wants to merge 34 commits into
mainfrom
feat/megatron-gdn-tree-core

Conversation

@bradhilton

@bradhilton bradhilton commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds support for deeper shared-prefixes and multi-LoRA slots while simplifying the code. Also adds a dev script for validation and some tests.

What changed

  • Replaced depth-specialized shared-prefix/GDN paths with one generic tree representation and execution path.
  • Added pack_shared_prefixes(...) and shared-prefix tree/parser coverage for depth 0 through deeper branchy trees.
  • Simplified CP/GDN planning and execution around tree metadata, parent indices, and depth buckets.
  • Added direct block-mask metadata parity tests against PyTorch create_block_mask, including full/partial block metadata rather than only mask_mod behavior.
  • Kept LoRA/vLLM codec/export support aligned with the Qwen3/Qwen3.5 A3B layouts.
  • Added dev/megatron_review_perf.py for Austin-focused block-mask, FlexAttention, and CP planning measurements.

Validation

Local on this branch:

  • uv run --no-sync prek run --all-files: passed.
  • uv run pytest tests/unit/test_shared_prefix_attention_builder.py tests/unit/test_shared_prefix_tree.py tests/unit/test_shared_prefix_packing.py tests/unit/test_shared_prefix_grad_parity.py tests/integration/megatron/lora/test_lora_disk_codecs.py: 30 passed, 9 skipped locally; GPU/Megatron/vLLM portions are skipped on local macOS.

4x H200 SkyPilot (codex-trainer-rank-h200) on this branch:

  • Shared-prefix/block-mask/tree/grad-parity subset: 48 passed.
  • vLLM LoRA disk codec suite with GPU allocation: 17 passed.
  • Austin-style review perf, Qwen/Qwen3.5-35B-A3B-style packed shape, CP=4, 198k packed tokens, 2.448M logical tokens:
    • cold CP planning: 60.54ms
    • warm-cache CP planning: 1.865ms
    • full block-mask build, 7 masks: 136.503ms
    • 8192-token small reference variant: block-mask build 3.478ms, CP planning 12.916ms
  • In head-to-head performance check against main, speed was statistically indistinguishable, both about 33k tok/s.

This was referenced Jun 25, 2026
@bradhilton bradhilton marked this pull request as ready for review June 26, 2026 01:49
@bradhilton bradhilton requested a review from FurtherAI June 26, 2026 01:50
@bradhilton bradhilton force-pushed the feat/megatron-gdn-tree-core branch from 5f1a400 to e8d17f2 Compare June 27, 2026 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant