dflash: enable Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-122B-A10B, Qwen3-Coder-Next, Gemma 4 31B on Vulkan#79
Conversation
Verified full Vulkan build on native AMD hardware: - AMD Ryzen AI MAX+ 395 / Radeon 8060S (RADV GFX1151, Mesa 25.0.7) - GCC 14.2.0, CMake 3.31.6, Vulkan 1.4.309, glslc - GGML_VULKAN=ON, GGML_NATIVE=ON, Release mode - 100% build success, all binaries produced, working tree clean
) Full Vulkan build verified on a second host after supplying glslc and Vulkan-Headers 1.4.309 locally (Debian 11 ships neither): - Debian 11 (bullseye), Linux 5.10.0-43-amd64 - AMD Ryzen 9 5950X (32 threads), 62 GiB RAM - NVIDIA GeForce RTX 3090 (host) - GCC 10.2.1, CMake 3.31.11, Ninja 1.10.1 - glslc: shaderc v2023.2 (built from source -> ~/.local/bin) - Vulkan headers 1.4.309 (KhronosGroup Vulkan-Headers -> ~/.local) - GGML_VULKAN=ON, GGML_NATIVE=ON, Release, -j32 - Full build rc=0, all binaries produced, test-dflash-plumbing rc=0 - 0 errors; only benign -Wdouble-promotion/-Wmissing-field-initializers warnings No PR source changes required; host-side additions only.
Verified full Vulkan build on native AMD hardware: - AMD Ryzen AI MAX+ 395 / Radeon 8060S (RADV GFX1151, Mesa 25.0.7) - GCC 14.2.0, CMake 3.31.6, Vulkan 1.4.309, glslc - GGML_VULKAN=ON, GGML_NATIVE=ON, Release mode - 100% build success, all binaries produced, working tree clean
) Full Vulkan build verified on a second host after supplying glslc and Vulkan-Headers 1.4.309 locally (Debian 11 ships neither): - Debian 11 (bullseye), Linux 5.10.0-43-amd64 - AMD Ryzen 9 5950X (32 threads), 62 GiB RAM - NVIDIA GeForce RTX 3090 (host) - GCC 10.2.1, CMake 3.31.11, Ninja 1.10.1 - glslc: shaderc v2023.2 (built from source -> ~/.local/bin) - Vulkan headers 1.4.309 (KhronosGroup Vulkan-Headers -> ~/.local) - GGML_VULKAN=ON, GGML_NATIVE=ON, Release, -j32 - Full build rc=0, all binaries produced, test-dflash-plumbing rc=0 - 0 errors; only benign -Wdouble-promotion/-Wmissing-field-initializers warnings No PR source changes required; host-side additions only.
ecc90d2 to
648c0e9
Compare
Detail 1: CUDA resultsIf CUDA is available, the recommendation is simple: Use CUDA, not Vulkan. On RTX 3090, CUDA remains faster than Vulkan for Qwen3.6: Backend | No drafter | DFlash ring=0 | DFlash ring=1 -- | -- | -- | -- CUDA | 40.56 tok/s | 54.36 tok/s | 34.46 tok/s Vulkan | 32.13 tok/s | 28.24 tok/s | 28.14 tok/sImplications:
Recommended CUDA dense config: So for CUDA systems: use CUDA. For dense Qwen3.6, use DFlash with the GPU ring disabled / CPU hidden capture. Do not use the DFlash GPU ring yet. |
…97d) Targeted sync of the 11 DFlash code files onto PR Anbeeld#79 (no investigation docs/scripts); PR-79 VULKAN_BUILD_VERIFICATION.md preserved. Brings the llama-context.cpp sched/CPU-TOPK-fallback that makes reduced-verify argmax correct on Vulkan (Vulkan has no general GGML_OP_TOPK). Flawed TOPK probe NOT included; 648c0e9 range-check + fail-closed streak kept. Diagnostics guarded behind #ifdef GGML_DFLASH_DEBUG. Verified on AMD Strix Halo (RADV, 96GB UMA), Vulkan Release, greedy, -c4096 -b1024 -ub512, draft-n-max 4, cross-ctx 1024, 256 tok: Qwen3.6-27B (dense): base 12.42 -> ring0 23.38 -> ring1 19.63 tok/s; ring0/ring1 coherent (PR-79 garbage divergence FIXED). Qwen3-Coder-Next (MoE): base 52.95 -> ring0 33.76 -> ring1 47.34 tok/s; ring1 no crash (PR-79 GGML_ASSERT crash FIXED). Qwen3.5-122B-A10B (MoE): base 21.98 -> ring0 16.23 -> ring1 18.80 tok/s; ring1 still repetition-broken (open, ring0 coherent).
Detail 2: Vulkan-only resultsFor Vulkan-only systems, this PR is valuable, especially for dense Qwen3.6 on AMD UMA systems. On Strix Halo / RADV GFX1151: Model | No drafter | DFlash ring=0 | DFlash ring=1 -- | -- | -- | -- Qwen3.6-27B dense | 12.45 | 28.26 / 2.27× | 29.43 / 2.36× Qwen3-Coder-Next MoE | 52.86 | 34.32 / 0.65× | 47.30 / 0.89× Qwen3.5-122B-A10B MoE | 21.96 | 16.18 / 0.74× | 18.78 / 0.85×, broken repetitionImplications:
Recommended Vulkan dense config: Recommended Vulkan MoE config: unless doing experiments. |
Detail 3: Policy recommendation and PR implicationCUDA availableModel type | Recommendation -- | -- Dense Qwen3.6 | CUDA + DFlash, GGML_DFLASH_GPU_RING=0 MoE | CUDA baseline unless separately proven Vulkan | Avoid if CUDA is availableOverall, this PR turns Vulkan DFlash from a mostly diagnostic / fragile path into a usable path for the main dense Qwen3.6 DFlash target, especially on AMD UMA hardware. The Vulkan GPU ring is not universally faster, but on Strix Halo it provides the best dense-model result and avoids the CPU hidden-capture penalty. This PR does not make DFlash universally beneficial for MoE models. For MoE, the drafter overhead and/or acceptance behavior still loses to the no-drafter baseline, and the 122B |
Remove the n_rs_seq=0 clamp for LLM_ARCH_QWEN35MOE that was added by b9e118f alongside the GPU cross-ring. The clamp forces checkpoint-restore for draft rejection (n_rs_seq=0), but under the GPU cross-ring (GGML_DFLASH_GPU_RING=1) the target's DeltaNet recurrent-state cells desync from committed token positions (dozens of "non-consecutive token position" warnings in llama-memory-recurrent.cpp) -> wrong drafts accepted -> repetition (rep=12) for Qwen3.5-122B-A10B. n_rs_seq is DFlash-specific (common.cpp need_n_rs_seq() == draft.n_max, e.g. 4) so this only affects the DFlash target; the non-DFlash baseline keeps n_rs_seq=0 regardless. Keeping n_rs_seq=4 (RS partial rollback) makes both ring=0 (CPU capture) and ring=1 (GPU ring) coherent (rep=0, zero desync warnings) and the GPU ring beats ring=0 (~16.0 vs ~14.6 tok/s on AMD Strix Halo / Vulkan). The earlier "n_rs_seq>0 -> garbage" note (commit 9d09310cc, DeltaNet RS-rollback snapshot bug under split_equal) is not reproduced in the single-sequence DFlash path (-np 1); split_equal multi-sequence concerns should be fixed at the DeltaNet snapshot logic, not by clamping n_rs_seq=0 here (which breaks the GPU cross-ring). Verified on AMD Strix Halo (RADV GFX1151, 96GB UMA), Vulkan Release, greedy, -c4096 -b1024 -ub512, draft-n-max 4, cross-ctx 1024, 128-token decode: Qwen3.5-122B-A10B ring=0: 11.32 tok/s rep=0 (was coherent, still coherent) Qwen3.5-122B-A10B ring=1: 15.96 tok/s rep=0 (was rep=12 broken -> FIXED) Qwen3.6-27B ring=1: 26.07 tok/s rep=0 (no regression; n_rs_seq=8 untouched) Qwen3-Coder-Next ring=1: 44.61 tok/s rep=0 (no crash, no regression)
Vulkan DFlash comparison: main vs PR fixedStrix Halo / Vulkan results. Gemma 4 31B was tested with chat template,
|
|
Can someone verify and run on more hardware? |
now fixed on PR |
…ic-sampling corruption)
build_recurrent_attn (delta-net-base.cpp) keep-path (n_rs_seq>0) wrote all K
gdn_out snapshot slots into ssm_states_all every step. For a full verify batch
(n_seq_tokens>=K) every gdn_out slot is populated, but for a decode / short batch
(n_seq_tokens<K) the gated_delta_net kernel only writes the last
min(n_seq_tokens,K) slots; the leading gdn_out slots are uninitialised. The old
loop wrote that garbage into state slots 1..K-1 on every decode step.
Under DFlash with stochastic sampling, the adaptive draft-max controller drives
spec depth to 0 once acceptance drops, so every step becomes a 1-token decode ->
snapshot slots 1..K-1 stay garbage -> repeated draft-rejection rollbacks read
stale recurrent state -> output degrades to multilingual gibberish (0.000
acceptance, HTTP 500 parse error). Reproduced on Qwen3.6-35B-A3B (qwen35moe,
n_rs_seq=4). Greedy / high-acceptance runs were unaffected because verify
batches (n_seq_tokens=K) overwrite the corruption.
Fix: write only the kernel-populated gdn_out slots (newest n_pop states -> state
slots 0..n_pop-1) and rotate the older snapshots down by n_pop slots
(old slot s -> slot s+n_pop) so the recurrent-state history stays correct for
rollback. Iterating k_i ascending reads each old slot before it is overwritten,
so the in-place shift is safe (copies to the same persistent buffer are
serialised by the scheduler). For n_seq_tokens>=K the behaviour is unchanged.
Verified on AMD Strix Halo / Vulkan (single-sequence DFlash, -np 1):
Qwen3.6-35B-A3B (qwen35moe, n_rs_seq=4): stochastic 0.000->0.711 acc, rep=0
(was 500 parse-error gibberish; greedy 0.788 acc unchanged)
Qwen3.5-122B-A10B (qwen35moe, n_rs_seq=4): greedy ring=1 16.7 tok/s rep=0
(no regression); stochastic ring=1 now coherent 0.209 acc (bonus)
Qwen3.6-27B (qwen35 dense, n_rs_seq=8): greedy 0.771 / stochastic 0.519 acc,
rep=0 (no regression, keep path)
Qwen3-Coder-Next (qwen3next, n_rs_seq=0, !keep path): unaffected, rep=0
Post-fix performancePR fixed
Interpretation
Bottom lineThis fix turns Qwen3.6-35B-A3B DFlash from broken under stochastic sampling into a correct 1.48× speedup with about It also preserves correctness and speed across Qwen3.6-27B, Qwen3.5-122B-A10B, and Qwen3-Coder-Next, while making the 122B stochastic path coherent as a bonus. Tested on Strix Halo (amd ryzen 395+) platform.**** |
Rename enable-dflash-qwen3-coder-next-on-vulkan.md -> enable-and-fix-dflash-on-vulkan-various-models.md and expand it from the Coder-Next-only enablement rationale to the full multi-model enable+fix effort: Coder-Next enablement, Qwen3.6-27B (dense, works), Qwen3.5-122B-A10B GPU cross-ring n_rs_seq fix, Qwen3.6-35B-A3B DeltaNet RS snapshot-rotation fix, and Gemma 4 31B status. Adds the post-fix performance/correctness matrix and the gdb root-cause evidence for the 35B-A3B stochastic-sampling corruption.
Re-verify the Vulkan build on AMD Strix Halo against the current PR head, which layers the Coder-Next enablement, the vulkan-dflash-cross-ring sync, the 122B n_rs_seq fix, and the 35B-A3B snapshot-rotation fix. Fresh out-of-source configure + build: 0 errors, 50 benign warnings (no -Werror), all key binaries, runtime DFlash sanity across all tested models. Adds a PR-head-history table.
Add unit-testable regression coverage for the PR Anbeeld#79 DFlash behavior changes that are reachable without a full model: - 122B fix: llm_arch_supports_rs_rollback gate keeps QWEN35MOE on RS partial rollback (test_llm_arch_supports_rs_rollback). - DFlash enable: common_params_speculative::need_n_rs_seq() returns draft.n_max for DFLASH (test_need_n_rs_seq). - 35B-A3B fix: RS snapshot write-back slot mapping (rotation for n_seq_tokens<K) (test_rs_writeback_slot_mapping) and the gated_delta_net kernel snapshot contract (only the last min(n_seq_tokens,K) output slots are written; leading slots untouched) (test_gated_delta_net_snapshot_contract, CPU). - Coverage gap: the existing pure _for_test helpers in llama-context.h (capture_tokens_per_seq, replay_gdn_supported_s, replay_state_shape_valid, view_span_in_bounds, suppress_callback_for_prefill_ubatch, prefill_plan_needs_staging) now have direct unit tests. Minimal production change (behavior-identical): extract the build_recurrent_attn RS write-back slot decision into a pure static-inline helper llama_dflash_rs_writeback_slot_for_test() in llama-context.h, following the existing _for_test pattern (e.g. llama_dflash_replay_gdn_supported_s_for_test, which production already calls). build_recurrent_attn now calls the helper in its write-back loop so the production path and the unit test share one source of truth. The loop produces the same cache_slot / from_gdn / gdn_slot / old_slot as before. Guard verification: each new test was confirmed to FAIL when its fix is reverted (35B-A3B rotation -> pre-fix bug; 122B QWEN35MOE -> off; need_n_rs_seq DFLASH -> off), and PASS on the fixed code. Rebenchmark (Vulkan, all rep=0, no crash): q36 r1 23.76 tok/s, 35b r1 greedy 60.81, 35b r1 stoch 66.86 (original failing case now coherent C++), cn r1 47.48, 122b r1 16.75. Output-token differences vs the pre-refactor binary are FP recompilation noise (a trivial comment in delta-net-base.cpp reproduces the same divergence).
RTX 3090 PR performance: main vs PR fixedBenchmarked latest main Metric format: Summary table
Per-model conclusionsQwen3.6 27BBenefits: CUDA On CUDA, PR However, CUDA On Vulkan, DFlash does not benefit this model on RTX 3090. PR Vulkan baseline is Recommendation: use CUDA Qwen3.6 35B-A3BBenefits: correctness improves, but performance does not beat baseline. The PR fixes acceptance/correctness for DFlash paths:
But DFlash is still slower than the no-drafter baseline:
CUDA Recommendation: PR is valuable for correctness, but DFlash is not a performance win for this model on RTX 3090. Use baseline unless specifically testing DFlash correctness. Gemma 4 31BBenefits: CUDA CUDA
CUDA
On Vulkan, main had a DFlash speedup around
Recommendation: use CUDA Overall conclusionOn RTX 3090, the PR is not a universal performance win. Clear benefits:
Clear non-benefits / regressions:
Policy for RTX 3090 from this data:
|
Main vs PR #79
|
| Model | Case | Main | PR #79 | PR DFlash vs PR baseline | Conclusion |
|---|---|---|---|---|---|
| Qwen3.6-27B qwen35 dense |
base | 124.5 | 12.45 | — | 124.2 | 12.42 | — | — | Baseline parity |
| DFlash r0 | 117.2 | 20.84 | 0.667 | 117.3 | 23.26 | 0.798 | 1.87× | ✅ Strong PR win | |
| DFlash r1 | 117.6 | 21.02 | 0.667 | 118.5 | 23.34 | 0.798 | 1.88× | ✅ Strong PR win; r0≈r1 | |
| Qwen3.6-35B-A3B qwen35moe |
base | 227.7 | 52.03 | — | 219.7 | 51.65 | — | — | Baseline parity |
| DFlash r0 | 208.5 | 39.68 | 0.036 | 207.9 | 46.25 | 0.656 | 0.90× | ✅ Fixed correctness, ❌ not faster | |
| DFlash r1 | 213.8 | 40.04 | 0.036 | 214.5 | 46.33 | 0.658 | 0.90× | ✅ Fixed correctness, ❌ not faster | |
| Qwen3-Coder-Next qwen3next MoE |
base | 160.7 | 53.36 | — | 162.0 | 53.13 | — | — | Baseline parity |
| DFlash r0 | CRASH | 150.4 | 31.97 | 0.614 | 0.60× | ✅ Crash fixed, ❌ slower | |
| DFlash r1 | CRASH | 151.4 | 35.69 | 0.614 | 0.67× | ✅ Crash fixed, ❌ slower | |
| Qwen3.5-122B-A10B qwen35moe |
base | 75.0 | 13.11 | — | 74.9 | 22.05 | — | — | |
| DFlash r0 | 72.8 | 16.94 | 0.018 | 72.5 | 17.66 | 0.055 | 0.80× | ❌ Slower than PR baseline | |
| DFlash r1 | 72.1 | 18.54 | 0.018 | 71.7 | 18.39 | 0.055 | 0.83× | ❌ Slower than PR baseline | |
| Gemma 4 31B gemma4 |
base | 118.2 | 11.76 | — | 117.3 | 11.73 | — | — | Baseline parity |
| DFlash r0 | 115.5 | 10.97 | 0.087 | 114.8 | 27.31 | 0.716 | 2.33× | ✅ Major PR win under greedy | |
| DFlash r1 | 115.4 | 10.92 | 0.081 | 115.6 | 27.40 | 0.716 | 2.33× | ✅ Major PR win under greedy; r0≈r1 |
"own own own" repetition, identical on main baseline. This inflates DFlash acceptance. Throughput is still a valid greedy-generation measurement; for coherent Gemma 4 output, use stochastic + --reasoning on.
Findings
- No crashes on PR dflash: enable Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-122B-A10B, Qwen3-Coder-Next, Gemma 4 31B on Vulkan #79. Coder-Next DFlash crash is fixed.
- Qwen3.6-27B: clear DFlash win on PR, about
1.88×over PR baseline. - Gemma 4 31B: transformed under greedy, from broken/slow DFlash on main to
2.33×over PR baseline. - Qwen3.6-35B-A3B: acceptance fixed, but DFlash remains slower than baseline: about
0.90×. - Qwen3-Coder-Next: crash fixed, but DFlash is slower than baseline:
0.60–0.67×. - Qwen3.5-122B-A10B: PR baseline itself improves strongly,
22.05vs13.11 tok/s; DFlash is slower than that new baseline. - ring=0 ≈ ring=1 across the PR results; the restructure appears to narrow the ring-mode difference.
Bottom line
PR #79 fixes the broken Vulkan DFlash paths and gives strong wins for Qwen3.6-27B and Gemma 4 31B on Strix Halo.
For the larger / cheap-active MoE models, DFlash is now mostly correct but not a speed win; baseline remains better for Qwen3.6-35B-A3B, Qwen3-Coder-Next, and Qwen3.5-122B-A10B.
256K context stress test on Strix HaloRun complete: all 20 runs passed, with no crash and no OOM. The crash guard holds at Setup:
Conclusions
Bottom line: this is a strong stability result. The 256K stress path survives all tested models and stochastic cases. Performance benefit is model-dependent: clear win for Qwen3.6-27B, mostly stability/correctness value for the larger MoE cases. |
…ode fails) The drafter per-slot ctx defaults to 256 and draft batches use absolute positions on the target's timeline (draft_pos_base = committed_len), so the oldest accepted-prefix cells must be evicted once committed_len grows past the window or every later flat/tree draft fails slot allocation (LLAMA_MEMORY_STATUS_FAILED_PREPARE -> llama_decode returns 1). trim_drafter_prefix_window() was only reachable through update_drafter_kv_cache(), whose guard `if (!gpu_ring_handle || n_written <= 0) return;` skipped the entire function -- including the slide -- whenever the GPU cross-ring was disabled (GGML_DFLASH_GPU_RING=0, the CPU hidden-capture path). With the slide gone the 256-cell drafter KV filled at committed_len ~240 and every subsequent draft failed; the align_drafter_seq_or_clear fallback then wiped the KV and drafting resumed, producing periodic "drafter decode failed with 1" spam during long reasoning. The slide is a pure KV-cache operation with no ring dependency, and the DFlash drafter's self-attention is block-local (non-causal over the query block) plus cross-attention to target hiddens, so evicting old accepted-prefix cells is safe: that history is never attended. Move the trim above the gpu_ring_handle guard so it runs in both paths; the GPU-ring-specific updates stay gated. Verified on RTX 3090 (Qwen3.6-27B + DFlash-Q4_K_M, GGML_DFLASH_GPU_RING=0): 11 "drafter decode failed with 1" -> 0 across a 5k-token reasoning run, ~70% draft acceptance and ~51 t/s retained at the default 256 ctx (no --spec-draft-ctx-size bump needed). The GPU-ring-on path is logically unchanged (re-tested: 0 decode/slide errors, same acceptance).
256K Regression Report — DFlash Slide Fix VerificationCommit: Test configuration
Results
Correctness
Gemma 4 note: the Fix verificationNo
|
| Model | Baseline | DFlash ring=1 |
Ratio | Verdict |
|---|---|---|---|---|
| Qwen3.6-27B | 12.36 | 21.42 | 1.73× | Strong win |
| 35B-A3B | 42.80 | 54.97 | 1.28× | Solid win |
| Coder-Next | 53.22 | 46.68 | 0.88× | Expected; cheap active decode |
| Gemma 4 31B | 11.33 | 10.83 | 0.96× | Near baseline |
| 122B-A10B | 19.04 | 13.22 | 0.69× | Expected; weak drafter |
DFlash wins on dense Qwen3.6 and small-active MoE 35B-A3B. It is slower on cheap-active MoE models such as Coder-Next and 122B-A10B, where drafter overhead exceeds the speedup from low acceptance.
Expected warnings
These warnings are expected and are not test failures:
-
DFlash reduced verify failed (no argmax output, top_k=N)
Vulkan lacksGGML_OP_TOPKfork > 1, so it falls back to full-logits rejection sampling. -
discarding 2 draft tokens and recovering
Reduced-verify fallback recovery. Happens once per session. -
clearing stale drafter KV for seq=0
This is the fix in action: sliding-window trim of drafter KV. -
target/draft SWA window mismatch
Gemma 4 target SWA window is 1024 while drafter SWA window is 2048. Warning only.
Commands to reproduce
All commands run from:
/crypt/tmp/worktrees/pr-79-syncBaseline, no DFlash
./build-vulkan/bin/llama-server -m <MODEL> \
-np 1 -ngl all -c 262144 -b 1024 -ub 512 \
--flash-attn on --reasoning off \
--cache-type-k q4_0 --cache-type-v q4_1 \
--no-cache-idle-slots --no-cache-prompt \
--device Vulkan0 --port 18080DFlash ring=0, CPU capture
GGML_DFLASH_GPU_RING=0 ./build-vulkan/bin/llama-server -m <MODEL> \
-np 1 -ngl all -c 262144 -b 1024 -ub 512 \
--flash-attn on --reasoning off \
--cache-type-k q4_0 --cache-type-v q4_1 \
--no-cache-idle-slots --no-cache-prompt \
--device Vulkan0 \
--spec-type dflash --spec-draft-model <DRAFT> \
--spec-draft-n-max 4 --spec-dflash-cross-ctx 1024 --spec-branch-budget 0 \
--spec-draft-ngl all --spec-draft-type-k q4_0 --spec-draft-type-v q4_1 \
--spec-draft-device Vulkan0 \
--port 18080DFlash ring=1, GPU ring
./build-vulkan/bin/llama-server -m <MODEL> \
-np 1 -ngl all -c 262144 -b 1024 -ub 512 \
--flash-attn on --reasoning off \
--cache-type-k q4_0 --cache-type-v q4_1 \
--no-cache-idle-slots --no-cache-prompt \
--device Vulkan0 \
--spec-type dflash --spec-draft-model <DRAFT> \
--spec-draft-n-max 4 --spec-dflash-cross-ctx 1024 --spec-branch-budget 0 \
--spec-draft-ngl all --spec-draft-type-k q4_0 --spec-draft-type-v q4_1 \
--spec-draft-device Vulkan0 \
--port 18080Stochastic variant
Append this to the ring=1 command:
--temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 --seed 42Request
curl -s http://localhost:18080/completion \
-H "Content-Type: application/json" \
-d @/tmp/test_payload.jsonGreedy payload:
{
"prompt": "<hard_prompt.txt content>",
"n_predict": 512,
"temperature": 0,
"repeat_penalty": 1.0
}Stochastic payload:
{
"prompt": "<hard_prompt.txt content>",
"n_predict": 512,
"temperature": 1.0,
"top_k": 64,
"top_p": 0.95,
"min_p": 0.0,
"seed": 42,
"repeat_penalty": 1.0
}Model paths
-
Qwen3.6-27B
Target:/crypt/models/Qwen3.6-27B-Q4_K_M.gguf
Draft:/crypt/models/Qwen3.6-27B-DFlash-Q4_K_M.gguf -
Qwen3-Coder-Next
Target:/crypt/models/Qwen3-Coder-Next-Q4_K_M.gguf
Draft:/crypt/models/qwen3-coder-next-dflash-q4_k_m.gguf -
Qwen3.5-122B-A10B
Target:/crypt/models/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf
Draft:/crypt/models/qwen35-122b-a10b-dflash-Q4_K_M.gguf -
Qwen3.6-35B-A3B
Target:/crypt/models/Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf
Draft:/crypt/models/qwen36-35b-a3b-dflash-Q5_K_M.gguf -
Gemma 4 31B
Target:/crypt/models/gemma-4-31B-it-Q4_K_S.gguf
Draft:/crypt/models/gemma4-31b-it-dflash-Q5_K_M.gguf
Bottom line
This fix verifies the 256K DFlash slide path under stress.
All 20 runs pass with no crash, no OOM, no decode failure, and no failed slide warnings. The CPU-path drafter KV trimming issue is fixed.
Performance is model-dependent:
- Strong win: Qwen3.6-27B
- Solid win: 35B-A3B
- Near baseline: Gemma 4 31B
- Slower but stable: Coder-Next and 122B-A10B
Note on test prompt used
You are a senior C++ library engineer. Implement a production-quality, header-only, generic C++17 component and a thorough self-contained test harness, then critically review your own work. Follow every constraint exactly.
## Component: `concurrent_skip_list_map<K,V,Compare=std::less<K>>`
A thread-safe, ordered associative container backed by a probabilistic skip list with the following requirements:
1. **Concurrency**: multiple concurrent readers and a single exclusive writer at a time (reader-writer semantics). Writers must not starve readers. Use only the standard library (no external deps). Provide `try_*` non-blocking variants for every mutating operation.
2. **Operations**: `insert(k,v)` returning whether a new node was added; `erase(k)`; `find(k)`; `contains(k)`; `operator[](k)` (inserts default-constructed V if absent); `lower_bound`/`upper_bound`; `clear`; `size`; `empty`.
3. **Iteration**: forward and reverse bidirectional iterators (`begin/end/rbegin/rend`), `const_iterator`, and a `snapshot()` returning a `std::vector<std::pair<K,V>>` that is a consistent point-in-time copy taken under a read lock. Iterators must remain valid across concurrent mutation of *other* keys (document
the exact invalidation rules).
4. **Range queries**: `range(low, high)` returning a lazy view over `[low, high)`; `count(low, high)`.
5. **Memory**: node-level fine-grained locking per level with hand-over-hand (crabbing) locking; nodes are reference-counted so that iterators holding a node prevent its reclamation. No use-after-free under concurrent erase+iteration.
6. **Metrics**: atomic counters for inserts, erases, finds (hits/misses), range scans, and max level reached. Provide `metrics()` returning a struct snapshot.
7. **Configurability**: max level, probability (default 0.5), allocator via an `Allocator` template parameter defaulting to `std::allocator`.
8. **Correctness edge cases**: empty container, single-element, duplicate keys (overwrite semantics for `insert`), `operator[]` default construction cost, TTL-less (no expiration), comparator consistency, move semantics, swap.
## Deliverables
A) The complete header-only implementation with Doxygen-style comments on every public symbol, exception-safety guarantees (basic or strong) stated per operation, and `noexcept` annotations.
B) A `main()` test harness exercising: concurrency (spawn N reader threads + M writer threads on K distinct keys, join, assert no data races via a logical invariant checker), iterator-stability under erase, range correctness, metrics sanity, and a randomized differential test against `std::map` over 100k
operations with seed=42.
C) A critical review: identify at least four concrete weaknesses (correctness, performance, memory, API ergonomics), explain the failure scenario for each, and propose a specific fix with a code sketch.
Be rigorous. Do not hand-wave. Show real code, not pseudocode, for parts A and B.
…karound - Gemma 4 + DFlash + --flash-attn off: fully coherent (64% unique words) - Gemma 4 + DFlash + --flash-attn on: 'own' repetition (3% unique words) - Gemma 4 baseline + FA: coherent (66% unique words) The Vulkan FA kernel produces incorrect attention weights during DFlash draft-verification batch for Gemma 4's ISWA layers. This is a Vulkan backend limitation (not a DFlash logic bug). Known upstream issues: Vulkan FA mask stride (ggml-org#12720, ggml-org#12853), Vulkan + Gemma 4 MMQ precision (ollama#15261, fixed in v0.30.0-rc31). Workaround: use --flash-attn off for Gemma 4 DFlash on Vulkan, or stochastic sampling which is less sensitive to the precision issue.
Gemma 4
|
| Configuration | Result | Vocabulary diversity |
|---|---|---|
| Gemma 4 baseline + FA on | ✅ Coherent C++ code | 66% unique words |
| Gemma 4 + DFlash + FA off | ✅ Coherent C++ code | 64% unique words |
| Gemma 4 + DFlash + FA on | ❌ own own own repetition |
3% unique words |
Why this happens
- Gemma 4 has ISWA layers: mixed SWA and FULL attention,
swa=4,full=2. - During normal decoding, the Vulkan FA kernel produces correct results.
- During DFlash draft verification, the FA kernel computes wrong attention weights for the ISWA layers.
- The corrupted attention weights cause the target model to accept bad draft tokens.
- Under greedy sampling, the model converges to
ownrepetition because the top-1 token becomesown.
Workaround
For Gemma 4 + DFlash on Vulkan, use:
--flash-attn offWith FA disabled, Gemma 4 DFlash is coherent and still faster than baseline:
Gemma 4 baseline: ~11–12 tok/s
Gemma 4 DFlash FA off: ~17–19 tok/s
Commit
d6531e021b628e58c3d39c7256a964b0dd1f1bd9
…st status - Added PR head history table (b788b4a → d6531e0) - Updated Gemma 4 section: evidence table, broadened root cause (GLM review), 256K stress test results, GLM second-opinion findings - Added drafter KV slide fix (5b0e753) and unit tests (da88067) sections - Updated final status with verification matrix and PR head link - Updated summary table with current Gemma 4 status - Added Gemma 4 to post-fix performance matrix
Cross-backend comparison: Strix Halo iGPU vs W7800 eGPUPlatform / context
Decode = generated tokens/sec. Qwen3.6-27B
Conclusion for Qwen3.6-27B: Qwen3.6-35B-A3B
Conclusion for Qwen3.6-35B-A3B: Qwen3-Coder-Next Q4_K_M
Conclusion for Qwen3-Coder-Next: Gemma-4-31B
Conclusion for Gemma-4-31B: Best result per model
Backend/device conclusionsiGPU VulkanBest overall for Qwen3.6-27B with DFlash:
iGPU ROCmGenerally not as strong as iGPU Vulkan for decode speed:
W7800 VulkanStrongest baseline backend for the larger models:
W7800 ROCmMixed:
Overall conclusionDFlash benefit is highly model-, backend-, and device-dependent. The clearest DFlash win is Qwen3.6-27B on iGPU Vulkan, especially For the larger or cheap-active MoE models, DFlash usually does not beat baseline. In particular:
Practical recommendation from this matrix:
|
PR #79 — Executive Summary
What this PR is
dflash: enable Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-122B-A10B, Qwen3-Coder-Next, Gemma 4 31B on VulkanThis PR makes Vulkan DFlash functional across multiple model architectures and fixes the Vulkan DFlash paths so they can run without crashes, decode failures, or 256K-context OOM failures on the tested Strix Halo setup.
Development history
The PR started as:
dflash: enable Qwen3-Coder-Next on VulkanIt then evolved through several phases.
Phase 1: original Coder-Next enablement
Phase 2: sync from
vulkan-dflash-cross-ringvulkan-dflash-cross-ringbranch.ring=1regression:rep=12.Phase 3: bug fixes
Three major fixes were needed:
ab2be17fd— 122B fixRemoved the
n_rs_seq=0clamp forQWEN35MOE, which forced checkpoint/restore for draft rejection and caused DeltaNet recurrent-state desync → repetition.efb07137c— 35B-A3B fixRotated DeltaNet RS snapshots for
n_seq_tokens < K. Under stochastic sampling, every step became a 1-token decode, clobbering snapshot history → multilingual gibberish with 0% acceptance.ecc90d226/ later5b0e7533c— DFlash slide/decode failure fixesAdded crash guards and fixed the drafter KV slide path. This addressed decode failures such as
drafter decode failed with 1, caused by drafter KV fill-up after failed sliding on hybrid memory.Phase 4: testing and verification
tests/test-dflash-plumbing.cpp.ring=0andring=1.Phase 5: documentation
enable-dflash-qwen3-coder-next-on-vulkan.mdtoenable-and-fix-dflash-on-vulkan-various-models.md.--flash-attn off.Final verified status
Validated on:
5b0e7533cpr-79-sync-c 262144/ 256KResult: all 20 runs passed.
drafter decode failed with 1failed to slidewarningsFinal performance results
ring=0ring=1ring=1stochasticacc .746
acc .743
acc .434
acc .705
acc .710
acc .444
acc .328
acc .478
acc .284
acc .215
acc .185
acc .564
acc .144
acc .146
acc .164
Key findings
21.42 / 12.36 = 1.73×.54.97 / 42.80 = 1.28×.46.68 / 53.22 = 0.88×.ring=1is slightly above baseline:12.37 / 11.33 = 1.09×.13.22 / 19.04 = 0.69×.Recommended usage
ring=0andring=1are both good.ring=1was best in final testing.--flash-attn offto avoid the Vulkan FA / Gemma 4ownrepetition issue.Current status
The PR includes:
vulkan-dflash-cross-ringsynctests/test-dflash-plumbing.cppAll 5 target models are now functional on Vulkan with DFlash.
Personal note
I have been dog-fooding this heavily on my own Strix Halo machine, including 256K-context runs across all 5 target models. The current PR is crash-safe and usable in real testing, not just in narrow benchmark cases. Performance benefit is model-dependent, but the Vulkan DFlash path itself is now functional.
Bottom line
This PR turns Vulkan DFlash from a fragile / diagnostic path into a crash-safe and usable feature on Strix Halo.
It gives clear speedups for Qwen3.6-27B and Qwen3.6-35B-A3B, while providing stability and correctness value for Qwen3-Coder-Next, Qwen3.5-122B-A10B, and Gemma 4.