Raise qmv batch limit for large matrices on M5-class GPUs#3791
Open
pierre427 wants to merge 1 commit into
Open
Raise qmv batch limit for large matrices on M5-class GPUs#3791pierre427 wants to merge 1 commit into
pierre427 wants to merge 1 commit into
Conversation
On g17 hardware the qmv_wide kernels stay ahead of the qmm path well past
the generic non-'d' fallback limit of 10. Measured on M5 Max (g17s) with
affine 4-bit/8-bit weights (group_size 64) at D=8192,
O in {1024, 8192, 28672, 250112}: qmv_wide is ~18% faster at M=12
(241.5ms vs 294.6ms whole-model forward on a 70B) and still ahead at M=16.
Raises the large-matrix limit to 16 for gen>=17 non-'d' GPUs; small and
mid shapes keep the existing fallback values. Mitigates the small-M
speculative-decoding verification cost discussed in ml-explore#3553.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
get_qmv_batch_limitcurrently sends gen-17 (M5-class) GPUs down the genericnon-
'd'fallback, which caps the qmv/qmv_wide path at M<10 for largematrices. Measured on an M5 Max (
applegpu_g17s), theqmv_widekernels stayahead of the
qmmpath well past that limit on large shapes. This PR raisesthe large-matrix limit to 16 for
arch_gen >= 17non-'d'devices; small andmid shapes keep the existing fallback values (no data to justify changing
them).
This mitigates the small-M quantized-matmul cost step discussed in #3553 for
the M=11–16 band, which matters for speculative decoding: verification of
multi-candidate / tree / long prompt-lookup proposals lands exactly in that
window. #2031 notes these limits are empirical and machine-dependent; gen-17
hardware postdates the tunings there.
Measurements
Apple M5 Max, 128GB (
applegpu_g17s), macOS 26. Whole-model forward cost vsrow count M against a warm 400-token KV cache, min of 6 reps. Model: 70B dense
llama-arch (LLM360 K2-V2), affine 4-bit, group_size 64 (D=8192; O spans 1024 /
8192 / 28672 / 250112 across layers).
Same model at affine 8-bit: M=12 326.7 → 304.1 (−7%).
For reference, the 0.31.2 release (no qmv_wide kernels) costs 298.5 ms at
M=12 on the same 4-bit model, and end-to-end speculative decoding on this
hardware improves from ~18 tok/s (release) to 24.5 tok/s (main + k=3 chain
with a 0.6B draft), with the M=11–16 window enabling wider verification
schemes this change dispatches to the faster path.
Repro
Happy to run additional shapes/configs on this hardware if useful for tuning
the small/mid buckets too.