Fast LLM speculative inference server for consumer hardware.
-
Updated
Jun 27, 2026 - C++
Fast LLM speculative inference server for consumer hardware.
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
DFlash & TurboQuant in llama.cpp with up to 3x faster generation and 7.5x more KV cache in same VRAM
Fully uncensored, capability-enhanced abliteration of Qwen3.6-27B. NVFP4 + z-lab DFlash speculative decoding (n=12) on the unified ghcr.io/aeon-7/aeon-vllm-ultimate:latest container, tuned for long-context draft acceptance on DGX Spark. 6 HF variants (BF16/NVFP4/MTP/MTP-XS), docker-compose, and QuickStart.
vLLM runtime patch-overlay for Qwen3.6 + Gemma4 on consumer NVIDIA (Ampere sm_86, 2x A5000/3090) — Qwen3.6-35B-A3B FP8 ~244 tok/s, 27B-int4 hybrid GDN+Mamba, Gemma4 26B/31B AWQ, 256K ctx. 321 patches: TurboQuant k8v4 KV, MTP/DFlash spec-decode, FULL cudagraph, hybrid GDN. vLLM pin dev424 + SNDR Control Center GUI.
vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.
Local AI workstation — discover, run, chat, benchmark, and generate images from open-weight models. DFlash/DDTree speculative decoding, TurboQuant & TriAttention cache compression strategies, MLX + llama.cpp + vLLM + MTPLX backends.
llama.cpp fork optimized for NVIDIA DGX Spark / GB10 (Blackwell, SM 12.1) — TurboQuant weights + KV, NVFP4, DFlash MTP
Run Qwen 3.6-27B AWQ-INT4 models with DFlash speculative decoding on AMD Strix Halo hardware using vLLM for high-throughput inference.
Qwen3.6-27B BF16+DFlash 13-config parameter sweep on repne/vllm:v2. Stage A (3x3 buffer/graph) + Stage B (num_speculative_tokens) + Quality (HumanEval/MBPP). 7h, 421 problems, 195 cells.
CLI for building and testing DFlash-style speculative decoding draft models.
GGUF-native DFlash speculative decoding runtime for local models
vLLM v0.21 + DFlash + thinking_token_budget for Gemma 4 & Qwen 3.6 on Blackwell GB10 (sm_121a / sm_120)
vLLM v0.21 + Qwen 3.6 DFlash + real thinking_token_budget enforcement on Blackwell (sm_120 / sm_121a)
Add a description, image, and links to the dflash topic page so that developers can more easily learn about it.
To associate your repository with the dflash topic, visit your repo's landing page and select "manage topics."