tgi
Here are 16 public repositories matching this topic...
Kubernetes operator for self-hosted LLM inference across a heterogeneous GPU fleet: NVIDIA CUDA, AMD Vulkan, and Apple Silicon Metal. Runtimes: llama.cpp, vLLM, TGI, mlx-server. Multi-GPU sharding, model caching, OpenAI-compatible endpoints. Apache-2.0, run across homelab and on-prem fleets, actively developed.
-
Updated
Jun 27, 2026 - Go
大模型推理框架加速,让 LLM 飞起来
-
Updated
May 10, 2024 - Python
Bench360 is a modular benchmarking suite for local LLM deployments. It offers a full-stack, extensible pipeline to evaluate the latency, throughput, quality, and cost of LLM inference on consumer and enterprise GPUs. Bench360 supports flexible backends, tasks and scenarios, enabling fair and reproducible comparisons for researchers & practitioners.
-
Updated
Feb 18, 2026 - Python
Curated list of open-source LLM-serving runtimes, routers, evaluators, and standards. Run LLMs without locking into one vendor.
-
Updated
Jun 20, 2026
LLM Inference performance harness
-
Updated
Dec 29, 2025 - Python
AWS deployment stack for Gemma 3 on SageMaker with HuggingFace TGI, OpenAI-compatible API (Lambda + API Gateway), and OpenWebUI chat interface
-
Updated
Mar 25, 2026 - Python
Throughput + latency benchmark for OpenAI-compatible LLM endpoints (vLLM, TGI, llama.cpp, Ollama). TTFT, TPOT, throughput, percentiles. Runtime-agnostic.
-
Updated
Jun 20, 2026 - Python
An nvtop for local LLM inference: zero-config autodiscovery of vLLM, llama.cpp, Ollama, TGI, SGLang + live GPU and serving metrics in a Textual TUI. Unified-memory (GB10/Jetson) aware. MIT.
-
Updated
Jun 20, 2026 - Python
Bridge GitHub Copilot Chat with local vLLM/TGI servers and HuggingFace cloud models. Enterprise-ready VS Code extension for air-gapped AI coding.
-
Updated
Sep 29, 2025 - TypeScript
GPU-aware LLM proxy for Ollama, vLLM, TGI, and llama.cpp. Warm-first routing, cloud fallback to OpenAI/Anthropic, per-token cost tracking. Single Go binary, zero deps.
-
Updated
Jun 26, 2026 - Go
Self-hosted FastAPI gateway exposing OpenAI and Anthropic Messages APIs in front of any open-source LLM runtime (vLLM, Ollama, llama.cpp, TGI, SGLang, LocalAI, LM Studio). Streaming, embeddings, metrics, auth, rate limiting.
-
Updated
Apr 22, 2026 - Python
Lightweight HTML form with Python Flask app and accompanying scripts for swift testing of interactions with SEA-LION family of LLMs.
-
Updated
Aug 2, 2024 - Python
Improve this page
Add a description, image, and links to the tgi topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the tgi topic, visit your repo's landing page and select "manage topics."