Skip to content
#

tgi

Here are 16 public repositories matching this topic...

Kubernetes operator for self-hosted LLM inference across a heterogeneous GPU fleet: NVIDIA CUDA, AMD Vulkan, and Apple Silicon Metal. Runtimes: llama.cpp, vLLM, TGI, mlx-server. Multi-GPU sharding, model caching, OpenAI-compatible endpoints. Apache-2.0, run across homelab and on-prem fleets, actively developed.

  • Updated Jun 27, 2026
  • Go

Bench360 is a modular benchmarking suite for local LLM deployments. It offers a full-stack, extensible pipeline to evaluate the latency, throughput, quality, and cost of LLM inference on consumer and enterprise GPUs. Bench360 supports flexible backends, tasks and scenarios, enabling fair and reproducible comparisons for researchers & practitioners.

  • Updated Feb 18, 2026
  • Python

Self-hosted FastAPI gateway exposing OpenAI and Anthropic Messages APIs in front of any open-source LLM runtime (vLLM, Ollama, llama.cpp, TGI, SGLang, LocalAI, LM Studio). Streaming, embeddings, metrics, auth, rate limiting.

  • Updated Apr 22, 2026
  • Python

Improve this page

Add a description, image, and links to the tgi topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the tgi topic, visit your repo's landing page and select "manage topics."

Learn more