kube-ai

Self-aware Kubernetes workloads with pod-local diagnostics, RAG context, and AI-assisted recovery.

Why this project matters

Most incident response is still human glue: logs, dashboards, hunches, repeat.

kube-ai aims to make each workload more autonomous:

detect failure signals in real time,
reason using local + hosted AI,
ground recommendations using troubleshooting docs,
and move toward safe self-healing with minimal manual intervention.

The novelty

Calling an AI API is easy. Building a pod that can interpret its own failure patterns and improve over time is the real challenge.

What kube-ai is trying to do differently:

run as a pod-local sidecar, not only as a cluster-level observer,
keep an incident memory close to the workload,
use lightweight local reasoning first, then escalate to stronger models,
make troubleshooting knowledge portable with the deployment.

Think of it as workload-level immune response, not only platform-level monitoring.

Architecture at a glance

[ pod-local agent ]
  heartbeat -> signal collection -> anomaly detection -> incident ledger
                    |
                    v
               [ RAG pipeline ]
         retrieve relevant troubleshooting context
                    |
                    v
               [ AI routing ]
      local Ollama first, Foundry fallback for depth

Project layout:

agent/ pod-side runtime, AI routing, health endpoint
rag-pipeline/ ingest and query for troubleshooting knowledge
helm/ deployment packaging (in progress)
tsg-samples/ seed troubleshooting guides
internal-docs/ execution plan, testing notes, phase status

Quick start

Set up Python 3.12 and create the virtual environment.

cd agent
python -m venv .venv
.\.venv\Scripts\python.exe -m pip install -r requirements.txt

Configure AI mode.

Local mode: Ollama at http://127.0.0.1:11434/v1
Foundry mode: set AI_FOUNDRY_ENDPOINT_URL and AI_FOUNDRY_API_KEY in local env

Run the agent.

$env:AGENT_CONFIG_FILE='config/agent.yaml'
.\.venv\Scripts\python.exe -m app.main

Verify service health.

curl http://localhost:9090/health

Where we are now

Agent heartbeat + rule engine + health endpoint
Incident ledger persistence
Local Ollama diagnostics path
Azure Foundry diagnostics path
RAG ingest/query MVP
Kubernetes restart/OOM collector
Notification transport + execution policy layer
Helm-based production packaging
End-to-end demo and validation flow

Execution plan

Finish pod-level signal coverage and AI-assisted diagnosis loop.
Strengthen RAG grounding for higher-confidence recommendations.
Add notification and safe action orchestration.
Package for installable Kubernetes deployment via Helm.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agent		agent
docs/scripts		docs/scripts
helm/scripts		helm/scripts
internal-docs		internal-docs
rag-pipeline		rag-pipeline
tsg-samples		tsg-samples
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kube-ai

Why this project matters

The novelty

Architecture at a glance

Quick start

Where we are now

Execution plan

Useful links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kube-ai

Why this project matters

The novelty

Architecture at a glance

Quick start

Where we are now

Execution plan

Useful links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages