Self-aware Kubernetes workloads with pod-local diagnostics, RAG context, and AI-assisted recovery.
Most incident response is still human glue: logs, dashboards, hunches, repeat.
kube-ai aims to make each workload more autonomous:
- detect failure signals in real time,
- reason using local + hosted AI,
- ground recommendations using troubleshooting docs,
- and move toward safe self-healing with minimal manual intervention.
Calling an AI API is easy. Building a pod that can interpret its own failure patterns and improve over time is the real challenge.
What kube-ai is trying to do differently:
- run as a pod-local sidecar, not only as a cluster-level observer,
- keep an incident memory close to the workload,
- use lightweight local reasoning first, then escalate to stronger models,
- make troubleshooting knowledge portable with the deployment.
Think of it as workload-level immune response, not only platform-level monitoring.
[ pod-local agent ]
heartbeat -> signal collection -> anomaly detection -> incident ledger
|
v
[ RAG pipeline ]
retrieve relevant troubleshooting context
|
v
[ AI routing ]
local Ollama first, Foundry fallback for depth
Project layout:
agent/pod-side runtime, AI routing, health endpointrag-pipeline/ingest and query for troubleshooting knowledgehelm/deployment packaging (in progress)tsg-samples/seed troubleshooting guidesinternal-docs/execution plan, testing notes, phase status
- Set up Python 3.12 and create the virtual environment.
cd agent
python -m venv .venv
.\.venv\Scripts\python.exe -m pip install -r requirements.txt- Configure AI mode.
- Local mode: Ollama at
http://127.0.0.1:11434/v1 - Foundry mode: set
AI_FOUNDRY_ENDPOINT_URLandAI_FOUNDRY_API_KEYin local env
- Run the agent.
$env:AGENT_CONFIG_FILE='config/agent.yaml'
.\.venv\Scripts\python.exe -m app.main- Verify service health.
curl http://localhost:9090/health- Agent heartbeat + rule engine + health endpoint
- Incident ledger persistence
- Local Ollama diagnostics path
- Azure Foundry diagnostics path
- RAG ingest/query MVP
- Kubernetes restart/OOM collector
- Notification transport + execution policy layer
- Helm-based production packaging
- End-to-end demo and validation flow
- Finish pod-level signal coverage and AI-assisted diagnosis loop.
- Strengthen RAG grounding for higher-confidence recommendations.
- Add notification and safe action orchestration.
- Package for installable Kubernetes deployment via Helm.