I build evaluation harnesses and agent infrastructure. The part of AI I love most is the boring, honest half: making outputs measurable, falsifiable, and hard to fake. I was an AI evaluation team-lead at Turing for frontier AI lab clients, where I designed grading rubrics, caught reward-hacking, and rewrote prompts that moved agreement from "throwing darts" to "actually scores reasoning."
class Aaweg:
role = "GenAI Engineer"
focus = ["LLM evaluation", "multimodal RAG", "agentic systems"]
philosophy = "is it actually good, or just demo-good?"
currently = "shipping eval pipelines and breaking my own agents"
fun_fact = "wrote a C++ inference engine just to feel the tokens move"priorityjudge is a multi-pass agent that grades plans and PRDs on priority-definition quality. Extractor → 5 independent dimension scorers → deterministic citation verifier → synthesizer.
9887 / 10000 on a hand-rated calibration set · Spearman ρ = 1.0 against the human ranking · 96.5% citation precision (the verifier caught the 3.5% the LLM hallucinated) · test-retest CV = 0.4%. The win comes from architecture, not the model.
| Project | What it is | Stack |
|---|---|---|
priorityjudge |
APO-native plan scorer with verified citations | Python · Gemini · GH Actions · Docker |
nibblecore |
4-bit quantization kernels for Apple Silicon, benchmarked vs llama.cpp |
C++ · SIMD · Metal |
Story-Character-Extractor |
RAG pipeline pulling structured character profiles from stories | Embeddings · Vector DB · LLM |
Melody-Generation-using-LSTM |
LSTM trained on monophonic MIDI to generate new melodies | PyTorch · LSTM · MIDI |
also: LangGraph · LlamaIndex · MLflow · Airflow · ChromaDB · pytest · SIMD/Metal · Next.js when a UI is genuinely the right answer