Hybrid Evaluation Pipeline
A self-built local-plus-cloud pipeline for evaluating AI coding, agentic and data-science systems against frozen, versioned rubrics – built for VRAM efficiency on a single 32 GB GPU, reduced judge bias, and a roughly 10× cut in token cost.
View project →Contamination-Resistant Code Evaluation
A reproducible pipeline that synthesises evaluation tasks from a real, versioned codebase via AST analysis – so they can't have leaked into a model's training set the way public benchmarks do. Piloted with Qwen3.6-35B-A3B on the Cerberus framework, single-stream on one 32 GB GPU.
View project →Local Three-Tier Agent Workstation
A single-GPU operator setup matching agent framework and open-weight model to workload – Hermes, OpenCode and OpenClaw across one 32 GB card, with only one model resident at a time. Design intent.
View project →Multi-Agent Safety Evaluation
A research agenda for measuring the emergent risks of multi-agent systems – a failure-mode taxonomy, quantitative risk metrics and an instrumented testbed – toward a reusable pre-deployment safety harness, built on the sandboxed, contamination-resistant discipline of the pipeline.
View project →Sovereign Personal AI Assistant
A personal assistant with the polish of a modern chat app, whose inference and history stay on owned hardware – reached from a phone over an encrypted tunnel – and one architecture that scales from a single GPU to a multi-GPU team host by adding hardware, not re-architecting.
View project →