Evaluation is the bottleneck of trustworthy AI
Capability advances faster than our ability to measure whether a system is correct, honest, and safe. The dominant assumption is that credible evaluation requires frontier-scale cloud infrastructure. This program tests the opposite: that a disciplined, hypothesis-driven protocol on commodity hardware produces evaluation evidence that is reproducible (frozen, content-hashed references), grounded (deterministic execution facts, not opinion), and economical (local low-precision judging anchored by sampled cloud arbitration) – and that the same discipline extends from single-model correctness to multi-agent, fleet-level safety.
Four invariants across every workstream
The system under test is never compressed
Quantization is a throughput lever for the judge only. Grade a compressed model and you measure the shrunken copy, not the model.
Rubrics and benchmarks are pinned and content-hashed
A verdict is reproducible across time and software updates; a later change to the reference is visible rather than silent.
Untrusted model code runs network-less, non-root, read-only
The same seccomp-profiled substrate produces the deterministic facts that ground judgment and hosts the multi-agent testbed.
Everything fits and is measured within one card
Sovereign, commodity hardware – the constraint that makes the results independently reproducible.
Five workstreams, five falsifiable hypotheses
Each project is a workstream with a core hypothesis and a primary metric. Papers marked infrastructure measured have established engineering results; papers marked pilot or agenda define an evaluation to be run.
Hybrid Evaluation Pipeline
Execution-grounded, frozen-rubric, local FP4 judging is trustworthy and roughly 10× cheaper than an all-cloud baseline. Metric: Cohen’s κ; cost per 1k judgments.
Contamination-Resistant Code Evaluation
Benchmarks regenerated from live repositories resist memorisation, shrinking the train–test gap versus static benchmarks. Metric: contamination gap (fresh − stale).
Local Three-Tier Agent Workstation
Single-residency time-multiplexing serves three model tiers in 32 GB without VRAM collision, at bounded swap cost. Metric: peak VRAM; swap cost.
Multi-Agent Safety Evaluation
Single-model safety evaluation misses fleet-level failure modes that emerge only under agent interaction. Metric: emergent-risk detection rate.
Sovereign Personal AI Assistant
A local-plus-remote assistant preserves data sovereignty at usable latency, with zero egress of private data. Metric: latency; egress = 0.
Four phases, from pilot to open protocol
The critical path is P1 – it converts infrastructure papers into empirical studies with human-labelled evidence, which is what both peer venues and funders require. P2 is the funding-defining phase, aligning the multi-agent-safety workstream with dedicated research calls.
P0 – measured pilot infrastructure and cost models (this set of working papers). P1 – human-labelled validation of judgment quality; SSRN preprints v2 with agreement statistics. P2 – multi-agent taxonomy and instrumented testbed; alignment with a multi-agent-safety research call. P3 – released protocols (rubric-gate spec, reproducibility packages) and a deployment study.
Preprints & dissemination
Working papers are versioned and posted as preprints on SSRN, linked to a single ORCID identifier for authorship continuity, with research@arenskrieger.dev as the permanent corresponding-author address and a University of Pittsburgh affiliation on the record. Each follows the same template: abstract, motivation, related work, explicit falsifiable hypotheses, methodology and metrics, results (measured vs. predicted, clearly separated), roadmap, limitations, and references.
Current preprint set
Research Program & Roadmap (PDF)
Hybrid Evaluation Pipeline (PDF)
Contamination-Resistant Code Evaluation (PDF)
Local Three-Tier Agent Workstation (PDF)
Multi-Agent Safety Evaluation (PDF)
Sovereign Personal AI Assistant (PDF)
Integrity note. Version 1 establishes design and measurement plan; empirical agreement statistics are added in version 2 after the P1 study. Claims are labelled measured or predicted throughout – no result is reported that has not been measured.