Research Program — CTC AI Operations

§01 Thesis

Evaluation is the bottleneck of trustworthy AI

Capability advances faster than our ability to measure whether a system is correct, honest, and safe. The dominant assumption is that credible evaluation requires frontier-scale cloud infrastructure. This program tests the opposite: that a disciplined, hypothesis-driven protocol on commodity hardware produces evaluation evidence that is reproducible (frozen, content-hashed references), grounded (deterministic execution facts, not opinion), and economical (local low-precision judging anchored by sampled cloud arbitration) – and that the same discipline extends from single-model correctness to multi-agent, fleet-level safety.

Trustworthy, reproducible, low-cost evaluation – on hardware anyone can own.

§02 Invariants

Four invariants across every workstream

Full-weight SUT

The system under test is never compressed

Quantization is a throughput lever for the judge only. Grade a compressed model and you measure the shrunken copy, not the model.

Frozen references

Rubrics and benchmarks are pinned and content-hashed

A verdict is reproducible across time and software updates; a later change to the reference is visible rather than silent.

Hardened sandbox

Untrusted model code runs network-less, non-root, read-only

The same seccomp-profiled substrate produces the deterministic facts that ground judgment and hosts the multi-agent testbed.

32 GB discipline

Everything fits and is measured within one card

Sovereign, commodity hardware – the constraint that makes the results independently reproducible.

§03 Workstreams

Five workstreams, five falsifiable hypotheses

Each project is a workstream with a core hypothesis and a primary metric. Papers marked infrastructure measured have established engineering results; papers marked pilot or agenda define an evaluation to be run.

W1 · Infrastructure measured

Hybrid Evaluation Pipeline

Execution-grounded, frozen-rubric, local FP4 judging is trustworthy and roughly 10× cheaper than an all-cloud baseline. Metric: Cohen’s κ; cost per 1k judgments.

Project · Whitepaper (PDF)

W2 · Pilot

Contamination-Resistant Code Evaluation

Benchmarks regenerated from live repositories resist memorisation, shrinking the train–test gap versus static benchmarks. Metric: contamination gap (fresh − stale).

Project · Whitepaper (PDF)

W3 · Infrastructure measured

Local Three-Tier Agent Workstation

Single-residency time-multiplexing serves three model tiers in 32 GB without VRAM collision, at bounded swap cost. Metric: peak VRAM; swap cost.

Project · Whitepaper (PDF)

W4 · Agenda

Multi-Agent Safety Evaluation

Single-model safety evaluation misses fleet-level failure modes that emerge only under agent interaction. Metric: emergent-risk detection rate.

Project · Whitepaper (PDF)

W5 · Design

Sovereign Personal AI Assistant

A local-plus-remote assistant preserves data sovereignty at usable latency, with zero egress of private data. Metric: latency; egress = 0.

Project · Whitepaper (PDF)

§04 Roadmap

Four phases, from pilot to open protocol

The critical path is P1 – it converts infrastructure papers into empirical studies with human-labelled evidence, which is what both peer venues and funders require. P2 is the funding-defining phase, aligning the multi-agent-safety workstream with dedicated research calls.

P0 – measured pilot infrastructure and cost models (this set of working papers). P1 – human-labelled validation of judgment quality; SSRN preprints v2 with agreement statistics. P2 – multi-agent taxonomy and instrumented testbed; alignment with a multi-agent-safety research call. P3 – released protocols (rubric-gate spec, reproducibility packages) and a deployment study.

§05 Publications

Preprints & dissemination

Working papers are versioned and posted as preprints on SSRN, linked to a single ORCID identifier for authorship continuity, with research@arenskrieger.dev as the permanent corresponding-author address and a University of Pittsburgh affiliation on the record. Each follows the same template: abstract, motivation, related work, explicit falsifiable hypotheses, methodology and metrics, results (measured vs. predicted, clearly separated), roadmap, limitations, and references.

Working papers · v1 · in validation

Current preprint set

Research Program & Roadmap (PDF)
Hybrid Evaluation Pipeline (PDF)
Contamination-Resistant Code Evaluation (PDF)
Local Three-Tier Agent Workstation (PDF)
Multi-Agent Safety Evaluation (PDF)
Sovereign Personal AI Assistant (PDF)

Integrity note. Version 1 establishes design and measurement plan; empirical agreement statistics are added in version 2 after the P1 study. Claims are labelled measured or predicted throughout – no result is reported that has not been measured.