Projects
Profile // Evaluation Report · 2026

Marian E. Arenskrieger

CTC AI Operations

AI evaluation & data-quality on frontier-model projects — auditing, building, and stress-testing the datasets behind agentic AI.

DOMAINAI Eval · Data Quality · SWE (ML) PATHFinance → Data Science
Summary

AI-evaluation specialist who builds and audits the datasets frontier models are trained and tested on — with a steady focus on correctness, reliability, and reproducibility.

My background bridges two worlds: a banking apprenticeship and a B.A. in Financial Management, several years of self-employed quantitative trading, and a deliberate move into data science. That combination — financial rigor plus a data-science toolkit — is what I bring to evaluating agentic and function-calling AI systems.

What I do

FOCUS / 04

Four threads run through every engagement — frontier-model evaluation, the data-quality discipline behind it, the tooling that keeps it repeatable, and the markets background that grounds the judgment.

01Evaluation

AI Evaluation & RLHF

Rubric-based scoring for correctness, reasoning and instruction-following — rubrics frozen under semantic versioning and content hashes so a verdict stays reproducible months later. Pairwise model comparisons, multi-turn prompt design, adversarial-robustness checks and documented failure modes.

02Data Quality

Auditing & QA

QA and rubric-based audits of contributors' datasets for function-calling and agentic-AI projects — verifying correctness, format compliance and consistency before delivery, and grading the training signal itself against a fixed reference rather than shifting human judgment.

03Engineering

Tooling & Environments

Designing and operating a hybrid local-plus-cloud evaluation stack — local FP4 batch judging via vLLM, a hardened sandbox for untrusted model-generated code, prefix-cached frozen rubrics, and forking open-source tooling: JSON support in Cerberus, multi-layer validation and error detection in Haystack.

04Quant

Finance & Markets

A decade across capital markets, proprietary trading and financial modelling — the quantitative backbone behind the data work, plus applied AI in financial consulting, where a hallucinated figure costs real money.

Experience

LOG / REVERSE-CHRON
[01]Jan 2026
Present

AI Evaluation, Data Quality & Software Engineering

Labelbox · Remote · Clients: leading AI labs
Agentic AI Master ReviewerSoftware Engineer – Machine LearningSenior Machine Learning Expert
  • QA and rubric-based auditing of other contributors' datasets for function-calling and agentic-AI projects — verifying correctness, format compliance, and consistency before delivery, and enforcing quality standards within the Master Review team.
  • Deployment and configuration of local model environments to run frontier models against real tasks and generate datasets; construction of training and evaluation datasets, including HFI problem sets, for frontier-model coding tasks.
  • Forking and internal extension of open-source tooling — JSON support in Cerberus; multi-layer validation and error detection in Haystack — delivered as part of the dataset.
  • RLHF evaluation and multi-turn prompt design for agentic-coding tasks, with pairwise comparisons across frontier models and calibrated, rubric-based scoring for correctness, reasoning, and instruction-following; systematic documentation of failure modes and edge cases.
[02]Oct 2025
— May 2026

Finance & AI Intern

MLP SE · Wiesloch, Germany · Part-time
Capstone: AI for Financial Consulting & Recruiting
  • Analysis and conceptual design of AI use cases to support and personalize financial advisory.
  • Data-driven approaches to increase qualified applicants via AI targeting.
[03]Jun 2025
— Mar 2026

Machine Learning Specialist

Scale AI · Freelance, Remote
  • Mathematical evaluation of ML models for correctness, reasoning quality, and quantitative accuracy.
  • Rubric-based rating of model outputs on quantitative and reasoning tasks; identification of errors in model-generated reasoning and solutions.
[04]Sep 2019
— Jun 2025

Trader & Market Analyst

BraveTrade · Self-Employed, Remote
  • Proprietary trading across cryptocurrencies, equities, and options on a commercial basis.
  • Development and backtesting of trading strategies across spot and derivatives markets using statistical modeling.
  • Data-driven market and risk analysis; market analysis and trading coaching for private clients.
[05]Jan 2018
— Oct 2021

Cryptocurrency Mining Operator

BraveTrade · Self-Employed, Remote
  • Commercial operation of a cryptocurrency-mining business; procurement (leasing) and operation of mining hardware.
  • Continuous profitability (ROI) and energy-cost optimization; configuration, uptime monitoring, and documentation for tax compliance.
[06]Aug 2016
— May 2018

Bank Clerk & Banking Apprenticeship

VR-Bank eG Osnabrücker Nordland · Fürstenau, Germany
  • Banking operations and client work alongside a formal apprenticeship — the foundation of the finance side of my profile.

Selected work

PROJECTS / 05
Evaluation Infrastructure · CTC AI Operations

Hybrid Evaluation Pipeline

A self-built local-plus-cloud pipeline for evaluating AI coding, agentic and data-science systems against frozen, versioned rubrics — built for VRAM efficiency on a single 32 GB GPU, reduced judge bias, and a roughly 10× cut in token cost.

Two models in 32 GBJudge ≠ system under test≈95% off repeats
View project →
Evaluation Method · CTC AI Operations

Contamination-Resistant Code Evaluation

A reproducible pipeline that synthesises evaluation tasks from a real, versioned codebase via AST analysis — so they can't have leaked into a model's training set the way public benchmarks do. Piloted with Qwen3.6-35B-A3B on the Cerberus framework, single-stream on one 32 GB GPU.

Tasks from live AST~3B-active MoE on 32 GBMethod-first pilot
View project →
Local Infrastructure · CTC AI Operations

Local Three-Tier Agent Workstation

A single-GPU operator setup matching agent framework and open-weight model to workload — Hermes, OpenCode and OpenClaw across one 32 GB card, with only one model resident at a time. Design intent.

One GPU, three rolesOne model residentNo cloud dependency
View project →
Research Agenda · CTC AI Operations

Multi-Agent Safety Evaluation

A research agenda for measuring the emergent risks of multi-agent systems — a failure-mode taxonomy, quantitative risk metrics and an instrumented testbed — toward a reusable pre-deployment safety harness, built on the sandboxed, contamination-resistant discipline of the pipeline.

Risk in the edges: k(k−1)Taxonomy · metrics · testbedResearch agenda
View project →
Product Design · CTC AI Operations

Sovereign Personal AI Assistant

A personal assistant with the polish of a modern chat app, whose inference and history stay on owned hardware — reached from a phone over an encrypted tunnel — and one architecture that scales from a single GPU to a multi-GPU team host by adding hardware, not re-architecting.

Inference and history stay localPersonal rig, paired with a smartphoneConsumer → enterprise
View project →

Capabilities

MATRIX / 04 DOMAINS
AI / Machine Learning06
Rubric design, model evaluation and RLHF assessment for frontier and agentic systems — the core of the day-to-day work.
Frontier Model EvalLLM-as-JudgeRLHF AssessmentRubric DesignAdversarial TestingMulti-Agent Safety
Data Science & Statistics06
Turning noisy evaluation output into defensible, reproducible signal, with the statistics to stand behind the conclusions.
Statistical AnalysisReproducibilityData ModelingVariance AnalysisSelf-ConsistencyPredictive Models
Engineering & Tooling06
Local frontier-model pipelines and forked open-source eval tooling — the infrastructure the evaluation runs on.
Hardened SandboxBatch InferencevLLM / FP4Prefix CachingLocal DeploymentFunction-Calling
Finance & Business06
A decade across capital markets and proprietary trading — the quantitative backbone behind the AI work and its cost discipline.
Financial ModellingCapital MarketsConsultingTechnical AnalysisQuant AnalysisCost Modelling

Education & certifications

VERIFIED / CREDENTIALS

Education

Master of Data Science (MDS)
University of Pittsburgh, USA · Remote
NOV 2024 – PRESENT · Grade A · GPA 3.8
Applied Data Science Program
MIT Professional Education, USA · Remote
MAR 2025 – JUN 2025 · Grade A
Mathematics for Machine Learning
Imperial College London, UK · Remote
SEP 2024 – NOV 2024 · Grade A · 98.58%
Financial Management, B.A.
IU International University of Applied Sciences, Germany
AUG 2018 – JUL 2022 · Grade B
Apprenticeship in Banking
Genossenschaftsakademie, Rastede, Germany
AUG 2016 – MAY 2018 · Grade B

Certifications

Google Cloud Certified — Machine Learning Engineer
Applied Data Science Program — MIT Professional EducationJUN 2025
Mathematics for Machine Learning — Imperial College LondonNOV 2024
Career Essentials in Data Analysis — MicrosoftJUN 2025
Generative KI in der Softwareentwicklung — MicrosoftJUN 2025
Microsoft Azure KI GrundwissenJUN 2025
Certified Blockchain & Finance Professional™FEB 2020

Extracurricular

The Agentmakers — collaboration & knowledge transfer in agentic AI2026 –
Academic Mentor — University of Pittsburgh · support for assigned freshmen2025 –
Code for Germany — Open Knowledge Foundation DE · open-source projects2024 –
Languages
German
NATIVE
English
C2
Japanese
BASIC
Off the clock
ReadingSci-FiProgrammingSwimmingGamingTraveling

Born 15 May 1998. A long-standing fascination with science fiction is part of what drew me to AI in the first place — and it keeps me curious about where these systems are headed.