— Present
AI Evaluation, Data Quality & Software Engineering
- QA and rubric-based auditing of other contributors' datasets for function-calling and agentic-AI projects — verifying correctness, format compliance, and consistency before delivery, and enforcing quality standards within the Master Review team.
- Deployment and configuration of local model environments to run frontier models against real tasks and generate datasets; construction of training and evaluation datasets, including HFI problem sets, for frontier-model coding tasks.
- Forking and internal extension of open-source tooling — JSON support in Cerberus; multi-layer validation and error detection in Haystack — delivered as part of the dataset.
- RLHF evaluation and multi-turn prompt design for agentic-coding tasks, with pairwise comparisons across frontier models and calibrated, rubric-based scoring for correctness, reasoning, and instruction-following; systematic documentation of failure modes and edge cases.