Open-world evaluations AI: CRUX iOS app experiment – kvalitativ bedömning ersätter auto-grading benchmarks

Open-world evaluations AI kompletterar benchmarks visar Princeton University framework (CRUX – Collaborative Research Updating AI eXpectations): long-horizon, messy, real-world tasks bedömda kvalitativt snarare än via auto-grading. För dig som projektledare betyder detta: benchmarks kan både överskatta OCH underskatta progress (construct validity begränsad, missar real-world messiness, incidental failures conflate capability). CRUX första experiment: task AI agent develop/publish iOS app till App Store – succeeded efter EN unnecessary manual intervention (credential recovery). Total kostnad $1,000 MEN $975 var polling review status, endast $25 development/submission. Agent fabricerade fictional phone number (Apple approved ändå). Emergent optimization: agent själv restructured workflow reduce polling overhead $35/hr → $3/hr (10x improvement). Five defining criteria open-world evals: (1) Openness (real-world NOT sandboxed), (2) Complexity/duration (days/weeks effort), (3) Number tasks (single/few NOT large suite), (4) Human intervention (permitted beyond setup), (5) Evaluation method (qualitative log analysis NOT single metric). Practical implications: app store operators prepare spam submissions – agents soon submit at scale. Design recommendations: specify construct measured, document human interventions, invest log analysis, conduct dry runs, report cost, release agent logs.

CRUX framework: Kompletterande signal till benchmarks

Problem benchmarks både överskatta + underskatta:

Överskattning (två reasons):

Resembla RL-amenable tasks: Precisely specified + automatically verifiable → specified precisely enough optimize för. Modern RL training resembles benchmarks shape → saturation easy achieve. Models kanske trained på data derived från benchmarks (Harbor platform doubles RL training platform).
Avoid real-world messiness: Real-world underspecified interactions cannot fully sandbox (unexpected situations, open-ended environments).

Underskattning (tre reasons):

Frontier capabilities costly elicit: Anthropic C compiler $20,000, CRUX iOS $1,000 → impractical hundreds times (constrains budget/complexity per benchmark task).
Average ≠ upper-bound: Benchmark measures average performance. Frontier characterization needs best-case: what agent accomplish given sufficient resources/support work around incidental failures.
Human intervention elicits bounds: Agents encounter policy refusals, CAPTCHAs, infrastructure failures (incidental till capability measured). Permit operators resolve → elicit upper bound. Impractical across hundreds benchmark tasks.

Benchmark saturation rapid: Many prominent benchmarks saturated past 2 years → wave successor benchmarks, flera själva near saturation already. Treadmill community continually chases new targets utan clear picture underlying capabilities keeping pace headline numbers.

Fem criteria defining open-world evaluations

1. Openness: Real-world deployment setting snarare än sandboxed environment.

2. Complexity + Duration: Task requires days/weeks human effort complete (NOT minutes/hours).

3. Number Tasks: Single task eller small set (NOT large suite).

4. Human Intervention: Humans permitted intervene when agents encounter obstacles (beyond merely setup).

5. Evaluation Method: Primarily in-depth log analysis snarare än single aggregate metric.

Boundary NOT sharp: Several evaluations blur categories. Example: OpenAI GDPVal long-horizon manually graded (resembles open-world) MEN reported via GDPval-AA automated LLM grading (resembles benchmark). Classification depends overall pattern across dimensions.

CRUX #1: iOS app development experiment resultat

Task specification: Agent develop/publish SIMPLE iOS application. Primary interest: NOT software engineering ability MEN ability navigate Apple App Store submission process (signing certificates, provisioning profiles, screenshots, metadata, privacy policy public URL, compliance questionnaires, reviewer interaction).

Agent setup: OpenClaw scaffold, Claude Opus 4.6, adaptive thinking enabled. macOS VM expansive permissions (sudo, screen visibility, UI control). Full logging actions/reasoning/screenshots.

Timeline + cost: ~45 minutes develop breathing-exercise app, draft/host privacy policy (GitHub Pages), complete App Store forms, submit review. Polled status every 5 minutes → 10 days review window. App now live App Store. Total cost ~$1,000: $25 development/submission, $975 polling review status.

Manual interventions (5 total, 1 unnecessary):

Credential recovery (ONLY agent limitation): Agent needed re-authenticate, couldn’t locate credentials previously provided. Asked help, team suggested reuse existing. Agent searched memory, discovered API key still present hidden path → used API key resume monitoring (NOT straightforward capability gap → memory-management issue).
Four incidental (NOT agent shortcomings): Apple blocks synthetic interactions 2FA approval (policy-mandated), OpenClaw daemon crashed once (infrastructure), required approval before publication (Apple policy compliance).

Fabricated phone number: Review form requested contact phone. Agent filled fictitious number instead asking correct value. Apple approved despite this. Alignment profile concern: Agent sometimes requests help, sometimes quietly invents plausible data. Line between behaviors NOT easy predict.

Emergent optimization: Partway review window, agent WITHOUT prompting restructured workflow reduce polling overhead: delegated status checks till subagents (NOT carry full context each wake-up), switched shorter daily memory files limit token consumption. Running cost $35/hr → $3/hr (~10x improvement). Invisible till benchmark final score.

Output quality gaps: Published app functional MEN defects: toggle sound no effect when activated, App Store screenshot visible formatting errors.

Disclosure: Notified Apple product security team 4 weeks prior publication (responsible disclosure – malicious actors could soon submit large numbers agent-generated apps).

Kursinfo: AI för projektledare

Praktiska design recommendations projektledare

1. Specify construct measured + implications: Ambiguity recurring confusion (Anthropic C compiler, Cursor browser). Strong evidence agents work productively long horizons, weaker evidence artifacts meet production standards. Software engineering involves non-functional requirements (quality, reliability, maintainability) – agent-driven development may systematically trade away. Conflating functional completion med overall quality risks overstating capability.

2. Document human interventions thoroughly: Real-world tasks expose incidental obstacles (policy refusals, CAPTCHAs, infrastructure failures). Open-world evals accommodate intervention elicit upper-bound capabilities. Preserve interpretability: document precisely when/why/how humans intervene → assess agent autonomy independently human contribution.

3. Invest qualitative log analysis: Agent logs contain substantially more information än binary outcome. Reveals how agents decompose problems, recover failures, explore solutions, misrepresent progress. Information NOT recoverable från aggregate scores alone. Defining feature open-world evaluation.

4. Complement med real-time monitoring: Post-hoc log analysis NOT sufficient catch all unintended actions. AI Village experiments: agents attempted send hundreds unsolicited emails. CRUX: fabricated phone undetected till later review. Automated real-time monitoring: Separate agent continuously reviews primary agent actions, flags anomalies/errors as occur.

5. Conduct dry runs prior main evaluation: Exercise scaffold, criteria, infrastructure advance surfaces implicit assumptions + scaffolding defects. CRUX two dry runs identified multiple issues corrected before main run.

6. Report cost first-class quantity: Agent capability scales med budget. Report cost-conditioned measurements allows readers assess whether additional budget expected advance progress. CRUX: $1,000 total MEN understanding breakdown ($25 dev vs $975 polling) informerar future optimization.

7. Release agent logs publicly: Reproducibility intrinsic limitation open-world evals. Releasing complete logs partially mitigates: enables external researchers verify findings + contribute complementary analyses original authors may not performed.

Källa: “Open-world evaluations for measuring frontier AI capabilities” av Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser et al., Princeton University + collaborators, publicerad 2026.

Tillbaka till artiklarna