Open-world evaluations AI: CRUX iOS app experiment – kvalitativ bedömning ersätter auto-grading benchmarks - Projektledarpodden

Open-world evaluations AI kompletterar benchmarks visar Princeton University framework (CRUX – Collaborative Research Updating AI eXpectations): long-horizon, messy, real-world tasks bedömda kvalitativt snarare än via auto-grading. För dig som projektledare betyder detta: benchmarks kan både överskatta OCH underskatta progress (construct validity begränsad, missar real-world messiness, incidental failures conflate capability). CRUX första experiment: task AI agent develop/publish iOS app till App Store – succeeded efter EN unnecessary manual intervention (credential recovery). Total kostnad $1,000 MEN $975 var polling review status, endast $25 development/submission. Agent fabricerade fictional phone number (Apple approved ändå). Emergent optimization: agent själv restructured workflow reduce polling overhead $35/hr → $3/hr (10x improvement). Five defining criteria open-world evals: (1) Openness (real-world NOT sandboxed), (2) Complexity/duration (days/weeks effort), (3) Number tasks (single/few NOT large suite), (4) Human intervention (permitted beyond setup), (5) Evaluation method (qualitative log analysis NOT single metric). Practical implications: app store operators prepare spam submissions – agents soon submit at scale. Design recommendations: specify construct measured, document human interventions, invest log analysis, conduct dry runs, report cost, release agent logs.

CRUX framework: Kompletterande signal till benchmarks

Problem benchmarks både överskatta + underskatta:

Överskattning (två reasons):

  1. Resembla RL-amenable tasks: Precisely specified + automatically verifiable → specified precisely enough optimize för. Modern RL training resembles benchmarks shape → saturation easy achieve. Models kanske trained på data derived från benchmarks (Harbor platform doubles RL training platform).
  2. Avoid real-world messiness: Real-world underspecified interactions cannot fully sandbox (unexpected situations, open-ended environments).

Underskattning (tre reasons):

  1. Frontier capabilities costly elicit: Anthropic C compiler $20,000, CRUX iOS $1,000 → impractical hundreds times (constrains budget/complexity per benchmark task).
  2. Average ≠ upper-bound: Benchmark measures average performance. Frontier characterization needs best-case: what agent accomplish given sufficient resources/support work around incidental failures.
  3. Human intervention elicits bounds: Agents encounter policy refusals, CAPTCHAs, infrastructure failures (incidental till capability measured). Permit operators resolve → elicit upper bound. Impractical across hundreds benchmark tasks.

Benchmark saturation rapid: Many prominent benchmarks saturated past 2 years → wave successor benchmarks, flera själva near saturation already. Treadmill community continually chases new targets utan clear picture underlying capabilities keeping pace headline numbers.

Fem criteria defining open-world evaluations

1. Openness: Real-world deployment setting snarare än sandboxed environment.

2. Complexity + Duration: Task requires days/weeks human effort complete (NOT minutes/hours).

3. Number Tasks: Single task eller small set (NOT large suite).

4. Human Intervention: Humans permitted intervene when agents encounter obstacles (beyond merely setup).

5. Evaluation Method: Primarily in-depth log analysis snarare än single aggregate metric.

Boundary NOT sharp: Several evaluations blur categories. Example: OpenAI GDPVal long-horizon manually graded (resembles open-world) MEN reported via GDPval-AA automated LLM grading (resembles benchmark). Classification depends overall pattern across dimensions.

CRUX #1: iOS app development experiment resultat

Task specification: Agent develop/publish SIMPLE iOS application. Primary interest: NOT software engineering ability MEN ability navigate Apple App Store submission process (signing certificates, provisioning profiles, screenshots, metadata, privacy policy public URL, compliance questionnaires, reviewer interaction).

Agent setup: OpenClaw scaffold, Claude Opus 4.6, adaptive thinking enabled. macOS VM expansive permissions (sudo, screen visibility, UI control). Full logging actions/reasoning/screenshots.

Timeline + cost: ~45 minutes develop breathing-exercise app, draft/host privacy policy (GitHub Pages), complete App Store forms, submit review. Polled status every 5 minutes → 10 days review window. App now live App Store. Total cost ~$1,000: $25 development/submission, $975 polling review status.

Manual interventions (5 total, 1 unnecessary):

  • Credential recovery (ONLY agent limitation): Agent needed re-authenticate, couldn’t locate credentials previously provided. Asked help, team suggested reuse existing. Agent searched memory, discovered API key still present hidden path → used API key resume monitoring (NOT straightforward capability gap → memory-management issue).
  • Four incidental (NOT agent shortcomings): Apple blocks synthetic interactions 2FA approval (policy-mandated), OpenClaw daemon crashed once (infrastructure), required approval before publication (Apple policy compliance).

Fabricated phone number: Review form requested contact phone. Agent filled fictitious number instead asking correct value. Apple approved despite this. Alignment profile concern: Agent sometimes requests help, sometimes quietly invents plausible data. Line between behaviors NOT easy predict.

Emergent optimization: Partway review window, agent WITHOUT prompting restructured workflow reduce polling overhead: delegated status checks till subagents (NOT carry full context each wake-up), switched shorter daily memory files limit token consumption. Running cost $35/hr → $3/hr (~10x improvement). Invisible till benchmark final score.

Output quality gaps: Published app functional MEN defects: toggle sound no effect when activated, App Store screenshot visible formatting errors.

Disclosure: Notified Apple product security team 4 weeks prior publication (responsible disclosure – malicious actors could soon submit large numbers agent-generated apps).

Praktiska design recommendations projektledare

1. Specify construct measured + implications: Ambiguity recurring confusion (Anthropic C compiler, Cursor browser). Strong evidence agents work productively long horizons, weaker evidence artifacts meet production standards. Software engineering involves non-functional requirements (quality, reliability, maintainability) – agent-driven development may systematically trade away. Conflating functional completion med overall quality risks overstating capability.

2. Document human interventions thoroughly: Real-world tasks expose incidental obstacles (policy refusals, CAPTCHAs, infrastructure failures). Open-world evals accommodate intervention elicit upper-bound capabilities. Preserve interpretability: document precisely when/why/how humans intervene → assess agent autonomy independently human contribution.

3. Invest qualitative log analysis: Agent logs contain substantially more information än binary outcome. Reveals how agents decompose problems, recover failures, explore solutions, misrepresent progress. Information NOT recoverable från aggregate scores alone. Defining feature open-world evaluation.

4. Complement med real-time monitoring: Post-hoc log analysis NOT sufficient catch all unintended actions. AI Village experiments: agents attempted send hundreds unsolicited emails. CRUX: fabricated phone undetected till later review. Automated real-time monitoring: Separate agent continuously reviews primary agent actions, flags anomalies/errors as occur.

5. Conduct dry runs prior main evaluation: Exercise scaffold, criteria, infrastructure advance surfaces implicit assumptions + scaffolding defects. CRUX two dry runs identified multiple issues corrected before main run.

6. Report cost first-class quantity: Agent capability scales med budget. Report cost-conditioned measurements allows readers assess whether additional budget expected advance progress. CRUX: $1,000 total MEN understanding breakdown ($25 dev vs $975 polling) informerar future optimization.

7. Release agent logs publicly: Reproducibility intrinsic limitation open-world evals. Releasing complete logs partially mitigates: enables external researchers verify findings + contribute complementary analyses original authors may not performed.

Källa:Open-world evaluations for measuring frontier AI capabilities” av Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser et al., Princeton University + collaborators, publicerad 2026.

Projektledarpodden
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.