Agentic skills benchmarking misaligned med real-world work visar Carnegie Mellon University systematic framework: 43 benchmarks, 72,342 tasks mapped till O*NET (1,016 US occupations). För dig som projektledare betyder detta: Agent development concentrated Computer/Mathematical domain (endast 7.6% US employment) medan Management, Legal, Architecture (heavily digitized 88%/70%/71%) underrepresented (1.4%/0.3%/0.7% benchmark coverage). Skill-level: overfocus Getting Information + Working with Computers (<5% employment combined) crowds out Interacting with Others (pervades most jobs). Autonomy analysis 10,288 trajectories: workflow-based complexity measure (number granular steps) defines capability boundary – max complexity agent reliably complete ≥success threshold. Coding benchmarks visa framework (OpenHands > SWE-agent) + LM (Claude > GPT) advantages medium complexity. Tre design principles better capture socially important work: (1) Domain/skill coverage aligned employment/capital distribution, (2) Task realism complexity (cross-domain structure, procedural breadth), (3) Granular evaluation autonomy levels NOT binary automation/augmentation. Practical guidance: decompose complex user tasks till simpler instructions achieve desired agent success rate – för SWE-bench “implement RL algo” → breakdown subtasks till agent autonomy threshold.
Systematisk framework: O*NET mapping agent benchmarks
43 benchmarks aggregated (72,342 total tasks, 10,288 sampled):
- Inclusion criteria: (i) Agentic (interactive environment, observation-action loop), (ii) Work-related (tasks correspond real-world activities)
- Categories: General Digital Work, Web/Mobile Navigation, Information Planning, Software Engineering, Science, Social, Physical
- Examples: TheAgentCompany (175), GDPval (220 – highest coverage 47.8% domain/58.5% skill), WorkArena (16,833), SWE-bench (2,294), GAIA (466)
O*NET dual taxonomy structure:
Domain taxonomy (D_T): 23 job families, 743 occupations computer use, 5,806 task descriptions. Hierarchical: Industry (Business/Financial Operations) → Occupation (Accountants, Budget Analysts) → Concrete tasks (prepare adjusting journal entries). Employment + median salary data från US Bureau Labor Statistics (BLS).
Skill taxonomy (S_T): 4 general categories (Information Input, Interacting Others, Mental Processes, Work Output) → 9 → 41 detailed activities. Skill = concrete sequence actions achieve goal. Employment/capital estimates weighted by O*NET activity-level importance scores.
LLM-based mapping (GPT-5): Each benchmark example mapped till paths D_T ∪ S_T. Coverage = percentage unique paths covered taxonomy tree. 91.2%/95.5% successful mappings domain/skill. Manual validation: 90.9%/89.3% agreement rates (2 independent annotators).
Coverage-aware sampling: Large homogeneous benchmarks sampled batches size 5 till coverage increase slower Δ=0.1. Saves cost medan maintaining representativeness.
Domain mismatch: Programming-centric vs real-world distribution
Overwhelming concentration Computer/Mathematical:
- Benchmark effort: ~70% examples
- US employment: endast 7.6%
- Capital distribution: similarly underrepresented
- Reason: Software can perform tasks across domains MEN general-purpose benchmarks NOT capture domain-specific nuances
Missed opportunities heavily digitized domains:
- Management: 88% digital work, 1.4% benchmark coverage, highest capital concentration
- Legal: 70% digital, 0.3% coverage
- Architecture/Engineering: 71% digital, 0.7% coverage
- These domains: high digitization + economic value MEN sparsely covered
Disconnect benchmark focus + economic impact:
- Economically valuable domains (Management) underrepresented
- Low-paying labor-intensive (Personal Care/Service) likewise underexplored
- Benchmarking driven methodological convenience (readily specified NL tasks, easily verifiable rewards) NOT alignment employment structure/economic value
Skill mismatch: Narrow focus <5% employment
Human work requires diverse skill mix: Balanced across Information Input, Mental Processes, Interaction Others, Work Output – NO single category dominates. Reflects multifaceted nature real work (coordinating multiple activities NOT narrow capabilities).
Agent benchmarks concentrated two granular skills:
- Getting Information (Information Input): 3.1% employment
- Working with Computers (Work Output): 2.4% employment
- Combined <5% employment – heavily imbalanced distribution
Two distinct distortions:
- Uneven allocation within categories: Development effort focuses few easily benchmarked activities, neglects others same abstraction level
- Crowds out entire high-level categories: Interacting with Others receive minimal coverage despite pervading wide range occupations
Autonomy levels: Workflow-based complexity measure
Task complexity definition (Wood 1986 decomposition): Number + organization distinct skills/procedural steps required complete task. Approximated by number most granular workflow steps induced från agent trajectories.
Workflow induction procedure: Segments agent trajectory low-level actions (e.g., click) → hierarchical workflow goal-directed steps increasing granularity. Yields consistent representation task structure across heterogeneous trajectories.
Validation pairwise comparisons: 10 pairs adjacent complexity levels: 82.6% satisfaction relative granularity criterion (level-n+1 more complex än level-n).
Agent autonomy definition: Maximum task complexity agent can complete end-to-end above predefined success rate threshold med statistical confidence. Capability boundary – how complex task reliably handle utan human assistance.
Autonomy = max {c | SR(c) ≥ τ} SR(c) = success rate complexity level c, τ = threshold (e.g., 80%)
Praktiska insights projektledare
1. Domain prioritization real-world alignment: Management/Legal/Architecture heavily digitized MEN underrepresented benchmarks. Organizations these domains: agent capabilities may NOT match benchmarks suggest. Conduct domain-specific evaluations innan deployment.
2. Skill gap awareness: Agents strong self-contained activities (mental processes, work output) MEN struggle information retrieval + coordination others. Tasks requiring interpersonal interaction (pervades most jobs): expect lower autonomy, design hybrid workflows.
3. Framework + LM selection medium complexity: SWE-bench comparison visa OpenHands > SWE-agent, Claude > GPT för medium complexity coding. MEN trends may NOT generalize – broader trajectory release needed systematic comparisons.
4. Decompose tasks till autonomy threshold: Given user task + desired performance: identify agent autonomy level för required skills/domain. If task complexity exceeds boundary → decompose till simpler subtasks agent can reliably complete. Example: “implement RL algorithm” → breakdown “implement rollout loop”, “add reward calculation”, etc. till match agent SR threshold.
5. Calibrate oversight based complexity: Higher task complexity → sharply dropping success rates. Software engineering: success falls rapidly beyond complexity ~6 steps. Design human-in-loop checkpoints aligned complexity thresholds.
6. Three benchmark design principles:
- Coverage: Align domain/skill distribution med employment/capital (NOT methodological convenience)
- Realism: Include cross-domain structure (8.5% examples >3 domains), procedural breadth (32.6% examples ≥4 skills)
- Granular evaluation: Measure autonomy levels across complexity spectrum (NOT binary automation/augmentation)
7. Recognize representativeness ≠ relevance: Task assigned domain/skill establishes work relevance MEN NOT representativeness. Many benchmarks simplified versions omit contextual/procedural details. Evaluate whether benchmark tasks capture realistic scope practice.
Källa: “How Well Does Agent Development Reflect Real-World Work?” av Zora Z. Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang et al., Carnegie Mellon University + Stanford University, publicerad 6 mars 2026.
