Agentic skills benchmarking: CMU O*NET mapping 72,342 tasks visar 7.6% employment mismatch – Management/Legal underrepresented - Projektledarpodden

Agentic skills benchmarking misaligned med real-world work visar Carnegie Mellon University systematic framework: 43 benchmarks, 72,342 tasks mapped till O*NET (1,016 US occupations). För dig som projektledare betyder detta: Agent development concentrated Computer/Mathematical domain (endast 7.6% US employment) medan Management, Legal, Architecture (heavily digitized 88%/70%/71%) underrepresented (1.4%/0.3%/0.7% benchmark coverage). Skill-level: overfocus Getting Information + Working with Computers (<5% employment combined) crowds out Interacting with Others (pervades most jobs). Autonomy analysis 10,288 trajectories: workflow-based complexity measure (number granular steps) defines capability boundary – max complexity agent reliably complete ≥success threshold. Coding benchmarks visa framework (OpenHands > SWE-agent) + LM (Claude > GPT) advantages medium complexity. Tre design principles better capture socially important work: (1) Domain/skill coverage aligned employment/capital distribution, (2) Task realism complexity (cross-domain structure, procedural breadth), (3) Granular evaluation autonomy levels NOT binary automation/augmentation. Practical guidance: decompose complex user tasks till simpler instructions achieve desired agent success rate – för SWE-bench “implement RL algo” → breakdown subtasks till agent autonomy threshold.

Systematisk framework: O*NET mapping agent benchmarks

43 benchmarks aggregated (72,342 total tasks, 10,288 sampled):

  • Inclusion criteria: (i) Agentic (interactive environment, observation-action loop), (ii) Work-related (tasks correspond real-world activities)
  • Categories: General Digital Work, Web/Mobile Navigation, Information Planning, Software Engineering, Science, Social, Physical
  • Examples: TheAgentCompany (175), GDPval (220 – highest coverage 47.8% domain/58.5% skill), WorkArena (16,833), SWE-bench (2,294), GAIA (466)

O*NET dual taxonomy structure:

Domain taxonomy (D_T): 23 job families, 743 occupations computer use, 5,806 task descriptions. Hierarchical: Industry (Business/Financial Operations) → Occupation (Accountants, Budget Analysts) → Concrete tasks (prepare adjusting journal entries). Employment + median salary data från US Bureau Labor Statistics (BLS).

Skill taxonomy (S_T): 4 general categories (Information Input, Interacting Others, Mental Processes, Work Output) → 9 → 41 detailed activities. Skill = concrete sequence actions achieve goal. Employment/capital estimates weighted by O*NET activity-level importance scores.

LLM-based mapping (GPT-5): Each benchmark example mapped till paths D_T ∪ S_T. Coverage = percentage unique paths covered taxonomy tree. 91.2%/95.5% successful mappings domain/skill. Manual validation: 90.9%/89.3% agreement rates (2 independent annotators).

Coverage-aware sampling: Large homogeneous benchmarks sampled batches size 5 till coverage increase slower Δ=0.1. Saves cost medan maintaining representativeness.

Domain mismatch: Programming-centric vs real-world distribution

Overwhelming concentration Computer/Mathematical:

  • Benchmark effort: ~70% examples
  • US employment: endast 7.6%
  • Capital distribution: similarly underrepresented
  • Reason: Software can perform tasks across domains MEN general-purpose benchmarks NOT capture domain-specific nuances

Missed opportunities heavily digitized domains:

  • Management: 88% digital work, 1.4% benchmark coverage, highest capital concentration
  • Legal: 70% digital, 0.3% coverage
  • Architecture/Engineering: 71% digital, 0.7% coverage
  • These domains: high digitization + economic value MEN sparsely covered

Disconnect benchmark focus + economic impact:

  • Economically valuable domains (Management) underrepresented
  • Low-paying labor-intensive (Personal Care/Service) likewise underexplored
  • Benchmarking driven methodological convenience (readily specified NL tasks, easily verifiable rewards) NOT alignment employment structure/economic value

Skill mismatch: Narrow focus <5% employment

Human work requires diverse skill mix: Balanced across Information Input, Mental Processes, Interaction Others, Work Output – NO single category dominates. Reflects multifaceted nature real work (coordinating multiple activities NOT narrow capabilities).

Agent benchmarks concentrated two granular skills:

  • Getting Information (Information Input): 3.1% employment
  • Working with Computers (Work Output): 2.4% employment
  • Combined <5% employment – heavily imbalanced distribution

Two distinct distortions:

  1. Uneven allocation within categories: Development effort focuses few easily benchmarked activities, neglects others same abstraction level
  2. Crowds out entire high-level categories: Interacting with Others receive minimal coverage despite pervading wide range occupations

Autonomy levels: Workflow-based complexity measure

Task complexity definition (Wood 1986 decomposition): Number + organization distinct skills/procedural steps required complete task. Approximated by number most granular workflow steps induced från agent trajectories.

Workflow induction procedure: Segments agent trajectory low-level actions (e.g., click) → hierarchical workflow goal-directed steps increasing granularity. Yields consistent representation task structure across heterogeneous trajectories.

Validation pairwise comparisons: 10 pairs adjacent complexity levels: 82.6% satisfaction relative granularity criterion (level-n+1 more complex än level-n).

Agent autonomy definition: Maximum task complexity agent can complete end-to-end above predefined success rate threshold med statistical confidence. Capability boundary – how complex task reliably handle utan human assistance.

Autonomy = max {c | SR(c) ≥ τ} SR(c) = success rate complexity level c, τ = threshold (e.g., 80%)

Praktiska insights projektledare

1. Domain prioritization real-world alignment: Management/Legal/Architecture heavily digitized MEN underrepresented benchmarks. Organizations these domains: agent capabilities may NOT match benchmarks suggest. Conduct domain-specific evaluations innan deployment.

2. Skill gap awareness: Agents strong self-contained activities (mental processes, work output) MEN struggle information retrieval + coordination others. Tasks requiring interpersonal interaction (pervades most jobs): expect lower autonomy, design hybrid workflows.

3. Framework + LM selection medium complexity: SWE-bench comparison visa OpenHands > SWE-agent, Claude > GPT för medium complexity coding. MEN trends may NOT generalize – broader trajectory release needed systematic comparisons.

4. Decompose tasks till autonomy threshold: Given user task + desired performance: identify agent autonomy level för required skills/domain. If task complexity exceeds boundary → decompose till simpler subtasks agent can reliably complete. Example: “implement RL algorithm” → breakdown “implement rollout loop”, “add reward calculation”, etc. till match agent SR threshold.

5. Calibrate oversight based complexity: Higher task complexity → sharply dropping success rates. Software engineering: success falls rapidly beyond complexity ~6 steps. Design human-in-loop checkpoints aligned complexity thresholds.

6. Three benchmark design principles:

  • Coverage: Align domain/skill distribution med employment/capital (NOT methodological convenience)
  • Realism: Include cross-domain structure (8.5% examples >3 domains), procedural breadth (32.6% examples ≥4 skills)
  • Granular evaluation: Measure autonomy levels across complexity spectrum (NOT binary automation/augmentation)

7. Recognize representativeness ≠ relevance: Task assigned domain/skill establishes work relevance MEN NOT representativeness. Many benchmarks simplified versions omit contextual/procedural details. Evaluate whether benchmark tasks capture realistic scope practice.

Källa:How Well Does Agent Development Reflect Real-World Work?” av Zora Z. Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang et al., Carnegie Mellon University + Stanford University, publicerad 6 mars 2026.

Projektledarpodden
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.