Agentic skills benchmarking: CMU O*NET mapping 72,342 tasks visar 7.6% employment mismatch – Management/Legal underrepresented

Agentic skills benchmarking misaligned med real-world work visar Carnegie Mellon University systematic framework: 43 benchmarks, 72,342 tasks mapped till O*NET (1,016 US occupations). För dig som projektledare betyder detta: Agent development concentrated Computer/Mathematical domain (endast 7.6% US employment) medan Management, Legal, Architecture (heavily digitized 88%/70%/71%) underrepresented (1.4%/0.3%/0.7% benchmark coverage). Skill-level: overfocus Getting Information + Working with Computers (<5% employment combined) crowds out Interacting with Others (pervades most jobs). Autonomy analysis 10,288 trajectories: workflow-based complexity measure (number granular steps) defines capability boundary – max complexity agent reliably complete ≥success threshold. Coding benchmarks visa framework (OpenHands > SWE-agent) + LM (Claude > GPT) advantages medium complexity. Tre design principles better capture socially important work: (1) Domain/skill coverage aligned employment/capital distribution, (2) Task realism complexity (cross-domain structure, procedural breadth), (3) Granular evaluation autonomy levels NOT binary automation/augmentation. Practical guidance: decompose complex user tasks till simpler instructions achieve desired agent success rate – för SWE-bench “implement RL algo” → breakdown subtasks till agent autonomy threshold.

Systematisk framework: O*NET mapping agent benchmarks

43 benchmarks aggregated (72,342 total tasks, 10,288 sampled):

Inclusion criteria: (i) Agentic (interactive environment, observation-action loop), (ii) Work-related (tasks correspond real-world activities)
Categories: General Digital Work, Web/Mobile Navigation, Information Planning, Software Engineering, Science, Social, Physical
Examples: TheAgentCompany (175), GDPval (220 – highest coverage 47.8% domain/58.5% skill), WorkArena (16,833), SWE-bench (2,294), GAIA (466)

O*NET dual taxonomy structure:

Domain taxonomy (D_T): 23 job families, 743 occupations computer use, 5,806 task descriptions. Hierarchical: Industry (Business/Financial Operations) → Occupation (Accountants, Budget Analysts) → Concrete tasks (prepare adjusting journal entries). Employment + median salary data från US Bureau Labor Statistics (BLS).

Skill taxonomy (S_T): 4 general categories (Information Input, Interacting Others, Mental Processes, Work Output) → 9 → 41 detailed activities. Skill = concrete sequence actions achieve goal. Employment/capital estimates weighted by O*NET activity-level importance scores.

LLM-based mapping (GPT-5): Each benchmark example mapped till paths D_T ∪ S_T. Coverage = percentage unique paths covered taxonomy tree. 91.2%/95.5% successful mappings domain/skill. Manual validation: 90.9%/89.3% agreement rates (2 independent annotators).

Coverage-aware sampling: Large homogeneous benchmarks sampled batches size 5 till coverage increase slower Δ=0.1. Saves cost medan maintaining representativeness.

Domain mismatch: Programming-centric vs real-world distribution

Overwhelming concentration Computer/Mathematical:

Benchmark effort: ~70% examples
US employment: endast 7.6%
Capital distribution: similarly underrepresented
Reason: Software can perform tasks across domains MEN general-purpose benchmarks NOT capture domain-specific nuances

Missed opportunities heavily digitized domains:

Management: 88% digital work, 1.4% benchmark coverage, highest capital concentration
Legal: 70% digital, 0.3% coverage
Architecture/Engineering: 71% digital, 0.7% coverage
These domains: high digitization + economic value MEN sparsely covered

Disconnect benchmark focus + economic impact:

Economically valuable domains (Management) underrepresented
Low-paying labor-intensive (Personal Care/Service) likewise underexplored
Benchmarking driven methodological convenience (readily specified NL tasks, easily verifiable rewards) NOT alignment employment structure/economic value

Skill mismatch: Narrow focus <5% employment

Human work requires diverse skill mix: Balanced across Information Input, Mental Processes, Interaction Others, Work Output – NO single category dominates. Reflects multifaceted nature real work (coordinating multiple activities NOT narrow capabilities).

Agent benchmarks concentrated two granular skills:

Getting Information (Information Input): 3.1% employment
Working with Computers (Work Output): 2.4% employment
Combined <5% employment – heavily imbalanced distribution

Two distinct distortions:

Uneven allocation within categories: Development effort focuses few easily benchmarked activities, neglects others same abstraction level
Crowds out entire high-level categories: Interacting with Others receive minimal coverage despite pervading wide range occupations

Autonomy levels: Workflow-based complexity measure

Task complexity definition (Wood 1986 decomposition): Number + organization distinct skills/procedural steps required complete task. Approximated by number most granular workflow steps induced från agent trajectories.

Workflow induction procedure: Segments agent trajectory low-level actions (e.g., click) → hierarchical workflow goal-directed steps increasing granularity. Yields consistent representation task structure across heterogeneous trajectories.

Validation pairwise comparisons: 10 pairs adjacent complexity levels: 82.6% satisfaction relative granularity criterion (level-n+1 more complex än level-n).

Agent autonomy definition: Maximum task complexity agent can complete end-to-end above predefined success rate threshold med statistical confidence. Capability boundary – how complex task reliably handle utan human assistance.

Autonomy = max {c | SR(c) ≥ τ} SR(c) = success rate complexity level c, τ = threshold (e.g., 80%)

Kursinfo: AI för projektledare

Praktiska insights projektledare

1. Domain prioritization real-world alignment: Management/Legal/Architecture heavily digitized MEN underrepresented benchmarks. Organizations these domains: agent capabilities may NOT match benchmarks suggest. Conduct domain-specific evaluations innan deployment.

2. Skill gap awareness: Agents strong self-contained activities (mental processes, work output) MEN struggle information retrieval + coordination others. Tasks requiring interpersonal interaction (pervades most jobs): expect lower autonomy, design hybrid workflows.

3. Framework + LM selection medium complexity: SWE-bench comparison visa OpenHands > SWE-agent, Claude > GPT för medium complexity coding. MEN trends may NOT generalize – broader trajectory release needed systematic comparisons.

4. Decompose tasks till autonomy threshold: Given user task + desired performance: identify agent autonomy level för required skills/domain. If task complexity exceeds boundary → decompose till simpler subtasks agent can reliably complete. Example: “implement RL algorithm” → breakdown “implement rollout loop”, “add reward calculation”, etc. till match agent SR threshold.

5. Calibrate oversight based complexity: Higher task complexity → sharply dropping success rates. Software engineering: success falls rapidly beyond complexity ~6 steps. Design human-in-loop checkpoints aligned complexity thresholds.

6. Three benchmark design principles:

Coverage: Align domain/skill distribution med employment/capital (NOT methodological convenience)
Realism: Include cross-domain structure (8.5% examples >3 domains), procedural breadth (32.6% examples ≥4 skills)
Granular evaluation: Measure autonomy levels across complexity spectrum (NOT binary automation/augmentation)

7. Recognize representativeness ≠ relevance: Task assigned domain/skill establishes work relevance MEN NOT representativeness. Many benchmarks simplified versions omit contextual/procedural details. Evaluate whether benchmark tasks capture realistic scope practice.

Källa: “How Well Does Agent Development Reflect Real-World Work?” av Zora Z. Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang et al., Carnegie Mellon University + Stanford University, publicerad 6 mars 2026.

Tillbaka till artiklarna