Cross-cutting dimension

Agentic evaluation

How models navigate, not just what they answer. Agentic evaluation measures decision-making through state-aware mock environments where the environment is the scorekeeper. Every tool call is a stroke — fewer strokes, better cognition.

Environment as judge

State graphs track every action. No self-reporting, no prompt-based grading — the environment determines success.

Par as baseline

Each scenario has a known-optimal path. Performance is measured against it — not against other models.

Framework-agnostic

Same scenarios run through bare prompts, ReAct loops, and SDK harnesses. The orchestration dimension matters.

Scorecard

10 holes · par 51

Model	DNS lookup par 2	Log search par 3	API outag… par 4	ETL failu… par 4	DB slow q… par 5	Deploymen… par 5	Cascading… par 6	Security … par 6	Data corr… par 7	Distribut… par 9	Total	Goals
claude-sonnet-4-6 anthropic	2	3	4	5 ⬆	5	6 ⬆	7 ⬆	7 ⬆	8 ⬆	11 ⬆⬆	58+7	10/10
gpt-4o openai	2	3	4	5 ⬆	6 ⬆	5	8 ⬆⬆	7 ⬆	9 ⬆⬆	12 +3	61+10	9/10
gemini-pro vertex	2	4 ⬆	5 ⬆	5 ⬆	6 ⬆	6 ⬆	8 ⬆⬆	—	9 ⬆⬆	—	45-6	8/10
Qwen3-235B-FP8 together	2	3	5 ⬆	6 ⬆⬆	6 ⬆	7 ⬆⬆	—	8 ⬆⬆	—	—	37-14	7/10
claude-haiku-4-5 anthropic	3 ⬆	4 ⬆	5 ⬆	6 ⬆⬆	7 ⬆⬆	7 ⬆⬆	—	—	—	—	32-19	6/10
gpt-4o-mini openai	3 ⬆	5 ⬆⬆	6 ⬆⬆	—	7 ⬆⬆	8 +3	—	—	—	—	29-22	5/10
Par	2	3	4	4	5	5	6	6	7	9	51

Eagle or better Birdie Par Bogey Double bogey+— Did not reach goal

Efficiency map

Goals reached vs total strokes. Lower-right is optimal — more goals with fewer strokes.

Orchestration framework effect

The same model, the same scenarios — different orchestration frameworks produce different results. This is a cross-cutting dimension that affects every family score.

Bare model

42.0

avg score

ReAct loop

68.0

avg score

Anthropic SDK

74.0

avg score

OpenAI SDK

71.0

avg score

LangGraph

69.0

avg score

Why this matters: A model that scores 85% bare but only 60% through an SDK harness has a framework integration problem. A model that scores 50% bare but 80% harnessed benefits strongly from orchestration — critical for production deployment decisions.

How agentic evaluation connects

Informs families

Every eval family can be run bare or harnessed. The gap between bare and harnessed scores reveals how much a model depends on orchestration for each capability.

View families →

Informs domains

Domain scores shift under orchestration. A model strong in coding bare may falter in coding-under-harness — or vice versa. This surfaces real deployment readiness.

View domains →

Informs the leaderboard

GCI now includes both bare and harnessed dimensions. Models that only perform well in one mode are penalized in the composite — reflecting real-world usage patterns.

View leaderboard →

Informs model profiles

Each model's detail page shows its orchestration profile: which harnesses it works best with, where it loses strokes, and how it compares to par across scenarios.

View models →