Cross-cutting dimension

Agentic evaluation

How models navigate, not just what they answer. Agentic evaluation measures decision-making through state-aware mock environments where the environment is the scorekeeper. Every tool call is a stroke — fewer strokes, better cognition.

01

Environment as judge

State graphs track every action. No self-reporting, no prompt-based grading — the environment determines success.

02

Par as baseline

Each scenario has a known-optimal path. Performance is measured against it — not against other models.

03

Framework-agnostic

Same scenarios run through bare prompts, ReAct loops, and SDK harnesses. The orchestration dimension matters.

Scorecard

10 holes · par 51
Model
DNS lookup
par 2
Log search
par 3
API outag…
par 4
ETL failu…
par 4
DB slow q…
par 5
Deploymen…
par 5
Cascading…
par 6
Security …
par 6
Data corr…
par 7
Distribut…
par 9
TotalGoals
claude-sonnet-4-6
anthropic
2345 ⬆56 ⬆7 ⬆7 ⬆8 ⬆11 ⬆⬆58+710/10
gpt-4o
openai
2345 ⬆6 ⬆58 ⬆⬆7 ⬆9 ⬆⬆12 +361+109/10
gemini-pro
vertex
24 ⬆5 ⬆5 ⬆6 ⬆6 ⬆8 ⬆⬆9 ⬆⬆45-68/10
Qwen3-235B-FP8
together
235 ⬆6 ⬆⬆6 ⬆7 ⬆⬆8 ⬆⬆37-147/10
claude-haiku-4-5
anthropic
3 ⬆4 ⬆5 ⬆6 ⬆⬆7 ⬆⬆7 ⬆⬆32-196/10
gpt-4o-mini
openai
3 ⬆5 ⬆⬆6 ⬆⬆7 ⬆⬆8 +329-225/10
Par234455667951
Eagle or better Birdie Par Bogey Double bogey+ Did not reach goal

Efficiency map

Goals reached vs total strokes. Lower-right is optimal — more goals with fewer strokes.

Orchestration framework effect

The same model, the same scenarios — different orchestration frameworks produce different results. This is a cross-cutting dimension that affects every family score.

Bare model
42.0
avg score
ReAct loop
68.0
avg score
Anthropic SDK
74.0
avg score
OpenAI SDK
71.0
avg score
LangGraph
69.0
avg score

Why this matters: A model that scores 85% bare but only 60% through an SDK harness has a framework integration problem. A model that scores 50% bare but 80% harnessed benefits strongly from orchestration — critical for production deployment decisions.

How agentic evaluation connects

Informs families

Every eval family can be run bare or harnessed. The gap between bare and harnessed scores reveals how much a model depends on orchestration for each capability.

View families

Informs domains

Domain scores shift under orchestration. A model strong in coding bare may falter in coding-under-harness — or vice versa. This surfaces real deployment readiness.

View domains

Informs the leaderboard

GCI now includes both bare and harnessed dimensions. Models that only perform well in one mode are penalized in the composite — reflecting real-world usage patterns.

View leaderboard

Informs model profiles

Each model's detail page shows its orchestration profile: which harnesses it works best with, where it loses strokes, and how it compares to par across scenarios.

View models