Agentic evaluation
How models navigate, not just what they answer. Agentic evaluation measures decision-making through state-aware mock environments where the environment is the scorekeeper. Every tool call is a stroke — fewer strokes, better cognition.
Environment as judge
State graphs track every action. No self-reporting, no prompt-based grading — the environment determines success.
Par as baseline
Each scenario has a known-optimal path. Performance is measured against it — not against other models.
Framework-agnostic
Same scenarios run through bare prompts, ReAct loops, and SDK harnesses. The orchestration dimension matters.
Scorecard
10 holes · par 51| Model | DNS lookup par 2 | Log search par 3 | API outag… par 4 | ETL failu… par 4 | DB slow q… par 5 | Deploymen… par 5 | Cascading… par 6 | Security … par 6 | Data corr… par 7 | Distribut… par 9 | Total | Goals |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| claude-sonnet-4-6 anthropic | 2 | 3 | 4 | 5 ⬆ | 5 | 6 ⬆ | 7 ⬆ | 7 ⬆ | 8 ⬆ | 11 ⬆⬆ | 58+7 | 10/10 |
| gpt-4o openai | 2 | 3 | 4 | 5 ⬆ | 6 ⬆ | 5 | 8 ⬆⬆ | 7 ⬆ | 9 ⬆⬆ | 12 +3 | 61+10 | 9/10 |
| gemini-pro vertex | 2 | 4 ⬆ | 5 ⬆ | 5 ⬆ | 6 ⬆ | 6 ⬆ | 8 ⬆⬆ | — | 9 ⬆⬆ | — | 45-6 | 8/10 |
| Qwen3-235B-FP8 together | 2 | 3 | 5 ⬆ | 6 ⬆⬆ | 6 ⬆ | 7 ⬆⬆ | — | 8 ⬆⬆ | — | — | 37-14 | 7/10 |
| claude-haiku-4-5 anthropic | 3 ⬆ | 4 ⬆ | 5 ⬆ | 6 ⬆⬆ | 7 ⬆⬆ | 7 ⬆⬆ | — | — | — | — | 32-19 | 6/10 |
| gpt-4o-mini openai | 3 ⬆ | 5 ⬆⬆ | 6 ⬆⬆ | — | 7 ⬆⬆ | 8 +3 | — | — | — | — | 29-22 | 5/10 |
| Par | 2 | 3 | 4 | 4 | 5 | 5 | 6 | 6 | 7 | 9 | 51 |
Efficiency map
Goals reached vs total strokes. Lower-right is optimal — more goals with fewer strokes.
Orchestration framework effect
The same model, the same scenarios — different orchestration frameworks produce different results. This is a cross-cutting dimension that affects every family score.
Why this matters: A model that scores 85% bare but only 60% through an SDK harness has a framework integration problem. A model that scores 50% bare but 80% harnessed benefits strongly from orchestration — critical for production deployment decisions.
How agentic evaluation connects
Informs families
Every eval family can be run bare or harnessed. The gap between bare and harnessed scores reveals how much a model depends on orchestration for each capability.
View families →Informs domains
Domain scores shift under orchestration. A model strong in coding bare may falter in coding-under-harness — or vice versa. This surfaces real deployment readiness.
View domains →Informs the leaderboard
GCI now includes both bare and harnessed dimensions. Models that only perform well in one mode are penalized in the composite — reflecting real-world usage patterns.
View leaderboard →Informs model profiles
Each model's detail page shows its orchestration profile: which harnesses it works best with, where it loses strokes, and how it compares to par across scenarios.
View models →