Grounded reasoning for enterprise work. evaluates model performance on document synthesis, citation discipline, and multi-source decision making across 3 simulated company environments.
Overall Performance
Ebla-1 High
25.4%
Opus 4.6 High
20.1%
Sonnet 4.6 High
19.3%
GPT-5.4 High
17.8%
Gemini 3.1 Pro High
12.2%
GPT-5.2 High
11.3%
Grok 4.1 Fast High
8.2%
GPT-OSS-120b High
7.1%
Gemini 3 Flash High
6.3%