Introducing and Ebla-1
A benchmark on grounded reasoning and a state-of-the-art model.
Knowledge workers spend much of their time searching for, synthesizing, and verifying information across internal documents. AI agents are increasingly deployed for this work, but frontier models fail in consistent and often consequential ways: they confidently cite real documents that don't contain the claimed information, misread visual content like diagrams and tables, and fabricate values when reconciling evidence across sources.
Today, we're releasing , a benchmark that measures grounded reasoning1 in enterprise2 environments with dense, multimodal document corpora, and Ebla-1, a 120B model that scores 25.4% on it — beating every frontier model we tested.
evaluates on four axes — Correctness, Completeness, Composition, and Citations — with a rubric that penalizes fabrication directly. This penalty structure doubles as an RL training signal: we built Ebla-1 by reinforcement fine-tuning GPT-OSS-120b High via OpenAI's RFT API on a separate set of 30 training tasks, using the rubric format as reward. The model learns not just to ground claims in specific documents, but to reject nearby wrong substitutes and flag what it cannot verify, because fabricating costs it points.
We built 40 tasks across three simulated3 enterprise environments — a SaaS analytics platform, a financial services firm, and a chemical manufacturing company — each with realistic document corpora and platform clones (Salesforce, ServiceNow, Workday, and others). Domain specialists authored all tasks and verified every rubric criterion. We tested eight frontier models at high reasoning effort. The best, Opus 4.6 High, scores 20.1%. Ebla-1 scores 25.4%.
Only 6.1% of frontier task-model pairs are full solves. Each model runs every task multiple times at high reasoning effort; the chart above shows Pass@1 and Pass@8 scores.
The diagram below shows one representative benchmark task and the path a human expert follows to solve it — searching a multimodal contract corpus, identifying which clauses control the answer, reconciling dependencies across documents, and rejecting plausible but irrelevant evidence.
Methodology
Environments
Domain specialists authored all corpora, tasks, and gold outputs. Most tasks require cross-source reasoning: for example, calculating a team's budget utilization from an org chart, a compliance report, and a Workday export. Every expert performed their own tasks end-to-end, ensuring each task is solvable and each rubric criterion is grounded in a verified answer.
Environment Creation
Step-by-step flow from simulated company setup to validated tasks, rubrics, and environment delivery
spans three simulated enterprise environments: a SaaS analytics platform, a financial services firm, and a chemical manufacturing company. Each is generated from a knowledge graph that serves as the single source of truth. The graph encodes entities (people, teams, products, customers, policies, financial records) and their relationships; both the document corpus and platform clones (e.g., Salesforce, ServiceNow, Workday) are populated directly from it.
Agent
The agent follows a search-fetch-answer loop with a fixed budget of 36 tool-use steps. returns up to 5 results ranked by semantic similarity. retrieves and analyzes a document. For PDFs, an isolated LLM call to the native PDF API analyzes the document against the query without the raw content entering the main context. The agent submits its final response with page-level citations once it has acquired sufficient information.
Evaluation
Each task has a rubric of binary criteria partitioned across the four axes . Let be the LLM judge's verdict for criterion and its signed weight (positive for rewards, negative for penalties such as fabrication). The final score is:
We weight the rubric most heavily toward Correctness, then Completeness, then Citations, and finally Composition.
Results
Axes
Performance by evaluation axis
Correctness is the primary discriminator across models, driven by differences in retrieval accuracy and visual document extraction. Composition is the most uniform axis — even poor responses tend to be well-formatted. Citations varies substantially: weaker models provide vague document references while the strongest cite exact document IDs and page numbers.
Categories & Modalities
Performance across document categories and input modalities
AI agents fail along several factors: retrieval, reasoning, perception, and calibration. The hardest tasks often fail on retrieval, multi-hop logic, or visual parsing.
During evaluation, we identified several failure modes across frontier models. The two most consequential are failures of abstention and visual misinterpretation.
When the corpus lacks the requested information, every model generates a plausible-sounding answer, citing real documents that don't contain the claimed information. The output is indistinguishable from a well-sourced response without manually checking every citation. In compliance or legal contexts, a confident wrong answer is worse than no answer.
Models retrieve documents containing diagrams but consistently misread them. A safety evacuation diagram in one of our environments is found by every model but described with incorrect spatial relationships and missing elements, producing outputs that would be unsafe to act on.
When combining values across sources, models invent intermediate numbers and perform arithmetic on fabricated inputs with full confidence. When documents contradict each other, models fabricate governance hierarchies or override rules rather than flagging the conflict. On documents exceeding 15 pages, models cite nearby pages rather than the actual source page.
Training
The rubric provides dense partial credit, making it well-suited as an RL training signal, not just an evaluation. Each task's rubric contains binary criteria with signed weights, producing a near-continuous reward distribution rather than the sparse binary signal typical of agentic benchmarks. The four axes provide independent reward dimensions, and the simulated corpora are contamination-free by construction.
We trained GPT-OSS-120b High, a 120B open-weight model that scores 7.1% on , below nearly every frontier model we tested. Most runs produce near-zero scores: the model either fabricates answers that trigger penalty criteria or hedges so aggressively that it fails to answer the question.
We deployed OpenAI's RFT API on 30 training tasks (10 per environment) with the score as the RL reward signal. The training tasks are stratified by difficulty across anchor tasks (baseline score 60–75%), main tasks (25–60%), and stretch tasks (<25%). Training ran for 30 epochs with 8 rollouts per task per epoch (7,200 rollouts total): no SFT warmup, no curated demonstrations, no human preference labels. Fabrication penalties were retained in the reward signal.
Each trace receives as reward, mapped to [0, 1]. Because the numerator includes negative-weight criteria, triggering fabrication penalties can yield a lower reward than a blank response. That made fabrication expensive enough that early training favored safe refusal. But because the rubric also rewarded verified partial progress, continued RL pushed the model toward calibrated commitment: answer what is supported, reject nearby unsupported substitutes, and abstain only on the unresolved parts.
The post-trained model, Ebla-1, scores 25.4% mean on the full 40-task benchmark, 5 PP above the best frontier model (Opus 4.6 High at 20.1%) and a +18.3 PP gain over the baseline.
Completeness gained most (+18.8 PP, 4.0× baseline): the model learned to decompose questions and retrieve evidence for all parts. Correctness gained +14.6 PP (2.4×), reflecting broader improvements in retrieval, verification, and calibration under a reward that also penalizes fabrication. Citations gained +14.5 PP (2.8×), with the model learning to ground claims in specific document pages. Composition gained +13.0 PP (1.8×), the smallest relative gain since the baseline was already highest there. All evaluation runs used a fixed step budget of 36 tool-use calls with no test-time compute scaling; reported scores reflect single-pass inference.
Beyond aggregate scores, RL training produced qualitative shifts in how Ebla-1 interacts with the search environment. The baseline rephrases the same query with minor variations and frequently exhausts its full 36-step budget without converging. Ebla-1 decomposes multi-part questions into distinct sub-queries targeting different document categories, then fetches specific documents to verify claims before committing. It is more likely to reject plausible nearby wrong evidence, keep fields scoped to the right source, and follow a controlling condition through to its downstream consequence rather than stopping at the first plausible rule.
Perhaps most strikingly, Ebla-1 learned to commit under partial evidence. When evidence for one sub-question is unavailable, it answers what it can verify and explicitly states what it could not find, earning partial credit without triggering fabrication penalties. The baseline either fabricates confidently or refuses entirely. This calibrated commitment was not explicitly trained; it emerged from the interaction between the multi-axis rubric and the penalty structure.
Total inference costs ranged from $0.35 to $24.74, a 70× spread, when each model ran the full benchmark. Ebla-1 ran the full 40-task benchmark for $1.10 in total inference cost while scoring 5 PP higher than Opus 4.6 High, the highest-scoring frontier model at $24.74 — 22× cheaper. tasks are hard enough that no frontier model solves more than a handful, but structured enough that a 120B model can learn to outperform all of them with a single RL training run on commodity compute.
Access
For more details, contact founders@aviro.ai.
References
- 1.
Benchmarking Deep Search over Heterogeneous Enterprise Dataarxiv.org - 2.Salesforce/HERB · Datasets at Hugging Facehuggingface.co
- 3.
WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generationarxiv.org - 4.Wix/WixQA · Datasets at Hugging Facehuggingface.co
- 5.
Introducing OfficeQA: A benchmark for end-to-end grounded reasoning | Databricks Blogdatabricks.com - 6.GitHub - databricks/officeqa: Repository for getting started with the OfficeQA Benchmark.github.com
- 7.MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems for ACL 2025research.ibm.com
- 8.
MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systemsarxiv.org - 9.Unanswerability Evaluation for Retrieval Augmented Generationarxiv.org
- 10.Deep research | OpenAI APIplatform.openai.com
Footnotes
-
Grounded reasoning — Multi-step inference where every claim must be traceable to a specific source document, page, and passage, as opposed to free-form generation or closed-book QA. ↩
-
Enterprise — A large, structured organization with governance, repeatable processes, compliance, complex IT, and cross-functional coordination. Enterprise data refers to the organization's authoritative, shared information used across teams and systems. Enterprise documentation refers to internal authored files — policies, specs, roadmaps, notes, training materials, and presentations — that teams rely on for operations and decisions. ↩
-
Simulated — Created manually by domain experts; no proprietary corporate data is used. ↩