$Introducing $C^4$ and Ebla-1$

Introducing
and Ebla-1

A benchmark on grounded reasoning and a state-of-the-art model.

ResearchMarch 10, 2026

Knowledge workers spend much of their time searching for, synthesizing, and verifying information across internal documents. AI agents are increasingly deployed for this work, but frontier models fail in consistent and often consequential ways: they confidently cite real documents that don't contain the claimed information, misread visual content like diagrams and tables, and fabricate values when reconciling evidence across sources.

Today, we're releasing , a benchmark that measures grounded reasoning¹ in enterprise² environments with dense, multimodal document corpora, and Ebla-1, a 120B model that scores 25.4% on it — beating every frontier model we tested.

evaluates on four axes — Correctness, Completeness, Composition, and Citations — with a rubric that penalizes fabrication directly. This penalty structure doubles as an RL training signal: we built Ebla-1 by reinforcement fine-tuning GPT-OSS-120b High via OpenAI's RFT API on a separate set of 30 training tasks, using the rubric format as reward. The model learns not just to ground claims in specific documents, but to reject nearby wrong substitutes and flag what it cannot verify, because fabricating costs it points.

We built 40 tasks across three simulated³ enterprise environments — a SaaS analytics platform, a financial services firm, and a chemical manufacturing company — each with realistic document corpora and platform clones (Salesforce, ServiceNow, Workday, and others). Domain specialists authored all tasks and verified every rubric criterion. We tested eight frontier models at high reasoning effort. The best, Opus 4.6 High, scores 20.1%. Ebla-1 scores 25.4%.

Only 6.1% of frontier task-model pairs are full solves. Each model runs every task multiple times at high reasoning effort; the chart above shows Pass@1 and Pass@8 scores.

The diagram below shows one representative benchmark task and the path a human expert follows to solve it — searching a multimodal contract corpus, identifying which clauses control the answer, reconciling dependencies across documents, and rejecting plausible but irrelevant evidence.

Methodology

Environments

Domain specialists authored all corpora, tasks, and gold outputs. Most tasks require cross-source reasoning: for example, calculating a team's budget utilization from an org chart, a compliance report, and a Workday export. Every expert performed their own tasks end-to-end, ensuring each task is solvable and each rubric criterion is grounded in a verified answer.

Environment Creation

Step-by-step flow from simulated company setup to validated tasks, rubrics, and environment delivery

Define Simulated Company

A lead domain expert designs a simulated company's structure, relationships, and document plan.

Select industry vertical

Make knowledge graph

Make document templates

Plan document mix

Setup review2 reviewers

→

Create Platform Clones

Platform engineers build replicas of platforms, populated by data from the knowledge graph.

SalesforceGoogle AnalyticsZuoraAWSHubSpotWorkdayServiceNowTableauSAPPower BIInternal Tools

→

Create Corpus

Domain experts author a diverse set of documents representing the company knowledge graph.

PDFSheetsSlidesDocsPPTXDOCXXLSX

Consistency review2 reviewers

Task Specification

Domain experts define the prompt, tone, ontology, failure profile, verifier bundle, and metadata before calibrating against frontier models.

Write prompt with realistic tone, voice, and ambiguity

Specify expected deliverable and answer format

Assign ontology category and task type

Set failure timing tier and target surfaces

Build verifier bundle and scoring logic

Attach metadata, artifact pointers, and tags

Test with frontier models and iterate

→

Gold Outputs & Rubrics

A separate expert solves each task, writes the gold reference, and verifies the binary rubric plus end-to-end checks. Gold must score perfectly or prompt and rubric are fixed.

Separate expert solves task end-to-end

Produce gold output in the required format

Record root cause, correct fix, and expected checks

Write weighted binary rubric criteria

Validate gold path and end-to-end checks

Alignment QA3 reviewers

Human Baselining

New experts execute some tasks to verify feasibility and fix any issues found.

Fresh experts execute sample

Verify solvability & calibrate time

Iterate prompts & rubrics

→

Judge Calibration

Build expert-labeled validation set to verify automated judge reliability.

Sample diverse model outputs

Build expert-labeled set

Validate judge accuracy

→

Final Deliverables

Environment delivery includes artifacts, validated tasks, verifier bundles, and the ontology and metadata needed to run them correctly.

Artifacts

Tasks & Tone Specs

Verifiers & Gold Paths

Ontology & Metadata

spans three simulated enterprise environments: a SaaS analytics platform, a financial services firm, and a chemical manufacturing company. Each is generated from a knowledge graph that serves as the single source of truth. The graph encodes entities (people, teams, products, customers, policies, financial records) and their relationships; both the document corpus and platform clones (e.g., Salesforce, ServiceNow, Workday) are populated directly from it.

Agent

The agent follows a search-fetch-answer loop with a fixed budget of 36 tool-use steps. returns up to 5 results ranked by semantic similarity. retrieves and analyzes a document. For PDFs, an isolated LLM call to the native PDF API analyzes the document against the query without the raw content entering the main context. The agent submits its final response with page-level citations once it has acquired sufficient information.

Evaluation

Each task has a rubric of binary criteria partitioned across the four axes . Let be the LLM judge's verdict for criterion and its signed weight (positive for rewards, negative for penalties such as fabrication). The final score is:

We weight the rubric most heavily toward Correctness, then Completeness, then Citations, and finally Composition.

Results

$C^{4}$ Axes

Performance by evaluation axis

Score

30%15%0%

Ebla-1 High

Opus 4.6 High

Sonnet 4.6 High

GPT-5.4 High

Gemini 3.1 Pro High

GPT-5.2 High

Grok 4.1 Fast High

GPT-OSS-120b High

Gemini 3 Flash High

Correctness

Completeness

Composition

Citations

Correctness is the primary discriminator across models, driven by differences in retrieval accuracy and visual document extraction. Composition is the most uniform axis — even poor responses tend to be well-formatted. Citations varies substantially: weaker models provide vague document references while the strongest cite exact document IDs and page numbers.

Categories & Modalities

Performance across document categories and input modalities

Model

GRC

People

Finance

Sales

Engineering

Operations

Text

Tabular

Visual

Ebla-1 High

28%

33%

23%

39%

20%

29%

12%

24%

Opus 4.6 High

15%

37%

54%

21%

14%

26%

Sonnet 4.6 High

34%

23%

10%

28%

17%

25%

17%

GPT-5.4 High

30%

18%

27%

14%

23%

14%

Gemini 3.1 Pro High

20%

24%

10%

20%

16%

12%

GPT-5.2 High

18%

35%

13%

16%

Grok 4.1 Fast High

14%

12%

GPT-OSS-120b High

14%

13%

10%

Gemini 3 Flash High

16%

AI agents fail along several factors: retrieval, reasoning, perception, and calibration. The hardest tasks often fail on retrieval, multi-hop logic, or visual parsing.

During evaluation, we identified several failure modes across frontier models. The two most consequential are failures of abstention and visual misinterpretation.

When the corpus lacks the requested information, every model generates a plausible-sounding answer, citing real documents that don't contain the claimed information. The output is indistinguishable from a well-sourced response without manually checking every citation. In compliance or legal contexts, a confident wrong answer is worse than no answer.

Models retrieve documents containing diagrams but consistently misread them. A safety evacuation diagram in one of our environments is found by every model but described with incorrect spatial relationships and missing elements, producing outputs that would be unsafe to act on.

When combining values across sources, models invent intermediate numbers and perform arithmetic on fabricated inputs with full confidence. When documents contradict each other, models fabricate governance hierarchies or override rules rather than flagging the conflict. On documents exceeding 15 pages, models cite nearby pages rather than the actual source page.

Training

The rubric provides dense partial credit, making it well-suited as an RL training signal, not just an evaluation. Each task's rubric contains binary criteria with signed weights, producing a near-continuous reward distribution rather than the sparse binary signal typical of agentic benchmarks. The four axes provide independent reward dimensions, and the simulated corpora are contamination-free by construction.

We trained GPT-OSS-120b High, a 120B open-weight model that scores 7.1% on , below nearly every frontier model we tested. Most runs produce near-zero scores: the model either fabricates answers that trigger penalty criteria or hedges so aggressively that it fails to answer the question.

We deployed OpenAI's RFT API on 30 training tasks (10 per environment) with the score as the RL reward signal. The training tasks are stratified by difficulty across anchor tasks (baseline score 60–75%), main tasks (25–60%), and stretch tasks (<25%). Training ran for 30 epochs with 8 rollouts per task per epoch (7,200 rollouts total): no SFT warmup, no curated demonstrations, no human preference labels. Fabrication penalties were retained in the reward signal.

Each trace receives as reward, mapped to [0, 1]. Because the numerator includes negative-weight criteria, triggering fabrication penalties can yield a lower reward than a blank response. That made fabrication expensive enough that early training favored safe refusal. But because the rubric also rewarded verified partial progress, continued RL pushed the model toward calibrated commitment: answer what is supported, reject nearby unsupported substitutes, and abstain only on the unresolved parts.

The post-trained model, Ebla-1, scores 25.4% mean on the full 40-task benchmark, 5 PP above the best frontier model (Opus 4.6 High at 20.1%) and a +18.3 PP gain over the baseline.

Completeness gained most (+18.8 PP, 4.0× baseline): the model learned to decompose questions and retrieve evidence for all parts. Correctness gained +14.6 PP (2.4×), reflecting broader improvements in retrieval, verification, and calibration under a reward that also penalizes fabrication. Citations gained +14.5 PP (2.8×), with the model learning to ground claims in specific document pages. Composition gained +13.0 PP (1.8×), the smallest relative gain since the baseline was already highest there. All evaluation runs used a fixed step budget of 36 tool-use calls with no test-time compute scaling; reported scores reflect single-pass inference.

Beyond aggregate scores, RL training produced qualitative shifts in how Ebla-1 interacts with the search environment. The baseline rephrases the same query with minor variations and frequently exhausts its full 36-step budget without converging. Ebla-1 decomposes multi-part questions into distinct sub-queries targeting different document categories, then fetches specific documents to verify claims before committing. It is more likely to reject plausible nearby wrong evidence, keep fields scoped to the right source, and follow a controlling condition through to its downstream consequence rather than stopping at the first plausible rule.

Perhaps most strikingly, Ebla-1 learned to commit under partial evidence. When evidence for one sub-question is unavailable, it answers what it can verify and explicitly states what it could not find, earning partial credit without triggering fabrication penalties. The baseline either fabricates confidently or refuses entirely. This calibrated commitment was not explicitly trained; it emerged from the interaction between the multi-axis rubric and the penalty structure.

Total inference costs ranged from $0.35 to $24.74, a 70× spread, when each model ran the full benchmark. Ebla-1 ran the full 40-task benchmark for $1.10 in total inference cost while scoring 5 PP higher than Opus 4.6 High, the highest-scoring frontier model at $24.74 — 22× cheaper. tasks are hard enough that no frontier model solves more than a handful, but structured enough that a 120B model can learn to outperform all of them with a single RL training run on commodity compute.

Access

For more details, contact founders@aviro.ai.

References

1.
Benchmarking Deep Search over Heterogeneous Enterprise Data
arxiv.org
2.
Salesforce/HERB · Datasets at Hugging Face
huggingface.co
3.
WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation
arxiv.org
4.
Wix/WixQA · Datasets at Hugging Face
huggingface.co
5.
Introducing OfficeQA: A benchmark for end-to-end grounded reasoning | Databricks Blog
databricks.com
6.
GitHub - databricks/officeqa: Repository for getting started with the OfficeQA Benchmark.
github.com
7.
MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems for ACL 2025
research.ibm.com
8.
MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
arxiv.org
9.
Unanswerability Evaluation for Retrieval Augmented Generation
arxiv.org
10.
Deep research | OpenAI API
platform.openai.com

Grounded reasoning — Multi-step inference where every claim must be traceable to a specific source document, page, and passage, as opposed to free-form generation or closed-book QA. ↩
Enterprise — A large, structured organization with governance, repeatable processes, compliance, complex IT, and cross-functional coordination. Enterprise data refers to the organization's authoritative, shared information used across teams and systems. Enterprise documentation refers to internal authored files — policies, specs, roadmaps, notes, training materials, and presentations — that teams rely on for operations and decisions. ↩
Simulated — Created manually by domain experts; no proprietary corporate data is used. ↩

Introducing and Ebla-1