Introducing \(C^4\) and Ebla-1

Introducing
and Ebla-1

A benchmark on grounded reasoning and a state-of-the-art model.

ResearchMarch 10, 2026

Knowledge workers spend much of their time searching for, synthesizing, and verifying information across internal documents. AI agents are increasingly deployed for this work, but frontier models fail in consistent and often consequential ways: they confidently cite real documents that don't contain the claimed information, misread visual content like diagrams and tables, and fabricate values when reconciling evidence across sources.

Today, we're releasing , a benchmark that measures grounded reasoning1 in enterprise2 environments with dense, multimodal document corpora, and Ebla-1, a 120B model that scores 25.4% on it — beating every frontier model we tested.

Overall PerformanceAll models evaluated at high reasoning effort.PASS@1Ebla-1 High25.4%Opus 4.6 High20.1%Sonnet 4.6 High19.3%GPT-5.4 High17.8%Gemini 3.1 Pro High12.2%GPT-5.2 High11.3%Grok 4.1 Fast High8.2%GPT-OSS-120b High7.1%Gemini 3 Flash High6.3%PASS@8Ebla-1 High37.1%Opus 4.6 High36.6%Sonnet 4.6 High35.0%GPT-5.4 High32.9%Gemini 3.1 Pro High31.2%Grok 4.1 Fast High22.5%GPT-5.2 High22.0%GPT-OSS-120b High21.8%Gemini 3 Flash High21.6%

evaluates on four axes — Correctness, Completeness, Composition, and Citations — with a rubric that penalizes fabrication directly. This penalty structure doubles as an RL training signal: we built Ebla-1 by reinforcement fine-tuning GPT-OSS-120b High via OpenAI's RFT API on a separate set of 30 training tasks, using the rubric format as reward. The model learns not just to ground claims in specific documents, but to reject nearby wrong substitutes and flag what it cannot verify, because fabricating costs it points.

We built 40 tasks across three simulated3 enterprise environments — a SaaS analytics platform, a financial services firm, and a chemical manufacturing company — each with realistic document corpora and platform clones (Salesforce, ServiceNow, Workday, and others). Domain specialists authored all tasks and verified every rubric criterion. We tested eight frontier models at high reasoning effort. The best, Opus 4.6 High, scores 20.1%. Ebla-1 scores 25.4%.

Only 6.1% of frontier task-model pairs are full solves. Each model runs every task multiple times at high reasoning effort; the chart above shows Pass@1 and Pass@8 scores.

The diagram below shows one representative benchmark task and the path a human expert follows to solve it — searching a multimodal contract corpus, identifying which clauses control the answer, reconciling dependencies across documents, and rejecting plausible but irrelevant evidence.

Methodology

Environments

Domain specialists authored all corpora, tasks, and gold outputs. Most tasks require cross-source reasoning: for example, calculating a team's budget utilization from an org chart, a compliance report, and a Workday export. Every expert performed their own tasks end-to-end, ensuring each task is solvable and each rubric criterion is grounded in a verified answer.

Environment Creation

Step-by-step flow from simulated company setup to validated tasks, rubrics, and environment delivery

Define Simulated Company
A lead domain expert designs a simulated company's structure, relationships, and document plan.
Select industry vertical
Make knowledge graph
Make document templates
Plan document mix
Setup review2 reviewers
Create Platform Clones
Platform engineers build replicas of platforms, populated by data from the knowledge graph.
SalesforceGoogle AnalyticsZuoraAWSHubSpotWorkdayServiceNowTableauSAPPower BIInternal Tools
Create Corpus
Domain experts author a diverse set of documents representing the company knowledge graph.
PDFSheetsSlidesDocsPPTXDOCXXLSX
Consistency review2 reviewers
Task Specification
Domain experts define the prompt, tone, ontology, failure profile, verifier bundle, and metadata before calibrating against frontier models.
Write prompt with realistic tone, voice, and ambiguity
Specify expected deliverable and answer format
Assign ontology category and task type
Set failure timing tier and target surfaces
Build verifier bundle and scoring logic
Attach metadata, artifact pointers, and tags
Test with frontier models and iterate
Gold Outputs & Rubrics
A separate expert solves each task, writes the gold reference, and verifies the binary rubric plus end-to-end checks. Gold must score perfectly or prompt and rubric are fixed.
Separate expert solves task end-to-end
Produce gold output in the required format
Record root cause, correct fix, and expected checks
Write weighted binary rubric criteria
Validate gold path and end-to-end checks
Alignment QA3 reviewers
Human Baselining
New experts execute some tasks to verify feasibility and fix any issues found.
Fresh experts execute sample
Verify solvability & calibrate time
Iterate prompts & rubrics
Judge Calibration
Build expert-labeled validation set to verify automated judge reliability.
Sample diverse model outputs
Build expert-labeled set
Validate judge accuracy
Final Deliverables
Environment delivery includes artifacts, validated tasks, verifier bundles, and the ontology and metadata needed to run them correctly.
Artifacts
Tasks & Tone Specs
Verifiers & Gold Paths
Ontology & Metadata

spans three simulated enterprise environments: a SaaS analytics platform, a financial services firm, and a chemical manufacturing company. Each is generated from a knowledge graph that serves as the single source of truth. The graph encodes entities (people, teams, products, customers, policies, financial records) and their relationships; both the document corpus and platform clones (e.g., Salesforce, ServiceNow, Workday) are populated directly from it.

Agent

The agent follows a search-fetch-answer loop with a fixed budget of 36 tool-use steps. returns up to 5 results ranked by semantic similarity. retrieves and analyzes a document. For PDFs, an isolated LLM call to the native PDF API analyzes the document against the query without the raw content entering the main context. The agent submits its final response with page-level citations once it has acquired sufficient information.

Evaluation

Each task has a rubric of binary criteria partitioned across the four axes . Let be the LLM judge's verdict for criterion and its signed weight (positive for rewards, negative for penalties such as fabrication). The final score is:

We weight the rubric most heavily toward Correctness, then Completeness, then Citations, and finally Composition.

Results

Axes

Performance by evaluation axis

Score
30%15%0%
Ebla-1 High
Opus 4.6 High
Sonnet 4.6 High
GPT-5.4 High
Gemini 3.1 Pro High
GPT-5.2 High
GrokGrok 4.1 Fast High
GPT-OSS-120b High
Gemini 3 Flash High
Correctness
Completeness
Composition
Citations

Correctness is the primary discriminator across models, driven by differences in retrieval accuracy and visual document extraction. Composition is the most uniform axis — even poor responses tend to be well-formatted. Citations varies substantially: weaker models provide vague document references while the strongest cite exact document IDs and page numbers.

Categories & Modalities

Performance across document categories and input modalities

Model
GRC
People
Finance
Sales
Engineering
Operations
Text
Tabular
Visual
Ebla-1 High
8%
28%
33%
23%
39%
20%
29%
12%
24%
Opus 4.6 High
0%
15%
37%
0%
54%
0%
21%
14%
26%
Sonnet 4.6 High
7%
34%
23%
10%
28%
17%
25%
0%
17%
GPT-5.4 High
1%
3%
30%
18%
27%
14%
23%
2%
14%
Gemini 3.1 Pro High
0%
20%
24%
0%
10%
20%
16%
0%
12%
GPT-5.2 High
0%
2%
18%
3%
35%
0%
13%
16%
0%
GrokGrok 4.1 Fast High
0%
0%
14%
14%
12%
0%
12%
0%
0%
GPT-OSS-120b High
0%
0%
14%
7%
13%
0%
10%
0%
2%
Gemini 3 Flash High
2%
0%
16%
9%
0%
2%
7%
9%
0%

AI agents fail along several factors: retrieval, reasoning, perception, and calibration. The hardest tasks often fail on retrieval, multi-hop logic, or visual parsing.

During evaluation, we identified several failure modes across frontier models. The two most consequential are failures of abstention and visual misinterpretation.

When the corpus lacks the requested information, every model generates a plausible-sounding answer, citing real documents that don't contain the claimed information. The output is indistinguishable from a well-sourced response without manually checking every citation. In compliance or legal contexts, a confident wrong answer is worse than no answer.

Models retrieve documents containing diagrams but consistently misread them. A safety evacuation diagram in one of our environments is found by every model but described with incorrect spatial relationships and missing elements, producing outputs that would be unsafe to act on.

When combining values across sources, models invent intermediate numbers and perform arithmetic on fabricated inputs with full confidence. When documents contradict each other, models fabricate governance hierarchies or override rules rather than flagging the conflict. On documents exceeding 15 pages, models cite nearby pages rather than the actual source page.

Training

The rubric provides dense partial credit, making it well-suited as an RL training signal, not just an evaluation. Each task's rubric contains binary criteria with signed weights, producing a near-continuous reward distribution rather than the sparse binary signal typical of agentic benchmarks. The four axes provide independent reward dimensions, and the simulated corpora are contamination-free by construction.

We trained GPT-OSS-120b High, a 120B open-weight model that scores 7.1% on , below nearly every frontier model we tested. Most runs produce near-zero scores: the model either fabricates answers that trigger penalty criteria or hedges so aggressively that it fails to answer the question.

We deployed OpenAI's RFT API on 30 training tasks (10 per environment) with the score as the RL reward signal. The training tasks are stratified by difficulty across anchor tasks (baseline score 60–75%), main tasks (25–60%), and stretch tasks (<25%). Training ran for 30 epochs with 8 rollouts per task per epoch (7,200 rollouts total): no SFT warmup, no curated demonstrations, no human preference labels. Fabrication penalties were retained in the reward signal.

Each trace receives as reward, mapped to [0, 1]. Because the numerator includes negative-weight criteria, triggering fabrication penalties can yield a lower reward than a blank response. That made fabrication expensive enough that early training favored safe refusal. But because the rubric also rewarded verified partial progress, continued RL pushed the model toward calibrated commitment: answer what is supported, reject nearby unsupported substitutes, and abstain only on the unresolved parts.

The post-trained model, Ebla-1, scores 25.4% mean on the full 40-task benchmark, 5 PP above the best frontier model (Opus 4.6 High at 20.1%) and a +18.3 PP gain over the baseline.

Completeness gained most (+18.8 PP, 4.0× baseline): the model learned to decompose questions and retrieve evidence for all parts. Correctness gained +14.6 PP (2.4×), reflecting broader improvements in retrieval, verification, and calibration under a reward that also penalizes fabrication. Citations gained +14.5 PP (2.8×), with the model learning to ground claims in specific document pages. Composition gained +13.0 PP (1.8×), the smallest relative gain since the baseline was already highest there. All evaluation runs used a fixed step budget of 36 tool-use calls with no test-time compute scaling; reported scores reflect single-pass inference.

Beyond aggregate scores, RL training produced qualitative shifts in how Ebla-1 interacts with the search environment. The baseline rephrases the same query with minor variations and frequently exhausts its full 36-step budget without converging. Ebla-1 decomposes multi-part questions into distinct sub-queries targeting different document categories, then fetches specific documents to verify claims before committing. It is more likely to reject plausible nearby wrong evidence, keep fields scoped to the right source, and follow a controlling condition through to its downstream consequence rather than stopping at the first plausible rule.

Perhaps most strikingly, Ebla-1 learned to commit under partial evidence. When evidence for one sub-question is unavailable, it answers what it can verify and explicitly states what it could not find, earning partial credit without triggering fabrication penalties. The baseline either fabricates confidently or refuses entirely. This calibrated commitment was not explicitly trained; it emerged from the interaction between the multi-axis rubric and the penalty structure.

Total inference costs ranged from $0.35 to $24.74, a 70× spread, when each model ran the full benchmark. Ebla-1 ran the full 40-task benchmark for $1.10 in total inference cost while scoring 5 PP higher than Opus 4.6 High, the highest-scoring frontier model at $24.7422× cheaper. tasks are hard enough that no frontier model solves more than a handful, but structured enough that a 120B model can learn to outperform all of them with a single RL training run on commodity compute.

Access

For more details, contact founders@aviro.ai.

References

  1. 1.
    arxiv.org favicon
    Benchmarking Deep Search over Heterogeneous Enterprise Data
    arxiv.org
  2. 2.
    huggingface.co favicon
    Salesforce/HERB · Datasets at Hugging Face
    huggingface.co
  3. 3.
    arxiv.org favicon
    WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation
    arxiv.org
  4. 4.
    huggingface.co favicon
    Wix/WixQA · Datasets at Hugging Face
    huggingface.co
  5. 5.
    databricks.com favicon
    Introducing OfficeQA: A benchmark for end-to-end grounded reasoning | Databricks Blog
    databricks.com
  6. 6.
    github.com favicon
    GitHub - databricks/officeqa: Repository for getting started with the OfficeQA Benchmark.
    github.com
  7. 7.
    research.ibm.com favicon
    MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems for ACL 2025
    research.ibm.com
  8. 8.
    arxiv.org favicon
    MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
    arxiv.org
  9. 9.
    arxiv.org favicon
    Unanswerability Evaluation for Retrieval Augmented Generation
    arxiv.org
  10. 10.
    platform.openai.com favicon
    Deep research | OpenAI API
    platform.openai.com

Footnotes

  1. Grounded reasoning — Multi-step inference where every claim must be traceable to a specific source document, page, and passage, as opposed to free-form generation or closed-book QA.

  2. Enterprise — A large, structured organization with governance, repeatable processes, compliance, complex IT, and cross-functional coordination. Enterprise data refers to the organization's authoritative, shared information used across teams and systems. Enterprise documentation refers to internal authored files — policies, specs, roadmaps, notes, training materials, and presentations — that teams rely on for operations and decisions.

  3. Simulated — Created manually by domain experts; no proprietary corporate data is used.