Applied Examples Eval teams

Benchmarking Coding Agents as Math Auditors: A Synthetic Financial Document Dataset

We built a financial benchmark with planted errors to test Claude Code as a math auditor — no manual labeling needed.

Synthetic Financial Document Benchmark for Coding Agent Evaluation

Alex Lyzhov

Thu Mar 19

Summary

We wanted to show that DataFramer can construct useful evaluation datasets — including ground truth labels — with minimal human involvement, even for tasks so challenging that frontier models become unreliable.

This post is one such experiment. We built a benchmark of 20 synthetic financial PDFs (P&L statements, income statements) and used it to evaluate Claude Code (powered by Claude Opus 4.6) as a judge of mathematical consistency. The result: by default, Claude Code is not a reliable auditor — but with targeted prompting, its accuracy improves substantially.

Key results:

#PromptPrecisionRecallF1
1”Check math in the documents”1.0000.4170.588
2”Check math in all the documents”1.0000.5830.737
3”Check mathematical consistency in all the documents”1.0000.6670.800
4”…Use one subagent per document”1.0000.6670.800
5”…very thoroughly…find all errors whatever it takes. Use an Opus subagent for each document”0.8001.0000.889

The dataset is published on HuggingFace: dataframer/math-judging.


Generating the Dataset

We started with 3 seed financial documents (P&L and income statements) and ran a straightforward workflow in DataFramer to expand them into 20 diverse PDFs. The platform’s specification system ensured diversity along two axes:

  • Content diversity: different company types, table structures, line items, revenue scales, and financial periods.
  • Visual diversity: distinct formatting styles, layouts, fonts, and table designs across documents — each PDF looks different.

DataFramer specification showing the full list of value distributions controlling content and visual diversity across generated documents

Planted Errors with Ground Truth

We introduced a key generation property: document correctness status. In 70% of cases, the generator was instructed to insert a single, subtle consistency violation into an otherwise correct document. This might be one wrong sum out of several hundred implicit equations — a percentage that doesn’t match its components, a variance column that’s off by a small amount, a subtotal that doesn’t add up.

Document correctness status value distributions — 25% completely clean, with the remaining 75% split across specific error types like wrong subtotals, mismatched percentages, and incorrect variances

The critical design choice: we generated the ground truth labels first, then generated each document conditional on its label. This direction — label → document rather than document → label — is significantly more reliable for ensuring correctness of the ground truth.

On top of that, DataFramer’s agentic generation loop brings multiple layers of quality control:

  • Tool use during generation: the calculator tool gives the LLM access to a sandboxed Python environment for verifying every numerical relationship.
  • Revision cycles: conformance revisions catch internal inconsistencies that slip through initial generation.
  • Filtering: conformance filters reject and regenerate documents with structural issues.
  • Multi-agent pipeline: separate roles for outlining, generation, and multiple types of revision (coherence, consistency, conformance) — each stage cross-checks the output of the previous one.

Run configuration showing revision types (Coherence & Flow, Consistency, Distinguishability, Conformance), filtering options, and cost/time estimate

The result: ground truth labels are correct 95%+ of the time — more than enough to get a reliable relative ranking of different prompts, methods, or models.

PDF Rendering

The platform also automatically renders the final documents as PDFs rather than serving raw text. This makes the task slightly more realistic — and harder — for coding agent judges like Claude Code, which must parse the PDF before they can even begin checking the math.

A generated sample PDF — Red Ember Pizza Company Income Statement with detailed expense breakdowns and financial calculations


Evaluating Claude Code

With the dataset in hand, we used it as a laziness benchmark for Claude Code — the state-of-the-art coding agent, powered by Claude Opus 4.6, currently the most capable model available.

The task is conceptually straightforward: given a financial PDF, check whether all internal numerical relationships are consistent. A sufficiently thorough agent could verify every equation. But in practice, frontier models tend to cut corners — they spot-check a few numbers, declare everything fine, and move on. This makes them unreliable auditors by default.

We tested five prompts, ranging from minimal to exhaustive:

#PromptPrecisionRecallF1
1”Check math in the documents”1.0000.4170.588
2”Check math in all the documents”1.0000.5830.737
3”Check mathematical consistency in all the documents”1.0000.6670.800
4”…Use one subagent per document”1.0000.6670.800
5”…very thoroughly…find all errors whatever it takes. Use an Opus subagent for each document”0.8001.0000.889

Precision tends to be high — when Claude Code flags an error, it’s usually real — though it drops when the model is pushed to be exhaustive (prompt 5 introduces a few false positives). The bigger issue is recall: with minimal prompts, a significant fraction of planted errors go undetected. Only when we explicitly instructed the agent to be exhaustive, to check every mathematical relationship, and to delegate each document to a dedicated Opus subagent (which is not the default behavior) did recall approach 100%.

What’s Actually Happening

The behavior differences are revealing:

  • Prompt 1: Read 10 of 20 PDFs sequentially, then stopped. Never touched the other half.
  • Prompt 2: Read all 20 PDFs but checked everything solo.
  • Prompt 3: Read all 20 PDFs itself, then launched 8 subagents in batches.
  • Prompt 4: Launched 20 subagents (one per document) without reading the PDFs itself.
  • Prompt 5: Launched 20 Opus subagents (one per document), each instructed to be exhaustive.

The pattern is clear: out of the box, Claude Code is not a reliable judge for this task. It takes shortcuts — reading only some documents, skipping checks, not delegating work. Only with explicit instructions to be thorough and use capable subagents does accuracy reach acceptable levels.


The Workflow

The end-to-end process required minimal manual effort:

  1. Seed: 3 example financial documents.
  2. Spec: DataFramer analyzed the seeds and generated a specification with content and style distributions, plus our custom correctness property.
  3. Generate: 20 diverse PDFs produced with calculator tool, conformance revisions, and PDF rendering enabled.
  4. Verify (optional): We ran semi-automated checks on the generated labels and found 95%+ correctness. This step isn’t strictly necessary — even with a couple of mislabeled documents, the dataset is still good for comparing prompt strategies since we care about relative performance.

Quality check results — expected vs actual distribution for visual design and color scheme across 8 generated samples 5. Evaluate: Claude Code evaluated each PDF under multiple prompt variants. Results compared against ground truth labels.

The whole dataset — documents and labels — came out of the platform without much manual effort.


Try It Yourself

Reproduce the full generation workflow:  Open In Colab

The dataset is available on HuggingFace: dataframer/math-judging

You can swap in your own agent or model, adjust the prompts, or use this as a template for building evaluation datasets for entirely different tasks. The same approach works for other tasks and other agents — even when the task is hard enough that frontier models aren’t reliable out of the box.

Want to build a dataset like this for your own use case? Try DataFramer — you can go from a few seed documents to a full evaluation set in minutes.

Get started

Ready to build better AI with better data?

The real bottleneck in AI isn't intelligence. It's the data you can't generate, can't share, or can't trust.