Applied Examples ▸ AIMon Labs

How a 3B Model Outperformed GPT-4o on Hallucination Detection: The Training, Evals, Validation, and Benchmark Synthetic Data Pipeline Behind HDM-2

A 3B open-source model beat GPT-4o at hallucination detection, built entirely on DataFramer-generated training and eval data.

HDM-Bench: How a 3B Model Outperformed GPT-4o at Hallucination Detection

Alex Lyzhov

Tue Apr 15

DataFramer built the full data foundation behind AIMon Labs’ HDM-2 (training data, evaluation sets, validation pipelines, and the HDM-Bench benchmark), powering an open-source hallucination detection model that outperformed GPT-4o on hallucination detection benchmarks.


F1 Score on TruthfulQA	83.7
Model Parameters	3B
GPT-4o (est. parameters)	~200B
Inference latency on L4 GPU	<500ms

Background

At DataFramer, we believe the bottleneck for the next generation of AI models isn’t compute. It’s data quality. The story of AIMon Labs’ HDM-2 model is a concrete proof point. When their team set out to build an enterprise-grade hallucination detection model, they needed a data partner who could own the entire data lifecycle: from generating training examples to building evaluation sets to designing the benchmark itself. That’s where we came in.

HDM-2-3B, an open-source hallucination detection model, outperformed GPT-4o and GPT-4o-mini on hallucination detection tasks and did so at a fraction of the compute cost. The model and HDM-Bench have together crossed 7,500 downloads on HuggingFace. This article tells the story from our vantage point: what we built, why it mattered, and what it made possible.

The Problem: Hallucination Remains Unsolved at Enterprise Scale

Despite years of research, hallucination in large language models remains one of the most persistent and costly failure modes in production AI. Even the latest frontier models from OpenAI, Google, and Anthropic self-report hallucination rates approaching 20% in certain evaluation settings.

The standard industry response has been to use a large general-purpose model, typically GPT-4o, as an LLM-as-a-judge for hallucination evaluation. LLM judges like GPT-4o can work, but they are expensive, slow (often several seconds per query), inconsistent across prompt variations, and introduce a circular dependency on the very models you are trying to validate.

The core tension: Enterprises need hallucination detection that runs in real-time, costs pennies per call, and doesn’t rely on the same models it’s trying to evaluate. A specialized, lightweight model trained on high-quality domain-specific data is the natural answer, but only if the data pipeline behind it is robust enough at every stage: training, validation, and evaluation.

AIMon Labs understood this clearly. Building HDM-2 wasn’t just a modeling challenge. It was fundamentally a data challenge. They needed a partner who could generate the right training data, build the validation sets to iterate against, and design a benchmark rigorous enough to prove the model worked. That full-stack data problem is what DataFramer was built to solve.

Our Contribution: The Full Data Pipeline

DataFramer’s involvement with AIMon Labs spanned every stage of the data lifecycle: training, evaluation, validation, and benchmarking. This wasn’t a narrow dataset contribution. It was the data foundation that made HDM-2 possible.

Training Data

The model needed to learn what hallucinations look like across a wide range of enterprise contexts, not just clean academic examples, but the messy, subtle deviations that appear in real production RAG pipelines. RAG evaluation at enterprise scale requires exposing these subtle deviations in context-grounded responses, not just obvious fabrications. DataFramer generated domain-specific synthetic training data covering Finance, Healthcare, Legal, and Insurance scenarios via supervised fine-tuning on an SFT dataset with phrase-level ground truth labels that gave the model the fine-grained signal it needed to learn detection at the token level.

Evaluation & Validation Sets

Iterating toward a production-grade model requires held-out evaluation sets that are genuinely independent from training data, and validation pipelines that expose specific failure modes rather than just tracking aggregate metrics. DataFramer built these ground truth datasets and evaluation datasets in parallel with the training data, ensuring the AIMon team had a reliable feedback loop at every stage of development.

HDM-Bench: The Public Benchmark

The public-facing output of this collaboration is HDM-Bench, an open-source benchmark dataset hosted under the DataFramer HuggingFace organization and central to the HDM-2 research paper. HDM-Bench is an open-source hallucination detection benchmark and golden dataset, not a standard true/false factual recall dataset. It is phrase-level and multi-domain, built from the ground up for the way hallucinations actually appear in enterprise RAG pipelines: not as obvious fabrications, but as subtle deviations from grounding context. A wrong number here, an unsupported claim there, an enterprise-specific assertion that cannot be verified against public knowledge.

What makes HDM-Bench different:

1. Domain Coverage: Samples span Finance, Healthcare, Legal, and Insurance, the highest-stakes domains where hallucination has real business and regulatory consequences.

2. Phrase-Level Annotation: Every hallucinated span is annotated at the character level as part of the dataset annotation process, not just flagged at the sentence or document level, enabling token-level model training and evaluation.

3. Taxonomy-Aligned Labels: Labels align with HDM-2’s novel response taxonomy: context-based hallucinations, common knowledge violations, and innocuous statements are each tagged distinctly.

4. Two-Pass Human Annotation: Every example went through a stacked two-reviewer process with subject matter expert review: first pass annotation, second pass quality check, to maximize label reliability and minimize noise. This human-in-the-loop evaluation ensures ground truth labels reflect real expert judgment rather than automated assumptions.

The result is 1,320 carefully curated examples across two distinct data splits: a 1,120-row synthetic split generated by DataFramer, and a 199-row mr split. For a specialized evaluation benchmark designed to stress-test a detection model’s precision and recall in the most difficult edge cases, quality and diversity matter far more than volume.

What HDM-2 Achieved

Trained on DataFramer’s data and evaluated against HDM-Bench, HDM-2 set new performance standards across every LLM benchmark it was tested on. Here are the headline results:

Hallucination Detection Leaderboard: F1 Scores Across LLM Benchmarks

Model	RAGTruth F1	TruthfulQA F1	HDM-Bench F1
GPT-4o (as judge)	—	53.8	58.7
GPT-4o-mini (as judge)	—	56.2	57.7
LLaMA-2-13B (fine-tuned)	78.7	—	—
HDM-2-3B (DataFramer data)	85.0	83.7	73.6

On TruthfulQA, HDM-2 achieves an F1 of 83.7 against GPT-4o’s 53.8 and GPT-4o-mini’s 56.2, a 27-30 point advantage on every metric including precision, recall, and F1. On RAGTruth, HDM-2 reaches 85.03 F1, more than 6 points ahead of the next best fine-tuned model (LLaMA-2-13B at 78.7), on a model with a fraction of the parameters.

HDM-2, trained and iteratively validated against DataFramer’s phrase-level annotations, closes this gap significantly.

Why the Full Data Stack Matters

The success of HDM-2 is a case study in what becomes possible when model architecture and data pipeline are co-designed end to end. The HDM-2 team built a novel multi-task architecture with separate context-grounding and common-knowledge verification modules. But that architecture can only be trained well, evaluated honestly, and optimized reliably if the data at every stage (training, validation, evaluation, and benchmark) provides the right granularity of signal.

A training dataset that lacks domain diversity produces a brittle model. A validation set that isn’t independent of training produces false confidence. A benchmark that only labels entire responses as “hallucinated” or “not hallucinated” tells you nothing about where detection logic breaks down. DataFramer solved all three simultaneously, which is why the results look the way they do.

High-quality training data builds a capable model. Rigorous validation exposes failure modes. Targeted fixes improve performance. A credible benchmark proves it works. Then the cycle repeats. DataFramer is built to power every stage of that loop.

What This Means for Enterprise AI Teams

HDM-2, a fine-tuned open-source hallucination detection model, is available on HuggingFace under a CC BY-NC-SA license, and HDM-Bench is publicly available for any team to use as an evaluation baseline. For enterprise AI teams building RAG pipelines, here is what that means in practice:

Real-time guardrails become economically viable. At sub-500ms inference on a single L4 GPU, HDM-2 can be deployed inline for real-time LLM monitoring and observability, flagging hallucinations before responses reach end users rather than as a post-hoc audit. It acts as an LLM verifier running continuously in your serving path.

You no longer need GPT-4o as an LLM-as-a-judge for GPT-4o outputs. A specialized 3B model trained on purpose-built data can outperform a ~200B generalist on this specific task, at a fraction of the API cost and latency.

The data pipeline is the competitive moat. The teams that invest in rigorous, domain-specific training data, validation sets, and evaluation benchmarks will build better models faster than those relying on public datasets alone. HDM-2 demonstrates what that looks like in practice.

At DataFramer, we work with AI teams who understand that the path to a better model runs through better data at every stage. Training data, evaluation sets, validation pipelines, and benchmarks are not separate concerns. They are a single system that determines what your model can and cannot do.

The HDM-2 story is one we’re proud to have built from the ground up. If your team is facing a similar challenge: fine-tuning, evaluating, or trying to prove your model works in production, we’d like to talk.

HDM-Bench dataset available on HuggingFace · HDM-2 model on HuggingFace · Research paper: arXiv 2504.07069

Synthetic Text-to-SQL Data Generation with 100% SQL Validity Using Claude Haiku

How we generated 500 diverse, 100% valid text-to-SQL samples for LLM evaluation and fine-tuning using only Claude Haiku.

Alex Lyzhov

Applied Examples

Benchmarking Coding Agents as Math Auditors: A Synthetic Financial Document Dataset

We built a financial benchmark with planted errors to test Claude Code as a math auditor — no manual labeling needed.

Alex Lyzhov

Applied Examples

Building a Cyber Insurance Evaluation Dataset in 3 Easy Steps with DataFramer.

Scale a few real cyber insurance samples into a full evaluation and training dataset in three steps.

Puneet Anand

Get started

Ready to build better AI with better data?

The benchmark data is not just an evaluation artifact. It shapes what the model learns to care about. If your benchmark is shallow, your model will be too.

Book a Meeting

How a 3B Model Outperformed GPT-4o on Hallucination Detection: The Training, Evals, Validation, and Benchmark Synthetic Data Pipeline Behind HDM-2

Background

The Problem: Hallucination Remains Unsolved at Enterprise Scale

Our Contribution: The Full Data Pipeline

Training Data

Evaluation & Validation Sets

HDM-Bench: The Public Benchmark

What HDM-2 Achieved

Hallucination Detection Leaderboard: F1 Scores Across LLM Benchmarks

Why the Full Data Stack Matters

What This Means for Enterprise AI Teams

Synthetic Text-to-SQL Data Generation with 100% SQL Validity Using Claude Haiku

Benchmarking Coding Agents as Math Auditors: A Synthetic Financial Document Dataset

Building a Cyber Insurance Evaluation Dataset in 3 Easy Steps with DataFramer.

Ready to build better AI with better data?

Get In Touch

Get In Touch