Research from the
DataFramer team
Our work on AI evaluation, synthetic data benchmarks, and LLM reliability — published at peer-reviewed venues and on arXiv.
All Required, In Order: Phase-Level Evaluation for AI–Human Dialogue in Healthcare and Beyond
Introduces OIP-SCE, an evaluation framework that assesses whether conversational AI systems meet all necessary clinical requirements in the proper sequence — making AI dialogue systems more compliant with healthcare workflows and auditable for clinical review.
INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection
A benchmark dataset of real and synthetic insurance benefit verification calls annotated for compliance auditing — enabling evaluation of voice agents' ability to detect call phases and verify procedural and informational compliance.
HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification
Introduces HDM-2, a hallucination detection system that identifies inaccurate outputs from large language models by verifying responses against both provided context and general knowledge facts.
How to Generate 50K-Token Documents: Same LLM, Different Results
We compared Dataframer vs raw Claude Sonnet 4.5 for long-form text generation. Dataframer overwhelmingly won on diversity, style fidelity, length, and quality.
Generation of Synthetic Text2SQL LLM Data with 100% Validity Using Dataframer
How we used Dataframer to generate diverse and complex text-to-SQL samples using only Claude Haiku — and how you can do the same for LLM evaluation and training with minimal effort.