Published Research

Research from the
DataFramer team

Our work on AI evaluation, synthetic data benchmarks, and LLM reliability — published at peer-reviewed venues and on arXiv.

AAAI · PMLR 2026

All Required, In Order: Phase-Level Evaluation for AI–Human Dialogue in Healthcare and Beyond

Introduces OIP-SCE, an evaluation framework that assesses whether conversational AI systems meet all necessary clinical requirements in the proper sequence — making AI dialogue systems more compliant with healthcare workflows and auditable for clinical review.

Kulkarni · Lyzhov · Chaitanya · Joshi Read paper

arXiv · 2025

INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection

A benchmark dataset of real and synthetic insurance benefit verification calls annotated for compliance auditing — enabling evaluation of voice agents' ability to detect call phases and verify procedural and informational compliance.

Kulkarni · Lyzhov · Joshi · Chaitanya Read paper

arXiv · 2025

HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification

Introduces HDM-2, a hallucination detection system that identifies inaccurate outputs from large language models by verifying responses against both provided context and general knowledge facts.

Paudel · Lyzhov · Joshi · Anand Read paper

From the Blog

Jan 2026

How to Generate 50K-Token Documents: Same LLM, Different Results

We compared DataFramer vs raw Claude Sonnet 4.5 for long-form text generation. DataFramer overwhelmingly won on diversity, style fidelity, length, and quality.

Alex Lyzhov Read post

Dec 2025

Generation of Synthetic Text2SQL LLM Data with 100% Validity Using DataFramer

How we used DataFramer to generate diverse and complex text-to-SQL samples using only Claude Haiku — and how you can do the same for LLM evaluation and training with minimal effort.