Use case — Engineering, PM
Evaluate with what you have.
Generate what you're missing.
Two ways to evaluate in DataFramer: score your real production traces against calibrated judges, or generate synthetic test cases from those same traces and evaluate those.
Why this is hard
Generic benchmarks miss your actual failure modes.
Most eval datasets were built for general capabilities. Your failures are specific to your prompts, your retrieval, your users. A benchmark that doesn't reflect that tells you almost nothing.
Off-the-shelf judges haven't seen your domain.
We found that teams using uncalibrated judges were consistently wrong on a whole class of domain-specific failures. Nobody caught it until a human reviewer flagged it weeks later.
Real traces can't cover failures you haven't seen yet.
Production data is useful for known failures. But if a failure mode hasn't shown up yet, you have no test cases for it. Waiting for it to appear in production is not a testing strategy.
Evals rarely connect to the fixes they're supposed to validate.
Customers told us evals and fixes happened in separate workflows. By the time a fix shipped, nobody could say whether it had actually addressed the failure that triggered the eval.
Two paths, one platform
Evaluate with real traces. Generate what production can't give you.
Path 1
Evaluate with your traces
Pull traces from production and score them against rubrics your team defined, using judges calibrated to your human reviewers. Measure agreement. Build regression suites from cases that mattered. Before a fix ships, test it against the real failures that caused the problem.
Path 2
Generate what production can't give you
Pick real traces as the starting point. DataFramer generates synthetic test cases that reflect your actual domain. Add them to eval runs, test against known failure patterns, and cover edge cases before they show up in production.
How it works
From traces to tested fixes, end to end.
Bring in your traces
Connect Langfuse or LangSmith, or send traces directly via the DataFramer SDK. User feedback, corrections, and ratings can come in alongside traces.
Ingest
Pick traces as the starting point
From the Traces table, choose the rows that best represent your domain. Add them to a seed dataset in one step.
Seed datasets
Describe what you want to generate
A spec captures the structure, properties, and distributions of the dataset. You can define it yourself or let DataFramer infer it from your example data.
Specs
Generate synthetic test cases
Run generation from the spec. Output formats include JSONL, CSV, XLSX, PDF, and multi-folder samples. Generated data can cover failure patterns your real traces don't include yet.
Runs
Assemble an eval dataset for your judge
Combine real traces with generated ones, or use expert-reviewed traces only. Each dataset ties specific rubrics to the traces being scored.
Judge datasets
Score outputs and measure agreement
Pick your model and judge prompt. DataFramer scores each trace and shows how closely the judge agrees with your human reviewers. Check this before relying on the judge at scale.
Judge runs