How to Generate Multi-file EHR Datasets for 1000 patients with exact distributions
Turn two patient samples into 1,000 privacy-safe EHR records with exactly controlled distributions in five steps.
Puneet Anand
Wed Oct 15
Download 1000 patient records generated in this video
Download NowOr explore the publicly available EHR dataset on HuggingFace with 7,000+ downloads to see what multi-file patient records look like before generating your own.
Why EHR/EMR data is hard to access for healthcare AI
Real EHR/EMR data is extremely difficult to access. Privacy laws such as HIPAA and GDPR protect sensitive information, and while they are essential, they also restrict how much data researchers and AI developers can use.
Even when EHR datasets or other medical datasets become available, they’re often incomplete, de-identified, or stored in silos across departments. A researcher might have a lab report but not the corresponding imaging study or discharge summary. This fragmentation makes it hard to train machine learning models that reflect the diversity and complexity of real-world patients.
Teams spend months requesting access, cleaning EHR data, and managing compliance, only to end up with small, narrow datasets that can’t support robust AI systems. Innovation slows not because of a lack of ideas, but because usable data remains out of reach.
What counts as EHR/EMR data in a usable dataset?
When teams talk about EHR data, EMR data, or EHR datasets, they usually mean more than just a flat table. Usable healthcare and insurance medical datasets typically include:
- Structured data: demographics, vitals, diagnosis codes, procedure codes, medications, allergies, lab panels, and problem lists.
- Unstructured data: discharge summaries, operative notes, radiology reports, ICU notes, progress notes, and referral letters.
- Longitudinal datasets covering the full patient journey: multiple encounters over time, with linked events such as admissions, follow-ups, readmissions, and procedures.
- Multi-document patient folders: each patient has a set of files—ECG traces, stress tests, lab reports, imaging narratives, and summaries, that together form a single clinical story.
- Standard data exports: FHIR/HL7-based exports, PDFs, text documents, CSVs, and other formats that analytics and AI systems consume in production.
A practical synthetic workflow needs to reproduce this richness so that your EHR datasets are realistic enough for model development, evaluation, and downstream analytics.
Synthea datasets vs real-seeded synthetic EHR datasets
Open-source Synthea data and Synthea datasets are widely used to experiment with healthcare AI and analytics. DataFramer has also published a free EHR dataset on HuggingFace with 1000 multi-file patient samples and over 7,000 downloads, which you can use as a reference or starting seed if you don’t have your own samples available.
They’re excellent for:
- Getting started quickly with standardized, simulated patient records.
- Demonstrating pipelines, dashboards, and basic models.
- Teaching or prototyping when you have no access to real-world EHR data.
However, many teams eventually find that Synthea alone isn’t enough:
- It may not match your institution’s documentation style, templates, or clinical workflows.
- It may not reflect your specialty mix, comorbidities, or real-world coding patterns.
- It can be difficult to tune to your exact distributions or to mirror how your clinicians actually write notes.
DataFramer takes a complementary approach: you start from a small set of real EHR/EMR samples and use them as seeds. If you don’t have them available, you can use the “Seedless” generation feature to first craft your required structures and formats.
Side note: You drive the entire workflow through a UI or an API.
The platform then generates synthetic EHR datasets that:
- Are tuned to your target distributions (e.g., disease prevalence, age bands, comorbidity profiles, physician notes, markers, tests, etc).
- You can even configure dependent or conditional distributions. For example, you can set a higher prevalence of diabetes in older male age groups.
- Inherit realistic structure and language from your own environment.
- Remain privacy-safe by decoupling generated data from real identities.
In practice, teams often use Synthea datasets for early experiments and then switch to real-seeded synthetic EHR datasets when they need higher realism and closer alignment with production data.
Synthetic data offers a practical path forward
Synthetic data is artificially generated information that preserves the structure, patterns, and statistical relationships of real healthcare datasets without exposing any real identities. Unlike data anonymization techniques that redact or mask fields from real records, synthetic generation produces entirely new records that have no link to real individuals.
In healthcare AI, this approach helps teams safely create training and evaluation EHR datasets, EMR data extracts, and broader medical datasets that reflect genuine, real-world clinical complexity while staying compliant.
Instead of negotiating large data transfers or exposing live systems, you can work with synthetic EHR/EMR data that behaves like the real thing for modeling and analytics, but is designed to protect patient privacy and reduce regulatory friction.
Introducing DataFramer
DataFramer is a synthetic data platform designed to help organizations build, test, and deploy AI systems without exposing sensitive real-world data. Teams use it to generate synthetic data from real data samples, or describe what they need from scratch when no seed data is available. It lets you:
- Turn a handful of real patient folders into large synthetic EHR datasets.
- Generate structured and unstructured medical datasets for healthcare AI.
- Support adjacent use cases in health insurance datasets and life insurance datasets without accessing raw production systems.
Here’s a breakdown of how it can be used, especially in domains like healthcare and insurance.
How the five steps work for EHR datasets in practice
You will follow a clear five-step workflow that mirrors the demo transcript. Here are the highlights:
Step 1: Uploading a few representative patient folders as seed data
Start with a handful of representative seed records that you upload to DataFramer. Make sure each one includes the kinds of documents your models will eventually see in production, such as stress tests, ECGs, lab results, imaging, and discharge summaries. Validate that each patient’s files share a consistent identifier so relationships across documents remain intact. These seeds define the structure and content patterns of your target EHR dataset.
Note: As mentioned before, if you don’t have the seed samples available, you can use the “Seedless” generation feature to first craft your required structures and formats.
Step 2: Create a spec (the blueprint)
DataFramer analyzes the seed data and builds a blueprint of the structure and properties. This spec captures the document types, expected structure, and baseline distributions that will guide synthetic generation.
Step 3: Edit the spec to match your target distributions and requirements
Now you can refine what gets generated. Require unique names, add or refine data properties, and control distributions. For example, you can increase coronary artery disease distribution, boost diabetes prevalence, and shift the demographic mix toward female and elderly patients.
You can also define conditional rules, like generating more stress test reports when the condition is coronary artery disease.
Step 4: Run generations to create a larger synthetic dataset
Next, generate a thousand synthetic patient datasets, one folder per patient, each with new names, realistic histories, and structure that follows your target distributions. The output is ready for testing, training, validation, and demos.
Step 5: Evaluate and iterate (with humans if needed)
Finally, evaluate the generated dataset, chat with your dataset, and involve human experts as needed. This makes it easy to validate whether your synthetic EHR dataset matches your targets, and to iterate until it does.
Let’s visit these steps in more detail in this demo video, where we generate 1000 realistic samples.
Detailed Walkthrough of The 5 Step Workflow
Here is a step by step walkthrough of DataFramer from this demo.
Prerequisites and recommended inputs
-
A small, representative set of seed files for each subject or entity. Examples in healthcare include stress tests, ECG reports, lab results, imaging, discharge summaries, and patient profiles.
- Note: As mentioned before, if needed DataFramer can generate new samples from scratch to use as seeds. This feature is called “Seedless” generation.
-
Clear target goals for distributions and attributes to control during generation. Examples include disease prevalence, gender balance, age groups, and other medical dataset features.
Step 1 Upload EHR/EMR seed data
Purpose
Ingest and organize EHR/EMR seed samples that define the structure and context for synthetic generation.
What you do
- Select dataset mode. For multi-file subjects, choose multi-folder so each subject can include multiple documents.
- Upload the root folder or select folders for each subject.
- Provide a dataset name and description that reflects your EHR dataset use case (e.g., Patient history seed samples).
What DataFramer does
- Stores relevant files.
- Prepares the dataset for analysis and specification creation.
Tips
- Include the document types your models must see later: clinical notes datasets, lab reports, imaging summaries, and discharge documents. DataFramer supports PDF, XLSX, CSV, plain text, and multi-folder structures out of the box.
Step 2 Create specs for your synthetic EHR dataset
Purpose
Generate the initial blueprint that controls how synthetic data will be generated, including structure, properties, baseline, and even conditional distributions inferred from your seed data.
What you do
- Click “Create spec” on the dataset.
- Review the auto-populated spec that summarizes structure, file counts per subject, content types, and detected properties such as demographics, medical history, and clinical findings.
What DataFramer does
- Analyzes the seed dataset to infer structure and candidate properties.
- Pre-populates distributions and relationships discovered in seed data.
- If indicated by the user, DataFramer also expands the existing set of properties and their possible values.
Outputs
- An initial specification that describes structure and baseline properties, ready for refinement.
Step 3 Edit the spec to control properties, requirements, and distributions
Purpose
Refine the blueprint so the generated synthetic dataset matches your target populations, document patterns, and clinical logic.
What you do
-
Configure target distributions for key properties. Some examples:
- Increase coronary disease prevalence.
- Create a new medical disease property value for diabetes and raise its prevalence.
- Emphasize elderly female representation.
-
Add or refine target dataset requirements.
- Include medical condition as an explicit data property.
- Require all first and last names to be unique.
- Add additional records such as physician histories or evidence fields.
-
Encode conditional relationships so the dataset behaves like real journeys. Example rule:
- When medical condition is coronary artery disease, primary report type is stress test about 40 percent of the time and operative report about 15 percent of the time.
Read More: Base Distributions Conditional Distributions
What DataFramer does
- Validates edits to ensure constraints are consistent.
- Updates the spec so the generated dataset adheres to your requirements and logical relationships.
Outputs
- A finalized specification that fully describes the target structure, properties, distributions, and conditional logic for generating your realistic synthetic EHR datasets and other medical datasets.
Good practices
- Prefer conditional rules for any property that depends on another property.
- Keep distributions realistic enough to preserve utility while achieving your research goals.
Step 4 Create runs and generate synthetic EHR datasets
Purpose
Execute a run with the specification to produce a synthetic EHR dataset at the desired scale.
What you do
- Click Create run from the saved specification.
- Select the spec version, choose the model, and set the number of samples to generate.
- Models can be proprietary or open source based on your environment.
- Choose whether to enable revisions.
- Revisions perform additional passes to check whether outputs meet your requirements and distributions before finalizing.
- Start the run and monitor progress.
What DataFramer does
- Applies your spec to generate new patients with multiple files per patient.
- Preserves cross-file consistency and adheres to target distributions and conditional rules.
- Performs revision cycles if desired to improve fit to targets.
Outputs
- A generated EHR dataset with one folder per synthetic subject. Typical contents in this healthcare demo included operative notes, ICU sheets, lab results, discharge summaries, imaging or test narratives, and patient demographics, but other files can be added as required, for example insurance applications or submissions.
Performance notes
- Time to completion scales with sample count, model selection, and revision settings.
- Larger sample sizes converge more closely to your target distributions.
Step 5 Evaluate the generated dataset and iterate
DataFramer automatically evaluates the output against the targets and expectations set by you.
For example, if desired, all samples could be female, and/or 75 percent elderly, and conditions like type 2 diabetes, hypertension, and coronary artery disease can be configured with the highest frequencies as intended.
We can also use the chat feature to query the dataset directly, asking, for example, how many samples were generated and requesting a table of diseases by frequency, receiving instant structured replies. This makes it easier to validate whether your synthetic EHR dataset matches your clinical or business hypotheses.
You can also involve human experts as needed to review and annotate outputs for realism, consistency, and safety, and then iterate on the spec and rerun generations until the dataset meets your standards.
Using synthetic health datasets for insurance AI and Analytics
Beyond hospital and research settings, synthetic medical datasets are increasingly valuable for insurers:
-
Health insurance datasets
- Simulate claims-like records based on synthetic EHR/EMR journeys.
- Model utilization patterns, chronic disease burden, and cost drivers.
- Test care management, risk adjustment, and network design strategies without exposing member PHI.
-
Life insurance datasets
- Generate synthetic underwriting-style summaries that incorporate comorbidities, risk factors, and lifestyle indicators derived from clinical context.
- Explore how changes in age, condition prevalence, or treatment adherence affect mortality and morbidity assumptions.
- Share synthetic life insurance datasets across actuarial, underwriting, and data science teams to prototype new products and risk models.
Because DataFramer starts from a small, well-governed seed of EHR/EMR data, it becomes possible to create realistic, privacy-safe health insurance datasets and life insurance datasets that still behave like real populations.
The impact of synthetic data in healthcare
Synthetic data gives researchers and developers freedom to experiment, share, and iterate without risking privacy. Because no real identifiers are used, teams can share datasets freely and balance them for demographic diversity. Development timelines shrink from months to days, and collaboration across institutions no longer requires navigating complex data-sharing agreements.
DataFramer turns small, carefully governed EHR/EMR seed datasets into large synthetic datasets that teams across healthcare and insurance can use directly for model development and analytics.
FAQ: EHR datasets, EMR data, Synthea, and insurance use cases
What is the difference between EHR data and EMR data?
EHR data usually refers to a longitudinal view of a patient’s health across multiple encounters and care settings, while EMR data often refers to the digital chart within a single organization or encounter. In practice, most AI teams work with both, and DataFramer can generate synthetic versions of either as multi-file patient folders.
What are EHR datasets used for in machine learning?
EHR datasets and other medical datasets are used to:
- Train and evaluate prediction models (readmission, mortality, length of stay, risk scores).
- Power clinical decision support tools.
- Build phenotyping, cohort selection, and trial-matching systems.
- Support downstream analytics for providers, payers, and life sciences.
Synthetic EHR datasets also serve as evaluation datasets and ground truth datasets for clinical AI. Teams building medical NLP models, ICD coding tools, or clinical summarization systems need representative patient records to benchmark performance. A well-structured synthetic EHR dataset with known distributions gives you a controlled golden dataset for model evaluation without the compliance overhead of using real records.
Synthetic datasets let you do this work without exposing production systems.
How does synthetic EHR data compare to Synthea datasets?
Synthea datasets are open and standardized, making them ideal for early experimentation and teaching. Real-seeded synthetic EHR datasets generated with DataFramer:
- Are tailored to your specialty mix and workflows.
- Use language and formatting closer to your real documentation.
- Let you tune distributions and rules to match your target population.
Many teams combine both: Synthea for quick demos, and real-seeded synthetic EHR data for serious model development.
How can synthetic EHR data be used inside healthcare organizations?
Healthcare organizations can use synthetic EHR/EMR data to:
- Prototype and validate new AI tools in a safe sandbox before touching live records.
- Share realistic datasets with vendors, startups, and research partners without moving PHI, in the desired formats like FHIR.
- Run quality-improvement and operations experiments (e.g., capacity planning, triage flows) on realistic but de-identified journeys.
- Train clinicians, analysts, and data science teams on lifelike cases without compliance hurdles.
Because the data is synthetic, these use cases become much easier to approve and govern.
Can I generate datasets for underwriting or insurance risk modeling?
Yes. By starting from carefully governed clinical seeds, you can create synthetic health insurance datasets and life insurance datasets that:
- Capture realistic condition combinations, treatments, and outcomes.
- Support underwriting, pricing, and product analytics.
- Stay privacy-safe because no real policyholder or patient identities are exposed.
How large should an EHR dataset be for model evaluation?
It depends on the task, but in general:
- Hundreds of samples can be enough for exploratory models.
- Thousands to tens of thousands of samples are often used for robust evaluation.
- With synthetic data, you can scale your EHR datasets to these sizes and beyond, comfortably, while still anchoring them in a small, carefully curated real-seed cohort!!
A future where privacy and innovation coexist
In this demo, starting from just two real patients, DataFramer created 1000 complete, realistic samples. This process can be scaled to thousands, and the resulting EHR datasets and medical datasets can serve rapid, safe AI training and evaluation.
Synthetic EHR/EMR data gives hospitals, health insurers, and life insurers a practical path to building AI on realistic datasets without the access and compliance barriers that slow teams down today.
Synthetic Text-to-SQL Data Generation with 100% SQL Validity Using Claude Haiku
How we generated 500 diverse, 100% valid text-to-SQL samples for LLM evaluation and fine-tuning using only Claude Haiku.
Alex Lyzhov Benchmarking Coding Agents as Math Auditors: A Synthetic Financial Document Dataset
We built a financial benchmark with planted errors to test Claude Code as a math auditor — no manual labeling needed.
Alex Lyzhov Building a Cyber Insurance Evaluation Dataset in 3 Easy Steps with DataFramer.
Scale a few real cyber insurance samples into a full evaluation and training dataset in three steps.
Puneet Anand Get started
Ready to build better AI with better data?
The real bottleneck in AI isn't intelligence. It's the data you can't generate, can't share, or can't trust.