Databricks Technology Partner, Validated

Reality-grounded, requirements-tuned datasets that unlock AI.

Take your Databricks data further. Generate, anonymize, and simulate diverse datasets for training, fine-tuning, and evaluation, including eval sets, context docs, and golden labels. Powered by your Databricks Model Serving endpoints, with results landing directly in Unity Catalog.

Let's Talk View documentation →

Where teams get stuck

Problem

Your seed data isn't enough.

Resolution

Generate diverse, scaled datasets without starting from scratch. DataFramer works from your existing Databricks tables.

Problem

Your real data is off-limits.

Resolution

Anonymize it with structure and constraints intact, sensitive content removed. Privacy-safe synthetic alternatives to PHI and PII.

Problem

Your data doesn't cover what your model will face.

Resolution

Simulate the edge cases and scenarios your real data never captured, with full control over distributions and constraints.

Problem

You don't have enough data to know where your model actually fails.

Resolution

Generate targeted eval sets and edge cases from your real data structure, with golden labels included when the spec determines the answer.

What makes DataFramer unique

Starts from your real data

Seed-based generation learns from your existing Unity Catalog tables. Outputs inherit your schema, constraints, and value patterns, not invented ones.

You control the distribution

Spec-driven targeting lets you define exactly which values, edge cases, and scenarios get generated, and how often. Including conditional distributions across properties.

Automatic golden labels

When the spec fully determines the answer, golden labels are generated alongside the data, giving you eval sets and test suites without manual annotation.

Quality enforced, not assumed

Automatic revision loops and conformance filtering check outputs before they reach your pipelines. Optional human expert review for regulated or high-stakes domains.

Any format, any complexity

Works with tabular data, documents, multi-file structures, and nested formats. Handles complexity that generic generation tools can't.

Privacy-safe by design

Anonymize PHI, PII, and other sensitive data with structure intact and sensitive content removed. Generation runs through your own Databricks Model Serving endpoints.

How it works

Install the pydataframer-databricks connector, point it at any Unity Catalog table, and generate synthetic datasets that land back as Delta tables.

Unity Catalog Your source table

DataFramer Generate & evaluate

Delta Table Ready for downstream use

databricks_notebook.py

# Connect to your Databricks workspace
connector = DatabricksConnector(dbutils, scope="dataframer")

# Fetch seed data from any Unity Catalog table
seed_df = connector.fetch_sample_data(
    table_name="catalog.schema.my_table",
    num_items_to_select=25
)

# ... generate synthetic data via DataFramer ...

# Load results back into a Delta table
connector.load_generated_data(
    table_name="catalog.schema.synthetic_output",
    downloaded_zip=generated_zip,
    dataset_type=DatasetType.SINGLE_FILE,
    file_type=FileType.CSV
)

Service principal auth

OAuth M2M tokens via Databricks Secrets.

Your models, your data

Spec and sample generation run through Databricks Model Serving. Data never leaves your environment.

Standard catalog permissions

Uses existing USE CATALOG, SELECT, and MODIFY grants. No special setup.

Full round-trip

Read from any catalog table, generate synthetic data, and write back as Delta, all in one workflow.

Arbitrarily large samples

Generate as much high-quality synthetic data as you need for ML training, analytics, and testing.

CSV, JSON, and JSONL

Supports single-file and multi-file dataset structures across common file formats.

Ready to generate synthetic data in Databricks?

Follow the step-by-step guide or dive into the full documentation.

Get started Full documentation →