Reality-grounded, requirements-tuned datasets that unlock AI.
Take your Databricks data further. Generate, anonymize, and simulate diverse datasets for training, fine-tuning, and evaluation, including eval sets, context docs, and golden labels. Powered by your Databricks Model Serving endpoints, with results landing directly in Unity Catalog.
Where teams get stuck
Problem
Your seed data isn't enough.
Resolution
Generate diverse, scaled datasets without starting from scratch. DataFramer works from your existing Databricks tables.
Problem
Your real data is off-limits.
Resolution
Anonymize it with structure and constraints intact, sensitive content removed. Privacy-safe synthetic alternatives to PHI and PII.
Problem
Your data doesn't cover what your model will face.
Resolution
Simulate the edge cases and scenarios your real data never captured, with full control over distributions and constraints.
Problem
You don't have enough data to know where your model actually fails.
Resolution
Generate targeted eval sets and edge cases from your real data structure, with golden labels included when the spec determines the answer.
What makes DataFramer unique
Starts from your real data
Seed-based generation learns from your existing Unity Catalog tables. Outputs inherit your schema, constraints, and value patterns, not invented ones.
You control the distribution
Spec-driven targeting lets you define exactly which values, edge cases, and scenarios get generated, and how often. Including conditional distributions across properties.
Automatic golden labels
When the spec fully determines the answer, golden labels are generated alongside the data, giving you eval sets and test suites without manual annotation.
Quality enforced, not assumed
Automatic revision loops and conformance filtering check outputs before they reach your pipelines. Optional human expert review for regulated or high-stakes domains.
Any format, any complexity
Works with tabular data, documents, multi-file structures, and nested formats. Handles complexity that generic generation tools can't.
Privacy-safe by design
Anonymize PHI, PII, and other sensitive data with structure intact and sensitive content removed. Generation runs through your own Databricks Model Serving endpoints.
How it works
Install the pydataframer-databricks connector, point it at any Unity Catalog table, and generate synthetic datasets that land back as Delta tables.
# Connect to your Databricks workspace
connector = DatabricksConnector(dbutils, scope="dataframer")
# Fetch seed data from any Unity Catalog table
seed_df = connector.fetch_sample_data(
table_name="catalog.schema.my_table",
num_items_to_select=25
)
# ... generate synthetic data via DataFramer ...
# Load results back into a Delta table
connector.load_generated_data(
table_name="catalog.schema.synthetic_output",
downloaded_zip=generated_zip,
dataset_type=DatasetType.SINGLE_FILE,
file_type=FileType.CSV
) Service principal auth
OAuth M2M tokens via Databricks Secrets.
Your models, your data
Spec and sample generation run through Databricks Model Serving. Data never leaves your environment.
Standard catalog permissions
Uses existing USE CATALOG, SELECT, and MODIFY grants. No special setup.
Full round-trip
Read from any catalog table, generate synthetic data, and write back as Delta, all in one workflow.
Arbitrarily large samples
Generate as much high-quality synthetic data as you need for ML training, analytics, and testing.
CSV, JSON, and JSONL
Supports single-file and multi-file dataset structures across common file formats.
Ready to generate synthetic data in Databricks?
Follow the step-by-step guide or dive into the full documentation.