How we built a synthetic data platform that cut ML development time by 60%
How synthetic data generation transformed ML development workflows
Machine learning teams waste weeks waiting for production data access. Legal reviews, anonymization pipelines, compliance approvals—by the time data arrives, project timelines have slipped. Development environments sit empty while engineers context-switch to other work.
We saw this problem across multiple client engagements and decided to build a solution: Fabri, a synthetic data generation platform that gives ML teams realistic, privacy-safe datasets in minutes instead of weeks.

The Problem
Every ML project hits the same bottleneck: data access.
Production data contains PII. Legal needs to review access requests. Security requires audit trails. Compliance teams want to verify anonymization. For regulated industries—healthcare, finance, insurance—these reviews can take months.
Meanwhile, ML engineers need data to:
- Build and test feature pipelines
- Train initial model versions
- Validate preprocessing logic
- Load test inference systems
- Demo prototypes to stakeholders
Using scrambled or subset production data creates its own problems. The distributions don’t match reality. Edge cases disappear. Models trained on sanitized data fail when they hit production.
Teams need data that looks and behaves like production without being production.
Our Approach
We built Fabri as an internal tool first, then refined it across several client projects before productizing it.
Domain-Specific Templates
The key insight: synthetic data is only useful if it matches real-world patterns. Random data with correct column types isn’t enough—you need realistic distributions, correlations, and edge cases.
We built templates for common domains:
Healthcare
- Patient demographics with realistic age/gender/location distributions
- Diagnosis codes following ICD-10 frequency patterns
- Lab results with clinically plausible ranges and correlations
- Appointment schedules respecting business hours and provider capacity
Financial Services
- Transaction histories with realistic spending patterns
- Account balances following wealth distribution curves
- Credit scores correlated with income, age, and account history
- Fraud patterns at realistic (low) frequencies
Retail & E-commerce
- Customer profiles with segmentation-ready attributes
- Purchase histories with seasonal patterns and category affinities
- Inventory levels with realistic stockout scenarios
- Pricing data with promotion cycles
HR & Workforce
- Employee records with org hierarchy relationships
- Compensation data following market benchmarks
- Performance reviews with calibrated rating distributions
- Tenure patterns matching industry turnover rates
Each template encodes domain expertise—the kind of knowledge that takes years to accumulate and is usually locked in the heads of senior data scientists.
Schema Configuration
Templates provide starting points, but every dataset is unique. Fabri lets users customize:
Field-level controls
- Data types (string, integer, float, date, boolean, categorical)
- Distributions (normal, uniform, exponential, custom)
- Constraints (min/max, regex patterns, enum values)
- Null rates and missing data patterns
Cross-field relationships
- Correlations (salary increases with experience)
- Dependencies (state determines valid zip codes)
- Hierarchies (department → team → employee)
- Temporal sequences (events ordered correctly)
Statistical properties
- Target mean, median, standard deviation
- Outlier frequency and magnitude
- Cardinality for categorical fields
- Skewness and kurtosis for distributions
# Example schema configuration
schema = {
"customer_id": {
"type": "uuid",
"unique": True
},
"age": {
"type": "integer",
"distribution": "normal",
"mean": 42,
"std": 15,
"min": 18,
"max": 95
},
"income": {
"type": "float",
"distribution": "lognormal",
"median": 55000,
"correlation": {"age": 0.4, "education_years": 0.6}
},
"churn_risk": {
"type": "float",
"distribution": "beta",
"alpha": 2,
"beta": 8,
"correlation": {"tenure_months": -0.5, "support_tickets": 0.3}
}
}
Generation Engine
The core engine uses a multi-pass approach:
Pass 1: Independent fields Generate fields with no dependencies using specified distributions.
Pass 2: Correlated fields Use copulas to generate fields with specified correlation structures while preserving marginal distributions.
Pass 3: Dependent fields Apply business rules and constraints (e.g., end_date > start_date).
Pass 4: Validation Verify statistical properties match specifications within tolerance.
For large datasets (millions of rows), generation runs in parallel with chunked processing and progress tracking.
# Simplified generation flow
def generate_dataset(schema, num_rows, seed=None):
if seed:
random.seed(seed)
np.random.seed(seed)
# Pass 1: Independent fields
data = {}
for field, config in schema.items():
if not config.get('correlation'):
data[field] = generate_field(config, num_rows)
# Pass 2: Correlated fields using Gaussian copula
correlated_fields = [f for f, c in schema.items() if c.get('correlation')]
if correlated_fields:
correlation_matrix = build_correlation_matrix(schema, correlated_fields)
correlated_data = generate_correlated(correlation_matrix, num_rows)
for i, field in enumerate(correlated_fields):
# Transform to target marginal distribution
data[field] = transform_marginal(
correlated_data[:, i],
schema[field]
)
# Pass 3: Apply constraints
data = apply_constraints(data, schema)
# Pass 4: Validate
validation_report = validate_statistics(data, schema)
return data, validation_report
Version Control
Synthetic datasets evolve with projects. Schema changes, distribution tweaks, bug fixes—teams need to track what changed and reproduce previous versions.
Fabri includes built-in version control:
- Schema versioning: Track changes to field definitions
- Seed management: Reproduce exact datasets with stored seeds
- Diff visualization: See what changed between versions
- Branching: Experiment with schema variants without losing the original
- Tagging: Mark stable versions for production use
fabri-dataset-v1.0.0
├── schema.json
├── config.yaml
├── seed: 42
├── generated: 2025-01-15T10:30:00Z
├── rows: 1,000,000
└── validation_report.json
fabri-dataset-v1.1.0
├── schema.json (modified: added churn_risk field)
├── config.yaml
├── seed: 42
├── parent: v1.0.0
├── generated: 2025-01-20T14:15:00Z
├── rows: 1,000,000
└── validation_report.json
Data Visualization
Before exporting, users can explore generated data visually:
- Distribution plots: Histograms and density curves for each field
- Correlation heatmaps: Verify cross-field relationships
- Scatter matrices: Spot unexpected patterns
- Outlier detection: Identify and adjust extreme values
- Sample preview: Inspect raw rows
The visualization layer catches configuration errors before they propagate downstream.
Multi-Format Export
ML pipelines consume data in different formats. Fabri exports to:
- CSV — Universal compatibility
- JSON — Nested structures, API mocking
- Parquet — Columnar storage, efficient for large datasets
- SQL — Direct database insertion scripts
- Delta Lake — Versioned data lake integration
Export includes optional compression, partitioning, and schema metadata.
Technical Implementation
Tech Stack
- Backend: FastAPI + Python for the generation engine
- Database: PostgreSQL for schema/version storage, DuckDB for data processing
- Generation: NumPy, SciPy for distributions; Faker for realistic values
- Frontend: React dashboard for schema configuration and visualization
- Storage: S3-compatible object storage for generated datasets
Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Dashboard │────▶│ API Server │────▶│ Generation │
│ (React) │ │ (FastAPI) │ │ Engine │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌─────────────────┐
│ PostgreSQL │ │ Object Storage │
│ (Metadata) │ │ (Datasets) │
└──────────────────┘ └─────────────────┘
Performance
For a 1 million row dataset with 20 fields:
- Generation time: ~45 seconds
- Validation time: ~10 seconds
- Export (Parquet): ~5 seconds
- Peak memory: ~2GB
Generation scales linearly with row count and parallelizes across available cores.
API Endpoints
@app.post("/api/datasets")
async def create_dataset(schema: DatasetSchema):
"""Create a new dataset configuration"""
dataset_id = generate_dataset_id()
await db.datasets.insert_one({
"id": dataset_id,
"schema": schema.dict(),
"version": "1.0.0",
"created_at": datetime.utcnow()
})
return {"dataset_id": dataset_id}
@app.post("/api/datasets/{dataset_id}/generate")
async def generate(dataset_id: str, config: GenerationConfig):
"""Generate synthetic data from schema"""
dataset = await db.datasets.find_one({"id": dataset_id})
# Queue generation job
job_id = await queue.enqueue(
generate_dataset,
schema=dataset["schema"],
num_rows=config.num_rows,
seed=config.seed
)
return {"job_id": job_id, "status": "queued"}
@app.get("/api/datasets/{dataset_id}/export")
async def export_dataset(dataset_id: str, format: str = "parquet"):
"""Export generated dataset"""
dataset = await storage.get_dataset(dataset_id)
if format == "parquet":
return StreamingResponse(
dataset.to_parquet(),
media_type="application/octet-stream"
)
elif format == "csv":
return StreamingResponse(
dataset.to_csv(),
media_type="text/csv"
)
The Results
We deployed Fabri internally first, then rolled it out to three client projects.
Development Velocity
- Data access time dropped from 3 weeks to 30 minutes — Legal and compliance reviews eliminated for development environments
- ML iteration cycles shortened by 60% — Engineers could test changes immediately instead of waiting for data refreshes
- Onboarding time for new team members cut in half — No more waiting for production access approvals
Data Quality
- Model performance on synthetic data matched production within 5% — Distributions and correlations preserved key patterns
- Edge cases caught earlier — Synthetic data could include rare scenarios that barely appear in sampled production data
- A/B test simulations became possible — Generate counterfactual datasets to predict experiment outcomes
Compliance Benefits
- Zero PII in development environments — Audit-friendly by design
- GDPR/CCPA concerns eliminated for dev/test — No real user data means no data subject rights to manage
- Cross-border data transfer simplified — Synthetic data can move freely between regions
Cost Savings
- Reduced production database load — Development queries moved to synthetic datasets
- Smaller staging environments — No need to replicate full production data
- Fewer data breach vectors — Less real data means less exposure
Key Takeaways
Synthetic Data is an Infrastructure Problem
Most teams treat synthetic data as a one-off script. That approach fails at scale. You need versioning, validation, reproducibility, and governance—the same rigor applied to production data pipelines.
Domain Knowledge is the Differentiator
Generic random data is easy. Realistic data that captures domain-specific patterns, edge cases, and correlations requires expertise. Templates that encode this knowledge are the real value.
Statistical Validation is Non-Negotiable
Generated data must be verified against specifications. Drift between intended and actual distributions compounds through ML pipelines. Automated validation catches problems before they reach models.
Self-Service Unlocks Adoption
If engineers need to file tickets to get synthetic data, they’ll find workarounds. Fabri’s dashboard lets teams generate what they need, when they need it, without dependencies on data engineering.
What’s Next
Fabri started as an internal tool to accelerate our own ML projects. After seeing the impact across multiple engagements, we’re exploring:
- Hosted SaaS version for teams without infrastructure to self-host
- Additional domain templates based on client requests
- Integration with feature stores for seamless ML pipeline connections
- Differential privacy options for extra-sensitive use cases
The gap between “we need data” and “we have data” shouldn’t be weeks. It should be minutes.
Need realistic synthetic data for your ML projects? Get in touch to discuss how Fabri can accelerate your development cycles.