How we built a synthetic data platform that cut ML development time by 60%

December 2, 2025

How synthetic data generation transformed ML development workflows

Machine learning teams waste weeks waiting for production data access. Legal reviews, anonymization pipelines, compliance approvals—by the time data arrives, project timelines have slipped. Development environments sit empty while engineers context-switch to other work.

We saw this problem across multiple client engagements and decided to build a solution: Fabri, a synthetic data generation platform that gives ML teams realistic, privacy-safe datasets in minutes instead of weeks.

The Problem

Every ML project hits the same bottleneck: data access.

Production data contains PII. Legal needs to review access requests. Security requires audit trails. Compliance teams want to verify anonymization. For regulated industries—healthcare, finance, insurance—these reviews can take months.

Meanwhile, ML engineers need data to:

Build and test feature pipelines
Train initial model versions
Validate preprocessing logic
Load test inference systems
Demo prototypes to stakeholders

Using scrambled or subset production data creates its own problems. The distributions don’t match reality. Edge cases disappear. Models trained on sanitized data fail when they hit production.

Teams need data that looks and behaves like production without being production.

Our Approach

We built Fabri as an internal tool first, then refined it across several client projects before productizing it.

Domain-Specific Templates

The key insight: synthetic data is only useful if it matches real-world patterns. Random data with correct column types isn’t enough—you need realistic distributions, correlations, and edge cases.

We built templates for common domains:

Healthcare

Patient demographics with realistic age/gender/location distributions
Diagnosis codes following ICD-10 frequency patterns
Lab results with clinically plausible ranges and correlations
Appointment schedules respecting business hours and provider capacity

Financial Services

Transaction histories with realistic spending patterns
Account balances following wealth distribution curves
Credit scores correlated with income, age, and account history
Fraud patterns at realistic (low) frequencies

Retail & E-commerce

Customer profiles with segmentation-ready attributes
Purchase histories with seasonal patterns and category affinities
Inventory levels with realistic stockout scenarios
Pricing data with promotion cycles

HR & Workforce

Employee records with org hierarchy relationships
Compensation data following market benchmarks
Performance reviews with calibrated rating distributions
Tenure patterns matching industry turnover rates

Each template encodes domain expertise—the kind of knowledge that takes years to accumulate and is usually locked in the heads of senior data scientists.

Schema Configuration

Templates provide starting points, but every dataset is unique. Fabri lets users customize:

Field-level controls

Data types (string, integer, float, date, boolean, categorical)
Distributions (normal, uniform, exponential, custom)
Constraints (min/max, regex patterns, enum values)
Null rates and missing data patterns

Cross-field relationships

Correlations (salary increases with experience)
Dependencies (state determines valid zip codes)
Hierarchies (department → team → employee)
Temporal sequences (events ordered correctly)

Statistical properties

Target mean, median, standard deviation
Outlier frequency and magnitude
Cardinality for categorical fields
Skewness and kurtosis for distributions

# Example schema configuration
schema = {
    "customer_id": {
        "type": "uuid",
        "unique": True
    },
    "age": {
        "type": "integer",
        "distribution": "normal",
        "mean": 42,
        "std": 15,
        "min": 18,
        "max": 95
    },
    "income": {
        "type": "float",
        "distribution": "lognormal",
        "median": 55000,
        "correlation": {"age": 0.4, "education_years": 0.6}
    },
    "churn_risk": {
        "type": "float",
        "distribution": "beta",
        "alpha": 2,
        "beta": 8,
        "correlation": {"tenure_months": -0.5, "support_tickets": 0.3}
    }
}

Generation Engine

The core engine uses a multi-pass approach:

Pass 1: Independent fields Generate fields with no dependencies using specified distributions.

Pass 2: Correlated fields Use copulas to generate fields with specified correlation structures while preserving marginal distributions.

Pass 3: Dependent fields Apply business rules and constraints (e.g., end_date > start_date).

Pass 4: Validation Verify statistical properties match specifications within tolerance.

For large datasets (millions of rows), generation runs in parallel with chunked processing and progress tracking.

# Simplified generation flow
def generate_dataset(schema, num_rows, seed=None):
    if seed:
        random.seed(seed)
        np.random.seed(seed)
    
    # Pass 1: Independent fields
    data = {}
    for field, config in schema.items():
        if not config.get('correlation'):
            data[field] = generate_field(config, num_rows)
    
    # Pass 2: Correlated fields using Gaussian copula
    correlated_fields = [f for f, c in schema.items() if c.get('correlation')]
    if correlated_fields:
        correlation_matrix = build_correlation_matrix(schema, correlated_fields)
        correlated_data = generate_correlated(correlation_matrix, num_rows)
        
        for i, field in enumerate(correlated_fields):
            # Transform to target marginal distribution
            data[field] = transform_marginal(
                correlated_data[:, i],
                schema[field]
            )
    
    # Pass 3: Apply constraints
    data = apply_constraints(data, schema)
    
    # Pass 4: Validate
    validation_report = validate_statistics(data, schema)
    
    return data, validation_report

Version Control

Synthetic datasets evolve with projects. Schema changes, distribution tweaks, bug fixes—teams need to track what changed and reproduce previous versions.

Fabri includes built-in version control:

Schema versioning: Track changes to field definitions
Seed management: Reproduce exact datasets with stored seeds
Diff visualization: See what changed between versions
Branching: Experiment with schema variants without losing the original
Tagging: Mark stable versions for production use

fabri-dataset-v1.0.0
├── schema.json
├── config.yaml
├── seed: 42
├── generated: 2025-01-15T10:30:00Z
├── rows: 1,000,000
└── validation_report.json

fabri-dataset-v1.1.0
├── schema.json (modified: added churn_risk field)
├── config.yaml
├── seed: 42
├── parent: v1.0.0
├── generated: 2025-01-20T14:15:00Z
├── rows: 1,000,000
└── validation_report.json

Data Visualization

Before exporting, users can explore generated data visually:

Distribution plots: Histograms and density curves for each field
Correlation heatmaps: Verify cross-field relationships
Scatter matrices: Spot unexpected patterns
Outlier detection: Identify and adjust extreme values
Sample preview: Inspect raw rows

The visualization layer catches configuration errors before they propagate downstream.

Multi-Format Export

ML pipelines consume data in different formats. Fabri exports to:

CSV — Universal compatibility
JSON — Nested structures, API mocking
Parquet — Columnar storage, efficient for large datasets
SQL — Direct database insertion scripts
Delta Lake — Versioned data lake integration

Export includes optional compression, partitioning, and schema metadata.

Technical Implementation

Tech Stack

Backend: FastAPI + Python for the generation engine
Database: PostgreSQL for schema/version storage, DuckDB for data processing
Generation: NumPy, SciPy for distributions; Faker for realistic values
Frontend: React dashboard for schema configuration and visualization
Storage: S3-compatible object storage for generated datasets

Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Dashboard     │────▶│   API Server     │────▶│  Generation     │
│   (React)       │     │   (FastAPI)      │     │  Engine         │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                │                        │
                                ▼                        ▼
                        ┌──────────────────┐     ┌─────────────────┐
                        │   PostgreSQL     │     │  Object Storage │
                        │   (Metadata)     │     │  (Datasets)     │
                        └──────────────────┘     └─────────────────┘

Performance

For a 1 million row dataset with 20 fields:

Generation time: ~45 seconds
Validation time: ~10 seconds
Export (Parquet): ~5 seconds
Peak memory: ~2GB

Generation scales linearly with row count and parallelizes across available cores.

API Endpoints

@app.post("/api/datasets")
async def create_dataset(schema: DatasetSchema):
    """Create a new dataset configuration"""
    dataset_id = generate_dataset_id()
    await db.datasets.insert_one({
        "id": dataset_id,
        "schema": schema.dict(),
        "version": "1.0.0",
        "created_at": datetime.utcnow()
    })
    return {"dataset_id": dataset_id}

@app.post("/api/datasets/{dataset_id}/generate")
async def generate(dataset_id: str, config: GenerationConfig):
    """Generate synthetic data from schema"""
    dataset = await db.datasets.find_one({"id": dataset_id})
    
    # Queue generation job
    job_id = await queue.enqueue(
        generate_dataset,
        schema=dataset["schema"],
        num_rows=config.num_rows,
        seed=config.seed
    )
    
    return {"job_id": job_id, "status": "queued"}

@app.get("/api/datasets/{dataset_id}/export")
async def export_dataset(dataset_id: str, format: str = "parquet"):
    """Export generated dataset"""
    dataset = await storage.get_dataset(dataset_id)
    
    if format == "parquet":
        return StreamingResponse(
            dataset.to_parquet(),
            media_type="application/octet-stream"
        )
    elif format == "csv":
        return StreamingResponse(
            dataset.to_csv(),
            media_type="text/csv"
        )

The Results

We deployed Fabri internally first, then rolled it out to three client projects.

Development Velocity

Data access time dropped from 3 weeks to 30 minutes — Legal and compliance reviews eliminated for development environments
ML iteration cycles shortened by 60% — Engineers could test changes immediately instead of waiting for data refreshes
Onboarding time for new team members cut in half — No more waiting for production access approvals

Data Quality

Model performance on synthetic data matched production within 5% — Distributions and correlations preserved key patterns
Edge cases caught earlier — Synthetic data could include rare scenarios that barely appear in sampled production data
A/B test simulations became possible — Generate counterfactual datasets to predict experiment outcomes

Compliance Benefits

Zero PII in development environments — Audit-friendly by design
GDPR/CCPA concerns eliminated for dev/test — No real user data means no data subject rights to manage
Cross-border data transfer simplified — Synthetic data can move freely between regions

Cost Savings

Reduced production database load — Development queries moved to synthetic datasets
Smaller staging environments — No need to replicate full production data
Fewer data breach vectors — Less real data means less exposure

Key Takeaways

Synthetic Data is an Infrastructure Problem

Most teams treat synthetic data as a one-off script. That approach fails at scale. You need versioning, validation, reproducibility, and governance—the same rigor applied to production data pipelines.

Domain Knowledge is the Differentiator

Generic random data is easy. Realistic data that captures domain-specific patterns, edge cases, and correlations requires expertise. Templates that encode this knowledge are the real value.

Statistical Validation is Non-Negotiable

Generated data must be verified against specifications. Drift between intended and actual distributions compounds through ML pipelines. Automated validation catches problems before they reach models.

Self-Service Unlocks Adoption

If engineers need to file tickets to get synthetic data, they’ll find workarounds. Fabri’s dashboard lets teams generate what they need, when they need it, without dependencies on data engineering.

What’s Next

Fabri started as an internal tool to accelerate our own ML projects. After seeing the impact across multiple engagements, we’re exploring:

Hosted SaaS version for teams without infrastructure to self-host
Additional domain templates based on client requests
Integration with feature stores for seamless ML pipeline connections
Differential privacy options for extra-sensitive use cases

The gap between “we need data” and “we have data” shouldn’t be weeks. It should be minutes.

Need realistic synthetic data for your ML projects? Get in touch to discuss how Fabri can accelerate your development cycles.

Company

Services

Platforms