How we built a data obfuscation platform that made production data shareable

How intelligent data obfuscation unlocked collaboration without compromising privacy


Companies sit on valuable production data they can’t share—with vendors, partners, or even internal teams. PII regulations, contractual obligations, and security policies lock data behind approval processes that take weeks.

A healthcare company can’t share patient data with an ML vendor. A bank can’t give transaction logs to a fraud detection partner. An e-commerce platform can’t export customer behavior to their analytics consultants.

The data exists. The need is clear. But privacy concerns create an impasse.

We built Incog to solve this: transform sensitive datasets into shareable assets while preserving analytical value.

The Problem

Data sharing fails for predictable reasons.

PII is everywhere. Names, emails, phone numbers, addresses—they appear in columns you expect and columns you don’t. Manual identification misses fields. Automated detection has false negatives.

Anonymization breaks utility. Naive approaches—replacing names with “User_1”, scrambling emails—destroy the patterns that make data valuable. If you can’t analyze the obfuscated data the same way you’d analyze the original, what’s the point?

Consistency matters. If [email protected] appears in three tables, it needs to map to the same obfuscated value in all three. Otherwise, joins break and cross-table analysis becomes impossible.

The process is manual. Data engineers write one-off scripts for each sharing request. Scripts are error-prone, undocumented, and non-reproducible. Six months later, no one remembers how a dataset was transformed.

Companies need a systematic approach: detect sensitive data automatically, transform it consistently, preserve statistical properties, and document everything.

Our Approach

Incog evolved from Fabri, our synthetic data platform. We realized that many teams didn’t need fully synthetic data—they needed their actual data, just made safe to share.

Automatic PII Detection

The first step is finding sensitive data. Incog scans uploaded datasets and flags potential PII using multiple detection methods:

Pattern matching

  • Email addresses (regex)
  • Phone numbers (country-specific formats)
  • Social Security Numbers, National IDs
  • Credit card numbers (Luhn validation)
  • IP addresses, MAC addresses

Named Entity Recognition

  • Person names (using NLP models)
  • Organization names
  • Geographic locations
  • Dates that might indicate birthdays

Column name heuristics

  • Common PII column names (email, phone, ssn, dob, address)
  • Variations and abbreviations
  • Multi-language support

Statistical analysis

  • High cardinality string columns (likely identifiers)
  • Columns with unique values matching row count
  • Date columns with age-appropriate ranges
def detect_pii(df):
    """Scan dataframe for potential PII columns"""
    detections = []
    
    for column in df.columns:
        column_lower = column.lower()
        sample = df[column].dropna().head(1000)
        
        # Pattern-based detection
        if sample.dtype == 'object':
            email_match = sample.str.match(EMAIL_REGEX).mean()
            if email_match > 0.8:
                detections.append({
                    'column': column,
                    'type': 'email',
                    'confidence': email_match,
                    'method': 'pattern'
                })
                continue
            
            phone_match = sample.str.match(PHONE_REGEX).mean()
            if phone_match > 0.7:
                detections.append({
                    'column': column,
                    'type': 'phone',
                    'confidence': phone_match,
                    'method': 'pattern'
                })
                continue
        
        # Column name heuristics
        for pii_type, keywords in PII_KEYWORDS.items():
            if any(kw in column_lower for kw in keywords):
                detections.append({
                    'column': column,
                    'type': pii_type,
                    'confidence': 0.9,
                    'method': 'column_name'
                })
                break
        
        # High cardinality detection
        if sample.dtype == 'object':
            cardinality = sample.nunique() / len(sample)
            if cardinality > 0.95:
                detections.append({
                    'column': column,
                    'type': 'potential_identifier',
                    'confidence': cardinality,
                    'method': 'cardinality'
                })
    
    return detections

Detection results are presented in a review interface where users can:

  • Confirm or dismiss each detection
  • Manually flag columns the system missed
  • Assign obfuscation rules per column

Consistent Hashing

The core innovation: deterministic obfuscation. The same input always produces the same output.

def consistent_hash(value, salt, output_format='hex'):
    """Generate consistent hash for a value"""
    if pd.isna(value):
        return value
    
    # Combine value with salt
    salted = f"{salt}:{str(value)}"
    
    # Generate hash
    hash_bytes = hashlib.sha256(salted.encode()).digest()
    
    if output_format == 'hex':
        return hash_bytes[:8].hex()
    elif output_format == 'email':
        # Preserve email format
        local = hash_bytes[:6].hex()
        domain = hash_bytes[6:9].hex()
        return f"{local}@{domain}.example.com"
    elif output_format == 'phone':
        # Preserve phone format
        digits = ''.join(str(b % 10) for b in hash_bytes[:10])
        return f"+1-{digits[:3]}-{digits[3:6]}-{digits[6:10]}"

This means:

  • [email protected] always becomes [email protected]
  • The same email in different tables maps to the same hash
  • Joins and aggregations work correctly on obfuscated data
  • Original values cannot be recovered from hashes

The salt is stored securely and can be rotated for different sharing contexts—the same source data can produce different obfuscated versions for different recipients.

Numerical Scaling

PII isn’t just strings. Salaries, ages, account balances—numerical fields can be identifying too.

Incog provides several transformation options:

Range scaling Transform values to a different range while preserving relative ordering.

# Original: salaries from $35,000 to $250,000
# Scaled: values from 0 to 100
scaled = (original - min) / (max - min) * 100

Noise injection Add random noise while preserving distribution shape.

# Add Gaussian noise (5% of standard deviation)
noisy = original + np.random.normal(0, original.std() * 0.05, len(original))

Bucketing Replace exact values with ranges.

# Ages: 18-25, 26-35, 36-45, 46-55, 56-65, 65+
bucketed = pd.cut(ages, bins=[18, 25, 35, 45, 55, 65, 100])

Statistical preservation Maintain mean, standard deviation, and distribution shape while obfuscating individual values.

def preserve_statistics(values, seed):
    """Obfuscate while preserving statistical properties"""
    np.random.seed(seed)
    
    # Fit distribution
    mean, std = values.mean(), values.std()
    
    # Generate replacement values with same statistics
    replacement = np.random.normal(mean, std, len(values))
    
    # Match ranks to preserve ordering relationships
    original_ranks = values.rank()
    replacement_ranks = pd.Series(replacement).rank()
    
    # Map original ranks to replacement values
    rank_to_value = dict(zip(replacement_ranks, replacement))
    result = original_ranks.map(rank_to_value)
    
    return result

Feature Name Obfuscation

Sometimes column names themselves are sensitive—they reveal schema design, business logic, or proprietary categorizations.

Incog can obfuscate column headers while maintaining a mapping file:

Original → Obfuscated
───────────────────────
customer_id → col_a1
email → col_b2  
annual_income → col_c3
churn_risk_score → col_d4
premium_tier → col_e5

The mapping file stays internal; the obfuscated dataset can be shared freely.

Before/After Comparison

Users need confidence that obfuscation worked correctly. Incog provides side-by-side comparison:

Original Obfuscated
[email protected] [email protected]
[email protected] [email protected]
[email protected] [email protected]

Note: the same original value ([email protected]) maps to the same hash consistently.

Statistical comparison shows distributions are preserved:

Column: annual_income
────────────────────
                Original    Obfuscated
Mean            $72,450     $72,512
Std Dev         $28,340     $28,295
Min             $32,000     $31,847
Max             $195,000    $196,234
Median          $68,500     $68,621

Mapping Export

For some use cases, you need to reverse the obfuscation—or at least understand what was done. Incog exports transformation mappings:

{
  "dataset_id": "ds_abc123",
  "created_at": "2025-01-15T10:30:00Z",
  "salt": "stored_separately_in_vault",
  "transformations": [
    {
      "column": "email",
      "original_name": "customer_email",
      "method": "consistent_hash",
      "format": "email"
    },
    {
      "column": "salary",
      "original_name": "annual_compensation",
      "method": "statistical_preserve",
      "original_stats": {"mean": 72450, "std": 28340}
    },
    {
      "column": "col_d4",
      "original_name": "churn_risk_score",
      "method": "range_scale",
      "original_range": [0.0, 1.0],
      "scaled_range": [0, 100]
    }
  ]
}

Mappings are encrypted and access-controlled separately from the obfuscated data.

Technical Implementation

Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Upload UI     │────▶│   API Server     │────▶│  Detection      │
│                 │     │   (FastAPI)      │     │  Engine         │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                │                        │
                                ▼                        ▼
                        ┌──────────────────┐     ┌─────────────────┐
                        │   Config UI      │────▶│  Obfuscation    │
                        │   (Rules Setup)  │     │  Engine         │
                        └──────────────────┘     └─────────────────┘
                                                         │
                                                         ▼
                                                 ┌─────────────────┐
                                                 │  Export +       │
                                                 │  Validation     │
                                                 └─────────────────┘

Tech Stack

  • Backend: FastAPI + Python
  • Processing: Pandas, NumPy for transformations; spaCy for NER
  • Storage: PostgreSQL for metadata, S3-compatible for datasets
  • Security: HashiCorp Vault for salt management, AES-256 for mapping encryption
  • Frontend: React dashboard with before/after visualization

Core Processing Pipeline

class ObfuscationPipeline:
    def __init__(self, config, salt):
        self.config = config
        self.salt = salt
        self.mappings = {}
    
    def process(self, df):
        """Run full obfuscation pipeline"""
        result = df.copy()
        
        for column, rules in self.config['columns'].items():
            if column not in result.columns:
                continue
            
            method = rules['method']
            
            if method == 'consistent_hash':
                result[column] = result[column].apply(
                    lambda x: consistent_hash(x, self.salt, rules.get('format', 'hex'))
                )
            
            elif method == 'statistical_preserve':
                result[column] = preserve_statistics(
                    result[column], 
                    seed=hash(self.salt + column)
                )
            
            elif method == 'range_scale':
                original_min, original_max = result[column].min(), result[column].max()
                target_min, target_max = rules['target_range']
                result[column] = (
                    (result[column] - original_min) / (original_max - original_min)
                    * (target_max - target_min) + target_min
                )
            
            elif method == 'bucket':
                result[column] = pd.cut(
                    result[column], 
                    bins=rules['bins'],
                    labels=rules.get('labels')
                )
            
            elif method == 'drop':
                result = result.drop(columns=[column])
            
            # Track mapping
            self.mappings[column] = {
                'method': method,
                'config': rules
            }
        
        # Optionally obfuscate column names
        if self.config.get('obfuscate_columns'):
            result, column_mapping = obfuscate_column_names(result)
            self.mappings['_columns'] = column_mapping
        
        return result
    
    def validate(self, original, obfuscated):
        """Validate obfuscation preserved required properties"""
        report = {}
        
        for column in original.columns:
            if column not in obfuscated.columns:
                continue
            
            report[column] = {
                'original_stats': {
                    'mean': original[column].mean() if pd.api.types.is_numeric_dtype(original[column]) else None,
                    'std': original[column].std() if pd.api.types.is_numeric_dtype(original[column]) else None,
                    'unique': original[column].nunique(),
                    'nulls': original[column].isna().sum()
                },
                'obfuscated_stats': {
                    'mean': obfuscated[column].mean() if pd.api.types.is_numeric_dtype(obfuscated[column]) else None,
                    'std': obfuscated[column].std() if pd.api.types.is_numeric_dtype(obfuscated[column]) else None,
                    'unique': obfuscated[column].nunique(),
                    'nulls': obfuscated[column].isna().sum()
                }
            }
            
            # Check consistency
            if self.config['columns'].get(column, {}).get('method') == 'consistent_hash':
                # Verify same values hash to same outputs
                original_groups = original.groupby(column).size()
                obfuscated_groups = obfuscated.groupby(column).size()
                report[column]['consistency_check'] = (
                    original_groups.values == obfuscated_groups.values
                ).all()
        
        return report

API Endpoints

@app.post("/api/datasets/upload")
async def upload_dataset(file: UploadFile):
    """Upload dataset for obfuscation"""
    dataset_id = generate_id()
    
    # Parse file
    if file.filename.endswith('.csv'):
        df = pd.read_csv(file.file)
    elif file.filename.endswith('.parquet'):
        df = pd.read_parquet(file.file)
    
    # Run PII detection
    detections = detect_pii(df)
    
    # Store for configuration
    await storage.save_temp(dataset_id, df)
    
    return {
        "dataset_id": dataset_id,
        "columns": list(df.columns),
        "rows": len(df),
        "detections": detections
    }

@app.post("/api/datasets/{dataset_id}/configure")
async def configure_obfuscation(dataset_id: str, config: ObfuscationConfig):
    """Set obfuscation rules for dataset"""
    await db.configs.update_one(
        {"dataset_id": dataset_id},
        {"$set": {"config": config.dict()}},
        upsert=True
    )
    return {"status": "configured"}

@app.post("/api/datasets/{dataset_id}/obfuscate")
async def run_obfuscation(dataset_id: str):
    """Execute obfuscation pipeline"""
    df = await storage.load_temp(dataset_id)
    config = await db.configs.find_one({"dataset_id": dataset_id})
    
    # Get or generate salt
    salt = await vault.get_or_create_salt(dataset_id)
    
    # Run pipeline
    pipeline = ObfuscationPipeline(config['config'], salt)
    obfuscated = pipeline.process(df)
    validation = pipeline.validate(df, obfuscated)
    
    # Store results
    await storage.save_obfuscated(dataset_id, obfuscated)
    await db.mappings.insert_one({
        "dataset_id": dataset_id,
        "mappings": pipeline.mappings,
        "validation": validation,
        "created_at": datetime.utcnow()
    })
    
    return {
        "status": "complete",
        "validation": validation
    }

@app.get("/api/datasets/{dataset_id}/export")
async def export_obfuscated(dataset_id: str, format: str = "csv"):
    """Export obfuscated dataset"""
    df = await storage.load_obfuscated(dataset_id)
    
    if format == "csv":
        return StreamingResponse(
            iter([df.to_csv(index=False)]),
            media_type="text/csv",
            headers={"Content-Disposition": f"attachment; filename={dataset_id}_obfuscated.csv"}
        )
    elif format == "parquet":
        buffer = io.BytesIO()
        df.to_parquet(buffer)
        buffer.seek(0)
        return StreamingResponse(
            buffer,
            media_type="application/octet-stream"
        )

Security Considerations

Salt management

  • Salts stored in HashiCorp Vault, not in application database
  • Per-dataset salts enable different obfuscated versions for different recipients
  • Salt rotation for periodic re-obfuscation

Mapping protection

  • Mappings encrypted with AES-256 before storage
  • Access logged and audited
  • Optional: mappings stored separately from obfuscated data

Data handling

  • Original data deleted after obfuscation (configurable)
  • All processing in memory, no temp files
  • TLS for all data in transit

The Results

We deployed Incog for internal use and two client engagements.

Time Savings

  • Data sharing approval time dropped from 3 weeks to 2 hours — Automated obfuscation replaced manual review cycles
  • Data engineering time per sharing request reduced 80% — No more one-off scripts
  • Vendor onboarding accelerated by 70% — Partners got data access in days, not months

Data Utility Preserved

  • Analytical queries produced identical results on 94% of test cases (obfuscated vs. original)
  • ML models trained on obfuscated data performed within 3% of models trained on original data
  • Join operations worked correctly across all obfuscated tables thanks to consistent hashing

Compliance Benefits

  • Zero PII in shared datasets — Automated detection caught fields manual review missed
  • Audit trail for all transformations — Every obfuscation logged with method and config
  • GDPR data minimization satisfied — Only necessary fields shared, sensitive fields transformed

Consistency Wins

  • Cross-table joins maintained — Same customer appears with same hash across all exports
  • Time-series analysis preserved — Temporal patterns intact despite value obfuscation
  • Reproducible transformations — Same input + same salt = same output, every time

Key Takeaways

Detection Must Be Multi-Method

Pattern matching alone misses PII. Column name heuristics catch common cases but miss custom schemas. NER handles names but not all identifier types. Statistical analysis spots high-cardinality fields. You need all of them.

Consistency is Non-Negotiable

If obfuscation isn’t deterministic, the data becomes useless for any analysis requiring joins or aggregations. Consistent hashing is the foundation everything else builds on.

Statistical Preservation Requires Care

Naive obfuscation destroys distributions. Preserving mean and standard deviation isn’t enough—you need to maintain correlations, preserve ordering where appropriate, and validate that analytical queries still work.

The Mapping is as Sensitive as the Data

Obfuscated data with the mapping file is equivalent to original data. Mapping storage, access control, and encryption require the same rigor as PII handling.

What’s Next

Incog and Fabri form complementary halves of a data enablement strategy:

  • Fabri: Generate synthetic data when you need test/development datasets from scratch
  • Incog: Obfuscate real data when you need to share production data safely

We’re exploring:

  • Differential privacy integration for mathematical privacy guarantees
  • Streaming obfuscation for real-time data pipelines
  • Cross-dataset consistency ensuring the same entity hashes identically across separate uploads
  • Automated compliance reporting generating documentation for auditors

The goal: make data sharing as easy as data storage.


Need to share sensitive data safely? Get in touch to discuss how Incog can unlock collaboration without compromising privacy.