How we built a data obfuscation platform that made production data shareable
How intelligent data obfuscation unlocked collaboration without compromising privacy
Companies sit on valuable production data they can’t share—with vendors, partners, or even internal teams. PII regulations, contractual obligations, and security policies lock data behind approval processes that take weeks.
A healthcare company can’t share patient data with an ML vendor. A bank can’t give transaction logs to a fraud detection partner. An e-commerce platform can’t export customer behavior to their analytics consultants.
The data exists. The need is clear. But privacy concerns create an impasse.
We built Incog to solve this: transform sensitive datasets into shareable assets while preserving analytical value.

The Problem
Data sharing fails for predictable reasons.
PII is everywhere. Names, emails, phone numbers, addresses—they appear in columns you expect and columns you don’t. Manual identification misses fields. Automated detection has false negatives.
Anonymization breaks utility. Naive approaches—replacing names with “User_1”, scrambling emails—destroy the patterns that make data valuable. If you can’t analyze the obfuscated data the same way you’d analyze the original, what’s the point?
Consistency matters. If [email protected] appears in three tables, it needs to map to the same obfuscated value in all three. Otherwise, joins break and cross-table analysis becomes impossible.
The process is manual. Data engineers write one-off scripts for each sharing request. Scripts are error-prone, undocumented, and non-reproducible. Six months later, no one remembers how a dataset was transformed.
Companies need a systematic approach: detect sensitive data automatically, transform it consistently, preserve statistical properties, and document everything.
Our Approach
Incog evolved from Fabri, our synthetic data platform. We realized that many teams didn’t need fully synthetic data—they needed their actual data, just made safe to share.
Automatic PII Detection
The first step is finding sensitive data. Incog scans uploaded datasets and flags potential PII using multiple detection methods:
Pattern matching
- Email addresses (regex)
- Phone numbers (country-specific formats)
- Social Security Numbers, National IDs
- Credit card numbers (Luhn validation)
- IP addresses, MAC addresses
Named Entity Recognition
- Person names (using NLP models)
- Organization names
- Geographic locations
- Dates that might indicate birthdays
Column name heuristics
- Common PII column names (email, phone, ssn, dob, address)
- Variations and abbreviations
- Multi-language support
Statistical analysis
- High cardinality string columns (likely identifiers)
- Columns with unique values matching row count
- Date columns with age-appropriate ranges
def detect_pii(df):
"""Scan dataframe for potential PII columns"""
detections = []
for column in df.columns:
column_lower = column.lower()
sample = df[column].dropna().head(1000)
# Pattern-based detection
if sample.dtype == 'object':
email_match = sample.str.match(EMAIL_REGEX).mean()
if email_match > 0.8:
detections.append({
'column': column,
'type': 'email',
'confidence': email_match,
'method': 'pattern'
})
continue
phone_match = sample.str.match(PHONE_REGEX).mean()
if phone_match > 0.7:
detections.append({
'column': column,
'type': 'phone',
'confidence': phone_match,
'method': 'pattern'
})
continue
# Column name heuristics
for pii_type, keywords in PII_KEYWORDS.items():
if any(kw in column_lower for kw in keywords):
detections.append({
'column': column,
'type': pii_type,
'confidence': 0.9,
'method': 'column_name'
})
break
# High cardinality detection
if sample.dtype == 'object':
cardinality = sample.nunique() / len(sample)
if cardinality > 0.95:
detections.append({
'column': column,
'type': 'potential_identifier',
'confidence': cardinality,
'method': 'cardinality'
})
return detections
Detection results are presented in a review interface where users can:
- Confirm or dismiss each detection
- Manually flag columns the system missed
- Assign obfuscation rules per column
Consistent Hashing
The core innovation: deterministic obfuscation. The same input always produces the same output.
def consistent_hash(value, salt, output_format='hex'):
"""Generate consistent hash for a value"""
if pd.isna(value):
return value
# Combine value with salt
salted = f"{salt}:{str(value)}"
# Generate hash
hash_bytes = hashlib.sha256(salted.encode()).digest()
if output_format == 'hex':
return hash_bytes[:8].hex()
elif output_format == 'email':
# Preserve email format
local = hash_bytes[:6].hex()
domain = hash_bytes[6:9].hex()
return f"{local}@{domain}.example.com"
elif output_format == 'phone':
# Preserve phone format
digits = ''.join(str(b % 10) for b in hash_bytes[:10])
return f"+1-{digits[:3]}-{digits[3:6]}-{digits[6:10]}"
This means:
[email protected]always becomes[email protected]- The same email in different tables maps to the same hash
- Joins and aggregations work correctly on obfuscated data
- Original values cannot be recovered from hashes
The salt is stored securely and can be rotated for different sharing contexts—the same source data can produce different obfuscated versions for different recipients.
Numerical Scaling
PII isn’t just strings. Salaries, ages, account balances—numerical fields can be identifying too.
Incog provides several transformation options:
Range scaling Transform values to a different range while preserving relative ordering.
# Original: salaries from $35,000 to $250,000
# Scaled: values from 0 to 100
scaled = (original - min) / (max - min) * 100
Noise injection Add random noise while preserving distribution shape.
# Add Gaussian noise (5% of standard deviation)
noisy = original + np.random.normal(0, original.std() * 0.05, len(original))
Bucketing Replace exact values with ranges.
# Ages: 18-25, 26-35, 36-45, 46-55, 56-65, 65+
bucketed = pd.cut(ages, bins=[18, 25, 35, 45, 55, 65, 100])
Statistical preservation Maintain mean, standard deviation, and distribution shape while obfuscating individual values.
def preserve_statistics(values, seed):
"""Obfuscate while preserving statistical properties"""
np.random.seed(seed)
# Fit distribution
mean, std = values.mean(), values.std()
# Generate replacement values with same statistics
replacement = np.random.normal(mean, std, len(values))
# Match ranks to preserve ordering relationships
original_ranks = values.rank()
replacement_ranks = pd.Series(replacement).rank()
# Map original ranks to replacement values
rank_to_value = dict(zip(replacement_ranks, replacement))
result = original_ranks.map(rank_to_value)
return result
Feature Name Obfuscation
Sometimes column names themselves are sensitive—they reveal schema design, business logic, or proprietary categorizations.
Incog can obfuscate column headers while maintaining a mapping file:
Original → Obfuscated
───────────────────────
customer_id → col_a1
email → col_b2
annual_income → col_c3
churn_risk_score → col_d4
premium_tier → col_e5
The mapping file stays internal; the obfuscated dataset can be shared freely.
Before/After Comparison
Users need confidence that obfuscation worked correctly. Incog provides side-by-side comparison:
| Original | Obfuscated |
|---|---|
| [email protected] | [email protected] |
| [email protected] | [email protected] |
| [email protected] | [email protected] |
Note: the same original value ([email protected]) maps to the same hash consistently.
Statistical comparison shows distributions are preserved:
Column: annual_income
────────────────────
Original Obfuscated
Mean $72,450 $72,512
Std Dev $28,340 $28,295
Min $32,000 $31,847
Max $195,000 $196,234
Median $68,500 $68,621
Mapping Export
For some use cases, you need to reverse the obfuscation—or at least understand what was done. Incog exports transformation mappings:
{
"dataset_id": "ds_abc123",
"created_at": "2025-01-15T10:30:00Z",
"salt": "stored_separately_in_vault",
"transformations": [
{
"column": "email",
"original_name": "customer_email",
"method": "consistent_hash",
"format": "email"
},
{
"column": "salary",
"original_name": "annual_compensation",
"method": "statistical_preserve",
"original_stats": {"mean": 72450, "std": 28340}
},
{
"column": "col_d4",
"original_name": "churn_risk_score",
"method": "range_scale",
"original_range": [0.0, 1.0],
"scaled_range": [0, 100]
}
]
}
Mappings are encrypted and access-controlled separately from the obfuscated data.
Technical Implementation
Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Upload UI │────▶│ API Server │────▶│ Detection │
│ │ │ (FastAPI) │ │ Engine │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌─────────────────┐
│ Config UI │────▶│ Obfuscation │
│ (Rules Setup) │ │ Engine │
└──────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Export + │
│ Validation │
└─────────────────┘
Tech Stack
- Backend: FastAPI + Python
- Processing: Pandas, NumPy for transformations; spaCy for NER
- Storage: PostgreSQL for metadata, S3-compatible for datasets
- Security: HashiCorp Vault for salt management, AES-256 for mapping encryption
- Frontend: React dashboard with before/after visualization
Core Processing Pipeline
class ObfuscationPipeline:
def __init__(self, config, salt):
self.config = config
self.salt = salt
self.mappings = {}
def process(self, df):
"""Run full obfuscation pipeline"""
result = df.copy()
for column, rules in self.config['columns'].items():
if column not in result.columns:
continue
method = rules['method']
if method == 'consistent_hash':
result[column] = result[column].apply(
lambda x: consistent_hash(x, self.salt, rules.get('format', 'hex'))
)
elif method == 'statistical_preserve':
result[column] = preserve_statistics(
result[column],
seed=hash(self.salt + column)
)
elif method == 'range_scale':
original_min, original_max = result[column].min(), result[column].max()
target_min, target_max = rules['target_range']
result[column] = (
(result[column] - original_min) / (original_max - original_min)
* (target_max - target_min) + target_min
)
elif method == 'bucket':
result[column] = pd.cut(
result[column],
bins=rules['bins'],
labels=rules.get('labels')
)
elif method == 'drop':
result = result.drop(columns=[column])
# Track mapping
self.mappings[column] = {
'method': method,
'config': rules
}
# Optionally obfuscate column names
if self.config.get('obfuscate_columns'):
result, column_mapping = obfuscate_column_names(result)
self.mappings['_columns'] = column_mapping
return result
def validate(self, original, obfuscated):
"""Validate obfuscation preserved required properties"""
report = {}
for column in original.columns:
if column not in obfuscated.columns:
continue
report[column] = {
'original_stats': {
'mean': original[column].mean() if pd.api.types.is_numeric_dtype(original[column]) else None,
'std': original[column].std() if pd.api.types.is_numeric_dtype(original[column]) else None,
'unique': original[column].nunique(),
'nulls': original[column].isna().sum()
},
'obfuscated_stats': {
'mean': obfuscated[column].mean() if pd.api.types.is_numeric_dtype(obfuscated[column]) else None,
'std': obfuscated[column].std() if pd.api.types.is_numeric_dtype(obfuscated[column]) else None,
'unique': obfuscated[column].nunique(),
'nulls': obfuscated[column].isna().sum()
}
}
# Check consistency
if self.config['columns'].get(column, {}).get('method') == 'consistent_hash':
# Verify same values hash to same outputs
original_groups = original.groupby(column).size()
obfuscated_groups = obfuscated.groupby(column).size()
report[column]['consistency_check'] = (
original_groups.values == obfuscated_groups.values
).all()
return report
API Endpoints
@app.post("/api/datasets/upload")
async def upload_dataset(file: UploadFile):
"""Upload dataset for obfuscation"""
dataset_id = generate_id()
# Parse file
if file.filename.endswith('.csv'):
df = pd.read_csv(file.file)
elif file.filename.endswith('.parquet'):
df = pd.read_parquet(file.file)
# Run PII detection
detections = detect_pii(df)
# Store for configuration
await storage.save_temp(dataset_id, df)
return {
"dataset_id": dataset_id,
"columns": list(df.columns),
"rows": len(df),
"detections": detections
}
@app.post("/api/datasets/{dataset_id}/configure")
async def configure_obfuscation(dataset_id: str, config: ObfuscationConfig):
"""Set obfuscation rules for dataset"""
await db.configs.update_one(
{"dataset_id": dataset_id},
{"$set": {"config": config.dict()}},
upsert=True
)
return {"status": "configured"}
@app.post("/api/datasets/{dataset_id}/obfuscate")
async def run_obfuscation(dataset_id: str):
"""Execute obfuscation pipeline"""
df = await storage.load_temp(dataset_id)
config = await db.configs.find_one({"dataset_id": dataset_id})
# Get or generate salt
salt = await vault.get_or_create_salt(dataset_id)
# Run pipeline
pipeline = ObfuscationPipeline(config['config'], salt)
obfuscated = pipeline.process(df)
validation = pipeline.validate(df, obfuscated)
# Store results
await storage.save_obfuscated(dataset_id, obfuscated)
await db.mappings.insert_one({
"dataset_id": dataset_id,
"mappings": pipeline.mappings,
"validation": validation,
"created_at": datetime.utcnow()
})
return {
"status": "complete",
"validation": validation
}
@app.get("/api/datasets/{dataset_id}/export")
async def export_obfuscated(dataset_id: str, format: str = "csv"):
"""Export obfuscated dataset"""
df = await storage.load_obfuscated(dataset_id)
if format == "csv":
return StreamingResponse(
iter([df.to_csv(index=False)]),
media_type="text/csv",
headers={"Content-Disposition": f"attachment; filename={dataset_id}_obfuscated.csv"}
)
elif format == "parquet":
buffer = io.BytesIO()
df.to_parquet(buffer)
buffer.seek(0)
return StreamingResponse(
buffer,
media_type="application/octet-stream"
)
Security Considerations
Salt management
- Salts stored in HashiCorp Vault, not in application database
- Per-dataset salts enable different obfuscated versions for different recipients
- Salt rotation for periodic re-obfuscation
Mapping protection
- Mappings encrypted with AES-256 before storage
- Access logged and audited
- Optional: mappings stored separately from obfuscated data
Data handling
- Original data deleted after obfuscation (configurable)
- All processing in memory, no temp files
- TLS for all data in transit
The Results
We deployed Incog for internal use and two client engagements.
Time Savings
- Data sharing approval time dropped from 3 weeks to 2 hours — Automated obfuscation replaced manual review cycles
- Data engineering time per sharing request reduced 80% — No more one-off scripts
- Vendor onboarding accelerated by 70% — Partners got data access in days, not months
Data Utility Preserved
- Analytical queries produced identical results on 94% of test cases (obfuscated vs. original)
- ML models trained on obfuscated data performed within 3% of models trained on original data
- Join operations worked correctly across all obfuscated tables thanks to consistent hashing
Compliance Benefits
- Zero PII in shared datasets — Automated detection caught fields manual review missed
- Audit trail for all transformations — Every obfuscation logged with method and config
- GDPR data minimization satisfied — Only necessary fields shared, sensitive fields transformed
Consistency Wins
- Cross-table joins maintained — Same customer appears with same hash across all exports
- Time-series analysis preserved — Temporal patterns intact despite value obfuscation
- Reproducible transformations — Same input + same salt = same output, every time
Key Takeaways
Detection Must Be Multi-Method
Pattern matching alone misses PII. Column name heuristics catch common cases but miss custom schemas. NER handles names but not all identifier types. Statistical analysis spots high-cardinality fields. You need all of them.
Consistency is Non-Negotiable
If obfuscation isn’t deterministic, the data becomes useless for any analysis requiring joins or aggregations. Consistent hashing is the foundation everything else builds on.
Statistical Preservation Requires Care
Naive obfuscation destroys distributions. Preserving mean and standard deviation isn’t enough—you need to maintain correlations, preserve ordering where appropriate, and validate that analytical queries still work.
The Mapping is as Sensitive as the Data
Obfuscated data with the mapping file is equivalent to original data. Mapping storage, access control, and encryption require the same rigor as PII handling.
What’s Next
Incog and Fabri form complementary halves of a data enablement strategy:
- Fabri: Generate synthetic data when you need test/development datasets from scratch
- Incog: Obfuscate real data when you need to share production data safely
We’re exploring:
- Differential privacy integration for mathematical privacy guarantees
- Streaming obfuscation for real-time data pipelines
- Cross-dataset consistency ensuring the same entity hashes identically across separate uploads
- Automated compliance reporting generating documentation for auditors
The goal: make data sharing as easy as data storage.
Need to share sensitive data safely? Get in touch to discuss how Incog can unlock collaboration without compromising privacy.