Privacy-Preserving Synthesis: From Fingerprint to Dataset
VynFi's fingerprint-to-synthesis pipeline extracts differentially private statistical summaries from real data, then generates synthetic datasets that match those summaries without ever seeing the original records. This post walks through the full pipeline.
The fundamental tension in synthetic data is fidelity versus privacy. High-fidelity synthetic data that closely mirrors real distributions is more useful for analytics and model training, but also more likely to leak information about individuals in the source dataset. Low-fidelity data is safe but useless. VynFi resolves this tension by separating the privacy boundary from the generation process: a fingerprint extraction step runs on-premises against real data, producing a differentially private statistical summary. That summary — not the raw data — is sent to VynFi's generation engine, which synthesizes a full dataset matching the summary's distributional properties.
What Is a Fingerprint?
A fingerprint is a structured statistical summary of a dataset. It captures marginal distributions (histograms, quantiles, moments), pairwise correlations (Pearson, Spearman, mutual information), conditional distributions (P(amount | category)), temporal patterns (autocorrelation, seasonality indices), and structural properties (null rates, cardinality, value ranges). Critically, every statistic in the fingerprint is computed with calibrated noise injection to satisfy epsilon-differential privacy. The noise is calibrated so that the presence or absence of any single record in the source dataset changes the fingerprint by at most a bounded amount.
Extracting a Fingerprint On-Premises
The fingerprint extractor runs as a standalone CLI tool or Docker container inside your network. No data leaves your environment — only the resulting fingerprint file (a JSON document typically 50-200 KB) is exported.
# On-premises: extract a fingerprint from real data# This runs locally — no data is sent to VynFifrom vynfi.fingerprint import FingerprintExtractorextractor = FingerprintExtractor( epsilon=1.0, # privacy budget (lower = more private, noisier) delta=1e-5, # failure probability columns={ "amount": {"type": "continuous", "bounds": (0, 10_000_000)}, "account_code": {"type": "categorical", "max_cardinality": 500}, "posting_date": {"type": "timestamp", "granularity": "day"}, "entity_id": {"type": "categorical", "max_cardinality": 1000}, "is_debit": {"type": "boolean"}, },)# Read from your database, CSV, or DataFrameimport pandas as pdreal_data = pd.read_parquet("/secure/data/journal_entries_2025.parquet")fingerprint = extractor.extract(real_data)fingerprint.save("journal_entries_fingerprint.json")print(f"Fingerprint size: {fingerprint.size_bytes / 1024:.1f} KB")print(f"Privacy guarantee: ({fingerprint.epsilon}, {fingerprint.delta})-DP")print(f"Columns: {len(fingerprint.columns)}")print(f"Correlation pairs: {len(fingerprint.correlations)}")Generating from a Fingerprint
Once the fingerprint is extracted, you upload it to VynFi and use it as the basis for generation. The generation engine reads the marginal distributions, correlations, and structural properties from the fingerprint and produces synthetic data that matches those statistics — without ever seeing the underlying records.
import vynficlient = vynfi.VynFi()# Upload the fingerprint (the only artifact that leaves your network)fp = client.fingerprints.upload("journal_entries_fingerprint.json")print(f"Fingerprint ID: {fp.id}")# Generate synthetic data matching the fingerprintjob = client.jobs.create( mode="generate", fingerprint_id=fp.id, rows=100_000, periods=4, constraints=["balanced_entries", "sequential_dates"],)result = client.jobs.wait(job.id)archive = client.jobs.download_archive(result.id)# Compare synthetic vs. fingerprint statisticsimport pandas as pdsynthetic = pd.read_parquet(archive.file("journal_entries.parquet"))# The quality report compares synthetic marginals against the fingerprintquality = archive.json("fingerprint_fidelity.json")print(f"Marginal KS distance (mean): {quality['mean_ks_distance']:.4f}")print(f"Correlation RMSE: {quality['correlation_rmse']:.4f}")print(f"Category JS divergence (mean): {quality['mean_js_divergence']:.4f}")Privacy Guarantees and the Epsilon Budget
The epsilon parameter controls the privacy-utility trade-off. At epsilon=0.1, the fingerprint is very noisy — marginal distributions are approximate, rare categories may be suppressed, and correlations below the noise floor are dropped. The resulting synthetic data is highly private but may miss subtle distributional features. At epsilon=10.0, the fingerprint is nearly exact and the synthetic data closely mirrors the source, but the privacy guarantee is weaker. For most financial applications, epsilon values between 0.5 and 2.0 provide a practical balance.
Importantly, the privacy guarantee is composable. If you extract multiple fingerprints from overlapping datasets, the total privacy cost is the sum of the individual epsilons. VynFi tracks the cumulative privacy budget per fingerprint source and warns when the total budget exceeds a configurable threshold. Enterprise customers can enforce hard budget limits that prevent further fingerprint extraction once the threshold is reached.
Use Cases
- <strong>Cross-firm benchmarking</strong> — Multiple firms extract fingerprints from their data and upload them to a shared VynFi workspace. Synthetic data matching each fingerprint can be analyzed jointly without any firm exposing raw records.
- <strong>Vendor evaluation</strong> — Share a fingerprint-derived synthetic dataset with a prospective analytics vendor instead of real data. The vendor can evaluate their tools against realistic distributions without triggering data-sharing agreements or privacy reviews.
- <strong>Regulatory sandbox</strong> — Regulators receive fingerprint-matched synthetic data that preserves the statistical properties needed for supervisory analysis without exposing individual transactions or customer identities.
- <strong>Model development</strong> — Data science teams work with fingerprint-matched synthetic data during development, switching to real data only for final validation. This reduces the attack surface and simplifies access control.