Neural Diffusion for Tabular Financial Data
VynFi 3.0 adds a score-based diffusion model for tabular financial data generation. This post covers the architecture — score networks, denoising score matching, classifier-free guidance — and shows how hybrid mode combines neural and rule-based generation.
Rule-based generation has carried VynFi through two major versions. The three-layer knowledge model — domain constraints, statistical distributions, and structural relationships — produces data that passes Benford analysis, maintains balanced entries, and respects document flow integrity. But rules have a ceiling. Some distributional patterns in real financial data arise from emergent behavior that is difficult to encode explicitly: the heavy-tailed correlation structure between transaction amounts and counterparty diversity, the non-stationary seasonal patterns in expense categorization, the subtle clustering in payment timing that reflects human workflow habits.
VynFi 3.0 adds a neural diffusion model that learns these patterns directly from tabular financial distributions. The model is a score-based generative model (sometimes called a denoising diffusion model) adapted for mixed-type tabular data — continuous amounts, categorical codes, timestamps, and boolean flags coexisting in the same row.
**DataSynth 3.1 update:** Hybrid diffusion tuning landed — `diffusion.neural.hybrid_strategy` accepts `weighted_average`, `column_select`, or `threshold`, and `neural_columns` lets you route specific columns through the neural backend while statistical generation handles the rest. Use `column_select` to apply the neural model only to heavy-tailed amount distributions (where it most outperforms rules) while keeping the deterministic statistical engine for structural columns like GL accounts and document IDs. See neural_diffusion.py for the calibration pattern.
Score-Based Diffusion for Tabular Data
The core idea is denoising score matching. Starting from a clean data sample x, we progressively add Gaussian noise across T timesteps until the data distribution converges to pure noise. A neural network (the score network) is trained to estimate the gradient of the log probability density at each noise level: given a noisy sample, predict the direction that moves toward higher data density. At generation time, we start from pure noise and iteratively denoise using the learned score function.
For tabular data, the standard image-domain architecture does not apply. VynFi's score network uses a transformer encoder with per-column type embeddings. Continuous columns pass through a learned normalization layer; categorical columns use embedding lookups; timestamps are encoded as (day-of-week, day-of-month, month, hour) cyclic features. The transformer processes all columns jointly, capturing cross-column dependencies that factored models miss.
Classifier-Free Guidance
Raw diffusion sampling produces samples that match the training distribution on average, but may not satisfy domain constraints (e.g., debits equal credits within a journal entry). VynFi uses classifier-free guidance to steer generation toward constraint-satisfying regions. During training, the model is conditioned on domain labels (sector, table type, constraint set) with random dropout. At inference, the conditional and unconditional scores are combined with a guidance scale parameter that controls the trade-off between distributional fidelity and constraint satisfaction.
import vynficlient = vynfi.VynFi()# Pure diffusion mode — maximum distributional fidelityjob_pure = client.jobs.create( mode="diffusion", sector="financial_statements", rows=50_000, periods=4, guidance_scale=1.0, # minimal guidance, closest to learned distribution hybrid=False,)# Hybrid mode — diffusion + rule-based structural constraintsjob_hybrid = client.jobs.create( mode="diffusion", sector="financial_statements", rows=50_000, periods=4, guidance_scale=2.5, # stronger guidance toward constraint satisfaction hybrid=True, # post-process with rule engine for structural validity hybrid_constraints=[ "balanced_entries", "benford_compliance", "sequential_dates", "document_reference_integrity", ],)result = client.jobs.wait(job_hybrid.id)Hybrid Mode: Best of Both Approaches
Hybrid mode is the recommended default for most use cases. The diffusion model generates the initial data with realistic distributional properties — amount correlations, category co-occurrence patterns, temporal dynamics — and the rule engine applies a constraint-satisfaction pass that enforces structural validity. The result is data that has the statistical richness of a learned model with the structural guarantees of the rule-based engine.
The constraint-satisfaction pass is lightweight because the guided diffusion output is already close to satisfying constraints. In benchmarks, fewer than 3% of rows require adjustment in hybrid mode with guidance_scale=2.5, compared to 15-20% with guidance_scale=1.0.
Evaluating Output Quality
import pandas as pdarchive = client.jobs.download_archive(result.id)df = pd.read_parquet(archive.file("journal_entries.parquet"))# Quality metrics included in every diffusion jobmetrics = archive.json("quality_metrics.json")print(f"Rows generated: {len(df)}")print(f"Benford MAD: {metrics['benford_mad']:.4f}") # < 0.006 = close conformityprint(f"Column correlation RMSE: {metrics['corr_rmse']:.4f}") # vs. reference distributionprint(f"Category Jensen-Shannon: {metrics['cat_js']:.4f}") # lower = better matchprint(f"Constraint violations: {metrics['constraint_violations']}") # should be 0 in hybrid modeprint(f"Rows adjusted by rule engine: {metrics['hybrid_adjustments']}")The diffusion model is pre-trained on VynFi's internal reference distributions (calibrated against 155 real-world financial datasets, as described in our methodology paper). Customers on Scale and Enterprise tiers can fine-tune the model on their own fingerprint data — the privacy-preserving statistical summaries described in a separate post in this series — to produce synthetic data that matches their specific distributional characteristics without ever exposing raw records.