neuraldiffusionAIgeneration

Neural Diffusion for Tabular Financial Data

VynFi 3.0 adds a score-based diffusion model for tabular financial data generation. This post covers the architecture — score networks, denoising score matching, classifier-free guidance — and shows how hybrid mode combines neural and rule-based generation.

VynFi Team · EngineeringApril 17, 202610 min read

Rule-based generation has carried VynFi through two major versions. The three-layer knowledge model — domain constraints, statistical distributions, and structural relationships — produces data that passes Benford analysis, maintains balanced entries, and respects document flow integrity. But rules have a ceiling. Some distributional patterns in real financial data arise from emergent behavior that is difficult to encode explicitly: the heavy-tailed correlation structure between transaction amounts and counterparty diversity, the non-stationary seasonal patterns in expense categorization, the subtle clustering in payment timing that reflects human workflow habits.

VynFi 3.0 adds a neural diffusion model that learns these patterns directly from tabular financial distributions. The model is a score-based generative model (sometimes called a denoising diffusion model) adapted for mixed-type tabular data — continuous amounts, categorical codes, timestamps, and boolean flags coexisting in the same row.

**DataSynth 3.1 update:** Hybrid diffusion tuning landed — `diffusion.neural.hybrid_strategy` accepts `weighted_average`, `column_select`, or `threshold`, and `neural_columns` lets you route specific columns through the neural backend while statistical generation handles the rest. Use `column_select` to apply the neural model only to heavy-tailed amount distributions (where it most outperforms rules) while keeping the deterministic statistical engine for structural columns like GL accounts and document IDs. See neural_diffusion.py for the calibration pattern.

Score-Based Diffusion for Tabular Data

The core idea is denoising score matching. Starting from a clean data sample x, we progressively add Gaussian noise across T timesteps until the data distribution converges to pure noise. A neural network (the score network) is trained to estimate the gradient of the log probability density at each noise level: given a noisy sample, predict the direction that moves toward higher data density. At generation time, we start from pure noise and iteratively denoise using the learned score function.

For tabular data, the standard image-domain architecture does not apply. VynFi's score network uses a transformer encoder with per-column type embeddings. Continuous columns pass through a learned normalization layer; categorical columns use embedding lookups; timestamps are encoded as (day-of-week, day-of-month, month, hour) cyclic features. The transformer processes all columns jointly, capturing cross-column dependencies that factored models miss.

Classifier-Free Guidance

Raw diffusion sampling produces samples that match the training distribution on average, but may not satisfy domain constraints (e.g., debits equal credits within a journal entry). VynFi uses classifier-free guidance to steer generation toward constraint-satisfying regions. During training, the model is conditioned on domain labels (sector, table type, constraint set) with random dropout. At inference, the conditional and unconditional scores are combined with a guidance scale parameter that controls the trade-off between distributional fidelity and constraint satisfaction.

Python

import vynfi
client = vynfi.VynFi()
# Pure diffusion mode — maximum distributional fidelity
job_pure = client.jobs.create(
    mode="diffusion",
    sector="financial_statements",
    rows=50_000,
    periods=4,
    guidance_scale=1.0,    # minimal guidance, closest to learned distribution
    hybrid=False,
)
# Hybrid mode — diffusion + rule-based structural constraints
job_hybrid = client.jobs.create(
    mode="diffusion",
    sector="financial_statements",
    rows=50_000,
    periods=4,
    guidance_scale=2.5,    # stronger guidance toward constraint satisfaction
    hybrid=True,           # post-process with rule engine for structural validity
    hybrid_constraints=[
        "balanced_entries",
        "benford_compliance",
        "sequential_dates",
        "document_reference_integrity",
    ],
)
result = client.jobs.wait(job_hybrid.id)

Hybrid Mode: Best of Both Approaches

Hybrid mode is the recommended default for most use cases. The diffusion model generates the initial data with realistic distributional properties — amount correlations, category co-occurrence patterns, temporal dynamics — and the rule engine applies a constraint-satisfaction pass that enforces structural validity. The result is data that has the statistical richness of a learned model with the structural guarantees of the rule-based engine.

The constraint-satisfaction pass is lightweight because the guided diffusion output is already close to satisfying constraints. In benchmarks, fewer than 3% of rows require adjustment in hybrid mode with guidance_scale=2.5, compared to 15-20% with guidance_scale=1.0.

Evaluating Output Quality

Python

import pandas as pd
archive = client.jobs.download_archive(result.id)
df = pd.read_parquet(archive.file("journal_entries.parquet"))
# Quality metrics included in every diffusion job
metrics = archive.json("quality_metrics.json")
print(f"Rows generated: {len(df)}")
print(f"Benford MAD: {metrics['benford_mad']:.4f}")       # < 0.006 = close conformity
print(f"Column correlation RMSE: {metrics['corr_rmse']:.4f}")  # vs. reference distribution
print(f"Category Jensen-Shannon: {metrics['cat_js']:.4f}")     # lower = better match
print(f"Constraint violations: {metrics['constraint_violations']}")  # should be 0 in hybrid mode
print(f"Rows adjusted by rule engine: {metrics['hybrid_adjustments']}")

The diffusion model is pre-trained on VynFi's internal reference distributions (calibrated against 155 real-world financial datasets, as described in our methodology paper). Customers on Scale and Enterprise tiers can fine-tune the model on their own fingerprint data — the privacy-preserving statistical summaries described in a separate post in this series — to produce synthetic data that matches their specific distributional characteristics without ever exposing raw records.

Ready to try VynFi?

Start generating synthetic financial data with 10,000 free credits. No credit card required.