Research

The Science Behind Synthetic Financial Data

An exploration of the methodology, statistical foundations, and practical applications behind VynFi's synthetic data engine, designed for enterprises, fintechs, and researchers.

Published · SSRNApril 2026

DataSynth: Reference Knowledge Graphs for Enterprise Audit Analytics through Synthetic Data Generation with Provable Statistical Properties

Michael Ivertowski

This paper introduces the forward generation paradigm for enterprise audit analytics. It demonstrates that recovering ground truth from enterprise data is computationally infeasible, and presents a three-layer knowledge model calibrated against 155 real-world datasets (364M journal entries, 2.4B line items) that produces synthetic data with provable statistical properties.

364M

Journal entries calibrated

< 0.006

Benford MAD score

200K+

Entries per second

130+

Labeled anomaly subtypes

Our Approach

Four pillars powering the DataSynth engine

Statistical Modeling

DataSynth uses calibrated statistical distributions, Benford's Law compliance, and inter-column correlation matrices to produce data that is structurally indistinguishable from real financial records.

Sector Calibration

Each of VynFi's 8 sector models is tuned against empirical benchmarks from real-world financial data. Retail transactions, banking ledgers, and healthcare billing each have distinct statistical signatures that our engine reproduces faithfully.

Quality Validation

Every generated dataset undergoes automated quality checks: distribution fidelity scoring, correlation preservation tests, anomaly frequency validation, and Benford's Law compliance. Datasets that fail thresholds are rejected and regenerated.

Financial Coherence

Every generated dataset passes 32+ internal consistency checks — trial balance proof, FG rollforward, cash flow reconciliation, equity rollforward, segment-to-consolidated reconciliation, and intercompany elimination. Data that fails your audit tests is a bug, not a feature.

Audit Methodology Benchmarks

Cross-firm comparison and validation across industry-standard audit methodologies.

Big 4 Methodology Coverage

KPMG Clara, PwC Aura, Deloitte Omnia, and EY GAM blueprints with procedure-level comparison across 518 standards.

Blueprint Testing Framework

Automated validation of blueprint completeness, coverage metrics, and step consistency across methodologies.

Progressive Difficulty Benchmarks

Curriculum generation with graduated complexity for auditor training and AI model evaluation.

Anomaly Injection Framework

33 anomaly types across 5 categories with configurable difficulty, severity, and confidence scoring

Timing

7 types

Weekend posting, off-hours transactions, holiday entries, backdating, future-dating, period-end clustering, unusual frequency

Amount

8 types

Round numbers, just-below-threshold, duplicate amounts, outlier values, Benford violations, split transactions, structuring, unusual ratios

Relationship

6 types

Ghost vendors, missing approvals, circular references, orphan entries, mismatched counterparties, self-dealing patterns

Pattern

7 types

Duplicate payments, sequential invoices, round-trip flows, gradual increases, clustering behavior, layering sequences, channel switching

Structural

5 types

Missing fields, schema violations, referential integrity breaks, encoding anomalies, metadata inconsistencies

Each anomaly is tagged with difficulty (how hard it is to detect), severity (financial impact level), and confidence (certainty the record is truly anomalous) scores for supervised ML training.

ML Evaluation Results

VynFi synthetic data achieves within 3% of real-data F1 scores across three detection model families

Model	Type	F1 (Real Data)	F1 (VynFi Data)	Delta
Isolation Forest	Unsupervised	`0.82`	`0.80`	-2.4%
XGBoost	Supervised	`0.91`	`0.89`	-2.2%
GCN (Graph)	Graph Neural Net	`0.88`	`0.86`	-2.3%

Models trained on VynFi synthetic data and evaluated on held-out real-world test sets. Results demonstrate that synthetic-trained models generalize effectively to production data.

Open-Source Showcases

Interactive demos on Hugging Face

Four interactive Spaces and a trained model — built on VynFi synthetic data and published under permissive licenses. No API key, no login, just click and explore.

Streamlit Space

Accounting Network Explorer

Interactive ISO 21378 Level-2 account-class graph from je_network.parquet. Pan, zoom, and click any node to see the underlying journal-entry flows by class.

Open the explorer

Docker Space

Data Explorer

Browse VynFi reference datasets in your browser — column profiles, schema view, sample rows, and side-by-side comparison across the published parquet artifacts.

Browse datasets

Gradio Space

Fraud-GNN Demo

Three tabs in a single demo: an edge fraud predictor, a node anomaly explorer, and a live ROC curve that re-renders as you change the decision threshold.

Try the demo

Streamlit Space

Process Mining Demo

pm4py directly-follows graph (DFG) over a supply-chain OCEL event log. Filter by activity, see variant frequencies, export the discovered process model.

Open the demo

Reference Datasets

Pre-baked datasets on Hugging Face

Skip the generation step. Seven curated reference datasets ship the latest DataSynth outputs with ISO 21378 fields, fraud propagation labels, and microsecond-precision OCEL timestamps. Load directly via datasets.load_dataset("VynFi/<slug>").

JE · Fraud

VynFi/vynfi-journal-entries-1m

2.1M JE line items, manufacturing sector, ~7% fraud with propagation labels, 12 periods.

Open on Hugging Face

AML · Banking

VynFi/vynfi-aml-100k

Banking + AML labels with 0.857 typology coverage, 38× denser network, mule_link / shell_link edges.

Open on Hugging Face

Group Audit

VynFi/vynfi-group-audit-enterprise-2000

Audit-ready 100-entity consolidated group dataset under IFRS 3 / 10 / 28 / 21 + ISA 600.

Open on Hugging Face

Process Mining

VynFi/vynfi-ocel-manufacturing

OCEL 2.0 manufacturing event log, microsecond timestamps, 162 variants, 55% happy-path concentration.

Open on Hugging Face

P2P · Audit Trail

VynFi/vynfi-audit-p2p

P2P document chain (PO → GR → invoice → payment, 234 docs) with is_fraud_propagated and fraud_source_document_id for end-to-end audit-trail walks.

Open on Hugging Face

OCEL · Supply Chain

VynFi/vynfi-supply-chain-ocel

Cross-process mining OCEL event log spanning ordering, fulfilment, and returns with realistic imperfection rates (rework 15% / skip 10% / out-of-order 8%).

Open on Hugging Face

AML · NLP

VynFi/vynfi-sar-narratives

Suspicious-activity-report narratives for AML model training, paired with banking-flow ground-truth labels.

Open on Hugging Face

Browse all VynFi datasets and Spaces on Hugging Face

Privacy-Preserving Fingerprints

4 configurable privacy levels with differential privacy guarantees

Standard

Default

Epsilon

1.0

Balanced fidelity and privacy. Suitable for most development and testing use cases.

Enhanced

Recommended

Epsilon

0.5

Stronger privacy with moderate utility loss. Recommended for sensitive financial domains.

Strict

Regulated

Epsilon

0.1

Strong differential privacy. For regulated environments requiring formal privacy guarantees.

Maximum

Epsilon

0.01

Highest privacy protection. Near-zero re-identification risk with some statistical fidelity trade-off.

Lower epsilon values provide stronger privacy guarantees. VynFi's fingerprint system captures statistical distributions without storing any individual records, and differential privacy noise is applied before fingerprint export.

Use Cases

How organizations leverage VynFi's synthetic data

References

[1] Dwork, C., & Roth, A. (2014). "The Algorithmic Foundations of Differential Privacy." Foundations and Trends in Theoretical Computer Science, 9(3-4), 211-407.
[2] Jordon, J., Yoon, J., & van der Schaar, M. (2022). "Synthetic Data: What, Why and How?" arXiv:2205.03257.
[3] Assefa, S. A., et al. (2020). "Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls." NeurIPS Workshop on AI for Financial Services.
[4] European Commission (2024). "EU Artificial Intelligence Act: Regulation (EU) 2024/1689." Official Journal of the European Union.
[5] Benford, F. (1938). "The Law of Anomalous Numbers." Proceedings of the American Philosophical Society, 78(4), 551-572.
[6] Nigrini, M. J. (2012). "Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection." Wiley.
[7] Patki, N., Wedge, R., & Veeramachaneni, K. (2016). "The Synthetic Data Vault." IEEE International Conference on Data Science and Advanced Analytics.

Start building with synthetic financial data

5,000 free credits to start. No credit card required.

Datasets on Hugging Face Contact Enterprise Sales

The Science Behind Synthetic Financial Data

DataSynth: Reference Knowledge Graphs for Enterprise Audit Analytics through Synthetic Data Generation with Provable Statistical Properties

Our Approach

Statistical Modeling

Sector Calibration

Quality Validation

Financial Coherence

Audit Methodology Benchmarks

Big 4 Methodology Coverage

Blueprint Testing Framework

Progressive Difficulty Benchmarks

Anomaly Injection Framework

Timing

Amount

Relationship

Pattern

Structural

ML Evaluation Results

Interactive demos on Hugging Face

Accounting Network Explorer

Data Explorer

Fraud-GNN Demo

Process Mining Demo

Pre-baked datasets on Hugging Face

VynFi/vynfi-journal-entries-1m

VynFi/vynfi-aml-100k

VynFi/vynfi-group-audit-enterprise-2000

VynFi/vynfi-ocel-manufacturing

VynFi/vynfi-audit-p2p

VynFi/vynfi-supply-chain-ocel

VynFi/vynfi-sar-narratives

Privacy-Preserving Fingerprints

Standard

Enhanced

Strict

Maximum

Use Cases

Audit Training

Fintech Testing

Academic Research

Audit Firm Training

References

Start building with synthetic financial data