FRAUD ML

Labeled Synthetic Fraud Data for Model Training

Journal entries and transactions with ground-truth fraud and anomaly labels — multi-class typology, controllable rate, fully synthetic and shareable.

No card required. Credits are the only meter — every feature is open on every account.

Ground truth
Labels
Multi-class
Fraud typology
Controllable
Fraud rate
Shareable
Fully synthetic

Labeled fraud data is the bottleneck for every fraud-detection model. Real fraud is rare, sensitive, unlabeled, and impossible to share. VynFi generates synthetic journal entries and transactions with ground-truth fraud and anomaly labels — so you can train, benchmark, and stress-test models on data you actually own.

The labeled-data problem, solved

You cannot get labeled fraud data out of real companies — and the few public datasets are stale, narrow, and class-imbalanced. Because VynFi's fraud is injected synthetically, the labels are ground truth by construction: you know exactly which entries are fraudulent and why.

  • Per-row fraud and anomaly labels you can train and evaluate against
  • Multi-class typology: management override, revenue recognition, fictitious expense, journal-entry manipulation, and more
  • Control the fraud rate and class balance to match your modeling needs

Behaviorally faithful, so it transfers

Statistical similarity isn't enough — a model trained on superficially-realistic data fails on real data. VynFi's design targets behavioral fidelity (per Sajja et al., 2026, arXiv:2604.13125): the data reproduces process variants, control patterns, and anomaly signatures, not just column distributions. Every run emits Benford analysis and quality reports so you can verify rather than trust.

Document-flow context, not isolated rows

Fraud lives in relationships — an invoice that flows to a payment that flows to a posting. VynFi emits the document-flow graph linking entries, so models can learn structural fraud signals that flat row-level data can't express.

Any format, any scale

Export CSV, JSON, or Parquet at whatever scale your training pipeline needs, from a quick evaluation set to millions of labeled rows. Built on the open-source DataSynth engine (Rust, 100k+ rows/sec).

Frequently asked questions

How are the fraud labels defined?

Fraud is injected synthetically against a multi-class typology (management override, revenue recognition, fictitious expense, journal-entry manipulation, and others as the taxonomy grows). Because it is injected, the per-row labels are ground truth — there is no labeling ambiguity.

Can I control the fraud rate and class balance?

Yes. You can set the overall fraud rate and shape the mix across fraud types, which is useful for handling the class-imbalance problem that plagues real fraud datasets.

Will a model trained on it transfer to real data?

The engine targets behavioral fidelity — reproducing the behaviors (process variants, control patterns, anomaly signatures) that determine transfer, not just column-level statistics. Every run emits Benford and quality reports so you can validate fidelity for your use case.

What formats are available?

CSV, JSON, and Parquet, plus the document-flow graph that links entries. Free tier is 5,000 non-expiring credits, no card; one-time packs from $19.

Related use cases

Try it in 30 seconds — no signup

Generate a sample in the playground, or create a free account for 5,000 credits. Built on the open-source DataSynth engine (Apache 2.0).