DataSynth 4.2.1 Lands: The Foundation for VynFi 1.0
VynFi's engine just upgraded from DataSynth 3.1.1 to 4.2.1 — 29 upstream releases in one jump. This post walks through what landed: five-copula rank-preserving correlations, business-day temporal snapping across 11 regions, user-uploaded template packs, runtime LLM enrichment, 6 audit-optimizer endpoints, and the roadmap to 1.0 GA.
Since VynFi pinned DataSynth 3.1.1 in mid-April, the upstream engine shipped 29 releases: 3.1.2 through 3.5.4 and then the first major-version bump to 4.0.0 through 4.2.1. Today we adopted 4.2.1 end-to-end — new crates, new CPU image, new config surface, new endpoints. Production has been live on the new binary for a few hours. This post covers what changed that actually matters for VynFi customers.
**TL;DR**: Five copula types with rank-preserving inverse-CDF sampling (empirical Kendall-τ now matches theoretical τ within ±0.15). Business-day + holiday snapping across 11 regions. Pareto heavy-tails and regime-change events. User-uploaded YAML template packs. Runtime LLM enrichment via OpenRouter (Scale+). Six audit-optimizer endpoints. Existing jobs are byte-identical by default.
The distributions story
DataSynth 3.4.0 wired the first advanced amount distribution (LogNormal and Gaussian mixtures). 3.4.4 added Pareto heavy-tails — the right distribution for the 'one strategic contract dominates the quarter' shape that retail and manufacturing audit data actually exhibits. Then 3.5.x wired everything else: regime-change events (acquisitions, price shocks, product launches, policy changes, economic cycles), conditional rules (month/quarter-aware amount shaping), Gaussian copulas for amount↔line_count correlation, and statistical validation (Benford, χ², KS).
4.1.0 lit up the remaining four copulas (Clayton, Gumbel, Frank, Student-t) and expanded conditional rules to 10 input fields including period-end, quarter-end, and year-end flags. But the breakthrough was 4.1.6: rank-preserving inverse-CDF sampling. Before, amounts drew from their marginal and the copula 'nudged' them — the resulting empirical Kendall-τ was a fraction of theoretical τ. Now amounts draw directly from the copula's inverse CDF on the marginal, and empirical τ tracks theory within ±0.15. If you're training correlation-aware models on synthetic data, this is the difference between 'captures the signal' and 'accidentally creates a weaker signal than real data has'.
Portal surface
All of it exposed through `distributions.*` on `PortalGenerationConfig`: `amounts` (mixture), `pareto`, `correlations` (five copulas), `conditional`, `regimeChanges`, `validation`. Tier gating splits responsibility: Developer gets mixtures and Pareto, Team gets basic validation (Benford, χ², KS), Scale gets copulas, conditionals, regimes, industry profiles, Anderson-Darling, and Spearman correlation-check. Credit multipliers stack: amounts +5%, Pareto +5%, correlations +10%, conditional +5%, regime changes +10%, validation +15%, capped at 10× total.
{ "rows": 50000, "sector": "retail", "distributions": { "enabled": true, "industryProfile": "retail", "amounts": { "enabled": true, "components": [ { "weight": 0.6, "mu": 3.5, "sigma": 0.8, "label": "pos" }, { "weight": 0.3, "mu": 6.0, "sigma": 1.0, "label": "wholesale" }, { "weight": 0.1, "mu": 9.0, "sigma": 1.2, "label": "bulk_order" } ] }, "correlations": { "enabled": true, "copulaType": "clayton", "fields": ["amount", "line_count"], "matrix": [0.6] }, "validation": { "enabled": true, "tests": ["benford_first_digit", "chi_squared", "anderson_darling", "correlation_check"] } }}Temporal patterns across 11 regions
DataSynth 3.4.1 through 3.4.3 wired a `TemporalContext` through every generator that emits a date: P2P (PO / GR / invoice / payment), O2C (SO / delivery / customer-invoice / receipt / due-date), HR (time entries, expense-report submission / approval / paid / line-item dates), manufacturing (planned/actual start+end, routing operations), and period close (accrual reversals). When enabled, every date snaps to the next business day using multi-year holiday calendars.
VynFi's portal exposes 11 ISO-2 region codes: US, DE, UK, CH, FR, IT, ES, JP, CN, IN, BR. Free tier gets one region; Developer and up get multi-region (common for cross-border audit engagements). Multiple regions merge holidays — a US+DE engagement won't post on July 4 *or* on German Unity Day.
Template packs: bring your own names
The other 3.2.0 feature we productized today: template packs. DataSynth has embedded pools for vendor names, customer names, bank names, material descriptions, asset descriptions, department names, audit finding titles and narratives. For every industry- or region-specific engagement we've shipped, customers have wanted to override those with their own domain-specific pools. Now they can.
At `/dashboard/templates` (Team+), users create a pack, pick one of three merge strategies (extend, replace, merge_prefer_file), and upload YAML per category. There's a per-category editor, a validate button that re-runs every pool against the same check DataSynth does at startup, and a 1 MB-per-category cap so packs stay fast. When a job references a pack, the worker materialises every category to a job-scoped tempdir and stamps `templates.path` + `templates.merge_strategy` into the DataSynth config before invoking the binary.
**Storage choice**: we went with Postgres, not blob. Packs are small (tens of KB per category, typically under 100 KB total), and Postgres gives us transactional CRUD, FK-enforced tenant isolation, and no new SDK dependency. If any tenant outgrows this we'll promote to blob; the migration path is a swap on the writer side of `TemplatePackCategory::upsert`.
LLM enrichment: OpenRouter on tap
DataSynth 3.5.1 shipped an HTTP LLM backend targeting OpenRouter; 4.1.1 broadened it to vendors, customers, materials, and audit finding titles. Today we wired both: a runtime `llm.*` config block that enriches at generate-time (Scale+ only, 1.25× credit multiplier), and a pack-scoped enrichment endpoint (`POST /v1/template-packs/{id}/enrich`) that appends N LLM-generated entries to a pack category.
Model whitelist (portal-enforced at API layer): `anthropic/claude-sonnet-4.5` (default), `anthropic/claude-opus-4.7`, `openai/gpt-4o`, `openai/gpt-4o-mini`, `meta-llama/llama-4`. Per-call cap of 500 enrichments. The binary runs with `env_clear()` and a single re-injected `OPENROUTER_API_KEY` — no other secrets leak into the subprocess. Enriched YAML is validated before persisting so a malformed LLM response never corrupts a pack.
We build our own `datasynth-data-llm:4.2.1` image (CPU + `--features llm`) alongside the plain `datasynth-data:4.2.1`. The main worker pod stays on the plain image to keep its hot path lean; the enrichment endpoint targets the LLM variant via `DATASYNTH_LLM_BINARY_PATH`.
Audit-optimizer CLI: six new endpoints
DataSynth 4.1.2 added an `optimizer` subcommand with six operations: `risk-scope` (rank in-scope accounts by residual risk), `portfolio` (allocate audit hours under a budget), `resources` (phase-resource allocation), `conformance` (compare observed audit trace against FSM blueprint), `monte-carlo` (risk-weighted cost/duration simulation), and `calibration` (fit weights against historical findings). The CLI is a thin wrapper over `datasynth-audit-optimizer`; deeper analytics light up incrementally in 4.1.x and 4.2.x.
We exposed all six at `/v1/optimizer/*` (Scale+). Each endpoint takes a JSON body (serialised to YAML for the CLI), invokes the binary in a self-cleaning tempdir with a 10-minute timeout, reads the JSON output, and streams it back. The schemas are stable so portal UI and customer SDKs can be built today — when DataSynth wires up the real analytics, our endpoints auto-upgrade with no client code changes.
Audit + accounting: five new config sections
DataSynth 3.3.0 wired seven previously-dormant L1 generators (OrganizationalProfile, LegalDocument, ItControls, PriorYear, IndustryBenchmark, ManagementReport, DriftEvent). 3.3.1 shipped new accounting-standards generators (Lease per IFRS 16 / ASC 842, FairValue per IFRS 13 / ASC 820, FrameworkReconciliation across 12 canonical US GAAP ↔ IFRS difference areas). 3.3.2 closed out five audit-config fields that were previously schema-only. 4.1.3 added post-hoc vendor/customer interconnectivity labelling.
All exposed today through `PortalGenerationConfig`: `analyticsMetadata`, `audit` (with IT controls + fine-grained team/review config), `complianceRegulations.legalDocuments`, `accountingStandards` (with five accounting frameworks including dual_reporting for framework-reconciliation runs), and `interconnectivity`. Tier split: Team gets audit + analytics + legal docs, Scale gets accounting standards + interconnectivity. Multipliers: 1.05×, 1.10×, 1.05×, 1.15×, 1.05× respectively.
What's live, what's queued
Live in production right now: baseline DataSynth 4.2.1 engine, all the fraud-bias sweep + AML rebalancing + parallel-split fixes that land free with the version bump, and the new clippy-1.95-compatible API image.
Shipped to main (API surface only, no UI yet): every config section above, all tier gates, all credit multipliers, 6 audit-optimizer endpoints, the pack-enrichment endpoint. Portal CI is green; these deploy as soon as the CI-side auto-commit promotes the new API image tag.
Queued for the next engineering sessions: dashboard UI for LLM enrichment and audit engagement pages, GPU node pool + `datasynth-data-gpu:4.2.1` for Scale+ neural-diffusion jobs, OpenAPI spec regen, one pilot customer run per Scale+ feature. 1.0 GA ships when that last item lands.
A note on what's *not* exposed
A few DataSynth features deliberately stayed off the portal surface today. Deep mixture-component matrices, full conditional-rule DAGs, and regime-event event lists are available via the existing `overrides` blob (Scale+ raw-YAML passthrough) but not typed in `PortalGenerationConfig`. We'd rather light these up when a real customer asks for them than guess at the right shape — the `overrides` path is exactly the escape hatch that keeps us honest until we have evidence.
Similarly, adversarial model testing has a Scale tier placeholder but the portal surface moves to 1.1 so we can co-design it with the first fraud-team customer. Neural diffusion in `phase_diffusion_enhancement` is upstream-deferred pending DS 4.2.x / 4.3.x orchestrator wiring; the `neural-cuda` Cargo feature works end-to-end in isolation (DS 4.2.0 validated training + sampling on a 800-sample log-normal), but the production pipeline still uses statistical diffusion until DS ties the knot.
How to try it
If you're on Team or Scale, new fields are opt-in — existing jobs stay byte-identical. Drop any of the snippets above into your next job config, or use the raw YAML passthrough on Scale+ to exercise features we haven't yet typed. If you're on Free or Developer and want to test-drive the advanced features, email support — we're happy to open a short-lived Scale-tier trial against a specific use case.
Full release notes per upstream version live in the DataSynth `CHANGELOG.md`. Our adoption plan doc — commit-by-commit, workstream-by-workstream — is at `docs/plans/2026-04-21-datasynth-4.2-vynfi-1.0-adoption-plan.md`. Questions or corrections: grab someone on the eng team at the usual channels.