TPC-DS in the browser - all 99 queries on the real Spark engine

This demo runs all 99 official TPC-DS queries against a genuine TPC-DS dataset, using the real spark-rust Spark SQL engine compiled to WebAssembly -- parse, plan, optimize, execute -- entirely inside the browser tab. No server, no backend, no data leaving the sandbox.

It's the multi-table, multi-query sibling of ../browser (which is a single-query scratchpad). Where that demo proves "Spark dialect in a tab", this one proves the engine is complete enough to run an industry-standard analytical benchmark end to end.

+---------------------------- browser tab -----------------------------+
|  app.js                                  spark-wasm (WebAssembly)    |
|  +--------------------+  registerIpc()   +------------------------+  |
|  | tier picker        | ---------------> | SparkEngine            |  |
|  | 99-query gallery   |                  |  * 24 TPC-DS MemTables |  |
|  | SQL viewer         |  runJson(sql)    |  * 391 Spark UDFs      |  |
|  | result table       | <--------------- |  * DataFusion executor |  |
|  | "Run all 99 >"     |   JSON rows      +------------------------+  |
|  +--------------------+                                              |
|         ^ fetch + gunzip (DecompressionStream)                       |
|   data/tiers.json  +  data/<tier>/*.arrow.gz  (24 tables / tier)    |
+----------------------------------------------------------------------+
        the engine .wasm is fetched + compiled ONCE; switching tier
        re-streams only the Arrow tables (7-20 MB), never the engine.
        nothing leaves the origin; nothing leaves the tab.

Three dataset tiers, loaded on demand

The engine binary (~13 MB gz) and the dataset are fully decoupled, so the demo ships three fixture tiers and loads the smallest first for a fast first paint. You switch tier at runtime from the picker; the resident engine just re-registers its 24 tables from the new tier (registerIpc replaces a table in place — no engine reload, no page reload):

tier	knobs (cust/item/inv/mc)	gzipped	rows	queries returning rows
Tiny (default)	1200 / 480 / 100 / 48	~7 MB	~243 K	75 / 99
Medium	600 / 240 / 50 / 24	~12 MB	~444 K	80 / 99
Rich	300 / 120 / 25 / 12	~20 MB	~918 K	85 / 99

All three are dsdgen'd at full sf=1 (rich, correlated histories) then pruned to a deterministic entity-subset — denser tiers keep more seed entities, so more of the selective / multi-channel-self-join queries land on rows. The ladder is a clean ~2× geometric step in payload; the wins taper (a 3× payload buys +10 non-empty queries) because the remaining empties are genuine low-cardinality needles (see below). data/tiers.json is the index the UI reads to build the picker; each tier lives in its own data/<tier>/ dir with its own queries.json (the expected row counts are tier-specific).

Generated tier (no download) — the default

Alongside the three dsdgen fixture tiers, the demo offers a Generated tier and defaults to it, so the first paint downloads zero bytes of data — only the already-loaded engine binary. Picking it calls the engine's loadTpcds(sf) export, which synthesises all 24 tables in-browser via the in-memory tpc_gen generator (moved into spark-sql-core so it is reachable from wasm). sf=0.01 generates ~71K rows in well under a second in the wasm sandbox — no dsdgen, no FK-prune, no data/ fetch.

The generator is schema-faithful with referential integrity (returns reference real parent-sales rows; channel FKs are valid dimension keys), but uses uniform distributions rather than dsdgen's skews, so it has no fixed answer set — the UI suppresses the "matches reference" comparison for this tier and shows the actual results. It runs 72/99 queries non-empty, 0 errors. The ~27 empty queries are not bugs: 18 of them are empty on the bundled real-dsdgen tiny fixture too (and Trino's tpcds.tiny virtual catalog, a byte-for-byte dsdgen port, has the same empties) — at sf≈0.01 density those deeply-selective correlated-threshold / window-HAVING-ratio shapes simply have no qualifying rows, so an empty result matches the reference.

Largest synthesised tables at sf=0.01 (shown live in the demo's schema note via COUNT(*)):

table	rows
store_sales	28,800
catalog_sales	14,400
web_sales	7,200
household_demographics	7,200
store_returns	2,880
date_dim	2,192
customer_demographics	1,920
total (24 tables)	71,251

Smoke it headlessly (drives the real wasm module, needs only pkg/):

node gen-smoke.mjs

Quick start

bash build.sh          # compile wasm + generate/validate/stage ALL THREE tiers (~10 min cold)
bash serve.sh          # serve on http://localhost:8000/  (HOST=0.0.0.0 by default)
# open the URL; pick a tier (Tiny loads by default), click "Run all 99 >"

build.sh stages every tier by default. To iterate on just one:

STAGE=data TIER=tiny bash build.sh    # regenerate only the tiny tier
STAGE=wasm bash build.sh              # recompile only the engine

Headless end-to-end check (no browser needed -- drives the same wasm, across all three tiers, exercising the exact resident-engine hot-swap the picker uses):

node smoke.mjs              # representative subset of every tier (fast)
node smoke.mjs --all        # all 99 of every tier vs each tier's native reference
node smoke.mjs --tier rich  # restrict to one tier

How the data problem is solved

TPC-DS is a big-schema benchmark: 24 tables, 7 of them facts. Shipping a browser-sized dataset that still makes the 99 queries return rows takes two ideas, because two different things fight you.

1. Fixed-size mega-dimensions. Three dimension tables are the same size at every scale factor, so they don't shrink when you lower the scale:

table	rows at any SF	after FK-pruning
`customer_demographics`	1,920,800	~66,000
`time_dim`	86,400	~23,000
`date_dim`	73,049	~3,900

The generator FK-prunes each oversized dimension down to only the rows reachable from the fact tables (and from customer). A dimension row that no fact references can never affect any join result, so dropping it is loss-free. customer, customer_address and catalog_page are pruned the same way.

2. Coverage vs. size. A naive dsdgen(sf=0.01) is small, but its fact tables are so sparse that 27 of the 99 queries match zero rows — a single i_manufact_id = 128, or a customer who must be active in store and catalog and web across two years, lands on nothing. Raising the scale uniformly fixes coverage but blows the browser budget and still leaves the multi-channel self-joins empty.

The fix decouples the two: dsdgen at full sf=1 (rich, correlated histories where every year/month/category/state cell is populated), then keep a small, deterministic entity subset — a sparse uniform sample of customers/items plus a dense sample of the customers active in all three channels — together with all of each kept entity's facts across every channel and year. A kept customer therefore carries their entire cross-channel history, which is exactly what the q4/q11/q23/q31/q54/q64 self-joins need. The sample density is the tier knob: the Tiny tier keeps ~1/1200 of customers (~sf-0.02-sized, ~7 MB gz), the Rich tier ~1/300 (~sf-0.05-sized, ~20 MB gz, largest single file ~4.6 MB, under Cloudflare Pages' 25 MiB cap) — each with sf-1-like coverage. The knobs live in build.sh's TIERS ladder and are swept by scripts/tune-tpcds-wasm-fixtures.sh.

Data ships as Arrow IPC (typed), not JSON, so DATE/DECIMAL columns survive -- many TPC-DS queries filter on d_date BETWEEN cast('...' AS date), which JSON's stringly-typed coercion would break. The browser inflates the .arrow.gz files with the native DecompressionStream API (zero JS deps) and hands the raw IPC bytes to the engine.

Why some queries still return 0 rows

Every tier leaves some queries empty, and that's correct, not a bug: running DuckDB's own templates against full sf=1 shows those queries return ≤ 6 rows even at full scale (q17 is empty even there). They're genuine low-cardinality "needles" — e.g. q24 returns 1 row at sf=1 — that no browser-sized sample reliably preserves, which is why denser tiers recover them only gradually (Tiny 75 → Medium 80 → Rich 85 non-empty). The UI shows the native reference row count next to each result and a per-query dot in the gallery (green = rows, amber = empty). On the Rich tier, "Run all 99" reports 86 with rows / 13 empty / 0 engine-gap.

q59 divide-by-zero — fixed. Earlier builds hit a hard Arrow error: Divide by zero on q59 at the Rich tier: its stddev/mean coefficient-of-variation expression forms a zero denominator when a group's sum totals 0, and DataFusion's decimal/float divide kernels raise where Spark's non-ANSI semantics yield NULL. The engine now applies a spark-sql-core::spark_divide::NonAnsiDivideRule (masking a/b into a/nullif(b,0)) so divide-by-zero is NULL across int/decimal/float — q59 returns rows at every tier. (The sparser Tiny/Medium tiers never formed the zero-sum group, so q59 already returned 100 rows there; the fix makes Rich agree.)

Architecture: the persistent `SparkEngine`

The engine exposes a SparkEngine class (see crates/spark-wasm/src/wasm.rs) that registers tables once and runs many queries against the live context -- the right primitive for a 99-query workload. The one-shot run_sql_json_multi re-registers every table on every call, which would be pathological here.

const engine = new SparkEngine();                                   // build ONCE
for (const t of TABLES) engine.registerIpc(t, await fetchInflate(`./data/${tier}/${t}.arrow.gz`));
const rows = JSON.parse(engine.runJson(sql));                       // run any of the 99

// Switch tier: re-register the same 24 names with the new tier's bytes.
// registerIpc replaces a table in place — the engine is NOT rebuilt.
for (const t of TABLES) engine.registerIpc(t, await fetchInflate(`./data/${newTier}/${t}.arrow.gz`));

Three real wasm engine bugs this demo surfaced

End-to-end browser testing caught three execution bugs that the native test harness (which has a full tokio runtime, and historically only ever registered a table once) masked:

Parallel-plan panic. DataFusion parallelizes across target_partitions (>= CPU count) by default, spawning work onto a tokio::task::JoinSet. wasm32 is single-threaded with no such runtime, so any repartitioning plan trapped. Fixed by forcing target_partitions = 1 in build_context -- also the correct model for single-threaded wasm (mirrors DuckDB-WASM).
Missing reactor. Even single-partition, window / grouping-sets / multi-join plans still spawn onto a JoinSet, which needs a runtime context. The old block_on used futures::executor (no context) -> 10 of the 99 queries trapped with unreachable. Fixed by driving execution on a tokio current-thread runtime (the rt feature is wasm-clean -- no mio/socket reactor), which installs the context spawned tasks need.
Re-registration errored ("table already exists"). The tier hot-swap re-registers all 24 names on the resident engine. DataFusion's register_table errors on a duplicate name, so the second tier load trapped — Engine::register_batches had documented "replaces the earlier table" semantics it didn't actually implement. Fixed by deregistering any prior provider before re-registering (an upsert), making the documented contract true and the hot-swap real. Covered by the engine_reregister_replaces_in_place unit test.

After all three fixes the whole suite executes on the wasm engine with exact row-count parity against each tier's native reference, and tiers hot-swap on one resident engine. (q59 was historically the lone non-clean query on the Rich tier — a divide-by-zero in its stddev_samp/mean coefficient-of-variation expression when a group has a zero sum. That is now fixed by spark-sql-core::spark_divide::NonAnsiDivideRule, which gives divide-by-zero Spark's non-ANSI NULL semantics instead of raising, so q59 returns rows at every tier.)

Files

file	role
`index.html`	UI: tier picker, query gallery, run-all progress, SQL viewer, result table
`app.js`	loads engine once, fetches `tiers.json`, hot-swaps tiers, runs queries
`build.sh`	`STAGE=wasm` compiles the engine; `STAGE=data` generates all three tiers (`TIER=<id>` for one)
`serve.sh`	static HTTP server (python3)
`smoke.mjs`	headless Node E2E -- the wasm path across all tiers, asserted against each tier's native ref
`pkg/`	wasm-bindgen output (gitignored -- `build.sh` regenerates)
`data/tiers.json`	tier index the UI reads (default + per-tier size/row/coverage stats)
`data/<tier>/`	one dir per tier: `*.arrow.gz` + `queries.json` + `manifest.json` (gitignored)

The fixtures and query catalog are produced by two native bins in crates/spark-benchmarks:

gen-tpcds-wasm-fixtures -- dsdgen at sf=1 + coverage-preserving entity-subset + FK-prune + export parquet/IPC + manifest. The --keep-*-modulus knobs are what define each tier.
validate-tpcds-wasm -- load the IPC fixtures into the real SparkEngine, run all 99 queries natively (the exact browser code path), and emit queries.json (SQL + expected rows + table/feature tags + any native engine_error) plus a pass-rate report. This is the source of truth the browser's "matches reference" check and the engine-gap parity check compare against.

Privacy

The engine binary and the fixtures are fetched once, from this origin. Every query runs inside the wasm sandbox; results are computed in linear memory and rendered locally. Your queries and the data never leave the tab -- the same capability-based guarantee described in ~/dev-macrohard/jaaiworks/WASM-DUCKDB.md.

← back to the demo