TPC-DS in the browser - all 99 queries on the real Spark engine

This demo runs all 99 official TPC-DS queries against a genuine TPC-DS dataset, using the real spark-rust Spark SQL engine compiled to WebAssembly -- parse, plan, optimize, execute -- entirely inside the browser tab. No server, no backend, no data leaving the sandbox.

It's the multi-table, multi-query sibling of ../browser (which is a single-query scratchpad). Where that demo proves "Spark dialect in a tab", this one proves the engine is complete enough to run an industry-standard analytical benchmark end to end.

+---------------------------- browser tab -----------------------------+
|  app.js                                  spark-wasm (WebAssembly)    |
|  +--------------------+  registerIpc()   +------------------------+  |
|  | tier picker        | ---------------> | SparkEngine            |  |
|  | 99-query gallery   |                  |  * 24 TPC-DS MemTables |  |
|  | SQL viewer         |  runJson(sql)    |  * 391 Spark UDFs      |  |
|  | result table       | <--------------- |  * DataFusion executor |  |
|  | "Run all 99 >"     |   JSON rows      +------------------------+  |
|  +--------------------+                                              |
|         ^ fetch + gunzip (DecompressionStream)                       |
|   data/tiers.json  +  data/<tier>/*.arrow.gz  (24 tables / tier)    |
+----------------------------------------------------------------------+
        the engine .wasm is fetched + compiled ONCE; switching tier
        re-streams only the Arrow tables (7-20 MB), never the engine.
        nothing leaves the origin; nothing leaves the tab.

Three dataset tiers, loaded on demand

The engine binary (~13 MB gz) and the dataset are fully decoupled, so the demo ships three fixture tiers and loads the smallest first for a fast first paint. You switch tier at runtime from the picker; the resident engine just re-registers its 24 tables from the new tier (registerIpc replaces a table in place — no engine reload, no page reload):

tierknobs (cust/item/inv/mc)gzippedrowsqueries returning rows
Tiny (default)1200 / 480 / 100 / 48~7 MB~243 K75 / 99
Medium600 / 240 / 50 / 24~12 MB~444 K80 / 99
Rich300 / 120 / 25 / 12~20 MB~918 K85 / 99

All three are dsdgen'd at full sf=1 (rich, correlated histories) then pruned to a deterministic entity-subset — denser tiers keep more seed entities, so more of the selective / multi-channel-self-join queries land on rows. The ladder is a clean ~2× geometric step in payload; the wins taper (a 3× payload buys +10 non-empty queries) because the remaining empties are genuine low-cardinality needles (see below). data/tiers.json is the index the UI reads to build the picker; each tier lives in its own data/<tier>/ dir with its own queries.json (the expected row counts are tier-specific).

Quick start

bash build.sh          # compile wasm + generate/validate/stage ALL THREE tiers (~10 min cold)
bash serve.sh          # serve on http://localhost:8000/  (HOST=0.0.0.0 by default)
# open the URL; pick a tier (Tiny loads by default), click "Run all 99 >"

build.sh stages every tier by default. To iterate on just one:

STAGE=data TIER=tiny bash build.sh    # regenerate only the tiny tier
STAGE=wasm bash build.sh              # recompile only the engine

Headless end-to-end check (no browser needed -- drives the same wasm, across all three tiers, exercising the exact resident-engine hot-swap the picker uses):

node smoke.mjs              # representative subset of every tier (fast)
node smoke.mjs --all        # all 99 of every tier vs each tier's native reference
node smoke.mjs --tier rich  # restrict to one tier

How the data problem is solved

TPC-DS is a big-schema benchmark: 24 tables, 7 of them facts. Shipping a browser-sized dataset that still makes the 99 queries return rows takes two ideas, because two different things fight you.

1. Fixed-size mega-dimensions. Three dimension tables are the same size at every scale factor, so they don't shrink when you lower the scale:

tablerows at any SFafter FK-pruning
customer_demographics1,920,800~66,000
time_dim86,400~23,000
date_dim73,049~3,900

The generator FK-prunes each oversized dimension down to only the rows reachable from the fact tables (and from customer). A dimension row that no fact references can never affect any join result, so dropping it is loss-free. customer, customer_address and catalog_page are pruned the same way.

2. Coverage vs. size. A naive dsdgen(sf=0.01) is small, but its fact tables are so sparse that 27 of the 99 queries match zero rows — a single i_manufact_id = 128, or a customer who must be active in store and catalog and web across two years, lands on nothing. Raising the scale uniformly fixes coverage but blows the browser budget and still leaves the multi-channel self-joins empty.

The fix decouples the two: dsdgen at full sf=1 (rich, correlated histories where every year/month/category/state cell is populated), then keep a small, deterministic entity subset — a sparse uniform sample of customers/items plus a dense sample of the customers active in all three channels — together with all of each kept entity's facts across every channel and year. A kept customer therefore carries their entire cross-channel history, which is exactly what the q4/q11/q23/q31/q54/q64 self-joins need. The sample density is the tier knob: the Tiny tier keeps ~1/1200 of customers (~sf-0.02-sized, ~7 MB gz), the Rich tier ~1/300 (~sf-0.05-sized, ~20 MB gz, largest single file ~4.6 MB, under Cloudflare Pages' 25 MiB cap) — each with sf-1-like coverage. The knobs live in build.sh's TIERS ladder and are swept by scripts/tune-tpcds-wasm-fixtures.sh.

Data ships as Arrow IPC (typed), not JSON, so DATE/DECIMAL columns survive -- many TPC-DS queries filter on d_date BETWEEN cast('...' AS date), which JSON's stringly-typed coercion would break. The browser inflates the .arrow.gz files with the native DecompressionStream API (zero JS deps) and hands the raw IPC bytes to the engine.

Why some queries still return 0 rows

Every tier leaves some queries empty, and that's correct, not a bug: running DuckDB's own templates against full sf=1 shows those queries return ≤ 6 rows even at full scale (q17 is empty even there). They're genuine low-cardinality "needles" — e.g. q24 returns 1 row at sf=1 — that no browser-sized sample reliably preserves, which is why denser tiers recover them only gradually (Tiny 75 → Medium 80 → Rich 85 non-empty). The UI shows the native reference row count next to each result and a per-query dot in the gallery (green = rows, amber = empty, violet = engine-gap). On the Rich tier, "Run all 99" reports 85 with rows / 13 empty / 1 engine-gap: q59 hits a divide-by-zero in a stddev/mean coefficient-of-variation expression when a group's mean is zero — and the same error is produced by the native reference engine, so the demo renders it as a faithfully-reproduced engine gap (violet pill), not a wasm bug. The sparser Tiny/Medium tiers happen to avoid forming that zero-mean group, so q59 returns 100 rows there. Re-tuning empties vs payload is the tier knob (build.sh's TIERS ladder); see scripts/tune-tpcds-wasm-fixtures.sh.

Architecture: the persistent SparkEngine

The engine exposes a SparkEngine class (see crates/spark-wasm/src/wasm.rs) that registers tables once and runs many queries against the live context -- the right primitive for a 99-query workload. The one-shot run_sql_json_multi re-registers every table on every call, which would be pathological here.

const engine = new SparkEngine();                                   // build ONCE
for (const t of TABLES) engine.registerIpc(t, await fetchInflate(`./data/${tier}/${t}.arrow.gz`));
const rows = JSON.parse(engine.runJson(sql));                       // run any of the 99

// Switch tier: re-register the same 24 names with the new tier's bytes.
// registerIpc replaces a table in place — the engine is NOT rebuilt.
for (const t of TABLES) engine.registerIpc(t, await fetchInflate(`./data/${newTier}/${t}.arrow.gz`));

Three real wasm engine bugs this demo surfaced

End-to-end browser testing caught three execution bugs that the native test harness (which has a full tokio runtime, and historically only ever registered a table once) masked:

  1. Parallel-plan panic. DataFusion parallelizes across target_partitions (>= CPU count) by default, spawning work onto a tokio::task::JoinSet. wasm32 is single-threaded with no such runtime, so any repartitioning plan trapped. Fixed by forcing target_partitions = 1 in build_context -- also the correct model for single-threaded wasm (mirrors DuckDB-WASM).
  2. Missing reactor. Even single-partition, window / grouping-sets / multi-join plans still spawn onto a JoinSet, which needs a runtime context. The old block_on used futures::executor (no context) -> 10 of the 99 queries trapped with unreachable. Fixed by driving execution on a tokio current-thread runtime (the rt feature is wasm-clean -- no mio/socket reactor), which installs the context spawned tasks need.
  3. Re-registration errored ("table already exists"). The tier hot-swap re-registers all 24 names on the resident engine. DataFusion's register_table errors on a duplicate name, so the second tier load trapped — Engine::register_batches had documented "replaces the earlier table" semantics it didn't actually implement. Fixed by deregistering any prior provider before re-registering (an upsert), making the documented contract true and the hot-swap real. Covered by the engine_reregister_replaces_in_place unit test.

After all three fixes the whole suite executes on the wasm engine with exact row-count parity against each tier's native reference, and tiers hot-swap on one resident engine. (q59 is the lone non-clean query on the Rich tier — a genuine divide-by-zero in its stddev_samp/mean coefficient-of-variation expression when a group has a zero mean. The native reference engine errors identically, so queries.json records it as a known engine_error and both the browser and smoke.mjs treat it as cross-driver parity, not a regression. Tiny/Medium never form that zero-mean group, so q59 returns 100 rows there.)

Files

filerole
index.htmlUI: tier picker, query gallery, run-all progress, SQL viewer, result table
app.jsloads engine once, fetches tiers.json, hot-swaps tiers, runs queries
build.shSTAGE=wasm compiles the engine; STAGE=data generates all three tiers (TIER=<id> for one)
serve.shstatic HTTP server (python3)
smoke.mjsheadless Node E2E -- the wasm path across all tiers, asserted against each tier's native ref
pkg/wasm-bindgen output (gitignored -- build.sh regenerates)
data/tiers.jsontier index the UI reads (default + per-tier size/row/coverage stats)
data/<tier>/one dir per tier: *.arrow.gz + queries.json + manifest.json (gitignored)

The fixtures and query catalog are produced by two native bins in crates/spark-benchmarks:

Privacy

The engine binary and the fixtures are fetched once, from this origin. Every query runs inside the wasm sandbox; results are computed in linear memory and rendered locally. Your queries and the data never leave the tab -- the same capability-based guarantee described in ~/dev-macrohard/jaaiworks/WASM-DUCKDB.md.

← back to the demo