The hidden cost of Nx parallelism: when two layers compete

April 21, 2026

Increasing Nx’s parallelism looks like a free speedup, but it’s not as simple as it seems. Most modern build and test tools already parallelize internally, and Nx’s parallel flag stacks on top of that. Past a certain threshold, you pay the cost of oversubscription without getting any speedup in return, and sometimes you actually regress.

The setup

A synthetic Nx workspace with 96 packages (60 source files and 12 test files each), pinned to 16 logical CPUs via taskset to reflect a typical mid-tier CI runner. The build target calls esbuild directly, and the test target runs Vitest with its default inner thread pool of 7 workers. Each data point is a warmup run plus 3 measured runs, with the Nx cache disabled.

edbzn/nx-parallelism-benchmark

I ran the benchmark on a Ryzen 9 9950X3D (16 physical cores, 32 threads) with 64 GB of RAM running Ubuntu 25.10.

Results

Build with esbuild, on 16 CPUs, 96 packages

nx run-many -t build (lower is better)

--parallel=1

4,807 ms

--parallel=2

3,018 ms

--parallel=4

2,061 ms

--parallel=8

1,634 ms

--parallel=16

1,480 ms

--parallel=24

1,491 ms

--parallel=32

1,477 ms

The curve bends hard at parallel=16 and then flatlines. esbuild already saturates all 16 CPUs from inside a single process, so adding more Nx slots just duplicates work the Go runtime was going to do anyway. You pay RAM and scheduling overhead for no wall-clock win.

Test with Vitest (inner pool of 7 threads, its default), on 16 CPUs, 96 packages

nx run-many -t test (lower is better)

--parallel=1

38,764 ms

--parallel=2

22,124 ms

--parallel=4

15,545 ms

--parallel=8

14,552 ms

--parallel=16

14,780 ms

--parallel=24

14,960 ms

--parallel=32

15,051 ms

This one is worse than the build. Notice the sweet spot is at parallel=8, not 16, even though we have 16 CPUs: Vitest’s own 7-thread pool already fills the machine. At parallel=8 we’re running 8 × 7 = 56 threads on 16 cores, and that’s already the optimum. Past there, the curve climbs back up monotonically and parallel=32 ends up ~3.5% slower than parallel=8, with 32 × 7 = 224 threads fighting over 16 cores.

Note 📌

These numbers come from a synthetic workspace. On a real large monorepo with hundreds of packages, heavier bundles and bigger test suites, the degradation past the sweet spot is typically steeper: more RAM pressure, more GC, and more process startup cost. Treat the shape of the curve as the signal, not the absolute percentages.

Why it happens

When you set Nx’s parallel flag to N, Nx launches N task processes at the same time. Each of those processes is a full tool invocation that then does its own thing:

esbuild spawns a Go runtime that uses all available CPUs by default.
Vitest and Jest spawn their own pool of worker threads or forked processes.
tsc in build mode parses single-threaded but fans out I/O.
Vite build is just Rollup plus esbuild under the hood, so same story.

Each outer slot is a fresh Node process that spins up its own worker pool. The CPUs at the bottom don’t get more of themselves; they just get more guests fighting over the same seats.

What to do instead

Think of parallelism as a budget split between two layers: Nx on the outside and the tool on the inside. On CI you get better, more predictable results by keeping the inner layer quiet and letting Nx schedule across tasks.

Pin the tool to a single worker and let Nx do the parallelization:

Jest: --runInBand (runs all tests in the current process, no worker pool).
Vitest: --no-file-parallelism (runs test files sequentially inside a single worker).
esbuild: not really tunable, but since it’s already CPU-bound inside its own process, dropping Nx’s parallel value is usually the right lever.

With the inner pool pinned to 1, --parallel=N is the only knob that matters, and N ≈ your CPU count becomes a safe default. Measure from there.

Here’s the same Vitest sweep as above, but with --no-file-parallelism:

nx run-many -t test --no-file-parallelism (lower is better)

--parallel=1

128,705 ms

--parallel=2

64,672 ms

--parallel=4

33,987 ms

--parallel=8

19,265 ms

--parallel=16

13,881 ms

--parallel=24

13,859 ms

--parallel=32

13,905 ms

Two things to notice. First, the best time is ~5% lower than the best time of the stacked setup (13859 ms vs 14552 ms), so pinning the inner pool isn’t just about predictability, it’s also faster. Second, the curve is monotonically decreasing up to the CPU count and then flat, no regression past the sweet spot. That’s the shape you want on CI: one knob, no cliff.

Past a single machine’s sweet spot, the next speedup comes from horizontal distribution (Nx Agents, Nx Cloud, CI matrix jobs), not from cranking a single-machine knob. Watch memory too: each extra slot holds another Node process in RAM, which is a common cause of CI OOM-kills.

Tip 💡

On CI, start with jest --runInBand or vitest --no-file-parallelism and tune --parallel alone. One knob is easier to reason about than two.

TL;DR

Nx’s parallel flag doesn’t compose with tool-level parallelism, it multiplies it. On CI, pin the tool to a single worker (jest --runInBand, vitest --no-file-parallelism) and let --parallel do the scheduling. Measure, don’t guess.

Blog ideas

Upvote what you'd like me to write about next.

Nx DTE vs CI job matrix
Nx Distributed Task Execution (DTE) vs a traditional CI job matrix with manual sharding: a comparison of developer experience, performance, and reliability.
Are you ready for a monorepo?
An opinionated checklist to help you decide whether a monorepo is the right choice for your organization, and how to prepare for the transition.