Generally when I want to run something with so much parallelism I just write a small Go program instead, and let Go's runtime handle the scheduling. It works remarkably well and there's no execve() overhead too
So, there are a few reasons why forkrun might work better than this, depending on the situation:
1. if what you want to run is built to be called from a shell (including multi-step shell functions) and not Go. This is the main appeal of forkrun in my opinion - extreme performance without needing to rewrite anything.
2. if you are running on NUMA hardware. Forkrun deals with NUMA hardware remarkably well - it distributes work between nodes almost perfectly with almost 0 cross-node traffic.
Actually the thing that might need that would be make -j. Maybe it would make sense to make a bsd or gnu make version that integrates those optimizations. I am actually running a lot of stuff in cygwin and there is huge fork penalties on windows, so my hope would be to actually get some real payloads faster...
I hate to say it but forkrun probably wont work in cygwin. I haven't tried it, but forkrun makes heavy use of linux-only syscalls trhat I suspect arent available in cygwin.
forkrun might work under WSL2, as its my understanding WSL2 runs a full linux kernel in a hypervisor.
AFAIK, the Go runtime is pretty NUMA-oblivious. The mcache helps a bit with locality of small allocations, but otherwise, you aren't going to get the same benefits (though I absolutely here you about avoiding execve overhead).
So...yes, the execve overhead is real. BUT there's still a lot you can accomplish with pure bash builtins (which don't have the execve overhead). And, if you're open to rewriting things (which would probably be required to some extent if you were to make something intended for shell to run in Go) you can port whatever you need to run into a bash builtin and bypass the execve overhead that way. In fact, doing that is EXACTLY what forkrun does, and is a big part of why it is so fast.
0. vfork (which is sometimes better than CoW fork) + execve if exec is the only outcome of the spawned child. Or, use posix_spawn where available.
1. Inner-loop hot path code {sh,c}ould be made a bash built-in after proving that it's the source of a real performance bottleneck. (Just say "no" to premature optimization.) Otherwise, rewrite the whole thing in something performant enough like C, C++, Rust, etc.
2. I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C doing it in 1 thread worker per core.
> I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C
Short answer: in its fastest mode, forkrun gets very close to the practical dispatch limit for this kind of workload. A tight C loop would still be faster, but at that point you're no longer comparing “parallel job dispatch”—you're comparing raw in-process execution.
Let me try and at least show what kind of performance forkrun gives here. Lets set up 1 billion newlines in a file on a tmpfs
cd /tmp
yes $'\n' | head -n 1000000000 > f1
now lets try frun echo
time { frun echo <f1 >/dev/null; }
real 0m43.779s
user 20m3.801s
sys 0m11.017s
forkrun in its "standard mode" hits about 25 million lines per second running newlines through a no-op (:), and ever so slightly less (23 million lines a second) running them through echo. The vast majority of this time is bash overhead. forkrun breaks up the lines into batches of (up to) 4096 (but for 1 billion lines the average batch size is probably 4095). Then for each batch, a worker-specific data-reading fd is advanced to the correct byte offset where the data starts, and the worker runs
mapfile -t -n $N -u $fd A # N is typically 4096 here
echo "${A[@]}"
The second command (specifically the array expansion into a long list of quoted empty args) is what is taking up the vast majority of the time. frun has a flag (-U) then causes it to replace `"${A[@]}"` with `${A[*]}`, which (in the case of all empty inputs) collapses the long string of quoted empty args into a long list of spaces -> 0 args. This considerably speeds things up when inputs are all empty.
time { frun -U echo <f1 >/dev/null; }
real 0m13.295s
user 6m0.567s
sys 0m7.267s
And now we are at 75 million lines per second. But we are still largely limited by passing data through bash....which is why forkrun also has a mode (`-s`) where it bypasses bash mapfile + array expansion all together and instead splices (via one of the forkrun loadable builtins) data directly to the stdin of whatever you are parallelizing. If you are parallelizing a bash builtin (where there is no execve cost) forkrun gets REALLY fast.
time { frun -s : < f1; }
real 0m0.985s
user 0m13.894s
sys 0m12.398s
which means it is delimiter scanning, dynamically batching and distributing (in batches of up to 4096 lines) at a rate of OVER 1 BILLION LIONES A SECOND or at a rate of ~250,000 batches per second.
At that point the bottleneck is basically just delimiter scanning and kernel-level data movement. There’s very little “scheduler overhead” left to remove—whether you write it in bash+C hybrids (like forkrun) or pure C.
to a NUMA-Aware Contention-Free Dynamically-Auto-Tuning Bash-Native Streaming Parallelization Engine. I dare say 10 years is about the norm for going from "beginner" to "PhD-level" work.
I like it, and I hope it's soon going to be available in various Linux distributions, along with other modern tools such as fd instead of find, ripgrep instead of grep, and fzf, for instance.
I guess I've never really used parallel for anything that was bound by the dispatch speed of parallel itself. I've always use parallel for running stuff like ffmpeg in a folder of 200+ videos, and the speed in which parallel decides to queue up the jobs is going to be very thoroughly eaten by the cost of ffmpeg itself.
Still, worth a shot.
I have to ask, was this vibe-coded though? I ask because I see multiple em dashes in your description here, and a lot of no X, no Y... notation that Codex seems to be fond of.
ETA: Not vibe coded, I see stuff from four years ago...my mistake!
> I ask because I see multiple em dashes in your description here, and a lot of no X, no Y... notation that Codex seems to be fond of.
I asked a few LLM's for tips on writing the HN post. The post is my own words, but their style may have rubbed off on me a little bit. I'm admittedly better at the technical aspects than I am at "writing good catchy posts that dont turn into 20 pages of technical writing", so...
Have you ever run GNU Parallel on a powerful machine just to find one core pegged at 100% while the rest sit mostly idle?
I hit that wall...so I built forkrun.
forkrun is a self-tuning, drop-in replacement for GNU Parallel (and xargs -P) designed for high-frequency, low-latency shell workloads on modern and NUMA hardware (e.g., log processing, text transforms, HPC data prep pipelines).
On my 14-core/28-thread i9-7940x it achieves:
- 200,000+ batch dispatches/sec (vs ~500 for GNU Parallel)
- ~95–99% CPU utilization across all 28 logical cores (vs ~6% for GNU Parallel)
- Typically 50×–400× faster on real high-frequency low-latency workloads (vs GNU Parallel)
These benchmarks are intentionally worst-case (near-zero work per task), where dispatch overhead dominates. This is exactly the regime where GNU Parallel and similar tools struggle — and where forkrun is designed to perform.
A few of the techniques that make this possible:
- Born-local NUMA: stdin is splice()'d into a shared memfd, then pages are placed on the target NUMA node via set_mempolicy(MPOL_BIND) before any worker touches them, making the memfd NUMA-spliced.
- SIMD scanning: per-node indexers use AVX2/NEON to find line boundaries at memory bandwidth and publish byte-offsets and line-counts into per-node lock-free rings.
- Lock-free claiming: workers claim batches with a single atomic_fetch_add — no locks, no CAS retry loops; contention is reduced to a single atomic on one cache line.
- Memory management: a background thread uses fallocate(PUNCH_HOLE) to reclaim space without breaking the logical offset system.
…and that’s just the surface. The implementation uses many additional systems-level techniques (phase-aware tail handling, adaptive batching, early-flush detection, etc.) to eliminate overhead at every stage.
In its fastest (-b) mode (fixed-size batches, minimal processing), it can exceed 1B lines/sec. In typical streaming workloads it's often 50×–400× faster than GNU Parallel.
forkrun ships as a single bash file with an embedded, self-extracting C extension — no Perl, no Python, no install, full native support for parallelizing arbitrary shell functions. The binary is built in public GitHub Actions so you can trace it back to CI (see the GitHub "Blame" on the line containing the base64 embeddings).
> Have you ever run GNU Parallel on a powerful machine just to find one core pegged at 100% while the rest sit mostly idle?
Not exactly, but maybe I haven't used large enough NUMA machines to run tiny jobs?
I think usually parallel saturates my CPU and I'd guess most CPU schedulers are NUMA-aware at this point.
If you care about short tasks maybe parallel is the wrong tool, but if picking the task to run is the slow part AND you prefer throughput over latency maybe you need batching instead of a faster job scheduling tool.
I'm pretty sure parallel has some flags to allow batching up to K-elements, so maybe your process can take several inputs at once. Alternatively you can also bundle inputs as you generate them, but that might require a larger change to both the process that runs tasks and the one that generates the inputs for them.
parallel works fine so long as the time per job is on the order of seconds or longer.
Let me give you an example of a "worst-case" scenario for parallel. Start by making a file on a tmpfs with 10 million newlines
yes $'\n' | head -n 10000000 > /tmp/f1
So, now lets see how long it takes parallel to push all these lines through a no-op. This measures the pure "overhead of distributing 10 million lines in batches". Ill set it to use all my cpu cores (`-j $(nproc)`) and to use multiple lines per batch (`-m`).
time { parallel -j $(nproc) -m : <f1; }
real 2m51.062s
user 2m52.191s
sys 0m6.800s
Average CPU utalization here (on my 14c/28t i9-7940x) is CPU time / real time
Note that there is 1 process that is pegged at 100% usage the entire time that isnt doing any "work" in terms of processing lines - its just distributing lines to workers. If we assume that thread averaged about 0.98 cores utalized, it means that throughout the run it managed to keep around 0.066 out of 28 CPUs saturated with actual work.
Now let's try with frun
. ./frun.bash
time { frun : <f1; }
real 0m0.559s
user 0m10.409s
sys 0m0.201s
Interestingly, if we look at the ratio of CPU utilization (spent on real work):
18.9803220036 / 0.066 = 287x more CPU usage doing actual work
which gives a pretty straightforward story - forkrun is 300x faster here because it is utilizing 300x more CPU for actually doing work.
This regime of "high frequency low latency tasks" - millions or billions of tasks that make milliseconds or microseconds each - is the regime where forkrun excels and tools like parallel fall apart.
Side note: if I bump it to 100 million newlines:
time { frun : <f1; }
real 0m4.212s
user 1m52.397s
sys 0m1.019s
I’m not a parallels kind of user but I can appreciate your craft and know how rewarding these odysseys can be :)
What was the biggest “aha” moment when you worked how things interlock or you needed to make both change an and b at the same time, as either on their own slowed it down? Etc. And what is the single biggest impacting design choice?
And if you’re objective, what could be done to other tools to make them competitive?
So, in forkruns development there have been a few "AHA!" moments. Most of them were accompanied by a full re-write (current forkrun is v3).
The 1st AHA, and the basis for the original forkrun, was that you could eliminate a HUGE amount of the overhead of parallelizing things in shell in you use persistent workers and have them run things for you in a loop and distribute data to them. This is why the project is called "forkrun" - its short for "first you FORK, then you RUN".
The 2nd AHA, which spawned forkrun v2, was that you could distribute work without a central coordinator thread (which inevitably becomes the bottleneck). forkrun v2 did this by having 1 process dump data into a tmpfile on a ramdisk, then all the workers read from this file using a shared file descriptor and a lightweight pipe-based lock: write a newline into a shared anonymous pipe, read from pipe to acquire lock, write newline back to pipe to release it. FIFO naturally queues up waiters. This version actually worked really well, but it was a "serial read, parallel execute" design. Furthermore, the time it took to acquire and release a lock meant the design topped out at ~7 million lines per second. Nothing would make it faster, since that was the locking overhead.
The 3rd AHA was that I could make a very fast (SIMD-accellerated) delimiter scanner, post the byte offsets where lines (or batches of lines) started in the global data file, and then workers could claim batches and read data in parallel, making the design fully "parallel read + parallel execute"
The 4th AHA was regarding NUMA. it was "instead of reactively re-shuffling data between nodes, just put it on the right node to begin with". Furthermore, determine the "right node" using real-time backpressure from the nodes with a 3 chunk buffer to ensure the nodes are always fed with data. This one didn't need a rewrite, but is why forkrun scales SO WELL with NUMA.
> And if you’re objective, what could be done to other tools to make them competitive?
I wanted to reply separately to this bit, because I needed a bit of time to think about and respond to it.
To be frank, parallel optimizes for "breadth of features" and has, for example, the ability to coordinate distributed computing over ssh. But it fundamentally assumes that the workload itself will take dramatically longer than the coordination.
To really be competitive in "high-frequency low-latency workloads", where you have millions of inputs and each only takes microseconds, you would need a complete rewrite with an entirely different way of thinking.
Let me drop a few numbers just to drive this point home. Parallel is capable of batching and distributing around 500 batches of work a second. forkrun, in its "pass arguments via quoted cmdline args" mode is capable of batching and distributing around 10,000 batches a second. This is mostly limited by how fast bash can assemble long strings of quoted arguments to pass via the command line. In forkrun's `-s` mode, which bypasses bash entirely and splices data directly to the stdin of what you are parallelizing, forkrun is capable of batching and distributing over 200,000 batches a second.
The biggest architectural hurdle most existing tools have that makes it impossible to achieve forkrun's batch distribution rate is that almost all use a central distributor thread that forks each individual call (which is very expensive) and that is ALWAYS the bottleneck in high-frequency low-latency workloads. Pushing past this requires moving to a persistent worker model without a central coordinator. This alone necessitates a complete rewrite for basically all the existing tools.
That said, forkrun takes it so much further:
* It uses a SIMD-accelerated delimiter scanner + lock-free async-io to allow for workers to not only execute in parallel but to read inputs to run in parallel.
* It doesn't just use a standard "lock-free" design with CAS retry loops everywhere - it treats the problem like a physical pipeline of data flow and structurally eliminates contention between workers. The literal only "contention" is a single atomic on a single cache line - namely when a worker claims a batch by running `atomic_fetch_add` on a global monotonically increasing index (`read_idx`).
* It doesn't use heuristics - it uses a proper closed-loop control system. There is a 3-stage ramp-up (saturate workers -> geometric ramp -> backpressure-guided PID) to dynamically determine the batch size and the number of workers that works extremely well for all input types with 0 manual tuning.
* It keeps complexity in the slow path. Claiming a batch of lines literally just involves reading a couple shared mmap'ed vars and an `atomic_fetch_add` op in the fast path, which is why it can break 1 billion lines a second. The complexity is all so the slow path degrades gracefully, which is where it smartly trades latency for throughput (but only when throughput is limited by stdin to begin with).
* It treats NUMA as 1st class and chooses the "obvious in hindsight" path to just put data on the correct NUMA node from the very start instead of re-shuffling it between nodes reactively later.
I could go on, but the TL;DR is: to be competitive, other tools would really need to try and solve the "high-frequency low-latency stream parallelization" problem from first principles like forkrun did.
Theres no "install" - you just need to source the `frun.bash` file. Downloading frun.bash and sourcing it works just fine. directly sourcing a curl stream that grabs frun.bash from the git repo is just an alternate approach. It is not "required" by any means.
Forkrun is part of a vanishingly small number of projects written since the 1990s that get real work done as far as multicore computing goes.
I'm not super-familiar with NUMA, but hopefully its concepts might be applicable to other architectures. I noticed that you mentioned things like atomic add in the readme, so that gives me confidence that you really understand this stuff at a deep level.
My use case might eventually be to write a self-parallelizing programming language where higher-order methods run as isolated processes. Everything would be const by default to make imperative code available in a functional runtime. Then the compiler could turn loops and conditionals into higher-order methods since there are no side effects. Any mutability could be provided by monads enforcing the imperative shell, functional core pattern so that we could track state changes and enumerate all exceptional cases.
Basically we could write JavaScript/C-style code having MATLAB-style matrix operators that runs thousands of times faster than current languages, without the friction/limitations of shaders or the cognitive overhead of OpenCL/CUDA.
-
I feel that pretty much all modern computer architectures are designed incorrectly, which I've ranted about countless times on HN. The issue is that real workloads mostly wait for memory, since the CPU can run hundreds of times faster than load/store, especially for cache and branch prediction misses. So fabs invested billions of dollars into cache and branch prediction (that was the incorrect part).
They should have invested in multicore with local memories acting together as a content-addressable memory. Then fork with copy-on-write would have provided parallelism for free.
Instead, CPU progress (and arguably Moore's law itself) ended around 2007 with the arrival of the iPhone and Android, which sent R&D money to low-cost and low-power embedded chips. So the world was forced to jump on the GPU bandwagon, doubling down endlessly on SIMD instead of giving us MIMD.
Leaving us with what we have today: a dumpster fire of incompatible paradigms like OpenGL, Direct3D, Vulkan, Metal, TPUs, etc.
When we could have had transputers with unlimited compute and memory, scaling linearly with cost, that could run 3D and AI libraries as abstraction layers. Sadly that's only available in cloud computing currently.
We just got lucky that neural nets can run on GPUs. It would have been better to have access to the dozen or so other machine learning algorithms, especially genetic algorithms (which run poorly on GPUs).
forkrun's NUMA approach is really largely based on the idea that, as you said, "real workloads mostly wait for memory". The waiting for memory gets worse in NUMA because accessing memory from a different chiplet or a different socket requires accessing data that is physically farther from the CPU and thus has higher latency. forkrun takes a somewhat unique approach in dealing with this: instead of taking data in, putting it somewhere, and reshuffling it around based on demand, forkrun immediately puts it on the correct numa node's memory when it comes in. This creates a NUMA-striped global data memfd. on NUMA forkrun duplicates most of its machinery (indexer+scanner+worker pool) per node, and each node's machinery is only offered chunks from the global data memfd that are already on node-local memory.
This directly aims to solve (or at least reduce the effect from) "CPUs waiting for memory" on NUMA systems, where the wait (if memory has to cross sockets) can be substantial.
curl isnt required - you just need to source the `frun.bash` file. Downloading frun.bash and sourcing it works just fine. directly sourcing a curl stream that grabs frun.bash from the github repo is just an alternate approach. It is not "required" by any means.
The difference you are seeing in this specific usage is because frun is dynamically adjusting batch size and worker count (by default it always begins at a batch size of 1 and using 1 worker). It is pretty darn good at dynamically pinning these down pretty quickly, but with only 14k total inputs split you are probably ending up with 2-3 times as many jq calls as you do setting the batch size to 100 inputs from the start, and you may not be fully spawning 32 workers.
If you want an apples-to-apples comparison, try running the following. This tells frun to use 100 lines per batch (-l 100), to use 32 workers (-j 32). Please let me know how this one compares to the rush invocation in terms of runtime.
NOTE: when I posted this reply using a space as a delimiter was broken. I just pushed a PR to the forkrun main branch that fixes this. If you re-download frun.bash and source it in a new bash instance, then the above space-delimited command should work as well, and is the most direct apples-to-apples comparison to your rush command.
So I thought about this for a bit, and this actually doesnt surprise me all that much. This makes sense when you consider the following 2 things:
First, 14k items in batches of 100 are only 140 batches. 140 batches in 160 ms is not even 1000 batches per second. For reference, parallel tops out at around 500 per second (but is dreadfully slow) and forkrun, in its normal "passing quoted arguments via the cmdline" mode, can do about 10000 batches per second. I have no doubt rush is far more capable of distributing batches quicker than parallel, so theres a good chance that "how fast the parallelization engine can distribute work" isnt the main bottleneck for either frun nor rush for this particular workload.
Second, the way frun distributes batches is very efficient but requires setting up a substantial amount of supporting machinery. This puts (on my system) the "no-load run time" of forkrun at about 80 ms.
time { echo | frun :; }
real 0m0.078s
user 0m0.027s
sys 0m0.064s
And this 80 ms difference is pretty close to the time difference you are seeing. Id bet the "minimum no-load time" for rush is considerably lower - perhaps a couple of ms.
forkrun is optimized for plowing through MASSIVE amount of very fast running inputs...it is capable of plowing through a billion (empty) inputs a second in its fastest mode. 14k inputs just isn't enough to amortize the startup of all the lock-free machinery.
I would venture to guess that if you repeat the same test but with 100x more inputs, the relative difference between frun and rush would be considerably less.
forkrun complements things like SLURM (and even MPI). forkrun is intra-node, and is all about utilizing all the resources any given node as efficiently as possible, including when the node has a deep NUMA topology (e.g., it's EPYC-based). This allows SLURM and MPI to focus on inter-node work distribution and coordinating who gets to run things on which node and things like that.
tl;dr: forkrun takes over the "last mile" of actually running things on a given single node, so SLURM can focus on what it does best: efficiently allocating and distributing work to different nodes across the cluster.
1. if what you want to run is built to be called from a shell (including multi-step shell functions) and not Go. This is the main appeal of forkrun in my opinion - extreme performance without needing to rewrite anything. 2. if you are running on NUMA hardware. Forkrun deals with NUMA hardware remarkably well - it distributes work between nodes almost perfectly with almost 0 cross-node traffic.
forkrun might work under WSL2, as its my understanding WSL2 runs a full linux kernel in a hypervisor.
1. Inner-loop hot path code {sh,c}ould be made a bash built-in after proving that it's the source of a real performance bottleneck. (Just say "no" to premature optimization.) Otherwise, rewrite the whole thing in something performant enough like C, C++, Rust, etc.
2. I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C doing it in 1 thread worker per core.
Short answer: in its fastest mode, forkrun gets very close to the practical dispatch limit for this kind of workload. A tight C loop would still be faster, but at that point you're no longer comparing “parallel job dispatch”—you're comparing raw in-process execution.
Let me try and at least show what kind of performance forkrun gives here. Lets set up 1 billion newlines in a file on a tmpfs
now lets try frun echo forkrun in its "standard mode" hits about 25 million lines per second running newlines through a no-op (:), and ever so slightly less (23 million lines a second) running them through echo. The vast majority of this time is bash overhead. forkrun breaks up the lines into batches of (up to) 4096 (but for 1 billion lines the average batch size is probably 4095). Then for each batch, a worker-specific data-reading fd is advanced to the correct byte offset where the data starts, and the worker runs The second command (specifically the array expansion into a long list of quoted empty args) is what is taking up the vast majority of the time. frun has a flag (-U) then causes it to replace `"${A[@]}"` with `${A[*]}`, which (in the case of all empty inputs) collapses the long string of quoted empty args into a long list of spaces -> 0 args. This considerably speeds things up when inputs are all empty. And now we are at 75 million lines per second. But we are still largely limited by passing data through bash....which is why forkrun also has a mode (`-s`) where it bypasses bash mapfile + array expansion all together and instead splices (via one of the forkrun loadable builtins) data directly to the stdin of whatever you are parallelizing. If you are parallelizing a bash builtin (where there is no execve cost) forkrun gets REALLY fast. which means it is delimiter scanning, dynamically batching and distributing (in batches of up to 4096 lines) at a rate of OVER 1 BILLION LIONES A SECOND or at a rate of ~250,000 batches per second.At that point the bottleneck is basically just delimiter scanning and kernel-level data movement. There’s very little “scheduler overhead” left to remove—whether you write it in bash+C hybrids (like forkrun) or pure C.
I'll have to look into what would be required to package forkrun for the various distros. I'll try to make it happen in the near-ish future.
Still, worth a shot.
I have to ask, was this vibe-coded though? I ask because I see multiple em dashes in your description here, and a lot of no X, no Y... notation that Codex seems to be fond of.
ETA: Not vibe coded, I see stuff from four years ago...my mistake!
I asked a few LLM's for tips on writing the HN post. The post is my own words, but their style may have rubbed off on me a little bit. I'm admittedly better at the technical aspects than I am at "writing good catchy posts that dont turn into 20 pages of technical writing", so...
FWIW, I actually did play with Forkrun now, and it seems pretty neat!
Given the expertise at display and the 10yr on/off journey in building Forkrun, I'm sure there's folks like me who'd be glad to read those 20 pages!
Thanks for sharing your work.
Have you ever run GNU Parallel on a powerful machine just to find one core pegged at 100% while the rest sit mostly idle?
I hit that wall...so I built forkrun.
forkrun is a self-tuning, drop-in replacement for GNU Parallel (and xargs -P) designed for high-frequency, low-latency shell workloads on modern and NUMA hardware (e.g., log processing, text transforms, HPC data prep pipelines).
On my 14-core/28-thread i9-7940x it achieves:
- 200,000+ batch dispatches/sec (vs ~500 for GNU Parallel)
- ~95–99% CPU utilization across all 28 logical cores (vs ~6% for GNU Parallel)
- Typically 50×–400× faster on real high-frequency low-latency workloads (vs GNU Parallel)
These benchmarks are intentionally worst-case (near-zero work per task), where dispatch overhead dominates. This is exactly the regime where GNU Parallel and similar tools struggle — and where forkrun is designed to perform.
A few of the techniques that make this possible:
- Born-local NUMA: stdin is splice()'d into a shared memfd, then pages are placed on the target NUMA node via set_mempolicy(MPOL_BIND) before any worker touches them, making the memfd NUMA-spliced.
- SIMD scanning: per-node indexers use AVX2/NEON to find line boundaries at memory bandwidth and publish byte-offsets and line-counts into per-node lock-free rings.
- Lock-free claiming: workers claim batches with a single atomic_fetch_add — no locks, no CAS retry loops; contention is reduced to a single atomic on one cache line.
- Memory management: a background thread uses fallocate(PUNCH_HOLE) to reclaim space without breaking the logical offset system.
…and that’s just the surface. The implementation uses many additional systems-level techniques (phase-aware tail handling, adaptive batching, early-flush detection, etc.) to eliminate overhead at every stage.
In its fastest (-b) mode (fixed-size batches, minimal processing), it can exceed 1B lines/sec. In typical streaming workloads it's often 50×–400× faster than GNU Parallel.
forkrun ships as a single bash file with an embedded, self-extracting C extension — no Perl, no Python, no install, full native support for parallelizing arbitrary shell functions. The binary is built in public GitHub Actions so you can trace it back to CI (see the GitHub "Blame" on the line containing the base64 embeddings).
- Benchmarking scripts and raw results: https://github.com/jkool702/forkrun/blob/main/BENCHMARKS
- Architecture deep-dive: https://github.com/jkool702/forkrun/blob/main/DOCS
- Repo: https://github.com/jkool702/forkrun
Trying it is literally two commands:
Happy to answer questions.Not exactly, but maybe I haven't used large enough NUMA machines to run tiny jobs?
I think usually parallel saturates my CPU and I'd guess most CPU schedulers are NUMA-aware at this point.
If you care about short tasks maybe parallel is the wrong tool, but if picking the task to run is the slow part AND you prefer throughput over latency maybe you need batching instead of a faster job scheduling tool.
I'm pretty sure parallel has some flags to allow batching up to K-elements, so maybe your process can take several inputs at once. Alternatively you can also bundle inputs as you generate them, but that might require a larger change to both the process that runs tasks and the one that generates the inputs for them.
Let me give you an example of a "worst-case" scenario for parallel. Start by making a file on a tmpfs with 10 million newlines
So, now lets see how long it takes parallel to push all these lines through a no-op. This measures the pure "overhead of distributing 10 million lines in batches". Ill set it to use all my cpu cores (`-j $(nproc)`) and to use multiple lines per batch (`-m`). Average CPU utalization here (on my 14c/28t i9-7940x) is CPU time / real time Note that there is 1 process that is pegged at 100% usage the entire time that isnt doing any "work" in terms of processing lines - its just distributing lines to workers. If we assume that thread averaged about 0.98 cores utalized, it means that throughout the run it managed to keep around 0.066 out of 28 CPUs saturated with actual work.Now let's try with frun
CPU utilization is Lets compare the wall clock times Interestingly, if we look at the ratio of CPU utilization (spent on real work): which gives a pretty straightforward story - forkrun is 300x faster here because it is utilizing 300x more CPU for actually doing work.This regime of "high frequency low latency tasks" - millions or billions of tasks that make milliseconds or microseconds each - is the regime where forkrun excels and tools like parallel fall apart.
Side note: if I bump it to 100 million newlines:
CPU utilization: which on a 14c/28t CPU doing no-ops...isnt bad.Yes, to my extreme frustration. Thank you, I'm installing this right now while I read the rest of your comment.
I’m not a parallels kind of user but I can appreciate your craft and know how rewarding these odysseys can be :)
What was the biggest “aha” moment when you worked how things interlock or you needed to make both change an and b at the same time, as either on their own slowed it down? Etc. And what is the single biggest impacting design choice?
And if you’re objective, what could be done to other tools to make them competitive?
The 1st AHA, and the basis for the original forkrun, was that you could eliminate a HUGE amount of the overhead of parallelizing things in shell in you use persistent workers and have them run things for you in a loop and distribute data to them. This is why the project is called "forkrun" - its short for "first you FORK, then you RUN".
The 2nd AHA, which spawned forkrun v2, was that you could distribute work without a central coordinator thread (which inevitably becomes the bottleneck). forkrun v2 did this by having 1 process dump data into a tmpfile on a ramdisk, then all the workers read from this file using a shared file descriptor and a lightweight pipe-based lock: write a newline into a shared anonymous pipe, read from pipe to acquire lock, write newline back to pipe to release it. FIFO naturally queues up waiters. This version actually worked really well, but it was a "serial read, parallel execute" design. Furthermore, the time it took to acquire and release a lock meant the design topped out at ~7 million lines per second. Nothing would make it faster, since that was the locking overhead.
The 3rd AHA was that I could make a very fast (SIMD-accellerated) delimiter scanner, post the byte offsets where lines (or batches of lines) started in the global data file, and then workers could claim batches and read data in parallel, making the design fully "parallel read + parallel execute"
The 4th AHA was regarding NUMA. it was "instead of reactively re-shuffling data between nodes, just put it on the right node to begin with". Furthermore, determine the "right node" using real-time backpressure from the nodes with a 3 chunk buffer to ensure the nodes are always fed with data. This one didn't need a rewrite, but is why forkrun scales SO WELL with NUMA.
I wanted to reply separately to this bit, because I needed a bit of time to think about and respond to it.
To be frank, parallel optimizes for "breadth of features" and has, for example, the ability to coordinate distributed computing over ssh. But it fundamentally assumes that the workload itself will take dramatically longer than the coordination.
To really be competitive in "high-frequency low-latency workloads", where you have millions of inputs and each only takes microseconds, you would need a complete rewrite with an entirely different way of thinking.
Let me drop a few numbers just to drive this point home. Parallel is capable of batching and distributing around 500 batches of work a second. forkrun, in its "pass arguments via quoted cmdline args" mode is capable of batching and distributing around 10,000 batches a second. This is mostly limited by how fast bash can assemble long strings of quoted arguments to pass via the command line. In forkrun's `-s` mode, which bypasses bash entirely and splices data directly to the stdin of what you are parallelizing, forkrun is capable of batching and distributing over 200,000 batches a second.
The biggest architectural hurdle most existing tools have that makes it impossible to achieve forkrun's batch distribution rate is that almost all use a central distributor thread that forks each individual call (which is very expensive) and that is ALWAYS the bottleneck in high-frequency low-latency workloads. Pushing past this requires moving to a persistent worker model without a central coordinator. This alone necessitates a complete rewrite for basically all the existing tools.
That said, forkrun takes it so much further:
* It uses a SIMD-accelerated delimiter scanner + lock-free async-io to allow for workers to not only execute in parallel but to read inputs to run in parallel.
* It doesn't just use a standard "lock-free" design with CAS retry loops everywhere - it treats the problem like a physical pipeline of data flow and structurally eliminates contention between workers. The literal only "contention" is a single atomic on a single cache line - namely when a worker claims a batch by running `atomic_fetch_add` on a global monotonically increasing index (`read_idx`).
* It doesn't use heuristics - it uses a proper closed-loop control system. There is a 3-stage ramp-up (saturate workers -> geometric ramp -> backpressure-guided PID) to dynamically determine the batch size and the number of workers that works extremely well for all input types with 0 manual tuning.
* It keeps complexity in the slow path. Claiming a batch of lines literally just involves reading a couple shared mmap'ed vars and an `atomic_fetch_add` op in the fast path, which is why it can break 1 billion lines a second. The complexity is all so the slow path degrades gracefully, which is where it smartly trades latency for throughput (but only when throughput is limited by stdin to begin with).
* It treats NUMA as 1st class and chooses the "obvious in hindsight" path to just put data on the correct NUMA node from the very start instead of re-shuffling it between nodes reactively later.
I could go on, but the TL;DR is: to be competitive, other tools would really need to try and solve the "high-frequency low-latency stream parallelization" problem from first principles like forkrun did.
This is the kind of buzz I search out in my own programming :)
Have fun and keep challenged :)
Forkrun is part of a vanishingly small number of projects written since the 1990s that get real work done as far as multicore computing goes.
I'm not super-familiar with NUMA, but hopefully its concepts might be applicable to other architectures. I noticed that you mentioned things like atomic add in the readme, so that gives me confidence that you really understand this stuff at a deep level.
My use case might eventually be to write a self-parallelizing programming language where higher-order methods run as isolated processes. Everything would be const by default to make imperative code available in a functional runtime. Then the compiler could turn loops and conditionals into higher-order methods since there are no side effects. Any mutability could be provided by monads enforcing the imperative shell, functional core pattern so that we could track state changes and enumerate all exceptional cases.
Basically we could write JavaScript/C-style code having MATLAB-style matrix operators that runs thousands of times faster than current languages, without the friction/limitations of shaders or the cognitive overhead of OpenCL/CUDA.
-
I feel that pretty much all modern computer architectures are designed incorrectly, which I've ranted about countless times on HN. The issue is that real workloads mostly wait for memory, since the CPU can run hundreds of times faster than load/store, especially for cache and branch prediction misses. So fabs invested billions of dollars into cache and branch prediction (that was the incorrect part).
They should have invested in multicore with local memories acting together as a content-addressable memory. Then fork with copy-on-write would have provided parallelism for free.
Instead, CPU progress (and arguably Moore's law itself) ended around 2007 with the arrival of the iPhone and Android, which sent R&D money to low-cost and low-power embedded chips. So the world was forced to jump on the GPU bandwagon, doubling down endlessly on SIMD instead of giving us MIMD.
Leaving us with what we have today: a dumpster fire of incompatible paradigms like OpenGL, Direct3D, Vulkan, Metal, TPUs, etc.
When we could have had transputers with unlimited compute and memory, scaling linearly with cost, that could run 3D and AI libraries as abstraction layers. Sadly that's only available in cloud computing currently.
We just got lucky that neural nets can run on GPUs. It would have been better to have access to the dozen or so other machine learning algorithms, especially genetic algorithms (which run poorly on GPUs).
Maybe your work can help bridge that gap.
forkrun's NUMA approach is really largely based on the idea that, as you said, "real workloads mostly wait for memory". The waiting for memory gets worse in NUMA because accessing memory from a different chiplet or a different socket requires accessing data that is physically farther from the CPU and thus has higher latency. forkrun takes a somewhat unique approach in dealing with this: instead of taking data in, putting it somewhere, and reshuffling it around based on demand, forkrun immediately puts it on the correct numa node's memory when it comes in. This creates a NUMA-striped global data memfd. on NUMA forkrun duplicates most of its machinery (indexer+scanner+worker pool) per node, and each node's machinery is only offered chunks from the global data memfd that are already on node-local memory.
This directly aims to solve (or at least reduce the effect from) "CPUs waiting for memory" on NUMA systems, where the wait (if memory has to cross sockets) can be substantial.
If you want an apples-to-apples comparison, try running the following. This tells frun to use 100 lines per batch (-l 100), to use 32 workers (-j 32). Please let me know how this one compares to the rush invocation in terms of runtime.
side note: you should be able to use a space as a delimiter (-d ' ') and run NOTE: when I posted this reply using a space as a delimiter was broken. I just pushed a PR to the forkrun main branch that fixes this. If you re-download frun.bash and source it in a new bash instance, then the above space-delimited command should work as well, and is the most direct apples-to-apples comparison to your rush command.First, 14k items in batches of 100 are only 140 batches. 140 batches in 160 ms is not even 1000 batches per second. For reference, parallel tops out at around 500 per second (but is dreadfully slow) and forkrun, in its normal "passing quoted arguments via the cmdline" mode, can do about 10000 batches per second. I have no doubt rush is far more capable of distributing batches quicker than parallel, so theres a good chance that "how fast the parallelization engine can distribute work" isnt the main bottleneck for either frun nor rush for this particular workload.
Second, the way frun distributes batches is very efficient but requires setting up a substantial amount of supporting machinery. This puts (on my system) the "no-load run time" of forkrun at about 80 ms.
And this 80 ms difference is pretty close to the time difference you are seeing. Id bet the "minimum no-load time" for rush is considerably lower - perhaps a couple of ms.forkrun is optimized for plowing through MASSIVE amount of very fast running inputs...it is capable of plowing through a billion (empty) inputs a second in its fastest mode. 14k inputs just isn't enough to amortize the startup of all the lock-free machinery.
I would venture to guess that if you repeat the same test but with 100x more inputs, the relative difference between frun and rush would be considerably less.
tl;dr: forkrun takes over the "last mile" of actually running things on a given single node, so SLURM can focus on what it does best: efficiently allocating and distributing work to different nodes across the cluster.