Ask HN: Why does single-node DDP sometimes get slower with more GPUs?

Hi,

I keep running into a frustrating issue with PyTorch DDP on a single node (2–8 GPUs): adding GPUs sometimes makes training slower (not scale proportionally), and it is hard to tell what's actually gating the step.

In practice I see:

one rank silently becomes the “worst rank” and gates every step

step time spikes where GPUs look idle, but it’s unclear if the culprit is dataloader stalls, CPU contention, batch/sequence-length imbalance, or NCCL sync

Questions for folks who run multi-GPU training:

What do you suspect first when scaling regresses on a single node?

What signals do you look at to distinguish data vs compute vs comms/sync?

Any repeatable workflow / checklist that gets you to root cause fast?

Context: I am building a small OSS tool that shows live per-rank step timing + stall attribution (always-on; not a replacement for PyTorch Profiler/Nsight). If you have a workload where DDP scaling is weird and you are willing to run a ~10-minute test, I am happy to help interpret results and prioritize support for your setup.

Repo: https://github.com/traceopt-ai/traceml

2 points | by traceopt-ai 2 hours ago

0 comments