I keep running into a frustrating issue with PyTorch DDP on a single node (2–8 GPUs): adding GPUs sometimes makes training slower (not scale proportionally), and it is hard to tell what's actually gating the step.
In practice I see:
one rank silently becomes the “worst rank” and gates every step
step time spikes where GPUs look idle, but it’s unclear if the culprit is dataloader stalls, CPU contention, batch/sequence-length imbalance, or NCCL sync
Questions for folks who run multi-GPU training:
What do you suspect first when scaling regresses on a single node?
What signals do you look at to distinguish data vs compute vs comms/sync?
Any repeatable workflow / checklist that gets you to root cause fast?
Context: I am building a small OSS tool that shows live per-rank step timing + stall attribution (always-on; not a replacement for PyTorch Profiler/Nsight). If you have a workload where DDP scaling is weird and you are willing to run a ~10-minute test, I am happy to help interpret results and prioritize support for your setup.
Repo: https://github.com/traceopt-ai/traceml
0 comments