Finding and Fixing a 50k Goroutine Leak That Nearly Killed Production

(skoredin.pro)

24 points | by ibobev 5 days ago

6 comments

Uriopass 2 hours ago
> We couldn't just deploy and pray. 50,000 goroutines don't just disappear.
They do once you restart the server. Unsure what the "phase 3 monitoring" shows where the goroutines go down gradually. If you have new code to deploy you must recompile and therefore restart and those goroutines are gone anyway.
This feels like an AI made-up story but I'm not sure I understand the point of making this up.
However, goroutine leak is interesting! I hope what I learned from this post isn't hallucination. For example, how could the subscriber send messages/heartbeats to the closed goroutine without an error...
[-]
- cassonmars 2 hours ago
  I had the same confusion reading this – what kind of go-based webservice framework can you discretely deploy new handlers without restarts/redeploys? Would be a really awesome thing to have!
  [-]
  - WJW 1 hour ago
    TBH it sounds like it would be extremely against the whole "simplicity, even at high costs" philosophy the Golang people strive for. Deploying new handlers into a running web service seems much more like something the Erlang people would be interested in.
    (and lo, the BEAM does indeed allow hot code reloads. I don't think this is commonly used in BEAM-based Erlang/Elixir web services though. Certainly the Gleam people don't seem to like hot reloads very much...)
tim-kt 2 hours ago
This seems like an interesting problem and an interesting fix, but there is so much code and so little explanation that I am lost after "The Code That Looked Perfectly Fine". It also reads very much like AI. And FYI the "output" code blocks are (at least for me on Firefox) a dark gray on a darker gray background, so very unreadable.
sgt 33 minutes ago
> DevOps noted memory was "trending up but within limits."
I see this mentality now and again in DevOps teams. It reminds me of that "Not great, not terrible." quote from Chernobyl.
There are just so many things being monitored, but holistic understanding is not always there. And that's why I believe more and more that the developers of a products should also (partly) be DevOps, don't have a completely separate team with zero overlap.
mono442 2 hours ago
This post seems AI generated.
[-]
- LatencyKills 1 hour ago
  As a longtime Go developer, I found the bug and its fix interesting.
  If you found something wrong in the post, I'd really appreciate hearing about it.
  [-]
  - tim-kt 1 hour ago
    While I agree that it's not important whether or not someone uses AI to improve a blog post or create code examples, this blog post seems like the output of the prompt "Write an interesting blog post about a goroutine leak". I don't have the expertise to verify if what is written is actually correct or makes sense, but based on the other comments there seems to be some confusion if what is written is actually content or also AI generated output.
    [-]
    - LatencyKills 1 hour ago
      I do have expertise in Go. The bug was real, and the fix makes sense (though I couldn't verify it, of course).
      I just hope HN gets over the "but it might be AI!!" crap sooner rather than later and focuses on the actual content because these types of posts are never going away.
      [-]
      - tim-kt 1 hour ago
        Personally, I just don't like the way this is written. As I said though, I am not an expert and so I may be outside the target group. I think that the original "this is AI" comment is an automatic response which alternatively carries the meaning "this is low-effort" and in that sense I still think that it is valid criticism.
        [-]
        LatencyKills 1 hour ago
        Fair enough - I appreciate your thoughts. I'll keep the "this is low-effort" == "this is AI" equivalence in mind moving forward.
      - ilogik 17 minutes ago
        I've done a similar fix, even a bit more interesting, however I wouldn't consider it worthy of writing a blog post, not to mention submitting it to HN.
  - MD87 1 hour ago
    The bug is somewhat interesting.
    The entire "Gradual recovery" part of the post makes absolutely no sense, and is presumably an LLM fabrication. That's just... not how anything works. And deploying three different weird little mitigations flies in the face of the earlier "We couldn't just restart production. Too many active users."
its-kostya 1 hour ago
Having been plagued by Go's anti-pattern that is goroutine + channels* and having debugged far too many leaked go routines myself, I'd suggest using the pprof package that exposes the /debug/pprof endpoint for your go process. Specifically it exposes runtime profiling data over HTTP in pprof format so you can collect and inspect CPU, heap, goroutine, block, etc.
Debugging becomes, hit the debug endpoint, get a list of all goroutines and their call stack. A leaked goroutine shows up a lot with the same call stack and that's it. There is also a fancy graph you can make the visualizes allocated objects if you have a mem leak and aren't sure if it is me or goroutine.
* Anti-pattern because it is so easy to forgo good program design (like a solid state machine) and reach for a goroutine and communicate back with channels. Do that a few times and your code becomes spaghetti.
[-]
- butvacuum 1 hour ago
  I'm suprised whatever IDE got used, or some stage of the build, didn't throw a warning for leaving an object that has a dispose interface undisposed.
  But, I haven't touched Go. unexciting .net dev...
assbuttbuttass 1 hour ago
> Writers kept sending to sub.messages. The channel grew. Memory grew.
Channels are buffered in Go, they will not grow unbounded.
> Tickers Are Not Garbage Collected
It used to be necessary in older versions to call ticker.Stop(), but in recent versions it's no longer necessary.
```
    // Start goroutines
    go s.pumpMessages(ctx, sub)
    go s.heartbeat(ctx, sub)
    
    // Monitor the connection
    go s.monitorConnection(ctx, sub)
```
The "fixed" code is still using the fire-and-forget pattern for goroutines which encourages this kind of leak. Go makes it easy to add concurrency on the caller side, so it's usually better to write blocking functions that clean up all their goroutines before returning.
In general this article screams AI with most of the conclusions being hallucinated. Goroutine leaks are real, but it's hard to trust any of the article's conclusions