GLM-5.1: Towards Long-Horizon Tasks

(z.ai)

181 points | by zixuanlimit 1 hour ago

13 comments

  • RickHull 1 hour ago
    I am on their "Coding Lite" plan, which I got a lot of use out of for a few months, but it has been seriously gimped now. Obvious quantization issues, going in circles, flipping from X to !X, injecting chinese characters. It is useless now for any serious coding work.
    • unicornfinder 45 minutes ago
      I'm on their pro plan and I respectfully disagree - it's genuinely excellent with GLM 5.1 so long as you remember to /compact once it hits around 100k tokens. At that point it's pretty much broken and entirely unusable, but if you keep context under about 100k it's genuinely on par with Opus for me, and in some ways it's arguably better.
      • airstrike 31 minutes ago
        100k tokens it's basically nothing these days. Claude Opus 4.6M with 1M context windows is just a different ball game
        • operatingthetan 11 minutes ago
          The context windows of these Chinese open-source subscriptions (GLM, Minimax, Kimi) is too small and I'm guessing it's because they are trying to keep them cheap to run. Fine for openclaw, not so much for coding.
      • kay_o 21 minutes ago
        Is manual compation absolutely mandatory ?
        • jauntywundrkind 15 minutes ago
          I haven't screenshotted to alas, but it goes from being a perfectly reasonable chatty LLM, to suddenly spewing words and nonsense characters around this threshold, at least for me as a z.ai pro (mid tier) user.

          For around a month the limit seemed to be a little over 60k! I was despondent!!

          What's worse is that when it launched it was stable across the context window. My (wild) guess is that the model is stable but z.ai is doing something wonky with infrastructure, that they are trying to move from one context window to another or have some kv cache issues or some such, and it doesn't really work. If you fork or cancel in OpenCode there's a chance you see the issue much earlier, which feels like some other kind of hint about kv caching, maybe it not porting well between different shaped systems.

          More maliciously minded, this artificial limit also gives them an artificial way to dial in system load. Just not delivering the context window the model has reduces the work of what they have to host?

          But to the question: yes compaction is absolutely required. The ai can't even speak it's just a jumbled stream of words and punctuation once this hits. Is manual compaction required? One could find a way to build this into the harness, so no, it's a limitation of our tooling that our tooling doesn't work around the stated context window being (effectively) a lie.

          I'd really like to see this improved! At least it's not 60-65k anymore; those were soul crushing weeks, where I felt like my treasured celebrated joyful z.ai plan was now near worthless.

          There's a thread https://news.ycombinator.com/item?id=47678279 , and I have more extensive history / comments on what I've seen there.

          The question is: will this reproduce on other hosts, now that glm-5.1 is released? I expect the issue is going to be z.ai specific, given what I've seen (200k works -> 60k -> 100k context windows working on glm-5.1).

    • kay_o 58 minutes ago
      I am on the mid tier Coding plan to trying it out for the sake of curiosity.

      During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files

    • wolttam 41 minutes ago
      This is surprising to me. Maybe because I'm on Pro, and not Lite. I signed up last week and managed to get a ton of good work done with 5.1. I think I did run into the odd quantization quirk, but overall: $30 well spent
    • LaurensBER 10 minutes ago
      I'm on their lite plan as well and I've been using it for my OpenClaw. It had some issues but it also one-shotted a very impressive dashboard for my Twitter bookmarks.

      For the price this is a pretty damn impressive model.

    • Mashimo 41 minutes ago
      I'm also on the lite plan and have been using 5.1 for a few days now. It works fine for me.

      But it's all casual side projects.

      Edit: I often to /compact at around 100 000 token or switch to a new session. Maybe that is why.

    • satvikpendem 42 minutes ago
      Every model seems that way, going back to even GPT 3 and 4, the company comes out with a very impressive model that then regresses over a few months as the company tries to rein in inference costs through quantization and other methods.
    • benterix 27 minutes ago
      > Obvious quantization issues

      Devil's advocate: why shouldn't they do it if OpenAI, Anthropic and Google get away with playing this game?

    • esafak 16 minutes ago
      I'm on their Lite plan and I see some of this too. It is also slow. I use it as a backup.
    • margorczynski 42 minutes ago
      It has been useless for long time when compared to Opus or even something like Kimi. The saving grace was that it was dirt cheap but that doesn't matter if it can't do what I want even after many repeated tries and trying to push it to a correct solution.
  • Yukonv 1 hour ago
    Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.

    [0] https://huggingface.co/unsloth/GLM-5.1-GGUF

    • zozbot234 9 minutes ago
      SSD offload is always a possibility with good software support. Of course you might easily object that the model would not be "running" then, more like crawling. Still you'd be able to execute it locally and get it to respond after some time.

      Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.

  • johnfn 7 minutes ago
    GLM-5.0 is the real deal as far as open source models go. In our internal benchmarks it consistently outperforms other open source models, and was on par with things like GPT-5.2. Note that we don't use it for coding - we use it for more fuzzy tasks.
  • alex7o 58 minutes ago
    To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine.
    • disiplus 8 minutes ago
      When it works and its not slow it can impress. Like yesterday it solved something that kimi k2.5 could not. and kimi was best open source model for me. But it still slow sometimes. I have z.ai and kimi subscription when i run out of tokens for claude (max) and codex(plus).

      i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.

    • MegagramEnjoyer 28 minutes ago
      Why is that sad? A free and open source model outperforming their closed source counterparts is always a win for the users
    • DeathArrow 9 minutes ago
      After the context gets to 100k tokens you should open a new session or run /compact.
  • winterqt 10 minutes ago
    Comments here seem to be talking like they've used this model for longer than a few hours -- is this true, or are y'all just sharing your initial thoughts?
    • BeetleB 3 minutes ago
      It's been out for a while.
  • kirby88 25 minutes ago
    I wonder how that compare to harness methods like MAKER https://www.cognizant.com/us/en/ai-lab/blog/maker
  • gavinray 10 minutes ago
    I find the "8 hour Linux Desktop" bit disingenuous, in the fine print it's a browser page:

      > "build a Linux-style desktop environment as a web application"
    
    They claim "50 applications from scratch", but "Browser" and a bunch of the other apps are likely all <iframe> elements.

    We all know that building a spec-compliant browser alone is a herculean task.

  • DeathArrow 15 minutes ago
    I am already subscribed to their GLM Coding Pro monthly plan and working with GLM 5.1 coupled with Open Code is such a pleasure! I will cancel my Cursor subscription.
  • jaggs 12 minutes ago
    How does it compare to Kimi 2.5 or Qwen 3.6 Plus?
    • eis 6 minutes ago
      The blog post has a benchmark comparison table with these two in it
    • DeathArrow 8 minutes ago
      Compared to Kimi 2.5 or Qwen 3.6 Plus I don't know, but I ran GLM 5 (not 5.1) side by side with Qwen 3.5 Plus and it was visibly better.
  • bigyabai 1 hour ago
    It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts. When you crest 128k tokens, there's a high chance that the model will start spouting gibberish until you compact the history.

    For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.

    • cassianoleal 27 minutes ago
      I've done some very long sessions on OpenCode with Dynamic Context Pruning. Highly recommend it.

      https://github.com/Opencode-DCP/opencode-dynamic-context-pru...

    • embedding-shape 1 hour ago
      > It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts

      Since the entire purpose, focus and motivation of this model seems to have been "coherency over longer contexts", doesn't that issue makes it not an OK model? It's bad at the thing it's supposed to be good at, no?

      • wolttam 1 hour ago
        long(er) contexts (than the previous model)

        It does devolve into gibberish at long context (~120k+ tokens by my estimation but I haven't properly measured), but this is still by far the best bang-for-buck value model I have used for coding.

        It's a fine model

        • verdverm 53 minutes ago
          Have you tried gemma4?

          I'm curious how the bang for buck ratio works in comparison. My initial tests for coding tasks have been positive and I can run it at home. Bigger models I assume are still better on harder tasks.

    • whimblepop 1 hour ago
      That's pretty few, at least for the way I'm currently using LLMs. I have them do some Nix work (both debugging and coding) where accuracy and quality matters to me, so they're instructed to behave as I would when it comes to docs, always consulting certain docs and source code in a specific order. It's not unusual for them to chew through 200k - 600k tokens in a single session before they solve everything I want them to. That's what I currently think of when I think of "long horizon within a single context window".

      So I need them to not only not devolve into gibberish, but remain smart enough to be useful at contexts several times longer than that.

    • azuanrb 1 hour ago
      Have you compared it with using Claude Code as the harness? It performs much better than opencode for me.
    • jauntywundrkind 59 minutes ago
      Chiming in to second this issue. It is wildly frustrating.

      I suspect that this isn't the model, but something that z.ai is doing with hosting it. At launch I was related to find glm-5.1 was stable even as the context window filled all the way up (~200k). Where-as glm-5, while it could still talk and think, but had forgotten the finer points of tool use to the point where it was making grevious errors as it went (burning gobs of tokens to fix duplicate code problems).

      However, real brutal changes happened sometimes in the last two or three months: the parent problem emerged and emerged hard, out of nowhere. Worse, for me, it seemed to be around 60k context windows, which was heinous: I was honestly a bit despondent that my z.ai subscription had become so effectively useless. That I could only work on small problems.

      Thankfully the coherency barrier raised signficiantly around three weeks go. It now seems to lose its mind and emits chaotic non-sentance gibberish around 100k for me. GLM-5 was already getting pretty shaky at this point, so I feel like I at least have some kind of parity. But at least glm-5 was speaking & thinking with real sentances, I could keep conversing with it somewhat, where-as glm-5.1 seems to go from perfectly level headed working fine to all of a sudden just total breakdown, hard switch, at such a predictable context window size.

      It seems so so probable to me that this isn't the model that's making this happen: it's the hosting. There's some KV cache issue, or they are trying to expand the context window in some way, or to switch from one serving pool of small context to a big context serving pool, or something infrastructure wise that falls flat and collapses. Seeing the window so clearly change from 200k to 60k to 100k is both hope, but also, misery.

      I've been leaving some breadcrumbs on Bluesky as I go. It's been brutal to see. Especially having tasted a working glm-5.1. I don't super want to pay API rates to someone else, but I fully expect this situation to not reproduce on other hosting, and may well spend the money to try and see. https://bsky.app/profile/jauntywk.bsky.social/post/3mhxep7ek...

      All such a shame because aside from totally going mad & speaking unpuncutaed gibberish, glm-5.1 is clearly very very good and I trust it enormously.

      • esseph 20 minutes ago
        > "aside from totally going mad & speaking unpuncutaed gibberish [...] I trust it enormously."

        The bar is very low :(

  • dang 1 hour ago
    [stub for offtopicness]

    [[you guys, please don't post like this to HN - it will just irritate the community and get you flamed]]

    • smith7018 1 hour ago
      Hmm, three spam comments posted within 9 minutes of each other. The accounts were created 15 minutes ago, 51 days ago, and 3 months ago.

      Interesting.

      Hopefully these aren't bots created by Z.AI because GLM doesn't need fake engagement.

      • dang 1 hour ago
        These comments are probably either by friends of the OP or perhaps associated with the project somehow, which is against HN's rules but not the kind of attack we're mostly concerned with these days. Old-fashioned voting rings and booster comments aren't existential threats and actually bring up somewhat nostalgic feelings at the moment!

        Thanks for watching out for the quality of HN...

        • ray__ 38 minutes ago
          Would love to read a Tell HN post about the kinds of attacks you are concerned with!
      • tadfisher 1 hour ago
        I moderate a medium-sized development subreddit. The sheer volume of spam advertising some AI SaaS company has skyrocketed over the past few months, like 10000%. Comment spam is now a service you can purchase [0][1], and I would not be surprised if Z.ai engaged some marketing firm which ended up purchasing this service.

        There are YC members in the current batch who are spamming us right now [2]. They are all obvious engagement-bait questions which are conveniently answered with references to the SaaS.

        [0]: https://www.reddit.com/r/DoneDirtCheap/comments/1n5gubz/get_...

        [1]: https://www.reddit.com/r/AIJobs/comments/1oxjfjs/hiring_paid...

        [2]: https://www.reddit.com/r/androiddev/comments/1sdyijs/no_code...

      • greenavocado 1 hour ago
        Z.ai Discord is filled to the brim with people experiencing capacity issues. I had to cancel my subscription with Z.ai because the service was totally unusable. Their Discord is a graveyard of failures. I switched to Alibaba Cloud for GLM but now they hiked their coding plan to $50 a month which is 2.5x more expensive than ChatGPT Plus. Totally insane.
    • zendi 1 hour ago
      [flagged]
    • louszbd 1 hour ago
      [flagged]
    • seven2928 1 hour ago
      [flagged]
  • aplomb1026 51 minutes ago
    [dead]
  • andrewmcwatters 59 minutes ago
    [dead]