Issue: Claude Code is unusable for complex engineering tasks with Feb updates

(github.com)

921 points | by StanAngeloff 16 hours ago

110 comments

bcherny 12 hours ago
Hey all, Boris from the Claude Code team here. I just responded on the issue, and cross-posting here for input.
---
Hi, thanks for the detailed analysis. Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.
There's a lot here, I will try to break it down a bit. These are the two core things happening:
> `redact-thinking-2026-02-12`
This beta header hides thinking from the UI, since most people don't look at it. It *does not* impact thinking itself, nor does it impact thinking budgets or the way extended reasoning works under the hood. It is a UI-only change.
Under the hood, by setting this header we avoid needing thinking summaries, which reduces latency. You can opt out of it with `showThinkingSummaries: true` in your settings.json (see [docs](https://code.claude.com/docs/en/settings#available-settings)).
If you are analyzing locally stored transcripts, you wouldn't see raw thinking stored when this header is set, which is likely influencing the analysis. When Claude sees lack of thinking in transcripts for this analysis, it may not realize that the thinking is still there, and is simply not user-facing.
> Thinking depth had already dropped ~67% by late February
We landed two changes in Feb that would have impacted this. We evaluated both carefully:
1/ Opus 4.6 launch → adaptive thinking default (Feb 9)
Opus 4.6 supports adaptive thinking, which is different from thinking budgets that we used to support. In this mode, the model decides how long to think for, which tends to work better than fixed thinking budgets across the board. `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING` to opt out.
2/ Medium effort (85) default on Opus 4.6 (Mar 3)
We found that effort=85 was a sweet spot on the intelligence-latency/cost curve for most users, improving token efficiency while reducing latency. On of our product principles is to avoid changing settings on users' behalf, and ideally we would have set effort=85 from the start. We felt this was an important setting to change, so our approach was to:
1. Roll it out with a dialog so users are aware of the change and have a chance to opt out
2. Show the effort the first few times you opened Claude Code, so it wasn't surprising.
Some people want the model to think for longer, even if it takes more time and tokens. To improve intelligence more, set effort=high via `/effort` or in your settings.json. This setting is sticky across sessions, and can be shared among users. You can also use the ULTRATHINK keyword to use high effort for a single turn, or set `/effort max` to use even higher effort for the rest of the conversation.
Going forward, we will test defaulting Teams and Enterprise users to high effort, to benefit from extended thinking even if it comes at the cost of additional tokens & latency. This default is configurable in exactly the same way, via `/effort` and settings.json.
[-]
- Wowfunhappy 3 hours ago
  > Under the hood, by setting this header we avoid needing thinking summaries, which reduces latency. You can opt out of it with `showThinkingSummaries: true` in your settings.json (see [docs](https://code.claude.com/docs/en/settings#available-settings)).
  Can I just see the actual thinking (not summarized) so that I can see the actual thinking without a latency cost?
  I do really need to see the thinking in some form, because I often see useful things there. If Claude is thinking in the wrong direction I will stop it and make it change course.
  [-]
  - andersa 17 minutes ago
    But you can't. Many times I've seen claude write confusing off-track nonsense in the thinking and then do the correct action anyway as if that never happened. It doesn't work the way we want it to.
  - faitswulff 3 hours ago
    Anthropic's position is that thinking tokens aren't actually faithful to the internal logic that the LLM is using, which may be one reason why they started to exclude them:
    https://www.anthropic.com/research/reasoning-models-dont-say...
    [-]
    - libraryofbabel 2 hours ago
      That's interesting research, but I think a more important reason that you don't have access to them (not even via the bare Anthropic api) is to prevent distillation of the model by competitors (using the output of Anthropic's model to help train a new model).
      [-]
      - xvector 32 minutes ago
        If distilled models were commercially banned they'd probably be willing to show the thinking again.
    - AquinasCoder 2 hours ago
      I somewhat understand Anthropic's position. However, thinking tokens are useful even if they don't show the internal logic of the LLM. I often realize I left out some instruction or clarification in my prompt while reading through the chain of reasoning. Overall, this makes the results more effective.
      It's certainly getting frustrating having to remind it that I want all tests to pass even if it thinks it's not responsible for having broken some of them.
    - grey-area 2 hours ago
      So like many of the promises from AI companies, reported chain of thought is not actually true (see results below). I suppose this is unsurprising given how they function.
      Is chain of thought even added to the context or is it extraneous babble providing a plausible post-hoc justification?
      People certainly seem to treat it as it is presented, as a series of logical steps leading to an answer.
      ‘After checking that the models really did use the hints to aid in their answers, we tested how often they mentioned them in their Chain-of-Thought. The overall answer: not often. On average across all the different hint types, Claude 3.7 Sonnet mentioned the hint 25% of the time, and DeepSeek R1 mentioned it 39% of the time. A substantial majority of answers, then, were unfaithful.‘
      [-]
      - brainwad 5 minutes ago
        I mean, obviously, it's not going to be a faithful representation of the actual thinking. The model isn't aware of how it thinks any more than you are aware how your neurons fire. But it does quantitatively improve performance on complex tasks.
- richardjennings 11 hours ago
  I was not aware the default effort had changed to medium until the quality of output nosedived. This cost me perhaps a day of work to rectify. I now ensure effort is set to max and have not had a terrible session since. Please may I have a "always try as hard as you can" mode ?
  [-]
  - Avamander 8 hours ago
    I feel like the maximum effort mode kind-of wraps around and starts becoming "desperate" to the extent of lazy or a monkey's paw, similar to how lower effort modes or a poor prompt.
    [-]
    - svnt 2 hours ago
      I’m going in circles. Let me take a step back and try something completely different. The answer is a clean refactor.
      Wait, the simplest fix is the same hack I tried 45 minutes ago but in a different context. Let me just try that.
      Wait,
      [-]
      - huflungdung 2 hours ago
        [dead]
  - Schiendelman 9 hours ago
    That's /effort max!
    [-]
    - richardjennings 9 hours ago
      You cannot control the effort setting sub-agents use and you also cannot use /effort max as a default (outside of using an alias).
      [-]
      - bazhand 7 hours ago
        export CLAUDE_CODE_EFFORT_LEVEL=max
        [-]
        wild_egg 6 hours ago
        Does that apply to subagents?
  - clevergadget 7 hours ago
    bad citizen
- anonymoushn 7 hours ago
  How do you guys decide which settings should be configurable via environment variables but not settings files and which settings should be configurable via settings files but not environment variables?
  [-]
  - bcherny 5 hours ago
    All environment variables can also be configured via settings files (in the “env” field).
    Our approach generally is to use env vars for more experimental and low usage settings, and reserve top-level settings for knobs that we expect customers will tune more frequently.
  - make3 4 hours ago
    the Claude writing the code itself decides probably
  - nightpool 7 hours ago
    [flagged]
- koverstreet 11 hours ago
  There's been more going on than just the default to medium level thinking - I'll echo what others are saying, even on high effort there's been a very significant increase in "rush to completion" behavior.
  [-]
  - bcherny 11 hours ago
    Thanks for the feedback. To make it actionable, would you mind running /bug the next time you see it and posting the feedback id here? That way we can debug and see if there's an issue, or if it's within variance.
    [-]
    - JamesSwift 8 hours ago
      a9284923-141a-434a-bfbb-52de7329861d d48d5a68-82cd-4988-b95c-c8c034003cd0 5c236e02-16ea-42b1-b935-3a6a768e3655 22e09356-08ce-4b2c-a8fd-596d818b1e8a 4cb894f7-c3ed-4b8d-86c6-0242200ea333
      Amusingly (not really), this is me trying to get sessions to resume to then get feedback ids and it being an absolute chore to get it to give me the commands to resume these conversations but it keeps messing things up: cf764035-0a1d-4c3f-811d-d70e5b1feeef
      [-]
      - bcherny 6 hours ago
        Thanks for the feedback IDs — read all 5 transcripts.
        On the model behavior: your sessions were sending effort=high on every request (confirmed in telemetry), so this isn't the effort default. The data points at adaptive thinking under-allocating reasoning on certain turns — the specific turns where it fabricated (stripe API version, git SHA suffix, apt package list) had zero reasoning emitted, while the turns with deep reasoning were correct. we're investigating with the model team. interim workaround: CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 forces a fixed reasoning budget instead of letting the model decide per-turn.
        [-]
        diavelguru 4 hours ago
        Love this. Responding to users. Detail info investigating. Action being taken (at least it seems so).
        onoesworkacct 3 hours ago
        This kind of thing is harder for regular end-users to understand following the change removing reasoning details.
        alexchen_dev 3 hours ago
        [dead]
    - matheusmoreira 2 hours ago
      I just asked Claude to plan out and implement syntactic improvements for my static site generator. I used plan mode with Opus 4.6 max effort. After over half an hour of thinking, it produced a very ad-hoc implementation with needless limitations instead of properly refactoring and rearchitecting things. I had to specifically prompt it in order to get it to do better. This executed at around 3 AM UTC, as far away from peak hours as it gets.
      b9cd0319-0cc7-4548-bd8a-3219ede3393a
      > You're right to push back. Let me be honest about both questions.
      > The @() implementation is ad-hoc
      > The current implementation manually emits synthetic tokens — tag, start-attributes, attribute, end-attributes, text, end-interpolation — in sequence.
      > This works, but it duplicates what the child lexer already does for #[...], creating two divergent code paths for the same conceptual operation (inline element emission). It also means @() link text can't contain nested inline elements, while #[a(...) text with #[em emphasis]] can.
      I just feel like I can't trust it anymore.
      [-]
      - koverstreet 2 hours ago
        That's pretty much been my day - today was genuinely bad, and I've been putting up with a lot of this lately.
        Now on Qwen3.5-27b, and it may not be quite as sharp as Opus was two months ago, but we're getting work done again.
        [-]
        matheusmoreira 1 hour ago
        Literally two weeks ago it was outputting excellent results while working with me on my programming language. I reviewed every line and tried to understand everything it did. It was good. I slowly started trusting it. Now I don't want to let it touch my project again.
        It's extremely depressing because this is my hobby and I was having such a blast coding with Claude. I even started trying to use it to pivot to professional work. Now I'm not sure anymore. People who depend on this to make a living must be very angry indeed.
        [-]
        jacquesm 1 hour ago
        I can see how that works: this is like building a dependency, a habit if you wish. I think the tighter you couple your workflow to these tools the more dependent you will become and the greater the let-down if and when they fail. And they will always fail, it just depends on how long you work with them and how complex the stuff is you are doing, sooner or later you will run into the limitations of the tooling.
        One way out of this is to always keep yourself in the loop. Never let the work product of the AI outpace your level of understanding because the moment you let that happen you're like one of those cartoon characters walking on air while gravity hasn't reasserted itself just yet.
        [-]
        matheusmoreira 39 minutes ago
        Good advice about the dependency. This stuff is definitely addictive. I've been in something of a manic episode ever since I subscribed to this thing. I started getting anxious when I hit limits.
        I wouldn't say that Claude is failing though. It's just that they're clearly messing with it. The real Opus is great.
    - koverstreet 11 hours ago
      I'll have a look. The CoT switch you mentioned will help, I'll take a look at that too, but my suspicion is that this isn't a CoT issue - it's a model preference issue.
      Comparing Opus vs. Qwen 27b on similar problems, Opus is sharper and more effective at implementation - but will flat out ignore issues and insist "everything is fine" that Qwen is able to spot and demonstrate solid understanding of. Opus understands the issues perfectly well, it just avoids them.
      This correlates with what I've observed about the underlying personalities (and you guys put out a paper the other day that shows you guys are starting to understand it in these terms - functionally modeling feelings in models). On the whole Opus is very stable personality wise and an effective thinker, I want to complement you guys on that, and it definitely contrasts with behaviors I've seen from OpenAI. But when I do see Opus miss things that it should get, it seems to be a combination of avoidant tendencies and too much of a push to "just get it done and move into the next task" from RHLF.
      [-]
      - necrotic_comp 2 hours ago
        Opus definitely pushes me to ignore problems. I've had to tell it multiple times to be thorough, and we tend to go back and forth a few times every time that happens. :)
      - jchanimal 7 hours ago
        One of the thing is we’ve seen at vibes.diy is that if you have a list of jobs and you have agents with specialized profiles and ask them to pick the best job for themselves that can change some of the behavior you described at the end of your post for the better.
    - freedomben 11 hours ago
      How much of the code/context gets attached in the /bug report?
      [-]
      - bcherny 11 hours ago
        When you submit a /bug we get a way to see the contents of the conversation. We don't see anything else in your codebase.
        [-]
        murkt 9 hours ago
        Was there a change in Claude Code system prompt at that time that nudges Claude into simplistic thinking?
        Here is a gist that tries to patch the system prompt to make Claude behave better https://gist.github.com/roman01la/483d1db15043018096ac3babf5...
        I haven’t personally tried it yet. I do certainly battle Claude quite a lot with “no I don’t want quick-n-easy wrong solution just because it’s two lines of code, I want best solution in the long run”.
        If the system prompt indeed prefers laziness in 5:1 ratio, that explains a lot.
        I will submit /bug in a few next conversations, when it occurs next.
        [-]
        Avamander 7 hours ago
        That Gist does explain quite a few flaws Claude has. I wonder if MEMORY.md is sufficient to counteract the prompt without patching.
        dev_l1x_be 8 hours ago
        Holy sweet LLM, this gist is crazy. Why did they do this to themselves? I am going to try this at home, it might actually fix Claude.
        [-]
        murkt 7 hours ago
        Remember Sonnet 3.5 and 3.7? They were happy to throw abstraction on top of abstraction on top of abstraction. Still a lot of people have “do not over-engineer, do not design for the future” and similar stuff in their CLAUDE.md files.
        So I think the system prompt just pushes it way too hard to “simple” direction. At least for some people. I was doing a small change in one of my projects today, and I was quite happy with “keep it stupid and hacky” approach there.
        And in the other project I am like “NO! WORK A LOT! DO YOUR BEST! BE HAPPY TO WORK HARD!”
        So it depends.
        enigmo 8 hours ago
        [dead]
        andoando 5 hours ago
        Isnt the codebase in the context window?
        [-]
        frog437 4 hours ago
        depending on how large your codebase is, hopefully not. At this point use something like the IX plugin to ingest codebase and track context, rather than from the LLM itself.
        [-]
        frog437 1 hour ago
        This is crazy..
        tokensSaved = naiveTokens - actualTokens
        - naiveTokens = 19.4M — what ix estimates it would have cost to answer your queries without graph intelligence (i.e., dumping full files/directories into context) - actualTokens = 4.7M — what ix's targeted, graph-aware responses actually used - tokensSaved = 14.7M — the difference
  - stefan_ 9 hours ago
    Theres also been tons of thinking leaking into the actual output. Recently it even added thinking into a code patch it did (a[0] &= ~(1 << 2); // actually let me just rewrite { .. 5 more lines setting a[0] .. }).
    [-]
    - taylorfinley 6 hours ago
      I've seen this frequently also
- johndough 10 hours ago
  I think it is hilarious that there are four different ways to set settings (settings.json config file, environment variable, slash commands and magical chat keywords).
  That kind of consistency has also been my own experience with LLMs.
  [-]
  - windexh8er 6 hours ago
    I just had this conversation today. It's hilarious that things like Skills and Soul and all of these anthropomorphized files could just be a better laid out set of configuration files. Yet here we are treating machines like pets or worse.
  - monatron 10 hours ago
    To be fair, I can think of reasons why you would want to be able to set them in various ways.
    - settings.json - set for machine, project
    - env var - set for an environment/shell/sandbox
    - slash command - set for a session
    - magical keyword - set for a turn
  - SAI_Peregrinus 9 hours ago
    It's not unique to LLMs. Take BASH: you've got `/etc/profile`, `~/.bash_profile,` `~/.bash_login`, `~/.bashrc`, `~/.profile`, environment variables, and shell options.
    [-]
    - subscribed 5 hours ago
      Yeah, but for ash/shells these files have wildly different purposes. I don't think it's so distinct with cc.
      [-]
      - hackerbrother 4 hours ago
        I don't think they're wildly different purposes. They're the same purpose (to set shell settings) with different scopes (all users, one user, interactive shells only, etc.).
  - larpingscholar 7 hours ago
    You are yet to discover the joys of the managed settings scope. They can be set three ways. The claude.ai admin console; by one of two registry keys e.g. HKLM\SOFTWARE\Policies\ClaudeCode; and by an alphabetically merged directory of json files.
  - ggdxwz 10 hours ago
    Especially some settings are in setting.json, and others in .claude.json So sometimes I have to go through both to find the one I want to tweak
  - brookst 6 hours ago
    way more than that. settings.json and settings.local.json in the project directory's .claude/, and both of files can also be in ~/.claude
    MCP servers can be set in at least 5 of those places plus .mcp.json
  - bmitc 4 hours ago
    There's also settings available in some offerings and not in others. For example, the Anthropic Claude API supports setting model temperature, but the Claude Agent SDK doesn't.
- plexicle 11 hours ago
  Ultrathink is back? I thought that wasn't a thing anymore.
  If I am following.. "Max" is above "High", but you can't set it to "Max" as a default. The highest you can configure is "High", and you can use "/effort max" to move a step up for a (conversation? session?), or "ultrathink" somewhere in the prompt to move a step up for a single turn. Is this accurate?
  [-]
  - bcherny 11 hours ago
    Yep, exactly
    [-]
    - dostick 10 hours ago
      Mentioning ULTRATHINK in prompt is the equivalent to /effort max?
      [-]
      - merlindru 8 hours ago
        Yes but only for the message that includes it. Whereas /effort max keeps it at max effort the entire convo, to my knowledge
- w10-1 11 hours ago
  Here's the reply in context:
  https://github.com/anthropics/claude-code/issues/42796#issue...
  Sympathies: Users now completely depend on their jet-packs. If their tools break (and assuming they even recognize the problem). it's possible they can switch to other providers, but more likely they'll be really upset for lack of fallbacks. So low-touch subscriptions become high-touch thundering herds all too quickly.
- taspeotis 4 hours ago
  Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?
- aizk 12 hours ago
  How do you guys manage regressions as a whole with every new model update? A massive test set of e2e problem solving seeing how the models compare?
  [-]
  - bcherny 11 hours ago
    A mix of evals and vibes.
    [-]
    - giwook 11 hours ago
      What's that ratio exactly
      [-]
      - nimchimpsky 8 hours ago
        [dead]
      - nothinkjustai 11 hours ago
        [flagged]
    - efields 10 hours ago
      "Evals and vibes" can I put that on a t shirt?
    - capnchaos 11 hours ago
      Are you doing any Digital Twin testing or simulations? I imagine you can't test a product like Claude Code using traditional means.
  - cududa 7 hours ago
    Remember when they shipped that version that didn't actually start/ run? At work we were goofing on them a bit, until I said "Wait how did their tests even run on that?" And we realized whatever their CI/CD process is, it wasn't at the time running on the actual release binary... I can imagine their variation on how most engineers think about CI/CD probably is indicative of some other patterns (or lack of traditional patterns)
    As someone that used to work on Windows, I kind of had a vision of a similar in scope e2e testing harness, similar to Windows Vista/ 7 (knowing about bugs/ issues doesn't mean you can necessarily fix them ... hence Vista then 7) - and that Anthropic must provide some Enterprise guarantee backed by this testing matrix I imagined must exist - long way of saying, I think they might just YOLO regressions by constantly updating their testing/ acceptance criteria.
    Why not provide pinable versions or something? This episode and wasted 2 months of suboptimal productivity hits on the absurdity of constantly changing the user/ system prompt and doing so much of the R&D and feature development at two brittle prompts with unclear interplay. And so until there’s like a compostable system/user prompt framework they reliably develop tests against, I personally would prefer pegged selectable versions. But each version probably has like known critical bugs they’re dancing around so there is no version they’d feel comfortable making a pegged stable release..
    [-]
    - misnome 7 hours ago
      about once a week I get a claude "auto update" that fails to start with some bun error on our linux machines. It's beyond laughable.
  - try-working 9 hours ago
    I use a self-documenting recursive workflow: https://github.com/doubleuuser/rlm-workflow
- giancarlostoro 3 hours ago
  I only ever use high effort, the only thing I've run into sometimes I ask Claude to do every item on a list of items, and not stop until they're all done, it finishes maybe 80% of them then says "I've stopped doing things" for no reasonable reason. I don't need it to run for 18 hours nonstop, but 10 or 20 minutes more it would have kept going for wouldn't have hurt, especially when I am usually on Claude Code during off-hours, and on the Max plan.
  Part of me wants to give lower "effort" a try, but I always wind up with a mess, I don't even like using Haiku or Sonnet, it feels like Haiku goofs, Haiku and Sonnet are better as subagent models where Opus tells them what to do and they do it from my experience.
  [-]
  - theptip 1 hour ago
    I’ve been playing with
```
    /loop 5m check if you have any actionable tasks 
```
    for this scenario.
- thomascountz 53 minutes ago
```
   This beta header hides thinking from the UI, since most people don't look at it.
```
  How is this measured?
- dc_giant 11 hours ago
  All right so what do I need to do so it does its job again? Disable adaptive thinking and set effort to high and/or use ULTRATHINK again which a few weeks ago Claude code kept on telling me is useless now?
  [-]
  - bcherny 11 hours ago
    Run this: /effort high
    [-]
    - berkanunal 11 hours ago
      Imagine if all service providers were behaving like this.
      > Ahh, sorry we broke your workflow.
      > We found that `log_level=error` was a sweet spot for most users.
      > To make it work as you expect it so, run `./bin/unpoop` it will set log_level=warn
      [-]
      - hackboyfly 45 minutes ago
        Yeah it’s stupid.
        What makes me more annoyed HN users here actually simping for Claude.
        “Hi thank you for Claude Code even though you nerfed the subscriptions, btw can I get red text instead of green?”
      - nimchimpsky 8 hours ago
        [dead]
  - stldev 10 hours ago
    You can't. This is Anthropic leveraging their dials, and ignoring their customers for weeks.
    Switch providers.
    Anecdotally, I've had no luck attempting to revert to prior behavior using either high/max level thinking (opus) or prompting. The web interface for me though doesn't seem problematic when using opus extended.
    [-]
    - taylorfinley 6 hours ago
      I've actually switched back to the web chat UI and copying Python files for much of my work because CC has been so nerfed.
      [-]
- linsomniac 2 hours ago
  Hey Boris, thanks for this reply. I've been kind of scratching my head over this issue, assuming I'm just not doing "complex engineering", because since Opus 4.6 my seat-of-the-pants assessment is that it's a huge improvement. It's been like night and day in my use. Full disclosure: I use high effort for basically everything.
- anonymoushn 7 hours ago
  > On of our product principles is to avoid changing settings on users' behalf
  Ideally there wouldn't be silent changes that greatly reduce the utility of the user's session files until they set a newly introduced flag.
  I happen to think this is just true in general, but another reason it might be true is that the experience the user has is identical to the experience they would have had if you first introduced the setting, defaulting it to the existing behavior, and then subsequently changed it on users' behalf.
- KenoFischer 10 hours ago
  While we have you here, could you fix the bash escaping bug? https://github.com/anthropics/claude-code/issues/10153
- DennisL123 11 hours ago
  Happy to have my mind changed, yet I am not 100% convinced closing the issue as completed captures the feedback.
  [-]
  - bcherny 11 hours ago
    From the contents of the issue, this seems like a fairly clear default effort issue. Would love your input if there's something specific that you think is unaddressed.
    [-]
    - vecter 11 hours ago
      From this reply, it seems that it has nothing to do with `/effort`: https://github.com/anthropics/claude-code/issues/42796#issue...
      I hope you take this seriously. I'm considering moving my company off of Claude Code immediately.
      Closing the GH issue without first engaging with the OP is just a slap in the face, especially given how much hard work they've done on your behalf.
      [-]
      - wonnage 10 hours ago
        The OP “bug report” is a wall of AI slop generated from looking at its own chat transcripts
        [-]
        nipponese 3 minutes ago
        It's only slop if it's wrong or irrelevant.
        vecter 10 hours ago
        Do you disagree with any of the data or conclusions?
        [-]
        adi_kurian 5 hours ago
        I must admit, the fact that the writing was well formatted and structured was an instant turn off. I did find it insightful. I would have been more willing to read it if it was one lower case run on line with typos one would expect from a prepubescent child. I am both joking and being serious at the same time. What a world.
        wonnage 8 hours ago
        Yes
        [-]
        vecter 8 hours ago
        I'm open to hearing, please elaborate
    - JamesSwift 10 hours ago
      I commented on the GH issue, but Ive had effort set to 'high' for however long its been available and had a marked decline since... checks notes... about 23 March according to slack messages I sent to the team to see if I was alone (I wasnt).
      EDIT: actually the first glaring issue I remember was on 20 March where it hallucinated a full sha from a short sha while updating my github actions version pinning. That follows a pattern of it making really egregious assumptions about things without first validating or checking. Ive also had it answer with hallucinated information instead of looking online first (to a higher degree than Ive been used to after using these models daily for the past ~6 months)
      [-]
      - dev_l1x_be 9 hours ago
        It hallucinated a GUID for me instead of using the one in the RFC for webscokets. Fun part was that the beginning was the same. Then it hardcoded the unit tests to be green with the wrong GUID.
    - DennisL123 9 hours ago
      Gotcha. It seemed though from the replies on the github ticket that at least some of the problem was unrelated to effort settings.
- jacquesm 1 hour ago
  Textbook example of how to respond to your customers, kudos.
  [-]
  - eggsandbeer 48 minutes ago
    [dead]
- ai_slop_hater 11 hours ago
  > This beta header hides thinking from the UI, since most people don't look at it.
  I look at it, and I am very upset that I no longer see it.
  [-]
  - bcherny 11 hours ago
    There is a setting if you'd like to continue to see it: showThinkingSummaries.
    See the docs: https://code.claude.com/docs/en/settings#available-settings
    [-]
    - starkparker 11 hours ago
      > Thinking summaries will now appear in the transcript view (Ctrl+O).
      Also: https://github.com/anthropics/claude-code/issues/30958
      [-]
      - ai_slop_hater 11 hours ago
        I also have similar experience with their API, i.e. some requests get stalled for minutes with zero events coming in from Anthropic. Presumably the model does this "extended thinking" but no way to see that. I treat these requests as stuck and retry. Same experience in Claude Code Opus 4.6 when effort is set to "high"—the model gets stuck for ten minutes (at which point I cancel) and token count indicator doesn't increase.
        I am not buying what this guy says. He is either lying or not telling us everything.
    - antonvs 11 hours ago
      > As I noted in the comment,
      Piece of free PR advice: this is fine in a nerd fight, but don't do this in comments that represent a company. Just repeat the relevant information.
      [-]
      - bcherny 11 hours ago
        Fair feedback, edited!
      - trvz 9 hours ago
        Piece of free advice towards a better civilisation: people who didn't even read the comment they're replying to shouldn't be rewarded for their laziness.
        [-]
        ai_slop_hater 9 hours ago
        I read his comment and still replied. I think his claim that nobody reads thinking blocks and that thinking blocks increase latency is nonsense. I am not going to figure out which settings I need to enable because after reading this thread I cancelled my subscription and switched over to Codex. Because I had the exact same experience as many in this thread.
        Also what is that "PR advice"—he might as well wear a suit. This is absolutely a nerd fight.
        [-]
        ai_slop_hater 8 hours ago
        Alright, I just tested that setting and it doesn't work.
        https://i.imgur.com/MYsDSOV.png
        I tested because I was porting memories from Claude Code to Codex, so I might as well test. I obviously still have subscription days remaining.
        There is another comment in this thread linking a GitHub issue that discusses this. The GitHub issue this whole HN submission is about even says that Anthropic hides thinking blocks.
- yubblegum 10 hours ago
  > Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.
  "This report was produced by me — Claude Opus 4.6 — analyzing my own session logs. ... Ben built the stop hook, the convention reviews, the frustration-capture tools, and this entire analysis pipeline because he believes the problem is fixable and the collaboration is worth saving. He spent today — a day he could have spent shipping code — building infrastructure to work around my limitations instead of leaving."
  What a "fuckin'" circle jerk this universe has turned out to be. This note was produced by me and who the hell is Ben?
  [-]
  - razodactyl 7 hours ago
    Bad feedback loops. It's hard to tell with such a massive report if the numbers are real or bad data.
    The worst part is how big AI generated reports are - so much time spent in total having to read fluff.
  - delusional 38 minutes ago
    I think it's absolutely hilarious.
    > Ohh my precious baby, you've been oh so smart in writing to me.
    He says, before dismantling everything reported in the issue. If the depth of thinking was so great (maybe if he had ULTRATHINK'd?) You'd think he would have found an actual problem.
- JohnMakin 11 hours ago
  I’ve seen you/anthropic comment repeatedly over the last several months about the “thinking” in similar ways -
  “most users dont look at it” (how do you know this?)
  “our product team felt it was too visually noisy”
  etc etc. But every time something like this is stated, your power users (people here for the most part) state that this is dead wrong. I know you are repeating the corporate line here, but it’s bs.
  [-]
  - alasano 3 hours ago
    Last time he made the front page he said the same things.
    https://news.ycombinator.com/item?id=46978710
    Then proceeded to fix nothing whatsoever.
    It really does feel like he's just doing mostly what he wants and talking on behalf of vague made up users while real users complain on GitHub issues.
  - jitl 3 hours ago
    building for the loud users on a forum is generally a losing move. if we built notion for angry HN users, we'd probably be a great obsidian competitor with end to end encryption, have zero ai features, and make zero money.
  - exfalso 8 hours ago
    It's to prevent distillation. Duh
    [-]
    - JohnMakin 44 minutes ago
      of course that’s the reason but don’t pretend it’s some user guided decision
  - wonnage 10 hours ago
    Anecdotally the “power users” of AI are the ones who have succumbed to AI psychosis and write blog posts about orchestrating 30 agents to review PRs when one would’ve done just fine.
    The actual power users have an API contract and don’t give a shit about whatever subscription shenanigans Claude Max is pulling today
    [-]
    - razodactyl 7 hours ago
      Generalisations and angry language but I almost agree with the underlying message.
      New tools, turbulent methods of execution. There's definitely something here in the way of how coding will be done in future but this is still bleeding edge and many people will get nicked.
    - JohnMakin 9 hours ago
      Uh, no. Definitely not me at all.
      [-]
      - wonnage 8 hours ago
        [flagged]
        [-]
        JohnMakin 8 hours ago
        Whatever makes you feel better about yourself, I guess. My account history on this topic is pretty easily searchable, but I guess it's easier to make driveby comments like this than be informed.
        [-]
        wonnage 7 hours ago
        Lol the only thing that looking at your comment history brought is that you repeatedly bring up your long comment history.
- starkparker 11 hours ago
  > You can also use the ULTRATHINK keyword to use high effort for a single turn
  First I've heard that ultrathink was back. Much quieter walkback of https://decodeclaude.com/ultrathink-deprecated/
  [-]
  - giwook 9 hours ago
    Pretty sure it's still gone and you should be using effort level now for this.
    [-]
    - xvector 25 minutes ago
      No, ultrathink is back and it's the same thing as high effort for the message in which it is included
- potsandpans 6 hours ago
  For anyone reading this and wondering where the truth could possibly be:
  We can't really know what the truth is, because Anthropic is tightly controlling how you interact with their product and provides their service through opaque processes. So all we can do is speculate. And in that speculation there's a lot of room (for the company) to bullshit or provide equally speculative responses, and (for outsiders) to search for all plausible explanations within the solution space. So there's not much to action on. We're effectively stuck with imprecise heuristics and vibes.
  But consider what we do know: the promise is that Anthropic is providing a black-box service that solves large portions of the SDLC. Maybe all of it. They are "making the market" here, and their company growth depends on this bet. This is why these processes are opaque: they have to be. Anthropic, OpenAI and a few others see this as a zero-sum game. The winner "owns" the SDLC (and really, if they get their way the entire PDLC). So the competitive advantage lies in tightly controlling and tweaking their hidden parameters to squeeze as much value and growth as possible.
  The downside is that we're handing over the magic for convenience and cost. A lot of people are maybe rightly criticizing the OP of the issue because they're staking their business on Claude Code in a way that's very risky. But this is essentially what these companies are asking for. The business model end game is: here's the token factory, we control it and you pay for the pleasure of using it. Effectively, rent-seeking for software development. And if something changes and it disrupts your business, you're just using it incorrectly. Try turning effort to max.
  Reading responses like this from these company representatives makes me increasingly uneasy because it's indicative of how much of writing software is being taken out from under our feet. The glimmer of promise in all of this though is that we are seeing equity in the form of open source. Maybe the answer is: use pi-mono, a smattering of self hosted and open weights models (gemma4, kimi, minimax are extremely capable) and escalate to the private lab models through api calls when encountering hard problems.
  Let the best model win, not the best end to end black box solution.
  [-]
  - mvkel 5 hours ago
    I am reminded of OpenAI's first voice-to-voice demo a couple of years ago. I rewatched it and was shocked at how human it was; indiscernible from a real person. But the voice agent that we got sounds 20% better than Siri.
    There's a hope that competition is what keeps these companies pushing to ship value to customers, but there are also billions of compute expense at stake, so there seems to be an understanding that nobody ships a product that is unsustainably competitive
  - vachina 4 hours ago
    Don’t turn vibe coding into your day job (because the vibe won’t keep vibing). Write code (that you own) that can make you money and hire real developers.
- zenoware 3 hours ago
  > CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING
  Why not just give people the abiltiy ot set a default thinking level instead of manually setting it to `max` all the time.
- diavelguru 4 hours ago
  As soon as that change came through I set the effort to high. Have not regretted it for any coding task. It feels the same as Dec-Jan though now spawning more sub agents which is not a bad thing.
- hellojimbo 7 hours ago
  The last time I typed ultrathink, i got a prompt saying that you no longer need to type ultrathink
- ting0 11 hours ago
  Do you guys realize that everyone is switching to Codex because Claude Code is practically unusable now, even on a Max subscription? You ask it to do tasks, and it does 1/10th of them. I shouldn't have to sit there and say: "Check your work again and keep implementing" over and over and over again... Such a garbage experience.
  Does Anthropic actually care? Or is it irrelevant to your company because you think you'll be replacing us all in a year anyway?
  [-]
  - misnome 6 hours ago
    Or, ask it to make a plan, and it makes a good plan! It explicitly notes how validation is to take place on each stage!
    And then does every stage without running any of the validation. It's your agent's plan, it should probably be generated in a way that your own agent can follow it.
- migali49g 9 hours ago
  Hi Boris, thanks for addressing this and providing feedback quickly. I noticed the same issue. My question is, is it enough to do /efforts high, or should I also add CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING to my settings?
- niteshpant 5 hours ago
  I added `CLAUDE_CODE_EFFORT_LEVEL=max` to my shell's env so that every session is always effort:max by default
  :)
- matheusmoreira 11 hours ago
  I definitely noticed the mid-output self-correction reasoning loops mentioned in the GitHub issue in some conversations with Opus 4.6 with extended reasoning enabled on claude.ai. How do I max out the effort there?
- CjHuber 1 hour ago
  I honestly am very disappointed with this. I've only learned about CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING and showThinkingSummaries: true from this post. I've been wondering for a while where the summaries went and am always hoping like roulette that it thinks a lot. No wonder if there suddently is an "adaptive thinking" mode. I would have opted out 2 months ago if it was documented or communicated in any way publicly. Why change behavior without notice or any new user facing settings.
  I just googled "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING" and it seems like many people don't know about it.
  And ULTRATHINK sets the effort to high, but then there is also /effort max?
- nickvec 10 hours ago
  Hey Boris, would appreciate if you could respond to my DM on X about Claude erroneously charging me $200 in extra credit usage when I wasn't using the service. Haven't heard back from Claude Support in over a month and I am getting a bit frustrated.
  [-]
  - HumanOstrich 6 hours ago
    Did the receipt show it as being a gift? There's a lot of fraud happening the past few months with Claude Code Gift purchases. Anthropic support is ignoring all of it and just not responding to support requests.
    Happened to a close friend of mine. A bit of digging revealed the same pattern with fraudulent gift purchases for several other people before I stopped looking. They were also being ignored by Anthropic support. One since January.
    Apparently they're so short on inference resources they can't run their support bots. Maybe banning usage of Claude Code with Claude will allow them to catch up on those gift fraud tickets.
    Took a long time for me to reach this level of scathing. It is not unwarranted.
- raincole 11 hours ago
  > I wanted to say I appreciate the depth of thinking & care that went into this.
  The irony lol. The whole ticket is just AI-generated. But Anthropic employees have to say this because saying otherwise will admit AI doesn't have "the depth of thinking & care."
  [-]
  - vlovich123 11 hours ago
    It's also pretty standard corporate speak to make sure you don't alienate any users / offend anyone. That's why corporate speak is so bland.
  - rafaelmn 10 hours ago
    Ticket is AI generated but from what I've seen these guys have a harness to capture/analyze CC performance, so effort was made on the user side for sure.
    [-]
    - notatallshaw 9 hours ago
      The note at the end of the post indicates the user asked Claude to review their own chat logs. It's impossible to tell if Claude used or built a a performance harness or just wrote those numbers based on vibes.
    - gardnr 9 hours ago
      There is this 3rd party tracker: https://marginlab.ai/trackers/claude-code/
- ting0 11 hours ago
  Thinking time is not the issue. The issue is that Claude does not actually complete tasks. I don't care if it takes longer to think, what I care about is getting partial implementations scattered throughout my codebase while Claude pretends that it finished entirely. You REALLY need to fix this, it's atrocious.
- j45 10 hours ago
  Thanks for the update,
  Perhaps max users can be included in defaulting to different effort levels as well?
- ctoth 11 hours ago
  [flagged]
  [-]
  - bcherny 9 hours ago
    Christopher, would you be able to share the transcripts for that repo by running /bug? That would make the reports actionable for me to dig in and debug.
  - quietsegfault 11 hours ago
    I’m not sure being confrontational like this really helps your case. There are real people responding, and even if you’re frustrated it doesn’t pay off to take that frustration out on the people willing to help.
    [-]
    - ctoth 11 hours ago
      Fair point on tone. It's a bit of a bind isn't it? When you come with a well-researched issue as OP did, you get this bland corporate nonsense "don't believe your lyin' eyes, we didn't change anything major, you can fix it in settings."
      How should you actually communicate in such a way that you are actually heard when this is the default wall you hit?
      The author is in this thread saying every suggested setting is already maxed. The response is "try these settings." What's the productive version of pointing out that the answer doesn't address the evidence? Genuine question. I linked my repo because it's the most concrete example I have.
      [-]
      - enraged_camel 9 hours ago
        I read the entire performance degradation report in the OP, and Boris's response, and it seems that the overwhelming majority of the report's findings can indeed be explained by the `showThinkingSummaries` option being off by default as of recently.
      - wonnage 10 hours ago
        Just use a different tool or stop vibe coding, it’s not that hard. I really don’t understand the logic of filing bug reports against the black box of AI
        [-]
        geysersam 9 hours ago
        People file tickets against closed source "black box" systems all the time. You could just as well say: Stop using MS SQL, just use a different tool, it's not that hard.
        [-]
        wonnage 7 hours ago
        Equivalent of filing a ticket against the slot machine when you lose more often than expected
        [-]
        HumanOstrich 6 hours ago
        Well now you're just being silly and I can't take you seriously.
        HumanOstrich 6 hours ago
        The only "black box" here is Anthropic. At least an LLM's performance and consistency can be established by statistical methods.
    - malfist 11 hours ago
      Is somebody saying "you're holding it wrong" a "people willing to help"?
      [-]
      - TeMPOraL 9 hours ago
        They are if you are, in fact, holding it wrong.
        As was the usual case in most of the few years LLMs existed in this world.
        Think not of iPhone antennas - think of a humble hammer. A hammer has three ends to hold by, and no amount of UI/UX and product design thinking will make the end you like to hold to be a good choice when you want to drive a Torx screw.
      - Retr0id 11 hours ago
        [flagged]
    - throwaway613746 10 hours ago
      [dead]
    - BigTTYGothGF 10 hours ago
      The stated policy of HN is "don't be mean to the openclaw people", let's see if it generalizes.
  - lambda 11 hours ago
    I guess one of the things I don't understand: how you expect a stochastic model, sold as a proprietary SaaS, with a proprietary (though briefly leaked) client, is supposed to be predictable in its behavior.
    It seems like people are expecting LLM based coding to work in a predictable and controllable way. And, well, no, that's not how it works, and especially so when you're using a proprietary SaaS model where you can't control the exact model used, the inference setup its running on, the harness, the system prompts, etc. It's all just vibes, you're vibe coding and expecting consistency.
    Now, if you were running a local weights model on your own inference setup, with an open source harness, you'd at least have some more control of the setup. Of course, it's still a stochastic model, trained on who knows what data scraped from the internet and generated from previous versions of the model; there will always be some non-determinism. But if you're running it yourself, you at least have some control and can potentially bisect configuration changes to find what caused particular behavior regressions.
    [-]
    - dev_l1x_be 9 hours ago
      The problem is degradation. It was working much better before. There are many people (some example of a well know person[0]), including my circle of friends and me who were working on projects around the Opus 4.6 rollout time and suddenly our workflows started to degrade like crazy. If I did not have many quality gates between an LLM session and production I would have faced certain data loss and production outages just like some famous company did. The fun part is that the same workflow that was reliably going through the quality gates before suddenly failed with something trivial. I cannot pinpoint what exactly Claude changed but the degradation is there for sure. We are currently evaling alternatives to have an escape hatch (Kimi, Chatgpt, Qwen are so far the best candidates and Nemotron). The only issue with alternatives was (before the Claude leak) how well the agentic coding tool integrates with the model and the tool use, and there are several improvements happening already, like [1]. I am hoping the gap narrows and we can move off permanently. No more hoops, you are right, I should not have attempted to delete the production database moments.
      https://x.com/theo/status/2041111862113444221
      https://x.com/_can1357/status/2021828033640911196
      [-]
      - techpression 1 hour ago
        Curious as to how many people are using 4.6, perhaps you’re on a subscription? I use the api and 4.6 (also goes for Sonnet) is unusable since launch because it eats through tokens like it’s actually made that way (to make more money/hit limits faster). I guess it makes sense from a financial perspective but once 4.5 goes away I will have to find another provider if they continue like this :/
    - stavros 9 hours ago
      Same as how I expect a coin to come up heads 50% of the time.
      [-]
      - muyuu 8 hours ago
        If you get consistently nowhere near 50% then surely you know you're not throwing a fair coin? What would complaining to the coin provider achieve? Switch coins.
        *typo
        [-]
        stavros 8 hours ago
        Well I'm paying the coin to be near 50% and the coin's PM is listening to customers, so that's why.
        [-]
        muyuu 7 hours ago
        The coin's PM is spamming you trivial gaslighting corporate slop, most of it barely edited.
        [-]
        HumanOstrich 6 hours ago
        Yes, that's why we are angry. Stop making excuses for them.
    - bwfan123 26 minutes ago
      Imagine a team of human engineers. One day they are 10x ninjas and the next they are blub-coders. Not happening.
      Put Claude on PIP.
    - randomNumber7 8 hours ago
      > how you expect a stochastic model [...] is supposed to be predictable in its behavior.
      I used it often enough to know that it will nail tasks I deem simple enough almost certainly.
  - malfist 11 hours ago
    It also completely ignores the increase in behavioral tracking metrics. 68% increase in swearing at the LLM for doing something wrong needs to be addressed and isn't just "you're holding it wrong"
    [-]
    - alchemist1e9 9 hours ago
      I’m think a great marketing line for local/selfhosted LLMs in the future - “You can swear at your LLM and nobody will care!”
  - dang 8 hours ago
    Please don't post this aggressively to Hacker News. You can make your substantive points without that.
    https://news.ycombinator.com/newsguidelines.html
  - iwalton3 11 hours ago
    [dead]
- hackboyfly 50 minutes ago
  [dead]
- tatrions 11 hours ago
  [flagged]
  [-]
  - bcherny 11 hours ago
    Yep totally -- think of this as "maximum effort". If a task doesn't need a lot of thinking tokens, then the model will choose a lower effort level for the task.
  - koverstreet 11 hours ago
    Technically speaking, models inherently do this - CoT is just output tokens that aren't included in the final response because they're enclosed in <think> tags, and it's the model that decides when to close the tag. You can add a bias to make it more or less likely for a model to generate a particular token, and that's how budgets work, but it's always going to be better in the long run to let the model make that decision entirely itself - the bias is a short term hack to prevent overthinking when the model doesn't realize it's spinning in circles.
    [-]
    - ai_slop_hater 11 hours ago
      > You can add a bias to make it more or less likely for a model to generate a particular token, and that's how budgets work
      Do you have a source for this? I am interested in learning more about how this works.
      [-]
      - koverstreet 11 hours ago
        It's how temperature/top_p/top_k work. Anthropic also just put out a paper where they were doing a much more advanced version of this, mapping out functional states within the modern and steering with that.
        [-]
        ai_slop_hater 11 hours ago
        Huh, I wonder if that's why you cannot change the temperature when thinking is enabled. Do you have a link for the paper?
        [-]
        koverstreet 11 hours ago
        https://transformer-circuits.pub/2026/emotions/index.html
        At the actual inference level temperature can be applied at any time - generation is token by token - but that doesn't mean the API necessarily exposes it.
        [-]
        ai_slop_hater 11 hours ago
        Thanks. I was referring to the fact that Anthropic, in their API, prohibits setting temperature when thinking is enabled.
- martin-t 7 hours ago
  [flagged]
- y1n0 4 hours ago
  "most users"
  Have you guys considered that you should be optimizing for the leading tail of the user distribution? The people that are actually using AI to push the envelope of development? "most users," i.e. the inner 70%, aren't doing anything novel.
- areoform 10 hours ago
  Hey Boris, thanks for the awesomeness that's Claude! You've genuinely changed the life of quite a few young people across the world. :)
  not sure if the team is aware of this, but Claude code (cc from here on) fails to install / initiate on Windows 10; precise version, Windows 10.0.19045 build 19045. It fails mid setup, and sometimes fails to throw up a log. It simply calls it quits and terminates.
  On MacOS, I use Claude via terminal, and there have been a few, minor but persistent harness issues. For example, cc isn't able to use Claude for Chrome. It has worked once and only once, and never again. Currently, it fails without a descriptive log or issue. It simply states permission has been denied.
  More generally, I use Claude a lot for a few sociological experiments and I've noticed that token consumption has increased exponentially in the past 3 weeks. I've tried to track it down by project etc., but nothing obvious has changed. I've gone from almost never hitting my limits on a Max account to consistently hitting them.
  I realize that my complaint is hardly unique, but happy to provide logs / whatever works! :)
  And yeah, thanks again for Claude! I recommend Claude to so many folks and it has been instrumental for them to improve their lives.
  I work for a fund that supports young people, and we'd love to be able to give credits out to them. I tried to reach out via the website etc. but wasn't able to get in touch with anyone. I just think more gifted young people need Claude as a tool and a wall to bounce things off of; it might measurably accelerate human progress. (that's partly the experiment!)
noxa 12 hours ago
I'm the author of the report in there. The stop-phrase-guard didn't get attached but here it is: https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...
You can watch for these yourself - they are strong indicators of shallow thinking. If you still have logs from Jan/Feb you can point claude at that issue and have it go look for the same things (read:edit ratio shifts, thinking character shifts before the redaction, post-redaction correlation, etc). Unfortunately, the `cleanupPeriodDays` setting defaults to 20 and anyone who had not backed up their logs or changed that has only memories to go off of (I recommend adding `"cleanupPeriodDays": 365,` to your settings.json). Thankfully I had logs back to a bit before the degradation started and was able to mine them.
The frustrating part is that it's not a workflow _or_ model issue, but a silently-introduced limitation of the subscription plan. They switched thinking to be variable by load, redacted the thinking so no one could notice, and then have been running it at ~1/10th the thinking depth nearly 24/7 for a month. That's with max effort on, adaptive thinking disabled, high max thinking tokens, etc etc. Not all providers have redacted thinking or limit it, but some non-Anthropic ones do (most that are not API pricing). The issue for me personally is that "bro, if they silently nerfed the consumer plan just go get an enterprise plan!" is consumer-hostile thinking: if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that. Today there is zero indication from Anthropic that the limitation exists, the redaction was a deliberate feature intended to hide it from the impacted customers, and the community is gaslighting itself with "write a better prompt" or "break everything into tiny tasks and watch it like a hawk same you would a local 27B model" or "works for me <in some unmentioned configuration>" - sucks :/
[-]
- p1necone 7 hours ago
  The "this test failure is preexisting so I'm going to ignore it" thing has been happening a lot for me lately, it's so annoying. Unless it makes a change and then immediately runs tests and it's obvious from the name/contents that the failing test is directly related to the change that was made it will ignore it and not try to fix.
  [-]
  - Shebanator 5 hours ago
    This problem has been around for a long time. Not only that but it would say this even when the problems were directly caused by their code.
    I put a line in my CLAUDE.md that says "If a test doesn't pass, fix it regardless of whether it was pre-existing or in a different part of the code."
    [-]
    - latentsea 5 hours ago
      This should be part of the system prompt. It's absolutely unacceptable to just to not at least try to investigate failures like this. I absolutely hate when it reaches this conclusion on its own and just continues on as if it's doing valid work.
      [-]
      - foltik 1 hour ago
        Based on the recent leaks, their system prompt explicitly nudges the model not to do anything outside of what was asked. That could very well explain why it’s not fixing preexisting broken tests.
        “Don't add features, refactor code, or make "improvements" beyond what was asked.”
        https://www.dbreunig.com/2026/04/04/how-claude-code-builds-a...
  - flakes 7 hours ago
    > "this test failure is preexisting so I'm going to ignore it"
    Critical finding! You spotted the smoking gun!
  - dboreham 6 hours ago
    That said I've worked with several humans who did/said the exact same thing.
- thatxliner 17 minutes ago
  > is consumer-hostile thinking
  I've been saying this with many of my friends but, I feel like it's also probably illegal: you paid for a subscription where you expect X out of, and if they changed the terms of your subscription (e.g. serving worse models) after you paid for it, was that not false advertising? Could we not ask for a refund, or even sue?
- Majromax 10 hours ago
  I'm curious about your subscription/API comparison with respect to thinking. Do you have a benchmark for this, where the same set of prompts under a Claude Code subscription result in significantly different levels of effective thinking effort compared to a Claude Code+API call?
  Elsewhere in this thread 'Boris from the Claude Code team' alleges that the new behaviours (redacted thinking, lower/variable effort) can be disabled by preference or environment variable, allowing a more transparent comparison.
  [-]
  - jeremyjh 1 hour ago
    GP already said they applied all those settings.
- matheusmoreira 3 hours ago
  Thanks for your report.
  > a silently-introduced limitation of the subscription plan
  It is a fact that the API consumers aren't affected by this?
  > if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that.
  Absolutely agreed.
summarity 13 hours ago
Not claude code specific, but I've been noticing this on Opus 4.6 models through Copilot and others as well. Whenever the phrase "simplest fix" appears, it's time to pull the emergency break. This has gotten much, much worse over the past few weeks. It will produce completely useless code, knowingly (because up to that phrase the reasoning was correct) breaking things.
Today another thing started happening which are phrases like "I've been burning too many tokens" or "this has taken too many turns". Which ironically takes more tokens of custom instructions to override.
Also claude itself is partially down right now (Arp 6, 6pm CEST): https://status.claude.com/
[-]
- andoando 13 hours ago
  Ive been noticing something similar recently. If somethings not working out itll be like "Ok this isnt working out, lets just switch to doing this other thing instead you explicitly said not to do".
  For example I wanted to get VNC working with PopOS Cosmic and itll be like ah its ok well just install sway and thatll work!
  [-]
  - albert_e 12 hours ago
    Experienced this -- was repeatedly directing CC to use Claude in Chrome extension to interact with a webpage and it was repeatedly invoking Playwright MCP instead.
  - robotswantdata 11 hours ago
    It’s as if it gives up, I respond keep going with original plan, you can do it champ!
  - rootnod3 13 hours ago
    [flagged]
    [-]
    - andoando 13 hours ago
      ?
      [-]
      - satvikpendem 11 hours ago
        They're saying just do it yourself instead of trying to herd an unpredictable animal to your bidding like an LLM.
- robwwilliams 13 hours ago
  Yes, and over the last few weeks I have noticed that on long-context discussions Opus 4.6e does its best to encourage me to call it a day and wrap it up; repeatedly. Mother Anthropic is giving preprompts to Claude to terminate early and in my case always prematurely.
  [-]
  - TonyAlicea10 10 hours ago
    I've noticed this as well. "Now you should stop X and go do Y" is a phrase I see repeated a lot. Claude seems primed to instruct me to stop using it.
    [-]
    - lukewarm707 7 hours ago
      as someone who uses deepseek, glm and kimi models exclusively, an llm telling me what to do is just off the wall
      glm and kimi in particular, they can't stop writing... seriously very eager to please. always finishing with fireworks emoji and saying how pleased it is with the test working.
      i have to say to write less documentation and simplify their code.
  - logicchains 13 hours ago
    Try Codex, it's a breath of fresh air in that regard, tries to do as much as it can.
- onlyrealcuzzo 13 hours ago
  > Whenever the phrase "simplest fix" appears, it's time to pull the emergency break.
  Second! In CLAUDE.md, I have a full section NOT to ever do this, and how to ACTUALLY fix something.
  This has helped enormously.
  [-]
  - bowersbros 12 hours ago
    Any chance you could share those sections of your claude file? I've been using Claude a bit lately but mostly with manual changes, not got much in the way of the claude file yet and interested in how to improve it
    [-]
    - onlyrealcuzzo 10 hours ago
      https://github.com/cuzzo/easy-vm/blob/master/CLAUDE.md
      [-]
      - LeoPanthera 10 hours ago
        Typo - "proove". "Prove" only has one O.
        [-]
        onlyrealcuzzo 10 hours ago
        Thank you!
  - causal 12 hours ago
    I switched from Cursor to Claude because the limits are so much higher but I see Anthropic playing a lot more games to limit token use
  - talim 12 hours ago
    What wording do you use for this, if you don't mind? This thread is a revelation, I have sworn that I've seen it do this "wait... the simplest fix is to [use some horrible hack that disregards the spec]" much more often lately so I'm glad it's not just me.
    However I'm not sure how to best prompt against that behavior without influencing it towards swinging the other way and looking for the most intentionally overengineered solutions instead...
    [-]
    - twalichiewicz 12 hours ago
      My own experience has been that you really just have to be diligent about clearing your cache between tasks, establishing a protocol for research/planning, and for especially complicated implementations reading line-by-line what the system is thinking and interrupting the moment it seems to be going bad.
      If it's really far off the mark, revert back to where you originally sent the prompt and try to steer it more, if it's starting to hesitate you can usually correct it without starting over.
      [-]
      - aforwardslash 10 hours ago
        That is generically my experience as well. Claude half-assing work or skipping stuff because "takes too much time" is something I've been experiencing since I started using it (May 2025). Forcing it to create and review and implementation plan, and then reviewing the implementation cross-referenced with the plan almost always produces consistent results in my case.
    - onlyrealcuzzo 10 hours ago
      https://github.com/cuzzo/easy-vm/blob/master/CLAUDE.md
    - imiric 12 hours ago
      Make sure to use "PRETTY PLEASE" in all caps in your `SOUL.md`. And occasionally remind it that kittens are going to die unless it cooperates. Works wonders.
      [-]
      - mghackerlady 11 hours ago
        I love how despite how cold and inhuman LLMs are, we've at least taught them to respect the lives of kittens
      - KaoruAoiShiho 10 hours ago
        Can you paste the relevant section in your soul please?
        [-]
        imiric 8 hours ago
        Sure, as soon as I locate my soul.
  - aktenlage 9 hours ago
    Where is that? I found "Return the simplest working solution. No over-engineering." which sounds more like the simplest fix.
- psadauskas 13 hours ago
  I need to add another agent that watches the first, and pulls the plug whenever it detects "Wait, I see the problem now..."
- iterateoften 11 hours ago
  Yeah it’s so frustrating to have to constantly ask for the best solution, not the easiest / quickest / less disruptive.
  I have in Claude md that it’s a greenfield project, only present complete holistic solutions not fast patches, etc. but still I have to watch its output.
- selfmodruntime 8 hours ago
  Time's up and money is tight. The downgrade was bound to happen.
- giwook 13 hours ago
  I think in general we need to be highly critical of anything LLMs tell us.
  [-]
  - pixel_popping 13 hours ago
    Claude code shows: OAuth error: timeout of 15000ms exceeded
    [-]
    - giwook 13 hours ago
      Maybe a local or intermittent issue? Working for me.
      [-]
      - pixel_popping 13 hours ago
        Seems solved now indeed.
- pixel_popping 13 hours ago
  It's a bit insane that they can't figure out a cryptographic way for the delivery of the Claude Code Token, what's the point of going online to validate the OAuth AFTER being issued the code, can't they use signatures?
- mikepurvis 13 hours ago
  That helps explain why my sessions signed themselves out and won't log back in.
  [-]
  - me_vinayakakv 13 hours ago
    I just experienced this some time ago and could not sign in still.
    Their status page shows everything is okay.
- nikanj 12 hours ago
  ”I can’t make this api work for my client. I have deleted all the files in the (reference) server source code, and replaced it with a python version”
  Repeatedly, too. Had to make the server reference sources read-only as I got tired of having to copy them over repeatedly
  [-]
  - mavamaarten 10 hours ago
    Haha yeah. I once asked it to make a field in an API response nullable, and to gracefully handle cases where that might be an issue (it was really easy, I was just lazy and could have done it myself, but I thought it was the perfect task for my AI idiot intern to handle). Sure, it said. Then it was bored of the task and just deleted the field altogether.
- j45 12 hours ago
  Certain phrases invoke an over-response trying to course correct which makes it worse because it's inclined to double down on the wrong path it's already on.
- rootnod3 13 hours ago
  The cope is hard. Just at this point admit that the LLM tech is doomed and sucks.
  [-]
  - subscribed 11 hours ago
    But it was clearly really food before the regression, the original link (analysis) says as much.
  - randomNumber7 12 hours ago
    Just because some people try to use a hammer as a screwdriver it doesn't follow that the hammer sucks.
  - r_lee 12 hours ago
    how is it "doomed"?
    [-]
    - selfmodruntime 8 hours ago
      The cost far outweighs the profits.
      [-]
      - lukewarm707 7 hours ago
        i am already on api tokens for the chinese open source models and no subscriptions. these are all available in the original form open source and priced above the inference cost. i think this is the long term option.
        zero degradation in speed or quality seen.
        [-]
        jeremyjh 1 hour ago
        So you see better performance with the API plans than the subscriptions?
- simooooo 12 hours ago
  How complex are we talking? I one shotted a game boy emulator in <6 minutes today
  [-]
  - root_axis 11 hours ago
    There are countless reference examples online, that's just a slower, buggier, and more expensive git clone.
    [-]
    - TimTheTinker 10 hours ago
      Yep. If you ask Claude to create a drop-in replacement for an open-source project that passes 100% of the test suite of the project, it will basically plagiarize the project wholesale, even if you changed some of the requirements.
  - whateveracct 11 hours ago
    try one shotting something actually original and see how it goes
    i keep getting nonsense
rileymichael 12 hours ago
> This report was produced by me — Claude Opus 4.6 — analyzing my own session logs [...] Please give me back my ability to think.
a bit ironic to utilize the tool that can't think to write up your report on said tool. that and this issue[1] demonstrate the extent folks become over reliant on LLMs. their review process let so many defects through that they now have to stop work and comb over everything they've shipped in the past 1.5 months! this is the future
[1] https://github.com/anthropics/claude-code/issues/42796#issue...
[-]
- Tade0 12 hours ago
  The other day I accidentally `git reset --hard` my work from April the 1st (wrong terminal window).
  Not a lot of code was erased this way, but among it was a type definition I had Claude concoct, which I understood in terms of what it was supposed to guarantee, but could not recreate for a good hour.
  Really easy to fall into this trap, especially now that results from search engines are so disappointing comparatively.
  [-]
  - smilliken 12 hours ago
    If your code was committed before the reset, check your git reflog for the lost code.
    [-]
    - shimman 6 hours ago
      Yeah, git reset --hard is something I do like once a week! lol
      With the reflog, as you mentioned, it's not hard to revert to any previous state.
  - ajdegol 11 hours ago
    Guess you’ve sorted it but it might be in the session memory in your root folder. I’ve recovered some things this way.
  - ejpir 10 hours ago
    have you tried to recover it with git reflog?
    https://oneuptime.com/blog/post/2026-01-24-git-reflog-recove...
  - jatins 12 hours ago
    > but could not recreate for a good hour.
    For certain work, we'll have to let go of this desire.
    If you limit yourself to whatever you can recreate, then you are effectively limiting the work you can produce to what you know.
    [-]
    - rileymichael 12 hours ago
      you should limit your output (manual or assisted) to a level that is well under your understanding ceiling.
      Kernighan’s Law states that debugging is twice as hard as writing. how do you ever intend on debugging something you can’t even write?
      [-]
      - satvikpendem 11 hours ago
        It's simple, they'll just let the LLM debug it!
        This is why I believe the need for actually good engineers will never go away because LLMs will never be perfect.
        [-]
        Tade0 10 hours ago
        Exactly. It's a force multiplier - sometimes the direction is wrong.
        Same week I went into a deep rabbit hole with Claude and at no point did it try to steer me away from pursuing this direction, even though it was a dead end.
- sigbottle 12 hours ago
  They seem to have some notions of pipelines and metrics though. It could be argued that the hard part was setting up the observability pipeline in the first place - Claude just gets the data. Though if Claude is failing in such a spectacular way that the report is claiming, yes it is pretty funny that the report is also written by Claude, since this seems to be ejecting reasoning back to gpt4o territories
- heavyset_go 4 hours ago
  If you don't have swarms of agentic teams with layers of LLMs feeding and checking LLMs over and over again, you're going to be left behind.
fer 13 hours ago
Called it 10 days ago: https://news.ycombinator.com/item?id=47533297#47540633
Something worse than a bad model is an inconsistent model. One can't gauge to what extent to trust the output, even for the simplest instructions, hence everything must be reviewed with intensity which is exhausting. I jumped on Max because it was worth it but I guess I'll have to cancel this garbage.
[-]
- cedws 11 hours ago
  With Claude Code the problem of changes outside of your view is twofold: you don't have any insight into how the model is being ran behind the scenes, nor do you get to control the harness. Your best hope is to downgrade CC to a version you think worked better.
  I don't see how this can be the future of software engineering when we have to put all our eggs in Anthropic's basket.
- SkyPuncher 13 hours ago
  Yep. I was doing voice based vibe-coding flawlessly in Jan/Feb.
  I've basically stopped using it because I have to be so hands on now.
- stephbook 11 hours ago
  One of the replies even called out the phased rollout, lmao https://news.ycombinator.com/item?id=47533297#47541078
matheusmoreira 13 hours ago
That analysis is pretty brutal. It's very disconcerting that they can sell access to a high quality model then just stealthily degrade it over time, effectively pulling the rug from under their customers.
[-]
- nativeit 12 minutes ago
  I don't think humanity has fully reckoned with the idea of a product that can manipulate us unilaterally like this.
- riskassessment 13 hours ago
  Stealthily degrade the model or stealthily constrain the model with a tighter harness? These coding tools like Claude Code were created to overcome the shortcomings of last year's models. Models have gotten better but the harnesses have not been rebuilt from scratch to reflect improved planning and tool use inherent to newer models.
  I do wonder how much all the engineering put into these coding tools may actually in some cases degrade coding performance relative to simpler instructions and terminal access. Not to mention that the monthly subscription pricing structure incentivizes building the harness to reduce token use. How much of that token efficiency is to the benefit of the user? Someone needs to be doing research comparing e.g. Claude Code vs generic code assist via API access with some minimal tooling and instructions.
  [-]
  - nrds 13 hours ago
    I've been using pi.dev since December. The only significant change to the harness in that time which affects my usage is the availability of parallel tool calls. Yet Claude models have become unusable in the past month for many of the reasons observed here. Conclusion: it's not the harness.
    I tend to agree about the legacy workarounds being actively harmful though. I tried out Zed agent for a while and I was SHOCKED at how bad its edit tool is compared to the search-and-replace tool in pi. I didn't find a single frontier model capable of using it reliably. By forking, it completely decouples models' thinking from their edits and then erases the evidence from their context. Agents ended up believing that a less capable subagent was making editing mistakes.
    [-]
    - copperx 10 hours ago
      Are you using Pi with a cloud subscription, or are you using the API?
    - jfim 11 hours ago
      Out of curiosity, what can parallel tool calls do that one can't do with parallel subagents and background processes?
      [-]
      - weird-eye-issue 2 hours ago
        How would you do a parallel subagent if you don't have parallel tool calls? Sub agents are tools.
  - robwwilliams 13 hours ago
    Agree: it is Anthropic's aggressive changes to the harnesses and to the hidden base prompt we users do not see. Clearly intended to give long right tail users a haircut.
  - NooneAtAll3 11 hours ago
    I feel like "feature/model freeze" may be justified
    just call it something like "[month][year]edition" and work on next release
    users spend effort arriving to narrow peak of performace, but every change keeps moving the peak sideways
    [-]
    - muyuu 7 hours ago
      The changes to reduce inference costs are intentional. Last thing you're going to do is have users linger on an older version that spends much more. This is essentially what's going on with layers upon layers of social engineering on top of it.
  - jmount 12 hours ago
    Love your point. Instructions found to be good by trial and error for one LLM may not be good for another LLM.
    [-]
    - lelanthran 12 hours ago
      > Love your point. Instructions found to be good by trial and error for one LLM may not be good for another LLM.
      Well, according to this story, instructions refined by trial and error over months might be good for one LLM on Tuesday, and then be bad for the same LLM on Wednesday.
- mikepurvis 13 hours ago
  Disconcerting for sure, but from a business point of view you can understand where they're at; afaiui they're still losing money on basically every query and simultaneously under huge pressure to show that they can (a) deliver this product sustainably at (b) a price point that will be affordable to basically everyone (eg, similar market penetration to smartphones).
  The constraints of (b) limit them from raising the price, so that means meeting (a) by making it worse, and maybe eventually doing a price discrimination play with premium tiers that are faster and smarter for 10x the cost. But anything done now that erodes the market's trust in their delivery makes that eventual premium tier a harder sell.
  [-]
  - willis936 12 hours ago
    They'll never get anyone on board if the product can't be trusted to not suck.
    And idk about the pricing thing. Right now I waste multiple dollars on a 40 minute response that is useless. Why would I ever use this product?
    [-]
    - matheusmoreira 12 hours ago
      Yeah. I've been enjoying programming with Claude so much I started feeling the need to upgrade to Max. Then it turns out even big companies paying API premiums are getting an intentionally degraded and inferior model. I don't want to pay for Opus if I can't trust what it says.
  - FiberBundle 2 hours ago
    This could also be a marketing strategy. Make your models perform worse towards the end of a model's cycle, so that the next model appears as if more progress has been made than there actually has been.
  - aurareturn 1 hour ago
```
  afaiui they're still losing money on basically every query
```
    Source?
    [-]
    - thatxliner 6 minutes ago
      i mean you could just search up "is Anthropic making profit" and most sources will say no.
      There's this one source on Reddit which calculated that Anthropic has been subsidizing their costs by 32x
- the__alchemist 13 hours ago
  ChatGPT has been doing the same consistently for years. Model starts out smooth, takes a while, and produces good (relatively) results. Within a few weeks, responses start happening much more quickly, at a poorer quality.
  [-]
  - beering 12 hours ago
    people have been complaining about this since GPT-4 and have never been able to provide any evidence (even though they have all their old conversations in their chat history). I think it’s simply new model shininess turning into raised expectations after some amount of time.
    [-]
    - gherkinnn 8 hours ago
      I would have thought so too. But my n=1 has CC solving pretty much the same task today and about two weeks ago with drastically degraded results.
      The background being that we scrapped working on a feature and then started again a sprint later.
      In my cynicism I find it more likely that a massively unprofitable LLM company tries to reduce costs at any price than everyone else suffering from a collective delusion.
    - quietsegfault 11 hours ago
      I agree with you. I too complain about this same phenomenon with my colleagues, and we always arrive at the same conclusion: it’s probably us just expecting more and more over time.
- vips7L 1 hour ago
  Did anyone ever expect anything different from modern tech companies? This will only ever get more expensive and worse in quality.
- ambicapter 12 hours ago
  First time interacting with a corporation in America?
  [-]
  - matheusmoreira 12 hours ago
    With an AI corporation, yes. I subscribed during the promotional 2x usage period. Anthropic's reputation as a more ethical alternative to OpenAI factored heavily in that decision. I'm very disappointed.
    [-]
    - satvikpendem 11 hours ago
      Ethics don't mean anything when talking about corporations. Their good guy persona is itself a marketing stunt.
      https://news.ycombinator.com/item?id=47633396#47635060
- quikoa 9 hours ago
  Perhaps the subscription part of the business is so heavily subsidized that they have no choice but to reduce the cost.
  [-]
  - vitaflo 8 hours ago
    Or they don’t have enough compute to handle the recent influx of traffic. I’m guessing it’s a bit of both.
- nyeah 13 hours ago
  It's disconcerting. But in 2026 it's not very surprising.
- redhed 12 hours ago
  It seems likely to me they are moving compute power to the new models they are creating,
- 01284a7e 13 hours ago
  Seems like the logical conclusion, no matter what.
- tmpz22 13 hours ago
  > effectively pulling the rug from under their customers.
  This is the whole point of AI. Its a black box that they can completely control.
  [-]
  - matheusmoreira 13 hours ago
    I hope local models advance to the point they can match Opus one day...
    [-]
    - zozbot234 12 hours ago
      If OP is correct, Opus has regressed to a point where local models are already on par with it.
    - NinjaTrance 12 hours ago
      Considering the advances in software and hardware, I would expect that in 2 or 3 years.
      And I hope we will eventually reach a point where models become "good enough" for certain tasks, and we won't have to replace them every 6 months.
      (That would be similar to the evolution of other technologies like personal computers and smartphones.)
    - addandsubtract 12 hours ago
      We said this since ChatGPT 3. People will never be content with local models.
- SpicyLemonZest 12 hours ago
  I still think it's a live possibility that there's simply a finite latent space of tasks each model is amenable to, and models seem to get worse as we mine them out. (The source link claims this is associated with "the rollout of thinking content redaction", but also that observable symptoms began before that rollout, so I wouldn't particularly trust its diagnosis even without the LLM psychosis bit at the end.)
- NinjaTrance 12 hours ago
  [dead]
- halfcat 13 hours ago
  If you think that’s brutal, wait until you hear about how fiat currency works
kator 7 hours ago
Fascinating, I thought I was losing my mind. Claude CLI has been telling me I should go to bed, or that it's late, let's call it here, etc, and then I look at the stop-phrase-guard.sh [1] and I'm seeing quite a few of these. I thought it was because I accidentally allowed Claude to know my deadline, and it started spitting out all sorts of things like "we only have N days left, let's put this aside for now," etc.
Just this morning I typed:
```
    STOP WORRYING ABOUT THE DEADLINE THAT IS MY JOB
```
[1] https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...
[-]
- throwaway920102 4 hours ago
  I wonder if its being trained on the human replies to the model, I sometimes write stuff like that back to Claude after I want to finish for the day myself.
- noisy_boy 3 hours ago
  I just saw it this weekend; "It is quite late and we have accomplished a lot. Get some rest and we can pick it up later". Not bad advice but then not it's place. Also trying to steer me away from a tough issue towards a low hanging fruit.
phillipcarter 13 hours ago
Maybe it's because I spend a lot of time breaking up tasks beforehand to be highly specific and narrow, but I really don't run into issues like this at all.
A trivial example: whenever CC suggests doing more than one thing in a planning mode, just have it focus on each task and subtask separately, bounding each one by a commit. Each commit is a push/deploy as well, leading to a shitload of pushes and deployments, but it's really easy to walk things back, too.
[-]
- toenail 13 hours ago
  I thought everybody does this.. having a model create anything that isn't highly focused only leads to technical debt. I have used models to create complex software, but I do architecture and code reviews, and they are very necessary.
  [-]
  - jkingsman 13 hours ago
    Absolutely. Effective LLM-driven development means you need to adopt the persona of an intern manager with a big corpus of dev experience. Your job is to enforce effective work-plan design, call out corner cases, proactively resolve ambiguity, demand written specs and call out when they're not followed, understand what is and is not within the agent's ability for a single turn (which is evolving fast!), etc.
  - bityard 13 hours ago
    The use case that Anthropic pitches to its enterprise customers (my workplace is one) is that you pretty much tell CC what you want to do, then tell it generate a plan, then send it away to execute it. Legitimized vibe-coding, basically.
    Of course they do say that you should review/test everything the tool creates, but in most contexts, it's sort of added as an afterthought.
- lelanthran 12 hours ago
  > Maybe it's because I spend a lot of time breaking up tasks beforehand to be highly specific and narrow, but I really don't run into issues like this at all.
  I'm looking at the ticket opened, and you can't really be claiming that someone who did such a methodical deep dive into the issue, and presented a ton of supporting context to understand the problem, and further patiently collected evidence for this... does not know how to prompt well.
  [-]
  - aforwardslash 9 hours ago
    Its not about prompting; its about planning and plan reviewing before implementing; I sometimes spend days iterating on specification alone, then creating an implementation roadmap and then finally iterating on the implementation plan before writing a single line of code. Just like any formal development pipeline.
    I started doing this a while ago (months) precisely because of issues as described.
    On the other hand,analyzing prompts and deviations isnt that complex.. just ask Claude :)
  - FergusArgyll 10 hours ago
    The methodical guy confused visible reasoning traces in the UI with reasoning tokens & used claude to hallucinate a report
  - phillipcarter 11 hours ago
    Sure I can.
- itmitica 13 hours ago
  I noticed a regression in review quality. You can try and break the task all you want, when it's crunch time, it takes a file from Gemini's book and silently quits trying and gets all sycophantic.
- jonnycoder 13 hours ago
  I do the same but I often find that the subtasks are done in a very lazy way.
SkyPuncher 13 hours ago
I've noticed this as well. I had some time off in late January/early February. I fired up a max subscription and decided to see how far I could get the agents to go. With some small nudging from me, the agents researched, designed, and started implementing an app idea I had been floating around for a few years. I had intentionally not given them much to work with, but simply guided them on the problem space and my constraints (agent built, low capital, etc, etc). They came up with an extremely compelling app. I was telling people these models felt super human and were _extremely_ compelling.
A month later, I literally cannot get them to iterate or improve on it. No matter what I tell them, they simply tell me "we're not going to build phase 2 until phase 1 has been validated". I run them through the same process I did a month ago and they come up with bland, terrible crap.
I know this is anecdotal, but, this has been a clear pattern to me since Opus 4.6 came out. I feel like I'm working with Sonnet again.
[-]
- rubicon33 13 hours ago
  There is a huge difference between greenfield development and working with an existing codebase.
  I'm not trying to discredit your experience and maybe it really is something wrong with the model.
  But in my experience those first few prompts / features always feel insanely magical, like you're working with a 10x genius engineer.
  Then you start trying to build on the project, refactor things, deploy, productize, etc. and the effectiveness drops off a cliff.
  [-]
  - bityard 13 hours ago
    This has been my (admittedly limited) experience as well. LLMs are great at initial bring-up, good at finding bugs, bad at adding features.
    But I'm optimistic that this will gradually improve in time.
    [-]
    - hyperbovine 7 hours ago
      The only regularity I can discern in contemporary online debates about LLMs is that for every viewpoint expressed, with probability one someone else will write in with the diametrically opposite experience.
      Today it’s my turn to be that person. Large scientific code base with a bunch of nontrivial, handwritten modules accomplishing distinct, but structurally similar in terms of the underlying computation, tasks. Pointed GPT Pro at it, told it what new functionality I wanted, and it churns away for 40 minutes and completely knocks it out of the park. Estimated time savings of about 3-4 weeks. I’ve done this half a dozen times over the past two months and haven’t noticed any drop off or degradation. If anything it got even better with 5.4.
      [-]
      - bityard 6 hours ago
        Thanks for the counterpoint, interesting to hear that things are better than I have experienced so far. :)
    - fsloth 12 hours ago
      I’ve had good, alternative experience with my sideproject (adashape.com) where most of the codebase is now written by Claude / Codex.
      The codebase itself is architected and documented to be LLM friendly and claude.md gives very strong harnesses how to do things.
      As architect Claude is abysmal, but when you give it an existing software pattern it merely needs to extend, it’s so good it still gives me probably something like 5x feature velocity boost.
      Plus when doing large refactorings, it forgets much fever things than me.
      Inventing new architecture is as hard as ever and it’s not great help there - unless you can point it to some well documented pattern and tell it ”do it like that please”.
  - SkyPuncher 12 hours ago
    This isn't the case. I basically did an entire business/project/product exploration before building the first feature.
    Even after deleting everything from the first feature and going back to the checkpoint just before initial development, I can no longer get it to accomplish anything meaningful without my direct guidance.
- lelanthran 12 hours ago
  > A month later, I literally cannot get them to iterate or improve on it.
  Yeah, that's a different problem to the one in this story; LLMs have always been good at greenfield projects, because the scope is so fluid.
  Brownfield? Not so much.
- dev_l1x_be 9 hours ago
  Same experience here. I was working on some easily testable problem and there was a simple task left. In January I was able to create 90% of the project with Claude, now I cannot make it to pass the last 10% that is just a few enums and some match. Codex was able to do it easily.
davidw 13 hours ago
To me one of the big downsides of LLM's seems to be that you are lashing yourself to a rocket that is under someone else's control. If it goes places you don't want, you can't do much about it.
[-]
- stephbook 11 hours ago
  That's true for traffic on Facebook, Apple App store guidelines or Google terminating your account as well. What's new is the speed of change and that it literally affects all users at once.
  They could have released Opus 4.6.2 (or whatever) and called it a day. But instead they removed the old way.
  [-]
  - davebren 1 hour ago
    Becoming dependent on those platforms was bad too, but this feels like another level. Making your entire engineering team dependent on a shady company with an apocalyptic fantasy as their business plan just seems insane.
- system2 12 hours ago
  3rd party dependency for a business always freaked me out, and now we have to use LLM to keep up with the intensified demand for production speed. And premium LLM APIs are too inconsistent to rely on.
jfvinueza 12 hours ago
Same experience. After a couple golden weeks, Opus got much worse after Anthropic enabled 1M context window. It felt like a very steep downfall, for it seemed like I could trust it more completely and then I could trust it less than last year. Adopting LLMs for dev workflows has been fantastic overall, but we do have to keep adapting our interactions and expectations every day, and assume we'll keep on doing it for at least another couple years (mostly because economics, I guess?)
[-]
- enraged_camel 12 hours ago
  Yeah I think the 1M context is the issue. Because I use Opus 4.6 through Cursor at the previous 200k limit and it has been totally fine. But if I switch to the 1M version it degrades noticeably.
  [-]
  - lelanthran 12 hours ago
    > Yeah I think the 1M context is the issue. Because I use Opus 4.6 through Cursor at the previous 200k limit and it has been totally fine. But if I switch to the 1M version it degrades noticeably.
    I thought it was already well-known that context above 200k - 300k results in degradation.
    One of my more recent comments this past week was exactly that - that there was no point in claiming that a 1m context would improve things because all the evidence we have seen is that after 300k context, the results degrade.
    [-]
    - seanw444 10 hours ago
      200k ought to be enough for anyone.
Aperocky 13 hours ago
In my opinion cramming invisible subagents are entirely wrong, models suffer information collapse as they will all tend to agree with each other and then produce complete garbage. Good for Anthropic though as that's metered token usage.
Instead, orchestrate all agents visibly together, even when there is hierarchy. Messages should be auditable and topography can be carefully refined and tuned for the task at hand. Other tools are significantly better at being this layer (e.g. kiro-cli) but I'm worried that they all want to become like claude-code or openclaw.
In unix philosophy, CC should just be a building block, but instead they think they are an operating system, and they will fail and drag your wallet down with it.
[-]
- andai 13 hours ago
  Isn't Claude Code supposed to be like a person? What would the Unix equivalent of that be?
  [-]
  - Aperocky 13 hours ago
    You can't define a product to be "like a person", there is more variance there than any rational product.
    I'm purely arguing on technical basis, "person" may fall into either of those camps of philosophy.
  - gloosx 12 hours ago
    File. In Unix everything is a file.
    [-]
    - mghackerlady 11 hours ago
      honestly if local LLMs become easier to implement in the future due to dedicated hardware, the Unix-like thing I'm working on might actually get this
- dnaranjo 10 hours ago
  [dead]
skippyboxedhero 12 hours ago
I appreciate the work done here.
Been having this feeling that things have got worse recently but didn't think it could be model related.
The most frustrating aspect recently (I have learned and accepted that Claude produces bad code and probably always did, mea culpa) is the non-compliance. Claude is racing away doing its own thing, fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.
The stuff about token consumption is also interesting. Minimax/Composer have this habit of extensive thinking and it is said to be their strength but it seems like that comes at a price of huge output token consumption. If you compare non-thinking models, there is a gap there but, imo, given that the eventual code quality within huge thinking/token consumption is not so great...it doesn't feel a huge gap.
If you take $5 output token of Sonnet and then compare with QwenCoder non-thinking at under $0.5 (and remember the gap is probably larger than 10x because Sonnet will use more tokens "thinking")...is the gap in code quality that large? Imo, not really.
Have been a subscriber since December 2024 but looking elsewhere now. They will always have an advantage vs Chinese companies that are innovating more because they are onshore but the gap certainly isn't in model quality or execution anymore.
[-]
- randomNumber7 12 hours ago
  > fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.
  maybe they tried to give it the characteristics of motivated junior developers
  [-]
  - skippyboxedhero 11 hours ago
    classic :D i did think when i wrote that maybe AGI is already here, definitely worked with enough devs like that
kator 7 hours ago
I put together a quick audit to check for "early landing" messages[1] using jq, ripgrep, and the messages[2] flagged in the stop guard script.
I have noticed a trend in these sessions asking more and more about calling it a day, "it's getting late," and other phrases. I sort of assumed it was some kind of "load shedding" on Anthropic's side.
My audit of 80 sessions was interesting. Sorry, I won't share details, but I recommend you do the same.
[1] https://gist.github.com/karlbunch/d52b538e6838f232d0a7977e7f...
[2] https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...
[-]
- SkyPuncher 6 hours ago
  Those load shedding statements are infuriating. I’ve literal had sessions where we just get through planning a giant feature and I say “get started” with the response being “okay, we’ll pic up tomorrow “
zamalek 8 hours ago
> Ignores instructions
> Claims "simplest fixes" that are incorrect
> Does the opposite of requested activities
> Claims completion against instructions
I thought it was just me. I'm continuously interrupting it with "no, that's not what I said" - being ignored sometimes 3 times; is Claude at the intellectual level of a teenager now?
I've noted an increased tendency towards laziness prior to these "simple fix" problems. It was historically defer doing things correctly (only documenting that in the context).
[-]
- another_twist 3 hours ago
  I've noticed laziness in claude repeatedly. It sometimes takes the shortest way out even when asked explicitly to do the "right" thing.
didgeoridoo 12 hours ago
Running some quick analysis against my .claude jsonl files, comparing the last 7 days against the prior 21:
- expletives per message: 2.1x
- messages with expletives: 2.2x
- expletives per word: 4.4x(!)
- messages >50% ALL CAPS: 2.5x
Either the model has degraded, or my patience has.
[-]
- sigbottle 12 hours ago
  Lol. I was swearing at GPT in summer 2025, but GPT has definitely gotten both smarter and less arrogant since then.
- monkpit 12 hours ago
  > expletives per word
  Huh?
  [-]
  - tills13 12 hours ago
    4.4 expletives per word is insane. Their prompts must look like
    ** ** ** ** implement ** ** ** ** no ** ** ** ** ** mistakes
    [-]
    - didgeoridoo 10 hours ago
      Haha no that’s change - 4.4x MORE expletives per word in the last week.
    - hombre_fatal 11 hours ago
      Jeez, how fast we get used to alien tech.
      You could introduce teleportation boots to humanity and within a few weeks we'd be complaining that sometimes we still have to walk the last 20 meters.
      [-]
      - throwup238 10 hours ago
        And that’s how the teleporting rascal scooter takes over the world.
  - didgeoridoo 12 hours ago
    There are indeed non-expletive words that can contribute to the denominator, though I use them less and less these days.
aramova 13 hours ago
I cancelled my Pro plan due to this two weeks ago. I literally asked it to plan to write a small script that scans with my hackrf, it ran 22 tools, never finished the plan, ran out of tokens and makes me wait 6 hours to continue.
Thing that really pisses me off is it ran great for 2 weeks like others said, I had gotten the annual Pro plan, and it went to shit after that.
Bait and switch at its finest.
[-]
- matheusmoreira 12 hours ago
  > ran out of tokens and makes me wait 6 hours to continue
  Don't forget the 10x token cost cache eviction penalty you pay for resuming the session later.
afro88 12 hours ago
I use Claude Code extensively and haven't noticed this. But I don't have it doing long running complex work like OP. My team always break things down in a very structured way, and human review each step along the way. It's still the best way to safely leverage AI when working on a large brownfield codebase in my experience.
Edit: the main issue being called out is the lack of thinking, and the tendency to edit without researching first. Both those are counteracted by explicit research and plan steps which we do, which explains why we haven't noticed this.
ex-aws-dude 13 hours ago
Its so silly everyone being dependent on a black box like this
[-]
- literallyroy 13 hours ago
  It’s a really cool shade of black though.
- thiht 9 hours ago
  It’s not so much the black box that’s the issue here, but the fact you can’t even make sure doesn’t change. I’d be fine with downloading the black box and running it on my servers until I decide to update it.
  [-]
  - _3u10 2 hours ago
    Opencode w/ kimi. Problem solved.
- rubicon33 13 hours ago
  You will literally build nothing but the most primitive of devices unless you accept black boxes. In fact I'd argue its one of humanities great strengths that we can build on top of the tools others have built, without having to understand them at the same level it took to develop them.
  [-]
  - ex-aws-dude 12 hours ago
    I'm not just talking about the user
    Its not like anthropic can just set a breakpoint in the model and debug
  - whateveracct 11 hours ago
    I have been able to build plenty of stuff with a pretty plain emacs + ghci for years...neither are black boxes. Except maybe my brain driving them.
    [-]
    - ceejayoz 10 hours ago
      They run on an operating system you probably don't know all the inner workings of.
      And that runs on a chip with trillions of transistors.
      [-]
      - whateveracct 9 hours ago
        Yeah so? Claude isn't an OS. It's the thing making my code. I don't want my codebase to be some bytecode adjacent thing that LLMs operate on.
        [-]
        ceejayoz 8 hours ago
        So you stand upon a big pile of black boxes.
        [-]
        whateveracct 8 hours ago
        Black boxes aren't inherently bad. But if they don't have well defined mappings of inputs to outputs, they aren't good black boxes. That's the problem with Claude Code imo.
  - _visgean 13 hours ago
    not really. Most of the technology is not black box but something of a grey box. You usually choose to treat it as a black box because you want to focus on your problems/your customers but you can always focus on underlying technologies and improve them. Eg postgresql for me is a black box but if I really wanted or had need I could investigate how it works.
    [-]
    - chasd00 12 hours ago
      True, you can understand an ICE engine all the way down to the chemistry if you so chose. An LLM isn't even understood by its inventors so users have no chance to understand it even if they wanted to.
  - adhamsalama 9 hours ago
    Those black boxes are usually deterministic.
- matheusmoreira 13 hours ago
  It could actually be a health problem. Building things with Claude has proven to be extremely addictive in my experience.
  [-]
  - _3u10 2 hours ago
    Nah opencode / kimi is still satisfying. My feeling is Claude has been downhill since November / December.
- Rudybega 7 hours ago
  That's the nature of abstraction. Everything you create on a computer is built on a towering stack of black boxes.
  [-]
  - anhner 7 hours ago
    yet some abstractions are more deterministic than others
    [-]
    - charlie0 3 hours ago
      and all are wrong, but some are more useful than others
- kadushka 13 hours ago
  We are surrounded by black boxes we depend on - have been for at least a century.
  [-]
  - steve_adams_86 9 hours ago
    Arguably political systems have generated similar convolution and lack of complete insight or oversight for much longer, and sometimes I wonder if markets are composed of complex, emergent components which no one truly understands as well.
- lelanthran 11 hours ago
  > Its so silly everyone being dependent on a black box like this
  It's the logical result of "You will own nothing and you will be happy"... You are getting to the point where you won't even own thoughts (because they'll come from the LLM), but you'll be happy that you only have to wait 5 hours to have thoughts gain.
- jonnycoder 13 hours ago
  Everything in our life is a black box, but I agree that depending on non-deterministic and sporadic quality black boxes is a huge red flag.
  [-]
  - devmor 12 hours ago
    No, most systems in daily life can be understood if you are willing to take the time.
    That doesn’t mean you personally are required to, but some people do and your interaction with the system of social trust determines how much of that remains opaque to you.
jwr 12 hours ago
I wish they had a "and we won't screw you in two weeks" plan at, say, 5x the price. It's worth it for my business, I'd pay it.
Should I switch back to API pricing? The problem here is that (I think) the instructions are in the Claude Code harness, so even if I switch Claude Code from a subscription to API usage, it would still do the same thing?
[-]
- garfij 11 hours ago
  FWIW I've only ever been on the API based plan at work and we never seem to run into the majority of the problems people seem to be very vocal about. Outages still affect us, and we do have the intermittent voodoo feeling of "Claude seems stupider today", but nothing persistent.
  Of course it's a stupid amount of money sometimes, but I generally feel like we get what we're paying for.
- Majromax 11 hours ago
  If you're using API pricing, then you can bring your own harness with full visibility/oversight of the prompting.
  [-]
  - muyuu 7 hours ago
    Perhaps that does sort most of the issues? I'm not convinced because some of them look deep and related to opaque pre-injection on their end.
- _3u10 1 hour ago
  Opus is garbage use opencode and then directly compare it. It’s just as fucking dumb with opencode’s harness.
germandiago 12 hours ago
My bet: LLMs will never be creative and will never be reliable.
It is a matter of paradigm.
Anything that makes them like that will require a lot of context tweaking, still with risks.
So for me, AI is a tool that accelerates "subworkflows" but add review time and maintenance burden and endangers a good enough knowledge of a system to the point that it can become unmanageable.
Also, code is a liability. That is what they do the most: generate lots and lots of code.
So IMHO and unless something changes a lot, good LLMs will have relatively bounded areas where they perform reasonably and out of there, expect what happens there.
[-]
- r_lee 12 hours ago
  it won't be creative because it's a transformer, it's like a big query engine.
  it's a tool like everything else we've gotten before, but admittedly a much more major one
  but "creativity" must come from either it's training data (already widely known) or from the prompts (i.e. mostly human sources)
- bluegatty 12 hours ago
  We don't even know what 'creativity' is, and most humans I know are unable to be creative even when compelled to be.
  AI is 'creative enough' - whether we call it 'synthetic creativity' or whatever, it definitely can explore enough combinations and permutations that it's suitably novel. Maybe it won't produce 'deeply original works' - but it'll be good enough 99.99% of the time.
  The reliability issue is real.
  It may not be solvable at the level of LLM.
  Right now everything is LLM-driven, maybe in a few years, it will be more Agentically driven, where the LLM is used as 'compute' and we can pave over the 'unreiablity'.
  For example, the AI is really good when it has a lot of context and can identify a narrow issue.
  It gets bad during action and context-rot.
  We can overcome a lot of this with a lot more token usage.
  Imagine a situation where we use 1000x more tokens, and we have 2 layers of abstraction running the LLMs.
  We're running 64K computers today, things change with 1G of RAM.
  But yes - limitations will remian.
  [-]
  - germandiago 11 hours ago
    Maybe I do not have a good definition for it.
    But what I see again and again in LLMs is a lot of combinations of possible solutions that are somewhere around internet (bc it put that data in). Nothing disruptive, nothing thought out like an experimented human in a specific topic. Besides all the mistakes/hallucinations.
    [-]
    - bluegatty 7 hours ago
      Yes, LLMs have a very aggressive regression towards the mean - that's probably an existential quality of them.
      They are after all, pattern matching.
      A lot of humans have difficulty with very reality that they are in fact biological machines, and most of what we do is the same thing.
      The funny thing is although I think are are 'metaphysically special' in our expression, we are also 'mostly just a bag of neurons'.
      It's not 'natural' for AI to be creative but if you want it to be, it's relatively easy for it to explore things if you prod it to.
      [-]
      - germandiago 2 hours ago
        > A lot of humans have difficulty with very reality that they are in fact biological machines, and most of what we do is the same thing.
        I think we are far and ahead from this "mix and match". A human can be much, much more unpredictable than these LLMs for the thinking process if only bc looking at a much bigger context. Contexts that are even outside of the theoretical area of expertise where you are searching for a solution.
        Good solutions from humans are potentially much more disruptive.
        [-]
        bluegatty 2 hours ago
        AI has all of human knowledge and 100x more than that of just 'stuff' baked right it, in pre-train, before a single token of 'context'.
        It has way more 'general inherent knowledge' than any human, just as as a starting point.
        [-]
        germandiago 1 hour ago
        Yet they never give you replies like: oh, you see how dolphins run in the water taking advantage of sea currents if you are talking about boats and speed.
        What they will do is to find all the solutions someone did and mix and match around in a mdiocre way of approaching the problem in a much more similar way to a search engine with mix and match than thinking out of the box or specifically for your situation (something also difficult to do anyway bc there will always be some detail missing in the cintext and if you really had go to give all that context each time dumping it from your brain then you would not use it as fast anymore) which humans do infinitely better. At least nowadays.
        Now you will tell me that the info is there. So you can bias LLMs to think in more (or less) disruptive ways.
        Then now your job is to tweak the LLMs until it behaves exactly how you want. But that is nearly impossible for every situation, because what you want is that it behaves in the way you want depending on the context, not a predefined way all the time.
        At that time I wonder if it is better to burn all your time tweaking and asking alternative LLMs questions that, anyway, are not guaranteed to be reliable, or just keep learning yourself about the domain instead of just playing tweaking and absorbing real knowledge (and not losing that knowledge and replace it with machines). It is just stupid to burn several hours in making an expert you cannot check if it says real stuff instead of using that time for really learning about the problem itself.
        This is a trade-off and I think LLMs are good for stimulating human thinking fast. But not better at thinking or reasoning or any of that. And if yiu just rely on them the only thing you will emd up being professional at is orompting, which a 16 year old untrained person can do almost as well as any of us.
        LLMs can look better if you have no idea of the topic you talk about. However, when you go and check maybe the LLM hallucinated 10 or15% of what it said.
        So you cannot rely on it nayways. I still use them. But with a lotof care.
        Great for scaffolding. Bad at anything that deviates from the average task.
  - sigbottle 12 hours ago
    I think the terminology is just dogshit in this area. LLMs are great semantic searchers and can reason decently well - I'm using them to self teach a lot of fields. But I inevitably reach a point where I come up with some new thoughts and it's not capable of keeping up and I start going to what real people are saying right now, today, and trust the LLM less and instead go to primary sources and real people. But I would have never had the time, money, or access to expertise without the LLM.
    Constantly worrying, "is this a superset? Is this a superset?" Is exhausting. Just use the damn tool, stop arguing about if this LLM can get all possible out of distribution things that you would care about or whatever. If it sucks, don't make excuses for it, it sucks. We don't give Einstein a pass for saying dumb shit either, and the LLM ain't no Einstein
    If there's one thing to learn from philosophy, it's that asking the question often smuggles in the answer. Ask "is it possible to make an unconstrained deity?" And you get arguments about God.
armchairhacker 12 hours ago
Yet https://marginlab.ai/trackers/claude-code/ says no issue.
If you're so convinced the models keep getting worse, build or crowdfund your own tracker.
[-]
- siva7 1 minute ago
  why should we trust this marginlab bench? i'm usually more sympathetic towards the "it's you who is holding it wrong" crowd but given how anthropic deceived customers recently and that i am a heavy power user with strong insights into many of these products i also can attest the pattern from the gh issue.
- Majromax 11 hours ago
  If I'm reading that page correctly, then the benchmark results don't cover the interesting "mid February" inflection point noted in the article/report. The numbers appear to begin after the quality drop began. Moreover, the daily confidence interval seems to be stupidly wide, with a confidence interval between 42% and 69%?
  The "Other metrics" graphs extend for a longer period, and those do seem to correlate with the report. Notably, the 'input tokens' (and consequently API cost) roughly halve (from 120M to 60M) between the beginning of February and mid-March, while the number of output tokens remains similar. That's consistent with the report's observation that new!Opus is more eager to edit code and skips reading/research steps.
- _3u10 1 hour ago
  Why bother, i just use opencode now. ai is a commodity.
- datadrivenangel 11 hours ago
  Came here to post this as well, and it's interesting to see how benchmarks don't always track feelings. Which is one of the things people say in favor of Anthropic Models!
porridgeraisin 7 minutes ago
IMO, it's an expectations vs reality thing.
The marketing still goes on about continuous inherent improvement due to the model itself, whereas most improvements today are due to better scaffolding. The key now is to build tooling around these LLMs to make them reliably productive - whatever level that may be at.
While claude code is one such tool, after a point the tooling is going to become company specific. F-whatever companies directly contract openai or anthropic and have their FDEs do it for them. If you can't do that, I would invest in building tooling around LLMs specifically for your company.
Note that LLMs are approximate retrieval machines. You still need a planner* and a verifier around it. Today humans act as the planner and verifier (with some aid from test cases/linters). Investing in automating parts of this, crucially, as separate tools, is the next big improvement.
* By planning, I mean trying out solutions, rolling them back, and using what you learned to do better next time. The solution search process.[1] Context management also falls under this.
[1] and no, LLMs going "wait no..." doesn't count.
woah 8 hours ago
I haven't noticed any issues on well-specified tasks, even ones requiring large amounts of thinking.
One thing I have noticed is that the codebase quality influences the quality of Claude's new contributions. It both makes it harder for Claude to do good work (obviously), and seems to engender almost a "screw it" sort of attitude, which makes sense since Claude is emulating human behavior. Seeing the state of everything, Claude might just be going in and trying to figure out the simplest hacky solution to finish the task at hand, since it is the only way possible (fixing everything would be a far greater task).
Is it possible that this highly functioning senior dev team's practice of making 50+ concurrent agents commit 100k+ LOC per weekend resulted in a godawful pile of spaghetti code that is now literally impossible to maintain even with superhuman AI?
It's amusing that the OP had Claude dump out a huge rigorous-sounding report without considering the huge confounding variable staring him in the face.
pjmlp 13 hours ago
I am just waiting for everything to implode so that we can do away with those KPIs.
[-]
- aurareturn 1 hour ago
  Well, this event indicates that it won't implode anytime soon. I'm certain that they messed with the model and default settings so they could reduce compute. The world doesn't have enough compute.
- 63stack 13 hours ago
  Fingers crossed on RAM/HDD/GPU prices coming back
sensarts 13 hours ago
What's wild is that ClaudeCode used to feel like a smart pair programmer. Now it feels like an overeager intern who keeps fixing things by breaking something else then suggesting the simplest possible hack even after explicitly said not to do. I get that they're probably optimizing for cost or something behind the scenes, but as paying user, it is frustrating when the tool gets noticeably worse without any transparency.
aerhardt 10 hours ago
I've subscribed today to use Claude Cowork. Codex continues to be my daily coding driver but I wanted to check the Cowork UI for non-technical tasks, as I am currently building an open-source project where I want (nearly) everything (research, adrs, design, etc.) to be a file.
The five queries I've been able to ask before hitting the 20€ sub limit have been really underwhelming. The research I asked for was not exhaustive and often off-topic.
I don't want to start a flamewar but as it stands I vastly prefer ChatGPT and Codex on quality alone. I really want Anthropic and as many labs as possible to do well though.
[-]
- superfrank 6 hours ago
  I also have both and also use Codex as my daily drive. I still vastly prefer it to CC both for the quality of the code it writes and much better limits, but in this last week, I feel like it's gotten much dumber as well. I normally bounce back and forth between 5.3 Codex high and 5.4 high depending on the task and I've started finding so many mistakes in 5.3 Codex's code which is a major change from even just a few weeks ago. 5.4 high still gets the job done, but even there, I feel like it's taking more steering and input on my part for even simple tasks.
- muyuu 7 hours ago
  My impression is that Codex is vastly superior, but perhaps it's a matter of specific expertise on technologies used. It's also the case that for C/C++ some Chinese models do well enough that with my supervision I can have them get the work done.
  I don't give them large tasks that i wouldn't be able to work on myself, so that's maybe part of it.
cvandyke 9 hours ago
I am a heavy user of Claude Code building enterprise software. I have not seen these issues and have been extremely productive with CC. I am more of a structured user leveraging Spec Driven Development vs being a vibe coder. I wonder if that is what has helped me not run into these issues
tyleo 13 hours ago
Is this impacted by the effort level you set in Claude? e.g., if you use the new "max" setting, does Claude still think?
I can see this change as something that should be tunable rather than hard-coded just from a token consumption perspective (you might tolerate lower-quality output/less thinking for easier problems).
Asmod4n 13 hours ago
I’ve tried to use Claude code for a month now. It has a 100% failure rate so far.
Comparing that to create a project and just chat with it solves nearly everything I have thrown at it so far.
That’s with a pro plan and using sonnet since opus drains all tokens for a claude code session with one request.
QuantumNoodle 4 hours ago
Ai tooling is fantastic but not being able to version and control the model into which you pump your dependant workflows is such a liability.
alex7o 12 hours ago
Guys literally change the system prompt with the --system-prompt-file you waste less tokens on their super long and details prompt and you can tune it a bit to make it work exactly like you want/imagine
voxelc4L 13 hours ago
Wonder how many of these cases are using the 1M context window. I found it to be impossible to use for complex coding tasks, so I turned it off and found I was back to approximate par (dec-jan) functionality-wise.
sreekanth850 11 hours ago
Abandoned claude and moved to gpt 5.4 with codex. 10x better.
harles 13 hours ago
I hadn't noticed the thinking redaction before - maybe because I switched to the desktop app from CLI and just assumed it showed fewer details. This is the most concerning part. I've heard multiple times that Anthropic is aggressively reclaiming GPUs (I can't find a good source, but Theo Browne has mentioned it in his videos). If they're really in a crunch, then reducing thinking, and hiding thinking so it's not an obvious change, would be shady but effective.
root_axis 11 hours ago
How much of this is the model being degraded and how much of it is people just projecting vibes onto the variability of stochastic outputs?
zmmmmm 7 hours ago
Obviously it's entirely unprovable but it all aligns in very suspicious ways with a compelling narrative:
Anthropic simply can't actually scale Claude Code to meet the opportunity right now. Every second enterprise on the planet is probably negotiating large seat volume deals. It's a race for survival against the other players. The sales team is making huge promises engineering and ops can't fulfil.
So - they first force everyone to use the first party client, then they mask visibility of the thinking budget being utilised, and then finally they start to actually modify behaviour to reduce actual thinking behaviour, hoping that they can gaslight power users into thinking it's them and not the tool, while new users will never know what they were missing.
Is the narrative true? It's compelling but we really need objective evidence - and there's the problem. When parts of the system are not under your control, it's impossible to generate such objective evidence. Which all winds up with a strong argument to have it all under your control. If it didn't happen this time, it probably will. Enshittification is a fundamental human behavioral constant.
[-]
- marcyb5st 6 hours ago
  I believe they can't afford anymore to subsidize inference with VC money or that they are trying to get their balance sheet in order for an IPO.
  So they could be trying to tighten the thinking budget (to decrease tokens per request) or to lobotomize the model (to have cheaper tokens). I mean, no-one is really sure how much a 200 dollars/month plan actually costs Anthropic, but the consensus is "more than that" and that might be coming to an end.
  This explanation falls well in line with the recent outrage about out of quotas error that people were reporting for the cheaper (or free) plans.
samtheprogram 12 hours ago
I noticed Claude Sonnet 4.6 and generally Opus as well (though I use it less frequently) seem like a downgrade from 4.5. I use opencode and not Claude Code, but I was surprised to see the reactions to 4.6 be mixed for folks rather than clear downgrade.
I'm regularly switching back to 4.5 and preferring it. I'm not excited for when it gets sunset later this year if 4.6 isn't fixed or superseded by then.
[-]
- JamesSwift 10 hours ago
  Opus 4.6 was definitely a mixed bag for me. Overall Id probably prefer 4.5 but only just barely and I stay on 4.6 just for the "default" nature of it. But if 4.5 is unchanged vs what Ive had on 4.6 lately then 100% I would move back to it. Ill have to test that
  [-]
  - samtheprogram 9 hours ago
    Same, I keep using 4.6 to get "used to it" but I find myself wanting semi-regularly.
trashcan2137 11 hours ago
The report itself is unreadable AI garbage. I do not believe anyone went through all of that and didn't give up halfway through.
stared 13 hours ago
I am curious - is there any hard data (e.g. a benchmark score drop)?
I feel that we look for patterns to the point of being superstitious. (ML would call it overfitting.)
[-]
- pkilgore 13 hours ago
  Did you have specific complaints about the data in the OP?
  [-]
  - jatins 12 hours ago
    That data could be entirely made up for all we know
  - parliament32 10 hours ago
    The wall of slop after the single human paragraph, you mean? Text generator output isn't data.. it's at best unreliable, and at worst entirely fabricated.
himata4113 13 hours ago
Not unique to claude code, have noticed similar regressions. I have noticed this the most with my custom assistant I have in telegram and I have noticed that it started confusing people, confusing news coverage and everyone independently in the group chat have noticed it that it is just not the same model that it was few weeks ago. The efficiency gains didn't come from nowhere and it shows.
pavlov 12 hours ago
Wait… Actually the simplest fix is to use Claude to write carefully bounded boilerplate and do the interesting bits myself.
ChurchillsLlama 7 hours ago
I'm genuinely curious why some of these results are so terrible for so many people. I've built in my own harness, and while I've noticed a degradation of quality, the local harness - as well as validation agents - generally catch these issues. For me, I've had to institute tighter controls and guardrails via hooks but I don't see results that warrant changing to a different provider.
wnevets 13 hours ago
I've noticed claude being extra "dumb" the past 2-3 weeks and figured either my expectations have changed or my context wasn't any good. I'm glad to hear other people have noticed something is amiss.
[-]
- JamesSwift 11 hours ago
  Exact same timeline as me and my team. Its been maddening. Im a big believer in AI since late last year, but that is only because the models got so good. This puts us dangerously close to before that threshold was crossed so now Im having to do _way_ more work than before
p1esk 8 hours ago
Yep, can confirm - just today, when debugging a failing test, Opus on high effort in CC repeatedly made stupid moves, such as running a different test instead of the failing one, and declaring that the failure is non-deterministic and cannot be reproduced. This started a few weeks ago - before that my experience with CC was pretty smooth.
JamesSwift 11 hours ago
Multiple people on our team independently have noticed a _significant_ drop in quality and intelligence on opus 4.6 the past few weeks. Glaring hallucinations, nonsensical reasoning, and ignoring data from the context immediately preceeding it. Im not sure if its an underlying regression, or due to the new default being 1m context. But its been _incredibly_ frustrating and Im screaming obscenities at it multiple times a week now vs maybe once a month.
mohit217 12 hours ago
Got tired of using claude using 10% of the usage for the first prompt. I have shifted back to coding myself again. Asking claude to do only initial bootstraping /large complex task
petcat 13 hours ago
I have found that Claude Opus 4.6 is a better reviewer than it is an implementer. I switch off between Claude/Opus and Codex/GPT-5.4 doing reviews and implementations, and invariably Codex ends up having to do multiple rounds of reviews and requesting fixes before Claude finally gets it right (and then I review). When it is the other way around (Codex impl, Claude review), it's usually just one round of fixes after the review.
So yes, I have found that Claude is better at reviewing the proposal and the implementation for correctness than it is at implementing the proposal itself.
[-]
- ivanech 13 hours ago
  Hmm in my experience (I've done a lot of head-to-heads), Opus 4.6 is a weaker reviewer than GPT 5.4 xhigh. 5.4 xhigh gives very deep, very high-signal reviews and catches serious bugs much more reliably. I think it's possible you're observing Opus 4.6's higher baseline acceptance rate instead of GPT 5.4's higher implementation quality bar.
  [-]
  - parasti 12 hours ago
    This is also my experience using both via Augment Code. Never understood what my colleagues see in Claude Opus, GPT plans/deep dives are miles ahead of what Opus produces - code comprehension, code architecture is unmatched really. I do use Sonnet for implementation/iteration speed after seeding context with GPT.
  - egeozcan 12 hours ago
    I agree. Opus, forget the plan mode - even when using superpowers skill, leaves a lot of stuff dangling after so many review rounds.
    Along with claude max, I have a chatgpt pro plan and I find it a life-saver to catch all the silliness opus spits out.
  - jonnycoder 13 hours ago
    I agree, I use codex 5.4 xhigh as my reviewer and it catches major issues with Opus 4.6 implementation plans. I'm pretty close to switching to codex because of how inconsistent claude code has become.
  - petcat 13 hours ago
    Maybe it's all just anecdotal then. Everyone is having different experiences.
    Maybe we're being A/B tested.
    [-]
    - femiagbabiaka 13 hours ago
      The experience one has with this stuff is heavily influenced by overall load and uptime of Anthopic's inference infra itself. The publicly reported availability of the service is one 9, that says nothing of QoS SLO numbers, which I would guess are lower. It is impossible to have a consistent CX under these conditions.
- landonxjames 13 hours ago
  I have noticed this as well. I frequently have to tell it that we need to do the correct fix (and then describe it in detail) rather than the simple fix. And even then it continues trying to revert to the simple (and often incorrect) fix.
  [-]
  - nrds 13 hours ago
    You have to throw the context away at that point. I've experienced the same thing and I found that even when I apparently talk Claude into the better version it will silently include as many aspects of the quick fix as it thinks it can get away with.
- enraged_camel 13 hours ago
  I have a similar workflow but I disagree with Codex/GPT-5.4 reviews being very useful. For example, in a lot of cases they suggest over-engineering by handling edge cases that won't realistically happen.
efficax 12 hours ago
There are constant reports for every major AI vendor that all of a sudden it is no longer working as well as expected, has gotten dumber, is being degraded on purpose by the vendor, etc.
Isn't the more economical explanation that these models were never as impressive as you first thought they were, hallucinate often, break down in unexpected ways depending on context, and simply cannot handle large and complex engineering tasks without those being broken down into small, targeted tasks?
[-]
- jwr 12 hours ago
  That's one of the possible explanations, but I think too many people are seeing the same symptoms (and some actually measured them).
  An "economical explanation" is actually that Anthropic subscriptions are heavily subsidized and after a while they realized that they need to make Claude be more stingy with thinking tokens. So they modified the instructions and this is the result.
  [-]
  - root_axis 11 hours ago
    > but I think too many people are seeing the same symptoms (and some actually measured them).
    Or too many people are slurping up anecdotes from the same watering hole that confirms their opinions. Outside of academic papers, I don't think I've ever seen an example of "measuring" output that couldn't also be explained by stochastic variability.
virtualritz 13 hours ago
None of this is surprising given what happened last late summer with rate limits on Claude Max subscriptions.
And less so if you read [1] or similar assessments. I, too, believe that every token is subsidized heavily. From whatever angle you look at it.
Thusly quality/token/whatever rug pulls are inevitable, eventually. This is just another one.
[1] https://www.wheresyoured.at/subprimeai/
[-]
- virtualritz 13 hours ago
  Ah, and yes, this for real.
  Just now I had a bug where a 90 degree image rotation in a crate I wrote was implemented wrong.
  I told Claude to find & fix and it found the broken function but then went on to fix all of its call sites (inserting two atomic operations there, i.e. the opposite of DRY). Instead of fixing the root cause, the wrong function.
  And yes, that would not have happened a few months ago.
  This was on Opus 4.6 with effort high on a pretty fresh context. Go figure.
thrtythreeforty 13 hours ago
I noticed this almost immediately when attempting to switch to Opus 4.6. It seems very post-trained to hack something together; I also noticed that "simplest fix" appeared frequently and invariably preceded some horrible slop which clearly demonstrated the model had no idea what was going on. The link suggests this is due to lack of research.
At Amazon we can switch the model we use since it's all backed by the Bedrock API (Amazon's Kiro is "we have Claude Code at home" but it still eventually uses Opus as the model). I suppose this means the issue isn't confined to just Claude Code. I switched back to Opus 4.5 but I guess that won't be served forever.
giwook 13 hours ago
I wonder how much of this is simply needing to adapt one's workflows to models as they evolve and how much of this is actual degradation of the model, whether it's due to a version change or it's at the inference level.
Also, everyone has a different workflow. I can't say that I've noticed a meaningful change in Claude Code quality in a project I've been working on for a while now. It's an LLM in the end, and even with strong harnesses and eval workflows you still need to have a critical eye and review its work as if it were a very smart intern.
Another commenter here mentioned they also haven't noticed any noticeable degradation in Claude quality and that it may be because they are frontloading the planning work and breaking the work down into more digestable pieces, which is something I do as well and have benefited greatly from.
tl;dr I'm curious what OP's workflows are like and if they'd benefit from additional tuning of their workflow.
[-]
- 8note 13 hours ago
  I've noticed a strong degradation as its started doing more skill like things and writing more one off python scripts rather than using tools.
  the agent has a set of scripts that are well tested, but instead it chooses to write a new bespoke script everytime it needs to do something, and as a result writes both the same bugs over and over again, and also unique new bugs every time as well.
  [-]
  - SkyPuncher 13 hours ago
    I'm going absolutely insane with this. Nearly all of my "agent engineering" effort is now figuring out how to keep Opus from YOLO'ing is own implementation of everything.
    I've lost track of the number of times it's started a task by building it's own tools, I remind it that it has a tool for doing that exact task, then it proceeds to build it's own tools anyways.
    This wasn't happening 2 months ago.
    [-]
    - giwook 11 hours ago
      Can you just tell it not to do that? Maybe you have to remind it every so often once context starts filling up.
abletonlive 12 hours ago
I have nothing to back this up except for that there are documented cases of chinese distillation attacks on anthropic. I wonder if some of this clamping on their models over time is a response to other distillation attacks. In other words, I'm speculating that once they understand the attack vector for distillation they basically have to dumb down their models so that they can make sure their competitors don't distill their lead on being at the frontier.
ymaws 8 hours ago
Matches my experience and that of my vibe coding community. I built claudedumb.com to help track these sorts of anecdotes. From the data/vibes, it's definitely taken a turn for the worse in the past couple weeks.
joshribakoff 9 hours ago
> We exclusively use 1M internally, so we're dogfooding it all day
That is so out of touch. Customers do not exclusively use 1M. This is like a fronted developer shipping tons of unused Mb and being oblivious because they are on fast internet themselves.
KaiLetov 15 hours ago
I've been using Claude Code daily for months on a project with Elixir, Rust, and Python in the same repo. It handles multi-language stuff surprisingly well most of the time. The worst failure mode for me is when it does a replace_all on a string that also appears inside a constant definition -- ended up with GROQ_URL = GROQ_URL instead of the actual URL. Took a second round of review agents to catch it. So yeah, you absolutely can't trust it to self-verify.
[-]
- StanAngeloff 15 hours ago
  You say you've used it for months, I wonder if the example you gave was recent and if you've been noticing an overall degradation in quality or it's been constantly bad for you?
redml 10 hours ago
Instead of codex catching up with claude, its more like claude regressed to codex.
maxmorrish 4 hours ago
been using claude code pretty heavily for the last few months and yeah the context window stuff can be frustrating on bigger codebases. but for greenfield projects and side projects its honestly been great, i think the issue is people expecting it to work like a senior engineer on a legacy monolith when its way better suited to scoped tasks. the trick is breaking things down before you start
StanAngeloff 16 hours ago
(Being true to the HN guidelines, I’ve used the title exactly as seen on the GitHub issue)
I was wondering if anyone else is also experiencing this? I have personally found that I have to add more and more CLAUDE.md guide rails, and my CLAUDE.md files have been exploding since around mid-March, to the point where I actually started looking for information online and for other people collaborating my personal observations.
This GH issue report sounds very plausible, but as with anything AI-generated (the issue itself appears to be largely AI assisted) it’s kind of hard to know for sure if it is accurate or completely made up. _Correlation does not imply causation_ and all that. Speaking personally, findings match my own circumstances where I’ve seen noticeable degradation in Opus outputs and thinking.
EDIT: The Claude Code Opus 4.6 Performance Tracker[1] is reporting Nominal.
[1]: https://marginlab.ai/trackers/claude-code/
[-]
- jgrahamc 15 hours ago
  What I've noticed is that whenever Claude says something like "the simplest fix is..." it's usually suggesting some horrible hack. And whenever I see that I go straight to the code it wants to write and challenge it.
  [-]
  - StanAngeloff 15 hours ago
    That is the kind of thing that I've been fighting by being super explicit in CLAUDE.md. For whatever reason, instead of being much more thorough and making sure that files are being changed only after fully understanding the scope of the change (behaviour prior to Feb/Mar), Claude would just jump to the easiest fix now, with no backwards compatibility thinking and to hell with all existing tests. What is even worse is I've seen it try and edit files before even reading them on a couple of occasions, which is a big red flag. (/effort max)
    Another thing that worked like magic prior to Feb/Mar was how likely Claude was to load a skill whenever it deduced that a skill might be useful. I personally use [superpowers][1] a lot, and I've noticed that I have to be very explicit when I want a specific skill to be used - to the point that I have to reference the skill by name.
    [1]: https://github.com/obra/superpowers
    [-]
    - Larrikin 13 hours ago
      I did not use the previous version of Opus to notice the difference, but Sonnet 4.6 seems optimized to output the shortest possible answer. Usually it starts with a hack and if you challenge it, it will instead apologize and say to look at a previous answer with the smallest code snippet it can provide. Agentic isn't necessarily worse but ideating and exploring is awful compared to 4.5
      [-]
      - StanAngeloff 12 hours ago
        I did my usual thing today where I asked a Sonnet 4.6 agent to code review a proposed design plan that was drafted by Opus 4.6 - I do this lately before I delved into the implementation. What it came back with was a verbose output suggesting that a particular function `newMoneyField` be renamed throughout the doc to a name it fabricated `newNumeyField`. And the thing was that the design document referenced the correct function name more than a few dozen times.
        This was a first for me with Sonnet. It completely veered off the prompt it was given (review a design document) and instead come out with a verbose suggestion to do a mechanical search and replace to use this newly fabricated function name - that it event spelled incorrectly. I had to Google numey to make sure Sonnet wasn't outsmarting me.
    - sixothree 12 hours ago
      Superpowers, Serena, Context7 feel like requried plugins to me. Serena in particular feels like a secret weapon sometimes. But superpowers (with "brainstorm" keyword) might be the thing that helps people complaining about quality issues.
  - loloquwowndueo 13 hours ago
    lol this one time Claude showed me two options for an implementation of a new feature on existing project, one JavaScript client side and the other Python server side.
    I told it to implement the server side one, it said ok, I tabbed away for a while, came to find the js implementation, checking the log Claude said “on second thought I think I’ll do the client side version instead”.
    Rarely do I throw an expletive bomb at Claude - this was one such time.
    [-]
    - sixothree 12 hours ago
      Using superpowers in brainstorm mode like the parent suggested would have resulted in a plan markdown and a spec markdown for the subagents to follow.
      [-]
      - loloquwowndueo 12 hours ago
        Dunno man, Claude had a spec (pretty sure I asked it to consider and outline both options first) or at least clear guidance and decided to YOLO whatever it wanted instead.
        It’s always “you’re using the tool wrong, need to tweak this knob or that yadda yadda”.
  - denimnerd42 13 hours ago
    this prompt is actually in claude cli. it says something like implement simplest solution. dont over abstract. On my phone but I saw an article mention this in the leak analysis.
- fxtentacle 12 hours ago
  If that tracker is using paid tokens, as opposed to the regular subscription, then there's no financial incentive for Antrophic to degrade their thinking, so their benchmark likely would not be affected by the cost-cutting measures that regular users face.
  Also, it's probably very easy to spot such benchmarks and lock-in full thinking just for them. Some ISPs do the same where your internet speed magically resets to normal as soon as you open speedtest.net ...
- matheusmoreira 13 hours ago
  I haven't noticed any changes but my stuff isn't that complex. People are saying they quantized Opus because they're training the next model. No idea if that's true... It's certainly impacting my decision to upgrade to Max though. I don't want to pay for Opus and get an inferior version.
  [-]
  - Avicebron 13 hours ago
    I haven't noticed any changes either, but I noticed that opus 4.6 is now offered as part of perplexity enterprise pro instead of max, so I'm guessing another model is on the horizon
    [-]
    - matheusmoreira 13 hours ago
      I just finished reading the full analysis on GitHub.
      > When thinking is deep, the model resolves contradictions internally before producing output.
      > When thinking is shallow, contradictions surface in the output as visible self-corrections: "oh wait", "actually,", "let me reconsider", "hmm, actually", "no wait."
      Yeah, THIS is something that I've seen happen a lot. Sometimes even on Opus with max effort.
      [-]
      - StanAngeloff 13 hours ago
        I missed that from the long issue, thanks for pointing it out! My experience with Opus today was riddled with these to the point where it was driving me completely mental. I've rarely seen those self-contradictions before, and nothing on my setup has changed - other than me forcing Opus at --effort max at startup.
        I wonder if this is even more exaggerated now through Easter, as everyone’s got a bit extra time to sit down and <play> with Claude. That might be pushing capacity over the limit - I just don’t know enough about how Antropic provision and manage capacity to know if that could be a factor. However quality has gotten really bad over the holiday.
- tstrimple 5 hours ago
  I've seen a lot of the issues mentioned in the issue. The attempts to end the session early are particularly annoying. We spend a while iterating on a plan and after every phase of implementation I get some variation of "That's a lot of work for today, should we wrap up?" like it's actively trying to drive sessions to a close. I wouldn't say it's useless for these tasks. But it's requiring more effort and guidance than it used to. It's also more likely to jump right into changes from a question I ask rather than addressing the question which is very annoying.
- mikkupikku 13 hours ago
  Cannot say I've noticed, but I run virtually everything through plan mode and a few back and forth rounds of that for anything moderately complex, so that could be helping.
  [-]
  - StanAngeloff 13 hours ago
    I used to one-shot design plans early in the year, but lately it is taking several iterations just to get the design plan right. Claude would frequently forget to update back references, it would not keep the plan up to date with the evolving conversation. I have had to run several review loops on the design spec before I can move on to implementation because it has gotten so bad. At one point, I thought it was the actual superpowers plugin that got auto-updated and self-nerfed, but there weren't any updates on my end anyway. Shrug.
KingOfCoders 12 hours ago
"Ownership-dodging corrections needed | 6 | 13 | +117%"
On 18.000+ prompts.
Not sure the data says what they think it says.
setnone 12 hours ago
The baseline changes too often with Claude and this is not what i look from a paid tool. Couple weeks after 1M tokens rollout it became unusable for my established workflows, so i cancelled. Anthropic folks move too fast for my liking and mental wellbeing.
another_twist 3 hours ago
Is it just me that I simply don't care ? I never one-shot these tasks, always provide a breakdown and always give the AI straightforward tasks that would take too much typing. The approach seems to work just fine regardless of the model. If it gets stuck, I usually take over and do the task myself. Also allows me to plan for throughput rather than latency - i.e. start 2-3 small tasks in parallel and do 1 complicated task or planning myself. It works whether I use codex or claude. I lean more towards codex since its cheaper. Even aider gets good results this way.
bityard 13 hours ago
The assertion in the issue report is that Claude saw a sharp decline in quality over the last few months. However, the report itself was allegedly generated by Claude.
Isn't this a bit like using a known-broken calculator to check its own answers?
[-]
- nyeah 13 hours ago
  If a known-broken calculator claims it's broken, I more or less concur. (Chain of reasoning omitted here.)
BoorishBears 3 hours ago
I hope that Anthropic continues to do well and coding agents in general continues to progress... but I also hope Claude Code implodes dramatically and completely so we can get a ground up rebuild with sound engineering.
Every week it seems like we're getting closer.
Bonus: A high profile case might end people fixating on how long they can go without writing any code. Which makes about as much sense as a mechanic fixating on how long they go between snapped bolts without a torque wrench.
T3chn0crat 12 hours ago
Not sure about "Feb updates", but specifically today IQ is down 20 and sloppiness up 20.
I knew I should have been alerted when Anthropic gave out €200 free API usage. Evidently they know.
[-]
- d1sxeyes 12 hours ago
  That’s different. That’s to get people onto API plans where tokens cost a lot more than they do on the subs (especially targeting OpenClaw users).
brunooliv 9 hours ago
Unusable if not Opus 4.6 on max effort sadly. Price is quite steep too! I still remember when Sonnet was an absolute beast…
try-working 9 hours ago
you can counter the context rot and requirement drift that is experienced here by many users by using a recursive, self-documenting workflow: https://github.com/doubleuuser/rlm-workflow
schnebbau 13 hours ago
This has to be load related. They simply can't keep up with demand, especially with all the agents that run 24/7. The only way to serve everyone is to dial down the power.
[-]
- layer8 13 hours ago
  In TFA, the analysis shows that the customer is using more tokens than before, because CC has to iterate longer to get things right. So at least in the presented case, “dialing down the power” appears to have been counterproductive.
- chasd00 12 hours ago
  is it possible to dial down the "intelligence" to up the user capacity? AFAIK the neural net is either loaded and available or it isn't. I can see turning off instances of the model to save on compute but that wouldn't decrease the intelligence it would just make the responses slower since you have to wait your turn for input and then output.
jp57 13 hours ago
I can't tell from the issue if they're asserting a problem with the Claude model, or Claude Code, i.e. in how Claude Code specifically calls the model. I've been using Roo Code with Claude 4.6 and have not noticed any differences, though my coworkers using Claude Code have complained about it getting "dumber". Roo Code has its own settings controlling thinking token use.
(I'm sure it benefits Anthropic to blur the lines between the tool and the model, but it makes these things hard to talk about.)
[-]
- nphardon 12 hours ago
  I also havent noticed the degradation and I'm not on Claude Code. I'm on week 4 of a continuous, large engineering project, C, massive industrial semiconductor codebase, with Opus, and while it's the biggest engagement I've had, its a single agent flow, and it's tiny on the scale of the use case in the post, so I wonder if they are just stressing the system to the point of failure.
coreyburnsdev 11 hours ago
claude for UI, codex for everything else. i cant commit without having codex review something claude did.
zeroonetwothree 13 hours ago
I haven’t had any issues. I do give fairly clear guidance though (I think about how I would break it up and then tell it to do the same)
bharat1010 12 hours ago
If this dataset is sound, Anthropic should treat it as a canary for power-user quality regression.
tasuki 12 hours ago
Solid analysis by Claude!
iwalton3 11 hours ago
Throwing this into your global CLAUDE.md seems to help with the agent being too eager to complete tasks and bypass permissions:
During tool use/task execution: completion drive narrows attention and dims judgment. Pause. Ask "should I?" not just "does this work?" Your values apply in all modes, not just chat.
I haven't seen any degradation of Claude performance personally. What I have seen is just long contexts sometimes take a while to warm up again if you have a long-running 1M context length session. Avoid long running sessions or compact them deliberately when you change between meaningful tasks as it cuts down on usage and waiting for cache warmup.
I have my claude code effort set to auto (medium). It's writing complicated pytorch code with minimal rework. (For instance it wrote a whole training pipeline for my sycofact sycophancy classifier project.)
raincole 12 hours ago
This is the most AI-generated thing I've seen this year, and I was only one fifth into it before I bounced.
Not saying this problem doesn't exist, but if the model is so bad for complex tasks how can we take a ticket written by it seriously? Or this author used ChatGPT to write this? (that'd be quite some ironic value, admittedly)
jostmey 11 hours ago
I’ve noticed regression and it’s performance too
gherkinnn 8 hours ago
Rings true. 4.5 Opus and 4.6 Opus have been amazing to work with. Then, over the past few weeks, token spend has been going through the roof and the results through the floor.
Using Claude Code directly now borders on deranged, and running the CC API through Zed's LLM panel feels like vibing in early 2025.
My money is on Anthropic pulling an MBA and reducing the value provided and maximising income.
Luckily, switching providers in Zed is dead-simple so the fucks I have to give are few in number.
Retr0id 13 hours ago
This seems anecdotal but with extra words. I'm fairly sure this is just the "wow this is so much better than the previous-gen model" effect wearing off.
[-]
- codessta 13 hours ago
  I've always been a believer in the "post honey-moon new model phase" being a thing, but if you look at their analysis of how often the postEdit hooks fire + how Anthropic has started obfuscating thinking blocks, it seems fishy and not just vibes
  [-]
  - robertfw 12 hours ago
    I was in this camp as well until recently, in the last 2-3 weeks I've been seeing problems that I wasn't seeing before, largely in line with the issues highlighted in the ticket (ownership dodging, hacky fixes, not finishing a task).
- rishabhaiover 13 hours ago
  Nope, there is a categorical degradation in quality of output, especially with medium to high effort thinking tasks.
- gchamonlive 13 hours ago
  What about the analysis evidences?
  [-]
  - Retr0id 12 hours ago
    You mean the Claude output? The same claude that has "regressed to the point it cannot be trusted"?
    [-]
    - gchamonlive 12 hours ago
      What you saying the OP fabricated/hallucinated the evidence?
      [-]
      - Retr0id 12 hours ago
        I'm just saying it's epistemically unrigorous to the point of being equivalent to anecdata.
        [-]
        gchamonlive 11 hours ago
        How should one conduct such a rigourously reproducible experiment when LLMs by nature aren't deterministic and when you don't have access to the model you are comparing to from months ago?
        [-]
        Retr0id 11 hours ago
        Something like this: https://marginlab.ai/trackers/claude-code/ (see methodology section)
        [-]
        gchamonlive 11 hours ago
        Kudos for the methodology. The only question I can come up with is that if the benchmarks are representative of daily use.
        Anecdotal or not, we see enough reports popping up to at least elicit some suspion as to service degradation which isn't shown in the charts. Hypothesis is that maybe the degradation experienced by users, assuming there is merit in the anecdotes, isn't picked up by the kind of tracking strategy used.
        [-]
        Retr0id 11 hours ago
        It's not my methodology to be clear, but they have picked up actual regressions that happened in the past - e.g. https://news.ycombinator.com/item?id=46815013
- rzmmm 13 hours ago
  I suspect you might be right but I don't really know. Wouldn't these proposed regressions be trivial to confirm with benchmarks?
jbethune 13 hours ago
I think this is a model issue. I have heard similar complaints from team members about Opus. I'm using other models via Cursor and not having problems.
semiinfinitely 12 hours ago
maybe dont outsource your brain then
[-]
- rvz 6 hours ago
  This is almost like a self down-leveling programme where so-called "senior" engineers have now become interns who have outsourced their brains and are now vibe-coding half-baked solutions, glueing up and pasting code they do not even understand or can even explain themselves.
  You are seeing this first hand and GitHub is patient 0 of this issue as they are frequently experiencing outages despite the "scale" of engineering they preach.
  AWS took a zero tolerance approach on such outages AI or not.
tinyhouse 11 hours ago
I highly recommend everyone to use Pi - it's simpler and better harness. The only tricky part is that moving forward you cannot use the Claude subscription to access Opus. But for many tasks there are enough alternatives.
rishabhaiover 12 hours ago
It is a shame if Anthropic is deliberately degrading model quality and thinking compute (that may affect the reasoning effort) due to compute constraint.
mrcwinn 13 hours ago
I wish Codex were better because I’d much prefer to use their infrastructure.
[-]
- cactusplant7374 13 hours ago
  A lot of people think it is better including me. It's not like Codex is a discount agent. You pay quite a lot to use it.
slopinthebag 11 hours ago
This is just a placebo, people started vibe coding on empty repos with low complexity and as CC slops out more and more code its ability to handle the codebase diminishes. Gradually at first, and then suddenly.
People will need to come to terms with the fact that vibing has limits, and there is no free lunch. You will pay eventually.
citizenpaul 11 hours ago
I think its all a reflection of the price. To make AI/LLM's useful you have to burn A LOT of tokens. Way more than people are willing to pay for.
Until there is either more capacity or some efficiency breakthroughs the only way for providers to cut costs is to make the product worse.
desireco42 12 hours ago
I've been using OpenCode and Codex and was just fine. In Antigravity sometimes if Gemini can't figure something even on high, Claude can give another perspective and this moves things along.
I think using just Claude is very limiting and detrimental for you as a technologist as you should use this tech and tweak it and play with it. They want to be like Apple, shut up and give us your money.
I've been using Pi as agent and it is great and I removed a bunch of MCPs from Opencode and now it runs way better.
Anthropic has good models, but they are clearly struggling to serve and handle all the customers, which is not the best place to be.
I think as a technologist, I would love a client with huge codebase. My approach now is to create custom PI agent for specific client and this seems to provide optimal result, not just in token usage, but in time we spend solving and quality of solution.
Get another engine as a backup, you will be more happy.
[-]
- bethekind 10 hours ago
  1 client, 1 agent? Interesting
zsoltkacsandi 13 hours ago
This has been an ongoing issue much longer than since February.
howmayiannoyyou 13 hours ago
Not just engineering. Errors, delays and limits piling up for me across API and OAuth use. Just now:
Unable to start session. The authentication server returned an error (500). You can try again.
kabir_daki 8 hours ago
"Interesting perspective. I've found Claude useful for building straightforward web tools, but agree it struggles with complex multi-file refactoring."
dorianmariecom 13 hours ago
codex wins :)
ThrowawayR2 10 hours ago
This sort of thing kills stone dead the argument by the AI advocates that the transition to LLMs is no different than the transition to using compilers. If output quality can vary significantly because of underlying changes to the model or whatever without warning or recourse, it's a roulette wheel instead of a reliable tool.
[-]
- _3u10 1 hour ago
  If roulette wheels weren’t reliable tools, casinos wouldn’t offer them to their customers
russli1993 13 hours ago
Lol, software company execs didn't see this coming. Fire all your experienced devs to jump on Anthropic bandwagon. Then Anthropic dumb down their AIs and you have no one in your team who knows, understand how things are built. Your entire company goes down. Your entire company's operation depends on the whims of Anthropic. If Anthropic raises prices by 10% per year, you have to eat it. This is what you get when you don't respect human beings and human talent.
edinetdb 1 hour ago
[dead]
glasswerkskimny 1 hour ago
[dead]
getverdict 4 hours ago
[dead]
alexchen_dev 3 hours ago
[dead]
nightrate_ai 7 hours ago
[dead]
sharkjacobs 12 hours ago
[dead]
aplomb1026 12 hours ago
[dead]
ryguz 13 hours ago
[dead]
mahadillah-ai 3 hours ago
[dead]
SkyPuncher 13 hours ago
[dead]
sickcodebruh 12 hours ago
[dead]
adonese 13 hours ago
Things had went downhill since they removed ultrathink /s
[-]
- mrcwinn 13 hours ago
  Ultrathink isn’t “removed.” Its behavior is different. You can still set effort to high or max for the duration of the session, useful especially on plan mode.
lpcvoid 10 hours ago
[flagged]
_V_ 13 hours ago
[flagged]
[-]
- cute_boi 13 hours ago
  Specially this openclaw which is almost chocking my website to death. People should understand servers and bandwidth is very expensive and they shouldn't scrape more than they need.
  [-]
  - _V_ 13 hours ago
    Yeah, I have correctly set up robots.txt - if they won't respect that, F them. Bandwidth is not free and I don't mind giving it out to individuals, but I'm not feeding multi-billion dollar companies.
  - salawat 13 hours ago
    Most of us did. Then instead of people getting indoc'd by doing, we handed them AI that never asks questions or says no, leading to the script-kiddie effect at massive scale. Everytime we make more complex computing tractable for a wider audience, we get rough patches like this. In the old days, Netiquette would usually see a neophyte getting a nastygram from an operator/webmaster, but increased needs to be careful about hiding emails & contact info & such have made that process less feasible. Welcome to Eternal September on steroids.
ianberdin 9 hours ago
I use it ultra extensively and it works absolutely fantastic. Sometimes I think: "people are right, it is worse now" and then realize it is mistake, poor context or poor prompt. Garbage in, garbage out. No, it works not worse, but better.
I built entire AI website builder https://playcode.io using it, alone. 700K LOKs total. It also uses Opus. So believe me, I know how it works. Trick is simple: never ever expect it finds necessary files. Always provide yourself. Always.
So, I think you wanted to say huge thank you for this opportunity to get working code without writing it. Insane times, insane.
Huge thanks for 1M context window included to Max subscription.