Been using these tools for the last few years. Can already tell opus and sonnet 4 have set a completely new benchmark, especially using Claude Code.
They just work, less hallucination, less infinite loops of confusion. You can set it off and come back with a 80-90% confidence it’s done what you asked. Maximum 3-4 iterations to get website/app component styling perfect (vs 5-10 before).
I’ve already seen too many of the classic ‘omg this doesn’t work for me they suck, overhyped’ posts. Fair enough if that’s your experience, but I completely disagree and can’t help but think your prompting is the problem.
Without using too much stereotypical AI hyperbole, I think these are the biggest step change since GPT 3.
Source: Code with Claude Opening Keynote
Videos
The 200k context window is deflating especially when gpt and gemini are eating them for lunch. Even if they went to 500k would be better.
Benchmarks at this point in the A.I game are negligible at best and you sure don't "Feel" a 1% difference between the 3. It feels like we are getting to the point of diminishing returns.
Us as programmers should be able to see the forest from the trees here. We think differently than the normal person. We think outside of the box. We don't get caught in hype as we exist in the realm of research, facts and practicality.
This Claude release is more hype than practical.
Been using ChatGPT Plus with o3 and Gemini 2.5 Pro for coding the past months. Both are decent but always felt like something was missing, you know? Like they'd get me 80% there but then I'd waste time fixing their weird quirks or explaining context over and over or running in a endless error loop.
Just tried Claude 4 Opus and... damn. This is what I expected AI coding to be like.
The difference is night and day:
-
Actually understands my existing codebase instead of giving generic solutions that don't fit
-
Debugging is scary good - it literally found a memory leak in my React app that I'd been hunting for days
-
Code quality is just... clean. Like actually readable, properly structured code
-
Explains trade-offs instead of just spitting out the first solution
Real example: Had this mess of nested async calls in my Express API. ChatGPT kept suggesting Promise.all which wasn't what I needed. Gemini gave me some overcomplicated rxjs nonsense. Claude 4 looked at it for 2 seconds and suggested a clean async/await pattern with proper error boundaries. Worked perfectly.
The context window is massive too - I can literally paste my entire project and it gets it. No more "remember we discussed X in our previous conversation" BS.
I'm not trying to shill here but if you're doing serious development work, this thing is worth every penny. Been more productive this week than the entire last month.
Got an invite link if anyone wants to try it: https://claude.ai/referral/6UGWfPA1pQ
Anyone else tried it yet? Curious how it compares for different languages/frameworks.
EDIT: Just to be clear - I've tested basically every major AI coding tool out there. This is the first one that actually feels like it gets programming, not just text completion that happens to be code. This also takes Cursor to a whole new level!
I have been extensively using Gemini 2.5 Pro for coding-related stuff and O3 for everything else, and it's crazy that within a month or so, they look kind of obsolete. Claude Opus 4 is the best overall model available right now.
I ran a quick coding test, Opus against Gemini 2.5 Pro and OpenAI o3. The intention was to create visually appealing and bug-free code.
Here are my observations
-
Claude Opus 4 leads in raw performance and prompt adherence.
-
It understands user intentions better, reminiscent of 3.6 Sonnet.
-
High taste. The generated outputs are tasteful. Retains the Opus 3 personality to an extent.
-
Though unrelated to code, the model feels nice, and I never enjoyed talking to Gemini and o3.
-
Gemini 2.5 is more affordable in pricing and takes much fewer API credits than Opus.
-
One million context length in Gemini is undefeatable for large codebase understanding.
-
Opus is the slowest in time to first token. You have to be patient with the thinking mode.
Check out the blog post for complete comparison analysis with codes: Claude 4 Opus vs. Gemini 2.5 vs. OpenAI o3
The vibes with Opus are the best; it has a personality and is stupidly capable. But too pricey; it's best used with the Claude app, the API cost will put a hole in your pocket. Gemini will always be your friend with free access and the cheapest SOTA model.
Would love to know your experience with Claude 4 Opus and how you would compare it with o3 and Gemini 2.5 pro in coding and non-coding tasks.
Starting off: Don't get me wrong, Sonnet 4 is legendary model for coding. It's so good, maybe even too good. It has zero-shot basically every one of my personal tests in Cursor and a couple complex Rust problems I always test LLMs with.
I belive most people have hugely praised Sonnet 4 with good reasons. It's extremely good at coding, yet due to the fact that lots of people in this sub are coders, they often feel they're whole day gets more productive. What they don't realize is that this model is kinda bad for normies. This model on a personal note has felt severely overtrained on code and likely caused catastrophic forgetting in this model. It feels severely lobotimized on non-code related tasks.
Opus 4 however seems to be fine, it has gone through my math tasks without and issues. Just too expensive to be a daily driver tho.
Here is one of the grade 9 math problem from math class I recently had to do (yes im in high school). I decided to try Sonnet 4 on it.
Math ProblemI gave Sonnet 4 (non-reasoning) this exact prompt of "Teach me how to do this question step-by-step for High School Maths" and GPT-4.1 the same prompt with this image attached.
Results:
Sonnet 4Sonnet 4 got completely confused and starts just doing confusing random operations and gets lost. Then gives me some vague steps and tries to get me to solve it???? Sonnet 4 very rarely gets it right, it either starts trying to make the user solve it or gives out answers like (3.10, 3.30, 3.40 and etc).
GPT-4.1 Response:
GPT-4.1 ResponseI have reran the same test on GPT 4.1 also many times and it seems to get it right every single time. This is one of the of dozens of questions I have found Sonnet 4 getting consistenly wrong or just rambles about. Whereas GPT-4.1 hits it right away.
People in AI all believes these models are improving so much (they are) but normies don't experience that much. As I believe the most substantial improvements on these models recently were code. whereas normies don't code, they can tell it improved a bit, but not a mind blowing amount.
I know, I know, whenever a model comes out you get people saying this, but it's on very concrete things for me, I'm not just biased against it. For reference, I'm comparing 4 Sonnet (concise) with 3.7 Sonnet (concise), no reasoning for either.
I asked it to calculate the total markup I paid at a gas station relative to the supermarket. I gave it quantities in a way I thought was clear ("I got three protein bars and three milks, one of the others each. What was the total markup I paid?", but that's later in the conversation after it searched for prices). And indeed, 3.7 understands this without any issue (and I regenerated the message to make sure it wasn't a fluke). But with 4, even with much back and forth and several regenerations, it kept interpreting this as 3 milk, 1 protein bar, 1 [other item], 1 [other item], until I very explicitly laid it out as I just did.
And then, another conversation, I ask it, "Does this seem correct, or too much?" with a photo of food, and macro estimates for the meal in a screenshot. Again, 3.7 understands this fine, as asking whether the figures seem to be an accurate estimate. Whereas 4, again with a couple regenerations to test, seems to think I'm asking whether it's an appropriate meal (as in, not too much food for dinner or whatever). And in one instance, misreads the screenshot (thinking that the number of calories I will have cumulatively eaten after that meal is the number of calories of that meal).
Is anyone else seeing any issues like this?
Today, Anthropic is introducing the next generation of Claude models: Claude Opus 4 and Claude Sonnet 4, setting new standards for coding, advanced reasoning, and AI agents. Claude Opus 4 is the world’s best coding model, with sustained performance on complex, long-running tasks and agent workflows. Claude Sonnet 4 is a drop-in replacement for Claude Sonnet 3.7, delivering superior coding and reasoning while responding more precisely to your instructions.
Claude Opus 4 and Sonnet 4 are hybrid models offering two modes: near-instant responses and extended thinking for deeper reasoning. Both models can also alternate between reasoning and tool use—like web search—to improve responses.
Both Claude 4 models are available today for all paid plans. Additionally, Claude Sonnet 4 is available on the free plan.
Read more here: https://www.anthropic.com/news/claude-4
I mean how the heck literally Is Qwen-3 better than claude-4(the Claude who used to dog walk everyone). this is just disappointing 🫠
Anthropic recently unveiled Claude 4 (Opus and Sonnet), achieving record-breaking 72.7% performance on SWE-bench Verified and surpassing OpenAI’s latest models. Benchmarks aside, I wanted to see how Claude 4 holds up under real-world software engineering tasks. I spent the last 24 hours putting it through intensive testing with challenging refactoring scenarios.
I tested Claude 4 using a Rust codebase featuring complex, interconnected issues following a significant architectural refactor. These problems included asynchronous workflows, edge-case handling in parsers, and multi-module dependencies. Previous versions, such as Claude Sonnet 3.7, struggled here—often resorting to modifying test code rather than addressing the root architectural issues.
Claude 4 impressed me by resolving these problems correctly in just one attempt, never modifying tests or taking shortcuts. Both Opus and Sonnet variants demonstrated genuine comprehension of architectural logic, providing solutions that improved long-term code maintainability.
Key observations from practical testing:
Claude 4 consistently focused on the deeper architectural causes, not superficial fixes.
Both variants successfully fixed the problems on their first attempt, editing around 15 lines across multiple files, all relevant and correct.
Solutions were clear, maintainable, and reflected real software engineering discipline.
I was initially skeptical about Anthropic’s claims regarding their models' improved discipline and reduced tendency toward superficial fixes. However, based on this hands-on experience, Claude 4 genuinely delivers noticeable improvement over earlier models.
For developers seriously evaluating AI coding assistants—particularly for integration in more sophisticated workflows—Claude 4 seems to genuinely warrant attention.
A detailed write-up and deeper analysis are available here: Claude 4 First Impressions: Anthropic’s AI Coding Breakthrough
Interested to hear others' experiences with Claude 4, especially in similarly challenging development scenarios.
I’ve been using Claude 4 sonnet in agent mode for the past month and a half and compared to the other models it worked better, getting the job done 80% of the time with little debugging process.
Recently I’ve noticed that it’s starting to act more like GPT 4.1, it’s making a lot of mistakes, when it says it has “fixed the mistake and understands why the bug is happening and assures it 100% works now” it actually didn’t fix anything nothing has changed or in fact it had made the code worse, something it rarely ever did, now it’s frequently doing it.
Is anyone else having this issue?