🌐
Reddit
reddit.com › r/claudeai › i tested gpt-5.1 codex against sonnet 4.5, and it's about time anthropic bros take pricing seriously.
r/ClaudeAI on Reddit: I tested GPT-5.1 Codex against Sonnet 4.5, and it's about time Anthropic bros take pricing seriously.
November 15, 2025 -

I've used Claude Sonnets the most among LLMs, for the simple reason that they are so good at prompt-following and an absolute beast at tool execution. That also partly explains the maximum Anthropic revenue from APIs (code agents to be precise). They have an insane first-mover advantage, and developers love to die for.

But GPT 5.1 codex has been insanely good. One of the first things I do when a new promising model drops is to run small tests to decide which models to stick with until the next significant drop. Also, allows dogfooding our product while building these.

I did a quick competition among Claude 4.5 Sonnet, GPT 5, 5.1 Codex, and Kimi k2 thinking.

  • Test 1 involved building a system that learns baseline error rates, uses z-scores and moving averages, catches rate-of-change spikes, and handles 100k+ logs/minute with under 10ms latency.

  • Test 2 involved fixing race conditions when multiple processors detect the same anomaly. Handle ≤3s clock skew and processor crashes. Prevent duplicate alerts when processors fire within 5 seconds of each other.

The setup used models with their own CLI agent inside Cursor,

  • Claude Code with Sonnet 4.5

  • GPT 5 and 5.1 Codex with Codex CLI

  • Kimi K2 Thinking with Kimi CLI

Here's what I found out:

  • Test 1 - Advanced Anomaly Detection: Both GPT-5 and GPT-5.1 Codex shipped working code. Claude and Kimi both had critical bugs that would crash in production. GPT-5.1 improved on GPT-5's architecture and was faster (11m vs 18m).

  • Test 2 - Distributed Alert Deduplication: Codexes won again with actual integration. Claude had solid architecture, but didn't wire it up. Kimi had good ideas, but a broken duplicate-detection logic.

Codex cost me $0.95 total (GPT-5) vs Claude's $1.68. That's 43% cheaper for code that actually works. GPT-5.1 was even more efficient at $0.76 total ($0.39 for test 1, $0.37 for test 2).

I have written down a complete comparison picture for this. Check it out here: Codexes vs Sonnet vs Kimi

And, honestly, I can see the simillar performance delta in other tasks as well. Though for many quick tasks I still use Haiku, and Opus for hardcore reasoning, but GPT-5 variants have become great workhorses.

OpenAI is certainly after that juicy Anthropic enterprise margins, and Anthropic really needs to rethink its pricing.

Would love to know your experience with GPT 5.1 and how you rate it against Claude 4.5 Sonnet.

🌐
Composio
composio.dev › blog › claude-sonnet-4-5-vs-gpt-5-codex-best-model-for-agentic-coding
Claude Sonnet 4.5 vs. GPT-5 Codex: Best model for agentic coding - Composio
Struggled more with lint fixes and schema edge cases in this project. GPT‑5 Codex + Codex: Strongest at iterative execution, refactoring, and debugging; reliably shipped a working recommendation pipeline with minimal lint errors.
🌐
Builder.io
builder.io › blog › codex-vs-claude-code
Codex vs Claude Code: which is the better AI coding agent?
October 1, 2025 - The key thing: GPT-5 is significantly more efficient under the hood than Claude Sonnet, and especially Opus. In recent production usage, quality feels comparable by most anecdotes and public benchmarks, but GPT-5 costs roughly half of Sonnet, ...
🌐
Composio
composio.dev › blog › kimi-k2-thinking-vs-claude-4-5-sonnet-vs-gpt-5-codex-tested-the-best-models-for-agentic-coding
GPT-5.1 Codex vs. Claude 4.5 Sonnet vs. Kimi K2 Thinking : Tested the best models for agentic coding - Composio
It delivers integrated code that handles edge cases, costs 43% less than Claude, and needs minimal polish. GPT-5 was already solid, but GPT-5.1's improvements in speed and architecture make it the clear choice for new work.
🌐
Medium
medium.com › @leucopsis › how-gpt-5-codex-compares-to-claude-sonnet-4-5-1c1c0c2120b0
How GPT-5-Codex Compares to Claude Sonnet 4.5 | by Barnacle Goose | Medium
November 15, 2025 - The locally working GPT-5-Codex version uses PowerShell (on PC in an IDE extension for VS Code) to run tests and can indeed take hours to finish. Anthropic has positioned Claude Sonnet 4.5 as a Codex direct challenger, labeling it the world’s best model for coding.
🌐
Reddit
reddit.com › r/claudeai › 24 hours with claude code (opus 4.1) vs codex (gpt-5)
r/ClaudeAI on Reddit: 24 Hours with Claude Code (Opus 4.1) vs Codex (GPT-5)
August 8, 2025 -

Been testing both for a full day now, and I've got some thoughts. Also want to make sure I'm not going crazy.

Look, maybe I'm biased because I'm used to it, but Claude Code just feels right in my terminal. I actually prefer it over the Claude desktop app most of the time bc of the granular control. Want to crank up thinking? Use "ultrathink"? Need agents? Just ask.

Now, GPT-5. Man, I had HIGH hopes. OpenAI's marketing this as the "best coding model" and I was expecting that same mind-blown feeling I got when Claude Code (Opus 4) first dropped. But honestly? Not even close. And yes, before anyone asks, I'm using GPT-5 on Medium as a Plus user, so maybe the heavy thinking version is much different (though I doubt it).

What's really got me scratching my head is seeing the Cursor CEO singing its praises. Like, am I using it wrong? Is GPT-5 somehow way better in Cursor than in Codex CLI? Because with Claude, the experience is much better in Claude code vs cursor imo (why I don't use cursor anymore)

The Torture Test: My go-to new model test is having them build complex 3D renders from scratch. After Opus 4.1 was released, I had Claude Code tackle a biochemical mechanism visualization with multiple organelles, proteins, substrates, the whole nine yards. Claude picked Vite + Three.js + GSAP, and while it didn't one-shot it (they never do), I got damn close to a viable animation in a single day. That's impressive, especially considering the little effort I intentionally put forth.

So naturally, I thought I'd let GPT-5 take a crack at fixing some lingering bugs. Key word: thought.

Not only could it NOT fix them, it actively broke working parts of the code. Features it claimed to implement? Either missing or broken. I specifically prompted Codex to carefully read the files, understand the existing architecture, and exercise caution. The kind of instructions that would have Claude treating my code like fine china. GPT-5? Went full bull in a china shop.

Don't get me wrong, I've seen Claude break things too. But after extensive testing across different scenarios, here's my take:

  • Simple stuff (basic features, bug fixes): GPT-5 holds its own

  • Complex from-scratch projects: Claude by a mile

  • Understanding existing codebases: Claude handles context better (it always been like this)

I'm continuing to test GPT-5 in various scenarios, but right now I can't confidently build anything complex from scratch with it.

Curious what everyone else's experience has been. Am I missing something here, or is the emperor wearing no clothes?

Top answer
1 of 5
114
I don’t think it’ll be that long before OpenAI just go full consumer and stop trying to be the AI company for everyone Already removing the model choice on ChatGPT has broken the app for me. Coding is still better in Claude — despite those models not reaching the same benchmarks. Each major lab is going to have to start specialising
2 of 5
38
Claude Code as a CLI platform is infinitely better than Codex. However using GPT-5 inside of Cursor I’ve found it to be at least as capable as Claude 4.1. However, my criticism is that while it does execute well, the code it writes is very difficult to read whereas Claude seems to provide readable code out of the box without having to be asked. I do find that GPT 5 is better at planning out what it’s going to do before doing it, but you can get Claude to also do this with just having it write a plan.md file and reviewing it with Claude. I’ve also found GPT 5 to forget fewer specific details. Claude 4.1 will often forget to add #includes in my C++ code for example, while GPT 5 seems to do this a bet less. They both tend to screw up though and I usually end up with about 2-3 rounds of compilation failures before I get code that compiles. 90% of the time if it compiles, it also works as intended. Overall, I think it is at least equivalent (as long as you use Claude inside Cursor, NOT Codex). I’ve not thrown problems at it that Claude cannot already solve though, so interested to do some more difficult tests with it.
🌐
Hacker News
news.ycombinator.com › item
GPT-5.2-Codex | Hacker News
5 days ago - Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math) · Yes, it takes longer (quality, not speed please!)
🌐
Bind AI IDE
blog.getbind.co › 2025 › 09 › 16 › gpt-5-codex-vs-claude-code-vs-cursor-which-is-best-for-coding
GPT-5 Codex vs Claude Code vs Cursor – Which is best for coding?
September 17, 2025 - If your priority is deep, autonomous support for a large, complex codebase (tests, refactoring, reviews), then GPT-5-Codex currently seems to be the most advanced; its improvements in code review, visual inputs, and dependency reasoning give ...
🌐
Reddit
reddit.com › r/chatgptcoding › codex cli + gpt-5-codex still a more effective duo than claude code + sonnet 4.5
r/ChatGPTCoding on Reddit: Codex CLI + GPT-5-codex still a more effective duo than Claude Code + Sonnet 4.5
October 8, 2025 -

I have been using Codex for a while (since Sonnet 4 was nerfed), it has so far has been a great experience. And now that Sonnet 4.5 is here. I really wanted to test which model among Sonnet 4.5 and GPT-5-codex offers more value.

So, I built an e-com app (I named it vibeshop as it is vibe coded) using both the models using CC and Codex CLI with respective LLMs, also added MCP to the mix for a complete agent coding setup.

I created a monorepo and used various packages to see how well the models could handle context. I built a clothing recommendation engine in TypeScript for a serverless environment to test performance under realistic constraints (I was really hoping that these models would make the architectural decisions on their own, and tell me that this can't be done in a serverless environment because of the computational load). The app takes user preferences, ranks outfits, and generates clean UI layouts for web and mobile.

Here's what I found out.

Observations on Claude perf

Claude Sonnet 4.5 started strong. It handled the design beautifully, with pixel-perfect layouts, proper hierarchy, and clear explanations of each step. I could never have done this lol. But as the project grew, it struggled with smaller details, like schema relations and handling HttpOnly tokens mapped to opaque IDs with TTL/cleanup to prevent spoofing or cross-user issues.

Observations on GPT-5-codex

GPT-5 Codex, on the other hand, had a better handling of the situation. It maintained context better, refactored safely, and produced working code almost immediately (though it still had some linter errors like unused variables). It understood file dependencies, handled cross-module logic cleanly, and seemed to “get” the project structure better. The only downside was the developer experience of Codex, the docs are still unclear and there is limited control, but the output quality made up for it.

Both models still produced long-running queries that would be problematic in a serverless setup. It would’ve been nice if they flagged that upfront, but you still see that architectural choices require a human designer to make final calls. By the end, Codex delivered the entire recommendation engine with fewer retries and far fewer context errors. Claude’s output looked cleaner on the surface, but Codex’s results actually held up in production.

Claude outdid GPT-5 in frontend implement and GPT-5 outshone Claude in debugging and implementing backend.

Cost comparison:

Claude Sonnet 4.5 + Claude Code: ~18M input + 117k output tokens, cost around $10.26. Produced more lint errors but UI looked clean.
GPT-5 Codex + Codex Agent: ~600k input + 103k output tokens, cost around $2.50. Fewer errors, clean UI, and better schema handling.

I wrote a full breakdown Claude 4.5 Sonnet vs GPT-5 Codex,

Would love to know what combination of coding agent and models you use and how you found Sonnet 4.5 in comparison to GPT-5.

Find elsewhere
🌐
Reddit
reddit.com › r/claudecode › gpt 5.1-codex in vs studio outperforming claude code by a country mile
r/ClaudeCode on Reddit: GPT 5.1-Codex in VS Studio outperforming Claude Code by a country mile
November 14, 2025 -

Over the last couple of days I’ve been running GPT-5.1-Codex and Claude Code side-by-side in VS Code on actual project work, not the usual throwaway examples. The difference has surprised me. GPT-5.1-Codex feels noticeably quicker, keeps track of what’s going on across multiple files, and actually updates the codebase without making a mess. Claude Code is still fine for small refactors or explaining what a block of code does, but once things get a bit more involved it starts losing context, mixing up files, or spitting out diffs that don’t match anything. Curious if others are seeing the same thing

🌐
DEV Community
dev.to › blamsa0mine › claude-code-vs-gpt-5-codex-which-one-should-you-use-and-when--4092
Claude Code vs GPT‑5 Codex: which one should you use — and when ? - DEV Community
September 17, 2025 - TL;DR — Use both. Reach for GPT‑5 Codex when you need fast, precise diffs and short‑cycle code‑gen inside your IDE; switch to Claude Code for deep repo understanding, multi‑step refactors, and disciplined terminal workflows.
🌐
Medium
medium.com › @leucopsis › gpt-5-1-codex-max-vs-claude-opus-4-5-ad995359231b
GPT-5.1-Codex-Max vs Claude Opus 4.5 | by Barnacle Goose | Dec, 2025 | Medium
3 weeks ago - The GPT-5.1 family performs exceptionally well here, and Codex-Max benefits from that foundation. Its reported score is about 89.4 percent on the Pro-level variant, significantly ahead of Claude Opus 4.5 at roughly 82.4 percent.
🌐
PromptLayer
blog.promptlayer.com › codex-vs-claude-code
Codex vs Claude Code
September 29, 2025 - Independent testing reveals nuanced differences. GPT-5 proved faster and more token-efficient, using approximately 90% fewer tokens than Claude Opus 4.1 and completing tasks more quickly.
🌐
Cursor IDE
cursor-ide.com › blog › gpt-51-vs-claude-45
GPT-5/5.1 vs Claude Sonnet 4.5: Complete 2025 Comparison Guide - Cursor IDE 博客
November 13, 2025 - Claude Sonnet 4.5 operates with a 200K token context window (expandable to 1M tokens for specific applications), processing both input and maintaining conversational state within this limit. While smaller than GPT-5's 400K window, the model's context management proves more efficient—maintaining ...
🌐
X
x.com › iannuttall › status › 1962910312430215307
Ian Nuttall on X: "Comparing Codex CLI vs Claude Code side-by-side" / X
TLDR; Claude Code is more mature and has features like subagents, custom slash commands, and hooks that make you more productive. Codex with GPT-5 High is catching up fast though.
🌐
Hacker News
news.ycombinator.com › item
Building more with GPT-5.1-Codex-Max | Hacker News
November 23, 2025 - One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character of them - to the point that i've seen it work for 30 minutes to ...
🌐
Cursor
forum.cursor.com › discussions
ChatGPT 5.1 Codex High vs Gemini 3 Pro vs Claude Sonnet 4.5 for coding - Discussions - Cursor - Community Forum
November 20, 2025 - What’s everyone’s experience so far with these three models? I have seen reports ChatGPT 5.1 Codex High is testing highest for coding and seen people report good results with Gemini 3 Pro. But Sonnet is pretty reliable f…
🌐
Composio
composio.dev › blog › claude-4-5-opus-vs-gemini-3-pro-vs-gpt-5-codex-max-the-sota-coding-model
Claude 4.5 Opus vs. Gemini 3 Pro vs. GPT-5-codex-max: The SOTA coding model - Composio
Opus 4.5(Claude): Outstanding at ... they hit the metal. GPT-5.1 Codex: The most dependable for real-world development, integrates cleanly, handles edge cases, and produces code that holds up under load....
🌐
Skywork
skywork.ai › home › claude code sdk vs gpt 5 codex — which one thinks like a dev
Claude Code SDK vs GPT 5 Codex — Which one thinks like a dev - Skywork ai
October 15, 2025 - It often skipped the explicit “why” unless prompted. If you need someone to justify decisions or walk through trade-offs, Claude wins. If you need runnable code now and will vet it yourself, GPT-5 Codex is usually faster.