I've used Claude Sonnets the most among LLMs, for the simple reason that they are so good at prompt-following and an absolute beast at tool execution. That also partly explains the maximum Anthropic revenue from APIs (code agents to be precise). They have an insane first-mover advantage, and developers love to die for.
But GPT 5.1 codex has been insanely good. One of the first things I do when a new promising model drops is to run small tests to decide which models to stick with until the next significant drop. Also, allows dogfooding our product while building these.
I did a quick competition among Claude 4.5 Sonnet, GPT 5, 5.1 Codex, and Kimi k2 thinking.
Test 1 involved building a system that learns baseline error rates, uses z-scores and moving averages, catches rate-of-change spikes, and handles 100k+ logs/minute with under 10ms latency.
Test 2 involved fixing race conditions when multiple processors detect the same anomaly. Handle ≤3s clock skew and processor crashes. Prevent duplicate alerts when processors fire within 5 seconds of each other.
The setup used models with their own CLI agent inside Cursor,
Claude Code with Sonnet 4.5
GPT 5 and 5.1 Codex with Codex CLI
Kimi K2 Thinking with Kimi CLI
Here's what I found out:
Test 1 - Advanced Anomaly Detection: Both GPT-5 and GPT-5.1 Codex shipped working code. Claude and Kimi both had critical bugs that would crash in production. GPT-5.1 improved on GPT-5's architecture and was faster (11m vs 18m).
Test 2 - Distributed Alert Deduplication: Codexes won again with actual integration. Claude had solid architecture, but didn't wire it up. Kimi had good ideas, but a broken duplicate-detection logic.
Codex cost me $0.95 total (GPT-5) vs Claude's $1.68. That's 43% cheaper for code that actually works. GPT-5.1 was even more efficient at $0.76 total ($0.39 for test 1, $0.37 for test 2).
I have written down a complete comparison picture for this. Check it out here: Codexes vs Sonnet vs Kimi
And, honestly, I can see the simillar performance delta in other tasks as well. Though for many quick tasks I still use Haiku, and Opus for hardcore reasoning, but GPT-5 variants have become great workhorses.
OpenAI is certainly after that juicy Anthropic enterprise margins, and Anthropic really needs to rethink its pricing.
Would love to know your experience with GPT 5.1 and how you rate it against Claude 4.5 Sonnet.
Videos
Been testing both for a full day now, and I've got some thoughts. Also want to make sure I'm not going crazy.
Look, maybe I'm biased because I'm used to it, but Claude Code just feels right in my terminal. I actually prefer it over the Claude desktop app most of the time bc of the granular control. Want to crank up thinking? Use "ultrathink"? Need agents? Just ask.
Now, GPT-5. Man, I had HIGH hopes. OpenAI's marketing this as the "best coding model" and I was expecting that same mind-blown feeling I got when Claude Code (Opus 4) first dropped. But honestly? Not even close. And yes, before anyone asks, I'm using GPT-5 on Medium as a Plus user, so maybe the heavy thinking version is much different (though I doubt it).
What's really got me scratching my head is seeing the Cursor CEO singing its praises. Like, am I using it wrong? Is GPT-5 somehow way better in Cursor than in Codex CLI? Because with Claude, the experience is much better in Claude code vs cursor imo (why I don't use cursor anymore)
The Torture Test: My go-to new model test is having them build complex 3D renders from scratch. After Opus 4.1 was released, I had Claude Code tackle a biochemical mechanism visualization with multiple organelles, proteins, substrates, the whole nine yards. Claude picked Vite + Three.js + GSAP, and while it didn't one-shot it (they never do), I got damn close to a viable animation in a single day. That's impressive, especially considering the little effort I intentionally put forth.
So naturally, I thought I'd let GPT-5 take a crack at fixing some lingering bugs. Key word: thought.
Not only could it NOT fix them, it actively broke working parts of the code. Features it claimed to implement? Either missing or broken. I specifically prompted Codex to carefully read the files, understand the existing architecture, and exercise caution. The kind of instructions that would have Claude treating my code like fine china. GPT-5? Went full bull in a china shop.
Don't get me wrong, I've seen Claude break things too. But after extensive testing across different scenarios, here's my take:
Simple stuff (basic features, bug fixes): GPT-5 holds its own
Complex from-scratch projects: Claude by a mile
Understanding existing codebases: Claude handles context better (it always been like this)
I'm continuing to test GPT-5 in various scenarios, but right now I can't confidently build anything complex from scratch with it.
Curious what everyone else's experience has been. Am I missing something here, or is the emperor wearing no clothes?
I have been using Codex for a while (since Sonnet 4 was nerfed), it has so far has been a great experience. And now that Sonnet 4.5 is here. I really wanted to test which model among Sonnet 4.5 and GPT-5-codex offers more value.
So, I built an e-com app (I named it vibeshop as it is vibe coded) using both the models using CC and Codex CLI with respective LLMs, also added MCP to the mix for a complete agent coding setup.
I created a monorepo and used various packages to see how well the models could handle context. I built a clothing recommendation engine in TypeScript for a serverless environment to test performance under realistic constraints (I was really hoping that these models would make the architectural decisions on their own, and tell me that this can't be done in a serverless environment because of the computational load). The app takes user preferences, ranks outfits, and generates clean UI layouts for web and mobile.
Here's what I found out.
Observations on Claude perf
Claude Sonnet 4.5 started strong. It handled the design beautifully, with pixel-perfect layouts, proper hierarchy, and clear explanations of each step. I could never have done this lol. But as the project grew, it struggled with smaller details, like schema relations and handling HttpOnly tokens mapped to opaque IDs with TTL/cleanup to prevent spoofing or cross-user issues.
Observations on GPT-5-codex
GPT-5 Codex, on the other hand, had a better handling of the situation. It maintained context better, refactored safely, and produced working code almost immediately (though it still had some linter errors like unused variables). It understood file dependencies, handled cross-module logic cleanly, and seemed to “get” the project structure better. The only downside was the developer experience of Codex, the docs are still unclear and there is limited control, but the output quality made up for it.
Both models still produced long-running queries that would be problematic in a serverless setup. It would’ve been nice if they flagged that upfront, but you still see that architectural choices require a human designer to make final calls. By the end, Codex delivered the entire recommendation engine with fewer retries and far fewer context errors. Claude’s output looked cleaner on the surface, but Codex’s results actually held up in production.
Claude outdid GPT-5 in frontend implement and GPT-5 outshone Claude in debugging and implementing backend.
Cost comparison:
Claude Sonnet 4.5 + Claude Code: ~18M input + 117k output tokens, cost around $10.26. Produced more lint errors but UI looked clean.
GPT-5 Codex + Codex Agent: ~600k input + 103k output tokens, cost around $2.50. Fewer errors, clean UI, and better schema handling.
I wrote a full breakdown Claude 4.5 Sonnet vs GPT-5 Codex,
Would love to know what combination of coding agent and models you use and how you found Sonnet 4.5 in comparison to GPT-5.
Over the last couple of days I’ve been running GPT-5.1-Codex and Claude Code side-by-side in VS Code on actual project work, not the usual throwaway examples. The difference has surprised me. GPT-5.1-Codex feels noticeably quicker, keeps track of what’s going on across multiple files, and actually updates the codebase without making a mess. Claude Code is still fine for small refactors or explaining what a block of code does, but once things get a bit more involved it starts losing context, mixing up files, or spitting out diffs that don’t match anything. Curious if others are seeing the same thing