Videos
Who should pay for Claude Opus 4.5 versus GPT-5.1?
Should I wait for GPT-6 or Claude 5 before investing?
What are the practical limits and caveats when using these models?
I've used Claude Sonnets the most among LLMs, for the simple reason that they are so good at prompt-following and an absolute beast at tool execution. That also partly explains the maximum Anthropic revenue from APIs (code agents to be precise). They have an insane first-mover advantage, and developers love to die for.
But GPT 5.1 codex has been insanely good. One of the first things I do when a new promising model drops is to run small tests to decide which models to stick with until the next significant drop. Also, allows dogfooding our product while building these.
I did a quick competition among Claude 4.5 Sonnet, GPT 5, 5.1 Codex, and Kimi k2 thinking.
Test 1 involved building a system that learns baseline error rates, uses z-scores and moving averages, catches rate-of-change spikes, and handles 100k+ logs/minute with under 10ms latency.
Test 2 involved fixing race conditions when multiple processors detect the same anomaly. Handle ≤3s clock skew and processor crashes. Prevent duplicate alerts when processors fire within 5 seconds of each other.
The setup used models with their own CLI agent inside Cursor,
Claude Code with Sonnet 4.5
GPT 5 and 5.1 Codex with Codex CLI
Kimi K2 Thinking with Kimi CLI
Here's what I found out:
Test 1 - Advanced Anomaly Detection: Both GPT-5 and GPT-5.1 Codex shipped working code. Claude and Kimi both had critical bugs that would crash in production. GPT-5.1 improved on GPT-5's architecture and was faster (11m vs 18m).
Test 2 - Distributed Alert Deduplication: Codexes won again with actual integration. Claude had solid architecture, but didn't wire it up. Kimi had good ideas, but a broken duplicate-detection logic.
Codex cost me $0.95 total (GPT-5) vs Claude's $1.68. That's 43% cheaper for code that actually works. GPT-5.1 was even more efficient at $0.76 total ($0.39 for test 1, $0.37 for test 2).
I have written down a complete comparison picture for this. Check it out here: Codexes vs Sonnet vs Kimi
And, honestly, I can see the simillar performance delta in other tasks as well. Though for many quick tasks I still use Haiku, and Opus for hardcore reasoning, but GPT-5 variants have become great workhorses.
OpenAI is certainly after that juicy Anthropic enterprise margins, and Anthropic really needs to rethink its pricing.
Would love to know your experience with GPT 5.1 and how you rate it against Claude 4.5 Sonnet.
I am at a final phases of very complicated multi ecosystem project. As the other models couldn’t dig deep down i am standing between those two giants. For which one would you guys go ?
Been testing both for a full day now, and I've got some thoughts. Also want to make sure I'm not going crazy.
Look, maybe I'm biased because I'm used to it, but Claude Code just feels right in my terminal. I actually prefer it over the Claude desktop app most of the time bc of the granular control. Want to crank up thinking? Use "ultrathink"? Need agents? Just ask.
Now, GPT-5. Man, I had HIGH hopes. OpenAI's marketing this as the "best coding model" and I was expecting that same mind-blown feeling I got when Claude Code (Opus 4) first dropped. But honestly? Not even close. And yes, before anyone asks, I'm using GPT-5 on Medium as a Plus user, so maybe the heavy thinking version is much different (though I doubt it).
What's really got me scratching my head is seeing the Cursor CEO singing its praises. Like, am I using it wrong? Is GPT-5 somehow way better in Cursor than in Codex CLI? Because with Claude, the experience is much better in Claude code vs cursor imo (why I don't use cursor anymore)
The Torture Test: My go-to new model test is having them build complex 3D renders from scratch. After Opus 4.1 was released, I had Claude Code tackle a biochemical mechanism visualization with multiple organelles, proteins, substrates, the whole nine yards. Claude picked Vite + Three.js + GSAP, and while it didn't one-shot it (they never do), I got damn close to a viable animation in a single day. That's impressive, especially considering the little effort I intentionally put forth.
So naturally, I thought I'd let GPT-5 take a crack at fixing some lingering bugs. Key word: thought.
Not only could it NOT fix them, it actively broke working parts of the code. Features it claimed to implement? Either missing or broken. I specifically prompted Codex to carefully read the files, understand the existing architecture, and exercise caution. The kind of instructions that would have Claude treating my code like fine china. GPT-5? Went full bull in a china shop.
Don't get me wrong, I've seen Claude break things too. But after extensive testing across different scenarios, here's my take:
Simple stuff (basic features, bug fixes): GPT-5 holds its own
Complex from-scratch projects: Claude by a mile
Understanding existing codebases: Claude handles context better (it always been like this)
I'm continuing to test GPT-5 in various scenarios, but right now I can't confidently build anything complex from scratch with it.
Curious what everyone else's experience has been. Am I missing something here, or is the emperor wearing no clothes?