Well, without rehashing the whole Claude vs. Codex drama again, we’re basically in the same situation except this time, somehow, the Claude Code + Sonnet 4.5 combo actually shows real strength.
I asked something I thought would be super easy and straightforward for Gemini 3.0 Pro.
I work in a fully dockerized environment, meaning every little Python module I have runs inside its own container, and they all share the same database. Nothing too complicated, right?
It was late at night, I was tired, and I asked Gemini 3.0 Pro to apply a small patch to one of the containers, redeploy it for me, and test the endpoint.
Well… bad idea. It completely messed up the DB container (no worries, I had backups even though it didn’t delete the volumes). It spun up a brand-new container, created a new database, and set a new password “postgres123”. Then it kept starting and stopping the module I had asked it to refactor… and since it changed the database, of course the module couldn’t connect anymore. Long story short: even with precise instructions, it failed, ran out of tokens, and hit the 5-hour limit.
So I reverted everything and asked Claude Code the exact same thing.
Five to ten minutes later: everything was smooth. No issues at all.
The refactor worked perfectly.
Conclusion:
Maybe everyone already knows this, but the best benchmarks even agentic ones are NOT good indicators of real-world performance. This all comes down to orchestration, and that’s exactly why so many companies like Factory.AI are investing heavily in this space.
Videos
I've been switching back and forth between Claude Sonnet 4.5 or Composer 1 and Gemini 3.0 and I’m trying to figure out which model actually performs better for real-world coding tasks inside Cursor AI. I'm not looking for a general comparison.
I want feedback specifically in the context of how these models behave inside the Cursor IDE.
I Ran all three models for a coding task just to see how they behave when things aren’t clean or nicely phrased.
The goal was just to see who performs like a real dev.
here's my takeaway
Opus 4.5 handled real repo-issues the best. It fixed things without breaking unrelated parts and didn’t hallucinate new abstractions. Felt the most “engineering-minded
GPT-5.1 was close behind. It explained its reasoning step-by-step and sometimes added improvements I never asked for. Helpful when you want safety, annoying when you want precision
Gemini solved most tasks but tended to optimize or simplify decisions I explicitly constrained. Good output, but sometimes too “creative.”
On Refactoring and architecture-level tasks:
Opus delivered the most complete refactor with consistent naming, updated dependencies, and documentation.
GPT-5.1 took longer because it analyzed first, but the output was maintainable and defensive.
Gemini produced clean code but missed deeper security and design patterns.
Context windows (because it matters at repo scale):
Opus 4.5: ~200K tokens usable, handles large repos better without losing track
GPT-5.1: ~128K tokens but strong long-reasoning even near the limit
Gemini 3 Pro: ~1M tokens which is huge, but performance becomes inconsistent as input gets massive
What's your experience been with these three? Used these frontier models Side by Side in my Multi Agent AI setup with Anannas LLM Provider & the results were interesting.
Have you run your own comparisons, and if so, what setup are you using?
Three new coding models dropped almost at the same time, so I ran a quick real-world test inside my observability system. No playground experiments, I had each model implement the same two components directly in my repo:
Statistical anomaly detection (EWMA, z-scores, spike detection, 100k+ logs/min)
Distributed alert deduplication (clock skew, crashes, 5s suppression window)
Here’s the simplified summary of how each behaved.
Claude 4.5
Super detailed architecture, tons of structure, very “platform rewrite” energy.
But one small edge case (Infinity.toFixed) crashed the service, and the restored state came back corrupted.
Great design, not immediately production-safe.
GPT-5.1 Codex
Most stable output.
Simple O(1) anomaly loop, defensive math, clean Postgres-based dedupe with row locks.
Integrated into my existing codebase with zero fixes required.
Gemini 3 Pro
Fastest output and cleanest code.
Compact EWMA, straightforward ON CONFLICT dedupe.
Needed a bit of manual edge-case review but great for fast iteration.
TL;DR
| Model | Cost | Time | Notes |
|---|---|---|---|
| Gemini 3 Pro | $0.25 | ~5-6 mins | Very fast, clean |
| GPT-5.1 Codex | $0.51 | ~5-6 mins | Most reliable in my tests |
| Claude Opus 4.5 | $1.76 | ~12 mins | Strong design, needs hardening |
I also wired Composio’s tool router in one branch for Slack/Jira/PagerDuty actions, which simplified agent-side integrations.
Not claiming any “winner", just sharing how each behaved inside a real codebase.
If you want to know more, check out the Complete analysis: Read the full blog post