I've been switching back and forth between Claude Sonnet 4.5 or Composer 1 and Gemini 3.0 and I’m trying to figure out which model actually performs better for real-world coding tasks inside Cursor AI. I'm not looking for a general comparison.
I want feedback specifically in the context of how these models behave inside the Cursor IDE.
Claude Code-Sonnet 4.5 >>>>>>> Gemini 3.0 Pro - Antigravity
It seems opus 4.5 is just too amazing even compared to gemini 3
Is it just me or is Claude 4.5 better than Gemini Pro 3 on Antigravity
Have some fun with Gemini 3 vs Sonnet 4.5
Videos
Well, without rehashing the whole Claude vs. Codex drama again, we’re basically in the same situation except this time, somehow, the Claude Code + Sonnet 4.5 combo actually shows real strength.
I asked something I thought would be super easy and straightforward for Gemini 3.0 Pro.
I work in a fully dockerized environment, meaning every little Python module I have runs inside its own container, and they all share the same database. Nothing too complicated, right?
It was late at night, I was tired, and I asked Gemini 3.0 Pro to apply a small patch to one of the containers, redeploy it for me, and test the endpoint.
Well… bad idea. It completely messed up the DB container (no worries, I had backups even though it didn’t delete the volumes). It spun up a brand-new container, created a new database, and set a new password “postgres123”. Then it kept starting and stopping the module I had asked it to refactor… and since it changed the database, of course the module couldn’t connect anymore. Long story short: even with precise instructions, it failed, ran out of tokens, and hit the 5-hour limit.
So I reverted everything and asked Claude Code the exact same thing.
Five to ten minutes later: everything was smooth. No issues at all.
The refactor worked perfectly.
Conclusion:
Maybe everyone already knows this, but the best benchmarks even agentic ones are NOT good indicators of real-world performance. This all comes down to orchestration, and that’s exactly why so many companies like Factory.AI are investing heavily in this space.
Gemini 3 Pro is quite slow and keeps making more errors compared to Claude Sonnet 4.5 on Antigravity. It was fine at the start, but the more I used it, it is creating malformed edits and isn't able to even edit a single file?
I don't know if this is a bug or whether it's just that bad. Is anyone else facing problems?
Edit: FYI, I'm experiencing this both on the Low and High version on Fast. It is SO slow. It is taking up to few minutes just to give me an initial response.
Have some fun with Gemini 3. Im pretty sure google trained the model to tell you it is the smartest model with the most raw compute (Hard coded)
You can basically get it to prove itself wrong a few different ways, then have it compare itself to another model and it will always say it has more 'raw' compute or is smarter and say it failed because of XYZ - kind of like when you ask claude about who is dario you get about dario the CEO (they stuck it in the system prompt).
I had it come up with some instructions which included writing a python script, it then proceeded to import a non existent library and then when a different model actually fixed this it Gemini claimed that it had more 'raw' compute
I had my Sonnet 4.5 + memory pretty consistently beat Gemini
TLDR: Gemini 3 SOUNDS great, has a lot of the agentic capabilities where it reads the 'intent' of a users query and less the literal meaning, but its actual brain is not as big of a leap as I thought it would be
What have people found?
Sonnet > Gemini > Opus > Gpt 5.1 in terms of long running ability on complex tasks
Claude Code for me still beating Gemini/AntiGravity (also note that if you are using the research preview of Anti Gravity its entirely possible that they take ALL of your data for training purposes, read the terms of service carefully)
Ran these three models through three real-world coding scenarios to see how they actually perform.
The tests:
Prompt adherence: Asked for a Python rate limiter with 10 specific requirements (exact class names, error messages, etc). Basically, testing if they follow instructions or treat them as "suggestions."
Code refactoring: Gave them a messy, legacy API with security holes and bad practices. Wanted to see if they'd catch the issues and fix the architecture, plus whether they'd add safeguards we didn't explicitly ask for.
System extension: Handed over a partial notification system and asked them to explain the architecture first, then add an email handler. Testing comprehension before implementation.
Results:
Test 1 (Prompt Adherence): Gemini followed instructions most literally. Opus stayed close to spec with cleaner docs. GPT-5.1 went defensive mode - added validation and safeguards that weren't requested.
Test 1 resultsTest 2 (TypeScript API): Opus delivered the most complete refactoring (all 10 requirements). GPT-5.1 hit 9/10, caught security issues like missing auth and unsafe DB ops. Gemini got 8/10 with cleaner, faster output but missed some architectural flaws.
Test 2 resultsTest 3 (System Extension): Opus gave the most complete solution with templates for every event type. GPT-5.1 went deep on the understanding phase (identified bugs, created diagrams) then built out rich features like CC/BCC and attachments. Gemini understood the basics but delivered a "bare minimum" version.
Test 3 resultsTakeaways:
Opus was fastest overall (7 min total) while producing the most thorough output. Stayed concise when the spec was rigid, wrote more when thoroughness mattered.
GPT-5.1 consistently wrote 1.5-1.8x more code than Gemini because of JSDoc comments, validation logic, error handling, and explicit type definitions.
Gemini is cheapest overall but actually cost more than GPT in the complex system task - seems like it "thinks" longer even when the output is shorter.
Opus is most expensive ($1.68 vs $1.10 for Gemini) but if you need complete implementations on the first try, that might be worth it.
Full methodology and detailed breakdown here: https://blog.kilo.ai/p/benchmarking-gpt-51-vs-gemini-30-vs-opus-45
What's your experience been with these three? Have you run your own comparisons, and if so, what setup are you using?