🌐
Anthropic
anthropic.com › news › claude-4
Introducing Claude 4
Claude 4 models lead on SWE-bench Verified, a benchmark for performance on real software engineering tasks.
🌐
DataCamp
datacamp.com › blog › claude-4
Claude 4: Tests, Features, Access, Benchmarks & More | DataCamp
May 23, 2025 - Claude Opus 4 is designed for ... but I find that claim a bit empty. Yes, it’s currently the top performer on the SWE-bench Verified benchmark....
Discussions

Aider coding benchmarks for Claude 4 Sonnet & Opus
Sonnet 4 think < Sonnet 3.7 think? Sonnet 4 no think < Sonnet 3.7 no think? How? Regression? More on reddit.com
🌐 r/singularity
27
102
May 26, 2025
At last, Claude 4’s Aider Polyglot Coding Benchmark results are in (the benchmark many call the top "real-world" test).
Sonnet 4 worse than 3.7? More on reddit.com
🌐 r/ClaudeAI
66
161
May 26, 2025
Claude 4 benchmarks
Just tried Sonet 4 on a toy problem, hit the context limit instantly. Demis Hassabis has made me become a big fat context pig. More on reddit.com
🌐 r/singularity
237
890
May 22, 2025
Introducing Claude 4
Here's benchmarks Benchmark | Claude Opus 4 | Claude Sonnet 4 | Claude Sonnet 3.7 | OpenAI o3 | OpenAI GPT-4.1 | Gemini 2.5 Pro (Preview 05-06) | Agentic coding (SWE-bench Verified 1,5) | 72.5% / 79.4% | 72.7% / 80.2% | 62.3% / 70.3% | 69.1% | 54.6% | 63.2% | Agentic terminal coding (Terminal-bench 2,5) | 43.2% / 50.0% | 35.5% / 41.3% | 35.2% | 30.2% | 30.3% | 25.3% | Graduate-level reasoning (GPQA Diamond 5) | 79.6% / 83.3% | 75.4% / 83.8% | 78.2% | 83.3% | 66.3% | 83.0% | Agentic tool use (TAU-bench, Retail/Airline) | 81.4% / 59.6% | 80.5% / 60.0% | 81.2% / 58.4% | 70.4% / 52.0% | 68.0% / 49.4% | — | Multilingual Q&A (MMMLU 3) | 88.8% | 86.5% | 85.9% | 88.8% | 83.7% | — | Visual reasoning (MMMU validation) | 76.5% | 74.4% | 75.0% | 82.9% | 74.8% | 79.6% | HS math competition (AIME 2025 4,5) | 75.5% / 90.0% | 70.5% / 85.0% | 54.8% | 88.9% | — | 83.0% More on reddit.com
🌐 r/ClaudeAI
208
830
April 19, 2025
🌐
Reddit
reddit.com › r/claudeai › claude 4 benchmarks - we eating!
r/ClaudeAI on Reddit: Claude 4 Benchmarks - We eating!
March 1, 2025 -

Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.

Claude Opus 4 is our most powerful model yet, and the world’s best coding model.

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.

🌐
Vertu
vertu.com › best post › gemini 3 pro vs claude opus 4.5: the ultimate 2025 ai model comparison
Gemini 3 Pro vs Claude Opus 4.5: Benchmarks, Coding, Multimodal, and Cost Comparison
2 weeks ago - Agent Excellence: Superior performance on agentic benchmarks including: ... Extended Thinking: Advanced chain-of-thought execution with more stable reasoning across complex, multi-step problems · Tool Use Mastery: Exceptional at orchestrating multiple tools (bash, file editing, browser automation) and managing subagents via Claude Agent SDK
🌐
Getpassionfruit
getpassionfruit.com › blog › gpt-5-1-vs-claude-4-5-sonnet-vs-gemini-3-pro-vs-deepseek-v3-2-the-definitive-2025-ai-model-comparison
GPT 5.1 vs Claude 4.5 vs Gemini 3: 2025 AI Comparison
2 weeks ago - Gemini 3 Pro leads overall reasoning benchmarks with an unprecedented 1501 LMArena Elo, becoming the first model to break the 1500 barrier, while Claude 4.5 Sonnet dominates real-world coding at 77.2% SWE-bench and DeepSeek-V3.2 delivers ...
🌐
OpenCV
opencv.org › home › news › claude 4: the next generation of ai assistants
Claude 4 - Introduction, Benchmark & Applications
May 29, 2025 - This isn’t just a cool story; it’s a preview of the serious firepower Anthropic is unleashing today with Claude Opus 4 and Claude Sonnet 4. Forget incremental updates. These aren’t just upgrades; they’re setting new industry benchmarks for coding prowess, advanced reasoning capabilities, and the sophisticated operation of AI agents.
🌐
Vellum
vellum.ai › blog › claude-opus-4-5-benchmarks
Claude Opus 4.5 Benchmarks (Explained)
2 weeks ago - Claude Opus 4.5 achieves a remarkable score of 37.6%, a massive improvement that is more than double the score of GPT-5.1 (17.6%) and significantly higher than Gemini 3 Pro (31.1%). This points to a fundamental improvement in non-verbal, abstract ...
Find elsewhere
🌐
Composio
composio.dev › blog › claude-4-5-opus-vs-gemini-3-pro-vs-gpt-5-codex-max-the-sota-coding-model
Claude 4.5 Opus vs. Gemini 3 Pro vs. GPT-5-codex-max: The SOTA coding model - Composio
2 weeks ago - Claude 4.5 consistently demonstrated the strongest architectural reasoning and long-horizon thinking, but its outputs usually required additional effort to integrate and stabilise.
🌐
Artificial Analysis
artificialanalysis.ai › models › claude-4-opus
Claude 4 Opus - Intelligence, Performance & Price Analysis | Artificial Analysis
Analysis of Anthropic's Claude 4 Opus (Non-reasoning) and comparison to other AI models across key metrics including quality, price, performance (tokens per second & time to first token), context window & more.
🌐
Entelligence
entelligence.ai › blogs › claude-4-detailed-analysis
Claude 4.0: A Detailed Analysis | Entelligence Blog
May 30, 2025 - It nearly surpasses all the benchmarks, and the most important one is the SWE benchmark, where it scores 72.5% and reaches up to 79.4% with parallel test-time compute. Here are some other benchmarks you might be interested in: For this Claude ...
🌐
Leanware
leanware.co › insights › claude-opus4-vs-gemini-2-5-pro-vs-openai-o3-comparison
Claude 4 Opus vs Gemini 2.5 Pro vs OpenAI o3 | Full Comparison
Custom Software Development for Business Solution Company
Recently, Anthropic and Google ... practical use cases to understand which one fits different development and business needs.TL;DR: Claude Opus 4 leads coding benchmarks at 72.5% SWE-bench, Gemi... Leanware's deep understanding of the client’s needs has provided efficient solutions. They have improved business processes and have produced an intuitive and user-friendly system. Also, they consistently involve the client on the project for quality assurance, which secures ongoing partnership. Compare Claude Opus 4, Gemini 2.5 Pro, and OpenAI o3 to find the best AI for coding, document processing,
Rating: 5 ​
🌐
Interconnects
interconnects.ai › p › claude-4-and-anthropics-bet-on-code
Claude 4 and Anthropic's bet on code - by Nathan Lambert
May 27, 2025 - Buried in the system card was an ... rather than provide real usefulness, that showed Claude 4 dramatically outperforming the 3.7 model riddled with user headaches....
🌐
Medium
medium.com › @bob.mashouf › claude-4-vs-llama-4-benchmarking-technical-capabilities-55b99c17d3f7
Claude 4 vs LLaMA 4: Benchmarking Technical Capabilities
May 26, 2025 - The dynamic reasoning module in Claude 4 gives a big boost on tasks that require more thinking, up to 15% better than Claude 3.7 in “extended thinking” mode. The multimodal capabilities shine on VQA tasks, 10% better than Claude 3.7 on MMMU.
🌐
Analytics Vidhya
analyticsvidhya.com › home › grok 4 vs claude 4: which is better?
Grok 4 vs Claude 4: Which is Better?
In this section, we will contrast Grok 4 and Claude 4 on some major available public benchmarks. The table below illustrates their differences and some important performance metrics. Including reasoning, coding, latency, and context window size.
Published   July 12, 2025
🌐
Digital Watch Observatory
dig.watch › home › updates › claude opus 4 sets a benchmark in ai coding as anthropic’s revenue doubles
Claude Opus 4 sets a benchmark in AI coding as Anthropic's revenue doubles | Digital Watch Observatory
May 25, 2025 - The Claude 4 models, backed by Amazon and developed by former OpenAI executives, feature improvements in coding, autonomous task execution, and reasoning. Opus 4 leads in the SWE-bench coding benchmark at 72.5 percent, outperforming OpenAI’s ...
🌐
Gocodeo
gocodeo.com › post › claude-4-reasoning-memory-benchmarks-tools-and-use-cases
Claude 4: Reasoning, Memory, Benchmarks, Tools, and Use Cases
May 25, 2025 - Anthropic’s Claude 4 models were benchmarked across a range of tasks in coding, reasoning, and agentic tool use.
🌐
16x Eval
eval.16x.engineer › blog › claude-4-opus-sonnet-evaluation-results
Claude Opus 4 and Claude Sonnet 4 Evaluation Results
May 25, 2025 - Both Claude models produced notably more concise code compared to other models in the test while maintaining correct functionality. When tested on creating benchmark visualizations from data, both Claude models scored 8.5/10, ...
🌐
Medium
medium.com › @leucopsis › claude-sonnet-4-and-opus-4-a-review-db68b004db90
Claude Sonnet 4 and Opus 4, a Review | by Barnacle Goose | Medium
May 29, 2025 - In the SWE-bench Verified software engineering benchmark (500 real-world coding challenges), Claude 4 models achieved the top scores: 72.5% accuracy for Opus 4 and 72.7% for Sonnet 4 , significantly outperforming Claude 3.7’s 62.3% and also ...
🌐
DEV Community
dev.to › nodeshiftcloud › claude-4-opus-vs-sonnet-benchmarks-and-dev-workflow-with-claude-code-11fa
Claude 4: Opus vs Sonnet, Benchmarks, and Dev Workflow with Claude Code - DEV Community
May 24, 2025 - Claude 4 models set new highs in these tasks using only 500 problems, while OpenAI’s scores reflect a slightly smaller 477-task subset. For extended thinking benchmarks—like GPQA Diamond, TAU-bench, MMMLU, and AIME—performance surged when ...