claude 4 benchmarks - Brave Search

anthropic.com › news › claude-4

Introducing Claude 4

Claude 4 models lead on SWE-bench Verified, a benchmark for performance on real software engineering tasks.

datacamp.com › blog › claude-4

Claude 4: Tests, Features, Access, Benchmarks & More | DataCamp

May 23, 2025 - Claude Opus 4 is designed for ... but I find that claim a bit empty. Yes, it’s currently the top performer on the SWE-bench Verified benchmark....

Discussions

Aider coding benchmarks for Claude 4 Sonnet & Opus

Sonnet 4 think < Sonnet 3.7 think? Sonnet 4 no think < Sonnet 3.7 no think? How? Regression? More on reddit.com

r/singularity

27

102

May 26, 2025

At last, Claude 4’s Aider Polyglot Coding Benchmark results are in (the benchmark many call the top "real-world" test).

Sonnet 4 worse than 3.7? More on reddit.com

r/ClaudeAI

66

161

May 26, 2025

Claude 4 benchmarks

Just tried Sonet 4 on a toy problem, hit the context limit instantly. Demis Hassabis has made me become a big fat context pig. More on reddit.com

r/singularity

237

890

May 22, 2025

Introducing Claude 4

Here's benchmarks Benchmark | Claude Opus 4 | Claude Sonnet 4 | Claude Sonnet 3.7 | OpenAI o3 | OpenAI GPT-4.1 | Gemini 2.5 Pro (Preview 05-06) | Agentic coding (SWE-bench Verified 1,5) | 72.5% / 79.4% | 72.7% / 80.2% | 62.3% / 70.3% | 69.1% | 54.6% | 63.2% | Agentic terminal coding (Terminal-bench 2,5) | 43.2% / 50.0% | 35.5% / 41.3% | 35.2% | 30.2% | 30.3% | 25.3% | Graduate-level reasoning (GPQA Diamond 5) | 79.6% / 83.3% | 75.4% / 83.8% | 78.2% | 83.3% | 66.3% | 83.0% | Agentic tool use (TAU-bench, Retail/Airline) | 81.4% / 59.6% | 80.5% / 60.0% | 81.2% / 58.4% | 70.4% / 52.0% | 68.0% / 49.4% | — | Multilingual Q&A (MMMLU 3) | 88.8% | 86.5% | 85.9% | 88.8% | 83.7% | — | Visual reasoning (MMMU validation) | 76.5% | 74.4% | 75.0% | 82.9% | 74.8% | 79.6% | HS math competition (AIME 2025 4,5) | 75.5% / 90.0% | 70.5% / 85.0% | 54.8% | 88.9% | — | 83.0% More on reddit.com

r/ClaudeAI

208

830

April 19, 2025

Videos

Coding with Claude 4 is actually insane - YouTube

Claude Opus 4.5 is the BEST coding model ever... - YouTube

Claude 4 is not what you think... - YouTube

Claude 4.5 Sonnet: Best Coding Model In The World! Powerful + Agentic!

September 29, 2025

Claude Opus 4.5: BEST Coding Model EVER! INSANE Agentic Capabilties!

Anthropic's Claude Opus 4.5 in 5 Minutes - YouTube

reddit.com › r/claudeai › claude 4 benchmarks - we eating!

r/ClaudeAI on Reddit: Claude 4 Benchmarks - We eating!

March 1, 2025 -

Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.

Claude Opus 4 is our most powerful model yet, and the world’s best coding model.

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.

I would like to remind you: do not trust any benchmarks, test it yourself.

Context window is still 200k?

vertu.com › best post › gemini 3 pro vs claude opus 4.5: the ultimate 2025 ai model comparison

Gemini 3 Pro vs Claude Opus 4.5: Benchmarks, Coding, Multimodal, and Cost Comparison

2 weeks ago - Agent Excellence: Superior performance on agentic benchmarks including: ... Extended Thinking: Advanced chain-of-thought execution with more stable reasoning across complex, multi-step problems · Tool Use Mastery: Exceptional at orchestrating multiple tools (bash, file editing, browser automation) and managing subagents via Claude Agent SDK

Getpassionfruit

getpassionfruit.com › blog › gpt-5-1-vs-claude-4-5-sonnet-vs-gemini-3-pro-vs-deepseek-v3-2-the-definitive-2025-ai-model-comparison

GPT 5.1 vs Claude 4.5 vs Gemini 3: 2025 AI Comparison

2 weeks ago - Gemini 3 Pro leads overall reasoning benchmarks with an unprecedented 1501 LMArena Elo, becoming the first model to break the 1500 barrier, while Claude 4.5 Sonnet dominates real-world coding at 77.2% SWE-bench and DeepSeek-V3.2 delivers ...

opencv.org › home › news › claude 4: the next generation of ai assistants

Claude 4 - Introduction, Benchmark & Applications

May 29, 2025 - This isn’t just a cool story; it’s a preview of the serious firepower Anthropic is unleashing today with Claude Opus 4 and Claude Sonnet 4. Forget incremental updates. These aren’t just upgrades; they’re setting new industry benchmarks for coding prowess, advanced reasoning capabilities, and the sophisticated operation of AI agents.

x.com › AnthropicAI › status › 1925591505332576377

Anthropic on X: "Introducing the next generation: Claude Opus 4 and Claude Sonnet 4. Claude Opus 4 is our most powerful model yet, and the world’s best coding model. Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning. https://t.co/MJtczIvGE9" / X

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.

vellum.ai › blog › claude-opus-4-5-benchmarks

Claude Opus 4.5 Benchmarks (Explained)

2 weeks ago - Claude Opus 4.5 achieves a remarkable score of 37.6%, a massive improvement that is more than double the score of GPT-5.1 (17.6%) and significantly higher than Gemini 3 Pro (31.1%). This points to a fundamental improvement in non-verbal, abstract ...

Find elsewhere

Google Bing Mojeek

composio.dev › blog › claude-4-5-opus-vs-gemini-3-pro-vs-gpt-5-codex-max-the-sota-coding-model

Claude 4.5 Opus vs. Gemini 3 Pro vs. GPT-5-codex-max: The SOTA coding model - Composio

2 weeks ago - Claude 4.5 consistently demonstrated the strongest architectural reasoning and long-horizon thinking, but its outputs usually required additional effort to integrate and stabilise.

Artificial Analysis

artificialanalysis.ai › models › claude-4-opus

Claude 4 Opus - Intelligence, Performance & Price Analysis | Artificial Analysis

Analysis of Anthropic's Claude 4 Opus (Non-reasoning) and comparison to other AI models across key metrics including quality, price, performance (tokens per second & time to first token), context window & more.

entelligence.ai › blogs › claude-4-detailed-analysis

Claude 4.0: A Detailed Analysis | Entelligence Blog

May 30, 2025 - It nearly surpasses all the benchmarks, and the most important one is the SWE benchmark, where it scores 72.5% and reaches up to 79.4% with parallel test-time compute. Here are some other benchmarks you might be interested in: For this Claude ...

leanware.co › insights › claude-opus4-vs-gemini-2-5-pro-vs-openai-o3-comparison

Claude 4 Opus vs Gemini 2.5 Pro vs OpenAI o3 | Full Comparison

Custom Software Development for Business Solution Company

Recently, Anthropic and Google ... practical use cases to understand which one fits different development and business needs.TL;DR: Claude Opus 4 leads coding benchmarks at 72.5% SWE-bench, Gemi... Leanware's deep understanding of the client’s needs has provided efficient solutions. They have improved business processes and have produced an intuitive and user-friendly system. Also, they consistently involve the client on the project for quality assurance, which secures ongoing partnership. Compare Claude Opus 4, Gemini 2.5 Pro, and OpenAI o3 to find the best AI for coding, document processing,

Rating: 5

interconnects.ai › p › claude-4-and-anthropics-bet-on-code

Claude 4 and Anthropic's bet on code - by Nathan Lambert

May 27, 2025 - Buried in the system card was an ... rather than provide real usefulness, that showed Claude 4 dramatically outperforming the 3.7 model riddled with user headaches....

medium.com › @bob.mashouf › claude-4-vs-llama-4-benchmarking-technical-capabilities-55b99c17d3f7

Claude 4 vs LLaMA 4: Benchmarking Technical Capabilities

May 26, 2025 - The dynamic reasoning module in Claude 4 gives a big boost on tasks that require more thinking, up to 15% better than Claude 3.7 in “extended thinking” mode. The multimodal capabilities shine on VQA tasks, 10% better than Claude 3.7 on MMMU.

Analytics Vidhya

analyticsvidhya.com › home › grok 4 vs claude 4: which is better?

Grok 4 vs Claude 4: Which is Better?

In this section, we will contrast Grok 4 and Claude 4 on some major available public benchmarks. The table below illustrates their differences and some important performance metrics. Including reasoning, coding, latency, and context window size.

Published July 12, 2025

Digital Watch Observatory

dig.watch › home › updates › claude opus 4 sets a benchmark in ai coding as anthropic’s revenue doubles

Claude Opus 4 sets a benchmark in AI coding as Anthropic's revenue doubles | Digital Watch Observatory

May 25, 2025 - The Claude 4 models, backed by Amazon and developed by former OpenAI executives, feature improvements in coding, autonomous task execution, and reasoning. Opus 4 leads in the SWE-bench coding benchmark at 72.5 percent, outperforming OpenAI’s ...

gocodeo.com › post › claude-4-reasoning-memory-benchmarks-tools-and-use-cases

Claude 4: Reasoning, Memory, Benchmarks, Tools, and Use Cases

May 25, 2025 - Anthropic’s Claude 4 models were benchmarked across a range of tasks in coding, reasoning, and agentic tool use.

eval.16x.engineer › blog › claude-4-opus-sonnet-evaluation-results

Claude Opus 4 and Claude Sonnet 4 Evaluation Results

May 25, 2025 - Both Claude models produced notably more concise code compared to other models in the test while maintaining correct functionality. When tested on creating benchmark visualizations from data, both Claude models scored 8.5/10, ...

medium.com › @leucopsis › claude-sonnet-4-and-opus-4-a-review-db68b004db90

Claude Sonnet 4 and Opus 4, a Review | by Barnacle Goose | Medium

May 29, 2025 - In the SWE-bench Verified software engineering benchmark (500 real-world coding challenges), Claude 4 models achieved the top scores: 72.5% accuracy for Opus 4 and 72.7% for Sonnet 4 , significantly outperforming Claude 3.7’s 62.3% and also ...

dev.to › nodeshiftcloud › claude-4-opus-vs-sonnet-benchmarks-and-dev-workflow-with-claude-code-11fa

Claude 4: Opus vs Sonnet, Benchmarks, and Dev Workflow with Claude Code - DEV Community

May 24, 2025 - Claude 4 models set new highs in these tasks using only 500 problems, while OpenAI’s scores reflect a slightly smaller 477-task subset. For extended thinking benchmarks—like GPQA Diamond, TAU-bench, MMMLU, and AIME—performance surged when ...