🌐
Artificial Analysis
artificialanalysis.ai › models › deepseek-r1-qwen3-8b
DeepSeek R1 0528 Qwen3 8B - Intelligence, Performance & Price Analysis
Analysis of DeepSeek's DeepSeek R1 0528 Qwen3 8B and comparison to other AI models across key metrics including quality, price, performance (tokens per second & time to first token), context window & more.
Top answer
1 of 5
65
Those benchmarks are a meme. ArtificialAnalysis uses benchmarks established by other research groups, which are often old and overtrained, so they aren't reliable. They carefully show or hide models on default list to paint a picture of bigger models doing better, but when you enable Qwen 8B and 32B with reasoning to be shown, this all falls apart. It's nice enough to brag about a model on LinkedIn, and they are somewhat useful - they seem to be independent and the image and video arenas are great, but they're not capable of maintaining a leak-proof expert benchmarks. Look at math reasoning: DeepSeek R10528 (May '25) - 94 Qwen3 14B (reasoning) - 86 Qwen3 8B (Reasoning) - 83 DeepSeek R1 (Jan '25) - 82 DeepSeek R1 05-28 Qwen3 8B - 79 Claude 3.7 Sonnet (thinking) - 72 Overall bench (Intelligence Index) : DeepSeek R1 (Jan '25) - 60 Qwen3 32B (Reasoning) - 59 Do you believe that it makes sense for Qwen3 8B to score above DeepSeek R1 or for Claude Sonnet 3.7 to be outclassed by DeepSeek R1 05-28 Qwen3 8B with a big margin? Another bench - LiveCodeBench Qwen3 14B (Reasoning) - 52 Claude 3.7 Sonnet thinking - 47 Why are devs using Claude 3.7/4 in Windsurf/Cursor/Roo/Cline/Aider and not Qwen 3 14B? Qwen3 14B is apparently a much better coder lmao. I can't call it benchmark contamination but it's definitely overfit to benchmarks. For god's sake, when you let base Qwen 2.5 32B non-Instruct generate random tokens with trash prompt it will often generate MMLU-style questions and answer pairs out of itself. It's trained to do well at benchmarks that they test on.
2 of 5
13
i really dont trust artificial analysis rankings these days since they just aggregate other peoples old benchmarks and like they still use scicode or whatever meanwhile its literally beyond satured all models score 99% on it
🌐
Hugging Face
huggingface.co › deepseek-ai › DeepSeek-R1-0528-Qwen3-8B
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B · Hugging Face
This model achieves state-of-the-art (SOTA) performance among open-source models on the AIME 2024, surpassing Qwen3 8B by +10.0% and matching the performance of Qwen3-235B-thinking.
🌐
LM Studio
lmstudio.ai › models › deepseek › deepseek-r1-0528-qwen3-8b
deepseek/deepseek-r1-0528-qwen3-8b
This model achieves state-of-the-art (SOTA) performance among open-source models on the AIME 2024, surpassing Qwen3 8B by +10.0% and matching the performance of Qwen3-235B-thinking.
🌐
Hugging Face
huggingface.co › deepseek-ai › DeepSeek-R1-0528
deepseek-ai/DeepSeek-R1-0528 · Hugging Face
This model achieves state-of-the-art (SOTA) performance among open-source models on the AIME 2024, surpassing Qwen3 8B by +10.0% and matching the performance of Qwen3-235B-thinking.
🌐
OpenRouter
openrouter.ai › deepseek › deepseek-r1-0528-qwen3-8b:free
DeepSeek R1 0528 Qwen3 8B - API, Providers, Stats | OpenRouter
The distilled variant, DeepSeek-R1-0528-Qwen3-8B, transfers this chain-of-thought into an 8 B-parameter form, beating standard Qwen3 8B by +10 pp and tying the 235 B “thinking” giant on AIME 2024.
🌐
Skywork
skywork.ai › home › models › deepseek: deepseek r1 0528 qwen3 8b free chat online
DeepSeek: DeepSeek R1 0528 Qwen3 8B Free Chat Online - Skywork ai
November 1, 2025 - The 8B model excels at code generation, debugging, explanation, and multi-language programming tasks. Its hybrid reasoning system is particularly effective for algorithmic problem-solving and complex code refactoring tasks that require deep ...
🌐
Artificial Analysis
artificialanalysis.ai › models › deepseek-r1-qwen3-8b › providers
DeepSeek R1 0528 Qwen3 8B: API Provider Performance Benchmarking & Price Analysis | Artificial Analysis
Analysis of API providers for DeepSeek R1 0528 Qwen3 8B across performance metrics including latency (time to first token), output speed (output tokens per second), price and others. API providers benchmarked include Novita.
🌐
Reddit
reddit.com › r/localllama › deepseek-r1-0528 official benchmarks released!!!
r/LocalLLaMA on Reddit: DeepSeek-R1-0528 Official Benchmarks Released!!!
May 29, 2025 - AIME 2024, surpassing Qwen3 8B by +10.0% and matching the performance of Qwen3-235B-thinking. We believe that the chain-of-thought from DeepSeek-R1-0528 will hold significant importance for both academic research on reasoning models and industrial ...
Find elsewhere
🌐
Ollama
ollama.com › library › deepseek-r1:8b-0528-qwen3-q4_K_M
deepseek-r1:8b-0528-qwen3-q4_K_M
In this update, DeepSeek R1 has significantly improved its reasoning and inference capabilities. The model has demonstrated outstanding performance across various benchmark evaluations, including mathematics, programming, and general logic.
🌐
Ollama
ollama.com › library › deepseek-r1:8b
deepseek-r1:8b
In this update, DeepSeek R1 has significantly improved its reasoning and inference capabilities. The model has demonstrated outstanding performance across various benchmark evaluations, including mathematics, programming, and general logic.
🌐
Hugging Face
huggingface.co › unsloth › DeepSeek-R1-0528-Qwen3-8B-GGUF
unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF · Hugging Face
This model achieves state-of-the-art (SOTA) performance among open-source models on the AIME 2024, surpassing Qwen3 8B by +10.0% and matching the performance of Qwen3-235B-thinking.
🌐
Clarifai
clarifai.com › deepseek-ai › deepseek-chat › models › DeepSeek-R1-0528-Qwen3-8B
DeepSeek-R1-0528-Qwen3-8B model | Clarifai - The World's AI
DeepSeek-R1-0528 improves reasoning and logic via better computation and optimization, nearing the performance of top models like O3 and Gemini 2.5 Pro.
🌐
LM Studio
lmstudio.ai › home › models › deepseek-r1
deepseek-r1
This model achieves state-of-the-art (SOTA) performance among open-source models on the AIME 2024, surpassing Qwen3 8B by +10.0% and matching the performance of Qwen3-235B-thinking.
🌐
Cordatus
blog.cordatus.ai › home › deepseek-r1-0528-qwen3-8b: think like a 235b model
DeepSeek-R1-0528-Qwen3-8B: Think Like a 235B Model - Cordatus
July 25, 2025 - On the AIME 2024 benchmark, the model achieves state-of-the-art (SOTA) performance among all open-source models, surpassing the standard Qwen3 8B by an impressive +10.0% in accuracy.
🌐
Unsloth
unsloth.ai › blog › deepseek-r1-0528
How to Run Deepseek-R1-0528 Locally
The distill achieves the same performance as Qwen3 (235B). Qwen3 GGUF: DeepSeek-R1-0528-Qwen3-8B-GGUF You can also fine-tune the Qwen3 model with Unsloth. You can run the model using Unsloth's 1.78-bit Dynamic 2.0 GGUFs on your favorite inference frameworks. We quantized DeepSeek’s R1 671B parameter model from 720GB down to 185GB - a 75% size reduction.
🌐
Runpod
runpod.io › home › blog › the 'minor upgrade' that’s anything but: deepseek r1 0528 deep dive
The 'Minor Upgrade' That’s Anything But: DeepSeek R1 0528 Deep Dive | Runpod Blog
Perhaps most impressively, the chain-of-thought from DeepSeek-R1-0528 was distilled to post-train Qwen3 8B Base, obtaining DeepSeek-R1-0528-Qwen3-8B. This model achieves state-of-the-art (SOTA) performance among open-source models on the AIME 2024, surpassing Qwen3 8B by +10.0% and matching ...
🌐
Ollama
ollama.com › sam860 › deepseek-r1-0528-qwen3:8b
sam860/deepseek-r1-0528-qwen3:8b
For benchmarks requiring sampling, the model uses: - Temperature: 0.6 - Top-p value: 0.95 - 16 responses per query to estimate pass@1
🌐
Reddit
reddit.com › r/localllama › deepseek-r1-0528-qwen3-8b
r/LocalLLaMA on Reddit: DeepSeek-R1-0528-Qwen3-8B
April 10, 2025 - (Edit: source: https://livebench.ai/#/) But I got the impression qwq 32b is worth trying cause based on https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87 it performs almost at the level of deepseek r1 at the larger context, it is even better than deepseek r1 0528 on larger context... You should try it yourself and compare the result for your usecase. ... I don't see the R1 Qwen3 8B distilled on that site, and neither do I find Data Analysis... So I am not sure what you are talking about here? ... The site I gave the link to is the fiction benchmark, basically tests coherence at various context lengths.