Artificial Analysis
artificialanalysis.ai › models › deepseek-r1-qwen3-8b
DeepSeek R1 0528 Qwen3 8B - Intelligence, Performance & Price Analysis
Analysis of DeepSeek's DeepSeek R1 0528 Qwen3 8B and comparison to other AI models across key metrics including quality, price, performance (tokens per second & time to first token), context window & more.
Reddit
reddit.com › r/localllama › deepseek’s new r1-0528-qwen3-8b is the most intelligent 8b parameter model yet, but not by much: alibaba’s own qwen3 8b is just one point behind
r/LocalLLaMA on Reddit: DeepSeek’s new R1-0528-Qwen3-8B is the most intelligent 8B parameter model yet, but not by much: Alibaba’s own Qwen3 8B is just one point behind
June 5, 2025 -
source: https://x.com/ArtificialAnlys/status/1930630854268850271
amazing to have a local 8b model so smart like this in my machine!
what are your thoughts?
Top answer 1 of 5
65
Those benchmarks are a meme. ArtificialAnalysis uses benchmarks established by other research groups, which are often old and overtrained, so they aren't reliable. They carefully show or hide models on default list to paint a picture of bigger models doing better, but when you enable Qwen 8B and 32B with reasoning to be shown, this all falls apart. It's nice enough to brag about a model on LinkedIn, and they are somewhat useful - they seem to be independent and the image and video arenas are great, but they're not capable of maintaining a leak-proof expert benchmarks. Look at math reasoning: DeepSeek R10528 (May '25) - 94 Qwen3 14B (reasoning) - 86 Qwen3 8B (Reasoning) - 83 DeepSeek R1 (Jan '25) - 82 DeepSeek R1 05-28 Qwen3 8B - 79 Claude 3.7 Sonnet (thinking) - 72 Overall bench (Intelligence Index) : DeepSeek R1 (Jan '25) - 60 Qwen3 32B (Reasoning) - 59 Do you believe that it makes sense for Qwen3 8B to score above DeepSeek R1 or for Claude Sonnet 3.7 to be outclassed by DeepSeek R1 05-28 Qwen3 8B with a big margin? Another bench - LiveCodeBench Qwen3 14B (Reasoning) - 52 Claude 3.7 Sonnet thinking - 47 Why are devs using Claude 3.7/4 in Windsurf/Cursor/Roo/Cline/Aider and not Qwen 3 14B? Qwen3 14B is apparently a much better coder lmao. I can't call it benchmark contamination but it's definitely overfit to benchmarks. For god's sake, when you let base Qwen 2.5 32B non-Instruct generate random tokens with trash prompt it will often generate MMLU-style questions and answer pairs out of itself. It's trained to do well at benchmarks that they test on.
2 of 5
13
i really dont trust artificial analysis rankings these days since they just aggregate other peoples old benchmarks and like they still use scicode or whatever meanwhile its literally beyond satured all models score 99% on it
Videos
41:46
DeepSeek R1 0528 : 8B vs 671B (Live Test) - YouTube
17:15
DeepSeek R1 0528 Qwen3 8B - Small Upgraded Student Model - Install ...
r/LocalLLaMA on Reddit: DeepSeek-R1-0528-Qwen3-8B on iPhone 16 Pro
11:57
Deepseek R1 0528 685b Q4 Full Local Ai Review - YouTube
06:04
DeepSeek R1 0528 in 6 Minutes - YouTube
10:25
Run DeepSeek-R1-0528-Qwen3-8B Locally with Gaia (Easy Tutorial!)
Artificial Analysis
artificialanalysis.ai › models › deepseek-r1-qwen3-8b › providers
DeepSeek R1 0528 Qwen3 8B: API Provider Performance Benchmarking & Price Analysis | Artificial Analysis
Analysis of API providers for DeepSeek R1 0528 Qwen3 8B across performance metrics including latency (time to first token), output speed (output tokens per second), price and others. API providers benchmarked include Novita.
Ollama
ollama.com › library › deepseek-r1:8b
deepseek-r1:8b
In this update, DeepSeek R1 has significantly improved its reasoning and inference capabilities. The model has demonstrated outstanding performance across various benchmark evaluations, including mathematics, programming, and general logic.
LM Studio
lmstudio.ai › home › models › deepseek-r1
deepseek-r1
This model achieves state-of-the-art (SOTA) performance among open-source models on the AIME 2024, surpassing Qwen3 8B by +10.0% and matching the performance of Qwen3-235B-thinking.
Unsloth
unsloth.ai › blog › deepseek-r1-0528
How to Run Deepseek-R1-0528 Locally
The distill achieves the same performance as Qwen3 (235B). Qwen3 GGUF: DeepSeek-R1-0528-Qwen3-8B-GGUF You can also fine-tune the Qwen3 model with Unsloth. You can run the model using Unsloth's 1.78-bit Dynamic 2.0 GGUFs on your favorite inference frameworks. We quantized DeepSeek’s R1 671B parameter model from 720GB down to 185GB - a 75% size reduction.
Runpod
runpod.io › home › blog › the 'minor upgrade' that’s anything but: deepseek r1 0528 deep dive
The 'Minor Upgrade' That’s Anything But: DeepSeek R1 0528 Deep Dive | Runpod Blog
Perhaps most impressively, the chain-of-thought from DeepSeek-R1-0528 was distilled to post-train Qwen3 8B Base, obtaining DeepSeek-R1-0528-Qwen3-8B. This model achieves state-of-the-art (SOTA) performance among open-source models on the AIME 2024, surpassing Qwen3 8B by +10.0% and matching ...
Ollama
ollama.com › sam860 › deepseek-r1-0528-qwen3:8b
sam860/deepseek-r1-0528-qwen3:8b
For benchmarks requiring sampling, the model uses: - Temperature: 0.6 - Top-p value: 0.95 - 16 responses per query to estimate pass@1
Reddit
reddit.com › r/localllama › deepseek-r1-0528-qwen3-8b
r/LocalLLaMA on Reddit: DeepSeek-R1-0528-Qwen3-8B
April 10, 2025 - (Edit: source: https://livebench.ai/#/) But I got the impression qwq 32b is worth trying cause based on https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87 it performs almost at the level of deepseek r1 at the larger context, it is even better than deepseek r1 0528 on larger context... You should try it yourself and compare the result for your usecase. ... I don't see the R1 Qwen3 8B distilled on that site, and neither do I find Data Analysis... So I am not sure what you are talking about here? ... The site I gave the link to is the fiction benchmark, basically tests coherence at various context lengths.