Is there a difference between them? I only just saw the R1 Zero one today, curious if anyone’s tried it or not.
Videos
I found the guide using Ollama and Chatbox very helpful but I am very interested in the CoT R1-zero version (vs R1 that still uses SFT) and can't seem to find a distilled version anywhere. Has anyone figured this out?
Over the weekend I wanted to learn how was DeepSeek-R1 trained, and what was so revolutionary about it. So I ended up reading the paper, and wrote down my thoughts. < the article linked is (hopefully) written in a way that it's easier for everyone to understand it -- no PhD required!
Here's a "quick" summary:
1/ DeepSeek-R1-Zero is trained with pure-reinforcement learning (RL), without using labeled data. It's the first time someone tried and succeeded doing that. (that we know of, o1 report didn't show much)
2/ Traditional RL frameworks (like PPO) have something like an 'LLM coach or critic' that tells the model whether the answer was good or bad -- based on given examples (labeled data). DeepSeek uses GRPO, a pure-RL framework that skips the critic and calculates the group average of LLM answers based on predefined rules
3/ But, how can you evaluate the performance if you don't have labeled data to test against it? With this framework, the rules aren't perfect—they’re just a best guess at what "good" looks like. The RL process tries to optimize on things like:
Does the answer make sense? (Coherence)
Is it in the right format? (Completeness)
Does it match the general style we expect? (Fluency)
For example, for the DeepSeek-R1-Zero model, for mathematical tasks, the model could be rewarded for producing outputs that align to mathematical principles or logical consistency.
It makes sense.. and it works... to some extent!
4/ This model (R1-Zero) had issues with poor readability and language mixing -- something that you'd get from using pure-RL. So, the authors wanted to go through a multi-stage training process and do something that feels like hacking various training methods:
5/ What you see above is the DeepSeek-R1 model that goes through a list of training methods for different purposes
(i) the cold start data lays a structured foundation fixing issues like poor readability
(ii) pure-RL develops reasoning almost on auto-pilot
(iii) rejection sampling + SFT works with top-tier training data that improves accuracy, and
(iv) another final RL stage ensures additional level of generalization.
And with that they're doing as good as or better than o1 models.
Lmk if you have any questions (i might be able to answer them).
There's been a lot of excitement around deepseek r1 obviously, but I was wondering if anyone has had success running deepseek r1 zero? I don't think there's as many quantizations or distillations of r1-ZERO out there. Is my only option to rent an 8xa100 cluster?
https://huggingface.co/deepseek-ai/DeepSeek-R1
https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero
I am waiting for this. hopfully today
Recently just added to OpenRouter. Anyone tried it? How is it?
The AI world is losing its mind over DeepSeek-R1-Zero, a model that skipped supervised fine-tuning (SFT) entirely and learned purely through reinforcement learning (RL). Unlike its sibling R1—which uses some SFT data to stay "human-readable"—R1-Zero’s training mirrors AlphaZero’s trial-and-error self-play. The result? Jaw-dropping performance (AIME math scores jumped from 15.6% → 86.7%) paired with bizarre, uninterpretable reasoning. Researchers observed "aha moments" where it autonomously rechecked flawed logic mid-process and allocated more compute to harder problems—without human guidance. But here’s the kicker: its outputs are riddled with garbled language mixes (e.g., Chinese/English spaghetti code) and logic leaps that even its creators can’t fully explain.
Meanwhile, R1 (the SFT-hybrid version) achieves similar performance without the chaos, proving that human-curated data still tames the beast. But at what cost? R1-Zero’s pure RL approach hints at a terrifying possibility: minds that optimize truth beyond human comprehension. And with API costs 50x cheaper than OpenAI’s, scaling this could democratize superintelligence—or unleash unreadable black-box AI.
If R1-Zero’s "alien logic" solves problems we can’t, does readability even matter… or is this how alignment dies?
Since a month ago I was looking for a free model that I really like, there were several with pros and cons between the Cohere, Mistrall and Gemini, but when I tried the Deepseek R1 Zero (Free) I was very satisfied with the responses as they meet both NSFW and SFW, sometimes it becomes repetitive but it is easy to get out of it, maybe I'm not demanding but I like when a model is aware of the scenario and character descriptions.
Hey guys deepseek seems to only provide API for R1 and not for R1-Zero, so is there another platform where i can find API for R1-Zero?
If there's no API available, what GPUs do i need to run inference on R1-Zero?
Finally, there is a model worthy of the hype it has been getting since Claude 3.6 Sonnet. Deepseek has released something anyone hardly expected: a reasoning model on par with OpenAI’s o1 within a month of the v3 release, with an MIT license and 1/20th of o1’s cost.
This is easily the best release since GPT-4. It's wild; the general public seems excited about this, while the big AI labs are probably scrambling. It feels like things are about to speed up in the AI world. And it's all thanks to this new DeepSeek-R1 model and how they trained it.
Some key details from the paper
-
Pure RL (GRPO) on v3-base to get r1-zero. (No Monte-Carlo Tree Search or Process Reward Modelling)
-
The model uses “Aha moments” as pivot tokens to reflect and reevaluate answers during CoT.
-
To overcome r1-zero’s readability issues, v3 was SFTd on cold start data.
-
Distillation works, small models like Qwen and Llama trained over r1 generated data show significant improvements.
Here’s an overall r0 pipeline
-
v3 base + RL (GRPO) → r1-zero
r1 training pipeline.
-
DeepSeek-V3 Base + SFT (Cold Start Data) → Checkpoint 1
-
Checkpoint 1 + RL (GRPO + Language Consistency) → Checkpoint 2
-
Checkpoint 2 used to Generate Data (Rejection Sampling)
-
DeepSeek-V3 Base + SFT (Generated Data + Other Data) → Checkpoint 3
-
Checkpoint 3 + RL (Reasoning + Preference Rewards) → DeepSeek-R1
We know the benchmarks, but just how good is it?
Deepseek r1 vs OpenAI o1.
So, for this, I tested r1 and o1 side by side on complex reasoning, math, coding, and creative writing problems. These are the questions that o1 solved only or by none before.
Here’s what I found:
-
For reasoning, it is much better than any previous SOTA model until o1. It is better than o1-preview but a notch below o1. This is also shown in the ARC AGI bench.
-
Mathematics: It's also the same for mathematics; r1 is a killer, but o1 is better.
-
Coding: I didn’t get to play much, but on first look, it’s up there with o1, and the fact that it costs 20x less makes it the practical winner.
-
Writing: This is where R1 takes the lead. It gives the same vibes as early Opus. It’s free, less censored, has much more personality, is easy to steer, and is very creative compared to the rest, even o1-pro.
What interested me was how free the model sounded and thought traces were, akin to human internal monologue. Perhaps this is because of the less stringent RLHF, unlike US models.
The fact that you can get r1 from v3 via pure RL was the most surprising.
For in-depth analysis, commentary, and remarks on the Deepseek r1, check out this blog post: Notes on Deepseek r1
What are your experiences with the new Deepseek r1? Did you find the model useful for your use cases?
Link to blog post: An Analysis of DeepSeek's R1-Zero and R1
From Mike Knoop (ARC-Prize Cofounder) on X:
just published my full u/arcprize analysis of deepseek's r1-zero and r1. link below. key points:
r1-zero is more important than r1.
both r1-zero and r1 score ~15% on ARC-AGI-1. this is fascinating. it matches deepseek's own benchmarking showing comprable results in logical domains like math and coding across r1-zero and r1.
r1-zero removes the final human input bottleneck -- "expert CoT labeling" eg. supervised fine-tuning ("SFT"). from there to AGI, it's all about efficiency.
deepseek says r1-zero suffers from incoherence and language mixing. this has been corroborated online. but we saw no evidence in our testing. all this suggests:
SFT is not necessary for accurate and legible CoT reasoning in domains with strong verification.
the r1-zero training process is capable of creating its own internal domain specific language (DSL) in token space via RL optimization.
SFT is currently necessary for increasing CoT reasoning domain generality with these LLM architectures
this makes intuitive sense, as language itself is effectively a reasoning DSL. The exact same "words" can be learned in one domain and applied in another, like a program. the pure RL approach can not yet discover a broad shared vocabulary and I expect this will be a strong focus for future research.
ultimately r1-zero demonstrates the prototype of a potential scaling regime with zero human bottlenecks – even in the training data acquisition itself.
more broadly, the public is very under-informed about impending inference demand. o3 beating ARC-AGI-1 (75%/86% on low/high compute) was barely reported mainstream. expect more market whiplash as the frontier progress isn't disseminated fast enough. mainstream press has important work to do.
o1/o3/r1 benchmark accuracy scores are exciting but the real practical impact will be massively improved reliability, leading agents to finally start working in 2025.
we'll also start seeing "synthetic data" (low quality) becoming "real data" (high quality) -- and the end user is paying for it! there is a legit power concentration potential feedback loop here to understand.
r1-zero and r1 being open is great for the world, deepseek has moved the science forward. many folks have told me they plan to use r1's ideas for ARC Prize 2025, which i'm excited to see. we are going to rapidly find the limits of LLMs + CoT search.
Finally, there is a model worthy of the hype it has been getting since Claude 3.6 Sonnet. Deepseek has released something anyone hardly expected: a reasoning model on par with OpenAI’s o1 within a month of the v3 release, with an MIT license and 1/20th of o1’s cost.
This is easily the best release since GPT-4. It's wild; the general public seems excited about this, while the big AI labs are probably scrambling. It feels like things are about to speed up in the AI world. And it's all thanks to this new DeepSeek-R1 model and how they trained it.
Some key details from the paper
-
Pure RL (GRPO) on v3-base to get r1-zero. (No Monte-Carlo Tree Search or Process Reward Modelling)
-
The model uses “Aha moments” as pivot tokens to reflect and reevaluate answers during CoT.
-
To overcome r1-zero’s readability issues, v3 was SFTd on cold start data.
-
Distillation works, small models like Qwen and Llama trained over r1 generated data show significant improvements.
Here’s an overall r0 pipeline
-
v3 base + RL (GRPO) → r1-zero
r1 training pipeline.
-
DeepSeek-V3 Base + SFT (Cold Start Data) → Checkpoint 1
-
Checkpoint 1 + RL (GRPO + Language Consistency) → Checkpoint 2
-
Checkpoint 2 used to Generate Data (Rejection Sampling)
-
DeepSeek-V3 Base + SFT (Generated Data + Other Data) → Checkpoint 3
-
Checkpoint 3 + RL (Reasoning + Preference Rewards) → DeepSeek-R1
We know the benchmarks, but just how good is it?
Deepseek r1 vs OpenAI o1.
So, for this, I tested r1 and o1 side by side on complex reasoning, math, coding, and creative writing problems. These are the questions that o1 solved only or by none before.
Here’s what I found:
-
For reasoning, it is much better than any previous SOTA model until o1. It is better than o1-preview but a notch below o1. This is also shown in the ARC AGI bench.
-
Mathematics: It's also the same for mathematics; r1 is a killer, but o1 is better.
-
Coding: I didn’t get to play much, but on first look, it’s up there with o1, and the fact that it costs 20x less makes it the practical winner.
-
Writing: This is where R1 takes the lead. It gives the same vibes as early Opus. It’s free, less censored, has much more personality, is easy to steer, and is very creative compared to the rest, even o1-pro.
What interested me was how free the model sounded and thought traces were, akin to human internal monologue. Perhaps this is because of the less stringent RLHF, unlike US models.
The fact that you can get r1 from v3 via pure RL was the most surprising.
For in-depth analysis, commentary, and remarks on the Deepseek r1, check out this blog post: Notes on Deepseek r1
What are your experiences with the new Deepseek r1? Did you find the model useful for your use cases?
DeepSeek's R1-Zero is significant because it achieves strong reasoning performance without human-labeled data (SFT)
It only relies on Reinforcement Learning (RL)
This overcomes the friction of human data bottlenecks
"Inference as Training."
Reasoning systems can generate high-quality data during inference, which can then be used to further train and improve the model
This creates a powerful feedback loop and a potential runaway effect for companies with large user bases
https://arcprize.org/blog/r1-zero-r1-results-analysis
This is my personal experience. Small R1 models that can run fast enough generate too much output. Effectively they end up being very slow, compared to something like LLama3.2. Even if you are OK with the speed, R1 fails to stick to simple output instructions.
Regarding the chain of thought concept: I am not convinced that this is yielding significant improvement. Retrospection works if you have an external feedback or reference, not by going over your own thoughts like a schizophrenic exclaiming "wait no" every now and then.
R1 gives the impression of a student who doesn't know the answer and is hoping to wing it by accidentally stumbling on something acceptable while stalling the teacher.
I decided I wanted to do a lit review of everything the deepseek team had published so far and try to get a sense of what they did differently. "Just a copy/rip-off of GPT" didn't really compute for me. Here's my plain-language, 5-minute analysis. Think of it as a warm-start to "how do I explain this to my dad?" then go read the papers cited.
On January 20th 2025, a little-known firm operating out of PRC open-sourced a model known as DeepSeek-R1, claiming to represent a frontier-level reasoning model, incorporating features such as chains-of-thought and multimodality, able to ingest and generate multiple data-types. This advancement represents the first such model to be produced by researchers within PRC and was accomplished without on-premises use of the NVIDIA H100 GPU, instead making use of the lower-clocked (1.75 vs 1.83Ghz) and lower memory (80 vs 96Gb) H800 GPU (estimated 5% lower computational throughput). Performance of R1 was benchmarked by DeepSeek and found to be near the performance of OpenAI’s o1-0912 across each of six benchmarks.
This level of performance on its own is not necessarily impressive. DeepSeekV3 and R1 join a growing group of highly performant AI “chat” models available to the public. DeepSeek researchers were able, however, to write, train, distill and deploy a set of state-of-the-art models for a small fraction of the cost of American-led efforts. DeepSeek’s self-published cost estimates for training the V3 LLM are in the range of 2788k GPU-hours costing an estimated $5.576M USD and a total size of around 600B parameters (DeepSeek-AI, 2024). This is in contrast to Sam Altman (CEO of OpenAI) estimating that GPT-4 cost over $100M USD to train at over 1 trillion parameters with GPT-5 costs running into the billions(Buchholz, 2024). While DeepSeek utilized only 2048 H800 GPUs, Meta-AI (the publisher of the open-source LLAMA model family) is estimated to own “350,000 NVIDIA H100 GPUs as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.”(Kevin Lee, 2024).
The task now is understanding what innovations led to this massive leap in training efficiency. Undoubtedly having use of preexisting models substantially lowered the training costs for the DeepSeek venture. The DeepSeek team made ample use of the QwQ model published by the Alibaba Qwen team. Speedups were made through leveraging technical expertise, using 8-bit floating-point precision (FP-8), striking a middle-ground between the larger FP-16 and lower-precision INT-4. An added speedup was gained from a novel load-balancing strategy, a multi-token prediction objective, and “co-design of algorithms, frameworks and hardware [to] overcome the communication bottleneck in cross-node MoE training”. Great pains were clearly taken in optimizing the training strategy for efficiency with several other novel techniques not mentioned here but can be found in the DeepSeek V3 technical report(DeepSeek-AI, 2024).
The key advancement offered by the DeepSeek-R1 training strategy was the shift from large, human-compiled datasets, to an unsupervised strategy. DeepSeek-R1 was trained using only a small amount of supervised data and conducted the bulk of its learning through unsupervised reinforcement learning (RL). DeepSeek-R1-Zero meanwhile, was trained using no supervised data in a strategy reminiscent of the Chess and Shogi training of Alpha-Zero(Silver, 2017)).
Detailed in the paper “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Model”(Shao Z., 2024), DeepSeek researchers used a mixture-of-experts model which they trained under a strategy they call “Group Relative Policy Optimization” (GRPO). Under GRPO, computational costs are sharply reduced by eliminating the need for a second “critic” model to judge the reasoning of the model in training.
DeepSeek had, by 2025, published several papers and open-source models approaching state-of-the-art performance in mathematical reasoning and coding. While the DeepSeek team did have use of existing open-source models and public APIs, to dismiss the real innovations in their techniques would be a mistake. DeepSeek-R1 and the strategies behind it represent a shift in priorities common in any industry where a resource becomes limited – a shift away from “scale is all you need” or “no replacement for displacement” and towards an optimization for efficiency.
References
Buchholz, K. (2024, August 23). The Extreme Cost of
Training AI Models. Forbes.
DeepSeek-AI, A. L. (2024). DeepSeek-V3 Technical
Report. Arxiv.Org.
Kevin Lee, A. G. (2024). Building Meta's GenAI
Infrastructure. Engineering at Meta.
Shao Z., W. P. (2024). DeepSeekMath: Pushing the
Limits of Mathematical Reasoning in Open Language Models. Arxiv.org.
Silver, D. H. (2017). Mastering Chess and Shogi by
Self-Play with a General Reinforcement Learning Algorithm. Arxiv.org.