Videos
How was DeepSeek-R1 built; For dummies
Notes on Deepseek r1: Just how good it is compared to OpenAI o1
Over the weekend I wanted to learn how was DeepSeek-R1 trained, and what was so revolutionary about it. So I ended up reading the paper, and wrote down my thoughts. < the article linked is (hopefully) written in a way that it's easier for everyone to understand it -- no PhD required!
Here's a "quick" summary:
1/ DeepSeek-R1-Zero is trained with pure-reinforcement learning (RL), without using labeled data. It's the first time someone tried and succeeded doing that. (that we know of, o1 report didn't show much)
2/ Traditional RL frameworks (like PPO) have something like an 'LLM coach or critic' that tells the model whether the answer was good or bad -- based on given examples (labeled data). DeepSeek uses GRPO, a pure-RL framework that skips the critic and calculates the group average of LLM answers based on predefined rules
3/ But, how can you evaluate the performance if you don't have labeled data to test against it? With this framework, the rules aren't perfect—they’re just a best guess at what "good" looks like. The RL process tries to optimize on things like:
Does the answer make sense? (Coherence)
Is it in the right format? (Completeness)
Does it match the general style we expect? (Fluency)
For example, for the DeepSeek-R1-Zero model, for mathematical tasks, the model could be rewarded for producing outputs that align to mathematical principles or logical consistency.
It makes sense.. and it works... to some extent!
4/ This model (R1-Zero) had issues with poor readability and language mixing -- something that you'd get from using pure-RL. So, the authors wanted to go through a multi-stage training process and do something that feels like hacking various training methods:
5/ What you see above is the DeepSeek-R1 model that goes through a list of training methods for different purposes
(i) the cold start data lays a structured foundation fixing issues like poor readability
(ii) pure-RL develops reasoning almost on auto-pilot
(iii) rejection sampling + SFT works with top-tier training data that improves accuracy, and
(iv) another final RL stage ensures additional level of generalization.
And with that they're doing as good as or better than o1 models.
Lmk if you have any questions (i might be able to answer them).
Finally, there is a model worthy of the hype it has been getting since Claude 3.6 Sonnet. Deepseek has released something anyone hardly expected: a reasoning model on par with OpenAI’s o1 within a month of the v3 release, with an MIT license and 1/20th of o1’s cost.
This is easily the best release since GPT-4. It's wild; the general public seems excited about this, while the big AI labs are probably scrambling. It feels like things are about to speed up in the AI world. And it's all thanks to this new DeepSeek-R1 model and how they trained it.
Some key details from the paper
-
Pure RL (GRPO) on v3-base to get r1-zero. (No Monte-Carlo Tree Search or Process Reward Modelling)
-
The model uses “Aha moments” as pivot tokens to reflect and reevaluate answers during CoT.
-
To overcome r1-zero’s readability issues, v3 was SFTd on cold start data.
-
Distillation works, small models like Qwen and Llama trained over r1 generated data show significant improvements.
Here’s an overall r0 pipeline
-
v3 base + RL (GRPO) → r1-zero
r1 training pipeline.
-
DeepSeek-V3 Base + SFT (Cold Start Data) → Checkpoint 1
-
Checkpoint 1 + RL (GRPO + Language Consistency) → Checkpoint 2
-
Checkpoint 2 used to Generate Data (Rejection Sampling)
-
DeepSeek-V3 Base + SFT (Generated Data + Other Data) → Checkpoint 3
-
Checkpoint 3 + RL (Reasoning + Preference Rewards) → DeepSeek-R1
We know the benchmarks, but just how good is it?
Deepseek r1 vs OpenAI o1.
So, for this, I tested r1 and o1 side by side on complex reasoning, math, coding, and creative writing problems. These are the questions that o1 solved only or by none before.
Here’s what I found:
-
For reasoning, it is much better than any previous SOTA model until o1. It is better than o1-preview but a notch below o1. This is also shown in the ARC AGI bench.
-
Mathematics: It's also the same for mathematics; r1 is a killer, but o1 is better.
-
Coding: I didn’t get to play much, but on first look, it’s up there with o1, and the fact that it costs 20x less makes it the practical winner.
-
Writing: This is where R1 takes the lead. It gives the same vibes as early Opus. It’s free, less censored, has much more personality, is easy to steer, and is very creative compared to the rest, even o1-pro.
What interested me was how free the model sounded and thought traces were, akin to human internal monologue. Perhaps this is because of the less stringent RLHF, unlike US models.
The fact that you can get r1 from v3 via pure RL was the most surprising.
For in-depth analysis, commentary, and remarks on the Deepseek r1, check out this blog post: Notes on Deepseek r1
What are your experiences with the new Deepseek r1? Did you find the model useful for your use cases?