🌐
arXiv
arxiv.org › html › 2501.12948v1
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
October 13, 2025 - In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question ...
🌐
Phil Schmid
philschmid.de › deepseek-r1
Bite: How Deepseek R1 was trained
January 17, 2025 - DeepSeek AI released DeepSeek-R1, an open model that rivals OpenAI's o1 in complex reasoning tasks, introduced using Group Relative Policy Optimization (GRPO) and RL-focused multi-stage training approach.
🌐
Medium
medium.com › yugen-ai-technology-blog › understanding-the-math-behind-grpo-deepseek-r1-zero-9fb15e103a0a
Understanding the Math Behind GRPO — DeepSeek-R1-Zero | by Yugen.ai | Yugen.ai Technology Blog | Medium
February 8, 2025 - While there have been numerous ... DeepSeek-R1-Zero and DeepSeek-R1 i.e. Reinforcement Learning, specifically the Group Relative Policy Optimisation (GRPO)....
🌐
GitHub
github.com › philschmid › deep-learning-pytorch-huggingface › blob › main › training › mini-deepseek-r1-aha-grpo.ipynb
deep-learning-pytorch-huggingface/training/mini-deepseek-r1-aha-grpo.ipynb at main · philschmid/deep-learning-pytorch-huggingface
Well, DeepSeek-R1 is an open model that rivals OpenAI's o1 in complex reasoning tasks, introduced using Group Relative Policy Optimization (GRPO) and RL-focused multi-stage training approach.
Author   philschmid
🌐
GitHub
github.com › huggingface › open-r1
GitHub - huggingface/open-r1: Fully open reproduction of DeepSeek-R1
The project is simple by design and mostly consists of: src/open_r1: contains the scripts to train models as well as generate synthetic data: grpo.py: trains a model with GRPO on a given dataset.
Starred by 25.8K users
Forked by 2.4K users
Languages   Python 89.4% | Shell 10.0% | Makefile 0.6%
🌐
DataOps Labs
blog.dataopslabs.com › deepseek-r1-efficient-reinforcement-learning-with-grpo
Efficient Learning: DeepSeek R1 with GRPO
January 28, 2025 - DeepSeek R1 uses GRPO for cost-efficient AI training, boosting reasoning capabilities and reducing hardware expenses across diverse tasks
🌐
Oxen
oxen.ai › blog › how-deepseek-r1-grpo-and-previous-deepseek-models-work
How DeepSeek R1, GRPO, and Previous DeepSeek Models Work | Oxen.ai
Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.
🌐
Unsloth
unsloth.ai › blog › r1-reasoning
Train your own R1 reasoning model locally (GRPO)
DeepSeek’s R1 research revealed an “aha moment” where R1-Zero autonomously learned to allocate more thinking time without human feedback by using Group Relative Policy Optimization (GRPO).
🌐
GitHub
github.com › FareedKhan-dev › train-deepseek-r1
GitHub - FareedKhan-dev/train-deepseek-r1: Building DeepSeek R1 from Scratch
But DeepSeek uses GRPO for training their initial (R1 Zero), GRPO does things differently because it figures out a baseline, a kind of reference point for good actions directly from the results it gets from a group of actions.
Starred by 730 users
Forked by 118 users
Languages   Jupyter Notebook
Find elsewhere
🌐
Hugging Face
huggingface.co › learn › llm-course › en › chapter12 › 3
Understanding the DeepSeek R1 Paper - Hugging Face LLM Course
DeepSeek R1 represents a significant ... learning. The paper introduces a new reinforcement learning algorithm called Group Relative Policy Optimization (GRPO)....
🌐
Artificial Intelligence in Plain English
ai.plainenglish.io › deepseek-r1-understanding-grpo-and-multi-stage-training-5e0bbc28a281
DeepSeek R1: Understanding GRPO and Multi-Stage Training | by BavalpreetSinghh | Artificial Intelligence in Plain English
February 4, 2025 - Artificial intelligence has taken a significant leap forward with the release of DeepSeek R1, an open model that challenges OpenAI’s o1 in advanced reasoning tasks. Developed using an innovative technique called Group Relative Policy Optimisation (GRPO) and a multi-stage training approach, ...
🌐
Reddit
reddit.com › r/llmdevs › how was deepseek-r1 built; for dummies
r/LLMDevs on Reddit: How was DeepSeek-R1 built; For dummies
January 27, 2025 -

Over the weekend I wanted to learn how was DeepSeek-R1 trained, and what was so revolutionary about it. So I ended up reading the paper, and wrote down my thoughts. < the article linked is (hopefully) written in a way that it's easier for everyone to understand it -- no PhD required!

Here's a "quick" summary:

1/ DeepSeek-R1-Zero is trained with pure-reinforcement learning (RL), without using labeled data. It's the first time someone tried and succeeded doing that. (that we know of, o1 report didn't show much)

2/ Traditional RL frameworks (like PPO) have something like an 'LLM coach or critic' that tells the model whether the answer was good or bad -- based on given examples (labeled data). DeepSeek uses GRPO, a pure-RL framework that skips the critic and calculates the group average of LLM answers based on predefined rules

3/ But, how can you evaluate the performance if you don't have labeled data to test against it? With this framework, the rules aren't perfect—they’re just a best guess at what "good" looks like. The RL process tries to optimize on things like:

Does the answer make sense? (Coherence)

Is it in the right format? (Completeness)

Does it match the general style we expect? (Fluency)

For example, for the DeepSeek-R1-Zero model, for mathematical tasks, the model could be rewarded for producing outputs that align to mathematical principles or logical consistency.

It makes sense.. and it works... to some extent!

4/ This model (R1-Zero) had issues with poor readability and language mixing -- something that you'd get from using pure-RL. So, the authors wanted to go through a multi-stage training process and do something that feels like hacking various training methods:

5/ What you see above is the DeepSeek-R1 model that goes through a list of training methods for different purposes

(i) the cold start data lays a structured foundation fixing issues like poor readability
(ii) pure-RL develops reasoning almost on auto-pilot
(iii) rejection sampling + SFT works with top-tier training data that improves accuracy, and
(iv) another final RL stage ensures additional level of generalization.

And with that they're doing as good as or better than o1 models.

Lmk if you have any questions (i might be able to answer them).

🌐
DEV Community
dev.to › aws › takeaways-from-the-deepseek-r1-model-2dli
Takeaways from the DeepSeek-R1 model - DEV Community
January 22, 2025 - The GRPO algorithm (Group Relative Policy Optimization), first introduced with DeepSeekMath and used for DeepSeek-R1, streamlines RL by eliminating a key bottleneck: the “critic” model.
🌐
Hugging Face
huggingface.co › blog › NormalUhr › grpo
DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge
Using the elementary school exam analogy, we’ve moved step by step from raw absolute scores to PPO’s full mechanism (Critic, Advantage, Clip, Reference Model), and finally to GRPO (leveraging multiple outputs’ average scores to eliminate the value function).
🌐
Substack
iaee.substack.com › p › deepseek-r1-intuitively-and-exhaustively
DeepSeek-R1 — Intuitively and Exhaustively Explained
February 3, 2025 - We start with DeepSeek-V3-Base, ... high quality chains of thought. Then we apply reinforcement learning using Group Relative Policy Optimization (GRPO)....
🌐
Towards AI
pub.towardsai.net › grpo-and-deepseek-r1-zero-9e81f15c6ba2
GRPO and DeepSeek-R1-Zero. 📚 Table of Contents | by Shakti Wadekar | Towards AI
March 15, 2025 - “We intentionally limit our ... — — — — — — — — — — ... It is trained using Reinforcement technique called as Group Relative Policy Optimization (GRPO)....
🌐
Oxen
ghost.oxen.ai › how-deepseek-r1-grpo-and-previous-deepseek-models-work
How DeepSeek R1, GRPO, and Previous DeepSeek Models Work
February 4, 2025 - Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.
🌐
Analytics Vidhya
analyticsvidhya.com › home › decoding deepseek r1’s advanced reasoning capabilities
Decoding DeepSeek R1's Advanced Reasoning Capabilities
March 20, 2025 - Understand DeepSeek-R1’s advanced reasoning capabilities and its impact on the LLM landscape. Learn how Group Relative Policy Optimization (GRPO) enhances reinforcement learning without a Critic model.
🌐
GitHub
github.com › policy-gradient › GRPO-Zero
GitHub - policy-gradient/GRPO-Zero: Implementing DeepSeek R1's GRPO algorithm from scratch
Implementing DeepSeek R1's GRPO algorithm from scratch - policy-gradient/GRPO-Zero
Starred by 1.7K users
Forked by 81 users
Languages   Python