deepseek-r1 grpo - Brave Search

arxiv.org › html › 2501.12948v1

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

October 13, 2025 - In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question ...

philschmid.de › deepseek-r1

Bite: How Deepseek R1 was trained

January 17, 2025 - DeepSeek AI released DeepSeek-R1, an open model that rivals OpenAI's o1 in complex reasoning tasks, introduced using Group Relative Policy Optimization (GRPO) and RL-focused multi-stage training approach.

Videos

GRPO the secrete sauce of Deepseek R1 training - YouTube

January 28, 2025

DeepSeek-R1-Zero & GRPO RL Algorithm Explained Simply - YouTube

January 26, 2025

DeepSeek R1 - Quick easy explainer - YouTube

January 21, 2025

DeepSeek R1 Theory Tutorial – Architecture, GRPO, KL Divergence ...

GRPO: How DeepSeek R1's Reinforcement Learning Works - YouTube

February 12, 2025

DeepSeek R1 Theory Overview | GRPO + RL + SFT - YouTube

January 31, 2025

medium.com › yugen-ai-technology-blog › understanding-the-math-behind-grpo-deepseek-r1-zero-9fb15e103a0a

Understanding the Math Behind GRPO — DeepSeek-R1-Zero | by Yugen.ai | Yugen.ai Technology Blog | Medium

February 8, 2025 - While there have been numerous ... DeepSeek-R1-Zero and DeepSeek-R1 i.e. Reinforcement Learning, specifically the Group Relative Policy Optimisation (GRPO)....

github.com › philschmid › deep-learning-pytorch-huggingface › blob › main › training › mini-deepseek-r1-aha-grpo.ipynb

deep-learning-pytorch-huggingface/training/mini-deepseek-r1-aha-grpo.ipynb at main · philschmid/deep-learning-pytorch-huggingface

Well, DeepSeek-R1 is an open model that rivals OpenAI's o1 in complex reasoning tasks, introduced using Group Relative Policy Optimization (GRPO) and RL-focused multi-stage training approach.

Author philschmid

github.com › huggingface › open-r1

GitHub - huggingface/open-r1: Fully open reproduction of DeepSeek-R1

The project is simple by design and mostly consists of: src/open_r1: contains the scripts to train models as well as generate synthetic data: grpo.py: trains a model with GRPO on a given dataset.

Starred by 25.8K users

Forked by 2.4K users

Languages Python 89.4% | Shell 10.0% | Makefile 0.6%

blog.dataopslabs.com › deepseek-r1-efficient-reinforcement-learning-with-grpo

Efficient Learning: DeepSeek R1 with GRPO

January 28, 2025 - DeepSeek R1 uses GRPO for cost-efficient AI training, boosting reasoning capabilities and reducing hardware expenses across diverse tasks

oxen.ai › blog › how-deepseek-r1-grpo-and-previous-deepseek-models-work

How DeepSeek R1, GRPO, and Previous DeepSeek Models Work | Oxen.ai

Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

unsloth.ai › blog › r1-reasoning

Train your own R1 reasoning model locally (GRPO)

DeepSeek’s R1 research revealed an “aha moment” where R1-Zero autonomously learned to allocate more thinking time without human feedback by using Group Relative Policy Optimization (GRPO).

github.com › FareedKhan-dev › train-deepseek-r1

GitHub - FareedKhan-dev/train-deepseek-r1: Building DeepSeek R1 from Scratch

But DeepSeek uses GRPO for training their initial (R1 Zero), GRPO does things differently because it figures out a baseline, a kind of reference point for good actions directly from the results it gets from a group of actions.

Starred by 730 users

Forked by 118 users

Languages Jupyter Notebook

Find elsewhere

Google Bing Mojeek

huggingface.co › learn › llm-course › en › chapter12 › 3

Understanding the DeepSeek R1 Paper - Hugging Face LLM Course

DeepSeek R1 represents a significant ... learning. The paper introduces a new reinforcement learning algorithm called Group Relative Policy Optimization (GRPO)....

Artificial Intelligence in Plain English

ai.plainenglish.io › deepseek-r1-understanding-grpo-and-multi-stage-training-5e0bbc28a281

DeepSeek R1: Understanding GRPO and Multi-Stage Training | by BavalpreetSinghh | Artificial Intelligence in Plain English

February 4, 2025 - Artificial intelligence has taken a significant leap forward with the release of DeepSeek R1, an open model that challenges OpenAI’s o1 in advanced reasoning tasks. Developed using an innovative technique called Group Relative Policy Optimisation (GRPO) and a multi-stage training approach, ...

reddit.com › r/llmdevs › how was deepseek-r1 built; for dummies

r/LLMDevs on Reddit: How was DeepSeek-R1 built; For dummies

January 27, 2025 -

Over the weekend I wanted to learn how was DeepSeek-R1 trained, and what was so revolutionary about it. So I ended up reading the paper, and wrote down my thoughts. < the article linked is (hopefully) written in a way that it's easier for everyone to understand it -- no PhD required!

Here's a "quick" summary:

1/ DeepSeek-R1-Zero is trained with pure-reinforcement learning (RL), without using labeled data. It's the first time someone tried and succeeded doing that. (that we know of, o1 report didn't show much)

2/ Traditional RL frameworks (like PPO) have something like an 'LLM coach or critic' that tells the model whether the answer was good or bad -- based on given examples (labeled data). DeepSeek uses GRPO, a pure-RL framework that skips the critic and calculates the group average of LLM answers based on predefined rules

3/ But, how can you evaluate the performance if you don't have labeled data to test against it? With this framework, the rules aren't perfect—they’re just a best guess at what "good" looks like. The RL process tries to optimize on things like:

Does the answer make sense? (Coherence)

Is it in the right format? (Completeness)

Does it match the general style we expect? (Fluency)

For example, for the DeepSeek-R1-Zero model, for mathematical tasks, the model could be rewarded for producing outputs that align to mathematical principles or logical consistency.

It makes sense.. and it works... to some extent!

4/ This model (R1-Zero) had issues with poor readability and language mixing -- something that you'd get from using pure-RL. So, the authors wanted to go through a multi-stage training process and do something that feels like hacking various training methods:

5/ What you see above is the DeepSeek-R1 model that goes through a list of training methods for different purposes

(i) the cold start data lays a structured foundation fixing issues like poor readability
(ii) pure-RL develops reasoning almost on auto-pilot
(iii) rejection sampling + SFT works with top-tier training data that improves accuracy, and
(iv) another final RL stage ensures additional level of generalization.

And with that they're doing as good as or better than o1 models.

Lmk if you have any questions (i might be able to answer them).

Just wrote about it, it's absolutely great, and the less is more will definitely redefine AI as we know it

Work in progress https://github.com/huggingface/open-r1

dev.to › aws › takeaways-from-the-deepseek-r1-model-2dli

Takeaways from the DeepSeek-R1 model - DEV Community

January 22, 2025 - The GRPO algorithm (Group Relative Policy Optimization), first introduced with DeepSeekMath and used for DeepSeek-R1, streamlines RL by eliminating a key bottleneck: the “critic” model.

huggingface.co › blog › NormalUhr › grpo

DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge

Using the elementary school exam analogy, we’ve moved step by step from raw absolute scores to PPO’s full mechanism (Critic, Advantage, Clip, Reference Model), and finally to GRPO (leveraging multiple outputs’ average scores to eliminate the value function).

iaee.substack.com › p › deepseek-r1-intuitively-and-exhaustively

DeepSeek-R1 — Intuitively and Exhaustively Explained

February 3, 2025 - We start with DeepSeek-V3-Base, ... high quality chains of thought. Then we apply reinforcement learning using Group Relative Policy Optimization (GRPO)....

reddit.com › r/localllama › deepseek r1 grpo code open sourced 🤯

r/LocalLLaMA on Reddit: Deepseek R1 GRPO code open sourced 🤯

October 12, 2024 - Deepseek-r1-0528 is fire!

pub.towardsai.net › grpo-and-deepseek-r1-zero-9e81f15c6ba2

GRPO and DeepSeek-R1-Zero. 📚 Table of Contents | by Shakti Wadekar | Towards AI

March 15, 2025 - “We intentionally limit our ... — — — — — — — — — — ... It is trained using Reinforcement technique called as Group Relative Policy Optimization (GRPO)....

ghost.oxen.ai › how-deepseek-r1-grpo-and-previous-deepseek-models-work

How DeepSeek R1, GRPO, and Previous DeepSeek Models Work

February 4, 2025 - Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

Analytics Vidhya

analyticsvidhya.com › home › decoding deepseek r1’s advanced reasoning capabilities

Decoding DeepSeek R1's Advanced Reasoning Capabilities

March 20, 2025 - Understand DeepSeek-R1’s advanced reasoning capabilities and its impact on the LLM landscape. Learn how Group Relative Policy Optimization (GRPO) enhances reinforcement learning without a Critic model.

github.com › policy-gradient › GRPO-Zero

GitHub - policy-gradient/GRPO-Zero: Implementing DeepSeek R1's GRPO algorithm from scratch

Implementing DeepSeek R1's GRPO algorithm from scratch - policy-gradient/GRPO-Zero

Starred by 1.7K users

Forked by 81 users

Languages Python