🌐
arXiv
arxiv.org › abs › 2501.12948
[2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
January 22, 2025 - View a PDF of the paper titled DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, by DeepSeek-AI and 199 other authors View PDF HTML (experimental)
🌐
arXiv
arxiv.org › pdf › 2501.12948 pdf
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
In this paper, we take the first step toward improving language model reasoning capabilities · using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop · reasoning capabilities without any supervised data, focusing on their self-evolution through · a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ
🌐
Hugging Face
huggingface.co › deepseek-ai › DeepSeek-R1
deepseek-ai/DeepSeek-R1 · Hugging Face
DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities ...
🌐
GitHub
github.com › deepseek-ai › DeepSeek-R1 › blob › main › DeepSeek_R1.pdf
DeepSeek-R1/DeepSeek_R1.pdf at main · deepseek-ai/DeepSeek-R1
deepseek-ai / DeepSeek-R1 Public · Notifications · You must be signed in to change notification settings · Fork 11.8k · Star 91.6k ·
Author   deepseek-ai
🌐
AI Papers Academy
aipapersacademy.com › home › deepseek-r1 paper explained – a new rl llms era in ai?
DeepSeek-R1 Paper Explained – A New RL LLMs Era in AI? - AI Papers Academy
July 3, 2025 - The paper, titled “DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models via Reinforcement Learning”, presents a state-of-the-art, open-source reasoning model and a detailed recipe for training such models using large-scale ...
🌐
Hugging Face
huggingface.co › papers › 2501.12948
Paper page - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/deepseek-r1-incentivizing-reasoning-capability-in-llms-via-reinforcement-learning
🌐
Nature
nature.com › articles › article
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning | Nature
September 17, 2025 - A new artificial intelligence model, DeepSeek-R1, is introduced, demonstrating that the reasoning abilities of large language models can be incentivized through pure reinforcement learning, removing the need for human-annotated demonstrations.
🌐
GitHub
github.com › deepseek-ai › DeepSeek-R1
GitHub - deepseek-ai/DeepSeek-R1
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning.
Starred by 91.6K users
Forked by 11.8K users
🌐
Medium
medium.com › data-science-in-your-pocket › understanding-deepseek-r1-paper-beginners-guide-e86f83fda796
Understanding DeepSeek-R1 paper: Beginner’s guide | by Mehul Gupta | Data Science in Your Pocket | Medium
January 31, 2025 - The paper explores a new way to improve reasoning using pure reinforcement learning (RL) — meaning no supervised data (human-labeled examples). Instead, the model learns by itself through an RL framework called GRPO (we will discuss this in ...
Find elsewhere
🌐
DeepSeek
api-docs.deepseek.com › deepseek-r1 release 2025/01/20
DeepSeek-R1 Release | DeepSeek API Docs
🛠️ DeepSeek-R1: Technical Highlights · 📈 Large-scale RL in post-training · 🏆 Significant performance boost with minimal labeled data · 🔢 Math, code, and reasoning tasks on par with OpenAI-o1 · 📄 More details: https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf ·
🌐
Sean Goedecke
seangoedecke.com › deepseek-r1
What did DeepSeek figure out about reasoning with DeepSeek-R1?
The Chinese AI lab1 DeepSeek recently released their new reasoning model R1, which is supposedly (a) better than the current best reasoning models (OpenAI’s o1- series), and (b) was trained on a GPU cluster a fraction the size of any of the big western AI labs. Unlike the big western AI labs, they’ve released a paper ...
🌐
Reddit
reddit.com › r/singularity › a summary of deepseek-r1's paper by deepseek-r1
r/singularity on Reddit: A summary of DeepSeek-R1's paper by DeepSeek-R1
December 2, 2024 -
  • Aha moments emerged naturally in RL: Self-correction behaviors like "Wait, let’s reevaluate..." arose without SFT.

  • Cold-start SFT fixed readability: ~1k structured examples resolved language mixing.

  • GRPO cut RL costs by 30%: Group-wise reward normalization outperformed PPO.

  • RL increased CoT length autonomously: Reasoning steps grew from 100→1k tokens without penalties.

  • Distillation beat direct RL in small models: SFT on R1 data outperformed RL-trained base models.

  • Process rewards failed; outcome rewards worked better: Rule-based final-answer checks stabilized training.

  • XML tags reduced hallucinations 15%: Structured <think>/<answer> improved reward clarity.

  • Language mixing fixed via consistency rewards: Penalized code-switching in multilingual outputs.

I find it funny that ive seen multiple AI youtubers explain papers and they just go to another AI to help them in the video but hey it does a good job

https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf

🌐
Medium
medium.com › @mayadakhatib › deepseek-r1-a-short-summary-73b6b8ced9cf
DeepSeek R1 — a short summary
January 25, 2025 - The DeepSeek R1 model stands out for multiple reasons: It’s a free, open source SOTA reasoning model that is trained using direct Reinforcement Learning without supervised finetuning.
🌐
Reddit
reddit.com › r/llmdevs › how was deepseek-r1 built; for dummies
r/LLMDevs on Reddit: How was DeepSeek-R1 built; For dummies
January 27, 2025 -

Over the weekend I wanted to learn how was DeepSeek-R1 trained, and what was so revolutionary about it. So I ended up reading the paper, and wrote down my thoughts. < the article linked is (hopefully) written in a way that it's easier for everyone to understand it -- no PhD required!

Here's a "quick" summary:

1/ DeepSeek-R1-Zero is trained with pure-reinforcement learning (RL), without using labeled data. It's the first time someone tried and succeeded doing that. (that we know of, o1 report didn't show much)

2/ Traditional RL frameworks (like PPO) have something like an 'LLM coach or critic' that tells the model whether the answer was good or bad -- based on given examples (labeled data). DeepSeek uses GRPO, a pure-RL framework that skips the critic and calculates the group average of LLM answers based on predefined rules

3/ But, how can you evaluate the performance if you don't have labeled data to test against it? With this framework, the rules aren't perfect—they’re just a best guess at what "good" looks like. The RL process tries to optimize on things like:

Does the answer make sense? (Coherence)

Is it in the right format? (Completeness)

Does it match the general style we expect? (Fluency)

For example, for the DeepSeek-R1-Zero model, for mathematical tasks, the model could be rewarded for producing outputs that align to mathematical principles or logical consistency.

It makes sense.. and it works... to some extent!

4/ This model (R1-Zero) had issues with poor readability and language mixing -- something that you'd get from using pure-RL. So, the authors wanted to go through a multi-stage training process and do something that feels like hacking various training methods:

5/ What you see above is the DeepSeek-R1 model that goes through a list of training methods for different purposes

(i) the cold start data lays a structured foundation fixing issues like poor readability
(ii) pure-RL develops reasoning almost on auto-pilot
(iii) rejection sampling + SFT works with top-tier training data that improves accuracy, and
(iv) another final RL stage ensures additional level of generalization.

And with that they're doing as good as or better than o1 models.

Lmk if you have any questions (i might be able to answer them).

🌐
X
x.com › omarsar0 › status › 1881479496466927632
elvis on X: "The DeepSeek-R1 paper is a gem! Highly encourage everyone to read it. It's clear that LLM reasoning capabilities can be learned in different ways. RL, if applied correctly and at scale, can lead to some really powerful and interesting scaling and emergent properties. There https://t.co/egcmnWyBqp" / X
Here is my breakdown of the paper along with a few tests: https://youtu.be/3GlFd3doO3U?si=SVOCGhhMSY2xqR_2… The multi-state training might not make sense initially but they provide clues on optimizations that we can continue to tap into. Data quality is still very important for enhancing the usability of the LLM. Unlike other reasoning LLMs, DeepSeek-R1's training recipe and weights are open so we can build on top of it.
🌐
Interconnects
interconnects.ai › p › deepseek-r1-recipe-for-o1
DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs
January 21, 2025 - The DeepSeek R1 report has an entire other subsection dedicated to its distillation experiments, where it took completions from the R1 model and finetuned existing open-weight models with them to boost performance. This is a fantastic service for them to release this and provides a solid baseline for RL experiments on smaller models to try and match in the near future. The discussion in the paper on how large models are required to see the biggest reasoning gains (and generate effective synthetic data) is likely the biggest open question:
🌐
Ponder
ponder.ing › researches › deepseek-r1-paper-explained
DeepSeek R1 Paper Explained: What is it and How does it work? - Ponder
The paper introduces DeepSeek R1, a large language model trained on a massive dataset with up to 8K context length.
🌐
arXiv
arxiv.org › abs › 2502.02523
[2502.02523] Brief analysis of DeepSeek R1 and its implications for Generative AI
February 7, 2025 - View a PDF of the paper titled Brief analysis of DeepSeek R1 and its implications for Generative AI, by Sarah Mercer and 2 other authors View PDF HTML (experimental)