🌐
Towards AI
pub.towardsai.net › deepseek-r1-model-architecture-853fefac7050
DeepSeek-R1: Model Architecture. This article provides an in-depth… | by Shakti Wadekar | Towards AI
March 13, 2025 - DeepSeek-R1 employs Multi-Head Latent Attention (MLA) layers instead of standard multi-head attention across all transformer layers. The first three transformer layers differ from the rest, using a standard Feed-Forward Network (FFN) layer. From layer 4 to 61, a Mixture-of-Experts (MoE) layer ...
🌐
HiddenLayer
hiddenlayer.com › home › research › innovation hub › analysing deepseek-r1’s architecture
Analysing DeepSeek-R1’s Architecture
March 25, 2025 - For the purposes of our analysis, our team converted the DeepSeek R1 model hosted on HuggingFace to the ONNX file format, enabling us to examine its computational graph. We used this, along with a review of associated technical papers and code, to identify shared characteristics and subgraphs observed within other models and piece together the defining features of its architecture.
Discussions

Kimi K2 Thinking and DeepSeek R1 Architectures Side by Side
Really nice breakdown, super clean visual. kinda wild how similar they look, guessing the real sauce is in the training and tuning. Appreciate you putting this together. More on reddit.com
🌐 r/LocalLLaMA
34
157
November 6, 2025
Notes on Deepseek r1: Just how good it is compared to OpenAI o1
Aside from the LLM model itself, this shown that OpenAI isn't that ahead anymore from others, I mean, OpenAI still has the money and the hype, but 1 year ago, no one could beat them. The game has changed, surely. Of course OpenAI is gonna make moves, but this is a huge W for LLM in general More on reddit.com
🌐 r/LocalLLaMA
486
1266
October 25, 2024
🌐
Fireworks AI
fireworks.ai › blog › deepseek-model-architecture
DeepSeek v3 and R1 Model Architecture: Why it's powerful and economical
DeepSeek v3 and R1 continue to use the traditional Transformer block, incorporating SwiGLU, RoPE, and RMSNorm. It also inherits Multi-head Latent Attention (MLA) and radical Mixture-of-Experts (MoE) introduced by DeepSeek v2.
🌐
Medium
medium.com › @namnguyenthe › deepseek-r1-architecture-and-training-explain-83319903a684
DeepSeek-R1: Architecture and training explain | by The Nam | Medium
January 25, 2025 - But does DeepSeek-R1 rely entirely on RL? The answer is both yes and no. The authors released two distinct models: DeepSeek-R1-Zero and DeepSeek-R1. The former only used RL in the post-training process. While it showed performance on par with GPT-o1 on certain reasoning benchmarks, it struggled with poor readability and occasional language mixing.
🌐
Hugging Face
huggingface.co › deepseek-ai › DeepSeek-R1
deepseek-ai/DeepSeek-R1 · Hugging Face
1 month ago - DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base. For more details regarding the model architecture, please refer to DeepSeek-V3 repository.
🌐
arXiv
arxiv.org › pdf › 2501.12948 pdf
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Table 1 | Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning ... The reward is the source of the training signal, which decides the optimization direction of RL. To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two
🌐
GeeksforGeeks
geeksforgeeks.org › artificial intelligence › deepseek-r1-technical-overview-of-its-architecture-and-innovations
DeepSeek-R1: Technical Overview of its Architecture and Innovations - GeeksforGeeks
February 3, 2025 - At its core, DeepSeek-R1 distinguishes ... architecture is built on two foundational pillars: a cutting-edge Mixture of Experts (MoE) framework and an advanced transformer-based design....
🌐
Languagemodels
newsletter.languagemodels.co › p › the-illustrated-deepseek-r1
The Illustrated DeepSeek-R1 - by Jay Alammar
January 27, 2025 - Just like previous models from the dawn of GPT2 and GPT 3, DeepSeek-R1 is a stack of Transformer decoder blocks. It’s made up 61 of them. The first three are dense, but the rest are mixture-of-experts layers (See my co-author Maarten’s ...
Find elsewhere
🌐
Sebastian Raschka
magazine.sebastianraschka.com › p › the-big-llm-architecture-comparison
The Big LLM Architecture Comparison
July 19, 2025 - As you have probably heard more than once by now, DeepSeek R1 made a big impact when it was released in January 2025. DeepSeek R1 is a reasoning model built on top of the DeepSeek V3 architecture, which was introduced in December 2024.
🌐
Reddit
reddit.com › r/localllama › kimi k2 thinking and deepseek r1 architectures side by side
r/LocalLLaMA on Reddit: Kimi K2 Thinking and DeepSeek R1 Architectures Side by Side
November 6, 2025 -

Kimi K2 is based on the DeepSeek V3/R1 architecture, and here's a side-by-side comparison.

- 2× fewer attention heads (64 vs. 128)
- ~1.5× more experts per MoE layer (384 vs. 256)
- Bigger vocabulary (160k vs. 129k)
- K2 activates ~32B parameters per token (vs. 37B in DeepSeek R1)
- Fewer dense FFN blocks before MoE
- 2x longer supported context

In short, Kimi K2 is a slightly scaled DeepSeek V3/R1. And the gains are in the data and training recipes. Hopefully, we will see some details on those soon, too.

🌐
NVIDIA
build.nvidia.com › deepseek-ai › deepseek-r1 › modelcard
deepseek-r1 Model by Deepseek-ai | NVIDIA NIM
Runtime Engine(s): vLLM and SGLang Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Jetson, NVIDIA Hopper, NVIDIA Lovelace, NVIDIA Pascal, NVIDIA Turing, and NVIDIA Volta architectures [Preferred/Supported] Operating System(s): Linux
🌐
Insights
netsetsoftware.com › home › artificial intelligence & ml
DeepSeek R1 Open Source Models Selecting the Right Architecture with RAG - Insights
February 19, 2025 - DeepSeek R1: Selecting the right open-source model architecture with RAG. A guide to building efficient and powerful AI applications.
🌐
Thelmbook
thelmbook.com › articles
DeepSeek R1 and R1-Zero Explained
This website requires Javascript to be enabled. Please turn on Javascript and reload the page
🌐
Fireworks AI
fireworks.ai › blog › deepseek-r1-deepdive
DeepSeek-R1 Overview: Features, Capabilities, Parameters
DeepSeek R1 excels at tasks demanding logical inference, chain-of-thought reasoning, and real-time decision-making. Whether it’s solving high-level mathematics, generating sophisticated code, or breaking down complex scientific questions, DeepSeek R1’s RL-based architecture allows it to self-discover and refine reasoning strategies over time.
🌐
Nature
nature.com › articles › article
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning | Nature
September 17, 2025 - The prompts used in this dataset ... improvement. The architecture of our reward model is consistent with that of DeepSeek-R1, with the addition of a reward head designed to predict scalar preference scores....
🌐
Neosage
blog.neosage.io › p › inside-deepseek-r1-a-masterclass
Inside DeepSeek-R1: A Masterclass in Incentivising Intelligence
May 15, 2025 - So from this point on, we stop looking at R1 as “a strong open model.” · And start looking at it as a system architecture — one that happens to make strong reasoning emerge with lower training burden, lower inference cost, and far better alignment with engineering constraints. Let’s unpack that system. Note for the reader: This breakdown has been intentionally kept accessible, not to simplify the work, but to sharpen your intuition. The goal isn’t just to understand DeepSeek-R1, but to update your mental model so you can take these ideas to the application layer.
🌐
YouTube
youtube.com › watch
DeepSeek R1 Theory Tutorial – Architecture, GRPO, KL Divergence - YouTube
Learn about DeepSeek R1's innovative AI architecture from @deeplearningexplained. The course explores how R1 achieves exceptional reasoning through reinforc...
Published   March 11, 2025
🌐
Milvus
milvus.io › ai-quick-reference › what-is-the-architecture-of-deepseeks-r1-model
What is the architecture of DeepSeek's R1 model?
DeepSeek’s R1 model is a transformer-based architecture designed for efficiency and scalability, optimized for both training and inference. Like most modern large language models, it relies on the transformer’s self-attention mechanism to ...
🌐
Deep AI
deepai.tn › papers › deepseek-r1-architecture-workflow
DeepSeek R1 Architecture and Training Workflow from Scratch – Deep AI — Leading Generative AI-powered Solutions for Business
Now that we’ve covered the main ideas, let’s dive into how the reward modeling works for R1 Zero. They kept things straightforward. Instead of using a complex neural network to evaluate answers, they opted for a simple rule-based reward system. Take our math problem: “What is 2 + 3 * 4?” The system knows the correct answer is 14. It checks the output from DeepSeek V3 (our reinforcement learning agent) and looks specifically at the
🌐
Reddit
reddit.com › r/localllama › notes on deepseek r1: just how good it is compared to openai o1
r/LocalLLaMA on Reddit: Notes on Deepseek r1: Just how good it is compared to OpenAI o1
October 25, 2024 -

Finally, there is a model worthy of the hype it has been getting since Claude 3.6 Sonnet. Deepseek has released something anyone hardly expected: a reasoning model on par with OpenAI’s o1 within a month of the v3 release, with an MIT license and 1/20th of o1’s cost.

This is easily the best release since GPT-4. It's wild; the general public seems excited about this, while the big AI labs are probably scrambling. It feels like things are about to speed up in the AI world. And it's all thanks to this new DeepSeek-R1 model and how they trained it. 

Some key details from the paper

  • Pure RL (GRPO) on v3-base to get r1-zero. (No Monte-Carlo Tree Search or Process Reward Modelling)

  • The model uses “Aha moments” as pivot tokens to reflect and reevaluate answers during CoT.

  • To overcome r1-zero’s readability issues, v3 was SFTd on cold start data.

  • Distillation works, small models like Qwen and Llama trained over r1 generated data show significant improvements.

Here’s an overall r0 pipeline

  • v3 base + RL (GRPO) → r1-zero

r1 training pipeline.

  1. DeepSeek-V3 Base + SFT (Cold Start Data) → Checkpoint 1

  2. Checkpoint 1 + RL (GRPO + Language Consistency) → Checkpoint 2

  3. Checkpoint 2 used to Generate Data (Rejection Sampling)

  4. DeepSeek-V3 Base + SFT (Generated Data + Other Data) → Checkpoint 3

  5. Checkpoint 3 + RL (Reasoning + Preference Rewards) → DeepSeek-R1

We know the benchmarks, but just how good is it?

Deepseek r1 vs OpenAI o1.

So, for this, I tested r1 and o1 side by side on complex reasoning, math, coding, and creative writing problems. These are the questions that o1 solved only or by none before.

Here’s what I found:

  • For reasoning, it is much better than any previous SOTA model until o1. It is better than o1-preview but a notch below o1. This is also shown in the ARC AGI bench.

  • Mathematics: It's also the same for mathematics; r1 is a killer, but o1 is better.

  • Coding: I didn’t get to play much, but on first look, it’s up there with o1, and the fact that it costs 20x less makes it the practical winner.

  • Writing: This is where R1 takes the lead. It gives the same vibes as early Opus. It’s free, less censored, has much more personality, is easy to steer, and is very creative compared to the rest, even o1-pro.

What interested me was how free the model sounded and thought traces were, akin to human internal monologue. Perhaps this is because of the less stringent RLHF, unlike US models.

The fact that you can get r1 from v3 via pure RL was the most surprising.

For in-depth analysis, commentary, and remarks on the Deepseek r1, check out this blog post: Notes on Deepseek r1

What are your experiences with the new Deepseek r1? Did you find the model useful for your use cases?